前幾天,一臺Oracle資料庫(Oracle Database 10g Release 10.2.0.4.0 - 64bit Production)監控出現"PMON failed to acquire latch, see PMON dump"錯誤,連接資料庫出現短暫異常,告警日誌中具體錯誤如下所... ...
前幾天,一臺Oracle資料庫(Oracle Database 10g Release 10.2.0.4.0 - 64bit Production)監控出現"PMON failed to acquire latch, see PMON dump"錯誤,連接資料庫出現短暫異常,告警日誌中具體錯誤如下所示:
Tue Dec 20 09:13:16 2016
PMON failed to acquire latch, see PMON dump
Tue Dec 20 09:14:16 2016
PMON failed to acquire latch, see PMON dump
Tue Dec 20 09:15:55 2016
PMON failed to acquire latch, see PMON dump
Tue Dec 20 09:17:15 2016
PMON failed to acquire latch, see PMON dump
Tue Dec 20 09:17:24 2016
WARNING: inbound connection timed out (ORA-3136)
Tue Dec 20 09:18:23 2016
PMON failed to acquire latch, see PMON dump
Tue Dec 20 09:19:24 2016
PMON failed to acquire latch, see PMON dump
在生成的epps_pmon_4988.trc 跟蹤文件裡面,發現有些詳細信息,你會發現PMON進程不能獲取'Child shared pool' latch,它被一個pid = 19 ospid=5022的進程給阻塞了。而ospid這個進程是一個Dispatcher的進程。
*** 2016-12-20 09:14:16.575
PMON unable to acquire latch 600edfa8 Child shared pool level=7 child#=1
Location from where latch is held: kghfrunp: alloc: session dur:
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
33 (3, 1482196555, 3)
10 (3, 1482196555, 3)
25 (3, 1482196555, 3)
13 (3, 1482196555, 3)
waiter count=4
gotten 861091119 times wait, failed first 7114074 sleeps 1392223
gotten 0 times nowait, failed: 0
possible holder pid = 19 ospid=5022
----------------------------------------
SO: 0x40979aec8, type: 2, owner: (nil), flag: INIT/-/-/0x00
(process) Oracle pid=19, calls cur/top: (nil)/0x409c92608, flag: (80) DISPATCHER
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 236
last post received-location: kmcpdp
last process to post me: 4097a64a0 106 64
last post sent: 0 0 229
last post sent-location: kmcmbf: not KMCVCFTOS
last process posted by me: 4097a64a0 106 64
(latch info) wait_event=0 bits=a0
holding (efd=4) 600edfa8 Child shared pool level=7 child#=1
Location from where latch is held: kghfrunp: alloc: session dur:
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
33 (3, 1482196555, 3)
10 (3, 1482196555, 3)
25 (3, 1482196555, 3)
13 (3, 1482196555, 3)
waiter count=4
holding (efd=4) 3fff78210 Child library cache level=5 child#=2
Location from where latch is held: kghfrunp: clatch: wait:
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
15 (3, 1482196555, 3)
17 (3, 1482196555, 3)
12 (3, 1482196555, 3)
waiter count=3
Process Group: DEFAULT, pseudo proc: 0x4098bc190
O/S info: user: oracle, term: UNKNOWN, ospid: 5022
OSD pid info: Unix process pid: 5022, image: [email protected] (D007)
Short stack dump:
ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+111<-__restore_rt()+0<-kghfrunp()+1506<-kghfnd()+1389<-kghalo()+587<-kmnsbm()+578<-nsb
al()+428<-nsbalc()+123<-nsdo()+17278<-nsopen()+2315<-nsanswer()+512<-kmnans()+37<-kmdahd()+385<-kmdmai()+5220<-kmmrdp()+564<-o
pirip()+1193<-opidrv()+582<-sou2o()+114<-opimai_real()+317<-main()+116<-__libc_start_main()+244<-_start()+41
Dump of memory from 0x0000000409747C68 to 0x0000000409747E70
409747C60 00000001 00000000 [........]
409747C70 FE9BEE10 00000003 0000003A 0003129B [........:.......]
409747C80 FEA7D5D0 00000003 0000003A 0003129B [........:.......]
409747C90 FE9DAD30 00000003 0000003A 0003129B [0.......:.......]
Repeat 2 times
409747CC0 FEAB01F0 00000003 0000003A 0003129B [........:.......]
409747CD0 FE9DAD30 00000003 0000003A 0003129B [0.......:.......]
409747CE0 FEA44E70 00000003 0000003A 0003129B [pN......:.......]
409747CF0 FEAA6FF0 00000003 0000003A 0003129B [.o......:.......]
409747D00 FEAB8AD0 00000003 0000003A 0003129B [........:.......]
409747D10 FEA14FF0 00000003 0000003A 0003129B [.O......:.......]
409747D20 FE9A77F0 00000003 0000003A 0003129B [.w......:.......]
Repeat 1 times
409747D40 FEA3CEB0 00000003 0000003A 0003129B [........:.......]
Repeat 1 times
409747D60 FE9C64B0 00000003 0000003A 0003129B [.d......:.......]
Repeat 1 times
409747D80 FEA062B0 00000003 0000003A 0003129B [.b......:.......]
Repeat 3 times
409747DC0 FEAA6FF0 00000003 0000003A 0003129B [.o......:.......]
409747DD0 FEA8F9D0 00000003 0000003A 0003129B [........:.......]
409747DE0 FE9F7570 00000003 0000003A 0003129B [pu......:.......]
409747DF0 FEA91530 00000003 0000003A 0003129B [0.......:.......]
409747E00 FE9BEE10 00000003 0000003A 0003129B [........:.......]
409747E10 FE9BB750 00000003 0000003A 0003129B [P.......:.......]
409747E20 FEA90C10 00000003 0000003A 0003129B [........:.......]
409747E30 FEA8B9F0 00000003 0000003A 0003129B [........:.......]
409747E40 FE9C5270 00000003 0000003A 0003129B [pR......:.......]
409747E50 FEAE12B0 00000003 0000003A 0003129B [........:.......]
409747E60 FE9C5270 00000003 0000003A 0003129B [pR......:.......]
由於當時沒有出現問題時,並沒有及時發現,沒有Collect HangAnalyze traces,所以再深入一點的挖掘root case已經很難了。當時手工生成了一個快照(9:26),也就是說9:00 ~ 9:26這段時間生成的快照剛剛覆蓋了出現問題的時間段。生成了這個時段的AWR報告,在這個時間段latch:library cache 和latch:shared pool等待事件是主要等待事件。
出現問題的時間段,資料庫伺服器是比較空閑的。
生成了20-Dec-16 09:11:16到20-Dec-16 09:21:16時段的ASH報告。如下所示,latch:library cache 和latch:shared pool為主要等待事件,但是Avg Active Sessions很小。
所以覺得很有可能是跟Bug有關係,後面在Oracle MetaLink查了一下是否有相關Bug,如下一些相關資料:
Bug 7039896 Spin under kghquiesce_regular_extent holding shared pool latch with AMM
Bug 6488694 - DATABASE HUNG WITH PMON FAILED TO ACQUIRE LATCH MESSAGE
Note 7039896.8 - Bug 7039896 - Spin under kghquiesce_regular_extent holding shared pool latch with AMM
Pmon Failed To Acquire Latch" Messages in Alert Log -Database Hung (文檔 ID 468740.1)
Hang (Involving Shared Resource)
A process may hold a shared resource a lot longer than normally expected leading to many other processes having to wait for that resource. Such a resource could be a lock, a library cache pin, a latch etc.. The overall symptom is that typically numerous processes all appear to be stuck, although some processes may continue unhindered if they do not need the blocked resource.
A process enters a tight CPU loop so appears to hang but is actually consuming CPU.
This issue can result in latch contention within the database.
Waits for "latch: shared pool"
我們資料庫版本為Oracle Database 10g Release 10.2.0.4.0 - 64bit Production, 所以Bug 7039896是會影響的這個資料庫的, 而出現的現象也很符合,但是有一點就是並沒有涉及MMAN進程。而且查過V$SGA_RESIZE_OPS,那個時間段並沒有相關組件的增長、收縮。另外跟Bug 也非常類似,但是trc文件並沒有發現跟MMAN進程有關係。 這個問題還是第一次出現,而且出現過一次後,最近幾天都沒有出現,所以更加確信是Bug引起的。當然是要找個時間應用Bug 7039896的相關補丁。
另外,在查找這個問題的時候,在官方文檔看到一個如何處理、診斷'PMON failed to acquire latch, see PMON dump'的詳細文檔,本想收錄於此,不過還是保持為PDF文件較好,需要可從下麵鏈接下載。