问题描述
这是一个朋友单位的库,据说该库hung了一段时间,从alert log来看,中间有2小时是没有任何信息,难道hung了2个小时吗 ? alert log 如下:
Tue Oct 28 15:23:21 2014 kkjcre1p: unable to spawn jobq slave process Tue Oct 28 15:23:21 2014 Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx_cjq0_9568296.trc: Tue Oct 28 17:22:04 2014 System State dumped to trace file Tue Oct 28 17:25:21 2014 LGWR: terminating instance due to error 472 Instance terminated by LGWR, pid = 10748094 Tue Oct 28 17:26:41 2014 Starting ORACLE instance (normal) sskgpgetexecname failed to get name LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Picked latch-free SCN scheme 3 Using LOG_ARCHIVE_DEST_10 parameter default value as USE_DB_RECOVERY_FILE_DEST Autotune of undo retention is turned on.
专家解答
上述的信息是很简单的,提到了system state dump,下面我们来看下朋友提供的dump. 首先利用脚本先格式化一下:
Starting Systemstate 1 ...................................................... Ass.Awk Version 1.0.38 ~~~~~~~~~~~~~~~~~~~~~~ Source file : xxxx_ora_33030248.trc System State 1 (2014-10-28 17:22:04.803) ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~ 1: [DEAD] 2: waiting for 'cursor: pin X' [Mutex ad39e34] seq=37912 3: waiting for 'rdbms ipc message' seq=14319 4: waiting for 'rdbms ipc message' seq=1742 5: waiting for 'rdbms ipc message' seq=36300 6: waiting for 'rdbms ipc message' seq=11154 7: waiting for 'rdbms ipc message' seq=40518 8: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=26959 9: waiting for 'rdbms ipc message' seq=1897 10: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=46586 11: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=42779 12: waiting for 'rdbms ipc message' seq=1704 13: 14: 15: waiting for 'Streams AQ: qmn slave idle wait' seq=60 16: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=23668 17: waiting for 'SQL*Net message from client' seq=33 18: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=25574 Cmd: Insert 19: last wait for 'latch: library cache' [DEAD] 20: waiting for 'Streams AQ: qmn coordinator idle wait' seq=7 ........ 24: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=20357 25: waiting for 'Streams AQ: waiting for time management or cleanup tasks' seq=2929 26: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=23830 Cmd: Select 27: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=18517 ....... 31: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=16015 32: last wait for 'latch: library cache'[Latch 70000006b9d8008] ........ 54: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=408 Blockers ~~~~~~~~ Above is a list of all the processes. If they are waiting for a resource then it will be given in square brackets. Below is a summary of the waited upon resources, together with the holder of that resource. Notes: ~~~~~ o A process id of '???' implies that the holder was not found in the systemstate. (The holder may have released the resource before we dumped the state object tree of the blocking process). o Lines with 'Enqueue conversion' below can be ignored *unless* other sessions are waiting on that resource too. For more, see http://dlsunuk11.uk.oracle.com/Public/TOOLS/Ass.html#enqcnv) Resource Holder State Mutex ad39e34 19: last wait for 'latch: library cache' Latch 70000006b9d8008 19: last wait for 'latch: library cache' Latch 70000006b9d8008 32: 32: is waiting for 19: 32: Some of the above latches may be child latches. Please check the section named 'Child Latch Report' below for further notes. Blockers According to Tracefile Wait Info: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. This may not work for 64bit platforms. See bug 2902997 for details. 2. If the blocking process is shown as 0 then that session may no longer be present. 3. If resources are held across code layers then sometimes the tracefile wait info will not recognise the problem. No blockers seen. Object Names ~~~~~~~~~~~~ Mutex ad39e34 Latch 70000006b9d8008 Child library cache Child Latch Report ~~~~~~~~~~~~~~~~~~ Some processes are being blocked waiting for child latches. At the moment this script does not detect the blocker because the child latch address differs to the parent latch address. To manually detect the blocker please take the following steps : 1. Determine the TYPE of latch (Eg library cache) that is involved. 2. Search the source trace file for a target of : holding.*Parent.*library cache (Assuming we have a child library cache and have vi-like regular expressions) If this shows nothing then the blocker may have released the resource before we got to dump the state object tree of the blocked process. A list of processes that hold parent latches is given below : No processes found. Latch Wait List Information ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pid 8 is blocked waiting for latch 'Child library cache' with 38 waiters. Possible holder is Pid 10027060. ....... Pid 54 is blocked waiting for latch Child library cache with 38 waiters. Possible holder is Pid 10027060. 39 entries seen Summary of Wait Events Seen (count>10) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 37 : 'latch: library cache' ------------------ooOoo------------------ For the LATEST version of this utility see http://dlsunuk11.uk.oracle.com/Public/Utils.html#ass For additional documentation see http://dlsunuk11.uk.oracle.com/Public/TOOLS/Ass.html Suggested improvements, bugs etc. should be sent to kevin.p.quinn@oracle.com End of report. 18605 Lines Processed.
很显然,从脚本格式化之后的信息来看,这部分信息是我们要分析的关键:
Resource Holder State Mutex ad39e34 19: last wait for 'latch: library cache' Latch 70000006b9d8008 19: last wait for 'latch: library cache' Latch 70000006b9d8008 32: 32: is waiting for 19: 32:
19和32都是holder,而32 正在等待19. 注意,这里的19,32其实都是值的orapid。
我们可以看到,orapid 32正在正待Latch 70000006b9d8008,而orapid 32在等待19,那么orapid 19可能就是我们需要分析的关键了。 从这里来看,orapid 19似乎同时持有了一个mutex和latch,下面我们来分别看下orapid 19和32 的详细信息:
PROCESS 32: ---------------------------------------- SO: 70000006f69ac40, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=32, calls cur/top: 0/7000000697bf6d0, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 0 last post received-location: No post last process to post me: none last post sent: 0 0 0 last post sent-location: No post last process posted by me: none (latch info) wait_event=0 bits=0 Location from where call was made: kghfrunp: clatch: wait: waiting for 70000006b9d8008 Child library cache level=5 child#=4 Location from where latch is held: kghfrunp: clatch: nowait: Context saved from call: 0 state=busy, wlstate=free waiters [orapid (seconds since: put on list, posted, alive check)]: 18 (7124, 1414488125, 2) 16 (7121, 1414488125, 2) 26 (7085, 1414488125, 2) 10 (7040, 1414488125, 2) 21 (6878, 1414488125, 2) 22 (6764, 1414488125, 2) 11 (6749, 1414488125, 2) 23 (6725, 1414488125, 2) 8 (6497, 1414488125, 2) 24 (6125, 1414488125, 2) 27 (5570, 1414488125, 2) 28 (5525, 1414488125, 2) 29 (5414, 1414488125, 2) 30 (4925, 1414488125, 2) 31 (4817, 1414488125, 2) 32 (4571, 1414488125, 2) ....... 53 (518, 1414488125, 2) 54 (125, 1414488125, 2) waiter count=38 gotten 8804924 times wait, failed first 526 sleeps 874 gotten 81617 times nowait, failed: 3348 possible holder pid = 19 ospid=10027060 on wait list for 70000006b9d8008 acquiring 70000006b9d8008 Process Group: DEFAULT, pseudo proc: 70000006f783298 O/S info: user: oracle, term: UNKNOWN, ospid: 26542208 OSD pid info: Unix process pid: 26542208, image: oracle@fbtest ......... ---------------------------------------- SO: 70000006fa49ad8, type: 4, owner: 70000006f69ac40, flag: INIT/-/-/0x00 (session) sid: 525 trans: 0, creator: 70000006f69ac40, flag: (1) USR/- -/-/-/-/-/- DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 0, psql: 70000006341a4e8, user: 0/SYS O/S info: user: , term: , ospid: , machine: program: last wait for 'latch: library cache' blocking sess=0x0 seq=15189 wait_time=292981 seconds since wait started=4569 address=70000006b9d8008, number=d7, tries=3b54 Dumping Session Wait History for 'latch: library cache' count=1 wait_time=292981 address=70000006b9d8008, number=d7, tries=3b54 。。。。。。 for 'latch: library cache' count=1 wait_time=292990 address=70000006b9d8008, number=d7, tries=3b4b
我们可以看到orapid 32是一个用户进程,正在等待library cache latch,latch地址是:70000006b9d8008
从possible holder pid = 19 ospid=10027060 可以看出,这个latch正在被orapid 19所持有,下面我们搜索关键字:process 19
PROCESS 19: ---------------------------------------- SO: 70000006f6945e0, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=19, calls cur/top: 0/7000000692e8ad0, flag: (1) DEAD int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 0 last post received-location: No post last process to post me: none last post sent: 0 0 0 last post sent-location: No post last process posted by me: none (latch info) wait_event=0 bits=200 holding (efd=5) 70000006b9d8008 Child library cache level=5 child#=4 Location from where latch is held: kghfrunp: clatch: nowait: Context saved from call: 0 state=busy, wlstate=free waiters [orapid (seconds since: put on list, posted, alive check)]: 18 (7124, 1414488125, 2) 16 (7121, 1414488125, 2) 26 (7085, 1414488125, 2) 10 (7040, 1414488125, 2) 21 (6878, 1414488125, 2) 22 (6764, 1414488125, 2) 11 (6749, 1414488125, 2) 23 (6725, 1414488125, 2) 8 (6497, 1414488125, 2) 24 (6125, 1414488125, 2) 27 (5570, 1414488125, 2) 28 (5525, 1414488125, 2) 29 (5414, 1414488125, 2) 30 (4925, 1414488125, 2) 31 (4817, 1414488125, 2) 32 (4571, 1414488125, 2) ....... 54 (125, 1414488125, 2) waiter count=38 Process Group: DEFAULT, pseudo proc: 70000006f783298 O/S info: user: oracle, term: UNKNOWN, ospid: 10027060 (DEAD) OSD pid info: Unix process pid: 10027060, image: oracle@fbtest (J001) .......... Repeat 27 times ---------------------------------------- SO: 70000006fa55b80, type: 4, owner: 70000006f6945e0, flag: INIT/-/-/0x00 (session) sid: 534 trans: 0, creator: 70000006f6945e0, flag: (1) USR/- -/-/-/-/-/- DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS O/S info: user: , term: , ospid: , machine: program: last wait for 'latch: library cache' blocking sess=0x0 seq=14 wait_time=507 seconds since wait started=7167 address=70000006b9d7f68, number=d7, tries=0 Dumping Session Wait History for 'latch: library cache' count=1 wait_time=507 address=70000006b9d7f68, number=d7, tries=0 for 'latch: library cache' count=1 wait_time=328 address=70000006b9d7f68, number=d7, tries=0 for 'latch: library cache' count=1 wait_time=340 address=70000006b9d7f68, number=d7, tries=0 for 'latch: library cache' count=1 wait_time=1644 address=70000006b9d8008, number=d7, tries=0 for 'latch: library cache' count=1 wait_time=1245 address=70000006b9d8008, number=d7, tries=0 for 'latch: library cache' count=1 wait_time=1633 address=70000006b9d8008, number=d7, tries=0 for 'latch: library cache' count=1 wait_time=315 address=70000006b9d80a8, number=d7, tries=0 for 'latch: library cache' count=1 wait_time=655 address=70000006b9d80a8, number=d7, tries=0 for 'latch: library cache' count=1 wait_time=4888 address=70000006b9d80a8, number=d7, tries=0 for 'latch: library cache' count=1 wait_time=176 address=70000006b9d80a8, number=d7, tries=0 ........ temporary object counter: 0 ---------------------------------------- UOL used : 1 locks(used=0, free=0) KGX Atomic Operation Log 70000006e404f90 Mutex 70000003eec4be0(534, 0) idn ad39e34 oper EXCL Cursor Pin uid 534 efd 5 whr 11 slp 0 KGX Atomic Operation Log 70000006e404fd8 Mutex 0(0, 0) idn 0 oper NONE Library Cache uid 534 efd 0 whr 0 slp 0 KGX Atomic Operation Log 70000006e405020 Mutex 0(0, 0) idn 0 oper NONE Library Cache uid 534 efd 0 whr 0 slp 0
我们可以看到orapid 19这个进程正在持有llibrary cache latch 70000006b9d8008没有释放,到前面的orapid 32一直等待。
然而我们可以看到,orapid 19虽然持有了70000006b9d8008这个latch,然而同时它也在等待一个library cache latch 70000006b9d7f68
我们可以看到这个orapid 19的进程其实是J001,其状态已经为dead了。同时我们分析下面的mutex信息,发现该进程同时还持有了mutex: Mutex 70000003eec4be0(534, 0) idn ad39e34 oper EXCL
从这句话我们可以看出,orapid 19还持有了mutex ad39e34,而且mode为exel,即排他模式. 根据这个mutex 地址,我们进行搜索,发现 Oracle Pmon进程正在申请这个Mutex,如下:
PROCESS 2: ---------------------------------------- SO: 70000006f68c000, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=2, calls cur/top: 70000006fab6310/70000006fab6310, flag: (e) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 33 last post received-location: ksrpublish last process to post me: 70000006f692660 3 2 last post sent: 0 0 48 last post sent-location: ksoreq_reply last process posted by me: 70000006f7809e0 1 0 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 70000006f783298 O/S info: user: oracle, term: UNKNOWN, ospid: 9895986 OSD pid info: Unix process pid: 9895986, image: oracle@fbtest (PMON) ........ ---------------------------------------- SO: 70000006fd51908, type: 11, owner: 70000006f68c000, flag: INIT/-/-/0x00 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 70000006f68c000, event: 1, last message event: 13, last message waited event: 13, messages read: 1 channel: (70000006fd77710) scumnt mount lock scope: 1, event: 13, last mesage event: 13, publishers/subscribers: 0/11, messages published: 1 ---------------------------------------- SO: 70000006fa71d08, type: 4, owner: 70000006f68c000, flag: INIT/-/-/0x00 (session) sid: 555 trans: 0, creator: 70000006f68c000, flag: (51) USR/- BSY/-/-/-/-/- DID: 0000-0002-00000002, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS service name: SYS$BACKGROUND waiting for 'cursor: pin X' blocking sess=0x0 seq=37912 wait_time=0 seconds since wait started=0 idn=ad39e34, value=21600000000, where|sleeps=b000a93dd Dumping Session Wait History for 'cursor: pin X' count=1 wait_time=9782 idn=ad39e34, value=21600000000, where|sleeps=b000a93dc .......... idn=ad39e34, value=21600000000, where|sleeps=b000a93d4 for 'cursor: pin X' count=1 wait_time=9782 idn=ad39e34, value=21600000000, where|sleeps=b000a93d3 temporary object counter: 0 ---------------------------------------- UOL used : 1 locks(used=0, free=0) KGX Atomic Operation Log 70000006eb507f8 Mutex 70000003eec4be0(534, 0) idn ad39e34 oper GET_EXCL Cursor Pin uid 555 efd 5 whr 11 slp 5127 KGX Atomic Operation Log 70000006eb50840 Mutex 0(0, 0) idn 0 oper NONE Library Cache uid 555 efd 0 whr 0 slp 0 KGX Atomic Operation Log 70000006eb50888 Mutex 0(0, 0) idn 0 oper NONE Library Cache uid 555 efd 0 whr 0 slp 0 ----------------------------------------
可以看到pmon进程一直在等待这个mutex,也需要申请获得排他模式的mutex(类似传统的排他模式的library cache pin).
而由于这个mutex被前面的orapid 19(j001)锁持有,没有释放,导致pmon进程挂起。
最后我们简单的总结下:
1、大量的进程在等待library cache latch,而该latch被orapid 19(J001)进程所持有.
2、J001进程持有library cache latch不释放,同时还申请持有mutex,该进程状态是dead
3、Pmon进程申请排它模式的Mutex,而该mutex被J001进程持有,且状态也是排它模式.
4、按理说J001进程死掉之后,其持有的latch信息应该都会被pmon进程所清理释放掉,因为 这本身就是pmon进程的分内之事,然而由于mutex的原因,导致pmon进程挂起。
5、最后由于pmon进程也无法获得mutex,导致无法清理j0001进程,最后导致j0001阻塞了大量会话, 出现大面积library cache latch等待,最后整个系统hung.
6、个人认为问题的根据还是mutex的问题,library cache latch只是表象.