暂无图片
暂无图片
1
暂无图片
暂无图片
暂无图片

The cause of system hung is cursor: pin X ?

原创 Roger 2014-10-31
979
这是一个朋友单位的库,据说该库hung了一段时间,从alert log来看,中间有2小时是没有任何信息,难道hung了2个小时吗 ? alert log 如下:
Tue Oct 28 15:23:21 2014
kkjcre1p: unable to spawn jobq slave process
Tue Oct 28 15:23:21 2014
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx_cjq0_9568296.trc:

Tue Oct 28 17:22:04 2014
System State dumped to trace file
Tue Oct 28 17:25:21 2014
LGWR: terminating instance due to error 472
Instance terminated by LGWR, pid = 10748094
Tue Oct 28 17:26:41 2014
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_10 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.

上述的信息是很简单的,提到了system state dump,下面我们来看下朋友提供的dump. 首先利用脚本先格式化一下:
Starting Systemstate 1
......................................................
Ass.Awk Version 1.0.38
~~~~~~~~~~~~~~~~~~~~~~
Source file : xxxx_ora_33030248.trc

System State 1 (2014-10-28 17:22:04.803)
~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~
1: [DEAD]
2: waiting for 'cursor: pin X' [Mutex ad39e34] seq=37912
3: waiting for 'rdbms ipc message' seq=14319
4: waiting for 'rdbms ipc message' seq=1742
5: waiting for 'rdbms ipc message' seq=36300
6: waiting for 'rdbms ipc message' seq=11154
7: waiting for 'rdbms ipc message' seq=40518
8: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=26959
9: waiting for 'rdbms ipc message' seq=1897
10: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=46586
11: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=42779
12: waiting for 'rdbms ipc message' seq=1704
13:
14:
15: waiting for 'Streams AQ: qmn slave idle wait' seq=60
16: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=23668
17: waiting for 'SQL*Net message from client' seq=33
18: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=25574
Cmd: Insert
19: last wait for 'latch: library cache' [DEAD]
20: waiting for 'Streams AQ: qmn coordinator idle wait' seq=7
........
24: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=20357
25: waiting for 'Streams AQ: waiting for time management or cleanup tasks' seq=2929
26: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=23830
Cmd: Select
27: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=18517
.......
31: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=16015
32: last wait for 'latch: library cache'[Latch 70000006b9d8008]
........
54: waiting for 'latch: library cache' [Latch 70000006b9d8008] seq=408

Blockers
~~~~~~~~

Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate. (The holder may have released the resource before we
dumped the state object tree of the blocking process).
o Lines with 'Enqueue conversion' below can be ignored *unless*
other sessions are waiting on that resource too. For more, see
http://dlsunuk11.uk.oracle.com/Public/TOOLS/Ass.html#enqcnv)

Resource Holder State
Mutex ad39e34 19: last wait for 'latch: library cache'
Latch 70000006b9d8008 19: last wait for 'latch: library cache'
Latch 70000006b9d8008 32: 32: is waiting for 19: 32:

Some of the above latches may be child latches. Please check the section
named 'Child Latch Report' below for further notes.

Blockers According to Tracefile Wait Info:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. This may not work for 64bit platforms. See bug 2902997 for details.
2. If the blocking process is shown as 0 then that session may no longer be
present.
3. If resources are held across code layers then sometimes the tracefile wait
info will not recognise the problem.

No blockers seen.

Object Names
~~~~~~~~~~~~
Mutex ad39e34
Latch 70000006b9d8008 Child library cache

Child Latch Report
~~~~~~~~~~~~~~~~~~
Some processes are being blocked waiting for child latches.

At the moment this script does not detect the blocker because the
child latch address differs to the parent latch address. To manually
detect the blocker please take the following steps :
1. Determine the TYPE of latch (Eg library cache) that is involved.
2. Search the source trace file for a target of :
holding.*Parent.*library cache
(Assuming we have a child library cache and have vi-like regular expressions)

If this shows nothing then the blocker may have released the resource
before we got to dump the state object tree of the blocked process.

A list of processes that hold parent latches is given below :

No processes found.

Latch Wait List Information
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pid 8 is blocked waiting for latch 'Child library cache' with 38 waiters.
Possible holder is Pid 10027060.
.......
Pid 54 is blocked waiting for latch Child library cache with 38 waiters.
Possible holder is Pid 10027060.
39 entries seen

Summary of Wait Events Seen (count>10)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
37 : 'latch: library cache'

------------------ooOoo------------------
For the LATEST version of this utility see
http://dlsunuk11.uk.oracle.com/Public/Utils.html#ass

For additional documentation see
http://dlsunuk11.uk.oracle.com/Public/TOOLS/Ass.html

Suggested improvements, bugs etc. should be sent to kevin.p.quinn@oracle.com

End of report. 18605 Lines Processed.

很显然,从脚本格式化之后的信息来看,这部分信息是我们要分析的关键:
                    Resource Holder State
Mutex ad39e34 19: last wait for 'latch: library cache'
Latch 70000006b9d8008 19: last wait for 'latch: library cache'
Latch 70000006b9d8008 32: 32: is waiting for 19: 32:

 

19和32都是holder,而32 正在等待19. 注意,这里的19,32其实都是值的orapid。
我们可以看到,orapid 32正在正待Latch 70000006b9d8008,而orapid 32在等待19,那么orapid 19可能就是我们需要分析的关键了。 从这里来看,orapid 19似乎同时持有了一个mutex和latch,下面我们来分别看下orapid 19和32 的详细信息:
PROCESS 32:
----------------------------------------
SO: 70000006f69ac40, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=32, calls cur/top: 0/7000000697bf6d0, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
last post received-location: No post
last process to post me: none
last post sent: 0 0 0
last post sent-location: No post
last process posted by me: none
(latch info) wait_event=0 bits=0
Location from where call was made: kghfrunp: clatch: wait:
waiting for 70000006b9d8008 Child library cache level=5 child#=4
Location from where latch is held: kghfrunp: clatch: nowait:
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (7124, 1414488125, 2)
16 (7121, 1414488125, 2)
26 (7085, 1414488125, 2)
10 (7040, 1414488125, 2)
21 (6878, 1414488125, 2)
22 (6764, 1414488125, 2)
11 (6749, 1414488125, 2)
23 (6725, 1414488125, 2)
8 (6497, 1414488125, 2)
24 (6125, 1414488125, 2)
27 (5570, 1414488125, 2)
28 (5525, 1414488125, 2)
29 (5414, 1414488125, 2)
30 (4925, 1414488125, 2)
31 (4817, 1414488125, 2)
32 (4571, 1414488125, 2)
.......
53 (518, 1414488125, 2)
54 (125, 1414488125, 2)
waiter count=38
gotten 8804924 times wait, failed first 526 sleeps 874
gotten 81617 times nowait, failed: 3348
possible holder pid = 19 ospid=10027060
on wait list for 70000006b9d8008
acquiring 70000006b9d8008
Process Group: DEFAULT, pseudo proc: 70000006f783298
O/S info: user: oracle, term: UNKNOWN, ospid: 26542208
OSD pid info: Unix process pid: 26542208, image: oracle@fbtest
.........
----------------------------------------
SO: 70000006fa49ad8, type: 4, owner: 70000006f69ac40, flag: INIT/-/-/0x00
(session) sid: 525 trans: 0, creator: 70000006f69ac40, flag: (1) USR/- -/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 70000006341a4e8, user: 0/SYS
O/S info: user: , term: , ospid: , machine:
program:
last wait for 'latch: library cache' blocking sess=0x0 seq=15189 wait_time=292981 seconds since wait started=4569
address=70000006b9d8008, number=d7, tries=3b54
Dumping Session Wait History
for 'latch: library cache' count=1 wait_time=292981
address=70000006b9d8008, number=d7, tries=3b54
。。。。。。
for 'latch: library cache' count=1 wait_time=292990
address=70000006b9d8008, number=d7, tries=3b4b

我们可以看到orapid 32是一个用户进程,正在等待library cache latch,latch地址是:70000006b9d8008
从possible holder pid = 19 ospid=10027060 可以看出,这个latch正在被orapid 19所持有,下面我们搜索关键字:process 19
PROCESS 19:
----------------------------------------
SO: 70000006f6945e0, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=19, calls cur/top: 0/7000000692e8ad0, flag: (1) DEAD
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
last post received-location: No post
last process to post me: none
last post sent: 0 0 0
last post sent-location: No post
last process posted by me: none
(latch info) wait_event=0 bits=200
holding (efd=5) 70000006b9d8008 Child library cache level=5 child#=4
Location from where latch is held: kghfrunp: clatch: nowait:
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
18 (7124, 1414488125, 2)
16 (7121, 1414488125, 2)
26 (7085, 1414488125, 2)
10 (7040, 1414488125, 2)
21 (6878, 1414488125, 2)
22 (6764, 1414488125, 2)
11 (6749, 1414488125, 2)
23 (6725, 1414488125, 2)
8 (6497, 1414488125, 2)
24 (6125, 1414488125, 2)
27 (5570, 1414488125, 2)
28 (5525, 1414488125, 2)
29 (5414, 1414488125, 2)
30 (4925, 1414488125, 2)
31 (4817, 1414488125, 2)
32 (4571, 1414488125, 2)
.......
54 (125, 1414488125, 2)
waiter count=38
Process Group: DEFAULT, pseudo proc: 70000006f783298
O/S info: user: oracle, term: UNKNOWN, ospid: 10027060 (DEAD)
OSD pid info: Unix process pid: 10027060, image: oracle@fbtest (J001)
..........
Repeat 27 times
----------------------------------------
SO: 70000006fa55b80, type: 4, owner: 70000006f6945e0, flag: INIT/-/-/0x00
(session) sid: 534 trans: 0, creator: 70000006f6945e0, flag: (1) USR/- -/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
O/S info: user: , term: , ospid: , machine:
program:
last wait for 'latch: library cache' blocking sess=0x0 seq=14 wait_time=507 seconds since wait started=7167
address=70000006b9d7f68, number=d7, tries=0
Dumping Session Wait History
for 'latch: library cache' count=1 wait_time=507
address=70000006b9d7f68, number=d7, tries=0
for 'latch: library cache' count=1 wait_time=328
address=70000006b9d7f68, number=d7, tries=0
for 'latch: library cache' count=1 wait_time=340
address=70000006b9d7f68, number=d7, tries=0
for 'latch: library cache' count=1 wait_time=1644
address=70000006b9d8008, number=d7, tries=0
for 'latch: library cache' count=1 wait_time=1245
address=70000006b9d8008, number=d7, tries=0
for 'latch: library cache' count=1 wait_time=1633
address=70000006b9d8008, number=d7, tries=0
for 'latch: library cache' count=1 wait_time=315
address=70000006b9d80a8, number=d7, tries=0
for 'latch: library cache' count=1 wait_time=655
address=70000006b9d80a8, number=d7, tries=0
for 'latch: library cache' count=1 wait_time=4888
address=70000006b9d80a8, number=d7, tries=0
for 'latch: library cache' count=1 wait_time=176
address=70000006b9d80a8, number=d7, tries=0
........
temporary object counter: 0
----------------------------------------
UOL used : 1 locks(used=0, free=0)
KGX Atomic Operation Log 70000006e404f90
Mutex 70000003eec4be0(534, 0) idn ad39e34 oper EXCL
Cursor Pin uid 534 efd 5 whr 11 slp 0
KGX Atomic Operation Log 70000006e404fd8
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 534 efd 0 whr 0 slp 0
KGX Atomic Operation Log 70000006e405020
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 534 efd 0 whr 0 slp 0

 

我们可以看到orapid 19这个进程正在持有llibrary cache latch 70000006b9d8008没有释放,到前面的orapid 32一直等待。
然而我们可以看到,orapid 19虽然持有了70000006b9d8008这个latch,然而同时它也在等待一个library cache latch 70000006b9d7f68
我们可以看到这个orapid 19的进程其实是J001,其状态已经为dead了。同时我们分析下面的mutex信息,发现该进程同时还持有了mutex: Mutex 70000003eec4be0(534, 0) idn ad39e34 oper EXCL
从这句话我们可以看出,orapid 19还持有了mutex ad39e34,而且mode为exel,即排他模式. 根据这个mutex 地址,我们进行搜索,发现 Oracle Pmon进程正在申请这个Mutex,如下:
PROCESS 2:
----------------------------------------
SO: 70000006f68c000, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: 70000006fab6310/70000006fab6310, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 33
last post received-location: ksrpublish
last process to post me: 70000006f692660 3 2
last post sent: 0 0 48
last post sent-location: ksoreq_reply
last process posted by me: 70000006f7809e0 1 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 70000006f783298
O/S info: user: oracle, term: UNKNOWN, ospid: 9895986
OSD pid info: Unix process pid: 9895986, image: oracle@fbtest (PMON)
........
----------------------------------------
SO: 70000006fd51908, type: 11, owner: 70000006f68c000, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 70000006f68c000,
event: 1, last message event: 13,
last message waited event: 13, messages read: 1
channel: (70000006fd77710) scumnt mount lock
scope: 1, event: 13, last mesage event: 13,
publishers/subscribers: 0/11,
messages published: 1
----------------------------------------
SO: 70000006fa71d08, type: 4, owner: 70000006f68c000, flag: INIT/-/-/0x00
(session) sid: 555 trans: 0, creator: 70000006f68c000, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0000-0002-00000002, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'cursor: pin X' blocking sess=0x0 seq=37912 wait_time=0 seconds since wait started=0
idn=ad39e34, value=21600000000, where|sleeps=b000a93dd
Dumping Session Wait History
for 'cursor: pin X' count=1 wait_time=9782
idn=ad39e34, value=21600000000, where|sleeps=b000a93dc
..........
idn=ad39e34, value=21600000000, where|sleeps=b000a93d4
for 'cursor: pin X' count=1 wait_time=9782
idn=ad39e34, value=21600000000, where|sleeps=b000a93d3
temporary object counter: 0
----------------------------------------
UOL used : 1 locks(used=0, free=0)
KGX Atomic Operation Log 70000006eb507f8
Mutex 70000003eec4be0(534, 0) idn ad39e34 oper GET_EXCL
Cursor Pin uid 555 efd 5 whr 11 slp 5127
KGX Atomic Operation Log 70000006eb50840
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 555 efd 0 whr 0 slp 0
KGX Atomic Operation Log 70000006eb50888
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 555 efd 0 whr 0 slp 0
----------------------------------------

可以看到pmon进程一直在等待这个mutex,也需要申请获得排他模式的mutex(类似传统的排他模式的library cache pin).
而由于这个mutex被前面的orapid 19(j001)锁持有,没有释放,导致pmon进程挂起。
最后我们简单的总结下:
1、大量的进程在等待library cache latch,而该latch被orapid 19(J001)进程所持有.

2、J001进程持有library cache latch不释放,同时还申请持有mutex,该进程状态是dead

3、Pmon进程申请排它模式的Mutex,而该mutex被J001进程持有,且状态也是排它模式.

4、按理说J001进程死掉之后,其持有的latch信息应该都会被pmon进程所清理释放掉,因为   这本身就是pmon进程的分内之事,然而由于mutex的原因,导致pmon进程挂起。

5、最后由于pmon进程也无法获得mutex,导致无法清理j0001进程,最后导致j0001阻塞了大量会话,   出现大面积library cache latch等待,最后整个系统hung.
6、个人认为问题的根据还是mutex的问题,library cache latch只是表象.
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论