Oracle 11.2.0.3.9 RAC is hang (for Linux)

Roger 2019-06-17

1385

问题描述

一个朋友的库出现异常，6月30号就出现hang的问题，当时也给了建议直接kill会话解决了，没想到现在又出现了。简单的帮忙分析了一下原因，如下是alert log的信息：

Mon Jul 14 18:50:06 2014
Archived Log entry 3689 added for thread 1 sequence 2223 ID 0x1db958b2 dest 1:
Mon Jul 14 19:40:29 2014
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_lmhb_17008.trc  (incident=224129):
ORA-00445: background process "GCR0" did not start after 120 seconds
Incident details in: /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_224129/xxxx1_lmhb_17008_i224129.trc
Mon Jul 14 19:40:30 2014
Dumping diagnostic data in directory=[cdmp_20140714194030], requested by (instance=1, osid=17008 (LMHB)), summary=[incident=224129].
Mon Jul 14 19:40:46 2014
LCK0 (ospid: 17064) waits for latch 'shared pool' for 169 secs.
Mon Jul 14 19:41:29 2014
PMON failed to acquire latch, see PMON dump
Mon Jul 14 19:42:01 2014
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_lmhb_17008.trc  (incident=224130):
ORA-29771: process USER (OSID 4221) blocks LCK0 (OSID 17064) for more than 70 seconds
Incident details in: /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_224130/xxxx1_lmhb_17008_i224130.trc
Mon Jul 14 19:42:29 2014
PMON failed to acquire latch, see PMON dump
Mon Jul 14 19:43:10 2014
USER (ospid: 4221) is blocking LCK0 (ospid: 17064) in a wait
LMHB (ospid: 17008) kills USER (ospid: 4221).

专家解答

从上面的alert log信息来看,可以看到GCR0进程启动失败，且核心进程LCK0也在等待latch:shared pool. 更重要的一点是：
连pmon进程也无法获得latch了。很明显，这个数据库hang住了。

对于Oracle hang问题,大家所了解做法是,先登录数据库，然后找到阻塞源头kill会话。

那么如果数据库hang住了之后，怎么登录呢？对于Oracle 10g以及之后的版本可以通过sqlplus -prelim / as sysdba方式登录.

然后进行hanganalyze dump,可惜是朋友这里操作了之后看到任何信息,那么只能进行systemstate dump和processstate dump了。

首先我们来看下PMON进程在等待为什么，为什么无法获得latch ？

PROCESS 2: PMON
 ----------------------------------------
 SO: 0x55ccf1b98, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x55ccf1b98, name=process, file=ksu.h LINE:12616, pg=0
 (process) Oracle pid:2, ser:1, calls cur/top: 0x55d7c76b0/0x55d7c76b0
 flags : (0xe) SYSTEM
 flags2: (0x0),  flags3: (0x10)
intr error: 0, call error: 0, sess error: 0, txn error 0
 intr queue: empty
 ksudlp FALSE at location: 0
 (post info) last post received: 0 0 146
 last post received-location: kjm.h LINE:1248 ID:kjfmHealth_Latch_killHolder: post pmon after latch holder kill
 last process to post me: 544d25368 1 2
 last post sent: 0 0 52
 last post sent-location: ksv2.h LINE:1682 ID:ksvpst: cldel - post master
 last process posted by me: 544d25368 1 2
 (latch info) wait_event=0 bits=0
 Location from where call was made: kgh.h LINE:6483 ID:kghfre: Chunk Header
 Context saved from call: 22949992976
 waiting for 601091f8 Child shared pool level=7 child#=2
Location from where latch is held: kgh.h LINE:6463 ID:kghalo:
Context saved from call: 0
 state=busy [holder orapid=146] wlstate=free [value=0]
 waiters [orapid (seconds since: put on list, posted, alive check)]:
 205 (1830, 1405339770, 1769)
 30 (1829, 1405339770, 1650)
 208 (1793, 1405339770, 328)
 66 (1770, 1405339770, 1770)
 104 (1743, 1405339770, 1743)
 16 (1536, 1405339770, 1536)
 96 (1489, 1405339770, 1489)
 58 (270, 1405339770, 270)
 waiter count=8
 gotten 2181033884 times wait, failed first 7066852 sleeps 865327
 gotten 215950 times nowait, failed: 464
 possible holder pid = 146 ospid=13135
 Process Group: DEFAULT, pseudo proc: 0x540f7d060
 O/S info: user: oracle, term: UNKNOWN, ospid: 16945
OSD pid info: Unix process pid: 16945, image: oracle@itvxxxx1 (PMON)

可以看到pmon进程无法获得shared pool latch,而该资源(601091f8)正在被orapid=146所持有。下面来看下orapid 146在干什么？

PROCESS 146:
----------------------------------------
 SO: 0x55cd17458, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x55cd17458, name=process, file=ksu.h LINE:12616, pg=0
 (process) Oracle pid:146, ser:39, calls cur/top: 0x26c90ee80/0x26c9099e0
 flags : (0x0) -
 flags2: (0x0),  flags3: (0x10)
intr error: 0, call error: 0, sess error: 0, txn error 0
 intr queue: empty
 ksudlp FALSE at location: 0
 (post info) last post received: 0 0 259
 last post received-location: kgl.h LINE:8714 ID:kgllkdl: post after freeing latch
 last process to post me: 548d29e00 4 0
 last post sent: 0 0 26
 last post sent-location: ksa2.h LINE:285 ID:ksasnd
 last process posted by me: 55ccf6f08 1 6
 (latch info) wait_event=0 bits=80
 holding    (efd=17) 601091f8 Child shared pool level=7 child#=2
Location from where latch is held: kgh.h LINE:6463 ID:kghalo:
Context saved from call: 0
 state=busy [holder orapid=146] wlstate=free [value=0]
 waiters [orapid (seconds since: put on list, posted, alive check)]:
 205 (2115, 1405340055, 2054)
 30 (2114, 1405340055, 1935)
 208 (2078, 1405340055, 162)
 66 (2055, 1405340055, 2055)
 104 (2028, 1405340055, 2028)
 16 (1821, 1405340055, 1821)
 96 (1774, 1405340055, 1774)
 58 (555, 1405340055, 555)
 65 (255, 1405340055, 255)
 waiter count=9
 Process Group: DEFAULT, pseudo proc: 0x540f7d060
 O/S info: user: grid, term: UNKNOWN, ospid: 13135
OSD pid info: Unix process pid: 13135, image: oracle@itvxxxx1

ok,我们看到该进程正在持有 601091f8这个Child shared pool。朋友这里单独对orapid=146这个会话进行了dump，下面来深入分析一下：

SO: 0x55cd17458, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x55cd17458, name=process, file=ksu.h LINE:12616, pg=0
 (process) Oracle pid:146, ser:39, calls cur/top: 0x26c90ee80/0x26c9099e0
 flags : (0x0) -
 flags2: (0x0),  flags3: (0x10)
intr error: 0, call error: 0, sess error: 0, txn error 0
 intr queue: empty
 ksudlp FALSE at location: 0
 (post info) last post received: 0 0 259
 last post received-location: kgl.h LINE:8714 ID:kgllkdl: post after freeing latch
 last process to post me: 548d29e00 4 0
 last post sent: 0 0 26
 last post sent-location: ksa2.h LINE:285 ID:ksasnd
 last process posted by me: 55ccf6f08 1 6
 (latch info) wait_event=0 bits=80
 holding    (efd=17) 601091f8 Child shared pool level=7 child#=2
Location from where latch is held: kgh.h LINE:6463 ID:kghalo:
Context saved from call: 0
 state=busy [holder orapid=146] wlstate=free [value=0]
 waiters [orapid (seconds since: put on list, posted, alive check)]:
 205 (3715, 1405341655, 3654)
 30 (3714, 1405341655, 3535)
 208 (3678, 1405341655, 406)
 66 (3655, 1405341655, 3655)
 104 (3628, 1405341655, 3628)
 16 (3421, 1405341655, 3421)
 96 (3374, 1405341655, 3374)
 58 (2155, 1405341655, 2155)
 65 (1855, 1405341655, 1855)
 64 (956, 1405341655, 956)
 69 (551, 1405341655, 551)
 waiter count=11
 Process Group: DEFAULT, pseudo proc: 0x540f7d060
 O/S info: user: grid, term: UNKNOWN, ospid: 13135
OSD pid info: Unix process pid: 13135, image: oracle@itvxxxx1
 。。。。。。。。。
 。。。。。。。。。
 SO: 0x55d09c5e0, type: 4, owner: 0x55cd17458, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x55cd17458, name=session, file=ksu.h LINE:12624, pg=0
 (session) sid: 385 ser: 3787 trans: (nil), creator: 0x55cd17458
 flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
 flags2: (0x40009) -/-/INC
 DID: , short-term DID:
txn branch: (nil)
 oct: 3, prv: 0, sql: 0x54f4c5250, psql: 0x4bc1de318, user: 34/ITVMANAGER
 ksuxds FALSE at location: 0
 service name: xxxx
 client details:
 O/S info: user: root, term: unknown, ospid: 1234
 machine: localhost.localdomain program: JDBC Thin Client
 application name: JDBC Thin Client, hash value=2546894660
 Current Wait Stack:
 0: waiting for 'library cache: mutex X'
 idn=0x1a43f94e, value=0x76b00000000, where=0x4f
 wait_id=7745 seq_num=7746 snap_id=1
 wait times: snap=1 min 38 sec, exc=1 min 38 sec, total=1 min 38 sec
 wait times: max=infinite, heur=61 min 56 sec
 wait counts: calls=8957 os=8957
 in_wait=1 iflags=0x15b2
 There is at least one session blocking this session.
 Dumping 1 direct blocker(s):
 inst: 1, sid: 1899, ser: 35707
 Dumping final blocker:
 inst: 1, sid: 2654, ser: 39173

可以看到该进程正在等待library cache: mutex X,而从最后的几行信息可以看到提示，fina blocker为：
inst: 1, sid: 2654, ser: 39173 （注意：11gR2之前这个信息是没有的，可见11gR2比较强大了）

既然这里提到最终的blocker 会话是2654，那么我们搜索关键字：sid: 2654 找到如下信息：

SO: 0x5417cacb0, type: 4, owner: 0x55cd163a8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x55cd163a8, name=session, file=ksu.h LINE:12624, pg=0
 (session) sid: 2654 ser: 39173 trans: (nil), creator: 0x55cd163a8
 flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
 flags2: (0x9) -/-/INC
 DID: , short-term DID:
txn branch: (nil)
 oct: 3, prv: 0, sql: 0x34f05aa78, psql: 0x29ac2e878, user: 34/xxxxxx

 ksuxds FALSE at location: 0
 service name: xxxx
 client details:
 O/S info: user: root, term: unknown, ospid: 1234
 machine: localhost.localdomain program: JDBC Thin Client
 application name: JDBC Thin Client, hash value=2546894660
 Current Wait Stack:
 0: waiting for 'latch: shared pool'
 address=0x60109298, number=0x133, tries=0x0
 wait_id=1048 seq_num=1049 snap_id=1
 wait times: snap=42 min 12 sec, exc=42 min 12 sec, total=42 min 12 sec
 wait times: max=infinite, heur=42 min 12 sec
 wait counts: calls=0 os=0
 in_wait=1 iflags=0x2520
 There is at least one session blocking this session.
 Dumping 1 direct blocker(s):
 inst: 1, sid: 1716, ser: 19027
 Dumping final blocker:
 inst: 1, sid: 1716, ser: 19027
 There are 57 sessions blocked by this session.

这里又提示该进程的最终blocker为 sid 1716。而2654 会话等待等待latch: shared pool。下面继续搜索sid: 1716

SO: 0x541505e80, type: 4, owner: 0x540d8abf0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x540d8abf0, name=session, file=ksu.h LINE:12624, pg=0
 (session) sid: 1716 ser: 19027 trans: (nil), creator: 0x540d8abf0
 flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
 flags2: (0x40009) -/-/INC
 DID: , short-term DID:
txn branch: (nil)
 oct: 3, prv: 0, sql: 0x55ba986f0, psql: 0x34ea4cf40, user: 34/xxxxxxxx
 ksuxds FALSE at location: 0
 service name: xxxx
 client details:
 O/S info: user: root, term: unknown, ospid: 1234
 machine: localhost.localdomain program: JDBC Thin Client
 application name: JDBC Thin Client, hash value=2546894660
 Current Wait Stack:
 0: waiting for 'library cache: mutex X'
 idn=0x1a43f94e, value=0x76b00000000, where=0x4f
 wait_id=32063 seq_num=32067 snap_id=1
 wait times: snap=5 min 22 sec, exc=5 min 22 sec, total=5 min 22 sec
 wait times: max=infinite, heur=41 min 33 sec
 wait counts: calls=29210 os=29210
 in_wait=1 iflags=0x15b2
 There is at least one session blocking this session.
 Dumping 1 direct blocker(s):
 inst: 1, sid: 1899, ser: 35707
 Dumping final blocker:
 inst: 1, sid: 1899, ser: 35707
 There are 56 sessions blocked by this session.
 Dumping one waiter:
 inst: 1, sid: 2654, ser: 39173
 wait event: 'latch: shared pool'
 p1: 'address'=0x60109298
 p2: 'number'=0x133
 p3: 'tries'=0x0
 row_wait_obj#: 19692, block#: 3852, row#: 0, file# 50
 min_blocked_time: 2483 secs, waiter_cache_ver: 11844
 Wait State:
 fixed_waits=0 flags=0x22 boundary=(nil)/-1

可以看到sid 1716 在等待library cache: mutex X，而该会话的最终blocker为 sid：1899，继续搜索sid：1899

SO: 0x54553f600, type: 4, owner: 0x55cd152f8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x55cd152f8, name=session, file=ksu.h LINE:12624, pg=0
 (session) sid: 1899 ser: 35707 trans: (nil), creator: 0x55cd152f8
 flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
 flags2: (0x40009) -/-/INC
 DID: , short-term DID:
txn branch: (nil)
 oct: 3, prv: 0, sql: 0x555f6e3b0, psql: 0x55a0664c0, user: 34/xxxxxxx
 ksuxds FALSE at location: 0
 service name: xxxx
 client details:
 O/S info: user: root, term: unknown, ospid: 1234
 machine: localhost.localdomain program: JDBC Thin Client
 application name: JDBC Thin Client, hash value=2546894660
 Current Wait Stack:
 0: waiting for 'library cache: mutex X'
 idn=0x1d694212, value=0xa5e00000000, where=0x7f
 wait_id=928 seq_num=929 snap_id=1
 wait times: snap=6 min 0 sec, exc=6 min 0 sec, total=6 min 0 sec
 wait times: max=infinite, heur=42 min 12 sec
 wait counts: calls=32681 os=32681
 in_wait=1 iflags=0x15b2
 There is at least one session blocking this session.
 Dumping 1 direct blocker(s):
 inst: 1, sid: 2654, ser: 39173
 Dumping final blocker:
 inst: 1, sid: 2654, ser: 39173
 There are 57 sessions blocked by this session.

我们可以清楚的看到，最后又回到了sid:2654. 感觉饶了一圈又回来了。如果你感觉这样分析有点费劲，那么可以通过
Oracle systemstate dump的格式化脚本将前面的systemstate dump的trace格式化，格式化之后，可以你会发现如下的信息：

Starting Systemstate 1
..............................................................................
..................
.............................................................
...............................................................................
...............................................................................
...
Ass.Awk Version 1.0.39
~~~~~~~~~~~~~~~~~~~~~~
Source file : /soft/xxxx1_ora_23533.trc

System State 1  (2014-07-14 20:09:30.961)
~~~~~~~~~~~~~~   ~~~~~~~~~~~~~~~~~~~~~~~
1:                                      [DEAD]
2:  waiting for 'latch: shared pool'   [Latch 601091f8]
3:  waiting for 'rdbms ipc message'
4:  waiting for 'VKTM Logical Idle Wait'
。。。。。。。
。。。。。。。

Resource Holder State
 Latch 601091f8   146: 146: is waiting for 138:
 Latch 60109298   105: 105: is waiting for 138:
 PIN: Handle=0x55b1664c8    ??? Blocker
 Latch 60109158    97: 97: is waiting for 138:
 Enq PV-00000000-00000000    16: 16: is waiting for 146:
 Latch 53330ec30    54: 54: is waiting for Rcache object=0x491dfa5e0
 Rcache object=0x491dfa5e0    ??? Blocker
 Mutex 4f3f1fc3    97: 97: is waiting for 138:
 Mutex 1a43f94e   138: 138: is waiting for 142:
 LOCK: Handle=0x54f06d310   138: 138: is waiting for 142:
 Mutex 1d694212   142: 142: is waiting for 105:
 Mutex b182f101   103: 103: is waiting for 105:
 Latch 6000a018   208: 208: is waiting for 146:
 Latch 60010f38   226: 226: is waiting for 97:

我们把上面的信息简单整理一下，如下：

146(library cache: mutex X) 等待 138 (library cache: mutex X)
138(library cache: mutex X) 等待 142 (latch: shared pool)
142(latch: shared pool) 等待 105 (library cache: mutex X)
105(library cache: mutex X) 等待 138 (library cache: mutex X)

我们可以看出，很明显Oracle在这里形成了死锁，即library cache: mutex X 和 latch：shared pool构成了死锁.

当然这里如果是要解决问题，那么简单，将138,142,105 3个会话kill掉即可。

凭直觉我们可能就知道，这样的问题很可能是Oracle bug导致，注意朋友这里的环境已经是11.2.0.3.9了。

搜索了MOS，我认为90%的情况是这个Bug 17588480 – library cache mutex/shared pool latch deadlock (文档 ID 17588480.8)

可惜的是朋友这里没有做errorstack，无法准确定位到是不是这个bug，较为遗憾。

oracle

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

Oracle 11.2.0.3.9 RAC is hang (for Linux)

问题描述

专家解答

评论