案例分析:Oracle RAC Instance Startup Hangs

原创 liaju 2021-02-02

892

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.1.0.2 to 10.2.0.5 [Release 10.1 to 10.2]
Information in this document applies to any platform.

PURPOSE

This article troubleshoots Oracle RAC Instance startup issue with a case study. This case study shows detailed troubleshooting steps and how a solution had attained eventually.

TROUBLESHOOTING STEPS

Reported Issue

It is a 2 node RAC. The instance has previously crashed and trying to start it up just hangs. When looking at the alert log of the instance, the MOUNT is successful. However, the 'alter database open' has not completed. It seems to be hanging as there are no further entries in the alert log related to successful completion of the database open.

The other instance of the RAC is just running fine.

Troubleshooting

Step 1: Gather hanganalyze and systemstate dump

Hang analysis and the system state dump is taken at the hanging instance by logging in as SYSDBA. The following command was used.

sqlplus -prelim / as sysdba

SQL> oradebug setmypid
SQL> oradebug unlimit
SQL> oradebug -g all hanganalyze 3
SQL> oradebug -g all dump systemstate 266

<<<<< Waited sometime before proceeding to take another set.

SQL> oradebug -g all hanganalyze 3
SQL> oradebug -g all dump systemstate 266

As the hanganalyze is taken on the second instance, this instance's DIAG trace contains both snapshots of the hanganalyze output.

Each instance's systemstate dumps goes into its own instance's DIAG trace file.

Step2: Hang Analysis Dump

The hanganalyze in the second instances DIAG trace is analyzed.

*** 2011-01-27 23:38:34.405
==============
HANG ANALYSIS: level 3 syslevel 1
==============
Open chains found:
Chain 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
<0/349/1390/No Wait> -- <0/237/63068/enq: JI - contention>
Chain 2 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
<0/420/5232/SQL*Net message from client>
-- <1/652/5/0x390047a0/17986/enq: TT - contention>
-- <0/491/1/enq: TT - contention>
Chain 3 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
<0/548/7793/No Wait> -- <0/31/15037/latch: cache buffers chains>
. . .
. . .

*** 2011-01-27 23:43:17.061
==============
HANG ANALYSIS: level 3 syslevel 1
==============
Open chains found:
Chain 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
<0/420/5232/SQL*Net message from client>
-- <1/652/5/0x390047a0/17986/enq: TT - contention>
-- <0/491/1/enq: TT - contention>

By looking at both HANG ANALYSIS entries in the DIAG trace, it is clear that the highlighted chain remains constant.

<cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :   <<< This is the column description for the below entries
<0/420/5232/SQL*Net message from client>           <<< Holder session
-- <1/652/5/0x390047a0/17986/enq: TT - contention> <<<< Waiter session
-- <0/491/1/enq: TT - contention>                                  <<<< Waiter session.

It shows that,
On instance 0 (cnode), this is first instance, sid 420 is waiting on 'SQL*Net message from client' which means this session is idle, just waiting for user input.

On instance 1, this is second instance which hangs during startup, sid 652 is waiting on 'enq: TT - contention'. This session in turn waits for the sid 420 which is on instance 0. It also shows that sid 491 is queued behind it.

With the hanganalyze output, we now know that we need to concentrate on two sessions only i.e.
sid 652 on instance 1 which is waiting for sid 420 on instance 0

Next we proceed to analyze the system state dumps to get more details about these two sessions.

Step 3: Waiter Session on Hanging Instance

Searching for the string 'sid: 652' on instance 1(second instance) DIAG trace shows this session state object details as shown below.

Session State Object:

SO: 5391774e8, type: 4, owner: 5390047a0, flag: INIT/-/-/0x00
(session) sid: 652 trans: 53a4fc170, creator: 5390047a0, flag: (80000041) USR/- BSY/-/-/-/-/-
. . .
waiting for 'enq: TT - contention' blocking sess=0x0 seq=700 wait_time=0 seconds since wait started=1041
name|mode=54540006, tablespace ID=0, operation=0
Dumping Session Wait History
for 'enq: TT - contention' count=1 wait_time=57165
name|mode=54540006, tablespace ID=0, operation=0
for 'enq: TT - contention' count=1 wait_time=497859
name|mode=54540006, tablespace ID=0, operation=0
for 'enq: TT - contention' count=1 wait_time=498081
name|mode=54540006, tablespace ID=0, operation=0

name|mode=54540006. name takes up the first 4 digit and the mode takes up the remaining 4.

name=5454. here 54 is actually ASCII code, converting this to letter is T.
hence name is actually TT. TT is tablespace enqueue which serializes DDL operations on tablespace.

mode=0006. i.e. 6. mode 6 means Exclusive Lock mode.

tablespace ID=0. 'select ts#,name from v$tablespace' shows 0 is actually SYSTEM tablespace.

In Summary, this session is waiting for Exclusive Lock on SYSTEM tablespace.

There are two more places where the enqueue information can be verified in the systemstate dumps. Those are enqueue global resource dump and the enqueue state object.

Enqueue Global Resource Dump:

----------resource 0x4b62f5648----------------------
resname : [0x0][0x0],[TT]
Local node : 1 <<< This node is instance 1
dir_node : 0
master_node : 0 <<< Master node is instance 0, the other instance.
. . .
GRANTED_Q :
CONVERT_Q:
lp 5392f3ff8 gl KJUSERNL rl KJUSEREX rp 4b62f5648 [0x0][0x0],[TT]
master 0 gl owner 5391774e8 possible pid 17986 xid 2001-001F-0000001B bast 0 rseq 6 mseq 0 history 0x49a5149a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT

It shows that the resource '[0x0][0x0],[TT]' is actually mastered on the instance 0 (master_node : 0). And this session is in CONVERT_Q waiting to obtain Exclusive lock(KJUSEREX). We need to look at the resource dump on the master node to get the complete holder and waiter sessions details.

Enqueue State Object:

SO: 53918e230, type: 5, owner: 53918c8b8, flag: INIT/-/-/0x00
(enqueue) TT-00000000-00000000 DID: 0002-001F-0000001B
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x3
res: 0x4cb28d418, req: X, lock_flag: 0x0
own: 0x5391774e8, sess: 0x5391774e8, proc: 0x5390047a0, prv: 0x4cb28d438
slk: 0x5392f3ff8

enqueue state object also shows that this session is waiting to get Exclusive lock(req: X) on the resource TT-00000000-00000000.

The difference in these state object/resource dump is that the way they represent the same resource,

Session State Object shows as 'name|mode=54540006, tablespace ID=0, operation=0'
Global Resource Dump shows is as '[0x0][0x0],[TT]'
enqueue State Object shows it as 'TT-00000000-00000000'

Step 4: Blocker Session on Running Instance

search for the string 'sid: 420' on the first instance's DIAG trace to get the session state object details.

Session State Object:

SO: 52f0bcb70, type: 4, owner: 52f00fd78, flag: INIT/-/-/0x00
(session) sid: 420 trans: 52fca9d68, creator: 52f00fd78, flag: (8000041) USR/- -/-/-/-/-/-
. . .
waiting for 'SQL*Net message from client' blocking sess=0x0 seq=61945 wait_time=0 seconds since wait started=28834
driver id=54435000, #bytes=1, =0
Dumping Session Wait History
for 'SQL*Net more data to client' count=1 wait_time=32
driver id=54435000, #bytes=7fc, =0
for 'SQL*Net more data to client' count=1 wait_time=27
driver id=54435000, #bytes=7d9, =0

This shows that this session is waiting on 'SQL*Net message from client' which is actually one of the idle wait event. Which means that this session is just waiting for user input and not doing any work at the database.

However, from hanganalyze we know that the instance 1's sid 652 waiting for Exclusive TT Lock that depends on this session. Hence, this session should be holding TT lock on SYSTEM tablespace in an incompatible mode. Searching down to look for Global Resource Dump and Enqueue State Object shows,

Global Resource Dump:

----------resource 0x48a77c228----------------------
resname : [0x0][0x0],[TT]
Local node : 0 <<< This node is instance 0
dir_node : 0
master_node : 0 <<< master node is instance 0, this is the master node.

GRANTED_Q :
lp 52f237628 gl KJUSERPR rp 48a77c228 [0x0][0x0],[TT]
master 0 gl owner 52f0bcb70 possible pid 4110 xid 1008-0084-000E16DC bast 0 rseq 8445 mseq 0 history 0x495149a5
open opt KJUSERDEADLOCK KJUSERIDLK
CONVERT_Q:
lp 571d845f8 gl KJUSERNL rl KJUSEREX rp 48a77c228 [0x0][0x0],[TT]
master 0 owner 1 bast 1 rseq 6 mseq 0x1 history 0x77d497ad
convert opt KJUSERGETVALUE
lp 52f2c75f8 gl KJUSERNL rl KJUSERPR rp 48a77c228 [0x0][0x0],[TT]
master 0 gl owner 52f11bb48 possible pid 7615 xid 1001-0018-00000003 bast 0 rseq 8445 mseq 0 history 0x1495149a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT

This shows that this session is holding the resource [0x0][0x0],[TT] in Protected Read(KJUSERPR) mode, as it is shown under GRANTED_Q. As this is the master node for this resource(master_node : 0), it shows all granted(holder) and all the waiter sessions.

Enqueue State Object:

SO: 52f132038, type: 5, owner: 52f0bcb70, flag: INIT/-/-/0x00
(enqueue) TT-00000000-00000000 DID: 0001-0084-000E16DC
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x3
res: 0x4b91e0050, mode: S, lock_flag: 0x0
own: 0x52f0bcb70, sess: 0x52f0bcb70, proc: 0x52f00fd78, prv: 0x4b91e0060
slk: 0x52f237628

It shows that this session is actually holding Shared Lock(mode: S) on the resource TT-00000000-00000000.

As this session is holding a Shared Lock on the same resource any session looking for Exclusive Lock should wait till this session releases the Shared Lock.

Step 5: Conclusion

Instance 1 startup hangs. One of it's session is waiting to obtain Exclusive Lock on SYSTEM tablespace.
A session on Instance 0 is actually holding Shared Lock on the SYSTEM tablespace, hence blocking the that is looking for Exclusive Lock.

The instance startup may require Exclusive Lock on SYSTEM tablespace especially after a crash, typically used after recovery of an instance at instance startup to allocate and initialize private and public rollback segments to the instance being started. Acquired exclusive for the duration of the operation.

The solution here would be to kill the holder session thus allowing the other instance to get Exclusive Lock and to complete the instance startup.

oracle

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

案例分析:Oracle RAC Instance Startup Hangs

APPLIES TO:

PURPOSE

TROUBLESHOOTING STEPS

评论