hanganalyze分析死锁

dm5250 2025-06-06

181

最近刚好在复习文件头相关的知识，刚好把dump相关的内容梳理了一下，这次主要针对hanganalyze和dump systemstate分析死锁的过程做个记录。

1、数据库死锁

开启两个会话，模拟数据库死锁操作。

--session 1,进行update object_id=101的记录，不提交
SQL> show user;
USER is "DC"
SQL> select count(*) from test;

  COUNT(*)
----------
     86307


SQL> select OBJECT_NAME from test where object_id=101;

OBJECT_NAME
--------------------------------------------------------------------------------
FIXED_OBJ$

SQL> update /* session 1 */ test set object_name='dc' where object_id=101;

1 row updated.

SQL> 



--session 2，进行update  object_id=101的记录，不提交
[oracle@db01 ~]$ dc

SQL*Plus: Release 11.2.0.4.0 Production on Fri Jun 6 14:24:47 2025

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> update /* session 2 */ test set object_name='hp' where object_id=101;

可以看到，session 2在进行update操作后，因为session 1没有提交，一直就处于hang住的状态，如果连接数和并发足够，数据库系统可能也会被hang住没法使用。

2、hanganalyze dump

现在要分析数据库目前hang住的原因，进行hanganalyze dump操作。如果此时数据库无法登录，可以通过“sqlplus -prelim / as sysdba”进行登录操作。

--hanganalyze 级别
10     Dump all processes (IGN state)
 5      Level 4 + Dump all processes involved in wait chains (NLEAF state)
 4      Level 3 + Dump leaf nodes (blockers) in wait chains (LEAF,LEAF_NW,IGN_DMP state)
 3      Level 2 + Dump only processes thought to be in a hang (IN_HANG state)
 1-2    Only HANGANALYZE output, no process dump at all

SQL> oradebug setmypid
Statement processed.
SQL> oradebug hanganalyze 3
Hang Analysis in /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_30882.trc
SQL> exit

--dump日志如下
[oracle@db01 ~]$ more /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_30882.trc
*** 2025-06-06 15:03:44.078
Processing Oradebug command 'hanganalyze 3'

*** 2025-06-06 15:03:44.131
===============================================================================
HANG ANALYSIS:
  instances (db_name.oracle_sid): orcl.orcl
  oradebug_node_dump_level: 3
  analysis initiated by oradebug
  os thread scheduling delay history: (sampling every 1.000000 secs)
    0.000000 secs at [ 15:03:43 ]
      NOTE: scheduling delay has not been sampled for 0.587238 secs    0.000000 secs from [ 15:03:39 - 15:03:44 ], 5 sec avg
    0.000000 secs from [ 15:02:44 - 15:03:44 ], 1 min avg
    0.000000 secs from [ 14:58:44 - 15:03:44 ], 5 min avg
  vktm time drift history
    09:52:22.682 HR fwd drift by 1075918 us at 1749174742681732
    09:52:26.334 HR fwd drift by 3653060 us at 1749174746334792
    09:52:26.334 LR time stalled at 1749174746
    09:52:27.335 LR stall ended at 1749174747 drift 1 us
===============================================================================
 
Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'SQL*Net message from client'<='enq: TX - row lock contention'
     Chain 1 Signature Hash: 0x38c48850
 
===============================================================================
Non-intersecting chains:
 
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (orcl.orcl)
                   os id: 28050
              process id: 20, oracle@db01 (TNS V1-V3)
              session id: 12
        session serial #: 827
    }
    is waiting for 'enq: TX - row lock contention' with wait info:
    {
                      p1: 'name|mode'=0x54580006
                      p2: 'usn<<16 | slot'=0x2000e
                      p3: 'sequence'=0x81d
            time in wait: 20.313051 sec
           timeout after: never
                 wait id: 122
                blocking: 0 sessions
             current sql: update /* session 2 */ test set object_name='hp' where object_id=101
             short stack: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+160<-ksl
iwat()+2022<-kslwaitctx()+163<-ksqcmi()+2848<-ksqgtlctx()+3501<-ksqgelctx()+557<-ktuGetTxForXid()+131<-ktcwit1()+336<-kdddgb()+8364<-kdusru
()+461<-updrowFastPath()+1209<-qerupFetch()+2366<-updaul()+1321<-updThreePhaseExe()+318<-updexe()+418<-opiexe()+10378<-kpoal8()+2118<-opiod
r()+917<-ttcpip()+2183<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+26
            wait history:
              * time between current wait and wait #1: 0.000685 sec
              1.       event: 'SQL*Net message from client'
                 time waited: 1 min 9 sec
                     wait id: 121             p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.000004 sec
              2.       event: 'SQL*Net message to client'
                 time waited: 0.000001 sec
                     wait id: 120             p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000021 sec
              3.       event: 'log file sync'
                 time waited: 0.000372 sec
                     wait id: 119             p1: 'buffer#'=0x75c
                                              p2: 'sync scn'=0x381858
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (orcl.orcl)
                   os id: 28307
              process id: 21, oracle@db01 (TNS V1-V3)
              session id: 762
        session serial #: 11101
    }
    which is waiting for 'SQL*Net message from client' with wait info:
    {
                      p1: 'driver id'=0x62657100
                      p2: '#bytes'=0x1
            time in wait: 35.218218 sec
           timeout after: never
                 wait id: 42
                blocking: 1 session
             current sql: <none>
             short stack: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-read()+14<-ntpfprd()+117<-nsbasic_brc
()+376<-nsbrecv()+69<-nioqrc()+495<-opikndf2()+978<-opitsk()+831<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<
-ssthrdmain()+265<-main()+201<-__libc_start_main()+245
            wait history:
              * time between current wait and wait #1: 0.000004 sec
              1.       event: 'SQL*Net message to client'
                 time waited: 0.000001 sec
                     wait id: 41              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.004361 sec
              2.       event: 'SQL*Net message from client'
                 time waited: 49.733491 sec
                     wait id: 40              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000004 sec
              3.       event: 'SQL*Net message to client'
                 time waited: 0.000001 sec
                     wait id: 39              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
    }
 
Chain 1 Signature: 'SQL*Net message from client'<='enq: TX - row lock contention'
Chain 1 Signature Hash: 0x38c48850
-------------------------------------------------------------------------------
 
===============================================================================
Extra information that will be dumped at higher levels:
[level  4] :   1 node dumps -- [LEAF] [LEAF_NW] 
[level  5] :   1 node dumps -- [NO_WAIT] [INVOL_WT] [SINGLE_NODE] [NLEAF] [SINGLE_NODE_NW] 
 
State of ALL nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]):
[11]/1/12/827/0xf2c6e940/28050/NLEAF/[761]
[761]/1/762/11101/0xf1653540/28307/LEAF/

分析trc日志可以发现，阻塞会话的源头session id: 762，session serial #: 11101，在系统中kill掉这个会话，session 2的会话不再hang住。

3、systemstate分析

再看下dump systemstate的结果分析

SQL> oradebug setmypid
Statement processed.
SQL> oradebug dump systemstate 10
Statement processed.
SQL> oradebug tracefile_name
/u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_32013.trc

--使用awk工具进行格式化
[oracle@db01 ~]$ awk -f ass109.awk /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_32013.trc

Starting Systemstate 1
.................................
Ass.Awk Version 1.0.9 - Processing /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_32013.trc

System State 1
~~~~~~~~~~~~~~~~
1:                                      
2:  0: waiting for 'pmon timer'         
3:  0: waiting for 'rdbms ipc message'  
4:  0: waiting for 'VKTM Logical Idle Wait' 
5:  0: waiting for 'rdbms ipc message'  
6:  0: waiting for 'DIAG idle wait'     
7:  0: waiting for 'rdbms ipc message'  
8:  0: waiting for 'DIAG idle wait'     
9:  0: waiting for 'rdbms ipc message'  
10: 0: waiting for 'rdbms ipc message'  
11: 0: waiting for 'rdbms ipc message'  
12: 0: waiting for 'rdbms ipc message'  
13: 0: waiting for 'smon timer'         
14: 0: waiting for 'rdbms ipc message'  
15: 0: waiting for 'rdbms ipc message'  
16: 0: waiting for 'rdbms ipc message'  
17:                                     
18:                                     
19: 0: waiting for 'rdbms ipc message'  
20: 0: waiting for 'SQL*Net message from client' 
21: 0: waiting for 'SQL*Net message from client' 
22: 0: waiting for 'Space Manager: slave idle wait' 
23: 0: waiting for 'rdbms ipc message'  
24: 0: waiting for 'rdbms ipc message'  
25: 0: waiting for 'rdbms ipc message'  
26: 0: waiting for 'rdbms ipc message'  
27: 0: waiting for 'Streams AQ: qmn coordinator idle wait' 
28: 0: waiting for 'rdbms ipc message'  
29: 0: waiting for 'enq: TX - row lock contention'[Enqueue TX-00050005-0000081E] 
30: 0: waiting for 'SQL*Net message from client' 
31:                                     
38: 0: waiting for 'Streams AQ: qmn slave idle wait' 
39: 0: waiting for 'Streams AQ: waiting for time management or cleanup tasks' 
Blockers
~~~~~~~~

        Above is a list of all the processes. If they are waiting for a resource
        then it will be given in square brackets. Below is a summary of the
        waited upon resources, together with the holder of that resource.
        Notes:
        ~~~~~
         o A process id of '???' implies that the holder was not found in the
           systemstate.

                    Resource Holder State
Enqueue TX-00050005-0000081E    20: 0: waiting for 'SQL*Net message from client'

Object Names
~~~~~~~~~~~~
Enqueue TX-00050005-0000081E                                  


20142 Lines Processed.

可以看到格式化出来的结果，pid 29请求Enqueue TX-00050005-0000081E被pid 20阻塞
SQL> select s.sid,s.serial# from v$session s,v$process p where s.PADDR=p.ADDR and p.pid=20;

       SID    SERIAL#
---------- ----------
        12        831
这里是重新登入会话进行的模拟，所以和上面章节的hanganalyze结果不一致 ，kill掉源头或者commit之后，session 2恢复正常

hang故障排除流程

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

hanganalyze分析死锁

1、数据库死锁

2、hanganalyze dump

3、systemstate分析

评论