【经验】Resource Manage导致数据库hang异常分析

原创张维照 2019-12-26

2568

本文节选自《云和恩墨技术通讯》（12月刊）
下载链接：https://www.modb.pro/doc/1593

当数据库出现严重的性能问题hang的时候，我们非常需要通过systemstate dump来知道进程在做什么，在等待什么，谁是资源的持有者，阻塞源等等。在出现上述问题时，及时收集systemstate dump非常有助于问题原因的分析。在此我们分享一个近期的客户案例，供大家参考。

问题描述

2019年11月17日下午15：30左右开始，前端应用出现普通用户连接不上数据库的情况，在数据库主机上使用普通用户登录时出现hang死状态，但使用system用户可以成功登录数据库。DB alert日志中出现大量ORA-12012自动执行对象作业出错，ORA-04021等待锁定对象时发生超时等报错，数据库CKPT进程超时400多分钟。

问题分析

查看当前监听状态正常：

查看数据库活动会话情况：

注：其他类似信息已截断，等待事件主要有library cache lock, reliable message, enq: HW –contention, 并行相关的等待，无block_session信息。

做数据库hanganayly dump 进一步定位故障原因：

SQL> oradebug setmypid
SQL> oradebug unlimit
SQL> oradebug hanganalyze 3

上面堵塞链条，显示library cache lock, reliable message, enq: HW –contention,buffer busy waits， latch free 都是被堵塞的级联等待。

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 2 (xxxx,xxx)
                   os id: 5789
              process id: 256, oracle@db2
              session id: 5
        session serial #: 65411
    }
    is waiting for 'library cache lock' with wait info:
    {
                      p1: 'handle address'=0xa0f5a2338
                      p2: 'lock address'=0x770eea08
                      p3: '100*mode+namespace'=0x7f0002
            time in wait: 11 min 49 sec (last interval)
            time in wait: 13 min 32 sec (total)
           timeout after: 1 min 27 sec
                 wait id: 2
                blocking: 0 sessions
             current sql: <none>
           …….
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 2 (xxxx,xxx)
                   os id: 13392
              process id: 267, oracle@db2
              session id: 1575
        session serial #: 36526
    }
    which is waiting for 'enq: HW - contention' with wait info:
    {
                      p1: 'name|mode'=0x48570006
                      p2: 'table space #'=0x4
                      p3: 'block'=0x13074d0
            time in wait: 431 min 12 sec
           timeout after: never
                 wait id: 4
                blocking: 126 sessions
             current sql: update user$ set spare6=DECODE(to_char(:2, 'YYYY-MM-DD'), '0000-00-00', to_date(NULL), :2) where user#=:1
             short stack: ksedsts()+244<-ksdxfstk()+58<-ksdxcb()+918<-sspuser()+224<-__sighandler()<-semtimedop()+10<-skgpwwait()+200<-ksliwat()+2091<-kslwaitctx()+161<-kjusuc()+8387<-ksipgetctxia()+4251<-ksqcmi()+24244<-ksqgtlctx()+4034<-ksqgelctx()+633<-ktusp_starttx()+213<-ktuaex1r()+354<-ktrsexec()+371<-ktuaspResumableInPdb()+11<-ktuAddExtent()+287<-ktusmaex()+352<-ktusmasp1r()+238<-ktrsexec()+1210<-ktuaspResumableInPdb()+11<-ktusmasp()+576<-ktuchg2()+9193<-ktbchg2()+240<-kdu_array_flush_retry()+2377<-kdu_array_flush1()+140<-qerupFetch
            wait history:
              * time between current wait and wait #1: 0.003152 sec
              1.       event: 'enq: IV -  contention'
                 time waited: 0.000500 sec
                     wait id: 3                p1: 'type|mode'=0x49560005
                                               p2: 'id1'=0x53594e43
                                               p3: 'id2'=0x17
              * time between wait #1 and #2: 0.000115 sec
              2.       event: 'enq: IV -  contention'
                 time waited: 0.001472 sec
                     wait id: 2                p1: 'type|mode'=0x49560005
                                               p2: 'id1'=0x4c4f434b
                                               p3: 'id2'=0x17
              * time between wait #2 and #3: 0.007973 sec
              3.       event: 'SQL*Net message from client'
                 time waited: 0.000730 sec
                     wait id: 1                p1: 'driver id'=0x74637000
                                               p2: '#bytes'=0x1
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 2 (xxxx,xxx))
                   os id: 11109
              process id: 880, oracle@db2
              session id: 2321
        session serial #: 24259
    }
    which is waiting for 'buffer busy waits' with wait info:
    {
                      p1: 'file#'=0x4
                      p2: 'block#'=0x2
                      p3: 'class#'=0xd
            time in wait: 431 min 35 sec
           timeout after: never
                 wait id: 208
                blocking: 237 sessions
…………………..
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 2 (xxxx,xxx))
                   os id: 25271
              process id: 30, oracle@db2 (SMON)
              session id: 4261
        session serial #: 51091
    }
………
    {
                instance: 2 (xxxx,xxx)
                   os id: 25267
              process id: 28, oracle@db2 (CKPT)
              session id: 3977
        session serial #: 23943
    }
    which is waiting for 'latch free' with wait info:
    {
                      p1: 'address'=0x6000c1d8
                      p2: 'number'=0x15
                      p3: 'tries'=0x0
            time in wait: 434 min 12 sec
           timeout after: never
                 wait id: 9858273
                blocking: 500 sessions
             current sql: <none>
            …….
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 2 (xxxx,xxx)
                   os id: 25223
              process id: 9, oracle@db2 (DBRM)
              session id: 1279
        session serial #: 22797
    }
    which is not in a wait:
    {
               last wait: 434 min 49 sec ago
                blocking: 505 sessions
             current sql: <none>
             short stack: <none: error encountered - ORA-32515: cannot issue ORADEBUG command 'SHORT_STACK' to process 'Unix process pid: 25223, image: oracle@db2 (DBRM)'; prior command execution time exceeds 30000 ms>
            wait history:
              1.       event: 'rdbms ipc message'
                 time waited: 2.999246 sec
                     wait id: 1117963          p1: 'timeout'=0x12c
              * time between wait #1 and #2: 0.002750 sec
              2.       event: 'rdbms ipc message'
                 time waited: 2.439357 sec
                     wait id: 1117962          p1: 'timeout'=0xf4
              * time between wait #2 and #3: 0.000046 sec
              3.       event: 'rdbms ipc message'
                 time waited: 0.565102 sec
                     wait id: 1117961          p1: 'timeout'=0x12c
    }
 
Chain 1 Signature: <not in a wait><='latch free'<='reliable message'<='buffer busy waits'<='enq: HW - contention'<='library cache lock'

注：上面是会话堵塞链，问题的根源是DBRM进程， DBRM是Resource Manager的后台进程。
查看当时的resource_managed 参数：

SQL> show parameter resource

NAME                                               TYPE        VALUE
------------------------------------    ----------- ------------------------------
resource_limit                                   boolean     TRUE
resource_manager_cpu_allocation    integer     32
resource_manager_plan                    string      SCHEDULER[0x444A]:DEFAULT_MAIN
                                                                          TENANCE_PLAN

当前数据库未禁用resource manager, 用户进程和后台进程CKPT被DBRM进程堵塞。

问题解决

DBRM资源管理的后台进程长时间堵塞了ckpt 后台核心进程，导致一系列级联堵塞，影响了数据库前端应用和登录。当前版本虽然没有发现已知bug，但是建议禁用resource manage问题特性。

alter system set resource_manager_plan='';

禁用resource manage后数据库和应用恢复正常，数据库异常等待消失。

其他建议：禁止应用中写高并发。该环境应用中使用太高的并行hint，如果再加上并发，并行度如果没有限制的话，那占用的CPU资源就是并行数*并发数据，很快会把CPU资源占满，影响全库性能。

oracle

最后修改时间：2019-12-26 11:58:52

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

【经验】Resource Manage导致数据库hang异常分析

问题描述

问题分析

问题解决

评论