暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

珍惜吧 很少看到AIX的案例了

463


   

    今天一个微信群网友私信我有个case,请求帮忙给点意见,说上午业务繁忙,一个数据库节点自动重启了。我们先来看看数据库告警日志。

    Thu Feb 01 09:31:20 2024
    Archived Log entry 1886895 added for thread 2 sequence 1027967 ID 0x436ee63 dest 1:
    Thu Feb 01 09:32:38 2024


    ***********************************************************************


    Fatal NI connect error 12170.


    VERSION INFORMATION:
    TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
    TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
    Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
    Time: 01-FEB-2024 09:32:38
    Tracing not turned on.
    Tns error struct:
    ns main err code: 12535


    TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
    Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.133.210.112)(PORT=56224))
    WARNING: inbound connection timed out (ORA-3136)
    Thu Feb 01 09:32:39 2024
    ......
    ***********************************************************************Thu Feb 01 09:32:45 2024
    Thu Feb 01 09:32:45 2024


    Fatal NI connect error 12170.
    Thu Feb 01 09:32:45 2024
    Thu Feb 01 09:32:45 2024


    Fatal NI connect error 12170.


    Fatal NI connect error 12170.
    Thu Feb 01 09:32:45 2024
    Thu Feb 01 09:32:45 2024
    Thu Feb 01 09:32:45 2024
    Thu Feb 01 09:32:45 2024
    Thu Feb 01 09:32:45 2024


    Fatal NI connect error 12170.
    Thu Feb 01 09:32:45 2024


    ***********************************************************************


    ***********************************************************************
    ......


    Thu Feb 01 10:20:10 2024
    PMON failed to acquire latch, see PMON dump
    Thu Feb 01 10:21:10 2024
    PMON failed to acquire latch, see PMON dump
    Thu Feb 01 10:21:43 2024
    Errors in file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_smco_36831644.trc (incident=2667623):
    ORA-00445: background process "W004" did not start after 120 seconds
    Thu Feb 01 10:22:01 2024
    Killing enqueue blocker (pid=2425816) on resource CF-00000000-00000000 by (pid=36569482)
    Thu Feb 01 10:22:22 2024
    PMON failed to acquire latch, see PMON dump
    Thu Feb 01 10:23:05 2024
    IPC Send timeout detected. Receiver ospid 38601620 [oracle@tystyxr2 (LCK0)]
    Receiver is waiting for a latch, dumping latch state for ospid 38601620 [oracle@tystyxr2 (LCK0)] 38601620
    Thu Feb 01 10:23:05 2024
    Errors in file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_ora_43189812.trc:
    Thu Feb 01 10:23:07 2024
    Errors in file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_lck0_38601620.trc:
    Dumping diagnostic data in directory=[cdmp_20240201102307], requested by (instance=2, osid=43189812), summary=[abnormal process termination].
    Thu Feb 01 10:23:09 2024
    ORA-00020: maximum number of processes (8000) exceeded
    ORA-20 errors will not be written to the alert log for
    the next minute. Please look at trace files to see all
    the ORA-20 errors.
    Thu Feb 01 10:23:26 2024
    Process PZ53 submission failed with error = 20
    Thu Feb 01 10:24:10 2024
    ORA-00020: maximum number of processes (8000) exceeded
    ORA-20 errors will not be written to the alert log for
    the next minute. Please look at trace files to see all
    the ORA-20 errors.
    Thu Feb 01 10:24:14 2024
    .....
    WARNING: inbound connection timed out (ORA-3136)
    Thu Feb 01 10:33:46 2024
    NOTE: ASMB terminating
    Errors in file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_asmb_2425816.trc:
    ORA-15064: communication failure with ASM instance
    ORA-03135: connection lost contact
    Process ID:
    Session ID: 993 Serial number: 44
    Errors in file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_asmb_2425816.trc:
    ORA-15064: communication failure with ASM instance
    ORA-03135: connection lost contact
    Process ID:
    Session ID: 993 Serial number: 44
    ASMB (ospid: 2425816): terminating the instance due to error 15064
    Thu Feb 01 10:33:46 2024
    System state dump requested by (instance=2, osid=2425816 (ASMB)), summary=[abnormal instance termination].
    System State dumped to trace file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_diag_6422788.trc
    Thu Feb 01 10:33:47 2024
    ORA-1092 : opitsk aborting process
    Thu Feb 01 10:33:47 2024
    ORA-1092 : opitsk aborting process
    ......
    Thu Feb 01 10:33:50 2024
    ORA-1092 : opitsk aborting process
    Instance terminated by ASMB, pid = 2425816
    USER (ospid: 13895476): terminating the instance
    Instance terminated by USER, pid = 13895476
    Thu Feb 01 10:34:08 2024
    Starting ORACLE instance (normal)
    LICENSE_MAX_SESSION = 0
    LICENSE_SESSIONS_WARNING = 0
    Picked latch-free SCN scheme 3
    Autotune of undo retention is turned on.
    LICENSE_MAX_USERS = 0
    SYS auditing is disabled
    Starting up:
    Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
    With the Partitioning, Real Application Clusters, OLAP, Data Mining
    and Real Application Testing options.
    ORACLE_HOME = app/product/11.2.0/db


        看上去大致是数据库连接数撑满了,最后数据库夯死,后面集群脑裂,因为Oracle 11gR2引入的reboot less特性,不会重启整个集群而是重启实例,因此把该节点数据库重启了。(数据库夯死,ipc send timeout了,这是必然)。

         那么数据库实例被重启主要是因为核心进程asmb被kill了,而该进程被kill,主要是因为持有了cf锁释放.一旦任何进程持有cf锁,时间超过120秒,如果不释放都将被kill,这是Oracle 11gR2的机制,有如下几个参数来进行控制:

      NAME                                                    VALUE                DESCRIB
      ------------------------------------------------------- -------------------- --------------------------------------------------------------------------------
      _controlfile_enqueue_timeout 900 control file enqueue timeout in seconds
      _controlfile_enqueue_holding_time 120 control file enqueue max holding time in seconds
      _controlfile_enqueue_holding_time_tracking_size 10 control file enqueue holding time tracking size
      _controlfile_enqueue_dump FALSE dump the system states after controlfile enqueue timeout
      _kill_controlfile_enqueue_blocker TRUE enable killing controlfile enqueue blocker on timeout

       

        然而这里我认为并不是主要原因,毕竟在10:20之后才出现重启,问题是看alert log在9:30左右就开始出现hang的情况了。当数据库都快夯死了,自然各种操作都会异常,因此我们还是要去进一步分析之前为什么会hang?数据库crash之前有个systemstate dump,我们可以来对这个文件进行一下分析(话说,如果国产数据库也有类似的dump功能该多好)。

        *** 2024-02-01 09:42:11.679
        Process diagnostic dump for oracle@tystyxr2 (ARC9), OS id=42795032,
        pid: 65, proc_ser: 1, sid: 6111, sess_ser: 1
        -------------------------------------------------------------------------------
        os thread scheduling delay history: (sampling every 1.000000 secs)
        0.000000 secs at [ 09:42:10 ]
        NOTE: scheduling delay has not been sampled for 0.983271 secs 0.000000 secs from [ 09:42:07 - 09:42:11 ], 5 sec avg
        0.000000 secs from [ 09:41:12 - 09:42:11 ], 1 min avg
        0.000000 secs from [ 09:37:12 - 09:42:11 ], 5 min avg
        loadavg : 9.27 12.95 13.12
        swap info: free_mem = 249.07M rsv = 100.00M
        alloc = 539.64M avail = 25600.00M swap_free = 25060.36M
        F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
        240001 A oracle 42795032 1 0 60 20 6566e5590 242412 Dec 08 - 11:47 ora_arc9_xxxxcog2


        从dump来看,在9:42:11秒这个时间点,物理内存free_mem只有249M了,基本耗尽了,难道是内存耗尽了?先不着急下结论,进一步看看arch进程在干什么?

          *** 2024-02-01 09:42:11.809
          ----------------------------------------
          SO: 0x700001de22f6b68, type: 2, owner: 0x0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
          proc=0x700001de22f6b68, name=process, file=ksu.h LINE:12616 ID:, pg=0
          (process) Oracle pid:65, ser:1, calls cur/top: 0x700001c2270a3b8/0x700001c22545558
          flags : (0x2) SYSTEM
          flags2: (0x0), flags3: (0x10)
          intr error: 0, call error: 0, sess error: 0, txn error 0
          intr queue: empty
          ksudlp FALSE at location: 0
          (post info) last post received: 0 0 26
          last post received-location: ksa2.h LINE:285 ID:ksasnd
          last process to post me: 700001dd22daec0 1 6
          last post sent: 0 0 47
          last post sent-location: ksv2.h LINE:1667 ID:ksvpst: send
          last process posted by me: 700001df22ff268 1 2
          (latch info) wait_event=0 bits=0
          Process Group: DEFAULT, pseudo proc: 0x700001dd2779610
          O/S info: user: oracle, term: UNKNOWN, ospid: 42795032
              OSD pid info: Unix process pid: 42795032, image: oracle@xxxxx2 (ARC9)
          ----------------------------------------
          SO: 0x700001d832813e0, type: 9, owner: 0x700001de22f6b68, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
          proc=0x700001de22f6b68, name=FileIdentificatonBlock, file=ksfd2.h LINE:312 ID:, pg=0
          (FIB) 700001d832813e0 flags=192 reference cnt=0 incno=86462 seqno=1
          fname=
          fno=0 lblksz=0 fsiz=0
          ############# kfiofib = 0x700001d83281498 #################
          File number = 0.0
          File type = 3
          Flags = 8
          Blksize = 0
          File size = 0 blocks
          Blk one offset = 0
          Redundancy = 0
          Physical blocksz = 0
          Open name          = +xxx_COG/xxxxcogstb/onlinelog/group_8.513.980062059 
          Fully-qualified nm =
          Mapid = 0
          Slave ID = 0
          Connection = 0x0x0
          。。。。。。
          SO: 0x700001de3213538, type: 4, owner: 0x700001de22f6b68, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
          proc=0x700001de22f6b68, name=session, file=ksu.h LINE:12624 ID:, pg=0
          (session) sid: 6111 ser: 1 trans: 0x0, creator: 0x700001de22f6b68
          flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
          flags2: (0x409) -/-/INC
          DID: , short-term DID:
          txn branch: 0x0
          oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS
          ksuxds FALSE at location: 0
          service name: SYS$BACKGROUND
          Current Wait Stack:
          1: waiting for 'KSV master wait'
          =0x0, =0x0, =0x0
          wait_id=2277702 seq_num=62811 snap_id=2
          wait times: snap=1 min 13 sec, exc=3 min 16 sec, total=3 min 16 sec
          wait times: max=infinite, heur=3 min 16 sec
          wait counts: calls=66 os=66
          in_wait=1 iflags=0x1520
          0: waiting for 'Disk file operations I/O'
          FileOperation=0x2, fileno=0x0, filetype=0x3
          wait_id=2277701 seq_num=62808 snap_id=1
          wait times: snap=0.000000 sec, exc=0.000133 sec, total=3 min 16 sec
          wait times: max=infinite, heur=3 min 16 sec
          wait counts: calls=0 os=0
          in_wait=1 iflags=0x15a0
          There are 299 sessions blocked by this session.
          Dumping one waiter:
          inst: 1, sid: 3573, ser: 1
          wait event: 'enq: CF - contention'
          p1: 'name|mode'=0x43460005
          p2: '0'=0x0
          p3: 'operation'=0x0
          row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
          min_blocked_time: 146 secs, waiter_cache_ver: 42721
          Wait State:
          fixed_waits=0 flags=0x22 boundary=0x0/-1


          上面的信息非常的清晰,此时arch进程已经阻塞了299个session,当前应该是正在持有control file latch,所以看到其被阻塞者(waiter)都在等enq:CF- contetnion.

           进一步分析可以看到在9:49分的时候,lck0进程阻塞了几千个会话,你说数据库能不夯死吗?

            *** 2024-02-01 09:49:59.659
            ----------------------------------------
            SO: 0x700001e02316b10, type: 2, owner: 0x0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
            proc=0x700001e02316b10, name=process, file=ksu.h LINE:12616 ID:, pg=0
            (process) Oracle pid:48, ser:186, calls cur/top: 0x700001c21ecee40/0x700001c21ecee40
            flags : (0x6) SYSTEM
            flags2: (0x0), flags3: (0x10)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
            ksudlp FALSE at location: 0
            (post info) last post received: 0 0 81
            last post received-location: kji.h LINE:3422 ID:kjatb: wake up enqueue blocker
            last process to post me: 700001db22e17d0 2 6
            last post sent: 0 0 22
            last post sent-location: ksb2.h LINE:855 ID:ksbria
            last process posted by me: 700001e02316b10 186 6
            (latch info) wait_event=0 bits=0
            Location from where call was made: kqr.h LINE:2283 ID:kqrbip:
            waiting for 700001d429d69f0 Child row cache objects level=4 child#=13
            Location from where latch is held: kgh.h LINE:6541 ID:kghfrunp: clatch: wait:
            Context saved from call: 0
            state=busy [holder orapid=2290] wlstate=free [value=0]
            waiters [orapid (seconds since: put on list, posted, alive check)]:
            1456 (93, 1706752199, 0)
            1556 (93, 1706752199, 0)
            1560 (93, 1706752199, 0)
            ......
            2645 (3, 1706752199, 0)
            2634 (3, 1706752199, 0)
            77 (3, 1706752199, 0)
            2635 (3, 1706752199, 0)
            waiter count=2182
            gotten 500724120 times wait, failed first 130283057 sleeps 135954343
            gotten 760588 times nowait, failed: 539552515
            possible holder pid = 2290 ospid=52888000
            on wait list for 700001d429d69f0
            Process Group: DEFAULT, pseudo proc: 0x700001dd2779610
            O/S info: user: oracle, term: UNKNOWN, ospid: 38601620
                OSD pid info: Unix process pid: 38601620, image: oracle@xxxxx2 (LCK0)  


               我们可以看到lck0无法获得latch了,被一个2290 会话阻塞了,那么2290在干什么呢?是什么进程呢?

                      Oracle session identified by:
              {
              instance: 2 (xxxx.xxxxcog2)
              os id: 52888000 (DEAD)
              process id: 2290, oracle@xxxxx2
              session id: 10734
              session serial #: 1
              }
              is waiting for 'latch: row cache objects' with wait info:
              {
              p1: 'address'=0x700001d429d69f0
              p2: 'number'=0x118
              p3: 'tries'=0x0
              time in wait: 31 min 21 sec
              heur. time in wait: 39 min 41 sec
              timeout after: never
              wait id: 10
              blocking: 0 sessions
              wait history:
              * time between current wait and wait #1: 0.000041 sec
              1. event: 'latch: row cache objects'
              time waited: 8 min 19 sec
              wait id: 9 p1: 'address'=0x700001d429d69f0
              p2: 'number'=0x118
              p3: 'tries'=0x0
              * time between wait #1 and #2: 0.017833 sec
              2. event: 'latch: shared pool'
              time waited: 0.017542 sec
              wait id: 8 p1: 'address'=0x700000000109158
              p2: 'number'=0x133
              p3: 'tries'=0x0
              * time between wait #2 and #3: 0.000032 sec
              3. event: 'latch: row cache objects'
              time waited: 10.802931 sec
              wait id: 7 p1: 'address'=0x700001d429d69f0
              p2: 'number'=0x118
              p3: 'tries'=0x0
              }


              Chain 8148 Signature: 'latch: row cache objects'
              Chain 8148 Signature Hash: 0xdbc4bc81


                  看上去2290 这个会话就是一个普通session,根据等待时间我们可以倒推(2024-02-01 09:49:59.659 这个时间点倒推31分钟),那么大概在09:18左右应该系统就有点问题了。我们先记住这个时间点。

                    综上,我们要去分析9:32之前数据库为什么出问题,难道是空间满了,导致arch无法归档了?问了下网友,说归档磁盘组空间还很足,显然可以排除。我们前面看到dump 其实已经提示有点问题了,那么还有没有一种可能是数据库等待异常,连接越来越多,把内存耗尽了呢?

                    此时我们结合listener log和osw 来看看。

                [root@xxx tmp]# tail -50000 listener.log |fgrep "01-FEB-2024 09:" |fgrep "establish" |awk '{print $1 " " $2}' |awk -F: '{print $1 ":" $2 }' |sort |uniq -c
                6 01-FEB-2024 09:00
                289 01-FEB-2024 09:01
                17 01-FEB-2024 09:02
                1 01-FEB-2024 09:04
                1 01-FEB-2024 09:06
                9 01-FEB-2024 09:07
                7 01-FEB-2024 09:08
                3 01-FEB-2024 09:09
                4 01-FEB-2024 09:10
                1 01-FEB-2024 09:11
                3 01-FEB-2024 09:14
                191 01-FEB-2024 09:16
                20 01-FEB-2024 09:17
                20 01-FEB-2024 09:18
                2 01-FEB-2024 09:19
                1 01-FEB-2024 09:20
                1 01-FEB-2024 09:21
                3 01-FEB-2024 09:23
                3 01-FEB-2024 09:24
                2 01-FEB-2024 09:25
                2 01-FEB-2024 09:26
                3 01-FEB-2024 09:27
                1 01-FEB-2024 09:28
                1 01-FEB-2024 09:30
                219 01-FEB-2024 09:31
                16 01-FEB-2024 09:32
                23 01-FEB-2024 09:33
                27 01-FEB-2024 09:34
                1 01-FEB-2024 09:35
                16 01-FEB-2024 09:36
                30 01-FEB-2024 09:37
                459 01-FEB-2024 09:38
                397 01-FEB-2024 09:39
                。。。。。。
                256 01-FEB-2024 09:58
                274 01-FEB-2024 09:59

                从监听日志分析来看,每分钟连接最高459,低的话只有几十个而已,并不算高,但是如果持续时间比较长的话,还是很容易就撑满了。从9:31-10:23 ,大约是52分钟,从监听日志来看每分钟平均200~300个链接,50分钟大概就是会上万。所以网友这里processes =8000,自然很快就爆掉了。不难理解!

                我们来看其中几个时间点的osw数据:

                  zzz ***Thu Feb 1 09:31:21 CST 2024


                  System configuration: lcpu=128 mem=262144MB ent=8.00


                  kthr memory page faults cpu
                  ----- ----------- ------------------------ ------------ -----------------------
                  r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec
                  2 0 44950952 436130 0 0 0 0 0 0 11095 96606 39410 27 10 62 1 5.44 68.0
                  0 0 44955637 431359 0 0 0 0 0 0 14971 116316 53392 41 12 46 1 7.81 97.6
                  10 0 44952580 434027 0 0 0 0 0 0 17366 81886 42874 34 11 55 1 6.50 81.2
                  zzz ***Thu Feb 1 09:31:41 CST 2024


                  System configuration: lcpu=128 mem=262144MB ent=8.00


                  kthr memory page faults cpu
                  ----- ----------- ------------------------ ------------ -----------------------
                  r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec
                  13 0 45316871 78841 0 0 0 0 0 0 14896 122684 58719 40 14 45 0 8.61 107.6
                  7 0 45318075 77057 0 0 0 0 0 0 14918 151949 68454 41 14 44 1 10.19 127.4
                  15 0 45321226 73073 0 0 0 0 0 0 14001 157227 56818 41 14 45 0 10.20 127.5
                  zzz ***Thu Feb 1 09:32:01 CST 2024

                  可以看到9:31 突然free memory从1.6G降低到280M,其实free已经很少很少了,内存突然降低是因为此时连接数突然增加了,从前面监听链接统计可以看出31分一共建立了200多个连接,所以内存下降也是理所当然。

                  那么9:01的连接数很高,是不是此时free memory更低呢?你猜对了。

                    zzz ***Thu Feb 1 09:01:30 CST 2024


                    System configuration: lcpu=128 mem=262144MB ent=8.00


                    kthr memory page faults cpu
                    ----- ----------- ------------------------ ------------ -----------------------
                    r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec
                    37 0 45190991 49283 0 0 0 38426 197710 0 15199 253559 73319 42 17 39 1 18.41 230.2
                    48 1 45247351 53540 0 0 0 60598 147876 0 12026 249960 68464 44 17 38 1 17.80 222.6
                    45 0 45253280 61481 0 0 0 16086 15853 0 12514 291799 82776 45 16 38 1 17.95 224.4
                    zzz ***Thu Feb 1 09:01:50 CST 2024

                    基本上已经低于200Mb。而物理内存有256G,难道是操作系统参数设置有问题?

                    果然又是一个遵循AIX最佳实践的客户,这2个参数搞的很大。

                    实际上这2个参数默认值有点偏大,不建议这么高,我们以前都是设置为15%-20%。


                      


                    文章转载自Roger的数据库专栏,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                    评论