珍惜吧很少看到AIX的案例了

Roger的数据库专栏 2024-02-02

463

今天一个微信群网友私信我有个case，请求帮忙给点意见，说上午业务繁忙，一个数据库节点自动重启了。我们先来看看数据库告警日志。

Thu Feb 01 09:31:20 2024
Archived Log entry 1886895 added for thread 2 sequence 1027967 ID 0x436ee63 dest 1:
Thu Feb 01 09:32:38 2024


***********************************************************************


Fatal NI connect error 12170.


  VERSION INFORMATION:
  TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
  TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
  Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
  Time: 01-FEB-2024 09:32:38
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535


TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.133.210.112)(PORT=56224))
WARNING: inbound connection timed out (ORA-3136)
Thu Feb 01 09:32:39 2024
......
***********************************************************************Thu Feb 01 09:32:45 2024
Thu Feb 01 09:32:45 2024


Fatal NI connect error 12170.
Thu Feb 01 09:32:45 2024
Thu Feb 01 09:32:45 2024


Fatal NI connect error 12170.


Fatal NI connect error 12170.
Thu Feb 01 09:32:45 2024
Thu Feb 01 09:32:45 2024
Thu Feb 01 09:32:45 2024
Thu Feb 01 09:32:45 2024
Thu Feb 01 09:32:45 2024


Fatal NI connect error 12170.
Thu Feb 01 09:32:45 2024


***********************************************************************


***********************************************************************
......


Thu Feb 01 10:20:10 2024
PMON failed to acquire latch, see PMON dump
Thu Feb 01 10:21:10 2024
PMON failed to acquire latch, see PMON dump
Thu Feb 01 10:21:43 2024
Errors in file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_smco_36831644.trc  (incident=2667623):
ORA-00445: background process "W004" did not start after 120 seconds
Thu Feb 01 10:22:01 2024
Killing enqueue blocker (pid=2425816) on resource CF-00000000-00000000 by (pid=36569482)
Thu Feb 01 10:22:22 2024
PMON failed to acquire latch, see PMON dump
Thu Feb 01 10:23:05 2024
IPC Send timeout detected. Receiver ospid 38601620 [oracle@tystyxr2 (LCK0)]
Receiver is waiting for a latch, dumping latch state for ospid 38601620 [oracle@tystyxr2 (LCK0)] 38601620
Thu Feb 01 10:23:05 2024
Errors in file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_ora_43189812.trc:
Thu Feb 01 10:23:07 2024
Errors in file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_lck0_38601620.trc:
Dumping diagnostic data in directory=[cdmp_20240201102307], requested by (instance=2, osid=43189812), summary=[abnormal process termination].
Thu Feb 01 10:23:09 2024
ORA-00020: maximum number of processes (8000) exceeded
 ORA-20 errors will not be written to the alert log for
 the next minute. Please look at trace files to see all
 the ORA-20 errors.
Thu Feb 01 10:23:26 2024
Process PZ53 submission failed with error = 20
Thu Feb 01 10:24:10 2024
ORA-00020: maximum number of processes (8000) exceeded
 ORA-20 errors will not be written to the alert log for
 the next minute. Please look at trace files to see all
 the ORA-20 errors.
Thu Feb 01 10:24:14 2024
.....
WARNING: inbound connection timed out (ORA-3136)
Thu Feb 01 10:33:46 2024
NOTE: ASMB terminating
Errors in file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_asmb_2425816.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID: 
Session ID: 993 Serial number: 44
Errors in file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_asmb_2425816.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID: 
Session ID: 993 Serial number: 44
ASMB (ospid: 2425816): terminating the instance due to error 15064
Thu Feb 01 10:33:46 2024
System state dump requested by (instance=2, osid=2425816 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file app/oracle/diag/rdbms/xxxxcogstb/xxxxcog2/trace/xxxxcog2_diag_6422788.trc
Thu Feb 01 10:33:47 2024
ORA-1092 : opitsk aborting process
Thu Feb 01 10:33:47 2024
ORA-1092 : opitsk aborting process
......
Thu Feb 01 10:33:50 2024
ORA-1092 : opitsk aborting process
Instance terminated by ASMB, pid = 2425816
USER (ospid: 13895476): terminating the instance
Instance terminated by USER, pid = 13895476
Thu Feb 01 10:34:08 2024
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on. 
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = app/product/11.2.0/db

看上去大致是数据库连接数撑满了，最后数据库夯死，后面集群脑裂，因为Oracle 11gR2引入的reboot less特性，不会重启整个集群而是重启实例，因此把该节点数据库重启了。（数据库夯死，ipc send timeout了，这是必然）。

那么数据库实例被重启主要是因为核心进程asmb被kill了，而该进程被kill，主要是因为持有了cf锁释放.一旦任何进程持有cf锁，时间超过120秒，如果不释放都将被kill，这是Oracle 11gR2的机制，有如下几个参数来进行控制：

NAME                                                    VALUE                DESCRIB
------------------------------------------------------- -------------------- --------------------------------------------------------------------------------
_controlfile_enqueue_timeout                            900                  control file enqueue timeout in seconds
_controlfile_enqueue_holding_time                       120                  control file enqueue max holding time in seconds
_controlfile_enqueue_holding_time_tracking_size         10                   control file enqueue holding time tracking size
_controlfile_enqueue_dump                               FALSE                dump the system states after controlfile enqueue timeout
_kill_controlfile_enqueue_blocker                       TRUE                 enable killing controlfile enqueue blocker on timeout

然而这里我认为并不是主要原因，毕竟在10:20之后才出现重启，问题是看alert log在9:30左右就开始出现hang的情况了。当数据库都快夯死了，自然各种操作都会异常，因此我们还是要去进一步分析之前为什么会hang？数据库crash之前有个systemstate dump，我们可以来对这个文件进行一下分析（话说，如果国产数据库也有类似的dump功能该多好）。

*** 2024-02-01 09:42:11.679
Process diagnostic dump for oracle@tystyxr2 (ARC9), OS id=42795032,
pid: 65, proc_ser: 1, sid: 6111, sess_ser: 1 
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
  0.000000 secs at [ 09:42:10 ]
    NOTE: scheduling delay has not been sampled for 0.983271 secs  0.000000 secs from [ 09:42:07 - 09:42:11 ], 5 sec avg
  0.000000 secs from [ 09:41:12 - 09:42:11 ], 1 min avg
  0.000000 secs from [ 09:37:12 - 09:42:11 ], 5 min avg
loadavg : 9.27 12.95 13.12
swap info: free_mem = 249.07M rsv = 100.00M 
           alloc = 539.64M avail = 25600.00M swap_free = 25060.36M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A   oracle 42795032        1   0  60 20 6566e5590 242412            Dec 08      - 11:47 ora_arc9_xxxxcog2

从dump来看，在9:42:11秒这个时间点，物理内存free_mem只有249M了，基本耗尽了，难道是内存耗尽了？先不着急下结论，进一步看看arch进程在干什么？

*** 2024-02-01 09:42:11.809
  ----------------------------------------
  SO: 0x700001de22f6b68, type: 2, owner: 0x0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x700001de22f6b68, name=process, file=ksu.h LINE:12616 ID:, pg=0
  (process) Oracle pid:65, ser:1, calls cur/top: 0x700001c2270a3b8/0x700001c22545558
            flags : (0x2) SYSTEM
            flags2: (0x0),  flags3: (0x10) 
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 26
              last post received-location: ksa2.h LINE:285 ID:ksasnd
              last process to post me: 700001dd22daec0 1 6
              last post sent: 0 0 47
              last post sent-location: ksv2.h LINE:1667 ID:ksvpst: send
              last process posted by me: 700001df22ff268 1 2
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 0x700001dd2779610
    O/S info: user: oracle, term: UNKNOWN, ospid: 42795032 
    OSD pid info: Unix process pid: 42795032, image: oracle@xxxxx2 (ARC9)
    ----------------------------------------
    SO: 0x700001d832813e0, type: 9, owner: 0x700001de22f6b68, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x700001de22f6b68, name=FileIdentificatonBlock, file=ksfd2.h LINE:312 ID:, pg=0
    (FIB) 700001d832813e0 flags=192 reference cnt=0 incno=86462 seqno=1 
     fname=
     fno=0 lblksz=0 fsiz=0
############# kfiofib = 0x700001d83281498 #################
File number        = 0.0 
File type          = 3 
Flags              = 8 
Blksize            = 0 
File size          = 0 blocks 
Blk one offset     = 0 
Redundancy         = 0 
Physical blocksz   = 0 
Open name          = +xxx_COG/xxxxcogstb/onlinelog/group_8.513.980062059 
Fully-qualified nm = 
Mapid             = 0 
Slave ID          = 0 
Connection        = 0x0x0 
。。。。。。
    SO: 0x700001de3213538, type: 4, owner: 0x700001de22f6b68, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x700001de22f6b68, name=session, file=ksu.h LINE:12624 ID:, pg=0
    (session) sid: 6111 ser: 1 trans: 0x0, creator: 0x700001de22f6b68
              flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x409) -/-/INC
              DID: , short-term DID: 
              txn branch: 0x0
              oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS
    ksuxds FALSE at location: 0
    service name: SYS$BACKGROUND
    Current Wait Stack:
     1: waiting for 'KSV master wait'
        =0x0, =0x0, =0x0
        wait_id=2277702 seq_num=62811 snap_id=2
        wait times: snap=1 min 13 sec, exc=3 min 16 sec, total=3 min 16 sec
        wait times: max=infinite, heur=3 min 16 sec
        wait counts: calls=66 os=66
        in_wait=1 iflags=0x1520
     0: waiting for 'Disk file operations I/O'
        FileOperation=0x2, fileno=0x0, filetype=0x3
        wait_id=2277701 seq_num=62808 snap_id=1
        wait times: snap=0.000000 sec, exc=0.000133 sec, total=3 min 16 sec
        wait times: max=infinite, heur=3 min 16 sec
        wait counts: calls=0 os=0
        in_wait=1 iflags=0x15a0
    There are 299 sessions blocked by this session.
    Dumping one waiter:
      inst: 1, sid: 3573, ser: 1
      wait event: 'enq: CF - contention'
        p1: 'name|mode'=0x43460005
        p2: '0'=0x0
        p3: 'operation'=0x0
      row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
      min_blocked_time: 146 secs, waiter_cache_ver: 42721
    Wait State:
      fixed_waits=0 flags=0x22 boundary=0x0/-1

上面的信息非常的清晰，此时arch进程已经阻塞了299个session，当前应该是正在持有control file latch，所以看到其被阻塞者（waiter）都在等enq:CF- contetnion.

进一步分析可以看到在9:49分的时候，lck0进程阻塞了几千个会话，你说数据库能不夯死吗？

*** 2024-02-01 09:49:59.659
  ----------------------------------------
  SO: 0x700001e02316b10, type: 2, owner: 0x0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x700001e02316b10, name=process, file=ksu.h LINE:12616 ID:, pg=0
  (process) Oracle pid:48, ser:186, calls cur/top: 0x700001c21ecee40/0x700001c21ecee40
            flags : (0x6) SYSTEM
            flags2: (0x0),  flags3: (0x10) 
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 81
              last post received-location: kji.h LINE:3422 ID:kjatb: wake up enqueue blocker
              last process to post me: 700001db22e17d0 2 6
              last post sent: 0 0 22
              last post sent-location: ksb2.h LINE:855 ID:ksbria
              last process posted by me: 700001e02316b10 186 6
    (latch info) wait_event=0 bits=0
        Location from where call was made: kqr.h LINE:2283 ID:kqrbip: 
      waiting for 700001d429d69f0 Child row cache objects level=4 child#=13 
        Location from where latch is held: kgh.h LINE:6541 ID:kghfrunp: clatch: wait: 
        Context saved from call: 0
        state=busy [holder orapid=2290] wlstate=free [value=0]
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           1456 (93, 1706752199, 0)
           1556 (93, 1706752199, 0)
           1560 (93, 1706752199, 0)
           ......
           2645 (3, 1706752199, 0)
           2634 (3, 1706752199, 0)
           77 (3, 1706752199, 0)
           2635 (3, 1706752199, 0)
           waiter count=2182
          gotten 500724120 times wait, failed first 130283057 sleeps 135954343
          gotten 760588 times nowait, failed: 539552515
        possible holder pid = 2290 ospid=52888000
      on wait list for 700001d429d69f0
    Process Group: DEFAULT, pseudo proc: 0x700001dd2779610
    O/S info: user: oracle, term: UNKNOWN, ospid: 38601620 
    OSD pid info: Unix process pid: 38601620, image: oracle@xxxxx2 (LCK0)

我们可以看到lck0无法获得latch了，被一个2290 会话阻塞了，那么2290在干什么呢？是什么进程呢？

        Oracle session identified by:
    {
                instance: 2 (xxxx.xxxxcog2)
                   os id: 52888000 (DEAD)
              process id: 2290, oracle@xxxxx2
              session id: 10734
        session serial #: 1
    }
    is waiting for 'latch: row cache objects' with wait info:
    {
                      p1: 'address'=0x700001d429d69f0
                      p2: 'number'=0x118
                      p3: 'tries'=0x0
            time in wait: 31 min 21 sec
      heur. time in wait: 39 min 41 sec
           timeout after: never
                 wait id: 10
                blocking: 0 sessions
            wait history:
              * time between current wait and wait #1: 0.000041 sec
              1.       event: 'latch: row cache objects'
                 time waited: 8 min 19 sec
                     wait id: 9               p1: 'address'=0x700001d429d69f0
                                              p2: 'number'=0x118
                                              p3: 'tries'=0x0
              * time between wait #1 and #2: 0.017833 sec
              2.       event: 'latch: shared pool'
                 time waited: 0.017542 sec
                     wait id: 8               p1: 'address'=0x700000000109158
                                              p2: 'number'=0x133
                                              p3: 'tries'=0x0
              * time between wait #2 and #3: 0.000032 sec
              3.       event: 'latch: row cache objects'
                 time waited: 10.802931 sec
                     wait id: 7               p1: 'address'=0x700001d429d69f0
                                              p2: 'number'=0x118
                                              p3: 'tries'=0x0
    }


Chain 8148 Signature: 'latch: row cache objects'
Chain 8148 Signature Hash: 0xdbc4bc81

看上去2290 这个会话就是一个普通session，根据等待时间我们可以倒推（2024-02-01 09:49:59.659 这个时间点倒推31分钟），那么大概在09:18左右应该系统就有点问题了。我们先记住这个时间点。

综上，我们要去分析9:32之前数据库为什么出问题，难道是空间满了，导致arch无法归档了？问了下网友，说归档磁盘组空间还很足，显然可以排除。我们前面看到dump 其实已经提示有点问题了，那么还有没有一种可能是数据库等待异常，连接越来越多，把内存耗尽了呢？

此时我们结合listener log和osw 来看看。

[root@xxx tmp]# tail -50000 listener.log |fgrep "01-FEB-2024 09:" |fgrep "establish" |awk '{print $1 " " $2}' |awk -F: '{print $1 ":" $2 }' |sort |uniq -c
      6 01-FEB-2024 09:00
    289 01-FEB-2024 09:01
     17 01-FEB-2024 09:02
      1 01-FEB-2024 09:04
      1 01-FEB-2024 09:06
      9 01-FEB-2024 09:07
      7 01-FEB-2024 09:08
      3 01-FEB-2024 09:09
      4 01-FEB-2024 09:10
      1 01-FEB-2024 09:11
      3 01-FEB-2024 09:14
    191 01-FEB-2024 09:16
     20 01-FEB-2024 09:17
     20 01-FEB-2024 09:18
      2 01-FEB-2024 09:19
      1 01-FEB-2024 09:20
      1 01-FEB-2024 09:21
      3 01-FEB-2024 09:23
      3 01-FEB-2024 09:24
      2 01-FEB-2024 09:25
      2 01-FEB-2024 09:26
      3 01-FEB-2024 09:27
      1 01-FEB-2024 09:28
      1 01-FEB-2024 09:30
    219 01-FEB-2024 09:31
     16 01-FEB-2024 09:32
     23 01-FEB-2024 09:33
     27 01-FEB-2024 09:34
      1 01-FEB-2024 09:35
     16 01-FEB-2024 09:36
     30 01-FEB-2024 09:37
    459 01-FEB-2024 09:38
    397 01-FEB-2024 09:39
    。。。。。。
    256 01-FEB-2024 09:58
    274 01-FEB-2024 09:59

从监听日志分析来看，每分钟连接最高459，低的话只有几十个而已，并不算高，但是如果持续时间比较长的话，还是很容易就撑满了。从9:31-10:23 ，大约是52分钟，从监听日志来看每分钟平均200~300个链接，50分钟大概就是会上万。所以网友这里processes =8000，自然很快就爆掉了。不难理解！

我们来看其中几个时间点的osw数据：

zzz ***Thu Feb 1 09:31:21 CST 2024


System configuration: lcpu=128 mem=262144MB ent=8.00


kthr    memory              page              faults              cpu          
----- ----------- ------------------------ ------------ -----------------------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
 2  0 44950952 436130   0   0   0   0    0   0 11095 96606 39410 27 10 62  1  5.44  68.0
 0  0 44955637 431359   0   0   0   0    0   0 14971 116316 53392 41 12 46  1  7.81  97.6
10  0 44952580 434027   0   0   0   0    0   0 17366 81886 42874 34 11 55  1  6.50  81.2
zzz ***Thu Feb 1 09:31:41 CST 2024


System configuration: lcpu=128 mem=262144MB ent=8.00


kthr    memory              page              faults              cpu          
----- ----------- ------------------------ ------------ -----------------------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
13  0 45316871 78841   0   0   0   0    0   0 14896 122684 58719 40 14 45  0  8.61 107.6
 7  0 45318075 77057   0   0   0   0    0   0 14918 151949 68454 41 14 44  1 10.19 127.4
15  0 45321226 73073   0   0   0   0    0   0 14001 157227 56818 41 14 45  0 10.20 127.5
zzz ***Thu Feb 1 09:32:01 CST 2024

可以看到9:31 突然free memory从1.6G降低到280M,其实free已经很少很少了，内存突然降低是因为此时连接数突然增加了，从前面监听链接统计可以看出31分一共建立了200多个连接，所以内存下降也是理所当然。

那么9:01的连接数很高，是不是此时free memory更低呢？你猜对了。

zzz ***Thu Feb 1 09:01:30 CST 2024


System configuration: lcpu=128 mem=262144MB ent=8.00


kthr    memory              page              faults              cpu          
----- ----------- ------------------------ ------------ -----------------------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
37  0 45190991 49283   0   0   0 38426 197710   0 15199 253559 73319 42 17 39  1 18.41 230.2
48  1 45247351 53540   0   0   0 60598 147876   0 12026 249960 68464 44 17 38  1 17.80 222.6
45  0 45253280 61481   0   0   0 16086 15853   0 12514 291799 82776 45 16 38  1 17.95 224.4
zzz ***Thu Feb 1 09:01:50 CST 2024

基本上已经低于200Mb。而物理内存有256G，难道是操作系统参数设置有问题？

果然又是一个遵循AIX最佳实践的客户，这2个参数搞的很大。

实际上这2个参数默认值有点偏大，不建议这么高，我们以前都是设置为15%-20%。

文章转载自Roger的数据库专栏，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

珍惜吧 很少看到AIX的案例了

评论

珍惜吧很少看到AIX的案例了