暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

lms进程异常 导致实例重启一例

原创 Roger 2011-07-11
1399
下午公司内部技术部培训时,同事接到某客户的电话,某rac的其中一个接到重启了,
希望让我们帮忙分析一下具体原因,同事把相关的日志也转发给我了,简单的分析了一下,

alert log如下:

Mon Jul 11 12:47:14 2011
Thread 2 advanced to log sequence 53143
  Current log# 3 seq# 53143 mem# 0: /dev/rdb_redo21_500m
Mon Jul 11 12:47:54 2011
ARC3: Standby redo logfile selected for thread 2 sequence 53142 for destination LOG_ARCHIVE_DEST_2
Mon Jul 11 12:52:48 2011
Errors in file /ora/app/admin/oraapp/bdump/oraapp2_lms0_1044892.trc:
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530],

 [], [], [], []
Mon Jul 11 12:52:51 2011
Errors in file /ora/app/admin/oraapp/bdump/oraapp2_lms0_1044892.trc:
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530],

[], [], [], []
Mon Jul 11 12:52:52 2011
Trace dumping is performing id=[cdmp_20110711125251]
Mon Jul 11 12:52:55 2011
Errors in file /ora/app/admin/oraapp/bdump/oraapp2_lms0_1044892.trc:
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530],

[], [], [], []
Mon Jul 11 12:52:55 2011
LMS0: terminating instance due to error 484
Instance terminated by LMS0, pid = 1044892
Mon Jul 11 12:53:41 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 en6 192.168.182.0 configured from OCR for use as a cluster interconnect
Interface type 1 en4 10.66.81.0 configured from OCR for use as  a public interface
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled

从上面的信息来看,该节点在12点52分出现ora-600内部错误,仅仅不到10s的时间,就导致lms0进程异常中止,
最后导致该实例重启。

我们来看看对于的trace信息:
*** 2011-07-11 12:52:48.437
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], [], [], [], []
----- Call Stack Trace -----
calling              call     entry                argument values in hex     
location             type     point                (? means dubious value)    
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              FFFFFFFFFFFC560 ? 000000000 ?
ksedmp+0290          bl       ksedst               104811C98 ?
ksfdmp+0018          bl       03F40AE0            
kgerinv+00dc         bl       _ptrgl              
kgeasnmierr+0040     bl       kgerinv              000000000 ? 000000000 ?
                                                   000000000 ? 11021E068 ?
                                                   FFFFFFFFFFFCCA0 ?
kjbrchkmode+0230     bl       03F3FD0C            
kjbriconsign+00d8    bl       kjbrchkmode          000000000 ? FFFFFFFFFFFCCA8 ?
kjbmiconsign+01f0    bl       kjbriconsign         10517A4BC ? 000000000 ?
                                                   000000002 ? 42C98000E0000 ?
                                                   0FFFFCCF0 ?
kjbmpocr+1114        bl       kjbmiconsign         110BFB640 ? 11021E068 ?
                                                   114C3177CB0279 ? 110C9F850 ?
                                                   FFFFFFFFFFFCCA0 ?
kjmxmpm+0238         bl       03F3FB24            
kjmpmsgi+050c        bl       kjmxmpm              0A25B7EE4 ? 600000000 ?
                                                   A25B00ED00000000 ?
                                                   000000000 ?
kjmsm+177c           bl       kjmpmsgi             110C6D820 ? 1EAD6A410 ?
                                                   100000000 ? 000000001 ?
                                                   0FFFFD780 ? 000000000 ?
ksbrdp+03e0          bl       _ptrgl              
opirip+03fc          bl       01FAFCDC            
opidrv+0448          bl       opirip               1102736F0 ? 410275030 ?
                                                   FFFFFFFFFFFF310 ?
sou2o+0090           bl       opidrv               320318411C ? 400000010 ?
                                                   FFFFFFFFFFFF310 ?
opimai_real+0150     bl       01FAB1D4            
main+0098            bl       opimai_real          000000000 ? 000000000 ?
__start+0098         bl       main                 000000000 ? 000000000 ?

 


*** 2011-07-11 12:52:51.089
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], [], [], [], []
----- Call Stack Trace -----
calling              call     entry                argument values in hex     
location             type     point                (? means dubious value)    
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              FFFFFFFFFFFC9E0 ? 000000000 ?
ksedmp+0290          bl       ksedst               104811C98 ?
kjmpmsgi+0620        bl       03F40AE0            
kjmsm+177c           bl       kjmpmsgi             110C6D820 ? 1EAD6A410 ?
                                                   100000000 ? 000000001 ?
                                                   0FFFFD780 ? 000000000 ?
ksbrdp+03e0          bl       _ptrgl              
opirip+03fc          bl       01FAFCDC            
opidrv+0448          bl       opirip               1102736F0 ? 410275030 ?
                                                   FFFFFFFFFFFF310 ?
sou2o+0090           bl       opidrv               320318411C ? 400000010 ?
                                                   FFFFFFFFFFFF310 ?
opimai_real+0150     bl       01FAB1D4            
main+0098            bl       opimai_real          000000000 ? 000000000 ?
__start+0098         bl       main                 000000000 ? 000000000 ?


前后2次的ora--00600信息的,虽然call stack信息有点微小差别,不过基本上一致,根据600错误以及call stack
我们在metalink上查到一个bug,对比了一下call stack信息,90{39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}一致。虽然不能肯定的下结论,但是感觉也有点
八九不离十了的味道了。

下面咱们继续,既然是lms0进程异常中止,那么我们这里就很有必要来分析一下lms0进程的状态了,
于是在trace中搜索 process或lms0 找到如下信息:

PROCESS STATE
-------------
Process global information:
     process: 7000000ea3a7dd8, call: 7000000ea5c9a88, xact: 0, curses: 7000000ea578b08, usrses: 7000000ea580548
  ----------------------------------------
  SO: 7000000ea3a7dd8, type: 2, owner: 0, flag: INIT/-/-/0x00  ---从这里owner 为0 看出该so 已经处于顶层了
  (process) Oracle pid=7, calls cur/top: 7000000ea5c9a88/7000000ea5c7680, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 90
              last post received-location: KJCS Post snd proxy to flush msg
              last process to post me: 7000000ea3a8d98 1 6
              last post sent: 0 0 93
              last post sent-location: kjbrcvdscn: scn acked
              last process posted by me: 7000000ea3a8d98 1 6
    (latch info) wait_event=0 bits=10
      holding    (efd=5) 7000000e8f688a0 Child gcs resource hash level=4 child#=20
        Location from where latch is held: kjbmpconvert:
        Context saved from call: 0
        state=busy, wlstate=free
    Process Group: DEFAULT, pseudo proc: 7000000e7418bc0
    O/S info: user: oracle, term: UNKNOWN, ospid: 1044892
    OSD pid info: Unix process pid: 1044892, image: oracle@t3smis_db_b (LMS0)
   
  
这里的gcs resource hash 是latch 名称,其level为4,可以查询v$latch.
搜索字符串7000000ea3a7dd8,发现如下内容:

SO: 7000000e1963000, type: 20, owner: 7000000ea3a7dd8, flag: -/-/-/0x00
namespace [KSXP] key   = [ 32 32 30 47 45 53 52 30 30 31 00 ]
    ----------------------------------------
    SO: 7000000ea580548, type: 4, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00
    (session) sid: 550 trans: 0, creator: 7000000ea3a7dd8, flag: (51) USR/- BSY/-/-/-/-/-
              DID: 0002-0007-00000003, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
    last wait for 'gcs remote message' blocking sess=0x0 seq=62654 wait_time=18631 seconds since wait started=4
                waittime=18, poll=0, event=0
    Dumping Session Wait History
     for 'gcs remote message' count=1 wait_time=18631
                waittime=18, poll=0, event=0
     for 'gcs remote message' count=1 wait_time=751
                waittime=18, poll=0, event=0
     for 'gcs remote message' count=1 wait_time=248
                waittime=18, poll=0, event=0
     for 'gcs remote message' count=1 wait_time=74
                waittime=18, poll=0, event=0
     for 'gcs remote message' count=1 wait_time=2099
                waittime=18, poll=0, event=0
     for 'gcs remote message' count=1 wait_time=7055
                waittime=18, poll=0, event=0
     for 'gcs remote message' count=1 wait_time=7010
                waittime=18, poll=0, event=0
     for 'gcs remote message' count=1 wait_time=56
                waittime=18, poll=0, event=0
     for 'gcs remote message' count=1 wait_time=6202
                waittime=18, poll=0, event=0
     for 'gcs remote message' count=1 wait_time=803
                waittime=18, poll=0, event=0
    temporary object counter: 0
      ----------------------------------------
      UOL used : 0 locks(used=0, free=0)
      KGX Atomic Operation Log 7000000e8fa1998
       Mutex 0(0, 0) idn 0 oper NONE
       Library Cache uid 550 efd 0 whr 0 slp 0
      KGX Atomic Operation Log 7000000e8fa19e0
       Mutex 0(0, 0) idn 0 oper NONE
       Library Cache uid 550 efd 0 whr 0 slp 0
      KGX Atomic Operation Log 7000000e8fa1a28
       Mutex 0(0, 0) idn 0 oper NONE
       Library Cache uid 550 efd 0 whr 0 slp 0
      ----------------------------------------
      SO: 7000000e08f57d8, type: 41, owner: 7000000ea580548, flag: INIT/-/-/0x00
      (dummy) nxc=0, nlb=0  
    ----------------------------------------
    SO: 7000000e765b078, type: 11, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00
    (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 7000000ea3a7dd8,
                       event: 7, last message event: 93,
                       last message waited event: 93, messages read: 1
                       channel: (7000000e767c640) system events broadcast channel
                                scope: 2, event: 38038, last mesage event: 93,
                                publishers/subscribers: 0/141,
                                messages published: 1
    ----------------------------------------
    SO: 7000000ea93e9e0, type: 19, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00
     GES MSG BUFFERS: st=emp chunk=0x0 hdr=0x0 lnk=0x0 flags=0x0 inc=16
      outq=2 sndq=0 opid=7 prmb=0x0
      mbg[i]=(4626201 97644168) mbg[b]=(54246 41376233) mbg[r]=(0 0)
      fmq[i]=(30 22) fmq[b]=(20 8) fmq[r]=(0 0)
      mop[s]=64557405 mop[q]=104801381 pendq=0 zmbq=0
    ------------process 0x7000000ea93e9e0--------------------
    proc version      : 0
    Local node        : 1
    pid               : 1044892
    lkp_node          : 1
    svr_mode          : 0
    proc state        : KJP_NORMAL
    Last drm hb acked : 0
    Total accesses    : 351576
    Imm.  accesses    : 351543
    Locks on ASTQ     : 0
    Locks Pending AST : 0
    Granted locks     : 0
    AST_Q:
    PENDING_Q:
    GRANTED_Q:
    KJM HIST LMS0:
      1:1 12:1:10 7:1 6:1 10:0 2:0 19:0 18:2 17:2 16:2
      15:2 12:18639 7:0 6:15 10:0 17:1 16:1 14:73:4 1:0 12:1:7
      7:0 6:0 10:1 17:1 16:1 12:753 7:1 6:0 10:0 17:2
      16:1 13:65521:4 20:84:3 1:0 12:1:7 7:1 6:0 10:0 17:2 16:1
      12:250 7:1 6:16 10:0 17:2 16:1 14:73:5 1:1 12:1:9 7:0
      6:0 10:1 17:1 16:1 12:78 7:0 6:0 10:1 17:2 21:32
      16:2 12:2105 7:0 6:14
     DEFER MSG QUEUE ON LMS0 IS EMPTY
     SEQUENCES:
      0:116284752.0  1:0.0
    ----------------------------------------
    SO: 7000000ea5c7680, type: 3, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00
    (call) sess: cur 7000000ea580548, rec 7000000ea578b08, usr 7000000ea580548; depth: 0
      ----------------------------------------
      SO: 7000000ea5c9a88, type: 3, owner: 7000000ea5c7680, flag: INIT/-/-/0x00
      (call) sess: cur 7000000ea578b08, rec 7000000ea578b08, usr 7000000ea580548; depth: 1
      ----------------------------------------
      SO: 7000000ea578b08, type: 4, owner: 7000000ea5c7680, flag: INIT/-/-/0x00
      (session) sid: 538 trans: 0, creator: 0, flag: (2) -/REC -/-/-/-/-/-
                DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
                txn branch: 0
                oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
      temporary object counter: 0
    ----------------------------------------
    SO: 7000000e7842e80, type: 16, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00
    (osp req holder)
   
   从上面的rec 7000000ea578b08 我们可以判断为这个是一个递归操作,而后面的depth 1则是表示
   递归操作的层次。
   无奈之下,我猜测那个600错误中的字符串700000041fd2ce0 应该是个类似latch地址的东西,或者说是

一个资源的标示,于是搜索字符串700000041fd2ce0,发现如下信息:
    *** 2011-07-11 12:52:52.561
GLOBAL CACHE ELEMENT DUMP (address: 700000041fd2c70):
  id1: 0x42c98 id2: 0xe0000 obj: 0 block: (14/273560)
  lock: NL rls: 0x0000 acq: 0x0000 latch: 3
  flags: 0x41 fair: 0 recovery: 0
  bscn: 0x0.0 bctx: 0 write: 0 scan: 0x2000000 xflg: 0 xid: 0x0.0.0
  lcp: 0 lnk: [NULL] lch: [700000053f3aa30,700000053f3aa30]
  LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
    flg: 0x08000000 state: SCURRENT mode: NULL
      addr: 700000053f3a930 obj: 47733 cls: DATA bscn: 0x2.cb164f84
 GCS UNKNOWN 700000041fd2ce0,0 sq[0,0] resp[0,0x0.0] pkey 0
   grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
   master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c
   history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968
   disk: 0xdd6e.07000000 write request: 0x05dc.00000089
   pi scn: 0x0002.0045d1a8
   msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
Dump of buffer cache at level 1 for tsn=9, rdba=58993816
BH (700000053f3a930) file#: 14 rdba: 0x03842c98 (14/273560) class: 1 ba: 700000053602000
  set: 10 blksize: 8192 bsi: 0 set-flg: 0 pwbcnt: 0
  dbwrid: 0 obj: 47733 objn: 47733 tsn: 9 afn: 14
  hash: [7000000e1d2bb78,7000000e1d2bb78] lru: [700000014f61a38,70000003bf83688]
  lru-flags: hot_buffer
  ckptq: [NULL] fileq: [NULL] objq: [700000020f63830,700000052f14028]
  st: SCURRENT md: NULL tch: 0 le: 700000041fd2c70
  flags: remote_transfer
  LRBA: [0x0.0.0] HSCN: [0xffff.ffffffff] HSUB: [65535]
   GCS RESOURCE 7000000e4ce1530 hashq [7000000d01f0ff0,7000000e4493420] name[0x42c98.e0000] pkey 47733
     grant 7000000e2190070 cvt 0 send 0,0 write 0,0@65535
     flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL
     disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
     xid 0x0000.000.00000000 sid 0
     pkey 47733
     hv 89 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 12, dom 0]
     kjga st 0x4, step 0.0.0, cinc 16, rmno 1550, flags 0x0
     lb 0, hb 0, myb 3475, drmb 3475, apifrz 0
     GCS SHADOW 7000000e2190070,1272 sq[700000041fd2ce0,7000000e4ce1550] resp[7000000e4ce1530,0x42c98.e0000] pkey 47733
       grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
       master 1 owner 0 sid 0 remote[700000045fb5940,1518] hist 0x8208225f
       history 0x1f.0x9.0x2.0x2.0x2.0x2. cflag 0x0 sender 0 flags 0x0 replay# 0
       disk: 0x0000.00000000 write request: 0x0000.00000000
       pi scn: 0x0000.00000000
       msgseq 0x3 updseq 0x0 reqids[1996,0,0] infop 0x0
     GCS UNKNOWN 700000041fd2ce0,0 sq[0,0] resp[0,0x0.0] pkey 0
       grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
       master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c
       history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968
       disk: 0xdd6e.07000000 write request: 0x05dc.00000089
       pi scn: 0x0002.0045d1a8
       msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
     GCS UNKNOWN 7000000dd6e9e10,0 sq[0,0] resp[0,0x0.0] pkey 0
       grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
       master 147 owner 0 sid 0 remote[70206dc00,0] hist 0x6006c
       history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 24 flags 0xc replay# 0
       disk: 0xdd6e.07000000 write request: 0x05dc.00000089
       pi scn: 0x0002.0045daa8
       msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
     GCS UNKNOWN 7000000dd6e9318,0 sq[0,0] resp[0,0x0.0] pkey 0
       grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
       master 148 owner 0 sid 0 remote[70206dbff,0] hist 0x6006c
       history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 200 flags 0xc replay# 0
       disk: 0xdd6e.07000000 write request: 0x05dc.00000089
       pi scn: 0x0002.0045d9a8
       msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
     GCS UNKNOWN 7000000dd6e94c8,0 sq[0,0] resp[0,0x0.0] pkey 0
       grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
       master 118 owner 0 sid 0 remote[70206dbf6,0] hist 0x6006c
       history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 8 flags 0xc replay# 0
       disk: 0xdda6.07000000 write request: 0x05dc.00000089
       pi scn: 0x0002.0045d0a8
       msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
     GCS UNKNOWN 7000000dda67608,0 sq[0,0] resp[0,0x0.0] pkey 0
       grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
       master 158 owner 0 sid 0 remote[70206dbf5,0] hist 0x6006c
       history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 96 flags 0xc replay# 0
       disk: 0xdd4b.07000000 write request: 0x05dc.00000089
       pi scn: 0x0002.0045cfa8
       msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
          ABORTING
lms abort after exception 600
KJC Communication Dump:
  state 0x5  flags 0x0  mode 0x0  inst 1  inc 16
  nrcv 3  nsp 3  nrcvbuf 1000
  reg_msg: sz 456  cur 229 (s:0 i:229) max 967  ini 1750
  big_msg: sz 8240  cur 21 (s:0 i:21) max 529  ini 1750
  rsv_msg: sz 8240  cur 0 (s:0 i:0) max 0  tot 1000
  rcvr: id 0  orapid 6  ospid 1098058
  rcvr: id 2  orapid 8  ospid 463194
  rcvr: id 1  orapid 7  ospid 1044892
  send proxy: id 0  ndst 1 (0:0 )
  send proxy: id 2  ndst 1 (0:2 )
  send proxy: id 1  ndst 1 (0:1 )
GES resource limits:
  ges resources: cur 0 max 0 ini 11712
  ges enqueues: cur 0 max 0 ini 17740
  ges cresources: cur 977 max 2335
  gcs resources: cur 91982 max 185193 ini 185193
  gcs shadows: cur 72232 max 135376 ini 185193
KJCTS state: seq-check:no  timeout:yes  waitticks:0x3  highload no
GES destination context:
Dest 0  rcvr 0  inc 16  state 0x10041  tstate 0x0

这里我发现GCS RESOURCE 7000000e4ce1530 这里的7000000e4ce1530 正是600错误中的错误字符串。
我们回头再来认真的看看上面的600错误,如下:
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], [], [], [], []

其中字符串代码,700000041FD2CE0  7000000E4CE1530

*** 2011-07-11 06:46:53.780
kjdrvalidRMno: msg type 34 from node 0 dropped
   FUSION MSG 11094cc68,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8]
        CLOSE [0x1794.80000, 47523] shadow [0,0] seq 0x2 act 1
          client [700000026fd0078,1025] reqid 1369 ordered 0
          grant 1 convert 0 role 0
          pi [0x0.0x0] flags 0x0 state 0x20
          disk scn 0x0.0 writereq scn 0x0.0 rreqid 0
          msgRM# 1536 bkt# 3826 drmbkt# 3826
     pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0
     hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0]
     kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20
     lb 3072, hb 4095, myb 3826, drmb 3826, apifrz 1
kjmvalidate: drm drop a message RMno 1536 from 0 type 34
  mver 16 myver 16 seq 0.109921737 rseq 0.109921736 flag x6d
kjdrvalidRMno: msg type 34 from node 0 dropped
   FUSION MSG 11094cce8,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8]
        CLOSE [0x178c.80000, 47523] shadow [0,0] seq 0x2 act 1
          client [70000005bfc6010,1475] reqid 1921 ordered 0
          grant 1 convert 0 role 0
          pi [0x0.0x0] flags 0x0 state 0x20
          disk scn 0x0.0 writereq scn 0x0.0 rreqid 0
          msgRM# 1536 bkt# 3825 drmbkt# 3825
     pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0
     hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0]
     kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20
     lb 3072, hb 4095, myb 3825, drmb 3825, apifrz 1
kjmvalidate: drm drop a message RMno 1536 from 0 type 34
  mver 16 myver 16 seq 0.109921738 rseq 0.109921737 flag x6d
kjdrvalidRMno: msg type 34 from node 0 dropped
   FUSION MSG 11094cd68,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8]
        CLOSE [0x17cb.80000, 47523] shadow [0,0] seq 0x2 act 1
          client [70000001fff22f0,1158] reqid 1407 ordered 0
          grant 1 convert 0 role 0
          pi [0x0.0x0] flags 0x0 state 0x20
          disk scn 0x0.0 writereq scn 0x0.0 rreqid 0
          msgRM# 1536 bkt# 3833 drmbkt# 3833
     pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0
     hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0]
     kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20
     lb 3072, hb 4095, myb 3833, drmb 3833, apifrz 1
kjmvalidate: drm drop a message RMno 1536 from 0 type 34
  mver 16 myver 16 seq 0.109921739 rseq 0.109921738 flag x6d
kjdrvalidRMno: msg type 34 from node 0 dropped
   FUSION MSG 11094cde8,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8]
        CLOSE [0x17c3.80000, 47523] shadow [0,0] seq 0x2 act 1
          client [70000005bff7d90,1289] reqid 1807 ordered 0
          grant 1 convert 0 role 0
          pi [0x0.0x0] flags 0x0 state 0x20
          disk scn 0x0.0 writereq scn 0x0.0 rreqid 0
          msgRM# 1536 bkt# 3832 drmbkt# 3832
     pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0
     hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0]
     kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20
     lb 3072, hb 4095, myb 3832, drmb 3832, apifrz 1
kjmvalidate: drm drop a message RMno 1536 from 0 type 34
  mver 16 myver 16 seq 0.109921740 rseq 0.109921739 flag x6d
*** 2011-07-11 12:52:48.432
 GCS UNKNOWN 700000041fd2ce0,0 sq[0,0] resp[0,0x0.0] pkey 0
   grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
   master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c
   history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968
   disk: 0xdd6e.07000000 write request: 0x05dc.00000089
   pi scn: 0x0002.0045d1a8
   msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
 GCS RESOURCE 7000000e4ce1530 hashq [7000000d01f0ff0,7000000e4493420] name[0x42c98.e0000] pkey 47733
   grant 7000000e2190070 cvt 0 send 0,0 write 0,0@65535
   flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL
   disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
   xid 0x0000.000.00000000 sid 0
   pkey 47733
   hv 89 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 12, dom 0]
   kjga st 0x4, step 0.0.0, cinc 16, rmno 1550, flags 0x0
   lb 0, hb 0, myb 3475, drmb 3475, apifrz 0
   GCS SHADOW 7000000e2190070,1272 sq[700000041fd2ce0,7000000e4ce1550] resp[7000000e4ce1530,0x42c98.e0000] pkey 47733
     grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
     master 1 owner 0 sid 0 remote[700000045fb5940,1518] hist 0x8208225f
     history 0x1f.0x9.0x2.0x2.0x2.0x2. cflag 0x0 sender 0 flags 0x0 replay# 0
     disk: 0x0000.00000000 write request: 0x0000.00000000
     pi scn: 0x0000.00000000
     msgseq 0x3 updseq 0x0 reqids[1996,0,0] infop 0x0
   GCS UNKNOWN 700000041fd2ce0,0 sq[0,0] resp[0,0x0.0] pkey 0
     grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
     master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c
     history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968
     disk: 0xdd6e.07000000 write request: 0x05dc.00000089
     pi scn: 0x0002.0045d1a8
     msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
   GCS UNKNOWN 7000000dd6e9e10,0 sq[0,0] resp[0,0x0.0] pkey 0
     grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
     master 147 owner 0 sid 0 remote[70206dc00,0] hist 0x6006c
     history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 24 flags 0xc replay# 0
     disk: 0xdd6e.07000000 write request: 0x05dc.00000089
     pi scn: 0x0002.0045daa8
     msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
   GCS UNKNOWN 7000000dd6e9318,0 sq[0,0] resp[0,0x0.0] pkey 0
     grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
     master 148 owner 0 sid 0 remote[70206dbff,0] hist 0x6006c
     history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 200 flags 0xc replay# 0
     disk: 0xdd6e.07000000 write request: 0x05dc.00000089
     pi scn: 0x0002.0045d9a8
     msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
   GCS UNKNOWN 7000000dd6e94c8,0 sq[0,0] resp[0,0x0.0] pkey 0
     grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
     master 118 owner 0 sid 0 remote[70206dbf6,0] hist 0x6006c
     history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 8 flags 0xc replay# 0
     disk: 0xdda6.07000000 write request: 0x05dc.00000089
     pi scn: 0x0002.0045d0a8
     msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
   GCS UNKNOWN 7000000dda67608,0 sq[0,0] resp[0,0x0.0] pkey 0
     grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
     master 158 owner 0 sid 0 remote[70206dbf5,0] hist 0x6006c
     history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 96 flags 0xc replay# 0
     disk: 0xdd4b.07000000 write request: 0x05dc.00000089
     pi scn: 0x0002.0045cfa8
     msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
        ABORTING
*** 2011-07-11 12:52:48.437
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], [], [], [], []

从上看,我感觉似乎跟DRM有点关系,至于是不是一定有关系,我不敢肯定,隐隐的感觉有点关联。

到最后,我们有必要来回顾一下lms进程的作用,说的简单一点,该进程的作用就是管理集群节点之间数据block的访问,并在节点之间传递cache buffer

的块镜像,,当lms进程跨节点请求数据时,为了确保数据的一致性,在同一时刻,保证只能有一个实例去更新一个数据块,该进程跟lmd进程协同作用,

所以通常来说,lms出问题,lmd也会出问题。

我们再回到这个问题上来,lms进程为啥为异常abort?从上面gcs信息来看,是由于请求的资源和所获取到的资源不一致导致,而又由于需要

保证数据的一致性,进而导致lms abort。 正常情况下,7000000E4CE1530应该跟700000041FD2CE0一样才对。

当然上面也是我的个人猜测,如果大家有什么高见,欢迎讨论。

最后我们给客户的答复是上面判断的那个bug,虽然有有点无奈,不过也没办法。关于该bug的详细描述如下:
BOTH RAC INSTANCES CRASHED WITH ORA-600 [kjbrchkmode:last] [ID 987822.1]

--------------------------------------------------------------------------------
 
  修改时间 27-AUG-2010     类型 PROBLEM     状态 PUBLISHED  

In this Document
  Symptoms
  Cause
  Solution

--------------------------------------------------------------------------------


Symptoms
RAC Instance crashed after the LMS process terminated with  ORA-600 [kjbrchkmode:last], [60], [0x5F91E9390], [2], [], [], [], []

The corresponding LMS tracefile showed the following call stack:

ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kjbrchkmode:last], [60], [0x5F91E9390], [2], [], [], [], []
----- Call Stack Trace -----
ksedmp kgerinv kgeasnmierr kjbrchkmode kjbmprlst kjmxmpm kjmpmsgi kjmsm ksbrdp opirip opidrv sou2o opimai_real main

Cause
The issue is caused by:
BUG 6239052 INSTANCE CRASH DUE TO ORA-600 [KJBRCHKMODE LAST]


The fix for bug 6239052 is included in patchset 10.2.0.4, and 11g onwards.

 
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论