Rac 实例被莫名重启案例分析

Roger 2019-06-17

1147

问题描述

某客户的一套rac，其中node1 被莫名的重启了，如下：

Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:
ORA-29740: evicted by member , group incarnation
Wed Aug 14 08:25:54 2013
Doing block recovery for file 204 block 35113
Wed Aug 14 08:26:00 2013
Shutting down instance (abort)
License high water mark = 5209
Wed Aug 14 08:26:03 2013
Termination issued to instance processes. Waiting for the processes to exit
Wed Aug 14 08:26:09 2013
Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 15573208
Wed Aug 14 08:26:10 2013
Instance terminated by USER, pid = 15335602
Wed Aug 14 08:27:17 2013
Starting ORACLE instance (normal)

同时在该时间点，rac的另外一个节点进行了reconfig，信息如下：

<pre class="brush:php">Wed Aug 14 08:23:20 2013
IPC Send timeout detected.Sender: ospid 454822
Receiver: inst 1 binc 57382989 ospid 14856422
Wed Aug 14 08:23:29 2013
IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10
Wed Aug 14 08:23:29 2013
Communications reconfiguration: instance_number 1
Wed Aug 14 08:23:30 2013
Trace dumping is performing id=[cdmp_20130814082329]
Wed Aug 14 08:25:19 2013
Evicting instance 1 from cluster
Wed Aug 14 08:26:09 2013
Reconfiguration started (old inc 4, new inc 8)
List of nodes:
 1
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Wed Aug 14 08:26:10 2013
 LMS 5: 23 GCS shadows cancelled, 1 closed
Wed Aug 14 08:26:10 2013

专家解答

从上面这部分信息看不出什么东西，我们仔细来看下node1 的alert log信息：

<pre class="brush:php">Wed Aug 14 08:23:20 2013
IPC Send timeout detected. Receiver ospid 14856422     ++++根据ospid我们发现，14856422 正是lms3.
Receiver is waiting for a latch dumping latch state for receiver -16444
Wed Aug 14 08:23:50 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms3_14856422.trc:
Wed Aug 14 08:23:54 2013
Trace dumping is performing id=[cdmp_20130814082329]
Wed Aug 14 08:25:51 2013
Errors in file /oraclelog/ngcrmdb/udump/ngcrmdb1_ora_7504228.trc:
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lmon_15573208.trc:
ORA-29740: evicted by member 1, group incarnation 6
Wed Aug 14 08:25:54 2013                              +++++lmon 进程终止，抛出ora-29740错误
LMON: terminating instance due to error 29740
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_pmon_6075054.trc:
ORA-29740: evicted by member , group incarnation
……..
……
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms9_14266524.trc:
ORA-29740: evicted by member , group incarnation
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:
ORA-29740: evicted by member , group incarnation
Wed Aug 14 08:25:54 2013
Doing block recovery for file 204 block 35113
Wed Aug 14 08:26:00 2013
Shutting down instance (abort)

同时，我们查看node 2节点的alert log也发现出现了node 1节点lms3进程超时的信息，如下：

Wed Aug 14 08:23:20 2013
IPC Send timeout detected.Sender: ospid 454822     +++++ crm2 节点的lms3进程
Receiver: inst 1 binc 57382989 ospid 14856422      +++++ crm1 节点的lms3进程
Wed Aug 14 08:23:29 2013
IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10
Wed Aug 14 08:23:29 2013
Communications reconfiguration: instance_number 1
Wed Aug 14 08:23:30 2013
Trace dumping is performing id=[cdmp_20130814082329]
Wed Aug 14 08:25:19 2013
Evicting instance 1 from cluster
Wed Aug 14 08:26:09 2013
Reconfiguration started (old inc 4, new inc 8)
List of nodes:
 1
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Wed Aug 14 08:26:10 2013
 LMS 5: 23 GCS shadows cancelled, 1 closed
Wed Aug 14 08:26:10 2013

通过node 2节点的trace 文件 /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc，可以确认，

<pre class="brush:php">Dump file /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine
and Real Application Testing options
ORACLE_HOME = /oracle/app/product/10.2.0/db_1
System name:	AIX
Node name:	ngcrmdb2
Release:	3
Version:	5
Machine:	00C8A6E24C00
Instance name: ngcrmdb2
Redo thread mounted by this instance: 0 &lt;none&gt;
Oracle process number: 10
Unix process pid: 454822, image: oracle@ngcrmdb2 (LMS3)

......
......
*** 2013-08-12 10:34:58.266
kjbmskipref(0x312f.3b10000)(seqs 0x13361338)(rids 0x13f813f9)(from 0x192)
*** 2013-08-14 08:23:20.471
*** 2013-08-14 08:23:20.474
IPC Send timeout detected.Sender: ospid 454822
Receiver: inst 1 binc 57382989 ospid 14856422
SKGXPCTX: 0x1104103b0 ctx
WAIT HISTORY
Time(msec)	 Wait Type	 Return Code
----------	 ---------	 ------------
14		 NORMAL		 POSTED
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
wait delta 0 sec (3 msec) ctx ts 0x57cd2bd2 last ts 0x57cd2bd2
user cpu time since last wait 0 sec 0 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 8
timed wait receives 87
admno 0xbc1dc98 admport:
SSKGXPT 0x104111c0 flags  	socket no 11 	IP 192.168.5.2 	UDP 33191
context timestamp 0x57cd2bd2
buffers queued on port 1104cff40
    sconno     accono   ertt  state   seq#   sent  async   sync rtrans   acks
0x7dd68acd 0x12d3831a   1024      3  42918 292492860 292492982      0   2695 286577042
slot 0 rqh=1110ca6c8
seq=42912 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1
slot 1 rqh=1110ca2e0
seq=42913 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1
slot 2 rqh=1110c9ef8
……

我们可以看出是因为实例之间lms3进程通信交互出现超时导致crm1节点

被驱逐重启。根据crm1节点的alert log信息，我们可以整理一个如下的时间关系:

Aug 14 08:23:20 –lms3 Time out（lmon check发现）

Aug 14 08:25:54 –抛出ORA-29740: evicted by member 1, group incarnation 6错误，lmon 进程终止

Aug 14 08:25:54 –完成block recover（Doing block recovery for file 204 block 35113），由smon完成.

Aug 14 08:26:00 –crm1实例 abort关闭.

通过node 1节点的告警日志，我们可以发现，在Aug 14 08:23:20 时刻，lms3进程出现time out。通常来讲，lms进程出现超时，可能是由于系统压力过大，即load过高导致。下面通过分析lgwr,lmd以及lmon 进程的trace来证实这一点。首先我们来分析一下lgwr trace：

*** 2013-08-13 02:29:49.603
Warning: log write time 1500ms, size 278KB
*** 2013-08-13 03:24:00.496
Warning: log write time 1440ms, size 13397KB
*** 2013-08-13 03:24:04.213
Warning: log write time 540ms, size 18775KB
*** 2013-08-13 03:24:05.298
Warning: log write time 1080ms, size 10027KB
*** 2013-08-13 03:24:09.590
Warning: log write time 540ms, size 10124KB
*** 2013-08-13 03:24:10.114
Warning: log write time 520ms, size 18577KB
*** 2013-08-13 03:24:14.393
Warning: log write time 540ms, size 16529KB
*** 2013-08-13 03:24:15.400
Warning: log write time 1010ms, size 13973KB
*** 2013-08-13 03:24:16.813
Warning: log write time 500ms, size 20055KB
*** 2013-08-13 03:24:19.572
Warning: log write time 1360ms, size 11320KB
*** 2013-08-13 03:24:22.962
Warning: log write time 760ms, size 18383KB
*** 2013-08-13 03:24:25.383
Warning: log write time 510ms, size 13886KB
*** 2013-08-13 03:24:26.791
Warning: log write time 1410ms, size 14385KB
*** 2013-08-13 03:24:41.486
Warning: log write time 890ms, size 11268KB
*** 2013-08-13 03:24:42.371
Warning: log write time 880ms, size 27178KB
*** 2013-08-13 03:24:55.580
Warning: log write time 570ms, size 14009KB
*** 2013-08-13 03:24:58.326
Warning: log write time 2730ms, size 31518KB
*** 2013-08-13 09:53:36.967
Warning: log write time 1740ms, size 5KB
*** 2013-08-13 12:10:41.138
Warning: log write time 580ms, size 2KB
*** 2013-08-13 14:39:31.894
Warning: log write time 1080ms, size 102KB
*** 2013-08-14 00:51:42.142
Warning: log write time 800ms, size 347KB
*** 2013-08-14 01:20:31.054
Warning: log write time 510ms, size 1455KB
*** 2013-08-14 01:24:35.160
Warning: log write time 500ms, size 463KB
*** 2013-08-14 01:24:36.246
Warning: log write time 1080ms, size 887KB
*** 2013-08-14 01:24:37.135
Warning: log write time 890ms, size 2201KB
*** 2013-08-14 02:04:42.532
Warning: log write time 590ms, size 14KB
*** 2013-08-14 02:06:04.027
Warning: log write time 570ms, size 43KB
*** 2013-08-14 03:25:27.431
Warning: log write time 500ms, size 13805KB
*** 2013-08-14 03:25:42.751
Warning: log write time 650ms, size 1919KB
*** 2013-08-14 03:25:52.230
Warning: log write time 1130ms, size 35575KB
*** 2013-08-14 03:25:52.737
Warning: log write time 500ms, size 9288KB
*** 2013-08-14 03:25:54.660
Warning: log write time 1020ms, size 9277KB
*** 2013-08-14 03:25:55.581
Warning: log write time 910ms, size 36265KB
*** 2013-08-14 03:25:56.369
Warning: log write time 500ms, size 36267KB
*** 2013-08-14 03:25:58.585
Warning: log write time 1010ms, size 14202KB
tkcrrsarc: (WARN) Failed to find ARCH for message (message:0x1)
tkcrrpa: (WARN) Failed initial attempt to send ARCH message (message:0x1)
*** 2013-08-14 03:59:30.005
Warning: log write time 540ms, size 89KB
*** 2013-08-14 08:25:54.390
error 29740 detected in background process
ORA-29740: evicted by member , group incarnation

lgwr 进程trace仅仅截取了一小部分，从这部分内容来看，在凌晨03:25~03:59这个时间段，出现了log write 超时。通过对比可以发现，其实每天的03:25左右，lgwr write超时都是最严重的，说明这个时间点系统的IO压力是最大的.然而03:59~ 08:25之间并没有出现别的信息，说明这个时间端内，系统的IO压力应该不大，但是这并不能说明系统的负载不高。下面我们继续分析lmon 进程的trace内容Trace ngcrmdb1_lmon_6627610.trc内容如下：

Unix process pid: 6627610, image: oracle@ngcrmdb1 (LMON)

*** SERVICE NAME:() 2013-08-14 08:27:21.684
*** SESSION ID:(13202.5) 2013-08-14 08:27:21.684
GES resources 242662 pool 56
GES enqueues 378762
GES IPC: Receivers 15  Senders 15
GES IPC: Buffers  Receive 1000  Send (i:24750 b:1913) Reserve 1000
GES IPC: Msg Size  Regular 408  Batch 8192
Batching factor: enqueue replay 201, ack 224
Batching factor: cache replay 126 size per lock 64
kjxggin: receive buffer size = 32768
high load threshold = 200704                    ++++ load过高，已经超过threshold
*** 2013-08-14 08:27:26.865
kjxgmrcfg: Reconfiguration started, reason 1    +++++ 实例进行reconfiguartion，说明这个时间点，crm1节点被restart了
kjxgmcs: Setting state to 0 0.                  +++++ reason为0，表示不属于cluster了
*** 2013-08-14 08:27:26.867
 Name Service frozen
kjxgmcs: Setting state to 0 1.          +++++ 节点被重新加入到cluster，状态改变为1（表示被monitor）
kjxgrssvote: reconfig bitmap chksum 0xfa98ff9 cnt 2 master 0 ret 0
kjfcpiora: published my fusion master weight 7699028
kjfcpiora: published my enqueue weight 242662
kjfcpiora: publish my flogb 16
kjfcpiora: publish my cluster_database_instances parameter = 2
kjxggpoll: change poll time to 50 ms
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 10 2.
 kjfmuin: bitmap 0 1
kjfmmhi: received msg from 0 (inc 10)
 kjfmmhi: received msg from 1 (inc 4)
 Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 10 3.
 Name Service recovery started
 Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 10 4.
 Multicasted all local name entries for publish
 Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 10 5.
 Name Service normal
 Name Service recovery done
*** 2013-08-14 08:27:27.464
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 10 6.
*** 2013-08-14 08:27:27.565
*** 2013-08-14 08:27:27.565
Reconfiguration started (old inc 0, new inc 10)
Dynamic remastering is disabled
Synchronization timeout interval: 900 sec
List of nodes:
 0 1
*** 2013-08-14 08:27:27.565
kjxggpoll: change poll time to 600 ms
 Global Resource Directory frozen
node 0
release 10 2 0 4
node 1
release 10 2 0 4
 number of mastership buckets = 128
。。。。。。
 Post SMON to start 1st pass IR
 Fix write in gcs resources
 sent syncr inc 10 lvl 8 to 0 (10,24/0/0)
sent synca inc 10 lvl 8 (10,24/0/0)
*** 2013-08-14 08:27:36.170                     +++++reconfig完成
Reconfiguration complete
*   domain 0 valid?: 1
*** 2013-08-14 08:27:41.581
kjxgrtmc2: mounting member 0 thread 1           +++++mount thread（crm1）

从该trace来看，2013-08-14 08:27:26开始reconfig，到2013-08-14 08:27:41 reconfig完成。一共经历了15秒。从上面信息来判断，

lms3进程压力偏大，然后这里的load仅仅是指单个进程，而并非整个系统。在rac中，lmon进程主要作用是监控集群的global

enqueues和resources资源，且会执行global enqueue recovery。实例异常终止后，lmon进程会进行gcs内存方面的处理。当一个实例

加入或者离开集群后，lmon进程会对lock和resource进行reconfiguration.

此外lmon 进程还会在cluster中不同的实例间进行通讯检查，如果发现对方通讯超时，就会发出节点eviction，将该node驱逐。

在另外一个ngcrmdb1_lmon_15573208.trc 中记录了lms 进程的详细信息，如下：

*** 2013-07-28 07:12:17.257
Reconfiguration complete
*   domain 0 valid?: 1
*** 2013-08-14 08:20:30.089
kjfmrcvrchk: receiver LMS[3] check timed out (1376439430.1376439630.0).  +++发现lms3 time out
kjfmrcvrchk: Dumping callstack of lms3
Submitting asynchronized dump request [20]
*** 2013-08-14 08:23:29.891
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3         +++进行reconfig 保护，避免重启
kjxgrrcfgchk: Initiating reconfig, reason 3
kjxgrrcfgchk: Increased trace level for comm reconfig from 0 to 10
kjxgrrcfgchk: prev pstate 5  mapsz 512
kjxgrrcfgchk: new  bmp: 0 1
kjxgrrcfgchk: work bmp: 0 1
kjxgrrcfgchk: rr  bmp: 0 1
kjxgrrcfgchk: the instance is NOT in good health.
*** 2013-08-14 08:23:30.496
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3  +++reason 3，表示集群中的node通讯失败
*** 2013-08-14 08:23:31.100
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
.......
.......
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:23:37.161
kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272
*** 2013-08-14 08:23:37.162
kjxgrmsghndlr: Recvd msg 1104e60a0 type 5 from 1, seq (4,5)
*** 2013-08-14 08:23:37.162
Received heartbeat ack from 1
*** 2013-08-14 08:23:37.767
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:23:38.391
.......
.......
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:23:42.040
kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272
*** 2013-08-14 08:23:42.041
kjxgrmsghndlr: Recvd msg 1104e6640 type 5 from 1, seq (4,5)
*** 2013-08-14 08:23:42.041
Received heartbeat ack from 1
*** 2013-08-14 08:23:42.646
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:23:43.291
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:23:43.897
.......
.......
*** 2013-08-14 08:25:14.505
kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272
*** 2013-08-14 08:25:14.506
kjxgrmsghndlr: Recvd msg 1104eda80 type 5 from 1, seq (4,5)
*** 2013-08-14 08:25:14.506
Received heartbeat ack from 1
*** 2013-08-14 08:25:15.110
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
.......
.......
*** 2013-08-14 08:25:19.400
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:25:19.400
kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272
*** 2013-08-14 08:25:19.402
kjxgrmsghndlr: Recvd msg 1104ee020 type 5 from 1, seq (4,5)
*** 2013-08-14 08:25:19.402
Received heartbeat ack from 1
*** 2013-08-14 08:25:20.005
kjxgrdtrt: Evicted by 1, seq (6, 4)
IMR state information
  Member 0, thread 1, state 0x4:4, flags 0x0040
  RR seq commit 4  cur 4
  Propstate 5  prv 4  pending 0
  system load 3402 (normal)

  Member information:
    Member 0, incarn 2, version 0x36b98a7, thrd 1
      prev thrd 65535, status 0x0007 (JRM.), err 0x0000
    Member 1, incarn 4, version 0x370baab, thrd 2
      prev thrd 65535, status 0x0007 (JRM.), err 0x0000

根据上面信息整理一个如下时间关系：

2013-08-14 08:20:30   ----check lms3 出现time out
2013-08-14 08:23:29   ----提高process level，进行reconfig detect保护
2013-08-14 08:25:20   ----发出Evicted 申请

前面我们已经知道了是由于lms3进程交互出现超时导致实例被重启。那么crm系统在该时间点load真的很高吗？

我们通过数据来进行验证：我们重点来看下lms3 进程的diag信息，如下：

<pre class="brush:php">*** 2013-08-14 08:25:27.917
Dumping diagnostic information for ospid 14856422:
OS pid = 14856422
loadavg : 2.49 3.72 5.99
swap info: free_mem = 7493.79M rsv = 256.00M
           alloc = 2105.66M avail = 65536.00M swap_free = 63430.34M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240103 A   oracle 14856422        1   4  39 -- 1f059f6510 110936            Jul 28      - 1414:58 ora_lms3_ngcrmdb1
open: The file access permissions do not allow the specified action.

可以看到，该进程上次启动时间是7月28号，该进程的pri值是39. 该进程的平均load为：2.49 3.72 5.99

从这里看来，操作系统load 并不高。通过grep 该trace，可以看到其他lms进程的loadavg 都相差不大，如下：

[root@killdb ~]# cat ngcrmdb1_lmon_15573208.trc |grep 'loadavg :'
loadavg : 2.21 3.71 6.01
loadavg : 2.35 3.72 6.00
loadavg : 2.35 3.72 6.00
loadavg : 2.49 3.72 5.99
loadavg : 2.49 3.72 5.99
loadavg : 2.49 3.72 5.99
loadavg : 2.85 3.78 6.00
loadavg : 2.85 3.78 6.00
loadavg : 2.86 3.76 5.98
loadavg : 2.86 3.76 5.98
loadavg : 2.86 3.76 5.98
loadavg : 2.79 3.74 5.96
loadavg : 2.79 3.74 5.96
loadavg : 2.81 3.72 5.94
loadavg : 2.81 3.72 5.94
loadavg : 2.81 3.72 5.94

到这里，我们可以发现，lms3 进程load其实并不算高，那么为什么会等待超时呢？以至于rac实例之间lms3

进程交互出现超时导致实例被驱逐重启。那么lms3 进程等待什么呢？我们来看下node1 lms3进程的信息：

可以发现lms3 一直在等待cache buffer chains，且wait time很长，以至于出现超时. 一般情况下，cbc latch的持有时间
都是非常短暂的，通常都在毫秒级别，然而这里其等待时间却高达几百秒.前面发现lms 3进程申请的资源70000150072db50
被 orapid 1386持有，那么我们观察一下orapid 1386是什么进程：
<pre class="brush:php">SO: 70000154841a800, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=1386, calls cur/top: 70000146b345730/700001469390a10, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 109 0 4
              last post received-location: kslpsr
              last process to post me: 7000015453a5098 1 6
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: 7000015453a5098 1 6
  (latch info) wait_event=0 bits=2
    holding    (efd=16) 70000150072db50 Child cache buffers chains level=1 child#=476313
        Location from where latch is held: kcbgtcr: kslbegin excl:
        Context saved from call: 557847035
        state=busy(exclusive) (val=0x200000000000056a) holder orapid = 1386
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         1594 (1051, 1376439951, 2)
         53 (1033, 1376439951, 2)
         98 (1032, 1376439951, 2)
         1502 (1011, 1376439951, 2)
         69 (1009, 1376439951, 2)
         124 (960, 1376439951, 2)
         213 (950, 1376439951, 2)
         191 (948, 1376439951, 2)
         75 (712, 1376439951, 2)
         1359 (695, 1376439951, 2)
         424 (650, 1376439951, 2)
         1572 (636, 1376439951, 2)
         89 (591, 1376439951, 2)
         5168 (588, 1376439951, 2)
         33 (521, 1376439951, 2)        ++++33，就是lms3 进程
         1296 (394, 1376439951, 2)
         4049 (334, 1376439951, 2)
         163 (333, 1376439951, 2)
         4437 (91, 1376439951, 2)
         4459 (32, 1376439951, 2)
         waiter count=20
    Process Group: DEFAULT, pseudo proc: 7000015446c94e8
    O/S info: user: oracle, term: UNKNOWN, ospid: 7504228
    OSD pid info: Unix process pid: 7504228, image: <a href="mailto:oracle@ngcrmdb1">oracle@ngcrmdb1</a>

可以清楚的看到，lms3进程确实在等待pid 1386这个进程，换句话讲，lms3 进程在等待cbc latch，而该cbc latch resource
却被pid 1386所持有，一直没有释放；而且不仅仅是lms3进程在等待，还有其他19个进程也在等待.Pid 1386 进程的历史等待
事件如下：
<pre class="brush:php">SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00
  (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/-
            DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009
            txn branch: 0
            oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1
  O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3
            program: JDBC Thin Client
  client info: logon 10.200.141.139
  application name: JDBC Thin Client, hash value=2546894660
  last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=1051
          driver id=28444553, #bytes=1, =0
  Dumping Session Wait History
   for 'SQL*Net message from client' count=1 wait_time=340
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=2
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=149098
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=3
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=269
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=318
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'log file sync' count=1 wait_time=378
          buffer#=6e6d, =0, =0
   for 'SQL*Net message from client' count=1 wait_time=1031
          driver id=28444553, #bytes=1, =0
  temporary object counter: 0

……….

从上面信息可以发现，orapid 1386这个session是地址10.200.141.139 登录的，主机名为：ngibweb3。
从上面信息来看，该进程的等待event是: SQL*Net message from client ，且该进程持有cbc latch长时间不释放。
对于进程持有cbc latch长时间不释放，主要有几种可能性：

1. 进程hang住;
2. oracle bug或操作系统hang住.
Lms3 进程等待超过5分钟，且cbc resource被该进程持有，由此来看，该session 进程可能hang住了.

那么如何确认该进程可能已经hang住了？

*** 2013-08-14 08:06:53.459
Dumping diagnostic information for ospid 7504228:
OS pid = 7504228
loadavg : 9.49 9.52 8.51
swap info: free_mem = 9215.18M rsv = 256.00M
           alloc = 2086.34M avail = 65536.00M swap_free = 63449.66M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A   oracle  7504228        1  32  76 20 1af21a7510 96324 f10006000ae020c8   Jul 28      - 568:26 oraclengcrmdb1 (LOCAL=NO)
...........
...........
SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00
  (session) sid: 11786 trans: 7000014f368ee48, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/-
            DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009
            txn branch: 0
            oct: 3, prv: 0, sql: 70000145c8bbba0, psql: 70000145c969a38, user: 41/UOP_CRM1
  O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3
            program: JDBC Thin Client
  client info: logon 10.200.141.139
  application name: JDBC Thin Client, hash value=2546894660
  last wait for 'SQL*Net message from client' blocking sess=0x0 seq=41178 wait_time=325 seconds since wait started=0
          driver id=28444553, #bytes=1, =0
  Dumping Session Wait History
   for 'SQL*Net message from client' count=1 wait_time=325
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=2
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=2148096
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=4
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=345
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=386
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'log file sync' count=1 wait_time=482
          buffer#=693f, =0, =0
   for 'SQL*Net message from client' count=1 wait_time=1267
          driver id=28444553, #bytes=1, =0
  temporary object counter: 0

*** 2013-08-14 08:08:13.068
Dumping diagnostic information for ospid 7504228:
OS pid = 7504228
loadavg : 9.51 9.48 8.57
swap info: free_mem = 9158.54M rsv = 256.00M
           alloc = 2086.33M avail = 65536.00M swap_free = 63449.67M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A   oracle  7504228        1  33  76 20 1af21a7510 98492            Jul 28      - 568:29 oraclengcrmdb1 (LOCAL=NO)
open: The file access permissions do not allow the specified action.
procstack: write(/proc/7504228/ctl): The requested resource is busy.
7504228: oraclengcrmdb1 (LOCAL=NO)
0x0000000000000000  ????????() + ??
........
........
SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00
  (session) sid: 11786 trans: 7000014f368ee48, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/-
            DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009
            txn branch: 0
            oct: 3, prv: 0, sql: 70000145ac1ef50, psql: 70000145db36f00, user: 41/UOP_CRM1
  O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3
            program: JDBC Thin Client
  client info: logon 10.200.141.139
  application name: JDBC Thin Client, hash value=2546894660
  last wait for 'gc current request' blocking sess=0x0 seq=42098 wait_time=432 seconds since wait started=0
          file#=23f, block#=fe61a, id#=2000001
  Dumping Session Wait History
   for 'gc current block 2-way' count=1 wait_time=432
          =23f, =fe61a, =2000001
   for 'gc current block 2-way' count=1 wait_time=305
          =1be, =64f8, =2000001
   for 'gc cr block 2-way' count=1 wait_time=423
          =1be, =64f8, =1
   for 'gc current block 2-way' count=1 wait_time=482
          =2b3, =24e03, =2000001
   for 'gc current block 2-way' count=1 wait_time=327
          =1be, =64b7, =2000001
   for 'gc cr block 2-way' count=1 wait_time=413
          =1be, =64b7, =1
   for 'gc current block 2-way' count=1 wait_time=366
          =23f, =fe681, =2000001
   for 'gc current block 2-way' count=1 wait_time=418
          =1be, =64a9, =2000001
   for 'gc cr block 2-way' count=1 wait_time=453
          =1be, =64a9, =1
   for 'gc current block 2-way' count=1 wait_time=454
          =1b4, =fa1c2, =2000001
  temporary object counter: 0

*** 2013-08-14 08:08:17.929
Dumping diagnostic information for ospid 7504228:
OS pid = 7504228
loadavg : 9.63 9.50 8.59
swap info: free_mem = 9263.69M rsv = 256.00M
           alloc = 2086.33M avail = 65536.00M swap_free = 63449.67M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A   oracle  7504228        1  41  80 20 1af21a7510 96448            Jul 28      - 568:31 oraclengcrmdb1 (LOCAL=NO)
open: The file access permissions do not allow the specified action.
procstack: write(/proc/7504228/ctl): The requested resource is busy.
7504228: oraclengcrmdb1 (LOCAL=NO)
0x0000000000000000  ????????() + ??
*** 2013-08-14 08:08:18.810
...........
...........
SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00
  (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/-
            DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009
            txn branch: 0
            oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1
  O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3
            program: JDBC Thin Client
  client info: logon 10.200.141.139
  application name: JDBC Thin Client, hash value=2546894660
  last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=0
          driver id=28444553, #bytes=1, =0
  Dumping Session Wait History
   for 'SQL*Net message from client' count=1 wait_time=340
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=2
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=149098
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=3
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=269
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=318
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'log file sync' count=1 wait_time=378
          buffer#=6e6d, =0, =0
   for 'SQL*Net message from client' count=1 wait_time=1031
          driver id=28444553, #bytes=1, =0
  temporary object counter: 0

*** 2013-08-14 08:25:49.363
Dumping diagnostic information for ospid 7504228:
OS pid = 7504228
loadavg : 2.81 3.72 5.94
swap info: free_mem = 7515.38M rsv = 256.00M
           alloc = 2106.36M avail = 65536.00M swap_free = 63429.64M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 T   oracle  7504228        1   0  60 20 1af21a7510 96324            Jul 28      - 568:31 oraclengcrmdb1 (LOCAL=NO)
.......
.......
SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00
  (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/-
            DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009
            txn branch: 0
            oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1
  O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3
            program: JDBC Thin Client
  client info: logon 10.200.141.139
  application name: JDBC Thin Client, hash value=2546894660
  last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=1051
          driver id=28444553, #bytes=1, =0
  Dumping Session Wait History
   for 'SQL*Net message from client' count=1 wait_time=340
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=2
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=149098
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=3
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=269
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=318
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'log file sync' count=1 wait_time=378
          buffer#=6e6d, =0, =0
   for 'SQL*Net message from client' count=1 wait_time=1031
          driver id=28444553, #bytes=1, =0
  temporary object counter: 0

可以看到最后的2个时间点，2013-08-14 08:08:18和2013-08-14 08:25:49，该进程都的time cmd值都没有发生变化，

且其状态都是busy. 同时执行的SQL都是70000145dad36e8，我们来看SQL：70000145dad36e8 是什么语句？

<pre class="brush:php">Cursor#6(1104a1958) state=BOUND curiob=1105400e8
 curflg=cd fl2=0 par=0 ses=700001542fc0208
 sqltxt(70000145dad36e8)=
DELETE /*+ index(A IDX_TI_B_MCAS_UDR_3) */ FROM TI_B_MCAS_UDR A WHERE DECODE(A.SERV_TYPE,'VGOP',A.SERV_TYPE||A.BIZ_TYPE_CODE,A.SERV_TYPE) NOT IN ('DSMP', '1258','CMMB','VGOP60','VGOP10','NIAP','RINP','STMM','PIMM','MGIS','PPOC','MCSP','MDOP','MBLG','SURF','NFCP','MCLD','MEBP')
  hash=048fa56d3419181a96e347d96053b942
  parent=70000145dab3218 maxchild=03 plk=70000146b2e2208 ppn=n
cursor instantiation=1105400e8 used=1376438898
 child#2(7000011f63f85e0) pcs=7000012b5494ea8
  clk=7000013ac106198 ci=7000011ddc3d0b0 pn=70000146bb6b560 ctx=70000140e841ab8
 kgsccflg=0 llk[1105400f0,1105400f0] idx=0
 xscflg=c0100276 fl2=c000400 fl3=228208c fl4=100
 Frames pfr 110540080 siz=11728 efr 11053ff90 siz=11672
 Cursor frame dump
  enxt: 4.0x00000480  enxt: 3.0x00000af8  enxt: 2.0x00000ec8  enxt: 1.0x00000f58
  pnxt: 2.0x00000008  pnxt: 1.0x00000030
 kxscphp  1105384e8 siz=984 inu=344 nps=344

可以看到该进程从8:08就在执行一个delete操作，直到8:25分,经历了17分钟. 我们从trace信息来看，该进程确实被挂起了1051秒。

那么，最后我们通过该进程的trace来确认一下，这个session在干什么？信息如下：

Dump file /oraclelog/ngcrmdb/udump/ngcrmdb1_ora_7504228.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine
and Real Application Testing options
ORACLE_HOME = /oracle/app/product/10.2.0/db_1
System name:	AIX
Node name:	ngcrmdb1
Release:	3
Version:	5
Machine:	00C8A6F24C00
Instance name: ngcrmdb1
Redo thread mounted by this instance: 1
Oracle process number: 1386
Unix process pid: 7504228, image: oracle@ngcrmdb1

*** 2013-08-14 08:25:51.146
*** ACTION NAME:() 2013-08-14 08:25:51.119
*** MODULE NAME:(JDBC Thin Client) 2013-08-14 08:25:51.119
*** SERVICE NAME:(ngcrmdb) 2013-08-14 08:25:51.119
*** SESSION ID:(11786.20) 2013-08-14 08:25:51.119
Received ORADEBUG command 'dump errorstack 3' from process Unix process pid: 5886552, image:
*** 2013-08-14 08:25:51.146
ksedmp: internal or fatal error
Current SQL statement for this session:
DELETE /*+ index(A IDX_TI_B_MCAS_UDR_3) */ FROM TI_B_MCAS_UDR A WHERE DECODE(A.SERV_TYPE,'VGOP',A.SERV_TYPE||A.BIZ_TYPE_CODE,A.SERV_TYPE) NOT IN ('DSMP', '1258','CMMB','VGOP60','VGOP10','NIAP','RINP','STMM','PIMM','MGIS','PPOC','MCSP','MDOP','MBLG','SURF','NFCP','MCLD','MEBP')
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
700001459e79290        66  procedure UOP_CRM1.P_UIP_MERGER_MCASDATA
70000145c91ebc8         1  anonymous block

从进程的trace信息来看，跟我们前面的分析符合，可以看到该session在执行一个delete操作,该sql执行计划下：

<pre class="brush:php">============
Plan Table
============
-----------------------------------------------------------+-----------------------------------+
| Id  | Operation                     | Name               | Rows  | Bytes | Cost  | Time      |
-----------------------------------------------------------+-----------------------------------+
| 0   | DELETE STATEMENT              |                    |       |       |   39K |           |
| 1   |  DELETE                       | TI_B_MCAS_UDR      |       |       |       |           |
| 2   |   TABLE ACCESS BY INDEX ROWID | TI_B_MCAS_UDR      |     1 |    47 |   39K |  00:08:05 |
| 3   |    INDEX FULL SCAN            | IDX_TI_B_MCAS_UDR_3|  169K |       |  1505 |  00:00:19 |
-----------------------------------------------------------+-----------------------------------+

可以看到，该sql执行过差主要是因为进行了index full scan，需要注意的是,index full scan 进行的单块读。且由于
该环境是RAC，oracle需要在实例之间进行block的传输，确保数据的一致性以及cache fusion的作用，而该操作主要是
lms进程来完成的。然而该进程长时间持有cbc latch不释放，可能是由于该session 已经本身已经hang住。

最终由于node 1节点lms3 进程长时间获得不到cbc latch而出现超时,被lmon检查发现,以至于最后被lmon进程将实例重启。

oracle

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

Rac 实例被莫名重启案例分析

问题描述

专家解答

评论