The database instance Crash because the CPU High ?

原创 Roger 2014-10-10

1038

某系统的其中一个RAC节点的db实例被干掉并自动重启了，如下是该实例的alert log信息：

Wed Oct 08 12:59:51 CST 2014
Thread 1 advanced to log sequence 543 (LGWR switch)
Current log# 1 seq# 543 mem# 0: +DGSYS/hisdb/onlinelog/group_1.261.856010381
Wed Oct 08 13:13:45 CST 2014
WARNING: inbound connection timed out (ORA-3136)
Wed Oct 08 13:16:56 CST 2014
LMON (ospid: 17329) is not heartbeating for 204 seconds.
LMON is not healthy and has no heartbeat.
Please check LMD0/LMS0 and DIAG trace files for detail.
Wed Oct 08 13:17:02 CST 2014
LMS0 (ospid: 17333) is terminating the instance.
LMS0: terminating instance due to error 484
Wed Oct 08 13:17:05 CST 2014
System state dump is made for local instance
System State dumped to trace file /opt/oracle/admin/hisdb/bdump/hisdb1_diag_17325.trc
Wed Oct 08 13:17:06 CST 2014
Shutting down instance (abort)
License high water mark = 1068
Wed Oct 08 13:17:07 CST 2014
Instance terminated by LMS0, pid = 17333
Wed Oct 08 13:17:11 CST 2014
Instance terminated by USER, pid = 15788
Wed Oct 08 13:17:17 CST 2014
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 eth1 10.0.0.0 configured from OCR for use as a cluster interconnect
Interface type 1 eth0 10.3.36.192 configured from OCR for use as  a public interface
Picked latch-free SCN scheme 3
Wed Oct 08 13:17:27 CST 2014
Using LOG_ARCHIVE_DEST_1 parameter default value as /opt/oracle/product/10.2.0/db/dbs/arch
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.5.0.
System parameters with non-default values:
。。。。。。。
Wed Oct 08 13:17:39 CST 2014
ALTER DATABASE OPEN
Picked broadcast on commit scheme to generate SCNs
Wed Oct 08 13:17:39 CST 2014
SUCCESS: diskgroup MSDATA was mounted
SUCCESS: diskgroup DGDATA was mounted
Wed Oct 08 13:17:39 CST 2014
Thread 1 opened at log sequence 544
Current log# 2 seq# 544 mem# 0: +DGSYS/hisdb/onlinelog/group_2.262.856010381
Successful open of redo thread 1
Wed Oct 08 13:17:39 CST 2014
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Wed Oct 08 13:17:39 CST 2014
SMON: enabling cache recovery
Wed Oct 08 13:17:40 CST 2014
Successfully onlined Undo Tablespace 1.
Wed Oct 08 13:17:40 CST 2014
SMON: enabling tx recovery
Wed Oct 08 13:17:40 CST 2014
Database Characterset is ZHS16GBK
Opening with internal Resource Manager plan
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=28, OS id=16432
Wed Oct 08 13:17:42 CST 2014
Completed: ALTER DATABASE OPEN

我们可以看到，该实例在13:17:07 秒被LMS0进程强行终止掉了，接着该实例在13:17:17 被正常启动。
在该实例被强行终止之前，有一点关键信息是值得我们注意的，如下：

Wed Oct 08 13:13:45 CST 2014
WARNING: inbound connection timed out (ORA-3136)
Wed Oct 08 13:16:56 CST 2014
LMON (ospid: 17329) is not heartbeating for 204 seconds.
LMON is not healthy and has no heartbeat.
Please check LMD0/LMS0 and DIAG trace files for detail.
Wed Oct 08 13:17:02 CST 2014
LMS0 (ospid: 17333) is terminating the instance.
LMS0: terminating instance due to error 484

从这部分信息，我们可以大致判断，在13:16:56时，Oracle已经发现LMON进程长时间没有检测到心跳了，这个时间长达204秒。
如果根据时间向前推进，在13:13:32时间点，实际上Lmon进程就开始出现异常了。我们也可以看到在13:13:45时间点，出现了
一个ora-3136错误。一般来讲，这个waring跟系统的负载可能有极大的关系，例如资源使用极高，可能出现超时的情况。

从alert log信息来看，Oracle 让我们去查看LMD0/LMS0 以及diag的信息来进行进一步的分析。那么我们首先就来看一下LMD0进程的信息：

*** 2014-10-08 12:47:22.077
Setting 3-way CR grants to 1 global-lru off? 0
*** 2014-10-08 13:16:58.621
KJM_HISTORY: RCVR STALL OP(12) context 5 elapsed 202974048 us
KJM HIST LMD0:
12:5:202974048 7:6 6:3 10:4 2:6 1:2 11:2 15:2 12:80846 7:4
6:4 10:9 11:2 15:4 12:78700 7:8 6:1 10:8 11:2 15:2
12:79509 7:3 6:1 10:3 11:2 15:2 12:78710 7:2 6:2 10:4
11:3 15:1 12:79073 7:3 6:2 10:5 11:2 15:2 12:79846 7:6
6:1 10:14 11:2 15:5 12:78560 7:6 6:3 10:9 11:2 15:5
12:79708 7:5 6:1 10:6 11:0 15:1 12:79261 7:3 6:1 10:10
11:1 15:4 12:79432 7:4
----------------------------------------
SO: 0x478e7ff10, type: 4, owner: 0x4798e0088, flag: INIT/-/-/0x00
(session) sid: 3301 trans: (nil), creator: 0x4798e0088, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
last wait for 'ges remote message' wait_time=3 min 22 sec, seconds since wait started=258
waittime=40, loop=0, p3=0
blocking sess=0x(nil) seq=8230
Dumping Session Wait History
for 'ges remote message' count=1 wait_time=3 min 22 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.080830 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078682 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.079500 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078702 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.079063 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.079834 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078517 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.079689 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.079250 sec
waittime=40, loop=0, p3=0

从LMD0进程的信息来看，可以发现该进程（lmd0）已经等待了258秒，等待事件为ges remote message，除此之外，没有其他的信息了。

既然这样，那我们继续来看下LMS0进程的信息：

*** 2014-09-30 10:01:46.825
DRM(3) win(1) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(2) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(3) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(4) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(5) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(6) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(7) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(8) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
*** 2014-10-08 13:16:56.712
kjfdchklmon: LMON is not heartbeating for 204 sec (1412745209.1412745413.0).
kjfdchklmon: Dumping callstack of lmon
Submitting asynchronized dump request [20]
kjfdchklmon: LMON is not healthy. kill instance.
ksuitm: waiting up to [5] seconds before killing DIAG(17325)

LMS0进程的信息似乎也没有太多的价值。最后我们只能看看diag的信息了，Oracle在将所有进程kill之前进行了一次dump。

*** 2014-10-08 13:16:56.894
Process diagnostic dump for oracle@hisdbs1 (LMON), OS id=17329,
pid: 5, proc_ser: 1, sid: 3302, sess_ser: 1
-------------------------------------------------------------------------------
loadavg : 62.17 35.29 18.12
Memory (Avail / Total) = 5976.90M / 64455.59M
Swap (Avail / Total) = 21215.88M /  24003.36M
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
Short stack dump: ORA-00072: ?? "Unix process pid: 17329, image: oracle@hisdbs1 (LMON)" ???

-------------------------------------------------------------------------------
Process diagnostic dump actual duration=8.010000 sec
(max dump time=30.000000 sec)
*** 2014-10-08 13:17:04.908
----------------------------------------
SO: 0x47e8f2a38, type: 2, owner: (nil), flag: INIT/-/-/0x00
(process) Oracle pid=5, calls cur/top: 0x4762511b8/0x4762511b8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 91
last post received-location: kjfzack: post pmon for dead process cleanup
last process to post me: 4798e0088 1 6
last post sent: 0 0 108
last post sent-location: kjmpost: post lmd
last process posted by me: 4798e0088 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: ksqgel: create enqueue: parent obj
Context saved from call: 19162010040
waiting for 600098c8 enqueues level=5
Location from where latch is held: ksqgel: create enqueue: parent obj
Context saved from call: 16790760304
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
5 (216, 1412745425, 6)           ---LMON
774 (216, 1412745425, 216)
15 (216, 1412745425, 153)        ---ckpt
813 (216, 1412745425, 216)
18 (216, 1412745425, 141)        ---CJQ0
25 (216, 1412745425, 216)        ---ASMB
886 (27, 1412745425, 27)
280 (15, 1412745425, 15)
783 (9, 1412745425, 9)
800 (9, 1412745425, 9)
waiter count=10
gotten 100670517 times wait, failed first 9774 sleeps 746
gotten 0 times nowait, failed: 0
possible holder pid = 792 ospid=15095
on wait list for 600098c8
Process Group: DEFAULT, pseudo proc: 0x478a24a58
O/S info: user: oracle, term: UNKNOWN, ospid: 17329 (DEAD)
OSD pid info: Unix process pid: 17329, image: oracle@hisdbs1 (LMON)
PSO child state object changes :
Dump of memory from 0x000000047E8930D8 to 0x000000047E8932E0
47E8930D0                   00000008 00000000          [........]
。。。。。。
Repeat 23 times
----------------------------------------
SO: 0x477e7c9c0, type: 4, owner: 0x47e8f2a38, flag: INIT/-/-/0x00
(session) sid: 3302 trans: (nil), creator: 0x47e8f2a38, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0005-00000006, short-term DID: 0000-0000-00000000
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
waiting for 'latch free' wait_time=0, seconds since wait started=210
address=600098c8, number=13, tries=2b
blocking sess=0x(nil) seq=28098
Dumping Session Wait History
for 'latch free' count=1 wait_time=4.885477 sec
address=600098c8, number=13, tries=2a
for 'latch free' count=1 wait_time=4.902795 sec
address=600098c8, number=13, tries=29
for 'latch free' count=1 wait_time=4.884726 sec
address=600098c8, number=13, tries=28
for 'latch free' count=1 wait_time=4.883317 sec
address=600098c8, number=13, tries=27
for 'latch free' count=1 wait_time=4.887379 sec
address=600098c8, number=13, tries=26
for 'latch free' count=1 wait_time=4.885649 sec
address=600098c8, number=13, tries=25
for 'latch free' count=1 wait_time=4.886556 sec
address=600098c8, number=13, tries=24
for 'latch free' count=1 wait_time=4.888166 sec
address=600098c8, number=13, tries=23
for 'latch free' count=1 wait_time=4.885367 sec
address=600098c8, number=13, tries=22
for 'latch free' count=1 wait_time=4.885849 sec
address=600098c8, number=13, tries=21

。。。。。。
------------process 0x0x474316de0--------------------
proc version      : 0
Local node        : 0
pid               : 17329
lkp_node          : 0
svr_mode          : 0
proc state        : KJP_NORMAL
Last drm hb acked : 0
Total accesses    : 2
Imm.  accesses    : 1
Locks on ASTQ     : 0
Locks Pending AST : 0
Granted locks     : 0
AST_Q:
PENDING_Q:
GRANTED_Q:
Performing diagnostic data dump for this instance

从前面的load来看，load信息如下：
loadavg : 62.17 35.29 18.12

这里简单解释一下，这3个loadavg值表示该主机分别在1分钟，5分钟，15分钟内的平均load值。显然，这里我们不看后面2个指标了。
根据时间点，我们这里看第一个loadavg值是62.17. 这个值已经超出CPU的数倍了，由此我们不难判断在13:16:56向前推进1分钟的时间内
系统的load是极高的。

对于LMON进程，主要是监控RAC的GES信息，当然其作用不仅仅局限于此，其还负责检查集群中每个节点的健康情况，当有节点出现故障是，
负责进行reconfig以及GRD(global resource Directory)的恢复等等。我们知道AC的脑裂机制，如果IO fencing是Oracle本身来完成，也
就是说由CLusterware来完成。那么Lmon进程检查到实例级别出现脑裂时，会通知clusterware来进行脑裂操作，然而其并不会等待clusterware
的处理结果。当等待超过一定时间，那么LMON进程会自动触发IMR(instance membership recovery)，这实际上也就是我们所说的
Instance membership reconfig。

对于报错中提到的LMON进程无法获得心跳了，那么LMON进程的心跳是什么呢？ lmon进程主要通过2种心跳机制来判断集群节点的状态：

1) 网络心跳（主要是通过ping进行检测）
2) 控制文件磁盘心跳，类似ckpt进程每3s更新一次controlfile的机制。

那么什么情况下，会导致LMON无法获得心跳呢？

一般来讲有可能是该进程无法获得latch或者系统的资源不足，导致LMON进程无法正常工作。而这里我们也确实发现系统的load极高。
从上面的进程详细信息来看，我们也确实发现该进程一直在等待一个latch，同时有其他的10个进程正在等待LMON进程，这其中的等待
就包括LMON,ASMB，ckpt进程等等。从diag的dump来看，提示lmon进程已经等待了216秒。

从process dump我们还可以看到该进程一直无法获得的latch是可能是被如下进程所持有：

possible holder pid = 792 ospid=15095

不过可惜的是，我搜索diag日志，居然找不到pid 792这个会话。我们知道diag的trace中的process state dump信息是根据PID排序的。
我发现dump内容到process 69就结束了。从PID来看，该实例在出问题的时候，起码都800个process以上，dump的内容不可能只有这么一点。

从2节点的核心进程的日志也看不到有价值的信息。后面监控了一段时间发现该节点cpu几乎都是90{39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}以上，很多时候甚至100{39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}。

检查发现是CBC latch，而且都是一些报表的SQL。

对于这个case，我仍然感觉可能是资源异常导致LMON进程异常，最后Oracle 检测到该进程长时间没有心跳，最终终止了该实例。

基于这个假设，我打算在自己的VM Rac环境(Linux +10.2.0.5)测试一下，如果某个节点CPU使用率极高的情况下，是否会导致LMON进程异常。

不过遗憾的是，我并没有模拟出这种情况，如下是我的模拟过程：

利用shell 来模拟cpu的消耗：

[root@rac1 ~]# cat kill_cpu.sh
#! /bin/sh
# filename killcpu.sh
for i in `seq $1`
do
echo -ne "
i=0;
while true
do
i=i+1;
done" | /bin/sh &
pid_array[$i]=$! ;
done
for i in "${pid_array[@]}"; do
echo 'kill ' $i ';';
done

调用该脚本即可：
[root@rac1 ~]# ./kill_cpu.sh 1
kill 17449 ;

这样测试了10分钟，发现Rac毫无反应，于是我又模拟了latch: cache buffer chains. 以此来尽可能的将该节点的CPU消耗光。

SQL> create table killdb(no int,object_name varchar2(50));

Table created.

SQL> declare i int;
2  begin
3  for i in 1..5 loop
4  insert into killdb  select  rownum as no,object_name from dba_objects;
5  end loop;
6  end;
7  /

PL/SQL procedure successfully completed.

SQL> select count(1) from killdb;

COUNT(1)
----------
250645

SQL> commit;

Commit complete.

SQL> create or replace procedure full_scan is
2   i int;
3   icount int;
4   begin
5   for i in 1..1000 loop
6   select count(*) into icount from killdb;
7   end loop;
8   end;
9   /

Procedure created.

SQL> var job_no number;
SQL> begin
2  for idx in 1..30 loop
3  dbms_job.submit(:job_no,'full_scan;');
4  commit;
5  end loop ;
6  end;
7  /

PL/SQL procedure successfully completed.

SQL> select event,count(1) from v$session where wait_class#<>6 group by event;

EVENT                                                              COUNT(1)
---------------------------------------------------------------- ----------
os thread startup                                                         1
null event                                                                1
latch: cache buffers chains                                              26
gc current request                                                        1
SQL*Net message to client                                                 1

如下是该节点的CPU使用情况：

08:42:47 PM  CPU   {39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}user   {39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}nice    {39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}sys {39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}iowait    {39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}irq   {39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}soft  {39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}steal   {39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}idle    intr/s
08:42:48 PM  all   99.01    0.00    0.99    0.00    0.00    0.00    0.00    0.00   1005.94
08:42:49 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   1044.44
......省略部分内容
08:43:08 PM  all   99.01    0.00    0.99    0.00    0.00    0.00    0.00    0.00   1003.96
08:43:09 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   1038.38
08:43:10 PM  all   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00   1019.00
08:43:11 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   1032.00
08:43:12 PM  all   99.01    0.00    0.99    0.00    0.00    0.00    0.00    0.00   1000.99
08:43:13 PM  all   95.96    0.00    4.04    0.00    0.00    0.00    0.00    0.00   1202.02
08:43:14 PM  all   93.07    0.00    5.94    0.00    0.00    0.99    0.00    0.00   1057.43
08:43:15 PM  all   86.32    0.00   12.63    0.00    0.00    1.05    0.00    0.00   1080.00
08:43:16 PM  all   96.97    0.00    3.03    0.00    0.00    0.00    0.00    0.00   1058.59
08:44:48 PM  all   99.25    0.00    0.61    0.00    0.08    0.07    0.00    0.00   1026.50
08:46:16 PM  all   99.24    0.00    0.61    0.00    0.05    0.10    0.00    0.00   1024.49
08:46:44 PM  all   99.06    0.00    0.73    0.00    0.07    0.15    0.00    0.00   1042.57
08:46:45 PM  all   91.43    0.00    7.62    0.00    0.00    0.95    0.00    0.00   1140.95
08:46:46 PM  all   82.61    0.00   17.39    0.00    0.00    0.00    0.00    0.00   1042.39
08:46:47 PM  all   57.14    0.00   42.86    0.00    0.00    0.00    0.00    0.00   1068.25
08:46:48 PM  all   52.00    0.00   48.00    0.00    0.00    0.00    0.00    0.00   1092.00
08:46:49 PM  all   84.34    0.00   14.46    0.00    0.00    1.20    0.00    0.00   1090.36
08:46:50 PM  all   90.82    0.00    7.14    0.00    1.02    1.02    0.00    0.00   1137.76
08:46:51 PM  all   89.69    0.00   10.31    0.00    0.00    0.00    0.00    0.00   1040.21
08:46:52 PM  all   95.65    0.00    4.35    0.00    0.00    0.00    0.00    0.00   1052.17
08:46:53 PM  all   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00   1012.00
08:46:54 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   1028.00
08:46:55 PM  all   99.01    0.00    0.99    0.00    0.00    0.00    0.00    0.00   1007.92
08:46:56 PM  all   98.92    0.00    1.08    0.00    0.00    0.00    0.00    0.00   1052.69

我们可以看到在08:43:16 到 08:46:44 之间，rac 节点1其实hang了很长时间，可惜这个数据库节点都没有挂。

我只能说Oracle 太强悍了，我檫。

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

The database instance Crash because the CPU High ?

评论