接着上篇
1.3诊断ocssd驱逐
讲了哪些进程会引起重起,重起后我们应该抓哪些日志去分析,接下来我们就去分别针对每一个种情况去说说。
1.3.1 OCSSD进程驱逐主要原因
正常情况下rac节点之前会定时进行网络通信,当网络故障发生时,节点之前不能通讯了,它就会去驱逐其中一个节点。 而节点之前的网络中断或者延时,在驱逐之前会进行30个(默认设置的)连续的检测。
同样,如果节点不能读或者写心跳磁盘,如果节点不能和多数磁盘心跳文件进行通信的话,节点将会被驱逐。
另外,从11g rac开始重启机制发生变化, 11gR2集群里面采用的是robustless重起,最先不重起节点,只试着重起集群,只要当重启集群不成功时,才会重起节点即重启操作系统。 有时候如DB LMON进程请求CSS把实例从集群里面驱逐,如果发生了超时,驱逐实例不成功的话,就会重起节点。
突然的故障或者ocssd进程的hang起
oraclebug.
硬件原因。
1.3.2 磁盘心跳故障原因
集群必须能够访问最小数量的磁盘心跳文件,否则节点将会停止工作。当节点因为某种原因中止时, $GRID_HOME/log/<hostname>/alert<nodename>.log将会有CRS-1606报错,例如:
2013-01-26 10:15:47.177
[cssd(3743)]CRS-1606:The number of votingfiles available, 1, is less than the minimum number of voting files required,2, resulting in CSSD termination to ensure data integrity; details at(:CSSNM00018:) in /u01/app/11.2.0/grid/log/apdbc76n1/cssd/ocssd.log
常见原因:
心跳文件磁盘相关存储出现问题;如果版本小于11.2.0.3.4时,触发了bug 13869978.
解决思路:
1.检查当前心跳文件是否可以访问
a. crsctl query css votedisk
b. 心跳文件磁盘权限是否是正确的
c. 如果心跳文件在asm中时 "lsdsk -k -G diskgroup_name";如果用了asmlib时用 "oracleasm querydisk /dev/* 2>/dev/null" as root toshow which raw device corresponds to each asmlib label, or see Document811457.1.
d.检查raw的是否可读时,用dd命令ddif=/dev/raw/raw1 of=/dev/null count=100 bs=1024,注意不覆盖了裸设备
2.是否触发了bug13869978,11.2.0.3.4patch和以上都解决了该问题
3.查看相关日志文件如下:
各节点操作系统日志文件:
AIX: Use command"errpt -a" to show messages.
Linux:/var/log/messages
Solaris:/var/adm/messages
HP-UX:/var/adm/syslog/syslog.log
Windows: Checkthe System Event log
存储日志.
如果心跳文件放在NFS上,检查nfs日志,如果用了网络NFS,检查MTU等,恒生的实施标准一般不会用NFS所以这里我们的系统一般没有问题。
clusteralert log
crsdlog
cssdlog
4. 查看磁盘心跳IO负载记录
如果voting file所在的磁盘非常忙的时候,可能会导致集群和磁盘文件之间长时间没有影响,也会驱逐节点,因此可用CHM或者OSW来查看驱逐时间点前后IO负载情况。找到对应的磁盘后可以从下面三方面查看:
a. iostat
磁盘的影响时间是否大于20ms或者更高;繁忙程度是否大于50%或者更高;平均等待时间是否大于20ms或者更高。
b. ps or top
当进程的状态是D时表示等待IO
Checkarchived ps data near the time of the problem for process state in"D" - "D" means waiting on IO
c. netstat
Ifvoting disk(s) on NFS, check the following:
Checknetstat for the interface which the node uses to communicate with the NFS. Lookfor any errors, dropped packets, etc. Also check NFS server's historicalnetstat and iostat data, if available.
1.3.3 网络中断故障原因
1.哪个节点在什么时间被时候驱逐了
2.检查私有网络是否正确
3.一般日志中有如下特定
刚好有以前的一份日志记录,就贴到这里了,
正常的节点上的日志应该类似下面$GRID_HOME/log/<hostname>/alert<hostname>.log:
2012-07-14 19:24:18.420
[cssd(6192)]CRS-1612:Network communicationwith node racnode02 (2) missing for 50% of timeout interval. Removal of thisnode from cluster in 14.500 seconds
2012-07-14 19:24:25.422
[cssd(6192)]CRS-1611:Network communication with node racnode02(2) missing for 75% of timeout interval. Removal of this node from cluster in7.500 seconds
2012-07-14 19:24:30.424
[cssd(6192)]CRS-1610:Network communication with node racnode02(2) missing for 90% of timeout interval. Removal of this node from cluster in2.500 seconds
2012-07-14 19:24:32.925
[cssd(6192)]CRS-1607:Node racnode02 is beingevicted in cluster incarnation 179915229; details at (:CSSNM00007:) in/u01/app/gridhome/log/racnode01/cssd/ocssd.log.
出故障的节点上面$GRID_HOME/log/<hostname>/alert<hostname>.log应有类似:
2012-07-14 19:24:29.282
[cssd(8625)]CRS-1608:This node was evicted bynode 1, racnode01; details at (:CSSNM00005:) in/u01/app/gridhome/log/racnode02/cssd/ocssd.log.
2012-07-14 19:24:29.282
[cssd(8625)]CRS-1656:The CSS daemon isterminating due to a fatal error; Details at (:CSSSC00012:) in/u01/app/gridhome/log/racnode02/cssd/ocssd.log
并且通常情况,如果是网络引起的驱逐,会自动驱逐节点二,即使是节点一的网络有问题
1.3.4 驱逐升级至节点操作系统重起
当一个实例自身有问题时,其它实例就会驱动这个有问题的实例。如果这个有问题的实例没有中止自身时,发出驱逐指令的实例就再会发送一个"member kill"到本地节点CRS的ocssd.bin进程,然后本地CRS的ocssd.bin进程再发送"member kill"进程到有问题节点的ocssd.bin进程。有问题节点的ocssd.bin进程就会发出一个"kill -9"到本地实例的后台进程,然后中止该实例。但是"member kill" 有个30秒的限制,即如果有问题的实例的一个或者多个进程在30秒内没有关掉,"member kill"就会times out. 一旦发生了"member kill"times out,之前发送"member kill"的节点就会再次以同样的方式发送"node kill"到有问题节点的ocssd.bin进程。一旦有问题的节点通过网络或者磁盘心跳接收到了"node kill"指令,ocssd.bin进程会先试着自我关闭,如果一个ocssd.bin能够正常关闭(需要相关依赖ocssd.bin进程的资源如db,asm,crs等能成功关闭),然后crs会尽量先不重起操作系统转而再试着重起所有守护进程。如果其中任何一个资源或者ocssd.bin进程不能正常关闭,这时节点操作系统才会重起。
常见原因:“memberkill”升级为“node kill”
1)Unix或者Linux操作系统中,数据库实例的一个或者多个进程处在"kill-9"不能直接终止的状态,例如,一个进程的状态是"D",D意味着"Uninterruptible sleep(通常IO)".进程的状态变为D的原因是因为进程已经发出一个IO请求,正在等OS完成该IO操作。直到此次IO请求完成前,该进程都不能被Killed. 所以当ocssd.bin进程去试着kill一个正在等IO完成的进程时,如果该IO请求在30秒或者更少的时间没有完成的话,该"member kill"请求将会time out,并且升级成"node kill"请求。
2)Unix或者Linux操作系统中,如果一个或者多个oracledb或者asm进程被kill了但是没有父进程回收的话就会变成一个僵尸进程。所有oracle 数据库进程或者ASM进程的父进程都是OS init 守护进程(OS pid号为1),所以oracle数据库或者ASM进程在被kill之后都要由init守护进程去清理,但是在这清理之前被kill的进程都是僵尸进程,状态是Z. OSWatcher输出会显示进程状态。一般来说init 守护进程是操作系统,所以如果init守护进程不能快速地者清理这些被killed的进程话,一般都是操作比如调度出现了问题,需要联系操作系统管理员来帮忙解决此类问题。
3)问题节点的负载非常高,所以无论是cpu使用率100%还是内存使用率非常严重、产生大量的换页以及大量交换分区的频繁使用,结果都造成了ocssd.bin进程不能在30秒或者更少的时间里把需要踢除的数据库的进程kill掉,当"member kill"发生timeout后,"member kil"就会升级为“node kill". 我们需要找出为什么操作系统内存或者cpu为什么使用如此高的原因。OSWatcher输出会提供类似的信息。
问题发现:
找出节点重起或者crs一直在不停的重起是由于"member kill"请求升级为"node kill"结果,请查找下面三个日志:
1. 发出请求驱逐请求的节点DB alert.log日志,一般是好的那个节点
Remote instance kill is issued
2. 发出请求“驱逐请求”的节点cssd.log日志,一般是好的那个节点
clssgmMbrKillEsc:Escalating node <node number of the problem node> Member request0x00000001 Member success
clssnmMarkNodeForRemoval: node <node numberof the problem node>, <the problem node name> marked for removal
clssnmKillNode: node <node number of theproblem node> (<the problem node name>) kill initiated
3. 集群alert.log日志GI_HOME/log/<nodename>/alert<node name>.log
CRS-1663:Memberkill issued
1.2 诊断cssagent/cssmonitor驱逐
cssagent或者cssmonitor需要定期接收来自CCSD的心跳,如果来自CSSD的心跳间隔时间太长,cssagent或者cssmonitor就会中止或者驱逐该节点。当节点由于该原因中止后,集群alert日志中有CRS-8011报错:
[ohasd(11243)]CRS-8011:reboot advisory message from host:racnode01, component: cssagent, with timestamp: L-2009-05-05-10:03:25.340
[ohasd(11243)]CRS-8013:reboot advisory message text: Rebootingafter limit 28500 exceeded; disk timeout 27630, network timeout 28500, lastheartbeat from CSSD at epoch seconds 1241543005.340, 4294967295 millisecondsago based on invariant clock value of 93235653
以上消息可能在本地节点,也可能在远程节点。
对于这类的驱逐,大多与bug相关,请参考下面1.4.1到1.4.6.
主要原因
主要原因通常是操作系统资源不足或者负载太高造成的,结果导致ocssd心跳进程或者cssagent不能被调度或者正常工作,通常我们需要解决的是操作系统资源争用。
解决思路
1.收集操作系统关键时间点操作系统资源状态
CHM/OS Watcher(建议客户rac节点上配置这个)
从收集的历史数据上,如何收集请参考上面。一般我们的思路是去寻找重启时间点前面的cpu或者内存负载,去看下操作系统负载是不是非常严重,是不是有非常高的cpu利用每户,或者某个集群进程占用很高的cpu或者内存利用率非常高,存在非常的严重的页进出或者交换,另外硬件故障也可能引起相关问题。下面是mentalink上面的一段文字:
Check archived CPU stats (eg. vmstat, top) for CPU load near 100% or run queue very high:
Check historical "top" data for processes using most CPU. If an Oracle clusterware process is using unusually high CPU, see point (2) below.
Check system CPU capacity.
Check historical "ps" data for unusual amount of processes eg. login storm, too many PQ slaves, etc.
Check archived memory and paging stats (eg. vmstat) for system out of memory and/or swap:
Check historical "top", "ps" data for processes using most memory.
Check if RSS of top memory consuming processes was increasing over time.
OS getting locked up in a driver or hardware: Check with sysadmin and/or OS support.
1.4.1 BUG CRSD 占用大量资源
11gR2GI crsd.bin进程占用很高的cpu, srvctl,crsctl或者其它工具命令的反映非常慢,与crsd交互慢。有的很少情况下,慢到以致于像是hang住了,通常情况下发生11.2 DB以前的版本中。
crsd.log中会报类似下面的错误
[ OCRAPI][4113]SLOS : SLOS: cat=-2, opn=malloc,dep=0, loc=addgrpnm10 out of memory(557984)
[ OCRUTL][4113]u_fill_errorbuf: ErrorInfo : [[SLOS: cat=-2, opn=malloc, dep=0, loc=addgrpnm10out of memory(557984)]]
主要是和bug相关:
Causedby bug10019726, bug12615394, and bug12767563
1.4.2 BUG ORAAGENT or ORAROOTAGENT 占用大量资源(CPU,Memory etc)
oraagent.binor orarootagent.bin 进程占用非常高的cpu或者内存资源,当集群一直运行时,这两个进程就会用光操作的资源,结果导致节点驱逐或者重启。
如用 "top" 或者其它内存监控工具,我们会发现非常高的直拨内存使用率,并随着时间不断增加。
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7242 oracle 18 0 308g 953m 14m S 83.2 0.7 166:28.72 oraagent.bin
In above example, 308GB of virtual memory is allocated to the agentprocess while it's consuming 83.2% of CPU.
这种情况也是由bug引起的,具体可参见参考中mentalink中文档。
1.4.3 BUG High Memory Usage in GIPC Code
另外GIPC code 占用非常高的内存现象就是 一个或者多个进程如(octssd.bin,crsd.bin etc) 占用大量的进程,随着时间的进行,耗光了操作系统的资源。
如
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7242 oracle 18 0 308g 953m 14m S 83.2 0.7 166:28.72 octssd.bin
在上面的例子中308G的虚拟内存被占用了。
Clusterware trace may have the following:
[GIPCXCPT][1100249408] gipcDestroyPtrF [gipcEndpointProcess :gipcEndpoint.c : 259]: EXCEPTION[ ret gipcretInvalidObject (3) ] failure to destroy obj 0x7fabe412acb0[000000000006951c] { gipcReceiveRequest: peerName 'ipc', data 0x7fabe40ce1c8, len 10240, olen 672, off 0, parentEndp0x8b5040, ret gipcretSuccess (0), objFlags 0x0, reqFlags 0xac }, flags 0x0
[GIPCXCPT][1100249408] gipcObjectCheckF [gipcDestroyPtrF : gipcInternal.c: 2138]: invalid object magic, obj 0xb2f908, magic èè/è, realMagic gipcObjs (63706967736a624f), type gipcObj ect, ret gipcretInvalidObject (3)
OnWindows, handle leak can be observed
由bug引起的,具体可参见参考中mentalink文章。
1.4.4 BUG OLOGGERD High CP导致的节点驱逐
GI Alert 文件有下面的报错并且其中一个节点被驱逐了:
Node 1 GI Alert log
------------------------
CRS-1612:Network communication with node prodrac2(2) missing for 50% of timeoutinterval. Removal of this node from cluster in 29.240 seconds
..
CRS-1610:Network communication with node prodrac2 (2) missing for 90% oftimeout interval. Removal of this node from cluster in 3.740 seconds
CRS-1607:Node utx2db02 is being evicted in cluster incarnation 278185525;details at (:CSSNM00007:) in/orabase1/app/11.2.0.3/grid_6/log/utx2db01/cssd/ocssd.log.
Node 2 GI Alert log
------------------------
CRS-1610:Network communication with node prodrac1 (1) missing for 90% oftimeout interval. Removal of this node from cluster in 3.740 seconds
CRS-1609:This node is unable to communicate with other nodes in the cluster andis going down to preserve cluster integrity; details at (:CSSNM00008:) in/orabase1/app/11.2.0.3/grid_6/log/utx2db02/cssd/ocssd.log.
CRS-1656:The CSS daemon is terminating due to a fatal error; Details at(:CSSSC00012:) in /orabase1/app/11.2.0.3/grid_6/log/utx2db02/cssd/ocssd.log
Top output showsthat Cluster Health Monitor (CHM) daemon ologgerd using high CPU and startsspinning before the reboot
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+COMMAND
31439 root RT 0 375m 142m 58m S 161.8 0.1 359:42.49/orabase1/app/11.2.0.3/grid_6/bin/ologgerd -m utx2db01 -r -d/orabase1/app/11.2.0.3/grid_
Loggerd uses high cpu and do lots of I/O to the disk wherethe BDB (Berkeley Database used by CHM) resides.
This is due toBUG13867435 OLOGGERD USING A LOT OF RESOURCES .
1.4.5 BUG AIX
Apply fix for Bug 13940331. Without thisfix, OCSSD threads do not run in RT. This has caused node reboots by cssagentand cssmonitor for several customers. See Document 13940331.8 for moreinformation. The patch can be downloaded from ARU 13940331.
Apply all recommended fixes from Document1427855.1.
RHEL 6.6: IPC Send timeout/node eviction etc withhigh packet reassembles failure
版本11.2.0.1之后,Redhat 或者Oracle linux操作系统内核,在升级内核到6.6之后,出问题或者重起节点的messages日志中有如下报错:
Fri May 01 03:05:48 2015
IPC Send timeout detected. Receiver ospid 28660 [oracle@xxxxx (LMS0)]
Fri May 01 03:05:48 2015
Errors in file/xddv1covd/oracle/diag/rdbms/xrcovd/XRCOVD3/trace/XRCOVD3_lms0_28660.trc:
IPC Send timeoutdetected. Receiver ospid 28670 [oracle@xxxxx (LMS1)]
Fri May 01 03:05:53 2015
Errors in file/xddv1covd/oracle/diag/rdbms/xrcovd/XRCOVD3/trace/XRCOVD3_lms1_28670.trc:
Fri May 01 03:06:00 2015
IPC Send timeoutdetected. Receiver ospid 31414 [oracle@xxxxx (PZ98)]
Fri May 01 03:06:00 2015
Errors in file /xddv1covd/oracle/diag/rdbms/xrcovd/XRCOVD3/trace/XRCOVD3_pz98_31414.trc:
Fri May 01 03:06:13 2015
IPC Send timeoutdetected. Receiver ospid 1835 [oracle@xxxxx(PZ97)]
Fri May 01 03:06:13 2015
Errors in file/xddv1covd/oracle/diag/rdbms/xrcovd/XRCOVD3/trace/XRCOVD3_pz97_1835.trc:
Fri May 01 03:06:43 2015
Fri May 01 03:06:43 2015
Received an instance abort message from instance 1Received an instance abortmessage from instance 1
Please check instance 1alert and LMON trace files for detail.Please check instance1 alert and LMON trace files for detail.
LMS0 (ospid: 28660):terminating the instance due to error 481
Fri May 01 03:06:43 2015
System state dump requestedby (instance=3, osid=28660 (LMS0)), summary=[abnormal instance termination].
System State dumped to trace file /xddv1covd/oracle/diag/rdbms/xrcovd/XRCOVD3/trace/XRCOVD3_diag_28625.trc
While this ishappening, "netstat" shows huge jump of "packet reassemblesfailed":
==>>beforethe issue, the following number is more or less stable or increasing slowly
6817 packet reassembles failed
....
==>>in 30 minutes it increased by 50
6867 packet reassembles failed
==>>now the issue is happening and in 10 seconds it increasedby 7533 -6867 = 666
7533 packet reassembles failed
==>>in another 10 seconds it increased by 9630 7533= 2097
9630 packet reassembles failed
并且有其它症状1. 节点驱逐 2. 实例或者节点不能加入到集群至到重起该节点,一直报错” packetreassembles failed”
原因:
RHEL 6.6 has a few ipfrag fix and increased the defaultipfrag_*_thresh:
cat /proc/sys/net/ipv4/ipfrag_low_thresh
3145728
cat /proc/sys/net/ipv4/ipfrag_high_thresh
4194304
However, the issue is still happening, for Oracle Linuxrunning Red-Hat compatible kernel, the issue is being tracked:
BUG 21036841 - LCOV5/7/17 SERVER CRASHED AFTER PATCHUPGRADE AND KERNEL UPGRADE
临时解决方案:
启用jumbo frame 或者增加下面两参数的值:
net.ipv4.ipfrag_high_thresh = 16M
net.ipv4.ipfrag_low_thresh = 15M
参考:
Troubleshooting Clusterware Node Evictions (Reboots) (Doc ID 1050693.1)
RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassemblesfailure (Doc ID 2008933.1)
Troubleshooting ORA-29740 in a RAC Environment (Doc ID 219361.1)
Top 5 issues for Instance Eviction (Doc ID 1374110.1)
Top 5 RAC Instance Crash Issues (Doc ID 1375405.1)
LMON received an instance eviction notification from instance n (Doc ID1440892.1)
Troubleshooting Instance Evictions (Instance terminates with ORA-29740,Instance Abort, Instance Kill) (Doc ID 1549135.1)
Note 1493110.1 Database InstanceHung and One Of RAC Nodes Rebooted as Membership Kill Escalated
Database Instance Hung and One Of RAC Nodes Rebooted as Membership KillEscalated (Doc ID 1493110.1)
Troubleshooting 10g and 11.1 Clusterware Reboots (Doc ID 265769.1)
Oracle Clusterware 10gR2/ 11gR1/ 11gR2/ 12cR1 Diagnostic Collection Guide(Doc ID 330358.1)
Troubleshooting 10g and 11.1 Clusterware Reboots (Doc ID 265769.1)




