aix 7.1 + oracle 11.2.0.4 rac ，节点2 主机自动重启原因寻找

四九年入国军

2022-06-01

100M

oracle rac

节点2 32.161 在今天早上10:11发生了重启，节点2只能看到重启后日志，看不到重启前的。

线索1：

2022-06-01 10:11:59.193: [cssd(11207262)]CRS-1612:Network communication with node ncshisdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.848 seconds

节点1有这个告警，我的理解是这个心跳不通是在节点2重启的过程中发生的

线索2：

11.2.0.2 开始，当出现以下情况的时候，集群件（GI）会重新启动集群管理软件，而不是将节点重启。

1.当某个节点连续丢失网络心跳超过misscount时。

2.当某个节点不能访问大多数表决盘（VF）时。
3.当member kill 被升级成为node kill的时候。
在之前的版本，以上情况，集群管理软件（CRS）会直接重启节点。

为啥我这个节点2直接机器重启，而不是节点重启，请高手解惑

7条回答

默认

最新

四九年入国军

上传附件：rac日志.zip

有用 0

Uncopyrightable

超时可能与网络或者共享磁盘的连接有关；

从11.2.0.2开始，由于新特性rebootless restart的原因，节点不会被重启，而是集群管理软件进行重启。

具体文案找不到了，等待其他大神补充下了～

Rebootless Node Fencing

In versions before 11.2.0.2 Oracle Clusterware tried to prevent a split-brain with a fast reboot (better: reset) of the server(s) without waiting for ongoing I/O operations or synchronization of the file systems. This mechanism has been changed in version 11.2.0.2 (first 11g Release 2 patch set). After deciding which node to evict, the clusterware:

. attempts to shut down all Oracle resources/processes on the server (especially processes generating I/Os)

. will stop itself on the node

. afterwards Oracle High Availability Service Daemon (OHASD) will try to start the Cluster Ready Services (CRS) stack again. Once the cluster interconnect is back online,all relevant cluster resources on that node will automatically start

. kill the node if stop of resources or processes generating I/O is not possible (hanging in kernel mode, I/O path, etc.)

This behavior change is particularly useful for non-cluster aware applications.

Prior to 11g R2, during voting disk failures the node will be rebooted to protect the integrity of the cluster. But rebooting cannot be necessarily just the communication issue. The node can be hanging or the IO operation can be hanging so potentially the reboot decision can be the incorrect one. So Oracle Clusterware will fence the node without rebooting. This is a big (and big) achievement and changes in the way the cluster is designed.

The reason why we will have to avoid the reboot is that during reboots resources need to re-mastered and the nodes remaining on the cluster should be re-formed. In a big cluster with many numbers of nodes, this can be potentially a very expensive operation so Oracle fences the node by killing the offending process so the cluster will shutdown but the node will not be shutdown. Once the IO path is available or the network heartbeat is available, the cluster will be started again. Be assured the data will be protected but it will be done without any pain rebooting the nodes. But in the cases where the reboot is needed to protect the integrity, the cluster will decide to reboot the node.