RAC节点2更换物理心跳网卡后集群起不来

DBA小白成长记 2018-04-13

3508

最近，主机运维同事发现Oracle RAC节点2所在的物理机1块心跳网卡异常，时通时不通，后来直接宕掉，没起来，所以，需要进行更换网卡。

步骤如下：

1、备份gpnp profile 文件（每个节点都要备份）

节点1：

[root@shenrac01 peer]# pwd

/u01/app/11.2.0/grid/gpnp/shen3rac01/profiles/peer

[root@shenrac01 peer]# cp profile.xml profile.xml_bak

[root@shenrac01 peer]# ll

total 16

-rw-r--r-- 1 grid oinstall 1963 Jun 20 2017 profile.old

-rw-r--r-- 1 grid oinstall 1905 Jun 20 2017 profile_orig.xml

-rw-r--r-- 1 grid oinstall 1963 Jun 20 2017 profile.xml

-rw-r--r-- 1 root root 1963 Apr 8 20:10 profile.xml_bak

节点2：

[root@shenrac02 bin]# cd u01/app/11.2.0/grid/gpnp/shen3rac02/profiles/peer

[root@shenrac02 peer]# cp profile.xml profile.xml_bak

针对于 11.2 的结构，私有网络配置信息不但保存在 OCR 中，而且还保存在 gpnp 属性文件中。虽然这次实践中没有用到，但是保险起见，建议对其进行备份。

2、查看节点的网口信息和私网信息

节点1：

[root@shenrac01 peer]# cd u01/app/11.2.0/grid/bin

[root@shenrac01 bin]# ./oifcfg getif

eth3 172.11.10.0 global cluster_interconnect

eth5 172.11.11.0 global cluster_interconnect

bond0 172.28.11.0 global public

[root@shenrac01 bin]# ./oifcfg iflist -p -n

eth3 172.11.10.0 PRIVATE 255.255.255.0

eth5 172.11.11.0 PRIVATE 255.255.255.0

eth5 169.254.128.0 UNKNOWN 255.255.128.0

eth5 169.254.0.0 UNKNOWN 255.255.128.0

bond0 172.28.11.0 PRIVATE 255.255.255.0

节点2：

[root@shenrac02 peer]# cd /u01/app/11.2.0/grid/bin

[root@shenrac02 bin]# ./oifcfg getif

[root@shenrac02 bin]# ./oifcfg iflist -p -n

eth3 172.11.10.0 PRIVATE 255.255.255.0

eth5 172.11.11.0 PRIVATE 255.255.255.0

eth5 169.254.0.0 UNKNOWN 255.255.128.0

eth5 169.254.128.0 UNKNOWN 255.255.128.0

bond0 172.28.11.0 PRIVATE 255.255.255.0

可以看到此时节点2的网口信息已经不能得到，另外心跳网卡3的HAIP已经漂移到心跳网卡5

3、查看节点2目前的集群状态

[root@shenrac02 ~]# cd /u01/app/11.2.0/grid/bin

[root@shenrac02 bin]# ./crsctl status res -t

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Status failed, or completed with errors.

[root@shenrac02 bin]# su - oracle

[oracle@shenrac02 ~]$ lsnrctl status

LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 08-APR-2018 20:18:43

Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))

TNS-12541: TNS:no listener

TNS-12560: TNS:protocol adapter error

TNS-00511: No listener

Linux Error: 111: Connection refused

[oracle@shenrac02 ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Sun Apr 8 20:21:07 2018

Connected to an idle instance.

发现节点2已经被集群踢出，监听数据库等资源都已经被关闭（按理来讲，不是很正常，因为配置了2路心跳，1块心跳网卡出现问题，还有另一块心跳网卡，但因为发现的比较晚，已经找不到最初网卡故障时的日志，需要后面再进行观察；另外其实查看不用这么麻烦，crsctl status res -t -init就可以查看到所有的相关信息，当时没想到）

4、关闭节点2集群自动启动

[root@dbtest2 bin]# cd /u01/app/11.2.0/grid/bin

[root@dbtest2 bin]# ./crsctl disable crs

5、关闭节点2服务器，更换网卡

网卡名要跟原来的一致，使网络配置信息都已在 OS层更改完成，确保更改完成后新的接口在所有的节点都可用有效，需要在/etc/udev/rules.d/70-persistent-net.rules下面进行更改（有专人负责）

$ ifconfig -a

$ ping <private hostname>

6、尝试正常起集群，看看能不能起来

[root@shenrac02 bin]# cd /u01/app/11.2.0/grid/bin

[root@shenrac02 bin]# ./crsctl start crs

[root@shenrac02 bin]# ./crsctl stat res -t

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Status failed, or completed with errors.

7、查看crsd日志

2018-04-08 23:59:39.083: [GIPCXCPT][1669314304] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'shenrac02', port 'b48c-3c02-ef3f-a10a', hctx 0x2962490 [0000000000000010] { gipchaContext : host 'shenrac02', name 'c8f7-4d68-c487-d92d', luid '5d313c36-00000000', numNode 0, numInf 2, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound(36)

2018-04-08 23:59:39.083: [GIPCHGEN][1669314304] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 809]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 0x2962490[0000000000000010] { gipchaContext : host 'shenrac02', name 'c8f7-4d68-c487-d92d', luid '5d313c36-00000000', numNode 0, numInf 2, usrFlags 0x0, flags 0x5 }, host 'shenrac02', port 'b48c-3c02-ef3f-a10a', flags 0x0

根据以上信息，查看gipcd日志

8、查看gipcd日志

2018-04-10 00:08:39.765: [GIPCDMON][2694625024] gipcdMonitorCssCheck: found node shenrac01

2018-04-10 00:08:39.766: [GIPCDMON][2694625024] gipcdMonitorCssCheck: updating timeout node shenrac01

2018-04-10 00:08:39.766: [GIPCDMON][2694625024] gipcdMonitorCssCheck: found node shenrac02

2018-04-10 00:08:39.766: [GIPCDMON][2694625024] gipcdMonitorFailZombieNodes: skipping live node 'shenrac01', time 0 ms, endp 0000000000000000, 0000000000000933

2018-04-10 00:08:39.766: [GIPCDMON][2694625024] gipcdMonitorFailZombieNodes: skipping live node 'shenrac01', time 0 ms, endp 0000000000000000, 0000000000000ab7

2018-04-10 00:08:39.882: [GIPCDCLT][2698827520] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000121

2018-04-10 00:08:41.109: [GIPCDCLT][2698827520] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 000000000000063a

2018-04-10 00:08:42.082: [ CLSINET][2694625024] Returning NETDATA: 2 interfaces

网上找了也没找到什么具体原因及解决方法，后来咨询了oracle工程师，怀疑是GI的gipcd 进程异常,虽然发现节点，但认为是僵尸节点：

“gipcdMonitorFailZombieNodes”

与gipcd 已知bug现象类似:Bug 16981204 : LNX64-11204-GIPC: GIPCD LOG GROWS UP TOO FAST, ABOUT 11M EVERY 4 HOURS

他们给出的建议是—修复gipcd进程问题： (1)可以kill 节点1 gipcd 进程，该进程会重启，不影响节点1实例 (2) 重启节点1 的GI

9、解决

(1)kill掉节点1 gipcd 进程没作用，节点2依然没有起来

(2)重启节点1后，节点2集群正常起来

注意：生产环境一定要确保数据安全性，重启存活的单节点要根据业务情况做好风险控制措施，实施全备或者其他措施DG等

本文分享自微信公众号 - DBA小白成长记，如有侵权，请联系 service001@enmotech.com 删除。

oracle

最后修改时间：2020-01-13 22:18:56

文章转载自DBA小白成长记，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

RAC节点2更换物理心跳网卡后集群起不来

评论