最近,主机运维同事发现Oracle RAC节点2所在的物理机1块心跳网卡异常,时通时不通,后来直接宕掉,没起来,所以,需要进行更换网卡。
步骤如下:
1、备份gpnp profile 文件(每个节点都要备份)
节点1:
[root@shenrac01 peer]# pwd
/u01/app/11.2.0/grid/gpnp/shen3rac01/profiles/peer
[root@shenrac01 peer]# cp profile.xml profile.xml_bak
[root@shenrac01 peer]# ll
total 16
-rw-r--r-- 1 grid oinstall 1963 Jun 20 2017 profile.old
-rw-r--r-- 1 grid oinstall 1905 Jun 20 2017 profile_orig.xml
-rw-r--r-- 1 grid oinstall 1963 Jun 20 2017 profile.xml
-rw-r--r-- 1 root root 1963 Apr 8 20:10 profile.xml_bak
节点2:
[root@shenrac02 bin]# cd u01/app/11.2.0/grid/gpnp/shen3rac02/profiles/peer
[root@shenrac02 peer]# cp profile.xml profile.xml_bak
针对于 11.2 的结构,私有网络配置信息不但保存在 OCR 中,而且还保存在 gpnp 属性文件中。虽然这次实践中没有用到,但是保险起见,建议对其进行备份。
2、查看节点的网口信息和私网信息
节点1:
[root@shenrac01 peer]# cd u01/app/11.2.0/grid/bin
[root@shenrac01 bin]# ./oifcfg getif
eth3 172.11.10.0 global cluster_interconnect
eth5 172.11.11.0 global cluster_interconnect
bond0 172.28.11.0 global public
[root@shenrac01 bin]# ./oifcfg iflist -p -n
eth3 172.11.10.0 PRIVATE 255.255.255.0
eth5 172.11.11.0 PRIVATE 255.255.255.0
eth5 169.254.128.0 UNKNOWN 255.255.128.0
eth5 169.254.0.0 UNKNOWN 255.255.128.0
bond0 172.28.11.0 PRIVATE 255.255.255.0
节点2:
[root@shenrac02 peer]# cd /u01/app/11.2.0/grid/bin
[root@shenrac02 bin]# ./oifcfg getif
^C
[root@shenrac02 bin]# ./oifcfg iflist -p -n
eth3 172.11.10.0 PRIVATE 255.255.255.0
eth5 172.11.11.0 PRIVATE 255.255.255.0
eth5 169.254.0.0 UNKNOWN 255.255.128.0
eth5 169.254.128.0 UNKNOWN 255.255.128.0
bond0 172.28.11.0 PRIVATE 255.255.255.0
可以看到此时节点2的网口信息已经不能得到,另外心跳网卡3的HAIP已经漂移到心跳网卡5
3、查看节点2目前的集群状态
[root@shenrac02 ~]# cd /u01/app/11.2.0/grid/bin
[root@shenrac02 bin]# ./crsctl status res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@shenrac02 bin]# su - oracle
[oracle@shenrac02 ~]$ lsnrctl status
LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 08-APR-2018 20:18:43
Copyright (c) 1991, 2013, Oracle. All rights reserved.
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
TNS-12541: TNS:no listener
TNS-12560: TNS:protocol adapter error
TNS-00511: No listener
Linux Error: 111: Connection refused
[oracle@shenrac02 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Sun Apr 8 20:21:07 2018
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to an idle instance.
发现节点2已经被集群踢出,监听数据库等资源都已经被关闭(按理来讲,不是很正常,因为配置了2路心跳,1块心跳网卡出现问题,还有另一块心跳网卡,但因为发现的比较晚,已经找不到最初网卡故障时的日志,需要后面再进行观察;另外其实查看不用这么麻烦,crsctl status res -t -init就可以查看到所有的相关信息,当时没想到)
4、关闭节点2集群自动启动
[root@dbtest2 bin]# cd /u01/app/11.2.0/grid/bin
[root@dbtest2 bin]# ./crsctl disable crs
5、关闭节点2服务器,更换网卡
网卡名要跟原来的一致,使网络配置信息都已在 OS层更改完成,确保更改完成后新的接口在所有的节点都可用有效,需要在/etc/udev/rules.d/70-persistent-net.rules下面进行更改(有专人负责)
$ ifconfig -a
$ ping <private hostname>
6、尝试正常起集群,看看能不能起来
[root@shenrac02 bin]# cd /u01/app/11.2.0/grid/bin
[root@shenrac02 bin]# ./crsctl start crs
[root@shenrac02 bin]# ./crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
7、查看crsd日志
2018-04-08 23:59:39.083: [GIPCXCPT][1669314304] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'shenrac02', port 'b48c-3c02-ef3f-a10a', hctx 0x2962490 [0000000000000010] { gipchaContext : host 'shenrac02', name 'c8f7-4d68-c487-d92d', luid '5d313c36-00000000', numNode 0, numInf 2, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound(36)
2018-04-08 23:59:39.083: [GIPCHGEN][1669314304] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 809]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 0x2962490[0000000000000010] { gipchaContext : host 'shenrac02', name 'c8f7-4d68-c487-d92d', luid '5d313c36-00000000', numNode 0, numInf 2, usrFlags 0x0, flags 0x5 }, host 'shenrac02', port 'b48c-3c02-ef3f-a10a', flags 0x0
根据以上信息,查看gipcd日志
8、查看gipcd日志
2018-04-10 00:08:39.765: [GIPCDMON][2694625024] gipcdMonitorCssCheck: found node shenrac01
2018-04-10 00:08:39.766: [GIPCDMON][2694625024] gipcdMonitorCssCheck: updating timeout node shenrac01
2018-04-10 00:08:39.766: [GIPCDMON][2694625024] gipcdMonitorCssCheck: updating timeout node shenrac01
2018-04-10 00:08:39.766: [GIPCDMON][2694625024] gipcdMonitorCssCheck: found node shenrac02
2018-04-10 00:08:39.766: [GIPCDMON][2694625024] gipcdMonitorFailZombieNodes: skipping live node 'shenrac01', time 0 ms, endp 0000000000000000, 0000000000000933
2018-04-10 00:08:39.766: [GIPCDMON][2694625024] gipcdMonitorFailZombieNodes: skipping live node 'shenrac01', time 0 ms, endp 0000000000000000, 0000000000000ab7
2018-04-10 00:08:39.882: [GIPCDCLT][2698827520] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000121
2018-04-10 00:08:39.882: [GIPCDCLT][2698827520] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000121
2018-04-10 00:08:41.109: [GIPCDCLT][2698827520] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 000000000000063a
2018-04-10 00:08:41.109: [GIPCDCLT][2698827520] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 000000000000063a
2018-04-10 00:08:42.082: [ CLSINET][2694625024] Returning NETDATA: 2 interfaces
网上找了也没找到什么具体原因及解决方法,后来咨询了oracle工程师,怀疑是GI的gipcd 进程异常,虽然发现节点,但认为是 僵尸节点:
“gipcdMonitorFailZombieNodes”
与gipcd 已知bug现象类似:Bug 16981204 : LNX64-11204-GIPC: GIPCD LOG GROWS UP TOO FAST, ABOUT 11M EVERY 4 HOURS
他们给出的建议是—修复gipcd进程问题: (1)可以kill 节点1 gipcd 进程,该进程会重启,不影响节点1实例 (2) 重启节点1 的GI
9、解决
(1)kill掉节点1 gipcd 进程没作用,节点2依然没有起来
(2)重启节点1后,节点2集群正常起来
注意:生产环境一定要确保数据安全性,重启存活的单节点要根据业务情况做好风险控制措施,实施全备或者其他措施DG等
本文分享自微信公众号 - DBA小白成长记,如有侵权,请联系 service001@enmotech.com 删除。




