11.2.0.3 CRS start slow and cssd.log show ‘Msg-reply has soap fault 10’ 案例

张维照 2019-05-31

1616

问题描述

有一套11.2的RAC环境遇到了很怪的问题，每次CRS重启节点1的CRS都要在30分钟左右才能启动，期间只能看到就是在GPNP启动阶段循环的尝试，和MOS中bug 12356910很像但是安装补丁后问题并未解决，也尝试增加了该进程的日志级别(crsctl set log gpnp “GPNP=5”)均未找到问题根源，等CRS启动后使用gpnptool find也可以取到其它节点列表的未发现同名的CLUSTER,并对CRS做了Deconfigure/Reconfigure，以至于最后换机器重装CRS重启挂臷DB，测试阶段正常，但是只要更改到生产之前所用的IP后问题就会再次重现，之前曾见过有一相似案例最后是使用在网络层用VLAN后解决，因这个环境的网络复杂VLAN界限难以明确，并且后期会埋下雷，环境版本ORACLE 11.2.0.3 2NODES RAC ON HPUX IA 11.31,当前版本已过ORACLE的服务期无法再联系开发定位BUG, 最佳方法可能是升级定位或更换IP, GPNP是通常使用PUBLIC IP发送UDP广播包在同一网域内如果有装有ORACLE CRS的mDNS服务(mdnsd.bin)解析本机的cluster_id and cluster_name，来寻找同一CLUSTER的其它节点。理论修改PUBLIC IP应该就可以，但是应用前期连接数据库存在使用public IP的中件间，而且短时间内无法梳理并修改，如何及解决CRS启动慢的问题又可以避免中间件或为中间件争取时间梳理？

专家解答

下面是我的一种方案。

先附段CRS的错误日志, 下面是用我们的脚本格式化后的同时段的所有日志文件的集合片段

2016-09-21 00:26:06.870@./ohasd/ohasd.log [   CRSPE][29] {073} State change received from qdyya1 for ora.diskmon 1 1
2016-09-21 00:26:06.871@./agent/ohasd/orarootagent_root/orarootagent_root.log [    AGFW][10] {073} Agent is exiting with exit code 1
2016-09-21 00:26:06.871@./agent/ohasd/orarootagent_root/orarootagent_root.log [    AGFW][10] {073} Agent is shutting down.
2016-09-21 00:26:06.871@./ohasd/ohasd.log [   CRSPE][29] {073} RI [ora.diskmon 1 1] new external state [OFFLINE] old value [ONLINE] on qdyya1 label = [] 
2016-09-21 00:26:06.871@./ohasd/ohasd.log [   CRSPE][29] {073} RI [ora.diskmon 1 1] new target state [OFFLINE] old value [ONLINE]
2016-09-21 00:26:06.872@./ohasd/ohasd.log [   CRSPE][29] {073} Processing unplanned state change for [ora.diskmon 1 1]
2016-09-21 00:26:06.872@./ohasd/ohasd.log [  CRSOCR][27] {073} Multi Write Batch processing...
2016-09-21 00:26:06.875@./ohasd/ohasd.log [  CRSOCR][27] {073} Multi Write Batch done.
2016-09-21 00:26:06.879@./ohasd/ohasd.log [   CRSPE][29] {073} Failover cannot be completed for [ora.diskmon 1 1]. Stopping it and the resource tree
2016-09-21 00:26:06.879@./ohasd/ohasd.log [   CRSPE][29] {073} Target is not ONLINE, not recovering [ora.diskmon 1 1]
2016-09-21 00:26:06.880@./ohasd/ohasd.log [   CRSPE][29] {073} Op 600000000260ac20 has 4 WOs
2016-09-21 00:26:06.887@./ohasd/ohasd.log [   CRSPE][29] {073} ICE has queued an operation. Details Operation [STOP of [ora.diskmon 1 1] on [qdyya1]  600000000260ac20] cannot run cause it needs R lock for
2016-09-21 00:26:06.985@./ohasd/ohasd.log [ CRSCOMM][21] Ipc Client disconnected.
2016-09-21 00:26:06.985@./ohasd/ohasd.log [ CRSCOMM][21] IpcL connection to member 7 has been removed
2016-09-21 00:26:06.985@./ohasd/ohasd.log [ CRSCOMM][21][FFAIL] Ipc Couldnt clscreceive message, no message 11
2016-09-21 00:26:06.985@./ohasd/ohasd.log [ CRSCOMM][21][FFAIL] IpcL Listener got clsc error 11 for memNum. 7
2016-09-21 00:26:06.985@./ohasd/ohasd.log [CLSFRAME][21] Disconnected from AGENT process {Relative|Node0|Process7|Type3}
2016-09-21 00:26:06.985@./ohasd/ohasd.log [CLSFRAME][21] Removing IPC Member{Relative|Node0|Process7|Type3}
2016-09-21 00:26:06.986@./ohasd/ohasd.log [    AGFW][24] {0090} /oracle/app/11.2.0.3/grid/bin/orarootagent_root disconnected.
2016-09-21 00:26:06.986@./ohasd/ohasd.log [    AGFW][24] {0090} Agent /oracle/app/11.2.0.3/grid/bin/orarootagent_root[6161] stopped!
2016-09-21 00:26:06.986@./ohasd/ohasd.log [    AGFW][24] {0090} Agfw Proxy Server received process disconnected notification, count=1
2016-09-21 00:26:06.986@./ohasd/ohasd.log [   CRSPE][29] {0088} Disconnected from server 
2016-09-21 00:26:06.986@./ohasd/ohasd.log [ CRSCOMM][24] {0090} IpcL removeConnection Member 7 does not exist.
2016-09-21 00:26:07.179@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:08.241@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:08.310@./gpnpd/gpnpd.log [  OCRMSG][3]GIPC error [29] msg [gipcretConnectionRefused]
2016-09-21 00:26:09.189@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:10.262@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:11.199@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:12.281@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:13.219@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:14.301@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:15.229@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:16.321@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:17.239@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:18.341@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:19.249@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:20.361@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:21.260@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:22.382@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:22.452@./gpnpd/gpnpd.log [  OCRMSG][3]GIPC error [29] msg [gipcretConnectionRefused]
2016-09-21 00:26:23.279@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:24.401@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:25.289@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:26.421@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:27.299@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:28.441@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:29.309@./cssd/ocssd.l01 [    GPNP][1]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:30.461@./ohasd/ohasd.log [  CRSCCL][18]clsgpnpm_newWiredMsg [at clsgpnpm.c741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http//www.grid-pnp.org/2005/12/gpnp-errors#"]
2016-09-21 00:26:30.963@./agent/ohasd/oraagent_grid/oraagent_grid.l01 [ora.mdnsd][8] {002} [check] clsdmc_respget return status=0, ecode=0
...--- log show 尝试半小时后最终放弃尝试，启动CRS

其实方法很简单：
因为CRS中对于PUBLIC IP记录的是网卡名，所以给原网卡修改成新IP, 再增加一物理网卡并修改为原PUBLIC IP, 新IP用于CRS PUBLIC IP, 原IP用于手动增加ORACLE LISTENER为原应用提示服务。下面是在我虚拟机的测试。

因为只更改PUBLIC IP在同一网卡和subnet，CRS不需要任何修改，只是在OS层修改IP前后重启一次CRS.以下只操作节点1.步骤如下:
1. Shutdown Oracle Clusterware stack
2. Modify the new IP address at network layer, DNS and /etc/hosts file to reflect the change
3. Add a new network interface and set original IP
4. Restart Oracle Clusterware stack
5. Add an additional listener to Oracle manually with original IP

192.168.1.116 node1 node1.anbob.com –更换为192.168.1.16
192.168.1.126 node2 node2.anbob.com

192.168.1.216 node1-vip
192.168.1.226 node2-vip

172.168.1.116 node1-priv
172.168.1.126 node2-priv

192.168.1.200 anbob-cluster anbob-cluster-scan

[root@node1 ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 08:00:27:5F:EC:1A  
          inet addr:192.168.1.116  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe5f:ec1a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:17081 errors:0 dropped:0 overruns:0 frame:0
          TX packets:13480 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1405616 (1.3 MiB)  TX bytes:6599214 (6.2 MiB)

eth0:1    Link encap:Ethernet  HWaddr 08:00:27:5F:EC:1A  
          inet addr:192.168.1.216  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth1      Link encap:Ethernet  HWaddr 08:00:27:93:EE:5E  
          inet addr:172.168.1.116  Bcast:172.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe93:ee5e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:27881 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12911 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:15779267 (15.0 MiB)  TX bytes:3670283 (3.5 MiB)

eth1:1    Link encap:Ethernet  HWaddr 08:00:27:93:EE:5E  
          inet addr:169.254.193.48  Bcast:169.254.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth2      Link encap:Ethernet  HWaddr 08:00:27:73:D2:CE  
          inet addr:192.168.56.20  Bcast:192.168.56.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe73:d2ce/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3014 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1794 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:272606 (266.2 KiB)  TX bytes:304859 (297.7 KiB)

eth3      Link encap:Ethernet  HWaddr 08:00:27:7C:D7:DA  
          inet addr:192.168.1.16  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe7c:d7da/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:16960 errors:0 dropped:0 overruns:0 frame:0
          TX packets:268 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1385423 (1.3 MiB)  TX bytes:36566 (35.7 KiB)

Note:
eth0 用于public ip, eth1 用于cluster_interconnect（private ip）,eth2 是我的物理机与虚拟的专用网（可以忽略），eth3 是新增加的物理网卡.

[root@node1 ~]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

[grid@node1 ~]$ crsctl query crs activeversion
Oracle Clusterware active version on the cluster is [11.2.0.3.0]
[grid@node1 ~]$ crsctl query crs releaseversion
Oracle High Availability Services release version on the local node is [11.2.0.3.0]


[grid@node1 ~]$ oifcfg getif
eth0  192.168.1.0  global  public
eth1  172.168.1.0  global  cluster_interconnect

[grid@node1 ~]$ srvctl config nodeapps -a
Network exists: 1/192.168.1.0/255.255.255.0/eth0, type static
VIP exists: /node1-vip/192.168.1.216/192.168.1.0/255.255.255.0/eth0, hosting node node1
VIP exists: /node2-vip/192.168.1.226/192.168.1.0/255.255.255.0/eth0, hosting node node2

[grid@node1 ~]$ lsnrctl status

LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 01-NOV-2016 15:18:18
Copyright (c) 1991, 2011, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date                01-NOV-2016 15:15:46
Uptime                    0 days 0 hr. 2 min. 34 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/11.2.0.3/grid/network/admin/listener.ora
Listener Log File         /u01/app/grid/diag/tnslsnr/node1/listener/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.1.116)(PORT=1521)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.1.216)(PORT=1521)))
Services Summary...
Service "+ASM" has 1 instance(s).
  Instance "+ASM1", status READY, has 1 handler(s) for this service...
The command completed successfully[root@node1 ~]$  crsctl stop crs

Note:
节点2不用修改，只在节点1增加新网卡并停到CRS后，主机上就可以修改交换eth0和eth3的IP地址。

[root@node1 ~]# ifdown eth3
[root@node1 ~]# ifdown eth0
[root@node1 ~]# ifup eth0
[root@node1 ~]# ifup eth3

[root@node1 ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 08:00:27:5F:EC:1A  
          inet addr:192.168.1.16  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe5f:ec1a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:40411 errors:0 dropped:0 overruns:0 frame:0
          TX packets:31081 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3345829 (3.1 MiB)  TX bytes:14142844 (13.4 MiB)

eth1      Link encap:Ethernet  HWaddr 08:00:27:93:EE:5E  
          inet addr:172.168.1.116  Bcast:172.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe93:ee5e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:64788 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35273 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:36053768 (34.3 MiB)  TX bytes:10979476 (10.4 MiB)

eth2      Link encap:Ethernet  HWaddr 08:00:27:73:D2:CE  
          inet addr:192.168.56.20  Bcast:192.168.56.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe73:d2ce/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:6400 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3946 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:560483 (547.3 KiB)  TX bytes:698482 (682.1 KiB)

eth3      Link encap:Ethernet  HWaddr 08:00:27:7C:D7:DA  
          inet addr:192.168.1.116  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe7c:d7da/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:42434 errors:0 dropped:0 overruns:0 frame:0
          TX packets:498 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3363456 (3.2 MiB)  TX bytes:71937 (70.2 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:35592 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35592 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:30547907 (29.1 MiB)  TX bytes:30547907 (29.1 MiB)

vi /etc/hosts
# 192.168.1.116 node1 node1.anbob.com
192.168.1.16 node1 node1.anbob.com
192.168.1.126 node2 node2.anbob.com

192.168.1.216 node1-vip
192.168.1.226 node2-vip

172.168.1.116 node1-priv
172.168.1.126 node2-priv

192.168.1.200 anbob-cluster anbob-cluster-scan[root@node1 ~]$  crsctl start crs[grid@node1 ~]$ lsnrctl status

LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 01-NOV-2016 15:47:26

Copyright (c) 1991, 2011, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date                01-NOV-2016 15:47:21
Uptime                    0 days 0 hr. 0 min. 18 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/11.2.0.3/grid/network/admin/listener.ora
Listener Log File         /u01/app/11.2.0.3/grid/log/diag/tnslsnr/node1/listener/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.1.16)(PORT=1521)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.1.216)(PORT=1521)))
Services Summary...
Service "+ASM" has 1 instance(s).
  Instance "+ASM1", status READY, has 1 handler(s) for this service...
The command completed successfully

Note:
在主机层修改完IP地址后，并修改了/etc/hosts后，启动CRS, 监听的IP地址已使用了新的Public IP,如果有使用静态注册，记的手动修改参数文件。

手动增加原有IP的监听并静态注册，把下面的附加到listener.ora文件中,注意我这里叫listener2.

SID_LIST_LISTENER2 =
  (SID_LIST =
    (SID_DESC =
      (GLOBAL_DBNAME = anbob)
      (ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1) # this is DB oracle home,no GI
      (SID_NAME = anbob1)
    )
  )

LISTENER2 =
(DESCRIPTION_LIST =
  (DESCRIPTION =
   (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.1.116)(PORT = 1521)(IP = FIRST))
  )
)

[grid@node1 admin]$ lsnrctl start listener2

LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 01-NOV-2016 19:29:28

Copyright (c) 1991, 2011, Oracle.  All rights reserved.

Starting /u01/app/11.2.0.3/grid/bin/tnslsnr: please wait...

TNSLSNR for Linux: Version 11.2.0.3.0 - Production
System parameter file is /u01/app/11.2.0.3/grid/network/admin/listener.ora
Log messages written to /u01/app/11.2.0.3/grid/log/diag/tnslsnr/node1/listener2/alert/log.xml
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.1.116)(PORT=1521)))

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.1.116)(PORT=1521)(IP=FIRST)))
STATUS of the LISTENER
------------------------
Alias                     listener2
Version                   TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date                01-NOV-2016 19:29:29
Uptime                    0 days 0 hr. 0 min. 2 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/11.2.0.3/grid/network/admin/listener.ora
Listener Log File         /u01/app/11.2.0.3/grid/log/diag/tnslsnr/node1/listener2/alert/log.xml
Listening Endpoints Summary...  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.1.116)(PORT=1521)))Services Summary...
Service "anbob" has 1 instance(s).
  Instance "anbob1", status UNKNOWN, has 1 handler(s) for this service...
The command completed successfully

Note:
手动增加了新的listener名为Listener2，监听在原PUBLIC IP和相同的端口上，注意这里我们不会把该监听注册到CRS中，因为我只在节点1上增加了新的网卡，并且如果增加新的监听资源需要增加新的network资源，担心CRS又使用原PUBLIC IP出现CRS启动的问题,所以后续需要我们自己写shell维护该监听。下面从客户端测试连接。

[grid@node1 admin]$ ps -ef|grep lsnr
grid      4070     1  0 19:29 ?        00:00:00 /u01/app/11.2.0.3/grid/bin/tnslsnr listener2 -inherit
grid      5439     1  0 19:43 ?        00:00:00 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER_SCAN1 -inherit
grid      5476     1  0 19:43 ?        00:00:00 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER -inherit

[grid@node1 admin]$ netstat -an|grep 1521
tcp        0      0 192.168.1.216:1521          0.0.0.0:*                   LISTEN      
tcp        0      0 192.168.1.16:1521           0.0.0.0:*                   LISTEN      
tcp        0      0 192.168.1.200:1521          0.0.0.0:*                   LISTEN      
tcp        0      0 192.168.1.116:1521          0.0.0.0:*                   LISTEN      
tcp        0      0 192.168.1.216:1521          192.168.1.216:58796         ESTABLISHED 
tcp        0      0 192.168.1.216:58796         192.168.1.216:1521          ESTABLISHED 
tcp        0      0 192.168.1.200:1521          192.168.1.200:60114         ESTABLISHED 
tcp        0      0 192.168.1.216:58825         192.168.1.216:1521          ESTABLISHED 
tcp        0      0 192.168.1.200:60114         192.168.1.200:1521          ESTABLISHED 
tcp        0      0 192.168.1.216:1521          192.168.1.216:58825         ESTABLISHED 
unix  3      [ ]         STREAM     CONNECTED     61521  

# Tnsnames.ora 中增加

# new public ip
anbob16=
  (DESCRIPTION=
    (ADDRESS_LIST=
      (ADDRESS=(PROTOCOL=TCP)   (HOST=192.168.1.16)  (PORT=1521) )
    )
    (CONNECT_DATA=
	  (SERVER = DEDICATED)
      (SERVICE_NAME=anbob)
    )
  )
  
# original public ip
anbob116=
  (DESCRIPTION=
    (ADDRESS_LIST=
      (ADDRESS=(PROTOCOL=TCP)   (HOST=192.168.1.116)  (PORT=1521) )
    )
    (CONNECT_DATA=
	  (SERVER = DEDICATED)
      (SERVICE_NAME=anbob)
    )
  )
[oracle@node1 admin]$ sqlplus anbob/anbob@anbob16
SQL*Plus: Release 11.2.0.3.0 Production on Tue Nov 1 20:21:20 2016
Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

SQL>  
  
[oracle@node1 admin]$ sqlplus anbob/anbob@anbob116
SQL*Plus: Release 11.2.0.3.0 Production on Tue Nov 1 20:25:17 2016
Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

SQL>

Note:
经测试无论是新还是原PUBLIC IP 都可以连接，当然原VIP 但是在原eth0上，并且IP并未改变，也在监听上可以连接不再演示。

我开始测试时想增加一块新网卡不互换IP，如节点1 使用eth3，eht1;节点2 使用eth0, eht1 结果当然是失败了。
1，修改public 网卡
oifcfg delif -global eth0 -n node1
oifcfg setif -global eth3/192.168.1.0:public
这样会级连的修改另一节点，即使你不使用global而是用node也会；
2，如果另一节点是关闭状态，并且做上面的修改时不报错，但是想再回滚删除eth3时会报错如下
“[grid@node1 ~]$ oifcfg delif -global eth3
PRIF-33: Failed to set or delete interface because hosts could not be discovered
CRS-02307: No GPnP services on requested remote hosts.
PRIF-32: Error in checking for profile availability for host node2
CRS-02306: GPnP service on host “node2” not found.
”
解决方法是打开另一节点，确保所有的节点是RUNNING状态,并增加 -force删除。

3，如果另一节点是关闭状态，做#1操作，也是可以完成，但是节点2启动后vip不会漂回来，因为vip 依赖网卡名，节点2不存在eth3.

Note! 有时是因为节点间的GIPC通信出了问题，可以尝试通过KILL 问题外的节点的GIPCD.BIN进程来解决， KILL GIPC进程后该进程会自动重启该但并不会影响CRS可用性。

——— update 2018 -10 ———–

MOS上有篇Note 与该版本问题较为匹配 bug 12356910

CSSD Fails to Start After Repeated Message: Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) (文档 ID 1588034.1)

Solution

The fix is included in 11.2.0.4, 12.1, apply interim patch 12356910 if business is impacted.

The workaround is to restart GI on all nodes.

oracle

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

11.2.0.3 CRS start slow and cssd.log show ‘Msg-reply has soap fault 10’ 案例

问题描述

专家解答

评论