“IPC send timeout error” 导致RAC的节点挂起

Roger 2019-06-17

4328

问题描述

一同事管理的一套3 nodes rac，上周出现故障，由于其中一个节点出现IPC send timeout，导致另外2个node挂住，应用无法访问，最后只能重启node。

根据 alert 整理了一个简单的时间列表：

###### sndb3 ######

Fri Nov 25 09:26:08 2011   IPC Send timeout detected. Receiver ospid 340396   -- sndb3
..........   
Fri Nov 25 09:28:54 2011   IPC Send timeout detected. Receiver ospid 340396   -- sndb3

Fri Nov 25 10:04:19 2011   Trace dumping is performing id=[cdmp_20111125100426]

Fri Nov 25 10:10:07 2011   Warning: cluster_database_instances (2) of node 0 is <= my node id (2) 

Fri Nov 25 10:12:20 2011   Reconfiguration started  -- 开始  reconfig 此时 List of nodes: 0 2

Fri Nov 25 10:28:05 2011   Reconfiguration started  -- 第2次 reconfig List of nodes: 0 1 2

Fri Nov 25 10:49:03 2011   Reconfiguration started  -- 第3次 reconfig List of nodes: 2

Fri Nov 25 10:49:50 2011   Reconfiguration started  -- 第4次 reconfig List of nodes: 1 2

Fri Nov 25 10:55:17 2011   Reconfiguration started  -- 第5次 reconfig List of nodes: 2

Fri Nov 25 10:55:18 2011   Reconfiguration complete 

Fri Nov 25 10:55:19 2011   Shutting down instance (abort)  -- abort 数据库

Fri Nov 25 10:56:47 2011   Starting ORACLE instance -- 正常启动数据库

###### sndb1 ######

Fri Nov 25 09:26:29 2011   IPC Send timeout detected.Sender: ospid 352952
                           Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 09:26:32 2011   ORA-27506: IPC error connecting to a port

Fri Nov 25 09:26:33 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 09:26:38 2011   IPC Send timeout detected.Sender: ospid 258258
                           Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 09:27:05 2011   ORA-27506: IPC error connecting to a port

                           ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 09:42:09 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []
                           ORA-03135: connection lost contact
                           ORA-03135: connection lost contact

Fri Nov 25 10:12:41 2011   Reconfiguration started  -- 第1次 reconfig List of nodes: 0 2

Fri Nov 25 10:22:55 2011   IPC Send timeout detected.Sender: ospid 127746
                           Receiver: inst 3 binc -981855203 ospid 340396
.............
.............

Fri Nov 25 10:22:58 2011   IPC Send timeout detected.Sender: ospid 299594
                           Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 10:22:58 2011   ORA-27506: IPC error connecting to a port

Fri Nov 25 10:22:59 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []
.............

Fri Nov 25 10:23:25 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:28:26 2011   Reconfiguration started -- 第2次 reconfig List of nodes: 0 1 2

Fri Nov 25 10:33:55 2011   Reconfiguration complete

Fri Nov 25 10:39:14 2011   IPC Send timeout detected.Sender: ospid 377580
                           Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 10:39:18 2011   ORA-27506: IPC error connecting to a port
                           ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:49:19 2011   ORA-29746: Cluster Synchronization Service is being shut down.

Fri Nov 25 10:49:20 2011   ORA-27300: OS system dependent operation:invalid_process failed with status: 0
                           ORA-27301: OS failure message: Error 0
                           ORA-27302: failure occurred at: skgpstime3
                           ORA-27144: attempt to kill process failed
                           ORA-29746: Cluster Synchronization Service is being shut down.
                           Instance terminated by LMON, pid = 365348

Fri Nov 25 11:02:12 2011   Starting ORACLE instance (normal) -- 正常启动数据库实例 sndb1

###### sndb2 ######

Fri Nov 25 09:26:20 2011   Trace dumping is performing id=[cdmp_20111125092634]

Fri Nov 25 09:26:24 2011   IPC Send timeout detected.Sender: ospid 352604
                           Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 09:26:50 2011   ORA-27506: IPC error connecting to a port
                           ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:04:24 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:04:25 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []
                           ORA-03113: end-of-file on communication channel

Fri Nov 25 10:10:13 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []
                           ORA-03135: connection lost contact
                           ORA-03135: connection lost contact

Fri Nov 25 10:12:23 2011   Shutting down instance (abort)  -- abort 数据库

Fri Nov 25 10:12:23 2011   ORA-27300: OS system dependent operation:invalid_process failed with status: 0
                           ORA-27301: OS failure message: Error 0
                           ORA-27302: failure occurred at: skgpstime3
                           ORA-27144: attempt to kill process failed

Fri Nov 25 10:12:58 2011   Instance terminated by USER, pid = 304124

Fri Nov 25 10:27:52 2011   Starting ORACLE instance (normal) -- 正常启动数据库

Fri Nov 25 10:28:10 2011   lmon registered with NM - instance id 2 (internal mem no 1)

Fri Nov 25 10:28:12 2011   Reconfiguration started  -- 第1次 reconfig  List of nodes:0 1 2

Fri Nov 25 10:33:23 2011   IPC Send timeout detected.Sender: ospid 529138

Fri Nov 25 10:33:26 2011   ORA-27506: IPC error connecting to a port

Fri Nov 25 10:33:34 2011   ORA-00482: LMD* process terminated with error

Fri Nov 25 10:33:36 2011   Shutting down instance (abort)  -- 第2次 abort 数据库
...............
...............
Fri Nov 25 10:34:47 2011   ORA-27300: OS system dependent operation:invalid_process failed with status: 0
                           ORA-27301: OS failure message: Error 0
                           ORA-27302: failure occurred at: skgpstime3
                           ORA-27144: attempt to kill process failed

Fri Nov 25 10:34:53 2011   Instance terminated by USER, pid = 446862

Fri Nov 25 10:49:39 2011   Starting ORACLE instance (normal) -- 第2次正常启动数据库

Fri Nov 25 10:49:57 2011   Reconfiguration started -- 第2次 reconfig List of nodes: 1 2

Fri Nov 25 10:56:39 2011   Instance terminated by USER, pid = 205414

Fri Nov 25 11:03:59 2011   Starting ORACLE instance (normal) -- 第3次正常启动数据库

Fri Nov 25 11:04:17 2011   Reconfiguration started -- 第3次 reconfig List of nodes: 0 1 2

专家解答

整体的梳理情况如上，据同事反映，该rac在25号9~10点之间，由于报错ipc send out，导致1,2两个node hang住，

他的直接描述是应用都无法连接1,2 node了，最后他通过srvctl强制停掉1,2 节点。

我们知道，出现ipc超时，而已引起系统系统挂住的情况，可能有如下几种：

1. cpu负载很高

2. 内存不足

3. 网络问题，如心跳异常

4. 系统参数设置，比如udp，tcp相关参数设置过小

5. crs和db版本不一致（这点我不确定，是因为这里的情况是crs 10.2.0.4, db 10.2.0.1), 这点是我临时加上的。

下面我们来一一排查进行分析：

sndb3 节点在故障之间的 osw 监控信息如下：

zzz ***Fri Nov 25 09:25:46 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr    memory              page              faults        cpu    
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 0  0 4002889 11233515   0   0   0   0    0   0 6238 240716 20451 11  6 81  2
 2  0 4002865 11233537   0   0   0   0    0   0 6215 240551 21467 10  6 82  2
 2  0 4002895 11233502   0   0   0   0    0   0 5594 239027 19584 11  5 82  2

zzz ***Fri Nov 25 09:26:16 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr    memory              page              faults        cpu    
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 0  1 4002579 11229653   0   0   0   0    0   0 1950 31763 8699  4  2 88  5
 0  1 4002657 11229073   0   0   0   0    0   0 2325 27718 9141  3  2 90  5
 2  2 4002661 11228588   0   0   0   0    0   0 1891 22869 8332  4  1 90  5

zzz ***Fri Nov 25 09:26:46 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr    memory              page              faults        cpu    
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 1  0 4009439 11223429   0   0   0   0    0   0 820 21974 4532  4  2 95  0
 1  0 4009445 11223419   0   0   0   0    0   0 797 13064 4497  3  1 96  0
 1  0 4009445 11223416   0   0   0   0    0   0 621 11103 3988  3  1 96  0

zzz ***Fri Nov 25 09:27:16 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr    memory              page              faults        cpu    
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 3  1 4012753 11217337   0   0   0   0    0   0 303 229232 3042  6 20 73  0
 2  0 4012780 11216595   0   0   0   0    0   0 612 107627 3761  5  9 86  0
 2  0 4012780 11216592   0   0   0   0    0   0 770 12774 4312  3  1 96  0

从上可以看到，sndb3 在ipc send超时出现时，系统cpu和内存均非常空闲，下面来检查网络：

zzzFri Nov 25 09:25:46 EAT 2011

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
en3   1500  link#2      0.21.5e.6.8.37    56004124     0   948458    10     0
                        01:00:5e:00:00:01
en3   1500  192.168.0   192.168.0.3       56004124     0   948458    10     0
                        224.0.0.1          
en0   1500  link#3      0.21.5e.6.8.f2    623180180     0 574030148     3     0
                        01:00:5e:00:00:01
en0   1500  10.10.10    10.10.10.12       623180180     0 574030148     3     0
                        224.0.0.1          
en1   1500  link#4      0.21.5e.6.8.f3           0     0  1401015 1401015     0
                        01:00:5e:00:00:01
en1   1500  10.76.200   10.76.200.132            0     0  1401015 1401015     0
                        224.0.0.1          
en2   1500  link#5      0.21.5e.6.8.36    435553601     0 119387543     3     0
                        01:00:5e:00:00:01
en2   1500  10.76.1     10.76.1.143       435553601     0 119387543     3     0
                        224.0.0.1          
en2   1500  10.76.1     10.76.1.144       435553601     0 119387543     3     0
                        224.0.0.1          
lo0   16896 link#1                        14167657     0 14181993     0     0
lo0   16896 127         127.0.0.1         14167657     0 14181993     0     0
                        224.0.0.1          
lo0   16896 ::1                           14167657     0 14181993     0     0
                        ff01::1          
                        ff02::1:ff00:1   
                        ff02::1          
icmp:
	95502471 calls to icmp_error
	0 errors not generated because old message was icmp
	Output histogram:
		echo reply: 4295876
		destination unreachable: 226000
	872 messages with bad code fields
	0 messages < minimum length
	12 bad checksums
	0 messages with bad length
	Input histogram:
		echo reply: 1458209
		destination unreachable: 56666
		echo: 4295876
		time exceeded: 191
	4295876 message responses generated
igmp:
	0 messages received
	0 messages received with too few bytes
	0 messages received with bad checksum
	0 membership queries received
	0 membership queries received with invalid field(s)
	0 membership reports received
	0 membership reports received with invalid field(s)
	0 membership reports received for groups to which we belong
	8 membership reports sent
tcp:
	122743924 packets sent
		116431735 data packets (2302053184 bytes)
		25761 data packets (10003811 bytes) retransmitted
		3553016 ack-only packets (3361721 delayed)
		0 URG only packets
		13 window probe packets
		2423518 window update packets
		309881 control packets
		28566873 large sends
		1403066833 bytes sent using largesend
		65160 bytes is the biggest largesend
	386118499 packets received
		369946371 acks (for 2302274662 bytes)
		74335 duplicate acks
		0 acks for unsent data
		98730125 packets (1777353999 bytes) received in-sequence
		23016 completely duplicate packets (3194142 bytes)
		0 old duplicate packets
		20 packets with some dup. data (8594 bytes duped)
		35986 out-of-order packets (2602936 bytes)
		0 packets (0 bytes) of data after window
		0 window probes
		49947 window update packets
		7 packets received after close
		0 packets with bad hardware assisted checksum
		0 discarded for bad checksums
		0 discarded for bad header offset fields
		0 discarded because packet too short
		13 discarded by listeners
		0 discarded due to listener's queue full
		283911590 ack packet headers correctly predicted
		15339170 data packet headers correctly predicted
	59734 connection requests
	59629 connection accepts
	83810 connections established (including accepts)
	119948 connections closed (including 1831 drops)
	0 connections with ECN capability
	0 times responded to ECN
	35446 embryonic connections dropped
	351698548 segments updated rtt (of 112846079 attempts)
	0 segments with congestion window reduced bit set
	0 segments with congestion experienced bit set
	0 resends due to path MTU discovery
	36306 path MTU discovery terminations due to retransmits
	203836 retransmit timeouts
		439 connections dropped by rexmit timeout
	189 fast retransmits
		26 when congestion window less than 4 segments
	134 newreno retransmits
	27 times avoided false fast retransmits
	30 persist timeouts
		0 connections dropped due to persist timeout
	46287 keepalive timeouts
		9999 keepalive probes sent
		36288 connections dropped by keepalive
	0 times SACK blocks array is extended
	0 times SACK holes array is extended
	0 packets dropped due to memory allocation failure
	0 connections in timewait reused
	0 delayed ACKs for SYN
	0 delayed ACKs for FIN
	0 send_and_disconnects
	0 spliced connections
	0 spliced connections closed
	0 spliced connections reset
	0 spliced connections timeout
	0 spliced connections persist timeout
	0 spliced connections keepalive timeout
udp:
	702526878 datagrams received
	0 incomplete headers
	3 bad data length fields
	5 bad checksums
	95502471 dropped due to no socket
	47780240 broadcast/multicast datagrams dropped due to no socket
	100987 socket buffer overflows <== Note here.
	559143172 delivered
	559613903 datagrams output
ip:
	1106726708 total packets received
	0 bad header checksums
	0 with size smaller than minimum
	0 with data size < data length
	0 with header length < data size
	0 with data length < header length
	0 with bad options
	0 with incorrect version number
	13553929 fragments received
	9 fragments dropped (dup or out of space)
	841 fragments dropped after timeout
	3122140 packets reassembled ok
	1094399939 packets for this host
	56393 packets for unknown/unsupported protocol
	0 packets forwarded
	207190 packets not forwardable
	0 redirects sent
	693606490 packets sent from this host
	991 packets sent with fabricated ip header
	0 output packets dropped due to no bufs, etc.
	0 output packets discarded due to no route
	4245736 output datagrams fragmented
	20081318 fragments created
	396 datagrams that can't be fragmented
	1631395 IP Multicast packets dropped due to no receiver
	0 successful path MTU discovery cycles
	0 path MTU rediscovery cycles attempted
	0 path MTU discovery no-response estimates
	0 path MTU discovery response timeouts
	0 path MTU discovery decreases detected
	0 path MTU discovery packets sent
	0 path MTU discovery memory allocation failures
	0 ipintrq overflows
	2 with illegal source
	0 packets processed by threads
	0 packets dropped by threads
	14925 packets dropped due to the full socket receive buffer
	0 dead gateway detection packets sent
	0 dead gateway detection packet allocation failures
	0 dead gateway detection gateway allocation failures

ipv6:
	0 total packets received
	0 with size smaller than minimum
	0 with data size < data length
	0 with incorrect version number
	0 with illegal source
	0 input packets without enough memory
	0 fragments received
	0 fragments dropped (dup or out of space)
	0 fragments dropped after timeout
	0 packets reassembled ok
	0 packets for this host
	0 packets for unknown/unsupported protocol
	0 packets forwarded
	0 packets not forwardable
	0 too big packets not forwarded
	0 packets sent from this host
	0 packets sent with fabricated ipv6 header
	0 output packets dropped due to no bufs, etc.
	0 output packets without enough memory
	0 output packets discarded due to no route
	0 output datagrams fragmented
	0 fragments created
	0 packets dropped due to the full socket receive buffer
	0 packets not delivered due to bad raw IPv6 checksum
icmpv6:
	0 calls to icmp6_error
	0 errors not generated because old message was icmpv6
	Output histogram:
		unreachable: 0
		packets too big: 0
		time exceeded: 0
		parameter problems: 0
		redirects: 0
		echo requests: 0
		echo replies: 0
		group queries: 0
		group reports: 0
		group terminations: 0
		router solicitations: 0
		router advertisements: 0
		neighbor solicitations: 0
		neighbor advertisements: 0
	0 messages with bad code fields
	0 messages < minimum length
	0 bad checksums
	0 messages with bad length
	Input histogram:
		unreachable: 0
		packets too big: 0
		time exceeded: 0
		parameter problems: 0
		echo requests: 0
		echo replies: 0
		group queries: 0
			bad group queries: 0
		group reports: 0
			bad group reports: 0
			our groups' reports: 0
		group terminations: 0
		bad group terminations: 0
		router solicitations: 0
		bad router solicitations: 0
		router advertisements: 0
		bad router advertisements: 0
		neighbor solicitations: 0
		bad neighbor solicitations: 0
		neighbor advertisements: 0
		bad neighbor advertisements: 0
		redirects: 0
		bad redirects: 0
		mobility calls when not started: 0
		home agent address discovery requests: 0
		bad home agent address discovery requests: 0
		home agent address discovery replies: 0
		bad home agent address discovery replies: 0
		prefix solicitations: 0
		bad prefix solicitations: 0
		prefix advertisements: 0
		bad prefix advertisements: 0
	0 message responses generated

从上面的 100987 socket buffer overflows 信息来看，似乎有包溢出，从这里看可能是udp相关参数设置有问题。

下面再来分析下crs的日志以及lmd，lmon等关键性进程的trace：

*** 2011-11-25 09:26:38.408
IPC Send timeout detected.Sender: ospid 258258
Receiver: inst 3 binc -981855203 ospid 340396
SKGXPCTX: 0x11053d158 ctx
wait delta 0 sec (14 msec) ctx ts 0x4b421 last ts 0x4b422
user cpu time since last wait 0 sec 0 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 1
timed wait receives 0
admno 0x3c67bf54 admport:
SSKGXPT 0x1053d4a4 flags  active network 0
info for network 0
	socket no 11 	IP 10.10.10.10 	UDP 36952
	HACMP network_id 0	sflags SSKGXPT_UP
context timestamp 0x4b421
buffers queued on port 11057a888
    sconno     accono   ertt  state   seq#   sent  async   sync rtrans   acks
       ach     accono     sconno      admno  state   seq#    rcv rtrans   acks
0x1105d9050 0x2c61e946 0x48f33837 0x1368ad14     40  32764      1      0      1 
Submitting synchronized dump request [268435460]
*** 2011-11-25 09:27:05.144
ORA-27506: IPC error connecting to a port
Assertion failed: 0, file  skgxpu.c, line 8568
*** 2011-11-25 09:27:05.144
NOTICE: Process received signal SIGLOST. Therefore, the call
        stack does not point to the location of the problem.
        Take a look at /var/adm/messages to find why this 
        process got SIGLOST
Exception signal: 6	0x9000000002832b4 (pthread_kill+0080) e8410028
Registers:

#if defined (SIGLOST)	/* resource lost (eg, record-lock lost) */
  signal_names[SIGLOST] = "SIGLOST";
#endif

###### sndb3 的lmon trace如下 ######

*** 2011-11-25 10:12:20.202
kjfcrfg: DRM window size = 2048->2048 (min lognb = 14) 
*** 2011-11-25 10:12:20.210
Reconfiguration started (old inc 6, new inc 8)
Synchronization timeout interval: 1020 sec
List of nodes:
 0 2
*** 2011-11-25 10:12:20.213
 Global Resource Directory frozen
node 0
node 2
 asby init, 0/0/x2 
 asby returns, 0/0/x2/false 
* Domain maps before reconfiguration: 
*   DOMAIN 0 (valid 1): 0 1 2 
* End of domain mappings 
* kjbdomrcfg2: domain 0 invalid = TRUE 
* Domain maps after recomputation: 
*   DOMAIN 0 (valid 0): 0 2 
* End of domain mappings 
 Dead  inst 1 
 Join  inst 
 Exist inst 0 2 
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
 Active Sendback Threshold = 50 %
 Communication channels reestablished
 sent syncr inc 8 lvl 1 to 0 (8,5/0/0) 
 synca inc 8 lvl 1 rcvd (8.5) 
 replayed dom info (8.6) 
 received master 0 (8.6) 
 sent syncr inc 8 lvl 2 to 0 (8,7/0/0) 
 synca inc 8 lvl 2 rcvd (8.7) 
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Set master node info 
 sent syncr inc 8 lvl 3 to 0 (8,13/0/0) 
 synca inc 8 lvl 3 rcvd (8.13) 
 Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 2204
 sent syncr inc 8 lvl 4 to 0 (8,15/0/0) 
 synca inc 8 lvl 4 rcvd (8.15) 
 Dwn-cvts replayed, VALBLKs dubious
 sent syncr inc 8 lvl 5 to 0 (8,18/0/0) 
 synca inc 8 lvl 5 rcvd (8.18) 
 All grantable enqueues granted
 sent syncr inc 8 lvl 6 to 0 (8,20/0/0) 
 synca inc 8 lvl 6 rcvd (8.20) 
*** 2011-11-25 10:12:22.061
 Submitted all GCS cache requests
 sent syncr inc 8 lvl 7 to 0 (8,22/0/0) 
 synca inc 8 lvl 7 rcvd (8.22) 
 Fix write in gcs resources
 sent syncr inc 8 lvl 8 to 0 (8,24/0/0) 
 synca inc 8 lvl 8 rcvd (8.24) 
*** 2011-11-25 10:12:23.459

其他2个node的crsd ocssd 以及lmon lmd等trace也没有发现有价值的信息，而且在9:26~10:12之间都没任何信息。

只有lmd trace中有点有价值的信息：

SKGXPDOCON:connection refused: out of  memory
SSKGXPID 0xffffb640 network 0	Internet address 10.10.10.11 	UDP port number 61800
SKGXPDOCON:connection refused: out of  memory
SSKGXPID 0xffffb640 network 0	Internet address 10.10.10.11 	UDP port number 61800
*** 2011-11-25 09:26:20.449
SKGXPDOCON:connection refused: out of  memory
SSKGXPID 0xffffb640 network 0	Internet address 10.10.10.10 	UDP port number 36992
SKGXPDOCON:connection refused: out of  memory
SSKGXPID 0xffffb640 network 0	Internet address 10.10.10.10 	UDP port number 33548
*** 2011-11-25 09:27:15.598
Received ORADEBUG command 'dump errorstack 1' from process Unix process pid: 275164, image: 
*** 2011-11-25 09:27:15.598

另外通过oraadebug ipc，我们也能看到rac心跳确实是使用的udp协议，如下：

Shared Memory:
ID		KEY
10485768	0xe1892820
Maximum processes:               = 750
ksxpdmp:  facility 0 (?) (0x1, 0x0) counts 0, 0
ksxpdmp:  Dumping the osd context
SKGXPCTX: 0x1105a4c78 ctx
wait delta 13 sec (13149 msec) ctx ts 0xb last ts 0xb
user cpu time since last wait -171798692 sec -1 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 0
timed wait receives 0
admno 0x104b9dec admport:
SSKGXPT 0x105a4fc4 flags SSKGXPT_READPENDING active network 0
info for network 0
	socket no 7 	IP 10.10.10.12 	UDP 55895 <== Note here.
	HACMP network_id 0	sflags SSKGXPT_UP
context timestamp 0xb
	no ports
    sconno     accono   ertt  state   seq#   sent  async   sync rtrans   acks
0x3115c6c7 0x6e0d721f     32      3  32765      2      2      0      0      2

另外os方面也没有发现有价值的信息，故障期间的 awr top5 如下：

<div id="B" style="height:auto; width:auto;OVERFLOW-y:auto;border:red 1px solid;margin:10px">
<img src="http://www.killdb.com/wp-content/uploads/2011/11/IPC_send_timeout_error_top_5.jpg" alt="" title="IPC_send_timeout_error_top_5" width="500" height="162" class="aligncenter size-full wp-image-854" />
</div>

关于ges/gcs enter server mode 这两个等待事件表现正常，表现为该实例在进行recover，smaon进程表现出的等待为该event。

下面是google到的解释：

In a RAC instance the during the instance recovery, first pass scan can be delayed by 300ms-1.5s waiting 
on GRD (Global resource directory). During that time SMAON enters the wait event which is called "ges enter server mode" .

另外有点值得关注是，SQ相关等待较高。

通过下面的Global Cache and Enqueue Services - Messaging Statistics 也能看出cluster interconnect 通信消耗还是挺大的。

另外mos上找到一个相关的文档，不过跟这个情况看上去不符合，该文档如下：
<span style="font-size: 12px; color: #0000ff; font-family: monospace;">
IPC Send timeout detected and Instance eviction on RDBMS and ASM 10.2.0.1 RAC [ID 370182.1]
</span>

最后给出的邮件回复如下：

该参数设置过小可能会导致rac node挂起, 故认为第2种可能性更大。
 
当前系统该值为：

            udp_recvspace = 655360
            udp_sendspace = 65536
 
根据官方文档，建议修改为如下值：

            udp_recvspace = 2621440(该参数可以不调整)
            udp_sendspace = 262144
 
udp_sendspace 官方文档推荐值是(DB_BLOCK_SIZE * DB_MULTIBLOCK_READ_COUNT) + 4 KB


参数调整以后，要监控一段时间。
 
补充：从awr来看，top 5中 enq: SQ - contention 等待较高，建议调大sequence的cache值（默认值20较小）

TOYS_SEQ_ZF 下存在sequence，通过如下方式进行修改：
 
alter sequence owner.sequence_name cache 1000;

检查sequence cache值：

set lines 200
set pagesize 1000
col owner for a20
col SEQUENCE_NAME for a30

select owner,sequence_name,MAX_VALUE,INCREMENT_BY,CYCLE_FLAG,ORDER_FLAG,CACHE_SIZE 
  from dba_sequence;

oracle

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

“IPC send timeout error” 导致RAC的节点挂起

问题描述

专家解答

评论