infiniband在ORACLE各种一体机(exadata/BDA/PCA等)中大量使用,通过一些常见的命令确认健康状态、问题排查等。
监控命令如下:
root@dm01sw-ib1 ~]# showunhealthyOK - No unhealthy sensors
其返回的结果应该为OK-No unhealthy sensors,如果不是,则需要在Infiniband交换机中执行env_test来检查具体出错的传感器。需要注意的是,此命令无法检测Infiniband的电源供应(Power Supply)状态,如果要检测供电状态需要运行以下命令:
root@dm01sw-ib1 ~]# checkpowerPSU 0 present OKPSU 1 present OKAll PSUs OK
以上命令应该返回All PSUs OK。
当然,有时候我们还需要查看Infiniband网络各个端口链路、发送、接收、中继、缓冲的错误信息,在数据库节点或者Infiniband交换机上执行如下命令就能满足要求:
root@dm01sw-ib1 ~]# ibqueryerrors.pl -s RcvSwRelayErrors,RcvRemotePhysErrors,XmtDiscards,XmtConstraintErrors,RcvConstraintErrors, ExcBufOverrunErrors,VL15DroppedSuppressing:RcvSwRelayErrors,RcvRemotePhysErrors,XmtDiscards,XmtConstraintErrors,RcvConstraintErrorsErrors for 0x00212846901ea0a0 "SUN DCS 36P QDR dm01sw-ib3 10.242.65.9"GUID 0x00212846901ea0a0 port 17:[VL15Dropped == 4]GUID 0x00212846901ea0a0 port 25:[RcvErrors == 218]GUID 0x00212846901ea0a0 port 27:[RcvErrors == 144]GUID 0x00212846901ea0a0 port 28:[RcvErrors == 188]GUID 0x00212846901ea0a0 port 30:[ExcBufOverrunErrors == 1] [RcvErrors == 678][LinkRecovers == 1]GUID 0x00212846901ea0a0 port 31:[VL15Dropped == 13]Errors for 0x002128468eada0a0 "SUN DCS 36P QDR dm01sw-ib2 10.242.65.8"GUID 0x002128468eada0a0 port 7:[ExcBufOverrunErrors == 3] [RcvErrors == 1299][LinkRecovers == 3]GUID 0x002128468eada0a0 port 9:[RcvErrors == 225]GUID 0x002128468eada0a0 port 10:[ExcBufOverrunErrors == 3] [RcvErrors ==1434] [LinkRecovers == 3]GUID 0x002128468eada0a0 port 12:[ExcBufOverrunErrors == 4] [RcvErrors ==2382] [LinkRecovers == 4]GUID 0x002128468eada0a0 port 13:[LinkDowned == 1]GUID 0x002128468eada0a0 port 14:[LinkDowned == 1]GUID 0x002128468eada0a0 port 15:[LinkDowned == 1]GUID 0x002128468eada0a0 port 16:[LinkDowned == 1]GUID 0x002128468eada0a0 port 17:[LinkDowned == 1]GUID 0x002128468eada0a0 port 31:[LinkDowned == 1]Errors for 0x002128469566a0a0 "SUN DCS 36P QDR dm01sw-ib1 10.242.65.7"GUID 0x002128469566a0a0 port 19:[LinkDowned == 3]GUID 0x002128469566a0a0 port 21:[LinkDowned == 2]
在数据库节点及存储节点运行ibstatus,用于查询本机Infiniband端口的状态:
# ibstatusInfiniband device 'mlx4_0' port 1 status:default gid:fe80:0000:0000:0000:0021:2800:01a1:3fedbase lid:0x2sm lid:0x1state: 4:ACTIVEphys state:5:LinkUprate: 40Gb/sec (4X QDR)link_layer: IBInfiniband device 'mlx4_0' port 2 status:default gid:fe80:0000:0000:0000:0021:2800:01a1:3feebase lid:0x5sm lid:0x1state: 4:ACTIVEphys state:5:LinkUprate: 40Gb/sec (4X QDR)link_layer: IB
预期正常的返回结果应该是:
State:4:ACTIVEPhys state:5:LinkUpRate:40Gb/sec (4X QDR)
Infiniband端口的状态
# ifconfig ib0ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1RX packets:133506 errors:0 dropped:0 overruns:0 frame:0TX packets:114796 errors:0 dropped:0 overruns:0 carrier:0collisions:0 txqueuelen:1024RX bytes:33936833 (32.3 MiB) TX bytes:33524268 (31.9 MiB)# ifconfig ib1ib1 Link encap:InfiniBand HWaddr 80:00:00:49:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1RX packets:1702 errors:0 dropped:1702 overruns:0 frame:0TX packets:0 errors:0 dropped:0 overruns:0 carrier:0collisions:0 txqueuelen:1024RX bytes:194784 (190.2 KiB) TX bytes:0 (0.0 b)# ifconfig bondib0bondib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00inet addr:192.168.10.9 Bcast:192.168.11.255 Mask:255.255.252.0inet6 addr:fe80:221:2800:1a1:ffd/64 Scope:LinkUP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1RX packets:135278 errors:0 dropped:1702 overruns:0 frame:0TX packets:114847 errors:0 dropped:0 overruns:0 carrier:0collisions:0 txqueuelen:0RX bytes:34150057 (32.5 MiB) TX bytes:33540576 (31.9 MiB)
其状态的正常返回结果应该是UP BROADCAST RUNNING MASTER MULTICAST,并且同时需要严密监控errors、dropped、overruns的值。
使用rds-ping和ping测试数据库节点与所有存储节点之间的链路是否畅通:
# rds-ping 192.168.10.91:61 usec2:55 usec3:53 usec……# ping 192.168.10.9PING 192.168.10.9 (192.168.10.9) 56(84) bytes of data.64 bytes from 192.168.10.9:icmp_seq=1 ttl=64 time=1.80 ms64 bytes from 192.168.10.9:icmp_seq=2 ttl=64 time=0.078 ms64 bytes from 192.168.10.9:icmp_seq=3 ttl=64 time=0.083 ms
查看端口错误信息:
# perfquery# Port counters:Lid 40 port 1 (CapMask:0x1400)PortSelect:…………………………1PortSelect:…………………………1CounterSelect:……………………0x0000SymbolErrorCounter:……………….0 #####LinkErrorRecoveryCounter:……….0LinkDownedCounter:…………………0 #####PortRcvErrors:……………………0 #####PortRcvRemotePhysicalErrors:…….0PortRcvSwitchRelayErrors:……….0PortXmitDiscards:…………………0PortXmitConstraintErrors:……….0PortRcvConstraintErrors:…………0CounterSelect2:……………………0x00LocalLinkIntegrityErrors:……….0 #####ExcessiveBufferOverrunErrors:……0 #####VL15Dropped:…………………………0PortXmitData:…………………….4294967295PortRcvData:…………………………4294967295PortXmitPkts:…………………….648093271PortRcvPkts:…………………………285784546
监控SymbolErrorCounter、LinkDownedCounter、PortRcvErrors、LocalLinkIntegrityErrors、ExcessiveBufferOverrunErrors这几项指标,在正常情况下不应该有增长。
当然我们也可以登录Web版本的Infiniband交换机管理界面,在Configuration->System Mamagement Access->SNMP下配置SNMP,将Infiniband交换机纳入网管监控平台。
在这个Web控制界面中,在System Monitoring下,有所有与系统相关的监控信息,包括传感器、事件日志等。
在数据库节点的/opt/oracle.SupportTools/ibdiagtools目录下提供了一系列的诊断工具对Infiniband的故障进行检测,其中包括最常用的verify_topology和infinicheck等。
Infnincheck是用来检查Infiniband网络最大吞吐量的命令,需要在空载的情况下运行,否则可能影响正常的业务,同时首次执行需要加上-z以清理上次运行时生成的文件。
# opt/oracle.SupportTools/ibdiagtools/infinicheck -z# opt/oracle.SupportTools/ibdiagtools/infinicheck
—END—





