作者:王坤,微信公众号:rundba,转载请注明出处。
如需公众号转发,请联系wx:landnow。

情况描述:之前ib01和ib02的10口有线路直连,正常。后被更换到5口。
正常情况下,关机后随意更换端口,均可正常提供服务。
但换到5口后,指示灯不亮后,查看ib健康状况,有报错,通过清理报错,并启用端口5自动连接后恢复正常。

1. 更换到5口,报错
查看监控状况,报错
[root@exasw-ibb01 ~]# showunhealthyWARNING Autodisabled portsFAILURE - 1 sensors NOT OK
2. 环境测试
使用env_test测试
[root@exasw-ibb01 ~]# env_testEnvironment test started:Starting Environment Daemon test:Environment daemon runningEnvironment Daemon test returned OKStarting Voltage test:Voltage ECB OKMeasured 3.3V Main = 3.28 VMeasured 3.3V Standby = 3.35 VMeasured 12V = 11.90 VMeasured 5V = 4.99 VMeasured VBAT = 3.03 VMeasured 2.5V = 2.49 VMeasured 1.8V = 1.78 VMeasured I4 1.2V = 1.22 VVoltage test returned OKStarting PSU test:PSU 0 present OKPSU 1 present OKPSU test returned OKStarting Temperature test:Back temperature 35Front temperature 37SP temperature 51Switch temperature 48, maxtemperature 49Temperature test returned OKStarting FAN test:Fan 0 not presentFan 1 running at rpm 12426Fan 2 running at rpm 12317Fan 3 running at rpm 12099Fan 4 not presentFAN test returned OKStarting Connector test:Connector test returned OKStarting Onboard ibdevice test:Switch OKAll Internal ibdevices OKOnboard ibdevice test returned OKStarting SSD test:SSD test returned OKStarting Auto-link-disable test:WARNING Autodisabled portsAuto-link-disable test returned 1 faultsEnvironment test FAILED #测试失败
3. 查看错误
有auto-link-disable报错
spsh-> show faultyTarget | Property | Value------------------------------------------------------+---------------------------------------------------------------+---------------------------------------------------------------------------------------------/SP/faultmgmt/0 | fru | SYS/SP/faultmgmt/0/faults/0 | class | fault.device.ib.auto-link-disable #此处又auto-link禁用提示/SP/faultmgmt/0/faults/0 | sunw-msg-id | ---/SP/faultmgmt/0/faults/0 | component | SYS/SP/faultmgmt/0/faults/0 | uuid | cf425a70-59e4-6711-cb37-a48938f5e257/SP/faultmgmt/0/faults/0 | timestamp | 2020-06-11/09:30:51/SP/faultmgmt/0/faults/0 | fru_serial_number | AK00276771/SP/faultmgmt/0/faults/0 | fru_part_number | 7052970/SP/faultmgmt/0/faults/0 | fru_name | Sun Datacenter InfiniBand Switch 36/SP/faultmgmt/0/faults/0 | fru_manufacturer | Sun Microsystems/SP/faultmgmt/0/faults/0 | system_component_manufacturer | Sun Microsystems/SP/faultmgmt/0/faults/0 | system_component_name | Sun Datacenter InfiniBand Switch 36/SP/faultmgmt/0/faults/0 | system_component_part_number | 7052970/SP/faultmgmt/0/faults/0 | system_component_serial_number | AK00276771/SP/faultmgmt/0/faults/0 | chassis_manufacturer | Sun Microsystems/SP/faultmgmt/0/faults/0 | chassis_name | Sun Datacenter InfiniBand Switch 36/SP/faultmgmt/0/faults/0 | chassis_part_number | 7052970/SP/faultmgmt/0/faults/0 | chassis_serial_number | AK00276771/SP/faultmgmt/0/faults/0 | system_manufacturer | Sun Microsystems/SP/faultmgmt/0/faults/0 | system_name | Sun Datacenter InfiniBand Switch 36/SP/faultmgmt/0/faults/0 | system_part_number | 7052970/SP/faultmgmt/0/faults/0 | system_serial_number | AK00276771
4. 清理历史错误和计数
清理历史错误
[root@exasw-iba01 ~]# ibclearerrors## Summary: 7 nodes cleared 0 errors
清理历史计数
[root@exasw-iba01 ~]# ibclearcounters## Summary: 7 nodes cleared 0 errors
5. 查看当前已连接ib端口
显示5口已连接
[root@exasw-iba01 ~]# listlinkupConnector 0A Not presentConnector 1A Not presentConnector 2A Not presentConnector 3A Not presentConnector 4A Not presentConnector 5A Present <-> Switch Port 30 is up (Enabled)Connector 6A Present <-> Switch Port 35 is up (Enabled)Connector 7A Present <-> Switch Port 33 is up (Enabled)Connector 8A Present <-> Switch Port 31 is up (Enabled)Connector 9A Present <-> Switch Port 14 is up (Enabled)Connector 10A Not presentConnector 11A Present <-> Switch Port 18 is up (Enabled)Connector 12A Present <-> Switch Port 11 is up (Enabled)Connector 13A Present <-> Switch Port 09 is up (Enabled)Connector 14A Present <-> Switch Port 07 is up (Enabled)Connector 15A Present <-> Switch Port 05 is up (Enabled)Connector 16A Present <-> Switch Port 03 is up (Enabled)Connector 17A Present <-> Switch Port 01 is up (Enabled)Connector 0B Not presentConnector 1B Not presentConnector 2B Not presentConnector 3B Not presentConnector 4B Not presentConnector 5B Not presentConnector 6B Not presentConnector 7B Not presentConnector 8B Not presentConnector 9B Not presentConnector 10B Not presentConnector 11B Not presentConnector 12B Not presentConnector 13B Not presentConnector 14B Not presentConnector 15B Not presentConnector 16B Not presentConnector 17B Not present
6. 清理Fault Management Shell告警
启用port 5自动连接
enableswitchport --automatic 5 #写法错误,正确的应为5A
再次查看告警
spsh-> show faulty
如果ILOM中仍然显示ib auto link disabled,此时从Fault Management Shell清理告警
登录ILOM
# spsh
进入Fault Management会话 (CLI)
-> start SP/faultmgmt/shellAre you sure you want to start SP/faultmgmt/shell (y/n)? yfaultmgmtsp> fmadm faultyfaultmgmtsp> fmadm repair cf425a70-59e4-6711-cb37-a48938f5e257
验证无错误
faultmgmtsp> fmadm faultyNo problems found #无错faultmgmtsp> exit-> show faulty #空exit-> exit
7. 启用端口autolink后告警消除
查看健康状况-报错
[root@exasw-ibb01 ~]# showunhealthyWARNING Autodisabled portsFAILURE - 1 sensors NOT OK
同时启用5(A)口和10(A)口auot link
enableswitchport --automatic 5Aenableswitchport --automatic 10A
查看健康状况-错误已消失
[root@exasw-iba01 ~]# showunhealthyOK - No unhealthy sensors
8. 小结
通过对更换端口进行操作,发现更换后异常,近一步发现auto link被禁用,再次清理报错,并启用auto link后,错误消失。
—END—

往期推荐
文章转载自rundba,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




