敲敲放通,高端路由器 巡检中的AI动作检查和HG故障检查
问题描述
SR8806-X-S 巡检内容过程中遇到两个问题,slot0主控槽位的AI动作检查,slot0和slot4之间HG故障检查
过程分析
AI动作检查,时间距离较近,告警级别为严重。
display hardware internal diag hardware-diag-action information slot X,有报警master board reject loading by AI.
===============display hardware internal diag hardware-diag-action information slot 0 ===============
--------------------Executed action records:--------------------
--------------------Unexecuted action records:------------------
Slot 0:
1. 2022-12-30, 22:32:27 master board reject loading by AI.
reason:HG_ERR. chip 0: the action unexecuted 38 times.
HG故障检查,在0槽和4槽发现了HG隔离/恢复动作。
===============display hardware internal hgmonitor action 0 ===============
23:35:43:977767 12/30/2022: unit 0 port 22 is recovered normal by local.
23:35:49:870161 12/30/2022: unit 0 port 22 is isolated by local.
===============display hardware internal hgmonitor action 4 ===============
23:35:44:688003 12/30/2022: unit 0 port 70 is recovered normal by rpc.
23:35:50:581055 12/30/2022: unit 0 port 70 is isolated by rpc.
解决方法
AI动作检查的告警就是由于这个hg down引起的,两个报警一起出现。
1.检查slot0和slot4单板是否有插紧,螺丝是否有拧紧。 //螺丝拧紧后故障消除
2螺丝拧紧后故障仍未恢复。则需要在slot0和slot4之间hg口有报障,交叉验证确认故障位置。//更换备件前,Slot 0和slot 4的连接器需要先检查下(包括单板和机框槽位),确保连接器没有弯针等异常,排除机框问题。