首页 科技问答 张文宁,某局点S6800设备堆叠分裂问题

张文宁,某局点S6800设备堆叠分裂问题

科技问答 318
1676540263,

组网及说明

/

告警信息

/

问题描述

103日两台6800设备运行中突然在18:19堆叠分裂,随后无人干预下于18:24自动恢复:

%Oct  3 18:19:27:691 2022 HK-FT-0201-E02-H6800QTH3-LA-01 BFD/4/BFD_MAD_INTERFACE_CHANGE_STATE: BFD MAD function enabled on Vlan-interface199 changed to the faulty state.

%Oct  3 18:19:29:987 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DRVPLAT/4/DrvDebug:

 The port Forty1/0/53 can't receive irf pkt and has been changed to inactive status, please check.

%Oct  3 18:19:29:987 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DRVPLAT/4/DrvDebug:

 The port Forty1/0/54 can't receive irf pkt, please check.

%Oct  3 18:19:40:615 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DRVPLAT/4/DrvDebug:

 The port Forty1/0/54 can't receive irf pkt, please check. This message repeated 1 times in last 10 seconds.

%Oct  3 18:19:40:568 2022 HK-FT-0201-E02-H6800QTH3-LA-01 STM/2/STM_LINK_TIMEOUT: IRF port 1 went down because the heartbeat timed out.

%Oct  3 18:19:40:573 2022 HK-FT-0201-E02-H6800QTH3-LA-01 STM/3/STM_LINK_DOWN: IRF port 1 went down.

%Oct  3 18:19:40:650 2022 HK-FT-0201-E02-H6800QTH3-LA-01 LAGG/6/LAGG_INACTIVE_PHYSTATE: Member port XGE2/0/1 of aggregation group BAGG1 changed to the inactive state, because the physical state of the port is down.

%Oct  3 18:19:40:663 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DEV/3/BOARD_REMOVED: Board was removed from slot 2, type is  S6800-54QT.

 

%Oct  3 18:22:41:332 2022 HK-FT-0201-E02-H6800QTH3-LA-01 STM/6/STM_LINK_UP: IRF port 1 came up.

%Oct  3 18:22:41:636 2022 HK-FT-0201-E02-H6800QTH3-LA-01 IFNET/3/PHY_UPDOWN: Physical state on the interface FortyGigE1/0/54 changed to up.

%Oct  3 18:22:41:637 2022 HK-FT-0201-E02-H6800QTH3-LA-01 IFNET/5/LINK_UPDOWN: Line protocol state on the interface FortyGigE1/0/54 changed to up.

%Oct  3 18:23:22:656 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DEV/2/BOARD_STATE_FAULT: Board state changed to Fault on slot 2, type is unknown.

%Oct  3 18:23:29:277 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DEV/5/BOARD_STATE_NORMAL: Board state changed to Normal on slot 2, type is  S6800-54QT.

过程分析

如果是堆叠分裂再合并,那么重启原因应该是STMIRF megre,但是从重启后的原因记录为warm reboot,充分说明分裂前Slot 2本身发生了故障,当时堆叠心跳报文和MAD功能均无法正常交互了,Slot2是在感知不到堆叠分裂的情况检测到自身故障并重启了自己。进一步查看重启前的 reboot 记录,发现没有信息,并且secondary_log也只有一次启动记录,就像是冷启动一样。不过History interrupt 里有信息,但是是乱码的,说明是高端内存都没记录下来或者记录错了。

因此虽然是热重启(warm reboot),但是记录的信息就像冷重启一样。按照以往经验,一般是PCIE出现问题的设备才有这种情况,为消除隐患,建议将slot 2返回分析。

Slot 2:

Uptime is 0 weeks,0 days,1 hour,26 minutes

S6800-54QT with 2 Processor

BOARD TYPE:         S6800-54QT

DRAM:               4096M bytes

FLASH:              1024M bytes

PCB 1 Version:      VER.A

PCB 2 Version:      VER.A

FPGA Version:       NONE

Bootrom Version:    229

CPLD 1 Version:     002

CPLD 2 Version:     002

Release Version:    H3C S6800-54QT-2609

Patch Version:      Release 2609H09

Reboot Cause:       WarmReboot

[SubSlot 0] 48XGT+6QSFP Plus

Display kernel相关信息都没有记录下来。

  ===============display kernel deadloop 20 verbose slot 2 =============== 

No information to display.

=================================================================

  ===============display kernel exception 10 verbose slot 2 =============== 

No information to display.

=================================================================

  ===============display kernel reboot 20 verbose slot 2 =============== 

No information to display.

重启前中断信息记录的是乱码:

  ===============display reboot interrupt 2=============== 

============ History interrupt info of slot 2 ============

Last 200 interrupts time:

Irq ID   jiffies         year/month/day hour:min:sec   Count       1 

  12     0x3c65cda6     1766/02/03     10:19:05       1 

  09     0x73c6a59ac     2022/10/03     10:19:05       1 

  10     0x73c68962f     0998/10/03     10:19:04       1 

  13     0x73c662d32     1830/10/01     02:19:07       1 

  13     0x71c69a00f     1894/10/03     10:19:08       1 

  09     0x73c6a556b     1254/10/03     00:03:09       1 

  05     0x73c6578ea     2022/10/03     10:19:10       1 

  13     0x73c663439     1254/10/03     10:19:10       1 

  10     0x73c6a7ab8     2020/10/01     08:03:10       1 

  10     0x73c6a7e90     1478/10/03     08:19:11       1 

  13     0x3c61e42b     1958/10/03     10:03:04       1 

  13     0x3c64f189     1734/10/03     08:19:13       1 

  05     0x73c6a057e     1382/10/03     02:03:12       1 

  13     0x73c2a041d     2018/10/03     10:01:14       1 

  13     0x73c658c3c     1254/10/01     08:03:13       1 

  08     0x73c09ce40     2018/08/03     10:19:15       1 

  08     0x73c649410     1990/10/02     10:02:01       1 

  05     0x73c2a47b5     1732/10/03     00:18:57       1 

  13     0x734693223     1988/10/01     10:19:00

=================================================================

 

重启前jiffies和任务切换信息是空的。

  ===============display reboot last-time  2=============== 

slot 2 Last Running Info:

CPU           Time                  jiffies        TASK

=================================================================

============================================================

Secondary log buf 也只有一次的内容:

  ===============printk irq trace info on slot 2=============== 

  ===============printk log buffer info on slot 2=============== 

<4>---------- secondary log buffer [1] ----------

<6>[    0.000000] 0:Initializing cgroup subsys cpuset <6>0:done

<6>[    0.000000] 0:Initializing cgroup subsys cpu <6>0:done

<5>[    0.000000] 0:Linux version (none) (CMO@host) (gcc version 4.4.5 20100516 (prerelease) (GCC) ) #2 SMP Tue Nov 7 16:00:00 CST 2017

<4>[    0.000000] Standard version 0.50

解决方法

返修Slot2设备。

CRM论坛(CRMbbs.com)——一个让用户更懂CRM的垂直性行业内容平台,CRM论坛致力于互联网、客户管理、销售管理、SCRM私域流量内容输出5年。 如果您有好的内容,欢迎向我们投稿,共建CRM多元化生态体系,创建CRM客户管理一体化生态解决方案。本文来源:知了社区基于知识共享署名-相同方式共享3.0中国大陆许可协议,某局点S6800设备堆叠分裂问题