首页 科技问答 张文宁,某局点S12500X-AF堆叠分裂下挂业务大量中断问题

张文宁,某局点S12500X-AF堆叠分裂下挂业务大量中断问题

科技问答 267
1676538364,

组网及说明

/

告警信息

/

问题描述

前方反馈,网管发现superspine堆叠设备S12500X-AF出现日志告警,该设备下挂金融云、大数据业务陆续报障,前方通过整机掉电superspine设备发现无法恢复,最后下电隔离1框,3框单台运行业务最终恢复正常,业务中断约0.5小时。具体故障现象如下:

7408:44左右,网管收到大量superspine上送的异常日志,同时业务侧报障。

本日志告警1分钟发送一次

本次查询范围:2022-07-04 08:39 CST2022-07-04 08:44 CST

时间: 2022-07-04 08:44 CST

模块: DEV

摘要: BOARD_REMOVED

主机名: DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U

命中次数: 25

内容: 2022-07-04T08:44:03+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/3/BOARD_REMOVED: -DevIP=10.191.252.1; Board was removed from chassis 3 slot 0, type is LSXM1SUPH1.

 

本日志告警1分钟发送一次

本次查询范围:2022-07-04 08:39 CST2022-07-04 08:44 CST

时间: 2022-07-04 08:44 CST

模块: DRVPLAT

摘要: DrvDebug

主机名: DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U

命中次数: 6

内容: 2022-07-04T08:44:09+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DRVPLAT/4/DrvDebug: -DevIP=10.191.252.1;   WARNING: Heartbeat with chassis 1 slot 2 timed out.

过程分析

从网管侧的日志,可以看到3框在844分发生了堆叠分裂:

Jul 4, 2022 @ 08:44:49.463   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:04+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 3 slot 15, type is LSXM1SFH08D1.

Jul 4, 2022 @ 08:44:49.463   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:04+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 3 slot 14, type is LSXM1SFH08D1.

Jul 4, 2022 @ 08:44:49.441   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:04+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 3 slot 13, type is LSXM1SFH08D1.

Jul 4, 2022 @ 08:44:49.432   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:04+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 3 slot 12, type is LSXM1SFH08D1.

Jul 4, 2022 @ 08:44:49.425   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:04+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 3 slot 11, type is LSXM1SFH08D1.

Jul 4, 2022 @ 08:44:49.410   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:04+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 3 slot 10, type is LSXM1SFH08D1.

Jul 4, 2022 @ 08:44:49.360   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:04+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 3 slot 5, type is LSXM1CGQ18QGHB1.

Jul 4, 2022 @ 08:44:49.293   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:04+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 3 slot 4, type is LSXM1TGS48QGHA1.

Jul 4, 2022 @ 08:44:49.273   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:04+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 3 slot 3, type is LSXM1QGS24HB1.

Jul 4, 2022 @ 08:44:49.266   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:04+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 3 slot 2, type is LSXM1CGQ18QGHB1.

Jul 4, 2022 @ 08:44:49.250   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:04+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 3 slot 1, type is LSXM1SUPH1.

 

同时从下行的spine日志可以看到,3框被置于mad down状态(3框框号大),且1slot2的接口也down了:

%Jul  4 08:45:37:740 2022 DCXN-CLOSS-APP-spine-SW01_IDC02-S1301-1728U LLDP/5/LLDP_NEIGHBOR_AGE_OUT: -Slot=2; Nearest bridge agent neighbor aged out on port HundredGigE2/0/5 (IfIndex 503), neighbor's chassis ID is 78aa-82cf-5e00, port ID is HundredGigE3/2/0/5.

 

%Jul  4 08:45:28:740 2022 DCXN-CLOSS-APP-spine-SW01_IDC02-S1301-1728U LLDP/5/LLDP_NEIGHBOR_AGE_OUT: -Slot=2; Nearest bridge agent neighbor aged out on port HundredGigE2/0/4 (IfIndex 498), neighbor's chassis ID is 78aa-82cf-5e00, port ID is HundredGigE1/2/0/5.

 

%Jul  4 08:44:03:942 2022 DCXN-CLOSS-APP-spine-SW01_IDC02-S1301-1728U BGP/5/BGP_STATE_CHANGED:

 BGP.: 10.191.2.13 state has changed from ESTABLISHED to IDLE for physical interface configuration changed.

 

%Jul  4 08:44:03:940 2022 DCXN-CLOSS-APP-spine-SW01_IDC02-S1301-1728U IFNET/5/LINK_UPDOWN: Line protocol state on the interface HundredGigE2/0/5 changed to down.

%Jul  4 08:44:03:938 2022 DCXN-CLOSS-APP-spine-SW01_IDC02-S1301-1728U IFNET/3/PHY_UPDOWN: Physical state on the interface HundredGigE2/0/5 changed to down.

%Jul  4 08:43:29:305 2022 DCXN-CLOSS-APP-spine-SW01_IDC02-S1301-1728U IFNET/5/LINK_UPDOWN: Line protocol state on the interface HundredGigE2/0/4 changed to down.

 

分裂后,此时1框独立运行,从日志看到844分时12槽频繁日志打印超时,并最终变成了fault状态:

Jul 4, 2022 @ 08:44:51.305   DEV Critical     BOARD_STATE_FAULT   2022-07-04T08:44:47+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/BOARD_STATE_FAULT: -DevIP=10.191.252.1; Board state changed to Fault on chassis 1 slot 2, type is LSXM1CGQ18QGHB1.

Jul 4, 2022 @ 08:44:51.278   STM Error        STM_LINK_DOWN 2022-07-04T08:44:47+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10STM/3/STM_LINK_DOWN: -DevIP=10.191.252.1; IRF port 1 went down.

Jul 4, 2022 @ 08:44:51.260   DRVPLAT Warning  DrvDebug        2022-07-04T08:44:39+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DRVPLAT/4/DrvDebug: -DevIP=10.191.252.1;   WARNING: Heartbeat with chassis 1 slot 2 timed out.

Jul 4, 2022 @ 08:44:51.258   DRVPLAT Warning  DrvDebug        2022-07-04T08:44:38+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DRVPLAT/4/DrvDebug: -DevIP=10.191.252.1-Chassis=1-Slot=15;   WARNING: Bcast IPC packets from chassis 1 slot 2 to chassis 1 slot 15 were blocked.

Jul 4, 2022 @ 08:44:51.256   DRVPLAT Warning  DrvDebug        2022-07-04T08:44:38+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DRVPLAT/4/DrvDebug: -DevIP=10.191.252.1-Chassis=1-Slot=15;   WARNING: Ucast IPC packets from chassis 1 slot 2 to chassis 1 slot 15 were blocked.

Jul 4, 2022 @ 08:44:51.254   DRVMNT Error        ERRORCODE   2022-07-04T08:44:38+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DRVMNT/3/ERRORCODE: -DevIP=10.191.252.1-Chassis=1-Slot=15; MdcId=1; ErrCode=0x6e06, GOLD: Ipc Block.

Jul 4, 2022 @ 08:44:51.186   DRVPLAT Warning  DrvDebug        2022-07-04T08:44:29+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DRVPLAT/4/DrvDebug: -DevIP=10.191.252.1;   WARNING: Heartbeat with chassis 1 slot 2 timed out.

Jul 4, 2022 @ 08:44:51.109   DRVPLAT Warning  DrvDebug        2022-07-04T08:44:19+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DRVPLAT/4/DrvDebug: -DevIP=10.191.252.1;   WARNING: Heartbeat with chassis 1 slot 2 timed out.

Jul 4, 2022 @ 08:44:50.612   DRVPLAT Warning  DrvDebug        2022-07-04T08:44:09+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DRVPLAT/4/DrvDebug: -DevIP=10.191.252.1;   WARNING: Heartbeat with chassis 1 slot 2 timed out.

由上述分析可以看出,在故障时,由于1slot2发生 IPC不通的硬件故障,且堆叠线缆集中在1slot2上,所以发生了堆叠分裂,同时因为3框号更大,现场老版本的MAD机制是将框号大的3框所有业务端口置于MAD DOWN状态,因此此时业务只能走1框。

同时因为现场确认冗余性,superspine与下行spine设备的只有1/2/0/5端口,而1slot2硬件故障了,因此大量业务转发不通,与现场业务报障现象吻合。

 

前方为了紧急恢复业务,在08:55时,尝试对1框和3框均进行掉电重启:

Jul 4, 2022 @ 08:55:30.751   DEV Critical     POWER_FAILED      2022-07-04T08:55:31+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/POWER_FAILED: -DevIP=10.191.252.1; Chassis 1 power 4 failed.

Jul 4, 2022 @ 08:55:28.346   DEV Critical     POWER_FAILED      2022-07-04T08:55:28+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/POWER_FAILED: -DevIP=10.191.252.1; Chassis 1 power 3 failed.

Jul 4, 2022 @ 08:55:25.941   DEV Critical     POWER_FAILED      2022-07-04T08:55:26+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/POWER_FAILED: -DevIP=10.191.252.1; Chassis 1 power 2 failed.

Jul 4, 2022 @ 08:55:24.336   DEV Critical     POWER_FAILED      2022-07-04T08:55:24+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/POWER_FAILED: -DevIP=10.191.252.1; Chassis 1 power 1 failed.

 

09:153框陆续启动并最终NORMAL,此时业务得以临时恢复,

Jul 4, 2022 @ 09:15:03.675   IFNET       Error        PHY_UPDOWN        2022-07-04T09:09:17+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10IFNET/3/PHY_UPDOWN: -DevIP=10.191.252.1; Physical state on the interface Ten-GigabitEthernet3/4/0/2 changed to up.

Jul 4, 2022 @ 09:15:03.570   IFNET       Notice     LINK_UPDOWN      2022-07-04T09:09:17+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10IFNET/5/LINK_UPDOWN: -DevIP=10.191.252.1; Line protocol state on the interface Ten-GigabitEthernet3/4/0/3 changed to up.

Jul 4, 2022 @ 09:15:03.568   IFNET       Error        PHY_UPDOWN        2022-07-04T09:09:17+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10IFNET/3/PHY_UPDOWN: -DevIP=10.191.252.1; Physical state on the interface Ten-GigabitEthernet3/4/0/3 changed to up.

Jul 4, 2022 @ 09:15:03.464   IFNET       Notice     LINK_UPDOWN      2022-07-04T09:09:17+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10IFNET/5/LINK_UPDOWN: -DevIP=10.191.252.1; Line protocol state on the interface Ten-GigabitEthernet3/4/0/40 changed to up.

Jul 4, 2022 @ 09:15:03.461   IFNET       Error        PHY_UPDOWN        2022-07-04T09:09:17+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10IFNET/3/PHY_UPDOWN: -DevIP=10.191.252.1; Physical state on the interface Ten-GigabitEthernet3/4/0/40 changed to up.

Jul 4, 2022 @ 09:15:03.056   IFNET       Notice     LINK_UPDOWN      2022-07-04T09:09:17+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10IFNET/5/LINK_UPDOWN: -DevIP=10.191.252.1; Line protocol state on the interface Ten-GigabitEthernet3/4/0/6 changed to up.

Jul 4, 2022 @ 09:15:03.053   IFNET       Notice     LINK_UPDOWN      2022-07-04T09:09:17+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10IFNET/5/LINK_UPDOWN: -DevIP=10.191.252.1; Line protocol state on the interface Route-Aggregation5 changed to up.

 

但紧接着1框也陆续加载恢复normal,但是因为12槽位硬件故障了,即便掉电重启,也未能normal,所以当1框全部重启起来后,会再次触发mad 机制,将3框所有业务口mad down,导致业务再次受损:

Jul 4, 2022 @ 09:28:10.217   BFD  Notice     BFD_MAD_INTERFACE_CHANGE_STATE      2022-07-04T09:19:27+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10BFD/5/BFD_MAD_INTERFACE_CHANGE_STATE: -DevIP=10.191.252.1; BFD MAD function enabled on Vlan-interface4000 changed to the normal state.

Jul 4, 2022 @ 09:28:08.113   HA    Notice     HA_BATCHBACKUP_FINISHED       2022-07-04T09:19:25+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10HA/5/HA_BATCHBACKUP_FINISHED: -DevIP=10.191.252.1; Batch backup of standby board in chassis 1 slot 1 has finished.

Jul 4, 2022 @ 09:28:07.090   HA    Notice     HA_BATCHBACKUP_STARTED        2022-07-04T09:19:24+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10HA/5/HA_BATCHBACKUP_STARTED: -DevIP=10.191.252.1; Batch backup of standby board in chassis 1 slot 1 started.

Jul 4, 2022 @ 09:27:37.858   BFD  Notice     BFD_CHANGE_FSM         2022-07-04T09:18:55+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10BFD/5/BFD_CHANGE_FSM: -DevIP=10.191.252.1; Sess[10.191.1.21/10.191.1.22, LD/RD:8003/8030, Interface:RAGG151, SessType:Ctrl, LinkType:INET], Ver:1, Sta: INIT->UP, Diag: 0 (No Diagnostic)

Jul 4, 2022 @ 09:27:37.856   BFD  Notice     BFD_CHANGE_FSM         2022-07-04T09:18:55+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10BFD/5/BFD_CHANGE_FSM: -DevIP=10.191.252.1; Sess[10.191.1.21/10.191.1.22, LD/RD:8003/8030, Interface:RAGG151, SessType:Ctrl, LinkType:INET], Ver:1, Sta: DOWN->INIT, Diag: 0 (No Diagnostic)

Jul 4, 2022 @ 09:27:37.854   BGP Notice     BGP_STATE_CHANGED  2022-07-04T09:18:55+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10BGP/5/BGP_STATE_CHANGED: -DevIP=10.191.252.1;   BGP.DATA: 10.191.1.22 State is changed from OPENCONFIRM to ESTABLISHED.

Jul 4, 2022 @ 09:27:37.853   SYSLOG    Informational SYSLOG_RESTART   2022-07-04T08:52:59+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10SYSLOG/6/SYSLOG_RESTART: -DevIP=10.191.252.1; System restarted -- H3C Comware Software.

Jul 4, 2022 @ 09:27:28.457   IFNET       Notice     LINK_UPDOWN      2022-07-04T09:27:28+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10IFNET/5/LINK_UPDOWN: -DevIP=10.191.252.1; Line protocol state on the interface Vlan-interface4000 changed to up.

Jul 4, 2022 @ 09:27:28.455   IFNET       Error        PHY_UPDOWN        2022-07-04T09:27:28+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10IFNET/3/PHY_UPDOWN: -DevIP=10.191.252.1; Physical state on the interface Vlan-interface4000 changed to up.

Jul 4, 2022 @ 09:27:28.453   IFNET       Notice     LINK_UPDOWN      2022-07-04T09:27:28+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10IFNET/5/LINK_UPDOWN: -DevIP=10.191.252.1; Line protocol state on the interface Ten-GigabitEthernet3/4/0/52 changed to up.

 

从下行spine设备日志中没有看到2/0/4 up日志,证实了掉电重启后1slot2依旧无法normal。同时2/0/5再次down,证实了1框重启恢复后,再次发生了MAD down

%Jul  4 09:27:28:706 2022 DCXN-CLOSS-APP-spine-SW01_IDC02-S1301-1728U IFNET/5/LINK_UPDOWN: Line protocol state on the interface HundredGigE2/0/5 changed to down.

%Jul  4 09:27:28:704 2022 DCXN-CLOSS-APP-spine-SW01_IDC02-S1301-1728U IFNET/3/PHY_UPDOWN: Physical state on the interface HundredGigE2/0/5 changed to down.

 

因此此时业务转发依旧故障,前方在09:29再次两框掉电重启,并不让1框上电:

Jul 4, 2022 @ 09:29:04.901   DEV Critical     POWER_FAILED      2022-07-04T09:20:22+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/POWER_FAILED: -DevIP=10.191.252.1; Chassis 1 power 4 failed.

Jul 4, 2022 @ 09:29:03.297   DEV Critical     POWER_FAILED      2022-07-04T09:20:20+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/POWER_FAILED: -DevIP=10.191.252.1; Chassis 1 power 3 failed.

Jul 4, 2022 @ 09:29:02.494   DEV Critical     POWER_FAILED      2022-07-04T09:20:19+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/POWER_FAILED: -DevIP=10.191.252.1; Chassis 1 power 2 failed.

Jul 4, 2022 @ 09:29:01.691   DEV Critical     POWER_FAILED      2022-07-04T09:20:19+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10DEV/2/POWER_FAILED: -DevIP=10.191.252.1; Chassis 1 power 1 failed.

 

09:473框启动恢复,最终业务全部恢复:

Jul 4, 2022 @ 09:47:37.393   IFNET       Error        PHY_UPDOWN        2022-07-04T09:47:36+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10IFNET/3/PHY_UPDOWN: -DevIP=10.191.252.1; Physical state on the interface Ten-GigabitEthernet3/4/0/3 changed to up.

Jul 4, 2022 @ 09:47:37.288   LLDP         Informational LLDP_CREATE_NEIGHBOR     2022-07-04T09:47:36+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10LLDP/6/LLDP_CREATE_NEIGHBOR: -DevIP=10.191.252.1-Chassis=3-Slot=4; Nearest bridge agent neighbor created on port Ten-GigabitEthernet3/4/0/40 (IfIndex 12588), neighbor's chassis ID is ac74-092c-e08d, port ID is Ten-GigabitEthernet1/0/28.

Jul 4, 2022 @ 09:47:37.287   SYSLOG    Informational SYSLOG_RESTART   2022-07-04T09:37:30+08:00DCXN-CLOSS-superspine-SW01_IDC03-S13021301-2132U %%10SYSLOG/6/SYSLOG_RESTART: -DevIP=10.191.252.1; System restarted -- H3C Comware Software.

 

最后隔离的1框重新上电,可以看到12槽也已经一直处于fault状态,说明现网1slot2确实已经彻底故障了:

Slot   Type                State    Subslot  Soft Ver             Patch Ver

1/0    LSXM1SUPH1          Master   0        S12508X-AF-2713      None     

1/1    LSXM1SUPH1          Standby  0        S12508X-AF-2713      None     

1/2    NONE                Fault    0        NONE                 None     

1/3    LSXM1QGS24HB1       Normal   0        S12508X-AF-2713      None  

 

综上,因为12槽发生了IPC不通的硬件故障引发了堆叠分裂,同时MAD机制将3框业务口down了,只有1框能转发业务,同时因为1框的下连spine只有1slot2一个端口,没有具备冗余性,最终导致的业务大面积受损。

 

关于IPC不通硬件故障的说明:

造成 IPC不通的可能性较多,可能由于 IPC芯片器件、芯片器件转发不通、 CPU故障产生收发包异常。在单板 normal能看到记录的异常信息。当前单板已经无法 normal了,需要将单板返回后进行硬件分析。   

解决方法

     解决方案

1slot 2单板硬件故障,更换备件解决。


其他优化方案

1、 建议现场堆叠线跨板卡部署,增加冗余性,避免单点故障引发堆叠分裂。

2、 建议现场上下行端口跨板卡部署,增加冗余性,避免单点故障导致整机转发业务受损。

3、 建议升级R2719P01版本,新版本支持MAD健康性检查。

CRM论坛(CRMbbs.com)——一个让用户更懂CRM的垂直性行业内容平台,CRM论坛致力于互联网、客户管理、销售管理、SCRM私域流量内容输出5年。 如果您有好的内容,欢迎向我们投稿,共建CRM多元化生态体系,创建CRM客户管理一体化生态解决方案。本文来源:知了社区基于知识共享署名-相同方式共享3.0中国大陆许可协议,某局点S12500X-AF堆叠分裂下挂业务大量中断问题