宣轩,3par8200 node 0 热重启
问题描述
3PAR8200 node 0 热重启
过程分析
Release version 3.3.1 (MU5)
Patches: P126,P132,P135,P140,P146,P150,P151,P155,P156,P164,P170,P172,P173
showeeprom显示node 0在2022-08-25 11:13:47 CST出现重启
ndoe 0
--------
Board revision: 0920-200048.B4
Assembly: FXN 2019/17 Serial 438870
System serial: CN792305WP
System W19: 0x23EDE
BIOS version: 5.5.7
OS version: 3.3.1.648
Reset reason: ALIVE_L
Last boot: 2022-08-25 11:13:47 CST
Last cluster join: 2022-08-25 11:14:07 CST
Last panic: 2022-08-25 11:08:47 CST
Last panic request: Never
Error ignore code: 00
SMI context: 00
Last HBA mode: 2a000000
BIOS state: 80 ff 24 27 28 29 2a 2c
TPD state: ff ff ff ff ff ff ff ff
Code 128 (BIOS update) - Subcode 0x2050507 (2050404) 2022-08-12 21:53:23 CST
Code 128 (BIOS update) - Subcode 0x2050404 (2050236) 2020-08-04 11:03:01 CST
Code 61 (AC Power Loss) - Subcode 0x0 (0) 2020-04-14 16:22:50 CST
Code 61 (AC Power Loss) - Subcode 0x0 (0) 2019-06-09 21:01:54 CST
\INSPLO~4.194\var\core\nemoe\NODE0-~1\N0_fa_2022-08-25_11_08_48\显示在故障时间点CPU 0出现死锁61s,死锁的进程为kworker,当时运行的计划任务为rtc_timer周期任务
3PAR(R) InForm(tm) OS 3.3.1.648 CN792305WP-0 ttyS0
CN792305WP-0 login: [1084521.180755] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 61s! [kworker/5:1:1871]
[1084521.201385] Kernel panic[5]: softlockup: hung tasks
[1084521.211421] CPU: 5 PID: 1871 Comm: kworker/5:1 Tainted: P O L ------------ 3.10.0 #1
[1084521.229231] Hardware name: HP Romley Platform, BIOS UDK_05.05.07 2019-11-01
[1084521.243413] Workqueue: events rtc_timer_do_work
[1084521.252765] Call Trace:
[1084521.257961] <IRQ> [<ffffffff817688b6>] dump_stack+0x19/0x1b
[1084521.269735] [<ffffffff81767370>] panic+0x14f/0x27f
[1084521.279770] [<ffffffff811478c7>] watchdog_timer_fn+0x227/0x230
[1084521.291879] [<ffffffff811476a0>] ? watchdog_enable+0xa0/0xa0
[1084521.303642] [<ffffffff810ee18f>] __hrtimer_run_queues+0xaf/0x260
[1084521.316098] [<ffffffff8111bb9a>] ? ktime_get_update_offsets_now+0x5a/0x120
[1084521.330280] [<ffffffff810ee6f2>] hrtimer_interrupt+0xa2/0x1b0
[1084521.342217] [<ffffffff81096bbe>] local_apic_timer_interrupt+0x3e/0x70
[1084521.355536] [<ffffffff8177ce93>] smp_apic_timer_interrupt+0x43/0x60
[1084521.368509] [<ffffffff81779e42>] apic_timer_interrupt+0x162/0x170
[1084521.381134] <EOI> [<ffffffff8176f3a5>] ? _raw_spin_unlock_irqrestore+0x15/0x20
[1084521.396192] [<ffffffff810f3ac4>] __wake_up+0x44/0x50
[1084521.406573] [<ffffffff8153f8ef>] rtc_handle_legacy_irq+0x9f/0xc0
[1084521.419027] [<ffffffff8153f948>] rtc_uie_update_irq+0x18/0x20
[1084521.430962] [<ffffffff8153fa87>] rtc_timer_do_work+0xd7/0x1d0
[1084521.442897] [<ffffffff810715ec>] ? __switch_to+0x12c/0x4f0
[1084521.454314] [<ffffffff8176d492>] ? __schedule+0x492/0xb00
[1084521.465557] [<ffffffff810e2c64>] process_one_work+0x1c4/0x4e0
[1084521.477492] [<ffffffff810e3d61>] worker_thread+0x121/0x430
[1084521.488910] [<ffffffff810e3c40>] ? manage_workers.isra.28+0x2b0/0x2b0
[1084521.502228] [<ffffffff810ea762>] kthread+0xc2/0xd0
[1084521.512264] [<ffffffff810ea6a0>] ? flush_kthread_worker+0x80/0x80
[1084521.524890] [<ffffffff81778e77>] ret_from_fork_nospec_begin+0x21/0x21
[1084521.538210] [<ffffffff810ea6a0>] ? flush_kthread_worker+0x80/0x80
[1084521.550836] Kernel Offset: disabled
解决方法
问题的根源是rtc_timer_do_work()Linux内核函数的问题,当我们处理一些周期性任务时,陷入死循环,导致进入死锁。
预计在下一个大版本会进行修复;
临时规避措施:使用disable_soft_lockup.sh 脚本进行规避软件死锁问题
操作方式:
root 账户登录3PAR,进行下面的命令:
cd /common/stbin
./disable_soft_lockup.sh --install 进行安装
./disable_soft_lockup.sh --verify 进行验证,kernel.softlockup_panic数值变成0