前言
最近出现了多次因为各种原因导致的操作系统宕机的问题,为了查找系统宕机的原因,需要收集系统宕机后的内存转储信息。本文就介绍几种常见的方式。
阿里云内置命令收集
由于主要使用的是阿里云的服务器,就先介绍阿里云收集系统宕机内存转储信息的方式。
阿里云系统默认没有开启 dump 配置,我们需要使用以下命令开启:acs-plugin-manager --exec --plugin=ecs_dump_collector --params="--enable"
那我们如何收集内存转储信息呢,执行以下命令:acs-plugin-manager --exec --plugin=ecs_dump_collector --params="-c"
出现以下信息表示收集成功。
有了转储信息文件我们就可以将文件交给专业的运维人员或阿里云工程师进行宕机原因分析。
使用 Crash + Kdump
如果我们没有使用阿里云的服务该如何收集系统崩溃的转储信息呢,我们可以使用 Crash + Kdump 进行收集,通过前面的使用可以看出阿里云的插件就是使用了这两个工具,接下来开始进行简单的介绍。
安装
需要先安装 Crash + Kdump,使用此命令进行安装:
$ sudo apt install linux-crashdump
$ sudo apt install crash
安装完为了使服务生效需要重启服务器。
使用以下命令:sudo cat /etc/default/grub.d/kdump-tools.cfg
可以看出系统保留了 192M RAM 内存区供转储捕获内核使用
收集
为了测试方便我们可以使用此命令快速触发崩溃: sudo echo c > /proc/sysrq-trigger
。命令执行后在 /var/crash 目录会生成以当前时间为名称的目录,目录里面就是收集到的转储信息。
demsg.x 为崩溃时候的系统内核日志,dump.x 文件则为转储的内核快照文件。为了更好的查找问题,我们还需要安装 vmlinux,使用以下命令安装:
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/ddebs.list
sudo apt install ubuntu-dbgsym-keyring
sudo apt-get update
sudo apt -y install linux-image-$(uname -r)-dbgsym
安装完我们就可以对刚刚收集到的转储文件进行分析了,使用命令:
sudo crash /usr/lib/debug/boot/vmlinux-5.4.0-166-generic /var/crash/202401061626/dump.202401061626
输出:
crash 7.2.8
Copyright (C) 2002-2020 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
WARNING: kernel relocated [300MB]: patching 115314 gdb minimal_symbol values
KERNEL: /usr/lib/debug/boot/vmlinux-5.4.0-166-generic
DUMPFILE: /var/crash/202401061626/dump.202401061626 [PARTIAL DUMP]
CPUS: 2
DATE: Sat Jan 6 16:26:23 2024
UPTIME: 00:06:35
LOAD AVERAGE: 0.00, 0.02, 0.02
TASKS: 148
NODENAME: iZ7xv1a0t15muqy1e2co1uZ
RELEASE: 5.4.0-166-generic
VERSION: #183-Ubuntu SMP Mon Oct 2 11:28:33 UTC 2023
MACHINE: x86_64 (2699 Mhz)
MEMORY: 3.9 GB
PANIC: "Kernel panic - not syncing: sysrq triggered crash"
PID: 3654
COMMAND: "echo"
TASK: ffff8c4a696d8000 [THREAD_INFO: ffff8c4a696d8000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
使用bt命令查看崩溃时的调用栈:
crash> bt
PID: 3654 TASK: ffff8c4a696d8000 CPU: 0 COMMAND: "echo"
#0 [ffffabde00513c68] machine_kexec at ffffffff93c6d063
#1 [ffffabde00513cc8] __crash_kexec at ffffffff93d4de32
#2 [ffffabde00513d98] panic at ffffffff946a69a8
#3 [ffffabde00513e18] sysrq_handle_crash at ffffffff94288205
#4 [ffffabde00513e28] __handle_sysrq.cold at ffffffff946ce885
#5 [ffffabde00513e60] write_sysrq_trigger at ffffffff94288a38
#6 [ffffabde00513e78] proc_reg_write at ffffffff93f69113
#7 [ffffabde00513e98] __vfs_write at ffffffff93ed1b6b
#8 [ffffabde00513ea8] vfs_write at ffffffff93ed2879
#9 [ffffabde00513ee0] ksys_write at ffffffff93ed4e07
#10 [ffffabde00513f20] __x64_sys_write at ffffffff93ed4e9a
#11 [ffffabde00513f30] do_syscall_64 at ffffffff93c04fe7
#12 [ffffabde00513f50] entry_SYSCALL_64_after_hwframe at ffffffff948000a4
RIP: 00007ff833b41077 RSP: 00007ffeadb56b78 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007ff833b41077
RDX: 0000000000000002 RSI: 000055b3e2648440 RDI: 0000000000000001
RBP: 000055b3e2648440 R8: 0000000000000000 R9: 0000000000000001
R10: 00007ff833c1c640 R11: 0000000000000246 R12: 0000000000000002
R13: 00007ff833c206a0 R14: 00007ff833c1c4a0 R15: 00007ff833c1b8a0
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
crash 还有很多用法,感兴趣的读者可以自行研究学习。