前言

最近出现了多次因为各种原因导致的操作系统宕机的问题,为了查找系统宕机的原因,需要收集系统宕机后的内存转储信息。本文就介绍几种常见的方式。

阿里云内置命令收集

由于主要使用的是阿里云的服务器,就先介绍阿里云收集系统宕机内存转储信息的方式。

阿里云系统默认没有开启 dump 配置,我们需要使用以下命令开启:acs-plugin-manager --exec --plugin=ecs_dump_collector --params="--enable"

img.png

那我们如何收集内存转储信息呢,执行以下命令:acs-plugin-manager --exec --plugin=ecs_dump_collector --params="-c" 出现以下信息表示收集成功。

img_1.png

img_2.png

有了转储信息文件我们就可以将文件交给专业的运维人员或阿里云工程师进行宕机原因分析。

使用 Crash + Kdump

如果我们没有使用阿里云的服务该如何收集系统崩溃的转储信息呢,我们可以使用 Crash + Kdump 进行收集,通过前面的使用可以看出阿里云的插件就是使用了这两个工具,接下来开始进行简单的介绍。

安装

需要先安装 Crash + Kdump,使用此命令进行安装:

$ sudo apt install linux-crashdump
$ sudo apt install crash

安装完为了使服务生效需要重启服务器。

使用以下命令:sudo cat /etc/default/grub.d/kdump-tools.cfg

img_3.png img_4.png

可以看出系统保留了 192M RAM 内存区供转储捕获内核使用

收集

为了测试方便我们可以使用此命令快速触发崩溃: sudo echo c > /proc/sysrq-trigger 。命令执行后在 /var/crash 目录会生成以当前时间为名称的目录,目录里面就是收集到的转储信息。

img_5.png

demsg.x 为崩溃时候的系统内核日志,dump.x 文件则为转储的内核快照文件。为了更好的查找问题,我们还需要安装 vmlinux,使用以下命令安装:

echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse
 deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse
 deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/ddebs.list
 sudo apt install ubuntu-dbgsym-keyring
sudo apt-get update
sudo apt -y install linux-image-$(uname -r)-dbgsym

安装完我们就可以对刚刚收集到的转储文件进行分析了,使用命令:

sudo crash /usr/lib/debug/boot/vmlinux-5.4.0-166-generic /var/crash/202401061626/dump.202401061626 

输出:

crash 7.2.8
Copyright (C) 2002-2020  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [300MB]: patching 115314 gdb minimal_symbol values

      KERNEL: /usr/lib/debug/boot/vmlinux-5.4.0-166-generic            
    DUMPFILE: /var/crash/202401061626/dump.202401061626  [PARTIAL DUMP]
        CPUS: 2
        DATE: Sat Jan  6 16:26:23 2024
      UPTIME: 00:06:35
LOAD AVERAGE: 0.00, 0.02, 0.02
       TASKS: 148
    NODENAME: iZ7xv1a0t15muqy1e2co1uZ
     RELEASE: 5.4.0-166-generic
     VERSION: #183-Ubuntu SMP Mon Oct 2 11:28:33 UTC 2023
     MACHINE: x86_64  (2699 Mhz)
      MEMORY: 3.9 GB
       PANIC: "Kernel panic - not syncing: sysrq triggered crash"
         PID: 3654
     COMMAND: "echo"
        TASK: ffff8c4a696d8000  [THREAD_INFO: ffff8c4a696d8000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

使用bt命令查看崩溃时的调用栈:

crash> bt
PID: 3654   TASK: ffff8c4a696d8000  CPU: 0   COMMAND: "echo"
 #0 [ffffabde00513c68] machine_kexec at ffffffff93c6d063
 #1 [ffffabde00513cc8] __crash_kexec at ffffffff93d4de32
 #2 [ffffabde00513d98] panic at ffffffff946a69a8
 #3 [ffffabde00513e18] sysrq_handle_crash at ffffffff94288205
 #4 [ffffabde00513e28] __handle_sysrq.cold at ffffffff946ce885
 #5 [ffffabde00513e60] write_sysrq_trigger at ffffffff94288a38
 #6 [ffffabde00513e78] proc_reg_write at ffffffff93f69113
 #7 [ffffabde00513e98] __vfs_write at ffffffff93ed1b6b
 #8 [ffffabde00513ea8] vfs_write at ffffffff93ed2879
 #9 [ffffabde00513ee0] ksys_write at ffffffff93ed4e07
#10 [ffffabde00513f20] __x64_sys_write at ffffffff93ed4e9a
#11 [ffffabde00513f30] do_syscall_64 at ffffffff93c04fe7
#12 [ffffabde00513f50] entry_SYSCALL_64_after_hwframe at ffffffff948000a4
    RIP: 00007ff833b41077  RSP: 00007ffeadb56b78  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000002  RCX: 00007ff833b41077
    RDX: 0000000000000002  RSI: 000055b3e2648440  RDI: 0000000000000001
    RBP: 000055b3e2648440   R8: 0000000000000000   R9: 0000000000000001
    R10: 00007ff833c1c640  R11: 0000000000000246  R12: 0000000000000002
    R13: 00007ff833c206a0  R14: 00007ff833c1c4a0  R15: 00007ff833c1b8a0
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

crash 还有很多用法,感兴趣的读者可以自行研究学习。

参考