From crashdump to clarity: A kernel debugging adventure
Post by: Anton Eliasson
Image author: Konstantin Lanzet, CC BY-SA 3.0 < CC BY-SA 3.0 Deed | Attribution-ShareAlike 3.0 Unported | Creative Commons >, via Wikimedia Commons
When making mistakes in operating system kernel code there is often no recovery apart from a full system reboot. Unlike when userspace processes crash there is nobody to pick up the pieces after a kernel crash so a full system reboot is the only option.
There are a few candidates for debugging such problems:
- printf debugging
- hardware debuggers
Printf debugging is quick and easy but limited in that it can only print what you can imagine going wrong beforehand. Hardware debuggers can be used to inspect the full running state of the system. However, they only work in interactive sessions and are not useful for debugging crashes encountered in automated test systems. This leaves crashdumps.
A crashdump is essentially a full copy of DRAM and the CPU state at the point of a crash. AXIS OS has a bespoke crashdump feature that builds on collaboration between the Linux® kernel and our bootloader. The advantage of this solution compared to e.g. kdump in mainline Linux is that it can capture crashdumps no matter how broken the crashed system was, even if it was totally hung and rebooted by a hardware watchdog reset. This is used to collect crashdumps from devices in our testing labs.
The crashdump can be inspected in a debugger like gdb or the crash utility.
A case study
This is reconstructed from memory, old notes and from re-doing parts. There may be inaccuracies in some details. A few years ago I was tasked with debugging an issue on a product where some bootups would fail with errors in the SquashFS filesystem driver while decompressing the root filesystem, seemingly at random. The errors could be seen on the system’s serial console and looked like this:
[ 20.409287] SQUASHFS error: zlib end-of-stream not found (-3) [ 20.409335] SQUASHFS error: zlib decompression failed, data probably corrupt [ 20.409340] SQUASHFS error: Failed to read data block of (compressed) size 87427 starting at 0x17f712c
The system would then be unresponsive over the network since the boot hadn’t completed. The system-on-chip that this product uses didn’t have crashdumps implemented so initially I added prints to the SquashFS driver that dumped the first 128 bytes of the page that failed to decompress to the kernel log. To make debugging more convenient I also added an option to the SquashFS driver to panic the system on I/O errors instead of failing the operation. Fortunately the issue occurred quite often so it was easy to reproduce by simply rebooting the device in a loop until it crashed. I compared the output with the corresponding location in the filesystem image but found that they were identical. I tried dumping memory from a few other offsets in the page but got the same result.
In the end I decided that I needed working crashdumps in order to debug this further so I spent the time needed to port crashdump support to the system-on-chip and product in question. The details of the crashdump implementation may be of interest but is a topic for another story.
After reproducing the issue again I finally ended up with a crashdump on the device’s hard drive. Excited, I opened it in the crash utility:
crash-arm --readnow vmlinux-4.9.206-axis5 dump-20230519-123630.elf crash-arm 7.2.9 [...] GNU gdb (GDB) 7.6 [...] This GDB was configured as "--host=x86_64-unknown-linux-gnu --target=arm-elf-linux"... KERNEL: vmlinux-4.9.206-axis5 DUMPFILE: dump-20230519-123630.elf CPUS: 2 [OFFLINE: 1] DATE: Fri May 19 14:35:58 CEST 2023 UPTIME: 00:00:20 LOAD AVERAGE: 1.07, 0.24, 0.08 TASKS: 276 NODENAME: ax-accc8eda4b32 RELEASE: 4.9.206-axis5 VERSION: #1 SMP PREEMPT Thu Apr 27 01:32:02 UTC 2023 MACHINE: armv7l (unknown Mhz) MEMORY: 512 MB PANIC: "Kernel panic - not syncing: squashfs read failed" PID: 893 COMMAND: "smbd" TASK: d5479600 [THREAD_INFO: dc1b4000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash-arm>
In this case a good place to start debugging is to inspect the contents of the page cache belonging to the root filesystem image in memory. The inode number of the superblock can be extracted like this:
crash-arm> mount MOUNT SUPERBLK TYPE DEVNAME DIRNAME df911200 df948000 rootfs rootfs / df6f1e00 df543000 squashfs /dev/root / df6f1f00 df949800 proc proc /proc df4a5000 df543800 sysfs sysfs /sys df4a5100 df9d0000 devtmpfs devtmpfs /dev df4a5200 df544000 tmpfs tmpfs /run df4a5300 df544800 tmpfs tmpfs /var/volatile [...] crash-arm> super_block.s_bdev df543000 s_bdev = 0xdf822400 crash-arm> block_device.bd_inode 0xdf822400 bd_inode = 0xdf822488
files -p 0xdf822488 command can now be used to inspect the page cache and the
rd command can then be used to dump memory and compare it to the filesystem image on disk. The SquashFS decompression error was at address 0x17f712c which corresponds to page index floor(0x17f712c / 4096) = 0x17f7 so this is a good place to start. However, the problem may lie elsewhere due to the fact that decompression errors can propagate from the initial error until the decompressor can no longer make sense of the data. Using this method it took me a few tries before I eventually found discrepancies between the filesystem image in memory and the corresponding data on disk.
I later learned that there is an easier way which uses the crash cacheutils extension. After building and loading it into crash with
extend cacheutils.so the
ccat command becomes available which can be used to dump the entire contents of the filesystem image like this:
crash-arm> ccat 0xdf822488 foo
The filesystem dump can then be compared with the filesystem image on disk for example using vimdiff:
$ od -v -Ax -tx4z foo > foo.hex $ od -v -Ax -tx4z rootfs.img > rootfs.hex $ vimdiff foo.hex rootfs.hex
With the help of colleagues I found that in many pages at offset 128 and offset 2024+128 there were 32 bytes overwritten with the same data. Those 32 bytes matched another location in the filesystem image and so were most likely stolen from there. This type of memory corruption bears the hallmark of a cache coherency issue, as 32 bytes is the cache line size on this platform. A colleague of mine who is familiar with the Arm® PL310 cache controller in this system-on-chip identified that it was missing a piece of configuration in its devicetree, specifically:
The PL310 cache controller has a peculiar default behavior of transforming non-cacheable memory transactions into cacheable no-allocate transactions (see this patch for further reading). This improves performance in some cases. However, in cases where the same memory has another cacheable mapping, which is the case for DMA buffers in Linux, the cache may have already been populated with stale data. Uncached reads will in this case return the cached stale data, leading to memory corruption. Setting the shared override bit in the cache controller disables this behavior and fixed the boot failures on this product.
The SquashFS option to panic the system on I/O errors didn’t apply to Linux mainline so it wasn’t easily upstreamable. Eventually it was rewritten and upstreamed by a colleague. Panicking and rebooting on serious errors is desirable on some embedded systems where the best action is to get back to a known working state as quickly as possible even if it means a short period of downtime. On a general-purpose PC, or in some safety-critical systems, it may instead be desirable to limp along in order to facilitate interactive debugging or to maintain at least partial functionality.
While the fix was simple in the end, the road leading to it was long and instructive. Going through the exercise of implementing crashdumps, capturing one and analyzing it was instructive and made me more prepared to debug similar issues in the future.
Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.
Arm is a registered trademark of Arm Limited (or its subsidiaries or affiliates) in the US and/or elsewhere.