The Xen Crashdump Analyser is written in (a subset of) C++, and has no
dependencies. Grap g++
from your standard distro repositories,
type make
, and it should work. To run successfully, the analyser
needs to know about the structure of some of Xen's internal structures,
such as struct vcpu
and struct domain
. This is
covered by the above patch which adds some new offsets using the standard
system, and appends them to the hypervisor symbol table.
The Xen Crashdump Analyser is designed to be run with access
to /proc/vmcore
of a crashed system. Typically, this means
running in a kdump environment, but it is possible to
copy /proc/vmcore
off a crashed system for later analysis.
XenServer uses a 64MB crash region, so the analyser has been designed from
scratch with this in mind. Furthermore, in the case of a hypervisor crash, no
pointers, data or pagetables left over can be trusted, leading to very
defensive coding style when trying to walk Xen memory pulling out state. As
an upper stress test, I have verified correct functionality when running in a
64MB crash region (including kernel and regular root filesystem), analysing a
crash with 40 domains, each with 4095 vcpus. (CentOS 6.2 wouldn't boot
with 4096, but I was too busy to investigate why. Also, the server ran like
treacle)
To play with the crashdump analyser, you will need a 64bit Xen, compiled
with the patch from above, running a classic-xen dom0 kernel. The kexec
functionality is not present in pvops currently, but should be appearing soon.
Specify crashkernel=<size>@<location>
on Xen's
command line. Load a crash kernel in the normal mannor
using kexec
from the kexec-tools
package. To crash
the server, use echo c > /proc/sysrq-trigger
or xl
debug-keys c
.
See ./xen-crashdump-analyser --help
for full information, but in
short
./xen-crashdump-analyser -x /path/to/xen.symtab -d /path/to/dom0.symtab -o /path/to/output/directory
This will write a set of logs to the output directory:
xen-crashdump-analyser.log
- The analysers log of what it is
doing.xen.log
- Xen's state. Some general information, full pcpu state
(regs, stack, code and calltrace) and active vcpu state, and the console
ring.dom0.log
- Dom0's state. Some domain information,
full vcpu state (regs, stack, code and calltrace), and the console
ring.dom$U.log
- For each domU, a report similar to dom0 where
possible (allowing for lack of symbol table, inability to read HVM domain
memory, etc.)Here is an exaple analysis which is the basis of my investigation into race conditions in the scheduler (http://lists.xen.org/archives/html/xen-devel/2013-02/msg01411.html)
I will freely admit that it is far from bug free, and the code is somewhat organic in places. So far, bugs get fixed and new features get added on a 'when I am not doing something more urgent' basis. In the majority of software crashes, the Xen and dom0 state are sufficient to start working on the problem. The plan is to upstream the analyser into the main Xen repository, given its close ABI links with the hypervisor. It is presented here in a hope that you will try it out and find it useful. I think the following features would be nice, and will see about implementing them (in my copious free time):