annotate Documentation/oops-tracing.txt @ 912:dd42cdb0ab89

[IA64] Build blktap2 driver by default in x86 builds.

add CONFIG_XEN_BLKDEV_TAP2=y to buildconfigs/linux-defconfig_xen_ia64.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
author Isaku Yamahata <yamahata@valinux.co.jp>
date Mon Jun 29 12:09:16 2009 +0900 (2009-06-29)
parents 831230e53067
rev   line source
ian@0 1 NOTE: ksymoops is useless on 2.6. Please use the Oops in its original format
ian@0 2 (from dmesg, etc). Ignore any references in this or other docs to "decoding
ian@0 3 the Oops" or "running it through ksymoops". If you post an Oops from 2.6 that
ian@0 4 has been run through ksymoops, people will just tell you to repost it.
ian@0 5
ian@0 6 Quick Summary
ian@0 7 -------------
ian@0 8
ian@0 9 Find the Oops and send it to the maintainer of the kernel area that seems to be
ian@0 10 involved with the problem. Don't worry too much about getting the wrong person.
ian@0 11 If you are unsure send it to the person responsible for the code relevant to
ian@0 12 what you were doing. If it occurs repeatably try and describe how to recreate
ian@0 13 it. That's worth even more than the oops.
ian@0 14
ian@0 15 If you are totally stumped as to whom to send the report, send it to
ian@0 16 linux-kernel@vger.kernel.org. Thanks for your help in making Linux as
ian@0 17 stable as humanly possible.
ian@0 18
ian@0 19 Where is the Oops?
ian@0 20 ----------------------
ian@0 21
ian@0 22 Normally the Oops text is read from the kernel buffers by klogd and
ian@0 23 handed to syslogd which writes it to a syslog file, typically
ian@0 24 /var/log/messages (depends on /etc/syslog.conf). Sometimes klogd dies,
ian@0 25 in which case you can run dmesg > file to read the data from the kernel
ian@0 26 buffers and save it. Or you can cat /proc/kmsg > file, however you
ian@0 27 have to break in to stop the transfer, kmsg is a "never ending file".
ian@0 28 If the machine has crashed so badly that you cannot enter commands or
ian@0 29 the disk is not available then you have three options :-
ian@0 30
ian@0 31 (1) Hand copy the text from the screen and type it in after the machine
ian@0 32 has restarted. Messy but it is the only option if you have not
ian@0 33 planned for a crash. Alternatively, you can take a picture of
ian@0 34 the screen with a digital camera - not nice, but better than
ian@0 35 nothing. If the messages scroll off the top of the console, you
ian@0 36 may find that booting with a higher resolution (eg, vga=791)
ian@0 37 will allow you to read more of the text. (Caveat: This needs vesafb,
ian@0 38 so won't help for 'early' oopses)
ian@0 39
ian@0 40 (2) Boot with a serial console (see Documentation/serial-console.txt),
ian@0 41 run a null modem to a second machine and capture the output there
ian@0 42 using your favourite communication program. Minicom works well.
ian@0 43
ian@0 44 (3) Use Kdump (see Documentation/kdump/kdump.txt),
ian@0 45 extract the kernel ring buffer from old memory with using dmesg
ian@0 46 gdbmacro in Documentation/kdump/gdbmacros.txt.
ian@0 47
ian@0 48
ian@0 49 Full Information
ian@0 50 ----------------
ian@0 51
ian@0 52 NOTE: the message from Linus below applies to 2.4 kernel. I have preserved it
ian@0 53 for historical reasons, and because some of the information in it still
ian@0 54 applies. Especially, please ignore any references to ksymoops.
ian@0 55
ian@0 56 From: Linus Torvalds <torvalds@osdl.org>
ian@0 57
ian@0 58 How to track down an Oops.. [originally a mail to linux-kernel]
ian@0 59
ian@0 60 The main trick is having 5 years of experience with those pesky oops
ian@0 61 messages ;-)
ian@0 62
ian@0 63 Actually, there are things you can do that make this easier. I have two
ian@0 64 separate approaches:
ian@0 65
ian@0 66 gdb /usr/src/linux/vmlinux
ian@0 67 gdb> disassemble <offending_function>
ian@0 68
ian@0 69 That's the easy way to find the problem, at least if the bug-report is
ian@0 70 well made (like this one was - run through ksymoops to get the
ian@0 71 information of which function and the offset in the function that it
ian@0 72 happened in).
ian@0 73
ian@0 74 Oh, it helps if the report happens on a kernel that is compiled with the
ian@0 75 same compiler and similar setups.
ian@0 76
ian@0 77 The other thing to do is disassemble the "Code:" part of the bug report:
ian@0 78 ksymoops will do this too with the correct tools, but if you don't have
ian@0 79 the tools you can just do a silly program:
ian@0 80
ian@0 81 char str[] = "\xXX\xXX\xXX...";
ian@0 82 main(){}
ian@0 83
ian@0 84 and compile it with gcc -g and then do "disassemble str" (where the "XX"
ian@0 85 stuff are the values reported by the Oops - you can just cut-and-paste
ian@0 86 and do a replace of spaces to "\x" - that's what I do, as I'm too lazy
ian@0 87 to write a program to automate this all).
ian@0 88
ian@0 89 Finally, if you want to see where the code comes from, you can do
ian@0 90
ian@0 91 cd /usr/src/linux
ian@0 92 make fs/buffer.s # or whatever file the bug happened in
ian@0 93
ian@0 94 and then you get a better idea of what happens than with the gdb
ian@0 95 disassembly.
ian@0 96
ian@0 97 Now, the trick is just then to combine all the data you have: the C
ian@0 98 sources (and general knowledge of what it _should_ do), the assembly
ian@0 99 listing and the code disassembly (and additionally the register dump you
ian@0 100 also get from the "oops" message - that can be useful to see _what_ the
ian@0 101 corrupted pointers were, and when you have the assembler listing you can
ian@0 102 also match the other registers to whatever C expressions they were used
ian@0 103 for).
ian@0 104
ian@0 105 Essentially, you just look at what doesn't match (in this case it was the
ian@0 106 "Code" disassembly that didn't match with what the compiler generated).
ian@0 107 Then you need to find out _why_ they don't match. Often it's simple - you
ian@0 108 see that the code uses a NULL pointer and then you look at the code and
ian@0 109 wonder how the NULL pointer got there, and if it's a valid thing to do
ian@0 110 you just check against it..
ian@0 111
ian@0 112 Now, if somebody gets the idea that this is time-consuming and requires
ian@0 113 some small amount of concentration, you're right. Which is why I will
ian@0 114 mostly just ignore any panic reports that don't have the symbol table
ian@0 115 info etc looked up: it simply gets too hard to look it up (I have some
ian@0 116 programs to search for specific patterns in the kernel code segment, and
ian@0 117 sometimes I have been able to look up those kinds of panics too, but
ian@0 118 that really requires pretty good knowledge of the kernel just to be able
ian@0 119 to pick out the right sequences etc..)
ian@0 120
ian@0 121 _Sometimes_ it happens that I just see the disassembled code sequence
ian@0 122 from the panic, and I know immediately where it's coming from. That's when
ian@0 123 I get worried that I've been doing this for too long ;-)
ian@0 124
ian@0 125 Linus
ian@0 126
ian@0 127
ian@0 128 ---------------------------------------------------------------------------
ian@0 129 Notes on Oops tracing with klogd:
ian@0 130
ian@0 131 In order to help Linus and the other kernel developers there has been
ian@0 132 substantial support incorporated into klogd for processing protection
ian@0 133 faults. In order to have full support for address resolution at least
ian@0 134 version 1.3-pl3 of the sysklogd package should be used.
ian@0 135
ian@0 136 When a protection fault occurs the klogd daemon automatically
ian@0 137 translates important addresses in the kernel log messages to their
ian@0 138 symbolic equivalents. This translated kernel message is then
ian@0 139 forwarded through whatever reporting mechanism klogd is using. The
ian@0 140 protection fault message can be simply cut out of the message files
ian@0 141 and forwarded to the kernel developers.
ian@0 142
ian@0 143 Two types of address resolution are performed by klogd. The first is
ian@0 144 static translation and the second is dynamic translation. Static
ian@0 145 translation uses the System.map file in much the same manner that
ian@0 146 ksymoops does. In order to do static translation the klogd daemon
ian@0 147 must be able to find a system map file at daemon initialization time.
ian@0 148 See the klogd man page for information on how klogd searches for map
ian@0 149 files.
ian@0 150
ian@0 151 Dynamic address translation is important when kernel loadable modules
ian@0 152 are being used. Since memory for kernel modules is allocated from the
ian@0 153 kernel's dynamic memory pools there are no fixed locations for either
ian@0 154 the start of the module or for functions and symbols in the module.
ian@0 155
ian@0 156 The kernel supports system calls which allow a program to determine
ian@0 157 which modules are loaded and their location in memory. Using these
ian@0 158 system calls the klogd daemon builds a symbol table which can be used
ian@0 159 to debug a protection fault which occurs in a loadable kernel module.
ian@0 160
ian@0 161 At the very minimum klogd will provide the name of the module which
ian@0 162 generated the protection fault. There may be additional symbolic
ian@0 163 information available if the developer of the loadable module chose to
ian@0 164 export symbol information from the module.
ian@0 165
ian@0 166 Since the kernel module environment can be dynamic there must be a
ian@0 167 mechanism for notifying the klogd daemon when a change in module
ian@0 168 environment occurs. There are command line options available which
ian@0 169 allow klogd to signal the currently executing daemon that symbol
ian@0 170 information should be refreshed. See the klogd manual page for more
ian@0 171 information.
ian@0 172
ian@0 173 A patch is included with the sysklogd distribution which modifies the
ian@0 174 modules-2.0.0 package to automatically signal klogd whenever a module
ian@0 175 is loaded or unloaded. Applying this patch provides essentially
ian@0 176 seamless support for debugging protection faults which occur with
ian@0 177 kernel loadable modules.
ian@0 178
ian@0 179 The following is an example of a protection fault in a loadable module
ian@0 180 processed by klogd:
ian@0 181 ---------------------------------------------------------------------------
ian@0 182 Aug 29 09:51:01 blizard kernel: Unable to handle kernel paging request at virtual address f15e97cc
ian@0 183 Aug 29 09:51:01 blizard kernel: current->tss.cr3 = 0062d000, %cr3 = 0062d000
ian@0 184 Aug 29 09:51:01 blizard kernel: *pde = 00000000
ian@0 185 Aug 29 09:51:01 blizard kernel: Oops: 0002
ian@0 186 Aug 29 09:51:01 blizard kernel: CPU: 0
ian@0 187 Aug 29 09:51:01 blizard kernel: EIP: 0010:[oops:_oops+16/3868]
ian@0 188 Aug 29 09:51:01 blizard kernel: EFLAGS: 00010212
ian@0 189 Aug 29 09:51:01 blizard kernel: eax: 315e97cc ebx: 003a6f80 ecx: 001be77b edx: 00237c0c
ian@0 190 Aug 29 09:51:01 blizard kernel: esi: 00000000 edi: bffffdb3 ebp: 00589f90 esp: 00589f8c
ian@0 191 Aug 29 09:51:01 blizard kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
ian@0 192 Aug 29 09:51:01 blizard kernel: Process oops_test (pid: 3374, process nr: 21, stackpage=00589000)
ian@0 193 Aug 29 09:51:01 blizard kernel: Stack: 315e97cc 00589f98 0100b0b4 bffffed4 0012e38e 00240c64 003a6f80 00000001
ian@0 194 Aug 29 09:51:01 blizard kernel: 00000000 00237810 bfffff00 0010a7fa 00000003 00000001 00000000 bfffff00
ian@0 195 Aug 29 09:51:01 blizard kernel: bffffdb3 bffffed4 ffffffda 0000002b 0007002b 0000002b 0000002b 00000036
ian@0 196 Aug 29 09:51:01 blizard kernel: Call Trace: [oops:_oops_ioctl+48/80] [_sys_ioctl+254/272] [_system_call+82/128]
ian@0 197 Aug 29 09:51:01 blizard kernel: Code: c7 00 05 00 00 00 eb 08 90 90 90 90 90 90 90 90 89 ec 5d c3
ian@0 198 ---------------------------------------------------------------------------
ian@0 199
ian@0 200 Dr. G.W. Wettstein Oncology Research Div. Computing Facility
ian@0 201 Roger Maris Cancer Center INTERNET: greg@wind.rmcc.com
ian@0 202 820 4th St. N.
ian@0 203 Fargo, ND 58122
ian@0 204 Phone: 701-234-7556
ian@0 205
ian@0 206
ian@0 207 ---------------------------------------------------------------------------
ian@0 208 Tainted kernels:
ian@0 209
ian@0 210 Some oops reports contain the string 'Tainted: ' after the program
ian@0 211 counter. This indicates that the kernel has been tainted by some
ian@0 212 mechanism. The string is followed by a series of position-sensitive
ian@0 213 characters, each representing a particular tainted value.
ian@0 214
ian@0 215 1: 'G' if all modules loaded have a GPL or compatible license, 'P' if
ian@0 216 any proprietary module has been loaded. Modules without a
ian@0 217 MODULE_LICENSE or with a MODULE_LICENSE that is not recognised by
ian@0 218 insmod as GPL compatible are assumed to be proprietary.
ian@0 219
ian@0 220 2: 'F' if any module was force loaded by "insmod -f", ' ' if all
ian@0 221 modules were loaded normally.
ian@0 222
ian@0 223 3: 'S' if the oops occurred on an SMP kernel running on hardware that
ian@0 224 hasn't been certified as safe to run multiprocessor.
ian@0 225 Currently this occurs only on various Athlons that are not
ian@0 226 SMP capable.
ian@0 227
ian@0 228 4: 'R' if a module was force unloaded by "rmmod -f", ' ' if all
ian@0 229 modules were unloaded normally.
ian@0 230
ian@0 231 5: 'M' if any processor has reported a Machine Check Exception,
ian@0 232 ' ' if no Machine Check Exceptions have occurred.
ian@0 233
ian@0 234 6: 'B' if a page-release function has found a bad page reference or
ian@0 235 some unexpected page flags.
ian@0 236
ian@0 237 The primary reason for the 'Tainted: ' string is to tell kernel
ian@0 238 debuggers if this is a clean kernel or if anything unusual has
ian@0 239 occurred. Tainting is permanent: even if an offending module is
ian@0 240 unloaded, the tainted value remains to indicate that the kernel is not
ian@0 241 trustworthy.