linux‎ > ‎

oops

I did another talk for The Greater London Linux User Group on the 23rd June 2001 on the subject of Linux Kernel Debugging. My slides and notes are online here, I hope people find them useful.

simon@urbanmyth.org


23rd June 2015

I was surprised to find the presentation still actively linked around the internet despite being 14 years old to the day so I decided to reinstate it at it's old URL.

Who knows, I may even update it if there is interest!


Presentation and my original presenter notes

  • about me
  • show you some things you may not have thought about and some techniques you can use
  • some bits are more complicated than others!
  • this is the kind of knowledge that separates sysadmins from operators
  • should be something for everyone
  • even if it's a sense of scale
  • cover a lot of ground
  • fastpath through the subject
  • long time on oopsen
  • debugging, a bit hard
  • rest of the time on lockups
  • quickly talk about making bug reports
  • can't show everything
  • just some basics to get you started
  • more than one way to do it!
  • do others if there is interest
  • UNIX debugging books
  • to simplify things
  • oops: tripped an alarm and flashing red lights
  • lockup: the kernel has tripped itself and can't recover
  • can be temporary like the wavelan problems we had last time
  • reboot: catastrophic failure, something very bad happened...
  • PSU upgrade causing reboots under load, not enough power
  • Q: user/sysadmin/developer
  • Q: crash?
  • Q: capture an oops?
  • Q: decode it?
  • Q: report it?
  • Q: fix it yourself?
  • spend some time on this
  • quite important
  • you should report because:
  • found rare bug
  • only happens on your hardware config
  • screen / cut and paste
  • ring buffer / dmesg
  • capture closer to decoding
  • decoding closer to fixing
  • fixing closer to developing
  • will this sound like yoda? (developing leads to suffering!!)
  • serial consoles [ IPL - kernel - lilo - null modem ]
  • descriptive huh?
  • formatting is off to get it on a slide and readable
  • Q: how many programmers?
  • Q: who has seen assembler?
  • system may continue running afterwards
  • undefined behaviour
  • xksymoops late 98
  • decode by hand so you understand it
  • use some of the techniques later on in lockups
  • NULL pointer
  • to a structure
  • accessing offset 14
  • Sun offset story
    • Solaris development cycle
    • crashed randomly all over the place
    • corruption in memory
    • corruption was always happening at the same offset
    • worked out which structures had members at that offset
    • worked out the places that altered those members
    • just checked those pieces of code
  • may trigger other oopsen in normally fine code
  • instruction pointer
  • where it went tits up
  • come back to this later
  • intel documents
  • could be useful once the function has been decoded
  • process information..
  • the stack is general purpose
  • smashing the kernel stack (8k ?)
  • limited in size
  • unreliable call trace
  • pretty useless, needs to be decoded
  • addresses that look right
  • return address is in middle of function
  • code is the next instructions to be executed
  • address on it's own is almost useless
  • 3 - 4 gig range
  • get the sorted symbol table (System.map or /proc/ksyms)
  • the oops happened somewhere in a function
  • so the EIP lies between two addresses
  • derive the function and the offset into the function
  • can do that for all the call trace functions
  • you can see:
  • tar
  • mkdir
  • reiserfs filesystem (no I'm not having a go)
  • note the mov instruction, we'll come back to that
  • x
  • xbase 10!
  • this decoded information is useful to developers
  • you should be able to get to this point easily
  • x
  • debugging isn't a recipe you can just follow
  • there is more than one way to do it
  • need to adapt (use printk etc.)
  • understand what is going on
  • in this case we have an oops dump, so we'll concentrate on it
  • how you might use this information
  • dont worry if you get lost in this next bit
  • over quite quickly
  • find function name in source
  • grep or cscope
  • dissassemble the .o file (or the kernel)
  • match offset to get where it failed
  • binutils, every system has it
  • offset was 0x298
  • tricky!
  • what it does
  • how it branches
  • more than one way to do it!
  • back to the objdump output
  • failed on the move
  • about to do a call
  • the function called had four arguments (push)
  • from:
  • looking at the branching
  • counting function calls
  • bloody obvious comment
  • include -g in CFLAGS
  • code before preprocessing
  • shows you exactly where it failed
  • it's loading an address into a register before calling it
  • you have that address at compile time
  • strange optimisation??
  • tada!
  • macros are a pain to debug
  • could run the source through the C preprocessor (-E)
  • coding style can make things worse; one line to 16k


  • will stop there
  • a message from our sponsor...
  • Linus is notorious for being against in-kernel debuggers
    • to properly fix problems you need to
    • understand the written code
    • and what the code is supposed to be doing
  • can relax a bit now
  • this is a lot easier
  • lockups, hangs
  • in general lockups are caused by waiting for something
  • isn't going to happen
  • isn't going to be released
  • can lockup one or more processors on a system
  • like capturing oopsen, the goal is to find out where the code locked up
  • when you have the location you can answer why
  • bad hardware (like memory)
  • bad driver
  • bad microcode....
  • bad luck?
  • easier to debug when it's quickly reproducible
  • the print IEP patch can be used to debug the spontaneous reboot type bugs
  • software problems
  • test with keyboard lights
  • hardware hackers can make an NMI board from old ISA cards
  • find out where it's gone with magic sysreq
  • P show regs / EIP
  • very useful general purpose tool
  • slightly dangerous so it's usually turned off on most systems by default
  • /etc/sysctl.conf on red hat
  • you'll get an interrupt
  • a-b lock inversion deadlocks on MP systems

  • NMI watchdog producing stack dumps
  • print EIP patch is applicable here
  • X lockups are common and difficult to debug
  • on console look like lockups with IRQ disabled
  • check over the network, eject pccards on notebooks, music carries on playing
  • can use the sysreq key to reset the keyboard into Raw mode and change VTs
  • NONONO!
  • read the REPORTING-BUGS file
    • oops data
    • kernel version
    • patches applied
    • kernel config file
    • hardware
    • MOST IMPORTANTLY WHAT YOU WERE DOING AT THE TIME!
  • type
  • gather
  • process
  • what to do with that it
  • debugging basics
  • hopefully closer to developing code
  • or realise what's going wrong faster
  • go further? understand the kernel
  • learn basic languages (C and ASM)
  • experiment


  • Ċ
    Simon Trimmer,
    22 Jun 2015, 17:07
    Comments