 | about meshow you some things you may not have thought about and some techniques you can usesome bits are more complicated than others!this is the kind of knowledge that separates sysadmins from operatorsshould be something for everyoneeven if it's a sense of scale |
 | cover a lot of groundfastpath through the subjectlong time on oopsendebugging, a bit hardrest of the time on lockups quickly talk about making bug reports
|
 | can't show everythingjust some basics to get you startedmore than one way to do it! do others if there is interestUNIX debugging books |
 | to simplify thingsoops: tripped an alarm and flashing red lightslockup: the kernel has tripped itself and can't recovercan be temporary like the wavelan problems we had last timereboot: catastrophic failure, something very bad happened...PSU upgrade causing reboots under load, not enough power Q: user/sysadmin/developerQ: crash?Q: capture an oops?Q: decode it?Q: report it?Q: fix it yourself? |
 | spend some time on thisquite importantyou should report because:found rare bugonly happens on your hardware config |
 | screen / cut and pastering buffer / dmesg capture closer to decodingdecoding closer to fixingfixing closer to developingwill this sound like yoda? (developing leads to suffering!!) serial consoles [ IPL - kernel - lilo - null modem ] |
 | descriptive huh?formatting is off to get it on a slide and readableQ: how many programmers?Q: who has seen assembler? |
 | system may continue running afterwards undefined behaviour |
 | xksymoops late 98decode by hand so you understand ituse some of the techniques later on in lockups |
 | NULL pointerto a structureaccessing offset 14Sun offset story- Solaris development cycle
- crashed randomly all over the place
- corruption in memory
- corruption was always happening at the same offset
- worked out which structures had members at that offset
- worked out the places that altered those members
- just checked those pieces of code
|
 | may trigger other oopsen in normally fine code |
 | instruction pointerwhere it went tits up come back to this later |
 | intel documentscould be useful once the function has been decoded |
 | process information.. the stack is general purposesmashing the kernel stack (8k ?)limited in size |
 | unreliable call tracepretty useless, needs to be decodedaddresses that look rightreturn address is in middle of function code is the next instructions to be executed |
 | address on it's own is almost useless3 - 4 gig range get the sorted symbol table (System.map or /proc/ksyms)the oops happened somewhere in a functionso the EIP lies between two addressesderive the function and the offset into the functioncan do that for all the call trace functions |
 | you can see:tarmkdirreiserfs filesystem (no I'm not having a go) note the mov instruction, we'll come back to that |
x | xbase 10!this decoded information is useful to developers you should be able to get to this point easily |
x | debugging isn't a recipe you can just followthere is more than one way to do itneed to adapt (use printk etc.)understand what is going on in this case we have an oops dump, so we'll concentrate on ithow you might use this information dont worry if you get lost in this next bitover quite quickly |
 | find function name in sourcegrep or cscopedissassemble the .o file (or the kernel)match offset to get where it failed |
 | binutils, every system has itoffset was 0x298 |
 | tricky!what it doeshow it branchesmore than one way to do it! |
 | back to the objdump outputfailed on the moveabout to do a callthe function called had four arguments (push) |
 | from:looking at the branchingcounting function callsbloody obvious comment |
  | include -g in CFLAGScode before preprocessingshows you exactly where it failed |
| |
 | it's loading an address into a register before calling ityou have that address at compile timestrange optimisation?? |
 | tada!macros are a pain to debugcould run the source through the C preprocessor (-E)coding style can make things worse; one line to 16k |
| will stop therea message from our sponsor...Linus is notorious for being against in-kernel debuggers- to properly fix problems you need to
- understand the written code
- and what the code is supposed to be doing
|
 | can relax a bit nowthis is a lot easier |
 | lockups, hangsin general lockups are caused by waiting for somethingisn't going to happenisn't going to be released can lockup one or more processors on a systemlike capturing oopsen, the goal is to find out where the code locked upwhen you have the location you can answer why |
 | bad hardware (like memory)bad driverbad microcode....bad luck?easier to debug when it's quickly reproduciblethe print IEP patch can be used to debug the spontaneous reboot type bugs |
 | software problemstest with keyboard lightshardware hackers can make an NMI board from old ISA cardsfind out where it's gone with magic sysreq |
 | P show regs / EIPvery useful general purpose toolslightly dangerous so it's usually turned off on most systems by default/etc/sysctl.conf on red hat |
 | you'll get an interrupta-b lock inversion deadlocks on MP systems |
| NMI watchdog producing stack dumpsprint EIP patch is applicable here X lockups are common and difficult to debugon console look like lockups with IRQ disabledcheck over the network, eject pccards on notebooks, music carries on playingcan use the sysreq key to reset the keyboard into Raw mode and change VTs |
 | NONONO!read the REPORTING-BUGS file- oops data
- kernel version
- patches applied
- kernel config file
- hardware
- MOST IMPORTANTLY WHAT YOU WERE DOING AT THE TIME!
|
  | typegatherprocesswhat to do with that itdebugging basics hopefully closer to developing codeor realise what's going wrong faster go further? understand the kernellearn basic languages (C and ASM)experiment |
|