Commit | Line | Data |
---|---|---|
43019a56 IM |
1 | Table of contents |
2 | ================= | |
3 | ||
4 | Last updated: 20 December 2005 | |
5 | ||
6 | Contents | |
7 | ======== | |
8 | ||
9 | - Introduction | |
10 | - Devices not appearing | |
11 | - Finding patch that caused a bug | |
12 | -- Finding using git-bisect | |
13 | -- Finding it the old way | |
14 | - Fixing the bug | |
15 | ||
16 | Introduction | |
17 | ============ | |
18 | ||
19 | Always try the latest kernel from kernel.org and build from source. If you are | |
20 | not confident in doing that please report the bug to your distribution vendor | |
21 | instead of to a kernel developer. | |
22 | ||
23 | Finding bugs is not always easy. Have a go though. If you can't find it don't | |
24 | give up. Report as much as you have found to the relevant maintainer. See | |
25 | MAINTAINERS for who that is for the subsystem you have worked on. | |
26 | ||
27 | Before you submit a bug report read REPORTING-BUGS. | |
28 | ||
29 | Devices not appearing | |
30 | ===================== | |
31 | ||
32 | Often this is caused by udev. Check that first before blaming it on the | |
33 | kernel. | |
34 | ||
35 | Finding patch that caused a bug | |
36 | =============================== | |
37 | ||
38 | ||
39 | ||
40 | Finding using git-bisect | |
41 | ------------------------ | |
42 | ||
43 | Using the provided tools with git makes finding bugs easy provided the bug is | |
44 | reproducible. | |
45 | ||
46 | Steps to do it: | |
47 | - start using git for the kernel source | |
48 | - read the man page for git-bisect | |
49 | - have fun | |
50 | ||
51 | Finding it the old way | |
52 | ---------------------- | |
53 | ||
1da177e4 LT |
54 | [Sat Mar 2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)] |
55 | ||
d81919c9 | 56 | This is how to track down a bug if you know nothing about kernel hacking. |
1da177e4 LT |
57 | It's a brute force approach but it works pretty well. |
58 | ||
59 | You need: | |
60 | ||
61 | . A reproducible bug - it has to happen predictably (sorry) | |
62 | . All the kernel tar files from a revision that worked to the | |
63 | revision that doesn't | |
64 | ||
65 | You will then do: | |
66 | ||
67 | . Rebuild a revision that you believe works, install, and verify that. | |
68 | . Do a binary search over the kernels to figure out which one | |
d81919c9 | 69 | introduced the bug. I.e., suppose 1.3.28 didn't have the bug, but |
1da177e4 LT |
70 | you know that 1.3.69 does. Pick a kernel in the middle and build |
71 | that, like 1.3.50. Build & test; if it works, pick the mid point | |
72 | between .50 and .69, else the mid point between .28 and .50. | |
73 | . You'll narrow it down to the kernel that introduced the bug. You | |
d81919c9 | 74 | can probably do better than this but it gets tricky. |
1da177e4 LT |
75 | |
76 | . Narrow it down to a subdirectory | |
77 | ||
78 | - Copy kernel that works into "test". Let's say that 3.62 works, | |
79 | but 3.63 doesn't. So you diff -r those two kernels and come | |
80 | up with a list of directories that changed. For each of those | |
81 | directories: | |
82 | ||
83 | Copy the non-working directory next to the working directory | |
d81919c9 | 84 | as "dir.63". |
1da177e4 | 85 | One directory at time, try moving the working directory to |
d81919c9 | 86 | "dir.62" and mv dir.63 dir"time, try |
1da177e4 LT |
87 | |
88 | mv dir dir.62 | |
89 | mv dir.63 dir | |
90 | find dir -name '*.[oa]' -print | xargs rm -f | |
91 | ||
92 | And then rebuild and retest. Assuming that all related | |
d81919c9 CK |
93 | changes were contained in the sub directory, this should |
94 | isolate the change to a directory. | |
1da177e4 LT |
95 | |
96 | Problems: changes in header files may have occurred; I've | |
d81919c9 | 97 | found in my case that they were self explanatory - you may |
1da177e4 LT |
98 | or may not want to give up when that happens. |
99 | ||
100 | . Narrow it down to a file | |
101 | ||
102 | - You can apply the same technique to each file in the directory, | |
d81919c9 CK |
103 | hoping that the changes in that file are self contained. |
104 | ||
1da177e4 LT |
105 | . Narrow it down to a routine |
106 | ||
107 | - You can take the old file and the new file and manually create | |
108 | a merged file that has | |
109 | ||
110 | #ifdef VER62 | |
111 | routine() | |
112 | { | |
113 | ... | |
114 | } | |
115 | #else | |
116 | routine() | |
117 | { | |
118 | ... | |
119 | } | |
120 | #endif | |
121 | ||
122 | And then walk through that file, one routine at a time and | |
123 | prefix it with | |
124 | ||
125 | #define VER62 | |
126 | /* both routines here */ | |
127 | #undef VER62 | |
128 | ||
129 | Then recompile, retest, move the ifdefs until you find the one | |
130 | that makes the difference. | |
131 | ||
132 | Finally, you take all the info that you have, kernel revisions, bug | |
d81919c9 | 133 | description, the extent to which you have narrowed it down, and pass |
1da177e4 LT |
134 | that off to whomever you believe is the maintainer of that section. |
135 | A post to linux.dev.kernel isn't such a bad idea if you've done some | |
136 | work to narrow it down. | |
137 | ||
138 | If you get it down to a routine, you'll probably get a fix in 24 hours. | |
139 | ||
140 | My apologies to Linus and the other kernel hackers for describing this | |
141 | brute force approach, it's hardly what a kernel hacker would do. However, | |
142 | it does work and it lets non-hackers help fix bugs. And it is cool | |
143 | because Linux snapshots will let you do this - something that you can't | |
144 | do with vendor supplied releases. | |
145 | ||
43019a56 IM |
146 | Fixing the bug |
147 | ============== | |
148 | ||
149 | Nobody is going to tell you how to fix bugs. Seriously. You need to work it | |
150 | out. But below are some hints on how to use the tools. | |
151 | ||
152 | To debug a kernel, use objdump and look for the hex offset from the crash | |
153 | output to find the valid line of code/assembler. Without debug symbols, you | |
154 | will see the assembler code for the routine shown, but if your kernel has | |
155 | debug symbols the C code will also be available. (Debug symbols can be enabled | |
156 | in the kernel hacking menu of the menu configuration.) For example: | |
157 | ||
158 | objdump -r -S -l --disassemble net/dccp/ipv4.o | |
159 | ||
160 | NB.: you need to be at the top level of the kernel tree for this to pick up | |
161 | your C files. | |
162 | ||
163 | If you don't have access to the code you can also debug on some crash dumps | |
164 | e.g. crash dump output as shown by Dave Miller. | |
165 | ||
166 | > EIP is at ip_queue_xmit+0x14/0x4c0 | |
167 | > ... | |
168 | > Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00 | |
169 | > 00 00 55 57 56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08 | |
170 | > <8b> 83 3c 01 00 00 89 44 24 14 8b 45 28 85 c0 89 44 24 18 0f 85 | |
171 | > | |
172 | > Put the bytes into a "foo.s" file like this: | |
173 | > | |
174 | > .text | |
175 | > .globl foo | |
176 | > foo: | |
177 | > .byte .... /* bytes from Code: part of OOPS dump */ | |
178 | > | |
179 | > Compile it with "gcc -c -o foo.o foo.s" then look at the output of | |
180 | > "objdump --disassemble foo.o". | |
181 | > | |
182 | > Output: | |
183 | > | |
184 | > ip_queue_xmit: | |
185 | > push %ebp | |
186 | > push %edi | |
187 | > push %esi | |
188 | > push %ebx | |
189 | > sub $0xbc, %esp | |
190 | > mov 0xd0(%esp), %ebp ! %ebp = arg0 (skb) | |
191 | > mov 0x8(%ebp), %ebx ! %ebx = skb->sk | |
192 | > mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt | |
193 | ||
926b2898 PE |
194 | In addition, you can use GDB to figure out the exact file and line |
195 | number of the OOPS from the vmlinux file. If you have | |
196 | CONFIG_DEBUG_INFO enabled, you can simply copy the EIP value from the | |
197 | OOPS: | |
198 | ||
199 | EIP: 0060:[<c021e50e>] Not tainted VLI | |
200 | ||
201 | And use GDB to translate that to human-readable form: | |
202 | ||
203 | gdb vmlinux | |
204 | (gdb) l *0xc021e50e | |
205 | ||
206 | If you don't have CONFIG_DEBUG_INFO enabled, you use the function | |
207 | offset from the OOPS: | |
208 | ||
209 | EIP is at vt_ioctl+0xda8/0x1482 | |
210 | ||
211 | And recompile the kernel with CONFIG_DEBUG_INFO enabled: | |
212 | ||
213 | make vmlinux | |
214 | gdb vmlinux | |
215 | (gdb) p vt_ioctl | |
216 | (gdb) l *(0x<address of vt_ioctl> + 0xda8) | |
dcc85cb6 RK |
217 | or, as one command |
218 | (gdb) l *(vt_ioctl + 0xda8) | |
219 | ||
220 | If you have a call trace, such as :- | |
221 | >Call Trace: | |
222 | > [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5 | |
223 | > [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e | |
224 | > [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee | |
225 | > ... | |
226 | this shows the problem in the :jbd: module. You can load that module in gdb | |
227 | and list the relevant code. | |
228 | gdb fs/jbd/jbd.ko | |
229 | (gdb) p log_wait_commit | |
230 | (gdb) l *(0x<address> + 0xa3) | |
231 | or | |
232 | (gdb) l *(log_wait_commit + 0xa3) | |
233 | ||
926b2898 | 234 | |
43019a56 IM |
235 | Another very useful option of the Kernel Hacking section in menuconfig is |
236 | Debug memory allocations. This will help you see whether data has been | |
237 | initialised and not set before use etc. To see the values that get assigned | |
238 | with this look at mm/slab.c and search for POISON_INUSE. When using this an | |
239 | Oops will often show the poisoned data instead of zero which is the default. | |
240 | ||
241 | Once you have worked out a fix please submit it upstream. After all open | |
242 | source is about sharing what you do and don't you want to be recognised for | |
243 | your genius? | |
244 | ||
245 | Please do read Documentation/SubmittingPatches though to help your code get | |
246 | accepted. |