Segmentation faults¶
In principle, pure Clean programs should not segfault, because the language abstracts away from memory management. When a segmentation fault occurs it is usually caused by corruption of the heap. This in turn can be due to several causes:
- Incorrect use of the foreign function interface (
ccall
andcentry
). - Incorrect use of inline ABC code (
code { ... }
blocks). - A bug in GraphCopy, if you use its deserialization functionality.
- A bug in the compiler, code generator, or run time system.
This page contains some suggestions to track down segmentation faults.
Before following the suggestions on this page, read the preliminaries to get as much debug information in your executable as possible.
Getting a stack trace¶
To get a stack trace, run the program in gdb
. Because Clean does not use any
standard calling convention, the bt
command does not work. Once you have hit
the segfault (use the r
command), use j write_profile_stack
to print the Clean stack:
(gdb) j write_profile_stack
Continuing at 0x40f856.
Stack trace:
<case>[line:10];9;2 [module: test]
Start [module: test]
start [module: System]
(Afterwards, the program will attempt to continue running, but the result is of no concern to us here; we have our stack trace.)
Be aware that that due to tail calls, not all functions may appear in the stack
trace. The example above shows that <case>[line:10];9;2
was called from
Start
. However, there may have been a number of tail calls in between. For
example, Start
may have called f
, which did a tail call to g
, which did a
tail call to <case>[line:10];9;2
.
Example¶
Consider the following example:
module segfault
import StdEnv
Start = {{#10,20,30} & [-1]=0}
(You need to compile this without checking for stack indices; the default in
clm
.)
Recall the node layout of arrays. The elements
are at offset 3. At offset 1 we have the length; at offset 2 we have the
element descriptor. The element descriptor is 0 for boxed arrays, or points to
a descriptor for unboxed arrays. We here create an unboxed array of type Int
,
which thus looks like this in memory:
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
_ARRAY_ |
3 | INT |
10 | 20 | 30 |
By overwriting element -1 with 0 we are effectively overwriting the element descriptor, turning this array into a boxed array:
0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
_ARRAY_ |
3 | NULL |
10 | 20 | 30 |
When the array is printed, the run time system sees a boxed array and tries to
follow the elements as pointers. Because 10
does not point to a node in the
heap, the program crashes.
We compile this program with clm -tst -desc -exl -ns segfault -o segfault
to
include as much debug information as possible. We then run the program in
gdb
:
$ gdb -q ./segfault -ex r
Reading symbols from ./segfault...(no debugging symbols found)...done.
Starting program: /tmp/segfault
Program received signal SIGSEGV, Segmentation fault.
0x0000000000415ec3 in __print__graph ()
Use layout prev
to see the value of both the registers and the current code
being executed:
┌──Register group: general─────────────────────────────────────────────────────┐
│rax 0x0 0 rbx 0x3 3 │
│rcx 0xa 10 rdx 0x7ffff7e178c0 140737352136896 │
│rsi 0x7ffff79d4008 140737347665928 rdi 0x7ffff7a56058 140737348198488 │
│rbp 0x7fffffffdaa0 0x7fffffffdaa0 rsp 0x7fffffffda98 0x7fffffffda98 │
│r8 0x3 3 r9 0x77 119 │
│r10 0x46e010 4644880 r11 0x410 1040 │
│r12 0x7fffffffdc18 140737488346136 r13 0x7ffff79d4008 140737347665928 │
│r14 0x7ffff7a56058 140737348198488 r15 0x31f7 12791 │
│... ... ... ... ... ... │
└──────────────────────────────────────────────────────────────────────────────┘
│0x415ea8 <__print__graph> callq 0x415ebc <__print__graph+20>
│0x415ead <__print__graph+5> lea 0x42f6c(%rip),%rax # 0x458e20
│0x415eb4 <__print__graph+12> callq 0x4018e0 <print_sc>
│0x415eb9 <__print__graph+17> retq
│0x415eba <__print__graph+18> xchg %ax,%ax
│0x415ebc <__print__graph+20> mov $0x0,%rax
>│0x415ec3 <__print__graph+27> testb $0x2,(%rcx)
│0x415ec6 <__print__graph+30> jne 0x415ed0 <__print__graph+40>
│0x415ecc <__print__graph+36> push %rax
│0x415ecd <__print__graph+37> callq *(%rcx)
_print_graph
is defined in _system.abc
as:
.o 1 0
_print_graph
.d 1 0
jsr _print
.o 0 0
print_sc "\n"
.d 0 0
rtn
.o 1 0
_print
pushI 0 | push the bracket count
_continue_print
jsr_eval 0
| ...
Mapping this to the disassembly from gdb
we can see that we are in the
jsr_eval
instruction in _continue_print
(see the register
mapping to see that rax
is the top of the B stack and
rcx
is the top of the A stack). The current instruction, testb $0x2 (%rcx)
,
checks whether the node on top of the A stack is in head normal form by
checking whether its descriptor has its second least significant bit
set. However, rcx
is 10
, so the
dereferencing of the descriptor causes the segmentation fault. (Again, this is
because 10
is treated as a pointer now that the array is expected to be
boxed.)
To solve the issue we must understand why 10
ends up in the top of the A
stack, which should contain a pointer. To do so we need to figure out from
where we have entered _continue_print
. One method is to find all places where
_print
or _continue_print
is referenced in _system.abc
and export a
nearby label. This means exporting _print_args
, _print_rest_list
,
_no_comma_0
, and _print_rest_tuple
. We then need to place breakpoints on
these labels in gdb
to find where we are coming from:
$ gdb -q ./segfault -ex 'b __print__args' -ex 'b __print__rest__list' -ex 'b __no__comma__0' -ex 'b __print__rest__tuple' -ex r
Reading symbols from ./segfault...(no debugging symbols found)...done.
Breakpoint 1 at 0x4161c8
Breakpoint 2 at 0x416279
Breakpoint 3 at 0x41667c
Breakpoint 4 at 0x416ce7
Starting program: /tmp/segfault
Breakpoint 3, 0x000000000041667c in ::__no__comma(void) ()
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000416127 in __print__graph ()
Clearly we are coming from _no_comma_0
. Reading through _system.abc
it
becomes clear that this routine is for the first element of boxed arrays.
There are now two options: either the array is indeed a boxed array, but the element got corrupted, or the array is unboxed, and the element descriptor got corrupted. To find out which, we need to know the addresses of both the element descriptor and the first element, and set up a watchpoint to be notified when they change.
(A third option is that the node is not an array at all, and that its
descriptor got corrupted with the value __ARRAY__+2
. This would seem a priori
less likely: for the element descriptor or the element to get corrupted can
happen with any value, but it would have to be exactly __ARRAY__+2
in the
descriptor for the node to be erroneously interpreted as an array—any other
value would cause a crash at an earlier stage.)
To find the relevant addresses we inspect the state at this breakpoint (r
;
layout prev
):
┌──Register group: general─────────────────────────────────────────────────────┐
│rax 0x0 0 rbx 0x3 3 │
│rcx 0x7ffff7a56028 140737348198440 rdx 0x7ffff7e178c0 140737352136896 │
│rsi 0x7ffff79d4000 140737347665920 rdi 0x7ffff7a56058 140737348198488 │
│rbp 0x7fffffffdaa0 0x7fffffffdaa0 rsp 0x7fffffffdab0 0x7fffffffdab0 │
│r8 0x3 3 r9 0x77 119 │
│r10 0x46f010 4648976 r11 0x410 1040 │
│r12 0x7fffffffdc18 140737488346136 r13 0x7ffff79d4008 140737347665928 │
│r14 0x7ffff7a56058 140737348198488 r15 0x31f7 12791 │
│... ... ... ... ... ... │
└──────────────────────────────────────────────────────────────────────────────┘
B+>│0x41667c <__no__comma__0> lea -0x10(%rsp),%rsp
│0x416681 <__no__comma__0+5> mov %rbx,(%rsp)
│0x416685 <__no__comma__0+9> mov %rax,0x8(%rsp)
│0x41668a <__no__comma__0+14> mov %rcx,(%rsi)
│0x41668d <__no__comma__0+17> mov 0x18(%rcx,%rax,8),%rcx
│0x416692 <__no__comma__0+22> add $0x8,%rsi
│0x416696 <__no__comma__0+26> callq 0x416120 <__print__graph+20>
From the code of _no_comma_0
it is clear that the array is in the top of the
A stack, i.e., rcx
:
_print_array_a
pop_b 1
pushI 0
push_a 0
push_arraysize _ 0 1
jmp _print_array_lp2
.o 1 2 i i
_print_array_lp1
eqI_b 0 1
jmp_true _no_comma_0
print_sc ","
_no_comma_0
push_b 1
push_a 0
select _ 0 1
.d 1 0
jsr _print
The instruction mov 0x18(%rcx,%rax,8),%rcx
, in which rcx
gets set to 10
,
is the equivalent of select _ 0 1
; it selects the element of the array (in
rcx
) pointed to by the top of the B stack (rax
). The offset 0x18
is
because array elements start 3 words into the array.
In other words, at this point, rcx
contains a pointer to the array itself.
The element descriptor is then at rcx+0x10
and the first element at
rcx+0x18
. We can set up watchpoints for this and rerun the program:
(gdb) watch *(int64_t*)0x7ffff7a56038
Hardware watchpoint 5: *(int64_t*)0x7ffff7a56038
(gdb) watch *(int64_t*)0x7ffff7a56040
Hardware watchpoint 6: *(int64_t*)0x7ffff7a56040
(gdb) r
We are now notified whenever the element descriptor (in ..38
) or the first
element (in ..40
) changes:
Hardware watchpoint 5: *(int64_t*)0x7ffff7a56038
Old value = <unreadable>
New value = 4560066
0x00000000004175d9 in e__segfault__nStart_P1 ()
(gdb) c
Continuing.
Hardware watchpoint 6: *(int64_t*)0x7ffff7a56040
Old value = <unreadable>
New value = 140737348198440
0x00000000004175e0 in e__segfault__nStart_P1 ()
(gdb) c
Continuing.
Hardware watchpoint 6: *(int64_t*)0x7ffff7a56040
Old value = 140737348198440
New value = 10
0x0000000000417600 in e__segfault__nStart_P1 ()
(gdb) c
Continuing.
Hardware watchpoint 5: *(int64_t*)0x7ffff7a56038
Old value = 4560066
New value = 0
0x0000000000417610 in e__segfault__nStart_P1 ()
(gdb) c
Continuing.
Breakpoint 3, 0x000000000041667c in ::__no__comma(void) ()
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000416127 in __print__graph ()
We thus find the following sequence of events:
- The element descriptor is set to 4560066. Inspecting
objdump -t segfault
, which gives the addresses of exported symbols, we find that this isINT+2
, so we have an unboxed integer array. - The first element is set to 140737348198440.
- The first element is set to 10.
- The element descriptor is set to 0, making this a boxed array.
Although it is not exactly clear what step 2 is needed for, we can ignore it. We now know that this is supposed to be an unboxed array, but that it was erroneously made a boxed array in step 4. One can now follow the tips here to figure out where exactly in the Clean code this is done.
Crashes in the garbage collector¶
In the example above the program crashed because the corrupted node in the heap was used. It is of course possible that the node, after getting corrupted, is not used any more. In that case the program might not crash despite the heap corruption. In still other cases, it may be that the corrupted node remains in the live set until garbage collection runs. In this case, the garbage collector will visit the node and likely crash.
In practice, most crashes in the garbage collector boil down to heap corruption. The garbage collector has been stable for quite some time; it is relatively unlikely for a segfault to boil down to a bug in the garbage collector. As in the example above, we must first find the node that is corrupted and then the place that it got corrupted in.
It may be that the bug occurs sooner or later with a different garbage
collector. Clean has three garbage collectors.
It is useful to try to reproduce the bug with both the copying and the marking
collector. If the bug occurs in both environments, it is highly unlikely to be
due to a bug in either collector. Furthermore, you can now save some time by
using the garbage collector that crashes the soonest to debug further. In the
same vain it is also useful to experiment with different heap sizes. Be sure to
also change the initial heap size (-gci
). Programs that crash after a minimal
amount of garbage collection iterations are easiest to debug.
Reducing to a minimal example¶
It is tricky to reduce a crashing program to a minimal example, since even the slightest change will have effects on memory consumption, and hence when the garbage collector runs and which nodes it will visit. For these crashes it is less useful to reduce to a minimal example anyway, but should you feel the need to do so, you can test the program with a variety of heap sizes. For example:
for h in $(seq 200 600); do ./a.out -h "${h}k" || echo $h; done
This will test the program with heap sizes of 200, 201, .. 600 kB and output the heap sizes for which the program fails.
Finding the corrupt node¶
It is more important to find the node on which the garbage collector crashes.
Run the program in gdb
to find the exact location of the crash. From there
you will have to read the Assembly code a bit to figure out what is going on.
We need to determine the exact word that is corrupted.
For example:
-
A node is visited that has an illegal descriptor (e.g. a null pointer). GC crashes when dereferencing the descriptor to find the node arity. We are interested in the descriptor.
-
A node is visited that has an illegal descriptor, which happens to be an accessible pointer. GC dereferences it to find the arity, which is a ridiculously large number, and attempts to collect all 1234567 pointer arguments of the node (crashing only when dereferencing the arguments). We are interested in the descriptor.
-
A node is visited with a valid descriptor, but one of its pointer arguments points to an illegal location. GC crashes when following the pointer. We are interested in the argument.
Determine the exact address of the corrupted word. Place a watchpoint on this word:
(gdb) watch *(void**) 0x...
Then, rerun the program. You are now notified of every change to this value.
Typically only the last change is relevant (all the previous ones are because
the memory is used for other nodes in previous cycles between GC iterations).
You now have the exact location where the heap became corrupted. The
example above shows how to work with gdb
watchpoints.
Garbage collector crash example¶
We can slightly modify our example above to make sure it crashes in the garbage collector:
module segfault
import StdEnv
import StdOverloadedList
Start = Reverse [!{{#10,20,30} & [-1]=0} \\ _ <- [0..1000]]
The cause of the crash is the same: we have overwritten the element descriptor
of the unboxed array with a NULL
pointer, causing it to be interpreted as a
boxed array; because 10
is not a valid pointer, the garbage collector crashes
when it tries to follow it.
However, we here let the heap fill up with a list of these arrays (Reverse
requires its entire argument to be on the heap until it can begin to build its
result). We must use a head-strict list to make sure that the arrays are
evaluated while building the list. With a lazy list, the evaluation would be
postponed until the moment of printing, and the garbage collector would see
only thunks that would rewrite to these arrays.
Running this with a small heap shows that it crashes in the garbage collector:
$ clm -tst -desc -exl -ns segfault -o segfault
$ ./segfault -h 20k -gc -st
A stack: 24 bytes. BC stack: 168 bytes.
Segmentation fault (core dumped)
Because -st
prints stack sizes before garbage collection and -gc
prints
heap size after garbage collection, the fact that the crash occurs after the
stack sizes have been print but before the heap size has been print shows that
the crash occurs in the garbage collector. We can verify this with gdb
:
$ gdb -q ./segfault -ex 'r -h 20k'
Reading symbols from ./segfault...(no debugging symbols found)...done.
Starting program: /tmp/segfault -h 20k
Program received signal SIGSEGV, Segmentation fault.
0x0000000000402451 in continue_after_selector_2 ()
continue_after_selector_2
is defined in the copying collector:
continue_after_selector_2:
mov rcx,qword ptr [rdx]
test cl,2
je not_in_hnf_2
in_hnf_2:
Reading the code of the garbage collector, with its
design and the layout of
nodes in mind, we can see that rdx
points to a node to be visited and copied to the other semi-space. rcx
is
then the descriptor, and is tested for its second least significant bit to
determine whether it is a head normal form (hnf
) or not. The segmentation
fault occurs because the descriptor cannot be fetched because rdx
is not a
valid pointer:
(gdb) p $rdx
$1 = 10
This brings us back to the same point as in the example above: we
need to figure out why 10
is being treated as a reference to a node. To do
so, we need to place tactical breakpoints in the garbage collector to determine
the control flow up to continue_after_selector_2
.
If you do this you will find that we enter continue_after_selector_2
from
copy_lp2
right above, and that the value of 10 in rdx
is loaded from rbp
.
We can thus place a watchpoint on the value of rbp
at the time of the crash:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000402451 in continue_after_selector_2 ()
(gdb) p/x $rbp
$1 = 0x4b32d8
(gdb) watch *(int64_t*)0x4b32d8
Hardware watchpoint 1: *(int64_t*)0x4b32d8
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /tmp/segfault -h 20k
Hardware watchpoint 1: *(int64_t*)0x4b32d8
Old value = <unreadable>
New value = 10
0x0000000000402b98 in cp_s_arg_lp2 ()
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000402451 in continue_after_selector_2 ()
The value is loaded in cp_s_arg_lp2
. By reading the code for this routine we
find that it is used for strings and arrays. We can place breakpoints on the
relevant routines to find that we are coming from copy_array_a2
, which is
called by copy_array_2
in the case of a boxed array.
From here, debugging proceeds as in the example above: either the array is indeed boxed and the element got corrupted, or the array is unboxed and the element descriptor got corrupted. We can place watchpoints on both memory addresses and rerun the program to find out which.