Segmentation faults

In principle, pure Clean programs should not segfault, because the language abstracts away from memory management. When a segmentation fault occurs it is usually caused by corruption of the heap. This in turn can be due to several causes:

  • Incorrect use of the foreign function interface (ccall and centry).
  • Incorrect use of inline ABC code (code { ... } blocks).
  • A bug in GraphCopy, if you use its deserialization functionality.
  • A bug in the compiler, code generator, or run time system.

This page contains some suggestions to track down segmentation faults.

Before following the suggestions on this page, read the preliminaries to get as much debug information in your executable as possible.

Getting a stack trace

To get a stack trace, run the program in gdb. Because Clean does not use any standard calling convention, the bt command does not work. Once you have hit the segfault (use the r command), use j write_profile_stack to print the Clean stack:

(gdb) j write_profile_stack
Continuing at 0x40f856.
Stack trace:
<case>[line:10];9;2 [module: test]
Start [module: test]
start [module: System]

(Afterwards, the program will attempt to continue running, but the result is of no concern to us here; we have our stack trace.)

Be aware that that due to tail calls, not all functions may appear in the stack trace. The example above shows that <case>[line:10];9;2 was called from Start. However, there may have been a number of tail calls in between. For example, Start may have called f, which did a tail call to g, which did a tail call to <case>[line:10];9;2.

Example

Consider the following example:

module segfault

import StdEnv

Start = {{#10,20,30} & [-1]=0}

(You need to compile this without checking for stack indices; the default in clm.)

Recall the node layout of arrays. The elements are at offset 3. At offset 1 we have the length; at offset 2 we have the element descriptor. The element descriptor is 0 for boxed arrays, or points to a descriptor for unboxed arrays. We here create an unboxed array of type Int, which thus looks like this in memory:

0 1 2 3 4 5
_ARRAY_ 3 INT 10 20 30

By overwriting element -1 with 0 we are effectively overwriting the element descriptor, turning this array into a boxed array:

0 1 2 3 4 5
_ARRAY_ 3 NULL 10 20 30

When the array is printed, the run time system sees a boxed array and tries to follow the elements as pointers. Because 10 does not point to a node in the heap, the program crashes.

We compile this program with clm -tst -desc -exl -ns segfault -o segfault to include as much debug information as possible. We then run the program in gdb:

$ gdb -q ./segfault -ex r
Reading symbols from ./segfault...(no debugging symbols found)...done.
Starting program: /tmp/segfault

Program received signal SIGSEGV, Segmentation fault.
0x0000000000415ec3 in __print__graph ()

Use layout prev to see the value of both the registers and the current code being executed:

┌──Register group: general─────────────────────────────────────────────────────┐
│rax  0x0             0                  rbx  0x3             3                │
│rcx  0xa             10                 rdx  0x7ffff7e178c0  140737352136896  │
│rsi  0x7ffff79d4008  140737347665928    rdi  0x7ffff7a56058  140737348198488  │
│rbp  0x7fffffffdaa0  0x7fffffffdaa0     rsp  0x7fffffffda98  0x7fffffffda98   │
│r8   0x3             3                  r9   0x77            119              │
│r10  0x46e010        4644880            r11  0x410           1040             │
│r12  0x7fffffffdc18  140737488346136    r13  0x7ffff79d4008  140737347665928  │
│r14  0x7ffff7a56058  140737348198488    r15  0x31f7          12791            │
│...  ...             ...                ...  ...             ...              │
└──────────────────────────────────────────────────────────────────────────────┘
   │0x415ea8 <__print__graph>       callq  0x415ebc <__print__graph+20>
   │0x415ead <__print__graph+5>     lea    0x42f6c(%rip),%rax        # 0x458e20
   │0x415eb4 <__print__graph+12>    callq  0x4018e0 <print_sc>
   │0x415eb9 <__print__graph+17>    retq
   │0x415eba <__print__graph+18>    xchg   %ax,%ax
   │0x415ebc <__print__graph+20>    mov    $0x0,%rax
  >│0x415ec3 <__print__graph+27>    testb  $0x2,(%rcx)
   │0x415ec6 <__print__graph+30>    jne    0x415ed0 <__print__graph+40>
   │0x415ecc <__print__graph+36>    push   %rax
   │0x415ecd <__print__graph+37>    callq  *(%rcx)

_print_graph is defined in _system.abc as:

.o 1 0
_print_graph
.d 1 0
    jsr _print
.o 0 0
    print_sc    "\n"
.d 0 0
    rtn

.o 1 0
_print
    pushI   0   | push the bracket count
_continue_print
    jsr_eval    0
    | ...

Mapping this to the disassembly from gdb we can see that we are in the jsr_eval instruction in _continue_print (see the register mapping to see that rax is the top of the B stack and rcx is the top of the A stack). The current instruction, testb $0x2 (%rcx), checks whether the node on top of the A stack is in head normal form by checking whether its descriptor has its second least significant bit set. However, rcx is 10, so the dereferencing of the descriptor causes the segmentation fault. (Again, this is because 10 is treated as a pointer now that the array is expected to be boxed.)

To solve the issue we must understand why 10 ends up in the top of the A stack, which should contain a pointer. To do so we need to figure out from where we have entered _continue_print. One method is to find all places where _print or _continue_print is referenced in _system.abc and export a nearby label. This means exporting _print_args, _print_rest_list, _no_comma_0, and _print_rest_tuple. We then need to place breakpoints on these labels in gdb to find where we are coming from:

$ gdb -q ./segfault -ex 'b __print__args' -ex 'b __print__rest__list' -ex 'b __no__comma__0' -ex 'b __print__rest__tuple' -ex r
Reading symbols from ./segfault...(no debugging symbols found)...done.
Breakpoint 1 at 0x4161c8
Breakpoint 2 at 0x416279
Breakpoint 3 at 0x41667c
Breakpoint 4 at 0x416ce7
Starting program: /tmp/segfault

Breakpoint 3, 0x000000000041667c in ::__no__comma(void) ()
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000000000416127 in __print__graph ()

Clearly we are coming from _no_comma_0. Reading through _system.abc it becomes clear that this routine is for the first element of boxed arrays.

There are now two options: either the array is indeed a boxed array, but the element got corrupted, or the array is unboxed, and the element descriptor got corrupted. To find out which, we need to know the addresses of both the element descriptor and the first element, and set up a watchpoint to be notified when they change.

(A third option is that the node is not an array at all, and that its descriptor got corrupted with the value __ARRAY__+2. This would seem a priori less likely: for the element descriptor or the element to get corrupted can happen with any value, but it would have to be exactly __ARRAY__+2 in the descriptor for the node to be erroneously interpreted as an array—any other value would cause a crash at an earlier stage.)

To find the relevant addresses we inspect the state at this breakpoint (r; layout prev):

┌──Register group: general─────────────────────────────────────────────────────┐
│rax  0x0             0                  rbx  0x3             3                │
│rcx  0x7ffff7a56028  140737348198440    rdx  0x7ffff7e178c0  140737352136896  │
│rsi  0x7ffff79d4000  140737347665920    rdi  0x7ffff7a56058  140737348198488  │
│rbp  0x7fffffffdaa0  0x7fffffffdaa0     rsp  0x7fffffffdab0  0x7fffffffdab0   │
│r8   0x3             3                  r9   0x77            119              │
│r10  0x46f010        4648976            r11  0x410           1040             │
│r12  0x7fffffffdc18  140737488346136    r13  0x7ffff79d4008  140737347665928  │
│r14  0x7ffff7a56058  140737348198488    r15  0x31f7          12791            │
│...  ...             ...                ...  ...             ...              │
└──────────────────────────────────────────────────────────────────────────────┘
B+>│0x41667c <__no__comma__0>       lea    -0x10(%rsp),%rsp
   │0x416681 <__no__comma__0+5>     mov    %rbx,(%rsp)
   │0x416685 <__no__comma__0+9>     mov    %rax,0x8(%rsp)
   │0x41668a <__no__comma__0+14>    mov    %rcx,(%rsi)
   │0x41668d <__no__comma__0+17>    mov    0x18(%rcx,%rax,8),%rcx
   │0x416692 <__no__comma__0+22>    add    $0x8,%rsi
   │0x416696 <__no__comma__0+26>    callq  0x416120 <__print__graph+20>

From the code of _no_comma_0 it is clear that the array is in the top of the A stack, i.e., rcx:

_print_array_a
    pop_b   1
    pushI   0
    push_a  0
    push_arraysize  _ 0 1
    jmp _print_array_lp2
.o 1 2 i i
_print_array_lp1
    eqI_b   0 1
    jmp_true    _no_comma_0
    print_sc    ","
_no_comma_0
    push_b  1
    push_a  0
    select  _ 0 1
.d 1 0
    jsr _print

The instruction mov 0x18(%rcx,%rax,8),%rcx, in which rcx gets set to 10, is the equivalent of select _ 0 1; it selects the element of the array (in rcx) pointed to by the top of the B stack (rax). The offset 0x18 is because array elements start 3 words into the array.

In other words, at this point, rcx contains a pointer to the array itself. The element descriptor is then at rcx+0x10 and the first element at rcx+0x18. We can set up watchpoints for this and rerun the program:

(gdb) watch *(int64_t*)0x7ffff7a56038
Hardware watchpoint 5: *(int64_t*)0x7ffff7a56038
(gdb) watch *(int64_t*)0x7ffff7a56040
Hardware watchpoint 6: *(int64_t*)0x7ffff7a56040
(gdb) r

We are now notified whenever the element descriptor (in ..38) or the first element (in ..40) changes:

Hardware watchpoint 5: *(int64_t*)0x7ffff7a56038

Old value = <unreadable>
New value = 4560066
0x00000000004175d9 in e__segfault__nStart_P1 ()
(gdb) c
Continuing.

Hardware watchpoint 6: *(int64_t*)0x7ffff7a56040

Old value = <unreadable>
New value = 140737348198440
0x00000000004175e0 in e__segfault__nStart_P1 ()
(gdb) c
Continuing.

Hardware watchpoint 6: *(int64_t*)0x7ffff7a56040

Old value = 140737348198440
New value = 10
0x0000000000417600 in e__segfault__nStart_P1 ()
(gdb) c
Continuing.

Hardware watchpoint 5: *(int64_t*)0x7ffff7a56038

Old value = 4560066
New value = 0
0x0000000000417610 in e__segfault__nStart_P1 ()
(gdb) c
Continuing.

Breakpoint 3, 0x000000000041667c in ::__no__comma(void) ()
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000000000416127 in __print__graph ()

We thus find the following sequence of events:

  1. The element descriptor is set to 4560066. Inspecting objdump -t segfault, which gives the addresses of exported symbols, we find that this is INT+2, so we have an unboxed integer array.
  2. The first element is set to 140737348198440.
  3. The first element is set to 10.
  4. The element descriptor is set to 0, making this a boxed array.

Although it is not exactly clear what step 2 is needed for, we can ignore it. We now know that this is supposed to be an unboxed array, but that it was erroneously made a boxed array in step 4. One can now follow the tips here to figure out where exactly in the Clean code this is done.

Crashes in the garbage collector

In the example above the program crashed because the corrupted node in the heap was used. It is of course possible that the node, after getting corrupted, is not used any more. In that case the program might not crash despite the heap corruption. In still other cases, it may be that the corrupted node remains in the live set until garbage collection runs. In this case, the garbage collector will visit the node and likely crash.

In practice, most crashes in the garbage collector boil down to heap corruption. The garbage collector has been stable for quite some time; it is relatively unlikely for a segfault to boil down to a bug in the garbage collector. As in the example above, we must first find the node that is corrupted and then the place that it got corrupted in.

It may be that the bug occurs sooner or later with a different garbage collector. Clean has three garbage collectors. It is useful to try to reproduce the bug with both the copying and the marking collector. If the bug occurs in both environments, it is highly unlikely to be due to a bug in either collector. Furthermore, you can now save some time by using the garbage collector that crashes the soonest to debug further. In the same vain it is also useful to experiment with different heap sizes. Be sure to also change the initial heap size (-gci). Programs that crash after a minimal amount of garbage collection iterations are easiest to debug.

Reducing to a minimal example

It is tricky to reduce a crashing program to a minimal example, since even the slightest change will have effects on memory consumption, and hence when the garbage collector runs and which nodes it will visit. For these crashes it is less useful to reduce to a minimal example anyway, but should you feel the need to do so, you can test the program with a variety of heap sizes. For example:

for h in $(seq 200 600); do ./a.out -h "${h}k" || echo $h; done

This will test the program with heap sizes of 200, 201, .. 600 kB and output the heap sizes for which the program fails.

Finding the corrupt node

It is more important to find the node on which the garbage collector crashes. Run the program in gdb to find the exact location of the crash. From there you will have to read the Assembly code a bit to figure out what is going on. We need to determine the exact word that is corrupted.

For example:

  • A node is visited that has an illegal descriptor (e.g. a null pointer). GC crashes when dereferencing the descriptor to find the node arity. We are interested in the descriptor.

  • A node is visited that has an illegal descriptor, which happens to be an accessible pointer. GC dereferences it to find the arity, which is a ridiculously large number, and attempts to collect all 1234567 pointer arguments of the node (crashing only when dereferencing the arguments). We are interested in the descriptor.

  • A node is visited with a valid descriptor, but one of its pointer arguments points to an illegal location. GC crashes when following the pointer. We are interested in the argument.

Determine the exact address of the corrupted word. Place a watchpoint on this word:

(gdb) watch *(void**) 0x...

Then, rerun the program. You are now notified of every change to this value. Typically only the last change is relevant (all the previous ones are because the memory is used for other nodes in previous cycles between GC iterations). You now have the exact location where the heap became corrupted. The example above shows how to work with gdb watchpoints.

Garbage collector crash example

We can slightly modify our example above to make sure it crashes in the garbage collector:

module segfault

import StdEnv
import StdOverloadedList

Start = Reverse [!{{#10,20,30} & [-1]=0} \\ _ <- [0..1000]]

The cause of the crash is the same: we have overwritten the element descriptor of the unboxed array with a NULL pointer, causing it to be interpreted as a boxed array; because 10 is not a valid pointer, the garbage collector crashes when it tries to follow it.

However, we here let the heap fill up with a list of these arrays (Reverse requires its entire argument to be on the heap until it can begin to build its result). We must use a head-strict list to make sure that the arrays are evaluated while building the list. With a lazy list, the evaluation would be postponed until the moment of printing, and the garbage collector would see only thunks that would rewrite to these arrays.

Running this with a small heap shows that it crashes in the garbage collector:

$ clm -tst -desc -exl -ns segfault -o segfault
$ ./segfault -h 20k -gc -st
A stack: 24 bytes. BC stack: 168 bytes.
Segmentation fault (core dumped)

Because -st prints stack sizes before garbage collection and -gc prints heap size after garbage collection, the fact that the crash occurs after the stack sizes have been print but before the heap size has been print shows that the crash occurs in the garbage collector. We can verify this with gdb:

$ gdb -q ./segfault -ex 'r -h 20k'
Reading symbols from ./segfault...(no debugging symbols found)...done.
Starting program: /tmp/segfault -h 20k

Program received signal SIGSEGV, Segmentation fault.
0x0000000000402451 in continue_after_selector_2 ()

continue_after_selector_2 is defined in the copying collector:

continue_after_selector_2:
        mov     rcx,qword ptr [rdx]
        test    cl,2
        je      not_in_hnf_2
in_hnf_2:

Reading the code of the garbage collector, with its design and the layout of nodes in mind, we can see that rdx points to a node to be visited and copied to the other semi-space. rcx is then the descriptor, and is tested for its second least significant bit to determine whether it is a head normal form (hnf) or not. The segmentation fault occurs because the descriptor cannot be fetched because rdx is not a valid pointer:

(gdb) p $rdx
$1 = 10

This brings us back to the same point as in the example above: we need to figure out why 10 is being treated as a reference to a node. To do so, we need to place tactical breakpoints in the garbage collector to determine the control flow up to continue_after_selector_2.

If you do this you will find that we enter continue_after_selector_2 from copy_lp2 right above, and that the value of 10 in rdx is loaded from rbp. We can thus place a watchpoint on the value of rbp at the time of the crash:

Program received signal SIGSEGV, Segmentation fault.
0x0000000000402451 in continue_after_selector_2 ()
(gdb) p/x $rbp
$1 = 0x4b32d8
(gdb) watch *(int64_t*)0x4b32d8
Hardware watchpoint 1: *(int64_t*)0x4b32d8
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /tmp/segfault -h 20k

Hardware watchpoint 1: *(int64_t*)0x4b32d8

Old value = <unreadable>
New value = 10
0x0000000000402b98 in cp_s_arg_lp2 ()
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000000000402451 in continue_after_selector_2 ()

The value is loaded in cp_s_arg_lp2. By reading the code for this routine we find that it is used for strings and arrays. We can place breakpoints on the relevant routines to find that we are coming from copy_array_a2, which is called by copy_array_2 in the case of a boxed array.

From here, debugging proceeds as in the example above: either the array is indeed boxed and the element got corrupted, or the array is unboxed and the element descriptor got corrupted. We can place watchpoints on both memory addresses and rerun the program to find out which.