Introduction

This section describes Clean’s run time system. A good understanding of it is extremely useful for writing efficient code and effective debugging.

The ABC machine

The ABC machine is the abstract machine for which the Clean compiler generates code. The code generator can generate actual machine code from ABC code for various platforms. Detailed information about the ABC machine is available on a separate page, but here we must treat the basics already.

The ABC machine is an abstract imperative machine for graph rewriting. The program state consists of the following elements:

  • The program that is being executed and the program counter.
  • The graph store or heap, containing the graph that is being rewritten.
  • The Argument stack to hold references to the graph store. This stack is used to pass node arguments to functions.
  • The Basic value stack to hold basic values (integers, characters, etc.) that are passed as arguments to functions.
  • The Control stack to hold return addresses.

The run time system is written in ABC, Clean, and Assembly. In _system.abc (in StdEnv), routines are defined that evaluate and print the result of the Start node. In the run-time-system repository the C and Assembly code can be found. (Check the Makefile for your platform to figure out which files are relevant.) The garbage collectors, code for higher order function applications, and some other things are written in Assembly; functions to deal with file I/O (among other things) are written in C.

Node types

There are two important node types to be aware of: thunks and head normal forms.

A head normal form is a node that cannot be evaluated, although its children might. An example is the _Cons node for a list element: due to lazy evaluation its children may not have been evaluated yet, but the node itself cannot be evaluated further. Head normal forms are ADT constructors, records, basic value nodes, and unsaturated closures (see below).

A thunk is a node that can be evaluated further. It is a representation of a function application, so it contains a reference to the function as well as its arguments. To evaluate it, the node entry point of the function is called. This entry point evaluates the function call and overwrites the thunk with the result. Overwriting the node prevents the duplication of work.

An unsaturated closure is an incomplete function application, like const 37. Because it cannot be evaluated (yet), it is represented as a head normal form.

Node layout

Each node contains a reference to a descriptor, containing information about the type and arity of the node, and a list of arguments.

To support overwriting of thunks by head normal forms in the node entry point, the following scheme is used:

  • Head normal forms with two or less arguments use 1 word (0 arguments), 2 words (1 argument), or 3 words (2 arguments).
  • Head normal forms with more than three arguments are split over two blocks. The first block contains the descriptor, the first argument, and a pointer to the second block. The second block contains the rest of the arguments.
  • Thunks contain a descriptor and the arguments, but are never split over two blocks. Thunks with less than two arguments are padded to consume at least 3 words so that they can be overwritten by any head normal form.

Arrays

To be able to overwrite a thunk with an array, arrays are always split over two nodes. The first node has the ARRAY descriptor and a pointer to the second node (it is a normal head normal form).

0 1
ARRAY descriptor (ptr to second node)

The memory layout of the second node depends on the type of the array.

Because the size of the second node is dependent on the length of the array, the garbage collector has to treat this node in a special way.

In case of the array being a boxed array, the second node has the following layout:

0 1 2 …
_ARRAY_ descriptor length elements …

For boxed arrays, the elements are pointers to the nodes of the elements.

Arrays may also be unboxed, which means that the values are included directly in the array, instead of the array consisting of pointers, which point to the values of the array.

Unboxed arrays may consist of Bools, Chars, Ints, Reals and records.

The memory layout for Bools, Chars, Ints and Reals is the same other than the descriptor being different.

Strings are unboxed arrays of Chars. In this case, the second node has the following layout:

0 1 2 …
_STRING_ descriptor length elements …

Unboxed Bool arrays are optimized to use 8 bits per value, they have the same layout as Strings except that the _ARRAY_BOOL_ descriptor is used.

Unboxed Int arrays have the same layout as Strings except that the _ARRAY_INT_ descriptor is used. 32 bits are used per value on 32-bit systems, 64 bits are used per value on 64-bit systems.

Unboxed Real arrays have the same layout as Strings except that the _ARRAY_REAL_ descriptor is used. Reals use 64 bits per value on both 32-bit and 64-bit systems.

Arrays of unboxed records have a special node layout, in this case an element descriptor is included which contains a description of the records present in the array.

The special node layout is needed because unboxed record arrays may use multiple words per elements. Therefore, an element descriptor is needed to provide information on the size of an element.

0 1 2 3 …
_ARRAY_R_ descriptor length element descriptor elements …

All arrays of arrays have an _ARRAY_ descriptor. In case the outer array is unboxed, the elements contain pointers to the internal array node directly. For instance, in the case of an unboxed array of strings ({#{#Char}}), the elements of the outer array would point to array nodes with a _STRING_ descriptor. Conversely, if the outer array were boxed, the elements would point to an array node with the ARRAY descriptor. This conforms to the layout for boxed arrays described above.

Unboxing

Basic values, records, and arrays can be unboxed, meaning that its value is included directly in the type in which it is unboxed (instead of a pointer to a boxing node).

For example, the heap for the boxed list [37] looks like this:

0 1 2 3 4
_Cons 3 (ptr) _Nil (ptr) INT 37 (value)

The heap for the unboxed list [#37] looks like this:

0 1 2
_Consi _Nil (ptr) 37 (value)

A different _Cons descriptor, _Consi, is needed so that the garbage collector knows not to treat the second argument as a pointer, so that the run time system knows how to print the value, etc.

Unboxed values always appear after boxed values.

Values are unboxed in the following circumstances:

  • As the strict argument of an ADT constructor or record.
  • In unboxed lists and arrays ([#], [#!], and {#}).
  • As the strict argument of a function. The arguments of the value are then passed on the A and B stacks, instead of a single argument for the value itself. For instance, with :: R = {x :: !Int, y :: Int}, a lazy argument of type R would be passed as a single pointer to the R node on the A stack. A strict argument of type R would be passed as the unboxed x value on the B stack and a pointer to the y value on the A stack (because it is lazy it must be boxed).

See the language report, §5.1.6 and §10.1.3, for more details.

Because the descriptors of ADT constructors do not permit strictness on ABC level, a record descriptor is generated for ADT constructors with strict arguments. Additionally, a separate curry table is generated to allow currying of these constructors, because genuine records cannot be curried.