Callgraph profiling

These notes describe the implementation of callgraph profiling in Clean. For a comparison with other profiling methods, see the Profiling introduction.

Just like the normal time profile, the callgraph profile tells you (1) how much time is spent in each function; (2) how many words are allocated; and (3) how many lazy/strict/curried calls there are to each function. Internally we keep track of where thunks were created, so that the costs to evaluate them are attributed to that stack instead of the stack of the evaluating function.

It looks like this:

Callgraph profile example

Usage

To use in the Clean IDE, enable “Time Profile, Call Graph, and Stack Trace” in “Project options” > “Compiler”. With cpm, use cpm project X.prj set -pg to enable or ... -npg to disable. With clm, compile with -pg. (When compiling with clm, you may need to run clm -pg -O _system before.)

Run your program as normally. After termination, a file with the same name as the executable and the extension .pgcl is created. This is a binary file which requires a separate viewer program. convertprofile can generate two output formats:

  • An HTML file that can be inspected in the browser (-o FILE.html). It may happen that the generated HTML is so large that it crashes the browser. In this case you may want to run convertprofile with the --min-ticks or --min-words options to prune small functions (experiment with these options to obtain an HTML file of, say, 20MB) or with --exclude to exclude certain functions.

  • A callgrind output file that can be inspected in KCacheGrind (-o callgrind.out.NAME).

An attempt is made to compensate for the additional overhead costs in time. However note that callgraph profiling alters the memory behaviour of your program will change which may for example lead to different garbage collection patterns. Also be aware that the overhead correction is a best effort only and may lead to functions getting attributed (near-)zero costs.

Known limitations

  • Callgraph profiling is supported on x64 (Linux, Mac, Windows) and x86 (Linux, Windows) architectures only.
  • The costs of the evaluation of CAFs are attributed to the call stack in which the evaluation is demanded. It would be more natural if these costs were attributed on the top level. For this an ABC annotation .pcaf could be added to indicate which functions are there to evaluate a CAF.
  • The profiler keeps track of time overhead caused by the profiling code. However, the extra memory usage is not tracked.
  • Because the memory layout of nodes is changed, callgraph profiling cannot be used in combination with ABC interpreter interworking (the serialize and deserialize functions in ABC.Interpreter).

Future ideas

  • An interactive viewer in which the --min-ticks, --min-words, and --exclude options of the conversion tool can be changed interactively would be very useful.
  • It may be useful to be able to get an inverted callgraph to see the main caller contexts of a heavily used function.
  • The costs of partial function applications (i.e. curried calls) are currently attributed to the call stack in which they are applied. It might be useful to add an option to attribute these costs to the call stack where the closure is first created.
  • Part of the profiling code is written in C. Porting this to Assembly may reduce overhead.

Notes on the implementation

Performance

A benchmark of the compiler compiling itself shows a slowdown of 4.3; with regular time profiling this is 3.4. The output file was 26MB.

Note that more memory is used (see under Thunks and Uniqueness), so garbage collection patterns may differ.

Thunks

We extend the payload with an extra field (as in GHC). This is an unboxed field and is placed as the last field of the node. This avoid issues with nodes that are handled specially in the garbage collector: tuple selectors (arity -1), indirections (arity -2), record selectors of unboxed fields (arity -3), and record selectors of boxed fields (arity -4).

Thunks with only A-stack fields are transformed into thunks with one unboxed field. This means that their arity changes from n to 256+n+1.

Partial function applications

Currently, cost centres are not stored in closures. This means that the costs of a curried call are attributed to the stack in which it is called. If an option is added to attribute these costs to the place where the closure was first created (see Future ideas), we can best use the first pointer argument to store the cost centre - this will simplify currying.

hamming is a nice example for partial function applications:

ham :: [Int]
ham = y
where
    y = [1:merge (merge (map ((*) 2) y) (map ((*) 3) y)) (map ((*) 5) y)]

    merge f=:[a:b] g=:[c:d]
        | a < c = [a: merge b g]
        | a == c = merge f d
        | otherwise = [c: merge f d]

Where should the costs for the (*) applications be attributed?

  • ham.* (place of creation)
  • ham.merge.* (place of application)

(In this case, place of application coincides with Haskell’s strategy, since the common prefix between ham.* and ham.merge is the whole prefix of ham.*.)

Uniqueness

When the ReuseUniqueNodes optimisation is turned on (-ou), the compiler may overwrite a unique node with a thunk. Normally it can do so for thunks with up to 2 arguments. Because thunks are larger now, we can now only do this for thunks with up to 1 argument (excluding the cost centre pointer). This means that the optimisation is less beneficial, and more memory will be used when callgraph profiling is turned on.

GraphCopy

GraphCopy ignores the extra fields in thunks when serializing. When deserializing a thunk, it adds an extra field pointing to the current cost centre. This means that cost centres are not preserved over serialization cycles (which is not possible anyway if the data leaves the application), and that costs of the deserialized value are attributed to the GraphCopy function:

module with_graph_copy

import StdEnv, graph_copy

Start
    # (xs,i) = copy_from_string (copy_to_string [0..])
    = take 10 xs

Callgraph profiling with GraphCopy

The profile file

We use a custom binary file format with shared string stores to reduce size.

The format is as follows:

Offset Width Description
0 4 Magic number ("prof" in ASCII)
4 4 Version number (a positive integer, currently 2)
8 4 Number of modules
12 4 Number of cost centres
16 variable CPU frequency in ticks per second (introduced in version 2)
.. variable Estimated overhead in ticks per 1000 profiling calls (introduced in version 2)
.. variable Module names
.. variable Cost centres
.. variable Call graph root entry

Modules are NUL-terminated strings.

Cost centres are of the form:

Offset Width Description
0 variable Module ID (index for the module array; starting at 1)
.. variable NUL-terminated name

A call graph entry is of the form:

Offset Width Description
0 variable Cost centre ID (index for the cost centre array; starting at 1)
.. variable Number of ticks spent in this entry
.. variable Number of words allocated in this entry
.. variable Number of tail calls and returns from this entry
.. variable Number of strict calls entering this entry
.. variable Number of lazy calls entering this entry
.. variable Number of curried calls entering this entry
.. variable Number of children
.. variable Child entries

Integers encoded with a variable width are encoded as a little-endian stream of bytes, of which only the lower seven contain payload. The MSB of each byte except the last in the stream is 1. See https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128 for the inspiration.

A first version was implemented in the following commits:

Supporting changes:

Later changes were made in:

Comparison with GHC

These are some notes on a similar implementation in GHC, just for reference.

It is compiled mainly from:

The core concept is that of cost centres (CC). These are roughly the same as functions, but one can use program annotations to mark subexpressions as individual cost centres as well.

The RTS keeps track of cost centre stacks (CCS). Costs are attributed to CCSs (cf. the current Clean time profiler which attributes costs to CCs). The top of the stack is sometimes called the Current Cost Centre (Stack) or CCC(S).

(Mutual) recursion

[A]lthough GHC does keep information about which groups of functions called each other recursively, this information isn’t displayed in the basic time and allocation profile, instead the call-graph is flattened into a tree as follows: a call to a function that occurs elsewhere on the current stack does not push another entry on the stack, instead the costs for this call are aggregated into the caller [2].

[2]: Note that this policy has changed slightly in GHC 7.4.1 relative to earlier versions, and may yet change further, feedback is welcome.

Thunks

When compiling for profiling, thunks have an additional field at the start of the payload which holds a pointer to the corresponding CCS. When evaluating the thunk this CCS is pushed to the stack of CCSs. This way, costs are attributed to the CCS where the thunk was created rather than where it is evaluated.

Higher-order functions

From the first link:

At a function call, GHC takes the stack stored in the function being called (which for a top-level function will be empty), and appends it to the current stack, ignoring any prefix that is identical to a prefix of the current stack.

This seems to say that in f = g (const x) x the costs for f.const, which is applied in f.g, are attributed to f.g.const. f.const is “the stack stored in the function being called”, f.g is “the current stack”, and f is the common prefix.

If this is the correct interpretation, it looks like this rule allows you to keep track of where closures are defined as well as were they are applied.