Callgraph profiling¶
These notes describe the implementation of callgraph profiling in Clean. For a comparison with other profiling methods, see the Profiling introduction.
Just like the normal time profile, the callgraph profile tells you (1) how much time is spent in each function; (2) how many words are allocated; and (3) how many lazy/strict/curried calls there are to each function. Internally we keep track of where thunks were created, so that the costs to evaluate them are attributed to that stack instead of the stack of the evaluating function.
It looks like this:
Usage¶
To use in the Clean IDE, enable “Time Profile, Call Graph, and Stack Trace” in
“Project options” > “Compiler”. With cpm, use cpm project X.prj set -pg
to
enable or ... -npg
to disable. With clm, compile with -pg
. (When compiling
with clm, you may need to run clm -pg -O _system
before.)
Run your program as normally. After termination, a file with the same name as
the executable and the extension .pgcl
is created. This is a binary file
which requires a separate viewer program. convertprofile
can generate two output formats:
-
An HTML file that can be inspected in the browser (
-o FILE.html
). It may happen that the generated HTML is so large that it crashes the browser. In this case you may want to runconvertprofile
with the--min-ticks
or--min-words
options to prune small functions (experiment with these options to obtain an HTML file of, say, 20MB) or with--exclude
to exclude certain functions. -
A callgrind output file that can be inspected in KCacheGrind (
-o callgrind.out.NAME
).
An attempt is made to compensate for the additional overhead costs in time. However note that callgraph profiling alters the memory behaviour of your program will change which may for example lead to different garbage collection patterns. Also be aware that the overhead correction is a best effort only and may lead to functions getting attributed (near-)zero costs.
Known limitations¶
- Callgraph profiling is supported on x64 (Linux, Mac, Windows) and x86 (Linux, Windows) architectures only.
- The costs of the evaluation of CAFs are attributed to the call stack in which
the evaluation is demanded. It would be more natural if these costs were
attributed on the top level. For this an ABC annotation
.pcaf
could be added to indicate which functions are there to evaluate a CAF. - The profiler keeps track of time overhead caused by the profiling code. However, the extra memory usage is not tracked.
- Because the memory layout of nodes is changed, callgraph profiling cannot be
used in combination with ABC interpreter interworking (the
serialize
anddeserialize
functions in ABC.Interpreter).
Future ideas¶
- An interactive viewer in which the
--min-ticks
,--min-words
, and--exclude
options of the conversion tool can be changed interactively would be very useful. - It may be useful to be able to get an inverted callgraph to see the main caller contexts of a heavily used function.
- The costs of partial function applications (i.e. curried calls) are currently attributed to the call stack in which they are applied. It might be useful to add an option to attribute these costs to the call stack where the closure is first created.
- Part of the profiling code is written in C. Porting this to Assembly may reduce overhead.
Notes on the implementation¶
Performance¶
A benchmark of the compiler compiling itself shows a slowdown of 4.3; with regular time profiling this is 3.4. The output file was 26MB.
Note that more memory is used (see under Thunks and Uniqueness), so garbage collection patterns may differ.
Thunks¶
We extend the payload with an extra field (as in GHC). This is an unboxed field and is placed as the last field of the node. This avoid issues with nodes that are handled specially in the garbage collector: tuple selectors (arity -1), indirections (arity -2), record selectors of unboxed fields (arity -3), and record selectors of boxed fields (arity -4).
Thunks with only A-stack fields are transformed into thunks with one unboxed field. This means that their arity changes from n to 256+n+1.
Partial function applications¶
Currently, cost centres are not stored in closures. This means that the costs of a curried call are attributed to the stack in which it is called. If an option is added to attribute these costs to the place where the closure was first created (see Future ideas), we can best use the first pointer argument to store the cost centre - this will simplify currying.
hamming
is a nice example for partial function applications:
ham :: [Int]
ham = y
where
y = [1:merge (merge (map ((*) 2) y) (map ((*) 3) y)) (map ((*) 5) y)]
merge f=:[a:b] g=:[c:d]
| a < c = [a: merge b g]
| a == c = merge f d
| otherwise = [c: merge f d]
Where should the costs for the (*)
applications be attributed?
ham.*
(place of creation)ham.merge.*
(place of application)
(In this case, place of application coincides with Haskell’s strategy, since
the common prefix between ham.*
and ham.merge
is the whole prefix of
ham.*
.)
Uniqueness¶
When the ReuseUniqueNodes
optimisation is turned on (-ou
), the compiler may
overwrite a unique node with a thunk. Normally it can do so for thunks with up
to 2 arguments. Because thunks are larger now, we can now only do this for
thunks with up to 1 argument (excluding the cost centre pointer). This means
that the optimisation is less beneficial, and more memory will be used when
callgraph profiling is turned on.
GraphCopy¶
GraphCopy ignores the extra fields in thunks when serializing. When deserializing a thunk, it adds an extra field pointing to the current cost centre. This means that cost centres are not preserved over serialization cycles (which is not possible anyway if the data leaves the application), and that costs of the deserialized value are attributed to the GraphCopy function:
module with_graph_copy
import StdEnv, graph_copy
Start
# (xs,i) = copy_from_string (copy_to_string [0..])
= take 10 xs
The profile file¶
We use a custom binary file format with shared string stores to reduce size.
The format is as follows:
Offset | Width | Description |
---|---|---|
0 | 4 | Magic number ("prof" in ASCII) |
4 | 4 | Version number (a positive integer, currently 2) |
8 | 4 | Number of modules |
12 | 4 | Number of cost centres |
16 | variable | CPU frequency in ticks per second (introduced in version 2) |
.. | variable | Estimated overhead in ticks per 1000 profiling calls (introduced in version 2) |
.. | variable | Module names |
.. | variable | Cost centres |
.. | variable | Call graph root entry |
Modules are NUL
-terminated strings.
Cost centres are of the form:
Offset | Width | Description |
---|---|---|
0 | variable | Module ID (index for the module array; starting at 1) |
.. | variable | NUL -terminated name |
A call graph entry is of the form:
Offset | Width | Description |
---|---|---|
0 | variable | Cost centre ID (index for the cost centre array; starting at 1) |
.. | variable | Number of ticks spent in this entry |
.. | variable | Number of words allocated in this entry |
.. | variable | Number of tail calls and returns from this entry |
.. | variable | Number of strict calls entering this entry |
.. | variable | Number of lazy calls entering this entry |
.. | variable | Number of curried calls entering this entry |
.. | variable | Number of children |
.. | variable | Child entries |
Integers encoded with a variable width are encoded as a little-endian stream of bytes, of which only the lower seven contain payload. The MSB of each byte except the last in the stream is 1. See https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128 for the inspiration.
Links to the implementation¶
A first version was implemented in the following commits:
- Code generator: add extra unboxed field to thunks to keep track of cost centres
- Run time system: first implementation
- Compiler: add
-pg
option and modify.comp
ABC header clm
: add-pg
and-npg
options- IDE and
cpm
: add settings and command line options - GraphCopy support: ignore extra field in thunks
- StdEnv: make it easier to patch
_system.abc
- Build scripts: add new run time system to clean-base
Supporting changes:
- Code generator: add instruction
get_thunk_arity
, corrected for the extra unboxed field - ABC interpreter: add instruction
get_thunk_arity
Later changes were made in:
- Code generator: x86 support
- Run time system: x86 support
- Build scripts: x86 support
- Run time system: add information to correct for overhead
Comparison with GHC¶
These are some notes on a similar implementation in GHC, just for reference.
It is compiled mainly from:
The core concept is that of cost centres (CC). These are roughly the same as functions, but one can use program annotations to mark subexpressions as individual cost centres as well.
The RTS keeps track of cost centre stacks (CCS). Costs are attributed to CCSs (cf. the current Clean time profiler which attributes costs to CCs). The top of the stack is sometimes called the Current Cost Centre (Stack) or CCC(S).
(Mutual) recursion¶
[A]lthough GHC does keep information about which groups of functions called each other recursively, this information isn’t displayed in the basic time and allocation profile, instead the call-graph is flattened into a tree as follows: a call to a function that occurs elsewhere on the current stack does not push another entry on the stack, instead the costs for this call are aggregated into the caller [2].
[2]: Note that this policy has changed slightly in GHC 7.4.1 relative to earlier versions, and may yet change further, feedback is welcome.
Thunks¶
When compiling for profiling, thunks have an additional field at the start of the payload which holds a pointer to the corresponding CCS. When evaluating the thunk this CCS is pushed to the stack of CCSs. This way, costs are attributed to the CCS where the thunk was created rather than where it is evaluated.
Higher-order functions¶
From the first link:
At a function call, GHC takes the stack stored in the function being called (which for a top-level function will be empty), and appends it to the current stack, ignoring any prefix that is identical to a prefix of the current stack.
This seems to say that in f = g (const x) x
the costs for f.const
, which is
applied in f.g
, are attributed to f.g.const
. f.const
is “the stack stored
in the function being called”, f.g
is “the current stack”, and f
is the
common prefix.
If this is the correct interpretation, it looks like this rule allows you to keep track of where closures are defined as well as were they are applied.