Chapter 24: Understanding the Garbage Collector

Table of Contents

OCaml, with its runtime that is a C library, handles most of the memory management for us. Routines exist that can be called by our OCaml program for heap block allocation to fill the heap up with OCaml values. What’s left is for us to understand how the Garbage collector manages the rest of the lifecycle. Note: pre-multicore OCaml garbage collection was NOT thread-safe, the modern one is, so these notes will contain information about the multicore-GC as well (in addition to what’s covered in the textbook).

The C runtime manages a heap that it gets from the OS. To do so, it holds up heap blocks that it fills with OCaml values based on alloc requests by the OCaml program.

Mark and Sweep Garbage Collection #

When an allocation request can’t be met, then the GC is invoked. GC decides live vs dead values (to be collected) – program can’t explicitly free a value when done (QQs: it claims that we can’t force a collection – is that really true? e.g. can’t we just do a System.sleep or some kind of Os module related call that would trigger a GC collection?).

It’s not a live-tracking stateful approach, the mark and sweep is done regularly.

Scanning:

its’ a root of values (e.g. stack) and it’s a tree-traversal to figure out which heap blocks are reachable and which are not — ends up with a directed graph. Edge exists if there’s a pointer between heap block nodes

dead blocks — if can’t reach via the graph traversal \(\implies\) can be freed and reused by the application
alive blocks — reachable

The marking is done by colouring the color segment, see below for what the colours mean.

Generational Garbage Collection #

Generational GC: separate memory regions to hold blocks based on how long the blocks have been alive:

young: small, fixed-size minor heap where most blocks get allocated
old generation: larger, variable-sized, major heap for long-lived blocks

Relies on the Generational Hypothesis: if we code in a typical functional programming style then young blocks die young and old blocks stay around longer than young ones — the lifecycle patterns are distinct so the memory layouts and GC algorithms are also different.

The `Gc` Module and `OCAMLRUNPARAM` #

Runtime system behaviour can be altered / tuned by using either Gc module or using the OCAMLRUNPARAM env var.

The env var allows us to set GC params without recompiling (since it’s not within the code, just a consumed var) — see runtime system config params if interested.

The Fast Minor Heap #

It’s a single contiguous chunk of virtual memory (fixed size) that has a sequence of OCaml blocks — if there’s is space then allocation is \(O(1)\) operation withe a couple of CPU instructions \(\implies\) meant for short-lived values.

GC happens via copying collection: when live blocks get moved to the major heap — so b blocks take \(O(b)\) time — given the Generational Hypothesis, b should be pretty small.

It’s generally a stop-the-world action \(\implies\) that’s why it needs to be FAST to not block application execution.

Allocating on the Minor Heap #

So the overall minor heap is given from the OS from a malloc, that’s what establishes the base and caml_young_end pointers to memory. To ensure that it’s word-aligned, that’s why caml_young_start is slightly to the right of base.

The growth of the minor heap (OCaml runtime controlled) is from right to left — this is quick to allocate because it’s a branchless, simple action of:

decrement the current curr_ptr and
check if limit is reached — if so then collection is triggerred.

QQ: it seems that the minor heap kind of behaves like a stack of its own because of the contiguity point and how the pointers just get decremented like that — it feels like it’s behaving more like a stack. Does my intuition or observation have any merit here?

Purpose of `limit` #

limit is a dynamic pointer here – so if the runtime wants to schedule a minor heap collection, then it just needs to set limit to end and in the next allocation, there won’t be enough space for anything so a collection would be triggerred (early collection).

Table 1: Compiler introduces poll points

Poll Points
why?	introduced by the compiler to allow pre-emption from UNIX signals, other internal book-keeping — especially because our coding style might make it such that it may take a long time do an allocation on the minor heap
what happens?	@ a poll point, `ptr` is just checked against `limit`
where?	typically at the start of every function, at back edge of loops – depends on the dataflow, the compiler optimises this to minimise the poll points

Controlling the Minor Heap Size #

As usual, we could use Gc module or inject via OCAMLRUNPARAM:

open Core;;

let c = Gc.get ();;
val c : Gc.Control.t =
  {Core.Gc.Control.minor_heap_size = 262144; major_heap_increment = 15;
   space_overhead = 120; verbose = 0; max_overhead = 500;
   stack_limit = 1048576; allocation_policy = 2; window_size = 1;
   custom_major_ratio = 44; custom_minor_ratio = 100;
   custom_minor_max_size = 8192}


(* --- how to tune the size: *)
Gc.tune ~minor_heap_size:(262144 * 2) ();;
- : unit = ()

Code Snippet 1: Example of reading and tuning, Core defaults to 8Mb, else it's a 2Mb default on 64-bit machines

Notes:

Core increasese the default minor heap size from standard OCaml significantly
if we change the GC size — will trigger an immediate minor heap-allocation

The Long-Lived Major Heap #

Variable sized, where bulk of the longer-lived larger-values in the program exist. It can be any number of non-contiguous chunks of virtual memory, each with live blocks with free memory alongside to expand.

OCaml runtime maintains a free-list data structure that acts as a lookup index for free memory that it has allocated — used to satisfy allocation requests for OCaml blocks.

It can be in the size of Gbs, collection works in phases and is also a stop-the-world action so that blocks can be moved around without being observed by the live application.


mark	acts on heap slice	scans the block graph and marks all live blocks by setting the color tag of the block header
sweep	acts on heap slice	equentially scans the heap chunks and identifies dead blocks that weren’t marked earlier
compact	acts on entire memory at once	relocates live blocks into a freshly allocated heap to eliminate gaps in the free list. This prevents the fragmentation of heap blocks in long-running programs and normally occurs much less frequently than the mark and sweep phases

Since it’s a stop-the-world action, it is done incrementally over heap slices to avoid pausing the live application for too long.

Compaction touches the entire memory in one go, though it’s a rare operation — it’s for defragging.

Allocating on the Major Heap #

Singly linked list of contiguous memory chunks that are sorted in increasing order of virtual memory addresses.

Each chunk is from malloc, has a header and data-area that contains OCaml heap chunks.

Figure 3: Layout of a Contiguous Chunk allocated on the Major Heap

Typically this is the process for allocation:

check free list of blocks for suitable region that fits
1. if no room on free list, runtime will expand the major heap — allocates a fresh heap chunk that is large enough \(\implies\) chunk added to free list \(\implies\) repeat first step
2. if room exists \(\implies\) allocate

Controlling the Heap Increment #

major_heap_increment is an observed value by the OCaml runtime, helps control the major heap growth — it’s the number of words to add to teh major heap per expansion

Memory Allocation Strategies #

Different strategies exist for different program behaviours. They all need to manage the following trade-off: allocation-cost vs heap fragmentation.

How they work:

Best-fit strategy \(\implies\) best of both worlds
1. small allocations: use size-segregated free-lists for small allocations (i.e. \(\le\) 16 words)
2. large allocations: use splay tree (recency-biased) to select the smallest block that fits
Next-fit strategy: holds a pointer to the last allocated block; scans forward through the free list and wraps around if necessary \(\implies\) prioritises low allocation cost vs fragmentation
First-fit strategy: scans free list from the beginning for the first block large enough \(\implies\) lower fragmentation over allocation-cost

Table 2: Comparison of strategies

Strategy	Core Idea	Advantages	Disadvantages	Best Use Case
Best-Fit (default)	Find the smallest suitable block	Balanced performance; efficient for common small allocations; reduces wasted space	Moderate CPU cost; not the fastest	General-purpose workloads with mixed allocation sizes
Next-Fit	Continue from last used position	Fast allocation; good cache locality; reuses nearby memory	Causes fragmentation, especially for large blocks at start of list	Workloads prioritizing speed and locality over fragmentation
First-Fit	Take the first suitable block	Reduces fragmentation; fewer heap compactions	Slower allocation due to full scan	Real-time or fragmentation-sensitive workloads with varied sizes

Marking and Scanning the Heap #

This happens in heap slices instead of the complete major heap at once. There’s 2-bit color field in its header — allows for stateful colour-checks to pause and resume:

Colour value	Meaning
blue	on free list, not being used
white @ marking	not reached, possibly reachable
white @ sweeping	unreachable, can be freed
black	reachable, fields have been scanned

When allocated, all heap values are coloured white (reachable, not scanned yet).

The runtime tracks the rate of allocation (and avaialble memory) and calculates the size of slice of heap to sweep.

Marking uses a mark-stack:

starts from always -alive root values – will be black coloured
white blocks are blackend and pushed to mark stack
if there’s a mark-stack overflow, then there’s a pruning that happens: empties the mark stack entirely, summarizing the addresses of each block as start and end ranges in each heap chunk header.

Controlling Major Heap Collections #

It’s possible to trigger a single heap slice using Gc.major_slice / Gc.full_major.

QQ: earlier, the text mentioned that it’s not possible to explicitly trigger a collection. But in this case, the major heap can be collected from (and implicitly, also the minor heap since this function will also collect from the minor heap)???

Heap Compaction #

It’s natural to end up having a fragmented major heap, so compaction is possible.

OCaml’s heap compaction happens in-place.

We can tune the max_overhead which is something like a threshold/ratio of free:allocated memory before which compaction happens.

Intergenerational Pointers #

Problem: To know which minor heap blocks are live, the collector needs to check the major heap blocks to figure out which minor heap blocks are pointed to by the major heap blocks. If not, every minor collection needs to also scan the major heap.

Solution: maintain a set of intergenerational pointers that acts as a remembered set. There’s a write barrier to updating this remembered set, and that’s how the set is kept updated.

Mutable Write Barrier #

The write barrier can have great impact — that’s why sometimes doing a immutable update can be faster than doing a mutable update.

The compiler tracks all mutable types and @ runtime, caml_modify is called before making the mutable change. It’s job is to keep the remembered set consistent — checks the location of the target write and the value it’s being changed to.

It’s efficient but sometimes might be slower than just allocating a fresh value on the fast minor heap and doing some extra collections.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# require "Core_bench";;
open Core
open Core_bench

module Mutable = struct
  type t =
    { mutable iters : int
    ; mutable count : float
    }

  let rec test t =
    if t.iters = 0
    then ()
    else (
      t.iters <- t.iters - 1;
      t.count <- t.count +. 1.0;
      test t)
end

module Immutable = struct
  type t =
    { iters : int
    ; count : float
    }

  let rec test t =
    if t.iters = 0
    then ()
    else test { iters = t.iters - 1; count = t.count +. 1.0 }
end

let () =
  let iters = 1_000_000 in
  let count = 0.0 in
  let tests =
    [ Bench.Test.create ~name:"mutable" (fun () ->
          Mutable.test { iters; count })
    ; Bench.Test.create ~name:"immutable" (fun () ->
          Immutable.test { iters; count })
    ]
  in
  Bench.make_command tests |> Command_unix.run

Code Snippet 2: Timed test: mutable vs immutable update

Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name      │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ mutable   │  52.14ms │  2.00Mw │   20.75w │   20.75w │    100.00% │
│ immutable │  41.62ms │  5.00Mw │  115.01w │  115.01w │     79.83% │
└───────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
module Mutable :
  sig
    type t = { mutable iters : Base.int; mutable count : Base.float; }
    val test : t -> unit/2
  end
module Immutable :
  sig
    type t = { iters : Base.int; count : Base.float; }
    val test : t -> unit/2
  end

Code Snippet 1: Outcome of timed test

Observations:

Space/time tradeoff: Mutable version takes longer to complete than immutable version but also uses fewer minor-heap words than the immutable version
typically better to use immutable datastructures than conventional mutable ones
if the mutation is rare, then it’s faster to tolerate the write-barrier hit and avoid all the extra allocations.

Attaching Finalizer Functions to Values #

For situations that are classic sources of memory leaks (e.g. closing of file-descriptors or last-remnant in log-buffer), it’s good to run finalisers to manually free things up.

Finalizers can run at any time in any thread, so they can be confusing in multi-threaded contexts.

Typically GC calls the finalization functions in the order of deallocation. Suppose values become unreachable during the same GC cycle, then the finalization functions will be called in the reverse order of the corresponding calls to add_finalizer. Each call to add_finalizer adds to the set of functions, which are run when value becomes unreachable — we can have mutliple finalizers pointing to the same heap block.

Every finalizer runs at most once, runs when the GC determines a block, b, is unreachable and removes its corresponding finalizers from the finalizers set.

Program termination will NOT cause all the finalizers to be run before the runtime exits.

Remember OCaml finalizers here are not the same as C finalizers in the context of FFIs.

What Values Can Be Finalized? #

Values that aren’t heap allocated can’t be finalized
- ints, const constructors, bools, empty array, empty list, unit value
Some constant values (e.g. list of integer constants) can be heap-allocated but not deallocated until the end of the program — just use Core.Heap_block to check if it’s heap allocated and finalizable.
Finalizers need to be registered, the registereres can be found in Core.Gc.Expert