Post

Lazy Resolution: Resolve Once, Dispatch Forever

Deep dive into lazy resolution for C++ runtime dispatch. How a self-patching function pointer eliminates per-call overhead after a one-time resolution cost, with assembly analysis, benchmarks, and thread safety considerations.

Lazy Resolution: Resolve Once, Dispatch Forever

What if a function pointer could resolve itself on first use, then dispatch at zero cost forever after?

In a previous post, I compared four approaches to runtime dispatch in C++: virtual functions, function pointers, std::variant, and a template-based technique called decoupled CRTP. The template approach matched raw function pointers in steady-state performance, but I glossed over the interesting part: how it resolves the right implementation at runtime without paying for it on every call.

This post takes that mechanism apart. We’ll walk through the three states of a self-patching function pointer, look at the assembly, measure the resolution cost, and see where this pattern shows up in production (OpenJDK’s GC barrier dispatch).

The Problem

You have a set of concrete implementations compiled at build time. Templates give you composition and inlining within each implementation. But the user picks which one to use at runtime, via a flag or config file.

The obvious solutions each have a cost on every call:

  • Virtual dispatch: two dependent memory loads (vptr, then vtable entry)
  • Switch/if-else: a branch on every invocation
  • std::variant + std::visit: depending on your stdlib, this can compile down to its own function pointer table with per-call overhead for the lambda capture

What if you could decide once and never pay again?

The Pattern

The idea is simple. You have a function pointer that starts pointing at a resolver stub. The first call runs the resolver, which figures out the right implementation, patches the pointer to point directly at it, and forwards the call. Every subsequent call goes through the patched pointer, which is now a direct function call. No vtable, no branch, no switch.

A note on terminology: “GC barriers” are bookkeeping hooks a runtime inserts around heap stores (e.g. marking a card dirty), unrelated to CPU memory barriers or std::atomic_thread_fence. Different GC algorithms need different hooks – some none, some post-store only, some pre-and-post. The exact semantics don’t matter here; what matters is that the choice is fixed at startup.

Here’s the minimal version. The previous post used printf for barrier side effects; here we use a volatile int sink instead, which gives the compiler a visible side effect without the I/O overhead that would dominate the benchmark.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
using StoreFn = void(*)(int*, int);
static volatile int sink;

// Three concrete implementations
void epsilon_store(int* addr, int value) { *addr = value; }
void serial_store(int* addr, int value)  { *addr = value; sink = 1; }
void g1_store(int* addr, int value)      { sink = *addr; *addr = value; sink = 1; }

// Forward declaration
void store_init(int* addr, int value);

// The function pointer. Starts pointing at the resolver.
StoreFn _store_func = &store_init;

// Resolver: runs once, patches the pointer, forwards the call
void store_init(int* addr, int value) {
    StoreFn func;
    switch (runtime_gc_choice()) {
        case G1:      func = &g1_store;      break;
        case Serial:  func = &serial_store;  break;
        case Epsilon: func = &epsilon_store; break;
    }
    _store_func = func;    // patch
    func(addr, value);     // forward
}

// Public API
void store(int* addr, int value) {
    _store_func(addr, value);
}

Three States of the Pointer

The function pointer goes through three states during the program’s lifetime:

flowchart LR
    S(( )) -->|program start| A[Unresolved]
    A -->|first call| B[Resolving]
    B -->|pointer patched| C[Resolved]
    C --> H[ ] -->|all subsequent calls| C
    style H width:0px,height:0px,padding:0px,margin:0px,opacity:0
    linkStyle 4 stroke-dasharray: 0

State 1: Unresolved. The pointer holds the address of store_init. No runtime choice has been made yet.

State 2: Resolving. The first call to store() invokes store_init. It reads the runtime configuration, picks the right implementation, writes the function address into _store_func, and then tail-calls the resolved function so the first invocation still gets the right answer.

State 3: Resolved. The pointer now holds the address of the concrete implementation (e.g., &g1_store). Every subsequent call is a single indirect call through a global. The resolver is never called again.

In concrete terms, the global _store_func changes from holding the address of store_init (say, 0x401234) to holding the address of g1_store (say, 0x401180). That single pointer write is the entire cost of resolution.

What the Compiler Generates

After resolution, the store() function compiles down to:

1
2
; store() - steady state
jmp   *_store_func(%rip)       ; tail-call through global function pointer

That’s it. One instruction. The pointer lives at a fixed RIP-relative address, so there’s no object dereference, no vptr load, no vtable indexing. Just a jump through a global.

Compare that to virtual dispatch:

1
2
3
; virtual dispatch
movq  (%rdi), %rax             ; LOAD 1: vptr from object
call  *16(%rax)                ; LOAD 2: vtable entry → indirect call

Two dependent loads. The second can’t start until the first completes.

The store_init resolver itself is straightforward:

1
2
3
4
5
6
; store_init() - runs once
store_init:
    movq  g_gc_name(%rip), %rdi    ; load the config string
    call  resolve                   ; returns function address in %rax
    movq  %rax, _store_func(%rip)  ; PATCH the global pointer
    jmp   *%rax                    ; tail-call the resolved function

After store_init patches _store_func, it’s never called again. The jmp *%rax at the end forwards the first call to the correct implementation without the caller knowing anything happened.

Benchmarks

All measurements on the same hardware as the previous post (Intel Xeon Gold 6130). 100M iterations, GCC 11 at -O2 -march=skylake-avx512, pinned to a single core with taskset. The new measurement here is the resolution cost (how long the first call takes when it triggers the resolver).

 Resolution (first call)Lazy steady-stateDirect function pointerDifference
G1~47 ns2.87 ns/call2.87 ns/call~0.00 ns
Serial~69 ns3.35 ns/call3.35 ns/call~0.00 ns
Epsilon~43 ns2.87 ns/call2.87 ns/call~0.00 ns

The “Direct function pointer” column is a plain function pointer assigned once at startup (no lazy resolution machinery), called from the same global-pointer path. The “Difference” column is the steady-state cost of lazy resolution minus the direct function pointer: effectively zero.

The resolution cost is about 43-69 ns and it happens exactly once. That cost is dominated by the indirect branch through the unresolved pointer (a cold branch target the predictor hasn’t seen) plus the switch. The variance comes from CPU frequency scaling on the first call. After patching, the branch predictor learns the target and subsequent calls are indistinguishable from a direct function pointer. At 100M iterations, the one-time cost amortizes to well under 0.001 ns per call.

For context, here’s how it stacks up against all four approaches from the previous post (G1 barriers, same machine):

Approachns/callDispatch overheadNotes
Direct call (baseline)1.44 nsNon-polymorphic, compile-time known target
Function pointer2.39 ns+0.95 nsOne indirect call via register, no composition
Decoupled CRTP (lazy)2.39 ns+0.95 nsOne indirect call via global, composable barriers
Virtual dispatch2.87 ns+1.43 nsTwo dependent loads (vptr + vtable entry)
std::variant + std::visit3.72 ns+2.28 nsClosed type set, stdlib-dependent overhead

Lazy resolution matches function pointers in steady state, but it also gives you the composition that function pointers lack.

The benchmarks show the pattern works. But correctness matters more than speed. Two things can go wrong: concurrent resolution and type mismatches. The next two sections address each.

Thread Safety

In OpenJDK, this resolution happens during VM initialization, before any application threads start. The JVM is single-threaded at that point, so no synchronization is needed.

But what if your application isn’t single-threaded at initialization time? Two threads could call store() simultaneously, both seeing the unresolved pointer, both entering store_init. On x86, aligned pointer-sized writes are atomic at the hardware level, so you won’t get a torn pointer. But the C++ memory model doesn’t guarantee this.

The safe portable approach is std::atomic<StoreFn> with memory_order_relaxed:

1
2
3
4
5
6
7
8
9
10
11
std::atomic<StoreFn> _store_func{&store_init};

void store(int* addr, int value) {
    _store_func.load(std::memory_order_relaxed)(addr, value);
}

void store_init(int* addr, int value) {
    StoreFn func = resolve();
    _store_func.store(func, std::memory_order_relaxed);
    func(addr, value);
}

memory_order_relaxed is enough here because we don’t need ordering guarantees between threads. If two threads race to resolve, they’ll both compute the same answer (the GC choice is immutable after startup) and both write the same pointer. The worst case is that the resolver runs twice, which is harmless.

This only works when the resolver is idempotent: the same inputs must always produce the same output, with no observable side effects. If your resolution logic allocates resources, registers callbacks, or modifies shared state, a double-resolve becomes a bug. In that case, use std::call_once or an equivalent single-initialization guard instead of relaxed atomics.

Type Safety Without RTTI

In the previous post, I flagged a correctness concern with decoupled CRTP: the static_cast targets a global singleton whose concrete type is a runtime decision. If the resolver pairs the wrong type with the wrong AccessBarrier specialization, you get undefined behavior. How does the resolver know the type?

C++ RTTI (dynamic_cast) would work, but OpenJDK avoids it entirely. Instead, each BarrierSet carries a lightweight type identity based on a bitset:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// OpenJDK: utilities/fakeRttiSupport.hpp (simplified)
template <typename T, typename TagType>
class FakeRttiSupport {
    uintx     _tag_set;       // bitset: which types this instance "is-a"
    TagType   _concrete_tag;  // the leaf type

    bool has_tag(TagType tag) const {
        return (_tag_set & (1u << tag)) != 0;
    }

    FakeRttiSupport add_tag(TagType tag) const {
        return FakeRttiSupport(_concrete_tag, _tag_set | (1u << tag));
    }
};

Each class in the hierarchy adds its own tag during construction. When G1BarrierSet is created, the tag set accumulates through the constructor chain:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// G1BarrierSet starts with its own tag
G1BarrierSet::G1BarrierSet(...)
    : CardTableBarrierSet(..., FakeRtti(BarrierSet::G1BarrierSet))
    // tag_set = {G1BarrierSet}

// CardTableBarrierSet adds its tag
CardTableBarrierSet::CardTableBarrierSet(..., const FakeRtti& fake_rtti)
    : ModRefBarrierSet(..., fake_rtti.add_tag(BarrierSet::CardTableBarrierSet))
    // tag_set = {G1BarrierSet, CardTableBarrierSet}

// ModRefBarrierSet adds its tag
ModRefBarrierSet::ModRefBarrierSet(..., const FakeRtti& fake_rtti)
    : BarrierSet(..., fake_rtti.add_tag(BarrierSet::ModRef))
    // tag_set = {G1BarrierSet, CardTableBarrierSet, ModRef}

The tag set builds up as each constructor delegates to its parent:

ConstructorAdds tagCumulative tag_set
G1BarrierSetG1BarrierSet{G1BarrierSet}
CardTableBarrierSetCardTableBarrierSet{G1BarrierSet, CardTableBarrierSet}
ModRefBarrierSetModRef{G1BarrierSet, CardTableBarrierSet, ModRef}
---
config:
  flowchart:
    curve: stepBefore
---
flowchart LR
    A["G1BarrierSet()\ntag_set: 001"] -->|add_tag| B["CardTableBarrierSet()\ntag_set: 011"]
    B -->|add_tag| C["ModRefBarrierSet()\ntag_set: 111"]
    C -->|add_tag| D["BarrierSet()\ntag_set: 111"]

By the time construction finishes, a G1BarrierSet has concrete_tag = G1BarrierSet and a tag set that includes every class in its inheritance chain. A checked downcast is then a bitwise AND:

1
2
3
4
5
template <typename T>
T* barrier_set_cast(BarrierSet* bs) {
    assert(bs->is_a(BarrierSet::GetName<T>::value));  // bitwise AND check
    return static_cast<T*>(bs);
}

This is cheaper than dynamic_cast (one AND vs. walking the type hierarchy) and works in codebases compiled with -fno-rtti. The tag set is bounded by the number of bits in uintx (typically 64), which is more than enough for a GC hierarchy.

When to Use This

Lazy resolution works well when:

  • The choice is made once and never changes. GC selection, logging backend, rendering pipeline. If the strategy could change mid-execution, you need something else.
  • The call is on a hot path. If you’re calling this 100M times per second, the difference between one indirect call and two dependent loads (virtual dispatch) matters. If you’re calling it once per request, use virtual dispatch.
  • You want composition. This is what separates it from raw function pointers. The resolved function can be the endpoint of a template inheritance chain that composes concerns at compile time.

It’s overkill when:

  • You need per-object dispatch. Lazy resolution works on globals/singletons. If different objects need different strategies, virtual dispatch is the right tool.
  • The strategy set changes at runtime. Plugin systems where users can load new strategies dynamically need a different pattern.

If This Looks Familiar

If you’ve worked with ELF binaries, you’ve already used this pattern. PLT (Procedure Linkage Table) stubs in ELF dynamic linking work the same way: the first call to a dynamically linked function goes through a stub that resolves the symbol and patches the GOT (Global Offset Table) entry. Every subsequent call goes through the patched GOT entry directly. Lazy resolution applies the same principle at the application level, with the resolver under your control instead of the dynamic linker’s.

Key Takeaways

  1. Lazy resolution trades a one-time startup cost for zero per-call overhead. After resolution, it’s indistinguishable from a direct function pointer.
  2. The pattern has three moving parts: a function pointer, a resolver stub, and a patch. The resolver runs once and then gets out of the way.
  3. Combined with decoupled CRTP, lazy resolution gives you both composition (template inheritance chains) and zero-overhead dispatch. That’s the combination that makes it compelling over raw function pointers.
  4. OpenJDK has used this pattern since JDK 10 for GC barrier dispatch, with FakeRtti providing type-safe downcasts at the cost of a single bitwise AND, no C++ RTTI required.

Appendix: How OpenJDK Does It

The pattern above is a simplified version of what OpenJDK has used since JDK 10 for GC barrier dispatch. If you’re curious how this looks in a production codebase, here’s the mapping.

In OpenJDK, the equivalent of our _store_func lives in a RuntimeDispatch template. It holds one function pointer per access type (load, store, CAS, etc.), each starting at a resolver stub. The resolution switch lives in access.inline.hpp:

1
2
3
4
5
6
7
8
9
10
// OpenJDK: access.inline.hpp (simplified)
BarrierResolver::resolve_barrier_gc() {
    switch (BarrierSet::barrier_set()->kind()) {
        case BarrierSet::G1:
            return G1BarrierSet::AccessBarrier<decorators>::oop_store;
        case BarrierSet::CardTable:
            return CardTableBarrierSet::AccessBarrier<decorators>::oop_store;
        // ...
    }
}

The pieces map directly to our simplified version:

Simplified versionOpenJDK
StoreFn _store_funcRuntimeDispatch::_store_func
store_init()RuntimeDispatch::store_init()
switch (kind)BarrierResolver::resolve_barrier_gc()
_store_func = funcpatching in xxx_init

Here’s how it all connects:

flowchart LR
    A["HeapAccess::store()"] --> B["RuntimeDispatch::store()"]
    B --> C{"_store_func"}
    C -.->|first call| D["store_init()"]
    D -.-> E["resolve_barrier_gc()"]
    E -.->|func ptr| D
    D -.->|patch + tail-call| G["G1::AccessBarrier::store()"]
    C -->|subsequent| G

The caller (HeapAccess::store) never changes. It always calls through RuntimeDispatch::store, which always calls through _store_func. The only thing that changes is what _store_func points to, and that changes exactly once.

The production implementation also includes a decorator-based template metaprogramming layer for composing barrier concerns, but that’s orthogonal to the lazy resolution pattern itself.


The benchmark code and all examples are in the companion repository. Measured on Intel Xeon Gold 6130, GCC 11.4, libstdc++, -O2 -march=skylake-avx512, pinned to a single core with taskset -c 0. Best of 3 runs reported.

This post is licensed under CC BY 4.0 by the author.