kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

LTO / Whole-Program Optimization (planned work)

This is the forward-looking plan for link-time optimization in kit: making a library or executable look like a single translation unit to the optimizer, so inlining, dead-code elimination, internalization, and the rest of the interprocedural family can cross TU boundaries. It deliberately does not target GCC/Clang LTO bitcode compatibility. The initial scope is kit invocations that provide all sources up front (kit cc *.c -O2 -flto -o prog); separately compiled IR objects are a later phase that reuses the same core.

The optimizer baseline this builds on — the recording IR, the recording/optimizing boundary, the finalize path, and the pass catalog — is in ../OPT.md and OPTIMIZER.md. The link-time symbol model is in LINKER.md. The CG/object lifetime boundary used by the remaining Phase 1 staging work is in CG_OBJ_LIFECYCLE.md. This document treats those as given and describes only the LTO-specific additions.

The headline finding from investigating the tree: most of the machinery for whole-program optimization already exists; it is just per-TU, single-arch, and partly unreached. LTO here is three concrete refactors plus wiring, not a new subsystem. The largest of the three is factoring the linker's symbol-resolution policy out so it can run at merge time as well as at link time.

Status (2026-06-04)

Phase 0 is complete and shipping; Phase 1's all-sources-up-front LTO path is implemented in this branch. The end state is not a C-only shortcut: every source-building verb routes through one staging engine, and every in-tree frontend declares either semantic CG staging or opaque-object participation. The link-picture-driven preserved/export prepass now feeds kit_cg_finish, and executable LTO internalizes non-preserved globals before the whole-module reachability walk. Where reality diverged from the original wording below:

Done

Phase 1 source-staging checklist

Baseline (what already exists)

A handful of facts about the current code path frame everything below.

The merged module, then, is not something LTO must build. It is something the recorder already builds and the finalize path already consumes — for one TU, on one arch. LTO is mostly about not tearing it down between TUs, generalizing the finalize pass, and applying real resolution policy as the merge happens.

1. Design decision: shared context, not clone-and-merge

Two architectures can make the optimizer see one module:

  1. Clone-and-merge. Each TU records into its own CgIrModule/ObjBuilder; an IR-linker deep-copies every function into a merged module, rebuilds the symbol table, and remaps every operand/reloc/alias to merged ids.
  2. Shared context. All TUs record into one live session — one ObjBuilder, one recorder, one CgIrModule — so globals unify in place via the existing decl interning and the finalize path sees the union directly.

We choose shared context. The comparison:

Shared context Clone + remap
Global identity Free (decl already interns by name) Rebuild symbol table + remap every operand/reloc/alias
Memory / time Record once, in place Duplicate all IR into a merge arena
Resolution policy Apply at the per-TU merge boundary Apply in the merge pass
Local distinctness Skip-intern locals (small CG change) Falls out of remap
Lifecycle cost Staging mode + cross-TU arena lifetime None — TUs stay independent
Net new code Mostly wiring + policy extraction A full cloner/remapper on the hot path
Serialized objects (Phase 2) Deserialize = replay records through the same recording/merge API A separate clone-from-bytes engine

Clone's only advantage is that TUs stay fully independent, so there is no staging lifecycle to manage. That is not worth re-implementing the symbol merge the linker already knows how to do, nor the per-TU IR duplication. The decisive row is the last one: shared context makes the recording/merge API the single funnel. A frontend feeds it; a .kit.ir deserializer feeds it the same way. There is no second merge engine to build for Phase 2 — fat-object LTO becomes "replay serialized decl/func records into the live shared module," reusing the same local-handling and resolution code paths verbatim.

The rest of this document describes the shared-context design.

2. Symbol identity: what unifies, what must stay distinct

Shared context gets global unification for free (§Baseline). The one correctness trap is local symbols, and there is exactly one rule to add.

kit_cg_decl interns every name through obj_symbol_find. For globals that is correct and desirable. For SB_LOCAL symbols it is wrong: two TUs each with static int x; (the frontend passes the bare name with LOCAL binding, lang/c/decl/decl.c:72) would collapse to one symbol. The same hazard exists for static functions, and for the per-TU counters behind block-scope statics (mint_static_local_sym, lang/c/parse/parse.c:660) and compound literals (mint_compound_literal_sym, lang/c/parse/parse_init.c:1012), which reset per parser and would produce colliding names like y.0 and __kit_compound_literal.2 across TUs.

Fix: for SB_LOCAL bindings, skip obj_symbol_find and always mint a fresh id. Consequences, all benign:

No frontend mangling is required. Anonymous read-only data (.Lkit_ro.N, src/cg/memory.c:102) needs no change at all once the session is shared, because the rodata_counter is no longer reset between TUs (see §4) and keeps climbing.

3. Resolution policy: factoring symresolve out of the linker

Today, if we naively share one ObjBuilder, we lose all symbol-resolution semantics: obj_symbol_define overwrites last-writer-wins, so two strong definitions silently clobber instead of raising an ODR error, strong-vs-weak becomes declaration-order dependent, and commons never merge. The precedence rules we need already exist in link_resolve_symbols (src/link/link_resolve.c:258): strong-vs-strong → ODR error (modulo COFF COMDAT/SELECTANY), strong beats weak, weak-weak keeps the first, common merging takes max size/align, and a definition beats a common.

Per the investigation, that logic is cleanly separable. The decision is pure over (name, bind, kind, size, align, common_align, defined?, in_comdat) tuples; only the bookkeeping — the globals SymHash, the per-input InputMap, COMDAT section discard, DSO iteration — is entangled with linker state.

Extract a small shared module, src/obj/symresolve.{h,c} (the obj layer is the natural home; both consumers sit above it):

// pure: no linker state, no allocation
SymMergeResult symresolve_merge(SymAttrs existing, SymAttrs incoming,
                                int coff_target);
// -> KEEP_EXISTING | REPLACE | MERGE_COMMON(size, align)
//  | COMDAT_DISCARD | ODR_ERROR

Move link_bind_strength, link_sym_is_def, and link_sym_is_spurious_undef (src/link/link_internal.h) alongside it. Then:

Two mechanical needs fall out of sharing one builder:

The opaque inputs in a link (libc, crt, kit archives, DSOs) are still resolved by the linker at link time against the single emitted LTO object. So the policy module has two call sites — recording-time merge among the LTO set, and link-time resolution against everything — which is the justification for extracting it rather than duplicating it.

4. The staging lifecycle

The lifecycle target for Phase 1 is documented in CG_OBJ_LIFECYCLE.md. The short version: ObjBuilder owns object lifetime, while KitCg borrows an object, records one or more semantic units, and finishes codegen into that object. kit_cg_finish is a CG flush/lowering/debug operation; it is not object finalization.

The old object-shaped bracket used to finalize (lowers + emits everything), null g->obj/g->target, and reset per-object state including rodata_counter (src/cg/session.c). The structural state is now a borrowed lifecycle:

The driver change is a shared staging engine: group every LTO-capable source input in command-line order, stage semantic frontends into the borrowed CG session and shared object, compile opaque frontends/objects as ordinary inputs, then finish CG once and substitute the resulting builder at the right place in the link order. The hook is build_compile_all in driver/cmd/build.c (shared by build-exe/build-lib/build-obj) and cc_run_link_exe — both already compile every source under one KitCompiler, which is the seam this loop replaces. (compile/compile_engine from the original plan were retired in favor of the build verbs on main.)

5. The export / preserved set

Internalizing a global — demoting it to hidden/local, which unlocks DCE and unconstrained inlining — is sound only when nothing outside the LTO set can reference it by name and it is not interposable. This is the one input that genuinely needs the full link picture, so it is computed at link time and handed to the LTO core. A symbol must be preserved if it is:

The linker already answers "is this symbol referenced from outside" for archive pull (scan_presence_before / member_satisfies, src/link/link_resolve.c:859, :923); the preserved set is the same question asked of the LTO set against the opaque inputs and the output-kind/visibility policy. Conservative default: internalize only for executable outputs or provably non-exported symbols.

Phase 1 implements this for all-sources-up-front executable LTO: the driver stages semantic sources, assembles the ordered link session, asks the linker for preserved LTO symbols, then passes those IDs to kit_cg_finish before object finalization. Relocatable and archive-member outputs remain conservative because later links may still reference globals by name. Shared-library LTO continues to reject until shared output policy is exercised.

6. The whole-program optimization core

With a merged module and a preserved set, the core is opt_emit_reachable_aarch64 generalized:

  1. Generalize the finalize sweep to all arches. Lift the ARM64-only path in opt_on_finalize into an arch-independent opt_whole_module_finalize, and switch x86-64/riscv64 from eager per-function emit to defer-to-finalize when the whole-program path is active. Keep -O0/-O1 streaming and the JIT/interp/run/dbg/emu paths on the existing eager path — LTO is an AOT concern. The one verification item is that nothing downstream depends on x64/rv64 eager emission (opt_maybe_capture_interp).
  2. Internalize non-preserved globals using the §5 set.
  3. GC unreachable functions and data (the existing reachability walk, now over the whole program).
  4. Lower the reachable set into a FuncSet and run opt_inline — the already-written, never-called whole-program inliner — with a real cost model.
  5. Emit one object and substitute it for the IR inputs before the final link.

Steps 1 and 4 are independently valuable and land first (Phase 0): they turn the unreached inliner and the generalized sweep into a tested, shipping path on a single TU before any cross-TU complexity exists.

7. Phased delivery

Phase 0 — Whole-translation-unit optimization. No merge, no serialization, no driver changes. Generalize the finalize sweep to all arches, switch x64/rv64 to defer-to-finalize under the whole-program path, and wire opt_inline over the reachable FuncSet. Delivers real cross-function inlining within a TU at -O2 on every arch, generalized dead-static elimination, and the inliner finally exercised on real code. Lowest risk — purely inside the optimizer — and it validates the deferred-emit path that Phase 1's staging lifecycle also relies on.

Phase 1 — Shared-context, all-sources-up-front LTO. The target case, kit cc *.c -flto -o prog and kit build-exe -flto (and build-lib/build-obj; build-obj replaced the retired compile). Build on Phase 0 by adding: (a) the symresolve extraction (§3), (b) the ObjBuilder name index (§3), (c) skip-intern for locals (§2), (d) the KitCg/ObjBuilder borrowed staging lifecycle and the driver loop that records N frontends into one session and finishes CG once (§4), (e) the preserved set fed from the assembled link into kit_cg_finish (§5). No cloner, no serialization, no archive support yet.

Phase 2 — Serialized IR objects (.kit.ir). Optional follow-on for separate compilation, archives, and build caches. kit cc -c -flto a.c emits a normal object whose symbol table is the real decl set — so the linker's symbol-driven archive pull works unchanged — plus a .kit.ir custom section (the object model already supports arbitrary SEC_OTHER sections) carrying a serialized CgIrModule. The linker detects the section and replays the records into the same shared context through the same recording/merge API, reusing skip-intern-locals and symresolve_merge verbatim. Archives of IR objects work because pull is symbol-table driven. Note: whole-program LTO is incompatible with the file-based incremental linker (LINKER.md); -flto forces a full link.

8. Optimizations unlocked

Inlining is the headline; the merged-module + FuncSet framework makes a whole interprocedural family expressible (listed as enabled, not committed):

9. Risks, semantics, and limitations

10. First slices

Two independently landable, low-risk steps that de-risk the whole direction before any LTO surface exists:

  1. Extract symresolve and refactor link_resolve_symbols onto it. Pure refactor, covered by test/link. Lands the load-bearing piece and improves the linker regardless of LTO.
  2. Add the ObjBuilder name -> id index behind the existing obj_symbol_find/obj_symbol_ex API. Drop-in; measurable on its own.

Then Phase 0 (generalize the sweep + wire opt_inline, gated behind -O2 / -fwhole-program), validated first on x86-64 with red-green tests in test/opt (a caller+callee that should fuse) and test/smoke/x64 (behavioral parity). Then the staging lifecycle and skip-intern-locals behind -flto, exercised first on a two-TU test/smoke case where a cross-TU callee inlines.

Open questions