kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 438e6b094ea31ed5ec9787d451e59ddc9a7770f9
parent bde9b5848be4f7d9b2587abaaf7c4ad02417143c
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu,  4 Jun 2026 06:56:18 -0700

doc/plan: add LTO / whole-program optimization plan

Forward-looking design for making a library/executable look like a single
TU to the optimizer. Chooses shared-context recording (all sources record
into one session/ObjBuilder/CgIrModule) over clone-and-merge, since globals
already intern by name, the recorder already accumulates a whole module, and
the finalize sweep + unreached opt_inline already exist. Centers the work on
factoring the linker's resolution policy into a shared symresolve module
usable at merge time. Phased: whole-TU opt, all-sources LTO, serialized
.kit.ir objects.

Diffstat:
Adoc/plan/LTO.md | 369+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 369 insertions(+), 0 deletions(-)

diff --git a/doc/plan/LTO.md b/doc/plan/LTO.md @@ -0,0 +1,369 @@ +# LTO / Whole-Program Optimization (planned work) + +This is the forward-looking plan for link-time optimization in kit: making a +library or executable look like a single translation unit to the optimizer, so +inlining, dead-code elimination, internalization, and the rest of the +interprocedural family can cross TU boundaries. It deliberately does **not** +target GCC/Clang LTO bitcode compatibility. The initial scope is kit invocations +that provide all sources up front (`kit cc *.c -O2 -flto -o prog`); separately +compiled IR objects are a later phase that reuses the same core. + +The optimizer baseline this builds on — the recording IR, the +recording/optimizing boundary, the finalize path, and the pass catalog — is in +[../OPT.md](../OPT.md) and [OPTIMIZER.md](OPTIMIZER.md). The link-time symbol +model is in [LINKER.md](LINKER.md). This document treats those as given and +describes only the LTO-specific additions. + +The headline finding from investigating the tree: **most of the machinery for +whole-program optimization already exists; it is just per-TU, single-arch, and +partly unreached.** LTO here is three concrete refactors plus wiring, not a new +subsystem. The largest of the three is factoring the linker's symbol-resolution +policy out so it can run at merge time as well as at link time. + +## Baseline (what already exists) + +A handful of facts about the current code path frame everything below. + +- **Globals already intern by name within an `ObjBuilder`.** `kit_cg_decl` does + `obj_symbol_find` then reuse-or-create (`src/cg/session.c:198`). Two frontends + that `decl` `foo` into the *same* builder receive the *same* `ObjSymId`. The + CG and optimizer IRs reference call targets and globals by `ObjSymId` + (`IRCallAux.desc.callee.v.global.sym`), so a caller's `call foo` already points + at the id the definer will define — no remap, no clone. This is the load-bearing + fact for the whole design. +- **The recorder already accumulates a whole module.** One `CgIrRecorder` owns + one `CgIrModule` and appends every `func_begin`/`func_end` into it + (`src/cg/ir_recorder.c`), flushing only at `finalize`. `CgIrModule` + (`src/cg/ir.h:270`) holds all functions, aliases, and file-scope asm. Per + function it carries `call_refs` and `global_refs` symbol sets + (`src/cg/ir.h:247`) — the call/use graph is materialized during recording. +- **The optimizer already finalizes over the whole module — for one arch.** + `opt_on_finalize` (`src/opt/opt.c:566`) hands the entire `CgIrModule` to + `opt_emit_reachable_aarch64` (`src/opt/opt.c:495`), which seeds a root set + (non-`LOCAL` symbols, `KIT_CG_SYM_USED` locals, alias targets, exported data + relocs), walks each function's `call_refs`/`global_refs` plus the data-reloc + graph, **removes unreachable local symbols**, then lowers + optimizes + emits + only what is live. This is whole-program GC for one TU. x86-64 and riscv64 + instead emit eagerly per function in `opt_on_func` (`src/opt/opt.c:322`); they + have no module pass. +- **The whole-program inliner exists and is unreached.** `opt_inline(FuncSet*, + max_iters)` (`src/opt/pass_inline.c:667`) does topologically ordered, + growth-gated, call-graph inlining over a `FuncSet` of lowered `Func`s. Only the + streaming tiny variant `opt_try_tiny_inline` (cost cap 8, straightline only) + runs today. The real inliner has never had a caller. See OPTIMIZER.md §6. +- **The driver already shares one `KitCompiler` across sources** and keeps + objects in memory through link. `cc_run_link_exe` compiles each source to its + own in-memory `KitObjBuilder` under one compiler (`driver/cmd/cc.c:2655`, + `objs[]` at `:2585`) and hands the builders straight to the link session via + `kit_link_session_add_obj` (`:2735`/`:2771`) — no temp `.o` files. The + orchestration seam for LTO is already where it needs to be. +- **The obj layer has no resolution policy and no name index.** + `obj_symbol_define` is last-writer-wins with no precedence check + (`src/obj/obj.c:544`), and `obj_symbol_find` is a linear scan + (`src/obj/obj.c:534`). The only resolution rule anywhere below the linker is + the weak-demotion special case hand-coded into `kit_cg_decl` + (`src/cg/session.c:203`). All real precedence lives in `link_resolve_symbols` + (`src/link/link_resolve.c:258`). + +The merged module, then, is not something LTO must *build*. It is something the +recorder already builds and the finalize path already consumes — for one TU, on +one arch. LTO is mostly about *not tearing it down between TUs*, *generalizing +the finalize pass*, and *applying real resolution policy as the merge happens*. + +## 1. Design decision: shared context, not clone-and-merge + +Two architectures can make the optimizer see one module: + +1. **Clone-and-merge.** Each TU records into its own `CgIrModule`/`ObjBuilder`; + an IR-linker deep-copies every function into a merged module, rebuilds the + symbol table, and remaps every operand/reloc/alias to merged ids. +2. **Shared context.** All TUs record into *one* live session — one + `ObjBuilder`, one recorder, one `CgIrModule` — so globals unify in place via + the existing decl interning and the finalize path sees the union directly. + +We choose **shared context**. The comparison: + +| | Shared context | Clone + remap | +|---|---|---| +| Global identity | Free (decl already interns by name) | Rebuild symbol table + remap every operand/reloc/alias | +| Memory / time | Record once, in place | Duplicate all IR into a merge arena | +| Resolution policy | Apply at the per-TU merge boundary | Apply in the merge pass | +| Local distinctness | Skip-intern locals (small CG change) | Falls out of remap | +| Lifecycle cost | Staging mode + cross-TU arena lifetime | None — TUs stay independent | +| Net new code | Mostly wiring + policy extraction | A full cloner/remapper on the hot path | +| Serialized objects (Phase 2) | Deserialize = replay records through the same recording/merge API | A separate clone-from-bytes engine | + +Clone's only advantage is that TUs stay fully independent, so there is no +staging lifecycle to manage. That is not worth re-implementing the symbol merge +the linker already knows how to do, nor the per-TU IR duplication. The decisive +row is the last one: shared context makes the recording/merge API the single +funnel. A frontend feeds it; a `.kit.ir` deserializer feeds it the same way. +There is no second merge engine to build for Phase 2 — fat-object LTO becomes +"replay serialized decl/func records into the live shared module," reusing the +same local-handling and resolution code paths verbatim. + +The rest of this document describes the shared-context design. + +## 2. Symbol identity: what unifies, what must stay distinct + +Shared context gets global unification for free (§Baseline). The one correctness +trap is **local symbols**, and there is exactly one rule to add. + +`kit_cg_decl` interns *every* name through `obj_symbol_find`. For globals that is +correct and desirable. For `SB_LOCAL` symbols it is wrong: two TUs each with +`static int x;` (the frontend passes the bare name with LOCAL binding, +`lang/c/decl/decl.c:72`) would collapse to one symbol. The same hazard exists for +static functions, and for the per-TU counters behind block-scope statics +(`mint_static_local_sym`, `lang/c/parse/parse.c:660`) and compound literals +(`mint_compound_literal_sym`, `lang/c/parse/parse_init.c:1012`), which reset per +parser and would produce colliding names like `y.0` and +`__kit_compound_literal.2` across TUs. + +**Fix:** for `SB_LOCAL` bindings, skip `obj_symbol_find` and always mint a fresh +id. Consequences, all benign: + +- Two locals named `x` get distinct `ObjSymId`s. Duplicate `STB_LOCAL` names in + one object are legal in every format kit emits; locals never enter the global + name table; the optimizer indexes functions by id, not name. +- The frontend caches the id per `Decl`, so intra-TU reuse of a static is + unaffected — the second reference goes through the cached id, not a fresh decl. +- The static-vs-extern-same-name case resolves correctly: the static gets a fresh + id; an unrelated `extern foo` keeps the shared global id. + +No frontend mangling is required. Anonymous read-only data (`.Lkit_ro.N`, +`src/cg/memory.c:102`) needs no change at all once the session is shared, because +the `rodata_counter` is no longer reset between TUs (see §4) and keeps climbing. + +## 3. Resolution policy: factoring `symresolve` out of the linker + +Today, if we naively share one `ObjBuilder`, we lose all symbol-resolution +semantics: `obj_symbol_define` overwrites last-writer-wins, so two strong +definitions silently clobber instead of raising an ODR error, strong-vs-weak +becomes declaration-order dependent, and commons never merge. The precedence +rules we need already exist in `link_resolve_symbols` +(`src/link/link_resolve.c:258`): strong-vs-strong → ODR error (modulo COFF +COMDAT/SELECTANY), strong beats weak, weak-weak keeps the first, common merging +takes max size/align, and a definition beats a common. + +Per the investigation, that logic is cleanly separable. The *decision* is pure +over `(name, bind, kind, size, align, common_align, defined?, in_comdat)` +tuples; only the *bookkeeping* — the `globals` `SymHash`, the per-input +`InputMap`, COMDAT section discard, DSO iteration — is entangled with linker +state. + +**Extract a small shared module**, `src/obj/symresolve.{h,c}` (the obj layer is +the natural home; both consumers sit above it): + +```c +// pure: no linker state, no allocation +SymMergeResult symresolve_merge(SymAttrs existing, SymAttrs incoming, + int coff_target); +// -> KEEP_EXISTING | REPLACE | MERGE_COMMON(size, align) +// | COMDAT_DISCARD | ODR_ERROR +``` + +Move `link_bind_strength`, `link_sym_is_def`, and `link_sym_is_spurious_undef` +(`src/link/link_internal.h`) alongside it. Then: + +- **Refactor `link_resolve_symbols` onto it.** A pure cleanup with no behavior + change, fully covered by the `test/link` corpus, that gives the policy one + source of truth and leaves the linker better than we found it. +- **The LTO staging coordinator calls the same function** at the per-TU merge + boundary. Crucially this is a *binding-precedence* decision — which body wins — + not id remapping, because ids are already unified. When TU B contributes a body + for a global TU A already provided, `symresolve_merge` decides whether to keep + A's `CgIrFunc`/data, replace it with B's, merge commons, or raise ODR. The + loser's body is dropped from the module and its decl remains as a reference. + ODR conflicts are reported at the second definition's `SrcLoc` — better + diagnostics than the linker's post-hoc panic, because source locations still + exist at this point. + +Two mechanical needs fall out of sharing one builder: + +- **Give `ObjBuilder` a `name -> id` hash map.** With the whole program's symbols + in one builder, the linear `obj_symbol_find` (`src/obj/obj.c:534`) is O(n²) at + decl time. The assembler already carries its own `SymSymMap` precisely because + obj lacks one (`src/asm/asm.c`). Adding the index to the builder removes the + quadratic and hosts the resolution hook — a win even for ordinary single-TU + compiles, and it lets the assembler shed its private map later. +- **One open question on `define` timing.** During pure recording a global is + *declared* (with binding) but not *defined* in the obj sense until finalize + emits its section/offset. So the "which body wins" decision must run at the + staging boundary against the set of bodies a TU contributes (it has a + `CgIrFunc` or data record for the symbol), not at `obj_symbol_define` time. + This is the linker's per-input symbol merge applied to `CgIrModule` + contributions. + +The opaque inputs in a link (libc, crt, kit archives, DSOs) are still resolved by +the linker at link time against the single emitted LTO object. So the policy +module has two call sites — recording-time merge among the LTO set, and +link-time resolution against everything — which is the justification for +extracting it rather than duplicating it. + +## 4. The staging lifecycle + +`kit_cg_end_obj` finalizes (lowers + emits everything), nulls `g->obj`/ +`g->target`, and resets per-object state including `rodata_counter` +(`src/cg/session.c`). That bracket assumes exactly one TU. LTO needs a staging +mode: + +- **Record each TU into the live session without finalizing**; run a single + `finalize` after the last TU. A new session entry point (a "stage" variant of + `begin_obj`/`end_obj`, or a recorder flag that defers `finalize_recorded`) lets + the recorder accumulate all TUs into the one `CgIrModule`. +- **The recording arena must outlive any single TU.** The recorder and module are + arena-allocated from `c->tu` today (`opt_cgtarget_new`, `cg_ir_recorder_new`). + For LTO they must come from a cross-TU arena (a dedicated LTO arena, or + `c->global`) so the accumulated IR survives across the per-TU frontend runs. + This is the one sharp edge of the lifecycle change and should be settled first. +- **Each TU keeps its own frontend state.** The per-TU `Pool`, `DeclTable`, and + type interning stay independent; only the CG session and `ObjBuilder` are + shared. The shared `KitCompiler` already spans sources today, so `c->global` + name interning is already consistent across TUs. + +The driver/`compile_engine` change is a loop: open one staging session, run the C +frontend once per source against it, finalize once, hand the single resulting +builder to the link session in place of the per-source builders. + +## 5. The export / preserved set + +Internalizing a global — demoting it to hidden/local, which unlocks DCE and +unconstrained inlining — is sound only when nothing outside the LTO set can +reference it by name and it is not interposable. This is the one input that +genuinely needs the full link picture, so it is computed at link time and handed +to the LTO core. A symbol must be **preserved** if it is: + +- the entry symbol (`main`/`_start`), or in the dynamic export set; for `-shared`, + default-visibility symbols are interposable and must **not** be internalized or + inlined across unless `-fvisibility=hidden` / a version script / `-Bsymbolic` + says otherwise; +- referenced (undefined) by any **opaque** input — libc/crt calling `main`, a + kit archive member that is not IR, a DSO; +- `__attribute__((used))`, in an init/fini array, named in inline or file-scope + asm, an IFUNC resolver, or address-significant in an opaque input. + +The linker already answers "is this symbol referenced from outside" for archive +pull (`scan_presence_before` / `member_satisfies`, `src/link/link_resolve.c:859`, +`:923`); the preserved set is the same question asked of the LTO set against the +opaque inputs and the output-kind/visibility policy. Conservative default: +internalize only for executable outputs or provably non-exported symbols. + +## 6. The whole-program optimization core + +With a merged module and a preserved set, the core is `opt_emit_reachable_aarch64` +generalized: + +1. **Generalize the finalize sweep to all arches.** Lift the ARM64-only path in + `opt_on_finalize` into an arch-independent `opt_whole_module_finalize`, and + switch x86-64/riscv64 from eager per-function emit to defer-to-finalize when + the whole-program path is active. Keep `-O0`/`-O1` streaming and the + JIT/interp/`run`/`dbg`/`emu` paths on the existing eager path — LTO is an AOT + concern. The one verification item is that nothing downstream depends on x64/rv64 + eager emission (`opt_maybe_capture_interp`). +2. **Internalize** non-preserved globals using the §5 set. +3. **GC** unreachable functions and data (the existing reachability walk, now over + the whole program). +4. **Lower the reachable set into a `FuncSet`** and run `opt_inline` — the + already-written, never-called whole-program inliner — with a real cost model. +5. **Emit one object** and substitute it for the IR inputs before the final link. + +Steps 1 and 4 are independently valuable and land first (Phase 0): they turn the +unreached inliner and the generalized sweep into a tested, shipping path on a +single TU before any cross-TU complexity exists. + +## 7. Phased delivery + +**Phase 0 — Whole-translation-unit optimization.** No merge, no serialization, +no driver changes. Generalize the finalize sweep to all arches, switch +x64/rv64 to defer-to-finalize under the whole-program path, and wire `opt_inline` +over the reachable `FuncSet`. Delivers real cross-function inlining within a TU at +`-O2` on every arch, generalized dead-static elimination, and the inliner finally +exercised on real code. Lowest risk — purely inside the optimizer — and it +validates the deferred-emit path that Phase 1's staging lifecycle also relies on. + +**Phase 1 — Shared-context, all-sources-up-front LTO.** The target case, +`kit cc *.c -O2 -flto -o prog` and `kit compile`. Build on Phase 0 by adding: +(a) the `symresolve` extraction (§3), (b) the `ObjBuilder` name index (§3), +(c) skip-intern for locals (§2), (d) the `kit_cg` staging lifecycle and the +driver loop that records N frontends into one session and finalizes once (§4), +(e) the preserved set fed from the assembled link (§5). No cloner, no +serialization, no archive support yet. + +**Phase 2 — Serialized IR objects (`.kit.ir`).** Optional follow-on for separate +compilation, archives, and build caches. `kit cc -c -flto a.c` emits a normal +object whose symbol table is the real decl set — so the linker's symbol-driven +archive pull works unchanged — plus a `.kit.ir` custom section (the object model +already supports arbitrary `SEC_OTHER` sections) carrying a serialized +`CgIrModule`. The linker detects the section and **replays the records into the +same shared context** through the same recording/merge API, reusing +skip-intern-locals and `symresolve_merge` verbatim. Archives of IR objects work +because pull is symbol-table driven. Note: whole-program LTO is incompatible with +the file-based incremental linker (LINKER.md); `-flto` forces a full link. + +## 8. Optimizations unlocked + +Inlining is the headline; the merged-module + `FuncSet` framework makes a whole +interprocedural family *expressible* (listed as enabled, not committed): + +- **Cross-TU / whole-program inlining** — `opt_inline`, already written. +- **Internalization** to hidden/local for non-exported globals — enables DCE, + removes PLT/GOT indirection, frees intra-function optimization. +- **Whole-program dead code / data elimination** — the generalized sweep. +- Future: devirtualization / direct-call promotion, IPSCCP and cross-function + constant/range propagation, argument promotion, identical-code folding, + `const`/`pure` inference, global-to-local-constant propagation. + +## 9. Risks, semantics, and limitations + +- **Resolution fidelity.** ODR, weak/strong, common merging, COMDAT, IFUNC, + aliases, and visibility must match the linker exactly or LTO miscompiles — + hence the shared `symresolve` module rather than a re-implementation. +- **Interposition / shared libraries.** Never internalize or inline across an + interposable default-visibility boundary unless `-Bsymbolic`/hidden visibility + makes it safe. Default conservative for `-shared`. +- **Inline and file-scope asm** naming symbols are opaque references: treat as + roots, never rename, internalize, or DCE them. +- **Debug info.** Cross-TU inlines need `inlined_subroutine` DWARF with correct + file/line; `SrcLoc` must carry file identity through the merged module. + Acceptable initial limitation: degraded inlined debug info under `-g -flto`, + stated explicitly. +- **Compile time / memory.** The whole program lives in memory; `opt_inline`'s + growth gates bound blow-up; only the LTO set is optimized, opaque inputs stay + opaque. +- **Determinism.** Record and merge in input order; iterate stably. +- **Recording arena lifetime** (§4) is the one structural hazard — settle it + before building the staging loop. +- **TLS, varargs, atomics, computed goto, label-address tables** must survive the + shared module unchanged. Function-local label addresses are already + function-scoped; cross-function `data_addr`/`pcrel`/`symdiff` reference symbols, + which are already unified by id. + +## 10. First slices + +Two independently landable, low-risk steps that de-risk the whole direction +before any LTO surface exists: + +1. **Extract `symresolve` and refactor `link_resolve_symbols` onto it.** Pure + refactor, covered by `test/link`. Lands the load-bearing piece and improves + the linker regardless of LTO. +2. **Add the `ObjBuilder` name -> id index** behind the existing + `obj_symbol_find`/`obj_symbol_ex` API. Drop-in; measurable on its own. + +Then Phase 0 (generalize the sweep + wire `opt_inline`, gated behind `-O2` / +`-fwhole-program`), validated first on x86-64 with red-green tests in `test/opt` +(a caller+callee that should fuse) and `test/smoke/x64` (behavioral parity). Then +the staging lifecycle and skip-intern-locals behind `-flto`, exercised first on a +two-TU `test/smoke` case where a cross-TU callee inlines. + +## Open questions + +- **Define-timing for resolution** (§3): confirm the staging-boundary merge is the + right hook versus an `obj_symbol_define`-time check, given symbols are only + obj-defined at emit. +- **Recording arena** (§4): dedicated LTO arena vs `c->global`, and how the + per-TU frontend arenas interact with a cross-TU module. +- **`-flto` flag surface**: accept the GCC/Clang spelling for `cc`; decide the + kit-native spelling for `compile` and whether `-fwhole-program` is a distinct, + more aggressive internalization mode. +- **CG API exposure**: whether staging is internal to the driver/`compile_engine` + or a public `kit_cg`/`kit_compile` surface for embedders driving multi-TU LTO.