doc/plan: add LTO / whole-program optimization plan - kit

commit 438e6b094ea31ed5ec9787d451e59ddc9a7770f9
parent bde9b5848be4f7d9b2587abaaf7c4ad02417143c
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu,  4 Jun 2026 06:56:18 -0700

doc/plan: add LTO / whole-program optimization plan

Forward-looking design for making a library/executable look like a single
TU to the optimizer. Chooses shared-context recording (all sources record
into one session/ObjBuilder/CgIrModule) over clone-and-merge, since globals
already intern by name, the recorder already accumulates a whole module, and
the finalize sweep + unreached opt_inline already exist. Centers the work on
factoring the linker's resolution policy into a shared symresolve module
usable at merge time. Phased: whole-TU opt, all-sources LTO, serialized
.kit.ir objects.

Diffstat:
A doc/plan/LTO.md  | 369 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 file changed, 369 insertions(+), 0 deletions(-)
diff --git a/doc/plan/LTO.md b/doc/plan/LTO.md
@@ -0,0 +1,369 @@
+# LTO / Whole-Program Optimization (planned work)
+
+This is the forward-looking plan for link-time optimization in kit: making a
+library or executable look like a single translation unit to the optimizer, so
+inlining, dead-code elimination, internalization, and the rest of the
+interprocedural family can cross TU boundaries. It deliberately does **not**
+target GCC/Clang LTO bitcode compatibility. The initial scope is kit invocations
+that provide all sources up front (`kit cc *.c -O2 -flto -o prog`); separately
+compiled IR objects are a later phase that reuses the same core.
+
+The optimizer baseline this builds on — the recording IR, the
+recording/optimizing boundary, the finalize path, and the pass catalog — is in
+[../OPT.md](../OPT.md) and [OPTIMIZER.md](OPTIMIZER.md). The link-time symbol
+model is in [LINKER.md](LINKER.md). This document treats those as given and
+describes only the LTO-specific additions.
+
+The headline finding from investigating the tree: **most of the machinery for
+whole-program optimization already exists; it is just per-TU, single-arch, and
+partly unreached.** LTO here is three concrete refactors plus wiring, not a new
+subsystem. The largest of the three is factoring the linker's symbol-resolution
+policy out so it can run at merge time as well as at link time.
+
+## Baseline (what already exists)
+
+A handful of facts about the current code path frame everything below.
+
+- **Globals already intern by name within an `ObjBuilder`.** `kit_cg_decl` does
+  `obj_symbol_find` then reuse-or-create (`src/cg/session.c:198`). Two frontends
+  that `decl` `foo` into the *same* builder receive the *same* `ObjSymId`. The
+  CG and optimizer IRs reference call targets and globals by `ObjSymId`
+  (`IRCallAux.desc.callee.v.global.sym`), so a caller's `call foo` already points
+  at the id the definer will define — no remap, no clone. This is the load-bearing
+  fact for the whole design.
+- **The recorder already accumulates a whole module.** One `CgIrRecorder` owns
+  one `CgIrModule` and appends every `func_begin`/`func_end` into it
+  (`src/cg/ir_recorder.c`), flushing only at `finalize`. `CgIrModule`
+  (`src/cg/ir.h:270`) holds all functions, aliases, and file-scope asm. Per
+  function it carries `call_refs` and `global_refs` symbol sets
+  (`src/cg/ir.h:247`) — the call/use graph is materialized during recording.
+- **The optimizer already finalizes over the whole module — for one arch.**
+  `opt_on_finalize` (`src/opt/opt.c:566`) hands the entire `CgIrModule` to
+  `opt_emit_reachable_aarch64` (`src/opt/opt.c:495`), which seeds a root set
+  (non-`LOCAL` symbols, `KIT_CG_SYM_USED` locals, alias targets, exported data
+  relocs), walks each function's `call_refs`/`global_refs` plus the data-reloc
+  graph, **removes unreachable local symbols**, then lowers + optimizes + emits
+  only what is live. This is whole-program GC for one TU. x86-64 and riscv64
+  instead emit eagerly per function in `opt_on_func` (`src/opt/opt.c:322`); they
+  have no module pass.
+- **The whole-program inliner exists and is unreached.** `opt_inline(FuncSet*,
+  max_iters)` (`src/opt/pass_inline.c:667`) does topologically ordered,
+  growth-gated, call-graph inlining over a `FuncSet` of lowered `Func`s. Only the
+  streaming tiny variant `opt_try_tiny_inline` (cost cap 8, straightline only)
+  runs today. The real inliner has never had a caller. See OPTIMIZER.md §6.
+- **The driver already shares one `KitCompiler` across sources** and keeps
+  objects in memory through link. `cc_run_link_exe` compiles each source to its
+  own in-memory `KitObjBuilder` under one compiler (`driver/cmd/cc.c:2655`,
+  `objs[]` at `:2585`) and hands the builders straight to the link session via
+  `kit_link_session_add_obj` (`:2735`/`:2771`) — no temp `.o` files. The
+  orchestration seam for LTO is already where it needs to be.
+- **The obj layer has no resolution policy and no name index.**
+  `obj_symbol_define` is last-writer-wins with no precedence check
+  (`src/obj/obj.c:544`), and `obj_symbol_find` is a linear scan
+  (`src/obj/obj.c:534`). The only resolution rule anywhere below the linker is
+  the weak-demotion special case hand-coded into `kit_cg_decl`
+  (`src/cg/session.c:203`). All real precedence lives in `link_resolve_symbols`
+  (`src/link/link_resolve.c:258`).
+
+The merged module, then, is not something LTO must *build*. It is something the
+recorder already builds and the finalize path already consumes — for one TU, on
+one arch. LTO is mostly about *not tearing it down between TUs*, *generalizing
+the finalize pass*, and *applying real resolution policy as the merge happens*.
+
+## 1. Design decision: shared context, not clone-and-merge
+
+Two architectures can make the optimizer see one module:
+
+1. **Clone-and-merge.** Each TU records into its own `CgIrModule`/`ObjBuilder`;
+   an IR-linker deep-copies every function into a merged module, rebuilds the
+   symbol table, and remaps every operand/reloc/alias to merged ids.
+2. **Shared context.** All TUs record into *one* live session — one
+   `ObjBuilder`, one recorder, one `CgIrModule` — so globals unify in place via
+   the existing decl interning and the finalize path sees the union directly.
+
+We choose **shared context**. The comparison:
+
+| | Shared context | Clone + remap |
+|---|---|---|
+| Global identity | Free (decl already interns by name) | Rebuild symbol table + remap every operand/reloc/alias |
+| Memory / time | Record once, in place | Duplicate all IR into a merge arena |
+| Resolution policy | Apply at the per-TU merge boundary | Apply in the merge pass |
+| Local distinctness | Skip-intern locals (small CG change) | Falls out of remap |
+| Lifecycle cost | Staging mode + cross-TU arena lifetime | None — TUs stay independent |
+| Net new code | Mostly wiring + policy extraction | A full cloner/remapper on the hot path |
+| Serialized objects (Phase 2) | Deserialize = replay records through the same recording/merge API | A separate clone-from-bytes engine |
+
+Clone's only advantage is that TUs stay fully independent, so there is no
+staging lifecycle to manage. That is not worth re-implementing the symbol merge
+the linker already knows how to do, nor the per-TU IR duplication. The decisive
+row is the last one: shared context makes the recording/merge API the single
+funnel. A frontend feeds it; a `.kit.ir` deserializer feeds it the same way.
+There is no second merge engine to build for Phase 2 — fat-object LTO becomes
+"replay serialized decl/func records into the live shared module," reusing the
+same local-handling and resolution code paths verbatim.
+
+The rest of this document describes the shared-context design.
+
+## 2. Symbol identity: what unifies, what must stay distinct
+
+Shared context gets global unification for free (§Baseline). The one correctness
+trap is **local symbols**, and there is exactly one rule to add.
+
+`kit_cg_decl` interns *every* name through `obj_symbol_find`. For globals that is
+correct and desirable. For `SB_LOCAL` symbols it is wrong: two TUs each with
+`static int x;` (the frontend passes the bare name with LOCAL binding,
+`lang/c/decl/decl.c:72`) would collapse to one symbol. The same hazard exists for
+static functions, and for the per-TU counters behind block-scope statics
+(`mint_static_local_sym`, `lang/c/parse/parse.c:660`) and compound literals
+(`mint_compound_literal_sym`, `lang/c/parse/parse_init.c:1012`), which reset per
+parser and would produce colliding names like `y.0` and
+`__kit_compound_literal.2` across TUs.
+
+**Fix:** for `SB_LOCAL` bindings, skip `obj_symbol_find` and always mint a fresh
+id. Consequences, all benign:
+
+- Two locals named `x` get distinct `ObjSymId`s. Duplicate `STB_LOCAL` names in
+  one object are legal in every format kit emits; locals never enter the global
+  name table; the optimizer indexes functions by id, not name.
+- The frontend caches the id per `Decl`, so intra-TU reuse of a static is
+  unaffected — the second reference goes through the cached id, not a fresh decl.
+- The static-vs-extern-same-name case resolves correctly: the static gets a fresh
+  id; an unrelated `extern foo` keeps the shared global id.
+
+No frontend mangling is required. Anonymous read-only data (`.Lkit_ro.N`,
+`src/cg/memory.c:102`) needs no change at all once the session is shared, because
+the `rodata_counter` is no longer reset between TUs (see §4) and keeps climbing.
+
+## 3. Resolution policy: factoring `symresolve` out of the linker
+
+Today, if we naively share one `ObjBuilder`, we lose all symbol-resolution
+semantics: `obj_symbol_define` overwrites last-writer-wins, so two strong
+definitions silently clobber instead of raising an ODR error, strong-vs-weak
+becomes declaration-order dependent, and commons never merge. The precedence
+rules we need already exist in `link_resolve_symbols`
+(`src/link/link_resolve.c:258`): strong-vs-strong → ODR error (modulo COFF
+COMDAT/SELECTANY), strong beats weak, weak-weak keeps the first, common merging
+takes max size/align, and a definition beats a common.
+
+Per the investigation, that logic is cleanly separable. The *decision* is pure
+over `(name, bind, kind, size, align, common_align, defined?, in_comdat)`
+tuples; only the *bookkeeping* — the `globals` `SymHash`, the per-input
+`InputMap`, COMDAT section discard, DSO iteration — is entangled with linker
+state.
+
+**Extract a small shared module**, `src/obj/symresolve.{h,c}` (the obj layer is
+the natural home; both consumers sit above it):
+
+```c
+// pure: no linker state, no allocation
+SymMergeResult symresolve_merge(SymAttrs existing, SymAttrs incoming,
+                                int coff_target);
+// -> KEEP_EXISTING | REPLACE | MERGE_COMMON(size, align)
+//  | COMDAT_DISCARD | ODR_ERROR
+```
+
+Move `link_bind_strength`, `link_sym_is_def`, and `link_sym_is_spurious_undef`
+(`src/link/link_internal.h`) alongside it. Then:
+
+- **Refactor `link_resolve_symbols` onto it.** A pure cleanup with no behavior
+  change, fully covered by the `test/link` corpus, that gives the policy one
+  source of truth and leaves the linker better than we found it.
+- **The LTO staging coordinator calls the same function** at the per-TU merge
+  boundary. Crucially this is a *binding-precedence* decision — which body wins —
+  not id remapping, because ids are already unified. When TU B contributes a body
+  for a global TU A already provided, `symresolve_merge` decides whether to keep
+  A's `CgIrFunc`/data, replace it with B's, merge commons, or raise ODR. The
+  loser's body is dropped from the module and its decl remains as a reference.
+  ODR conflicts are reported at the second definition's `SrcLoc` — better
+  diagnostics than the linker's post-hoc panic, because source locations still
+  exist at this point.
+
+Two mechanical needs fall out of sharing one builder:
+
+- **Give `ObjBuilder` a `name -> id` hash map.** With the whole program's symbols
+  in one builder, the linear `obj_symbol_find` (`src/obj/obj.c:534`) is O(n²) at
+  decl time. The assembler already carries its own `SymSymMap` precisely because
+  obj lacks one (`src/asm/asm.c`). Adding the index to the builder removes the
+  quadratic and hosts the resolution hook — a win even for ordinary single-TU
+  compiles, and it lets the assembler shed its private map later.
+- **One open question on `define` timing.** During pure recording a global is
+  *declared* (with binding) but not *defined* in the obj sense until finalize
+  emits its section/offset. So the "which body wins" decision must run at the
+  staging boundary against the set of bodies a TU contributes (it has a
+  `CgIrFunc` or data record for the symbol), not at `obj_symbol_define` time.
+  This is the linker's per-input symbol merge applied to `CgIrModule`
+  contributions.
+
+The opaque inputs in a link (libc, crt, kit archives, DSOs) are still resolved by
+the linker at link time against the single emitted LTO object. So the policy
+module has two call sites — recording-time merge among the LTO set, and
+link-time resolution against everything — which is the justification for
+extracting it rather than duplicating it.
+
+## 4. The staging lifecycle
+
+`kit_cg_end_obj` finalizes (lowers + emits everything), nulls `g->obj`/
+`g->target`, and resets per-object state including `rodata_counter`
+(`src/cg/session.c`). That bracket assumes exactly one TU. LTO needs a staging
+mode:
+
+- **Record each TU into the live session without finalizing**; run a single
+  `finalize` after the last TU. A new session entry point (a "stage" variant of
+  `begin_obj`/`end_obj`, or a recorder flag that defers `finalize_recorded`) lets
+  the recorder accumulate all TUs into the one `CgIrModule`.
+- **The recording arena must outlive any single TU.** The recorder and module are
+  arena-allocated from `c->tu` today (`opt_cgtarget_new`, `cg_ir_recorder_new`).
+  For LTO they must come from a cross-TU arena (a dedicated LTO arena, or
+  `c->global`) so the accumulated IR survives across the per-TU frontend runs.
+  This is the one sharp edge of the lifecycle change and should be settled first.
+- **Each TU keeps its own frontend state.** The per-TU `Pool`, `DeclTable`, and
+  type interning stay independent; only the CG session and `ObjBuilder` are
+  shared. The shared `KitCompiler` already spans sources today, so `c->global`
+  name interning is already consistent across TUs.
+
+The driver/`compile_engine` change is a loop: open one staging session, run the C
+frontend once per source against it, finalize once, hand the single resulting
+builder to the link session in place of the per-source builders.
+
+## 5. The export / preserved set
+
+Internalizing a global — demoting it to hidden/local, which unlocks DCE and
+unconstrained inlining — is sound only when nothing outside the LTO set can
+reference it by name and it is not interposable. This is the one input that
+genuinely needs the full link picture, so it is computed at link time and handed
+to the LTO core. A symbol must be **preserved** if it is:
+
+- the entry symbol (`main`/`_start`), or in the dynamic export set; for `-shared`,
+  default-visibility symbols are interposable and must **not** be internalized or
+  inlined across unless `-fvisibility=hidden` / a version script / `-Bsymbolic`
+  says otherwise;
+- referenced (undefined) by any **opaque** input — libc/crt calling `main`, a
+  kit archive member that is not IR, a DSO;
+- `__attribute__((used))`, in an init/fini array, named in inline or file-scope
+  asm, an IFUNC resolver, or address-significant in an opaque input.
+
+The linker already answers "is this symbol referenced from outside" for archive
+pull (`scan_presence_before` / `member_satisfies`, `src/link/link_resolve.c:859`,
+`:923`); the preserved set is the same question asked of the LTO set against the
+opaque inputs and the output-kind/visibility policy. Conservative default:
+internalize only for executable outputs or provably non-exported symbols.
+
+## 6. The whole-program optimization core
+
+With a merged module and a preserved set, the core is `opt_emit_reachable_aarch64`
+generalized:
+
+1. **Generalize the finalize sweep to all arches.** Lift the ARM64-only path in
+   `opt_on_finalize` into an arch-independent `opt_whole_module_finalize`, and
+   switch x86-64/riscv64 from eager per-function emit to defer-to-finalize when
+   the whole-program path is active. Keep `-O0`/`-O1` streaming and the
+   JIT/interp/`run`/`dbg`/`emu` paths on the existing eager path — LTO is an AOT
+   concern. The one verification item is that nothing downstream depends on x64/rv64
+   eager emission (`opt_maybe_capture_interp`).
+2. **Internalize** non-preserved globals using the §5 set.
+3. **GC** unreachable functions and data (the existing reachability walk, now over
+   the whole program).
+4. **Lower the reachable set into a `FuncSet`** and run `opt_inline` — the
+   already-written, never-called whole-program inliner — with a real cost model.
+5. **Emit one object** and substitute it for the IR inputs before the final link.
+
+Steps 1 and 4 are independently valuable and land first (Phase 0): they turn the
+unreached inliner and the generalized sweep into a tested, shipping path on a
+single TU before any cross-TU complexity exists.
+
+## 7. Phased delivery
+
+**Phase 0 — Whole-translation-unit optimization.** No merge, no serialization,
+no driver changes. Generalize the finalize sweep to all arches, switch
+x64/rv64 to defer-to-finalize under the whole-program path, and wire `opt_inline`
+over the reachable `FuncSet`. Delivers real cross-function inlining within a TU at
+`-O2` on every arch, generalized dead-static elimination, and the inliner finally
+exercised on real code. Lowest risk — purely inside the optimizer — and it
+validates the deferred-emit path that Phase 1's staging lifecycle also relies on.
+
+**Phase 1 — Shared-context, all-sources-up-front LTO.** The target case,
+`kit cc *.c -O2 -flto -o prog` and `kit compile`. Build on Phase 0 by adding:
+(a) the `symresolve` extraction (§3), (b) the `ObjBuilder` name index (§3),
+(c) skip-intern for locals (§2), (d) the `kit_cg` staging lifecycle and the
+driver loop that records N frontends into one session and finalizes once (§4),
+(e) the preserved set fed from the assembled link (§5). No cloner, no
+serialization, no archive support yet.
+
+**Phase 2 — Serialized IR objects (`.kit.ir`).** Optional follow-on for separate
+compilation, archives, and build caches. `kit cc -c -flto a.c` emits a normal
+object whose symbol table is the real decl set — so the linker's symbol-driven
+archive pull works unchanged — plus a `.kit.ir` custom section (the object model
+already supports arbitrary `SEC_OTHER` sections) carrying a serialized
+`CgIrModule`. The linker detects the section and **replays the records into the
+same shared context** through the same recording/merge API, reusing
+skip-intern-locals and `symresolve_merge` verbatim. Archives of IR objects work
+because pull is symbol-table driven. Note: whole-program LTO is incompatible with
+the file-based incremental linker (LINKER.md); `-flto` forces a full link.
+
+## 8. Optimizations unlocked
+
+Inlining is the headline; the merged-module + `FuncSet` framework makes a whole
+interprocedural family *expressible* (listed as enabled, not committed):
+
+- **Cross-TU / whole-program inlining** — `opt_inline`, already written.
+- **Internalization** to hidden/local for non-exported globals — enables DCE,
+  removes PLT/GOT indirection, frees intra-function optimization.
+- **Whole-program dead code / data elimination** — the generalized sweep.
+- Future: devirtualization / direct-call promotion, IPSCCP and cross-function
+  constant/range propagation, argument promotion, identical-code folding,
+  `const`/`pure` inference, global-to-local-constant propagation.
+
+## 9. Risks, semantics, and limitations
+
+- **Resolution fidelity.** ODR, weak/strong, common merging, COMDAT, IFUNC,
+  aliases, and visibility must match the linker exactly or LTO miscompiles —
+  hence the shared `symresolve` module rather than a re-implementation.
+- **Interposition / shared libraries.** Never internalize or inline across an
+  interposable default-visibility boundary unless `-Bsymbolic`/hidden visibility
+  makes it safe. Default conservative for `-shared`.
+- **Inline and file-scope asm** naming symbols are opaque references: treat as
+  roots, never rename, internalize, or DCE them.
+- **Debug info.** Cross-TU inlines need `inlined_subroutine` DWARF with correct
+  file/line; `SrcLoc` must carry file identity through the merged module.
+  Acceptable initial limitation: degraded inlined debug info under `-g -flto`,
+  stated explicitly.
+- **Compile time / memory.** The whole program lives in memory; `opt_inline`'s
+  growth gates bound blow-up; only the LTO set is optimized, opaque inputs stay
+  opaque.
+- **Determinism.** Record and merge in input order; iterate stably.
+- **Recording arena lifetime** (§4) is the one structural hazard — settle it
+  before building the staging loop.
+- **TLS, varargs, atomics, computed goto, label-address tables** must survive the
+  shared module unchanged. Function-local label addresses are already
+  function-scoped; cross-function `data_addr`/`pcrel`/`symdiff` reference symbols,
+  which are already unified by id.
+
+## 10. First slices
+
+Two independently landable, low-risk steps that de-risk the whole direction
+before any LTO surface exists:
+
+1. **Extract `symresolve` and refactor `link_resolve_symbols` onto it.** Pure
+   refactor, covered by `test/link`. Lands the load-bearing piece and improves
+   the linker regardless of LTO.
+2. **Add the `ObjBuilder` name -> id index** behind the existing
+   `obj_symbol_find`/`obj_symbol_ex` API. Drop-in; measurable on its own.
+
+Then Phase 0 (generalize the sweep + wire `opt_inline`, gated behind `-O2` /
+`-fwhole-program`), validated first on x86-64 with red-green tests in `test/opt`
+(a caller+callee that should fuse) and `test/smoke/x64` (behavioral parity). Then
+the staging lifecycle and skip-intern-locals behind `-flto`, exercised first on a
+two-TU `test/smoke` case where a cross-TU callee inlines.
+
+## Open questions
+
+- **Define-timing for resolution** (§3): confirm the staging-boundary merge is the
+  right hook versus an `obj_symbol_define`-time check, given symbols are only
+  obj-defined at emit.
+- **Recording arena** (§4): dedicated LTO arena vs `c->global`, and how the
+  per-TU frontend arenas interact with a cross-TU module.
+- **`-flto` flag surface**: accept the GCC/Clang spelling for `cc`; decide the
+  kit-native spelling for `compile` and whether `-fwhole-program` is a distinct,
+  more aggressive internalization mode.
+- **CG API exposure**: whether staging is internal to the driver/`compile_engine`
+  or a public `kit_cg`/`kit_compile` surface for embedders driving multi-TU LTO.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README