LTO / Whole-Program Optimization (planned work)

This is the forward-looking plan for link-time optimization in kit: making a library or executable look like a single translation unit to the optimizer, so inlining, dead-code elimination, internalization, and the rest of the interprocedural family can cross TU boundaries. It deliberately does not target GCC/Clang LTO bitcode compatibility. The initial scope is kit invocations that provide all sources up front (kit cc *.c -O2 -flto -o prog); separately compiled IR objects are a later phase that reuses the same core.

The optimizer baseline this builds on — the recording IR, the recording/optimizing boundary, the finalize path, and the pass catalog — is in ../OPT.md and OPTIMIZER.md. The link-time symbol model is in LINKER.md. The CG/object lifetime boundary used by the remaining Phase 1 staging work is in CG_OBJ_LIFECYCLE.md. This document treats those as given and describes only the LTO-specific additions.

The headline finding from investigating the tree: most of the machinery for whole-program optimization already exists; it is just per-TU, single-arch, and partly unreached. LTO here is three concrete refactors plus wiring, not a new subsystem. The largest of the three is factoring the linker's symbol-resolution policy out so it can run at merge time as well as at link time.

Status (2026-06-04)

Phase 0 is complete and shipping; Phase 1's all-sources-up-front LTO path is implemented in this branch. The end state is not a C-only shortcut: every source-building verb routes through one staging engine, and every in-tree frontend declares either semantic CG staging or opaque-object participation. The link-picture-driven preserved/export prepass now feeds kit_cg_finish, and executable LTO internalizes non-preserved globals before the whole-module reachability walk. Where reality diverged from the original wording below:

The gate is -O1, not -O2. Whole-program optimization (deferred emit + module sweep + inliner) runs whenever the optimizer runs: o->whole_program = (level >= 1) in opt_cgtarget_new. -O2 is treated as -O1 for now. References to -O2/-fwhole-program gating below are superseded.
One arch path, no identity checks. The ARM64-only sweep is now opt_whole_module_finalize for every arch; src/opt has zero arch == KIT_ARCH_* checks. The sret arg-slot rule moved off arch identity to ABIFuncInfo.sret_consumes_int_arg (set per ABI impl). Remaining generic-layer arch identity (src/cg/type.c, src/cg/atomic.c, src/link/link_resolve.c) is tracked as separate cleanup, not part of LTO.
Cross-TU LTO will be opt-in behind -flto (revisit making it the -O1 default once proven) — resolves the flag-surface open question.
Frontend participation is explicit. C, Toy, and Wasm lower into a caller-owned open KitCg; asm is an opaque LTO participant and continues to compile as an ordinary object.
The lifecycle target is borrowed KitCg + caller-owned ObjBuilder, not a separate LTO unit abstraction. ObjBuilder owns object lifetime; KitCg records source units into a borrowed object and finishes semantic codegen with link-picture policy. See CG_OBJ_LIFECYCLE.md.
symresolve_merge signature as built is (SymAttrs existing, SymAttrs incoming) with in_comdat carried inside SymAttrs; no separate coff_target parameter (the COMDAT flags carry everything the decision needs).
Preserved/export internalization is part of Phase 1. The LTO CG finish path receives linker-computed preserved symbols for executable links, and cc -shared -flto remains disabled until shared-library output is exercised.

Done

§6.1 Generalize the finalize sweep to all arches — opt_whole_module_finalize (src/opt/opt.c); x64/rv64 defer-to-finalize; -O0 and the JIT/interp/run paths unchanged; opt_maybe_capture_interp still invoked per reachable func.
§6.4 Wire opt_inline over the reachable FuncSet — opt_run_o1_native split into opt_o1_native_prepare / opt_o1_native_finish; the sweep lowers the live set into one FuncSet, runs the inliner, then finishes each func.
Interposition soundness fix (strengthens §9): weak/interposable callees are never inlined — opt_cg_func_interposable marks them KIT_CG_INLINE_NEVER, honored by both the streaming tiny-inliner and the whole-program inliner. Caught by a strong-over-weak override case the prior (tiny-inliner) behavior miscompiled.
§3 symresolve extraction — src/obj/symresolve.{h,c}; link_resolve_symbols refactored onto symresolve_merge; link_bind_strength / link_sym_is_def / link_sym_is_spurious_undef are now wrappers. Behavior-preserving (test-link 122/0, test-macho 80/0, ODR/weak/common/COMDAT all covered).
§3 ObjBuilder name→id index — SymNameIndex in src/obj/obj.c; obj_symbol_find is an authoritative O(1) hash lookup with no linear scan, kept exact through obj_symbol_ex and obj_symbol_rename.
Tests — test/opt/whole_program_inline.sh (wired test-opt-whole-program-inline): static callee fuses on aa64/x64/rv64, weak callee kept out-of-line (interposition guard), opt.inline.inlined fires at -O1, and the kit-native build verbs (build-obj/build-exe) fuse too.
Build verbs participate. build-exe/build-lib/build-obj (which replaced compile on main) compile each source to an in-memory builder under one KitCompiler via build_compile_all (driver/cmd/build.c) and route through the shared kit_cg path, so per-TU whole-program optimization applies at -O1 with no verb-specific wiring. build_compile_all is also the single seam the Phase 1 cross-TU staging loop will hook (all three verbs at once); cc keeps its own cc_run_link_exe → link_engine path.

Phase 1 source-staging checklist

Architecture lock-in. Phase 1 is implemented as a frontend staging and CG/ObjBuilder lifecycle refactor, not a C-driver shortcut. All source-building verbs (cc, build-exe, build-lib, build-obj) route through the same staging engine. Frontends explicitly declare how they participate: semantic kit_cg staging for frontends that lower through CG, or opaque-object participation for inputs that cannot expose semantic IR (notably asm). The change is not complete until every in-tree frontend is opted into one of those modes.
§2 Skip-intern locals. In kit_cg_decl (src/cg/session.c:198), for SB_LOCAL bindings skip obj_symbol_find and always mint a fresh id. Confirm the per-Decl id cache keeps intra-TU static reuse pointing at the cached id, and that single-TU behavior is unchanged (locals are already unique per name within a TU).
§4 Recording-arena lifetime — settle first. Choose dedicated LTO arena vs c->global for the recorder/CgIrModule so accumulated IR outlives each per-TU frontend run. This is the one structural hazard (§9).
§4 Source staging under the current CG API. Add a deferred-finalize mode to kit_cg: record N TUs into one shared session / ObjBuilder / CgIrModule without per-TU finalization, then finish CG and finalize the object once. Keep per-TU frontend state (Pool/DeclTable/type interning) independent.
§4 CG/ObjBuilder borrowed lifecycle. Replace the former object-shaped CG bracket with the lifecycle in CG_OBJ_LIFECYCLE.md: caller-owned ObjBuilder, borrowed KitCg, explicit unit boundaries, kit_cg_finish for semantic codegen policy, and caller-owned object finalization. One-TU and multi-TU builds now use the same ownership model.
§3/§4 Recording-time merge. At the per-TU staging boundary, when a TU contributes a body for a symbol already defined, call symresolve_merge to pick the winner; drop the loser's CgIrFunc/data and keep its decl as a reference; report ODR at the second definition's SrcLoc.
§4 Driver loop + -flto flag. Parse -flto in cc and the build verbs, thread an LTO flag through KitCodeOptions/the driver, and add the staging path: one shared session, frontend per source, one CG finish/object finalize, single builder to the link session. Hook it at build_compile_all (driver/cmd/build.c) so build-exe/lib/obj get it together, plus cc_run_link_exe. (The build verbs already share one KitCompiler, so the seam is in place.)
§5 Preserved/export set. Compute from the assembled link (entry symbol, dynamic exports, undefs referenced by opaque inputs, used/init-fini/asm-named/IFUNC/ address-significant) and hand it to kit_cg_finish. Current Phase 1 behavior is conservative for relocatable/archive outputs, while executable outputs internalize non-preserved LTO definitions. Shared-library LTO remains disabled until shared output is exercised.
§6.2 Internalize non-preserved globals using the preserved set (unlocks cross-TU DCE and unconstrained inlining), then re-run GC.
Tests. A two-TU test/smoke (or test/link) case where a cross-TU callee inlines under -flto; a guard that a weak/exported cross-TU symbol is not inlined/internalized; cross-TU ODR reported at the right SrcLoc.

Baseline (what already exists)

A handful of facts about the current code path frame everything below.

Globals already intern by name within an ObjBuilder. kit_cg_decl does obj_symbol_find then reuse-or-create (src/cg/session.c:198). Two frontends that decl foo into the same builder receive the same ObjSymId. The CG and optimizer IRs reference call targets and globals by ObjSymId (IRCallAux.desc.callee.v.global.sym), so a caller's call foo already points at the id the definer will define — no remap, no clone. This is the load-bearing fact for the whole design.
The recorder already accumulates a whole module. One CgIrRecorder owns one CgIrModule and appends every func_begin/func_end into it (src/cg/ir_recorder.c), flushing only at finalize. CgIrModule (src/cg/ir.h:270) holds all functions, aliases, and file-scope asm. Per function it carries call_refs and global_refs symbol sets (src/cg/ir.h:247) — the call/use graph is materialized during recording.
The optimizer already finalizes over the whole module — for one arch. opt_on_finalize (src/opt/opt.c:566) hands the entire CgIrModule to opt_emit_reachable_aarch64 (src/opt/opt.c:495), which seeds a root set (non-LOCAL symbols, KIT_CG_SYM_USED locals, alias targets, exported data relocs), walks each function's call_refs/global_refs plus the data-reloc graph, removes unreachable local symbols, then lowers + optimizes + emits only what is live. This is whole-program GC for one TU. x86-64 and riscv64 instead emit eagerly per function in opt_on_func (src/opt/opt.c:322); they have no module pass.
The whole-program inliner exists and is unreached. opt_inline(FuncSet*, max_iters) (src/opt/pass_inline.c:667) does topologically ordered, growth-gated, call-graph inlining over a FuncSet of lowered Funcs. Only the streaming tiny variant opt_try_tiny_inline (cost cap 8, straightline only) runs today. The real inliner has never had a caller. See OPTIMIZER.md §6.
The driver already shares one KitCompiler across sources and keeps objects in memory through link. cc_run_link_exe compiles each source to its own in-memory KitObjBuilder under one compiler (driver/cmd/cc.c:2655, objs[] at :2585) and hands the builders straight to the link session via kit_link_session_add_obj (:2735/:2771) — no temp .o files. The orchestration seam for LTO is already where it needs to be.
The obj layer has no resolution policy and no name index. obj_symbol_define is last-writer-wins with no precedence check (src/obj/obj.c:544), and obj_symbol_find is a linear scan (src/obj/obj.c:534). The only resolution rule anywhere below the linker is the weak-demotion special case hand-coded into kit_cg_decl (src/cg/session.c:203). All real precedence lives in link_resolve_symbols (src/link/link_resolve.c:258).

The merged module, then, is not something LTO must build. It is something the recorder already builds and the finalize path already consumes — for one TU, on one arch. LTO is mostly about not tearing it down between TUs, generalizing the finalize pass, and applying real resolution policy as the merge happens.

1. Design decision: shared context, not clone-and-merge

Two architectures can make the optimizer see one module:

Clone-and-merge. Each TU records into its own CgIrModule/ObjBuilder; an IR-linker deep-copies every function into a merged module, rebuilds the symbol table, and remaps every operand/reloc/alias to merged ids.
Shared context. All TUs record into one live session — one ObjBuilder, one recorder, one CgIrModule — so globals unify in place via the existing decl interning and the finalize path sees the union directly.

We choose shared context. The comparison:

	Shared context	Clone + remap
Global identity	Free (decl already interns by name)	Rebuild symbol table + remap every operand/reloc/alias
Memory / time	Record once, in place	Duplicate all IR into a merge arena
Resolution policy	Apply at the per-TU merge boundary	Apply in the merge pass
Local distinctness	Skip-intern locals (small CG change)	Falls out of remap
Lifecycle cost	Staging mode + cross-TU arena lifetime	None — TUs stay independent
Net new code	Mostly wiring + policy extraction	A full cloner/remapper on the hot path
Serialized objects (Phase 2)	Deserialize = replay records through the same recording/merge API	A separate clone-from-bytes engine

Clone's only advantage is that TUs stay fully independent, so there is no staging lifecycle to manage. That is not worth re-implementing the symbol merge the linker already knows how to do, nor the per-TU IR duplication. The decisive row is the last one: shared context makes the recording/merge API the single funnel. A frontend feeds it; a .kit.ir deserializer feeds it the same way. There is no second merge engine to build for Phase 2 — fat-object LTO becomes "replay serialized decl/func records into the live shared module," reusing the same local-handling and resolution code paths verbatim.

The rest of this document describes the shared-context design.

2. Symbol identity: what unifies, what must stay distinct

Shared context gets global unification for free (§Baseline). The one correctness trap is local symbols, and there is exactly one rule to add.

kit_cg_decl interns every name through obj_symbol_find. For globals that is correct and desirable. For SB_LOCAL symbols it is wrong: two TUs each with static int x; (the frontend passes the bare name with LOCAL binding, lang/c/decl/decl.c:72) would collapse to one symbol. The same hazard exists for static functions, and for the per-TU counters behind block-scope statics (mint_static_local_sym, lang/c/parse/parse.c:660) and compound literals (mint_compound_literal_sym, lang/c/parse/parse_init.c:1012), which reset per parser and would produce colliding names like y.0 and __kit_compound_literal.2 across TUs.

Fix: for SB_LOCAL bindings, skip obj_symbol_find and always mint a fresh id. Consequences, all benign:

Two locals named x get distinct ObjSymIds. Duplicate STB_LOCAL names in one object are legal in every format kit emits; locals never enter the global name table; the optimizer indexes functions by id, not name.
The frontend caches the id per Decl, so intra-TU reuse of a static is unaffected — the second reference goes through the cached id, not a fresh decl.
The static-vs-extern-same-name case resolves correctly: the static gets a fresh id; an unrelated extern foo keeps the shared global id.

No frontend mangling is required. Anonymous read-only data (.Lkit_ro.N, src/cg/memory.c:102) needs no change at all once the session is shared, because the rodata_counter is no longer reset between TUs (see §4) and keeps climbing.

3. Resolution policy: factoring `symresolve` out of the linker

Today, if we naively share one ObjBuilder, we lose all symbol-resolution semantics: obj_symbol_define overwrites last-writer-wins, so two strong definitions silently clobber instead of raising an ODR error, strong-vs-weak becomes declaration-order dependent, and commons never merge. The precedence rules we need already exist in link_resolve_symbols (src/link/link_resolve.c:258): strong-vs-strong → ODR error (modulo COFF COMDAT/SELECTANY), strong beats weak, weak-weak keeps the first, common merging takes max size/align, and a definition beats a common.

Per the investigation, that logic is cleanly separable. The decision is pure over (name, bind, kind, size, align, common_align, defined?, in_comdat) tuples; only the bookkeeping — the globals SymHash, the per-input InputMap, COMDAT section discard, DSO iteration — is entangled with linker state.

Extract a small shared module, src/obj/symresolve.{h,c} (the obj layer is the natural home; both consumers sit above it):

// pure: no linker state, no allocation
SymMergeResult symresolve_merge(SymAttrs existing, SymAttrs incoming,
                                int coff_target);
// -> KEEP_EXISTING | REPLACE | MERGE_COMMON(size, align)
//  | COMDAT_DISCARD | ODR_ERROR

Move link_bind_strength, link_sym_is_def, and link_sym_is_spurious_undef (src/link/link_internal.h) alongside it. Then:

Refactor link_resolve_symbols onto it. A pure cleanup with no behavior change, fully covered by the test/link corpus, that gives the policy one source of truth and leaves the linker better than we found it.
The LTO staging coordinator calls the same function at the per-TU merge boundary. Crucially this is a binding-precedence decision — which body wins — not id remapping, because ids are already unified. When TU B contributes a body for a global TU A already provided, symresolve_merge decides whether to keep A's CgIrFunc/data, replace it with B's, merge commons, or raise ODR. The loser's body is dropped from the module and its decl remains as a reference. ODR conflicts are reported at the second definition's SrcLoc — better diagnostics than the linker's post-hoc panic, because source locations still exist at this point.

Two mechanical needs fall out of sharing one builder:

Give ObjBuilder a name -> id hash map. With the whole program's symbols in one builder, the linear obj_symbol_find (src/obj/obj.c:534) is O(n²) at decl time. The assembler already carries its own SymSymMap precisely because obj lacks one (src/asm/asm.c). Adding the index to the builder removes the quadratic and hosts the resolution hook — a win even for ordinary single-TU compiles, and it lets the assembler shed its private map later.
One open question on define timing. During pure recording a global is declared (with binding) but not defined in the obj sense until finalize emits its section/offset. So the "which body wins" decision must run at the staging boundary against the set of bodies a TU contributes (it has a CgIrFunc or data record for the symbol), not at obj_symbol_define time. This is the linker's per-input symbol merge applied to CgIrModule contributions.

The opaque inputs in a link (libc, crt, kit archives, DSOs) are still resolved by the linker at link time against the single emitted LTO object. So the policy module has two call sites — recording-time merge among the LTO set, and link-time resolution against everything — which is the justification for extracting it rather than duplicating it.

4. The staging lifecycle

The lifecycle target for Phase 1 is documented in CG_OBJ_LIFECYCLE.md. The short version: ObjBuilder owns object lifetime, while KitCg borrows an object, records one or more semantic units, and finishes codegen into that object. kit_cg_finish is a CG flush/lowering/debug operation; it is not object finalization.

The old object-shaped bracket used to finalize (lowers + emits everything), null g->obj/g->target, and reset per-object state including rodata_counter (src/cg/session.c). The structural state is now a borrowed lifecycle:

Record each TU as a unit in one live CG session without object finalization. Run a single kit_cg_finish after the last semantic source, then let the caller finalize the ObjBuilder. The shared path records N semantic frontends into one shared KitCg / ObjBuilder and finalizes once through the explicit lifecycle: kit_cg_begin, kit_cg_begin_unit, kit_cg_end_unit, kit_cg_finish, and kit_cg_detach/kit_cg_abort. Drivers collect sources and opaque inputs; they do not implement definition selection, IR lifetime, semantic finalization, or object finalization policy.
Frontend participation is explicit. KitFrontendVTable has a split contract: semantic frontends implement compile_cg, while opaque frontends implement compile_obj. C, Toy, and Wasm participate by emitting into a caller-owned open KitCg session; one-TU object builds are wrapped at the compile-session layer by creating an ObjBuilder, attaching KitCg for one unit, finishing CG, and then finalizing the object. Asm has no semantic CG representation, so its LTO participation mode is opaque: it compiles to an ordinary object and contributes references/definitions to the link picture but not to the merged optimization module. This keeps all verbs and all frontends on one declared path while allowing semantic frontend opt-in one at a time.
The recording arena must outlive any single TU. The recorder and module are arena-allocated from c->tu today (opt_cgtarget_new, cg_ir_recorder_new). In the current implementation c->tu is already compiler-session lifetime (not reset between source inputs), so Phase 1 uses it as the cross-TU recorder arena and documents that lifetime. If c->tu later becomes per-source again, the shared CG path must switch to an explicit cross-source arena; the frontend staging API must not depend on that allocator choice.
Each TU keeps its own frontend state. The per-TU Pool, DeclTable, and type interning stay independent; only the CG session and ObjBuilder are shared. The shared KitCompiler already spans sources today, so c->global name interning is already consistent across TUs.

The driver change is a shared staging engine: group every LTO-capable source input in command-line order, stage semantic frontends into the borrowed CG session and shared object, compile opaque frontends/objects as ordinary inputs, then finish CG once and substitute the resulting builder at the right place in the link order. The hook is build_compile_all in driver/cmd/build.c (shared by build-exe/build-lib/build-obj) and cc_run_link_exe — both already compile every source under one KitCompiler, which is the seam this loop replaces. (compile/compile_engine from the original plan were retired in favor of the build verbs on main.)

5. The export / preserved set

Internalizing a global — demoting it to hidden/local, which unlocks DCE and unconstrained inlining — is sound only when nothing outside the LTO set can reference it by name and it is not interposable. This is the one input that genuinely needs the full link picture, so it is computed at link time and handed to the LTO core. A symbol must be preserved if it is:

the entry symbol (main/_start), or in the dynamic export set; for -shared, default-visibility symbols are interposable and must not be internalized or inlined across unless -fvisibility=hidden / a version script / -Bsymbolic says otherwise;
referenced (undefined) by any opaque input — libc/crt calling main, a kit archive member that is not IR, a DSO;
__attribute__((used)), in an init/fini array, named in inline or file-scope asm, an IFUNC resolver, or address-significant in an opaque input.

The linker already answers "is this symbol referenced from outside" for archive pull (scan_presence_before / member_satisfies, src/link/link_resolve.c:859, :923); the preserved set is the same question asked of the LTO set against the opaque inputs and the output-kind/visibility policy. Conservative default: internalize only for executable outputs or provably non-exported symbols.

Phase 1 implements this for all-sources-up-front executable LTO: the driver stages semantic sources, assembles the ordered link session, asks the linker for preserved LTO symbols, then passes those IDs to kit_cg_finish before object finalization. Relocatable and archive-member outputs remain conservative because later links may still reference globals by name. Shared-library LTO continues to reject until shared output policy is exercised.

6. The whole-program optimization core

With a merged module and a preserved set, the core is opt_emit_reachable_aarch64 generalized:

Generalize the finalize sweep to all arches. Lift the ARM64-only path in opt_on_finalize into an arch-independent opt_whole_module_finalize, and switch x86-64/riscv64 from eager per-function emit to defer-to-finalize when the whole-program path is active. Keep -O0/-O1 streaming and the JIT/interp/run/dbg/emu paths on the existing eager path — LTO is an AOT concern. The one verification item is that nothing downstream depends on x64/rv64 eager emission (opt_maybe_capture_interp).
Internalize non-preserved globals using the §5 set.
GC unreachable functions and data (the existing reachability walk, now over the whole program).
Lower the reachable set into a FuncSet and run opt_inline — the already-written, never-called whole-program inliner — with a real cost model.
Emit one object and substitute it for the IR inputs before the final link.

Steps 1 and 4 are independently valuable and land first (Phase 0): they turn the unreached inliner and the generalized sweep into a tested, shipping path on a single TU before any cross-TU complexity exists.

7. Phased delivery

Phase 0 — Whole-translation-unit optimization. No merge, no serialization, no driver changes. Generalize the finalize sweep to all arches, switch x64/rv64 to defer-to-finalize under the whole-program path, and wire opt_inline over the reachable FuncSet. Delivers real cross-function inlining within a TU at -O2 on every arch, generalized dead-static elimination, and the inliner finally exercised on real code. Lowest risk — purely inside the optimizer — and it validates the deferred-emit path that Phase 1's staging lifecycle also relies on.

Phase 1 — Shared-context, all-sources-up-front LTO. The target case, kit cc *.c -flto -o prog and kit build-exe -flto (and build-lib/build-obj; build-obj replaced the retired compile). Build on Phase 0 by adding: (a) the symresolve extraction (§3), (b) the ObjBuilder name index (§3), (c) skip-intern for locals (§2), (d) the KitCg/ObjBuilder borrowed staging lifecycle and the driver loop that records N frontends into one session and finishes CG once (§4), (e) the preserved set fed from the assembled link into kit_cg_finish (§5). No cloner, no serialization, no archive support yet.

Phase 2 — Serialized IR objects (.kit.ir). Optional follow-on for separate compilation, archives, and build caches. kit cc -c -flto a.c emits a normal object whose symbol table is the real decl set — so the linker's symbol-driven archive pull works unchanged — plus a .kit.ir custom section (the object model already supports arbitrary SEC_OTHER sections) carrying a serialized CgIrModule. The linker detects the section and replays the records into the same shared context through the same recording/merge API, reusing skip-intern-locals and symresolve_merge verbatim. Archives of IR objects work because pull is symbol-table driven. Note: whole-program LTO is incompatible with the file-based incremental linker (LINKER.md); -flto forces a full link.

8. Optimizations unlocked

Inlining is the headline; the merged-module + FuncSet framework makes a whole interprocedural family expressible (listed as enabled, not committed):

Cross-TU / whole-program inlining — opt_inline, already written.
Internalization to hidden/local for non-exported globals — enables DCE, removes PLT/GOT indirection, frees intra-function optimization.
Whole-program dead code / data elimination — the generalized sweep.
Future: devirtualization / direct-call promotion, IPSCCP and cross-function constant/range propagation, argument promotion, identical-code folding, const/pure inference, global-to-local-constant propagation.

9. Risks, semantics, and limitations

Resolution fidelity. ODR, weak/strong, common merging, COMDAT, IFUNC, aliases, and visibility must match the linker exactly or LTO miscompiles — hence the shared symresolve module rather than a re-implementation.
Interposition / shared libraries. Never internalize or inline across an interposable default-visibility boundary unless -Bsymbolic/hidden visibility makes it safe. Default conservative for -shared.
Inline and file-scope asm naming symbols are opaque references: treat as roots, never rename, internalize, or DCE them.
Debug info. Cross-TU inlines need inlined_subroutine DWARF with correct file/line; SrcLoc must carry file identity through the merged module. Acceptable initial limitation: degraded inlined debug info under -g -flto, stated explicitly.
Compile time / memory. The whole program lives in memory; opt_inline's growth gates bound blow-up; only the LTO set is optimized, opaque inputs stay opaque.
Determinism. Record and merge in input order; iterate stably.
Recording arena lifetime (§4) is the one structural hazard — settle it before building the staging loop.
TLS, varargs, atomics, computed goto, label-address tables must survive the shared module unchanged. Function-local label addresses are already function-scoped; cross-function data_addr/pcrel/symdiff reference symbols, which are already unified by id.

10. First slices

Two independently landable, low-risk steps that de-risk the whole direction before any LTO surface exists:

Extract symresolve and refactor link_resolve_symbols onto it. Pure refactor, covered by test/link. Lands the load-bearing piece and improves the linker regardless of LTO.
Add the ObjBuilder name -> id index behind the existing obj_symbol_find/obj_symbol_ex API. Drop-in; measurable on its own.

Then Phase 0 (generalize the sweep + wire opt_inline, gated behind -O2 / -fwhole-program), validated first on x86-64 with red-green tests in test/opt (a caller+callee that should fuse) and test/smoke/x64 (behavioral parity). Then the staging lifecycle and skip-intern-locals behind -flto, exercised first on a two-TU test/smoke case where a cross-TU callee inlines.

Open questions

Define-timing for resolution (§3): confirm the staging-boundary merge is the right hook versus an obj_symbol_define-time check, given symbols are only obj-defined at emit.
Recording arena follow-through (§4): Phase 1 relies on c->tu having compiler-session lifetime for the cross-TU recorder/module. If frontend reset semantics later make c->tu per-source again, move the recorder/module to an explicit cross-source arena without changing the frontend staging API.
-flto flag surface (largely resolved — see Status): -flto opt-in on cc and the build verbs, decided per the Status section. Still open: whether -fwhole-program is a distinct, more aggressive internalization mode, and whether to make cross-TU LTO the -O1 default later.
CG API exposure: how much of the borrowed lifecycle (kit_cg_begin/kit_cg_begin_unit/kit_cg_finish/kit_cg_detach) remains internal to the driver (build.c's build_compile_all, cc_run_link_exe) versus becoming a public kit_cg/kit_compile surface for embedders driving multi-TU LTO.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

kit