LTO / Whole-Program Optimization (planned work)
This is the forward-looking plan for link-time optimization in kit: making a
library or executable look like a single translation unit to the optimizer, so
inlining, dead-code elimination, internalization, and the rest of the
interprocedural family can cross TU boundaries. It deliberately does not
target GCC/Clang LTO bitcode compatibility. The initial scope is kit invocations
that provide all sources up front (kit cc *.c -O2 -flto -o prog); separately
compiled IR objects are a later phase that reuses the same core.
The optimizer baseline this builds on — the recording IR, the recording/optimizing boundary, the finalize path, and the pass catalog — is in ../OPT.md and OPTIMIZER.md. The link-time symbol model is in LINKER.md. The CG/object lifetime boundary used by the remaining Phase 1 staging work is in CG_OBJ_LIFECYCLE.md. This document treats those as given and describes only the LTO-specific additions.
The headline finding from investigating the tree: most of the machinery for whole-program optimization already exists; it is just per-TU, single-arch, and partly unreached. LTO here is three concrete refactors plus wiring, not a new subsystem. The largest of the three is factoring the linker's symbol-resolution policy out so it can run at merge time as well as at link time.
Status (2026-06-04)
Phase 0 is complete and shipping; Phase 1's all-sources-up-front LTO path is
implemented in this branch. The end state is not a C-only shortcut:
every source-building verb routes through one staging engine, and every
in-tree frontend declares either semantic CG staging or opaque-object
participation. The link-picture-driven preserved/export prepass now feeds
kit_cg_finish, and executable LTO internalizes non-preserved globals before
the whole-module reachability walk. Where reality diverged from the original
wording below:
- The gate is
-O1, not-O2. Whole-program optimization (deferred emit + module sweep + inliner) runs whenever the optimizer runs:o->whole_program = (level >= 1)inopt_cgtarget_new.-O2is treated as-O1for now. References to-O2/-fwhole-programgating below are superseded. - One arch path, no identity checks. The ARM64-only sweep is now
opt_whole_module_finalizefor every arch;src/opthas zeroarch == KIT_ARCH_*checks. The sret arg-slot rule moved off arch identity toABIFuncInfo.sret_consumes_int_arg(set per ABI impl). Remaining generic-layer arch identity (src/cg/type.c,src/cg/atomic.c,src/link/link_resolve.c) is tracked as separate cleanup, not part of LTO. - Cross-TU LTO will be opt-in behind
-flto(revisit making it the-O1default once proven) — resolves the flag-surface open question. - Frontend participation is explicit. C, Toy, and Wasm lower into a
caller-owned open
KitCg; asm is an opaque LTO participant and continues to compile as an ordinary object. - The lifecycle target is borrowed
KitCg+ caller-ownedObjBuilder, not a separate LTO unit abstraction.ObjBuilderowns object lifetime;KitCgrecords source units into a borrowed object and finishes semantic codegen with link-picture policy. See CG_OBJ_LIFECYCLE.md. symresolve_mergesignature as built is(SymAttrs existing, SymAttrs incoming)within_comdatcarried insideSymAttrs; no separatecoff_targetparameter (the COMDAT flags carry everything the decision needs).- Preserved/export internalization is part of Phase 1. The LTO CG finish
path receives linker-computed preserved symbols for executable links, and
cc -shared -fltoremains disabled until shared-library output is exercised.
Done
- §6.1 Generalize the finalize sweep to all arches —
opt_whole_module_finalize(src/opt/opt.c); x64/rv64 defer-to-finalize;-O0and the JIT/interp/run paths unchanged;opt_maybe_capture_interpstill invoked per reachable func. - §6.4 Wire
opt_inlineover the reachableFuncSet—opt_run_o1_nativesplit intoopt_o1_native_prepare/opt_o1_native_finish; the sweep lowers the live set into one FuncSet, runs the inliner, then finishes each func. - Interposition soundness fix (strengthens §9): weak/interposable callees are
never inlined —
opt_cg_func_interposablemarks themKIT_CG_INLINE_NEVER, honored by both the streaming tiny-inliner and the whole-program inliner. Caught by a strong-over-weak override case the prior (tiny-inliner) behavior miscompiled. - §3
symresolveextraction —src/obj/symresolve.{h,c};link_resolve_symbolsrefactored ontosymresolve_merge;link_bind_strength/link_sym_is_def/link_sym_is_spurious_undefare now wrappers. Behavior-preserving (test-link 122/0, test-macho 80/0, ODR/weak/common/COMDAT all covered). - §3
ObjBuildername→id index —SymNameIndexinsrc/obj/obj.c;obj_symbol_findis an authoritative O(1) hash lookup with no linear scan, kept exact throughobj_symbol_exandobj_symbol_rename. - Tests —
test/opt/whole_program_inline.sh(wiredtest-opt-whole-program-inline): static callee fuses on aa64/x64/rv64, weak callee kept out-of-line (interposition guard),opt.inline.inlinedfires at-O1, and the kit-native build verbs (build-obj/build-exe) fuse too. - Build verbs participate.
build-exe/build-lib/build-obj(which replacedcompileonmain) compile each source to an in-memory builder under oneKitCompilerviabuild_compile_all(driver/cmd/build.c) and route through the sharedkit_cgpath, so per-TU whole-program optimization applies at-O1with no verb-specific wiring.build_compile_allis also the single seam the Phase 1 cross-TU staging loop will hook (all three verbs at once);cckeeps its owncc_run_link_exe→link_enginepath.
Phase 1 source-staging checklist
- Architecture lock-in. Phase 1 is implemented as a frontend staging
and CG/ObjBuilder lifecycle refactor, not a C-driver shortcut. All
source-building verbs (
cc,build-exe,build-lib,build-obj) route through the same staging engine. Frontends explicitly declare how they participate: semantickit_cgstaging for frontends that lower through CG, or opaque-object participation for inputs that cannot expose semantic IR (notably asm). The change is not complete until every in-tree frontend is opted into one of those modes. - §2 Skip-intern locals. In
kit_cg_decl(src/cg/session.c:198), forSB_LOCALbindings skipobj_symbol_findand always mint a fresh id. Confirm the per-Declid cache keeps intra-TU static reuse pointing at the cached id, and that single-TU behavior is unchanged (locals are already unique per name within a TU). - §4 Recording-arena lifetime — settle first. Choose dedicated LTO arena vs
c->globalfor the recorder/CgIrModuleso accumulated IR outlives each per-TU frontend run. This is the one structural hazard (§9). - §4 Source staging under the current CG API. Add a deferred-finalize
mode to
kit_cg: record N TUs into one shared session /ObjBuilder/CgIrModulewithout per-TU finalization, then finish CG and finalize the object once. Keep per-TU frontend state (Pool/DeclTable/type interning) independent. - §4 CG/ObjBuilder borrowed lifecycle. Replace the former
object-shaped CG bracket with the lifecycle in
CG_OBJ_LIFECYCLE.md: caller-owned
ObjBuilder, borrowedKitCg, explicit unit boundaries,kit_cg_finishfor semantic codegen policy, and caller-owned object finalization. One-TU and multi-TU builds now use the same ownership model. - §3/§4 Recording-time merge. At the per-TU staging boundary, when a TU
contributes a body for a symbol already defined, call
symresolve_mergeto pick the winner; drop the loser'sCgIrFunc/data and keep its decl as a reference; report ODR at the second definition'sSrcLoc. - §4 Driver loop +
-fltoflag. Parse-fltoinccand the build verbs, thread an LTO flag throughKitCodeOptions/the driver, and add the staging path: one shared session, frontend per source, one CG finish/object finalize, single builder to the link session. Hook it atbuild_compile_all(driver/cmd/build.c) so build-exe/lib/obj get it together, pluscc_run_link_exe. (The build verbs already share oneKitCompiler, so the seam is in place.) - §5 Preserved/export set. Compute from the assembled link (entry symbol,
dynamic exports, undefs referenced by opaque inputs,
used/init-fini/asm-named/IFUNC/ address-significant) and hand it tokit_cg_finish. Current Phase 1 behavior is conservative for relocatable/archive outputs, while executable outputs internalize non-preserved LTO definitions. Shared-library LTO remains disabled until shared output is exercised. - §6.2 Internalize non-preserved globals using the preserved set (unlocks cross-TU DCE and unconstrained inlining), then re-run GC.
- Tests. A two-TU
test/smoke(ortest/link) case where a cross-TU callee inlines under-flto; a guard that a weak/exported cross-TU symbol is not inlined/internalized; cross-TU ODR reported at the rightSrcLoc.
Baseline (what already exists)
A handful of facts about the current code path frame everything below.
- Globals already intern by name within an
ObjBuilder.kit_cg_decldoesobj_symbol_findthen reuse-or-create (src/cg/session.c:198). Two frontends thatdeclfoointo the same builder receive the sameObjSymId. The CG and optimizer IRs reference call targets and globals byObjSymId(IRCallAux.desc.callee.v.global.sym), so a caller'scall fooalready points at the id the definer will define — no remap, no clone. This is the load-bearing fact for the whole design. - The recorder already accumulates a whole module. One
CgIrRecorderowns oneCgIrModuleand appends everyfunc_begin/func_endinto it (src/cg/ir_recorder.c), flushing only atfinalize.CgIrModule(src/cg/ir.h:270) holds all functions, aliases, and file-scope asm. Per function it carriescall_refsandglobal_refssymbol sets (src/cg/ir.h:247) — the call/use graph is materialized during recording. - The optimizer already finalizes over the whole module — for one arch.
opt_on_finalize(src/opt/opt.c:566) hands the entireCgIrModuletoopt_emit_reachable_aarch64(src/opt/opt.c:495), which seeds a root set (non-LOCALsymbols,KIT_CG_SYM_USEDlocals, alias targets, exported data relocs), walks each function'scall_refs/global_refsplus the data-reloc graph, removes unreachable local symbols, then lowers + optimizes + emits only what is live. This is whole-program GC for one TU. x86-64 and riscv64 instead emit eagerly per function inopt_on_func(src/opt/opt.c:322); they have no module pass. - The whole-program inliner exists and is unreached.
opt_inline(FuncSet*, max_iters)(src/opt/pass_inline.c:667) does topologically ordered, growth-gated, call-graph inlining over aFuncSetof loweredFuncs. Only the streaming tiny variantopt_try_tiny_inline(cost cap 8, straightline only) runs today. The real inliner has never had a caller. See OPTIMIZER.md §6. - The driver already shares one
KitCompileracross sources and keeps objects in memory through link.cc_run_link_execompiles each source to its own in-memoryKitObjBuilderunder one compiler (driver/cmd/cc.c:2655,objs[]at:2585) and hands the builders straight to the link session viakit_link_session_add_obj(:2735/:2771) — no temp.ofiles. The orchestration seam for LTO is already where it needs to be. - The obj layer has no resolution policy and no name index.
obj_symbol_defineis last-writer-wins with no precedence check (src/obj/obj.c:544), andobj_symbol_findis a linear scan (src/obj/obj.c:534). The only resolution rule anywhere below the linker is the weak-demotion special case hand-coded intokit_cg_decl(src/cg/session.c:203). All real precedence lives inlink_resolve_symbols(src/link/link_resolve.c:258).
The merged module, then, is not something LTO must build. It is something the recorder already builds and the finalize path already consumes — for one TU, on one arch. LTO is mostly about not tearing it down between TUs, generalizing the finalize pass, and applying real resolution policy as the merge happens.
1. Design decision: shared context, not clone-and-merge
Two architectures can make the optimizer see one module:
- Clone-and-merge. Each TU records into its own
CgIrModule/ObjBuilder; an IR-linker deep-copies every function into a merged module, rebuilds the symbol table, and remaps every operand/reloc/alias to merged ids. - Shared context. All TUs record into one live session — one
ObjBuilder, one recorder, oneCgIrModule— so globals unify in place via the existing decl interning and the finalize path sees the union directly.
We choose shared context. The comparison:
| Shared context | Clone + remap | |
|---|---|---|
| Global identity | Free (decl already interns by name) | Rebuild symbol table + remap every operand/reloc/alias |
| Memory / time | Record once, in place | Duplicate all IR into a merge arena |
| Resolution policy | Apply at the per-TU merge boundary | Apply in the merge pass |
| Local distinctness | Skip-intern locals (small CG change) | Falls out of remap |
| Lifecycle cost | Staging mode + cross-TU arena lifetime | None — TUs stay independent |
| Net new code | Mostly wiring + policy extraction | A full cloner/remapper on the hot path |
| Serialized objects (Phase 2) | Deserialize = replay records through the same recording/merge API | A separate clone-from-bytes engine |
Clone's only advantage is that TUs stay fully independent, so there is no
staging lifecycle to manage. That is not worth re-implementing the symbol merge
the linker already knows how to do, nor the per-TU IR duplication. The decisive
row is the last one: shared context makes the recording/merge API the single
funnel. A frontend feeds it; a .kit.ir deserializer feeds it the same way.
There is no second merge engine to build for Phase 2 — fat-object LTO becomes
"replay serialized decl/func records into the live shared module," reusing the
same local-handling and resolution code paths verbatim.
The rest of this document describes the shared-context design.
2. Symbol identity: what unifies, what must stay distinct
Shared context gets global unification for free (§Baseline). The one correctness trap is local symbols, and there is exactly one rule to add.
kit_cg_decl interns every name through obj_symbol_find. For globals that is
correct and desirable. For SB_LOCAL symbols it is wrong: two TUs each with
static int x; (the frontend passes the bare name with LOCAL binding,
lang/c/decl/decl.c:72) would collapse to one symbol. The same hazard exists for
static functions, and for the per-TU counters behind block-scope statics
(mint_static_local_sym, lang/c/parse/parse.c:660) and compound literals
(mint_compound_literal_sym, lang/c/parse/parse_init.c:1012), which reset per
parser and would produce colliding names like y.0 and
__kit_compound_literal.2 across TUs.
Fix: for SB_LOCAL bindings, skip obj_symbol_find and always mint a fresh
id. Consequences, all benign:
- Two locals named
xget distinctObjSymIds. DuplicateSTB_LOCALnames in one object are legal in every format kit emits; locals never enter the global name table; the optimizer indexes functions by id, not name. - The frontend caches the id per
Decl, so intra-TU reuse of a static is unaffected — the second reference goes through the cached id, not a fresh decl. - The static-vs-extern-same-name case resolves correctly: the static gets a fresh
id; an unrelated
extern fookeeps the shared global id.
No frontend mangling is required. Anonymous read-only data (.Lkit_ro.N,
src/cg/memory.c:102) needs no change at all once the session is shared, because
the rodata_counter is no longer reset between TUs (see §4) and keeps climbing.
3. Resolution policy: factoring symresolve out of the linker
Today, if we naively share one ObjBuilder, we lose all symbol-resolution
semantics: obj_symbol_define overwrites last-writer-wins, so two strong
definitions silently clobber instead of raising an ODR error, strong-vs-weak
becomes declaration-order dependent, and commons never merge. The precedence
rules we need already exist in link_resolve_symbols
(src/link/link_resolve.c:258): strong-vs-strong → ODR error (modulo COFF
COMDAT/SELECTANY), strong beats weak, weak-weak keeps the first, common merging
takes max size/align, and a definition beats a common.
Per the investigation, that logic is cleanly separable. The decision is pure
over (name, bind, kind, size, align, common_align, defined?, in_comdat)
tuples; only the bookkeeping — the globals SymHash, the per-input
InputMap, COMDAT section discard, DSO iteration — is entangled with linker
state.
Extract a small shared module, src/obj/symresolve.{h,c} (the obj layer is
the natural home; both consumers sit above it):
// pure: no linker state, no allocation
SymMergeResult symresolve_merge(SymAttrs existing, SymAttrs incoming,
int coff_target);
// -> KEEP_EXISTING | REPLACE | MERGE_COMMON(size, align)
// | COMDAT_DISCARD | ODR_ERROR
Move link_bind_strength, link_sym_is_def, and link_sym_is_spurious_undef
(src/link/link_internal.h) alongside it. Then:
- Refactor
link_resolve_symbolsonto it. A pure cleanup with no behavior change, fully covered by thetest/linkcorpus, that gives the policy one source of truth and leaves the linker better than we found it. - The LTO staging coordinator calls the same function at the per-TU merge
boundary. Crucially this is a binding-precedence decision — which body wins —
not id remapping, because ids are already unified. When TU B contributes a body
for a global TU A already provided,
symresolve_mergedecides whether to keep A'sCgIrFunc/data, replace it with B's, merge commons, or raise ODR. The loser's body is dropped from the module and its decl remains as a reference. ODR conflicts are reported at the second definition'sSrcLoc— better diagnostics than the linker's post-hoc panic, because source locations still exist at this point.
Two mechanical needs fall out of sharing one builder:
- Give
ObjBuilderaname -> idhash map. With the whole program's symbols in one builder, the linearobj_symbol_find(src/obj/obj.c:534) is O(n²) at decl time. The assembler already carries its ownSymSymMapprecisely because obj lacks one (src/asm/asm.c). Adding the index to the builder removes the quadratic and hosts the resolution hook — a win even for ordinary single-TU compiles, and it lets the assembler shed its private map later. - One open question on
definetiming. During pure recording a global is declared (with binding) but not defined in the obj sense until finalize emits its section/offset. So the "which body wins" decision must run at the staging boundary against the set of bodies a TU contributes (it has aCgIrFuncor data record for the symbol), not atobj_symbol_definetime. This is the linker's per-input symbol merge applied toCgIrModulecontributions.
The opaque inputs in a link (libc, crt, kit archives, DSOs) are still resolved by the linker at link time against the single emitted LTO object. So the policy module has two call sites — recording-time merge among the LTO set, and link-time resolution against everything — which is the justification for extracting it rather than duplicating it.
4. The staging lifecycle
The lifecycle target for Phase 1 is documented in
CG_OBJ_LIFECYCLE.md. The short version: ObjBuilder
owns object lifetime, while KitCg borrows an object, records one or more
semantic units, and finishes codegen into that object. kit_cg_finish is a CG
flush/lowering/debug operation; it is not object finalization.
The old object-shaped bracket used to finalize (lowers + emits everything),
null g->obj/g->target, and reset per-object state including
rodata_counter (src/cg/session.c). The structural state is now a borrowed
lifecycle:
- Record each TU as a unit in one live CG session without object
finalization. Run a single
kit_cg_finishafter the last semantic source, then let the caller finalize theObjBuilder. The shared path records N semantic frontends into one sharedKitCg/ObjBuilderand finalizes once through the explicit lifecycle:kit_cg_begin,kit_cg_begin_unit,kit_cg_end_unit,kit_cg_finish, andkit_cg_detach/kit_cg_abort. Drivers collect sources and opaque inputs; they do not implement definition selection, IR lifetime, semantic finalization, or object finalization policy. - Frontend participation is explicit.
KitFrontendVTablehas a split contract: semantic frontends implementcompile_cg, while opaque frontends implementcompile_obj. C, Toy, and Wasm participate by emitting into a caller-owned openKitCgsession; one-TU object builds are wrapped at the compile-session layer by creating anObjBuilder, attachingKitCgfor one unit, finishing CG, and then finalizing the object. Asm has no semantic CG representation, so its LTO participation mode is opaque: it compiles to an ordinary object and contributes references/definitions to the link picture but not to the merged optimization module. This keeps all verbs and all frontends on one declared path while allowing semantic frontend opt-in one at a time. - The recording arena must outlive any single TU. The recorder and module are
arena-allocated from
c->tutoday (opt_cgtarget_new,cg_ir_recorder_new). In the current implementationc->tuis already compiler-session lifetime (not reset between source inputs), so Phase 1 uses it as the cross-TU recorder arena and documents that lifetime. Ifc->tulater becomes per-source again, the shared CG path must switch to an explicit cross-source arena; the frontend staging API must not depend on that allocator choice. - Each TU keeps its own frontend state. The per-TU
Pool,DeclTable, and type interning stay independent; only the CG session andObjBuilderare shared. The sharedKitCompileralready spans sources today, soc->globalname interning is already consistent across TUs.
The driver change is a shared staging engine: group every LTO-capable source
input in command-line order, stage semantic frontends into the borrowed CG
session and shared object, compile opaque frontends/objects as ordinary inputs,
then finish CG once and substitute the resulting builder at the right place in
the link order. The hook is build_compile_all in driver/cmd/build.c (shared
by build-exe/build-lib/build-obj) and cc_run_link_exe — both already compile
every source under one KitCompiler, which is the seam this loop replaces.
(compile/compile_engine from the original plan were retired in favor of the
build verbs on main.)
5. The export / preserved set
Internalizing a global — demoting it to hidden/local, which unlocks DCE and unconstrained inlining — is sound only when nothing outside the LTO set can reference it by name and it is not interposable. This is the one input that genuinely needs the full link picture, so it is computed at link time and handed to the LTO core. A symbol must be preserved if it is:
- the entry symbol (
main/_start), or in the dynamic export set; for-shared, default-visibility symbols are interposable and must not be internalized or inlined across unless-fvisibility=hidden/ a version script /-Bsymbolicsays otherwise; - referenced (undefined) by any opaque input — libc/crt calling
main, a kit archive member that is not IR, a DSO; __attribute__((used)), in an init/fini array, named in inline or file-scope asm, an IFUNC resolver, or address-significant in an opaque input.
The linker already answers "is this symbol referenced from outside" for archive
pull (scan_presence_before / member_satisfies, src/link/link_resolve.c:859,
:923); the preserved set is the same question asked of the LTO set against the
opaque inputs and the output-kind/visibility policy. Conservative default:
internalize only for executable outputs or provably non-exported symbols.
Phase 1 implements this for all-sources-up-front executable LTO: the driver
stages semantic sources, assembles the ordered link session, asks the linker for
preserved LTO symbols, then passes those IDs to kit_cg_finish before object
finalization. Relocatable and archive-member outputs remain conservative because
later links may still reference globals by name. Shared-library LTO continues
to reject until shared output policy is exercised.
6. The whole-program optimization core
With a merged module and a preserved set, the core is opt_emit_reachable_aarch64
generalized:
- Generalize the finalize sweep to all arches. Lift the ARM64-only path in
opt_on_finalizeinto an arch-independentopt_whole_module_finalize, and switch x86-64/riscv64 from eager per-function emit to defer-to-finalize when the whole-program path is active. Keep-O0/-O1streaming and the JIT/interp/run/dbg/emupaths on the existing eager path — LTO is an AOT concern. The one verification item is that nothing downstream depends on x64/rv64 eager emission (opt_maybe_capture_interp). - Internalize non-preserved globals using the §5 set.
- GC unreachable functions and data (the existing reachability walk, now over the whole program).
- Lower the reachable set into a
FuncSetand runopt_inline— the already-written, never-called whole-program inliner — with a real cost model. - Emit one object and substitute it for the IR inputs before the final link.
Steps 1 and 4 are independently valuable and land first (Phase 0): they turn the unreached inliner and the generalized sweep into a tested, shipping path on a single TU before any cross-TU complexity exists.
7. Phased delivery
Phase 0 — Whole-translation-unit optimization. No merge, no serialization,
no driver changes. Generalize the finalize sweep to all arches, switch
x64/rv64 to defer-to-finalize under the whole-program path, and wire opt_inline
over the reachable FuncSet. Delivers real cross-function inlining within a TU at
-O2 on every arch, generalized dead-static elimination, and the inliner finally
exercised on real code. Lowest risk — purely inside the optimizer — and it
validates the deferred-emit path that Phase 1's staging lifecycle also relies on.
Phase 1 — Shared-context, all-sources-up-front LTO. The target case,
kit cc *.c -flto -o prog and kit build-exe -flto (and build-lib/build-obj;
build-obj replaced the retired compile). Build on Phase 0 by adding:
(a) the symresolve extraction (§3), (b) the ObjBuilder name index (§3),
(c) skip-intern for locals (§2), (d) the KitCg/ObjBuilder borrowed staging
lifecycle and the driver loop that records N frontends into one session and
finishes CG once (§4), (e) the preserved set fed from the assembled link into
kit_cg_finish (§5). No cloner, no serialization, no archive support yet.
Phase 2 — Serialized IR objects (.kit.ir). Optional follow-on for separate
compilation, archives, and build caches. kit cc -c -flto a.c emits a normal
object whose symbol table is the real decl set — so the linker's symbol-driven
archive pull works unchanged — plus a .kit.ir custom section (the object model
already supports arbitrary SEC_OTHER sections) carrying a serialized
CgIrModule. The linker detects the section and replays the records into the
same shared context through the same recording/merge API, reusing
skip-intern-locals and symresolve_merge verbatim. Archives of IR objects work
because pull is symbol-table driven. Note: whole-program LTO is incompatible with
the file-based incremental linker (LINKER.md); -flto forces a full link.
8. Optimizations unlocked
Inlining is the headline; the merged-module + FuncSet framework makes a whole
interprocedural family expressible (listed as enabled, not committed):
- Cross-TU / whole-program inlining —
opt_inline, already written. - Internalization to hidden/local for non-exported globals — enables DCE, removes PLT/GOT indirection, frees intra-function optimization.
- Whole-program dead code / data elimination — the generalized sweep.
- Future: devirtualization / direct-call promotion, IPSCCP and cross-function
constant/range propagation, argument promotion, identical-code folding,
const/pureinference, global-to-local-constant propagation.
9. Risks, semantics, and limitations
- Resolution fidelity. ODR, weak/strong, common merging, COMDAT, IFUNC,
aliases, and visibility must match the linker exactly or LTO miscompiles —
hence the shared
symresolvemodule rather than a re-implementation. - Interposition / shared libraries. Never internalize or inline across an
interposable default-visibility boundary unless
-Bsymbolic/hidden visibility makes it safe. Default conservative for-shared. - Inline and file-scope asm naming symbols are opaque references: treat as roots, never rename, internalize, or DCE them.
- Debug info. Cross-TU inlines need
inlined_subroutineDWARF with correct file/line;SrcLocmust carry file identity through the merged module. Acceptable initial limitation: degraded inlined debug info under-g -flto, stated explicitly. - Compile time / memory. The whole program lives in memory;
opt_inline's growth gates bound blow-up; only the LTO set is optimized, opaque inputs stay opaque. - Determinism. Record and merge in input order; iterate stably.
- Recording arena lifetime (§4) is the one structural hazard — settle it before building the staging loop.
- TLS, varargs, atomics, computed goto, label-address tables must survive the
shared module unchanged. Function-local label addresses are already
function-scoped; cross-function
data_addr/pcrel/symdiffreference symbols, which are already unified by id.
10. First slices
Two independently landable, low-risk steps that de-risk the whole direction before any LTO surface exists:
- Extract
symresolveand refactorlink_resolve_symbolsonto it. Pure refactor, covered bytest/link. Lands the load-bearing piece and improves the linker regardless of LTO. - Add the
ObjBuildername -> id index behind the existingobj_symbol_find/obj_symbol_exAPI. Drop-in; measurable on its own.
Then Phase 0 (generalize the sweep + wire opt_inline, gated behind -O2 /
-fwhole-program), validated first on x86-64 with red-green tests in test/opt
(a caller+callee that should fuse) and test/smoke/x64 (behavioral parity). Then
the staging lifecycle and skip-intern-locals behind -flto, exercised first on a
two-TU test/smoke case where a cross-TU callee inlines.
Open questions
- Define-timing for resolution (§3): confirm the staging-boundary merge is the
right hook versus an
obj_symbol_define-time check, given symbols are only obj-defined at emit. - Recording arena follow-through (§4): Phase 1 relies on
c->tuhaving compiler-session lifetime for the cross-TU recorder/module. If frontend reset semantics later makec->tuper-source again, move the recorder/module to an explicit cross-source arena without changing the frontend staging API. -fltoflag surface (largely resolved — see Status):-fltoopt-in onccand the build verbs, decided per the Status section. Still open: whether-fwhole-programis a distinct, more aggressive internalization mode, and whether to make cross-TU LTO the-O1default later.- CG API exposure: how much of the borrowed lifecycle
(
kit_cg_begin/kit_cg_begin_unit/kit_cg_finish/kit_cg_detach) remains internal to the driver (build.c'sbuild_compile_all,cc_run_link_exe) versus becoming a publickit_cg/kit_compilesurface for embedders driving multi-TU LTO.