commit 47cede9aca1511ee8a697c37bbf98aab26bb4ee3
parent 0ae44208d15bba91c4d3a9188848cb3871bf02bd
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Mon, 8 Jun 2026 11:39:35 -0700
plan: clean out completed plans
Diffstat:
5 files changed, 0 insertions(+), 1618 deletions(-)
diff --git a/doc/plan/BACKTRACE.md b/doc/plan/BACKTRACE.md
@@ -1,419 +0,0 @@
-# Plan: stack-trace builtins & runtime backtrace
-
-## Status — 2026-06-05 — L1 + L2 + L3a + `kit symbolize` + L3c (tool-side auto-backtrace) shipped (WS1–WS5); L3b remaining
-
-L3c (WS5) — **tool-side auto-backtrace for `kit run` and `kit dbg`** — is now
-shipped. Both tools print a symbolized frame-pointer-chain backtrace at a
-fault/trap, reusing the DWARF reader they already own and never crossing into
-`rt/`. The decisive finding: the CFI stepper `kit_dwarf_unwind_step`
-(`src/debug/dwarf_cfi.c:213`) takes **no memory provider**, so when the return
-address is spilled to the stack (the normal case) it returns pc=0 and the walk
-dies after the leaf — the existing `dbg bt` was effectively single-frame. The
-fix is to walk the **frame-pointer chain** (kit's uniform `fp[0]`=caller fp /
-`fp[1]`=saved ra record, no `.eh_frame` needed), the same walk `__kit_backtrace`
-does, lifted tool-side with a memory-read callback.
-
-- **Shared module** `driver/lib/backtrace.c` + `.h`: the FP-step kernel
- (`driver_bt_fp_step`, with `__kit_backtrace`'s guards) + arch FP-reg/ptr-size
- helpers + a PC-list symbolizer (`driver_backtrace_print_pcs`). Gated into the
- DBG/RUN tool builds (`mk/driver_srcs.mk`). Walks stop at the **kit-image
- boundary** (`kit_jit_runtime_to_image` == 0) so output ends at `main` and is
- host-independent (no libc/dyld trampoline noise).
-- **`kit dbg`** (`driver/cmd/dbg.c`): `dbg_cmd_bt` now advances via the FP-step
- kernel over `kit_jit_session_read_mem` (so it walks the whole stack, not just
- the leaf), and `dbg_render_stop` auto-invokes it on `KIT_STOP_SIGNAL`
- (faults + `__builtin_trap`/assert; not breakpoints/steps).
-- **`kit run`** (`driver/cmd/run.c` + `driver/env/posix_dbg.c`): a lightweight
- in-process crash guard (`driver_run_with_crash_guard`) installs
- SIGSEGV/SIGBUS/SIGILL/SIGFPE/SIGABRT/SIGTRAP handlers around the direct
- `entry_fn` call, reusing the existing `dbg_ucontext_to_frame` marshalling.
- Because `kit run` shares its stack with the program, the chain is captured
- **inside the handler** (before the post-`siglongjmp` stack is reused) and
- symbolized afterward in normal context; the process exits `128 + signo`.
- Windows has a no-op stub (vectored-handler port is a follow-up).
-- **Tests:** `test/dbg/cases/toy-trap-backtrace` (multi-frame trap → auto-bt +
- `bt`), updated `toy-trap-stop` golden, and a `kit run` crash lane in
- `test/driver/run.sh` (`run-backtrace-*`: non-zero exit + symbolized
- `bt_leaf/bt_mid/bt_root` + source file).
-
-Scope note: `kit emu` auto-backtrace is **out of scope** (the emulator doesn't
-retain the guest's DWARF after load); left as a follow-up alongside L3b.
-
-L3a (WS4) is now shipped on top of L1/L2:
-
-- **L3a print** `__kit_print_backtrace` — `rt/lib/stack/print_backtrace.c` walks
- via `__kit_backtrace(buf, 64, skip=1)` (skip hides the print frame, so `#0` is
- the caller) and writes one raw `#N 0x<hex>` line per frame to the **weak**
- `__kit_backtrace_write(const char*, size_t)` sink. Integer/hex formatting is
- hand-rolled (no printf/libc pulled into the panic path); the address uses
- `uintptr_t` so it is not truncated on LLP64. Declared in
- `rt/include/kit/backtrace.h`; added to `RT_BASE_SRCS`.
-- **Output sink (open question resolved):** weak no-op `__kit_backtrace_write`
- default, so freestanding images that never wire a sink still link; the host /
- `_start` overrides it to route bytes to `write(2)` (or a UART). Chosen over a
- mandatory explicit-sink param to keep freestanding builds link-clean.
-- **Assert hook (deferred from L2)** — `rt/lib/assert/assert.c::__kit_assert_fail`
- now emits a `kit: assertion failed: <expr>, file <file>, line <line>, function
- <func>` banner then `__kit_print_backtrace()` before `__builtin_trap()`, all
- through the same weak sink (printf-free). Pulling `__kit_assert_fail` therefore
- also pulls `print_backtrace.o` → `backtrace.o` from the archive — the intended
- wiring.
-- **Symbolization** is out-of-process via two hosted tools that share one
- DWARF-open + func/line core (`driver/lib/dwarfsym.c`):
- - `kit addr2line` — the faithful GNU/LLVM clone (bare addresses in,
- `file:line` out), unchanged in contract.
- - `kit symbolize` (`driver/cmd/symbolize.c`, **shipped**) — reads the raw
- `#N 0x<hex>` stream `__kit_print_backtrace` emits, finds the address on each
- line, resolves it through the same DWARF reader, and rewrites the line in
- place as `#0 0x401136 bt_leaf at addr2line_prog.c:51:3`, keeping the `#N`
- framing addr2line structurally can't. Lines with no address pass through
- verbatim. A single `-e <image>` today; multi-`-e`/module-map (for `libc.so`
- frames that need their own load slide) is the natural extension.
- Verified round-trip: a static non-PIE ELF prints its own trace at runtime, and
- the captured addresses resolve `bt_leaf`/`bt_mid`/`bt_root`/`test_main` through
- both `kit addr2line -f -e <image>` and `kit symbolize -e <image>` (outer
- no-`-g` frames show `??`).
-
-Tests (L3a): `test/rt/cases/print_backtrace.c` (in-process parse of the emitted
-`#N 0xADDR` lines, aa64/x64/rv64 under exec, exit 42) and `test/rt/addr2line.sh`
-+ `test/rt/addr2line_prog.c` (the symbolization round-trip, make target
-`test-rt-backtrace`). The round-trip script runs the captured stream through
-**both** lanes per arch/opt: `kit addr2line -f` over the bare addresses, and
-`kit symbolize` over the raw `#N 0xADDR` stream (asserting the `#N` framing is
-preserved and `<func> at file:line` is appended). `test/rt/smoke.c` also
-includes `<kit/backtrace.h>` so the header compiles on every rt-header target.
-
-**Opt coverage — the backtrace path passes at O0 *and* O1 on all three arches.**
-The rt-runtime corpus (`test/rt/run.sh`) and the addr2line round-trip
-(`test/rt/addr2line.sh`) now sweep both opt levels (`KIT_RT_OPT_LEVELS`), so
-`backtrace_capture` (L2) and `print_backtrace` (L3a) are exercised against
-optimized callers — all green at O0/O1. Sweeping O1 also surfaced **two
-unrelated, pre-existing kit bugs**, left red (not skipped) and logged in
-doc/plan/TODO.md: (1) **x86-64 `-g -O1` + the 4-operand register-pinned syscall
-idiom** aborts the compiler (`too many memory asm operands`,
-`src/arch/x64/native.c:4014`) — this is why the `x64/O1` lane of
-`test-rt-backtrace` is red, though x64/O1 backtrace correctness is still proven
-by `print_backtrace`/`backtrace_capture` (no asm); (2) **setjmp/longjmp is
-miscompiled at `-O1`** on every arch (`setjmp_runtime/O1` returns 1, not 42 —
-the second-return value isn't observed), failing `test-rt-runtime`.
-
-Remaining: **L3b** in-process self-symbolization (and the deferred `kit emu`
-auto-backtrace). WS5/L3c (tool-side auto-backtrace) is **done** — see Status.
-
-Implemented and tested through L2:
-
-- **L1 builtins** `__builtin_frame_address` / `__builtin_return_address` — two
- CG intrinsics (`KIT_CG_INTRIN_FRAME_ADDRESS` / `_RETURN_ADDRESS`), constant
- level carried as a single IMM operand, lowered as an unrolled FP walk on
- aarch64 / x86-64 / riscv (O0 and O1, same backend handler). The C target
- forwards `__builtin_*` to the host compiler; wasm reports unsupported; the C
- frontend validates the level via `eval_const_int`.
-- **L1 O1 modeling** — `IR_INTRINSIC` is already conservatively side-effecting
- in opt (never DCE'd / CSE'd / hoisted), so no new effect modeling was needed.
- The one real O1 hazard — riscv's frameless-leaf tier (`slim_prologue`) emits
- no prologue and never anchors `s0` — is handled by a new
- `NativeKnownFrameDesc.reads_frame` flag set during frame analysis when these
- intrinsics appear; aarch64/x64 keep the frame record in every prologue shape,
- so they need no change.
-- **L2 capture** `__kit_backtrace` — `rt/lib/stack/backtrace.c` +
- `rt/include/kit/backtrace.h`, in `RT_BASE_SRCS` for every variant.
-
-Open questions resolved while building:
-
-- **rv64 frame-record layout** — the psABI `ra@s0-8` / `fp@s0-16` guess in the
- L2 sketch is *wrong for kit*. kit's prologue stores the pair at and above s0:
- `[s0+0] = caller fp`, `[s0+ptr] = saved ra` (verified against
- `rv_build_prologue`). So the layout is uniform across all kit targets
- (`fp[0]`/`fp[1]` in units of `void*`) — `__kit_backtrace` needs no per-arch
- offset table at all, just index 0 and 1.
-- **wasm** — diagnose unsupported (confirmed acceptable); the capability hook
- returns false and the C frontend emits a clean error.
-- **leaf-frame omission** — handled via `reads_frame` (above).
-
-Tests: `test/rt/cases/backtrace_capture.c` (aa64/x64/rv64 under exec),
-`test/parse/cases/builtin_29..31_*` (+ `cases_err/..._nonconst`) across the
-D/R/E/J/C lanes at O0/O1, `test/toy/cases/154_frame_return_address.toy`.
-
-### Remaining tasks (L3)
-
-Nothing in L1/L2/L3a is outstanding. What's left is the rest of L3:
-
-- ~~**WS4 — L3a:**~~ **done** (see Status) — `__kit_print_backtrace()` + weak
- `__kit_backtrace_write` sink + assert-path hook + `kit addr2line` round-trip.
-- ~~**WS5 — `kit symbolize`:**~~ **done** (see Status) — the hosted batching
- symbolizer that reads the `#N 0x<hex>` stream and annotates it in place,
- sharing `driver/lib/dwarfsym.c` with `addr2line`. Tested by the second lane of
- `test/rt/addr2line.sh`.
-- ~~**WS5 — L3c (tool-side auto-backtrace):**~~ **done** (see Status) —
- `kit run` + `kit dbg` auto-print a symbolized FP-chain backtrace at a
- fault/trap via the shared `driver/lib/backtrace.c`; truncated at the kit-image
- boundary; never crosses into rt. `kit emu` auto-backtrace remains deferred (it
- doesn't retain the guest DWARF).
-- **L3b:** in-process self-symbolization (hosted-only `libkit_bt.a`); deferred
- until a concrete consumer needs in-binary symbolized panics.
-
-All Open-questions items are now resolved (the L3a output sink chose the weak
-default — see Open questions).
-
-## Overview
-
-kit has no way for compiled code to inspect its own call stack. This roadmap
-adds that capability in three layers: GCC-compatible primitive **builtins**
-(`__builtin_return_address`, `__builtin_frame_address`), a freestanding runtime
-**capture** function (`__kit_backtrace`), and a **symbolizing print** path
-(`__kit_print_backtrace`) that turns return addresses into `func at file:line`.
-
-Matching design docs once shipped: [../FRONTENDS.md](../FRONTENDS.md) (the
-builtins), [../RUNTIME.md](../RUNTIME.md) (the rt helpers), [../DWARF.md](../DWARF.md)
-(symbolization).
-
-## Why
-
-- **Portability.** `__builtin_return_address` / `__builtin_frame_address` are a
- de-facto part of the GCC/Clang surface. Real C code (libc `backtrace`,
- sanitizer shims, allocators, profilers, `unwind`-free panic handlers) uses
- them; kit currently can't compile any of it.
-- **Diagnostics.** `__kit_assert_fail` (`rt/lib/assert/assert.c`) and the
- emulator fault path (`src/emu/emu.c`, `compiler_panic`) currently die silently
- with `__builtin_trap()`. A backtrace at the trap point is the single biggest
- debuggability win for kit-compiled programs.
-- **It is cheap here, specifically.** kit maintains a frame pointer on **every**
- backend and has **no `-fomit-frame-pointer`** (x29 on aarch64, rbp on x64,
- s0/x8 on rv64; `AA_FP = 29` at `src/arch/aa64/native.c:61`). Every prologue
- stores a `{saved_fp, saved_ra}` frame record. Frame-pointer-chain walking is
- therefore *reliable*, with no unwind tables and no `.eh_frame` dependency.
-
-## What already exists (and what it can't do)
-
-- **`.eh_frame` CFI** is emitted by default for hosted targets
- (`src/arch/mc.c:736`, `mc_emit_eh_frame`), and **off for freestanding**.
-- **A CFI unwinder**, `kit_dwarf_unwind_step` (`src/debug/dwarf_cfi.c:213`),
- interprets FDE/CIE programs — but deliberately takes **no memory provider**, so
- it *cannot self-unwind a live stack*. It is built for the dbg/JIT path where a
- session reads target memory out-of-band (`driver/cmd/dbg.c:1010`, the `bt`
- command). It is not a candidate for in-process capture.
-- **Symbolization** (`kit_dwarf_addr_to_line`, `kit_dwarf_func_at`,
- `include/kit/dwarf.h:21`) is mature — it backs `addr2line`
- (`driver/cmd/addr2line.c`) and `dbg bt`. But it lives in **`libkit.a`, not the
- freestanding runtime `rt/`**. Pulling it into a freestanding image is a
- non-goal (see L3).
-- **The runtime has zero unwind/backtrace code today.** `rt/lib/stack/` exists
- but holds only the Windows `chkstk` helper — a natural home for the new
- capture code.
-
-Design consequence: **capture via the FP chain; symbolize via the existing
-DWARF reader, kept on the hosted side of the boundary.** Do *not* reuse the CFI
-unwinder for self-capture.
-
----
-
-## L1 — Primitive builtins (`__builtin_return_address`, `__builtin_frame_address`)
-
-GCC semantics: `__builtin_frame_address(n)` returns the frame address of the
-current function (n=0) or its n-th caller; `__builtin_return_address(n)` returns
-the return address into that frame. The level argument **must be an integer
-constant** (kit validates via the existing `eval_const_int()` path, as
-`__builtin_offsetof` already does at `parse_expr.c:1331`). Out-of-range / runaway
-walks are allowed to return a garbage-but-safe value or 0, matching GCC's "use 0
-only with care" contract.
-
-### Lowering: two new CG intrinsics, FP-chain only
-
-Add to `KitCgIntrinsic` (`include/kit/cg.h:916`):
-
-```
-KIT_CG_INTRIN_FRAME_ADDRESS, /* pop level(u32 const); push void* */
-KIT_CG_INTRIN_RETURN_ADDRESS, /* pop level(u32 const); push void* */
-```
-
-Both lower through one shared FP-walk so level 0 and level N use the same path,
-and so level 0's return address comes from the **spilled** frame-record slot (not
-the live LR/RA, which may be clobbered mid-function):
-
-| arch | FP reg | `frame(0)` | walk one frame | return addr from frame F |
-|------|--------|-----------|----------------|--------------------------|
-| aarch64 | x29 | x29 | `fp = *(fp)` | `*(fp + 8)` (saved x30) |
-| x86-64 | rbp | rbp | `fp = *(fp)` | `*(fp + 8)` (pushed retaddr) |
-| rv64 | s0/x8 | s0 | `fp = *(fp)` | `*(fp + ptr)` (saved ra) |
-
-The table is **uniform** across kit's targets: the prologue stores
-`[fp+0] = caller fp`, `[fp+ptr] = saved ra` everywhere (verified against
-`rv_build_prologue` — note this differs from the RISC-V psABI's `ra@s0-8` /
-`fp@s0-16`, which an early draft of this table wrongly assumed).
-
-For a constant level the walk unrolls to `level` dependent loads (typically 0–2),
-so no loop is emitted. wasm has no FP chain → **diagnose unsupported**, exactly
-as the IRQ/cache intrinsics already do per-arch.
-
-### Files to touch (the standard "new value-producing intrinsic" path)
-
-- `include/kit/cg.h` — two enum entries + doc comments.
-- `src/cg/arith.c:1726` — two rows in the `KitCgIntrinsic → INTRIN_*` table.
-- `src/cg/cgtarget.h:148` — two `INTRIN_*` enum entries
- (`INTRIN_FRAME_ADDRESS`, `INTRIN_RETURN_ADDRESS`).
-- `lang/c/parse/parse_priv.h:231` + `parse.c:1526` — intern
- `__builtin_return_address` / `__builtin_frame_address` symbols.
-- `lang/c/parse/parse_expr.c` (in `try_parse_builtin_call`, ~1696–2018) — two
- handlers: parse the constant level via `eval_const_int`, then emit the
- intrinsic with result type `void*`. New `cg_adapter.c` helper
- `pcg_frame_or_return_address(p, kind, level)`.
-- Per-arch O0 lowering: `src/arch/aa64/native.c` (~3572), `src/arch/x64/native.c`
- (~3378), `src/arch/riscv/native.c` (~2992) — emit the FP-walk loads;
- `src/arch/wasm/emit.c` (~1590) + `src/arch/c_target/c_emit.c` (~2603) — handle
- or diagnose (C target can emit `__builtin_*` straight through to the host
- compiler).
-- Capability hooks: `src/arch/{aa64,x64,riscv,wasm}/arch.c` (alongside the
- existing `KIT_CG_INTRIN_TRAP` cases at e.g. `aa64/arch.c:197`).
-- **Optimizer (O1/O2) [done]:** in practice no new effect modeling was needed —
- `IR_INTRINSIC` is already conservatively side-effecting in opt (never DCE'd,
- CSE'd, or hoisted; see `pass_dce.c`), and the FP it reads is stable across the
- whole function, so scheduling is harmless. The one real O1 hazard turned out to
- be a backend frame issue, not an opt-modeling one: riscv's frameless-leaf tier
- (`slim_prologue`) emits no prologue and never anchors `s0`, so a leaf that reads
- its own frame would walk a stale `s0`. Fixed with a `NativeKnownFrameDesc.reads_frame`
- flag set in `pass_native_emit.c` frame analysis and ANDed into riscv's
- `slim_prologue` decision; aarch64/x64 keep the frame record in every prologue
- shape, so they need nothing. O1 smoke tests run on all three arches.
-
-### Tests (L1) [done]
-
-- `test/toy/cases/154_frame_return_address.toy` — CG-API case exercising both
- intrinsics at levels 0/1/2 (`@[.noinline]` chain pins the depth).
-- `test/parse/cases/builtin_29_return_address.c`, `builtin_30_frame_address.c`,
- and `builtin_31_return_address_anchor.c`; error case
- `cases_err/builtin_return_address_nonconst.c` for a non-constant level.
-- The plan's "anchor in caller's range" smoke check is `builtin_31` (run via the
- parse harness's qemu/podman exec lane on x64 + aa64 + rv64 at **O0 and O1**),
- not a `test/smoke` script. It anchors on the **caller's function address**, not
- a `&&label`: GNU labels-as-values whose address is taken but never `goto`'d
- break at O1 (`undefined reference to '.Lcfblk.N'`; see doc/plan/TODO.md).
-
----
-
-## L2 — Capture: `__kit_backtrace` (freestanding runtime fn)
-
-Surface decision (confirmed): **primitives are builtins; capture/print are
-runtime functions**, mirroring the GCC-builtin / glibc-`backtrace` split. New
-freestanding TU `rt/lib/stack/backtrace.c`, declared in a new public runtime
-header `rt/include/kit/backtrace.h`:
-
-```c
-/* Fill buf[0..max) with return addresses, innermost first, skipping the
- * innermost `skip` frames (skip >= 1 hides __kit_backtrace itself).
- * Returns the number of frames written. Freestanding: pure FP walk, no libc,
- * no DWARF, works on every target that keeps a frame pointer (all of kit's). */
-int __kit_backtrace(void** buf, int max, int skip);
-```
-
-Implementation is the L1 walk expressed in portable C: seed from
-`__builtin_frame_address(0)`, then loop `fp = *(void**)fp` reading the saved-RA
-slot, stopping on a NULL saved-RA (the synthetic stack origin), a NULL or
-non-increasing fp (stack grows down — detect cycles/garbage), a misaligned link,
-or `max`. **No per-arch knob is needed:** kit's frame layout is uniform, so the
-walk indexes `fp[0]` (caller fp) and `fp[1]` (saved ra) as `void**`, which scales
-to the target pointer width automatically — no offset table, no `#ifdef` cascade.
-`skip` discards the innermost N frames (a print wrapper passes `skip >= 1`).
-
-- `mk/rt.mk` — added `rt/lib/stack/backtrace.c` to `RT_BASE_SRCS` (built for
- every variant; `rt/lib/stack/` already compiled the Windows chkstk helper).
-- **Assert-path hook — landed in WS4 (was deferred):**
- `rt/lib/assert/assert.c::__kit_assert_fail` now emits a banner +
- `__kit_print_backtrace()` before `__builtin_trap()`. It needed the L3
- `__kit_print_backtrace()`, so it shipped with WS4 rather than L2.
-
-### Tests (L2) [done]
-
-- `test/rt/cases/backtrace_capture.c` — a known-depth `@[.noinline]` recursion;
- asserts depth, all return addresses non-null, that the recursive frames share a
- call site (proving the walk follows the chain), and the `skip`/`max` bounds;
- `return 42` on success. Runs under `test/rt/run.sh` on aa64/x64/rv64.
-
----
-
-## L3 — Symbolize & print: `__kit_print_backtrace`
-
-This is where the freestanding boundary bites: turning an address into
-`func at file:line` needs the DWARF reader, which is **libkit, not rt**. Three
-sub-options, ordered by how cleanly they respect that boundary. Recommend
-shipping **L3a now**, leaving L3b/L3c as documented extensions.
-
-- **L3a — raw print + out-of-process symbolization (shipped — WS4).**
- `__kit_print_backtrace()` lives in rt (`rt/lib/stack/print_backtrace.c`), walks
- via `__kit_backtrace`, and writes raw lines (`#0 0x401136`, …) to a
- host-provided sink (the weak `__kit_backtrace_write(const char*, size_t)` the
- host or `_start` wires to `write(2)`; freestanding default is a no-op).
- Symbolization is a separate hosted step through either `kit addr2line` (bare
- addresses) or `kit symbolize` (the raw `#N 0x<hex>` stream, annotated in place
- — **shipped**; see Status). Both share the DWARF-open + func/line core in
- `driver/lib/dwarfsym.c` and reuse the existing reader, so the freestanding
- image carries zero new symbolization code, matching how minimal panic handlers
- work in the wild.
-
-- **L3b — in-process self-symbolization (hosted-only).** A trimmed line/func
- reader (reusing `kit_dwarf_addr_to_line` + `kit_dwarf_func_at`) linked into a
- **hosted-only** archive — e.g. `libkit_bt.a` or a `*-hosted` rt variant — that
- opens the running image's own DWARF. Heavy (drags in the DWARF reader and an
- image-self-map); strictly opt-in, never in the freestanding default. Only build
- if a concrete consumer needs in-binary symbolized panics.
-
-- **L3c — tool-side auto-backtrace.** `kit run` / `kit emu` / `dbg` already own a
- DWARF reader and the `dbg bt` rendering path (`driver/cmd/dbg.c:1010`). Hook
- their fault/trap handlers (e.g. the `EMU_TRAP_FAULT` → `compiler_panic` site in
- `src/emu/emu.c`) to print a symbolized backtrace automatically. This is the
- highest-value, lowest-risk symbolized experience because it reuses everything
- and never crosses into rt. Largely independent of L1/L2 (the tools can unwind
- via their own session memory + `kit_dwarf_unwind_step`).
-
-### Tests (L3)
-
-- L3a [done]: `test/rt/addr2line.sh` (+ `addr2line_prog.c`) runs a kit-compiled
- program that prints its own trace, then symbolizes the captured stream two
- ways — `kit addr2line -f` over the bare addresses, and `kit symbolize` over the
- raw `#N 0xADDR` stream (asserting the `#N` framing survives and `<func> at
- file:line` is appended) — checking `bt_leaf`/`bt_mid`/`bt_root`/`test_main`
- appear (make target `test-rt-backtrace`, aa64/x64/rv64). In-process companion:
- `test/rt/cases/print_backtrace.c` parses the emitted `#N 0xADDR` lines.
-- L3c: an `kit emu` fault test asserting a symbolized frame line on stderr.
-
----
-
-## Suggested sequencing
-
-1. **WS1 — L1 primitives, O0** — all three native arches + parse/toy tests. ✅ done.
-2. **WS2 — L1 at O1/O2** — opt effect-modeling audit (turned out to need only the
- riscv frame-record fix) + O1 tests. ✅ done.
-3. **WS3 — L2 `__kit_backtrace`** in rt + capture test. ✅ done (assert-hook moved
- to WS4 — it needs the L3 print fn).
-4. **WS4 — L3a** raw print (`__kit_print_backtrace` + weak `__kit_backtrace_write`
- sink) + `kit addr2line` round-trip; wire the assert hook. ✅ done.
-5. **WS5 — `kit symbolize`** hosted batching symbolizer over the `#N 0x<hex>`
- stream, sharing `driver/lib/dwarfsym.c` with `addr2line`; second lane of
- `test/rt/addr2line.sh`. ✅ done.
-6. **WS5 — L3c** tool-side auto-backtrace for `kit run` + `kit dbg`. ✅ done
- (`kit emu` deferred — no retained guest DWARF).
-7. **L3b** deferred until a consumer needs in-binary symbolized panics.
-
-## Open questions
-
-None outstanding.
-
-Resolved in WS4:
-
-- ~~**Output sink for L3a:**~~ weak `__kit_backtrace_write` (no-op default) vs.
- requiring the host to pass a sink explicitly. **Chose the weak default** — it
- keeps freestanding builds linking with no sink, and a host / `_start`
- overrides it to route bytes to `write(2)` or a UART. (Resolved while building
- WS4.)
-
-Resolved while building L1/L2:
-
-- ~~**wasm:**~~ diagnose unsupported — confirmed acceptable; the capability hook
- returns false and the C frontend emits a clean error. (C target separately
- forwards `__builtin_*` to the host compiler.)
-- ~~**rv64 frame-record layout:**~~ verified against `rv_build_prologue` — kit
- stores `[s0+0]=caller fp`, `[s0+ptr]=saved ra` (NOT the psABI `ra@s0-8` /
- `fp@s0-16`), so the layout is uniform across targets.
-- ~~**leaf-frame omission:**~~ handled by `NativeKnownFrameDesc.reads_frame`, which
- forces riscv off its frameless-leaf tier when these intrinsics appear; aa64/x64
- always keep the frame record. (Level-0 reads the spilled slot via the FP, so no
- live-LR/RA fallback is needed.)
diff --git a/doc/plan/DIST_LIBRARY.md b/doc/plan/DIST_LIBRARY.md
@@ -1,290 +0,0 @@
-# Distribution as a library subsystem
-
-> **Status: implemented.** The migration below has landed (one commit): the
-> dist subsystem moved to `src/dist/` (+ top-level `vendor/`), exposed through
-> `<kit/cas.h>` / `<kit/package.h>` (`src/api/{cas,package}.c`), gated by
-> `KIT_CAS_ENABLED` / `KIT_PKG_ENABLED`; `kit cas` / `kit pkg` are thin
-> CLIs over the public API via a `KitCasHost` vtable (`driver/lib/dist_host.c`),
-> with operational errors flowing through `ctx->diag`. Verified green:
-> `test-driver-cas` (41) + `test-driver-pkg` (182) under ASan/UBSan.
->
-> **Deferred:** the v2 deletion + `3`-suffix rename (Stage 3, below) were *not*
-> done — the dead v2 code was carried over unchanged. On inspection the deletion
-> is more surgical than the line-ranges below imply: the v2 *extern* surface
-> (`DistManifest`/`DistArtifact`/`DistDependency`, `dist_manifest_*`,
-> `dist_kpkg2_*`, the v2 `DistKpkg*` structs + `dist_kpkg_*` v2 codecs) is
-> safely unreferenced, but the v2 and v3 manifest parsers **share** the static
-> helpers `set_err` / `trim_lead` / `trim_trail` / `copy_field` / `kind_valid`
-> in `src/dist/manifest.c` (only `parse_u64`, the v2 `finalize`, and
-> `dist_manifest_path_valid` are v2-only). The cleanup must keep the shared
-> helpers — verify the same in `src/dist/kpkg.c` — and recompile + rerun the
-> cas/pkg suites after.
-
-Signed, content-addressed distribution (`kit cas` / `kit pkg`) is today the
-only major capability that lives **entirely inside `driver/`** — its model,
-its vendored crypto/compression, and its create/verify/unpack pipelines all sit
-under `driver/dist/` and `driver/cmd/{cas,pkg}.c`. Every other capability is a
-libkit subsystem behind `include/kit/`, with the CLI tool a thin
-arg-parser on top. This doc captures the plan to bring distribution into the
-same shape: move the implementation into the library, expose it through two
-public headers, and reduce `cas.c`/`pkg.c` to flag-parsing + host wiring. The
-design it realizes is in [../DISTRIBUTE.md](../DISTRIBUTE.md); the precedent it
-follows is the `ar` subsystem (`src/api/archive.c` + `include/kit/archive.h`,
-gated by `KIT_AR_ENABLED` distinct from `KIT_TOOL_AR_ENABLED`).
-
-## Goal
-
-`libkit.a` gains a content-store API and a signed-package API, gated by their
-own subsystem flags so a minimal embedding pays nothing for them. The `kit
-cas` and `kit pkg` tools become thin CLIs that translate flags into public
-calls and supply host vtables — exactly like `ar`, `ld`, `objdump`. An embedder
-can create, sign, verify, inspect, and unpack packages, and drive a CAS, without
-the driver and without linking host crypto/compression.
-
-## Why this is the right shape (not CLI-only logic)
-
-Two layers are stacked under `driver/dist/`, and they have very different
-readiness:
-
-- **The `dist_*` byte model** (`driver/dist/*.c`, ~6.4k lines plus ~6.7k
- vendored) is **already written to the public boundary's contract.** It
- includes only `<kit/core.h>` plus its own headers — no `driver.h`/`env.h`.
- It sources no entropy and does no I/O except through `KitWriter` callbacks
- and a small host vtable (`DistCasHost` = `KitFileIO` + `mkdir_p` +
- `mark_executable`). This obeys the "host supplies all side effects" principle
- verbatim. Moving it is near-mechanical.
-
-- **The `pkg_*` / `cas_*` orchestration** (`driver/cmd/pkg.c` 2123 lines,
- `cas.c` 491 lines) holds the valuable pipelines — `pkg_create_targz`,
- `pkg_create_kpkg`, `pkg_verify_portable`, `pkg_verify_native`, blob
- reconstruction, trust/key resolution — but is entangled with the CLI. The
- glue to unwind, by call count in `pkg.c`:
- - `driver_errf` ×88 — stderr error reporting → structured error returns / the
- `KitContext` diag sink (the `dist_*` parsers already take
- `char* err, size_t errcap`).
- - `driver_mkdir_p`, `driver_mark_executable_output`,
- `driver_walk_regular_files` — host filesystem ops beyond `KitFileIO`.
- - `driver_random_bytes` ×2 — host CSPRNG, only for keygen.
- - `driver_getenv` ×2 — trust-file path defaulting
- (`$KIT_TRUSTED_KEYS` / `$HOME`); env-var *policy* that **stays in the
- driver**.
- - `driver_streq` / `driver_printf` / `driver_has_suffix` — arg parsing and
- stdout formatting; **stay in the driver**.
-
-The layering invariant forces the move: `driver/` may include only
-`<kit/*.h>`, and `src/api` may not include `driver/` headers — so a public
-boundary is impossible while the code sits in `driver/dist/`. Relocating to
-`src/` is a precondition, not a cleanup.
-
-## Target tree layout
-
-```
-vendor/ # top-level: pristine third-party trees
- monocypher/ # (moved from driver/dist/vendor/monocypher)
- lz4/ # (moved from driver/dist/vendor/lz4)
-include/kit/cas.h # content model: blob/tree hashing + CAS store
-include/kit/package.h # package model: manifest, sign/verify, create/unpack
-src/api/cas.c # public handles <-> internal (archive.c precedent)
-src/api/package.c
-src/dist/ # moved dist_* subsystem (private headers)
- dist.{c,h} blob.{c,h} tree.{c,h} cas.{c,h}
- manifest.{c,h} kpkg.{c,h} trust.{c,h}
- blake2b.{c,h} ed25519.{c,h} minisig.{c,h} b64.{c,h}
- deflate.{c,h} lz4.{c,h} tar.{c,h} # kit-maintained shims/extracts
-```
-
-Vendor split, confirmed by inspection: only **monocypher** and **lz4** are
-pristine third-party trees pulled in by `#include` — they move to a repo-root
-`vendor/`. `deflate.c` is a kit-maintained *extract* of miniz (already
-modified, not pristine), and `b64.c` / `tar.c` are self-contained — these stay
-in `src/dist/`. The shim includes that currently read
-`"vendor/monocypher/..."` (e.g. `blake2b.h`, `ed25519.c`) get rewritten to the
-new top-level path.
-
-## Config gating
-
-Add subsystem flags to `include/kit/config.h`, separate from the tool flags
-(mirroring `KIT_AR_ENABLED` vs `KIT_TOOL_AR_ENABLED`):
-
-```c
-#define KIT_CAS_ENABLED 1 /* content store: src/dist/{blob,tree,cas} + kit/cas.h */
-#define KIT_PKG_ENABLED 1 /* signed packages: adds manifest/kpkg/minisig/crypto + kit/package.h */
-```
-
-`KIT_PKG_ENABLED` implies `KIT_CAS_ENABLED` (packages are built over the
-content model). `KIT_TOOL_CAS_ENABLED` / `KIT_TOOL_PKG_ENABLED` stay and
-assert their subsystem flag. Off → the units (and the vendored crypto) drop
-entirely, so a minimal embedding carries no Ed25519/BLAKE2b/DEFLATE/LZ4. The
-Makefile's `LIB_SRCS_*` gains a dist regime that pulls `src/dist/*.c` plus the
-enabled `vendor/` trees.
-
-## Public API surface
-
-Two headers, mirroring DISTRIBUTE.md's content-model vs signed-package split.
-Model structs are exposed as POD (renamed to the `Kit*` convention); the
-vendored primitives and the kpkg wire codecs stay internal.
-
-### `include/kit/cas.h` — content model (self-verifying, no trust)
-
-- POD types: `KitTree`, `KitTreeEntry`, `KitBlobInfo`.
-- Pure hashing (no I/O): `kit_blob_id`, `kit_blob_root`, `kit_blob_info`,
- `kit_tree_id`, `kit_tree_emit`, `kit_tree_parse`, `kit_tree_find`.
-- A `KitCas` handle over `KitContext` + a host vtable:
- `kit_cas_open`, `kit_cas_put_blob` / `get_blob`,
- `kit_cas_put_tree` / `get_tree`, `kit_cas_add_tree_from_dir`,
- `kit_cas_verify_tree`, `kit_cas_materialize`.
-
-### `include/kit/package.h` — package model (signed)
-
-- POD model: `KitPackageManifest` with its outputs / artifacts / deps; a
- public `KitPackageEncoding` descriptor (region layout, chunk-index summary,
- external-fetch templates) so `inspect --encoding` and external-fetch planning
- are real library features.
-- Keys / trust: `KitMinisigKeypair`; `kit_pkg_keygen` (entropy injected via
- the host vtable, never read by the library); pubkey/seckey emit + parse;
- `kit_pkg_sign` / `kit_pkg_verify_signature`. Trust resolution takes
- **explicit** trusted-keys bytes — the library reads no env vars and no
- `$HOME`; the driver supplies the resolved path/bytes.
-- Pipelines as opts-struct calls: `kit_pkg_create` (format `kpkg`|`tar.gz`,
- native-shape `fat`|`metadata`|`thin`, compression, source = `--root` dir or
- `cas + tree`, external dir), `kit_pkg_verify`, `kit_pkg_unpack`,
- `kit_pkg_inspect`.
-
-### Kept internal (`src/dist/` private headers)
-
-All vendored code; the `dist_blake2b` / `dist_ed25519` / `dist_minisig` /
-`dist_b64` / `dist_gz` / `dist_lz4` / `dist_tar` shims; and the kpkg wire
-codecs (header / descriptor / index encode-decode). Rationale: raw crypto and
-on-wire binary layout are implementation detail — exposing them invites misuse
-and an API-stability burden. The logical model and pipelines are the contract.
-
-## New host capabilities
-
-The library reaches the host through `KitContext.file_io` (read/write,
-already present) plus one new vtable for the operations `KitFileIO` doesn't
-cover — every one of which the driver already implements:
-
-```c
-typedef struct KitDistHost {
- int (*mkdir_p)(void* user, const char* path);
- int (*mark_executable)(void* user, const char* path);
- int (*walk_regular_files)(void* user, const char* root, /* callback */ ...);
- int (*fill_random)(void* user, uint8_t* out, size_t n); /* keygen only */
- void* user;
-} KitDistHost;
-```
-
-`DistCasHost` already models `mkdir_p` + `mark_executable`; this generalizes it
-and adds the directory walk (`driver_walk_regular_files`) and CSPRNG
-(`driver_random_bytes`). Naming/placement TBD during Stage 2 (could fold the
-CAS-only subset into `KitCas` and keep `fill_random` package-side).
-
-## Error reporting (decided)
-
-Public dist calls **return `KitStatus`** and **emit human-readable detail
-through `ctx->diag`** — not through an err-buffer at the boundary. This is the
-established convention, not a new pattern:
-
-- `KitContext` carries `KitDiagSink* diag` directly (`core.h`), so the sink
- is reachable without a `KitCompiler` — exactly as the pure-byte subsystems
- (object, archive, dwarf) get it.
-- It mirrors the linker: `src/link/link_layout.c` emits operational errors such
- as "linker script: undefined symbol …" through the sink and returns a status.
- Package/CAS errors are the same shape — operational, no source position.
-- `KitStatus` already carries the right categories: `KIT_MALFORMED` (bad
- manifest/tree/signature), `KIT_NOT_FOUND` (missing blob/tree/key),
- `KIT_IO`, `KIT_INVALID` (unsafe path), `KIT_UNSUPPORTED` (encrypted
- seckey / scrypt). The status is the machine-readable category; the diag
- message is the actionable detail (`"blob root mismatch for: <path>"`).
-
-Mechanics:
-
-- **No source location.** Emit with a zero `KitSrcLoc` (file_id 0), as the
- linker does for non-source errors; the host stderr sink already tolerates it.
-- **The internal `dist_*` parsers keep their `(char* err, size_t errcap)`
- buffer** unchanged. The `src/api` wrapper catches that string and forwards it
- to `ctx->diag`, so the byte model barely changes and its detailed parse
- messages survive intact.
-- **A small internal `api_diagf(ctx, kind, fmt, …)` helper** over
- `ctx->diag->emit` (no-op when `diag` is NULL) packs varargs for the api layer.
-- **The 88 `driver_errf` sites split by ownership.** Operational/pipeline errors
- (create / verify / unpack / resolve) move into `src/api/package.c` as diag
- emits; pure arg-parse errors (`"unknown option"`, `"-o BASE is required"`)
- stay in `pkg.c` as `driver_errf`, because argument parsing is driver policy.
-- **Embedder control.** The sink bumps its `errors` counter and prints. For the
- CLI that is exactly today's `driver_errf` behavior. An embedder doing
- speculative verification supplies its own (or no) sink and reads only the
- `KitStatus`, so a failed verify stays quiet.
-
-## Versioning: latest-only (decided)
-
-We support only the current on-disk format and drop all back-compat code. This
-is verified to be **pure deletion with zero behavioral change**: every v2 symbol
-(`dist_manifest_*`, the non-`3` `dist_kpkg_*`, `dist_kpkg2_*`, and the
-`DistManifest` / `DistArtifact` / `DistDependency` / `DistKpkgHeader` /
-`DistKpkgDescriptor` / `DistKpkgIndexRecord` structs) is referenced *only* in
-its own definition files — never by `pkg.c`, `cas.c`, or any test. The driver
-already emits and reads v3 exclusively.
-
-Dropping it pays off twice:
-
-1. Deletes the dead v2 structs / functions / constants from `manifest.c` and
- `kpkg.c`.
-2. Lets the survivors **shed the `3` suffix** as they go public:
- `DistPackageManifest` → `KitPackageManifest`, `DistKpkg3Header` →
- `KitPackageHeader`, internal `dist_kpkg3_*` → `dist_kpkg_*`. The versioned
- naming only existed to coexist with v2.
-
-**Precision:** drop the v2 *parse paths and C identifiers*, but keep the on-disk
-wire magic at `kpkg3\0` / `kit-package 3` / `kit-encoding 3`. "Latest
-version" means v3 on disk; renumbering the wire format would itself break
-anything already produced. We stop *accepting* v2 input; we do not renumber.
-
-## Staged plan (each stage builds green)
-
-1. **Vendor move.** `driver/dist/vendor/{monocypher,lz4}` → top-level
- `vendor/{monocypher,lz4}`; rewrite the shim `#include` paths; update the
- Makefile. Pure relocation, no API change — lands first to isolate path
- churn.
-2. **Lift-and-shift the content layer.** Move `dist.{c,h}` `blob` `tree` `cas`
- to `src/dist/`; add `src/api/cas.c` + `include/kit/cas.h` wrapping
- blob/tree/CAS; add `KIT_CAS_ENABLED`; repoint `driver/cmd/cas.c` at the
- public header + a host vtable. Smallest behavioral slice; proves the
- boundary end to end.
-3. **Drop v2 first, then extract the package pipelines.** Delete the dead v2
- code (see *Versioning* above) and shed the `3` suffix — a self-contained,
- zero-behavior-change cleanup that shrinks the surface before it moves. Then
- move `manifest` `kpkg` `minisig` `trust` `b64` `deflate` `lz4` `tar` +
- crypto shims to `src/dist/`; lift the `pkg_create_*` / `pkg_verify_*` /
- unpack / key-resolution logic out of `driver/cmd/pkg.c` into
- `src/api/package.c` behind `kit_pkg_*`, converting operational `driver_errf`
- → `api_diagf` (see *Error reporting* above) and `driver_*` fs/random →
- `KitDistHost`. `pkg.c` shrinks to arg parsing + host wiring + trust-path/env
- policy. This is the bulk of the work and the main risk.
-4. **Tests.** Keep `test/cas/run.sh` + `test/pkg/run.sh` as end-to-end CLI
- tests; optionally add unit tests that call the new public API directly (now
- possible — coverage was CLI-only before).
-5. **Docs.** Update `../DISTRIBUTE.md` paths (the layering diagram's
- `driver/dist/*` rows become `src/dist/*` + the two public headers), the
- `../DESIGN.md` layering box (the `driver/dist/` callout moves), and the
- `CLAUDE.md` code map (add `vendor/`, `src/dist/`, `kit/cas.h`,
- `kit/package.h`).
-
-## Risks / watch items
-
-- **Error reporting** is decided (see above): `KitStatus` + `ctx->diag`, no
- boundary err-buffers. Remaining care is mechanical — route the ~70 operational
- `driver_errf` sites to `api_diagf` while leaving arg-parse errors in `pkg.c`,
- and confirm the CLI's stderr output is unchanged by the existing
- `test/pkg/run.sh` corpus.
-- **Trust policy must not leak into the library.** `$KIT_TRUSTED_KEYS` /
- `$HOME` defaulting and `--tofu` write-back are *driver* policy; the library
- takes resolved bytes/paths and returns "would-pin this key id" decisions for
- the driver to act on. Keep `getenv` driver-side.
-- **Binary-format stability.** Once the manifest/tree/kpkg model is public, the
- determinism invariants in DISTRIBUTE.md become a public contract. With v2 gone
- there is only one format to preserve — keep the wire magic at v3 (do not
- renumber) and lock the bytes with the existing corpus before refactoring.
-- **Subsystem flag matrix.** Verify `KIT_PKG_ENABLED && !KIT_CAS_ENABLED`
- is a build-time error, and that both-off drops the vendored crypto so a
- no-dist embedding stays clean (assert as the other subsystems do).
diff --git a/doc/plan/LTO.md b/doc/plan/LTO.md
@@ -1,534 +0,0 @@
-# LTO / Whole-Program Optimization (planned work)
-
-This is the forward-looking plan for link-time optimization in kit: making a
-library or executable look like a single translation unit to the optimizer, so
-inlining, dead-code elimination, internalization, and the rest of the
-interprocedural family can cross TU boundaries. It deliberately does **not**
-target GCC/Clang LTO bitcode compatibility. The initial scope is kit invocations
-that provide all sources up front (`kit cc *.c -O2 -flto -o prog`); separately
-compiled IR objects are a later phase that reuses the same core.
-
-The optimizer baseline this builds on — the recording IR, the
-recording/optimizing boundary, the finalize path, and the pass catalog — is in
-[../OPT.md](../OPT.md) and [OPTIMIZER.md](OPTIMIZER.md). The link-time symbol
-model is in [LINKER.md](LINKER.md). The CG/object lifetime boundary used by the
-remaining Phase 1 staging work is in
-[CG_OBJ_LIFECYCLE.md](CG_OBJ_LIFECYCLE.md). This document treats those as given
-and describes only the LTO-specific additions.
-
-The headline finding from investigating the tree: **most of the machinery for
-whole-program optimization already exists; it is just per-TU, single-arch, and
-partly unreached.** LTO here is three concrete refactors plus wiring, not a new
-subsystem. The largest of the three is factoring the linker's symbol-resolution
-policy out so it can run at merge time as well as at link time.
-
-## Status (2026-06-04)
-
-**Phase 0 is complete and shipping; Phase 1's all-sources-up-front LTO path is
-implemented in this branch.** The end state is not a C-only shortcut:
-every source-building verb routes through one staging engine, and every
-in-tree frontend declares either semantic CG staging or opaque-object
-participation. The link-picture-driven preserved/export prepass now feeds
-`kit_cg_finish`, and executable LTO internalizes non-preserved globals before
-the whole-module reachability walk. Where reality diverged from the original
-wording below:
-
-- **The gate is `-O1`, not `-O2`.** Whole-program optimization (deferred emit +
- module sweep + inliner) runs whenever the optimizer runs:
- `o->whole_program = (level >= 1)` in `opt_cgtarget_new`. `-O2` is treated as
- `-O1` for now. References to `-O2`/`-fwhole-program` gating below are superseded.
-- **One arch path, no identity checks.** The ARM64-only sweep is now
- `opt_whole_module_finalize` for every arch; `src/opt` has zero
- `arch == KIT_ARCH_*` checks. The sret arg-slot rule moved off arch identity to
- `ABIFuncInfo.sret_consumes_int_arg` (set per ABI impl). Remaining generic-layer
- arch identity (`src/cg/type.c`, `src/cg/atomic.c`, `src/link/link_resolve.c`) is
- tracked as separate cleanup, not part of LTO.
-- **Cross-TU LTO will be opt-in behind `-flto`** (revisit making it the `-O1`
- default once proven) — resolves the flag-surface open question.
-- **Frontend participation is explicit.** C, Toy, and Wasm lower into a
- caller-owned open `KitCg`; asm is an opaque LTO participant and continues to
- compile as an ordinary object.
-- **The lifecycle target is borrowed `KitCg` + caller-owned `ObjBuilder`, not a
- separate LTO unit abstraction.** `ObjBuilder` owns object lifetime; `KitCg`
- records source units into a borrowed object and finishes semantic codegen with
- link-picture policy. See [CG_OBJ_LIFECYCLE.md](CG_OBJ_LIFECYCLE.md).
-- **`symresolve_merge` signature** as built is `(SymAttrs existing, SymAttrs
- incoming)` with `in_comdat` carried inside `SymAttrs`; no separate `coff_target`
- parameter (the COMDAT flags carry everything the decision needs).
-- **Preserved/export internalization is part of Phase 1.** The LTO CG finish
- path receives linker-computed preserved symbols for executable links, and
- `cc -shared -flto` remains disabled until shared-library output is exercised.
-
-### Done
-
-- [x] **§6.1 Generalize the finalize sweep to all arches** — `opt_whole_module_finalize`
- (`src/opt/opt.c`); x64/rv64 defer-to-finalize; `-O0` and the JIT/interp/run paths
- unchanged; `opt_maybe_capture_interp` still invoked per reachable func.
-- [x] **§6.4 Wire `opt_inline`** over the reachable `FuncSet` — `opt_run_o1_native`
- split into `opt_o1_native_prepare` / `opt_o1_native_finish`; the sweep lowers the
- live set into one FuncSet, runs the inliner, then finishes each func.
-- [x] **Interposition soundness fix** (strengthens §9): weak/interposable callees are
- never inlined — `opt_cg_func_interposable` marks them `KIT_CG_INLINE_NEVER`, honored by
- both the streaming tiny-inliner and the whole-program inliner. Caught by a
- strong-over-weak override case the prior (tiny-inliner) behavior miscompiled.
-- [x] **§3 `symresolve` extraction** — `src/obj/symresolve.{h,c}`;
- `link_resolve_symbols` refactored onto `symresolve_merge`; `link_bind_strength` /
- `link_sym_is_def` / `link_sym_is_spurious_undef` are now wrappers. Behavior-preserving
- (test-link 122/0, test-macho 80/0, ODR/weak/common/COMDAT all covered).
-- [x] **§3 `ObjBuilder` name→id index** — `SymNameIndex` in `src/obj/obj.c`;
- `obj_symbol_find` is an authoritative O(1) hash lookup with no linear scan, kept
- exact through `obj_symbol_ex` and `obj_symbol_rename`.
-- [x] **Tests** — `test/opt/whole_program_inline.sh` (wired `test-opt-whole-program-inline`):
- static callee fuses on aa64/x64/rv64, weak callee kept out-of-line (interposition
- guard), `opt.inline.inlined` fires at `-O1`, and the kit-native build verbs
- (`build-obj`/`build-exe`) fuse too.
-- [x] **Build verbs participate.** `build-exe`/`build-lib`/`build-obj` (which replaced
- `compile` on `main`) compile each source to an in-memory builder under one
- `KitCompiler` via `build_compile_all` (`driver/cmd/build.c`) and route through the
- shared `kit_cg` path, so per-TU whole-program optimization applies at `-O1` with no
- verb-specific wiring. `build_compile_all` is also the single seam the Phase 1
- cross-TU staging loop will hook (all three verbs at once); `cc` keeps its own
- `cc_run_link_exe` → `link_engine` path.
-
-### Phase 1 source-staging checklist
-
-- [x] **Architecture lock-in.** Phase 1 is implemented as a frontend staging
- and CG/ObjBuilder lifecycle refactor, not a C-driver shortcut. All
- source-building verbs (`cc`, `build-exe`, `build-lib`, `build-obj`) route
- through the same staging engine. Frontends explicitly declare how they
- participate: semantic `kit_cg` staging for frontends that lower through CG, or
- opaque-object participation for inputs that cannot expose semantic IR
- (notably asm). The change is not complete until every in-tree frontend is
- opted into one of those modes.
-- [x] **§2 Skip-intern locals.** In `kit_cg_decl` (`src/cg/session.c:198`), for
- `SB_LOCAL` bindings skip `obj_symbol_find` and always mint a fresh id. Confirm the
- per-`Decl` id cache keeps intra-TU static reuse pointing at the cached id, and that
- single-TU behavior is unchanged (locals are already unique per name within a TU).
-- [x] **§4 Recording-arena lifetime — settle first.** Choose dedicated LTO arena vs
- `c->global` for the recorder/`CgIrModule` so accumulated IR outlives each per-TU
- frontend run. This is the one structural hazard (§9).
-- [x] **§4 Source staging under the current CG API.** Add a deferred-finalize
- mode to `kit_cg`: record N TUs into one shared session / `ObjBuilder` /
- `CgIrModule` without per-TU finalization, then finish CG and finalize the
- object once. Keep per-TU frontend state (Pool/DeclTable/type interning)
- independent.
-- [x] **§4 CG/ObjBuilder borrowed lifecycle.** Replace the former
- object-shaped CG bracket with the lifecycle in
- [CG_OBJ_LIFECYCLE.md](CG_OBJ_LIFECYCLE.md): caller-owned `ObjBuilder`,
- borrowed `KitCg`, explicit unit boundaries, `kit_cg_finish` for semantic
- codegen policy, and caller-owned object finalization. One-TU and multi-TU
- builds now use the same
- ownership model.
-- [x] **§3/§4 Recording-time merge.** At the per-TU staging boundary, when a TU
- contributes a body for a symbol already defined, call `symresolve_merge` to pick the
- winner; drop the loser's `CgIrFunc`/data and keep its decl as a reference; report ODR
- at the second definition's `SrcLoc`.
-- [x] **§4 Driver loop + `-flto` flag.** Parse `-flto` in `cc` and the build verbs,
- thread an LTO flag through `KitCodeOptions`/the driver, and add the staging path:
- one shared session, frontend per source, one CG finish/object finalize, single
- builder to the link session. Hook it at `build_compile_all`
- (`driver/cmd/build.c`) so build-exe/lib/obj get it together, plus
- `cc_run_link_exe`. (The build verbs already share one `KitCompiler`, so the
- seam is in place.)
-- [x] **§5 Preserved/export set.** Compute from the assembled link (entry symbol,
- dynamic exports, undefs referenced by opaque inputs, `used`/init-fini/asm-named/IFUNC/
- address-significant) and hand it to `kit_cg_finish`. Current Phase 1 behavior
- is conservative for relocatable/archive outputs, while executable outputs
- internalize non-preserved LTO definitions. Shared-library LTO remains disabled
- until shared output is exercised.
-- [x] **§6.2 Internalize** non-preserved globals using the preserved set (unlocks
- cross-TU DCE and unconstrained inlining), then re-run GC.
-- [x] **Tests.** A two-TU `test/smoke` (or `test/link`) case where a cross-TU callee
- inlines under `-flto`; a guard that a weak/exported cross-TU symbol is *not*
- inlined/internalized; cross-TU ODR reported at the right `SrcLoc`.
-
-## Baseline (what already exists)
-
-A handful of facts about the current code path frame everything below.
-
-- **Globals already intern by name within an `ObjBuilder`.** `kit_cg_decl` does
- `obj_symbol_find` then reuse-or-create (`src/cg/session.c:198`). Two frontends
- that `decl` `foo` into the *same* builder receive the *same* `ObjSymId`. The
- CG and optimizer IRs reference call targets and globals by `ObjSymId`
- (`IRCallAux.desc.callee.v.global.sym`), so a caller's `call foo` already points
- at the id the definer will define — no remap, no clone. This is the load-bearing
- fact for the whole design.
-- **The recorder already accumulates a whole module.** One `CgIrRecorder` owns
- one `CgIrModule` and appends every `func_begin`/`func_end` into it
- (`src/cg/ir_recorder.c`), flushing only at `finalize`. `CgIrModule`
- (`src/cg/ir.h:270`) holds all functions, aliases, and file-scope asm. Per
- function it carries `call_refs` and `global_refs` symbol sets
- (`src/cg/ir.h:247`) — the call/use graph is materialized during recording.
-- **The optimizer already finalizes over the whole module — for one arch.**
- `opt_on_finalize` (`src/opt/opt.c:566`) hands the entire `CgIrModule` to
- `opt_emit_reachable_aarch64` (`src/opt/opt.c:495`), which seeds a root set
- (non-`LOCAL` symbols, `KIT_CG_SYM_USED` locals, alias targets, exported data
- relocs), walks each function's `call_refs`/`global_refs` plus the data-reloc
- graph, **removes unreachable local symbols**, then lowers + optimizes + emits
- only what is live. This is whole-program GC for one TU. x86-64 and riscv64
- instead emit eagerly per function in `opt_on_func` (`src/opt/opt.c:322`); they
- have no module pass.
-- **The whole-program inliner exists and is unreached.** `opt_inline(FuncSet*,
- max_iters)` (`src/opt/pass_inline.c:667`) does topologically ordered,
- growth-gated, call-graph inlining over a `FuncSet` of lowered `Func`s. Only the
- streaming tiny variant `opt_try_tiny_inline` (cost cap 8, straightline only)
- runs today. The real inliner has never had a caller. See OPTIMIZER.md §6.
-- **The driver already shares one `KitCompiler` across sources** and keeps
- objects in memory through link. `cc_run_link_exe` compiles each source to its
- own in-memory `KitObjBuilder` under one compiler (`driver/cmd/cc.c:2655`,
- `objs[]` at `:2585`) and hands the builders straight to the link session via
- `kit_link_session_add_obj` (`:2735`/`:2771`) — no temp `.o` files. The
- orchestration seam for LTO is already where it needs to be.
-- **The obj layer has no resolution policy and no name index.**
- `obj_symbol_define` is last-writer-wins with no precedence check
- (`src/obj/obj.c:544`), and `obj_symbol_find` is a linear scan
- (`src/obj/obj.c:534`). The only resolution rule anywhere below the linker is
- the weak-demotion special case hand-coded into `kit_cg_decl`
- (`src/cg/session.c:203`). All real precedence lives in `link_resolve_symbols`
- (`src/link/link_resolve.c:258`).
-
-The merged module, then, is not something LTO must *build*. It is something the
-recorder already builds and the finalize path already consumes — for one TU, on
-one arch. LTO is mostly about *not tearing it down between TUs*, *generalizing
-the finalize pass*, and *applying real resolution policy as the merge happens*.
-
-## 1. Design decision: shared context, not clone-and-merge
-
-Two architectures can make the optimizer see one module:
-
-1. **Clone-and-merge.** Each TU records into its own `CgIrModule`/`ObjBuilder`;
- an IR-linker deep-copies every function into a merged module, rebuilds the
- symbol table, and remaps every operand/reloc/alias to merged ids.
-2. **Shared context.** All TUs record into *one* live session — one
- `ObjBuilder`, one recorder, one `CgIrModule` — so globals unify in place via
- the existing decl interning and the finalize path sees the union directly.
-
-We choose **shared context**. The comparison:
-
-| | Shared context | Clone + remap |
-|---|---|---|
-| Global identity | Free (decl already interns by name) | Rebuild symbol table + remap every operand/reloc/alias |
-| Memory / time | Record once, in place | Duplicate all IR into a merge arena |
-| Resolution policy | Apply at the per-TU merge boundary | Apply in the merge pass |
-| Local distinctness | Skip-intern locals (small CG change) | Falls out of remap |
-| Lifecycle cost | Staging mode + cross-TU arena lifetime | None — TUs stay independent |
-| Net new code | Mostly wiring + policy extraction | A full cloner/remapper on the hot path |
-| Serialized objects (Phase 2) | Deserialize = replay records through the same recording/merge API | A separate clone-from-bytes engine |
-
-Clone's only advantage is that TUs stay fully independent, so there is no
-staging lifecycle to manage. That is not worth re-implementing the symbol merge
-the linker already knows how to do, nor the per-TU IR duplication. The decisive
-row is the last one: shared context makes the recording/merge API the single
-funnel. A frontend feeds it; a `.kit.ir` deserializer feeds it the same way.
-There is no second merge engine to build for Phase 2 — fat-object LTO becomes
-"replay serialized decl/func records into the live shared module," reusing the
-same local-handling and resolution code paths verbatim.
-
-The rest of this document describes the shared-context design.
-
-## 2. Symbol identity: what unifies, what must stay distinct
-
-Shared context gets global unification for free (§Baseline). The one correctness
-trap is **local symbols**, and there is exactly one rule to add.
-
-`kit_cg_decl` interns *every* name through `obj_symbol_find`. For globals that is
-correct and desirable. For `SB_LOCAL` symbols it is wrong: two TUs each with
-`static int x;` (the frontend passes the bare name with LOCAL binding,
-`lang/c/decl/decl.c:72`) would collapse to one symbol. The same hazard exists for
-static functions, and for the per-TU counters behind block-scope statics
-(`mint_static_local_sym`, `lang/c/parse/parse.c:660`) and compound literals
-(`mint_compound_literal_sym`, `lang/c/parse/parse_init.c:1012`), which reset per
-parser and would produce colliding names like `y.0` and
-`__kit_compound_literal.2` across TUs.
-
-**Fix:** for `SB_LOCAL` bindings, skip `obj_symbol_find` and always mint a fresh
-id. Consequences, all benign:
-
-- Two locals named `x` get distinct `ObjSymId`s. Duplicate `STB_LOCAL` names in
- one object are legal in every format kit emits; locals never enter the global
- name table; the optimizer indexes functions by id, not name.
-- The frontend caches the id per `Decl`, so intra-TU reuse of a static is
- unaffected — the second reference goes through the cached id, not a fresh decl.
-- The static-vs-extern-same-name case resolves correctly: the static gets a fresh
- id; an unrelated `extern foo` keeps the shared global id.
-
-No frontend mangling is required. Anonymous read-only data (`.Lkit_ro.N`,
-`src/cg/memory.c:102`) needs no change at all once the session is shared, because
-the `rodata_counter` is no longer reset between TUs (see §4) and keeps climbing.
-
-## 3. Resolution policy: factoring `symresolve` out of the linker
-
-Today, if we naively share one `ObjBuilder`, we lose all symbol-resolution
-semantics: `obj_symbol_define` overwrites last-writer-wins, so two strong
-definitions silently clobber instead of raising an ODR error, strong-vs-weak
-becomes declaration-order dependent, and commons never merge. The precedence
-rules we need already exist in `link_resolve_symbols`
-(`src/link/link_resolve.c:258`): strong-vs-strong → ODR error (modulo COFF
-COMDAT/SELECTANY), strong beats weak, weak-weak keeps the first, common merging
-takes max size/align, and a definition beats a common.
-
-Per the investigation, that logic is cleanly separable. The *decision* is pure
-over `(name, bind, kind, size, align, common_align, defined?, in_comdat)`
-tuples; only the *bookkeeping* — the `globals` `SymHash`, the per-input
-`InputMap`, COMDAT section discard, DSO iteration — is entangled with linker
-state.
-
-**Extract a small shared module**, `src/obj/symresolve.{h,c}` (the obj layer is
-the natural home; both consumers sit above it):
-
-```c
-// pure: no linker state, no allocation
-SymMergeResult symresolve_merge(SymAttrs existing, SymAttrs incoming,
- int coff_target);
-// -> KEEP_EXISTING | REPLACE | MERGE_COMMON(size, align)
-// | COMDAT_DISCARD | ODR_ERROR
-```
-
-Move `link_bind_strength`, `link_sym_is_def`, and `link_sym_is_spurious_undef`
-(`src/link/link_internal.h`) alongside it. Then:
-
-- **Refactor `link_resolve_symbols` onto it.** A pure cleanup with no behavior
- change, fully covered by the `test/link` corpus, that gives the policy one
- source of truth and leaves the linker better than we found it.
-- **The LTO staging coordinator calls the same function** at the per-TU merge
- boundary. Crucially this is a *binding-precedence* decision — which body wins —
- not id remapping, because ids are already unified. When TU B contributes a body
- for a global TU A already provided, `symresolve_merge` decides whether to keep
- A's `CgIrFunc`/data, replace it with B's, merge commons, or raise ODR. The
- loser's body is dropped from the module and its decl remains as a reference.
- ODR conflicts are reported at the second definition's `SrcLoc` — better
- diagnostics than the linker's post-hoc panic, because source locations still
- exist at this point.
-
-Two mechanical needs fall out of sharing one builder:
-
-- **Give `ObjBuilder` a `name -> id` hash map.** With the whole program's symbols
- in one builder, the linear `obj_symbol_find` (`src/obj/obj.c:534`) is O(n²) at
- decl time. The assembler already carries its own `SymSymMap` precisely because
- obj lacks one (`src/asm/asm.c`). Adding the index to the builder removes the
- quadratic and hosts the resolution hook — a win even for ordinary single-TU
- compiles, and it lets the assembler shed its private map later.
-- **One open question on `define` timing.** During pure recording a global is
- *declared* (with binding) but not *defined* in the obj sense until finalize
- emits its section/offset. So the "which body wins" decision must run at the
- staging boundary against the set of bodies a TU contributes (it has a
- `CgIrFunc` or data record for the symbol), not at `obj_symbol_define` time.
- This is the linker's per-input symbol merge applied to `CgIrModule`
- contributions.
-
-The opaque inputs in a link (libc, crt, kit archives, DSOs) are still resolved by
-the linker at link time against the single emitted LTO object. So the policy
-module has two call sites — recording-time merge among the LTO set, and
-link-time resolution against everything — which is the justification for
-extracting it rather than duplicating it.
-
-## 4. The staging lifecycle
-
-The lifecycle target for Phase 1 is documented in
-[CG_OBJ_LIFECYCLE.md](CG_OBJ_LIFECYCLE.md). The short version: `ObjBuilder`
-owns object lifetime, while `KitCg` borrows an object, records one or more
-semantic units, and finishes codegen into that object. `kit_cg_finish` is a CG
-flush/lowering/debug operation; it is not object finalization.
-
-The old object-shaped bracket used to finalize (lowers + emits everything),
-null `g->obj`/`g->target`, and reset per-object state including
-`rodata_counter` (`src/cg/session.c`). The structural state is now a borrowed
-lifecycle:
-
-- **Record each TU as a unit in one live CG session without object
- finalization.** Run a single `kit_cg_finish` after the last semantic source,
- then let the caller finalize the `ObjBuilder`. The shared path records N
- semantic frontends into one shared `KitCg` / `ObjBuilder` and finalizes once
- through the explicit lifecycle: `kit_cg_begin`, `kit_cg_begin_unit`,
- `kit_cg_end_unit`, `kit_cg_finish`, and `kit_cg_detach`/`kit_cg_abort`.
- Drivers collect sources and opaque inputs; they do not implement definition
- selection, IR lifetime, semantic finalization, or object finalization policy.
-- **Frontend participation is explicit.** `KitFrontendVTable` has a split
- contract: semantic frontends implement `compile_cg`, while opaque frontends
- implement `compile_obj`. C, Toy, and Wasm participate by emitting into a
- caller-owned open `KitCg` session; one-TU object builds are wrapped at the
- compile-session layer by creating an `ObjBuilder`, attaching `KitCg` for one
- unit, finishing CG, and then finalizing the object.
- Asm has no semantic CG representation, so its LTO participation mode is opaque:
- it compiles to an ordinary object and contributes references/definitions to the
- link picture but not to the merged optimization module. This keeps all verbs
- and all frontends on one declared path while allowing semantic frontend opt-in
- one at a time.
-- **The recording arena must outlive any single TU.** The recorder and module are
- arena-allocated from `c->tu` today (`opt_cgtarget_new`, `cg_ir_recorder_new`).
- In the current implementation `c->tu` is already compiler-session lifetime
- (not reset between source inputs), so Phase 1 uses it as the cross-TU recorder
- arena and documents that lifetime. If `c->tu` later becomes per-source again,
- the shared CG path must switch to an explicit cross-source arena; the frontend
- staging API must not depend on that allocator choice.
-- **Each TU keeps its own frontend state.** The per-TU `Pool`, `DeclTable`, and
- type interning stay independent; only the CG session and `ObjBuilder` are
- shared. The shared `KitCompiler` already spans sources today, so `c->global`
- name interning is already consistent across TUs.
-
-The driver change is a shared staging engine: group every LTO-capable source
-input in command-line order, stage semantic frontends into the borrowed CG
-session and shared object, compile opaque frontends/objects as ordinary inputs,
-then finish CG once and substitute the resulting builder at the right place in
-the link order. The hook is `build_compile_all` in `driver/cmd/build.c` (shared
-by build-exe/build-lib/build-obj) and `cc_run_link_exe` — both already compile
-every source under one `KitCompiler`, which is the seam this loop replaces.
-(`compile`/`compile_engine` from the original plan were retired in favor of the
-build verbs on `main`.)
-
-## 5. The export / preserved set
-
-Internalizing a global — demoting it to hidden/local, which unlocks DCE and
-unconstrained inlining — is sound only when nothing outside the LTO set can
-reference it by name and it is not interposable. This is the one input that
-genuinely needs the full link picture, so it is computed at link time and handed
-to the LTO core. A symbol must be **preserved** if it is:
-
-- the entry symbol (`main`/`_start`), or in the dynamic export set; for `-shared`,
- default-visibility symbols are interposable and must **not** be internalized or
- inlined across unless `-fvisibility=hidden` / a version script / `-Bsymbolic`
- says otherwise;
-- referenced (undefined) by any **opaque** input — libc/crt calling `main`, a
- kit archive member that is not IR, a DSO;
-- `__attribute__((used))`, in an init/fini array, named in inline or file-scope
- asm, an IFUNC resolver, or address-significant in an opaque input.
-
-The linker already answers "is this symbol referenced from outside" for archive
-pull (`scan_presence_before` / `member_satisfies`, `src/link/link_resolve.c:859`,
-`:923`); the preserved set is the same question asked of the LTO set against the
-opaque inputs and the output-kind/visibility policy. Conservative default:
-internalize only for executable outputs or provably non-exported symbols.
-
-Phase 1 implements this for all-sources-up-front executable LTO: the driver
-stages semantic sources, assembles the ordered link session, asks the linker for
-preserved LTO symbols, then passes those IDs to `kit_cg_finish` before object
-finalization. Relocatable and archive-member outputs remain conservative because
-later links may still reference globals by name. Shared-library LTO continues
-to reject until shared output policy is exercised.
-
-## 6. The whole-program optimization core
-
-With a merged module and a preserved set, the core is `opt_emit_reachable_aarch64`
-generalized:
-
-1. **Generalize the finalize sweep to all arches.** Lift the ARM64-only path in
- `opt_on_finalize` into an arch-independent `opt_whole_module_finalize`, and
- switch x86-64/riscv64 from eager per-function emit to defer-to-finalize when
- the whole-program path is active. Keep `-O0`/`-O1` streaming and the
- JIT/interp/`run`/`dbg`/`emu` paths on the existing eager path — LTO is an AOT
- concern. The one verification item is that nothing downstream depends on x64/rv64
- eager emission (`opt_maybe_capture_interp`).
-2. **Internalize** non-preserved globals using the §5 set.
-3. **GC** unreachable functions and data (the existing reachability walk, now over
- the whole program).
-4. **Lower the reachable set into a `FuncSet`** and run `opt_inline` — the
- already-written, never-called whole-program inliner — with a real cost model.
-5. **Emit one object** and substitute it for the IR inputs before the final link.
-
-Steps 1 and 4 are independently valuable and land first (Phase 0): they turn the
-unreached inliner and the generalized sweep into a tested, shipping path on a
-single TU before any cross-TU complexity exists.
-
-## 7. Phased delivery
-
-**Phase 0 — Whole-translation-unit optimization.** No merge, no serialization,
-no driver changes. Generalize the finalize sweep to all arches, switch
-x64/rv64 to defer-to-finalize under the whole-program path, and wire `opt_inline`
-over the reachable `FuncSet`. Delivers real cross-function inlining within a TU at
-`-O2` on every arch, generalized dead-static elimination, and the inliner finally
-exercised on real code. Lowest risk — purely inside the optimizer — and it
-validates the deferred-emit path that Phase 1's staging lifecycle also relies on.
-
-**Phase 1 — Shared-context, all-sources-up-front LTO.** The target case,
-`kit cc *.c -flto -o prog` and `kit build-exe -flto` (and `build-lib`/`build-obj`;
-`build-obj` replaced the retired `compile`). Build on Phase 0 by adding:
-(a) the `symresolve` extraction (§3), (b) the `ObjBuilder` name index (§3),
-(c) skip-intern for locals (§2), (d) the `KitCg`/`ObjBuilder` borrowed staging
-lifecycle and the driver loop that records N frontends into one session and
-finishes CG once (§4), (e) the preserved set fed from the assembled link into
-`kit_cg_finish` (§5). No cloner, no serialization, no archive support yet.
-
-**Phase 2 — Serialized IR objects (`.kit.ir`).** Optional follow-on for separate
-compilation, archives, and build caches. `kit cc -c -flto a.c` emits a normal
-object whose symbol table is the real decl set — so the linker's symbol-driven
-archive pull works unchanged — plus a `.kit.ir` custom section (the object model
-already supports arbitrary `SEC_OTHER` sections) carrying a serialized
-`CgIrModule`. The linker detects the section and **replays the records into the
-same shared context** through the same recording/merge API, reusing
-skip-intern-locals and `symresolve_merge` verbatim. Archives of IR objects work
-because pull is symbol-table driven. Note: whole-program LTO is incompatible with
-the file-based incremental linker (LINKER.md); `-flto` forces a full link.
-
-## 8. Optimizations unlocked
-
-Inlining is the headline; the merged-module + `FuncSet` framework makes a whole
-interprocedural family *expressible* (listed as enabled, not committed):
-
-- **Cross-TU / whole-program inlining** — `opt_inline`, already written.
-- **Internalization** to hidden/local for non-exported globals — enables DCE,
- removes PLT/GOT indirection, frees intra-function optimization.
-- **Whole-program dead code / data elimination** — the generalized sweep.
-- Future: devirtualization / direct-call promotion, IPSCCP and cross-function
- constant/range propagation, argument promotion, identical-code folding,
- `const`/`pure` inference, global-to-local-constant propagation.
-
-## 9. Risks, semantics, and limitations
-
-- **Resolution fidelity.** ODR, weak/strong, common merging, COMDAT, IFUNC,
- aliases, and visibility must match the linker exactly or LTO miscompiles —
- hence the shared `symresolve` module rather than a re-implementation.
-- **Interposition / shared libraries.** Never internalize or inline across an
- interposable default-visibility boundary unless `-Bsymbolic`/hidden visibility
- makes it safe. Default conservative for `-shared`.
-- **Inline and file-scope asm** naming symbols are opaque references: treat as
- roots, never rename, internalize, or DCE them.
-- **Debug info.** Cross-TU inlines need `inlined_subroutine` DWARF with correct
- file/line; `SrcLoc` must carry file identity through the merged module.
- Acceptable initial limitation: degraded inlined debug info under `-g -flto`,
- stated explicitly.
-- **Compile time / memory.** The whole program lives in memory; `opt_inline`'s
- growth gates bound blow-up; only the LTO set is optimized, opaque inputs stay
- opaque.
-- **Determinism.** Record and merge in input order; iterate stably.
-- **Recording arena lifetime** (§4) is the one structural hazard — settle it
- before building the staging loop.
-- **TLS, varargs, atomics, computed goto, label-address tables** must survive the
- shared module unchanged. Function-local label addresses are already
- function-scoped; cross-function `data_addr`/`pcrel`/`symdiff` reference symbols,
- which are already unified by id.
-
-## 10. First slices
-
-Two independently landable, low-risk steps that de-risk the whole direction
-before any LTO surface exists:
-
-1. **Extract `symresolve` and refactor `link_resolve_symbols` onto it.** Pure
- refactor, covered by `test/link`. Lands the load-bearing piece and improves
- the linker regardless of LTO.
-2. **Add the `ObjBuilder` name -> id index** behind the existing
- `obj_symbol_find`/`obj_symbol_ex` API. Drop-in; measurable on its own.
-
-Then Phase 0 (generalize the sweep + wire `opt_inline`, gated behind `-O2` /
-`-fwhole-program`), validated first on x86-64 with red-green tests in `test/opt`
-(a caller+callee that should fuse) and `test/smoke/x64` (behavioral parity). Then
-the staging lifecycle and skip-intern-locals behind `-flto`, exercised first on a
-two-TU `test/smoke` case where a cross-TU callee inlines.
-
-## Open questions
-
-- **Define-timing for resolution** (§3): confirm the staging-boundary merge is the
- right hook versus an `obj_symbol_define`-time check, given symbols are only
- obj-defined at emit.
-- **Recording arena follow-through** (§4): Phase 1 relies on `c->tu` having
- compiler-session lifetime for the cross-TU recorder/module. If frontend reset
- semantics later make `c->tu` per-source again, move the recorder/module to an
- explicit cross-source arena without changing the frontend staging API.
-- **`-flto` flag surface** (largely resolved — see Status): `-flto` opt-in on `cc`
- and the build verbs, decided per the Status section. Still open: whether
- `-fwhole-program` is a distinct, more aggressive internalization mode, and whether
- to make cross-TU LTO the `-O1` default later.
-- **CG API exposure**: how much of the borrowed lifecycle
- (`kit_cg_begin`/`kit_cg_begin_unit`/`kit_cg_finish`/`kit_cg_detach`) remains
- internal to the driver (`build.c`'s `build_compile_all`, `cc_run_link_exe`)
- versus becoming a public `kit_cg`/`kit_compile` surface for embedders driving
- multi-TU LTO.
diff --git a/doc/plan/README.md b/doc/plan/README.md
@@ -11,7 +11,6 @@ shrinks to whatever remains open.
| [RELEASE.md](RELEASE.md) | Cross-cutting initial-release punchlist: release scope, deferred features, and per-subsystem completion/validation items. | — |
| [OPTIMIZER.md](OPTIMIZER.md) | Completing the O2 SSA mid-end, expanded inlining, -O0/-O1 performance work, machine register-constraint improvements. | [../OPT.md](../OPT.md) |
| [LINKER.md](LINKER.md) | Incremental linking: the file-based object-link redesign and remaining non-ELF format coverage. | [../LINK.md](../LINK.md) |
-| [RELOC.md](RELOC.md) | Genericizing the canonical-`RelocKind` half of the relocation layer. WS-B/C/E all landed (per-arch `RelocDesc` table, byte-patcher partitioned per-arch, FreeBSD IFUNC/IRELATIVE); only optional WS-A enum collapse remains. | [../OBJ.md](../OBJ.md), [../LINK.md](../LINK.md) |
| [JIT.md](JIT.md) | Function-level hot reload, Go-runtime-style codegen support, and remaining JIT host-portability work. | [../JIT.md](../JIT.md) |
| [DEBUG.md](DEBUG.md) | The Windows debugger host adapter, x64/rv64 displaced single-step, profiling, and DWARF gaps. | [../DBG.md](../DBG.md), [../DWARF.md](../DWARF.md) |
| [WASM.md](WASM.md) | Completing the Wasm object backend and remaining parser/validator coverage. | [../WASM.md](../WASM.md) |
@@ -21,9 +20,6 @@ shrinks to whatever remains open.
| [BUILD.md](BUILD.md) | A new content-addressed build coordinator (Bazel/Nix-style incremental builds layered on the CAS) — storage state machine, caching algorithm, recipe protocol. Distinct from `../BUILD.md` (kit's own Makefile build). | — (new subsystem) |
| [BUILD_COMMANDS.md](BUILD_COMMANDS.md) | The kit-native `build-exe`/`build-lib`/`build-obj` verbs that replace `compile`: polyglot, in-memory compile+link with `--group` flag scoping and full link-flag control. Distinct from `BUILD.md` (the CAS coordinator). | [../DRIVER.md](../DRIVER.md) |
| [LLGEN_IMPORT.md](LLGEN_IMPORT.md) | Importing the standalone LL(1)/Pratt parser and lexer generator into libkit, including public API renames, file moves, build gates, and a `kit llgen` command. | — |
-| [BACKTRACE.md](BACKTRACE.md) | Stack-trace support: GCC-compatible `__builtin_return_address`/`__builtin_frame_address` primitives, a freestanding `__kit_backtrace` capture helper, and symbolized backtrace printing. L1–L3a/L3c shipped; L3b (in-process self-symbolization) deferred. | [../FRONTENDS.md](../FRONTENDS.md), [../RUNTIME.md](../RUNTIME.md), [../DWARF.md](../DWARF.md) |
-| [LTO.md](LTO.md) | Whole-program optimization: `symresolve` extraction, cross-TU inlining, internalization. Phase 0 (whole-TU opt) and Phase 1 (all-sources-up-front LTO) shipped; Phase 2 (serialized `.kit.ir` objects) open. | [../OPT.md](../OPT.md) |
| [CODEGEN.md](CODEGEN.md) | CG API interface cleanup: PLACE/VALUE centerpiece, op/intrinsic taxonomy, atomic/order/AsmDir unification, multi-result API, i128/f128-as-VALUE. Tracks 1/3/4/5/6/7 landed; Track 2 (binop/cmp split) and Track 1c open. | [../CODEGEN.md](../CODEGEN.md) |
-| [DIST_LIBRARY.md](DIST_LIBRARY.md) | Migrating the CAS/package distribution subsystem into libkit as a gated public API (`kit/cas.h`, `kit/package.h`). Main migration shipped; Stage 3 v2 dead-code deletion deferred. | [../DISTRIBUTE.md](../DISTRIBUTE.md) |
| [FREEBSD.md](FREEBSD.md) | FreeBSD target support: VM harness, triple parsing, runtime variants, COMDAT/`STB_GNU_UNIQUE` fixes. Static link blocked on archive weak-alias cycle (needs `--start-group` semantics); dynamic link and full VM validation remaining. | — |
| [TODO.md](TODO.md) | Open deferred fixes and code smells only. Completed items are removed instead of checked off. Not a roadmap; a current backlog. | — |
diff --git a/doc/plan/RELOC.md b/doc/plan/RELOC.md
@@ -1,371 +0,0 @@
-# Relocation-layer genericization (planned work)
-
-## Status — 2026-06-05 — WS-B (descriptor table) + WS-C (byte-patcher partition) + WS-E.2/E.3 (residual gates) landed; only the optional WS-A enum collapse remains
-
-This roadmap makes the **canonical-`RelocKind` half** of the relocation subsystem
-as modular as the wire half already is. The goal is the project's standing
-contract (see [../INTERFACES.md](../INTERFACES.md)): code that depends on a
-pluggable item — here, the target **arch** — must never switch on its identity,
-and adding or changing an arch's relocations must touch exactly **one place**.
-
-The "modularity wave" commits (`9d905b3c..769d6ae1`) already closed the two
-identity *switches* in the reloc path and moved the reloc-name table onto a
-per-arch hook, all via the incremental capability-hook style (narrow fields/hooks
-on the existing `LinkArchDesc` / `ObjElfArchOps` vtables). **What remains is the
-structural denormalization**: the per-kind static facts (width, GOT/TLS class) are
-still re-enumerated in generic switches, and the byte-patcher's ISA encoders still
-live in the format-neutral obj layer. This revision marks the landed items as
-baseline and rescopes the open work accordingly.
-
-Design docs this work feeds back into once shipped:
-[../OBJ.md](../OBJ.md) ("Relocation model and the shared byte-patcher"),
-[../LINK.md](../LINK.md) (the reloc passes), [../INTERFACES.md](../INTERFACES.md)
-(the backend contract).
-
-## Landed since this plan was first written (`9d905b3c..769d6ae1`)
-
-- **The one arch-identity switch is gone (was finding #25).** The
- `(target.arch == KIT_ARCH_X86_64) ? R_X64_TPOFF64 : R_AARCH64_TPOFF64` ternary in
- `link_emit_internal_tpoff64` is now `link_arch_desc_for(l->c)->tpoff64_reloc`, a
- new per-arch `LinkArchDesc` field (`src/link/link_arch.h`, populated in
- `src/arch/{aa64,x64,riscv}/link.c`). This is WS-A's *functional* fix via the
- field route rather than the value-class collapse — the collapse remains an
- optional cleanup (now WS-A below, downgraded).
-- **The FreeBSD static-IFUNC OS gate is gone (was finding #18).** `use_rela_iplt`
- now calls `obj_format_static_ifunc_via_rela_iplt(c)` (`src/obj/obj.h:819`, impl
- `src/obj/obj_secnames.c:371`) instead of `os == KIT_OS_FREEBSD && obj ==
- KIT_OBJ_ELF`. WS-E item 1 is **done**.
-- **The reloc-name table moved to a per-arch hook (was finding #24, partially).**
- `kit_obj_reloc_kind_name` no longer inlines an x86_64 table; it lowers the
- canonical kind via `reloc_to` and calls the new `ObjElfArchOps.reloc_name`
- (`src/obj/format.h:65`; impls `elf_{x86_64,aarch64,riscv}_reloc_name`). **But**
- the dispatch is still gated `if (fmt != KIT_OBJ_ELF || arch != KIT_ARCH_X86_64)
- return NULL;` (`src/api/object_file.c:384`): the aarch64/riscv `reloc_name`
- functions exist but are deliberately *not* consulted, because the rv64/aa64
- objdump golden corpus expects the arch-neutral spelling ("RV_CALL", not
- "R_RISCV_CALL"). So the name *table* is now per-arch data, but a residual
- two-axis identity gate remains, coupled to the test corpus. See WS-E item 3.
-
-Net: the reloc path now contains **no arch-identity branch**, but still
-denormalizes per-kind facts across generic switches (the structural work below).
-
-## The thesis (what still stands)
-
-A relocation kind is a single logical entity. Its static attributes still live in
-parallel tables the compiler cannot keep in sync:
-
-| Attribute | Lives in | Status |
-|-----------|----------|--------|
-| how to patch the bytes | per-arch `src/arch/<arch>/reloc.c` (`*_reloc_apply_insn`) + neutral `reloc_apply_neutral()` `src/obj/reloc_apply.c`; dispatched by `link_reloc_apply()` `src/link/link_reloc_apply.c` | **landed** — WS-C |
-| byte width | `RelocDesc.width` (per-arch `src/arch/<arch>/reloc.c` + neutral `src/obj/reloc.c`) | **landed** — WS-B |
-| uses GOT / is TLS-GOT | `RelocDesc.flags` `RELOC_USES_GOT`/`RELOC_IS_TLS_GOT` | **landed** — WS-B |
-| branch / got-load / tlvp / direct-page | `RelocDesc.flags` `RELOC_IS_BRANCH`/`USES_GOT`/`IS_TLVP`/`DIRECT_PAGE` | **landed** — WS-B |
-| display name | `ObjElfArchOps.reloc_name` `src/obj/format.h:65` (per-arch hook) | **landed** (with a residual gate — WS-E.3) |
-
-Two generic switches (`reloc_width`, `reloc_uses_got`/`is_tls_got`) still enumerate
-every arch's kinds, so adding an arch's relocation edits generic `link` code; and
-the GOT/branch classification is *answered twice* — once by those generic switches
-(consumed by the ELF/static GOT pass) and once by the per-arch `LinkArchDesc.is_*`
-hooks (consumed by the Mach-O linker). The byte-patcher's per-kind encoders — pure
-ISA knowledge — still sit in the format-neutral `src/obj/reloc_apply.c`.
-
-## Baseline — already clean (context, not work)
-
-- **Per-(arch,format) wire translators** (`reloc_to`/`reloc_from`/`reloc_pcrel`/
- `reloc_length`, and now `reloc_name`) in `src/obj/{elf,macho,coff}/reloc_<arch>.c`,
- reached only through the format sub-ops (`src/obj/format.h:55-81`). Adding a format
- or an arch's wire encoding is a one-table change. These do **not** move; the
- per-arch reloc *name* legitimately belongs here, not in the descriptor below.
-- **The single-entry byte-patcher boundary.** `link_reloc_apply(c, kind, P, S, A, P)`
- is reused verbatim by the static linker, JIT linker, assembler, and emulator guest
- loader ([../OBJ.md](../OBJ.md): "one encoder, three loaders"). That **one-entry,
- one-encoder invariant is load-bearing** and WS-C preserves it: only the
- implementation behind the entry is partitioned, never the entry.
-- **`LinkArchDesc`** already carries per-arch PLT/IPLT geometry, stub emitters, the
- `is_*` classifiers, and now `tpoff64_reloc`. It is the proven home for per-arch
- link facts; WS-B extends it (or a descriptor it points to), it does not replace it.
-- **The canonical `RelocKind` enum** (`src/obj/obj.h:108`) — one global enum,
- backends emit canonical kinds — is correct and stays.
-
-## The end state (ownership)
-
-```
-src/obj/reloc_apply.c neutral core: reloc_apply_neutral() — byte encoders
- for the arch-independent data-word kinds (R_ABS*,
- R_REL*, R_PC*, R_TPOFF*, the x64 GOT/dynamic data
- slots, the RISC-V data ADD/SUB/SET arithmetic) + the
- ULEB128 codec. Pure obj-core, no link/arch dep.
-src/link/link_reloc_apply.c (NEW) the single public link_reloc_apply() dispatcher:
- neutral-then-arch. Housed in link (not obj-core)
- because resolving the per-arch slice needs
- link_arch_desc_for() — same boundary call as WS-B's
- reloc_desc() dispatcher.
-src/arch/<arch>/reloc.c that arch's RelocDesc rows (width + class flags, WS-B)
- AND its instruction-immediate byte encoders
- (*_reloc_apply_insn, WS-C), reached via
- LinkArchDesc.reloc_apply_insn. (R_PLT32's apply is the
- RISC-V AUIPC+JALR pair, so it lives in the rv hook with
- R_RV_CALL — not neutral, despite its neutral name.)
-src/obj/<fmt>/reloc_<arch>.c UNCHANGED — the per-(arch,fmt) wire translators,
- incl. the reloc_name spellings (already landed).
-src/obj/coff/reloc.c COFF-specific kinds' RelocDesc rows (format, not arch).
-```
-
-After this, adding an arch's relocation is **one row** (width + flags) in that
-arch's `reloc.c`, one byte encoder beside it, and one wire-translator entry — all
-arch-local. No generic file in `src/link` or `src/api` enumerates relocation kinds.
-
----
-
-## WS-A — Value-class kind collapse (addresses **A**) — *#25 done; collapse optional*
-
-**Status.** The identity switch (#25) is **fixed** via `LinkArchDesc.tpoff64_reloc`.
-What remains is the underlying naming smell, now *optional* and lower-value: the
-canonical enum still carries two byte-identical 64-bit-tpoff kinds, and RISC-V
-reuses the AArch64-named one cross-arch (`src/arch/riscv/link.c:131,149:
-.tpoff64_reloc = R_AARCH64_TPOFF64`).
-
-**Optional cleanup.** Collapse `R_X64_TPOFF64` + `R_AARCH64_TPOFF64` → a neutral
-`R_TPOFF64` (apply arm is shared already, `reloc_apply.c:98-99`). This additionally
-**retires the `tpoff64_reloc` field** — once all three arches name the same kind,
-`link_emit_internal_tpoff64` just writes `R_TPOFF64` and the per-arch field has no
-remaining variation. Touch-sites: `obj.h:198,284` (enum), `reloc_apply.c:98-99` +
-`reloc_width` (fold arms), `obj/elf/reloc_x86_64.c` (`R_TPOFF64 ↔
-ELF_R_X86_64_TPOFF64`; aa64 stays wire-less), `obj/elf/link.c:352,388` (the two
-arch-specific tpoff-classification helpers — verify the variant-I/II *coordinate*
-selection there keys on the ABI/arch context, not on the kind name, before
-merging), and `arch/{aa64,x64,riscv}/link.c` (drop `.tpoff64_reloc`).
-
-**Defer unless** doing WS-B/C anyway — it is pure tidiness now and best folded into
-that pass (the descriptor work touches the same enum + apply arms). No urgency:
-there is no remaining identity switch here.
-
-**Oracle.** `make test-link test-elf test-smoke-x64 test-smoke-rv64
-test-aa64-inline` + a TLS `test-toy` slice + `make bootstrap` (IE-model TLS).
-
----
-
-## WS-B — One per-arch `RelocDesc {width, flags}` table (addresses **B + C**) — *LANDED*
-
-**Status (landed).** `RelocDesc {u8 width; u8 flags}` resolved arch-aware by
-`reloc_desc(c, k)`:
-- neutral data-word kinds → `src/obj/reloc.c` (`reloc_desc_neutral`, pure obj-core);
-- per-arch slices → `src/arch/{aa64,x64,riscv}/reloc.c`, reached through a new
- `LinkArchDesc.reloc_desc` hook that replaces the five `is_*` hooks;
-- dispatcher + `reloc_kind_*` predicates → `src/link/link_reloc_desc.{h,c}`.
-
-Placement note: the **dispatcher** lives in `src/link`, not the plan's
-`src/obj/reloc.c`, because resolving the per-arch slice needs `link_arch_desc_for()`
-— housing it in obj-core would invert the obj→link boundary (CLAUDE.md). The neutral
-descriptor *data* is still pure obj-core (`src/obj/reloc.c`). The arch slice wins over
-neutral so `R_PLT32` can be a branch on x86-64/RISC-V but flag-free on AArch64 while
-sharing the neutral width.
-
-Deleted: `reloc_width` / `reloc_uses_got` / `reloc_is_tls_got` (link_reloc_layout) and
-`jit_reloc_width_local` (link_jit). Migrated consumers (GOT/stub/width passes,
-`link_jit`, and the Mach-O `is_*` call sites) read `reloc_kind_*`. Migration guard:
-`test/link/reloc_desc_test.c` — frozen-oracle parity over every kind × every backend
-arch (3016 checks). `rg "case R_(AARCH64|X64|RV)_" src/link` is now empty; full
-link/elf/macho/ar/isa/aa64-inline suites + `make bootstrap` (debug+release,
-byte-identical) pass. WS-A's enum collapse stays deferred — `tpoff64_reloc` remains a
-per-arch field.
-
-**Problem (original).** `reloc_width()` and `reloc_uses_got()`/`reloc_is_tls_got()` are generic
-switches re-enumerating every arch's kinds, and the GOT/branch classification is
-answered *twice* (those switches vs the per-arch `LinkArchDesc.is_*` hooks). Adding
-an arch's reloc edits generic `link_reloc_layout.c`; the two classification
-mechanisms can silently disagree.
-
-**Change.** One descriptor, owned per-arch, as the single source of a kind's static
-*structural* facts. **Name is excluded** — it already landed on the per-arch wire
-ops (`ObjElfArchOps.reloc_name`), which is its correct home; the descriptor carries
-only width + classification.
-
-```c
-/* src/obj/reloc.h (new) */
-typedef enum RelocDescFlag {
- RELOC_PCREL = 1u << 0,
- RELOC_USES_GOT = 1u << 1,
- RELOC_IS_TLS_GOT = 1u << 2,
- RELOC_IS_BRANCH = 1u << 3, /* needs a JIT/range veneer (== needs_jit_call_stub) */
- RELOC_IS_TLVP = 1u << 4, /* Mach-O TLV page/pageoff */
- RELOC_DIRECT_PAGE = 1u << 5, /* Mach-O ADRP-direct */
- RELOC_MARKER = 1u << 6, /* RELAX/ALIGN/TPREL_ADD — no bytes */
- RELOC_WIDTH_DYN = 1u << 7, /* ULEB128 — width read from bytes at apply */
-} RelocDescFlag;
-
-typedef struct RelocDesc { u8 width; u8 flags; } RelocDesc;
-
-const RelocDesc* reloc_desc(const Compiler* c, RelocKind k); /* caller holds target arch */
-```
-
-**Ownership / assembly.** `reloc_desc()` resolves neutral-core kinds from a table in
-`src/obj/reloc.c`; arch-family kinds dispatch to `link_arch_desc_for(c)->reloc_desc(k)`
-(a new `LinkArchDesc` hook returning that arch's slice, the same shape as the
-existing `is_*`/`tpoff64_reloc` entries); COFF-family kinds resolve from a COFF slice.
-Adding an arch is one slice in `src/arch/<arch>/reloc.c` — no generic edit.
-
-**Migrate consumers, then delete the generic switches:**
-- `reloc_width()` (`link_reloc_layout.c:256`) → delete; callers read
- `reloc_desc(c,k)->width`. Keep the `RELOC_WIDTH_DYN` sentinel + the ULEB128
- offset-bounds guard (`link_reloc_layout.c:1117-1126`).
-- `reloc_uses_got()`/`reloc_is_tls_got()` (`link_reloc_layout.c:392,380`) → delete;
- the GOT pass reads `reloc_desc(c,k)->flags & RELOC_USES_GOT / RELOC_IS_TLS_GOT`.
-- The four `LinkArchDesc.is_*` hooks (`link_arch.h:79-82`) + their impls in
- `src/arch/{aa64,x64,riscv}/link.c` → delete; the Mach-O linker callers
- (`src/obj/macho/link.c:420,492,566,1483,1496,1505,1514,1563`) read descriptor
- flags. `needs_jit_call_stub` (`link_reloc_layout.c:594,1095`) → `RELOC_IS_BRANCH`
- (it aliases `is_branch_reloc` on every arch today).
-
-End state: **no generic file classifies or sizes relocations by enumerating arch
-kinds, and each fact has exactly one source** — width/flags in the descriptor,
-name on the wire ops.
-
-**Exhaustiveness test (the red-green anchor).** Add `test/obj/reloc_desc` iterating
-**every** `RelocKind` for each enabled arch, asserting `reloc_desc()` returns a row
-(`width != 0` unless `MARKER`/`WIDTH_DYN`). Cross-check that, for every kind the old
-`reloc_width()` covered, the descriptor returns the *same* width (a migration guard).
-This makes "forgot a row" a failing test instead of a silent default. Write it red
-first.
-
-**Oracle.** The exhaustiveness/migration test, then `make test-link test-elf
-test-macho test-ar test-smoke-x64 test-smoke-rv64`, then `make bootstrap`
-(macOS/aa64 bootstrap drives the Mach-O GOT/TLVP/branch classifiers that the `is_*`
-deletion touches; byte-identity catches any width drift).
-
----
-
-## WS-C — Partition the byte-patcher per-arch behind the single entry (addresses **D**) — *LANDED*
-
-**Status (landed).** The instruction-immediate byte encoders moved into each
-backend as `*_reloc_apply_insn` (`src/arch/{aa64,x64,riscv}/reloc.c`), reached
-through a new `LinkArchDesc.reloc_apply_insn` hook (`src/link/link_arch.h`,
-wired in each arch's `link.c`). The format-neutral data-word arms (R_ABS/REL/PC/
-TPOFF writes, x64 GOT/dynamic slots, the RISC-V data ADD/SUB/SET arithmetic, and
-the ULEB128 codec) stay in obj-core as `reloc_apply_neutral()`
-(`src/obj/reloc_apply.c`), which has no link/arch dependency. The single public
-entry `link_reloc_apply()` moved to `src/link/link_reloc_apply.c` (neutral-then-
-arch dispatch) — *not* obj-core, because resolving the per-arch slice needs
-`link_arch_desc_for()`, the same boundary reason WS-B placed `reloc_desc()` in
-`src/link`. The dispatcher enumerates no kinds (`rg "case R_(AARCH64|X64|RV)_"
-src/link` is empty). x64 owns only `R_X64_PC8`; the wider x64 GOT/PLT/TPOFF data
-slots remained neutral. `R_PLT32` is applied as the RISC-V AUIPC+JALR pair so it
-lives in the rv hook beside `R_RV_CALL` (x64 never emits canonical `R_PLT32` — it
-emits `R_X64_PLT32` via `reloc_from`). Migration guard:
-`test/link/reloc_apply_test.c` (`test-link-reloc-apply`) — frozen pre-WS-C
-patched bytes for every instruction-immediate kind across aa64/x64/rv (50
-checks). The reloc_uleb128 c=NULL path still works (neutral never touches the
-compiler). Full link/elf/macho/ar/asm/isa/opt/coff/smoke matrix + bootstrap pass.
-
-**Problem (original).** `src/obj/reloc_apply.c` lives in the format-neutral obj layer but
-encodes pure ISA knowledge — AArch64 imm19/imm26/ADRP page math, RISC-V U/I/S/B/J
-immediate scatter and the 0x800 HI20 bias, x64 field writes. Adding an arch edits
-this shared file; the encoders belong in the backends, beside that arch's MC emitter
-and (post-WS-B) its `reloc.c` descriptor slice.
-
-**Constraint (must not break).** `link_reloc_apply(c, kind, ...)` stays the **one
-public entry**, called unchanged by all four loaders (`src/asm/asm.c:1296`,
-`src/emu/dl.c:15`, `src/link/link_jit.c`, `src/obj/{elf,macho,coff}/link.c`). The
-"one encoder, three loaders" invariant ([../OBJ.md](../OBJ.md)) is preserved — there
-is still exactly one encoder per kind; it moves to the owning backend.
-
-**Change.**
-1. Keep `link_reloc_apply` in `src/obj/reloc.c` as the dispatcher; it handles the
- **arch-neutral data-word arms inline** (`R_ABS32/64`, `R_REL*/PC*`, `R_TPOFF*`,
- `R_GOT32`, `R_PLT32`, the ULEB128 codec) — plain `wr_uN_le`, no ISA knowledge.
-2. Instruction-embedded kinds dispatch to a new `LinkArchDesc.reloc_apply_insn(c, k,
- P, S, A, P)` hook. Move the AArch64 arms to `src/arch/aa64/reloc.c`, RISC-V to
- `src/arch/riscv/reloc.c`, x64 instruction arms (`R_X64_PC8`) to
- `src/arch/x64/reloc.c`. `c` (hence `target.arch`) is available at every call site
- (verified: all `link_reloc_apply` callers pass a `Compiler*`).
-3. COFF-specific kinds route to a COFF encoder slice.
-
-Each backend's `reloc.c` then owns {desc rows (WS-B), class flags (WS-B), byte
-encoders (WS-C)} for its kinds — one file per arch.
-
-**Oracle.** Highest blast radius; lean on the WS-B exhaustiveness test + the full
-matrix: `make test-link test-elf test-macho test-isa test-asm test-smoke-x64
-test-smoke-rv64 test-aa64-inline`, the JIT/emu reloc paths (`test-cg-api`, a
-`run`/`emu` smoke), then **both** bootstrap chains (`make bootstrap-debug
-bootstrap-release`) — byte-identity over the compiler's own object output is the
-definitive proof no encoding shifted. Do this last, one arch at a time (neutral-core
-extraction first, then aa64, x64, rv), keeping old switch arms live until each arch's
-hook is proven, so every step bisects to one arch.
-
----
-
-## WS-E — Residual format gates (addresses **E**) — *all items LANDED*
-
-1. **FreeBSD static-IFUNC mechanism (#18).** **Done** — now
- `obj_format_static_ifunc_via_rela_iplt(c)` (`src/obj/obj_secnames.c:371`).
-2. **IRELATIVE wire type via hardcoded `KIT_OBJ_ELF`.** **Done.** The generic
- `link_elf_irelative_type` is deleted; the iplt pass now calls
- `obj_format_static_ifunc_irelative_type(l->c)` (sibling of the WS-E.1 predicate in
- `src/obj/obj_secnames.c`), which resolves the resolver reloc through the *target*
- object format (`c->target.obj`) rather than the literal `KIT_OBJ_ELF`. The generic
- link pass names no format constant.
-3. **`reloc_name` dispatch gate (#24 residual).** **Done.** `kit_obj_reloc_kind_name`
- (`src/api/object_file.c`) now guards only `if (fmt != KIT_OBJ_ELF) return NULL;` —
- the `arch != KIT_ARCH_X86_64` axis is gone, so aarch64/riscv ELF relocs print via
- their `ObjElfArchOps.reloc_name` tables (matching binutils objdump:
- `R_AARCH64_CALL26`, `R_RISCV_CALL`). One golden refreshed
- (`test/objdump/rv64/cases/03-reloc-annotations`: `RV_CALL` → `R_RISCV_CALL` in the
- `-r` records; the `-d` disasm annotation keeps the arch-neutral `[RV_CALL]`, which
- comes from the disassembler's `reloc_kind_name`, a separate path). Mach-O/COFF have
- no `reloc_name` table yet, so they still fall back to the neutral spelling.
-
-**Oracle.** `make test-link test-elf test-driver-objdump` — all pass (item 3's golden
-churn was the single rv64 reloc-annotations case, purely the reloc spelling). Item 2's
-FreeBSD static-IFUNC path is unexercised on the macOS host but the change is a
-behaviour-preserving refactor (same per-arch `r_irelative`, resolved format == ELF
-wherever `use_rela_iplt` is true); deeper coverage is the FreeBSD VM lane
-(`scripts/freebsd_vm.sh` / `test-toy-freebsd-vm`, see [FREEBSD.md](FREEBSD.md)).
-
----
-
-## Sequencing & risk
-
-1. **WS-B** — the central remaining change: the `RelocDesc {width, flags}` table +
- exhaustiveness test, deleting both generic switches and the duplicating `is_*`
- hooks. This is now the highest-value open item (the identity switches are already
- gone). Fold WS-A's value-class collapse in here since it touches the same enum/arms.
-2. **WS-C** — **DONE.** Encoder partition behind the single entry, gated by the new
- `test/link/reloc_apply_test.c` frozen-bytes guard + bootstrap byte-identity.
-3. **WS-E.2 / WS-E.3** — **DONE.** (WS-E.3's binutils-spelling switch also required
- refreshing `test/smoke/rv64_tls_link.sh`'s reloc grep — `RV_TPREL_HI20` →
- `R_RISCV_TPREL_HI20` — a stale expectation it had missed.)
-
-**Risk controls.** Every WS is red-green: WS-B's exhaustiveness + width-migration test
-is written first and fails until each arch's slice is complete. The **bootstrap** is
-the load-bearing oracle — it patches every relocation kind the compiler emits for its
-own source, so a byte-identical stage2/stage3 proves the encoding path is unchanged.
-Per CLAUDE.md, prefer targeted suites during iteration (redirect output to a file);
-reserve `make bootstrap` for end-of-WS gates. Keep old paths live beside new within a
-WS (especially WS-C, per-arch) so any regression bisects to one arch's hook.
-
-## Done criteria
-
-All met by WS-B + WS-C below except the optional WS-A enum collapse (still deferred).
-
-- ✓ No file under `src/link/` enumerates `RelocKind` arms: `reloc_width`,
- `reloc_uses_got`, `reloc_is_tls_got`, the `LinkArchDesc.is_*` hooks, **and the
- byte-patcher's instruction arms** are gone; consumers read the per-arch
- `RelocDesc` / call the per-arch `reloc_apply_insn`. (`rg "case R_(AARCH64|X64|RV)_"
- src/link` returns nothing — the WS-C dispatcher is case-free.)
-- ✓ Every relocation static fact has exactly one source: width + class flags in the
- per-arch `RelocDesc` slice, wire encoding + name in `src/obj/<fmt>/reloc_<arch>.c`,
- **and the instruction byte encoder in that arch's `reloc.c` `*_reloc_apply_insn`**.
-- ✓ `link_reloc_apply` remains the single public byte-patcher entry (now in
- `src/link/link_reloc_apply.c`); its instruction-encoding arms live in
- `src/arch/<arch>/reloc.c`, the obj layer keeps only the arch-neutral data-word arms
- (`reloc_apply_neutral`).
-- ✓ Adding a hypothetical new arch's relocation touches only that arch's
- `src/arch/<arch>/reloc.c` (one `RelocDesc` row + one `reloc_apply_insn` arm) and its
- `src/obj/<fmt>/reloc_<arch>.c` — guarded by `test/link/reloc_desc_test.c` (rows) and
- `test/link/reloc_apply_test.c` (bytes); no generic file needs edits.
-- (Optional/low-pri, **still open** — WS-A) the `tpoff64_reloc` field is retired by the
- `R_TPOFF64` collapse. (The `object_file.c` `reloc_name` gate removal + objdump golden
- refresh and `link_elf_irelative_type` already landed under WS-E.)
-- ✓ `make bootstrap-debug` reaches the byte-identical fixed point; the full
- link/elf/macho/coff/isa/asm/opt/smoke matrix passes. (Release bootstrap carries a
- PRE-EXISTING `.Lkit_jt.0` break unrelated to this work — gate on `bootstrap-debug`.)