kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit dcda57cc574b99b0149dc7d03be22a83e96f4f16
parent 7993dec9c21f1f66f79a0050c6579f5dd11d8ff6
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sat,  9 May 2026 17:09:25 -0700

OPT.md plan

Diffstat:
Adoc/OPT.md | 437+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 437 insertions(+), 0 deletions(-)

diff --git a/doc/OPT.md b/doc/OPT.md @@ -0,0 +1,437 @@ +# OPT — implementation plan + +Scope: what it takes to land cfree's optimizer behind the +"presents-as-CGTarget" contract described in `doc/DESIGN.md` §5.1, §9. +The producer side is the wrapper plus IR (`src/opt/opt.h`, +`src/opt/ir.h`); the consumer is the wrapped target CGTarget and (at +-O2) the lowering pipeline that drives it. The whole `test/cg` corpus +already serves as the equivalence oracle — every case is built against +`CGTarget` directly today, so the same case with `opt_cgtarget` +inserted between `cg-runner` and the AArch64 backend must produce the +same observable behavior. + +Today the headers are real, the implementation is a single panic stub, +and the pipeline is wired to call into it the moment we drop the stub. +This plan starts at "wrapper that records nothing, replays directly" +and builds out to a real intra+IPO pipeline. + +--- + +## 1. What working opt must look like + +Two distinct replay engines, sharing one recording front-end. + +### 1.1 Level-1 (function-at-a-time, no SSA) + +``` +parse → cg → opt_cgtarget {record into Func} → on func_end: + optionally rewrite the recorded tape (peephole, constfold) + replay each Inst back as the matching CGTarget call → + wrapped target → MCEmitter → ObjBuilder +``` + +Replay is 1:1: each recorded `Inst` corresponds to one `target->method(...)` +call. No phis, no register allocation, no IR-level CFG transforms that +would break the correspondence. This preserves `doc/DESIGN.md` §8's +"function-at-a-time" streaming guarantee at -O1. + +### 1.2 Level-2 (TU-buffered, SSA, IPO, full lowering) + +``` +parse → cg → opt_cgtarget {record into Func} → on func_end: + store Func; do nothing else. + on cgtarget_finalize: + for each Func: build_cfg, build_ssa (incl. mem2reg), gvn, + copy_prop, dse, ssa_dce, licm, pressure_relief, + make_conventional_ssa, ssa_combine, undo_ssa, jump_opt + opt_inline + opt_cleanup (bounded iterations, -finline-iters=N) + for each Func: machinize, live_info, coalesce, regalloc, + combine, dce, opt_emit → wrapped target → MCEmitter +``` + +The level-2 path *cannot* preserve the 1:1 record→replay correspondence +once SSA/regalloc has run — the lowering pipeline is the only path +back to the wrapped CGTarget. This split is the single biggest +load-bearing decision in the design. + +### 1.3 Equivalence we commit to + +For every case in `test/cg/CORPUS.md` Groups A–Q, building with +`opt_cgtarget_new(c, target, level)` for `level ∈ {1, 2}` must produce +the same `test_main` exit code as building against the AArch64 target +directly. DWARF (W path) equivalence is weaker — opt at level 2 may +collapse line rows or move locations into loclists — but Group P at +level 0/1 must be byte-identical. + +--- + +## 2. Current state inventory + +### 2.1 Headers — real + +- `src/opt/opt.h` — `opt_cgtarget_new`, plus forward decls for every + pass listed in `doc/DESIGN.md` §9.2/§9.3/§9.4. +- `src/opt/ir.h` — `Func`/`Block`/`Inst`, `Val`, multi-def slot for + calls/CAS/overflow, typed aux structs (`IRCallAux`, `IRGepAux`, + `IRPhiAux`, `IRAsmAux`, `IRCasAux`, ...), frame-slot and param + tables on `Func`, value table (`val_def_block`/`val_def_inst`/ + `val_type`). +- `src/arch/arch.h` — `Reg` is `u32`, wide enough for the unbounded + virtuals the wrapper hands out; `REG_NONE = 0xffffffffu`. `RegClass` + has room (`u8 cls` on `Operand`). + +### 2.2 Implementations — stubs + +- `src/api/stubs.c:84-88` — `opt_cgtarget_new` panics with + "subsystem not implemented: opt". +- `src/api/pipeline.c:227-229` — pipeline already wraps when + `opts->opt_level > 0`. Drop the stub and the path goes live. +- `src/emu/emu.c:180` — emu also opts in to the wrapper at + `opt_level > 0`. Same drop-in point. +- `src/arch/aarch64.c:3009-3011` — `cgtarget_finalize` is a no-op for + the bare target; the wrapper will install its own `finalize` that + runs IPO + lowering at level 2. + +### 2.3 Producer-side wiring — already in place + +- `CGTarget` carries the full vtable the wrapper needs to intercept + (function lifecycle, frame slots, params, control flow, data + movement, arith/cmp/convert, calls/return, alloca, varargs, + setjmp/longjmp, atomics, intrinsics, asm_block, set_loc, finalize). + Nothing in the vtable assumes a direct backend. +- `CGTarget.set_loc` is sticky; `Inst.loc` already exists in the IR. + The wrapper stamps each recorded `Inst` with the most recent + `set_loc` value, and `opt_emit` plays it back via + `target->set_loc(target, inst.loc)` before each emit-side call. No + new plumbing needed. +- `Type*` interning is global (`Pool global`), so SSA value types and + call signatures can be compared by pointer identity throughout the + IR. + +### 2.4 Test surface + +- `test/cg/CORPUS.md` line 23: "`O` (opt-wrapped) lands once + `opt_cgtarget` is implemented." This is the equivalence oracle. +- `test/cg/harness/cg_runner.c` constructs a `CGTarget*` via + `cgtarget_new`. Adding an `--opt-level N` flag that wraps it with + `opt_cgtarget_new` is the smallest change to flip every case onto + the new path. +- Group R (deferred) — opt-wrapped equivalence — is the dedicated + pass-by-pass regression suite, but in practice we'd run all of A–Q + through `--opt-level 1` continuously from Phase 0 onward. + +--- + +## 3. Phased plan + +Each phase ends at a green test surface. Phases 0–3 are reversible — +strip the wrapper back to identity replay and the system still works. +Phase 4 is the wedge: once SSA/regalloc is in the pipeline, you need +the full lowering path to get bytes back. + +### Phase 0 — wrapper skeleton + equivalence harness + +Goal: replace the stub with a real `opt_cgtarget` that does *nothing +but forward*, and stand up the test harness that proves it. Without +the harness, "Phase 0 is correct" has no signal — so they land +together. + +Wrapper: + +- New `src/opt/opt.c`: `opt_cgtarget_new` returns a CGTarget whose + every vtable slot is a one-line `target->method(target, ...)`. No + IR built. +- `finalize` calls `target->finalize(target)`, then the wrapper's own + cleanup (none yet). +- `destroy` cascades: free wrapper state, then `cgtarget_free(target)`. +- Pipeline already wires this at `src/api/pipeline.c:227`. No driver + changes. + +Harness: + +- Add `--opt-level N` flag to `test/cg/harness/cg_runner.c`. When + `N > 0`, wrap the constructed `CGTarget*` with + `opt_cgtarget_new(c, target, N)` before running the case. +- `test/cg/run.sh`: each case in Groups A–Q runs at `--opt-level 0` + and `--opt-level 1`; exit codes must match. Add `--opt-level 2` + once Phase 4 lands. Group P (DWARF/W path) stays at level 0 only + for now — opt-level equivalence on DWARF is a Phase 5+ concern. + +Tests: full Groups A–Q D/R/E/J pass at `--opt-level 0` and +`--opt-level 1`. Any divergence means the forwarding lost data — fix +it before continuing. + +Exit criterion: dual-level corpus is green and stays that way. + +### Phase 1 — recording fidelity + +Goal: actually build `Func`/`Block`/`Inst` while still emitting +correct code. Replay 1:1 onto the wrapped target on `func_end`. + +- `src/opt/ir.c`: `ir_func_new`, `ir_block_new`, `ir_emit`, + `ir_emit_multi`, `ir_emit_const_i`, `ir_emit_const_bytes`, + `ir_frame_slot_new`, `ir_param_add`, `ir_set_terminator`. Arena is + per-`Func` from `compiler_arena_new` against `Compiler.tu`. +- `src/opt/opt.c`: each CGTarget method records into the current + `Func`'s current block. `Reg` returned from `alloc_reg` is the same + integer as the `Val` it defines (`Val == Reg`, modulo `VAL_NONE = 0` + / `REG_NONE = 0xffffffff` sentinels). Operand `OPK_REG` carries + that same integer; replay reads it back and reissues. +- `clobbers`/`spill_reg`/`reload_reg`/`free_reg` — under unbounded + virtuals these shouldn't be called. `free_reg` is documented as a + hint and ignored. The other three indicate CG misuse from the opt + side; panic loudly. +- `src/opt/replay.c`: walks a `Func` linearly, dispatches each + `Inst.op` back to the matching `target->method(...)` call. One + giant switch keyed on `IROp`. Multi-result ops (`IR_CALL`, + `IR_ATOMIC_CAS`, `IR_INTRINSIC` for `*_OVERFLOW`) read from + `defs[0..ndefs)`. +- `set_loc`: wrapper updates a `pending_loc` field; every subsequent + `ir_emit` stamps `Inst.loc = pending_loc`. Replay calls + `target->set_loc(target, inst.loc)` before each emit-side call. +- `finalize` at level 1: replay each buffered Func, then + `target->finalize(target)`. + +Tests: same corpus, `--opt-level 1`. The IR shape gets stress-tested +on the tricky cases — Group F bitfields, Group G calls (sret split, +byval, indirect), Group I alloca, Group J va_*, Group K atomics, Group +L intrinsics (especially multi-result overflow), Group M (deferred: +asm_block). Anything that loses data in record→replay shows up here. + +Exit criterion: `--opt-level 1` corpus is green and IR-allocated +arenas are reaped on panic via `compiler_defer`. + +### Phase 2 — IR pretty-printer + level-1 peepholes + +Goal: be able to look at IR; introduce trivial pre-SSA rewrites. + +- `src/opt/ir_print.c`: `ir_func_print(Writer*, const Func*)` for + ad-hoc debugging and Group R diff oracles. Format unspecified + beyond "stable enough for golden-file diffs". +- `cg-runner --opt-level 1 --dump-ir NAME` prints the recorded `Func` + before replay. +- A handful of safe linear-tape rewrites: fold `IR_CONST_I` + + `IR_IADD/ISUB/IMUL` of two constants into a fresh `IR_CONST_I`; + collapse trivially dead stores immediately overwritten in the same + block. Each rewrite preserves the recorded→replayable contract. + +Tests: corpus stays green. Add hand-picked cases (or new ones) where +the rewrite fires and the resulting code is observably smaller via +disassembly diff against `--opt-level 0`. + +### Phase 3 — SSA construction, dry-run + +Goal: build SSA without consuming it. Catches IR-shape bugs before +they matter. + +- `src/opt/pass_cfg.c`: `opt_build_cfg` — derives + `Block.preds`/`succ`/`nsucc` from terminators (`IR_BR`, `IR_CONDBR`, + `IR_RET`, `IR_LONGJMP`, `IR_INTRINSIC{TRAP,UNREACHABLE}`). + `IR_SETJMP` is a control barrier — splits its block but is not a + terminator (control falls through). +- `src/opt/pass_ssa.c`: `opt_build_ssa` — standard dominance-frontier + algorithm; promotes any `FrameSlot` whose `FSF_ADDR_TAKEN` bit was + never set (mem2reg folded in per `doc/DESIGN.md` §12). Inserts + `IR_PHI` instructions with `IRPhiAux` populated. +- At level 2 these run on `func_end` but their output is *discarded* + before replay. Goal is "no panics on the corpus", not "improved + code". + +Tests: at level 2, recorder runs `build_cfg` + `build_ssa` then +forces level-1 replay. Corpus stays green; any panic is a real bug +(unhandled IROp, malformed CFG from goto/switch lowering, address- +taken-detection miss). + +### Phase 4 — lowering pipeline (the wedge) + +Goal: real level-2 path: SSA → machinize → regalloc → emit. From +this point, level-2 cannot fall back to level-1 replay. + +- `src/opt/pass_machinize.c`: `opt_machinize(Func*, Target)` — ABI + lowering (calls into `TargetABI` for argument/return part + classification just like CG does today; the result lives in + `IRCallAux.abi` already), 2-op forms, calling-convention spill, + prolog/epilog placeholders. +- `src/opt/pass_live.c`: `opt_live_info` — standard backwards dataflow. +- `src/opt/pass_regalloc.c`: `opt_regalloc` — linear scan, no live-range + splitting initially. Allocates physical registers from the same + pool the AArch64 backend uses for scratch (i.e. asks the wrapped + target for clobbers and the param/sret physical mapping). +- `src/opt/pass_emit.c`: `opt_emit(Compiler*, Func*, CGTarget*)` — + walks the lowered IR and drives the wrapped target via the + emit-side CGTarget surface. This is symmetric with `replay.c` from + Phase 1: Inst → target call. Difference is that operands are now + physical Operands (post-RA) and prolog/epilog/spill insertion has + happened. +- Wrapper's `finalize` at level 2 runs the per-Func intra pipeline, + then for each Func runs the lowering pipeline, then calls + `target->finalize(target)`. + +Initial pass set inside the intra pipeline is *empty* — `build_ssa` +followed immediately by `undo_ssa`/`make_conventional_ssa`. The +lowering pipeline does the real work. This isolates the lowering bugs +from the optimization bugs. + +Tests: `--opt-level 2` corpus is green. Equivalence harness extended +to cover level 2. + +Exit criterion: level-2 corpus green, no regressions in the level-1 +or level-0 suites. + +### Phase 5 — intra-procedural passes, one at a time + +Goal: enable real optimizations. Each pass is independently +toggleable for bisection. + +Order (tracking `doc/DESIGN.md` §9.2): + +1. `opt_gvn` (with constprop, redundant-load elim folded in). +2. `opt_copy_prop` (with redundant-extension elim). +3. `opt_dse`. +4. `opt_ssa_dce`. +5. `opt_jump_opt`. +6. `opt_build_loop_tree` + `opt_licm`. +7. `opt_addr_xform`. +8. `opt_block_cloning`. +9. `opt_pressure_relief`. +10. Lowering-time additions: `opt_coalesce`, `opt_combine`, + `opt_dce` (post-RA), live-range splitting in `opt_regalloc`. + +Each pass lands behind a flag so the equivalence harness can run with +just-this-pass to localize bugs. No UB-exploiting transformations +(`doc/DESIGN.md` §9): no signed-overflow-is-unreachable, no +shift-by-≥-width-is-unreachable, no division-by-zero-is-unreachable, +no null-deref-is-unreachable. + +### Phase 6 — inter-procedural + +Goal: cross-function inlining + cleanup iteration. + +- `opt_inline`: bottom-up call-graph walk. SCCs (mutual recursion) + skipped. Heuristic: instruction count + call site count. Inlining + recognizes the `__cfree_setjmp` symbol by name as returns-twice and + refuses to inline across. +- `opt_cleanup`: subset re-run (gvn, copy_prop, ssa_dce, jump_opt, + licm if loops, addr_xform if uses remain). +- `-finline-iters=N` knob (default 1, hard cap enforced inside the + wrapper). + +Tests: corpus + a small set of cases hand-crafted to require +inlining-plus-cleanup to optimize (e.g. a small wrapper around a +constant returner whose body becomes a constant only after inlining). + +--- + +## 4. Module layout + +``` +src/opt/ + opt.h (already exists) + ir.h (already exists) + opt.c wrapper, vtable bindings, finalize dispatch + ir.c Func/Block/Inst plumbing, val table, arenas + ir_print.c pretty-printer (Phase 2) + replay.c level-1 replay (Phase 1) + pass_cfg.c opt_build_cfg + pass_ssa.c opt_build_ssa, opt_make_conventional_ssa, + opt_ssa_combine, opt_undo_ssa + pass_gvn.c opt_gvn + pass_copy.c opt_copy_prop + pass_dse.c opt_dse + pass_dce.c opt_ssa_dce, opt_dce + pass_jump.c opt_jump_opt + pass_loop.c build_loop_tree + opt_licm + pass_addr.c opt_addr_xform + pass_clone.c opt_block_cloning + pass_pressure.c opt_pressure_relief + pass_machinize.c opt_machinize + pass_live.c opt_live_info + pass_coalesce.c opt_coalesce + pass_regalloc.c opt_regalloc + pass_combine.c opt_combine + pass_emit.c opt_emit (lowering replay) + pass_inline.c opt_inline + opt_cleanup +``` + +This split lets each phase land as a small set of new files plus +edits to `opt.c`. No file accumulates more than one pass. + +--- + +## 5. Cross-cutting decisions + +### 5.1 Reg ↔ Val identity + +Each call to `wrapper->alloc_reg(class, type)` allocates a fresh `Val` +in the current `Func`'s value table and returns its integer as +`Reg`. `Operand{kind=OPK_REG, v.reg=R}` is interpreted on replay as +"the value defined at SSA id R." This collapses two parallel ID +spaces into one and avoids a side mapping. `REG_NONE` (`0xffffffff`) +and `VAL_NONE` (`0`) live at opposite ends of the range — neither +is allocated. + +### 5.2 Frame slots stay frame slots until SSA construction + +`cg_local`/`cg_param` always allocate frame slots through +`CGTarget.frame_slot`/`CGTarget.param`. The wrapper records them +verbatim. Loads/stores against frame slots become `IR_LOAD`/`IR_STORE` +with `MemAccess.alias = ALIAS_LOCAL`. `build_ssa` (Phase 3) is the +only place that promotes non-`FSF_ADDR_TAKEN` slots to SSA values. +Address-taken slots stay in memory and are reasoned about through +`MemAccess` alias roots (`doc/DESIGN.md` §5.6). + +This keeps the recorder dumb and pushes all decisions into one pass. + +### 5.3 Method intercepts the wrapper rejects + +Under unbounded virtuals, these CG-side mechanics are nonsense: + +- `clobbers` — meaningless without a finite physical register set. +- `spill_reg` / `reload_reg` — CG drives these for the -O0 value + stack; opt's wrapper has no value stack. + +`free_reg` is already a hint; the wrapper ignores it. + +The first three should panic with a clear "called X on opt_cgtarget" +message. They indicate CG is being driven in -O0 mode while opt is +attached, which means a wiring bug. + +### 5.4 set_loc fan-out + +The wrapper's `set_loc` updates a single `pending_loc` field on the +recorder state. `ir_emit` stamps `Inst.loc = pending_loc`. The +lowering replay (Phase 4) calls `target->set_loc(target, inst.loc)` +before each emit-side call. This is exactly the protocol +`doc/DWARF.md` §3 expects, and it preserves Group P (set_loc/debug) +behavior at level 2. + +### 5.5 No UB-exploiting passes + +`doc/DESIGN.md` §9 is binding. opt may not assume signed overflow, +shift-by-≥-width, division by zero, or null deref are unreachable. +WASM traps on the first three deterministically; real targets are +also more predictable this way. The "70% of -O2" goal is achievable +without these transformations. + +--- + +## 6. Open questions + +- **Linear scan vs graph coloring at -O2.** MIR uses fast linear + scan. The "70% of -O2" target is achievable with it. Graph coloring + is a follow-up if a benchmark gap demands it. +- **Loop tree at lowering vs after intra passes.** §9.4 builds the + loop tree inside the lowering pipeline (used by RA for split + decisions). §9.2 also builds one for LICM. Open: do these share an + arena-owned structure that survives both, or are they computed + twice? Rebuilding is cheap; sharing requires invalidation + discipline. Default to rebuilding until it matters. +- **Stackifier vs Relooper for WASM.** `doc/DESIGN.md` §14 leaves + this open. opt's flat-CFG IR makes either viable. Decision deferred + to when the WASM backend lands; the IR is structure-agnostic. +- **Inlining heuristic tuning.** Bench-driven; the knob + (`-finline-iters=N`, instruction-count budget) is exposed from + Phase 6 onward. +- **IR snapshot/golden-file format.** Phase 2's pretty-printer needs + a stable enough format for diff-based regression tests in Group R. + Decide format when Phase 2 lands.