OPT.md plan - kit

commit dcda57cc574b99b0149dc7d03be22a83e96f4f16
parent 7993dec9c21f1f66f79a0050c6579f5dd11d8ff6
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sat,  9 May 2026 17:09:25 -0700

OPT.md plan

Diffstat:
A doc/OPT.md  | 437 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 file changed, 437 insertions(+), 0 deletions(-)
diff --git a/doc/OPT.md b/doc/OPT.md
@@ -0,0 +1,437 @@
+# OPT — implementation plan
+
+Scope: what it takes to land cfree's optimizer behind the
+"presents-as-CGTarget" contract described in `doc/DESIGN.md` §5.1, §9.
+The producer side is the wrapper plus IR (`src/opt/opt.h`,
+`src/opt/ir.h`); the consumer is the wrapped target CGTarget and (at
+-O2) the lowering pipeline that drives it. The whole `test/cg` corpus
+already serves as the equivalence oracle — every case is built against
+`CGTarget` directly today, so the same case with `opt_cgtarget`
+inserted between `cg-runner` and the AArch64 backend must produce the
+same observable behavior.
+
+Today the headers are real, the implementation is a single panic stub,
+and the pipeline is wired to call into it the moment we drop the stub.
+This plan starts at "wrapper that records nothing, replays directly"
+and builds out to a real intra+IPO pipeline.
+
+---
+
+## 1. What working opt must look like
+
+Two distinct replay engines, sharing one recording front-end.
+
+### 1.1 Level-1 (function-at-a-time, no SSA)
+
+```
+parse → cg → opt_cgtarget {record into Func} → on func_end:
+            optionally rewrite the recorded tape (peephole, constfold)
+            replay each Inst back as the matching CGTarget call →
+            wrapped target → MCEmitter → ObjBuilder
+```
+
+Replay is 1:1: each recorded `Inst` corresponds to one `target->method(...)`
+call. No phis, no register allocation, no IR-level CFG transforms that
+would break the correspondence. This preserves `doc/DESIGN.md` §8's
+"function-at-a-time" streaming guarantee at -O1.
+
+### 1.2 Level-2 (TU-buffered, SSA, IPO, full lowering)
+
+```
+parse → cg → opt_cgtarget {record into Func} → on func_end:
+            store Func; do nothing else.
+        on cgtarget_finalize:
+            for each Func: build_cfg, build_ssa (incl. mem2reg), gvn,
+                copy_prop, dse, ssa_dce, licm, pressure_relief,
+                make_conventional_ssa, ssa_combine, undo_ssa, jump_opt
+            opt_inline + opt_cleanup (bounded iterations, -finline-iters=N)
+            for each Func: machinize, live_info, coalesce, regalloc,
+                combine, dce, opt_emit → wrapped target → MCEmitter
+```
+
+The level-2 path *cannot* preserve the 1:1 record→replay correspondence
+once SSA/regalloc has run — the lowering pipeline is the only path
+back to the wrapped CGTarget. This split is the single biggest
+load-bearing decision in the design.
+
+### 1.3 Equivalence we commit to
+
+For every case in `test/cg/CORPUS.md` Groups A–Q, building with
+`opt_cgtarget_new(c, target, level)` for `level ∈ {1, 2}` must produce
+the same `test_main` exit code as building against the AArch64 target
+directly. DWARF (W path) equivalence is weaker — opt at level 2 may
+collapse line rows or move locations into loclists — but Group P at
+level 0/1 must be byte-identical.
+
+---
+
+## 2. Current state inventory
+
+### 2.1 Headers — real
+
+- `src/opt/opt.h` — `opt_cgtarget_new`, plus forward decls for every
+  pass listed in `doc/DESIGN.md` §9.2/§9.3/§9.4.
+- `src/opt/ir.h` — `Func`/`Block`/`Inst`, `Val`, multi-def slot for
+  calls/CAS/overflow, typed aux structs (`IRCallAux`, `IRGepAux`,
+  `IRPhiAux`, `IRAsmAux`, `IRCasAux`, ...), frame-slot and param
+  tables on `Func`, value table (`val_def_block`/`val_def_inst`/
+  `val_type`).
+- `src/arch/arch.h` — `Reg` is `u32`, wide enough for the unbounded
+  virtuals the wrapper hands out; `REG_NONE = 0xffffffffu`. `RegClass`
+  has room (`u8 cls` on `Operand`).
+
+### 2.2 Implementations — stubs
+
+- `src/api/stubs.c:84-88` — `opt_cgtarget_new` panics with
+  "subsystem not implemented: opt".
+- `src/api/pipeline.c:227-229` — pipeline already wraps when
+  `opts->opt_level > 0`. Drop the stub and the path goes live.
+- `src/emu/emu.c:180` — emu also opts in to the wrapper at
+  `opt_level > 0`. Same drop-in point.
+- `src/arch/aarch64.c:3009-3011` — `cgtarget_finalize` is a no-op for
+  the bare target; the wrapper will install its own `finalize` that
+  runs IPO + lowering at level 2.
+
+### 2.3 Producer-side wiring — already in place
+
+- `CGTarget` carries the full vtable the wrapper needs to intercept
+  (function lifecycle, frame slots, params, control flow, data
+  movement, arith/cmp/convert, calls/return, alloca, varargs,
+  setjmp/longjmp, atomics, intrinsics, asm_block, set_loc, finalize).
+  Nothing in the vtable assumes a direct backend.
+- `CGTarget.set_loc` is sticky; `Inst.loc` already exists in the IR.
+  The wrapper stamps each recorded `Inst` with the most recent
+  `set_loc` value, and `opt_emit` plays it back via
+  `target->set_loc(target, inst.loc)` before each emit-side call. No
+  new plumbing needed.
+- `Type*` interning is global (`Pool global`), so SSA value types and
+  call signatures can be compared by pointer identity throughout the
+  IR.
+
+### 2.4 Test surface
+
+- `test/cg/CORPUS.md` line 23: "`O` (opt-wrapped) lands once
+  `opt_cgtarget` is implemented." This is the equivalence oracle.
+- `test/cg/harness/cg_runner.c` constructs a `CGTarget*` via
+  `cgtarget_new`. Adding an `--opt-level N` flag that wraps it with
+  `opt_cgtarget_new` is the smallest change to flip every case onto
+  the new path.
+- Group R (deferred) — opt-wrapped equivalence — is the dedicated
+  pass-by-pass regression suite, but in practice we'd run all of A–Q
+  through `--opt-level 1` continuously from Phase 0 onward.
+
+---
+
+## 3. Phased plan
+
+Each phase ends at a green test surface. Phases 0–3 are reversible —
+strip the wrapper back to identity replay and the system still works.
+Phase 4 is the wedge: once SSA/regalloc is in the pipeline, you need
+the full lowering path to get bytes back.
+
+### Phase 0 — wrapper skeleton + equivalence harness
+
+Goal: replace the stub with a real `opt_cgtarget` that does *nothing
+but forward*, and stand up the test harness that proves it. Without
+the harness, "Phase 0 is correct" has no signal — so they land
+together.
+
+Wrapper:
+
+- New `src/opt/opt.c`: `opt_cgtarget_new` returns a CGTarget whose
+  every vtable slot is a one-line `target->method(target, ...)`. No
+  IR built.
+- `finalize` calls `target->finalize(target)`, then the wrapper's own
+  cleanup (none yet).
+- `destroy` cascades: free wrapper state, then `cgtarget_free(target)`.
+- Pipeline already wires this at `src/api/pipeline.c:227`. No driver
+  changes.
+
+Harness:
+
+- Add `--opt-level N` flag to `test/cg/harness/cg_runner.c`. When
+  `N > 0`, wrap the constructed `CGTarget*` with
+  `opt_cgtarget_new(c, target, N)` before running the case.
+- `test/cg/run.sh`: each case in Groups A–Q runs at `--opt-level 0`
+  and `--opt-level 1`; exit codes must match. Add `--opt-level 2`
+  once Phase 4 lands. Group P (DWARF/W path) stays at level 0 only
+  for now — opt-level equivalence on DWARF is a Phase 5+ concern.
+
+Tests: full Groups A–Q D/R/E/J pass at `--opt-level 0` and
+`--opt-level 1`. Any divergence means the forwarding lost data — fix
+it before continuing.
+
+Exit criterion: dual-level corpus is green and stays that way.
+
+### Phase 1 — recording fidelity
+
+Goal: actually build `Func`/`Block`/`Inst` while still emitting
+correct code. Replay 1:1 onto the wrapped target on `func_end`.
+
+- `src/opt/ir.c`: `ir_func_new`, `ir_block_new`, `ir_emit`,
+  `ir_emit_multi`, `ir_emit_const_i`, `ir_emit_const_bytes`,
+  `ir_frame_slot_new`, `ir_param_add`, `ir_set_terminator`. Arena is
+  per-`Func` from `compiler_arena_new` against `Compiler.tu`.
+- `src/opt/opt.c`: each CGTarget method records into the current
+  `Func`'s current block. `Reg` returned from `alloc_reg` is the same
+  integer as the `Val` it defines (`Val == Reg`, modulo `VAL_NONE = 0`
+  / `REG_NONE = 0xffffffff` sentinels). Operand `OPK_REG` carries
+  that same integer; replay reads it back and reissues.
+- `clobbers`/`spill_reg`/`reload_reg`/`free_reg` — under unbounded
+  virtuals these shouldn't be called. `free_reg` is documented as a
+  hint and ignored. The other three indicate CG misuse from the opt
+  side; panic loudly.
+- `src/opt/replay.c`: walks a `Func` linearly, dispatches each
+  `Inst.op` back to the matching `target->method(...)` call. One
+  giant switch keyed on `IROp`. Multi-result ops (`IR_CALL`,
+  `IR_ATOMIC_CAS`, `IR_INTRINSIC` for `*_OVERFLOW`) read from
+  `defs[0..ndefs)`.
+- `set_loc`: wrapper updates a `pending_loc` field; every subsequent
+  `ir_emit` stamps `Inst.loc = pending_loc`. Replay calls
+  `target->set_loc(target, inst.loc)` before each emit-side call.
+- `finalize` at level 1: replay each buffered Func, then
+  `target->finalize(target)`.
+
+Tests: same corpus, `--opt-level 1`. The IR shape gets stress-tested
+on the tricky cases — Group F bitfields, Group G calls (sret split,
+byval, indirect), Group I alloca, Group J va_*, Group K atomics, Group
+L intrinsics (especially multi-result overflow), Group M (deferred:
+asm_block). Anything that loses data in record→replay shows up here.
+
+Exit criterion: `--opt-level 1` corpus is green and IR-allocated
+arenas are reaped on panic via `compiler_defer`.
+
+### Phase 2 — IR pretty-printer + level-1 peepholes
+
+Goal: be able to look at IR; introduce trivial pre-SSA rewrites.
+
+- `src/opt/ir_print.c`: `ir_func_print(Writer*, const Func*)` for
+  ad-hoc debugging and Group R diff oracles. Format unspecified
+  beyond "stable enough for golden-file diffs".
+- `cg-runner --opt-level 1 --dump-ir NAME` prints the recorded `Func`
+  before replay.
+- A handful of safe linear-tape rewrites: fold `IR_CONST_I` +
+  `IR_IADD/ISUB/IMUL` of two constants into a fresh `IR_CONST_I`;
+  collapse trivially dead stores immediately overwritten in the same
+  block. Each rewrite preserves the recorded→replayable contract.
+
+Tests: corpus stays green. Add hand-picked cases (or new ones) where
+the rewrite fires and the resulting code is observably smaller via
+disassembly diff against `--opt-level 0`.
+
+### Phase 3 — SSA construction, dry-run
+
+Goal: build SSA without consuming it. Catches IR-shape bugs before
+they matter.
+
+- `src/opt/pass_cfg.c`: `opt_build_cfg` — derives
+  `Block.preds`/`succ`/`nsucc` from terminators (`IR_BR`, `IR_CONDBR`,
+  `IR_RET`, `IR_LONGJMP`, `IR_INTRINSIC{TRAP,UNREACHABLE}`).
+  `IR_SETJMP` is a control barrier — splits its block but is not a
+  terminator (control falls through).
+- `src/opt/pass_ssa.c`: `opt_build_ssa` — standard dominance-frontier
+  algorithm; promotes any `FrameSlot` whose `FSF_ADDR_TAKEN` bit was
+  never set (mem2reg folded in per `doc/DESIGN.md` §12). Inserts
+  `IR_PHI` instructions with `IRPhiAux` populated.
+- At level 2 these run on `func_end` but their output is *discarded*
+  before replay. Goal is "no panics on the corpus", not "improved
+  code".
+
+Tests: at level 2, recorder runs `build_cfg` + `build_ssa` then
+forces level-1 replay. Corpus stays green; any panic is a real bug
+(unhandled IROp, malformed CFG from goto/switch lowering, address-
+taken-detection miss).
+
+### Phase 4 — lowering pipeline (the wedge)
+
+Goal: real level-2 path: SSA → machinize → regalloc → emit. From
+this point, level-2 cannot fall back to level-1 replay.
+
+- `src/opt/pass_machinize.c`: `opt_machinize(Func*, Target)` — ABI
+  lowering (calls into `TargetABI` for argument/return part
+  classification just like CG does today; the result lives in
+  `IRCallAux.abi` already), 2-op forms, calling-convention spill,
+  prolog/epilog placeholders.
+- `src/opt/pass_live.c`: `opt_live_info` — standard backwards dataflow.
+- `src/opt/pass_regalloc.c`: `opt_regalloc` — linear scan, no live-range
+  splitting initially. Allocates physical registers from the same
+  pool the AArch64 backend uses for scratch (i.e. asks the wrapped
+  target for clobbers and the param/sret physical mapping).
+- `src/opt/pass_emit.c`: `opt_emit(Compiler*, Func*, CGTarget*)` —
+  walks the lowered IR and drives the wrapped target via the
+  emit-side CGTarget surface. This is symmetric with `replay.c` from
+  Phase 1: Inst → target call. Difference is that operands are now
+  physical Operands (post-RA) and prolog/epilog/spill insertion has
+  happened.
+- Wrapper's `finalize` at level 2 runs the per-Func intra pipeline,
+  then for each Func runs the lowering pipeline, then calls
+  `target->finalize(target)`.
+
+Initial pass set inside the intra pipeline is *empty* — `build_ssa`
+followed immediately by `undo_ssa`/`make_conventional_ssa`. The
+lowering pipeline does the real work. This isolates the lowering bugs
+from the optimization bugs.
+
+Tests: `--opt-level 2` corpus is green. Equivalence harness extended
+to cover level 2.
+
+Exit criterion: level-2 corpus green, no regressions in the level-1
+or level-0 suites.
+
+### Phase 5 — intra-procedural passes, one at a time
+
+Goal: enable real optimizations. Each pass is independently
+toggleable for bisection.
+
+Order (tracking `doc/DESIGN.md` §9.2):
+
+1. `opt_gvn` (with constprop, redundant-load elim folded in).
+2. `opt_copy_prop` (with redundant-extension elim).
+3. `opt_dse`.
+4. `opt_ssa_dce`.
+5. `opt_jump_opt`.
+6. `opt_build_loop_tree` + `opt_licm`.
+7. `opt_addr_xform`.
+8. `opt_block_cloning`.
+9. `opt_pressure_relief`.
+10. Lowering-time additions: `opt_coalesce`, `opt_combine`,
+    `opt_dce` (post-RA), live-range splitting in `opt_regalloc`.
+
+Each pass lands behind a flag so the equivalence harness can run with
+just-this-pass to localize bugs. No UB-exploiting transformations
+(`doc/DESIGN.md` §9): no signed-overflow-is-unreachable, no
+shift-by-≥-width-is-unreachable, no division-by-zero-is-unreachable,
+no null-deref-is-unreachable.
+
+### Phase 6 — inter-procedural
+
+Goal: cross-function inlining + cleanup iteration.
+
+- `opt_inline`: bottom-up call-graph walk. SCCs (mutual recursion)
+  skipped. Heuristic: instruction count + call site count. Inlining
+  recognizes the `__cfree_setjmp` symbol by name as returns-twice and
+  refuses to inline across.
+- `opt_cleanup`: subset re-run (gvn, copy_prop, ssa_dce, jump_opt,
+  licm if loops, addr_xform if uses remain).
+- `-finline-iters=N` knob (default 1, hard cap enforced inside the
+  wrapper).
+
+Tests: corpus + a small set of cases hand-crafted to require
+inlining-plus-cleanup to optimize (e.g. a small wrapper around a
+constant returner whose body becomes a constant only after inlining).
+
+---
+
+## 4. Module layout
+
+```
+src/opt/
+  opt.h            (already exists)
+  ir.h             (already exists)
+  opt.c            wrapper, vtable bindings, finalize dispatch
+  ir.c             Func/Block/Inst plumbing, val table, arenas
+  ir_print.c       pretty-printer (Phase 2)
+  replay.c         level-1 replay (Phase 1)
+  pass_cfg.c       opt_build_cfg
+  pass_ssa.c       opt_build_ssa, opt_make_conventional_ssa,
+                   opt_ssa_combine, opt_undo_ssa
+  pass_gvn.c       opt_gvn
+  pass_copy.c      opt_copy_prop
+  pass_dse.c       opt_dse
+  pass_dce.c       opt_ssa_dce, opt_dce
+  pass_jump.c      opt_jump_opt
+  pass_loop.c      build_loop_tree + opt_licm
+  pass_addr.c      opt_addr_xform
+  pass_clone.c     opt_block_cloning
+  pass_pressure.c  opt_pressure_relief
+  pass_machinize.c opt_machinize
+  pass_live.c      opt_live_info
+  pass_coalesce.c  opt_coalesce
+  pass_regalloc.c  opt_regalloc
+  pass_combine.c   opt_combine
+  pass_emit.c      opt_emit (lowering replay)
+  pass_inline.c    opt_inline + opt_cleanup
+```
+
+This split lets each phase land as a small set of new files plus
+edits to `opt.c`. No file accumulates more than one pass.
+
+---
+
+## 5. Cross-cutting decisions
+
+### 5.1 Reg ↔ Val identity
+
+Each call to `wrapper->alloc_reg(class, type)` allocates a fresh `Val`
+in the current `Func`'s value table and returns its integer as
+`Reg`. `Operand{kind=OPK_REG, v.reg=R}` is interpreted on replay as
+"the value defined at SSA id R." This collapses two parallel ID
+spaces into one and avoids a side mapping. `REG_NONE` (`0xffffffff`)
+and `VAL_NONE` (`0`) live at opposite ends of the range — neither
+is allocated.
+
+### 5.2 Frame slots stay frame slots until SSA construction
+
+`cg_local`/`cg_param` always allocate frame slots through
+`CGTarget.frame_slot`/`CGTarget.param`. The wrapper records them
+verbatim. Loads/stores against frame slots become `IR_LOAD`/`IR_STORE`
+with `MemAccess.alias = ALIAS_LOCAL`. `build_ssa` (Phase 3) is the
+only place that promotes non-`FSF_ADDR_TAKEN` slots to SSA values.
+Address-taken slots stay in memory and are reasoned about through
+`MemAccess` alias roots (`doc/DESIGN.md` §5.6).
+
+This keeps the recorder dumb and pushes all decisions into one pass.
+
+### 5.3 Method intercepts the wrapper rejects
+
+Under unbounded virtuals, these CG-side mechanics are nonsense:
+
+- `clobbers` — meaningless without a finite physical register set.
+- `spill_reg` / `reload_reg` — CG drives these for the -O0 value
+  stack; opt's wrapper has no value stack.
+
+`free_reg` is already a hint; the wrapper ignores it.
+
+The first three should panic with a clear "called X on opt_cgtarget"
+message. They indicate CG is being driven in -O0 mode while opt is
+attached, which means a wiring bug.
+
+### 5.4 set_loc fan-out
+
+The wrapper's `set_loc` updates a single `pending_loc` field on the
+recorder state. `ir_emit` stamps `Inst.loc = pending_loc`. The
+lowering replay (Phase 4) calls `target->set_loc(target, inst.loc)`
+before each emit-side call. This is exactly the protocol
+`doc/DWARF.md` §3 expects, and it preserves Group P (set_loc/debug)
+behavior at level 2.
+
+### 5.5 No UB-exploiting passes
+
+`doc/DESIGN.md` §9 is binding. opt may not assume signed overflow,
+shift-by-≥-width, division by zero, or null deref are unreachable.
+WASM traps on the first three deterministically; real targets are
+also more predictable this way. The "70% of -O2" goal is achievable
+without these transformations.
+
+---
+
+## 6. Open questions
+
+- **Linear scan vs graph coloring at -O2.** MIR uses fast linear
+  scan. The "70% of -O2" target is achievable with it. Graph coloring
+  is a follow-up if a benchmark gap demands it.
+- **Loop tree at lowering vs after intra passes.** §9.4 builds the
+  loop tree inside the lowering pipeline (used by RA for split
+  decisions). §9.2 also builds one for LICM. Open: do these share an
+  arena-owned structure that survives both, or are they computed
+  twice? Rebuilding is cheap; sharing requires invalidation
+  discipline. Default to rebuilding until it matters.
+- **Stackifier vs Relooper for WASM.** `doc/DESIGN.md` §14 leaves
+  this open. opt's flat-CFG IR makes either viable. Decision deferred
+  to when the WASM backend lands; the IR is structure-agnostic.
+- **Inlining heuristic tuning.** Bench-driven; the knob
+  (`-finline-iters=N`, instruction-count budget) is exposed from
+  Phase 6 onward.
+- **IR snapshot/golden-file format.** Phase 2's pretty-printer needs
+  a stable enough format for diff-based regression tests in Group R.
+  Decide format when Phase 2 lands.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README