OPT.md plan update - kit

commit 61ac2c5548bcea594508b8e586154d02ddea4f9e
parent 19a555939630998634e929b779ac915593859f36
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon, 11 May 2026 18:24:42 -0700

OPT.md plan update

Diffstat:
M doc/OPT.md  | 1021 ++++++++++++++++++++++++++++++++++++++++++++++---------------------------------

1 file changed, 601 insertions(+), 420 deletions(-)
diff --git a/doc/OPT.md b/doc/OPT.md
@@ -1,468 +1,649 @@
-# OPT — implementation plan
-
-Scope: what it takes to land cfree's optimizer behind the
-"presents-as-CGTarget" contract described in `doc/DESIGN.md` §5.1, §9.
-The producer side is the wrapper plus IR (`src/opt/opt.h`,
-`src/opt/ir.h`); the consumer is the wrapped target CGTarget and (at
--O2) the lowering pipeline that drives it. The whole `test/cg` corpus
-already serves as the equivalence oracle — every case is built against
-`CGTarget` directly today, so the same case with `opt_cgtarget`
-inserted between `cg-runner` and the AArch64 backend must produce the
-same observable behavior.
-
-Today the headers are real, the implementation is a single panic stub,
-and the pipeline is wired to call into it the moment we drop the stub.
-This plan starts at "wrapper that records nothing, replays directly"
-and builds out to a real intra+IPO pipeline.
+# OPT - MIR-informed implementation plan
+
+Scope: deliver cfree's optimized backend behind the `CGTarget` wrapper
+contract described in `doc/DESIGN.md` Section 9, using MIR JIT's optimizer as
+the model for phase order, level gating, and performance targets.
+
+This document is based on a source investigation of `~/tmp/mir/`, mainly
+`mir-gen.c`, `mir.c`, `mir.h`, `mir-gen.h`, `MIR.md`, `README.md`, and the
+`c2mir` driver. The important takeaway is that MIR keeps the optimizer short:
+`-O1` is a fast lowering/code-selection path, while the expensive
+SSA/value/memory/loop work is reserved for `-O2`.
 
 ---
 
-## 1. What working opt must look like
+## 1. Target Shape
+
+cfree exposes three levels:
 
-A single shared engine. CG drives `opt_cgtarget`; every CGTarget call
-lands as exactly one `Inst` in a per-function flat-CFG IR (§5.1). On
-`func_end` (intra-procedural) and `finalize` (inter-procedural), the
-wrapper runs an optimization schedule, lowers through machinize →
-regalloc → emit, and drives the wrapped target CGTarget to produce
-machine code.
+- `-O0`: direct backend path. CG drives the target `CGTarget` immediately.
+- `-O1`: MIR-style minimal optimizer. Record IR, lower it through the new
+  backend path, run liveness, simplified register allocation, post-RA combine,
+  and DCE. No SSA value passes and no inlining.
+- `-O2`: MIR-style full optimizer. Add SSA, GVN/constant propagation,
+  redundant load elimination, copy propagation, dead store elimination,
+  SSA DCE, LICM, pressure relief, address transformation, block cloning,
+  coalescing, live-range splitting, and inlining plus cleanup.
+
+The core contract stays simple:
 
 ```
-parse → cg → opt_cgtarget {record into Func} →
-        on func_end: build_cfg → build_ssa → <intra schedule> → <lower>
-        on finalize: <inter schedule> → for each dirty Func: <lower>
-                  → wrapped target CGTarget → MCEmitter → ObjBuilder
-
-  <lower> = make_conventional_ssa → ssa_combine → undo_ssa →
-            machinize → live_info → coalesce → regalloc → combine →
-            dce → opt_emit
+parse -> cg -> opt_cgtarget { record Func IR }
+
+-O1 func_end/finalize:
+    build_cfg
+    machinize
+    build_loop_tree
+    live_info
+    regalloc(simplified)
+    combine
+    dce
+    opt_emit -> wrapped CGTarget
+
+-O2 func_end/finalize:
+    build_cfg
+    block_cloning / addr_xform setup
+    build_ssa
+    gvn
+    copy_prop
+    dse
+    ssa_dce
+    build_loop_tree + licm
+    pressure_relief
+    make_conventional_ssa
+    ssa_combine
+    undo_ssa
+    jump_opt
+    machinize
+    build_loop_tree
+    coalesce
+    live_info
+    regalloc(with live-range splitting)
+    combine
+    dce
+    opt_emit -> wrapped CGTarget
+
+-O2 finalize before final lowering:
+    opt_inline
+    opt_cleanup(dirty callers)
 ```
 
-The level dial selects only the optimization schedule; everything
-else is shared.
+The level dial selects optimization work. The lowering backend should be
+shared by `-O1` and `-O2`; `-O1` must remain the bisection floor for bugs in
+machinize, liveness, allocation, emission, and target-specific rewrite.
 
-### 1.1 Level 1 — minimal
+---
 
-Intra: `build_cfg`, `build_ssa` (mem2reg), `jump_opt`. No GVN, no
-LICM, no DSE.
-Inter: none (no inlining; no cleanup iteration).
+## 2. MIR Findings
 
-Just enough to land code through the SSA pipeline at quality
-comparable to direct `CGTarget` lowering. The point of level 1 is
-isolation: when a bug reproduces here, it is in IR construction,
-SSA, or lowering — not in any of the value-changing passes. It is
-the working bisection floor for the rest of the pipeline.
+### 2.1 Where the Optimizer Lives
 
-### 1.2 Level 2 — full
+MIR's optimizer is concentrated in `mir-gen.c`; inlining and MIR-level
+simplification live in `mir.c`; `c2mir/c2mir-driver.c` maps command-line
+`-O` flags to `MIR_gen_set_optimize_level`.
 
-Intra (per `doc/DESIGN.md` §9.2): `build_cfg`, `build_ssa` (mem2reg),
-`gvn` (incl. constprop, redundant-load elim), `copy_prop` (incl.
-redundant-extension elim), `dse`, `ssa_dce`, `jump_opt`,
-`build_loop_tree` + `licm`, `addr_xform`, `block_cloning`,
-`pressure_relief`.
-Inter: `opt_inline` + `opt_cleanup` (bounded by `-finline-iters=N`,
-default 1).
+Key source facts:
 
-### 1.3 Equivalence we commit to
+- `MIR_gen_init` defaults `optimize_level` to `2`.
+- `MIR_gen_set_optimize_level(ctx, level)` only stores the requested level in
+  generator context.
+- `c2mir` treats bare `-O` as `-O2`.
+- Current `mir-gen.c` checks `optimize_level >= 2` for the full pass set.
+  Some docs/comments still describe a separate `-O3`, but the current source
+  does not gate additional passes on `>= 3`. For cfree, model `-O2` as MIR's
+  current full source-level pipeline.
 
-For every case in `test/cg/CORPUS.md` Groups A–Q, building with
-`opt_cgtarget_new(c, target, level)` for `level ∈ {1, 2}` must
-produce the same `test_main` exit code as building against the
-AArch64 target directly. Both levels emit through the SSA →
-machinize → regalloc → emit path, so neither is byte-equivalent to
-level 0 — exit-code equivalence is the contract.
+### 2.2 MIR Level Semantics
 
-DWARF (W path) equivalence is weaker. Level 1 aims for row-by-row
-parity with level 0 in Group P; level 2 may collapse line rows or
-move locations into loclists when value-changing passes fire.
-Neither is a hard contract until the DWARF cross-level harness lands
-(§3 Phase 5+).
+MIR documentation says:
 
----
+- Level `0`: only register allocator and machine code generator work.
+- Level `1`: adds code selection. It should produce more compact/faster code
+  than level 0 at nearly the same generation speed.
+- Level `2`: adds common subexpression elimination and conditional constant
+  propagation. MIR docs report level 1 generation speed around 50 percent
+  faster than level 2.
+- Current source-level `>=2`: also includes block cloning, SSA variable
+  renaming, address transformation, DSE, LICM, pressure relief, coalescing,
+  and live-range splitting.
 
-## 2. Current state inventory
-
-### 2.1 Headers — real
-
-- `src/opt/opt.h` — `opt_cgtarget_new`, plus forward decls for every
-  pass listed in `doc/DESIGN.md` §9.2/§9.3/§9.4.
-- `src/opt/ir.h` — `Func`/`Block`/`Inst`, `Val`, multi-def slot for
-  calls/CAS/overflow, typed aux structs (`IRCallAux`, `IRGepAux`,
-  `IRPhiAux`, `IRAsmAux`, `IRCasAux`, ...), frame-slot and param
-  tables on `Func`, value table (`val_def_block`/`val_def_inst`/
-  `val_type`).
-- `src/arch/arch.h` — `Reg` is `u32`, wide enough for the unbounded
-  virtuals the wrapper hands out; `REG_NONE = 0xffffffffu`. `RegClass`
-  has room (`u8 cls` on `Operand`).
-
-### 2.2 Implementations — stubs
-
-- `src/api/stubs.c:84-88` — `opt_cgtarget_new` panics with
-  "subsystem not implemented: opt".
-- `src/api/pipeline.c:227-229` — pipeline already wraps when
-  `opts->opt_level > 0`. Drop the stub and the path goes live.
-- `src/emu/emu.c:180` — emu also opts in to the wrapper at
-  `opt_level > 0`. Same drop-in point.
-- `src/arch/aarch64.c:3009-3011` — `cgtarget_finalize` is a no-op for
-  the bare target; the wrapper will install its own `finalize` that
-  runs IPO + lowering at level 2.
-
-### 2.3 Producer-side wiring — already in place
-
-- `CGTarget` carries the full vtable the wrapper needs to intercept
-  (function lifecycle, frame slots, params, control flow, data
-  movement, arith/cmp/convert, calls/return, alloca, varargs,
-  setjmp/longjmp, atomics, intrinsics, asm_block, set_loc, finalize).
-  Nothing in the vtable assumes a direct backend.
-- `CGTarget.set_loc` is sticky; `Inst.loc` already exists in the IR.
-  The wrapper stamps each recorded `Inst` with the most recent
-  `set_loc` value, and `opt_emit` plays it back via
-  `target->set_loc(target, inst.loc)` before each emit-side call. No
-  new plumbing needed.
-- `Type*` interning is global (`Pool global`), so SSA value types and
-  call signatures can be compared by pointer identity throughout the
-  IR.
-
-### 2.4 Test surface
-
-- `test/cg/CORPUS.md` line 23: "`O` (opt-wrapped) lands once
-  `opt_cgtarget` is implemented." This is the equivalence oracle.
-- `test/cg/harness/cg_runner.c` constructs a `CGTarget*` via
-  `cgtarget_new`. Adding an `--opt-level N` flag that wraps it with
-  `opt_cgtarget_new` is the smallest change to flip every case onto
-  the new path.
-- Group R (deferred) — opt-wrapped equivalence — is the dedicated
-  pass-by-pass regression suite, but in practice we'd run all of A–Q
-  through `--opt-level 1` continuously from Phase 0 onward.
+For cfree:
+
+- `-O1` should optimize the backend mechanics without running SSA passes.
+- `-O2` should be the full quality mode.
+- Do not create a user-visible `-O3` until benchmarks identify a genuinely
+  useful extra tier.
+
+### 2.3 MIR Performance Bar
+
+MIR's README reports `c2m -eg` at about `0.91` geomean performance relative
+to GCC `-O2` on its 15 small C benchmarks, with QBE around `0.65` on the same
+table. Treat those numbers as directional rather than a literal cfree promise:
+MIR has a JIT-oriented IR and mature per-target rewrite logic. Still, they set
+the bar for this project:
+
+- `-O1`: compile-time-focused. It should be much closer to `-O0` compile time
+  than to `-O2`, while removing obvious backend artifacts through combine and
+  DCE.
+- `-O2`: performance-focused. The first external target is at least QBE-class
+  generated-code performance on the MIR `c-benchmarks` set, then iteratively
+  close on MIR's published result.
+- Both levels must preserve the CG corpus exit-code contract before benchmark
+  tuning starts.
 
 ---
 
-## 3. Phased plan
-
-Each phase ends at a green test surface. Phases 0–3 are reversible —
-strip the wrapper back to identity replay and the system still works.
-Phase 4 is the wedge: once the SSA → lowering pipeline is the only
-path to bytes, both levels go through it. Levels 1 and 2 share the
-pipeline from Phase 4 onward and diverge only in which optimization
-passes the schedule includes (§1.1, §1.2).
-
-### Phase 0 — wrapper skeleton + equivalence harness
-
-Goal: replace the stub with a real `opt_cgtarget` that does *nothing
-but forward*, and stand up the test harness that proves it. Without
-the harness, "Phase 0 is correct" has no signal — so they land
-together.
-
-Wrapper:
-
-- New `src/opt/opt.c`: `opt_cgtarget_new` returns a CGTarget whose
-  every vtable slot is a one-line `target->method(target, ...)`. No
-  IR built.
-- `finalize` calls `target->finalize(target)`, then the wrapper's own
-  cleanup (none yet).
-- `destroy` cascades: free wrapper state, then `cgtarget_free(target)`.
-- Pipeline already wires this at `src/api/pipeline.c:227`. No driver
-  changes.
-
-Harness:
-
-- Add `--opt-level N` flag to `test/cg/harness/cg_runner.c`. When
-  `N > 0`, wrap the constructed `CGTarget*` with
-  `opt_cgtarget_new(c, target, N)` before running the case.
-- `test/cg/run.sh`: each case in Groups A–Q runs at `--opt-level 0`,
-  `1`, and `2`; exit codes must match across all three. Through
-  Phase 3, levels 1 and 2 share the recorder's 1:1 replay path with
-  level 2 additionally exercising `build_cfg` + `build_ssa` as a
-  dry-run (output discarded). Phase 4 promotes both levels to the
-  shared SSA → lower path. Group P (DWARF/W path) stays at level 0
-  only for now — opt-level equivalence on DWARF is a Phase 5+
-  concern.
-
-Tests: full Groups A–Q D/R/E/J pass at `--opt-level 0` and
-`--opt-level 1`. Any divergence means the forwarding lost data — fix
-it before continuing.
-
-Exit criterion: dual-level corpus is green and stays that way.
-
-### Phase 1 — recording fidelity
-
-Goal: actually build `Func`/`Block`/`Inst` while still emitting
-correct code. Replay 1:1 onto the wrapped target on `func_end`.
-
-- `src/opt/ir.c`: `ir_func_new`, `ir_block_new`, `ir_emit`,
-  `ir_emit_multi`, `ir_emit_const_i`, `ir_emit_const_bytes`,
-  `ir_frame_slot_new`, `ir_param_add`, `ir_set_terminator`. Arena is
-  per-`Func` from `compiler_arena_new` against `Compiler.tu`.
-- `src/opt/opt.c`: each CGTarget method records into the current
-  `Func`'s current block. `Reg` returned from `alloc_reg` is the same
-  integer as the `Val` it defines (`Val == Reg`, modulo `VAL_NONE = 0`
-  / `REG_NONE = 0xffffffff` sentinels). Operand `OPK_REG` carries
-  that same integer; replay reads it back and reissues.
-- `clobbers`/`spill_reg`/`reload_reg`/`free_reg` — under unbounded
-  virtuals these shouldn't be called. `free_reg` is documented as a
-  hint and ignored. The other three indicate CG misuse from the opt
-  side; panic loudly.
-- `src/opt/replay.c`: walks a `Func` linearly, dispatches each
-  `Inst.op` back to the matching `target->method(...)` call. One
-  giant switch keyed on `IROp`. Multi-result ops (`IR_CALL`,
-  `IR_ATOMIC_CAS`, `IR_INTRINSIC` for `*_OVERFLOW`) read from
-  `defs[0..ndefs)`.
-- `set_loc`: wrapper updates a `pending_loc` field; every subsequent
-  `ir_emit` stamps `Inst.loc = pending_loc`. Replay calls
-  `target->set_loc(target, inst.loc)` before each emit-side call.
-- `finalize` at level 1: replay each buffered Func, then
-  `target->finalize(target)`.
-
-Tests: same corpus, `--opt-level 1`. The IR shape gets stress-tested
-on the tricky cases — Group F bitfields, Group G calls (sret split,
-byval, indirect), Group I alloca, Group J va_*, Group K atomics, Group
-L intrinsics (especially multi-result overflow), Group M (deferred:
-asm_block). Anything that loses data in record→replay shows up here.
-
-Exit criterion: `--opt-level 1` corpus is green and IR-allocated
-arenas are reaped on panic via `compiler_defer`.
-
-### Phase 2 — IR pretty-printer + level-1 peepholes
-
-Goal: be able to look at IR; introduce trivial pre-SSA rewrites.
-
-- `src/opt/ir_print.c`: `ir_func_print(Writer*, const Func*)` for
-  ad-hoc debugging and Group R diff oracles. Format unspecified
-  beyond "stable enough for golden-file diffs".
-- `cg-runner --opt-level 1 --dump-ir NAME` prints the recorded `Func`
-  before replay.
-- A handful of safe linear-tape rewrites: fold `IR_CONST_I` +
-  `IR_IADD/ISUB/IMUL` of two constants into a fresh `IR_CONST_I`;
-  collapse trivially dead stores immediately overwritten in the same
-  block. Each rewrite preserves the recorded→replayable contract.
-
-Tests: corpus stays green. Add hand-picked cases (or new ones) where
-the rewrite fires and the resulting code is observably smaller via
-disassembly diff against `--opt-level 0`.
-
-### Phase 3 — SSA construction, dry-run
-
-Goal: build SSA without consuming it. Catches IR-shape bugs before
-they matter — Phase 4's lowering pipeline depends on the SSA being
-correct, so we shake out construction bugs separately.
-
-- `src/opt/pass_cfg.c`: `opt_build_cfg` — derives
-  `Block.preds`/`succ`/`nsucc` from terminators (`IR_BR`, `IR_CONDBR`,
-  `IR_CMP_BRANCH`, `IR_RET`, `IR_LONGJMP`, `IR_BREAK_TO`,
-  `IR_CONTINUE_TO`, `IR_INTRINSIC{TRAP,UNREACHABLE}`). `IR_SETJMP` is
-  a control barrier — splits its block but is not a terminator
-  (control falls through).
-- `src/opt/pass_ssa.c`: `opt_build_ssa` — standard dominance-frontier
-  algorithm; promotes any `FrameSlot` whose `FSF_ADDR_TAKEN` bit was
-  never set (mem2reg folded in per `doc/DESIGN.md` §12). Inserts
-  `IR_PHI` instructions with `IRPhiAux` populated.
-- These run at both levels on `func_end`, but their output is
-  *discarded* before replay until Phase 4 lands the lowering path.
-  Goal at this phase is "no panics on the corpus", not "improved
-  code".
-
-Tests: at levels 1 and 2, recorder runs `build_cfg` + `build_ssa`
-then falls back to the recorder's 1:1 replay. Corpus stays green;
-any panic is a real bug (unhandled IROp, malformed CFG from
-goto/switch lowering, address-taken-detection miss).
-
-### Phase 4 — lowering pipeline (the wedge)
-
-Goal: SSA → make_conventional_ssa → undo_ssa → machinize →
-live_info → coalesce → regalloc → combine → dce → opt_emit, replacing
-the recorder's 1:1 replay at *both* levels. From this point, neither
-level falls back to record→replay.
-
-- `src/opt/pass_machinize.c`: `opt_machinize(Func*, Target)` — ABI
-  lowering (calls into `TargetABI` for argument/return part
-  classification just like CG does today; the result lives in
-  `IRCallAux.abi` already), 2-op forms, calling-convention spill,
-  prolog/epilog placeholders.
-- `src/opt/pass_live.c`: `opt_live_info` — standard backwards dataflow.
-- `src/opt/pass_regalloc.c`: `opt_regalloc` — linear scan, no live-range
-  splitting initially. Allocates physical registers from the same
-  pool the AArch64 backend uses for scratch (i.e. asks the wrapped
-  target for clobbers and the param/sret physical mapping).
-- `src/opt/pass_emit.c`: `opt_emit(Compiler*, Func*, CGTarget*)` —
-  walks the lowered IR and drives the wrapped target via the
-  emit-side CGTarget surface. Inst → target call, but operands are
-  now physical Operands (post-RA) and prolog/epilog/spill insertion
-  has happened.
-- Wrapper's `func_end` runs `build_cfg` + `build_ssa` +
-  `make_conventional_ssa` + `undo_ssa` + the lowering pipeline at
-  both levels. No optimization passes between SSA build and undo at
-  this phase — the lowering pipeline does all the work, so we
-  isolate lowering bugs from optimization-pass bugs. Level 1 and
-  level 2 are functionally identical at the end of Phase 4.
-
-Tests: `--opt-level 1` and `--opt-level 2` corpora green. The
-recorder's old 1:1 replay path is removed; SSA → lower is the only
-path to bytes.
-
-Exit criterion: levels 1 and 2 both green, no regressions in the
-level-0 suite.
-
-### Phase 5 — intra-procedural passes, level-gated
-
-Goal: populate the optimization schedule. All new passes land in the
-*level-2* schedule only; level 1's schedule stays at the Phase 4 set
-(`build_cfg`, `build_ssa`, `jump_opt` once it's wired in) so it
-remains the bisection floor for IR/SSA/lowering bugs.
-
-Order (tracking `doc/DESIGN.md` §9.2; each pass lands behind a flag
-so the equivalence harness can run with just-this-pass to localize
-bugs):
-
-1. `opt_jump_opt` — moves into the level-1 schedule too (cheap,
-   high-value, doesn't change values).
-2. `opt_gvn` (with constprop, redundant-load elim folded in) —
-   level 2 only.
-3. `opt_copy_prop` (with redundant-extension elim) — level 2 only.
-4. `opt_dse` — level 2 only.
-5. `opt_ssa_dce` — level 2 only.
-6. `opt_build_loop_tree` + `opt_licm` — level 2 only.
-7. `opt_addr_xform` — level 2 only.
-8. `opt_block_cloning` — level 2 only.
-9. `opt_pressure_relief` — level 2 only.
-10. Lowering-time additions (run at both levels): `opt_coalesce`,
-    `opt_combine`, `opt_dce` (post-RA), live-range splitting in
-    `opt_regalloc`.
-
-No UB-exploiting transformations (`doc/DESIGN.md` §9): no
-signed-overflow-is-unreachable, no shift-by-≥-width-is-unreachable,
-no division-by-zero-is-unreachable, no null-deref-is-unreachable.
-
-### Phase 6 — inter-procedural (level 2 only)
-
-Goal: cross-function inlining + cleanup iteration. Level 1 stays
-intra-procedural — the inliner is the largest source of correctness
-risk and the largest divergence from level-0 codegen, so it earns
-its own gating.
-
-- `opt_inline`: bottom-up call-graph walk. SCCs (mutual recursion)
-  skipped. Heuristic: instruction count + call site count. Inlining
-  recognizes the `__cfree_setjmp` symbol by name as returns-twice and
-  refuses to inline across.
-- `opt_cleanup`: subset re-run (gvn, copy_prop, ssa_dce, jump_opt,
-  licm if loops, addr_xform if uses remain).
-- `-finline-iters=N` knob (default 1, hard cap enforced inside the
-  wrapper).
-
-Tests: corpus + a small set of cases hand-crafted to require
-inlining-plus-cleanup to optimize (e.g. a small wrapper around a
-constant returner whose body becomes a constant only after inlining).
+## 3. MIR Pipeline, Source Order
+
+This is the effective order in current `mir-gen.c` for full function
+generation.
+
+### 3.1 Common Setup
+
+MIR duplicates the function instruction list, allocates a function CFG object,
+and runs `build_func_cfg`.
+
+`build_func_cfg`:
+
+- Creates synthetic entry and exit blocks.
+- Splits basic blocks at labels, branches, returns, property branches, and
+  fallthrough boundaries.
+- Converts register operands to internal variable operands.
+- Marks address-producing instructions and addressable regs.
+- Adds edges for direct branches, switches, fallthrough, and possible indirect
+  jump targets.
+- Marks calls on blocks so memory availability and liveness can treat them as
+  barriers.
+- Removes unreachable blocks when `optimize_level > 0`.
+
+cfree's `opt_build_cfg` should follow this shape: construct explicit
+entry/exit blocks, keep call/memory barrier metadata on blocks, and make
+unreachable cleanup part of `-O1+`, not a late optional cleanup.
+
+### 3.2 Full `-O2` Pre-Lowering Passes
+
+MIR full mode then runs:
+
+1. `clone_bbs` before SSA when there are no address instructions.
+   MIR clones cold blocks after return back into hot predecessors when this
+   exposes optimization in the hot path. It uses a bounded growth factor and
+   skips back edges.
+
+2. `build_ssa`.
+   MIR uses a Braun-style construction: it creates optimized maximal SSA,
+   minimizes redundant phis, builds def-use edges, and optionally renames
+   variables. cfree should copy the useful property, not the representation:
+   every use should have cheap access to its defining inst, and every def
+   should have cheap access to its uses.
+
+3. `addr_xform` when address instructions exist.
+   MIR tries to eliminate `ADDR` pseudos that only feed memory operands. If an
+   address pseudo must remain addressable, MIR converts uses to memory loads
+   and stores, rebuilds SSA, then clones blocks. For cfree, this maps to
+   folding GEP/address-of chains into target memory operands where the target
+   can encode them, while keeping address-taken frame slots in memory.
+
+4. `gvn`.
+   MIR computes memory availability, dominators, and value numbers in postorder.
+   This pass includes constant propagation, branch folding, redundant expression
+   elimination, redundant load elimination, and store/load reuse. It uses alias
+   and nonalias tags on memory operands. Calls clear or conservatively restrict
+   memory availability.
+
+5. `copy_prop`.
+   MIR propagates copies, folds multiply/divide by powers of two, and removes
+   redundant extension chains. cfree should keep this as a separate pass after
+   GVN because it relies on SSA edges and target legality.
+
+6. `dse`.
+   MIR computes memory liveness by memory location, handles alloca escape
+   through calls, and removes stores whose memory location is not live. This
+   depends on GVN's memory numbering and alias classification.
+
+7. `ssa_dce`.
+   MIR deletes SSA instructions with unused outputs, while preserving calls,
+   allocas, varargs, returns, frame/stack effects, overflow sequences with live
+   overflow branches, and other side-effecting ops.
+
+8. `build_loop_tree + licm`.
+   MIR builds natural loops and preheaders. LICM skips branches, phis, calls,
+   memory, varargs, allocas, and potentially trapping divisions/mods. It is
+   pressure-sensitive: cheap single instructions are not moved unless their
+   inputs are worth moving too; multiplies are considered expensive enough to
+   move.
+
+9. `pressure_relief`.
+   MIR moves single-use immediate or constant-like moves closer to their use
+   when doing so reduces pressure and does not move work into a loop.
+
+10. `make_conventional_ssa`.
+    MIR lowers phis into edge/block moves, splitting critical edges when needed.
+
+11. `ssa_combine`.
+    MIR combines compare+branch pairs and folds address components into memory
+    operands while SSA edges are still available.
+
+12. `undo_ssa`.
+    MIR removes phi nodes and SSA edges.
+
+13. `jump_opt`.
+    MIR removes unreachable blocks, empty blocks, branches to the next
+    instruction, chains of labels, and branches to jumps.
+
+### 3.3 Lowering and Allocation
+
+After pre-lowering optimization, MIR runs the machine-dependent and allocation
+pipeline:
+
+1. `target_machinize`.
+   This performs ABI lowering, call lowering, target two-operand forms, and
+   other machine-dependent normalization.
+
+2. Build a loop tree for `-O1+`.
+   MIR uses loop depth for frequency/pressure estimates in liveness,
+   coalescing, and allocation.
+
+3. `collect_moves`, move-only liveness, conflict matrix, and `coalesce` at
+   `-O2`.
+   MIR aggressively coalesces move-related regs, prioritizing moves by loop
+   frequency and checking conflicts in a bounded matrix.
+
+4. Full `live_info`.
+   MIR computes `live_in`/`live_out`, register frequencies, live lengths, and
+   block pressure. It understands phis before SSA destruction and call-used
+   hard registers after machinize.
+
+5. `reg_alloc`.
+   MIR builds live ranges and assigns pseudos by priority: tied hard regs first,
+   then higher frequency, then shorter live length. At levels below 2 it uses a
+   simplified allocator. At full level it can split live ranges and place
+   spills/restores on edges or block boundaries.
+
+6. `rewrite`.
+   MIR rewrites pseudo regs to hard regs or stack slots, inserts reloads and
+   stores, handles call-clobbered saves, and deletes noop moves.
+
+7. `combine`.
+   This is target-aware code selection. It substitutes safe single-use moves,
+   collapses extension chains, commutes operands to expose legal combinations,
+   removes internal labels, and validates each rewrite with target legality.
+
+8. `dead_code_elimination`.
+   MIR performs post-RA DCE using live-out sets and preserves side effects.
+
+9. Prolog/epilog and machine instruction generation.
+
+cfree should keep this exact split: combine after register allocation is not a
+replacement for SSA `copy_prop`; it is a target legality/code-selection pass.
+
+### 3.4 Inlining
+
+MIR inlining is in `mir.c`, before generator optimization. It processes direct
+call-like instructions after MIR simplification, not inside `mir-gen.c`.
+
+Important MIR inliner behavior:
+
+- It can inline direct normal calls and explicit inline calls when the callee
+  is available.
+- It skips unresolved externals, self recursion, label-reference functions,
+  varargs, `jret`, and over-budget callees.
+- Default size budgets are small: normal calls have a much smaller cap than
+  explicit inline calls.
+- It limits caller growth relative to original caller size.
+- It renames callee registers, duplicates labels, materializes argument moves,
+  rewrites returns, and handles simple top-level constant-size `alloca`.
+
+cfree should run inlining on retained `Func` IR at `finalize` for `-O2`, then
+run cleanup on dirty callers. SCCs can be skipped for v1. The inliner must
+refuse setjmp/longjmp-sensitive and vararg cases until their semantics are
+explicitly tested.
 
 ---
 
-## 4. Module layout
+## 4. cfree Level Schedules
+
+### 4.1 `-O1` Minimal Schedule
+
+`-O1` is not "half of SSA." It is MIR's cheap backend optimization tier.
+
+Required `-O1` schedule:
 
 ```
-src/opt/
-  opt.h            (already exists)
-  ir.h             (already exists)
-  opt.c            wrapper, vtable bindings, finalize dispatch
-  ir.c             Func/Block/Inst plumbing, val table, arenas
-  ir_print.c       pretty-printer (Phase 2)
-  replay.c         level-1 replay (Phase 1)
-  pass_cfg.c       opt_build_cfg
-  pass_ssa.c       opt_build_ssa, opt_make_conventional_ssa,
-                   opt_ssa_combine, opt_undo_ssa
-  pass_gvn.c       opt_gvn
-  pass_copy.c      opt_copy_prop
-  pass_dse.c       opt_dse
-  pass_dce.c       opt_ssa_dce, opt_dce
-  pass_jump.c      opt_jump_opt
-  pass_loop.c      build_loop_tree + opt_licm
-  pass_addr.c      opt_addr_xform
-  pass_clone.c     opt_block_cloning
-  pass_pressure.c  opt_pressure_relief
-  pass_machinize.c opt_machinize
-  pass_live.c      opt_live_info
-  pass_coalesce.c  opt_coalesce
-  pass_regalloc.c  opt_regalloc
-  pass_combine.c   opt_combine
-  pass_emit.c      opt_emit (lowering replay)
-  pass_inline.c    opt_inline + opt_cleanup
+build_cfg
+machinize
+build_loop_tree
+live_info
+regalloc(allow_live_range_split = false)
+combine
+dce
+opt_emit
 ```
 
-This split lets each phase land as a small set of new files plus
-edits to `opt.c`. No file accumulates more than one pass.
+Allowed `-O1` details:
+
+- Remove unreachable blocks during CFG construction.
+- Use loop depth only for frequency/pressure costing.
+- Run target-aware combine after register allocation.
+- Delete noop moves and dead post-RA definitions.
+- Use a priority-based allocator, but without coalescing and live-range
+  splitting in the first production version.
+
+Forbidden at `-O1`:
+
+- `build_ssa`
+- `gvn`
+- `copy_prop`
+- `dse`
+- `ssa_dce`
+- `licm`
+- `pressure_relief`
+- `coalesce`
+- `opt_inline`
+
+This keeps `-O1` useful and debuggable: if `-O1` fails, the bug is in
+recording, CFG, machinize, liveness, allocation, combine, DCE, or emission.
+
+### 4.2 `-O2` Full Schedule
+
+`-O2` uses MIR's full current source pipeline:
+
+```
+build_cfg
+if no address-transform candidates:
+    block_cloning
+build_ssa
+if address-transform candidates:
+    addr_xform
+    undo_ssa
+    block_cloning
+    build_ssa
+gvn
+copy_prop
+dse
+ssa_dce
+build_loop_tree
+licm
+pressure_relief
+make_conventional_ssa
+ssa_combine
+undo_ssa
+jump_opt
+machinize
+build_loop_tree
+collect_moves
+live_info(move vars only)
+coalesce
+live_info(all vars)
+regalloc(allow_live_range_split = true)
+combine
+dce
+opt_emit
+```
+
+Then, once inlining exists:
+
+```
+finalize:
+    opt_inline(FuncSet, max_iters)
+    for each dirty caller:
+        opt_cleanup
+    lower every dirty/not-yet-emitted Func through the full schedule
+```
+
+`opt_cleanup` should re-run the passes that inlining exposes value for:
+`build_cfg`, `build_ssa`, `gvn`, `copy_prop`, `dse`, `ssa_dce`, `licm` when
+loops exist, `pressure_relief`, `make_conventional_ssa`, `ssa_combine`,
+`undo_ssa`, and `jump_opt`.
+
+### 4.3 Transformations We Do Not Take
+
+`doc/DESIGN.md` is still binding: no UB-exploiting transforms. Do not assume
+signed overflow, shift-by-width, division by zero, or null dereference are
+unreachable. MIR is careful around potentially trapping division/modulo in
+LICM; cfree should be at least as conservative.
+
+MIR property instructions and lazy basic-block versioning are out of scope for
+the first optimized backend. They are a separate JIT specialization feature,
+not required for normal `cfree` object/executable codegen.
 
 ---
 
-## 5. Cross-cutting decisions
+## 5. Current cfree State
+
+The optimizer is no longer just a stub:
+
+- `src/opt/opt.c` implements the `CGTarget` wrapper.
+- `src/opt/ir.c` and `src/opt/ir.h` implement the recorded IR container.
+- `src/opt/pass_cfg.c` implements CFG construction.
+- `src/opt/pass_ssa.c` implements current SSA construction.
+- `test/cg/run.sh` supports `CFREE_OPT_LEVELS`; default is `0 1`, while
+  level `2` is opt-in today.
+
+Current behavior:
+
+- Level `1` records and replays 1:1 into the wrapped target.
+- Level `2` runs `opt_build_cfg + opt_build_ssa` as a dry run, discards that
+  result, then replays 1:1.
+- No real optimized lowering path exists yet.
+
+The next implementation work should replace replay with the MIR-style lowering
+pipeline, first at `-O1`, then at `-O2`.
+
+---
+
+## 6. Implementation Phases
+
+Each phase should end with targeted green tests. Prefer red-green tests for
+the exact pass being introduced, then expand through the CG corpus.
+
+### Phase A - Production `-O1` Lowering
 
-### 5.1 Reg ↔ Val identity
+Goal: make `-O1` stop replaying and emit through the optimized backend path
+without SSA value passes.
 
-Each call to `wrapper->alloc_reg(class, type)` allocates a fresh `Val`
-in the current `Func`'s value table and returns its integer as
-`Reg`. `Operand{kind=OPK_REG, v.reg=R}` is interpreted on replay as
-"the value defined at SSA id R." This collapses two parallel ID
-spaces into one and avoids a side mapping. `REG_NONE` (`0xffffffff`)
-and `VAL_NONE` (`0`) live at opposite ends of the range — neither
-is allocated.
+Implement:
 
-### 5.2 Frame slots stay frame slots until SSA construction
+- `opt_machinize`
+- `opt_live_info`
+- `opt_regalloc(..., false)`
+- `opt_combine`
+- `opt_dce`
+- `opt_emit`
 
-`cg_local`/`cg_param` always allocate frame slots through
-`CGTarget.frame_slot`/`CGTarget.param`. The wrapper records them
-verbatim. Loads/stores against frame slots become `IR_LOAD`/`IR_STORE`
-with `MemAccess.alias = ALIAS_LOCAL`. `build_ssa` (Phase 3) is the
-only place that promotes non-`FSF_ADDR_TAKEN` slots to SSA values.
-Address-taken slots stay in memory and are reasoned about through
-`MemAccess` alias roots (`doc/DESIGN.md` §5.6).
+Keep the allocator simple but not naive:
 
-This keeps the recorder dumb and pushes all decisions into one pass.
+- Build live ranges from block liveness.
+- Sort allocation candidates by tied hard-reg requirement, frequency, live
+  length, then stable id.
+- Assign hard regs when possible, stack slots otherwise.
+- Rewrite pseudos into hard regs/stack slots with reserved scratch regs for
+  reload/store addressing.
+- Delete noop moves after rewrite.
 
-### 5.3 Method intercepts the wrapper rejects
+Exit criteria:
 
-Under unbounded virtuals, these CG-side mechanics are nonsense:
+- `CFREE_OPT_LEVELS="0 1" make test-cg` passes for targeted AArch64 cases.
+- Add focused allocation cases for call-clobber saves, stack spills, tied
+  hard regs from inline asm, and values live through branches.
 
-- `clobbers` — meaningless without a finite physical register set.
-- `spill_reg` / `reload_reg` — CG drives these for the -O0 value
-  stack; opt's wrapper has no value stack.
+### Phase B - Full Allocation Infrastructure
 
-`free_reg` is already a hint; the wrapper ignores it.
+Goal: bring `-O2` allocation quality up to MIR's model before value passes
+start changing code aggressively.
 
-The first three should panic with a clear "called X on opt_cgtarget"
-message. They indicate CG is being driven in -O0 mode while opt is
-attached, which means a wiring bug.
+Implement:
 
-### 5.4 set_loc fan-out
+- Move collection.
+- Move-only liveness.
+- Conflict matrix for move-related regs.
+- Aggressive coalescing.
+- Live-range splitting and edge/block spill placement.
 
-The wrapper's `set_loc` updates a single `pending_loc` field on the
-recorder state. `ir_emit` stamps `Inst.loc = pending_loc`. The
-lowering replay (Phase 4) calls `target->set_loc(target, inst.loc)`
-before each emit-side call. This is exactly the protocol
-`doc/DWARF.md` §3 expects, and it preserves Group P (set_loc/debug)
-behavior at level 2.
+Exit criteria:
 
-### 5.5 No UB-exploiting passes
+- `-O1` remains green.
+- `-O2` can use the same lowering path with coalescing/splitting enabled, even
+  if all SSA value passes are still disabled.
 
-`doc/DESIGN.md` §9 is binding. opt may not assume signed overflow,
-shift-by-≥-width, division by zero, or null deref are unreachable.
-WASM traps on the first three deterministically; real targets are
-also more predictable this way. The "70% of -O2" goal is achievable
-without these transformations.
+### Phase C - SSA Value and Memory Passes
+
+Goal: enable the MIR full pre-lowering schedule for `-O2`.
+
+Implement in order:
+
+1. `opt_block_cloning`
+2. `opt_addr_xform`
+3. `opt_gvn`
+4. `opt_copy_prop`
+5. `opt_dse`
+6. `opt_ssa_dce`
+7. `opt_licm`
+8. `opt_pressure_relief`
+9. `opt_make_conventional_ssa`
+10. `opt_ssa_combine`
+11. `opt_undo_ssa`
+12. `opt_jump_opt`
+
+Do not batch these into one landing. Each pass needs a pass-local corpus case
+that fails red without the pass or its bug fix.
+
+Exit criteria:
+
+- `CFREE_OPT_LEVELS="0 1 2" make test-cg` passes for the relevant arch.
+- Pass-local dump tests prove the intended rewrite fires.
+
+### Phase D - Inlining and Cleanup
+
+Goal: add MIR-style small direct-call inlining for `-O2`.
+
+Implement:
+
+- Bottom-up call graph over retained `Func` IR.
+- SCC skip for v1.
+- Size budgets modeled on MIR: small budget for normal calls, larger budget
+  for explicit inline candidates once cfree tracks that source property.
+- Caller growth cap.
+- Register/value remapping, block/label remapping, parameter materialization,
+  return rewrite, and debug location preservation.
+- Conservative refusal for varargs, setjmp/longjmp, inline asm with hard
+  constraints, and functions whose frame/alloca behavior is not yet modeled.
+
+After each inline iteration, run `opt_cleanup` on dirty callers.
+
+Exit criteria:
+
+- Small wrapper-call cases optimize after inline+cleanup.
+- Recursive and mutually recursive cases are unchanged and correct.
+
+### Phase E - Benchmark Closure
+
+Goal: tune by measurement, not by adding passes speculatively.
+
+Benchmark set:
+
+- MIR `c-benchmarks`: `array`, `binary-trees`, `funnkuch-reduce`,
+  `hash`, `hash2`, `heapsort`, `lists`, `mandelbrot`, `matrix`,
+  `method-call`, `nbody`, `sieve`, `spectral-norm`, `strcat`, and any
+  benchmark that requires only supported libc/runtime features.
+- cfree-specific stress cases for ABI, TLS, atomics, and inline asm.
+
+Measure:
+
+- Compile wall time for `-O0`, `-O1`, `-O2`.
+- Executable run time against clang/gcc `-O2` when available.
+- Code size for hot text sections.
+- Pass counters: removed GVN expressions, folded branches, removed stores,
+  coalesced moves, spills/restores, split ranges, post-RA deleted moves.
+
+Target:
+
+- `-O1` should be the fast optimized tier and materially faster to compile
+  than `-O2`.
+- `-O2` should first reach QBE-class performance on the benchmark set, then
+  close toward MIR's published `c2m -eg` geomean.
+
+---
+
+## 7. Module Layout
+
+Keep the MIR pass boundaries as separate cfree modules:
+
+```
+src/opt/
+  opt.c             wrapper, vtable bindings, schedule dispatch
+  ir.c              Func/Block/Inst plumbing
+  ir_print.c        stable dumps for pass tests
+  pass_cfg.c        CFG, unreachable cleanup
+  pass_clone.c      block cloning
+  pass_ssa.c        SSA build, conventional SSA, undo SSA
+  pass_addr.c       address transformation
+  pass_gvn.c        GVN, constprop, redundant-load elimination
+  pass_copy.c       copy propagation, extension cleanup
+  pass_dse.c        dead store elimination
+  pass_dce.c        SSA DCE and post-RA DCE
+  pass_loop.c       loop tree and LICM
+  pass_pressure.c   pressure relief
+  pass_jump.c       jump optimization
+  pass_machinize.c  target ABI and machine-form lowering
+  pass_live.c       liveness, pressure, live ranges
+  pass_coalesce.c   move collection and coalescing
+  pass_regalloc.c   assignment, rewrite, splitting
+  pass_combine.c    target-aware code selection
+  pass_emit.c       drive wrapped CGTarget
+  pass_inline.c     inlining and cleanup
+```
+
+No pass should reach into the wrapped target's private implementation. Target
+specifics belong behind `Target`, `TargetABI`, or explicit helper hooks.
+
+---
+
+## 8. Pass Invariants
+
+- No VLAs and no global optimizer state. MIR uses generator context; cfree
+  should hang all pass state off `Compiler`, `Func`, `FuncSet`, or explicit
+  pass contexts.
+- `Reg` and `Val` identity remains valid while recording. Lowering may map
+  Vals to new physical registers or stack slots, but that mapping belongs to
+  allocation state.
+- Frame slots remain frame slots until a pass proves promotion is legal.
+- Calls are memory barriers unless alias information proves otherwise.
+- Inline asm is side-effecting and may pin hard regs; the allocator and
+  inliner must treat it conservatively.
+- `set_loc` is sticky on recorded insts. Every emitted instruction must forward
+  the active location to the wrapped target before machine emission.
+- Every pass either preserves CFG metadata or invalidates/rebuilds it before
+  the next consumer.
 
 ---
 
-## 6. Open questions
-
-- **Linear scan vs graph coloring at -O2.** MIR uses fast linear
-  scan. The "70% of -O2" target is achievable with it. Graph coloring
-  is a follow-up if a benchmark gap demands it.
-- **Loop tree at lowering vs after intra passes.** §9.4 builds the
-  loop tree inside the lowering pipeline (used by RA for split
-  decisions). §9.2 also builds one for LICM. Open: do these share an
-  arena-owned structure that survives both, or are they computed
-  twice? Rebuilding is cheap; sharing requires invalidation
-  discipline. Default to rebuilding until it matters.
-- **Stackifier vs Relooper for WASM.** `doc/DESIGN.md` §14 leaves
-  this open. opt's flat-CFG IR makes either viable. Decision deferred
-  to when the WASM backend lands; the IR is structure-agnostic.
-- **Inlining heuristic tuning.** Bench-driven; the knob
-  (`-finline-iters=N`, instruction-count budget) is exposed from
-  Phase 6 onward.
-- **IR snapshot/golden-file format.** Phase 2's pretty-printer needs
-  a stable enough format for diff-based regression tests in Group R.
-  Decide format when Phase 2 lands.
+## 9. Test Strategy
+
+Targeted commands:
+
+```
+CFREE_OPT_LEVELS="0 1" make test-cg
+CFREE_OPT_LEVELS="0 1 2" make test-cg
+CFREE_TEST_FILTER=<case> CFREE_OPT_LEVELS="0 1 2" make test-cg
+```
+
+Use pass dumps for red-green tests:
+
+- CFG: block/successor/predecessor shape.
+- SSA: phi placement and def-use chains.
+- GVN: redundant expression/load replaced and constant branch folded.
+- DSE: dead store removed while escaping stores remain.
+- LICM: safe invariant moved; trapping division and memory ops remain.
+- RA: expected spill/reload/coalescing counters.
+- Combine: target-legal fused instruction shape.
+
+Before broad runs, redirect output to a file and inspect the failure slice:
+
+```
+CFREE_TEST_FILTER=<case> CFREE_OPT_LEVELS="0 1 2" make test-cg >build/opt-test.log 2>&1
+tail -200 build/opt-test.log
+```
+
+The full CG corpus remains the equivalence oracle: levels `1` and `2` must
+produce the same observable `test_main` result as level `0`. DWARF equivalence
+can start weaker, but `set_loc` forwarding must not regress line-row emission.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README