commit 61ac2c5548bcea594508b8e586154d02ddea4f9e
parent 19a555939630998634e929b779ac915593859f36
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Mon, 11 May 2026 18:24:42 -0700
OPT.md plan update
Diffstat:
| M | doc/OPT.md | | | 1021 | ++++++++++++++++++++++++++++++++++++++++++++++--------------------------------- |
1 file changed, 601 insertions(+), 420 deletions(-)
diff --git a/doc/OPT.md b/doc/OPT.md
@@ -1,468 +1,649 @@
-# OPT — implementation plan
-
-Scope: what it takes to land cfree's optimizer behind the
-"presents-as-CGTarget" contract described in `doc/DESIGN.md` §5.1, §9.
-The producer side is the wrapper plus IR (`src/opt/opt.h`,
-`src/opt/ir.h`); the consumer is the wrapped target CGTarget and (at
--O2) the lowering pipeline that drives it. The whole `test/cg` corpus
-already serves as the equivalence oracle — every case is built against
-`CGTarget` directly today, so the same case with `opt_cgtarget`
-inserted between `cg-runner` and the AArch64 backend must produce the
-same observable behavior.
-
-Today the headers are real, the implementation is a single panic stub,
-and the pipeline is wired to call into it the moment we drop the stub.
-This plan starts at "wrapper that records nothing, replays directly"
-and builds out to a real intra+IPO pipeline.
+# OPT - MIR-informed implementation plan
+
+Scope: deliver cfree's optimized backend behind the `CGTarget` wrapper
+contract described in `doc/DESIGN.md` Section 9, using MIR JIT's optimizer as
+the model for phase order, level gating, and performance targets.
+
+This document is based on a source investigation of `~/tmp/mir/`, mainly
+`mir-gen.c`, `mir.c`, `mir.h`, `mir-gen.h`, `MIR.md`, `README.md`, and the
+`c2mir` driver. The important takeaway is that MIR keeps the optimizer short:
+`-O1` is a fast lowering/code-selection path, while the expensive
+SSA/value/memory/loop work is reserved for `-O2`.
---
-## 1. What working opt must look like
+## 1. Target Shape
+
+cfree exposes three levels:
-A single shared engine. CG drives `opt_cgtarget`; every CGTarget call
-lands as exactly one `Inst` in a per-function flat-CFG IR (§5.1). On
-`func_end` (intra-procedural) and `finalize` (inter-procedural), the
-wrapper runs an optimization schedule, lowers through machinize →
-regalloc → emit, and drives the wrapped target CGTarget to produce
-machine code.
+- `-O0`: direct backend path. CG drives the target `CGTarget` immediately.
+- `-O1`: MIR-style minimal optimizer. Record IR, lower it through the new
+ backend path, run liveness, simplified register allocation, post-RA combine,
+ and DCE. No SSA value passes and no inlining.
+- `-O2`: MIR-style full optimizer. Add SSA, GVN/constant propagation,
+ redundant load elimination, copy propagation, dead store elimination,
+ SSA DCE, LICM, pressure relief, address transformation, block cloning,
+ coalescing, live-range splitting, and inlining plus cleanup.
+
+The core contract stays simple:
```
-parse → cg → opt_cgtarget {record into Func} →
- on func_end: build_cfg → build_ssa → <intra schedule> → <lower>
- on finalize: <inter schedule> → for each dirty Func: <lower>
- → wrapped target CGTarget → MCEmitter → ObjBuilder
-
- <lower> = make_conventional_ssa → ssa_combine → undo_ssa →
- machinize → live_info → coalesce → regalloc → combine →
- dce → opt_emit
+parse -> cg -> opt_cgtarget { record Func IR }
+
+-O1 func_end/finalize:
+ build_cfg
+ machinize
+ build_loop_tree
+ live_info
+ regalloc(simplified)
+ combine
+ dce
+ opt_emit -> wrapped CGTarget
+
+-O2 func_end/finalize:
+ build_cfg
+ block_cloning / addr_xform setup
+ build_ssa
+ gvn
+ copy_prop
+ dse
+ ssa_dce
+ build_loop_tree + licm
+ pressure_relief
+ make_conventional_ssa
+ ssa_combine
+ undo_ssa
+ jump_opt
+ machinize
+ build_loop_tree
+ coalesce
+ live_info
+ regalloc(with live-range splitting)
+ combine
+ dce
+ opt_emit -> wrapped CGTarget
+
+-O2 finalize before final lowering:
+ opt_inline
+ opt_cleanup(dirty callers)
```
-The level dial selects only the optimization schedule; everything
-else is shared.
+The level dial selects optimization work. The lowering backend should be
+shared by `-O1` and `-O2`; `-O1` must remain the bisection floor for bugs in
+machinize, liveness, allocation, emission, and target-specific rewrite.
-### 1.1 Level 1 — minimal
+---
-Intra: `build_cfg`, `build_ssa` (mem2reg), `jump_opt`. No GVN, no
-LICM, no DSE.
-Inter: none (no inlining; no cleanup iteration).
+## 2. MIR Findings
-Just enough to land code through the SSA pipeline at quality
-comparable to direct `CGTarget` lowering. The point of level 1 is
-isolation: when a bug reproduces here, it is in IR construction,
-SSA, or lowering — not in any of the value-changing passes. It is
-the working bisection floor for the rest of the pipeline.
+### 2.1 Where the Optimizer Lives
-### 1.2 Level 2 — full
+MIR's optimizer is concentrated in `mir-gen.c`; inlining and MIR-level
+simplification live in `mir.c`; `c2mir/c2mir-driver.c` maps command-line
+`-O` flags to `MIR_gen_set_optimize_level`.
-Intra (per `doc/DESIGN.md` §9.2): `build_cfg`, `build_ssa` (mem2reg),
-`gvn` (incl. constprop, redundant-load elim), `copy_prop` (incl.
-redundant-extension elim), `dse`, `ssa_dce`, `jump_opt`,
-`build_loop_tree` + `licm`, `addr_xform`, `block_cloning`,
-`pressure_relief`.
-Inter: `opt_inline` + `opt_cleanup` (bounded by `-finline-iters=N`,
-default 1).
+Key source facts:
-### 1.3 Equivalence we commit to
+- `MIR_gen_init` defaults `optimize_level` to `2`.
+- `MIR_gen_set_optimize_level(ctx, level)` only stores the requested level in
+ generator context.
+- `c2mir` treats bare `-O` as `-O2`.
+- Current `mir-gen.c` checks `optimize_level >= 2` for the full pass set.
+ Some docs/comments still describe a separate `-O3`, but the current source
+ does not gate additional passes on `>= 3`. For cfree, model `-O2` as MIR's
+ current full source-level pipeline.
-For every case in `test/cg/CORPUS.md` Groups A–Q, building with
-`opt_cgtarget_new(c, target, level)` for `level ∈ {1, 2}` must
-produce the same `test_main` exit code as building against the
-AArch64 target directly. Both levels emit through the SSA →
-machinize → regalloc → emit path, so neither is byte-equivalent to
-level 0 — exit-code equivalence is the contract.
+### 2.2 MIR Level Semantics
-DWARF (W path) equivalence is weaker. Level 1 aims for row-by-row
-parity with level 0 in Group P; level 2 may collapse line rows or
-move locations into loclists when value-changing passes fire.
-Neither is a hard contract until the DWARF cross-level harness lands
-(§3 Phase 5+).
+MIR documentation says:
----
+- Level `0`: only register allocator and machine code generator work.
+- Level `1`: adds code selection. It should produce more compact/faster code
+ than level 0 at nearly the same generation speed.
+- Level `2`: adds common subexpression elimination and conditional constant
+ propagation. MIR docs report level 1 generation speed around 50 percent
+ faster than level 2.
+- Current source-level `>=2`: also includes block cloning, SSA variable
+ renaming, address transformation, DSE, LICM, pressure relief, coalescing,
+ and live-range splitting.
-## 2. Current state inventory
-
-### 2.1 Headers — real
-
-- `src/opt/opt.h` — `opt_cgtarget_new`, plus forward decls for every
- pass listed in `doc/DESIGN.md` §9.2/§9.3/§9.4.
-- `src/opt/ir.h` — `Func`/`Block`/`Inst`, `Val`, multi-def slot for
- calls/CAS/overflow, typed aux structs (`IRCallAux`, `IRGepAux`,
- `IRPhiAux`, `IRAsmAux`, `IRCasAux`, ...), frame-slot and param
- tables on `Func`, value table (`val_def_block`/`val_def_inst`/
- `val_type`).
-- `src/arch/arch.h` — `Reg` is `u32`, wide enough for the unbounded
- virtuals the wrapper hands out; `REG_NONE = 0xffffffffu`. `RegClass`
- has room (`u8 cls` on `Operand`).
-
-### 2.2 Implementations — stubs
-
-- `src/api/stubs.c:84-88` — `opt_cgtarget_new` panics with
- "subsystem not implemented: opt".
-- `src/api/pipeline.c:227-229` — pipeline already wraps when
- `opts->opt_level > 0`. Drop the stub and the path goes live.
-- `src/emu/emu.c:180` — emu also opts in to the wrapper at
- `opt_level > 0`. Same drop-in point.
-- `src/arch/aarch64.c:3009-3011` — `cgtarget_finalize` is a no-op for
- the bare target; the wrapper will install its own `finalize` that
- runs IPO + lowering at level 2.
-
-### 2.3 Producer-side wiring — already in place
-
-- `CGTarget` carries the full vtable the wrapper needs to intercept
- (function lifecycle, frame slots, params, control flow, data
- movement, arith/cmp/convert, calls/return, alloca, varargs,
- setjmp/longjmp, atomics, intrinsics, asm_block, set_loc, finalize).
- Nothing in the vtable assumes a direct backend.
-- `CGTarget.set_loc` is sticky; `Inst.loc` already exists in the IR.
- The wrapper stamps each recorded `Inst` with the most recent
- `set_loc` value, and `opt_emit` plays it back via
- `target->set_loc(target, inst.loc)` before each emit-side call. No
- new plumbing needed.
-- `Type*` interning is global (`Pool global`), so SSA value types and
- call signatures can be compared by pointer identity throughout the
- IR.
-
-### 2.4 Test surface
-
-- `test/cg/CORPUS.md` line 23: "`O` (opt-wrapped) lands once
- `opt_cgtarget` is implemented." This is the equivalence oracle.
-- `test/cg/harness/cg_runner.c` constructs a `CGTarget*` via
- `cgtarget_new`. Adding an `--opt-level N` flag that wraps it with
- `opt_cgtarget_new` is the smallest change to flip every case onto
- the new path.
-- Group R (deferred) — opt-wrapped equivalence — is the dedicated
- pass-by-pass regression suite, but in practice we'd run all of A–Q
- through `--opt-level 1` continuously from Phase 0 onward.
+For cfree:
+
+- `-O1` should optimize the backend mechanics without running SSA passes.
+- `-O2` should be the full quality mode.
+- Do not create a user-visible `-O3` until benchmarks identify a genuinely
+ useful extra tier.
+
+### 2.3 MIR Performance Bar
+
+MIR's README reports `c2m -eg` at about `0.91` geomean performance relative
+to GCC `-O2` on its 15 small C benchmarks, with QBE around `0.65` on the same
+table. Treat those numbers as directional rather than a literal cfree promise:
+MIR has a JIT-oriented IR and mature per-target rewrite logic. Still, they set
+the bar for this project:
+
+- `-O1`: compile-time-focused. It should be much closer to `-O0` compile time
+ than to `-O2`, while removing obvious backend artifacts through combine and
+ DCE.
+- `-O2`: performance-focused. The first external target is at least QBE-class
+ generated-code performance on the MIR `c-benchmarks` set, then iteratively
+ close on MIR's published result.
+- Both levels must preserve the CG corpus exit-code contract before benchmark
+ tuning starts.
---
-## 3. Phased plan
-
-Each phase ends at a green test surface. Phases 0–3 are reversible —
-strip the wrapper back to identity replay and the system still works.
-Phase 4 is the wedge: once the SSA → lowering pipeline is the only
-path to bytes, both levels go through it. Levels 1 and 2 share the
-pipeline from Phase 4 onward and diverge only in which optimization
-passes the schedule includes (§1.1, §1.2).
-
-### Phase 0 — wrapper skeleton + equivalence harness
-
-Goal: replace the stub with a real `opt_cgtarget` that does *nothing
-but forward*, and stand up the test harness that proves it. Without
-the harness, "Phase 0 is correct" has no signal — so they land
-together.
-
-Wrapper:
-
-- New `src/opt/opt.c`: `opt_cgtarget_new` returns a CGTarget whose
- every vtable slot is a one-line `target->method(target, ...)`. No
- IR built.
-- `finalize` calls `target->finalize(target)`, then the wrapper's own
- cleanup (none yet).
-- `destroy` cascades: free wrapper state, then `cgtarget_free(target)`.
-- Pipeline already wires this at `src/api/pipeline.c:227`. No driver
- changes.
-
-Harness:
-
-- Add `--opt-level N` flag to `test/cg/harness/cg_runner.c`. When
- `N > 0`, wrap the constructed `CGTarget*` with
- `opt_cgtarget_new(c, target, N)` before running the case.
-- `test/cg/run.sh`: each case in Groups A–Q runs at `--opt-level 0`,
- `1`, and `2`; exit codes must match across all three. Through
- Phase 3, levels 1 and 2 share the recorder's 1:1 replay path with
- level 2 additionally exercising `build_cfg` + `build_ssa` as a
- dry-run (output discarded). Phase 4 promotes both levels to the
- shared SSA → lower path. Group P (DWARF/W path) stays at level 0
- only for now — opt-level equivalence on DWARF is a Phase 5+
- concern.
-
-Tests: full Groups A–Q D/R/E/J pass at `--opt-level 0` and
-`--opt-level 1`. Any divergence means the forwarding lost data — fix
-it before continuing.
-
-Exit criterion: dual-level corpus is green and stays that way.
-
-### Phase 1 — recording fidelity
-
-Goal: actually build `Func`/`Block`/`Inst` while still emitting
-correct code. Replay 1:1 onto the wrapped target on `func_end`.
-
-- `src/opt/ir.c`: `ir_func_new`, `ir_block_new`, `ir_emit`,
- `ir_emit_multi`, `ir_emit_const_i`, `ir_emit_const_bytes`,
- `ir_frame_slot_new`, `ir_param_add`, `ir_set_terminator`. Arena is
- per-`Func` from `compiler_arena_new` against `Compiler.tu`.
-- `src/opt/opt.c`: each CGTarget method records into the current
- `Func`'s current block. `Reg` returned from `alloc_reg` is the same
- integer as the `Val` it defines (`Val == Reg`, modulo `VAL_NONE = 0`
- / `REG_NONE = 0xffffffff` sentinels). Operand `OPK_REG` carries
- that same integer; replay reads it back and reissues.
-- `clobbers`/`spill_reg`/`reload_reg`/`free_reg` — under unbounded
- virtuals these shouldn't be called. `free_reg` is documented as a
- hint and ignored. The other three indicate CG misuse from the opt
- side; panic loudly.
-- `src/opt/replay.c`: walks a `Func` linearly, dispatches each
- `Inst.op` back to the matching `target->method(...)` call. One
- giant switch keyed on `IROp`. Multi-result ops (`IR_CALL`,
- `IR_ATOMIC_CAS`, `IR_INTRINSIC` for `*_OVERFLOW`) read from
- `defs[0..ndefs)`.
-- `set_loc`: wrapper updates a `pending_loc` field; every subsequent
- `ir_emit` stamps `Inst.loc = pending_loc`. Replay calls
- `target->set_loc(target, inst.loc)` before each emit-side call.
-- `finalize` at level 1: replay each buffered Func, then
- `target->finalize(target)`.
-
-Tests: same corpus, `--opt-level 1`. The IR shape gets stress-tested
-on the tricky cases — Group F bitfields, Group G calls (sret split,
-byval, indirect), Group I alloca, Group J va_*, Group K atomics, Group
-L intrinsics (especially multi-result overflow), Group M (deferred:
-asm_block). Anything that loses data in record→replay shows up here.
-
-Exit criterion: `--opt-level 1` corpus is green and IR-allocated
-arenas are reaped on panic via `compiler_defer`.
-
-### Phase 2 — IR pretty-printer + level-1 peepholes
-
-Goal: be able to look at IR; introduce trivial pre-SSA rewrites.
-
-- `src/opt/ir_print.c`: `ir_func_print(Writer*, const Func*)` for
- ad-hoc debugging and Group R diff oracles. Format unspecified
- beyond "stable enough for golden-file diffs".
-- `cg-runner --opt-level 1 --dump-ir NAME` prints the recorded `Func`
- before replay.
-- A handful of safe linear-tape rewrites: fold `IR_CONST_I` +
- `IR_IADD/ISUB/IMUL` of two constants into a fresh `IR_CONST_I`;
- collapse trivially dead stores immediately overwritten in the same
- block. Each rewrite preserves the recorded→replayable contract.
-
-Tests: corpus stays green. Add hand-picked cases (or new ones) where
-the rewrite fires and the resulting code is observably smaller via
-disassembly diff against `--opt-level 0`.
-
-### Phase 3 — SSA construction, dry-run
-
-Goal: build SSA without consuming it. Catches IR-shape bugs before
-they matter — Phase 4's lowering pipeline depends on the SSA being
-correct, so we shake out construction bugs separately.
-
-- `src/opt/pass_cfg.c`: `opt_build_cfg` — derives
- `Block.preds`/`succ`/`nsucc` from terminators (`IR_BR`, `IR_CONDBR`,
- `IR_CMP_BRANCH`, `IR_RET`, `IR_LONGJMP`, `IR_BREAK_TO`,
- `IR_CONTINUE_TO`, `IR_INTRINSIC{TRAP,UNREACHABLE}`). `IR_SETJMP` is
- a control barrier — splits its block but is not a terminator
- (control falls through).
-- `src/opt/pass_ssa.c`: `opt_build_ssa` — standard dominance-frontier
- algorithm; promotes any `FrameSlot` whose `FSF_ADDR_TAKEN` bit was
- never set (mem2reg folded in per `doc/DESIGN.md` §12). Inserts
- `IR_PHI` instructions with `IRPhiAux` populated.
-- These run at both levels on `func_end`, but their output is
- *discarded* before replay until Phase 4 lands the lowering path.
- Goal at this phase is "no panics on the corpus", not "improved
- code".
-
-Tests: at levels 1 and 2, recorder runs `build_cfg` + `build_ssa`
-then falls back to the recorder's 1:1 replay. Corpus stays green;
-any panic is a real bug (unhandled IROp, malformed CFG from
-goto/switch lowering, address-taken-detection miss).
-
-### Phase 4 — lowering pipeline (the wedge)
-
-Goal: SSA → make_conventional_ssa → undo_ssa → machinize →
-live_info → coalesce → regalloc → combine → dce → opt_emit, replacing
-the recorder's 1:1 replay at *both* levels. From this point, neither
-level falls back to record→replay.
-
-- `src/opt/pass_machinize.c`: `opt_machinize(Func*, Target)` — ABI
- lowering (calls into `TargetABI` for argument/return part
- classification just like CG does today; the result lives in
- `IRCallAux.abi` already), 2-op forms, calling-convention spill,
- prolog/epilog placeholders.
-- `src/opt/pass_live.c`: `opt_live_info` — standard backwards dataflow.
-- `src/opt/pass_regalloc.c`: `opt_regalloc` — linear scan, no live-range
- splitting initially. Allocates physical registers from the same
- pool the AArch64 backend uses for scratch (i.e. asks the wrapped
- target for clobbers and the param/sret physical mapping).
-- `src/opt/pass_emit.c`: `opt_emit(Compiler*, Func*, CGTarget*)` —
- walks the lowered IR and drives the wrapped target via the
- emit-side CGTarget surface. Inst → target call, but operands are
- now physical Operands (post-RA) and prolog/epilog/spill insertion
- has happened.
-- Wrapper's `func_end` runs `build_cfg` + `build_ssa` +
- `make_conventional_ssa` + `undo_ssa` + the lowering pipeline at
- both levels. No optimization passes between SSA build and undo at
- this phase — the lowering pipeline does all the work, so we
- isolate lowering bugs from optimization-pass bugs. Level 1 and
- level 2 are functionally identical at the end of Phase 4.
-
-Tests: `--opt-level 1` and `--opt-level 2` corpora green. The
-recorder's old 1:1 replay path is removed; SSA → lower is the only
-path to bytes.
-
-Exit criterion: levels 1 and 2 both green, no regressions in the
-level-0 suite.
-
-### Phase 5 — intra-procedural passes, level-gated
-
-Goal: populate the optimization schedule. All new passes land in the
-*level-2* schedule only; level 1's schedule stays at the Phase 4 set
-(`build_cfg`, `build_ssa`, `jump_opt` once it's wired in) so it
-remains the bisection floor for IR/SSA/lowering bugs.
-
-Order (tracking `doc/DESIGN.md` §9.2; each pass lands behind a flag
-so the equivalence harness can run with just-this-pass to localize
-bugs):
-
-1. `opt_jump_opt` — moves into the level-1 schedule too (cheap,
- high-value, doesn't change values).
-2. `opt_gvn` (with constprop, redundant-load elim folded in) —
- level 2 only.
-3. `opt_copy_prop` (with redundant-extension elim) — level 2 only.
-4. `opt_dse` — level 2 only.
-5. `opt_ssa_dce` — level 2 only.
-6. `opt_build_loop_tree` + `opt_licm` — level 2 only.
-7. `opt_addr_xform` — level 2 only.
-8. `opt_block_cloning` — level 2 only.
-9. `opt_pressure_relief` — level 2 only.
-10. Lowering-time additions (run at both levels): `opt_coalesce`,
- `opt_combine`, `opt_dce` (post-RA), live-range splitting in
- `opt_regalloc`.
-
-No UB-exploiting transformations (`doc/DESIGN.md` §9): no
-signed-overflow-is-unreachable, no shift-by-≥-width-is-unreachable,
-no division-by-zero-is-unreachable, no null-deref-is-unreachable.
-
-### Phase 6 — inter-procedural (level 2 only)
-
-Goal: cross-function inlining + cleanup iteration. Level 1 stays
-intra-procedural — the inliner is the largest source of correctness
-risk and the largest divergence from level-0 codegen, so it earns
-its own gating.
-
-- `opt_inline`: bottom-up call-graph walk. SCCs (mutual recursion)
- skipped. Heuristic: instruction count + call site count. Inlining
- recognizes the `__cfree_setjmp` symbol by name as returns-twice and
- refuses to inline across.
-- `opt_cleanup`: subset re-run (gvn, copy_prop, ssa_dce, jump_opt,
- licm if loops, addr_xform if uses remain).
-- `-finline-iters=N` knob (default 1, hard cap enforced inside the
- wrapper).
-
-Tests: corpus + a small set of cases hand-crafted to require
-inlining-plus-cleanup to optimize (e.g. a small wrapper around a
-constant returner whose body becomes a constant only after inlining).
+## 3. MIR Pipeline, Source Order
+
+This is the effective order in current `mir-gen.c` for full function
+generation.
+
+### 3.1 Common Setup
+
+MIR duplicates the function instruction list, allocates a function CFG object,
+and runs `build_func_cfg`.
+
+`build_func_cfg`:
+
+- Creates synthetic entry and exit blocks.
+- Splits basic blocks at labels, branches, returns, property branches, and
+ fallthrough boundaries.
+- Converts register operands to internal variable operands.
+- Marks address-producing instructions and addressable regs.
+- Adds edges for direct branches, switches, fallthrough, and possible indirect
+ jump targets.
+- Marks calls on blocks so memory availability and liveness can treat them as
+ barriers.
+- Removes unreachable blocks when `optimize_level > 0`.
+
+cfree's `opt_build_cfg` should follow this shape: construct explicit
+entry/exit blocks, keep call/memory barrier metadata on blocks, and make
+unreachable cleanup part of `-O1+`, not a late optional cleanup.
+
+### 3.2 Full `-O2` Pre-Lowering Passes
+
+MIR full mode then runs:
+
+1. `clone_bbs` before SSA when there are no address instructions.
+ MIR clones cold blocks after return back into hot predecessors when this
+ exposes optimization in the hot path. It uses a bounded growth factor and
+ skips back edges.
+
+2. `build_ssa`.
+ MIR uses a Braun-style construction: it creates optimized maximal SSA,
+ minimizes redundant phis, builds def-use edges, and optionally renames
+ variables. cfree should copy the useful property, not the representation:
+ every use should have cheap access to its defining inst, and every def
+ should have cheap access to its uses.
+
+3. `addr_xform` when address instructions exist.
+ MIR tries to eliminate `ADDR` pseudos that only feed memory operands. If an
+ address pseudo must remain addressable, MIR converts uses to memory loads
+ and stores, rebuilds SSA, then clones blocks. For cfree, this maps to
+ folding GEP/address-of chains into target memory operands where the target
+ can encode them, while keeping address-taken frame slots in memory.
+
+4. `gvn`.
+ MIR computes memory availability, dominators, and value numbers in postorder.
+ This pass includes constant propagation, branch folding, redundant expression
+ elimination, redundant load elimination, and store/load reuse. It uses alias
+ and nonalias tags on memory operands. Calls clear or conservatively restrict
+ memory availability.
+
+5. `copy_prop`.
+ MIR propagates copies, folds multiply/divide by powers of two, and removes
+ redundant extension chains. cfree should keep this as a separate pass after
+ GVN because it relies on SSA edges and target legality.
+
+6. `dse`.
+ MIR computes memory liveness by memory location, handles alloca escape
+ through calls, and removes stores whose memory location is not live. This
+ depends on GVN's memory numbering and alias classification.
+
+7. `ssa_dce`.
+ MIR deletes SSA instructions with unused outputs, while preserving calls,
+ allocas, varargs, returns, frame/stack effects, overflow sequences with live
+ overflow branches, and other side-effecting ops.
+
+8. `build_loop_tree + licm`.
+ MIR builds natural loops and preheaders. LICM skips branches, phis, calls,
+ memory, varargs, allocas, and potentially trapping divisions/mods. It is
+ pressure-sensitive: cheap single instructions are not moved unless their
+ inputs are worth moving too; multiplies are considered expensive enough to
+ move.
+
+9. `pressure_relief`.
+ MIR moves single-use immediate or constant-like moves closer to their use
+ when doing so reduces pressure and does not move work into a loop.
+
+10. `make_conventional_ssa`.
+ MIR lowers phis into edge/block moves, splitting critical edges when needed.
+
+11. `ssa_combine`.
+ MIR combines compare+branch pairs and folds address components into memory
+ operands while SSA edges are still available.
+
+12. `undo_ssa`.
+ MIR removes phi nodes and SSA edges.
+
+13. `jump_opt`.
+ MIR removes unreachable blocks, empty blocks, branches to the next
+ instruction, chains of labels, and branches to jumps.
+
+### 3.3 Lowering and Allocation
+
+After pre-lowering optimization, MIR runs the machine-dependent and allocation
+pipeline:
+
+1. `target_machinize`.
+ This performs ABI lowering, call lowering, target two-operand forms, and
+ other machine-dependent normalization.
+
+2. Build a loop tree for `-O1+`.
+ MIR uses loop depth for frequency/pressure estimates in liveness,
+ coalescing, and allocation.
+
+3. `collect_moves`, move-only liveness, conflict matrix, and `coalesce` at
+ `-O2`.
+ MIR aggressively coalesces move-related regs, prioritizing moves by loop
+ frequency and checking conflicts in a bounded matrix.
+
+4. Full `live_info`.
+ MIR computes `live_in`/`live_out`, register frequencies, live lengths, and
+ block pressure. It understands phis before SSA destruction and call-used
+ hard registers after machinize.
+
+5. `reg_alloc`.
+ MIR builds live ranges and assigns pseudos by priority: tied hard regs first,
+ then higher frequency, then shorter live length. At levels below 2 it uses a
+ simplified allocator. At full level it can split live ranges and place
+ spills/restores on edges or block boundaries.
+
+6. `rewrite`.
+ MIR rewrites pseudo regs to hard regs or stack slots, inserts reloads and
+ stores, handles call-clobbered saves, and deletes noop moves.
+
+7. `combine`.
+ This is target-aware code selection. It substitutes safe single-use moves,
+ collapses extension chains, commutes operands to expose legal combinations,
+ removes internal labels, and validates each rewrite with target legality.
+
+8. `dead_code_elimination`.
+ MIR performs post-RA DCE using live-out sets and preserves side effects.
+
+9. Prolog/epilog and machine instruction generation.
+
+cfree should keep this exact split: combine after register allocation is not a
+replacement for SSA `copy_prop`; it is a target legality/code-selection pass.
+
+### 3.4 Inlining
+
+MIR inlining is in `mir.c`, before generator optimization. It processes direct
+call-like instructions after MIR simplification, not inside `mir-gen.c`.
+
+Important MIR inliner behavior:
+
+- It can inline direct normal calls and explicit inline calls when the callee
+ is available.
+- It skips unresolved externals, self recursion, label-reference functions,
+ varargs, `jret`, and over-budget callees.
+- Default size budgets are small: normal calls have a much smaller cap than
+ explicit inline calls.
+- It limits caller growth relative to original caller size.
+- It renames callee registers, duplicates labels, materializes argument moves,
+ rewrites returns, and handles simple top-level constant-size `alloca`.
+
+cfree should run inlining on retained `Func` IR at `finalize` for `-O2`, then
+run cleanup on dirty callers. SCCs can be skipped for v1. The inliner must
+refuse setjmp/longjmp-sensitive and vararg cases until their semantics are
+explicitly tested.
---
-## 4. Module layout
+## 4. cfree Level Schedules
+
+### 4.1 `-O1` Minimal Schedule
+
+`-O1` is not "half of SSA." It is MIR's cheap backend optimization tier.
+
+Required `-O1` schedule:
```
-src/opt/
- opt.h (already exists)
- ir.h (already exists)
- opt.c wrapper, vtable bindings, finalize dispatch
- ir.c Func/Block/Inst plumbing, val table, arenas
- ir_print.c pretty-printer (Phase 2)
- replay.c level-1 replay (Phase 1)
- pass_cfg.c opt_build_cfg
- pass_ssa.c opt_build_ssa, opt_make_conventional_ssa,
- opt_ssa_combine, opt_undo_ssa
- pass_gvn.c opt_gvn
- pass_copy.c opt_copy_prop
- pass_dse.c opt_dse
- pass_dce.c opt_ssa_dce, opt_dce
- pass_jump.c opt_jump_opt
- pass_loop.c build_loop_tree + opt_licm
- pass_addr.c opt_addr_xform
- pass_clone.c opt_block_cloning
- pass_pressure.c opt_pressure_relief
- pass_machinize.c opt_machinize
- pass_live.c opt_live_info
- pass_coalesce.c opt_coalesce
- pass_regalloc.c opt_regalloc
- pass_combine.c opt_combine
- pass_emit.c opt_emit (lowering replay)
- pass_inline.c opt_inline + opt_cleanup
+build_cfg
+machinize
+build_loop_tree
+live_info
+regalloc(allow_live_range_split = false)
+combine
+dce
+opt_emit
```
-This split lets each phase land as a small set of new files plus
-edits to `opt.c`. No file accumulates more than one pass.
+Allowed `-O1` details:
+
+- Remove unreachable blocks during CFG construction.
+- Use loop depth only for frequency/pressure costing.
+- Run target-aware combine after register allocation.
+- Delete noop moves and dead post-RA definitions.
+- Use a priority-based allocator, but without coalescing and live-range
+ splitting in the first production version.
+
+Forbidden at `-O1`:
+
+- `build_ssa`
+- `gvn`
+- `copy_prop`
+- `dse`
+- `ssa_dce`
+- `licm`
+- `pressure_relief`
+- `coalesce`
+- `opt_inline`
+
+This keeps `-O1` useful and debuggable: if `-O1` fails, the bug is in
+recording, CFG, machinize, liveness, allocation, combine, DCE, or emission.
+
+### 4.2 `-O2` Full Schedule
+
+`-O2` uses MIR's full current source pipeline:
+
+```
+build_cfg
+if no address-transform candidates:
+ block_cloning
+build_ssa
+if address-transform candidates:
+ addr_xform
+ undo_ssa
+ block_cloning
+ build_ssa
+gvn
+copy_prop
+dse
+ssa_dce
+build_loop_tree
+licm
+pressure_relief
+make_conventional_ssa
+ssa_combine
+undo_ssa
+jump_opt
+machinize
+build_loop_tree
+collect_moves
+live_info(move vars only)
+coalesce
+live_info(all vars)
+regalloc(allow_live_range_split = true)
+combine
+dce
+opt_emit
+```
+
+Then, once inlining exists:
+
+```
+finalize:
+ opt_inline(FuncSet, max_iters)
+ for each dirty caller:
+ opt_cleanup
+ lower every dirty/not-yet-emitted Func through the full schedule
+```
+
+`opt_cleanup` should re-run the passes that inlining exposes value for:
+`build_cfg`, `build_ssa`, `gvn`, `copy_prop`, `dse`, `ssa_dce`, `licm` when
+loops exist, `pressure_relief`, `make_conventional_ssa`, `ssa_combine`,
+`undo_ssa`, and `jump_opt`.
+
+### 4.3 Transformations We Do Not Take
+
+`doc/DESIGN.md` is still binding: no UB-exploiting transforms. Do not assume
+signed overflow, shift-by-width, division by zero, or null dereference are
+unreachable. MIR is careful around potentially trapping division/modulo in
+LICM; cfree should be at least as conservative.
+
+MIR property instructions and lazy basic-block versioning are out of scope for
+the first optimized backend. They are a separate JIT specialization feature,
+not required for normal `cfree` object/executable codegen.
---
-## 5. Cross-cutting decisions
+## 5. Current cfree State
+
+The optimizer is no longer just a stub:
+
+- `src/opt/opt.c` implements the `CGTarget` wrapper.
+- `src/opt/ir.c` and `src/opt/ir.h` implement the recorded IR container.
+- `src/opt/pass_cfg.c` implements CFG construction.
+- `src/opt/pass_ssa.c` implements current SSA construction.
+- `test/cg/run.sh` supports `CFREE_OPT_LEVELS`; default is `0 1`, while
+ level `2` is opt-in today.
+
+Current behavior:
+
+- Level `1` records and replays 1:1 into the wrapped target.
+- Level `2` runs `opt_build_cfg + opt_build_ssa` as a dry run, discards that
+ result, then replays 1:1.
+- No real optimized lowering path exists yet.
+
+The next implementation work should replace replay with the MIR-style lowering
+pipeline, first at `-O1`, then at `-O2`.
+
+---
+
+## 6. Implementation Phases
+
+Each phase should end with targeted green tests. Prefer red-green tests for
+the exact pass being introduced, then expand through the CG corpus.
+
+### Phase A - Production `-O1` Lowering
-### 5.1 Reg ↔ Val identity
+Goal: make `-O1` stop replaying and emit through the optimized backend path
+without SSA value passes.
-Each call to `wrapper->alloc_reg(class, type)` allocates a fresh `Val`
-in the current `Func`'s value table and returns its integer as
-`Reg`. `Operand{kind=OPK_REG, v.reg=R}` is interpreted on replay as
-"the value defined at SSA id R." This collapses two parallel ID
-spaces into one and avoids a side mapping. `REG_NONE` (`0xffffffff`)
-and `VAL_NONE` (`0`) live at opposite ends of the range — neither
-is allocated.
+Implement:
-### 5.2 Frame slots stay frame slots until SSA construction
+- `opt_machinize`
+- `opt_live_info`
+- `opt_regalloc(..., false)`
+- `opt_combine`
+- `opt_dce`
+- `opt_emit`
-`cg_local`/`cg_param` always allocate frame slots through
-`CGTarget.frame_slot`/`CGTarget.param`. The wrapper records them
-verbatim. Loads/stores against frame slots become `IR_LOAD`/`IR_STORE`
-with `MemAccess.alias = ALIAS_LOCAL`. `build_ssa` (Phase 3) is the
-only place that promotes non-`FSF_ADDR_TAKEN` slots to SSA values.
-Address-taken slots stay in memory and are reasoned about through
-`MemAccess` alias roots (`doc/DESIGN.md` §5.6).
+Keep the allocator simple but not naive:
-This keeps the recorder dumb and pushes all decisions into one pass.
+- Build live ranges from block liveness.
+- Sort allocation candidates by tied hard-reg requirement, frequency, live
+ length, then stable id.
+- Assign hard regs when possible, stack slots otherwise.
+- Rewrite pseudos into hard regs/stack slots with reserved scratch regs for
+ reload/store addressing.
+- Delete noop moves after rewrite.
-### 5.3 Method intercepts the wrapper rejects
+Exit criteria:
-Under unbounded virtuals, these CG-side mechanics are nonsense:
+- `CFREE_OPT_LEVELS="0 1" make test-cg` passes for targeted AArch64 cases.
+- Add focused allocation cases for call-clobber saves, stack spills, tied
+ hard regs from inline asm, and values live through branches.
-- `clobbers` — meaningless without a finite physical register set.
-- `spill_reg` / `reload_reg` — CG drives these for the -O0 value
- stack; opt's wrapper has no value stack.
+### Phase B - Full Allocation Infrastructure
-`free_reg` is already a hint; the wrapper ignores it.
+Goal: bring `-O2` allocation quality up to MIR's model before value passes
+start changing code aggressively.
-The first three should panic with a clear "called X on opt_cgtarget"
-message. They indicate CG is being driven in -O0 mode while opt is
-attached, which means a wiring bug.
+Implement:
-### 5.4 set_loc fan-out
+- Move collection.
+- Move-only liveness.
+- Conflict matrix for move-related regs.
+- Aggressive coalescing.
+- Live-range splitting and edge/block spill placement.
-The wrapper's `set_loc` updates a single `pending_loc` field on the
-recorder state. `ir_emit` stamps `Inst.loc = pending_loc`. The
-lowering replay (Phase 4) calls `target->set_loc(target, inst.loc)`
-before each emit-side call. This is exactly the protocol
-`doc/DWARF.md` §3 expects, and it preserves Group P (set_loc/debug)
-behavior at level 2.
+Exit criteria:
-### 5.5 No UB-exploiting passes
+- `-O1` remains green.
+- `-O2` can use the same lowering path with coalescing/splitting enabled, even
+ if all SSA value passes are still disabled.
-`doc/DESIGN.md` §9 is binding. opt may not assume signed overflow,
-shift-by-≥-width, division by zero, or null deref are unreachable.
-WASM traps on the first three deterministically; real targets are
-also more predictable this way. The "70% of -O2" goal is achievable
-without these transformations.
+### Phase C - SSA Value and Memory Passes
+
+Goal: enable the MIR full pre-lowering schedule for `-O2`.
+
+Implement in order:
+
+1. `opt_block_cloning`
+2. `opt_addr_xform`
+3. `opt_gvn`
+4. `opt_copy_prop`
+5. `opt_dse`
+6. `opt_ssa_dce`
+7. `opt_licm`
+8. `opt_pressure_relief`
+9. `opt_make_conventional_ssa`
+10. `opt_ssa_combine`
+11. `opt_undo_ssa`
+12. `opt_jump_opt`
+
+Do not batch these into one landing. Each pass needs a pass-local corpus case
+that fails red without the pass or its bug fix.
+
+Exit criteria:
+
+- `CFREE_OPT_LEVELS="0 1 2" make test-cg` passes for the relevant arch.
+- Pass-local dump tests prove the intended rewrite fires.
+
+### Phase D - Inlining and Cleanup
+
+Goal: add MIR-style small direct-call inlining for `-O2`.
+
+Implement:
+
+- Bottom-up call graph over retained `Func` IR.
+- SCC skip for v1.
+- Size budgets modeled on MIR: small budget for normal calls, larger budget
+ for explicit inline candidates once cfree tracks that source property.
+- Caller growth cap.
+- Register/value remapping, block/label remapping, parameter materialization,
+ return rewrite, and debug location preservation.
+- Conservative refusal for varargs, setjmp/longjmp, inline asm with hard
+ constraints, and functions whose frame/alloca behavior is not yet modeled.
+
+After each inline iteration, run `opt_cleanup` on dirty callers.
+
+Exit criteria:
+
+- Small wrapper-call cases optimize after inline+cleanup.
+- Recursive and mutually recursive cases are unchanged and correct.
+
+### Phase E - Benchmark Closure
+
+Goal: tune by measurement, not by adding passes speculatively.
+
+Benchmark set:
+
+- MIR `c-benchmarks`: `array`, `binary-trees`, `funnkuch-reduce`,
+ `hash`, `hash2`, `heapsort`, `lists`, `mandelbrot`, `matrix`,
+ `method-call`, `nbody`, `sieve`, `spectral-norm`, `strcat`, and any
+ benchmark that requires only supported libc/runtime features.
+- cfree-specific stress cases for ABI, TLS, atomics, and inline asm.
+
+Measure:
+
+- Compile wall time for `-O0`, `-O1`, `-O2`.
+- Executable run time against clang/gcc `-O2` when available.
+- Code size for hot text sections.
+- Pass counters: removed GVN expressions, folded branches, removed stores,
+ coalesced moves, spills/restores, split ranges, post-RA deleted moves.
+
+Target:
+
+- `-O1` should be the fast optimized tier and materially faster to compile
+ than `-O2`.
+- `-O2` should first reach QBE-class performance on the benchmark set, then
+ close toward MIR's published `c2m -eg` geomean.
+
+---
+
+## 7. Module Layout
+
+Keep the MIR pass boundaries as separate cfree modules:
+
+```
+src/opt/
+ opt.c wrapper, vtable bindings, schedule dispatch
+ ir.c Func/Block/Inst plumbing
+ ir_print.c stable dumps for pass tests
+ pass_cfg.c CFG, unreachable cleanup
+ pass_clone.c block cloning
+ pass_ssa.c SSA build, conventional SSA, undo SSA
+ pass_addr.c address transformation
+ pass_gvn.c GVN, constprop, redundant-load elimination
+ pass_copy.c copy propagation, extension cleanup
+ pass_dse.c dead store elimination
+ pass_dce.c SSA DCE and post-RA DCE
+ pass_loop.c loop tree and LICM
+ pass_pressure.c pressure relief
+ pass_jump.c jump optimization
+ pass_machinize.c target ABI and machine-form lowering
+ pass_live.c liveness, pressure, live ranges
+ pass_coalesce.c move collection and coalescing
+ pass_regalloc.c assignment, rewrite, splitting
+ pass_combine.c target-aware code selection
+ pass_emit.c drive wrapped CGTarget
+ pass_inline.c inlining and cleanup
+```
+
+No pass should reach into the wrapped target's private implementation. Target
+specifics belong behind `Target`, `TargetABI`, or explicit helper hooks.
+
+---
+
+## 8. Pass Invariants
+
+- No VLAs and no global optimizer state. MIR uses generator context; cfree
+ should hang all pass state off `Compiler`, `Func`, `FuncSet`, or explicit
+ pass contexts.
+- `Reg` and `Val` identity remains valid while recording. Lowering may map
+ Vals to new physical registers or stack slots, but that mapping belongs to
+ allocation state.
+- Frame slots remain frame slots until a pass proves promotion is legal.
+- Calls are memory barriers unless alias information proves otherwise.
+- Inline asm is side-effecting and may pin hard regs; the allocator and
+ inliner must treat it conservatively.
+- `set_loc` is sticky on recorded insts. Every emitted instruction must forward
+ the active location to the wrapped target before machine emission.
+- Every pass either preserves CFG metadata or invalidates/rebuilds it before
+ the next consumer.
---
-## 6. Open questions
-
-- **Linear scan vs graph coloring at -O2.** MIR uses fast linear
- scan. The "70% of -O2" target is achievable with it. Graph coloring
- is a follow-up if a benchmark gap demands it.
-- **Loop tree at lowering vs after intra passes.** §9.4 builds the
- loop tree inside the lowering pipeline (used by RA for split
- decisions). §9.2 also builds one for LICM. Open: do these share an
- arena-owned structure that survives both, or are they computed
- twice? Rebuilding is cheap; sharing requires invalidation
- discipline. Default to rebuilding until it matters.
-- **Stackifier vs Relooper for WASM.** `doc/DESIGN.md` §14 leaves
- this open. opt's flat-CFG IR makes either viable. Decision deferred
- to when the WASM backend lands; the IR is structure-agnostic.
-- **Inlining heuristic tuning.** Bench-driven; the knob
- (`-finline-iters=N`, instruction-count budget) is exposed from
- Phase 6 onward.
-- **IR snapshot/golden-file format.** Phase 2's pretty-printer needs
- a stable enough format for diff-based regression tests in Group R.
- Decide format when Phase 2 lands.
+## 9. Test Strategy
+
+Targeted commands:
+
+```
+CFREE_OPT_LEVELS="0 1" make test-cg
+CFREE_OPT_LEVELS="0 1 2" make test-cg
+CFREE_TEST_FILTER=<case> CFREE_OPT_LEVELS="0 1 2" make test-cg
+```
+
+Use pass dumps for red-green tests:
+
+- CFG: block/successor/predecessor shape.
+- SSA: phi placement and def-use chains.
+- GVN: redundant expression/load replaced and constant branch folded.
+- DSE: dead store removed while escaping stores remain.
+- LICM: safe invariant moved; trapping division and memory ops remain.
+- RA: expected spill/reload/coalescing counters.
+- Combine: target-legal fused instruction shape.
+
+Before broad runs, redirect output to a file and inspect the failure slice:
+
+```
+CFREE_TEST_FILTER=<case> CFREE_OPT_LEVELS="0 1 2" make test-cg >build/opt-test.log 2>&1
+tail -200 build/opt-test.log
+```
+
+The full CG corpus remains the equivalence oracle: levels `1` and `2` must
+produce the same observable `test_main` result as level `0`. DWARF equivalence
+can start weaker, but `set_loc` forwarding must not regress line-row emission.