kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 61ac2c5548bcea594508b8e586154d02ddea4f9e
parent 19a555939630998634e929b779ac915593859f36
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon, 11 May 2026 18:24:42 -0700

OPT.md plan update

Diffstat:
Mdoc/OPT.md | 1021++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
1 file changed, 601 insertions(+), 420 deletions(-)

diff --git a/doc/OPT.md b/doc/OPT.md @@ -1,468 +1,649 @@ -# OPT — implementation plan - -Scope: what it takes to land cfree's optimizer behind the -"presents-as-CGTarget" contract described in `doc/DESIGN.md` §5.1, §9. -The producer side is the wrapper plus IR (`src/opt/opt.h`, -`src/opt/ir.h`); the consumer is the wrapped target CGTarget and (at --O2) the lowering pipeline that drives it. The whole `test/cg` corpus -already serves as the equivalence oracle — every case is built against -`CGTarget` directly today, so the same case with `opt_cgtarget` -inserted between `cg-runner` and the AArch64 backend must produce the -same observable behavior. - -Today the headers are real, the implementation is a single panic stub, -and the pipeline is wired to call into it the moment we drop the stub. -This plan starts at "wrapper that records nothing, replays directly" -and builds out to a real intra+IPO pipeline. +# OPT - MIR-informed implementation plan + +Scope: deliver cfree's optimized backend behind the `CGTarget` wrapper +contract described in `doc/DESIGN.md` Section 9, using MIR JIT's optimizer as +the model for phase order, level gating, and performance targets. + +This document is based on a source investigation of `~/tmp/mir/`, mainly +`mir-gen.c`, `mir.c`, `mir.h`, `mir-gen.h`, `MIR.md`, `README.md`, and the +`c2mir` driver. The important takeaway is that MIR keeps the optimizer short: +`-O1` is a fast lowering/code-selection path, while the expensive +SSA/value/memory/loop work is reserved for `-O2`. --- -## 1. What working opt must look like +## 1. Target Shape + +cfree exposes three levels: -A single shared engine. CG drives `opt_cgtarget`; every CGTarget call -lands as exactly one `Inst` in a per-function flat-CFG IR (§5.1). On -`func_end` (intra-procedural) and `finalize` (inter-procedural), the -wrapper runs an optimization schedule, lowers through machinize → -regalloc → emit, and drives the wrapped target CGTarget to produce -machine code. +- `-O0`: direct backend path. CG drives the target `CGTarget` immediately. +- `-O1`: MIR-style minimal optimizer. Record IR, lower it through the new + backend path, run liveness, simplified register allocation, post-RA combine, + and DCE. No SSA value passes and no inlining. +- `-O2`: MIR-style full optimizer. Add SSA, GVN/constant propagation, + redundant load elimination, copy propagation, dead store elimination, + SSA DCE, LICM, pressure relief, address transformation, block cloning, + coalescing, live-range splitting, and inlining plus cleanup. + +The core contract stays simple: ``` -parse → cg → opt_cgtarget {record into Func} → - on func_end: build_cfg → build_ssa → <intra schedule> → <lower> - on finalize: <inter schedule> → for each dirty Func: <lower> - → wrapped target CGTarget → MCEmitter → ObjBuilder - - <lower> = make_conventional_ssa → ssa_combine → undo_ssa → - machinize → live_info → coalesce → regalloc → combine → - dce → opt_emit +parse -> cg -> opt_cgtarget { record Func IR } + +-O1 func_end/finalize: + build_cfg + machinize + build_loop_tree + live_info + regalloc(simplified) + combine + dce + opt_emit -> wrapped CGTarget + +-O2 func_end/finalize: + build_cfg + block_cloning / addr_xform setup + build_ssa + gvn + copy_prop + dse + ssa_dce + build_loop_tree + licm + pressure_relief + make_conventional_ssa + ssa_combine + undo_ssa + jump_opt + machinize + build_loop_tree + coalesce + live_info + regalloc(with live-range splitting) + combine + dce + opt_emit -> wrapped CGTarget + +-O2 finalize before final lowering: + opt_inline + opt_cleanup(dirty callers) ``` -The level dial selects only the optimization schedule; everything -else is shared. +The level dial selects optimization work. The lowering backend should be +shared by `-O1` and `-O2`; `-O1` must remain the bisection floor for bugs in +machinize, liveness, allocation, emission, and target-specific rewrite. -### 1.1 Level 1 — minimal +--- -Intra: `build_cfg`, `build_ssa` (mem2reg), `jump_opt`. No GVN, no -LICM, no DSE. -Inter: none (no inlining; no cleanup iteration). +## 2. MIR Findings -Just enough to land code through the SSA pipeline at quality -comparable to direct `CGTarget` lowering. The point of level 1 is -isolation: when a bug reproduces here, it is in IR construction, -SSA, or lowering — not in any of the value-changing passes. It is -the working bisection floor for the rest of the pipeline. +### 2.1 Where the Optimizer Lives -### 1.2 Level 2 — full +MIR's optimizer is concentrated in `mir-gen.c`; inlining and MIR-level +simplification live in `mir.c`; `c2mir/c2mir-driver.c` maps command-line +`-O` flags to `MIR_gen_set_optimize_level`. -Intra (per `doc/DESIGN.md` §9.2): `build_cfg`, `build_ssa` (mem2reg), -`gvn` (incl. constprop, redundant-load elim), `copy_prop` (incl. -redundant-extension elim), `dse`, `ssa_dce`, `jump_opt`, -`build_loop_tree` + `licm`, `addr_xform`, `block_cloning`, -`pressure_relief`. -Inter: `opt_inline` + `opt_cleanup` (bounded by `-finline-iters=N`, -default 1). +Key source facts: -### 1.3 Equivalence we commit to +- `MIR_gen_init` defaults `optimize_level` to `2`. +- `MIR_gen_set_optimize_level(ctx, level)` only stores the requested level in + generator context. +- `c2mir` treats bare `-O` as `-O2`. +- Current `mir-gen.c` checks `optimize_level >= 2` for the full pass set. + Some docs/comments still describe a separate `-O3`, but the current source + does not gate additional passes on `>= 3`. For cfree, model `-O2` as MIR's + current full source-level pipeline. -For every case in `test/cg/CORPUS.md` Groups A–Q, building with -`opt_cgtarget_new(c, target, level)` for `level ∈ {1, 2}` must -produce the same `test_main` exit code as building against the -AArch64 target directly. Both levels emit through the SSA → -machinize → regalloc → emit path, so neither is byte-equivalent to -level 0 — exit-code equivalence is the contract. +### 2.2 MIR Level Semantics -DWARF (W path) equivalence is weaker. Level 1 aims for row-by-row -parity with level 0 in Group P; level 2 may collapse line rows or -move locations into loclists when value-changing passes fire. -Neither is a hard contract until the DWARF cross-level harness lands -(§3 Phase 5+). +MIR documentation says: ---- +- Level `0`: only register allocator and machine code generator work. +- Level `1`: adds code selection. It should produce more compact/faster code + than level 0 at nearly the same generation speed. +- Level `2`: adds common subexpression elimination and conditional constant + propagation. MIR docs report level 1 generation speed around 50 percent + faster than level 2. +- Current source-level `>=2`: also includes block cloning, SSA variable + renaming, address transformation, DSE, LICM, pressure relief, coalescing, + and live-range splitting. -## 2. Current state inventory - -### 2.1 Headers — real - -- `src/opt/opt.h` — `opt_cgtarget_new`, plus forward decls for every - pass listed in `doc/DESIGN.md` §9.2/§9.3/§9.4. -- `src/opt/ir.h` — `Func`/`Block`/`Inst`, `Val`, multi-def slot for - calls/CAS/overflow, typed aux structs (`IRCallAux`, `IRGepAux`, - `IRPhiAux`, `IRAsmAux`, `IRCasAux`, ...), frame-slot and param - tables on `Func`, value table (`val_def_block`/`val_def_inst`/ - `val_type`). -- `src/arch/arch.h` — `Reg` is `u32`, wide enough for the unbounded - virtuals the wrapper hands out; `REG_NONE = 0xffffffffu`. `RegClass` - has room (`u8 cls` on `Operand`). - -### 2.2 Implementations — stubs - -- `src/api/stubs.c:84-88` — `opt_cgtarget_new` panics with - "subsystem not implemented: opt". -- `src/api/pipeline.c:227-229` — pipeline already wraps when - `opts->opt_level > 0`. Drop the stub and the path goes live. -- `src/emu/emu.c:180` — emu also opts in to the wrapper at - `opt_level > 0`. Same drop-in point. -- `src/arch/aarch64.c:3009-3011` — `cgtarget_finalize` is a no-op for - the bare target; the wrapper will install its own `finalize` that - runs IPO + lowering at level 2. - -### 2.3 Producer-side wiring — already in place - -- `CGTarget` carries the full vtable the wrapper needs to intercept - (function lifecycle, frame slots, params, control flow, data - movement, arith/cmp/convert, calls/return, alloca, varargs, - setjmp/longjmp, atomics, intrinsics, asm_block, set_loc, finalize). - Nothing in the vtable assumes a direct backend. -- `CGTarget.set_loc` is sticky; `Inst.loc` already exists in the IR. - The wrapper stamps each recorded `Inst` with the most recent - `set_loc` value, and `opt_emit` plays it back via - `target->set_loc(target, inst.loc)` before each emit-side call. No - new plumbing needed. -- `Type*` interning is global (`Pool global`), so SSA value types and - call signatures can be compared by pointer identity throughout the - IR. - -### 2.4 Test surface - -- `test/cg/CORPUS.md` line 23: "`O` (opt-wrapped) lands once - `opt_cgtarget` is implemented." This is the equivalence oracle. -- `test/cg/harness/cg_runner.c` constructs a `CGTarget*` via - `cgtarget_new`. Adding an `--opt-level N` flag that wraps it with - `opt_cgtarget_new` is the smallest change to flip every case onto - the new path. -- Group R (deferred) — opt-wrapped equivalence — is the dedicated - pass-by-pass regression suite, but in practice we'd run all of A–Q - through `--opt-level 1` continuously from Phase 0 onward. +For cfree: + +- `-O1` should optimize the backend mechanics without running SSA passes. +- `-O2` should be the full quality mode. +- Do not create a user-visible `-O3` until benchmarks identify a genuinely + useful extra tier. + +### 2.3 MIR Performance Bar + +MIR's README reports `c2m -eg` at about `0.91` geomean performance relative +to GCC `-O2` on its 15 small C benchmarks, with QBE around `0.65` on the same +table. Treat those numbers as directional rather than a literal cfree promise: +MIR has a JIT-oriented IR and mature per-target rewrite logic. Still, they set +the bar for this project: + +- `-O1`: compile-time-focused. It should be much closer to `-O0` compile time + than to `-O2`, while removing obvious backend artifacts through combine and + DCE. +- `-O2`: performance-focused. The first external target is at least QBE-class + generated-code performance on the MIR `c-benchmarks` set, then iteratively + close on MIR's published result. +- Both levels must preserve the CG corpus exit-code contract before benchmark + tuning starts. --- -## 3. Phased plan - -Each phase ends at a green test surface. Phases 0–3 are reversible — -strip the wrapper back to identity replay and the system still works. -Phase 4 is the wedge: once the SSA → lowering pipeline is the only -path to bytes, both levels go through it. Levels 1 and 2 share the -pipeline from Phase 4 onward and diverge only in which optimization -passes the schedule includes (§1.1, §1.2). - -### Phase 0 — wrapper skeleton + equivalence harness - -Goal: replace the stub with a real `opt_cgtarget` that does *nothing -but forward*, and stand up the test harness that proves it. Without -the harness, "Phase 0 is correct" has no signal — so they land -together. - -Wrapper: - -- New `src/opt/opt.c`: `opt_cgtarget_new` returns a CGTarget whose - every vtable slot is a one-line `target->method(target, ...)`. No - IR built. -- `finalize` calls `target->finalize(target)`, then the wrapper's own - cleanup (none yet). -- `destroy` cascades: free wrapper state, then `cgtarget_free(target)`. -- Pipeline already wires this at `src/api/pipeline.c:227`. No driver - changes. - -Harness: - -- Add `--opt-level N` flag to `test/cg/harness/cg_runner.c`. When - `N > 0`, wrap the constructed `CGTarget*` with - `opt_cgtarget_new(c, target, N)` before running the case. -- `test/cg/run.sh`: each case in Groups A–Q runs at `--opt-level 0`, - `1`, and `2`; exit codes must match across all three. Through - Phase 3, levels 1 and 2 share the recorder's 1:1 replay path with - level 2 additionally exercising `build_cfg` + `build_ssa` as a - dry-run (output discarded). Phase 4 promotes both levels to the - shared SSA → lower path. Group P (DWARF/W path) stays at level 0 - only for now — opt-level equivalence on DWARF is a Phase 5+ - concern. - -Tests: full Groups A–Q D/R/E/J pass at `--opt-level 0` and -`--opt-level 1`. Any divergence means the forwarding lost data — fix -it before continuing. - -Exit criterion: dual-level corpus is green and stays that way. - -### Phase 1 — recording fidelity - -Goal: actually build `Func`/`Block`/`Inst` while still emitting -correct code. Replay 1:1 onto the wrapped target on `func_end`. - -- `src/opt/ir.c`: `ir_func_new`, `ir_block_new`, `ir_emit`, - `ir_emit_multi`, `ir_emit_const_i`, `ir_emit_const_bytes`, - `ir_frame_slot_new`, `ir_param_add`, `ir_set_terminator`. Arena is - per-`Func` from `compiler_arena_new` against `Compiler.tu`. -- `src/opt/opt.c`: each CGTarget method records into the current - `Func`'s current block. `Reg` returned from `alloc_reg` is the same - integer as the `Val` it defines (`Val == Reg`, modulo `VAL_NONE = 0` - / `REG_NONE = 0xffffffff` sentinels). Operand `OPK_REG` carries - that same integer; replay reads it back and reissues. -- `clobbers`/`spill_reg`/`reload_reg`/`free_reg` — under unbounded - virtuals these shouldn't be called. `free_reg` is documented as a - hint and ignored. The other three indicate CG misuse from the opt - side; panic loudly. -- `src/opt/replay.c`: walks a `Func` linearly, dispatches each - `Inst.op` back to the matching `target->method(...)` call. One - giant switch keyed on `IROp`. Multi-result ops (`IR_CALL`, - `IR_ATOMIC_CAS`, `IR_INTRINSIC` for `*_OVERFLOW`) read from - `defs[0..ndefs)`. -- `set_loc`: wrapper updates a `pending_loc` field; every subsequent - `ir_emit` stamps `Inst.loc = pending_loc`. Replay calls - `target->set_loc(target, inst.loc)` before each emit-side call. -- `finalize` at level 1: replay each buffered Func, then - `target->finalize(target)`. - -Tests: same corpus, `--opt-level 1`. The IR shape gets stress-tested -on the tricky cases — Group F bitfields, Group G calls (sret split, -byval, indirect), Group I alloca, Group J va_*, Group K atomics, Group -L intrinsics (especially multi-result overflow), Group M (deferred: -asm_block). Anything that loses data in record→replay shows up here. - -Exit criterion: `--opt-level 1` corpus is green and IR-allocated -arenas are reaped on panic via `compiler_defer`. - -### Phase 2 — IR pretty-printer + level-1 peepholes - -Goal: be able to look at IR; introduce trivial pre-SSA rewrites. - -- `src/opt/ir_print.c`: `ir_func_print(Writer*, const Func*)` for - ad-hoc debugging and Group R diff oracles. Format unspecified - beyond "stable enough for golden-file diffs". -- `cg-runner --opt-level 1 --dump-ir NAME` prints the recorded `Func` - before replay. -- A handful of safe linear-tape rewrites: fold `IR_CONST_I` + - `IR_IADD/ISUB/IMUL` of two constants into a fresh `IR_CONST_I`; - collapse trivially dead stores immediately overwritten in the same - block. Each rewrite preserves the recorded→replayable contract. - -Tests: corpus stays green. Add hand-picked cases (or new ones) where -the rewrite fires and the resulting code is observably smaller via -disassembly diff against `--opt-level 0`. - -### Phase 3 — SSA construction, dry-run - -Goal: build SSA without consuming it. Catches IR-shape bugs before -they matter — Phase 4's lowering pipeline depends on the SSA being -correct, so we shake out construction bugs separately. - -- `src/opt/pass_cfg.c`: `opt_build_cfg` — derives - `Block.preds`/`succ`/`nsucc` from terminators (`IR_BR`, `IR_CONDBR`, - `IR_CMP_BRANCH`, `IR_RET`, `IR_LONGJMP`, `IR_BREAK_TO`, - `IR_CONTINUE_TO`, `IR_INTRINSIC{TRAP,UNREACHABLE}`). `IR_SETJMP` is - a control barrier — splits its block but is not a terminator - (control falls through). -- `src/opt/pass_ssa.c`: `opt_build_ssa` — standard dominance-frontier - algorithm; promotes any `FrameSlot` whose `FSF_ADDR_TAKEN` bit was - never set (mem2reg folded in per `doc/DESIGN.md` §12). Inserts - `IR_PHI` instructions with `IRPhiAux` populated. -- These run at both levels on `func_end`, but their output is - *discarded* before replay until Phase 4 lands the lowering path. - Goal at this phase is "no panics on the corpus", not "improved - code". - -Tests: at levels 1 and 2, recorder runs `build_cfg` + `build_ssa` -then falls back to the recorder's 1:1 replay. Corpus stays green; -any panic is a real bug (unhandled IROp, malformed CFG from -goto/switch lowering, address-taken-detection miss). - -### Phase 4 — lowering pipeline (the wedge) - -Goal: SSA → make_conventional_ssa → undo_ssa → machinize → -live_info → coalesce → regalloc → combine → dce → opt_emit, replacing -the recorder's 1:1 replay at *both* levels. From this point, neither -level falls back to record→replay. - -- `src/opt/pass_machinize.c`: `opt_machinize(Func*, Target)` — ABI - lowering (calls into `TargetABI` for argument/return part - classification just like CG does today; the result lives in - `IRCallAux.abi` already), 2-op forms, calling-convention spill, - prolog/epilog placeholders. -- `src/opt/pass_live.c`: `opt_live_info` — standard backwards dataflow. -- `src/opt/pass_regalloc.c`: `opt_regalloc` — linear scan, no live-range - splitting initially. Allocates physical registers from the same - pool the AArch64 backend uses for scratch (i.e. asks the wrapped - target for clobbers and the param/sret physical mapping). -- `src/opt/pass_emit.c`: `opt_emit(Compiler*, Func*, CGTarget*)` — - walks the lowered IR and drives the wrapped target via the - emit-side CGTarget surface. Inst → target call, but operands are - now physical Operands (post-RA) and prolog/epilog/spill insertion - has happened. -- Wrapper's `func_end` runs `build_cfg` + `build_ssa` + - `make_conventional_ssa` + `undo_ssa` + the lowering pipeline at - both levels. No optimization passes between SSA build and undo at - this phase — the lowering pipeline does all the work, so we - isolate lowering bugs from optimization-pass bugs. Level 1 and - level 2 are functionally identical at the end of Phase 4. - -Tests: `--opt-level 1` and `--opt-level 2` corpora green. The -recorder's old 1:1 replay path is removed; SSA → lower is the only -path to bytes. - -Exit criterion: levels 1 and 2 both green, no regressions in the -level-0 suite. - -### Phase 5 — intra-procedural passes, level-gated - -Goal: populate the optimization schedule. All new passes land in the -*level-2* schedule only; level 1's schedule stays at the Phase 4 set -(`build_cfg`, `build_ssa`, `jump_opt` once it's wired in) so it -remains the bisection floor for IR/SSA/lowering bugs. - -Order (tracking `doc/DESIGN.md` §9.2; each pass lands behind a flag -so the equivalence harness can run with just-this-pass to localize -bugs): - -1. `opt_jump_opt` — moves into the level-1 schedule too (cheap, - high-value, doesn't change values). -2. `opt_gvn` (with constprop, redundant-load elim folded in) — - level 2 only. -3. `opt_copy_prop` (with redundant-extension elim) — level 2 only. -4. `opt_dse` — level 2 only. -5. `opt_ssa_dce` — level 2 only. -6. `opt_build_loop_tree` + `opt_licm` — level 2 only. -7. `opt_addr_xform` — level 2 only. -8. `opt_block_cloning` — level 2 only. -9. `opt_pressure_relief` — level 2 only. -10. Lowering-time additions (run at both levels): `opt_coalesce`, - `opt_combine`, `opt_dce` (post-RA), live-range splitting in - `opt_regalloc`. - -No UB-exploiting transformations (`doc/DESIGN.md` §9): no -signed-overflow-is-unreachable, no shift-by-≥-width-is-unreachable, -no division-by-zero-is-unreachable, no null-deref-is-unreachable. - -### Phase 6 — inter-procedural (level 2 only) - -Goal: cross-function inlining + cleanup iteration. Level 1 stays -intra-procedural — the inliner is the largest source of correctness -risk and the largest divergence from level-0 codegen, so it earns -its own gating. - -- `opt_inline`: bottom-up call-graph walk. SCCs (mutual recursion) - skipped. Heuristic: instruction count + call site count. Inlining - recognizes the `__cfree_setjmp` symbol by name as returns-twice and - refuses to inline across. -- `opt_cleanup`: subset re-run (gvn, copy_prop, ssa_dce, jump_opt, - licm if loops, addr_xform if uses remain). -- `-finline-iters=N` knob (default 1, hard cap enforced inside the - wrapper). - -Tests: corpus + a small set of cases hand-crafted to require -inlining-plus-cleanup to optimize (e.g. a small wrapper around a -constant returner whose body becomes a constant only after inlining). +## 3. MIR Pipeline, Source Order + +This is the effective order in current `mir-gen.c` for full function +generation. + +### 3.1 Common Setup + +MIR duplicates the function instruction list, allocates a function CFG object, +and runs `build_func_cfg`. + +`build_func_cfg`: + +- Creates synthetic entry and exit blocks. +- Splits basic blocks at labels, branches, returns, property branches, and + fallthrough boundaries. +- Converts register operands to internal variable operands. +- Marks address-producing instructions and addressable regs. +- Adds edges for direct branches, switches, fallthrough, and possible indirect + jump targets. +- Marks calls on blocks so memory availability and liveness can treat them as + barriers. +- Removes unreachable blocks when `optimize_level > 0`. + +cfree's `opt_build_cfg` should follow this shape: construct explicit +entry/exit blocks, keep call/memory barrier metadata on blocks, and make +unreachable cleanup part of `-O1+`, not a late optional cleanup. + +### 3.2 Full `-O2` Pre-Lowering Passes + +MIR full mode then runs: + +1. `clone_bbs` before SSA when there are no address instructions. + MIR clones cold blocks after return back into hot predecessors when this + exposes optimization in the hot path. It uses a bounded growth factor and + skips back edges. + +2. `build_ssa`. + MIR uses a Braun-style construction: it creates optimized maximal SSA, + minimizes redundant phis, builds def-use edges, and optionally renames + variables. cfree should copy the useful property, not the representation: + every use should have cheap access to its defining inst, and every def + should have cheap access to its uses. + +3. `addr_xform` when address instructions exist. + MIR tries to eliminate `ADDR` pseudos that only feed memory operands. If an + address pseudo must remain addressable, MIR converts uses to memory loads + and stores, rebuilds SSA, then clones blocks. For cfree, this maps to + folding GEP/address-of chains into target memory operands where the target + can encode them, while keeping address-taken frame slots in memory. + +4. `gvn`. + MIR computes memory availability, dominators, and value numbers in postorder. + This pass includes constant propagation, branch folding, redundant expression + elimination, redundant load elimination, and store/load reuse. It uses alias + and nonalias tags on memory operands. Calls clear or conservatively restrict + memory availability. + +5. `copy_prop`. + MIR propagates copies, folds multiply/divide by powers of two, and removes + redundant extension chains. cfree should keep this as a separate pass after + GVN because it relies on SSA edges and target legality. + +6. `dse`. + MIR computes memory liveness by memory location, handles alloca escape + through calls, and removes stores whose memory location is not live. This + depends on GVN's memory numbering and alias classification. + +7. `ssa_dce`. + MIR deletes SSA instructions with unused outputs, while preserving calls, + allocas, varargs, returns, frame/stack effects, overflow sequences with live + overflow branches, and other side-effecting ops. + +8. `build_loop_tree + licm`. + MIR builds natural loops and preheaders. LICM skips branches, phis, calls, + memory, varargs, allocas, and potentially trapping divisions/mods. It is + pressure-sensitive: cheap single instructions are not moved unless their + inputs are worth moving too; multiplies are considered expensive enough to + move. + +9. `pressure_relief`. + MIR moves single-use immediate or constant-like moves closer to their use + when doing so reduces pressure and does not move work into a loop. + +10. `make_conventional_ssa`. + MIR lowers phis into edge/block moves, splitting critical edges when needed. + +11. `ssa_combine`. + MIR combines compare+branch pairs and folds address components into memory + operands while SSA edges are still available. + +12. `undo_ssa`. + MIR removes phi nodes and SSA edges. + +13. `jump_opt`. + MIR removes unreachable blocks, empty blocks, branches to the next + instruction, chains of labels, and branches to jumps. + +### 3.3 Lowering and Allocation + +After pre-lowering optimization, MIR runs the machine-dependent and allocation +pipeline: + +1. `target_machinize`. + This performs ABI lowering, call lowering, target two-operand forms, and + other machine-dependent normalization. + +2. Build a loop tree for `-O1+`. + MIR uses loop depth for frequency/pressure estimates in liveness, + coalescing, and allocation. + +3. `collect_moves`, move-only liveness, conflict matrix, and `coalesce` at + `-O2`. + MIR aggressively coalesces move-related regs, prioritizing moves by loop + frequency and checking conflicts in a bounded matrix. + +4. Full `live_info`. + MIR computes `live_in`/`live_out`, register frequencies, live lengths, and + block pressure. It understands phis before SSA destruction and call-used + hard registers after machinize. + +5. `reg_alloc`. + MIR builds live ranges and assigns pseudos by priority: tied hard regs first, + then higher frequency, then shorter live length. At levels below 2 it uses a + simplified allocator. At full level it can split live ranges and place + spills/restores on edges or block boundaries. + +6. `rewrite`. + MIR rewrites pseudo regs to hard regs or stack slots, inserts reloads and + stores, handles call-clobbered saves, and deletes noop moves. + +7. `combine`. + This is target-aware code selection. It substitutes safe single-use moves, + collapses extension chains, commutes operands to expose legal combinations, + removes internal labels, and validates each rewrite with target legality. + +8. `dead_code_elimination`. + MIR performs post-RA DCE using live-out sets and preserves side effects. + +9. Prolog/epilog and machine instruction generation. + +cfree should keep this exact split: combine after register allocation is not a +replacement for SSA `copy_prop`; it is a target legality/code-selection pass. + +### 3.4 Inlining + +MIR inlining is in `mir.c`, before generator optimization. It processes direct +call-like instructions after MIR simplification, not inside `mir-gen.c`. + +Important MIR inliner behavior: + +- It can inline direct normal calls and explicit inline calls when the callee + is available. +- It skips unresolved externals, self recursion, label-reference functions, + varargs, `jret`, and over-budget callees. +- Default size budgets are small: normal calls have a much smaller cap than + explicit inline calls. +- It limits caller growth relative to original caller size. +- It renames callee registers, duplicates labels, materializes argument moves, + rewrites returns, and handles simple top-level constant-size `alloca`. + +cfree should run inlining on retained `Func` IR at `finalize` for `-O2`, then +run cleanup on dirty callers. SCCs can be skipped for v1. The inliner must +refuse setjmp/longjmp-sensitive and vararg cases until their semantics are +explicitly tested. --- -## 4. Module layout +## 4. cfree Level Schedules + +### 4.1 `-O1` Minimal Schedule + +`-O1` is not "half of SSA." It is MIR's cheap backend optimization tier. + +Required `-O1` schedule: ``` -src/opt/ - opt.h (already exists) - ir.h (already exists) - opt.c wrapper, vtable bindings, finalize dispatch - ir.c Func/Block/Inst plumbing, val table, arenas - ir_print.c pretty-printer (Phase 2) - replay.c level-1 replay (Phase 1) - pass_cfg.c opt_build_cfg - pass_ssa.c opt_build_ssa, opt_make_conventional_ssa, - opt_ssa_combine, opt_undo_ssa - pass_gvn.c opt_gvn - pass_copy.c opt_copy_prop - pass_dse.c opt_dse - pass_dce.c opt_ssa_dce, opt_dce - pass_jump.c opt_jump_opt - pass_loop.c build_loop_tree + opt_licm - pass_addr.c opt_addr_xform - pass_clone.c opt_block_cloning - pass_pressure.c opt_pressure_relief - pass_machinize.c opt_machinize - pass_live.c opt_live_info - pass_coalesce.c opt_coalesce - pass_regalloc.c opt_regalloc - pass_combine.c opt_combine - pass_emit.c opt_emit (lowering replay) - pass_inline.c opt_inline + opt_cleanup +build_cfg +machinize +build_loop_tree +live_info +regalloc(allow_live_range_split = false) +combine +dce +opt_emit ``` -This split lets each phase land as a small set of new files plus -edits to `opt.c`. No file accumulates more than one pass. +Allowed `-O1` details: + +- Remove unreachable blocks during CFG construction. +- Use loop depth only for frequency/pressure costing. +- Run target-aware combine after register allocation. +- Delete noop moves and dead post-RA definitions. +- Use a priority-based allocator, but without coalescing and live-range + splitting in the first production version. + +Forbidden at `-O1`: + +- `build_ssa` +- `gvn` +- `copy_prop` +- `dse` +- `ssa_dce` +- `licm` +- `pressure_relief` +- `coalesce` +- `opt_inline` + +This keeps `-O1` useful and debuggable: if `-O1` fails, the bug is in +recording, CFG, machinize, liveness, allocation, combine, DCE, or emission. + +### 4.2 `-O2` Full Schedule + +`-O2` uses MIR's full current source pipeline: + +``` +build_cfg +if no address-transform candidates: + block_cloning +build_ssa +if address-transform candidates: + addr_xform + undo_ssa + block_cloning + build_ssa +gvn +copy_prop +dse +ssa_dce +build_loop_tree +licm +pressure_relief +make_conventional_ssa +ssa_combine +undo_ssa +jump_opt +machinize +build_loop_tree +collect_moves +live_info(move vars only) +coalesce +live_info(all vars) +regalloc(allow_live_range_split = true) +combine +dce +opt_emit +``` + +Then, once inlining exists: + +``` +finalize: + opt_inline(FuncSet, max_iters) + for each dirty caller: + opt_cleanup + lower every dirty/not-yet-emitted Func through the full schedule +``` + +`opt_cleanup` should re-run the passes that inlining exposes value for: +`build_cfg`, `build_ssa`, `gvn`, `copy_prop`, `dse`, `ssa_dce`, `licm` when +loops exist, `pressure_relief`, `make_conventional_ssa`, `ssa_combine`, +`undo_ssa`, and `jump_opt`. + +### 4.3 Transformations We Do Not Take + +`doc/DESIGN.md` is still binding: no UB-exploiting transforms. Do not assume +signed overflow, shift-by-width, division by zero, or null dereference are +unreachable. MIR is careful around potentially trapping division/modulo in +LICM; cfree should be at least as conservative. + +MIR property instructions and lazy basic-block versioning are out of scope for +the first optimized backend. They are a separate JIT specialization feature, +not required for normal `cfree` object/executable codegen. --- -## 5. Cross-cutting decisions +## 5. Current cfree State + +The optimizer is no longer just a stub: + +- `src/opt/opt.c` implements the `CGTarget` wrapper. +- `src/opt/ir.c` and `src/opt/ir.h` implement the recorded IR container. +- `src/opt/pass_cfg.c` implements CFG construction. +- `src/opt/pass_ssa.c` implements current SSA construction. +- `test/cg/run.sh` supports `CFREE_OPT_LEVELS`; default is `0 1`, while + level `2` is opt-in today. + +Current behavior: + +- Level `1` records and replays 1:1 into the wrapped target. +- Level `2` runs `opt_build_cfg + opt_build_ssa` as a dry run, discards that + result, then replays 1:1. +- No real optimized lowering path exists yet. + +The next implementation work should replace replay with the MIR-style lowering +pipeline, first at `-O1`, then at `-O2`. + +--- + +## 6. Implementation Phases + +Each phase should end with targeted green tests. Prefer red-green tests for +the exact pass being introduced, then expand through the CG corpus. + +### Phase A - Production `-O1` Lowering -### 5.1 Reg ↔ Val identity +Goal: make `-O1` stop replaying and emit through the optimized backend path +without SSA value passes. -Each call to `wrapper->alloc_reg(class, type)` allocates a fresh `Val` -in the current `Func`'s value table and returns its integer as -`Reg`. `Operand{kind=OPK_REG, v.reg=R}` is interpreted on replay as -"the value defined at SSA id R." This collapses two parallel ID -spaces into one and avoids a side mapping. `REG_NONE` (`0xffffffff`) -and `VAL_NONE` (`0`) live at opposite ends of the range — neither -is allocated. +Implement: -### 5.2 Frame slots stay frame slots until SSA construction +- `opt_machinize` +- `opt_live_info` +- `opt_regalloc(..., false)` +- `opt_combine` +- `opt_dce` +- `opt_emit` -`cg_local`/`cg_param` always allocate frame slots through -`CGTarget.frame_slot`/`CGTarget.param`. The wrapper records them -verbatim. Loads/stores against frame slots become `IR_LOAD`/`IR_STORE` -with `MemAccess.alias = ALIAS_LOCAL`. `build_ssa` (Phase 3) is the -only place that promotes non-`FSF_ADDR_TAKEN` slots to SSA values. -Address-taken slots stay in memory and are reasoned about through -`MemAccess` alias roots (`doc/DESIGN.md` §5.6). +Keep the allocator simple but not naive: -This keeps the recorder dumb and pushes all decisions into one pass. +- Build live ranges from block liveness. +- Sort allocation candidates by tied hard-reg requirement, frequency, live + length, then stable id. +- Assign hard regs when possible, stack slots otherwise. +- Rewrite pseudos into hard regs/stack slots with reserved scratch regs for + reload/store addressing. +- Delete noop moves after rewrite. -### 5.3 Method intercepts the wrapper rejects +Exit criteria: -Under unbounded virtuals, these CG-side mechanics are nonsense: +- `CFREE_OPT_LEVELS="0 1" make test-cg` passes for targeted AArch64 cases. +- Add focused allocation cases for call-clobber saves, stack spills, tied + hard regs from inline asm, and values live through branches. -- `clobbers` — meaningless without a finite physical register set. -- `spill_reg` / `reload_reg` — CG drives these for the -O0 value - stack; opt's wrapper has no value stack. +### Phase B - Full Allocation Infrastructure -`free_reg` is already a hint; the wrapper ignores it. +Goal: bring `-O2` allocation quality up to MIR's model before value passes +start changing code aggressively. -The first three should panic with a clear "called X on opt_cgtarget" -message. They indicate CG is being driven in -O0 mode while opt is -attached, which means a wiring bug. +Implement: -### 5.4 set_loc fan-out +- Move collection. +- Move-only liveness. +- Conflict matrix for move-related regs. +- Aggressive coalescing. +- Live-range splitting and edge/block spill placement. -The wrapper's `set_loc` updates a single `pending_loc` field on the -recorder state. `ir_emit` stamps `Inst.loc = pending_loc`. The -lowering replay (Phase 4) calls `target->set_loc(target, inst.loc)` -before each emit-side call. This is exactly the protocol -`doc/DWARF.md` §3 expects, and it preserves Group P (set_loc/debug) -behavior at level 2. +Exit criteria: -### 5.5 No UB-exploiting passes +- `-O1` remains green. +- `-O2` can use the same lowering path with coalescing/splitting enabled, even + if all SSA value passes are still disabled. -`doc/DESIGN.md` §9 is binding. opt may not assume signed overflow, -shift-by-≥-width, division by zero, or null deref are unreachable. -WASM traps on the first three deterministically; real targets are -also more predictable this way. The "70% of -O2" goal is achievable -without these transformations. +### Phase C - SSA Value and Memory Passes + +Goal: enable the MIR full pre-lowering schedule for `-O2`. + +Implement in order: + +1. `opt_block_cloning` +2. `opt_addr_xform` +3. `opt_gvn` +4. `opt_copy_prop` +5. `opt_dse` +6. `opt_ssa_dce` +7. `opt_licm` +8. `opt_pressure_relief` +9. `opt_make_conventional_ssa` +10. `opt_ssa_combine` +11. `opt_undo_ssa` +12. `opt_jump_opt` + +Do not batch these into one landing. Each pass needs a pass-local corpus case +that fails red without the pass or its bug fix. + +Exit criteria: + +- `CFREE_OPT_LEVELS="0 1 2" make test-cg` passes for the relevant arch. +- Pass-local dump tests prove the intended rewrite fires. + +### Phase D - Inlining and Cleanup + +Goal: add MIR-style small direct-call inlining for `-O2`. + +Implement: + +- Bottom-up call graph over retained `Func` IR. +- SCC skip for v1. +- Size budgets modeled on MIR: small budget for normal calls, larger budget + for explicit inline candidates once cfree tracks that source property. +- Caller growth cap. +- Register/value remapping, block/label remapping, parameter materialization, + return rewrite, and debug location preservation. +- Conservative refusal for varargs, setjmp/longjmp, inline asm with hard + constraints, and functions whose frame/alloca behavior is not yet modeled. + +After each inline iteration, run `opt_cleanup` on dirty callers. + +Exit criteria: + +- Small wrapper-call cases optimize after inline+cleanup. +- Recursive and mutually recursive cases are unchanged and correct. + +### Phase E - Benchmark Closure + +Goal: tune by measurement, not by adding passes speculatively. + +Benchmark set: + +- MIR `c-benchmarks`: `array`, `binary-trees`, `funnkuch-reduce`, + `hash`, `hash2`, `heapsort`, `lists`, `mandelbrot`, `matrix`, + `method-call`, `nbody`, `sieve`, `spectral-norm`, `strcat`, and any + benchmark that requires only supported libc/runtime features. +- cfree-specific stress cases for ABI, TLS, atomics, and inline asm. + +Measure: + +- Compile wall time for `-O0`, `-O1`, `-O2`. +- Executable run time against clang/gcc `-O2` when available. +- Code size for hot text sections. +- Pass counters: removed GVN expressions, folded branches, removed stores, + coalesced moves, spills/restores, split ranges, post-RA deleted moves. + +Target: + +- `-O1` should be the fast optimized tier and materially faster to compile + than `-O2`. +- `-O2` should first reach QBE-class performance on the benchmark set, then + close toward MIR's published `c2m -eg` geomean. + +--- + +## 7. Module Layout + +Keep the MIR pass boundaries as separate cfree modules: + +``` +src/opt/ + opt.c wrapper, vtable bindings, schedule dispatch + ir.c Func/Block/Inst plumbing + ir_print.c stable dumps for pass tests + pass_cfg.c CFG, unreachable cleanup + pass_clone.c block cloning + pass_ssa.c SSA build, conventional SSA, undo SSA + pass_addr.c address transformation + pass_gvn.c GVN, constprop, redundant-load elimination + pass_copy.c copy propagation, extension cleanup + pass_dse.c dead store elimination + pass_dce.c SSA DCE and post-RA DCE + pass_loop.c loop tree and LICM + pass_pressure.c pressure relief + pass_jump.c jump optimization + pass_machinize.c target ABI and machine-form lowering + pass_live.c liveness, pressure, live ranges + pass_coalesce.c move collection and coalescing + pass_regalloc.c assignment, rewrite, splitting + pass_combine.c target-aware code selection + pass_emit.c drive wrapped CGTarget + pass_inline.c inlining and cleanup +``` + +No pass should reach into the wrapped target's private implementation. Target +specifics belong behind `Target`, `TargetABI`, or explicit helper hooks. + +--- + +## 8. Pass Invariants + +- No VLAs and no global optimizer state. MIR uses generator context; cfree + should hang all pass state off `Compiler`, `Func`, `FuncSet`, or explicit + pass contexts. +- `Reg` and `Val` identity remains valid while recording. Lowering may map + Vals to new physical registers or stack slots, but that mapping belongs to + allocation state. +- Frame slots remain frame slots until a pass proves promotion is legal. +- Calls are memory barriers unless alias information proves otherwise. +- Inline asm is side-effecting and may pin hard regs; the allocator and + inliner must treat it conservatively. +- `set_loc` is sticky on recorded insts. Every emitted instruction must forward + the active location to the wrapped target before machine emission. +- Every pass either preserves CFG metadata or invalidates/rebuilds it before + the next consumer. --- -## 6. Open questions - -- **Linear scan vs graph coloring at -O2.** MIR uses fast linear - scan. The "70% of -O2" target is achievable with it. Graph coloring - is a follow-up if a benchmark gap demands it. -- **Loop tree at lowering vs after intra passes.** §9.4 builds the - loop tree inside the lowering pipeline (used by RA for split - decisions). §9.2 also builds one for LICM. Open: do these share an - arena-owned structure that survives both, or are they computed - twice? Rebuilding is cheap; sharing requires invalidation - discipline. Default to rebuilding until it matters. -- **Stackifier vs Relooper for WASM.** `doc/DESIGN.md` §14 leaves - this open. opt's flat-CFG IR makes either viable. Decision deferred - to when the WASM backend lands; the IR is structure-agnostic. -- **Inlining heuristic tuning.** Bench-driven; the knob - (`-finline-iters=N`, instruction-count budget) is exposed from - Phase 6 onward. -- **IR snapshot/golden-file format.** Phase 2's pretty-printer needs - a stable enough format for diff-based regression tests in Group R. - Decide format when Phase 2 lands. +## 9. Test Strategy + +Targeted commands: + +``` +CFREE_OPT_LEVELS="0 1" make test-cg +CFREE_OPT_LEVELS="0 1 2" make test-cg +CFREE_TEST_FILTER=<case> CFREE_OPT_LEVELS="0 1 2" make test-cg +``` + +Use pass dumps for red-green tests: + +- CFG: block/successor/predecessor shape. +- SSA: phi placement and def-use chains. +- GVN: redundant expression/load replaced and constant branch folded. +- DSE: dead store removed while escaping stores remain. +- LICM: safe invariant moved; trapping division and memory ops remain. +- RA: expected spill/reload/coalescing counters. +- Combine: target-legal fused instruction shape. + +Before broad runs, redirect output to a file and inspect the failure slice: + +``` +CFREE_TEST_FILTER=<case> CFREE_OPT_LEVELS="0 1 2" make test-cg >build/opt-test.log 2>&1 +tail -200 build/opt-test.log +``` + +The full CG corpus remains the equivalence oracle: levels `1` and `2` must +produce the same observable `test_main` result as level `0`. DWARF equivalence +can start weaker, but `set_loc` forwarding must not regress line-row emission.