Split optimizer design and performance docs - kit

commit e462688b568fc13b28f3db4126fca1da77335b87
parent a28452e0597151ea9bb31eaf7c220e4ee5ee801c
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 22 May 2026 09:14:36 -0700

Split optimizer design and performance docs

Diffstat:
D doc/CONSTFOLD.md  | 600 -------------------------------------------------------------------------------
D doc/MIR_RA_REPORT.md  | 933 -------------------------------------------------------------------------------
M doc/OPT.md  | 1571 ++++++++++++++-----------------------------------------------------------------
D doc/OPT1.md  | 330 -------------------------------------------------------------------------------
A doc/OPT_PERF.md  | 200 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D doc/PERF.md  | 285 -------------------------------------------------------------------------------
M scripts/opt_bench.sh  | 23 ++++++++++++++++++++++-

7 files changed, 490 insertions(+), 3452 deletions(-)
diff --git a/doc/CONSTFOLD.md b/doc/CONSTFOLD.md
@@ -1,600 +0,0 @@
-# CONSTFOLD - Vstack Constant Simplification Plan
-
-`-O1` needs a tiny constant simplification path that improves obvious scalar
-code without pulling the O2 SSA/value-optimization machinery into the fast
-backend tier. The right first home is the public CG API value stack in
-`src/api/cg.c`, beside the existing delayed compare representation used by
-compare-consuming branches.
-
-This plan covers integer constant folding and local expression delay in the
-vstack. It intentionally avoids global propagation, CFG-wide reasoning, and
-target-specific peepholes.
-
-## Status
-
-Implemented:
-
-- immediate-only integer binop folding for flag-free integer operations up to
-  64 bits;
-- immediate-only integer unop folding for flag-free integer operations up to
-  64 bits;
-- immediate-only integer and pointer-like compare folding;
-- scalar immediate return preservation through `cfree_cg_ret`;
-- CG API shape coverage for O1 literal folds;
-- delayed-compare materialization now reuses only owned operand registers;
-- delayed `SV_ARITH` for unary and binary arithmetic;
-- expression-local arithmetic chain folding;
-- straight-line local constant shadowing with conservative boundary
-  invalidation;
-- disassembly and metrics refresh for the phase 5 probes;
-- `ApiSValue` delayed compare/arithmetic payloads are now unioned;
-- local-shadow invalidation is routed through named memory, control, and
-  address-taking boundary helpers;
-- delayed arithmetic has a direct register-pressure materialization regression
-  test.
-
-Remaining:
-
-- consider a shadow generation counter if repeated `clear_all` scans show up in
-  future O1 metrics.
-
-## Current Shape
-
-The API value stack currently carries:
-
-- `SV_OPERAND`: an immediate, register, local, global, or indirect operand;
-- `SV_CMP`: a delayed integer compare with source operands and register
-  ownership bits.
-
-Integer constants enter the stack as `SV_OPERAND` plus `OPK_IMM`.
-
-Integer compare already has the desired delayed shape:
-
-```text
-push a
-push b
-int_cmp        -> pushes SV_CMP, emits no boolean materialization
-branch_true    -> consumes SV_CMP as target->cmp_branch
-```
-
-This avoids the old branch-consumer shape:
-
-```text
-cmp
-cset
-cmp #0
-b.cond
-```
-
-Arithmetic now has a first immediate-only fold. For flag-free integer
-operations where both operands are `OPK_IMM`, `api_cg_binop` and
-`api_cg_unop` fold before allocating a result register. `api_cg_cmp` likewise
-folds immediate-only integer and pointer-like compares. Non-immediate
-arithmetic still materializes immediately; delayed arithmetic is the next
-planned step.
-
-The visible O1 quality gap is:
-
-```c
-int x = 40;
-x += 2;
-return x;
-```
-
-The expression part can be fixed by immediate folding, but the full example
-also crosses a local store/load. That requires a narrow local constant shadow,
-not just expression-stack folding.
-
-## Goals
-
-- Fold immediate-only integer arithmetic before register allocation.
-- Delay simple integer arithmetic on the vstack when doing so can avoid
-  materialization or combine a following immediate.
-- Preserve the existing `SV_CMP` compare-branch behavior.
-- Improve straight-line scalar-local code such as `x = 40; x = x + 2`.
-- Keep the implementation target-neutral and bounded.
-- Do not introduce VLAs or global state.
-- Keep all new state attached to `CfreeCg`, `ApiSValue`, or `ApiSourceLocal`.
-
-## Non-Goals
-
-- No SSA construction.
-- No global constant propagation.
-- No CFG-wide dataflow.
-- No folding across labels, jumps, branches, calls, volatile accesses, inline
-  asm, or unknown memory effects.
-- No FP constant folding in this pass.
-- No i128 folding until there is a deliberate wide-integer helper.
-- No target-specific immediate legality checks in the vstack. The target
-  backend already decides whether an `OPK_IMM` source fits an instruction
-  encoding or must be materialized.
-
-## Design
-
-### 1. Immediate-Only Integer Folding
-
-Implemented in `src/api/cg.c` with helpers shaped as:
-
-```c
-static int api_try_fold_int_binop(CfreeCg*, BinOp, CfreeCgTypeId,
-                                  i64 a, i64 b, i64* out);
-static int api_try_fold_int_unop(CfreeCg*, UnOp, CfreeCgTypeId,
-                                 i64 a, i64* out);
-static int api_try_fold_int_cmp(CfreeCg*, CmpOp, CfreeCgTypeId,
-                                i64 a, i64 b, i64* out);
-```
-
-The binop/unop paths run before allocating a destination register. If both
-source operands are `OPK_IMM` and the operation is supported, they push a new
-`OPK_IMM` result and release the inputs.
-
-Implemented operations:
-
-- `BO_IADD`, `BO_ISUB`, `BO_IMUL`;
-- `BO_AND`, `BO_OR`, `BO_XOR`;
-- `BO_SHL`, `BO_SHR_U`, `BO_SHR_S`;
-- `UO_NEG`, `UO_BNOT`, `UO_NOT`;
-- integer compares.
-
-Still skipped:
-
-- signed and unsigned division/remainder, to avoid changing divide-by-zero
-  behavior and signed-overflow corner cases in the first patch;
-- any operation with trap, saturating, or exact/nowrap flags until flags are
-  actually modeled by the vstack fold;
-- FP operations;
-- i128 types.
-
-Evaluation rules:
-
-- Use unsigned arithmetic internally where possible so host signed overflow is
-  not invoked.
-- Mask the result to the destination integer width for widths below 64.
-- For arithmetic right shift, sign-extend the masked input before shifting.
-- Normalize `_Bool` results to `0` or `1`.
-- Treat pointer-typed `OPK_IMM` only as integer-sized values for compares.
-  Do not fold pointer arithmetic here.
-
-### 2. Immediate Return Preservation
-
-Implemented. `cfree_cg_ret` preserves non-aggregate scalar `OPK_IMM` values by
-passing them directly to `target->ret`.
-
-Required behavior:
-
-```text
-return 40 + 2
-  -> vstack folds to OPK_IMM 42
-  -> ret receives OPK_IMM 42
-  -> backend emits only return-value materialization
-```
-
-If the return value is delayed arithmetic or any non-immediate expression, it
-can still materialize through the normal register path.
-
-### 3. Delayed Arithmetic SValues
-
-Add a generic delayed arithmetic stack value after the immediate-only fold is
-stable. It should cover both unary and binary integer arithmetic, with an
-operation-kind discriminator and an optional second operand.
-
-Suggested shape:
-
-```c
-typedef enum ApiDelayedArithKind {
-  API_DELAYED_UNOP,
-  API_DELAYED_BINOP,
-} ApiDelayedArithKind;
-
-typedef enum ApiSValueKind {
-  SV_OPERAND,
-  SV_CMP,
-  SV_ARITH,
-} ApiSValueKind;
-
-typedef struct ApiSValue {
-  Operand op;
-  Operand cmp_a;
-  Operand cmp_b;
-  Operand arith_a;
-  Operand arith_b;
-  CfreeCgTypeId type;
-  CmpOp cmp_op;
-  BinOp arith_bin_op;
-  UnOp arith_un_op;
-  u8 kind;
-  u8 arith_kind;
-  u8 res;
-  u8 pinned;
-  u8 lvalue;
-  u8 cmp_a_owned;
-  u8 cmp_b_owned;
-  u8 arith_a_owned;
-  u8 arith_b_owned;
-  FrameSlot spill_slot;
-  CfreeCgLocal source_local;
-} ApiSValue;
-```
-
-If struct size becomes a concern, the compare and arithmetic payloads can be
-moved into a small union. Keep the first patch simple unless stack footprint
-shows up in measurements.
-
-Materialization mirrors `SV_CMP`:
-
-- `api_ensure_reg` dispatches `SV_ARITH` to `api_materialize_arith_to`.
-- `api_materialize_arith_to` calls `target->unop` when
-  `arith_kind == API_DELAYED_UNOP` and `target->binop` when
-  `arith_kind == API_DELAYED_BINOP`.
-- `api_release` releases owned operand registers for `SV_ARITH`.
-- `api_pick_victim` ignores delayed non-register values.
-- `api_alloc_reg_or_spill` may materialize a delayed arithmetic value if no
-  normal spill victim exists, just as it does for delayed compares.
-
-Creation rules:
-
-- Only create `SV_ARITH` for integer-class operations.
-- For binary operations, require at least one non-immediate source.
-  Immediate-only cases fold to `OPK_IMM` instead.
-- For unary operations, delay only non-immediate sources. Immediate-only cases
-  fold to `OPK_IMM` instead.
-- Own registers only when the source `ApiSValue` owns that register, following
-  the existing compare ownership pattern.
-- Never create delayed arithmetic for lvalues, aggregates, FP values, calls, or
-  memory operands that need a load.
-- Do not encode unary operations as synthetic binary operations. For example,
-  keep `-x` as `UO_NEG`, `~x` as `UO_BNOT`, and `!x` as `UO_NOT` or a delayed
-  compare only when the consumer naturally wants compare semantics.
-
-### 4. Local Arithmetic Chain Folding
-
-Once `SV_ARITH` exists, fold cheap expression-local chains before
-materializing:
-
-```text
-(x + C1) + C2 -> x + (C1 + C2)
-(x + C1) - C2 -> x + (C1 - C2)
-(x - C1) + C2 -> x + (C2 - C1)
-(x - C1) - C2 -> x - (C1 + C2)
-(x ^ C1) ^ C2 -> x ^ (C1 ^ C2)
-(x & C1) & C2 -> x & (C1 & C2)
-(x | C1) | C2 -> x | (C1 | C2)
-~~x -> x
-!!x -> x != 0, if the result is consumed as a branch or compare-like boolean
-```
-
-Keep this conservative:
-
-- only one delayed level;
-- only one real non-immediate operand;
-- no reassociation across non-immediate operands;
-- no multiply reassociation initially;
-- no transforms that depend on signed-overflow assumptions.
-- no unary chain transform that changes C truth-value semantics for non-boolean
-  integer results.
-
-If the folded constant becomes an identity, collapse the result:
-
-- `x + 0`, `x - 0`, `x ^ 0`, `x | 0` -> `x`;
-- `x & -1` -> `x` after width masking;
-- `x * 1` -> `x` if multiply folding is later added;
-- `x * 0` -> `0` only when flags and side effects are known safe.
-
-### 5. Straight-Line Local Constant Shadow
-
-To handle `int x = 40; x += 2; return x;`, track constants for source locals
-that are eligible for register-local treatment.
-
-Extend `ApiSourceLocal`:
-
-```c
-typedef struct ApiSourceLocal {
-  ...
-  i64 const_value;
-  u8 const_valid;
-  u8 pad[2];
-} ApiSourceLocal;
-```
-
-Rules:
-
-- A local can be shadowed only if it is scalar integer or pointer typed.
-- Address-taken locals are not shadowed.
-- Volatile accesses are not shadowed.
-- Parameters start unknown.
-- On store of an immediate to an eligible source local, set the shadow value.
-- On store of a non-immediate to that local, clear the shadow value.
-- On partial-width or mismatched-type access to that local, clear/do not use the
-  shadow value.
-- On load of an eligible source local with a valid shadow, push `OPK_IMM`
-  instead of loading/materializing the local.
-- On taking a local address, clear the local's shadow.
-- On unknown memory or control-flow boundary, clear all shadows.
-
-Clear all local shadows at:
-
-- label placement;
-- unconditional jump;
-- conditional branch;
-- switch;
-- scope break/continue control transfer if it can leave the current block;
-- call, unless later alias analysis proves the call cannot observe the local;
-- inline asm;
-- atomic operations and fences;
-- direct stores through `OPK_INDIRECT`;
-- address-taking of any shadowed local.
-
-This is intentionally a basic-block-local cache. It is not a dataflow fact.
-
-### 6. Interaction With Existing Register Locals
-
-The CG API already treats some scalar autos as source-register lvalues when the
-target returns `CG_LOCAL_STORAGE_REG`. Loads from those locals can avoid memory.
-The constant shadow sits above that:
-
-```text
-local has valid constant
-  -> push OPK_IMM
-else local stored in register
-  -> push register-backed rvalue
-else
-  -> normal frame load path
-```
-
-Stores to register locals still update the real storage when necessary. The
-shadow only affects subsequent vstack reads in the same straight-line region.
-
-### 7. Boundary And Correctness Rules
-
-Every helper that consumes stack values must understand the new delayed value:
-
-- `api_release`;
-- `api_ensure_reg`;
-- `api_force_reg`;
-- `api_class_of_sv`;
-- `api_reg_of_sv`;
-- `api_owned_reg_type`;
-- `api_pick_victim`;
-- `cfree_cg_dup`;
-- `cfree_cg_store`;
-- call argument preparation;
-- return handling;
-- branch handling.
-
-If an operation does not explicitly support delayed arithmetic, it should force
-materialization first. That keeps behavior local and avoids subtle ownership
-bugs.
-
-Do not fold across volatile memory. `CfreeCgMemAccess.flags` already carries
-`CFREE_CG_MEM_VOLATILE`; the local shadow and vstack folds must respect it.
-
-Do not depend on C signed-overflow undefined behavior. The CG API integer types
-are width-only, and the existing frontend currently passes `CFREE_CG_INTOP_NONE`
-for ordinary integer arithmetic. The fold should preserve the backend's
-current wrap-like behavior unless flags explicitly request something else.
-
-## Implementation Phases
-
-### Phase 1: Red Tests For Literal Folding
-
-Done. The CG API test now covers O1 literal fold shapes for:
-
-- direct API function returning `40 + 2`;
-- direct API function returning `(5 << 3) | 2`;
-- direct API compare through `40 + 2 == 42`;
-- direct API unary folding through `~~41 + 1`.
-
-Expected shape should check for fewer emitted arithmetic operations where the
-existing test harness can observe it. Runtime correctness alone is not enough.
-
-### Phase 2: Immediate-Only Fold
-
-Done. The fold helpers are wired into:
-
-- `api_cg_binop`;
-- `api_cg_unop`;
-- `api_cg_cmp`;
-- `cfree_cg_ret`.
-
-Validation run:
-
-```sh
-make test-cg-api test-opt
-CFREE_TEST_ALLOW_SKIP=1 make test-parse
-```
-
-Final observed parse result:
-
-```text
-2496 pass, 0 fail, 4 skip
-```
-
-### Phase 3: Delayed Arithmetic
-
-Done. Added `SV_ARITH`, operation-kind metadata, ownership helpers,
-materialization, release logic, and register-pressure materialization.
-
-Tests cover:
-
-- one delayed unary arithmetic value consumed by return;
-- one delayed binary arithmetic value consumed by return;
-- delayed arithmetic consumed by compare;
-- delayed arithmetic consumed by another binop with immediate folding;
-- delayed unary arithmetic consumed by another unop with chain folding;
-- delayed arithmetic forced by store path;
-- register-pressure materialization when no ordinary spill victim exists.
-
-Run:
-
-```sh
-make test-cg-api
-make test-opt
-```
-
-### Phase 4: Local Constant Shadow
-
-Done. Added local shadow state to `ApiSourceLocal` and invalidation helpers:
-
-```c
-static void api_local_const_clear(ApiSourceLocal*);
-static void api_local_const_clear_all(CfreeCg*);
-static void api_local_const_memory_boundary(CfreeCg*);
-static void api_local_const_control_boundary(CfreeCg*);
-static void api_local_const_address_taken(CfreeCg*, CfreeCgLocal);
-static int api_local_const_can_track(CfreeCg*, const ApiSourceLocal*,
-                                     CfreeCgMemAccess);
-static void api_local_const_store(CfreeCg*, CfreeCgLocal, CfreeCgTypeId, i64);
-static int api_local_const_load(CfreeCg*, CfreeCgLocal, CfreeCgTypeId,
-                                Operand* out);
-```
-
-Wire store/load behavior through `cfree_cg_store` and `cfree_cg_load`.
-
-Wire invalidation through:
-
-- `cfree_cg_label_place`;
-- `cfree_cg_jump`;
-- `api_branch_if`;
-- `cfree_cg_switch`;
-- call paths;
-- inline asm path;
-- atomics/fences;
-- indirect stores;
-- local address-taking.
-
-Tests cover:
-
-- `int x = 40; x = x + 2; return x;`;
-- shadow cleared by label/jump;
-- shadow cleared by branch;
-- shadow cleared by address-taking;
-- shadow cleared by volatile access;
-- shadow cleared by indirect store;
-- partial-width stores do not preserve a full-local shadow;
-- parameter locals are initially unknown.
-
-Run:
-
-```sh
-make test-cg-api
-make test-opt
-make test-parse
-```
-
-If the change affects emitted target code shape, also run one targeted smoke
-case per supported O1 target:
-
-```sh
-make test-smoke-x64
-make test-aa64-inline
-```
-
-and an RV64 targeted case if available locally.
-
-### Future Refactors
-
-Completed cleanup from the original phase 3/4 implementation:
-
-- `SV_CMP` and `SV_ARITH` payload fields now live in an `ApiSValue` union.
-- Local-shadow invalidation now goes through named boundary helpers:
-  `api_local_const_memory_boundary`, `api_local_const_control_boundary`, and
-  `api_local_const_address_taken`.
-- `test/api/cg_type_test.c` includes an explicit delayed-arithmetic
-  register-pressure case that forces materialization when no ordinary spill
-  victim exists.
-
-Still worth considering before growing the vstack simplifier further:
-
-- Consider a shadow generation counter if whole-function local counts grow
-  enough for repeated `clear_all` scans to show up in O1 metrics.
-- Consider one small builder helper in `test/api/cg_type_test.c` for creating
-  O1 one-parameter functions; the expanded constfold tests intentionally stayed
-  direct but now duplicate setup boilerplate.
-
-### Phase 5: Disassembly And Metrics
-
-Done. Re-ran the probe cases from `doc/OPT1.md`:
-
-- `6_5_17_compound_assign`;
-- a direct literal return case;
-- a compare/branch literal case;
-- one local-address-taken case.
-
-For each of AArch64, x64, and RV64, recorded:
-
-- `.text` size;
-- instruction count;
-- `mov`/`load_imm` count;
-- arithmetic instruction count;
-- spills/reloads;
-- rewrite inserted instruction count.
-
-Host JIT `run --time` samples recorded O1 wall time, live-range time,
-regalloc time, spills/reloads, and rewrite inserted instruction count. The
-current numbers are in `doc/PERF.md` under "Vstack Constfold Probe".
-
-## Expected Results
-
-For direct literal arithmetic:
-
-```c
-int test_main(void) { return 40 + 2; }
-```
-
-O1 should emit only return-value materialization for `42`, with no separate add.
-
-For scalar locals:
-
-```c
-int test_main(void) {
-  int x = 40;
-  x = x + 2;
-  return x;
-}
-```
-
-O1 should avoid the load/add/store/reload chain in the straight-line case. The
-best expected shape is return-value materialization for `42`, unless debug or
-local-storage requirements intentionally force a store.
-
-For address-taken locals:
-
-```c
-int test_main(void) {
-  int x = 40;
-  int* p = &x;
-  *p = *p + 2;
-  return x;
-}
-```
-
-The local shadow must not fold through the pointer access. Correctness is more
-important than preserving the constant.
-
-## Risks
-
-- Register ownership bugs in delayed arithmetic are the main implementation
-  risk. Reuse the compare ownership pattern and add tests where delayed
-  arithmetic values are forced, duplicated, dropped, and spilled.
-- Local shadow invalidation can become unsound if it is treated as a dataflow
-  fact. Keep it basic-block-local and clear aggressively.
-- Signed arithmetic folding can diverge from backend behavior if it uses host
-  signed overflow. Use unsigned operations plus width masking.
-- Return-path changes can accidentally bypass ABI requirements for aggregate
-  returns. Only preserve immediates for non-aggregate scalar returns.
-- Debug-info expectations for locals may eventually require materialized
-  storage even when constant-shadowed. Do not remove local declarations or
-  frame-slot allocation in this pass.
-
-## Acceptance Criteria
-
-- Immediate-only integer expressions fold before backend allocation.
-- Delayed compare-branch behavior remains covered and unchanged.
-- Delayed arithmetic materializes correctly in all consumers.
-- Straight-line scalar-local constants fold in the common `x = 40; x += 2`
-  shape.
-- Shadowed constants are cleared across control-flow, volatile, indirect
-  memory, call, asm, atomic, and address-taking boundaries.
-- Targeted tests pass for CG API, O1 opt, and parser coverage.
-- Disassembly shows fewer arithmetic/materialization instructions in the probe
-  cases without introducing target-specific vstack logic.
diff --git a/doc/MIR_RA_REPORT.md b/doc/MIR_RA_REPORT.md
@@ -1,933 +0,0 @@
-# MIR Data Structures and Register Allocation Report
-
-This report summarizes the data structures and algorithms in MIR that matter
-most for cfree's optimizer direction. The emphasis is structural: cfree should
-move toward MIR's allocation shape before spending more effort on small
-localized speedups in the current dense-conflict implementation.
-
-Primary MIR source references:
-
-- `~/tmp/mir/mir-gen.c`
-- `~/tmp/mir/mir.h`
-- `~/tmp/mir/mir-bitmap.h`
-- `~/tmp/mir/mir-varr.h`
-- `~/tmp/mir/mir-dlist.h`
-
-Related cfree notes:
-
-- `doc/OPT.md`: broad MIR-informed optimizer plan.
-- `doc/PERF.md`: current timing and scaling data.
-
-## Executive Summary
-
-MIR's register allocation pipeline is live-range based. It does not allocate by
-building a full pseudo-register interference matrix. MIR computes block live
-sets, converts them into compressed live ranges, assigns physical locations by
-checking occupied hard-register and stack locations at live-range program
-points, then rewrites instructions.
-
-cfree's original O1 path was conflict-matrix based. `opt_live_info` built a
-dense `nvals x words` pseudo interference matrix, stored per-instruction full
-`live_after` bitsets, and scanned `1..nvals` in hot loops. `doc/PERF.md` shows
-why that structure was the scaling problem for one large function.
-
-The main recommendation is to reshape cfree around MIR's allocation model:
-
-1. Separate block liveness from register-allocation analysis.
-2. Build live ranges from block liveness.
-3. Allocate against per-program-point occupied physical locations.
-4. Keep conflict matrices only for bounded, optional coalescing.
-5. Rewrite pseudos after assignment, with later live-range splitting for O2.
-
-This is a structural change, not an incremental optimization of the current
-matrix.
-
-As of the current O1 implementation, the dense pseudo-conflict allocator is no
-longer the normal path. O1 now has pass-local block liveness, compressed live
-ranges, split hard-register and stack-slot occupancy, assignment-map rewrite,
-per-location interval occupancy, and metrics for live/range/allocation/rewrite
-event traffic. The remaining work is no longer "delete the old matrix" or
-"replace O1 point-index occupancy"; those cleanup steps are complete. The next
-allocator work should move to O2 coalescing/splitting and generated-code
-quality, while keeping O1's fast MIR-shaped path as the baseline.
-
-## MIR Container Layer
-
-MIR uses small generic data-structure building blocks throughout `mir-gen.c`.
-They are simple but important because they determine how passes are organized.
-
-### VARR
-
-`mir-varr.h` defines typed growable arrays. A `VARR(T)` stores:
-
-- element count
-- capacity
-- element pointer
-- allocator
-
-Growth uses `realloc` and increases capacity by about 1.5x. MIR uses `VARR`
-for pass-owned vectors such as temporary instruction lists, live-range tables,
-sorted allocation candidates, and per-point used-location bitmaps.
-
-cfree already has arena-backed arrays and typed vector macros in several
-subsystems. We should not copy `VARR` mechanically, but we should mirror the
-ownership pattern: pass state owns resizable vectors explicitly instead of
-embedding all allocation artifacts permanently in `Func`.
-
-### DLIST
-
-`mir-dlist.h` defines intrusive typed doubly linked lists. MIR functions and
-basic blocks use linked instruction lists because passes insert, delete, and
-move instructions frequently:
-
-- function instruction list: `MIR_func.insns`
-- CFG block list: `func_cfg.bbs`
-- block instruction list: `bb.bb_insns`
-- edge lists: `bb.in_edges`, `bb.out_edges`
-
-cfree currently stores each `Block` as a contiguous `Inst*` array. That is good
-for replay and scanning, but less natural for MIR-style lowering, rewrite,
-spill insertion, edge splitting, and post-RA combine. For the first structural
-rewrite we can keep arrays if we introduce a pass-local edit buffer, but the
-final shape should support cheap insertion around instructions and on block
-boundaries.
-
-### Bitmap
-
-`mir-bitmap.h` is a growable sparse-ish bitmap built on `VARR<uint64_t>`.
-Important details:
-
-- It expands only as far as the highest needed bit.
-- `bitmap_clear` truncates length to zero.
-- binary/ternary operations trim trailing zero words.
-- iteration uses `FOREACH_BITMAP_BIT` to visit set bits.
-
-This is a key contrast with cfree's current fixed-width bitsets. In cfree,
-`opt_live_words = ceil(nvals / 64)`, and every live set for the function has
-that width. MIR's bitmaps still use dense words internally, but their active
-length tracks actual high bits, and many loops iterate set bits.
-
-cfree should add an optimizer bitmap abstraction before rewriting RA. It should
-support:
-
-- clear without touching max capacity
-- copy/ior/and-not with change reporting
-- set-bit iteration
-- optional trimming of trailing zero words
-
-The exact storage can remain arena-backed for pass lifetime, but the API should
-hide fixed full-function width from users.
-
-## MIR Function and CFG Shape
-
-MIR's generator owns a `gen_ctx` with pass subcontexts:
-
-- target context
-- data-flow context
-- SSA context
-- GVN context
-- live-range context
-- coalesce context
-- RA context
-- combine context
-
-This keeps pass state out of the public MIR function structure. The current
-function being generated stores a `func_cfg` in `func_item->data` during
-generation.
-
-### Instructions
-
-`MIR_insn` is a variable-sized object:
-
-```c
-struct MIR_insn {
-  void *data;
-  DLIST_LINK(MIR_insn_t) insn_link;
-  MIR_insn_code_t code : 32;
-  unsigned int nops : 32;
-  MIR_op_t ops[1];
-};
-```
-
-`data` points to pass-local metadata, usually `bb_insn` after CFG construction.
-This lets MIR keep the IR instruction object stable while attaching and
-replacing analysis state.
-
-cfree's `Inst` is a fixed struct with side arrays for operands and aux data.
-That is fine for recording, but pass state should move out of `Inst`, `Block`,
-and `Func` where possible. In particular, liveness, ranges, allocation
-assignment, and rewrite scratch state should be pass contexts, not persistent
-fields scattered through `Func`.
-
-### Basic Blocks
-
-MIR `bb` contains:
-
-- CFG links: in/out edge lists
-- instruction list
-- flags and reachability
-- four data-flow bitmaps: `in`, `out`, `gen`, `kill`
-- dominator bitmaps reused by several passes
-- loop-node pointer
-- max integer and FP pressure
-
-MIR reuses the bitmap fields for different data-flow problems through local
-macros such as `live_in`, `live_out`, `live_gen`, and `live_kill`.
-
-cfree currently stores `live_in`, `live_out`, `live_use`, `live_def`, and
-`live_after` directly on `Block`. That ties block storage to one analysis. The
-MIR-like direction is to keep persistent block structure focused on CFG and
-instruction ownership, and keep data-flow bitmaps in pass-local arrays indexed
-by block id.
-
-## MIR Pipeline Shape
-
-For the allocation-related portion, MIR runs:
-
-1. `target_machinize`
-2. build loop tree for O1+
-3. optional move collection and coalescing for O2
-4. full live info
-5. `reg_alloc`
-6. rewrite
-7. optional splitting for O2
-8. post-RA combine
-9. post-combine DCE
-
-This split is more important than individual implementation details. In MIR,
-liveness is a reusable input to multiple later phases, but register allocation
-does not depend on a full pseudo interference graph.
-
-For cfree, O1 should become:
-
-```text
-build_cfg
-machinize
-build_loop_tree
-live_blocks
-build_live_ranges
-assign_locations(no splitting)
-rewrite_locations
-combine
-dce
-emit
-```
-
-O2 can add coalescing, splitting, SSA value passes, and richer cleanup after
-this shape is stable.
-
-## MIR Liveness
-
-MIR's liveness is block-first:
-
-1. Walk each block backward to compute `live_gen` and `live_kill`.
-2. Solve backward data-flow:
-   - `live_out = union(successor.live_in)`
-   - `live_in = live_gen | (live_out & ~live_kill)`
-3. Track register frequency and block pressure while scanning.
-4. Iterate set bits for live-out pressure accounting.
-
-MIR also has a `scan_vars` mode. Normally `scan_vars_num == 0`, meaning all
-variables participate. For move coalescing, MIR can restrict liveness to only
-move-related variables.
-
-cfree should copy this separation. The first replacement for `opt_live_info`
-should not build:
-
-- `val_conflicts`
-- per-instruction `live_after`
-- spill/rewrite data
-
-It should only produce block live-in/live-out, use/def, frequency, and pressure
-inputs. Later passes can build the extra views they need.
-
-## MIR Live Ranges
-
-MIR represents each variable as a linked list of live ranges:
-
-```c
-struct live_range {
-  lr_bb_t lr_bb;
-  int start, finish;
-  int ref_cost;
-  live_range_t next;
-};
-```
-
-The range builder walks blocks backward from `live_out`, just like liveness,
-but emits range start/end points. It also records whole-block ranges using
-`lr_bb` for values that are live through a block without being referenced
-inside it.
-
-MIR then compresses live-range points:
-
-- gather all born/dead points
-- map old instruction points to a smaller point space
-- merge adjacent ranges where legal
-
-This compression matters. It reduces the size of the later `used_locs`
-structure and avoids making the allocator pay for every instruction position
-when only lifetime boundaries matter.
-
-cfree should introduce:
-
-```c
-typedef struct OptLiveRange {
-  Val val;
-  u32 start;
-  u32 end;
-  u32 next;
-  u32 first_block_span; /* optional: whole-block span list */
-} OptLiveRange;
-
-typedef struct OptLiveRangeSet {
-  OptLiveRange* ranges;
-  u32 nranges;
-  u32* first_range_by_preg;
-  u32 point_count;
-} OptLiveRangeSet;
-```
-
-The exact fields can change, but the structural point should not: allocation
-should reason from ranges, not from a dense value-value conflict matrix.
-
-## MIR Register Allocation
-
-MIR allocation has three core pieces:
-
-### Candidate Ordering
-
-MIR builds `sorted_regs` from pseudo regs, computes live length from ranges,
-and sorts allocation candidates. The priority includes tied hard-register
-requirements, frequency, and live length.
-
-cfree already has a similar priority idea in `val_higher_priority`. The
-problem is not ordering. The problem is what each candidate checks against.
-
-### Occupied Locations
-
-MIR maintains:
-
-- `used_locs`: vector of bitmaps indexed by compressed program point
-- `busy_used_locs`: extra occupancy for split-capable allocation
-- `conflict_locs1`: temporary bitmap of occupied physical locations
-
-For each pseudo range, MIR unions the occupied locations for all points in the
-range into `conflict_locs`, then asks for a hard register or stack slot not in
-that set.
-
-This is the central data-structure difference:
-
-- MIR: pseudo -> live ranges -> occupied hard locations by point
-- cfree today: pseudo -> dense conflicts with other pseudos -> scan assigned
-  pseudos to test hard-register conflicts
-
-The MIR shape bounds allocation checks by range length and physical-location
-bitmap width, not by all pseudo pairs. It is a better fit for large generated
-functions where many values never overlap.
-
-### Rewrite
-
-After assignment, MIR rewrites instruction operands:
-
-- hard-register assigned pseudos become hard regs
-- stack-assigned pseudos become loads/stores around uses/defs
-- some spilled uses can become memory operands directly if target-legal
-- call-clobbered live regs are saved/restored
-- noop moves are deleted
-
-cfree's current rewrite is simpler and tightly coupled to `live_after`. A
-MIR-like cfree rewrite should walk each block backward with a rolling live set,
-using allocation assignment and target scratch registers. It should not require
-persisting a full live-after bitmap for every instruction.
-
-## MIR Conflict Matrix Use
-
-MIR still has a conflict matrix, but only for coalescing. It is intentionally
-narrow:
-
-- collect moves
-- collect only source/destination regs from those moves
-- cap the number at `MIR_MAX_COALESCE_VARS`
-- build a matrix over this reduced scan-var space
-- coalesce non-conflicting move-related regs
-
-The comment in MIR says 10K scan vars is about 8 MB for the coalescing matrix.
-That cap is a design signal: even MIR avoids an unbounded full pseudo matrix.
-
-cfree should follow this model:
-
-- no full matrix in normal O1 allocation
-- optional move-only matrix in O2 coalescing
-- hard cap and metric counters for matrix size
-
-## cfree Mismatch Status
-
-The initial O1 implementation had five major structural mismatches:
-
-1. `opt_live_info` was overloaded.
-   It computed block liveness, per-value summaries, per-instruction live-after,
-   conflict matrix, call live-across summaries, and asm constraints in one pass.
-   This has been split into pass-local liveness, range construction,
-   allocation, and rewrite state.
-
-2. DDE paid for full allocation analysis.
-   Pre-DDE no longer builds allocation ranges or conflicts. It still runs block
-   liveness before DDE, and O1 runs block liveness again for regalloc after DDE
-   changes the instruction stream. Avoiding or reusing that second pass is a
-   remaining O1 cleanup item, not the main current bottleneck.
-
-3. Dense conflicts were the allocator's core dependency.
-   The normal O1 allocator no longer allocates or consumes `Func.val_conflicts`.
-   Dense matrices should stay out of O1 and return only as bounded, optional
-   move-coalescing support in O2.
-
-4. Hot loops scanned all vals.
-   Many of these loops now use operand-reference events, set-bit iteration, or
-   touched-value/generation lists. The final O1 cleanup removed the remaining
-   high-watermark live-across-call/rewrite scans that showed up in spill-heavy
-   wide functions.
-
-5. `Block.live_after` stored a full live bitmap per instruction.
-   Normal O1 no longer persists per-instruction full live-after storage.
-   Rewrite and DDE walk backward with rolling live state, and rewrite's
-   live-word traffic is now linear in the focused spill-pressure ladder.
-
-6. Point-index occupancy loops remained after the matrix was removed.
-   Normal O1 allocation now stores sorted live intervals per physical
-   hard-register and per stack spill slot. Assignment checks interval overlap
-   for each candidate range instead of OR-ing occupied-location bitmap rows for
-   every compressed point.
-
-## Proposed cfree Structural Target
-
-Introduce pass-local optimizer state:
-
-```c
-typedef struct OptBitset OptBitset;
-typedef struct OptDataflow OptDataflow;
-typedef struct OptLiveRanges OptLiveRanges;
-typedef struct OptAllocator OptAllocator;
-typedef struct OptRewrite OptRewrite;
-
-typedef struct OptPassCtx {
-  Compiler* c;
-  Func* f;
-  Arena* scratch;
-  OptDataflow live;
-  OptLiveRanges ranges;
-  OptAllocator alloc;
-} OptPassCtx;
-```
-
-Move these out of persistent `Func` where possible:
-
-- `live_in`, `live_out`, `live_use`, `live_def`
-- `live_after`
-- `val_conflicts`
-- allocation scratch arrays
-
-Keep these or equivalents as persistent-enough function metadata:
-
-- value type/class tables
-- block/instruction ownership
-- CFG edges
-- frame slots
-- target hard-register availability
-
-Allocation result should be a separate mapping:
-
-```c
-typedef enum OptLocKind {
-  OPT_LOC_NONE,
-  OPT_LOC_HARD,
-  OPT_LOC_STACK,
-} OptLocKind;
-
-typedef struct OptLoc {
-  u8 kind;
-  u8 cls;
-  Reg hard_reg;
-  FrameSlot spill_slot;
-} OptLoc;
-```
-
-Rewrite consumes `OptLoc[val]` and emits final target operands.
-
-## Implementation Direction
-
-This is the recommended structural sequence.
-
-### Phase 1: Data Structure Extraction
-
-Add pass-local bitset and data-flow structures. Keep current behavior running,
-but make new liveness code independent of `Block.live_*` fields.
-
-Deliverables:
-
-- `pass_live.c` owns block liveness storage.
-- bitset supports set-bit iteration and trimmed operations.
-- DDE uses block liveness only.
-- metrics count live words, active words, block live bytes, and set-bit scans.
-
-### Phase 2: Live Range Builder
-
-Build live ranges from block live sets. Do not allocate from them yet.
-
-Deliverables:
-
-- range list per value
-- compressed point numbering
-- metrics for point count, range count, average ranges per value, max range
-  length, and whole-block spans
-- dump output for range tests
-
-### Phase 3: Range-Based O1 Allocator
-
-Replace dense-conflict allocation with MIR-style occupied-location assignment.
-No live-range splitting yet.
-
-Deliverables:
-
-- `used_locs[point]` bitmaps for hard regs and stack slots
-- candidate ordering by tied hard reg, frequency, live length, stable id
-- hard-register assignment when possible
-- stack assignment otherwise
-- no `val_conflicts` allocation in normal O1
-
-### Phase 4: Rewrite Without `live_after`
-
-Rewrite by walking blocks and instructions with rolling live state plus
-assignment maps.
-
-Deliverables:
-
-- hard-reg operand rewrite
-- stack reload/store insertion
-- scratch-reg handling by target class
-- call-clobber preservation
-- noop move deletion
-- no required per-instruction full live-after storage
-
-### Phase 5: O1 Cleanup and Scaling
-
-After O1 has the right shape, remove transition scaffolding and fix the
-remaining O1 scaling buckets before adding O2 allocator features:
-
-- rewrite without extra per-instruction temporary list churn: done
-- delete the old dense liveness/conflict path from the normal O1 route: done
-- narrow or compress `used_locs` so stack occupancy does not scale as
-  `points * candidates`: done via split hard/stack occupancy and then
-  per-location interval sets
-
-Only after that should O2 add move collection, move-only liveness, capped
-coalescing matrices, live-range splitting, and edge/block spill placement. This
-mirrors MIR's division: the matrix is an optional coalescing tool, not the
-allocator's main model.
-
-## Implementation Checklist
-
-Use this as the working tracker for the structural rewrite. Each phase should
-land with focused tests and metrics before moving to the next phase.
-
-Compatibility policy: compatibility with the old O1 optimizer internals is not
-a goal. The public compiler/API behavior must stay correct, but internal helper
-APIs, dumps, tests, metrics, and persistent IR fields may be broken, renamed, or
-deleted when they preserve the dense-conflict shape or add runtime cost. Prefer
-a clean fast implementation over shims for old pass boundaries.
-
-### Phase 0: Baseline and Guardrails
-
-- [x] Preserve current O0 behavior as the reference path.
-- [x] Keep existing O1 tests green before structural changes begin.
-- [x] Add or identify stress inputs for:
-  - [x] one large straight-line function
-  - [x] many small functions
-  - [x] branch liveness
-  - [x] call-clobber preservation
-  - [x] spills
-  - [x] tied hard registers from inline asm
-- [x] Keep `cfree run --time -O1` metrics available throughout the rewrite.
-- [x] Add metrics names for the new shape before deleting old counters:
-  - [x] `opt.live.blocks`
-  - [x] `opt.live.active_words`
-  - [x] `opt.live.bitset_words_touched`
-  - [x] `opt.live.dataflow_iterations`
-  - [x] `opt.live.dataflow_block_visits`
-  - [x] `opt.ranges`
-  - [x] `opt.range_points`
-  - [x] `opt.range.point_visits`
-  - [x] `opt.range.preg_scans`
-  - [x] `opt.range.live_words_touched`
-  - [x] `opt.alloc.used_loc_words`
-  - [x] `opt.alloc.hard_loc_words`
-  - [x] `opt.alloc.stack_loc_words`
-  - [x] `opt.alloc.stack_slots`
-  - [x] `opt.alloc.hard_point_visits`
-  - [x] `opt.alloc.stack_point_visits`
-  - [x] `opt.alloc.hard_word_ors`
-  - [x] `opt.alloc.stack_word_ors`
-  - [x] `opt.alloc.hard_mark_points`
-  - [x] `opt.alloc.stack_mark_points`
-  - [x] `opt.alloc.spills`
-  - [x] `opt.dde.live_words_touched`
-  - [x] `opt.rewrite.reloads`
-  - [x] `opt.rewrite.stores`
-  - [x] `opt.rewrite.inserted_insts`
-  - [x] `opt.rewrite.live_words_touched`
-
-Implemented by `test/opt/phase0_guardrails.sh`, wired into `make test-opt`.
-The script runs generated O0/O1 guardrail programs for the stress shapes above
-and checks that `cfree run --time -O1` emits the reserved metric names. Inline
-asm tied-register stress is identified by the existing parser case
-`test/parse/cases/asm_01_grammar.c`; allocator-specific inline-asm execution
-coverage belongs in a later pass once the range allocator owns rewrite.
-
-### Phase 1: Pass-Local Liveness
-
-- [x] Add an optimizer bitset abstraction with:
-  - [x] set/clear/test
-  - [x] copy
-  - [x] union
-  - [x] union-and-not
-  - [x] change reporting
-  - [x] set-bit iteration
-  - [x] active-word tracking or trailing-zero trimming
-- [x] Add pass-local block liveness storage.
-- [x] Move block `use`/`def` calculation into the new liveness module.
-- [x] Compute `live_in`/`live_out` without allocating conflicts.
-- [x] Make pre-DDE use only block liveness.
-- [x] Keep old full `opt_live_info` available for regalloc during transition.
-      This transition debt was retired in phase 6.
-- [x] Add dump output for block liveness.
-- [x] Add tests for branch, loop, and call liveness.
-
-Exit criteria:
-
-- [x] Existing public O1 behavior is unchanged.
-- [x] Pre-DDE no longer builds `val_conflicts`.
-- [x] Metrics show separate block-liveness timing and no pre-DDE conflict bytes.
-
-### Phase 2: Live Range Model
-
-- [x] Define `OptLiveRange` and `OptLiveRangeSet`.
-- [x] Build ranges from block `live_out` with a backward block walk.
-- [x] Track per-value:
-  - [x] first range
-  - [x] live length
-  - [x] frequency/spill cost
-  - [x] live-across-call information
-- [x] Add compressed point numbering.
-- [x] Add whole-block span representation or an equivalent compact model.
-- [x] Add range dump output.
-- [x] Add range metrics:
-  - [x] total ranges
-  - [x] total points before compression
-  - [x] total points after compression
-  - [x] max ranges per value
-  - [x] max live length
-
-Exit criteria:
-
-- [x] Range dumps match expected lifetimes on small hand-written cases.
-- [x] Range construction works without using `Block.live_after`.
-- [x] Current allocator still works while range data is built side by side.
-
-### Phase 3: Range-Based Assignment
-
-- [x] Define `OptLoc` assignment mapping for every `Val`.
-- [x] Build allocation candidate list from values with ranges.
-- [x] Sort candidates by:
-  - [x] tied hard-register requirement
-  - [x] frequency/spill cost
-  - [x] shorter live length
-  - [x] stable value id
-- [x] Add `used_locs[point]` bitmaps.
-- [x] Mark hard-register live ranges in `used_locs`.
-- [x] For each candidate, union occupied locations across its ranges.
-- [x] Assign a legal hard register when available.
-- [x] Assign/reuse stack slots when no hard register is available.
-- [x] Preserve tied hard-register validation.
-- [x] Preserve forbidden hard-register constraints from asm/calls.
-- [x] Keep live-range splitting disabled for O1.
-
-Exit criteria:
-
-- [x] O1 regalloc no longer depends on `val_conflicts`.
-- [x] `opt.conflict_bytes` is zero or absent on the normal O1 allocation path.
-- [x] Branch, loop, call-clobber, spill, and tied-register tests pass.
-
-### Phase 4: Rewrite From Assignment Map
-
-- [x] Rewrite hard-assigned values to hard registers.
-- [x] Insert reloads for stack-assigned uses.
-- [x] Insert stores for stack-assigned defs.
-- [x] Reserve target scratch registers per class.
-- [x] Handle operands nested in:
-  - [x] normal operands
-  - [x] indirect operands
-  - [x] call descriptors
-  - [x] return descriptors
-  - [x] intrinsic descriptors
-  - [x] inline asm descriptors
-- [x] Preserve call-clobber saves/restores with a rolling live set.
-- [x] Delete noop moves after rewrite.
-- [x] Avoid persistent per-instruction full `live_after` storage.
-- [x] Add rewrite dump output.
-
-Exit criteria:
-
-- [x] O1 emits correct code without `Block.live_after`.
-- [x] Reload/store metrics are stable and understandable.
-- [x] Public O1 behavior and smoke JIT tests pass.
-
-### Phase 5: Reduce Phase-4 Rewrite Overhead
-
-Phase-4 timings show the no-`live_after` rewrite improves the large
-straight-line case but regresses many-small-function/function-table inputs. Keep
-the rolling-live design, but remove the temporary-list and reverse-copy cost
-introduced by the first implementation.
-
-- [x] Rewrite blocks with a single block-local output buffer or pre-sized edit
-      buffer.
-- [x] Avoid per-instruction `RewriteList` allocation/copy churn.
-- [x] Preserve rolling-live call-clobber behavior without persistent
-      `Block.live_after`.
-- [x] Keep scratch-register selection deterministic after the rewrite cleanup.
-- [x] Update tests that asserted old internal storage or exact intermediate
-      dump details.
-- [ ] Re-run the same-input phase-3 vs phase-5 timing comparison from
-      `doc/PERF.md`.
-
-Exit criteria:
-
-- [ ] Function-table/main-many-small `opt.regalloc` returns to phase-3 parity
-      or better.
-- [ ] Large straight-line phase-4 gains are retained.
-- [x] No persistent per-instruction full live-after sets are rebuilt in normal
-      O1.
-
-Implemented by rewriting each block into one backward-filled block output
-buffer. Per-instruction before/after insertion lists are reused per block, and
-the final reverse-copy pass is gone.
-
-### Phase 6: Delete Old Dense Path
-
-- [x] Remove normal-path allocation of `Func.val_conflicts`.
-- [x] Delete old all-values conflict construction from O1, including
-      compatibility-only helpers and tests.
-- [x] Remove or demote `opt_live_words`, `opt_conflict_words`,
-      `Block.live_in`, `Block.live_out`, `Block.live_use`, `Block.live_def`,
-      and `Block.live_after` from persistent IR state where no public behavior
-      requires them.
-- [x] Replace `opt_live_info` callers with pass-local liveness/ranges, then
-      delete or narrow `opt_live_info`.
-- [x] Make tests assert public behavior and new pass dumps/metrics, not dense
-      conflict matrix details.
-- [x] Update docs and metrics names to reflect range-based RA.
-
-Exit criteria:
-
-- [x] Normal O1 has no reachable dense pseudo-conflict construction.
-- [x] The source tree no longer contains tests whose purpose is preserving the
-      old dense interference model.
-- [x] `opt.conflict_bytes` is absent or always zero on normal O1 paths.
-
-Implemented by deleting `opt_live_info`, `Func.val_conflicts`,
-`Func.opt_conflict_words`, and persistent block liveness/live-after fields.
-Post-rewrite combine/DCE now build pass-local hard-register liveness when they
-need block live-out information.
-
-### Phase 7: Narrow `used_locs`
-
-Phase-3 and phase-4 timings show the remaining large-function bend tracks
-`opt.alloc.used_loc_words`. The first range allocator sizes occupied-location
-bitmaps as hard-register bits plus one possible stack bit per candidate; that
-preserves correctness but reintroduces superlinear memory/time for one large
-function.
-
-- [x] Split hard-register occupancy from stack-slot occupancy.
-- [x] Make stack occupancy demand-sized by allocated stack slots, not candidate
-      count.
-- [x] Consider sparse or interval-indexed stack occupancy for long straight-line
-      functions.
-- [x] Keep hard-register occupancy compact and class-aware.
-- [x] Add metrics separating hard occupancy words, stack occupancy words, and
-      allocated stack slot count.
-- [ ] Re-run straight-line and table-main ladders after each shape change.
-
-Exit criteria:
-
-- [ ] Large straight-line benchmark scales by range/point count, not
-      `points * candidates`, `nvals * live_words`, or dense conflict bytes.
-- [ ] `doc/PERF.md` can be updated with before/after O1 scaling numbers.
-- [ ] Table-main remains at phase-5 parity or better.
-
-Implemented with separate `hard_used_locs` and `stack_used_locs` allocator
-tables. Hard occupancy remains `OPT_REG_CLASSES * 32` bits per point; stack
-occupancy grows only when a new spill slot crosses a word boundary. The sparse
-interval form is deferred until timing data shows it is needed.
-
-### Phase 8: O1 MIR-Shape Cleanup
-
-The range allocator has the right high-level shape, but current timings show
-there is still O1 work to do before adding O2 coalescing or splitting. The next
-areas of focus are the remaining places where cfree is less MIR-shaped than it
-looks at the pass boundary: fixed-width live traffic, point-by-point occupancy
-loops, and rewrite liveness work.
-
-Implemented in the current cleanup pass:
-
-- [x] Emit event counters for live, range, allocation, DDE, and rewrite work.
-- [x] Remove the duplicate pre-DDE live-range build from the normal O1 path.
-- [x] Replace DDE per-instruction full-width use/def bitsets with
-      operand-reference events.
-- [x] Replace rewrite per-instruction full-width use/def bitsets with
-      operand-reference events.
-- [x] Track open/touched range values with generation marks during live-range
-      construction.
-- [x] Avoid assigning caller-saved registers to non-tied values that are known
-      to be live across calls.
-- [x] Update `doc/PERF.md` with current O1 scaling data and counter ratios.
-
-Phase 8 cleanup status:
-
-- [x] Make `OptBitset` operations truly MIR-like: track active length, trim
-      trailing zero words, and bound copy/ior/and-not work by active words
-      rather than the function's maximum value id.
-- [x] Reduce `opt.live_ranges.regalloc` on spill-heavy wide functions.
-      Current `opt.range.preg_scans` is linear, so investigate live-word
-      touches, range insertion/compression, final per-preg summaries, and
-      live-across-call bookkeeping in `src/opt/pass_live.c`.
-- [x] Reduce `opt.rewrite.live_words_touched`. Rewrite no longer stores
-      `Block.live_after`, but its rolling live set still grows faster than
-      input on spill-heavy wide functions.
-- [ ] Replace repeated point-index occupancy OR/mark loops with per-location
-      interval or event state once live-range/rewrite traffic is under control.
-      `opt.alloc.*point_visits` is linear now, but the absolute count is high.
-- [ ] Decide whether the second O1 block-liveness pass can be avoided after
-      DDE, either by reusing/revalidating liveness or by having DDE produce
-      enough local update information.
-- [x] Keep split hard-register and stack-slot occupancy. It fixed the old
-      `points * candidates` stack growth and should remain the O1 baseline.
-
-Exit criteria:
-
-- [x] Straight-line and table-shaped stress inputs remain near-linear at the
-      512-to-1024 step.
-- [x] Spill-pressure `opt.live_ranges.regalloc` and `opt.regalloc` stop
-      growing near `3x` on each input doubling.
-- [x] `opt.rewrite.live_words_touched` no longer grows superlinearly on the
-      spill ladder.
-- [x] O1 remains free of dense pseudo-conflict construction on the normal path.
-
-Implemented by active-length `OptBitset` copies, a rolling live-value list for
-live-across-call range bookkeeping, and rewrite call-save filtering that only
-checks hard-assigned caller-saved values known to be live across a call. The
-remaining point-index occupancy loops are linear and kept as future O2-quality
-work rather than an O1 blocker.
-
-### Phase 9: O2 Coalescing
-
-- [ ] Collect move candidates after machinize.
-- [ ] Add move-only liveness using `scan_vars`-style restricted variables.
-- [ ] Build a capped conflict matrix only for move-related values.
-- [ ] Add metrics for:
-  - [ ] move candidates
-  - [ ] coalescing scan vars
-  - [ ] coalescing matrix bytes
-  - [ ] coalesced moves
-- [ ] Coalesce non-conflicting move groups.
-- [ ] Keep the matrix cap explicit and configurable at compile time.
-
-Exit criteria:
-
-- [ ] Coalescing is optional and does not affect O1 compile-time shape.
-- [ ] Matrix memory is bounded and reported.
-
-### Phase 10: O2 Splitting and Spill Placement
-
-- [ ] Add split-capable occupancy state similar to MIR's `busy_used_locs`.
-- [ ] Identify profitable live-range gaps.
-- [ ] Place spills/restores on block boundaries.
-- [ ] Split critical edges when required for edge placement.
-- [ ] Add spill-placement dumps.
-- [ ] Add metrics for split ranges and edge/block spill placements.
-
-Exit criteria:
-
-- [ ] O2 can reduce spills without changing O1 assignment behavior.
-- [ ] Edge/block placement tests cover diamonds, loops, and critical edges.
-
-### Phase 11: Cleanup and Documentation
-
-- [ ] Update `doc/OPT.md` with the final implemented module names and pass
-      order.
-- [x] Update `doc/PERF.md` with new O1 scaling data.
-- [ ] Add a short allocator invariant section near the implementation.
-- [ ] Ensure pass dumps are stable enough for targeted regression tests.
-- [ ] Remove obsolete comments that describe the dense conflict allocator as
-      the normal O1 path.
-
-## Non-Goals For The First Rewrite
-
-Avoid spending early effort on:
-
-- micro-optimizing the existing full conflict matrix
-- adding sparse rows to `val_conflicts` as a long-term solution
-- tuning candidate priority before range-based allocation exists
-- preserving optimizer-internal compatibility with the dense allocator era
-- link/JIT performance
-- parser/CG data-structure changes
-
-Those may still matter later, but they do not fix the current structural
-scaling issue.
-
-## Design Risks
-
-### Instruction Storage
-
-cfree's block instruction arrays make insertion more awkward than MIR's linked
-lists. We can handle this with rewrite output buffers initially. If combine,
-spill insertion, and edge splitting become too cumbersome, revisit instruction
-storage after the range allocator lands.
-
-### Arena Lifetime
-
-MIR explicitly destroys many pass structures. cfree's arenas make allocation
-cheap but can hide large transient memory. The new pass contexts should use a
-scratch arena or explicit freeable heap where transient memory is substantial.
-
-### Debuggability
-
-The rewrite should add stable dumps early:
-
-- block live sets
-- live ranges
-- allocation map
-- inserted reloads/stores
-- final rewritten IR
-
-Without dumps, range allocation bugs will be hard to diagnose.
-
-### Target Coupling
-
-MIR's rewrite relies heavily on target hooks. cfree should keep target
-knowledge behind explicit helpers:
-
-- hard-register class availability
-- caller-saved/callee-saved masks
-- scratch registers
-- memory-operand legality
-- stack-slot addressability
-
-## Bottom Line
-
-MIR's RA performance shape comes from avoiding a full pseudo interference graph
-in the normal allocation path. cfree should mirror that, even when doing so
-breaks old optimizer-internal APIs, fields, dumps, or tests. The next serious
-O1 work should keep the pass-local liveness, compressed live ranges,
-occupied-location assignment, and assignment-map rewrite shape, then remove the
-transition scaffolding and remaining scaling buckets.
-
-Once that shape exists, incremental tuning will have a much better foundation.
diff --git a/doc/OPT.md b/doc/OPT.md
@@ -1,1371 +1,336 @@
-# OPT - MIR-informed implementation plan
+# OPT - Optimizer Design
 
-Scope: deliver cfree's optimized backend behind the `CGTarget` wrapper
-contract described in `doc/DESIGN.md` Section 9, using MIR JIT's optimizer as
-the model for phase order, level gating, and performance targets.
+This document describes cfree's optimized backend pipeline. It is intended to
+be the stable design reference for `-O1` and `-O2`: the IR shape, pass order,
+major analyses, and ownership boundaries. Performance measurements and current
+benchmark gaps live in `doc/OPT_PERF.md`.
 
-This document is based on a source investigation of `~/tmp/mir/`, mainly
-`mir-gen.c`, `mir.c`, `mir.h`, `mir-gen.h`, `MIR.md`, `README.md`, and the
-`c2mir` driver. The important takeaway is that MIR keeps the optimizer short:
-`-O1` is a fast lowering/code-selection path, while the expensive
-SSA/value/memory/loop work is reserved for `-O2`.
+The optimizer is implemented as a `CGTarget` wrapper. The C frontend and other
+frontends still emit through the public CG interface; the wrapper records that
+stream as cfree IR, runs the requested optimization schedule, then replays the
+lowered IR into the wrapped target.
 
----
+## Goals
 
-## 1. Target Shape
+- Keep `-O0` as the direct backend path.
+- Keep `-O1` as a fast optimized tier: shared lowering, liveness, allocation,
+  post-RA cleanup, and emission, but no SSA value optimization or inlining.
+- Make `-O2` the full quality tier: interprocedural retention for inlining,
+  SSA value/memory passes, loop cleanup, pressure relief, coalescing, live-range
+  splitting, and the same final lowering path as `-O1`.
+- Preserve module boundaries. Frontends own language semantics, ABI lowering
+  owns call/return shape, targets own encoding legality, and optimizer passes
+  operate on recorded CG-level IR.
 
-cfree exposes three levels:
+## Level Shape
 
-- `-O0`: direct backend path. CG drives the target `CGTarget` immediately.
-- `-O1`: MIR-style minimal optimizer. Record IR, lower it through the new
-  backend path, run liveness, simplified register allocation, post-RA combine,
-  and DCE. No SSA value passes and no inlining.
-- `-O2`: MIR-style full optimizer. Add SSA, GVN/constant propagation,
-  redundant load elimination, copy propagation, dead store elimination,
-  SSA DCE, LICM, pressure relief, address transformation, block cloning,
-  coalescing, live-range splitting, and inlining plus cleanup.
+`-O0`:
 
-The core contract stays simple:
+```text
+frontend -> CG -> target
+```
 
+`-O1`:
+
+```text
+frontend -> CG -> opt_cgtarget records Func IR
+
+func_end:
+  build/lower CFG
+  machinize
+  loop tree
+  liveness
+  dead-def elimination
+  register allocation without live-range splitting
+  post-RA combine
+  DCE
+  jump cleanup
+  emit to wrapped target
 ```
-parse -> cg -> opt_cgtarget { record Func IR }
-
--O1 func_end/finalize:
-    build_cfg
-    machinize
-    build_loop_tree
-    live_info
-    regalloc(simplified)
-    combine
-    dce
-    opt_emit -> wrapped CGTarget
-
--O2 func_end/finalize:
-    build_cfg
-    build_reg_ssa
-    block_cloning
-    build_ssa
-    gvn
-    copy_prop
-    dse
-    ssa_dce
-    build_loop_tree + licm
-    pressure_relief
-    make_conventional_ssa
-    ssa_combine
-    undo_ssa
-    jump_opt
-    machinize
-    build_loop_tree
-    coalesce
-    live_info
-    regalloc(with live-range splitting)
-    combine
-    dce
-    opt_emit -> wrapped CGTarget
-
--O2 finalize before final lowering:
-    opt_inline
-    opt_cleanup(dirty callers)
+
+`-O2`:
+
+```text
+frontend -> CG -> opt_cgtarget records all Func IR
+
+finalize:
+  inline retained functions
+  for each retained function:
+    SSA cleanup and value/memory optimization
+    lower through the shared backend pipeline
+    register allocation with coalescing and live-range splitting
+    emit to wrapped target
 ```
 
-The level dial selects optimization work. The lowering backend should be
-shared by `-O1` and `-O2`; `-O1` must remain the bisection floor for bugs in
-machinize, liveness, allocation, emission, and target-specific rewrite.
+Functions that contain constructs whose cloning or SSA semantics are not yet
+explicit, such as inline asm, label-address loads, or indirect branches, are
+conservatively routed through the non-SSA lowering path.
 
----
+## Core Data Structures
 
-## 2. MIR Findings
+### `OptImpl`
 
-### 2.1 Where the Optimizer Lives
+`OptImpl` is the `CGTarget` wrapper state in `src/opt/opt.c`.
 
-MIR's optimizer is concentrated in `mir-gen.c`; inlining and MIR-level
-simplification live in `mir.c`; `c2mir/c2mir-driver.c` maps command-line
-`-O` flags to `MIR_gen_set_optimize_level`.
+- `target` is the wrapped real backend target.
+- `level` selects the optimization schedule.
+- `f` and `cur` track the function and current block during recording.
+- `pending_loc` stores the latest source location and stamps new instructions.
+- `FuncSet funcs` retains all `-O2` functions until `finalize`, so inlining can
+  see both callees and callers before any function is emitted.
 
-Key source facts:
+### Function IR
 
-- `MIR_gen_init` defaults `optimize_level` to `2`.
-- `MIR_gen_set_optimize_level(ctx, level)` only stores the requested level in
-  generator context.
-- `c2mir` treats bare `-O` as `-O2`.
-- Current `mir-gen.c` checks `optimize_level >= 2` for the full pass set.
-  Some docs/comments still describe a separate `-O3`, but the current source
-  does not gate additional passes on `>= 3`. For cfree, model `-O2` as MIR's
-  current full source-level pipeline.
-
-### 2.2 MIR Level Semantics
-
-MIR documentation says:
-
-- Level `0`: only register allocator and machine code generator work.
-- Level `1`: adds code selection. It should produce more compact/faster code
-  than level 0 at nearly the same generation speed.
-- Level `2`: adds common subexpression elimination and conditional constant
-  propagation. MIR docs report level 1 generation speed around 50 percent
-  faster than level 2.
-- Current source-level `>=2`: also includes block cloning, SSA variable
-  renaming, address transformation, DSE, LICM, pressure relief, coalescing,
-  and live-range splitting.
-
-For cfree:
-
-- `-O1` should optimize the backend mechanics without running SSA passes.
-- `-O2` should be the full quality mode.
-- Do not create a user-visible `-O3` until benchmarks identify a genuinely
-  useful extra tier.
-
-### 2.3 MIR Performance Bar
-
-MIR's README reports `c2m -eg` at about `0.91` geomean performance relative
-to GCC `-O2` on its 15 small C benchmarks, with QBE around `0.65` on the same
-table. Treat those numbers as directional rather than a literal cfree promise:
-MIR has a JIT-oriented IR and mature per-target rewrite logic. Still, they set
-the bar for this project:
-
-- `-O1`: compile-time-focused. It should be much closer to `-O0` compile time
-  than to `-O2`, while removing obvious backend artifacts through combine and
-  DCE.
-- `-O2`: performance-focused. The first external target is at least QBE-class
-  generated-code performance on the MIR `c-benchmarks` set, then iteratively
-  close on MIR's published result.
-- Both levels must preserve the CG corpus exit-code contract before benchmark
-  tuning starts.
-
----
-
-## 3. MIR Pipeline, Source Order
-
-This is the effective order in current `mir-gen.c` for full function
-generation.
-
-### 3.1 Common Setup
-
-MIR duplicates the function instruction list, allocates a function CFG object,
-and runs `build_func_cfg`.
-
-`build_func_cfg`:
-
-- Creates synthetic entry and exit blocks.
-- Splits basic blocks at labels, branches, returns, property branches, and
-  fallthrough boundaries.
-- Converts register operands to internal variable operands.
-- Marks address-producing instructions and addressable regs.
-- Adds edges for direct branches, switches, fallthrough, and possible indirect
-  jump targets.
-- Marks calls on blocks so memory availability and liveness can treat them as
-  barriers.
-- Removes unreachable blocks when `optimize_level > 0`.
-
-cfree's `opt_build_cfg` should follow this shape: construct explicit
-entry/exit blocks, keep call/memory barrier metadata on blocks, and make
-unreachable cleanup part of `-O1+`, not a late optional cleanup.
-
-### 3.2 Full `-O2` Pre-Lowering Passes
-
-MIR full mode then runs:
-
-1. `clone_bbs` before SSA when there are no address instructions.
-   MIR clones cold blocks after return back into hot predecessors when this
-   exposes optimization in the hot path. It uses a bounded growth factor and
-   skips back edges.
-
-2. `build_ssa`.
-   MIR uses a Braun-style construction: it creates optimized maximal SSA,
-   minimizes redundant phis, builds def-use edges, and optionally renames
-   variables. cfree should copy the useful property, not the representation:
-   every use should have cheap access to its defining inst, and every def
-   should have cheap access to its uses.
-
-3. `addr_xform` when address instructions exist.
-   MIR tries to eliminate `ADDR` pseudos that only feed memory operands. If an
-   address pseudo must remain addressable, MIR converts uses to memory loads
-   and stores, rebuilds SSA, then clones blocks. For cfree, this maps to
-   folding GEP/address-of chains into target memory operands where the target
-   can encode them, while keeping address-taken frame slots in memory.
-
-4. `gvn`.
-   MIR computes memory availability, dominators, and value numbers in postorder.
-   This pass includes constant propagation, branch folding, redundant expression
-   elimination, redundant load elimination, and store/load reuse. It uses alias
-   and nonalias tags on memory operands. Calls clear or conservatively restrict
-   memory availability.
-
-   cfree should land this in two explicit slices. The first slice is scalar GVN
-   only: pure SSA value numbering and replacement for constants, arithmetic,
-   compares, conversions, and other side-effect-free register expressions. It
-   should not eliminate or forward `IR_LOAD`, interpret `IR_STORE`, or make
-   memory-dependent branch decisions.
-
-   Before the second slice touches memory, document and test the memory
-   numbering shape. That design must define memory version tokens by alias
-   class/root, how `MemAccess.alias`, address operands, address spaces, globals,
-   TLS, params, and unknown memory participate in keys, and the invalidation
-   rules for stores, calls, atomics, fences, volatile operations, and inline asm.
-   Redundant-load elimination and store/load reuse start only after that shape
-   is explicit and covered by pass-local dump tests.
-
-5. `copy_prop`.
-   MIR propagates copies, folds multiply/divide by powers of two, and removes
-   redundant extension chains. cfree should keep this as a separate pass after
-   GVN because it relies on SSA edges and target legality.
-
-6. `dse`.
-   MIR computes memory liveness by memory location, handles alloca escape
-   through calls, and removes stores whose memory location is not live. This
-   depends on GVN's memory numbering and alias classification.
-
-7. `ssa_dce`.
-   MIR deletes SSA instructions with unused outputs, while preserving calls,
-   allocas, varargs, returns, frame/stack effects, overflow sequences with live
-   overflow branches, and other side-effecting ops.
-
-8. `build_loop_tree + licm`.
-   MIR builds natural loops and preheaders. LICM skips branches, phis, calls,
-   memory, varargs, allocas, and potentially trapping divisions/mods. It is
-   pressure-sensitive: cheap single instructions are not moved unless their
-   inputs are worth moving too; multiplies are considered expensive enough to
-   move.
-
-9. `pressure_relief`.
-   MIR moves single-use immediate or constant-like moves closer to their use
-   when doing so reduces pressure and does not move work into a loop.
+The optimizer IR is defined in `src/opt/ir.h`.
 
-10. `make_conventional_ssa`.
-    MIR lowers phis into edge/block moves, splitting critical edges when needed.
+- `Func` owns the arena-backed IR for one function: blocks, frame slots,
+  pseudo-register metadata, value metadata, local declarations, emit order, and
+  pass scratch fields.
+- `Block` owns a linear instruction list plus CFG predecessor/successor edges.
+  `emit_order` records layout order separately from block id order.
+- `Inst` is one recorded operation. Most `IROp` values map directly to a single
+  CGTarget method. Source locations are stored on instructions.
+- `PReg` is the mutable pseudo-register id used by recorded CG IR and by the
+  lowering pipeline before register SSA.
+- `Val` is the SSA value id used after `opt_build_reg_ssa` and mem2reg SSA.
+  `VAL_NONE` and id zero are sentinels.
+- Per-op auxiliary structs preserve structured data for calls, returns, phis,
+  switches, inline asm, atomics, aggregate memory operations, intrinsics, and
+  debug/scope operations.
 
-11. `ssa_combine`.
-    MIR combines compare+branch pairs and folds address components into memory
-    operands while SSA edges are still available.
-
-12. `undo_ssa`.
-    MIR removes phi nodes and SSA edges.
-
-13. `jump_opt`.
-    MIR removes unreachable blocks, empty blocks, branches to the next
-    instruction, chains of labels, and branches to jumps.
-
-### 3.3 Lowering and Allocation
-
-After pre-lowering optimization, MIR runs the machine-dependent and allocation
-pipeline:
-
-1. `target_machinize`.
-   This performs ABI lowering, call lowering, target two-operand forms, and
-   other machine-dependent normalization.
-
-2. Build a loop tree for `-O1+`.
-   MIR uses loop depth for frequency/pressure estimates in liveness,
-   coalescing, and allocation.
-
-3. `collect_moves`, move-only liveness, conflict matrix, and `coalesce` at
-   `-O2`.
-   MIR aggressively coalesces move-related regs, prioritizing moves by loop
-   frequency and checking conflicts in a bounded matrix.
-
-4. Full `live_info`.
-   MIR computes `live_in`/`live_out`, register frequencies, live lengths, and
-   block pressure. It understands phis before SSA destruction and call-used
-   hard registers after machinize.
-
-5. `reg_alloc`.
-   MIR builds live ranges and assigns pseudos by priority: tied hard regs first,
-   then higher frequency, then shorter live length. At levels below 2 it uses a
-   simplified allocator. At full level it can split live ranges and place
-   spills/restores on edges or block boundaries.
-
-6. `rewrite`.
-   MIR rewrites pseudo regs to hard regs or stack slots, inserts reloads and
-   stores, handles call-clobbered saves, and deletes noop moves.
-
-7. `combine`.
-   This is target-aware code selection. It substitutes safe single-use moves,
-   collapses extension chains, commutes operands to expose legal combinations,
-   removes internal labels, and validates each rewrite with target legality.
+The representation deliberately keeps mode explicit. Before pseudo-register
+SSA, `OPK_REG` operands carry `PReg` ids. After pseudo-register SSA, they carry
+`Val` ids, and `Func.opt_reg_ssa` tells shared helpers which namespace is
+active.
 
-8. `dead_code_elimination`.
-   MIR performs post-RA DCE using live-out sets and preserves side effects.
+### Memory Shape
 
-9. Prolog/epilog and machine instruction generation.
+Memory operations carry `MemAccess`, including type, size, alignment, volatile
+or atomic properties, and alias root. Important alias roots include locals,
+globals, TLS, parameters, and unknown memory. Memory-aware passes must preserve
+observable memory: volatile, atomic, aggregate operations with side effects,
+calls that may clobber, fences, and inline asm.
 
-cfree should keep this exact split: combine after register allocation is not a
-replacement for SSA `copy_prop`; it is a target legality/code-selection pass.
+### Analysis State
 
-### 3.4 Inlining
+`OptAnalysis` holds per-pass order and dominance data:
 
-MIR inlining is in `mir.c`, before generator optimization. It processes direct
-call-like instructions after MIR simplification, not inside `mir-gen.c`.
+- postorder/reverse-postorder
+- reachability
+- immediate dominators and dominator children
+- dominance frontiers
 
-Important MIR inliner behavior:
+`Func.opt_valid_analyses` tracks coarse invalidation for CFG, def-use,
+dominators, and loop info. Passes that mutate control flow, operands, or
+instructions either rebuild the relevant analysis or invalidate it.
 
-- It can inline direct normal calls and explicit inline calls when the callee
-  is available.
-- It skips unresolved externals, self recursion, label-reference functions,
-  varargs, `jret`, and over-budget callees.
-- Default size budgets are small: normal calls have a much smaller cap than
-  explicit inline calls.
-- It limits caller growth relative to original caller size.
-- It renames callee registers, duplicates labels, materializes argument moves,
-  rewrites returns, and handles simple top-level constant-size `alloca`.
+The shared def-use representation records every use of every `Val`, including
+ordinary operands, indirect bases, phi inputs, call arguments/results,
+aggregate operands, inline asm operands, and intrinsic operands.
 
-cfree should run inlining on retained `Func` IR at `finalize` for `-O2`, then
-run cleanup on dirty callers. SCCs can be skipped for v1. The inliner must
-refuse setjmp/longjmp-sensitive and vararg cases until their semantics are
-explicitly tested.
+## Shared Lowering Pipeline
 
----
+`opt_run_lowering_pipeline` is used by both `-O1` and `-O2`.
 
-## 4. cfree Level Schedules
+1. `opt_build_cfg`
+   Builds explicit CFG edges from branches, returns, switches, fallthroughs,
+   scopes, indirect branches, and synthetic blocks. It also removes unreachable
+   connected blocks after cleanup.
 
-### 4.1 `-O1` Minimal Schedule
+2. `opt_jump_cleanup`
+   Performs CFG cleanup before and after lowering. The early form repairs and
+   simplifies CFG shape; the late form also optimizes layout fallthroughs.
 
-`-O1` is not "half of SSA." It is MIR's cheap backend optimization tier.
+3. `opt_simplify_local`
+   Performs local algebraic/address simplifications that do not require SSA.
 
-Required `-O1` schedule:
+4. `opt_machinize`
+   Lowers generic IR toward target-legal machine IR. This includes target
+   operand constraints, hard-register constraints, clobber information, and
+   target-specific rewrite decisions needed before allocation.
 
-```
-build_cfg
-machinize
-build_loop_tree
-live_info
-regalloc(allow_live_range_split = false)
-combine
-dce
-opt_emit
-```
+5. `opt_build_loop_tree`
+   Builds loop metadata from dominators for use by later analysis and metrics.
 
-Allowed `-O1` details:
+6. `opt_live_blocks`
+   Computes block liveness over pseudo-registers or SSA values, depending on
+   the active mode.
 
-- Remove unreachable blocks during CFG construction.
-- Use loop depth only for frequency/pressure costing.
-- Run target-aware combine after register allocation.
-- Delete noop moves and dead post-RA definitions.
-- Use a priority-based allocator, but without coalescing and live-range
-  splitting in the first production version.
+7. `opt_dead_def_elim_with_live`
+   Removes dead register definitions using live-out information while
+   preserving side-effecting instructions.
 
-Forbidden at `-O1`:
+8. `opt_regalloc`
+   Allocates pseudo-registers to hard registers or spill slots. `-O1` uses the
+   simple allocation mode. `-O2` enables coalescing and live-range splitting.
 
-- `build_ssa`
-- `gvn`
-- `copy_prop`
-- `dse`
-- `ssa_dce`
-- `licm`
-- `pressure_relief`
-- `coalesce`
-- `opt_inline`
+9. `opt_combine`
+   Runs post-allocation peephole combines over lowered IR.
 
-This keeps `-O1` useful and debuggable: if `-O1` fails, the bug is in
-recording, CFG, machinize, liveness, allocation, combine, DCE, or emission.
+10. `opt_dce`
+    Deletes now-dead non-side-effecting instructions after allocation and
+    combine.
 
-### 4.2 `-O2` Full Schedule
+11. `opt_emit`
+    Replays the final IR into the wrapped backend target.
 
-`-O2` uses MIR's full current source pipeline:
+The shared path is the bisection floor. Bugs in CFG construction, target
+machinization, liveness, allocation, rewrite, combine, or emission should
+usually reproduce at `-O1`.
 
-```
-build_cfg
-build_reg_ssa
-if no address-transform candidates:
-    block_cloning
-build_ssa
-if address-transform candidates:
-    addr_xform
-    undo_ssa
-    block_cloning
-    build_ssa
-gvn
-copy_prop
-dse
-ssa_dce
-build_loop_tree
-licm
-pressure_relief
-make_conventional_ssa
-ssa_combine
-undo_ssa
-jump_opt
-machinize
-build_loop_tree
-collect_moves
-live_info(move vars only)
-coalesce
-live_info(all vars)
-regalloc(allow_live_range_split = true)
-combine
-dce
-opt_emit
-```
+## `-O1` Pipeline
 
-Then, once inlining exists:
+`-O1` runs only the shared lowering pipeline. It intentionally skips:
 
-```
-finalize:
-    opt_inline(FuncSet, max_iters)
-    for each dirty caller:
-        opt_cleanup
-    lower every dirty/not-yet-emitted Func through the full schedule
-```
+- interprocedural retention
+- inlining
+- SSA construction
+- GVN
+- copy propagation beyond local cleanup
+- DSE
+- LICM
+- pressure relief
+- live-range splitting
 
-`opt_cleanup` should re-run the passes that inlining exposes value for:
-`build_cfg`, `build_ssa`, `gvn`, `copy_prop`, `dse`, `ssa_dce`, `licm` when
-loops exist, `pressure_relief`, `make_conventional_ssa`, `ssa_combine`,
-`undo_ssa`, and `jump_opt`.
+`-O1` should compile much closer to `-O0` than to `-O2`, while removing obvious
+backend artifacts and exercising the optimized allocation/emission path.
 
-Current implementation subset:
+## `-O2` Pipeline
 
-```
-build_cfg
-jump_cleanup(CFG)
-build_cfg
-verify(o2-pre-ssa-cfg)
-build_reg_ssa
-verify(o2-reg-ssa)
-block_cloning
-verify(o2-block-clone-cfg)
-build_ssa
-verify(o2-ssa)
-ssa_dce
-copy_cleanup
-verify(o2-pre-addr-cleanup)
-addr_xform
-verify(o2-addr-xform-ssa)
-ssa_dce
-verify(o2-addr-xform-dce)
-copy_cleanup
-verify(o2-addr-xform-copy-cleanup)
-simplify
-verify(o2-simplify-ssa)
-ssa_dce
-copy_cleanup
-verify(o2-simplify-cleanup)
-gvn
-verify(o2-gvn-ssa)
-copy_prop
-verify(o2-copy-prop-ssa)
-ssa_dce
-verify(o2-copy-dce)
-dse
-verify(o2-dse-ssa)
-ssa_dce
-verify(o2-dse-dce)
-build_loop_tree
-licm
-verify(o2-licm-ssa)
-pressure_relief
-verify(o2-pressure-relief-ssa)
-make_conventional_ssa
-verify(o2-conventional-ssa)
-ssa_combine
-verify(o2-ssa-combine)
-undo_ssa
-copy_cleanup
-verify(o2-undo-copy-cleanup)
-opt_jump_opt
-verify(o2-jump-opt)
-shared O1 lowering pipeline (machinize ... jump_cleanup ... opt_emit)
-```
+`-O2` retains every function until target finalization. This enables whole-TU
+inlining before any function is emitted.
 
-Every CFG or value-use mutation above is followed by the relevant verifier
-checkpoint in `opt_run_o2_pipeline`.
-
-### 4.3 Identity Simplification
-
-Identity simplification is implemented in three homes, depending on how much
-context the rewrite needs.
-
-1. CG-time stack/value simplification, before IR recording.
-   This is the earliest and best place for identities where the operation can
-   simply avoid materializing a result. It benefits `-O0`, `-O1`, and `-O2`,
-   and it prevents avoidable temporaries from ever entering opt IR. Keep this
-   limited to rewrites that are local to the two values currently on the CG
-   stack, do not require CFG/use-def information, and are valid for every
-   target backend.
-
-   Current CG folds immediate integer binops, collapses delayed integer
-   arithmetic chains, collapses integer identities through
-   `api_try_collapse_binop_identity`, and `api_cg_convert` elides same-type
-   conversions. The identity set now includes safe unflagged integer cases such
-   as `x * 1`, `1 * x`, `x / 1`, `x % 1 -> 0`, `x << 0`, `x >> 0`, zero
-   annihilators for multiply/and, and all-ones annihilators for or.
-
-   Do not put target-dependent legality here. Do not fold floating-point
-   identities here: signed zero, NaNs, exceptions, contraction, and rounding
-   flags make algebraic FP identities nontrivial. Also keep the existing
-   `!flags` rule for integer folding unless a flag's exact semantics are
-   explicitly modeled.
-
-2. A tiny non-SSA `opt_simplify_local` in the shared lowering pipeline.
-   This runs after the initial `build_cfg`/`jump_cleanup`/`build_cfg` cleanup
-   and before `machinize`, so it is available to `-O1` and to the final lowering
-   leg of `-O2` after SSA has been undone:
-
-   ```
-   build_cfg
-   jump_cleanup(CFG)
-   build_cfg
-   simplify_local
-   machinize
-   ...
-   ```
-
-   This pass stays local and cheap. It rewrites one instruction at a time:
-   neutral/annihilator integer binops, `x - x -> 0`, `x ^ x -> 0`, `x & x` and
-   `x | x` to copies, same-value integer compares, and exact no-op converts.
-   It does not chase definitions, compute dominance, rewrite arbitrary uses, or
-   become a substitute for SSA value passes. Its job is to reduce obvious work
-   before liveness/RA without changing the meaning of O1 as the cheap backend
-   tier.
-
-   O1 should not grow a general value optimizer. If an O1 simplification needs
-   def-use, dominance, copy propagation, memory reasoning, or loop information,
-   it belongs in O2 instead. Post-RA `combine` remains the target-aware cleanup
-   point for no-op moves, immediate folding into legal machine operands, and
-   backend artifacts exposed only after register assignment.
-
-3. An SSA `opt_simplify` in the O2 pre-lowering value pipeline.
-   This is the primary home for scalar and address identities that require
-   use-def information. It runs after mem2reg/address cleanup has exposed SSA
-   values and before GVN builds value-number keys:
-
-   ```
-   build_ssa
-   ssa_dce
-   copy_cleanup
-   addr_xform
-   ssa_dce
-   copy_cleanup
-   simplify
-   ssa_dce
-   copy_cleanup
-   gvn
-   ```
-
-   This pass rewrites the defining instruction to `IR_COPY` or `IR_LOAD_IMM`
-   and relies on `opt_ssa_dce`/`opt_copy_cleanup` to remove leftovers. It
-   shares the local rules and adds SSA-only rules for double bitwise-not,
-   add-zero address/value chains fed by SSA constants, and no-op bitcast chains
-   that reduce to a copy. GVN still owns constant folding, leader selection,
-   redundant expression/load elimination, and branch folding; `opt_simplify`
-   canonicalizes cheap identities so GVN has fewer equivalent shapes to learn.
-
-The practical rule is: if CG can avoid emitting IR without looking beyond the
-top CG values, do it in CG. If the rewrite is a single recorded instruction and
-does not need SSA, put it in shared `simplify_local` so O1 benefits. If the
-rewrite needs def-use, dominance, memory/address canonicalization, or cleanup
-after inlining, put it in O2 `opt_simplify`.
-
-### 4.4 Transformations We Do Not Take
-
-`doc/DESIGN.md` is still binding: no UB-exploiting transforms. Do not assume
-signed overflow, shift-by-width, division by zero, or null dereference are
-unreachable. MIR is careful around potentially trapping division/modulo in
-LICM; cfree should be at least as conservative.
-
-MIR property instructions and lazy basic-block versioning are out of scope for
-the first optimized backend. They are a separate JIT specialization feature,
-not required for normal `cfree` object/executable codegen.
-
----
-
-## 5. Current cfree State
-
-The optimizer is no longer just a stub:
-
-- `src/opt/opt.c` implements the `CGTarget` wrapper.
-- `src/opt/ir.c` and `src/opt/ir.h` implement the recorded IR container.
-- `src/opt/pass_cfg.c` implements CFG construction.
-- `src/opt/pass_ssa.c` implements pseudo-register SSA construction, mem2reg
-  SSA construction, conventional SSA edge-copy lowering, SSA undo, and stable
-  SSA dumps for pass tests.
-- `src/opt/pass_coalesce.c` implements O2 move collection, move-related
-  conflict analysis, and conservative coalescing.
-- `src/opt/pass_analysis.c` owns shared CFG order, dominators, dominance
-  frontiers, dominator children, def-use rebuilding, and verifier guardrails.
-- `src/opt/pass_simplify.c` implements CG-adjacent local identity cleanup for
-  the shared lowering path and SSA-aware identity canonicalization for O2 before
-  GVN.
-- `cfree cc` forwards `-O0`/`-O1`/`-O2` into `CfreeCodeOptions`, so native
-  frontends exercise the optimizer through the normal compile driver.
-- `test/toy/run.sh` supports `CFREE_TOY_OPT_LEVELS`; default is `0 1 2`, so
-  `make test-toy` exercises direct, O1, and O2 paths by default.
-
-Current behavior:
-
-- Level `1` records IR, builds CFG, machinizes, builds loop frequencies,
-  computes PReg liveness and live ranges, runs the current priority
-  allocator/rewrite path over mutable PRegs, combines, runs post-RA DCE,
-  cleans jumps/layout, and emits through the wrapped target. This path keeps
-  whole-PReg allocation and does not run coalescing or splitting.
-- Level `2` has its own scheduler entry point. It now converts mutable
-  pseudo-registers to SSA values, builds real mem2reg SSA, verifies phi/use-def
-  invariants, runs the first small O2 transforms, lowers phis through
-  conventional SSA, undoes SSA, and then routes through the shared lowering
-  path with O2 allocation policy enabled.
-- `opt_regalloc(..., allow_live_range_split)` now uses the O2 policy bit to
-  enable move coalescing and singleton live-range splitting. Coalescing only
-  considers valid same-class, same-type `IR_COPY` pairs, weights moves by block
-  frequency, builds a bounded conflict matrix for move-related PRegs, and
-  rejects incompatible tied-register or forbidden-register constraints.
-- O2 allocation assigns coalesced groups as a unit. Every member receives the
-  same hard register or spill slot, so coalesced copies rewrite to hard-reg
-  self copies and are removed by post-RA combine/DCE.
-- O2 live-range splitting is intentionally conservative in this slice: only
-  singleton allocation groups with multiple live ranges can split. Split PRegs
-  get a canonical spill home plus per-range segment records. Hard segments
-  insert boundary reload/store traffic through the existing rewrite buffers;
-  coalesced groups, call-crossing values, and O1 allocations remain whole.
-- `opt_build_ssa` is no longer just a shape check: promoted loads/stores are
-  rewritten to SSA values, phis have real `Val` defs and predecessor inputs,
-  and def-use chains are rebuilt for downstream SSA passes.
-- Instruction ids are stable across block-array rewrites, so analysis and
-  dumps no longer depend on raw `Inst*` addresses or positional references.
-- Loop tree construction uses shared `OptAnalysis` dominators instead of a
-  pass-local dominator implementation.
-- `Func.opt_valid_analyses` now tracks CFG, def-use, dominator, and loop-info
-  validity. Mutating passes invalidate the affected bits, rebuild passes mark
-  their products valid, and `opt_verify` checks cached def-use before rebuilding
-  it so stale use-site assumptions are caught.
-- The first SSA-backed value passes are in the O2 path: `opt_ssa_dce` removes
-  unused pure SSA defs/phis, and `opt_copy_cleanup` rewrites safe SSA copy users
-  through def-use chains while avoiding multi-def conventional-SSA edge copies.
-- The next pre-GVN O2 slice is also in the O2 path: `opt_block_cloning` clones
-  tiny non-loop join blocks under strict growth limits when cloned defs remain
-  local and external uses are dominated; `opt_addr_xform` folds simple local
-  `addr_of` pseudos into non-observable load/store memory operands. Volatile,
-  atomic, global, TLS, non-zero indirect-offset, and otherwise non-local address
-  cases remain materialized. Both passes rebuild/invalidate analysis state and
-  are bracketed by `opt_ssa_dce`, `opt_copy_cleanup`, and verifier checkpoints.
-- The first post-address value passes are in the O2 path: `opt_simplify`
-  canonicalizes cheap scalar/address identities before GVN; scalar GVN rewrites
-  dominated duplicate pure expressions and folds safe integer
-  constants/branches; the memory-aware GVN slice rewrites exact redundant
-  loads and store/load reuse through the documented alias/version model; and
-  `opt_copy_prop` propagates SSA copy chains plus collapses redundant integer
-  extension chains.
-- The first memory and loop passes are in the O2 path: `opt_dse` removes exact
-  overwritten or unread non-escaping local stores, `opt_licm` hoists safe loop
-  invariants while leaving trapping and memory operations in place, and
-  `opt_pressure_relief` sinks single-use immediate/constant defs with a
-  per-block linear rewrite plan.
-
-Temporary defensive/pessimizing checks to lift:
-
-- [ ] O2 scalar locals/params that would be register-backed at O1 are still
-  initially recorded as frame slots for mem2reg. Lift this when source-local
-  storage identity can be represented directly through pseudo-register SSA.
-- [ ] O2 functions containing `IR_ASM_BLOCK` route through the non-SSA lowering
-  path. Lift this when inline-asm tied operands, inout operands,
-  early-clobbers, fixed-register constraints, and clobber liveness are modeled
-  across pseudo-register SSA construction and destruction.
-- [ ] O2 functions containing `IR_LOAD_LABEL_ADDR` or `IR_INDIRECT_BRANCH`
-  route through the non-SSA lowering path. Lift this when label-address and
-  computed-goto targets have stable block/MCLabel semantics across O2 CFG
-  cleanup, block cloning, edge splitting, SSA destruction, and jump cleanup.
-- [ ] O2 switch table promotion is disabled for gapped dense ranges
-  (`span != ncases`). Lift this when jump-table default slots remain correctly
-  bound after O2 layout and jump cleanup, including interior default entries.
-- [ ] Jump forwarding refuses to forward through single-jump blocks that already
-  own an `MCLabel`. Lift or narrow this once data relocations and label tables
-  are retargeted when CFG cleanup forwards or removes labeled trampoline
-  blocks.
-
-Full O2 jump optimization is now in the O2 path after SSA destruction. It
-extends the shared O1 jump cleanup with switch/default target forwarding, empty
-fallthrough-chain forwarding, repeated branch-to-branch forwarding, and
-same-target conditional branch collapse while preserving the final layout
-cleanup for branch-to-physical-next deletion.
-
-The next implementation work should move to Phase E inlining and cleanup.
-
----
-
-## 6. Implementation Phases
-
-Each phase should end with targeted green tests. Prefer red-green tests for
-the exact pass being introduced, then expand through the CG corpus.
-
-### Phase A - Production `-O1` Lowering
-
-Status: implemented as the public level-1 path. Keep extending tests and
-target-specific legality checks here, but do not make O1 depend on SSA/value
-passes. The final O1 namespace is mutable `PReg`: `ir_ensure_preg` grows only
-the pseudo-register metadata table, and O1 liveness, allocation, rewrite, and
-metrics are PReg-indexed.
-
-Goal: make `-O1` stop replaying and emit through the optimized backend path
-without SSA value passes.
-
-Implement:
-
-- `opt_machinize`
-- `opt_live_info`
-- `opt_regalloc(..., false)`
-- `opt_combine`
-- `opt_dce`
-- `opt_emit`
-
-Keep the allocator simple but not naive:
-
-- Build live ranges from block liveness.
-- Sort allocation candidates by tied hard-reg requirement, frequency, live
-  length, then stable id.
-- Assign hard regs when possible, stack slots otherwise.
-- Rewrite pseudos into hard regs/stack slots with reserved scratch regs for
-  reload/store addressing.
-- Delete noop moves after rewrite.
-
-Exit criteria:
-
-- [x] `CFREE_TOY_OPT_LEVELS="0 1" make test-toy` passes for targeted cases.
-- [x] Focused allocation cases cover call-clobber saves, stack spills, tied
-  hard regs from inline asm, and values live through branches.
-
-### Phase B - O2 Analysis Substrate
-
-Status: implemented for the first O2 value-pass slice. `OptPassCtx`, the
-shared operand/reference walker,
-`OptAnalysis` CFG order/dominator/frontier data, stable instruction ids,
-production mem2reg SSA, def-use rebuilding, SSA dumps, and stronger
-`opt_verify` guardrails are in place. Loop tree construction now uses shared
-dominators. Passes now have explicit CFG, def-use, dominator, and loop-info
-validity bits, and the first SSA value passes did not require additional shared
-analysis fields.
-
-Goal: create the durable data structures and invalidation model that all O2
-passes will share before implementing transformations.
-
-Checklist:
-
-- [x] A small pass context for per-pass state, stage names, metrics, and
-  temporary arenas.
-- [x] A shared operand/reference walker for every pass that needs uses/defs,
-  including `CGABIValue`, calls, returns, inline asm, intrinsics, and indirect
-  bases.
-- [x] Shared analysis results for RPO/PO, dominators, dominance frontiers, and
-  dominator children.
-- [x] Loop tree construction moved onto shared dominator analysis.
-- [x] Stable instruction ids so long-lived analysis never depends on raw
-  `Inst*` addresses across block array rewrites.
-- [x] Real mem2reg phi `Val` definitions and rewritten promoted uses.
-- [x] Def-use chain rebuilding, including phi inputs and indirect bases.
-- [x] SSA dump coverage for phis and rewritten uses.
-- [x] `opt_verify(Func*, stage)` gates for CFG consistency, instruction id
-  consistency, phi arity, def-use consistency, and SSA value class checks.
-- [x] Formal analysis invalidation flags for passes that preserve, rebuild, or
-  invalidate CFG/value/use-def state.
-- [x] Additional shared analysis fields demanded by first value passes.
-
-Exit criteria:
-
-- [x] `-O1` stays green.
-- [x] `-O2` can run the substrate and verifier, then lower through the current
-  O1 backend path with no observable behavior change.
-- [x] SSA dump tests prove phis and rewritten uses on branch/loop locals.
-- [x] `make test-opt`
-- [x] Focused `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy` branch/loop subset.
-- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-
-### Phase C - Full Allocation Infrastructure
-
-Status: implemented as the first O2 allocation-quality slice. O1 keeps the
-existing whole-PReg priority allocator. O2 enables conservative move
-coalescing, coalesced-group allocation, and singleton live-range splitting
-without changing public `libcfree` APIs or adding CFG edge splitting.
-
-Goal: bring `-O2` allocation quality up to MIR's model before value passes
-start changing code aggressively.
-
-Implement:
-
-- [x] Move collection.
-- [x] Bounded conflict matrix for move-related regs using existing live-range
-  overlap data.
-- [x] Conservative coalescing for same-class, same-type copies with compatible
-  fixed-register and forbidden-register constraints.
-- [x] Coalesced-group allocation before singleton allocation.
-- [x] O2-only split allocation representation with per-PReg segment lists and
-  canonical spill homes.
-- [x] Conservative singleton live-range splitting when whole hard allocation
-  fails.
-- [x] Boundary reload/store insertion for hard split segments through rewrite.
-- [x] Critical-edge CFG mutation for spill placement. Current placement stays
-  at unambiguous block/range boundaries.
-
-Exit criteria:
-
-- [x] `-O1` remains green.
-- [x] `-O2` can use the same lowering path with coalescing/splitting enabled,
-  while preserving the current small SSA cleanup passes.
-- [x] Focused `make test-opt` coverage for coalescing, conflict refusal,
-  incompatible constraints, O1 preservation, and singleton splitting.
-- [x] Toy pressure fixture:
-  `CFREE_TEST_FILTER=129_o2_pressure_coalesce CFREE_TOY_OPT_LEVELS="0 1 2"
-  make test-toy`
-- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-
-### Phase D - SSA-Backed Value and Memory Passes
-
-Goal: enable the MIR full pre-lowering schedule for `-O2`, starting with small
-SSA-backed passes that prove the use-def substrate before heavier value and
-memory optimizations.
-
-Implement in order:
-
-1. [x] `opt_ssa_dce`
-2. [x] `opt_copy_cleanup`
-3. [x] `opt_block_cloning`
-4. [x] `opt_addr_xform`
-5. [x] scalar-only `opt_gvn`
-6. [x] `opt_copy_prop`
-7. [x] first memory-aware `opt_gvn` / redundant load elimination tests
-8. [x] CG identity expansion, shared `opt_simplify_local`, and O2
-   `opt_simplify` as described in Section 4.3
-9. [x] extended memory tracker for joins/path state
-10. [x] `opt_dse`
-11. [x] `opt_licm`
-12. [x] `opt_pressure_relief`
-13. [x] `opt_make_conventional_ssa`
-14. [x] `opt_ssa_combine`
-15. [x] `opt_undo_ssa`
-16. [x] full O2 `opt_jump_opt` beyond the shared O1 jump cleanup
-
-Do not batch these into one landing. Each pass needs a pass-local corpus case
-that fails red without the pass or its bug fix.
-
-The path-aware memory GVN slice is now landed. Keep later memory rewrites tied
-to the documented alias and memory numbering model: do not expand `IR_LOAD`
-rewrites or reason across `IR_STORE`, calls, atomics, fences, volatile
-operations, TLS, globals, or unknown memory without an explicit invalidation
-rule and pass-local tests.
-
-Completed pre-GVN slice exit criteria:
-
-- [x] `make test-opt`
-- [x] Focused toy O2 branch/join/address-memory subset:
-  `CFREE_TEST_FILTER=128_o2_branch_join_addr_mem CFREE_TOY_OPT_LEVELS="0 1 2"
-  make test-toy`
-- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-- [x] Disassembly sanity check shows narrow churn: existing
-  `10_pointer_addr_deref` removes separate address materialization in O2, while
-  the new branch/join/address fixture has only minor register-choice churn.
-- [x] Pass-local dump tests prove the implemented SSA DCE, copy cleanup, block
-  cloning, and address-folding rewrites fire.
-
-Memory numbering and alias invalidation model for memory GVN:
-
-- Memory value numbering is keyed by `(address-space, alias root, byte range,
-  memory version, load type, size, alignment, flags)`. The address operand is
-  still part of the key unless the alias root is a singleton local/global/string
-  object with a proven constant offset. This keeps two unknown pointers in the
-  same alias class from being treated as the same location.
-- A memory version token is maintained per alias root plus one conservative
-  `unknown` token. `ALIAS_LOCAL`, `ALIAS_GLOBAL`, `ALIAS_PARAM`, `ALIAS_HEAP`,
-  and `ALIAS_STRING` each use their root id as the fine-grained version key;
-  `ALIAS_UNKNOWN`, missing alias data, unsupported address forms, and escaped
-  allocas use the unknown token. Address space participates before alias-root
-  comparison, so different address spaces never share a memory value number.
-- A store to a precise root increments that root's version. A store through
-  unknown memory increments the unknown token and invalidates every non-invariant
-  root that may alias unknown memory. A store to volatile memory is never
-  forwarded from or to, and also acts as an observable memory barrier for other
-  volatile operations.
-- Calls increment the unknown token and invalidate all roots that can escape:
-  globals, params, heap, TLS/global address roots, and any local whose address
-  has flowed to a call argument, return value, unknown/global store, aggregate
-  memory op, inline asm, intrinsic, or unsupported address use. Non-escaping
-  local roots keep their root version across ordinary calls. Later call attrs
-  can narrow this further.
-- Atomics, fences, inline asm with memory effects, `IR_AGG_COPY`,
-  `IR_AGG_SET`, bitfield stores, varargs memory operations, and runtime memory
-  intrinsics are full memory barriers until each has a precise alias rule.
-  Atomic loads are not redundant-load candidates unless a later model proves
-  the exact ordering legality.
-- TLS addresses are treated like globals for alias roots but are call-clobbered
-  unless the access is proven local-exec and the call cannot observe the thread
-  object. The first slice should conservatively invalidate them at calls.
-- Redundant-load elimination may replace `IR_LOAD` only when the prior value's
-  def dominates the use, the memory key including version is identical, neither
-  access is volatile/atomic, the load type and byte range match, and no
-  invalidating barrier lies between the two through the version model. Store/load
-  reuse uses the same key and additionally requires the stored operand's value
-  to dominate the load.
-- At block joins, memory availability is intersected across reachable
-  predecessors. An entry survives only when every reachable predecessor has the
-  same memory key available with the same `Val`; otherwise the joined load is
-  preserved. This slice does not invent memory phis.
-
-Current memory GVN slice:
-
-- [x] Pass-local tests cover repeated exact local loads, store/load reuse for
-  the same precise local, path-aware branch joins, loop-header backedge
-  conservatism, unknown-store barriers, call preservation for non-escaped
-  locals, escaped-local call clobbers, and volatile/atomic exclusions.
-- [x] The pass maintains per-root plus unknown versions, treats calls and
-  conservative memory ops as barriers, and can see through simple `addr_of` plus
-  zero-index/address-minus-zero materialization when both memory operations
-  refer to the same precise root.
-- [x] AArch64 disassembly sanity check: a direct address-taken store followed by
-  a load of the same address reuses the stored value, removing the post-store
-  `ldur`.
-- [x] The tracker uses memory state snapshots at block boundaries. Join blocks
-  reuse memory only when all reachable predecessors agree on the same available
-  key/value pair; otherwise no memory phi is invented and the load stays.
-- [x] Root escape information is precise enough for calls to preserve
-  non-escaping address-taken locals, while globals/params/heap/TLS, escaped
-  locals, and unknown roots remain conservatively clobbered.
-
-Current DSE slice:
-
-- [x] `opt_dse` runs in the O2 SSA schedule after GVN, copy propagation, and
-  SSA DCE, before conventional-SSA lowering.
-- [x] Backward memory liveness removes exact precise-root stores that are
-  overwritten before any overlapping read, and removes unread stores to
-  non-escaped locals at function exit.
-- [x] Calls, volatile/atomic accesses, aggregate memory ops, bitfield memory
-  ops, varargs memory ops, fences, inline asm, and intrinsics remain
-  conservative barriers. Escaped locals and externally observable roots stay
-  live across calls/exits.
-- [x] Pass-local tests cover same-block and all-path overwritten local stores,
-  unread non-escaped locals, escaped-local call preservation, and volatile
-  preservation.
-
-Current LICM and pressure relief slices:
-
-- [x] `opt_licm` runs after DSE/DCE and loop-tree construction, hoisting safe
-  loop-invariant scalar work to preheaders.
-- [x] LICM pass-local tests cover safe invariant hoisting and preservation of
-  memory loads and potentially trapping division.
-- [x] `opt_pressure_relief` runs after LICM, planning all legal same-block
-  single-use immediate/constant sinks from one def-use rebuild and applying
-  affected block rewrites in linear passes.
-- [x] Pressure-relief pass-local tests cover sinking, loop-boundary refusal,
-  multi-use preservation, memory barriers, and many same-block moves.
-
-Scalar/address identity simplification now runs before memory GVN. Red-green
-coverage includes integer neutral/annihilator identities, same-register
-integer identities, same-value integer compares, exact no-op conversions,
-double bitwise-not, add-zero address chains, and a toy end-to-end identity
-chain at `-O0`, `-O1`, and `-O2`. These rewrites stay typed and
-legality-aware: they do not erase operations with trapping, flagged overflow,
-FP rounding/NaN/sign-zero semantics, volatile/atomic memory effects, or target
-addressing constraints that have not been modeled.
-
-Remaining Phase D exit criteria:
-
-- [x] Scalar GVN pass-local dump tests prove redundant pure expressions are
-  replaced and constant branches fold.
-- [x] Memory numbering and alias invalidation are documented before any memory
-  GVN or redundant-load rewrite lands.
-- [x] `opt_copy_prop` pass-local tests prove SSA copy propagation and redundant
-  extension-chain cleanup, with a focused toy O2 fixture.
-- [x] AArch64 O2 disassembly sanity checks show `130_o2_copy_prop_ext`
-  collapsing the widening/copy chain to a single byte load/zero-extend path,
-  and memory GVN removing redundant post-store stack loads in covered cases.
-  `128_o2_branch_join_addr_mem` now covers both all-predecessors-agree and
-  predecessor-disagreement branch stores.
-- [x] Identity simplification has focused pass-local tests, a toy end-to-end
-  fixture, and an AArch64 disassembly sanity check showing identity arithmetic
-  removed from the generated code.
-- [x] LICM and pressure relief have focused pass-local tests and are in the
-  default O2 schedule.
-- [x] `opt_ssa_combine` has focused red-green tests before it enters the
-  default O2 schedule.
-- [x] Full O2 jump optimization has focused red-green tests before it enters
-  the default O2 schedule, covering switch target forwarding, empty
-  fallthrough-chain forwarding, repeated branch forwarding, and same-target
-  conditional collapse.
-- [x] Post-inline cleanup quality: after Phase E inlines a simple Toy wrapper,
-  the existing Phase D value passes should collapse the exposed constant/copy
-  chain and remove frame traffic. Covered example:
-
-  ```
-  fn add1(x: i32): i32 {
-    return x + (1 as i32);
-  }
-
-  pub fn main(): i32 {
-    let y: i32 = add1(41 as i32);
-    if y == (42 as i32) {
-      return 0 as i32;
-    }
-    return 1 as i32;
-  }
-  ```
-
-  `-O2` removes the wrapper call leftovers and AArch64 now lowers the focused
-  quality fixture's `main` to a constant return with no call, prologue, spill,
-  or reload traffic.
-
-### Phase E - Inlining and Cleanup
-
-Goal: add MIR-style small direct-call inlining for `-O2`.
-
-Design decision: inline before machine lowering, on retained raw `Func` IR, at
-`CGTarget.finalize`. Do not inline the post-SSA-destruction or post-machinize
-form. The current implementation emits at `func_end`; Phase E first has to
-make the `opt.h` contract real for `-O2`: `func_end` records and stores the
-function, while `finalize` performs interprocedural work, then runs the normal
-per-function O2 pipeline and emission in original function order.
-
-Status: implemented for the Phase E v1 shape. Retained `-O2` finalize
-scheduling is in place, `opt_cleanup` is the reusable O2 pre-lowering cleanup
-schedule, and `pass_inline.c` inlines small direct calls to retained, acyclic
-callees with void/scalar register returns. It supports straight-line wrappers
-and small branchy callees with multiple scalar returns. It intentionally refuses
-aggregate/sret/byval returns, tail calls, varargs, alloca, inline asm,
-computed goto, and other hard cases until those rewrite rules have focused
-tests.
-
-This matches MIR's useful property: inline while calls, parameters, frame
-slots, returns, labels, and source locations are still target-independent.
-Cloning target-lowered calls would couple the inliner to ABI move plans,
-hard-register constraints, and late spill/rewrite state. Cloning SSA would
-also require phi and def-use surgery across two functions. Raw IR cloning keeps
-Phase E mechanical and lets the existing O2 passes clean up the exposed
-constants, copies, dead stores, and dead blocks.
-
-Implementation shape:
-
-- [x] Add real `FuncSet` ownership to `OptImpl`: an arena-backed ordered
-  `Func**` plus retained-definition lookup by `ObjSymId`. No global state.
-  Record order is emission order at finalize.
-- [x] Split scheduling in `opt.c`:
-  - `func_end`: `opt_frame_home_addr_taken_locals`, classify eligibility, append
-    to `FuncSet`, and clear the current recorder state. For `-O1`, keeping the
-    current immediate emit path is acceptable; `-O2` must retain functions.
-  - `finalize`: run `opt_inline(FuncSet*, max_iters)`, then lower/emit every
-    retained function. Functions that still require the defensive non-SSA O2
-    route use the shared O1 lowering path at finalize.
-  - Refactor the current O2 scheduler into "O2 cleanup/pre-lowering" and
-    "lowering/emission" pieces so an inlined dirty function can reuse the
-    existing verifier-bracketed pass order.
-- [x] Build the call graph from direct `IR_CALL` sites whose callee is
-  `OPK_GLOBAL` with addend zero and whose symbol has a retained `Func`.
-  Indirect calls and unresolved/external symbols are ordinary non-inlineable
-  calls.
-- [x] Compute SCCs on the retained graph. Inline only calls to callees in already
-  processed acyclic components. Skip self recursion and mutually recursive SCCs
-  for v1 instead of trying partial recursive budgets.
-- [x] Rebuild the call graph at each inline iteration. Default `max_iters = 1`;
-  keep a small internal cap even if a later driver knob requests more.
-
-Eligibility and budgets:
-
-- [x] Candidate callee must be retained, same target/context, non-variadic, not
-  naked/interrupt/ifunc once those attributes reach `CGFuncDesc`, and not marked
-  `noinline` once C declaration flags are plumbed through CG.
-- [x] Candidate call site must be a normal direct call. Skip `CG_CALL_TAIL` and any
-  future musttail representation.
-- [x] V1 skips callees containing `IR_ALLOCA`, `IR_VA_*`, `IR_LOAD_LABEL_ADDR`,
-  `IR_INDIRECT_BRANCH`, `IR_ASM_BLOCK`, `INTRIN_SETJMP`, `INTRIN_LONGJMP`,
-  coroutine switch, or any op whose cloning semantics are not explicit.
-- [x] V1 supports void and scalar register returns first. Aggregate/sret/byval
-  returns can be enabled later after return materialization has focused tests.
-- [x] Normal-call budget should be deliberately small: start around 12 to 20
-  non-debug, non-label instructions, with a caller growth cap around 25 percent
-  and an absolute per-function growth cap. Once `inline` / `always_inline` /
-  `noinline` are carried into CG, use a larger explicit-inline budget and let
-  `always_inline` exceed the normal budget but still obey semantic refusal
-  rules.
-- [x] Keep the heuristic target-independent. Calls, memory barriers, atomics,
-  and fences remain semantic refusals in Phase E v1; multiple returns add cost,
-  and simple scalar instructions use the normal small-instruction budget.
-
-Inline rewrite:
-
-- [x] Split the caller block at the call site into `pre`, cloned callee body, and
-  `cont`. The original call instruction is removed.
-- [x] Allocate fresh caller-local ids for every cloned block, frame slot, scope,
-  pseudo-register, and SSA value id present in the callee. Preserve callee
-  instruction ids only as source material; cloned instructions get new stable
-  ids from the caller.
-- [x] Clone `emit_order` for cloned blocks in callee layout order, inserted between
-  `pre` and `cont`. Update CFG successors/predecessors through `opt_build_cfg`
-  after the rewrite rather than hand-maintaining all derived metadata.
-- [x] Materialize arguments at the clone entry. For register-like callee parameter
-  storage, emit `IR_COPY` from the call argument value to the remapped callee
-  parameter pseudo. For frame-backed parameters, emit stores to the remapped
-  parameter frame slots using the recorded `IRParam` storage and `MemAccess`
-  shape. Constants become `IR_LOAD_IMM` or direct stores as appropriate.
-- [x] Rewrite each cloned `IR_RET`:
-  - void return: replace with `IR_BR cont`.
-  - scalar register return: copy the returned value to the call site's result
-    operand, then branch to `cont`.
-  - multiple returns: all returns branch to the same continuation; if later SSA
-    needs a phi, the normal O2 mem2reg/reg-SSA construction will create it from
-    the inserted copies/stores.
-- [x] Keep ordinary continuation semantics for noreturn callees in Phase E v1.
-  Do not infer noreturn only from body shape until continuation omission has
-  focused tests.
-- [x] Preserve `Inst.loc` from cloned callee instructions. Argument materialization
-  uses the call-site location. This is enough for line-accurate emission now;
-  `DW_TAG_inlined_subroutine` can remain a later DWARF phase.
-
-Cleanup after inlining:
-
-- [x] Mark modified callers dirty and invalidate CFG, def-use, dominator, and loop
-  analyses.
-- [x] For each retained `-O2` function, run the existing O2 pre-lowering schedule
-  after all inline iterations. This is `opt_cleanup` in Phase E terms:
-  `build_cfg`, jump cleanup, reg-SSA, mem2reg SSA, simplify, GVN, copy
-  propagation, DSE, SSA DCE, LICM, pressure relief, conventional SSA,
-  SSA combine, undo SSA, and jump optimization, with the same verifier
-  checkpoints as the current O2 path.
-- [x] Then run the shared lowering pipeline once and emit. No function should be
-  emitted before all possible callers have had a chance to inline it.
-
-Testing plan:
-
-- [x] Unit/pass tests in `test/opt` for direct wrapper inline, two-return scalar
-  inline, refusal for recursion/SCCs, refusal for varargs/alloca/setjmp/asm,
-  and caller growth cap.
-- [x] End-to-end toy/C cases where `static int add1(int x) { return x + 1; }`
-  disappears at `-O2` but remains a call at `-O1`.
-- [x] Keep the Toy `add1(41) == 42` example from the Phase D checklist as a
-  disassembly sanity case: Phase E is responsible for removing the call; Phase
-  D cleanup is responsible for removing the exposed constant/copy/branch
-  leftovers.
-- [x] A chained-call case proves bottom-up order: `main -> f -> g` inlines `g`
-  into `f` before deciding whether `f` fits into `main`.
-- [x] A negative recursive case proves codegen is unchanged and correct.
-- [x] A focused disassembly sanity check on AArch64 or x64 verifies wrapper-call
-  removal after inline+cleanup without broad corpus churn.
-
-Exit criteria:
-
-- [x] `make test-opt`
-- [x] Focused wrapper/chained-call toy fixtures:
-  `CFREE_TEST_FILTER=<inline-case> CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-- [x] Recursive and mutually recursive cases are unchanged and correct.
-
-### Phase F - Benchmark Closure
-
-Goal: tune by measurement, not by adding passes speculatively.
-
-Benchmark set:
-
-- MIR `c-benchmarks`: `array`, `binary-trees`, `funnkuch-reduce`,
-  `hash`, `hash2`, `heapsort`, `lists`, `mandelbrot`, `matrix`,
-  `method-call`, `nbody`, `sieve`, `spectral-norm`, `strcat`, and any
-  benchmark that requires only supported libc/runtime features.
-- cfree-specific stress cases for ABI, TLS, atomics, and inline asm.
-
-Primary harness:
+### Inlining
 
-```
-make bench-opt
-```
+`opt_inline` operates on `FuncSet`.
 
-`scripts/opt_bench.sh` compares MIR `c2m`, `clang`, `gcc-15`, `cfree cc`,
-and `cfree run` at `-O0`, `-O1`, and `-O2` on MIR's `c-benchmarks` sources
-from `~/tmp/mir/c-benchmarks`. It writes per-case logs, `results.csv`, and a
-geomean summary under `build/bench/opt/`.
+- The call graph is built from direct `IR_CALL` sites whose callee is an
+  `OPK_GLOBAL` with addend zero and a retained `Func`.
+- Strongly connected components are used to avoid recursive inlining. Self and
+  mutually recursive SCCs are refused for v1.
+- The current iteration cap is small. The default is one inline iteration.
+- Candidate callees must be retained, target-compatible, non-variadic, not
+  recursive, and within a conservative growth budget.
+- V1 refuses callees containing alloca, varargs, setjmp/longjmp, inline asm,
+  label-address loads, indirect branches, coroutine switching, or unknown
+  cloning semantics.
+- Void and scalar register returns are supported. Aggregate/sret/byval returns
+  are left for future work.
+- The caller block is split into pre-call, cloned body, and continuation blocks.
+  Arguments are materialized in the clone entry, returns branch to the
+  continuation, and modified callers are marked dirty.
 
-The hosted `cfree cc` lane passes `--sysroot` and `-lc`; on Darwin the
-sysroot defaults to `xcrun --show-sdk-path`, and it can be overridden with
-`CFREE_OPT_BENCH_SYSROOT`. The `cfree run` lane is an in-process JIT run for
-direct comparison with MIR; it also passes `--sysroot` and `-lc`, expands the
-hosted libc headers/defines in the driver, and relies on host-symbol fallback
-for libc calls.
+After inlining, dirty functions run the normal `-O2` cleanup schedule.
 
-The benchmark intentionally records unsupported cases instead of stopping at
-the first failure. A cfree row that cannot compile a hosted benchmark is a
-`COMPILE_FAIL` data point, not a harness failure; this lets optimizer work
-track coverage and performance in the same run.
+### SSA Cleanup Schedule
 
-Useful focused runs:
+`opt_cleanup` is the `-O2` pre-lowering schedule.
 
-```
-CFREE_OPT_BENCHES="sieve spectral-norm" make bench-opt
-CFREE_OPT_BENCH_LEVELS="1 2" CFREE_OPT_BENCH_RUN_REPEATS=5 make bench-opt
-GCC=/opt/homebrew/bin/gcc-15 MIR_DIR=~/tmp/mir make bench-opt
-```
+1. `opt_build_cfg`, `opt_jump_cleanup`, `opt_build_cfg`
+   Canonicalize control flow before SSA.
 
-Measure:
-
-- Compile wall time for `-O0`, `-O1`, `-O2`.
-- Executable run time against `gcc-15 -O2`, `clang -O2`, and MIR `c2m -O2`
-  when available.
-- MIR-specific split: `compile_ms` is C-to-binary-MIR time, `codegen_ms` is
-  the JIT link/generation slice reported by `c2m -v`, and `runtime_ms` is the
-  generated function execution slice. Native compiler rows use compile+link
-  wall time for `compile_ms` and executable wall time for `runtime_ms`.
-- `cfree-run` uses `cfree run --bench-time`: `compile_ms` is compile+JIT time,
-  and `runtime_ms` is the in-process entry-call execution slice.
-- Code size for hot text sections.
-- Pass counters: removed GVN expressions, folded branches, removed stores,
-  coalesced moves, spills/restores, split ranges, post-RA deleted moves.
-
-Initial representative run, 2026-05-22:
-
-- Scope: `array`, `binary-trees`, `hash`, `hash2`, `matrix`, `nbody`,
-  `sieve`, and `spectral-norm`; levels `0 1 2`; one compile repeat and one
-  run repeat. Output was written to `build/bench/opt/results.csv` and
-  `build/bench/opt/summary.md`.
-- Coverage: 120 data rows; 91 `OK`, 18 `COMPILE_FAIL`, 9 `RUN_FAIL`, and
-  2 `OUTPUT_FAIL`.
-- `gcc-15` and `clang` completed all rows. cfree and MIR are blocked on some
-  hosted/math cases: `binary-trees`, `nbody`, and `spectral-norm` hit Darwin
-  `math.h`/builtin compatibility issues. cfree also has an `-O2` wrong-code
-  failure on `matrix`.
-- Runtime geomean versus `gcc-15 -O2` on completed rows:
-  `cfree-run` improved from `0.372x` at `-O0` to `0.445x` at `-O1` and
-  `0.480x` at `-O2`; MIR improved from `0.554x` to `0.609x` to `0.796x`.
-  `clang -O2` measured `1.076x`.
-- Compile/JIT geomean versus `gcc-15 -O2`: `cfree-run` was about `11.5x` to
-  `12.0x` faster, while MIR was about `2.7x` faster. On cases where both
-  completed, cfree-run compile/JIT time was roughly 4-5x faster than MIR.
-- Direct cfree-run versus MIR runtime on common successful cases:
-  `0.67x` at `-O0`, `0.73x` at `-O1`, and `0.57x` at `-O2`. This confirms
-  the current split: cfree's compile/JIT path is very fast, but generated-code
-  quality still trails MIR, especially at `-O2`.
-
-Immediate benchmark blockers:
-
-- [x] Fix the `matrix -O2` wrong-code regression before trusting O2 timing.
-- [x] Add or model the hosted math builtins needed by Darwin `math.h`,
-  including `__builtin_fabsf`, `__builtin_inf*`, `__builtin_huge_val*`,
-  `_Float16` declarations, and the float/double predefined limit macros used
-  by the system header.
-- Re-run the full MIR benchmark set after those blockers, then increase repeat
-  counts for stable numbers.
-
-Follow-up representative run after cfree blocker fixes, 2026-05-22:
-
-- Same 8-benchmark scope, levels `0 1 2`, one compile repeat, and one run
-  repeat.
-- Coverage: 120 data rows; 111 `OK`, 9 `COMPILE_FAIL`, 0 `RUN_FAIL`, and
-  0 `OUTPUT_FAIL`.
-- All `cfree` and `cfree-run` rows completed successfully, including
-  `matrix -O2`, `binary-trees`, `nbody`, and `spectral-norm`. The remaining
-  `COMPILE_FAIL` rows are all `mir-c2m` on the hosted/math benchmarks.
-- Updated runtime geomean versus `gcc-15 -O2` on the representative scope:
-  `cfree-run` measured `0.363x` at `-O0`, `0.519x` at `-O1`, and `0.481x`
-  at `-O2`; MIR measured `0.550x`, `0.574x`, and `0.786x` on its five
-  completed cases. `clang -O2` measured `1.072x`.
-
-Focused rerun of the formerly broken cfree rows, 2026-05-22:
-
-- Scope: `matrix`, `binary-trees`, `nbody`, and `spectral-norm`; levels
-  `0 1 2`; one compile repeat and one run repeat. Output was written to
-  `build/bench/opt/rerun-broken-cfree/results.csv`.
-- All `cfree` and `cfree-run` rows were `OK`.
-- Geomean speed ratios use each benchmark's `gcc-15 -O2` row as `1.0x` for
-  both compile time and run time:
-
-| tool | opt | compile speed | runtime speed |
-| --- | ---: | ---: | ---: |
-| `cfree` | `O0` | `7.050x` | `0.319x` |
-| `cfree` | `O1` | `7.062x` | `0.534x` |
-| `cfree` | `O2` | `6.831x` | `0.483x` |
-| `cfree-run` | `O0` | `13.709x` | `0.323x` |
-| `cfree-run` | `O1` | `14.286x` | `0.550x` |
-| `cfree-run` | `O2` | `11.791x` | `0.512x` |
-
-- On this focused scope, `O1` currently has the best cfree runtime geomean;
-  `O2` remains correct but is slower than `O1` on the measured mix.
-
-Target:
-
-- `-O1` should be the fast optimized tier and materially faster to compile
-  than `-O2`.
-- `-O2` should first reach QBE-class performance on the benchmark set, then
-  close toward MIR's published `c2m -eg` geomean.
-
----
-
-## 7. Module Layout
-
-Keep the MIR pass boundaries as separate cfree modules:
+2. `opt_build_reg_ssa`
+   Converts mutable pseudo-registers to SSA `Val` ids.
 
-```
-src/opt/
-  opt.c             wrapper, vtable bindings, schedule dispatch
-  ir.c              Func/Block/Inst plumbing
-  ir_print.c        stable dumps for pass tests
-  pass_cfg.c        CFG, unreachable cleanup
-  pass_o2.c         early O2 transforms: block cloning, address transform,
-                    scalar GVN, temporary DSE/cleanup stubs
-  pass_simplify.c   local and SSA-aware scalar/address identity simplification
-  pass_clone.c      block cloning (split out when pass size warrants it)
-  pass_ssa.c        SSA build, conventional SSA, undo SSA
-  pass_addr.c       address transformation (split out when pass size warrants it)
-  pass_gvn.c        GVN, constprop, redundant-load elimination
-  pass_copy.c       copy propagation, extension cleanup
-  pass_dse.c        dead store elimination
-  pass_dce.c        SSA DCE and post-RA DCE
-  pass_loop.c       loop tree and LICM
-  pass_pressure.c   pressure relief
-  pass_jump.c       jump optimization
-  pass_machinize.c  target ABI and machine-form lowering
-  pass_live.c       liveness, pressure, live ranges
-  pass_lower.c      current assignment/rewrite implementation
-  pass_coalesce.c   move collection and coalescing
-  pass_regalloc.c   assignment, rewrite, splitting (split out when warranted)
-  pass_combine.c    target-aware code selection
-  pass_emit.c       drive wrapped CGTarget
-  pass_inline.c     inlining and cleanup
-```
+3. `opt_block_cloning`
+   Clones small cold/return blocks into hot predecessors when bounded growth
+   exposes better local optimization.
 
-No pass should reach into the wrapped target's private implementation. Target
-specifics belong behind `Target`, `TargetABI`, or explicit helper hooks.
+4. `opt_build_ssa`
+   Promotes eligible frame-backed locals/params to mem2reg SSA, inserts phis,
+   and rewrites loads/stores to values.
 
----
+5. `opt_ssa_dce` and `opt_copy_cleanup`
+   Remove unused SSA definitions and redundant copies before heavier passes.
 
-## 8. Pass Invariants
+6. `opt_addr_xform`
+   Folds address-producing pseudos into direct memory operands where legal and
+   preserves address-taken storage in memory.
 
-- No VLAs and no global optimizer state. MIR uses generator context; cfree
-  should hang all pass state off `Compiler`, `Func`, `FuncSet`, or explicit
-  pass contexts.
-- `PReg` names mutable virtual-register storage before reg-SSA. `Val` names
-  SSA values after pseudo-register SSA construction. Lowering may map PRegs or
-  post-SSA Vals to physical registers or stack slots, but that mapping belongs
-  to allocation state.
-- Frame slots remain frame slots until a pass proves promotion is legal.
-- Calls are memory barriers unless alias information proves otherwise.
-- Inline asm is side-effecting and may pin hard regs; the allocator and
-  inliner must treat it conservatively.
-- `set_loc` is sticky on recorded insts. Every emitted instruction must forward
-  the active location to the wrapped target before machine emission.
-- Every pass either preserves CFG metadata or invalidates/rebuilds it before
-  the next consumer.
+7. `opt_simplify`
+   Applies SSA-aware identity simplification, constant simplification, and
+   local algebraic cleanup.
 
----
+8. `opt_gvn`
+   Performs scalar value numbering, constant propagation, branch folding,
+   redundant expression elimination, and memory-aware redundant load/store-load
+   reuse where alias/version rules permit it.
 
-## 9. Test Strategy
+9. `opt_copy_prop`
+   Propagates SSA copies and removes redundant extension/copy chains.
 
-Targeted commands:
+10. `opt_dse`
+    Removes stores proven overwritten or dead along all paths, while preserving
+    observable memory and call/escape boundaries.
 
-```
-CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy
-CFREE_TEST_FILTER=<case> CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy
-make test-opt
-```
+11. `opt_build_loop_tree`, `opt_licm`
+    Rebuild loop information and hoist safe invariant computations.
 
-Use pass dumps for red-green tests:
+12. `opt_pressure_relief`
+    Sinks legal same-block computations to reduce peak pressure before
+    allocation.
 
-- CFG: block/successor/predecessor shape.
-- SSA: phi placement and def-use chains.
-- Scalar GVN: redundant pure expression replaced and constant branch folded,
-  with def-use rebuilt and stale-analysis verifier coverage.
-- Memory GVN: redundant load/store-load reuse only after alias and memory
-  numbering are specified; tests must include volatile/atomic/call/unknown
-  memory barriers.
-- DSE: dead store removed while escaping stores remain.
-- LICM: safe invariant moved; trapping division and memory ops remain.
-- RA: expected spill/reload/coalescing counters.
-- Combine: target-legal fused instruction shape.
+13. `opt_make_conventional_ssa`
+    Lowers SSA phis into edge copies. These copies are marked
+    `IRF_NO_COALESCE`; coalescing them can collapse current and next
+    loop-carried values and miscompile loops.
+
+14. `opt_ssa_combine`
+    Combines patterns while conventional SSA edge copies are explicit.
 
-Before broad runs, redirect output to a file and inspect the failure slice:
+15. `opt_undo_ssa`, `opt_copy_cleanup`
+    Rewrites SSA values back into allocation-ready pseudo-register form.
+
+16. `opt_jump_opt`
+    Runs full O2 jump optimization before shared lowering.
+
+Verifier checkpoints bracket the important transitions: CFG, reg SSA, block
+cloning, mem2reg SSA, address transform, simplify, GVN, copy prop, DSE, LICM,
+pressure relief, conventional SSA, SSA combine, undo SSA, and jump opt.
 
-```
-CFREE_TEST_FILTER=<case> CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy >build/opt-test.log 2>&1
-tail -200 build/opt-test.log
-```
+After cleanup, `-O2` enters the shared lowering pipeline with live-range
+splitting enabled.
+
+## Register Allocation
+
+Allocation uses target-provided register classes and hard-register sets.
+
+- Pseudo-register metadata records type, class, fixed/tied hard-register
+  constraints, and forbidden hard-register masks.
+- Liveness is represented as block live-in/live-out sets and live ranges over
+  instruction points.
+- Coalescing collects move-related candidates, builds a bounded conflict matrix
+  for those candidates, and merges only same-class, same-type registers with
+  compatible hard-register constraints and no live-range conflict.
+- Copies marked `IRF_NO_COALESCE` are not coalescing candidates.
+- `-O2` allocation may split live ranges and insert boundary reload/store
+  instructions. Critical edges are split when needed for placement.
+- Spill slots are represented as frame slots and rewritten as ordinary loads
+  and stores before emission.
+
+## Verification, Dumps, and Metrics
+
+`opt_verify(Func*, stage)` checks CFG reciprocity, reachable block shape,
+emit-order validity, instruction ids, operand namespaces, phi consistency, and
+def-use freshness. Passes should keep verifier failures local by using stage
+names that describe the last completed transformation.
+
+The optimizer also emits metrics through the compiler metrics interface. The
+driver's `cfree run --time` mode is the easiest way to inspect scoped timings
+such as frontend, pass totals, link, JIT, and execution.
+
+IR dumps and pass-local tests should prefer stable ids and pass-specific
+fixtures. End-to-end parse/toy tests should be focused on user-visible behavior
+and disassembly sanity checks, not broad churn.
+
+## Current Limits
 
-The full toy corpus remains the end-to-end equivalence oracle for frontend
-coverage: levels `1` and `2` must produce the same observable result as level
-`0`. The API and opt unit suites remain the narrower pass/CG contract oracle.
-DWARF equivalence can start weaker, but `set_loc` forwarding must not regress
-line-row emission.
+- `_Float16` is parsed for Darwin header compatibility but is not a real
+  half-precision ABI implementation.
+- Inlining v1 does not handle aggregate returns, sret/byval, recursive SCCs,
+  varargs, alloca, setjmp/longjmp, inline asm, or indirect-control constructs.
+- `cfree-run --bench-time` reports compile+JIT as one slice. Finer-grained
+  timing is available through `--time`, but benchmark CSVs do not yet split
+  cfree frontend, optimizer, link, and JIT into separate columns.
+- Performance goals, measurements, and tuning priorities live in
+  `doc/OPT_PERF.md`.
diff --git a/doc/OPT1.md b/doc/OPT1.md
@@ -1,330 +0,0 @@
-# OPT1 - Current O1 Backend Path
-
-`-O1` is cfree's fast optimized backend tier. It records frontend CG
-operations as IR, runs a small MIR-shaped lowering/allocation pipeline, rewrites
-virtual values to hard registers or spill slots, performs narrow post-RA
-cleanup, and emits through the target backend.
-
-This document has two jobs:
-
-1. document the implemented O1 pipeline; and
-2. record the completed O1 code-shape cleanup and remaining performance
-   watchpoints.
-
-Compile-time measurements and scaling notes live in `doc/PERF.md`. Broader
-optimizer direction lives in `doc/OPT.md`.
-
-## Goals
-
-O1 should remain:
-
-- correct across supported targets and ABI lowering paths;
-- materially better than O0 generated code;
-- close enough to O0 compile time for JIT use;
-- free of dense pseudo-register conflict matrices and whole-function
-  point/candidate products in hot paths;
-- simple enough that O2 can add coalescing, splitting, SSA, and value
-  optimization without destabilizing the O1 baseline.
-
-O1 is not intended to perform full SSA optimization, global value numbering,
-LICM, general inlining, or global block layout. Those belong in O2 unless a
-small local cleanup is required to keep O1 code shape reasonable.
-
-## Pipeline
-
-The normal O1 function-end path is:
-
-```text
-build_cfg
-jump_cleanup.cfg
-build_cfg
-machinize
-build_loop_tree
-live_blocks.pre_dde
-dead_def_elim
-regalloc:
-  live_blocks
-  live_ranges_build
-  assign_ranges
-  rewrite
-combine
-dce
-jump_cleanup.cfg
-build_cfg
-jump_cleanup.layout
-emit
-```
-
-Pass boundaries:
-
-- `machinize` applies target-specific ABI, call, and register constraints before
-  liveness and allocation.
-- `live_blocks.pre_dde` computes block liveness for pre-RA dead-def
-  elimination.
-- `dead_def_elim` removes dead virtual definitions using block liveness and
-  operand-reference events.
-- `regalloc` rebuilds liveness after DDE, constructs compressed live ranges,
-  assigns hard registers or spill slots, and rewrites operands.
-- `combine` is narrow post-RA cleanup: noop physical copy deletion, safe
-  single-use folds, redundant spill load/store cleanup, and similar local
-  rewrites.
-- `dce` is conservative after rewrite and removes only trivial dead IR shapes.
-- `jump_cleanup.cfg` forwards branches to branch-only blocks and performs local
-  compare-branch inversion when that removes an unconditional jump block.
-- `jump_cleanup.layout` runs immediately before emit and deletes unconditional
-  branches whose target is the physical fallthrough block.
-
-## Liveness And Ranges
-
-O1 liveness is pass-local. Persistent per-instruction live-after storage and
-dense value conflict matrices are not part of the normal path.
-
-`opt_live_blocks` computes block `live_use`, `live_def`, `live_in`, and
-`live_out` sets using `OptBitset`. `OptBitset` tracks active words, trims
-trailing zero words, and bounds copy/union work by active length where possible.
-
-`opt_live_ranges_build` turns block liveness plus backward instruction walks
-into compressed live ranges:
-
-- per-instruction references are collected as use/def value lists;
-- open and touched values are tracked with generation marks;
-- raw instruction points are compressed to live-range boundary points;
-- live-across-call frequency is tracked from a rolling live-value list;
-- per-value summaries record live length, use/def frequency, block frequency,
-  call-crossing frequency, and spill cost.
-
-The expected O1 invariant is `opt.conflict_bytes=0`.
-
-## Allocation
-
-The normal O1 allocator is range based. It does not build or consume a
-pseudo-register interference matrix.
-
-Candidates are ordered by:
-
-1. tied hard-register requirement;
-2. spill/frequency cost;
-3. shorter live length;
-4. stable value id.
-
-Each value is assigned either:
-
-- a target-provided hard register in the value's register class; or
-- a compatible `FS_SPILL` frame slot.
-
-Non-tied values known to be live across calls prefer callee-saved registers, but
-may use caller-saved registers when no better non-conflicting location is
-available. Correctness does not depend on avoiding caller-saved registers
-globally: rewrite preserves hard-assigned values only at calls whose
-target-provided clobber mask actually destroys that register.
-
-Occupancy is stored as sorted per-location interval sets:
-
-- one interval vector per physical hard-register location;
-- one interval vector per allocated stack spill slot.
-
-Conflict checks test whether any candidate live range overlaps the location's
-intervals. Marking inserts and merges intervals. The old point-index bitmap row
-OR/mark path is not part of normal O1.
-
-Metrics still report `opt.alloc.used_loc_words`, `opt.alloc.hard_loc_words`,
-`opt.alloc.stack_loc_words`, point visits, and mark counters for trend
-compatibility, but `opt.alloc.hard_word_ors` and
-`opt.alloc.stack_word_ors` should remain zero on the interval path.
-
-## Rewrite
-
-Rewrite consumes the assignment map and walks each block backward with rolling
-liveness:
-
-- hard-assigned pseudos become physical registers;
-- spilled uses receive reloads through target scratch registers;
-- spilled defs receive stores;
-- call save/restore insertion checks hard-assigned values known to be live
-  across some call, then intersects each value's assigned register with that
-  call's target-provided clobber mask;
-- inline-asm register constraints are applied only for functions containing
-  `IR_ASM_BLOCK`.
-
-Rewrite does not materialize per-instruction full live-after sets.
-
-## Target Contract
-
-O1 relies on each target backend to provide:
-
-- physical-register metadata per register class, including allocable,
-  caller-saved, callee-saved, argument, return, reserved, and preference/cost
-  flags;
-- legacy allocable hard-register pools per register class for direct-CG
-  compatibility and as a fallback while older backends migrate;
-- scratch-register pools disjoint from allocable pools;
-- caller-saved register classification and per-call clobber masks;
-- return-register masks and callee-save masks;
-- call-plan construction that describes ABI register destinations, outgoing
-  stack arguments, return registers, call clobbers, return masks, outgoing
-  stack size, and variadic metadata before liveness and allocation;
-- parameter storage binding through `CGTarget.param`, which may return either a
-  frame slot or register-backed local storage for simple direct ABI params;
-- optional hard-register reservation before backend `func_end`;
-- target legality for local folds performed by `opt_combine`.
-
-The current target pools are:
-
-- AArch64: integer `x19-x28`, FP `v8-v23`;
-- x64: integer `R13/R14/R15/R10`, FP `XMM6-XMM15`;
-- RV64: integer `s4-s11`, FP `fs4-fs11`.
-
-Backends own final prologue/epilogue emission and callee-saved register
-preservation. O1 calls `reserve_hard_regs` with hard registers still visible in
-replay after cleanup so backend save/restore decisions match emitted IR.
-
-Targets may provide a known-frame entry path. When `func_begin_known_frame` and
-`call_stack_size` are both available, O1 computes frame slots, outgoing call
-area size, and dynamic-allocation presence before body replay and passes that
-to the backend. AArch64, x64, and RV64 use this path to emit the exact prologue
-up front instead of reserving target NOP patch slots.
-
-## Current Generated-Code Shape
-
-Fresh May 2026 probes were run with `build/cfree` rebuilt at host `-O2`, on an
-arm64 machine, focusing on native AArch64 codegen. The same simple C probes
-were also compiled for x64/RV64 by the probe script, but AArch64 is the primary
-quality target here.
-
-Current progress:
-
-- Immediate scalar arithmetic folds through simple local shadows. For example,
-  `int x = 40; x += 2; return x;` lowers to immediate return materialization on
-  AArch64.
-- Simple loop locals stay in registers.
-- Branch-consuming compares lower to direct compare branches.
-- O1 branch cleanup removes branch-to-branch targets and unconditional branches
-  to physical fallthrough when it can do so locally.
-- Address-taken locals still go through memory, as expected for current O1.
-- The old prologue NOP patch path is gone; known-frame entry is used where the
-  target supports it.
-- Empty leaf functions omit avoidable frame setup and return directly.
-- Scalar return producers can be retargeted into ABI return registers, removing
-  target-level return self-copies.
-- Unused scalar parameters consume ABI locations without materializing moves,
-  and used stack-passed scalar parameters load directly from the incoming stack
-  area.
-
-Representative AArch64 `const_local` output:
-
-```asm
-mov w0, #0x2a
-ret
-```
-
-Representative AArch64 `scalar_add` output:
-
-```asm
-add w0, w0, w1
-ret
-```
-
-A ninth-argument AArch64 scalar probe now loads directly from the incoming
-stack area in a frameless leaf:
-
-```asm
-ldur w0, [sp]
-add w0, w0, #1
-ret
-```
-
-## Completed O1 Codegen Work
-
-The O1 cleanup list is now implemented for the current backend path:
-
-- Leaf functions can omit the frame on AArch64, x64, and RV64 when the known
-  frame is empty, the function has no calls or dynamic stack allocation, and the
-  backend has no callee-saved state to preserve. Returns in such functions emit
-  the target return instruction directly instead of branching to an adjacent
-  epilogue.
-- Post-RA cleanup removes physical self-copies, collapses short copy chains,
-  retargets safe single-use producers, and retargets adjacent scalar return
-  producers into ABI return registers when target metadata allows it.
-- Target physical-register metadata exposes caller-saved, callee-saved,
-  argument, return, reserved, and scratch policy. O1 prefers caller-saved
-  registers for non-call-crossing values, uses callee-saved registers for
-  call-crossing values when profitable, and ties non-call-crossing scalar
-  parameters to their incoming ABI registers when legal.
-- Non-address-taken scalar parameters remain register-backed. Unused
-  register-backed parameters consume their ABI locations without materializing
-  pointless moves, and stack-passed scalar parameters are loaded directly from
-  the incoming stack area. Frameless functions use `sp`/`rsp` for those loads;
-  framed functions use the target frame pointer as before.
-- Planned call replay is the normal O1 call path for supported target plans.
-  Replay uses a local parallel-move resolver for register moves, stack argument
-  stores, return extraction, cycles, and indirect callees that overlap argument
-  destinations.
-- Branch cleanup remains intentionally local: branch-target forwarding,
-  compare-branch inversion for local jump-block removal, unreachable pruning via
-  CFG, and physical fallthrough branch deletion.
-
-Representative AArch64 output after these slices:
-
-```asm
-f:
-  mov w0, #0x2a
-  ret
-
-g:
-  add w0, w0, w1
-  ret
-
-h: // ninth integer argument
-  ldur w0, [sp]
-  add w0, w0, #1
-  ret
-```
-
-Further improvements such as global block ordering, general coalescing, SSA
-value optimization, and deeper stack-heavy compile-time tuning belong to O2 or
-`doc/PERF.md`, not the O1 cleanup baseline.
-
-## Compile-Time Watchlist
-
-The current O1 compile-time shape is good for normal C ladders: many-function
-and direct-call generated inputs are approximately linear through 1024
-functions with `opt.conflict_bytes=0`.
-
-The remaining counter-level warning is stack-heavy pressure. In the current
-host-`-O2` native run:
-
-```text
-shape        step       opt.o1.total  regalloc  used_words
-spill_argc   512->1024      2.07x       2.23x      3.60x
-params       256->512       1.73x       1.80x      3.33x
-```
-
-Wall time is only mildly super-linear, but the `used_words` growth should be
-investigated before much larger stack-heavy inputs are treated as normal O1
-workloads. See `doc/PERF.md` for the full tables.
-
-## Tests And Metrics
-
-Focused O1 coverage lives in `test/opt/` and the phase0 guardrails. It covers
-branch liveness, call-clobber preservation, spill pressure, inline asm
-tied/fixed registers, pre-RA DDE, post-rewrite cleanup, and cross-target mock
-validation.
-
-Useful targeted runs:
-
-```sh
-make test-opt
-make test-cg-api
-make test-toy
-make test-aa64-inline
-make test-smoke-x64
-```
-
-`cfree run --time -O1` emits counters for:
-
-- function/block/inst/value counts;
-- block liveness active words and dataflow visits;
-- live range counts, compressed points, live-word touches, and value scans;
-- allocator hard/stack occupancy, interval probes/marks, spills;
-- rewrite reloads/stores/inserted instructions/live traffic;
-- link/JIT and compile frontend scopes.
diff --git a/doc/OPT_PERF.md b/doc/OPT_PERF.md
@@ -0,0 +1,200 @@
+# OPT Performance Tracking
+
+This document tracks optimizer performance, benchmark coverage, and the current
+gap to MIR and GCC. The optimizer design and pass order live in `doc/OPT.md`.
+
+## How To Run
+
+Primary harness:
+
+```sh
+make bench-opt
+```
+
+Useful focused runs:
+
+```sh
+CFREE_OPT_BENCHES="sieve spectral-norm" make bench-opt
+CFREE_OPT_BENCH_LEVELS="1 2" CFREE_OPT_BENCH_RUN_REPEATS=5 make bench-opt
+GCC=/opt/homebrew/bin/gcc-15 MIR_DIR=~/tmp/mir make bench-opt
+```
+
+The harness compares:
+
+- `gcc-15`
+- `clang`
+- `cfree cc`
+- `cfree run`
+- MIR `c2m`
+
+It writes `results.csv`, per-case logs, binaries, and `summary.md` under
+`build/bench/opt/` or `CFREE_OPT_BENCH_OUT`.
+
+## Reporting Model
+
+The summary reports compile time in two complementary ways.
+
+1. Unified geomean ratios
+
+   Compile-speed ratios use `compile_ms + codegen_ms` where both are
+   available. For MIR, this combines C-to-binary-MIR time and JIT link/generate
+   time so the headline number is comparable to `cfree-run` as a whole.
+
+2. Split compile timings
+
+   For MIR, `compile_ms` is C-to-binary-MIR time and `codegen_ms` is the
+   `c2m -v` link/JIT generation slice. This table explains whether a
+   compile-time difference comes from C parsing/lowering or from the optimized
+   generation pipeline.
+
+For `cfree-run`, `--bench-time` currently reports compile+JIT as one slice.
+Use `cfree run --time` for finer-grained local investigations. The benchmark
+CSV does not yet split cfree frontend, optimizer, link, and JIT into separate
+columns.
+
+MIR's benchmark binary is an optimized host executable. On the current Darwin
+machine, `/Users/ryan/tmp/mir/c2m` is built by MIR's makefile with
+`-O3 -DNDEBUG`.
+
+## Current Representative Run
+
+Date: 2026-05-22.
+
+Scope:
+
+- `array`
+- `binary-trees`
+- `hash`
+- `hash2`
+- `matrix`
+- `nbody`
+- `sieve`
+- `spectral-norm`
+
+Levels: `0 1 2`.
+
+Repeats: one compile repeat and one run repeat.
+
+Output:
+
+- `build/bench/opt/results.csv`
+- `build/bench/opt/summary.md`
+
+Coverage:
+
+| status | rows |
+| --- | ---: |
+| OK | 111 |
+| COMPILE_FAIL | 9 |
+| RUN_FAIL | 0 |
+| OUTPUT_FAIL | 0 |
+
+All `cfree` and `cfree-run` rows completed successfully, including the
+previously broken `matrix -O2`, `binary-trees`, `nbody`, and `spectral-norm`
+rows. The remaining `COMPILE_FAIL` rows are all MIR `c2m` on the Darwin
+hosted/math benchmarks.
+
+## Headline Geomeans
+
+Base: each benchmark's `gcc-15 -O2` row is `1.0x`.
+
+Representative 8-case scope:
+
+| tool | opt | cases | compile speed | runtime speed |
+| --- | ---: | ---: | ---: | ---: |
+| `cfree` | `O0` | 8 | `5.649x` | `0.346x` |
+| `cfree` | `O1` | 8 | `5.792x` | `0.502x` |
+| `cfree` | `O2` | 8 | `5.730x` | `0.468x` |
+| `cfree-run` | `O0` | 8 | `11.229x` | `0.363x` |
+| `cfree-run` | `O1` | 8 | `11.647x` | `0.519x` |
+| `cfree-run` | `O2` | 8 | `10.474x` | `0.481x` |
+| `mir-c2m` | `O0` | 5 | `2.476x` | `0.550x` |
+| `mir-c2m` | `O1` | 5 | `2.515x` | `0.574x` |
+| `mir-c2m` | `O2` | 5 | `2.437x` | `0.786x` |
+
+Interpretation:
+
+- cfree wins the whole compile/JIT headline mostly because its C frontend and
+  hosted-header path are much faster, and because MIR fails the Darwin
+  hosted/math rows.
+- On MIR-common cases where both systems have IR, MIR's optimized generation
+  pipeline is faster and produces better runtime code.
+- cfree `O1` currently has the best cfree runtime geomean on this mix.
+  `O2` is correct but slower than `O1` on the current benchmark set.
+
+## Formerly Broken cfree Rows
+
+Focused rerun:
+
+- Scope: `matrix`, `binary-trees`, `nbody`, `spectral-norm`
+- Levels: `0 1 2`
+- Output: `build/bench/opt/rerun-broken-cfree/results.csv`
+
+All `cfree` and `cfree-run` rows were `OK`.
+
+Base: each benchmark's `gcc-15 -O2` row is `1.0x`.
+
+| tool | opt | compile speed | runtime speed |
+| --- | ---: | ---: | ---: |
+| `cfree` | `O0` | `7.050x` | `0.319x` |
+| `cfree` | `O1` | `7.062x` | `0.534x` |
+| `cfree` | `O2` | `6.831x` | `0.483x` |
+| `cfree-run` | `O0` | `13.709x` | `0.323x` |
+| `cfree-run` | `O1` | `14.286x` | `0.550x` |
+| `cfree-run` | `O2` | `11.791x` | `0.512x` |
+
+## Optimizer/Link/JIT Split
+
+Focused timing pass:
+
+- Scope: `array`, `hash`, `hash2`, `matrix`, `sieve`
+- Output: `build/bench/opt/opt-link-split-common/summary.csv`
+- cfree measurement: sum of `opt.o*.total`, `link.resolve.total`, and JIT
+  scoped timings from `cfree run --time`
+- MIR measurement: `MIR link finish` from `c2m -v -O<n> file.bmir -eg`
+
+| opt | cfree opt+link+JIT ms | MIR link/JIT ms | result |
+| ---: | ---: | ---: | --- |
+| `O0` | `0.275` | `0.758` | cfree `2.76x` faster |
+| `O1` | `4.552` | `0.718` | MIR `6.34x` faster |
+| `O2` | `4.728` | `1.100` | MIR `4.30x` faster |
+
+This is the key split for tuning: cfree's final link/JIT mechanics are very
+fast, but MIR's optimized generation pipeline is substantially faster at
+`O1` and `O2`.
+
+## Goals
+
+Compile-time goals:
+
+- Preserve cfree's frontend and whole-pipeline compile/JIT advantage.
+- Add benchmark CSV columns for cfree frontend, optimizer, link, and JIT slices
+  instead of relying on ad hoc `--time` logs.
+- Reduce optimized-pass cost. The first target is closing the MIR link/JIT gap
+  on the common-case split measurement.
+- Keep `-O1` materially faster to compile than `-O2`.
+
+Runtime-quality goals:
+
+- Make `O2` reliably faster than `O1` on the representative benchmark mix.
+- Close toward MIR `O2` on MIR-common cases.
+- First external target: reach at least QBE-class performance on the MIR
+  `c-benchmarks` set, then close toward MIR's published `c2m -eg` geomean.
+
+Coverage goals:
+
+- Keep all cfree rows passing on the representative scope.
+- Expand the benchmark set back toward all MIR `c-benchmarks` that require only
+  supported hosted libc/runtime features.
+- Track MIR failures separately from cfree failures so cfree coverage is not
+  obscured by external tool limitations.
+
+## Current Tuning Priorities
+
+1. Investigate why `O2` trails `O1` on the current cfree runtime geomean.
+2. Profile expensive O2 passes on `hash`, `hash2`, and `matrix`, where the
+   opt/link split shows cfree optimized generation much slower than MIR.
+3. Improve generated-code quality before adding broad new passes. Prefer
+   targeted fixes backed by benchmark deltas and pass-local tests.
+4. Add finer cfree timing columns to `scripts/opt_bench.sh` once the driver can
+   expose parse, optimize, link, and JIT slices in parseable form.
diff --git a/doc/PERF.md b/doc/PERF.md
@@ -1,285 +0,0 @@
-# Cfree O1 Performance Notes
-
-This file records the current `cfree run --time -O1` shape and the remaining
-compile-time scaling risks. It is intentionally focused on the current O1 path;
-older allocator phase history is preserved in git and in `doc/MIR_RA_REPORT.md`.
-
-## Measurement Setup
-
-Fresh snapshot: May 17, 2026.
-Follow-up stack-assignment probe: May 21, 2026.
-
-- Host: `arm64` macOS.
-- Compiler under test: `build/cfree`, rebuilt as a Mach-O arm64 executable with
-  the compiler itself built by clang at host `-O2`.
-- Timing path: native `build/cfree run --time -O1`, which exercises the native
-  AArch64 JIT/codegen path on this machine.
-- Samples: three runs per point; tables use p50 milliseconds.
-- Raw local outputs from the run:
-  - `build/o1-probes/scaling.tsv`
-  - `build/o1-probes/spill_argc.tsv`
-  - `build/o1-current/params-hostO2.tsv`
-  - `build/o1-current/toy-hostO2.tsv`
-  - `build/o1-investigate/spill_argc_final.tsv`
-  - `build/o1-investigate/params_argc_final.tsv`
-
-The build log should show `clang -O2 ...` for driver, lib, C frontend, and Toy
-frontend objects before these timings are compared to future runs.
-
-## Metric Shape
-
-Top-level `cfree run --time` scopes:
-
-- `run.total`: full driver path after option/input loading through entry return.
-- `run.compile_and_jit`: compile all inputs and build the JIT image.
-- `compile.tu`: one translation unit to an object builder.
-- `compile.frontend`: registered frontend body.
-- `compile.c.*`: C frontend setup, PP option application, parse/codegen,
-  cleanup.
-- `opt.o1.total`: per-function O1 pipeline, nested under frontend codegen.
-- `opt.jump_cleanup`: cheap O1 CFG/layout branch cleanup.
-- `link_jit.all`: linker resolve plus JIT image materialization.
-- `run.jit_lookup` and `run.entry_call`: symbol lookup and execution.
-
-Important O1 counters:
-
-- `opt.funcs`, `opt.blocks`, `opt.insts`, `opt.pregs`
-- `opt.live.block_bytes`, `opt.live.active_words`,
-  `opt.live.bitset_words_touched`
-- `opt.ranges`, `opt.range_points`, `opt.range_raw_points`,
-  `opt.range.point_visits`, `opt.range.preg_scans`,
-  `opt.range.live_words_touched`
-- `opt.alloc.used_loc_words`, `opt.alloc.hard_loc_words`,
-  `opt.alloc.stack_loc_words`, `opt.alloc.stack_slots`
-- `opt.alloc.hard_point_visits`, `opt.alloc.stack_point_visits`,
-  `opt.alloc.hard_mark_points`, `opt.alloc.stack_mark_points`
-- `opt.alloc.spills`
-- `opt.rewrite.reloads`, `opt.rewrite.stores`,
-  `opt.rewrite.inserted_insts`, `opt.rewrite.live_words_touched`
-- `opt.conflict_bytes`
-
-`opt.conflict_bytes` is expected to stay zero on the normal O1 path. The old
-dense pseudo-register conflict matrix is no longer part of O1 allocation.
-
-## Normal Function-Count Ladders
-
-Two synthetic C ladders are used for broad scaling:
-
-- `straight`: N loop-heavy static functions, with `main` directly calling every
-  function. This creates N small functions plus one larger straight-line caller.
-- `table`: the same N functions, but `main` calls through a function-pointer
-  table. This keeps the caller compact and isolates "many functions" from "one
-  larger function".
-
-O1 scopes are summed across all functions.
-
-```text
-straight
-N     compile.tu  opt.o1.total  live_pre  live_reg  regalloc  insts  vals   ranges  spills  used_words  conflict
-64       17.442        15.776     2.495     3.941     8.740   1856    833    1789       0        3198         0
-128      31.210        28.372     4.453     7.191    15.631   3712   1665    3581       0        6398         0
-256      60.026        54.522     8.544    13.764    30.201   7424   3329    7165       0       12798         0
-512     119.735       108.974    16.987    27.253    61.085  14848   6657   14333       0       25598         0
-1024    234.857       210.831    32.958    52.953   115.918  29696  13313   28669       0       51198         0
-
-table
-N     compile.tu  opt.o1.total  live_pre  live_reg  regalloc  insts  vals   ranges  spills  used_words  conflict
-64       14.246        12.916     2.046     3.269     7.113   1620    713    1555       0        2720         0
-128      28.560        26.036     4.128     6.591    14.368   3220   1417    3091       0        5408         0
-256      57.608        52.647     8.371    13.331    28.973   6420   2825    6163       0       10784         0
-512     114.689       104.540    16.557    26.401    57.560  12820   5641   12307       0       21536         0
-1024    232.619       209.061    33.476    52.789   115.426  25620  11273   24595       0       43040         0
-```
-
-512 to 1024 ratios:
-
-```text
-shape     compile.tu  opt.o1.total  live_pre  live_reg  regalloc  used_words
-straight     1.96x        1.93x       1.94x     1.94x     1.90x      2.00x
-table        2.03x        2.00x       2.02x     2.00x     2.01x      2.00x
-```
-
-This is the desired current shape. Many functions and the normal direct-call
-large-caller case are essentially linear through 1024 generated functions.
-
-## Spill-Pressure Ladder
-
-The old `spill` row in `run_scaling.sh` initializes many locals from constants;
-current O1 folds most of that away, so it is useful as a frontend/source-size
-sanity check but not as a pressure test. The useful pressure ladder is
-`spill_argc`, which initializes each local from `argc` so values remain live.
-
-```text
-spill_argc
-N     compile.tu  opt.o1.total  live_pre  live_reg  regalloc  insts  vals  ranges  spills  reloads  inserted  used_words  conflict
-128       1.388         0.769     0.045     0.169     0.474    770   641     767     108      108       216        3076         0
-256       2.468         1.433     0.058     0.291     0.929   1538  1281    1535     236      236       472        9222         0
-512       4.658         2.750     0.072     0.526     1.825   3074  2561    3071     492      492       984       30730         0
-1024      9.894         5.681     0.110     1.058     4.073   6146  5121    6143    1004     1004      2008      110610         0
-```
-
-512 to 1024 ratios:
-
-```text
-compile.tu       2.12x
-opt.o1.total     2.07x
-live_pre         1.53x
-live_reg         2.01x
-regalloc         2.23x
-spills           2.04x
-used_words       3.60x
-```
-
-This was the main remaining mild super-linear signal in the May 17 snapshot.
-The May 21 follow-up found the direct cause: stack spill assignment used
-first-fit probing across existing spill slots. When many spilled ranges all
-overlapped, every new spill checked almost every previous stack slot before
-allocating a new one, so `opt.alloc.stack_point_visits` grew quadratically.
-
-The fix keeps hard-register assignment priority ordered, then assigns spilled
-values to stack slots in lifetime order using an active min-heap keyed by slot
-end position plus a free-slot list. Stack occupancy is still marked into the
-per-slot interval vectors for replay/debug metrics, but assignment no longer
-probes existing stack slots in the all-live case.
-
-May 21 follow-up, same `spill_argc` shape, single runs:
-
-```text
-spill_argc final
-N     compile.tu  opt.o1.total  live_pre  regalloc  used_words  hard_words  stack_words  stack_slots  hard_visits  stack_visits  hard_marks  stack_marks  spills  points
-128       3.037         0.853     0.114     0.396        1686         616         1070          107        11732             0         405          107     107     514
-256       3.989         1.214     0.082     0.634        3222         872         2350          235        23508             0         789          235     235    1026
-512       6.776         1.887     0.075     1.050        6294        1384         4910          491        47060             0        1557          491     491    2050
-1024     13.605         3.561     0.102     2.041       12438        2408        10030         1003        94164             0        3093         1003    1003    4098
-```
-
-512 to 1024 ratios:
-
-```text
-compile.tu       2.01x
-opt.o1.total     1.89x
-regalloc         1.94x
-spills           2.04x
-used_words       1.98x
-stack_marks      2.04x
-stack_visits     0
-```
-
-`used_words` now reports the allocator's actual interval-vector storage rather
-than the size of the old dense point-by-location bitmap model. It is expected
-to track interval/vector capacity and can move in allocation-size steps, but it
-no longer reflects a point-times-slot working set.
-
-## Parameter-Heavy Ladder
-
-This C ladder uses one function with N integer parameters and a caller that
-passes N constants. It stresses ABI lowering, stack arguments, and callee-side
-parameter use.
-
-```text
-params
-N     compile.tu  opt.o1.total  live_pre  live_reg  regalloc  insts  vals  ranges  spills  used_words  conflict
-8         0.816         0.577     0.091     0.143     0.305     37    26      32       0        72         0
-16        0.850         0.595     0.090     0.150     0.320     69    50      64       0       136         0
-32        0.932         0.644     0.093     0.162     0.357    133    98     128      12       392         0
-64        1.053         0.698     0.093     0.175     0.396    261   194     256      44       776         0
-128       1.367         0.866     0.096     0.211     0.510    517   386     512     108      2056         0
-256       1.979         1.209     0.099     0.281     0.740   1029   770    1024     236      6152         0
-512       3.484         2.088     0.110     0.434     1.334   2053  1538    2048     492     20488         0
-```
-
-256 to 512 ratios:
-
-```text
-compile.tu       1.76x
-opt.o1.total     1.73x
-live_pre         1.11x
-live_reg         1.54x
-regalloc         1.80x
-spills           2.08x
-used_words       3.33x
-```
-
-The timing buckets are sub-linear to mildly linear here because the fixed
-frontend/O1 setup costs are still visible. In the May 17 snapshot, `used_words`
-again grew faster than spills and ranges when many parameters forced stack
-traffic.
-
-May 21 follow-up used `argc`-dependent caller arguments to keep the call-side
-values live and prevent constant folding from collapsing the probe:
-
-```text
-params_argc final
-N     compile.tu  opt.o1.total  live_pre  regalloc  used_words  hard_words  stack_words  stack_slots  hard_visits  stack_visits  hard_marks  stack_marks  spills  points
-128       2.296         0.417     0.051     0.205        1436         376         1060          106         2969             0          25          106     106     133
-256       3.717         0.635     0.053     0.292        2716         376         2340          234         5913             0          25          234     234     261
-512       6.343         0.926     0.052     0.365        5276         376         4900          490        11801             0          25          490     490     517
-1024     13.602         2.428     0.080     0.834       10396         376        10020         1002        23577             0          25         1002    1002    1029
-```
-
-512 to 1024 ratios:
-
-```text
-compile.tu       2.14x
-opt.o1.total     2.62x
-regalloc         2.28x
-spills           2.04x
-used_words       1.97x
-stack_marks      2.04x
-stack_visits     0
-```
-
-The parameter probe still has noisy wall-time at these small absolute O1
-durations, but the allocator-side counters now have the desired linear shape.
-
-## Toy Samples
-
-The Toy frontend uses the same CG/O1/backend machinery, so these are useful
-small API-shape probes. O1 scopes are summed across functions.
-
-```text
-case                      compile.tu  opt.o1.total  live_pre  live_reg  regalloc  insts  vals  ranges  spills  used_words  conflict
-06_while_sum                  0.436         0.319     0.049     0.075     0.163     15     7      13       0        26         0
-09_function_params            0.645         0.569     0.091     0.142     0.302     10     8       8       0        22         0
-111_many_function_params      0.682         0.591     0.092     0.147     0.316     38    22      36       0        78         0
-123_spec_demo                 4.709         4.229     0.659     1.016     2.270    788   359     590       0      1168         0
-```
-
-The Toy samples do not show a compile-time scaling problem at this size. The
-larger spec demo is dominated by O1/regalloc, but counters are small and there
-is no conflict-matrix traffic.
-
-## Current Interpretation
-
-The current O1 compile-time story is good enough to shift attention toward
-generated-code quality:
-
-- Normal many-function and large-caller C inputs are approximately linear.
-- The old dense liveness/conflict path is gone from normal O1
-  (`opt.conflict_bytes=0`).
-- Stack spill assignment no longer does first-fit interval probing across every
-  existing spill slot; `opt.alloc.stack_point_visits` stays zero on the
-  stack-heavy May 21 ladders.
-- Link/JIT is not the bottleneck in these runs; the visible time is frontend
-  parse/codegen plus O1.
-- Spill-heavy and parameter-heavy allocator counters now scale by ranges,
-  stack slots, and interval marks rather than by point/candidate products.
-
-## Investigation Priorities
-
-1. Keep the stack-heavy May 21 shape as a regression guard.
-   `stack_point_visits` should remain zero on the normal O1 stack assignment
-   path, while `stack_marks`, `stack_slots`, spills, and actual interval
-   storage should remain approximately linear.
-
-2. Keep `opt.conflict_bytes=0` as a regression guard.
-   Reintroducing dense pseudo conflict storage would be a clear O1 regression.
-
-3. Keep normal O1 branch cleanup block-linear.
-   `jump_cleanup` should remain a local pass over blocks/terminators. Global
-   layout work belongs in O2.
-
-4. Split frontend residuals only after O1 code shape work.
-   The non-O1 compile residual is visible, but current scaling pressure is not
-   coming from object finalization or link/JIT. More parser/PP/CG counters would
-   be useful, but O1 generated-code cleanup has higher near-term value.
diff --git a/scripts/opt_bench.sh b/scripts/opt_bench.sh
@@ -379,7 +379,7 @@ lines = []
 lines.append("# OPT Benchmark Summary")
 lines.append("")
 lines.append(f"Base for speed ratios: `{base_tool} -O2`.")
-lines.append("For MIR, `compile_ms` is C-to-binary-MIR time and `codegen_ms` is the JIT link/generation slice reported by `c2m -v`; compile ratios use their sum. `cfree-run` uses `--bench-time`: `compile_ms` is compile+JIT time, and `runtime_ms` is the in-process entry-call execution slice.")
+lines.append("Unified compile-speed ratios use `compile_ms + codegen_ms` where both are available. For MIR, `compile_ms` is C-to-binary-MIR time and `codegen_ms` is the JIT link/generation slice reported by `c2m -v`; this keeps the headline number comparable to `cfree-run` as a whole while the split table below shows where time is spent. `cfree-run` uses `--bench-time`: `compile_ms` is compile+JIT time, and `runtime_ms` is the in-process entry-call execution slice.")
 lines.append("")
 lines.append("## Status")
 lines.append("")
@@ -415,6 +415,27 @@ for key in sorted(groups):
         f"{geo(run_ratios)} | {avg(comp_totals)} | {avg(run_times)} |"
     )
 lines.append("")
+lines.append("## Split Compile Timings")
+lines.append("")
+lines.append("| tool | opt | ok cases | avg compile/frontend ms | avg codegen/JIT ms | avg unified compile ms |")
+lines.append("| --- | ---: | ---: | ---: | ---: | ---: |")
+for key in sorted(groups):
+    vals = groups[key]
+    compile_times = []
+    codegen_times = []
+    comp_totals = []
+    for r in vals:
+        cm = fnum(r["compile_ms"])
+        cg = fnum(r["codegen_ms"])
+        total = None if cm is None else cm + (cg or 0.0)
+        compile_times.append(cm)
+        codegen_times.append(cg)
+        comp_totals.append(total)
+    lines.append(
+        f"| {key[0]} | {key[1]} | {len(vals)} | {avg(compile_times)} | "
+        f"{avg(codegen_times)} | {avg(comp_totals)} |"
+    )
+lines.append("")
 lines.append(f"Raw CSV: `{csv_path}`")
 with open(out_path, "w") as f:
     f.write("\n".join(lines) + "\n")

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

D	doc/CONSTFOLD.md	\|	600	-------------------------------------------------------------------------------
D	doc/MIR_RA_REPORT.md	\|	933	-------------------------------------------------------------------------------
M	doc/OPT.md	\|	1571	++++++++++++++-----------------------------------------------------------------
D	doc/OPT1.md	\|	330	-------------------------------------------------------------------------------
A	doc/OPT_PERF.md	\|	200	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D	doc/PERF.md	\|	285	-------------------------------------------------------------------------------
M	scripts/opt_bench.sh	\|	23	++++++++++++++++++++++-