commit e462688b568fc13b28f3db4126fca1da77335b87
parent a28452e0597151ea9bb31eaf7c220e4ee5ee801c
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Fri, 22 May 2026 09:14:36 -0700
Split optimizer design and performance docs
Diffstat:
| D | doc/CONSTFOLD.md | | | 600 | ------------------------------------------------------------------------------- |
| D | doc/MIR_RA_REPORT.md | | | 933 | ------------------------------------------------------------------------------- |
| M | doc/OPT.md | | | 1571 | ++++++++++++++----------------------------------------------------------------- |
| D | doc/OPT1.md | | | 330 | ------------------------------------------------------------------------------- |
| A | doc/OPT_PERF.md | | | 200 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
| D | doc/PERF.md | | | 285 | ------------------------------------------------------------------------------- |
| M | scripts/opt_bench.sh | | | 23 | ++++++++++++++++++++++- |
7 files changed, 490 insertions(+), 3452 deletions(-)
diff --git a/doc/CONSTFOLD.md b/doc/CONSTFOLD.md
@@ -1,600 +0,0 @@
-# CONSTFOLD - Vstack Constant Simplification Plan
-
-`-O1` needs a tiny constant simplification path that improves obvious scalar
-code without pulling the O2 SSA/value-optimization machinery into the fast
-backend tier. The right first home is the public CG API value stack in
-`src/api/cg.c`, beside the existing delayed compare representation used by
-compare-consuming branches.
-
-This plan covers integer constant folding and local expression delay in the
-vstack. It intentionally avoids global propagation, CFG-wide reasoning, and
-target-specific peepholes.
-
-## Status
-
-Implemented:
-
-- immediate-only integer binop folding for flag-free integer operations up to
- 64 bits;
-- immediate-only integer unop folding for flag-free integer operations up to
- 64 bits;
-- immediate-only integer and pointer-like compare folding;
-- scalar immediate return preservation through `cfree_cg_ret`;
-- CG API shape coverage for O1 literal folds;
-- delayed-compare materialization now reuses only owned operand registers;
-- delayed `SV_ARITH` for unary and binary arithmetic;
-- expression-local arithmetic chain folding;
-- straight-line local constant shadowing with conservative boundary
- invalidation;
-- disassembly and metrics refresh for the phase 5 probes;
-- `ApiSValue` delayed compare/arithmetic payloads are now unioned;
-- local-shadow invalidation is routed through named memory, control, and
- address-taking boundary helpers;
-- delayed arithmetic has a direct register-pressure materialization regression
- test.
-
-Remaining:
-
-- consider a shadow generation counter if repeated `clear_all` scans show up in
- future O1 metrics.
-
-## Current Shape
-
-The API value stack currently carries:
-
-- `SV_OPERAND`: an immediate, register, local, global, or indirect operand;
-- `SV_CMP`: a delayed integer compare with source operands and register
- ownership bits.
-
-Integer constants enter the stack as `SV_OPERAND` plus `OPK_IMM`.
-
-Integer compare already has the desired delayed shape:
-
-```text
-push a
-push b
-int_cmp -> pushes SV_CMP, emits no boolean materialization
-branch_true -> consumes SV_CMP as target->cmp_branch
-```
-
-This avoids the old branch-consumer shape:
-
-```text
-cmp
-cset
-cmp #0
-b.cond
-```
-
-Arithmetic now has a first immediate-only fold. For flag-free integer
-operations where both operands are `OPK_IMM`, `api_cg_binop` and
-`api_cg_unop` fold before allocating a result register. `api_cg_cmp` likewise
-folds immediate-only integer and pointer-like compares. Non-immediate
-arithmetic still materializes immediately; delayed arithmetic is the next
-planned step.
-
-The visible O1 quality gap is:
-
-```c
-int x = 40;
-x += 2;
-return x;
-```
-
-The expression part can be fixed by immediate folding, but the full example
-also crosses a local store/load. That requires a narrow local constant shadow,
-not just expression-stack folding.
-
-## Goals
-
-- Fold immediate-only integer arithmetic before register allocation.
-- Delay simple integer arithmetic on the vstack when doing so can avoid
- materialization or combine a following immediate.
-- Preserve the existing `SV_CMP` compare-branch behavior.
-- Improve straight-line scalar-local code such as `x = 40; x = x + 2`.
-- Keep the implementation target-neutral and bounded.
-- Do not introduce VLAs or global state.
-- Keep all new state attached to `CfreeCg`, `ApiSValue`, or `ApiSourceLocal`.
-
-## Non-Goals
-
-- No SSA construction.
-- No global constant propagation.
-- No CFG-wide dataflow.
-- No folding across labels, jumps, branches, calls, volatile accesses, inline
- asm, or unknown memory effects.
-- No FP constant folding in this pass.
-- No i128 folding until there is a deliberate wide-integer helper.
-- No target-specific immediate legality checks in the vstack. The target
- backend already decides whether an `OPK_IMM` source fits an instruction
- encoding or must be materialized.
-
-## Design
-
-### 1. Immediate-Only Integer Folding
-
-Implemented in `src/api/cg.c` with helpers shaped as:
-
-```c
-static int api_try_fold_int_binop(CfreeCg*, BinOp, CfreeCgTypeId,
- i64 a, i64 b, i64* out);
-static int api_try_fold_int_unop(CfreeCg*, UnOp, CfreeCgTypeId,
- i64 a, i64* out);
-static int api_try_fold_int_cmp(CfreeCg*, CmpOp, CfreeCgTypeId,
- i64 a, i64 b, i64* out);
-```
-
-The binop/unop paths run before allocating a destination register. If both
-source operands are `OPK_IMM` and the operation is supported, they push a new
-`OPK_IMM` result and release the inputs.
-
-Implemented operations:
-
-- `BO_IADD`, `BO_ISUB`, `BO_IMUL`;
-- `BO_AND`, `BO_OR`, `BO_XOR`;
-- `BO_SHL`, `BO_SHR_U`, `BO_SHR_S`;
-- `UO_NEG`, `UO_BNOT`, `UO_NOT`;
-- integer compares.
-
-Still skipped:
-
-- signed and unsigned division/remainder, to avoid changing divide-by-zero
- behavior and signed-overflow corner cases in the first patch;
-- any operation with trap, saturating, or exact/nowrap flags until flags are
- actually modeled by the vstack fold;
-- FP operations;
-- i128 types.
-
-Evaluation rules:
-
-- Use unsigned arithmetic internally where possible so host signed overflow is
- not invoked.
-- Mask the result to the destination integer width for widths below 64.
-- For arithmetic right shift, sign-extend the masked input before shifting.
-- Normalize `_Bool` results to `0` or `1`.
-- Treat pointer-typed `OPK_IMM` only as integer-sized values for compares.
- Do not fold pointer arithmetic here.
-
-### 2. Immediate Return Preservation
-
-Implemented. `cfree_cg_ret` preserves non-aggregate scalar `OPK_IMM` values by
-passing them directly to `target->ret`.
-
-Required behavior:
-
-```text
-return 40 + 2
- -> vstack folds to OPK_IMM 42
- -> ret receives OPK_IMM 42
- -> backend emits only return-value materialization
-```
-
-If the return value is delayed arithmetic or any non-immediate expression, it
-can still materialize through the normal register path.
-
-### 3. Delayed Arithmetic SValues
-
-Add a generic delayed arithmetic stack value after the immediate-only fold is
-stable. It should cover both unary and binary integer arithmetic, with an
-operation-kind discriminator and an optional second operand.
-
-Suggested shape:
-
-```c
-typedef enum ApiDelayedArithKind {
- API_DELAYED_UNOP,
- API_DELAYED_BINOP,
-} ApiDelayedArithKind;
-
-typedef enum ApiSValueKind {
- SV_OPERAND,
- SV_CMP,
- SV_ARITH,
-} ApiSValueKind;
-
-typedef struct ApiSValue {
- Operand op;
- Operand cmp_a;
- Operand cmp_b;
- Operand arith_a;
- Operand arith_b;
- CfreeCgTypeId type;
- CmpOp cmp_op;
- BinOp arith_bin_op;
- UnOp arith_un_op;
- u8 kind;
- u8 arith_kind;
- u8 res;
- u8 pinned;
- u8 lvalue;
- u8 cmp_a_owned;
- u8 cmp_b_owned;
- u8 arith_a_owned;
- u8 arith_b_owned;
- FrameSlot spill_slot;
- CfreeCgLocal source_local;
-} ApiSValue;
-```
-
-If struct size becomes a concern, the compare and arithmetic payloads can be
-moved into a small union. Keep the first patch simple unless stack footprint
-shows up in measurements.
-
-Materialization mirrors `SV_CMP`:
-
-- `api_ensure_reg` dispatches `SV_ARITH` to `api_materialize_arith_to`.
-- `api_materialize_arith_to` calls `target->unop` when
- `arith_kind == API_DELAYED_UNOP` and `target->binop` when
- `arith_kind == API_DELAYED_BINOP`.
-- `api_release` releases owned operand registers for `SV_ARITH`.
-- `api_pick_victim` ignores delayed non-register values.
-- `api_alloc_reg_or_spill` may materialize a delayed arithmetic value if no
- normal spill victim exists, just as it does for delayed compares.
-
-Creation rules:
-
-- Only create `SV_ARITH` for integer-class operations.
-- For binary operations, require at least one non-immediate source.
- Immediate-only cases fold to `OPK_IMM` instead.
-- For unary operations, delay only non-immediate sources. Immediate-only cases
- fold to `OPK_IMM` instead.
-- Own registers only when the source `ApiSValue` owns that register, following
- the existing compare ownership pattern.
-- Never create delayed arithmetic for lvalues, aggregates, FP values, calls, or
- memory operands that need a load.
-- Do not encode unary operations as synthetic binary operations. For example,
- keep `-x` as `UO_NEG`, `~x` as `UO_BNOT`, and `!x` as `UO_NOT` or a delayed
- compare only when the consumer naturally wants compare semantics.
-
-### 4. Local Arithmetic Chain Folding
-
-Once `SV_ARITH` exists, fold cheap expression-local chains before
-materializing:
-
-```text
-(x + C1) + C2 -> x + (C1 + C2)
-(x + C1) - C2 -> x + (C1 - C2)
-(x - C1) + C2 -> x + (C2 - C1)
-(x - C1) - C2 -> x - (C1 + C2)
-(x ^ C1) ^ C2 -> x ^ (C1 ^ C2)
-(x & C1) & C2 -> x & (C1 & C2)
-(x | C1) | C2 -> x | (C1 | C2)
-~~x -> x
-!!x -> x != 0, if the result is consumed as a branch or compare-like boolean
-```
-
-Keep this conservative:
-
-- only one delayed level;
-- only one real non-immediate operand;
-- no reassociation across non-immediate operands;
-- no multiply reassociation initially;
-- no transforms that depend on signed-overflow assumptions.
-- no unary chain transform that changes C truth-value semantics for non-boolean
- integer results.
-
-If the folded constant becomes an identity, collapse the result:
-
-- `x + 0`, `x - 0`, `x ^ 0`, `x | 0` -> `x`;
-- `x & -1` -> `x` after width masking;
-- `x * 1` -> `x` if multiply folding is later added;
-- `x * 0` -> `0` only when flags and side effects are known safe.
-
-### 5. Straight-Line Local Constant Shadow
-
-To handle `int x = 40; x += 2; return x;`, track constants for source locals
-that are eligible for register-local treatment.
-
-Extend `ApiSourceLocal`:
-
-```c
-typedef struct ApiSourceLocal {
- ...
- i64 const_value;
- u8 const_valid;
- u8 pad[2];
-} ApiSourceLocal;
-```
-
-Rules:
-
-- A local can be shadowed only if it is scalar integer or pointer typed.
-- Address-taken locals are not shadowed.
-- Volatile accesses are not shadowed.
-- Parameters start unknown.
-- On store of an immediate to an eligible source local, set the shadow value.
-- On store of a non-immediate to that local, clear the shadow value.
-- On partial-width or mismatched-type access to that local, clear/do not use the
- shadow value.
-- On load of an eligible source local with a valid shadow, push `OPK_IMM`
- instead of loading/materializing the local.
-- On taking a local address, clear the local's shadow.
-- On unknown memory or control-flow boundary, clear all shadows.
-
-Clear all local shadows at:
-
-- label placement;
-- unconditional jump;
-- conditional branch;
-- switch;
-- scope break/continue control transfer if it can leave the current block;
-- call, unless later alias analysis proves the call cannot observe the local;
-- inline asm;
-- atomic operations and fences;
-- direct stores through `OPK_INDIRECT`;
-- address-taking of any shadowed local.
-
-This is intentionally a basic-block-local cache. It is not a dataflow fact.
-
-### 6. Interaction With Existing Register Locals
-
-The CG API already treats some scalar autos as source-register lvalues when the
-target returns `CG_LOCAL_STORAGE_REG`. Loads from those locals can avoid memory.
-The constant shadow sits above that:
-
-```text
-local has valid constant
- -> push OPK_IMM
-else local stored in register
- -> push register-backed rvalue
-else
- -> normal frame load path
-```
-
-Stores to register locals still update the real storage when necessary. The
-shadow only affects subsequent vstack reads in the same straight-line region.
-
-### 7. Boundary And Correctness Rules
-
-Every helper that consumes stack values must understand the new delayed value:
-
-- `api_release`;
-- `api_ensure_reg`;
-- `api_force_reg`;
-- `api_class_of_sv`;
-- `api_reg_of_sv`;
-- `api_owned_reg_type`;
-- `api_pick_victim`;
-- `cfree_cg_dup`;
-- `cfree_cg_store`;
-- call argument preparation;
-- return handling;
-- branch handling.
-
-If an operation does not explicitly support delayed arithmetic, it should force
-materialization first. That keeps behavior local and avoids subtle ownership
-bugs.
-
-Do not fold across volatile memory. `CfreeCgMemAccess.flags` already carries
-`CFREE_CG_MEM_VOLATILE`; the local shadow and vstack folds must respect it.
-
-Do not depend on C signed-overflow undefined behavior. The CG API integer types
-are width-only, and the existing frontend currently passes `CFREE_CG_INTOP_NONE`
-for ordinary integer arithmetic. The fold should preserve the backend's
-current wrap-like behavior unless flags explicitly request something else.
-
-## Implementation Phases
-
-### Phase 1: Red Tests For Literal Folding
-
-Done. The CG API test now covers O1 literal fold shapes for:
-
-- direct API function returning `40 + 2`;
-- direct API function returning `(5 << 3) | 2`;
-- direct API compare through `40 + 2 == 42`;
-- direct API unary folding through `~~41 + 1`.
-
-Expected shape should check for fewer emitted arithmetic operations where the
-existing test harness can observe it. Runtime correctness alone is not enough.
-
-### Phase 2: Immediate-Only Fold
-
-Done. The fold helpers are wired into:
-
-- `api_cg_binop`;
-- `api_cg_unop`;
-- `api_cg_cmp`;
-- `cfree_cg_ret`.
-
-Validation run:
-
-```sh
-make test-cg-api test-opt
-CFREE_TEST_ALLOW_SKIP=1 make test-parse
-```
-
-Final observed parse result:
-
-```text
-2496 pass, 0 fail, 4 skip
-```
-
-### Phase 3: Delayed Arithmetic
-
-Done. Added `SV_ARITH`, operation-kind metadata, ownership helpers,
-materialization, release logic, and register-pressure materialization.
-
-Tests cover:
-
-- one delayed unary arithmetic value consumed by return;
-- one delayed binary arithmetic value consumed by return;
-- delayed arithmetic consumed by compare;
-- delayed arithmetic consumed by another binop with immediate folding;
-- delayed unary arithmetic consumed by another unop with chain folding;
-- delayed arithmetic forced by store path;
-- register-pressure materialization when no ordinary spill victim exists.
-
-Run:
-
-```sh
-make test-cg-api
-make test-opt
-```
-
-### Phase 4: Local Constant Shadow
-
-Done. Added local shadow state to `ApiSourceLocal` and invalidation helpers:
-
-```c
-static void api_local_const_clear(ApiSourceLocal*);
-static void api_local_const_clear_all(CfreeCg*);
-static void api_local_const_memory_boundary(CfreeCg*);
-static void api_local_const_control_boundary(CfreeCg*);
-static void api_local_const_address_taken(CfreeCg*, CfreeCgLocal);
-static int api_local_const_can_track(CfreeCg*, const ApiSourceLocal*,
- CfreeCgMemAccess);
-static void api_local_const_store(CfreeCg*, CfreeCgLocal, CfreeCgTypeId, i64);
-static int api_local_const_load(CfreeCg*, CfreeCgLocal, CfreeCgTypeId,
- Operand* out);
-```
-
-Wire store/load behavior through `cfree_cg_store` and `cfree_cg_load`.
-
-Wire invalidation through:
-
-- `cfree_cg_label_place`;
-- `cfree_cg_jump`;
-- `api_branch_if`;
-- `cfree_cg_switch`;
-- call paths;
-- inline asm path;
-- atomics/fences;
-- indirect stores;
-- local address-taking.
-
-Tests cover:
-
-- `int x = 40; x = x + 2; return x;`;
-- shadow cleared by label/jump;
-- shadow cleared by branch;
-- shadow cleared by address-taking;
-- shadow cleared by volatile access;
-- shadow cleared by indirect store;
-- partial-width stores do not preserve a full-local shadow;
-- parameter locals are initially unknown.
-
-Run:
-
-```sh
-make test-cg-api
-make test-opt
-make test-parse
-```
-
-If the change affects emitted target code shape, also run one targeted smoke
-case per supported O1 target:
-
-```sh
-make test-smoke-x64
-make test-aa64-inline
-```
-
-and an RV64 targeted case if available locally.
-
-### Future Refactors
-
-Completed cleanup from the original phase 3/4 implementation:
-
-- `SV_CMP` and `SV_ARITH` payload fields now live in an `ApiSValue` union.
-- Local-shadow invalidation now goes through named boundary helpers:
- `api_local_const_memory_boundary`, `api_local_const_control_boundary`, and
- `api_local_const_address_taken`.
-- `test/api/cg_type_test.c` includes an explicit delayed-arithmetic
- register-pressure case that forces materialization when no ordinary spill
- victim exists.
-
-Still worth considering before growing the vstack simplifier further:
-
-- Consider a shadow generation counter if whole-function local counts grow
- enough for repeated `clear_all` scans to show up in O1 metrics.
-- Consider one small builder helper in `test/api/cg_type_test.c` for creating
- O1 one-parameter functions; the expanded constfold tests intentionally stayed
- direct but now duplicate setup boilerplate.
-
-### Phase 5: Disassembly And Metrics
-
-Done. Re-ran the probe cases from `doc/OPT1.md`:
-
-- `6_5_17_compound_assign`;
-- a direct literal return case;
-- a compare/branch literal case;
-- one local-address-taken case.
-
-For each of AArch64, x64, and RV64, recorded:
-
-- `.text` size;
-- instruction count;
-- `mov`/`load_imm` count;
-- arithmetic instruction count;
-- spills/reloads;
-- rewrite inserted instruction count.
-
-Host JIT `run --time` samples recorded O1 wall time, live-range time,
-regalloc time, spills/reloads, and rewrite inserted instruction count. The
-current numbers are in `doc/PERF.md` under "Vstack Constfold Probe".
-
-## Expected Results
-
-For direct literal arithmetic:
-
-```c
-int test_main(void) { return 40 + 2; }
-```
-
-O1 should emit only return-value materialization for `42`, with no separate add.
-
-For scalar locals:
-
-```c
-int test_main(void) {
- int x = 40;
- x = x + 2;
- return x;
-}
-```
-
-O1 should avoid the load/add/store/reload chain in the straight-line case. The
-best expected shape is return-value materialization for `42`, unless debug or
-local-storage requirements intentionally force a store.
-
-For address-taken locals:
-
-```c
-int test_main(void) {
- int x = 40;
- int* p = &x;
- *p = *p + 2;
- return x;
-}
-```
-
-The local shadow must not fold through the pointer access. Correctness is more
-important than preserving the constant.
-
-## Risks
-
-- Register ownership bugs in delayed arithmetic are the main implementation
- risk. Reuse the compare ownership pattern and add tests where delayed
- arithmetic values are forced, duplicated, dropped, and spilled.
-- Local shadow invalidation can become unsound if it is treated as a dataflow
- fact. Keep it basic-block-local and clear aggressively.
-- Signed arithmetic folding can diverge from backend behavior if it uses host
- signed overflow. Use unsigned operations plus width masking.
-- Return-path changes can accidentally bypass ABI requirements for aggregate
- returns. Only preserve immediates for non-aggregate scalar returns.
-- Debug-info expectations for locals may eventually require materialized
- storage even when constant-shadowed. Do not remove local declarations or
- frame-slot allocation in this pass.
-
-## Acceptance Criteria
-
-- Immediate-only integer expressions fold before backend allocation.
-- Delayed compare-branch behavior remains covered and unchanged.
-- Delayed arithmetic materializes correctly in all consumers.
-- Straight-line scalar-local constants fold in the common `x = 40; x += 2`
- shape.
-- Shadowed constants are cleared across control-flow, volatile, indirect
- memory, call, asm, atomic, and address-taking boundaries.
-- Targeted tests pass for CG API, O1 opt, and parser coverage.
-- Disassembly shows fewer arithmetic/materialization instructions in the probe
- cases without introducing target-specific vstack logic.
diff --git a/doc/MIR_RA_REPORT.md b/doc/MIR_RA_REPORT.md
@@ -1,933 +0,0 @@
-# MIR Data Structures and Register Allocation Report
-
-This report summarizes the data structures and algorithms in MIR that matter
-most for cfree's optimizer direction. The emphasis is structural: cfree should
-move toward MIR's allocation shape before spending more effort on small
-localized speedups in the current dense-conflict implementation.
-
-Primary MIR source references:
-
-- `~/tmp/mir/mir-gen.c`
-- `~/tmp/mir/mir.h`
-- `~/tmp/mir/mir-bitmap.h`
-- `~/tmp/mir/mir-varr.h`
-- `~/tmp/mir/mir-dlist.h`
-
-Related cfree notes:
-
-- `doc/OPT.md`: broad MIR-informed optimizer plan.
-- `doc/PERF.md`: current timing and scaling data.
-
-## Executive Summary
-
-MIR's register allocation pipeline is live-range based. It does not allocate by
-building a full pseudo-register interference matrix. MIR computes block live
-sets, converts them into compressed live ranges, assigns physical locations by
-checking occupied hard-register and stack locations at live-range program
-points, then rewrites instructions.
-
-cfree's original O1 path was conflict-matrix based. `opt_live_info` built a
-dense `nvals x words` pseudo interference matrix, stored per-instruction full
-`live_after` bitsets, and scanned `1..nvals` in hot loops. `doc/PERF.md` shows
-why that structure was the scaling problem for one large function.
-
-The main recommendation is to reshape cfree around MIR's allocation model:
-
-1. Separate block liveness from register-allocation analysis.
-2. Build live ranges from block liveness.
-3. Allocate against per-program-point occupied physical locations.
-4. Keep conflict matrices only for bounded, optional coalescing.
-5. Rewrite pseudos after assignment, with later live-range splitting for O2.
-
-This is a structural change, not an incremental optimization of the current
-matrix.
-
-As of the current O1 implementation, the dense pseudo-conflict allocator is no
-longer the normal path. O1 now has pass-local block liveness, compressed live
-ranges, split hard-register and stack-slot occupancy, assignment-map rewrite,
-per-location interval occupancy, and metrics for live/range/allocation/rewrite
-event traffic. The remaining work is no longer "delete the old matrix" or
-"replace O1 point-index occupancy"; those cleanup steps are complete. The next
-allocator work should move to O2 coalescing/splitting and generated-code
-quality, while keeping O1's fast MIR-shaped path as the baseline.
-
-## MIR Container Layer
-
-MIR uses small generic data-structure building blocks throughout `mir-gen.c`.
-They are simple but important because they determine how passes are organized.
-
-### VARR
-
-`mir-varr.h` defines typed growable arrays. A `VARR(T)` stores:
-
-- element count
-- capacity
-- element pointer
-- allocator
-
-Growth uses `realloc` and increases capacity by about 1.5x. MIR uses `VARR`
-for pass-owned vectors such as temporary instruction lists, live-range tables,
-sorted allocation candidates, and per-point used-location bitmaps.
-
-cfree already has arena-backed arrays and typed vector macros in several
-subsystems. We should not copy `VARR` mechanically, but we should mirror the
-ownership pattern: pass state owns resizable vectors explicitly instead of
-embedding all allocation artifacts permanently in `Func`.
-
-### DLIST
-
-`mir-dlist.h` defines intrusive typed doubly linked lists. MIR functions and
-basic blocks use linked instruction lists because passes insert, delete, and
-move instructions frequently:
-
-- function instruction list: `MIR_func.insns`
-- CFG block list: `func_cfg.bbs`
-- block instruction list: `bb.bb_insns`
-- edge lists: `bb.in_edges`, `bb.out_edges`
-
-cfree currently stores each `Block` as a contiguous `Inst*` array. That is good
-for replay and scanning, but less natural for MIR-style lowering, rewrite,
-spill insertion, edge splitting, and post-RA combine. For the first structural
-rewrite we can keep arrays if we introduce a pass-local edit buffer, but the
-final shape should support cheap insertion around instructions and on block
-boundaries.
-
-### Bitmap
-
-`mir-bitmap.h` is a growable sparse-ish bitmap built on `VARR<uint64_t>`.
-Important details:
-
-- It expands only as far as the highest needed bit.
-- `bitmap_clear` truncates length to zero.
-- binary/ternary operations trim trailing zero words.
-- iteration uses `FOREACH_BITMAP_BIT` to visit set bits.
-
-This is a key contrast with cfree's current fixed-width bitsets. In cfree,
-`opt_live_words = ceil(nvals / 64)`, and every live set for the function has
-that width. MIR's bitmaps still use dense words internally, but their active
-length tracks actual high bits, and many loops iterate set bits.
-
-cfree should add an optimizer bitmap abstraction before rewriting RA. It should
-support:
-
-- clear without touching max capacity
-- copy/ior/and-not with change reporting
-- set-bit iteration
-- optional trimming of trailing zero words
-
-The exact storage can remain arena-backed for pass lifetime, but the API should
-hide fixed full-function width from users.
-
-## MIR Function and CFG Shape
-
-MIR's generator owns a `gen_ctx` with pass subcontexts:
-
-- target context
-- data-flow context
-- SSA context
-- GVN context
-- live-range context
-- coalesce context
-- RA context
-- combine context
-
-This keeps pass state out of the public MIR function structure. The current
-function being generated stores a `func_cfg` in `func_item->data` during
-generation.
-
-### Instructions
-
-`MIR_insn` is a variable-sized object:
-
-```c
-struct MIR_insn {
- void *data;
- DLIST_LINK(MIR_insn_t) insn_link;
- MIR_insn_code_t code : 32;
- unsigned int nops : 32;
- MIR_op_t ops[1];
-};
-```
-
-`data` points to pass-local metadata, usually `bb_insn` after CFG construction.
-This lets MIR keep the IR instruction object stable while attaching and
-replacing analysis state.
-
-cfree's `Inst` is a fixed struct with side arrays for operands and aux data.
-That is fine for recording, but pass state should move out of `Inst`, `Block`,
-and `Func` where possible. In particular, liveness, ranges, allocation
-assignment, and rewrite scratch state should be pass contexts, not persistent
-fields scattered through `Func`.
-
-### Basic Blocks
-
-MIR `bb` contains:
-
-- CFG links: in/out edge lists
-- instruction list
-- flags and reachability
-- four data-flow bitmaps: `in`, `out`, `gen`, `kill`
-- dominator bitmaps reused by several passes
-- loop-node pointer
-- max integer and FP pressure
-
-MIR reuses the bitmap fields for different data-flow problems through local
-macros such as `live_in`, `live_out`, `live_gen`, and `live_kill`.
-
-cfree currently stores `live_in`, `live_out`, `live_use`, `live_def`, and
-`live_after` directly on `Block`. That ties block storage to one analysis. The
-MIR-like direction is to keep persistent block structure focused on CFG and
-instruction ownership, and keep data-flow bitmaps in pass-local arrays indexed
-by block id.
-
-## MIR Pipeline Shape
-
-For the allocation-related portion, MIR runs:
-
-1. `target_machinize`
-2. build loop tree for O1+
-3. optional move collection and coalescing for O2
-4. full live info
-5. `reg_alloc`
-6. rewrite
-7. optional splitting for O2
-8. post-RA combine
-9. post-combine DCE
-
-This split is more important than individual implementation details. In MIR,
-liveness is a reusable input to multiple later phases, but register allocation
-does not depend on a full pseudo interference graph.
-
-For cfree, O1 should become:
-
-```text
-build_cfg
-machinize
-build_loop_tree
-live_blocks
-build_live_ranges
-assign_locations(no splitting)
-rewrite_locations
-combine
-dce
-emit
-```
-
-O2 can add coalescing, splitting, SSA value passes, and richer cleanup after
-this shape is stable.
-
-## MIR Liveness
-
-MIR's liveness is block-first:
-
-1. Walk each block backward to compute `live_gen` and `live_kill`.
-2. Solve backward data-flow:
- - `live_out = union(successor.live_in)`
- - `live_in = live_gen | (live_out & ~live_kill)`
-3. Track register frequency and block pressure while scanning.
-4. Iterate set bits for live-out pressure accounting.
-
-MIR also has a `scan_vars` mode. Normally `scan_vars_num == 0`, meaning all
-variables participate. For move coalescing, MIR can restrict liveness to only
-move-related variables.
-
-cfree should copy this separation. The first replacement for `opt_live_info`
-should not build:
-
-- `val_conflicts`
-- per-instruction `live_after`
-- spill/rewrite data
-
-It should only produce block live-in/live-out, use/def, frequency, and pressure
-inputs. Later passes can build the extra views they need.
-
-## MIR Live Ranges
-
-MIR represents each variable as a linked list of live ranges:
-
-```c
-struct live_range {
- lr_bb_t lr_bb;
- int start, finish;
- int ref_cost;
- live_range_t next;
-};
-```
-
-The range builder walks blocks backward from `live_out`, just like liveness,
-but emits range start/end points. It also records whole-block ranges using
-`lr_bb` for values that are live through a block without being referenced
-inside it.
-
-MIR then compresses live-range points:
-
-- gather all born/dead points
-- map old instruction points to a smaller point space
-- merge adjacent ranges where legal
-
-This compression matters. It reduces the size of the later `used_locs`
-structure and avoids making the allocator pay for every instruction position
-when only lifetime boundaries matter.
-
-cfree should introduce:
-
-```c
-typedef struct OptLiveRange {
- Val val;
- u32 start;
- u32 end;
- u32 next;
- u32 first_block_span; /* optional: whole-block span list */
-} OptLiveRange;
-
-typedef struct OptLiveRangeSet {
- OptLiveRange* ranges;
- u32 nranges;
- u32* first_range_by_preg;
- u32 point_count;
-} OptLiveRangeSet;
-```
-
-The exact fields can change, but the structural point should not: allocation
-should reason from ranges, not from a dense value-value conflict matrix.
-
-## MIR Register Allocation
-
-MIR allocation has three core pieces:
-
-### Candidate Ordering
-
-MIR builds `sorted_regs` from pseudo regs, computes live length from ranges,
-and sorts allocation candidates. The priority includes tied hard-register
-requirements, frequency, and live length.
-
-cfree already has a similar priority idea in `val_higher_priority`. The
-problem is not ordering. The problem is what each candidate checks against.
-
-### Occupied Locations
-
-MIR maintains:
-
-- `used_locs`: vector of bitmaps indexed by compressed program point
-- `busy_used_locs`: extra occupancy for split-capable allocation
-- `conflict_locs1`: temporary bitmap of occupied physical locations
-
-For each pseudo range, MIR unions the occupied locations for all points in the
-range into `conflict_locs`, then asks for a hard register or stack slot not in
-that set.
-
-This is the central data-structure difference:
-
-- MIR: pseudo -> live ranges -> occupied hard locations by point
-- cfree today: pseudo -> dense conflicts with other pseudos -> scan assigned
- pseudos to test hard-register conflicts
-
-The MIR shape bounds allocation checks by range length and physical-location
-bitmap width, not by all pseudo pairs. It is a better fit for large generated
-functions where many values never overlap.
-
-### Rewrite
-
-After assignment, MIR rewrites instruction operands:
-
-- hard-register assigned pseudos become hard regs
-- stack-assigned pseudos become loads/stores around uses/defs
-- some spilled uses can become memory operands directly if target-legal
-- call-clobbered live regs are saved/restored
-- noop moves are deleted
-
-cfree's current rewrite is simpler and tightly coupled to `live_after`. A
-MIR-like cfree rewrite should walk each block backward with a rolling live set,
-using allocation assignment and target scratch registers. It should not require
-persisting a full live-after bitmap for every instruction.
-
-## MIR Conflict Matrix Use
-
-MIR still has a conflict matrix, but only for coalescing. It is intentionally
-narrow:
-
-- collect moves
-- collect only source/destination regs from those moves
-- cap the number at `MIR_MAX_COALESCE_VARS`
-- build a matrix over this reduced scan-var space
-- coalesce non-conflicting move-related regs
-
-The comment in MIR says 10K scan vars is about 8 MB for the coalescing matrix.
-That cap is a design signal: even MIR avoids an unbounded full pseudo matrix.
-
-cfree should follow this model:
-
-- no full matrix in normal O1 allocation
-- optional move-only matrix in O2 coalescing
-- hard cap and metric counters for matrix size
-
-## cfree Mismatch Status
-
-The initial O1 implementation had five major structural mismatches:
-
-1. `opt_live_info` was overloaded.
- It computed block liveness, per-value summaries, per-instruction live-after,
- conflict matrix, call live-across summaries, and asm constraints in one pass.
- This has been split into pass-local liveness, range construction,
- allocation, and rewrite state.
-
-2. DDE paid for full allocation analysis.
- Pre-DDE no longer builds allocation ranges or conflicts. It still runs block
- liveness before DDE, and O1 runs block liveness again for regalloc after DDE
- changes the instruction stream. Avoiding or reusing that second pass is a
- remaining O1 cleanup item, not the main current bottleneck.
-
-3. Dense conflicts were the allocator's core dependency.
- The normal O1 allocator no longer allocates or consumes `Func.val_conflicts`.
- Dense matrices should stay out of O1 and return only as bounded, optional
- move-coalescing support in O2.
-
-4. Hot loops scanned all vals.
- Many of these loops now use operand-reference events, set-bit iteration, or
- touched-value/generation lists. The final O1 cleanup removed the remaining
- high-watermark live-across-call/rewrite scans that showed up in spill-heavy
- wide functions.
-
-5. `Block.live_after` stored a full live bitmap per instruction.
- Normal O1 no longer persists per-instruction full live-after storage.
- Rewrite and DDE walk backward with rolling live state, and rewrite's
- live-word traffic is now linear in the focused spill-pressure ladder.
-
-6. Point-index occupancy loops remained after the matrix was removed.
- Normal O1 allocation now stores sorted live intervals per physical
- hard-register and per stack spill slot. Assignment checks interval overlap
- for each candidate range instead of OR-ing occupied-location bitmap rows for
- every compressed point.
-
-## Proposed cfree Structural Target
-
-Introduce pass-local optimizer state:
-
-```c
-typedef struct OptBitset OptBitset;
-typedef struct OptDataflow OptDataflow;
-typedef struct OptLiveRanges OptLiveRanges;
-typedef struct OptAllocator OptAllocator;
-typedef struct OptRewrite OptRewrite;
-
-typedef struct OptPassCtx {
- Compiler* c;
- Func* f;
- Arena* scratch;
- OptDataflow live;
- OptLiveRanges ranges;
- OptAllocator alloc;
-} OptPassCtx;
-```
-
-Move these out of persistent `Func` where possible:
-
-- `live_in`, `live_out`, `live_use`, `live_def`
-- `live_after`
-- `val_conflicts`
-- allocation scratch arrays
-
-Keep these or equivalents as persistent-enough function metadata:
-
-- value type/class tables
-- block/instruction ownership
-- CFG edges
-- frame slots
-- target hard-register availability
-
-Allocation result should be a separate mapping:
-
-```c
-typedef enum OptLocKind {
- OPT_LOC_NONE,
- OPT_LOC_HARD,
- OPT_LOC_STACK,
-} OptLocKind;
-
-typedef struct OptLoc {
- u8 kind;
- u8 cls;
- Reg hard_reg;
- FrameSlot spill_slot;
-} OptLoc;
-```
-
-Rewrite consumes `OptLoc[val]` and emits final target operands.
-
-## Implementation Direction
-
-This is the recommended structural sequence.
-
-### Phase 1: Data Structure Extraction
-
-Add pass-local bitset and data-flow structures. Keep current behavior running,
-but make new liveness code independent of `Block.live_*` fields.
-
-Deliverables:
-
-- `pass_live.c` owns block liveness storage.
-- bitset supports set-bit iteration and trimmed operations.
-- DDE uses block liveness only.
-- metrics count live words, active words, block live bytes, and set-bit scans.
-
-### Phase 2: Live Range Builder
-
-Build live ranges from block live sets. Do not allocate from them yet.
-
-Deliverables:
-
-- range list per value
-- compressed point numbering
-- metrics for point count, range count, average ranges per value, max range
- length, and whole-block spans
-- dump output for range tests
-
-### Phase 3: Range-Based O1 Allocator
-
-Replace dense-conflict allocation with MIR-style occupied-location assignment.
-No live-range splitting yet.
-
-Deliverables:
-
-- `used_locs[point]` bitmaps for hard regs and stack slots
-- candidate ordering by tied hard reg, frequency, live length, stable id
-- hard-register assignment when possible
-- stack assignment otherwise
-- no `val_conflicts` allocation in normal O1
-
-### Phase 4: Rewrite Without `live_after`
-
-Rewrite by walking blocks and instructions with rolling live state plus
-assignment maps.
-
-Deliverables:
-
-- hard-reg operand rewrite
-- stack reload/store insertion
-- scratch-reg handling by target class
-- call-clobber preservation
-- noop move deletion
-- no required per-instruction full live-after storage
-
-### Phase 5: O1 Cleanup and Scaling
-
-After O1 has the right shape, remove transition scaffolding and fix the
-remaining O1 scaling buckets before adding O2 allocator features:
-
-- rewrite without extra per-instruction temporary list churn: done
-- delete the old dense liveness/conflict path from the normal O1 route: done
-- narrow or compress `used_locs` so stack occupancy does not scale as
- `points * candidates`: done via split hard/stack occupancy and then
- per-location interval sets
-
-Only after that should O2 add move collection, move-only liveness, capped
-coalescing matrices, live-range splitting, and edge/block spill placement. This
-mirrors MIR's division: the matrix is an optional coalescing tool, not the
-allocator's main model.
-
-## Implementation Checklist
-
-Use this as the working tracker for the structural rewrite. Each phase should
-land with focused tests and metrics before moving to the next phase.
-
-Compatibility policy: compatibility with the old O1 optimizer internals is not
-a goal. The public compiler/API behavior must stay correct, but internal helper
-APIs, dumps, tests, metrics, and persistent IR fields may be broken, renamed, or
-deleted when they preserve the dense-conflict shape or add runtime cost. Prefer
-a clean fast implementation over shims for old pass boundaries.
-
-### Phase 0: Baseline and Guardrails
-
-- [x] Preserve current O0 behavior as the reference path.
-- [x] Keep existing O1 tests green before structural changes begin.
-- [x] Add or identify stress inputs for:
- - [x] one large straight-line function
- - [x] many small functions
- - [x] branch liveness
- - [x] call-clobber preservation
- - [x] spills
- - [x] tied hard registers from inline asm
-- [x] Keep `cfree run --time -O1` metrics available throughout the rewrite.
-- [x] Add metrics names for the new shape before deleting old counters:
- - [x] `opt.live.blocks`
- - [x] `opt.live.active_words`
- - [x] `opt.live.bitset_words_touched`
- - [x] `opt.live.dataflow_iterations`
- - [x] `opt.live.dataflow_block_visits`
- - [x] `opt.ranges`
- - [x] `opt.range_points`
- - [x] `opt.range.point_visits`
- - [x] `opt.range.preg_scans`
- - [x] `opt.range.live_words_touched`
- - [x] `opt.alloc.used_loc_words`
- - [x] `opt.alloc.hard_loc_words`
- - [x] `opt.alloc.stack_loc_words`
- - [x] `opt.alloc.stack_slots`
- - [x] `opt.alloc.hard_point_visits`
- - [x] `opt.alloc.stack_point_visits`
- - [x] `opt.alloc.hard_word_ors`
- - [x] `opt.alloc.stack_word_ors`
- - [x] `opt.alloc.hard_mark_points`
- - [x] `opt.alloc.stack_mark_points`
- - [x] `opt.alloc.spills`
- - [x] `opt.dde.live_words_touched`
- - [x] `opt.rewrite.reloads`
- - [x] `opt.rewrite.stores`
- - [x] `opt.rewrite.inserted_insts`
- - [x] `opt.rewrite.live_words_touched`
-
-Implemented by `test/opt/phase0_guardrails.sh`, wired into `make test-opt`.
-The script runs generated O0/O1 guardrail programs for the stress shapes above
-and checks that `cfree run --time -O1` emits the reserved metric names. Inline
-asm tied-register stress is identified by the existing parser case
-`test/parse/cases/asm_01_grammar.c`; allocator-specific inline-asm execution
-coverage belongs in a later pass once the range allocator owns rewrite.
-
-### Phase 1: Pass-Local Liveness
-
-- [x] Add an optimizer bitset abstraction with:
- - [x] set/clear/test
- - [x] copy
- - [x] union
- - [x] union-and-not
- - [x] change reporting
- - [x] set-bit iteration
- - [x] active-word tracking or trailing-zero trimming
-- [x] Add pass-local block liveness storage.
-- [x] Move block `use`/`def` calculation into the new liveness module.
-- [x] Compute `live_in`/`live_out` without allocating conflicts.
-- [x] Make pre-DDE use only block liveness.
-- [x] Keep old full `opt_live_info` available for regalloc during transition.
- This transition debt was retired in phase 6.
-- [x] Add dump output for block liveness.
-- [x] Add tests for branch, loop, and call liveness.
-
-Exit criteria:
-
-- [x] Existing public O1 behavior is unchanged.
-- [x] Pre-DDE no longer builds `val_conflicts`.
-- [x] Metrics show separate block-liveness timing and no pre-DDE conflict bytes.
-
-### Phase 2: Live Range Model
-
-- [x] Define `OptLiveRange` and `OptLiveRangeSet`.
-- [x] Build ranges from block `live_out` with a backward block walk.
-- [x] Track per-value:
- - [x] first range
- - [x] live length
- - [x] frequency/spill cost
- - [x] live-across-call information
-- [x] Add compressed point numbering.
-- [x] Add whole-block span representation or an equivalent compact model.
-- [x] Add range dump output.
-- [x] Add range metrics:
- - [x] total ranges
- - [x] total points before compression
- - [x] total points after compression
- - [x] max ranges per value
- - [x] max live length
-
-Exit criteria:
-
-- [x] Range dumps match expected lifetimes on small hand-written cases.
-- [x] Range construction works without using `Block.live_after`.
-- [x] Current allocator still works while range data is built side by side.
-
-### Phase 3: Range-Based Assignment
-
-- [x] Define `OptLoc` assignment mapping for every `Val`.
-- [x] Build allocation candidate list from values with ranges.
-- [x] Sort candidates by:
- - [x] tied hard-register requirement
- - [x] frequency/spill cost
- - [x] shorter live length
- - [x] stable value id
-- [x] Add `used_locs[point]` bitmaps.
-- [x] Mark hard-register live ranges in `used_locs`.
-- [x] For each candidate, union occupied locations across its ranges.
-- [x] Assign a legal hard register when available.
-- [x] Assign/reuse stack slots when no hard register is available.
-- [x] Preserve tied hard-register validation.
-- [x] Preserve forbidden hard-register constraints from asm/calls.
-- [x] Keep live-range splitting disabled for O1.
-
-Exit criteria:
-
-- [x] O1 regalloc no longer depends on `val_conflicts`.
-- [x] `opt.conflict_bytes` is zero or absent on the normal O1 allocation path.
-- [x] Branch, loop, call-clobber, spill, and tied-register tests pass.
-
-### Phase 4: Rewrite From Assignment Map
-
-- [x] Rewrite hard-assigned values to hard registers.
-- [x] Insert reloads for stack-assigned uses.
-- [x] Insert stores for stack-assigned defs.
-- [x] Reserve target scratch registers per class.
-- [x] Handle operands nested in:
- - [x] normal operands
- - [x] indirect operands
- - [x] call descriptors
- - [x] return descriptors
- - [x] intrinsic descriptors
- - [x] inline asm descriptors
-- [x] Preserve call-clobber saves/restores with a rolling live set.
-- [x] Delete noop moves after rewrite.
-- [x] Avoid persistent per-instruction full `live_after` storage.
-- [x] Add rewrite dump output.
-
-Exit criteria:
-
-- [x] O1 emits correct code without `Block.live_after`.
-- [x] Reload/store metrics are stable and understandable.
-- [x] Public O1 behavior and smoke JIT tests pass.
-
-### Phase 5: Reduce Phase-4 Rewrite Overhead
-
-Phase-4 timings show the no-`live_after` rewrite improves the large
-straight-line case but regresses many-small-function/function-table inputs. Keep
-the rolling-live design, but remove the temporary-list and reverse-copy cost
-introduced by the first implementation.
-
-- [x] Rewrite blocks with a single block-local output buffer or pre-sized edit
- buffer.
-- [x] Avoid per-instruction `RewriteList` allocation/copy churn.
-- [x] Preserve rolling-live call-clobber behavior without persistent
- `Block.live_after`.
-- [x] Keep scratch-register selection deterministic after the rewrite cleanup.
-- [x] Update tests that asserted old internal storage or exact intermediate
- dump details.
-- [ ] Re-run the same-input phase-3 vs phase-5 timing comparison from
- `doc/PERF.md`.
-
-Exit criteria:
-
-- [ ] Function-table/main-many-small `opt.regalloc` returns to phase-3 parity
- or better.
-- [ ] Large straight-line phase-4 gains are retained.
-- [x] No persistent per-instruction full live-after sets are rebuilt in normal
- O1.
-
-Implemented by rewriting each block into one backward-filled block output
-buffer. Per-instruction before/after insertion lists are reused per block, and
-the final reverse-copy pass is gone.
-
-### Phase 6: Delete Old Dense Path
-
-- [x] Remove normal-path allocation of `Func.val_conflicts`.
-- [x] Delete old all-values conflict construction from O1, including
- compatibility-only helpers and tests.
-- [x] Remove or demote `opt_live_words`, `opt_conflict_words`,
- `Block.live_in`, `Block.live_out`, `Block.live_use`, `Block.live_def`,
- and `Block.live_after` from persistent IR state where no public behavior
- requires them.
-- [x] Replace `opt_live_info` callers with pass-local liveness/ranges, then
- delete or narrow `opt_live_info`.
-- [x] Make tests assert public behavior and new pass dumps/metrics, not dense
- conflict matrix details.
-- [x] Update docs and metrics names to reflect range-based RA.
-
-Exit criteria:
-
-- [x] Normal O1 has no reachable dense pseudo-conflict construction.
-- [x] The source tree no longer contains tests whose purpose is preserving the
- old dense interference model.
-- [x] `opt.conflict_bytes` is absent or always zero on normal O1 paths.
-
-Implemented by deleting `opt_live_info`, `Func.val_conflicts`,
-`Func.opt_conflict_words`, and persistent block liveness/live-after fields.
-Post-rewrite combine/DCE now build pass-local hard-register liveness when they
-need block live-out information.
-
-### Phase 7: Narrow `used_locs`
-
-Phase-3 and phase-4 timings show the remaining large-function bend tracks
-`opt.alloc.used_loc_words`. The first range allocator sizes occupied-location
-bitmaps as hard-register bits plus one possible stack bit per candidate; that
-preserves correctness but reintroduces superlinear memory/time for one large
-function.
-
-- [x] Split hard-register occupancy from stack-slot occupancy.
-- [x] Make stack occupancy demand-sized by allocated stack slots, not candidate
- count.
-- [x] Consider sparse or interval-indexed stack occupancy for long straight-line
- functions.
-- [x] Keep hard-register occupancy compact and class-aware.
-- [x] Add metrics separating hard occupancy words, stack occupancy words, and
- allocated stack slot count.
-- [ ] Re-run straight-line and table-main ladders after each shape change.
-
-Exit criteria:
-
-- [ ] Large straight-line benchmark scales by range/point count, not
- `points * candidates`, `nvals * live_words`, or dense conflict bytes.
-- [ ] `doc/PERF.md` can be updated with before/after O1 scaling numbers.
-- [ ] Table-main remains at phase-5 parity or better.
-
-Implemented with separate `hard_used_locs` and `stack_used_locs` allocator
-tables. Hard occupancy remains `OPT_REG_CLASSES * 32` bits per point; stack
-occupancy grows only when a new spill slot crosses a word boundary. The sparse
-interval form is deferred until timing data shows it is needed.
-
-### Phase 8: O1 MIR-Shape Cleanup
-
-The range allocator has the right high-level shape, but current timings show
-there is still O1 work to do before adding O2 coalescing or splitting. The next
-areas of focus are the remaining places where cfree is less MIR-shaped than it
-looks at the pass boundary: fixed-width live traffic, point-by-point occupancy
-loops, and rewrite liveness work.
-
-Implemented in the current cleanup pass:
-
-- [x] Emit event counters for live, range, allocation, DDE, and rewrite work.
-- [x] Remove the duplicate pre-DDE live-range build from the normal O1 path.
-- [x] Replace DDE per-instruction full-width use/def bitsets with
- operand-reference events.
-- [x] Replace rewrite per-instruction full-width use/def bitsets with
- operand-reference events.
-- [x] Track open/touched range values with generation marks during live-range
- construction.
-- [x] Avoid assigning caller-saved registers to non-tied values that are known
- to be live across calls.
-- [x] Update `doc/PERF.md` with current O1 scaling data and counter ratios.
-
-Phase 8 cleanup status:
-
-- [x] Make `OptBitset` operations truly MIR-like: track active length, trim
- trailing zero words, and bound copy/ior/and-not work by active words
- rather than the function's maximum value id.
-- [x] Reduce `opt.live_ranges.regalloc` on spill-heavy wide functions.
- Current `opt.range.preg_scans` is linear, so investigate live-word
- touches, range insertion/compression, final per-preg summaries, and
- live-across-call bookkeeping in `src/opt/pass_live.c`.
-- [x] Reduce `opt.rewrite.live_words_touched`. Rewrite no longer stores
- `Block.live_after`, but its rolling live set still grows faster than
- input on spill-heavy wide functions.
-- [ ] Replace repeated point-index occupancy OR/mark loops with per-location
- interval or event state once live-range/rewrite traffic is under control.
- `opt.alloc.*point_visits` is linear now, but the absolute count is high.
-- [ ] Decide whether the second O1 block-liveness pass can be avoided after
- DDE, either by reusing/revalidating liveness or by having DDE produce
- enough local update information.
-- [x] Keep split hard-register and stack-slot occupancy. It fixed the old
- `points * candidates` stack growth and should remain the O1 baseline.
-
-Exit criteria:
-
-- [x] Straight-line and table-shaped stress inputs remain near-linear at the
- 512-to-1024 step.
-- [x] Spill-pressure `opt.live_ranges.regalloc` and `opt.regalloc` stop
- growing near `3x` on each input doubling.
-- [x] `opt.rewrite.live_words_touched` no longer grows superlinearly on the
- spill ladder.
-- [x] O1 remains free of dense pseudo-conflict construction on the normal path.
-
-Implemented by active-length `OptBitset` copies, a rolling live-value list for
-live-across-call range bookkeeping, and rewrite call-save filtering that only
-checks hard-assigned caller-saved values known to be live across a call. The
-remaining point-index occupancy loops are linear and kept as future O2-quality
-work rather than an O1 blocker.
-
-### Phase 9: O2 Coalescing
-
-- [ ] Collect move candidates after machinize.
-- [ ] Add move-only liveness using `scan_vars`-style restricted variables.
-- [ ] Build a capped conflict matrix only for move-related values.
-- [ ] Add metrics for:
- - [ ] move candidates
- - [ ] coalescing scan vars
- - [ ] coalescing matrix bytes
- - [ ] coalesced moves
-- [ ] Coalesce non-conflicting move groups.
-- [ ] Keep the matrix cap explicit and configurable at compile time.
-
-Exit criteria:
-
-- [ ] Coalescing is optional and does not affect O1 compile-time shape.
-- [ ] Matrix memory is bounded and reported.
-
-### Phase 10: O2 Splitting and Spill Placement
-
-- [ ] Add split-capable occupancy state similar to MIR's `busy_used_locs`.
-- [ ] Identify profitable live-range gaps.
-- [ ] Place spills/restores on block boundaries.
-- [ ] Split critical edges when required for edge placement.
-- [ ] Add spill-placement dumps.
-- [ ] Add metrics for split ranges and edge/block spill placements.
-
-Exit criteria:
-
-- [ ] O2 can reduce spills without changing O1 assignment behavior.
-- [ ] Edge/block placement tests cover diamonds, loops, and critical edges.
-
-### Phase 11: Cleanup and Documentation
-
-- [ ] Update `doc/OPT.md` with the final implemented module names and pass
- order.
-- [x] Update `doc/PERF.md` with new O1 scaling data.
-- [ ] Add a short allocator invariant section near the implementation.
-- [ ] Ensure pass dumps are stable enough for targeted regression tests.
-- [ ] Remove obsolete comments that describe the dense conflict allocator as
- the normal O1 path.
-
-## Non-Goals For The First Rewrite
-
-Avoid spending early effort on:
-
-- micro-optimizing the existing full conflict matrix
-- adding sparse rows to `val_conflicts` as a long-term solution
-- tuning candidate priority before range-based allocation exists
-- preserving optimizer-internal compatibility with the dense allocator era
-- link/JIT performance
-- parser/CG data-structure changes
-
-Those may still matter later, but they do not fix the current structural
-scaling issue.
-
-## Design Risks
-
-### Instruction Storage
-
-cfree's block instruction arrays make insertion more awkward than MIR's linked
-lists. We can handle this with rewrite output buffers initially. If combine,
-spill insertion, and edge splitting become too cumbersome, revisit instruction
-storage after the range allocator lands.
-
-### Arena Lifetime
-
-MIR explicitly destroys many pass structures. cfree's arenas make allocation
-cheap but can hide large transient memory. The new pass contexts should use a
-scratch arena or explicit freeable heap where transient memory is substantial.
-
-### Debuggability
-
-The rewrite should add stable dumps early:
-
-- block live sets
-- live ranges
-- allocation map
-- inserted reloads/stores
-- final rewritten IR
-
-Without dumps, range allocation bugs will be hard to diagnose.
-
-### Target Coupling
-
-MIR's rewrite relies heavily on target hooks. cfree should keep target
-knowledge behind explicit helpers:
-
-- hard-register class availability
-- caller-saved/callee-saved masks
-- scratch registers
-- memory-operand legality
-- stack-slot addressability
-
-## Bottom Line
-
-MIR's RA performance shape comes from avoiding a full pseudo interference graph
-in the normal allocation path. cfree should mirror that, even when doing so
-breaks old optimizer-internal APIs, fields, dumps, or tests. The next serious
-O1 work should keep the pass-local liveness, compressed live ranges,
-occupied-location assignment, and assignment-map rewrite shape, then remove the
-transition scaffolding and remaining scaling buckets.
-
-Once that shape exists, incremental tuning will have a much better foundation.
diff --git a/doc/OPT.md b/doc/OPT.md
@@ -1,1371 +1,336 @@
-# OPT - MIR-informed implementation plan
+# OPT - Optimizer Design
-Scope: deliver cfree's optimized backend behind the `CGTarget` wrapper
-contract described in `doc/DESIGN.md` Section 9, using MIR JIT's optimizer as
-the model for phase order, level gating, and performance targets.
+This document describes cfree's optimized backend pipeline. It is intended to
+be the stable design reference for `-O1` and `-O2`: the IR shape, pass order,
+major analyses, and ownership boundaries. Performance measurements and current
+benchmark gaps live in `doc/OPT_PERF.md`.
-This document is based on a source investigation of `~/tmp/mir/`, mainly
-`mir-gen.c`, `mir.c`, `mir.h`, `mir-gen.h`, `MIR.md`, `README.md`, and the
-`c2mir` driver. The important takeaway is that MIR keeps the optimizer short:
-`-O1` is a fast lowering/code-selection path, while the expensive
-SSA/value/memory/loop work is reserved for `-O2`.
+The optimizer is implemented as a `CGTarget` wrapper. The C frontend and other
+frontends still emit through the public CG interface; the wrapper records that
+stream as cfree IR, runs the requested optimization schedule, then replays the
+lowered IR into the wrapped target.
----
+## Goals
-## 1. Target Shape
+- Keep `-O0` as the direct backend path.
+- Keep `-O1` as a fast optimized tier: shared lowering, liveness, allocation,
+ post-RA cleanup, and emission, but no SSA value optimization or inlining.
+- Make `-O2` the full quality tier: interprocedural retention for inlining,
+ SSA value/memory passes, loop cleanup, pressure relief, coalescing, live-range
+ splitting, and the same final lowering path as `-O1`.
+- Preserve module boundaries. Frontends own language semantics, ABI lowering
+ owns call/return shape, targets own encoding legality, and optimizer passes
+ operate on recorded CG-level IR.
-cfree exposes three levels:
+## Level Shape
-- `-O0`: direct backend path. CG drives the target `CGTarget` immediately.
-- `-O1`: MIR-style minimal optimizer. Record IR, lower it through the new
- backend path, run liveness, simplified register allocation, post-RA combine,
- and DCE. No SSA value passes and no inlining.
-- `-O2`: MIR-style full optimizer. Add SSA, GVN/constant propagation,
- redundant load elimination, copy propagation, dead store elimination,
- SSA DCE, LICM, pressure relief, address transformation, block cloning,
- coalescing, live-range splitting, and inlining plus cleanup.
+`-O0`:
-The core contract stays simple:
+```text
+frontend -> CG -> target
+```
+`-O1`:
+
+```text
+frontend -> CG -> opt_cgtarget records Func IR
+
+func_end:
+ build/lower CFG
+ machinize
+ loop tree
+ liveness
+ dead-def elimination
+ register allocation without live-range splitting
+ post-RA combine
+ DCE
+ jump cleanup
+ emit to wrapped target
```
-parse -> cg -> opt_cgtarget { record Func IR }
-
--O1 func_end/finalize:
- build_cfg
- machinize
- build_loop_tree
- live_info
- regalloc(simplified)
- combine
- dce
- opt_emit -> wrapped CGTarget
-
--O2 func_end/finalize:
- build_cfg
- build_reg_ssa
- block_cloning
- build_ssa
- gvn
- copy_prop
- dse
- ssa_dce
- build_loop_tree + licm
- pressure_relief
- make_conventional_ssa
- ssa_combine
- undo_ssa
- jump_opt
- machinize
- build_loop_tree
- coalesce
- live_info
- regalloc(with live-range splitting)
- combine
- dce
- opt_emit -> wrapped CGTarget
-
--O2 finalize before final lowering:
- opt_inline
- opt_cleanup(dirty callers)
+
+`-O2`:
+
+```text
+frontend -> CG -> opt_cgtarget records all Func IR
+
+finalize:
+ inline retained functions
+ for each retained function:
+ SSA cleanup and value/memory optimization
+ lower through the shared backend pipeline
+ register allocation with coalescing and live-range splitting
+ emit to wrapped target
```
-The level dial selects optimization work. The lowering backend should be
-shared by `-O1` and `-O2`; `-O1` must remain the bisection floor for bugs in
-machinize, liveness, allocation, emission, and target-specific rewrite.
+Functions that contain constructs whose cloning or SSA semantics are not yet
+explicit, such as inline asm, label-address loads, or indirect branches, are
+conservatively routed through the non-SSA lowering path.
----
+## Core Data Structures
-## 2. MIR Findings
+### `OptImpl`
-### 2.1 Where the Optimizer Lives
+`OptImpl` is the `CGTarget` wrapper state in `src/opt/opt.c`.
-MIR's optimizer is concentrated in `mir-gen.c`; inlining and MIR-level
-simplification live in `mir.c`; `c2mir/c2mir-driver.c` maps command-line
-`-O` flags to `MIR_gen_set_optimize_level`.
+- `target` is the wrapped real backend target.
+- `level` selects the optimization schedule.
+- `f` and `cur` track the function and current block during recording.
+- `pending_loc` stores the latest source location and stamps new instructions.
+- `FuncSet funcs` retains all `-O2` functions until `finalize`, so inlining can
+ see both callees and callers before any function is emitted.
-Key source facts:
+### Function IR
-- `MIR_gen_init` defaults `optimize_level` to `2`.
-- `MIR_gen_set_optimize_level(ctx, level)` only stores the requested level in
- generator context.
-- `c2mir` treats bare `-O` as `-O2`.
-- Current `mir-gen.c` checks `optimize_level >= 2` for the full pass set.
- Some docs/comments still describe a separate `-O3`, but the current source
- does not gate additional passes on `>= 3`. For cfree, model `-O2` as MIR's
- current full source-level pipeline.
-
-### 2.2 MIR Level Semantics
-
-MIR documentation says:
-
-- Level `0`: only register allocator and machine code generator work.
-- Level `1`: adds code selection. It should produce more compact/faster code
- than level 0 at nearly the same generation speed.
-- Level `2`: adds common subexpression elimination and conditional constant
- propagation. MIR docs report level 1 generation speed around 50 percent
- faster than level 2.
-- Current source-level `>=2`: also includes block cloning, SSA variable
- renaming, address transformation, DSE, LICM, pressure relief, coalescing,
- and live-range splitting.
-
-For cfree:
-
-- `-O1` should optimize the backend mechanics without running SSA passes.
-- `-O2` should be the full quality mode.
-- Do not create a user-visible `-O3` until benchmarks identify a genuinely
- useful extra tier.
-
-### 2.3 MIR Performance Bar
-
-MIR's README reports `c2m -eg` at about `0.91` geomean performance relative
-to GCC `-O2` on its 15 small C benchmarks, with QBE around `0.65` on the same
-table. Treat those numbers as directional rather than a literal cfree promise:
-MIR has a JIT-oriented IR and mature per-target rewrite logic. Still, they set
-the bar for this project:
-
-- `-O1`: compile-time-focused. It should be much closer to `-O0` compile time
- than to `-O2`, while removing obvious backend artifacts through combine and
- DCE.
-- `-O2`: performance-focused. The first external target is at least QBE-class
- generated-code performance on the MIR `c-benchmarks` set, then iteratively
- close on MIR's published result.
-- Both levels must preserve the CG corpus exit-code contract before benchmark
- tuning starts.
-
----
-
-## 3. MIR Pipeline, Source Order
-
-This is the effective order in current `mir-gen.c` for full function
-generation.
-
-### 3.1 Common Setup
-
-MIR duplicates the function instruction list, allocates a function CFG object,
-and runs `build_func_cfg`.
-
-`build_func_cfg`:
-
-- Creates synthetic entry and exit blocks.
-- Splits basic blocks at labels, branches, returns, property branches, and
- fallthrough boundaries.
-- Converts register operands to internal variable operands.
-- Marks address-producing instructions and addressable regs.
-- Adds edges for direct branches, switches, fallthrough, and possible indirect
- jump targets.
-- Marks calls on blocks so memory availability and liveness can treat them as
- barriers.
-- Removes unreachable blocks when `optimize_level > 0`.
-
-cfree's `opt_build_cfg` should follow this shape: construct explicit
-entry/exit blocks, keep call/memory barrier metadata on blocks, and make
-unreachable cleanup part of `-O1+`, not a late optional cleanup.
-
-### 3.2 Full `-O2` Pre-Lowering Passes
-
-MIR full mode then runs:
-
-1. `clone_bbs` before SSA when there are no address instructions.
- MIR clones cold blocks after return back into hot predecessors when this
- exposes optimization in the hot path. It uses a bounded growth factor and
- skips back edges.
-
-2. `build_ssa`.
- MIR uses a Braun-style construction: it creates optimized maximal SSA,
- minimizes redundant phis, builds def-use edges, and optionally renames
- variables. cfree should copy the useful property, not the representation:
- every use should have cheap access to its defining inst, and every def
- should have cheap access to its uses.
-
-3. `addr_xform` when address instructions exist.
- MIR tries to eliminate `ADDR` pseudos that only feed memory operands. If an
- address pseudo must remain addressable, MIR converts uses to memory loads
- and stores, rebuilds SSA, then clones blocks. For cfree, this maps to
- folding GEP/address-of chains into target memory operands where the target
- can encode them, while keeping address-taken frame slots in memory.
-
-4. `gvn`.
- MIR computes memory availability, dominators, and value numbers in postorder.
- This pass includes constant propagation, branch folding, redundant expression
- elimination, redundant load elimination, and store/load reuse. It uses alias
- and nonalias tags on memory operands. Calls clear or conservatively restrict
- memory availability.
-
- cfree should land this in two explicit slices. The first slice is scalar GVN
- only: pure SSA value numbering and replacement for constants, arithmetic,
- compares, conversions, and other side-effect-free register expressions. It
- should not eliminate or forward `IR_LOAD`, interpret `IR_STORE`, or make
- memory-dependent branch decisions.
-
- Before the second slice touches memory, document and test the memory
- numbering shape. That design must define memory version tokens by alias
- class/root, how `MemAccess.alias`, address operands, address spaces, globals,
- TLS, params, and unknown memory participate in keys, and the invalidation
- rules for stores, calls, atomics, fences, volatile operations, and inline asm.
- Redundant-load elimination and store/load reuse start only after that shape
- is explicit and covered by pass-local dump tests.
-
-5. `copy_prop`.
- MIR propagates copies, folds multiply/divide by powers of two, and removes
- redundant extension chains. cfree should keep this as a separate pass after
- GVN because it relies on SSA edges and target legality.
-
-6. `dse`.
- MIR computes memory liveness by memory location, handles alloca escape
- through calls, and removes stores whose memory location is not live. This
- depends on GVN's memory numbering and alias classification.
-
-7. `ssa_dce`.
- MIR deletes SSA instructions with unused outputs, while preserving calls,
- allocas, varargs, returns, frame/stack effects, overflow sequences with live
- overflow branches, and other side-effecting ops.
-
-8. `build_loop_tree + licm`.
- MIR builds natural loops and preheaders. LICM skips branches, phis, calls,
- memory, varargs, allocas, and potentially trapping divisions/mods. It is
- pressure-sensitive: cheap single instructions are not moved unless their
- inputs are worth moving too; multiplies are considered expensive enough to
- move.
-
-9. `pressure_relief`.
- MIR moves single-use immediate or constant-like moves closer to their use
- when doing so reduces pressure and does not move work into a loop.
+The optimizer IR is defined in `src/opt/ir.h`.
-10. `make_conventional_ssa`.
- MIR lowers phis into edge/block moves, splitting critical edges when needed.
+- `Func` owns the arena-backed IR for one function: blocks, frame slots,
+ pseudo-register metadata, value metadata, local declarations, emit order, and
+ pass scratch fields.
+- `Block` owns a linear instruction list plus CFG predecessor/successor edges.
+ `emit_order` records layout order separately from block id order.
+- `Inst` is one recorded operation. Most `IROp` values map directly to a single
+ CGTarget method. Source locations are stored on instructions.
+- `PReg` is the mutable pseudo-register id used by recorded CG IR and by the
+ lowering pipeline before register SSA.
+- `Val` is the SSA value id used after `opt_build_reg_ssa` and mem2reg SSA.
+ `VAL_NONE` and id zero are sentinels.
+- Per-op auxiliary structs preserve structured data for calls, returns, phis,
+ switches, inline asm, atomics, aggregate memory operations, intrinsics, and
+ debug/scope operations.
-11. `ssa_combine`.
- MIR combines compare+branch pairs and folds address components into memory
- operands while SSA edges are still available.
-
-12. `undo_ssa`.
- MIR removes phi nodes and SSA edges.
-
-13. `jump_opt`.
- MIR removes unreachable blocks, empty blocks, branches to the next
- instruction, chains of labels, and branches to jumps.
-
-### 3.3 Lowering and Allocation
-
-After pre-lowering optimization, MIR runs the machine-dependent and allocation
-pipeline:
-
-1. `target_machinize`.
- This performs ABI lowering, call lowering, target two-operand forms, and
- other machine-dependent normalization.
-
-2. Build a loop tree for `-O1+`.
- MIR uses loop depth for frequency/pressure estimates in liveness,
- coalescing, and allocation.
-
-3. `collect_moves`, move-only liveness, conflict matrix, and `coalesce` at
- `-O2`.
- MIR aggressively coalesces move-related regs, prioritizing moves by loop
- frequency and checking conflicts in a bounded matrix.
-
-4. Full `live_info`.
- MIR computes `live_in`/`live_out`, register frequencies, live lengths, and
- block pressure. It understands phis before SSA destruction and call-used
- hard registers after machinize.
-
-5. `reg_alloc`.
- MIR builds live ranges and assigns pseudos by priority: tied hard regs first,
- then higher frequency, then shorter live length. At levels below 2 it uses a
- simplified allocator. At full level it can split live ranges and place
- spills/restores on edges or block boundaries.
-
-6. `rewrite`.
- MIR rewrites pseudo regs to hard regs or stack slots, inserts reloads and
- stores, handles call-clobbered saves, and deletes noop moves.
-
-7. `combine`.
- This is target-aware code selection. It substitutes safe single-use moves,
- collapses extension chains, commutes operands to expose legal combinations,
- removes internal labels, and validates each rewrite with target legality.
+The representation deliberately keeps mode explicit. Before pseudo-register
+SSA, `OPK_REG` operands carry `PReg` ids. After pseudo-register SSA, they carry
+`Val` ids, and `Func.opt_reg_ssa` tells shared helpers which namespace is
+active.
-8. `dead_code_elimination`.
- MIR performs post-RA DCE using live-out sets and preserves side effects.
+### Memory Shape
-9. Prolog/epilog and machine instruction generation.
+Memory operations carry `MemAccess`, including type, size, alignment, volatile
+or atomic properties, and alias root. Important alias roots include locals,
+globals, TLS, parameters, and unknown memory. Memory-aware passes must preserve
+observable memory: volatile, atomic, aggregate operations with side effects,
+calls that may clobber, fences, and inline asm.
-cfree should keep this exact split: combine after register allocation is not a
-replacement for SSA `copy_prop`; it is a target legality/code-selection pass.
+### Analysis State
-### 3.4 Inlining
+`OptAnalysis` holds per-pass order and dominance data:
-MIR inlining is in `mir.c`, before generator optimization. It processes direct
-call-like instructions after MIR simplification, not inside `mir-gen.c`.
+- postorder/reverse-postorder
+- reachability
+- immediate dominators and dominator children
+- dominance frontiers
-Important MIR inliner behavior:
+`Func.opt_valid_analyses` tracks coarse invalidation for CFG, def-use,
+dominators, and loop info. Passes that mutate control flow, operands, or
+instructions either rebuild the relevant analysis or invalidate it.
-- It can inline direct normal calls and explicit inline calls when the callee
- is available.
-- It skips unresolved externals, self recursion, label-reference functions,
- varargs, `jret`, and over-budget callees.
-- Default size budgets are small: normal calls have a much smaller cap than
- explicit inline calls.
-- It limits caller growth relative to original caller size.
-- It renames callee registers, duplicates labels, materializes argument moves,
- rewrites returns, and handles simple top-level constant-size `alloca`.
+The shared def-use representation records every use of every `Val`, including
+ordinary operands, indirect bases, phi inputs, call arguments/results,
+aggregate operands, inline asm operands, and intrinsic operands.
-cfree should run inlining on retained `Func` IR at `finalize` for `-O2`, then
-run cleanup on dirty callers. SCCs can be skipped for v1. The inliner must
-refuse setjmp/longjmp-sensitive and vararg cases until their semantics are
-explicitly tested.
+## Shared Lowering Pipeline
----
+`opt_run_lowering_pipeline` is used by both `-O1` and `-O2`.
-## 4. cfree Level Schedules
+1. `opt_build_cfg`
+ Builds explicit CFG edges from branches, returns, switches, fallthroughs,
+ scopes, indirect branches, and synthetic blocks. It also removes unreachable
+ connected blocks after cleanup.
-### 4.1 `-O1` Minimal Schedule
+2. `opt_jump_cleanup`
+ Performs CFG cleanup before and after lowering. The early form repairs and
+ simplifies CFG shape; the late form also optimizes layout fallthroughs.
-`-O1` is not "half of SSA." It is MIR's cheap backend optimization tier.
+3. `opt_simplify_local`
+ Performs local algebraic/address simplifications that do not require SSA.
-Required `-O1` schedule:
+4. `opt_machinize`
+ Lowers generic IR toward target-legal machine IR. This includes target
+ operand constraints, hard-register constraints, clobber information, and
+ target-specific rewrite decisions needed before allocation.
-```
-build_cfg
-machinize
-build_loop_tree
-live_info
-regalloc(allow_live_range_split = false)
-combine
-dce
-opt_emit
-```
+5. `opt_build_loop_tree`
+ Builds loop metadata from dominators for use by later analysis and metrics.
-Allowed `-O1` details:
+6. `opt_live_blocks`
+ Computes block liveness over pseudo-registers or SSA values, depending on
+ the active mode.
-- Remove unreachable blocks during CFG construction.
-- Use loop depth only for frequency/pressure costing.
-- Run target-aware combine after register allocation.
-- Delete noop moves and dead post-RA definitions.
-- Use a priority-based allocator, but without coalescing and live-range
- splitting in the first production version.
+7. `opt_dead_def_elim_with_live`
+ Removes dead register definitions using live-out information while
+ preserving side-effecting instructions.
-Forbidden at `-O1`:
+8. `opt_regalloc`
+ Allocates pseudo-registers to hard registers or spill slots. `-O1` uses the
+ simple allocation mode. `-O2` enables coalescing and live-range splitting.
-- `build_ssa`
-- `gvn`
-- `copy_prop`
-- `dse`
-- `ssa_dce`
-- `licm`
-- `pressure_relief`
-- `coalesce`
-- `opt_inline`
+9. `opt_combine`
+ Runs post-allocation peephole combines over lowered IR.
-This keeps `-O1` useful and debuggable: if `-O1` fails, the bug is in
-recording, CFG, machinize, liveness, allocation, combine, DCE, or emission.
+10. `opt_dce`
+ Deletes now-dead non-side-effecting instructions after allocation and
+ combine.
-### 4.2 `-O2` Full Schedule
+11. `opt_emit`
+ Replays the final IR into the wrapped backend target.
-`-O2` uses MIR's full current source pipeline:
+The shared path is the bisection floor. Bugs in CFG construction, target
+machinization, liveness, allocation, rewrite, combine, or emission should
+usually reproduce at `-O1`.
-```
-build_cfg
-build_reg_ssa
-if no address-transform candidates:
- block_cloning
-build_ssa
-if address-transform candidates:
- addr_xform
- undo_ssa
- block_cloning
- build_ssa
-gvn
-copy_prop
-dse
-ssa_dce
-build_loop_tree
-licm
-pressure_relief
-make_conventional_ssa
-ssa_combine
-undo_ssa
-jump_opt
-machinize
-build_loop_tree
-collect_moves
-live_info(move vars only)
-coalesce
-live_info(all vars)
-regalloc(allow_live_range_split = true)
-combine
-dce
-opt_emit
-```
+## `-O1` Pipeline
-Then, once inlining exists:
+`-O1` runs only the shared lowering pipeline. It intentionally skips:
-```
-finalize:
- opt_inline(FuncSet, max_iters)
- for each dirty caller:
- opt_cleanup
- lower every dirty/not-yet-emitted Func through the full schedule
-```
+- interprocedural retention
+- inlining
+- SSA construction
+- GVN
+- copy propagation beyond local cleanup
+- DSE
+- LICM
+- pressure relief
+- live-range splitting
-`opt_cleanup` should re-run the passes that inlining exposes value for:
-`build_cfg`, `build_ssa`, `gvn`, `copy_prop`, `dse`, `ssa_dce`, `licm` when
-loops exist, `pressure_relief`, `make_conventional_ssa`, `ssa_combine`,
-`undo_ssa`, and `jump_opt`.
+`-O1` should compile much closer to `-O0` than to `-O2`, while removing obvious
+backend artifacts and exercising the optimized allocation/emission path.
-Current implementation subset:
+## `-O2` Pipeline
-```
-build_cfg
-jump_cleanup(CFG)
-build_cfg
-verify(o2-pre-ssa-cfg)
-build_reg_ssa
-verify(o2-reg-ssa)
-block_cloning
-verify(o2-block-clone-cfg)
-build_ssa
-verify(o2-ssa)
-ssa_dce
-copy_cleanup
-verify(o2-pre-addr-cleanup)
-addr_xform
-verify(o2-addr-xform-ssa)
-ssa_dce
-verify(o2-addr-xform-dce)
-copy_cleanup
-verify(o2-addr-xform-copy-cleanup)
-simplify
-verify(o2-simplify-ssa)
-ssa_dce
-copy_cleanup
-verify(o2-simplify-cleanup)
-gvn
-verify(o2-gvn-ssa)
-copy_prop
-verify(o2-copy-prop-ssa)
-ssa_dce
-verify(o2-copy-dce)
-dse
-verify(o2-dse-ssa)
-ssa_dce
-verify(o2-dse-dce)
-build_loop_tree
-licm
-verify(o2-licm-ssa)
-pressure_relief
-verify(o2-pressure-relief-ssa)
-make_conventional_ssa
-verify(o2-conventional-ssa)
-ssa_combine
-verify(o2-ssa-combine)
-undo_ssa
-copy_cleanup
-verify(o2-undo-copy-cleanup)
-opt_jump_opt
-verify(o2-jump-opt)
-shared O1 lowering pipeline (machinize ... jump_cleanup ... opt_emit)
-```
+`-O2` retains every function until target finalization. This enables whole-TU
+inlining before any function is emitted.
-Every CFG or value-use mutation above is followed by the relevant verifier
-checkpoint in `opt_run_o2_pipeline`.
-
-### 4.3 Identity Simplification
-
-Identity simplification is implemented in three homes, depending on how much
-context the rewrite needs.
-
-1. CG-time stack/value simplification, before IR recording.
- This is the earliest and best place for identities where the operation can
- simply avoid materializing a result. It benefits `-O0`, `-O1`, and `-O2`,
- and it prevents avoidable temporaries from ever entering opt IR. Keep this
- limited to rewrites that are local to the two values currently on the CG
- stack, do not require CFG/use-def information, and are valid for every
- target backend.
-
- Current CG folds immediate integer binops, collapses delayed integer
- arithmetic chains, collapses integer identities through
- `api_try_collapse_binop_identity`, and `api_cg_convert` elides same-type
- conversions. The identity set now includes safe unflagged integer cases such
- as `x * 1`, `1 * x`, `x / 1`, `x % 1 -> 0`, `x << 0`, `x >> 0`, zero
- annihilators for multiply/and, and all-ones annihilators for or.
-
- Do not put target-dependent legality here. Do not fold floating-point
- identities here: signed zero, NaNs, exceptions, contraction, and rounding
- flags make algebraic FP identities nontrivial. Also keep the existing
- `!flags` rule for integer folding unless a flag's exact semantics are
- explicitly modeled.
-
-2. A tiny non-SSA `opt_simplify_local` in the shared lowering pipeline.
- This runs after the initial `build_cfg`/`jump_cleanup`/`build_cfg` cleanup
- and before `machinize`, so it is available to `-O1` and to the final lowering
- leg of `-O2` after SSA has been undone:
-
- ```
- build_cfg
- jump_cleanup(CFG)
- build_cfg
- simplify_local
- machinize
- ...
- ```
-
- This pass stays local and cheap. It rewrites one instruction at a time:
- neutral/annihilator integer binops, `x - x -> 0`, `x ^ x -> 0`, `x & x` and
- `x | x` to copies, same-value integer compares, and exact no-op converts.
- It does not chase definitions, compute dominance, rewrite arbitrary uses, or
- become a substitute for SSA value passes. Its job is to reduce obvious work
- before liveness/RA without changing the meaning of O1 as the cheap backend
- tier.
-
- O1 should not grow a general value optimizer. If an O1 simplification needs
- def-use, dominance, copy propagation, memory reasoning, or loop information,
- it belongs in O2 instead. Post-RA `combine` remains the target-aware cleanup
- point for no-op moves, immediate folding into legal machine operands, and
- backend artifacts exposed only after register assignment.
-
-3. An SSA `opt_simplify` in the O2 pre-lowering value pipeline.
- This is the primary home for scalar and address identities that require
- use-def information. It runs after mem2reg/address cleanup has exposed SSA
- values and before GVN builds value-number keys:
-
- ```
- build_ssa
- ssa_dce
- copy_cleanup
- addr_xform
- ssa_dce
- copy_cleanup
- simplify
- ssa_dce
- copy_cleanup
- gvn
- ```
-
- This pass rewrites the defining instruction to `IR_COPY` or `IR_LOAD_IMM`
- and relies on `opt_ssa_dce`/`opt_copy_cleanup` to remove leftovers. It
- shares the local rules and adds SSA-only rules for double bitwise-not,
- add-zero address/value chains fed by SSA constants, and no-op bitcast chains
- that reduce to a copy. GVN still owns constant folding, leader selection,
- redundant expression/load elimination, and branch folding; `opt_simplify`
- canonicalizes cheap identities so GVN has fewer equivalent shapes to learn.
-
-The practical rule is: if CG can avoid emitting IR without looking beyond the
-top CG values, do it in CG. If the rewrite is a single recorded instruction and
-does not need SSA, put it in shared `simplify_local` so O1 benefits. If the
-rewrite needs def-use, dominance, memory/address canonicalization, or cleanup
-after inlining, put it in O2 `opt_simplify`.
-
-### 4.4 Transformations We Do Not Take
-
-`doc/DESIGN.md` is still binding: no UB-exploiting transforms. Do not assume
-signed overflow, shift-by-width, division by zero, or null dereference are
-unreachable. MIR is careful around potentially trapping division/modulo in
-LICM; cfree should be at least as conservative.
-
-MIR property instructions and lazy basic-block versioning are out of scope for
-the first optimized backend. They are a separate JIT specialization feature,
-not required for normal `cfree` object/executable codegen.
-
----
-
-## 5. Current cfree State
-
-The optimizer is no longer just a stub:
-
-- `src/opt/opt.c` implements the `CGTarget` wrapper.
-- `src/opt/ir.c` and `src/opt/ir.h` implement the recorded IR container.
-- `src/opt/pass_cfg.c` implements CFG construction.
-- `src/opt/pass_ssa.c` implements pseudo-register SSA construction, mem2reg
- SSA construction, conventional SSA edge-copy lowering, SSA undo, and stable
- SSA dumps for pass tests.
-- `src/opt/pass_coalesce.c` implements O2 move collection, move-related
- conflict analysis, and conservative coalescing.
-- `src/opt/pass_analysis.c` owns shared CFG order, dominators, dominance
- frontiers, dominator children, def-use rebuilding, and verifier guardrails.
-- `src/opt/pass_simplify.c` implements CG-adjacent local identity cleanup for
- the shared lowering path and SSA-aware identity canonicalization for O2 before
- GVN.
-- `cfree cc` forwards `-O0`/`-O1`/`-O2` into `CfreeCodeOptions`, so native
- frontends exercise the optimizer through the normal compile driver.
-- `test/toy/run.sh` supports `CFREE_TOY_OPT_LEVELS`; default is `0 1 2`, so
- `make test-toy` exercises direct, O1, and O2 paths by default.
-
-Current behavior:
-
-- Level `1` records IR, builds CFG, machinizes, builds loop frequencies,
- computes PReg liveness and live ranges, runs the current priority
- allocator/rewrite path over mutable PRegs, combines, runs post-RA DCE,
- cleans jumps/layout, and emits through the wrapped target. This path keeps
- whole-PReg allocation and does not run coalescing or splitting.
-- Level `2` has its own scheduler entry point. It now converts mutable
- pseudo-registers to SSA values, builds real mem2reg SSA, verifies phi/use-def
- invariants, runs the first small O2 transforms, lowers phis through
- conventional SSA, undoes SSA, and then routes through the shared lowering
- path with O2 allocation policy enabled.
-- `opt_regalloc(..., allow_live_range_split)` now uses the O2 policy bit to
- enable move coalescing and singleton live-range splitting. Coalescing only
- considers valid same-class, same-type `IR_COPY` pairs, weights moves by block
- frequency, builds a bounded conflict matrix for move-related PRegs, and
- rejects incompatible tied-register or forbidden-register constraints.
-- O2 allocation assigns coalesced groups as a unit. Every member receives the
- same hard register or spill slot, so coalesced copies rewrite to hard-reg
- self copies and are removed by post-RA combine/DCE.
-- O2 live-range splitting is intentionally conservative in this slice: only
- singleton allocation groups with multiple live ranges can split. Split PRegs
- get a canonical spill home plus per-range segment records. Hard segments
- insert boundary reload/store traffic through the existing rewrite buffers;
- coalesced groups, call-crossing values, and O1 allocations remain whole.
-- `opt_build_ssa` is no longer just a shape check: promoted loads/stores are
- rewritten to SSA values, phis have real `Val` defs and predecessor inputs,
- and def-use chains are rebuilt for downstream SSA passes.
-- Instruction ids are stable across block-array rewrites, so analysis and
- dumps no longer depend on raw `Inst*` addresses or positional references.
-- Loop tree construction uses shared `OptAnalysis` dominators instead of a
- pass-local dominator implementation.
-- `Func.opt_valid_analyses` now tracks CFG, def-use, dominator, and loop-info
- validity. Mutating passes invalidate the affected bits, rebuild passes mark
- their products valid, and `opt_verify` checks cached def-use before rebuilding
- it so stale use-site assumptions are caught.
-- The first SSA-backed value passes are in the O2 path: `opt_ssa_dce` removes
- unused pure SSA defs/phis, and `opt_copy_cleanup` rewrites safe SSA copy users
- through def-use chains while avoiding multi-def conventional-SSA edge copies.
-- The next pre-GVN O2 slice is also in the O2 path: `opt_block_cloning` clones
- tiny non-loop join blocks under strict growth limits when cloned defs remain
- local and external uses are dominated; `opt_addr_xform` folds simple local
- `addr_of` pseudos into non-observable load/store memory operands. Volatile,
- atomic, global, TLS, non-zero indirect-offset, and otherwise non-local address
- cases remain materialized. Both passes rebuild/invalidate analysis state and
- are bracketed by `opt_ssa_dce`, `opt_copy_cleanup`, and verifier checkpoints.
-- The first post-address value passes are in the O2 path: `opt_simplify`
- canonicalizes cheap scalar/address identities before GVN; scalar GVN rewrites
- dominated duplicate pure expressions and folds safe integer
- constants/branches; the memory-aware GVN slice rewrites exact redundant
- loads and store/load reuse through the documented alias/version model; and
- `opt_copy_prop` propagates SSA copy chains plus collapses redundant integer
- extension chains.
-- The first memory and loop passes are in the O2 path: `opt_dse` removes exact
- overwritten or unread non-escaping local stores, `opt_licm` hoists safe loop
- invariants while leaving trapping and memory operations in place, and
- `opt_pressure_relief` sinks single-use immediate/constant defs with a
- per-block linear rewrite plan.
-
-Temporary defensive/pessimizing checks to lift:
-
-- [ ] O2 scalar locals/params that would be register-backed at O1 are still
- initially recorded as frame slots for mem2reg. Lift this when source-local
- storage identity can be represented directly through pseudo-register SSA.
-- [ ] O2 functions containing `IR_ASM_BLOCK` route through the non-SSA lowering
- path. Lift this when inline-asm tied operands, inout operands,
- early-clobbers, fixed-register constraints, and clobber liveness are modeled
- across pseudo-register SSA construction and destruction.
-- [ ] O2 functions containing `IR_LOAD_LABEL_ADDR` or `IR_INDIRECT_BRANCH`
- route through the non-SSA lowering path. Lift this when label-address and
- computed-goto targets have stable block/MCLabel semantics across O2 CFG
- cleanup, block cloning, edge splitting, SSA destruction, and jump cleanup.
-- [ ] O2 switch table promotion is disabled for gapped dense ranges
- (`span != ncases`). Lift this when jump-table default slots remain correctly
- bound after O2 layout and jump cleanup, including interior default entries.
-- [ ] Jump forwarding refuses to forward through single-jump blocks that already
- own an `MCLabel`. Lift or narrow this once data relocations and label tables
- are retargeted when CFG cleanup forwards or removes labeled trampoline
- blocks.
-
-Full O2 jump optimization is now in the O2 path after SSA destruction. It
-extends the shared O1 jump cleanup with switch/default target forwarding, empty
-fallthrough-chain forwarding, repeated branch-to-branch forwarding, and
-same-target conditional branch collapse while preserving the final layout
-cleanup for branch-to-physical-next deletion.
-
-The next implementation work should move to Phase E inlining and cleanup.
-
----
-
-## 6. Implementation Phases
-
-Each phase should end with targeted green tests. Prefer red-green tests for
-the exact pass being introduced, then expand through the CG corpus.
-
-### Phase A - Production `-O1` Lowering
-
-Status: implemented as the public level-1 path. Keep extending tests and
-target-specific legality checks here, but do not make O1 depend on SSA/value
-passes. The final O1 namespace is mutable `PReg`: `ir_ensure_preg` grows only
-the pseudo-register metadata table, and O1 liveness, allocation, rewrite, and
-metrics are PReg-indexed.
-
-Goal: make `-O1` stop replaying and emit through the optimized backend path
-without SSA value passes.
-
-Implement:
-
-- `opt_machinize`
-- `opt_live_info`
-- `opt_regalloc(..., false)`
-- `opt_combine`
-- `opt_dce`
-- `opt_emit`
-
-Keep the allocator simple but not naive:
-
-- Build live ranges from block liveness.
-- Sort allocation candidates by tied hard-reg requirement, frequency, live
- length, then stable id.
-- Assign hard regs when possible, stack slots otherwise.
-- Rewrite pseudos into hard regs/stack slots with reserved scratch regs for
- reload/store addressing.
-- Delete noop moves after rewrite.
-
-Exit criteria:
-
-- [x] `CFREE_TOY_OPT_LEVELS="0 1" make test-toy` passes for targeted cases.
-- [x] Focused allocation cases cover call-clobber saves, stack spills, tied
- hard regs from inline asm, and values live through branches.
-
-### Phase B - O2 Analysis Substrate
-
-Status: implemented for the first O2 value-pass slice. `OptPassCtx`, the
-shared operand/reference walker,
-`OptAnalysis` CFG order/dominator/frontier data, stable instruction ids,
-production mem2reg SSA, def-use rebuilding, SSA dumps, and stronger
-`opt_verify` guardrails are in place. Loop tree construction now uses shared
-dominators. Passes now have explicit CFG, def-use, dominator, and loop-info
-validity bits, and the first SSA value passes did not require additional shared
-analysis fields.
-
-Goal: create the durable data structures and invalidation model that all O2
-passes will share before implementing transformations.
-
-Checklist:
-
-- [x] A small pass context for per-pass state, stage names, metrics, and
- temporary arenas.
-- [x] A shared operand/reference walker for every pass that needs uses/defs,
- including `CGABIValue`, calls, returns, inline asm, intrinsics, and indirect
- bases.
-- [x] Shared analysis results for RPO/PO, dominators, dominance frontiers, and
- dominator children.
-- [x] Loop tree construction moved onto shared dominator analysis.
-- [x] Stable instruction ids so long-lived analysis never depends on raw
- `Inst*` addresses across block array rewrites.
-- [x] Real mem2reg phi `Val` definitions and rewritten promoted uses.
-- [x] Def-use chain rebuilding, including phi inputs and indirect bases.
-- [x] SSA dump coverage for phis and rewritten uses.
-- [x] `opt_verify(Func*, stage)` gates for CFG consistency, instruction id
- consistency, phi arity, def-use consistency, and SSA value class checks.
-- [x] Formal analysis invalidation flags for passes that preserve, rebuild, or
- invalidate CFG/value/use-def state.
-- [x] Additional shared analysis fields demanded by first value passes.
-
-Exit criteria:
-
-- [x] `-O1` stays green.
-- [x] `-O2` can run the substrate and verifier, then lower through the current
- O1 backend path with no observable behavior change.
-- [x] SSA dump tests prove phis and rewritten uses on branch/loop locals.
-- [x] `make test-opt`
-- [x] Focused `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy` branch/loop subset.
-- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-
-### Phase C - Full Allocation Infrastructure
-
-Status: implemented as the first O2 allocation-quality slice. O1 keeps the
-existing whole-PReg priority allocator. O2 enables conservative move
-coalescing, coalesced-group allocation, and singleton live-range splitting
-without changing public `libcfree` APIs or adding CFG edge splitting.
-
-Goal: bring `-O2` allocation quality up to MIR's model before value passes
-start changing code aggressively.
-
-Implement:
-
-- [x] Move collection.
-- [x] Bounded conflict matrix for move-related regs using existing live-range
- overlap data.
-- [x] Conservative coalescing for same-class, same-type copies with compatible
- fixed-register and forbidden-register constraints.
-- [x] Coalesced-group allocation before singleton allocation.
-- [x] O2-only split allocation representation with per-PReg segment lists and
- canonical spill homes.
-- [x] Conservative singleton live-range splitting when whole hard allocation
- fails.
-- [x] Boundary reload/store insertion for hard split segments through rewrite.
-- [x] Critical-edge CFG mutation for spill placement. Current placement stays
- at unambiguous block/range boundaries.
-
-Exit criteria:
-
-- [x] `-O1` remains green.
-- [x] `-O2` can use the same lowering path with coalescing/splitting enabled,
- while preserving the current small SSA cleanup passes.
-- [x] Focused `make test-opt` coverage for coalescing, conflict refusal,
- incompatible constraints, O1 preservation, and singleton splitting.
-- [x] Toy pressure fixture:
- `CFREE_TEST_FILTER=129_o2_pressure_coalesce CFREE_TOY_OPT_LEVELS="0 1 2"
- make test-toy`
-- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-
-### Phase D - SSA-Backed Value and Memory Passes
-
-Goal: enable the MIR full pre-lowering schedule for `-O2`, starting with small
-SSA-backed passes that prove the use-def substrate before heavier value and
-memory optimizations.
-
-Implement in order:
-
-1. [x] `opt_ssa_dce`
-2. [x] `opt_copy_cleanup`
-3. [x] `opt_block_cloning`
-4. [x] `opt_addr_xform`
-5. [x] scalar-only `opt_gvn`
-6. [x] `opt_copy_prop`
-7. [x] first memory-aware `opt_gvn` / redundant load elimination tests
-8. [x] CG identity expansion, shared `opt_simplify_local`, and O2
- `opt_simplify` as described in Section 4.3
-9. [x] extended memory tracker for joins/path state
-10. [x] `opt_dse`
-11. [x] `opt_licm`
-12. [x] `opt_pressure_relief`
-13. [x] `opt_make_conventional_ssa`
-14. [x] `opt_ssa_combine`
-15. [x] `opt_undo_ssa`
-16. [x] full O2 `opt_jump_opt` beyond the shared O1 jump cleanup
-
-Do not batch these into one landing. Each pass needs a pass-local corpus case
-that fails red without the pass or its bug fix.
-
-The path-aware memory GVN slice is now landed. Keep later memory rewrites tied
-to the documented alias and memory numbering model: do not expand `IR_LOAD`
-rewrites or reason across `IR_STORE`, calls, atomics, fences, volatile
-operations, TLS, globals, or unknown memory without an explicit invalidation
-rule and pass-local tests.
-
-Completed pre-GVN slice exit criteria:
-
-- [x] `make test-opt`
-- [x] Focused toy O2 branch/join/address-memory subset:
- `CFREE_TEST_FILTER=128_o2_branch_join_addr_mem CFREE_TOY_OPT_LEVELS="0 1 2"
- make test-toy`
-- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-- [x] Disassembly sanity check shows narrow churn: existing
- `10_pointer_addr_deref` removes separate address materialization in O2, while
- the new branch/join/address fixture has only minor register-choice churn.
-- [x] Pass-local dump tests prove the implemented SSA DCE, copy cleanup, block
- cloning, and address-folding rewrites fire.
-
-Memory numbering and alias invalidation model for memory GVN:
-
-- Memory value numbering is keyed by `(address-space, alias root, byte range,
- memory version, load type, size, alignment, flags)`. The address operand is
- still part of the key unless the alias root is a singleton local/global/string
- object with a proven constant offset. This keeps two unknown pointers in the
- same alias class from being treated as the same location.
-- A memory version token is maintained per alias root plus one conservative
- `unknown` token. `ALIAS_LOCAL`, `ALIAS_GLOBAL`, `ALIAS_PARAM`, `ALIAS_HEAP`,
- and `ALIAS_STRING` each use their root id as the fine-grained version key;
- `ALIAS_UNKNOWN`, missing alias data, unsupported address forms, and escaped
- allocas use the unknown token. Address space participates before alias-root
- comparison, so different address spaces never share a memory value number.
-- A store to a precise root increments that root's version. A store through
- unknown memory increments the unknown token and invalidates every non-invariant
- root that may alias unknown memory. A store to volatile memory is never
- forwarded from or to, and also acts as an observable memory barrier for other
- volatile operations.
-- Calls increment the unknown token and invalidate all roots that can escape:
- globals, params, heap, TLS/global address roots, and any local whose address
- has flowed to a call argument, return value, unknown/global store, aggregate
- memory op, inline asm, intrinsic, or unsupported address use. Non-escaping
- local roots keep their root version across ordinary calls. Later call attrs
- can narrow this further.
-- Atomics, fences, inline asm with memory effects, `IR_AGG_COPY`,
- `IR_AGG_SET`, bitfield stores, varargs memory operations, and runtime memory
- intrinsics are full memory barriers until each has a precise alias rule.
- Atomic loads are not redundant-load candidates unless a later model proves
- the exact ordering legality.
-- TLS addresses are treated like globals for alias roots but are call-clobbered
- unless the access is proven local-exec and the call cannot observe the thread
- object. The first slice should conservatively invalidate them at calls.
-- Redundant-load elimination may replace `IR_LOAD` only when the prior value's
- def dominates the use, the memory key including version is identical, neither
- access is volatile/atomic, the load type and byte range match, and no
- invalidating barrier lies between the two through the version model. Store/load
- reuse uses the same key and additionally requires the stored operand's value
- to dominate the load.
-- At block joins, memory availability is intersected across reachable
- predecessors. An entry survives only when every reachable predecessor has the
- same memory key available with the same `Val`; otherwise the joined load is
- preserved. This slice does not invent memory phis.
-
-Current memory GVN slice:
-
-- [x] Pass-local tests cover repeated exact local loads, store/load reuse for
- the same precise local, path-aware branch joins, loop-header backedge
- conservatism, unknown-store barriers, call preservation for non-escaped
- locals, escaped-local call clobbers, and volatile/atomic exclusions.
-- [x] The pass maintains per-root plus unknown versions, treats calls and
- conservative memory ops as barriers, and can see through simple `addr_of` plus
- zero-index/address-minus-zero materialization when both memory operations
- refer to the same precise root.
-- [x] AArch64 disassembly sanity check: a direct address-taken store followed by
- a load of the same address reuses the stored value, removing the post-store
- `ldur`.
-- [x] The tracker uses memory state snapshots at block boundaries. Join blocks
- reuse memory only when all reachable predecessors agree on the same available
- key/value pair; otherwise no memory phi is invented and the load stays.
-- [x] Root escape information is precise enough for calls to preserve
- non-escaping address-taken locals, while globals/params/heap/TLS, escaped
- locals, and unknown roots remain conservatively clobbered.
-
-Current DSE slice:
-
-- [x] `opt_dse` runs in the O2 SSA schedule after GVN, copy propagation, and
- SSA DCE, before conventional-SSA lowering.
-- [x] Backward memory liveness removes exact precise-root stores that are
- overwritten before any overlapping read, and removes unread stores to
- non-escaped locals at function exit.
-- [x] Calls, volatile/atomic accesses, aggregate memory ops, bitfield memory
- ops, varargs memory ops, fences, inline asm, and intrinsics remain
- conservative barriers. Escaped locals and externally observable roots stay
- live across calls/exits.
-- [x] Pass-local tests cover same-block and all-path overwritten local stores,
- unread non-escaped locals, escaped-local call preservation, and volatile
- preservation.
-
-Current LICM and pressure relief slices:
-
-- [x] `opt_licm` runs after DSE/DCE and loop-tree construction, hoisting safe
- loop-invariant scalar work to preheaders.
-- [x] LICM pass-local tests cover safe invariant hoisting and preservation of
- memory loads and potentially trapping division.
-- [x] `opt_pressure_relief` runs after LICM, planning all legal same-block
- single-use immediate/constant sinks from one def-use rebuild and applying
- affected block rewrites in linear passes.
-- [x] Pressure-relief pass-local tests cover sinking, loop-boundary refusal,
- multi-use preservation, memory barriers, and many same-block moves.
-
-Scalar/address identity simplification now runs before memory GVN. Red-green
-coverage includes integer neutral/annihilator identities, same-register
-integer identities, same-value integer compares, exact no-op conversions,
-double bitwise-not, add-zero address chains, and a toy end-to-end identity
-chain at `-O0`, `-O1`, and `-O2`. These rewrites stay typed and
-legality-aware: they do not erase operations with trapping, flagged overflow,
-FP rounding/NaN/sign-zero semantics, volatile/atomic memory effects, or target
-addressing constraints that have not been modeled.
-
-Remaining Phase D exit criteria:
-
-- [x] Scalar GVN pass-local dump tests prove redundant pure expressions are
- replaced and constant branches fold.
-- [x] Memory numbering and alias invalidation are documented before any memory
- GVN or redundant-load rewrite lands.
-- [x] `opt_copy_prop` pass-local tests prove SSA copy propagation and redundant
- extension-chain cleanup, with a focused toy O2 fixture.
-- [x] AArch64 O2 disassembly sanity checks show `130_o2_copy_prop_ext`
- collapsing the widening/copy chain to a single byte load/zero-extend path,
- and memory GVN removing redundant post-store stack loads in covered cases.
- `128_o2_branch_join_addr_mem` now covers both all-predecessors-agree and
- predecessor-disagreement branch stores.
-- [x] Identity simplification has focused pass-local tests, a toy end-to-end
- fixture, and an AArch64 disassembly sanity check showing identity arithmetic
- removed from the generated code.
-- [x] LICM and pressure relief have focused pass-local tests and are in the
- default O2 schedule.
-- [x] `opt_ssa_combine` has focused red-green tests before it enters the
- default O2 schedule.
-- [x] Full O2 jump optimization has focused red-green tests before it enters
- the default O2 schedule, covering switch target forwarding, empty
- fallthrough-chain forwarding, repeated branch forwarding, and same-target
- conditional collapse.
-- [x] Post-inline cleanup quality: after Phase E inlines a simple Toy wrapper,
- the existing Phase D value passes should collapse the exposed constant/copy
- chain and remove frame traffic. Covered example:
-
- ```
- fn add1(x: i32): i32 {
- return x + (1 as i32);
- }
-
- pub fn main(): i32 {
- let y: i32 = add1(41 as i32);
- if y == (42 as i32) {
- return 0 as i32;
- }
- return 1 as i32;
- }
- ```
-
- `-O2` removes the wrapper call leftovers and AArch64 now lowers the focused
- quality fixture's `main` to a constant return with no call, prologue, spill,
- or reload traffic.
-
-### Phase E - Inlining and Cleanup
-
-Goal: add MIR-style small direct-call inlining for `-O2`.
-
-Design decision: inline before machine lowering, on retained raw `Func` IR, at
-`CGTarget.finalize`. Do not inline the post-SSA-destruction or post-machinize
-form. The current implementation emits at `func_end`; Phase E first has to
-make the `opt.h` contract real for `-O2`: `func_end` records and stores the
-function, while `finalize` performs interprocedural work, then runs the normal
-per-function O2 pipeline and emission in original function order.
-
-Status: implemented for the Phase E v1 shape. Retained `-O2` finalize
-scheduling is in place, `opt_cleanup` is the reusable O2 pre-lowering cleanup
-schedule, and `pass_inline.c` inlines small direct calls to retained, acyclic
-callees with void/scalar register returns. It supports straight-line wrappers
-and small branchy callees with multiple scalar returns. It intentionally refuses
-aggregate/sret/byval returns, tail calls, varargs, alloca, inline asm,
-computed goto, and other hard cases until those rewrite rules have focused
-tests.
-
-This matches MIR's useful property: inline while calls, parameters, frame
-slots, returns, labels, and source locations are still target-independent.
-Cloning target-lowered calls would couple the inliner to ABI move plans,
-hard-register constraints, and late spill/rewrite state. Cloning SSA would
-also require phi and def-use surgery across two functions. Raw IR cloning keeps
-Phase E mechanical and lets the existing O2 passes clean up the exposed
-constants, copies, dead stores, and dead blocks.
-
-Implementation shape:
-
-- [x] Add real `FuncSet` ownership to `OptImpl`: an arena-backed ordered
- `Func**` plus retained-definition lookup by `ObjSymId`. No global state.
- Record order is emission order at finalize.
-- [x] Split scheduling in `opt.c`:
- - `func_end`: `opt_frame_home_addr_taken_locals`, classify eligibility, append
- to `FuncSet`, and clear the current recorder state. For `-O1`, keeping the
- current immediate emit path is acceptable; `-O2` must retain functions.
- - `finalize`: run `opt_inline(FuncSet*, max_iters)`, then lower/emit every
- retained function. Functions that still require the defensive non-SSA O2
- route use the shared O1 lowering path at finalize.
- - Refactor the current O2 scheduler into "O2 cleanup/pre-lowering" and
- "lowering/emission" pieces so an inlined dirty function can reuse the
- existing verifier-bracketed pass order.
-- [x] Build the call graph from direct `IR_CALL` sites whose callee is
- `OPK_GLOBAL` with addend zero and whose symbol has a retained `Func`.
- Indirect calls and unresolved/external symbols are ordinary non-inlineable
- calls.
-- [x] Compute SCCs on the retained graph. Inline only calls to callees in already
- processed acyclic components. Skip self recursion and mutually recursive SCCs
- for v1 instead of trying partial recursive budgets.
-- [x] Rebuild the call graph at each inline iteration. Default `max_iters = 1`;
- keep a small internal cap even if a later driver knob requests more.
-
-Eligibility and budgets:
-
-- [x] Candidate callee must be retained, same target/context, non-variadic, not
- naked/interrupt/ifunc once those attributes reach `CGFuncDesc`, and not marked
- `noinline` once C declaration flags are plumbed through CG.
-- [x] Candidate call site must be a normal direct call. Skip `CG_CALL_TAIL` and any
- future musttail representation.
-- [x] V1 skips callees containing `IR_ALLOCA`, `IR_VA_*`, `IR_LOAD_LABEL_ADDR`,
- `IR_INDIRECT_BRANCH`, `IR_ASM_BLOCK`, `INTRIN_SETJMP`, `INTRIN_LONGJMP`,
- coroutine switch, or any op whose cloning semantics are not explicit.
-- [x] V1 supports void and scalar register returns first. Aggregate/sret/byval
- returns can be enabled later after return materialization has focused tests.
-- [x] Normal-call budget should be deliberately small: start around 12 to 20
- non-debug, non-label instructions, with a caller growth cap around 25 percent
- and an absolute per-function growth cap. Once `inline` / `always_inline` /
- `noinline` are carried into CG, use a larger explicit-inline budget and let
- `always_inline` exceed the normal budget but still obey semantic refusal
- rules.
-- [x] Keep the heuristic target-independent. Calls, memory barriers, atomics,
- and fences remain semantic refusals in Phase E v1; multiple returns add cost,
- and simple scalar instructions use the normal small-instruction budget.
-
-Inline rewrite:
-
-- [x] Split the caller block at the call site into `pre`, cloned callee body, and
- `cont`. The original call instruction is removed.
-- [x] Allocate fresh caller-local ids for every cloned block, frame slot, scope,
- pseudo-register, and SSA value id present in the callee. Preserve callee
- instruction ids only as source material; cloned instructions get new stable
- ids from the caller.
-- [x] Clone `emit_order` for cloned blocks in callee layout order, inserted between
- `pre` and `cont`. Update CFG successors/predecessors through `opt_build_cfg`
- after the rewrite rather than hand-maintaining all derived metadata.
-- [x] Materialize arguments at the clone entry. For register-like callee parameter
- storage, emit `IR_COPY` from the call argument value to the remapped callee
- parameter pseudo. For frame-backed parameters, emit stores to the remapped
- parameter frame slots using the recorded `IRParam` storage and `MemAccess`
- shape. Constants become `IR_LOAD_IMM` or direct stores as appropriate.
-- [x] Rewrite each cloned `IR_RET`:
- - void return: replace with `IR_BR cont`.
- - scalar register return: copy the returned value to the call site's result
- operand, then branch to `cont`.
- - multiple returns: all returns branch to the same continuation; if later SSA
- needs a phi, the normal O2 mem2reg/reg-SSA construction will create it from
- the inserted copies/stores.
-- [x] Keep ordinary continuation semantics for noreturn callees in Phase E v1.
- Do not infer noreturn only from body shape until continuation omission has
- focused tests.
-- [x] Preserve `Inst.loc` from cloned callee instructions. Argument materialization
- uses the call-site location. This is enough for line-accurate emission now;
- `DW_TAG_inlined_subroutine` can remain a later DWARF phase.
-
-Cleanup after inlining:
-
-- [x] Mark modified callers dirty and invalidate CFG, def-use, dominator, and loop
- analyses.
-- [x] For each retained `-O2` function, run the existing O2 pre-lowering schedule
- after all inline iterations. This is `opt_cleanup` in Phase E terms:
- `build_cfg`, jump cleanup, reg-SSA, mem2reg SSA, simplify, GVN, copy
- propagation, DSE, SSA DCE, LICM, pressure relief, conventional SSA,
- SSA combine, undo SSA, and jump optimization, with the same verifier
- checkpoints as the current O2 path.
-- [x] Then run the shared lowering pipeline once and emit. No function should be
- emitted before all possible callers have had a chance to inline it.
-
-Testing plan:
-
-- [x] Unit/pass tests in `test/opt` for direct wrapper inline, two-return scalar
- inline, refusal for recursion/SCCs, refusal for varargs/alloca/setjmp/asm,
- and caller growth cap.
-- [x] End-to-end toy/C cases where `static int add1(int x) { return x + 1; }`
- disappears at `-O2` but remains a call at `-O1`.
-- [x] Keep the Toy `add1(41) == 42` example from the Phase D checklist as a
- disassembly sanity case: Phase E is responsible for removing the call; Phase
- D cleanup is responsible for removing the exposed constant/copy/branch
- leftovers.
-- [x] A chained-call case proves bottom-up order: `main -> f -> g` inlines `g`
- into `f` before deciding whether `f` fits into `main`.
-- [x] A negative recursive case proves codegen is unchanged and correct.
-- [x] A focused disassembly sanity check on AArch64 or x64 verifies wrapper-call
- removal after inline+cleanup without broad corpus churn.
-
-Exit criteria:
-
-- [x] `make test-opt`
-- [x] Focused wrapper/chained-call toy fixtures:
- `CFREE_TEST_FILTER=<inline-case> CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-- [x] Recursive and mutually recursive cases are unchanged and correct.
-
-### Phase F - Benchmark Closure
-
-Goal: tune by measurement, not by adding passes speculatively.
-
-Benchmark set:
-
-- MIR `c-benchmarks`: `array`, `binary-trees`, `funnkuch-reduce`,
- `hash`, `hash2`, `heapsort`, `lists`, `mandelbrot`, `matrix`,
- `method-call`, `nbody`, `sieve`, `spectral-norm`, `strcat`, and any
- benchmark that requires only supported libc/runtime features.
-- cfree-specific stress cases for ABI, TLS, atomics, and inline asm.
-
-Primary harness:
+### Inlining
-```
-make bench-opt
-```
+`opt_inline` operates on `FuncSet`.
-`scripts/opt_bench.sh` compares MIR `c2m`, `clang`, `gcc-15`, `cfree cc`,
-and `cfree run` at `-O0`, `-O1`, and `-O2` on MIR's `c-benchmarks` sources
-from `~/tmp/mir/c-benchmarks`. It writes per-case logs, `results.csv`, and a
-geomean summary under `build/bench/opt/`.
+- The call graph is built from direct `IR_CALL` sites whose callee is an
+ `OPK_GLOBAL` with addend zero and a retained `Func`.
+- Strongly connected components are used to avoid recursive inlining. Self and
+ mutually recursive SCCs are refused for v1.
+- The current iteration cap is small. The default is one inline iteration.
+- Candidate callees must be retained, target-compatible, non-variadic, not
+ recursive, and within a conservative growth budget.
+- V1 refuses callees containing alloca, varargs, setjmp/longjmp, inline asm,
+ label-address loads, indirect branches, coroutine switching, or unknown
+ cloning semantics.
+- Void and scalar register returns are supported. Aggregate/sret/byval returns
+ are left for future work.
+- The caller block is split into pre-call, cloned body, and continuation blocks.
+ Arguments are materialized in the clone entry, returns branch to the
+ continuation, and modified callers are marked dirty.
-The hosted `cfree cc` lane passes `--sysroot` and `-lc`; on Darwin the
-sysroot defaults to `xcrun --show-sdk-path`, and it can be overridden with
-`CFREE_OPT_BENCH_SYSROOT`. The `cfree run` lane is an in-process JIT run for
-direct comparison with MIR; it also passes `--sysroot` and `-lc`, expands the
-hosted libc headers/defines in the driver, and relies on host-symbol fallback
-for libc calls.
+After inlining, dirty functions run the normal `-O2` cleanup schedule.
-The benchmark intentionally records unsupported cases instead of stopping at
-the first failure. A cfree row that cannot compile a hosted benchmark is a
-`COMPILE_FAIL` data point, not a harness failure; this lets optimizer work
-track coverage and performance in the same run.
+### SSA Cleanup Schedule
-Useful focused runs:
+`opt_cleanup` is the `-O2` pre-lowering schedule.
-```
-CFREE_OPT_BENCHES="sieve spectral-norm" make bench-opt
-CFREE_OPT_BENCH_LEVELS="1 2" CFREE_OPT_BENCH_RUN_REPEATS=5 make bench-opt
-GCC=/opt/homebrew/bin/gcc-15 MIR_DIR=~/tmp/mir make bench-opt
-```
+1. `opt_build_cfg`, `opt_jump_cleanup`, `opt_build_cfg`
+ Canonicalize control flow before SSA.
-Measure:
-
-- Compile wall time for `-O0`, `-O1`, `-O2`.
-- Executable run time against `gcc-15 -O2`, `clang -O2`, and MIR `c2m -O2`
- when available.
-- MIR-specific split: `compile_ms` is C-to-binary-MIR time, `codegen_ms` is
- the JIT link/generation slice reported by `c2m -v`, and `runtime_ms` is the
- generated function execution slice. Native compiler rows use compile+link
- wall time for `compile_ms` and executable wall time for `runtime_ms`.
-- `cfree-run` uses `cfree run --bench-time`: `compile_ms` is compile+JIT time,
- and `runtime_ms` is the in-process entry-call execution slice.
-- Code size for hot text sections.
-- Pass counters: removed GVN expressions, folded branches, removed stores,
- coalesced moves, spills/restores, split ranges, post-RA deleted moves.
-
-Initial representative run, 2026-05-22:
-
-- Scope: `array`, `binary-trees`, `hash`, `hash2`, `matrix`, `nbody`,
- `sieve`, and `spectral-norm`; levels `0 1 2`; one compile repeat and one
- run repeat. Output was written to `build/bench/opt/results.csv` and
- `build/bench/opt/summary.md`.
-- Coverage: 120 data rows; 91 `OK`, 18 `COMPILE_FAIL`, 9 `RUN_FAIL`, and
- 2 `OUTPUT_FAIL`.
-- `gcc-15` and `clang` completed all rows. cfree and MIR are blocked on some
- hosted/math cases: `binary-trees`, `nbody`, and `spectral-norm` hit Darwin
- `math.h`/builtin compatibility issues. cfree also has an `-O2` wrong-code
- failure on `matrix`.
-- Runtime geomean versus `gcc-15 -O2` on completed rows:
- `cfree-run` improved from `0.372x` at `-O0` to `0.445x` at `-O1` and
- `0.480x` at `-O2`; MIR improved from `0.554x` to `0.609x` to `0.796x`.
- `clang -O2` measured `1.076x`.
-- Compile/JIT geomean versus `gcc-15 -O2`: `cfree-run` was about `11.5x` to
- `12.0x` faster, while MIR was about `2.7x` faster. On cases where both
- completed, cfree-run compile/JIT time was roughly 4-5x faster than MIR.
-- Direct cfree-run versus MIR runtime on common successful cases:
- `0.67x` at `-O0`, `0.73x` at `-O1`, and `0.57x` at `-O2`. This confirms
- the current split: cfree's compile/JIT path is very fast, but generated-code
- quality still trails MIR, especially at `-O2`.
-
-Immediate benchmark blockers:
-
-- [x] Fix the `matrix -O2` wrong-code regression before trusting O2 timing.
-- [x] Add or model the hosted math builtins needed by Darwin `math.h`,
- including `__builtin_fabsf`, `__builtin_inf*`, `__builtin_huge_val*`,
- `_Float16` declarations, and the float/double predefined limit macros used
- by the system header.
-- Re-run the full MIR benchmark set after those blockers, then increase repeat
- counts for stable numbers.
-
-Follow-up representative run after cfree blocker fixes, 2026-05-22:
-
-- Same 8-benchmark scope, levels `0 1 2`, one compile repeat, and one run
- repeat.
-- Coverage: 120 data rows; 111 `OK`, 9 `COMPILE_FAIL`, 0 `RUN_FAIL`, and
- 0 `OUTPUT_FAIL`.
-- All `cfree` and `cfree-run` rows completed successfully, including
- `matrix -O2`, `binary-trees`, `nbody`, and `spectral-norm`. The remaining
- `COMPILE_FAIL` rows are all `mir-c2m` on the hosted/math benchmarks.
-- Updated runtime geomean versus `gcc-15 -O2` on the representative scope:
- `cfree-run` measured `0.363x` at `-O0`, `0.519x` at `-O1`, and `0.481x`
- at `-O2`; MIR measured `0.550x`, `0.574x`, and `0.786x` on its five
- completed cases. `clang -O2` measured `1.072x`.
-
-Focused rerun of the formerly broken cfree rows, 2026-05-22:
-
-- Scope: `matrix`, `binary-trees`, `nbody`, and `spectral-norm`; levels
- `0 1 2`; one compile repeat and one run repeat. Output was written to
- `build/bench/opt/rerun-broken-cfree/results.csv`.
-- All `cfree` and `cfree-run` rows were `OK`.
-- Geomean speed ratios use each benchmark's `gcc-15 -O2` row as `1.0x` for
- both compile time and run time:
-
-| tool | opt | compile speed | runtime speed |
-| --- | ---: | ---: | ---: |
-| `cfree` | `O0` | `7.050x` | `0.319x` |
-| `cfree` | `O1` | `7.062x` | `0.534x` |
-| `cfree` | `O2` | `6.831x` | `0.483x` |
-| `cfree-run` | `O0` | `13.709x` | `0.323x` |
-| `cfree-run` | `O1` | `14.286x` | `0.550x` |
-| `cfree-run` | `O2` | `11.791x` | `0.512x` |
-
-- On this focused scope, `O1` currently has the best cfree runtime geomean;
- `O2` remains correct but is slower than `O1` on the measured mix.
-
-Target:
-
-- `-O1` should be the fast optimized tier and materially faster to compile
- than `-O2`.
-- `-O2` should first reach QBE-class performance on the benchmark set, then
- close toward MIR's published `c2m -eg` geomean.
-
----
-
-## 7. Module Layout
-
-Keep the MIR pass boundaries as separate cfree modules:
+2. `opt_build_reg_ssa`
+ Converts mutable pseudo-registers to SSA `Val` ids.
-```
-src/opt/
- opt.c wrapper, vtable bindings, schedule dispatch
- ir.c Func/Block/Inst plumbing
- ir_print.c stable dumps for pass tests
- pass_cfg.c CFG, unreachable cleanup
- pass_o2.c early O2 transforms: block cloning, address transform,
- scalar GVN, temporary DSE/cleanup stubs
- pass_simplify.c local and SSA-aware scalar/address identity simplification
- pass_clone.c block cloning (split out when pass size warrants it)
- pass_ssa.c SSA build, conventional SSA, undo SSA
- pass_addr.c address transformation (split out when pass size warrants it)
- pass_gvn.c GVN, constprop, redundant-load elimination
- pass_copy.c copy propagation, extension cleanup
- pass_dse.c dead store elimination
- pass_dce.c SSA DCE and post-RA DCE
- pass_loop.c loop tree and LICM
- pass_pressure.c pressure relief
- pass_jump.c jump optimization
- pass_machinize.c target ABI and machine-form lowering
- pass_live.c liveness, pressure, live ranges
- pass_lower.c current assignment/rewrite implementation
- pass_coalesce.c move collection and coalescing
- pass_regalloc.c assignment, rewrite, splitting (split out when warranted)
- pass_combine.c target-aware code selection
- pass_emit.c drive wrapped CGTarget
- pass_inline.c inlining and cleanup
-```
+3. `opt_block_cloning`
+ Clones small cold/return blocks into hot predecessors when bounded growth
+ exposes better local optimization.
-No pass should reach into the wrapped target's private implementation. Target
-specifics belong behind `Target`, `TargetABI`, or explicit helper hooks.
+4. `opt_build_ssa`
+ Promotes eligible frame-backed locals/params to mem2reg SSA, inserts phis,
+ and rewrites loads/stores to values.
----
+5. `opt_ssa_dce` and `opt_copy_cleanup`
+ Remove unused SSA definitions and redundant copies before heavier passes.
-## 8. Pass Invariants
+6. `opt_addr_xform`
+ Folds address-producing pseudos into direct memory operands where legal and
+ preserves address-taken storage in memory.
-- No VLAs and no global optimizer state. MIR uses generator context; cfree
- should hang all pass state off `Compiler`, `Func`, `FuncSet`, or explicit
- pass contexts.
-- `PReg` names mutable virtual-register storage before reg-SSA. `Val` names
- SSA values after pseudo-register SSA construction. Lowering may map PRegs or
- post-SSA Vals to physical registers or stack slots, but that mapping belongs
- to allocation state.
-- Frame slots remain frame slots until a pass proves promotion is legal.
-- Calls are memory barriers unless alias information proves otherwise.
-- Inline asm is side-effecting and may pin hard regs; the allocator and
- inliner must treat it conservatively.
-- `set_loc` is sticky on recorded insts. Every emitted instruction must forward
- the active location to the wrapped target before machine emission.
-- Every pass either preserves CFG metadata or invalidates/rebuilds it before
- the next consumer.
+7. `opt_simplify`
+ Applies SSA-aware identity simplification, constant simplification, and
+ local algebraic cleanup.
----
+8. `opt_gvn`
+ Performs scalar value numbering, constant propagation, branch folding,
+ redundant expression elimination, and memory-aware redundant load/store-load
+ reuse where alias/version rules permit it.
-## 9. Test Strategy
+9. `opt_copy_prop`
+ Propagates SSA copies and removes redundant extension/copy chains.
-Targeted commands:
+10. `opt_dse`
+ Removes stores proven overwritten or dead along all paths, while preserving
+ observable memory and call/escape boundaries.
-```
-CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy
-CFREE_TEST_FILTER=<case> CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy
-make test-opt
-```
+11. `opt_build_loop_tree`, `opt_licm`
+ Rebuild loop information and hoist safe invariant computations.
-Use pass dumps for red-green tests:
+12. `opt_pressure_relief`
+ Sinks legal same-block computations to reduce peak pressure before
+ allocation.
-- CFG: block/successor/predecessor shape.
-- SSA: phi placement and def-use chains.
-- Scalar GVN: redundant pure expression replaced and constant branch folded,
- with def-use rebuilt and stale-analysis verifier coverage.
-- Memory GVN: redundant load/store-load reuse only after alias and memory
- numbering are specified; tests must include volatile/atomic/call/unknown
- memory barriers.
-- DSE: dead store removed while escaping stores remain.
-- LICM: safe invariant moved; trapping division and memory ops remain.
-- RA: expected spill/reload/coalescing counters.
-- Combine: target-legal fused instruction shape.
+13. `opt_make_conventional_ssa`
+ Lowers SSA phis into edge copies. These copies are marked
+ `IRF_NO_COALESCE`; coalescing them can collapse current and next
+ loop-carried values and miscompile loops.
+
+14. `opt_ssa_combine`
+ Combines patterns while conventional SSA edge copies are explicit.
-Before broad runs, redirect output to a file and inspect the failure slice:
+15. `opt_undo_ssa`, `opt_copy_cleanup`
+ Rewrites SSA values back into allocation-ready pseudo-register form.
+
+16. `opt_jump_opt`
+ Runs full O2 jump optimization before shared lowering.
+
+Verifier checkpoints bracket the important transitions: CFG, reg SSA, block
+cloning, mem2reg SSA, address transform, simplify, GVN, copy prop, DSE, LICM,
+pressure relief, conventional SSA, SSA combine, undo SSA, and jump opt.
-```
-CFREE_TEST_FILTER=<case> CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy >build/opt-test.log 2>&1
-tail -200 build/opt-test.log
-```
+After cleanup, `-O2` enters the shared lowering pipeline with live-range
+splitting enabled.
+
+## Register Allocation
+
+Allocation uses target-provided register classes and hard-register sets.
+
+- Pseudo-register metadata records type, class, fixed/tied hard-register
+ constraints, and forbidden hard-register masks.
+- Liveness is represented as block live-in/live-out sets and live ranges over
+ instruction points.
+- Coalescing collects move-related candidates, builds a bounded conflict matrix
+ for those candidates, and merges only same-class, same-type registers with
+ compatible hard-register constraints and no live-range conflict.
+- Copies marked `IRF_NO_COALESCE` are not coalescing candidates.
+- `-O2` allocation may split live ranges and insert boundary reload/store
+ instructions. Critical edges are split when needed for placement.
+- Spill slots are represented as frame slots and rewritten as ordinary loads
+ and stores before emission.
+
+## Verification, Dumps, and Metrics
+
+`opt_verify(Func*, stage)` checks CFG reciprocity, reachable block shape,
+emit-order validity, instruction ids, operand namespaces, phi consistency, and
+def-use freshness. Passes should keep verifier failures local by using stage
+names that describe the last completed transformation.
+
+The optimizer also emits metrics through the compiler metrics interface. The
+driver's `cfree run --time` mode is the easiest way to inspect scoped timings
+such as frontend, pass totals, link, JIT, and execution.
+
+IR dumps and pass-local tests should prefer stable ids and pass-specific
+fixtures. End-to-end parse/toy tests should be focused on user-visible behavior
+and disassembly sanity checks, not broad churn.
+
+## Current Limits
-The full toy corpus remains the end-to-end equivalence oracle for frontend
-coverage: levels `1` and `2` must produce the same observable result as level
-`0`. The API and opt unit suites remain the narrower pass/CG contract oracle.
-DWARF equivalence can start weaker, but `set_loc` forwarding must not regress
-line-row emission.
+- `_Float16` is parsed for Darwin header compatibility but is not a real
+ half-precision ABI implementation.
+- Inlining v1 does not handle aggregate returns, sret/byval, recursive SCCs,
+ varargs, alloca, setjmp/longjmp, inline asm, or indirect-control constructs.
+- `cfree-run --bench-time` reports compile+JIT as one slice. Finer-grained
+ timing is available through `--time`, but benchmark CSVs do not yet split
+ cfree frontend, optimizer, link, and JIT into separate columns.
+- Performance goals, measurements, and tuning priorities live in
+ `doc/OPT_PERF.md`.
diff --git a/doc/OPT1.md b/doc/OPT1.md
@@ -1,330 +0,0 @@
-# OPT1 - Current O1 Backend Path
-
-`-O1` is cfree's fast optimized backend tier. It records frontend CG
-operations as IR, runs a small MIR-shaped lowering/allocation pipeline, rewrites
-virtual values to hard registers or spill slots, performs narrow post-RA
-cleanup, and emits through the target backend.
-
-This document has two jobs:
-
-1. document the implemented O1 pipeline; and
-2. record the completed O1 code-shape cleanup and remaining performance
- watchpoints.
-
-Compile-time measurements and scaling notes live in `doc/PERF.md`. Broader
-optimizer direction lives in `doc/OPT.md`.
-
-## Goals
-
-O1 should remain:
-
-- correct across supported targets and ABI lowering paths;
-- materially better than O0 generated code;
-- close enough to O0 compile time for JIT use;
-- free of dense pseudo-register conflict matrices and whole-function
- point/candidate products in hot paths;
-- simple enough that O2 can add coalescing, splitting, SSA, and value
- optimization without destabilizing the O1 baseline.
-
-O1 is not intended to perform full SSA optimization, global value numbering,
-LICM, general inlining, or global block layout. Those belong in O2 unless a
-small local cleanup is required to keep O1 code shape reasonable.
-
-## Pipeline
-
-The normal O1 function-end path is:
-
-```text
-build_cfg
-jump_cleanup.cfg
-build_cfg
-machinize
-build_loop_tree
-live_blocks.pre_dde
-dead_def_elim
-regalloc:
- live_blocks
- live_ranges_build
- assign_ranges
- rewrite
-combine
-dce
-jump_cleanup.cfg
-build_cfg
-jump_cleanup.layout
-emit
-```
-
-Pass boundaries:
-
-- `machinize` applies target-specific ABI, call, and register constraints before
- liveness and allocation.
-- `live_blocks.pre_dde` computes block liveness for pre-RA dead-def
- elimination.
-- `dead_def_elim` removes dead virtual definitions using block liveness and
- operand-reference events.
-- `regalloc` rebuilds liveness after DDE, constructs compressed live ranges,
- assigns hard registers or spill slots, and rewrites operands.
-- `combine` is narrow post-RA cleanup: noop physical copy deletion, safe
- single-use folds, redundant spill load/store cleanup, and similar local
- rewrites.
-- `dce` is conservative after rewrite and removes only trivial dead IR shapes.
-- `jump_cleanup.cfg` forwards branches to branch-only blocks and performs local
- compare-branch inversion when that removes an unconditional jump block.
-- `jump_cleanup.layout` runs immediately before emit and deletes unconditional
- branches whose target is the physical fallthrough block.
-
-## Liveness And Ranges
-
-O1 liveness is pass-local. Persistent per-instruction live-after storage and
-dense value conflict matrices are not part of the normal path.
-
-`opt_live_blocks` computes block `live_use`, `live_def`, `live_in`, and
-`live_out` sets using `OptBitset`. `OptBitset` tracks active words, trims
-trailing zero words, and bounds copy/union work by active length where possible.
-
-`opt_live_ranges_build` turns block liveness plus backward instruction walks
-into compressed live ranges:
-
-- per-instruction references are collected as use/def value lists;
-- open and touched values are tracked with generation marks;
-- raw instruction points are compressed to live-range boundary points;
-- live-across-call frequency is tracked from a rolling live-value list;
-- per-value summaries record live length, use/def frequency, block frequency,
- call-crossing frequency, and spill cost.
-
-The expected O1 invariant is `opt.conflict_bytes=0`.
-
-## Allocation
-
-The normal O1 allocator is range based. It does not build or consume a
-pseudo-register interference matrix.
-
-Candidates are ordered by:
-
-1. tied hard-register requirement;
-2. spill/frequency cost;
-3. shorter live length;
-4. stable value id.
-
-Each value is assigned either:
-
-- a target-provided hard register in the value's register class; or
-- a compatible `FS_SPILL` frame slot.
-
-Non-tied values known to be live across calls prefer callee-saved registers, but
-may use caller-saved registers when no better non-conflicting location is
-available. Correctness does not depend on avoiding caller-saved registers
-globally: rewrite preserves hard-assigned values only at calls whose
-target-provided clobber mask actually destroys that register.
-
-Occupancy is stored as sorted per-location interval sets:
-
-- one interval vector per physical hard-register location;
-- one interval vector per allocated stack spill slot.
-
-Conflict checks test whether any candidate live range overlaps the location's
-intervals. Marking inserts and merges intervals. The old point-index bitmap row
-OR/mark path is not part of normal O1.
-
-Metrics still report `opt.alloc.used_loc_words`, `opt.alloc.hard_loc_words`,
-`opt.alloc.stack_loc_words`, point visits, and mark counters for trend
-compatibility, but `opt.alloc.hard_word_ors` and
-`opt.alloc.stack_word_ors` should remain zero on the interval path.
-
-## Rewrite
-
-Rewrite consumes the assignment map and walks each block backward with rolling
-liveness:
-
-- hard-assigned pseudos become physical registers;
-- spilled uses receive reloads through target scratch registers;
-- spilled defs receive stores;
-- call save/restore insertion checks hard-assigned values known to be live
- across some call, then intersects each value's assigned register with that
- call's target-provided clobber mask;
-- inline-asm register constraints are applied only for functions containing
- `IR_ASM_BLOCK`.
-
-Rewrite does not materialize per-instruction full live-after sets.
-
-## Target Contract
-
-O1 relies on each target backend to provide:
-
-- physical-register metadata per register class, including allocable,
- caller-saved, callee-saved, argument, return, reserved, and preference/cost
- flags;
-- legacy allocable hard-register pools per register class for direct-CG
- compatibility and as a fallback while older backends migrate;
-- scratch-register pools disjoint from allocable pools;
-- caller-saved register classification and per-call clobber masks;
-- return-register masks and callee-save masks;
-- call-plan construction that describes ABI register destinations, outgoing
- stack arguments, return registers, call clobbers, return masks, outgoing
- stack size, and variadic metadata before liveness and allocation;
-- parameter storage binding through `CGTarget.param`, which may return either a
- frame slot or register-backed local storage for simple direct ABI params;
-- optional hard-register reservation before backend `func_end`;
-- target legality for local folds performed by `opt_combine`.
-
-The current target pools are:
-
-- AArch64: integer `x19-x28`, FP `v8-v23`;
-- x64: integer `R13/R14/R15/R10`, FP `XMM6-XMM15`;
-- RV64: integer `s4-s11`, FP `fs4-fs11`.
-
-Backends own final prologue/epilogue emission and callee-saved register
-preservation. O1 calls `reserve_hard_regs` with hard registers still visible in
-replay after cleanup so backend save/restore decisions match emitted IR.
-
-Targets may provide a known-frame entry path. When `func_begin_known_frame` and
-`call_stack_size` are both available, O1 computes frame slots, outgoing call
-area size, and dynamic-allocation presence before body replay and passes that
-to the backend. AArch64, x64, and RV64 use this path to emit the exact prologue
-up front instead of reserving target NOP patch slots.
-
-## Current Generated-Code Shape
-
-Fresh May 2026 probes were run with `build/cfree` rebuilt at host `-O2`, on an
-arm64 machine, focusing on native AArch64 codegen. The same simple C probes
-were also compiled for x64/RV64 by the probe script, but AArch64 is the primary
-quality target here.
-
-Current progress:
-
-- Immediate scalar arithmetic folds through simple local shadows. For example,
- `int x = 40; x += 2; return x;` lowers to immediate return materialization on
- AArch64.
-- Simple loop locals stay in registers.
-- Branch-consuming compares lower to direct compare branches.
-- O1 branch cleanup removes branch-to-branch targets and unconditional branches
- to physical fallthrough when it can do so locally.
-- Address-taken locals still go through memory, as expected for current O1.
-- The old prologue NOP patch path is gone; known-frame entry is used where the
- target supports it.
-- Empty leaf functions omit avoidable frame setup and return directly.
-- Scalar return producers can be retargeted into ABI return registers, removing
- target-level return self-copies.
-- Unused scalar parameters consume ABI locations without materializing moves,
- and used stack-passed scalar parameters load directly from the incoming stack
- area.
-
-Representative AArch64 `const_local` output:
-
-```asm
-mov w0, #0x2a
-ret
-```
-
-Representative AArch64 `scalar_add` output:
-
-```asm
-add w0, w0, w1
-ret
-```
-
-A ninth-argument AArch64 scalar probe now loads directly from the incoming
-stack area in a frameless leaf:
-
-```asm
-ldur w0, [sp]
-add w0, w0, #1
-ret
-```
-
-## Completed O1 Codegen Work
-
-The O1 cleanup list is now implemented for the current backend path:
-
-- Leaf functions can omit the frame on AArch64, x64, and RV64 when the known
- frame is empty, the function has no calls or dynamic stack allocation, and the
- backend has no callee-saved state to preserve. Returns in such functions emit
- the target return instruction directly instead of branching to an adjacent
- epilogue.
-- Post-RA cleanup removes physical self-copies, collapses short copy chains,
- retargets safe single-use producers, and retargets adjacent scalar return
- producers into ABI return registers when target metadata allows it.
-- Target physical-register metadata exposes caller-saved, callee-saved,
- argument, return, reserved, and scratch policy. O1 prefers caller-saved
- registers for non-call-crossing values, uses callee-saved registers for
- call-crossing values when profitable, and ties non-call-crossing scalar
- parameters to their incoming ABI registers when legal.
-- Non-address-taken scalar parameters remain register-backed. Unused
- register-backed parameters consume their ABI locations without materializing
- pointless moves, and stack-passed scalar parameters are loaded directly from
- the incoming stack area. Frameless functions use `sp`/`rsp` for those loads;
- framed functions use the target frame pointer as before.
-- Planned call replay is the normal O1 call path for supported target plans.
- Replay uses a local parallel-move resolver for register moves, stack argument
- stores, return extraction, cycles, and indirect callees that overlap argument
- destinations.
-- Branch cleanup remains intentionally local: branch-target forwarding,
- compare-branch inversion for local jump-block removal, unreachable pruning via
- CFG, and physical fallthrough branch deletion.
-
-Representative AArch64 output after these slices:
-
-```asm
-f:
- mov w0, #0x2a
- ret
-
-g:
- add w0, w0, w1
- ret
-
-h: // ninth integer argument
- ldur w0, [sp]
- add w0, w0, #1
- ret
-```
-
-Further improvements such as global block ordering, general coalescing, SSA
-value optimization, and deeper stack-heavy compile-time tuning belong to O2 or
-`doc/PERF.md`, not the O1 cleanup baseline.
-
-## Compile-Time Watchlist
-
-The current O1 compile-time shape is good for normal C ladders: many-function
-and direct-call generated inputs are approximately linear through 1024
-functions with `opt.conflict_bytes=0`.
-
-The remaining counter-level warning is stack-heavy pressure. In the current
-host-`-O2` native run:
-
-```text
-shape step opt.o1.total regalloc used_words
-spill_argc 512->1024 2.07x 2.23x 3.60x
-params 256->512 1.73x 1.80x 3.33x
-```
-
-Wall time is only mildly super-linear, but the `used_words` growth should be
-investigated before much larger stack-heavy inputs are treated as normal O1
-workloads. See `doc/PERF.md` for the full tables.
-
-## Tests And Metrics
-
-Focused O1 coverage lives in `test/opt/` and the phase0 guardrails. It covers
-branch liveness, call-clobber preservation, spill pressure, inline asm
-tied/fixed registers, pre-RA DDE, post-rewrite cleanup, and cross-target mock
-validation.
-
-Useful targeted runs:
-
-```sh
-make test-opt
-make test-cg-api
-make test-toy
-make test-aa64-inline
-make test-smoke-x64
-```
-
-`cfree run --time -O1` emits counters for:
-
-- function/block/inst/value counts;
-- block liveness active words and dataflow visits;
-- live range counts, compressed points, live-word touches, and value scans;
-- allocator hard/stack occupancy, interval probes/marks, spills;
-- rewrite reloads/stores/inserted instructions/live traffic;
-- link/JIT and compile frontend scopes.
diff --git a/doc/OPT_PERF.md b/doc/OPT_PERF.md
@@ -0,0 +1,200 @@
+# OPT Performance Tracking
+
+This document tracks optimizer performance, benchmark coverage, and the current
+gap to MIR and GCC. The optimizer design and pass order live in `doc/OPT.md`.
+
+## How To Run
+
+Primary harness:
+
+```sh
+make bench-opt
+```
+
+Useful focused runs:
+
+```sh
+CFREE_OPT_BENCHES="sieve spectral-norm" make bench-opt
+CFREE_OPT_BENCH_LEVELS="1 2" CFREE_OPT_BENCH_RUN_REPEATS=5 make bench-opt
+GCC=/opt/homebrew/bin/gcc-15 MIR_DIR=~/tmp/mir make bench-opt
+```
+
+The harness compares:
+
+- `gcc-15`
+- `clang`
+- `cfree cc`
+- `cfree run`
+- MIR `c2m`
+
+It writes `results.csv`, per-case logs, binaries, and `summary.md` under
+`build/bench/opt/` or `CFREE_OPT_BENCH_OUT`.
+
+## Reporting Model
+
+The summary reports compile time in two complementary ways.
+
+1. Unified geomean ratios
+
+ Compile-speed ratios use `compile_ms + codegen_ms` where both are
+ available. For MIR, this combines C-to-binary-MIR time and JIT link/generate
+ time so the headline number is comparable to `cfree-run` as a whole.
+
+2. Split compile timings
+
+ For MIR, `compile_ms` is C-to-binary-MIR time and `codegen_ms` is the
+ `c2m -v` link/JIT generation slice. This table explains whether a
+ compile-time difference comes from C parsing/lowering or from the optimized
+ generation pipeline.
+
+For `cfree-run`, `--bench-time` currently reports compile+JIT as one slice.
+Use `cfree run --time` for finer-grained local investigations. The benchmark
+CSV does not yet split cfree frontend, optimizer, link, and JIT into separate
+columns.
+
+MIR's benchmark binary is an optimized host executable. On the current Darwin
+machine, `/Users/ryan/tmp/mir/c2m` is built by MIR's makefile with
+`-O3 -DNDEBUG`.
+
+## Current Representative Run
+
+Date: 2026-05-22.
+
+Scope:
+
+- `array`
+- `binary-trees`
+- `hash`
+- `hash2`
+- `matrix`
+- `nbody`
+- `sieve`
+- `spectral-norm`
+
+Levels: `0 1 2`.
+
+Repeats: one compile repeat and one run repeat.
+
+Output:
+
+- `build/bench/opt/results.csv`
+- `build/bench/opt/summary.md`
+
+Coverage:
+
+| status | rows |
+| --- | ---: |
+| OK | 111 |
+| COMPILE_FAIL | 9 |
+| RUN_FAIL | 0 |
+| OUTPUT_FAIL | 0 |
+
+All `cfree` and `cfree-run` rows completed successfully, including the
+previously broken `matrix -O2`, `binary-trees`, `nbody`, and `spectral-norm`
+rows. The remaining `COMPILE_FAIL` rows are all MIR `c2m` on the Darwin
+hosted/math benchmarks.
+
+## Headline Geomeans
+
+Base: each benchmark's `gcc-15 -O2` row is `1.0x`.
+
+Representative 8-case scope:
+
+| tool | opt | cases | compile speed | runtime speed |
+| --- | ---: | ---: | ---: | ---: |
+| `cfree` | `O0` | 8 | `5.649x` | `0.346x` |
+| `cfree` | `O1` | 8 | `5.792x` | `0.502x` |
+| `cfree` | `O2` | 8 | `5.730x` | `0.468x` |
+| `cfree-run` | `O0` | 8 | `11.229x` | `0.363x` |
+| `cfree-run` | `O1` | 8 | `11.647x` | `0.519x` |
+| `cfree-run` | `O2` | 8 | `10.474x` | `0.481x` |
+| `mir-c2m` | `O0` | 5 | `2.476x` | `0.550x` |
+| `mir-c2m` | `O1` | 5 | `2.515x` | `0.574x` |
+| `mir-c2m` | `O2` | 5 | `2.437x` | `0.786x` |
+
+Interpretation:
+
+- cfree wins the whole compile/JIT headline mostly because its C frontend and
+ hosted-header path are much faster, and because MIR fails the Darwin
+ hosted/math rows.
+- On MIR-common cases where both systems have IR, MIR's optimized generation
+ pipeline is faster and produces better runtime code.
+- cfree `O1` currently has the best cfree runtime geomean on this mix.
+ `O2` is correct but slower than `O1` on the current benchmark set.
+
+## Formerly Broken cfree Rows
+
+Focused rerun:
+
+- Scope: `matrix`, `binary-trees`, `nbody`, `spectral-norm`
+- Levels: `0 1 2`
+- Output: `build/bench/opt/rerun-broken-cfree/results.csv`
+
+All `cfree` and `cfree-run` rows were `OK`.
+
+Base: each benchmark's `gcc-15 -O2` row is `1.0x`.
+
+| tool | opt | compile speed | runtime speed |
+| --- | ---: | ---: | ---: |
+| `cfree` | `O0` | `7.050x` | `0.319x` |
+| `cfree` | `O1` | `7.062x` | `0.534x` |
+| `cfree` | `O2` | `6.831x` | `0.483x` |
+| `cfree-run` | `O0` | `13.709x` | `0.323x` |
+| `cfree-run` | `O1` | `14.286x` | `0.550x` |
+| `cfree-run` | `O2` | `11.791x` | `0.512x` |
+
+## Optimizer/Link/JIT Split
+
+Focused timing pass:
+
+- Scope: `array`, `hash`, `hash2`, `matrix`, `sieve`
+- Output: `build/bench/opt/opt-link-split-common/summary.csv`
+- cfree measurement: sum of `opt.o*.total`, `link.resolve.total`, and JIT
+ scoped timings from `cfree run --time`
+- MIR measurement: `MIR link finish` from `c2m -v -O<n> file.bmir -eg`
+
+| opt | cfree opt+link+JIT ms | MIR link/JIT ms | result |
+| ---: | ---: | ---: | --- |
+| `O0` | `0.275` | `0.758` | cfree `2.76x` faster |
+| `O1` | `4.552` | `0.718` | MIR `6.34x` faster |
+| `O2` | `4.728` | `1.100` | MIR `4.30x` faster |
+
+This is the key split for tuning: cfree's final link/JIT mechanics are very
+fast, but MIR's optimized generation pipeline is substantially faster at
+`O1` and `O2`.
+
+## Goals
+
+Compile-time goals:
+
+- Preserve cfree's frontend and whole-pipeline compile/JIT advantage.
+- Add benchmark CSV columns for cfree frontend, optimizer, link, and JIT slices
+ instead of relying on ad hoc `--time` logs.
+- Reduce optimized-pass cost. The first target is closing the MIR link/JIT gap
+ on the common-case split measurement.
+- Keep `-O1` materially faster to compile than `-O2`.
+
+Runtime-quality goals:
+
+- Make `O2` reliably faster than `O1` on the representative benchmark mix.
+- Close toward MIR `O2` on MIR-common cases.
+- First external target: reach at least QBE-class performance on the MIR
+ `c-benchmarks` set, then close toward MIR's published `c2m -eg` geomean.
+
+Coverage goals:
+
+- Keep all cfree rows passing on the representative scope.
+- Expand the benchmark set back toward all MIR `c-benchmarks` that require only
+ supported hosted libc/runtime features.
+- Track MIR failures separately from cfree failures so cfree coverage is not
+ obscured by external tool limitations.
+
+## Current Tuning Priorities
+
+1. Investigate why `O2` trails `O1` on the current cfree runtime geomean.
+2. Profile expensive O2 passes on `hash`, `hash2`, and `matrix`, where the
+ opt/link split shows cfree optimized generation much slower than MIR.
+3. Improve generated-code quality before adding broad new passes. Prefer
+ targeted fixes backed by benchmark deltas and pass-local tests.
+4. Add finer cfree timing columns to `scripts/opt_bench.sh` once the driver can
+ expose parse, optimize, link, and JIT slices in parseable form.
diff --git a/doc/PERF.md b/doc/PERF.md
@@ -1,285 +0,0 @@
-# Cfree O1 Performance Notes
-
-This file records the current `cfree run --time -O1` shape and the remaining
-compile-time scaling risks. It is intentionally focused on the current O1 path;
-older allocator phase history is preserved in git and in `doc/MIR_RA_REPORT.md`.
-
-## Measurement Setup
-
-Fresh snapshot: May 17, 2026.
-Follow-up stack-assignment probe: May 21, 2026.
-
-- Host: `arm64` macOS.
-- Compiler under test: `build/cfree`, rebuilt as a Mach-O arm64 executable with
- the compiler itself built by clang at host `-O2`.
-- Timing path: native `build/cfree run --time -O1`, which exercises the native
- AArch64 JIT/codegen path on this machine.
-- Samples: three runs per point; tables use p50 milliseconds.
-- Raw local outputs from the run:
- - `build/o1-probes/scaling.tsv`
- - `build/o1-probes/spill_argc.tsv`
- - `build/o1-current/params-hostO2.tsv`
- - `build/o1-current/toy-hostO2.tsv`
- - `build/o1-investigate/spill_argc_final.tsv`
- - `build/o1-investigate/params_argc_final.tsv`
-
-The build log should show `clang -O2 ...` for driver, lib, C frontend, and Toy
-frontend objects before these timings are compared to future runs.
-
-## Metric Shape
-
-Top-level `cfree run --time` scopes:
-
-- `run.total`: full driver path after option/input loading through entry return.
-- `run.compile_and_jit`: compile all inputs and build the JIT image.
-- `compile.tu`: one translation unit to an object builder.
-- `compile.frontend`: registered frontend body.
-- `compile.c.*`: C frontend setup, PP option application, parse/codegen,
- cleanup.
-- `opt.o1.total`: per-function O1 pipeline, nested under frontend codegen.
-- `opt.jump_cleanup`: cheap O1 CFG/layout branch cleanup.
-- `link_jit.all`: linker resolve plus JIT image materialization.
-- `run.jit_lookup` and `run.entry_call`: symbol lookup and execution.
-
-Important O1 counters:
-
-- `opt.funcs`, `opt.blocks`, `opt.insts`, `opt.pregs`
-- `opt.live.block_bytes`, `opt.live.active_words`,
- `opt.live.bitset_words_touched`
-- `opt.ranges`, `opt.range_points`, `opt.range_raw_points`,
- `opt.range.point_visits`, `opt.range.preg_scans`,
- `opt.range.live_words_touched`
-- `opt.alloc.used_loc_words`, `opt.alloc.hard_loc_words`,
- `opt.alloc.stack_loc_words`, `opt.alloc.stack_slots`
-- `opt.alloc.hard_point_visits`, `opt.alloc.stack_point_visits`,
- `opt.alloc.hard_mark_points`, `opt.alloc.stack_mark_points`
-- `opt.alloc.spills`
-- `opt.rewrite.reloads`, `opt.rewrite.stores`,
- `opt.rewrite.inserted_insts`, `opt.rewrite.live_words_touched`
-- `opt.conflict_bytes`
-
-`opt.conflict_bytes` is expected to stay zero on the normal O1 path. The old
-dense pseudo-register conflict matrix is no longer part of O1 allocation.
-
-## Normal Function-Count Ladders
-
-Two synthetic C ladders are used for broad scaling:
-
-- `straight`: N loop-heavy static functions, with `main` directly calling every
- function. This creates N small functions plus one larger straight-line caller.
-- `table`: the same N functions, but `main` calls through a function-pointer
- table. This keeps the caller compact and isolates "many functions" from "one
- larger function".
-
-O1 scopes are summed across all functions.
-
-```text
-straight
-N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict
-64 17.442 15.776 2.495 3.941 8.740 1856 833 1789 0 3198 0
-128 31.210 28.372 4.453 7.191 15.631 3712 1665 3581 0 6398 0
-256 60.026 54.522 8.544 13.764 30.201 7424 3329 7165 0 12798 0
-512 119.735 108.974 16.987 27.253 61.085 14848 6657 14333 0 25598 0
-1024 234.857 210.831 32.958 52.953 115.918 29696 13313 28669 0 51198 0
-
-table
-N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict
-64 14.246 12.916 2.046 3.269 7.113 1620 713 1555 0 2720 0
-128 28.560 26.036 4.128 6.591 14.368 3220 1417 3091 0 5408 0
-256 57.608 52.647 8.371 13.331 28.973 6420 2825 6163 0 10784 0
-512 114.689 104.540 16.557 26.401 57.560 12820 5641 12307 0 21536 0
-1024 232.619 209.061 33.476 52.789 115.426 25620 11273 24595 0 43040 0
-```
-
-512 to 1024 ratios:
-
-```text
-shape compile.tu opt.o1.total live_pre live_reg regalloc used_words
-straight 1.96x 1.93x 1.94x 1.94x 1.90x 2.00x
-table 2.03x 2.00x 2.02x 2.00x 2.01x 2.00x
-```
-
-This is the desired current shape. Many functions and the normal direct-call
-large-caller case are essentially linear through 1024 generated functions.
-
-## Spill-Pressure Ladder
-
-The old `spill` row in `run_scaling.sh` initializes many locals from constants;
-current O1 folds most of that away, so it is useful as a frontend/source-size
-sanity check but not as a pressure test. The useful pressure ladder is
-`spill_argc`, which initializes each local from `argc` so values remain live.
-
-```text
-spill_argc
-N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills reloads inserted used_words conflict
-128 1.388 0.769 0.045 0.169 0.474 770 641 767 108 108 216 3076 0
-256 2.468 1.433 0.058 0.291 0.929 1538 1281 1535 236 236 472 9222 0
-512 4.658 2.750 0.072 0.526 1.825 3074 2561 3071 492 492 984 30730 0
-1024 9.894 5.681 0.110 1.058 4.073 6146 5121 6143 1004 1004 2008 110610 0
-```
-
-512 to 1024 ratios:
-
-```text
-compile.tu 2.12x
-opt.o1.total 2.07x
-live_pre 1.53x
-live_reg 2.01x
-regalloc 2.23x
-spills 2.04x
-used_words 3.60x
-```
-
-This was the main remaining mild super-linear signal in the May 17 snapshot.
-The May 21 follow-up found the direct cause: stack spill assignment used
-first-fit probing across existing spill slots. When many spilled ranges all
-overlapped, every new spill checked almost every previous stack slot before
-allocating a new one, so `opt.alloc.stack_point_visits` grew quadratically.
-
-The fix keeps hard-register assignment priority ordered, then assigns spilled
-values to stack slots in lifetime order using an active min-heap keyed by slot
-end position plus a free-slot list. Stack occupancy is still marked into the
-per-slot interval vectors for replay/debug metrics, but assignment no longer
-probes existing stack slots in the all-live case.
-
-May 21 follow-up, same `spill_argc` shape, single runs:
-
-```text
-spill_argc final
-N compile.tu opt.o1.total live_pre regalloc used_words hard_words stack_words stack_slots hard_visits stack_visits hard_marks stack_marks spills points
-128 3.037 0.853 0.114 0.396 1686 616 1070 107 11732 0 405 107 107 514
-256 3.989 1.214 0.082 0.634 3222 872 2350 235 23508 0 789 235 235 1026
-512 6.776 1.887 0.075 1.050 6294 1384 4910 491 47060 0 1557 491 491 2050
-1024 13.605 3.561 0.102 2.041 12438 2408 10030 1003 94164 0 3093 1003 1003 4098
-```
-
-512 to 1024 ratios:
-
-```text
-compile.tu 2.01x
-opt.o1.total 1.89x
-regalloc 1.94x
-spills 2.04x
-used_words 1.98x
-stack_marks 2.04x
-stack_visits 0
-```
-
-`used_words` now reports the allocator's actual interval-vector storage rather
-than the size of the old dense point-by-location bitmap model. It is expected
-to track interval/vector capacity and can move in allocation-size steps, but it
-no longer reflects a point-times-slot working set.
-
-## Parameter-Heavy Ladder
-
-This C ladder uses one function with N integer parameters and a caller that
-passes N constants. It stresses ABI lowering, stack arguments, and callee-side
-parameter use.
-
-```text
-params
-N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict
-8 0.816 0.577 0.091 0.143 0.305 37 26 32 0 72 0
-16 0.850 0.595 0.090 0.150 0.320 69 50 64 0 136 0
-32 0.932 0.644 0.093 0.162 0.357 133 98 128 12 392 0
-64 1.053 0.698 0.093 0.175 0.396 261 194 256 44 776 0
-128 1.367 0.866 0.096 0.211 0.510 517 386 512 108 2056 0
-256 1.979 1.209 0.099 0.281 0.740 1029 770 1024 236 6152 0
-512 3.484 2.088 0.110 0.434 1.334 2053 1538 2048 492 20488 0
-```
-
-256 to 512 ratios:
-
-```text
-compile.tu 1.76x
-opt.o1.total 1.73x
-live_pre 1.11x
-live_reg 1.54x
-regalloc 1.80x
-spills 2.08x
-used_words 3.33x
-```
-
-The timing buckets are sub-linear to mildly linear here because the fixed
-frontend/O1 setup costs are still visible. In the May 17 snapshot, `used_words`
-again grew faster than spills and ranges when many parameters forced stack
-traffic.
-
-May 21 follow-up used `argc`-dependent caller arguments to keep the call-side
-values live and prevent constant folding from collapsing the probe:
-
-```text
-params_argc final
-N compile.tu opt.o1.total live_pre regalloc used_words hard_words stack_words stack_slots hard_visits stack_visits hard_marks stack_marks spills points
-128 2.296 0.417 0.051 0.205 1436 376 1060 106 2969 0 25 106 106 133
-256 3.717 0.635 0.053 0.292 2716 376 2340 234 5913 0 25 234 234 261
-512 6.343 0.926 0.052 0.365 5276 376 4900 490 11801 0 25 490 490 517
-1024 13.602 2.428 0.080 0.834 10396 376 10020 1002 23577 0 25 1002 1002 1029
-```
-
-512 to 1024 ratios:
-
-```text
-compile.tu 2.14x
-opt.o1.total 2.62x
-regalloc 2.28x
-spills 2.04x
-used_words 1.97x
-stack_marks 2.04x
-stack_visits 0
-```
-
-The parameter probe still has noisy wall-time at these small absolute O1
-durations, but the allocator-side counters now have the desired linear shape.
-
-## Toy Samples
-
-The Toy frontend uses the same CG/O1/backend machinery, so these are useful
-small API-shape probes. O1 scopes are summed across functions.
-
-```text
-case compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict
-06_while_sum 0.436 0.319 0.049 0.075 0.163 15 7 13 0 26 0
-09_function_params 0.645 0.569 0.091 0.142 0.302 10 8 8 0 22 0
-111_many_function_params 0.682 0.591 0.092 0.147 0.316 38 22 36 0 78 0
-123_spec_demo 4.709 4.229 0.659 1.016 2.270 788 359 590 0 1168 0
-```
-
-The Toy samples do not show a compile-time scaling problem at this size. The
-larger spec demo is dominated by O1/regalloc, but counters are small and there
-is no conflict-matrix traffic.
-
-## Current Interpretation
-
-The current O1 compile-time story is good enough to shift attention toward
-generated-code quality:
-
-- Normal many-function and large-caller C inputs are approximately linear.
-- The old dense liveness/conflict path is gone from normal O1
- (`opt.conflict_bytes=0`).
-- Stack spill assignment no longer does first-fit interval probing across every
- existing spill slot; `opt.alloc.stack_point_visits` stays zero on the
- stack-heavy May 21 ladders.
-- Link/JIT is not the bottleneck in these runs; the visible time is frontend
- parse/codegen plus O1.
-- Spill-heavy and parameter-heavy allocator counters now scale by ranges,
- stack slots, and interval marks rather than by point/candidate products.
-
-## Investigation Priorities
-
-1. Keep the stack-heavy May 21 shape as a regression guard.
- `stack_point_visits` should remain zero on the normal O1 stack assignment
- path, while `stack_marks`, `stack_slots`, spills, and actual interval
- storage should remain approximately linear.
-
-2. Keep `opt.conflict_bytes=0` as a regression guard.
- Reintroducing dense pseudo conflict storage would be a clear O1 regression.
-
-3. Keep normal O1 branch cleanup block-linear.
- `jump_cleanup` should remain a local pass over blocks/terminators. Global
- layout work belongs in O2.
-
-4. Split frontend residuals only after O1 code shape work.
- The non-O1 compile residual is visible, but current scaling pressure is not
- coming from object finalization or link/JIT. More parser/PP/CG counters would
- be useful, but O1 generated-code cleanup has higher near-term value.
diff --git a/scripts/opt_bench.sh b/scripts/opt_bench.sh
@@ -379,7 +379,7 @@ lines = []
lines.append("# OPT Benchmark Summary")
lines.append("")
lines.append(f"Base for speed ratios: `{base_tool} -O2`.")
-lines.append("For MIR, `compile_ms` is C-to-binary-MIR time and `codegen_ms` is the JIT link/generation slice reported by `c2m -v`; compile ratios use their sum. `cfree-run` uses `--bench-time`: `compile_ms` is compile+JIT time, and `runtime_ms` is the in-process entry-call execution slice.")
+lines.append("Unified compile-speed ratios use `compile_ms + codegen_ms` where both are available. For MIR, `compile_ms` is C-to-binary-MIR time and `codegen_ms` is the JIT link/generation slice reported by `c2m -v`; this keeps the headline number comparable to `cfree-run` as a whole while the split table below shows where time is spent. `cfree-run` uses `--bench-time`: `compile_ms` is compile+JIT time, and `runtime_ms` is the in-process entry-call execution slice.")
lines.append("")
lines.append("## Status")
lines.append("")
@@ -415,6 +415,27 @@ for key in sorted(groups):
f"{geo(run_ratios)} | {avg(comp_totals)} | {avg(run_times)} |"
)
lines.append("")
+lines.append("## Split Compile Timings")
+lines.append("")
+lines.append("| tool | opt | ok cases | avg compile/frontend ms | avg codegen/JIT ms | avg unified compile ms |")
+lines.append("| --- | ---: | ---: | ---: | ---: | ---: |")
+for key in sorted(groups):
+ vals = groups[key]
+ compile_times = []
+ codegen_times = []
+ comp_totals = []
+ for r in vals:
+ cm = fnum(r["compile_ms"])
+ cg = fnum(r["codegen_ms"])
+ total = None if cm is None else cm + (cg or 0.0)
+ compile_times.append(cm)
+ codegen_times.append(cg)
+ comp_totals.append(total)
+ lines.append(
+ f"| {key[0]} | {key[1]} | {len(vals)} | {avg(compile_times)} | "
+ f"{avg(codegen_times)} | {avg(comp_totals)} |"
+ )
+lines.append("")
lines.append(f"Raw CSV: `{csv_path}`")
with open(out_path, "w") as f:
f.write("\n".join(lines) + "\n")