kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit e462688b568fc13b28f3db4126fca1da77335b87
parent a28452e0597151ea9bb31eaf7c220e4ee5ee801c
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 22 May 2026 09:14:36 -0700

Split optimizer design and performance docs

Diffstat:
Ddoc/CONSTFOLD.md | 600-------------------------------------------------------------------------------
Ddoc/MIR_RA_REPORT.md | 933-------------------------------------------------------------------------------
Mdoc/OPT.md | 1571++++++++++++++-----------------------------------------------------------------
Ddoc/OPT1.md | 330-------------------------------------------------------------------------------
Adoc/OPT_PERF.md | 200+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Ddoc/PERF.md | 285-------------------------------------------------------------------------------
Mscripts/opt_bench.sh | 23++++++++++++++++++++++-
7 files changed, 490 insertions(+), 3452 deletions(-)

diff --git a/doc/CONSTFOLD.md b/doc/CONSTFOLD.md @@ -1,600 +0,0 @@ -# CONSTFOLD - Vstack Constant Simplification Plan - -`-O1` needs a tiny constant simplification path that improves obvious scalar -code without pulling the O2 SSA/value-optimization machinery into the fast -backend tier. The right first home is the public CG API value stack in -`src/api/cg.c`, beside the existing delayed compare representation used by -compare-consuming branches. - -This plan covers integer constant folding and local expression delay in the -vstack. It intentionally avoids global propagation, CFG-wide reasoning, and -target-specific peepholes. - -## Status - -Implemented: - -- immediate-only integer binop folding for flag-free integer operations up to - 64 bits; -- immediate-only integer unop folding for flag-free integer operations up to - 64 bits; -- immediate-only integer and pointer-like compare folding; -- scalar immediate return preservation through `cfree_cg_ret`; -- CG API shape coverage for O1 literal folds; -- delayed-compare materialization now reuses only owned operand registers; -- delayed `SV_ARITH` for unary and binary arithmetic; -- expression-local arithmetic chain folding; -- straight-line local constant shadowing with conservative boundary - invalidation; -- disassembly and metrics refresh for the phase 5 probes; -- `ApiSValue` delayed compare/arithmetic payloads are now unioned; -- local-shadow invalidation is routed through named memory, control, and - address-taking boundary helpers; -- delayed arithmetic has a direct register-pressure materialization regression - test. - -Remaining: - -- consider a shadow generation counter if repeated `clear_all` scans show up in - future O1 metrics. - -## Current Shape - -The API value stack currently carries: - -- `SV_OPERAND`: an immediate, register, local, global, or indirect operand; -- `SV_CMP`: a delayed integer compare with source operands and register - ownership bits. - -Integer constants enter the stack as `SV_OPERAND` plus `OPK_IMM`. - -Integer compare already has the desired delayed shape: - -```text -push a -push b -int_cmp -> pushes SV_CMP, emits no boolean materialization -branch_true -> consumes SV_CMP as target->cmp_branch -``` - -This avoids the old branch-consumer shape: - -```text -cmp -cset -cmp #0 -b.cond -``` - -Arithmetic now has a first immediate-only fold. For flag-free integer -operations where both operands are `OPK_IMM`, `api_cg_binop` and -`api_cg_unop` fold before allocating a result register. `api_cg_cmp` likewise -folds immediate-only integer and pointer-like compares. Non-immediate -arithmetic still materializes immediately; delayed arithmetic is the next -planned step. - -The visible O1 quality gap is: - -```c -int x = 40; -x += 2; -return x; -``` - -The expression part can be fixed by immediate folding, but the full example -also crosses a local store/load. That requires a narrow local constant shadow, -not just expression-stack folding. - -## Goals - -- Fold immediate-only integer arithmetic before register allocation. -- Delay simple integer arithmetic on the vstack when doing so can avoid - materialization or combine a following immediate. -- Preserve the existing `SV_CMP` compare-branch behavior. -- Improve straight-line scalar-local code such as `x = 40; x = x + 2`. -- Keep the implementation target-neutral and bounded. -- Do not introduce VLAs or global state. -- Keep all new state attached to `CfreeCg`, `ApiSValue`, or `ApiSourceLocal`. - -## Non-Goals - -- No SSA construction. -- No global constant propagation. -- No CFG-wide dataflow. -- No folding across labels, jumps, branches, calls, volatile accesses, inline - asm, or unknown memory effects. -- No FP constant folding in this pass. -- No i128 folding until there is a deliberate wide-integer helper. -- No target-specific immediate legality checks in the vstack. The target - backend already decides whether an `OPK_IMM` source fits an instruction - encoding or must be materialized. - -## Design - -### 1. Immediate-Only Integer Folding - -Implemented in `src/api/cg.c` with helpers shaped as: - -```c -static int api_try_fold_int_binop(CfreeCg*, BinOp, CfreeCgTypeId, - i64 a, i64 b, i64* out); -static int api_try_fold_int_unop(CfreeCg*, UnOp, CfreeCgTypeId, - i64 a, i64* out); -static int api_try_fold_int_cmp(CfreeCg*, CmpOp, CfreeCgTypeId, - i64 a, i64 b, i64* out); -``` - -The binop/unop paths run before allocating a destination register. If both -source operands are `OPK_IMM` and the operation is supported, they push a new -`OPK_IMM` result and release the inputs. - -Implemented operations: - -- `BO_IADD`, `BO_ISUB`, `BO_IMUL`; -- `BO_AND`, `BO_OR`, `BO_XOR`; -- `BO_SHL`, `BO_SHR_U`, `BO_SHR_S`; -- `UO_NEG`, `UO_BNOT`, `UO_NOT`; -- integer compares. - -Still skipped: - -- signed and unsigned division/remainder, to avoid changing divide-by-zero - behavior and signed-overflow corner cases in the first patch; -- any operation with trap, saturating, or exact/nowrap flags until flags are - actually modeled by the vstack fold; -- FP operations; -- i128 types. - -Evaluation rules: - -- Use unsigned arithmetic internally where possible so host signed overflow is - not invoked. -- Mask the result to the destination integer width for widths below 64. -- For arithmetic right shift, sign-extend the masked input before shifting. -- Normalize `_Bool` results to `0` or `1`. -- Treat pointer-typed `OPK_IMM` only as integer-sized values for compares. - Do not fold pointer arithmetic here. - -### 2. Immediate Return Preservation - -Implemented. `cfree_cg_ret` preserves non-aggregate scalar `OPK_IMM` values by -passing them directly to `target->ret`. - -Required behavior: - -```text -return 40 + 2 - -> vstack folds to OPK_IMM 42 - -> ret receives OPK_IMM 42 - -> backend emits only return-value materialization -``` - -If the return value is delayed arithmetic or any non-immediate expression, it -can still materialize through the normal register path. - -### 3. Delayed Arithmetic SValues - -Add a generic delayed arithmetic stack value after the immediate-only fold is -stable. It should cover both unary and binary integer arithmetic, with an -operation-kind discriminator and an optional second operand. - -Suggested shape: - -```c -typedef enum ApiDelayedArithKind { - API_DELAYED_UNOP, - API_DELAYED_BINOP, -} ApiDelayedArithKind; - -typedef enum ApiSValueKind { - SV_OPERAND, - SV_CMP, - SV_ARITH, -} ApiSValueKind; - -typedef struct ApiSValue { - Operand op; - Operand cmp_a; - Operand cmp_b; - Operand arith_a; - Operand arith_b; - CfreeCgTypeId type; - CmpOp cmp_op; - BinOp arith_bin_op; - UnOp arith_un_op; - u8 kind; - u8 arith_kind; - u8 res; - u8 pinned; - u8 lvalue; - u8 cmp_a_owned; - u8 cmp_b_owned; - u8 arith_a_owned; - u8 arith_b_owned; - FrameSlot spill_slot; - CfreeCgLocal source_local; -} ApiSValue; -``` - -If struct size becomes a concern, the compare and arithmetic payloads can be -moved into a small union. Keep the first patch simple unless stack footprint -shows up in measurements. - -Materialization mirrors `SV_CMP`: - -- `api_ensure_reg` dispatches `SV_ARITH` to `api_materialize_arith_to`. -- `api_materialize_arith_to` calls `target->unop` when - `arith_kind == API_DELAYED_UNOP` and `target->binop` when - `arith_kind == API_DELAYED_BINOP`. -- `api_release` releases owned operand registers for `SV_ARITH`. -- `api_pick_victim` ignores delayed non-register values. -- `api_alloc_reg_or_spill` may materialize a delayed arithmetic value if no - normal spill victim exists, just as it does for delayed compares. - -Creation rules: - -- Only create `SV_ARITH` for integer-class operations. -- For binary operations, require at least one non-immediate source. - Immediate-only cases fold to `OPK_IMM` instead. -- For unary operations, delay only non-immediate sources. Immediate-only cases - fold to `OPK_IMM` instead. -- Own registers only when the source `ApiSValue` owns that register, following - the existing compare ownership pattern. -- Never create delayed arithmetic for lvalues, aggregates, FP values, calls, or - memory operands that need a load. -- Do not encode unary operations as synthetic binary operations. For example, - keep `-x` as `UO_NEG`, `~x` as `UO_BNOT`, and `!x` as `UO_NOT` or a delayed - compare only when the consumer naturally wants compare semantics. - -### 4. Local Arithmetic Chain Folding - -Once `SV_ARITH` exists, fold cheap expression-local chains before -materializing: - -```text -(x + C1) + C2 -> x + (C1 + C2) -(x + C1) - C2 -> x + (C1 - C2) -(x - C1) + C2 -> x + (C2 - C1) -(x - C1) - C2 -> x - (C1 + C2) -(x ^ C1) ^ C2 -> x ^ (C1 ^ C2) -(x & C1) & C2 -> x & (C1 & C2) -(x | C1) | C2 -> x | (C1 | C2) -~~x -> x -!!x -> x != 0, if the result is consumed as a branch or compare-like boolean -``` - -Keep this conservative: - -- only one delayed level; -- only one real non-immediate operand; -- no reassociation across non-immediate operands; -- no multiply reassociation initially; -- no transforms that depend on signed-overflow assumptions. -- no unary chain transform that changes C truth-value semantics for non-boolean - integer results. - -If the folded constant becomes an identity, collapse the result: - -- `x + 0`, `x - 0`, `x ^ 0`, `x | 0` -> `x`; -- `x & -1` -> `x` after width masking; -- `x * 1` -> `x` if multiply folding is later added; -- `x * 0` -> `0` only when flags and side effects are known safe. - -### 5. Straight-Line Local Constant Shadow - -To handle `int x = 40; x += 2; return x;`, track constants for source locals -that are eligible for register-local treatment. - -Extend `ApiSourceLocal`: - -```c -typedef struct ApiSourceLocal { - ... - i64 const_value; - u8 const_valid; - u8 pad[2]; -} ApiSourceLocal; -``` - -Rules: - -- A local can be shadowed only if it is scalar integer or pointer typed. -- Address-taken locals are not shadowed. -- Volatile accesses are not shadowed. -- Parameters start unknown. -- On store of an immediate to an eligible source local, set the shadow value. -- On store of a non-immediate to that local, clear the shadow value. -- On partial-width or mismatched-type access to that local, clear/do not use the - shadow value. -- On load of an eligible source local with a valid shadow, push `OPK_IMM` - instead of loading/materializing the local. -- On taking a local address, clear the local's shadow. -- On unknown memory or control-flow boundary, clear all shadows. - -Clear all local shadows at: - -- label placement; -- unconditional jump; -- conditional branch; -- switch; -- scope break/continue control transfer if it can leave the current block; -- call, unless later alias analysis proves the call cannot observe the local; -- inline asm; -- atomic operations and fences; -- direct stores through `OPK_INDIRECT`; -- address-taking of any shadowed local. - -This is intentionally a basic-block-local cache. It is not a dataflow fact. - -### 6. Interaction With Existing Register Locals - -The CG API already treats some scalar autos as source-register lvalues when the -target returns `CG_LOCAL_STORAGE_REG`. Loads from those locals can avoid memory. -The constant shadow sits above that: - -```text -local has valid constant - -> push OPK_IMM -else local stored in register - -> push register-backed rvalue -else - -> normal frame load path -``` - -Stores to register locals still update the real storage when necessary. The -shadow only affects subsequent vstack reads in the same straight-line region. - -### 7. Boundary And Correctness Rules - -Every helper that consumes stack values must understand the new delayed value: - -- `api_release`; -- `api_ensure_reg`; -- `api_force_reg`; -- `api_class_of_sv`; -- `api_reg_of_sv`; -- `api_owned_reg_type`; -- `api_pick_victim`; -- `cfree_cg_dup`; -- `cfree_cg_store`; -- call argument preparation; -- return handling; -- branch handling. - -If an operation does not explicitly support delayed arithmetic, it should force -materialization first. That keeps behavior local and avoids subtle ownership -bugs. - -Do not fold across volatile memory. `CfreeCgMemAccess.flags` already carries -`CFREE_CG_MEM_VOLATILE`; the local shadow and vstack folds must respect it. - -Do not depend on C signed-overflow undefined behavior. The CG API integer types -are width-only, and the existing frontend currently passes `CFREE_CG_INTOP_NONE` -for ordinary integer arithmetic. The fold should preserve the backend's -current wrap-like behavior unless flags explicitly request something else. - -## Implementation Phases - -### Phase 1: Red Tests For Literal Folding - -Done. The CG API test now covers O1 literal fold shapes for: - -- direct API function returning `40 + 2`; -- direct API function returning `(5 << 3) | 2`; -- direct API compare through `40 + 2 == 42`; -- direct API unary folding through `~~41 + 1`. - -Expected shape should check for fewer emitted arithmetic operations where the -existing test harness can observe it. Runtime correctness alone is not enough. - -### Phase 2: Immediate-Only Fold - -Done. The fold helpers are wired into: - -- `api_cg_binop`; -- `api_cg_unop`; -- `api_cg_cmp`; -- `cfree_cg_ret`. - -Validation run: - -```sh -make test-cg-api test-opt -CFREE_TEST_ALLOW_SKIP=1 make test-parse -``` - -Final observed parse result: - -```text -2496 pass, 0 fail, 4 skip -``` - -### Phase 3: Delayed Arithmetic - -Done. Added `SV_ARITH`, operation-kind metadata, ownership helpers, -materialization, release logic, and register-pressure materialization. - -Tests cover: - -- one delayed unary arithmetic value consumed by return; -- one delayed binary arithmetic value consumed by return; -- delayed arithmetic consumed by compare; -- delayed arithmetic consumed by another binop with immediate folding; -- delayed unary arithmetic consumed by another unop with chain folding; -- delayed arithmetic forced by store path; -- register-pressure materialization when no ordinary spill victim exists. - -Run: - -```sh -make test-cg-api -make test-opt -``` - -### Phase 4: Local Constant Shadow - -Done. Added local shadow state to `ApiSourceLocal` and invalidation helpers: - -```c -static void api_local_const_clear(ApiSourceLocal*); -static void api_local_const_clear_all(CfreeCg*); -static void api_local_const_memory_boundary(CfreeCg*); -static void api_local_const_control_boundary(CfreeCg*); -static void api_local_const_address_taken(CfreeCg*, CfreeCgLocal); -static int api_local_const_can_track(CfreeCg*, const ApiSourceLocal*, - CfreeCgMemAccess); -static void api_local_const_store(CfreeCg*, CfreeCgLocal, CfreeCgTypeId, i64); -static int api_local_const_load(CfreeCg*, CfreeCgLocal, CfreeCgTypeId, - Operand* out); -``` - -Wire store/load behavior through `cfree_cg_store` and `cfree_cg_load`. - -Wire invalidation through: - -- `cfree_cg_label_place`; -- `cfree_cg_jump`; -- `api_branch_if`; -- `cfree_cg_switch`; -- call paths; -- inline asm path; -- atomics/fences; -- indirect stores; -- local address-taking. - -Tests cover: - -- `int x = 40; x = x + 2; return x;`; -- shadow cleared by label/jump; -- shadow cleared by branch; -- shadow cleared by address-taking; -- shadow cleared by volatile access; -- shadow cleared by indirect store; -- partial-width stores do not preserve a full-local shadow; -- parameter locals are initially unknown. - -Run: - -```sh -make test-cg-api -make test-opt -make test-parse -``` - -If the change affects emitted target code shape, also run one targeted smoke -case per supported O1 target: - -```sh -make test-smoke-x64 -make test-aa64-inline -``` - -and an RV64 targeted case if available locally. - -### Future Refactors - -Completed cleanup from the original phase 3/4 implementation: - -- `SV_CMP` and `SV_ARITH` payload fields now live in an `ApiSValue` union. -- Local-shadow invalidation now goes through named boundary helpers: - `api_local_const_memory_boundary`, `api_local_const_control_boundary`, and - `api_local_const_address_taken`. -- `test/api/cg_type_test.c` includes an explicit delayed-arithmetic - register-pressure case that forces materialization when no ordinary spill - victim exists. - -Still worth considering before growing the vstack simplifier further: - -- Consider a shadow generation counter if whole-function local counts grow - enough for repeated `clear_all` scans to show up in O1 metrics. -- Consider one small builder helper in `test/api/cg_type_test.c` for creating - O1 one-parameter functions; the expanded constfold tests intentionally stayed - direct but now duplicate setup boilerplate. - -### Phase 5: Disassembly And Metrics - -Done. Re-ran the probe cases from `doc/OPT1.md`: - -- `6_5_17_compound_assign`; -- a direct literal return case; -- a compare/branch literal case; -- one local-address-taken case. - -For each of AArch64, x64, and RV64, recorded: - -- `.text` size; -- instruction count; -- `mov`/`load_imm` count; -- arithmetic instruction count; -- spills/reloads; -- rewrite inserted instruction count. - -Host JIT `run --time` samples recorded O1 wall time, live-range time, -regalloc time, spills/reloads, and rewrite inserted instruction count. The -current numbers are in `doc/PERF.md` under "Vstack Constfold Probe". - -## Expected Results - -For direct literal arithmetic: - -```c -int test_main(void) { return 40 + 2; } -``` - -O1 should emit only return-value materialization for `42`, with no separate add. - -For scalar locals: - -```c -int test_main(void) { - int x = 40; - x = x + 2; - return x; -} -``` - -O1 should avoid the load/add/store/reload chain in the straight-line case. The -best expected shape is return-value materialization for `42`, unless debug or -local-storage requirements intentionally force a store. - -For address-taken locals: - -```c -int test_main(void) { - int x = 40; - int* p = &x; - *p = *p + 2; - return x; -} -``` - -The local shadow must not fold through the pointer access. Correctness is more -important than preserving the constant. - -## Risks - -- Register ownership bugs in delayed arithmetic are the main implementation - risk. Reuse the compare ownership pattern and add tests where delayed - arithmetic values are forced, duplicated, dropped, and spilled. -- Local shadow invalidation can become unsound if it is treated as a dataflow - fact. Keep it basic-block-local and clear aggressively. -- Signed arithmetic folding can diverge from backend behavior if it uses host - signed overflow. Use unsigned operations plus width masking. -- Return-path changes can accidentally bypass ABI requirements for aggregate - returns. Only preserve immediates for non-aggregate scalar returns. -- Debug-info expectations for locals may eventually require materialized - storage even when constant-shadowed. Do not remove local declarations or - frame-slot allocation in this pass. - -## Acceptance Criteria - -- Immediate-only integer expressions fold before backend allocation. -- Delayed compare-branch behavior remains covered and unchanged. -- Delayed arithmetic materializes correctly in all consumers. -- Straight-line scalar-local constants fold in the common `x = 40; x += 2` - shape. -- Shadowed constants are cleared across control-flow, volatile, indirect - memory, call, asm, atomic, and address-taking boundaries. -- Targeted tests pass for CG API, O1 opt, and parser coverage. -- Disassembly shows fewer arithmetic/materialization instructions in the probe - cases without introducing target-specific vstack logic. diff --git a/doc/MIR_RA_REPORT.md b/doc/MIR_RA_REPORT.md @@ -1,933 +0,0 @@ -# MIR Data Structures and Register Allocation Report - -This report summarizes the data structures and algorithms in MIR that matter -most for cfree's optimizer direction. The emphasis is structural: cfree should -move toward MIR's allocation shape before spending more effort on small -localized speedups in the current dense-conflict implementation. - -Primary MIR source references: - -- `~/tmp/mir/mir-gen.c` -- `~/tmp/mir/mir.h` -- `~/tmp/mir/mir-bitmap.h` -- `~/tmp/mir/mir-varr.h` -- `~/tmp/mir/mir-dlist.h` - -Related cfree notes: - -- `doc/OPT.md`: broad MIR-informed optimizer plan. -- `doc/PERF.md`: current timing and scaling data. - -## Executive Summary - -MIR's register allocation pipeline is live-range based. It does not allocate by -building a full pseudo-register interference matrix. MIR computes block live -sets, converts them into compressed live ranges, assigns physical locations by -checking occupied hard-register and stack locations at live-range program -points, then rewrites instructions. - -cfree's original O1 path was conflict-matrix based. `opt_live_info` built a -dense `nvals x words` pseudo interference matrix, stored per-instruction full -`live_after` bitsets, and scanned `1..nvals` in hot loops. `doc/PERF.md` shows -why that structure was the scaling problem for one large function. - -The main recommendation is to reshape cfree around MIR's allocation model: - -1. Separate block liveness from register-allocation analysis. -2. Build live ranges from block liveness. -3. Allocate against per-program-point occupied physical locations. -4. Keep conflict matrices only for bounded, optional coalescing. -5. Rewrite pseudos after assignment, with later live-range splitting for O2. - -This is a structural change, not an incremental optimization of the current -matrix. - -As of the current O1 implementation, the dense pseudo-conflict allocator is no -longer the normal path. O1 now has pass-local block liveness, compressed live -ranges, split hard-register and stack-slot occupancy, assignment-map rewrite, -per-location interval occupancy, and metrics for live/range/allocation/rewrite -event traffic. The remaining work is no longer "delete the old matrix" or -"replace O1 point-index occupancy"; those cleanup steps are complete. The next -allocator work should move to O2 coalescing/splitting and generated-code -quality, while keeping O1's fast MIR-shaped path as the baseline. - -## MIR Container Layer - -MIR uses small generic data-structure building blocks throughout `mir-gen.c`. -They are simple but important because they determine how passes are organized. - -### VARR - -`mir-varr.h` defines typed growable arrays. A `VARR(T)` stores: - -- element count -- capacity -- element pointer -- allocator - -Growth uses `realloc` and increases capacity by about 1.5x. MIR uses `VARR` -for pass-owned vectors such as temporary instruction lists, live-range tables, -sorted allocation candidates, and per-point used-location bitmaps. - -cfree already has arena-backed arrays and typed vector macros in several -subsystems. We should not copy `VARR` mechanically, but we should mirror the -ownership pattern: pass state owns resizable vectors explicitly instead of -embedding all allocation artifacts permanently in `Func`. - -### DLIST - -`mir-dlist.h` defines intrusive typed doubly linked lists. MIR functions and -basic blocks use linked instruction lists because passes insert, delete, and -move instructions frequently: - -- function instruction list: `MIR_func.insns` -- CFG block list: `func_cfg.bbs` -- block instruction list: `bb.bb_insns` -- edge lists: `bb.in_edges`, `bb.out_edges` - -cfree currently stores each `Block` as a contiguous `Inst*` array. That is good -for replay and scanning, but less natural for MIR-style lowering, rewrite, -spill insertion, edge splitting, and post-RA combine. For the first structural -rewrite we can keep arrays if we introduce a pass-local edit buffer, but the -final shape should support cheap insertion around instructions and on block -boundaries. - -### Bitmap - -`mir-bitmap.h` is a growable sparse-ish bitmap built on `VARR<uint64_t>`. -Important details: - -- It expands only as far as the highest needed bit. -- `bitmap_clear` truncates length to zero. -- binary/ternary operations trim trailing zero words. -- iteration uses `FOREACH_BITMAP_BIT` to visit set bits. - -This is a key contrast with cfree's current fixed-width bitsets. In cfree, -`opt_live_words = ceil(nvals / 64)`, and every live set for the function has -that width. MIR's bitmaps still use dense words internally, but their active -length tracks actual high bits, and many loops iterate set bits. - -cfree should add an optimizer bitmap abstraction before rewriting RA. It should -support: - -- clear without touching max capacity -- copy/ior/and-not with change reporting -- set-bit iteration -- optional trimming of trailing zero words - -The exact storage can remain arena-backed for pass lifetime, but the API should -hide fixed full-function width from users. - -## MIR Function and CFG Shape - -MIR's generator owns a `gen_ctx` with pass subcontexts: - -- target context -- data-flow context -- SSA context -- GVN context -- live-range context -- coalesce context -- RA context -- combine context - -This keeps pass state out of the public MIR function structure. The current -function being generated stores a `func_cfg` in `func_item->data` during -generation. - -### Instructions - -`MIR_insn` is a variable-sized object: - -```c -struct MIR_insn { - void *data; - DLIST_LINK(MIR_insn_t) insn_link; - MIR_insn_code_t code : 32; - unsigned int nops : 32; - MIR_op_t ops[1]; -}; -``` - -`data` points to pass-local metadata, usually `bb_insn` after CFG construction. -This lets MIR keep the IR instruction object stable while attaching and -replacing analysis state. - -cfree's `Inst` is a fixed struct with side arrays for operands and aux data. -That is fine for recording, but pass state should move out of `Inst`, `Block`, -and `Func` where possible. In particular, liveness, ranges, allocation -assignment, and rewrite scratch state should be pass contexts, not persistent -fields scattered through `Func`. - -### Basic Blocks - -MIR `bb` contains: - -- CFG links: in/out edge lists -- instruction list -- flags and reachability -- four data-flow bitmaps: `in`, `out`, `gen`, `kill` -- dominator bitmaps reused by several passes -- loop-node pointer -- max integer and FP pressure - -MIR reuses the bitmap fields for different data-flow problems through local -macros such as `live_in`, `live_out`, `live_gen`, and `live_kill`. - -cfree currently stores `live_in`, `live_out`, `live_use`, `live_def`, and -`live_after` directly on `Block`. That ties block storage to one analysis. The -MIR-like direction is to keep persistent block structure focused on CFG and -instruction ownership, and keep data-flow bitmaps in pass-local arrays indexed -by block id. - -## MIR Pipeline Shape - -For the allocation-related portion, MIR runs: - -1. `target_machinize` -2. build loop tree for O1+ -3. optional move collection and coalescing for O2 -4. full live info -5. `reg_alloc` -6. rewrite -7. optional splitting for O2 -8. post-RA combine -9. post-combine DCE - -This split is more important than individual implementation details. In MIR, -liveness is a reusable input to multiple later phases, but register allocation -does not depend on a full pseudo interference graph. - -For cfree, O1 should become: - -```text -build_cfg -machinize -build_loop_tree -live_blocks -build_live_ranges -assign_locations(no splitting) -rewrite_locations -combine -dce -emit -``` - -O2 can add coalescing, splitting, SSA value passes, and richer cleanup after -this shape is stable. - -## MIR Liveness - -MIR's liveness is block-first: - -1. Walk each block backward to compute `live_gen` and `live_kill`. -2. Solve backward data-flow: - - `live_out = union(successor.live_in)` - - `live_in = live_gen | (live_out & ~live_kill)` -3. Track register frequency and block pressure while scanning. -4. Iterate set bits for live-out pressure accounting. - -MIR also has a `scan_vars` mode. Normally `scan_vars_num == 0`, meaning all -variables participate. For move coalescing, MIR can restrict liveness to only -move-related variables. - -cfree should copy this separation. The first replacement for `opt_live_info` -should not build: - -- `val_conflicts` -- per-instruction `live_after` -- spill/rewrite data - -It should only produce block live-in/live-out, use/def, frequency, and pressure -inputs. Later passes can build the extra views they need. - -## MIR Live Ranges - -MIR represents each variable as a linked list of live ranges: - -```c -struct live_range { - lr_bb_t lr_bb; - int start, finish; - int ref_cost; - live_range_t next; -}; -``` - -The range builder walks blocks backward from `live_out`, just like liveness, -but emits range start/end points. It also records whole-block ranges using -`lr_bb` for values that are live through a block without being referenced -inside it. - -MIR then compresses live-range points: - -- gather all born/dead points -- map old instruction points to a smaller point space -- merge adjacent ranges where legal - -This compression matters. It reduces the size of the later `used_locs` -structure and avoids making the allocator pay for every instruction position -when only lifetime boundaries matter. - -cfree should introduce: - -```c -typedef struct OptLiveRange { - Val val; - u32 start; - u32 end; - u32 next; - u32 first_block_span; /* optional: whole-block span list */ -} OptLiveRange; - -typedef struct OptLiveRangeSet { - OptLiveRange* ranges; - u32 nranges; - u32* first_range_by_preg; - u32 point_count; -} OptLiveRangeSet; -``` - -The exact fields can change, but the structural point should not: allocation -should reason from ranges, not from a dense value-value conflict matrix. - -## MIR Register Allocation - -MIR allocation has three core pieces: - -### Candidate Ordering - -MIR builds `sorted_regs` from pseudo regs, computes live length from ranges, -and sorts allocation candidates. The priority includes tied hard-register -requirements, frequency, and live length. - -cfree already has a similar priority idea in `val_higher_priority`. The -problem is not ordering. The problem is what each candidate checks against. - -### Occupied Locations - -MIR maintains: - -- `used_locs`: vector of bitmaps indexed by compressed program point -- `busy_used_locs`: extra occupancy for split-capable allocation -- `conflict_locs1`: temporary bitmap of occupied physical locations - -For each pseudo range, MIR unions the occupied locations for all points in the -range into `conflict_locs`, then asks for a hard register or stack slot not in -that set. - -This is the central data-structure difference: - -- MIR: pseudo -> live ranges -> occupied hard locations by point -- cfree today: pseudo -> dense conflicts with other pseudos -> scan assigned - pseudos to test hard-register conflicts - -The MIR shape bounds allocation checks by range length and physical-location -bitmap width, not by all pseudo pairs. It is a better fit for large generated -functions where many values never overlap. - -### Rewrite - -After assignment, MIR rewrites instruction operands: - -- hard-register assigned pseudos become hard regs -- stack-assigned pseudos become loads/stores around uses/defs -- some spilled uses can become memory operands directly if target-legal -- call-clobbered live regs are saved/restored -- noop moves are deleted - -cfree's current rewrite is simpler and tightly coupled to `live_after`. A -MIR-like cfree rewrite should walk each block backward with a rolling live set, -using allocation assignment and target scratch registers. It should not require -persisting a full live-after bitmap for every instruction. - -## MIR Conflict Matrix Use - -MIR still has a conflict matrix, but only for coalescing. It is intentionally -narrow: - -- collect moves -- collect only source/destination regs from those moves -- cap the number at `MIR_MAX_COALESCE_VARS` -- build a matrix over this reduced scan-var space -- coalesce non-conflicting move-related regs - -The comment in MIR says 10K scan vars is about 8 MB for the coalescing matrix. -That cap is a design signal: even MIR avoids an unbounded full pseudo matrix. - -cfree should follow this model: - -- no full matrix in normal O1 allocation -- optional move-only matrix in O2 coalescing -- hard cap and metric counters for matrix size - -## cfree Mismatch Status - -The initial O1 implementation had five major structural mismatches: - -1. `opt_live_info` was overloaded. - It computed block liveness, per-value summaries, per-instruction live-after, - conflict matrix, call live-across summaries, and asm constraints in one pass. - This has been split into pass-local liveness, range construction, - allocation, and rewrite state. - -2. DDE paid for full allocation analysis. - Pre-DDE no longer builds allocation ranges or conflicts. It still runs block - liveness before DDE, and O1 runs block liveness again for regalloc after DDE - changes the instruction stream. Avoiding or reusing that second pass is a - remaining O1 cleanup item, not the main current bottleneck. - -3. Dense conflicts were the allocator's core dependency. - The normal O1 allocator no longer allocates or consumes `Func.val_conflicts`. - Dense matrices should stay out of O1 and return only as bounded, optional - move-coalescing support in O2. - -4. Hot loops scanned all vals. - Many of these loops now use operand-reference events, set-bit iteration, or - touched-value/generation lists. The final O1 cleanup removed the remaining - high-watermark live-across-call/rewrite scans that showed up in spill-heavy - wide functions. - -5. `Block.live_after` stored a full live bitmap per instruction. - Normal O1 no longer persists per-instruction full live-after storage. - Rewrite and DDE walk backward with rolling live state, and rewrite's - live-word traffic is now linear in the focused spill-pressure ladder. - -6. Point-index occupancy loops remained after the matrix was removed. - Normal O1 allocation now stores sorted live intervals per physical - hard-register and per stack spill slot. Assignment checks interval overlap - for each candidate range instead of OR-ing occupied-location bitmap rows for - every compressed point. - -## Proposed cfree Structural Target - -Introduce pass-local optimizer state: - -```c -typedef struct OptBitset OptBitset; -typedef struct OptDataflow OptDataflow; -typedef struct OptLiveRanges OptLiveRanges; -typedef struct OptAllocator OptAllocator; -typedef struct OptRewrite OptRewrite; - -typedef struct OptPassCtx { - Compiler* c; - Func* f; - Arena* scratch; - OptDataflow live; - OptLiveRanges ranges; - OptAllocator alloc; -} OptPassCtx; -``` - -Move these out of persistent `Func` where possible: - -- `live_in`, `live_out`, `live_use`, `live_def` -- `live_after` -- `val_conflicts` -- allocation scratch arrays - -Keep these or equivalents as persistent-enough function metadata: - -- value type/class tables -- block/instruction ownership -- CFG edges -- frame slots -- target hard-register availability - -Allocation result should be a separate mapping: - -```c -typedef enum OptLocKind { - OPT_LOC_NONE, - OPT_LOC_HARD, - OPT_LOC_STACK, -} OptLocKind; - -typedef struct OptLoc { - u8 kind; - u8 cls; - Reg hard_reg; - FrameSlot spill_slot; -} OptLoc; -``` - -Rewrite consumes `OptLoc[val]` and emits final target operands. - -## Implementation Direction - -This is the recommended structural sequence. - -### Phase 1: Data Structure Extraction - -Add pass-local bitset and data-flow structures. Keep current behavior running, -but make new liveness code independent of `Block.live_*` fields. - -Deliverables: - -- `pass_live.c` owns block liveness storage. -- bitset supports set-bit iteration and trimmed operations. -- DDE uses block liveness only. -- metrics count live words, active words, block live bytes, and set-bit scans. - -### Phase 2: Live Range Builder - -Build live ranges from block live sets. Do not allocate from them yet. - -Deliverables: - -- range list per value -- compressed point numbering -- metrics for point count, range count, average ranges per value, max range - length, and whole-block spans -- dump output for range tests - -### Phase 3: Range-Based O1 Allocator - -Replace dense-conflict allocation with MIR-style occupied-location assignment. -No live-range splitting yet. - -Deliverables: - -- `used_locs[point]` bitmaps for hard regs and stack slots -- candidate ordering by tied hard reg, frequency, live length, stable id -- hard-register assignment when possible -- stack assignment otherwise -- no `val_conflicts` allocation in normal O1 - -### Phase 4: Rewrite Without `live_after` - -Rewrite by walking blocks and instructions with rolling live state plus -assignment maps. - -Deliverables: - -- hard-reg operand rewrite -- stack reload/store insertion -- scratch-reg handling by target class -- call-clobber preservation -- noop move deletion -- no required per-instruction full live-after storage - -### Phase 5: O1 Cleanup and Scaling - -After O1 has the right shape, remove transition scaffolding and fix the -remaining O1 scaling buckets before adding O2 allocator features: - -- rewrite without extra per-instruction temporary list churn: done -- delete the old dense liveness/conflict path from the normal O1 route: done -- narrow or compress `used_locs` so stack occupancy does not scale as - `points * candidates`: done via split hard/stack occupancy and then - per-location interval sets - -Only after that should O2 add move collection, move-only liveness, capped -coalescing matrices, live-range splitting, and edge/block spill placement. This -mirrors MIR's division: the matrix is an optional coalescing tool, not the -allocator's main model. - -## Implementation Checklist - -Use this as the working tracker for the structural rewrite. Each phase should -land with focused tests and metrics before moving to the next phase. - -Compatibility policy: compatibility with the old O1 optimizer internals is not -a goal. The public compiler/API behavior must stay correct, but internal helper -APIs, dumps, tests, metrics, and persistent IR fields may be broken, renamed, or -deleted when they preserve the dense-conflict shape or add runtime cost. Prefer -a clean fast implementation over shims for old pass boundaries. - -### Phase 0: Baseline and Guardrails - -- [x] Preserve current O0 behavior as the reference path. -- [x] Keep existing O1 tests green before structural changes begin. -- [x] Add or identify stress inputs for: - - [x] one large straight-line function - - [x] many small functions - - [x] branch liveness - - [x] call-clobber preservation - - [x] spills - - [x] tied hard registers from inline asm -- [x] Keep `cfree run --time -O1` metrics available throughout the rewrite. -- [x] Add metrics names for the new shape before deleting old counters: - - [x] `opt.live.blocks` - - [x] `opt.live.active_words` - - [x] `opt.live.bitset_words_touched` - - [x] `opt.live.dataflow_iterations` - - [x] `opt.live.dataflow_block_visits` - - [x] `opt.ranges` - - [x] `opt.range_points` - - [x] `opt.range.point_visits` - - [x] `opt.range.preg_scans` - - [x] `opt.range.live_words_touched` - - [x] `opt.alloc.used_loc_words` - - [x] `opt.alloc.hard_loc_words` - - [x] `opt.alloc.stack_loc_words` - - [x] `opt.alloc.stack_slots` - - [x] `opt.alloc.hard_point_visits` - - [x] `opt.alloc.stack_point_visits` - - [x] `opt.alloc.hard_word_ors` - - [x] `opt.alloc.stack_word_ors` - - [x] `opt.alloc.hard_mark_points` - - [x] `opt.alloc.stack_mark_points` - - [x] `opt.alloc.spills` - - [x] `opt.dde.live_words_touched` - - [x] `opt.rewrite.reloads` - - [x] `opt.rewrite.stores` - - [x] `opt.rewrite.inserted_insts` - - [x] `opt.rewrite.live_words_touched` - -Implemented by `test/opt/phase0_guardrails.sh`, wired into `make test-opt`. -The script runs generated O0/O1 guardrail programs for the stress shapes above -and checks that `cfree run --time -O1` emits the reserved metric names. Inline -asm tied-register stress is identified by the existing parser case -`test/parse/cases/asm_01_grammar.c`; allocator-specific inline-asm execution -coverage belongs in a later pass once the range allocator owns rewrite. - -### Phase 1: Pass-Local Liveness - -- [x] Add an optimizer bitset abstraction with: - - [x] set/clear/test - - [x] copy - - [x] union - - [x] union-and-not - - [x] change reporting - - [x] set-bit iteration - - [x] active-word tracking or trailing-zero trimming -- [x] Add pass-local block liveness storage. -- [x] Move block `use`/`def` calculation into the new liveness module. -- [x] Compute `live_in`/`live_out` without allocating conflicts. -- [x] Make pre-DDE use only block liveness. -- [x] Keep old full `opt_live_info` available for regalloc during transition. - This transition debt was retired in phase 6. -- [x] Add dump output for block liveness. -- [x] Add tests for branch, loop, and call liveness. - -Exit criteria: - -- [x] Existing public O1 behavior is unchanged. -- [x] Pre-DDE no longer builds `val_conflicts`. -- [x] Metrics show separate block-liveness timing and no pre-DDE conflict bytes. - -### Phase 2: Live Range Model - -- [x] Define `OptLiveRange` and `OptLiveRangeSet`. -- [x] Build ranges from block `live_out` with a backward block walk. -- [x] Track per-value: - - [x] first range - - [x] live length - - [x] frequency/spill cost - - [x] live-across-call information -- [x] Add compressed point numbering. -- [x] Add whole-block span representation or an equivalent compact model. -- [x] Add range dump output. -- [x] Add range metrics: - - [x] total ranges - - [x] total points before compression - - [x] total points after compression - - [x] max ranges per value - - [x] max live length - -Exit criteria: - -- [x] Range dumps match expected lifetimes on small hand-written cases. -- [x] Range construction works without using `Block.live_after`. -- [x] Current allocator still works while range data is built side by side. - -### Phase 3: Range-Based Assignment - -- [x] Define `OptLoc` assignment mapping for every `Val`. -- [x] Build allocation candidate list from values with ranges. -- [x] Sort candidates by: - - [x] tied hard-register requirement - - [x] frequency/spill cost - - [x] shorter live length - - [x] stable value id -- [x] Add `used_locs[point]` bitmaps. -- [x] Mark hard-register live ranges in `used_locs`. -- [x] For each candidate, union occupied locations across its ranges. -- [x] Assign a legal hard register when available. -- [x] Assign/reuse stack slots when no hard register is available. -- [x] Preserve tied hard-register validation. -- [x] Preserve forbidden hard-register constraints from asm/calls. -- [x] Keep live-range splitting disabled for O1. - -Exit criteria: - -- [x] O1 regalloc no longer depends on `val_conflicts`. -- [x] `opt.conflict_bytes` is zero or absent on the normal O1 allocation path. -- [x] Branch, loop, call-clobber, spill, and tied-register tests pass. - -### Phase 4: Rewrite From Assignment Map - -- [x] Rewrite hard-assigned values to hard registers. -- [x] Insert reloads for stack-assigned uses. -- [x] Insert stores for stack-assigned defs. -- [x] Reserve target scratch registers per class. -- [x] Handle operands nested in: - - [x] normal operands - - [x] indirect operands - - [x] call descriptors - - [x] return descriptors - - [x] intrinsic descriptors - - [x] inline asm descriptors -- [x] Preserve call-clobber saves/restores with a rolling live set. -- [x] Delete noop moves after rewrite. -- [x] Avoid persistent per-instruction full `live_after` storage. -- [x] Add rewrite dump output. - -Exit criteria: - -- [x] O1 emits correct code without `Block.live_after`. -- [x] Reload/store metrics are stable and understandable. -- [x] Public O1 behavior and smoke JIT tests pass. - -### Phase 5: Reduce Phase-4 Rewrite Overhead - -Phase-4 timings show the no-`live_after` rewrite improves the large -straight-line case but regresses many-small-function/function-table inputs. Keep -the rolling-live design, but remove the temporary-list and reverse-copy cost -introduced by the first implementation. - -- [x] Rewrite blocks with a single block-local output buffer or pre-sized edit - buffer. -- [x] Avoid per-instruction `RewriteList` allocation/copy churn. -- [x] Preserve rolling-live call-clobber behavior without persistent - `Block.live_after`. -- [x] Keep scratch-register selection deterministic after the rewrite cleanup. -- [x] Update tests that asserted old internal storage or exact intermediate - dump details. -- [ ] Re-run the same-input phase-3 vs phase-5 timing comparison from - `doc/PERF.md`. - -Exit criteria: - -- [ ] Function-table/main-many-small `opt.regalloc` returns to phase-3 parity - or better. -- [ ] Large straight-line phase-4 gains are retained. -- [x] No persistent per-instruction full live-after sets are rebuilt in normal - O1. - -Implemented by rewriting each block into one backward-filled block output -buffer. Per-instruction before/after insertion lists are reused per block, and -the final reverse-copy pass is gone. - -### Phase 6: Delete Old Dense Path - -- [x] Remove normal-path allocation of `Func.val_conflicts`. -- [x] Delete old all-values conflict construction from O1, including - compatibility-only helpers and tests. -- [x] Remove or demote `opt_live_words`, `opt_conflict_words`, - `Block.live_in`, `Block.live_out`, `Block.live_use`, `Block.live_def`, - and `Block.live_after` from persistent IR state where no public behavior - requires them. -- [x] Replace `opt_live_info` callers with pass-local liveness/ranges, then - delete or narrow `opt_live_info`. -- [x] Make tests assert public behavior and new pass dumps/metrics, not dense - conflict matrix details. -- [x] Update docs and metrics names to reflect range-based RA. - -Exit criteria: - -- [x] Normal O1 has no reachable dense pseudo-conflict construction. -- [x] The source tree no longer contains tests whose purpose is preserving the - old dense interference model. -- [x] `opt.conflict_bytes` is absent or always zero on normal O1 paths. - -Implemented by deleting `opt_live_info`, `Func.val_conflicts`, -`Func.opt_conflict_words`, and persistent block liveness/live-after fields. -Post-rewrite combine/DCE now build pass-local hard-register liveness when they -need block live-out information. - -### Phase 7: Narrow `used_locs` - -Phase-3 and phase-4 timings show the remaining large-function bend tracks -`opt.alloc.used_loc_words`. The first range allocator sizes occupied-location -bitmaps as hard-register bits plus one possible stack bit per candidate; that -preserves correctness but reintroduces superlinear memory/time for one large -function. - -- [x] Split hard-register occupancy from stack-slot occupancy. -- [x] Make stack occupancy demand-sized by allocated stack slots, not candidate - count. -- [x] Consider sparse or interval-indexed stack occupancy for long straight-line - functions. -- [x] Keep hard-register occupancy compact and class-aware. -- [x] Add metrics separating hard occupancy words, stack occupancy words, and - allocated stack slot count. -- [ ] Re-run straight-line and table-main ladders after each shape change. - -Exit criteria: - -- [ ] Large straight-line benchmark scales by range/point count, not - `points * candidates`, `nvals * live_words`, or dense conflict bytes. -- [ ] `doc/PERF.md` can be updated with before/after O1 scaling numbers. -- [ ] Table-main remains at phase-5 parity or better. - -Implemented with separate `hard_used_locs` and `stack_used_locs` allocator -tables. Hard occupancy remains `OPT_REG_CLASSES * 32` bits per point; stack -occupancy grows only when a new spill slot crosses a word boundary. The sparse -interval form is deferred until timing data shows it is needed. - -### Phase 8: O1 MIR-Shape Cleanup - -The range allocator has the right high-level shape, but current timings show -there is still O1 work to do before adding O2 coalescing or splitting. The next -areas of focus are the remaining places where cfree is less MIR-shaped than it -looks at the pass boundary: fixed-width live traffic, point-by-point occupancy -loops, and rewrite liveness work. - -Implemented in the current cleanup pass: - -- [x] Emit event counters for live, range, allocation, DDE, and rewrite work. -- [x] Remove the duplicate pre-DDE live-range build from the normal O1 path. -- [x] Replace DDE per-instruction full-width use/def bitsets with - operand-reference events. -- [x] Replace rewrite per-instruction full-width use/def bitsets with - operand-reference events. -- [x] Track open/touched range values with generation marks during live-range - construction. -- [x] Avoid assigning caller-saved registers to non-tied values that are known - to be live across calls. -- [x] Update `doc/PERF.md` with current O1 scaling data and counter ratios. - -Phase 8 cleanup status: - -- [x] Make `OptBitset` operations truly MIR-like: track active length, trim - trailing zero words, and bound copy/ior/and-not work by active words - rather than the function's maximum value id. -- [x] Reduce `opt.live_ranges.regalloc` on spill-heavy wide functions. - Current `opt.range.preg_scans` is linear, so investigate live-word - touches, range insertion/compression, final per-preg summaries, and - live-across-call bookkeeping in `src/opt/pass_live.c`. -- [x] Reduce `opt.rewrite.live_words_touched`. Rewrite no longer stores - `Block.live_after`, but its rolling live set still grows faster than - input on spill-heavy wide functions. -- [ ] Replace repeated point-index occupancy OR/mark loops with per-location - interval or event state once live-range/rewrite traffic is under control. - `opt.alloc.*point_visits` is linear now, but the absolute count is high. -- [ ] Decide whether the second O1 block-liveness pass can be avoided after - DDE, either by reusing/revalidating liveness or by having DDE produce - enough local update information. -- [x] Keep split hard-register and stack-slot occupancy. It fixed the old - `points * candidates` stack growth and should remain the O1 baseline. - -Exit criteria: - -- [x] Straight-line and table-shaped stress inputs remain near-linear at the - 512-to-1024 step. -- [x] Spill-pressure `opt.live_ranges.regalloc` and `opt.regalloc` stop - growing near `3x` on each input doubling. -- [x] `opt.rewrite.live_words_touched` no longer grows superlinearly on the - spill ladder. -- [x] O1 remains free of dense pseudo-conflict construction on the normal path. - -Implemented by active-length `OptBitset` copies, a rolling live-value list for -live-across-call range bookkeeping, and rewrite call-save filtering that only -checks hard-assigned caller-saved values known to be live across a call. The -remaining point-index occupancy loops are linear and kept as future O2-quality -work rather than an O1 blocker. - -### Phase 9: O2 Coalescing - -- [ ] Collect move candidates after machinize. -- [ ] Add move-only liveness using `scan_vars`-style restricted variables. -- [ ] Build a capped conflict matrix only for move-related values. -- [ ] Add metrics for: - - [ ] move candidates - - [ ] coalescing scan vars - - [ ] coalescing matrix bytes - - [ ] coalesced moves -- [ ] Coalesce non-conflicting move groups. -- [ ] Keep the matrix cap explicit and configurable at compile time. - -Exit criteria: - -- [ ] Coalescing is optional and does not affect O1 compile-time shape. -- [ ] Matrix memory is bounded and reported. - -### Phase 10: O2 Splitting and Spill Placement - -- [ ] Add split-capable occupancy state similar to MIR's `busy_used_locs`. -- [ ] Identify profitable live-range gaps. -- [ ] Place spills/restores on block boundaries. -- [ ] Split critical edges when required for edge placement. -- [ ] Add spill-placement dumps. -- [ ] Add metrics for split ranges and edge/block spill placements. - -Exit criteria: - -- [ ] O2 can reduce spills without changing O1 assignment behavior. -- [ ] Edge/block placement tests cover diamonds, loops, and critical edges. - -### Phase 11: Cleanup and Documentation - -- [ ] Update `doc/OPT.md` with the final implemented module names and pass - order. -- [x] Update `doc/PERF.md` with new O1 scaling data. -- [ ] Add a short allocator invariant section near the implementation. -- [ ] Ensure pass dumps are stable enough for targeted regression tests. -- [ ] Remove obsolete comments that describe the dense conflict allocator as - the normal O1 path. - -## Non-Goals For The First Rewrite - -Avoid spending early effort on: - -- micro-optimizing the existing full conflict matrix -- adding sparse rows to `val_conflicts` as a long-term solution -- tuning candidate priority before range-based allocation exists -- preserving optimizer-internal compatibility with the dense allocator era -- link/JIT performance -- parser/CG data-structure changes - -Those may still matter later, but they do not fix the current structural -scaling issue. - -## Design Risks - -### Instruction Storage - -cfree's block instruction arrays make insertion more awkward than MIR's linked -lists. We can handle this with rewrite output buffers initially. If combine, -spill insertion, and edge splitting become too cumbersome, revisit instruction -storage after the range allocator lands. - -### Arena Lifetime - -MIR explicitly destroys many pass structures. cfree's arenas make allocation -cheap but can hide large transient memory. The new pass contexts should use a -scratch arena or explicit freeable heap where transient memory is substantial. - -### Debuggability - -The rewrite should add stable dumps early: - -- block live sets -- live ranges -- allocation map -- inserted reloads/stores -- final rewritten IR - -Without dumps, range allocation bugs will be hard to diagnose. - -### Target Coupling - -MIR's rewrite relies heavily on target hooks. cfree should keep target -knowledge behind explicit helpers: - -- hard-register class availability -- caller-saved/callee-saved masks -- scratch registers -- memory-operand legality -- stack-slot addressability - -## Bottom Line - -MIR's RA performance shape comes from avoiding a full pseudo interference graph -in the normal allocation path. cfree should mirror that, even when doing so -breaks old optimizer-internal APIs, fields, dumps, or tests. The next serious -O1 work should keep the pass-local liveness, compressed live ranges, -occupied-location assignment, and assignment-map rewrite shape, then remove the -transition scaffolding and remaining scaling buckets. - -Once that shape exists, incremental tuning will have a much better foundation. diff --git a/doc/OPT.md b/doc/OPT.md @@ -1,1371 +1,336 @@ -# OPT - MIR-informed implementation plan +# OPT - Optimizer Design -Scope: deliver cfree's optimized backend behind the `CGTarget` wrapper -contract described in `doc/DESIGN.md` Section 9, using MIR JIT's optimizer as -the model for phase order, level gating, and performance targets. +This document describes cfree's optimized backend pipeline. It is intended to +be the stable design reference for `-O1` and `-O2`: the IR shape, pass order, +major analyses, and ownership boundaries. Performance measurements and current +benchmark gaps live in `doc/OPT_PERF.md`. -This document is based on a source investigation of `~/tmp/mir/`, mainly -`mir-gen.c`, `mir.c`, `mir.h`, `mir-gen.h`, `MIR.md`, `README.md`, and the -`c2mir` driver. The important takeaway is that MIR keeps the optimizer short: -`-O1` is a fast lowering/code-selection path, while the expensive -SSA/value/memory/loop work is reserved for `-O2`. +The optimizer is implemented as a `CGTarget` wrapper. The C frontend and other +frontends still emit through the public CG interface; the wrapper records that +stream as cfree IR, runs the requested optimization schedule, then replays the +lowered IR into the wrapped target. ---- +## Goals -## 1. Target Shape +- Keep `-O0` as the direct backend path. +- Keep `-O1` as a fast optimized tier: shared lowering, liveness, allocation, + post-RA cleanup, and emission, but no SSA value optimization or inlining. +- Make `-O2` the full quality tier: interprocedural retention for inlining, + SSA value/memory passes, loop cleanup, pressure relief, coalescing, live-range + splitting, and the same final lowering path as `-O1`. +- Preserve module boundaries. Frontends own language semantics, ABI lowering + owns call/return shape, targets own encoding legality, and optimizer passes + operate on recorded CG-level IR. -cfree exposes three levels: +## Level Shape -- `-O0`: direct backend path. CG drives the target `CGTarget` immediately. -- `-O1`: MIR-style minimal optimizer. Record IR, lower it through the new - backend path, run liveness, simplified register allocation, post-RA combine, - and DCE. No SSA value passes and no inlining. -- `-O2`: MIR-style full optimizer. Add SSA, GVN/constant propagation, - redundant load elimination, copy propagation, dead store elimination, - SSA DCE, LICM, pressure relief, address transformation, block cloning, - coalescing, live-range splitting, and inlining plus cleanup. +`-O0`: -The core contract stays simple: +```text +frontend -> CG -> target +``` +`-O1`: + +```text +frontend -> CG -> opt_cgtarget records Func IR + +func_end: + build/lower CFG + machinize + loop tree + liveness + dead-def elimination + register allocation without live-range splitting + post-RA combine + DCE + jump cleanup + emit to wrapped target ``` -parse -> cg -> opt_cgtarget { record Func IR } - --O1 func_end/finalize: - build_cfg - machinize - build_loop_tree - live_info - regalloc(simplified) - combine - dce - opt_emit -> wrapped CGTarget - --O2 func_end/finalize: - build_cfg - build_reg_ssa - block_cloning - build_ssa - gvn - copy_prop - dse - ssa_dce - build_loop_tree + licm - pressure_relief - make_conventional_ssa - ssa_combine - undo_ssa - jump_opt - machinize - build_loop_tree - coalesce - live_info - regalloc(with live-range splitting) - combine - dce - opt_emit -> wrapped CGTarget - --O2 finalize before final lowering: - opt_inline - opt_cleanup(dirty callers) + +`-O2`: + +```text +frontend -> CG -> opt_cgtarget records all Func IR + +finalize: + inline retained functions + for each retained function: + SSA cleanup and value/memory optimization + lower through the shared backend pipeline + register allocation with coalescing and live-range splitting + emit to wrapped target ``` -The level dial selects optimization work. The lowering backend should be -shared by `-O1` and `-O2`; `-O1` must remain the bisection floor for bugs in -machinize, liveness, allocation, emission, and target-specific rewrite. +Functions that contain constructs whose cloning or SSA semantics are not yet +explicit, such as inline asm, label-address loads, or indirect branches, are +conservatively routed through the non-SSA lowering path. ---- +## Core Data Structures -## 2. MIR Findings +### `OptImpl` -### 2.1 Where the Optimizer Lives +`OptImpl` is the `CGTarget` wrapper state in `src/opt/opt.c`. -MIR's optimizer is concentrated in `mir-gen.c`; inlining and MIR-level -simplification live in `mir.c`; `c2mir/c2mir-driver.c` maps command-line -`-O` flags to `MIR_gen_set_optimize_level`. +- `target` is the wrapped real backend target. +- `level` selects the optimization schedule. +- `f` and `cur` track the function and current block during recording. +- `pending_loc` stores the latest source location and stamps new instructions. +- `FuncSet funcs` retains all `-O2` functions until `finalize`, so inlining can + see both callees and callers before any function is emitted. -Key source facts: +### Function IR -- `MIR_gen_init` defaults `optimize_level` to `2`. -- `MIR_gen_set_optimize_level(ctx, level)` only stores the requested level in - generator context. -- `c2mir` treats bare `-O` as `-O2`. -- Current `mir-gen.c` checks `optimize_level >= 2` for the full pass set. - Some docs/comments still describe a separate `-O3`, but the current source - does not gate additional passes on `>= 3`. For cfree, model `-O2` as MIR's - current full source-level pipeline. - -### 2.2 MIR Level Semantics - -MIR documentation says: - -- Level `0`: only register allocator and machine code generator work. -- Level `1`: adds code selection. It should produce more compact/faster code - than level 0 at nearly the same generation speed. -- Level `2`: adds common subexpression elimination and conditional constant - propagation. MIR docs report level 1 generation speed around 50 percent - faster than level 2. -- Current source-level `>=2`: also includes block cloning, SSA variable - renaming, address transformation, DSE, LICM, pressure relief, coalescing, - and live-range splitting. - -For cfree: - -- `-O1` should optimize the backend mechanics without running SSA passes. -- `-O2` should be the full quality mode. -- Do not create a user-visible `-O3` until benchmarks identify a genuinely - useful extra tier. - -### 2.3 MIR Performance Bar - -MIR's README reports `c2m -eg` at about `0.91` geomean performance relative -to GCC `-O2` on its 15 small C benchmarks, with QBE around `0.65` on the same -table. Treat those numbers as directional rather than a literal cfree promise: -MIR has a JIT-oriented IR and mature per-target rewrite logic. Still, they set -the bar for this project: - -- `-O1`: compile-time-focused. It should be much closer to `-O0` compile time - than to `-O2`, while removing obvious backend artifacts through combine and - DCE. -- `-O2`: performance-focused. The first external target is at least QBE-class - generated-code performance on the MIR `c-benchmarks` set, then iteratively - close on MIR's published result. -- Both levels must preserve the CG corpus exit-code contract before benchmark - tuning starts. - ---- - -## 3. MIR Pipeline, Source Order - -This is the effective order in current `mir-gen.c` for full function -generation. - -### 3.1 Common Setup - -MIR duplicates the function instruction list, allocates a function CFG object, -and runs `build_func_cfg`. - -`build_func_cfg`: - -- Creates synthetic entry and exit blocks. -- Splits basic blocks at labels, branches, returns, property branches, and - fallthrough boundaries. -- Converts register operands to internal variable operands. -- Marks address-producing instructions and addressable regs. -- Adds edges for direct branches, switches, fallthrough, and possible indirect - jump targets. -- Marks calls on blocks so memory availability and liveness can treat them as - barriers. -- Removes unreachable blocks when `optimize_level > 0`. - -cfree's `opt_build_cfg` should follow this shape: construct explicit -entry/exit blocks, keep call/memory barrier metadata on blocks, and make -unreachable cleanup part of `-O1+`, not a late optional cleanup. - -### 3.2 Full `-O2` Pre-Lowering Passes - -MIR full mode then runs: - -1. `clone_bbs` before SSA when there are no address instructions. - MIR clones cold blocks after return back into hot predecessors when this - exposes optimization in the hot path. It uses a bounded growth factor and - skips back edges. - -2. `build_ssa`. - MIR uses a Braun-style construction: it creates optimized maximal SSA, - minimizes redundant phis, builds def-use edges, and optionally renames - variables. cfree should copy the useful property, not the representation: - every use should have cheap access to its defining inst, and every def - should have cheap access to its uses. - -3. `addr_xform` when address instructions exist. - MIR tries to eliminate `ADDR` pseudos that only feed memory operands. If an - address pseudo must remain addressable, MIR converts uses to memory loads - and stores, rebuilds SSA, then clones blocks. For cfree, this maps to - folding GEP/address-of chains into target memory operands where the target - can encode them, while keeping address-taken frame slots in memory. - -4. `gvn`. - MIR computes memory availability, dominators, and value numbers in postorder. - This pass includes constant propagation, branch folding, redundant expression - elimination, redundant load elimination, and store/load reuse. It uses alias - and nonalias tags on memory operands. Calls clear or conservatively restrict - memory availability. - - cfree should land this in two explicit slices. The first slice is scalar GVN - only: pure SSA value numbering and replacement for constants, arithmetic, - compares, conversions, and other side-effect-free register expressions. It - should not eliminate or forward `IR_LOAD`, interpret `IR_STORE`, or make - memory-dependent branch decisions. - - Before the second slice touches memory, document and test the memory - numbering shape. That design must define memory version tokens by alias - class/root, how `MemAccess.alias`, address operands, address spaces, globals, - TLS, params, and unknown memory participate in keys, and the invalidation - rules for stores, calls, atomics, fences, volatile operations, and inline asm. - Redundant-load elimination and store/load reuse start only after that shape - is explicit and covered by pass-local dump tests. - -5. `copy_prop`. - MIR propagates copies, folds multiply/divide by powers of two, and removes - redundant extension chains. cfree should keep this as a separate pass after - GVN because it relies on SSA edges and target legality. - -6. `dse`. - MIR computes memory liveness by memory location, handles alloca escape - through calls, and removes stores whose memory location is not live. This - depends on GVN's memory numbering and alias classification. - -7. `ssa_dce`. - MIR deletes SSA instructions with unused outputs, while preserving calls, - allocas, varargs, returns, frame/stack effects, overflow sequences with live - overflow branches, and other side-effecting ops. - -8. `build_loop_tree + licm`. - MIR builds natural loops and preheaders. LICM skips branches, phis, calls, - memory, varargs, allocas, and potentially trapping divisions/mods. It is - pressure-sensitive: cheap single instructions are not moved unless their - inputs are worth moving too; multiplies are considered expensive enough to - move. - -9. `pressure_relief`. - MIR moves single-use immediate or constant-like moves closer to their use - when doing so reduces pressure and does not move work into a loop. +The optimizer IR is defined in `src/opt/ir.h`. -10. `make_conventional_ssa`. - MIR lowers phis into edge/block moves, splitting critical edges when needed. +- `Func` owns the arena-backed IR for one function: blocks, frame slots, + pseudo-register metadata, value metadata, local declarations, emit order, and + pass scratch fields. +- `Block` owns a linear instruction list plus CFG predecessor/successor edges. + `emit_order` records layout order separately from block id order. +- `Inst` is one recorded operation. Most `IROp` values map directly to a single + CGTarget method. Source locations are stored on instructions. +- `PReg` is the mutable pseudo-register id used by recorded CG IR and by the + lowering pipeline before register SSA. +- `Val` is the SSA value id used after `opt_build_reg_ssa` and mem2reg SSA. + `VAL_NONE` and id zero are sentinels. +- Per-op auxiliary structs preserve structured data for calls, returns, phis, + switches, inline asm, atomics, aggregate memory operations, intrinsics, and + debug/scope operations. -11. `ssa_combine`. - MIR combines compare+branch pairs and folds address components into memory - operands while SSA edges are still available. - -12. `undo_ssa`. - MIR removes phi nodes and SSA edges. - -13. `jump_opt`. - MIR removes unreachable blocks, empty blocks, branches to the next - instruction, chains of labels, and branches to jumps. - -### 3.3 Lowering and Allocation - -After pre-lowering optimization, MIR runs the machine-dependent and allocation -pipeline: - -1. `target_machinize`. - This performs ABI lowering, call lowering, target two-operand forms, and - other machine-dependent normalization. - -2. Build a loop tree for `-O1+`. - MIR uses loop depth for frequency/pressure estimates in liveness, - coalescing, and allocation. - -3. `collect_moves`, move-only liveness, conflict matrix, and `coalesce` at - `-O2`. - MIR aggressively coalesces move-related regs, prioritizing moves by loop - frequency and checking conflicts in a bounded matrix. - -4. Full `live_info`. - MIR computes `live_in`/`live_out`, register frequencies, live lengths, and - block pressure. It understands phis before SSA destruction and call-used - hard registers after machinize. - -5. `reg_alloc`. - MIR builds live ranges and assigns pseudos by priority: tied hard regs first, - then higher frequency, then shorter live length. At levels below 2 it uses a - simplified allocator. At full level it can split live ranges and place - spills/restores on edges or block boundaries. - -6. `rewrite`. - MIR rewrites pseudo regs to hard regs or stack slots, inserts reloads and - stores, handles call-clobbered saves, and deletes noop moves. - -7. `combine`. - This is target-aware code selection. It substitutes safe single-use moves, - collapses extension chains, commutes operands to expose legal combinations, - removes internal labels, and validates each rewrite with target legality. +The representation deliberately keeps mode explicit. Before pseudo-register +SSA, `OPK_REG` operands carry `PReg` ids. After pseudo-register SSA, they carry +`Val` ids, and `Func.opt_reg_ssa` tells shared helpers which namespace is +active. -8. `dead_code_elimination`. - MIR performs post-RA DCE using live-out sets and preserves side effects. +### Memory Shape -9. Prolog/epilog and machine instruction generation. +Memory operations carry `MemAccess`, including type, size, alignment, volatile +or atomic properties, and alias root. Important alias roots include locals, +globals, TLS, parameters, and unknown memory. Memory-aware passes must preserve +observable memory: volatile, atomic, aggregate operations with side effects, +calls that may clobber, fences, and inline asm. -cfree should keep this exact split: combine after register allocation is not a -replacement for SSA `copy_prop`; it is a target legality/code-selection pass. +### Analysis State -### 3.4 Inlining +`OptAnalysis` holds per-pass order and dominance data: -MIR inlining is in `mir.c`, before generator optimization. It processes direct -call-like instructions after MIR simplification, not inside `mir-gen.c`. +- postorder/reverse-postorder +- reachability +- immediate dominators and dominator children +- dominance frontiers -Important MIR inliner behavior: +`Func.opt_valid_analyses` tracks coarse invalidation for CFG, def-use, +dominators, and loop info. Passes that mutate control flow, operands, or +instructions either rebuild the relevant analysis or invalidate it. -- It can inline direct normal calls and explicit inline calls when the callee - is available. -- It skips unresolved externals, self recursion, label-reference functions, - varargs, `jret`, and over-budget callees. -- Default size budgets are small: normal calls have a much smaller cap than - explicit inline calls. -- It limits caller growth relative to original caller size. -- It renames callee registers, duplicates labels, materializes argument moves, - rewrites returns, and handles simple top-level constant-size `alloca`. +The shared def-use representation records every use of every `Val`, including +ordinary operands, indirect bases, phi inputs, call arguments/results, +aggregate operands, inline asm operands, and intrinsic operands. -cfree should run inlining on retained `Func` IR at `finalize` for `-O2`, then -run cleanup on dirty callers. SCCs can be skipped for v1. The inliner must -refuse setjmp/longjmp-sensitive and vararg cases until their semantics are -explicitly tested. +## Shared Lowering Pipeline ---- +`opt_run_lowering_pipeline` is used by both `-O1` and `-O2`. -## 4. cfree Level Schedules +1. `opt_build_cfg` + Builds explicit CFG edges from branches, returns, switches, fallthroughs, + scopes, indirect branches, and synthetic blocks. It also removes unreachable + connected blocks after cleanup. -### 4.1 `-O1` Minimal Schedule +2. `opt_jump_cleanup` + Performs CFG cleanup before and after lowering. The early form repairs and + simplifies CFG shape; the late form also optimizes layout fallthroughs. -`-O1` is not "half of SSA." It is MIR's cheap backend optimization tier. +3. `opt_simplify_local` + Performs local algebraic/address simplifications that do not require SSA. -Required `-O1` schedule: +4. `opt_machinize` + Lowers generic IR toward target-legal machine IR. This includes target + operand constraints, hard-register constraints, clobber information, and + target-specific rewrite decisions needed before allocation. -``` -build_cfg -machinize -build_loop_tree -live_info -regalloc(allow_live_range_split = false) -combine -dce -opt_emit -``` +5. `opt_build_loop_tree` + Builds loop metadata from dominators for use by later analysis and metrics. -Allowed `-O1` details: +6. `opt_live_blocks` + Computes block liveness over pseudo-registers or SSA values, depending on + the active mode. -- Remove unreachable blocks during CFG construction. -- Use loop depth only for frequency/pressure costing. -- Run target-aware combine after register allocation. -- Delete noop moves and dead post-RA definitions. -- Use a priority-based allocator, but without coalescing and live-range - splitting in the first production version. +7. `opt_dead_def_elim_with_live` + Removes dead register definitions using live-out information while + preserving side-effecting instructions. -Forbidden at `-O1`: +8. `opt_regalloc` + Allocates pseudo-registers to hard registers or spill slots. `-O1` uses the + simple allocation mode. `-O2` enables coalescing and live-range splitting. -- `build_ssa` -- `gvn` -- `copy_prop` -- `dse` -- `ssa_dce` -- `licm` -- `pressure_relief` -- `coalesce` -- `opt_inline` +9. `opt_combine` + Runs post-allocation peephole combines over lowered IR. -This keeps `-O1` useful and debuggable: if `-O1` fails, the bug is in -recording, CFG, machinize, liveness, allocation, combine, DCE, or emission. +10. `opt_dce` + Deletes now-dead non-side-effecting instructions after allocation and + combine. -### 4.2 `-O2` Full Schedule +11. `opt_emit` + Replays the final IR into the wrapped backend target. -`-O2` uses MIR's full current source pipeline: +The shared path is the bisection floor. Bugs in CFG construction, target +machinization, liveness, allocation, rewrite, combine, or emission should +usually reproduce at `-O1`. -``` -build_cfg -build_reg_ssa -if no address-transform candidates: - block_cloning -build_ssa -if address-transform candidates: - addr_xform - undo_ssa - block_cloning - build_ssa -gvn -copy_prop -dse -ssa_dce -build_loop_tree -licm -pressure_relief -make_conventional_ssa -ssa_combine -undo_ssa -jump_opt -machinize -build_loop_tree -collect_moves -live_info(move vars only) -coalesce -live_info(all vars) -regalloc(allow_live_range_split = true) -combine -dce -opt_emit -``` +## `-O1` Pipeline -Then, once inlining exists: +`-O1` runs only the shared lowering pipeline. It intentionally skips: -``` -finalize: - opt_inline(FuncSet, max_iters) - for each dirty caller: - opt_cleanup - lower every dirty/not-yet-emitted Func through the full schedule -``` +- interprocedural retention +- inlining +- SSA construction +- GVN +- copy propagation beyond local cleanup +- DSE +- LICM +- pressure relief +- live-range splitting -`opt_cleanup` should re-run the passes that inlining exposes value for: -`build_cfg`, `build_ssa`, `gvn`, `copy_prop`, `dse`, `ssa_dce`, `licm` when -loops exist, `pressure_relief`, `make_conventional_ssa`, `ssa_combine`, -`undo_ssa`, and `jump_opt`. +`-O1` should compile much closer to `-O0` than to `-O2`, while removing obvious +backend artifacts and exercising the optimized allocation/emission path. -Current implementation subset: +## `-O2` Pipeline -``` -build_cfg -jump_cleanup(CFG) -build_cfg -verify(o2-pre-ssa-cfg) -build_reg_ssa -verify(o2-reg-ssa) -block_cloning -verify(o2-block-clone-cfg) -build_ssa -verify(o2-ssa) -ssa_dce -copy_cleanup -verify(o2-pre-addr-cleanup) -addr_xform -verify(o2-addr-xform-ssa) -ssa_dce -verify(o2-addr-xform-dce) -copy_cleanup -verify(o2-addr-xform-copy-cleanup) -simplify -verify(o2-simplify-ssa) -ssa_dce -copy_cleanup -verify(o2-simplify-cleanup) -gvn -verify(o2-gvn-ssa) -copy_prop -verify(o2-copy-prop-ssa) -ssa_dce -verify(o2-copy-dce) -dse -verify(o2-dse-ssa) -ssa_dce -verify(o2-dse-dce) -build_loop_tree -licm -verify(o2-licm-ssa) -pressure_relief -verify(o2-pressure-relief-ssa) -make_conventional_ssa -verify(o2-conventional-ssa) -ssa_combine -verify(o2-ssa-combine) -undo_ssa -copy_cleanup -verify(o2-undo-copy-cleanup) -opt_jump_opt -verify(o2-jump-opt) -shared O1 lowering pipeline (machinize ... jump_cleanup ... opt_emit) -``` +`-O2` retains every function until target finalization. This enables whole-TU +inlining before any function is emitted. -Every CFG or value-use mutation above is followed by the relevant verifier -checkpoint in `opt_run_o2_pipeline`. - -### 4.3 Identity Simplification - -Identity simplification is implemented in three homes, depending on how much -context the rewrite needs. - -1. CG-time stack/value simplification, before IR recording. - This is the earliest and best place for identities where the operation can - simply avoid materializing a result. It benefits `-O0`, `-O1`, and `-O2`, - and it prevents avoidable temporaries from ever entering opt IR. Keep this - limited to rewrites that are local to the two values currently on the CG - stack, do not require CFG/use-def information, and are valid for every - target backend. - - Current CG folds immediate integer binops, collapses delayed integer - arithmetic chains, collapses integer identities through - `api_try_collapse_binop_identity`, and `api_cg_convert` elides same-type - conversions. The identity set now includes safe unflagged integer cases such - as `x * 1`, `1 * x`, `x / 1`, `x % 1 -> 0`, `x << 0`, `x >> 0`, zero - annihilators for multiply/and, and all-ones annihilators for or. - - Do not put target-dependent legality here. Do not fold floating-point - identities here: signed zero, NaNs, exceptions, contraction, and rounding - flags make algebraic FP identities nontrivial. Also keep the existing - `!flags` rule for integer folding unless a flag's exact semantics are - explicitly modeled. - -2. A tiny non-SSA `opt_simplify_local` in the shared lowering pipeline. - This runs after the initial `build_cfg`/`jump_cleanup`/`build_cfg` cleanup - and before `machinize`, so it is available to `-O1` and to the final lowering - leg of `-O2` after SSA has been undone: - - ``` - build_cfg - jump_cleanup(CFG) - build_cfg - simplify_local - machinize - ... - ``` - - This pass stays local and cheap. It rewrites one instruction at a time: - neutral/annihilator integer binops, `x - x -> 0`, `x ^ x -> 0`, `x & x` and - `x | x` to copies, same-value integer compares, and exact no-op converts. - It does not chase definitions, compute dominance, rewrite arbitrary uses, or - become a substitute for SSA value passes. Its job is to reduce obvious work - before liveness/RA without changing the meaning of O1 as the cheap backend - tier. - - O1 should not grow a general value optimizer. If an O1 simplification needs - def-use, dominance, copy propagation, memory reasoning, or loop information, - it belongs in O2 instead. Post-RA `combine` remains the target-aware cleanup - point for no-op moves, immediate folding into legal machine operands, and - backend artifacts exposed only after register assignment. - -3. An SSA `opt_simplify` in the O2 pre-lowering value pipeline. - This is the primary home for scalar and address identities that require - use-def information. It runs after mem2reg/address cleanup has exposed SSA - values and before GVN builds value-number keys: - - ``` - build_ssa - ssa_dce - copy_cleanup - addr_xform - ssa_dce - copy_cleanup - simplify - ssa_dce - copy_cleanup - gvn - ``` - - This pass rewrites the defining instruction to `IR_COPY` or `IR_LOAD_IMM` - and relies on `opt_ssa_dce`/`opt_copy_cleanup` to remove leftovers. It - shares the local rules and adds SSA-only rules for double bitwise-not, - add-zero address/value chains fed by SSA constants, and no-op bitcast chains - that reduce to a copy. GVN still owns constant folding, leader selection, - redundant expression/load elimination, and branch folding; `opt_simplify` - canonicalizes cheap identities so GVN has fewer equivalent shapes to learn. - -The practical rule is: if CG can avoid emitting IR without looking beyond the -top CG values, do it in CG. If the rewrite is a single recorded instruction and -does not need SSA, put it in shared `simplify_local` so O1 benefits. If the -rewrite needs def-use, dominance, memory/address canonicalization, or cleanup -after inlining, put it in O2 `opt_simplify`. - -### 4.4 Transformations We Do Not Take - -`doc/DESIGN.md` is still binding: no UB-exploiting transforms. Do not assume -signed overflow, shift-by-width, division by zero, or null dereference are -unreachable. MIR is careful around potentially trapping division/modulo in -LICM; cfree should be at least as conservative. - -MIR property instructions and lazy basic-block versioning are out of scope for -the first optimized backend. They are a separate JIT specialization feature, -not required for normal `cfree` object/executable codegen. - ---- - -## 5. Current cfree State - -The optimizer is no longer just a stub: - -- `src/opt/opt.c` implements the `CGTarget` wrapper. -- `src/opt/ir.c` and `src/opt/ir.h` implement the recorded IR container. -- `src/opt/pass_cfg.c` implements CFG construction. -- `src/opt/pass_ssa.c` implements pseudo-register SSA construction, mem2reg - SSA construction, conventional SSA edge-copy lowering, SSA undo, and stable - SSA dumps for pass tests. -- `src/opt/pass_coalesce.c` implements O2 move collection, move-related - conflict analysis, and conservative coalescing. -- `src/opt/pass_analysis.c` owns shared CFG order, dominators, dominance - frontiers, dominator children, def-use rebuilding, and verifier guardrails. -- `src/opt/pass_simplify.c` implements CG-adjacent local identity cleanup for - the shared lowering path and SSA-aware identity canonicalization for O2 before - GVN. -- `cfree cc` forwards `-O0`/`-O1`/`-O2` into `CfreeCodeOptions`, so native - frontends exercise the optimizer through the normal compile driver. -- `test/toy/run.sh` supports `CFREE_TOY_OPT_LEVELS`; default is `0 1 2`, so - `make test-toy` exercises direct, O1, and O2 paths by default. - -Current behavior: - -- Level `1` records IR, builds CFG, machinizes, builds loop frequencies, - computes PReg liveness and live ranges, runs the current priority - allocator/rewrite path over mutable PRegs, combines, runs post-RA DCE, - cleans jumps/layout, and emits through the wrapped target. This path keeps - whole-PReg allocation and does not run coalescing or splitting. -- Level `2` has its own scheduler entry point. It now converts mutable - pseudo-registers to SSA values, builds real mem2reg SSA, verifies phi/use-def - invariants, runs the first small O2 transforms, lowers phis through - conventional SSA, undoes SSA, and then routes through the shared lowering - path with O2 allocation policy enabled. -- `opt_regalloc(..., allow_live_range_split)` now uses the O2 policy bit to - enable move coalescing and singleton live-range splitting. Coalescing only - considers valid same-class, same-type `IR_COPY` pairs, weights moves by block - frequency, builds a bounded conflict matrix for move-related PRegs, and - rejects incompatible tied-register or forbidden-register constraints. -- O2 allocation assigns coalesced groups as a unit. Every member receives the - same hard register or spill slot, so coalesced copies rewrite to hard-reg - self copies and are removed by post-RA combine/DCE. -- O2 live-range splitting is intentionally conservative in this slice: only - singleton allocation groups with multiple live ranges can split. Split PRegs - get a canonical spill home plus per-range segment records. Hard segments - insert boundary reload/store traffic through the existing rewrite buffers; - coalesced groups, call-crossing values, and O1 allocations remain whole. -- `opt_build_ssa` is no longer just a shape check: promoted loads/stores are - rewritten to SSA values, phis have real `Val` defs and predecessor inputs, - and def-use chains are rebuilt for downstream SSA passes. -- Instruction ids are stable across block-array rewrites, so analysis and - dumps no longer depend on raw `Inst*` addresses or positional references. -- Loop tree construction uses shared `OptAnalysis` dominators instead of a - pass-local dominator implementation. -- `Func.opt_valid_analyses` now tracks CFG, def-use, dominator, and loop-info - validity. Mutating passes invalidate the affected bits, rebuild passes mark - their products valid, and `opt_verify` checks cached def-use before rebuilding - it so stale use-site assumptions are caught. -- The first SSA-backed value passes are in the O2 path: `opt_ssa_dce` removes - unused pure SSA defs/phis, and `opt_copy_cleanup` rewrites safe SSA copy users - through def-use chains while avoiding multi-def conventional-SSA edge copies. -- The next pre-GVN O2 slice is also in the O2 path: `opt_block_cloning` clones - tiny non-loop join blocks under strict growth limits when cloned defs remain - local and external uses are dominated; `opt_addr_xform` folds simple local - `addr_of` pseudos into non-observable load/store memory operands. Volatile, - atomic, global, TLS, non-zero indirect-offset, and otherwise non-local address - cases remain materialized. Both passes rebuild/invalidate analysis state and - are bracketed by `opt_ssa_dce`, `opt_copy_cleanup`, and verifier checkpoints. -- The first post-address value passes are in the O2 path: `opt_simplify` - canonicalizes cheap scalar/address identities before GVN; scalar GVN rewrites - dominated duplicate pure expressions and folds safe integer - constants/branches; the memory-aware GVN slice rewrites exact redundant - loads and store/load reuse through the documented alias/version model; and - `opt_copy_prop` propagates SSA copy chains plus collapses redundant integer - extension chains. -- The first memory and loop passes are in the O2 path: `opt_dse` removes exact - overwritten or unread non-escaping local stores, `opt_licm` hoists safe loop - invariants while leaving trapping and memory operations in place, and - `opt_pressure_relief` sinks single-use immediate/constant defs with a - per-block linear rewrite plan. - -Temporary defensive/pessimizing checks to lift: - -- [ ] O2 scalar locals/params that would be register-backed at O1 are still - initially recorded as frame slots for mem2reg. Lift this when source-local - storage identity can be represented directly through pseudo-register SSA. -- [ ] O2 functions containing `IR_ASM_BLOCK` route through the non-SSA lowering - path. Lift this when inline-asm tied operands, inout operands, - early-clobbers, fixed-register constraints, and clobber liveness are modeled - across pseudo-register SSA construction and destruction. -- [ ] O2 functions containing `IR_LOAD_LABEL_ADDR` or `IR_INDIRECT_BRANCH` - route through the non-SSA lowering path. Lift this when label-address and - computed-goto targets have stable block/MCLabel semantics across O2 CFG - cleanup, block cloning, edge splitting, SSA destruction, and jump cleanup. -- [ ] O2 switch table promotion is disabled for gapped dense ranges - (`span != ncases`). Lift this when jump-table default slots remain correctly - bound after O2 layout and jump cleanup, including interior default entries. -- [ ] Jump forwarding refuses to forward through single-jump blocks that already - own an `MCLabel`. Lift or narrow this once data relocations and label tables - are retargeted when CFG cleanup forwards or removes labeled trampoline - blocks. - -Full O2 jump optimization is now in the O2 path after SSA destruction. It -extends the shared O1 jump cleanup with switch/default target forwarding, empty -fallthrough-chain forwarding, repeated branch-to-branch forwarding, and -same-target conditional branch collapse while preserving the final layout -cleanup for branch-to-physical-next deletion. - -The next implementation work should move to Phase E inlining and cleanup. - ---- - -## 6. Implementation Phases - -Each phase should end with targeted green tests. Prefer red-green tests for -the exact pass being introduced, then expand through the CG corpus. - -### Phase A - Production `-O1` Lowering - -Status: implemented as the public level-1 path. Keep extending tests and -target-specific legality checks here, but do not make O1 depend on SSA/value -passes. The final O1 namespace is mutable `PReg`: `ir_ensure_preg` grows only -the pseudo-register metadata table, and O1 liveness, allocation, rewrite, and -metrics are PReg-indexed. - -Goal: make `-O1` stop replaying and emit through the optimized backend path -without SSA value passes. - -Implement: - -- `opt_machinize` -- `opt_live_info` -- `opt_regalloc(..., false)` -- `opt_combine` -- `opt_dce` -- `opt_emit` - -Keep the allocator simple but not naive: - -- Build live ranges from block liveness. -- Sort allocation candidates by tied hard-reg requirement, frequency, live - length, then stable id. -- Assign hard regs when possible, stack slots otherwise. -- Rewrite pseudos into hard regs/stack slots with reserved scratch regs for - reload/store addressing. -- Delete noop moves after rewrite. - -Exit criteria: - -- [x] `CFREE_TOY_OPT_LEVELS="0 1" make test-toy` passes for targeted cases. -- [x] Focused allocation cases cover call-clobber saves, stack spills, tied - hard regs from inline asm, and values live through branches. - -### Phase B - O2 Analysis Substrate - -Status: implemented for the first O2 value-pass slice. `OptPassCtx`, the -shared operand/reference walker, -`OptAnalysis` CFG order/dominator/frontier data, stable instruction ids, -production mem2reg SSA, def-use rebuilding, SSA dumps, and stronger -`opt_verify` guardrails are in place. Loop tree construction now uses shared -dominators. Passes now have explicit CFG, def-use, dominator, and loop-info -validity bits, and the first SSA value passes did not require additional shared -analysis fields. - -Goal: create the durable data structures and invalidation model that all O2 -passes will share before implementing transformations. - -Checklist: - -- [x] A small pass context for per-pass state, stage names, metrics, and - temporary arenas. -- [x] A shared operand/reference walker for every pass that needs uses/defs, - including `CGABIValue`, calls, returns, inline asm, intrinsics, and indirect - bases. -- [x] Shared analysis results for RPO/PO, dominators, dominance frontiers, and - dominator children. -- [x] Loop tree construction moved onto shared dominator analysis. -- [x] Stable instruction ids so long-lived analysis never depends on raw - `Inst*` addresses across block array rewrites. -- [x] Real mem2reg phi `Val` definitions and rewritten promoted uses. -- [x] Def-use chain rebuilding, including phi inputs and indirect bases. -- [x] SSA dump coverage for phis and rewritten uses. -- [x] `opt_verify(Func*, stage)` gates for CFG consistency, instruction id - consistency, phi arity, def-use consistency, and SSA value class checks. -- [x] Formal analysis invalidation flags for passes that preserve, rebuild, or - invalidate CFG/value/use-def state. -- [x] Additional shared analysis fields demanded by first value passes. - -Exit criteria: - -- [x] `-O1` stays green. -- [x] `-O2` can run the substrate and verifier, then lower through the current - O1 backend path with no observable behavior change. -- [x] SSA dump tests prove phis and rewritten uses on branch/loop locals. -- [x] `make test-opt` -- [x] Focused `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy` branch/loop subset. -- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy` - -### Phase C - Full Allocation Infrastructure - -Status: implemented as the first O2 allocation-quality slice. O1 keeps the -existing whole-PReg priority allocator. O2 enables conservative move -coalescing, coalesced-group allocation, and singleton live-range splitting -without changing public `libcfree` APIs or adding CFG edge splitting. - -Goal: bring `-O2` allocation quality up to MIR's model before value passes -start changing code aggressively. - -Implement: - -- [x] Move collection. -- [x] Bounded conflict matrix for move-related regs using existing live-range - overlap data. -- [x] Conservative coalescing for same-class, same-type copies with compatible - fixed-register and forbidden-register constraints. -- [x] Coalesced-group allocation before singleton allocation. -- [x] O2-only split allocation representation with per-PReg segment lists and - canonical spill homes. -- [x] Conservative singleton live-range splitting when whole hard allocation - fails. -- [x] Boundary reload/store insertion for hard split segments through rewrite. -- [x] Critical-edge CFG mutation for spill placement. Current placement stays - at unambiguous block/range boundaries. - -Exit criteria: - -- [x] `-O1` remains green. -- [x] `-O2` can use the same lowering path with coalescing/splitting enabled, - while preserving the current small SSA cleanup passes. -- [x] Focused `make test-opt` coverage for coalescing, conflict refusal, - incompatible constraints, O1 preservation, and singleton splitting. -- [x] Toy pressure fixture: - `CFREE_TEST_FILTER=129_o2_pressure_coalesce CFREE_TOY_OPT_LEVELS="0 1 2" - make test-toy` -- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy` - -### Phase D - SSA-Backed Value and Memory Passes - -Goal: enable the MIR full pre-lowering schedule for `-O2`, starting with small -SSA-backed passes that prove the use-def substrate before heavier value and -memory optimizations. - -Implement in order: - -1. [x] `opt_ssa_dce` -2. [x] `opt_copy_cleanup` -3. [x] `opt_block_cloning` -4. [x] `opt_addr_xform` -5. [x] scalar-only `opt_gvn` -6. [x] `opt_copy_prop` -7. [x] first memory-aware `opt_gvn` / redundant load elimination tests -8. [x] CG identity expansion, shared `opt_simplify_local`, and O2 - `opt_simplify` as described in Section 4.3 -9. [x] extended memory tracker for joins/path state -10. [x] `opt_dse` -11. [x] `opt_licm` -12. [x] `opt_pressure_relief` -13. [x] `opt_make_conventional_ssa` -14. [x] `opt_ssa_combine` -15. [x] `opt_undo_ssa` -16. [x] full O2 `opt_jump_opt` beyond the shared O1 jump cleanup - -Do not batch these into one landing. Each pass needs a pass-local corpus case -that fails red without the pass or its bug fix. - -The path-aware memory GVN slice is now landed. Keep later memory rewrites tied -to the documented alias and memory numbering model: do not expand `IR_LOAD` -rewrites or reason across `IR_STORE`, calls, atomics, fences, volatile -operations, TLS, globals, or unknown memory without an explicit invalidation -rule and pass-local tests. - -Completed pre-GVN slice exit criteria: - -- [x] `make test-opt` -- [x] Focused toy O2 branch/join/address-memory subset: - `CFREE_TEST_FILTER=128_o2_branch_join_addr_mem CFREE_TOY_OPT_LEVELS="0 1 2" - make test-toy` -- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy` -- [x] Disassembly sanity check shows narrow churn: existing - `10_pointer_addr_deref` removes separate address materialization in O2, while - the new branch/join/address fixture has only minor register-choice churn. -- [x] Pass-local dump tests prove the implemented SSA DCE, copy cleanup, block - cloning, and address-folding rewrites fire. - -Memory numbering and alias invalidation model for memory GVN: - -- Memory value numbering is keyed by `(address-space, alias root, byte range, - memory version, load type, size, alignment, flags)`. The address operand is - still part of the key unless the alias root is a singleton local/global/string - object with a proven constant offset. This keeps two unknown pointers in the - same alias class from being treated as the same location. -- A memory version token is maintained per alias root plus one conservative - `unknown` token. `ALIAS_LOCAL`, `ALIAS_GLOBAL`, `ALIAS_PARAM`, `ALIAS_HEAP`, - and `ALIAS_STRING` each use their root id as the fine-grained version key; - `ALIAS_UNKNOWN`, missing alias data, unsupported address forms, and escaped - allocas use the unknown token. Address space participates before alias-root - comparison, so different address spaces never share a memory value number. -- A store to a precise root increments that root's version. A store through - unknown memory increments the unknown token and invalidates every non-invariant - root that may alias unknown memory. A store to volatile memory is never - forwarded from or to, and also acts as an observable memory barrier for other - volatile operations. -- Calls increment the unknown token and invalidate all roots that can escape: - globals, params, heap, TLS/global address roots, and any local whose address - has flowed to a call argument, return value, unknown/global store, aggregate - memory op, inline asm, intrinsic, or unsupported address use. Non-escaping - local roots keep their root version across ordinary calls. Later call attrs - can narrow this further. -- Atomics, fences, inline asm with memory effects, `IR_AGG_COPY`, - `IR_AGG_SET`, bitfield stores, varargs memory operations, and runtime memory - intrinsics are full memory barriers until each has a precise alias rule. - Atomic loads are not redundant-load candidates unless a later model proves - the exact ordering legality. -- TLS addresses are treated like globals for alias roots but are call-clobbered - unless the access is proven local-exec and the call cannot observe the thread - object. The first slice should conservatively invalidate them at calls. -- Redundant-load elimination may replace `IR_LOAD` only when the prior value's - def dominates the use, the memory key including version is identical, neither - access is volatile/atomic, the load type and byte range match, and no - invalidating barrier lies between the two through the version model. Store/load - reuse uses the same key and additionally requires the stored operand's value - to dominate the load. -- At block joins, memory availability is intersected across reachable - predecessors. An entry survives only when every reachable predecessor has the - same memory key available with the same `Val`; otherwise the joined load is - preserved. This slice does not invent memory phis. - -Current memory GVN slice: - -- [x] Pass-local tests cover repeated exact local loads, store/load reuse for - the same precise local, path-aware branch joins, loop-header backedge - conservatism, unknown-store barriers, call preservation for non-escaped - locals, escaped-local call clobbers, and volatile/atomic exclusions. -- [x] The pass maintains per-root plus unknown versions, treats calls and - conservative memory ops as barriers, and can see through simple `addr_of` plus - zero-index/address-minus-zero materialization when both memory operations - refer to the same precise root. -- [x] AArch64 disassembly sanity check: a direct address-taken store followed by - a load of the same address reuses the stored value, removing the post-store - `ldur`. -- [x] The tracker uses memory state snapshots at block boundaries. Join blocks - reuse memory only when all reachable predecessors agree on the same available - key/value pair; otherwise no memory phi is invented and the load stays. -- [x] Root escape information is precise enough for calls to preserve - non-escaping address-taken locals, while globals/params/heap/TLS, escaped - locals, and unknown roots remain conservatively clobbered. - -Current DSE slice: - -- [x] `opt_dse` runs in the O2 SSA schedule after GVN, copy propagation, and - SSA DCE, before conventional-SSA lowering. -- [x] Backward memory liveness removes exact precise-root stores that are - overwritten before any overlapping read, and removes unread stores to - non-escaped locals at function exit. -- [x] Calls, volatile/atomic accesses, aggregate memory ops, bitfield memory - ops, varargs memory ops, fences, inline asm, and intrinsics remain - conservative barriers. Escaped locals and externally observable roots stay - live across calls/exits. -- [x] Pass-local tests cover same-block and all-path overwritten local stores, - unread non-escaped locals, escaped-local call preservation, and volatile - preservation. - -Current LICM and pressure relief slices: - -- [x] `opt_licm` runs after DSE/DCE and loop-tree construction, hoisting safe - loop-invariant scalar work to preheaders. -- [x] LICM pass-local tests cover safe invariant hoisting and preservation of - memory loads and potentially trapping division. -- [x] `opt_pressure_relief` runs after LICM, planning all legal same-block - single-use immediate/constant sinks from one def-use rebuild and applying - affected block rewrites in linear passes. -- [x] Pressure-relief pass-local tests cover sinking, loop-boundary refusal, - multi-use preservation, memory barriers, and many same-block moves. - -Scalar/address identity simplification now runs before memory GVN. Red-green -coverage includes integer neutral/annihilator identities, same-register -integer identities, same-value integer compares, exact no-op conversions, -double bitwise-not, add-zero address chains, and a toy end-to-end identity -chain at `-O0`, `-O1`, and `-O2`. These rewrites stay typed and -legality-aware: they do not erase operations with trapping, flagged overflow, -FP rounding/NaN/sign-zero semantics, volatile/atomic memory effects, or target -addressing constraints that have not been modeled. - -Remaining Phase D exit criteria: - -- [x] Scalar GVN pass-local dump tests prove redundant pure expressions are - replaced and constant branches fold. -- [x] Memory numbering and alias invalidation are documented before any memory - GVN or redundant-load rewrite lands. -- [x] `opt_copy_prop` pass-local tests prove SSA copy propagation and redundant - extension-chain cleanup, with a focused toy O2 fixture. -- [x] AArch64 O2 disassembly sanity checks show `130_o2_copy_prop_ext` - collapsing the widening/copy chain to a single byte load/zero-extend path, - and memory GVN removing redundant post-store stack loads in covered cases. - `128_o2_branch_join_addr_mem` now covers both all-predecessors-agree and - predecessor-disagreement branch stores. -- [x] Identity simplification has focused pass-local tests, a toy end-to-end - fixture, and an AArch64 disassembly sanity check showing identity arithmetic - removed from the generated code. -- [x] LICM and pressure relief have focused pass-local tests and are in the - default O2 schedule. -- [x] `opt_ssa_combine` has focused red-green tests before it enters the - default O2 schedule. -- [x] Full O2 jump optimization has focused red-green tests before it enters - the default O2 schedule, covering switch target forwarding, empty - fallthrough-chain forwarding, repeated branch forwarding, and same-target - conditional collapse. -- [x] Post-inline cleanup quality: after Phase E inlines a simple Toy wrapper, - the existing Phase D value passes should collapse the exposed constant/copy - chain and remove frame traffic. Covered example: - - ``` - fn add1(x: i32): i32 { - return x + (1 as i32); - } - - pub fn main(): i32 { - let y: i32 = add1(41 as i32); - if y == (42 as i32) { - return 0 as i32; - } - return 1 as i32; - } - ``` - - `-O2` removes the wrapper call leftovers and AArch64 now lowers the focused - quality fixture's `main` to a constant return with no call, prologue, spill, - or reload traffic. - -### Phase E - Inlining and Cleanup - -Goal: add MIR-style small direct-call inlining for `-O2`. - -Design decision: inline before machine lowering, on retained raw `Func` IR, at -`CGTarget.finalize`. Do not inline the post-SSA-destruction or post-machinize -form. The current implementation emits at `func_end`; Phase E first has to -make the `opt.h` contract real for `-O2`: `func_end` records and stores the -function, while `finalize` performs interprocedural work, then runs the normal -per-function O2 pipeline and emission in original function order. - -Status: implemented for the Phase E v1 shape. Retained `-O2` finalize -scheduling is in place, `opt_cleanup` is the reusable O2 pre-lowering cleanup -schedule, and `pass_inline.c` inlines small direct calls to retained, acyclic -callees with void/scalar register returns. It supports straight-line wrappers -and small branchy callees with multiple scalar returns. It intentionally refuses -aggregate/sret/byval returns, tail calls, varargs, alloca, inline asm, -computed goto, and other hard cases until those rewrite rules have focused -tests. - -This matches MIR's useful property: inline while calls, parameters, frame -slots, returns, labels, and source locations are still target-independent. -Cloning target-lowered calls would couple the inliner to ABI move plans, -hard-register constraints, and late spill/rewrite state. Cloning SSA would -also require phi and def-use surgery across two functions. Raw IR cloning keeps -Phase E mechanical and lets the existing O2 passes clean up the exposed -constants, copies, dead stores, and dead blocks. - -Implementation shape: - -- [x] Add real `FuncSet` ownership to `OptImpl`: an arena-backed ordered - `Func**` plus retained-definition lookup by `ObjSymId`. No global state. - Record order is emission order at finalize. -- [x] Split scheduling in `opt.c`: - - `func_end`: `opt_frame_home_addr_taken_locals`, classify eligibility, append - to `FuncSet`, and clear the current recorder state. For `-O1`, keeping the - current immediate emit path is acceptable; `-O2` must retain functions. - - `finalize`: run `opt_inline(FuncSet*, max_iters)`, then lower/emit every - retained function. Functions that still require the defensive non-SSA O2 - route use the shared O1 lowering path at finalize. - - Refactor the current O2 scheduler into "O2 cleanup/pre-lowering" and - "lowering/emission" pieces so an inlined dirty function can reuse the - existing verifier-bracketed pass order. -- [x] Build the call graph from direct `IR_CALL` sites whose callee is - `OPK_GLOBAL` with addend zero and whose symbol has a retained `Func`. - Indirect calls and unresolved/external symbols are ordinary non-inlineable - calls. -- [x] Compute SCCs on the retained graph. Inline only calls to callees in already - processed acyclic components. Skip self recursion and mutually recursive SCCs - for v1 instead of trying partial recursive budgets. -- [x] Rebuild the call graph at each inline iteration. Default `max_iters = 1`; - keep a small internal cap even if a later driver knob requests more. - -Eligibility and budgets: - -- [x] Candidate callee must be retained, same target/context, non-variadic, not - naked/interrupt/ifunc once those attributes reach `CGFuncDesc`, and not marked - `noinline` once C declaration flags are plumbed through CG. -- [x] Candidate call site must be a normal direct call. Skip `CG_CALL_TAIL` and any - future musttail representation. -- [x] V1 skips callees containing `IR_ALLOCA`, `IR_VA_*`, `IR_LOAD_LABEL_ADDR`, - `IR_INDIRECT_BRANCH`, `IR_ASM_BLOCK`, `INTRIN_SETJMP`, `INTRIN_LONGJMP`, - coroutine switch, or any op whose cloning semantics are not explicit. -- [x] V1 supports void and scalar register returns first. Aggregate/sret/byval - returns can be enabled later after return materialization has focused tests. -- [x] Normal-call budget should be deliberately small: start around 12 to 20 - non-debug, non-label instructions, with a caller growth cap around 25 percent - and an absolute per-function growth cap. Once `inline` / `always_inline` / - `noinline` are carried into CG, use a larger explicit-inline budget and let - `always_inline` exceed the normal budget but still obey semantic refusal - rules. -- [x] Keep the heuristic target-independent. Calls, memory barriers, atomics, - and fences remain semantic refusals in Phase E v1; multiple returns add cost, - and simple scalar instructions use the normal small-instruction budget. - -Inline rewrite: - -- [x] Split the caller block at the call site into `pre`, cloned callee body, and - `cont`. The original call instruction is removed. -- [x] Allocate fresh caller-local ids for every cloned block, frame slot, scope, - pseudo-register, and SSA value id present in the callee. Preserve callee - instruction ids only as source material; cloned instructions get new stable - ids from the caller. -- [x] Clone `emit_order` for cloned blocks in callee layout order, inserted between - `pre` and `cont`. Update CFG successors/predecessors through `opt_build_cfg` - after the rewrite rather than hand-maintaining all derived metadata. -- [x] Materialize arguments at the clone entry. For register-like callee parameter - storage, emit `IR_COPY` from the call argument value to the remapped callee - parameter pseudo. For frame-backed parameters, emit stores to the remapped - parameter frame slots using the recorded `IRParam` storage and `MemAccess` - shape. Constants become `IR_LOAD_IMM` or direct stores as appropriate. -- [x] Rewrite each cloned `IR_RET`: - - void return: replace with `IR_BR cont`. - - scalar register return: copy the returned value to the call site's result - operand, then branch to `cont`. - - multiple returns: all returns branch to the same continuation; if later SSA - needs a phi, the normal O2 mem2reg/reg-SSA construction will create it from - the inserted copies/stores. -- [x] Keep ordinary continuation semantics for noreturn callees in Phase E v1. - Do not infer noreturn only from body shape until continuation omission has - focused tests. -- [x] Preserve `Inst.loc` from cloned callee instructions. Argument materialization - uses the call-site location. This is enough for line-accurate emission now; - `DW_TAG_inlined_subroutine` can remain a later DWARF phase. - -Cleanup after inlining: - -- [x] Mark modified callers dirty and invalidate CFG, def-use, dominator, and loop - analyses. -- [x] For each retained `-O2` function, run the existing O2 pre-lowering schedule - after all inline iterations. This is `opt_cleanup` in Phase E terms: - `build_cfg`, jump cleanup, reg-SSA, mem2reg SSA, simplify, GVN, copy - propagation, DSE, SSA DCE, LICM, pressure relief, conventional SSA, - SSA combine, undo SSA, and jump optimization, with the same verifier - checkpoints as the current O2 path. -- [x] Then run the shared lowering pipeline once and emit. No function should be - emitted before all possible callers have had a chance to inline it. - -Testing plan: - -- [x] Unit/pass tests in `test/opt` for direct wrapper inline, two-return scalar - inline, refusal for recursion/SCCs, refusal for varargs/alloca/setjmp/asm, - and caller growth cap. -- [x] End-to-end toy/C cases where `static int add1(int x) { return x + 1; }` - disappears at `-O2` but remains a call at `-O1`. -- [x] Keep the Toy `add1(41) == 42` example from the Phase D checklist as a - disassembly sanity case: Phase E is responsible for removing the call; Phase - D cleanup is responsible for removing the exposed constant/copy/branch - leftovers. -- [x] A chained-call case proves bottom-up order: `main -> f -> g` inlines `g` - into `f` before deciding whether `f` fits into `main`. -- [x] A negative recursive case proves codegen is unchanged and correct. -- [x] A focused disassembly sanity check on AArch64 or x64 verifies wrapper-call - removal after inline+cleanup without broad corpus churn. - -Exit criteria: - -- [x] `make test-opt` -- [x] Focused wrapper/chained-call toy fixtures: - `CFREE_TEST_FILTER=<inline-case> CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy` -- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy` -- [x] Recursive and mutually recursive cases are unchanged and correct. - -### Phase F - Benchmark Closure - -Goal: tune by measurement, not by adding passes speculatively. - -Benchmark set: - -- MIR `c-benchmarks`: `array`, `binary-trees`, `funnkuch-reduce`, - `hash`, `hash2`, `heapsort`, `lists`, `mandelbrot`, `matrix`, - `method-call`, `nbody`, `sieve`, `spectral-norm`, `strcat`, and any - benchmark that requires only supported libc/runtime features. -- cfree-specific stress cases for ABI, TLS, atomics, and inline asm. - -Primary harness: +### Inlining -``` -make bench-opt -``` +`opt_inline` operates on `FuncSet`. -`scripts/opt_bench.sh` compares MIR `c2m`, `clang`, `gcc-15`, `cfree cc`, -and `cfree run` at `-O0`, `-O1`, and `-O2` on MIR's `c-benchmarks` sources -from `~/tmp/mir/c-benchmarks`. It writes per-case logs, `results.csv`, and a -geomean summary under `build/bench/opt/`. +- The call graph is built from direct `IR_CALL` sites whose callee is an + `OPK_GLOBAL` with addend zero and a retained `Func`. +- Strongly connected components are used to avoid recursive inlining. Self and + mutually recursive SCCs are refused for v1. +- The current iteration cap is small. The default is one inline iteration. +- Candidate callees must be retained, target-compatible, non-variadic, not + recursive, and within a conservative growth budget. +- V1 refuses callees containing alloca, varargs, setjmp/longjmp, inline asm, + label-address loads, indirect branches, coroutine switching, or unknown + cloning semantics. +- Void and scalar register returns are supported. Aggregate/sret/byval returns + are left for future work. +- The caller block is split into pre-call, cloned body, and continuation blocks. + Arguments are materialized in the clone entry, returns branch to the + continuation, and modified callers are marked dirty. -The hosted `cfree cc` lane passes `--sysroot` and `-lc`; on Darwin the -sysroot defaults to `xcrun --show-sdk-path`, and it can be overridden with -`CFREE_OPT_BENCH_SYSROOT`. The `cfree run` lane is an in-process JIT run for -direct comparison with MIR; it also passes `--sysroot` and `-lc`, expands the -hosted libc headers/defines in the driver, and relies on host-symbol fallback -for libc calls. +After inlining, dirty functions run the normal `-O2` cleanup schedule. -The benchmark intentionally records unsupported cases instead of stopping at -the first failure. A cfree row that cannot compile a hosted benchmark is a -`COMPILE_FAIL` data point, not a harness failure; this lets optimizer work -track coverage and performance in the same run. +### SSA Cleanup Schedule -Useful focused runs: +`opt_cleanup` is the `-O2` pre-lowering schedule. -``` -CFREE_OPT_BENCHES="sieve spectral-norm" make bench-opt -CFREE_OPT_BENCH_LEVELS="1 2" CFREE_OPT_BENCH_RUN_REPEATS=5 make bench-opt -GCC=/opt/homebrew/bin/gcc-15 MIR_DIR=~/tmp/mir make bench-opt -``` +1. `opt_build_cfg`, `opt_jump_cleanup`, `opt_build_cfg` + Canonicalize control flow before SSA. -Measure: - -- Compile wall time for `-O0`, `-O1`, `-O2`. -- Executable run time against `gcc-15 -O2`, `clang -O2`, and MIR `c2m -O2` - when available. -- MIR-specific split: `compile_ms` is C-to-binary-MIR time, `codegen_ms` is - the JIT link/generation slice reported by `c2m -v`, and `runtime_ms` is the - generated function execution slice. Native compiler rows use compile+link - wall time for `compile_ms` and executable wall time for `runtime_ms`. -- `cfree-run` uses `cfree run --bench-time`: `compile_ms` is compile+JIT time, - and `runtime_ms` is the in-process entry-call execution slice. -- Code size for hot text sections. -- Pass counters: removed GVN expressions, folded branches, removed stores, - coalesced moves, spills/restores, split ranges, post-RA deleted moves. - -Initial representative run, 2026-05-22: - -- Scope: `array`, `binary-trees`, `hash`, `hash2`, `matrix`, `nbody`, - `sieve`, and `spectral-norm`; levels `0 1 2`; one compile repeat and one - run repeat. Output was written to `build/bench/opt/results.csv` and - `build/bench/opt/summary.md`. -- Coverage: 120 data rows; 91 `OK`, 18 `COMPILE_FAIL`, 9 `RUN_FAIL`, and - 2 `OUTPUT_FAIL`. -- `gcc-15` and `clang` completed all rows. cfree and MIR are blocked on some - hosted/math cases: `binary-trees`, `nbody`, and `spectral-norm` hit Darwin - `math.h`/builtin compatibility issues. cfree also has an `-O2` wrong-code - failure on `matrix`. -- Runtime geomean versus `gcc-15 -O2` on completed rows: - `cfree-run` improved from `0.372x` at `-O0` to `0.445x` at `-O1` and - `0.480x` at `-O2`; MIR improved from `0.554x` to `0.609x` to `0.796x`. - `clang -O2` measured `1.076x`. -- Compile/JIT geomean versus `gcc-15 -O2`: `cfree-run` was about `11.5x` to - `12.0x` faster, while MIR was about `2.7x` faster. On cases where both - completed, cfree-run compile/JIT time was roughly 4-5x faster than MIR. -- Direct cfree-run versus MIR runtime on common successful cases: - `0.67x` at `-O0`, `0.73x` at `-O1`, and `0.57x` at `-O2`. This confirms - the current split: cfree's compile/JIT path is very fast, but generated-code - quality still trails MIR, especially at `-O2`. - -Immediate benchmark blockers: - -- [x] Fix the `matrix -O2` wrong-code regression before trusting O2 timing. -- [x] Add or model the hosted math builtins needed by Darwin `math.h`, - including `__builtin_fabsf`, `__builtin_inf*`, `__builtin_huge_val*`, - `_Float16` declarations, and the float/double predefined limit macros used - by the system header. -- Re-run the full MIR benchmark set after those blockers, then increase repeat - counts for stable numbers. - -Follow-up representative run after cfree blocker fixes, 2026-05-22: - -- Same 8-benchmark scope, levels `0 1 2`, one compile repeat, and one run - repeat. -- Coverage: 120 data rows; 111 `OK`, 9 `COMPILE_FAIL`, 0 `RUN_FAIL`, and - 0 `OUTPUT_FAIL`. -- All `cfree` and `cfree-run` rows completed successfully, including - `matrix -O2`, `binary-trees`, `nbody`, and `spectral-norm`. The remaining - `COMPILE_FAIL` rows are all `mir-c2m` on the hosted/math benchmarks. -- Updated runtime geomean versus `gcc-15 -O2` on the representative scope: - `cfree-run` measured `0.363x` at `-O0`, `0.519x` at `-O1`, and `0.481x` - at `-O2`; MIR measured `0.550x`, `0.574x`, and `0.786x` on its five - completed cases. `clang -O2` measured `1.072x`. - -Focused rerun of the formerly broken cfree rows, 2026-05-22: - -- Scope: `matrix`, `binary-trees`, `nbody`, and `spectral-norm`; levels - `0 1 2`; one compile repeat and one run repeat. Output was written to - `build/bench/opt/rerun-broken-cfree/results.csv`. -- All `cfree` and `cfree-run` rows were `OK`. -- Geomean speed ratios use each benchmark's `gcc-15 -O2` row as `1.0x` for - both compile time and run time: - -| tool | opt | compile speed | runtime speed | -| --- | ---: | ---: | ---: | -| `cfree` | `O0` | `7.050x` | `0.319x` | -| `cfree` | `O1` | `7.062x` | `0.534x` | -| `cfree` | `O2` | `6.831x` | `0.483x` | -| `cfree-run` | `O0` | `13.709x` | `0.323x` | -| `cfree-run` | `O1` | `14.286x` | `0.550x` | -| `cfree-run` | `O2` | `11.791x` | `0.512x` | - -- On this focused scope, `O1` currently has the best cfree runtime geomean; - `O2` remains correct but is slower than `O1` on the measured mix. - -Target: - -- `-O1` should be the fast optimized tier and materially faster to compile - than `-O2`. -- `-O2` should first reach QBE-class performance on the benchmark set, then - close toward MIR's published `c2m -eg` geomean. - ---- - -## 7. Module Layout - -Keep the MIR pass boundaries as separate cfree modules: +2. `opt_build_reg_ssa` + Converts mutable pseudo-registers to SSA `Val` ids. -``` -src/opt/ - opt.c wrapper, vtable bindings, schedule dispatch - ir.c Func/Block/Inst plumbing - ir_print.c stable dumps for pass tests - pass_cfg.c CFG, unreachable cleanup - pass_o2.c early O2 transforms: block cloning, address transform, - scalar GVN, temporary DSE/cleanup stubs - pass_simplify.c local and SSA-aware scalar/address identity simplification - pass_clone.c block cloning (split out when pass size warrants it) - pass_ssa.c SSA build, conventional SSA, undo SSA - pass_addr.c address transformation (split out when pass size warrants it) - pass_gvn.c GVN, constprop, redundant-load elimination - pass_copy.c copy propagation, extension cleanup - pass_dse.c dead store elimination - pass_dce.c SSA DCE and post-RA DCE - pass_loop.c loop tree and LICM - pass_pressure.c pressure relief - pass_jump.c jump optimization - pass_machinize.c target ABI and machine-form lowering - pass_live.c liveness, pressure, live ranges - pass_lower.c current assignment/rewrite implementation - pass_coalesce.c move collection and coalescing - pass_regalloc.c assignment, rewrite, splitting (split out when warranted) - pass_combine.c target-aware code selection - pass_emit.c drive wrapped CGTarget - pass_inline.c inlining and cleanup -``` +3. `opt_block_cloning` + Clones small cold/return blocks into hot predecessors when bounded growth + exposes better local optimization. -No pass should reach into the wrapped target's private implementation. Target -specifics belong behind `Target`, `TargetABI`, or explicit helper hooks. +4. `opt_build_ssa` + Promotes eligible frame-backed locals/params to mem2reg SSA, inserts phis, + and rewrites loads/stores to values. ---- +5. `opt_ssa_dce` and `opt_copy_cleanup` + Remove unused SSA definitions and redundant copies before heavier passes. -## 8. Pass Invariants +6. `opt_addr_xform` + Folds address-producing pseudos into direct memory operands where legal and + preserves address-taken storage in memory. -- No VLAs and no global optimizer state. MIR uses generator context; cfree - should hang all pass state off `Compiler`, `Func`, `FuncSet`, or explicit - pass contexts. -- `PReg` names mutable virtual-register storage before reg-SSA. `Val` names - SSA values after pseudo-register SSA construction. Lowering may map PRegs or - post-SSA Vals to physical registers or stack slots, but that mapping belongs - to allocation state. -- Frame slots remain frame slots until a pass proves promotion is legal. -- Calls are memory barriers unless alias information proves otherwise. -- Inline asm is side-effecting and may pin hard regs; the allocator and - inliner must treat it conservatively. -- `set_loc` is sticky on recorded insts. Every emitted instruction must forward - the active location to the wrapped target before machine emission. -- Every pass either preserves CFG metadata or invalidates/rebuilds it before - the next consumer. +7. `opt_simplify` + Applies SSA-aware identity simplification, constant simplification, and + local algebraic cleanup. ---- +8. `opt_gvn` + Performs scalar value numbering, constant propagation, branch folding, + redundant expression elimination, and memory-aware redundant load/store-load + reuse where alias/version rules permit it. -## 9. Test Strategy +9. `opt_copy_prop` + Propagates SSA copies and removes redundant extension/copy chains. -Targeted commands: +10. `opt_dse` + Removes stores proven overwritten or dead along all paths, while preserving + observable memory and call/escape boundaries. -``` -CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy -CFREE_TEST_FILTER=<case> CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy -make test-opt -``` +11. `opt_build_loop_tree`, `opt_licm` + Rebuild loop information and hoist safe invariant computations. -Use pass dumps for red-green tests: +12. `opt_pressure_relief` + Sinks legal same-block computations to reduce peak pressure before + allocation. -- CFG: block/successor/predecessor shape. -- SSA: phi placement and def-use chains. -- Scalar GVN: redundant pure expression replaced and constant branch folded, - with def-use rebuilt and stale-analysis verifier coverage. -- Memory GVN: redundant load/store-load reuse only after alias and memory - numbering are specified; tests must include volatile/atomic/call/unknown - memory barriers. -- DSE: dead store removed while escaping stores remain. -- LICM: safe invariant moved; trapping division and memory ops remain. -- RA: expected spill/reload/coalescing counters. -- Combine: target-legal fused instruction shape. +13. `opt_make_conventional_ssa` + Lowers SSA phis into edge copies. These copies are marked + `IRF_NO_COALESCE`; coalescing them can collapse current and next + loop-carried values and miscompile loops. + +14. `opt_ssa_combine` + Combines patterns while conventional SSA edge copies are explicit. -Before broad runs, redirect output to a file and inspect the failure slice: +15. `opt_undo_ssa`, `opt_copy_cleanup` + Rewrites SSA values back into allocation-ready pseudo-register form. + +16. `opt_jump_opt` + Runs full O2 jump optimization before shared lowering. + +Verifier checkpoints bracket the important transitions: CFG, reg SSA, block +cloning, mem2reg SSA, address transform, simplify, GVN, copy prop, DSE, LICM, +pressure relief, conventional SSA, SSA combine, undo SSA, and jump opt. -``` -CFREE_TEST_FILTER=<case> CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy >build/opt-test.log 2>&1 -tail -200 build/opt-test.log -``` +After cleanup, `-O2` enters the shared lowering pipeline with live-range +splitting enabled. + +## Register Allocation + +Allocation uses target-provided register classes and hard-register sets. + +- Pseudo-register metadata records type, class, fixed/tied hard-register + constraints, and forbidden hard-register masks. +- Liveness is represented as block live-in/live-out sets and live ranges over + instruction points. +- Coalescing collects move-related candidates, builds a bounded conflict matrix + for those candidates, and merges only same-class, same-type registers with + compatible hard-register constraints and no live-range conflict. +- Copies marked `IRF_NO_COALESCE` are not coalescing candidates. +- `-O2` allocation may split live ranges and insert boundary reload/store + instructions. Critical edges are split when needed for placement. +- Spill slots are represented as frame slots and rewritten as ordinary loads + and stores before emission. + +## Verification, Dumps, and Metrics + +`opt_verify(Func*, stage)` checks CFG reciprocity, reachable block shape, +emit-order validity, instruction ids, operand namespaces, phi consistency, and +def-use freshness. Passes should keep verifier failures local by using stage +names that describe the last completed transformation. + +The optimizer also emits metrics through the compiler metrics interface. The +driver's `cfree run --time` mode is the easiest way to inspect scoped timings +such as frontend, pass totals, link, JIT, and execution. + +IR dumps and pass-local tests should prefer stable ids and pass-specific +fixtures. End-to-end parse/toy tests should be focused on user-visible behavior +and disassembly sanity checks, not broad churn. + +## Current Limits -The full toy corpus remains the end-to-end equivalence oracle for frontend -coverage: levels `1` and `2` must produce the same observable result as level -`0`. The API and opt unit suites remain the narrower pass/CG contract oracle. -DWARF equivalence can start weaker, but `set_loc` forwarding must not regress -line-row emission. +- `_Float16` is parsed for Darwin header compatibility but is not a real + half-precision ABI implementation. +- Inlining v1 does not handle aggregate returns, sret/byval, recursive SCCs, + varargs, alloca, setjmp/longjmp, inline asm, or indirect-control constructs. +- `cfree-run --bench-time` reports compile+JIT as one slice. Finer-grained + timing is available through `--time`, but benchmark CSVs do not yet split + cfree frontend, optimizer, link, and JIT into separate columns. +- Performance goals, measurements, and tuning priorities live in + `doc/OPT_PERF.md`. diff --git a/doc/OPT1.md b/doc/OPT1.md @@ -1,330 +0,0 @@ -# OPT1 - Current O1 Backend Path - -`-O1` is cfree's fast optimized backend tier. It records frontend CG -operations as IR, runs a small MIR-shaped lowering/allocation pipeline, rewrites -virtual values to hard registers or spill slots, performs narrow post-RA -cleanup, and emits through the target backend. - -This document has two jobs: - -1. document the implemented O1 pipeline; and -2. record the completed O1 code-shape cleanup and remaining performance - watchpoints. - -Compile-time measurements and scaling notes live in `doc/PERF.md`. Broader -optimizer direction lives in `doc/OPT.md`. - -## Goals - -O1 should remain: - -- correct across supported targets and ABI lowering paths; -- materially better than O0 generated code; -- close enough to O0 compile time for JIT use; -- free of dense pseudo-register conflict matrices and whole-function - point/candidate products in hot paths; -- simple enough that O2 can add coalescing, splitting, SSA, and value - optimization without destabilizing the O1 baseline. - -O1 is not intended to perform full SSA optimization, global value numbering, -LICM, general inlining, or global block layout. Those belong in O2 unless a -small local cleanup is required to keep O1 code shape reasonable. - -## Pipeline - -The normal O1 function-end path is: - -```text -build_cfg -jump_cleanup.cfg -build_cfg -machinize -build_loop_tree -live_blocks.pre_dde -dead_def_elim -regalloc: - live_blocks - live_ranges_build - assign_ranges - rewrite -combine -dce -jump_cleanup.cfg -build_cfg -jump_cleanup.layout -emit -``` - -Pass boundaries: - -- `machinize` applies target-specific ABI, call, and register constraints before - liveness and allocation. -- `live_blocks.pre_dde` computes block liveness for pre-RA dead-def - elimination. -- `dead_def_elim` removes dead virtual definitions using block liveness and - operand-reference events. -- `regalloc` rebuilds liveness after DDE, constructs compressed live ranges, - assigns hard registers or spill slots, and rewrites operands. -- `combine` is narrow post-RA cleanup: noop physical copy deletion, safe - single-use folds, redundant spill load/store cleanup, and similar local - rewrites. -- `dce` is conservative after rewrite and removes only trivial dead IR shapes. -- `jump_cleanup.cfg` forwards branches to branch-only blocks and performs local - compare-branch inversion when that removes an unconditional jump block. -- `jump_cleanup.layout` runs immediately before emit and deletes unconditional - branches whose target is the physical fallthrough block. - -## Liveness And Ranges - -O1 liveness is pass-local. Persistent per-instruction live-after storage and -dense value conflict matrices are not part of the normal path. - -`opt_live_blocks` computes block `live_use`, `live_def`, `live_in`, and -`live_out` sets using `OptBitset`. `OptBitset` tracks active words, trims -trailing zero words, and bounds copy/union work by active length where possible. - -`opt_live_ranges_build` turns block liveness plus backward instruction walks -into compressed live ranges: - -- per-instruction references are collected as use/def value lists; -- open and touched values are tracked with generation marks; -- raw instruction points are compressed to live-range boundary points; -- live-across-call frequency is tracked from a rolling live-value list; -- per-value summaries record live length, use/def frequency, block frequency, - call-crossing frequency, and spill cost. - -The expected O1 invariant is `opt.conflict_bytes=0`. - -## Allocation - -The normal O1 allocator is range based. It does not build or consume a -pseudo-register interference matrix. - -Candidates are ordered by: - -1. tied hard-register requirement; -2. spill/frequency cost; -3. shorter live length; -4. stable value id. - -Each value is assigned either: - -- a target-provided hard register in the value's register class; or -- a compatible `FS_SPILL` frame slot. - -Non-tied values known to be live across calls prefer callee-saved registers, but -may use caller-saved registers when no better non-conflicting location is -available. Correctness does not depend on avoiding caller-saved registers -globally: rewrite preserves hard-assigned values only at calls whose -target-provided clobber mask actually destroys that register. - -Occupancy is stored as sorted per-location interval sets: - -- one interval vector per physical hard-register location; -- one interval vector per allocated stack spill slot. - -Conflict checks test whether any candidate live range overlaps the location's -intervals. Marking inserts and merges intervals. The old point-index bitmap row -OR/mark path is not part of normal O1. - -Metrics still report `opt.alloc.used_loc_words`, `opt.alloc.hard_loc_words`, -`opt.alloc.stack_loc_words`, point visits, and mark counters for trend -compatibility, but `opt.alloc.hard_word_ors` and -`opt.alloc.stack_word_ors` should remain zero on the interval path. - -## Rewrite - -Rewrite consumes the assignment map and walks each block backward with rolling -liveness: - -- hard-assigned pseudos become physical registers; -- spilled uses receive reloads through target scratch registers; -- spilled defs receive stores; -- call save/restore insertion checks hard-assigned values known to be live - across some call, then intersects each value's assigned register with that - call's target-provided clobber mask; -- inline-asm register constraints are applied only for functions containing - `IR_ASM_BLOCK`. - -Rewrite does not materialize per-instruction full live-after sets. - -## Target Contract - -O1 relies on each target backend to provide: - -- physical-register metadata per register class, including allocable, - caller-saved, callee-saved, argument, return, reserved, and preference/cost - flags; -- legacy allocable hard-register pools per register class for direct-CG - compatibility and as a fallback while older backends migrate; -- scratch-register pools disjoint from allocable pools; -- caller-saved register classification and per-call clobber masks; -- return-register masks and callee-save masks; -- call-plan construction that describes ABI register destinations, outgoing - stack arguments, return registers, call clobbers, return masks, outgoing - stack size, and variadic metadata before liveness and allocation; -- parameter storage binding through `CGTarget.param`, which may return either a - frame slot or register-backed local storage for simple direct ABI params; -- optional hard-register reservation before backend `func_end`; -- target legality for local folds performed by `opt_combine`. - -The current target pools are: - -- AArch64: integer `x19-x28`, FP `v8-v23`; -- x64: integer `R13/R14/R15/R10`, FP `XMM6-XMM15`; -- RV64: integer `s4-s11`, FP `fs4-fs11`. - -Backends own final prologue/epilogue emission and callee-saved register -preservation. O1 calls `reserve_hard_regs` with hard registers still visible in -replay after cleanup so backend save/restore decisions match emitted IR. - -Targets may provide a known-frame entry path. When `func_begin_known_frame` and -`call_stack_size` are both available, O1 computes frame slots, outgoing call -area size, and dynamic-allocation presence before body replay and passes that -to the backend. AArch64, x64, and RV64 use this path to emit the exact prologue -up front instead of reserving target NOP patch slots. - -## Current Generated-Code Shape - -Fresh May 2026 probes were run with `build/cfree` rebuilt at host `-O2`, on an -arm64 machine, focusing on native AArch64 codegen. The same simple C probes -were also compiled for x64/RV64 by the probe script, but AArch64 is the primary -quality target here. - -Current progress: - -- Immediate scalar arithmetic folds through simple local shadows. For example, - `int x = 40; x += 2; return x;` lowers to immediate return materialization on - AArch64. -- Simple loop locals stay in registers. -- Branch-consuming compares lower to direct compare branches. -- O1 branch cleanup removes branch-to-branch targets and unconditional branches - to physical fallthrough when it can do so locally. -- Address-taken locals still go through memory, as expected for current O1. -- The old prologue NOP patch path is gone; known-frame entry is used where the - target supports it. -- Empty leaf functions omit avoidable frame setup and return directly. -- Scalar return producers can be retargeted into ABI return registers, removing - target-level return self-copies. -- Unused scalar parameters consume ABI locations without materializing moves, - and used stack-passed scalar parameters load directly from the incoming stack - area. - -Representative AArch64 `const_local` output: - -```asm -mov w0, #0x2a -ret -``` - -Representative AArch64 `scalar_add` output: - -```asm -add w0, w0, w1 -ret -``` - -A ninth-argument AArch64 scalar probe now loads directly from the incoming -stack area in a frameless leaf: - -```asm -ldur w0, [sp] -add w0, w0, #1 -ret -``` - -## Completed O1 Codegen Work - -The O1 cleanup list is now implemented for the current backend path: - -- Leaf functions can omit the frame on AArch64, x64, and RV64 when the known - frame is empty, the function has no calls or dynamic stack allocation, and the - backend has no callee-saved state to preserve. Returns in such functions emit - the target return instruction directly instead of branching to an adjacent - epilogue. -- Post-RA cleanup removes physical self-copies, collapses short copy chains, - retargets safe single-use producers, and retargets adjacent scalar return - producers into ABI return registers when target metadata allows it. -- Target physical-register metadata exposes caller-saved, callee-saved, - argument, return, reserved, and scratch policy. O1 prefers caller-saved - registers for non-call-crossing values, uses callee-saved registers for - call-crossing values when profitable, and ties non-call-crossing scalar - parameters to their incoming ABI registers when legal. -- Non-address-taken scalar parameters remain register-backed. Unused - register-backed parameters consume their ABI locations without materializing - pointless moves, and stack-passed scalar parameters are loaded directly from - the incoming stack area. Frameless functions use `sp`/`rsp` for those loads; - framed functions use the target frame pointer as before. -- Planned call replay is the normal O1 call path for supported target plans. - Replay uses a local parallel-move resolver for register moves, stack argument - stores, return extraction, cycles, and indirect callees that overlap argument - destinations. -- Branch cleanup remains intentionally local: branch-target forwarding, - compare-branch inversion for local jump-block removal, unreachable pruning via - CFG, and physical fallthrough branch deletion. - -Representative AArch64 output after these slices: - -```asm -f: - mov w0, #0x2a - ret - -g: - add w0, w0, w1 - ret - -h: // ninth integer argument - ldur w0, [sp] - add w0, w0, #1 - ret -``` - -Further improvements such as global block ordering, general coalescing, SSA -value optimization, and deeper stack-heavy compile-time tuning belong to O2 or -`doc/PERF.md`, not the O1 cleanup baseline. - -## Compile-Time Watchlist - -The current O1 compile-time shape is good for normal C ladders: many-function -and direct-call generated inputs are approximately linear through 1024 -functions with `opt.conflict_bytes=0`. - -The remaining counter-level warning is stack-heavy pressure. In the current -host-`-O2` native run: - -```text -shape step opt.o1.total regalloc used_words -spill_argc 512->1024 2.07x 2.23x 3.60x -params 256->512 1.73x 1.80x 3.33x -``` - -Wall time is only mildly super-linear, but the `used_words` growth should be -investigated before much larger stack-heavy inputs are treated as normal O1 -workloads. See `doc/PERF.md` for the full tables. - -## Tests And Metrics - -Focused O1 coverage lives in `test/opt/` and the phase0 guardrails. It covers -branch liveness, call-clobber preservation, spill pressure, inline asm -tied/fixed registers, pre-RA DDE, post-rewrite cleanup, and cross-target mock -validation. - -Useful targeted runs: - -```sh -make test-opt -make test-cg-api -make test-toy -make test-aa64-inline -make test-smoke-x64 -``` - -`cfree run --time -O1` emits counters for: - -- function/block/inst/value counts; -- block liveness active words and dataflow visits; -- live range counts, compressed points, live-word touches, and value scans; -- allocator hard/stack occupancy, interval probes/marks, spills; -- rewrite reloads/stores/inserted instructions/live traffic; -- link/JIT and compile frontend scopes. diff --git a/doc/OPT_PERF.md b/doc/OPT_PERF.md @@ -0,0 +1,200 @@ +# OPT Performance Tracking + +This document tracks optimizer performance, benchmark coverage, and the current +gap to MIR and GCC. The optimizer design and pass order live in `doc/OPT.md`. + +## How To Run + +Primary harness: + +```sh +make bench-opt +``` + +Useful focused runs: + +```sh +CFREE_OPT_BENCHES="sieve spectral-norm" make bench-opt +CFREE_OPT_BENCH_LEVELS="1 2" CFREE_OPT_BENCH_RUN_REPEATS=5 make bench-opt +GCC=/opt/homebrew/bin/gcc-15 MIR_DIR=~/tmp/mir make bench-opt +``` + +The harness compares: + +- `gcc-15` +- `clang` +- `cfree cc` +- `cfree run` +- MIR `c2m` + +It writes `results.csv`, per-case logs, binaries, and `summary.md` under +`build/bench/opt/` or `CFREE_OPT_BENCH_OUT`. + +## Reporting Model + +The summary reports compile time in two complementary ways. + +1. Unified geomean ratios + + Compile-speed ratios use `compile_ms + codegen_ms` where both are + available. For MIR, this combines C-to-binary-MIR time and JIT link/generate + time so the headline number is comparable to `cfree-run` as a whole. + +2. Split compile timings + + For MIR, `compile_ms` is C-to-binary-MIR time and `codegen_ms` is the + `c2m -v` link/JIT generation slice. This table explains whether a + compile-time difference comes from C parsing/lowering or from the optimized + generation pipeline. + +For `cfree-run`, `--bench-time` currently reports compile+JIT as one slice. +Use `cfree run --time` for finer-grained local investigations. The benchmark +CSV does not yet split cfree frontend, optimizer, link, and JIT into separate +columns. + +MIR's benchmark binary is an optimized host executable. On the current Darwin +machine, `/Users/ryan/tmp/mir/c2m` is built by MIR's makefile with +`-O3 -DNDEBUG`. + +## Current Representative Run + +Date: 2026-05-22. + +Scope: + +- `array` +- `binary-trees` +- `hash` +- `hash2` +- `matrix` +- `nbody` +- `sieve` +- `spectral-norm` + +Levels: `0 1 2`. + +Repeats: one compile repeat and one run repeat. + +Output: + +- `build/bench/opt/results.csv` +- `build/bench/opt/summary.md` + +Coverage: + +| status | rows | +| --- | ---: | +| OK | 111 | +| COMPILE_FAIL | 9 | +| RUN_FAIL | 0 | +| OUTPUT_FAIL | 0 | + +All `cfree` and `cfree-run` rows completed successfully, including the +previously broken `matrix -O2`, `binary-trees`, `nbody`, and `spectral-norm` +rows. The remaining `COMPILE_FAIL` rows are all MIR `c2m` on the Darwin +hosted/math benchmarks. + +## Headline Geomeans + +Base: each benchmark's `gcc-15 -O2` row is `1.0x`. + +Representative 8-case scope: + +| tool | opt | cases | compile speed | runtime speed | +| --- | ---: | ---: | ---: | ---: | +| `cfree` | `O0` | 8 | `5.649x` | `0.346x` | +| `cfree` | `O1` | 8 | `5.792x` | `0.502x` | +| `cfree` | `O2` | 8 | `5.730x` | `0.468x` | +| `cfree-run` | `O0` | 8 | `11.229x` | `0.363x` | +| `cfree-run` | `O1` | 8 | `11.647x` | `0.519x` | +| `cfree-run` | `O2` | 8 | `10.474x` | `0.481x` | +| `mir-c2m` | `O0` | 5 | `2.476x` | `0.550x` | +| `mir-c2m` | `O1` | 5 | `2.515x` | `0.574x` | +| `mir-c2m` | `O2` | 5 | `2.437x` | `0.786x` | + +Interpretation: + +- cfree wins the whole compile/JIT headline mostly because its C frontend and + hosted-header path are much faster, and because MIR fails the Darwin + hosted/math rows. +- On MIR-common cases where both systems have IR, MIR's optimized generation + pipeline is faster and produces better runtime code. +- cfree `O1` currently has the best cfree runtime geomean on this mix. + `O2` is correct but slower than `O1` on the current benchmark set. + +## Formerly Broken cfree Rows + +Focused rerun: + +- Scope: `matrix`, `binary-trees`, `nbody`, `spectral-norm` +- Levels: `0 1 2` +- Output: `build/bench/opt/rerun-broken-cfree/results.csv` + +All `cfree` and `cfree-run` rows were `OK`. + +Base: each benchmark's `gcc-15 -O2` row is `1.0x`. + +| tool | opt | compile speed | runtime speed | +| --- | ---: | ---: | ---: | +| `cfree` | `O0` | `7.050x` | `0.319x` | +| `cfree` | `O1` | `7.062x` | `0.534x` | +| `cfree` | `O2` | `6.831x` | `0.483x` | +| `cfree-run` | `O0` | `13.709x` | `0.323x` | +| `cfree-run` | `O1` | `14.286x` | `0.550x` | +| `cfree-run` | `O2` | `11.791x` | `0.512x` | + +## Optimizer/Link/JIT Split + +Focused timing pass: + +- Scope: `array`, `hash`, `hash2`, `matrix`, `sieve` +- Output: `build/bench/opt/opt-link-split-common/summary.csv` +- cfree measurement: sum of `opt.o*.total`, `link.resolve.total`, and JIT + scoped timings from `cfree run --time` +- MIR measurement: `MIR link finish` from `c2m -v -O<n> file.bmir -eg` + +| opt | cfree opt+link+JIT ms | MIR link/JIT ms | result | +| ---: | ---: | ---: | --- | +| `O0` | `0.275` | `0.758` | cfree `2.76x` faster | +| `O1` | `4.552` | `0.718` | MIR `6.34x` faster | +| `O2` | `4.728` | `1.100` | MIR `4.30x` faster | + +This is the key split for tuning: cfree's final link/JIT mechanics are very +fast, but MIR's optimized generation pipeline is substantially faster at +`O1` and `O2`. + +## Goals + +Compile-time goals: + +- Preserve cfree's frontend and whole-pipeline compile/JIT advantage. +- Add benchmark CSV columns for cfree frontend, optimizer, link, and JIT slices + instead of relying on ad hoc `--time` logs. +- Reduce optimized-pass cost. The first target is closing the MIR link/JIT gap + on the common-case split measurement. +- Keep `-O1` materially faster to compile than `-O2`. + +Runtime-quality goals: + +- Make `O2` reliably faster than `O1` on the representative benchmark mix. +- Close toward MIR `O2` on MIR-common cases. +- First external target: reach at least QBE-class performance on the MIR + `c-benchmarks` set, then close toward MIR's published `c2m -eg` geomean. + +Coverage goals: + +- Keep all cfree rows passing on the representative scope. +- Expand the benchmark set back toward all MIR `c-benchmarks` that require only + supported hosted libc/runtime features. +- Track MIR failures separately from cfree failures so cfree coverage is not + obscured by external tool limitations. + +## Current Tuning Priorities + +1. Investigate why `O2` trails `O1` on the current cfree runtime geomean. +2. Profile expensive O2 passes on `hash`, `hash2`, and `matrix`, where the + opt/link split shows cfree optimized generation much slower than MIR. +3. Improve generated-code quality before adding broad new passes. Prefer + targeted fixes backed by benchmark deltas and pass-local tests. +4. Add finer cfree timing columns to `scripts/opt_bench.sh` once the driver can + expose parse, optimize, link, and JIT slices in parseable form. diff --git a/doc/PERF.md b/doc/PERF.md @@ -1,285 +0,0 @@ -# Cfree O1 Performance Notes - -This file records the current `cfree run --time -O1` shape and the remaining -compile-time scaling risks. It is intentionally focused on the current O1 path; -older allocator phase history is preserved in git and in `doc/MIR_RA_REPORT.md`. - -## Measurement Setup - -Fresh snapshot: May 17, 2026. -Follow-up stack-assignment probe: May 21, 2026. - -- Host: `arm64` macOS. -- Compiler under test: `build/cfree`, rebuilt as a Mach-O arm64 executable with - the compiler itself built by clang at host `-O2`. -- Timing path: native `build/cfree run --time -O1`, which exercises the native - AArch64 JIT/codegen path on this machine. -- Samples: three runs per point; tables use p50 milliseconds. -- Raw local outputs from the run: - - `build/o1-probes/scaling.tsv` - - `build/o1-probes/spill_argc.tsv` - - `build/o1-current/params-hostO2.tsv` - - `build/o1-current/toy-hostO2.tsv` - - `build/o1-investigate/spill_argc_final.tsv` - - `build/o1-investigate/params_argc_final.tsv` - -The build log should show `clang -O2 ...` for driver, lib, C frontend, and Toy -frontend objects before these timings are compared to future runs. - -## Metric Shape - -Top-level `cfree run --time` scopes: - -- `run.total`: full driver path after option/input loading through entry return. -- `run.compile_and_jit`: compile all inputs and build the JIT image. -- `compile.tu`: one translation unit to an object builder. -- `compile.frontend`: registered frontend body. -- `compile.c.*`: C frontend setup, PP option application, parse/codegen, - cleanup. -- `opt.o1.total`: per-function O1 pipeline, nested under frontend codegen. -- `opt.jump_cleanup`: cheap O1 CFG/layout branch cleanup. -- `link_jit.all`: linker resolve plus JIT image materialization. -- `run.jit_lookup` and `run.entry_call`: symbol lookup and execution. - -Important O1 counters: - -- `opt.funcs`, `opt.blocks`, `opt.insts`, `opt.pregs` -- `opt.live.block_bytes`, `opt.live.active_words`, - `opt.live.bitset_words_touched` -- `opt.ranges`, `opt.range_points`, `opt.range_raw_points`, - `opt.range.point_visits`, `opt.range.preg_scans`, - `opt.range.live_words_touched` -- `opt.alloc.used_loc_words`, `opt.alloc.hard_loc_words`, - `opt.alloc.stack_loc_words`, `opt.alloc.stack_slots` -- `opt.alloc.hard_point_visits`, `opt.alloc.stack_point_visits`, - `opt.alloc.hard_mark_points`, `opt.alloc.stack_mark_points` -- `opt.alloc.spills` -- `opt.rewrite.reloads`, `opt.rewrite.stores`, - `opt.rewrite.inserted_insts`, `opt.rewrite.live_words_touched` -- `opt.conflict_bytes` - -`opt.conflict_bytes` is expected to stay zero on the normal O1 path. The old -dense pseudo-register conflict matrix is no longer part of O1 allocation. - -## Normal Function-Count Ladders - -Two synthetic C ladders are used for broad scaling: - -- `straight`: N loop-heavy static functions, with `main` directly calling every - function. This creates N small functions plus one larger straight-line caller. -- `table`: the same N functions, but `main` calls through a function-pointer - table. This keeps the caller compact and isolates "many functions" from "one - larger function". - -O1 scopes are summed across all functions. - -```text -straight -N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict -64 17.442 15.776 2.495 3.941 8.740 1856 833 1789 0 3198 0 -128 31.210 28.372 4.453 7.191 15.631 3712 1665 3581 0 6398 0 -256 60.026 54.522 8.544 13.764 30.201 7424 3329 7165 0 12798 0 -512 119.735 108.974 16.987 27.253 61.085 14848 6657 14333 0 25598 0 -1024 234.857 210.831 32.958 52.953 115.918 29696 13313 28669 0 51198 0 - -table -N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict -64 14.246 12.916 2.046 3.269 7.113 1620 713 1555 0 2720 0 -128 28.560 26.036 4.128 6.591 14.368 3220 1417 3091 0 5408 0 -256 57.608 52.647 8.371 13.331 28.973 6420 2825 6163 0 10784 0 -512 114.689 104.540 16.557 26.401 57.560 12820 5641 12307 0 21536 0 -1024 232.619 209.061 33.476 52.789 115.426 25620 11273 24595 0 43040 0 -``` - -512 to 1024 ratios: - -```text -shape compile.tu opt.o1.total live_pre live_reg regalloc used_words -straight 1.96x 1.93x 1.94x 1.94x 1.90x 2.00x -table 2.03x 2.00x 2.02x 2.00x 2.01x 2.00x -``` - -This is the desired current shape. Many functions and the normal direct-call -large-caller case are essentially linear through 1024 generated functions. - -## Spill-Pressure Ladder - -The old `spill` row in `run_scaling.sh` initializes many locals from constants; -current O1 folds most of that away, so it is useful as a frontend/source-size -sanity check but not as a pressure test. The useful pressure ladder is -`spill_argc`, which initializes each local from `argc` so values remain live. - -```text -spill_argc -N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills reloads inserted used_words conflict -128 1.388 0.769 0.045 0.169 0.474 770 641 767 108 108 216 3076 0 -256 2.468 1.433 0.058 0.291 0.929 1538 1281 1535 236 236 472 9222 0 -512 4.658 2.750 0.072 0.526 1.825 3074 2561 3071 492 492 984 30730 0 -1024 9.894 5.681 0.110 1.058 4.073 6146 5121 6143 1004 1004 2008 110610 0 -``` - -512 to 1024 ratios: - -```text -compile.tu 2.12x -opt.o1.total 2.07x -live_pre 1.53x -live_reg 2.01x -regalloc 2.23x -spills 2.04x -used_words 3.60x -``` - -This was the main remaining mild super-linear signal in the May 17 snapshot. -The May 21 follow-up found the direct cause: stack spill assignment used -first-fit probing across existing spill slots. When many spilled ranges all -overlapped, every new spill checked almost every previous stack slot before -allocating a new one, so `opt.alloc.stack_point_visits` grew quadratically. - -The fix keeps hard-register assignment priority ordered, then assigns spilled -values to stack slots in lifetime order using an active min-heap keyed by slot -end position plus a free-slot list. Stack occupancy is still marked into the -per-slot interval vectors for replay/debug metrics, but assignment no longer -probes existing stack slots in the all-live case. - -May 21 follow-up, same `spill_argc` shape, single runs: - -```text -spill_argc final -N compile.tu opt.o1.total live_pre regalloc used_words hard_words stack_words stack_slots hard_visits stack_visits hard_marks stack_marks spills points -128 3.037 0.853 0.114 0.396 1686 616 1070 107 11732 0 405 107 107 514 -256 3.989 1.214 0.082 0.634 3222 872 2350 235 23508 0 789 235 235 1026 -512 6.776 1.887 0.075 1.050 6294 1384 4910 491 47060 0 1557 491 491 2050 -1024 13.605 3.561 0.102 2.041 12438 2408 10030 1003 94164 0 3093 1003 1003 4098 -``` - -512 to 1024 ratios: - -```text -compile.tu 2.01x -opt.o1.total 1.89x -regalloc 1.94x -spills 2.04x -used_words 1.98x -stack_marks 2.04x -stack_visits 0 -``` - -`used_words` now reports the allocator's actual interval-vector storage rather -than the size of the old dense point-by-location bitmap model. It is expected -to track interval/vector capacity and can move in allocation-size steps, but it -no longer reflects a point-times-slot working set. - -## Parameter-Heavy Ladder - -This C ladder uses one function with N integer parameters and a caller that -passes N constants. It stresses ABI lowering, stack arguments, and callee-side -parameter use. - -```text -params -N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict -8 0.816 0.577 0.091 0.143 0.305 37 26 32 0 72 0 -16 0.850 0.595 0.090 0.150 0.320 69 50 64 0 136 0 -32 0.932 0.644 0.093 0.162 0.357 133 98 128 12 392 0 -64 1.053 0.698 0.093 0.175 0.396 261 194 256 44 776 0 -128 1.367 0.866 0.096 0.211 0.510 517 386 512 108 2056 0 -256 1.979 1.209 0.099 0.281 0.740 1029 770 1024 236 6152 0 -512 3.484 2.088 0.110 0.434 1.334 2053 1538 2048 492 20488 0 -``` - -256 to 512 ratios: - -```text -compile.tu 1.76x -opt.o1.total 1.73x -live_pre 1.11x -live_reg 1.54x -regalloc 1.80x -spills 2.08x -used_words 3.33x -``` - -The timing buckets are sub-linear to mildly linear here because the fixed -frontend/O1 setup costs are still visible. In the May 17 snapshot, `used_words` -again grew faster than spills and ranges when many parameters forced stack -traffic. - -May 21 follow-up used `argc`-dependent caller arguments to keep the call-side -values live and prevent constant folding from collapsing the probe: - -```text -params_argc final -N compile.tu opt.o1.total live_pre regalloc used_words hard_words stack_words stack_slots hard_visits stack_visits hard_marks stack_marks spills points -128 2.296 0.417 0.051 0.205 1436 376 1060 106 2969 0 25 106 106 133 -256 3.717 0.635 0.053 0.292 2716 376 2340 234 5913 0 25 234 234 261 -512 6.343 0.926 0.052 0.365 5276 376 4900 490 11801 0 25 490 490 517 -1024 13.602 2.428 0.080 0.834 10396 376 10020 1002 23577 0 25 1002 1002 1029 -``` - -512 to 1024 ratios: - -```text -compile.tu 2.14x -opt.o1.total 2.62x -regalloc 2.28x -spills 2.04x -used_words 1.97x -stack_marks 2.04x -stack_visits 0 -``` - -The parameter probe still has noisy wall-time at these small absolute O1 -durations, but the allocator-side counters now have the desired linear shape. - -## Toy Samples - -The Toy frontend uses the same CG/O1/backend machinery, so these are useful -small API-shape probes. O1 scopes are summed across functions. - -```text -case compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict -06_while_sum 0.436 0.319 0.049 0.075 0.163 15 7 13 0 26 0 -09_function_params 0.645 0.569 0.091 0.142 0.302 10 8 8 0 22 0 -111_many_function_params 0.682 0.591 0.092 0.147 0.316 38 22 36 0 78 0 -123_spec_demo 4.709 4.229 0.659 1.016 2.270 788 359 590 0 1168 0 -``` - -The Toy samples do not show a compile-time scaling problem at this size. The -larger spec demo is dominated by O1/regalloc, but counters are small and there -is no conflict-matrix traffic. - -## Current Interpretation - -The current O1 compile-time story is good enough to shift attention toward -generated-code quality: - -- Normal many-function and large-caller C inputs are approximately linear. -- The old dense liveness/conflict path is gone from normal O1 - (`opt.conflict_bytes=0`). -- Stack spill assignment no longer does first-fit interval probing across every - existing spill slot; `opt.alloc.stack_point_visits` stays zero on the - stack-heavy May 21 ladders. -- Link/JIT is not the bottleneck in these runs; the visible time is frontend - parse/codegen plus O1. -- Spill-heavy and parameter-heavy allocator counters now scale by ranges, - stack slots, and interval marks rather than by point/candidate products. - -## Investigation Priorities - -1. Keep the stack-heavy May 21 shape as a regression guard. - `stack_point_visits` should remain zero on the normal O1 stack assignment - path, while `stack_marks`, `stack_slots`, spills, and actual interval - storage should remain approximately linear. - -2. Keep `opt.conflict_bytes=0` as a regression guard. - Reintroducing dense pseudo conflict storage would be a clear O1 regression. - -3. Keep normal O1 branch cleanup block-linear. - `jump_cleanup` should remain a local pass over blocks/terminators. Global - layout work belongs in O2. - -4. Split frontend residuals only after O1 code shape work. - The non-O1 compile residual is visible, but current scaling pressure is not - coming from object finalization or link/JIT. More parser/PP/CG counters would - be useful, but O1 generated-code cleanup has higher near-term value. diff --git a/scripts/opt_bench.sh b/scripts/opt_bench.sh @@ -379,7 +379,7 @@ lines = [] lines.append("# OPT Benchmark Summary") lines.append("") lines.append(f"Base for speed ratios: `{base_tool} -O2`.") -lines.append("For MIR, `compile_ms` is C-to-binary-MIR time and `codegen_ms` is the JIT link/generation slice reported by `c2m -v`; compile ratios use their sum. `cfree-run` uses `--bench-time`: `compile_ms` is compile+JIT time, and `runtime_ms` is the in-process entry-call execution slice.") +lines.append("Unified compile-speed ratios use `compile_ms + codegen_ms` where both are available. For MIR, `compile_ms` is C-to-binary-MIR time and `codegen_ms` is the JIT link/generation slice reported by `c2m -v`; this keeps the headline number comparable to `cfree-run` as a whole while the split table below shows where time is spent. `cfree-run` uses `--bench-time`: `compile_ms` is compile+JIT time, and `runtime_ms` is the in-process entry-call execution slice.") lines.append("") lines.append("## Status") lines.append("") @@ -415,6 +415,27 @@ for key in sorted(groups): f"{geo(run_ratios)} | {avg(comp_totals)} | {avg(run_times)} |" ) lines.append("") +lines.append("## Split Compile Timings") +lines.append("") +lines.append("| tool | opt | ok cases | avg compile/frontend ms | avg codegen/JIT ms | avg unified compile ms |") +lines.append("| --- | ---: | ---: | ---: | ---: | ---: |") +for key in sorted(groups): + vals = groups[key] + compile_times = [] + codegen_times = [] + comp_totals = [] + for r in vals: + cm = fnum(r["compile_ms"]) + cg = fnum(r["codegen_ms"]) + total = None if cm is None else cm + (cg or 0.0) + compile_times.append(cm) + codegen_times.append(cg) + comp_totals.append(total) + lines.append( + f"| {key[0]} | {key[1]} | {len(vals)} | {avg(compile_times)} | " + f"{avg(codegen_times)} | {avg(comp_totals)} |" + ) +lines.append("") lines.append(f"Raw CSV: `{csv_path}`") with open(out_path, "w") as f: f.write("\n".join(lines) + "\n")