OP1.md doc refresh + perf refresh - kit

commit f639d9174f8a364814dc16e21a1bd1ffa3634902
parent 0bbc3338aeb7ede4ee32d37859e5309d1e3605d4
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sun, 17 May 2026 16:11:06 -0700

OP1.md doc refresh + perf refresh

Diffstat:
M doc/OPT1.md  | 566 +++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------
M doc/PERF.md  | 994 +++++++++++++------------------------------------------------------------------

2 files changed, 554 insertions(+), 1006 deletions(-)
diff --git a/doc/OPT1.md b/doc/OPT1.md
@@ -1,29 +1,33 @@
 # OPT1 - Current O1 Backend Path
 
-`-O1` is cfree's fast optimized backend tier. It records frontend CG operations
-as IR, runs a small MIR-shaped lowering/allocation pipeline, rewrites virtual
-values to hard registers or spill slots, performs narrow post-RA cleanup, and
-then emits through the normal target backend.
+`-O1` is cfree's fast optimized backend tier. It records frontend CG
+operations as IR, runs a small MIR-shaped lowering/allocation pipeline, rewrites
+virtual values to hard registers or spill slots, performs narrow post-RA
+cleanup, and emits through the target backend.
 
-This document describes the implemented O1 path and tracks remaining O1 code
-quality and compile-time work. Broader optimizer direction lives in
-`doc/OPT.md`; allocator performance history and measurements live in
-`doc/PERF.md` and `doc/MIR_RA_REPORT.md`.
+This document has two jobs:
+
+1. document the implemented O1 pipeline; and
+2. track the remaining work needed to clean up O1 generated code.
+
+Compile-time measurements and scaling notes live in `doc/PERF.md`. Broader
+optimizer direction lives in `doc/OPT.md`.
 
 ## Goals
 
-O1 is the bisection floor for optimized codegen. It should remain:
+O1 should remain:
 
 - correct across supported targets and ABI lowering paths;
 - materially better than O0 generated code;
 - close enough to O0 compile time for JIT use;
-- free of unbounded dense pseudo-register conflict matrices;
+- free of dense pseudo-register conflict matrices and whole-function
+  point/candidate products in hot paths;
 - simple enough that O2 can add coalescing, splitting, SSA, and value
   optimization without destabilizing the O1 baseline.
 
 O1 is not intended to perform full SSA optimization, global value numbering,
-LICM, general inlining, or full branch layout. Those belong in O2 unless a
-local cleanup is required to keep O1 code shape reasonable.
+LICM, general inlining, or global block layout. Those belong in O2 unless a
+small local cleanup is required to keep O1 code shape reasonable.
 
 ## Pipeline
 
@@ -50,28 +54,29 @@ jump_cleanup.layout
 emit
 ```
 
-Important pass boundaries:
+Pass boundaries:
 
-- `machinize` applies target-specific ABI/call/register constraints before
+- `machinize` applies target-specific ABI, call, and register constraints before
   liveness and allocation.
-- `dead_def_elim` runs before RA using block liveness and operand-reference
-  events.
+- `live_blocks.pre_dde` computes block liveness for pre-RA dead-def
+  elimination.
+- `dead_def_elim` removes dead virtual definitions using block liveness and
+  operand-reference events.
 - `regalloc` rebuilds liveness after DDE, constructs compressed live ranges,
-  assigns locations, and rewrites operands.
-- `combine` is deliberately narrow post-RA cleanup: noop physical copy
-  deletion, safe single-use folds, redundant spill load/store cleanup, and
-  similar local rewrites.
+  assigns hard registers or spill slots, and rewrites operands.
+- `combine` is narrow post-RA cleanup: noop physical copy deletion, safe
+  single-use folds, redundant spill load/store cleanup, and similar local
+  rewrites.
 - `dce` is conservative after rewrite and removes only trivial dead IR shapes.
-- `jump_cleanup` is the cheap O1 branch-shape pass. The CFG stage forwards
-  branches to branch-only blocks and inverts compare-branches when that removes
-  an unconditional jump block without doing full block layout. The layout stage
-  runs immediately before emit and deletes unconditional branches to the
-  physical fallthrough block.
+- `jump_cleanup.cfg` forwards branches to branch-only blocks and performs local
+  compare-branch inversion when that removes an unconditional jump block.
+- `jump_cleanup.layout` runs immediately before emit and deletes unconditional
+  branches whose target is the physical fallthrough block.
 
 ## Liveness And Ranges
 
-O1 liveness is pass-local. Persistent `Block.live_after` and dense value
-conflict storage are not part of the normal path.
+O1 liveness is pass-local. Persistent per-instruction live-after storage and
+dense value conflict matrices are not part of the normal path.
 
 `opt_live_blocks` computes block `live_use`, `live_def`, `live_in`, and
 `live_out` sets using `OptBitset`. `OptBitset` tracks active words, trims
@@ -87,6 +92,8 @@ into compressed live ranges:
 - per-value summaries record live length, use/def frequency, block frequency,
   call-crossing frequency, and spill cost.
 
+The expected O1 invariant is `opt.conflict_bytes=0`.
+
 ## Allocation
 
 The normal O1 allocator is range based. It does not build or consume a
@@ -104,13 +111,11 @@ Each value is assigned either:
 - a target-provided hard register in the value's register class; or
 - a compatible `FS_SPILL` frame slot.
 
-Non-tied values known to be live across calls strongly prefer callee-saved
-registers, but may use caller-saved registers when no better non-conflicting
-location is available. Correctness does not depend on avoiding caller-saved
-registers globally: rewrite preserves hard-assigned values only at calls whose
-target-provided clobber mask actually destroys that register. Tied
-hard-register values are checked for availability, clobber constraints, and
-live-range conflicts.
+Non-tied values known to be live across calls prefer callee-saved registers, but
+may use caller-saved registers when no better non-conflicting location is
+available. Correctness does not depend on avoiding caller-saved registers
+globally: rewrite preserves hard-assigned values only at calls whose
+target-provided clobber mask actually destroys that register.
 
 Occupancy is stored as sorted per-location interval sets:
 
@@ -118,12 +123,13 @@ Occupancy is stored as sorted per-location interval sets:
 - one interval vector per allocated stack spill slot.
 
 Conflict checks test whether any candidate live range overlaps the location's
-intervals. Marking inserts and merges intervals. The old point-index bitmap
-row OR/mark path is not part of normal O1.
+intervals. Marking inserts and merges intervals. The old point-index bitmap row
+OR/mark path is not part of normal O1.
 
-Metrics still report `opt.alloc.hard_loc_words`, `opt.alloc.stack_loc_words`,
-and point/mark counters for trend compatibility, but `opt.alloc.hard_word_ors`
-and `opt.alloc.stack_word_ors` should remain zero on the interval path.
+Metrics still report `opt.alloc.used_loc_words`, `opt.alloc.hard_loc_words`,
+`opt.alloc.stack_loc_words`, point visits, and mark counters for trend
+compatibility, but `opt.alloc.hard_word_ors` and
+`opt.alloc.stack_word_ors` should remain zero on the interval path.
 
 ## Rewrite
 
@@ -157,7 +163,7 @@ O1 relies on each target backend to provide:
   stack arguments, return registers, call clobbers, return masks, outgoing
   stack size, and variadic metadata before liveness and allocation;
 - parameter storage binding through `CGTarget.param`, which may return either a
-  frame slot or a register-backed local storage for simple direct ABI params;
+  frame slot or register-backed local storage for simple direct ABI params;
 - optional hard-register reservation before backend `func_end`;
 - target legality for local folds performed by `opt_combine`.
 
@@ -167,19 +173,359 @@ The current target pools are:
 - x64: integer `R13/R14/R15/R10`, FP `XMM6-XMM15`;
 - RV64: integer `s4-s11`, FP `fs4-fs11`.
 
-Backends still own final prologue/epilogue emission and callee-saved register
-preservation. O1 calls `reserve_hard_regs` with the hard registers still visible
-in replay after cleanup so backend save/restore decisions match the emitted IR.
-Direct CG still uses the legacy `call` hook. O1 records call plans for liveness,
-allocation, and preservation, but backend call emission still performs the final
-argument and return moves through the existing sequential call emitters.
+Backends own final prologue/epilogue emission and callee-saved register
+preservation. O1 calls `reserve_hard_regs` with hard registers still visible in
+replay after cleanup so backend save/restore decisions match emitted IR.
+
+Targets may provide a known-frame entry path. When `func_begin_known_frame` and
+`call_stack_size` are both available, O1 computes frame slots, outgoing call
+area size, and dynamic-allocation presence before body replay and passes that
+to the backend. AArch64, x64, and RV64 use this path to emit the exact prologue
+up front instead of reserving target NOP patch slots.
+
+## Current Generated-Code Shape
+
+Fresh May 2026 probes were run with `build/cfree` rebuilt at host `-O2`, on an
+arm64 machine, focusing on native AArch64 codegen. The same simple C probes
+were also compiled for x64/RV64 by the probe script, but AArch64 is the primary
+quality target here.
+
+Current progress:
+
+- Immediate scalar arithmetic folds through simple local shadows. For example,
+  `int x = 40; x += 2; return x;` lowers to immediate return materialization on
+  AArch64.
+- Simple loop locals stay in registers.
+- Branch-consuming compares lower to direct compare branches.
+- O1 branch cleanup removes branch-to-branch targets and unconditional branches
+  to physical fallthrough when it can do so locally.
+- Address-taken locals still go through memory, as expected for current O1.
+- The old prologue NOP patch path is gone; known-frame entry is used where the
+  target supports it.
+
+Representative AArch64 `const_local` output:
+
+```asm
+sub sp, sp, #0x10
+stp x29, x30, [sp]
+mov x29, sp
+mov w0, #0x2a
+b   .Lepilogue
+ldp x29, x30, [sp]
+add sp, sp, #0x10
+ret
+```
+
+Representative AArch64 `scalar_add` output still contains redundant copies:
+
+```asm
+mov w2, w0
+mov w0, w1
+add w2, w2, w0
+mov w0, w2
+mov w0, w0
+b   .Lepilogue
+```
+
+A parameter-heavy 32-argument AArch64 probe shows the main remaining shape
+problem: stack arguments are copied into frame slots, many callee-saved
+registers are used and therefore saved/restored, and returns still branch to an
+immediately following epilogue block.
+
+## Remaining O1 Codegen Work
+
+These are code-shape issues, not known correctness issues.
+
+1. Leaf-frame and epilogue cleanup.
+   Leaf functions such as `const_local` still get a frame and save/restore
+   `x29/x30`. O1 should let targets omit frames when the function has no calls,
+   no dynamic stack, no required frame-address use, and no spills/locals that
+   require a frame. Return-to-immediate-epilogue branches should also be removed
+   or emitted as direct fallthrough where legal.
+
+2. Redundant physical-copy cleanup.
+   Copies like `mov w0, w0`, return-register self copies, and copy chains around
+   simple arithmetic are still visible. Extend post-RA combine/DCE to remove
+   more physical self-copies and to coalesce simple return/argument copies
+   without requiring a full O2 coalescer.
+
+3. AArch64 caller/callee register choice for leaf and small functions.
+   Current AArch64 O1 allocable integer registers are callee-saved
+   `x19-x28`. This is conservative and correct, but a leaf function that uses
+   many of them pays save/restore traffic. A targeted O1 improvement is to make
+   caller-saved temporaries available when doing so does not create call
+   preservation work, especially in leaf functions.
+
+4. Stack-parameter handling.
+   Parameter-heavy functions copy stack arguments into local frame slots before
+   use. For non-address-taken scalar parameters, O1 should prefer using the
+   incoming stack location directly or loading it once into an assigned hard
+   register when profitable, instead of manufacturing extra frame traffic.
+
+5. Planned/parallel call argument setup.
+   O1 records call plans before allocation, but final backend call emission is
+   still largely through existing sequential call emitters. Moving O1 call
+   emission closer to planned/parallel argument setup should reduce copies,
+   stack shuffling, and conservative register choices around calls.
+
+6. Keep branch cleanup local.
+   O1 should keep only cheap CFG/layout cleanup. Global block ordering and
+   heavier branch layout belong in O2.
+
+## Implementation Plan
+
+The remaining O1 work should land as small vertical slices. Each slice should
+start with a focused `test/opt` unit or small smoke case that exposes the bad
+shape, then make the narrowest backend/opt change needed for that shape. Avoid
+SSA, global block layout, dense conflict data, and whole-function coalescing in
+these steps.
+
+### 1. Leaf Frames And Return Fallthrough
+
+Goal: make trivial leaf functions emit without an avoidable frame, callee-save
+traffic, or a branch to the immediately following epilogue.
+
+Implementation:
+
+- Add an O1 function summary in `src/opt/opt.c` before `opt_emit`: has calls,
+  has `IR_ALLOCA`, has frame-address-required storage, has non-empty frame
+  slots after known-frame collection, has outgoing stack args, and has used
+  callee-saved registers after rewrite.
+- Extend `CGKnownFrameDesc` or add a sibling target hook bit so
+  `func_begin_known_frame` can receive "frame may be omitted" without each
+  backend rediscovering O1 state. The backend remains authoritative because
+  CFI, platform ABI rules, and frame-pointer policy are target-owned.
+- Teach AArch64 first, then x64/RV64, to omit the frame only when the known
+  frame is empty, the function is leaf, no dynamic stack exists, no outgoing
+  call area exists, and no callee-saved register needs preservation.
+- Keep `reserve_hard_regs` authoritative. If post-RA code still mentions a
+  callee-saved register, the target must preserve it even for a leaf.
+- Remove return-to-next-epilogue branches in the layout stage. Either let
+  `opt_jump_cleanup(...LAYOUT)` delete an `IR_BR` to the next emitted epilogue
+  block, or add a target-local direct-return fast path only when the backend's
+  epilogue is empty. Prefer the IR/layout cleanup first because it is target
+  neutral.
+
+Tests:
+
+- Add `test/opt` mock coverage for the summary: empty leaf, leaf with spill,
+  leaf with alloca, non-leaf, and leaf using a callee-saved hard reg.
+- Add AArch64 smoke/disassembly cases for `return 42;` and one-reg arithmetic:
+  no `stp x29, x30`, no stack adjustment, no branch to an adjacent epilogue.
+- Run `make test-opt test-smoke-x64 test-aa64-inline`; add RV64/x64 targeted
+  smoke once their frame omission hooks are enabled.
+
+### 2. Post-RA Physical Copy Cleanup
+
+Goal: remove obvious copy noise such as `mov w0, w0`, return-register self
+copies, and short copy chains around a single arithmetic result without adding
+an O2 coalescer.
+
+Implementation:
+
+- Extend `src/opt/pass_lower.c:opt_combine` with a post-RA copy-propagation
+  sweep over hard registers. Reuse the existing `HardBlockLive` and local
+  single-use machinery; keep it block-local except for already-computed
+  live-out safety checks.
+- Add ABI-value helpers that can rewrite a return or planned-call operand from
+  a copied hard register to its original source when the source has not been
+  clobbered and the copied destination is not live out.
+- Handle these cases first:
+  - physical self-copy deletion for all copy-like ops already modeled as
+    `IR_COPY`;
+  - `copy retreg, src; ret retreg` to `ret src` when the target return emitter
+    accepts that operand shape;
+  - `producer tmp; copy retreg, tmp; ret retreg` by retargeting the producer to
+    `retreg` when `retarget_producer_legal` already says that is safe;
+  - adjacent `copy a, b; copy c, a` to `copy c, b` when `a` has no other use.
+- Keep target legality explicit. If an instruction requires same-dst operands
+  or cannot write a return register directly, leave it for O2.
+
+Tests:
+
+- Extend `test/opt/opt_test.c` with post-rewrite blocks for each pattern,
+  including a negative case where the source is clobbered by a call.
+- Add a small AArch64 scalar-add probe that checks the final disassembly has no
+  `mov w0, w0` and fewer register-to-register moves.
+- Run `make test-opt test-cg-api test-toy`.
+
+### 3. Caller-Saved Registers In Leaf And Small Functions
+
+Goal: avoid save/restore traffic caused by preferring AArch64 callee-saved
+registers in leaf and low-pressure functions.
+
+Implementation:
+
+- Keep the backend physical register metadata in `src/arch/*/opt_coord.c` as
+  the source of truth. Do not hard-code register classes in the allocator.
+- Split O1 register ordering into a per-function policy in
+  `src/opt/pass_lower.c`: leaf/no-call values prefer caller-saved temp regs;
+  values live across calls prefer callee-saved regs only when that avoids
+  per-call save/restore; tied ABI regs keep their fixed priority.
+- For AArch64, make a caller-saved temp subset visible in the O1 allocation
+  order for leaf functions. Avoid `x16/x17` because they are backend scratch
+  registers, and continue to respect `opt_reserved_regs`, arg/ret hazards, and
+  inline-asm fixed/clobber masks.
+- For non-leaf functions, keep current correctness first: caller-saved
+  assignment is allowed only if rewrite will preserve the value across the
+  specific call clobber mask.
+- Add metrics for hard-reg class choice if useful for regression tracking:
+  caller-saved assignments, callee-saved assignments, and call-preserve inserts.
+
+Tests:
+
+- Extend existing allocator preference tests with AArch64-like metadata:
+  leaf temporaries choose caller-saved temps, call-crossing values avoid them
+  when a callee-saved reg is available, inline asm fixed regs still win.
+- Add AArch64 smoke for a leaf function with many scalar temps: no
+  callee-save pair stores solely due to O1 temporaries.
+- Run `make test-opt test-aa64-inline test-smoke-x64`.
+
+### 4. Stack-Parameter Handling
+
+Goal: stop copying non-address-taken scalar stack parameters into new local
+frame slots before their first real use.
+
+Implementation:
+
+- Add an IR representation for incoming stack parameter storage that is
+  distinct from ordinary callee frame slots. The target should be able to
+  describe it as an immutable incoming argument address, e.g. base register plus
+  ABI stack offset, without allocating `FS_PARAM`.
+- Extend `CGLocalStorage` or add O1-only parameter metadata so
+  `w_param` can record direct incoming-stack storage for scalar,
+  non-address-taken, non-variadic, direct ABI parameters.
+- Teach AArch64 first because it is the current quality target. `aa_param`
+  should be able to bind a register-backed param by loading from the incoming
+  stack location, and bind a memory-backed non-address-taken param without
+  copying into a local frame slot. Preserve the existing copy behavior for
+  address-taken params, aggregate/byval params, variadic save-area needs, and
+  debug locations that require a stable local home.
+- Update alias metadata so loads from incoming stack params are not mistaken
+  for writable local frame slots. Treat them as parameter memory that may alias
+  conservatively with unknown memory unless a target/language fact says more.
+- After AArch64 is green, port the same storage description to x64 and RV64.
+
+Tests:
+
+- Add O1 unit coverage for `w_param`: scalar stack param with no address taken
+  does not allocate `FS_PARAM`; address-taken and aggregate params still do.
+- Add AArch64 smoke for a 9+ integer-argument function: stack args are loaded
+  directly or once into assigned regs, with no extra callee local slot traffic.
+- Run `make test-opt test-cg-api test-smoke-x64`; add RV64 smoke after the
+  RV64 port.
+
+### 5. Planned/Parallel Call Argument Setup
+
+Goal: make O1 replay use target call plans as the normal call-emission route so
+argument setup is parallelized and copy-minimized.
+
+Implementation:
+
+- Promote `IRCallAux.plan_valid/use_plan_replay` from a partial path to the
+  default O1 path after `opt_machinize`. Keep the old descriptor replay as a
+  fallback for targets without `plan_call` or `emit_call_plan`.
+- Add a small parallel-move resolver in `src/opt/opt.c` replay that lowers the
+  planned register moves, stack stores, return moves, and indirect callee
+  preservation. It must handle cycles using target scratch registers or a
+  temporary spill slot; no VLA and no global resolver state.
+- Keep call-stack sizing tied to `call_stack_size` and the known-frame path so
+  outgoing areas are still allocated once per function.
+- Use target `call_clobber_mask` and plan return masks to avoid preserving
+  registers that the concrete call cannot clobber.
+- Start with integer direct calls, then FP args/returns, then mixed stack args,
+  then indirect calls. Leave varargs and aggregate edge cases on descriptor
+  replay until their planned lowering has explicit tests.
+
+Tests:
+
+- Extend mock `plan_call` tests to assert O1 emits planned calls by default and
+  preserves an indirect callee only when it conflicts with argument moves.
+- Add cycle tests for argument permutation, including `a -> b, b -> a`.
+- Add cross-target call smoke for direct, indirect, stack-heavy, and mixed
+  int/FP calls.
+- Run `make test-opt test-cg-api test-toy test-smoke-x64`.
+
+### 6. Local Branch Cleanup Only
+
+Goal: improve obvious return/branch shape while preserving O1's cheap,
+block-local cleanup policy.
+
+Implementation:
+
+- Keep `src/opt/pass_jump.c` restricted to branch-target forwarding,
+  compare-branch inversion, unreachable pruning already owned by CFG, and
+  physical fallthrough branch deletion.
+- Add only local return/epilogue cleanup needed by step 1. Do not introduce
+  trace layout, global block ordering, or profile-guided branch placement in
+  O1.
+- Document any tempting global cleanup in `doc/OPT.md` as O2 work instead of
+  expanding O1.
+
+Tests:
+
+- Add branch-cleanup tests for adjacent epilogue fallthrough and negative cases
+  where the epilogue is not adjacent or has target-required body.
+- Run `make test-opt`.
+
+### 7. Stack-Heavy Compile-Time Watch
+
+Goal: keep stack-heavy `params` and `spill_argc` shapes close to linear before
+larger inputs become normal benchmarks.
+
+Implementation:
+
+- Instrument the current `used_words` growth source in
+  `src/opt/pass_live.c` and `src/opt/pass_lower.c`: distinguish hard-register
+  occupancy words, stack-slot occupancy words, live-range point count, and
+  parameter/spill slot count.
+- Investigate whether stack slots are being modeled with more interval words
+  than needed, especially after stack-parameter improvements remove some
+  `FS_PARAM` traffic.
+- Keep the invariant that `opt.conflict_bytes=0` on normal O1. Any fix that
+  reintroduces dense pseudo conflicts is a regression, even if it improves one
+  microbenchmark.
+
+Tests:
+
+- Add or update a phase0 guardrail that records `opt.alloc.used_loc_words` for
+  representative stack-heavy cases without asserting fragile exact values.
+- Refresh `doc/PERF.md` after the stack-parameter and caller-saved-register
+  slices land.
+
+### Suggested Order
+
+1. Physical-copy cleanup. It is isolated, improves current AArch64 samples, and
+   gives better post-RA liveness coverage for later steps.
+2. Leaf return/fallthrough cleanup. This removes visible branch noise and
+   creates the summary data needed for frame omission.
+3. AArch64 caller-saved leaf policy. This directly reduces save/restore traffic
+   once leaf frames are eligible for omission.
+4. Stack-parameter handling. This is higher risk because it changes parameter
+   storage contracts, so land it per target.
+5. Planned call replay. This has the largest call/ABI surface and should build
+   on the copy and register-choice cleanup.
+6. Compile-time stack-heavy investigation and `doc/PERF.md` refresh after the
+   generated-code changes have reduced avoidable stack traffic.
+
+## Compile-Time Watchlist
+
+The current O1 compile-time shape is good for normal C ladders: many-function
+and direct-call generated inputs are approximately linear through 1024
+functions with `opt.conflict_bytes=0`.
+
+The remaining counter-level warning is stack-heavy pressure. In the current
+host-`-O2` native run:
 
-Targets may also provide a known-frame entry path for O1. When
-`func_begin_known_frame` and `call_stack_size` are both available, O1 computes
-all frame slots, outgoing call area size, and dynamic-allocation presence before
-body replay and passes that to the backend. AArch64, x64, and RV64 use this
-path to emit the exact prologue up front instead of reserving target NOP patch
-slots.
+```text
+shape        step       opt.o1.total  regalloc  used_words
+spill_argc   512->1024      2.07x       2.23x      3.60x
+params       256->512       1.73x       1.80x      3.33x
+```
+
+Wall time is only mildly super-linear, but the `used_words` growth should be
+investigated before much larger stack-heavy inputs are treated as normal O1
+workloads. See `doc/PERF.md` for the full tables.
 
 ## Tests And Metrics
 
@@ -206,119 +552,3 @@ make test-smoke-x64
 - allocator hard/stack occupancy, interval probes/marks, spills;
 - rewrite reloads/stores/inserted instructions/live traffic;
 - link/JIT and compile frontend scopes.
-
-## Current State
-
-Fresh May 2026 scaling probes used the same direct-call and function-table
-families tracked in `doc/PERF.md`, with three samples per point and p50 times
-below. Both normal-path ladders are now essentially linear through 1024
-generated functions. The old dense conflict matrix is absent from the normal
-path (`opt.conflict_bytes=0`), and the live-range and allocator buckets double
-with the input.
-
-```text
-straight direct-call ladder
-N     run.total  opt.o1.total  live_reg  regalloc
-64      18.155       14.826      3.899     8.056
-128     35.244       29.371      7.727    15.970
-256     70.254       58.873     15.765    32.168
-512    144.874      120.613     30.894    64.842
-1024   283.198      233.653     61.392   126.779
-
-function-table ladder
-N     run.total  opt.o1.total  live_reg  regalloc
-64      17.536       14.571      3.822     7.915
-128     34.270       28.855      7.594    15.685
-256     68.305       57.831     15.183    31.413
-512    135.433      114.115     30.025    62.018
-1024   277.842      230.563     60.167   124.895
-```
-
-A nonconstant wide-local spill probe also scales linearly in spill/reload and
-rewrite counts, though regalloc bends slightly above 2x under heavy spill
-pressure:
-
-```text
-N     opt.o1  live_reg  regalloc  spills  reloads  inserted
-128    1.541    0.398     0.918     120     120       240
-256    3.230    0.778     2.048     248     248       496
-512    6.991    1.474     4.924     504     504      1008
-```
-
-Current code-shape probes compiled `identity_param`, `scalar_add`,
-`while_sum`, `simple_branch`, `direct_call`, `const_local`, and
-`local_addr_taken` across x64, AArch64, and RV64 with `-O1`, then disassembled
-with `llvm-objdump -dr`.
-
-Observed progress:
-
-- O1 is materially smaller than O0 across all three supported native backends
-  in the older corpus probes.
-- The old prologue NOP sleds are gone on O1 for AArch64, x64, and RV64.
-- Simple loop locals are kept in registers in the loop probes instead of being
-  reloaded from stack slots on each iteration.
-- Branch-consuming compares lower to direct compare branches. For example,
-  AArch64 while-loop conditions use `cmp w19, #0xa` plus `b.ge ...` rather
-  than a `cmp; cset; cmp #0; b.cond` bridge.
-- Local branch cleanup now removes O1-only branch artifacts: branch-to-branch
-  targets are forwarded, safe compare-branches are inverted to remove
-  fallthrough jump blocks, and final unconditional branches to the physical
-  fallthrough block are deleted before emit.
-- Very local constant simplification handles the direct O1 probes covered by
-  `doc/CONSTFOLD.md`: immediate arithmetic, compare literals, delayed
-  arithmetic chains, and straight-line scalar-local shadows. The
-  `int x = 40; x += 2; return x;` probe now reduces to immediate return
-  materialization on AArch64:
-
-```asm
-mov w0, #0x2a
-b   ...
-```
-
-- Post-rewrite DCE and copy cleanup now model hard-register call arguments,
-  ABI argument parts, return registers, and caller-saved call clobbers. The
-  local combine pass also retargets safe single-use arithmetic producers to a
-  following physical-copy destination, including commutative operand swaps for
-  x64-style two-operand overlap cases.
-- Targets now expose descriptive physical-register metadata and per-call
-  clobber/return/callee-save masks to O1. `machinize` attaches a target call
-  plan to each `IR_CALL` before liveness and allocation, and rewrite plus
-  hard-register liveness use that call-specific clobber mask for preservation.
-  This closes the register-preservation correctness issue: hard-assigned values
-  live across calls are saved/restored only when the planned call actually
-  clobbers their assigned physical register.
-- Simple scalar parameters no longer need late mem2reg promotion. `CGTarget.param`
-  owns ABI entry binding and O1 can keep non-memory-required direct params as
-  virtual-register-backed locals, replaying the final hard register or spill slot
-  to the target backend. Frame-backed params remain the path for aggregates,
-  address-taken values, indirect/byval ABI cases, and other memory-required
-  shapes.
-
-Remaining O1 shape issues visible in the current dumps:
-
-- The register-preservation correctness issue is closed, but O1 still
-  saves/restores more callee-saved registers than ideal in some small functions
-  under register pressure or values live across calls. The old unconditional
-  scratch-register saves have been removed, and preservation is now
-  call-specific. Remaining excess traffic is code-shape work: conservative
-  register exposure, missing call-argument parallel copies in emitted code, and
-  limited coalescing/argument-copy cleanup.
-- Direct-call tiny functions are still heavy at O1. The x64 `callee(x) + 2`
-  probe emitted 167 bytes and 47 instructions across two small functions,
-  mostly frame setup, callee-save traffic, copies, and branch-to-epilogue
-  artifacts. O1 does not need general inlining, but call-side frame/save
-  discipline remains expensive.
-
-## Todo
-
-MIR's O1 path suggests these high-value local cleanups that still fit cfree's
-fast tier:
-
-1. Continue reducing callee-save traffic as a code-shape issue.
-   The preservation correctness piece is done: O1 reserves/preserves only
-   replay-visible hard registers after final cleanup and uses per-call clobber
-   masks for live-across-call saves/restores. Remaining work is mostly
-   coalescing/argument-copy quality, pressure-sensitive choices, switching O1
-   call emission to planned/parallel argument setup, and then safely broadening
-   ABI argument/return registers in the allocable pools. The remaining
-   structural work is tracked in `doc/OPT_REGS_CALL_PLAN.md`.
diff --git a/doc/PERF.md b/doc/PERF.md
@@ -1,906 +1,224 @@
-# Cfree Run/JIT Performance Notes
+# Cfree O1 Performance Notes
 
-This file records early measurements for `cfree run --time -O1` and the
-current hypotheses for reducing time from submit to first execution.
+This file records the current `cfree run --time -O1` shape and the remaining
+compile-time scaling risks. It is intentionally focused on the current O1 path;
+older allocator phase history is preserved in git and in `doc/MIR_RA_REPORT.md`.
 
-The metrics are emitted through `CfreeEnv.metrics`; `cfree run --time` installs
-a hosted stderr logger in the driver. Core/libcfree remain callback-only and do
-not write files or depend on hosted I/O.
+## Measurement Setup
 
-## Current Metric Shape
+Fresh snapshot: May 17, 2026.
+
+- Host: `arm64` macOS.
+- Compiler under test: `build/cfree`, rebuilt as a Mach-O arm64 executable with
+  the compiler itself built by clang at host `-O2`.
+- Timing path: native `build/cfree run --time -O1`, which exercises the native
+  AArch64 JIT/codegen path on this machine.
+- Samples: three runs per point; tables use p50 milliseconds.
+- Raw local outputs from the run:
+  - `build/o1-probes/scaling.tsv`
+  - `build/o1-probes/spill_argc.tsv`
+  - `build/o1-current/params-hostO2.tsv`
+  - `build/o1-current/toy-hostO2.tsv`
+
+The build log should show `clang -O2 ...` for driver, lib, C frontend, and Toy
+frontend objects before these timings are compared to future runs.
+
+## Metric Shape
 
 Top-level `cfree run --time` scopes:
 
 - `run.total`: full driver path after option/input loading through entry return.
 - `run.compile_and_jit`: compile all inputs and build the JIT image.
-- `compile.tu`: one translation unit to an `ObjBuilder`.
+- `compile.tu`: one translation unit to an object builder.
 - `compile.frontend`: registered frontend body.
-- `compile.c.*`: C frontend setup, PP option application, parse/codegen, cleanup.
-- `opt.o1.total`: per-function O1 pipeline, nested under C parse/codegen.
-- `opt.jump_cleanup`: cheap O1 CFG/layout branch cleanup after post-rewrite
-  cleanup, nested under `opt.o1.total`.
+- `compile.c.*`: C frontend setup, PP option application, parse/codegen,
+  cleanup.
+- `opt.o1.total`: per-function O1 pipeline, nested under frontend codegen.
+- `opt.jump_cleanup`: cheap O1 CFG/layout branch cleanup.
 - `link_jit.all`: linker resolve plus JIT image materialization.
-- `link.resolve.*`: archive ingestion, symbol resolution, GC, layout, reloc emit.
-- `jit.*`: reservation, segment copy, relocation patching, protection, icache flush,
-  constructors.
 - `run.jit_lookup` and `run.entry_call`: symbol lookup and execution.
 
-Important counters:
+Important O1 counters:
 
-- `compile.input_bytes`
-- `compile.obj_sections`, `compile.obj_relocs`
 - `opt.funcs`, `opt.blocks`, `opt.insts`, `opt.vals`
-- `opt.live_words`, `opt.live.blocks`, `opt.live.active_words`
-- `opt.live.block_bytes`, `opt.live.set_bit_scans`
-- `opt.live.bitset_words_touched`, `opt.live.dataflow_iterations`,
-  `opt.live.dataflow_block_visits`
-- `opt.ranges`, `opt.range_points`, `opt.range_raw_points`
-- `opt.range.point_visits`, `opt.range.value_scans`,
+- `opt.live.block_bytes`, `opt.live.active_words`,
+  `opt.live.bitset_words_touched`
+- `opt.ranges`, `opt.range_points`, `opt.range_raw_points`,
+  `opt.range.point_visits`, `opt.range.value_scans`,
   `opt.range.live_words_touched`
-- `opt.alloc.used_loc_words`, `opt.alloc.spills`
-- `opt.alloc.hard_loc_words`, `opt.alloc.stack_loc_words`,
-  `opt.alloc.stack_slots`
+- `opt.alloc.used_loc_words`, `opt.alloc.hard_loc_words`,
+  `opt.alloc.stack_loc_words`, `opt.alloc.stack_slots`
 - `opt.alloc.hard_point_visits`, `opt.alloc.stack_point_visits`,
-  `opt.alloc.hard_word_ors`, `opt.alloc.stack_word_ors`,
   `opt.alloc.hard_mark_points`, `opt.alloc.stack_mark_points`
-- `opt.dde.live_words_touched`
+- `opt.alloc.spills`
 - `opt.rewrite.reloads`, `opt.rewrite.stores`,
   `opt.rewrite.inserted_insts`, `opt.rewrite.live_words_touched`
 - `opt.conflict_bytes`
-- `link.inputs`, `link.sections`, `link.segments`, `link.syms`, `link.relocs`
-- `jit.master_size`, `jit.nsegments`, `jit.input_section_bytes`,
-  `jit.segment_bytes`
-
-## Small Programs
-
-Measured with 20 samples each on stdin, `build/cfree run --time -O1 -`.
-Times are `min / p50 / p95 / mean / max` in milliseconds.
-
-`int main(){return 0;}`
-
-```text
-run.total             0.290 / 0.311 / 0.345 / 0.316 / 0.351
-run.compile_and_jit   0.274 / 0.293 / 0.313 / 0.296 / 0.329
-compile.tu            0.218 / 0.235 / 0.246 / 0.235 / 0.257
-compile.c.parse_codegen
-                      0.082 / 0.091 / 0.097 / 0.091 / 0.107
-opt.o1.total          0.048 / 0.056 / 0.061 / 0.055 / 0.065
-opt.live_blocks.pre_dde
-                      0.013 / 0.017 / 0.023 / 0.017 / 0.024
-opt.live_ranges.regalloc
-                      0.003 / 0.004 / 0.004 / 0.004 / 0.005
-opt.regalloc          0.009 / 0.010 / 0.012 / 0.010 / 0.015
-link_jit.all          0.047 / 0.050 / 0.056 / 0.051 / 0.061
-link.resolve.total    0.023 / 0.025 / 0.028 / 0.025 / 0.029
-```
-
-One small loop:
-
-```c
-int main(){int s=0; for(int i=0;i<10;i++) s+=i; return s==45?0:1;}
-```
-
-```text
-run.total             0.369 / 0.400 / 0.446 / 0.401 / 0.448
-run.compile_and_jit   0.351 / 0.381 / 0.414 / 0.381 / 0.427
-compile.tu            0.293 / 0.314 / 0.349 / 0.315 / 0.353
-compile.c.parse_codegen
-                      0.151 / 0.165 / 0.196 / 0.167 / 0.212
-opt.o1.total          0.090 / 0.097 / 0.107 / 0.098 / 0.122
-opt.live_blocks.pre_dde
-                      0.024 / 0.026 / 0.029 / 0.027 / 0.032
-opt.live_ranges.regalloc
-                      0.012 / 0.013 / 0.015 / 0.013 / 0.015
-opt.regalloc          0.025 / 0.027 / 0.030 / 0.028 / 0.042
-link_jit.all          0.048 / 0.052 / 0.066 / 0.055 / 0.088
-link.resolve.total    0.023 / 0.026 / 0.034 / 0.026 / 0.035
-```
-
-For tiny programs, compile dominates, with O1 and JIT/link image setup as the
-main visible sub-buckets. Link/JIT is already sub-millisecond.
-
-## Scaling Programs
-
-Two synthetic ladders were used:
-
-- `straight_main`: N loop-heavy static functions, and `main` directly calls
-  every function. This creates one large straight-line function in addition to
-  N small loop functions.
-- `table_main`: the same N loop-heavy functions, but `main` iterates through a
-  function-pointer table. This keeps `main` compact and isolates the cost of
-  many functions from the cost of one large function.
-
-Samples: 7 or 5 runs through 256 functions, 3 runs at 512 and 1024 functions.
-The table below uses p50 timings. O1 scopes are summed across functions.
-
-### Straight-Line Main
-
-```text
-funcs  insts  vals   conflict_bytes  run.total  compile.tu  opt.o1.total  live_pre  live_reg  regalloc  link_jit
-1      74     50     800               0.845       0.641        0.381       0.087     0.084     0.112    0.135
-4      278    188    3008              2.027       1.797        1.279       0.326     0.324     0.421    0.173
-16     1094   740    11840             5.644       5.460        4.088       1.216     1.203     1.528    0.119
-64     4358   2948   56576            22.089      21.817       16.926       5.406     5.365     6.683    0.149
-128    8710   5892   131520           48.249      47.906       38.203      12.782    12.696    15.655    0.170
-256    17414  11780  336704          111.786     111.161       92.228      33.042    32.883    40.054    0.224
-512    34822  23556  968256          289.752     288.507      249.098      95.864    96.069   115.167    0.309
-1024   69638  47108  3116096         877.053     874.755      793.174     325.557   321.390   379.543    0.513
-```
-
-The 512 to 1024 step is the clearest warning:
-
-```text
-compile.tu             288.507 -> 874.755 ms  3.03x
-opt.o1.total           249.098 -> 793.174 ms  3.18x
-opt.live_info.pre_dde   95.864 -> 325.557 ms  3.40x
-opt.live_info.regalloc  96.069 -> 321.390 ms  3.35x
-opt.regalloc           115.167 -> 379.543 ms  3.30x
-```
-
-The associated `opt.conflict_bytes` grows from `968256` to `3116096`, about
-3.2x for a 2x input. That points directly at dense liveness/conflict data as
-the superlinear path.
-
-### Function-Table Main
-
-```text
-funcs  insts  vals   conflict_bytes  run.total  compile.tu  opt.o1.total  live_pre  live_reg  regalloc  link_jit
-1      94     60     960               0.895       0.692        0.404       0.095     0.092     0.125    0.141
-4      286    189    3024              1.902       1.692        1.178       0.314     0.309     0.401    0.148
-16     1054   705    11280             5.506       5.315        3.940       1.146     1.126     1.433    0.120
-64     4126   2769   44304            19.621      19.353       14.632       4.411     4.394     5.538    0.152
-128    8222   5521   88336            38.127      37.795       28.668       8.718     8.699    10.934    0.169
-256    16414  11025  176400           75.585      75.103       57.207      17.368    17.411    21.901    0.215
-512    32798  22033  352528          151.642     150.752      114.097      34.786    34.678    43.643    0.289
-1024   65566  44049  704784          303.295     301.632      226.238      68.497    68.306    86.028    0.450
-```
-
-Here the 512 to 1024 step is essentially linear:
-
-```text
-compile.tu             150.752 -> 301.632 ms  2.00x
-opt.o1.total           114.097 -> 226.238 ms  1.98x
-opt.live_info.pre_dde   34.786 ->  68.497 ms  1.97x
-opt.live_info.regalloc  34.678 ->  68.306 ms  1.97x
-opt.regalloc            43.643 ->  86.028 ms  1.97x
-```
 
-This indicates that "many functions" is not the current scaling problem. The
-problem is a single large function whose liveness/conflict structures become
-large enough to bend past linear.
+`opt.conflict_bytes` is expected to stay zero on the normal O1 path. The old
+dense pseudo-register conflict matrix is no longer part of O1 allocation.
 
-### Phase-1 Pass-Local Liveness Rerun
+## Normal Function-Count Ladders
 
-After `doc/MIR_RA_REPORT.md` phase 1, pre-DDE uses pass-local block liveness
-instead of full `opt_live_info`. Regalloc still uses the old dense liveness and
-conflict path. The synthetic source generator used here is the same family as
-the tables above, but these are fresh generated inputs rather than byte-for-byte
-identical files, so compare trends and bucket shapes more than individual rows.
+Two synthetic C ladders are used for broad scaling:
 
-Samples: 5 runs per point. The table uses p50 milliseconds. O1 scopes and
-O1 counters are summed across functions. `pre_block_bytes` is
-`opt.live.block_bytes` from `opt.live_blocks.pre_dde`; `reg_conflict_bytes` is
-the remaining `opt.conflict_bytes` from `opt.live_info.regalloc`.
+- `straight`: N loop-heavy static functions, with `main` directly calling every
+  function. This creates N small functions plus one larger straight-line caller.
+- `table`: the same N functions, but `main` calls through a function-pointer
+  table. This keeps the caller compact and isolates "many functions" from "one
+  larger function".
 
-Straight-line main:
+O1 scopes are summed across all functions.
 
 ```text
-funcs  input_bytes  insts  vals   pre_block_bytes  reg_conflict_bytes  run.total  compile.tu  parse_codegen  opt.o1.total  live_blocks_pre  live_reg  regalloc  link_jit
-1      372          91     72     320              576                   0.737      0.627       0.465          0.297         0.029            0.128     0.171     0.062
-4      1327         322    270    608              2160                  1.538      1.421       1.277          0.837         0.062            0.442     0.578     0.058
-16     5236         1246   1062   1984             9056                  4.861      4.756       4.610          3.085         0.174            1.787     2.285     0.056
-64     21028        4942   4230   7264             42224                19.542     19.386      19.249         13.377         0.665            8.199    10.403     0.073
-128    42420        9870   8454   14304            100784               42.077     41.830      41.693         30.284         1.314           19.476    24.434     0.107
-256    85940        19726  16902  28384            267056               98.532     97.987      97.845         74.982         2.603           51.347    63.176     0.157
-512    172980       39438  33798  56544            796208              254.219    252.884     252.723        206.802         5.174          152.154   183.374     0.269
-1024   347352       78862  67590  112864           2640944             742.562    740.519     740.362        646.586        10.425          502.879   599.496     0.436
-```
+straight
+N     compile.tu  opt.o1.total  live_pre  live_reg  regalloc  insts  vals   ranges  spills  used_words  conflict
+64       17.442        15.776     2.495     3.941     8.740   1856    833    1789       0        3198         0
+128      31.210        28.372     4.453     7.191    15.631   3712   1665    3581       0        6398         0
+256      60.026        54.522     8.544    13.764    30.201   7424   3329    7165       0       12798         0
+512     119.735       108.974    16.987    27.253    61.085  14848   6657   14333       0       25598         0
+1024    234.857       210.831    32.958    52.953   115.918  29696  13313   28669       0       51198         0
 
-Function-table main:
-
-```text
-funcs  input_bytes  insts  vals   pre_block_bytes  reg_conflict_bytes  run.total  compile.tu  parse_codegen  opt.o1.total  live_blocks_pre  live_reg  regalloc  link_jit
-1      461          112    83     576              664                   0.694      0.602       0.465          0.285         0.023            0.129     0.173     0.058
-4      1392         328    269    864              2152                  1.513      1.386       1.251          0.805         0.053            0.434     0.565     0.058
-16     5200         1192   1013   2016             8104                  4.801      4.687       4.550          3.030         0.194            1.697     2.189     0.060
-64     20560        4648   3989   6624             31912                17.431     17.261      17.111         11.452         0.703            6.560     8.487     0.079
-128    41349        9256   7957   12768            63656                34.170     33.918      33.777         22.640         1.289           13.077    16.946     0.114
-256    83589        18472  15893  25056            127144               67.625     67.144      67.004         45.065         2.567           26.043    33.724     0.156
-512    168069       36904  31765  49632            254120              134.804    134.093     133.941         89.938         5.150           51.976    67.348     0.250
-1024   337298       73768  63509  98784            508072              272.397    271.171     271.026        179.323        10.224          103.779   134.391     0.402
+table
+N     compile.tu  opt.o1.total  live_pre  live_reg  regalloc  insts  vals   ranges  spills  used_words  conflict
+64       14.246        12.916     2.046     3.269     7.113   1620    713    1555       0        2720         0
+128      28.560        26.036     4.128     6.591    14.368   3220   1417    3091       0        5408         0
+256      57.608        52.647     8.371    13.331    28.973   6420   2825    6163       0       10784         0
+512     114.689       104.540    16.557    26.401    57.560  12820   5641   12307       0       21536         0
+1024    232.619       209.061    33.476    52.789   115.426  25620  11273   24595       0       43040         0
 ```
 
-The 512 to 1024 step now separates the fixed part from the remaining problem:
+512 to 1024 ratios:
 
 ```text
-straight_main:
-compile.tu              252.884 -> 740.519 ms  2.93x
-opt.o1.total            206.802 -> 646.586 ms  3.13x
-opt.live_blocks.pre_dde   5.174 ->  10.425 ms  2.01x
-opt.live_info.regalloc  152.154 -> 502.879 ms  3.31x
-opt.regalloc            183.374 -> 599.496 ms  3.27x
-pre_block_bytes          56544  -> 112864      2.00x
-reg_conflict_bytes      796208  -> 2640944     3.32x
-
-table_main:
-compile.tu              134.093 -> 271.171 ms  2.02x
-opt.o1.total             89.938 -> 179.323 ms  1.99x
-opt.live_blocks.pre_dde   5.150 ->  10.224 ms  1.99x
-opt.live_info.regalloc   51.976 -> 103.779 ms  2.00x
-opt.regalloc             67.348 -> 134.391 ms  2.00x
-pre_block_bytes          49632  -> 98784       1.99x
-reg_conflict_bytes      254120  -> 508072      2.00x
+shape     compile.tu  opt.o1.total  live_pre  live_reg  regalloc  used_words
+straight     1.96x        1.93x       1.94x     1.94x     1.90x      2.00x
+table        2.03x        2.00x       2.02x     2.00x     2.01x      2.00x
 ```
 
-The phase-1 result is the intended partial fix: pre-DDE no longer allocates a
-conflict matrix and its block-live storage scales linearly on the large
-straight-line case. The remaining superlinear behavior is now concentrated in
-the regalloc-side full liveness/conflict path.
-
-### Phase-3 Range-Allocator Rerun
-
-After `doc/MIR_RA_REPORT.md` phase 3, O1 assignment no longer uses the dense
-pseudo conflict matrix. Regalloc rebuilds pass-local block liveness and live
-ranges, assigns against `used_locs[point]`, and reports
-`opt.conflict_bytes=0` on the normal allocation path. Rewrite still uses
-per-instruction `live_after`, so this is not the final phase-4 shape.
-
-Samples: 5 runs per point. The table uses p50 milliseconds. O1 scopes and O1
-counters are summed across functions. The generator is the same direct-call vs
-function-table family as above, but these are fresh generated inputs and not
-byte-for-byte identical to the earlier tables.
-
-Straight-line main:
+This is the desired current shape. Many functions and the normal direct-call
+large-caller case are essentially linear through 1024 generated functions.
 
-```text
-funcs  input_bytes  insts  vals   pre_block_bytes  ranges  range_points  used_loc_words  conflict_bytes  run.total  compile.tu  parse_codegen  opt.o1.total  live_blocks_pre  live_ranges_reg  regalloc  link_jit
-1      139          55     32     576              28      41            82              0               0.551      0.444       0.296          0.184         0.048            0.025            0.051     0.057
-4      430          178    110    1632             100     143           286             0               0.887      0.784       0.642          0.418         0.114            0.069            0.132     0.055
-16     1624         670    422    6080             388     551           1189            0               2.433      2.306       2.147          1.414         0.391            0.254            0.480     0.069
-64     6520         2638   1670   23648            1540    2183          5674            0               8.195      8.042       7.894          5.236         1.435            0.992            1.894     0.084
-128    13188        5262   3334   47072            3076    4359          13894           0              15.634     15.399      15.250         10.068         2.657            1.936            3.893     0.102
-256    26884        10510  6662   93920            6148    8711          38014           0              31.114     30.795      30.648         20.296         5.097            3.876            8.415     0.139
-512    54276        21006  13318  187616           12292   17415         116974          0              66.957     66.408      66.265         44.891        10.516            8.204           20.599     0.219
-1024   109180       41998  26630  375008           24580   34823         397774          0             157.795    156.418     156.251        109.919        22.778           18.122           57.799     0.374
-```
+## Spill-Pressure Ladder
 
-Function-table main:
+The old `spill` row in `run_scaling.sh` initializes many locals from constants;
+current O1 folds most of that away, so it is useful as a frontend/source-size
+sanity check but not as a pressure test. The useful pressure ladder is
+`spill_argc`, which initializes each local from `argc` so values remain live.
 
 ```text
-funcs  input_bytes  insts  vals   pre_block_bytes  ranges  range_points  used_loc_words  conflict_bytes  run.total  compile.tu  parse_codegen  opt.o1.total  live_blocks_pre  live_ranges_reg  regalloc  link_jit
-1      212          76     43     832              39      56            112             0               0.594      0.486       0.335          0.194         0.051            0.031            0.060     0.061
-4      482          184    109    1888             99      143           286             0               0.944      0.833       0.685          0.437         0.118            0.072            0.140     0.061
-16     1587         616    373    6112             339     491           982             0               2.266      2.145       1.993          1.316         0.375            0.230            0.432     0.064
-64     6099         2344   1429   23008            1299    1883          3766            0               7.565      7.414       7.259          4.806         1.347            0.861            1.605     0.077
-128    12228        4648   2837   45536            2579    3739          7478            0              14.055     13.851      13.703          8.959         2.491            1.659            3.058     0.098
-256    24772        9256   5653   90592            5139    7451          14902           0              27.074     26.765      26.620         17.091         4.677            3.237            5.942     0.141
-512    49860        18472  11285  180704           10259   14875         29750           0              54.364     53.839      53.682         33.996         9.330            6.422           11.850     0.225
-1024   100133       36904  22549  360928           20499   29723         59446           0             114.081    112.879     112.706         68.967        18.959           12.909           23.842     0.367
+spill_argc
+N     compile.tu  opt.o1.total  live_pre  live_reg  regalloc  insts  vals  ranges  spills  reloads  inserted  used_words  conflict
+128       1.388         0.769     0.045     0.169     0.474    770   641     767     108      108       216        3076         0
+256       2.468         1.433     0.058     0.291     0.929   1538  1281    1535     236      236       472        9222         0
+512       4.658         2.750     0.072     0.526     1.825   3074  2561    3071     492      492       984       30730         0
+1024      9.894         5.681     0.110     1.058     4.073   6146  5121    6143    1004     1004      2008      110610         0
 ```
 
-The 512 to 1024 step after phase 3:
+512 to 1024 ratios:
 
 ```text
-straight_main:
-compile.tu               66.408 -> 156.418 ms  2.36x
-opt.o1.total             44.891 -> 109.919 ms  2.45x
-opt.live_blocks.pre_dde  10.516 ->  22.778 ms  2.17x
-opt.live_ranges.regalloc  8.204 ->  18.122 ms  2.21x
-opt.regalloc             20.599 ->  57.799 ms  2.81x
-pre_block_bytes          187616 -> 375008      2.00x
-used_loc_words           116974 -> 397774      3.40x
-conflict_bytes                0 -> 0
-
-table_main:
-compile.tu               53.839 -> 112.879 ms  2.10x
-opt.o1.total             33.996 ->  68.967 ms  2.03x
-opt.live_blocks.pre_dde   9.330 ->  18.959 ms  2.03x
-opt.live_ranges.regalloc  6.422 ->  12.909 ms  2.01x
-opt.regalloc             11.850 ->  23.842 ms  2.01x
-pre_block_bytes          180704 -> 360928      2.00x
-used_loc_words            29750 -> 59446       2.00x
-conflict_bytes                0 -> 0
+compile.tu       2.12x
+opt.o1.total     2.07x
+live_pre         1.53x
+live_reg         2.01x
+regalloc         2.23x
+spills           2.04x
+used_words       3.60x
 ```
 
-This is the intended phase-3 result: the old conflict matrix is gone from O1,
-and the large straight-line case is much faster than the phase-1 rerun. The
-remaining non-linear straight-line bucket now tracks `used_loc_words`, because
-the first range allocator uses dense occupied-location bitmaps whose stack-slot
-portion is sized by candidate count. Table-main stays close to linear because
-each function remains small.
+This is the main remaining mild super-linear signal. Wall time is only slightly
+above 2x, but `used_words` grows 3.60x. That points at occupancy/assignment
+working-set growth under stack pressure, even though the old conflict matrix is
+gone. The next investigation should look at why per-location interval occupancy
+and candidate checking expand faster than spills/ranges for this generated
+shape.
 
-## Compile Perf
+## Parameter-Heavy Ladder
 
-`compile.tu` is now split into generic and C-specific scopes:
+This C ladder uses one function with N integer parameters and a caller that
+passes N constants. It stresses ABI lowering, stack arguments, and callee-side
+parameter use.
 
 ```text
-compile.tu
-  compile.input_bytes
-  compile.frontend
-    compile.c.setup
-      compile.c.pool_new
-      compile.c.lex_open
-      compile.c.pp_new
-      compile.c.cg_new
-      compile.c.decl_new
-    compile.c.pp_options
-    compile.c.pp_push_input
-    compile.c.parse_codegen
-      opt.o1.total ...
-    compile.c.cleanup
-  compile.obj_finalize
-  compile.obj_sections
-  compile.obj_relocs
+params
+N     compile.tu  opt.o1.total  live_pre  live_reg  regalloc  insts  vals  ranges  spills  used_words  conflict
+8         0.816         0.577     0.091     0.143     0.305     37    26      32       0        72         0
+16        0.850         0.595     0.090     0.150     0.320     69    50      64       0       136         0
+32        0.932         0.644     0.093     0.162     0.357    133    98     128      12       392         0
+64        1.053         0.698     0.093     0.175     0.396    261   194     256      44       776         0
+128       1.367         0.866     0.096     0.211     0.510    517   386     512     108      2056         0
+256       1.979         1.209     0.099     0.281     0.740   1029   770    1024     236      6152         0
+512       3.484         2.088     0.110     0.434     1.334   2053  1538    2048     492     20488         0
 ```
 
-Updated phase-3 1024-function p50 examples with the detailed compile scopes:
+256 to 512 ratios:
 
 ```text
-metric                         straight_main  table_main
-run.total                      157.795 ms     114.081 ms
-compile.tu                     156.418 ms     112.879 ms
-compile.frontend               156.411 ms     112.868 ms
-compile.c.setup                  0.128 ms       0.130 ms
-compile.c.pp_options             0.003 ms       0.004 ms
-compile.c.parse_codegen        156.251 ms     112.706 ms
-compile.c.cleanup                0.020 ms       0.015 ms
-compile.obj_finalize             0.000 ms       0.000 ms
-opt.o1.total summed            109.919 ms      68.967 ms
+compile.tu       1.76x
+opt.o1.total     1.73x
+live_pre         1.11x
+live_reg         1.54x
+regalloc         1.80x
+spills           2.08x
+used_words       3.33x
 ```
 
-Subtracting summed O1 from `compile.tu` leaves about `44-46 ms` in both
-variants:
-
-```text
-straight_main: 156.418 - 109.919 = 46.499 ms
-table_main:    112.879 -  68.967 = 43.912 ms
-```
+The timing buckets are sub-linear to mildly linear here because the fixed
+frontend/O1 setup costs are still visible. The counter to watch is again
+`used_words`: it grows faster than spills and ranges when many parameters force
+stack traffic. This is not yet a wall-time bottleneck, but it is the clearest
+counter-level sign that parameter/stack occupancy needs attention before much
+larger parameter-heavy inputs are treated as normal workloads.
 
-That residual is the current frontend/PP/parser/CG API cost for roughly
-100 KiB of generated C in these phase-3 inputs. It appears linear with input
-size and number of small functions in these tests. The setup, PP option
-application, cleanup, and object finalization scopes are negligible. The opaque
-part that remains is inside `compile.c.parse_codegen`, which streams PP tokens,
-parses declarations and statements, performs type work, drives the CG API, and
-triggers per-function O1 when function bodies end.
+## Toy Samples
 
-### Compile Scope Scaling
-
-After adding the detailed `compile.tu` scopes, the scale ladders were rerun.
-The table below uses p50 milliseconds. `opt.o1.total` is summed across
-functions, and `non_o1_residual` is `compile.tu - opt.o1.total`.
-
-Straight-line main:
+The Toy frontend uses the same CG/O1/backend machinery, so these are useful
+small API-shape probes. O1 scopes are summed across functions.
 
 ```text
-funcs  input_bytes  compile.tu  parse_codegen  opt.o1.total  non_o1_residual
-1      139            0.444        0.296          0.184          0.260
-4      430            0.784        0.642          0.418          0.366
-16     1624           2.306        2.147          1.414          0.892
-64     6520           8.042        7.894          5.236          2.806
-128    13188         15.399       15.250         10.068          5.331
-256    26884         30.795       30.648         20.296         10.499
-512    54276         66.408       66.265         44.891         21.517
-1024   109180       156.418      156.251        109.919         46.499
+case                      compile.tu  opt.o1.total  live_pre  live_reg  regalloc  insts  vals  ranges  spills  used_words  conflict
+06_while_sum                  0.436         0.319     0.049     0.075     0.163     15     7      13       0        26         0
+09_function_params            0.645         0.569     0.091     0.142     0.302     10     8       8       0        22         0
+111_many_function_params      0.682         0.591     0.092     0.147     0.316     38    22      36       0        78         0
+123_spec_demo                 4.709         4.229     0.659     1.016     2.270    788   359     590       0      1168         0
 ```
 
-Function-table main:
+The Toy samples do not show a compile-time scaling problem at this size. The
+larger spec demo is dominated by O1/regalloc, but counters are small and there
+is no conflict-matrix traffic.
 
-```text
-funcs  input_bytes  compile.tu  parse_codegen  opt.o1.total  non_o1_residual
-1      212            0.486        0.335          0.194          0.292
-4      482            0.833        0.685          0.437          0.396
-16     1587           2.145        1.993          1.316          0.829
-64     6099           7.414        7.259          4.806          2.608
-128    12228         13.851       13.703          8.959          4.892
-256    24772         26.765       26.620         17.091          9.674
-512    49860         53.839       53.682         33.996         19.843
-1024   100133       112.879      112.706         68.967         43.912
-```
+## Current Interpretation
 
-The detailed compile scopes show:
-
-- `compile.frontend` is effectively all of `compile.tu`; object finalization is
-  below measurable resolution in these runs.
-- `compile.c.parse_codegen` is effectively all of `compile.frontend`.
-- `compile.c.setup` stays flat around `0.14 ms`; most of that is `pp_new`
-  registering predefined macros.
-- `compile.c.pp_options` stays around `0.02 ms` with no `-I`, `-D`, or `-U`
-  options.
-- `compile.c.cleanup` grows only to about `0.01-0.02 ms`.
-- `non_o1_residual` is near-linear with source size at larger sizes, around
-  `430-460 us/KiB` from 64 through 1024 functions in the phase-3 inputs.
-
-So the compile-side scaling story is two-layered: the broad C parse/codegen
-bucket scales linearly with input size after startup overhead, while the nested
-O1 liveness/regalloc work can dominate that bucket and become superlinear for a
-single large function.
-
-Next useful compile instrumentation:
-
-- Count tokens consumed by `pp_next`.
-- Count top-level declarations and function definitions in `parse_c`.
-- Split parse/codegen around function bodies, global declarations, and data
-  emission.
-- Add CG API counters for emitted ops, stack/value operations, local slots, and
-  object writes.
-- Add pool/arena allocation counters or high-water marks for frontend and CG
-  arenas.
-
-### Phase-4 Rewrite Rerun
-
-After `doc/MIR_RA_REPORT.md` phase 4, O1 rewrite no longer materializes
-`Block.live_after`. Inline-asm clobber constraints and call-clobber
-save/restore insertion now use rolling backward live sets. The old full
-per-instruction live-after storage is still available through `opt_live_info`
-for compatibility tests, but the normal O1 `opt_regalloc` path no longer
-requires it.
-
-Samples: 5 runs per point. The table uses p50 milliseconds. O1 scopes and O1
-counters are summed across functions. The generator is the same direct-call vs
-function-table family used for the phase-3 rerun, but these are fresh generated
-inputs and not byte-for-byte identical to the earlier tables.
-
-Straight-line main:
+The current O1 compile-time story is good enough to shift attention toward
+generated-code quality:
 
-```text
-funcs input_bytes insts vals  pre_block_bytes ranges range_points used_loc_words conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_blocks_pre live_ranges_reg regalloc link_jit
-1     218         61    38    576             37     52           104            0              0.584     0.475      0.319         0.208        0.059           0.033           0.067    0.062
-4     731         226   140   2016            142    196          392            0              1.050     0.943      0.800         0.530        0.146           0.090           0.193    0.058
-16    2810        886   548   7872            562    772          1776           0              3.044     2.912      2.754         1.855        0.507           0.345           0.726    0.065
-64    11191       3526  2180  31392           2242   3076         8864           0              10.963    10.756     10.599        7.140        1.924           1.375           2.906    0.089
-128   22450       7046  4356  62688           4482   6148         23096          0              21.189    20.676     20.524        13.718       3.682           2.690           5.706    0.110
-256   45191       14086 8708  125280          8962   12292        67688          0              41.840    41.173     41.019        27.476       7.238           5.464           11.777   0.157
-512   90668       28166 17412 250464          17922  24580        221384         0              87.461    86.405     86.252        58.701       14.896          11.760          26.346   0.243
-1024  181691      56326 34820 500832          35842  49156        786824         0              190.569   188.638    188.479       129.835      32.046          26.404          61.906   0.425
-```
-
-Function-table main:
+- Normal many-function and large-caller C inputs are approximately linear.
+- The old dense liveness/conflict path is gone from normal O1
+  (`opt.conflict_bytes=0`).
+- Link/JIT is not the bottleneck in these runs; the visible time is frontend
+  parse/codegen plus O1.
+- Spill-heavy and parameter-heavy shapes still have counter-level
+  super-linear growth in `used_words`. Wall time is only mildly affected today,
+  but future work should verify interval occupancy and candidate checking do not
+  drift back into point/candidate-product behavior.
 
-```text
-funcs input_bytes insts vals  pre_block_bytes ranges range_points used_loc_words conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_blocks_pre live_ranges_reg regalloc link_jit
-1     311         82    49    832             48     67           134            0              0.665     0.545      0.381         0.230        0.064           0.036           0.077    0.066
-4     788         226   133   2272            135    190          380            0              1.122     0.989      0.839         0.550        0.149           0.094           0.203    0.068
-16    2718        802   469   8032            483    682          1364           0              2.886     2.744      2.592         1.737        0.481           0.316           0.661    0.076
-64    10475       3106  1813  31072           1875   2650         5300           0              10.054    9.828      9.670         6.569        1.827           1.207           2.491    0.097
-128   20875       6178  3605  61792           3731   5274         10548          0              18.978    18.667     18.515        12.271       3.328           2.275           4.800    0.117
-256   41824       12322 7189  123232          7443   10522        21044          0              37.247    36.757     36.593        24.043       6.552           4.505           9.416    0.158
-512   83717       24610 14357 246112          14867  21018        42036          0              74.068    73.190     73.035        47.519       12.995          8.945           18.699   0.236
-1024  167549      49186 28693 491872          29715  42010        84020          0              153.281   151.622    151.471       96.303       26.390          17.895          37.547   0.382
-```
-
-To isolate the phase-4 change, the same generated inputs were also run against
-a temporary clean worktree at the phase-3 baseline (`HEAD` before the phase-4
-working-tree edits). The table below shows phase-3 p50 to phase-4 p50 on those
-identical inputs.
-
-```text
-straight_main:
-funcs run.total                opt.o1.total              live_ranges_reg        regalloc
-512   88.544 -> 87.461 (-1.2%) 60.910 -> 58.701 (-3.6%)  12.268 -> 11.760 (-4.1%) 29.589 -> 26.346 (-11.0%)
-1024  206.333 -> 190.569 (-7.6%) 147.862 -> 129.835 (-12.2%) 27.774 -> 26.404 (-4.9%) 82.750 -> 61.906 (-25.2%)
-
-table_main:
-funcs run.total                opt.o1.total             live_ranges_reg        regalloc
-512   69.043 -> 74.068 (+7.3%) 43.713 -> 47.519 (+8.7%) 9.240 -> 8.945 (-3.2%)  15.729 -> 18.699 (+18.9%)
-1024  143.333 -> 153.281 (+6.9%) 88.574 -> 96.303 (+8.7%) 18.549 -> 17.895 (-3.5%) 31.645 -> 37.547 (+18.7%)
-```
+## Investigation Priorities
 
-An extra 7-sample repeat at 512 and 1024 confirmed the shape: large
-straight-line `regalloc` improves by about 13-27%, while the table-shaped case
-regresses by about 16-17% in `regalloc`. The live-range analysis bucket itself
-improves modestly in both shapes. The table-main regression is therefore in
-rewrite mechanics, not range construction.
-
-The most likely cause is the first phase-4 implementation strategy: it rewrites
-each block while walking backward, builds per-instruction temporary rewrite
-lists, appends them in reverse, then reverses the block. That removes the large
-persistent `live_after` allocation and helps the large straight-line case, but
-it adds list-copy overhead that is visible when many compact functions dominate
-the workload. A forward rewrite with a precomputed rolling-live cursor or a
-block-local edit buffer should keep the no-`live_after` shape without the
-table-main tax.
-
-### Phase-7 Fresh Rerun
-
-After phases 5, 6, and the implemented phase-7 occupancy split, the normal O1
-path no longer contains the old dense pseudo-conflict/live-after state. Rewrite
-uses a block-local output buffer, `opt_live_info` and persistent block liveness
-fields are gone, and allocator occupancy is split into hard-register and
-demand-sized stack-slot tables.
-
-Samples: 5 runs per point. The table uses p50 milliseconds. The direct-call and
-function-table generators are the same family as earlier sections, but these
-are fresh generated inputs and not byte-for-byte identical to prior phase
-tables.
-
-Straight-line main:
-
-```text
-funcs input_bytes insts vals  block_bytes ranges range_points used_loc_words hard_words stack_words stack_slots conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_pre live_ranges_reg regalloc link_jit
-1     331         84    57    576         53     70           140            140        0           0           0              0.625     0.517      0.365         0.230        0.061    0.033           0.073    0.058
-4     1051        297   213   1632        203    262          524            524        0           0           0              1.217     1.100      0.943         0.603        0.150    0.105           0.218    0.060
-16    3971        1149  837   6080        803    1030         2060           2060       0           0           0              3.562     3.445      3.301         2.148        0.539    0.375           0.803    0.061
-64    15749       4557  3333  23648       3203   4102         8204           8204       0           0           0             12.945    12.762     12.612         8.143        1.997    1.499           3.197    0.083
-128   31598       9101  6661  47072       6403   8198         16396          16396      0           0           0             24.576    24.306     24.158        15.474        3.730    2.955           6.252    0.118
-256   63654       18189 13317 93920       12803  16390        32780          32780      0           0           0             48.921    48.385     48.237        30.977        7.385    6.008          12.770    0.159
-512   127759      36365 26629 187616      25603  32774        65548          65548      0           0           0             98.418    97.658     97.503        63.030       14.960   12.268          26.552    0.249
-1024  256093      72717 53253 375008      51203  65542        131084         131084     0           0           0            206.580   204.832    204.682       131.904       30.624   25.894          57.609    0.421
-```
+1. Track `used_words` decomposition on stack-heavy inputs.
+   Add or inspect `hard_loc_words`, `stack_loc_words`, stack slot count, and
+   interval probe/mark counters for `spill_argc` and parameter-heavy ladders.
+   The immediate question is why `used_words` grows 3.3-3.6x while ranges and
+   spills only double.
 
-Function-table main:
+2. Keep `opt.conflict_bytes=0` as a regression guard.
+   Reintroducing dense pseudo conflict storage would be a clear O1 regression.
 
-```text
-funcs input_bytes insts vals  block_bytes ranges range_points used_loc_words hard_words stack_words stack_slots conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_pre live_ranges_reg regalloc link_jit
-1     400         106   69    832         65     86           172            172        0           0           0              0.680     0.571      0.413         0.252        0.064    0.040           0.087    0.062
-4     1096        304   213   1888        203    263          526            526        0           0           0              1.244     1.132      0.988         0.625        0.162    0.107           0.224    0.062
-16    3914        1096  789   6112        755    971          1942           1942       0           0           0              3.473     3.346      3.188         2.061        0.516    0.358           0.765    0.066
-64    15260       4264  3093  23008       2963   3803         7606           7606       0           0           0             12.162    11.990     11.840         7.621        1.864    1.382           2.980    0.086
-128   30505       8488  6165  45536       5907   7579         15158          15158      0           0           0             23.249    22.999     22.848        14.553        3.484    2.701           5.777    0.110
-256   61281       16936 12309 90592       11795  15131        30262          30262      0           0           0             45.760    45.318     45.169        28.765        6.871    5.386          11.538    0.150
-512   122826      33832 24597 180704      23571  30235        60470          60470      0           0           0             91.469    90.751     90.601        57.470       13.717   10.718          23.064    0.240
-1024  246016      67624 49173 360928      47123  60443        120886         120886     0           0           0            190.086   187.863    187.699       116.461       27.974   21.462          46.345    0.417
-```
-
-The 512 to 1024 step is now close to linear in both normal shapes:
-
-```text
-straight_main:
-compile.tu               97.658 -> 204.832 ms  2.10x
-opt.o1.total             63.030 -> 131.904 ms  2.09x
-opt.live_blocks.pre_dde  14.960 ->  30.624 ms  2.05x
-opt.live_ranges.regalloc 12.268 ->  25.894 ms  2.11x
-opt.regalloc             26.552 ->  57.609 ms  2.17x
-used_loc_words            65548 -> 131084      2.00x
-hard_loc_words            65548 -> 131084      2.00x
-stack_loc_words               0 -> 0
-
-table_main:
-compile.tu               90.751 -> 187.863 ms  2.07x
-opt.o1.total             57.470 -> 116.461 ms  2.03x
-opt.live_blocks.pre_dde  13.717 ->  27.974 ms  2.04x
-opt.live_ranges.regalloc 10.718 ->  21.462 ms  2.00x
-opt.regalloc             23.064 ->  46.345 ms  2.01x
-used_loc_words            60470 -> 120886      2.00x
-hard_loc_words            60470 -> 120886      2.00x
-stack_loc_words               0 -> 0
-```
-
-This confirms that phases 5 and 7 removed the prior table-main rewrite tax and
-the direct-call `used_locs` superlinear growth for this generator. These inputs
-do not force spills, so the stack side of the occupancy split stays at zero.
-
-To exercise stack occupancy, a separate spill-pressure ladder used one large
-`main` that repeatedly forms 24 live arguments and calls a local `sink24`.
-Some larger cases return non-zero guest status because the generated integer
-result is intentionally ignored by the benchmark harness; compile/JIT metrics
-are still emitted before process exit.
-
-```text
-groups input_bytes insts vals  block_bytes ranges range_points used_loc_words hard_words stack_words stack_slots spills stores compile.tu opt.o1.total live_pre live_ranges_reg regalloc link_jit
-1      1103        204   176   288         172    200          551            400        151         16          16     16     0.815      0.384        0.097    0.075           0.150    0.055
-4      3377        651   548   864         544    647          1892           1294       598         16          64     64     2.062      1.127        0.242    0.271           0.584    0.060
-16     12905       2439  2036  3168        2032   2435         7256           4870       2386        16          256    256    9.368      6.436        1.072    1.452           4.212    0.071
-64     52745       9591  7988  12096       7984   9587         28712          19174      9538        16          1024   1024   76.878    64.917       7.948   12.695          49.800    0.110
-128    107881      19127 15924 24000       15920  19123        57320          38246      19074       16          2048   2048   262.764   234.319      25.701   43.068         186.860    0.168
-256    223337      38199 31796 47808       31792  38195        114536         76390      38146       16          4096   4096   964.562   893.449      89.543  156.086         732.317    0.255
-512    454249      76343 63540 95424       63536  76339        228968         152678     76290       16          8192   8192   3737.556  3536.356    332.393  595.509        2902.790    0.445
-```
-
-The spill ladder tells a different story:
-
-```text
-256 -> 512:
-compile.tu              964.562 -> 3737.556 ms  3.87x
-opt.o1.total            893.449 -> 3536.356 ms  3.96x
-opt.live_blocks.pre_dde  89.543 ->  332.393 ms  3.71x
-opt.live_ranges.regalloc 156.086 ->  595.509 ms  3.82x
-opt.regalloc            732.317 -> 2902.790 ms  3.96x
-used_loc_words          114536  -> 228968       2.00x
-hard_loc_words           76390  -> 152678       2.00x
-stack_loc_words          38146  ->  76290       2.00x
-stack_slots                 16  ->     16       1.00x
-spills                    4096  ->   8192       2.00x
-```
-
-The stack occupancy split is doing its job: stack words scale with allocated
-stack slots and points, not candidate count, and the slot count stays flat at
-16 while spills double. The remaining superlinear behavior is now elsewhere:
-wide single functions still drive full-width value bitset work in block
-liveness, live-range construction, DDE, and the rewrite/allocation walk. In
-other words, phase 7 fixed the specific `points * candidates` occupancy growth,
-but not the broader `instructions * live_words` shape for values that are live
-across very wide regions.
-
-### Phase-8 O1 Wide-Function Rerun
-
-Phase 8 attacks three remaining O1 costs exposed by the spill-pressure ladder:
-
-- allocator candidate ordering now uses `qsort` instead of insertion sort;
-- live-range construction uses per-instruction use/def value lists with
-  generation marks instead of clearing fixed-width temporary bitsets;
-- asm register-constraint application first checks whether the function has any
-  `IR_ASM_BLOCK`, avoiding a full backward live walk for normal C functions.
-
-Normal direct-call and table-shaped ladders remain close to linear at the
-512-to-1024 step:
-
-```text
-straight_main:
-compile.tu               97.938 -> 202.409 ms  2.07x
-opt.o1.total             61.803 -> 125.589 ms  2.03x
-opt.live_blocks.pre_dde  15.131 ->  30.724 ms  2.03x
-opt.live_ranges.regalloc 10.268 ->  20.757 ms  2.02x
-opt.regalloc             23.644 ->  47.980 ms  2.03x
-
-table_main:
-compile.tu               90.607 -> 188.297 ms  2.08x
-opt.o1.total             56.265 -> 115.105 ms  2.05x
-opt.live_blocks.pre_dde  13.915 ->  28.431 ms  2.04x
-opt.live_ranges.regalloc  9.156 ->  18.499 ms  2.02x
-opt.regalloc             21.062 ->  42.722 ms  2.03x
-```
-
-The spill-pressure ladder improves substantially in absolute time:
-
-```text
-groups compile.tu opt.o1.total live_pre live_ranges_reg regalloc hard_words stack_words stack_slots spills
-128       91.797       62.914    9.021           9.213   32.036      38246       19074          16   2048
-256      275.634      200.878   24.166          24.887  104.204      76390       38146          16   4096
-512      958.535      754.326   74.544          76.522  374.253     152678       76290          16   8192
-```
-
-Compared with the phase-7 512-group spill row, this cuts:
-
-```text
-compile.tu              3737.556 ->  958.535 ms  3.90x faster
-opt.o1.total            3536.356 ->  754.326 ms  4.69x faster
-opt.live_blocks.pre_dde  332.393 ->   74.544 ms  4.46x faster
-opt.live_ranges.regalloc 595.509 ->   76.522 ms  7.78x faster
-opt.regalloc            2902.790 ->  374.253 ms  7.76x faster
-```
-
-The 256-to-512 spill step is still superlinear:
-
-```text
-compile.tu              275.634 -> 958.535 ms  3.48x
-opt.o1.total            200.878 -> 754.326 ms  3.76x
-opt.live_blocks.pre_dde  24.166 ->  74.544 ms  3.08x
-opt.live_ranges.regalloc 24.887 ->  76.522 ms  3.07x
-opt.regalloc            104.204 -> 374.253 ms  3.59x
-hard_loc_words           76390  -> 152678      2.00x
-stack_loc_words          38146  ->  76290      2.00x
-stack_slots                 16  ->     16      1.00x
-spills                    4096  ->   8192      2.00x
-```
-
-The remaining wide-function cost is no longer the old conflict matrix,
-candidate sort, stack occupancy, or no-asm constraint scan. It is concentrated
-in repeated dense/high-watermark live-set work and range-point occupancy walks
-inside one very wide function.
-
-Next investigation should start in:
-
-- `src/opt/pass_live.c`: `opt_live_blocks`, `live_recompute_metrics`, and the
-  final per-value range summary in `opt_live_ranges_build`. These still have
-  high-watermark or all-value loops that show up on wide single functions.
-- `src/opt/pass_lower.c`: `alloc_collect_occupied_hard`,
-  `alloc_collect_occupied_stack`, `alloc_mark_hard_loc`, and
-  `alloc_mark_stack_loc`. These still walk every point in every candidate range;
-  interval/event state per location should replace repeated point scans.
-- `src/opt/pass_lower.c`: `rewrite_func` and `opt_dead_def_elim_with_live`.
-  Both still maintain rolling live bitsets with function-width scratch storage;
-  after range/occupancy work, convert these to set-bit or operand-event updates.
-
-### Current O1 MIR-Shape Rerun
-
-After the follow-up O1 cleanup, the normal path is closer to MIR in four
-important ways:
-
-- pre-DDE no longer builds a duplicate live-range set;
-- DDE and rewrite use operand-reference events instead of materializing
-  per-instruction full-width use/def bitsets;
-- live-range construction tracks open and touched values with generation marks
-  instead of rescanning every value at block boundaries;
-- O1 assignment avoids caller-saved registers for values known to be live
-  across calls when the value is not tied to a hard register.
-
-The table below is from a focused 3-sample p50 rerun over freshly generated
-stress inputs. It is not byte-for-byte comparable to the earlier phase tables,
-but it shows the current bucket shape and the counters that should drive the
-next work.
-
-```text
-case          compile.tu opt.o1.total live_pre dde   live_ranges_reg regalloc range.value_scans alloc.hard_point_visits rewrite.live_words dde.live_words spills
-straight 512     140.387      116.194   17.700 3.159           30.384   63.670             25616                   28689              81672          31398      0
-straight 1024    286.085      234.626   35.394 6.410           62.155  129.088             51216                   57361             261624          62774      0
-table 512        135.180      113.238   17.650 2.934           29.298   61.829             19506                   21058              25156          25152      0
-table 1024       273.757      226.367   35.209 6.007           58.357  123.448             38962                   42050              50244          50240      0
-spill 128         44.702       19.570    0.712 1.018            7.844   12.942             25700                   64252              77762          26306   2048
-spill 256        109.002       43.998    1.327 2.021           20.569   31.053             51300                  128380             257818          52506   4096
-spill 512        300.108      109.833    2.538 3.964           61.995   83.985            102500                  256636             925130         104906   8192
-```
-
-The normal straight-line and table-shaped ladders are now essentially linear.
-For 512 to 1024 functions, `opt.o1.total`, `opt.live_ranges.regalloc`,
-`opt.regalloc`, `opt.range.value_scans`, `opt.alloc.hard_point_visits`, and
-`opt.dde.live_words_touched` are all about `2.0x`. The one outlier is
-`opt.rewrite.live_words_touched` on straight-line input, which is `3.20x`,
-although its absolute time is no longer the dominant bucket.
-
-The spill-pressure ladder is much better in absolute time, but still bends:
-
-```text
-spill 128 -> 256:
-opt.o1.total             19.570 ->  43.998 ms  2.25x
-opt.live_ranges.regalloc  7.844 ->  20.569 ms  2.62x
-opt.regalloc             12.942 ->  31.053 ms  2.40x
-range.value_scans         25700 ->   51300     2.00x
-alloc.hard_point_visits   64252 ->  128380     2.00x
-rewrite.live_words        77762 ->  257818     3.32x
-
-spill 256 -> 512:
-opt.o1.total             43.998 -> 109.833 ms  2.50x
-opt.live_ranges.regalloc 20.569 ->  61.995 ms  3.01x
-opt.regalloc             31.053 ->  83.985 ms  2.70x
-range.value_scans         51300 ->  102500     2.00x
-alloc.hard_point_visits  128380 ->  256636     2.00x
-rewrite.live_words       257818 ->  925130     3.59x
-```
-
-This narrows the next O1 work. DDE is now roughly linear, range value scans are
-linear, and allocation point visits are linear but high. The remaining
-MIR-shape gap is concentrated in `opt.live_ranges.regalloc`, `opt.regalloc`,
-and rewrite live-word traffic on spill-heavy wide functions.
-
-### Final O1 Cleanup Rerun
-
-The follow-up cleanup removes the remaining high-watermark scans that were
-visible in the spill-pressure ladder:
-
-- `OptBitset` copy/metric paths now use active length and clear only stale
-  active words.
-- live-range construction tracks live-across-call values with a rolling value
-  list instead of scanning the full active bitmap at every call.
-- rewrite call save/restore insertion checks only hard-assigned caller-saved
-  values that are known to be live across some call.
-
-Focused single-sample spill-pressure sanity run:
-
-```text
-groups compile.tu opt.o1.total live_pre live_ranges_reg regalloc range.live_words rewrite.live_words hard_point_visits spills
-128       44.171       17.937    0.679           5.154   11.623              256             25602             64129   2048
-256      104.255       38.741    1.290          10.271   25.900              512             51202            128257   4096
-512      280.679       90.815    2.648          22.043   64.286             1024            102402            256513   8192
-```
-
-The important counter shape is now linear where the prior O1 work was still
-superlinear: `range.live_words`, `rewrite.live_words`, point visits, and spills
-all double as the input doubles. The wall-time buckets are close enough to the
-linear target for O1 that remaining occupancy-loop work is no longer blocking
-the O2 coalescing/splitting work.
-
-### O1 Occupancy-Interval Cleanup
-
-The final O1 cleanup replaces the allocator's point-index occupied-location
-bitmaps with sorted per-location interval sets. Assignment still consumes the
-same compressed live ranges and produces the same `OptLoc` map for rewrite, but
-hard-register and stack-slot conflict checks no longer OR per-point bitmap
-rows. The old `hard_word_ors` and `stack_word_ors` counters therefore stay at
-zero on this path.
-
-Focused single-sample spill-pressure sanity run on a fresh generated input:
-
-```text
-groups opt.o1.total live_ranges_reg regalloc hard_point_visits stack_point_visits hard_mark_points stack_mark_points hard_word_ors stack_word_ors spills rewrite.live_words
-128          19.782           4.971   13.355              40195              17392            10754              2048             0              0   2048              25604
-256          43.253          10.285   29.960              80387              34800            21506              4096             0              0   4096              51204
-512          98.830          21.880   72.118             160771              69616            43010              8192             0              0   8192             102404
-```
-
-The important shape is still linear at the 128 -> 256 -> 512 steps: spills,
-rewrite live traffic, stack interval probes, hard interval probes, and interval
-marks all roughly double as the generated input doubles. This removes the last
-O1 allocator cleanup item identified by the MIR-shape report; further allocator
-work should now be justified by generated-code quality, O2 coalescing, or live
-range splitting rather than by O1 compile-time scaling.
-
-### Vstack Constfold Probe
-
-After `doc/CONSTFOLD.md` phase 5, the focused O1 probes were re-run with
-`build/cfree cc -O1 -target ... -c` and disassembled with `llvm-objdump -dr`.
-Instruction counts below include function prologue/epilogue traffic; the useful
-shape is that the direct literal, compound-assign, and literal compare/branch
-probes all reduce to immediate return materialization, while the address-taken
-probe still performs the load/add/store/reload sequence.
-
-```text
-probe                   arch   text  insn   mov arithmetic
-compound_assign         x64      45    11     6     1
-compound_assign         aa64     32     8     2     2
-compound_assign         rv64     72    18     1     3
-literal_return          x64      45    11     6     1
-literal_return          aa64     32     8     2     2
-literal_return          rv64     72    18     1     3
-compare_branch_literal  x64      45    11     6     1
-compare_branch_literal  aa64     32     8     2     2
-compare_branch_literal  rv64     72    18     1     3
-local_addr_taken        x64     107    25    18     1
-local_addr_taken        aa64     88    22     3     5
-local_addr_taken        rv64    128    32     1     4
-```
-
-Host JIT timings were measured with seven `build/cfree run --time -O1 -e
-test_main` samples per probe; the table uses p50 milliseconds for time buckets
-and stable first-sample counters for rewrite activity.
-
-```text
-probe                    opt.o1  live_ranges  regalloc  spills  reloads  inserted
-compound_assign           0.285        0.071     0.153       0        0         0
-literal_return            0.289        0.072     0.155       0        0         0
-compare_branch_literal    0.351        0.088     0.188       0        0         0
-local_addr_taken          0.351        0.088     0.187       0        0         0
-```
+3. Keep normal O1 branch cleanup block-linear.
+   `jump_cleanup` should remain a local pass over blocks/terminators. Global
+   layout work belongs in O2.
 
-### O1 Local Jump Cleanup
-
-The O1 branch-shape cleanup is intentionally local and bounded. It has two
-stages:
-
-- `jump_cleanup.cfg`: forwards branch-to-branch targets and inverts a
-  compare-branch only when that removes the immediately following
-  unconditional-jump block.
-- `jump_cleanup.layout`: runs immediately before emit and deletes an
-  unconditional branch whose target is the next physical block in `emit_order`.
-
-The scaling property is block/edge-linear for the O1 shapes this pass owns.
-The pass builds a per-block emit-order index once per run, memoizes resolved
-branch-only chains, and never walks values, liveness bitsets, ranges, or
-allocator occupancy. Its scratch storage is proportional to block count, and
-its work is proportional to blocks plus terminator edges plus branch-only chain
-members. It should therefore remain invisible next to `live_blocks`,
-`live_ranges_reg`, and `regalloc` in large O1 scaling probes. If future branch
-layout work needs global block ordering, keep that out of O1 and put it in O2.
-
-## Performance Priorities
-
-1. Keep O1 on interval occupancy.
-   The normal O1 allocator now uses per-location interval sets instead of
-   point-index bitmap rows. Do not reintroduce point-by-point OR/mark loops for
-   O1; if O2 needs richer placement, build it as interval/event state with
-   explicit metrics.
-
-2. Keep the split `used_locs` design.
-   `hard_loc_words` and `stack_loc_words` now scale linearly in both normal and
-   spill-pressure ladders. Stack occupancy is demand-sized by allocated stack
-   slots, so it is no longer the explanation for the wide-spill superlinear
-   timings.
-
-3. Decide whether O1 can avoid the second block-liveness pass.
-   O1 still runs block liveness before DDE and again for regalloc because DDE
-   changes the instruction stream. Keep this until the other buckets are
-   smaller, then either reuse/revalidate liveness after DDE or make DDE produce
-   enough local update information to avoid a full rebuild.
-
-4. Keep jump cleanup local and block-linear.
-   The O1 branch cleanup exists to remove obvious code-shape artifacts, not to
-   perform global layout. Keep it on cached emit-order indexes, memoized
-   branch-only chains, and block/terminator scans. Do not add value, liveness,
-   or range traffic to this pass.
-
-5. Avoid whole-function contiguous growth where hot passes scan repeatedly.
-   Large `Val`/block/instruction arrays and dense bit matrices are likely to
-   hurt cache locality. Segmented arrays may help by reducing large copies and
-   stabilizing addresses, but pass algorithms need to avoid turning segment
-   traversal into extra indirection in inner loops.
-
-6. Keep link/JIT lower priority for now.
-   Even at 1024 functions, `link_jit.all` is around `0.5 ms`; copy, reloc, and
-   icache flush are small and roughly linear. They are not the bottleneck for
-   submit-to-execution latency in these tests.
-
-7. Split `compile.c.parse_codegen` further before changing parser/CG data
-   structures.
-   The non-O1 compile residual is meaningful at large source sizes, but it is
-   currently broad. More counters are needed before choosing parser, PP, type,
-   or CG API changes.
+4. Split frontend residuals only after O1 code shape work.
+   The non-O1 compile residual is visible, but current scaling pressure is not
+   coming from object finalization or link/JIT. More parser/PP/CG counters would
+   be useful, but O1 generated-code cleanup has higher near-term value.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/OPT1.md	\|	566	+++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------
M	doc/PERF.md	\|	994	+++++++++++++------------------------------------------------------------------