kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit f639d9174f8a364814dc16e21a1bd1ffa3634902
parent 0bbc3338aeb7ede4ee32d37859e5309d1e3605d4
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sun, 17 May 2026 16:11:06 -0700

OP1.md doc refresh + perf refresh

Diffstat:
Mdoc/OPT1.md | 566+++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------
Mdoc/PERF.md | 994+++++++++++++------------------------------------------------------------------
2 files changed, 554 insertions(+), 1006 deletions(-)

diff --git a/doc/OPT1.md b/doc/OPT1.md @@ -1,29 +1,33 @@ # OPT1 - Current O1 Backend Path -`-O1` is cfree's fast optimized backend tier. It records frontend CG operations -as IR, runs a small MIR-shaped lowering/allocation pipeline, rewrites virtual -values to hard registers or spill slots, performs narrow post-RA cleanup, and -then emits through the normal target backend. +`-O1` is cfree's fast optimized backend tier. It records frontend CG +operations as IR, runs a small MIR-shaped lowering/allocation pipeline, rewrites +virtual values to hard registers or spill slots, performs narrow post-RA +cleanup, and emits through the target backend. -This document describes the implemented O1 path and tracks remaining O1 code -quality and compile-time work. Broader optimizer direction lives in -`doc/OPT.md`; allocator performance history and measurements live in -`doc/PERF.md` and `doc/MIR_RA_REPORT.md`. +This document has two jobs: + +1. document the implemented O1 pipeline; and +2. track the remaining work needed to clean up O1 generated code. + +Compile-time measurements and scaling notes live in `doc/PERF.md`. Broader +optimizer direction lives in `doc/OPT.md`. ## Goals -O1 is the bisection floor for optimized codegen. It should remain: +O1 should remain: - correct across supported targets and ABI lowering paths; - materially better than O0 generated code; - close enough to O0 compile time for JIT use; -- free of unbounded dense pseudo-register conflict matrices; +- free of dense pseudo-register conflict matrices and whole-function + point/candidate products in hot paths; - simple enough that O2 can add coalescing, splitting, SSA, and value optimization without destabilizing the O1 baseline. O1 is not intended to perform full SSA optimization, global value numbering, -LICM, general inlining, or full branch layout. Those belong in O2 unless a -local cleanup is required to keep O1 code shape reasonable. +LICM, general inlining, or global block layout. Those belong in O2 unless a +small local cleanup is required to keep O1 code shape reasonable. ## Pipeline @@ -50,28 +54,29 @@ jump_cleanup.layout emit ``` -Important pass boundaries: +Pass boundaries: -- `machinize` applies target-specific ABI/call/register constraints before +- `machinize` applies target-specific ABI, call, and register constraints before liveness and allocation. -- `dead_def_elim` runs before RA using block liveness and operand-reference - events. +- `live_blocks.pre_dde` computes block liveness for pre-RA dead-def + elimination. +- `dead_def_elim` removes dead virtual definitions using block liveness and + operand-reference events. - `regalloc` rebuilds liveness after DDE, constructs compressed live ranges, - assigns locations, and rewrites operands. -- `combine` is deliberately narrow post-RA cleanup: noop physical copy - deletion, safe single-use folds, redundant spill load/store cleanup, and - similar local rewrites. + assigns hard registers or spill slots, and rewrites operands. +- `combine` is narrow post-RA cleanup: noop physical copy deletion, safe + single-use folds, redundant spill load/store cleanup, and similar local + rewrites. - `dce` is conservative after rewrite and removes only trivial dead IR shapes. -- `jump_cleanup` is the cheap O1 branch-shape pass. The CFG stage forwards - branches to branch-only blocks and inverts compare-branches when that removes - an unconditional jump block without doing full block layout. The layout stage - runs immediately before emit and deletes unconditional branches to the - physical fallthrough block. +- `jump_cleanup.cfg` forwards branches to branch-only blocks and performs local + compare-branch inversion when that removes an unconditional jump block. +- `jump_cleanup.layout` runs immediately before emit and deletes unconditional + branches whose target is the physical fallthrough block. ## Liveness And Ranges -O1 liveness is pass-local. Persistent `Block.live_after` and dense value -conflict storage are not part of the normal path. +O1 liveness is pass-local. Persistent per-instruction live-after storage and +dense value conflict matrices are not part of the normal path. `opt_live_blocks` computes block `live_use`, `live_def`, `live_in`, and `live_out` sets using `OptBitset`. `OptBitset` tracks active words, trims @@ -87,6 +92,8 @@ into compressed live ranges: - per-value summaries record live length, use/def frequency, block frequency, call-crossing frequency, and spill cost. +The expected O1 invariant is `opt.conflict_bytes=0`. + ## Allocation The normal O1 allocator is range based. It does not build or consume a @@ -104,13 +111,11 @@ Each value is assigned either: - a target-provided hard register in the value's register class; or - a compatible `FS_SPILL` frame slot. -Non-tied values known to be live across calls strongly prefer callee-saved -registers, but may use caller-saved registers when no better non-conflicting -location is available. Correctness does not depend on avoiding caller-saved -registers globally: rewrite preserves hard-assigned values only at calls whose -target-provided clobber mask actually destroys that register. Tied -hard-register values are checked for availability, clobber constraints, and -live-range conflicts. +Non-tied values known to be live across calls prefer callee-saved registers, but +may use caller-saved registers when no better non-conflicting location is +available. Correctness does not depend on avoiding caller-saved registers +globally: rewrite preserves hard-assigned values only at calls whose +target-provided clobber mask actually destroys that register. Occupancy is stored as sorted per-location interval sets: @@ -118,12 +123,13 @@ Occupancy is stored as sorted per-location interval sets: - one interval vector per allocated stack spill slot. Conflict checks test whether any candidate live range overlaps the location's -intervals. Marking inserts and merges intervals. The old point-index bitmap -row OR/mark path is not part of normal O1. +intervals. Marking inserts and merges intervals. The old point-index bitmap row +OR/mark path is not part of normal O1. -Metrics still report `opt.alloc.hard_loc_words`, `opt.alloc.stack_loc_words`, -and point/mark counters for trend compatibility, but `opt.alloc.hard_word_ors` -and `opt.alloc.stack_word_ors` should remain zero on the interval path. +Metrics still report `opt.alloc.used_loc_words`, `opt.alloc.hard_loc_words`, +`opt.alloc.stack_loc_words`, point visits, and mark counters for trend +compatibility, but `opt.alloc.hard_word_ors` and +`opt.alloc.stack_word_ors` should remain zero on the interval path. ## Rewrite @@ -157,7 +163,7 @@ O1 relies on each target backend to provide: stack arguments, return registers, call clobbers, return masks, outgoing stack size, and variadic metadata before liveness and allocation; - parameter storage binding through `CGTarget.param`, which may return either a - frame slot or a register-backed local storage for simple direct ABI params; + frame slot or register-backed local storage for simple direct ABI params; - optional hard-register reservation before backend `func_end`; - target legality for local folds performed by `opt_combine`. @@ -167,19 +173,359 @@ The current target pools are: - x64: integer `R13/R14/R15/R10`, FP `XMM6-XMM15`; - RV64: integer `s4-s11`, FP `fs4-fs11`. -Backends still own final prologue/epilogue emission and callee-saved register -preservation. O1 calls `reserve_hard_regs` with the hard registers still visible -in replay after cleanup so backend save/restore decisions match the emitted IR. -Direct CG still uses the legacy `call` hook. O1 records call plans for liveness, -allocation, and preservation, but backend call emission still performs the final -argument and return moves through the existing sequential call emitters. +Backends own final prologue/epilogue emission and callee-saved register +preservation. O1 calls `reserve_hard_regs` with hard registers still visible in +replay after cleanup so backend save/restore decisions match emitted IR. + +Targets may provide a known-frame entry path. When `func_begin_known_frame` and +`call_stack_size` are both available, O1 computes frame slots, outgoing call +area size, and dynamic-allocation presence before body replay and passes that +to the backend. AArch64, x64, and RV64 use this path to emit the exact prologue +up front instead of reserving target NOP patch slots. + +## Current Generated-Code Shape + +Fresh May 2026 probes were run with `build/cfree` rebuilt at host `-O2`, on an +arm64 machine, focusing on native AArch64 codegen. The same simple C probes +were also compiled for x64/RV64 by the probe script, but AArch64 is the primary +quality target here. + +Current progress: + +- Immediate scalar arithmetic folds through simple local shadows. For example, + `int x = 40; x += 2; return x;` lowers to immediate return materialization on + AArch64. +- Simple loop locals stay in registers. +- Branch-consuming compares lower to direct compare branches. +- O1 branch cleanup removes branch-to-branch targets and unconditional branches + to physical fallthrough when it can do so locally. +- Address-taken locals still go through memory, as expected for current O1. +- The old prologue NOP patch path is gone; known-frame entry is used where the + target supports it. + +Representative AArch64 `const_local` output: + +```asm +sub sp, sp, #0x10 +stp x29, x30, [sp] +mov x29, sp +mov w0, #0x2a +b .Lepilogue +ldp x29, x30, [sp] +add sp, sp, #0x10 +ret +``` + +Representative AArch64 `scalar_add` output still contains redundant copies: + +```asm +mov w2, w0 +mov w0, w1 +add w2, w2, w0 +mov w0, w2 +mov w0, w0 +b .Lepilogue +``` + +A parameter-heavy 32-argument AArch64 probe shows the main remaining shape +problem: stack arguments are copied into frame slots, many callee-saved +registers are used and therefore saved/restored, and returns still branch to an +immediately following epilogue block. + +## Remaining O1 Codegen Work + +These are code-shape issues, not known correctness issues. + +1. Leaf-frame and epilogue cleanup. + Leaf functions such as `const_local` still get a frame and save/restore + `x29/x30`. O1 should let targets omit frames when the function has no calls, + no dynamic stack, no required frame-address use, and no spills/locals that + require a frame. Return-to-immediate-epilogue branches should also be removed + or emitted as direct fallthrough where legal. + +2. Redundant physical-copy cleanup. + Copies like `mov w0, w0`, return-register self copies, and copy chains around + simple arithmetic are still visible. Extend post-RA combine/DCE to remove + more physical self-copies and to coalesce simple return/argument copies + without requiring a full O2 coalescer. + +3. AArch64 caller/callee register choice for leaf and small functions. + Current AArch64 O1 allocable integer registers are callee-saved + `x19-x28`. This is conservative and correct, but a leaf function that uses + many of them pays save/restore traffic. A targeted O1 improvement is to make + caller-saved temporaries available when doing so does not create call + preservation work, especially in leaf functions. + +4. Stack-parameter handling. + Parameter-heavy functions copy stack arguments into local frame slots before + use. For non-address-taken scalar parameters, O1 should prefer using the + incoming stack location directly or loading it once into an assigned hard + register when profitable, instead of manufacturing extra frame traffic. + +5. Planned/parallel call argument setup. + O1 records call plans before allocation, but final backend call emission is + still largely through existing sequential call emitters. Moving O1 call + emission closer to planned/parallel argument setup should reduce copies, + stack shuffling, and conservative register choices around calls. + +6. Keep branch cleanup local. + O1 should keep only cheap CFG/layout cleanup. Global block ordering and + heavier branch layout belong in O2. + +## Implementation Plan + +The remaining O1 work should land as small vertical slices. Each slice should +start with a focused `test/opt` unit or small smoke case that exposes the bad +shape, then make the narrowest backend/opt change needed for that shape. Avoid +SSA, global block layout, dense conflict data, and whole-function coalescing in +these steps. + +### 1. Leaf Frames And Return Fallthrough + +Goal: make trivial leaf functions emit without an avoidable frame, callee-save +traffic, or a branch to the immediately following epilogue. + +Implementation: + +- Add an O1 function summary in `src/opt/opt.c` before `opt_emit`: has calls, + has `IR_ALLOCA`, has frame-address-required storage, has non-empty frame + slots after known-frame collection, has outgoing stack args, and has used + callee-saved registers after rewrite. +- Extend `CGKnownFrameDesc` or add a sibling target hook bit so + `func_begin_known_frame` can receive "frame may be omitted" without each + backend rediscovering O1 state. The backend remains authoritative because + CFI, platform ABI rules, and frame-pointer policy are target-owned. +- Teach AArch64 first, then x64/RV64, to omit the frame only when the known + frame is empty, the function is leaf, no dynamic stack exists, no outgoing + call area exists, and no callee-saved register needs preservation. +- Keep `reserve_hard_regs` authoritative. If post-RA code still mentions a + callee-saved register, the target must preserve it even for a leaf. +- Remove return-to-next-epilogue branches in the layout stage. Either let + `opt_jump_cleanup(...LAYOUT)` delete an `IR_BR` to the next emitted epilogue + block, or add a target-local direct-return fast path only when the backend's + epilogue is empty. Prefer the IR/layout cleanup first because it is target + neutral. + +Tests: + +- Add `test/opt` mock coverage for the summary: empty leaf, leaf with spill, + leaf with alloca, non-leaf, and leaf using a callee-saved hard reg. +- Add AArch64 smoke/disassembly cases for `return 42;` and one-reg arithmetic: + no `stp x29, x30`, no stack adjustment, no branch to an adjacent epilogue. +- Run `make test-opt test-smoke-x64 test-aa64-inline`; add RV64/x64 targeted + smoke once their frame omission hooks are enabled. + +### 2. Post-RA Physical Copy Cleanup + +Goal: remove obvious copy noise such as `mov w0, w0`, return-register self +copies, and short copy chains around a single arithmetic result without adding +an O2 coalescer. + +Implementation: + +- Extend `src/opt/pass_lower.c:opt_combine` with a post-RA copy-propagation + sweep over hard registers. Reuse the existing `HardBlockLive` and local + single-use machinery; keep it block-local except for already-computed + live-out safety checks. +- Add ABI-value helpers that can rewrite a return or planned-call operand from + a copied hard register to its original source when the source has not been + clobbered and the copied destination is not live out. +- Handle these cases first: + - physical self-copy deletion for all copy-like ops already modeled as + `IR_COPY`; + - `copy retreg, src; ret retreg` to `ret src` when the target return emitter + accepts that operand shape; + - `producer tmp; copy retreg, tmp; ret retreg` by retargeting the producer to + `retreg` when `retarget_producer_legal` already says that is safe; + - adjacent `copy a, b; copy c, a` to `copy c, b` when `a` has no other use. +- Keep target legality explicit. If an instruction requires same-dst operands + or cannot write a return register directly, leave it for O2. + +Tests: + +- Extend `test/opt/opt_test.c` with post-rewrite blocks for each pattern, + including a negative case where the source is clobbered by a call. +- Add a small AArch64 scalar-add probe that checks the final disassembly has no + `mov w0, w0` and fewer register-to-register moves. +- Run `make test-opt test-cg-api test-toy`. + +### 3. Caller-Saved Registers In Leaf And Small Functions + +Goal: avoid save/restore traffic caused by preferring AArch64 callee-saved +registers in leaf and low-pressure functions. + +Implementation: + +- Keep the backend physical register metadata in `src/arch/*/opt_coord.c` as + the source of truth. Do not hard-code register classes in the allocator. +- Split O1 register ordering into a per-function policy in + `src/opt/pass_lower.c`: leaf/no-call values prefer caller-saved temp regs; + values live across calls prefer callee-saved regs only when that avoids + per-call save/restore; tied ABI regs keep their fixed priority. +- For AArch64, make a caller-saved temp subset visible in the O1 allocation + order for leaf functions. Avoid `x16/x17` because they are backend scratch + registers, and continue to respect `opt_reserved_regs`, arg/ret hazards, and + inline-asm fixed/clobber masks. +- For non-leaf functions, keep current correctness first: caller-saved + assignment is allowed only if rewrite will preserve the value across the + specific call clobber mask. +- Add metrics for hard-reg class choice if useful for regression tracking: + caller-saved assignments, callee-saved assignments, and call-preserve inserts. + +Tests: + +- Extend existing allocator preference tests with AArch64-like metadata: + leaf temporaries choose caller-saved temps, call-crossing values avoid them + when a callee-saved reg is available, inline asm fixed regs still win. +- Add AArch64 smoke for a leaf function with many scalar temps: no + callee-save pair stores solely due to O1 temporaries. +- Run `make test-opt test-aa64-inline test-smoke-x64`. + +### 4. Stack-Parameter Handling + +Goal: stop copying non-address-taken scalar stack parameters into new local +frame slots before their first real use. + +Implementation: + +- Add an IR representation for incoming stack parameter storage that is + distinct from ordinary callee frame slots. The target should be able to + describe it as an immutable incoming argument address, e.g. base register plus + ABI stack offset, without allocating `FS_PARAM`. +- Extend `CGLocalStorage` or add O1-only parameter metadata so + `w_param` can record direct incoming-stack storage for scalar, + non-address-taken, non-variadic, direct ABI parameters. +- Teach AArch64 first because it is the current quality target. `aa_param` + should be able to bind a register-backed param by loading from the incoming + stack location, and bind a memory-backed non-address-taken param without + copying into a local frame slot. Preserve the existing copy behavior for + address-taken params, aggregate/byval params, variadic save-area needs, and + debug locations that require a stable local home. +- Update alias metadata so loads from incoming stack params are not mistaken + for writable local frame slots. Treat them as parameter memory that may alias + conservatively with unknown memory unless a target/language fact says more. +- After AArch64 is green, port the same storage description to x64 and RV64. + +Tests: + +- Add O1 unit coverage for `w_param`: scalar stack param with no address taken + does not allocate `FS_PARAM`; address-taken and aggregate params still do. +- Add AArch64 smoke for a 9+ integer-argument function: stack args are loaded + directly or once into assigned regs, with no extra callee local slot traffic. +- Run `make test-opt test-cg-api test-smoke-x64`; add RV64 smoke after the + RV64 port. + +### 5. Planned/Parallel Call Argument Setup + +Goal: make O1 replay use target call plans as the normal call-emission route so +argument setup is parallelized and copy-minimized. + +Implementation: + +- Promote `IRCallAux.plan_valid/use_plan_replay` from a partial path to the + default O1 path after `opt_machinize`. Keep the old descriptor replay as a + fallback for targets without `plan_call` or `emit_call_plan`. +- Add a small parallel-move resolver in `src/opt/opt.c` replay that lowers the + planned register moves, stack stores, return moves, and indirect callee + preservation. It must handle cycles using target scratch registers or a + temporary spill slot; no VLA and no global resolver state. +- Keep call-stack sizing tied to `call_stack_size` and the known-frame path so + outgoing areas are still allocated once per function. +- Use target `call_clobber_mask` and plan return masks to avoid preserving + registers that the concrete call cannot clobber. +- Start with integer direct calls, then FP args/returns, then mixed stack args, + then indirect calls. Leave varargs and aggregate edge cases on descriptor + replay until their planned lowering has explicit tests. + +Tests: + +- Extend mock `plan_call` tests to assert O1 emits planned calls by default and + preserves an indirect callee only when it conflicts with argument moves. +- Add cycle tests for argument permutation, including `a -> b, b -> a`. +- Add cross-target call smoke for direct, indirect, stack-heavy, and mixed + int/FP calls. +- Run `make test-opt test-cg-api test-toy test-smoke-x64`. + +### 6. Local Branch Cleanup Only + +Goal: improve obvious return/branch shape while preserving O1's cheap, +block-local cleanup policy. + +Implementation: + +- Keep `src/opt/pass_jump.c` restricted to branch-target forwarding, + compare-branch inversion, unreachable pruning already owned by CFG, and + physical fallthrough branch deletion. +- Add only local return/epilogue cleanup needed by step 1. Do not introduce + trace layout, global block ordering, or profile-guided branch placement in + O1. +- Document any tempting global cleanup in `doc/OPT.md` as O2 work instead of + expanding O1. + +Tests: + +- Add branch-cleanup tests for adjacent epilogue fallthrough and negative cases + where the epilogue is not adjacent or has target-required body. +- Run `make test-opt`. + +### 7. Stack-Heavy Compile-Time Watch + +Goal: keep stack-heavy `params` and `spill_argc` shapes close to linear before +larger inputs become normal benchmarks. + +Implementation: + +- Instrument the current `used_words` growth source in + `src/opt/pass_live.c` and `src/opt/pass_lower.c`: distinguish hard-register + occupancy words, stack-slot occupancy words, live-range point count, and + parameter/spill slot count. +- Investigate whether stack slots are being modeled with more interval words + than needed, especially after stack-parameter improvements remove some + `FS_PARAM` traffic. +- Keep the invariant that `opt.conflict_bytes=0` on normal O1. Any fix that + reintroduces dense pseudo conflicts is a regression, even if it improves one + microbenchmark. + +Tests: + +- Add or update a phase0 guardrail that records `opt.alloc.used_loc_words` for + representative stack-heavy cases without asserting fragile exact values. +- Refresh `doc/PERF.md` after the stack-parameter and caller-saved-register + slices land. + +### Suggested Order + +1. Physical-copy cleanup. It is isolated, improves current AArch64 samples, and + gives better post-RA liveness coverage for later steps. +2. Leaf return/fallthrough cleanup. This removes visible branch noise and + creates the summary data needed for frame omission. +3. AArch64 caller-saved leaf policy. This directly reduces save/restore traffic + once leaf frames are eligible for omission. +4. Stack-parameter handling. This is higher risk because it changes parameter + storage contracts, so land it per target. +5. Planned call replay. This has the largest call/ABI surface and should build + on the copy and register-choice cleanup. +6. Compile-time stack-heavy investigation and `doc/PERF.md` refresh after the + generated-code changes have reduced avoidable stack traffic. + +## Compile-Time Watchlist + +The current O1 compile-time shape is good for normal C ladders: many-function +and direct-call generated inputs are approximately linear through 1024 +functions with `opt.conflict_bytes=0`. + +The remaining counter-level warning is stack-heavy pressure. In the current +host-`-O2` native run: -Targets may also provide a known-frame entry path for O1. When -`func_begin_known_frame` and `call_stack_size` are both available, O1 computes -all frame slots, outgoing call area size, and dynamic-allocation presence before -body replay and passes that to the backend. AArch64, x64, and RV64 use this -path to emit the exact prologue up front instead of reserving target NOP patch -slots. +```text +shape step opt.o1.total regalloc used_words +spill_argc 512->1024 2.07x 2.23x 3.60x +params 256->512 1.73x 1.80x 3.33x +``` + +Wall time is only mildly super-linear, but the `used_words` growth should be +investigated before much larger stack-heavy inputs are treated as normal O1 +workloads. See `doc/PERF.md` for the full tables. ## Tests And Metrics @@ -206,119 +552,3 @@ make test-smoke-x64 - allocator hard/stack occupancy, interval probes/marks, spills; - rewrite reloads/stores/inserted instructions/live traffic; - link/JIT and compile frontend scopes. - -## Current State - -Fresh May 2026 scaling probes used the same direct-call and function-table -families tracked in `doc/PERF.md`, with three samples per point and p50 times -below. Both normal-path ladders are now essentially linear through 1024 -generated functions. The old dense conflict matrix is absent from the normal -path (`opt.conflict_bytes=0`), and the live-range and allocator buckets double -with the input. - -```text -straight direct-call ladder -N run.total opt.o1.total live_reg regalloc -64 18.155 14.826 3.899 8.056 -128 35.244 29.371 7.727 15.970 -256 70.254 58.873 15.765 32.168 -512 144.874 120.613 30.894 64.842 -1024 283.198 233.653 61.392 126.779 - -function-table ladder -N run.total opt.o1.total live_reg regalloc -64 17.536 14.571 3.822 7.915 -128 34.270 28.855 7.594 15.685 -256 68.305 57.831 15.183 31.413 -512 135.433 114.115 30.025 62.018 -1024 277.842 230.563 60.167 124.895 -``` - -A nonconstant wide-local spill probe also scales linearly in spill/reload and -rewrite counts, though regalloc bends slightly above 2x under heavy spill -pressure: - -```text -N opt.o1 live_reg regalloc spills reloads inserted -128 1.541 0.398 0.918 120 120 240 -256 3.230 0.778 2.048 248 248 496 -512 6.991 1.474 4.924 504 504 1008 -``` - -Current code-shape probes compiled `identity_param`, `scalar_add`, -`while_sum`, `simple_branch`, `direct_call`, `const_local`, and -`local_addr_taken` across x64, AArch64, and RV64 with `-O1`, then disassembled -with `llvm-objdump -dr`. - -Observed progress: - -- O1 is materially smaller than O0 across all three supported native backends - in the older corpus probes. -- The old prologue NOP sleds are gone on O1 for AArch64, x64, and RV64. -- Simple loop locals are kept in registers in the loop probes instead of being - reloaded from stack slots on each iteration. -- Branch-consuming compares lower to direct compare branches. For example, - AArch64 while-loop conditions use `cmp w19, #0xa` plus `b.ge ...` rather - than a `cmp; cset; cmp #0; b.cond` bridge. -- Local branch cleanup now removes O1-only branch artifacts: branch-to-branch - targets are forwarded, safe compare-branches are inverted to remove - fallthrough jump blocks, and final unconditional branches to the physical - fallthrough block are deleted before emit. -- Very local constant simplification handles the direct O1 probes covered by - `doc/CONSTFOLD.md`: immediate arithmetic, compare literals, delayed - arithmetic chains, and straight-line scalar-local shadows. The - `int x = 40; x += 2; return x;` probe now reduces to immediate return - materialization on AArch64: - -```asm -mov w0, #0x2a -b ... -``` - -- Post-rewrite DCE and copy cleanup now model hard-register call arguments, - ABI argument parts, return registers, and caller-saved call clobbers. The - local combine pass also retargets safe single-use arithmetic producers to a - following physical-copy destination, including commutative operand swaps for - x64-style two-operand overlap cases. -- Targets now expose descriptive physical-register metadata and per-call - clobber/return/callee-save masks to O1. `machinize` attaches a target call - plan to each `IR_CALL` before liveness and allocation, and rewrite plus - hard-register liveness use that call-specific clobber mask for preservation. - This closes the register-preservation correctness issue: hard-assigned values - live across calls are saved/restored only when the planned call actually - clobbers their assigned physical register. -- Simple scalar parameters no longer need late mem2reg promotion. `CGTarget.param` - owns ABI entry binding and O1 can keep non-memory-required direct params as - virtual-register-backed locals, replaying the final hard register or spill slot - to the target backend. Frame-backed params remain the path for aggregates, - address-taken values, indirect/byval ABI cases, and other memory-required - shapes. - -Remaining O1 shape issues visible in the current dumps: - -- The register-preservation correctness issue is closed, but O1 still - saves/restores more callee-saved registers than ideal in some small functions - under register pressure or values live across calls. The old unconditional - scratch-register saves have been removed, and preservation is now - call-specific. Remaining excess traffic is code-shape work: conservative - register exposure, missing call-argument parallel copies in emitted code, and - limited coalescing/argument-copy cleanup. -- Direct-call tiny functions are still heavy at O1. The x64 `callee(x) + 2` - probe emitted 167 bytes and 47 instructions across two small functions, - mostly frame setup, callee-save traffic, copies, and branch-to-epilogue - artifacts. O1 does not need general inlining, but call-side frame/save - discipline remains expensive. - -## Todo - -MIR's O1 path suggests these high-value local cleanups that still fit cfree's -fast tier: - -1. Continue reducing callee-save traffic as a code-shape issue. - The preservation correctness piece is done: O1 reserves/preserves only - replay-visible hard registers after final cleanup and uses per-call clobber - masks for live-across-call saves/restores. Remaining work is mostly - coalescing/argument-copy quality, pressure-sensitive choices, switching O1 - call emission to planned/parallel argument setup, and then safely broadening - ABI argument/return registers in the allocable pools. The remaining - structural work is tracked in `doc/OPT_REGS_CALL_PLAN.md`. diff --git a/doc/PERF.md b/doc/PERF.md @@ -1,906 +1,224 @@ -# Cfree Run/JIT Performance Notes +# Cfree O1 Performance Notes -This file records early measurements for `cfree run --time -O1` and the -current hypotheses for reducing time from submit to first execution. +This file records the current `cfree run --time -O1` shape and the remaining +compile-time scaling risks. It is intentionally focused on the current O1 path; +older allocator phase history is preserved in git and in `doc/MIR_RA_REPORT.md`. -The metrics are emitted through `CfreeEnv.metrics`; `cfree run --time` installs -a hosted stderr logger in the driver. Core/libcfree remain callback-only and do -not write files or depend on hosted I/O. +## Measurement Setup -## Current Metric Shape +Fresh snapshot: May 17, 2026. + +- Host: `arm64` macOS. +- Compiler under test: `build/cfree`, rebuilt as a Mach-O arm64 executable with + the compiler itself built by clang at host `-O2`. +- Timing path: native `build/cfree run --time -O1`, which exercises the native + AArch64 JIT/codegen path on this machine. +- Samples: three runs per point; tables use p50 milliseconds. +- Raw local outputs from the run: + - `build/o1-probes/scaling.tsv` + - `build/o1-probes/spill_argc.tsv` + - `build/o1-current/params-hostO2.tsv` + - `build/o1-current/toy-hostO2.tsv` + +The build log should show `clang -O2 ...` for driver, lib, C frontend, and Toy +frontend objects before these timings are compared to future runs. + +## Metric Shape Top-level `cfree run --time` scopes: - `run.total`: full driver path after option/input loading through entry return. - `run.compile_and_jit`: compile all inputs and build the JIT image. -- `compile.tu`: one translation unit to an `ObjBuilder`. +- `compile.tu`: one translation unit to an object builder. - `compile.frontend`: registered frontend body. -- `compile.c.*`: C frontend setup, PP option application, parse/codegen, cleanup. -- `opt.o1.total`: per-function O1 pipeline, nested under C parse/codegen. -- `opt.jump_cleanup`: cheap O1 CFG/layout branch cleanup after post-rewrite - cleanup, nested under `opt.o1.total`. +- `compile.c.*`: C frontend setup, PP option application, parse/codegen, + cleanup. +- `opt.o1.total`: per-function O1 pipeline, nested under frontend codegen. +- `opt.jump_cleanup`: cheap O1 CFG/layout branch cleanup. - `link_jit.all`: linker resolve plus JIT image materialization. -- `link.resolve.*`: archive ingestion, symbol resolution, GC, layout, reloc emit. -- `jit.*`: reservation, segment copy, relocation patching, protection, icache flush, - constructors. - `run.jit_lookup` and `run.entry_call`: symbol lookup and execution. -Important counters: +Important O1 counters: -- `compile.input_bytes` -- `compile.obj_sections`, `compile.obj_relocs` - `opt.funcs`, `opt.blocks`, `opt.insts`, `opt.vals` -- `opt.live_words`, `opt.live.blocks`, `opt.live.active_words` -- `opt.live.block_bytes`, `opt.live.set_bit_scans` -- `opt.live.bitset_words_touched`, `opt.live.dataflow_iterations`, - `opt.live.dataflow_block_visits` -- `opt.ranges`, `opt.range_points`, `opt.range_raw_points` -- `opt.range.point_visits`, `opt.range.value_scans`, +- `opt.live.block_bytes`, `opt.live.active_words`, + `opt.live.bitset_words_touched` +- `opt.ranges`, `opt.range_points`, `opt.range_raw_points`, + `opt.range.point_visits`, `opt.range.value_scans`, `opt.range.live_words_touched` -- `opt.alloc.used_loc_words`, `opt.alloc.spills` -- `opt.alloc.hard_loc_words`, `opt.alloc.stack_loc_words`, - `opt.alloc.stack_slots` +- `opt.alloc.used_loc_words`, `opt.alloc.hard_loc_words`, + `opt.alloc.stack_loc_words`, `opt.alloc.stack_slots` - `opt.alloc.hard_point_visits`, `opt.alloc.stack_point_visits`, - `opt.alloc.hard_word_ors`, `opt.alloc.stack_word_ors`, `opt.alloc.hard_mark_points`, `opt.alloc.stack_mark_points` -- `opt.dde.live_words_touched` +- `opt.alloc.spills` - `opt.rewrite.reloads`, `opt.rewrite.stores`, `opt.rewrite.inserted_insts`, `opt.rewrite.live_words_touched` - `opt.conflict_bytes` -- `link.inputs`, `link.sections`, `link.segments`, `link.syms`, `link.relocs` -- `jit.master_size`, `jit.nsegments`, `jit.input_section_bytes`, - `jit.segment_bytes` - -## Small Programs - -Measured with 20 samples each on stdin, `build/cfree run --time -O1 -`. -Times are `min / p50 / p95 / mean / max` in milliseconds. - -`int main(){return 0;}` - -```text -run.total 0.290 / 0.311 / 0.345 / 0.316 / 0.351 -run.compile_and_jit 0.274 / 0.293 / 0.313 / 0.296 / 0.329 -compile.tu 0.218 / 0.235 / 0.246 / 0.235 / 0.257 -compile.c.parse_codegen - 0.082 / 0.091 / 0.097 / 0.091 / 0.107 -opt.o1.total 0.048 / 0.056 / 0.061 / 0.055 / 0.065 -opt.live_blocks.pre_dde - 0.013 / 0.017 / 0.023 / 0.017 / 0.024 -opt.live_ranges.regalloc - 0.003 / 0.004 / 0.004 / 0.004 / 0.005 -opt.regalloc 0.009 / 0.010 / 0.012 / 0.010 / 0.015 -link_jit.all 0.047 / 0.050 / 0.056 / 0.051 / 0.061 -link.resolve.total 0.023 / 0.025 / 0.028 / 0.025 / 0.029 -``` - -One small loop: - -```c -int main(){int s=0; for(int i=0;i<10;i++) s+=i; return s==45?0:1;} -``` - -```text -run.total 0.369 / 0.400 / 0.446 / 0.401 / 0.448 -run.compile_and_jit 0.351 / 0.381 / 0.414 / 0.381 / 0.427 -compile.tu 0.293 / 0.314 / 0.349 / 0.315 / 0.353 -compile.c.parse_codegen - 0.151 / 0.165 / 0.196 / 0.167 / 0.212 -opt.o1.total 0.090 / 0.097 / 0.107 / 0.098 / 0.122 -opt.live_blocks.pre_dde - 0.024 / 0.026 / 0.029 / 0.027 / 0.032 -opt.live_ranges.regalloc - 0.012 / 0.013 / 0.015 / 0.013 / 0.015 -opt.regalloc 0.025 / 0.027 / 0.030 / 0.028 / 0.042 -link_jit.all 0.048 / 0.052 / 0.066 / 0.055 / 0.088 -link.resolve.total 0.023 / 0.026 / 0.034 / 0.026 / 0.035 -``` - -For tiny programs, compile dominates, with O1 and JIT/link image setup as the -main visible sub-buckets. Link/JIT is already sub-millisecond. - -## Scaling Programs - -Two synthetic ladders were used: - -- `straight_main`: N loop-heavy static functions, and `main` directly calls - every function. This creates one large straight-line function in addition to - N small loop functions. -- `table_main`: the same N loop-heavy functions, but `main` iterates through a - function-pointer table. This keeps `main` compact and isolates the cost of - many functions from the cost of one large function. - -Samples: 7 or 5 runs through 256 functions, 3 runs at 512 and 1024 functions. -The table below uses p50 timings. O1 scopes are summed across functions. - -### Straight-Line Main - -```text -funcs insts vals conflict_bytes run.total compile.tu opt.o1.total live_pre live_reg regalloc link_jit -1 74 50 800 0.845 0.641 0.381 0.087 0.084 0.112 0.135 -4 278 188 3008 2.027 1.797 1.279 0.326 0.324 0.421 0.173 -16 1094 740 11840 5.644 5.460 4.088 1.216 1.203 1.528 0.119 -64 4358 2948 56576 22.089 21.817 16.926 5.406 5.365 6.683 0.149 -128 8710 5892 131520 48.249 47.906 38.203 12.782 12.696 15.655 0.170 -256 17414 11780 336704 111.786 111.161 92.228 33.042 32.883 40.054 0.224 -512 34822 23556 968256 289.752 288.507 249.098 95.864 96.069 115.167 0.309 -1024 69638 47108 3116096 877.053 874.755 793.174 325.557 321.390 379.543 0.513 -``` - -The 512 to 1024 step is the clearest warning: - -```text -compile.tu 288.507 -> 874.755 ms 3.03x -opt.o1.total 249.098 -> 793.174 ms 3.18x -opt.live_info.pre_dde 95.864 -> 325.557 ms 3.40x -opt.live_info.regalloc 96.069 -> 321.390 ms 3.35x -opt.regalloc 115.167 -> 379.543 ms 3.30x -``` - -The associated `opt.conflict_bytes` grows from `968256` to `3116096`, about -3.2x for a 2x input. That points directly at dense liveness/conflict data as -the superlinear path. - -### Function-Table Main - -```text -funcs insts vals conflict_bytes run.total compile.tu opt.o1.total live_pre live_reg regalloc link_jit -1 94 60 960 0.895 0.692 0.404 0.095 0.092 0.125 0.141 -4 286 189 3024 1.902 1.692 1.178 0.314 0.309 0.401 0.148 -16 1054 705 11280 5.506 5.315 3.940 1.146 1.126 1.433 0.120 -64 4126 2769 44304 19.621 19.353 14.632 4.411 4.394 5.538 0.152 -128 8222 5521 88336 38.127 37.795 28.668 8.718 8.699 10.934 0.169 -256 16414 11025 176400 75.585 75.103 57.207 17.368 17.411 21.901 0.215 -512 32798 22033 352528 151.642 150.752 114.097 34.786 34.678 43.643 0.289 -1024 65566 44049 704784 303.295 301.632 226.238 68.497 68.306 86.028 0.450 -``` - -Here the 512 to 1024 step is essentially linear: - -```text -compile.tu 150.752 -> 301.632 ms 2.00x -opt.o1.total 114.097 -> 226.238 ms 1.98x -opt.live_info.pre_dde 34.786 -> 68.497 ms 1.97x -opt.live_info.regalloc 34.678 -> 68.306 ms 1.97x -opt.regalloc 43.643 -> 86.028 ms 1.97x -``` -This indicates that "many functions" is not the current scaling problem. The -problem is a single large function whose liveness/conflict structures become -large enough to bend past linear. +`opt.conflict_bytes` is expected to stay zero on the normal O1 path. The old +dense pseudo-register conflict matrix is no longer part of O1 allocation. -### Phase-1 Pass-Local Liveness Rerun +## Normal Function-Count Ladders -After `doc/MIR_RA_REPORT.md` phase 1, pre-DDE uses pass-local block liveness -instead of full `opt_live_info`. Regalloc still uses the old dense liveness and -conflict path. The synthetic source generator used here is the same family as -the tables above, but these are fresh generated inputs rather than byte-for-byte -identical files, so compare trends and bucket shapes more than individual rows. +Two synthetic C ladders are used for broad scaling: -Samples: 5 runs per point. The table uses p50 milliseconds. O1 scopes and -O1 counters are summed across functions. `pre_block_bytes` is -`opt.live.block_bytes` from `opt.live_blocks.pre_dde`; `reg_conflict_bytes` is -the remaining `opt.conflict_bytes` from `opt.live_info.regalloc`. +- `straight`: N loop-heavy static functions, with `main` directly calling every + function. This creates N small functions plus one larger straight-line caller. +- `table`: the same N functions, but `main` calls through a function-pointer + table. This keeps the caller compact and isolates "many functions" from "one + larger function". -Straight-line main: +O1 scopes are summed across all functions. ```text -funcs input_bytes insts vals pre_block_bytes reg_conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_blocks_pre live_reg regalloc link_jit -1 372 91 72 320 576 0.737 0.627 0.465 0.297 0.029 0.128 0.171 0.062 -4 1327 322 270 608 2160 1.538 1.421 1.277 0.837 0.062 0.442 0.578 0.058 -16 5236 1246 1062 1984 9056 4.861 4.756 4.610 3.085 0.174 1.787 2.285 0.056 -64 21028 4942 4230 7264 42224 19.542 19.386 19.249 13.377 0.665 8.199 10.403 0.073 -128 42420 9870 8454 14304 100784 42.077 41.830 41.693 30.284 1.314 19.476 24.434 0.107 -256 85940 19726 16902 28384 267056 98.532 97.987 97.845 74.982 2.603 51.347 63.176 0.157 -512 172980 39438 33798 56544 796208 254.219 252.884 252.723 206.802 5.174 152.154 183.374 0.269 -1024 347352 78862 67590 112864 2640944 742.562 740.519 740.362 646.586 10.425 502.879 599.496 0.436 -``` +straight +N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict +64 17.442 15.776 2.495 3.941 8.740 1856 833 1789 0 3198 0 +128 31.210 28.372 4.453 7.191 15.631 3712 1665 3581 0 6398 0 +256 60.026 54.522 8.544 13.764 30.201 7424 3329 7165 0 12798 0 +512 119.735 108.974 16.987 27.253 61.085 14848 6657 14333 0 25598 0 +1024 234.857 210.831 32.958 52.953 115.918 29696 13313 28669 0 51198 0 -Function-table main: - -```text -funcs input_bytes insts vals pre_block_bytes reg_conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_blocks_pre live_reg regalloc link_jit -1 461 112 83 576 664 0.694 0.602 0.465 0.285 0.023 0.129 0.173 0.058 -4 1392 328 269 864 2152 1.513 1.386 1.251 0.805 0.053 0.434 0.565 0.058 -16 5200 1192 1013 2016 8104 4.801 4.687 4.550 3.030 0.194 1.697 2.189 0.060 -64 20560 4648 3989 6624 31912 17.431 17.261 17.111 11.452 0.703 6.560 8.487 0.079 -128 41349 9256 7957 12768 63656 34.170 33.918 33.777 22.640 1.289 13.077 16.946 0.114 -256 83589 18472 15893 25056 127144 67.625 67.144 67.004 45.065 2.567 26.043 33.724 0.156 -512 168069 36904 31765 49632 254120 134.804 134.093 133.941 89.938 5.150 51.976 67.348 0.250 -1024 337298 73768 63509 98784 508072 272.397 271.171 271.026 179.323 10.224 103.779 134.391 0.402 +table +N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict +64 14.246 12.916 2.046 3.269 7.113 1620 713 1555 0 2720 0 +128 28.560 26.036 4.128 6.591 14.368 3220 1417 3091 0 5408 0 +256 57.608 52.647 8.371 13.331 28.973 6420 2825 6163 0 10784 0 +512 114.689 104.540 16.557 26.401 57.560 12820 5641 12307 0 21536 0 +1024 232.619 209.061 33.476 52.789 115.426 25620 11273 24595 0 43040 0 ``` -The 512 to 1024 step now separates the fixed part from the remaining problem: +512 to 1024 ratios: ```text -straight_main: -compile.tu 252.884 -> 740.519 ms 2.93x -opt.o1.total 206.802 -> 646.586 ms 3.13x -opt.live_blocks.pre_dde 5.174 -> 10.425 ms 2.01x -opt.live_info.regalloc 152.154 -> 502.879 ms 3.31x -opt.regalloc 183.374 -> 599.496 ms 3.27x -pre_block_bytes 56544 -> 112864 2.00x -reg_conflict_bytes 796208 -> 2640944 3.32x - -table_main: -compile.tu 134.093 -> 271.171 ms 2.02x -opt.o1.total 89.938 -> 179.323 ms 1.99x -opt.live_blocks.pre_dde 5.150 -> 10.224 ms 1.99x -opt.live_info.regalloc 51.976 -> 103.779 ms 2.00x -opt.regalloc 67.348 -> 134.391 ms 2.00x -pre_block_bytes 49632 -> 98784 1.99x -reg_conflict_bytes 254120 -> 508072 2.00x +shape compile.tu opt.o1.total live_pre live_reg regalloc used_words +straight 1.96x 1.93x 1.94x 1.94x 1.90x 2.00x +table 2.03x 2.00x 2.02x 2.00x 2.01x 2.00x ``` -The phase-1 result is the intended partial fix: pre-DDE no longer allocates a -conflict matrix and its block-live storage scales linearly on the large -straight-line case. The remaining superlinear behavior is now concentrated in -the regalloc-side full liveness/conflict path. - -### Phase-3 Range-Allocator Rerun - -After `doc/MIR_RA_REPORT.md` phase 3, O1 assignment no longer uses the dense -pseudo conflict matrix. Regalloc rebuilds pass-local block liveness and live -ranges, assigns against `used_locs[point]`, and reports -`opt.conflict_bytes=0` on the normal allocation path. Rewrite still uses -per-instruction `live_after`, so this is not the final phase-4 shape. - -Samples: 5 runs per point. The table uses p50 milliseconds. O1 scopes and O1 -counters are summed across functions. The generator is the same direct-call vs -function-table family as above, but these are fresh generated inputs and not -byte-for-byte identical to the earlier tables. - -Straight-line main: +This is the desired current shape. Many functions and the normal direct-call +large-caller case are essentially linear through 1024 generated functions. -```text -funcs input_bytes insts vals pre_block_bytes ranges range_points used_loc_words conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_blocks_pre live_ranges_reg regalloc link_jit -1 139 55 32 576 28 41 82 0 0.551 0.444 0.296 0.184 0.048 0.025 0.051 0.057 -4 430 178 110 1632 100 143 286 0 0.887 0.784 0.642 0.418 0.114 0.069 0.132 0.055 -16 1624 670 422 6080 388 551 1189 0 2.433 2.306 2.147 1.414 0.391 0.254 0.480 0.069 -64 6520 2638 1670 23648 1540 2183 5674 0 8.195 8.042 7.894 5.236 1.435 0.992 1.894 0.084 -128 13188 5262 3334 47072 3076 4359 13894 0 15.634 15.399 15.250 10.068 2.657 1.936 3.893 0.102 -256 26884 10510 6662 93920 6148 8711 38014 0 31.114 30.795 30.648 20.296 5.097 3.876 8.415 0.139 -512 54276 21006 13318 187616 12292 17415 116974 0 66.957 66.408 66.265 44.891 10.516 8.204 20.599 0.219 -1024 109180 41998 26630 375008 24580 34823 397774 0 157.795 156.418 156.251 109.919 22.778 18.122 57.799 0.374 -``` +## Spill-Pressure Ladder -Function-table main: +The old `spill` row in `run_scaling.sh` initializes many locals from constants; +current O1 folds most of that away, so it is useful as a frontend/source-size +sanity check but not as a pressure test. The useful pressure ladder is +`spill_argc`, which initializes each local from `argc` so values remain live. ```text -funcs input_bytes insts vals pre_block_bytes ranges range_points used_loc_words conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_blocks_pre live_ranges_reg regalloc link_jit -1 212 76 43 832 39 56 112 0 0.594 0.486 0.335 0.194 0.051 0.031 0.060 0.061 -4 482 184 109 1888 99 143 286 0 0.944 0.833 0.685 0.437 0.118 0.072 0.140 0.061 -16 1587 616 373 6112 339 491 982 0 2.266 2.145 1.993 1.316 0.375 0.230 0.432 0.064 -64 6099 2344 1429 23008 1299 1883 3766 0 7.565 7.414 7.259 4.806 1.347 0.861 1.605 0.077 -128 12228 4648 2837 45536 2579 3739 7478 0 14.055 13.851 13.703 8.959 2.491 1.659 3.058 0.098 -256 24772 9256 5653 90592 5139 7451 14902 0 27.074 26.765 26.620 17.091 4.677 3.237 5.942 0.141 -512 49860 18472 11285 180704 10259 14875 29750 0 54.364 53.839 53.682 33.996 9.330 6.422 11.850 0.225 -1024 100133 36904 22549 360928 20499 29723 59446 0 114.081 112.879 112.706 68.967 18.959 12.909 23.842 0.367 +spill_argc +N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills reloads inserted used_words conflict +128 1.388 0.769 0.045 0.169 0.474 770 641 767 108 108 216 3076 0 +256 2.468 1.433 0.058 0.291 0.929 1538 1281 1535 236 236 472 9222 0 +512 4.658 2.750 0.072 0.526 1.825 3074 2561 3071 492 492 984 30730 0 +1024 9.894 5.681 0.110 1.058 4.073 6146 5121 6143 1004 1004 2008 110610 0 ``` -The 512 to 1024 step after phase 3: +512 to 1024 ratios: ```text -straight_main: -compile.tu 66.408 -> 156.418 ms 2.36x -opt.o1.total 44.891 -> 109.919 ms 2.45x -opt.live_blocks.pre_dde 10.516 -> 22.778 ms 2.17x -opt.live_ranges.regalloc 8.204 -> 18.122 ms 2.21x -opt.regalloc 20.599 -> 57.799 ms 2.81x -pre_block_bytes 187616 -> 375008 2.00x -used_loc_words 116974 -> 397774 3.40x -conflict_bytes 0 -> 0 - -table_main: -compile.tu 53.839 -> 112.879 ms 2.10x -opt.o1.total 33.996 -> 68.967 ms 2.03x -opt.live_blocks.pre_dde 9.330 -> 18.959 ms 2.03x -opt.live_ranges.regalloc 6.422 -> 12.909 ms 2.01x -opt.regalloc 11.850 -> 23.842 ms 2.01x -pre_block_bytes 180704 -> 360928 2.00x -used_loc_words 29750 -> 59446 2.00x -conflict_bytes 0 -> 0 +compile.tu 2.12x +opt.o1.total 2.07x +live_pre 1.53x +live_reg 2.01x +regalloc 2.23x +spills 2.04x +used_words 3.60x ``` -This is the intended phase-3 result: the old conflict matrix is gone from O1, -and the large straight-line case is much faster than the phase-1 rerun. The -remaining non-linear straight-line bucket now tracks `used_loc_words`, because -the first range allocator uses dense occupied-location bitmaps whose stack-slot -portion is sized by candidate count. Table-main stays close to linear because -each function remains small. +This is the main remaining mild super-linear signal. Wall time is only slightly +above 2x, but `used_words` grows 3.60x. That points at occupancy/assignment +working-set growth under stack pressure, even though the old conflict matrix is +gone. The next investigation should look at why per-location interval occupancy +and candidate checking expand faster than spills/ranges for this generated +shape. -## Compile Perf +## Parameter-Heavy Ladder -`compile.tu` is now split into generic and C-specific scopes: +This C ladder uses one function with N integer parameters and a caller that +passes N constants. It stresses ABI lowering, stack arguments, and callee-side +parameter use. ```text -compile.tu - compile.input_bytes - compile.frontend - compile.c.setup - compile.c.pool_new - compile.c.lex_open - compile.c.pp_new - compile.c.cg_new - compile.c.decl_new - compile.c.pp_options - compile.c.pp_push_input - compile.c.parse_codegen - opt.o1.total ... - compile.c.cleanup - compile.obj_finalize - compile.obj_sections - compile.obj_relocs +params +N compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict +8 0.816 0.577 0.091 0.143 0.305 37 26 32 0 72 0 +16 0.850 0.595 0.090 0.150 0.320 69 50 64 0 136 0 +32 0.932 0.644 0.093 0.162 0.357 133 98 128 12 392 0 +64 1.053 0.698 0.093 0.175 0.396 261 194 256 44 776 0 +128 1.367 0.866 0.096 0.211 0.510 517 386 512 108 2056 0 +256 1.979 1.209 0.099 0.281 0.740 1029 770 1024 236 6152 0 +512 3.484 2.088 0.110 0.434 1.334 2053 1538 2048 492 20488 0 ``` -Updated phase-3 1024-function p50 examples with the detailed compile scopes: +256 to 512 ratios: ```text -metric straight_main table_main -run.total 157.795 ms 114.081 ms -compile.tu 156.418 ms 112.879 ms -compile.frontend 156.411 ms 112.868 ms -compile.c.setup 0.128 ms 0.130 ms -compile.c.pp_options 0.003 ms 0.004 ms -compile.c.parse_codegen 156.251 ms 112.706 ms -compile.c.cleanup 0.020 ms 0.015 ms -compile.obj_finalize 0.000 ms 0.000 ms -opt.o1.total summed 109.919 ms 68.967 ms +compile.tu 1.76x +opt.o1.total 1.73x +live_pre 1.11x +live_reg 1.54x +regalloc 1.80x +spills 2.08x +used_words 3.33x ``` -Subtracting summed O1 from `compile.tu` leaves about `44-46 ms` in both -variants: - -```text -straight_main: 156.418 - 109.919 = 46.499 ms -table_main: 112.879 - 68.967 = 43.912 ms -``` +The timing buckets are sub-linear to mildly linear here because the fixed +frontend/O1 setup costs are still visible. The counter to watch is again +`used_words`: it grows faster than spills and ranges when many parameters force +stack traffic. This is not yet a wall-time bottleneck, but it is the clearest +counter-level sign that parameter/stack occupancy needs attention before much +larger parameter-heavy inputs are treated as normal workloads. -That residual is the current frontend/PP/parser/CG API cost for roughly -100 KiB of generated C in these phase-3 inputs. It appears linear with input -size and number of small functions in these tests. The setup, PP option -application, cleanup, and object finalization scopes are negligible. The opaque -part that remains is inside `compile.c.parse_codegen`, which streams PP tokens, -parses declarations and statements, performs type work, drives the CG API, and -triggers per-function O1 when function bodies end. +## Toy Samples -### Compile Scope Scaling - -After adding the detailed `compile.tu` scopes, the scale ladders were rerun. -The table below uses p50 milliseconds. `opt.o1.total` is summed across -functions, and `non_o1_residual` is `compile.tu - opt.o1.total`. - -Straight-line main: +The Toy frontend uses the same CG/O1/backend machinery, so these are useful +small API-shape probes. O1 scopes are summed across functions. ```text -funcs input_bytes compile.tu parse_codegen opt.o1.total non_o1_residual -1 139 0.444 0.296 0.184 0.260 -4 430 0.784 0.642 0.418 0.366 -16 1624 2.306 2.147 1.414 0.892 -64 6520 8.042 7.894 5.236 2.806 -128 13188 15.399 15.250 10.068 5.331 -256 26884 30.795 30.648 20.296 10.499 -512 54276 66.408 66.265 44.891 21.517 -1024 109180 156.418 156.251 109.919 46.499 +case compile.tu opt.o1.total live_pre live_reg regalloc insts vals ranges spills used_words conflict +06_while_sum 0.436 0.319 0.049 0.075 0.163 15 7 13 0 26 0 +09_function_params 0.645 0.569 0.091 0.142 0.302 10 8 8 0 22 0 +111_many_function_params 0.682 0.591 0.092 0.147 0.316 38 22 36 0 78 0 +123_spec_demo 4.709 4.229 0.659 1.016 2.270 788 359 590 0 1168 0 ``` -Function-table main: +The Toy samples do not show a compile-time scaling problem at this size. The +larger spec demo is dominated by O1/regalloc, but counters are small and there +is no conflict-matrix traffic. -```text -funcs input_bytes compile.tu parse_codegen opt.o1.total non_o1_residual -1 212 0.486 0.335 0.194 0.292 -4 482 0.833 0.685 0.437 0.396 -16 1587 2.145 1.993 1.316 0.829 -64 6099 7.414 7.259 4.806 2.608 -128 12228 13.851 13.703 8.959 4.892 -256 24772 26.765 26.620 17.091 9.674 -512 49860 53.839 53.682 33.996 19.843 -1024 100133 112.879 112.706 68.967 43.912 -``` +## Current Interpretation -The detailed compile scopes show: - -- `compile.frontend` is effectively all of `compile.tu`; object finalization is - below measurable resolution in these runs. -- `compile.c.parse_codegen` is effectively all of `compile.frontend`. -- `compile.c.setup` stays flat around `0.14 ms`; most of that is `pp_new` - registering predefined macros. -- `compile.c.pp_options` stays around `0.02 ms` with no `-I`, `-D`, or `-U` - options. -- `compile.c.cleanup` grows only to about `0.01-0.02 ms`. -- `non_o1_residual` is near-linear with source size at larger sizes, around - `430-460 us/KiB` from 64 through 1024 functions in the phase-3 inputs. - -So the compile-side scaling story is two-layered: the broad C parse/codegen -bucket scales linearly with input size after startup overhead, while the nested -O1 liveness/regalloc work can dominate that bucket and become superlinear for a -single large function. - -Next useful compile instrumentation: - -- Count tokens consumed by `pp_next`. -- Count top-level declarations and function definitions in `parse_c`. -- Split parse/codegen around function bodies, global declarations, and data - emission. -- Add CG API counters for emitted ops, stack/value operations, local slots, and - object writes. -- Add pool/arena allocation counters or high-water marks for frontend and CG - arenas. - -### Phase-4 Rewrite Rerun - -After `doc/MIR_RA_REPORT.md` phase 4, O1 rewrite no longer materializes -`Block.live_after`. Inline-asm clobber constraints and call-clobber -save/restore insertion now use rolling backward live sets. The old full -per-instruction live-after storage is still available through `opt_live_info` -for compatibility tests, but the normal O1 `opt_regalloc` path no longer -requires it. - -Samples: 5 runs per point. The table uses p50 milliseconds. O1 scopes and O1 -counters are summed across functions. The generator is the same direct-call vs -function-table family used for the phase-3 rerun, but these are fresh generated -inputs and not byte-for-byte identical to the earlier tables. - -Straight-line main: +The current O1 compile-time story is good enough to shift attention toward +generated-code quality: -```text -funcs input_bytes insts vals pre_block_bytes ranges range_points used_loc_words conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_blocks_pre live_ranges_reg regalloc link_jit -1 218 61 38 576 37 52 104 0 0.584 0.475 0.319 0.208 0.059 0.033 0.067 0.062 -4 731 226 140 2016 142 196 392 0 1.050 0.943 0.800 0.530 0.146 0.090 0.193 0.058 -16 2810 886 548 7872 562 772 1776 0 3.044 2.912 2.754 1.855 0.507 0.345 0.726 0.065 -64 11191 3526 2180 31392 2242 3076 8864 0 10.963 10.756 10.599 7.140 1.924 1.375 2.906 0.089 -128 22450 7046 4356 62688 4482 6148 23096 0 21.189 20.676 20.524 13.718 3.682 2.690 5.706 0.110 -256 45191 14086 8708 125280 8962 12292 67688 0 41.840 41.173 41.019 27.476 7.238 5.464 11.777 0.157 -512 90668 28166 17412 250464 17922 24580 221384 0 87.461 86.405 86.252 58.701 14.896 11.760 26.346 0.243 -1024 181691 56326 34820 500832 35842 49156 786824 0 190.569 188.638 188.479 129.835 32.046 26.404 61.906 0.425 -``` - -Function-table main: +- Normal many-function and large-caller C inputs are approximately linear. +- The old dense liveness/conflict path is gone from normal O1 + (`opt.conflict_bytes=0`). +- Link/JIT is not the bottleneck in these runs; the visible time is frontend + parse/codegen plus O1. +- Spill-heavy and parameter-heavy shapes still have counter-level + super-linear growth in `used_words`. Wall time is only mildly affected today, + but future work should verify interval occupancy and candidate checking do not + drift back into point/candidate-product behavior. -```text -funcs input_bytes insts vals pre_block_bytes ranges range_points used_loc_words conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_blocks_pre live_ranges_reg regalloc link_jit -1 311 82 49 832 48 67 134 0 0.665 0.545 0.381 0.230 0.064 0.036 0.077 0.066 -4 788 226 133 2272 135 190 380 0 1.122 0.989 0.839 0.550 0.149 0.094 0.203 0.068 -16 2718 802 469 8032 483 682 1364 0 2.886 2.744 2.592 1.737 0.481 0.316 0.661 0.076 -64 10475 3106 1813 31072 1875 2650 5300 0 10.054 9.828 9.670 6.569 1.827 1.207 2.491 0.097 -128 20875 6178 3605 61792 3731 5274 10548 0 18.978 18.667 18.515 12.271 3.328 2.275 4.800 0.117 -256 41824 12322 7189 123232 7443 10522 21044 0 37.247 36.757 36.593 24.043 6.552 4.505 9.416 0.158 -512 83717 24610 14357 246112 14867 21018 42036 0 74.068 73.190 73.035 47.519 12.995 8.945 18.699 0.236 -1024 167549 49186 28693 491872 29715 42010 84020 0 153.281 151.622 151.471 96.303 26.390 17.895 37.547 0.382 -``` - -To isolate the phase-4 change, the same generated inputs were also run against -a temporary clean worktree at the phase-3 baseline (`HEAD` before the phase-4 -working-tree edits). The table below shows phase-3 p50 to phase-4 p50 on those -identical inputs. - -```text -straight_main: -funcs run.total opt.o1.total live_ranges_reg regalloc -512 88.544 -> 87.461 (-1.2%) 60.910 -> 58.701 (-3.6%) 12.268 -> 11.760 (-4.1%) 29.589 -> 26.346 (-11.0%) -1024 206.333 -> 190.569 (-7.6%) 147.862 -> 129.835 (-12.2%) 27.774 -> 26.404 (-4.9%) 82.750 -> 61.906 (-25.2%) - -table_main: -funcs run.total opt.o1.total live_ranges_reg regalloc -512 69.043 -> 74.068 (+7.3%) 43.713 -> 47.519 (+8.7%) 9.240 -> 8.945 (-3.2%) 15.729 -> 18.699 (+18.9%) -1024 143.333 -> 153.281 (+6.9%) 88.574 -> 96.303 (+8.7%) 18.549 -> 17.895 (-3.5%) 31.645 -> 37.547 (+18.7%) -``` +## Investigation Priorities -An extra 7-sample repeat at 512 and 1024 confirmed the shape: large -straight-line `regalloc` improves by about 13-27%, while the table-shaped case -regresses by about 16-17% in `regalloc`. The live-range analysis bucket itself -improves modestly in both shapes. The table-main regression is therefore in -rewrite mechanics, not range construction. - -The most likely cause is the first phase-4 implementation strategy: it rewrites -each block while walking backward, builds per-instruction temporary rewrite -lists, appends them in reverse, then reverses the block. That removes the large -persistent `live_after` allocation and helps the large straight-line case, but -it adds list-copy overhead that is visible when many compact functions dominate -the workload. A forward rewrite with a precomputed rolling-live cursor or a -block-local edit buffer should keep the no-`live_after` shape without the -table-main tax. - -### Phase-7 Fresh Rerun - -After phases 5, 6, and the implemented phase-7 occupancy split, the normal O1 -path no longer contains the old dense pseudo-conflict/live-after state. Rewrite -uses a block-local output buffer, `opt_live_info` and persistent block liveness -fields are gone, and allocator occupancy is split into hard-register and -demand-sized stack-slot tables. - -Samples: 5 runs per point. The table uses p50 milliseconds. The direct-call and -function-table generators are the same family as earlier sections, but these -are fresh generated inputs and not byte-for-byte identical to prior phase -tables. - -Straight-line main: - -```text -funcs input_bytes insts vals block_bytes ranges range_points used_loc_words hard_words stack_words stack_slots conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_pre live_ranges_reg regalloc link_jit -1 331 84 57 576 53 70 140 140 0 0 0 0.625 0.517 0.365 0.230 0.061 0.033 0.073 0.058 -4 1051 297 213 1632 203 262 524 524 0 0 0 1.217 1.100 0.943 0.603 0.150 0.105 0.218 0.060 -16 3971 1149 837 6080 803 1030 2060 2060 0 0 0 3.562 3.445 3.301 2.148 0.539 0.375 0.803 0.061 -64 15749 4557 3333 23648 3203 4102 8204 8204 0 0 0 12.945 12.762 12.612 8.143 1.997 1.499 3.197 0.083 -128 31598 9101 6661 47072 6403 8198 16396 16396 0 0 0 24.576 24.306 24.158 15.474 3.730 2.955 6.252 0.118 -256 63654 18189 13317 93920 12803 16390 32780 32780 0 0 0 48.921 48.385 48.237 30.977 7.385 6.008 12.770 0.159 -512 127759 36365 26629 187616 25603 32774 65548 65548 0 0 0 98.418 97.658 97.503 63.030 14.960 12.268 26.552 0.249 -1024 256093 72717 53253 375008 51203 65542 131084 131084 0 0 0 206.580 204.832 204.682 131.904 30.624 25.894 57.609 0.421 -``` +1. Track `used_words` decomposition on stack-heavy inputs. + Add or inspect `hard_loc_words`, `stack_loc_words`, stack slot count, and + interval probe/mark counters for `spill_argc` and parameter-heavy ladders. + The immediate question is why `used_words` grows 3.3-3.6x while ranges and + spills only double. -Function-table main: +2. Keep `opt.conflict_bytes=0` as a regression guard. + Reintroducing dense pseudo conflict storage would be a clear O1 regression. -```text -funcs input_bytes insts vals block_bytes ranges range_points used_loc_words hard_words stack_words stack_slots conflict_bytes run.total compile.tu parse_codegen opt.o1.total live_pre live_ranges_reg regalloc link_jit -1 400 106 69 832 65 86 172 172 0 0 0 0.680 0.571 0.413 0.252 0.064 0.040 0.087 0.062 -4 1096 304 213 1888 203 263 526 526 0 0 0 1.244 1.132 0.988 0.625 0.162 0.107 0.224 0.062 -16 3914 1096 789 6112 755 971 1942 1942 0 0 0 3.473 3.346 3.188 2.061 0.516 0.358 0.765 0.066 -64 15260 4264 3093 23008 2963 3803 7606 7606 0 0 0 12.162 11.990 11.840 7.621 1.864 1.382 2.980 0.086 -128 30505 8488 6165 45536 5907 7579 15158 15158 0 0 0 23.249 22.999 22.848 14.553 3.484 2.701 5.777 0.110 -256 61281 16936 12309 90592 11795 15131 30262 30262 0 0 0 45.760 45.318 45.169 28.765 6.871 5.386 11.538 0.150 -512 122826 33832 24597 180704 23571 30235 60470 60470 0 0 0 91.469 90.751 90.601 57.470 13.717 10.718 23.064 0.240 -1024 246016 67624 49173 360928 47123 60443 120886 120886 0 0 0 190.086 187.863 187.699 116.461 27.974 21.462 46.345 0.417 -``` - -The 512 to 1024 step is now close to linear in both normal shapes: - -```text -straight_main: -compile.tu 97.658 -> 204.832 ms 2.10x -opt.o1.total 63.030 -> 131.904 ms 2.09x -opt.live_blocks.pre_dde 14.960 -> 30.624 ms 2.05x -opt.live_ranges.regalloc 12.268 -> 25.894 ms 2.11x -opt.regalloc 26.552 -> 57.609 ms 2.17x -used_loc_words 65548 -> 131084 2.00x -hard_loc_words 65548 -> 131084 2.00x -stack_loc_words 0 -> 0 - -table_main: -compile.tu 90.751 -> 187.863 ms 2.07x -opt.o1.total 57.470 -> 116.461 ms 2.03x -opt.live_blocks.pre_dde 13.717 -> 27.974 ms 2.04x -opt.live_ranges.regalloc 10.718 -> 21.462 ms 2.00x -opt.regalloc 23.064 -> 46.345 ms 2.01x -used_loc_words 60470 -> 120886 2.00x -hard_loc_words 60470 -> 120886 2.00x -stack_loc_words 0 -> 0 -``` - -This confirms that phases 5 and 7 removed the prior table-main rewrite tax and -the direct-call `used_locs` superlinear growth for this generator. These inputs -do not force spills, so the stack side of the occupancy split stays at zero. - -To exercise stack occupancy, a separate spill-pressure ladder used one large -`main` that repeatedly forms 24 live arguments and calls a local `sink24`. -Some larger cases return non-zero guest status because the generated integer -result is intentionally ignored by the benchmark harness; compile/JIT metrics -are still emitted before process exit. - -```text -groups input_bytes insts vals block_bytes ranges range_points used_loc_words hard_words stack_words stack_slots spills stores compile.tu opt.o1.total live_pre live_ranges_reg regalloc link_jit -1 1103 204 176 288 172 200 551 400 151 16 16 16 0.815 0.384 0.097 0.075 0.150 0.055 -4 3377 651 548 864 544 647 1892 1294 598 16 64 64 2.062 1.127 0.242 0.271 0.584 0.060 -16 12905 2439 2036 3168 2032 2435 7256 4870 2386 16 256 256 9.368 6.436 1.072 1.452 4.212 0.071 -64 52745 9591 7988 12096 7984 9587 28712 19174 9538 16 1024 1024 76.878 64.917 7.948 12.695 49.800 0.110 -128 107881 19127 15924 24000 15920 19123 57320 38246 19074 16 2048 2048 262.764 234.319 25.701 43.068 186.860 0.168 -256 223337 38199 31796 47808 31792 38195 114536 76390 38146 16 4096 4096 964.562 893.449 89.543 156.086 732.317 0.255 -512 454249 76343 63540 95424 63536 76339 228968 152678 76290 16 8192 8192 3737.556 3536.356 332.393 595.509 2902.790 0.445 -``` - -The spill ladder tells a different story: - -```text -256 -> 512: -compile.tu 964.562 -> 3737.556 ms 3.87x -opt.o1.total 893.449 -> 3536.356 ms 3.96x -opt.live_blocks.pre_dde 89.543 -> 332.393 ms 3.71x -opt.live_ranges.regalloc 156.086 -> 595.509 ms 3.82x -opt.regalloc 732.317 -> 2902.790 ms 3.96x -used_loc_words 114536 -> 228968 2.00x -hard_loc_words 76390 -> 152678 2.00x -stack_loc_words 38146 -> 76290 2.00x -stack_slots 16 -> 16 1.00x -spills 4096 -> 8192 2.00x -``` - -The stack occupancy split is doing its job: stack words scale with allocated -stack slots and points, not candidate count, and the slot count stays flat at -16 while spills double. The remaining superlinear behavior is now elsewhere: -wide single functions still drive full-width value bitset work in block -liveness, live-range construction, DDE, and the rewrite/allocation walk. In -other words, phase 7 fixed the specific `points * candidates` occupancy growth, -but not the broader `instructions * live_words` shape for values that are live -across very wide regions. - -### Phase-8 O1 Wide-Function Rerun - -Phase 8 attacks three remaining O1 costs exposed by the spill-pressure ladder: - -- allocator candidate ordering now uses `qsort` instead of insertion sort; -- live-range construction uses per-instruction use/def value lists with - generation marks instead of clearing fixed-width temporary bitsets; -- asm register-constraint application first checks whether the function has any - `IR_ASM_BLOCK`, avoiding a full backward live walk for normal C functions. - -Normal direct-call and table-shaped ladders remain close to linear at the -512-to-1024 step: - -```text -straight_main: -compile.tu 97.938 -> 202.409 ms 2.07x -opt.o1.total 61.803 -> 125.589 ms 2.03x -opt.live_blocks.pre_dde 15.131 -> 30.724 ms 2.03x -opt.live_ranges.regalloc 10.268 -> 20.757 ms 2.02x -opt.regalloc 23.644 -> 47.980 ms 2.03x - -table_main: -compile.tu 90.607 -> 188.297 ms 2.08x -opt.o1.total 56.265 -> 115.105 ms 2.05x -opt.live_blocks.pre_dde 13.915 -> 28.431 ms 2.04x -opt.live_ranges.regalloc 9.156 -> 18.499 ms 2.02x -opt.regalloc 21.062 -> 42.722 ms 2.03x -``` - -The spill-pressure ladder improves substantially in absolute time: - -```text -groups compile.tu opt.o1.total live_pre live_ranges_reg regalloc hard_words stack_words stack_slots spills -128 91.797 62.914 9.021 9.213 32.036 38246 19074 16 2048 -256 275.634 200.878 24.166 24.887 104.204 76390 38146 16 4096 -512 958.535 754.326 74.544 76.522 374.253 152678 76290 16 8192 -``` - -Compared with the phase-7 512-group spill row, this cuts: - -```text -compile.tu 3737.556 -> 958.535 ms 3.90x faster -opt.o1.total 3536.356 -> 754.326 ms 4.69x faster -opt.live_blocks.pre_dde 332.393 -> 74.544 ms 4.46x faster -opt.live_ranges.regalloc 595.509 -> 76.522 ms 7.78x faster -opt.regalloc 2902.790 -> 374.253 ms 7.76x faster -``` - -The 256-to-512 spill step is still superlinear: - -```text -compile.tu 275.634 -> 958.535 ms 3.48x -opt.o1.total 200.878 -> 754.326 ms 3.76x -opt.live_blocks.pre_dde 24.166 -> 74.544 ms 3.08x -opt.live_ranges.regalloc 24.887 -> 76.522 ms 3.07x -opt.regalloc 104.204 -> 374.253 ms 3.59x -hard_loc_words 76390 -> 152678 2.00x -stack_loc_words 38146 -> 76290 2.00x -stack_slots 16 -> 16 1.00x -spills 4096 -> 8192 2.00x -``` - -The remaining wide-function cost is no longer the old conflict matrix, -candidate sort, stack occupancy, or no-asm constraint scan. It is concentrated -in repeated dense/high-watermark live-set work and range-point occupancy walks -inside one very wide function. - -Next investigation should start in: - -- `src/opt/pass_live.c`: `opt_live_blocks`, `live_recompute_metrics`, and the - final per-value range summary in `opt_live_ranges_build`. These still have - high-watermark or all-value loops that show up on wide single functions. -- `src/opt/pass_lower.c`: `alloc_collect_occupied_hard`, - `alloc_collect_occupied_stack`, `alloc_mark_hard_loc`, and - `alloc_mark_stack_loc`. These still walk every point in every candidate range; - interval/event state per location should replace repeated point scans. -- `src/opt/pass_lower.c`: `rewrite_func` and `opt_dead_def_elim_with_live`. - Both still maintain rolling live bitsets with function-width scratch storage; - after range/occupancy work, convert these to set-bit or operand-event updates. - -### Current O1 MIR-Shape Rerun - -After the follow-up O1 cleanup, the normal path is closer to MIR in four -important ways: - -- pre-DDE no longer builds a duplicate live-range set; -- DDE and rewrite use operand-reference events instead of materializing - per-instruction full-width use/def bitsets; -- live-range construction tracks open and touched values with generation marks - instead of rescanning every value at block boundaries; -- O1 assignment avoids caller-saved registers for values known to be live - across calls when the value is not tied to a hard register. - -The table below is from a focused 3-sample p50 rerun over freshly generated -stress inputs. It is not byte-for-byte comparable to the earlier phase tables, -but it shows the current bucket shape and the counters that should drive the -next work. - -```text -case compile.tu opt.o1.total live_pre dde live_ranges_reg regalloc range.value_scans alloc.hard_point_visits rewrite.live_words dde.live_words spills -straight 512 140.387 116.194 17.700 3.159 30.384 63.670 25616 28689 81672 31398 0 -straight 1024 286.085 234.626 35.394 6.410 62.155 129.088 51216 57361 261624 62774 0 -table 512 135.180 113.238 17.650 2.934 29.298 61.829 19506 21058 25156 25152 0 -table 1024 273.757 226.367 35.209 6.007 58.357 123.448 38962 42050 50244 50240 0 -spill 128 44.702 19.570 0.712 1.018 7.844 12.942 25700 64252 77762 26306 2048 -spill 256 109.002 43.998 1.327 2.021 20.569 31.053 51300 128380 257818 52506 4096 -spill 512 300.108 109.833 2.538 3.964 61.995 83.985 102500 256636 925130 104906 8192 -``` - -The normal straight-line and table-shaped ladders are now essentially linear. -For 512 to 1024 functions, `opt.o1.total`, `opt.live_ranges.regalloc`, -`opt.regalloc`, `opt.range.value_scans`, `opt.alloc.hard_point_visits`, and -`opt.dde.live_words_touched` are all about `2.0x`. The one outlier is -`opt.rewrite.live_words_touched` on straight-line input, which is `3.20x`, -although its absolute time is no longer the dominant bucket. - -The spill-pressure ladder is much better in absolute time, but still bends: - -```text -spill 128 -> 256: -opt.o1.total 19.570 -> 43.998 ms 2.25x -opt.live_ranges.regalloc 7.844 -> 20.569 ms 2.62x -opt.regalloc 12.942 -> 31.053 ms 2.40x -range.value_scans 25700 -> 51300 2.00x -alloc.hard_point_visits 64252 -> 128380 2.00x -rewrite.live_words 77762 -> 257818 3.32x - -spill 256 -> 512: -opt.o1.total 43.998 -> 109.833 ms 2.50x -opt.live_ranges.regalloc 20.569 -> 61.995 ms 3.01x -opt.regalloc 31.053 -> 83.985 ms 2.70x -range.value_scans 51300 -> 102500 2.00x -alloc.hard_point_visits 128380 -> 256636 2.00x -rewrite.live_words 257818 -> 925130 3.59x -``` - -This narrows the next O1 work. DDE is now roughly linear, range value scans are -linear, and allocation point visits are linear but high. The remaining -MIR-shape gap is concentrated in `opt.live_ranges.regalloc`, `opt.regalloc`, -and rewrite live-word traffic on spill-heavy wide functions. - -### Final O1 Cleanup Rerun - -The follow-up cleanup removes the remaining high-watermark scans that were -visible in the spill-pressure ladder: - -- `OptBitset` copy/metric paths now use active length and clear only stale - active words. -- live-range construction tracks live-across-call values with a rolling value - list instead of scanning the full active bitmap at every call. -- rewrite call save/restore insertion checks only hard-assigned caller-saved - values that are known to be live across some call. - -Focused single-sample spill-pressure sanity run: - -```text -groups compile.tu opt.o1.total live_pre live_ranges_reg regalloc range.live_words rewrite.live_words hard_point_visits spills -128 44.171 17.937 0.679 5.154 11.623 256 25602 64129 2048 -256 104.255 38.741 1.290 10.271 25.900 512 51202 128257 4096 -512 280.679 90.815 2.648 22.043 64.286 1024 102402 256513 8192 -``` - -The important counter shape is now linear where the prior O1 work was still -superlinear: `range.live_words`, `rewrite.live_words`, point visits, and spills -all double as the input doubles. The wall-time buckets are close enough to the -linear target for O1 that remaining occupancy-loop work is no longer blocking -the O2 coalescing/splitting work. - -### O1 Occupancy-Interval Cleanup - -The final O1 cleanup replaces the allocator's point-index occupied-location -bitmaps with sorted per-location interval sets. Assignment still consumes the -same compressed live ranges and produces the same `OptLoc` map for rewrite, but -hard-register and stack-slot conflict checks no longer OR per-point bitmap -rows. The old `hard_word_ors` and `stack_word_ors` counters therefore stay at -zero on this path. - -Focused single-sample spill-pressure sanity run on a fresh generated input: - -```text -groups opt.o1.total live_ranges_reg regalloc hard_point_visits stack_point_visits hard_mark_points stack_mark_points hard_word_ors stack_word_ors spills rewrite.live_words -128 19.782 4.971 13.355 40195 17392 10754 2048 0 0 2048 25604 -256 43.253 10.285 29.960 80387 34800 21506 4096 0 0 4096 51204 -512 98.830 21.880 72.118 160771 69616 43010 8192 0 0 8192 102404 -``` - -The important shape is still linear at the 128 -> 256 -> 512 steps: spills, -rewrite live traffic, stack interval probes, hard interval probes, and interval -marks all roughly double as the generated input doubles. This removes the last -O1 allocator cleanup item identified by the MIR-shape report; further allocator -work should now be justified by generated-code quality, O2 coalescing, or live -range splitting rather than by O1 compile-time scaling. - -### Vstack Constfold Probe - -After `doc/CONSTFOLD.md` phase 5, the focused O1 probes were re-run with -`build/cfree cc -O1 -target ... -c` and disassembled with `llvm-objdump -dr`. -Instruction counts below include function prologue/epilogue traffic; the useful -shape is that the direct literal, compound-assign, and literal compare/branch -probes all reduce to immediate return materialization, while the address-taken -probe still performs the load/add/store/reload sequence. - -```text -probe arch text insn mov arithmetic -compound_assign x64 45 11 6 1 -compound_assign aa64 32 8 2 2 -compound_assign rv64 72 18 1 3 -literal_return x64 45 11 6 1 -literal_return aa64 32 8 2 2 -literal_return rv64 72 18 1 3 -compare_branch_literal x64 45 11 6 1 -compare_branch_literal aa64 32 8 2 2 -compare_branch_literal rv64 72 18 1 3 -local_addr_taken x64 107 25 18 1 -local_addr_taken aa64 88 22 3 5 -local_addr_taken rv64 128 32 1 4 -``` - -Host JIT timings were measured with seven `build/cfree run --time -O1 -e -test_main` samples per probe; the table uses p50 milliseconds for time buckets -and stable first-sample counters for rewrite activity. - -```text -probe opt.o1 live_ranges regalloc spills reloads inserted -compound_assign 0.285 0.071 0.153 0 0 0 -literal_return 0.289 0.072 0.155 0 0 0 -compare_branch_literal 0.351 0.088 0.188 0 0 0 -local_addr_taken 0.351 0.088 0.187 0 0 0 -``` +3. Keep normal O1 branch cleanup block-linear. + `jump_cleanup` should remain a local pass over blocks/terminators. Global + layout work belongs in O2. -### O1 Local Jump Cleanup - -The O1 branch-shape cleanup is intentionally local and bounded. It has two -stages: - -- `jump_cleanup.cfg`: forwards branch-to-branch targets and inverts a - compare-branch only when that removes the immediately following - unconditional-jump block. -- `jump_cleanup.layout`: runs immediately before emit and deletes an - unconditional branch whose target is the next physical block in `emit_order`. - -The scaling property is block/edge-linear for the O1 shapes this pass owns. -The pass builds a per-block emit-order index once per run, memoizes resolved -branch-only chains, and never walks values, liveness bitsets, ranges, or -allocator occupancy. Its scratch storage is proportional to block count, and -its work is proportional to blocks plus terminator edges plus branch-only chain -members. It should therefore remain invisible next to `live_blocks`, -`live_ranges_reg`, and `regalloc` in large O1 scaling probes. If future branch -layout work needs global block ordering, keep that out of O1 and put it in O2. - -## Performance Priorities - -1. Keep O1 on interval occupancy. - The normal O1 allocator now uses per-location interval sets instead of - point-index bitmap rows. Do not reintroduce point-by-point OR/mark loops for - O1; if O2 needs richer placement, build it as interval/event state with - explicit metrics. - -2. Keep the split `used_locs` design. - `hard_loc_words` and `stack_loc_words` now scale linearly in both normal and - spill-pressure ladders. Stack occupancy is demand-sized by allocated stack - slots, so it is no longer the explanation for the wide-spill superlinear - timings. - -3. Decide whether O1 can avoid the second block-liveness pass. - O1 still runs block liveness before DDE and again for regalloc because DDE - changes the instruction stream. Keep this until the other buckets are - smaller, then either reuse/revalidate liveness after DDE or make DDE produce - enough local update information to avoid a full rebuild. - -4. Keep jump cleanup local and block-linear. - The O1 branch cleanup exists to remove obvious code-shape artifacts, not to - perform global layout. Keep it on cached emit-order indexes, memoized - branch-only chains, and block/terminator scans. Do not add value, liveness, - or range traffic to this pass. - -5. Avoid whole-function contiguous growth where hot passes scan repeatedly. - Large `Val`/block/instruction arrays and dense bit matrices are likely to - hurt cache locality. Segmented arrays may help by reducing large copies and - stabilizing addresses, but pass algorithms need to avoid turning segment - traversal into extra indirection in inner loops. - -6. Keep link/JIT lower priority for now. - Even at 1024 functions, `link_jit.all` is around `0.5 ms`; copy, reloc, and - icache flush are small and roughly linear. They are not the bottleneck for - submit-to-execution latency in these tests. - -7. Split `compile.c.parse_codegen` further before changing parser/CG data - structures. - The non-O1 compile residual is meaningful at large source sizes, but it is - currently broad. More counters are needed before choosing parser, PP, type, - or CG API changes. +4. Split frontend residuals only after O1 code shape work. + The non-O1 compile residual is visible, but current scaling pressure is not + coming from object finalization or link/JIT. More parser/PP/CG counters would + be useful, but O1 generated-code cleanup has higher near-term value.