kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit bff67bc795e315aa52e3a298aee4131f6ccd37f2
parent c89c0ddd48dfe37accf41decefb7ec977713edab
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Wed, 27 May 2026 08:46:00 -0700

doc: update O1 checklist — asm + direct-replay deletion + callee-saves + param-home done

Mark inline-asm routing, direct-replay/​gate deletion, callee-saved allocability,
and the parameter-home round-trip as complete. Split the allocable-set item into
the finished callee-saved work and a remaining (unplanned) task to make arg
registers x0..x7 / v0..v7 allocable, recording the finding that this is an
indivisible change needing a general parallel-move sequencer across all four ABI
emission paths (call args, tail-call args, function entry, multi-result return).

Diffstat:
Mdoc/OPT_O1_PASSES.md | 93++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 71 insertions(+), 22 deletions(-)

diff --git a/doc/OPT_O1_PASSES.md b/doc/OPT_O1_PASSES.md @@ -496,9 +496,10 @@ x64/rv64 `NativeTarget` ports are out of scope. ## Checklist Integration test (drives this work): -`CFREE_NO_DIRECT_REPLAY=1 CFREE_OPT_LEVELS=1 CFREE_TEST_PATHS=R ./test/toy/run.sh` -(`CFREE_NO_DIRECT_REPLAY` is a temporary gate in `opt.c` that forces every -function through the optimizer; delete it once the bypass is removed.) +`CFREE_OPT_LEVELS=1 CFREE_TEST_PATHS=R ./test/toy/run.sh` +(The `CFREE_NO_DIRECT_REPLAY` gate and the direct-replay path are now deleted — +every function goes through the optimizer unconditionally, so no gate is +needed.) Completeness — route all ops through the optimizer: - [x] **Varargs** — `va_start_/va_arg_/va_end_/va_copy_` hooks on `NativeTarget`; @@ -507,14 +508,12 @@ Completeness — route all ops through the optimizer: operands as address-taken. - [x] **Latent `IR_ADDR_OF` spill-writeback bug** in `pass_native_emit.c` (ADDR_OF result computed into scratch was never stored back). -- [ ] **Inline asm** (`IR_ASM_BLOCK`) — add a `NativeTarget` `asm_block` that - *binds the optimizer's pre-allocated operand registers* to the template - (machinize already fills `out_fixed_regs`/`in_fixed_regs`/`clobber_mask` - in `pass_machinize.c`; lower/analysis already apply them). Must NOT - self-allocate (the direct path does, which is unsafe when values are live - in regs across the asm). Refactor aa64 asm clobber-mask / callee-save / - restore helpers off `NativeDirectTarget` (same wrapper pattern as va). - Toy cases: 102,104,105,108,110,19,20. +- [x] **Inline asm** (`IR_ASM_BLOCK`) — DONE. `NativeTarget` `asm_block` binds + the optimizer's pre-allocated operand registers to the template (no + self-allocation; inputs are already live in their regs and outputs are + consumed via use/def data flow). aa64 asm clobber-mask / callee-save / + restore helpers refactored off `NativeDirectTarget` onto `AANativeTarget`. + Toy cases 102,104,105,108,110,19,20 all pass. - [x] **Aggregates / sret / byval** — DONE. Aggregate locals forced to frame; per-part ABI typing in plan_call/plan_ret; aggregate results via copy_bytes; aggregate-typed IR_COPY/IR_LOAD/IR_STORE via copy_bytes; sret @@ -526,19 +525,46 @@ Completeness — route all ops through the optimizer: completeness once a producer exists. Direct-path deletion: -- [ ] Delete `opt_func_needs_direct_replay`, `opt_replay_cg_ir_direct`, the - `OptReplay` machinery, and the `replay_*` helpers in `opt.c`. -- [ ] Remove the `CFREE_NO_DIRECT_REPLAY` env gate. +- [x] Delete `opt_func_needs_direct_replay`, `opt_replay_cg_ir_direct`, the + `OptReplay` machinery, and the `replay_*` helpers in `opt.c`. DONE — every + function now goes through the optimizer (opt.c 680 → 269 lines). +- [x] Remove the `CFREE_NO_DIRECT_REPLAY` env gate. DONE. Performance (priority 3, after completeness + correctness): -- [ ] **Expand aa64 allocable set** — only 6 int allocable regs today; add - callee-saved x19..x28 (and callee-saved FP) with backend-tracked prologue - save/restore (`patch_apply` already rewrites the prologue after the body). - Likely the bulk of the current runtime regression. -- [ ] **Local classification (Regression B)** — verify non-escaping - address-taken locals get promoted to registers; close any gap between - `opt_addr_xform_pregs` + `opt_promote_scalar_locals` and what the old path - achieved. +- [x] **Expand aa64 allocable set — callee-saved** — DONE. x19..x28 and d8..d15 + are allocable (caller-saved stay first so they're preferred; callee-saved + chosen only under pressure). `pass_native_emit` scans the lowered MIR for + the callee-saved regs the allocator used and passes them to a new + `reserve_callee_saves` `NativeTarget` hook before frame-slot mapping; the + aa64 backend reserves save slots first (small FP-relative offsets), saves + them in the back-patched prologue and restores them in the epilogue and + the tail-call patch. FP class gained a `callee_saved_mask` (d8..d15). +- [ ] **Expand aa64 allocable set — arg registers** — make x0..x7 / v0..v7 + allocable so values (params, call args, temporaries) can live in them. + No plan yet. Key finding from the attempt (reverted): there is no safe + "entry-only" subset — making arg regs allocable simultaneously exposes a + parallel-move / shuffle hazard in *four* ABI emission paths (call-argument + setup, tail-call argument setup, function entry / param materialization, + and multi-result return). A "forbid arg-reg sources at calls/returns + + let each param keep only its own incoming arg reg" subset got the bypass + probe from 14→11 toy R-O1 failures but newly broke 3 previously-passing + tests, i.e. the subset is unsound. Correct approach: one general + parallel-move-with-memory sequencer (cycle-break via reserved scratch; + x0..x7/v0..v7 currently excluded so sources and destinations are disjoint) + applied uniformly to all four paths, gated by permutation/swap stress + tests. The across-call correctness safety net already exists + (`rewrite_call_save_one` saves caller-saved live-across-call values keyed + on the clobber mask, which includes arg regs). +- [~] **Local classification (Regression B)** — param round-trip CLOSED: + `bind_param` now receives the allocator-chosen destination `NativeLoc`, so + a register-allocated scalar param moves straight from its incoming arg + register into its hard register (no frame store+reload); the + `IR_PARAM_DECL` marker emits nothing. Remaining gap: address-taken locals + whose address is stored into a pointer variable then dereferenced are not + promoted (needs ADDR_OF copy-propagation through PRegs); genuinely aliased + locals correctly stay in memory. `opt_addr_xform_pregs` + + `opt_promote_scalar_locals` already handle direct `ADDR_OF(local)` → + load/store base folding. - [ ] **Unit tests** — new targeted tests for the `CgIrFunc`→`NativeTarget` path (local promotion, addr-fold, regalloc, lowered bypass ops); re-enable a `test-opt` make target (old `test/opt/opt_test.c` is disabled @@ -569,6 +595,29 @@ Performance (priority 3, after completeness + correctness): pass. Bypass-off R-path failures: 11 → 9 (7 asm + 2 tail-sret). Default R-path: 408/408. +- Inline asm + direct-path deletion (commit "opt: route inline asm through + optimizer; delete direct-replay path"): `asm_block` `NativeTarget` hook binds + pre-allocated operand registers; all 7 asm cases pass bypass-off. With asm + done the bypass-off R-path reached 0 failures (tail-sret had already been + fixed), so the entire direct-replay path + `CFREE_NO_DIRECT_REPLAY` gate were + deleted (opt.c 680 → 269 lines). Every function now goes through the + optimizer. Full toy suite (R/L/C/W × O0/O1/O2): 1333 pass, 0 fail, 8 skip. + +- Callee-saved registers allocable (commit "aa64: make callee-saved registers + allocable at O1"): x19..x28 / d8..d15 added to the allocable set with + prologue/epilogue + tail-call save/restore via the new `reserve_callee_saves` + hook. Verified O0==O1 on register-pressure int and fp programs. + +- Parameter-home elimination (commit "opt: route incoming params straight into + their allocated register"): `bind_param` takes a destination `NativeLoc`; + params no longer round-trip through a frame home. Deleted + `allocate_param_home` / `local_home_for_preg` / `param_home_by_preg`. + +- Arg-register allocability attempted and reverted (see the unchecked perf item + above for the finding): the "entry-only" subset is unsound; the change is + indivisible and needs a general parallel-move sequencer across all four ABI + paths. Tree left at the param-home-elimination commit, suite green. + Debugging aids: `CFREE_NO_DIRECT_REPLAY=1 cfree cc -O1 -c <case>.toy` + `cfree objdump -d`; `CFREE_DUMP=1` / `CFREE_DUMPCG=1` dump optimizer/CG IR (they `compiler_panic` on the first recorded function — temporarily swap the