commit bff67bc795e315aa52e3a298aee4131f6ccd37f2
parent c89c0ddd48dfe37accf41decefb7ec977713edab
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Wed, 27 May 2026 08:46:00 -0700
doc: update O1 checklist — asm + direct-replay deletion + callee-saves + param-home done
Mark inline-asm routing, direct-replay/gate deletion, callee-saved allocability,
and the parameter-home round-trip as complete. Split the allocable-set item into
the finished callee-saved work and a remaining (unplanned) task to make arg
registers x0..x7 / v0..v7 allocable, recording the finding that this is an
indivisible change needing a general parallel-move sequencer across all four ABI
emission paths (call args, tail-call args, function entry, multi-result return).
Diffstat:
| M | doc/OPT_O1_PASSES.md | | | 93 | ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------- |
1 file changed, 71 insertions(+), 22 deletions(-)
diff --git a/doc/OPT_O1_PASSES.md b/doc/OPT_O1_PASSES.md
@@ -496,9 +496,10 @@ x64/rv64 `NativeTarget` ports are out of scope.
## Checklist
Integration test (drives this work):
-`CFREE_NO_DIRECT_REPLAY=1 CFREE_OPT_LEVELS=1 CFREE_TEST_PATHS=R ./test/toy/run.sh`
-(`CFREE_NO_DIRECT_REPLAY` is a temporary gate in `opt.c` that forces every
-function through the optimizer; delete it once the bypass is removed.)
+`CFREE_OPT_LEVELS=1 CFREE_TEST_PATHS=R ./test/toy/run.sh`
+(The `CFREE_NO_DIRECT_REPLAY` gate and the direct-replay path are now deleted —
+every function goes through the optimizer unconditionally, so no gate is
+needed.)
Completeness — route all ops through the optimizer:
- [x] **Varargs** — `va_start_/va_arg_/va_end_/va_copy_` hooks on `NativeTarget`;
@@ -507,14 +508,12 @@ Completeness — route all ops through the optimizer:
operands as address-taken.
- [x] **Latent `IR_ADDR_OF` spill-writeback bug** in `pass_native_emit.c`
(ADDR_OF result computed into scratch was never stored back).
-- [ ] **Inline asm** (`IR_ASM_BLOCK`) — add a `NativeTarget` `asm_block` that
- *binds the optimizer's pre-allocated operand registers* to the template
- (machinize already fills `out_fixed_regs`/`in_fixed_regs`/`clobber_mask`
- in `pass_machinize.c`; lower/analysis already apply them). Must NOT
- self-allocate (the direct path does, which is unsafe when values are live
- in regs across the asm). Refactor aa64 asm clobber-mask / callee-save /
- restore helpers off `NativeDirectTarget` (same wrapper pattern as va).
- Toy cases: 102,104,105,108,110,19,20.
+- [x] **Inline asm** (`IR_ASM_BLOCK`) — DONE. `NativeTarget` `asm_block` binds
+ the optimizer's pre-allocated operand registers to the template (no
+ self-allocation; inputs are already live in their regs and outputs are
+ consumed via use/def data flow). aa64 asm clobber-mask / callee-save /
+ restore helpers refactored off `NativeDirectTarget` onto `AANativeTarget`.
+ Toy cases 102,104,105,108,110,19,20 all pass.
- [x] **Aggregates / sret / byval** — DONE. Aggregate locals forced to frame;
per-part ABI typing in plan_call/plan_ret; aggregate results via
copy_bytes; aggregate-typed IR_COPY/IR_LOAD/IR_STORE via copy_bytes; sret
@@ -526,19 +525,46 @@ Completeness — route all ops through the optimizer:
completeness once a producer exists.
Direct-path deletion:
-- [ ] Delete `opt_func_needs_direct_replay`, `opt_replay_cg_ir_direct`, the
- `OptReplay` machinery, and the `replay_*` helpers in `opt.c`.
-- [ ] Remove the `CFREE_NO_DIRECT_REPLAY` env gate.
+- [x] Delete `opt_func_needs_direct_replay`, `opt_replay_cg_ir_direct`, the
+ `OptReplay` machinery, and the `replay_*` helpers in `opt.c`. DONE — every
+ function now goes through the optimizer (opt.c 680 → 269 lines).
+- [x] Remove the `CFREE_NO_DIRECT_REPLAY` env gate. DONE.
Performance (priority 3, after completeness + correctness):
-- [ ] **Expand aa64 allocable set** — only 6 int allocable regs today; add
- callee-saved x19..x28 (and callee-saved FP) with backend-tracked prologue
- save/restore (`patch_apply` already rewrites the prologue after the body).
- Likely the bulk of the current runtime regression.
-- [ ] **Local classification (Regression B)** — verify non-escaping
- address-taken locals get promoted to registers; close any gap between
- `opt_addr_xform_pregs` + `opt_promote_scalar_locals` and what the old path
- achieved.
+- [x] **Expand aa64 allocable set — callee-saved** — DONE. x19..x28 and d8..d15
+ are allocable (caller-saved stay first so they're preferred; callee-saved
+ chosen only under pressure). `pass_native_emit` scans the lowered MIR for
+ the callee-saved regs the allocator used and passes them to a new
+ `reserve_callee_saves` `NativeTarget` hook before frame-slot mapping; the
+ aa64 backend reserves save slots first (small FP-relative offsets), saves
+ them in the back-patched prologue and restores them in the epilogue and
+ the tail-call patch. FP class gained a `callee_saved_mask` (d8..d15).
+- [ ] **Expand aa64 allocable set — arg registers** — make x0..x7 / v0..v7
+ allocable so values (params, call args, temporaries) can live in them.
+ No plan yet. Key finding from the attempt (reverted): there is no safe
+ "entry-only" subset — making arg regs allocable simultaneously exposes a
+ parallel-move / shuffle hazard in *four* ABI emission paths (call-argument
+ setup, tail-call argument setup, function entry / param materialization,
+ and multi-result return). A "forbid arg-reg sources at calls/returns +
+ let each param keep only its own incoming arg reg" subset got the bypass
+ probe from 14→11 toy R-O1 failures but newly broke 3 previously-passing
+ tests, i.e. the subset is unsound. Correct approach: one general
+ parallel-move-with-memory sequencer (cycle-break via reserved scratch;
+ x0..x7/v0..v7 currently excluded so sources and destinations are disjoint)
+ applied uniformly to all four paths, gated by permutation/swap stress
+ tests. The across-call correctness safety net already exists
+ (`rewrite_call_save_one` saves caller-saved live-across-call values keyed
+ on the clobber mask, which includes arg regs).
+- [~] **Local classification (Regression B)** — param round-trip CLOSED:
+ `bind_param` now receives the allocator-chosen destination `NativeLoc`, so
+ a register-allocated scalar param moves straight from its incoming arg
+ register into its hard register (no frame store+reload); the
+ `IR_PARAM_DECL` marker emits nothing. Remaining gap: address-taken locals
+ whose address is stored into a pointer variable then dereferenced are not
+ promoted (needs ADDR_OF copy-propagation through PRegs); genuinely aliased
+ locals correctly stay in memory. `opt_addr_xform_pregs` +
+ `opt_promote_scalar_locals` already handle direct `ADDR_OF(local)` →
+ load/store base folding.
- [ ] **Unit tests** — new targeted tests for the `CgIrFunc`→`NativeTarget`
path (local promotion, addr-fold, regalloc, lowered bypass ops);
re-enable a `test-opt` make target (old `test/opt/opt_test.c` is disabled
@@ -569,6 +595,29 @@ Performance (priority 3, after completeness + correctness):
pass. Bypass-off R-path failures: 11 → 9 (7 asm + 2 tail-sret). Default
R-path: 408/408.
+- Inline asm + direct-path deletion (commit "opt: route inline asm through
+ optimizer; delete direct-replay path"): `asm_block` `NativeTarget` hook binds
+ pre-allocated operand registers; all 7 asm cases pass bypass-off. With asm
+ done the bypass-off R-path reached 0 failures (tail-sret had already been
+ fixed), so the entire direct-replay path + `CFREE_NO_DIRECT_REPLAY` gate were
+ deleted (opt.c 680 → 269 lines). Every function now goes through the
+ optimizer. Full toy suite (R/L/C/W × O0/O1/O2): 1333 pass, 0 fail, 8 skip.
+
+- Callee-saved registers allocable (commit "aa64: make callee-saved registers
+ allocable at O1"): x19..x28 / d8..d15 added to the allocable set with
+ prologue/epilogue + tail-call save/restore via the new `reserve_callee_saves`
+ hook. Verified O0==O1 on register-pressure int and fp programs.
+
+- Parameter-home elimination (commit "opt: route incoming params straight into
+ their allocated register"): `bind_param` takes a destination `NativeLoc`;
+ params no longer round-trip through a frame home. Deleted
+ `allocate_param_home` / `local_home_for_preg` / `param_home_by_preg`.
+
+- Arg-register allocability attempted and reverted (see the unchecked perf item
+ above for the finding): the "entry-only" subset is unsound; the change is
+ indivisible and needs a general parallel-move sequencer across all four ABI
+ paths. Tree left at the param-home-elimination commit, suite green.
+
Debugging aids: `CFREE_NO_DIRECT_REPLAY=1 cfree cc -O1 -c <case>.toy` +
`cfree objdump -d`; `CFREE_DUMP=1` / `CFREE_DUMPCG=1` dump optimizer/CG IR
(they `compiler_panic` on the first recorded function — temporarily swap the