commit 19bfa484a2cc61e639afe8823cd50dac1c46d04c
parent 7ea57a999f6e38192ca781416198d64c778cc276
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Wed, 27 May 2026 05:50:43 -0700
doc: update O1 recovery progress (aggregates mostly landed; tail-sret + asm remain)
Diffstat:
1 file changed, 34 insertions(+), 4 deletions(-)
diff --git a/doc/OPT_O1_PASSES.md b/doc/OPT_O1_PASSES.md
@@ -515,10 +515,16 @@ Completeness — route all ops through the optimizer:
in regs across the asm). Refactor aa64 asm clobber-mask / callee-save /
restore helpers off `NativeDirectTarget` (same wrapper pattern as va).
Toy cases: 102,104,105,108,110,19,20.
-- [ ] **Aggregates / sret / byval** — ABI lowering gaps in the optimizer path:
- 124 (slices, wrong value), 130 (record sret, wrong codegen), 36 ("scalar
- too large" panic), 37 (tail sret). Covers aggregate params/results,
- sret returns, and aggregate/by-value call arguments.
+- [~] **Aggregates / sret / byval** — mostly landed. DONE: aggregate locals
+ forced to frame; per-part ABI typing in plan_call/plan_ret; aggregate
+ results via copy_bytes; aggregate-typed IR_COPY/IR_LOAD/IR_STORE via
+ copy_bytes. 130 (record sret) and 124 (slices) now pass. REMAINING:
+ **tail call + sret** (36 musttail, 37 tail). Bug: in the tail+sret arg
+ shuffle, the first argument is loaded into x8, then x8 is overwritten with
+ the forwarded sret pointer before being moved to x0 — so x0 gets the sret
+ pointer instead of arg0 (see `aa_plan_call` tail/sret path + the tail-call
+ argument staging). Order the sret-x8 setup after the argument moves, or
+ stage args through temps that don't alias x8.
- [ ] **BREAK_TO / CONTINUE_TO + SCOPE cond** — currently unused by frontends
(toy/c lower break/continue to `BR`+labels), but unwired in emit. Either
lower them to CFG edges in cg_ir_lower or wire emit, for true
@@ -549,3 +555,27 @@ Performance (priority 3, after completeness + correctness):
- Varargs landed end-to-end on the optimizer path; `IR_ADDR_OF` writeback fixed.
Bypass-disabled R-path failures: 14 → 11. Default R-path (O0+O1): 408/408.
+ (commit "opt: route varargs through optimizer path; fix ADDR_OF spill writeback")
+- Aggregate ABI, partial (commit "opt: aggregate ABI lowering ... (partial)"):
+ - Force aggregate / >8-byte locals to frame in `cg_ir_lower` `lower_locals`
+ (a 16-byte struct result local was being allocated to a single PReg).
+ - Type each ABI part by its own width in `aa_plan_call`/`aa_plan_ret` direct
+ paths via `aa_part_scalar_type` (was using the aggregate type → truncating
+ `mov w0,w9` for i64 fields).
+ - `emit_call`/`emit_ret`: aggregate/oversized results use `copy_bytes` / hand
+ `plan_ret` the value's memory location directly (no scalar temp copy).
+ - Result: no more "scalar too large" panics; sret no longer truncates. But
+ values still wrong (130→0, 124→40, 36/37→160); still 11 bypass-off failures.
+
+- Aggregate COPY/LOAD/STORE via `copy_bytes` (commit "opt: handle
+ aggregate-typed COPY/LOAD/STORE via byte copy"): the root cause of the
+ truncated/mis-offset aggregate moves was that `IR_COPY`/`IR_LOAD`/`IR_STORE`
+ on aggregate-typed operands were emitted as scalar moves. 130 and 124 now
+ pass. Bypass-off R-path failures: 11 → 9 (7 asm + 2 tail-sret). Default
+ R-path: 408/408.
+
+Debugging aids: `CFREE_NO_DIRECT_REPLAY=1 cfree cc -O1 -c <case>.toy` +
+`cfree objdump -d`; `CFREE_DUMP=1` / `CFREE_DUMPCG=1` dump optimizer/CG IR
+(they `compiler_panic` on the first recorded function — temporarily swap the
+panic in `opt_dbg_dump_cg`/`opt_dbg_dump` for `cfree_debug_printf` to dump all
+functions). Note the CG-IR dumper does not print INDIRECT `ofs`.