doc: update O1 recovery progress (aggregates mostly landed; tail-sret + asm remain) - kit

commit 19bfa484a2cc61e639afe8823cd50dac1c46d04c
parent 7ea57a999f6e38192ca781416198d64c778cc276
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Wed, 27 May 2026 05:50:43 -0700

doc: update O1 recovery progress (aggregates mostly landed; tail-sret + asm remain)

Diffstat:
M doc/OPT_O1_PASSES.md  | 38 ++++++++++++++++++++++++++++++++++----

1 file changed, 34 insertions(+), 4 deletions(-)
diff --git a/doc/OPT_O1_PASSES.md b/doc/OPT_O1_PASSES.md
@@ -515,10 +515,16 @@ Completeness — route all ops through the optimizer:
       in regs across the asm). Refactor aa64 asm clobber-mask / callee-save /
       restore helpers off `NativeDirectTarget` (same wrapper pattern as va).
       Toy cases: 102,104,105,108,110,19,20.
-- [ ] **Aggregates / sret / byval** — ABI lowering gaps in the optimizer path:
-      124 (slices, wrong value), 130 (record sret, wrong codegen), 36 ("scalar
-      too large" panic), 37 (tail sret). Covers aggregate params/results,
-      sret returns, and aggregate/by-value call arguments.
+- [~] **Aggregates / sret / byval** — mostly landed. DONE: aggregate locals
+      forced to frame; per-part ABI typing in plan_call/plan_ret; aggregate
+      results via copy_bytes; aggregate-typed IR_COPY/IR_LOAD/IR_STORE via
+      copy_bytes. 130 (record sret) and 124 (slices) now pass. REMAINING:
+      **tail call + sret** (36 musttail, 37 tail). Bug: in the tail+sret arg
+      shuffle, the first argument is loaded into x8, then x8 is overwritten with
+      the forwarded sret pointer before being moved to x0 — so x0 gets the sret
+      pointer instead of arg0 (see `aa_plan_call` tail/sret path + the tail-call
+      argument staging). Order the sret-x8 setup after the argument moves, or
+      stage args through temps that don't alias x8.
 - [ ] **BREAK_TO / CONTINUE_TO + SCOPE cond** — currently unused by frontends
       (toy/c lower break/continue to `BR`+labels), but unwired in emit. Either
       lower them to CFG edges in cg_ir_lower or wire emit, for true
@@ -549,3 +555,27 @@ Performance (priority 3, after completeness + correctness):
 
 - Varargs landed end-to-end on the optimizer path; `IR_ADDR_OF` writeback fixed.
   Bypass-disabled R-path failures: 14 → 11. Default R-path (O0+O1): 408/408.
+  (commit "opt: route varargs through optimizer path; fix ADDR_OF spill writeback")
+- Aggregate ABI, partial (commit "opt: aggregate ABI lowering ... (partial)"):
+  - Force aggregate / >8-byte locals to frame in `cg_ir_lower` `lower_locals`
+    (a 16-byte struct result local was being allocated to a single PReg).
+  - Type each ABI part by its own width in `aa_plan_call`/`aa_plan_ret` direct
+    paths via `aa_part_scalar_type` (was using the aggregate type → truncating
+    `mov w0,w9` for i64 fields).
+  - `emit_call`/`emit_ret`: aggregate/oversized results use `copy_bytes` / hand
+    `plan_ret` the value's memory location directly (no scalar temp copy).
+  - Result: no more "scalar too large" panics; sret no longer truncates. But
+    values still wrong (130→0, 124→40, 36/37→160); still 11 bypass-off failures.
+
+- Aggregate COPY/LOAD/STORE via `copy_bytes` (commit "opt: handle
+  aggregate-typed COPY/LOAD/STORE via byte copy"): the root cause of the
+  truncated/mis-offset aggregate moves was that `IR_COPY`/`IR_LOAD`/`IR_STORE`
+  on aggregate-typed operands were emitted as scalar moves. 130 and 124 now
+  pass. Bypass-off R-path failures: 11 → 9 (7 asm + 2 tail-sret). Default
+  R-path: 408/408.
+
+Debugging aids: `CFREE_NO_DIRECT_REPLAY=1 cfree cc -O1 -c <case>.toy` +
+`cfree objdump -d`; `CFREE_DUMP=1` / `CFREE_DUMPCG=1` dump optimizer/CG IR
+(they `compiler_panic` on the first recorded function — temporarily swap the
+panic in `opt_dbg_dump_cg`/`opt_dbg_dump` for `cfree_debug_printf` to dump all
+functions). Note the CG-IR dumper does not print INDIRECT `ofs`.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README