opt/aa64: plan the O1 call frame up front, drop all back-patching - kit

commit 7eca5a4d48b4d1fc043f5cd81bd010fbd1785b1f
parent 90f2ba1a39b3ad8ae0ba9119ce548d2f5d85cd19
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 28 May 2026 13:49:09 -0700

opt/aa64: plan the O1 call frame up front, drop all back-patching

The O1 optimizer now computes the complete call frame before emission
(plan_frame: callee-saved set, every static slot, outgoing-arg area via
the new pure NativeTarget.call_stack_bytes, has_alloca, atomic-RMW scratch
spill) and drives func_begin_known_frame. The aa64 backend emits a final
prologue, allocas, and tail epilogues with no back-patching; the single-pass
NativeDirectTarget keeps its reserve-and-patch strategy. An a->known_frame
flag is the sole discriminator, and a frame_final guard panics if any
frame slot is requested after the prologue (which caught atomic RMW's
mid-body spill).

Emit no longer invents body-time frame slots: aggregate/oversized call
results land directly in their frame home (dropping a redundant temp +
store/reload/copy), and a store whose source aliases its address base
collapses the address into a scratch register instead of spilling.

Falls out of deciding slim-prologue eligibility before emitting: the
leading `b PC+4; nop` filler (old fat-then-patch-to-slim artifact) is gone
at every function entry; prologue/epilogue are otherwise byte-identical.

aa_emit_prologue / emit_minimal_prologue retired from the O1 path. x64/rv64
untouched.

Diffstat:
M doc/OPT_O1_PERF_TODO.md  | 44 ++++++++++++++++++++------------------------
M src/arch/aa64/native.c  | 258 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------
M src/arch/native_target.h  | 23 +++++++++++++++++++++++
M src/opt/pass_native_emit.c  | 201 +++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------

4 files changed, 361 insertions(+), 165 deletions(-)
diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md
@@ -50,25 +50,16 @@ overhead** — every byte of which is multiplied by 7.6M.
 Open items, in priority order (most recent disasm in
 `/tmp/mc/binary-trees.cfree.o`):
 
-1. **Useless leading `b PC+4` at every function entry.** All four functions
-   still start with this:
-   ```
-   sub  sp, sp, #0x20
-   stp  x29, x30, [sp, #0x10]
-   add  x29, sp, #0x10
-   stp  x20, x19, [x29, #-0x10]
-   b    PC+4              <-- branches to the very next instruction
-   mov  x19, x0
-   ```
-   Root cause: commit `9bd61e8` ("emit param_decls into a dedicated prologue
-   block") added an empty-on-emit entry block ahead of the function body.
-   `opt_jump_cleanup`'s helper `empty_fallthrough_block` in
-   `src/opt/pass_jump.c` then explicitly bails out when `block == f->entry`,
-   so the empty entry block is never merged into its single successor.
-   Lifting that guard (with whatever safety condition `9bd61e8` was protecting
-   against — likely "first body block is a loop header") would let
-   jump-cleanup absorb the prologue block in the common case. **+1 insn ×
-   7.6M calls.**
+1. **Useless leading `b PC+4` at every function entry. [DONE]** Fixed by the
+   known-frame prologue rework (frame planning in `opt_emit_native` →
+   `func_begin_known_frame`). The branch was *not* the empty entry block — it
+   was filler left by the old two-phase prologue: `aa_emit_prologue` sized a
+   *fat* prologue (slim eligibility wasn't decided until `func_end`), then the
+   `func_end` patch rebuilt it as the shorter slim form and back-filled the
+   leftover words with `b PC+N; nop`. The known-frame path decides slim
+   eligibility *before* emitting, so it emits the slim/slim_small_frame prologue
+   directly — no filler, no patch. Functions now fall straight from the prologue
+   into the body. **-1 to -2 insns at entry, every function.**
 
 2. **Prologue compaction: 4-insn → 2-insn pre-indexed.** Half-done. cfree
    today emits:
@@ -157,13 +148,18 @@ worth comparing the splice inner loop directly.
 These help several benches. Both are partial; the binary-trees items above
 are the most concrete tests for whether each is complete.
 
-1. **Drop the leading `b PC+4` at function entry.** See binary-trees item 1.
-   Affects every cfree-compiled function, not just binary-trees.
+1. **Drop the leading `b PC+4` at function entry. [DONE]** See binary-trees
+   item 1. Resolved by the known-frame prologue (the optimizer plans the whole
+   frame up front, so the prologue is emitted final in its slim form rather than
+   fat-then-patched). Affected every cfree-compiled function.
 
 2. **Compact FP-frame prologue/epilogue.** See binary-trees item 2. The
-   2-insn pre-indexed form is wired in for the no-callee-save case; needs
-   to be extended to small frames with 1–2 callee-saves. Biggest absolute
-   payoff on call-heavy benches.
+   2-insn pre-indexed form is wired in for the no-callee-save case (Tier A);
+   extending it to small frames with 1–2 callee-saves still needs the frame
+   record moved to the bottom of the frame (fp-at-bottom layout) so a single
+   `stp x29,x30,[sp,#-N]!` covers both the sp decrement and the save — a
+   separable layout change, not unlocked by the frame-planning rework alone.
+   Biggest absolute payoff still open on call-heavy benches.
 
 3. **Hard-register copy coalescing for `IR_LOAD_IMM` sources.** See
    binary-trees item 3. The hint-propagation path covers `ldr` → call-arg
diff --git a/src/arch/aa64/native.c b/src/arch/aa64/native.c
@@ -210,6 +210,23 @@ typedef struct AANativeTarget {
    * when both would apply). Applies to almost every function with a small
    * frame, including those with callee-saves and locals. */
   u8 slim_small_frame;
+  /* Set by aa_func_begin_known_frame (optimizer path: the full frame is known
+   * up front, so the prologue, allocas, and tail epilogues are emitted final
+   * with no back-patching). Cleared by aa_func_begin (NativeDirectTarget
+   * single-pass path: worst-case prologue region reserved and patched, alloca /
+   * tail sites recorded and patched at func_end). This flag is the single
+   * discriminator between the two strategies throughout this file. */
+  u8 known_frame;
+  /* Set when the function body contains a dynamic alloca. On the known-frame
+   * path it comes from NativeKnownFrameDesc.has_alloca (needed before the body
+   * to settle slim-epilogue eligibility); on the single-pass path it tracks
+   * nalloca_patches. Disqualifies the slim small-frame epilogue. */
+  u8 has_alloca;
+  /* Set on the known-frame path once the frame is fixed and the prologue
+   * emitted. Any frame_slot request after this point would grow the frame the
+   * prologue already encoded — a silent miscompile — so aa_frame_slot panics.
+   * The optimizer is expected to plan every slot before the body. */
+  u8 frame_final;
 } AANativeTarget;
 
 static AANativeTarget* aa_of(NativeTarget* t) { return (AANativeTarget*)t; }
@@ -906,10 +923,11 @@ static void aa_emit_q_frame(AANativeTarget* a, int load, u32 qreg,
   aa_emit32(mc, aa_ldst_q_uimm(load, qreg, AA_TMP1, 0));
 }
 
-static void aa_emit_variadic_reg_saves(AANativeTarget* a) {
+/* Reserve the variadic register-save-area frame slots (gp then fp). Split from
+ * the store emission so the known-frame path can fix the full frame — including
+ * these slots — before the prologue, then emit the stores after it. */
+static void aa_reserve_variadic_reg_saves(AANativeTarget* a) {
   NativeFrameSlotDesc sd;
-  NativeAddr addr;
-  MemAccess mem;
   CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64);
   ABIVaListInfo vai = abi_va_list_layout(a->base.c->abi);
   if (vai.kind != ABI_VA_LIST_AAPCS64) return;
@@ -922,6 +940,16 @@ static void aa_emit_variadic_reg_saves(AANativeTarget* a) {
   sd.size = vai.fp_reg_count * vai.fp_slot_size;
   sd.align = 16;
   a->va_vr_slot = a->base.frame_slot(&a->base, &sd);
+}
+
+/* Emit the stores into the variadic register-save area. Slots must already be
+ * reserved (aa_reserve_variadic_reg_saves). */
+static void aa_emit_variadic_reg_save_stores(AANativeTarget* a) {
+  NativeAddr addr;
+  MemAccess mem;
+  CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64);
+  ABIVaListInfo vai = abi_va_list_layout(a->base.c->abi);
+  if (vai.kind != ABI_VA_LIST_AAPCS64) return;
   memset(&mem, 0, sizeof mem);
   mem.type = i64;
   mem.size = 8;
@@ -941,7 +969,10 @@ static void aa_emit_variadic_reg_saves(AANativeTarget* a) {
 
 static void aa_emit_entry_saves(AANativeTarget* a);
 
-static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
+/* Per-function state reset + function-symbol / cfi / prologue-anchor setup
+ * shared by both entry points (aa_func_begin for the single-pass path,
+ * aa_func_begin_known_frame for the optimizer path). Emits no prologue. */
+static void aa_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) {
   AANativeTarget* a = aa_of(t);
   MCEmitter* mc = t->mc;
   a->func = fd;
@@ -968,6 +999,9 @@ static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
   a->ncallee_saves = 0;
   a->slim_prologue = 0;
   a->slim_small_frame = 0;
+  a->known_frame = 0;
+  a->has_alloca = 0;
+  a->frame_final = 0;
   mc->set_section(mc, fd->text_section_id);
   mc->emit_align(mc, 4, 0);
   a->func_start = mc->pos(mc);
@@ -976,49 +1010,72 @@ static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
   a->prologue_pos = mc->pos(mc);
   a->minimal_prologue_words = 0;
   a->epilogue_label = mc->label_new(mc);
-  /* Optimizer path: emit nothing here. The exact-size prologue and the
-   * sret/variadic entry saves are emitted later by aa_emit_prologue, once the
-   * callee-save set and frame slots are known. The single-pass path reserves a
-   * worst-case region (patched in func_end) and emits the entry saves now. */
-  if (t->emit_minimal_prologue) return;
+}
+
+/* Single-pass (NativeDirectTarget) entry point: the frame is not known up
+ * front, so reserve a worst-case prologue region (patched in aa_func_end once
+ * max_outgoing / callee-saves are final) and emit the entry saves now. */
+static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
+  AANativeTarget* a = aa_of(t);
+  MCEmitter* mc = t->mc;
+  aa_func_begin_common(t, fd);
   for (u32 i = 0; i < AA_PROLOGUE_WORDS; ++i) aa_emit32(mc, 0xd503201fu);
   aa_emit_entry_saves(a);
 }
 
-/* Emit the sret-pointer save (x8 → slot) and, for variadic functions, the
- * argument register-save area. Run immediately after the prologue frame setup
- * on both the single-pass path (from func_begin) and the optimizer path (from
- * aa_emit_prologue). */
-static void aa_emit_entry_saves(AANativeTarget* a) {
+/* Reserve the entry-save frame slots: the sret-pointer home (x8) and, for
+ * variadic functions, the argument register-save area. Reserving is split from
+ * emitting so the known-frame path can fix the full frame before the prologue;
+ * the single-pass path runs both back to back via aa_emit_entry_saves. */
+static void aa_reserve_entry_saves(AANativeTarget* a) {
   NativeTarget* t = &a->base;
   const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, a->func->fn_type);
   if (abi && abi->has_sret) {
     NativeFrameSlotDesc sd;
-    NativeAddr addr;
-    NativeLoc src;
-    MemAccess mem;
     memset(&sd, 0, sizeof sd);
     sd.type = builtin_id(CFREE_CG_BUILTIN_I64);
     sd.size = 8;
     sd.align = 8;
     sd.kind = NATIVE_FRAME_SLOT_SAVE;
     a->sret_ptr_slot = t->frame_slot(t, &sd);
+  }
+  if (abi && abi->variadic) aa_reserve_variadic_reg_saves(a);
+}
+
+/* Emit the entry-save stores (x8 → sret slot, then the variadic reg-save area).
+ * Slots must already be reserved (aa_reserve_entry_saves). */
+static void aa_emit_entry_save_stores(AANativeTarget* a) {
+  NativeTarget* t = &a->base;
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, a->func->fn_type);
+  if (abi && abi->has_sret) {
+    NativeAddr addr;
+    NativeLoc src;
+    MemAccess mem;
+    CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64);
     memset(&addr, 0, sizeof addr);
     addr.base_kind = NATIVE_ADDR_BASE_FRAME;
     addr.base.frame = a->sret_ptr_slot;
-    addr.base_type = sd.type;
+    addr.base_type = i64;
     memset(&src, 0, sizeof src);
     src.kind = NATIVE_LOC_REG;
     src.cls = NATIVE_REG_INT;
-    src.type = sd.type;
+    src.type = i64;
     src.v.reg = 8u;
     memset(&mem, 0, sizeof mem);
-    mem.type = sd.type;
+    mem.type = i64;
     mem.size = 8;
     mem.align = 8;
     aa_emit_mem(a, 0, src, addr, mem);
   }
-  if (abi && abi->variadic) aa_emit_variadic_reg_saves(a);
+  if (abi && abi->variadic) aa_emit_variadic_reg_save_stores(a);
+}
+
+/* Reserve + emit the entry saves back to back. Single-pass (NativeDirectTarget)
+ * path, where the prologue region is a reserved worst-case block and slot
+ * offsets need not be final before it. */
+static void aa_emit_entry_saves(AANativeTarget* a) {
+  aa_reserve_entry_saves(a);
+  aa_emit_entry_save_stores(a);
 }
 
 static void aa_note_frame_state(NativeTarget* t,
@@ -1198,7 +1255,7 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words,
 /* Build the prologue instruction words for `L` into `words` (capacity `cap`),
  * returning the count. Shared by the NativeDirectTarget patch path (reserves
  * a fixed worst-case region, then patches it here) and the optimizer path
- * (emits an exact-size region up front; see aa_emit_prologue).
+ * (aa_func_begin_known_frame emits exactly these words up front).
  *
  * All three variants establish the same post-prologue state defined by L:
  *   sp = caller's sp - L->frame_size
@@ -1263,23 +1320,6 @@ static void aa_patch_prologue(AANativeTarget* a, const AAFrameLayout* L,
     aa_patch32(a->base.obj, sec, a->prologue_pos + i * 4u, words[i]);
 }
 
-/* Optimizer path: emit an exact-size prologue in place (no reserved NOP
- * region). The callee-save set and the static frame slots are final by now, so
- * the prologue's instruction count is fixed; only the frame-size immediates
- * (sub sp / save-area address / fp = sp+saved_pair) still depend on body-
- * emitted temporaries and are patched in func_end. We size the region with
- * a frame that fits add/sub's imm12 (the real frame must too, or func_end's
- * rebuild — capped at this length — panics). */
-static void aa_emit_prologue(NativeTarget* t) {
-  AANativeTarget* a = aa_of(t);
-  u32 words[AA_PROLOGUE_WORDS];
-  AAFrameLayout L = aa_build_layout(a->cum_off, a->max_outgoing);
-  u32 n = aa_build_prologue_words(a, &L, words, AA_PROLOGUE_WORDS);
-  for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]);
-  a->minimal_prologue_words = n;
-  aa_emit_entry_saves(a);
-}
-
 static void aa_emit_restore_frame(AANativeTarget* a, const AAFrameLayout* L) {
   MCEmitter* mc = a->base.mc;
   u32 words[AA_PROLOGUE_WORDS];
@@ -1349,35 +1389,27 @@ static void aa_func_end(NativeTarget* t) {
   AANativeTarget* a = aa_of(t);
   MCEmitter* mc = t->mc;
   AAFrameLayout L = aa_build_layout(a->cum_off, a->max_outgoing);
-  /* Optimizer path emitted an exact-size prologue (minimal_prologue_words);
-   * the single-pass path reserved a fixed worst-case region. Either way the
-   * frame-size immediates are only final now, so patch the region in place. */
+  /* known_frame (optimizer): prologue, allocas, and tail epilogues were emitted
+   * final and slim eligibility was settled in aa_func_begin_known_frame — there
+   * is nothing to patch. Single-pass (NDT): a worst-case prologue region was
+   * reserved and the deferred patches recorded; resolve them now that the frame
+   * is final. The NDT path always uses the fat prologue/epilogue (slim_* left 0
+   * by aa_func_begin_common, since its reserved region is much larger). */
   u32 prologue_region =
-      t->emit_minimal_prologue ? a->minimal_prologue_words : AA_PROLOGUE_WORDS;
-  /* Slim Tier A eligibility (set before emitting the epilogue / patching the
-   * prologue so the *_restore_frame / *_build_prologue_words helpers pick the
-   * slim form). Conditions: no callee-saves needed, no alloca, no body
-   * slots (locals/spills/sret/variadic — all counted in slot_bytes), no
-   * outgoing stack args, optimizer path only (the NDT reserves a much
-   * larger prologue region). */
-  a->slim_prologue =
-      t->emit_minimal_prologue && a->ncallee_saves == 0 && a->nalloca == 0 &&
-      L.slot_bytes == 0 && L.out_stack == 0;
-  /* Universal small-frame fast path: skip the x17/x10 scratch when the
-   * saved-pair offset fits stp's signed 7-bit scaled immediate. Mutually
-   * exclusive with the Tier A slim form (Tier A is strictly tighter).
-   * Disqualify alloca: alloca dynamically moves sp during the body, and the
-   * fat epilogue (sp = fp + 16 via x10) is what restores sp from fp; the
-   * slim_small_frame epilogue's `add sp, sp, #N` only undoes the static
-   * frame, leaving sp pointing into the alloca area. */
-  a->slim_small_frame = !a->slim_prologue && a->nalloca == 0 &&
-                        aa_sp_off_saved_pair(&L) <= 504u;
+      a->known_frame ? a->minimal_prologue_words : AA_PROLOGUE_WORDS;
   mc->label_place(mc, a->epilogue_label);
   aa_emit_callee_restores(a);
   aa_emit_restore_frame(a, &L);
   aa_emit32(mc, aa64_ret(AA_LR));
-  aa_patch_prologue(a, &L, prologue_region);
-  aa_apply_patches(a, &L);
+  if (a->known_frame) {
+    /* The frame-planning pre-pass plus final prologue/alloca/tail emission must
+     * leave nothing deferred; a stray patch would mean a body-time frame change
+     * the final prologue never saw. */
+    if (a->npatches != 0) aa_panic(a, "known-frame path left deferred patches");
+  } else {
+    aa_patch_prologue(a, &L, prologue_region);
+    aa_apply_patches(a, &L);
+  }
   if (mc->cfi_set_next_pc_offset && mc->cfi_def_cfa && mc->cfi_offset) {
     mc->cfi_set_next_pc_offset(mc, prologue_region * 4u);
     /* CFA = caller's sp = fp + AA_FRAME_SAVE_SIZE. saved fp/lr at fp/fp+8
@@ -1409,6 +1441,8 @@ static NativeFrameSlot aa_frame_slot(NativeTarget* t,
   AANativeSlot* s;
   u32 size = d->size ? d->size : 8u;
   u32 align = d->align ? d->align : 1u;
+  if (a->frame_final)
+    aa_panic(a, "frame slot requested after known-frame prologue");
   if (a->nslots == a->slots_cap) {
     u32 cap = a->slots_cap ? a->slots_cap * 2u : 16u;
     AANativeSlot* nb = arena_zarray(t->c->tu, AANativeSlot, cap);
@@ -1425,19 +1459,62 @@ static NativeFrameSlot aa_frame_slot(NativeTarget* t,
   return a->nslots;
 }
 
+/* Optimizer entry point: the full frame is supplied up front, so the prologue,
+ * entry saves, slim-form eligibility, allocas, and tail epilogues are all final
+ * the moment they are emitted — no back-patching (aa_func_end skips the patch
+ * passes when a->known_frame). Slot creation order matches the single-pass path
+ * (callee-saves first for stur range, then the static slots, then sret/variadic
+ * entry saves) so offsets are identical to what the patch path would produce. */
 static void aa_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd,
                                       const NativeKnownFrameDesc* frame,
                                       NativeFrameSlot* out_slots) {
-  aa_func_begin(t, fd);
+  AANativeTarget* a = aa_of(t);
+  AAFrameLayout L;
+  u32 words[AA_PROLOGUE_WORDS];
+  u32 n;
+  aa_func_begin_common(t, fd);
+  a->known_frame = 1;
   if (frame) {
-    AANativeTarget* a = aa_of(t);
-    if (frame->max_outgoing > a->max_outgoing)
-      a->max_outgoing = frame->max_outgoing;
+    a->has_alloca = frame->has_alloca;
+    if (frame->callee_saved_used && frame->ncallee_classes)
+      aa_reserve_callee_saves(t, frame->callee_saved_used,
+                              frame->ncallee_classes);
     for (u32 i = 0; i < frame->nslots; ++i) {
       NativeFrameSlot slot = aa_frame_slot(t, &frame->slots[i]);
       if (out_slots) out_slots[i] = slot;
     }
+    aa_reserve_entry_saves(a);
+    /* Reserve the atomic-RMW scratch spill last (matching its lazy position in
+     * the single-pass path), so aa_saved_tmp_spill reuses it instead of growing
+     * the frame mid-body. */
+    if (frame->needs_scratch_spill) {
+      NativeFrameSlotDesc sd;
+      memset(&sd, 0, sizeof sd);
+      sd.type = builtin_id(CFREE_CG_BUILTIN_I64);
+      sd.size = 8;
+      sd.align = 8;
+      sd.kind = NATIVE_FRAME_SLOT_SPILL;
+      a->saved_tmp_slot = a->base.frame_slot(&a->base, &sd);
+    }
+    if (frame->max_outgoing > a->max_outgoing)
+      a->max_outgoing = frame->max_outgoing;
   }
+  /* Frame is final: slot_bytes (cum_off) and out_stack (max_outgoing) are both
+   * known, so the prologue immediates and slim-form choice are settled here. */
+  L = aa_build_layout(a->cum_off, a->max_outgoing);
+  /* Slim Tier A: no callee-saves, no alloca, no body slots, no outgoing stack
+   * args. slim_small_frame: skip the x17/x10 scratch when the saved-pair offset
+   * fits stp's signed 7-bit scaled immediate. (See aa_func_end for the
+   * single-pass path, which never takes the slim form.) */
+  a->slim_prologue = a->ncallee_saves == 0 && !a->has_alloca &&
+                     L.slot_bytes == 0 && L.out_stack == 0;
+  a->slim_small_frame = !a->slim_prologue && !a->has_alloca &&
+                        aa_sp_off_saved_pair(&L) <= 504u;
+  n = aa_build_prologue_words(a, &L, words, AA_PROLOGUE_WORDS);
+  for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]);
+  a->minimal_prologue_words = n;
+  a->frame_final = 1;
+  aa_emit_entry_save_stores(a);
 }
 
 static void aa_spill(NativeTarget* t, NativeLoc src, NativeFrameSlot slot,
@@ -2028,14 +2105,22 @@ static void aa_alloca(NativeTarget* t, NativeLoc dst, NativeLoc size,
   aa_emit32(t->mc, aa64_add_imm(1, AA_TMP1, AA_SP, 0, 0));
   aa_emit32(t->mc, aa64_sub(1, AA_TMP1, AA_TMP1, AA_TMP0));
   aa_emit32(t->mc, aa64_add_imm(1, AA_SP, AA_TMP1, 0, 0));
-  {
+  /* The alloca result is sp + outgoing-area bytes. On the known-frame path
+   * max_outgoing is already final, so emit the final `add dst, sp, #N` here; on
+   * the single-pass path it is not known yet, so record a patch. */
+  if (a->known_frame) {
+    u32 imm12, sh;
+    if (!aa64_addsub_imm_fits(a->max_outgoing, &imm12, &sh))
+      aa_panic(a, "outgoing area too large for alloca result");
+    aa_emit32(t->mc, aa64_add_imm(1, loc_reg(dst), AA_SP, imm12, sh));
+  } else {
     AAPatch* p = aa_patch_alloc(a);
     p->kind = AA_PATCH_ALLOCA;
     p->pos = t->mc->pos(t->mc);
     p->u.dst_reg = loc_reg(dst);
     a->nalloca++;
+    aa_emit32(t->mc, aa64_add_imm(1, loc_reg(dst), AA_SP, 0, 0));
   }
-  aa_emit32(t->mc, aa64_add_imm(1, loc_reg(dst), AA_SP, 0, 0));
 }
 
 static MemAccess aa_mem_for_type(NativeTarget* t, CfreeCgTypeId type,
@@ -2283,6 +2368,14 @@ static u32 aa_signature_stack_bytes(NativeTarget* t, CfreeCgTypeId fn_type,
   return aa_call_stack_size(t, &d);
 }
 
+/* Pure NativeTarget.call_stack_bytes: outgoing stack bytes for a full call
+ * descriptor (handles variadic stack args, unlike signature_stack_bytes which
+ * sees only the fixed params). aa_call_stack_size reads only fn_type and each
+ * args[i].type, so the frame-planning pre-pass can call this before emitting. */
+static u32 aa_call_stack_bytes(NativeTarget* t, const NativeCallDesc* desc) {
+  return aa_call_stack_size(t, desc);
+}
+
 /* One register-passed call argument: write `src` (or its address) into the
  * argument register `dst`. Collected during planning and emitted as a batch so
  * the backend can order them as a parallel copy (see aa_emit_reg_arg_moves). */
@@ -2535,6 +2628,31 @@ static void aa_ret(NativeTarget* t);
 
 static void aa_emit_tail_site(NativeTarget* t, NativeLoc callee) {
   AANativeTarget* a = aa_of(t);
+  if (a->known_frame) {
+    /* Frame is final: emit the tail epilogue (callee restores + frame restore +
+     * branch) directly, exactly the words aa_apply_patches would patch in but
+     * without the reserved NOP padding. */
+    AAFrameLayout L = aa_build_layout(a->cum_off, a->max_outgoing);
+    u32 words[AA_TAIL_WORDS];
+    u32 n = 0;
+    aa_words_callee_restores(a, words, AA_TAIL_WORDS, &n);
+    aa_words_restore_frame(a, words, AA_TAIL_WORDS, &n, &L);
+    if (n >= AA_TAIL_WORDS) aa_panic(a, "tail epilogue too large");
+    for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]);
+    if (callee.kind == NATIVE_LOC_REG) {
+      aa_emit32(t->mc, aa64_br(loc_reg(callee)));
+    } else if (callee.kind == NATIVE_LOC_GLOBAL) {
+      u32 pos = t->mc->pos(t->mc);
+      aa_emit32(t->mc, aa64_b(0));
+      t->mc->emit_reloc_at(t->mc, t->mc->section_id, pos, R_AARCH64_JUMP26,
+                           callee.v.global.sym, callee.v.global.addend, 0, 0);
+    } else {
+      aa_panic(a, "unsupported tail target");
+    }
+    return;
+  }
+  /* Single-pass: reserve a worst-case region and record a patch; the callee
+   * restores and frame restore depend on the not-yet-final frame layout. */
   AAPatch* p = aa_patch_alloc(a);
   p->kind = AA_PATCH_TAIL;
   p->pos = t->mc->pos(t->mc);
@@ -3299,8 +3417,8 @@ NativeTarget* aa64_native_target_new(Compiler* c, ObjBuilder* obj,
   t->func_begin_known_frame = aa_func_begin_known_frame;
   t->note_frame_state = aa_note_frame_state;
   t->reserve_callee_saves = aa_reserve_callee_saves;
-  t->emit_prologue = aa_emit_prologue;
   t->signature_stack_bytes = aa_signature_stack_bytes;
+  t->call_stack_bytes = aa_call_stack_bytes;
   t->has_store_zero_reg = 1;
   t->store_zero_reg = 31u; /* wzr/xzr in the Rt position of a store */
   t->func_end = aa_func_end;
diff --git a/src/arch/native_target.h b/src/arch/native_target.h
@@ -48,6 +48,22 @@ typedef struct NativeKnownFrameDesc {
   u32 nslots;
   u32 max_outgoing;
   u32 align;
+  /* Callee-saved hard registers the allocator assigned, one bitmask per
+   * NativeAllocClass (indexed by class id). The backend reserves a save slot
+   * and emits the prologue save / epilogue restore for each — equivalent to a
+   * reserve_callee_saves() call, but folded into the known-frame setup so the
+   * full frame is fixed before the prologue is emitted. NULL / 0 means none. */
+  const u32* callee_saved_used;
+  u32 ncallee_classes;
+  /* Whether the function body contains a dynamic alloca. The backend needs this
+   * up front (before the body) to decide prologue/epilogue form, since with a
+   * known frame the slim-epilogue eligibility is settled at func_begin. */
+  u8 has_alloca;
+  /* Whether the body has an operation that needs a backend-internal scratch
+   * spill slot — on aa64, an atomic read-modify-write, whose retry loop spills
+   * one scratch register. The backend reserves the slot up front so the body
+   * never grows the frame after the prologue. */
+  u8 needs_scratch_spill;
 } NativeKnownFrameDesc;
 
 typedef enum NativeAllocClass {
@@ -298,6 +314,13 @@ struct NativeTarget {
    * out-pointer may be NULL. May itself be NULL. */
   u32 (*signature_stack_bytes)(NativeTarget*, CfreeCgTypeId fn_type,
                                int* variadic, u32* nparams);
+  /* Pure query: the outgoing stack-argument bytes a call with this descriptor
+   * uses, rounded to the ABI's outgoing-area alignment. Reads only fn_type,
+   * flags, nargs, and each args[i].type — never argument *locations* — so the
+   * optimizer can call it in a frame-planning pre-pass, before any argument
+   * marshalling is emitted, to size the outgoing area. Must equal the
+   * stack_arg_size plan_call computes for the same descriptor. May be NULL. */
+  u32 (*call_stack_bytes)(NativeTarget*, const NativeCallDesc*);
   /* Integer hardware zero register, if the ISA has one (aa64 wzr/xzr, rv64
    * x0). When `has_store_zero_reg` is set, the emit path stores a constant 0
    * straight from `store_zero_reg` instead of materializing 0 into a scratch
diff --git a/src/opt/pass_native_emit.c b/src/opt/pass_native_emit.c
@@ -24,7 +24,6 @@ typedef struct NativeEmitCtx {
   NativeFrameSlot* slot_map;
   MCLabel* labels;
   u8* label_placed;
-  u32 max_outgoing;
   ObjSecId local_static_sec;
   ObjSymId local_static_sym;
   u32 local_static_base;
@@ -628,19 +627,19 @@ static void emit_call(NativeEmitCtx* e, Inst* in) {
     args[i] = abi_storage_loc(e, &aux->desc.args[i], in->loc);
   if (aux->desc.ret.storage.kind) {
     CfreeCgTypeId rty = aux->desc.ret.type;
-    int scalar = !cg_type_is_aggregate(e->c, rty) &&
-                 type_size_or(e->c, rty, 8u) <= 8u;
     results = arena_zarray(e->f->arena, NativeLoc, 1);
     final_result = abi_storage_loc(e, &aux->desc.ret, in->loc);
-    /* Scalar result: hand plan_call the value's real destination (the MIR
-     * result reg, or its spill slot) directly, so it emits one move out of the
-     * ABI result register. Routing every scalar result through a fresh temp
-     * slot — store x0 then immediately reload — was a pure round trip on every
-     * call; emit_ret already avoids the analogous trip on returns. The temp
-     * slot is kept only for aggregate / oversized results, which plan_call /
-     * the callee write in parts and must land in memory. */
-    if (scalar && (final_result.kind == NATIVE_LOC_REG ||
-                   final_result.kind == NATIVE_LOC_FRAME)) {
+    /* Hand plan_call the value's real destination directly whenever it is a
+     * register or a frame slot: a scalar result is a single move out of the ABI
+     * result register, and an aggregate / oversized result — which plan_call or
+     * the callee writes in parts and so must land in memory — lands straight in
+     * its frame home. Routing either through a fresh temp slot (store then
+     * reload / copy_bytes) was a pure round trip on every call. The temp slot is
+     * a fallback for the rare result whose storage is neither a register nor a
+     * frame slot (e.g. written into a global); lowering hoists aggregates to a
+     * frame home (opt_lower_to_mir), so this branch is scalar-only in practice. */
+    if (final_result.kind == NATIVE_LOC_REG ||
+        final_result.kind == NATIVE_LOC_FRAME) {
       results[0] = final_result;
     } else {
       result_slot = temp_slot(e, rty, in->loc, NATIVE_FRAME_SLOT_SPILL);
@@ -657,8 +656,6 @@ static void emit_call(NativeEmitCtx* e, Inst* in) {
   d.tail_policy = aux->desc.tail_policy;
   d.inline_policy = aux->desc.inline_policy;
   e->target->plan_call(e->target, &d, &plan);
-  if (plan.stack_arg_size > e->max_outgoing)
-    e->max_outgoing = plan.stack_arg_size;
   for (u32 i = 0; i < plan.nargs; ++i)
     write_loc(e, plan.args[i].dst, plan.args[i].src, plan.args[i].mem, in->loc);
   if (plan.callee.kind != NATIVE_LOC_REG &&
@@ -805,19 +802,15 @@ static void emit_inst(NativeEmitCtx* e, u32 block, u32 order_index, Inst* in,
           class_for_type(e, in->opnds[1].type) == NATIVE_REG_INT)
         src = loc_reg(in->opnds[1].type, NATIVE_REG_INT,
                       e->target->store_zero_reg);
+      /* Source register aliases the address base/index (e.g. `*p = (T)p`).
+       * Collapse the address into a scratch register: collapse_addr_to_reg
+       * selects a scratch distinct from both base and index — hence distinct
+       * from `src` — so the store reads `src` and writes through the fresh
+       * scratch with no alias. This stays entirely in registers; the frame is
+       * fully planned before emission, so emit never allocates a slot here. */
       if (src.kind == NATIVE_LOC_REG && (src.v.reg == addr_base_reg(&addr) ||
-                                         src.v.reg == addr_index_reg(&addr))) {
-        NativeFrameSlot slot =
-            temp_slot(e, in->opnds[1].type, in->loc, NATIVE_FRAME_SLOT_SPILL);
-        NativeLoc frame = loc_frame(in->opnds[1].type,
-                                    class_for_type(e, in->opnds[1].type), slot);
-        write_loc(e, frame, src, mem_for_type(e->c, in->opnds[1].type),
-                  in->loc);
+                                         src.v.reg == addr_index_reg(&addr)))
         collapse_addr_to_reg(e, &addr, in->loc);
-        src = materialize(e, frame, class_for_type(e, in->opnds[1].type),
-                          in->opnds[1].type, addr_base_reg(&addr), REG_NONE,
-                          in->loc);
-      }
       if (src.kind != NATIVE_LOC_REG) {
         if (!scratch_available(e, class_for_type(e, in->opnds[1].type),
                                addr_base_reg(&addr), addr_index_reg(&addr)))
@@ -1286,24 +1279,6 @@ static void emit_block(NativeEmitCtx* e, u32 block, u32 order_index,
   }
 }
 
-static void map_frame_slots(NativeEmitCtx* e) {
-  e->slot_map =
-      arena_zarray(e->f->arena, NativeFrameSlot, e->f->nframe_slots + 1u);
-  for (u32 i = 0; i < e->f->nframe_slots; ++i) {
-    IRFrameSlot* s = &e->f->frame_slots[i];
-    NativeFrameSlotDesc d;
-    memset(&d, 0, sizeof d);
-    d.type = s->type;
-    d.name = s->name;
-    d.loc = s->loc;
-    d.size = s->size;
-    d.align = s->align;
-    d.kind = s->kind;
-    d.flags = s->flags;
-    e->slot_map[s->id] = e->target->frame_slot(e->target, &d);
-  }
-}
-
 #define EMIT_MAX_REG_CLASSES 4u
 
 static void collect_used_reg(Func* f, Inst* in, OptOperand* op, int is_def,
@@ -1317,37 +1292,128 @@ static void collect_used_reg(Func* f, Inst* in, OptOperand* op, int is_def,
     used[op->cls] |= 1u << op->v.reg;
 }
 
-/* After register allocation the MIR names hard registers directly, so we can
- * scan it for the callee-saved registers the allocator assigned and ask the
- * target to save/restore them. Must run after func_begin and before frame-slot
- * mapping so the target can place the save slots first. */
-static void reserve_callee_saves(NativeEmitCtx* e) {
+/* After register allocation the MIR names hard registers directly, so we scan
+ * it for the callee-saved registers the allocator assigned. Fills `used[cls]`
+ * (one bitmask per alloc class, masked to each class's callee-saved set) and
+ * returns the class count. The masks feed NativeKnownFrameDesc so the backend
+ * reserves the save slots as part of the up-front frame. */
+static u32 compute_callee_saved_used(NativeEmitCtx* e, u32* used, u32 cap) {
   NativeTarget* t = e->target;
   const NativeRegInfo* ri = t->regs;
-  u32 used[EMIT_MAX_REG_CLASSES];
   u32 nclasses;
-  if (!t->reserve_callee_saves || !ri) return;
-  memset(used, 0, sizeof used);
+  for (u32 i = 0; i < cap; ++i) used[i] = 0;
+  if (!ri) return 0;
   for (u32 b = 0; b < e->f->nblocks; ++b) {
     Block* bl = &e->f->blocks[b];
     for (u32 i = 0; i < bl->ninsts; ++i)
       opt_walk_inst_operands(e->f, &bl->insts[i], collect_used_reg, used);
   }
-  nclasses = ri->nclasses < EMIT_MAX_REG_CLASSES ? ri->nclasses
-                                                 : EMIT_MAX_REG_CLASSES;
+  nclasses = ri->nclasses < cap ? ri->nclasses : cap;
   for (u32 i = 0; i < ri->nclasses; ++i) {
     const NativeAllocClassInfo* ci = &ri->classes[i];
-    if (ci->cls < EMIT_MAX_REG_CLASSES)
-      used[ci->cls] &= ci->callee_saved_mask;
+    if (ci->cls < cap) used[ci->cls] &= ci->callee_saved_mask;
+  }
+  return nclasses;
+}
+
+/* Plan the complete call frame before any code is emitted, then hand it to the
+ * backend via func_begin_known_frame so the prologue is emitted final. The
+ * optimizer knows everything the frame needs after register allocation and MIR
+ * lowering: the callee-saved set (scanned from the MIR), every static frame
+ * slot (f->frame_slots), and the outgoing-arg area (the max over all calls of
+ * the pure call_stack_bytes query). The body therefore allocates no slots, so
+ * the frame is final up front and nothing is back-patched. Populates
+ * e->slot_map from the backend-assigned slot handles for the body to use. */
+static void plan_frame(NativeEmitCtx* e, const CGFuncDesc* fd) {
+  NativeTarget* t = e->target;
+  NativeKnownFrameDesc frame;
+  NativeFrameSlotDesc* slots = NULL;
+  NativeFrameSlot* out_slots = NULL;
+  u32 used[EMIT_MAX_REG_CLASSES];
+  u32 nclasses;
+  u32 max_args = 0, max_outgoing = 0;
+  u8 has_alloca = 0;
+  u8 needs_scratch_spill = 0;
+  memset(&frame, 0, sizeof frame);
+  nclasses = t->reserve_callee_saves
+                 ? compute_callee_saved_used(e, used, EMIT_MAX_REG_CLASSES)
+                 : 0u;
+  /* Outgoing-arg area = max stack-arg bytes over all calls; also note alloca. */
+  for (u32 b = 0; b < e->f->nblocks; ++b) {
+    Block* bl = &e->f->blocks[b];
+    for (u32 i = 0; i < bl->ninsts; ++i) {
+      Inst* in = &bl->insts[i];
+      if ((IROp)in->op == IR_ALLOCA) {
+        has_alloca = 1;
+      } else if ((IROp)in->op == IR_ATOMIC_RMW) {
+        needs_scratch_spill = 1;
+      } else if ((IROp)in->op == IR_CALL) {
+        IRCallAux* aux = (IRCallAux*)in->extra.aux;
+        if (aux && aux->desc.nargs > max_args) max_args = aux->desc.nargs;
+      }
+    }
+  }
+  if (t->call_stack_bytes) {
+    NativeLoc* args =
+        max_args ? arena_zarray(e->f->arena, NativeLoc, max_args) : NULL;
+    for (u32 b = 0; b < e->f->nblocks; ++b) {
+      Block* bl = &e->f->blocks[b];
+      for (u32 i = 0; i < bl->ninsts; ++i) {
+        Inst* in = &bl->insts[i];
+        IRCallAux* aux;
+        NativeCallDesc d;
+        u32 sb;
+        if ((IROp)in->op != IR_CALL) continue;
+        aux = (IRCallAux*)in->extra.aux;
+        if (!aux) continue;
+        memset(&d, 0, sizeof d);
+        d.fn_type = aux->desc.fn_type;
+        d.flags = aux->desc.flags;
+        d.nargs = aux->desc.nargs;
+        for (u32 k = 0; k < aux->desc.nargs; ++k) {
+          memset(&args[k], 0, sizeof args[k]);
+          args[k].type = aux->desc.args[k].type;
+        }
+        d.args = args;
+        sb = t->call_stack_bytes(t, &d);
+        if (sb > max_outgoing) max_outgoing = sb;
+      }
+    }
+  }
+  e->slot_map =
+      arena_zarray(e->f->arena, NativeFrameSlot, e->f->nframe_slots + 1u);
+  if (e->f->nframe_slots) {
+    slots = arena_zarray(e->f->arena, NativeFrameSlotDesc, e->f->nframe_slots);
+    out_slots = arena_zarray(e->f->arena, NativeFrameSlot, e->f->nframe_slots);
+    for (u32 i = 0; i < e->f->nframe_slots; ++i) {
+      IRFrameSlot* s = &e->f->frame_slots[i];
+      NativeFrameSlotDesc* d = &slots[i];
+      memset(d, 0, sizeof *d);
+      d->type = s->type;
+      d->name = s->name;
+      d->loc = s->loc;
+      d->size = s->size;
+      d->align = s->align;
+      d->kind = s->kind;
+      d->flags = s->flags;
+    }
   }
-  t->reserve_callee_saves(t, used, nclasses);
+  frame.slots = slots;
+  frame.nslots = e->f->nframe_slots;
+  frame.max_outgoing = max_outgoing;
+  frame.callee_saved_used = nclasses ? used : NULL;
+  frame.ncallee_classes = nclasses;
+  frame.has_alloca = has_alloca;
+  frame.needs_scratch_spill = needs_scratch_spill;
+  t->func_begin_known_frame(t, fd, &frame, out_slots);
+  for (u32 i = 0; i < e->f->nframe_slots; ++i)
+    e->slot_map[e->f->frame_slots[i].id] = out_slots[i];
 }
 
 void opt_emit_native(Compiler* c, Func* f, NativeTarget* target) {
   NativeEmitCtx e;
   Func view;
   CGFuncDesc fd;
-  NativeFramePatchState state;
   if (!f || !target) return;
   memset(&e, 0, sizeof e);
   if (f->mir) {
@@ -1375,16 +1441,13 @@ void opt_emit_native(Compiler* c, Func* f, NativeTarget* target) {
   metrics_scope_end(c, "opt.native_emit.setup");
 
   metrics_scope_begin(c, "opt.native_emit.func_begin");
-  /* The optimizer path knows the callee-save set and frame slots before the
-   * body, so the backend can emit an exact-size prologue here rather than
-   * reserving a worst-case NOP region patched at func_end. Signal this before
-   * func_begin (so it skips the reserved region) and emit the prologue once
-   * reserve_callee_saves + map_frame_slots have run. */
-  target->emit_minimal_prologue = target->emit_prologue != NULL;
-  target->func_begin(target, &fd);
-  reserve_callee_saves(&e);
-  map_frame_slots(&e);
-  if (target->emit_minimal_prologue) target->emit_prologue(target);
+  /* The optimizer has the whole frame after regalloc + MIR lowering, so it
+   * plans it up front (plan_frame) and drives func_begin_known_frame: the
+   * backend emits a final prologue with no reserved NOP region and no
+   * back-patching. The body allocates no frame slots, so the frame stays final;
+   * allocas and tail epilogues are emitted final too. (Contrast the
+   * single-pass NativeDirectTarget path, which reserves and patches.) */
+  plan_frame(&e, &fd);
   bind_params(&e);
   metrics_scope_end(c, "opt.native_emit.func_begin");
 
@@ -1393,10 +1456,6 @@ void opt_emit_native(Compiler* c, Func* f, NativeTarget* target) {
     emit_block(&e, e.f->emit_order[i], i, &fd);
   metrics_scope_end(c, "opt.native_emit.body");
 
-  memset(&state, 0, sizeof state);
-  state.max_outgoing = e.max_outgoing;
-  if (target->note_frame_state) target->note_frame_state(target, &state);
-  if (target->patch_apply) target->patch_apply(target);
   metrics_scope_begin(c, "opt.native_emit.func_end");
   target->func_end(target);
   metrics_scope_end(c, "opt.native_emit.func_end");

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/OPT_O1_PERF_TODO.md	\|	44	++++++++++++++++++++------------------------
M	src/arch/aa64/native.c	\|	258	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------
M	src/arch/native_target.h	\|	23	+++++++++++++++++++++++
M	src/opt/pass_native_emit.c	\|	201	+++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------