commit 7eca5a4d48b4d1fc043f5cd81bd010fbd1785b1f
parent 90f2ba1a39b3ad8ae0ba9119ce548d2f5d85cd19
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Thu, 28 May 2026 13:49:09 -0700
opt/aa64: plan the O1 call frame up front, drop all back-patching
The O1 optimizer now computes the complete call frame before emission
(plan_frame: callee-saved set, every static slot, outgoing-arg area via
the new pure NativeTarget.call_stack_bytes, has_alloca, atomic-RMW scratch
spill) and drives func_begin_known_frame. The aa64 backend emits a final
prologue, allocas, and tail epilogues with no back-patching; the single-pass
NativeDirectTarget keeps its reserve-and-patch strategy. An a->known_frame
flag is the sole discriminator, and a frame_final guard panics if any
frame slot is requested after the prologue (which caught atomic RMW's
mid-body spill).
Emit no longer invents body-time frame slots: aggregate/oversized call
results land directly in their frame home (dropping a redundant temp +
store/reload/copy), and a store whose source aliases its address base
collapses the address into a scratch register instead of spilling.
Falls out of deciding slim-prologue eligibility before emitting: the
leading `b PC+4; nop` filler (old fat-then-patch-to-slim artifact) is gone
at every function entry; prologue/epilogue are otherwise byte-identical.
aa_emit_prologue / emit_minimal_prologue retired from the O1 path. x64/rv64
untouched.
Diffstat:
4 files changed, 361 insertions(+), 165 deletions(-)
diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md
@@ -50,25 +50,16 @@ overhead** — every byte of which is multiplied by 7.6M.
Open items, in priority order (most recent disasm in
`/tmp/mc/binary-trees.cfree.o`):
-1. **Useless leading `b PC+4` at every function entry.** All four functions
- still start with this:
- ```
- sub sp, sp, #0x20
- stp x29, x30, [sp, #0x10]
- add x29, sp, #0x10
- stp x20, x19, [x29, #-0x10]
- b PC+4 <-- branches to the very next instruction
- mov x19, x0
- ```
- Root cause: commit `9bd61e8` ("emit param_decls into a dedicated prologue
- block") added an empty-on-emit entry block ahead of the function body.
- `opt_jump_cleanup`'s helper `empty_fallthrough_block` in
- `src/opt/pass_jump.c` then explicitly bails out when `block == f->entry`,
- so the empty entry block is never merged into its single successor.
- Lifting that guard (with whatever safety condition `9bd61e8` was protecting
- against — likely "first body block is a loop header") would let
- jump-cleanup absorb the prologue block in the common case. **+1 insn ×
- 7.6M calls.**
+1. **Useless leading `b PC+4` at every function entry. [DONE]** Fixed by the
+ known-frame prologue rework (frame planning in `opt_emit_native` →
+ `func_begin_known_frame`). The branch was *not* the empty entry block — it
+ was filler left by the old two-phase prologue: `aa_emit_prologue` sized a
+ *fat* prologue (slim eligibility wasn't decided until `func_end`), then the
+ `func_end` patch rebuilt it as the shorter slim form and back-filled the
+ leftover words with `b PC+N; nop`. The known-frame path decides slim
+ eligibility *before* emitting, so it emits the slim/slim_small_frame prologue
+ directly — no filler, no patch. Functions now fall straight from the prologue
+ into the body. **-1 to -2 insns at entry, every function.**
2. **Prologue compaction: 4-insn → 2-insn pre-indexed.** Half-done. cfree
today emits:
@@ -157,13 +148,18 @@ worth comparing the splice inner loop directly.
These help several benches. Both are partial; the binary-trees items above
are the most concrete tests for whether each is complete.
-1. **Drop the leading `b PC+4` at function entry.** See binary-trees item 1.
- Affects every cfree-compiled function, not just binary-trees.
+1. **Drop the leading `b PC+4` at function entry. [DONE]** See binary-trees
+ item 1. Resolved by the known-frame prologue (the optimizer plans the whole
+ frame up front, so the prologue is emitted final in its slim form rather than
+ fat-then-patched). Affected every cfree-compiled function.
2. **Compact FP-frame prologue/epilogue.** See binary-trees item 2. The
- 2-insn pre-indexed form is wired in for the no-callee-save case; needs
- to be extended to small frames with 1–2 callee-saves. Biggest absolute
- payoff on call-heavy benches.
+ 2-insn pre-indexed form is wired in for the no-callee-save case (Tier A);
+ extending it to small frames with 1–2 callee-saves still needs the frame
+ record moved to the bottom of the frame (fp-at-bottom layout) so a single
+ `stp x29,x30,[sp,#-N]!` covers both the sp decrement and the save — a
+ separable layout change, not unlocked by the frame-planning rework alone.
+ Biggest absolute payoff still open on call-heavy benches.
3. **Hard-register copy coalescing for `IR_LOAD_IMM` sources.** See
binary-trees item 3. The hint-propagation path covers `ldr` → call-arg
diff --git a/src/arch/aa64/native.c b/src/arch/aa64/native.c
@@ -210,6 +210,23 @@ typedef struct AANativeTarget {
* when both would apply). Applies to almost every function with a small
* frame, including those with callee-saves and locals. */
u8 slim_small_frame;
+ /* Set by aa_func_begin_known_frame (optimizer path: the full frame is known
+ * up front, so the prologue, allocas, and tail epilogues are emitted final
+ * with no back-patching). Cleared by aa_func_begin (NativeDirectTarget
+ * single-pass path: worst-case prologue region reserved and patched, alloca /
+ * tail sites recorded and patched at func_end). This flag is the single
+ * discriminator between the two strategies throughout this file. */
+ u8 known_frame;
+ /* Set when the function body contains a dynamic alloca. On the known-frame
+ * path it comes from NativeKnownFrameDesc.has_alloca (needed before the body
+ * to settle slim-epilogue eligibility); on the single-pass path it tracks
+ * nalloca_patches. Disqualifies the slim small-frame epilogue. */
+ u8 has_alloca;
+ /* Set on the known-frame path once the frame is fixed and the prologue
+ * emitted. Any frame_slot request after this point would grow the frame the
+ * prologue already encoded — a silent miscompile — so aa_frame_slot panics.
+ * The optimizer is expected to plan every slot before the body. */
+ u8 frame_final;
} AANativeTarget;
static AANativeTarget* aa_of(NativeTarget* t) { return (AANativeTarget*)t; }
@@ -906,10 +923,11 @@ static void aa_emit_q_frame(AANativeTarget* a, int load, u32 qreg,
aa_emit32(mc, aa_ldst_q_uimm(load, qreg, AA_TMP1, 0));
}
-static void aa_emit_variadic_reg_saves(AANativeTarget* a) {
+/* Reserve the variadic register-save-area frame slots (gp then fp). Split from
+ * the store emission so the known-frame path can fix the full frame — including
+ * these slots — before the prologue, then emit the stores after it. */
+static void aa_reserve_variadic_reg_saves(AANativeTarget* a) {
NativeFrameSlotDesc sd;
- NativeAddr addr;
- MemAccess mem;
CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64);
ABIVaListInfo vai = abi_va_list_layout(a->base.c->abi);
if (vai.kind != ABI_VA_LIST_AAPCS64) return;
@@ -922,6 +940,16 @@ static void aa_emit_variadic_reg_saves(AANativeTarget* a) {
sd.size = vai.fp_reg_count * vai.fp_slot_size;
sd.align = 16;
a->va_vr_slot = a->base.frame_slot(&a->base, &sd);
+}
+
+/* Emit the stores into the variadic register-save area. Slots must already be
+ * reserved (aa_reserve_variadic_reg_saves). */
+static void aa_emit_variadic_reg_save_stores(AANativeTarget* a) {
+ NativeAddr addr;
+ MemAccess mem;
+ CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64);
+ ABIVaListInfo vai = abi_va_list_layout(a->base.c->abi);
+ if (vai.kind != ABI_VA_LIST_AAPCS64) return;
memset(&mem, 0, sizeof mem);
mem.type = i64;
mem.size = 8;
@@ -941,7 +969,10 @@ static void aa_emit_variadic_reg_saves(AANativeTarget* a) {
static void aa_emit_entry_saves(AANativeTarget* a);
-static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
+/* Per-function state reset + function-symbol / cfi / prologue-anchor setup
+ * shared by both entry points (aa_func_begin for the single-pass path,
+ * aa_func_begin_known_frame for the optimizer path). Emits no prologue. */
+static void aa_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) {
AANativeTarget* a = aa_of(t);
MCEmitter* mc = t->mc;
a->func = fd;
@@ -968,6 +999,9 @@ static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
a->ncallee_saves = 0;
a->slim_prologue = 0;
a->slim_small_frame = 0;
+ a->known_frame = 0;
+ a->has_alloca = 0;
+ a->frame_final = 0;
mc->set_section(mc, fd->text_section_id);
mc->emit_align(mc, 4, 0);
a->func_start = mc->pos(mc);
@@ -976,49 +1010,72 @@ static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
a->prologue_pos = mc->pos(mc);
a->minimal_prologue_words = 0;
a->epilogue_label = mc->label_new(mc);
- /* Optimizer path: emit nothing here. The exact-size prologue and the
- * sret/variadic entry saves are emitted later by aa_emit_prologue, once the
- * callee-save set and frame slots are known. The single-pass path reserves a
- * worst-case region (patched in func_end) and emits the entry saves now. */
- if (t->emit_minimal_prologue) return;
+}
+
+/* Single-pass (NativeDirectTarget) entry point: the frame is not known up
+ * front, so reserve a worst-case prologue region (patched in aa_func_end once
+ * max_outgoing / callee-saves are final) and emit the entry saves now. */
+static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
+ AANativeTarget* a = aa_of(t);
+ MCEmitter* mc = t->mc;
+ aa_func_begin_common(t, fd);
for (u32 i = 0; i < AA_PROLOGUE_WORDS; ++i) aa_emit32(mc, 0xd503201fu);
aa_emit_entry_saves(a);
}
-/* Emit the sret-pointer save (x8 → slot) and, for variadic functions, the
- * argument register-save area. Run immediately after the prologue frame setup
- * on both the single-pass path (from func_begin) and the optimizer path (from
- * aa_emit_prologue). */
-static void aa_emit_entry_saves(AANativeTarget* a) {
+/* Reserve the entry-save frame slots: the sret-pointer home (x8) and, for
+ * variadic functions, the argument register-save area. Reserving is split from
+ * emitting so the known-frame path can fix the full frame before the prologue;
+ * the single-pass path runs both back to back via aa_emit_entry_saves. */
+static void aa_reserve_entry_saves(AANativeTarget* a) {
NativeTarget* t = &a->base;
const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, a->func->fn_type);
if (abi && abi->has_sret) {
NativeFrameSlotDesc sd;
- NativeAddr addr;
- NativeLoc src;
- MemAccess mem;
memset(&sd, 0, sizeof sd);
sd.type = builtin_id(CFREE_CG_BUILTIN_I64);
sd.size = 8;
sd.align = 8;
sd.kind = NATIVE_FRAME_SLOT_SAVE;
a->sret_ptr_slot = t->frame_slot(t, &sd);
+ }
+ if (abi && abi->variadic) aa_reserve_variadic_reg_saves(a);
+}
+
+/* Emit the entry-save stores (x8 → sret slot, then the variadic reg-save area).
+ * Slots must already be reserved (aa_reserve_entry_saves). */
+static void aa_emit_entry_save_stores(AANativeTarget* a) {
+ NativeTarget* t = &a->base;
+ const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, a->func->fn_type);
+ if (abi && abi->has_sret) {
+ NativeAddr addr;
+ NativeLoc src;
+ MemAccess mem;
+ CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64);
memset(&addr, 0, sizeof addr);
addr.base_kind = NATIVE_ADDR_BASE_FRAME;
addr.base.frame = a->sret_ptr_slot;
- addr.base_type = sd.type;
+ addr.base_type = i64;
memset(&src, 0, sizeof src);
src.kind = NATIVE_LOC_REG;
src.cls = NATIVE_REG_INT;
- src.type = sd.type;
+ src.type = i64;
src.v.reg = 8u;
memset(&mem, 0, sizeof mem);
- mem.type = sd.type;
+ mem.type = i64;
mem.size = 8;
mem.align = 8;
aa_emit_mem(a, 0, src, addr, mem);
}
- if (abi && abi->variadic) aa_emit_variadic_reg_saves(a);
+ if (abi && abi->variadic) aa_emit_variadic_reg_save_stores(a);
+}
+
+/* Reserve + emit the entry saves back to back. Single-pass (NativeDirectTarget)
+ * path, where the prologue region is a reserved worst-case block and slot
+ * offsets need not be final before it. */
+static void aa_emit_entry_saves(AANativeTarget* a) {
+ aa_reserve_entry_saves(a);
+ aa_emit_entry_save_stores(a);
}
static void aa_note_frame_state(NativeTarget* t,
@@ -1198,7 +1255,7 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words,
/* Build the prologue instruction words for `L` into `words` (capacity `cap`),
* returning the count. Shared by the NativeDirectTarget patch path (reserves
* a fixed worst-case region, then patches it here) and the optimizer path
- * (emits an exact-size region up front; see aa_emit_prologue).
+ * (aa_func_begin_known_frame emits exactly these words up front).
*
* All three variants establish the same post-prologue state defined by L:
* sp = caller's sp - L->frame_size
@@ -1263,23 +1320,6 @@ static void aa_patch_prologue(AANativeTarget* a, const AAFrameLayout* L,
aa_patch32(a->base.obj, sec, a->prologue_pos + i * 4u, words[i]);
}
-/* Optimizer path: emit an exact-size prologue in place (no reserved NOP
- * region). The callee-save set and the static frame slots are final by now, so
- * the prologue's instruction count is fixed; only the frame-size immediates
- * (sub sp / save-area address / fp = sp+saved_pair) still depend on body-
- * emitted temporaries and are patched in func_end. We size the region with
- * a frame that fits add/sub's imm12 (the real frame must too, or func_end's
- * rebuild — capped at this length — panics). */
-static void aa_emit_prologue(NativeTarget* t) {
- AANativeTarget* a = aa_of(t);
- u32 words[AA_PROLOGUE_WORDS];
- AAFrameLayout L = aa_build_layout(a->cum_off, a->max_outgoing);
- u32 n = aa_build_prologue_words(a, &L, words, AA_PROLOGUE_WORDS);
- for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]);
- a->minimal_prologue_words = n;
- aa_emit_entry_saves(a);
-}
-
static void aa_emit_restore_frame(AANativeTarget* a, const AAFrameLayout* L) {
MCEmitter* mc = a->base.mc;
u32 words[AA_PROLOGUE_WORDS];
@@ -1349,35 +1389,27 @@ static void aa_func_end(NativeTarget* t) {
AANativeTarget* a = aa_of(t);
MCEmitter* mc = t->mc;
AAFrameLayout L = aa_build_layout(a->cum_off, a->max_outgoing);
- /* Optimizer path emitted an exact-size prologue (minimal_prologue_words);
- * the single-pass path reserved a fixed worst-case region. Either way the
- * frame-size immediates are only final now, so patch the region in place. */
+ /* known_frame (optimizer): prologue, allocas, and tail epilogues were emitted
+ * final and slim eligibility was settled in aa_func_begin_known_frame — there
+ * is nothing to patch. Single-pass (NDT): a worst-case prologue region was
+ * reserved and the deferred patches recorded; resolve them now that the frame
+ * is final. The NDT path always uses the fat prologue/epilogue (slim_* left 0
+ * by aa_func_begin_common, since its reserved region is much larger). */
u32 prologue_region =
- t->emit_minimal_prologue ? a->minimal_prologue_words : AA_PROLOGUE_WORDS;
- /* Slim Tier A eligibility (set before emitting the epilogue / patching the
- * prologue so the *_restore_frame / *_build_prologue_words helpers pick the
- * slim form). Conditions: no callee-saves needed, no alloca, no body
- * slots (locals/spills/sret/variadic — all counted in slot_bytes), no
- * outgoing stack args, optimizer path only (the NDT reserves a much
- * larger prologue region). */
- a->slim_prologue =
- t->emit_minimal_prologue && a->ncallee_saves == 0 && a->nalloca == 0 &&
- L.slot_bytes == 0 && L.out_stack == 0;
- /* Universal small-frame fast path: skip the x17/x10 scratch when the
- * saved-pair offset fits stp's signed 7-bit scaled immediate. Mutually
- * exclusive with the Tier A slim form (Tier A is strictly tighter).
- * Disqualify alloca: alloca dynamically moves sp during the body, and the
- * fat epilogue (sp = fp + 16 via x10) is what restores sp from fp; the
- * slim_small_frame epilogue's `add sp, sp, #N` only undoes the static
- * frame, leaving sp pointing into the alloca area. */
- a->slim_small_frame = !a->slim_prologue && a->nalloca == 0 &&
- aa_sp_off_saved_pair(&L) <= 504u;
+ a->known_frame ? a->minimal_prologue_words : AA_PROLOGUE_WORDS;
mc->label_place(mc, a->epilogue_label);
aa_emit_callee_restores(a);
aa_emit_restore_frame(a, &L);
aa_emit32(mc, aa64_ret(AA_LR));
- aa_patch_prologue(a, &L, prologue_region);
- aa_apply_patches(a, &L);
+ if (a->known_frame) {
+ /* The frame-planning pre-pass plus final prologue/alloca/tail emission must
+ * leave nothing deferred; a stray patch would mean a body-time frame change
+ * the final prologue never saw. */
+ if (a->npatches != 0) aa_panic(a, "known-frame path left deferred patches");
+ } else {
+ aa_patch_prologue(a, &L, prologue_region);
+ aa_apply_patches(a, &L);
+ }
if (mc->cfi_set_next_pc_offset && mc->cfi_def_cfa && mc->cfi_offset) {
mc->cfi_set_next_pc_offset(mc, prologue_region * 4u);
/* CFA = caller's sp = fp + AA_FRAME_SAVE_SIZE. saved fp/lr at fp/fp+8
@@ -1409,6 +1441,8 @@ static NativeFrameSlot aa_frame_slot(NativeTarget* t,
AANativeSlot* s;
u32 size = d->size ? d->size : 8u;
u32 align = d->align ? d->align : 1u;
+ if (a->frame_final)
+ aa_panic(a, "frame slot requested after known-frame prologue");
if (a->nslots == a->slots_cap) {
u32 cap = a->slots_cap ? a->slots_cap * 2u : 16u;
AANativeSlot* nb = arena_zarray(t->c->tu, AANativeSlot, cap);
@@ -1425,19 +1459,62 @@ static NativeFrameSlot aa_frame_slot(NativeTarget* t,
return a->nslots;
}
+/* Optimizer entry point: the full frame is supplied up front, so the prologue,
+ * entry saves, slim-form eligibility, allocas, and tail epilogues are all final
+ * the moment they are emitted — no back-patching (aa_func_end skips the patch
+ * passes when a->known_frame). Slot creation order matches the single-pass path
+ * (callee-saves first for stur range, then the static slots, then sret/variadic
+ * entry saves) so offsets are identical to what the patch path would produce. */
static void aa_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd,
const NativeKnownFrameDesc* frame,
NativeFrameSlot* out_slots) {
- aa_func_begin(t, fd);
+ AANativeTarget* a = aa_of(t);
+ AAFrameLayout L;
+ u32 words[AA_PROLOGUE_WORDS];
+ u32 n;
+ aa_func_begin_common(t, fd);
+ a->known_frame = 1;
if (frame) {
- AANativeTarget* a = aa_of(t);
- if (frame->max_outgoing > a->max_outgoing)
- a->max_outgoing = frame->max_outgoing;
+ a->has_alloca = frame->has_alloca;
+ if (frame->callee_saved_used && frame->ncallee_classes)
+ aa_reserve_callee_saves(t, frame->callee_saved_used,
+ frame->ncallee_classes);
for (u32 i = 0; i < frame->nslots; ++i) {
NativeFrameSlot slot = aa_frame_slot(t, &frame->slots[i]);
if (out_slots) out_slots[i] = slot;
}
+ aa_reserve_entry_saves(a);
+ /* Reserve the atomic-RMW scratch spill last (matching its lazy position in
+ * the single-pass path), so aa_saved_tmp_spill reuses it instead of growing
+ * the frame mid-body. */
+ if (frame->needs_scratch_spill) {
+ NativeFrameSlotDesc sd;
+ memset(&sd, 0, sizeof sd);
+ sd.type = builtin_id(CFREE_CG_BUILTIN_I64);
+ sd.size = 8;
+ sd.align = 8;
+ sd.kind = NATIVE_FRAME_SLOT_SPILL;
+ a->saved_tmp_slot = a->base.frame_slot(&a->base, &sd);
+ }
+ if (frame->max_outgoing > a->max_outgoing)
+ a->max_outgoing = frame->max_outgoing;
}
+ /* Frame is final: slot_bytes (cum_off) and out_stack (max_outgoing) are both
+ * known, so the prologue immediates and slim-form choice are settled here. */
+ L = aa_build_layout(a->cum_off, a->max_outgoing);
+ /* Slim Tier A: no callee-saves, no alloca, no body slots, no outgoing stack
+ * args. slim_small_frame: skip the x17/x10 scratch when the saved-pair offset
+ * fits stp's signed 7-bit scaled immediate. (See aa_func_end for the
+ * single-pass path, which never takes the slim form.) */
+ a->slim_prologue = a->ncallee_saves == 0 && !a->has_alloca &&
+ L.slot_bytes == 0 && L.out_stack == 0;
+ a->slim_small_frame = !a->slim_prologue && !a->has_alloca &&
+ aa_sp_off_saved_pair(&L) <= 504u;
+ n = aa_build_prologue_words(a, &L, words, AA_PROLOGUE_WORDS);
+ for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]);
+ a->minimal_prologue_words = n;
+ a->frame_final = 1;
+ aa_emit_entry_save_stores(a);
}
static void aa_spill(NativeTarget* t, NativeLoc src, NativeFrameSlot slot,
@@ -2028,14 +2105,22 @@ static void aa_alloca(NativeTarget* t, NativeLoc dst, NativeLoc size,
aa_emit32(t->mc, aa64_add_imm(1, AA_TMP1, AA_SP, 0, 0));
aa_emit32(t->mc, aa64_sub(1, AA_TMP1, AA_TMP1, AA_TMP0));
aa_emit32(t->mc, aa64_add_imm(1, AA_SP, AA_TMP1, 0, 0));
- {
+ /* The alloca result is sp + outgoing-area bytes. On the known-frame path
+ * max_outgoing is already final, so emit the final `add dst, sp, #N` here; on
+ * the single-pass path it is not known yet, so record a patch. */
+ if (a->known_frame) {
+ u32 imm12, sh;
+ if (!aa64_addsub_imm_fits(a->max_outgoing, &imm12, &sh))
+ aa_panic(a, "outgoing area too large for alloca result");
+ aa_emit32(t->mc, aa64_add_imm(1, loc_reg(dst), AA_SP, imm12, sh));
+ } else {
AAPatch* p = aa_patch_alloc(a);
p->kind = AA_PATCH_ALLOCA;
p->pos = t->mc->pos(t->mc);
p->u.dst_reg = loc_reg(dst);
a->nalloca++;
+ aa_emit32(t->mc, aa64_add_imm(1, loc_reg(dst), AA_SP, 0, 0));
}
- aa_emit32(t->mc, aa64_add_imm(1, loc_reg(dst), AA_SP, 0, 0));
}
static MemAccess aa_mem_for_type(NativeTarget* t, CfreeCgTypeId type,
@@ -2283,6 +2368,14 @@ static u32 aa_signature_stack_bytes(NativeTarget* t, CfreeCgTypeId fn_type,
return aa_call_stack_size(t, &d);
}
+/* Pure NativeTarget.call_stack_bytes: outgoing stack bytes for a full call
+ * descriptor (handles variadic stack args, unlike signature_stack_bytes which
+ * sees only the fixed params). aa_call_stack_size reads only fn_type and each
+ * args[i].type, so the frame-planning pre-pass can call this before emitting. */
+static u32 aa_call_stack_bytes(NativeTarget* t, const NativeCallDesc* desc) {
+ return aa_call_stack_size(t, desc);
+}
+
/* One register-passed call argument: write `src` (or its address) into the
* argument register `dst`. Collected during planning and emitted as a batch so
* the backend can order them as a parallel copy (see aa_emit_reg_arg_moves). */
@@ -2535,6 +2628,31 @@ static void aa_ret(NativeTarget* t);
static void aa_emit_tail_site(NativeTarget* t, NativeLoc callee) {
AANativeTarget* a = aa_of(t);
+ if (a->known_frame) {
+ /* Frame is final: emit the tail epilogue (callee restores + frame restore +
+ * branch) directly, exactly the words aa_apply_patches would patch in but
+ * without the reserved NOP padding. */
+ AAFrameLayout L = aa_build_layout(a->cum_off, a->max_outgoing);
+ u32 words[AA_TAIL_WORDS];
+ u32 n = 0;
+ aa_words_callee_restores(a, words, AA_TAIL_WORDS, &n);
+ aa_words_restore_frame(a, words, AA_TAIL_WORDS, &n, &L);
+ if (n >= AA_TAIL_WORDS) aa_panic(a, "tail epilogue too large");
+ for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]);
+ if (callee.kind == NATIVE_LOC_REG) {
+ aa_emit32(t->mc, aa64_br(loc_reg(callee)));
+ } else if (callee.kind == NATIVE_LOC_GLOBAL) {
+ u32 pos = t->mc->pos(t->mc);
+ aa_emit32(t->mc, aa64_b(0));
+ t->mc->emit_reloc_at(t->mc, t->mc->section_id, pos, R_AARCH64_JUMP26,
+ callee.v.global.sym, callee.v.global.addend, 0, 0);
+ } else {
+ aa_panic(a, "unsupported tail target");
+ }
+ return;
+ }
+ /* Single-pass: reserve a worst-case region and record a patch; the callee
+ * restores and frame restore depend on the not-yet-final frame layout. */
AAPatch* p = aa_patch_alloc(a);
p->kind = AA_PATCH_TAIL;
p->pos = t->mc->pos(t->mc);
@@ -3299,8 +3417,8 @@ NativeTarget* aa64_native_target_new(Compiler* c, ObjBuilder* obj,
t->func_begin_known_frame = aa_func_begin_known_frame;
t->note_frame_state = aa_note_frame_state;
t->reserve_callee_saves = aa_reserve_callee_saves;
- t->emit_prologue = aa_emit_prologue;
t->signature_stack_bytes = aa_signature_stack_bytes;
+ t->call_stack_bytes = aa_call_stack_bytes;
t->has_store_zero_reg = 1;
t->store_zero_reg = 31u; /* wzr/xzr in the Rt position of a store */
t->func_end = aa_func_end;
diff --git a/src/arch/native_target.h b/src/arch/native_target.h
@@ -48,6 +48,22 @@ typedef struct NativeKnownFrameDesc {
u32 nslots;
u32 max_outgoing;
u32 align;
+ /* Callee-saved hard registers the allocator assigned, one bitmask per
+ * NativeAllocClass (indexed by class id). The backend reserves a save slot
+ * and emits the prologue save / epilogue restore for each — equivalent to a
+ * reserve_callee_saves() call, but folded into the known-frame setup so the
+ * full frame is fixed before the prologue is emitted. NULL / 0 means none. */
+ const u32* callee_saved_used;
+ u32 ncallee_classes;
+ /* Whether the function body contains a dynamic alloca. The backend needs this
+ * up front (before the body) to decide prologue/epilogue form, since with a
+ * known frame the slim-epilogue eligibility is settled at func_begin. */
+ u8 has_alloca;
+ /* Whether the body has an operation that needs a backend-internal scratch
+ * spill slot — on aa64, an atomic read-modify-write, whose retry loop spills
+ * one scratch register. The backend reserves the slot up front so the body
+ * never grows the frame after the prologue. */
+ u8 needs_scratch_spill;
} NativeKnownFrameDesc;
typedef enum NativeAllocClass {
@@ -298,6 +314,13 @@ struct NativeTarget {
* out-pointer may be NULL. May itself be NULL. */
u32 (*signature_stack_bytes)(NativeTarget*, CfreeCgTypeId fn_type,
int* variadic, u32* nparams);
+ /* Pure query: the outgoing stack-argument bytes a call with this descriptor
+ * uses, rounded to the ABI's outgoing-area alignment. Reads only fn_type,
+ * flags, nargs, and each args[i].type — never argument *locations* — so the
+ * optimizer can call it in a frame-planning pre-pass, before any argument
+ * marshalling is emitted, to size the outgoing area. Must equal the
+ * stack_arg_size plan_call computes for the same descriptor. May be NULL. */
+ u32 (*call_stack_bytes)(NativeTarget*, const NativeCallDesc*);
/* Integer hardware zero register, if the ISA has one (aa64 wzr/xzr, rv64
* x0). When `has_store_zero_reg` is set, the emit path stores a constant 0
* straight from `store_zero_reg` instead of materializing 0 into a scratch
diff --git a/src/opt/pass_native_emit.c b/src/opt/pass_native_emit.c
@@ -24,7 +24,6 @@ typedef struct NativeEmitCtx {
NativeFrameSlot* slot_map;
MCLabel* labels;
u8* label_placed;
- u32 max_outgoing;
ObjSecId local_static_sec;
ObjSymId local_static_sym;
u32 local_static_base;
@@ -628,19 +627,19 @@ static void emit_call(NativeEmitCtx* e, Inst* in) {
args[i] = abi_storage_loc(e, &aux->desc.args[i], in->loc);
if (aux->desc.ret.storage.kind) {
CfreeCgTypeId rty = aux->desc.ret.type;
- int scalar = !cg_type_is_aggregate(e->c, rty) &&
- type_size_or(e->c, rty, 8u) <= 8u;
results = arena_zarray(e->f->arena, NativeLoc, 1);
final_result = abi_storage_loc(e, &aux->desc.ret, in->loc);
- /* Scalar result: hand plan_call the value's real destination (the MIR
- * result reg, or its spill slot) directly, so it emits one move out of the
- * ABI result register. Routing every scalar result through a fresh temp
- * slot — store x0 then immediately reload — was a pure round trip on every
- * call; emit_ret already avoids the analogous trip on returns. The temp
- * slot is kept only for aggregate / oversized results, which plan_call /
- * the callee write in parts and must land in memory. */
- if (scalar && (final_result.kind == NATIVE_LOC_REG ||
- final_result.kind == NATIVE_LOC_FRAME)) {
+ /* Hand plan_call the value's real destination directly whenever it is a
+ * register or a frame slot: a scalar result is a single move out of the ABI
+ * result register, and an aggregate / oversized result — which plan_call or
+ * the callee writes in parts and so must land in memory — lands straight in
+ * its frame home. Routing either through a fresh temp slot (store then
+ * reload / copy_bytes) was a pure round trip on every call. The temp slot is
+ * a fallback for the rare result whose storage is neither a register nor a
+ * frame slot (e.g. written into a global); lowering hoists aggregates to a
+ * frame home (opt_lower_to_mir), so this branch is scalar-only in practice. */
+ if (final_result.kind == NATIVE_LOC_REG ||
+ final_result.kind == NATIVE_LOC_FRAME) {
results[0] = final_result;
} else {
result_slot = temp_slot(e, rty, in->loc, NATIVE_FRAME_SLOT_SPILL);
@@ -657,8 +656,6 @@ static void emit_call(NativeEmitCtx* e, Inst* in) {
d.tail_policy = aux->desc.tail_policy;
d.inline_policy = aux->desc.inline_policy;
e->target->plan_call(e->target, &d, &plan);
- if (plan.stack_arg_size > e->max_outgoing)
- e->max_outgoing = plan.stack_arg_size;
for (u32 i = 0; i < plan.nargs; ++i)
write_loc(e, plan.args[i].dst, plan.args[i].src, plan.args[i].mem, in->loc);
if (plan.callee.kind != NATIVE_LOC_REG &&
@@ -805,19 +802,15 @@ static void emit_inst(NativeEmitCtx* e, u32 block, u32 order_index, Inst* in,
class_for_type(e, in->opnds[1].type) == NATIVE_REG_INT)
src = loc_reg(in->opnds[1].type, NATIVE_REG_INT,
e->target->store_zero_reg);
+ /* Source register aliases the address base/index (e.g. `*p = (T)p`).
+ * Collapse the address into a scratch register: collapse_addr_to_reg
+ * selects a scratch distinct from both base and index — hence distinct
+ * from `src` — so the store reads `src` and writes through the fresh
+ * scratch with no alias. This stays entirely in registers; the frame is
+ * fully planned before emission, so emit never allocates a slot here. */
if (src.kind == NATIVE_LOC_REG && (src.v.reg == addr_base_reg(&addr) ||
- src.v.reg == addr_index_reg(&addr))) {
- NativeFrameSlot slot =
- temp_slot(e, in->opnds[1].type, in->loc, NATIVE_FRAME_SLOT_SPILL);
- NativeLoc frame = loc_frame(in->opnds[1].type,
- class_for_type(e, in->opnds[1].type), slot);
- write_loc(e, frame, src, mem_for_type(e->c, in->opnds[1].type),
- in->loc);
+ src.v.reg == addr_index_reg(&addr)))
collapse_addr_to_reg(e, &addr, in->loc);
- src = materialize(e, frame, class_for_type(e, in->opnds[1].type),
- in->opnds[1].type, addr_base_reg(&addr), REG_NONE,
- in->loc);
- }
if (src.kind != NATIVE_LOC_REG) {
if (!scratch_available(e, class_for_type(e, in->opnds[1].type),
addr_base_reg(&addr), addr_index_reg(&addr)))
@@ -1286,24 +1279,6 @@ static void emit_block(NativeEmitCtx* e, u32 block, u32 order_index,
}
}
-static void map_frame_slots(NativeEmitCtx* e) {
- e->slot_map =
- arena_zarray(e->f->arena, NativeFrameSlot, e->f->nframe_slots + 1u);
- for (u32 i = 0; i < e->f->nframe_slots; ++i) {
- IRFrameSlot* s = &e->f->frame_slots[i];
- NativeFrameSlotDesc d;
- memset(&d, 0, sizeof d);
- d.type = s->type;
- d.name = s->name;
- d.loc = s->loc;
- d.size = s->size;
- d.align = s->align;
- d.kind = s->kind;
- d.flags = s->flags;
- e->slot_map[s->id] = e->target->frame_slot(e->target, &d);
- }
-}
-
#define EMIT_MAX_REG_CLASSES 4u
static void collect_used_reg(Func* f, Inst* in, OptOperand* op, int is_def,
@@ -1317,37 +1292,128 @@ static void collect_used_reg(Func* f, Inst* in, OptOperand* op, int is_def,
used[op->cls] |= 1u << op->v.reg;
}
-/* After register allocation the MIR names hard registers directly, so we can
- * scan it for the callee-saved registers the allocator assigned and ask the
- * target to save/restore them. Must run after func_begin and before frame-slot
- * mapping so the target can place the save slots first. */
-static void reserve_callee_saves(NativeEmitCtx* e) {
+/* After register allocation the MIR names hard registers directly, so we scan
+ * it for the callee-saved registers the allocator assigned. Fills `used[cls]`
+ * (one bitmask per alloc class, masked to each class's callee-saved set) and
+ * returns the class count. The masks feed NativeKnownFrameDesc so the backend
+ * reserves the save slots as part of the up-front frame. */
+static u32 compute_callee_saved_used(NativeEmitCtx* e, u32* used, u32 cap) {
NativeTarget* t = e->target;
const NativeRegInfo* ri = t->regs;
- u32 used[EMIT_MAX_REG_CLASSES];
u32 nclasses;
- if (!t->reserve_callee_saves || !ri) return;
- memset(used, 0, sizeof used);
+ for (u32 i = 0; i < cap; ++i) used[i] = 0;
+ if (!ri) return 0;
for (u32 b = 0; b < e->f->nblocks; ++b) {
Block* bl = &e->f->blocks[b];
for (u32 i = 0; i < bl->ninsts; ++i)
opt_walk_inst_operands(e->f, &bl->insts[i], collect_used_reg, used);
}
- nclasses = ri->nclasses < EMIT_MAX_REG_CLASSES ? ri->nclasses
- : EMIT_MAX_REG_CLASSES;
+ nclasses = ri->nclasses < cap ? ri->nclasses : cap;
for (u32 i = 0; i < ri->nclasses; ++i) {
const NativeAllocClassInfo* ci = &ri->classes[i];
- if (ci->cls < EMIT_MAX_REG_CLASSES)
- used[ci->cls] &= ci->callee_saved_mask;
+ if (ci->cls < cap) used[ci->cls] &= ci->callee_saved_mask;
+ }
+ return nclasses;
+}
+
+/* Plan the complete call frame before any code is emitted, then hand it to the
+ * backend via func_begin_known_frame so the prologue is emitted final. The
+ * optimizer knows everything the frame needs after register allocation and MIR
+ * lowering: the callee-saved set (scanned from the MIR), every static frame
+ * slot (f->frame_slots), and the outgoing-arg area (the max over all calls of
+ * the pure call_stack_bytes query). The body therefore allocates no slots, so
+ * the frame is final up front and nothing is back-patched. Populates
+ * e->slot_map from the backend-assigned slot handles for the body to use. */
+static void plan_frame(NativeEmitCtx* e, const CGFuncDesc* fd) {
+ NativeTarget* t = e->target;
+ NativeKnownFrameDesc frame;
+ NativeFrameSlotDesc* slots = NULL;
+ NativeFrameSlot* out_slots = NULL;
+ u32 used[EMIT_MAX_REG_CLASSES];
+ u32 nclasses;
+ u32 max_args = 0, max_outgoing = 0;
+ u8 has_alloca = 0;
+ u8 needs_scratch_spill = 0;
+ memset(&frame, 0, sizeof frame);
+ nclasses = t->reserve_callee_saves
+ ? compute_callee_saved_used(e, used, EMIT_MAX_REG_CLASSES)
+ : 0u;
+ /* Outgoing-arg area = max stack-arg bytes over all calls; also note alloca. */
+ for (u32 b = 0; b < e->f->nblocks; ++b) {
+ Block* bl = &e->f->blocks[b];
+ for (u32 i = 0; i < bl->ninsts; ++i) {
+ Inst* in = &bl->insts[i];
+ if ((IROp)in->op == IR_ALLOCA) {
+ has_alloca = 1;
+ } else if ((IROp)in->op == IR_ATOMIC_RMW) {
+ needs_scratch_spill = 1;
+ } else if ((IROp)in->op == IR_CALL) {
+ IRCallAux* aux = (IRCallAux*)in->extra.aux;
+ if (aux && aux->desc.nargs > max_args) max_args = aux->desc.nargs;
+ }
+ }
+ }
+ if (t->call_stack_bytes) {
+ NativeLoc* args =
+ max_args ? arena_zarray(e->f->arena, NativeLoc, max_args) : NULL;
+ for (u32 b = 0; b < e->f->nblocks; ++b) {
+ Block* bl = &e->f->blocks[b];
+ for (u32 i = 0; i < bl->ninsts; ++i) {
+ Inst* in = &bl->insts[i];
+ IRCallAux* aux;
+ NativeCallDesc d;
+ u32 sb;
+ if ((IROp)in->op != IR_CALL) continue;
+ aux = (IRCallAux*)in->extra.aux;
+ if (!aux) continue;
+ memset(&d, 0, sizeof d);
+ d.fn_type = aux->desc.fn_type;
+ d.flags = aux->desc.flags;
+ d.nargs = aux->desc.nargs;
+ for (u32 k = 0; k < aux->desc.nargs; ++k) {
+ memset(&args[k], 0, sizeof args[k]);
+ args[k].type = aux->desc.args[k].type;
+ }
+ d.args = args;
+ sb = t->call_stack_bytes(t, &d);
+ if (sb > max_outgoing) max_outgoing = sb;
+ }
+ }
+ }
+ e->slot_map =
+ arena_zarray(e->f->arena, NativeFrameSlot, e->f->nframe_slots + 1u);
+ if (e->f->nframe_slots) {
+ slots = arena_zarray(e->f->arena, NativeFrameSlotDesc, e->f->nframe_slots);
+ out_slots = arena_zarray(e->f->arena, NativeFrameSlot, e->f->nframe_slots);
+ for (u32 i = 0; i < e->f->nframe_slots; ++i) {
+ IRFrameSlot* s = &e->f->frame_slots[i];
+ NativeFrameSlotDesc* d = &slots[i];
+ memset(d, 0, sizeof *d);
+ d->type = s->type;
+ d->name = s->name;
+ d->loc = s->loc;
+ d->size = s->size;
+ d->align = s->align;
+ d->kind = s->kind;
+ d->flags = s->flags;
+ }
}
- t->reserve_callee_saves(t, used, nclasses);
+ frame.slots = slots;
+ frame.nslots = e->f->nframe_slots;
+ frame.max_outgoing = max_outgoing;
+ frame.callee_saved_used = nclasses ? used : NULL;
+ frame.ncallee_classes = nclasses;
+ frame.has_alloca = has_alloca;
+ frame.needs_scratch_spill = needs_scratch_spill;
+ t->func_begin_known_frame(t, fd, &frame, out_slots);
+ for (u32 i = 0; i < e->f->nframe_slots; ++i)
+ e->slot_map[e->f->frame_slots[i].id] = out_slots[i];
}
void opt_emit_native(Compiler* c, Func* f, NativeTarget* target) {
NativeEmitCtx e;
Func view;
CGFuncDesc fd;
- NativeFramePatchState state;
if (!f || !target) return;
memset(&e, 0, sizeof e);
if (f->mir) {
@@ -1375,16 +1441,13 @@ void opt_emit_native(Compiler* c, Func* f, NativeTarget* target) {
metrics_scope_end(c, "opt.native_emit.setup");
metrics_scope_begin(c, "opt.native_emit.func_begin");
- /* The optimizer path knows the callee-save set and frame slots before the
- * body, so the backend can emit an exact-size prologue here rather than
- * reserving a worst-case NOP region patched at func_end. Signal this before
- * func_begin (so it skips the reserved region) and emit the prologue once
- * reserve_callee_saves + map_frame_slots have run. */
- target->emit_minimal_prologue = target->emit_prologue != NULL;
- target->func_begin(target, &fd);
- reserve_callee_saves(&e);
- map_frame_slots(&e);
- if (target->emit_minimal_prologue) target->emit_prologue(target);
+ /* The optimizer has the whole frame after regalloc + MIR lowering, so it
+ * plans it up front (plan_frame) and drives func_begin_known_frame: the
+ * backend emits a final prologue with no reserved NOP region and no
+ * back-patching. The body allocates no frame slots, so the frame stays final;
+ * allocas and tail epilogues are emitted final too. (Contrast the
+ * single-pass NativeDirectTarget path, which reserves and patches.) */
+ plan_frame(&e, &fd);
bind_params(&e);
metrics_scope_end(c, "opt.native_emit.func_begin");
@@ -1393,10 +1456,6 @@ void opt_emit_native(Compiler* c, Func* f, NativeTarget* target) {
emit_block(&e, e.f->emit_order[i], i, &fd);
metrics_scope_end(c, "opt.native_emit.body");
- memset(&state, 0, sizeof state);
- state.max_outgoing = e.max_outgoing;
- if (target->note_frame_state) target->note_frame_state(target, &state);
- if (target->patch_apply) target->patch_apply(target);
metrics_scope_begin(c, "opt.native_emit.func_end");
target->func_end(target);
metrics_scope_end(c, "opt.native_emit.func_end");