kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 7eca5a4d48b4d1fc043f5cd81bd010fbd1785b1f
parent 90f2ba1a39b3ad8ae0ba9119ce548d2f5d85cd19
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 28 May 2026 13:49:09 -0700

opt/aa64: plan the O1 call frame up front, drop all back-patching

The O1 optimizer now computes the complete call frame before emission
(plan_frame: callee-saved set, every static slot, outgoing-arg area via
the new pure NativeTarget.call_stack_bytes, has_alloca, atomic-RMW scratch
spill) and drives func_begin_known_frame. The aa64 backend emits a final
prologue, allocas, and tail epilogues with no back-patching; the single-pass
NativeDirectTarget keeps its reserve-and-patch strategy. An a->known_frame
flag is the sole discriminator, and a frame_final guard panics if any
frame slot is requested after the prologue (which caught atomic RMW's
mid-body spill).

Emit no longer invents body-time frame slots: aggregate/oversized call
results land directly in their frame home (dropping a redundant temp +
store/reload/copy), and a store whose source aliases its address base
collapses the address into a scratch register instead of spilling.

Falls out of deciding slim-prologue eligibility before emitting: the
leading `b PC+4; nop` filler (old fat-then-patch-to-slim artifact) is gone
at every function entry; prologue/epilogue are otherwise byte-identical.

aa_emit_prologue / emit_minimal_prologue retired from the O1 path. x64/rv64
untouched.

Diffstat:
Mdoc/OPT_O1_PERF_TODO.md | 44++++++++++++++++++++------------------------
Msrc/arch/aa64/native.c | 258+++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------
Msrc/arch/native_target.h | 23+++++++++++++++++++++++
Msrc/opt/pass_native_emit.c | 201+++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------
4 files changed, 361 insertions(+), 165 deletions(-)

diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md @@ -50,25 +50,16 @@ overhead** — every byte of which is multiplied by 7.6M. Open items, in priority order (most recent disasm in `/tmp/mc/binary-trees.cfree.o`): -1. **Useless leading `b PC+4` at every function entry.** All four functions - still start with this: - ``` - sub sp, sp, #0x20 - stp x29, x30, [sp, #0x10] - add x29, sp, #0x10 - stp x20, x19, [x29, #-0x10] - b PC+4 <-- branches to the very next instruction - mov x19, x0 - ``` - Root cause: commit `9bd61e8` ("emit param_decls into a dedicated prologue - block") added an empty-on-emit entry block ahead of the function body. - `opt_jump_cleanup`'s helper `empty_fallthrough_block` in - `src/opt/pass_jump.c` then explicitly bails out when `block == f->entry`, - so the empty entry block is never merged into its single successor. - Lifting that guard (with whatever safety condition `9bd61e8` was protecting - against — likely "first body block is a loop header") would let - jump-cleanup absorb the prologue block in the common case. **+1 insn × - 7.6M calls.** +1. **Useless leading `b PC+4` at every function entry. [DONE]** Fixed by the + known-frame prologue rework (frame planning in `opt_emit_native` → + `func_begin_known_frame`). The branch was *not* the empty entry block — it + was filler left by the old two-phase prologue: `aa_emit_prologue` sized a + *fat* prologue (slim eligibility wasn't decided until `func_end`), then the + `func_end` patch rebuilt it as the shorter slim form and back-filled the + leftover words with `b PC+N; nop`. The known-frame path decides slim + eligibility *before* emitting, so it emits the slim/slim_small_frame prologue + directly — no filler, no patch. Functions now fall straight from the prologue + into the body. **-1 to -2 insns at entry, every function.** 2. **Prologue compaction: 4-insn → 2-insn pre-indexed.** Half-done. cfree today emits: @@ -157,13 +148,18 @@ worth comparing the splice inner loop directly. These help several benches. Both are partial; the binary-trees items above are the most concrete tests for whether each is complete. -1. **Drop the leading `b PC+4` at function entry.** See binary-trees item 1. - Affects every cfree-compiled function, not just binary-trees. +1. **Drop the leading `b PC+4` at function entry. [DONE]** See binary-trees + item 1. Resolved by the known-frame prologue (the optimizer plans the whole + frame up front, so the prologue is emitted final in its slim form rather than + fat-then-patched). Affected every cfree-compiled function. 2. **Compact FP-frame prologue/epilogue.** See binary-trees item 2. The - 2-insn pre-indexed form is wired in for the no-callee-save case; needs - to be extended to small frames with 1–2 callee-saves. Biggest absolute - payoff on call-heavy benches. + 2-insn pre-indexed form is wired in for the no-callee-save case (Tier A); + extending it to small frames with 1–2 callee-saves still needs the frame + record moved to the bottom of the frame (fp-at-bottom layout) so a single + `stp x29,x30,[sp,#-N]!` covers both the sp decrement and the save — a + separable layout change, not unlocked by the frame-planning rework alone. + Biggest absolute payoff still open on call-heavy benches. 3. **Hard-register copy coalescing for `IR_LOAD_IMM` sources.** See binary-trees item 3. The hint-propagation path covers `ldr` → call-arg diff --git a/src/arch/aa64/native.c b/src/arch/aa64/native.c @@ -210,6 +210,23 @@ typedef struct AANativeTarget { * when both would apply). Applies to almost every function with a small * frame, including those with callee-saves and locals. */ u8 slim_small_frame; + /* Set by aa_func_begin_known_frame (optimizer path: the full frame is known + * up front, so the prologue, allocas, and tail epilogues are emitted final + * with no back-patching). Cleared by aa_func_begin (NativeDirectTarget + * single-pass path: worst-case prologue region reserved and patched, alloca / + * tail sites recorded and patched at func_end). This flag is the single + * discriminator between the two strategies throughout this file. */ + u8 known_frame; + /* Set when the function body contains a dynamic alloca. On the known-frame + * path it comes from NativeKnownFrameDesc.has_alloca (needed before the body + * to settle slim-epilogue eligibility); on the single-pass path it tracks + * nalloca_patches. Disqualifies the slim small-frame epilogue. */ + u8 has_alloca; + /* Set on the known-frame path once the frame is fixed and the prologue + * emitted. Any frame_slot request after this point would grow the frame the + * prologue already encoded — a silent miscompile — so aa_frame_slot panics. + * The optimizer is expected to plan every slot before the body. */ + u8 frame_final; } AANativeTarget; static AANativeTarget* aa_of(NativeTarget* t) { return (AANativeTarget*)t; } @@ -906,10 +923,11 @@ static void aa_emit_q_frame(AANativeTarget* a, int load, u32 qreg, aa_emit32(mc, aa_ldst_q_uimm(load, qreg, AA_TMP1, 0)); } -static void aa_emit_variadic_reg_saves(AANativeTarget* a) { +/* Reserve the variadic register-save-area frame slots (gp then fp). Split from + * the store emission so the known-frame path can fix the full frame — including + * these slots — before the prologue, then emit the stores after it. */ +static void aa_reserve_variadic_reg_saves(AANativeTarget* a) { NativeFrameSlotDesc sd; - NativeAddr addr; - MemAccess mem; CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64); ABIVaListInfo vai = abi_va_list_layout(a->base.c->abi); if (vai.kind != ABI_VA_LIST_AAPCS64) return; @@ -922,6 +940,16 @@ static void aa_emit_variadic_reg_saves(AANativeTarget* a) { sd.size = vai.fp_reg_count * vai.fp_slot_size; sd.align = 16; a->va_vr_slot = a->base.frame_slot(&a->base, &sd); +} + +/* Emit the stores into the variadic register-save area. Slots must already be + * reserved (aa_reserve_variadic_reg_saves). */ +static void aa_emit_variadic_reg_save_stores(AANativeTarget* a) { + NativeAddr addr; + MemAccess mem; + CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64); + ABIVaListInfo vai = abi_va_list_layout(a->base.c->abi); + if (vai.kind != ABI_VA_LIST_AAPCS64) return; memset(&mem, 0, sizeof mem); mem.type = i64; mem.size = 8; @@ -941,7 +969,10 @@ static void aa_emit_variadic_reg_saves(AANativeTarget* a) { static void aa_emit_entry_saves(AANativeTarget* a); -static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) { +/* Per-function state reset + function-symbol / cfi / prologue-anchor setup + * shared by both entry points (aa_func_begin for the single-pass path, + * aa_func_begin_known_frame for the optimizer path). Emits no prologue. */ +static void aa_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) { AANativeTarget* a = aa_of(t); MCEmitter* mc = t->mc; a->func = fd; @@ -968,6 +999,9 @@ static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) { a->ncallee_saves = 0; a->slim_prologue = 0; a->slim_small_frame = 0; + a->known_frame = 0; + a->has_alloca = 0; + a->frame_final = 0; mc->set_section(mc, fd->text_section_id); mc->emit_align(mc, 4, 0); a->func_start = mc->pos(mc); @@ -976,49 +1010,72 @@ static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) { a->prologue_pos = mc->pos(mc); a->minimal_prologue_words = 0; a->epilogue_label = mc->label_new(mc); - /* Optimizer path: emit nothing here. The exact-size prologue and the - * sret/variadic entry saves are emitted later by aa_emit_prologue, once the - * callee-save set and frame slots are known. The single-pass path reserves a - * worst-case region (patched in func_end) and emits the entry saves now. */ - if (t->emit_minimal_prologue) return; +} + +/* Single-pass (NativeDirectTarget) entry point: the frame is not known up + * front, so reserve a worst-case prologue region (patched in aa_func_end once + * max_outgoing / callee-saves are final) and emit the entry saves now. */ +static void aa_func_begin(NativeTarget* t, const CGFuncDesc* fd) { + AANativeTarget* a = aa_of(t); + MCEmitter* mc = t->mc; + aa_func_begin_common(t, fd); for (u32 i = 0; i < AA_PROLOGUE_WORDS; ++i) aa_emit32(mc, 0xd503201fu); aa_emit_entry_saves(a); } -/* Emit the sret-pointer save (x8 → slot) and, for variadic functions, the - * argument register-save area. Run immediately after the prologue frame setup - * on both the single-pass path (from func_begin) and the optimizer path (from - * aa_emit_prologue). */ -static void aa_emit_entry_saves(AANativeTarget* a) { +/* Reserve the entry-save frame slots: the sret-pointer home (x8) and, for + * variadic functions, the argument register-save area. Reserving is split from + * emitting so the known-frame path can fix the full frame before the prologue; + * the single-pass path runs both back to back via aa_emit_entry_saves. */ +static void aa_reserve_entry_saves(AANativeTarget* a) { NativeTarget* t = &a->base; const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, a->func->fn_type); if (abi && abi->has_sret) { NativeFrameSlotDesc sd; - NativeAddr addr; - NativeLoc src; - MemAccess mem; memset(&sd, 0, sizeof sd); sd.type = builtin_id(CFREE_CG_BUILTIN_I64); sd.size = 8; sd.align = 8; sd.kind = NATIVE_FRAME_SLOT_SAVE; a->sret_ptr_slot = t->frame_slot(t, &sd); + } + if (abi && abi->variadic) aa_reserve_variadic_reg_saves(a); +} + +/* Emit the entry-save stores (x8 → sret slot, then the variadic reg-save area). + * Slots must already be reserved (aa_reserve_entry_saves). */ +static void aa_emit_entry_save_stores(AANativeTarget* a) { + NativeTarget* t = &a->base; + const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, a->func->fn_type); + if (abi && abi->has_sret) { + NativeAddr addr; + NativeLoc src; + MemAccess mem; + CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64); memset(&addr, 0, sizeof addr); addr.base_kind = NATIVE_ADDR_BASE_FRAME; addr.base.frame = a->sret_ptr_slot; - addr.base_type = sd.type; + addr.base_type = i64; memset(&src, 0, sizeof src); src.kind = NATIVE_LOC_REG; src.cls = NATIVE_REG_INT; - src.type = sd.type; + src.type = i64; src.v.reg = 8u; memset(&mem, 0, sizeof mem); - mem.type = sd.type; + mem.type = i64; mem.size = 8; mem.align = 8; aa_emit_mem(a, 0, src, addr, mem); } - if (abi && abi->variadic) aa_emit_variadic_reg_saves(a); + if (abi && abi->variadic) aa_emit_variadic_reg_save_stores(a); +} + +/* Reserve + emit the entry saves back to back. Single-pass (NativeDirectTarget) + * path, where the prologue region is a reserved worst-case block and slot + * offsets need not be final before it. */ +static void aa_emit_entry_saves(AANativeTarget* a) { + aa_reserve_entry_saves(a); + aa_emit_entry_save_stores(a); } static void aa_note_frame_state(NativeTarget* t, @@ -1198,7 +1255,7 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words, /* Build the prologue instruction words for `L` into `words` (capacity `cap`), * returning the count. Shared by the NativeDirectTarget patch path (reserves * a fixed worst-case region, then patches it here) and the optimizer path - * (emits an exact-size region up front; see aa_emit_prologue). + * (aa_func_begin_known_frame emits exactly these words up front). * * All three variants establish the same post-prologue state defined by L: * sp = caller's sp - L->frame_size @@ -1263,23 +1320,6 @@ static void aa_patch_prologue(AANativeTarget* a, const AAFrameLayout* L, aa_patch32(a->base.obj, sec, a->prologue_pos + i * 4u, words[i]); } -/* Optimizer path: emit an exact-size prologue in place (no reserved NOP - * region). The callee-save set and the static frame slots are final by now, so - * the prologue's instruction count is fixed; only the frame-size immediates - * (sub sp / save-area address / fp = sp+saved_pair) still depend on body- - * emitted temporaries and are patched in func_end. We size the region with - * a frame that fits add/sub's imm12 (the real frame must too, or func_end's - * rebuild — capped at this length — panics). */ -static void aa_emit_prologue(NativeTarget* t) { - AANativeTarget* a = aa_of(t); - u32 words[AA_PROLOGUE_WORDS]; - AAFrameLayout L = aa_build_layout(a->cum_off, a->max_outgoing); - u32 n = aa_build_prologue_words(a, &L, words, AA_PROLOGUE_WORDS); - for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]); - a->minimal_prologue_words = n; - aa_emit_entry_saves(a); -} - static void aa_emit_restore_frame(AANativeTarget* a, const AAFrameLayout* L) { MCEmitter* mc = a->base.mc; u32 words[AA_PROLOGUE_WORDS]; @@ -1349,35 +1389,27 @@ static void aa_func_end(NativeTarget* t) { AANativeTarget* a = aa_of(t); MCEmitter* mc = t->mc; AAFrameLayout L = aa_build_layout(a->cum_off, a->max_outgoing); - /* Optimizer path emitted an exact-size prologue (minimal_prologue_words); - * the single-pass path reserved a fixed worst-case region. Either way the - * frame-size immediates are only final now, so patch the region in place. */ + /* known_frame (optimizer): prologue, allocas, and tail epilogues were emitted + * final and slim eligibility was settled in aa_func_begin_known_frame — there + * is nothing to patch. Single-pass (NDT): a worst-case prologue region was + * reserved and the deferred patches recorded; resolve them now that the frame + * is final. The NDT path always uses the fat prologue/epilogue (slim_* left 0 + * by aa_func_begin_common, since its reserved region is much larger). */ u32 prologue_region = - t->emit_minimal_prologue ? a->minimal_prologue_words : AA_PROLOGUE_WORDS; - /* Slim Tier A eligibility (set before emitting the epilogue / patching the - * prologue so the *_restore_frame / *_build_prologue_words helpers pick the - * slim form). Conditions: no callee-saves needed, no alloca, no body - * slots (locals/spills/sret/variadic — all counted in slot_bytes), no - * outgoing stack args, optimizer path only (the NDT reserves a much - * larger prologue region). */ - a->slim_prologue = - t->emit_minimal_prologue && a->ncallee_saves == 0 && a->nalloca == 0 && - L.slot_bytes == 0 && L.out_stack == 0; - /* Universal small-frame fast path: skip the x17/x10 scratch when the - * saved-pair offset fits stp's signed 7-bit scaled immediate. Mutually - * exclusive with the Tier A slim form (Tier A is strictly tighter). - * Disqualify alloca: alloca dynamically moves sp during the body, and the - * fat epilogue (sp = fp + 16 via x10) is what restores sp from fp; the - * slim_small_frame epilogue's `add sp, sp, #N` only undoes the static - * frame, leaving sp pointing into the alloca area. */ - a->slim_small_frame = !a->slim_prologue && a->nalloca == 0 && - aa_sp_off_saved_pair(&L) <= 504u; + a->known_frame ? a->minimal_prologue_words : AA_PROLOGUE_WORDS; mc->label_place(mc, a->epilogue_label); aa_emit_callee_restores(a); aa_emit_restore_frame(a, &L); aa_emit32(mc, aa64_ret(AA_LR)); - aa_patch_prologue(a, &L, prologue_region); - aa_apply_patches(a, &L); + if (a->known_frame) { + /* The frame-planning pre-pass plus final prologue/alloca/tail emission must + * leave nothing deferred; a stray patch would mean a body-time frame change + * the final prologue never saw. */ + if (a->npatches != 0) aa_panic(a, "known-frame path left deferred patches"); + } else { + aa_patch_prologue(a, &L, prologue_region); + aa_apply_patches(a, &L); + } if (mc->cfi_set_next_pc_offset && mc->cfi_def_cfa && mc->cfi_offset) { mc->cfi_set_next_pc_offset(mc, prologue_region * 4u); /* CFA = caller's sp = fp + AA_FRAME_SAVE_SIZE. saved fp/lr at fp/fp+8 @@ -1409,6 +1441,8 @@ static NativeFrameSlot aa_frame_slot(NativeTarget* t, AANativeSlot* s; u32 size = d->size ? d->size : 8u; u32 align = d->align ? d->align : 1u; + if (a->frame_final) + aa_panic(a, "frame slot requested after known-frame prologue"); if (a->nslots == a->slots_cap) { u32 cap = a->slots_cap ? a->slots_cap * 2u : 16u; AANativeSlot* nb = arena_zarray(t->c->tu, AANativeSlot, cap); @@ -1425,19 +1459,62 @@ static NativeFrameSlot aa_frame_slot(NativeTarget* t, return a->nslots; } +/* Optimizer entry point: the full frame is supplied up front, so the prologue, + * entry saves, slim-form eligibility, allocas, and tail epilogues are all final + * the moment they are emitted — no back-patching (aa_func_end skips the patch + * passes when a->known_frame). Slot creation order matches the single-pass path + * (callee-saves first for stur range, then the static slots, then sret/variadic + * entry saves) so offsets are identical to what the patch path would produce. */ static void aa_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd, const NativeKnownFrameDesc* frame, NativeFrameSlot* out_slots) { - aa_func_begin(t, fd); + AANativeTarget* a = aa_of(t); + AAFrameLayout L; + u32 words[AA_PROLOGUE_WORDS]; + u32 n; + aa_func_begin_common(t, fd); + a->known_frame = 1; if (frame) { - AANativeTarget* a = aa_of(t); - if (frame->max_outgoing > a->max_outgoing) - a->max_outgoing = frame->max_outgoing; + a->has_alloca = frame->has_alloca; + if (frame->callee_saved_used && frame->ncallee_classes) + aa_reserve_callee_saves(t, frame->callee_saved_used, + frame->ncallee_classes); for (u32 i = 0; i < frame->nslots; ++i) { NativeFrameSlot slot = aa_frame_slot(t, &frame->slots[i]); if (out_slots) out_slots[i] = slot; } + aa_reserve_entry_saves(a); + /* Reserve the atomic-RMW scratch spill last (matching its lazy position in + * the single-pass path), so aa_saved_tmp_spill reuses it instead of growing + * the frame mid-body. */ + if (frame->needs_scratch_spill) { + NativeFrameSlotDesc sd; + memset(&sd, 0, sizeof sd); + sd.type = builtin_id(CFREE_CG_BUILTIN_I64); + sd.size = 8; + sd.align = 8; + sd.kind = NATIVE_FRAME_SLOT_SPILL; + a->saved_tmp_slot = a->base.frame_slot(&a->base, &sd); + } + if (frame->max_outgoing > a->max_outgoing) + a->max_outgoing = frame->max_outgoing; } + /* Frame is final: slot_bytes (cum_off) and out_stack (max_outgoing) are both + * known, so the prologue immediates and slim-form choice are settled here. */ + L = aa_build_layout(a->cum_off, a->max_outgoing); + /* Slim Tier A: no callee-saves, no alloca, no body slots, no outgoing stack + * args. slim_small_frame: skip the x17/x10 scratch when the saved-pair offset + * fits stp's signed 7-bit scaled immediate. (See aa_func_end for the + * single-pass path, which never takes the slim form.) */ + a->slim_prologue = a->ncallee_saves == 0 && !a->has_alloca && + L.slot_bytes == 0 && L.out_stack == 0; + a->slim_small_frame = !a->slim_prologue && !a->has_alloca && + aa_sp_off_saved_pair(&L) <= 504u; + n = aa_build_prologue_words(a, &L, words, AA_PROLOGUE_WORDS); + for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]); + a->minimal_prologue_words = n; + a->frame_final = 1; + aa_emit_entry_save_stores(a); } static void aa_spill(NativeTarget* t, NativeLoc src, NativeFrameSlot slot, @@ -2028,14 +2105,22 @@ static void aa_alloca(NativeTarget* t, NativeLoc dst, NativeLoc size, aa_emit32(t->mc, aa64_add_imm(1, AA_TMP1, AA_SP, 0, 0)); aa_emit32(t->mc, aa64_sub(1, AA_TMP1, AA_TMP1, AA_TMP0)); aa_emit32(t->mc, aa64_add_imm(1, AA_SP, AA_TMP1, 0, 0)); - { + /* The alloca result is sp + outgoing-area bytes. On the known-frame path + * max_outgoing is already final, so emit the final `add dst, sp, #N` here; on + * the single-pass path it is not known yet, so record a patch. */ + if (a->known_frame) { + u32 imm12, sh; + if (!aa64_addsub_imm_fits(a->max_outgoing, &imm12, &sh)) + aa_panic(a, "outgoing area too large for alloca result"); + aa_emit32(t->mc, aa64_add_imm(1, loc_reg(dst), AA_SP, imm12, sh)); + } else { AAPatch* p = aa_patch_alloc(a); p->kind = AA_PATCH_ALLOCA; p->pos = t->mc->pos(t->mc); p->u.dst_reg = loc_reg(dst); a->nalloca++; + aa_emit32(t->mc, aa64_add_imm(1, loc_reg(dst), AA_SP, 0, 0)); } - aa_emit32(t->mc, aa64_add_imm(1, loc_reg(dst), AA_SP, 0, 0)); } static MemAccess aa_mem_for_type(NativeTarget* t, CfreeCgTypeId type, @@ -2283,6 +2368,14 @@ static u32 aa_signature_stack_bytes(NativeTarget* t, CfreeCgTypeId fn_type, return aa_call_stack_size(t, &d); } +/* Pure NativeTarget.call_stack_bytes: outgoing stack bytes for a full call + * descriptor (handles variadic stack args, unlike signature_stack_bytes which + * sees only the fixed params). aa_call_stack_size reads only fn_type and each + * args[i].type, so the frame-planning pre-pass can call this before emitting. */ +static u32 aa_call_stack_bytes(NativeTarget* t, const NativeCallDesc* desc) { + return aa_call_stack_size(t, desc); +} + /* One register-passed call argument: write `src` (or its address) into the * argument register `dst`. Collected during planning and emitted as a batch so * the backend can order them as a parallel copy (see aa_emit_reg_arg_moves). */ @@ -2535,6 +2628,31 @@ static void aa_ret(NativeTarget* t); static void aa_emit_tail_site(NativeTarget* t, NativeLoc callee) { AANativeTarget* a = aa_of(t); + if (a->known_frame) { + /* Frame is final: emit the tail epilogue (callee restores + frame restore + + * branch) directly, exactly the words aa_apply_patches would patch in but + * without the reserved NOP padding. */ + AAFrameLayout L = aa_build_layout(a->cum_off, a->max_outgoing); + u32 words[AA_TAIL_WORDS]; + u32 n = 0; + aa_words_callee_restores(a, words, AA_TAIL_WORDS, &n); + aa_words_restore_frame(a, words, AA_TAIL_WORDS, &n, &L); + if (n >= AA_TAIL_WORDS) aa_panic(a, "tail epilogue too large"); + for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]); + if (callee.kind == NATIVE_LOC_REG) { + aa_emit32(t->mc, aa64_br(loc_reg(callee))); + } else if (callee.kind == NATIVE_LOC_GLOBAL) { + u32 pos = t->mc->pos(t->mc); + aa_emit32(t->mc, aa64_b(0)); + t->mc->emit_reloc_at(t->mc, t->mc->section_id, pos, R_AARCH64_JUMP26, + callee.v.global.sym, callee.v.global.addend, 0, 0); + } else { + aa_panic(a, "unsupported tail target"); + } + return; + } + /* Single-pass: reserve a worst-case region and record a patch; the callee + * restores and frame restore depend on the not-yet-final frame layout. */ AAPatch* p = aa_patch_alloc(a); p->kind = AA_PATCH_TAIL; p->pos = t->mc->pos(t->mc); @@ -3299,8 +3417,8 @@ NativeTarget* aa64_native_target_new(Compiler* c, ObjBuilder* obj, t->func_begin_known_frame = aa_func_begin_known_frame; t->note_frame_state = aa_note_frame_state; t->reserve_callee_saves = aa_reserve_callee_saves; - t->emit_prologue = aa_emit_prologue; t->signature_stack_bytes = aa_signature_stack_bytes; + t->call_stack_bytes = aa_call_stack_bytes; t->has_store_zero_reg = 1; t->store_zero_reg = 31u; /* wzr/xzr in the Rt position of a store */ t->func_end = aa_func_end; diff --git a/src/arch/native_target.h b/src/arch/native_target.h @@ -48,6 +48,22 @@ typedef struct NativeKnownFrameDesc { u32 nslots; u32 max_outgoing; u32 align; + /* Callee-saved hard registers the allocator assigned, one bitmask per + * NativeAllocClass (indexed by class id). The backend reserves a save slot + * and emits the prologue save / epilogue restore for each — equivalent to a + * reserve_callee_saves() call, but folded into the known-frame setup so the + * full frame is fixed before the prologue is emitted. NULL / 0 means none. */ + const u32* callee_saved_used; + u32 ncallee_classes; + /* Whether the function body contains a dynamic alloca. The backend needs this + * up front (before the body) to decide prologue/epilogue form, since with a + * known frame the slim-epilogue eligibility is settled at func_begin. */ + u8 has_alloca; + /* Whether the body has an operation that needs a backend-internal scratch + * spill slot — on aa64, an atomic read-modify-write, whose retry loop spills + * one scratch register. The backend reserves the slot up front so the body + * never grows the frame after the prologue. */ + u8 needs_scratch_spill; } NativeKnownFrameDesc; typedef enum NativeAllocClass { @@ -298,6 +314,13 @@ struct NativeTarget { * out-pointer may be NULL. May itself be NULL. */ u32 (*signature_stack_bytes)(NativeTarget*, CfreeCgTypeId fn_type, int* variadic, u32* nparams); + /* Pure query: the outgoing stack-argument bytes a call with this descriptor + * uses, rounded to the ABI's outgoing-area alignment. Reads only fn_type, + * flags, nargs, and each args[i].type — never argument *locations* — so the + * optimizer can call it in a frame-planning pre-pass, before any argument + * marshalling is emitted, to size the outgoing area. Must equal the + * stack_arg_size plan_call computes for the same descriptor. May be NULL. */ + u32 (*call_stack_bytes)(NativeTarget*, const NativeCallDesc*); /* Integer hardware zero register, if the ISA has one (aa64 wzr/xzr, rv64 * x0). When `has_store_zero_reg` is set, the emit path stores a constant 0 * straight from `store_zero_reg` instead of materializing 0 into a scratch diff --git a/src/opt/pass_native_emit.c b/src/opt/pass_native_emit.c @@ -24,7 +24,6 @@ typedef struct NativeEmitCtx { NativeFrameSlot* slot_map; MCLabel* labels; u8* label_placed; - u32 max_outgoing; ObjSecId local_static_sec; ObjSymId local_static_sym; u32 local_static_base; @@ -628,19 +627,19 @@ static void emit_call(NativeEmitCtx* e, Inst* in) { args[i] = abi_storage_loc(e, &aux->desc.args[i], in->loc); if (aux->desc.ret.storage.kind) { CfreeCgTypeId rty = aux->desc.ret.type; - int scalar = !cg_type_is_aggregate(e->c, rty) && - type_size_or(e->c, rty, 8u) <= 8u; results = arena_zarray(e->f->arena, NativeLoc, 1); final_result = abi_storage_loc(e, &aux->desc.ret, in->loc); - /* Scalar result: hand plan_call the value's real destination (the MIR - * result reg, or its spill slot) directly, so it emits one move out of the - * ABI result register. Routing every scalar result through a fresh temp - * slot — store x0 then immediately reload — was a pure round trip on every - * call; emit_ret already avoids the analogous trip on returns. The temp - * slot is kept only for aggregate / oversized results, which plan_call / - * the callee write in parts and must land in memory. */ - if (scalar && (final_result.kind == NATIVE_LOC_REG || - final_result.kind == NATIVE_LOC_FRAME)) { + /* Hand plan_call the value's real destination directly whenever it is a + * register or a frame slot: a scalar result is a single move out of the ABI + * result register, and an aggregate / oversized result — which plan_call or + * the callee writes in parts and so must land in memory — lands straight in + * its frame home. Routing either through a fresh temp slot (store then + * reload / copy_bytes) was a pure round trip on every call. The temp slot is + * a fallback for the rare result whose storage is neither a register nor a + * frame slot (e.g. written into a global); lowering hoists aggregates to a + * frame home (opt_lower_to_mir), so this branch is scalar-only in practice. */ + if (final_result.kind == NATIVE_LOC_REG || + final_result.kind == NATIVE_LOC_FRAME) { results[0] = final_result; } else { result_slot = temp_slot(e, rty, in->loc, NATIVE_FRAME_SLOT_SPILL); @@ -657,8 +656,6 @@ static void emit_call(NativeEmitCtx* e, Inst* in) { d.tail_policy = aux->desc.tail_policy; d.inline_policy = aux->desc.inline_policy; e->target->plan_call(e->target, &d, &plan); - if (plan.stack_arg_size > e->max_outgoing) - e->max_outgoing = plan.stack_arg_size; for (u32 i = 0; i < plan.nargs; ++i) write_loc(e, plan.args[i].dst, plan.args[i].src, plan.args[i].mem, in->loc); if (plan.callee.kind != NATIVE_LOC_REG && @@ -805,19 +802,15 @@ static void emit_inst(NativeEmitCtx* e, u32 block, u32 order_index, Inst* in, class_for_type(e, in->opnds[1].type) == NATIVE_REG_INT) src = loc_reg(in->opnds[1].type, NATIVE_REG_INT, e->target->store_zero_reg); + /* Source register aliases the address base/index (e.g. `*p = (T)p`). + * Collapse the address into a scratch register: collapse_addr_to_reg + * selects a scratch distinct from both base and index — hence distinct + * from `src` — so the store reads `src` and writes through the fresh + * scratch with no alias. This stays entirely in registers; the frame is + * fully planned before emission, so emit never allocates a slot here. */ if (src.kind == NATIVE_LOC_REG && (src.v.reg == addr_base_reg(&addr) || - src.v.reg == addr_index_reg(&addr))) { - NativeFrameSlot slot = - temp_slot(e, in->opnds[1].type, in->loc, NATIVE_FRAME_SLOT_SPILL); - NativeLoc frame = loc_frame(in->opnds[1].type, - class_for_type(e, in->opnds[1].type), slot); - write_loc(e, frame, src, mem_for_type(e->c, in->opnds[1].type), - in->loc); + src.v.reg == addr_index_reg(&addr))) collapse_addr_to_reg(e, &addr, in->loc); - src = materialize(e, frame, class_for_type(e, in->opnds[1].type), - in->opnds[1].type, addr_base_reg(&addr), REG_NONE, - in->loc); - } if (src.kind != NATIVE_LOC_REG) { if (!scratch_available(e, class_for_type(e, in->opnds[1].type), addr_base_reg(&addr), addr_index_reg(&addr))) @@ -1286,24 +1279,6 @@ static void emit_block(NativeEmitCtx* e, u32 block, u32 order_index, } } -static void map_frame_slots(NativeEmitCtx* e) { - e->slot_map = - arena_zarray(e->f->arena, NativeFrameSlot, e->f->nframe_slots + 1u); - for (u32 i = 0; i < e->f->nframe_slots; ++i) { - IRFrameSlot* s = &e->f->frame_slots[i]; - NativeFrameSlotDesc d; - memset(&d, 0, sizeof d); - d.type = s->type; - d.name = s->name; - d.loc = s->loc; - d.size = s->size; - d.align = s->align; - d.kind = s->kind; - d.flags = s->flags; - e->slot_map[s->id] = e->target->frame_slot(e->target, &d); - } -} - #define EMIT_MAX_REG_CLASSES 4u static void collect_used_reg(Func* f, Inst* in, OptOperand* op, int is_def, @@ -1317,37 +1292,128 @@ static void collect_used_reg(Func* f, Inst* in, OptOperand* op, int is_def, used[op->cls] |= 1u << op->v.reg; } -/* After register allocation the MIR names hard registers directly, so we can - * scan it for the callee-saved registers the allocator assigned and ask the - * target to save/restore them. Must run after func_begin and before frame-slot - * mapping so the target can place the save slots first. */ -static void reserve_callee_saves(NativeEmitCtx* e) { +/* After register allocation the MIR names hard registers directly, so we scan + * it for the callee-saved registers the allocator assigned. Fills `used[cls]` + * (one bitmask per alloc class, masked to each class's callee-saved set) and + * returns the class count. The masks feed NativeKnownFrameDesc so the backend + * reserves the save slots as part of the up-front frame. */ +static u32 compute_callee_saved_used(NativeEmitCtx* e, u32* used, u32 cap) { NativeTarget* t = e->target; const NativeRegInfo* ri = t->regs; - u32 used[EMIT_MAX_REG_CLASSES]; u32 nclasses; - if (!t->reserve_callee_saves || !ri) return; - memset(used, 0, sizeof used); + for (u32 i = 0; i < cap; ++i) used[i] = 0; + if (!ri) return 0; for (u32 b = 0; b < e->f->nblocks; ++b) { Block* bl = &e->f->blocks[b]; for (u32 i = 0; i < bl->ninsts; ++i) opt_walk_inst_operands(e->f, &bl->insts[i], collect_used_reg, used); } - nclasses = ri->nclasses < EMIT_MAX_REG_CLASSES ? ri->nclasses - : EMIT_MAX_REG_CLASSES; + nclasses = ri->nclasses < cap ? ri->nclasses : cap; for (u32 i = 0; i < ri->nclasses; ++i) { const NativeAllocClassInfo* ci = &ri->classes[i]; - if (ci->cls < EMIT_MAX_REG_CLASSES) - used[ci->cls] &= ci->callee_saved_mask; + if (ci->cls < cap) used[ci->cls] &= ci->callee_saved_mask; + } + return nclasses; +} + +/* Plan the complete call frame before any code is emitted, then hand it to the + * backend via func_begin_known_frame so the prologue is emitted final. The + * optimizer knows everything the frame needs after register allocation and MIR + * lowering: the callee-saved set (scanned from the MIR), every static frame + * slot (f->frame_slots), and the outgoing-arg area (the max over all calls of + * the pure call_stack_bytes query). The body therefore allocates no slots, so + * the frame is final up front and nothing is back-patched. Populates + * e->slot_map from the backend-assigned slot handles for the body to use. */ +static void plan_frame(NativeEmitCtx* e, const CGFuncDesc* fd) { + NativeTarget* t = e->target; + NativeKnownFrameDesc frame; + NativeFrameSlotDesc* slots = NULL; + NativeFrameSlot* out_slots = NULL; + u32 used[EMIT_MAX_REG_CLASSES]; + u32 nclasses; + u32 max_args = 0, max_outgoing = 0; + u8 has_alloca = 0; + u8 needs_scratch_spill = 0; + memset(&frame, 0, sizeof frame); + nclasses = t->reserve_callee_saves + ? compute_callee_saved_used(e, used, EMIT_MAX_REG_CLASSES) + : 0u; + /* Outgoing-arg area = max stack-arg bytes over all calls; also note alloca. */ + for (u32 b = 0; b < e->f->nblocks; ++b) { + Block* bl = &e->f->blocks[b]; + for (u32 i = 0; i < bl->ninsts; ++i) { + Inst* in = &bl->insts[i]; + if ((IROp)in->op == IR_ALLOCA) { + has_alloca = 1; + } else if ((IROp)in->op == IR_ATOMIC_RMW) { + needs_scratch_spill = 1; + } else if ((IROp)in->op == IR_CALL) { + IRCallAux* aux = (IRCallAux*)in->extra.aux; + if (aux && aux->desc.nargs > max_args) max_args = aux->desc.nargs; + } + } + } + if (t->call_stack_bytes) { + NativeLoc* args = + max_args ? arena_zarray(e->f->arena, NativeLoc, max_args) : NULL; + for (u32 b = 0; b < e->f->nblocks; ++b) { + Block* bl = &e->f->blocks[b]; + for (u32 i = 0; i < bl->ninsts; ++i) { + Inst* in = &bl->insts[i]; + IRCallAux* aux; + NativeCallDesc d; + u32 sb; + if ((IROp)in->op != IR_CALL) continue; + aux = (IRCallAux*)in->extra.aux; + if (!aux) continue; + memset(&d, 0, sizeof d); + d.fn_type = aux->desc.fn_type; + d.flags = aux->desc.flags; + d.nargs = aux->desc.nargs; + for (u32 k = 0; k < aux->desc.nargs; ++k) { + memset(&args[k], 0, sizeof args[k]); + args[k].type = aux->desc.args[k].type; + } + d.args = args; + sb = t->call_stack_bytes(t, &d); + if (sb > max_outgoing) max_outgoing = sb; + } + } + } + e->slot_map = + arena_zarray(e->f->arena, NativeFrameSlot, e->f->nframe_slots + 1u); + if (e->f->nframe_slots) { + slots = arena_zarray(e->f->arena, NativeFrameSlotDesc, e->f->nframe_slots); + out_slots = arena_zarray(e->f->arena, NativeFrameSlot, e->f->nframe_slots); + for (u32 i = 0; i < e->f->nframe_slots; ++i) { + IRFrameSlot* s = &e->f->frame_slots[i]; + NativeFrameSlotDesc* d = &slots[i]; + memset(d, 0, sizeof *d); + d->type = s->type; + d->name = s->name; + d->loc = s->loc; + d->size = s->size; + d->align = s->align; + d->kind = s->kind; + d->flags = s->flags; + } } - t->reserve_callee_saves(t, used, nclasses); + frame.slots = slots; + frame.nslots = e->f->nframe_slots; + frame.max_outgoing = max_outgoing; + frame.callee_saved_used = nclasses ? used : NULL; + frame.ncallee_classes = nclasses; + frame.has_alloca = has_alloca; + frame.needs_scratch_spill = needs_scratch_spill; + t->func_begin_known_frame(t, fd, &frame, out_slots); + for (u32 i = 0; i < e->f->nframe_slots; ++i) + e->slot_map[e->f->frame_slots[i].id] = out_slots[i]; } void opt_emit_native(Compiler* c, Func* f, NativeTarget* target) { NativeEmitCtx e; Func view; CGFuncDesc fd; - NativeFramePatchState state; if (!f || !target) return; memset(&e, 0, sizeof e); if (f->mir) { @@ -1375,16 +1441,13 @@ void opt_emit_native(Compiler* c, Func* f, NativeTarget* target) { metrics_scope_end(c, "opt.native_emit.setup"); metrics_scope_begin(c, "opt.native_emit.func_begin"); - /* The optimizer path knows the callee-save set and frame slots before the - * body, so the backend can emit an exact-size prologue here rather than - * reserving a worst-case NOP region patched at func_end. Signal this before - * func_begin (so it skips the reserved region) and emit the prologue once - * reserve_callee_saves + map_frame_slots have run. */ - target->emit_minimal_prologue = target->emit_prologue != NULL; - target->func_begin(target, &fd); - reserve_callee_saves(&e); - map_frame_slots(&e); - if (target->emit_minimal_prologue) target->emit_prologue(target); + /* The optimizer has the whole frame after regalloc + MIR lowering, so it + * plans it up front (plan_frame) and drives func_begin_known_frame: the + * backend emits a final prologue with no reserved NOP region and no + * back-patching. The body allocates no frame slots, so the frame stays final; + * allocas and tail epilogues are emitted final too. (Contrast the + * single-pass NativeDirectTarget path, which reserves and patches.) */ + plan_frame(&e, &fd); bind_params(&e); metrics_scope_end(c, "opt.native_emit.func_begin"); @@ -1393,10 +1456,6 @@ void opt_emit_native(Compiler* c, Func* f, NativeTarget* target) { emit_block(&e, e.f->emit_order[i], i, &fd); metrics_scope_end(c, "opt.native_emit.body"); - memset(&state, 0, sizeof state); - state.max_outgoing = e.max_outgoing; - if (target->note_frame_state) target->note_frame_state(target, &state); - if (target->patch_apply) target->patch_apply(target); metrics_scope_begin(c, "opt.native_emit.func_end"); target->func_end(target); metrics_scope_end(c, "opt.native_emit.func_end");