commit a39ba2690ba6a6f5179187de2956cfa4df145657
parent f1e91a7c2257e38b8244f99c4066cabea0336507
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Mon, 1 Jun 2026 13:24:30 -0700
arch: known-frame prologue cost-model tiers for x64 and rv64
Bring the rv64 and x64 known-frame (-O1) prologue paths to the per-call
cost-model parity aa64 already has (doc/plan/ARCH.md §2). Each backend now
selects the cheapest valid frame shape in func_begin_known_frame.
- x64: slim_frame drops `sub rsp` for an empty frame (keeps push rbp; mov
rbp,rsp); redzone_leaf keeps a SysV leaf's small frame (<=128B) in the
128-byte red zone with no reservation. Both leave the leave/CFI/rbp-relative
offsets untouched. No fold tier — push rbp already folds the sp move.
- rv64: a register-only leaf with no callee-saves/slots/outgoing/sret/variadic
emits no prologue and a bare ret (~8 insns/leaf). The fp_at_bottom fold is
intentionally not ported (zero instruction win without pre/post-indexed
stores).
- Shared: NativeKnownFrameDesc gains is_leaf and has_asm, derived in plan_frame.
has_asm disqualifies the frame-eliding tiers, since inline asm can clobber the
return-address register (rv64 ra) or the red zone (x64) opaquely.
test/opt/prologue_tier.sh pins the per-arch shapes (incl. the asm/non-leaf
guards); runtime-verified across the toy and parse corpora at -O1.
Test infra: pin the cross-arch exec container images per arch by content
digest (test/lib/test_images.sh) instead of the shared, mutable alpine:latest
tag, so pulling one arch can't clobber another's rootfs. The run path stays
--pull=never; `make test-images` (test/lib/pull_test_images.sh) is the one
network step, and exec_target now SKIPs cleanly when an arch is unprovisioned.
Diffstat:
9 files changed, 469 insertions(+), 71 deletions(-)
diff --git a/doc/plan/ARCH.md b/doc/plan/ARCH.md
@@ -64,11 +64,10 @@ targets (see the interpreter and toy `musttail` tracks).
The fixed per-call overhead -- prologue + epilogue + arg setup, independent of the
body -- is the dominant cost on call-heavy code. aa64 picks one of four frame
-shapes per function to minimize it; x64 and rv64 currently emit a single
-RBP-anchored / s0-anchored shape and do **not** have the fold tiers. The design
-rationale lives in [../ARCH.md](../ARCH.md); the aa64 measurements and the
-remaining body-level warts are tracked alongside [../OPT.md](../OPT.md) and
-[OPTIMIZER.md](OPTIMIZER.md).
+shapes per function to minimize it. x64 and rv64 now select a cheaper known-frame
+shape too (see **Done** below); the design rationale lives in
+[../ARCH.md](../ARCH.md); the aa64 measurements and the remaining body-level warts
+are tracked alongside [../OPT.md](../OPT.md) and [OPTIMIZER.md](OPTIMIZER.md).
aa64 tiers (baseline, for reference):
@@ -83,24 +82,48 @@ The known-frame asymmetry (bottom-record only on the -O1 path) is intentional:
the frame-size-dependent offsets require the frame to be final before the body,
which only the optimizer's frame planner guarantees.
-Planned:
-
-- **rv64 frame fold.** Port the `fp_at_bottom` idea: for known-frame functions
- with no outgoing stack args and a small frame, place the saved s0/ra pair at the
- bottom (s0 = sp) so the sp adjustment folds into the save/restore and
- callee-saves stack above the record at positive offsets. RISC-V has no
- pre/post-indexed store, so the fold is the address-arithmetic saving, not an
- addressing-mode one -- quantify the win before committing. Add a leaf/no-frame
- (`slim_prologue`-equivalent) tier for leaf functions with no callee-saves.
-- **x64 frame fold / leaf omission.** Add the equivalent tiers for SysV and
- Win64. x64 already emits the exact known-frame prologue (no placeholder/patch),
- so this is shape selection in `func_begin_known_frame`, not a re-architecture.
- SysV leaf functions can also exploit the 128-byte red zone to skip the
- `sub rsp` entirely -- design and gate this carefully against alloca and any
- outgoing-arg use.
+Leaf-ness is surfaced to the backends through `NativeKnownFrameDesc.is_leaf`
+(set in `plan_frame`, `pass_native_emit.c`, as "no `IR_CALL` of any kind --
+regular or sibling/tail"). A leaf never clobbers the return-address register or
+the stack below sp, which is what unlocks the no-frame / red-zone shapes below.
+
+Done:
+
+- **x64 slim + red-zone tiers** (`x64_func_begin_known_frame`). Two known-frame
+ shapes, both keeping the `push rbp; mov rbp,rsp` record (so the `leave`
+ epilogue, the `CFA = rbp+16` CFI, and every rbp-relative offset are unchanged)
+ and only dropping the `sub rsp` reservation:
+ - `slim_frame` -- empty frame (no callee-saves, no body slots, no outgoing
+ args, no alloca). Safe for non-leaves too: `push rbp` keeps rsp 16-aligned
+ for any register-only call, and nothing lives below rsp. SysV + Win64.
+ - `redzone_leaf` -- SysV leaf with a small frame (`is_leaf`, no alloca, no
+ outgoing args, `frame_size <= 128`). Locals/callee-saves stay at their
+ rbp-relative offsets, which now land in the 128-byte red zone. Leaf-only,
+ since any call would clobber the red zone; Win64 (no red zone) is excluded
+ by the `shadow_space == 0` gate.
+
+ No x64 *fold* tier: `push rbp` already folds the sp-move into the store, so
+ there is no aa64-`fp_at_bottom`-style win to capture.
+- **rv64 leaf tier** (`rv_func_begin_known_frame`, `slim_prologue`). A leaf with
+ no callee-saves, no body slots, no outgoing args, no sret/variadic and
+ register-only params (`signature_stack_bytes == 0`) never reads s0 nor clobbers
+ ra (both are reserved, never allocable), so it emits **no prologue** and a bare
+ `ret` -- the whole frame setup/teardown is elided (~8 insns/leaf). CFI is
+ `def_cfa(sp, 0)`, matching the CIE default (ra stays live in its register).
+- **rv64 frame fold: intentionally not ported.** Porting aa64's `fp_at_bottom`
+ to rv64 was measured at a **zero** instruction win: RISC-V has no
+ pre/post-indexed store, so moving the saved s0/ra pair to the bottom still
+ needs a separate `addi sp,sp,-N` plus the `sd`/`addi s0,sp` -- the same four
+ instructions as the top-record shape. The fold only relocates data, it removes
+ no instruction, so it was skipped rather than add a fold-aware offset-helper
+ layer for no benefit. (Per the "quantify the win before committing" guidance
+ that previously stood here.) The rv64 leaf tier above is the real rv64 win.
+
+Still open:
+
- **Cost-model alignment.** `signature_stack_bytes` / `call_stack_bytes` are the
shared hooks the optimizer uses to size the outgoing area and gate tail-call
- realizability; they exist on all three. As the fold tiers and tail-call paths
+ realizability; they exist on all three. As the tail-call paths (section 1)
land, verify the optimizer's per-call cost estimates reflect the cheaper
shapes so frame/spill decisions stay consistent across arches.
diff --git a/src/arch/native_target.h b/src/arch/native_target.h
@@ -64,6 +64,21 @@ typedef struct NativeKnownFrameDesc {
* one scratch register. The backend reserves the slot up front so the body
* never grows the frame after the prologue. */
u8 needs_scratch_spill;
+ /* Whether the function is a leaf — its body contains no call of any kind
+ * (regular or sibling/tail). A leaf does not clobber the return-address
+ * register or the stack below sp through a call, so backends can omit the
+ * saved-frame record entirely (rv64 leaf tier) or skip the stack reservation
+ * and keep locals in the red zone (x64 SysV red-zone tier) — but ONLY when
+ * `has_asm` is also clear (see below). Conservatively false whenever any
+ * IR_CALL is present. */
+ u8 is_leaf;
+ /* Whether the body contains an inline-asm block. Inline asm can clobber the
+ * return-address register (rv64 ra) or write into the red zone / make a call
+ * (x64) without the optimizer modelling it, so the frame-eliding leaf/red-zone
+ * tiers must NOT fire when this is set — even for an otherwise-leaf function.
+ * The single-pass and fat known-frame shapes always save the return address
+ * and reserve their stack, so they are unaffected. */
+ u8 has_asm;
} NativeKnownFrameDesc;
typedef enum NativeAllocClass {
diff --git a/src/arch/rv64/native.c b/src/arch/rv64/native.c
@@ -205,6 +205,16 @@ typedef struct RvNativeTarget {
u32 fp_pair_off;
u32 minimal_prologue_words; /* known-frame path: exact prologue length, else 0 */
+ /* Known-frame (-O1) leaf no-frame tier (aa64's slim_prologue equivalent),
+ * settled in rv_func_begin_known_frame; always 0 on the single-pass path. A
+ * leaf with no callee-saves, no body slots, no outgoing args, no sret/variadic
+ * and register-only params never reads s0 nor clobbers ra, so it emits NO
+ * prologue and a bare `ret` — the whole frame setup/teardown is elided. RISC-V
+ * has no pre/post-indexed store, so aa64's fp_at_bottom fold would save zero
+ * instructions on a kept frame and is intentionally not ported (see
+ * doc/plan/ARCH.md §2); this leaf tier is the rv64 win. */
+ u8 slim_prologue;
+
u32 incoming_stack_size; /* fixed-param stack bytes (tail-call check) */
u32 next_param_int;
u32 next_param_fp;
@@ -1240,6 +1250,7 @@ static void rv_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) {
a->npatches = 0;
a->nalloca = 0;
a->minimal_prologue_words = 0;
+ a->slim_prologue = 0;
mc->set_section(mc, fd->text_section_id);
mc->emit_align(mc, 4, 0);
@@ -1422,22 +1433,27 @@ static void rv_func_end(NativeTarget* t) {
/* epilogue */
mc->label_place(mc, a->epilogue_label);
- for (i = (i32)n_int - 1; i >= 0; --i)
- rv_load_s0(mc, 0, int_regs[i], rv_save_off(a, (u32)i));
- for (i = (i32)n_fp - 1; i >= 0; --i)
- rv_load_s0(mc, 1, fp_regs[i], rv_save_off(a, n_int + (u32)i));
- if (a->frame.has_alloca)
- rv_emit_addr_adjust(mc, RV_SP, RV_S0, -(i32)fp_pair_off);
- rv64_emit32(mc, rv_ld(RV_RA, RV_S0, 8));
- rv64_emit32(mc, rv_ld(RV_S0, RV_S0, 0));
- /* sp += frame_size */
- if (fits_i12((i32)frame_size)) {
- rv64_emit32(mc, rv_addi(RV_SP, RV_SP, (i32)frame_size));
+ if (a->slim_prologue) {
+ /* Frameless leaf: no callee-saves, no s0/ra to reload, sp untouched. */
+ rv64_emit32(mc, rv_jalr(RV_ZERO, RV_RA, 0));
} else {
- rv_emit_load_imm(mc, 1, RV_TMP0, (i64)frame_size);
- rv64_emit32(mc, rv_add(RV_SP, RV_SP, RV_TMP0));
+ for (i = (i32)n_int - 1; i >= 0; --i)
+ rv_load_s0(mc, 0, int_regs[i], rv_save_off(a, (u32)i));
+ for (i = (i32)n_fp - 1; i >= 0; --i)
+ rv_load_s0(mc, 1, fp_regs[i], rv_save_off(a, n_int + (u32)i));
+ if (a->frame.has_alloca)
+ rv_emit_addr_adjust(mc, RV_SP, RV_S0, -(i32)fp_pair_off);
+ rv64_emit32(mc, rv_ld(RV_RA, RV_S0, 8));
+ rv64_emit32(mc, rv_ld(RV_S0, RV_S0, 0));
+ /* sp += frame_size */
+ if (fits_i12((i32)frame_size)) {
+ rv64_emit32(mc, rv_addi(RV_SP, RV_SP, (i32)frame_size));
+ } else {
+ rv_emit_load_imm(mc, 1, RV_TMP0, (i64)frame_size);
+ rv64_emit32(mc, rv_add(RV_SP, RV_SP, RV_TMP0));
+ }
+ rv64_emit32(mc, rv_jalr(RV_ZERO, RV_RA, 0));
}
- rv64_emit32(mc, rv_jalr(RV_ZERO, RV_RA, 0));
/* patch prologue */
if (!a->frame.known_frame) {
@@ -1462,19 +1478,27 @@ static void rv_func_end(NativeTarget* t) {
/* CFI: CFA = s0 + (frame_size - fp_pair_off) */
if (mc->cfi_set_next_pc_offset && mc->cfi_def_cfa && mc->cfi_offset) {
- i32 cfa = (i32)frame_size - (i32)fp_pair_off;
- u32 post = a->prologue_pos + (a->frame.known_frame
- ? a->minimal_prologue_words * 4u
- : RV_PROLOGUE_WORDS * 4u);
- u32 k;
- mc->cfi_set_next_pc_offset(mc, post - a->func_start);
- mc->cfi_def_cfa(mc, RV_S0, cfa);
- mc->cfi_offset(mc, RV_S0, -cfa);
- mc->cfi_offset(mc, RV_RA, -cfa + 8);
- for (k = 0; k < n_int; ++k)
- mc->cfi_offset(mc, int_regs[k], rv_save_off(a, k) - cfa);
- for (k = 0; k < n_fp; ++k)
- mc->cfi_offset(mc, 32u + fp_regs[k], rv_save_off(a, n_int + k) - cfa);
+ if (a->slim_prologue) {
+ /* Frameless leaf: CFA = sp (unchanged from entry) and the return address
+ * stays live in ra (the CIE default), so no saved-register rules. The
+ * state holds from the first instruction (offset 0). */
+ mc->cfi_set_next_pc_offset(mc, 0);
+ mc->cfi_def_cfa(mc, RV_SP, 0);
+ } else {
+ i32 cfa = (i32)frame_size - (i32)fp_pair_off;
+ u32 post = a->prologue_pos + (a->frame.known_frame
+ ? a->minimal_prologue_words * 4u
+ : RV_PROLOGUE_WORDS * 4u);
+ u32 k;
+ mc->cfi_set_next_pc_offset(mc, post - a->func_start);
+ mc->cfi_def_cfa(mc, RV_S0, cfa);
+ mc->cfi_offset(mc, RV_S0, -cfa);
+ mc->cfi_offset(mc, RV_RA, -cfa + 8);
+ for (k = 0; k < n_int; ++k)
+ mc->cfi_offset(mc, int_regs[k], rv_save_off(a, k) - cfa);
+ for (k = 0; k < n_fp; ++k)
+ mc->cfi_offset(mc, 32u + fp_regs[k], rv_save_off(a, n_int + k) - cfa);
+ }
}
end = mc->pos(mc);
@@ -1497,6 +1521,9 @@ static void rv_reserve_callee_saves(NativeTarget* t, const u32* used,
native_frame_set_callee_saves(&rv_of(t)->frame, used, nclasses, NULL, 0, 0);
}
+static u32 rv_signature_stack_bytes(NativeTarget* t, CfreeCgTypeId fn_type,
+ int* variadic, u32* nparams);
+
/* Optimizer entry point: the full frame is supplied up front, so the prologue
* is emitted final the moment it is built — no NOP region, no func_end patch
* (rv_func_end skips patching when known_frame). rv_build_prologue emits the
@@ -1531,9 +1558,27 @@ static void rv_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd,
fp_pair_off = rv_fp_pair_off(a, frame_size);
a->frame_size_final = frame_size;
a->fp_pair_off = fp_pair_off;
+ a->prologue_pos = mc->pos(mc);
+ /* Leaf no-frame tier (aa64 slim_prologue equivalent): a leaf with no
+ * callee-saves, no body slots, no outgoing args, no sret/variadic and
+ * register-only params never reads s0 (no frame slots / stack args) nor
+ * clobbers ra (no calls). Emit no prologue at all; rv_func_end emits a bare
+ * `ret`. cum_off==0 already implies no sret slot and no param spills, but the
+ * extra guards keep the intent explicit. Inline asm is excluded: it can clobber
+ * ra opaquely, and without the saved record the bare `ret` would return through
+ * the destroyed link register. */
+ a->slim_prologue =
+ frame && frame->is_leaf && !frame->has_asm &&
+ a->frame.ncallee_saves == 0 && !a->frame.has_alloca &&
+ a->frame.cum_off == 0 && a->frame.max_outgoing == 0 && !a->has_sret &&
+ !a->is_variadic && rv_signature_stack_bytes(t, fd->fn_type, NULL, NULL) == 0;
+ if (a->slim_prologue) {
+ a->minimal_prologue_words = 0;
+ native_frame_set_final(&a->frame);
+ return;
+ }
n_int = rv_collect_int_saves(a, int_regs);
n_fp = rv_collect_fp_saves(a, fp_regs);
- a->prologue_pos = mc->pos(mc);
nwords = rv_build_prologue(a, words, RV_KNOWN_PROLOGUE_WORDS, frame_size,
fp_pair_off, int_regs, n_int, fp_regs, n_fp);
for (i = 0; i < nwords; ++i) rv64_emit32(mc, words[i]);
diff --git a/src/arch/x64/native.c b/src/arch/x64/native.c
@@ -101,6 +101,24 @@ typedef struct X64NativeTarget {
u32 prologue_nbytes;
MCLabel epilogue_label;
+ /* Known-frame (-O1) prologue cost-model tiers, settled in
+ * x64_func_begin_known_frame; both 0 on the single-pass path (which can't know
+ * the frame up front). Either one suppresses the `sub rsp` reservation; the
+ * rbp frame record (push rbp; mov rbp,rsp) and every rbp-relative offset stay
+ * unchanged, so the epilogue (`leave`), CFI (CFA = rbp+16), and debug locs are
+ * identical to the fat shape.
+ * slim_frame - empty frame (no callee-saves/locals/outgoing/alloca): the
+ * `sub rsp` reserved nothing, so it is simply dropped. Safe
+ * for non-leaves (push rbp keeps rsp 16-aligned for calls,
+ * and nothing lives below rsp). SysV + Win64.
+ * redzone_leaf - SysV leaf with a small frame (<= 128B, no alloca, no
+ * outgoing args): locals/callee-saves stay at their
+ * rbp-relative offsets, which now land in the 128-byte red
+ * zone instead of a reserved region. Leaf-only — a call would
+ * clobber the red zone. */
+ u8 slim_frame;
+ u8 redzone_leaf;
+
/* Optimizer (-O1) entry binds: register-destination param binds are deferred
* here and resolved as a parallel copy in x64_bind_params_end, since the
* allocator may rotate params across the incoming arg registers — a
@@ -1431,10 +1449,16 @@ static ObjSymId x64_chkstk_sym(NativeTarget* t) {
}
/* Build the prologue byte sequence into buf. Returns bytes written and, when
- * the chkstk path fires, the disp32 offset of the call site. */
+ * the chkstk path fires, the disp32 offset of the call site. When `skip_sub` is
+ * set (the known-frame slim / red-zone tiers), the `sub rsp` reservation is
+ * omitted entirely: the frame record is established but no stack is reserved,
+ * either because the frame is empty (slim) or because the locals/saves live in
+ * the SysV red zone (redzone_leaf). Callers must only set it when the frame
+ * needs no reserved region (no alloca, no outgoing args, and — for the red
+ * zone — a leaf frame <= 128 bytes). */
static u32 x64_build_prologue(X64NativeTarget* a, u8* buf, u32 cap,
u32 frame_size, const Reg* cs_int, u32 n_int,
- const Reg* cs_fp, u32 n_fp,
+ const Reg* cs_fp, u32 n_fp, int skip_sub,
u32* chkstk_disp_pos_out) {
u32 wi = 0;
u32 xmm_base = x64_xmm_base(a, n_fp);
@@ -1446,8 +1470,11 @@ static u32 x64_build_prologue(X64NativeTarget* a, u8* buf, u32 cap,
buf[wi++] = X64_REX_BASE | X64_REX_W;
buf[wi++] = X64_OPC_MOV_RM_R;
buf[wi++] = modrm(3u, X64_RSP, X64_RBP);
- /* sub rsp, frame_size (or chkstk on Win64 large frame). */
- if (a->abi->shadow_space && frame_size > X64_WIN64_CHKSTK_THRESHOLD) {
+ /* sub rsp, frame_size (or chkstk on Win64 large frame); skipped by the slim /
+ * red-zone tiers, which reserve no stack. */
+ if (skip_sub) {
+ /* no reservation */
+ } else if (a->abi->shadow_space && frame_size > X64_WIN64_CHKSTK_THRESHOLD) {
if (wi + 13u > cap) x64_panic(a, "prologue placeholder overflow");
buf[wi++] = (u8)(X64_OPC_MOV_RI | (X64_RAX & 7u)); /* mov eax, imm32 */
wr_u32_le(buf + wi, frame_size);
@@ -1527,6 +1554,8 @@ static void x64_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) {
a->npatches = 0;
a->nalloca = 0;
a->nbind_moves = 0;
+ a->slim_frame = 0;
+ a->redzone_leaf = 0;
a->prologue_nbytes =
a->abi->shadow_space ? X64_PROLOGUE_BYTES_WIN64 : X64_PROLOGUE_BYTES;
@@ -1637,9 +1666,25 @@ static void x64_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd,
n_fp = x64_collect_fp_saves(a, cs_fp);
frame_size = x64_compute_frame_size(a, n_int, n_fp);
a->frame_size_final = frame_size;
+ /* Cost-model tier selection (mirrors aa64's aa_func_begin_known_frame): with
+ * the frame final before the body, choose the cheapest valid prologue shape.
+ * Both tiers keep the rbp record and only drop the `sub rsp`, so the
+ * epilogue/CFI/offset helpers are untouched. x64 needs no `fp_at_bottom`-style
+ * fold: `push rbp` already folds the sp-move into the store. */
+ a->slim_frame = a->frame.ncallee_saves == 0 && !a->frame.has_alloca &&
+ a->frame.cum_off == 0 && a->frame.max_outgoing == 0;
+ /* redzone keeps locals below rsp in the red zone; exclude inline asm, which
+ * may issue a `call` (clobbering the red zone) the optimizer can't see. slim
+ * needs no such guard: it has no locals there and the return address lives on
+ * the stack at [rbp+8], not in a clobberable register. */
+ a->redzone_leaf = !a->slim_frame && a->abi->shadow_space == 0 &&
+ frame && frame->is_leaf && !frame->has_asm &&
+ !a->frame.has_alloca && a->frame.max_outgoing == 0 &&
+ frame_size <= 128u;
a->prologue_pos = mc->pos(mc);
nbytes = x64_build_prologue(a, buf, sizeof buf, frame_size, cs_int, n_int,
- cs_fp, n_fp, &chkstk_disp_pos);
+ cs_fp, n_fp, a->slim_frame || a->redzone_leaf,
+ &chkstk_disp_pos);
mc->emit_bytes(mc, buf, nbytes);
if (chkstk_disp_pos != (u32)-1) {
ObjSymId chk = x64_chkstk_sym(t);
@@ -1685,8 +1730,10 @@ static void x64_func_end(NativeTarget* t) {
u32 nbytes;
u32 k;
for (k = 0; k < a->prologue_nbytes; ++k) buf[k] = X64_NOP1;
+ /* Single-pass path never selects a slim/red-zone tier (it cannot know the
+ * frame up front), so it always emits the full reservation. */
nbytes = x64_build_prologue(a, buf, a->prologue_nbytes, frame_size, cs_int,
- n_int, cs_fp, n_fp, &chkstk_disp_pos);
+ n_int, cs_fp, n_fp, 0, &chkstk_disp_pos);
(void)nbytes;
obj_patch(obj, sec, a->prologue_pos, buf, a->prologue_nbytes);
if (chkstk_disp_pos != (u32)-1) {
diff --git a/src/opt/pass_native_emit.c b/src/opt/pass_native_emit.c
@@ -1354,6 +1354,8 @@ static void plan_frame(NativeEmitCtx* e, const CGFuncDesc* fd) {
u32 max_args = 0, max_outgoing = 0;
u8 has_alloca = 0;
u8 needs_scratch_spill = 0;
+ u8 has_call = 0;
+ u8 has_asm = 0;
memset(&frame, 0, sizeof frame);
nclasses = t->reserve_callee_saves
? compute_callee_saved_used(e, used, EMIT_MAX_REG_CLASSES)
@@ -1369,7 +1371,14 @@ static void plan_frame(NativeEmitCtx* e, const CGFuncDesc* fd) {
needs_scratch_spill = 1;
} else if ((IROp)in->op == IR_CALL) {
IRCallAux* aux = (IRCallAux*)in->extra.aux;
+ /* Any call (regular or sibling/tail) means the function is not a leaf:
+ * it clobbers the return-address register and the stack below sp. */
+ has_call = 1;
if (aux && aux->desc.nargs > max_args) max_args = aux->desc.nargs;
+ } else if ((IROp)in->op == IR_ASM_BLOCK) {
+ /* Inline asm may clobber the return-address register or the red zone
+ * opaquely; disqualifies the frame-eliding tiers (see has_asm). */
+ has_asm = 1;
}
}
}
@@ -1425,6 +1434,8 @@ static void plan_frame(NativeEmitCtx* e, const CGFuncDesc* fd) {
frame.ncallee_classes = nclasses;
frame.has_alloca = has_alloca;
frame.needs_scratch_spill = needs_scratch_spill;
+ frame.is_leaf = !has_call;
+ frame.has_asm = has_asm;
t->func_begin_known_frame(t, fd, &frame, out_slots);
for (u32 i = 0; i < e->f->nframe_slots; ++i)
e->slot_map[e->f->frame_slots[i].id] = out_slots[i];
diff --git a/test/lib/exec_target.sh b/test/lib/exec_target.sh
@@ -40,9 +40,14 @@
# will be queued. The same path is bind-mounted at the same path
# inside the container.
# - Optional: RUN_AARCH64_IMAGE / RUN_X64_IMAGE / RUN_RV64_IMAGE
-# override the container image (default alpine:latest — musl
-# libc, matching the prior inline implementation and consistent
-# with test/smoke/rv64.sh).
+# override the container image. The defaults are pinned, per-arch,
+# content-addressed alpine digests (see test/lib/test_images.sh);
+# provision them once with `make test-images`. Every `podman run`
+# below uses --pull=never, so the run path never touches the network.
+
+# Pinned per-arch image references (sets RUN_<ARCH>_IMAGE defaults and provides
+# cfree_test_image_for_arch). Sourced relative to this file's location.
+. "$(dirname "${BASH_SOURCE[0]}")/test_images.sh"
# Internal queue arrays. Each entry's tag is recorded alongside the
# rest so flush can split into per-target batched runs.
@@ -87,19 +92,33 @@ _exec_target_platform() {
esac
}
-# Default image is alpine:latest (musl libc). Chosen for rv64 because:
-# - musl is the C runtime the rv64 lane is brought up against
-# (matches test/smoke/rv64.sh default).
-# - alpine ships riscv64 images in the official manifest, so podman
-# can pull and exec under qemu-user without bespoke registries.
-# Override per-arch with RUN_<ARCH>_IMAGE when a glibc base is needed.
+# Per-arch pinned image (content-addressed alpine digest, musl libc). The pins
+# live in test/lib/test_images.sh and are provisioned by `make test-images`;
+# RUN_<ARCH>_IMAGE overrides them (e.g. for a glibc base). Distinct digests per
+# arch mean local storage can never confuse one arch's rootfs for another's.
_exec_target_image() {
- case "$(_exec_target_arch "$1")" in
- aarch64) echo "${RUN_AARCH64_IMAGE:-alpine:latest}" ;;
- x64) echo "${RUN_X64_IMAGE:-alpine:latest}" ;;
- rv64) echo "${RUN_RV64_IMAGE:-alpine:latest}" ;;
- *) echo "alpine:latest" ;;
- esac
+ local img; img="$(cfree_test_image_for_arch "$(_exec_target_arch "$1")")"
+ [ -n "$img" ] && printf '%s' "$img" || printf 'alpine:latest'
+}
+
+# Memoized: is this arch's pinned image present in local storage? The harnesses
+# run with --pull=never, so a missing image means the container runner is
+# unavailable until `make test-images` provisions it. Cached per arch so
+# exec_target_supported stays a constant cost across hundreds of cases.
+_exec_target_image_present() {
+ local arch var cached
+ arch="$(_exec_target_arch "$1")"
+ var="_EXEC_TARGET_IMG_${arch}"
+ cached="${!var:-}"
+ if [ -z "$cached" ]; then
+ if podman image exists "$(_exec_target_image "$1")" 2>/dev/null; then
+ cached=yes
+ else
+ cached=no
+ fi
+ printf -v "$var" '%s' "$cached"
+ fi
+ [ "$cached" = yes ]
}
# True when the host can exec this target without container/qemu help.
@@ -184,7 +203,10 @@ exec_target_supported() {
fi
_exec_target_native "$tag" && return 0
[ -n "$(_exec_target_qemu "$tag")" ] && return 0
- [ "${have_podman:-0}" -eq 1 ] && return 0
+ # podman is only a usable runner once the arch's pinned image is provisioned
+ # locally (run path is --pull=never); otherwise report unsupported so callers
+ # SKIP cleanly instead of failing on a missing image.
+ [ "${have_podman:-0}" -eq 1 ] && _exec_target_image_present "$tag" && return 0
return 1
}
diff --git a/test/lib/pull_test_images.sh b/test/lib/pull_test_images.sh
@@ -0,0 +1,52 @@
+#!/usr/bin/env bash
+# Provision the pinned per-arch alpine rootfs images that exec_target.sh runs
+# cfree-emitted binaries inside.
+#
+# THIS is the only test step that touches the network. The test harnesses
+# themselves always run with `podman run --pull=never`, so the images must be
+# present locally first. Run once (and again only after the pin in
+# test/lib/test_images.sh changes):
+#
+# make test-images # or: bash test/lib/pull_test_images.sh
+#
+# Each arch is pulled by its own content digest, so the three images coexist in
+# local storage and can never be confused for one another. Set FORCE=1 to
+# re-pull even when an image is already present.
+set -euo pipefail
+
+ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
+# shellcheck source=test/lib/test_images.sh
+. "$ROOT/test/lib/test_images.sh"
+
+if ! command -v podman >/dev/null 2>&1; then
+ echo "pull-test-images: podman not found; nothing to provision" >&2
+ exit 1
+fi
+
+rc=0
+for arch in $CFREE_TEST_ARCHES; do
+ ref="$(cfree_test_image_for_arch "$arch")"
+ if [ -z "$ref" ]; then
+ echo "pull-test-images: no image pinned for arch '$arch'" >&2
+ rc=1
+ continue
+ fi
+ if [ -z "${FORCE:-}" ] && podman image exists "$ref"; then
+ echo "pull-test-images: $arch already present ($ref)"
+ continue
+ fi
+ # Digest references are single-arch and content-addressed, so no --platform
+ # is needed and pulling one arch cannot affect another's local image.
+ echo "pull-test-images: pulling $arch <- $ref"
+ if id="$(podman pull -q "$ref")"; then
+ echo "pull-test-images: $arch ready ($id)"
+ else
+ echo "pull-test-images: FAILED to pull $arch ($ref)" >&2
+ rc=1
+ fi
+done
+
+if [ "$rc" -eq 0 ]; then
+ echo "pull-test-images: all arches provisioned"
+fi
+exit "$rc"
diff --git a/test/lib/test_images.sh b/test/lib/test_images.sh
@@ -0,0 +1,41 @@
+# test/lib/test_images.sh — pinned per-arch container images for hermetic
+# cross-arch test execution. Sourced by test/lib/exec_target.sh (the run path)
+# and test/lib/pull_test_images.sh (the provisioning path).
+#
+# The compiler test harnesses run cfree-emitted ELF binaries inside a container
+# under qemu-user. To keep ALL network off the main test path, every `podman
+# run` uses --pull=never against these LOCAL images, which must be provisioned
+# ahead of time with `make test-images` (the one network-touching step).
+#
+# Each arch is pinned to its OWN content-addressed image digest under
+# docker.io/library/alpine. Because the references are content digests rather
+# than the shared, mutable `alpine:latest` tag, pulling one arch can NEVER
+# clobber another arch's rootfs in local storage — that wrong-arch-manifest
+# trap (a `--platform riscv64` pull retagging the local `alpine:latest` to the
+# riscv64 image, so a later `--platform amd64 --pull=never` run executes against
+# the riscv64 rootfs) is what this scheme exists to prevent.
+#
+# Pin: alpine 3.23.4, manifest list
+# docker.io/library/alpine@sha256:5b10f432ef3da1b8d4c7eb6c487f2f5a8f096bc91145e68878dd4a5019afde11
+# To bump the pin: `podman manifest inspect <new alpine ref>` and replace the
+# three per-arch image digests below (the `application/vnd.*.manifest.v1+json`
+# entry for each architecture).
+
+# Per-arch pins. `:=` so an explicit RUN_<ARCH>_IMAGE override from the caller
+# (e.g. test/asm/hostas_cross.sh, which needs a glibc/edge variant) still wins.
+: "${RUN_AARCH64_IMAGE:=docker.io/library/alpine@sha256:378c4c5418f7493bd500ad21ffb43818d0689daaad43e3261859fb417d1481a0}"
+: "${RUN_X64_IMAGE:=docker.io/library/alpine@sha256:4d889c14e7d5a73929ab00be2ef8ff22437e7cbc545931e52554a7b00e123d8b}"
+: "${RUN_RV64_IMAGE:=docker.io/library/alpine@sha256:667d07bf2f6239f094f64b5682c8ffbe24c9f3139b1fb854f85caf931a3d7439}"
+
+# arch token ("aarch64"/"x64"/"rv64") -> pinned image reference.
+cfree_test_image_for_arch() {
+ case "$1" in
+ aarch64) printf '%s' "$RUN_AARCH64_IMAGE" ;;
+ x64) printf '%s' "$RUN_X64_IMAGE" ;;
+ rv64) printf '%s' "$RUN_RV64_IMAGE" ;;
+ *) printf '' ;;
+ esac
+}
+
+# The arches provisioned/runnable through a container.
+CFREE_TEST_ARCHES="aarch64 x64 rv64"
diff --git a/test/opt/prologue_tier.sh b/test/opt/prologue_tier.sh
@@ -0,0 +1,142 @@
+#!/usr/bin/env bash
+# Structural checks for the -O1 known-frame prologue cost-model tiers.
+#
+# aa64 is the reference (it already selects a minimal frame shape per function);
+# this pins that behaviour and asserts the ported equivalents on x64 and rv64:
+#
+# x64 slim leaf-ish frame (no callee-saves/locals/outgoing) keeps
+# `push rbp; mov rbp,rsp` but drops the `sub rsp` reservation.
+# x64 red-zone SysV leaf with a small frame keeps its locals/saves at
+# rbp-relative offsets in the 128-byte red zone, no `sub rsp`.
+# rv64 leaf a true leaf with no frame needs (no callee-saves, no slots,
+# no outgoing, register-only params) emits NO prologue and a
+# bare `ret` -- it never saves ra / sets up s0.
+#
+# A non-leaf (calls something) must NOT take the leaf tiers, since the call
+# clobbers ra (rv64) / the red zone (x64): those guards are asserted too.
+set -euo pipefail
+
+ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
+CFREE="${CFREE:-$ROOT/build/cfree}"
+WORK="$ROOT/build/test/opt/prologue_tier"
+rm -rf "$WORK"
+mkdir -p "$WORK"
+
+fail() {
+ printf 'prologue-tier check FAILED: %s\n' "$1" >&2
+ if [ -n "${2:-}" ] && [ -f "$2" ]; then
+ sed 's/^/ | /' "$2" >&2
+ fi
+ exit 1
+}
+
+slice_func() {
+ local src="$1" func="$2" out="$3"
+ awk -v name="$func" '
+ $0 ~ "^[0-9a-f]+ <" name ">:" { in_fn = 1; print; next }
+ /^[0-9a-f]+ </ { in_fn = 0 }
+ in_fn { print }
+ ' "$src" > "$out"
+}
+
+compile_case() {
+ local triple="$1" name="$2" src="$3"
+ "$CFREE" cc -target "$triple" -O1 -c "$src" \
+ -o "$WORK/$name.o" > "$WORK/$name.cc.out" 2> "$WORK/$name.cc.err"
+ "$CFREE" objdump -d "$WORK/$name.o" \
+ > "$WORK/$name.dis" 2> "$WORK/$name.objdump.err"
+}
+
+# ---- fixtures ----
+cat > "$WORK/leaf.c" <<'EOF'
+int leaf(int x) { return x + 1; }
+EOF
+
+cat > "$WORK/leaf_call.c" <<'EOF'
+extern int g(int);
+int leaf_call(int x) { return g(x) + 1; }
+EOF
+
+# 16 bytes of locals, leaf, no calls -> fits the SysV red zone on x64.
+cat > "$WORK/small_locals.c" <<'EOF'
+int small_locals(int x) {
+ volatile int a[4];
+ a[0] = x; a[1] = x + 1; a[2] = x + 2; a[3] = x + 3;
+ return a[0] + a[1] + a[2] + a[3];
+}
+EOF
+
+# Inline asm can clobber the return-address register (rv64 ra) or the red zone
+# (x64 SysV), so a function containing ANY asm block must NOT take the
+# frame-eliding leaf/red-zone tiers, even though it makes no call.
+cat > "$WORK/asm_fn.c" <<'EOF'
+int asm_fn(int x) { __asm__ volatile("" ::: "memory"); return x + 1; }
+EOF
+cat > "$WORK/asm_locals.c" <<'EOF'
+int asm_locals(int x) {
+ volatile int a[4];
+ __asm__ volatile("" ::: "memory");
+ a[0] = x; a[1] = x + 1; a[2] = x + 2; a[3] = x + 3;
+ return a[0] + a[1] + a[2] + a[3];
+}
+EOF
+
+# ===================== aa64 reference (characterization) =====================
+compile_case aarch64-linux-gnu aa64_leaf "$WORK/leaf.c"
+slice_func "$WORK/aa64_leaf.dis" leaf "$WORK/aa64_leaf.fn"
+grep -Eq 'stp[[:space:]]+x29, x30, \[sp, #-16\]!' "$WORK/aa64_leaf.fn" ||
+ fail 'aa64 leaf is not using the slim_prologue frame record' "$WORK/aa64_leaf.fn"
+
+# ===================== x64 slim (no sub rsp on empty frame) ==================
+compile_case x86_64-linux-gnu x64_leaf "$WORK/leaf.c"
+slice_func "$WORK/x64_leaf.dis" leaf "$WORK/x64_leaf.fn"
+grep -Eq 'push[[:space:]]+%rbp' "$WORK/x64_leaf.fn" ||
+ fail 'x64 leaf dropped the rbp frame record (slim should keep push rbp)' "$WORK/x64_leaf.fn"
+grep -Eq 'sub[ql]?[[:space:]]+\$[0-9]+, %rsp' "$WORK/x64_leaf.fn" &&
+ fail 'x64 leaf still reserves stack (slim tier should skip sub rsp)' "$WORK/x64_leaf.fn"
+
+# A register-only call still leaves max_outgoing==0 and an empty frame -> slim.
+compile_case x86_64-linux-gnu x64_leaf_call "$WORK/leaf_call.c"
+slice_func "$WORK/x64_leaf_call.dis" leaf_call "$WORK/x64_leaf_call.fn"
+grep -Eq 'sub[ql]?[[:space:]]+\$[0-9]+, %rsp' "$WORK/x64_leaf_call.fn" &&
+ fail 'x64 register-only call still reserves stack (slim should skip sub rsp)' "$WORK/x64_leaf_call.fn"
+
+# ===================== x64 red-zone leaf (locals, no sub rsp) ================
+compile_case x86_64-linux-gnu x64_redzone "$WORK/small_locals.c"
+slice_func "$WORK/x64_redzone.dis" small_locals "$WORK/x64_redzone.fn"
+grep -Eq 'sub[ql]?[[:space:]]+\$[0-9]+, %rsp' "$WORK/x64_redzone.fn" &&
+ fail 'x64 red-zone leaf still reserves stack (should use the red zone, no sub rsp)' "$WORK/x64_redzone.fn"
+grep -Eq '\(%rbp\)' "$WORK/x64_redzone.fn" ||
+ fail 'x64 red-zone leaf does not address locals rbp-relative' "$WORK/x64_redzone.fn"
+
+# ===================== rv64 leaf (no frame at all) ==========================
+compile_case riscv64-linux-gnu rv64_leaf "$WORK/leaf.c"
+slice_func "$WORK/rv64_leaf.dis" leaf "$WORK/rv64_leaf.fn"
+grep -Eq '\bsd[[:space:]]+ra,' "$WORK/rv64_leaf.fn" &&
+ fail 'rv64 leaf saved ra (leaf tier should emit no frame)' "$WORK/rv64_leaf.fn"
+grep -Eq 'addi[[:space:]]+sp, sp, -' "$WORK/rv64_leaf.fn" &&
+ fail 'rv64 leaf adjusted sp (leaf tier should keep sp untouched)' "$WORK/rv64_leaf.fn"
+grep -Eq '\bret\b' "$WORK/rv64_leaf.fn" ||
+ fail 'rv64 leaf is missing its ret' "$WORK/rv64_leaf.fn"
+
+# Guard: a non-leaf (calls g) must KEEP the frame so ra survives the call.
+compile_case riscv64-linux-gnu rv64_leaf_call "$WORK/leaf_call.c"
+slice_func "$WORK/rv64_leaf_call.dis" leaf_call "$WORK/rv64_leaf_call.fn"
+grep -Eq '\bsd[[:space:]]+ra,' "$WORK/rv64_leaf_call.fn" ||
+ fail 'rv64 non-leaf dropped the ra save (leaf tier over-fired across a call)' "$WORK/rv64_leaf_call.fn"
+
+# Guard: inline asm may clobber ra, so a leaf with asm must KEEP the frame.
+compile_case riscv64-linux-gnu rv64_asm "$WORK/asm_fn.c"
+slice_func "$WORK/rv64_asm.dis" asm_fn "$WORK/rv64_asm.fn"
+grep -Eq '\bsd[[:space:]]+ra,' "$WORK/rv64_asm.fn" ||
+ fail 'rv64 leaf with inline asm dropped the ra save (slim tier must not fire with asm)' "$WORK/rv64_asm.fn"
+
+# ===================== x64 red-zone must not fire with inline asm ===========
+# Inline asm may clobber the red zone (e.g. a `call`), so a red-zone leaf with
+# asm must keep its reserved stack.
+compile_case x86_64-linux-gnu x64_asm_locals "$WORK/asm_locals.c"
+slice_func "$WORK/x64_asm_locals.dis" asm_locals "$WORK/x64_asm_locals.fn"
+grep -Eq 'sub[ql]?[[:space:]]+\$[0-9]+, %rsp' "$WORK/x64_asm_locals.fn" ||
+ fail 'x64 red-zone leaf with inline asm skipped sub rsp (must reserve, asm may clobber the red zone)' "$WORK/x64_asm_locals.fn"
+
+printf 'prologue-tier: ok\n'