kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit a39ba2690ba6a6f5179187de2956cfa4df145657
parent f1e91a7c2257e38b8244f99c4066cabea0336507
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon,  1 Jun 2026 13:24:30 -0700

arch: known-frame prologue cost-model tiers for x64 and rv64

Bring the rv64 and x64 known-frame (-O1) prologue paths to the per-call
cost-model parity aa64 already has (doc/plan/ARCH.md §2). Each backend now
selects the cheapest valid frame shape in func_begin_known_frame.

- x64: slim_frame drops `sub rsp` for an empty frame (keeps push rbp; mov
  rbp,rsp); redzone_leaf keeps a SysV leaf's small frame (<=128B) in the
  128-byte red zone with no reservation. Both leave the leave/CFI/rbp-relative
  offsets untouched. No fold tier — push rbp already folds the sp move.
- rv64: a register-only leaf with no callee-saves/slots/outgoing/sret/variadic
  emits no prologue and a bare ret (~8 insns/leaf). The fp_at_bottom fold is
  intentionally not ported (zero instruction win without pre/post-indexed
  stores).
- Shared: NativeKnownFrameDesc gains is_leaf and has_asm, derived in plan_frame.
  has_asm disqualifies the frame-eliding tiers, since inline asm can clobber the
  return-address register (rv64 ra) or the red zone (x64) opaquely.

test/opt/prologue_tier.sh pins the per-arch shapes (incl. the asm/non-leaf
guards); runtime-verified across the toy and parse corpora at -O1.

Test infra: pin the cross-arch exec container images per arch by content
digest (test/lib/test_images.sh) instead of the shared, mutable alpine:latest
tag, so pulling one arch can't clobber another's rootfs. The run path stays
--pull=never; `make test-images` (test/lib/pull_test_images.sh) is the one
network step, and exec_target now SKIPs cleanly when an arch is unprovisioned.

Diffstat:
Mdoc/plan/ARCH.md | 65++++++++++++++++++++++++++++++++++++++++++++---------------------
Msrc/arch/native_target.h | 15+++++++++++++++
Msrc/arch/rv64/native.c | 101+++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------
Msrc/arch/x64/native.c | 59+++++++++++++++++++++++++++++++++++++++++++++++++++++------
Msrc/opt/pass_native_emit.c | 11+++++++++++
Mtest/lib/exec_target.sh | 54++++++++++++++++++++++++++++++++++++++----------------
Atest/lib/pull_test_images.sh | 52++++++++++++++++++++++++++++++++++++++++++++++++++++
Atest/lib/test_images.sh | 41+++++++++++++++++++++++++++++++++++++++++
Atest/opt/prologue_tier.sh | 142+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 469 insertions(+), 71 deletions(-)

diff --git a/doc/plan/ARCH.md b/doc/plan/ARCH.md @@ -64,11 +64,10 @@ targets (see the interpreter and toy `musttail` tracks). The fixed per-call overhead -- prologue + epilogue + arg setup, independent of the body -- is the dominant cost on call-heavy code. aa64 picks one of four frame -shapes per function to minimize it; x64 and rv64 currently emit a single -RBP-anchored / s0-anchored shape and do **not** have the fold tiers. The design -rationale lives in [../ARCH.md](../ARCH.md); the aa64 measurements and the -remaining body-level warts are tracked alongside [../OPT.md](../OPT.md) and -[OPTIMIZER.md](OPTIMIZER.md). +shapes per function to minimize it. x64 and rv64 now select a cheaper known-frame +shape too (see **Done** below); the design rationale lives in +[../ARCH.md](../ARCH.md); the aa64 measurements and the remaining body-level warts +are tracked alongside [../OPT.md](../OPT.md) and [OPTIMIZER.md](OPTIMIZER.md). aa64 tiers (baseline, for reference): @@ -83,24 +82,48 @@ The known-frame asymmetry (bottom-record only on the -O1 path) is intentional: the frame-size-dependent offsets require the frame to be final before the body, which only the optimizer's frame planner guarantees. -Planned: - -- **rv64 frame fold.** Port the `fp_at_bottom` idea: for known-frame functions - with no outgoing stack args and a small frame, place the saved s0/ra pair at the - bottom (s0 = sp) so the sp adjustment folds into the save/restore and - callee-saves stack above the record at positive offsets. RISC-V has no - pre/post-indexed store, so the fold is the address-arithmetic saving, not an - addressing-mode one -- quantify the win before committing. Add a leaf/no-frame - (`slim_prologue`-equivalent) tier for leaf functions with no callee-saves. -- **x64 frame fold / leaf omission.** Add the equivalent tiers for SysV and - Win64. x64 already emits the exact known-frame prologue (no placeholder/patch), - so this is shape selection in `func_begin_known_frame`, not a re-architecture. - SysV leaf functions can also exploit the 128-byte red zone to skip the - `sub rsp` entirely -- design and gate this carefully against alloca and any - outgoing-arg use. +Leaf-ness is surfaced to the backends through `NativeKnownFrameDesc.is_leaf` +(set in `plan_frame`, `pass_native_emit.c`, as "no `IR_CALL` of any kind -- +regular or sibling/tail"). A leaf never clobbers the return-address register or +the stack below sp, which is what unlocks the no-frame / red-zone shapes below. + +Done: + +- **x64 slim + red-zone tiers** (`x64_func_begin_known_frame`). Two known-frame + shapes, both keeping the `push rbp; mov rbp,rsp` record (so the `leave` + epilogue, the `CFA = rbp+16` CFI, and every rbp-relative offset are unchanged) + and only dropping the `sub rsp` reservation: + - `slim_frame` -- empty frame (no callee-saves, no body slots, no outgoing + args, no alloca). Safe for non-leaves too: `push rbp` keeps rsp 16-aligned + for any register-only call, and nothing lives below rsp. SysV + Win64. + - `redzone_leaf` -- SysV leaf with a small frame (`is_leaf`, no alloca, no + outgoing args, `frame_size <= 128`). Locals/callee-saves stay at their + rbp-relative offsets, which now land in the 128-byte red zone. Leaf-only, + since any call would clobber the red zone; Win64 (no red zone) is excluded + by the `shadow_space == 0` gate. + + No x64 *fold* tier: `push rbp` already folds the sp-move into the store, so + there is no aa64-`fp_at_bottom`-style win to capture. +- **rv64 leaf tier** (`rv_func_begin_known_frame`, `slim_prologue`). A leaf with + no callee-saves, no body slots, no outgoing args, no sret/variadic and + register-only params (`signature_stack_bytes == 0`) never reads s0 nor clobbers + ra (both are reserved, never allocable), so it emits **no prologue** and a bare + `ret` -- the whole frame setup/teardown is elided (~8 insns/leaf). CFI is + `def_cfa(sp, 0)`, matching the CIE default (ra stays live in its register). +- **rv64 frame fold: intentionally not ported.** Porting aa64's `fp_at_bottom` + to rv64 was measured at a **zero** instruction win: RISC-V has no + pre/post-indexed store, so moving the saved s0/ra pair to the bottom still + needs a separate `addi sp,sp,-N` plus the `sd`/`addi s0,sp` -- the same four + instructions as the top-record shape. The fold only relocates data, it removes + no instruction, so it was skipped rather than add a fold-aware offset-helper + layer for no benefit. (Per the "quantify the win before committing" guidance + that previously stood here.) The rv64 leaf tier above is the real rv64 win. + +Still open: + - **Cost-model alignment.** `signature_stack_bytes` / `call_stack_bytes` are the shared hooks the optimizer uses to size the outgoing area and gate tail-call - realizability; they exist on all three. As the fold tiers and tail-call paths + realizability; they exist on all three. As the tail-call paths (section 1) land, verify the optimizer's per-call cost estimates reflect the cheaper shapes so frame/spill decisions stay consistent across arches. diff --git a/src/arch/native_target.h b/src/arch/native_target.h @@ -64,6 +64,21 @@ typedef struct NativeKnownFrameDesc { * one scratch register. The backend reserves the slot up front so the body * never grows the frame after the prologue. */ u8 needs_scratch_spill; + /* Whether the function is a leaf — its body contains no call of any kind + * (regular or sibling/tail). A leaf does not clobber the return-address + * register or the stack below sp through a call, so backends can omit the + * saved-frame record entirely (rv64 leaf tier) or skip the stack reservation + * and keep locals in the red zone (x64 SysV red-zone tier) — but ONLY when + * `has_asm` is also clear (see below). Conservatively false whenever any + * IR_CALL is present. */ + u8 is_leaf; + /* Whether the body contains an inline-asm block. Inline asm can clobber the + * return-address register (rv64 ra) or write into the red zone / make a call + * (x64) without the optimizer modelling it, so the frame-eliding leaf/red-zone + * tiers must NOT fire when this is set — even for an otherwise-leaf function. + * The single-pass and fat known-frame shapes always save the return address + * and reserve their stack, so they are unaffected. */ + u8 has_asm; } NativeKnownFrameDesc; typedef enum NativeAllocClass { diff --git a/src/arch/rv64/native.c b/src/arch/rv64/native.c @@ -205,6 +205,16 @@ typedef struct RvNativeTarget { u32 fp_pair_off; u32 minimal_prologue_words; /* known-frame path: exact prologue length, else 0 */ + /* Known-frame (-O1) leaf no-frame tier (aa64's slim_prologue equivalent), + * settled in rv_func_begin_known_frame; always 0 on the single-pass path. A + * leaf with no callee-saves, no body slots, no outgoing args, no sret/variadic + * and register-only params never reads s0 nor clobbers ra, so it emits NO + * prologue and a bare `ret` — the whole frame setup/teardown is elided. RISC-V + * has no pre/post-indexed store, so aa64's fp_at_bottom fold would save zero + * instructions on a kept frame and is intentionally not ported (see + * doc/plan/ARCH.md §2); this leaf tier is the rv64 win. */ + u8 slim_prologue; + u32 incoming_stack_size; /* fixed-param stack bytes (tail-call check) */ u32 next_param_int; u32 next_param_fp; @@ -1240,6 +1250,7 @@ static void rv_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) { a->npatches = 0; a->nalloca = 0; a->minimal_prologue_words = 0; + a->slim_prologue = 0; mc->set_section(mc, fd->text_section_id); mc->emit_align(mc, 4, 0); @@ -1422,22 +1433,27 @@ static void rv_func_end(NativeTarget* t) { /* epilogue */ mc->label_place(mc, a->epilogue_label); - for (i = (i32)n_int - 1; i >= 0; --i) - rv_load_s0(mc, 0, int_regs[i], rv_save_off(a, (u32)i)); - for (i = (i32)n_fp - 1; i >= 0; --i) - rv_load_s0(mc, 1, fp_regs[i], rv_save_off(a, n_int + (u32)i)); - if (a->frame.has_alloca) - rv_emit_addr_adjust(mc, RV_SP, RV_S0, -(i32)fp_pair_off); - rv64_emit32(mc, rv_ld(RV_RA, RV_S0, 8)); - rv64_emit32(mc, rv_ld(RV_S0, RV_S0, 0)); - /* sp += frame_size */ - if (fits_i12((i32)frame_size)) { - rv64_emit32(mc, rv_addi(RV_SP, RV_SP, (i32)frame_size)); + if (a->slim_prologue) { + /* Frameless leaf: no callee-saves, no s0/ra to reload, sp untouched. */ + rv64_emit32(mc, rv_jalr(RV_ZERO, RV_RA, 0)); } else { - rv_emit_load_imm(mc, 1, RV_TMP0, (i64)frame_size); - rv64_emit32(mc, rv_add(RV_SP, RV_SP, RV_TMP0)); + for (i = (i32)n_int - 1; i >= 0; --i) + rv_load_s0(mc, 0, int_regs[i], rv_save_off(a, (u32)i)); + for (i = (i32)n_fp - 1; i >= 0; --i) + rv_load_s0(mc, 1, fp_regs[i], rv_save_off(a, n_int + (u32)i)); + if (a->frame.has_alloca) + rv_emit_addr_adjust(mc, RV_SP, RV_S0, -(i32)fp_pair_off); + rv64_emit32(mc, rv_ld(RV_RA, RV_S0, 8)); + rv64_emit32(mc, rv_ld(RV_S0, RV_S0, 0)); + /* sp += frame_size */ + if (fits_i12((i32)frame_size)) { + rv64_emit32(mc, rv_addi(RV_SP, RV_SP, (i32)frame_size)); + } else { + rv_emit_load_imm(mc, 1, RV_TMP0, (i64)frame_size); + rv64_emit32(mc, rv_add(RV_SP, RV_SP, RV_TMP0)); + } + rv64_emit32(mc, rv_jalr(RV_ZERO, RV_RA, 0)); } - rv64_emit32(mc, rv_jalr(RV_ZERO, RV_RA, 0)); /* patch prologue */ if (!a->frame.known_frame) { @@ -1462,19 +1478,27 @@ static void rv_func_end(NativeTarget* t) { /* CFI: CFA = s0 + (frame_size - fp_pair_off) */ if (mc->cfi_set_next_pc_offset && mc->cfi_def_cfa && mc->cfi_offset) { - i32 cfa = (i32)frame_size - (i32)fp_pair_off; - u32 post = a->prologue_pos + (a->frame.known_frame - ? a->minimal_prologue_words * 4u - : RV_PROLOGUE_WORDS * 4u); - u32 k; - mc->cfi_set_next_pc_offset(mc, post - a->func_start); - mc->cfi_def_cfa(mc, RV_S0, cfa); - mc->cfi_offset(mc, RV_S0, -cfa); - mc->cfi_offset(mc, RV_RA, -cfa + 8); - for (k = 0; k < n_int; ++k) - mc->cfi_offset(mc, int_regs[k], rv_save_off(a, k) - cfa); - for (k = 0; k < n_fp; ++k) - mc->cfi_offset(mc, 32u + fp_regs[k], rv_save_off(a, n_int + k) - cfa); + if (a->slim_prologue) { + /* Frameless leaf: CFA = sp (unchanged from entry) and the return address + * stays live in ra (the CIE default), so no saved-register rules. The + * state holds from the first instruction (offset 0). */ + mc->cfi_set_next_pc_offset(mc, 0); + mc->cfi_def_cfa(mc, RV_SP, 0); + } else { + i32 cfa = (i32)frame_size - (i32)fp_pair_off; + u32 post = a->prologue_pos + (a->frame.known_frame + ? a->minimal_prologue_words * 4u + : RV_PROLOGUE_WORDS * 4u); + u32 k; + mc->cfi_set_next_pc_offset(mc, post - a->func_start); + mc->cfi_def_cfa(mc, RV_S0, cfa); + mc->cfi_offset(mc, RV_S0, -cfa); + mc->cfi_offset(mc, RV_RA, -cfa + 8); + for (k = 0; k < n_int; ++k) + mc->cfi_offset(mc, int_regs[k], rv_save_off(a, k) - cfa); + for (k = 0; k < n_fp; ++k) + mc->cfi_offset(mc, 32u + fp_regs[k], rv_save_off(a, n_int + k) - cfa); + } } end = mc->pos(mc); @@ -1497,6 +1521,9 @@ static void rv_reserve_callee_saves(NativeTarget* t, const u32* used, native_frame_set_callee_saves(&rv_of(t)->frame, used, nclasses, NULL, 0, 0); } +static u32 rv_signature_stack_bytes(NativeTarget* t, CfreeCgTypeId fn_type, + int* variadic, u32* nparams); + /* Optimizer entry point: the full frame is supplied up front, so the prologue * is emitted final the moment it is built — no NOP region, no func_end patch * (rv_func_end skips patching when known_frame). rv_build_prologue emits the @@ -1531,9 +1558,27 @@ static void rv_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd, fp_pair_off = rv_fp_pair_off(a, frame_size); a->frame_size_final = frame_size; a->fp_pair_off = fp_pair_off; + a->prologue_pos = mc->pos(mc); + /* Leaf no-frame tier (aa64 slim_prologue equivalent): a leaf with no + * callee-saves, no body slots, no outgoing args, no sret/variadic and + * register-only params never reads s0 (no frame slots / stack args) nor + * clobbers ra (no calls). Emit no prologue at all; rv_func_end emits a bare + * `ret`. cum_off==0 already implies no sret slot and no param spills, but the + * extra guards keep the intent explicit. Inline asm is excluded: it can clobber + * ra opaquely, and without the saved record the bare `ret` would return through + * the destroyed link register. */ + a->slim_prologue = + frame && frame->is_leaf && !frame->has_asm && + a->frame.ncallee_saves == 0 && !a->frame.has_alloca && + a->frame.cum_off == 0 && a->frame.max_outgoing == 0 && !a->has_sret && + !a->is_variadic && rv_signature_stack_bytes(t, fd->fn_type, NULL, NULL) == 0; + if (a->slim_prologue) { + a->minimal_prologue_words = 0; + native_frame_set_final(&a->frame); + return; + } n_int = rv_collect_int_saves(a, int_regs); n_fp = rv_collect_fp_saves(a, fp_regs); - a->prologue_pos = mc->pos(mc); nwords = rv_build_prologue(a, words, RV_KNOWN_PROLOGUE_WORDS, frame_size, fp_pair_off, int_regs, n_int, fp_regs, n_fp); for (i = 0; i < nwords; ++i) rv64_emit32(mc, words[i]); diff --git a/src/arch/x64/native.c b/src/arch/x64/native.c @@ -101,6 +101,24 @@ typedef struct X64NativeTarget { u32 prologue_nbytes; MCLabel epilogue_label; + /* Known-frame (-O1) prologue cost-model tiers, settled in + * x64_func_begin_known_frame; both 0 on the single-pass path (which can't know + * the frame up front). Either one suppresses the `sub rsp` reservation; the + * rbp frame record (push rbp; mov rbp,rsp) and every rbp-relative offset stay + * unchanged, so the epilogue (`leave`), CFI (CFA = rbp+16), and debug locs are + * identical to the fat shape. + * slim_frame - empty frame (no callee-saves/locals/outgoing/alloca): the + * `sub rsp` reserved nothing, so it is simply dropped. Safe + * for non-leaves (push rbp keeps rsp 16-aligned for calls, + * and nothing lives below rsp). SysV + Win64. + * redzone_leaf - SysV leaf with a small frame (<= 128B, no alloca, no + * outgoing args): locals/callee-saves stay at their + * rbp-relative offsets, which now land in the 128-byte red + * zone instead of a reserved region. Leaf-only — a call would + * clobber the red zone. */ + u8 slim_frame; + u8 redzone_leaf; + /* Optimizer (-O1) entry binds: register-destination param binds are deferred * here and resolved as a parallel copy in x64_bind_params_end, since the * allocator may rotate params across the incoming arg registers — a @@ -1431,10 +1449,16 @@ static ObjSymId x64_chkstk_sym(NativeTarget* t) { } /* Build the prologue byte sequence into buf. Returns bytes written and, when - * the chkstk path fires, the disp32 offset of the call site. */ + * the chkstk path fires, the disp32 offset of the call site. When `skip_sub` is + * set (the known-frame slim / red-zone tiers), the `sub rsp` reservation is + * omitted entirely: the frame record is established but no stack is reserved, + * either because the frame is empty (slim) or because the locals/saves live in + * the SysV red zone (redzone_leaf). Callers must only set it when the frame + * needs no reserved region (no alloca, no outgoing args, and — for the red + * zone — a leaf frame <= 128 bytes). */ static u32 x64_build_prologue(X64NativeTarget* a, u8* buf, u32 cap, u32 frame_size, const Reg* cs_int, u32 n_int, - const Reg* cs_fp, u32 n_fp, + const Reg* cs_fp, u32 n_fp, int skip_sub, u32* chkstk_disp_pos_out) { u32 wi = 0; u32 xmm_base = x64_xmm_base(a, n_fp); @@ -1446,8 +1470,11 @@ static u32 x64_build_prologue(X64NativeTarget* a, u8* buf, u32 cap, buf[wi++] = X64_REX_BASE | X64_REX_W; buf[wi++] = X64_OPC_MOV_RM_R; buf[wi++] = modrm(3u, X64_RSP, X64_RBP); - /* sub rsp, frame_size (or chkstk on Win64 large frame). */ - if (a->abi->shadow_space && frame_size > X64_WIN64_CHKSTK_THRESHOLD) { + /* sub rsp, frame_size (or chkstk on Win64 large frame); skipped by the slim / + * red-zone tiers, which reserve no stack. */ + if (skip_sub) { + /* no reservation */ + } else if (a->abi->shadow_space && frame_size > X64_WIN64_CHKSTK_THRESHOLD) { if (wi + 13u > cap) x64_panic(a, "prologue placeholder overflow"); buf[wi++] = (u8)(X64_OPC_MOV_RI | (X64_RAX & 7u)); /* mov eax, imm32 */ wr_u32_le(buf + wi, frame_size); @@ -1527,6 +1554,8 @@ static void x64_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) { a->npatches = 0; a->nalloca = 0; a->nbind_moves = 0; + a->slim_frame = 0; + a->redzone_leaf = 0; a->prologue_nbytes = a->abi->shadow_space ? X64_PROLOGUE_BYTES_WIN64 : X64_PROLOGUE_BYTES; @@ -1637,9 +1666,25 @@ static void x64_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd, n_fp = x64_collect_fp_saves(a, cs_fp); frame_size = x64_compute_frame_size(a, n_int, n_fp); a->frame_size_final = frame_size; + /* Cost-model tier selection (mirrors aa64's aa_func_begin_known_frame): with + * the frame final before the body, choose the cheapest valid prologue shape. + * Both tiers keep the rbp record and only drop the `sub rsp`, so the + * epilogue/CFI/offset helpers are untouched. x64 needs no `fp_at_bottom`-style + * fold: `push rbp` already folds the sp-move into the store. */ + a->slim_frame = a->frame.ncallee_saves == 0 && !a->frame.has_alloca && + a->frame.cum_off == 0 && a->frame.max_outgoing == 0; + /* redzone keeps locals below rsp in the red zone; exclude inline asm, which + * may issue a `call` (clobbering the red zone) the optimizer can't see. slim + * needs no such guard: it has no locals there and the return address lives on + * the stack at [rbp+8], not in a clobberable register. */ + a->redzone_leaf = !a->slim_frame && a->abi->shadow_space == 0 && + frame && frame->is_leaf && !frame->has_asm && + !a->frame.has_alloca && a->frame.max_outgoing == 0 && + frame_size <= 128u; a->prologue_pos = mc->pos(mc); nbytes = x64_build_prologue(a, buf, sizeof buf, frame_size, cs_int, n_int, - cs_fp, n_fp, &chkstk_disp_pos); + cs_fp, n_fp, a->slim_frame || a->redzone_leaf, + &chkstk_disp_pos); mc->emit_bytes(mc, buf, nbytes); if (chkstk_disp_pos != (u32)-1) { ObjSymId chk = x64_chkstk_sym(t); @@ -1685,8 +1730,10 @@ static void x64_func_end(NativeTarget* t) { u32 nbytes; u32 k; for (k = 0; k < a->prologue_nbytes; ++k) buf[k] = X64_NOP1; + /* Single-pass path never selects a slim/red-zone tier (it cannot know the + * frame up front), so it always emits the full reservation. */ nbytes = x64_build_prologue(a, buf, a->prologue_nbytes, frame_size, cs_int, - n_int, cs_fp, n_fp, &chkstk_disp_pos); + n_int, cs_fp, n_fp, 0, &chkstk_disp_pos); (void)nbytes; obj_patch(obj, sec, a->prologue_pos, buf, a->prologue_nbytes); if (chkstk_disp_pos != (u32)-1) { diff --git a/src/opt/pass_native_emit.c b/src/opt/pass_native_emit.c @@ -1354,6 +1354,8 @@ static void plan_frame(NativeEmitCtx* e, const CGFuncDesc* fd) { u32 max_args = 0, max_outgoing = 0; u8 has_alloca = 0; u8 needs_scratch_spill = 0; + u8 has_call = 0; + u8 has_asm = 0; memset(&frame, 0, sizeof frame); nclasses = t->reserve_callee_saves ? compute_callee_saved_used(e, used, EMIT_MAX_REG_CLASSES) @@ -1369,7 +1371,14 @@ static void plan_frame(NativeEmitCtx* e, const CGFuncDesc* fd) { needs_scratch_spill = 1; } else if ((IROp)in->op == IR_CALL) { IRCallAux* aux = (IRCallAux*)in->extra.aux; + /* Any call (regular or sibling/tail) means the function is not a leaf: + * it clobbers the return-address register and the stack below sp. */ + has_call = 1; if (aux && aux->desc.nargs > max_args) max_args = aux->desc.nargs; + } else if ((IROp)in->op == IR_ASM_BLOCK) { + /* Inline asm may clobber the return-address register or the red zone + * opaquely; disqualifies the frame-eliding tiers (see has_asm). */ + has_asm = 1; } } } @@ -1425,6 +1434,8 @@ static void plan_frame(NativeEmitCtx* e, const CGFuncDesc* fd) { frame.ncallee_classes = nclasses; frame.has_alloca = has_alloca; frame.needs_scratch_spill = needs_scratch_spill; + frame.is_leaf = !has_call; + frame.has_asm = has_asm; t->func_begin_known_frame(t, fd, &frame, out_slots); for (u32 i = 0; i < e->f->nframe_slots; ++i) e->slot_map[e->f->frame_slots[i].id] = out_slots[i]; diff --git a/test/lib/exec_target.sh b/test/lib/exec_target.sh @@ -40,9 +40,14 @@ # will be queued. The same path is bind-mounted at the same path # inside the container. # - Optional: RUN_AARCH64_IMAGE / RUN_X64_IMAGE / RUN_RV64_IMAGE -# override the container image (default alpine:latest — musl -# libc, matching the prior inline implementation and consistent -# with test/smoke/rv64.sh). +# override the container image. The defaults are pinned, per-arch, +# content-addressed alpine digests (see test/lib/test_images.sh); +# provision them once with `make test-images`. Every `podman run` +# below uses --pull=never, so the run path never touches the network. + +# Pinned per-arch image references (sets RUN_<ARCH>_IMAGE defaults and provides +# cfree_test_image_for_arch). Sourced relative to this file's location. +. "$(dirname "${BASH_SOURCE[0]}")/test_images.sh" # Internal queue arrays. Each entry's tag is recorded alongside the # rest so flush can split into per-target batched runs. @@ -87,19 +92,33 @@ _exec_target_platform() { esac } -# Default image is alpine:latest (musl libc). Chosen for rv64 because: -# - musl is the C runtime the rv64 lane is brought up against -# (matches test/smoke/rv64.sh default). -# - alpine ships riscv64 images in the official manifest, so podman -# can pull and exec under qemu-user without bespoke registries. -# Override per-arch with RUN_<ARCH>_IMAGE when a glibc base is needed. +# Per-arch pinned image (content-addressed alpine digest, musl libc). The pins +# live in test/lib/test_images.sh and are provisioned by `make test-images`; +# RUN_<ARCH>_IMAGE overrides them (e.g. for a glibc base). Distinct digests per +# arch mean local storage can never confuse one arch's rootfs for another's. _exec_target_image() { - case "$(_exec_target_arch "$1")" in - aarch64) echo "${RUN_AARCH64_IMAGE:-alpine:latest}" ;; - x64) echo "${RUN_X64_IMAGE:-alpine:latest}" ;; - rv64) echo "${RUN_RV64_IMAGE:-alpine:latest}" ;; - *) echo "alpine:latest" ;; - esac + local img; img="$(cfree_test_image_for_arch "$(_exec_target_arch "$1")")" + [ -n "$img" ] && printf '%s' "$img" || printf 'alpine:latest' +} + +# Memoized: is this arch's pinned image present in local storage? The harnesses +# run with --pull=never, so a missing image means the container runner is +# unavailable until `make test-images` provisions it. Cached per arch so +# exec_target_supported stays a constant cost across hundreds of cases. +_exec_target_image_present() { + local arch var cached + arch="$(_exec_target_arch "$1")" + var="_EXEC_TARGET_IMG_${arch}" + cached="${!var:-}" + if [ -z "$cached" ]; then + if podman image exists "$(_exec_target_image "$1")" 2>/dev/null; then + cached=yes + else + cached=no + fi + printf -v "$var" '%s' "$cached" + fi + [ "$cached" = yes ] } # True when the host can exec this target without container/qemu help. @@ -184,7 +203,10 @@ exec_target_supported() { fi _exec_target_native "$tag" && return 0 [ -n "$(_exec_target_qemu "$tag")" ] && return 0 - [ "${have_podman:-0}" -eq 1 ] && return 0 + # podman is only a usable runner once the arch's pinned image is provisioned + # locally (run path is --pull=never); otherwise report unsupported so callers + # SKIP cleanly instead of failing on a missing image. + [ "${have_podman:-0}" -eq 1 ] && _exec_target_image_present "$tag" && return 0 return 1 } diff --git a/test/lib/pull_test_images.sh b/test/lib/pull_test_images.sh @@ -0,0 +1,52 @@ +#!/usr/bin/env bash +# Provision the pinned per-arch alpine rootfs images that exec_target.sh runs +# cfree-emitted binaries inside. +# +# THIS is the only test step that touches the network. The test harnesses +# themselves always run with `podman run --pull=never`, so the images must be +# present locally first. Run once (and again only after the pin in +# test/lib/test_images.sh changes): +# +# make test-images # or: bash test/lib/pull_test_images.sh +# +# Each arch is pulled by its own content digest, so the three images coexist in +# local storage and can never be confused for one another. Set FORCE=1 to +# re-pull even when an image is already present. +set -euo pipefail + +ROOT="$(cd "$(dirname "$0")/../.." && pwd)" +# shellcheck source=test/lib/test_images.sh +. "$ROOT/test/lib/test_images.sh" + +if ! command -v podman >/dev/null 2>&1; then + echo "pull-test-images: podman not found; nothing to provision" >&2 + exit 1 +fi + +rc=0 +for arch in $CFREE_TEST_ARCHES; do + ref="$(cfree_test_image_for_arch "$arch")" + if [ -z "$ref" ]; then + echo "pull-test-images: no image pinned for arch '$arch'" >&2 + rc=1 + continue + fi + if [ -z "${FORCE:-}" ] && podman image exists "$ref"; then + echo "pull-test-images: $arch already present ($ref)" + continue + fi + # Digest references are single-arch and content-addressed, so no --platform + # is needed and pulling one arch cannot affect another's local image. + echo "pull-test-images: pulling $arch <- $ref" + if id="$(podman pull -q "$ref")"; then + echo "pull-test-images: $arch ready ($id)" + else + echo "pull-test-images: FAILED to pull $arch ($ref)" >&2 + rc=1 + fi +done + +if [ "$rc" -eq 0 ]; then + echo "pull-test-images: all arches provisioned" +fi +exit "$rc" diff --git a/test/lib/test_images.sh b/test/lib/test_images.sh @@ -0,0 +1,41 @@ +# test/lib/test_images.sh — pinned per-arch container images for hermetic +# cross-arch test execution. Sourced by test/lib/exec_target.sh (the run path) +# and test/lib/pull_test_images.sh (the provisioning path). +# +# The compiler test harnesses run cfree-emitted ELF binaries inside a container +# under qemu-user. To keep ALL network off the main test path, every `podman +# run` uses --pull=never against these LOCAL images, which must be provisioned +# ahead of time with `make test-images` (the one network-touching step). +# +# Each arch is pinned to its OWN content-addressed image digest under +# docker.io/library/alpine. Because the references are content digests rather +# than the shared, mutable `alpine:latest` tag, pulling one arch can NEVER +# clobber another arch's rootfs in local storage — that wrong-arch-manifest +# trap (a `--platform riscv64` pull retagging the local `alpine:latest` to the +# riscv64 image, so a later `--platform amd64 --pull=never` run executes against +# the riscv64 rootfs) is what this scheme exists to prevent. +# +# Pin: alpine 3.23.4, manifest list +# docker.io/library/alpine@sha256:5b10f432ef3da1b8d4c7eb6c487f2f5a8f096bc91145e68878dd4a5019afde11 +# To bump the pin: `podman manifest inspect <new alpine ref>` and replace the +# three per-arch image digests below (the `application/vnd.*.manifest.v1+json` +# entry for each architecture). + +# Per-arch pins. `:=` so an explicit RUN_<ARCH>_IMAGE override from the caller +# (e.g. test/asm/hostas_cross.sh, which needs a glibc/edge variant) still wins. +: "${RUN_AARCH64_IMAGE:=docker.io/library/alpine@sha256:378c4c5418f7493bd500ad21ffb43818d0689daaad43e3261859fb417d1481a0}" +: "${RUN_X64_IMAGE:=docker.io/library/alpine@sha256:4d889c14e7d5a73929ab00be2ef8ff22437e7cbc545931e52554a7b00e123d8b}" +: "${RUN_RV64_IMAGE:=docker.io/library/alpine@sha256:667d07bf2f6239f094f64b5682c8ffbe24c9f3139b1fb854f85caf931a3d7439}" + +# arch token ("aarch64"/"x64"/"rv64") -> pinned image reference. +cfree_test_image_for_arch() { + case "$1" in + aarch64) printf '%s' "$RUN_AARCH64_IMAGE" ;; + x64) printf '%s' "$RUN_X64_IMAGE" ;; + rv64) printf '%s' "$RUN_RV64_IMAGE" ;; + *) printf '' ;; + esac +} + +# The arches provisioned/runnable through a container. +CFREE_TEST_ARCHES="aarch64 x64 rv64" diff --git a/test/opt/prologue_tier.sh b/test/opt/prologue_tier.sh @@ -0,0 +1,142 @@ +#!/usr/bin/env bash +# Structural checks for the -O1 known-frame prologue cost-model tiers. +# +# aa64 is the reference (it already selects a minimal frame shape per function); +# this pins that behaviour and asserts the ported equivalents on x64 and rv64: +# +# x64 slim leaf-ish frame (no callee-saves/locals/outgoing) keeps +# `push rbp; mov rbp,rsp` but drops the `sub rsp` reservation. +# x64 red-zone SysV leaf with a small frame keeps its locals/saves at +# rbp-relative offsets in the 128-byte red zone, no `sub rsp`. +# rv64 leaf a true leaf with no frame needs (no callee-saves, no slots, +# no outgoing, register-only params) emits NO prologue and a +# bare `ret` -- it never saves ra / sets up s0. +# +# A non-leaf (calls something) must NOT take the leaf tiers, since the call +# clobbers ra (rv64) / the red zone (x64): those guards are asserted too. +set -euo pipefail + +ROOT="$(cd "$(dirname "$0")/../.." && pwd)" +CFREE="${CFREE:-$ROOT/build/cfree}" +WORK="$ROOT/build/test/opt/prologue_tier" +rm -rf "$WORK" +mkdir -p "$WORK" + +fail() { + printf 'prologue-tier check FAILED: %s\n' "$1" >&2 + if [ -n "${2:-}" ] && [ -f "$2" ]; then + sed 's/^/ | /' "$2" >&2 + fi + exit 1 +} + +slice_func() { + local src="$1" func="$2" out="$3" + awk -v name="$func" ' + $0 ~ "^[0-9a-f]+ <" name ">:" { in_fn = 1; print; next } + /^[0-9a-f]+ </ { in_fn = 0 } + in_fn { print } + ' "$src" > "$out" +} + +compile_case() { + local triple="$1" name="$2" src="$3" + "$CFREE" cc -target "$triple" -O1 -c "$src" \ + -o "$WORK/$name.o" > "$WORK/$name.cc.out" 2> "$WORK/$name.cc.err" + "$CFREE" objdump -d "$WORK/$name.o" \ + > "$WORK/$name.dis" 2> "$WORK/$name.objdump.err" +} + +# ---- fixtures ---- +cat > "$WORK/leaf.c" <<'EOF' +int leaf(int x) { return x + 1; } +EOF + +cat > "$WORK/leaf_call.c" <<'EOF' +extern int g(int); +int leaf_call(int x) { return g(x) + 1; } +EOF + +# 16 bytes of locals, leaf, no calls -> fits the SysV red zone on x64. +cat > "$WORK/small_locals.c" <<'EOF' +int small_locals(int x) { + volatile int a[4]; + a[0] = x; a[1] = x + 1; a[2] = x + 2; a[3] = x + 3; + return a[0] + a[1] + a[2] + a[3]; +} +EOF + +# Inline asm can clobber the return-address register (rv64 ra) or the red zone +# (x64 SysV), so a function containing ANY asm block must NOT take the +# frame-eliding leaf/red-zone tiers, even though it makes no call. +cat > "$WORK/asm_fn.c" <<'EOF' +int asm_fn(int x) { __asm__ volatile("" ::: "memory"); return x + 1; } +EOF +cat > "$WORK/asm_locals.c" <<'EOF' +int asm_locals(int x) { + volatile int a[4]; + __asm__ volatile("" ::: "memory"); + a[0] = x; a[1] = x + 1; a[2] = x + 2; a[3] = x + 3; + return a[0] + a[1] + a[2] + a[3]; +} +EOF + +# ===================== aa64 reference (characterization) ===================== +compile_case aarch64-linux-gnu aa64_leaf "$WORK/leaf.c" +slice_func "$WORK/aa64_leaf.dis" leaf "$WORK/aa64_leaf.fn" +grep -Eq 'stp[[:space:]]+x29, x30, \[sp, #-16\]!' "$WORK/aa64_leaf.fn" || + fail 'aa64 leaf is not using the slim_prologue frame record' "$WORK/aa64_leaf.fn" + +# ===================== x64 slim (no sub rsp on empty frame) ================== +compile_case x86_64-linux-gnu x64_leaf "$WORK/leaf.c" +slice_func "$WORK/x64_leaf.dis" leaf "$WORK/x64_leaf.fn" +grep -Eq 'push[[:space:]]+%rbp' "$WORK/x64_leaf.fn" || + fail 'x64 leaf dropped the rbp frame record (slim should keep push rbp)' "$WORK/x64_leaf.fn" +grep -Eq 'sub[ql]?[[:space:]]+\$[0-9]+, %rsp' "$WORK/x64_leaf.fn" && + fail 'x64 leaf still reserves stack (slim tier should skip sub rsp)' "$WORK/x64_leaf.fn" + +# A register-only call still leaves max_outgoing==0 and an empty frame -> slim. +compile_case x86_64-linux-gnu x64_leaf_call "$WORK/leaf_call.c" +slice_func "$WORK/x64_leaf_call.dis" leaf_call "$WORK/x64_leaf_call.fn" +grep -Eq 'sub[ql]?[[:space:]]+\$[0-9]+, %rsp' "$WORK/x64_leaf_call.fn" && + fail 'x64 register-only call still reserves stack (slim should skip sub rsp)' "$WORK/x64_leaf_call.fn" + +# ===================== x64 red-zone leaf (locals, no sub rsp) ================ +compile_case x86_64-linux-gnu x64_redzone "$WORK/small_locals.c" +slice_func "$WORK/x64_redzone.dis" small_locals "$WORK/x64_redzone.fn" +grep -Eq 'sub[ql]?[[:space:]]+\$[0-9]+, %rsp' "$WORK/x64_redzone.fn" && + fail 'x64 red-zone leaf still reserves stack (should use the red zone, no sub rsp)' "$WORK/x64_redzone.fn" +grep -Eq '\(%rbp\)' "$WORK/x64_redzone.fn" || + fail 'x64 red-zone leaf does not address locals rbp-relative' "$WORK/x64_redzone.fn" + +# ===================== rv64 leaf (no frame at all) ========================== +compile_case riscv64-linux-gnu rv64_leaf "$WORK/leaf.c" +slice_func "$WORK/rv64_leaf.dis" leaf "$WORK/rv64_leaf.fn" +grep -Eq '\bsd[[:space:]]+ra,' "$WORK/rv64_leaf.fn" && + fail 'rv64 leaf saved ra (leaf tier should emit no frame)' "$WORK/rv64_leaf.fn" +grep -Eq 'addi[[:space:]]+sp, sp, -' "$WORK/rv64_leaf.fn" && + fail 'rv64 leaf adjusted sp (leaf tier should keep sp untouched)' "$WORK/rv64_leaf.fn" +grep -Eq '\bret\b' "$WORK/rv64_leaf.fn" || + fail 'rv64 leaf is missing its ret' "$WORK/rv64_leaf.fn" + +# Guard: a non-leaf (calls g) must KEEP the frame so ra survives the call. +compile_case riscv64-linux-gnu rv64_leaf_call "$WORK/leaf_call.c" +slice_func "$WORK/rv64_leaf_call.dis" leaf_call "$WORK/rv64_leaf_call.fn" +grep -Eq '\bsd[[:space:]]+ra,' "$WORK/rv64_leaf_call.fn" || + fail 'rv64 non-leaf dropped the ra save (leaf tier over-fired across a call)' "$WORK/rv64_leaf_call.fn" + +# Guard: inline asm may clobber ra, so a leaf with asm must KEEP the frame. +compile_case riscv64-linux-gnu rv64_asm "$WORK/asm_fn.c" +slice_func "$WORK/rv64_asm.dis" asm_fn "$WORK/rv64_asm.fn" +grep -Eq '\bsd[[:space:]]+ra,' "$WORK/rv64_asm.fn" || + fail 'rv64 leaf with inline asm dropped the ra save (slim tier must not fire with asm)' "$WORK/rv64_asm.fn" + +# ===================== x64 red-zone must not fire with inline asm =========== +# Inline asm may clobber the red zone (e.g. a `call`), so a red-zone leaf with +# asm must keep its reserved stack. +compile_case x86_64-linux-gnu x64_asm_locals "$WORK/asm_locals.c" +slice_func "$WORK/x64_asm_locals.dis" asm_locals "$WORK/x64_asm_locals.fn" +grep -Eq 'sub[ql]?[[:space:]]+\$[0-9]+, %rsp' "$WORK/x64_asm_locals.fn" || + fail 'x64 red-zone leaf with inline asm skipped sub rsp (must reserve, asm may clobber the red zone)' "$WORK/x64_asm_locals.fn" + +printf 'prologue-tier: ok\n'