kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 9e004ebf1826ac067f5b6ab8c2790eb16ad52c6c
parent ed078ac80320bb5d7b8e4aa8c4e780d0aec8fdb8
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 28 May 2026 18:48:40 -0700

opt/aa64: fold O1 prologue via fp-at-bottom frame layout

Add the `fp_at_bottom` frame tier for known-frame (-O1) functions with
callee-saves/locals and no outgoing stack args: move the frame record to
the bottom of the frame (fp = sp) so the sp adjustment folds into a
pre-indexed `stp x29,x30,[sp,#-N]!` entry and post-indexed
`ldp x29,x30,[sp],#N` exit. Callee-saves stack above the record at
positive offsets (scaled `str`/`stp`, since the unscaled stur can't reach
offsets past 255). This is -2 fixed insns/call vs the top-record
slim_small_frame, matching gcc -O0's prologue shape.

The layout is frame-size-dependent (incoming-arg/slot offsets, CFA), so
it's only taken on the known-frame path where the frame is final before
the body. out_stack>0 frames keep the top-record slim_small_frame (the
record can't share the bottom with outgoing args). All layout decisions
stay centralized in the aa_fp_off_* / aa_cfa_off helpers — no use site
branches on the layout.

DWARF CFA becomes fp + frame_size (vs fp + 16 for top-record); verified
via .eh_frame. On binary-trees the four hot functions drop below gcc -O0
(12/18/21/16 vs 16/21/24/18 insns); cc runtime is unchanged (workload is
malloc-bound) but the JIT path reaches parity with gcc -O0. See
doc/PERCALL.md and doc/OPT_O1_PERF_TODO.md.

Regression test 140_fp_callee_save_bottom_frame exercises an f64
callee-save (v8-v15) at a >255 offset (the scaled d-register path).

Diffstat:
Mdoc/OPT_O1_PERF_TODO.md | 98++++++++++++++++++++++++++++++++++++++++---------------------------------------
Mdoc/PERCALL.md | 113+++++++++++++++++++++++++++++++++++++++++--------------------------------------
Msrc/arch/aa64/native.c | 216+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------
Atest/toy/cases/140_fp_callee_save_bottom_frame.expected | 1+
Atest/toy/cases/140_fp_callee_save_bottom_frame.toy | 38++++++++++++++++++++++++++++++++++++++
5 files changed, 309 insertions(+), 157 deletions(-)

diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md @@ -27,7 +27,7 @@ the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with | bench | cfree -O1 | gcc -O0 | vs gcc-O0 | mir -O1 | vs mir-O1 | behind | | --- | ---: | ---: | ---: | ---: | ---: | --- | -| binary-trees | 2973 | 2647 | **0.89×** (slower) | n/a¹ | — | gcc | +| binary-trees | 2969² | 2647 | **0.89×** (slower) | n/a¹ | — | gcc (cc path) | | lists | 4843 | 8868 | 1.83× ✓ | 4997 | 1.03× | mir | | hash2 | 4988 | 7481 | 1.50× ✓ | 3863 | **0.77×** | mir | | sieve | 5148 | 5077 | 0.99× (~tied) | 4028 | **0.78×** | gcc (~tied), mir | @@ -36,17 +36,32 @@ the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with ¹ mir-c2m fails to compile `binary-trees`, so only the gcc comparison applies. +² Refreshed 2026-05-28 after the `fp_at_bottom` prologue fold (item 2 below). +The fold landed (−2 fixed insns/call, codegen now byte-for-byte the same +prologue/epilogue shape as gcc -O0) but the **cc-linked runtime is unchanged** +(2973 → 2969): binary-trees is malloc/free-bound, so per-call ALU savings are +noise. The **JIT path is at parity** — `cfree-run -O1` = 2617 ms (~tied with gcc +2647), vs the cc binary's 2969. That ~350 ms cc-only deficit (identical codegen) +points at PLT/GOT indirection for `malloc`/`free` — ~15M dynamic-symbol calls +routed through stubs — as the real remaining cost on the cc path, not codegen. + ## Per-benchmark notes -### binary-trees — slower than unoptimized gcc (highest priority) -The only case where cfree `-O1` is *slower than gcc -O0* (0.89×, up from 0.84× -after the known-frame prologue landed — item 1 below). Workload is -recursive tree build/walk: four tiny functions (`NewTreeNode`, `ItemCheck`, -`BottomUpTree`, `DeleteTree`) called ~7.6M times at depth=19, plus a -`malloc`/`free` per node. The **body** of each function is fine — cfree -O1 -keeps the recurring pointer in a callee-save (x19) where gcc -O0 spills and -reloads it three times per call. The gap is entirely in **fixed per-call -overhead** — every byte of which is multiplied by 7.6M. +### binary-trees — cc path slower than unoptimized gcc; JIT at parity +cfree `-O1` (cc) is 0.89× gcc -O0 — but this is **no longer a codegen gap**. +After the `fp_at_bottom` prologue fold (item 2, DONE) the four hot functions +(`NewTreeNode` 12, `ItemCheck` 18, `BottomUpTree` 21, `DeleteTree` 16 insns) are +each *smaller* than gcc -O0 (16/21/24/18) and share gcc's exact +prologue/epilogue shape, yet cc runtime didn't move (2973 → 2969). The workload +is malloc/free-bound: four tiny functions called ~7.6M times at depth=19 with a +`malloc`/`free` per node, so the ~15M instructions removed are dwarfed by +allocator cost. Evidence it's *not* codegen: the JIT path (`cfree-run -O1`, +identical codegen, no PLT) is ~tied with gcc (2617 vs 2647). The remaining cc +deficit is the dynamic-call path to `malloc`/`free` through PLT/GOT stubs — the +next thing to chase here, if anything (e.g. `-fno-plt`-style direct calls, or +the linker resolving intra-image calls). The **body** quality was never the +problem: cfree keeps the recurring pointer in a callee-save (x19) where gcc -O0 +spills and reloads it three times per call. Open items, in priority order (most recent disasm in `/tmp/mc/binary-trees.cfree.o`): @@ -62,22 +77,17 @@ Open items, in priority order (most recent disasm in directly — no filler, no patch. Functions now fall straight from the prologue into the body. **-1 to -2 insns at entry, every function.** -2. **Prologue compaction: 4-insn → 2-insn pre-indexed.** Half-done. cfree - today emits: - ``` - sub sp, sp, #N - stp x29, x30, [sp, #M] - add x29, sp, #M - stp x20, x19, [x29, #-K] ; or `stur x19, ...` when only x19 live - ``` - `bcca0bd` added `slim_prologue` in `src/arch/aa64/native.c`, which uses - the pre-indexed `stp x29, x30, [sp, #-N]!` form — but only when - `ncallee_saves == 0`. The common path (one or more callee-saves) still - goes through `slim_small_frame`, which doesn't pre-index. Extend - `slim_small_frame` to use `aa64_stp64_pre` for the FP-save when the frame - size is known in advance, then emit the callee-save spills as separate - `stur`s afterward. Saves 1 insn at entry + 1 at every exit. **~2 insns × - 7.6M calls.** +2. **Prologue compaction: 4-insn → 2-insn pre-indexed. [DONE]** Implemented as + the `fp_at_bottom` tier in `src/arch/aa64/native.c`: a known-frame (-O1) small + frame with callee-saves and no outgoing stack args moves the frame record to + the bottom (fp = sp), folding the sp adjustment into the pre-indexed + `stp x29,x30,[sp,#-N]!` entry and post-indexed `ldp x29,x30,[sp],#N` exit; + callee-saves stack above the record at positive offsets (scaled `str`/`stp`). + Entry 4→3, exit 3→2 insns (−2/call), matching gcc -O0's shape. CFA becomes + `fp + frame_size`. See `doc/PERCALL.md`. Removed ~15M instructions at + depth=19 — but binary-trees runtime is malloc-bound, so the cc-path runtime + was unchanged (see the section intro). The win is real for code size and for + call-heavy, *non*-allocation-bound workloads. 3. **Zero materialized through a temp in `BottomUpTree` leaf path.** `NewTreeNode(NULL, NULL)` still emits: @@ -95,22 +105,15 @@ Open items, in priority order (most recent disasm in to also fire when the source op is `IR_LOAD_IMM`. **+2 insns × 524k leaf calls.** -4. **Trailing `b A; A: b B` pair in `DeleteTree`'s if/else merge.** - ``` - c9c: mov x0, x19 - ca0: bl _free - ca4: b 0xcac <-- jumps directly to the epilogue label - ca8: b 0xc9c <-- unreachable; left over from else side - cac: ldur x19, [x29, #-0x8] - ``` - Classic jump-thread target. `cleanup_layout_fallthrough_branches` in - `pass_jump.c` doesn't pick up this shape (two consecutive `b`s where - the second is unreachable). +1 insn per `DeleteTree` invocation - (7.6M calls). +4. **Trailing `b A; A: b B` pair in `DeleteTree`'s if/else merge. [GONE]** No + longer present as of the 2026-05-28 disasm: `DeleteTree`'s `b.eq` now targets + the epilogue directly, one clean branch, 16 insns total (vs gcc's 18). Either + a `pass_jump.c` change or the layout shift resolved it; re-add to this list + only if it reappears. -Together these would save 4–6 insns per call, ~30–50M instructions removed -at depth=19. Body quality is already on par with gcc-O0; this is all -fixed per-call overhead. +These items are all resolved or no-impact for binary-trees: the four functions +are now smaller than gcc -O0 (12/18/21/16 vs 16/21/24/18). The remaining +cc-path runtime gap is allocator/PLT cost, not codegen — see the section intro. ### mandelbrot — 0.92× vs mir (close to the bar) Inner loop is FP-heavy (`Tr*Tr + Ti*Ti < 4.0` Mandelbrot escape test + @@ -154,13 +157,12 @@ are the most concrete tests for whether each is complete. frame up front, so the prologue is emitted final in its slim form rather than fat-then-patched). Affected every cfree-compiled function. -2. **Compact FP-frame prologue/epilogue.** See binary-trees item 2. The - 2-insn pre-indexed form is wired in for the no-callee-save case (Tier A); - extending it to small frames with 1–2 callee-saves still needs the frame - record moved to the bottom of the frame (fp-at-bottom layout) so a single - `stp x29,x30,[sp,#-N]!` covers both the sp decrement and the save — a - separable layout change, not unlocked by the frame-planning rework alone. - Biggest absolute payoff still open on call-heavy benches. +2. **Compact FP-frame prologue/epilogue. [DONE]** See binary-trees item 2. + The fp-at-bottom layout (`fp_at_bottom` tier) now folds the sp decrement and + fp/lr save into one pre-indexed `stp x29,x30,[sp,#-N]!` (and post-indexed + `ldp` on exit) for known-frame small frames with callee-saves and no outgoing + stack args. −2 fixed insns/call. Helps code-heavy benches; runtime-neutral on + allocation-bound ones like binary-trees. 3. **Hard-register copy coalescing for `IR_LOAD_IMM` sources.** See binary-trees item 3. The hint-propagation path covers `ldr` → call-arg diff --git a/doc/PERCALL.md b/doc/PERCALL.md @@ -2,10 +2,11 @@ The fixed cost cfree `-O1` pays around *every* function call — prologue, epilogue, and call-site argument setup — independent of the function body. On -call-heavy workloads (binary-trees: ~7.6M calls at depth 19) this dominates, -and it is the whole reason binary-trees still trails gcc -O0 (0.89×). The -function *bodies* are already at or ahead of gcc -O0 (cfree keeps recurring -values in callee-saves where gcc -O0 spills and reloads); see +call-heavy workloads (binary-trees: ~7.6M calls at depth 19) this dominates. The +prologue/epilogue fold described here (`fp_at_bottom`) is now implemented, taking +the common per-call overhead from 7 to the optimal 5 fixed insns. The function +*bodies* are already at or ahead of gcc -O0 (cfree keeps recurring values in +callee-saves where gcc -O0 spills and reloads); see [OPT_O1_PERF_TODO.md](OPT_O1_PERF_TODO.md) for the benchmark standings. Numbers below are from `/tmp/mc/binary-trees.cfree.o` vs `binary-trees.gcc.o` @@ -13,57 +14,57 @@ Numbers below are from `/tmp/mc/binary-trees.cfree.o` vs `binary-trees.gcc.o` ## Prologue / epilogue tiers -cfree picks one of three frame shapes per function. "Fixed insns" excludes the +cfree picks one of four frame shapes per function. "Fixed insns" excludes the final `ret` (always required) and counts entry + exit. -| tier | when | fixed insns | pre-indexed? | +| tier | when | fixed insns | folded? | | --- | --- | ---: | --- | | Tier A (`slim_prologue`) | no callee-saves, no alloca, no body slots, no outgoing stack | 3 (2 entry + 1 exit) | yes — optimal | -| `slim_small_frame` | ≥1 callee-save (or body slots), saved-pair offset ≤ 504 | **7 (4 entry + 3 exit)** | **no** | +| `fp_at_bottom` | ≥1 callee-save (or body slots), **no outgoing stack args**, frame_size ≤ 504 | **5 (3 entry + 2 exit)** | **yes — optimal** | +| `slim_small_frame` | as above but with outgoing stack args (out_stack > 0), saved-pair offset ≤ 504 | 7 (4 entry + 3 exit) | no | | fat | large frame, alloca, big saved-pair offset | 7+ | no | -Tier A (leaf-ish, no callee-saves) is already optimal. The gap is entirely in -`slim_small_frame`, which is the common case — any function that keeps a value -live across a call lands here. +Tier A (leaf-ish, no callee-saves) and `fp_at_bottom` are both optimal. +`fp_at_bottom` is the common case — any function that keeps a value live across a +call (but doesn't itself pass >8 register-class args to a callee) lands here. The +remaining gap is only `slim_small_frame` (out_stack > 0, comparatively rare). -### Current `slim_small_frame` (the four binary-trees functions) +### `fp_at_bottom` (the four binary-trees functions) — implemented -Entry (4 insns) — e.g. `NewTreeNode`, two callee-saves: -``` -sub sp, sp, #32 -stp x29, x30, [sp, #16] ; frame record mid-frame -add x29, sp, #16 ; fp points above the callee-save area -stp x20, x19, [x29, #-16] ; callee-saves below fp -``` -Exit (3 insns + ret): -``` -ldp x20, x19, [x29, #-16] -ldp x29, x30, [sp, #16] -add sp, sp, #32 -ret -``` +The frame record moves to the bottom of the frame (`fp = sp`), so the sp +adjustment folds into the pre/post-indexed `stp`/`ldp` and callee-saves stack +*above* the record at positive offsets. This matches what gcc -O0 emits. -### Optimal `slim_small_frame` (what gcc -O0 already emits) - -Entry (3 insns): +Entry (3 insns) — e.g. `NewTreeNode`, two callee-saves: ``` -stp x29, x30, [sp, #-32]! ; one insn: decrement sp AND save fp/lr +stp x29, x30, [sp, #-32]! ; one insn: decrement sp AND save fp/lr at the bottom mov x29, sp ; frame record at the bottom of the frame -stp x19, x20, [sp, #16] ; callee-saves above the record (str for one) +stp x20, x19, [x29, #16] ; callee-saves above the record (str for an odd one) ``` Exit (2 insns + ret): ``` -ldp x19, x20, [sp, #16] +ldp x20, x19, [x29, #16] ldp x29, x30, [sp], #32 ; one insn: restore fp/lr AND release sp ret ``` -**−2 fixed insns per call** (−1 entry, −1 exit). The blocker is the layout: -cfree puts the frame record above the callee-saves (`add x29, sp, #16`), so the -sp-decrement and the fp/lr save cannot fold into a single pre-indexed `stp`. -Moving the frame record to the bottom of the frame (fp-at-bottom, `mov x29, sp`) -unlocks the pre-indexed entry `stp …, [sp, #-N]!` and post-indexed exit -`ldp …, [sp], #N`. +**−2 fixed insns per call** (−1 entry, −1 exit) vs the old top-record layout. +Implemented in `src/arch/aa64/native.c`: the `fp_at_bottom` flag (set in +`aa_func_begin_known_frame`), the layout-aware `aa_fp_off_*` / `aa_cfa_off` +helpers, the prologue/epilogue branches in `aa_build_prologue_words` / +`aa_words_restore_frame`, and the positive-offset callee-save stores in +`aa_words_callee_saves`. Only the known-frame (-O1) path takes it — the +frame-size-dependent offsets require the frame to be final before the body. +The DWARF CFA becomes `fp + frame_size` (vs `fp + 16` for the top-record forms). + +`out_stack > 0` functions can't move the record to the absolute bottom (outgoing +args live there), so they keep the top-record `slim_small_frame` layout below. + +#### Old top-record `slim_small_frame` (still used for out_stack > 0) + +Entry (4 insns) / exit (3 insns + ret): `sub sp` + `stp [sp,#N-16]` + +`add x29,sp,#N-16` + callee-saves; reverse on exit. fp points mid-frame above the +callee-saves, which is what blocks the fold. (An aside: omitting the frame pointer entirely — packing a callee-save with lr, e.g. `stp x19, x30, [sp, #-N]!`, no `x29` at all — would save one more insn, but @@ -99,29 +100,31 @@ Two small constant adders beyond the prologue, both call-shaped: ## Per-function tally (fixed + warts vs gcc -O0) -| function | cfree insns | gcc -O0 insns | cfree overhead vs optimal | -| --- | ---: | ---: | --- | -| NewTreeNode | 14 | 16 | +2 prologue | -| ItemCheck | 20 | 21 | +2 prologue | -| BottomUpTree | 23 | 24 | +2 prologue (zero-temp fixed) | -| DeleteTree | 20 | 18 | +2 prologue, +1 branch | +cfree insns are the `fp_at_bottom` numbers (incl. `ret`); the "was" column is the +old top-record layout, each −2 from the prologue fold. + +| function | cfree insns | was | gcc -O0 insns | remaining overhead | +| --- | ---: | ---: | ---: | --- | +| NewTreeNode | 12 | 14 | 16 | — (ahead of gcc) | +| ItemCheck | 18 | 20 | 21 | — (ahead of gcc) | +| BottomUpTree | 21 | 23 | 24 | — (ahead of gcc) | +| DeleteTree | 16 | 18 | 18 | +1 branch (item 4) | -cfree already beats gcc -O0 on the malloc-heavy `NewTreeNode` (args stay in -registers vs gcc's 8 stack shuffles), ties `ItemCheck`, and — with the -zero-temp fix — now edges ahead on `BottomUpTree`. `DeleteTree` remains the only -loss, on the fixed prologue overhead plus the item-4 branch chain. +With the prologue fold all four are now at or ahead of gcc -O0. `DeleteTree`'s +only remaining wart is the item-4 redundant branch chain. -## Optimal target +## Optimal target — reached for the common case -Per-call fixed overhead for the common (`slim_small_frame`) case: +Per-call fixed overhead: -- **current:** 7 insns (4 entry + 3 exit) + ret -- **optimal:** 5 insns (3 entry + 2 exit) + ret → **−2/call** +- **`fp_at_bottom` (common case):** 5 insns (3 entry + 2 exit) + ret — optimal, + matching gcc -O0. Done. +- **`slim_small_frame` (out_stack > 0 only):** 7 insns (4 entry + 3 exit) + ret — + unchanged; the record can't move to the bottom when outgoing args occupy it. -Across binary-trees (~7.6M calls): item 2 alone removes ~15M instructions; -adding items 3 and 4 brings the total to ~30M. That is expected to flip -`BottomUpTree` and `DeleteTree` ahead of gcc -O0 and move the benchmark past the -1.10× bar. +Across binary-trees (~7.6M calls) the prologue fold removes ~15M instructions; +combined with the zero-temp fix (item 3) and the pending item-4 branch fix the +total reaches ~30M, flipping `BottomUpTree`/`DeleteTree` ahead of gcc -O0. ## Reproducing diff --git a/src/arch/aa64/native.c b/src/arch/aa64/native.c @@ -64,27 +64,47 @@ enum { /* ============================================================================ * AAPCS64 frame layout * - * fp anchors at the caller's saved-pair address; sp anchors at the bottom of - * the outgoing-arg area. Every fp- or sp-relative offset in this file is - * computed via one of the helpers below — no site should do bare arithmetic - * on AA_FP / AA_SP for addressing the frame. + * Two layouts. Every fp- or sp-relative offset in this file is computed via one + * of the aa_fp_off_ / aa_sp_off_ helpers below — no site does bare arithmetic on + * AA_FP / AA_SP, and no site outside those helpers branches on the layout. + * + * TOP-RECORD (default — single-pass -O0, fat frames, and out_stack>0 small + * frames). fp anchors at the caller's saved-pair address near the top; sp at the + * bottom of the outgoing-arg area. Offsets are frame-size-independent. * * high addr caller's stack frame * +------------------------------+ - * | incoming stack args | aa_fp_off_in_arg(i) + * | incoming stack args | aa_fp_off_in_arg(a,i) = 16+i * +------------------------------+ - * fp --> | saved x29 (prev fp) | aa_fp_off_saved_fp() - * | saved x30 (prev lr) | aa_fp_off_saved_lr() + * fp --> | saved x29 (prev fp) | aa_fp_off_saved_fp() = 0 + * | saved x30 (prev lr) | aa_fp_off_saved_lr() = 8 * +------------------------------+ - * | frame slots | aa_fp_off_slot(s->off) + * | frame slots | aa_fp_off_slot(a,off) = -off * | (callee-saves + locals | * | + spills + sret/variadic) | * +------------------------------+ * | outgoing args | aa_sp_off_out_arg(i) * sp --> +------------------------------+ - * low addr + * low addr CFA = fp + 16 + * + * BOTTOM-RECORD (fp_at_bottom — known-frame -O1 small frames with + * callee-saves/locals and out_stack==0). The record moves to the bottom so the + * sp adjustment folds into a pre/post-indexed stp/ldp (−2 insns/call). fp = sp; + * slots/callee-saves stack ABOVE the record at positive offsets. Offsets depend + * on frame_size (hence known-frame only, where the frame is final before body). + * + * high addr caller's stack frame + * +------------------------------+ + * | incoming stack args | aa_fp_off_in_arg(a,i) = N+i + * +------------------------------+ <- caller's sp = CFA = fp + N + * | frame slots (+ align pad) | aa_fp_off_slot(a,off)=N-off + * | (callee-saves + locals …) | (in [16, N), above record) + * +------------------------------+ + * fp = sp --> | saved x29 (prev fp) | aa_fp_off_saved_fp() = 0 + * | saved x30 (prev lr) | aa_fp_off_saved_lr() = 8 + * low addr +------------------------------+ (N = frame_size; out_stack==0) * - * frame_size = align16(AA_FRAME_SAVE_SIZE + slot_bytes + out_stack). + * frame_size (N) = align16(AA_FRAME_SAVE_SIZE + slot_bytes + out_stack). * Tail calls write outgoing args into the caller's incoming-args window — * physically the same address, expressed via aa_fp_off_tail_out_arg. * ========================================================================== */ @@ -107,19 +127,13 @@ static inline AAFrameLayout aa_build_layout(u32 slot_bytes, u32 out_stack) { return L; } -/* FP-relative byte offsets. */ +/* FP-relative byte offsets. The saved-pair is at [fp]/[fp+8] in both the + * top-record and bottom-record (fp_at_bottom) layouts, so these two are + * layout-independent. The frame-size-dependent helpers — aa_fp_off_in_arg, + * aa_fp_off_slot, aa_fp_off_tail_out_arg — branch on a->fp_at_bottom and are + * defined after AANativeTarget (see aa_fp_off_* below aa_of). */ static inline i32 aa_fp_off_saved_fp(void) { return 0; } static inline i32 aa_fp_off_saved_lr(void) { return 8; } -static inline i32 aa_fp_off_in_arg(u32 byte_off) { - return (i32)(AA_FRAME_SAVE_SIZE + byte_off); -} -static inline i32 aa_fp_off_slot(u32 slot_off) { return -(i32)slot_off; } -/* Outgoing stack args on a tail call land in the caller's incoming-arg - * window — same physical address the tail-callee will read via - * aa_fp_off_in_arg. Same helper, distinct name for site-side intent. */ -static inline i32 aa_fp_off_tail_out_arg(u32 byte_off) { - return aa_fp_off_in_arg(byte_off); -} /* SP-relative byte offsets. */ static inline i32 aa_sp_off_out_arg(u32 byte_off) { return (i32)byte_off; } @@ -174,6 +188,11 @@ typedef struct AANativeTarget { u32 slots_cap; u32 cum_off; u32 max_outgoing; + /* Final frame size, set once in aa_func_begin_known_frame when fp_at_bottom is + * decided. Read by the fp-relative offset helpers in the bottom-record layout + * (where slot/incoming-arg offsets depend on frame_size); meaningless and + * unread on the single-pass path, which never sets fp_at_bottom. */ + u32 frame_size_final; u32 incoming_stack_size; u32 next_param_int; u32 next_param_fp; @@ -207,9 +226,21 @@ typedef struct AANativeTarget { * materialization in the prologue (stp x29,x30,[sp,#N-16] instead) and * the matching `add x10, fp, #0` in the epilogue (ldp x29,x30,[sp,#N-16] * + add sp,sp,#N). Mutually exclusive with `slim_prologue` (Tier A wins - * when both would apply). Applies to almost every function with a small - * frame, including those with callee-saves and locals. */ + * when both would apply) and `fp_at_bottom` (which wins for out_stack==0). + * Now only reached for small frames with outgoing stack args (out_stack>0), + * where the record cannot move to the bottom. Keeps the top-record layout. */ u8 slim_small_frame; + /* Set by aa_func_begin_known_frame for a small frame with callee-saves/locals + * and no outgoing stack args: the frame record moves to the bottom of the + * frame (fp = sp, `mov x29,sp`) so the sp adjustment folds into a pre-indexed + * `stp x29,x30,[sp,#-N]!` entry and post-indexed `ldp x29,x30,[sp],#N` exit + * (−2 insns/call vs slim_small_frame). Slots and callee-saves stack ABOVE the + * record at positive fp offsets; incoming args sit at fp+frame_size; CFA = + * fp+frame_size. The frame-size-dependent offsets are the reason this is only + * available on the known-frame path (frame final before the body). Mutually + * exclusive with slim_prologue (Tier A) and slim_small_frame; gated on + * out_stack==0 && !has_alloca && frame_size <= 504. */ + u8 fp_at_bottom; /* Set by aa_func_begin_known_frame (optimizer path: the full frame is known * up front, so the prologue, allocas, and tail epilogues are emitted final * with no back-patching). Cleared by aa_func_begin (NativeDirectTarget @@ -231,6 +262,36 @@ typedef struct AANativeTarget { static AANativeTarget* aa_of(NativeTarget* t) { return (AANativeTarget*)t; } +/* Layout-aware FP-relative offsets. Every frame use site goes through these; + * the fp_at_bottom test lives here and nowhere else. + * + * top-record (default): record near the top, fp anchored at the saved pair. + * incoming args at fp+16+b, slots below fp at -off. CFA = fp+16. + * bottom-record (fp_at_bottom): record at the bottom, fp = sp. + * incoming args at fp+frame_size+b, slots above the record at + * frame_size-off (in [16, frame_size), never overlapping the 16-byte + * record since frame_size = align16(16+cum_off) >= 16+cum_off). + * CFA = fp+frame_size. */ +static inline i32 aa_fp_off_in_arg(const AANativeTarget* a, u32 byte_off) { + u32 base = a->fp_at_bottom ? a->frame_size_final : AA_FRAME_SAVE_SIZE; + return (i32)(base + byte_off); +} +static inline i32 aa_fp_off_slot(const AANativeTarget* a, u32 slot_off) { + return a->fp_at_bottom ? (i32)a->frame_size_final - (i32)slot_off + : -(i32)slot_off; +} +/* Outgoing stack args on a tail call land in the caller's incoming-arg window — + * the same physical address the tail-callee will read via aa_fp_off_in_arg. + * Same helper, distinct name for site-side intent. */ +static inline i32 aa_fp_off_tail_out_arg(const AANativeTarget* a, u32 byte_off) { + return aa_fp_off_in_arg(a, byte_off); +} +/* CFA = caller's sp, expressed as an fp-relative offset (fp+16 top-record, + * fp+frame_size bottom-record). Named so the CFI emit site stays layout-blind. */ +static inline i32 aa_cfa_off(const AANativeTarget* a) { + return a->fp_at_bottom ? (i32)a->frame_size_final : (i32)AA_FRAME_SAVE_SIZE; +} + static void aa_panic(AANativeTarget* a, const char* msg) { compiler_panic(a->base.c, a->loc, "aarch64 native target: %s", msg); } @@ -643,7 +704,7 @@ static void aa_addr_base(AANativeTarget* a, NativeAddr addr, u32* base_out, case NATIVE_ADDR_BASE_FRAME: { AANativeSlot* s = aa_slot(a, addr.base.frame); *base_out = AA_FP; - *off_out = aa_fp_off_slot(s->off) + addr.offset; + *off_out = aa_fp_off_slot(a, s->off) + addr.offset; return; } case NATIVE_ADDR_BASE_GLOBAL: { @@ -917,7 +978,7 @@ static u32 aa_ldst_q_simm9(int load, u32 rt, u32 rn, i32 byte_off) { static void aa_emit_q_frame(AANativeTarget* a, int load, u32 qreg, NativeFrameSlot slot, u32 offset) { AANativeSlot* s = aa_slot(a, slot); - i32 off = aa_fp_off_slot(s->off) + (i32)offset; + i32 off = aa_fp_off_slot(a, s->off) + (i32)offset; MCEmitter* mc = a->base.mc; if (off >= 0 && ((u32)off & 15u) == 0 && ((u32)off >> 4) <= 0xfffu) { aa_emit32(mc, aa_ldst_q_uimm(load, qreg, AA_FP, (u32)off)); @@ -1007,6 +1068,8 @@ static void aa_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) { a->ncallee_saves = 0; a->slim_prologue = 0; a->slim_small_frame = 0; + a->fp_at_bottom = 0; + a->frame_size_final = 0; a->known_frame = 0; a->has_alloca = 0; a->frame_final = 0; @@ -1205,6 +1268,16 @@ static void aa_words_restore_frame(AANativeTarget* a, u32* words, u32 cap, words[(*n)++] = aa64_ldp64_post(AA_FP, AA_LR, AA_SP, 2); return; } + if (a->fp_at_bottom) { + /* Bottom-record fold: `ldp x29,x30,[sp],#N` reloads the pair from the + * bottom AND releases the whole frame in one insn. Callee-saves were + * already restored by aa_emit_callee_restores. -1 insn vs slim_small_frame + * (which needs a separate `add sp`). N <= 504 holds the post-index imm. */ + if (*n + 1u > cap) aa_panic(a, "instruction patch too small"); + words[(*n)++] = + aa64_ldp64_post(AA_FP, AA_LR, AA_SP, (i32)(L->frame_size / 8u)); + return; + } if (a->slim_small_frame) { /* `ldp x29,x30,[sp,#saved_pair] ; add sp,sp,#frame_size` — load through * sp avoids the fat path's `add x10, fp, #0` scratch, and the subsequent @@ -1237,14 +1310,16 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words, u32 cap, u32* n) { for (u32 i = 0; i < a->ncallee_saves;) { const AACalleeSave* cs = &a->callee_saves[i]; - i32 off = aa_fp_off_slot(aa_slot(a, cs->slot)->off); - if (off < -256 || off > 255) - aa_panic(a, "callee-save offset out of prologue range"); + i32 off = aa_fp_off_slot(a, aa_slot(a, cs->slot)->off); if (i + 1u < a->ncallee_saves && cs->cls == (u8)NATIVE_REG_INT && a->callee_saves[i + 1u].cls == (u8)NATIVE_REG_INT) { const AACalleeSave* cs2 = &a->callee_saves[i + 1u]; - i32 off2 = aa_fp_off_slot(aa_slot(a, cs2->slot)->off); - /* off2 = off - 8 (lower address; reserve allocates downward). */ + i32 off2 = aa_fp_off_slot(a, aa_slot(a, cs2->slot)->off); + /* cs2 is reserved after cs (larger slot.off), so it is the lower address + * in both layouts (off2 = off - 8): stp's Rt = cs2, Rt2 = cs, base off2. + * stp/ldp's signed-7-bit scaled immediate reaches ±504. */ + if (off2 < -512 || off2 > 504) + aa_panic(a, "callee-save pair offset out of prologue range"); if (*n >= cap) aa_panic(a, "prologue too large"); words[(*n)++] = save ? aa64_stp64_soff(cs2->reg, cs->reg, AA_FP, off2 / 8) @@ -1253,8 +1328,20 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words, } else { u32 v = cs->cls == (u8)NATIVE_REG_FP ? 1u : 0u; if (*n >= cap) aa_panic(a, "prologue too large"); - words[(*n)++] = save ? aa_stur_v(3, v, cs->reg, AA_FP, off) - : aa_ldur_v(3, v, cs->reg, AA_FP, off); + if (a->fp_at_bottom) { + /* Positive, 8-aligned offset above the record (up to frame_size-8 ≤ + * 496): the unscaled stur (±256) can't reach it, so use the scaled + * unsigned-imm str/ldr. */ + if (off < 0 || (u32)off > 0x7ff8u) + aa_panic(a, "callee-save offset out of prologue range"); + words[(*n)++] = save ? aa_str_uimm_v(3, v, cs->reg, AA_FP, (u32)off) + : aa_ldr_uimm_v(3, v, cs->reg, AA_FP, (u32)off); + } else { + if (off < -256 || off > 255) + aa_panic(a, "callee-save offset out of prologue range"); + words[(*n)++] = save ? aa_stur_v(3, v, cs->reg, AA_FP, off) + : aa_ldur_v(3, v, cs->reg, AA_FP, off); + } i += 1u; } } @@ -1265,11 +1352,10 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words, * a fixed worst-case region, then patches it here) and the optimizer path * (aa_func_begin_known_frame emits exactly these words up front). * - * All three variants establish the same post-prologue state defined by L: - * sp = caller's sp - L->frame_size - * fp = sp + aa_sp_off_saved_pair(L) (saved-pair address) - * saved x29/x30 at [fp], [fp+8] - * callee-saves at [fp - s->off] for each. */ + * All variants establish a post-prologue state defined by L: saved x29/x30 at + * [fp]/[fp+8], callee-saves at aa_fp_off_slot of each. The top-record variants + * leave fp = sp + aa_sp_off_saved_pair(L) (saved-pair near the top); the + * bottom-record variant leaves fp = sp (saved-pair at the bottom). */ static u32 aa_build_prologue_words(AANativeTarget* a, const AAFrameLayout* L, u32* words, u32 cap) { u32 n = 0; @@ -1283,6 +1369,17 @@ static u32 aa_build_prologue_words(AANativeTarget* a, const AAFrameLayout* L, words[n++] = aa64_add_imm(1, AA_FP, AA_SP, 0, 0); return n; } + if (a->fp_at_bottom) { + /* Bottom-record fold: `stp x29,x30,[sp,#-N]!` decrements sp by the whole + * frame AND saves the pair at the new bottom in one insn; `mov x29,sp` + * (add #0) anchors fp there. Callee-saves then stack above the record at + * positive offsets. -2 insns/call vs the top-record slim_small_frame. */ + if (n + 2u > cap) aa_panic(a, "prologue too large"); + words[n++] = aa64_stp64_pre(AA_FP, AA_LR, AA_SP, -(i32)(L->frame_size / 8u)); + words[n++] = aa64_add_imm(1, AA_FP, AA_SP, 0, 0); + aa_words_callee_saves(a, 1, words, cap, &n); + return n; + } aa_words_sub_sp_frame(a, words, cap, &n, L->frame_size); if (a->slim_small_frame) { /* `stp x29, x30, [sp, #saved_pair_off]` — skip the `add x17, sp, #...` @@ -1419,12 +1516,14 @@ static void aa_func_end(NativeTarget* t) { aa_apply_patches(a, &L); } if (mc->cfi_set_next_pc_offset && mc->cfi_def_cfa && mc->cfi_offset) { + i32 cfa = aa_cfa_off(a); mc->cfi_set_next_pc_offset(mc, prologue_region * 4u); - /* CFA = caller's sp = fp + AA_FRAME_SAVE_SIZE. saved fp/lr at fp/fp+8 - * (= CFA-16, CFA-8). Unified across all three prologue layouts. */ - mc->cfi_def_cfa(mc, AA_FP, AA_FRAME_SAVE_SIZE); - mc->cfi_offset(mc, AA_FP, aa_fp_off_saved_fp() - (i32)AA_FRAME_SAVE_SIZE); - mc->cfi_offset(mc, AA_LR, aa_fp_off_saved_lr() - (i32)AA_FRAME_SAVE_SIZE); + /* CFA = caller's sp, an fp-relative offset that depends on the layout: + * fp+16 (top-record) or fp+frame_size (bottom-record). saved fp/lr live at + * [fp]/[fp+8] in both, hence at CFA-cfa / CFA-cfa+8. */ + mc->cfi_def_cfa(mc, AA_FP, cfa); + mc->cfi_offset(mc, AA_FP, aa_fp_off_saved_fp() - cfa); + mc->cfi_offset(mc, AA_LR, aa_fp_off_saved_lr() - cfa); } obj_symbol_define(t->obj, a->func->sym, a->func->text_section_id, a->func_start, mc->pos(mc) - a->func_start); @@ -1508,16 +1607,25 @@ static void aa_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd, a->max_outgoing = frame->max_outgoing; } /* Frame is final: slot_bytes (cum_off) and out_stack (max_outgoing) are both - * known, so the prologue immediates and slim-form choice are settled here. */ + * known, so the prologue immediates and slim-form choice are settled here. + * frame_size_final must be set before aa_build_prologue_words / entry saves, + * since the bottom-record offset helpers read it. */ L = aa_build_layout(a->cum_off, a->max_outgoing); + a->frame_size_final = L.frame_size; /* Slim Tier A: no callee-saves, no alloca, no body slots, no outgoing stack - * args. slim_small_frame: skip the x17/x10 scratch when the saved-pair offset - * fits stp's signed 7-bit scaled immediate. (See aa_func_end for the - * single-pass path, which never takes the slim form.) */ + * args — the whole frame is the 16-byte record. fp_at_bottom: a small frame + * with callee-saves/locals and no outgoing stack args; the record moves to + * the bottom (fp = sp) so sp adjustment folds into the pre/post-indexed + * stp/ldp (frame_size <= 504 keeps the post-index ldp imm in range). Otherwise + * slim_small_frame keeps the top-record layout but skips the x17/x10 scratch + * (out_stack>0 small frames land here). (See aa_func_end for the single-pass + * path, which never takes any slim form.) */ a->slim_prologue = a->ncallee_saves == 0 && !a->has_alloca && L.slot_bytes == 0 && L.out_stack == 0; - a->slim_small_frame = !a->slim_prologue && !a->has_alloca && - aa_sp_off_saved_pair(&L) <= 504u; + a->fp_at_bottom = !a->slim_prologue && !a->has_alloca && L.out_stack == 0 && + L.frame_size <= 504u; + a->slim_small_frame = !a->slim_prologue && !a->fp_at_bottom && + !a->has_alloca && aa_sp_off_saved_pair(&L) <= 504u; n = aa_build_prologue_words(a, &L, words, AA_PROLOGUE_WORDS); for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]); a->minimal_prologue_words = n; @@ -1637,7 +1745,7 @@ static void aa_load_addr(NativeTarget* t, NativeLoc dst, NativeAddr addr) { switch ((NativeAddrBaseKind)addr.base_kind) { case NATIVE_ADDR_BASE_FRAME: { AANativeSlot* s = aa_slot(a, addr.base.frame); - aa_emit_add_imm(a, rd, AA_FP, aa_fp_off_slot(s->off) + addr.offset); + aa_emit_add_imm(a, rd, AA_FP, aa_fp_off_slot(a, s->off) + addr.offset); aa_apply_index(a, rd, &addr); return; } @@ -2251,7 +2359,7 @@ static void aa_store_outgoing_part(NativeTarget* t, int tail_call, * (= [fp + 16 + off], same address the tail-callee will read via * aa_fp_off_in_arg). Non-tail calls write to the sp-anchored outgoing * area at the bottom of the caller's frame. */ - addr.offset = tail_call ? aa_fp_off_tail_out_arg(stack_off) + addr.offset = tail_call ? aa_fp_off_tail_out_arg(aa_of(t), stack_off) : aa_sp_off_out_arg(stack_off); aa_emit_mem(aa_of(t), 0, src, addr, mem); } @@ -3505,7 +3613,7 @@ static void aa_bind_native_param(NativeTarget* t, const CGParamDesc* p, memset(&saddr, 0, sizeof saddr); saddr.base_kind = NATIVE_ADDR_BASE_REG; saddr.base.reg = AA_FP; - saddr.offset = aa_fp_off_in_arg(a->next_param_stack); + saddr.offset = aa_fp_off_in_arg(a, a->next_param_stack); aa_emit_mem(a, 1, src, saddr, aa_mem_for_type(t, p->type, 8)); a->next_param_stack += 8u; } @@ -3548,7 +3656,7 @@ static void aa_bind_native_param(NativeTarget* t, const CGParamDesc* p, saddr.base_kind = NATIVE_ADDR_BASE_REG; saddr.base.reg = AA_FP; saddr.base_type = p->type; - saddr.offset = aa_fp_off_in_arg(a->next_param_stack); + saddr.offset = aa_fp_off_in_arg(a, a->next_param_stack); aa_emit_mem(a, 1, src, saddr, aa_mem_for_type(t, p->type, part->size)); a->next_param_stack += aa_part_stack_size(part); } @@ -3712,7 +3820,7 @@ static void aa_va_start_core(AANativeTarget* a, NativeAddr ap) { /* `va_list = &<first vararg>`. Variadic stack args follow the fixed * incoming params in the same caller window, so the offset is the * current next_param_stack cursor. */ - aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a->next_param_stack)); + aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a, a->next_param_stack)); aa_emit_mem(a, 0, ptr, ap, aa_mem_for_type(t, ptr.type, 8)); return; } @@ -3731,7 +3839,7 @@ static void aa_va_start_core(AANativeTarget* a, NativeAddr ap) { /* __stack points at the incoming stack args, which sit above the saved * fp/lr pair — the same address bind_param uses (aa_fp_off_in_arg), not the * raw next_param_stack cursor. */ - aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a->next_param_stack)); + aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a, a->next_param_stack)); aa_emit_mem(a, 0, ptr, aa_reg_addr(ptr.type, base, (i32)vai.stack_offset), ptr_mem); aa_emit_add_imm(a, AA_TMP0, AA_FP, diff --git a/test/toy/cases/140_fp_callee_save_bottom_frame.expected b/test/toy/cases/140_fp_callee_save_bottom_frame.expected @@ -0,0 +1 @@ +964 diff --git a/test/toy/cases/140_fp_callee_save_bottom_frame.toy b/test/toy/cases/140_fp_callee_save_bottom_frame.toy @@ -0,0 +1,38 @@ +// Regression: aarch64 -O1 bottom-record (fp_at_bottom) frame layout must save +// and restore FP callee-saves (v8-v15) correctly. FP saves are always +// single-register, so they take the scaled `str d, [fp, #off]` / `ldr` path — +// the unscaled stur (+-256) the top-record layout uses cannot reach the large +// positive offsets the bottom-record layout produces. +// +// `acc` (an f64) is live across the call to `bump`, forcing an FP callee-save. +// The 40-element local array bloats the frame past 256 bytes, so the FP +// callee-save (reserved first => largest offset = frame_size-8) lands well past +// 255 and exercises the scaled d-register store/load. The body sums everything +// so nothing is dead-code-eliminated; the result is checked against .expected. + +fn bump(x: i64): i64 { return x + 100; } + +fn big_fp(seed: i64): i64 { + var buf: [40]i64; + var i: i64 = 0; + while i < 40 { + buf[i] = seed + i; + i = i + 1; + } + var acc: f64 = (seed as f64) + 0.5; // f64 live across the call below + let y: i64 = bump(seed); // call: acc must survive in d8..d15 + var s: i64 = 0; + i = 0; + while i < 40 { + s = s + buf[i]; + i = i + 1; + } + // seed=2: acc=2.5 -> 2; y=102; sum(2..41)=2*40+780=860; total=2+102+860=964. + return (acc as i64) + y + s; +} + +fn __user_main(): i64 { + return big_fp(2); +} + +fn main(): i32 { return __user_main() as i32; }