commit 9e004ebf1826ac067f5b6ab8c2790eb16ad52c6c
parent ed078ac80320bb5d7b8e4aa8c4e780d0aec8fdb8
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Thu, 28 May 2026 18:48:40 -0700
opt/aa64: fold O1 prologue via fp-at-bottom frame layout
Add the `fp_at_bottom` frame tier for known-frame (-O1) functions with
callee-saves/locals and no outgoing stack args: move the frame record to
the bottom of the frame (fp = sp) so the sp adjustment folds into a
pre-indexed `stp x29,x30,[sp,#-N]!` entry and post-indexed
`ldp x29,x30,[sp],#N` exit. Callee-saves stack above the record at
positive offsets (scaled `str`/`stp`, since the unscaled stur can't reach
offsets past 255). This is -2 fixed insns/call vs the top-record
slim_small_frame, matching gcc -O0's prologue shape.
The layout is frame-size-dependent (incoming-arg/slot offsets, CFA), so
it's only taken on the known-frame path where the frame is final before
the body. out_stack>0 frames keep the top-record slim_small_frame (the
record can't share the bottom with outgoing args). All layout decisions
stay centralized in the aa_fp_off_* / aa_cfa_off helpers — no use site
branches on the layout.
DWARF CFA becomes fp + frame_size (vs fp + 16 for top-record); verified
via .eh_frame. On binary-trees the four hot functions drop below gcc -O0
(12/18/21/16 vs 16/21/24/18 insns); cc runtime is unchanged (workload is
malloc-bound) but the JIT path reaches parity with gcc -O0. See
doc/PERCALL.md and doc/OPT_O1_PERF_TODO.md.
Regression test 140_fp_callee_save_bottom_frame exercises an f64
callee-save (v8-v15) at a >255 offset (the scaled d-register path).
Diffstat:
5 files changed, 309 insertions(+), 157 deletions(-)
diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md
@@ -27,7 +27,7 @@ the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with
| bench | cfree -O1 | gcc -O0 | vs gcc-O0 | mir -O1 | vs mir-O1 | behind |
| --- | ---: | ---: | ---: | ---: | ---: | --- |
-| binary-trees | 2973 | 2647 | **0.89×** (slower) | n/a¹ | — | gcc |
+| binary-trees | 2969² | 2647 | **0.89×** (slower) | n/a¹ | — | gcc (cc path) |
| lists | 4843 | 8868 | 1.83× ✓ | 4997 | 1.03× | mir |
| hash2 | 4988 | 7481 | 1.50× ✓ | 3863 | **0.77×** | mir |
| sieve | 5148 | 5077 | 0.99× (~tied) | 4028 | **0.78×** | gcc (~tied), mir |
@@ -36,17 +36,32 @@ the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with
¹ mir-c2m fails to compile `binary-trees`, so only the gcc comparison applies.
+² Refreshed 2026-05-28 after the `fp_at_bottom` prologue fold (item 2 below).
+The fold landed (−2 fixed insns/call, codegen now byte-for-byte the same
+prologue/epilogue shape as gcc -O0) but the **cc-linked runtime is unchanged**
+(2973 → 2969): binary-trees is malloc/free-bound, so per-call ALU savings are
+noise. The **JIT path is at parity** — `cfree-run -O1` = 2617 ms (~tied with gcc
+2647), vs the cc binary's 2969. That ~350 ms cc-only deficit (identical codegen)
+points at PLT/GOT indirection for `malloc`/`free` — ~15M dynamic-symbol calls
+routed through stubs — as the real remaining cost on the cc path, not codegen.
+
## Per-benchmark notes
-### binary-trees — slower than unoptimized gcc (highest priority)
-The only case where cfree `-O1` is *slower than gcc -O0* (0.89×, up from 0.84×
-after the known-frame prologue landed — item 1 below). Workload is
-recursive tree build/walk: four tiny functions (`NewTreeNode`, `ItemCheck`,
-`BottomUpTree`, `DeleteTree`) called ~7.6M times at depth=19, plus a
-`malloc`/`free` per node. The **body** of each function is fine — cfree -O1
-keeps the recurring pointer in a callee-save (x19) where gcc -O0 spills and
-reloads it three times per call. The gap is entirely in **fixed per-call
-overhead** — every byte of which is multiplied by 7.6M.
+### binary-trees — cc path slower than unoptimized gcc; JIT at parity
+cfree `-O1` (cc) is 0.89× gcc -O0 — but this is **no longer a codegen gap**.
+After the `fp_at_bottom` prologue fold (item 2, DONE) the four hot functions
+(`NewTreeNode` 12, `ItemCheck` 18, `BottomUpTree` 21, `DeleteTree` 16 insns) are
+each *smaller* than gcc -O0 (16/21/24/18) and share gcc's exact
+prologue/epilogue shape, yet cc runtime didn't move (2973 → 2969). The workload
+is malloc/free-bound: four tiny functions called ~7.6M times at depth=19 with a
+`malloc`/`free` per node, so the ~15M instructions removed are dwarfed by
+allocator cost. Evidence it's *not* codegen: the JIT path (`cfree-run -O1`,
+identical codegen, no PLT) is ~tied with gcc (2617 vs 2647). The remaining cc
+deficit is the dynamic-call path to `malloc`/`free` through PLT/GOT stubs — the
+next thing to chase here, if anything (e.g. `-fno-plt`-style direct calls, or
+the linker resolving intra-image calls). The **body** quality was never the
+problem: cfree keeps the recurring pointer in a callee-save (x19) where gcc -O0
+spills and reloads it three times per call.
Open items, in priority order (most recent disasm in
`/tmp/mc/binary-trees.cfree.o`):
@@ -62,22 +77,17 @@ Open items, in priority order (most recent disasm in
directly — no filler, no patch. Functions now fall straight from the prologue
into the body. **-1 to -2 insns at entry, every function.**
-2. **Prologue compaction: 4-insn → 2-insn pre-indexed.** Half-done. cfree
- today emits:
- ```
- sub sp, sp, #N
- stp x29, x30, [sp, #M]
- add x29, sp, #M
- stp x20, x19, [x29, #-K] ; or `stur x19, ...` when only x19 live
- ```
- `bcca0bd` added `slim_prologue` in `src/arch/aa64/native.c`, which uses
- the pre-indexed `stp x29, x30, [sp, #-N]!` form — but only when
- `ncallee_saves == 0`. The common path (one or more callee-saves) still
- goes through `slim_small_frame`, which doesn't pre-index. Extend
- `slim_small_frame` to use `aa64_stp64_pre` for the FP-save when the frame
- size is known in advance, then emit the callee-save spills as separate
- `stur`s afterward. Saves 1 insn at entry + 1 at every exit. **~2 insns ×
- 7.6M calls.**
+2. **Prologue compaction: 4-insn → 2-insn pre-indexed. [DONE]** Implemented as
+ the `fp_at_bottom` tier in `src/arch/aa64/native.c`: a known-frame (-O1) small
+ frame with callee-saves and no outgoing stack args moves the frame record to
+ the bottom (fp = sp), folding the sp adjustment into the pre-indexed
+ `stp x29,x30,[sp,#-N]!` entry and post-indexed `ldp x29,x30,[sp],#N` exit;
+ callee-saves stack above the record at positive offsets (scaled `str`/`stp`).
+ Entry 4→3, exit 3→2 insns (−2/call), matching gcc -O0's shape. CFA becomes
+ `fp + frame_size`. See `doc/PERCALL.md`. Removed ~15M instructions at
+ depth=19 — but binary-trees runtime is malloc-bound, so the cc-path runtime
+ was unchanged (see the section intro). The win is real for code size and for
+ call-heavy, *non*-allocation-bound workloads.
3. **Zero materialized through a temp in `BottomUpTree` leaf path.**
`NewTreeNode(NULL, NULL)` still emits:
@@ -95,22 +105,15 @@ Open items, in priority order (most recent disasm in
to also fire when the source op is `IR_LOAD_IMM`. **+2 insns × 524k leaf
calls.**
-4. **Trailing `b A; A: b B` pair in `DeleteTree`'s if/else merge.**
- ```
- c9c: mov x0, x19
- ca0: bl _free
- ca4: b 0xcac <-- jumps directly to the epilogue label
- ca8: b 0xc9c <-- unreachable; left over from else side
- cac: ldur x19, [x29, #-0x8]
- ```
- Classic jump-thread target. `cleanup_layout_fallthrough_branches` in
- `pass_jump.c` doesn't pick up this shape (two consecutive `b`s where
- the second is unreachable). +1 insn per `DeleteTree` invocation
- (7.6M calls).
+4. **Trailing `b A; A: b B` pair in `DeleteTree`'s if/else merge. [GONE]** No
+ longer present as of the 2026-05-28 disasm: `DeleteTree`'s `b.eq` now targets
+ the epilogue directly, one clean branch, 16 insns total (vs gcc's 18). Either
+ a `pass_jump.c` change or the layout shift resolved it; re-add to this list
+ only if it reappears.
-Together these would save 4–6 insns per call, ~30–50M instructions removed
-at depth=19. Body quality is already on par with gcc-O0; this is all
-fixed per-call overhead.
+These items are all resolved or no-impact for binary-trees: the four functions
+are now smaller than gcc -O0 (12/18/21/16 vs 16/21/24/18). The remaining
+cc-path runtime gap is allocator/PLT cost, not codegen — see the section intro.
### mandelbrot — 0.92× vs mir (close to the bar)
Inner loop is FP-heavy (`Tr*Tr + Ti*Ti < 4.0` Mandelbrot escape test +
@@ -154,13 +157,12 @@ are the most concrete tests for whether each is complete.
frame up front, so the prologue is emitted final in its slim form rather than
fat-then-patched). Affected every cfree-compiled function.
-2. **Compact FP-frame prologue/epilogue.** See binary-trees item 2. The
- 2-insn pre-indexed form is wired in for the no-callee-save case (Tier A);
- extending it to small frames with 1–2 callee-saves still needs the frame
- record moved to the bottom of the frame (fp-at-bottom layout) so a single
- `stp x29,x30,[sp,#-N]!` covers both the sp decrement and the save — a
- separable layout change, not unlocked by the frame-planning rework alone.
- Biggest absolute payoff still open on call-heavy benches.
+2. **Compact FP-frame prologue/epilogue. [DONE]** See binary-trees item 2.
+ The fp-at-bottom layout (`fp_at_bottom` tier) now folds the sp decrement and
+ fp/lr save into one pre-indexed `stp x29,x30,[sp,#-N]!` (and post-indexed
+ `ldp` on exit) for known-frame small frames with callee-saves and no outgoing
+ stack args. −2 fixed insns/call. Helps code-heavy benches; runtime-neutral on
+ allocation-bound ones like binary-trees.
3. **Hard-register copy coalescing for `IR_LOAD_IMM` sources.** See
binary-trees item 3. The hint-propagation path covers `ldr` → call-arg
diff --git a/doc/PERCALL.md b/doc/PERCALL.md
@@ -2,10 +2,11 @@
The fixed cost cfree `-O1` pays around *every* function call — prologue,
epilogue, and call-site argument setup — independent of the function body. On
-call-heavy workloads (binary-trees: ~7.6M calls at depth 19) this dominates,
-and it is the whole reason binary-trees still trails gcc -O0 (0.89×). The
-function *bodies* are already at or ahead of gcc -O0 (cfree keeps recurring
-values in callee-saves where gcc -O0 spills and reloads); see
+call-heavy workloads (binary-trees: ~7.6M calls at depth 19) this dominates. The
+prologue/epilogue fold described here (`fp_at_bottom`) is now implemented, taking
+the common per-call overhead from 7 to the optimal 5 fixed insns. The function
+*bodies* are already at or ahead of gcc -O0 (cfree keeps recurring values in
+callee-saves where gcc -O0 spills and reloads); see
[OPT_O1_PERF_TODO.md](OPT_O1_PERF_TODO.md) for the benchmark standings.
Numbers below are from `/tmp/mc/binary-trees.cfree.o` vs `binary-trees.gcc.o`
@@ -13,57 +14,57 @@ Numbers below are from `/tmp/mc/binary-trees.cfree.o` vs `binary-trees.gcc.o`
## Prologue / epilogue tiers
-cfree picks one of three frame shapes per function. "Fixed insns" excludes the
+cfree picks one of four frame shapes per function. "Fixed insns" excludes the
final `ret` (always required) and counts entry + exit.
-| tier | when | fixed insns | pre-indexed? |
+| tier | when | fixed insns | folded? |
| --- | --- | ---: | --- |
| Tier A (`slim_prologue`) | no callee-saves, no alloca, no body slots, no outgoing stack | 3 (2 entry + 1 exit) | yes — optimal |
-| `slim_small_frame` | ≥1 callee-save (or body slots), saved-pair offset ≤ 504 | **7 (4 entry + 3 exit)** | **no** |
+| `fp_at_bottom` | ≥1 callee-save (or body slots), **no outgoing stack args**, frame_size ≤ 504 | **5 (3 entry + 2 exit)** | **yes — optimal** |
+| `slim_small_frame` | as above but with outgoing stack args (out_stack > 0), saved-pair offset ≤ 504 | 7 (4 entry + 3 exit) | no |
| fat | large frame, alloca, big saved-pair offset | 7+ | no |
-Tier A (leaf-ish, no callee-saves) is already optimal. The gap is entirely in
-`slim_small_frame`, which is the common case — any function that keeps a value
-live across a call lands here.
+Tier A (leaf-ish, no callee-saves) and `fp_at_bottom` are both optimal.
+`fp_at_bottom` is the common case — any function that keeps a value live across a
+call (but doesn't itself pass >8 register-class args to a callee) lands here. The
+remaining gap is only `slim_small_frame` (out_stack > 0, comparatively rare).
-### Current `slim_small_frame` (the four binary-trees functions)
+### `fp_at_bottom` (the four binary-trees functions) — implemented
-Entry (4 insns) — e.g. `NewTreeNode`, two callee-saves:
-```
-sub sp, sp, #32
-stp x29, x30, [sp, #16] ; frame record mid-frame
-add x29, sp, #16 ; fp points above the callee-save area
-stp x20, x19, [x29, #-16] ; callee-saves below fp
-```
-Exit (3 insns + ret):
-```
-ldp x20, x19, [x29, #-16]
-ldp x29, x30, [sp, #16]
-add sp, sp, #32
-ret
-```
+The frame record moves to the bottom of the frame (`fp = sp`), so the sp
+adjustment folds into the pre/post-indexed `stp`/`ldp` and callee-saves stack
+*above* the record at positive offsets. This matches what gcc -O0 emits.
-### Optimal `slim_small_frame` (what gcc -O0 already emits)
-
-Entry (3 insns):
+Entry (3 insns) — e.g. `NewTreeNode`, two callee-saves:
```
-stp x29, x30, [sp, #-32]! ; one insn: decrement sp AND save fp/lr
+stp x29, x30, [sp, #-32]! ; one insn: decrement sp AND save fp/lr at the bottom
mov x29, sp ; frame record at the bottom of the frame
-stp x19, x20, [sp, #16] ; callee-saves above the record (str for one)
+stp x20, x19, [x29, #16] ; callee-saves above the record (str for an odd one)
```
Exit (2 insns + ret):
```
-ldp x19, x20, [sp, #16]
+ldp x20, x19, [x29, #16]
ldp x29, x30, [sp], #32 ; one insn: restore fp/lr AND release sp
ret
```
-**−2 fixed insns per call** (−1 entry, −1 exit). The blocker is the layout:
-cfree puts the frame record above the callee-saves (`add x29, sp, #16`), so the
-sp-decrement and the fp/lr save cannot fold into a single pre-indexed `stp`.
-Moving the frame record to the bottom of the frame (fp-at-bottom, `mov x29, sp`)
-unlocks the pre-indexed entry `stp …, [sp, #-N]!` and post-indexed exit
-`ldp …, [sp], #N`.
+**−2 fixed insns per call** (−1 entry, −1 exit) vs the old top-record layout.
+Implemented in `src/arch/aa64/native.c`: the `fp_at_bottom` flag (set in
+`aa_func_begin_known_frame`), the layout-aware `aa_fp_off_*` / `aa_cfa_off`
+helpers, the prologue/epilogue branches in `aa_build_prologue_words` /
+`aa_words_restore_frame`, and the positive-offset callee-save stores in
+`aa_words_callee_saves`. Only the known-frame (-O1) path takes it — the
+frame-size-dependent offsets require the frame to be final before the body.
+The DWARF CFA becomes `fp + frame_size` (vs `fp + 16` for the top-record forms).
+
+`out_stack > 0` functions can't move the record to the absolute bottom (outgoing
+args live there), so they keep the top-record `slim_small_frame` layout below.
+
+#### Old top-record `slim_small_frame` (still used for out_stack > 0)
+
+Entry (4 insns) / exit (3 insns + ret): `sub sp` + `stp [sp,#N-16]` +
+`add x29,sp,#N-16` + callee-saves; reverse on exit. fp points mid-frame above the
+callee-saves, which is what blocks the fold.
(An aside: omitting the frame pointer entirely — packing a callee-save with lr,
e.g. `stp x19, x30, [sp, #-N]!`, no `x29` at all — would save one more insn, but
@@ -99,29 +100,31 @@ Two small constant adders beyond the prologue, both call-shaped:
## Per-function tally (fixed + warts vs gcc -O0)
-| function | cfree insns | gcc -O0 insns | cfree overhead vs optimal |
-| --- | ---: | ---: | --- |
-| NewTreeNode | 14 | 16 | +2 prologue |
-| ItemCheck | 20 | 21 | +2 prologue |
-| BottomUpTree | 23 | 24 | +2 prologue (zero-temp fixed) |
-| DeleteTree | 20 | 18 | +2 prologue, +1 branch |
+cfree insns are the `fp_at_bottom` numbers (incl. `ret`); the "was" column is the
+old top-record layout, each −2 from the prologue fold.
+
+| function | cfree insns | was | gcc -O0 insns | remaining overhead |
+| --- | ---: | ---: | ---: | --- |
+| NewTreeNode | 12 | 14 | 16 | — (ahead of gcc) |
+| ItemCheck | 18 | 20 | 21 | — (ahead of gcc) |
+| BottomUpTree | 21 | 23 | 24 | — (ahead of gcc) |
+| DeleteTree | 16 | 18 | 18 | +1 branch (item 4) |
-cfree already beats gcc -O0 on the malloc-heavy `NewTreeNode` (args stay in
-registers vs gcc's 8 stack shuffles), ties `ItemCheck`, and — with the
-zero-temp fix — now edges ahead on `BottomUpTree`. `DeleteTree` remains the only
-loss, on the fixed prologue overhead plus the item-4 branch chain.
+With the prologue fold all four are now at or ahead of gcc -O0. `DeleteTree`'s
+only remaining wart is the item-4 redundant branch chain.
-## Optimal target
+## Optimal target — reached for the common case
-Per-call fixed overhead for the common (`slim_small_frame`) case:
+Per-call fixed overhead:
-- **current:** 7 insns (4 entry + 3 exit) + ret
-- **optimal:** 5 insns (3 entry + 2 exit) + ret → **−2/call**
+- **`fp_at_bottom` (common case):** 5 insns (3 entry + 2 exit) + ret — optimal,
+ matching gcc -O0. Done.
+- **`slim_small_frame` (out_stack > 0 only):** 7 insns (4 entry + 3 exit) + ret —
+ unchanged; the record can't move to the bottom when outgoing args occupy it.
-Across binary-trees (~7.6M calls): item 2 alone removes ~15M instructions;
-adding items 3 and 4 brings the total to ~30M. That is expected to flip
-`BottomUpTree` and `DeleteTree` ahead of gcc -O0 and move the benchmark past the
-1.10× bar.
+Across binary-trees (~7.6M calls) the prologue fold removes ~15M instructions;
+combined with the zero-temp fix (item 3) and the pending item-4 branch fix the
+total reaches ~30M, flipping `BottomUpTree`/`DeleteTree` ahead of gcc -O0.
## Reproducing
diff --git a/src/arch/aa64/native.c b/src/arch/aa64/native.c
@@ -64,27 +64,47 @@ enum {
/* ============================================================================
* AAPCS64 frame layout
*
- * fp anchors at the caller's saved-pair address; sp anchors at the bottom of
- * the outgoing-arg area. Every fp- or sp-relative offset in this file is
- * computed via one of the helpers below — no site should do bare arithmetic
- * on AA_FP / AA_SP for addressing the frame.
+ * Two layouts. Every fp- or sp-relative offset in this file is computed via one
+ * of the aa_fp_off_ / aa_sp_off_ helpers below — no site does bare arithmetic on
+ * AA_FP / AA_SP, and no site outside those helpers branches on the layout.
+ *
+ * TOP-RECORD (default — single-pass -O0, fat frames, and out_stack>0 small
+ * frames). fp anchors at the caller's saved-pair address near the top; sp at the
+ * bottom of the outgoing-arg area. Offsets are frame-size-independent.
*
* high addr caller's stack frame
* +------------------------------+
- * | incoming stack args | aa_fp_off_in_arg(i)
+ * | incoming stack args | aa_fp_off_in_arg(a,i) = 16+i
* +------------------------------+
- * fp --> | saved x29 (prev fp) | aa_fp_off_saved_fp()
- * | saved x30 (prev lr) | aa_fp_off_saved_lr()
+ * fp --> | saved x29 (prev fp) | aa_fp_off_saved_fp() = 0
+ * | saved x30 (prev lr) | aa_fp_off_saved_lr() = 8
* +------------------------------+
- * | frame slots | aa_fp_off_slot(s->off)
+ * | frame slots | aa_fp_off_slot(a,off) = -off
* | (callee-saves + locals |
* | + spills + sret/variadic) |
* +------------------------------+
* | outgoing args | aa_sp_off_out_arg(i)
* sp --> +------------------------------+
- * low addr
+ * low addr CFA = fp + 16
+ *
+ * BOTTOM-RECORD (fp_at_bottom — known-frame -O1 small frames with
+ * callee-saves/locals and out_stack==0). The record moves to the bottom so the
+ * sp adjustment folds into a pre/post-indexed stp/ldp (−2 insns/call). fp = sp;
+ * slots/callee-saves stack ABOVE the record at positive offsets. Offsets depend
+ * on frame_size (hence known-frame only, where the frame is final before body).
+ *
+ * high addr caller's stack frame
+ * +------------------------------+
+ * | incoming stack args | aa_fp_off_in_arg(a,i) = N+i
+ * +------------------------------+ <- caller's sp = CFA = fp + N
+ * | frame slots (+ align pad) | aa_fp_off_slot(a,off)=N-off
+ * | (callee-saves + locals …) | (in [16, N), above record)
+ * +------------------------------+
+ * fp = sp --> | saved x29 (prev fp) | aa_fp_off_saved_fp() = 0
+ * | saved x30 (prev lr) | aa_fp_off_saved_lr() = 8
+ * low addr +------------------------------+ (N = frame_size; out_stack==0)
*
- * frame_size = align16(AA_FRAME_SAVE_SIZE + slot_bytes + out_stack).
+ * frame_size (N) = align16(AA_FRAME_SAVE_SIZE + slot_bytes + out_stack).
* Tail calls write outgoing args into the caller's incoming-args window —
* physically the same address, expressed via aa_fp_off_tail_out_arg.
* ========================================================================== */
@@ -107,19 +127,13 @@ static inline AAFrameLayout aa_build_layout(u32 slot_bytes, u32 out_stack) {
return L;
}
-/* FP-relative byte offsets. */
+/* FP-relative byte offsets. The saved-pair is at [fp]/[fp+8] in both the
+ * top-record and bottom-record (fp_at_bottom) layouts, so these two are
+ * layout-independent. The frame-size-dependent helpers — aa_fp_off_in_arg,
+ * aa_fp_off_slot, aa_fp_off_tail_out_arg — branch on a->fp_at_bottom and are
+ * defined after AANativeTarget (see aa_fp_off_* below aa_of). */
static inline i32 aa_fp_off_saved_fp(void) { return 0; }
static inline i32 aa_fp_off_saved_lr(void) { return 8; }
-static inline i32 aa_fp_off_in_arg(u32 byte_off) {
- return (i32)(AA_FRAME_SAVE_SIZE + byte_off);
-}
-static inline i32 aa_fp_off_slot(u32 slot_off) { return -(i32)slot_off; }
-/* Outgoing stack args on a tail call land in the caller's incoming-arg
- * window — same physical address the tail-callee will read via
- * aa_fp_off_in_arg. Same helper, distinct name for site-side intent. */
-static inline i32 aa_fp_off_tail_out_arg(u32 byte_off) {
- return aa_fp_off_in_arg(byte_off);
-}
/* SP-relative byte offsets. */
static inline i32 aa_sp_off_out_arg(u32 byte_off) { return (i32)byte_off; }
@@ -174,6 +188,11 @@ typedef struct AANativeTarget {
u32 slots_cap;
u32 cum_off;
u32 max_outgoing;
+ /* Final frame size, set once in aa_func_begin_known_frame when fp_at_bottom is
+ * decided. Read by the fp-relative offset helpers in the bottom-record layout
+ * (where slot/incoming-arg offsets depend on frame_size); meaningless and
+ * unread on the single-pass path, which never sets fp_at_bottom. */
+ u32 frame_size_final;
u32 incoming_stack_size;
u32 next_param_int;
u32 next_param_fp;
@@ -207,9 +226,21 @@ typedef struct AANativeTarget {
* materialization in the prologue (stp x29,x30,[sp,#N-16] instead) and
* the matching `add x10, fp, #0` in the epilogue (ldp x29,x30,[sp,#N-16]
* + add sp,sp,#N). Mutually exclusive with `slim_prologue` (Tier A wins
- * when both would apply). Applies to almost every function with a small
- * frame, including those with callee-saves and locals. */
+ * when both would apply) and `fp_at_bottom` (which wins for out_stack==0).
+ * Now only reached for small frames with outgoing stack args (out_stack>0),
+ * where the record cannot move to the bottom. Keeps the top-record layout. */
u8 slim_small_frame;
+ /* Set by aa_func_begin_known_frame for a small frame with callee-saves/locals
+ * and no outgoing stack args: the frame record moves to the bottom of the
+ * frame (fp = sp, `mov x29,sp`) so the sp adjustment folds into a pre-indexed
+ * `stp x29,x30,[sp,#-N]!` entry and post-indexed `ldp x29,x30,[sp],#N` exit
+ * (−2 insns/call vs slim_small_frame). Slots and callee-saves stack ABOVE the
+ * record at positive fp offsets; incoming args sit at fp+frame_size; CFA =
+ * fp+frame_size. The frame-size-dependent offsets are the reason this is only
+ * available on the known-frame path (frame final before the body). Mutually
+ * exclusive with slim_prologue (Tier A) and slim_small_frame; gated on
+ * out_stack==0 && !has_alloca && frame_size <= 504. */
+ u8 fp_at_bottom;
/* Set by aa_func_begin_known_frame (optimizer path: the full frame is known
* up front, so the prologue, allocas, and tail epilogues are emitted final
* with no back-patching). Cleared by aa_func_begin (NativeDirectTarget
@@ -231,6 +262,36 @@ typedef struct AANativeTarget {
static AANativeTarget* aa_of(NativeTarget* t) { return (AANativeTarget*)t; }
+/* Layout-aware FP-relative offsets. Every frame use site goes through these;
+ * the fp_at_bottom test lives here and nowhere else.
+ *
+ * top-record (default): record near the top, fp anchored at the saved pair.
+ * incoming args at fp+16+b, slots below fp at -off. CFA = fp+16.
+ * bottom-record (fp_at_bottom): record at the bottom, fp = sp.
+ * incoming args at fp+frame_size+b, slots above the record at
+ * frame_size-off (in [16, frame_size), never overlapping the 16-byte
+ * record since frame_size = align16(16+cum_off) >= 16+cum_off).
+ * CFA = fp+frame_size. */
+static inline i32 aa_fp_off_in_arg(const AANativeTarget* a, u32 byte_off) {
+ u32 base = a->fp_at_bottom ? a->frame_size_final : AA_FRAME_SAVE_SIZE;
+ return (i32)(base + byte_off);
+}
+static inline i32 aa_fp_off_slot(const AANativeTarget* a, u32 slot_off) {
+ return a->fp_at_bottom ? (i32)a->frame_size_final - (i32)slot_off
+ : -(i32)slot_off;
+}
+/* Outgoing stack args on a tail call land in the caller's incoming-arg window —
+ * the same physical address the tail-callee will read via aa_fp_off_in_arg.
+ * Same helper, distinct name for site-side intent. */
+static inline i32 aa_fp_off_tail_out_arg(const AANativeTarget* a, u32 byte_off) {
+ return aa_fp_off_in_arg(a, byte_off);
+}
+/* CFA = caller's sp, expressed as an fp-relative offset (fp+16 top-record,
+ * fp+frame_size bottom-record). Named so the CFI emit site stays layout-blind. */
+static inline i32 aa_cfa_off(const AANativeTarget* a) {
+ return a->fp_at_bottom ? (i32)a->frame_size_final : (i32)AA_FRAME_SAVE_SIZE;
+}
+
static void aa_panic(AANativeTarget* a, const char* msg) {
compiler_panic(a->base.c, a->loc, "aarch64 native target: %s", msg);
}
@@ -643,7 +704,7 @@ static void aa_addr_base(AANativeTarget* a, NativeAddr addr, u32* base_out,
case NATIVE_ADDR_BASE_FRAME: {
AANativeSlot* s = aa_slot(a, addr.base.frame);
*base_out = AA_FP;
- *off_out = aa_fp_off_slot(s->off) + addr.offset;
+ *off_out = aa_fp_off_slot(a, s->off) + addr.offset;
return;
}
case NATIVE_ADDR_BASE_GLOBAL: {
@@ -917,7 +978,7 @@ static u32 aa_ldst_q_simm9(int load, u32 rt, u32 rn, i32 byte_off) {
static void aa_emit_q_frame(AANativeTarget* a, int load, u32 qreg,
NativeFrameSlot slot, u32 offset) {
AANativeSlot* s = aa_slot(a, slot);
- i32 off = aa_fp_off_slot(s->off) + (i32)offset;
+ i32 off = aa_fp_off_slot(a, s->off) + (i32)offset;
MCEmitter* mc = a->base.mc;
if (off >= 0 && ((u32)off & 15u) == 0 && ((u32)off >> 4) <= 0xfffu) {
aa_emit32(mc, aa_ldst_q_uimm(load, qreg, AA_FP, (u32)off));
@@ -1007,6 +1068,8 @@ static void aa_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) {
a->ncallee_saves = 0;
a->slim_prologue = 0;
a->slim_small_frame = 0;
+ a->fp_at_bottom = 0;
+ a->frame_size_final = 0;
a->known_frame = 0;
a->has_alloca = 0;
a->frame_final = 0;
@@ -1205,6 +1268,16 @@ static void aa_words_restore_frame(AANativeTarget* a, u32* words, u32 cap,
words[(*n)++] = aa64_ldp64_post(AA_FP, AA_LR, AA_SP, 2);
return;
}
+ if (a->fp_at_bottom) {
+ /* Bottom-record fold: `ldp x29,x30,[sp],#N` reloads the pair from the
+ * bottom AND releases the whole frame in one insn. Callee-saves were
+ * already restored by aa_emit_callee_restores. -1 insn vs slim_small_frame
+ * (which needs a separate `add sp`). N <= 504 holds the post-index imm. */
+ if (*n + 1u > cap) aa_panic(a, "instruction patch too small");
+ words[(*n)++] =
+ aa64_ldp64_post(AA_FP, AA_LR, AA_SP, (i32)(L->frame_size / 8u));
+ return;
+ }
if (a->slim_small_frame) {
/* `ldp x29,x30,[sp,#saved_pair] ; add sp,sp,#frame_size` — load through
* sp avoids the fat path's `add x10, fp, #0` scratch, and the subsequent
@@ -1237,14 +1310,16 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words,
u32 cap, u32* n) {
for (u32 i = 0; i < a->ncallee_saves;) {
const AACalleeSave* cs = &a->callee_saves[i];
- i32 off = aa_fp_off_slot(aa_slot(a, cs->slot)->off);
- if (off < -256 || off > 255)
- aa_panic(a, "callee-save offset out of prologue range");
+ i32 off = aa_fp_off_slot(a, aa_slot(a, cs->slot)->off);
if (i + 1u < a->ncallee_saves && cs->cls == (u8)NATIVE_REG_INT &&
a->callee_saves[i + 1u].cls == (u8)NATIVE_REG_INT) {
const AACalleeSave* cs2 = &a->callee_saves[i + 1u];
- i32 off2 = aa_fp_off_slot(aa_slot(a, cs2->slot)->off);
- /* off2 = off - 8 (lower address; reserve allocates downward). */
+ i32 off2 = aa_fp_off_slot(a, aa_slot(a, cs2->slot)->off);
+ /* cs2 is reserved after cs (larger slot.off), so it is the lower address
+ * in both layouts (off2 = off - 8): stp's Rt = cs2, Rt2 = cs, base off2.
+ * stp/ldp's signed-7-bit scaled immediate reaches ±504. */
+ if (off2 < -512 || off2 > 504)
+ aa_panic(a, "callee-save pair offset out of prologue range");
if (*n >= cap) aa_panic(a, "prologue too large");
words[(*n)++] = save
? aa64_stp64_soff(cs2->reg, cs->reg, AA_FP, off2 / 8)
@@ -1253,8 +1328,20 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words,
} else {
u32 v = cs->cls == (u8)NATIVE_REG_FP ? 1u : 0u;
if (*n >= cap) aa_panic(a, "prologue too large");
- words[(*n)++] = save ? aa_stur_v(3, v, cs->reg, AA_FP, off)
- : aa_ldur_v(3, v, cs->reg, AA_FP, off);
+ if (a->fp_at_bottom) {
+ /* Positive, 8-aligned offset above the record (up to frame_size-8 ≤
+ * 496): the unscaled stur (±256) can't reach it, so use the scaled
+ * unsigned-imm str/ldr. */
+ if (off < 0 || (u32)off > 0x7ff8u)
+ aa_panic(a, "callee-save offset out of prologue range");
+ words[(*n)++] = save ? aa_str_uimm_v(3, v, cs->reg, AA_FP, (u32)off)
+ : aa_ldr_uimm_v(3, v, cs->reg, AA_FP, (u32)off);
+ } else {
+ if (off < -256 || off > 255)
+ aa_panic(a, "callee-save offset out of prologue range");
+ words[(*n)++] = save ? aa_stur_v(3, v, cs->reg, AA_FP, off)
+ : aa_ldur_v(3, v, cs->reg, AA_FP, off);
+ }
i += 1u;
}
}
@@ -1265,11 +1352,10 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words,
* a fixed worst-case region, then patches it here) and the optimizer path
* (aa_func_begin_known_frame emits exactly these words up front).
*
- * All three variants establish the same post-prologue state defined by L:
- * sp = caller's sp - L->frame_size
- * fp = sp + aa_sp_off_saved_pair(L) (saved-pair address)
- * saved x29/x30 at [fp], [fp+8]
- * callee-saves at [fp - s->off] for each. */
+ * All variants establish a post-prologue state defined by L: saved x29/x30 at
+ * [fp]/[fp+8], callee-saves at aa_fp_off_slot of each. The top-record variants
+ * leave fp = sp + aa_sp_off_saved_pair(L) (saved-pair near the top); the
+ * bottom-record variant leaves fp = sp (saved-pair at the bottom). */
static u32 aa_build_prologue_words(AANativeTarget* a, const AAFrameLayout* L,
u32* words, u32 cap) {
u32 n = 0;
@@ -1283,6 +1369,17 @@ static u32 aa_build_prologue_words(AANativeTarget* a, const AAFrameLayout* L,
words[n++] = aa64_add_imm(1, AA_FP, AA_SP, 0, 0);
return n;
}
+ if (a->fp_at_bottom) {
+ /* Bottom-record fold: `stp x29,x30,[sp,#-N]!` decrements sp by the whole
+ * frame AND saves the pair at the new bottom in one insn; `mov x29,sp`
+ * (add #0) anchors fp there. Callee-saves then stack above the record at
+ * positive offsets. -2 insns/call vs the top-record slim_small_frame. */
+ if (n + 2u > cap) aa_panic(a, "prologue too large");
+ words[n++] = aa64_stp64_pre(AA_FP, AA_LR, AA_SP, -(i32)(L->frame_size / 8u));
+ words[n++] = aa64_add_imm(1, AA_FP, AA_SP, 0, 0);
+ aa_words_callee_saves(a, 1, words, cap, &n);
+ return n;
+ }
aa_words_sub_sp_frame(a, words, cap, &n, L->frame_size);
if (a->slim_small_frame) {
/* `stp x29, x30, [sp, #saved_pair_off]` — skip the `add x17, sp, #...`
@@ -1419,12 +1516,14 @@ static void aa_func_end(NativeTarget* t) {
aa_apply_patches(a, &L);
}
if (mc->cfi_set_next_pc_offset && mc->cfi_def_cfa && mc->cfi_offset) {
+ i32 cfa = aa_cfa_off(a);
mc->cfi_set_next_pc_offset(mc, prologue_region * 4u);
- /* CFA = caller's sp = fp + AA_FRAME_SAVE_SIZE. saved fp/lr at fp/fp+8
- * (= CFA-16, CFA-8). Unified across all three prologue layouts. */
- mc->cfi_def_cfa(mc, AA_FP, AA_FRAME_SAVE_SIZE);
- mc->cfi_offset(mc, AA_FP, aa_fp_off_saved_fp() - (i32)AA_FRAME_SAVE_SIZE);
- mc->cfi_offset(mc, AA_LR, aa_fp_off_saved_lr() - (i32)AA_FRAME_SAVE_SIZE);
+ /* CFA = caller's sp, an fp-relative offset that depends on the layout:
+ * fp+16 (top-record) or fp+frame_size (bottom-record). saved fp/lr live at
+ * [fp]/[fp+8] in both, hence at CFA-cfa / CFA-cfa+8. */
+ mc->cfi_def_cfa(mc, AA_FP, cfa);
+ mc->cfi_offset(mc, AA_FP, aa_fp_off_saved_fp() - cfa);
+ mc->cfi_offset(mc, AA_LR, aa_fp_off_saved_lr() - cfa);
}
obj_symbol_define(t->obj, a->func->sym, a->func->text_section_id,
a->func_start, mc->pos(mc) - a->func_start);
@@ -1508,16 +1607,25 @@ static void aa_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd,
a->max_outgoing = frame->max_outgoing;
}
/* Frame is final: slot_bytes (cum_off) and out_stack (max_outgoing) are both
- * known, so the prologue immediates and slim-form choice are settled here. */
+ * known, so the prologue immediates and slim-form choice are settled here.
+ * frame_size_final must be set before aa_build_prologue_words / entry saves,
+ * since the bottom-record offset helpers read it. */
L = aa_build_layout(a->cum_off, a->max_outgoing);
+ a->frame_size_final = L.frame_size;
/* Slim Tier A: no callee-saves, no alloca, no body slots, no outgoing stack
- * args. slim_small_frame: skip the x17/x10 scratch when the saved-pair offset
- * fits stp's signed 7-bit scaled immediate. (See aa_func_end for the
- * single-pass path, which never takes the slim form.) */
+ * args — the whole frame is the 16-byte record. fp_at_bottom: a small frame
+ * with callee-saves/locals and no outgoing stack args; the record moves to
+ * the bottom (fp = sp) so sp adjustment folds into the pre/post-indexed
+ * stp/ldp (frame_size <= 504 keeps the post-index ldp imm in range). Otherwise
+ * slim_small_frame keeps the top-record layout but skips the x17/x10 scratch
+ * (out_stack>0 small frames land here). (See aa_func_end for the single-pass
+ * path, which never takes any slim form.) */
a->slim_prologue = a->ncallee_saves == 0 && !a->has_alloca &&
L.slot_bytes == 0 && L.out_stack == 0;
- a->slim_small_frame = !a->slim_prologue && !a->has_alloca &&
- aa_sp_off_saved_pair(&L) <= 504u;
+ a->fp_at_bottom = !a->slim_prologue && !a->has_alloca && L.out_stack == 0 &&
+ L.frame_size <= 504u;
+ a->slim_small_frame = !a->slim_prologue && !a->fp_at_bottom &&
+ !a->has_alloca && aa_sp_off_saved_pair(&L) <= 504u;
n = aa_build_prologue_words(a, &L, words, AA_PROLOGUE_WORDS);
for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]);
a->minimal_prologue_words = n;
@@ -1637,7 +1745,7 @@ static void aa_load_addr(NativeTarget* t, NativeLoc dst, NativeAddr addr) {
switch ((NativeAddrBaseKind)addr.base_kind) {
case NATIVE_ADDR_BASE_FRAME: {
AANativeSlot* s = aa_slot(a, addr.base.frame);
- aa_emit_add_imm(a, rd, AA_FP, aa_fp_off_slot(s->off) + addr.offset);
+ aa_emit_add_imm(a, rd, AA_FP, aa_fp_off_slot(a, s->off) + addr.offset);
aa_apply_index(a, rd, &addr);
return;
}
@@ -2251,7 +2359,7 @@ static void aa_store_outgoing_part(NativeTarget* t, int tail_call,
* (= [fp + 16 + off], same address the tail-callee will read via
* aa_fp_off_in_arg). Non-tail calls write to the sp-anchored outgoing
* area at the bottom of the caller's frame. */
- addr.offset = tail_call ? aa_fp_off_tail_out_arg(stack_off)
+ addr.offset = tail_call ? aa_fp_off_tail_out_arg(aa_of(t), stack_off)
: aa_sp_off_out_arg(stack_off);
aa_emit_mem(aa_of(t), 0, src, addr, mem);
}
@@ -3505,7 +3613,7 @@ static void aa_bind_native_param(NativeTarget* t, const CGParamDesc* p,
memset(&saddr, 0, sizeof saddr);
saddr.base_kind = NATIVE_ADDR_BASE_REG;
saddr.base.reg = AA_FP;
- saddr.offset = aa_fp_off_in_arg(a->next_param_stack);
+ saddr.offset = aa_fp_off_in_arg(a, a->next_param_stack);
aa_emit_mem(a, 1, src, saddr, aa_mem_for_type(t, p->type, 8));
a->next_param_stack += 8u;
}
@@ -3548,7 +3656,7 @@ static void aa_bind_native_param(NativeTarget* t, const CGParamDesc* p,
saddr.base_kind = NATIVE_ADDR_BASE_REG;
saddr.base.reg = AA_FP;
saddr.base_type = p->type;
- saddr.offset = aa_fp_off_in_arg(a->next_param_stack);
+ saddr.offset = aa_fp_off_in_arg(a, a->next_param_stack);
aa_emit_mem(a, 1, src, saddr, aa_mem_for_type(t, p->type, part->size));
a->next_param_stack += aa_part_stack_size(part);
}
@@ -3712,7 +3820,7 @@ static void aa_va_start_core(AANativeTarget* a, NativeAddr ap) {
/* `va_list = &<first vararg>`. Variadic stack args follow the fixed
* incoming params in the same caller window, so the offset is the
* current next_param_stack cursor. */
- aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a->next_param_stack));
+ aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a, a->next_param_stack));
aa_emit_mem(a, 0, ptr, ap, aa_mem_for_type(t, ptr.type, 8));
return;
}
@@ -3731,7 +3839,7 @@ static void aa_va_start_core(AANativeTarget* a, NativeAddr ap) {
/* __stack points at the incoming stack args, which sit above the saved
* fp/lr pair — the same address bind_param uses (aa_fp_off_in_arg), not the
* raw next_param_stack cursor. */
- aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a->next_param_stack));
+ aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a, a->next_param_stack));
aa_emit_mem(a, 0, ptr, aa_reg_addr(ptr.type, base, (i32)vai.stack_offset),
ptr_mem);
aa_emit_add_imm(a, AA_TMP0, AA_FP,
diff --git a/test/toy/cases/140_fp_callee_save_bottom_frame.expected b/test/toy/cases/140_fp_callee_save_bottom_frame.expected
@@ -0,0 +1 @@
+964
diff --git a/test/toy/cases/140_fp_callee_save_bottom_frame.toy b/test/toy/cases/140_fp_callee_save_bottom_frame.toy
@@ -0,0 +1,38 @@
+// Regression: aarch64 -O1 bottom-record (fp_at_bottom) frame layout must save
+// and restore FP callee-saves (v8-v15) correctly. FP saves are always
+// single-register, so they take the scaled `str d, [fp, #off]` / `ldr` path —
+// the unscaled stur (+-256) the top-record layout uses cannot reach the large
+// positive offsets the bottom-record layout produces.
+//
+// `acc` (an f64) is live across the call to `bump`, forcing an FP callee-save.
+// The 40-element local array bloats the frame past 256 bytes, so the FP
+// callee-save (reserved first => largest offset = frame_size-8) lands well past
+// 255 and exercises the scaled d-register store/load. The body sums everything
+// so nothing is dead-code-eliminated; the result is checked against .expected.
+
+fn bump(x: i64): i64 { return x + 100; }
+
+fn big_fp(seed: i64): i64 {
+ var buf: [40]i64;
+ var i: i64 = 0;
+ while i < 40 {
+ buf[i] = seed + i;
+ i = i + 1;
+ }
+ var acc: f64 = (seed as f64) + 0.5; // f64 live across the call below
+ let y: i64 = bump(seed); // call: acc must survive in d8..d15
+ var s: i64 = 0;
+ i = 0;
+ while i < 40 {
+ s = s + buf[i];
+ i = i + 1;
+ }
+ // seed=2: acc=2.5 -> 2; y=102; sum(2..41)=2*40+780=860; total=2+102+860=964.
+ return (acc as i64) + y + s;
+}
+
+fn __user_main(): i64 {
+ return big_fp(2);
+}
+
+fn main(): i32 { return __user_main() as i32; }