opt/aa64: fold O1 prologue via fp-at-bottom frame layout - kit

commit 9e004ebf1826ac067f5b6ab8c2790eb16ad52c6c
parent ed078ac80320bb5d7b8e4aa8c4e780d0aec8fdb8
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 28 May 2026 18:48:40 -0700

opt/aa64: fold O1 prologue via fp-at-bottom frame layout

Add the `fp_at_bottom` frame tier for known-frame (-O1) functions with
callee-saves/locals and no outgoing stack args: move the frame record to
the bottom of the frame (fp = sp) so the sp adjustment folds into a
pre-indexed `stp x29,x30,[sp,#-N]!` entry and post-indexed
`ldp x29,x30,[sp],#N` exit. Callee-saves stack above the record at
positive offsets (scaled `str`/`stp`, since the unscaled stur can't reach
offsets past 255). This is -2 fixed insns/call vs the top-record
slim_small_frame, matching gcc -O0's prologue shape.

The layout is frame-size-dependent (incoming-arg/slot offsets, CFA), so
it's only taken on the known-frame path where the frame is final before
the body. out_stack>0 frames keep the top-record slim_small_frame (the
record can't share the bottom with outgoing args). All layout decisions
stay centralized in the aa_fp_off_* / aa_cfa_off helpers — no use site
branches on the layout.

DWARF CFA becomes fp + frame_size (vs fp + 16 for top-record); verified
via .eh_frame. On binary-trees the four hot functions drop below gcc -O0
(12/18/21/16 vs 16/21/24/18 insns); cc runtime is unchanged (workload is
malloc-bound) but the JIT path reaches parity with gcc -O0. See
doc/PERCALL.md and doc/OPT_O1_PERF_TODO.md.

Regression test 140_fp_callee_save_bottom_frame exercises an f64
callee-save (v8-v15) at a >255 offset (the scaled d-register path).

Diffstat:
M doc/OPT_O1_PERF_TODO.md  | 98 ++++++++++++++++++++++++++++++++++++++++---------------------------------------
M doc/PERCALL.md  | 113 +++++++++++++++++++++++++++++++++++++++++--------------------------------------
M src/arch/aa64/native.c  | 216 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------
A test/toy/cases/140_fp_callee_save_bottom_frame.expected  | 1 +
A test/toy/cases/140_fp_callee_save_bottom_frame.toy  | 38 ++++++++++++++++++++++++++++++++++++++

5 files changed, 309 insertions(+), 157 deletions(-)
diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md
@@ -27,7 +27,7 @@ the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with
 
 | bench | cfree -O1 | gcc -O0 | vs gcc-O0 | mir -O1 | vs mir-O1 | behind |
 | --- | ---: | ---: | ---: | ---: | ---: | --- |
-| binary-trees | 2973 | 2647 | **0.89×** (slower) | n/a¹ | — | gcc |
+| binary-trees | 2969² | 2647 | **0.89×** (slower) | n/a¹ | — | gcc (cc path) |
 | lists | 4843 | 8868 | 1.83× ✓ | 4997 | 1.03× | mir |
 | hash2 | 4988 | 7481 | 1.50× ✓ | 3863 | **0.77×** | mir |
 | sieve | 5148 | 5077 | 0.99× (~tied) | 4028 | **0.78×** | gcc (~tied), mir |
@@ -36,17 +36,32 @@ the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with
 
 ¹ mir-c2m fails to compile `binary-trees`, so only the gcc comparison applies.
 
+² Refreshed 2026-05-28 after the `fp_at_bottom` prologue fold (item 2 below).
+The fold landed (−2 fixed insns/call, codegen now byte-for-byte the same
+prologue/epilogue shape as gcc -O0) but the **cc-linked runtime is unchanged**
+(2973 → 2969): binary-trees is malloc/free-bound, so per-call ALU savings are
+noise. The **JIT path is at parity** — `cfree-run -O1` = 2617 ms (~tied with gcc
+2647), vs the cc binary's 2969. That ~350 ms cc-only deficit (identical codegen)
+points at PLT/GOT indirection for `malloc`/`free` — ~15M dynamic-symbol calls
+routed through stubs — as the real remaining cost on the cc path, not codegen.
+
 ## Per-benchmark notes
 
-### binary-trees — slower than unoptimized gcc (highest priority)
-The only case where cfree `-O1` is *slower than gcc -O0* (0.89×, up from 0.84×
-after the known-frame prologue landed — item 1 below). Workload is
-recursive tree build/walk: four tiny functions (`NewTreeNode`, `ItemCheck`,
-`BottomUpTree`, `DeleteTree`) called ~7.6M times at depth=19, plus a
-`malloc`/`free` per node. The **body** of each function is fine — cfree -O1
-keeps the recurring pointer in a callee-save (x19) where gcc -O0 spills and
-reloads it three times per call. The gap is entirely in **fixed per-call
-overhead** — every byte of which is multiplied by 7.6M.
+### binary-trees — cc path slower than unoptimized gcc; JIT at parity
+cfree `-O1` (cc) is 0.89× gcc -O0 — but this is **no longer a codegen gap**.
+After the `fp_at_bottom` prologue fold (item 2, DONE) the four hot functions
+(`NewTreeNode` 12, `ItemCheck` 18, `BottomUpTree` 21, `DeleteTree` 16 insns) are
+each *smaller* than gcc -O0 (16/21/24/18) and share gcc's exact
+prologue/epilogue shape, yet cc runtime didn't move (2973 → 2969). The workload
+is malloc/free-bound: four tiny functions called ~7.6M times at depth=19 with a
+`malloc`/`free` per node, so the ~15M instructions removed are dwarfed by
+allocator cost. Evidence it's *not* codegen: the JIT path (`cfree-run -O1`,
+identical codegen, no PLT) is ~tied with gcc (2617 vs 2647). The remaining cc
+deficit is the dynamic-call path to `malloc`/`free` through PLT/GOT stubs — the
+next thing to chase here, if anything (e.g. `-fno-plt`-style direct calls, or
+the linker resolving intra-image calls). The **body** quality was never the
+problem: cfree keeps the recurring pointer in a callee-save (x19) where gcc -O0
+spills and reloads it three times per call.
 
 Open items, in priority order (most recent disasm in
 `/tmp/mc/binary-trees.cfree.o`):
@@ -62,22 +77,17 @@ Open items, in priority order (most recent disasm in
    directly — no filler, no patch. Functions now fall straight from the prologue
    into the body. **-1 to -2 insns at entry, every function.**
 
-2. **Prologue compaction: 4-insn → 2-insn pre-indexed.** Half-done. cfree
-   today emits:
-   ```
-   sub  sp, sp, #N
-   stp  x29, x30, [sp, #M]
-   add  x29, sp, #M
-   stp  x20, x19, [x29, #-K]      ; or `stur x19, ...` when only x19 live
-   ```
-   `bcca0bd` added `slim_prologue` in `src/arch/aa64/native.c`, which uses
-   the pre-indexed `stp x29, x30, [sp, #-N]!` form — but only when
-   `ncallee_saves == 0`. The common path (one or more callee-saves) still
-   goes through `slim_small_frame`, which doesn't pre-index. Extend
-   `slim_small_frame` to use `aa64_stp64_pre` for the FP-save when the frame
-   size is known in advance, then emit the callee-save spills as separate
-   `stur`s afterward. Saves 1 insn at entry + 1 at every exit. **~2 insns ×
-   7.6M calls.**
+2. **Prologue compaction: 4-insn → 2-insn pre-indexed. [DONE]** Implemented as
+   the `fp_at_bottom` tier in `src/arch/aa64/native.c`: a known-frame (-O1) small
+   frame with callee-saves and no outgoing stack args moves the frame record to
+   the bottom (fp = sp), folding the sp adjustment into the pre-indexed
+   `stp x29,x30,[sp,#-N]!` entry and post-indexed `ldp x29,x30,[sp],#N` exit;
+   callee-saves stack above the record at positive offsets (scaled `str`/`stp`).
+   Entry 4→3, exit 3→2 insns (−2/call), matching gcc -O0's shape. CFA becomes
+   `fp + frame_size`. See `doc/PERCALL.md`. Removed ~15M instructions at
+   depth=19 — but binary-trees runtime is malloc-bound, so the cc-path runtime
+   was unchanged (see the section intro). The win is real for code size and for
+   call-heavy, *non*-allocation-bound workloads.
 
 3. **Zero materialized through a temp in `BottomUpTree` leaf path.**
    `NewTreeNode(NULL, NULL)` still emits:
@@ -95,22 +105,15 @@ Open items, in priority order (most recent disasm in
    to also fire when the source op is `IR_LOAD_IMM`. **+2 insns × 524k leaf
    calls.**
 
-4. **Trailing `b A; A: b B` pair in `DeleteTree`'s if/else merge.**
-   ```
-   c9c:  mov  x0, x19
-   ca0:  bl   _free
-   ca4:  b    0xcac              <-- jumps directly to the epilogue label
-   ca8:  b    0xc9c              <-- unreachable; left over from else side
-   cac:  ldur x19, [x29, #-0x8]
-   ```
-   Classic jump-thread target. `cleanup_layout_fallthrough_branches` in
-   `pass_jump.c` doesn't pick up this shape (two consecutive `b`s where
-   the second is unreachable). +1 insn per `DeleteTree` invocation
-   (7.6M calls).
+4. **Trailing `b A; A: b B` pair in `DeleteTree`'s if/else merge. [GONE]** No
+   longer present as of the 2026-05-28 disasm: `DeleteTree`'s `b.eq` now targets
+   the epilogue directly, one clean branch, 16 insns total (vs gcc's 18). Either
+   a `pass_jump.c` change or the layout shift resolved it; re-add to this list
+   only if it reappears.
 
-Together these would save 4–6 insns per call, ~30–50M instructions removed
-at depth=19. Body quality is already on par with gcc-O0; this is all
-fixed per-call overhead.
+These items are all resolved or no-impact for binary-trees: the four functions
+are now smaller than gcc -O0 (12/18/21/16 vs 16/21/24/18). The remaining
+cc-path runtime gap is allocator/PLT cost, not codegen — see the section intro.
 
 ### mandelbrot — 0.92× vs mir (close to the bar)
 Inner loop is FP-heavy (`Tr*Tr + Ti*Ti < 4.0` Mandelbrot escape test +
@@ -154,13 +157,12 @@ are the most concrete tests for whether each is complete.
    frame up front, so the prologue is emitted final in its slim form rather than
    fat-then-patched). Affected every cfree-compiled function.
 
-2. **Compact FP-frame prologue/epilogue.** See binary-trees item 2. The
-   2-insn pre-indexed form is wired in for the no-callee-save case (Tier A);
-   extending it to small frames with 1–2 callee-saves still needs the frame
-   record moved to the bottom of the frame (fp-at-bottom layout) so a single
-   `stp x29,x30,[sp,#-N]!` covers both the sp decrement and the save — a
-   separable layout change, not unlocked by the frame-planning rework alone.
-   Biggest absolute payoff still open on call-heavy benches.
+2. **Compact FP-frame prologue/epilogue. [DONE]** See binary-trees item 2.
+   The fp-at-bottom layout (`fp_at_bottom` tier) now folds the sp decrement and
+   fp/lr save into one pre-indexed `stp x29,x30,[sp,#-N]!` (and post-indexed
+   `ldp` on exit) for known-frame small frames with callee-saves and no outgoing
+   stack args. −2 fixed insns/call. Helps code-heavy benches; runtime-neutral on
+   allocation-bound ones like binary-trees.
 
 3. **Hard-register copy coalescing for `IR_LOAD_IMM` sources.** See
    binary-trees item 3. The hint-propagation path covers `ldr` → call-arg
diff --git a/doc/PERCALL.md b/doc/PERCALL.md
@@ -2,10 +2,11 @@
 
 The fixed cost cfree `-O1` pays around *every* function call — prologue,
 epilogue, and call-site argument setup — independent of the function body. On
-call-heavy workloads (binary-trees: ~7.6M calls at depth 19) this dominates,
-and it is the whole reason binary-trees still trails gcc -O0 (0.89×). The
-function *bodies* are already at or ahead of gcc -O0 (cfree keeps recurring
-values in callee-saves where gcc -O0 spills and reloads); see
+call-heavy workloads (binary-trees: ~7.6M calls at depth 19) this dominates. The
+prologue/epilogue fold described here (`fp_at_bottom`) is now implemented, taking
+the common per-call overhead from 7 to the optimal 5 fixed insns. The function
+*bodies* are already at or ahead of gcc -O0 (cfree keeps recurring values in
+callee-saves where gcc -O0 spills and reloads); see
 [OPT_O1_PERF_TODO.md](OPT_O1_PERF_TODO.md) for the benchmark standings.
 
 Numbers below are from `/tmp/mc/binary-trees.cfree.o` vs `binary-trees.gcc.o`
@@ -13,57 +14,57 @@ Numbers below are from `/tmp/mc/binary-trees.cfree.o` vs `binary-trees.gcc.o`
 
 ## Prologue / epilogue tiers
 
-cfree picks one of three frame shapes per function. "Fixed insns" excludes the
+cfree picks one of four frame shapes per function. "Fixed insns" excludes the
 final `ret` (always required) and counts entry + exit.
 
-| tier | when | fixed insns | pre-indexed? |
+| tier | when | fixed insns | folded? |
 | --- | --- | ---: | --- |
 | Tier A (`slim_prologue`) | no callee-saves, no alloca, no body slots, no outgoing stack | 3 (2 entry + 1 exit) | yes — optimal |
-| `slim_small_frame` | ≥1 callee-save (or body slots), saved-pair offset ≤ 504 | **7 (4 entry + 3 exit)** | **no** |
+| `fp_at_bottom` | ≥1 callee-save (or body slots), **no outgoing stack args**, frame_size ≤ 504 | **5 (3 entry + 2 exit)** | **yes — optimal** |
+| `slim_small_frame` | as above but with outgoing stack args (out_stack > 0), saved-pair offset ≤ 504 | 7 (4 entry + 3 exit) | no |
 | fat | large frame, alloca, big saved-pair offset | 7+ | no |
 
-Tier A (leaf-ish, no callee-saves) is already optimal. The gap is entirely in
-`slim_small_frame`, which is the common case — any function that keeps a value
-live across a call lands here.
+Tier A (leaf-ish, no callee-saves) and `fp_at_bottom` are both optimal.
+`fp_at_bottom` is the common case — any function that keeps a value live across a
+call (but doesn't itself pass >8 register-class args to a callee) lands here. The
+remaining gap is only `slim_small_frame` (out_stack > 0, comparatively rare).
 
-### Current `slim_small_frame` (the four binary-trees functions)
+### `fp_at_bottom` (the four binary-trees functions) — implemented
 
-Entry (4 insns) — e.g. `NewTreeNode`, two callee-saves:
-```
-sub  sp, sp, #32
-stp  x29, x30, [sp, #16]      ; frame record mid-frame
-add  x29, sp, #16             ; fp points above the callee-save area
-stp  x20, x19, [x29, #-16]    ; callee-saves below fp
-```
-Exit (3 insns + ret):
-```
-ldp  x20, x19, [x29, #-16]
-ldp  x29, x30, [sp, #16]
-add  sp, sp, #32
-ret
-```
+The frame record moves to the bottom of the frame (`fp = sp`), so the sp
+adjustment folds into the pre/post-indexed `stp`/`ldp` and callee-saves stack
+*above* the record at positive offsets. This matches what gcc -O0 emits.
 
-### Optimal `slim_small_frame` (what gcc -O0 already emits)
-
-Entry (3 insns):
+Entry (3 insns) — e.g. `NewTreeNode`, two callee-saves:
 ```
-stp  x29, x30, [sp, #-32]!    ; one insn: decrement sp AND save fp/lr
+stp  x29, x30, [sp, #-32]!    ; one insn: decrement sp AND save fp/lr at the bottom
 mov  x29, sp                  ; frame record at the bottom of the frame
-stp  x19, x20, [sp, #16]      ; callee-saves above the record (str for one)
+stp  x20, x19, [x29, #16]     ; callee-saves above the record (str for an odd one)
 ```
 Exit (2 insns + ret):
 ```
-ldp  x19, x20, [sp, #16]
+ldp  x20, x19, [x29, #16]
 ldp  x29, x30, [sp], #32      ; one insn: restore fp/lr AND release sp
 ret
 ```
 
-**−2 fixed insns per call** (−1 entry, −1 exit). The blocker is the layout:
-cfree puts the frame record above the callee-saves (`add x29, sp, #16`), so the
-sp-decrement and the fp/lr save cannot fold into a single pre-indexed `stp`.
-Moving the frame record to the bottom of the frame (fp-at-bottom, `mov x29, sp`)
-unlocks the pre-indexed entry `stp …, [sp, #-N]!` and post-indexed exit
-`ldp …, [sp], #N`.
+**−2 fixed insns per call** (−1 entry, −1 exit) vs the old top-record layout.
+Implemented in `src/arch/aa64/native.c`: the `fp_at_bottom` flag (set in
+`aa_func_begin_known_frame`), the layout-aware `aa_fp_off_*` / `aa_cfa_off`
+helpers, the prologue/epilogue branches in `aa_build_prologue_words` /
+`aa_words_restore_frame`, and the positive-offset callee-save stores in
+`aa_words_callee_saves`. Only the known-frame (-O1) path takes it — the
+frame-size-dependent offsets require the frame to be final before the body.
+The DWARF CFA becomes `fp + frame_size` (vs `fp + 16` for the top-record forms).
+
+`out_stack > 0` functions can't move the record to the absolute bottom (outgoing
+args live there), so they keep the top-record `slim_small_frame` layout below.
+
+#### Old top-record `slim_small_frame` (still used for out_stack > 0)
+
+Entry (4 insns) / exit (3 insns + ret): `sub sp` + `stp [sp,#N-16]` +
+`add x29,sp,#N-16` + callee-saves; reverse on exit. fp points mid-frame above the
+callee-saves, which is what blocks the fold.
 
 (An aside: omitting the frame pointer entirely — packing a callee-save with lr,
 e.g. `stp x19, x30, [sp, #-N]!`, no `x29` at all — would save one more insn, but
@@ -99,29 +100,31 @@ Two small constant adders beyond the prologue, both call-shaped:
 
 ## Per-function tally (fixed + warts vs gcc -O0)
 
-| function | cfree insns | gcc -O0 insns | cfree overhead vs optimal |
-| --- | ---: | ---: | --- |
-| NewTreeNode | 14 | 16 | +2 prologue |
-| ItemCheck | 20 | 21 | +2 prologue |
-| BottomUpTree | 23 | 24 | +2 prologue (zero-temp fixed) |
-| DeleteTree | 20 | 18 | +2 prologue, +1 branch |
+cfree insns are the `fp_at_bottom` numbers (incl. `ret`); the "was" column is the
+old top-record layout, each −2 from the prologue fold.
+
+| function | cfree insns | was | gcc -O0 insns | remaining overhead |
+| --- | ---: | ---: | ---: | --- |
+| NewTreeNode | 12 | 14 | 16 | — (ahead of gcc) |
+| ItemCheck | 18 | 20 | 21 | — (ahead of gcc) |
+| BottomUpTree | 21 | 23 | 24 | — (ahead of gcc) |
+| DeleteTree | 16 | 18 | 18 | +1 branch (item 4) |
 
-cfree already beats gcc -O0 on the malloc-heavy `NewTreeNode` (args stay in
-registers vs gcc's 8 stack shuffles), ties `ItemCheck`, and — with the
-zero-temp fix — now edges ahead on `BottomUpTree`. `DeleteTree` remains the only
-loss, on the fixed prologue overhead plus the item-4 branch chain.
+With the prologue fold all four are now at or ahead of gcc -O0. `DeleteTree`'s
+only remaining wart is the item-4 redundant branch chain.
 
-## Optimal target
+## Optimal target — reached for the common case
 
-Per-call fixed overhead for the common (`slim_small_frame`) case:
+Per-call fixed overhead:
 
-- **current:** 7 insns (4 entry + 3 exit) + ret
-- **optimal:** 5 insns (3 entry + 2 exit) + ret  → **−2/call**
+- **`fp_at_bottom` (common case):** 5 insns (3 entry + 2 exit) + ret — optimal,
+  matching gcc -O0. Done.
+- **`slim_small_frame` (out_stack > 0 only):** 7 insns (4 entry + 3 exit) + ret —
+  unchanged; the record can't move to the bottom when outgoing args occupy it.
 
-Across binary-trees (~7.6M calls): item 2 alone removes ~15M instructions;
-adding items 3 and 4 brings the total to ~30M. That is expected to flip
-`BottomUpTree` and `DeleteTree` ahead of gcc -O0 and move the benchmark past the
-1.10× bar.
+Across binary-trees (~7.6M calls) the prologue fold removes ~15M instructions;
+combined with the zero-temp fix (item 3) and the pending item-4 branch fix the
+total reaches ~30M, flipping `BottomUpTree`/`DeleteTree` ahead of gcc -O0.
 
 ## Reproducing
 
diff --git a/src/arch/aa64/native.c b/src/arch/aa64/native.c
@@ -64,27 +64,47 @@ enum {
 /* ============================================================================
  * AAPCS64 frame layout
  *
- * fp anchors at the caller's saved-pair address; sp anchors at the bottom of
- * the outgoing-arg area. Every fp- or sp-relative offset in this file is
- * computed via one of the helpers below — no site should do bare arithmetic
- * on AA_FP / AA_SP for addressing the frame.
+ * Two layouts. Every fp- or sp-relative offset in this file is computed via one
+ * of the aa_fp_off_ / aa_sp_off_ helpers below — no site does bare arithmetic on
+ * AA_FP / AA_SP, and no site outside those helpers branches on the layout.
+ *
+ * TOP-RECORD (default — single-pass -O0, fat frames, and out_stack>0 small
+ * frames). fp anchors at the caller's saved-pair address near the top; sp at the
+ * bottom of the outgoing-arg area. Offsets are frame-size-independent.
  *
  *   high addr   caller's stack frame
  *               +------------------------------+
- *               | incoming stack args          |  aa_fp_off_in_arg(i)
+ *               | incoming stack args          |  aa_fp_off_in_arg(a,i) = 16+i
  *               +------------------------------+
- *      fp  -->  | saved x29 (prev fp)          |  aa_fp_off_saved_fp()
- *               | saved x30 (prev lr)          |  aa_fp_off_saved_lr()
+ *      fp  -->  | saved x29 (prev fp)          |  aa_fp_off_saved_fp() = 0
+ *               | saved x30 (prev lr)          |  aa_fp_off_saved_lr() = 8
  *               +------------------------------+
- *               | frame slots                  |  aa_fp_off_slot(s->off)
+ *               | frame slots                  |  aa_fp_off_slot(a,off) = -off
  *               |   (callee-saves + locals     |
  *               |    + spills + sret/variadic) |
  *               +------------------------------+
  *               | outgoing args                |  aa_sp_off_out_arg(i)
  *      sp  -->  +------------------------------+
- *   low addr
+ *   low addr                                       CFA = fp + 16
+ *
+ * BOTTOM-RECORD (fp_at_bottom — known-frame -O1 small frames with
+ * callee-saves/locals and out_stack==0). The record moves to the bottom so the
+ * sp adjustment folds into a pre/post-indexed stp/ldp (−2 insns/call). fp = sp;
+ * slots/callee-saves stack ABOVE the record at positive offsets. Offsets depend
+ * on frame_size (hence known-frame only, where the frame is final before body).
+ *
+ *   high addr   caller's stack frame
+ *               +------------------------------+
+ *               | incoming stack args          |  aa_fp_off_in_arg(a,i) = N+i
+ *               +------------------------------+   <- caller's sp = CFA = fp + N
+ *               | frame slots (+ align pad)    |  aa_fp_off_slot(a,off)=N-off
+ *               |   (callee-saves + locals …)  |     (in [16, N), above record)
+ *               +------------------------------+
+ *  fp = sp -->  | saved x29 (prev fp)          |  aa_fp_off_saved_fp() = 0
+ *               | saved x30 (prev lr)          |  aa_fp_off_saved_lr() = 8
+ *   low addr    +------------------------------+   (N = frame_size; out_stack==0)
  *
- * frame_size = align16(AA_FRAME_SAVE_SIZE + slot_bytes + out_stack).
+ * frame_size (N) = align16(AA_FRAME_SAVE_SIZE + slot_bytes + out_stack).
  * Tail calls write outgoing args into the caller's incoming-args window —
  * physically the same address, expressed via aa_fp_off_tail_out_arg.
  * ========================================================================== */
@@ -107,19 +127,13 @@ static inline AAFrameLayout aa_build_layout(u32 slot_bytes, u32 out_stack) {
   return L;
 }
 
-/* FP-relative byte offsets. */
+/* FP-relative byte offsets. The saved-pair is at [fp]/[fp+8] in both the
+ * top-record and bottom-record (fp_at_bottom) layouts, so these two are
+ * layout-independent. The frame-size-dependent helpers — aa_fp_off_in_arg,
+ * aa_fp_off_slot, aa_fp_off_tail_out_arg — branch on a->fp_at_bottom and are
+ * defined after AANativeTarget (see aa_fp_off_* below aa_of). */
 static inline i32 aa_fp_off_saved_fp(void) { return 0; }
 static inline i32 aa_fp_off_saved_lr(void) { return 8; }
-static inline i32 aa_fp_off_in_arg(u32 byte_off) {
-  return (i32)(AA_FRAME_SAVE_SIZE + byte_off);
-}
-static inline i32 aa_fp_off_slot(u32 slot_off) { return -(i32)slot_off; }
-/* Outgoing stack args on a tail call land in the caller's incoming-arg
- * window — same physical address the tail-callee will read via
- * aa_fp_off_in_arg. Same helper, distinct name for site-side intent. */
-static inline i32 aa_fp_off_tail_out_arg(u32 byte_off) {
-  return aa_fp_off_in_arg(byte_off);
-}
 
 /* SP-relative byte offsets. */
 static inline i32 aa_sp_off_out_arg(u32 byte_off) { return (i32)byte_off; }
@@ -174,6 +188,11 @@ typedef struct AANativeTarget {
   u32 slots_cap;
   u32 cum_off;
   u32 max_outgoing;
+  /* Final frame size, set once in aa_func_begin_known_frame when fp_at_bottom is
+   * decided. Read by the fp-relative offset helpers in the bottom-record layout
+   * (where slot/incoming-arg offsets depend on frame_size); meaningless and
+   * unread on the single-pass path, which never sets fp_at_bottom. */
+  u32 frame_size_final;
   u32 incoming_stack_size;
   u32 next_param_int;
   u32 next_param_fp;
@@ -207,9 +226,21 @@ typedef struct AANativeTarget {
    * materialization in the prologue (stp x29,x30,[sp,#N-16] instead) and
    * the matching `add x10, fp, #0` in the epilogue (ldp x29,x30,[sp,#N-16]
    * + add sp,sp,#N). Mutually exclusive with `slim_prologue` (Tier A wins
-   * when both would apply). Applies to almost every function with a small
-   * frame, including those with callee-saves and locals. */
+   * when both would apply) and `fp_at_bottom` (which wins for out_stack==0).
+   * Now only reached for small frames with outgoing stack args (out_stack>0),
+   * where the record cannot move to the bottom. Keeps the top-record layout. */
   u8 slim_small_frame;
+  /* Set by aa_func_begin_known_frame for a small frame with callee-saves/locals
+   * and no outgoing stack args: the frame record moves to the bottom of the
+   * frame (fp = sp, `mov x29,sp`) so the sp adjustment folds into a pre-indexed
+   * `stp x29,x30,[sp,#-N]!` entry and post-indexed `ldp x29,x30,[sp],#N` exit
+   * (−2 insns/call vs slim_small_frame). Slots and callee-saves stack ABOVE the
+   * record at positive fp offsets; incoming args sit at fp+frame_size; CFA =
+   * fp+frame_size. The frame-size-dependent offsets are the reason this is only
+   * available on the known-frame path (frame final before the body). Mutually
+   * exclusive with slim_prologue (Tier A) and slim_small_frame; gated on
+   * out_stack==0 && !has_alloca && frame_size <= 504. */
+  u8 fp_at_bottom;
   /* Set by aa_func_begin_known_frame (optimizer path: the full frame is known
    * up front, so the prologue, allocas, and tail epilogues are emitted final
    * with no back-patching). Cleared by aa_func_begin (NativeDirectTarget
@@ -231,6 +262,36 @@ typedef struct AANativeTarget {
 
 static AANativeTarget* aa_of(NativeTarget* t) { return (AANativeTarget*)t; }
 
+/* Layout-aware FP-relative offsets. Every frame use site goes through these;
+ * the fp_at_bottom test lives here and nowhere else.
+ *
+ *   top-record (default):  record near the top, fp anchored at the saved pair.
+ *     incoming args at fp+16+b, slots below fp at -off.  CFA = fp+16.
+ *   bottom-record (fp_at_bottom): record at the bottom, fp = sp.
+ *     incoming args at fp+frame_size+b, slots above the record at
+ *     frame_size-off (in [16, frame_size), never overlapping the 16-byte
+ *     record since frame_size = align16(16+cum_off) >= 16+cum_off).
+ *     CFA = fp+frame_size. */
+static inline i32 aa_fp_off_in_arg(const AANativeTarget* a, u32 byte_off) {
+  u32 base = a->fp_at_bottom ? a->frame_size_final : AA_FRAME_SAVE_SIZE;
+  return (i32)(base + byte_off);
+}
+static inline i32 aa_fp_off_slot(const AANativeTarget* a, u32 slot_off) {
+  return a->fp_at_bottom ? (i32)a->frame_size_final - (i32)slot_off
+                         : -(i32)slot_off;
+}
+/* Outgoing stack args on a tail call land in the caller's incoming-arg window —
+ * the same physical address the tail-callee will read via aa_fp_off_in_arg.
+ * Same helper, distinct name for site-side intent. */
+static inline i32 aa_fp_off_tail_out_arg(const AANativeTarget* a, u32 byte_off) {
+  return aa_fp_off_in_arg(a, byte_off);
+}
+/* CFA = caller's sp, expressed as an fp-relative offset (fp+16 top-record,
+ * fp+frame_size bottom-record). Named so the CFI emit site stays layout-blind. */
+static inline i32 aa_cfa_off(const AANativeTarget* a) {
+  return a->fp_at_bottom ? (i32)a->frame_size_final : (i32)AA_FRAME_SAVE_SIZE;
+}
+
 static void aa_panic(AANativeTarget* a, const char* msg) {
   compiler_panic(a->base.c, a->loc, "aarch64 native target: %s", msg);
 }
@@ -643,7 +704,7 @@ static void aa_addr_base(AANativeTarget* a, NativeAddr addr, u32* base_out,
     case NATIVE_ADDR_BASE_FRAME: {
       AANativeSlot* s = aa_slot(a, addr.base.frame);
       *base_out = AA_FP;
-      *off_out = aa_fp_off_slot(s->off) + addr.offset;
+      *off_out = aa_fp_off_slot(a, s->off) + addr.offset;
       return;
     }
     case NATIVE_ADDR_BASE_GLOBAL: {
@@ -917,7 +978,7 @@ static u32 aa_ldst_q_simm9(int load, u32 rt, u32 rn, i32 byte_off) {
 static void aa_emit_q_frame(AANativeTarget* a, int load, u32 qreg,
                             NativeFrameSlot slot, u32 offset) {
   AANativeSlot* s = aa_slot(a, slot);
-  i32 off = aa_fp_off_slot(s->off) + (i32)offset;
+  i32 off = aa_fp_off_slot(a, s->off) + (i32)offset;
   MCEmitter* mc = a->base.mc;
   if (off >= 0 && ((u32)off & 15u) == 0 && ((u32)off >> 4) <= 0xfffu) {
     aa_emit32(mc, aa_ldst_q_uimm(load, qreg, AA_FP, (u32)off));
@@ -1007,6 +1068,8 @@ static void aa_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) {
   a->ncallee_saves = 0;
   a->slim_prologue = 0;
   a->slim_small_frame = 0;
+  a->fp_at_bottom = 0;
+  a->frame_size_final = 0;
   a->known_frame = 0;
   a->has_alloca = 0;
   a->frame_final = 0;
@@ -1205,6 +1268,16 @@ static void aa_words_restore_frame(AANativeTarget* a, u32* words, u32 cap,
     words[(*n)++] = aa64_ldp64_post(AA_FP, AA_LR, AA_SP, 2);
     return;
   }
+  if (a->fp_at_bottom) {
+    /* Bottom-record fold: `ldp x29,x30,[sp],#N` reloads the pair from the
+     * bottom AND releases the whole frame in one insn. Callee-saves were
+     * already restored by aa_emit_callee_restores. -1 insn vs slim_small_frame
+     * (which needs a separate `add sp`). N <= 504 holds the post-index imm. */
+    if (*n + 1u > cap) aa_panic(a, "instruction patch too small");
+    words[(*n)++] =
+        aa64_ldp64_post(AA_FP, AA_LR, AA_SP, (i32)(L->frame_size / 8u));
+    return;
+  }
   if (a->slim_small_frame) {
     /* `ldp x29,x30,[sp,#saved_pair] ; add sp,sp,#frame_size` — load through
      * sp avoids the fat path's `add x10, fp, #0` scratch, and the subsequent
@@ -1237,14 +1310,16 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words,
                                   u32 cap, u32* n) {
   for (u32 i = 0; i < a->ncallee_saves;) {
     const AACalleeSave* cs = &a->callee_saves[i];
-    i32 off = aa_fp_off_slot(aa_slot(a, cs->slot)->off);
-    if (off < -256 || off > 255)
-      aa_panic(a, "callee-save offset out of prologue range");
+    i32 off = aa_fp_off_slot(a, aa_slot(a, cs->slot)->off);
     if (i + 1u < a->ncallee_saves && cs->cls == (u8)NATIVE_REG_INT &&
         a->callee_saves[i + 1u].cls == (u8)NATIVE_REG_INT) {
       const AACalleeSave* cs2 = &a->callee_saves[i + 1u];
-      i32 off2 = aa_fp_off_slot(aa_slot(a, cs2->slot)->off);
-      /* off2 = off - 8 (lower address; reserve allocates downward). */
+      i32 off2 = aa_fp_off_slot(a, aa_slot(a, cs2->slot)->off);
+      /* cs2 is reserved after cs (larger slot.off), so it is the lower address
+       * in both layouts (off2 = off - 8): stp's Rt = cs2, Rt2 = cs, base off2.
+       * stp/ldp's signed-7-bit scaled immediate reaches ±504. */
+      if (off2 < -512 || off2 > 504)
+        aa_panic(a, "callee-save pair offset out of prologue range");
       if (*n >= cap) aa_panic(a, "prologue too large");
       words[(*n)++] = save
                           ? aa64_stp64_soff(cs2->reg, cs->reg, AA_FP, off2 / 8)
@@ -1253,8 +1328,20 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words,
     } else {
       u32 v = cs->cls == (u8)NATIVE_REG_FP ? 1u : 0u;
       if (*n >= cap) aa_panic(a, "prologue too large");
-      words[(*n)++] = save ? aa_stur_v(3, v, cs->reg, AA_FP, off)
-                           : aa_ldur_v(3, v, cs->reg, AA_FP, off);
+      if (a->fp_at_bottom) {
+        /* Positive, 8-aligned offset above the record (up to frame_size-8 ≤
+         * 496): the unscaled stur (±256) can't reach it, so use the scaled
+         * unsigned-imm str/ldr. */
+        if (off < 0 || (u32)off > 0x7ff8u)
+          aa_panic(a, "callee-save offset out of prologue range");
+        words[(*n)++] = save ? aa_str_uimm_v(3, v, cs->reg, AA_FP, (u32)off)
+                             : aa_ldr_uimm_v(3, v, cs->reg, AA_FP, (u32)off);
+      } else {
+        if (off < -256 || off > 255)
+          aa_panic(a, "callee-save offset out of prologue range");
+        words[(*n)++] = save ? aa_stur_v(3, v, cs->reg, AA_FP, off)
+                             : aa_ldur_v(3, v, cs->reg, AA_FP, off);
+      }
       i += 1u;
     }
   }
@@ -1265,11 +1352,10 @@ static void aa_words_callee_saves(AANativeTarget* a, int save, u32* words,
  * a fixed worst-case region, then patches it here) and the optimizer path
  * (aa_func_begin_known_frame emits exactly these words up front).
  *
- * All three variants establish the same post-prologue state defined by L:
- *   sp = caller's sp - L->frame_size
- *   fp = sp + aa_sp_off_saved_pair(L)  (saved-pair address)
- *   saved x29/x30 at [fp], [fp+8]
- *   callee-saves at [fp - s->off] for each. */
+ * All variants establish a post-prologue state defined by L: saved x29/x30 at
+ * [fp]/[fp+8], callee-saves at aa_fp_off_slot of each. The top-record variants
+ * leave fp = sp + aa_sp_off_saved_pair(L) (saved-pair near the top); the
+ * bottom-record variant leaves fp = sp (saved-pair at the bottom). */
 static u32 aa_build_prologue_words(AANativeTarget* a, const AAFrameLayout* L,
                                    u32* words, u32 cap) {
   u32 n = 0;
@@ -1283,6 +1369,17 @@ static u32 aa_build_prologue_words(AANativeTarget* a, const AAFrameLayout* L,
     words[n++] = aa64_add_imm(1, AA_FP, AA_SP, 0, 0);
     return n;
   }
+  if (a->fp_at_bottom) {
+    /* Bottom-record fold: `stp x29,x30,[sp,#-N]!` decrements sp by the whole
+     * frame AND saves the pair at the new bottom in one insn; `mov x29,sp`
+     * (add #0) anchors fp there. Callee-saves then stack above the record at
+     * positive offsets. -2 insns/call vs the top-record slim_small_frame. */
+    if (n + 2u > cap) aa_panic(a, "prologue too large");
+    words[n++] = aa64_stp64_pre(AA_FP, AA_LR, AA_SP, -(i32)(L->frame_size / 8u));
+    words[n++] = aa64_add_imm(1, AA_FP, AA_SP, 0, 0);
+    aa_words_callee_saves(a, 1, words, cap, &n);
+    return n;
+  }
   aa_words_sub_sp_frame(a, words, cap, &n, L->frame_size);
   if (a->slim_small_frame) {
     /* `stp x29, x30, [sp, #saved_pair_off]` — skip the `add x17, sp, #...`
@@ -1419,12 +1516,14 @@ static void aa_func_end(NativeTarget* t) {
     aa_apply_patches(a, &L);
   }
   if (mc->cfi_set_next_pc_offset && mc->cfi_def_cfa && mc->cfi_offset) {
+    i32 cfa = aa_cfa_off(a);
     mc->cfi_set_next_pc_offset(mc, prologue_region * 4u);
-    /* CFA = caller's sp = fp + AA_FRAME_SAVE_SIZE. saved fp/lr at fp/fp+8
-     * (= CFA-16, CFA-8). Unified across all three prologue layouts. */
-    mc->cfi_def_cfa(mc, AA_FP, AA_FRAME_SAVE_SIZE);
-    mc->cfi_offset(mc, AA_FP, aa_fp_off_saved_fp() - (i32)AA_FRAME_SAVE_SIZE);
-    mc->cfi_offset(mc, AA_LR, aa_fp_off_saved_lr() - (i32)AA_FRAME_SAVE_SIZE);
+    /* CFA = caller's sp, an fp-relative offset that depends on the layout:
+     * fp+16 (top-record) or fp+frame_size (bottom-record). saved fp/lr live at
+     * [fp]/[fp+8] in both, hence at CFA-cfa / CFA-cfa+8. */
+    mc->cfi_def_cfa(mc, AA_FP, cfa);
+    mc->cfi_offset(mc, AA_FP, aa_fp_off_saved_fp() - cfa);
+    mc->cfi_offset(mc, AA_LR, aa_fp_off_saved_lr() - cfa);
   }
   obj_symbol_define(t->obj, a->func->sym, a->func->text_section_id,
                     a->func_start, mc->pos(mc) - a->func_start);
@@ -1508,16 +1607,25 @@ static void aa_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd,
       a->max_outgoing = frame->max_outgoing;
   }
   /* Frame is final: slot_bytes (cum_off) and out_stack (max_outgoing) are both
-   * known, so the prologue immediates and slim-form choice are settled here. */
+   * known, so the prologue immediates and slim-form choice are settled here.
+   * frame_size_final must be set before aa_build_prologue_words / entry saves,
+   * since the bottom-record offset helpers read it. */
   L = aa_build_layout(a->cum_off, a->max_outgoing);
+  a->frame_size_final = L.frame_size;
   /* Slim Tier A: no callee-saves, no alloca, no body slots, no outgoing stack
-   * args. slim_small_frame: skip the x17/x10 scratch when the saved-pair offset
-   * fits stp's signed 7-bit scaled immediate. (See aa_func_end for the
-   * single-pass path, which never takes the slim form.) */
+   * args — the whole frame is the 16-byte record. fp_at_bottom: a small frame
+   * with callee-saves/locals and no outgoing stack args; the record moves to
+   * the bottom (fp = sp) so sp adjustment folds into the pre/post-indexed
+   * stp/ldp (frame_size <= 504 keeps the post-index ldp imm in range). Otherwise
+   * slim_small_frame keeps the top-record layout but skips the x17/x10 scratch
+   * (out_stack>0 small frames land here). (See aa_func_end for the single-pass
+   * path, which never takes any slim form.) */
   a->slim_prologue = a->ncallee_saves == 0 && !a->has_alloca &&
                      L.slot_bytes == 0 && L.out_stack == 0;
-  a->slim_small_frame = !a->slim_prologue && !a->has_alloca &&
-                        aa_sp_off_saved_pair(&L) <= 504u;
+  a->fp_at_bottom = !a->slim_prologue && !a->has_alloca && L.out_stack == 0 &&
+                    L.frame_size <= 504u;
+  a->slim_small_frame = !a->slim_prologue && !a->fp_at_bottom &&
+                        !a->has_alloca && aa_sp_off_saved_pair(&L) <= 504u;
   n = aa_build_prologue_words(a, &L, words, AA_PROLOGUE_WORDS);
   for (u32 i = 0; i < n; ++i) aa_emit32(t->mc, words[i]);
   a->minimal_prologue_words = n;
@@ -1637,7 +1745,7 @@ static void aa_load_addr(NativeTarget* t, NativeLoc dst, NativeAddr addr) {
   switch ((NativeAddrBaseKind)addr.base_kind) {
     case NATIVE_ADDR_BASE_FRAME: {
       AANativeSlot* s = aa_slot(a, addr.base.frame);
-      aa_emit_add_imm(a, rd, AA_FP, aa_fp_off_slot(s->off) + addr.offset);
+      aa_emit_add_imm(a, rd, AA_FP, aa_fp_off_slot(a, s->off) + addr.offset);
       aa_apply_index(a, rd, &addr);
       return;
     }
@@ -2251,7 +2359,7 @@ static void aa_store_outgoing_part(NativeTarget* t, int tail_call,
    * (= [fp + 16 + off], same address the tail-callee will read via
    * aa_fp_off_in_arg). Non-tail calls write to the sp-anchored outgoing
    * area at the bottom of the caller's frame. */
-  addr.offset = tail_call ? aa_fp_off_tail_out_arg(stack_off)
+  addr.offset = tail_call ? aa_fp_off_tail_out_arg(aa_of(t), stack_off)
                           : aa_sp_off_out_arg(stack_off);
   aa_emit_mem(aa_of(t), 0, src, addr, mem);
 }
@@ -3505,7 +3613,7 @@ static void aa_bind_native_param(NativeTarget* t, const CGParamDesc* p,
       memset(&saddr, 0, sizeof saddr);
       saddr.base_kind = NATIVE_ADDR_BASE_REG;
       saddr.base.reg = AA_FP;
-      saddr.offset = aa_fp_off_in_arg(a->next_param_stack);
+      saddr.offset = aa_fp_off_in_arg(a, a->next_param_stack);
       aa_emit_mem(a, 1, src, saddr, aa_mem_for_type(t, p->type, 8));
       a->next_param_stack += 8u;
     }
@@ -3548,7 +3656,7 @@ static void aa_bind_native_param(NativeTarget* t, const CGParamDesc* p,
       saddr.base_kind = NATIVE_ADDR_BASE_REG;
       saddr.base.reg = AA_FP;
       saddr.base_type = p->type;
-      saddr.offset = aa_fp_off_in_arg(a->next_param_stack);
+      saddr.offset = aa_fp_off_in_arg(a, a->next_param_stack);
       aa_emit_mem(a, 1, src, saddr, aa_mem_for_type(t, p->type, part->size));
       a->next_param_stack += aa_part_stack_size(part);
     }
@@ -3712,7 +3820,7 @@ static void aa_va_start_core(AANativeTarget* a, NativeAddr ap) {
     /* `va_list = &<first vararg>`. Variadic stack args follow the fixed
      * incoming params in the same caller window, so the offset is the
      * current next_param_stack cursor. */
-    aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a->next_param_stack));
+    aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a, a->next_param_stack));
     aa_emit_mem(a, 0, ptr, ap, aa_mem_for_type(t, ptr.type, 8));
     return;
   }
@@ -3731,7 +3839,7 @@ static void aa_va_start_core(AANativeTarget* a, NativeAddr ap) {
     /* __stack points at the incoming stack args, which sit above the saved
      * fp/lr pair — the same address bind_param uses (aa_fp_off_in_arg), not the
      * raw next_param_stack cursor. */
-    aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a->next_param_stack));
+    aa_emit_add_imm(a, AA_TMP0, AA_FP, aa_fp_off_in_arg(a, a->next_param_stack));
     aa_emit_mem(a, 0, ptr, aa_reg_addr(ptr.type, base, (i32)vai.stack_offset),
                 ptr_mem);
     aa_emit_add_imm(a, AA_TMP0, AA_FP,
diff --git a/test/toy/cases/140_fp_callee_save_bottom_frame.expected b/test/toy/cases/140_fp_callee_save_bottom_frame.expected
@@ -0,0 +1 @@
+964
diff --git a/test/toy/cases/140_fp_callee_save_bottom_frame.toy b/test/toy/cases/140_fp_callee_save_bottom_frame.toy
@@ -0,0 +1,38 @@
+// Regression: aarch64 -O1 bottom-record (fp_at_bottom) frame layout must save
+// and restore FP callee-saves (v8-v15) correctly. FP saves are always
+// single-register, so they take the scaled `str d, [fp, #off]` / `ldr` path —
+// the unscaled stur (+-256) the top-record layout uses cannot reach the large
+// positive offsets the bottom-record layout produces.
+//
+// `acc` (an f64) is live across the call to `bump`, forcing an FP callee-save.
+// The 40-element local array bloats the frame past 256 bytes, so the FP
+// callee-save (reserved first => largest offset = frame_size-8) lands well past
+// 255 and exercises the scaled d-register store/load. The body sums everything
+// so nothing is dead-code-eliminated; the result is checked against .expected.
+
+fn bump(x: i64): i64 { return x + 100; }
+
+fn big_fp(seed: i64): i64 {
+  var buf: [40]i64;
+  var i: i64 = 0;
+  while i < 40 {
+    buf[i] = seed + i;
+    i = i + 1;
+  }
+  var acc: f64 = (seed as f64) + 0.5;   // f64 live across the call below
+  let y: i64 = bump(seed);              // call: acc must survive in d8..d15
+  var s: i64 = 0;
+  i = 0;
+  while i < 40 {
+    s = s + buf[i];
+    i = i + 1;
+  }
+  // seed=2: acc=2.5 -> 2; y=102; sum(2..41)=2*40+780=860; total=2+102+860=964.
+  return (acc as i64) + y + s;
+}
+
+fn __user_main(): i64 {
+  return big_fp(2);
+}
+
+fn main(): i32 { return __user_main() as i32; }

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/OPT_O1_PERF_TODO.md	\|	98	++++++++++++++++++++++++++++++++++++++++---------------------------------------
M	doc/PERCALL.md	\|	113	+++++++++++++++++++++++++++++++++++++++++--------------------------------------
M	src/arch/aa64/native.c	\|	216	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------
A	test/toy/cases/140_fp_callee_save_bottom_frame.expected	\|	1	+
A	test/toy/cases/140_fp_callee_save_bottom_frame.toy	\|	38	++++++++++++++++++++++++++++++++++++++