doc: add PERCALL.md detailing -O1 per-call overhead and the optimal - kit

commit a3da4df2317690b3bff662fffd9fd1f3455e580f
parent 76182e3a5a1b436ec5df54f8405da50f3ee1e16f
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 28 May 2026 17:23:20 -0700

doc: add PERCALL.md detailing -O1 per-call overhead and the optimal

Quantifies the fixed prologue/epilogue + call-shaped body cost from the
binary-trees disassembly: slim_small_frame is 7 fixed insns/call (4 entry
+ 3 exit) where the pre-indexed optimal is 5, plus the zero-through-temp
and b-A;A:b-B warts. Cross-links the OPT_O1_PERF_TODO.md items.

Diffstat:
A doc/PERCALL.md  | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 file changed, 127 insertions(+), 0 deletions(-)
diff --git a/doc/PERCALL.md b/doc/PERCALL.md
@@ -0,0 +1,127 @@
+# cfree -O1 per-call overhead (aarch64)
+
+The fixed cost cfree `-O1` pays around *every* function call — prologue,
+epilogue, and call-site argument setup — independent of the function body. On
+call-heavy workloads (binary-trees: ~7.6M calls at depth 19) this dominates,
+and it is the whole reason binary-trees still trails gcc -O0 (0.89×). The
+function *bodies* are already at or ahead of gcc -O0 (cfree keeps recurring
+values in callee-saves where gcc -O0 spills and reloads); see
+[OPT_O1_PERF_TODO.md](OPT_O1_PERF_TODO.md) for the benchmark standings.
+
+Numbers below are from `/tmp/mc/binary-trees.cfree.o` vs `binary-trees.gcc.o`
+(gcc-15 -O0), native arm64/Apple.
+
+## Prologue / epilogue tiers
+
+cfree picks one of three frame shapes per function. "Fixed insns" excludes the
+final `ret` (always required) and counts entry + exit.
+
+| tier | when | fixed insns | pre-indexed? |
+| --- | --- | ---: | --- |
+| Tier A (`slim_prologue`) | no callee-saves, no alloca, no body slots, no outgoing stack | 3 (2 entry + 1 exit) | yes — optimal |
+| `slim_small_frame` | ≥1 callee-save (or body slots), saved-pair offset ≤ 504 | **7 (4 entry + 3 exit)** | **no** |
+| fat | large frame, alloca, big saved-pair offset | 7+ | no |
+
+Tier A (leaf-ish, no callee-saves) is already optimal. The gap is entirely in
+`slim_small_frame`, which is the common case — any function that keeps a value
+live across a call lands here.
+
+### Current `slim_small_frame` (the four binary-trees functions)
+
+Entry (4 insns) — e.g. `NewTreeNode`, two callee-saves:
+```
+sub  sp, sp, #32
+stp  x29, x30, [sp, #16]      ; frame record mid-frame
+add  x29, sp, #16             ; fp points above the callee-save area
+stp  x20, x19, [x29, #-16]    ; callee-saves below fp
+```
+Exit (3 insns + ret):
+```
+ldp  x20, x19, [x29, #-16]
+ldp  x29, x30, [sp, #16]
+add  sp, sp, #32
+ret
+```
+
+### Optimal `slim_small_frame` (what gcc -O0 already emits)
+
+Entry (3 insns):
+```
+stp  x29, x30, [sp, #-32]!    ; one insn: decrement sp AND save fp/lr
+mov  x29, sp                  ; frame record at the bottom of the frame
+stp  x19, x20, [sp, #16]      ; callee-saves above the record (str for one)
+```
+Exit (2 insns + ret):
+```
+ldp  x19, x20, [sp, #16]
+ldp  x29, x30, [sp], #32      ; one insn: restore fp/lr AND release sp
+ret
+```
+
+**−2 fixed insns per call** (−1 entry, −1 exit). The blocker is the layout:
+cfree puts the frame record above the callee-saves (`add x29, sp, #16`), so the
+sp-decrement and the fp/lr save cannot fold into a single pre-indexed `stp`.
+Moving the frame record to the bottom of the frame (fp-at-bottom, `mov x29, sp`)
+unlocks the pre-indexed entry `stp …, [sp, #-N]!` and post-indexed exit
+`ldp …, [sp], #N`. This is [OPT_O1_PERF_TODO.md](OPT_O1_PERF_TODO.md) item 2,
+the largest open per-call win.
+
+(An aside: omitting the frame pointer entirely — packing a callee-save with lr,
+e.g. `stp x19, x30, [sp, #-N]!`, no `x29` at all — would save one more insn, but
+neither cfree nor gcc -O0 does this; it costs unwind/debug fidelity and is not
+planned.)
+
+## Body-level per-call warts
+
+Two small constant adders beyond the prologue, both call-shaped:
+
+1. **Zero through a temp** (item 3). `BottomUpTree`'s leaf `NewTreeNode(NULL,
+   NULL)` materializes each null arg via a scratch:
+   ```
+   movz x8, 0x0 ; mov x0, x8 ; movz x8, 0x0 ; mov x1, x8     ; 4 insns
+   ```
+   Optimal is `mov x0, #0 ; mov x1, #0` (2 insns). **−2** on that path.
+   `IR_LOAD_IMM` sources don't participate in the call-arg register-hint
+   propagation in `pass_lower.c`.
+
+2. **Redundant branch chain** (item 4). `DeleteTree`'s if/else merge emits
+   `b A; A: b B` — a conditional branch to a label that just unconditionally
+   branches on to the real target. **−1** per `DeleteTree` call.
+   `cleanup_layout_fallthrough_branches` in `pass_jump.c` doesn't thread this
+   shape.
+
+## Per-function tally (fixed + warts vs gcc -O0)
+
+| function | cfree insns | gcc -O0 insns | cfree overhead vs optimal |
+| --- | ---: | ---: | --- |
+| NewTreeNode | 14 | 16 | +2 prologue |
+| ItemCheck | 20 | 21 | +2 prologue |
+| BottomUpTree | 25 | 24 | +2 prologue, +2 zero-temp |
+| DeleteTree | 20 | 18 | +2 prologue, +1 branch |
+
+cfree already beats gcc -O0 on the malloc-heavy `NewTreeNode` (args stay in
+registers vs gcc's 8 stack shuffles) and ties `ItemCheck`. It loses
+`BottomUpTree`/`DeleteTree` purely on the fixed overhead above.
+
+## Optimal target
+
+Per-call fixed overhead for the common (`slim_small_frame`) case:
+
+- **current:** 7 insns (4 entry + 3 exit) + ret
+- **optimal:** 5 insns (3 entry + 2 exit) + ret  → **−2/call**
+
+Across binary-trees (~7.6M calls): item 2 alone removes ~15M instructions;
+adding items 3 and 4 brings the total to ~30M. That is expected to flip
+`BottomUpTree` and `DeleteTree` ahead of gcc -O0 and move the benchmark past the
+1.10× bar.
+
+## Reproducing
+
+```sh
+SDK="$(xcrun --show-sdk-path)"
+SRC=~/tmp/mir/c-benchmarks/binary-trees.c
+build/release/cfree cc -O1 -std=c99 --sysroot "$SDK" -c "$SRC" -o /tmp/bt.cfree.o
+gcc-15 -O0 -c "$SRC" -o /tmp/bt.gcc.o
+build/release/cfree objdump -d /tmp/bt.cfree.o
+llvm-objdump -d /tmp/bt.gcc.o
+```

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README