kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit a3da4df2317690b3bff662fffd9fd1f3455e580f
parent 76182e3a5a1b436ec5df54f8405da50f3ee1e16f
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 28 May 2026 17:23:20 -0700

doc: add PERCALL.md detailing -O1 per-call overhead and the optimal

Quantifies the fixed prologue/epilogue + call-shaped body cost from the
binary-trees disassembly: slim_small_frame is 7 fixed insns/call (4 entry
+ 3 exit) where the pre-indexed optimal is 5, plus the zero-through-temp
and b-A;A:b-B warts. Cross-links the OPT_O1_PERF_TODO.md items.

Diffstat:
Adoc/PERCALL.md | 127+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 127 insertions(+), 0 deletions(-)

diff --git a/doc/PERCALL.md b/doc/PERCALL.md @@ -0,0 +1,127 @@ +# cfree -O1 per-call overhead (aarch64) + +The fixed cost cfree `-O1` pays around *every* function call — prologue, +epilogue, and call-site argument setup — independent of the function body. On +call-heavy workloads (binary-trees: ~7.6M calls at depth 19) this dominates, +and it is the whole reason binary-trees still trails gcc -O0 (0.89×). The +function *bodies* are already at or ahead of gcc -O0 (cfree keeps recurring +values in callee-saves where gcc -O0 spills and reloads); see +[OPT_O1_PERF_TODO.md](OPT_O1_PERF_TODO.md) for the benchmark standings. + +Numbers below are from `/tmp/mc/binary-trees.cfree.o` vs `binary-trees.gcc.o` +(gcc-15 -O0), native arm64/Apple. + +## Prologue / epilogue tiers + +cfree picks one of three frame shapes per function. "Fixed insns" excludes the +final `ret` (always required) and counts entry + exit. + +| tier | when | fixed insns | pre-indexed? | +| --- | --- | ---: | --- | +| Tier A (`slim_prologue`) | no callee-saves, no alloca, no body slots, no outgoing stack | 3 (2 entry + 1 exit) | yes — optimal | +| `slim_small_frame` | ≥1 callee-save (or body slots), saved-pair offset ≤ 504 | **7 (4 entry + 3 exit)** | **no** | +| fat | large frame, alloca, big saved-pair offset | 7+ | no | + +Tier A (leaf-ish, no callee-saves) is already optimal. The gap is entirely in +`slim_small_frame`, which is the common case — any function that keeps a value +live across a call lands here. + +### Current `slim_small_frame` (the four binary-trees functions) + +Entry (4 insns) — e.g. `NewTreeNode`, two callee-saves: +``` +sub sp, sp, #32 +stp x29, x30, [sp, #16] ; frame record mid-frame +add x29, sp, #16 ; fp points above the callee-save area +stp x20, x19, [x29, #-16] ; callee-saves below fp +``` +Exit (3 insns + ret): +``` +ldp x20, x19, [x29, #-16] +ldp x29, x30, [sp, #16] +add sp, sp, #32 +ret +``` + +### Optimal `slim_small_frame` (what gcc -O0 already emits) + +Entry (3 insns): +``` +stp x29, x30, [sp, #-32]! ; one insn: decrement sp AND save fp/lr +mov x29, sp ; frame record at the bottom of the frame +stp x19, x20, [sp, #16] ; callee-saves above the record (str for one) +``` +Exit (2 insns + ret): +``` +ldp x19, x20, [sp, #16] +ldp x29, x30, [sp], #32 ; one insn: restore fp/lr AND release sp +ret +``` + +**−2 fixed insns per call** (−1 entry, −1 exit). The blocker is the layout: +cfree puts the frame record above the callee-saves (`add x29, sp, #16`), so the +sp-decrement and the fp/lr save cannot fold into a single pre-indexed `stp`. +Moving the frame record to the bottom of the frame (fp-at-bottom, `mov x29, sp`) +unlocks the pre-indexed entry `stp …, [sp, #-N]!` and post-indexed exit +`ldp …, [sp], #N`. This is [OPT_O1_PERF_TODO.md](OPT_O1_PERF_TODO.md) item 2, +the largest open per-call win. + +(An aside: omitting the frame pointer entirely — packing a callee-save with lr, +e.g. `stp x19, x30, [sp, #-N]!`, no `x29` at all — would save one more insn, but +neither cfree nor gcc -O0 does this; it costs unwind/debug fidelity and is not +planned.) + +## Body-level per-call warts + +Two small constant adders beyond the prologue, both call-shaped: + +1. **Zero through a temp** (item 3). `BottomUpTree`'s leaf `NewTreeNode(NULL, + NULL)` materializes each null arg via a scratch: + ``` + movz x8, 0x0 ; mov x0, x8 ; movz x8, 0x0 ; mov x1, x8 ; 4 insns + ``` + Optimal is `mov x0, #0 ; mov x1, #0` (2 insns). **−2** on that path. + `IR_LOAD_IMM` sources don't participate in the call-arg register-hint + propagation in `pass_lower.c`. + +2. **Redundant branch chain** (item 4). `DeleteTree`'s if/else merge emits + `b A; A: b B` — a conditional branch to a label that just unconditionally + branches on to the real target. **−1** per `DeleteTree` call. + `cleanup_layout_fallthrough_branches` in `pass_jump.c` doesn't thread this + shape. + +## Per-function tally (fixed + warts vs gcc -O0) + +| function | cfree insns | gcc -O0 insns | cfree overhead vs optimal | +| --- | ---: | ---: | --- | +| NewTreeNode | 14 | 16 | +2 prologue | +| ItemCheck | 20 | 21 | +2 prologue | +| BottomUpTree | 25 | 24 | +2 prologue, +2 zero-temp | +| DeleteTree | 20 | 18 | +2 prologue, +1 branch | + +cfree already beats gcc -O0 on the malloc-heavy `NewTreeNode` (args stay in +registers vs gcc's 8 stack shuffles) and ties `ItemCheck`. It loses +`BottomUpTree`/`DeleteTree` purely on the fixed overhead above. + +## Optimal target + +Per-call fixed overhead for the common (`slim_small_frame`) case: + +- **current:** 7 insns (4 entry + 3 exit) + ret +- **optimal:** 5 insns (3 entry + 2 exit) + ret → **−2/call** + +Across binary-trees (~7.6M calls): item 2 alone removes ~15M instructions; +adding items 3 and 4 brings the total to ~30M. That is expected to flip +`BottomUpTree` and `DeleteTree` ahead of gcc -O0 and move the benchmark past the +1.10× bar. + +## Reproducing + +```sh +SDK="$(xcrun --show-sdk-path)" +SRC=~/tmp/mir/c-benchmarks/binary-trees.c +build/release/cfree cc -O1 -std=c99 --sysroot "$SDK" -c "$SRC" -o /tmp/bt.cfree.o +gcc-15 -O0 -c "$SRC" -o /tmp/bt.gcc.o +build/release/cfree objdump -d /tmp/bt.cfree.o +llvm-objdump -d /tmp/bt.gcc.o +```