kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 53cac58ca3c58f8646a21e1b81c1a0c78eb438a7
parent 665e83ecd46ad0bd3d827a8ba8d3edc99c6f0a6f
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 28 May 2026 19:02:32 -0700

doc: refresh OPT_O1_PERF_TODO standings (2026-05-28 sweep)

Single consistent 3-run sweep across all six tracked benches on
aarch64/Apple. Notable movement: lists 4843->4015 ms now clears the 1.10x
bar against both gcc -O0 (2.20x) and mir -O1 (1.24x); sieve improves to
0.85x vs mir; binary-trees, hash2, mandelbrot, strcat essentially flat.
Geomean cfree -O1 (cc) is 1.44x faster than gcc -O0.

Diffstat:
Mdoc/OPT_O1_PERF_TODO.md | 76++++++++++++++++++++++++++++++++++++++++++----------------------------------
1 file changed, 42 insertions(+), 34 deletions(-)

diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md @@ -19,31 +19,35 @@ reference exists). ## Current standings -Numbers below are from a 3-run sweep on aarch64/Apple (`COMPILE_REPEATS=3`, -`RUN_REPEATS=3`, best-of), runtime in ms; speedup = reference_time / -cfree_time (>1 means cfree is faster). gcc-O0 and mir-O1 columns come from -the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with +Numbers below are a single 3-run sweep on aarch64/Apple (`COMPILE_REPEATS=3`, +`RUN_REPEATS=3`, best-of), **refreshed 2026-05-28** after the `fp_at_bottom` +prologue fold; runtime in ms; speedup = reference_time / cfree_time (>1 means +cfree is faster). gcc-O0 and mir-O1 columns come from the cached baseline in +`scripts/opt_bench_baseline.csv`; regenerate with `CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`. | bench | cfree -O1 | gcc -O0 | vs gcc-O0 | mir -O1 | vs mir-O1 | behind | | --- | ---: | ---: | ---: | ---: | ---: | --- | -| binary-trees | 2969² | 2647 | **0.89×** (slower) | n/a¹ | — | gcc (cc path) | -| lists | 4843 | 8868 | 1.83× ✓ | 4997 | 1.03× | mir | -| hash2 | 4988 | 7481 | 1.50× ✓ | 3863 | **0.77×** | mir | -| sieve | 5148 | 5077 | 0.99× (~tied) | 4028 | **0.78×** | gcc (~tied), mir | -| mandelbrot | 3658 | 10274 | 2.81× ✓ | 3346 | 0.91× | mir | -| strcat | 5899 | 5965 | 1.01× (~tied) | 5775 | 0.98× | both (~tied) | +| binary-trees | 2939² | 2647 | **0.90×** (slower) | n/a¹ | — | gcc (cc path) | +| lists | 4015 | 8842 | 2.20× ✓ | 4988 | 1.24× ✓ | — (clears bar) | +| hash2 | 4915 | 7399 | 1.51× ✓ | 3857 | **0.78×** | mir | +| sieve | 4880 | 5032 | 1.03× (~tied) | 4170 | **0.85×** | gcc (~tied), mir | +| mandelbrot | 3655 | 10319 | 2.82× ✓ | 3332 | 0.91× | mir | +| strcat | 5906 | 5971 | 1.01× (~tied) | 5772 | 0.98× | both (~tied) | + +Geomean over the six (cfree -O1 cc): **1.44× faster than gcc -O0**, **0.86× vs +mir -O1** (mir-c2m geomean excludes binary-trees, which it can't compile). ¹ mir-c2m fails to compile `binary-trees`, so only the gcc comparison applies. -² Refreshed 2026-05-28 after the `fp_at_bottom` prologue fold (item 2 below). -The fold landed (−2 fixed insns/call, codegen now byte-for-byte the same -prologue/epilogue shape as gcc -O0) but the **cc-linked runtime is unchanged** -(2973 → 2969): binary-trees is malloc/free-bound, so per-call ALU savings are -noise. The **JIT path is at parity** — `cfree-run -O1` = 2617 ms (~tied with gcc -2647), vs the cc binary's 2969. That ~350 ms cc-only deficit (identical codegen) -points at PLT/GOT indirection for `malloc`/`free` — ~15M dynamic-symbol calls -routed through stubs — as the real remaining cost on the cc path, not codegen. +² The `fp_at_bottom` fold landed (−2 fixed insns/call; codegen now byte-for-byte +the same prologue/epilogue shape as gcc -O0, and the four hot functions are each +smaller than gcc -O0) but the **cc-linked runtime barely moved** (2973 → 2939): +binary-trees is malloc/free-bound, so per-call ALU savings are noise. The **JIT +path is at parity** — `cfree-run -O1` = 2642 ms (~tied with gcc 2647), vs the cc +binary's 2939. That cc-only deficit (identical codegen) points at PLT/GOT +indirection for `malloc`/`free` — ~15M dynamic-symbol calls routed through +stubs — as the real remaining cost on the cc path, not codegen. ## Per-benchmark notes @@ -115,7 +119,7 @@ These items are all resolved or no-impact for binary-trees: the four functions are now smaller than gcc -O0 (12/18/21/16 vs 16/21/24/18). The remaining cc-path runtime gap is allocator/PLT cost, not codegen — see the section intro. -### mandelbrot — 0.92× vs mir (close to the bar) +### mandelbrot — 0.91× vs mir (close to the bar) Inner loop is FP-heavy (`Tr*Tr + Ti*Ti < 4.0` Mandelbrot escape test + 4 fmuls + 2 fadds per iter). Hasn't been deeply investigated since the recent codegen batch. Worth disassembling the hot loop and comparing @@ -127,25 +131,26 @@ and constant-pool material. Small gap; not worth standalone investigation yet. Should naturally absorb any remaining cross-cutting wins. -### hash2 — 0.77× vs mir (still the worst against mir) +### hash2 — 0.78× vs mir (still the worst against mir) The previously-noted hoisting and strength-reduction wins landed (and -moved hash2 from 0.72× to 0.77×), but mir is still 1.29× faster. Remaining +moved hash2 from 0.72× to 0.78×), but mir is still ~1.27× faster. Remaining gap is in the parts of the loop the prior items didn't touch — most likely the modulo `val % ht->size` (mir probably emits a Barrett/reciprocal multiply for the small-divisor case where we still emit `udiv`) and the `strcmp` probe shape. Worth a fresh disassembly read of `ht_hashcode` and the probe loop in `ht_find_new` against mir's output. -### sieve — 0.78× vs mir, ~tied with gcc -Loop-invariant `movz` and IV copies are gone; remaining gap is structural. -mir is ~1.28× faster on the same loop shape (`flags[k] = 0` strided -store + `flags[i] = 1` init). Candidate gaps: address-mode folding into -the store (using `[x19, x8]` vs `add` + bare addr), and whether mir is -auto-vectorizing the init loop. +### sieve — 0.85× vs mir, ~tied with gcc +Improved (5148 → 4880 ms; 0.78× → 0.85× vs mir). Loop-invariant `movz` and IV +copies are gone; remaining gap is structural. mir is ~1.18× faster on the same +loop shape (`flags[k] = 0` strided store + `flags[i] = 1` init). Candidate gaps: +address-mode folding into the store (using `[x19, x8]` vs `add` + bare addr), +and whether mir is auto-vectorizing the init loop. -### lists — 1.03× vs mir (close to the bar) -Down from 0.95×. Doubly-linked list traversal + splice. Within ~5% of mir; -worth comparing the splice inner loop directly. +### lists — CLEARED (1.24× vs mir, 2.20× vs gcc) +Was 1.03× vs mir; now 4015 ms (down from 4843), clearing the 1.10× bar against +both references. Doubly-linked list traversal + splice. No longer a tracked +target — kept here for the record; drop on the next cleanup if it holds. ## Cross-cutting fixes (open) @@ -200,7 +205,10 @@ disassembly of the hot function. ## Notes / caveats -- Numbers above mix sweeps from different revisions; re-run with the same - `COMPILE_REPEATS=3 RUN_REPEATS=3` after a change to confirm movement. -- `cfree-run` (JIT) shares this codegen, so its runtimes track `cfree cc -O1`; - fixing these helps both paths. +- The cfree -O1 column is a single 2026-05-28 sweep (consistent revision); the + gcc/mir baseline columns are the cached `opt_bench_baseline.csv`. Re-run with + the same `COMPILE_REPEATS=3 RUN_REPEATS=3` after a change to confirm movement. +- `cfree-run` (JIT) shares this codegen, so its runtimes track `cfree cc -O1` — + except where the cc binary pays link-time overhead the JIT doesn't (e.g. + binary-trees: JIT is ~10% faster than the cc binary on the same code, the + PLT/GOT `malloc`/`free` indirection noted above).