doc: refresh OPT_O1_PERF_TODO standings (2026-05-28 sweep) - kit

commit 53cac58ca3c58f8646a21e1b81c1a0c78eb438a7
parent 665e83ecd46ad0bd3d827a8ba8d3edc99c6f0a6f
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 28 May 2026 19:02:32 -0700

doc: refresh OPT_O1_PERF_TODO standings (2026-05-28 sweep)

Single consistent 3-run sweep across all six tracked benches on
aarch64/Apple. Notable movement: lists 4843->4015 ms now clears the 1.10x
bar against both gcc -O0 (2.20x) and mir -O1 (1.24x); sieve improves to
0.85x vs mir; binary-trees, hash2, mandelbrot, strcat essentially flat.
Geomean cfree -O1 (cc) is 1.44x faster than gcc -O0.

Diffstat:
M doc/OPT_O1_PERF_TODO.md  | 76 ++++++++++++++++++++++++++++++++++++++++++----------------------------------

1 file changed, 42 insertions(+), 34 deletions(-)
diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md
@@ -19,31 +19,35 @@ reference exists).
 
 ## Current standings
 
-Numbers below are from a 3-run sweep on aarch64/Apple (`COMPILE_REPEATS=3`,
-`RUN_REPEATS=3`, best-of), runtime in ms; speedup = reference_time /
-cfree_time (>1 means cfree is faster). gcc-O0 and mir-O1 columns come from
-the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with
+Numbers below are a single 3-run sweep on aarch64/Apple (`COMPILE_REPEATS=3`,
+`RUN_REPEATS=3`, best-of), **refreshed 2026-05-28** after the `fp_at_bottom`
+prologue fold; runtime in ms; speedup = reference_time / cfree_time (>1 means
+cfree is faster). gcc-O0 and mir-O1 columns come from the cached baseline in
+`scripts/opt_bench_baseline.csv`; regenerate with
 `CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`.
 
 | bench | cfree -O1 | gcc -O0 | vs gcc-O0 | mir -O1 | vs mir-O1 | behind |
 | --- | ---: | ---: | ---: | ---: | ---: | --- |
-| binary-trees | 2969² | 2647 | **0.89×** (slower) | n/a¹ | — | gcc (cc path) |
-| lists | 4843 | 8868 | 1.83× ✓ | 4997 | 1.03× | mir |
-| hash2 | 4988 | 7481 | 1.50× ✓ | 3863 | **0.77×** | mir |
-| sieve | 5148 | 5077 | 0.99× (~tied) | 4028 | **0.78×** | gcc (~tied), mir |
-| mandelbrot | 3658 | 10274 | 2.81× ✓ | 3346 | 0.91× | mir |
-| strcat | 5899 | 5965 | 1.01× (~tied) | 5775 | 0.98× | both (~tied) |
+| binary-trees | 2939² | 2647 | **0.90×** (slower) | n/a¹ | — | gcc (cc path) |
+| lists | 4015 | 8842 | 2.20× ✓ | 4988 | 1.24× ✓ | — (clears bar) |
+| hash2 | 4915 | 7399 | 1.51× ✓ | 3857 | **0.78×** | mir |
+| sieve | 4880 | 5032 | 1.03× (~tied) | 4170 | **0.85×** | gcc (~tied), mir |
+| mandelbrot | 3655 | 10319 | 2.82× ✓ | 3332 | 0.91× | mir |
+| strcat | 5906 | 5971 | 1.01× (~tied) | 5772 | 0.98× | both (~tied) |
+
+Geomean over the six (cfree -O1 cc): **1.44× faster than gcc -O0**, **0.86× vs
+mir -O1** (mir-c2m geomean excludes binary-trees, which it can't compile).
 
 ¹ mir-c2m fails to compile `binary-trees`, so only the gcc comparison applies.
 
-² Refreshed 2026-05-28 after the `fp_at_bottom` prologue fold (item 2 below).
-The fold landed (−2 fixed insns/call, codegen now byte-for-byte the same
-prologue/epilogue shape as gcc -O0) but the **cc-linked runtime is unchanged**
-(2973 → 2969): binary-trees is malloc/free-bound, so per-call ALU savings are
-noise. The **JIT path is at parity** — `cfree-run -O1` = 2617 ms (~tied with gcc
-2647), vs the cc binary's 2969. That ~350 ms cc-only deficit (identical codegen)
-points at PLT/GOT indirection for `malloc`/`free` — ~15M dynamic-symbol calls
-routed through stubs — as the real remaining cost on the cc path, not codegen.
+² The `fp_at_bottom` fold landed (−2 fixed insns/call; codegen now byte-for-byte
+the same prologue/epilogue shape as gcc -O0, and the four hot functions are each
+smaller than gcc -O0) but the **cc-linked runtime barely moved** (2973 → 2939):
+binary-trees is malloc/free-bound, so per-call ALU savings are noise. The **JIT
+path is at parity** — `cfree-run -O1` = 2642 ms (~tied with gcc 2647), vs the cc
+binary's 2939. That cc-only deficit (identical codegen) points at PLT/GOT
+indirection for `malloc`/`free` — ~15M dynamic-symbol calls routed through
+stubs — as the real remaining cost on the cc path, not codegen.
 
 ## Per-benchmark notes
 
@@ -115,7 +119,7 @@ These items are all resolved or no-impact for binary-trees: the four functions
 are now smaller than gcc -O0 (12/18/21/16 vs 16/21/24/18). The remaining
 cc-path runtime gap is allocator/PLT cost, not codegen — see the section intro.
 
-### mandelbrot — 0.92× vs mir (close to the bar)
+### mandelbrot — 0.91× vs mir (close to the bar)
 Inner loop is FP-heavy (`Tr*Tr + Ti*Ti < 4.0` Mandelbrot escape test +
 4 fmuls + 2 fadds per iter). Hasn't been deeply investigated since the
 recent codegen batch. Worth disassembling the hot loop and comparing
@@ -127,25 +131,26 @@ and constant-pool material.
 Small gap; not worth standalone investigation yet. Should naturally
 absorb any remaining cross-cutting wins.
 
-### hash2 — 0.77× vs mir (still the worst against mir)
+### hash2 — 0.78× vs mir (still the worst against mir)
 The previously-noted hoisting and strength-reduction wins landed (and
-moved hash2 from 0.72× to 0.77×), but mir is still 1.29× faster. Remaining
+moved hash2 from 0.72× to 0.78×), but mir is still ~1.27× faster. Remaining
 gap is in the parts of the loop the prior items didn't touch — most
 likely the modulo `val % ht->size` (mir probably emits a Barrett/reciprocal
 multiply for the small-divisor case where we still emit `udiv`) and the
 `strcmp` probe shape. Worth a fresh disassembly read of `ht_hashcode` and
 the probe loop in `ht_find_new` against mir's output.
 
-### sieve — 0.78× vs mir, ~tied with gcc
-Loop-invariant `movz` and IV copies are gone; remaining gap is structural.
-mir is ~1.28× faster on the same loop shape (`flags[k] = 0` strided
-store + `flags[i] = 1` init). Candidate gaps: address-mode folding into
-the store (using `[x19, x8]` vs `add` + bare addr), and whether mir is
-auto-vectorizing the init loop.
+### sieve — 0.85× vs mir, ~tied with gcc
+Improved (5148 → 4880 ms; 0.78× → 0.85× vs mir). Loop-invariant `movz` and IV
+copies are gone; remaining gap is structural. mir is ~1.18× faster on the same
+loop shape (`flags[k] = 0` strided store + `flags[i] = 1` init). Candidate gaps:
+address-mode folding into the store (using `[x19, x8]` vs `add` + bare addr),
+and whether mir is auto-vectorizing the init loop.
 
-### lists — 1.03× vs mir (close to the bar)
-Down from 0.95×. Doubly-linked list traversal + splice. Within ~5% of mir;
-worth comparing the splice inner loop directly.
+### lists — CLEARED (1.24× vs mir, 2.20× vs gcc)
+Was 1.03× vs mir; now 4015 ms (down from 4843), clearing the 1.10× bar against
+both references. Doubly-linked list traversal + splice. No longer a tracked
+target — kept here for the record; drop on the next cleanup if it holds.
 
 ## Cross-cutting fixes (open)
 
@@ -200,7 +205,10 @@ disassembly of the hot function.
 
 ## Notes / caveats
 
-- Numbers above mix sweeps from different revisions; re-run with the same
-  `COMPILE_REPEATS=3 RUN_REPEATS=3` after a change to confirm movement.
-- `cfree-run` (JIT) shares this codegen, so its runtimes track `cfree cc -O1`;
-  fixing these helps both paths.
+- The cfree -O1 column is a single 2026-05-28 sweep (consistent revision); the
+  gcc/mir baseline columns are the cached `opt_bench_baseline.csv`. Re-run with
+  the same `COMPILE_REPEATS=3 RUN_REPEATS=3` after a change to confirm movement.
+- `cfree-run` (JIT) shares this codegen, so its runtimes track `cfree cc -O1` —
+  except where the cc binary pays link-time overhead the JIT doesn't (e.g.
+  binary-trees: JIT is ~10% faster than the cc binary on the same code, the
+  PLT/GOT `malloc`/`free` indirection noted above).

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README