commit 53cac58ca3c58f8646a21e1b81c1a0c78eb438a7
parent 665e83ecd46ad0bd3d827a8ba8d3edc99c6f0a6f
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Thu, 28 May 2026 19:02:32 -0700
doc: refresh OPT_O1_PERF_TODO standings (2026-05-28 sweep)
Single consistent 3-run sweep across all six tracked benches on
aarch64/Apple. Notable movement: lists 4843->4015 ms now clears the 1.10x
bar against both gcc -O0 (2.20x) and mir -O1 (1.24x); sieve improves to
0.85x vs mir; binary-trees, hash2, mandelbrot, strcat essentially flat.
Geomean cfree -O1 (cc) is 1.44x faster than gcc -O0.
Diffstat:
1 file changed, 42 insertions(+), 34 deletions(-)
diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md
@@ -19,31 +19,35 @@ reference exists).
## Current standings
-Numbers below are from a 3-run sweep on aarch64/Apple (`COMPILE_REPEATS=3`,
-`RUN_REPEATS=3`, best-of), runtime in ms; speedup = reference_time /
-cfree_time (>1 means cfree is faster). gcc-O0 and mir-O1 columns come from
-the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with
+Numbers below are a single 3-run sweep on aarch64/Apple (`COMPILE_REPEATS=3`,
+`RUN_REPEATS=3`, best-of), **refreshed 2026-05-28** after the `fp_at_bottom`
+prologue fold; runtime in ms; speedup = reference_time / cfree_time (>1 means
+cfree is faster). gcc-O0 and mir-O1 columns come from the cached baseline in
+`scripts/opt_bench_baseline.csv`; regenerate with
`CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`.
| bench | cfree -O1 | gcc -O0 | vs gcc-O0 | mir -O1 | vs mir-O1 | behind |
| --- | ---: | ---: | ---: | ---: | ---: | --- |
-| binary-trees | 2969² | 2647 | **0.89×** (slower) | n/a¹ | — | gcc (cc path) |
-| lists | 4843 | 8868 | 1.83× ✓ | 4997 | 1.03× | mir |
-| hash2 | 4988 | 7481 | 1.50× ✓ | 3863 | **0.77×** | mir |
-| sieve | 5148 | 5077 | 0.99× (~tied) | 4028 | **0.78×** | gcc (~tied), mir |
-| mandelbrot | 3658 | 10274 | 2.81× ✓ | 3346 | 0.91× | mir |
-| strcat | 5899 | 5965 | 1.01× (~tied) | 5775 | 0.98× | both (~tied) |
+| binary-trees | 2939² | 2647 | **0.90×** (slower) | n/a¹ | — | gcc (cc path) |
+| lists | 4015 | 8842 | 2.20× ✓ | 4988 | 1.24× ✓ | — (clears bar) |
+| hash2 | 4915 | 7399 | 1.51× ✓ | 3857 | **0.78×** | mir |
+| sieve | 4880 | 5032 | 1.03× (~tied) | 4170 | **0.85×** | gcc (~tied), mir |
+| mandelbrot | 3655 | 10319 | 2.82× ✓ | 3332 | 0.91× | mir |
+| strcat | 5906 | 5971 | 1.01× (~tied) | 5772 | 0.98× | both (~tied) |
+
+Geomean over the six (cfree -O1 cc): **1.44× faster than gcc -O0**, **0.86× vs
+mir -O1** (mir-c2m geomean excludes binary-trees, which it can't compile).
¹ mir-c2m fails to compile `binary-trees`, so only the gcc comparison applies.
-² Refreshed 2026-05-28 after the `fp_at_bottom` prologue fold (item 2 below).
-The fold landed (−2 fixed insns/call, codegen now byte-for-byte the same
-prologue/epilogue shape as gcc -O0) but the **cc-linked runtime is unchanged**
-(2973 → 2969): binary-trees is malloc/free-bound, so per-call ALU savings are
-noise. The **JIT path is at parity** — `cfree-run -O1` = 2617 ms (~tied with gcc
-2647), vs the cc binary's 2969. That ~350 ms cc-only deficit (identical codegen)
-points at PLT/GOT indirection for `malloc`/`free` — ~15M dynamic-symbol calls
-routed through stubs — as the real remaining cost on the cc path, not codegen.
+² The `fp_at_bottom` fold landed (−2 fixed insns/call; codegen now byte-for-byte
+the same prologue/epilogue shape as gcc -O0, and the four hot functions are each
+smaller than gcc -O0) but the **cc-linked runtime barely moved** (2973 → 2939):
+binary-trees is malloc/free-bound, so per-call ALU savings are noise. The **JIT
+path is at parity** — `cfree-run -O1` = 2642 ms (~tied with gcc 2647), vs the cc
+binary's 2939. That cc-only deficit (identical codegen) points at PLT/GOT
+indirection for `malloc`/`free` — ~15M dynamic-symbol calls routed through
+stubs — as the real remaining cost on the cc path, not codegen.
## Per-benchmark notes
@@ -115,7 +119,7 @@ These items are all resolved or no-impact for binary-trees: the four functions
are now smaller than gcc -O0 (12/18/21/16 vs 16/21/24/18). The remaining
cc-path runtime gap is allocator/PLT cost, not codegen — see the section intro.
-### mandelbrot — 0.92× vs mir (close to the bar)
+### mandelbrot — 0.91× vs mir (close to the bar)
Inner loop is FP-heavy (`Tr*Tr + Ti*Ti < 4.0` Mandelbrot escape test +
4 fmuls + 2 fadds per iter). Hasn't been deeply investigated since the
recent codegen batch. Worth disassembling the hot loop and comparing
@@ -127,25 +131,26 @@ and constant-pool material.
Small gap; not worth standalone investigation yet. Should naturally
absorb any remaining cross-cutting wins.
-### hash2 — 0.77× vs mir (still the worst against mir)
+### hash2 — 0.78× vs mir (still the worst against mir)
The previously-noted hoisting and strength-reduction wins landed (and
-moved hash2 from 0.72× to 0.77×), but mir is still 1.29× faster. Remaining
+moved hash2 from 0.72× to 0.78×), but mir is still ~1.27× faster. Remaining
gap is in the parts of the loop the prior items didn't touch — most
likely the modulo `val % ht->size` (mir probably emits a Barrett/reciprocal
multiply for the small-divisor case where we still emit `udiv`) and the
`strcmp` probe shape. Worth a fresh disassembly read of `ht_hashcode` and
the probe loop in `ht_find_new` against mir's output.
-### sieve — 0.78× vs mir, ~tied with gcc
-Loop-invariant `movz` and IV copies are gone; remaining gap is structural.
-mir is ~1.28× faster on the same loop shape (`flags[k] = 0` strided
-store + `flags[i] = 1` init). Candidate gaps: address-mode folding into
-the store (using `[x19, x8]` vs `add` + bare addr), and whether mir is
-auto-vectorizing the init loop.
+### sieve — 0.85× vs mir, ~tied with gcc
+Improved (5148 → 4880 ms; 0.78× → 0.85× vs mir). Loop-invariant `movz` and IV
+copies are gone; remaining gap is structural. mir is ~1.18× faster on the same
+loop shape (`flags[k] = 0` strided store + `flags[i] = 1` init). Candidate gaps:
+address-mode folding into the store (using `[x19, x8]` vs `add` + bare addr),
+and whether mir is auto-vectorizing the init loop.
-### lists — 1.03× vs mir (close to the bar)
-Down from 0.95×. Doubly-linked list traversal + splice. Within ~5% of mir;
-worth comparing the splice inner loop directly.
+### lists — CLEARED (1.24× vs mir, 2.20× vs gcc)
+Was 1.03× vs mir; now 4015 ms (down from 4843), clearing the 1.10× bar against
+both references. Doubly-linked list traversal + splice. No longer a tracked
+target — kept here for the record; drop on the next cleanup if it holds.
## Cross-cutting fixes (open)
@@ -200,7 +205,10 @@ disassembly of the hot function.
## Notes / caveats
-- Numbers above mix sweeps from different revisions; re-run with the same
- `COMPILE_REPEATS=3 RUN_REPEATS=3` after a change to confirm movement.
-- `cfree-run` (JIT) shares this codegen, so its runtimes track `cfree cc -O1`;
- fixing these helps both paths.
+- The cfree -O1 column is a single 2026-05-28 sweep (consistent revision); the
+ gcc/mir baseline columns are the cached `opt_bench_baseline.csv`. Re-run with
+ the same `COMPILE_REPEATS=3 RUN_REPEATS=3` after a change to confirm movement.
+- `cfree-run` (JIT) shares this codegen, so its runtimes track `cfree cc -O1` —
+ except where the cc binary pays link-time overhead the JIT doesn't (e.g.
+ binary-trees: JIT is ~10% faster than the cc binary on the same code, the
+ PLT/GOT `malloc`/`free` indirection noted above).