bench: cache fixed-compiler baseline; default opt_bench to cfree-only - kit

commit a3d48fb69c42dd07b58715354a1a6bda6673bbf4
parent c04e8d39488b1599fc07d0ca7633029866ef5fce
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 28 May 2026 11:46:28 -0700

bench: cache fixed-compiler baseline; default opt_bench to cfree-only

The gcc/clang/MIR codegen does not change as cfree evolves, so measure them
once into a checked-in cache and reuse it for comparisons:

- opt_bench.sh gains CFREE_OPT_BENCH_MODE (cfree|baseline). Default "cfree"
  mode runs only cfree/cfree-run and no longer requires c2m or the reference
  compilers; "baseline" mode sweeps gcc/clang/MIR over the full set at O0+O1
  and writes scripts/opt_bench_baseline.csv. Adds SKIP_CFREE; gates the c2m
  build and cfree-binary check on what actually runs.
- opt_bench_compare.py auto-merges the cached baseline (fresh rows win;
  --baseline-csv / --no-baseline to override) and scopes per-bench output to
  the benches the run covered. write_summary merges it too.
- scripts/opt_bench_baseline.csv: full suite x O0/O1 x gcc-15/clang/mir-c2m.
- OPT_O1_PERF_TODO.md: Reproducing section updated to the cfree-only flow.

Diffstat:
M doc/OPT_O1_PERF_TODO.md  | 313 +++++++++++++++++++++++++++++++++++++++----------------------------------------
M scripts/opt_bench.sh  | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
A scripts/opt_bench_baseline.csv  | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M scripts/opt_bench_compare.py  | 44 +++++++++++++++++++++++++++++++++++++++++++-

4 files changed, 359 insertions(+), 176 deletions(-)
diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md
@@ -20,161 +20,157 @@ reference exists).
 ## Current standings
 
 Numbers below are from a 3-run sweep on aarch64/Apple (`COMPILE_REPEATS=3`,
-`RUN_REPEATS=3`, best-of), runtime in ms; speedup = reference_time / cfree_time
-(>1 means cfree is faster).
+`RUN_REPEATS=3`, best-of), runtime in ms; speedup = reference_time /
+cfree_time (>1 means cfree is faster). gcc-O0 and mir-O1 columns come from
+the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with
+`CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`.
 
 | bench | cfree -O1 | gcc -O0 | vs gcc-O0 | mir -O1 | vs mir-O1 | behind |
 | --- | ---: | ---: | ---: | ---: | ---: | --- |
-| binary-trees | 3106 | 2634 | **0.85×** (slower) | n/a¹ | — | gcc |
-| lists | 5243 | 8817 | 1.68× ✓ | 4978 | 0.95× | mir |
-| hash2 | 5372 | 7365 | 1.37× ✓ | 3841 | **0.72×** | mir |
-| sieve | 4991 | 5023 | 1.01× | 4006 | **0.80×** | gcc (~tied), mir |
+| binary-trees | 3146 | 2639 | **0.84×** (slower) | n/a¹ | — | gcc |
+| lists | 4843 | 8868 | 1.83× ✓ | 4997 | 1.03× | mir |
+| hash2 | 4988 | 7481 | 1.50× ✓ | 3863 | **0.77×** | mir |
+| sieve | 5148 | 5077 | 0.99× (~tied) | 4028 | **0.78×** | gcc (~tied), mir |
+| mandelbrot | 3658 | 10274 | 2.81× ✓ | 3346 | 0.91× | mir |
+| strcat | 5899 | 5965 | 1.01× (~tied) | 5775 | 0.98× | both (~tied) |
 
-¹ mir-c2m fails to compile `binary-trees` in our setup, so only the gcc
-comparison applies there.
-
-Movement since the previous refresh (single-run numbers):
-- `lists` went from 0.58× to 0.95× vs mir and 1.02× to 1.68× vs gcc — the
-  recent O1 codegen wins + tiny-function inliner closed most of the gap. Still
-  not 10% past mir, but no longer the worst case.
-- The other three are roughly unchanged; the bar still hasn't been cleared
-  against mir on hash2/sieve, and binary-trees is still slower than gcc-O0.
+¹ mir-c2m fails to compile `binary-trees`, so only the gcc comparison applies.
 
 ## Per-benchmark notes
 
-### binary-trees — slower than unoptimized gcc (still highest priority)
-The only case where cfree `-O1` is *slower than gcc -O0* (0.85×). Workload is
-recursive tree build/walk: tiny leaf-ish functions called millions of times
-plus `malloc`/`free` on each node. Most wall-clock is in `malloc`; the
-compiler-visible slack lives in the per-call overhead. Inspecting `ItemCheck`,
-`BottomUpTree`, `DeleteTree`, `NewTreeNode` shows ~4–6 redundant moves per
-recursive call site:
-
-```
-ItemCheck (inner recursive call):
-  ldr  x8, [x19]
-  mov  x11, x8       <-- redundant copy
-  mov  x0, x11       <-- could be `mov x0, x8` straight from the ldr
-  bl   ItemCheck
-  mov  x8, x0        <-- materialize the return into the SSA reg
-  movz x9, 0x1       <-- the `1 +` constant, not hoisted (only 1 use, fine)
-  add  x20, x9, x8
-```
-
-Drivers for the gap, in order:
-1. **Copy coalescing leaves intermediate `mov` chains** at call boundaries:
-   `mov xN, xM; mov x0, xN` instead of folding through. Every recursive call
-   in `ItemCheck`/`BottomUpTree`/`DeleteTree` has at least one such redundant
-   pair. With ~7.6M recursive calls (depth=19) this is tens of millions of
-   wasted ops.
-2. **Standard 9-insn prologue/epilogue** (`sub sp; add x17; stp x29,x30;
-   add x29` + mirror on exit) on tiny non-leaf functions like `NewTreeNode`.
-   Leaf detection + skipping FP save/restore where the callee makes no calls
-   would help (`ItemCheck` recurses so it's not a leaf; `NewTreeNode` *does*
-   call malloc, also not a leaf — but neither needs the FP-frame setup we
-   currently emit).
-3. **`NewTreeNode` spills x19/x20** as callee-saves only to ferry the two
-   incoming args across one `malloc` call — pure overhead vs keeping the
-   live values in caller-save scratch + reload from a small spill slot.
-
-This is where the biggest absolute wall-clock win is, even though the per-op
-codegen looks roughly equivalent to gcc -O0.
-
-### hash2 — 0.72× vs mir
-Clears the gcc bar (1.37×) but is the worst against mir. The hot loop is
-`ht_hashcode`:
-
-```
-0xc88:  movz x9, 0x5         <-- loop-invariant constant, NOT hoisted
-0xc8c:  mul  x14, x9, x11    <-- could be `add x14, x11, x11, lsl #2` (5*v)
-0xc90:  ldrb w11, [x8]
-0xc94:  sxtb x13, w11
-0xc98:  add  x11, x13, x14
-0xc9c:  add  x8, x8, #1
-0xca0:  ldrb w13, [x8]
-0xca4:  cmp  w13, #0
-0xca8:  b.ne 0xc88
-```
-
-Two clear wins:
-1. **Loop-invariant immediate operand not hoisted.** The IR carries the `5`
-   as an inline `imm:` operand on the `binop mul`; `opt_machinize_native`
-   leaves it there, the emitter materializes it with `movz x9, 0x5` inside
-   the loop on every iteration. `opt_hoist_loop_consts` only hoists explicit
-   `IR_LOAD_IMM` defs (see `src/opt/pass_addr_fold.c:835`), so this never
-   becomes a hoist candidate.
-
-   Fix sketch: between `opt_machinize_native` and `opt_hoist_loop_consts`,
-   add a pass that lowers non-zero immediate operands on machine-irrelevant
-   positions (mul/add/sub second operand outside imm-fold range, store value
-   operand) to `IR_LOAD_IMM` + reg-use. Then the existing hoist pass picks
-   them up. Same fix unblocks sieve (see below).
-2. **`mul x14, x9, x11` for a constant 5×** — strength-reduce small-integer
-   multiplications to `add x, y, y, lsl #N` / `sub` sequences for power-of-two
-   neighbours. Independent of the hoist fix.
-
-### sieve — 0.80× vs mir, ~tied with gcc
-Behind both. Two hot inner loops; both leave clear codegen on the table.
-
-Init loop (`flags[i] = 1`):
-```
-0x2ac:  mov  x13, x8         <-- coalesce miss (copy of the IV)
-0x2b0:  movz w9, 0x1         <-- loop-invariant, NOT hoisted (same as hash2)
-0x2b4:  strb w9, [x19, x13]
-0x2b8:  add  x8, x8, #1
-0x2bc:  cmp  x8, #2, lsl #12
-0x2c0:  b.le 0x2ac
-```
-
-Mark-multiples loop (`flags[k] = 0`):
-```
-0x2ec:  mov  x14, x13        <-- coalesce miss; could just use x13 for the index
-0x2f0:  strb wzr, [x19, x14]  (strb wzr is already good — zero-source fast path
-0x2f4:  add  x13, x13, x8     fires here)
-0x2f8:  cmp  x13, #2, lsl #12
-0x2fc:  b.le 0x2ec
-```
-
-So:
-1. **Same imm-operand hoist gap** as hash2 — the inline `imm:1` on `store`
-   becomes a fresh `movz w9, 0x1` per iteration. The existing zero-source
-   fast path (`pass_native_emit.c:790`) handles the strb-zero case in the
-   second loop, which is why only the init loop has the `movz`.
-2. **Copy not coalesced for the IV/index** in both loops. The `mov x13, x8`
-   and `mov x14, x13` are pure register-allocator copies that the post-RA
-   passes don't clean up. Worth tracing what `opt_lower_to_mir` + the post-RA
-   jump-cleanup leaves; these moves likely come from how the IR copy ops in
-   block 4 of `sieve_min`'s IR survive RA (`copy def=v9 opnds=[v14,v8]` +
-   `copy def=v10 opnds=[v13,v12]` — two copies feeding one store).
-
-### lists — 0.95× vs mir (no longer bleeding)
-Last refresh was 0.58×; current 0.95×. Now within shot of the 1.10× bar.
-Same imm-hoist + copy-coalesce wins should close the remaining ~15%.
-
-## Cross-cutting fixes (would help multiple benches at once)
-
-These are the two themes that recur across the four benches above:
-
-1. **Hoist loop-invariant *immediate operands*, not just `IR_LOAD_IMM`.**
-   `opt_hoist_loop_consts` in `src/opt/pass_addr_fold.c` only sees defs of
-   `IR_LOAD_IMM`. Constants that arrive at emit as inline `imm:` operands on
-   `binop`/`store`/etc. are re-materialized by the emitter inside the loop
-   every iteration (sieve-init, hash2-hashcode). A pre-hoist lowering pass
-   that walks loop-body insts, takes their non-foldable imm operands, and
-   converts them to `IR_LOAD_IMM + reg use` would let the existing hoist
-   pass do the rest.
-2. **Copy coalescing leaves redundant `mov xN, xM` artifacts.** Seen in
-   sieve (both loops, IV/index copy), binary-trees (every recursive call
-   site has at least one redundant `mov` pair), and likely in lists. Most
-   of these are SSA-level `copy` ops that survive into machine code because
-   post-RA copy elimination is doing less than it could. Worth a focused
-   look at what `opt_mir_combine` / `opt_mir_dce` see for these inputs.
-
-Targeted secondary wins:
-- Strength-reduce small-integer multiplications by constants to `add`/`sub`
-  + shifted-register forms (hash2 `5 *`).
-- Slimmer prologue/epilogue for functions that don't need FP-frame setup
-  (binary-trees).
+### binary-trees — slower than unoptimized gcc (highest priority)
+The only case where cfree `-O1` is *slower than gcc -O0* (0.84×). Workload is
+recursive tree build/walk: four tiny functions (`NewTreeNode`, `ItemCheck`,
+`BottomUpTree`, `DeleteTree`) called ~7.6M times at depth=19, plus a
+`malloc`/`free` per node. The **body** of each function is fine — cfree -O1
+keeps the recurring pointer in a callee-save (x19) where gcc -O0 spills and
+reloads it three times per call. The gap is entirely in **fixed per-call
+overhead** — every byte of which is multiplied by 7.6M.
+
+Open items, in priority order (most recent disasm in
+`/tmp/mc/binary-trees.cfree.o`):
+
+1. **Useless leading `b PC+4` at every function entry.** All four functions
+   still start with this:
+   ```
+   sub  sp, sp, #0x20
+   stp  x29, x30, [sp, #0x10]
+   add  x29, sp, #0x10
+   stp  x20, x19, [x29, #-0x10]
+   b    PC+4              <-- branches to the very next instruction
+   mov  x19, x0
+   ```
+   Root cause: commit `9bd61e8` ("emit param_decls into a dedicated prologue
+   block") added an empty-on-emit entry block ahead of the function body.
+   `opt_jump_cleanup`'s helper `empty_fallthrough_block` in
+   `src/opt/pass_jump.c` then explicitly bails out when `block == f->entry`,
+   so the empty entry block is never merged into its single successor.
+   Lifting that guard (with whatever safety condition `9bd61e8` was protecting
+   against — likely "first body block is a loop header") would let
+   jump-cleanup absorb the prologue block in the common case. **+1 insn ×
+   7.6M calls.**
+
+2. **Prologue compaction: 4-insn → 2-insn pre-indexed.** Half-done. cfree
+   today emits:
+   ```
+   sub  sp, sp, #N
+   stp  x29, x30, [sp, #M]
+   add  x29, sp, #M
+   stp  x20, x19, [x29, #-K]      ; or `stur x19, ...` when only x19 live
+   ```
+   `bcca0bd` added `slim_prologue` in `src/arch/aa64/native.c`, which uses
+   the pre-indexed `stp x29, x30, [sp, #-N]!` form — but only when
+   `ncallee_saves == 0`. The common path (one or more callee-saves) still
+   goes through `slim_small_frame`, which doesn't pre-index. Extend
+   `slim_small_frame` to use `aa64_stp64_pre` for the FP-save when the frame
+   size is known in advance, then emit the callee-save spills as separate
+   `stur`s afterward. Saves 1 insn at entry + 1 at every exit. **~2 insns ×
+   7.6M calls.**
+
+3. **Zero materialized through a temp in `BottomUpTree` leaf path.**
+   `NewTreeNode(NULL, NULL)` still emits:
+   ```
+   c44:  mov  x8, #0x0
+   c48:  mov  x0, x8
+   c4c:  mov  x8, #0x0
+   c50:  mov  x1, x8
+   c54:  bl   _NewTreeNode
+   ```
+   Should be `mov x0, #0; mov x1, #0`. The sibling fix in `9ac2416` got the
+   `ldr` → call-arg case, but `IR_LOAD_IMM` sources don't seem to participate
+   in the ABI aliasing-hint propagation in `pass_lower.c`. Likely a small
+   extension to `set_preg_pref_for_call_args` / `propagate_hint_through_copies`
+   to also fire when the source op is `IR_LOAD_IMM`. **+2 insns × 524k leaf
+   calls.**
+
+4. **Trailing `b A; A: b B` pair in `DeleteTree`'s if/else merge.**
+   ```
+   c9c:  mov  x0, x19
+   ca0:  bl   _free
+   ca4:  b    0xcac              <-- jumps directly to the epilogue label
+   ca8:  b    0xc9c              <-- unreachable; left over from else side
+   cac:  ldur x19, [x29, #-0x8]
+   ```
+   Classic jump-thread target. `cleanup_layout_fallthrough_branches` in
+   `pass_jump.c` doesn't pick up this shape (two consecutive `b`s where
+   the second is unreachable). +1 insn per `DeleteTree` invocation
+   (7.6M calls).
+
+Together these would save 4–6 insns per call, ~30–50M instructions removed
+at depth=19. Body quality is already on par with gcc-O0; this is all
+fixed per-call overhead.
+
+### mandelbrot — 0.92× vs mir (close to the bar)
+Inner loop is FP-heavy (`Tr*Tr + Ti*Ti < 4.0` Mandelbrot escape test +
+4 fmuls + 2 fadds per iter). Hasn't been deeply investigated since the
+recent codegen batch. Worth disassembling the hot loop and comparing
+against mir to see what specifically is still on the table — likely some
+combination of FP register allocation, vectorization (which we don't do),
+and constant-pool material.
+
+### strcat — 0.97× vs mir, ~tied with gcc
+Small gap; not worth standalone investigation yet. Should naturally
+absorb any remaining cross-cutting wins.
+
+### hash2 — 0.77× vs mir (still the worst against mir)
+The previously-noted hoisting and strength-reduction wins landed (and
+moved hash2 from 0.72× to 0.77×), but mir is still 1.29× faster. Remaining
+gap is in the parts of the loop the prior items didn't touch — most
+likely the modulo `val % ht->size` (mir probably emits a Barrett/reciprocal
+multiply for the small-divisor case where we still emit `udiv`) and the
+`strcmp` probe shape. Worth a fresh disassembly read of `ht_hashcode` and
+the probe loop in `ht_find_new` against mir's output.
+
+### sieve — 0.78× vs mir, ~tied with gcc
+Loop-invariant `movz` and IV copies are gone; remaining gap is structural.
+mir is ~1.28× faster on the same loop shape (`flags[k] = 0` strided
+store + `flags[i] = 1` init). Candidate gaps: address-mode folding into
+the store (using `[x19, x8]` vs `add` + bare addr), and whether mir is
+auto-vectorizing the init loop.
+
+### lists — 1.03× vs mir (close to the bar)
+Down from 0.95×. Doubly-linked list traversal + splice. Within ~5% of mir;
+worth comparing the splice inner loop directly.
+
+## Cross-cutting fixes (open)
+
+These help several benches. Both are partial; the binary-trees items above
+are the most concrete tests for whether each is complete.
+
+1. **Drop the leading `b PC+4` at function entry.** See binary-trees item 1.
+   Affects every cfree-compiled function, not just binary-trees.
+
+2. **Compact FP-frame prologue/epilogue.** See binary-trees item 2. The
+   2-insn pre-indexed form is wired in for the no-callee-save case; needs
+   to be extended to small frames with 1–2 callee-saves. Biggest absolute
+   payoff on call-heavy benches.
+
+3. **Hard-register copy coalescing for `IR_LOAD_IMM` sources.** See
+   binary-trees item 3. The hint-propagation path covers `ldr` → call-arg
+   but skips immediates.
+
+4. **Jump-thread the `b A; A: b B` shape.** See binary-trees item 4.
+   General `pass_jump.c` cleanup, not bench-specific.
 
 ## Reproducing
 
@@ -182,15 +178,22 @@ Targeted secondary wins:
 # Build the optimized compiler first (clean release):
 rm -rf build/release && make RELEASE=1 bin
 
-# Run just these four with 3 repeats (best-of), O0+O1, gcc + cfree + mir:
+# Run the still-open or stale-number benches with 3 repeats (best-of). The
+# default mode measures only cfree (fast iteration) and the trailing compare
+# step automatically pulls gcc/mir numbers from the cached
+# scripts/opt_bench_baseline.csv — no need to re-run the fixed compilers:
 CFREE="$PWD/build/release/cfree" \
-CFREE_OPT_BENCHES="binary-trees lists hash2 sieve" \
+CFREE_OPT_BENCHES="binary-trees lists hash2 sieve mandelbrot strcat" \
 CFREE_OPT_BENCH_LEVELS="0 1" \
 CFREE_OPT_BENCH_COMPILE_REPEATS=3 CFREE_OPT_BENCH_RUN_REPEATS=3 \
-CFREE_OPT_BENCH_SKIP_GCC=0 CFREE_OPT_BENCH_SKIP_CLANG=1 CFREE_OPT_BENCH_SKIP_MIR=0 \
 bash scripts/opt_bench.sh
 ```
 
+The comparison against the cached baseline prints at the end of the run; re-run
+it standalone any time with `python3 scripts/opt_bench_compare.py`. Only
+regenerate the baseline cache (the `CFREE_OPT_BENCH_MODE=baseline` command
+above) when the host, the reference compilers, or the benchmark sources change.
+
 Per-iteration codegen for a single function is easiest to inspect via the
 optimizer's staged IR dump (`CFREE_DUMP=pre-emit cfree cc -O1 -c bench.c ...`,
 which panics after printing the pre-emit MIR) plus `objdump`/`lldb`
@@ -198,11 +201,7 @@ disassembly of the hot function.
 
 ## Notes / caveats
 
-- Numbers above are best-of-3; re-run with the same `COMPILE_REPEATS=3
-  RUN_REPEATS=3` after a change to confirm movement. Anything near 1.0× (e.g.
-  `sieve` vs gcc) is within noise and should be confirmed with a separate
-  re-run.
-- Not on this list but worth revisiting once the four above improve: `hash`,
-  `funnkuch-reduce`, `strcat` (previously marginal at single-run).
+- Numbers above mix sweeps from different revisions; re-run with the same
+  `COMPILE_REPEATS=3 RUN_REPEATS=3` after a change to confirm movement.
 - `cfree-run` (JIT) shares this codegen, so its runtimes track `cfree cc -O1`;
   fixing these helps both paths.
diff --git a/scripts/opt_bench.sh b/scripts/opt_bench.sh
@@ -10,25 +10,59 @@ GCC="${GCC:-gcc-15}"
 OUT_DIR="${CFREE_OPT_BENCH_OUT:-$ROOT/build/bench/opt}"
 CFREE_SYSROOT="${CFREE_OPT_BENCH_SYSROOT:-}"
 
-# Full benchmark set (override with CFREE_OPT_BENCHES to use it):
-#   array binary-trees except funnkuch-reduce hash hash2 heapsort lists matrix
+# Full benchmark set used for baseline caching (override with CFREE_OPT_BENCHES):
+#   array binary-trees funnkuch-reduce hash hash2 heapsort lists matrix
 #   method-call mandelbrot nbody sieve spectral-norm strcat
-# `except` in particular runs for many seconds at -O0, which is why the default
-# below is a small, quick subset at O0+O1 (skip the heavy O2 sweep).
+# `except` is excluded from the full set: it's setjmp/longjmp-bound and runs
+# for ~2.5 minutes per O0 sample, which inflates wall-clock without telling us
+# anything about codegen quality. Pass it explicitly via CFREE_OPT_BENCHES if
+# wanted.
+FULL_BENCHES="array binary-trees funnkuch-reduce hash hash2 heapsort lists matrix method-call mandelbrot nbody sieve spectral-norm strcat"
+
+# Cached baseline timings for the fixed compilers (gcc/clang/MIR). Their codegen
+# does not change as we iterate on cfree, so we measure them once into this file
+# (checked into scripts/) and reuse it for comparisons. Regenerate with
+# `CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`.
+BASELINE_CSV="${CFREE_OPT_BENCH_BASELINE_CSV:-$ROOT/scripts/opt_bench_baseline.csv}"
+
+# Mode selects which tools run and where results go:
+#   cfree    (default) - only cfree/cfree-run; writes build/bench/opt/results.csv
+#   baseline           - only gcc/clang/MIR over the full set; writes BASELINE_CSV
+MODE="${CFREE_OPT_BENCH_MODE:-cfree}"
+
 DEFAULT_LEVELS="0 1"
-DEFAULT_BENCHES="array hash hash2 matrix sieve"
 DEFAULT_COMPILE_REPEATS="3"
 DEFAULT_RUN_REPEATS="3"
+
+case "$MODE" in
+  baseline)
+    # Measure the fixed compilers across the full set; cfree is skipped.
+    DEFAULT_BENCHES="$FULL_BENCHES"
+    DEF_SKIP_GCC=0; DEF_SKIP_CLANG=0; DEF_SKIP_MIR=0; DEF_SKIP_CFREE=1
+    DEFAULT_CSV="$BASELINE_CSV"
+    ;;
+  cfree)
+    # Default working mode: only cfree, compared against the cached baseline.
+    DEFAULT_BENCHES="array hash hash2 matrix sieve"
+    DEF_SKIP_GCC=1; DEF_SKIP_CLANG=1; DEF_SKIP_MIR=1; DEF_SKIP_CFREE=0
+    DEFAULT_CSV="$OUT_DIR/results.csv"
+    ;;
+  *)
+    printf 'opt-bench: unknown CFREE_OPT_BENCH_MODE: %s (want cfree|baseline)\n' "$MODE" >&2
+    exit 2
+    ;;
+esac
+
 LEVELS="${CFREE_OPT_BENCH_LEVELS:-$DEFAULT_LEVELS}"
 BENCHES="${CFREE_OPT_BENCHES:-$DEFAULT_BENCHES}"
 COMPILE_REPEATS="${CFREE_OPT_BENCH_COMPILE_REPEATS:-$DEFAULT_COMPILE_REPEATS}"
 RUN_REPEATS="${CFREE_OPT_BENCH_RUN_REPEATS:-$DEFAULT_RUN_REPEATS}"
-# Per-tool skip flags. By default keep gcc (the baseline) and skip clang.
-# Override individually, e.g. CFREE_OPT_BENCH_SKIP_MIR=1 or
-# CFREE_OPT_BENCH_SKIP_CLANG=0.
-SKIP_GCC="${CFREE_OPT_BENCH_SKIP_GCC:-0}"
-SKIP_CLANG="${CFREE_OPT_BENCH_SKIP_CLANG:-1}"
-SKIP_MIR="${CFREE_OPT_BENCH_SKIP_MIR:-0}"
+# Per-tool skip flags. Defaults come from the mode above; override individually,
+# e.g. CFREE_OPT_BENCH_SKIP_MIR=1 or CFREE_OPT_BENCH_SKIP_CFREE=0.
+SKIP_GCC="${CFREE_OPT_BENCH_SKIP_GCC:-$DEF_SKIP_GCC}"
+SKIP_CLANG="${CFREE_OPT_BENCH_SKIP_CLANG:-$DEF_SKIP_CLANG}"
+SKIP_MIR="${CFREE_OPT_BENCH_SKIP_MIR:-$DEF_SKIP_MIR}"
+SKIP_CFREE="${CFREE_OPT_BENCH_SKIP_CFREE:-$DEF_SKIP_CFREE}"
 MIR_MAKE="${MIR_MAKE:-}"
 
 case "$(uname -s 2>/dev/null || true)" in
@@ -56,7 +90,7 @@ CFLAGS_EXTRA="${CFREE_OPT_BENCH_CFLAGS:-$DEFAULT_CFLAGS_EXTRA}"
 CFREE_FLAGS_EXTRA="${CFREE_OPT_BENCH_CFREE_FLAGS:-}"
 CFREE_RUN_FLAGS_EXTRA="${CFREE_OPT_BENCH_CFREE_RUN_FLAGS:-}"
 
-CSV="$OUT_DIR/results.csv"
+CSV="${CFREE_OPT_BENCH_CSV:-$DEFAULT_CSV}"
 SUMMARY="$OUT_DIR/summary.md"
 LOG_DIR="$OUT_DIR/logs"
 BIN_DIR="$OUT_DIR/bin"
@@ -357,16 +391,28 @@ bench_cfree_run() {
 }
 
 write_summary() {
-  python3 - "$CSV" "$SUMMARY" "$(tool_label "$GCC")" <<'PY'
+  python3 - "$CSV" "$SUMMARY" "$(tool_label "$GCC")" "$BASELINE_CSV" <<'PY'
 import csv
 import math
+import os
 import sys
 from collections import defaultdict
 
-csv_path, out_path, base_tool = sys.argv[1:4]
+csv_path, out_path, base_tool, baseline_path = sys.argv[1:5]
 with open(csv_path, newline="") as f:
     rows = list(csv.DictReader(f))
 
+# Merge cached baseline timings (gcc/clang/MIR) so the summary tables include
+# the fixed compilers even though the default run only measures cfree.
+seen = {(r["tool"], r["opt"], r["bench"]) for r in rows}
+if baseline_path and os.path.exists(baseline_path) and os.path.abspath(baseline_path) != os.path.abspath(csv_path):
+    with open(baseline_path, newline="") as f:
+        for r in csv.DictReader(f):
+            key = (r["tool"], r["opt"], r["bench"])
+            if key not in seen:
+                rows.append(r)
+                seen.add(key)
+
 def fnum(v):
     if v in ("", "NA", None):
         return None
@@ -481,13 +527,16 @@ if [ ! -d "$BENCH_DIR" ]; then
   printf 'opt-bench: benchmark directory not found: %s\n' "$BENCH_DIR" >&2
   exit 2
 fi
-if [ ! -x "$CFREE" ]; then
+if [ "$SKIP_CFREE" != "1" ] && [ ! -x "$CFREE" ]; then
   printf 'opt-bench: cfree binary not found: %s\n' "$CFREE" >&2
   printf 'opt-bench: run `make bin` or set CFREE=/path/to/cfree\n' >&2
   exit 2
 fi
-ensure_mir || exit 2
+# c2m is only required when MIR is part of the run.
+[ "$SKIP_MIR" != "1" ] && { ensure_mir || exit 2; }
 
+printf 'opt-bench: mode: %s\n' "$MODE"
+printf 'opt-bench: csv: %s\n' "$CSV"
 printf 'opt-bench: output: %s\n' "$OUT_DIR"
 printf 'opt-bench: benches: %s\n' "$BENCHES"
 printf 'opt-bench: levels: %s\n' "$LEVELS"
@@ -505,12 +554,20 @@ for bench in $BENCHES; do
   for opt in $LEVELS; do
     [ "$SKIP_GCC" != "1" ] && bench_native_tool "$bench" "$(tool_label "$GCC")" "$GCC" "$opt" "$src" "$expect" "$arg_line"
     [ "$SKIP_CLANG" != "1" ] && bench_native_tool "$bench" "$(tool_label "$CLANG")" "$CLANG" "$opt" "$src" "$expect" "$arg_line"
-    bench_native_tool "$bench" "cfree" "$CFREE cc" "$opt" "$src" "$expect" "$arg_line"
-    bench_cfree_run "$bench" "$opt" "$src" "$expect" "$arg_line"
+    if [ "$SKIP_CFREE" != "1" ]; then
+      bench_native_tool "$bench" "cfree" "$CFREE cc" "$opt" "$src" "$expect" "$arg_line"
+      bench_cfree_run "$bench" "$opt" "$src" "$expect" "$arg_line"
+    fi
     [ "$SKIP_MIR" != "1" ] && bench_mir "$bench" "$opt" "$src" "$expect" "$arg_line"
   done
 done
 
+if [ "$MODE" = "baseline" ]; then
+  printf 'opt-bench: wrote baseline cache %s\n' "$CSV"
+  printf 'opt-bench: commit this file so cfree runs can compare against it\n'
+  exit 0
+fi
+
 write_summary
 printf 'opt-bench: wrote %s\n' "$CSV"
 printf 'opt-bench: wrote %s\n' "$SUMMARY"
diff --git a/scripts/opt_bench_baseline.csv b/scripts/opt_bench_baseline.csv
@@ -0,0 +1,85 @@
+bench,tool,opt,status,compile_ms,codegen_ms,runtime_ms,exit_code,log
+"array","gcc-15","0","OK","151.617","NA","6192.521","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.array"
+"array","clang","0","OK","147.005","NA","5308.213","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.array"
+"array","mir-c2m","0","OK","121.159","0.264","4986.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.array"
+"array","gcc-15","1","OK","156.974","NA","2146.515","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.array"
+"array","clang","1","OK","143.504","NA","2896.896","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.array"
+"array","mir-c2m","1","OK","118.147","0.318","4792.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.array"
+"binary-trees","gcc-15","0","OK","148.974","NA","2647.209","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.binary-trees"
+"binary-trees","clang","0","OK","139.066","NA","2754.167","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.binary-trees"
+"binary-trees","mir-c2m","0","COMPILE_FAIL","123.128","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.binary-trees.compile.err"
+"binary-trees","gcc-15","1","OK","153.376","NA","2607.473","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.binary-trees"
+"binary-trees","clang","1","OK","142.402","NA","2400.911","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.binary-trees"
+"binary-trees","mir-c2m","1","COMPILE_FAIL","126.457","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.binary-trees.compile.err"
+"funnkuch-reduce","gcc-15","0","OK","148.897","NA","2557.454","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.funnkuch-reduce"
+"funnkuch-reduce","clang","0","OK","137.344","NA","2880.681","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.funnkuch-reduce"
+"funnkuch-reduce","mir-c2m","0","OK","122.131","0.499","2856.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.funnkuch-reduce"
+"funnkuch-reduce","gcc-15","1","OK","154.436","NA","2081.479","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.funnkuch-reduce"
+"funnkuch-reduce","clang","1","OK","149.392","NA","2311.176","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.funnkuch-reduce"
+"funnkuch-reduce","mir-c2m","1","OK","123.103","0.620","2764.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.funnkuch-reduce"
+"hash","gcc-15","0","OK","159.280","NA","4608.200","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.hash"
+"hash","clang","0","OK","142.514","NA","4875.379","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.hash"
+"hash","mir-c2m","0","OK","159.190","1.380","4172.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.hash"
+"hash","gcc-15","1","OK","180.370","NA","4133.414","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.hash"
+"hash","clang","1","OK","165.194","NA","4131.105","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.hash"
+"hash","mir-c2m","1","OK","152.222","1.747","4167.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.hash"
+"hash2","gcc-15","0","OK","161.541","NA","7398.824","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.hash2"
+"hash2","clang","0","OK","144.715","NA","8831.498","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.hash2"
+"hash2","mir-c2m","0","OK","152.456","1.430","3970.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.hash2"
+"hash2","gcc-15","1","OK","180.383","NA","4360.070","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.hash2"
+"hash2","clang","1","OK","165.528","NA","3850.965","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.hash2"
+"hash2","mir-c2m","1","OK","151.444","1.825","3857.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.hash2"
+"heapsort","gcc-15","0","OK","149.543","NA","7605.803","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.heapsort"
+"heapsort","clang","0","OK","139.750","NA","6318.618","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.heapsort"
+"heapsort","mir-c2m","0","COMPILE_FAIL","120.517","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.heapsort.compile.err"
+"heapsort","gcc-15","1","OK","151.685","NA","5536.478","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.heapsort"
+"heapsort","clang","1","OK","147.930","NA","4315.873","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.heapsort"
+"heapsort","mir-c2m","1","COMPILE_FAIL","120.113","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.heapsort.compile.err"
+"lists","gcc-15","0","OK","160.975","NA","8841.648","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.lists"
+"lists","clang","0","OK","142.189","NA","7696.170","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.lists"
+"lists","mir-c2m","0","OK","146.936","1.208","5512.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.lists"
+"lists","gcc-15","1","OK","175.373","NA","3438.077","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.lists"
+"lists","clang","1","OK","155.047","NA","2938.820","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.lists"
+"lists","mir-c2m","1","OK","146.978","1.523","4988.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.lists"
+"matrix","gcc-15","0","OK","151.555","NA","20657.715","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.matrix"
+"matrix","clang","0","OK","139.105","NA","13679.456","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.matrix"
+"matrix","mir-c2m","0","OK","140.975","0.707","11065.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.matrix"
+"matrix","gcc-15","1","OK","162.019","NA","3073.402","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.matrix"
+"matrix","clang","1","OK","149.687","NA","3125.393","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.matrix"
+"matrix","mir-c2m","1","OK","139.374","0.893","9466.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.matrix"
+"method-call","gcc-15","0","COMPILE_FAIL","74.268","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.method-call.compile.1.err"
+"method-call","clang","0","COMPILE_FAIL","88.040","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.method-call.compile.1.err"
+"method-call","mir-c2m","0","OK","116.234","0.490","4190.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.method-call"
+"method-call","gcc-15","1","COMPILE_FAIL","178.970","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.method-call.compile.1.err"
+"method-call","clang","1","COMPILE_FAIL","103.005","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.method-call.compile.1.err"
+"method-call","mir-c2m","1","OK","120.788","0.591","4151.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.method-call"
+"mandelbrot","gcc-15","0","OK","149.164","NA","10318.641","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.mandelbrot"
+"mandelbrot","clang","0","OK","139.282","NA","10586.734","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.mandelbrot"
+"mandelbrot","mir-c2m","0","OK","135.511","0.328","4501.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.mandelbrot"
+"mandelbrot","gcc-15","1","OK","148.877","NA","3151.572","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.mandelbrot"
+"mandelbrot","clang","1","OK","140.193","NA","2910.442","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.mandelbrot"
+"mandelbrot","mir-c2m","1","OK","124.927","0.400","3332.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.mandelbrot"
+"nbody","gcc-15","0","OK","157.218","NA","9017.622","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.nbody"
+"nbody","clang","0","OK","142.119","NA","10701.933","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.nbody"
+"nbody","mir-c2m","0","COMPILE_FAIL","139.124","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.nbody.compile.err"
+"nbody","gcc-15","1","OK","165.533","NA","2852.425","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.nbody"
+"nbody","clang","1","OK","154.581","NA","2711.537","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.nbody"
+"nbody","mir-c2m","1","COMPILE_FAIL","125.704","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.nbody.compile.err"
+"sieve","gcc-15","0","OK","146.078","NA","5032.428","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.sieve"
+"sieve","clang","0","OK","137.391","NA","13729.750","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.sieve"
+"sieve","mir-c2m","0","OK","130.733","0.226","4843.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.sieve"
+"sieve","gcc-15","1","OK","150.246","NA","2787.239","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.sieve"
+"sieve","clang","1","OK","144.944","NA","2512.013","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.sieve"
+"sieve","mir-c2m","1","OK","120.643","0.265","4170.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.sieve"
+"spectral-norm","gcc-15","0","OK","153.761","NA","14877.450","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.spectral-norm"
+"spectral-norm","clang","0","OK","142.870","NA","14844.299","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.spectral-norm"
+"spectral-norm","mir-c2m","0","COMPILE_FAIL","121.234","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.spectral-norm.compile.err"
+"spectral-norm","gcc-15","1","OK","159.815","NA","4049.676","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.spectral-norm"
+"spectral-norm","clang","1","OK","151.041","NA","4075.697","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.spectral-norm"
+"spectral-norm","mir-c2m","1","COMPILE_FAIL","125.809","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.spectral-norm.compile.err"
+"strcat","gcc-15","0","OK","151.030","NA","5970.900","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.strcat"
+"strcat","clang","0","OK","140.669","NA","5943.676","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.strcat"
+"strcat","mir-c2m","0","OK","148.012","0.484","5773.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.strcat"
+"strcat","gcc-15","1","OK","156.343","NA","4860.543","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.strcat"
+"strcat","clang","1","OK","143.566","NA","4804.322","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.strcat"
+"strcat","mir-c2m","1","OK","149.920","0.585","5772.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.strcat"
diff --git a/scripts/opt_bench_compare.py b/scripts/opt_bench_compare.py
@@ -1,15 +1,27 @@
 #!/usr/bin/env python3
 """Compare cfree tools vs a baseline (default: gcc-15 -O0).
 
+The fixed compilers (gcc/clang/MIR) don't change as cfree evolves, so their
+timings are measured once into a cached CSV (scripts/opt_bench_baseline.csv via
+`CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`). This script auto-merges
+that cache with the fresh cfree-only results so comparisons still work even when
+the results CSV contains only cfree rows.
+
 Usage:
     python3 scripts/opt_bench_compare.py [results.csv]
     python3 scripts/opt_bench_compare.py results.csv --base-tool gcc-15 --base-opt 0
+    python3 scripts/opt_bench_compare.py results.csv --baseline-csv path.csv
+    python3 scripts/opt_bench_compare.py results.csv --no-baseline
 """
 import csv
 import math
 import os
 import sys
 
+DEFAULT_BASELINE_CSV = os.path.join(
+    os.path.dirname(os.path.abspath(__file__)), "opt_bench_baseline.csv"
+)
+
 
 def fnum(v):
     try:
@@ -51,6 +63,7 @@ def main():
     csv_path = None
     base_tool_arg = None
     base_opt = "0"
+    baseline_csv = DEFAULT_BASELINE_CSV
     i = 0
     while i < len(args):
         if args[i] == "--base-tool" and i + 1 < len(args):
@@ -59,6 +72,12 @@ def main():
         elif args[i] == "--base-opt" and i + 1 < len(args):
             base_opt = args[i + 1]
             i += 2
+        elif args[i] == "--baseline-csv" and i + 1 < len(args):
+            baseline_csv = args[i + 1]
+            i += 2
+        elif args[i] == "--no-baseline":
+            baseline_csv = None
+            i += 1
         else:
             csv_path = args[i]
             i += 1
@@ -73,6 +92,26 @@ def main():
     with open(csv_path, newline="") as f:
         ok = [r for r in csv.DictReader(f) if r["status"] == "OK"]
 
+    # Benches measured by this run (typically cfree-only); used to scope output.
+    result_benches = {r["bench"] for r in ok}
+
+    # Merge cached baseline timings (gcc/clang/MIR) unless disabled. Rows already
+    # present in the fresh CSV win, so an explicit baseline run still overrides.
+    if (
+        baseline_csv
+        and os.path.exists(baseline_csv)
+        and os.path.abspath(baseline_csv) != os.path.abspath(csv_path)
+    ):
+        seen = {(r["tool"], r["opt"], r["bench"]) for r in ok}
+        with open(baseline_csv, newline="") as f:
+            for r in csv.DictReader(f):
+                if r["status"] != "OK":
+                    continue
+                key = (r["tool"], r["opt"], r["bench"])
+                if key not in seen:
+                    ok.append(r)
+                    seen.add(key)
+
     if not ok:
         sys.exit("compare: no OK rows in CSV")
 
@@ -96,7 +135,10 @@ def main():
         sys.exit(f"compare: no rows for tool={base_tool} opt={base_opt}")
 
     idx = {(r["tool"], r["opt"], r["bench"]): r for r in ok}
-    all_benches = sorted(baseline)
+    # Only report benches this run covered (and that the baseline has). When the
+    # CSV is itself a baseline dump (no separate result benches), show them all.
+    shown = (result_benches & set(baseline)) if result_benches else set(baseline)
+    all_benches = sorted(shown) or sorted(baseline)
     all_opts = sorted(
         {r["opt"] for r in ok},
         key=lambda x: (int(x) if x.isdigit() else 99, x),

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/OPT_O1_PERF_TODO.md	\|	313	+++++++++++++++++++++++++++++++++++++++----------------------------------------
M	scripts/opt_bench.sh	\|	93	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
A	scripts/opt_bench_baseline.csv	\|	85	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	scripts/opt_bench_compare.py	\|	44	+++++++++++++++++++++++++++++++++++++++++++-