commit a3d48fb69c42dd07b58715354a1a6bda6673bbf4
parent c04e8d39488b1599fc07d0ca7633029866ef5fce
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Thu, 28 May 2026 11:46:28 -0700
bench: cache fixed-compiler baseline; default opt_bench to cfree-only
The gcc/clang/MIR codegen does not change as cfree evolves, so measure them
once into a checked-in cache and reuse it for comparisons:
- opt_bench.sh gains CFREE_OPT_BENCH_MODE (cfree|baseline). Default "cfree"
mode runs only cfree/cfree-run and no longer requires c2m or the reference
compilers; "baseline" mode sweeps gcc/clang/MIR over the full set at O0+O1
and writes scripts/opt_bench_baseline.csv. Adds SKIP_CFREE; gates the c2m
build and cfree-binary check on what actually runs.
- opt_bench_compare.py auto-merges the cached baseline (fresh rows win;
--baseline-csv / --no-baseline to override) and scopes per-bench output to
the benches the run covered. write_summary merges it too.
- scripts/opt_bench_baseline.csv: full suite x O0/O1 x gcc-15/clang/mir-c2m.
- OPT_O1_PERF_TODO.md: Reproducing section updated to the cfree-only flow.
Diffstat:
4 files changed, 359 insertions(+), 176 deletions(-)
diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md
@@ -20,161 +20,157 @@ reference exists).
## Current standings
Numbers below are from a 3-run sweep on aarch64/Apple (`COMPILE_REPEATS=3`,
-`RUN_REPEATS=3`, best-of), runtime in ms; speedup = reference_time / cfree_time
-(>1 means cfree is faster).
+`RUN_REPEATS=3`, best-of), runtime in ms; speedup = reference_time /
+cfree_time (>1 means cfree is faster). gcc-O0 and mir-O1 columns come from
+the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with
+`CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`.
| bench | cfree -O1 | gcc -O0 | vs gcc-O0 | mir -O1 | vs mir-O1 | behind |
| --- | ---: | ---: | ---: | ---: | ---: | --- |
-| binary-trees | 3106 | 2634 | **0.85×** (slower) | n/a¹ | — | gcc |
-| lists | 5243 | 8817 | 1.68× ✓ | 4978 | 0.95× | mir |
-| hash2 | 5372 | 7365 | 1.37× ✓ | 3841 | **0.72×** | mir |
-| sieve | 4991 | 5023 | 1.01× | 4006 | **0.80×** | gcc (~tied), mir |
+| binary-trees | 3146 | 2639 | **0.84×** (slower) | n/a¹ | — | gcc |
+| lists | 4843 | 8868 | 1.83× ✓ | 4997 | 1.03× | mir |
+| hash2 | 4988 | 7481 | 1.50× ✓ | 3863 | **0.77×** | mir |
+| sieve | 5148 | 5077 | 0.99× (~tied) | 4028 | **0.78×** | gcc (~tied), mir |
+| mandelbrot | 3658 | 10274 | 2.81× ✓ | 3346 | 0.91× | mir |
+| strcat | 5899 | 5965 | 1.01× (~tied) | 5775 | 0.98× | both (~tied) |
-¹ mir-c2m fails to compile `binary-trees` in our setup, so only the gcc
-comparison applies there.
-
-Movement since the previous refresh (single-run numbers):
-- `lists` went from 0.58× to 0.95× vs mir and 1.02× to 1.68× vs gcc — the
- recent O1 codegen wins + tiny-function inliner closed most of the gap. Still
- not 10% past mir, but no longer the worst case.
-- The other three are roughly unchanged; the bar still hasn't been cleared
- against mir on hash2/sieve, and binary-trees is still slower than gcc-O0.
+¹ mir-c2m fails to compile `binary-trees`, so only the gcc comparison applies.
## Per-benchmark notes
-### binary-trees — slower than unoptimized gcc (still highest priority)
-The only case where cfree `-O1` is *slower than gcc -O0* (0.85×). Workload is
-recursive tree build/walk: tiny leaf-ish functions called millions of times
-plus `malloc`/`free` on each node. Most wall-clock is in `malloc`; the
-compiler-visible slack lives in the per-call overhead. Inspecting `ItemCheck`,
-`BottomUpTree`, `DeleteTree`, `NewTreeNode` shows ~4–6 redundant moves per
-recursive call site:
-
-```
-ItemCheck (inner recursive call):
- ldr x8, [x19]
- mov x11, x8 <-- redundant copy
- mov x0, x11 <-- could be `mov x0, x8` straight from the ldr
- bl ItemCheck
- mov x8, x0 <-- materialize the return into the SSA reg
- movz x9, 0x1 <-- the `1 +` constant, not hoisted (only 1 use, fine)
- add x20, x9, x8
-```
-
-Drivers for the gap, in order:
-1. **Copy coalescing leaves intermediate `mov` chains** at call boundaries:
- `mov xN, xM; mov x0, xN` instead of folding through. Every recursive call
- in `ItemCheck`/`BottomUpTree`/`DeleteTree` has at least one such redundant
- pair. With ~7.6M recursive calls (depth=19) this is tens of millions of
- wasted ops.
-2. **Standard 9-insn prologue/epilogue** (`sub sp; add x17; stp x29,x30;
- add x29` + mirror on exit) on tiny non-leaf functions like `NewTreeNode`.
- Leaf detection + skipping FP save/restore where the callee makes no calls
- would help (`ItemCheck` recurses so it's not a leaf; `NewTreeNode` *does*
- call malloc, also not a leaf — but neither needs the FP-frame setup we
- currently emit).
-3. **`NewTreeNode` spills x19/x20** as callee-saves only to ferry the two
- incoming args across one `malloc` call — pure overhead vs keeping the
- live values in caller-save scratch + reload from a small spill slot.
-
-This is where the biggest absolute wall-clock win is, even though the per-op
-codegen looks roughly equivalent to gcc -O0.
-
-### hash2 — 0.72× vs mir
-Clears the gcc bar (1.37×) but is the worst against mir. The hot loop is
-`ht_hashcode`:
-
-```
-0xc88: movz x9, 0x5 <-- loop-invariant constant, NOT hoisted
-0xc8c: mul x14, x9, x11 <-- could be `add x14, x11, x11, lsl #2` (5*v)
-0xc90: ldrb w11, [x8]
-0xc94: sxtb x13, w11
-0xc98: add x11, x13, x14
-0xc9c: add x8, x8, #1
-0xca0: ldrb w13, [x8]
-0xca4: cmp w13, #0
-0xca8: b.ne 0xc88
-```
-
-Two clear wins:
-1. **Loop-invariant immediate operand not hoisted.** The IR carries the `5`
- as an inline `imm:` operand on the `binop mul`; `opt_machinize_native`
- leaves it there, the emitter materializes it with `movz x9, 0x5` inside
- the loop on every iteration. `opt_hoist_loop_consts` only hoists explicit
- `IR_LOAD_IMM` defs (see `src/opt/pass_addr_fold.c:835`), so this never
- becomes a hoist candidate.
-
- Fix sketch: between `opt_machinize_native` and `opt_hoist_loop_consts`,
- add a pass that lowers non-zero immediate operands on machine-irrelevant
- positions (mul/add/sub second operand outside imm-fold range, store value
- operand) to `IR_LOAD_IMM` + reg-use. Then the existing hoist pass picks
- them up. Same fix unblocks sieve (see below).
-2. **`mul x14, x9, x11` for a constant 5×** — strength-reduce small-integer
- multiplications to `add x, y, y, lsl #N` / `sub` sequences for power-of-two
- neighbours. Independent of the hoist fix.
-
-### sieve — 0.80× vs mir, ~tied with gcc
-Behind both. Two hot inner loops; both leave clear codegen on the table.
-
-Init loop (`flags[i] = 1`):
-```
-0x2ac: mov x13, x8 <-- coalesce miss (copy of the IV)
-0x2b0: movz w9, 0x1 <-- loop-invariant, NOT hoisted (same as hash2)
-0x2b4: strb w9, [x19, x13]
-0x2b8: add x8, x8, #1
-0x2bc: cmp x8, #2, lsl #12
-0x2c0: b.le 0x2ac
-```
-
-Mark-multiples loop (`flags[k] = 0`):
-```
-0x2ec: mov x14, x13 <-- coalesce miss; could just use x13 for the index
-0x2f0: strb wzr, [x19, x14] (strb wzr is already good — zero-source fast path
-0x2f4: add x13, x13, x8 fires here)
-0x2f8: cmp x13, #2, lsl #12
-0x2fc: b.le 0x2ec
-```
-
-So:
-1. **Same imm-operand hoist gap** as hash2 — the inline `imm:1` on `store`
- becomes a fresh `movz w9, 0x1` per iteration. The existing zero-source
- fast path (`pass_native_emit.c:790`) handles the strb-zero case in the
- second loop, which is why only the init loop has the `movz`.
-2. **Copy not coalesced for the IV/index** in both loops. The `mov x13, x8`
- and `mov x14, x13` are pure register-allocator copies that the post-RA
- passes don't clean up. Worth tracing what `opt_lower_to_mir` + the post-RA
- jump-cleanup leaves; these moves likely come from how the IR copy ops in
- block 4 of `sieve_min`'s IR survive RA (`copy def=v9 opnds=[v14,v8]` +
- `copy def=v10 opnds=[v13,v12]` — two copies feeding one store).
-
-### lists — 0.95× vs mir (no longer bleeding)
-Last refresh was 0.58×; current 0.95×. Now within shot of the 1.10× bar.
-Same imm-hoist + copy-coalesce wins should close the remaining ~15%.
-
-## Cross-cutting fixes (would help multiple benches at once)
-
-These are the two themes that recur across the four benches above:
-
-1. **Hoist loop-invariant *immediate operands*, not just `IR_LOAD_IMM`.**
- `opt_hoist_loop_consts` in `src/opt/pass_addr_fold.c` only sees defs of
- `IR_LOAD_IMM`. Constants that arrive at emit as inline `imm:` operands on
- `binop`/`store`/etc. are re-materialized by the emitter inside the loop
- every iteration (sieve-init, hash2-hashcode). A pre-hoist lowering pass
- that walks loop-body insts, takes their non-foldable imm operands, and
- converts them to `IR_LOAD_IMM + reg use` would let the existing hoist
- pass do the rest.
-2. **Copy coalescing leaves redundant `mov xN, xM` artifacts.** Seen in
- sieve (both loops, IV/index copy), binary-trees (every recursive call
- site has at least one redundant `mov` pair), and likely in lists. Most
- of these are SSA-level `copy` ops that survive into machine code because
- post-RA copy elimination is doing less than it could. Worth a focused
- look at what `opt_mir_combine` / `opt_mir_dce` see for these inputs.
-
-Targeted secondary wins:
-- Strength-reduce small-integer multiplications by constants to `add`/`sub`
- + shifted-register forms (hash2 `5 *`).
-- Slimmer prologue/epilogue for functions that don't need FP-frame setup
- (binary-trees).
+### binary-trees — slower than unoptimized gcc (highest priority)
+The only case where cfree `-O1` is *slower than gcc -O0* (0.84×). Workload is
+recursive tree build/walk: four tiny functions (`NewTreeNode`, `ItemCheck`,
+`BottomUpTree`, `DeleteTree`) called ~7.6M times at depth=19, plus a
+`malloc`/`free` per node. The **body** of each function is fine — cfree -O1
+keeps the recurring pointer in a callee-save (x19) where gcc -O0 spills and
+reloads it three times per call. The gap is entirely in **fixed per-call
+overhead** — every byte of which is multiplied by 7.6M.
+
+Open items, in priority order (most recent disasm in
+`/tmp/mc/binary-trees.cfree.o`):
+
+1. **Useless leading `b PC+4` at every function entry.** All four functions
+ still start with this:
+ ```
+ sub sp, sp, #0x20
+ stp x29, x30, [sp, #0x10]
+ add x29, sp, #0x10
+ stp x20, x19, [x29, #-0x10]
+ b PC+4 <-- branches to the very next instruction
+ mov x19, x0
+ ```
+ Root cause: commit `9bd61e8` ("emit param_decls into a dedicated prologue
+ block") added an empty-on-emit entry block ahead of the function body.
+ `opt_jump_cleanup`'s helper `empty_fallthrough_block` in
+ `src/opt/pass_jump.c` then explicitly bails out when `block == f->entry`,
+ so the empty entry block is never merged into its single successor.
+ Lifting that guard (with whatever safety condition `9bd61e8` was protecting
+ against — likely "first body block is a loop header") would let
+ jump-cleanup absorb the prologue block in the common case. **+1 insn ×
+ 7.6M calls.**
+
+2. **Prologue compaction: 4-insn → 2-insn pre-indexed.** Half-done. cfree
+ today emits:
+ ```
+ sub sp, sp, #N
+ stp x29, x30, [sp, #M]
+ add x29, sp, #M
+ stp x20, x19, [x29, #-K] ; or `stur x19, ...` when only x19 live
+ ```
+ `bcca0bd` added `slim_prologue` in `src/arch/aa64/native.c`, which uses
+ the pre-indexed `stp x29, x30, [sp, #-N]!` form — but only when
+ `ncallee_saves == 0`. The common path (one or more callee-saves) still
+ goes through `slim_small_frame`, which doesn't pre-index. Extend
+ `slim_small_frame` to use `aa64_stp64_pre` for the FP-save when the frame
+ size is known in advance, then emit the callee-save spills as separate
+ `stur`s afterward. Saves 1 insn at entry + 1 at every exit. **~2 insns ×
+ 7.6M calls.**
+
+3. **Zero materialized through a temp in `BottomUpTree` leaf path.**
+ `NewTreeNode(NULL, NULL)` still emits:
+ ```
+ c44: mov x8, #0x0
+ c48: mov x0, x8
+ c4c: mov x8, #0x0
+ c50: mov x1, x8
+ c54: bl _NewTreeNode
+ ```
+ Should be `mov x0, #0; mov x1, #0`. The sibling fix in `9ac2416` got the
+ `ldr` → call-arg case, but `IR_LOAD_IMM` sources don't seem to participate
+ in the ABI aliasing-hint propagation in `pass_lower.c`. Likely a small
+ extension to `set_preg_pref_for_call_args` / `propagate_hint_through_copies`
+ to also fire when the source op is `IR_LOAD_IMM`. **+2 insns × 524k leaf
+ calls.**
+
+4. **Trailing `b A; A: b B` pair in `DeleteTree`'s if/else merge.**
+ ```
+ c9c: mov x0, x19
+ ca0: bl _free
+ ca4: b 0xcac <-- jumps directly to the epilogue label
+ ca8: b 0xc9c <-- unreachable; left over from else side
+ cac: ldur x19, [x29, #-0x8]
+ ```
+ Classic jump-thread target. `cleanup_layout_fallthrough_branches` in
+ `pass_jump.c` doesn't pick up this shape (two consecutive `b`s where
+ the second is unreachable). +1 insn per `DeleteTree` invocation
+ (7.6M calls).
+
+Together these would save 4–6 insns per call, ~30–50M instructions removed
+at depth=19. Body quality is already on par with gcc-O0; this is all
+fixed per-call overhead.
+
+### mandelbrot — 0.92× vs mir (close to the bar)
+Inner loop is FP-heavy (`Tr*Tr + Ti*Ti < 4.0` Mandelbrot escape test +
+4 fmuls + 2 fadds per iter). Hasn't been deeply investigated since the
+recent codegen batch. Worth disassembling the hot loop and comparing
+against mir to see what specifically is still on the table — likely some
+combination of FP register allocation, vectorization (which we don't do),
+and constant-pool material.
+
+### strcat — 0.97× vs mir, ~tied with gcc
+Small gap; not worth standalone investigation yet. Should naturally
+absorb any remaining cross-cutting wins.
+
+### hash2 — 0.77× vs mir (still the worst against mir)
+The previously-noted hoisting and strength-reduction wins landed (and
+moved hash2 from 0.72× to 0.77×), but mir is still 1.29× faster. Remaining
+gap is in the parts of the loop the prior items didn't touch — most
+likely the modulo `val % ht->size` (mir probably emits a Barrett/reciprocal
+multiply for the small-divisor case where we still emit `udiv`) and the
+`strcmp` probe shape. Worth a fresh disassembly read of `ht_hashcode` and
+the probe loop in `ht_find_new` against mir's output.
+
+### sieve — 0.78× vs mir, ~tied with gcc
+Loop-invariant `movz` and IV copies are gone; remaining gap is structural.
+mir is ~1.28× faster on the same loop shape (`flags[k] = 0` strided
+store + `flags[i] = 1` init). Candidate gaps: address-mode folding into
+the store (using `[x19, x8]` vs `add` + bare addr), and whether mir is
+auto-vectorizing the init loop.
+
+### lists — 1.03× vs mir (close to the bar)
+Down from 0.95×. Doubly-linked list traversal + splice. Within ~5% of mir;
+worth comparing the splice inner loop directly.
+
+## Cross-cutting fixes (open)
+
+These help several benches. Both are partial; the binary-trees items above
+are the most concrete tests for whether each is complete.
+
+1. **Drop the leading `b PC+4` at function entry.** See binary-trees item 1.
+ Affects every cfree-compiled function, not just binary-trees.
+
+2. **Compact FP-frame prologue/epilogue.** See binary-trees item 2. The
+ 2-insn pre-indexed form is wired in for the no-callee-save case; needs
+ to be extended to small frames with 1–2 callee-saves. Biggest absolute
+ payoff on call-heavy benches.
+
+3. **Hard-register copy coalescing for `IR_LOAD_IMM` sources.** See
+ binary-trees item 3. The hint-propagation path covers `ldr` → call-arg
+ but skips immediates.
+
+4. **Jump-thread the `b A; A: b B` shape.** See binary-trees item 4.
+ General `pass_jump.c` cleanup, not bench-specific.
## Reproducing
@@ -182,15 +178,22 @@ Targeted secondary wins:
# Build the optimized compiler first (clean release):
rm -rf build/release && make RELEASE=1 bin
-# Run just these four with 3 repeats (best-of), O0+O1, gcc + cfree + mir:
+# Run the still-open or stale-number benches with 3 repeats (best-of). The
+# default mode measures only cfree (fast iteration) and the trailing compare
+# step automatically pulls gcc/mir numbers from the cached
+# scripts/opt_bench_baseline.csv — no need to re-run the fixed compilers:
CFREE="$PWD/build/release/cfree" \
-CFREE_OPT_BENCHES="binary-trees lists hash2 sieve" \
+CFREE_OPT_BENCHES="binary-trees lists hash2 sieve mandelbrot strcat" \
CFREE_OPT_BENCH_LEVELS="0 1" \
CFREE_OPT_BENCH_COMPILE_REPEATS=3 CFREE_OPT_BENCH_RUN_REPEATS=3 \
-CFREE_OPT_BENCH_SKIP_GCC=0 CFREE_OPT_BENCH_SKIP_CLANG=1 CFREE_OPT_BENCH_SKIP_MIR=0 \
bash scripts/opt_bench.sh
```
+The comparison against the cached baseline prints at the end of the run; re-run
+it standalone any time with `python3 scripts/opt_bench_compare.py`. Only
+regenerate the baseline cache (the `CFREE_OPT_BENCH_MODE=baseline` command
+above) when the host, the reference compilers, or the benchmark sources change.
+
Per-iteration codegen for a single function is easiest to inspect via the
optimizer's staged IR dump (`CFREE_DUMP=pre-emit cfree cc -O1 -c bench.c ...`,
which panics after printing the pre-emit MIR) plus `objdump`/`lldb`
@@ -198,11 +201,7 @@ disassembly of the hot function.
## Notes / caveats
-- Numbers above are best-of-3; re-run with the same `COMPILE_REPEATS=3
- RUN_REPEATS=3` after a change to confirm movement. Anything near 1.0× (e.g.
- `sieve` vs gcc) is within noise and should be confirmed with a separate
- re-run.
-- Not on this list but worth revisiting once the four above improve: `hash`,
- `funnkuch-reduce`, `strcat` (previously marginal at single-run).
+- Numbers above mix sweeps from different revisions; re-run with the same
+ `COMPILE_REPEATS=3 RUN_REPEATS=3` after a change to confirm movement.
- `cfree-run` (JIT) shares this codegen, so its runtimes track `cfree cc -O1`;
fixing these helps both paths.
diff --git a/scripts/opt_bench.sh b/scripts/opt_bench.sh
@@ -10,25 +10,59 @@ GCC="${GCC:-gcc-15}"
OUT_DIR="${CFREE_OPT_BENCH_OUT:-$ROOT/build/bench/opt}"
CFREE_SYSROOT="${CFREE_OPT_BENCH_SYSROOT:-}"
-# Full benchmark set (override with CFREE_OPT_BENCHES to use it):
-# array binary-trees except funnkuch-reduce hash hash2 heapsort lists matrix
+# Full benchmark set used for baseline caching (override with CFREE_OPT_BENCHES):
+# array binary-trees funnkuch-reduce hash hash2 heapsort lists matrix
# method-call mandelbrot nbody sieve spectral-norm strcat
-# `except` in particular runs for many seconds at -O0, which is why the default
-# below is a small, quick subset at O0+O1 (skip the heavy O2 sweep).
+# `except` is excluded from the full set: it's setjmp/longjmp-bound and runs
+# for ~2.5 minutes per O0 sample, which inflates wall-clock without telling us
+# anything about codegen quality. Pass it explicitly via CFREE_OPT_BENCHES if
+# wanted.
+FULL_BENCHES="array binary-trees funnkuch-reduce hash hash2 heapsort lists matrix method-call mandelbrot nbody sieve spectral-norm strcat"
+
+# Cached baseline timings for the fixed compilers (gcc/clang/MIR). Their codegen
+# does not change as we iterate on cfree, so we measure them once into this file
+# (checked into scripts/) and reuse it for comparisons. Regenerate with
+# `CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`.
+BASELINE_CSV="${CFREE_OPT_BENCH_BASELINE_CSV:-$ROOT/scripts/opt_bench_baseline.csv}"
+
+# Mode selects which tools run and where results go:
+# cfree (default) - only cfree/cfree-run; writes build/bench/opt/results.csv
+# baseline - only gcc/clang/MIR over the full set; writes BASELINE_CSV
+MODE="${CFREE_OPT_BENCH_MODE:-cfree}"
+
DEFAULT_LEVELS="0 1"
-DEFAULT_BENCHES="array hash hash2 matrix sieve"
DEFAULT_COMPILE_REPEATS="3"
DEFAULT_RUN_REPEATS="3"
+
+case "$MODE" in
+ baseline)
+ # Measure the fixed compilers across the full set; cfree is skipped.
+ DEFAULT_BENCHES="$FULL_BENCHES"
+ DEF_SKIP_GCC=0; DEF_SKIP_CLANG=0; DEF_SKIP_MIR=0; DEF_SKIP_CFREE=1
+ DEFAULT_CSV="$BASELINE_CSV"
+ ;;
+ cfree)
+ # Default working mode: only cfree, compared against the cached baseline.
+ DEFAULT_BENCHES="array hash hash2 matrix sieve"
+ DEF_SKIP_GCC=1; DEF_SKIP_CLANG=1; DEF_SKIP_MIR=1; DEF_SKIP_CFREE=0
+ DEFAULT_CSV="$OUT_DIR/results.csv"
+ ;;
+ *)
+ printf 'opt-bench: unknown CFREE_OPT_BENCH_MODE: %s (want cfree|baseline)\n' "$MODE" >&2
+ exit 2
+ ;;
+esac
+
LEVELS="${CFREE_OPT_BENCH_LEVELS:-$DEFAULT_LEVELS}"
BENCHES="${CFREE_OPT_BENCHES:-$DEFAULT_BENCHES}"
COMPILE_REPEATS="${CFREE_OPT_BENCH_COMPILE_REPEATS:-$DEFAULT_COMPILE_REPEATS}"
RUN_REPEATS="${CFREE_OPT_BENCH_RUN_REPEATS:-$DEFAULT_RUN_REPEATS}"
-# Per-tool skip flags. By default keep gcc (the baseline) and skip clang.
-# Override individually, e.g. CFREE_OPT_BENCH_SKIP_MIR=1 or
-# CFREE_OPT_BENCH_SKIP_CLANG=0.
-SKIP_GCC="${CFREE_OPT_BENCH_SKIP_GCC:-0}"
-SKIP_CLANG="${CFREE_OPT_BENCH_SKIP_CLANG:-1}"
-SKIP_MIR="${CFREE_OPT_BENCH_SKIP_MIR:-0}"
+# Per-tool skip flags. Defaults come from the mode above; override individually,
+# e.g. CFREE_OPT_BENCH_SKIP_MIR=1 or CFREE_OPT_BENCH_SKIP_CFREE=0.
+SKIP_GCC="${CFREE_OPT_BENCH_SKIP_GCC:-$DEF_SKIP_GCC}"
+SKIP_CLANG="${CFREE_OPT_BENCH_SKIP_CLANG:-$DEF_SKIP_CLANG}"
+SKIP_MIR="${CFREE_OPT_BENCH_SKIP_MIR:-$DEF_SKIP_MIR}"
+SKIP_CFREE="${CFREE_OPT_BENCH_SKIP_CFREE:-$DEF_SKIP_CFREE}"
MIR_MAKE="${MIR_MAKE:-}"
case "$(uname -s 2>/dev/null || true)" in
@@ -56,7 +90,7 @@ CFLAGS_EXTRA="${CFREE_OPT_BENCH_CFLAGS:-$DEFAULT_CFLAGS_EXTRA}"
CFREE_FLAGS_EXTRA="${CFREE_OPT_BENCH_CFREE_FLAGS:-}"
CFREE_RUN_FLAGS_EXTRA="${CFREE_OPT_BENCH_CFREE_RUN_FLAGS:-}"
-CSV="$OUT_DIR/results.csv"
+CSV="${CFREE_OPT_BENCH_CSV:-$DEFAULT_CSV}"
SUMMARY="$OUT_DIR/summary.md"
LOG_DIR="$OUT_DIR/logs"
BIN_DIR="$OUT_DIR/bin"
@@ -357,16 +391,28 @@ bench_cfree_run() {
}
write_summary() {
- python3 - "$CSV" "$SUMMARY" "$(tool_label "$GCC")" <<'PY'
+ python3 - "$CSV" "$SUMMARY" "$(tool_label "$GCC")" "$BASELINE_CSV" <<'PY'
import csv
import math
+import os
import sys
from collections import defaultdict
-csv_path, out_path, base_tool = sys.argv[1:4]
+csv_path, out_path, base_tool, baseline_path = sys.argv[1:5]
with open(csv_path, newline="") as f:
rows = list(csv.DictReader(f))
+# Merge cached baseline timings (gcc/clang/MIR) so the summary tables include
+# the fixed compilers even though the default run only measures cfree.
+seen = {(r["tool"], r["opt"], r["bench"]) for r in rows}
+if baseline_path and os.path.exists(baseline_path) and os.path.abspath(baseline_path) != os.path.abspath(csv_path):
+ with open(baseline_path, newline="") as f:
+ for r in csv.DictReader(f):
+ key = (r["tool"], r["opt"], r["bench"])
+ if key not in seen:
+ rows.append(r)
+ seen.add(key)
+
def fnum(v):
if v in ("", "NA", None):
return None
@@ -481,13 +527,16 @@ if [ ! -d "$BENCH_DIR" ]; then
printf 'opt-bench: benchmark directory not found: %s\n' "$BENCH_DIR" >&2
exit 2
fi
-if [ ! -x "$CFREE" ]; then
+if [ "$SKIP_CFREE" != "1" ] && [ ! -x "$CFREE" ]; then
printf 'opt-bench: cfree binary not found: %s\n' "$CFREE" >&2
printf 'opt-bench: run `make bin` or set CFREE=/path/to/cfree\n' >&2
exit 2
fi
-ensure_mir || exit 2
+# c2m is only required when MIR is part of the run.
+[ "$SKIP_MIR" != "1" ] && { ensure_mir || exit 2; }
+printf 'opt-bench: mode: %s\n' "$MODE"
+printf 'opt-bench: csv: %s\n' "$CSV"
printf 'opt-bench: output: %s\n' "$OUT_DIR"
printf 'opt-bench: benches: %s\n' "$BENCHES"
printf 'opt-bench: levels: %s\n' "$LEVELS"
@@ -505,12 +554,20 @@ for bench in $BENCHES; do
for opt in $LEVELS; do
[ "$SKIP_GCC" != "1" ] && bench_native_tool "$bench" "$(tool_label "$GCC")" "$GCC" "$opt" "$src" "$expect" "$arg_line"
[ "$SKIP_CLANG" != "1" ] && bench_native_tool "$bench" "$(tool_label "$CLANG")" "$CLANG" "$opt" "$src" "$expect" "$arg_line"
- bench_native_tool "$bench" "cfree" "$CFREE cc" "$opt" "$src" "$expect" "$arg_line"
- bench_cfree_run "$bench" "$opt" "$src" "$expect" "$arg_line"
+ if [ "$SKIP_CFREE" != "1" ]; then
+ bench_native_tool "$bench" "cfree" "$CFREE cc" "$opt" "$src" "$expect" "$arg_line"
+ bench_cfree_run "$bench" "$opt" "$src" "$expect" "$arg_line"
+ fi
[ "$SKIP_MIR" != "1" ] && bench_mir "$bench" "$opt" "$src" "$expect" "$arg_line"
done
done
+if [ "$MODE" = "baseline" ]; then
+ printf 'opt-bench: wrote baseline cache %s\n' "$CSV"
+ printf 'opt-bench: commit this file so cfree runs can compare against it\n'
+ exit 0
+fi
+
write_summary
printf 'opt-bench: wrote %s\n' "$CSV"
printf 'opt-bench: wrote %s\n' "$SUMMARY"
diff --git a/scripts/opt_bench_baseline.csv b/scripts/opt_bench_baseline.csv
@@ -0,0 +1,85 @@
+bench,tool,opt,status,compile_ms,codegen_ms,runtime_ms,exit_code,log
+"array","gcc-15","0","OK","151.617","NA","6192.521","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.array"
+"array","clang","0","OK","147.005","NA","5308.213","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.array"
+"array","mir-c2m","0","OK","121.159","0.264","4986.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.array"
+"array","gcc-15","1","OK","156.974","NA","2146.515","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.array"
+"array","clang","1","OK","143.504","NA","2896.896","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.array"
+"array","mir-c2m","1","OK","118.147","0.318","4792.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.array"
+"binary-trees","gcc-15","0","OK","148.974","NA","2647.209","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.binary-trees"
+"binary-trees","clang","0","OK","139.066","NA","2754.167","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.binary-trees"
+"binary-trees","mir-c2m","0","COMPILE_FAIL","123.128","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.binary-trees.compile.err"
+"binary-trees","gcc-15","1","OK","153.376","NA","2607.473","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.binary-trees"
+"binary-trees","clang","1","OK","142.402","NA","2400.911","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.binary-trees"
+"binary-trees","mir-c2m","1","COMPILE_FAIL","126.457","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.binary-trees.compile.err"
+"funnkuch-reduce","gcc-15","0","OK","148.897","NA","2557.454","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.funnkuch-reduce"
+"funnkuch-reduce","clang","0","OK","137.344","NA","2880.681","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.funnkuch-reduce"
+"funnkuch-reduce","mir-c2m","0","OK","122.131","0.499","2856.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.funnkuch-reduce"
+"funnkuch-reduce","gcc-15","1","OK","154.436","NA","2081.479","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.funnkuch-reduce"
+"funnkuch-reduce","clang","1","OK","149.392","NA","2311.176","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.funnkuch-reduce"
+"funnkuch-reduce","mir-c2m","1","OK","123.103","0.620","2764.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.funnkuch-reduce"
+"hash","gcc-15","0","OK","159.280","NA","4608.200","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.hash"
+"hash","clang","0","OK","142.514","NA","4875.379","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.hash"
+"hash","mir-c2m","0","OK","159.190","1.380","4172.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.hash"
+"hash","gcc-15","1","OK","180.370","NA","4133.414","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.hash"
+"hash","clang","1","OK","165.194","NA","4131.105","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.hash"
+"hash","mir-c2m","1","OK","152.222","1.747","4167.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.hash"
+"hash2","gcc-15","0","OK","161.541","NA","7398.824","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.hash2"
+"hash2","clang","0","OK","144.715","NA","8831.498","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.hash2"
+"hash2","mir-c2m","0","OK","152.456","1.430","3970.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.hash2"
+"hash2","gcc-15","1","OK","180.383","NA","4360.070","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.hash2"
+"hash2","clang","1","OK","165.528","NA","3850.965","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.hash2"
+"hash2","mir-c2m","1","OK","151.444","1.825","3857.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.hash2"
+"heapsort","gcc-15","0","OK","149.543","NA","7605.803","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.heapsort"
+"heapsort","clang","0","OK","139.750","NA","6318.618","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.heapsort"
+"heapsort","mir-c2m","0","COMPILE_FAIL","120.517","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.heapsort.compile.err"
+"heapsort","gcc-15","1","OK","151.685","NA","5536.478","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.heapsort"
+"heapsort","clang","1","OK","147.930","NA","4315.873","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.heapsort"
+"heapsort","mir-c2m","1","COMPILE_FAIL","120.113","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.heapsort.compile.err"
+"lists","gcc-15","0","OK","160.975","NA","8841.648","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.lists"
+"lists","clang","0","OK","142.189","NA","7696.170","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.lists"
+"lists","mir-c2m","0","OK","146.936","1.208","5512.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.lists"
+"lists","gcc-15","1","OK","175.373","NA","3438.077","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.lists"
+"lists","clang","1","OK","155.047","NA","2938.820","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.lists"
+"lists","mir-c2m","1","OK","146.978","1.523","4988.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.lists"
+"matrix","gcc-15","0","OK","151.555","NA","20657.715","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.matrix"
+"matrix","clang","0","OK","139.105","NA","13679.456","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.matrix"
+"matrix","mir-c2m","0","OK","140.975","0.707","11065.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.matrix"
+"matrix","gcc-15","1","OK","162.019","NA","3073.402","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.matrix"
+"matrix","clang","1","OK","149.687","NA","3125.393","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.matrix"
+"matrix","mir-c2m","1","OK","139.374","0.893","9466.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.matrix"
+"method-call","gcc-15","0","COMPILE_FAIL","74.268","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.method-call.compile.1.err"
+"method-call","clang","0","COMPILE_FAIL","88.040","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.method-call.compile.1.err"
+"method-call","mir-c2m","0","OK","116.234","0.490","4190.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.method-call"
+"method-call","gcc-15","1","COMPILE_FAIL","178.970","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.method-call.compile.1.err"
+"method-call","clang","1","COMPILE_FAIL","103.005","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.method-call.compile.1.err"
+"method-call","mir-c2m","1","OK","120.788","0.591","4151.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.method-call"
+"mandelbrot","gcc-15","0","OK","149.164","NA","10318.641","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.mandelbrot"
+"mandelbrot","clang","0","OK","139.282","NA","10586.734","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.mandelbrot"
+"mandelbrot","mir-c2m","0","OK","135.511","0.328","4501.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.mandelbrot"
+"mandelbrot","gcc-15","1","OK","148.877","NA","3151.572","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.mandelbrot"
+"mandelbrot","clang","1","OK","140.193","NA","2910.442","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.mandelbrot"
+"mandelbrot","mir-c2m","1","OK","124.927","0.400","3332.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.mandelbrot"
+"nbody","gcc-15","0","OK","157.218","NA","9017.622","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.nbody"
+"nbody","clang","0","OK","142.119","NA","10701.933","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.nbody"
+"nbody","mir-c2m","0","COMPILE_FAIL","139.124","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.nbody.compile.err"
+"nbody","gcc-15","1","OK","165.533","NA","2852.425","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.nbody"
+"nbody","clang","1","OK","154.581","NA","2711.537","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.nbody"
+"nbody","mir-c2m","1","COMPILE_FAIL","125.704","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.nbody.compile.err"
+"sieve","gcc-15","0","OK","146.078","NA","5032.428","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.sieve"
+"sieve","clang","0","OK","137.391","NA","13729.750","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.sieve"
+"sieve","mir-c2m","0","OK","130.733","0.226","4843.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.sieve"
+"sieve","gcc-15","1","OK","150.246","NA","2787.239","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.sieve"
+"sieve","clang","1","OK","144.944","NA","2512.013","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.sieve"
+"sieve","mir-c2m","1","OK","120.643","0.265","4170.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.sieve"
+"spectral-norm","gcc-15","0","OK","153.761","NA","14877.450","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.spectral-norm"
+"spectral-norm","clang","0","OK","142.870","NA","14844.299","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.spectral-norm"
+"spectral-norm","mir-c2m","0","COMPILE_FAIL","121.234","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.spectral-norm.compile.err"
+"spectral-norm","gcc-15","1","OK","159.815","NA","4049.676","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.spectral-norm"
+"spectral-norm","clang","1","OK","151.041","NA","4075.697","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.spectral-norm"
+"spectral-norm","mir-c2m","1","COMPILE_FAIL","125.809","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.spectral-norm.compile.err"
+"strcat","gcc-15","0","OK","151.030","NA","5970.900","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.strcat"
+"strcat","clang","0","OK","140.669","NA","5943.676","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.strcat"
+"strcat","mir-c2m","0","OK","148.012","0.484","5773.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.strcat"
+"strcat","gcc-15","1","OK","156.343","NA","4860.543","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.strcat"
+"strcat","clang","1","OK","143.566","NA","4804.322","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.strcat"
+"strcat","mir-c2m","1","OK","149.920","0.585","5772.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.strcat"
diff --git a/scripts/opt_bench_compare.py b/scripts/opt_bench_compare.py
@@ -1,15 +1,27 @@
#!/usr/bin/env python3
"""Compare cfree tools vs a baseline (default: gcc-15 -O0).
+The fixed compilers (gcc/clang/MIR) don't change as cfree evolves, so their
+timings are measured once into a cached CSV (scripts/opt_bench_baseline.csv via
+`CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`). This script auto-merges
+that cache with the fresh cfree-only results so comparisons still work even when
+the results CSV contains only cfree rows.
+
Usage:
python3 scripts/opt_bench_compare.py [results.csv]
python3 scripts/opt_bench_compare.py results.csv --base-tool gcc-15 --base-opt 0
+ python3 scripts/opt_bench_compare.py results.csv --baseline-csv path.csv
+ python3 scripts/opt_bench_compare.py results.csv --no-baseline
"""
import csv
import math
import os
import sys
+DEFAULT_BASELINE_CSV = os.path.join(
+ os.path.dirname(os.path.abspath(__file__)), "opt_bench_baseline.csv"
+)
+
def fnum(v):
try:
@@ -51,6 +63,7 @@ def main():
csv_path = None
base_tool_arg = None
base_opt = "0"
+ baseline_csv = DEFAULT_BASELINE_CSV
i = 0
while i < len(args):
if args[i] == "--base-tool" and i + 1 < len(args):
@@ -59,6 +72,12 @@ def main():
elif args[i] == "--base-opt" and i + 1 < len(args):
base_opt = args[i + 1]
i += 2
+ elif args[i] == "--baseline-csv" and i + 1 < len(args):
+ baseline_csv = args[i + 1]
+ i += 2
+ elif args[i] == "--no-baseline":
+ baseline_csv = None
+ i += 1
else:
csv_path = args[i]
i += 1
@@ -73,6 +92,26 @@ def main():
with open(csv_path, newline="") as f:
ok = [r for r in csv.DictReader(f) if r["status"] == "OK"]
+ # Benches measured by this run (typically cfree-only); used to scope output.
+ result_benches = {r["bench"] for r in ok}
+
+ # Merge cached baseline timings (gcc/clang/MIR) unless disabled. Rows already
+ # present in the fresh CSV win, so an explicit baseline run still overrides.
+ if (
+ baseline_csv
+ and os.path.exists(baseline_csv)
+ and os.path.abspath(baseline_csv) != os.path.abspath(csv_path)
+ ):
+ seen = {(r["tool"], r["opt"], r["bench"]) for r in ok}
+ with open(baseline_csv, newline="") as f:
+ for r in csv.DictReader(f):
+ if r["status"] != "OK":
+ continue
+ key = (r["tool"], r["opt"], r["bench"])
+ if key not in seen:
+ ok.append(r)
+ seen.add(key)
+
if not ok:
sys.exit("compare: no OK rows in CSV")
@@ -96,7 +135,10 @@ def main():
sys.exit(f"compare: no rows for tool={base_tool} opt={base_opt}")
idx = {(r["tool"], r["opt"], r["bench"]): r for r in ok}
- all_benches = sorted(baseline)
+ # Only report benches this run covered (and that the baseline has). When the
+ # CSV is itself a baseline dump (no separate result benches), show them all.
+ shown = (result_benches & set(baseline)) if result_benches else set(baseline)
+ all_benches = sorted(shown) or sorted(baseline)
all_opts = sorted(
{r["opt"] for r in ok},
key=lambda x: (int(x) if x.isdigit() else 99, x),