kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit a3d48fb69c42dd07b58715354a1a6bda6673bbf4
parent c04e8d39488b1599fc07d0ca7633029866ef5fce
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 28 May 2026 11:46:28 -0700

bench: cache fixed-compiler baseline; default opt_bench to cfree-only

The gcc/clang/MIR codegen does not change as cfree evolves, so measure them
once into a checked-in cache and reuse it for comparisons:

- opt_bench.sh gains CFREE_OPT_BENCH_MODE (cfree|baseline). Default "cfree"
  mode runs only cfree/cfree-run and no longer requires c2m or the reference
  compilers; "baseline" mode sweeps gcc/clang/MIR over the full set at O0+O1
  and writes scripts/opt_bench_baseline.csv. Adds SKIP_CFREE; gates the c2m
  build and cfree-binary check on what actually runs.
- opt_bench_compare.py auto-merges the cached baseline (fresh rows win;
  --baseline-csv / --no-baseline to override) and scopes per-bench output to
  the benches the run covered. write_summary merges it too.
- scripts/opt_bench_baseline.csv: full suite x O0/O1 x gcc-15/clang/mir-c2m.
- OPT_O1_PERF_TODO.md: Reproducing section updated to the cfree-only flow.

Diffstat:
Mdoc/OPT_O1_PERF_TODO.md | 313+++++++++++++++++++++++++++++++++++++++----------------------------------------
Mscripts/opt_bench.sh | 93+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
Ascripts/opt_bench_baseline.csv | 85+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mscripts/opt_bench_compare.py | 44+++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 359 insertions(+), 176 deletions(-)

diff --git a/doc/OPT_O1_PERF_TODO.md b/doc/OPT_O1_PERF_TODO.md @@ -20,161 +20,157 @@ reference exists). ## Current standings Numbers below are from a 3-run sweep on aarch64/Apple (`COMPILE_REPEATS=3`, -`RUN_REPEATS=3`, best-of), runtime in ms; speedup = reference_time / cfree_time -(>1 means cfree is faster). +`RUN_REPEATS=3`, best-of), runtime in ms; speedup = reference_time / +cfree_time (>1 means cfree is faster). gcc-O0 and mir-O1 columns come from +the cached baseline in `scripts/opt_bench_baseline.csv`; regenerate with +`CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`. | bench | cfree -O1 | gcc -O0 | vs gcc-O0 | mir -O1 | vs mir-O1 | behind | | --- | ---: | ---: | ---: | ---: | ---: | --- | -| binary-trees | 3106 | 2634 | **0.85×** (slower) | n/a¹ | — | gcc | -| lists | 5243 | 8817 | 1.68× ✓ | 4978 | 0.95× | mir | -| hash2 | 5372 | 7365 | 1.37× ✓ | 3841 | **0.72×** | mir | -| sieve | 4991 | 5023 | 1.01× | 4006 | **0.80×** | gcc (~tied), mir | +| binary-trees | 3146 | 2639 | **0.84×** (slower) | n/a¹ | — | gcc | +| lists | 4843 | 8868 | 1.83× ✓ | 4997 | 1.03× | mir | +| hash2 | 4988 | 7481 | 1.50× ✓ | 3863 | **0.77×** | mir | +| sieve | 5148 | 5077 | 0.99× (~tied) | 4028 | **0.78×** | gcc (~tied), mir | +| mandelbrot | 3658 | 10274 | 2.81× ✓ | 3346 | 0.91× | mir | +| strcat | 5899 | 5965 | 1.01× (~tied) | 5775 | 0.98× | both (~tied) | -¹ mir-c2m fails to compile `binary-trees` in our setup, so only the gcc -comparison applies there. - -Movement since the previous refresh (single-run numbers): -- `lists` went from 0.58× to 0.95× vs mir and 1.02× to 1.68× vs gcc — the - recent O1 codegen wins + tiny-function inliner closed most of the gap. Still - not 10% past mir, but no longer the worst case. -- The other three are roughly unchanged; the bar still hasn't been cleared - against mir on hash2/sieve, and binary-trees is still slower than gcc-O0. +¹ mir-c2m fails to compile `binary-trees`, so only the gcc comparison applies. ## Per-benchmark notes -### binary-trees — slower than unoptimized gcc (still highest priority) -The only case where cfree `-O1` is *slower than gcc -O0* (0.85×). Workload is -recursive tree build/walk: tiny leaf-ish functions called millions of times -plus `malloc`/`free` on each node. Most wall-clock is in `malloc`; the -compiler-visible slack lives in the per-call overhead. Inspecting `ItemCheck`, -`BottomUpTree`, `DeleteTree`, `NewTreeNode` shows ~4–6 redundant moves per -recursive call site: - -``` -ItemCheck (inner recursive call): - ldr x8, [x19] - mov x11, x8 <-- redundant copy - mov x0, x11 <-- could be `mov x0, x8` straight from the ldr - bl ItemCheck - mov x8, x0 <-- materialize the return into the SSA reg - movz x9, 0x1 <-- the `1 +` constant, not hoisted (only 1 use, fine) - add x20, x9, x8 -``` - -Drivers for the gap, in order: -1. **Copy coalescing leaves intermediate `mov` chains** at call boundaries: - `mov xN, xM; mov x0, xN` instead of folding through. Every recursive call - in `ItemCheck`/`BottomUpTree`/`DeleteTree` has at least one such redundant - pair. With ~7.6M recursive calls (depth=19) this is tens of millions of - wasted ops. -2. **Standard 9-insn prologue/epilogue** (`sub sp; add x17; stp x29,x30; - add x29` + mirror on exit) on tiny non-leaf functions like `NewTreeNode`. - Leaf detection + skipping FP save/restore where the callee makes no calls - would help (`ItemCheck` recurses so it's not a leaf; `NewTreeNode` *does* - call malloc, also not a leaf — but neither needs the FP-frame setup we - currently emit). -3. **`NewTreeNode` spills x19/x20** as callee-saves only to ferry the two - incoming args across one `malloc` call — pure overhead vs keeping the - live values in caller-save scratch + reload from a small spill slot. - -This is where the biggest absolute wall-clock win is, even though the per-op -codegen looks roughly equivalent to gcc -O0. - -### hash2 — 0.72× vs mir -Clears the gcc bar (1.37×) but is the worst against mir. The hot loop is -`ht_hashcode`: - -``` -0xc88: movz x9, 0x5 <-- loop-invariant constant, NOT hoisted -0xc8c: mul x14, x9, x11 <-- could be `add x14, x11, x11, lsl #2` (5*v) -0xc90: ldrb w11, [x8] -0xc94: sxtb x13, w11 -0xc98: add x11, x13, x14 -0xc9c: add x8, x8, #1 -0xca0: ldrb w13, [x8] -0xca4: cmp w13, #0 -0xca8: b.ne 0xc88 -``` - -Two clear wins: -1. **Loop-invariant immediate operand not hoisted.** The IR carries the `5` - as an inline `imm:` operand on the `binop mul`; `opt_machinize_native` - leaves it there, the emitter materializes it with `movz x9, 0x5` inside - the loop on every iteration. `opt_hoist_loop_consts` only hoists explicit - `IR_LOAD_IMM` defs (see `src/opt/pass_addr_fold.c:835`), so this never - becomes a hoist candidate. - - Fix sketch: between `opt_machinize_native` and `opt_hoist_loop_consts`, - add a pass that lowers non-zero immediate operands on machine-irrelevant - positions (mul/add/sub second operand outside imm-fold range, store value - operand) to `IR_LOAD_IMM` + reg-use. Then the existing hoist pass picks - them up. Same fix unblocks sieve (see below). -2. **`mul x14, x9, x11` for a constant 5×** — strength-reduce small-integer - multiplications to `add x, y, y, lsl #N` / `sub` sequences for power-of-two - neighbours. Independent of the hoist fix. - -### sieve — 0.80× vs mir, ~tied with gcc -Behind both. Two hot inner loops; both leave clear codegen on the table. - -Init loop (`flags[i] = 1`): -``` -0x2ac: mov x13, x8 <-- coalesce miss (copy of the IV) -0x2b0: movz w9, 0x1 <-- loop-invariant, NOT hoisted (same as hash2) -0x2b4: strb w9, [x19, x13] -0x2b8: add x8, x8, #1 -0x2bc: cmp x8, #2, lsl #12 -0x2c0: b.le 0x2ac -``` - -Mark-multiples loop (`flags[k] = 0`): -``` -0x2ec: mov x14, x13 <-- coalesce miss; could just use x13 for the index -0x2f0: strb wzr, [x19, x14] (strb wzr is already good — zero-source fast path -0x2f4: add x13, x13, x8 fires here) -0x2f8: cmp x13, #2, lsl #12 -0x2fc: b.le 0x2ec -``` - -So: -1. **Same imm-operand hoist gap** as hash2 — the inline `imm:1` on `store` - becomes a fresh `movz w9, 0x1` per iteration. The existing zero-source - fast path (`pass_native_emit.c:790`) handles the strb-zero case in the - second loop, which is why only the init loop has the `movz`. -2. **Copy not coalesced for the IV/index** in both loops. The `mov x13, x8` - and `mov x14, x13` are pure register-allocator copies that the post-RA - passes don't clean up. Worth tracing what `opt_lower_to_mir` + the post-RA - jump-cleanup leaves; these moves likely come from how the IR copy ops in - block 4 of `sieve_min`'s IR survive RA (`copy def=v9 opnds=[v14,v8]` + - `copy def=v10 opnds=[v13,v12]` — two copies feeding one store). - -### lists — 0.95× vs mir (no longer bleeding) -Last refresh was 0.58×; current 0.95×. Now within shot of the 1.10× bar. -Same imm-hoist + copy-coalesce wins should close the remaining ~15%. - -## Cross-cutting fixes (would help multiple benches at once) - -These are the two themes that recur across the four benches above: - -1. **Hoist loop-invariant *immediate operands*, not just `IR_LOAD_IMM`.** - `opt_hoist_loop_consts` in `src/opt/pass_addr_fold.c` only sees defs of - `IR_LOAD_IMM`. Constants that arrive at emit as inline `imm:` operands on - `binop`/`store`/etc. are re-materialized by the emitter inside the loop - every iteration (sieve-init, hash2-hashcode). A pre-hoist lowering pass - that walks loop-body insts, takes their non-foldable imm operands, and - converts them to `IR_LOAD_IMM + reg use` would let the existing hoist - pass do the rest. -2. **Copy coalescing leaves redundant `mov xN, xM` artifacts.** Seen in - sieve (both loops, IV/index copy), binary-trees (every recursive call - site has at least one redundant `mov` pair), and likely in lists. Most - of these are SSA-level `copy` ops that survive into machine code because - post-RA copy elimination is doing less than it could. Worth a focused - look at what `opt_mir_combine` / `opt_mir_dce` see for these inputs. - -Targeted secondary wins: -- Strength-reduce small-integer multiplications by constants to `add`/`sub` - + shifted-register forms (hash2 `5 *`). -- Slimmer prologue/epilogue for functions that don't need FP-frame setup - (binary-trees). +### binary-trees — slower than unoptimized gcc (highest priority) +The only case where cfree `-O1` is *slower than gcc -O0* (0.84×). Workload is +recursive tree build/walk: four tiny functions (`NewTreeNode`, `ItemCheck`, +`BottomUpTree`, `DeleteTree`) called ~7.6M times at depth=19, plus a +`malloc`/`free` per node. The **body** of each function is fine — cfree -O1 +keeps the recurring pointer in a callee-save (x19) where gcc -O0 spills and +reloads it three times per call. The gap is entirely in **fixed per-call +overhead** — every byte of which is multiplied by 7.6M. + +Open items, in priority order (most recent disasm in +`/tmp/mc/binary-trees.cfree.o`): + +1. **Useless leading `b PC+4` at every function entry.** All four functions + still start with this: + ``` + sub sp, sp, #0x20 + stp x29, x30, [sp, #0x10] + add x29, sp, #0x10 + stp x20, x19, [x29, #-0x10] + b PC+4 <-- branches to the very next instruction + mov x19, x0 + ``` + Root cause: commit `9bd61e8` ("emit param_decls into a dedicated prologue + block") added an empty-on-emit entry block ahead of the function body. + `opt_jump_cleanup`'s helper `empty_fallthrough_block` in + `src/opt/pass_jump.c` then explicitly bails out when `block == f->entry`, + so the empty entry block is never merged into its single successor. + Lifting that guard (with whatever safety condition `9bd61e8` was protecting + against — likely "first body block is a loop header") would let + jump-cleanup absorb the prologue block in the common case. **+1 insn × + 7.6M calls.** + +2. **Prologue compaction: 4-insn → 2-insn pre-indexed.** Half-done. cfree + today emits: + ``` + sub sp, sp, #N + stp x29, x30, [sp, #M] + add x29, sp, #M + stp x20, x19, [x29, #-K] ; or `stur x19, ...` when only x19 live + ``` + `bcca0bd` added `slim_prologue` in `src/arch/aa64/native.c`, which uses + the pre-indexed `stp x29, x30, [sp, #-N]!` form — but only when + `ncallee_saves == 0`. The common path (one or more callee-saves) still + goes through `slim_small_frame`, which doesn't pre-index. Extend + `slim_small_frame` to use `aa64_stp64_pre` for the FP-save when the frame + size is known in advance, then emit the callee-save spills as separate + `stur`s afterward. Saves 1 insn at entry + 1 at every exit. **~2 insns × + 7.6M calls.** + +3. **Zero materialized through a temp in `BottomUpTree` leaf path.** + `NewTreeNode(NULL, NULL)` still emits: + ``` + c44: mov x8, #0x0 + c48: mov x0, x8 + c4c: mov x8, #0x0 + c50: mov x1, x8 + c54: bl _NewTreeNode + ``` + Should be `mov x0, #0; mov x1, #0`. The sibling fix in `9ac2416` got the + `ldr` → call-arg case, but `IR_LOAD_IMM` sources don't seem to participate + in the ABI aliasing-hint propagation in `pass_lower.c`. Likely a small + extension to `set_preg_pref_for_call_args` / `propagate_hint_through_copies` + to also fire when the source op is `IR_LOAD_IMM`. **+2 insns × 524k leaf + calls.** + +4. **Trailing `b A; A: b B` pair in `DeleteTree`'s if/else merge.** + ``` + c9c: mov x0, x19 + ca0: bl _free + ca4: b 0xcac <-- jumps directly to the epilogue label + ca8: b 0xc9c <-- unreachable; left over from else side + cac: ldur x19, [x29, #-0x8] + ``` + Classic jump-thread target. `cleanup_layout_fallthrough_branches` in + `pass_jump.c` doesn't pick up this shape (two consecutive `b`s where + the second is unreachable). +1 insn per `DeleteTree` invocation + (7.6M calls). + +Together these would save 4–6 insns per call, ~30–50M instructions removed +at depth=19. Body quality is already on par with gcc-O0; this is all +fixed per-call overhead. + +### mandelbrot — 0.92× vs mir (close to the bar) +Inner loop is FP-heavy (`Tr*Tr + Ti*Ti < 4.0` Mandelbrot escape test + +4 fmuls + 2 fadds per iter). Hasn't been deeply investigated since the +recent codegen batch. Worth disassembling the hot loop and comparing +against mir to see what specifically is still on the table — likely some +combination of FP register allocation, vectorization (which we don't do), +and constant-pool material. + +### strcat — 0.97× vs mir, ~tied with gcc +Small gap; not worth standalone investigation yet. Should naturally +absorb any remaining cross-cutting wins. + +### hash2 — 0.77× vs mir (still the worst against mir) +The previously-noted hoisting and strength-reduction wins landed (and +moved hash2 from 0.72× to 0.77×), but mir is still 1.29× faster. Remaining +gap is in the parts of the loop the prior items didn't touch — most +likely the modulo `val % ht->size` (mir probably emits a Barrett/reciprocal +multiply for the small-divisor case where we still emit `udiv`) and the +`strcmp` probe shape. Worth a fresh disassembly read of `ht_hashcode` and +the probe loop in `ht_find_new` against mir's output. + +### sieve — 0.78× vs mir, ~tied with gcc +Loop-invariant `movz` and IV copies are gone; remaining gap is structural. +mir is ~1.28× faster on the same loop shape (`flags[k] = 0` strided +store + `flags[i] = 1` init). Candidate gaps: address-mode folding into +the store (using `[x19, x8]` vs `add` + bare addr), and whether mir is +auto-vectorizing the init loop. + +### lists — 1.03× vs mir (close to the bar) +Down from 0.95×. Doubly-linked list traversal + splice. Within ~5% of mir; +worth comparing the splice inner loop directly. + +## Cross-cutting fixes (open) + +These help several benches. Both are partial; the binary-trees items above +are the most concrete tests for whether each is complete. + +1. **Drop the leading `b PC+4` at function entry.** See binary-trees item 1. + Affects every cfree-compiled function, not just binary-trees. + +2. **Compact FP-frame prologue/epilogue.** See binary-trees item 2. The + 2-insn pre-indexed form is wired in for the no-callee-save case; needs + to be extended to small frames with 1–2 callee-saves. Biggest absolute + payoff on call-heavy benches. + +3. **Hard-register copy coalescing for `IR_LOAD_IMM` sources.** See + binary-trees item 3. The hint-propagation path covers `ldr` → call-arg + but skips immediates. + +4. **Jump-thread the `b A; A: b B` shape.** See binary-trees item 4. + General `pass_jump.c` cleanup, not bench-specific. ## Reproducing @@ -182,15 +178,22 @@ Targeted secondary wins: # Build the optimized compiler first (clean release): rm -rf build/release && make RELEASE=1 bin -# Run just these four with 3 repeats (best-of), O0+O1, gcc + cfree + mir: +# Run the still-open or stale-number benches with 3 repeats (best-of). The +# default mode measures only cfree (fast iteration) and the trailing compare +# step automatically pulls gcc/mir numbers from the cached +# scripts/opt_bench_baseline.csv — no need to re-run the fixed compilers: CFREE="$PWD/build/release/cfree" \ -CFREE_OPT_BENCHES="binary-trees lists hash2 sieve" \ +CFREE_OPT_BENCHES="binary-trees lists hash2 sieve mandelbrot strcat" \ CFREE_OPT_BENCH_LEVELS="0 1" \ CFREE_OPT_BENCH_COMPILE_REPEATS=3 CFREE_OPT_BENCH_RUN_REPEATS=3 \ -CFREE_OPT_BENCH_SKIP_GCC=0 CFREE_OPT_BENCH_SKIP_CLANG=1 CFREE_OPT_BENCH_SKIP_MIR=0 \ bash scripts/opt_bench.sh ``` +The comparison against the cached baseline prints at the end of the run; re-run +it standalone any time with `python3 scripts/opt_bench_compare.py`. Only +regenerate the baseline cache (the `CFREE_OPT_BENCH_MODE=baseline` command +above) when the host, the reference compilers, or the benchmark sources change. + Per-iteration codegen for a single function is easiest to inspect via the optimizer's staged IR dump (`CFREE_DUMP=pre-emit cfree cc -O1 -c bench.c ...`, which panics after printing the pre-emit MIR) plus `objdump`/`lldb` @@ -198,11 +201,7 @@ disassembly of the hot function. ## Notes / caveats -- Numbers above are best-of-3; re-run with the same `COMPILE_REPEATS=3 - RUN_REPEATS=3` after a change to confirm movement. Anything near 1.0× (e.g. - `sieve` vs gcc) is within noise and should be confirmed with a separate - re-run. -- Not on this list but worth revisiting once the four above improve: `hash`, - `funnkuch-reduce`, `strcat` (previously marginal at single-run). +- Numbers above mix sweeps from different revisions; re-run with the same + `COMPILE_REPEATS=3 RUN_REPEATS=3` after a change to confirm movement. - `cfree-run` (JIT) shares this codegen, so its runtimes track `cfree cc -O1`; fixing these helps both paths. diff --git a/scripts/opt_bench.sh b/scripts/opt_bench.sh @@ -10,25 +10,59 @@ GCC="${GCC:-gcc-15}" OUT_DIR="${CFREE_OPT_BENCH_OUT:-$ROOT/build/bench/opt}" CFREE_SYSROOT="${CFREE_OPT_BENCH_SYSROOT:-}" -# Full benchmark set (override with CFREE_OPT_BENCHES to use it): -# array binary-trees except funnkuch-reduce hash hash2 heapsort lists matrix +# Full benchmark set used for baseline caching (override with CFREE_OPT_BENCHES): +# array binary-trees funnkuch-reduce hash hash2 heapsort lists matrix # method-call mandelbrot nbody sieve spectral-norm strcat -# `except` in particular runs for many seconds at -O0, which is why the default -# below is a small, quick subset at O0+O1 (skip the heavy O2 sweep). +# `except` is excluded from the full set: it's setjmp/longjmp-bound and runs +# for ~2.5 minutes per O0 sample, which inflates wall-clock without telling us +# anything about codegen quality. Pass it explicitly via CFREE_OPT_BENCHES if +# wanted. +FULL_BENCHES="array binary-trees funnkuch-reduce hash hash2 heapsort lists matrix method-call mandelbrot nbody sieve spectral-norm strcat" + +# Cached baseline timings for the fixed compilers (gcc/clang/MIR). Their codegen +# does not change as we iterate on cfree, so we measure them once into this file +# (checked into scripts/) and reuse it for comparisons. Regenerate with +# `CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`. +BASELINE_CSV="${CFREE_OPT_BENCH_BASELINE_CSV:-$ROOT/scripts/opt_bench_baseline.csv}" + +# Mode selects which tools run and where results go: +# cfree (default) - only cfree/cfree-run; writes build/bench/opt/results.csv +# baseline - only gcc/clang/MIR over the full set; writes BASELINE_CSV +MODE="${CFREE_OPT_BENCH_MODE:-cfree}" + DEFAULT_LEVELS="0 1" -DEFAULT_BENCHES="array hash hash2 matrix sieve" DEFAULT_COMPILE_REPEATS="3" DEFAULT_RUN_REPEATS="3" + +case "$MODE" in + baseline) + # Measure the fixed compilers across the full set; cfree is skipped. + DEFAULT_BENCHES="$FULL_BENCHES" + DEF_SKIP_GCC=0; DEF_SKIP_CLANG=0; DEF_SKIP_MIR=0; DEF_SKIP_CFREE=1 + DEFAULT_CSV="$BASELINE_CSV" + ;; + cfree) + # Default working mode: only cfree, compared against the cached baseline. + DEFAULT_BENCHES="array hash hash2 matrix sieve" + DEF_SKIP_GCC=1; DEF_SKIP_CLANG=1; DEF_SKIP_MIR=1; DEF_SKIP_CFREE=0 + DEFAULT_CSV="$OUT_DIR/results.csv" + ;; + *) + printf 'opt-bench: unknown CFREE_OPT_BENCH_MODE: %s (want cfree|baseline)\n' "$MODE" >&2 + exit 2 + ;; +esac + LEVELS="${CFREE_OPT_BENCH_LEVELS:-$DEFAULT_LEVELS}" BENCHES="${CFREE_OPT_BENCHES:-$DEFAULT_BENCHES}" COMPILE_REPEATS="${CFREE_OPT_BENCH_COMPILE_REPEATS:-$DEFAULT_COMPILE_REPEATS}" RUN_REPEATS="${CFREE_OPT_BENCH_RUN_REPEATS:-$DEFAULT_RUN_REPEATS}" -# Per-tool skip flags. By default keep gcc (the baseline) and skip clang. -# Override individually, e.g. CFREE_OPT_BENCH_SKIP_MIR=1 or -# CFREE_OPT_BENCH_SKIP_CLANG=0. -SKIP_GCC="${CFREE_OPT_BENCH_SKIP_GCC:-0}" -SKIP_CLANG="${CFREE_OPT_BENCH_SKIP_CLANG:-1}" -SKIP_MIR="${CFREE_OPT_BENCH_SKIP_MIR:-0}" +# Per-tool skip flags. Defaults come from the mode above; override individually, +# e.g. CFREE_OPT_BENCH_SKIP_MIR=1 or CFREE_OPT_BENCH_SKIP_CFREE=0. +SKIP_GCC="${CFREE_OPT_BENCH_SKIP_GCC:-$DEF_SKIP_GCC}" +SKIP_CLANG="${CFREE_OPT_BENCH_SKIP_CLANG:-$DEF_SKIP_CLANG}" +SKIP_MIR="${CFREE_OPT_BENCH_SKIP_MIR:-$DEF_SKIP_MIR}" +SKIP_CFREE="${CFREE_OPT_BENCH_SKIP_CFREE:-$DEF_SKIP_CFREE}" MIR_MAKE="${MIR_MAKE:-}" case "$(uname -s 2>/dev/null || true)" in @@ -56,7 +90,7 @@ CFLAGS_EXTRA="${CFREE_OPT_BENCH_CFLAGS:-$DEFAULT_CFLAGS_EXTRA}" CFREE_FLAGS_EXTRA="${CFREE_OPT_BENCH_CFREE_FLAGS:-}" CFREE_RUN_FLAGS_EXTRA="${CFREE_OPT_BENCH_CFREE_RUN_FLAGS:-}" -CSV="$OUT_DIR/results.csv" +CSV="${CFREE_OPT_BENCH_CSV:-$DEFAULT_CSV}" SUMMARY="$OUT_DIR/summary.md" LOG_DIR="$OUT_DIR/logs" BIN_DIR="$OUT_DIR/bin" @@ -357,16 +391,28 @@ bench_cfree_run() { } write_summary() { - python3 - "$CSV" "$SUMMARY" "$(tool_label "$GCC")" <<'PY' + python3 - "$CSV" "$SUMMARY" "$(tool_label "$GCC")" "$BASELINE_CSV" <<'PY' import csv import math +import os import sys from collections import defaultdict -csv_path, out_path, base_tool = sys.argv[1:4] +csv_path, out_path, base_tool, baseline_path = sys.argv[1:5] with open(csv_path, newline="") as f: rows = list(csv.DictReader(f)) +# Merge cached baseline timings (gcc/clang/MIR) so the summary tables include +# the fixed compilers even though the default run only measures cfree. +seen = {(r["tool"], r["opt"], r["bench"]) for r in rows} +if baseline_path and os.path.exists(baseline_path) and os.path.abspath(baseline_path) != os.path.abspath(csv_path): + with open(baseline_path, newline="") as f: + for r in csv.DictReader(f): + key = (r["tool"], r["opt"], r["bench"]) + if key not in seen: + rows.append(r) + seen.add(key) + def fnum(v): if v in ("", "NA", None): return None @@ -481,13 +527,16 @@ if [ ! -d "$BENCH_DIR" ]; then printf 'opt-bench: benchmark directory not found: %s\n' "$BENCH_DIR" >&2 exit 2 fi -if [ ! -x "$CFREE" ]; then +if [ "$SKIP_CFREE" != "1" ] && [ ! -x "$CFREE" ]; then printf 'opt-bench: cfree binary not found: %s\n' "$CFREE" >&2 printf 'opt-bench: run `make bin` or set CFREE=/path/to/cfree\n' >&2 exit 2 fi -ensure_mir || exit 2 +# c2m is only required when MIR is part of the run. +[ "$SKIP_MIR" != "1" ] && { ensure_mir || exit 2; } +printf 'opt-bench: mode: %s\n' "$MODE" +printf 'opt-bench: csv: %s\n' "$CSV" printf 'opt-bench: output: %s\n' "$OUT_DIR" printf 'opt-bench: benches: %s\n' "$BENCHES" printf 'opt-bench: levels: %s\n' "$LEVELS" @@ -505,12 +554,20 @@ for bench in $BENCHES; do for opt in $LEVELS; do [ "$SKIP_GCC" != "1" ] && bench_native_tool "$bench" "$(tool_label "$GCC")" "$GCC" "$opt" "$src" "$expect" "$arg_line" [ "$SKIP_CLANG" != "1" ] && bench_native_tool "$bench" "$(tool_label "$CLANG")" "$CLANG" "$opt" "$src" "$expect" "$arg_line" - bench_native_tool "$bench" "cfree" "$CFREE cc" "$opt" "$src" "$expect" "$arg_line" - bench_cfree_run "$bench" "$opt" "$src" "$expect" "$arg_line" + if [ "$SKIP_CFREE" != "1" ]; then + bench_native_tool "$bench" "cfree" "$CFREE cc" "$opt" "$src" "$expect" "$arg_line" + bench_cfree_run "$bench" "$opt" "$src" "$expect" "$arg_line" + fi [ "$SKIP_MIR" != "1" ] && bench_mir "$bench" "$opt" "$src" "$expect" "$arg_line" done done +if [ "$MODE" = "baseline" ]; then + printf 'opt-bench: wrote baseline cache %s\n' "$CSV" + printf 'opt-bench: commit this file so cfree runs can compare against it\n' + exit 0 +fi + write_summary printf 'opt-bench: wrote %s\n' "$CSV" printf 'opt-bench: wrote %s\n' "$SUMMARY" diff --git a/scripts/opt_bench_baseline.csv b/scripts/opt_bench_baseline.csv @@ -0,0 +1,85 @@ +bench,tool,opt,status,compile_ms,codegen_ms,runtime_ms,exit_code,log +"array","gcc-15","0","OK","151.617","NA","6192.521","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.array" +"array","clang","0","OK","147.005","NA","5308.213","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.array" +"array","mir-c2m","0","OK","121.159","0.264","4986.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.array" +"array","gcc-15","1","OK","156.974","NA","2146.515","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.array" +"array","clang","1","OK","143.504","NA","2896.896","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.array" +"array","mir-c2m","1","OK","118.147","0.318","4792.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.array" +"binary-trees","gcc-15","0","OK","148.974","NA","2647.209","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.binary-trees" +"binary-trees","clang","0","OK","139.066","NA","2754.167","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.binary-trees" +"binary-trees","mir-c2m","0","COMPILE_FAIL","123.128","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.binary-trees.compile.err" +"binary-trees","gcc-15","1","OK","153.376","NA","2607.473","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.binary-trees" +"binary-trees","clang","1","OK","142.402","NA","2400.911","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.binary-trees" +"binary-trees","mir-c2m","1","COMPILE_FAIL","126.457","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.binary-trees.compile.err" +"funnkuch-reduce","gcc-15","0","OK","148.897","NA","2557.454","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.funnkuch-reduce" +"funnkuch-reduce","clang","0","OK","137.344","NA","2880.681","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.funnkuch-reduce" +"funnkuch-reduce","mir-c2m","0","OK","122.131","0.499","2856.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.funnkuch-reduce" +"funnkuch-reduce","gcc-15","1","OK","154.436","NA","2081.479","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.funnkuch-reduce" +"funnkuch-reduce","clang","1","OK","149.392","NA","2311.176","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.funnkuch-reduce" +"funnkuch-reduce","mir-c2m","1","OK","123.103","0.620","2764.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.funnkuch-reduce" +"hash","gcc-15","0","OK","159.280","NA","4608.200","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.hash" +"hash","clang","0","OK","142.514","NA","4875.379","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.hash" +"hash","mir-c2m","0","OK","159.190","1.380","4172.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.hash" +"hash","gcc-15","1","OK","180.370","NA","4133.414","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.hash" +"hash","clang","1","OK","165.194","NA","4131.105","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.hash" +"hash","mir-c2m","1","OK","152.222","1.747","4167.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.hash" +"hash2","gcc-15","0","OK","161.541","NA","7398.824","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.hash2" +"hash2","clang","0","OK","144.715","NA","8831.498","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.hash2" +"hash2","mir-c2m","0","OK","152.456","1.430","3970.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.hash2" +"hash2","gcc-15","1","OK","180.383","NA","4360.070","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.hash2" +"hash2","clang","1","OK","165.528","NA","3850.965","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.hash2" +"hash2","mir-c2m","1","OK","151.444","1.825","3857.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.hash2" +"heapsort","gcc-15","0","OK","149.543","NA","7605.803","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.heapsort" +"heapsort","clang","0","OK","139.750","NA","6318.618","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.heapsort" +"heapsort","mir-c2m","0","COMPILE_FAIL","120.517","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.heapsort.compile.err" +"heapsort","gcc-15","1","OK","151.685","NA","5536.478","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.heapsort" +"heapsort","clang","1","OK","147.930","NA","4315.873","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.heapsort" +"heapsort","mir-c2m","1","COMPILE_FAIL","120.113","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.heapsort.compile.err" +"lists","gcc-15","0","OK","160.975","NA","8841.648","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.lists" +"lists","clang","0","OK","142.189","NA","7696.170","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.lists" +"lists","mir-c2m","0","OK","146.936","1.208","5512.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.lists" +"lists","gcc-15","1","OK","175.373","NA","3438.077","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.lists" +"lists","clang","1","OK","155.047","NA","2938.820","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.lists" +"lists","mir-c2m","1","OK","146.978","1.523","4988.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.lists" +"matrix","gcc-15","0","OK","151.555","NA","20657.715","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.matrix" +"matrix","clang","0","OK","139.105","NA","13679.456","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.matrix" +"matrix","mir-c2m","0","OK","140.975","0.707","11065.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.matrix" +"matrix","gcc-15","1","OK","162.019","NA","3073.402","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.matrix" +"matrix","clang","1","OK","149.687","NA","3125.393","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.matrix" +"matrix","mir-c2m","1","OK","139.374","0.893","9466.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.matrix" +"method-call","gcc-15","0","COMPILE_FAIL","74.268","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.method-call.compile.1.err" +"method-call","clang","0","COMPILE_FAIL","88.040","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.method-call.compile.1.err" +"method-call","mir-c2m","0","OK","116.234","0.490","4190.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.method-call" +"method-call","gcc-15","1","COMPILE_FAIL","178.970","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.method-call.compile.1.err" +"method-call","clang","1","COMPILE_FAIL","103.005","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.method-call.compile.1.err" +"method-call","mir-c2m","1","OK","120.788","0.591","4151.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.method-call" +"mandelbrot","gcc-15","0","OK","149.164","NA","10318.641","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.mandelbrot" +"mandelbrot","clang","0","OK","139.282","NA","10586.734","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.mandelbrot" +"mandelbrot","mir-c2m","0","OK","135.511","0.328","4501.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.mandelbrot" +"mandelbrot","gcc-15","1","OK","148.877","NA","3151.572","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.mandelbrot" +"mandelbrot","clang","1","OK","140.193","NA","2910.442","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.mandelbrot" +"mandelbrot","mir-c2m","1","OK","124.927","0.400","3332.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.mandelbrot" +"nbody","gcc-15","0","OK","157.218","NA","9017.622","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.nbody" +"nbody","clang","0","OK","142.119","NA","10701.933","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.nbody" +"nbody","mir-c2m","0","COMPILE_FAIL","139.124","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.nbody.compile.err" +"nbody","gcc-15","1","OK","165.533","NA","2852.425","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.nbody" +"nbody","clang","1","OK","154.581","NA","2711.537","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.nbody" +"nbody","mir-c2m","1","COMPILE_FAIL","125.704","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.nbody.compile.err" +"sieve","gcc-15","0","OK","146.078","NA","5032.428","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.sieve" +"sieve","clang","0","OK","137.391","NA","13729.750","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.sieve" +"sieve","mir-c2m","0","OK","130.733","0.226","4843.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.sieve" +"sieve","gcc-15","1","OK","150.246","NA","2787.239","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.sieve" +"sieve","clang","1","OK","144.944","NA","2512.013","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.sieve" +"sieve","mir-c2m","1","OK","120.643","0.265","4170.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.sieve" +"spectral-norm","gcc-15","0","OK","153.761","NA","14877.450","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.spectral-norm" +"spectral-norm","clang","0","OK","142.870","NA","14844.299","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.spectral-norm" +"spectral-norm","mir-c2m","0","COMPILE_FAIL","121.234","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.spectral-norm.compile.err" +"spectral-norm","gcc-15","1","OK","159.815","NA","4049.676","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.spectral-norm" +"spectral-norm","clang","1","OK","151.041","NA","4075.697","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.spectral-norm" +"spectral-norm","mir-c2m","1","COMPILE_FAIL","125.809","NA","NA","1","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.spectral-norm.compile.err" +"strcat","gcc-15","0","OK","151.030","NA","5970.900","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O0.strcat" +"strcat","clang","0","OK","140.669","NA","5943.676","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O0.strcat" +"strcat","mir-c2m","0","OK","148.012","0.484","5773.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O0.strcat" +"strcat","gcc-15","1","OK","156.343","NA","4860.543","0","/Users/ryan/code/cfree/build/bench/opt/logs/gcc-15.O1.strcat" +"strcat","clang","1","OK","143.566","NA","4804.322","0","/Users/ryan/code/cfree/build/bench/opt/logs/clang.O1.strcat" +"strcat","mir-c2m","1","OK","149.920","0.585","5772.000","0","/Users/ryan/code/cfree/build/bench/opt/logs/mir-c2m.O1.strcat" diff --git a/scripts/opt_bench_compare.py b/scripts/opt_bench_compare.py @@ -1,15 +1,27 @@ #!/usr/bin/env python3 """Compare cfree tools vs a baseline (default: gcc-15 -O0). +The fixed compilers (gcc/clang/MIR) don't change as cfree evolves, so their +timings are measured once into a cached CSV (scripts/opt_bench_baseline.csv via +`CFREE_OPT_BENCH_MODE=baseline scripts/opt_bench.sh`). This script auto-merges +that cache with the fresh cfree-only results so comparisons still work even when +the results CSV contains only cfree rows. + Usage: python3 scripts/opt_bench_compare.py [results.csv] python3 scripts/opt_bench_compare.py results.csv --base-tool gcc-15 --base-opt 0 + python3 scripts/opt_bench_compare.py results.csv --baseline-csv path.csv + python3 scripts/opt_bench_compare.py results.csv --no-baseline """ import csv import math import os import sys +DEFAULT_BASELINE_CSV = os.path.join( + os.path.dirname(os.path.abspath(__file__)), "opt_bench_baseline.csv" +) + def fnum(v): try: @@ -51,6 +63,7 @@ def main(): csv_path = None base_tool_arg = None base_opt = "0" + baseline_csv = DEFAULT_BASELINE_CSV i = 0 while i < len(args): if args[i] == "--base-tool" and i + 1 < len(args): @@ -59,6 +72,12 @@ def main(): elif args[i] == "--base-opt" and i + 1 < len(args): base_opt = args[i + 1] i += 2 + elif args[i] == "--baseline-csv" and i + 1 < len(args): + baseline_csv = args[i + 1] + i += 2 + elif args[i] == "--no-baseline": + baseline_csv = None + i += 1 else: csv_path = args[i] i += 1 @@ -73,6 +92,26 @@ def main(): with open(csv_path, newline="") as f: ok = [r for r in csv.DictReader(f) if r["status"] == "OK"] + # Benches measured by this run (typically cfree-only); used to scope output. + result_benches = {r["bench"] for r in ok} + + # Merge cached baseline timings (gcc/clang/MIR) unless disabled. Rows already + # present in the fresh CSV win, so an explicit baseline run still overrides. + if ( + baseline_csv + and os.path.exists(baseline_csv) + and os.path.abspath(baseline_csv) != os.path.abspath(csv_path) + ): + seen = {(r["tool"], r["opt"], r["bench"]) for r in ok} + with open(baseline_csv, newline="") as f: + for r in csv.DictReader(f): + if r["status"] != "OK": + continue + key = (r["tool"], r["opt"], r["bench"]) + if key not in seen: + ok.append(r) + seen.add(key) + if not ok: sys.exit("compare: no OK rows in CSV") @@ -96,7 +135,10 @@ def main(): sys.exit(f"compare: no rows for tool={base_tool} opt={base_opt}") idx = {(r["tool"], r["opt"], r["bench"]): r for r in ok} - all_benches = sorted(baseline) + # Only report benches this run covered (and that the baseline has). When the + # CSV is itself a baseline dump (no separate result benches), show them all. + shown = (result_benches & set(baseline)) if result_benches else set(baseline) + all_benches = sorted(shown) or sorted(baseline) all_opts = sorted( {r["opt"] for r in ok}, key=lambda x: (int(x) if x.isdigit() else 99, x),