kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

Optimizer Roadmap (planned work)

This is the forward-looking plan for kit's optimizer: the work still ahead to turn on the O2 SSA mid-end, broaden inlining, close the remaining O0/O1 generated-code quality gaps, and finish the machine register-constraint model. The current design — IR layering, the recording/optimizing boundary, the pass catalog, allocation, and the MIR physical boundary — is documented in ../OPT.md, which this roadmap treats as the starting baseline. Pass order, SSA representation, and the verification model described there are built and shipping at O1; the items below are about extending and enabling them, not re-describing them. Performance is tracked against gcc/clang and the MIR c2m JIT on the scripts/opt_bench.sh benchmark set; the long-term external target is QBE-class quality on that set, then closing toward MIR c2m -eg.

Related plans: INTERPRETER.md, CODEGEN.md, ARCH.md.

Baseline (where the remaining work starts)

A few facts about the current code path frame everything below:

1. Complete and turn on the O2 SSA mid-end

Goal. Make -O2 a real, selectable optimization level whose generated code is reliably faster than -O1 on the representative benchmark mix, and route the shared backend tail through it with coalescing and live-range splitting enabled.

Rationale. O1 already leads MIR O1 on the MIR-comparable benchmark scope and beats gcc/clang -O0 on both compile time and runtime. The remaining runtime quality gap is against the production compilers at -O1/-O2, and the SSA value/memory passes (GVN, DSE, LICM, copy-prop) are where that gap closes. The mid-end is written; the blocker is that enabling it as-is regressed runtime versus O1 on the bench mix and broke matrix -O2, so it stays isolated until it is a net win.

The path to flipping the switch:

2. Live-range splitting and coalescing (the O2 allocator layer)

Goal. Restore the live-range splitting layer on top of the point-indexed allocator core, and keep move-related coalescing as the O2-only allocation quality step.

Rationale. Splitting was deliberately deferred during the allocator rewrite to the MIR-shaped point-bitmap core (opt_regalloc_locations takes an allow_live_range_split flag that is currently a no-op at the splitting site). Until it returns, O2 may regress on benchmarks where splitting matters (array, hash, matrix), and hash2-style cases that trail their siblings as the table grows — where register pressure is the suspect — have no mitigation beyond coalescing.

3. O1 generated-code quality

Goal. Reach at least 1.10x faster than both gcc -O0 and MIR c2m -O1 on every tracked benchmark, without giving up O1's compile-speed lead.

Rationale. O1 is the developer-workflow optimization level: its compile cost is the same order of magnitude as a debug build but it produces materially faster code. A handful of benchmarks (hash2, sieve, mandelbrot, strcat) still trail MIR O1, and the gaps are specific and individually addressable.

4. O0 generated-code quality

Goal. Shrink the O0 runtime and code-size gap versus clang -O0 (currently ~1.3x slower on a small aarch64 sample) without sacrificing debuggability of source-level variables.

Rationale. O0 is the no-optimizer path (the frontend CgTarget wires straight to NativeDirectTarget), so these are instruction-selection and local-lowering issues, not optimizer-pass issues. They are safe to fix because they do not change which source variables get stable stack homes.

5. Machine register-constraint model: remaining work

Goal. Finish the follow-on cleanup now that the generic fixed-register/ clobber mechanism is in place, and keep the model the single entry point for target ISA register rules into the allocator.

Rationale. The core mechanism (tie + forbid + per-instruction clobber side table, with the x64 hook covering div/mod/shift/mul/CAS/RMW/va_arg) is shipped and is what makes the x64 -O1 div/mod/atomic/overflow/varargs cases correct. What remains is removing the now-redundant defensive code the mechanism made unnecessary, and keeping the contract clean as new constrained ops appear.

6. Inlining beyond tiny streaming callees

Goal. Grow inlining past the streaming tiny-callee policy toward a real cost-model-driven inliner, reusing the whole-program machinery that already exists in pass_inline.c.

Rationale. The tiny O1 inliner captures the biggest easy win (one- and two-instruction leaf helpers) but deliberately refuses anything with control flow, calls, or aggregate/sret/byval ABI shapes, and cannot see forward-defined or cross-TU callees. opt_inline (the whole-program inliner with growth gates) is in tree but unreached; richer inlining is mostly a policy-and-wiring problem, not new transform code.

7. Open questions and non-goals

Cross-cutting questions to resolve as the work above lands:

A shared register-pressure model is the common precursor under several items (LICM, pressure relief, O1 invariant hoisting): once it exists those passes can make pressure-aware rather than structural decisions. Auto-vectorization is explicitly not planned near-term — noted only because several FP benchmarks (mandelbrot, sieve init loops) lose to vectorizing peers; any future work there is a large standalone project.

8. Measurement and acceptance

Goal. Keep every change benchmark-gated and keep the benchmark harness able to attribute time to the right pipeline stage.

The performance contract for all of the above is preservation of what already wins: arena-backed function-local allocation, dense PReg-indexed arrays for hot per-value data, point-indexed allocator occupancy bitsets, operand-driven combiner dispatch, and block-layout fallthrough cleanup. New work must not add abstraction layers in hot loops, per-operand heap allocation in MIR, or whole-function rescans where a dense indexed array suffices.