Optimizer Roadmap (planned work)

This is the forward-looking plan for kit's optimizer: the work still ahead to turn on the O2 SSA mid-end, broaden inlining, close the remaining O0/O1 generated-code quality gaps, and finish the machine register-constraint model. The current design — IR layering, the recording/optimizing boundary, the pass catalog, allocation, and the MIR physical boundary — is documented in ../OPT.md, which this roadmap treats as the starting baseline. Pass order, SSA representation, and the verification model described there are built and shipping at O1; the items below are about extending and enabling them, not re-describing them. Performance is tracked against gcc/clang and the MIR c2m JIT on the scripts/opt_bench.sh benchmark set; the long-term external target is QBE-class quality on that set, then closing toward MIR c2m -eg.

Related plans: INTERPRETER.md, CODEGEN.md, ARCH.md.

Baseline (where the remaining work starts)

A few facts about the current code path frame everything below:

O1 is the only live optimized path. opt_cgtarget_new normalizes every opt_level >= 1 request to 1 (src/opt/opt.c), so no compilation selects O2 today. The O1 pipeline runs entirely in the PReg namespace with the non-splitting allocator (opt_regalloc_locations(..., allow_live_range_split = 0)).
The O2 SSA mid-end (opt_cleanup) is fully implemented but unreached. The schedule — register SSA, mem2reg, GVN, copy-prop, DSE, LICM, pressure relief, conventional SSA, undo-SSA — is defined and maintained against targeted opt tests; it just has no caller in shipped compilation. Turning it on is a correctness-and-quality project, not a from-scratch build.
Tiny-function inlining at O1 is done (baseline). opt_try_tiny_inline runs in the streaming O1 path; it inlines straightline DEFAULT/HINT callees under an 8-op cost cap, always inlines always_inline, and refuses noinline/recursive/control-rich callees. The whole-program inliner (opt_inline) also exists but is not wired into a live path.
The generic machine register-constraint mechanism is done (baseline). The tie/forbid/clobber primitives, the per-instruction clobber side table (Func.inst_clobbers), machinize_inst_clobbers, and apply_machine_reg_clobbers are in tree, fed by the x64 machine-clobber hook for div/mod/shift/mul/CAS/RMW/va_arg; aarch64/riscv64 leave the hook inert. Remaining work is follow-on cleanup, not the core model.

1. Complete and turn on the O2 SSA mid-end

Goal. Make -O2 a real, selectable optimization level whose generated code is reliably faster than -O1 on the representative benchmark mix, and route the shared backend tail through it with coalescing and live-range splitting enabled.

Rationale. O1 already leads MIR O1 on the MIR-comparable benchmark scope and beats gcc/clang -O0 on both compile time and runtime. The remaining runtime quality gap is against the production compilers at -O1/-O2, and the SSA value/memory passes (GVN, DSE, LICM, copy-prop) are where that gap closes. The mid-end is written; the blocker is that enabling it as-is regressed runtime versus O1 on the bench mix and broke matrix -O2, so it stays isolated until it is a net win.

The path to flipping the switch:

Replace wholesale def-use rebuilds with incremental SSA edges. The biggest O2 compile-time cost is re-running opt_rebuild_def_use after most mutating passes (dozens of call sites). Move to MIR-style per-operand doubly-linked ssa_edge lists maintained incrementally, so walk/relink/remove/redirect are O(1) and the def-use bit becomes a verification flag, not a rebuild trigger. Land the structures first with SSA-era passes quarantined behind per-pass build gates, then re-enable each pass one at a time with its MIR-parity work and quality assertions, refreshing its benchmark row as it lands.
Fix the LICM register-pressure regression. opt_licm has no pressure cost model and iterates loops in raster order; it hoists invariants into preheaders that lengthen live ranges and force spills the O1 allocator avoided — the most likely cause of O2 trailing O1 today. Add a backward pressure filter that skips hoists raising pressure unless the op is expensive (multiply/divide), reuse the already-built loop tree, and walk loops inner-to-outer.
Add address-mode synthesis to the SSA combine. opt_ssa_combine does not fold base + index*scale + disp chains into memory operands the way the post-RA MIR combine does; this is the largest missing codegen pattern for memory-heavy benchmarks (hash, matrix). Decompose address chains in SSA, validated by a backend memory-legality check.
Make GVN converge in one pass. GVN branch folding does not re-enqueue dependent blocks within a pass, so new constants only propagate on the next opt_cleanup iteration. Add intra-pass worklist re-enqueue (the per-edge scratch bit from the incremental-edge work is the natural worklist marker).
Demand-driven phi insertion. opt_build_reg_ssa/opt_build_ssa insert phis at every iterated-dominance-frontier site without live-in filtering. Move to demand-driven materialization (phis only where reaching defs differ) plus a fixpoint minimization pass.
Broaden the SSA passes that are currently all-or-nothing. opt_addr_xform bails on a single non-foldable use; teach it to fold the foldable uses and rewrite the rest. opt_pressure_relief only sinks immediate/const candidates; extend it to any single-cross-block-use move. Compute DSE memory liveness once before the pass instead of an inline fixpoint per invocation.

2. Live-range splitting and coalescing (the O2 allocator layer)

Goal. Restore the live-range splitting layer on top of the point-indexed allocator core, and keep move-related coalescing as the O2-only allocation quality step.

Rationale. Splitting was deliberately deferred during the allocator rewrite to the MIR-shaped point-bitmap core (opt_regalloc_locations takes an allow_live_range_split flag that is currently a no-op at the splitting site). Until it returns, O2 may regress on benchmarks where splitting matters (array, hash, matrix), and hash2-style cases that trail their siblings as the table grows — where register pressure is the suspect — have no mitigation beyond coalescing.

Port MIR's get_hard_reg_with_split / lr_gap_tab / split() on top of the existing point-bitmap occupancy structure, keeping the dense OptLocation/segment representation already designed for split results.
Drive splitting from a spiller-vs-spillee frequency profit comparison (search gaps in already-allocated ranges), rather than the boundary-driven, singleton-only, non-call-crossing shape sketched earlier.
Keep coalescing gated to O2 (matching MIR), preserving the unit-overlap conflict counting that prevents merging a multiply-defined local with a tied-register param.

3. O1 generated-code quality

Goal. Reach at least 1.10x faster than both gcc -O0 and MIR c2m -O1 on every tracked benchmark, without giving up O1's compile-speed lead.

Rationale. O1 is the developer-workflow optimization level: its compile cost is the same order of magnitude as a debug build but it produces materially faster code. A handful of benchmarks (hash2, sieve, mandelbrot, strcat) still trail MIR O1, and the gaps are specific and individually addressable.

Extend loop-invariant hoisting at O1 beyond constants. The block-local hoist at O1 (opt_hoist_loop_consts) only consolidates duplicate IR_LOAD_IMM defs that recur inside a loop into one canonical preg — enough for the sieve flag-init constant mov, but it does not move loop-invariant computations (address arithmetic, pure binops on loop-invariant operands) out of the loop. A conservative, pressure-aware single-BB invariant hoist would help sieve and similar loops without paying for the full SSA mid-end LICM.
Hard-register copy coalescing for IR_LOAD_IMM sources. The call-argument hint-propagation path covers ldr-to-call-arg but skips immediates, so f(NULL, NULL)-style calls still materialize each zero through a temp. Extend set_preg_pref_for_call_args / hint propagation to fire when the source op is IR_LOAD_IMM.
Strength-reduce small-divisor modulo. hash2's val % size still emits udiv; MIR emits a reciprocal/Barrett multiply for the small-constant-divisor case. This is the largest remaining hash2 gap against MIR.
FP instruction selection. mandelbrot's inner escape loop is FP-bound; inspect it against MIR for FP register allocation and constant-pool material (kit does not vectorize, which is a separate, larger question).
General jump-threading. Collapse b A; A: b B chains in pass_jump.c as a general cleanup (recurs intermittently across benches and layout shifts).
Resolve the binary-trees cc-path deficit. The JIT path with identical codegen is at parity, so the loss is PLT/GOT indirection for malloc/free; a -fno-plt-style direct-call or intra-image link resolution is a link-path question more than a codegen one — see LINKER.md.

4. O0 generated-code quality

Goal. Shrink the O0 runtime and code-size gap versus clang -O0 (currently ~1.3x slower on a small aarch64 sample) without sacrificing debuggability of source-level variables.

Rationale. O0 is the no-optimizer path (the frontend CgTarget wires straight to NativeDirectTarget), so these are instruction-selection and local-lowering issues, not optimizer-pass issues. They are safe to fix because they do not change which source variables get stable stack homes.

Reuse the known-frame/no-padding prologue at O0. O0 functions still emit the old fat-prologue reservation shape: a real prologue, then an unconditional branch over ~19 nops. O1's known-frame path already eliminated this; apply the same frame-known-up-front prologue to O0 when the frame is known. This costs one extra branch per call entry and bloats hot functions (hurting I-cache).
Stop stack-homing compiler-only temporaries. O0 materializes many intermediate values to the stack and immediately reloads them (NewTreeNode stores the constant 16 to a slot before malloc; clang uses mov x0, #0x10). Keep stable stack homes for source variables but keep simple expression intermediates in registers across a single expression.
Local FP selection wins. Use encodable FP immediates directly, select fmadd for multiply-add expression trees, and avoid spilling FP intermediates that have no source-level storage. Visible on mandelbrot at O0.
Do not eagerly emit unused header inline/helper functions. O0 emits SDK header helpers (___inline_isfinite, ctype helpers, __OSSwapInt*, etc.) that clang does not, inflating object text and the linked image and obscuring measurements. Emit static-inline/header helpers only when referenced.

5. Machine register-constraint model: remaining work

Goal. Finish the follow-on cleanup now that the generic fixed-register/ clobber mechanism is in place, and keep the model the single entry point for target ISA register rules into the allocator.

Rationale. The core mechanism (tie + forbid + per-instruction clobber side table, with the x64 hook covering div/mod/shift/mul/CAS/RMW/va_arg) is shipped and is what makes the x64 -O1 div/mod/atomic/overflow/varargs cases correct. What remains is removing the now-redundant defensive code the mechanism made unnecessary, and keeping the contract clean as new constrained ops appear.

Remove defensive backend moves the allocator now guarantees. With the dividend tied to rax and nothing live in rdx/rcx, the mov rax, dividend in x64_binop, the address-staging in x64_atomic_cas/_rmw, and the shift mov rcx, count are register-to-itself no-ops. Simplify them away.
Confirm InstId density and stability through pass_lower so the side-table lookup stays O(1); fall back to a small open-addressed map keyed by instruction only if ids ever become sparse.
Add focused regressions for the live-across-the-constraint shape (e.g. a function that divides while keeping an unrelated value live across the divide) so the live-across-forbid loop stays covered as new ops are added.

6. Inlining beyond tiny streaming callees

Goal. Grow inlining past the streaming tiny-callee policy toward a real cost-model-driven inliner, reusing the whole-program machinery that already exists in pass_inline.c.

Rationale. The tiny O1 inliner captures the biggest easy win (one- and two-instruction leaf helpers) but deliberately refuses anything with control flow, calls, or aggregate/sret/byval ABI shapes, and cannot see forward-defined or cross-TU callees. opt_inline (the whole-program inliner with growth gates) is in tree but unreached; richer inlining is mostly a policy-and-wiring problem, not new transform code.

Inline callees with control flow beyond the straightline whitelist, and callees with multi-part/aggregate ABI args/returns (today rejected by inline_rewrite_supported).
Forward-defined-callee inlining. The streaming registry only sees callees recorded earlier in the TU; a finalization-time pass (matching the ARM_64 reachability regime) could inline forward-defined callees.
A whole-program cost model. Wire opt_inline's growth-bounded gates into a live path (most naturally at O2) so larger-than-tiny callees inline when the cost model says it pays.

7. Open questions and non-goals

Cross-cutting questions to resolve as the work above lands:

Cheapest O2-positive subset. What minimal set of SSA passes already makes O2 > O1, so O2 can be enabled incrementally rather than waiting for the whole schedule?
Allocation path. Should O2 keep O1's point-indexed bitmap core plus splitting/coalescing, or does its quality justify a distinct path — and does coalescing-before-liveness (delete moves first, then compute liveness on the reduced program, as MIR does) buy enough compile time to reorder the pipeline?
DCE overlap. Can the pre-allocation dead-def elimination and the post-RA DCE be unified (at O1, and narrowed once SSA DCE runs at O2)? They overlap.
Constraint timing. When O2 turns on, do any SSA-era passes need to learn about fixed-register constraints before allocation, or does applying all constraints at allocation time (as today) remain sufficient?
Inlining scope. Is there an O1-affordable inlining subset (forward-defined tiny callees) worth enabling on the streaming path, or is the whole-program cost model O2-only? Cross-TU inlining is likely a separate, later effort.
Temporary-spilling boundary. How much of the O0 "compiler-only temporaries" fix belongs in the C frontend's expression lowering versus NativeDirectTarget's register cache?

A shared register-pressure model is the common precursor under several items (LICM, pressure relief, O1 invariant hoisting): once it exists those passes can make pressure-aware rather than structural decisions. Auto-vectorization is explicitly not planned near-term — noted only because several FP benchmarks (mandelbrot, sieve init loops) lose to vectorizing peers; any future work there is a large standalone project.

8. Measurement and acceptance

Goal. Keep every change benchmark-gated and keep the benchmark harness able to attribute time to the right pipeline stage.

Add benchmark CSV columns that split kit frontend, optimizer, link, and JIT time, instead of relying on ad hoc --time logs; the frontend currently dominates absolute time on these benches and masks backend movement.
Keep -O1 materially faster to compile than -O2, and treat any O2 enablement as gated on "no material regression in the backend-only O1 geomean and no bootstrap regression."
Expand the benchmark set back toward all MIR c-benchmarks that need only supported hosted libc/runtime features, and track MIR's own failures separately so kit coverage is not obscured by external-tool limits.

The performance contract for all of the above is preservation of what already wins: arena-backed function-local allocation, dense PReg-indexed arrays for hot per-value data, point-indexed allocator occupancy bitsets, operand-driven combiner dispatch, and block-layout fallthrough cleanup. New work must not add abstraction layers in hot loops, per-operand heap allocation in MIR, or whole-function rescans where a dense indexed array suffices.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README