Optimizer Roadmap (planned work)
This is the forward-looking plan for kit's optimizer: the work still ahead to
turn on the O2 SSA mid-end, broaden inlining, close the remaining O0/O1
generated-code quality gaps, and finish the machine register-constraint model.
The current design — IR layering, the recording/optimizing boundary, the pass
catalog, allocation, and the MIR physical boundary — is documented in
../OPT.md, which this roadmap treats as the starting baseline. Pass
order, SSA representation, and the verification model described there are built
and shipping at O1; the items below are about extending and enabling them, not
re-describing them. Performance is tracked against gcc/clang and the MIR c2m
JIT on the scripts/opt_bench.sh benchmark set; the long-term external target
is QBE-class quality on that set, then closing toward MIR c2m -eg.
Related plans: INTERPRETER.md, CODEGEN.md, ARCH.md.
Baseline (where the remaining work starts)
A few facts about the current code path frame everything below:
- O1 is the only live optimized path.
opt_cgtarget_newnormalizes everyopt_level >= 1request to1(src/opt/opt.c), so no compilation selects O2 today. The O1 pipeline runs entirely in the PReg namespace with the non-splitting allocator (opt_regalloc_locations(..., allow_live_range_split = 0)). - The O2 SSA mid-end (
opt_cleanup) is fully implemented but unreached. The schedule — register SSA, mem2reg, GVN, copy-prop, DSE, LICM, pressure relief, conventional SSA, undo-SSA — is defined and maintained against targeted opt tests; it just has no caller in shipped compilation. Turning it on is a correctness-and-quality project, not a from-scratch build. - Tiny-function inlining at O1 is done (baseline).
opt_try_tiny_inlineruns in the streaming O1 path; it inlines straightlineDEFAULT/HINTcallees under an 8-op cost cap, always inlinesalways_inline, and refusesnoinline/recursive/control-rich callees. The whole-program inliner (opt_inline) also exists but is not wired into a live path. - The generic machine register-constraint mechanism is done (baseline).
The tie/forbid/clobber primitives, the per-instruction clobber side table
(
Func.inst_clobbers),machinize_inst_clobbers, andapply_machine_reg_clobbersare in tree, fed by the x64 machine-clobber hook for div/mod/shift/mul/CAS/RMW/va_arg; aarch64/riscv64 leave the hook inert. Remaining work is follow-on cleanup, not the core model.
1. Complete and turn on the O2 SSA mid-end
Goal. Make -O2 a real, selectable optimization level whose generated code
is reliably faster than -O1 on the representative benchmark mix, and route the
shared backend tail through it with coalescing and live-range splitting enabled.
Rationale. O1 already leads MIR O1 on the MIR-comparable benchmark scope
and beats gcc/clang -O0 on both compile time and runtime. The remaining
runtime quality gap is against the production compilers at -O1/-O2, and the
SSA value/memory passes (GVN, DSE, LICM, copy-prop) are where that gap closes.
The mid-end is written; the blocker is that enabling it as-is regressed runtime
versus O1 on the bench mix and broke matrix -O2, so it stays isolated until it
is a net win.
The path to flipping the switch:
- Replace wholesale def-use rebuilds with incremental SSA edges. The biggest
O2 compile-time cost is re-running
opt_rebuild_def_useafter most mutating passes (dozens of call sites). Move to MIR-style per-operand doubly-linkedssa_edgelists maintained incrementally, so walk/relink/remove/redirect are O(1) and the def-use bit becomes a verification flag, not a rebuild trigger. Land the structures first with SSA-era passes quarantined behind per-pass build gates, then re-enable each pass one at a time with its MIR-parity work and quality assertions, refreshing its benchmark row as it lands. - Fix the LICM register-pressure regression.
opt_licmhas no pressure cost model and iterates loops in raster order; it hoists invariants into preheaders that lengthen live ranges and force spills the O1 allocator avoided — the most likely cause of O2 trailing O1 today. Add a backward pressure filter that skips hoists raising pressure unless the op is expensive (multiply/divide), reuse the already-built loop tree, and walk loops inner-to-outer. - Add address-mode synthesis to the SSA combine.
opt_ssa_combinedoes not foldbase + index*scale + dispchains into memory operands the way the post-RA MIR combine does; this is the largest missing codegen pattern for memory-heavy benchmarks (hash,matrix). Decompose address chains in SSA, validated by a backend memory-legality check. - Make GVN converge in one pass. GVN branch folding does not re-enqueue
dependent blocks within a pass, so new constants only propagate on the next
opt_cleanupiteration. Add intra-pass worklist re-enqueue (the per-edge scratch bit from the incremental-edge work is the natural worklist marker). - Demand-driven phi insertion.
opt_build_reg_ssa/opt_build_ssainsert phis at every iterated-dominance-frontier site without live-in filtering. Move to demand-driven materialization (phis only where reaching defs differ) plus a fixpoint minimization pass. - Broaden the SSA passes that are currently all-or-nothing.
opt_addr_xformbails on a single non-foldable use; teach it to fold the foldable uses and rewrite the rest.opt_pressure_reliefonly sinks immediate/const candidates; extend it to any single-cross-block-use move. Compute DSE memory liveness once before the pass instead of an inline fixpoint per invocation.
2. Live-range splitting and coalescing (the O2 allocator layer)
Goal. Restore the live-range splitting layer on top of the point-indexed allocator core, and keep move-related coalescing as the O2-only allocation quality step.
Rationale. Splitting was deliberately deferred during the allocator rewrite
to the MIR-shaped point-bitmap core (opt_regalloc_locations takes an
allow_live_range_split flag that is currently a no-op at the splitting site).
Until it returns, O2 may regress on benchmarks where splitting matters
(array, hash, matrix), and hash2-style cases that trail their siblings
as the table grows — where register pressure is the suspect — have no mitigation
beyond coalescing.
- Port MIR's
get_hard_reg_with_split/lr_gap_tab/split()on top of the existing point-bitmap occupancy structure, keeping the denseOptLocation/segment representation already designed for split results. - Drive splitting from a spiller-vs-spillee frequency profit comparison (search gaps in already-allocated ranges), rather than the boundary-driven, singleton-only, non-call-crossing shape sketched earlier.
- Keep coalescing gated to O2 (matching MIR), preserving the unit-overlap conflict counting that prevents merging a multiply-defined local with a tied-register param.
3. O1 generated-code quality
Goal. Reach at least 1.10x faster than both gcc -O0 and MIR c2m -O1 on
every tracked benchmark, without giving up O1's compile-speed lead.
Rationale. O1 is the developer-workflow optimization level: its compile cost
is the same order of magnitude as a debug build but it produces materially
faster code. A handful of benchmarks (hash2, sieve, mandelbrot, strcat)
still trail MIR O1, and the gaps are specific and individually addressable.
- Extend loop-invariant hoisting at O1 beyond constants. The block-local
hoist at O1 (
opt_hoist_loop_consts) only consolidates duplicateIR_LOAD_IMMdefs that recur inside a loop into one canonical preg — enough for the sieve flag-init constantmov, but it does not move loop-invariant computations (address arithmetic, pure binops on loop-invariant operands) out of the loop. A conservative, pressure-aware single-BB invariant hoist would helpsieveand similar loops without paying for the full SSA mid-end LICM. - Hard-register copy coalescing for
IR_LOAD_IMMsources. The call-argument hint-propagation path coversldr-to-call-arg but skips immediates, sof(NULL, NULL)-style calls still materialize each zero through a temp. Extendset_preg_pref_for_call_args/ hint propagation to fire when the source op isIR_LOAD_IMM. - Strength-reduce small-divisor modulo.
hash2'sval % sizestill emitsudiv; MIR emits a reciprocal/Barrett multiply for the small-constant-divisor case. This is the largest remaininghash2gap against MIR. - FP instruction selection.
mandelbrot's inner escape loop is FP-bound; inspect it against MIR for FP register allocation and constant-pool material (kit does not vectorize, which is a separate, larger question). - General jump-threading. Collapse
b A; A: b Bchains inpass_jump.cas a general cleanup (recurs intermittently across benches and layout shifts). - Resolve the
binary-treescc-path deficit. The JIT path with identical codegen is at parity, so the loss is PLT/GOT indirection formalloc/free; a-fno-plt-style direct-call or intra-image link resolution is a link-path question more than a codegen one — see LINKER.md.
4. O0 generated-code quality
Goal. Shrink the O0 runtime and code-size gap versus clang -O0 (currently
~1.3x slower on a small aarch64 sample) without sacrificing debuggability of
source-level variables.
Rationale. O0 is the no-optimizer path (the frontend CgTarget wires
straight to NativeDirectTarget), so these are instruction-selection and
local-lowering issues, not optimizer-pass issues. They are safe to fix because
they do not change which source variables get stable stack homes.
- Reuse the known-frame/no-padding prologue at O0. O0 functions still emit the old fat-prologue reservation shape: a real prologue, then an unconditional branch over ~19 nops. O1's known-frame path already eliminated this; apply the same frame-known-up-front prologue to O0 when the frame is known. This costs one extra branch per call entry and bloats hot functions (hurting I-cache).
- Stop stack-homing compiler-only temporaries. O0 materializes many
intermediate values to the stack and immediately reloads them
(
NewTreeNodestores the constant16to a slot beforemalloc; clang usesmov x0, #0x10). Keep stable stack homes for source variables but keep simple expression intermediates in registers across a single expression. - Local FP selection wins. Use encodable FP immediates directly, select
fmaddfor multiply-add expression trees, and avoid spilling FP intermediates that have no source-level storage. Visible onmandelbrotat O0. - Do not eagerly emit unused header inline/helper functions. O0 emits SDK
header helpers (
___inline_isfinite, ctype helpers,__OSSwapInt*, etc.) that clang does not, inflating object text and the linked image and obscuring measurements. Emit static-inline/header helpers only when referenced.
5. Machine register-constraint model: remaining work
Goal. Finish the follow-on cleanup now that the generic fixed-register/ clobber mechanism is in place, and keep the model the single entry point for target ISA register rules into the allocator.
Rationale. The core mechanism (tie + forbid + per-instruction clobber side
table, with the x64 hook covering div/mod/shift/mul/CAS/RMW/va_arg) is shipped
and is what makes the x64 -O1 div/mod/atomic/overflow/varargs cases correct.
What remains is removing the now-redundant defensive code the mechanism made
unnecessary, and keeping the contract clean as new constrained ops appear.
- Remove defensive backend moves the allocator now guarantees. With the
dividend tied to
raxand nothing live inrdx/rcx, themov rax, dividendinx64_binop, the address-staging inx64_atomic_cas/_rmw, and the shiftmov rcx, countare register-to-itself no-ops. Simplify them away. - Confirm
InstIddensity and stability throughpass_lowerso the side-table lookup stays O(1); fall back to a small open-addressed map keyed by instruction only if ids ever become sparse. - Add focused regressions for the live-across-the-constraint shape (e.g. a function that divides while keeping an unrelated value live across the divide) so the live-across-forbid loop stays covered as new ops are added.
6. Inlining beyond tiny streaming callees
Goal. Grow inlining past the streaming tiny-callee policy toward a real
cost-model-driven inliner, reusing the whole-program machinery that already
exists in pass_inline.c.
Rationale. The tiny O1 inliner captures the biggest easy win (one- and
two-instruction leaf helpers) but deliberately refuses anything with control
flow, calls, or aggregate/sret/byval ABI shapes, and cannot see forward-defined
or cross-TU callees. opt_inline (the whole-program inliner with growth gates)
is in tree but unreached; richer inlining is mostly a policy-and-wiring problem,
not new transform code.
- Inline callees with control flow beyond the straightline whitelist, and
callees with multi-part/aggregate ABI args/returns (today rejected by
inline_rewrite_supported). - Forward-defined-callee inlining. The streaming registry only sees callees recorded earlier in the TU; a finalization-time pass (matching the ARM_64 reachability regime) could inline forward-defined callees.
- A whole-program cost model. Wire
opt_inline's growth-bounded gates into a live path (most naturally at O2) so larger-than-tiny callees inline when the cost model says it pays.
7. Open questions and non-goals
Cross-cutting questions to resolve as the work above lands:
- Cheapest O2-positive subset. What minimal set of SSA passes already makes O2 > O1, so O2 can be enabled incrementally rather than waiting for the whole schedule?
- Allocation path. Should O2 keep O1's point-indexed bitmap core plus splitting/coalescing, or does its quality justify a distinct path — and does coalescing-before-liveness (delete moves first, then compute liveness on the reduced program, as MIR does) buy enough compile time to reorder the pipeline?
- DCE overlap. Can the pre-allocation dead-def elimination and the post-RA DCE be unified (at O1, and narrowed once SSA DCE runs at O2)? They overlap.
- Constraint timing. When O2 turns on, do any SSA-era passes need to learn about fixed-register constraints before allocation, or does applying all constraints at allocation time (as today) remain sufficient?
- Inlining scope. Is there an O1-affordable inlining subset (forward-defined tiny callees) worth enabling on the streaming path, or is the whole-program cost model O2-only? Cross-TU inlining is likely a separate, later effort.
- Temporary-spilling boundary. How much of the O0 "compiler-only
temporaries" fix belongs in the C frontend's expression lowering versus
NativeDirectTarget's register cache?
A shared register-pressure model is the common precursor under several items
(LICM, pressure relief, O1 invariant hoisting): once it exists those passes can
make pressure-aware rather than structural decisions. Auto-vectorization is
explicitly not planned near-term — noted only because several FP benchmarks
(mandelbrot, sieve init loops) lose to vectorizing peers; any future work
there is a large standalone project.
8. Measurement and acceptance
Goal. Keep every change benchmark-gated and keep the benchmark harness able to attribute time to the right pipeline stage.
- Add benchmark CSV columns that split kit frontend, optimizer, link, and JIT
time, instead of relying on ad hoc
--timelogs; the frontend currently dominates absolute time on these benches and masks backend movement. - Keep
-O1materially faster to compile than-O2, and treat any O2 enablement as gated on "no material regression in the backend-only O1 geomean and no bootstrap regression." - Expand the benchmark set back toward all MIR
c-benchmarksthat need only supported hosted libc/runtime features, and track MIR's own failures separately so kit coverage is not obscured by external-tool limits.
The performance contract for all of the above is preservation of what already wins: arena-backed function-local allocation, dense PReg-indexed arrays for hot per-value data, point-indexed allocator occupancy bitsets, operand-driven combiner dispatch, and block-layout fallthrough cleanup. New work must not add abstraction layers in hot loops, per-operand heap allocation in MIR, or whole-function rescans where a dense indexed array suffices.