kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

Optimizer (OPT)

This document is the design reference for kit's optimizer: the module that sits between the recording code-generation API and the per-architecture native backends. The optimizer owns a private, mutable IR; it lowers each recorded function into that IR, runs analyses and transforms over it, performs register allocation, builds a physical post-allocation IR (MIR), and finally replays the MIR into a NativeTarget backend. It is also the source of the function shape the bytecode interpreter consumes. The focus here is layering, ownership, representation invariants, and the reasoning behind the boundaries — not API signatures, which live in the headers. Cross-references: DESIGN.md, CODEGEN.md, IR.md, ARCH.md, and INTERPRETER.md.

1. Where the optimizer sits

kit's codegen has two surfaces. The semantic surface is the recording code-generation API (cg/ir.h, the CgTarget interface): frontends call it to describe a function. The physical surface is the per-architecture NativeTarget (arch/native_target.h): it encodes machine code. The optimizer is the bridge.

frontend  --CgTarget calls-->  CgIrRecorder  --records-->  CgIrFunc tape
                                                                 |
                                                opt_func_from_cg_ir (cg_ir_lower.c)
                                                                 v
                                            optimizer IR: Func / Block / Inst / Val / PReg
                                                                 |
                                                  passes (analysis + transform)
                                                                 |
                                                regalloc -> MFunc (physical MIR)
                                                                 v
                                       opt_emit_native --NativeTarget calls--> machine code

At -O0 the optimizer is not installed at all: the driver wires the frontend's CgTarget straight to the backend's NativeDirectTarget, which emits in a single pass with a small register cache (see CODEGEN.md). At opt_level >= 1, opt_cgtarget_new (src/opt/opt.c) installs a CgIrRecorder (cg/ir_recorder.c) as the sink. The recorder captures each function as a CgIrFunc tape, and on completion fires the optimizer's per-function callback.

OptImpl (in src/opt/opt.c) is the wrapper state: the wrapped real target, the resolved NativeTarget*, an optional dump writer, and a per-translation- unit registry of recorded CgIrFuncs (with a parallel lazily-lowered-Func cache) used for streaming tiny-callee inline lookup.

When each function is processed: streaming vs. finalization

Two scheduling regimes exist, chosen by target architecture:

Both regimes converge on the same lowering path and backend tail; they differ only in when a function is lowered and whether dead local functions are dropped before lowering or left for the linker (Section 3.1).

The recording/optimizing boundary

The split between recording (cg/ir) and optimizing (opt/ir) is deliberate and is the central design decision of this module:

Lowering also performs the first storage-classification decision. In lower_locals, each semantic CGLocal becomes either register storage (CG_LOCAL_STORAGE_REG, a fresh PReg, operands of kind OPK_REG) or frame storage (CG_LOCAL_STORAGE_FRAME, a FrameSlot, operands of kind OPK_LOCAL). A local is forced to a frame home when it is address-taken / memory-required (local_needs_home), an aggregate, or larger than a machine word. Everything else starts in a pseudo-register. Address-taken locals begin in frame storage; later HIR address-folding and promotion passes recover register storage for those whose address does not actually escape (Section 4). va_list operands are lowered as opaque pointer values, never address-taken, so that all va-layout knowledge stays behind the NativeTarget va hooks.

2. The optimizer IR and its operand model

The optimizer IR lives in src/opt/ir.h / src/opt/ir.c. Its shape:

Virtual vs physical operands; the mode-on-Func invariant

Operand (kind OPK_REG/OPK_IMM/OPK_LOCAL/OPK_GLOBAL/OPK_INDIRECT) is intentionally not a bare value id. Register operands change meaning across the pipeline, but the field never changes — the mode is a flag on Func, never encoded in the numeric id:

IR_PARAM_DECL is a def-only marker carrying no operands — the param's storage lives in the IRParam table, not in a synthetic self-operand. These invariants (virtual-only HIR operands, single-namespace-at-a-time, def-only param decls) are what the debug verifier (opt_verify, Section 7) checks at phase boundaries, so that a stale physical operand or a wrong-namespace use fails at the nearest checkpoint rather than in the backend encoder.

FrameSlot is the frame-storage currency: locals forced to memory, spill slots, ABI parameter slots, alloca regions, and outgoing-argument areas.

Token aliasing: optimizer-local names onto NativeTarget types

src/opt/ir.h deliberately reuses the physical backend's data types as the optimizer's own, via a layer of preprocessor #defines. After including arch/native_target.h it remaps a set of optimizer-local tokens onto the Native* types:

The reason is that the optimizer's frame-slot, register-class, and physical- register vocabulary is the backend's; sharing the structs avoids a translation layer at the emit boundary, where NativeFrame* is exactly what opt_emit_native hands the NativeTarget. The cost is a namespace hazard: a .c file that needs the real semantic cg/ir.h Operand/CG*Desc types (for example because it reads the recorded tape, or it talks to the live NativeTarget in Native* terms) must first #undef the aliased tokens. cg_ir_lower.c, opt.c, and pass_native_emit.c each do exactly this at the top of the file — they straddle the boundary and must escape the optimizer-local remapping to name the other side's types. Files that live entirely inside the optimizer IR (the analysis and transform passes) keep the aliases and never #undef.

3. One lowering path, three consumers

There is a single lowering path through the optimizer IR. The opt level and the consumer choose how far down it the function travels.

                       opt_func_from_cg_ir
                              |
            +-----------------+------------------+
            |                 |                  |
       O1 native          O2 mid-end        interpreter tap
   opt_run_o1_native    opt_cleanup +      opt_run_o1_interp
                        shared lowering    (stops before machinize)
            |                 |                  |
       machinize          SSA build,             |
       regalloc           value/mem passes,      |
       MIR + emit         conventional SSA,      |
                          undo-SSA, then         |
                          shared lowering        |
            v                 v                  v
       NativeTarget       NativeTarget      interp bytecode

O1 native (opt_run_o1_native)

This is the live optimized path for compiled output. opt_run_o1_native (src/opt/opt.c) is the per-function driver; how a function reaches it depends on the scheduling regime of Section 1. On x64/rv64 the per-function callback lowers the recorded function and calls opt_run_o1_native directly as each function is recorded. On ARM_64 the callback only registers the function; lowering and the call to opt_run_o1_native happen at finalization, once the reachability sweep has selected the function. Either way the function travels the same pipeline, entirely in the PReg namespace (opt_reg_ssa == 0) — no SSA construction, no value numbering. In source order:

build_cfg -> jump_cleanup(CFG) -> build_cfg -> simplify_local
try_tiny_inline   (+ cfg/jump_cleanup/cfg if anything inlined)
verify "lowering-cfg"
machinize_native        ABI/call/ret/param constraints + machine clobbers
verify "lowering-machinize"
addr_xform_pregs        fold ADDR_OF(local) into OPK_LOCAL loads/stores
promote_scalar_locals   non-escaped scalar frame slot -> PReg
addr_of_global_cse      hoist duplicate ADDR_OF(global) to entry
build_loop_tree
lower_loop_imm_operands / hoist_loop_consts   loop-invariant imm materialization
live_blocks             per-block PReg liveness (backward dataflow)
dead_def_elim_with_live pre-RA dead-definition elimination
regalloc_locations      PReg -> hard reg / spill slot (no live-range splitting)
verify "post-regalloc"
lower_to_mir            build physical MFunc; insert spill/reload
mir_verify "lower-mir"
mir_combine             post-RA peephole / addressing-mode synthesis
mir_dce                 post-RA dead-code elimination
mir_jump_cleanup(CFG) -> mir_build_cfg -> mir_jump_cleanup(LAYOUT)
emit_native             replay MIR into the NativeTarget

Once a function enters this pipeline it runs every stage — there is no per-op bypass within the pipeline itself. Varargs, inline asm, aggregates/sret/byval are all handled here. Most stages are bracketed by an opt_verify / opt_mir_verify checkpoint with a stage tag, and KIT_DUMP=<tag> dumps the IR at the matching stage (entry before any pass, pre-emit just before emit).

The reachability decision lives outside this pipeline and is per-architecture (Section 1). At module finalization (opt_on_finalize) file-scope asm blocks captured during recording are replayed on every target. On ARM_64, finalization additionally runs the reachability sweep that selects which functions are lowered at all, so dead local functions/data are never lowered or emitted; the survivors then each run the full pipeline above. On x64/rv64 every recorded function was already lowered and emitted during streaming, so dead-static elimination is left to the linker rather than performed here.

O2 mid-end (opt_cleanup + shared lowering)

The O2 mid-end is the SSA-based optimization schedule defined in opt_cleanup (src/opt/pass_o2.c). It is the intended mid-end architecture and is fully implemented, but it is not on the shipped code path: opt_cgtarget_new normalizes every requested opt_level to 1 (the line o->level = 1 in src/opt/opt.c), so no compilation ever selects O2 and every opt_level >= 1 request runs the O1 native path.

The rationale for this normalization is isolation. Keeping the O2 schedule defined and its passes maintained means the SSA representation and its incremental def-use can stabilize against targeted optimizer tests independently, without an SSA-construction or value-numbering bug affecting shipped output. The schedule is documented here because it is the designed mid-end shape that the O1 path is a deliberately reduced subset of; the section describes the intended architecture, not a live code path. The schedule is:

build_cfg / jump_cleanup(CFG) / build_cfg     canonicalize control flow
build_reg_ssa                                 PReg -> Val (register SSA)
block_cloning                                 bounded clone of small blocks
build_ssa                                     mem2reg: promote frame locals, insert phis
ssa_dce / copy_cleanup
addr_xform                                    fold address pseudos into mem operands
simplify                                      SSA-aware identity/algebraic cleanup
gvn                                           value numbering, constprop, branch fold,
                                              redundant-load reuse
copy_prop                                     copy + redundant-extension elimination
dse                                           dead store elimination
build_loop_tree / licm                        hoist loop invariants
pressure_relief                               sink same-block computes
make_conventional_ssa                         phis -> edge copies (IRF_NO_COALESCE)
ssa_combine
undo_ssa / copy_cleanup                        Val -> PReg, allocation-ready
jump_opt

By design an O2 function then re-enters the same backend tail as O1 (machinize through emit), with the allocator's live-range splitting and move-related coalescing enabled — the variants that the O1 path leaves off. The SSA value/memory passes (opt_gvn, opt_dse, opt_licm, opt_pressure_relief, opt_ssa_combine) live in src/opt/pass_o2.c; SSA construction and phi destruction in src/opt/pass_ssa.c.

Interpreter tap (opt_run_o1_interp)

The interpreter consumes the optimizer IR directly rather than machine code. The tap runs the maximal target-independent subset of the O1 pipeline and stops before machinization: build CFG, jump cleanup, simplify_local, the PReg-level address folds and scalar-local promotion, addr_of_global_cse, loop tree, and liveness-driven dead-def elimination. It deliberately stops before opt_machinize_native, register allocation, MIR lowering, and native emit. The result is a Func still in the PReg namespace (opt_reg_ssa == 0, no IR_PHI phis) that src/interp/lower.c lowers into threaded bytecode. The tap runs the folds even though in the native pipeline they sit after machinize, because they depend only on the PReg/frame-slot view, not on physical-register pools — so they are safe and they shrink the interpreter's work. See INTERPRETER.md.

4. Pass catalog by role

The passes are grouped here by responsibility. Each is one Func-in-place transform or analysis; the file paths orient the reader.

SSA mid-end (O2)

Shared analyses

Backend tail

5. Machine register-constraint model

Some target instructions pin operands or results to specific physical registers and clobber others as a side effect of their encoding — hardware constraints, not allocator choices (x86-64 idiv/div pinning the dividend to rax and clobbering rdx, variable shifts requiring the count in cl, one-operand mul, cmpxchg, and the va_arg offset scratch). aarch64 and riscv64 have no such instructions — their div/shift/mul are ordinary three-operand forms — so on those targets the constraint hooks are inert.

The optimizer models all fixed-register requirements through two allocator primitives, and the allocator (pass_lower.c) speaks only in physical register numbers here:

Three sources feed these, all unified at allocation time:

  1. Calls — the call plan's clobber_mask (caller-saved by default, or the call-specific mask) drives the live-across-forbid loop; argument/result registers come from the plan.
  2. Inline asmpass_machinize.c resolves named-register constraint strings and clobber lists into masks and fixed-register indices on the IRAsmAux; pass_lower.c's apply_asm_register_constraints ties the fixed operands and runs the live-across-forbid loop.
  3. Generic machine instructions — a binop or convert has no aux to hang constraints on, so machinization queries the target's machine-clobber hook per instruction and stores the result in a per-function side table keyed by InstId (Func.inst_clobbers, built in machinize_inst_clobbers). At allocation, apply_machine_reg_clobbers looks up the instruction's clobber mask and applies the same live-across-forbid loop. A NULL hook (aa64/rv64) means no entries and zero behavior change.

This is the single place where target ISA register rules enter the allocator, and all three sources reuse one mechanism — tie + forbid — rather than patching assignments after the fact. A value that merely dies at the instruction needs no constraint (the backend stages it into/out of the fixed register itself); only values that survive past the instruction are forbidden.

6. Allocation, MIR, and the physical boundary

Allocation does not rewrite HIR. opt_regalloc_locations consumes block liveness and the compressed live ranges and writes one canonical location per PReg into Func.preg_locs (OptLoc: hard register or spill slot). HIR operands stay virtual after allocation — the verifier checks this.

opt_lower_to_mir then builds the physical IR Func.mir (an MFunc): each virtual OPK_REG is translated through its OptLoc into a physical register or a frame access, spilled values get a reload before each use and a store after each def, and call plans are lowered into physical argument/return moves. From this point the IR is physical and non-SSA (registers may be multiply defined). All PReg-to-physical knowledge lives in this one step; after it, the HIR is untouched and the MIR is fully physical. The downstream MIR passes (combine, DCE, jump/layout cleanup) run over the MIR view and rely on physical-register liveness for their safety checks, then opt_emit_native replays the MIR.

The reason allocation results are a separate table rather than rewritten operands is the same mode-clarity principle from Section 2: a post-allocation pass can never accidentally treat a physical register as a PReg, replay can never see a stale virtual operand, and the MIR verifier can assert "no PRegs or Vals here" at a single boundary.

7. Verification and observability

The optimizer is checkpoint-verified in debug builds. opt_verify(Func*, stage) checks CFG reciprocity, reachable-block shape, emit-order validity, instruction ids, operand namespaces (no physical registers in HIR; correct PReg-vs-Val namespace for the current mode), phi consistency, and def-use freshness; opt_mir_verify checks the physical boundary (no virtual operands, valid frame slots, fully physical call plans). Each pass tags its checkpoint with the name of the transformation just completed, so a failure localizes to the nearest boundary. Func.opt_valid_analyses tracks coarse invalidation; passes that mutate control flow, operands, or instructions rebuild or invalidate the relevant analysis.

Observability hooks: KIT_DUMP=<tag> dumps the optimizer IR at a named stage, KIT_DUMPCG=1 dumps the recorded semantic tape before lowering, KIT_DUMP_INTERP dumps the interpreter-tap Func, and the optimizer emits scoped timing/count metrics (visible through kit run --time) for the frontend, each pass scope, allocation, and emit.