Optimizer (OPT)
This document is the design reference for kit's optimizer: the module that
sits between the recording code-generation API and the per-architecture native
backends. The optimizer owns a private, mutable IR; it lowers each recorded
function into that IR, runs analyses and transforms over it, performs register
allocation, builds a physical post-allocation IR (MIR), and finally replays the
MIR into a NativeTarget backend. It is also the source of the function shape
the bytecode interpreter consumes. The focus here is layering, ownership,
representation invariants, and the reasoning behind the boundaries — not API
signatures, which live in the headers. Cross-references: DESIGN.md,
CODEGEN.md, IR.md, ARCH.md, and
INTERPRETER.md.
1. Where the optimizer sits
kit's codegen has two surfaces. The semantic surface is the recording
code-generation API (cg/ir.h, the CgTarget interface): frontends call it to
describe a function. The physical surface is the per-architecture
NativeTarget (arch/native_target.h): it encodes machine code. The optimizer
is the bridge.
frontend --CgTarget calls--> CgIrRecorder --records--> CgIrFunc tape
|
opt_func_from_cg_ir (cg_ir_lower.c)
v
optimizer IR: Func / Block / Inst / Val / PReg
|
passes (analysis + transform)
|
regalloc -> MFunc (physical MIR)
v
opt_emit_native --NativeTarget calls--> machine code
At -O0 the optimizer is not installed at all: the driver wires the frontend's
CgTarget straight to the backend's NativeDirectTarget, which emits in a
single pass with a small register cache (see CODEGEN.md). At
opt_level >= 1, opt_cgtarget_new (src/opt/opt.c) installs a CgIrRecorder
(cg/ir_recorder.c) as the sink. The recorder captures each function as a
CgIrFunc tape, and on completion fires the optimizer's per-function callback.
OptImpl (in src/opt/opt.c) is the wrapper state: the wrapped real target,
the resolved NativeTarget*, an optional dump writer, and a per-translation-
unit registry of recorded CgIrFuncs (with a parallel lazily-lowered-Func
cache) used for streaming tiny-callee inline lookup.
When each function is processed: streaming vs. finalization
Two scheduling regimes exist, chosen by target architecture:
- Per-function streaming (x64, rv64). As the recorder completes each function it fires the optimizer's per-function callback, which lowers and fully processes that one function immediately. Functions flow through the pipeline in recording order, one at a time, before the module is finalized.
- Finalization-time, reachability-driven (ARM_64). The per-function callback
registers the recorded
CgIrFuncbut does no lowering. All processing is deferred to module finalization, where a reachability sweep over the call/data-reloc graph computes the set of functions actually referenced from a root, prunes the rest, and only then lowers and processes the survivors.
Both regimes converge on the same lowering path and backend tail; they differ only in when a function is lowered and whether dead local functions are dropped before lowering or left for the linker (Section 3.1).
The recording/optimizing boundary
The split between recording (cg/ir) and optimizing (opt/ir) is
deliberate and is the central design decision of this module:
- The recorded
CgIrFuncis a faithful, immutable transcript of the frontend's semantic intent. It speaks inCGLocal/Label/CGCallDescterms and knows nothing about CFGs, dominators, or physical registers. Frontends and ABI lowering own that layer; the optimizer never mutates it. Keeping it immutable is what makes streaming tiny-inline re-lowering cheap and repeatable, and what lets the same recorded tape feed both the native pipeline and the interpreter. opt_func_from_cg_ir(src/opt/cg_ir_lower.c) translates oneCgIrFuncinto the optimizer's own mutableFunc— a real CFG ofBlocks, each holding a linear list ofInsts, plus frame slots, a pseudo-register table, a value table, and the params/locals tables. From here on the optimizer works only onFunc; the recorded tape is a read-only source.
Lowering also performs the first storage-classification decision. In
lower_locals, each semantic CGLocal becomes either register storage
(CG_LOCAL_STORAGE_REG, a fresh PReg, operands of kind OPK_REG) or frame
storage (CG_LOCAL_STORAGE_FRAME, a FrameSlot, operands of kind OPK_LOCAL).
A local is forced to a frame home when it is address-taken / memory-required
(local_needs_home), an aggregate, or larger than a machine word. Everything
else starts in a pseudo-register. Address-taken locals begin in frame storage;
later HIR address-folding and promotion passes recover register storage for
those whose address does not actually escape (Section 4). va_list operands
are lowered as opaque pointer values, never address-taken, so that all
va-layout knowledge stays behind the NativeTarget va hooks.
2. The optimizer IR and its operand model
The optimizer IR lives in src/opt/ir.h / src/opt/ir.c. Its shape:
Funcowns one function: its CFG (blocks,entry,emit_order), frame slots, params, locals, the pseudo-register table, the SSA value table, scope bookkeeping, allocation results, and per-pass scratch.Blockis a basic block: a growableInst[], explicitpreds/succedges, and a pre-allocatedMCLabelfor blocks born fromcg_label_new.Instis one recorded operation. TheIROpenum mirrors theCgTargetsurface essentially 1:1 (each recordedCgTargetcall becomes exactly oneInst), plus a few SSA-only ops (IR_PHI,IR_CONST_I,IR_CONST_BYTES). Rich operations (calls, returns, switches, inline asm, atomics, aggregate memory ops, intrinsics, scopes, phis) carry a structuredauxrecord so the full semantic descriptor round-trips to emission.
Virtual vs physical operands; the mode-on-Func invariant
Operand (kind OPK_REG/OPK_IMM/OPK_LOCAL/OPK_GLOBAL/OPK_INDIRECT) is
intentionally not a bare value id. Register operands change meaning across the
pipeline, but the field never changes — the mode is a flag on Func, never
encoded in the numeric id:
- During lowering and the whole O1 path,
OPK_REGcarries aPReg: a mutable pseudo-register id, the persistent storage location of a value. - After
opt_build_reg_ssa(O2 only),OPK_REGcarries aVal: an SSA single-definition value id.Func.opt_reg_ssarecords which namespace is live; shared helpers (opt_reg_count,opt_reg_type,opt_reg_clsinopt_internal.h) consult it rather than guessing from context. - Physical registers never appear in
OPK_REGHIR operands. Allocation results go to a separate location table, and physical operands appear only in the MIR (Section 6).
IR_PARAM_DECL is a def-only marker carrying no operands — the param's storage
lives in the IRParam table, not in a synthetic self-operand. These invariants
(virtual-only HIR operands, single-namespace-at-a-time, def-only param decls)
are what the debug verifier (opt_verify, Section 7) checks at phase
boundaries, so that a stale physical operand or a wrong-namespace use fails at
the nearest checkpoint rather than in the backend encoder.
FrameSlot is the frame-storage currency: locals forced to memory, spill slots,
ABI parameter slots, alloca regions, and outgoing-argument areas.
Token aliasing: optimizer-local names onto NativeTarget types
src/opt/ir.h deliberately reuses the physical backend's data types as the
optimizer's own, via a layer of preprocessor #defines. After including
arch/native_target.h it remaps a set of optimizer-local tokens onto the
Native* types:
FrameSlot→NativeFrameSlot,FrameSlotKind/FS_*→ theNativeFrameSlot*enum,RegClass/RC_*→NativeAllocClass,CGPhysRegInfo→NativePhysRegInfo, the known-frame descriptor, and theCG_REG_*register role flags.- It also re-
#defines the now-removed semantic CG spellings —Operand,CGCallDesc,CGFuncDesc,CGParamDesc,CGScopeDesc,CGLocalStorage,FrameSlotDesc— onto the optimizer's ownOpt*structs, so optimizer code can keep using the short historical names.
The reason is that the optimizer's frame-slot, register-class, and physical-
register vocabulary is the backend's; sharing the structs avoids a translation
layer at the emit boundary, where NativeFrame* is exactly what
opt_emit_native hands the NativeTarget. The cost is a namespace hazard: a
.c file that needs the real semantic cg/ir.h Operand/CG*Desc types (for
example because it reads the recorded tape, or it talks to the live NativeTarget
in Native* terms) must first #undef the aliased tokens. cg_ir_lower.c,
opt.c, and pass_native_emit.c each do exactly this at the top of the file —
they straddle the boundary and must escape the optimizer-local remapping to name
the other side's types. Files that live entirely inside the optimizer IR (the
analysis and transform passes) keep the aliases and never #undef.
3. One lowering path, three consumers
There is a single lowering path through the optimizer IR. The opt level and the consumer choose how far down it the function travels.
opt_func_from_cg_ir
|
+-----------------+------------------+
| | |
O1 native O2 mid-end interpreter tap
opt_run_o1_native opt_cleanup + opt_run_o1_interp
shared lowering (stops before machinize)
| | |
machinize SSA build, |
regalloc value/mem passes, |
MIR + emit conventional SSA, |
undo-SSA, then |
shared lowering |
v v v
NativeTarget NativeTarget interp bytecode
O1 native (opt_run_o1_native)
This is the live optimized path for compiled output. opt_run_o1_native
(src/opt/opt.c) is the per-function driver; how a function reaches it depends
on the scheduling regime of Section 1. On x64/rv64 the per-function callback
lowers the recorded function and calls opt_run_o1_native directly as each
function is recorded. On ARM_64 the callback only registers the function;
lowering and the call to opt_run_o1_native happen at finalization, once the
reachability sweep has selected the function. Either way the function travels the
same pipeline, entirely in the PReg namespace (opt_reg_ssa == 0) — no SSA
construction, no value numbering. In source order:
build_cfg -> jump_cleanup(CFG) -> build_cfg -> simplify_local
try_tiny_inline (+ cfg/jump_cleanup/cfg if anything inlined)
verify "lowering-cfg"
machinize_native ABI/call/ret/param constraints + machine clobbers
verify "lowering-machinize"
addr_xform_pregs fold ADDR_OF(local) into OPK_LOCAL loads/stores
promote_scalar_locals non-escaped scalar frame slot -> PReg
addr_of_global_cse hoist duplicate ADDR_OF(global) to entry
build_loop_tree
lower_loop_imm_operands / hoist_loop_consts loop-invariant imm materialization
live_blocks per-block PReg liveness (backward dataflow)
dead_def_elim_with_live pre-RA dead-definition elimination
regalloc_locations PReg -> hard reg / spill slot (no live-range splitting)
verify "post-regalloc"
lower_to_mir build physical MFunc; insert spill/reload
mir_verify "lower-mir"
mir_combine post-RA peephole / addressing-mode synthesis
mir_dce post-RA dead-code elimination
mir_jump_cleanup(CFG) -> mir_build_cfg -> mir_jump_cleanup(LAYOUT)
emit_native replay MIR into the NativeTarget
Once a function enters this pipeline it runs every stage — there is no per-op
bypass within the pipeline itself. Varargs, inline asm, aggregates/sret/byval are
all handled here. Most stages are bracketed by an opt_verify /
opt_mir_verify checkpoint with a stage tag, and KIT_DUMP=<tag> dumps the
IR at the matching stage (entry before any pass, pre-emit just before emit).
The reachability decision lives outside this pipeline and is per-architecture
(Section 1). At module finalization (opt_on_finalize) file-scope asm blocks
captured during recording are replayed on every target. On ARM_64, finalization
additionally runs the reachability sweep that selects which functions are lowered
at all, so dead local functions/data are never lowered or emitted; the survivors
then each run the full pipeline above. On x64/rv64 every recorded function was
already lowered and emitted during streaming, so dead-static elimination is left
to the linker rather than performed here.
O2 mid-end (opt_cleanup + shared lowering)
The O2 mid-end is the SSA-based optimization schedule defined in opt_cleanup
(src/opt/pass_o2.c). It is the intended mid-end architecture and is fully
implemented, but it is not on the shipped code path: opt_cgtarget_new normalizes
every requested opt_level to 1 (the line o->level = 1 in src/opt/opt.c),
so no compilation ever selects O2 and every opt_level >= 1 request runs the O1
native path.
The rationale for this normalization is isolation. Keeping the O2 schedule defined and its passes maintained means the SSA representation and its incremental def-use can stabilize against targeted optimizer tests independently, without an SSA-construction or value-numbering bug affecting shipped output. The schedule is documented here because it is the designed mid-end shape that the O1 path is a deliberately reduced subset of; the section describes the intended architecture, not a live code path. The schedule is:
build_cfg / jump_cleanup(CFG) / build_cfg canonicalize control flow
build_reg_ssa PReg -> Val (register SSA)
block_cloning bounded clone of small blocks
build_ssa mem2reg: promote frame locals, insert phis
ssa_dce / copy_cleanup
addr_xform fold address pseudos into mem operands
simplify SSA-aware identity/algebraic cleanup
gvn value numbering, constprop, branch fold,
redundant-load reuse
copy_prop copy + redundant-extension elimination
dse dead store elimination
build_loop_tree / licm hoist loop invariants
pressure_relief sink same-block computes
make_conventional_ssa phis -> edge copies (IRF_NO_COALESCE)
ssa_combine
undo_ssa / copy_cleanup Val -> PReg, allocation-ready
jump_opt
By design an O2 function then re-enters the same backend tail as O1 (machinize
through emit), with the allocator's live-range splitting and move-related
coalescing enabled — the variants that the O1 path leaves off. The SSA
value/memory passes (opt_gvn, opt_dse, opt_licm,
opt_pressure_relief, opt_ssa_combine) live in src/opt/pass_o2.c; SSA
construction and phi destruction in src/opt/pass_ssa.c.
Interpreter tap (opt_run_o1_interp)
The interpreter consumes the optimizer IR directly rather than machine code. The
tap runs the maximal target-independent subset of the O1 pipeline and stops
before machinization: build CFG, jump cleanup, simplify_local, the PReg-level
address folds and scalar-local promotion, addr_of_global_cse, loop tree, and
liveness-driven dead-def elimination. It deliberately stops before
opt_machinize_native, register allocation, MIR lowering, and native emit. The
result is a Func still in the PReg namespace (opt_reg_ssa == 0, no
IR_PHI phis) that src/interp/lower.c lowers into threaded bytecode. The tap
runs the folds even though in the native pipeline they sit after machinize,
because they depend only on the PReg/frame-slot view, not on physical-register
pools — so they are safe and they shrink the interpreter's work. See
INTERPRETER.md.
4. Pass catalog by role
The passes are grouped here by responsibility. Each is one Func-in-place
transform or analysis; the file paths orient the reader.
SSA mid-end (O2)
- Register SSA + mem2reg (
src/opt/pass_ssa.c):opt_build_reg_ssarenames multiply-assigned PRegs into SSAVals;opt_build_ssapromotes eligible frame-backed locals/params to SSA via dominance-frontier phi insertion and rewrites their loads/stores to values.opt_make_conventional_ssalowers phis to edge copies (markedIRF_NO_COALESCE, because coalescing a phi edge copy can collapse a loop-carried value with its successor and miscompile the loop) andopt_undo_ssareturns to the PReg namespace for allocation. - GVN + DSE orchestration (
src/opt/pass_o2.c):opt_gvndoes scalar value numbering, constant propagation, branch folding, and memory-aware redundant load / store-to-load reuse gated by alias-root and version rules;opt_dseremoves stores proven dead or overwritten while preserving observable memory (volatile, atomic, calls that may clobber, escapes).opt_licmandopt_pressure_reliefround out the loop/pressure work, also here. - Peephole combine + addressing-mode synthesis (
src/opt/pass_combine.c):opt_combineis a per-block forward-pass-with-fixpoint that propagates copies, folds address-producing computations into a load/store'sOPK_INDIRECTbase/index/scale/offset where the backend accepts the shape, sinks defs toward their sole use, and folds extension chains. It is used in two roles: directly in the O2 SSA combine (opt_ssa_combinewraps it) and as the post-RA MIR combine (Section 6). When run over physical MIR it gates each rewrite on a live-range safety check (Section 5). - Simplify (
src/opt/pass_simplify.c):opt_simplify_localis the no-SSA-required local algebraic/addressing canonicalizer used on every path;opt_simplifyis the SSA-aware identity/constant cleanup used in O2. - DCE (
src/opt/pass_dce.c):opt_ssa_dceremoves unused SSA defs;opt_mir_dceremoves post-RA dead physical defs; both preserve side effects, including the subtle case of a value-producing op whose destination is anOPK_LOCAL(a write to an escaped frame-homed local is a memory side effect even when the op is otherwise pure). - Copy cleanup / copy prop (
src/opt/pass_copy.c): redundant-copy removal and copy propagation, including redundant extension/convert-chain elimination. - Inlining (
src/opt/pass_inline.c):opt_try_tiny_inlineis the streaming O1 entry. On the pre-machinize PReg form it resolves each directIR_CALLto a recorded callee via a lookup callback (OptImplowns the registry and the lazily re-lowered callee cache), gates on a tiny straightline-cost cap and a whitelist that excludes calls/control-rich constructs, refuses self/recursive callees, and splices the cloned body in. The whole-program inliner machinery (inline_call_siteand its gates) also lives here. - Address folding (
src/opt/pass_addr_fold.c): the always-on O1 HIR folds —opt_addr_xform_pregs(foldADDR_OF(local)into directOPK_LOCALload/store operands and clearFSF_ADDR_TAKENwhen all such defs retire),opt_promote_scalar_locals(promote a non-escaped scalar frame slot to a PReg, turning its stores/loads into copies),opt_addr_of_global_cse(hoist oneADDR_OF(global)compute to the entry block and reuse it), and the loop-invariant constant materialization (opt_hoist_loop_consts/opt_lower_loop_imm_operands).opt_addr_xformis the SSA-namespace counterpart used in O2.
Shared analyses
- CFG (
src/opt/pass_cfg.c):opt_build_cfgderivespreds/succfrom each block's terminator (branches, conditional/fused branches, returns, switches, indirect branches, scope break/continue edges) and validates reciprocity;opt_mir_build_cfgrecomputes them over the physical MIR. - Order + dominators + verify (
src/opt/pass_analysis.c): postorder / reverse-postorder, reachability, immediate dominators, dominator children, dominance frontiers (OptAnalysis), the coarse analysis-validity bits (OPT_ANALYSIS_CFG/DEF_USE/DOM/LOOP), and the debug verifieropt_verify. - Liveness (
src/opt/pass_live.c):opt_live_blockssolves per-block PReg liveness by backward dataflow into elastic 64-bit-word bitsets (OptBitset, grown on demand, trailing-zero-trimmed);opt_live_ranges_buildproduces the compressed point-indexed live ranges and per-PReg frequency/spill-cost metrics the allocator consumes. - Hard-register liveness (
src/opt/pass_hard_live.c): physical-register live-in/out over the post-RA MIR, plus the per-call clobber mask (opt_call_clobber_mask_for). This is what makes post-RA combine/DCE safe: a value in a callee-saved register survives a call, while caller-saved registers are killed by it. - Loop detection (
src/opt/pass_loop.c):opt_build_loop_treecomputes loop nesting depth from dominators; depth feeds the allocator's spill-cost weighting and LICM.
Backend tail
- Type-size lowering (
src/opt/pass_lower.c): the type/size machinery and the allocator that the PReg form needs before MIR (also hosts the allocation and constraint application described below). - Machinize (
src/opt/pass_machinize.c):opt_machinize_nativeis ABI lowering against theNativeTarget. It annotates calls/returns/params with calling-convention constraints (argument/result registers, the call clobber and return masks, callee-save markers), collects the target's register classes (allocable set, reserved scratch set, caller/callee-saved masks) and checks allocable and scratch sets do not overlap, resolves inline-asm named-register constraint strings into masks, and records per-instruction fixed-register clobbers (Section 5). - MIR view (
src/opt/pass_mir.c): the post-allocation physical IR. Rather than duplicate the CFG passes,pass_mir.cbuilds a transientFuncview whose block arrays point atFunc.mir, runs the sharedopt_combine,opt_dce,opt_build_cfg, andopt_jump_cleanupover that view, and commits it back. Theopt_mir_*wrappers are thin shims over this view; the shared passes are written once and reused for both HIR and MIR. - Coalescing / allocation (
src/opt/pass_coalesce.c,src/opt/pass_lower.c):opt_regalloc_locationsis a point-bitmap linear-scan allocator producing the canonicalFunc.preg_locslocation table (hard reg or spill slot per PReg) without mutating HIR operands. The non-splitting form is the O1 path; when live-range splitting is enabled (the O2 quality path) it invokes move-related coalescing (opt_coalesce_ranges), which builds a bounded conflict matrix and merges only same-class, same-type values with compatible constraints and no range conflict — never anIRF_NO_COALESCEcopy. - Jump / layout cleanup (
src/opt/pass_jump.c):opt_jump_cleanupin CFG mode drops unreachable blocks and collapses unconditional-jump chains; in LAYOUT mode it reorders blocks for fallthrough, rotates simple single-latch loops, and inverts mis-aligned conditional branches so the per-iteration back-jump disappears. - Native emit (
src/opt/pass_native_emit.c):opt_emit_nativereplays the physical MIR into aNativeTarget, usingNativeLoc(register / frame / imm / address) as the operand currency. It reserves exactly the callee-saved registers the allocator used, pre-maps frame slots, drives the backend's minimal-prologue hook when available, routes scalar call results straight to their destination, uses a hardware zero register for stored zeros where the backend advertises one, and legalizes addresses the backend rejects into a reserved scratch register. See ARCH.md for the backend contract.
5. Machine register-constraint model
Some target instructions pin operands or results to specific physical registers
and clobber others as a side effect of their encoding — hardware constraints,
not allocator choices (x86-64 idiv/div pinning the dividend to rax and
clobbering rdx, variable shifts requiring the count in cl, one-operand
mul, cmpxchg, and the va_arg offset scratch). aarch64 and riscv64 have
no such instructions — their div/shift/mul are ordinary three-operand forms —
so on those targets the constraint hooks are inert.
The optimizer models all fixed-register requirements through two allocator
primitives, and the allocator (pass_lower.c) speaks only in physical register
numbers here:
- Tied hard register (
OptPRegInfo.tied_hard_reg): pin a value to a specific physical register. Set for inline-asm operands with a{reg}constraint and for fixed-input/fixed-output machine operands. - Forbidden / clobbered hard registers
(
OptPRegInfo.forbidden_hard_regs/clobbered_hard_regs): for each register an instruction clobbers, every value live across the instruction (live-after, not a use or def of it) is forbidden from that register. The clobbered subset is recorded separately so the soft return-register placement hint cannot later clear a forbid that came from a real hardware clobber.
Three sources feed these, all unified at allocation time:
- Calls — the call plan's
clobber_mask(caller-saved by default, or the call-specific mask) drives the live-across-forbid loop; argument/result registers come from the plan. - Inline asm —
pass_machinize.cresolves named-register constraint strings and clobber lists into masks and fixed-register indices on theIRAsmAux;pass_lower.c'sapply_asm_register_constraintsties the fixed operands and runs the live-across-forbid loop. - Generic machine instructions — a binop or convert has no
auxto hang constraints on, so machinization queries the target's machine-clobber hook per instruction and stores the result in a per-function side table keyed byInstId(Func.inst_clobbers, built inmachinize_inst_clobbers). At allocation,apply_machine_reg_clobberslooks up the instruction's clobber mask and applies the same live-across-forbid loop. A NULL hook (aa64/rv64) means no entries and zero behavior change.
This is the single place where target ISA register rules enter the allocator, and all three sources reuse one mechanism — tie + forbid — rather than patching assignments after the fact. A value that merely dies at the instruction needs no constraint (the backend stages it into/out of the fixed register itself); only values that survive past the instruction are forbidden.
6. Allocation, MIR, and the physical boundary
Allocation does not rewrite HIR. opt_regalloc_locations consumes block
liveness and the compressed live ranges and writes one canonical location per
PReg into Func.preg_locs (OptLoc: hard register or spill slot). HIR operands
stay virtual after allocation — the verifier checks this.
opt_lower_to_mir then builds the physical IR Func.mir (an MFunc): each
virtual OPK_REG is translated through its OptLoc into a physical register or
a frame access, spilled values get a reload before each use and a store after
each def, and call plans are lowered into physical argument/return moves. From
this point the IR is physical and non-SSA (registers may be multiply defined).
All PReg-to-physical knowledge lives in this one step; after it, the HIR is
untouched and the MIR is fully physical. The downstream MIR passes (combine,
DCE, jump/layout cleanup) run over the MIR view and rely on physical-register
liveness for their safety checks, then opt_emit_native replays the MIR.
The reason allocation results are a separate table rather than rewritten operands is the same mode-clarity principle from Section 2: a post-allocation pass can never accidentally treat a physical register as a PReg, replay can never see a stale virtual operand, and the MIR verifier can assert "no PRegs or Vals here" at a single boundary.
7. Verification and observability
The optimizer is checkpoint-verified in debug builds. opt_verify(Func*, stage)
checks CFG reciprocity, reachable-block shape, emit-order validity, instruction
ids, operand namespaces (no physical registers in HIR; correct PReg-vs-Val
namespace for the current mode), phi consistency, and def-use freshness;
opt_mir_verify checks the physical boundary (no virtual operands, valid frame
slots, fully physical call plans). Each pass tags its checkpoint with the name
of the transformation just completed, so a failure localizes to the nearest
boundary. Func.opt_valid_analyses tracks coarse invalidation; passes that
mutate control flow, operands, or instructions rebuild or invalidate the
relevant analysis.
Observability hooks: KIT_DUMP=<tag> dumps the optimizer IR at a named stage,
KIT_DUMPCG=1 dumps the recorded semantic tape before lowering,
KIT_DUMP_INTERP dumps the interpreter-tap Func, and the optimizer emits
scoped timing/count metrics (visible through kit run --time) for the
frontend, each pass scope, allocation, and emit.