Interpreter
kit's IR interpreter (kit run --no-jit, and the emulator's interpret mode)
executes the compiler's own semantic IR directly, with no native code emission
and no executable memory. It is the third way to run kit IR alongside the
native backend (CODEGEN.md) and the JIT (JIT.md): a
threaded-bytecode engine that lowers a post-optimization function to fixed-width
records and runs them over an explicit, suspendable call stack. The design point
is fidelity to codegen semantics — it interprets the pre-machinize IR, the same
view the native backend lowers from, so an interpreted program and a compiled
program agree by construction. It lives in src/interp/.
Why interpret the semantic IR
The interpreter taps the optimizer pipeline (see OPT.md) at
opt_run_o1_interp (src/opt/opt.c): it runs the target-independent O1 subset
and stops before machinization and register allocation. At this tap point the
function is still a virtual-register machine:
CG IR --opt_func_from_cg_ir--> Func --[ti passes]--> Func (PReg view)
|
STOP before machinize/regalloc
|
interp_lower --> InterpFunc bytecode
Concretely opt_run_o1_interp runs CFG build, jump cleanup, local
simplification, the address/escape transforms (opt_addr_xform_pregs,
opt_promote_scalar_locals, opt_addr_of_global_cse), the loop tree, and
live-block dead-def elimination. It deliberately skips opt_machinize_native,
loop-immediate lowering / const hoisting, register allocation, and all MIR
passes. Those passes either bind to a physical register file the interpreter does
not have, or are pessimizations for a machine that takes immediates directly.
Consequences that shape the rest of the engine:
- Unbounded virtual registers.
OPK_REGoperands carry PReg ids (1..f->npregs); the register file is a flat slab indexed by PReg id. There is no spilling, no phys-reg pool. - No SSA / no PHIs. The interp tap leaves
opt_reg_ssa == 0; the lowerer asserts noIR_PHIsurvives. - Explicit CFG with materialized edges. Control flow is driven entirely by block successors and terminators; every reachable block has explicit succ edges, so the loader can resolve every branch to a bytecode pc.
- Semantic, un-ABI-lowered calls.
IR_CALLstill carries the high-levelCGCallDesc(callee operand, args/ret as ABI values, theABIFuncInfo). The interpreter applies value semantics for internal calls and only consults the ABI descriptor when it must cross into host code.
Because this is exactly the IR the native backend consumes, the interpreter and the compiled program share the same semantics — the differential test of "run the same IR both ways and compare" is the central correctness lever.
Layering and files
src/opt/opt.c opt_run_o1_interp + interp sink hook (one-way dep)
src/interp/lower.c Func -> InterpFunc bytecode (loader / threader)
src/interp/engine.c suspendable dispatch loop + handlers + intrinsics + FFI marshal
src/interp/ffi.c host-ABI external-call cast-thunk family
src/interp/interp_program.c program lifecycle, sym/name tables, memory + symbol resolution
src/interp/interp.h internal types (InterpInsn, InterpFunc, frame/stack)
include/kit/interp.h public API (program, host vtable, explicit-stack calls)
The dependency is one-way: the optimizer calls interp_capture_func (declared
locally in opt.c, not by including the interp header) when a compiler has an
interp sink attached. Each compiled function is then lowered into the program in
addition to whatever native emission is happening.
Bytecode: InterpFunc
lower.c turns a Func into an InterpFunc: a flat array of fixed-width
InterpInsn records plus side tables. The record is cache-friendly and caches
the hot fields a handler needs — destination PReg, resolved branch pcs, an inline
immediate, operand widths, fp flags, a tail-call flag, and a direct-threading
handler slot — while retaining a pointer to the source Inst so handlers can
read full operand detail (operand kinds, types, MemAccess, aux structs, the
call descriptor) generically rather than re-encoding every field.
Lowering is two passes over f->emit_order:
- Pass A places each reachable block at its starting pc, counting records
(one per emitting
Inst;NOP/PHI/PARAM_DECL/SCOPE_*and the constant markers emit nothing), operands, and switches. It bump-allocates frame-slot byte offsets honoring alignment (FS_ALLOCA slots are dynamic, sized at run time), and fixes the static frame size. - Pass B emits each record, mapping
IROp -> InterpOp(with aggregate-specialized variants such asIOP_COPY_AGG/IOP_LOAD_AGG), caching widths and sub-op tags (BinOp/CmpOp/ConvKind/AtomicOp), and resolving every branch/switch/indirect/label target from block id to bytecode pc via theblock_pc[]table built in Pass A.
The opcode set is one family per IROp, specialized only where the width or the
scalar/aggregate distinction changes the handler. There is no width-per-opcode
explosion; arithmetic carries its width and fp-ness in record fields and a sub-op
tag, and the handler masks/sign-extends accordingly.
Unsupported ops are not silently dropped. An op the interpreter cannot run
(notably IR_ASM_BLOCK) lowers to IOP_TRAP and flags the whole function
!ok with a reject reason. The engine reports a clean
interp: <reason> not supported diagnostic; it never miscompiles or falls back
to native code. This "diagnose, don't miscompile" rule is the contract for the
no-JIT path.
Static data and jump tables
Function-scope static blobs — ordinary static locals, dense-switch jump tables,
and computed-goto label arrays — are materialized at lower time
(lower_static_blobs) into an interp-private, program-lifetime buffer, and the
blob's symbol is bound to that buffer. WRITE markers contribute literal bytes;
LABEL_ADDR markers contribute the target block's bytecode pc, not a native
code address. This is essential: the interpreter addresses code by InterpInsn
index, so a jump table that the program later walks with IR_LOAD +
IR_INDIRECT_BRANCH must hold interp pcs. The stream marker ops themselves lower
to IOP_NOP; they are fully consumed by the materialization pass. This is what
lets the dense -O1 switch lowering and labels-as-values work under the
interpreter.
Engine: explicit-stack dispatch
The engine (interp_run_stack in engine.c) runs the top frame of an
explicit InterpStack. Execution state lives in data structures, never on the
host C stack: an IR-level call pushes an InterpFrame, a return pops one. The
host C stack stays O(1) regardless of IR call depth — deep IR recursion grows the
stack's frame array, not the host stack.
InterpStack (a swappable execution context / fiber)
frames[] explicit call stack; interp_run_stack runs frames[top]
regs_arena bump region: each frame's PReg file (npregs u64s)
mem_arena bump region: each frame's addressable bytes (locals, allocas, varargs)
scalar_ret return shuttle between frames
status / trap_reason
The two arenas are fixed, non-relocating reservations. An OP_ADDR_OF
materializes a local's address as an absolute host pointer into mem_arena, and
that pointer can escape into a register or into another local; reallocating
(moving) the arena would dangle it. Frames follow strict stack discipline (CALL
bumps the top, RET rewinds it), so a generous fixed reservation suffices, and
overflow traps cleanly as a stack overflow rather than corrupting memory. Frames
themselves reference the arenas by offset, not pointer, so the frame array
can be realloc'd on growth without invalidating anything.
Dispatch: direct-threaded with a switch fallback
Where the host compiler supports labels-as-values (GCC, clang, and kit itself),
the engine is direct-threaded: on first entry to a function each opcode's
&&handler is copied into its records, and every handler tail-dispatches with
goto *in->handler, giving the branch predictor a distinct indirect branch per
opcode site. The identical handler bodies compile as a portable switch for any
other compiler, sharing one source through the OP()/NEXT()/GO() macros. The
choice is governed by KIT_INTERP_THREADED (default on, in
include/kit/config.h) AND the compiler's capability; it can be forced with
-DKIT_INTERP_THREADED=0|1. Keeping a switch fallback is what lets any
compiler lacking labels-as-values run the same engine through one portable code
path, with no behavioral difference from the threaded build.
Handler shape and key behaviors
- Arithmetic / compare / convert read operand values, apply the operation by
width and fp flag, and write the result. Width masking and sign-extension are
explicit. The engine is the reference implementation of the IR's portable
edge-case semantics (IR.md "Well-definedness"): integer add/sub/mul
wrap modulo the width; shift counts reduce modulo the width; integer divide/rem
trap on a zero divisor and wrap
INT_MIN / -1(no UB); float→int conversion saturates (NaN -> 0, out-of-range -> clamped, matching Wasmtrunc_sat), avoiding the UB of casting an out-of-range double while staying identical to a plain cast for in-range inputs; the floating relationals are ordered (NaN -> false) whileneis unordered (NaN -> true). These rules are locked to the spec by the parameterized conformance cases intest/interp/interp_smoke_test.c(spec_*), which run each edge with runtime arguments so the optimizer cannot fold the operation away. The engine stores every scalar in au64, so it carries scalar widths up to 64 bits exactly; 128-bit scalars are memory/aggregate-lowered (or expanded to 64-bit-half / libcall sequences) before reaching a register handler. - Loads / stores / addressing never raw-dereference. Every memory access goes
through
interp_translate(below), which is what makes the two memory models swap cleanly. A destination operand may itself be memory — the optimizer leaves address-taken locals un-promoted — sowrite_dsthandles register and memory destinations. - Branches retarget
ipto a resolved pc and re-dispatch. Because branch handlers skip the straight-line memory-fault recheck, they test the fault latch before consuming a possibly-garbage selector.IOP_SWITCHreads its pre-resolved case/default pcs from a side table;IOP_INDIRECT_BRandIOP_LOAD_LABEL_ADDRtraffic in bytecode pcs (see static data above). - Faults vs. unsupported. A runtime fault (bad memory, divide-by-zero,
__builtin_trap, unreachable) setsTRAP; an op/signature the engine can't handle setsERROR. Both record a borrowed reason string retrievable viakit_interp_stack_trap_reason. Memory faults use a latch rechecked on straight-line ops and at branch selectors so a faulting access stops the loop rather than propagating zero.
Calls and O(1) tail calls
IOP_CALL resolves its callee from the call descriptor. A GLOBAL callee that
names a TU-internal function (or a function-pointer in a register whose host
address reverse-maps to one) is interpreted, not run as native — even through
a function pointer — so the no-JIT contract holds: the interpreter never executes
JITed code. Only a genuinely external target reaches the FFI path.
- Internal call: push a frame (bump regs + mem), bind arguments by value into the callee's parameter homes (register params into the register file, aggregate / large params copied into frame slots), lay out any anonymous (variadic) args into a contiguous buffer in the callee frame, record where the caller wants the result (a register, or an sret pointer into the caller's slot), then re-dispatch on the new top frame. No host recursion.
- Return: shuttle the scalar (or copy the aggregate into the sret pointer),
rewind the arenas to the frame's bases, pop, and deliver the result into the
caller's recorded destination. An empty stack means
DONE. - Tail calls (terminator
IR_CALL/CG_CALL_TAIL, or the last emitting inst of a successor-less block) are true O(1): the freshly-built callee frame is relocated down onto the dead caller's register/memory region and the arenas are rewound, so a tail loop runs in constant interp- and host-stack space. This is safe precisely because the callee has not executed yet — no absolute pointers into its own frame exist, and the argument-binding step has already copied every argument value out of the caller. External tail calls similarly forward the call's result as this frame's result without growing the stack.
Variadics, bitfields, atomics, intrinsics, TLS
The interpreter owns both ends of variadics, so its interpreter-private
va_list is self-consistent regardless of the target ABI's real layout: the
anonymous args are packed into the callee frame at aligned slots, va_start
seeds a cursor over that buffer, va_arg reads the typed slot and advances,
va_copy duplicates the cursor.
Bitfields are interpreted by shift/mask extract (sign-extending signed fields) and read-modify-write insert over the storage unit, using the field's layout descriptor. Atomics run on the single-threaded engine: the operation is serialized and the memory order is treated as sequentially consistent; a fence is a no-op. Intrinsics cover mem{cpy,move,set}, popcount/ctz/clz/bswap, checked-overflow builtins (exact-width detection), expect/assume/prefetch, and trap/unreachable.
Thread-local storage routes through the host's resolve_tls hook rather than
treating the symbol as storage, because a thread-local symbol does not denote its
storage on every target (a Mach-O symbol resolves to a TLV descriptor). The hook
returns the calling thread's address of the variable. When no hook is bound it
falls back to plain-global resolution, correct only where the symbol is the
storage; anything it cannot resolve safely returns NULL and is diagnosed.
Pluggable memory and symbols (host vs. emu)
The engine never assumes how an abstract address maps to a host pointer. A
KitInterpHost vtable (resolved through interp_translate /
interp_resolve_sym in interp_program.c) provides translate, resolve_sym,
and resolve_tls; any may be NULL. Two configurations share the engine:
- Host-identity (
kit run --no-jit): abstract addresses are real host pointers.interp_translatereturns the address unchanged; locals/allocas live inmem_arenaand their addresses are real host pointers; globals/externs resolve through the bound resolver. - Emu/guest: addresses are guest VAs translated through an
EmuAddrSpace, bounds- and permission-checked (see EMU.md).
Global symbols a function references are noted at lower time (their names captured from the obj while it is alive), and their host addresses resolved lazily and cached on first use, after the JIT image has been linked.
Integration
kit run --no-jit (driver/cmd/run.c) forces at least -O1, attaches an
InterpProgram so each function is captured to bytecode while the normal
object/JIT-link still runs (the link lays out data globals and resolves externs /
function pointers), then executes the entry only through the interpreter —
there is no JIT execution fallback; a non-interpretable entry is a hard error.
Globals/externs resolve by walking the JIT image's symbol table (tolerating a
target's leading-underscore C mangling), then host dlsym; thread-locals
additionally route through kit_jit_tlv_resolve, which unwraps kit's own
Mach-O TLV descriptor (verifying it before any indirect call so a foreign/dyld
descriptor never becomes a wild call). Wasm entries get their instance and linear
memory set up and run __kit_wasm_init plus the entry through the interpreter.
Emu interpret mode (KIT_EMU_MODE_INTERP, src/emu/emu.c) runs each lifted
guest block through the interpreter instead of JITing it, also forcing -O1. The
key simplification, verified against the rv64 lifter, is that the interp frame
stays host-identity: the lifter lowers guest loads/stores to FFI calls into
bounds-checked __emu_* host helpers, so there is no guest-VA translate hook and
no guest-stack frame carving — only resolve_sym is bound. A long-lived stack is
reset and reseeded per block (the additive kit_interp_stack_reset /
kit_interp_call_args_on API). Because capture is append-only,
kit_interp_lookup returns the newest same-named function, which gives
interpret mode the same fresh-code semantics the JIT gets when self-modifying code
invalidates a translation. A block the interpreter cannot run hard-fails with the
reason; there is no silent JIT fallback.
Public API surface
The public header (include/kit/interp.h) exposes the program lifecycle, the
host vtable, and an explicit-stack API designed as a swap-ready substrate for
fibers / virtual threads: create a stack, seed it with an entry frame and
arguments, and run/resume it to a DONE/TRAP/ERROR/BLOCKED status.
Swapping execution contexts is just resuming a different stack. kit_interp_call
is the convenience wrapper that allocates a stack, seeds it, resumes to
completion, and frees it. The external-call path through host code is the one
region that necessarily uses the host C stack for the call's duration, and is
therefore non-suspendable. Exact signatures live in the header.
FFI: external calls (ffi.c)
External (host) calls are marshalled by a hand-rolled cast-thunk family — the
classic libffi-lite trick. The engine classifies the call's ABIFuncInfo into
integer-register and fp-register slots (sret pointer first, byval aggregates
passed by pointer, register-split aggregates chunked), then interp_ffi_invoke
calls the host function pointer through a prototype cast that matches the
classified shape. This is correct on the supported ABIs (SysV x64, AAPCS64, RV64
LP64D) because integer and fp arguments come from independent register sequences,
so a maximal T(u64 x8, fp x8) prototype places the first N integers and M fp
values in the right registers regardless of interleaving; unused trailing slots
are ignored. Two fp shapes exist because a 4-byte single and an 8-byte double
occupy the fp register differently; a signature mixing the two is rejected.
Returns mirror this: a value comes back in one or two registers, dispatched
through scalar or struct-returning thunks whose field types steer the return
registers, and the caller scatters each part into the aggregate destination.
Signatures outside this family are diagnosed, not guessed: too many register
args, stack-routed variadics (Apple ARM64 vararg_on_stack), 3+-register struct
returns, 32-bit-fp struct-return fields, and aggregate / oversized scalars in a
variadic-tail position (which have no per-call ABI classification). The thunk
casts deliberately mismatch the real prototype, which trips clang's
-fsanitize=function, so the dispatcher opts out of that one check (clang only;
the kit self-host build never enables it).
What it does not do, by design
Inline asm (IR_ASM_BLOCK) is rejected: it needs machinize's constraint
resolution, which the interp tap skips, and has no portable interpretation. The
FFI signatures listed above are diagnosed rather than marshalled. These are clean
rejections with a reason, never miscompilations.
See also: IR.md, OPT.md, JIT.md, EMU.md, CODEGEN.md.