Interpreter

kit's IR interpreter (kit run --no-jit, and the emulator's interpret mode) executes the compiler's own semantic IR directly, with no native code emission and no executable memory. It is the third way to run kit IR alongside the native backend (CODEGEN.md) and the JIT (JIT.md): a threaded-bytecode engine that lowers a post-optimization function to fixed-width records and runs them over an explicit, suspendable call stack. The design point is fidelity to codegen semantics — it interprets the pre-machinize IR, the same view the native backend lowers from, so an interpreted program and a compiled program agree by construction. It lives in src/interp/.

Why interpret the semantic IR

The interpreter taps the optimizer pipeline (see OPT.md) at opt_run_o1_interp (src/opt/opt.c): it runs the target-independent O1 subset and stops before machinization and register allocation. At this tap point the function is still a virtual-register machine:

  CG IR  --opt_func_from_cg_ir-->  Func  --[ti passes]-->  Func (PReg view)
                                                              |
                                                  STOP before machinize/regalloc
                                                              |
                                              interp_lower  -->  InterpFunc bytecode

Concretely opt_run_o1_interp runs CFG build, jump cleanup, local simplification, the address/escape transforms (opt_addr_xform_pregs, opt_promote_scalar_locals, opt_addr_of_global_cse), the loop tree, and live-block dead-def elimination. It deliberately skips opt_machinize_native, loop-immediate lowering / const hoisting, register allocation, and all MIR passes. Those passes either bind to a physical register file the interpreter does not have, or are pessimizations for a machine that takes immediates directly.

Consequences that shape the rest of the engine:

Unbounded virtual registers. OPK_REG operands carry PReg ids (1..f->npregs); the register file is a flat slab indexed by PReg id. There is no spilling, no phys-reg pool.
No SSA / no PHIs. The interp tap leaves opt_reg_ssa == 0; the lowerer asserts no IR_PHI survives.
Explicit CFG with materialized edges. Control flow is driven entirely by block successors and terminators; every reachable block has explicit succ edges, so the loader can resolve every branch to a bytecode pc.
Semantic, un-ABI-lowered calls. IR_CALL still carries the high-level CGCallDesc (callee operand, args/ret as ABI values, the ABIFuncInfo). The interpreter applies value semantics for internal calls and only consults the ABI descriptor when it must cross into host code.

Because this is exactly the IR the native backend consumes, the interpreter and the compiled program share the same semantics — the differential test of "run the same IR both ways and compare" is the central correctness lever.

Layering and files

  src/opt/opt.c          opt_run_o1_interp + interp sink hook (one-way dep)
  src/interp/lower.c     Func -> InterpFunc bytecode (loader / threader)
  src/interp/engine.c    suspendable dispatch loop + handlers + intrinsics + FFI marshal
  src/interp/ffi.c       host-ABI external-call cast-thunk family
  src/interp/interp_program.c  program lifecycle, sym/name tables, memory + symbol resolution
  src/interp/interp.h    internal types (InterpInsn, InterpFunc, frame/stack)
  include/kit/interp.h public API (program, host vtable, explicit-stack calls)

The dependency is one-way: the optimizer calls interp_capture_func (declared locally in opt.c, not by including the interp header) when a compiler has an interp sink attached. Each compiled function is then lowered into the program in addition to whatever native emission is happening.

Bytecode: `InterpFunc`

lower.c turns a Func into an InterpFunc: a flat array of fixed-width InterpInsn records plus side tables. The record is cache-friendly and caches the hot fields a handler needs — destination PReg, resolved branch pcs, an inline immediate, operand widths, fp flags, a tail-call flag, and a direct-threading handler slot — while retaining a pointer to the source Inst so handlers can read full operand detail (operand kinds, types, MemAccess, aux structs, the call descriptor) generically rather than re-encoding every field.

Lowering is two passes over f->emit_order:

Pass A places each reachable block at its starting pc, counting records (one per emitting Inst; NOP/PHI/PARAM_DECL/SCOPE_* and the constant markers emit nothing), operands, and switches. It bump-allocates frame-slot byte offsets honoring alignment (FS_ALLOCA slots are dynamic, sized at run time), and fixes the static frame size.
Pass B emits each record, mapping IROp -> InterpOp (with aggregate-specialized variants such as IOP_COPY_AGG/IOP_LOAD_AGG), caching widths and sub-op tags (BinOp/CmpOp/ConvKind/AtomicOp), and resolving every branch/switch/indirect/label target from block id to bytecode pc via the block_pc[] table built in Pass A.

The opcode set is one family per IROp, specialized only where the width or the scalar/aggregate distinction changes the handler. There is no width-per-opcode explosion; arithmetic carries its width and fp-ness in record fields and a sub-op tag, and the handler masks/sign-extends accordingly.

Unsupported ops are not silently dropped. An op the interpreter cannot run (notably IR_ASM_BLOCK) lowers to IOP_TRAP and flags the whole function !ok with a reject reason. The engine reports a clean interp: <reason> not supported diagnostic; it never miscompiles or falls back to native code. This "diagnose, don't miscompile" rule is the contract for the no-JIT path.

Static data and jump tables

Function-scope static blobs — ordinary static locals, dense-switch jump tables, and computed-goto label arrays — are materialized at lower time (lower_static_blobs) into an interp-private, program-lifetime buffer, and the blob's symbol is bound to that buffer. WRITE markers contribute literal bytes; LABEL_ADDR markers contribute the target block's bytecode pc, not a native code address. This is essential: the interpreter addresses code by InterpInsn index, so a jump table that the program later walks with IR_LOAD + IR_INDIRECT_BRANCH must hold interp pcs. The stream marker ops themselves lower to IOP_NOP; they are fully consumed by the materialization pass. This is what lets the dense -O1 switch lowering and labels-as-values work under the interpreter.

Engine: explicit-stack dispatch

The engine (interp_run_stack in engine.c) runs the top frame of an explicit InterpStack. Execution state lives in data structures, never on the host C stack: an IR-level call pushes an InterpFrame, a return pops one. The host C stack stays O(1) regardless of IR call depth — deep IR recursion grows the stack's frame array, not the host stack.

  InterpStack (a swappable execution context / fiber)
    frames[]        explicit call stack; interp_run_stack runs frames[top]
    regs_arena      bump region: each frame's PReg file (npregs u64s)
    mem_arena       bump region: each frame's addressable bytes (locals, allocas, varargs)
    scalar_ret      return shuttle between frames
    status / trap_reason

The two arenas are fixed, non-relocating reservations. An OP_ADDR_OF materializes a local's address as an absolute host pointer into mem_arena, and that pointer can escape into a register or into another local; reallocating (moving) the arena would dangle it. Frames follow strict stack discipline (CALL bumps the top, RET rewinds it), so a generous fixed reservation suffices, and overflow traps cleanly as a stack overflow rather than corrupting memory. Frames themselves reference the arenas by offset, not pointer, so the frame array can be realloc'd on growth without invalidating anything.

Dispatch: direct-threaded with a switch fallback

Where the host compiler supports labels-as-values (GCC, clang, and kit itself), the engine is direct-threaded: on first entry to a function each opcode's &&handler is copied into its records, and every handler tail-dispatches with goto *in->handler, giving the branch predictor a distinct indirect branch per opcode site. The identical handler bodies compile as a portable switch for any other compiler, sharing one source through the OP()/NEXT()/GO() macros. The choice is governed by KIT_INTERP_THREADED (default on, in include/kit/config.h) AND the compiler's capability; it can be forced with -DKIT_INTERP_THREADED=0|1. Keeping a switch fallback is what lets any compiler lacking labels-as-values run the same engine through one portable code path, with no behavioral difference from the threaded build.

Handler shape and key behaviors

Arithmetic / compare / convert read operand values, apply the operation by width and fp flag, and write the result. Width masking and sign-extension are explicit. The engine is the reference implementation of the IR's portable edge-case semantics (IR.md "Well-definedness"): integer add/sub/mul wrap modulo the width; shift counts reduce modulo the width; integer divide/rem trap on a zero divisor and wrap INT_MIN / -1 (no UB); float→int conversion saturates (NaN -> 0, out-of-range -> clamped, matching Wasm trunc_sat), avoiding the UB of casting an out-of-range double while staying identical to a plain cast for in-range inputs; the floating relationals are ordered (NaN -> false) while ne is unordered (NaN -> true). These rules are locked to the spec by the parameterized conformance cases in test/interp/interp_smoke_test.c (spec_*), which run each edge with runtime arguments so the optimizer cannot fold the operation away. The engine stores every scalar in a u64, so it carries scalar widths up to 64 bits exactly; 128-bit scalars are memory/aggregate-lowered (or expanded to 64-bit-half / libcall sequences) before reaching a register handler.
Loads / stores / addressing never raw-dereference. Every memory access goes through interp_translate (below), which is what makes the two memory models swap cleanly. A destination operand may itself be memory — the optimizer leaves address-taken locals un-promoted — so write_dst handles register and memory destinations.
Branches retarget ip to a resolved pc and re-dispatch. Because branch handlers skip the straight-line memory-fault recheck, they test the fault latch before consuming a possibly-garbage selector. IOP_SWITCH reads its pre-resolved case/default pcs from a side table; IOP_INDIRECT_BR and IOP_LOAD_LABEL_ADDR traffic in bytecode pcs (see static data above).
Faults vs. unsupported. A runtime fault (bad memory, divide-by-zero, __builtin_trap, unreachable) sets TRAP; an op/signature the engine can't handle sets ERROR. Both record a borrowed reason string retrievable via kit_interp_stack_trap_reason. Memory faults use a latch rechecked on straight-line ops and at branch selectors so a faulting access stops the loop rather than propagating zero.

Calls and O(1) tail calls

IOP_CALL resolves its callee from the call descriptor. A GLOBAL callee that names a TU-internal function (or a function-pointer in a register whose host address reverse-maps to one) is interpreted, not run as native — even through a function pointer — so the no-JIT contract holds: the interpreter never executes JITed code. Only a genuinely external target reaches the FFI path.

Internal call: push a frame (bump regs + mem), bind arguments by value into the callee's parameter homes (register params into the register file, aggregate / large params copied into frame slots), lay out any anonymous (variadic) args into a contiguous buffer in the callee frame, record where the caller wants the result (a register, or an sret pointer into the caller's slot), then re-dispatch on the new top frame. No host recursion.
Return: shuttle the scalar (or copy the aggregate into the sret pointer), rewind the arenas to the frame's bases, pop, and deliver the result into the caller's recorded destination. An empty stack means DONE.
Tail calls (terminator IR_CALL / CG_CALL_TAIL, or the last emitting inst of a successor-less block) are true O(1): the freshly-built callee frame is relocated down onto the dead caller's register/memory region and the arenas are rewound, so a tail loop runs in constant interp- and host-stack space. This is safe precisely because the callee has not executed yet — no absolute pointers into its own frame exist, and the argument-binding step has already copied every argument value out of the caller. External tail calls similarly forward the call's result as this frame's result without growing the stack.

Variadics, bitfields, atomics, intrinsics, TLS

The interpreter owns both ends of variadics, so its interpreter-private va_list is self-consistent regardless of the target ABI's real layout: the anonymous args are packed into the callee frame at aligned slots, va_start seeds a cursor over that buffer, va_arg reads the typed slot and advances, va_copy duplicates the cursor.

Bitfields are interpreted by shift/mask extract (sign-extending signed fields) and read-modify-write insert over the storage unit, using the field's layout descriptor. Atomics run on the single-threaded engine: the operation is serialized and the memory order is treated as sequentially consistent; a fence is a no-op. Intrinsics cover mem{cpy,move,set}, popcount/ctz/clz/bswap, checked-overflow builtins (exact-width detection), expect/assume/prefetch, and trap/unreachable.

Thread-local storage routes through the host's resolve_tls hook rather than treating the symbol as storage, because a thread-local symbol does not denote its storage on every target (a Mach-O symbol resolves to a TLV descriptor). The hook returns the calling thread's address of the variable. When no hook is bound it falls back to plain-global resolution, correct only where the symbol is the storage; anything it cannot resolve safely returns NULL and is diagnosed.

Pluggable memory and symbols (host vs. emu)

The engine never assumes how an abstract address maps to a host pointer. A KitInterpHost vtable (resolved through interp_translate / interp_resolve_sym in interp_program.c) provides translate, resolve_sym, and resolve_tls; any may be NULL. Two configurations share the engine:

Host-identity (kit run --no-jit): abstract addresses are real host pointers. interp_translate returns the address unchanged; locals/allocas live in mem_arena and their addresses are real host pointers; globals/externs resolve through the bound resolver.
Emu/guest: addresses are guest VAs translated through an EmuAddrSpace, bounds- and permission-checked (see EMU.md).

Global symbols a function references are noted at lower time (their names captured from the obj while it is alive), and their host addresses resolved lazily and cached on first use, after the JIT image has been linked.

Integration

kit run --no-jit (driver/cmd/run.c) forces at least -O1, attaches an InterpProgram so each function is captured to bytecode while the normal object/JIT-link still runs (the link lays out data globals and resolves externs / function pointers), then executes the entry only through the interpreter — there is no JIT execution fallback; a non-interpretable entry is a hard error. Globals/externs resolve by walking the JIT image's symbol table (tolerating a target's leading-underscore C mangling), then host dlsym; thread-locals additionally route through kit_jit_tlv_resolve, which unwraps kit's own Mach-O TLV descriptor (verifying it before any indirect call so a foreign/dyld descriptor never becomes a wild call). Wasm entries get their instance and linear memory set up and run __kit_wasm_init plus the entry through the interpreter.

Emu interpret mode (KIT_EMU_MODE_INTERP, src/emu/emu.c) runs each lifted guest block through the interpreter instead of JITing it, also forcing -O1. The key simplification, verified against the rv64 lifter, is that the interp frame stays host-identity: the lifter lowers guest loads/stores to FFI calls into bounds-checked __emu_* host helpers, so there is no guest-VA translate hook and no guest-stack frame carving — only resolve_sym is bound. A long-lived stack is reset and reseeded per block (the additive kit_interp_stack_reset / kit_interp_call_args_on API). Because capture is append-only, kit_interp_lookup returns the newest same-named function, which gives interpret mode the same fresh-code semantics the JIT gets when self-modifying code invalidates a translation. A block the interpreter cannot run hard-fails with the reason; there is no silent JIT fallback.

Public API surface

The public header (include/kit/interp.h) exposes the program lifecycle, the host vtable, and an explicit-stack API designed as a swap-ready substrate for fibers / virtual threads: create a stack, seed it with an entry frame and arguments, and run/resume it to a DONE/TRAP/ERROR/BLOCKED status. Swapping execution contexts is just resuming a different stack. kit_interp_call is the convenience wrapper that allocates a stack, seeds it, resumes to completion, and frees it. The external-call path through host code is the one region that necessarily uses the host C stack for the call's duration, and is therefore non-suspendable. Exact signatures live in the header.

FFI: external calls (`ffi.c`)

External (host) calls are marshalled by a hand-rolled cast-thunk family — the classic libffi-lite trick. The engine classifies the call's ABIFuncInfo into integer-register and fp-register slots (sret pointer first, byval aggregates passed by pointer, register-split aggregates chunked), then interp_ffi_invoke calls the host function pointer through a prototype cast that matches the classified shape. This is correct on the supported ABIs (SysV x64, AAPCS64, RV64 LP64D) because integer and fp arguments come from independent register sequences, so a maximal T(u64 x8, fp x8) prototype places the first N integers and M fp values in the right registers regardless of interleaving; unused trailing slots are ignored. Two fp shapes exist because a 4-byte single and an 8-byte double occupy the fp register differently; a signature mixing the two is rejected. Returns mirror this: a value comes back in one or two registers, dispatched through scalar or struct-returning thunks whose field types steer the return registers, and the caller scatters each part into the aggregate destination.

Signatures outside this family are diagnosed, not guessed: too many register args, stack-routed variadics (Apple ARM64 vararg_on_stack), 3+-register struct returns, 32-bit-fp struct-return fields, and aggregate / oversized scalars in a variadic-tail position (which have no per-call ABI classification). The thunk casts deliberately mismatch the real prototype, which trips clang's -fsanitize=function, so the dispatcher opts out of that one check (clang only; the kit self-host build never enables it).

What it does not do, by design

Inline asm (IR_ASM_BLOCK) is rejected: it needs machinize's constraint resolution, which the interp tap skips, and has no portable interpretation. The FFI signatures listed above are diagnosed rather than marshalled. These are clean rejections with a reason, never miscompilations.

See also: IR.md, OPT.md, JIT.md, EMU.md, CODEGEN.md.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

kit