Emulator

kit emu is a user-mode emulator for guest ELF executables. It loads a guest program image into a host-managed address space, then runs it by JIT-translating one guest basic block at a time into host machine code through the same CG -> MC -> link pipeline the native JIT uses (see JIT.md), caching each translation keyed by guest PC and dispatching between cached blocks until the guest exits. There is no interpreter loop over guest opcodes and no separate guest codegen path: a guest ISA is treated as just another frontend that emits CG.

The emulator is feature-gated (KIT_EMU_ENABLED). When disabled, the arch and object-format emu vtables compile to empty stubs (src/arch/emu_stubs.c, src/obj/emu_stubs.c) and the public kit_emu_* calls return KIT_UNSUPPORTED.

Why this shape

The guiding decision is that the emulator owns process orchestration and nothing else. It must not embed ELF parsing, ISA decode/lift, or Linux ABI semantics inline. Each of those is the domain of an existing registry (object format, arch, OS), reached only through vtables. This keeps libkit policy-free — the library describes requests (a syscall, an unresolved import, a needed shared object) and an embedder or the driver decides what they mean — and it lets the bulk of the backend (opt, register allocation, MC emission, linking, JIT execmem) be reused unchanged: a lifted guest block is an ordinary CG function.

A second decision is that execution starts from a binary image, not source. Loading maps a guest process image; it never builds an ObjBuilder. An ObjBuilder appears only after the lifter emits CG for a translated block. The type split is deliberate:

object readers / KitObjFile: inspect binary formats (read-only)
EmuLoadedImage / EmuProcess: the live guest process state
ObjBuilder -> LinkImage / KitJit: host code generated per block

Footprint: three directories by design

The emulator deliberately spans three source trees, each behind the boundary it owns:

src/emu/             process orchestration, lifecycle, dispatch, address
                     space, code cache, runtime helpers, dynamic loader,
                     CPUState, TLS, fault routing
src/os/              guest-OS personality registry + per-OS impls
                     (Linux is the only one today)
src/obj/elf/emu_load.c   the guest ELF image loader (ObjFormatImpl.emu)

Plus per-ISA decode/lift under src/arch/<arch>/ (only rv64 ships a real ArchEmuOps). The boundary, not the file count, is the invariant: format code maps files, arch code decodes/lifts instructions, OS code models the user ABI, and src/emu coordinates execution.

Top-level data flow

guest bytes
  -> kit_detect_fmt + ObjFormatImpl.emu->detect_executable   (target)
  -> ObjFormatImpl.emu->load_executable -> EmuLoadedImage       (image)
  -> KitOsImpl.emu_init_process / _thread                     (stack, ABI)
  -> ArchEmuOps.cpu_new (+ attach addr space, set PC/SP, set tp)
       |
       v   dispatch loop (kit_emu_step):
  read guest PC
    -> code cache hit?  -- yes --> call cached host block
                        -- no  --> translate_block:
                              decode_block (one BB)
                              -> lift_block -> CG function
                              -> opt -> ObjBuilder
                              -> link session (JIT output) -> KitJit
                              -> cache (guest_pc -> host entry)
    call host block -> returns next guest PC
    inspect CPUState trap: EXIT stops the loop, FAULT panics

Public surface is kit_emu_run (load + run to completion) and the finer-grained kit_emu_new / kit_emu_step / kit_emu_lookup / kit_emu_free in include/kit/emu.h. The driver entry is driver/cmd/emu.c, which turns a path into bytes, marshals argv/envp, wires a KitJitHost (execmem + TLS), and reports the guest exit code.

KitEmu lifecycle

src/emu/emu.c owns KitEmu and the translate/dispatch loop. At construction (kit_emu_new) it:

Resolves config (emu_resolve_config): detect the binary format, look up ObjFormatImpl and require an emu vtable; determine the guest KitTarget (caller-supplied or detect_executable, which accepts a main ET_EXEC image); look up the ArchImpl and require its decode + emu hooks; look up the KitOsImpl. Any missing piece is KIT_UNSUPPORTED.
Wires bindings: public KitEmuExternalBindings (syscall / resolve_import / resolve_object) are adapted into the internal EmuExternalBindings shape via small thunks. When no syscall binding is supplied, the OS's emu_default_syscall is used directly, so the driver gets working Linux semantics out of the box.
Initializes OS process/thread private state, then calls the object format's load_executable, then emu_init_process (stack, auxv, brk, dynamic loading) and per-thread emu_init_thread (TLS, thread pointer).
Allocates CPUState via ArchEmuOps.cpu_new, seeds PC/SP, and attaches the address space.

Error and fault discipline

The emulator distinguishes two failure axes and routes each through a single boundary. Host-side build failures (out of memory, an unsupported guest, a lift or link error) use the compiler's panic/longjmp mechanism: kit_emu_new and kit_emu_step each wrap their body in a compiler_panic_save / setjmp frame, so any compiler_panic inside unwinds to the boundary, runs the registered cleanups (tearing down a partially built emu or a half-translated block), restores the prior panic frame, and returns a status. A code-cache hit short-circuits this boundary entirely, so the hot dispatch path pays no setjmp cost.

Guest-side faults are data, not control flow: a block records an EmuTrap* reason in CPUState and returns normally. The dispatcher reads the trap after the call — EMU_TRAP_EXIT stops the loop with an exit code, and an EMU_TRAP_FAULT that no OS personality converted into a signal frame is escalated into a host panic at the boundary. A guest decode failure surfaces as a translate miss and the same panic. The invariant is that no guest condition ever longjmps out of guest code; only the host build/escalation paths use the unwind boundary.

The whole emulator is allocated off the borrowed Compiler's heap and hangs off KitEmu; there is no global state. The KitJitHost (execmem allocator + TLS support) is borrowed and must outlive the emu — without one, runs surface KIT_UNSUPPORTED, since cold blocks need executable memory.

KitEmu carries two execution strategies (see below): the default JIT path stores host code entries in the cache; the optional INTERP path stores KitInterpFunc* and runs blocks through the IR interpreter.

Translation and dispatch

kit_emu_step(e, nblocks) runs up to nblocks guest basic blocks. For each iteration it reads the guest PC, looks it up (kit_emu_lookup), calls the resulting host block, sets the next PC from the block's return value, and inspects the CPUState trap reason:

EMU_TRAP_EXIT -> mark done, capture exit code, stop
EMU_TRAP_FAULT -> panic (a fault not converted into a signal frame)
otherwise continue to the next block

kit_emu_lookup is the cold-miss path. A cache hit short-circuits even the panic boundary. On a miss translate_block runs:

decode_block decodes a single basic block: it walks instructions until it hits a terminator, the per-block cap (EMU_MAX_INSTS_PER_BLOCK), a decode failure, or a mapped-range boundary, reading through a bounds-checked EMU_MEM_EXEC host pointer.
A fresh ObjBuilder and CG function are created. The block is one host function of the block signature above, with a guest-PC-derived symbol name (emu_block_<16-hex-pc> — fixed width, the full 64-bit guest PC in hex). The encoding is a bijection on guest PCs: two translations of the same PC hash to the same symbol, and distinct PCs hash to distinct symbols within the linker's global pool.
ArchEmuOps.lift_block emits the body. The block returns the next guest PC; traps/exits are recorded in CPUState, not the return value, so the dispatcher observes them after the call.
CG is finalized, the object built, and a one-shot link session (output kind JIT) links it with emu_runtime_extern_resolver wired in. The block entry is looked up by symbol in the resulting KitJit.

Each cold block is published as its own standalone one-block JIT image. The image is retained in the emu's jits vector (so its executable memory stays mapped for the emu's lifetime) and the block entry is inserted into the code cache. There is no cross-image relocation between blocks; control flows from block to block only through the dispatcher and the next-PC return value.

The code cache and invalidation

src/emu/runtime.c holds the cache: an open-addressed, linear-probe hash from guest PC to host entry, grown by doubling, never evicted, created lazily on first lookup (and requiring a wired JIT host).

Self-modifying / dynamically-patched guest code is handled through a generation counter on the address space. Writes to translated pages, dynamic relocations, and explicit invalidation all bump the generation and clear the per-page "translated" bit. kit_emu_lookup compares the cache's recorded generation against the address space's current generation; on mismatch it drops the entire cache and re-translates on demand. This is coarse but correct: stale host code is never executed.

Block chaining (design intent)

The shipping strategy is deliberately the simple one — every cold block is its own one-block JIT image, and all inter-block control flows through the dispatcher's next-PC return. This keeps invalidation trivial (drop a cache entry, the image stays mapped harmlessly) and reuses the JIT path unchanged, at the cost of a dispatcher round-trip per block edge.

A faster strategy is anticipated but not wired into the live loop: runtime.c defines EmuCodeRegion (an up-front PROT_NONE reservation with write/runtime dual-aliasing and a monotonic RX high-water mark) and a __emu_dispatch cross-block helper symbol. The intended shape is to bump-allocate translated blocks into one growing RX image and patch direct jumps between them in place (block chaining), falling back to __emu_dispatch for edges whose target is not yet translated. That would require incremental relocation into a shared image; the present design keeps blocks relocation-isolated so that the generation-counter invalidation can remain whole-cache.

Address-space mapping (image.c)

EmuAddrSpace (src/emu/image.c) is the only module that translates guest virtual addresses to host pointers. It is a sparse VM model: an ordered array of EmuMap regions with unmapped holes between them. There is no flat guest-base + offset; the host owns the storage for each map separately (heap-allocated bytes), so guest VAs need not be host-mapped at the same address.

Each map records [start,end), permissions (EMU_MEM_READ/WRITE/EXEC), a kind (anonymous, file-backed, or guard), and per-page dirty and translated bitmaps. Responsibilities:

map / unmap / protect with page-granular splitting and merging (unmap and protect carve a region out of the middle of a map by re-appending the unaffected head/tail pieces)
a checked pointer accessor (emu_addr_space_ptr) that validates the range lies in one map and has the needed permissions, records a structured EmuMemFault (unmapped vs protection) on failure, and marks written pages dirty — flipping any previously-translated page back to untranslated and bumping the generation
gap search for placing new maps, brk growth/shrink, copy-in for loader/stack setup, and explicit invalidation

emu_cpu_attach_addr_space lets the CPUState borrow the address space so runtime helpers can translate without threading the process pointer.

Runtime, helpers, and the extern resolver (runtime.c)

The runtime is in-process: there is no separate runtime object file to link. Lifted blocks call helper functions by referencing undefined extern symbols (EMU_SYM_*, e.g. __emu_load64, __emu_store32, __emu_syscall, __emu_cpu_state). At link time the linker calls emu_runtime_extern_resolver, which maps each name to the host address of the matching C function (or, for __emu_cpu_state, to the running emu's CPUState pointer). Unrecognized shared names fall through to the arch's resolve_runtime_helper hook, so a backend can register its own arch-private helpers (RV64 registers register-file and jalr helpers). Anything still unresolved becomes the linker's ordinary undefined-symbol diagnostic.

Memory helpers come in two flavors. The plain emu_mem_loadN set bounds-checks against the CPUState's window and, on miss, writes EMU_TRAP_FAULT and returns zero — the dispatcher then stops. The checked variants (emu_mem_loadN_checked, all stores) take the faulting PC and the fall-through next PC and, on a fault, route through emu_fault_deliver so an OS personality can convert the fault into a guest signal frame and hand back a resume PC — this is how a guest SIGSEGV handler runs instead of the process dying.

The syscall trampoline (emu_syscall / emu_syscall_next) is purely a marshaller: it asks the OS vtable to decode guest registers into an EmuSyscallRequest, calls the wired bindings.syscall to service it, and asks the OS to encode the result back into the guest return register. The runtime never issues a host syscall itself.

Guest CPUState (cpu.c)

EmuCPUState (src/emu/cpu.c) is the per-thread guest register/trap record. The core keeps it deliberately thin: PC, trap reason, exit code, a borrowed address-space pointer, and an opaque arch-private blob (arch_state) sized and owned by the backend (emu_cpu_new_with_arch_state). The core never interprets the register file; the arch's helpers (get_gpr/set_gpr, get_sp/set_sp, get_tp/set_tp, syscall arg/result accessors) do. Trap state (EMU_TRAP_EXIT / EMU_TRAP_FAULT) is how a block tells the dispatcher to stop.

cpu.c also defines the CG types the lifter needs: EmuThread* (modeled as void*) and the block signature, which uses i64 as the return type in the CG type system (emu_block_fn_type). The direct-call typedef in emu.c (u64 (*)(EmuThread*)) is the unsigned spelling of the same thing; the dispatch loop treats the return as a 64-bit machine word (the next guest PC) and never depends on its sign. Lifted blocks take the thread pointer as their one parameter and reach guest registers and memory only through helper calls — they hold no inline state.

Dynamic loader and relocation (dl.c)

src/emu/dl.c owns the runtime-only dynamic-linking work, sitting above the object-format emu vtable which supplies all format-specific parsing. After the main object is mapped, emu_dl_load_dependencies_and_relocate:

Loads needed objects: for each object, iterates DT_NEEDED entries; an entry not already in the link map is fetched via the resolve_object binding (so the embedder controls the search) and mapped with the format's map_object (an ET_DYN gets a load bias assigned into a VM gap).
Rebuilds TLS modules across the new link map (see TLS).
Applies relocations for every object's main and PLT tables. Each relocation is classified by the format/arch into a neutral class (relative, symbolic, or import-slot) plus a RelocKind; the symbol value S is resolved through the link map first, then via the resolve_import binding. The final bytes are patched in mapped guest memory through a checked writable pointer, and the patched range is invalidated so any cached translation of that page is dropped.

Symbol value resolution and relocation byte patching are shared with the linker: emu_apply_reloc_bytes is a thin wrapper over the neutral link_reloc_apply (src/obj/reloc_apply.c), so PC-relative, absolute, GOT-slot, etc. encodings are computed identically whether the linker is laying out a new image or the emulator is patching an input image at runtime. See LINK.md and OBJ.md.

Import binding / thunks. An import resolving to a guest address binds directly. One resolving to a native host function gets a guest import thunk: emu_dl_init_process reserves a small executable guest VA range; each host-backed import is assigned a slot, the arch emits a thunk there (emit_import_thunk), and an EmuImportBinding records the host function plus a typed signature. When the guest calls into the thunk range, the arch's call helper detects it (emu_dl_resolve_import_thunk, a base+size range check that returns the matching EmuImportBinding). emu_call_host_import then marshals the call: it reads the guest argument registers per the arch's syscall/call ABI into a small u64 array, casts the recorded host pointer to a function type chosen by the binding's EmuValue signature, invokes it, and writes the result back into the guest return register. Arguments are passed as raw 64-bit words; the signature is the contract that fixes arity and which words are live. Marshalling is intentionally narrow — host imports are limited to a small fixed argument count (integer/pointer words today), and a call that exceeds it is KIT_UNSUPPORTED rather than a guessed ABI. A binding with no declared signature defaults to a single-u64 shape. The resolve_import binding is policy-only — it decides whether and to what a symbol resolves; the loader owns GOT/thunk writes and the marshalling contract.

Guest TLS (tls.c)

src/emu/tls.c models thread-local storage as a process-owned module list plus per-thread blocks. emu_tls_rebuild_modules assigns a module ID to each loaded object that has a PT_TLS segment and accumulates the static TLS size/alignment. Per-thread allocation lives in the OS layer (linux_init_thread): for each module it maps an anonymous block, copies the module's .tdata image in (.tbss stays zero), records an EmuTlsBlock, and for the initial module sets the guest thread pointer via the arch's set_tp. The TLS image bytes are read and written through EmuAddrSpace like any other guest memory, never through linker buffers.

Guest-OS personality (src/os)

The guest OS is a pluggable registry. src/os/registry.c maps a KitOSKind to a KitOsImpl vtable; Linux (src/os/linux/linux.c) is the only implementation. The OS owns everything about the user-mode process convention that is neither arch nor object-format specific:

process/thread private state (the Linux impl stores the mmap hint, the per-signal action table, the per-thread signal mask and TLS blocks)
initial stack layout: argv/envp copy-in, the AT_RANDOM block, the aux vector, 16-byte alignment, and the argc/argv/envp/auxv table on the guest stack; it also sizes stack, guard page, and brk reserve and kicks off dynamic loading
syscall ABI and default table: decode the number and six args from guest registers, service them, and encode the result back, plus a per-syscall next_pc hook (so rt_sigreturn resumes at the restored PC rather than past the trap). The bundled table covers process exit, the core file/io and memory (brk, anonymous mmap/munmap/mprotect through the VM API with Linux errno semantics), clock, identity, and signal calls — enough to run a static Linux program.
signal delivery: rt_sigaction records handler/flags/restorer, rt_sigprocmask maintains the blocked mask; faults arrive via emu_deliver_fault, which builds a guest signal frame on the stack (saving the interrupted context through arch hooks), redirects to the handler, and rt_sigreturn restores the interrupted state. Blocked or unhandled signals fall back to a process fault.
map-region placement (emu_find_map_region / emu_note_map_region): where anonymous maps, DL thunks, and TLS blocks land in the VA space.

The OS layer translates between guest ABI state and emulator-level requests; it does not perform host I/O or resolve host symbols. The C calling-convention ABI is separate, derived from (arch, obj-format) by src/abi/registry.c; the OS vtable does not pick ABI vtables.

Guest ELF loader (src/obj/elf/emu_load.c)

The ELF loader implements ObjFormatImpl.emu (ObjFormatEmuOps, elf_emu_ops), handling ELF64 little-endian executable loading via program headers. It parses the header directly rather than going through elf_read.c — the reader builds an ObjBuilder for the linker, which is the wrong output here. It maps ET_EXEC for the main object and ET_DYN for dependencies (assigned a load bias into a VM gap), extracting the dynamic, TLS, and interpreter metadata that the rest of the pipeline needs. The interface backs the dynamic loader's needed-entry iteration, dynamic symbol lookup (by name and index), relocation iteration, and relocation classification (ELF type numbers -> neutral classes + RelocKind). The program-header types and dynamic tables are the loading contract; the per-object format-private blob is EmuElfDynInfo.

Architecture vtable

A guest ISA plugs in via ArchEmuOps (src/arch/arch.h) on top of the shared ArchDecodeOps decoder (also used by objdump and dbg, see ARCH.md). The emu ops provide CPUState construction, the CG types for the block function, the lifter (lift_block), register/SP/TP accessors, the syscall-register ABI, signal-context save/restore, and the import-thunk emit/size. rv64 is the implemented backend; its lifter emits one CG function per block that returns the next guest PC and calls the shared __emu_* helpers plus its own register-file helpers, with control-transfer and ecall instructions acting as block terminators.

Optional INTERP execution

The same lifted-block path can run through the kit IR interpreter instead of host code (KIT_EMU_MODE_INTERP, requires KIT_INTERP_ENABLED; forces at least -O1 so the optimizer's PReg-path IR is available for capture). The mode is chosen once at construction, not per block.

The two strategies differ only in what the code cache stores. In JIT mode the payload is a host code entry, invoked directly through the EmuBlockFn typedef. In INTERP mode the block is still linked as a JIT image — that link is what resolves and validates the helper externs and proves the block is well-formed — but the cache payload is the captured KitInterpFunc*, and dispatch runs it on a per-emu interpreter program and stack, seeding the thread pointer as the single argument and shuttling back the next-PC return. Either way guest registers and memory are reached only through the same __emu_* helpers, so the interpreter holds no guest state and the lifter is identical across modes. A block the interpreter cannot capture or run is a hard failure with a reason, never a silent fallback to the JIT payload. See INTERPRETER.md.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README