Emulator
kit emu is a user-mode emulator for guest ELF executables. It loads a
guest program image into a host-managed address space, then runs it by
JIT-translating one guest basic block at a time into host machine code
through the same CG -> MC -> link pipeline the native JIT uses (see
JIT.md), caching each translation keyed by guest PC and
dispatching between cached blocks until the guest exits. There is no
interpreter loop over guest opcodes and no separate guest codegen path:
a guest ISA is treated as just another frontend that emits CG.
The emulator is feature-gated (KIT_EMU_ENABLED). When disabled, the
arch and object-format emu vtables compile to empty stubs
(src/arch/emu_stubs.c, src/obj/emu_stubs.c) and the public
kit_emu_* calls return KIT_UNSUPPORTED.
Why this shape
The guiding decision is that the emulator owns process orchestration and nothing else. It must not embed ELF parsing, ISA decode/lift, or Linux ABI semantics inline. Each of those is the domain of an existing registry (object format, arch, OS), reached only through vtables. This keeps libkit policy-free — the library describes requests (a syscall, an unresolved import, a needed shared object) and an embedder or the driver decides what they mean — and it lets the bulk of the backend (opt, register allocation, MC emission, linking, JIT execmem) be reused unchanged: a lifted guest block is an ordinary CG function.
A second decision is that execution starts from a binary image, not
source. Loading maps a guest process image; it never builds an
ObjBuilder. An ObjBuilder appears only after the lifter emits CG for
a translated block. The type split is deliberate:
- object readers /
KitObjFile: inspect binary formats (read-only) EmuLoadedImage/EmuProcess: the live guest process stateObjBuilder->LinkImage/KitJit: host code generated per block
Footprint: three directories by design
The emulator deliberately spans three source trees, each behind the boundary it owns:
src/emu/ process orchestration, lifecycle, dispatch, address
space, code cache, runtime helpers, dynamic loader,
CPUState, TLS, fault routing
src/os/ guest-OS personality registry + per-OS impls
(Linux is the only one today)
src/obj/elf/emu_load.c the guest ELF image loader (ObjFormatImpl.emu)
Plus per-ISA decode/lift under src/arch/<arch>/ (only rv64 ships a
real ArchEmuOps). The boundary, not the file count, is the invariant:
format code maps files, arch code decodes/lifts instructions, OS code
models the user ABI, and src/emu coordinates execution.
Top-level data flow
guest bytes
-> kit_detect_fmt + ObjFormatImpl.emu->detect_executable (target)
-> ObjFormatImpl.emu->load_executable -> EmuLoadedImage (image)
-> KitOsImpl.emu_init_process / _thread (stack, ABI)
-> ArchEmuOps.cpu_new (+ attach addr space, set PC/SP, set tp)
|
v dispatch loop (kit_emu_step):
read guest PC
-> code cache hit? -- yes --> call cached host block
-- no --> translate_block:
decode_block (one BB)
-> lift_block -> CG function
-> opt -> ObjBuilder
-> link session (JIT output) -> KitJit
-> cache (guest_pc -> host entry)
call host block -> returns next guest PC
inspect CPUState trap: EXIT stops the loop, FAULT panics
Public surface is kit_emu_run (load + run to completion) and the
finer-grained kit_emu_new / kit_emu_step / kit_emu_lookup /
kit_emu_free in include/kit/emu.h. The driver entry is
driver/cmd/emu.c, which turns a path into bytes, marshals argv/envp,
wires a KitJitHost (execmem + TLS), and reports the guest exit code.
KitEmu lifecycle
src/emu/emu.c owns KitEmu and the translate/dispatch loop. At
construction (kit_emu_new) it:
- Resolves config (
emu_resolve_config): detect the binary format, look upObjFormatImpland require anemuvtable; determine the guestKitTarget(caller-supplied ordetect_executable, which accepts a mainET_EXECimage); look up theArchImpland require its decode + emu hooks; look up theKitOsImpl. Any missing piece isKIT_UNSUPPORTED. - Wires bindings: public
KitEmuExternalBindings(syscall / resolve_import / resolve_object) are adapted into the internalEmuExternalBindingsshape via small thunks. When no syscall binding is supplied, the OS'semu_default_syscallis used directly, so the driver gets working Linux semantics out of the box. - Initializes OS process/thread private state, then calls the
object format's
load_executable, thenemu_init_process(stack, auxv, brk, dynamic loading) and per-threademu_init_thread(TLS, thread pointer). - Allocates CPUState via
ArchEmuOps.cpu_new, seeds PC/SP, and attaches the address space.
Error and fault discipline
The emulator distinguishes two failure axes and routes each through a
single boundary. Host-side build failures (out of memory, an
unsupported guest, a lift or link error) use the compiler's
panic/longjmp mechanism: kit_emu_new and kit_emu_step each wrap
their body in a compiler_panic_save / setjmp frame, so any
compiler_panic inside unwinds to the boundary, runs the registered
cleanups (tearing down a partially built emu or a half-translated block),
restores the prior panic frame, and returns a status. A code-cache hit
short-circuits this boundary entirely, so the hot dispatch path pays no
setjmp cost.
Guest-side faults are data, not control flow: a block records an
EmuTrap* reason in CPUState and returns normally. The dispatcher reads
the trap after the call — EMU_TRAP_EXIT stops the loop with an exit
code, and an EMU_TRAP_FAULT that no OS personality converted into a
signal frame is escalated into a host panic at the boundary. A guest
decode failure surfaces as a translate miss and the same panic. The
invariant is that no guest condition ever longjmps out of guest code;
only the host build/escalation paths use the unwind boundary.
The whole emulator is allocated off the borrowed Compiler's heap and
hangs off KitEmu; there is no global state. The KitJitHost
(execmem allocator + TLS support) is borrowed and must outlive the emu —
without one, runs surface KIT_UNSUPPORTED, since cold blocks need
executable memory.
KitEmu carries two execution strategies (see below): the default JIT
path stores host code entries in the cache; the optional INTERP path
stores KitInterpFunc* and runs blocks through the IR interpreter.
Translation and dispatch
kit_emu_step(e, nblocks) runs up to nblocks guest basic blocks. For
each iteration it reads the guest PC, looks it up (kit_emu_lookup),
calls the resulting host block, sets the next PC from the block's return
value, and inspects the CPUState trap reason:
EMU_TRAP_EXIT-> mark done, capture exit code, stopEMU_TRAP_FAULT-> panic (a fault not converted into a signal frame)- otherwise continue to the next block
kit_emu_lookup is the cold-miss path. A cache hit short-circuits even
the panic boundary. On a miss translate_block runs:
decode_blockdecodes a single basic block: it walks instructions until it hits a terminator, the per-block cap (EMU_MAX_INSTS_PER_BLOCK), a decode failure, or a mapped-range boundary, reading through a bounds-checkedEMU_MEM_EXEChost pointer.- A fresh
ObjBuilderand CG function are created. The block is one host function of the block signature above, with a guest-PC-derived symbol name (emu_block_<16-hex-pc>— fixed width, the full 64-bit guest PC in hex). The encoding is a bijection on guest PCs: two translations of the same PC hash to the same symbol, and distinct PCs hash to distinct symbols within the linker's global pool. ArchEmuOps.lift_blockemits the body. The block returns the next guest PC; traps/exits are recorded in CPUState, not the return value, so the dispatcher observes them after the call.- CG is finalized, the object built, and a one-shot link session (output
kind JIT) links it with
emu_runtime_extern_resolverwired in. The block entry is looked up by symbol in the resultingKitJit.
Each cold block is published as its own standalone one-block JIT image.
The image is retained in the emu's jits vector (so its executable memory
stays mapped for the emu's lifetime) and the block entry is inserted into
the code cache. There is no cross-image relocation between blocks; control
flows from block to block only through the dispatcher and the next-PC
return value.
The code cache and invalidation
src/emu/runtime.c holds the cache: an open-addressed, linear-probe hash
from guest PC to host entry, grown by doubling, never evicted, created
lazily on first lookup (and requiring a wired JIT host).
Self-modifying / dynamically-patched guest code is handled through a
generation counter on the address space. Writes to translated pages,
dynamic relocations, and explicit invalidation all bump the generation
and clear the per-page "translated" bit. kit_emu_lookup compares the
cache's recorded generation against the address space's current
generation; on mismatch it drops the entire cache and re-translates on
demand. This is coarse but correct: stale host code is never executed.
Block chaining (design intent)
The shipping strategy is deliberately the simple one — every cold block is its own one-block JIT image, and all inter-block control flows through the dispatcher's next-PC return. This keeps invalidation trivial (drop a cache entry, the image stays mapped harmlessly) and reuses the JIT path unchanged, at the cost of a dispatcher round-trip per block edge.
A faster strategy is anticipated but not wired into the live loop:
runtime.c defines EmuCodeRegion (an up-front PROT_NONE reservation
with write/runtime dual-aliasing and a monotonic RX high-water mark) and a
__emu_dispatch cross-block helper symbol. The intended shape is to
bump-allocate translated blocks into one growing RX image and patch
direct jumps between them in place (block chaining), falling back to
__emu_dispatch for edges whose target is not yet translated. That would
require incremental relocation into a shared image; the present design
keeps blocks relocation-isolated so that the generation-counter
invalidation can remain whole-cache.
Address-space mapping (image.c)
EmuAddrSpace (src/emu/image.c) is the only module that translates
guest virtual addresses to host pointers. It is a sparse VM model: an
ordered array of EmuMap regions with unmapped holes between them. There
is no flat guest-base + offset; the host owns the storage for each map
separately (heap-allocated bytes), so guest VAs need not be host-mapped
at the same address.
Each map records [start,end), permissions (EMU_MEM_READ/WRITE/EXEC),
a kind (anonymous, file-backed, or guard), and per-page dirty and
translated bitmaps. Responsibilities:
- map / unmap / protect with page-granular splitting and merging (unmap and protect carve a region out of the middle of a map by re-appending the unaffected head/tail pieces)
- a checked pointer accessor (
emu_addr_space_ptr) that validates the range lies in one map and has the needed permissions, records a structuredEmuMemFault(unmapped vs protection) on failure, and marks written pages dirty — flipping any previously-translated page back to untranslated and bumping the generation - gap search for placing new maps,
brkgrowth/shrink, copy-in for loader/stack setup, and explicit invalidation
emu_cpu_attach_addr_space lets the CPUState borrow the address space so
runtime helpers can translate without threading the process pointer.
Runtime, helpers, and the extern resolver (runtime.c)
The runtime is in-process: there is no separate runtime object file to
link. Lifted blocks call helper functions by referencing undefined extern
symbols (EMU_SYM_*, e.g. __emu_load64, __emu_store32,
__emu_syscall, __emu_cpu_state). At link time the linker calls
emu_runtime_extern_resolver, which maps each name to the host address of
the matching C function (or, for __emu_cpu_state, to the running emu's
CPUState pointer). Unrecognized shared names fall through to the arch's
resolve_runtime_helper hook, so a backend can register its own
arch-private helpers (RV64 registers register-file and jalr helpers).
Anything still unresolved becomes the linker's ordinary
undefined-symbol diagnostic.
Memory helpers come in two flavors. The plain emu_mem_loadN set
bounds-checks against the CPUState's window and, on miss, writes
EMU_TRAP_FAULT and returns zero — the dispatcher then stops. The
checked variants (emu_mem_loadN_checked, all stores) take the faulting
PC and the fall-through next PC and, on a fault, route through
emu_fault_deliver so an OS personality can convert the fault into a
guest signal frame and hand back a resume PC — this is how a guest SIGSEGV
handler runs instead of the process dying.
The syscall trampoline (emu_syscall / emu_syscall_next) is purely a
marshaller: it asks the OS vtable to decode guest registers into an
EmuSyscallRequest, calls the wired bindings.syscall to service it, and
asks the OS to encode the result back into the guest return register.
The runtime never issues a host syscall itself.
Guest CPUState (cpu.c)
EmuCPUState (src/emu/cpu.c) is the per-thread guest register/trap
record. The core keeps it deliberately thin: PC, trap reason, exit code, a
borrowed address-space pointer, and an opaque arch-private blob
(arch_state) sized and owned by the backend
(emu_cpu_new_with_arch_state). The core never interprets the register
file; the arch's helpers (get_gpr/set_gpr, get_sp/set_sp,
get_tp/set_tp, syscall arg/result accessors) do. Trap state
(EMU_TRAP_EXIT / EMU_TRAP_FAULT) is how a block tells the dispatcher
to stop.
cpu.c also defines the CG types the lifter needs: EmuThread* (modeled
as void*) and the block signature, which uses i64 as the return type in
the CG type system (emu_block_fn_type). The direct-call typedef in
emu.c (u64 (*)(EmuThread*)) is the unsigned spelling of the same thing;
the dispatch loop treats the return as a 64-bit machine word (the next
guest PC) and never depends on its sign. Lifted blocks take the thread
pointer as their one parameter and reach guest registers and memory only
through helper calls — they hold no inline state.
Dynamic loader and relocation (dl.c)
src/emu/dl.c owns the runtime-only dynamic-linking work, sitting above
the object-format emu vtable which supplies all format-specific parsing.
After the main object is mapped, emu_dl_load_dependencies_and_relocate:
- Loads needed objects: for each object, iterates
DT_NEEDEDentries; an entry not already in the link map is fetched via theresolve_objectbinding (so the embedder controls the search) and mapped with the format'smap_object(anET_DYNgets a load bias assigned into a VM gap). - Rebuilds TLS modules across the new link map (see TLS).
- Applies relocations for every object's main and PLT tables. Each
relocation is classified by the format/arch into a neutral class
(relative, symbolic, or import-slot) plus a
RelocKind; the symbol valueSis resolved through the link map first, then via theresolve_importbinding. The final bytes are patched in mapped guest memory through a checked writable pointer, and the patched range is invalidated so any cached translation of that page is dropped.
Symbol value resolution and relocation byte patching are shared with the
linker: emu_apply_reloc_bytes is a thin wrapper over the neutral
link_reloc_apply (src/obj/reloc_apply.c), so PC-relative, absolute,
GOT-slot, etc. encodings are computed identically whether the linker is
laying out a new image or the emulator is patching an input image at
runtime. See LINK.md and OBJ.md.
Import binding / thunks. An import resolving to a guest address binds
directly. One resolving to a native host function gets a guest import
thunk: emu_dl_init_process reserves a small executable guest VA range;
each host-backed import is assigned a slot, the arch emits a thunk there
(emit_import_thunk), and an EmuImportBinding records the host function
plus a typed signature. When the guest calls into the thunk range, the
arch's call helper detects it (emu_dl_resolve_import_thunk, a
base+size range check that returns the matching EmuImportBinding).
emu_call_host_import then marshals the call: it reads the guest argument
registers per the arch's syscall/call ABI into a small u64 array, casts
the recorded host pointer to a function type chosen by the binding's
EmuValue signature, invokes it, and writes the result back into the
guest return register. Arguments are passed as raw 64-bit words; the
signature is the contract that fixes arity and which words are live.
Marshalling is intentionally narrow — host imports are limited to a small
fixed argument count (integer/pointer words today), and a call that
exceeds it is KIT_UNSUPPORTED rather than a guessed ABI. A binding
with no declared signature defaults to a single-u64 shape. The
resolve_import binding is policy-only — it decides whether and to
what a symbol resolves; the loader owns GOT/thunk writes and the
marshalling contract.
Guest TLS (tls.c)
src/emu/tls.c models thread-local storage as a process-owned module list
plus per-thread blocks. emu_tls_rebuild_modules assigns a module ID to
each loaded object that has a PT_TLS segment and accumulates the static
TLS size/alignment. Per-thread allocation lives in the OS layer
(linux_init_thread): for each module it maps an anonymous block, copies
the module's .tdata image in (.tbss stays zero), records an
EmuTlsBlock, and for the initial module sets the guest thread pointer
via the arch's set_tp. The TLS image bytes are read and written through
EmuAddrSpace like any other guest memory, never through linker buffers.
Guest-OS personality (src/os)
The guest OS is a pluggable registry. src/os/registry.c maps a
KitOSKind to a KitOsImpl vtable; Linux (src/os/linux/linux.c) is
the only implementation. The OS owns everything about the user-mode
process convention that is neither arch nor object-format specific:
- process/thread private state (the Linux impl stores the mmap hint, the per-signal action table, the per-thread signal mask and TLS blocks)
- initial stack layout: argv/envp copy-in, the
AT_RANDOMblock, the aux vector, 16-byte alignment, and theargc/argv/envp/auxv table on the guest stack; it also sizes stack, guard page, and brk reserve and kicks off dynamic loading - syscall ABI and default table: decode the number and six args from
guest registers, service them, and encode the result back, plus a
per-syscall
next_pchook (sort_sigreturnresumes at the restored PC rather than past the trap). The bundled table covers process exit, the core file/io and memory (brk, anonymousmmap/munmap/mprotectthrough the VM API with Linux errno semantics), clock, identity, and signal calls — enough to run a static Linux program. - signal delivery:
rt_sigactionrecords handler/flags/restorer,rt_sigprocmaskmaintains the blocked mask; faults arrive viaemu_deliver_fault, which builds a guest signal frame on the stack (saving the interrupted context through arch hooks), redirects to the handler, andrt_sigreturnrestores the interrupted state. Blocked or unhandled signals fall back to a process fault. - map-region placement (
emu_find_map_region/emu_note_map_region): where anonymous maps, DL thunks, and TLS blocks land in the VA space.
The OS layer translates between guest ABI state and emulator-level
requests; it does not perform host I/O or resolve host symbols. The C
calling-convention ABI is separate, derived from (arch, obj-format) by
src/abi/registry.c; the OS vtable does not pick ABI vtables.
Guest ELF loader (src/obj/elf/emu_load.c)
The ELF loader implements ObjFormatImpl.emu (ObjFormatEmuOps,
elf_emu_ops), handling ELF64 little-endian executable loading via program
headers. It parses the header directly rather than going through
elf_read.c — the reader builds an ObjBuilder for the linker, which is
the wrong output here. It maps ET_EXEC for the main object and ET_DYN
for dependencies (assigned a load bias into a VM gap), extracting the
dynamic, TLS, and interpreter metadata that the rest of the pipeline needs.
The interface backs the dynamic loader's needed-entry iteration, dynamic
symbol lookup (by name and index), relocation iteration, and relocation
classification (ELF type numbers -> neutral classes + RelocKind). The
program-header types and dynamic tables are the loading contract; the
per-object format-private blob is EmuElfDynInfo.
Architecture vtable
A guest ISA plugs in via ArchEmuOps (src/arch/arch.h) on top of the
shared ArchDecodeOps decoder (also used by objdump and dbg, see
ARCH.md). The emu ops provide CPUState construction, the CG
types for the block function, the lifter (lift_block), register/SP/TP
accessors, the syscall-register ABI, signal-context save/restore, and the
import-thunk emit/size. rv64 is the implemented backend; its lifter
emits one CG function per block that returns the next guest PC and calls
the shared __emu_* helpers plus its own register-file helpers, with
control-transfer and ecall instructions acting as block terminators.
Optional INTERP execution
The same lifted-block path can run through the kit IR interpreter
instead of host code (KIT_EMU_MODE_INTERP, requires
KIT_INTERP_ENABLED; forces at least -O1 so the optimizer's PReg-path
IR is available for capture). The mode is chosen once at construction, not
per block.
The two strategies differ only in what the code cache stores. In JIT mode
the payload is a host code entry, invoked directly through the
EmuBlockFn typedef. In INTERP mode the block is still linked as a JIT
image — that link is what resolves and validates the helper externs and
proves the block is well-formed — but the cache payload is the captured
KitInterpFunc*, and dispatch runs it on a per-emu interpreter program
and stack, seeding the thread pointer as the single argument and shuttling
back the next-PC return. Either way guest registers and memory are reached
only through the same __emu_* helpers, so the interpreter holds no guest
state and the lifter is identical across modes. A block the interpreter
cannot capture or run is a hard failure with a reason, never a silent
fallback to the JIT payload. See INTERPRETER.md.