kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

Codegen Spine

This is the codegen spine that sits between kit's frontends and machine code. Frontends never speak machine: they drive the public KitCg stack-machine API (kit_cg_*), which lowers to a single internal semantic sink — CgTarget. CgTarget has several realizations selected by opt-level and output kind: a direct machine-code emitter for -O0, an IR recorder that feeds the optimizer and interpreter, and source-like targets (the C backend, wasm). All the native -O0 backends share one CgTarget implementation parameterized by per-arch hooks. This doc covers that layering, the public-API-to-CgTarget lowering, and the shared native infrastructure. See IR.md for the recorded IR, OPT.md for optimization and machine lowering, ARCH.md for per-arch emission, CBACKEND.md and WASM.md for the source-like targets, and INTERPRETER.md for the interpreter.

The two boundaries

Codegen has exactly two stable interfaces, stacked:

  frontend (C / toy / wasm-lang / preprocessor)
      |  public push/pop API
      v
  KitCg value stack            (src/cg/*.c, op families)
      |  semantic CgTarget vtable (src/cg/cgtarget.h)
      v
  CgTarget realization
      |-- NativeDirectTarget   -O0 direct emit  -> NativeOps + NativeTarget
      |-- CgIrRecorder         -O1/-O2/interp   -> recorded IR -> optimizer -> NativeTarget
      |-- C-source target / wasm                (semantic, source-like output)

CgTarget (the semantic interface) speaks in target-data-layout terms a frontend can produce: typed semantic locals, immediates/globals/indirect addresses, labels and structured scopes, typed loads/stores, arithmetic, calls/returns as local-valued descriptors, aggregates, bitfields, atomics, varargs, intrinsics, and inline asm. It deliberately exposes no machine state: no hard registers, no spill slots, no call plans, no CFG/SSA, no prologue patching. That keeps it recordable verbatim as semantic IR and lets one frontend path serve every backend.

NativeTarget (src/arch/native_target.h, the physical interface) is the other boundary — the post-register-allocation machine-emission contract. Every hook receives caller-selected, target-legal physical operands and emits one native operation; it is explicitly not a register allocator. Both the -O0 direct path and the optimizer's machine-emit path bottom out here.

The split exists because the two contracts are genuinely different levels. The semantic interface is "what the program means"; the physical interface is "emit this selected instruction with these registers." Folding them, as an earlier design did, forced the shared vtable to expose both IR concepts and backend register state at once. Keeping them apart lets each native arch implement only physical emission, while the semantic-to-physical bridge for -O0 is written once and shared.

The public KitCg API and value stack

include/kit/cg.h is a stack machine. A frontend pushes typed values, names types, and issues operations that pop operands and push results — kit_cg_func_begin, kit_cg_load/kit_cg_store, kit_cg_fp_binop, kit_cg_call, kit_cg_branch_true, kit_cg_block_begin, and so on. This API is the insulation layer: it is source-stable across all the internal changes below it. See FRONTENDS.md.

Every stack entry is exactly one of two kinds, and each op declares the kinds it consumes and produces: a PLACE (an addressable, typed location — a local's storage, a global, or a computed [base + index*scale + offset]) or a VALUE (a scalar rvalue: integer, float, pointer, or 128-bit scalar). Addressing is built explicitly — push_local→PLACE, addr PLACE→VALUE(ptr), deref VALUE(ptr)→PLACE, field i PLACE(record)→PLACE, elem VALUE(ptr)+index →PLACE — and the op panics on a kind mismatch; CG never infers the kind or inserts an implicit dereference. load/store take a PLACE and no effective-address rider; the place ops fold the constant offset (deref/field) and scale (elem) into one OPK_INDIRECT, so the backend still gets a single addressing-mode memop. Aggregates are always a PLACE.

The implementation lives in src/cg/, split by op family rather than one monolith: value.c (stack discipline, lvalue/rvalue conversion, operand materialization), fold.c (the -O0 semantic peephole — constant folding, the delayed compare/arith forms, and const-local store-to-load forwarding; contract in fold.h), memory.c (loads/stores/addressing/aggregates), arith.c, control.c (labels, branches, scopes, switch, computed goto), call.c, atomic.c, asm.c, type.c, local.c, data.c, wide.c (128-bit scalars), with shared state and helpers in internal.h and lifecycle in session.c. (Files are named per family; there is no literal api_*.c prefix.)

The value stack's job is purely semantic lowering. Each entry (ApiSValue) is one of an operand (immediate / constant / semantic local / lvalue address), a delayed compare, or a delayed arith — forms held un-emitted so a following branch can fuse a compare instead of materializing a 0/1, or so a small immediate can flow straight into a binop. These delayed forms and the constant folding around them are the -O0 peephole; it lives in fold.c (contract in fold.h), kept isolated from the stack discipline. Both delayed forms are live: the delayed compare fuses into a following branch, and the delayed arith (admitted by api_can_delay_int_arith for an unflagged foldable integer op) flows an unmaterialized binop/unop into a following op so an immediate chain folds or an identity collapses, materializing only when a consumer needs a value. (The delayed arith was gated off while the load/store addressing rider existed; the strict place/value rework removed the rider and re-enabled it.) The stack does not own registers, frame slots, spill policy, or caller-saved preservation; those moved down into the target realizations. When an operation needs a value emitted, the stack calls the corresponding g->target->op(...) semantic hook with local-only operands.

Switch is a good example of the semantic/structured division. CgTarget carries an optional switch_ hook and a supports_label_table query. Native arches leave switch_ NULL and the shared cg_lower_switch_default lowers the structured descriptor into cmp_branch/jump/indirect_branch + a rodata label table. Source-like targets override switch_ to emit a native construct (the C target a real switch, a wasm target br_table). Same semantic input, different realization.

CgTarget realizations

session.c's kit_cg_begin picks the realization. It asks the arch registry (cg_backend_for_session, src/arch/registry.c) for a CGBackend whose make builds the base CgTarget for this target arch and output kind, then conditionally wraps it:

The arch make always builds the NativeDirectTarget as the leaf; the optimizer wrapper, when present, sits above it and reaches the leaf's NativeTarget through native_direct_target_native. So a native arch implements its semantic surface exactly once — there is no separate per-arch semantic CgTarget.

The IR recorder

CgIrRecorder is a thin CgTarget that turns each semantic call into exactly one IR instruction in a per-function CgIrFunc, preserving operands, sticky source locations, tail-call policy, and global references. It is purely a sink: finalize triggers the optimizer's cross-function passes and per-function lowering. From the recorded clean IR, the optimizer derives its own CFG/SSA/MIR/allocated-MIR views and finally calls opt_emit_native to drive the arch NativeTarget. The same recorded IR feeds the interpreter via a sibling lowering path (opt_run_o1_interp). See IR.md, OPT.md, and INTERPRETER.md.

Shared native -O0: NativeDirectTarget

src/cg/native_direct_target.c is the single direct -O0 semantic CgTarget for all native arches (aa64/rv64/x64). It accepts semantic locals and ops and emits machine code in one pass, with no IR recording, liveness, CFG, or SSA. That single-pass property is the whole point of the -O0 path: it is the fast, low-overhead route for unoptimized builds, JIT, and bootstrapping.

It is parameterized two ways:

NativeOps is intentionally tiny — it is not a second copy of CgTarget. Pure pass-throughs (register info, func_begin/func_end, frame-slot allocation, class_for_type, addr_legal) are called on the NativeTarget directly; NativeOps exists only for the arch-sensitive direct-mode glue.

What NativeDirectTarget owns (and what the value stack therefore does not): semantic-local allocation and metadata, assigning locals frame homes, direct scratch-register selection, a local register cache, dirty-flush/invalidation, materializing operands into physical values, storing results back, conservative flushes around calls/barriers, and outgoing-call-area tracking for frame finalization.

The semantic-vs-physical split, concretely

NativeOps is the semantic-side adapter — it answers questions phrased in semantic terms (CGParamDesc, Operand, CGCallDesc) and is used only on the -O0 direct path. NativeTarget is the physical-side emitter — it speaks NativeLoc (registers / frame slots / immediates / addresses) and is used by both -O0 (after NativeDirectTarget has chosen scratch regs and materialized operands) and the optimizer (after regalloc). The optimizer never touches NativeOps: anything it needs is in MIR or on NativeTarget. This is why the two structs exist side by side — they sit on opposite sides of the register-selection line.

The local register cache

The correct baseline gives every scalar local a frame home and round-trips it through memory per use — no liveness needed. On top of that baseline, NativeDirectTarget runs a write-back, basic-block-scoped local register cache that removes most of those round trips while staying single-pass. The load-bearing invariants are:

A monotonic use-tick drives approximate-LRU eviction; cached locals are tracked in an intrusive insertion-order list so flush-all is O(cached), not O(locals). The cache is never worse than the frame-only baseline — each cached local is stored at most once per boundary instead of once per definition. It is a local cache, not an allocator: the semantic API promises nothing about where a local lives, so the target is free to choose.

Single-pass -O0 cannot pre-plan the full frame, so the native prologue is emitted into a reserved region at func_begin and the frame-size/outgoing-area immediates are patched at func_end once final sizes are known. The optimizer path, which knows the whole frame up front, instead uses func_begin_known_frame + emit_prologue for an exact-size prologue. Both mechanisms are on the same NativeTarget.

Shared native infrastructure

Two more pieces are shared by every native backend (both -O0 and the optimizer's machine emit), so the per-arch code carries only ISA/ABI specifics.

NativeFrame — arch-neutral frame bookkeeping

src/cg/native_frame.{c,h} holds the parts of stack-frame layout that are identical across aa64/rv64/x64: the table of frame slots (locals, spills, sret/variadic homes, and aa64 callee-save homes) accumulated below the frame anchor, the cumulative-offset arithmetic, the running max-outgoing-arg size, the frame-final gate that forbids growing the frame after the prologue is emitted, and deriving the used-callee-save set from the optimizer's per-class register masks. It also answers the one frame-relevant ABI query — native_frame_va_save_bytes derives the vararg register-save-area size from the target ABI's va_list layout, so the per-arch magic numbers all flow from one ABI-driven source.

What stays per-arch is everything ISA/ABI-specific: the transform from a slot's cumulative offset to an anchor-relative displacement (fp/s0/rbp-relative, plus aa64's top- vs bottom-record choice), prologue/epilogue encoding, callee-save placement, slim-prologue variants, deferred-patch application, and variadic register-save stores.

native_argmove — parallel-copy register shuffle

src/cg/native_argmove.{c,h} is the shared scheduler for realizing a set of register dst <- src moves as a parallel copy. Marshalling call arguments — and, on the optimizer path, binding incoming parameters — is a parallel copy: every source register must be read before any move overwrites it. The allocator usually hands over a conflict-free order, but not always (variadics, and tail-call / entry permutations it is free to rotate). When a true cycle remains (e.g. rdi<-rdx, rsi<-rdi, rdx<-rsi), the scheduler breaks it by stashing one member into a scratch register and redirecting that value's readers.

The scheduling — topological emission, cycle detection, the scratch break — is identical across the three arches; only the leaf operations differ (how one move is emitted, which register is scratch), supplied through a small ops struct. All three native backends plus the entry param-bind path share this one scheduler.

Why this shape