Codegen Spine

This is the codegen spine that sits between kit's frontends and machine code. Frontends never speak machine: they drive the public KitCg stack-machine API (kit_cg_*), which lowers to a single internal semantic sink — CgTarget. CgTarget has several realizations selected by opt-level and output kind: a direct machine-code emitter for -O0, an IR recorder that feeds the optimizer and interpreter, and source-like targets (the C backend, wasm). All the native -O0 backends share one CgTarget implementation parameterized by per-arch hooks. This doc covers that layering, the public-API-to-CgTarget lowering, and the shared native infrastructure. See IR.md for the recorded IR, OPT.md for optimization and machine lowering, ARCH.md for per-arch emission, CBACKEND.md and WASM.md for the source-like targets, and INTERPRETER.md for the interpreter.

The two boundaries

Codegen has exactly two stable interfaces, stacked:

  frontend (C / toy / wasm-lang / preprocessor)
      |  public push/pop API
      v
  KitCg value stack            (src/cg/*.c, op families)
      |  semantic CgTarget vtable (src/cg/cgtarget.h)
      v
  CgTarget realization
      |-- NativeDirectTarget   -O0 direct emit  -> NativeOps + NativeTarget
      |-- CgIrRecorder         -O1/-O2/interp   -> recorded IR -> optimizer -> NativeTarget
      |-- C-source target / wasm                (semantic, source-like output)

CgTarget (the semantic interface) speaks in target-data-layout terms a frontend can produce: typed semantic locals, immediates/globals/indirect addresses, labels and structured scopes, typed loads/stores, arithmetic, calls/returns as local-valued descriptors, aggregates, bitfields, atomics, varargs, intrinsics, and inline asm. It deliberately exposes no machine state: no hard registers, no spill slots, no call plans, no CFG/SSA, no prologue patching. That keeps it recordable verbatim as semantic IR and lets one frontend path serve every backend.

NativeTarget (src/arch/native_target.h, the physical interface) is the other boundary — the post-register-allocation machine-emission contract. Every hook receives caller-selected, target-legal physical operands and emits one native operation; it is explicitly not a register allocator. Both the -O0 direct path and the optimizer's machine-emit path bottom out here.

The split exists because the two contracts are genuinely different levels. The semantic interface is "what the program means"; the physical interface is "emit this selected instruction with these registers." Folding them, as an earlier design did, forced the shared vtable to expose both IR concepts and backend register state at once. Keeping them apart lets each native arch implement only physical emission, while the semantic-to-physical bridge for -O0 is written once and shared.

The public KitCg API and value stack

include/kit/cg.h is a stack machine. A frontend pushes typed values, names types, and issues operations that pop operands and push results — kit_cg_func_begin, kit_cg_load/kit_cg_store, kit_cg_fp_binop, kit_cg_call, kit_cg_branch_true, kit_cg_block_begin, and so on. This API is the insulation layer: it is source-stable across all the internal changes below it. See FRONTENDS.md.

Every stack entry is exactly one of two kinds, and each op declares the kinds it consumes and produces: a PLACE (an addressable, typed location — a local's storage, a global, or a computed [base + index*scale + offset]) or a VALUE (a scalar rvalue: integer, float, pointer, or 128-bit scalar). Addressing is built explicitly — push_local→PLACE, addr PLACE→VALUE(ptr), deref VALUE(ptr)→PLACE, field i PLACE(record)→PLACE, elem VALUE(ptr)+index →PLACE — and the op panics on a kind mismatch; CG never infers the kind or inserts an implicit dereference. load/store take a PLACE and no effective-address rider; the place ops fold the constant offset (deref/field) and scale (elem) into one OPK_INDIRECT, so the backend still gets a single addressing-mode memop. Aggregates are always a PLACE.

The implementation lives in src/cg/, split by op family rather than one monolith: value.c (stack discipline, lvalue/rvalue conversion, operand materialization), fold.c (the -O0 semantic peephole — constant folding, the delayed compare/arith forms, and const-local store-to-load forwarding; contract in fold.h), memory.c (loads/stores/addressing/aggregates), arith.c, control.c (labels, branches, scopes, switch, computed goto), call.c, atomic.c, asm.c, type.c, local.c, data.c, wide.c (128-bit scalars), with shared state and helpers in internal.h and lifecycle in session.c. (Files are named per family; there is no literal api_*.c prefix.)

The value stack's job is purely semantic lowering. Each entry (ApiSValue) is one of an operand (immediate / constant / semantic local / lvalue address), a delayed compare, or a delayed arith — forms held un-emitted so a following branch can fuse a compare instead of materializing a 0/1, or so a small immediate can flow straight into a binop. These delayed forms and the constant folding around them are the -O0 peephole; it lives in fold.c (contract in fold.h), kept isolated from the stack discipline. Both delayed forms are live: the delayed compare fuses into a following branch, and the delayed arith (admitted by api_can_delay_int_arith for an unflagged foldable integer op) flows an unmaterialized binop/unop into a following op so an immediate chain folds or an identity collapses, materializing only when a consumer needs a value. (The delayed arith was gated off while the load/store addressing rider existed; the strict place/value rework removed the rider and re-enabled it.) The stack does not own registers, frame slots, spill policy, or caller-saved preservation; those moved down into the target realizations. When an operation needs a value emitted, the stack calls the corresponding g->target->op(...) semantic hook with local-only operands.

Switch is a good example of the semantic/structured division. CgTarget carries an optional switch_ hook and a supports_label_table query. Native arches leave switch_ NULL and the shared cg_lower_switch_default lowers the structured descriptor into cmp_branch/jump/indirect_branch + a rodata label table. Source-like targets override switch_ to emit a native construct (the C target a real switch, a wasm target br_table). Same semantic input, different realization.

CgTarget realizations

session.c's kit_cg_begin picks the realization. It asks the arch registry (cg_backend_for_session, src/arch/registry.c) for a CGBackend whose make builds the base CgTarget for this target arch and output kind, then conditionally wraps it:

-O0, native arch: the backend's make returns a NativeDirectTarget (see below). No IR is recorded; semantic ops emit machine code immediately.
-O1/-O2 or interpreter: session.c wraps the base target with opt_cgtarget_new (src/opt/opt.c), which returns a CgIrRecorder (src/cg/ir_recorder.c). Recording does not emit; at kit_cg_finish the optimizer replays optimized IR. The recorder still holds the unwrapped native target so the optimizer can drive NativeTarget directly after lowering.
C-source / wasm: the registry returns a source-like CgTarget that implements the semantic vtable and writes C text or a wasm module. These are semantic backends, not NativeTarget implementers.

The arch make always builds the NativeDirectTarget as the leaf; the optimizer wrapper, when present, sits above it and reaches the leaf's NativeTarget through native_direct_target_native. So a native arch implements its semantic surface exactly once — there is no separate per-arch semantic CgTarget.

The IR recorder

CgIrRecorder is a thin CgTarget that turns each semantic call into exactly one IR instruction in a per-function CgIrFunc, preserving operands, sticky source locations, tail-call policy, and global references. It is purely a sink: finalize triggers the optimizer's cross-function passes and per-function lowering. From the recorded clean IR, the optimizer derives its own CFG/SSA/MIR/allocated-MIR views and finally calls opt_emit_native to drive the arch NativeTarget. The same recorded IR feeds the interpreter via a sibling lowering path (opt_run_o1_interp). See IR.md, OPT.md, and INTERPRETER.md.

Shared native `-O0`: NativeDirectTarget

src/cg/native_direct_target.c is the single direct -O0 semantic CgTarget for all native arches (aa64/rv64/x64). It accepts semantic locals and ops and emits machine code in one pass, with no IR recording, liveness, CFG, or SSA. That single-pass property is the whole point of the -O0 path: it is the fast, low-overhead route for unoptimized builds, JIT, and bootstrapping.

It is parameterized two ways:

a per-arch NativeTarget for physical emission (injected at construction);
a small per-arch NativeOps adapter (native_direct_target.h) for the few direct-mode questions the generic target can't answer: parameter binding, operand/address legality, call planning + emission, return + tail-call realizability, varargs, inline asm, and the conservative memory barrier.

NativeOps is intentionally tiny — it is not a second copy of CgTarget. Pure pass-throughs (register info, func_begin/func_end, frame-slot allocation, class_for_type, addr_legal) are called on the NativeTarget directly; NativeOps exists only for the arch-sensitive direct-mode glue.

What NativeDirectTarget owns (and what the value stack therefore does not): semantic-local allocation and metadata, assigning locals frame homes, direct scratch-register selection, a local register cache, dirty-flush/invalidation, materializing operands into physical values, storing results back, conservative flushes around calls/barriers, and outgoing-call-area tracking for frame finalization.

The semantic-vs-physical split, concretely

NativeOps is the semantic-side adapter — it answers questions phrased in semantic terms (CGParamDesc, Operand, CGCallDesc) and is used only on the -O0 direct path. NativeTarget is the physical-side emitter — it speaks NativeLoc (registers / frame slots / immediates / addresses) and is used by both -O0 (after NativeDirectTarget has chosen scratch regs and materialized operands) and the optimizer (after regalloc). The optimizer never touches NativeOps: anything it needs is in MIR or on NativeTarget. This is why the two structs exist side by side — they sit on opposite sides of the register-selection line.

The local register cache

The correct baseline gives every scalar local a frame home and round-trips it through memory per use — no liveness needed. On top of that baseline, NativeDirectTarget runs a write-back, basic-block-scoped local register cache that removes most of those round trips while staying single-pass. The load-bearing invariants are:

What is cached: only scalar locals that fit a register and are neither address-taken nor memory-required. Aggregates and escaped locals stay frame-only.
Where: only caller-saved allocable registers. This sidesteps prologue bookkeeping (the direct path reports no clobbered callee-saves) and means the blanket flush before any call already covers ABI clobbering.
Block scope: with no CFG/liveness, a cached value cannot cross a control edge or join. The cache is spilled and emptied at every branch, label, and return; func_begin starts empty.
Escape-based aliasing: because address-taken locals are never cached, a pointer access can only alias an escaped local, which is never in a register — so loads/stores need no value-cache flush for aliasing. Addressing is made cache-aware: when a base/index local is live in a register, the address points at that register instead of reloading the frame home. Direct frame-home accesses to a cached local (e.g. by-value field extraction) flush just that one local.
Calls and barriers still flush the whole cache (caller-saved regs die, and a memory clobber may observe everything). Address-taking a cached local flushes just that local and marks it uncacheable thereafter.

A monotonic use-tick drives approximate-LRU eviction; cached locals are tracked in an intrusive insertion-order list so flush-all is O(cached), not O(locals). The cache is never worse than the frame-only baseline — each cached local is stored at most once per boundary instead of once per definition. It is a local cache, not an allocator: the semantic API promises nothing about where a local lives, so the target is free to choose.

Single-pass -O0 cannot pre-plan the full frame, so the native prologue is emitted into a reserved region at func_begin and the frame-size/outgoing-area immediates are patched at func_end once final sizes are known. The optimizer path, which knows the whole frame up front, instead uses func_begin_known_frame + emit_prologue for an exact-size prologue. Both mechanisms are on the same NativeTarget.

Shared native infrastructure

Two more pieces are shared by every native backend (both -O0 and the optimizer's machine emit), so the per-arch code carries only ISA/ABI specifics.

NativeFrame — arch-neutral frame bookkeeping

src/cg/native_frame.{c,h} holds the parts of stack-frame layout that are identical across aa64/rv64/x64: the table of frame slots (locals, spills, sret/variadic homes, and aa64 callee-save homes) accumulated below the frame anchor, the cumulative-offset arithmetic, the running max-outgoing-arg size, the frame-final gate that forbids growing the frame after the prologue is emitted, and deriving the used-callee-save set from the optimizer's per-class register masks. It also answers the one frame-relevant ABI query — native_frame_va_save_bytes derives the vararg register-save-area size from the target ABI's va_list layout, so the per-arch magic numbers all flow from one ABI-driven source.

What stays per-arch is everything ISA/ABI-specific: the transform from a slot's cumulative offset to an anchor-relative displacement (fp/s0/rbp-relative, plus aa64's top- vs bottom-record choice), prologue/epilogue encoding, callee-save placement, slim-prologue variants, deferred-patch application, and variadic register-save stores.

native_argmove — parallel-copy register shuffle

src/cg/native_argmove.{c,h} is the shared scheduler for realizing a set of register dst <- src moves as a parallel copy. Marshalling call arguments — and, on the optimizer path, binding incoming parameters — is a parallel copy: every source register must be read before any move overwrites it. The allocator usually hands over a conflict-free order, but not always (variadics, and tail-call / entry permutations it is free to rotate). When a true cycle remains (e.g. rdi<-rdx, rsi<-rdi, rdx<-rsi), the scheduler breaks it by stashing one member into a scratch register and redirecting that value's readers.

The scheduling — topological emission, cycle detection, the scratch break — is identical across the three arches; only the leaf operations differ (how one move is emitted, which register is scratch), supplied through a small ops struct. All three native backends plus the entry param-bind path share this one scheduler.

Why this shape

One semantic interface, recorded verbatim, means the optimizer and the -O0 emitter consume the same program description — the frontend writes it once.
One shared NativeDirectTarget means a new native arch implements physical emission (NativeTarget) plus a tiny NativeOps adapter, and gets a correct single-pass -O0 backend for free, sharing frame and arg-move logic.
The strict semantic/physical line keeps register allocation, spilling, and ABI placement out of the frontend-facing surface entirely, so they can change (or be skipped, at -O0) without touching anything above NativeTarget.
Direct -O0 emission with a local register cache buys fast unoptimized builds and JIT without a recording/optimization round trip, while remaining obviously correct via its block-local and escape-based invariants.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

kit