Codegen Spine
This is the codegen spine that sits between kit's frontends and machine code.
Frontends never speak machine: they drive the public KitCg stack-machine API
(kit_cg_*), which lowers to a single internal semantic sink — CgTarget.
CgTarget has several realizations selected by opt-level and output kind: a
direct machine-code emitter for -O0, an IR recorder that feeds the optimizer
and interpreter, and source-like targets (the C backend, wasm). All the native
-O0 backends share one CgTarget implementation parameterized by per-arch
hooks. This doc covers that layering, the public-API-to-CgTarget lowering, and
the shared native infrastructure. See IR.md for the recorded IR,
OPT.md for optimization and machine lowering, ARCH.md for
per-arch emission, CBACKEND.md and WASM.md for the
source-like targets, and INTERPRETER.md for the interpreter.
The two boundaries
Codegen has exactly two stable interfaces, stacked:
frontend (C / toy / wasm-lang / preprocessor)
| public push/pop API
v
KitCg value stack (src/cg/*.c, op families)
| semantic CgTarget vtable (src/cg/cgtarget.h)
v
CgTarget realization
|-- NativeDirectTarget -O0 direct emit -> NativeOps + NativeTarget
|-- CgIrRecorder -O1/-O2/interp -> recorded IR -> optimizer -> NativeTarget
|-- C-source target / wasm (semantic, source-like output)
CgTarget (the semantic interface) speaks in target-data-layout terms a
frontend can produce: typed semantic locals, immediates/globals/indirect
addresses, labels and structured scopes, typed loads/stores, arithmetic,
calls/returns as local-valued descriptors, aggregates, bitfields, atomics,
varargs, intrinsics, and inline asm. It deliberately exposes no machine
state: no hard registers, no spill slots, no call plans, no CFG/SSA, no
prologue patching. That keeps it recordable verbatim as semantic IR and lets one
frontend path serve every backend.
NativeTarget (src/arch/native_target.h, the physical interface) is the
other boundary — the post-register-allocation machine-emission contract. Every
hook receives caller-selected, target-legal physical operands and emits one
native operation; it is explicitly not a register allocator. Both the -O0
direct path and the optimizer's machine-emit path bottom out here.
The split exists because the two contracts are genuinely different levels. The
semantic interface is "what the program means"; the physical interface is "emit
this selected instruction with these registers." Folding them, as an earlier
design did, forced the shared vtable to expose both IR concepts and backend
register state at once. Keeping them apart lets each native arch implement only
physical emission, while the semantic-to-physical bridge for -O0 is written
once and shared.
The public KitCg API and value stack
include/kit/cg.h is a stack machine. A frontend pushes typed values, names
types, and issues operations that pop operands and push results —
kit_cg_func_begin, kit_cg_load/kit_cg_store, kit_cg_fp_binop,
kit_cg_call, kit_cg_branch_true, kit_cg_block_begin, and so on. This
API is the insulation layer: it is source-stable across all the internal changes
below it. See FRONTENDS.md.
Every stack entry is exactly one of two kinds, and each op declares the
kinds it consumes and produces: a PLACE (an addressable, typed location — a
local's storage, a global, or a computed [base + index*scale + offset]) or a
VALUE (a scalar rvalue: integer, float, pointer, or 128-bit scalar).
Addressing is built explicitly — push_local→PLACE, addr PLACE→VALUE(ptr),
deref VALUE(ptr)→PLACE, field i PLACE(record)→PLACE, elem VALUE(ptr)+index
→PLACE — and the op panics on a kind mismatch; CG never infers the kind or
inserts an implicit dereference. load/store take a PLACE and no
effective-address rider; the place ops fold the constant offset (deref/field)
and scale (elem) into one OPK_INDIRECT, so the backend still gets a single
addressing-mode memop. Aggregates are always a PLACE.
The implementation lives in src/cg/, split by op family rather than one
monolith: value.c (stack discipline, lvalue/rvalue conversion, operand
materialization), fold.c (the -O0 semantic peephole — constant folding, the
delayed compare/arith forms, and const-local store-to-load forwarding; contract
in fold.h), memory.c (loads/stores/addressing/aggregates), arith.c,
control.c (labels, branches, scopes, switch, computed goto), call.c,
atomic.c, asm.c, type.c, local.c, data.c, wide.c (128-bit scalars),
with shared state and helpers in internal.h and lifecycle in session.c.
(Files are named per family; there is no literal api_*.c prefix.)
The value stack's job is purely semantic lowering. Each entry (ApiSValue) is
one of an operand (immediate / constant / semantic local / lvalue address), a
delayed compare, or a delayed arith — forms held un-emitted so a following
branch can fuse a compare instead of materializing a 0/1, or so a small
immediate can flow straight into a binop. These delayed forms and the
constant folding around them are the -O0 peephole; it lives in fold.c
(contract in fold.h), kept isolated from the stack discipline. Both delayed
forms are live: the delayed compare fuses into a following branch, and the
delayed arith (admitted by api_can_delay_int_arith for an unflagged foldable
integer op) flows an unmaterialized binop/unop into a following op so an
immediate chain folds or an identity collapses, materializing only when a
consumer needs a value. (The delayed arith was gated off while the load/store
addressing rider existed; the strict place/value rework removed the rider and
re-enabled it.) The stack does
not own registers, frame slots, spill policy, or caller-saved preservation;
those moved down into the target realizations. When an operation needs a value
emitted, the stack calls the corresponding g->target->op(...) semantic hook
with local-only operands.
Switch is a good example of the semantic/structured division. CgTarget carries
an optional switch_ hook and a supports_label_table query. Native arches
leave switch_ NULL and the shared cg_lower_switch_default lowers the
structured descriptor into cmp_branch/jump/indirect_branch + a rodata label
table. Source-like targets override switch_ to emit a native construct (the C
target a real switch, a wasm target br_table). Same semantic input, different
realization.
CgTarget realizations
session.c's kit_cg_begin picks the realization. It asks the arch
registry (cg_backend_for_session, src/arch/registry.c) for a CGBackend
whose make builds the base CgTarget for this target arch and output kind,
then conditionally wraps it:
-O0, native arch: the backend'smakereturns aNativeDirectTarget(see below). No IR is recorded; semantic ops emit machine code immediately.-O1/-O2or interpreter:session.cwraps the base target withopt_cgtarget_new(src/opt/opt.c), which returns aCgIrRecorder(src/cg/ir_recorder.c). Recording does not emit; atkit_cg_finishthe optimizer replays optimized IR. The recorder still holds the unwrapped native target so the optimizer can driveNativeTargetdirectly after lowering.- C-source / wasm: the registry returns a source-like
CgTargetthat implements the semantic vtable and writes C text or a wasm module. These are semantic backends, notNativeTargetimplementers.
The arch make always builds the NativeDirectTarget as the leaf; the optimizer
wrapper, when present, sits above it and reaches the leaf's NativeTarget
through native_direct_target_native. So a native arch implements its semantic
surface exactly once — there is no separate per-arch semantic CgTarget.
The IR recorder
CgIrRecorder is a thin CgTarget that turns each semantic call into exactly
one IR instruction in a per-function CgIrFunc, preserving operands, sticky
source locations, tail-call policy, and global references. It is purely a sink:
finalize triggers the optimizer's cross-function passes and per-function
lowering. From the recorded clean IR, the optimizer derives its own
CFG/SSA/MIR/allocated-MIR views and finally calls opt_emit_native to drive the
arch NativeTarget. The same recorded IR feeds the interpreter via a sibling
lowering path (opt_run_o1_interp). See IR.md, OPT.md, and
INTERPRETER.md.
Shared native -O0: NativeDirectTarget
src/cg/native_direct_target.c is the single direct -O0 semantic CgTarget
for all native arches (aa64/rv64/x64). It accepts semantic locals and ops and
emits machine code in one pass, with no IR recording, liveness, CFG, or SSA.
That single-pass property is the whole point of the -O0 path: it is the fast,
low-overhead route for unoptimized builds, JIT, and bootstrapping.
It is parameterized two ways:
- a per-arch
NativeTargetfor physical emission (injected at construction); - a small per-arch
NativeOpsadapter (native_direct_target.h) for the few direct-mode questions the generic target can't answer: parameter binding, operand/address legality, call planning + emission, return + tail-call realizability, varargs, inline asm, and the conservative memory barrier.
NativeOps is intentionally tiny — it is not a second copy of CgTarget.
Pure pass-throughs (register info, func_begin/func_end, frame-slot
allocation, class_for_type, addr_legal) are called on the NativeTarget
directly; NativeOps exists only for the arch-sensitive direct-mode glue.
What NativeDirectTarget owns (and what the value stack therefore does not):
semantic-local allocation and metadata, assigning locals frame homes, direct
scratch-register selection, a local register cache, dirty-flush/invalidation,
materializing operands into physical values, storing results back, conservative
flushes around calls/barriers, and outgoing-call-area tracking for frame
finalization.
The semantic-vs-physical split, concretely
NativeOps is the semantic-side adapter — it answers questions phrased in
semantic terms (CGParamDesc, Operand, CGCallDesc) and is used only on the
-O0 direct path. NativeTarget is the physical-side emitter — it speaks
NativeLoc (registers / frame slots / immediates / addresses) and is used by
both -O0 (after NativeDirectTarget has chosen scratch regs and materialized
operands) and the optimizer (after regalloc). The optimizer never touches
NativeOps: anything it needs is in MIR or on NativeTarget. This is why the
two structs exist side by side — they sit on opposite sides of the
register-selection line.
The local register cache
The correct baseline gives every scalar local a frame home and round-trips it
through memory per use — no liveness needed. On top of that baseline,
NativeDirectTarget runs a write-back, basic-block-scoped local register
cache that removes most of those round trips while staying single-pass. The
load-bearing invariants are:
- What is cached: only scalar locals that fit a register and are neither address-taken nor memory-required. Aggregates and escaped locals stay frame-only.
- Where: only caller-saved allocable registers. This sidesteps prologue bookkeeping (the direct path reports no clobbered callee-saves) and means the blanket flush before any call already covers ABI clobbering.
- Block scope: with no CFG/liveness, a cached value cannot cross a control
edge or join. The cache is spilled and emptied at every branch, label, and
return;
func_beginstarts empty. - Escape-based aliasing: because address-taken locals are never cached, a pointer access can only alias an escaped local, which is never in a register — so loads/stores need no value-cache flush for aliasing. Addressing is made cache-aware: when a base/index local is live in a register, the address points at that register instead of reloading the frame home. Direct frame-home accesses to a cached local (e.g. by-value field extraction) flush just that one local.
- Calls and barriers still flush the whole cache (caller-saved regs die, and a memory clobber may observe everything). Address-taking a cached local flushes just that local and marks it uncacheable thereafter.
A monotonic use-tick drives approximate-LRU eviction; cached locals are tracked in an intrusive insertion-order list so flush-all is O(cached), not O(locals). The cache is never worse than the frame-only baseline — each cached local is stored at most once per boundary instead of once per definition. It is a local cache, not an allocator: the semantic API promises nothing about where a local lives, so the target is free to choose.
Single-pass -O0 cannot pre-plan the full frame, so the native prologue is
emitted into a reserved region at func_begin and the frame-size/outgoing-area
immediates are patched at func_end once final sizes are known. The optimizer
path, which knows the whole frame up front, instead uses
func_begin_known_frame + emit_prologue for an exact-size prologue. Both
mechanisms are on the same NativeTarget.
Shared native infrastructure
Two more pieces are shared by every native backend (both -O0 and the
optimizer's machine emit), so the per-arch code carries only ISA/ABI specifics.
NativeFrame — arch-neutral frame bookkeeping
src/cg/native_frame.{c,h} holds the parts of stack-frame layout that are
identical across aa64/rv64/x64: the table of frame slots (locals, spills,
sret/variadic homes, and aa64 callee-save homes) accumulated below the frame
anchor, the cumulative-offset arithmetic, the running max-outgoing-arg size, the
frame-final gate that forbids growing the frame after the prologue is emitted,
and deriving the used-callee-save set from the optimizer's per-class register
masks. It also answers the one frame-relevant ABI query —
native_frame_va_save_bytes derives the vararg register-save-area size from the
target ABI's va_list layout, so the per-arch magic numbers all flow from one
ABI-driven source.
What stays per-arch is everything ISA/ABI-specific: the transform from a slot's cumulative offset to an anchor-relative displacement (fp/s0/rbp-relative, plus aa64's top- vs bottom-record choice), prologue/epilogue encoding, callee-save placement, slim-prologue variants, deferred-patch application, and variadic register-save stores.
native_argmove — parallel-copy register shuffle
src/cg/native_argmove.{c,h} is the shared scheduler for realizing a set of
register dst <- src moves as a parallel copy. Marshalling call arguments —
and, on the optimizer path, binding incoming parameters — is a parallel copy:
every source register must be read before any move overwrites it. The allocator
usually hands over a conflict-free order, but not always (variadics, and
tail-call / entry permutations it is free to rotate). When a true cycle remains
(e.g. rdi<-rdx, rsi<-rdi, rdx<-rsi), the scheduler breaks it by stashing one
member into a scratch register and redirecting that value's readers.
The scheduling — topological emission, cycle detection, the scratch break — is identical across the three arches; only the leaf operations differ (how one move is emitted, which register is scratch), supplied through a small ops struct. All three native backends plus the entry param-bind path share this one scheduler.
Why this shape
- One semantic interface, recorded verbatim, means the optimizer and the
-O0emitter consume the same program description — the frontend writes it once. - One shared
NativeDirectTargetmeans a new native arch implements physical emission (NativeTarget) plus a tinyNativeOpsadapter, and gets a correct single-pass-O0backend for free, sharing frame and arg-move logic. - The strict semantic/physical line keeps register allocation, spilling, and
ABI placement out of the frontend-facing surface entirely, so they can change
(or be skipped, at
-O0) without touching anything aboveNativeTarget. - Direct
-O0emission with a local register cache buys fast unoptimized builds and JIT without a recording/optimization round trip, while remaining obviously correct via its block-local and escape-based invariants.