Architecture Backends
This document describes kit's arch backend abstraction: how a target
architecture plugs into the compiler, what its responsibilities are, and how
the three native backends (aa64, x64, rv64) are structured to maximize sharing
while keeping the ISA-specific seams thin. It also covers the ABI / calling
convention layer in src/abi, which is the single authority for storage layout
and call classification. The semantic codegen surface a backend sits behind is
in CODEGEN.md; the IR the optimizer feeds it is in IR.md;
the SSA/regalloc machinery driving the optimizing path is in OPT.md;
the standalone assembler that shares the ISA tables is in ASM.md. ABI
content is canonical here.
1. Two layers of "backend": CGBackend and ArchImpl
A target enters the compiler through two related abstractions that are wired by
struct-prefix subtyping (src/cg/cgtarget.h, src/arch/arch.h).
CGBackend ArchImpl
--------- --------
const char* name; +---> CGBackend backend; (first field)
CgTarget* (*make)(...); | KitArchKind kind;
| CgTarget* (*cgtarget_new)(...);
| ArchAsm* (*asm_new)(...);
| ArchDisasm* (*disasm_new)(...);
| int (*apply_label_fixup)(...);
| const LinkArchDesc* link;
| const ArchDecodeOps* decode; (emu/objdump)
| const ArchEmuOps* emu;
| const ArchDwarfOps* dwarf;
| const ArchDbgOps* dbg;
| const ArchAsmOps* asm_ops;
| register-file accessors;
| CFI / .eh_frame CIE constants;
| predefined target macros;
A
CGBackendanswers exactly one question: "build me aCgTargetfor thisCompiler+ObjBuilder+KitCodeOptions." It is the unit the session pipeline cares about — it knows nothing about machine code, registers, or object formats.cg_backend_c_target(the C source emitter, see CBACKEND.md) andcg_backend_check(the no-emit frontend checker) are standaloneCGBackends with noArchImpl— they areCGBackendand nothing more.An
ArchImplis aCGBackendplus the machine-code metadata a native target needs. BecauseCGBackend backendis its first field,(const CGBackend*)&arch_impl_xis a valid downcast — every machine-code arch is aCGBackendby composition. The extra fields are everything that is genuinely arch-specific and not about producing aCgTarget: the assembler and disassembler constructors, the label-fixup encoder, the linker/emu/DWARF/ debugger op tables, the register file, the predefined macros (__aarch64__,__x86_64__, ...), and the DWARF CFI defaults that seed the.eh_frameCIE.
aa64, x64, and rv64 expose full ArchImpls (arch_impl_aa64,
arch_impl_x64, arch_impl_rv64). wasm also exposes an ArchImpl
(arch_impl_wasm), but it is a thin one: its machine-code seams
(asm_new, apply_label_fixup, link, register accessors, CFI) are all NULL,
since wasm32 has no native machine encoding, no stack-frame ABI, and no
assembly form in this toolchain — it produces a WasmModule attached to the
ObjBuilder and only provides a disassembler that renders WAT for objdump (see
WASM.md). So the precise rule is: native machine-code arches have an
ArchImpl whose machine seams are populated; c_target/check have only a
CGBackend; wasm has an ArchImpl shell with the machine seams nulled out.
This split is deliberate. The pipeline picks a CGBackend per emit; metadata
consumers (DWARF producer, debugger, disassembler, register-name lookups) reach
for an ArchImpl and get back NULL when the target has no machine-code identity.
Neither layer leaks into the other.
2. The arch registry
src/arch/registry.c is the sole place that gates the arch vtable roster on
KIT_ARCH_*_ENABLED — it is the canonical config-gate site for the backend
axis, mirroring src/api/lang_registry.c for frontends. (The flags are also
read by the parallel object-format and ABI registries, which gate their own
rosters on the arch × format cross-product, and by src/core/config_assert.c
for build-time validity asserts — see OBJ.md and §7.) Everything
downstream of the registry operates on its outputs and never re-checks the build
flags.
The registry holds a single static arch_impls[] array (each entry gated by its
KIT_ARCH_*_ENABLED flag) and exposes two lookups:
arch_lookup(KitArchKind)/arch_for_compiler(Compiler*)walkarch_impls[]and return theArchImpl*whosekindmatches — the path for machine-arch metadata.c_targetis intentionally absent from this roster: it has noArchImpl, so a metadata query for it correctly returns NULL.cg_backend_for_session(Compiler*, KitCodeOptions*)picks theCGBackend*for an emit. It short-circuits tocg_backend_checkwhencheck_onlyis set and tocg_backend_c_targetwhenemit_c_sourceis set; otherwise it returns&arch_for_compiler(c)->backend. This is the one place the "is this anArchImplor a standaloneCGBackend?" decision is made, and it does not consultarch_impls[](the source-emit and check backends are not in it).
The registry also owns the thin dispatchers arch_reloc_operand,
arch_is_local_branch, and arch_reloc_call_pair, which forward to the target's
ArchAsmOps (used by cc -S symbolization, see §4 and ASM.md), plus
arch_disasm_* / arch_decode_* / formatter helpers. All are NULL-safe: a
target lacking the relevant op table gets the documented "no transformation"
answer rather than a crash.
3. The NativeTarget contract
src/arch/native_target.h defines NativeTarget, the physical machine-emission
contract that all three native backends implement. It is the layer where the
generic codegen drivers stop speaking in semantic terms (CGLocal ids,
high-level types) and start speaking in physical terms: hard registers, frame
slots, legal immediates, and concrete addressing modes. A NativeTarget never
allocates registers and never decides storage layout — callers hand it
caller-selected, target-legal physical operands; the target only encodes.
It is driven from two directions:
-O0 path: CG semantic ops ──► NativeDirectTarget ──┐
(src/cg/native_*) ├──► NativeTarget ──► MCEmitter ──► ObjBuilder
-O1+ path: CG ──► record IR ──► opt passes ──────────┘ (~35 hooks)
(SSA, machinize, regalloc, pass_native_emit)
At -O0, the shared
NativeDirectTarget(src/cg/native_direct_target.c) is theCgTarget. It owns semantic local homes, a small register cache, and conservative flushes, and lowers each semantic op directly intoNativeTargethook calls. The arch supplies aNativeTargetplus a small semantic adapter,NativeOps(bind_param, plan_call, emit_call/ret, va_*, asm_block, barriers, legality predicates) — the parts that need a foot in the semantic world. Everything else (frame slots,class_for_type,addr_legal) the direct target calls straight through toNativeTarget.At -O1+, the optimizer records IR, runs SSA/CFG passes, machinizes, and allocates registers (see OPT.md), then
src/opt/pass_native_emit.creplays the allocated program against the sameNativeTargethooks. By this point every value already has a physical home, so the emit pass hands the target hard registers and frame slots and the target just encodes.
That a single ~35-hook contract serves both paths is what keeps the two code generators byte-compatible per arch. The hook families:
- Frame & prologue.
func_begin(single-pass reserve-and-patch, used by the direct path) andfunc_begin_known_frame(the optimizer path, where regalloc has finished so the full frame — slots, callee-saves, alloca, scratch spills — is known before the prologue).frame_slot,bind_param/bind_params_end,reserve_callee_saves, the optionalemit_prologue(exact-size in-place prologue),note_frame_state, andframe_slot_debug_locfor the DWARF coordinate of a slot. - Control flow.
label_new/label_place,jump,cmp_branch,indirect_branch(with a valid-target set for jump tables),load_label_addr(for&&label). - Data movement.
move,load_imm,load_const,load_addr,load,store,tls_addr_of,copy_bytes/set_bytes(aggregate memcpy/memset),bitfield_load/bitfield_store,spill/reload. - Arithmetic.
binop,unop,cmp,convert,alloca_. - Calls & returns. A two-phase split:
plan_callturns aNativeCallDescinto aNativeCallPlan(arg moves, return slots, clobber/return masks, outgoing stack size) that the optimizer can inspect during frame planning, andemit_callrealizes it. Symmetricallyplan_ret/ret.call_stack_bytesandsignature_stack_bytesare pure pre-pass queries used to size the outgoing area and to decide tail-call (sibling) realizability. - Atomics & fences.
atomic_load/store/rmw/cas,fence. - Variadics.
va_start_/va_arg_/va_end_/va_copy_. Allva_listlayout knowledge (pointer ABI vs register-save-area ABI, field offsets) lives behind these and is answered by querying the target ABI; the optimizer makes no layout assumptions. - Intrinsics & asm.
intrinsic,asm_block(inline asm with constraints),file_scope_asm, plustrap,set_loc, deferredpatch_add/patch_apply,finalize,destroy.
A handful of small capability flags/queries let the generic drivers specialize
without arch branches: imm_legal/addr_legal (immediate and addressing-mode
legality), has_store_zero_reg/store_zero_reg (aa64 xzr, rv64 x0 — store
a constant 0 without materializing it), and the optional machine_op_clobbers,
which reports the fixed registers an encoding clobbers as a side effect (x86
idiv writes rax/rdx, a variable shift uses cl, atomics use rax/rcx/rdx) so the
allocator keeps values out of them; aa64/rv64 leave it NULL because their
encodings have no such fixed clobbers.
aa64 is the reference backend. src/arch/aa64/native.c is the most complete
and most heavily commented implementation; the x64 and rv64 ports are written
against it. Shared scaffolding extracted across all three lives in
src/cg/native_frame.c (slot-offset arithmetic, the frame-final gate, the
used-callee-save derivation, ABI-driven va-save sizing) and
src/cg/native_argmove.c (the parallel-copy register shuffle for call-arg and
param marshalling). What stays per-arch is everything ISA-specific: the
slot-offset coordinate transform (fp/s0/rbp-relative), prologue/epilogue
encoding, the slim-prologue variants, and instruction selection.
4. The ISA single-source-of-truth table
Each native arch has an isa.h + isa.c pair that is the one place its
instruction bit-layout lives. isa.h holds inline pack/unpack encoders
(e.g. aa64_movz, aa64_logsr_pack/_unpack) and a descriptor table
(aa64_insn_table[]: {mnemonic, match, mask, format, flags}). isa.c holds
the table data plus the operand print/parse dispatch keyed on format.
The key property is that three different consumers share the same tables:
src/arch/aa64/isa.{h,c} ◄── single source of truth
│ │ │
encoder │ disasm │ │ standalone assembler
(native.c emit) │ │ (asm.c)
(disasm.c decode)
- the encoder (codegen in
native.c) calls the inlinepackhelpers to emit instruction words; - the disassembler (
disasm.c) does one mask-and-compare against the table to identify a word, then dispatches onformatto the sameunpackhelpers to extract operands; - the standalone assembler (
asm.c, thekit astool and inline-asm()handling, see ASM.md) parses mnemonics against the table and encodes through the same inline helpers.
(For aa64 the same header is also pulled in by link.c and dbg.c.) The
invariant: when an opcode value or a field position changes, you update one site
and the encoder, decoder, and assembler stay consistent. The table is ordered
first-match-wins, with alias rows (tighter masks, e.g. mov ≡ orr Rd,zr,Rm,
cmp ≡ subs zr,...) placed before the canonical rows so the disassembler
renders the alias spelling while the assembler accepts both. x64/isa.{h,c} and
rv64/isa.{h,c} follow the identical pattern; x64 additionally factors its
byte-level REX/ModR/M/SIB primitives and prologue/epilogue into emit.c.
The ArchAsmOps table (reloc_operand, is_local_branch, reloc_call_pair)
is the textual complement to this: it tells the cc -S symbolizer how a
relocated operand is spelled for the target object format (aarch64 ELF
:lo12:sym, Mach-O sym@PAGEOFF, x86-64 sym(%rip)/@PLT, RISC-V
%pcrel_hi/%pcrel_lo with anchor pairing) so that re-assembling kit's -S
output reproduces byte-identical objects. It is the inverse of the assembler's
reloc-modifier parser.
5. MCEmitter — one generic emitter for all native arches
src/arch/mc.c is a single generic machine-code/object emitter (MCEmitter,
declared in src/arch/mc.h) used by every native arch. It sits between the
backend (or the assembler) and the ObjBuilder, and it owns only the bytes-and-
bookkeeping concerns that are genuinely arch-independent:
- the current section and byte position;
- the machine-label table: 1-based
MCLabelids, each carrying either a placement(sec, offset)or a list of pending forward-reference fixups that are applied atlabel_place; plus lazily-minted per-labelSB_LOCALsymbols (.Lcfblk.N) so code-location references (&&label, jump-table entries) relocate against a real symbol and survive a re-encoding assembler; - relocation forwarding (
emit_reloc,emit_reloc_at,emit_label_ref,emit_label_data_reloc); - the per-function context (
mc_begin_function/mc_end_function) that the deferred data-section label relocs read; - CFI buffering:
cfi_startproc/cfi_def_cfa/cfi_offset/... records are buffered per-function and flushed into a.eh_framesection bymc_emit_eh_frameat TU finalize. CFI directives are byte-position-bound, so they live on the one object that already tracks(section, offset).cfi_set_next_pc_offsetprovides a sticky prologue-PC override so backends that patch the prologue infunc_end(after the live PC has moved past it) can pin every frame-state rule to the post-prologue PC.
Encoding itself is not MCEmitter's job — it writes whatever bytes it is handed. Arch-specific behavior enters through exactly two thin seams:
ArchImpl.apply_label_fixup— given a resolved label displacement, encode it into the already-emitted bytes (aa64 splits the 26-bit imm26 of B/BL, the 19-bit CONDBR, the ADR immlo/immhi, and falls back to a literal-poolLDRfor out-of-range&&label; x64 writes a 4-byte rel32).mc.cbuilds anArchLabelFixupdescriptor and calls througharch_for_compiler.The
ArchImpl.cfi_*constants — the per-psABI CIE defaults (cfi_return_addr_reg, code/data alignment factors, initial CFA reg/offset) thatmc_emit_eh_framereads to encode the CIE.
This is the single most leverage-dense decision in the backend layer: the entire
.eh_frame producer, label resolution, relocation plumbing, and section/byte
management is written once and reused, with only those two pinpoint hooks per
arch. mc.h is split out from arch.h precisely so the many emission-only
consumers (per-arch emit/ops TUs, the assembler, the Debug producer) do not
transitively pull in the decode/disasm/emu/dbg surfaces.
6. Register files
Each native backend declares its register file as static NativeRegInfo data in
its native.c (e.g. aa_reg_info, wired into the NativeTarget at
construction; the DWARF-index ↔ assembler-name tables that the ArchImpl
exposes for objdump/asm live separately in regs.c). A NativeRegInfo is a set
of NativeAllocClassInfo (one per NativeAllocClass: INT, FP, VEC), each
carrying:
- an ordered allocable list — registers the allocator may assign, ordered by preference (aa64 lists caller-saved first so the allocator prefers them and avoids prologue saves);
- a scratch list — registers reserved for the backend's own temporaries (address materialization, atomic retry loops) and never handed to the allocator;
- a
NativePhysRegInforow per physical register (class, ABI arg/ret index, caller/callee-saved flags, spill/copy costs); - precomputed caller/callee/arg/ret/reserved bitmasks.
This one declaration feeds both code paths:
- the -O0 direct path resolves
reg_infoand the threeclass_info[]pointers once atNativeDirectTargetconstruction, so its register cache (allocate / evict / scratch-acquire) is an O(1) lookup. It uses the allocable order as a simple "next free register" pool with conservative flushes. - the -O1 allocator (
src/opt) consumes the same allocable lists, costs, and masks as its interference-graph inputs, and reports the callee-saves it actually used back throughfunc_begin_known_frame/reserve_callee_savesso the backend can reserve save slots and emit the matching prologue/epilogue.
Because incoming arg registers are marked non-allocable, register-destination
param binds can never alias a live incoming arg, which is what lets bind_param
ordering be unconstrained and lets bind_params_end resolve a param permutation
as a single parallel copy.
7. The ABI / calling-convention layer
src/abi is the single authority for target-dependent storage layout and
call classification. Frontends lower source types to KitCgTypeId before
entering it; from there the answers are language-agnostic. The public surface is
TargetABI (src/abi/abi.h), reachable as c->abi and consulted by both the
semantic codegen (src/cg/local.c for local sizing, cg/* for layout) and the
optimizer (src/opt/cg_ir_lower.c resolves abi_cg_func_info to drive
param-bind and call lowering). It is the canonical owner of: scalar sizes/aligns,
struct/union record layout (including bitfield storage units), function argument
classification (ABIFuncInfo: per-arg DIRECT/INDIRECT/EXPAND/IGNORE,
sret, byval, sign/zero-ext, vararg routing), and va_list shape.
The layer is split into a shared core and per-ABI vtables:
abi.cholds everything C-standard-driven and identical across ABIs: the scalar profile (LP64 sizes), record layout computation, and the memoizing caches for record layouts and function info. This is shared by all targets.abi_internal.hdefinesABIVtable— the parts that genuinely vary:compute_func_info(the argument/return classifier) and theva_listtype/layout facts.registry.cselects the per-(arch, object-format) vtable. Like the arch and obj registries, it gates entries on the combinedKIT_ARCH_*+KIT_OBJ_*flags and maps(KitArchKind, KitObjFmt)to anABIVtableviaabi_vtable_lookup.abi_initdoes this lookup once at compiler init.
ABIs are a derived axis, not a user-facing knob: every valid ABI is a 1:1 function of an (arch, OS-family) pair, where OS family follows from the object format (ELF → SysV/AAPCS-style, Mach-O → Apple, COFF → Windows). The registry therefore enumerates the cross-product cells that both sides enable:
| Arch | ELF (SysV-ish) | Mach-O (Apple) | COFF (Windows) |
|---|---|---|---|
| aa64 | aapcs64 |
apple_arm64 |
aapcs64_windows |
| x64 | sysv_x64 |
apple_x64 |
win64_x64 |
| rv64 | rv64 |
— | — |
| wasm | — (wasm32, via the wasm object format) |
Each per-ABI TU (abi_aapcs64.c, abi_sysv_x64.c, abi_apple_arm64.c,
abi_apple_x64.c, abi_win64_x64.c, abi_aapcs64_windows.c, abi_rv64.c)
implements its compute_func_info and va_list facts; the Apple/Windows
variants encode their divergences (e.g. Apple ARM64 routes the variadic tail
exclusively to the stack, recorded as vararg_on_stack in ABIFuncInfo). The
classification is the only authority — the NativeTarget plan/bind hooks and
the optimizer both consume ABIFuncInfo; they never re-derive argument
placement. Frame-relevant ABI facts (the vararg register-save-area size) are
funneled through src/cg/native_frame.c so the per-arch magic numbers all trace
back to one va_list-layout query.
8. Per-call cost model (aa64 -O1)
The fixed per-call overhead a backend pays — prologue, epilogue, and call-site
setup, independent of the function body — dominates call-heavy workloads, so the
aa64 known-frame path is structured to minimize it. The backend chooses one of a
small set of frame shapes per function (decided in aa_func_begin_known_frame,
encoded in native.c):
| Frame shape | When | Fixed insns (entry+exit, excl. ret) |
|---|---|---|
| slim prologue | leaf-ish: no callee-saves, no alloca, no body slots, no outgoing stack | 3 (optimal) |
fp_at_bottom |
≥1 callee-save/body slot, no outgoing stack args, frame ≤ 504 | 5 (optimal) |
slim_small_frame |
as above but with outgoing stack args | 7 |
| fat | large frame / alloca / big saved-pair offset | 7+ |
The key structural idea is fp_at_bottom: when there are no outgoing stack args,
the frame record moves to the bottom of the frame (fp = sp), so the sp
adjustment folds into a pre/post-indexed stp x29,x30,[sp,#-N]! / ldp x29,x30,[sp],#N, and callee-saves stack above the record at positive offsets.
This is the common case for any function that keeps a value live across a call
without itself passing >8 register-class args, and it reaches the same 5-insn
fixed cost as gcc -O0; the DWARF CFA becomes fp + frame_size. Functions with
outgoing stack args can't move the record to the bottom (the args live there),
so they keep the top-record slim_small_frame layout. This availability
asymmetry — bottom-record only on the known-frame path — exists because the
frame-size-dependent offsets require the frame to be final before the body, which
is only true under the optimizer's func_begin_known_frame.
Remaining and planned per-arch work (deferred niche encodings, audit follow-ups) is tracked in plan/ARCH.md.