Architecture Backends

This document describes kit's arch backend abstraction: how a target architecture plugs into the compiler, what its responsibilities are, and how the three native backends (aa64, x64, rv64) are structured to maximize sharing while keeping the ISA-specific seams thin. It also covers the ABI / calling convention layer in src/abi, which is the single authority for storage layout and call classification. The semantic codegen surface a backend sits behind is in CODEGEN.md; the IR the optimizer feeds it is in IR.md; the SSA/regalloc machinery driving the optimizing path is in OPT.md; the standalone assembler that shares the ISA tables is in ASM.md. ABI content is canonical here.

1. Two layers of "backend": CGBackend and ArchImpl

A target enters the compiler through two related abstractions that are wired by struct-prefix subtyping (src/cg/cgtarget.h, src/arch/arch.h).

  CGBackend                      ArchImpl
  ---------                      --------
  const char* name;        +---> CGBackend backend;   (first field)
  CgTarget* (*make)(...);  |     KitArchKind kind;
                           |     CgTarget* (*cgtarget_new)(...);
                           |     ArchAsm*  (*asm_new)(...);
                           |     ArchDisasm* (*disasm_new)(...);
                           |     int (*apply_label_fixup)(...);
                           |     const LinkArchDesc*  link;
                           |     const ArchDecodeOps* decode;  (emu/objdump)
                           |     const ArchEmuOps*    emu;
                           |     const ArchDwarfOps*  dwarf;
                           |     const ArchDbgOps*    dbg;
                           |     const ArchAsmOps*    asm_ops;
                           |     register-file accessors;
                           |     CFI / .eh_frame CIE constants;
                           |     predefined target macros;

A CGBackend answers exactly one question: "build me a CgTarget for this Compiler + ObjBuilder + KitCodeOptions." It is the unit the session pipeline cares about — it knows nothing about machine code, registers, or object formats. cg_backend_c_target (the C source emitter, see CBACKEND.md) and cg_backend_check (the no-emit frontend checker) are standalone CGBackends with no ArchImpl — they are CGBackend and nothing more.
An ArchImpl is a CGBackend plus the machine-code metadata a native target needs. Because CGBackend backend is its first field, (const CGBackend*)&arch_impl_x is a valid downcast — every machine-code arch is a CGBackend by composition. The extra fields are everything that is genuinely arch-specific and not about producing a CgTarget: the assembler and disassembler constructors, the label-fixup encoder, the linker/emu/DWARF/ debugger op tables, the register file, the predefined macros (__aarch64__, __x86_64__, ...), and the DWARF CFI defaults that seed the .eh_frame CIE.

aa64, x64, and rv64 expose full ArchImpls (arch_impl_aa64, arch_impl_x64, arch_impl_rv64). wasm also exposes an ArchImpl (arch_impl_wasm), but it is a thin one: its machine-code seams (asm_new, apply_label_fixup, link, register accessors, CFI) are all NULL, since wasm32 has no native machine encoding, no stack-frame ABI, and no assembly form in this toolchain — it produces a WasmModule attached to the ObjBuilder and only provides a disassembler that renders WAT for objdump (see WASM.md). So the precise rule is: native machine-code arches have an ArchImpl whose machine seams are populated; c_target/check have only a CGBackend; wasm has an ArchImpl shell with the machine seams nulled out.

This split is deliberate. The pipeline picks a CGBackend per emit; metadata consumers (DWARF producer, debugger, disassembler, register-name lookups) reach for an ArchImpl and get back NULL when the target has no machine-code identity. Neither layer leaks into the other.

2. The arch registry

src/arch/registry.c is the sole place that gates the arch vtable roster on KIT_ARCH_*_ENABLED — it is the canonical config-gate site for the backend axis, mirroring src/api/lang_registry.c for frontends. (The flags are also read by the parallel object-format and ABI registries, which gate their own rosters on the arch × format cross-product, and by src/core/config_assert.c for build-time validity asserts — see OBJ.md and §7.) Everything downstream of the registry operates on its outputs and never re-checks the build flags.

The registry holds a single static arch_impls[] array (each entry gated by its KIT_ARCH_*_ENABLED flag) and exposes two lookups:

arch_lookup(KitArchKind) / arch_for_compiler(Compiler*) walk arch_impls[] and return the ArchImpl* whose kind matches — the path for machine-arch metadata. c_target is intentionally absent from this roster: it has no ArchImpl, so a metadata query for it correctly returns NULL.
cg_backend_for_session(Compiler*, KitCodeOptions*) picks the CGBackend* for an emit. It short-circuits to cg_backend_check when check_only is set and to cg_backend_c_target when emit_c_source is set; otherwise it returns &arch_for_compiler(c)->backend. This is the one place the "is this an ArchImpl or a standalone CGBackend?" decision is made, and it does not consult arch_impls[] (the source-emit and check backends are not in it).

The registry also owns the thin dispatchers arch_reloc_operand, arch_is_local_branch, and arch_reloc_call_pair, which forward to the target's ArchAsmOps (used by cc -S symbolization, see §4 and ASM.md), plus arch_disasm_* / arch_decode_* / formatter helpers. All are NULL-safe: a target lacking the relevant op table gets the documented "no transformation" answer rather than a crash.

3. The NativeTarget contract

src/arch/native_target.h defines NativeTarget, the physical machine-emission contract that all three native backends implement. It is the layer where the generic codegen drivers stop speaking in semantic terms (CGLocal ids, high-level types) and start speaking in physical terms: hard registers, frame slots, legal immediates, and concrete addressing modes. A NativeTarget never allocates registers and never decides storage layout — callers hand it caller-selected, target-legal physical operands; the target only encodes.

It is driven from two directions:

   -O0 path:   CG semantic ops ──► NativeDirectTarget ──┐
                                   (src/cg/native_*)     ├──► NativeTarget ──► MCEmitter ──► ObjBuilder
   -O1+ path:  CG ──► record IR ──► opt passes ──────────┘        (~35 hooks)
                      (SSA, machinize, regalloc, pass_native_emit)

At -O0, the shared NativeDirectTarget (src/cg/native_direct_target.c) is the CgTarget. It owns semantic local homes, a small register cache, and conservative flushes, and lowers each semantic op directly into NativeTarget hook calls. The arch supplies a NativeTarget plus a small semantic adapter, NativeOps (bind_param, plan_call, emit_call/ret, va_*, asm_block, barriers, legality predicates) — the parts that need a foot in the semantic world. Everything else (frame slots, class_for_type, addr_legal) the direct target calls straight through to NativeTarget.
At -O1+, the optimizer records IR, runs SSA/CFG passes, machinizes, and allocates registers (see OPT.md), then src/opt/pass_native_emit.c replays the allocated program against the same NativeTarget hooks. By this point every value already has a physical home, so the emit pass hands the target hard registers and frame slots and the target just encodes.

That a single ~35-hook contract serves both paths is what keeps the two code generators byte-compatible per arch. The hook families:

Frame & prologue. func_begin (single-pass reserve-and-patch, used by the direct path) and func_begin_known_frame (the optimizer path, where regalloc has finished so the full frame — slots, callee-saves, alloca, scratch spills — is known before the prologue). frame_slot, bind_param/bind_params_end, reserve_callee_saves, the optional emit_prologue (exact-size in-place prologue), note_frame_state, and frame_slot_debug_loc for the DWARF coordinate of a slot.
Control flow. label_new/label_place, jump, cmp_branch, indirect_branch (with a valid-target set for jump tables), load_label_addr (for &&label).
Data movement. move, load_imm, load_const, load_addr, load, store, tls_addr_of, copy_bytes/set_bytes (aggregate memcpy/memset), bitfield_load/bitfield_store, spill/reload.
Arithmetic. binop, unop, cmp, convert, alloca_.
Calls & returns. A two-phase split: plan_call turns a NativeCallDesc into a NativeCallPlan (arg moves, return slots, clobber/return masks, outgoing stack size) that the optimizer can inspect during frame planning, and emit_call realizes it. Symmetrically plan_ret/ret. call_stack_bytes and signature_stack_bytes are pure pre-pass queries used to size the outgoing area and to decide tail-call (sibling) realizability.
Atomics & fences. atomic_load/store/rmw/cas, fence.
Variadics. va_start_/va_arg_/va_end_/va_copy_. All va_list layout knowledge (pointer ABI vs register-save-area ABI, field offsets) lives behind these and is answered by querying the target ABI; the optimizer makes no layout assumptions.
Intrinsics & asm. intrinsic, asm_block (inline asm with constraints), file_scope_asm, plus trap, set_loc, deferred patch_add/patch_apply, finalize, destroy.

A handful of small capability flags/queries let the generic drivers specialize without arch branches: imm_legal/addr_legal (immediate and addressing-mode legality), has_store_zero_reg/store_zero_reg (aa64 xzr, rv64 x0 — store a constant 0 without materializing it), and the optional machine_op_clobbers, which reports the fixed registers an encoding clobbers as a side effect (x86 idiv writes rax/rdx, a variable shift uses cl, atomics use rax/rcx/rdx) so the allocator keeps values out of them; aa64/rv64 leave it NULL because their encodings have no such fixed clobbers.

aa64 is the reference backend. src/arch/aa64/native.c is the most complete and most heavily commented implementation; the x64 and rv64 ports are written against it. Shared scaffolding extracted across all three lives in src/cg/native_frame.c (slot-offset arithmetic, the frame-final gate, the used-callee-save derivation, ABI-driven va-save sizing) and src/cg/native_argmove.c (the parallel-copy register shuffle for call-arg and param marshalling). What stays per-arch is everything ISA-specific: the slot-offset coordinate transform (fp/s0/rbp-relative), prologue/epilogue encoding, the slim-prologue variants, and instruction selection.

4. The ISA single-source-of-truth table

Each native arch has an isa.h + isa.c pair that is the one place its instruction bit-layout lives. isa.h holds inline pack/unpack encoders (e.g. aa64_movz, aa64_logsr_pack/_unpack) and a descriptor table (aa64_insn_table[]: {mnemonic, match, mask, format, flags}). isa.c holds the table data plus the operand print/parse dispatch keyed on format.

The key property is that three different consumers share the same tables:

        src/arch/aa64/isa.{h,c}   ◄── single source of truth
           │            │     │
   encoder │   disasm   │     │  standalone assembler
   (native.c emit)      │     │  (asm.c)
                  (disasm.c decode)

the encoder (codegen in native.c) calls the inline pack helpers to emit instruction words;
the disassembler (disasm.c) does one mask-and-compare against the table to identify a word, then dispatches on format to the same unpack helpers to extract operands;
the standalone assembler (asm.c, the kit as tool and inline-asm() handling, see ASM.md) parses mnemonics against the table and encodes through the same inline helpers.

(For aa64 the same header is also pulled in by link.c and dbg.c.) The invariant: when an opcode value or a field position changes, you update one site and the encoder, decoder, and assembler stay consistent. The table is ordered first-match-wins, with alias rows (tighter masks, e.g. mov ≡ orr Rd,zr,Rm, cmp ≡ subs zr,...) placed before the canonical rows so the disassembler renders the alias spelling while the assembler accepts both. x64/isa.{h,c} and rv64/isa.{h,c} follow the identical pattern; x64 additionally factors its byte-level REX/ModR/M/SIB primitives and prologue/epilogue into emit.c.

The ArchAsmOps table (reloc_operand, is_local_branch, reloc_call_pair) is the textual complement to this: it tells the cc -S symbolizer how a relocated operand is spelled for the target object format (aarch64 ELF :lo12:sym, Mach-O sym@PAGEOFF, x86-64 sym(%rip)/@PLT, RISC-V %pcrel_hi/%pcrel_lo with anchor pairing) so that re-assembling kit's -S output reproduces byte-identical objects. It is the inverse of the assembler's reloc-modifier parser.

5. MCEmitter — one generic emitter for all native arches

src/arch/mc.c is a single generic machine-code/object emitter (MCEmitter, declared in src/arch/mc.h) used by every native arch. It sits between the backend (or the assembler) and the ObjBuilder, and it owns only the bytes-and- bookkeeping concerns that are genuinely arch-independent:

the current section and byte position;
the machine-label table: 1-based MCLabel ids, each carrying either a placement (sec, offset) or a list of pending forward-reference fixups that are applied at label_place; plus lazily-minted per-label SB_LOCAL symbols (.Lcfblk.N) so code-location references (&&label, jump-table entries) relocate against a real symbol and survive a re-encoding assembler;
relocation forwarding (emit_reloc, emit_reloc_at, emit_label_ref, emit_label_data_reloc);
the per-function context (mc_begin_function/mc_end_function) that the deferred data-section label relocs read;
CFI buffering: cfi_startproc/cfi_def_cfa/cfi_offset/... records are buffered per-function and flushed into a .eh_frame section by mc_emit_eh_frame at TU finalize. CFI directives are byte-position-bound, so they live on the one object that already tracks (section, offset). cfi_set_next_pc_offset provides a sticky prologue-PC override so backends that patch the prologue in func_end (after the live PC has moved past it) can pin every frame-state rule to the post-prologue PC.

Encoding itself is not MCEmitter's job — it writes whatever bytes it is handed. Arch-specific behavior enters through exactly two thin seams:

ArchImpl.apply_label_fixup — given a resolved label displacement, encode it into the already-emitted bytes (aa64 splits the 26-bit imm26 of B/BL, the 19-bit CONDBR, the ADR immlo/immhi, and falls back to a literal-pool LDR for out-of-range &&label; x64 writes a 4-byte rel32). mc.c builds an ArchLabelFixup descriptor and calls through arch_for_compiler.
The ArchImpl.cfi_* constants — the per-psABI CIE defaults (cfi_return_addr_reg, code/data alignment factors, initial CFA reg/offset) that mc_emit_eh_frame reads to encode the CIE.

This is the single most leverage-dense decision in the backend layer: the entire .eh_frame producer, label resolution, relocation plumbing, and section/byte management is written once and reused, with only those two pinpoint hooks per arch. mc.h is split out from arch.h precisely so the many emission-only consumers (per-arch emit/ops TUs, the assembler, the Debug producer) do not transitively pull in the decode/disasm/emu/dbg surfaces.

6. Register files

Each native backend declares its register file as static NativeRegInfo data in its native.c (e.g. aa_reg_info, wired into the NativeTarget at construction; the DWARF-index ↔ assembler-name tables that the ArchImpl exposes for objdump/asm live separately in regs.c). A NativeRegInfo is a set of NativeAllocClassInfo (one per NativeAllocClass: INT, FP, VEC), each carrying:

an ordered allocable list — registers the allocator may assign, ordered by preference (aa64 lists caller-saved first so the allocator prefers them and avoids prologue saves);
a scratch list — registers reserved for the backend's own temporaries (address materialization, atomic retry loops) and never handed to the allocator;
a NativePhysRegInfo row per physical register (class, ABI arg/ret index, caller/callee-saved flags, spill/copy costs);
precomputed caller/callee/arg/ret/reserved bitmasks.

This one declaration feeds both code paths:

the -O0 direct path resolves reg_info and the three class_info[] pointers once at NativeDirectTarget construction, so its register cache (allocate / evict / scratch-acquire) is an O(1) lookup. It uses the allocable order as a simple "next free register" pool with conservative flushes.
the -O1 allocator (src/opt) consumes the same allocable lists, costs, and masks as its interference-graph inputs, and reports the callee-saves it actually used back through func_begin_known_frame / reserve_callee_saves so the backend can reserve save slots and emit the matching prologue/epilogue.

Because incoming arg registers are marked non-allocable, register-destination param binds can never alias a live incoming arg, which is what lets bind_param ordering be unconstrained and lets bind_params_end resolve a param permutation as a single parallel copy.

7. The ABI / calling-convention layer

src/abi is the single authority for target-dependent storage layout and call classification. Frontends lower source types to KitCgTypeId before entering it; from there the answers are language-agnostic. The public surface is TargetABI (src/abi/abi.h), reachable as c->abi and consulted by both the semantic codegen (src/cg/local.c for local sizing, cg/* for layout) and the optimizer (src/opt/cg_ir_lower.c resolves abi_cg_func_info to drive param-bind and call lowering). It is the canonical owner of: scalar sizes/aligns, struct/union record layout (including bitfield storage units), function argument classification (ABIFuncInfo: per-arg DIRECT/INDIRECT/EXPAND/IGNORE, sret, byval, sign/zero-ext, vararg routing), and va_list shape.

The layer is split into a shared core and per-ABI vtables:

abi.c holds everything C-standard-driven and identical across ABIs: the scalar profile (LP64 sizes), record layout computation, and the memoizing caches for record layouts and function info. This is shared by all targets.
abi_internal.h defines ABIVtable — the parts that genuinely vary: compute_func_info (the argument/return classifier) and the va_list type/layout facts.
registry.c selects the per-(arch, object-format) vtable. Like the arch and obj registries, it gates entries on the combined KIT_ARCH_* + KIT_OBJ_* flags and maps (KitArchKind, KitObjFmt) to an ABIVtable via abi_vtable_lookup. abi_init does this lookup once at compiler init.

ABIs are a derived axis, not a user-facing knob: every valid ABI is a 1:1 function of an (arch, OS-family) pair, where OS family follows from the object format (ELF → SysV/AAPCS-style, Mach-O → Apple, COFF → Windows). The registry therefore enumerates the cross-product cells that both sides enable:

Arch	ELF (SysV-ish)	Mach-O (Apple)	COFF (Windows)
aa64	`aapcs64`	`apple_arm64`	`aapcs64_windows`
x64	`sysv_x64`	`apple_x64`	`win64_x64`
rv64	`rv64`	—	—
wasm	— (`wasm32`, via the wasm object format)

Each per-ABI TU (abi_aapcs64.c, abi_sysv_x64.c, abi_apple_arm64.c, abi_apple_x64.c, abi_win64_x64.c, abi_aapcs64_windows.c, abi_rv64.c) implements its compute_func_info and va_list facts; the Apple/Windows variants encode their divergences (e.g. Apple ARM64 routes the variadic tail exclusively to the stack, recorded as vararg_on_stack in ABIFuncInfo). The classification is the only authority — the NativeTarget plan/bind hooks and the optimizer both consume ABIFuncInfo; they never re-derive argument placement. Frame-relevant ABI facts (the vararg register-save-area size) are funneled through src/cg/native_frame.c so the per-arch magic numbers all trace back to one va_list-layout query.

8. Per-call cost model (aa64 -O1)

The fixed per-call overhead a backend pays — prologue, epilogue, and call-site setup, independent of the function body — dominates call-heavy workloads, so the aa64 known-frame path is structured to minimize it. The backend chooses one of a small set of frame shapes per function (decided in aa_func_begin_known_frame, encoded in native.c):

Frame shape	When	Fixed insns (entry+exit, excl. `ret`)
slim prologue	leaf-ish: no callee-saves, no alloca, no body slots, no outgoing stack	3 (optimal)
`fp_at_bottom`	≥1 callee-save/body slot, no outgoing stack args, frame ≤ 504	5 (optimal)
`slim_small_frame`	as above but with outgoing stack args	7
fat	large frame / alloca / big saved-pair offset	7+

The key structural idea is fp_at_bottom: when there are no outgoing stack args, the frame record moves to the bottom of the frame (fp = sp), so the sp adjustment folds into a pre/post-indexed stp x29,x30,[sp,#-N]! / ldp x29,x30,[sp],#N, and callee-saves stack above the record at positive offsets. This is the common case for any function that keeps a value live across a call without itself passing >8 register-class args, and it reaches the same 5-insn fixed cost as gcc -O0; the DWARF CFA becomes fp + frame_size. Functions with outgoing stack args can't move the record to the bottom (the args live there), so they keep the top-record slim_small_frame layout. This availability asymmetry — bottom-record only on the known-frame path — exists because the frame-size-dependent offsets require the frame to be final before the body, which is only true under the optimizer's func_begin_known_frame.

Remaining and planned per-arch work (deferred niche encodings, audit follow-ups) is tracked in plan/ARCH.md.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README