kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

Architecture Backends

This document describes kit's arch backend abstraction: how a target architecture plugs into the compiler, what its responsibilities are, and how the three native backends (aa64, x64, rv64) are structured to maximize sharing while keeping the ISA-specific seams thin. It also covers the ABI / calling convention layer in src/abi, which is the single authority for storage layout and call classification. The semantic codegen surface a backend sits behind is in CODEGEN.md; the IR the optimizer feeds it is in IR.md; the SSA/regalloc machinery driving the optimizing path is in OPT.md; the standalone assembler that shares the ISA tables is in ASM.md. ABI content is canonical here.

1. Two layers of "backend": CGBackend and ArchImpl

A target enters the compiler through two related abstractions that are wired by struct-prefix subtyping (src/cg/cgtarget.h, src/arch/arch.h).

  CGBackend                      ArchImpl
  ---------                      --------
  const char* name;        +---> CGBackend backend;   (first field)
  CgTarget* (*make)(...);  |     KitArchKind kind;
                           |     CgTarget* (*cgtarget_new)(...);
                           |     ArchAsm*  (*asm_new)(...);
                           |     ArchDisasm* (*disasm_new)(...);
                           |     int (*apply_label_fixup)(...);
                           |     const LinkArchDesc*  link;
                           |     const ArchDecodeOps* decode;  (emu/objdump)
                           |     const ArchEmuOps*    emu;
                           |     const ArchDwarfOps*  dwarf;
                           |     const ArchDbgOps*    dbg;
                           |     const ArchAsmOps*    asm_ops;
                           |     register-file accessors;
                           |     CFI / .eh_frame CIE constants;
                           |     predefined target macros;

aa64, x64, and rv64 expose full ArchImpls (arch_impl_aa64, arch_impl_x64, arch_impl_rv64). wasm also exposes an ArchImpl (arch_impl_wasm), but it is a thin one: its machine-code seams (asm_new, apply_label_fixup, link, register accessors, CFI) are all NULL, since wasm32 has no native machine encoding, no stack-frame ABI, and no assembly form in this toolchain — it produces a WasmModule attached to the ObjBuilder and only provides a disassembler that renders WAT for objdump (see WASM.md). So the precise rule is: native machine-code arches have an ArchImpl whose machine seams are populated; c_target/check have only a CGBackend; wasm has an ArchImpl shell with the machine seams nulled out.

This split is deliberate. The pipeline picks a CGBackend per emit; metadata consumers (DWARF producer, debugger, disassembler, register-name lookups) reach for an ArchImpl and get back NULL when the target has no machine-code identity. Neither layer leaks into the other.

2. The arch registry

src/arch/registry.c is the sole place that gates the arch vtable roster on KIT_ARCH_*_ENABLED — it is the canonical config-gate site for the backend axis, mirroring src/api/lang_registry.c for frontends. (The flags are also read by the parallel object-format and ABI registries, which gate their own rosters on the arch × format cross-product, and by src/core/config_assert.c for build-time validity asserts — see OBJ.md and §7.) Everything downstream of the registry operates on its outputs and never re-checks the build flags.

The registry holds a single static arch_impls[] array (each entry gated by its KIT_ARCH_*_ENABLED flag) and exposes two lookups:

The registry also owns the thin dispatchers arch_reloc_operand, arch_is_local_branch, and arch_reloc_call_pair, which forward to the target's ArchAsmOps (used by cc -S symbolization, see §4 and ASM.md), plus arch_disasm_* / arch_decode_* / formatter helpers. All are NULL-safe: a target lacking the relevant op table gets the documented "no transformation" answer rather than a crash.

3. The NativeTarget contract

src/arch/native_target.h defines NativeTarget, the physical machine-emission contract that all three native backends implement. It is the layer where the generic codegen drivers stop speaking in semantic terms (CGLocal ids, high-level types) and start speaking in physical terms: hard registers, frame slots, legal immediates, and concrete addressing modes. A NativeTarget never allocates registers and never decides storage layout — callers hand it caller-selected, target-legal physical operands; the target only encodes.

It is driven from two directions:

   -O0 path:   CG semantic ops ──► NativeDirectTarget ──┐
                                   (src/cg/native_*)     ├──► NativeTarget ──► MCEmitter ──► ObjBuilder
   -O1+ path:  CG ──► record IR ──► opt passes ──────────┘        (~35 hooks)
                      (SSA, machinize, regalloc, pass_native_emit)

That a single ~35-hook contract serves both paths is what keeps the two code generators byte-compatible per arch. The hook families:

A handful of small capability flags/queries let the generic drivers specialize without arch branches: imm_legal/addr_legal (immediate and addressing-mode legality), has_store_zero_reg/store_zero_reg (aa64 xzr, rv64 x0 — store a constant 0 without materializing it), and the optional machine_op_clobbers, which reports the fixed registers an encoding clobbers as a side effect (x86 idiv writes rax/rdx, a variable shift uses cl, atomics use rax/rcx/rdx) so the allocator keeps values out of them; aa64/rv64 leave it NULL because their encodings have no such fixed clobbers.

aa64 is the reference backend. src/arch/aa64/native.c is the most complete and most heavily commented implementation; the x64 and rv64 ports are written against it. Shared scaffolding extracted across all three lives in src/cg/native_frame.c (slot-offset arithmetic, the frame-final gate, the used-callee-save derivation, ABI-driven va-save sizing) and src/cg/native_argmove.c (the parallel-copy register shuffle for call-arg and param marshalling). What stays per-arch is everything ISA-specific: the slot-offset coordinate transform (fp/s0/rbp-relative), prologue/epilogue encoding, the slim-prologue variants, and instruction selection.

4. The ISA single-source-of-truth table

Each native arch has an isa.h + isa.c pair that is the one place its instruction bit-layout lives. isa.h holds inline pack/unpack encoders (e.g. aa64_movz, aa64_logsr_pack/_unpack) and a descriptor table (aa64_insn_table[]: {mnemonic, match, mask, format, flags}). isa.c holds the table data plus the operand print/parse dispatch keyed on format.

The key property is that three different consumers share the same tables:

        src/arch/aa64/isa.{h,c}   ◄── single source of truth
           │            │     │
   encoder │   disasm   │     │  standalone assembler
   (native.c emit)      │     │  (asm.c)
                  (disasm.c decode)

(For aa64 the same header is also pulled in by link.c and dbg.c.) The invariant: when an opcode value or a field position changes, you update one site and the encoder, decoder, and assembler stay consistent. The table is ordered first-match-wins, with alias rows (tighter masks, e.g. movorr Rd,zr,Rm, cmpsubs zr,...) placed before the canonical rows so the disassembler renders the alias spelling while the assembler accepts both. x64/isa.{h,c} and rv64/isa.{h,c} follow the identical pattern; x64 additionally factors its byte-level REX/ModR/M/SIB primitives and prologue/epilogue into emit.c.

The ArchAsmOps table (reloc_operand, is_local_branch, reloc_call_pair) is the textual complement to this: it tells the cc -S symbolizer how a relocated operand is spelled for the target object format (aarch64 ELF :lo12:sym, Mach-O sym@PAGEOFF, x86-64 sym(%rip)/@PLT, RISC-V %pcrel_hi/%pcrel_lo with anchor pairing) so that re-assembling kit's -S output reproduces byte-identical objects. It is the inverse of the assembler's reloc-modifier parser.

5. MCEmitter — one generic emitter for all native arches

src/arch/mc.c is a single generic machine-code/object emitter (MCEmitter, declared in src/arch/mc.h) used by every native arch. It sits between the backend (or the assembler) and the ObjBuilder, and it owns only the bytes-and- bookkeeping concerns that are genuinely arch-independent:

Encoding itself is not MCEmitter's job — it writes whatever bytes it is handed. Arch-specific behavior enters through exactly two thin seams:

  1. ArchImpl.apply_label_fixup — given a resolved label displacement, encode it into the already-emitted bytes (aa64 splits the 26-bit imm26 of B/BL, the 19-bit CONDBR, the ADR immlo/immhi, and falls back to a literal-pool LDR for out-of-range &&label; x64 writes a 4-byte rel32). mc.c builds an ArchLabelFixup descriptor and calls through arch_for_compiler.

  2. The ArchImpl.cfi_* constants — the per-psABI CIE defaults (cfi_return_addr_reg, code/data alignment factors, initial CFA reg/offset) that mc_emit_eh_frame reads to encode the CIE.

This is the single most leverage-dense decision in the backend layer: the entire .eh_frame producer, label resolution, relocation plumbing, and section/byte management is written once and reused, with only those two pinpoint hooks per arch. mc.h is split out from arch.h precisely so the many emission-only consumers (per-arch emit/ops TUs, the assembler, the Debug producer) do not transitively pull in the decode/disasm/emu/dbg surfaces.

6. Register files

Each native backend declares its register file as static NativeRegInfo data in its native.c (e.g. aa_reg_info, wired into the NativeTarget at construction; the DWARF-index ↔ assembler-name tables that the ArchImpl exposes for objdump/asm live separately in regs.c). A NativeRegInfo is a set of NativeAllocClassInfo (one per NativeAllocClass: INT, FP, VEC), each carrying:

This one declaration feeds both code paths:

Because incoming arg registers are marked non-allocable, register-destination param binds can never alias a live incoming arg, which is what lets bind_param ordering be unconstrained and lets bind_params_end resolve a param permutation as a single parallel copy.

7. The ABI / calling-convention layer

src/abi is the single authority for target-dependent storage layout and call classification. Frontends lower source types to KitCgTypeId before entering it; from there the answers are language-agnostic. The public surface is TargetABI (src/abi/abi.h), reachable as c->abi and consulted by both the semantic codegen (src/cg/local.c for local sizing, cg/* for layout) and the optimizer (src/opt/cg_ir_lower.c resolves abi_cg_func_info to drive param-bind and call lowering). It is the canonical owner of: scalar sizes/aligns, struct/union record layout (including bitfield storage units), function argument classification (ABIFuncInfo: per-arg DIRECT/INDIRECT/EXPAND/IGNORE, sret, byval, sign/zero-ext, vararg routing), and va_list shape.

The layer is split into a shared core and per-ABI vtables:

ABIs are a derived axis, not a user-facing knob: every valid ABI is a 1:1 function of an (arch, OS-family) pair, where OS family follows from the object format (ELF → SysV/AAPCS-style, Mach-O → Apple, COFF → Windows). The registry therefore enumerates the cross-product cells that both sides enable:

Arch ELF (SysV-ish) Mach-O (Apple) COFF (Windows)
aa64 aapcs64 apple_arm64 aapcs64_windows
x64 sysv_x64 apple_x64 win64_x64
rv64 rv64
wasm — (wasm32, via the wasm object format)

Each per-ABI TU (abi_aapcs64.c, abi_sysv_x64.c, abi_apple_arm64.c, abi_apple_x64.c, abi_win64_x64.c, abi_aapcs64_windows.c, abi_rv64.c) implements its compute_func_info and va_list facts; the Apple/Windows variants encode their divergences (e.g. Apple ARM64 routes the variadic tail exclusively to the stack, recorded as vararg_on_stack in ABIFuncInfo). The classification is the only authority — the NativeTarget plan/bind hooks and the optimizer both consume ABIFuncInfo; they never re-derive argument placement. Frame-relevant ABI facts (the vararg register-save-area size) are funneled through src/cg/native_frame.c so the per-arch magic numbers all trace back to one va_list-layout query.

8. Per-call cost model (aa64 -O1)

The fixed per-call overhead a backend pays — prologue, epilogue, and call-site setup, independent of the function body — dominates call-heavy workloads, so the aa64 known-frame path is structured to minimize it. The backend chooses one of a small set of frame shapes per function (decided in aa_func_begin_known_frame, encoded in native.c):

Frame shape When Fixed insns (entry+exit, excl. ret)
slim prologue leaf-ish: no callee-saves, no alloca, no body slots, no outgoing stack 3 (optimal)
fp_at_bottom ≥1 callee-save/body slot, no outgoing stack args, frame ≤ 504 5 (optimal)
slim_small_frame as above but with outgoing stack args 7
fat large frame / alloca / big saved-pair offset 7+

The key structural idea is fp_at_bottom: when there are no outgoing stack args, the frame record moves to the bottom of the frame (fp = sp), so the sp adjustment folds into a pre/post-indexed stp x29,x30,[sp,#-N]! / ldp x29,x30,[sp],#N, and callee-saves stack above the record at positive offsets. This is the common case for any function that keeps a value live across a call without itself passing >8 register-class args, and it reaches the same 5-insn fixed cost as gcc -O0; the DWARF CFA becomes fp + frame_size. Functions with outgoing stack args can't move the record to the bottom (the args live there), so they keep the top-record slim_small_frame layout. This availability asymmetry — bottom-record only on the known-frame path — exists because the frame-size-dependent offsets require the frame to be final before the body, which is only true under the optimizer's func_begin_known_frame.


Remaining and planned per-arch work (deferred niche encodings, audit follow-ups) is tracked in plan/ARCH.md.