kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

C Source Backend

kit's no-deps posture rules out linking against LLVM or GCC for an industrial-strength optimizer. The C-source backend gives kit users that optimizer anyway: it emits portable C source (cc -S=c-style, selected via --emit=c) and hands the result to whatever gcc/clang exists on the build host. The host C compiler then performs ABI lowering, instruction selection, register allocation, and aggressive optimization (SROA, vectorization, etc.). kit's job here is to produce legal and complete C, not human-readable C. See CODEGEN.md for the CG model this backend consumes and IR.md for the semantic IR it walks.

A CGBackend, not an ArchImpl

There is no ArchImpl for the C target. An ArchImpl describes a machine — registers, encodings, an MCEmitter that writes object bytes. The C backend produces no machine code and no object bytes; the eventual machine code runs on the host triple after the host cc compiles the emitted source. What it is is a CGBackend (cg_backend_c_target in src/arch/c_target/target.c): the small "give me a CgTarget for this Compiler + ObjBuilder + emit options" unit the registry hands out. The registry selects it in cg_backend_for_session (src/arch/registry.c) whenever CodeOptions.emit_c_source is set; output is written to CodeOptions.c_source_writer instead of to an object file.

Two-stage pipeline: record CG into IR, then emit C from IR

The backend does not translate CG calls to C text directly. It splits into two stages with the semantic IR (see IR.md) as the seam:

  frontend (lang/c, lang/toy)
        | kit_cg_* calls
        v
  CgIrRecorder  (src/cg/ir_recorder.c)   <- a CgTarget that records
        |                                    semantic CG into CgIrModule
        |  [opt passes run here if opt_level>0]
        v
  c_emit_ir_module  (src/arch/c_target/ir_emit.c)
        |  switch over CgIrOp -> c_emit_* calls
        v
  c_emit_*  (src/arch/c_target/c_emit.c)
        |  string buffers (cbuf)
        v
  KitWriter -> .c text

c_target_backend_make constructs the C emitter (CTarget, via c_emit_target_new) and then wraps it in a CgIrRecorder whose finalize callback (c_ir_finalize) replays the recorded CgIrModule through c_emit_ir_module and then flushes the C source. The recorder is what the session and frontend actually drive; the CTarget is private behind the recorder's user pointer.

This design has two consequences worth stating. First, the C target never sees a live CG call stream — it walks a finished CgIrModule, so its emission code is a straightforward op-dispatch (ir_emit_inst) with no value-stack bookkeeping of its own. Second, because the recorder is itself a normal CgTarget, the IR optimizer can sit between record and emit exactly as it does for the machine backends: at opt_level > 0 the session wraps the recorder in opt_cgtarget_new, so the C emitted is the optimized IR. Deferring the heavy optimization to the host cc is still the intent, but the kit-side IR passes are not bypassed.

ir_emit.c carries one piece of glue beyond pure dispatch: CG scope handles are recorder-relative, so CIrEmitter keeps a scope_map that translates each recorded CGScope to the handle the emitter minted at scope_begin.

Target-locked, not portable

The emitted C is target-locked: it must be compiled for the same triple that kit --target= selected. Compiled for a different triple it may silently misbehave. The cause is fundamental to CG: semantic lvalue chains are flattened to (base, byte_offset) before any backend sees them. kit_cg_field(g, n) arrives as an indirect access with ofs=12; the field identity is gone, and that 12 came from the kit-selected target's record layout. If a downstream compiler assumes a different layout, the access is wrong. This is the same trade LLVM IR makes (datalayout-locked), and it does not limit the stated goal, since the user already fixed the triple at kit invocation.

Semantic temporaries become C locals

CG mints fresh, unbounded local ids (CGLocal); each one becomes a single declared C automatic variable named vN (c_local_name). Declaration is lazy: the first time a local is referenced, c_ensure_local appends one typed declaration to the per-function decls buffer. Locals are zero-initialized (= 0, or = {0} for aggregates) and marked __attribute__((unused)) to silence host-cc diagnostics on control flow the host can't reason through; the host DSEs the init when a real assignment dominates.

Each local has exactly one declared C type, recorded in local_type and checked for consistency on re-use. Where CG arithmetic crosses pointer/integer or differing-width boundaries, the emitter bridges through uintptr_t casts so host-cc warnings (-Wint-conversion and friends) stay quiet while the bit semantics are exact. Signedness-sensitive operations (unsigned divide/remainder, logical shift right, unsigned compares) get an explicit width-sized signed/unsigned cast on their operands.

Types: scalars map, aggregates are opaque bytes

c_typename lowers a CG type id to a C type:

The key invariant: composite types are opaque storage. A record or array of size N and alignment A becomes

typedef struct { _Alignas(A) uint8_t raw[N]; } __ty_<id>;

regardless of its fields. Field and element access is never expressed through C field syntax; CG already speaks in (base, byte_offset), so every access is an indirect dereference (*(T*)((char*)addr + ofs)). Emitting types as raw bytes sidesteps all C aggregate-semantics ambiguity (bitfield layout rules, array decay, packed/aligned attribute interactions) and keeps types orthogonal to access patterns. Modern hosts see through the offset arithmetic for SROA anyway. Function types instead become a function-pointer typedef R (*__ty_<id>)(...) for indirect calls. Multi-result returns synthesize a guarded __kit_tuple<N>_... struct.

Typedefs are emitted into a TU-wide typedefs buffer, keyed by unaliased type id with a per-id state machine (unseen / inflight / emitted) so each type is declared once, dependencies first, and recursive types degrade to forward-only rather than looping.

TU structure and the deferred prologue

The emitter accumulates several string buffers and flushes them in a fixed order at c_emit_finalize:

  prologue        #include <stdint.h>, <stdalign.h>  (+ stdarg/setjmp if used)
  typedefs        __ty_* opaque-storage and function-pointer typedefs
  forwards        one `RetT name(params);` per function seen
  data_defs       data symbol definitions and extern declarations
  function bodies signatures + spliced-in decls + body statements

Header choice beyond the two unconditional includes is deferred to finalize so the include lines stay deterministic regardless of when a feature was first referenced. The data walk (c_emit_data) populates two buffers: the data_defs buffer it owns, and — as a side effect, since data initializers can take the address of functions — the function forward-declaration buffer. So the walk runs first, then forwards is flushed, then data_defs. Forwards precede data definitions because a data initializer may reference a function by name.

Per function, declarations and body text are buffered separately: CG needs all locals declared at the top of the function, but surfaces them interleaved with body emission. func_end records the byte offset just past the opening brace (fn_body_start) and splices the accumulated decls in there. A last_was_terminator flag drops dead statements after an unconditional return/goto so the output is not littered with unreachable C.

Control flow

CG's structured scopes map to C control flow where possible. SCOPE_LOOP becomes for (;;) { ... }; within such a structured scope, jumps to the scope's break/continue labels are emitted as C break;/continue; rather than goto. Everything else lowers to labels and goto, which the host cc re-structures. Switches, computed/indirect branches (GCC &&label / goto *p), and address-of-label all have direct emitters.

Tail calls

CG owns the tail-call policy (see CODEGEN.md): before flagging a call as a sibling call it asks the target whether the call is realizable, and only sets CG_CALL_TAIL when the target agrees. The C backend answers through c_emit_tail_call_unrealizable_reason_for, wired into the recorder config as tail_call_unrealizable_reason. A realizable tail call is emitted as __attribute__((musttail)) return <call>;, which clang lowers to a guaranteed sibling call; the host compiler does the actual stack-reuse.

The reason hook declines the cases clang's musttail cannot honor, returning a human-readable string instead of NULL: a variadic caller, a variadic callee, or a caller/callee parameter-count mismatch. For those CG leaves the call unflagged and the backend emits an ordinary call. This keeps the C output within the subset clang's musttail accepts rather than asserting a sibling call the host would reject.

Mapping kit semantics onto GCC/clang C

GCC/clang-extension C covers everything CG can express, so each feature maps to a builtin or extension rather than a runtime shim:

Data symbols and cross-symbol relocations

Data emission walks the ObjBuilder's symbols at finalize. A defined data symbol is emitted as a typed file-scope object carrying its initializer bytes; undefined data becomes an extern uint8_t name[]; declaration. Linkage, visibility, weakness, const (for rodata), static (local binding) and _Thread_local (TLS) are reproduced via attributes and qualifiers so the host linker reconstructs the same symbol table.

Cross-symbol references (relocations into a symbol's bytes) are the interesting case. Rather than a runtime constructor, the symbol's storage struct is split so each relocated slot is a real typed field: raw uint8_t chunk_K[] runs interleaved with pointer-width fields (void* for 8-byte, uint32_t for R_ABS32). The initializer assigns standard C address-of expressions ((void*)((char*)&target + addend)) to those fields, so the host C compiler and linker resolve the references natively. The relocation slots are sorted by offset, and when any are present the struct is __attribute__((packed)) so the field layout matches the original byte image exactly.

TLS delegates entirely to _Thread_local; the host compiler builds its own descriptor. On Mach-O, where TLS is split into a descriptor symbol plus a synthesized init symbol (see OBJ.md), the emitter pulls the initial bytes from the init symbol via the descriptor's R_ABS64 and emits a single _Thread_local, skipping the object-level descriptor machinery.

Function-local static data uses CG's narrow source-backend hook: those symbols are emitted inside the owning function and skipped by the TU-wide data walk.

Source locations and debug info

set_loc emits #line N "path" directives (deduplicated against the last one emitted) into the function body. When the user passes -g to the downstream host gcc/clang, the resulting object carries debug info mapped back to the original kit input. kit's own DWARF producer (see DWARF.md) is unused in this mode — there is no Debug and no MCEmitter on this path.

Testing

Exercised end-to-end: emit C with kit cc --emit=c, compile the result with the host cc -Werror, run it, and assert behavior matches the machine-code path. The test/toy and test/parse corpora drive this via a dedicated emit mode. See TESTING.md.