kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

Frontends

kit turns several source languages into machine code through one narrow contract. A frontend is the front half of the compiler: it reads bytes of a single translation unit and drives the public code-generation API (see CODEGEN.md) to populate a relocatable object. Everything behind codegen — IR, optimization, register allocation, object emission — is language-agnostic. This document covers the frontend model, the C frontend pipeline, and the smaller toy and wasm frontends. For testing, see TESTING.md.

kit ships four frontends: C (the real one), asm (lives inside the codegen substrate), toy (a CG-API exercise vehicle), and wasm (WAT/wasm lowering).

The frontend contract

Every frontend implements one vtable, KitFrontendVTable (include/kit/compile.h). The contract is deliberately tiny:

  new_frontend(KitCompiler*)        -> opaque KitFrontendState*
  compile(state, opts, input, out)    -> KitStatus, populates an ObjBuilder
  free_frontend(state)
  extensions / nextensions            -> file extensions this lang claims
  commit / abort                      -> optional, for durable REPL state

The state object is opaque to libkit: the frontend allocates it from the compiler's heap in new_frontend, threads its own context through it, and frees it in free_frontend. compile is handed a KitSourceInput (name, bytes, language, and an input-kind that distinguishes a whole translation unit from a REPL top-level / expression / block) and an output KitObjBuilder to fill. The frontend opens a KitCg, emits into it, and the object builder captures the result. This keeps each language a thin client of the public CG and object APIs (INTERFACES.md) with no privileged access to internals.

Two design choices fall out of this boundary:

The language registry

src/api/lang_registry.c is the only place in libkit that consults the KIT_LANG_*_ENABLED build flags. During compiler construction lang_registry_init walks the enabled set and calls kit_register_frontend to wire each compiled-in vtable into c->frontends[KitLanguage]. After that, the public compile and pipeline paths dispatch purely by KitLanguage index — no host bootstrap, no #ifdef outside the registry. Third parties can still call kit_register_frontend to install or override any slot at runtime; the registry just provides the default wiring.

kit_language_for_path (src/api/compile.c) maps a file path to a language by extracting the trailing extension and walking every registered frontend's extensions list (case-insensitively, so .S and .s both hit asm's "s"). No frontend is privileged. C is just another language: it claims c and h, asm claims s, toy claims toy, and wasm claims wat and wasm. A path whose extension no frontend claims — or that has no extension — resolves to KIT_LANG_UNKNOWN rather than silently defaulting to C, so an unrecognized input is rejected instead of misparsed.

  path  --kit_language_for_path-->  KitLanguage
  KitLanguage  --c->frontends[]-->  KitFrontendVTable
  vtable.compile(...)  --KitCg-->   KitObjBuilder

The C frontend pipeline

The C frontend (lang/c/c.c, vtable kit_c_frontend_vtable) is a classic C11 compiler arranged as a streaming pipeline. There is no AST: tokens flow through the preprocessor into a single-pass recursive-descent parser that drives codegen directly as it goes. The stages:

  bytes
   |  lexer            lang/cpp/lex   (C11 6.4 token streaming)
   v
  tokens
   |  preprocessor     lang/cpp/pp    (translation phase 4)
   v
  preprocessed tokens
   |  parser           lang/c/parse   (single-pass recursive descent)
   v   + type / decl / sem / abi semantic layers (lang/c)
  KitCg ops  ->  KitObjBuilder

The lexer and preprocessor live under lang/cpp/ rather than lang/c/ because they are shared C-preprocessor infrastructure: the same code backs the C frontend, the standalone cpp tool, and cc -E. The C-specific stages — parser, type system, declarations, semantics, and ABI lowering — live under lang/c/.

c_frontend_compile is the conductor. It builds the per-compile objects (token pool, lexer over the input bytes, preprocessor, decl table, KitCg), applies preprocessor options (-I, -D, -U) recovered from the KitCCompileOptions planted in language_options, pushes the input lexer onto the preprocessor's include stack, then calls parse_c. The whole run is bracketed with metrics scopes so the build can profile each stage.

Lexer — lang/cpp/lex

lex.c streams tokens out of a borrowed source buffer per the C11 lexical grammar (6.4): identifiers, pp-numbers (classified into integer/float), string and character constants with their L/u/u8/U encoding prefixes, longest-match punctuators including digraphs, and the #/## tokens that the preprocessor needs for directives and pasting. It handles the earlier translation phases inline: line splicing (phase 2) and comments as whitespace, with physical newlines surfaced as TOK_NEWLINE so the preprocessor can implement directive-line semantics, and a small directive-context latch so a header-name after #include lexes correctly (6.4.7).

A deliberate layering decision: keyword bucketing is deferred to the parser. The lexer emits every word as TOK_IDENT; it does not know what a keyword is. This is correct because keyword-ness is a translation-phase-7 concern — a name that is a macro must expand before it can be a keyword, and the preprocessor traffics in identifiers, not keywords. parse_c interns the C11 keyword spellings into symbols once at startup (kw_names[] in parse.c) and recognizes keywords by symbol identity as it consumes tokens.

Preprocessor — lang/cpp/pp

pp.c (+ pp_directive.c, pp_expand.c) implements translation phase 4 and exposes a pull interface: pp_next yields one fully preprocessed token at a time, with directives consumed and macros expanded. It backs both the C frontend (the parser pulls tokens from it) and the standalone preprocessor (pp_emit_text reconstructs source text from the token stream for cpp / cc -E).

Internally the preprocessor runs a token-source stack. Each source is either a Lexer (the main file or an #included file) or a pre-built token buffer (a macro expansion in progress). Includes push a new lexer; macro invocations push an expansion buffer; EOF pops. This is what makes the include stack and macro rescanning fall out of one mechanism.

Macro expansion uses the Prosser hideset algorithm (the standard's "nested replacement" / blue-paint rule). Every token in an expansion buffer carries a hideset: the set of macro names it must not be re-expanded by during rescan. Function-like expansions compute the result hideset as the invocation hideset unioned with the just-expanded macro name, which is exactly what stops infinite recursion through self-referential and mutually-referential macros. Hidesets are interned into a small table (pp_expand.c) and kept sorted for canonical identity, so identical hidesets share one id.

pp_directive.c owns the rest of phase 4: the #if nesting stack and the preprocessor constant-expression evaluator, #include search and file open (through the compiler's file-IO service), #line, #pragma, #error, #embed, and #define/#undef. Include edges are recorded (pp_add_include_edge -> the source registry, include/kit/source.h) so dependency reporting (kit_dep_iter_*) and diagnostics can attribute tokens to the right file.

Parser — lang/c/parse

parse_c (parse.c) is a single-pass recursive-descent parser. It pulls tokens from the preprocessor with one or two tokens of lookahead and emits CG ops as it recognizes constructs — there is no separate AST or semantic-analysis pass over a tree. The parser is split by syntactic area: declarations and the TU driver in parse.c, expressions in parse_expr.c, types/declarators in parse_type.c, initializers in parse_init.c, and statements in parse_stmt.c.

Because lowering happens inline, the parser maintains a typed value stack that shadows the CG operand stack. Each entry records the C-language type of a value, value flags (lvalue / modifiable-lvalue / bit-field / register / null-pointer-constant), and an lvalue auxiliary record. The fields are cg_type_stack, cg_value_flags, and cg_lv_aux in struct Parser (parse_priv.h). This shadow stack is the parser's entire notion of "the expression evaluated so far": the CG side holds the runtime values, the parser side holds the static types and lvalue bookkeeping needed to drive the next op correctly (signedness for division/compare/shift, pointer-element size for arithmetic, bit-field metadata for loads/stores, and so on).

The cg_adapter.h coupling point

lang/c/parse/cg_adapter.c and its header cg_adapter.h are the real parser <-> codegen seam, and the most load-bearing design decision in the C frontend. The header declares a family of pcg_* helpers that look like thin wrappers over the public kit_cg_* API but do two extra jobs on every call:

The implication: the C frontend's lowering strategy is encoded in this adapter, not in the public CG contract. It is where C-specific decisions (lvalue-as-effective-address, lazy load/store/addr, the typed shadow stack) are concentrated. A change to how C lowers indirection lands here, not in the backends. (pcg_emit_enabled also lets the parser run constant-folding / sizeof contexts with codegen suppressed while still tracking types.)

Semantic layers — type / decl / sem / abi

The parser leans on four C-specific semantic layers under lang/c:

The toy frontend

The toy frontend (lang/toy) is a small, statically-typed, single-pass-friendly language whose purpose is to exercise the public code-generation API broadly and readably — it is the coverage vehicle for make test-toy, not a language anyone ships. It is C-like where that keeps tests legible and prefix-oriented (@builtin, @[...] attributes, dot-constants) where C syntax would make parsing or lowering ambiguous. Toy is fully self-contained: it has its own lexer (lexer.c), parser (parser.c + expr.c/decls.c/data.c/stmt logic), type system (types.c), and builtin/intrinsic/inline-asm handling (builtins.c, asm.c) — it does not reuse the C lexer or preprocessor.

Toy deliberately surfaces low-level CG features that have no clean C spelling: explicit linkage and symbol/ABI attributes, address-space pointers, tail calls, computed goto with target sets, relocatable data expressions (@pcrel, @symdiff, @labeladdr), atomics with explicit ordering/access, the full conversion-builtin set with rounding modes, target-capability queries, and typed inline assembly. The broad executable demo test/toy/cases/123_spec_demo.toy is a normal corpus case (not a doc-only sample) so the implementation must keep accepting and running the syntax it demonstrates.

Toy is also the one frontend with durable cross-compile state, because kit dbg runs it as a REPL. Its frontend object splits into a durable ToyModule (append-only declaration tables — functions, globals, named types — carrying metadata and compiler-durable type ids, but no per-object symbol handles) and a per-compile ToyParser that borrows the module and holds the per-object symbol environment plus the transaction state. The transaction is a journaled-in-place model: each compile records watermarks into the durable tables, mutates them in place during the parse, and on abort truncates back to the watermark and replays a small undo journal for the rare in-place mutation of an already-committed entry (completing a forward-declared record). commit is then a cheap disarm. This is what lets a failed or panicking REPL snippet leave the persistent declaration tables exactly as they were. The hooks are wired through the vtable's commit/abort; the session and dbg driver gate the commit on the full compile -> link -> publish chain. See DBG.md for the debugger REPL.

The wasm frontend

The wasm frontend (lang/wasm, vtable kit_wasm_frontend_vtable) accepts either WebAssembly text (.wat) or binary (.wasm) — wasm_parse_any sniffs the input and routes to the WAT parser or the binary decoder, validates the module, then lowers it into the CG API (cg.c) and synthesizes the host-import ABI shims (host_imports.c). Like C and asm it carries no durable cross-compile state (commit/abort are NULL). The structure and the WAT/binary/lowering details live in WASM.md.

The wasm module model it parses into — WasmModule and the WAT/binary codec — is src/wasm/wasm.h, a Tier-3 internal subsystem also used by the wasm object format (src/obj/wasm) and codegen backend (src/arch/wasm). It is the one place a frontend depends on src/ rather than include/kit/, and the reason is structural: the module IR is large and unstable, so it is shared internally rather than mirrored into the public API (whose wasm.h is the narrow host-import binder). Everything else the frontend touches — codegen, arena, heap — is public. See INTERFACES.md for the boundary rationale.