Frontends
kit turns several source languages into machine code through one narrow contract. A frontend is the front half of the compiler: it reads bytes of a single translation unit and drives the public code-generation API (see CODEGEN.md) to populate a relocatable object. Everything behind codegen — IR, optimization, register allocation, object emission — is language-agnostic. This document covers the frontend model, the C frontend pipeline, and the smaller toy and wasm frontends. For testing, see TESTING.md.
kit ships four frontends: C (the real one), asm (lives inside the codegen substrate), toy (a CG-API exercise vehicle), and wasm (WAT/wasm lowering).
The frontend contract
Every frontend implements one vtable, KitFrontendVTable
(include/kit/compile.h). The contract is deliberately tiny:
new_frontend(KitCompiler*) -> opaque KitFrontendState*
compile(state, opts, input, out) -> KitStatus, populates an ObjBuilder
free_frontend(state)
extensions / nextensions -> file extensions this lang claims
commit / abort -> optional, for durable REPL state
The state object is opaque to libkit: the frontend allocates it from the
compiler's heap in new_frontend, threads its own context through it, and
frees it in free_frontend. compile is handed a KitSourceInput (name,
bytes, language, and an input-kind that distinguishes a whole translation unit
from a REPL top-level / expression / block) and an output KitObjBuilder
to fill. The frontend opens a KitCg, emits into it, and the object builder
captures the result. This keeps each language a thin client of the public CG
and object APIs (INTERFACES.md) with no privileged access to
internals.
Two design choices fall out of this boundary:
Errors unwind, they don't return codes mid-stream. Frontends report fatal problems by panicking (
compiler_panic/kit_frontend_fatal), which longjmps to asetjmplanding pad that runs registered cleanups.kit_frontend_run(src/api/frontend.c) is the public shim that establishes that boundary for standalone frontend helpers (preprocess, token-dump) without exposing libkit's panic machinery. Ordinary diagnostics are non-fatal; only invariant breaks unwind.Durable cross-compile state is transactional. Most frontends are one-shot: one
compilecall, one object, no state carried forward (commit/abortare NULL for C, asm, and wasm). The toy frontend is the exception — its REPL accumulates declarations across snippets — so the vtable carries optionalcommit/aborthooks. A compile stages new declarations; the compile session then commits them on full success or aborts (also fired automatically on a panic via the cleanup stack) to restore the pre-compile state. The session API (kit_compile_session_compile= stage + auto-commit;kit_compile_session_stage+ explicit commit/abort) lets a REPL gate the commit on the whole compile -> link -> publish chain, not just compile, since publish can reject a clean compile (duplicate global). Both hooks must be idempotent.
The language registry
src/api/lang_registry.c is the only place in libkit that consults the
KIT_LANG_*_ENABLED build flags. During compiler construction
lang_registry_init walks the enabled set and calls
kit_register_frontend to wire each compiled-in vtable into
c->frontends[KitLanguage]. After that, the public compile and pipeline
paths dispatch purely by KitLanguage index — no host bootstrap, no
#ifdef outside the registry. Third parties can still call
kit_register_frontend to install or override any slot at runtime; the
registry just provides the default wiring.
kit_language_for_path (src/api/compile.c) maps a file path to a language
by extracting the trailing extension and walking every registered frontend's
extensions list (case-insensitively, so .S and .s both hit asm's "s").
No frontend is privileged. C is just another language: it claims c and h,
asm claims s, toy claims toy, and wasm claims wat and wasm. A path
whose extension no frontend claims — or that has no extension — resolves to
KIT_LANG_UNKNOWN rather than silently defaulting to C, so an unrecognized
input is rejected instead of misparsed.
path --kit_language_for_path--> KitLanguage
KitLanguage --c->frontends[]--> KitFrontendVTable
vtable.compile(...) --KitCg--> KitObjBuilder
The C frontend pipeline
The C frontend (lang/c/c.c, vtable kit_c_frontend_vtable) is a classic
C11 compiler arranged as a streaming pipeline. There is no AST: tokens flow
through the preprocessor into a single-pass recursive-descent parser that
drives codegen directly as it goes. The stages:
bytes
| lexer lang/cpp/lex (C11 6.4 token streaming)
v
tokens
| preprocessor lang/cpp/pp (translation phase 4)
v
preprocessed tokens
| parser lang/c/parse (single-pass recursive descent)
v + type / decl / sem / abi semantic layers (lang/c)
KitCg ops -> KitObjBuilder
The lexer and preprocessor live under lang/cpp/ rather than lang/c/
because they are shared C-preprocessor infrastructure: the same code backs the
C frontend, the standalone cpp tool, and cc -E. The C-specific stages —
parser, type system, declarations, semantics, and ABI lowering — live under
lang/c/.
c_frontend_compile is the conductor. It builds the per-compile objects
(token pool, lexer over the input bytes, preprocessor, decl table, KitCg),
applies preprocessor options (-I, -D, -U) recovered from the
KitCCompileOptions planted in language_options, pushes the input lexer
onto the preprocessor's include stack, then calls parse_c. The whole run is
bracketed with metrics scopes so the build can profile each stage.
Lexer — lang/cpp/lex
lex.c streams tokens out of a borrowed source buffer per the C11 lexical
grammar (6.4): identifiers, pp-numbers (classified into integer/float),
string and character constants with their L/u/u8/U encoding prefixes,
longest-match punctuators including digraphs, and the #/## tokens that the
preprocessor needs for directives and pasting. It handles the earlier
translation phases inline: line splicing (phase 2) and comments as whitespace,
with physical newlines surfaced as TOK_NEWLINE so the preprocessor can
implement directive-line semantics, and a small directive-context latch so a
header-name after #include lexes correctly (6.4.7).
A deliberate layering decision: keyword bucketing is deferred to the
parser. The lexer emits every word as TOK_IDENT; it does not know what a
keyword is. This is correct because keyword-ness is a translation-phase-7
concern — a name that is a macro must expand before it can be a keyword, and
the preprocessor traffics in identifiers, not keywords. parse_c interns the
C11 keyword spellings into symbols once at startup (kw_names[] in
parse.c) and recognizes keywords by symbol identity as it consumes tokens.
Preprocessor — lang/cpp/pp
pp.c (+ pp_directive.c, pp_expand.c) implements translation phase 4 and
exposes a pull interface: pp_next yields one fully preprocessed token at a
time, with directives consumed and macros expanded. It backs both the C
frontend (the parser pulls tokens from it) and the standalone preprocessor
(pp_emit_text reconstructs source text from the token stream for cpp /
cc -E).
Internally the preprocessor runs a token-source stack. Each source is
either a Lexer (the main file or an #included file) or a pre-built token
buffer (a macro expansion in progress). Includes push a new lexer; macro
invocations push an expansion buffer; EOF pops. This is what makes the
include stack and macro rescanning fall out of one mechanism.
Macro expansion uses the Prosser hideset algorithm (the standard's
"nested replacement" / blue-paint rule). Every token in an expansion buffer
carries a hideset: the set of macro names it must not be re-expanded by during
rescan. Function-like expansions compute the result hideset as the invocation
hideset unioned with the just-expanded macro name, which is exactly what stops
infinite recursion through self-referential and mutually-referential macros.
Hidesets are interned into a small table (pp_expand.c) and kept sorted for
canonical identity, so identical hidesets share one id.
pp_directive.c owns the rest of phase 4: the #if nesting stack and the
preprocessor constant-expression evaluator, #include search and file open
(through the compiler's file-IO service), #line, #pragma, #error,
#embed, and #define/#undef. Include edges are recorded
(pp_add_include_edge -> the source registry, include/kit/source.h) so
dependency reporting (kit_dep_iter_*) and diagnostics can attribute tokens
to the right file.
Parser — lang/c/parse
parse_c (parse.c) is a single-pass recursive-descent parser. It pulls
tokens from the preprocessor with one or two tokens of lookahead and emits CG
ops as it recognizes constructs — there is no separate AST or
semantic-analysis pass over a tree. The parser is split by syntactic area:
declarations and the TU driver in parse.c, expressions in parse_expr.c,
types/declarators in parse_type.c, initializers in parse_init.c, and
statements in parse_stmt.c.
Because lowering happens inline, the parser maintains a typed value stack
that shadows the CG operand stack. Each entry records the C-language type of a
value, value flags (lvalue / modifiable-lvalue / bit-field / register /
null-pointer-constant), and an lvalue auxiliary record. The fields are
cg_type_stack, cg_value_flags, and cg_lv_aux in struct Parser
(parse_priv.h). This shadow stack is the parser's entire notion of "the
expression evaluated so far": the CG side holds the runtime values, the parser
side holds the static types and lvalue bookkeeping needed to drive the next op
correctly (signedness for division/compare/shift, pointer-element size for
arithmetic, bit-field metadata for loads/stores, and so on).
The cg_adapter.h coupling point
lang/c/parse/cg_adapter.c and its header cg_adapter.h are the real
parser <-> codegen seam, and the most load-bearing design decision in the C
frontend. The header declares a family of pcg_* helpers that look like thin
wrappers over the public kit_cg_* API but do two extra jobs on every call:
They keep the parser's typed value stack in lockstep with the CG stack.
pcg_dup/pcg_swap/pcg_dropmirror the structural op onto the type stack;pcg_push_int/pcg_load/pcg_binop/pcg_convertpush, retag, or pop type-stack entries alongside emitting the CG op. The parser never callskit_cg_*for stack-affecting ops directly — it goes throughpcg_*so the two stacks can never drift.They fold C lvalue chains into a single CG memop. There is no CG-level
field/index/addr_offsetop. Instead a chain likea[i].gaccumulates a byte offset, an index scale, and bit-field metadata onto the TOS lvalue'sPcgLvAux(pcg_lv_member,pcg_lv_subscript,pcg_decay_array), and the nextpcg_load/pcg_store/pcg_addrconsumes that aux and emits one effective-address memop. The aux also records whether the base is a frame local or an already-computed pointer rvalue, so address-taking knows whether to emit akit_cg_addror treat the base as a pointer.
The implication: the C frontend's lowering strategy is encoded in this
adapter, not in the public CG contract. It is where C-specific decisions
(lvalue-as-effective-address, lazy load/store/addr, the typed shadow stack)
are concentrated. A change to how C lowers indirection lands here, not in the
backends. (pcg_emit_enabled also lets the parser run constant-folding /
sizeof contexts with codegen suppressed while still tracking types.)
Semantic layers — type / decl / sem / abi
The parser leans on four C-specific semantic layers under lang/c:
type (
lang/c/type) owns the C type representation and lowers each C type to aKitCgTypeId(type_cg_id_in_pool), interning records, enums, pointers, arrays, and function signatures into the CG type universe. C types are the currency of the parser's value stack.decl (
lang/c/decl) is theDeclTable, deliberately layered above the object builder: the object layer stores object-format facts, while the decl table owns the C-level rules the object layer does not know about — linkage, storage duration, tentative definitions, function-local statics, and initializer/redeclaration bookkeeping.decl_attrs.chandles GNU/C23 attribute parsing.sem (
lang/c/sem) is a small library of pure C semantic checks — assignment compatibility, compound-assignment, and redeclaration/composite type rules — that the parser calls at the relevant points and that produce a diagnostic message on failure rather than emitting anything.abi (
lang/c/abi) computes C scalar/aggregate layout facts and feeds the target calling-convention lowering for function entry, parameters, and returns. See ARCH.md for the per-target ABI implementations the C ABI layer drives.
The toy frontend
The toy frontend (lang/toy) is a small, statically-typed,
single-pass-friendly language whose purpose is to exercise the public
code-generation API broadly and readably — it is the coverage vehicle for
make test-toy, not a language anyone ships. It is C-like where that keeps
tests legible and prefix-oriented (@builtin, @[...] attributes,
dot-constants) where C syntax would make parsing or lowering ambiguous. Toy is
fully self-contained: it has its own lexer (lexer.c), parser
(parser.c + expr.c/decls.c/data.c/stmt logic), type system
(types.c), and builtin/intrinsic/inline-asm handling (builtins.c,
asm.c) — it does not reuse the C lexer or preprocessor.
Toy deliberately surfaces low-level CG features that have no clean C spelling:
explicit linkage and symbol/ABI attributes, address-space pointers, tail calls,
computed goto with target sets, relocatable data expressions (@pcrel,
@symdiff, @labeladdr), atomics with explicit ordering/access, the full
conversion-builtin set with rounding modes, target-capability queries, and
typed inline assembly. The broad executable demo
test/toy/cases/123_spec_demo.toy is a normal corpus case (not a doc-only
sample) so the implementation must keep accepting and running the syntax it
demonstrates.
Toy is also the one frontend with durable cross-compile state, because
kit dbg runs it as a REPL. Its frontend object splits into a durable
ToyModule (append-only declaration tables — functions, globals, named types
— carrying metadata and compiler-durable type ids, but no per-object symbol
handles) and a per-compile ToyParser that borrows the module and holds the
per-object symbol environment plus the transaction state. The transaction is a
journaled-in-place model: each compile records watermarks into the durable
tables, mutates them in place during the parse, and on abort truncates back
to the watermark and replays a small undo journal for the rare in-place
mutation of an already-committed entry (completing a forward-declared record).
commit is then a cheap disarm. This is what lets a failed or panicking REPL
snippet leave the persistent declaration tables exactly as they were. The
hooks are wired through the vtable's commit/abort; the session and dbg
driver gate the commit on the full compile -> link -> publish chain. See
DBG.md for the debugger REPL.
The wasm frontend
The wasm frontend (lang/wasm, vtable kit_wasm_frontend_vtable) accepts
either WebAssembly text (.wat) or binary (.wasm) — wasm_parse_any sniffs
the input and routes to the WAT parser or the binary decoder, validates the
module, then lowers it into the CG API (cg.c) and synthesizes the host-import
ABI shims (host_imports.c). Like C and asm it carries no durable
cross-compile state (commit/abort are NULL). The structure and the
WAT/binary/lowering details live in WASM.md.
The wasm module model it parses into — WasmModule and the WAT/binary codec —
is src/wasm/wasm.h, a Tier-3 internal subsystem also used by the wasm object
format (src/obj/wasm) and codegen backend (src/arch/wasm). It is the one
place a frontend depends on src/ rather than include/kit/, and the reason
is structural: the module IR is large and unstable, so it is shared internally
rather than mirrored into the public API (whose wasm.h is the narrow
host-import binder). Everything else the frontend touches — codegen, arena, heap
— is public. See INTERFACES.md for the boundary rationale.