kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

RSN — Ryan's Scripting Notation (planned work)

A small, statically-typed scripting language that lets kit replace its remaining external build/test dependencies — make, POSIX shell, and Python — with code it owns and ships, and that runs on kit's own compiler pipeline. RSN is the language; the build system that grows on top of it is a later layer (sketched at the end). This doc is the living design.

The architecture, in one line

RSN is another frontend on kit's pipeline (alongside C, Wasm, toy): RSN source → public CG API → IR → execution. To make a garbage-collected language fit, kit's IR gains a first-class managed-reference type, so RSN runs on the existing IR interpreter (arena-allocated first, then precisely GC'd) and can compile to fast native code through the existing backends later. Static typing is what makes this pay off: typed operations lower to real IR ops (not opaque runtime calls), so the optimizer/backends optimize RSN, and the managed-ref type makes every GC root statically known.

Because this turns the IR into a GC-aware IR, it is staged, and Stage 1 is itself split to keep the highest-blast-radius change (touching the shared CG type system) out of the critical path until the language has proven itself:

Why

Three external tools carry three jobs today: make (1.5k lines: the dependency DAG, host/config branching, KIT_*_ENABLED gating, the cross-arch matrix), shell (12k LOC under test/: the corpus harness engine, process orchestration, golden-diffing), and Python (parsing tool output and wrangling text/JSON/CSV). RSN's goal is one owned, typed language that does all three, so kit's only external dependency is a seed C toolchain. 3 → ~0.

Locked decisions

Architecture & runtime

  1. Typed RSN as a CG-API frontend (lang/rsn/, with a type checker); no bespoke bytecode — the IR / interp InterpInsn is RSN's bytecode.
  2. GC-aware IR: a managed-reference type + a small GC op dialect in the CG type system + IR, firewalled so the C path pays nothing.
  3. Precise, non-moving mark-sweep GC, precise by construction.
  4. Cooperative single-threaded fibers from day one, on the interp's explicit swappable stack and its reserved KIT_INTERP_BLOCKED yield state. Cores are saturated by spawned child processes, not threads.
  5. Staged delivery: interp first — arena-allocated (1a), then precise GC (1b); native managed codegen later (Stage 2).

Language

  1. Syntax: brace / C-ish (Go/Wren family), strictly LL(1), significant newline termination.
  2. Bindings: let (immutable) + var (mutable). Object contents stay mutable regardless — this governs only rebinding the name.
  3. Algebraic data types: struct (named product) + enum (sum / variants) + (T, U, …) tuple (anonymous product), all destructured by the one pattern grammar (match/if let/let/for).
  4. Nullability: refs are non-null; absence is T? (Option<T>), unwrapped with ?? (coalesce) and if let. Distinct from Result (absence vs error).
  5. Errors: Result<T, E> + Zig-style try (propagate) / catch (recover), not exceptions. Result/Option are sum types; match also destructures them.
  6. any + match: any is the dynamic escape hatch (JSON, parsed text); match narrows it (string s, int n, …). match is the single multi-way construct and general destructurer — literals/strings, ADTs, tuples, any type-narrows, and or-patterns (a | b -> …); no fallthrough, no switch.
  7. Generics — Stage 1 ships only the builtin containers (list<T>, map<K, V>) as hand-written specializations (packed/unboxed list<int> / list<float>, a shared boxed instantiation otherwise). The general instantiation strategy (.NET-style hybrid vs. plain monomorphization, plus constraints for user-defined generic fns) is deferred until the feature exists — not locked ahead of need.
  8. Expression if/match (value-producing); blocks appear only in control-flow slots, so a bare {} stays a map literal and ? is reserved for optionals (no ternary). match is the single multi-way construct — there is no separate switch; literal/string arms plus or-patterns (a | b -> …) cover value-dispatch, arms never fall through.
  9. Modules: file = module; pub exports (private by default); import "x.rsn" as x, qualified access.
  10. Patterns: Lua-style string patterns, not a full regex engine.
  11. Methods via UFCS: a.f(b)f(a, b). a.name is a field load when name is a field of a's type, otherwise a.name(...) resolves name as a free function with a as its first argument. One rule covers objs.push, src.replace, xs.map — no method-declaration form, no receiver concept. Tie-break: a field shadows a free function of the same name.
  12. String interpolation: "...${expr}..." (a lexer feature; the interpolated expr is a full expression, \$ escapes a literal $). Argv is still built with lists (["kit", "cc", src]), never by interpolating into one command string — that keeps shell-quoting bugs out by construction.
  13. First-class tuples + one pattern grammar: (T, U, …) (arity ≥ 2) is a storable, nestable anonymous product; () is unit and (e) is grouping. Destructure-only (no .0/.1). A single pattern concept serves let/var/for/match/if let (tuples always written (a, b)); map yields (K, V) so for (k, v) in m is just a tuple pattern. (a, b) = (b, a) is parallel assignment (RHS fully evaluated first).

Superseded by this design: the earlier NaN-boxed dynamic value model and dynamic typing — typing removes the need for a universal tagged value.

What we reuse

The IR extension: managed references + GC

The heart of the design, specified with its firewall discipline.

The type. A new CG type kind — a managed ref: an opaque, GC-owned pointer to a heap object of a given shape. Distinct from a raw C pointer. RSN's string/list/map/fn/fiber/chan/record/enum/tuple/any values are managed refs; int/float/bool are unmanaged scalars.

The ops (a small dialect): managed_alloc(shape) -> ref, ref_field_load/ store, ref_elem_load/store, ref_eq, ref_null. Safepoints are implicit at allocations, calls, and loop back-edges (no explicit op in the interp; once the 1b collector is on it can collect at any allocation because all roots are in registers).

Object shapes for precise tracing. Every managed object's header points at a shape descriptor listing which fields are managed refs (or a trace function for containers). Sum types use a per-variant shape: the collector reads the active tag, then traces that variant's ref-map. Records use a single field-map.

Root finding in the interp. Each InterpFunc carries a static ref-bitmap marking which PRegs hold managed refs. Roots = every live ref-PReg across all live fibers' frames + globals + the string-intern table + a small C-roots scratch for native functions mid-construction. Precise; no stack maps at the interp level.

Two cost-containing invariants (locked):

Firewall discipline. The managed dialect is an optional IR feature: a function with no managed refs needs no GC support and pays nothing. The C frontend never emits managed refs, so C → IR → {native, Wasm, C} is untouched and zero-cost. A backend without managed support asserts out cleanly on a managed function. The managed-IR capability is its own flag (KIT_MANAGED_IR), distinct from the KIT_RSN_ENABLED frontend that consumes it. The standing test: a C-only build compiles and links with the managed dialect compiled out.

Stage 2 (native). Precise native GC needs safepoints + stack maps: the optimizer's liveness + register allocator record which managed refs are live at each safepoint/call, per arch. This is the real cross-cutting work and the reason native is staged separately. The two invariants above keep it to root identification rather than the full moving-GC codegen problem.

Type system

Static, with local inference so scripts stay terse.

Value model

Typed, so values are their natural machine representations — no universal tagged word:

RSN type Representation
int / float / bool i64 / f64 / i8 (unmanaged scalars)
string, list, map, fn, fiber, chan, record, enum managed ref
(T, U, …) tuple managed ref (anonymous positional record); all-scalar packing deferred
Option<T> / T? managed ref (a sum); Option<scalar> boxes
any managed ref to a {tag, payload} box

int is a true 64-bit integer (the reason typing was attractive — no NaN-box, no bigint spill, no pointer masking). Mixed int/float arithmetic requires an explicit float(x). Strings are immutable, so they are always safe to share across fibers; interning is selective, not universal — short strings and any string used as a map key are interned (cheap equality, CAS keys), while large strings (file bodies, captured subprocess output) are held by reference without hashing them into the intern table. Option<scalar> boxing is later niche-optimizable; builtin list<int> stays packed/unboxed.

Errors and results

Values, typed, with Zig's try ergonomics — no exceptions, no unwind tables.

fn link(objs: list<string>) -> Result<string, Error> {
  let a = try run(["kit", "ar", "rcs", "lib.a"] + objs)   # propagate on failure
  return Ok(a.out)
}

Surface syntax

Brace-delimited, keyword-led, semicolon-free, mandatory braces on block bodies. Type annotations only in type positions. Comments are # to end-of-line (line or trailing); there are no block comments. Strings interpolate with "${expr}". Method calls are UFCS (xs.map(f)map(xs, f)).

struct Target { name: string, inputs: list<string>, optimize: bool }

enum Json {
  Null
  Bool(bool)
  Num(float)
  Str(string)
  Arr(list<Json>)
  Obj(map<string, Json>)
}

fn render(j: Json) -> string {
  match j {                          # match is an expression
    Null    -> "null"
    Str(s)  -> quote(s)
    Num(n)  -> str(n)
    Arr(xs) -> "[" + join(map(xs, render), ",") + "]"
    _       -> "..."
  }
}

fn build(t: Builder) {
  let objs: list<string> = []        # annotation: empty list element type
  for src in glob("src/**/*.c") {
    let obj = src.replace(".c", ".o")
    t.target(obj, [src], fn() { try run(["kit", "cc", "-c", src, "-o", obj]) })
    objs.push(obj)                   # mutating contents (let binding is fine)
  }
  let mode = if t.release { "opt" } else { "dbg" }   # value-if; no ternary
  t.target("libkit.a", objs, fn() { try run(["kit", "ar", "rcs", "libkit.a"] + objs) })
}

Grammar (LL(1))

LL(1) by construction; type syntax confined to type positions. The lexer suppresses the statement-terminating newline after an open bracket, a trailing binary operator, or when the next non-blank token is a leading . (so method/field chains may wrap across lines); the EBNF omits the newline token. Comments (# to end-of-line; no block form) and string interpolation ("${expr}", \$ escapes a literal $) are lexer-level; an interpolated ${…} lexes as a parenthesized sub-expression.

program    = { item } ;
item       = importDecl | [ "pub" ] decl | stmt ;
importDecl = "import" STRING "as" IDENT ;
decl       = fnDecl | structDecl | enumDecl ;

fnDecl     = "fn" IDENT [ generics ] "(" [ params ] ")" [ "->" type ] block ;
structDecl = "struct" IDENT [ generics ] "{" { IDENT ":" type [ "," ] } "}" ;
enumDecl   = "enum" IDENT [ generics ] "{" { variant [ "," ] } "}" ;
variant    = IDENT [ "(" type { "," type } ")" ] ;
generics   = "<" IDENT { "," IDENT } ">" ;
params     = IDENT ":" type { "," IDENT ":" type } [ "," ] ;

stmt       = letDecl | varDecl | whileStmt | forStmt | returnStmt
           | "break" | "continue" | exprStmt | block ;
letDecl    = "let" pattern [ ":" type ] "=" expr ;   (* pattern irrefutable *)
varDecl    = "var" pattern [ ":" type ] "=" expr ;   (* pattern irrefutable *)
whileStmt  = "while" cond block ;
forStmt    = "for" pattern "in" cond block ;          (* pattern irrefutable *)
returnStmt = "return" [ expr ] ;
exprStmt   = expr [ assignOp expr ] ;   (* tuple-of-lvalues LHS = parallel assign *)
assignOp   = "=" | "+=" | "-=" | "*=" | "/=" ;
block      = "{" { stmt } "}" ;     (* value = trailing expr-stmt, else unit *)

type       = baseType { "?" } ;
baseType   = IDENT [ "<" type { "," type } ">" ]
           | "fn" "(" [ type { "," type } ] ")" "->" type
           | "(" [ type "," type { "," type } ] ")" ;
             (* left-factored: "()" = unit; "(" T "," U … ")" = tuple (arity >= 2).
                No "(" T ")" form — types have no paren-grouping, so a lone "(int)"
                is a syntax error; unit-vs-tuple is decided after "(" on ")" vs a type. *)

expr       = coalesce [ "catch" [ "|" IDENT "|" ] expr ] ;
coalesce   = orExpr  { "??" orExpr } ;
orExpr     = andExpr { "||" andExpr } ;
andExpr    = eqExpr  { "&&" eqExpr } ;
eqExpr     = relExpr { ("==" | "!=") relExpr } ;
relExpr    = addExpr { ("<" | "<=" | ">" | ">=") addExpr } ;
addExpr    = mulExpr { ("+" | "-") mulExpr } ;
mulExpr    = unary   { ("*" | "/" | "%") unary } ;
unary      = "try" unary | ("!" | "-") unary | postfix ;
postfix    = primary { callOp | indexOp | fieldOp } ;
callOp     = "(" [ args ] ")" ;
indexOp    = "[" expr "]" ;
fieldOp    = "." IDENT ;
args       = expr { "," expr } [ "," ] ;

primary    = INT | FLOAT | STRING | "true" | "false"
           | ifExpr | matchExpr | spawnExpr
           | IDENT [ structLit ]          (* structLit suppressed in cond; see below *)
           | parenExpr
           | listLit | mapLit | fnLit ;
parenExpr  = "(" expr [ "," expr { "," expr } ] ")" ;
             (* left-factored: no comma => grouping (value is the inner expr);
                >= 1 comma => tuple (arity >= 2). The grouping-vs-tuple choice is
                made on the token after the first expr: ")" vs ",". No 1-tuples,
                so "(a,)" is rejected. *)
spawnExpr  = "spawn" block ;              (* yields a fiber handle *)
ifExpr     = "if" ( "let" pattern "=" cond | cond ) block
             [ "else" ( ifExpr | block ) ] ;
cond       = expr ;   (* a bare `IDENT { … }` struct literal is NOT taken here; a
                         struct literal in condition position must be
                         parenthesized: `if (Foo{…}).ok { … }` *)
matchExpr  = "match" expr "{" { matchArm } "}" ;
matchArm   = pattern "->" ( block | expr ) [ "," ] ;
pattern    = patternAtom { "|" patternAtom } ;   (* or-pattern; all alternatives
                                                    must bind the same names+types.
                                                    `|` here is pattern position,
                                                    distinct from catch's `|e|`. *)
patternAtom = "_" | literal
           | IDENT IDENT                       (* type-narrow: type binder *)
           | IDENT [ "(" pattern { "," pattern } ")" ]    (* Ctor / binding *)
           | "(" pattern "," pattern { "," pattern } ")" ;  (* tuple, arity >= 2 *)
structLit  = "{" [ field { "," field } [ "," ] ] "}" ;
field      = IDENT ":" expr ;
listLit    = "[" [ expr { "," expr } [ "," ] ] "]" ;
mapLit     = "{" [ entry { "," entry } [ "," ] ] "}" ;
entry      = ( STRING | "[" expr "]" ) ":" expr ;   (* string or computed key;
                         bare `IDENT:` is a struct field, never a map key, so
                         `{name: x}` is unambiguous and `map<int,V>` literals are
                         written `{[k]: v}` *)
fnLit      = "fn" "(" [ params ] ")" [ "->" type ] block ;
literal    = INT | FLOAT | STRING | "true" | "false" ;

LL(1) tactics that constrain the language:

Semantics defaults

Decided, revisitable — these were settled as defaults rather than formal forks:

Execution: RSN on the interp

Memory and GC

Embedding API and capability gating

typedef struct KitRsnHost {
  const KitFileIO* file_io;                 /* build-def: read/glob/stat live, writes gated */
  int      (*spawn)(void* user, const KitRsnSpawn*, KitRsnProc** out);
  int      (*proc_reap)(void* user, KitRsnProc*, int* exit_code); /* scheduler */
  int64_t  (*clock_ns)(void* user);           /* gated */
  int      (*mkdir_p)(void* user, const char* path);
  void* user;
} KitRsnHost;

KIT_API KitRsn*   kit_rsn_new(KitContext*, const KitRsnHost*);
KIT_API KitStatus kit_rsn_eval(KitRsn*, KitSlice src, const char* name);
KIT_API void        kit_rsn_register(KitRsn*, const char* name,
                                         KitRsnNativeFn, void* ud);
KIT_API void        kit_rsn_free(KitRsn*);

The build layer (later)

A library in RSN plus a thin engine: target(name, inputs, outputs, recipe); staleness by content hash via the CAS (not mtimes); parallel execution via parallel/fibers bounded to ncpu(); kit build loads build.rsn in build-definition mode for the graph, then executes it in script mode. Out of scope for the milestones below — it is what the language is for.

Implementation phases

Stage 1a — typed RSN, interpreted, arena-allocated (no collector yet):

  1. lang/rsn/ lexer + LL(1) parser → AST (full grammar above). Parse-corpus tests under test/rsn/.
  2. Type checker: local inference, ADTs (struct/enum/tuple), builtin-container generics, UFCS resolution, the uniform pattern grammar (irrefutability check for let/var/for, exhaustiveness for match), Result/Option/try, modules.
  3. RSN → CG-API lowering (typed ops → IR ops; any/containers/iterators via rt/lib/). Managed values live in a run-scoped arena, freed at run end — the interp carries a per-function ref-bitmap as interp-local metadata, but no new CG type kind is added yet, so the C compiler's type system is untouched.
  4. Fibers: complete the BLOCKED/yield path + scheduler over InterpStacks + host vtable (spawn/run/read/clock); spawn/parallel/chan.
  5. Stdlib: strings, list/map, iterators/range, Lua-style patterns, JSON (→ any), run/exec.
  6. CLI kit rsn + per-operation capability gating.

Goal of 1a: prove the language on real build.rsn / test scripts before the invasive IR surgery. Short-lived runs make arena-free-at-exit genuinely adequate.

Stage 1b — precise GC (added once runs are long-lived enough to need it):

  1. CG/IR managed-ref type + GC op dialect (src/cg), gated KIT_MANAGED_IR; object headers + (per-variant) shape descriptors. Firewall test: C-only build green with the dialect compiled out.
  2. Lowering emits managed ops for refs; the interp's root scan + mark-sweep collector + finalizers replace the arena. (The 1a ref-bitmap becomes the precise root map.)

Stage 2 — native managed codegen (optional, demand-driven):

  1. Safepoints + GC stack maps in the optimizer (liveness/regalloc).
  2. Native emit for managed functions across aa64/x64/rv64.
  3. JIT/AOT RSN: kit rsn --jit, and kit build emitting native build tools.
  4. Wasm/C-backend managed support (furthest out).

Then: the build layer.

Resolved decisions

All of: the architecture (typed CG-API frontend; GC-aware IR; precise non-moving mark-sweep; cooperative fibers; staged interp → 1a arena → 1b GC → native); let/var; non-null + T?; Result/try/catch; ADTs (struct/enum) with match; any narrowing via lowercase type-patterns in match; UFCS methods; string interpolation; # line comments; expression if/match with the {}-block rule and the struct-literal-in-condition restriction; qualified module imports; Lua-style patterns; separate int/float with explicit conversion; integer overflow traps; selective string interning; per-operation capability gating; first-class tuples with one uniform pattern grammar (Rust-style) + parallel assignment; match as the single multi-way construct (no switch) with or-patterns and no fallthrough; Stage-1 generics = builtin containers only; the semantics defaults above; NaN-box value model dropped.

Open decisions

None block Stage 1a. Deferred until their phase: selective imports (from … import { … } sugar), range syntax (a..b sugar vs the range() builtin) — which would also unlock range patterns in match (1..10 -> …), record-style enum payloads (Variant { … }), and user-defined generic functions (instantiation strategy + constraints — see Type system). The earlier tuple gap is now resolved (first-class (T, U), see Type system / Value model / Grammar); the remaining tuple sub-question deferred to its phase is packed all-scalar tuple representation (e.g. unboxed (int, int)), niche-optimized alongside packed list<int>. Match guards (pattern if cond -> …) are deliberately deferred — match stays structural + or-pattern dispatch, and conditional logic lives in if/else if; guards can be added later non-breakingly (matchArm = pattern [ "if" expr ] "->" …).

Naming

Language: RSN — Ryan's Scripting Notation. Source extension .rsn. Frontend lang/rsn/, CLI kit rsn, frontend flag KIT_RSN_ENABLED; the GC-aware IR it depends on is gated separately (KIT_MANAGED_IR). When RSN ships, the durable design moves up to doc/RSN.md and this roadmap shrinks to what remains open.