kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

Content-addressed build coordinator

Status: design, not yet built. This is a forward-looking roadmap for a new subsystem layered on the content store in ../DISTRIBUTE.md (<kit/cas.h>). It specifies the configuration model, the on-disk storage state machine, the coordinator's in-memory state and caching algorithm, the command protocol recipes use to talk back, and how builds share work over a network (remote CAS + signed traces).

Naming, to avoid confusion: this is not ../BUILD.md, which documents how kit itself is built (the Makefile, KIT_*_ENABLED component gating, and the 3-stage bootstrap). This doc designs a user-facing, Bazel/Nix-style incremental build engine that compiles arbitrary projects by running recipes and caching their outputs by content. The two are unrelated beyond sharing the word "build."

What it is

A build coordinator is a long-lived process that turns a build request (target T under configuration C) into a materialized output tree on disk, doing the least work necessary. It does this by running recipes — opaque executables, typically shell scripts — that produce output directories, and by caching every recipe result keyed by the exact set of inputs that produced it. A second request whose inputs are unchanged returns the cached output without re-running anything; a request whose direct dependencies' outputs are unchanged (even if some deep source changed) can still skip its own recipe.

The whole design rests on two properties:

It reuses the existing CAS wholesale: source-file bytes, output directories, config snapshots, and trace bodies are all CAS blobs and trees. The coordinator adds two things the content store does not have — a mutable per-target trace index, and the resolution algorithm that drives it.

Concepts and vocabulary

Term Meaning
Target T A named unit of work, e.g. //app:server. A build definition maps each target name to one recipe + local argv.
Recipe The executable the coordinator runs to produce a target's output. Identified by recipe-id = BLAKE2b(recipe file bytes).
Configuration A key→value map of tunables (target triple, opt level, feature flags). Two scopes — propagated and local — see below. Identified by config-id = BLAKE2b(canonical map).
Output tree The directory a recipe produces, captured as a CAS tree (tree-id). The unit a build request returns.
Base inputs The leaves of the dependency graph, dynamically requested by recipes: config values, source files, and globs.
Target dep A dependency of one target on the output tree of another. The graph's interior edges, created by need.
Shallow trace A record of one build: its direct base inputs + its direct target deps (by output tree-id) → its output tree-id.
Deep trace A record of one build: the transitive closure of source/glob leaves + the root config-id → its output tree-id.
Tree cache On-disk materialized output directories, keyed by tree-id, ready to hand back as a filesystem path.

The three base-input kinds are exactly the dynamically-requested leaves: a recipe asks the coordinator for a config value, the bytes of a source file, or the expansion of a glob, and each request is logged as a dependency so a later rebuild knows precisely what to re-check.

Configuration model

Configuration has two scopes, distinguished by visibility:

Serialized for replay, not just hashed. Both the effective config map and the argv vector are persisted by value as canonical, content-addressed CAS blobs (config-id, argv-id) — never reduced to an opaque one-way hash. A trace references them by id, and they are part of its reachable closure (GC-rooted, and bundled with shared traces). This is what lets the shallow path replay every need request exactly — reconstruct the dep's target name, its full effective config with overlays applied, and its argv — even when driving from an imported shallow trace with no access to the original build definition or workspace. Source files and globs are the opposite case: kept by hash only, because they are verified (did this still hash the same?), never replayed, and their bytes already live in the CAS.

The resolution identity of a build is therefore (target-name, config-id), where config-id is the effective propagated config. Local argv is determined by the target name (via the build definition), so it does not widen the key. Two requests (T, C1) and (T, C2) — the same target under different propagated config — are genuinely distinct builds; both may be in flight at once.

recipe-id covers only the recipe file bytes. Two other things that can change a target's behavior — which recipe it maps to, and its argv — live in the build definition, which the coordinator reads to resolve T. That read is itself a tracked source dependency of T, so editing a target's stanza (repointing its recipe, or changing argv) invalidates its traces through the ordinary source-leaf mechanism.

Why two trace kinds

The deep trace answers "is this exact configuration unchanged and has any source moved?" in one pass over a flat leaf set — no graph walk. When it matches, the output is determined and we skip straight to materialization. It is the inner dev-loop fast path: edit code, rebuild under the same config.

The shallow trace handles everything the deep path defers — a changed source that might not actually move a dep's output, and any configuration change. It rebuilds the direct deps (cheaply, via their own traces, threading the new config down through need), compares their resulting tree-ids against what it recorded, and if they match, the recipe is still skipped. This is the payoff of recording deps by output identity rather than by input identity: a subtree that ignores a changed config key, or absorbs a comment-only edit, produces the same output tree and short-circuits its parent's recipe.

On-disk storage

The build store sits beside (or contains) a CAS. Everything content-addressed is immutable and self-verifying; exactly one class of object is mutable.

<store>/
  cas/                         # the shared content store, per DISTRIBUTE.md
    blob/<pp>/<blob-id>        #   raw file bytes (sources, config maps, argv vectors, trace bodies)
    tree/<pp>/<tree-id>        #   canonical directory manifests (output trees)
    ...                        #   (chunk/, index/ as in DISTRIBUTE.md)
  build/
    trace/<pp>/<trace-id>      # IMMUTABLE canonical trace bodies (deep + shallow)
    target/<pp>/<target-key>   # MUTABLE per-target trace set (the only mutable state)
    cache/<pp>/<tree-id>/      # tree cache: materialized output directories
    tmp/                       # staging for atomic writes + recipe sandboxes

<pp> is the first two lowercase hex chars of the relevant id (the DISTRIBUTE.md convention). trace-id = BLAKE2b of the canonical trace body. target-key = BLAKE2b("kit build target v1" ‖ target-name) — note it depends on the target name only, never on config or recipe-id, so a target's index entry is a stable handle while its inputs churn. Config maps and trace bodies are ordinary CAS blobs; only target/ records and the cache/ materializations are build-specific.

Trace bodies (immutable, content-addressed)

Both kinds are strict, byte-stable, INI-style text in the family of kit-tree 1 / kit-package 3: a version line, ordered scalar fields, then sorted sections. Unknown keys/sections, duplicate rows, and non-canonical ordering are parse errors (a parse error is treated as absent, never as a match — fail safe). All ids are lowercase hex. Config values, source bytes, and glob expansions are stored by hash, never inline, so a body's size is independent of its inputs' sizes.

Shallow trace — direct inputs and direct deps, for fine-grained rechecking:

kit-build-shallow 1
target //app:server
recipe <recipe-id>
output <output-tree-id>
config <config-id>             ; this target's effective propagated config (full map blob)
argv <argv-id>                 ; this target's local config (serialized vector blob)
[config]                       ; CONSUMED propagated key NAMES, sorted (values live in the map)
opt
[source]                       ; sorted by path
src/main.c <blob-id>
build.kit <blob-id>            ; the build definition stanza is a tracked source
[glob]                         ; sorted by pattern
src/*.c <glob-result-hash>
[dep]                          ; sorted by (dep-name, dep-config-id) — each row replays one `need`
//lib:core <dep-config-id> <dep-output-tree-id>

Deep trace — root config-id plus the transitive source/glob closure, no [dep] and no per-key [config] section:

kit-build-deep 1
target //app:server
recipe <recipe-id>
output <output-tree-id>
root-config <config-id>        ; this target's effective propagated config (full map blob)
argv <argv-id>                 ; local config (serialized vector blob)
[source]                       ; union over the whole closure, sorted, deduped
src/main.c <blob-id>
lib/core.c <blob-id>           ; pulled in transitively from //lib:core
[glob]
src/*.c <glob-result-hash>

Why the deep trace needs only root-config + source/glob leaves. If the request's config-id equals root-config and every source/glob leaf still matches the live workspace, then every recipe in the closure has byte-identical inputs — so every need overlay it computes is identical, every effective config downstream is identical, and by determinism the output is identical. Any input that could perturb a downstream overlay (a source a recipe branches on, a config value) is itself either a recorded source/glob leaf or folded into root-config, so nothing escapes the check. Using the whole root-config (rather than just consumed keys) is a deliberate, conservative simplification: an unconsumed config change busts the deep fast path, but the shallow path recovers — it re-runs only the recipes whose output actually changed, so unrelated subtrees re-verify by hash without re-running.

Target record (mutable, the only mutable object)

build/target/<pp>/<target-key> lists this target's candidate traces, newest-first, capped:

kit-build-record 1
target //app:server
deep <trace-id>
deep <trace-id>
shallow <trace-id>
shallow <trace-id>

Each row points at a trace body in build/trace/. On every successful resolution the coordinator prepends the fresh trace, de-duplicates, and truncates each kind to a small cap (KIT_BUILD_RECORD_CAP, e.g. 8) — a bounded MRU window. Older traces age out and become GC-eligible. Multiple candidates exist because the same target may have been built under several distinct input/config combinations recently (e.g. two config values flipped back and forth).

Storage state machine (atomicity and crash safety)

The store must never hand back a wrong output after a crash. The rules:

  1. Content objects (cas/blob, cas/tree, build/trace, config maps, and a materialized build/cache/<tree-id>/) are written into build/tmp/, fsync'd, then atomically renamed to their final content-keyed path. A half-written object only ever exists under tmp/ and is orphaned, never under its content key. Re-deriving the same content re-creates the identical path — writes are idempotent, so a racing or retried producer is harmless.

  2. The target record is the only ordering-sensitive state, updated by read-modify-write into tmp/ then atomic rename. A reader always sees a complete prior or complete next version, never a torn one. A crash loses at most the most recent record update; losing a record entry is safe — the next build re-derives and rewrites it. A corrupt record (failed kit-build-record 1 parse) is treated as empty.

  3. Recipe sandboxes live in build/tmp/run-<n>/. On success the output subdir is ingested into the CAS (kit_cas_add_tree_from_dir) and the sandbox is removed; on failure or crash it is orphaned and swept later.

Invariant. Anything reachable under a content key (blob/, tree/, trace/) is complete and matches its key. A target record may dangle — point at a trace that GC removed — and readers tolerate that by treating the missing candidate as absent. The store degrades toward "rebuild," never toward "wrong answer."

Concurrency. Two coordinators sharing a store are safe for content writes (idempotent atomic renames). Target-record updates are last-writer-wins; both writers wrote valid records, so the only loss is MRU ordering, costing at most a rebuild. An optional per-target-key advisory lock around the read-modify-write removes even that.

Build coordinator state and caching

In-memory state

All state hangs off one KitBuildCoordinator context (no globals — the project rule). For the life of the process it memoizes every base-input probe and every target resolution, so a diamond in the graph is built once and a source file is hashed once:

Field Contents Notes
source_hashes path → blob-id (+ stat). Per-process memo.
glob_results pattern → (sorted paths, glob-result-hash). Per-process memo.
config_maps config-id → map, and overlay results. Immutable; content-addressed.
targets (target-name, config-id) → future of {output-tree-id, deep-leafset, path}. Memo and in-flight dedup.
cas Open KitCas* handle over cas/. Plus optional remote (below).
store Paths + target-record reader/writer over build/.
jobs Semaphore bounding concurrent recipe processes. The global parallelism limit.

"Cached over the life of the coordinator process" means exactly these memo tables: the coordinator assumes the workspace does not change under it mid-build.

The targets table is keyed by the full (target-name, config-id) and holds a future, not a flag — so concurrent needs of the same (T, config-id) await one resolution rather than racing (see Parallelism). A resolved build yields the output tree-id (to satisfy a need), a materialized path, and its deep leafset (the transitive source/glob closure, to fold into a parent's deep trace).

Cycle detection

The targets memo alone cannot catch a cycle — a self-dependent target would simply await its own unresolved future and deadlock. So each resolution carries an explicit build chain: the ordered list of (target-name, config-id) frames from the root request down to here. A need whose (dep, config-id) already appears on the chain is a dependency cycle; the chain is the error message (//a → //b → //a). The chain is per-path and distinct from the cross-path memo: the same (T, config-id) may appear in many chains (a shared dep) but never twice in one.

Resolution algorithm

A request is resolve(T, cfg, chain) where cfg is the effective propagated config (with cfg.id its config-id). The config-dependence is recorded in and verified against the traces; the deep path checks whole-config-id equality, the shallow path checks consumed keys.

resolve(T, cfg, chain):
  key = (T, cfg.id)
  if key in chain:  error cycle(chain + key)
  if key in targets:  return await targets[key]        # cross-path memo / in-flight dedup
  targets[key] = new future ; chain' = chain + key

  # ---- Phase 1: deep fast path (same config, did sources move?) -------
  for D in deep_traces(T), newest-first:
     if D.root_config == cfg.id
        and D.argv     == argv_id(T)
        and refresh(D.source_glob_leaves) all match live:   # memoized rehash / reglob
        return done(materialize(D.output, T, cfg), D.leafset)

  # ---- Phase 2: shallow path (config and/or sources moved) ------------
  for S in shallow_traces(T), newest-first:
     if S.recipe != recipe_id(T):           continue
     if S.argv   != argv_id(T):             continue       # local config
     M = config_by_id(S.config)                            # the built map (serialized blob)
     if any consumed key in S.config-keys differs between cfg and M:  continue
     if any direct source/glob leaf of S changed:                     continue
     ok = true ; child_leafsets = []
     for (dep, dep_cfg_id, recorded_tree) in S.deps:       # may run in parallel
        r = resolve(dep, config_by_id(dep_cfg_id), chain')  # replay the need: map recovered by id
        if r.output != recorded_tree: ok = false; break
        child_leafsets.push(r.leafset)
     if ok:
        leafset = union(direct source/glob leaves of S, child_leafsets)
        write_deep_trace(T, cfg.id, leafset, S.output)      # refresh the deep trace
        return done(materialize(S.output, T, cfg), leafset)

  # ---- Phase 3: run the recipe ----------------------------------------
  return run_recipe(T, cfg, chain')

with the materialization ladder shared by every cache hit:

materialize(tree-id, T, cfg):
  if tree-id in build/cache/  : return that path             # already on disk
  if tree-id in cas/          : restore into build/cache/, return path
  if remote configured        : fetch tree-id (+ blobs) into cas/, verify, restore
  else                        : return run_recipe(T, cfg)    # bytes gone: rebuild

This is the brief's flow, sharpened by config:

The non-obvious win is Phase 2 succeeding after an input changed. A changed source busts the deep trace (Phase 1 gives up), but if that source only feeds a dep whose recipe maps it to an unchanged output (tree-id identical), the dep compare passes and T's recipe is skipped. The deep trace is the cheap "nothing moved" check; the shallow trace is the "did the churn actually reach me?" check.

Running a recipe

run_recipe(T, cfg, chain):
  acquire jobs                                              # global parallelism limit
  sandbox = build/tmp/run-<n>/ ; out = sandbox/out/
  spawn recipe (argv = argv(T), recovered by argv_id) with env:
     { KIT_BUILD_SOCK, KIT_BUILD_OUT=out, KIT_BUILD_TARGET=T, workspace root }
  service the recipe's protocol requests, logging each as a dep (next section);
     config-get reads cfg; need recurses resolve(...) under cfg ⊕ overrides
  release jobs
  on nonzero exit: propagate failure, write NO trace
  on success:
     output = kit_cas_add_tree_from_dir(out)
     put serialized cfg map (cfg.id) and argv vector (argv_id(T)) into the CAS
     write_shallow_trace(T, recipe_id, argv_id, cfg.id, consumed key names,
                         direct leaves, dep edges with their config-ids, output)
     write_deep_trace   (T, cfg.id, argv_id, transitive source/glob closure, output)
     prepend both to T's target record (dedup, truncate to cap)
     install out in build/cache/<output>/
     return done({output, path}, leafset)

Writing both trace kinds on every real build is what lets a later request take whichever path fits the change it sees.

Parallelism

The coordinator owns all parallelism: it is the single process that launches and manages every recipe, so the overall job limit is configured in it (the jobs semaphore). Independent needs resolve concurrently up to that limit; the targets futures guarantee a shared (T, config-id) is built once even when many parents request it at the same instant. Scheduling is at recipe granularity — a recipe may parallelize internally, but the coordinator counts it as one job. Cycle detection composes with concurrency through the per-path build chain described above; the global memo handles dedup, the chain handles cycles.

Recipe protocol

Recipes communicate back over a transport the driver defines and the library abstracts behind a host vtable — the same "host supplies all side effects" principle the rest of kit follows. The library defines the command set and the length-prefixed request/response framing; the host supplies connect/read/write.

typedef struct KitBuildTransport {
  /* Accept one recipe connection (the recipe dialed the endpoint named in
   * $KIT_BUILD_SOCK); read/write length-prefixed frames; close. */
  int (*accept)(void* user, KitBuildConn** out);
  int (*read_frame)(void* user, KitBuildConn*, uint8_t* buf, size_t cap, size_t* n);
  int (*write_frame)(void* user, KitBuildConn*, const uint8_t* buf, size_t n);
  void (*close)(void* user, KitBuildConn*);
  void* user;
} KitBuildTransport;

Defaults and portability: a unix-domain socket (Linux, macOS, FreeBSD, and Windows 10+, which all support AF_UNIX) named in $KIT_BUILD_SOCK, or an anonymous pipe pair on inherited fds where a socket is undesirable; Windows may instead use a named pipe. An in-process transport (a direct callback table) serves recipes that are kit library calls rather than subprocesses. The command set is identical across all of them.

Every request both returns a value and logs a dependency. The logged dependency is what lands in the shallow trace and (after expansion) the deep trace. The contract that makes caching correct: an input the recipe reads but does not request is invisible to the cache — so every input must flow through a command.

Command Returns Dependency logged
config-get <key> the propagated value (or unset) config dep: key recorded as consumed; its value is the one in the effective config-id map (absent = unset)
source <path> blob-id + a path to read source dep: (path, blob-id)
glob <pattern> sorted list of matching paths glob dep: (pattern, glob-result-hash)
need <target> [k=v…] each dep's output tree-id + a path target dep edge: (dep, dep-config-id, output-tree-id)

Remote CAS and shared traces

Two independent network capabilities, split along the DISTRIBUTE.md line — content is trustless (hash-verified); claims are trusted (signed).

Remote object fetch (trustless, via a fetch recipe)

The coordinator performs no network I/O itself. When materialize (or a blob restore) needs an object that is absent locally and a remote is configured, it invokes a user-provided fetch recipe — an executable or command template, e.g. curl -fsS -o "$KIT_FETCH_OUT" "https://cache.example/{kind}/{pp}/{id}", rendered with {kind}blob/tree, {pp}, {id} exactly like DISTRIBUTE.md's external-fetch templates. The fetched bytes land in build/tmp/ and are verified against the requested content id before being installed into the local CAS; a corrupt or malicious mirror fails the hash check and is discarded. So the remote and the fetch recipe are untrusted — the existing self-verifying CAS makes that safe. This adds one rung to the materialization ladder (tree cache → local CAS → remote fetch → run recipe) and lets a clean checkout download an output instead of rebuilding it.

Shared traces (trusted, as signed packages)

A trace is a claim: "these inputs ⇒ this output tree." Unlike content, it is not self-verifying — confirming it means re-running the (assumed-deterministic) recipe, which is exactly the work we are trying to avoid. To safely import someone else's trace we must trust the claimant. Traces are therefore shared as signed trace bundles, reusing the DISTRIBUTE.md package + minisign + trust machinery wholesale:

The result is a precise security split: a local build can now deep/shallow-hit on a remote builder's trace — obtaining the output without running the recipe — while trusting only the signed claim; the output bytes themselves remain trustless (verified by tree-id/blob-id). Trusting a trace signer is exactly like trusting a package signer in DISTRIBUTE.md: it is trust in their build outputs, gated by the trusted-keys allowlist, and auditable by re-running under verify mode.

Determinism and hermeticity

Cache correctness is exactly the assumption output = f(declared inputs), deterministically.

Worked example

//app:server depends on //lib:core; the recipe globs src/*.c, reads src/main.c, and consumes config opt. The request is under config C0 (so config-id = c0).

  1. Cold build. No record. Phase 3 runs both recipes. //lib:core builds → tree L0. //app:server logs config opt, glob src/*.c, source src/main.c, need //lib:core (under c0(//lib:core, c0)); yields tree A0. Both trace kinds written for both targets. build/cache/A0/ materialized, path returned.
  2. No-op rebuild (same C0). Phase 1: //app:server's deep trace has root-config c0 (matches) and refreshes opt via config-id, src/*.c, src/main.c, and the folded lib/core.c — all memoized, all match. A0 is in the tree cache. Returns instantly, nothing re-run, no graph walk.
  3. Comment-only edit to lib/core.c. lib/core.c's blob-id moved → the deep trace fails Phase 1 (a folded source leaf changed). Phase 2: direct leaves (opt, src/*.c, src/main.c) unchanged, so probe the one dep — resolve(//lib:core, c0). Its recipe re-runs (its source moved) but emits a byte-identical L0 (comment stripped). L0 == L0//app:server's recipe is skipped, A0 restored, a refreshed deep trace written so step 2's fast path returns next time.
  4. Flip opt (now C1, config-id = c1). Phase 1 is skipped outright: //app:server's deep trace has root-config c0 ≠ c1. Phase 2 under c1: opt is a consumed key whose value differs between c1 and the trace's built map → no shallow trace holds → Phase 3 re-runs //app:server. Its need //lib:core carries no opt override, so the dep resolves as (//lib:core, c1); but //lib:core never consumes opt, so its shallow trace under c1 matches by consumed-keys (empty) and direct leaves, and L0 is reused without re-running. Only the one recipe that actually depends on opt re-ran.

Decided / deferred / open

Decided.

Deferred.

Open — compact deep traces. The flat deep leafset re-lists every transitive source in every ancestor: O(Σ closure sizes) bytes and O(closure) work to refresh. A structural representation fixes both and is the natural form of the "efficient set union" the closure already needs: