Content-addressed build coordinator
Status: design, not yet built. This is a forward-looking roadmap for a new subsystem layered on the content store in ../DISTRIBUTE.md (
<kit/cas.h>). It specifies the configuration model, the on-disk storage state machine, the coordinator's in-memory state and caching algorithm, the command protocol recipes use to talk back, and how builds share work over a network (remote CAS + signed traces).Naming, to avoid confusion: this is not ../BUILD.md, which documents how kit itself is built (the Makefile,
KIT_*_ENABLEDcomponent gating, and the 3-stage bootstrap). This doc designs a user-facing, Bazel/Nix-style incremental build engine that compiles arbitrary projects by running recipes and caching their outputs by content. The two are unrelated beyond sharing the word "build."
What it is
A build coordinator is a long-lived process that turns a build request
(target T under configuration C) into a materialized output tree on disk,
doing the least work necessary. It does this by running recipes — opaque
executables, typically shell scripts — that produce output directories, and by
caching every recipe result keyed by the exact set of inputs that produced it. A
second request whose inputs are unchanged returns the cached output without
re-running anything; a request whose direct dependencies' outputs are unchanged
(even if some deep source changed) can still skip its own recipe.
The whole design rests on two properties:
- Content identity. Every input and every output has a stable id: the
BLAKE2b-256 of its bytes (
kit_blob_info,kit_cas_*). Sameness is hash equality, never timestamps. - Determinism. A recipe is assumed to be a pure function of its declared inputs: identical declared inputs ⇒ byte-identical output tree. Caching is only as correct as this assumption (see Determinism).
It reuses the existing CAS wholesale: source-file bytes, output directories, config snapshots, and trace bodies are all CAS blobs and trees. The coordinator adds two things the content store does not have — a mutable per-target trace index, and the resolution algorithm that drives it.
Concepts and vocabulary
| Term | Meaning |
|---|---|
Target T |
A named unit of work, e.g. //app:server. A build definition maps each target name to one recipe + local argv. |
| Recipe | The executable the coordinator runs to produce a target's output. Identified by recipe-id = BLAKE2b(recipe file bytes). |
| Configuration | A key→value map of tunables (target triple, opt level, feature flags). Two scopes — propagated and local — see below. Identified by config-id = BLAKE2b(canonical map). |
| Output tree | The directory a recipe produces, captured as a CAS tree (tree-id). The unit a build request returns. |
| Base inputs | The leaves of the dependency graph, dynamically requested by recipes: config values, source files, and globs. |
| Target dep | A dependency of one target on the output tree of another. The graph's interior edges, created by need. |
| Shallow trace | A record of one build: its direct base inputs + its direct target deps (by output tree-id) → its output tree-id. |
| Deep trace | A record of one build: the transitive closure of source/glob leaves + the root config-id → its output tree-id. |
| Tree cache | On-disk materialized output directories, keyed by tree-id, ready to hand back as a filesystem path. |
The three base-input kinds are exactly the dynamically-requested leaves: a recipe asks the coordinator for a config value, the bytes of a source file, or the expansion of a glob, and each request is logged as a dependency so a later rebuild knows precisely what to re-check.
Configuration model
Configuration has two scopes, distinguished by visibility:
Propagated configuration flows down the whole subtree. The top-level request supplies it; a recipe reads a value with
config-get <key>(logged as a config dep); and aneedmay overlay it for the sub-build it triggers. A target's effective propagated config = its inherited config with overlays applied along theneedpath from the root. It is canonicalized to byte-stable text and content-addressed:config-id = BLAKE2b(canonical map), stored as a CAS blob so any recordedconfig-idresolves back to the actual map.Local configuration is the target's argv, fixed by the build definition. It is visible only to that one target's recipe (delivered as the process's actual argv) and is not propagated to deps. Treating argv as config keeps the model uniform: everything a recipe sees is configuration, some inherited, some local. Local config is recorded per trace but is not part of
config-id. It too is serialized to a canonical CAS blob,argv-id.
Serialized for replay, not just hashed. Both the effective config map and the
argv vector are persisted by value as canonical, content-addressed CAS blobs
(config-id, argv-id) — never reduced to an opaque one-way hash. A trace
references them by id, and they are part of its reachable closure (GC-rooted, and
bundled with shared traces). This is
what lets the shallow path replay every need request exactly — reconstruct
the dep's target name, its full effective config with overlays applied, and its
argv — even when driving from an imported shallow trace with no access to the
original build definition or workspace. Source files and globs are the opposite
case: kept by hash only, because they are verified (did this still hash the
same?), never replayed, and their bytes already live in the CAS.
The resolution identity of a build is therefore (target-name, config-id),
where config-id is the effective propagated config. Local argv is determined
by the target name (via the build definition), so it does not widen the key. Two
requests (T, C1) and (T, C2) — the same target under different propagated
config — are genuinely distinct builds; both may be in flight at once.
recipe-id covers only the recipe file bytes. Two other things that can change
a target's behavior — which recipe it maps to, and its argv — live in the
build definition, which the coordinator reads to resolve T. That read is
itself a tracked source dependency of T, so editing a target's stanza
(repointing its recipe, or changing argv) invalidates its traces through the
ordinary source-leaf mechanism.
Why two trace kinds
The deep trace answers "is this exact configuration unchanged and has any source moved?" in one pass over a flat leaf set — no graph walk. When it matches, the output is determined and we skip straight to materialization. It is the inner dev-loop fast path: edit code, rebuild under the same config.
The shallow trace handles everything the deep path defers — a changed source that
might not actually move a dep's output, and any configuration change. It
rebuilds the direct deps (cheaply, via their own traces, threading the new
config down through need), compares their resulting tree-ids against what it
recorded, and if they match, the recipe is still skipped. This is the payoff of
recording deps by output identity rather than by input identity: a subtree that
ignores a changed config key, or absorbs a comment-only edit, produces the same
output tree and short-circuits its parent's recipe.
On-disk storage
The build store sits beside (or contains) a CAS. Everything content-addressed is immutable and self-verifying; exactly one class of object is mutable.
<store>/
cas/ # the shared content store, per DISTRIBUTE.md
blob/<pp>/<blob-id> # raw file bytes (sources, config maps, argv vectors, trace bodies)
tree/<pp>/<tree-id> # canonical directory manifests (output trees)
... # (chunk/, index/ as in DISTRIBUTE.md)
build/
trace/<pp>/<trace-id> # IMMUTABLE canonical trace bodies (deep + shallow)
target/<pp>/<target-key> # MUTABLE per-target trace set (the only mutable state)
cache/<pp>/<tree-id>/ # tree cache: materialized output directories
tmp/ # staging for atomic writes + recipe sandboxes
<pp> is the first two lowercase hex chars of the relevant id (the DISTRIBUTE.md
convention). trace-id = BLAKE2b of the canonical trace body.
target-key = BLAKE2b("kit build target v1" ‖ target-name) — note it depends
on the target name only, never on config or recipe-id, so a target's index
entry is a stable handle while its inputs churn. Config maps and trace bodies are
ordinary CAS blobs; only target/ records and the cache/ materializations are
build-specific.
Trace bodies (immutable, content-addressed)
Both kinds are strict, byte-stable, INI-style text in the family of
kit-tree 1 / kit-package 3: a version line, ordered scalar fields, then
sorted sections. Unknown keys/sections, duplicate rows, and non-canonical
ordering are parse errors (a parse error is treated as absent, never as a
match — fail safe). All ids are lowercase hex. Config values, source bytes, and
glob expansions are stored by hash, never inline, so a body's size is
independent of its inputs' sizes.
Shallow trace — direct inputs and direct deps, for fine-grained rechecking:
kit-build-shallow 1
target //app:server
recipe <recipe-id>
output <output-tree-id>
config <config-id> ; this target's effective propagated config (full map blob)
argv <argv-id> ; this target's local config (serialized vector blob)
[config] ; CONSUMED propagated key NAMES, sorted (values live in the map)
opt
[source] ; sorted by path
src/main.c <blob-id>
build.kit <blob-id> ; the build definition stanza is a tracked source
[glob] ; sorted by pattern
src/*.c <glob-result-hash>
[dep] ; sorted by (dep-name, dep-config-id) — each row replays one `need`
//lib:core <dep-config-id> <dep-output-tree-id>
Deep trace — root config-id plus the transitive source/glob closure, no
[dep] and no per-key [config] section:
kit-build-deep 1
target //app:server
recipe <recipe-id>
output <output-tree-id>
root-config <config-id> ; this target's effective propagated config (full map blob)
argv <argv-id> ; local config (serialized vector blob)
[source] ; union over the whole closure, sorted, deduped
src/main.c <blob-id>
lib/core.c <blob-id> ; pulled in transitively from //lib:core
[glob]
src/*.c <glob-result-hash>
config <config-id>andargv <argv-id>reference the serialized effective config map and argv vector — canonical CAS blobs, kept by value and bundled with shared traces — so the target's invocation and everyneedit issues replay exactly, not merely check (see Configuration model).[config]lists the names of consumed propagated keys (consumed whether set or unset); the values live in theconfig-idmap this trace references. A key absent from that map was consumed while unset, so later adding it busts the trace — no sentinel value needed. Matching compares, for each consumed key, the request's value against the built map's value (absent == unset).<blob-id>= the source file's CAS blob id (kit_blob_info); source and glob leaves are kept by hash because they are verified, never replayed.<glob-result-hash>=BLAKE2bof the canonical(path, blob-id)listing the pattern matched (sorted by path). It changes if any matched file is added, removed, or edited — so one glob row covers the existence and content of its whole match set, and files read through a glob need no separate[source]rows.[dep]rows carry the dep's config-id (and outputtree-id); together with the dep's own serialized config/argv this re-resolves each dep under the exact propagated config it used, overlays included.
Why the deep trace needs only root-config + source/glob leaves. If the
request's config-id equals root-config and every source/glob leaf still
matches the live workspace, then every recipe in the closure has byte-identical
inputs — so every need overlay it computes is identical, every effective config
downstream is identical, and by determinism the output is identical. Any input
that could perturb a downstream overlay (a source a recipe branches on, a config
value) is itself either a recorded source/glob leaf or folded into root-config,
so nothing escapes the check. Using the whole root-config (rather than just
consumed keys) is a deliberate, conservative simplification: an unconsumed
config change busts the deep fast path, but the shallow path
recovers — it re-runs only the recipes whose output actually changed, so unrelated
subtrees re-verify by hash without re-running.
Target record (mutable, the only mutable object)
build/target/<pp>/<target-key> lists this target's candidate traces,
newest-first, capped:
kit-build-record 1
target //app:server
deep <trace-id>
deep <trace-id>
shallow <trace-id>
shallow <trace-id>
Each row points at a trace body in build/trace/. On every successful resolution
the coordinator prepends the fresh trace, de-duplicates, and truncates each kind
to a small cap (KIT_BUILD_RECORD_CAP, e.g. 8) — a bounded MRU window. Older
traces age out and become GC-eligible. Multiple candidates exist because the same
target may have been built under several distinct input/config combinations
recently (e.g. two config values flipped back and forth).
Storage state machine (atomicity and crash safety)
The store must never hand back a wrong output after a crash. The rules:
Content objects (
cas/blob,cas/tree,build/trace, config maps, and a materializedbuild/cache/<tree-id>/) are written intobuild/tmp/, fsync'd, then atomically renamed to their final content-keyed path. A half-written object only ever exists undertmp/and is orphaned, never under its content key. Re-deriving the same content re-creates the identical path — writes are idempotent, so a racing or retried producer is harmless.The target record is the only ordering-sensitive state, updated by read-modify-write into
tmp/then atomic rename. A reader always sees a complete prior or complete next version, never a torn one. A crash loses at most the most recent record update; losing a record entry is safe — the next build re-derives and rewrites it. A corrupt record (failedkit-build-record 1parse) is treated as empty.Recipe sandboxes live in
build/tmp/run-<n>/. On success the output subdir is ingested into the CAS (kit_cas_add_tree_from_dir) and the sandbox is removed; on failure or crash it is orphaned and swept later.
Invariant. Anything reachable under a content key (blob/, tree/,
trace/) is complete and matches its key. A target record may dangle — point at
a trace that GC removed — and readers tolerate that by treating the missing
candidate as absent. The store degrades toward "rebuild," never toward "wrong
answer."
Concurrency. Two coordinators sharing a store are safe for content writes (idempotent atomic renames). Target-record updates are last-writer-wins; both writers wrote valid records, so the only loss is MRU ordering, costing at most a rebuild. An optional per-target-key advisory lock around the read-modify-write removes even that.
Build coordinator state and caching
In-memory state
All state hangs off one KitBuildCoordinator context (no globals — the project
rule). For the life of the process it memoizes every base-input probe and every
target resolution, so a diamond in the graph is built once and a source file is
hashed once:
| Field | Contents | Notes |
|---|---|---|
source_hashes |
path → blob-id (+ stat). |
Per-process memo. |
glob_results |
pattern → (sorted paths, glob-result-hash). |
Per-process memo. |
config_maps |
config-id → map, and overlay results. |
Immutable; content-addressed. |
targets |
(target-name, config-id) → future of {output-tree-id, deep-leafset, path}. |
Memo and in-flight dedup. |
cas |
Open KitCas* handle over cas/. |
Plus optional remote (below). |
store |
Paths + target-record reader/writer over build/. |
— |
jobs |
Semaphore bounding concurrent recipe processes. | The global parallelism limit. |
"Cached over the life of the coordinator process" means exactly these memo tables: the coordinator assumes the workspace does not change under it mid-build.
The targets table is keyed by the full (target-name, config-id) and holds a
future, not a flag — so concurrent needs of the same (T, config-id) await
one resolution rather than racing (see Parallelism). A resolved
build yields the output tree-id (to satisfy a need), a materialized
path, and its deep leafset (the transitive source/glob closure, to fold
into a parent's deep trace).
Cycle detection
The targets memo alone cannot catch a cycle — a self-dependent target would
simply await its own unresolved future and deadlock. So each resolution carries an
explicit build chain: the ordered list of (target-name, config-id) frames
from the root request down to here. A need whose (dep, config-id) already
appears on the chain is a dependency cycle; the chain is the error message
(//a → //b → //a). The chain is per-path and distinct from the cross-path memo:
the same (T, config-id) may appear in many chains (a shared dep) but never twice
in one.
Resolution algorithm
A request is resolve(T, cfg, chain) where cfg is the effective propagated
config (with cfg.id its config-id). The config-dependence is recorded in and
verified against the traces; the deep path checks whole-config-id equality, the
shallow path checks consumed keys.
resolve(T, cfg, chain):
key = (T, cfg.id)
if key in chain: error cycle(chain + key)
if key in targets: return await targets[key] # cross-path memo / in-flight dedup
targets[key] = new future ; chain' = chain + key
# ---- Phase 1: deep fast path (same config, did sources move?) -------
for D in deep_traces(T), newest-first:
if D.root_config == cfg.id
and D.argv == argv_id(T)
and refresh(D.source_glob_leaves) all match live: # memoized rehash / reglob
return done(materialize(D.output, T, cfg), D.leafset)
# ---- Phase 2: shallow path (config and/or sources moved) ------------
for S in shallow_traces(T), newest-first:
if S.recipe != recipe_id(T): continue
if S.argv != argv_id(T): continue # local config
M = config_by_id(S.config) # the built map (serialized blob)
if any consumed key in S.config-keys differs between cfg and M: continue
if any direct source/glob leaf of S changed: continue
ok = true ; child_leafsets = []
for (dep, dep_cfg_id, recorded_tree) in S.deps: # may run in parallel
r = resolve(dep, config_by_id(dep_cfg_id), chain') # replay the need: map recovered by id
if r.output != recorded_tree: ok = false; break
child_leafsets.push(r.leafset)
if ok:
leafset = union(direct source/glob leaves of S, child_leafsets)
write_deep_trace(T, cfg.id, leafset, S.output) # refresh the deep trace
return done(materialize(S.output, T, cfg), leafset)
# ---- Phase 3: run the recipe ----------------------------------------
return run_recipe(T, cfg, chain')
with the materialization ladder shared by every cache hit:
materialize(tree-id, T, cfg):
if tree-id in build/cache/ : return that path # already on disk
if tree-id in cas/ : restore into build/cache/, return path
if remote configured : fetch tree-id (+ blobs) into cas/, verify, restore
else : return run_recipe(T, cfg) # bytes gone: rebuild
This is the brief's flow, sharpened by config:
- Phase 1 is the deep fast path: identical config-id and an all-clear source/glob
refresh ⇒ reuse.
refreshhits the per-process memo, so after the first target touches a file or glob the rest of the closure is free. - Phase 2 is the shallow path, taken when Phase 1 finds nothing (config changed,
or a source moved). It guards on the target's own consumed config and direct
leaves, then re-resolves each recorded dep under the dep's recorded config-id
and compares outputs. Re-using the recorded
dep_cfg_idis correct: if the parent's consumed config were different, the consumed-key guard above would have failed first and the parent would re-run, recomputing itsneedoverlays. - A Phase-2 hit also writes a fresh deep trace from the now-known closure, so the next request gets the Phase-1 fast path back.
- Phase 3 runs the recipe only when no trace holds, or when a hit's bytes were evicted and no remote can supply them.
The non-obvious win is Phase 2 succeeding after an input changed. A changed
source busts the deep trace (Phase 1 gives up), but if that source only feeds a dep
whose recipe maps it to an unchanged output (tree-id identical), the dep compare
passes and T's recipe is skipped. The deep trace is the cheap "nothing moved"
check; the shallow trace is the "did the churn actually reach me?" check.
Running a recipe
run_recipe(T, cfg, chain):
acquire jobs # global parallelism limit
sandbox = build/tmp/run-<n>/ ; out = sandbox/out/
spawn recipe (argv = argv(T), recovered by argv_id) with env:
{ KIT_BUILD_SOCK, KIT_BUILD_OUT=out, KIT_BUILD_TARGET=T, workspace root }
service the recipe's protocol requests, logging each as a dep (next section);
config-get reads cfg; need recurses resolve(...) under cfg ⊕ overrides
release jobs
on nonzero exit: propagate failure, write NO trace
on success:
output = kit_cas_add_tree_from_dir(out)
put serialized cfg map (cfg.id) and argv vector (argv_id(T)) into the CAS
write_shallow_trace(T, recipe_id, argv_id, cfg.id, consumed key names,
direct leaves, dep edges with their config-ids, output)
write_deep_trace (T, cfg.id, argv_id, transitive source/glob closure, output)
prepend both to T's target record (dedup, truncate to cap)
install out in build/cache/<output>/
return done({output, path}, leafset)
Writing both trace kinds on every real build is what lets a later request take whichever path fits the change it sees.
Parallelism
The coordinator owns all parallelism: it is the single process that launches and
manages every recipe, so the overall job limit is configured in it (the jobs
semaphore). Independent needs resolve concurrently up to that limit; the
targets futures guarantee a shared (T, config-id) is built once even when many
parents request it at the same instant. Scheduling is at recipe granularity — a
recipe may parallelize internally, but the coordinator counts it as one job.
Cycle detection composes with concurrency through the per-path build chain
described above; the global memo handles dedup, the chain handles cycles.
Recipe protocol
Recipes communicate back over a transport the driver defines and the library abstracts behind a host vtable — the same "host supplies all side effects" principle the rest of kit follows. The library defines the command set and the length-prefixed request/response framing; the host supplies connect/read/write.
typedef struct KitBuildTransport {
/* Accept one recipe connection (the recipe dialed the endpoint named in
* $KIT_BUILD_SOCK); read/write length-prefixed frames; close. */
int (*accept)(void* user, KitBuildConn** out);
int (*read_frame)(void* user, KitBuildConn*, uint8_t* buf, size_t cap, size_t* n);
int (*write_frame)(void* user, KitBuildConn*, const uint8_t* buf, size_t n);
void (*close)(void* user, KitBuildConn*);
void* user;
} KitBuildTransport;
Defaults and portability: a unix-domain socket (Linux, macOS, FreeBSD, and
Windows 10+, which all support AF_UNIX) named in $KIT_BUILD_SOCK, or an
anonymous pipe pair on inherited fds where a socket is undesirable; Windows
may instead use a named pipe. An in-process transport (a direct callback
table) serves recipes that are kit library calls rather than subprocesses. The
command set is identical across all of them.
Every request both returns a value and logs a dependency. The logged dependency is what lands in the shallow trace and (after expansion) the deep trace. The contract that makes caching correct: an input the recipe reads but does not request is invisible to the cache — so every input must flow through a command.
| Command | Returns | Dependency logged |
|---|---|---|
config-get <key> |
the propagated value (or unset) | config dep: key recorded as consumed; its value is the one in the effective config-id map (absent = unset) |
source <path> |
blob-id + a path to read |
source dep: (path, blob-id) |
glob <pattern> |
sorted list of matching paths | glob dep: (pattern, glob-result-hash) |
need <target> [k=v…] |
each dep's output tree-id + a path |
target dep edge: (dep, dep-config-id, output-tree-id) |
config-getreads the target's effective propagated config and records the value's hash, so changing a consumed key invalidates while changing an unconsumed key does not. If a changed consumed key makes the recipe branch and request a new key next time, the old trace already fails to match on the changed key, so the new key-set is discovered on the rebuild — self-correcting, never needing to predict the input set ahead of time. Local config (argv) is not read here; it arrives as the process's argv.sourcehands back a path inside the workspace (or, in a hermetic variant, a path staged into the sandbox) plus the blob-id the read is pinned to.globrecords the whole match set's content viaglob-result-hash; the recipe then reads the returned paths without further declaration.needis the dynamic-dependency primitive. The optionalk=vpairs overlay propagated config for that sub-build (local argv never propagates). The coordinator resolves(dep, cfg ⊕ overrides)recursively, returns its outputtree-idand a readable path, and records the edge with the dep'sconfig-idso the shallow path can re-resolve identically.needis also where cycles are caught.- Outputs need no command: the recipe writes under
$KIT_BUILD_OUTand the coordinator snapshots that directory on exit.
Remote CAS and shared traces
Two independent network capabilities, split along the DISTRIBUTE.md line — content is trustless (hash-verified); claims are trusted (signed).
Remote object fetch (trustless, via a fetch recipe)
The coordinator performs no network I/O itself. When materialize (or a blob
restore) needs an object that is absent locally and a remote is configured, it
invokes a user-provided fetch recipe — an executable or command template, e.g.
curl -fsS -o "$KIT_FETCH_OUT" "https://cache.example/{kind}/{pp}/{id}", rendered
with {kind} ∈ blob/tree, {pp}, {id} exactly like DISTRIBUTE.md's
external-fetch templates. The fetched bytes land in build/tmp/ and are
verified against the requested content id before being installed into the
local CAS; a corrupt or malicious mirror fails the hash check and is discarded. So
the remote and the fetch recipe are untrusted — the existing self-verifying CAS
makes that safe. This adds one rung to the materialization ladder (tree cache →
local CAS → remote fetch → run recipe) and lets a clean checkout download an
output instead of rebuilding it.
Shared traces (trusted, as signed packages)
A trace is a claim: "these inputs ⇒ this output tree." Unlike content, it is not self-verifying — confirming it means re-running the (assumed-deterministic) recipe, which is exactly the work we are trying to avoid. To safely import someone else's trace we must trust the claimant. Traces are therefore shared as signed trace bundles, reusing the DISTRIBUTE.md package + minisign + trust machinery wholesale:
- A bundle is a signed manifest (
kit-build-traces 1, signed exactly like akit-packagemanifest) listing(target-name, kind, trace-id, output-tree-id)claims, carried in a.kpkg(portable tar.gz or native kpkg) alongside the trace bodies and the serialized config-map and argv blobs they reference (required — without them an imported shallow trace cannot replay itsneeds), plus, optionally, the referenced output trees/blobs. - Trust is the DISTRIBUTE.md model unchanged: verify the minisign signature
against the trusted-keys file (
-p KEY, the anchor file, or--tofu); the signed trusted comment binds the signature to the manifest hash. - Import: verify signature → anchor key → install the trace bodies into
build/trace/and prepend them to the relevant target records. Output bytes come from the bundle or from the remote CAS — either way hash-verified on use.
The result is a precise security split: a local build can now deep/shallow-hit
on a remote builder's trace — obtaining the output without running the recipe —
while trusting only the signed claim; the output bytes themselves remain
trustless (verified by tree-id/blob-id). Trusting a trace signer is exactly
like trusting a package signer in DISTRIBUTE.md: it is trust in their build
outputs, gated by the trusted-keys allowlist, and auditable by re-running under
verify mode.
Determinism and hermeticity
Cache correctness is exactly the assumption output = f(declared inputs), deterministically.
- Declared-inputs-are-complete. Reading an undeclared file is a hermeticity violation. The protocol makes declaration the only intended way to get an input, but enforcement is deferred (see below): for now it is a contract recipes must honor.
- Determinism. Timestamps, RNG, and unpinned network fetches break the model — a second build's captured tree differs from the recorded output, and the cache would serve a stale-but-believed-current result. Mitigations: declare such inputs (a "now" config value, a pinned URL+hash), or mark a target no-cache so it always runs Phase 3.
- Verify mode. A diagnostic mode re-runs a recipe whose trace says "unchanged"
and compares the fresh
tree-idto the recorded one; a mismatch flags a nondeterministic or under-declared recipe — the recommended audit before trusting (or signing and sharing) traces.
Worked example
//app:server depends on //lib:core; the recipe globs src/*.c, reads
src/main.c, and consumes config opt. The request is under config C0 (so
config-id = c0).
- Cold build. No record. Phase 3 runs both recipes.
//lib:corebuilds →tree L0.//app:serverlogsconfig opt,glob src/*.c,source src/main.c,need //lib:core(underc0→(//lib:core, c0)); yieldstree A0. Both trace kinds written for both targets.build/cache/A0/materialized, path returned. - No-op rebuild (same
C0). Phase 1://app:server's deep trace hasroot-config c0(matches) and refreshesoptvia config-id,src/*.c,src/main.c, and the foldedlib/core.c— all memoized, all match.A0is in the tree cache. Returns instantly, nothing re-run, no graph walk. - Comment-only edit to
lib/core.c.lib/core.c's blob-id moved → the deep trace fails Phase 1 (a folded source leaf changed). Phase 2: direct leaves (opt,src/*.c,src/main.c) unchanged, so probe the one dep —resolve(//lib:core, c0). Its recipe re-runs (its source moved) but emits a byte-identicalL0(comment stripped).L0 == L0⇒//app:server's recipe is skipped,A0restored, a refreshed deep trace written so step 2's fast path returns next time. - Flip
opt(nowC1,config-id = c1). Phase 1 is skipped outright://app:server's deep trace hasroot-config c0 ≠ c1. Phase 2 underc1:optis a consumed key whose value differs betweenc1and the trace's built map → no shallow trace holds → Phase 3 re-runs//app:server. Itsneed //lib:corecarries nooptoverride, so the dep resolves as(//lib:core, c1); but//lib:corenever consumesopt, so its shallow trace underc1matches by consumed-keys (empty) and direct leaves, andL0is reused without re-running. Only the one recipe that actually depends onoptre-ran.
Decided / deferred / open
Decided.
- Transport is abstracted behind a host vtable the driver defines; defaults are a unix socket or anonymous pipe, with a Windows named-pipe fallback and an in-process variant. Supported hosts: Linux, macOS, Windows, FreeBSD.
- Parallelism lives in the coordinator (the
jobssemaphore); futures dedup shared targets; the per-path chain detects cycles. - Remote CAS delegates fetching to a user-provided fetch recipe and verifies by content id (trustless).
- Trace sharing ships traces as signed
.kpkgbundles over the DISTRIBUTE.md trust model (trusted), because traces are claims, not self-verifying content. - Public surface is a public header
<kit/build.h>plus akit builddriver command, gated byKIT_BUILD_ENABLED(thekit/cas.h+kit casprecedent). A thin CLI parses flags and supplies the transport/host vtables; the coordinator and trace model live in the library.
Deferred.
- Hermeticity enforcement. No sandbox-deny or FS-trace at first; the protocol
defines the contract, enforcement comes later (deny undeclared reads, or
strace/FUSE detection that warns). - GC. Deferred. When built: mark-and-sweep rooted at live target records →
reachable trace bodies → referenced trees/blobs/config-maps/argv vectors; sweep
unreachable
trace/,cas/, andcache/, with LRU size bounds on the (re-derivable) tree cache and CAS.
Open — compact deep traces. The flat deep leafset re-lists every transitive source in every ancestor: O(Σ closure sizes) bytes and O(closure) work to refresh. A structural representation fixes both and is the natural form of the "efficient set union" the closure already needs:
- Store each target's deep leafset as a content-addressed object
deep-set = { direct source/glob leaves, [child-deep-set-id…] }, withdeep-set-id = BLAKE2b(canonical(direct ‖ child-ids)). The fully-flattened set is the recursive union, but it is never materialized — an ancestor stores only its own direct slice plus 32-byte pointers to its children's deep-sets. A leaf shared by many ancestors is stored once (a Merkle DAG, structurally sharing like the CAS itself). This is the union operation: building a parent's deep-set is "concatenate sorted direct leaves with child-set pointers," not "merge N flat arrays." - Faster checking falls out of the same structure. Refresh memoizes
deep-set-id → valid?for the whole build: a subtree checked once is reused by every parent that points at it, and an unchanged subtree is recognized by pointer/hash equality at its root without descending to leaves. Change detection becomes O(changed frontier) instead of O(closure): recompute current deep-set-ids bottom-up (memoized, so O(distinct subtrees) = O(nodes)), and where a recomputed id equals the recorded one, the whole subtree is provably unchanged and skipped. - The cost is that refresh walks the DAG rather than streaming one flat list, and the child deep-set objects must be present (another store dependency, GC-rooted like traces). The flat form in this doc is the semantic definition; the structural form is the recommended physical representation once flat traces bloat — adopt it together with the in-memory union it mirrors.