kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 97146e78460c001d48c9dea8ed0f8144b8280537
parent e618f928841783b78b52452854d70240ab4dbd3f
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Tue,  2 Jun 2026 13:50:32 -0700

plan: build system plan v1

Diffstat:
Adoc/plan/BUILD.md | 441+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mdoc/plan/README.md | 1+
2 files changed, 442 insertions(+), 0 deletions(-)

diff --git a/doc/plan/BUILD.md b/doc/plan/BUILD.md @@ -0,0 +1,441 @@ +# Content-addressed build coordinator + +> **Status: design, not yet built.** This is a forward-looking roadmap for a new +> subsystem layered on the content store in [../DISTRIBUTE.md](../DISTRIBUTE.md) +> (`<cfree/cas.h>`). It specifies the on-disk storage state machine, the +> coordinator's in-memory state and caching algorithm, and the command protocol +> recipes use to talk back to the coordinator. +> +> **Naming, to avoid confusion:** this is *not* [../BUILD.md](../BUILD.md), which +> documents how cfree itself is built (the Makefile, `CFREE_*_ENABLED` component +> gating, and the 3-stage bootstrap). This doc designs a *user-facing*, +> Bazel/Nix-style incremental build engine that compiles arbitrary projects by +> running recipes and caching their outputs by content. The two are unrelated +> beyond sharing the word "build." + +## What it is + +A **build coordinator** is a long-lived process that turns a *build request* +(build target `T` under configuration `C`) into a materialized **output tree** on +disk, doing the least work necessary. It does this by running **recipes** — +opaque executables, typically shell scripts — that produce output directories, +and by caching every recipe result keyed by the exact set of inputs that +produced it. A second request whose inputs are unchanged returns the cached +output without re-running anything; a request whose *direct dependencies'* +outputs are unchanged (even if some deep source changed) can still skip its own +recipe. + +The whole design rests on two properties: + +- **Content identity.** Every input and every output has a stable id: the + BLAKE2b-256 of its bytes (`cfree_blob_info`, `cfree_cas_*`). Sameness is hash + equality, never timestamps. +- **Determinism.** A recipe is assumed to be a pure function of its *declared* + inputs: identical declared inputs ⇒ byte-identical output tree. Caching is only + as correct as this assumption (see [Determinism and hermeticity](#determinism-and-hermeticity)). + +It reuses the existing CAS wholesale: source-file bytes, output directories, and +trace bodies are all CAS blobs and trees. The coordinator adds two things the +content store does not have — a *mutable* per-target trace index, and the +resolution algorithm that drives it. + +## Concepts and vocabulary + +| Term | Meaning | +|------|---------| +| **Configuration** `C` | An immutable key→value map snapshotted for the life of one build. Supplies tunables (target triple, opt level, feature flags) to recipes. | +| **Target** `T` | A named unit of work, e.g. `//app:server`. The build definition maps each target name to exactly one recipe + argv. | +| **Recipe** | The executable the coordinator runs to produce a target's output. Identified by `recipe-id` = hash of (recipe bytes ‖ argv). | +| **Output tree** | The directory a recipe produces, captured as a CAS tree (`tree-id`). The unit a build request returns. | +| **Base inputs** | The *leaves* of the dependency graph, dynamically requested by recipes: **config values**, **source files**, and **globs**. | +| **Target dep** | A dependency of one target on the *output tree* of another target. The graph's interior edges. | +| **Shallow trace** | A record of one build: its direct base inputs + its direct target deps (by `tree-id`) → its output `tree-id`. | +| **Deep trace** | A record of one build: the *transitive* closure of base inputs (target deps recursively expanded to their leaves) → its output `tree-id`. | +| **Tree cache** | On-disk *materialized* output directories, keyed by `tree-id`, ready to hand back as a filesystem path. | + +The three base-input kinds are exactly the dynamically-requested leaves: a recipe +asks the coordinator for a config value, the bytes of a source file, or the +expansion of a glob, and each request is logged as a dependency so a later +rebuild knows precisely what to re-check. A glob's "value" is the sorted set of +matched paths *and* their content — see [glob](#glob-pattern) below. + +### Why two trace kinds + +The deep trace answers *"has anything in my entire world changed?"* in one pass +over a flat input set — no graph walk. When it matches, the output is determined +and we skip straight to materialization. + +The shallow trace answers the harder case the deep trace gives up on: *some deep +source changed, but did it actually change my dependencies' outputs?* A +formatting-only edit to a header, a comment change, a touched-but-identical file +— these perturb base inputs without perturbing the dep output that consumes +them. The shallow path rebuilds the *direct* deps (cheaply, via their own +traces), compares their resulting `tree-id`s against what the trace recorded, and +if they match, the recipe can still be skipped. This is the payoff of recording +deps by output identity rather than by input identity. + +## On-disk storage + +The build store sits beside (or contains) a CAS. Everything content-addressed is +immutable and self-verifying; exactly one class of object is mutable. + +``` +<store>/ + cas/ # the shared content store, per DISTRIBUTE.md + blob/<pp>/<blob-id> # raw file bytes + tree/<pp>/<tree-id> # canonical directory manifests + ... # (chunk/, index/ as in DISTRIBUTE.md) + build/ + trace/<pp>/<trace-id> # IMMUTABLE canonical trace bodies (deep + shallow) + target/<pp>/<target-key> # MUTABLE per-target trace set (the only mutable state) + cache/<pp>/<tree-id>/ # tree cache: materialized output directories + tmp/ # staging for atomic writes + recipe sandboxes +``` + +`<pp>` is the first two lowercase hex chars of the relevant id (the DISTRIBUTE.md +convention). `trace-id` = BLAKE2b-256 of the canonical trace body. +`target-key` = `BLAKE2b("cfree build target v1" ‖ target-name)` — note it depends +on the target *name only*, never on config or recipe-id, so a target's index +entry is stable while its inputs churn. + +### Trace bodies (immutable, content-addressed) + +Both kinds are strict, byte-stable, INI-style text in the family of +`cfree-tree 1` / `cfree-package 3`: a version line, ordered scalar fields, then +sorted sections. Unknown keys/sections, duplicate rows, and non-canonical +ordering are parse errors (a parse error is treated as *absent*, never as a +match — fail safe). All ids are lowercase hex. Config values, source bytes, and +glob expansions are stored *by hash*, never inline, so a body's size is +independent of its inputs' sizes. + +**Shallow trace** — direct inputs and direct deps: + +``` +cfree-build-shallow 1 +target //app:server +recipe <recipe-id> +output <output-tree-id> +[config] ; sorted by key +opt <value-hash> +triple <value-hash> +[source] ; sorted by path +src/main.c <blob-id> +[glob] ; sorted by pattern +src/*.c <glob-result-hash> +[dep] ; sorted by dep target-name +//lib:core <dep-output-tree-id> +//lib:util <dep-output-tree-id> +``` + +**Deep trace** — transitive base-input closure, *no* `[dep]` section (deps are +dissolved into their leaves): + +``` +cfree-build-deep 1 +target //app:server +recipe <recipe-id> +output <output-tree-id> +[config] ; union over the whole closure, sorted, deduped +opt <value-hash> +triple <value-hash> +[source] +src/main.c <blob-id> +lib/core.c <blob-id> ; pulled in transitively from //lib:core +[glob] +src/*.c <glob-result-hash> +``` + +- `<value-hash>` = `BLAKE2b(config-value-bytes)`. The literal value is recoverable + from the config snapshot; the trace stores only the hash so it is uniform and + size-stable. A *requested-but-unset* key is recorded with a reserved + sentinel hash so that later *adding* the key invalidates the trace. +- `<blob-id>` = the source file's CAS blob id (`cfree_blob_info`). +- `<glob-result-hash>` = `BLAKE2b` of the canonical listing of `(path, blob-id)` + pairs the pattern matched (sorted by path). It changes if any matched file is + added, removed, or edited — so a single glob row covers the existence *and* + content of its whole match set, and files read through a glob need no separate + `[source]` rows. +- `recipe-id` is recorded as an ordinary input and verified on lookup, so editing + a recipe (or its argv, or which recipe a target maps to) invalidates traces the + same way a source edit does. + +### Target record (mutable, the only mutable object) + +`build/target/<pp>/<target-key>` lists this target's candidate traces, +newest-first, capped: + +``` +cfree-build-record 1 +target //app:server +deep <trace-id> +deep <trace-id> +shallow <trace-id> +shallow <trace-id> +``` + +Each row points at a trace body in `build/trace/`. On every successful resolution +the coordinator prepends the fresh trace, de-duplicates, and truncates each kind +to a small cap (`CFREE_BUILD_RECORD_CAP`, e.g. 8) — a bounded MRU window. A trace +older than the window ages out and becomes GC-eligible. Multiple candidates exist +because the same target may have been built under several distinct input +combinations recently (e.g. two config values flipped back and forth). + +### Storage state machine (atomicity and crash safety) + +The store must never hand back a wrong output after a crash. The rules: + +1. **Content objects** (`cas/blob`, `cas/tree`, `build/trace`, and a materialized + `build/cache/<tree-id>/`) are written into `build/tmp/`, fsync'd, then + **atomically renamed** to their final content-keyed path. A half-written object + only ever exists under `tmp/` and is orphaned, never under its content key. + Re-deriving the same content re-creates the identical path — writes are + idempotent, so a racing or retried producer is harmless. + +2. **The target record is the only ordering-sensitive state**, and it is updated + by **read-modify-write into `tmp/` then atomic rename**. A reader therefore + always sees a complete prior version or a complete next version, never a torn + one. A crash loses *at most* the most recent record update; losing a record + entry is *safe* — the next build re-derives and rewrites it. A corrupt record + (failed `cfree-build-record 1` parse) is treated as empty. + +3. **Recipe sandboxes** live in `build/tmp/run-<n>/`. On success the output subdir + is ingested into the CAS (`cfree_cas_add_tree_from_dir`) and the sandbox is + removed; on failure or crash it is orphaned and swept later. + +**Invariant.** Anything reachable under a content key (`blob/`, `tree/`, +`trace/`) is complete and matches its key. A target record may *dangle* — point +at a trace that GC removed — and readers tolerate that by treating the missing +candidate as absent. The store degrades toward "rebuild," never toward "wrong +answer." + +**Concurrency.** Two coordinators sharing a store are safe for content writes +(idempotent atomic renames). Target-record updates are last-writer-wins; both +writers wrote *valid* records, so the only loss is MRU ordering, which costs at +most a rebuild. An optional per-target-key advisory lock around the +read-modify-write removes even that. + +**GC** (out of scope to implement now, in scope to design for): mark-and-sweep +rooted at the live target records → reachable trace bodies → referenced +trees/blobs; sweep unreachable `trace/`, `cas/`, and `cache/`. The tree cache and +CAS are independently size-boundable by LRU since both are re-derivable. + +## Build coordinator state and caching + +### In-memory state + +All state hangs off one `BuildCoordinator` context (no globals — the project +rule). For the life of the process it memoizes every base-input probe and every +target resolution, so a diamond in the graph is built once and a source file is +hashed once: + +| Field | Contents | Invalidated | +|-------|----------|-------------| +| `config` | The immutable configuration snapshot (key→value). | Never (per process). | +| `source_hashes` | `path → blob-id` (with stat for sanity). | Never (per process). | +| `glob_results` | `pattern → (sorted paths, glob-result-hash)`. | Never (per process). | +| `targets` | `target-name → resolution state`: `in-progress` \| `done(output-tree-id, deep-input-set, materialized-path)`. | Never (per process). | +| `cas` | Open `CfreeCas*` handle over `cas/`. | — | +| `store` | Paths + the target-record reader/writer over `build/`. | — | + +"Cached over the life of the coordinator process" (the request brief) means +exactly these memo tables: the coordinator assumes the workspace does not change +*under it* mid-build. Each table also doubles as cycle/dedup control — a +`target-name` found `in-progress` when requested again is a dependency cycle and +is an error. + +A resolved target yields two things its parent needs: the **output `tree-id`** +(to satisfy a `need`) and the **deep-input-set** (its deep trace's leaf closure, +to fold into the parent's own deep trace). + +### Resolution algorithm + +A request for `(T, C)` runs three phases. `C` is fixed for the process; the +config-dependence of `T` is *recorded in and verified against* its traces, so the +coordinator keys nothing by config and a change to an *unconsumed* config key +never busts `T`. + +``` +resolve(T): + if T in targets: # memoized / cycle check + return targets[T] or error("cycle") + mark T in-progress + + # ---- Phase 1: deep-trace fast path ---------------------------------- + for each deep trace D of T (newest first): + if refresh(D.base_inputs) all match current state: # rehash sources, + return materialize(D.output) # re-run globs, + # recompare config, + # recheck recipe-id + # ---- Phase 2: shallow-trace path ------------------------------------ + for each shallow trace S of T (newest first): + if any of S.direct_base_inputs changed: continue # this trace can't hold + ok = true + for each (dep, recorded_tree_id) in S.deps: + cur = resolve(dep) # recurse + if cur.output_tree_id != recorded_tree_id: ok = false; break + if ok: + write_deep_trace(T, union of S.direct_base_inputs + and each dep's deep-input-set) # refresh the deep trace + return materialize(S.output) + # ---- Phase 3: run the recipe ---------------------------------------- + return run_recipe(T) +``` + +with the materialization ladder shared by every cache hit: + +``` +materialize(tree-id): + if tree-id in build/cache/ : return that path # already on disk + if tree-id in cas/ : restore into build/cache/, return path + else : return run_recipe(T) # CAS evicted: rebuild +``` + +This is precisely the brief's flow. Reading it back against the request: + +- *"Check if a deep trace is available; if so, refresh all inputs."* — Phase 1. + `refresh` consults the memo tables, so after the first target touches a file or + glob the rest of the closure is free. +- *"If inputs have not changed, check tree cache → CAS restore → else run."* — the + `materialize` ladder. "Run" is the last resort only when the bytes were + GC'd/evicted; we already know *what* the output is, we just no longer *have* it. +- *"If inputs changed, or no deep trace, check shallow trace; if none, rerun."* — + fall-through from Phase 1 into Phase 2, and Phase 2's fall-through into Phase 3. +- *"If shallow trace available, rebuild target deps; if dep output hashes match + the trace and other inputs match, attempt CAS restore; if any differ, rerun."* + — the Phase 2 loop. "Other inputs match" is the `direct_base_inputs` guard; + "dep output hashes match" is the per-dep `tree-id` compare. + +The non-obvious move is **Phase 2 succeeding after a base input changed.** A +changed source busts the *deep* trace (Phase 1 gives up), but if that source only +feeds a dep whose recipe maps it to an unchanged output (`tree-id` identical), +the dep compare in Phase 2 still passes and `T`'s own recipe is skipped. The +deep trace is the cheap "nothing moved" check; the shallow trace is the "did the +churn actually reach me?" check. On a Phase-2 hit we also write a *fresh* deep +trace from the now-known closure, so the next request gets the Phase-1 fast path +back. + +### Running a recipe + +``` +run_recipe(T): + sandbox = build/tmp/run-<n>/ ; out = sandbox/out/ + spawn recipe with env: { CFREE_BUILD_SOCK, CFREE_BUILD_OUT=out, + CFREE_BUILD_TARGET=T, workspace root } + service the recipe's protocol requests, logging each as a dep (next section) + on nonzero exit: propagate failure, write NO trace + on success: + output-tree-id = cfree_cas_add_tree_from_dir(out) + write_shallow_trace(T, direct inputs logged this run, dep tree-ids, + output-tree-id) + write_deep_trace (T, transitive base-input closure, output-tree-id) + prepend both to T's target record (dedup, truncate to cap) + install out in build/cache/<output-tree-id>/ + return { output-tree-id, deep-input-set, path } +``` + +Writing *both* trace kinds on every real build is what lets a later request take +whichever path fits the change it sees. + +## Recipe protocol + +Recipes communicate back to the coordinator over a transport left unspecified +here — a pipe on a well-known fd, a unix socket named in `$CFREE_BUILD_SOCK`, or +an in-process callback table for recipes that are cfree library calls rather than +subprocesses. The command *set* is fixed; the wire is not. + +Every request both **returns a value** and **logs a dependency**. The logged +dependency is what lands in the shallow trace and (after expansion) the deep +trace. The contract that makes caching correct: **an input the recipe reads but +does not request is invisible to the cache** — so every input must flow through a +command. + +| Command | Returns | Dependency logged | +|---------|---------|-------------------| +| `config-get <key>` | the config value (or *unset*) | config dep: `(key, value-hash)`, sentinel hash if unset | +| `source <path>` | `blob-id` + an absolute path to read | source dep: `(path, blob-id)` | +| `glob <pattern>` | sorted list of matching paths | glob dep: `(pattern, glob-result-hash)` | +| `need <target>...` | each dep's output `tree-id` + a materialized path | target dep edge: `(dep, output-tree-id)` | +| `log <text>` | — | none (diagnostic passthrough) | + +- **`config-get`** records the value's hash, so changing a *consumed* key + invalidates while changing an unconsumed key does not. If a changed consumed + key causes the recipe to branch and request a *new* key next time, the old trace + already fails to match on the changed key, so the new key-set is discovered on + the rebuild — the scheme is self-correcting and never needs to predict the + input set ahead of time. +- **`source`** hands back a path inside the workspace (or, in a hermetic variant, + a path staged into the sandbox) plus the blob-id the recipe's read is pinned to. +- **`glob`** records the whole match set's content via `glob-result-hash`; the + recipe then reads the returned paths without further declaration. +- **`need`** is the dynamic-dependency primitive. The coordinator `resolve()`s + each target recursively (Phases 1–3), returns its output `tree-id` and a + readable materialized path, and records the edge by *output identity*. This is + what the shallow path re-checks. `need` is also where cycles are caught. +- **Outputs** need no command: the recipe writes under `$CFREE_BUILD_OUT` and the + coordinator snapshots that directory on exit. (An optional `declare-output + <path>` could assert an expected output for earlier error reporting.) + +## Determinism and hermeticity + +Cache correctness is exactly the assumption *output = f(declared inputs)*, +deterministically. The design's guard rails: + +- **Declared-inputs-are-complete.** Reading an undeclared file is a hermeticity + violation. It cannot be *detected* without sandboxing/FS tracing, but the + protocol makes declaration the only way to get an input, and the brief is to + enforce it later (a sandbox that denies undeclared reads, or an `strace`/FUSE + layer that records them). Until then it is a contract recipes must honor. +- **Determinism.** Timestamps, RNG, network fetches, and `$RANDOM`-style + nondeterminism break the model: the second build's captured tree differs from + the trace's recorded output, and the cache would serve a stale-but-believed- + current result. Mitigations: declare such inputs (a "now" config value, a + pinned URL+hash), or mark a target **no-cache** so it always runs Phase 3. +- **Verify mode.** A diagnostic mode re-runs a recipe whose trace says "unchanged" + and compares the fresh `tree-id` to the recorded one; a mismatch flags a + nondeterministic or under-declared recipe. This is the recommended way to audit + recipes before trusting their traces. + +## Worked example + +`//app:server` depends on `//lib:core`; the recipe globs `src/*.c`, reads +`src/main.c`, and consumes config `opt`. + +1. **Cold build.** No record. Phase 3 runs both recipes. `//lib:core` builds, + yields `tree L0`. `//app:server` runs, logs `config opt`, `glob src/*.c`, + `source src/main.c`, `need //lib:core → L0`; yields `tree A0`. Shallow + deep + traces written for both. `build/cache/A0/` materialized, path returned. +2. **No-op rebuild.** Phase 1 deep trace of `//app:server` refreshes `opt`, + `src/*.c`, `src/main.c`, and the *transitively folded* `lib/core.c` — all + memoized, all match. `A0` is in the tree cache. Returns instantly, nothing + re-run, no graph walk. +3. **Comment-only edit to `lib/core.c`.** `lib/core.c`'s blob-id changed → the + `//app:server` deep trace fails Phase 1 (its folded `[source]` moved). Phase 2: + direct base inputs (`opt`, `src/*.c`, `src/main.c`) are unchanged, so probe the + one dep — `resolve(//lib:core)`. That recurse's *own* recipe re-runs (its + source moved) but emits a byte-identical `tree L0` (comment stripped). The dep + compare `L0 == L0` passes → `//app:server`'s recipe is **skipped**, `A0` + restored, and a refreshed deep trace written so step 2's fast path returns. +4. **Flip `opt`.** `//app:server`'s deep and shallow traces both fail on the + `config opt` row → Phase 3 re-runs its recipe. `//lib:core` (which never + consumed `opt`) is untouched by Phase 1 and returns `L0` from cache. + +## Open questions / future work + +- **Transport.** Pipe vs unix socket vs in-process table; framing and the exact + request/response encoding. Deliberately unspecified above. +- **Hermeticity enforcement.** Sandbox that denies undeclared reads; FS-trace + fallback to *detect* (not prevent) them and warn. +- **Remote CAS.** The `materialize` ladder's "restore from CAS" naturally extends + to a remote/shared cache (the DISTRIBUTE.md external-object fetch hints are the + precedent) so a clean checkout can download outputs instead of rebuilding. +- **Compact deep traces.** A flat transitive leaf set can be large for deep + graphs. An alternative stores a merkle of per-dep deep-input hashes (composing + deep traces by reference) at the cost of a shallow walk on refresh — a + size/latency trade to revisit if flat traces bloat. +- **GC policy.** Concrete roots, schedule, and LRU bounds for `trace/`, the CAS, + and the tree cache. +- **Parallelism.** Independent `need`s can resolve concurrently; the memo table + must then carry futures (`in-progress` becomes "await") rather than a flag. +- **Public surface.** Whether this lands as a libcfree subsystem behind a public + header (the `cfree/cas.h` precedent) with a thin `cfree build` CLI, gated by a + `CFREE_BUILD_ENABLED` flag. diff --git a/doc/plan/README.md b/doc/plan/README.md @@ -16,3 +16,4 @@ shrinks to whatever remains open. | [ARCH.md](ARCH.md) | Remaining native-backend completeness for x64/rv64 relative to the aa64 reference, and per-call cost follow-ups. | [../ARCH.md](../ARCH.md) | | [BOOTSTRAP.md](BOOTSTRAP.md) | The 3-stage self-build reproducibility goal and the open `-O1` issues blocking it. | [../BUILD.md](../BUILD.md) | | [IMAGE_INSPECT.md](IMAGE_INSPECT.md) | Extending object inspection to executables and shared libraries. | [../OBJ.md](../OBJ.md) | +| [BUILD.md](BUILD.md) | A new content-addressed build coordinator (Bazel/Nix-style incremental builds layered on the CAS) — storage state machine, caching algorithm, recipe protocol. Distinct from `../BUILD.md` (cfree's own Makefile build). | — (new subsystem) |