boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs

commit 0696a381fa35277134e0c1dd22511fc2de886e96
parent 7a1408b717a8edf57455a7733b9113603afe3bbf
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Tue, 21 Apr 2026 08:53:42 -0700

docs: drop C1 and kaem-minimal; collapse to three contributions

Remove C1 as a separate compiler layer — seed tools now compile
through the same Lisp-hosted C compiler that builds tcc-boot
(PLAN.md). Vendor live-bootstrap's mescc-tools-extra, simple-patch,
and M2libc portable layer instead of authoring C here; no dispatcher
binary. Expand Lisp's syscall surface to 8 (+clone/execve/waitid,
open→openat) so the Lisp program itself drives the tcc-boot build,
eliminating kaem-minimal as a separate artifact.

Post-M1 contributions now: P1 pseudo-ISA, Lisp interpreter, C
compiler in Lisp.

Diffstat:
Ddocs/C1.md | 486-------------------------------------------------------------------------------
Mdocs/LISP.md | 8++++++--
Mdocs/PLAN.md | 69++++++++++++++++++++++++++++++++++++++++++++++++---------------------
Mdocs/SEED.md | 356+++++++++++++++++++++++++++++++++++++++++--------------------------------------
4 files changed, 239 insertions(+), 680 deletions(-)

diff --git a/docs/C1.md b/docs/C1.md @@ -1,486 +0,0 @@ -# Bootstrap C-like Language - -A minimal C-like language tuned for trivial one-pass compilation. Strict LL(1) -grammar, recursive-descent parseable with one token of lookahead, no semantic -feedback into the parser. Integer-only, two's complement, word-sized. - -## Goals and non-goals - -Goals: - -- **LL(1) grammar.** Hand-written recursive descent with one-token lookahead. - No symbol table needed for parsing. -- **One-pass compilation.** Source order determines visibility; forward - declarations are used for anything defined later. -- **One integer width.** `int` (machine word) and `byte` (8 bits, memory only). - No promotions, no rank, no integer zoo. -- **No type-directed parsing.** The tokenizer and parser never ask "is this - identifier a type?" -- **Trivial code generation.** Tree walker emitting stack-machine-style output - is sufficient. No register allocation required. -- **Explicit over implicit.** Ambiguous precedence combinations require - parentheses. - -Non-goals: ergonomics, expressiveness, optimization, source compatibility with -C. - -## Lexical structure - -- **Identifiers:** `[A-Za-z_][A-Za-z0-9_]*`. -- **Integers:** decimal `123`, hex `0x7F`, character `'a'` with escapes - `\n \t \r \0 \\ \' \"`. -- **Strings:** `"..."` with the same escapes. Type is `[]byte` — a slice - with `.ptr` into static read-only storage and `.len` equal to the number - of source bytes (no null terminator in the length, though a trailing - `'\0'` byte is present for interop with C-style APIs). -- **Comments:** `// ...` to end of line. No block comments. -- **Keywords:** `var const fn type struct return if else while break continue - switch default pub extern as sizeof null int byte`. -- **Operators:** listed in the expression grammar below. - -## Top-level structure - -``` -program = { toplevel } EOF -toplevel = [ 'pub' ] ( fn_def | var_decl | type_decl | fn_decl | const_decl ) - | 'extern' ( var_decl | fn_decl ) -fn_def = 'fn' IDENT '(' params ')' type block -fn_decl = 'fn' IDENT '(' params ')' type ';' -var_decl = 'var' IDENT type [ '=' const_expr ] ';' -type_decl = 'type' IDENT [ '=' type ] ';' // no '=' means forward decl -const_decl = 'const' IDENT '=' const_expr ';' -params = [ param { ',' param } ] -param = IDENT type -``` - -### Visibility - -All top-level names are **file-local by default**. Prefix with `pub` to export -from the translation unit. `extern` declares a name defined in another unit -and is always non-`pub` (it is a local reference to an external symbol). - -``` -var counter int; // file-local -pub var version int = 1; // exported -extern fn write(fd int, buf []byte) int; // defined elsewhere -``` - -### Forward declarations - -Because the compiler is one-pass, every name must be declared before use. -Functions defined later in the same file need a forward `fn` declaration -(signature with `;` instead of a body). Same for globals, if needed. - -``` -fn main() int; // forward decl; body appears later - -fn helper() int { - return main(); // legal because of the forward decl -} - -fn main() int { - return 0; -} -``` - -Forward declarations and definitions must agree on signature exactly. - -### Constants - -``` -const MAX_TOKENS = 1024; -pub const AST_ADD = 1; -pub const AST_SUB = AST_ADD + 1; -``` - -`const` binds a name to a compile-time integer. The right-hand side must be -a constant integer expression — integer/character literals, earlier `const` -names, `sizeof(T)`, and the arithmetic/bitwise/shift/compare operators -applied to those. Constants have type `int` and can be used anywhere an -`int` is expected, including array sizes, `switch` case labels, and `var` -or `const` initializers. - -Constants occupy no storage and cannot be addressed with `&`. A `const` is -local to the file unless marked `pub`. - -## Types - -Prefix constructors; read left-to-right. - -``` -type = 'int' | 'byte' - | '*' type - | '[' INT ']' type // array - | '[' ']' type // slice - | 'struct' '{' { IDENT type ';' } '}' - | 'fn' '(' typelist ')' type - | IDENT // named type -typelist = [ type { ',' type } ] -``` - -Examples: - -| Type | Meaning | -|------------------------------|-------------------------------------------| -| `int` | signed machine word | -| `byte` | 8-bit value, memory only | -| `*int` | pointer to int | -| `[10]int` | array of 10 ints | -| `[]int` | slice of int (pointer + length) | -| `*[10]*int` | pointer to array of 10 pointers to int | -| `fn(int, int) int` | function pointer | -| `struct { x int; y int; }` | anonymous struct type | - -### Named types - -``` -type Point = struct { x int; y int; }; -type NodePtr = *Node; -type Node = struct { next *Node; value int; }; -``` - -Self-reference through a pointer works in one pass because the pointer's -pointee type is just a name at parse time. Mutual recursion across two `type` -declarations requires a forward `type Name;` declaration (empty body) followed -by the definition. - -### Slices - -A slice `[]T` is a two-word value: a pointer and a length. It is laid out -as if declared: - -``` -struct { ptr *T; len int; } -``` - -with guaranteed field order and the members accessible as `.ptr` and `.len`. -Unlike other aggregate types, **slices pass and return by value** — they are -always exactly two words. - -Construct a slice from an array or another slice with slice syntax: - -``` -var buf [256]byte; -var s1 []byte = buf[..]; // whole array -var s2 []byte = buf[0..16]; // first 16 bytes -var s3 []byte = s1[4..]; // from index 4 to end -var s4 []byte = s1[..10]; // first 10 -``` - -Or assemble one by assigning the fields directly: - -``` -var s []int; -s.ptr = &arr[0]; -s.len = 10; -``` - -Indexing `s[i]` is sugar for `*(s.ptr + i)` with element-size scaling and is -an lvalue. Taking `&s[i]` yields `*T`. There is no bounds checking. - -Slicing is allowed on arrays and on slices only. `expr[i..j]` requires -`i <= j` and yields a slice of length `j - i`. Omitted bounds default to `0` -(low) and the base's length (high). Slicing a bare `*T` is not supported — -set `.ptr` and `.len` manually instead. - -Slices **cannot be compared by value.** `s1 == s2` is an error. Compare -`s1.ptr == s2.ptr` and `s1.len == s2.len` if you mean identity, or write -a byte-by-byte equality helper if you mean content. - -### No implicit conversions - -Every cross-type conversion goes through `as`: - -- `byte` ↔ `int`: explicit `as`. `byte as int` zero-extends; `int as byte` - truncates to the low 8 bits. -- `*T` ↔ `*U`: explicit `as`. -- `int` ↔ `*T`: explicit `as`. -- `null` is assignable to any `*T` without `as` (the only exception). - -### No decay - -Arrays and structs do **not** decay or copy implicitly. To pass one to a -function, take its address with `&`. `&arr` yields `*T` pointing at the -first element (not `*[N]T`). `&s` on a struct yields `*S`. - -``` -var buf [256]byte; -write(1, &buf, 256); // pass pointer to first byte - -var p Point; -init_point(&p, 3, 4); // pass pointer to struct -``` - -Arrays cannot be assigned, returned, or passed by value. Structs cannot be -assigned, returned, or passed by value. If you want to copy, call a helper. - -## Statements - -``` -block = '{' { statement } '}' -statement = var_decl - | 'if' expr block [ 'else' ( if_tail | block ) ] - | 'while' expr block - | switch_stmt - | 'return' [ expr ] ';' - | 'break' ';' - | 'continue' ';' - | block - | expr_stmt -if_tail = 'if' expr block [ 'else' ( if_tail | block ) ] -switch_stmt = 'switch' expr '{' { case_arm } [ default_arm ] '}' -case_arm = const_expr { ',' const_expr } block -default_arm = 'default' block -expr_stmt = expr ( '=' expr ';' | ';' ) -``` - -- **No parentheses on conditions.** The expression ends at the opening `{` of - the block because `{` is never a valid continuation of an expression. -- Braces are **mandatory** on every `if`, `else`, `while`. Dangling-else is - impossible. -- **Assignment is a statement, not an expression.** No chained assignment, - no assignment inside conditions. `=` vs `==` confusion at the statement - level is caught by grammar. -- No `for`, no ternary, no comma operator, no compound assignment - (`+=` etc.), no `++`/`--`. -- Local variables are **uninitialized** unless `= expr` is given. Local - initializers may be any expression, not just constants. -- Scalar globals (`int`, `byte`, `*T`, `fn(...)...`): initializer must be a - constant expression; zero-initialized if omitted. -- Aggregate globals (arrays, structs, slices): **zero-initialized only** — - there is no non-zero initializer syntax. Populate at program start if - needed. - -### Switch - -``` -switch tok.kind { - TK_PLUS, TK_MINUS { return parse_add(tok); } - TK_STAR { return parse_mul(tok); } - TK_LPAREN { return parse_group(tok); } - default { return parse_error(tok); } -} -``` - -- The scrutinee is any integer expression (`int` or `byte`). -- Case labels must be compile-time integer constants. Multiple labels per - arm are comma-separated. All labels across a `switch` must be distinct. -- Each arm is a mandatory `{}` block. **No fallthrough.** -- `default` is optional. With no default and no matching case, the `switch` - has no effect. -- `break` and `continue` inside an arm refer to the enclosing `while`, not - the `switch`. `return` returns from the function as usual. - -## Expressions - -Eight precedence levels. Where a level is marked **non-chainable**, the -operator may appear at most once; mixing with the surrounding levels requires -explicit parentheses. This is how we kill the classic C precedence traps -(`a & mask == 0` meaning `a & (mask == 0)`, etc.) with zero runtime cost and -a trivial parser — each non-chainable level is a one-shot `[ OP operand ]` -rather than a loop. - -``` -expr = logor -logor = logand { '||' logand } // chainable with itself -logand = compare { '&&' compare } // chainable with itself -compare = bitwise [ CMPOP bitwise ] // non-chainable -bitwise = shift [ BITOP shift ] // non-chainable -shift = addsub [ SHIFT addsub ] // non-chainable -addsub = muldiv { ('+'|'-') muldiv } // left-assoc chainable -muldiv = unary { MULOP unary } // left-assoc chainable -unary = ('-' | '!' | '~' | '*' | '&') unary - | postfix -postfix = primary { '(' args ')' | '[' expr ']' - | '[' [ expr ] '..' [ expr ] ']' - | '.' IDENT | 'as' type } -primary = INT | CHAR | STRING | 'null' | IDENT - | '(' expr ')' - | 'sizeof' '(' type ')' -args = [ expr { ',' expr } ] - -MULOP = '*' | '/' | '%' | '/u' | '%u' -SHIFT = '<<' | '>>' | '>>u' -BITOP = '&' | '|' | '^' -CMPOP = '==' | '!=' | '<' | '<=' | '>' | '>=' - | '<u' | '<=u' | '>u' | '>=u' -``` - -Concretely, these require parentheses: - -``` -a & b | c // error: mixing & and | -a << b + c // error: mixing shift and add -a == b == c // error: chained comparison -a < b < c // error: chained comparison -a && b || c // error: mixing && and || -a & mask == 0 // error: mixing bitwise and compare -``` - -Write instead: - -``` -(a & b) | c -a << (b + c) -(a == b) && (b == c) -(a < b) && (b < c) -(a && b) || c -(a & mask) == 0 -``` - -Chaining is allowed within the arithmetic levels (`a + b - c + d`, -`a * b / c`) and within `&&` and `||` individually (`a && b && c`, -`x || y || z`). Mixing `&&` and `||` still requires parentheses. - -### Signed vs unsigned - -Operators are **signed by default**. Unsigned variants are distinct tokens: - -| Signed | Unsigned | Meaning | -|--------|----------|--------------------------| -| `/` | `/u` | division | -| `%` | `%u` | remainder | -| `>>` | `>>u` | right shift (arith/log) | -| `<` | `<u` | less than | -| `<=` | `<=u` | less or equal | -| `>` | `>u` | greater than | -| `>=` | `>=u` | greater or equal | - -`+`, `-`, `*`, `==`, `!=`, `<<`, `&`, `|`, `^`, `~` have identical behavior in -two's complement and so have no signed/unsigned split. - -Signed overflow **wraps** (defined, not UB). Shift by >= word width is -undefined. - -### Booleans - -No boolean type. Zero is false, non-zero is true. Comparisons and `!` yield -0 or 1. `&&` and `||` short-circuit via branches. - -### Lvalues - -Exactly: `IDENT` (naming a variable), `*expr`, `lv.field`, `lv[expr]`. - -Field access **auto-dereferences pointers**: if `lv` has type `*S`, then -`lv.field` means `(*lv).field`. This chains as needed: `p.a.b` where `p: *A` -and `A.a: *B` means `(*(*p).a).b`. There is no separate `->` operator. - -`lv[expr]` requires the base to be a pointer, an array, or a slice. -Element-size scaling is applied — see Pointer arithmetic. - -### `sizeof` - -`sizeof(T)` takes a type, never an expression. Result is `int`, compile-time -constant. - -### Casts - -`expr as T` is postfix. `(T)expr` is **not** a cast — parentheses are only -for grouping. This removes the only genuine LL(1) hazard in C. - -### Pointer arithmetic - -When one operand of `+` or `-` is a pointer `*T`: - -- `*T + int` and `int + *T` yield `*T`, advancing by `sizeof(T)` bytes per - unit. -- `*T - int` yields `*T`, retreating by `sizeof(T)` bytes per unit. -- `*T - *T` (same pointee type) yields `int`, the signed element-count - difference. -- Pointer + pointer is not allowed. - -Indexing desugars to this arithmetic: - -- `p[i]` with `p: *T` is `*(p + i)`. -- `arr[i]` with `arr: [N]T` is `*(&arr + i)` (since `&arr` has type `*T`). -- `s[i]` with `s: []T` is `*(s.ptr + i)`. - -### Function values - -A bare function name is its own function pointer — no `&` required. If -`my_func` has type `fn(int) int`, then `my_func` is directly usable wherever -an `fn(int) int` value is expected, and `my_func(42)` calls it. Writing -`&my_func` is legal and yields the same pointer. - -## Calling convention and ABI notes - -- Arguments passed by value, evaluated left to right (pushed right to left - in typical stack-based targets). -- Returns are word-sized: `int`, any `*T`, or `byte` (returned as `int` - with zero-extension). -- **Structs never cross function boundaries by value.** Pass `*S`, return - `*S`, or use an out-parameter. -- **Slices (`[]T`) do cross by value** — always two words (pointer + length), - in a register pair or adjacent stack slots. This is the sole exception to - the one-word return / no-aggregate-by-value rule. -- No varargs. To print multiple values, call multiple helpers. - -## Preprocessor - -Only `#include "path"` is supported. Inclusion is textual but -**idempotent per resolved path** — a file already included in the current -compilation is silently skipped on subsequent `#include`s. No include -guards needed, no macros, no conditional compilation, no `#define`. Named -integer constants go in `const`; type aliases go in `type`. - -## Example - -``` -#include "io.lang" - -pub type Node = struct { - next *Node; - value int; -}; - -fn list_len(head *Node) int; // forward decl - -pub fn list_sum(head *Node) int { - var total int = 0; - var p *Node = head; - while p != null { - total = total + p.value; - p = p.next; - } - return total; -} - -fn list_len(head *Node) int { - var n int = 0; - var p *Node = head; - while p != null { - n = n + 1; - p = p.next; - } - return n; -} - -pub fn main() int { - var nodes [3]Node; - nodes[0].value = 10; nodes[0].next = &nodes[1]; - nodes[1].value = 20; nodes[1].next = &nodes[2]; - nodes[2].value = 30; nodes[2].next = null; - - var sum int = list_sum(&nodes[0]); - if (sum > 0) && (sum <u 1000) { - put_int(sum); - } - return 0; -} -``` - -## Dropped from C - -For reference — these are intentionally absent: - -`float`, `double`, `long double`, `complex`, `short`, `long`, `long long`, -`unsigned` as a type (use signed types with unsigned operators), `enum` -(use `const`), `union`, bitfields, C's `const` as a type qualifier (the -keyword is reused for named integer constants), `volatile`, `restrict`, -`static` (replaced by file-local default + `pub`), `typedef` (replaced by -`type`), K&R function syntax, variadic functions, designated initializers, -compound literals, `for`, `do`/`while`, ternary `?:`, comma operator, -compound assignment, `++`/`--`, block comments, pre-processor macros, -implicit conversions, array and function decay, struct/array pass-by-value, -C's `switch` (the keyword is reused with no `case`, no fallthrough, and -mandatory-block arms). diff --git a/docs/LISP.md b/docs/LISP.md @@ -51,8 +51,12 @@ Load-bearing; the rest of the document assumes them. 9. **Tail calls via P1 `TAIL`.** `eval` dispatches tail-position calls through `TAIL`; non-tail through `CALL`. Scheme-level tail-call correctness falls out for free. -10. **Five syscalls: `read`, `write`, `open`, `close`, `exit`.** Matches - PLAN.md. No signals, no `lseek`, no `stat`. +10. **Eight syscalls: `read`, `write`, `openat`, `close`, `exit`, + `clone`, `execve`, `waitid`.** Matches PLAN.md. The last three + let the Lisp program spawn M1/hex2 and act as the tcc-boot + build driver; `openat(AT_FDCWD, …)` replaces bare `open` + because aarch64/riscv64 lack it in the asm-generic table. No + signals, no `lseek`, no `stat`. 11. **Pair GC marks live in a separate bitmap**, not in the pair words. ~1.25 MB BSS for a 20 MB heap; keeps pairs at 16 bytes and keeps fixnums at 61 bits. diff --git a/docs/PLAN.md b/docs/PLAN.md @@ -118,20 +118,18 @@ uses these heavily. ## Backend -Two options, to be decided after the P1 spike: - -1. **Emit text M1 assembly** for x86_64, single-arch. Simplest codegen; - tcc-boot only runs on amd64. Matches the original plan. -2. **Emit P1** from the C compiler. The C compiler is written once in - portable Lisp and also *emits* portable asm, so tcc-boot lands on all - three arches for free (modulo tcc-boot's own arch support). Codegen gets - slightly harder — P1 is deliberately dumb, so C idioms like `x += y` - expand to multi-op P1 sequences — but we pay the ~2× code-size tax - already budgeted in `P1.md` rather than writing three backends. - -Option 2 is the natural endpoint of the P1 investment. Defer the decision -until we have measured P1 codegen quality on a non-trivial program (P1.md -stage 5). +**Settled: emit P1.** The C compiler is written once in portable Lisp and +emits portable asm, so both the pre-tcc-boot seed userland (`SEED.md`) and +tcc-boot itself land on all three arches without a second backend. Codegen +is slightly harder than direct amd64 — P1 is deliberately dumb, so C +idioms like `x += y` expand to multi-op P1 sequences — but we pay the +~2× code-size tax already budgeted in `P1.md` rather than writing three +backends. + +This forecloses the alternative of emitting amd64 M1 directly (simpler +codegen, single-arch only). That option would have satisfied a +tcc-boot-only goal, but `SEED.md` requires tri-arch seed binaries, so a +portable backend is load-bearing. ## Estimated budget @@ -140,7 +138,7 @@ stage 5). | Lisp interpreter in P1 (reader, eval, GC, primitives, I/O, pmatch) | 4,000–6,000 P1 | | C lexer + recursive-descent parser + CPP (in Lisp) | 2,000–3,000 | | Type checker + IR (slimmed compile.scm + info.scm) | 2,000–3,000 | -| Codegen + asm emit (M1-amd64 or P1, see Backend) | 800–1,500 | +| Codegen + P1 emit (see Backend) | 800–1,500 | | **Total auditable (this plan)** | **~9,000–13,000 LOC** | vs. **~54,000 LOC** current = **~4–6× shrink**, and the result is @@ -175,9 +173,38 @@ with any future seed-stage program. region at link time. No `brk`/`mmap` at runtime, no arena-sizing flag. Keeps the P1 program to a minimal syscall surface and makes the interpreter image self-describing. -- **Syscalls: five.** `read`, `write`, `open`, `close`, `exit`. Each - becomes one P1 `SYSCALL` op backed by a per-arch number table in the - P1 defs file. `read-file` loops `read` into a growable string until - EOF (no `stat`/`lseek`); `display`/`write`/`error` go through `write` - on fd 1/2; `error` finishes with `exit`. No signals, time, fork/exec, - or networking. +- **Syscalls: eight.** `read`, `write`, `openat`, `close`, `exit`, + `clone`, `execve`, `waitid`. Each becomes one P1 `SYSCALL` op + backed by a per-arch number table in the P1 defs file. + `read-file` loops `read` into a growable string until EOF (no + `stat`/`lseek`); `display`/`write`/`error` go through `write` on + fd 1/2; `error` finishes with `exit`. `openat(AT_FDCWD, …)` + replaces `open` because aarch64/riscv64 lack bare `open` in the + asm-generic table. `clone(SIGCHLD)` + `execve` + `waitid` give + the Lisp enough to drive the tcc-boot build directly — see + "Build driver" below. No signals, time, or networking. + +## Build driver + +Once Lisp can spawn, the Lisp program itself is the build driver. +There is no separate shell. A top-level Lisp source file reads the +pinned list of tcc-boot translation units, iterates over them, and +for each one: + +1. Reads the `.c` source into a Lisp string. +2. Calls the Lisp-hosted C compiler (in-process) to produce P1 text. +3. Writes the P1 text to a temp file. +4. Spawns M1 (from stage0-posix, via `clone`+`execve`) to assemble + P1 → `.hex2`; waits via `waitid`, aborts on non-zero. +5. Spawns hex2 to emit the final `.o` / ELF; waits, aborts on + non-zero. + +The seed-tool builds (each mescc-tools-extra source → one ELF) run +the same loop. Spawn-and-wait is a ~20 LOC Lisp primitive; the full +driver, including the hard-coded tcc-boot file list, is ~100–200 +LOC of Lisp counted against this plan. + +Concentrating orchestration in the Lisp program (rather than a +separate P1/M1 shell) collapses the post-M1 contribution list to +exactly three artifacts: P1, the Lisp interpreter, and the C +compiler. diff --git a/docs/SEED.md b/docs/SEED.md @@ -4,14 +4,14 @@ Bridge the window between *Lisp exists* and *tcc-boot exists* without touching M2-Planet, Mes, or MesCC. Inside that window, all code is -either a Lisp program running on the Lisp interpreter or subcommands -of a single monolithic C1 binary (`seed`) compiled through the -Lisp-hosted C1 compiler → P1 → M1 → hex2 pipeline. +either a Lisp program running on the Lisp interpreter or one of a +small set of standalone C binaries compiled through the Lisp-hosted +C compiler → P1 → M1 → hex2 pipeline. This document covers only that window. Phases before it (`seed0 → -hex0/hex1/hex2 → M1`, P1 defs, Lisp interpreter) are documented in -`P1.md` and `PLAN.md`. tcc-boot itself and everything downstream are -standard C and out of scope. +hex0/hex1/hex2 → M1`, P1 defs, Lisp interpreter, and the Lisp-hosted +C compiler) are documented in `P1.md` and `PLAN.md`. tcc-boot itself +and everything downstream are standard C and out of scope. ## Position in the chain @@ -19,142 +19,136 @@ standard C and out of scope. stage0-posix: seed0 → hex0 → hex1 → hex2 → M1 (no C, no Lisp) P1 layer: P1 defs files load into M1 (P1.md) Lisp: P1 text (Lisp interp source) → M1 → hex2 (PLAN.md) -C1 compiler: Lisp program, loaded into the Lisp image (this doc) -──────── seed window begins here ──────── -seed binary: C1 source → Lisp+C1cc → P1 text → M1 → hex2 (this doc) C compiler: Lisp program, loaded into the Lisp image (PLAN.md) +──────── seed window begins here ──────── +seed tools: C source → Lisp+Ccc → P1 text → M1 → hex2 (this doc) ──────── seed window ends when tcc-boot is built ──────── tcc-boot: C source → Lisp+Ccc → P1 text → M1 → hex2 (PLAN.md) ``` -Two Lisp programs (C1 compiler, C compiler) and one statically-linked -C1 binary. No M2-Planet artifact and no Mes Scheme module anywhere. +One Lisp-hosted C compiler (shared with tcc-boot) and a handful of +statically-linked C binaries. No M2-Planet artifact and no Mes +Scheme module anywhere. ## Settled decisions These are load-bearing; rest of the document assumes them. -1. **C1 targets P1.** One C1 source per subcommand, tri-arch binary - via the existing M1+hex2 path. Accepts P1's ~2× code-size tax. -2. **C1 compiler lives in Lisp.** Same host as the C compiler; shares - the Lisp runtime. ~1.5–2.5k LOC Lisp, counted against `PLAN.md`. -3. **Monolithic `seed` binary.** One executable with subcommand - dispatch on `argv[1]` (e.g. `seed kaem script.kaem`, `seed cat - file`, `seed cp a b`). One audit unit, one copy of the runtime, - no loader. Bug blast radius is the whole seed userland — mitigated - by keeping each subcommand self-contained and tested in isolation. -4. **Uncompressed tcc-boot mirror.** Host the upstream tcc-boot source +1. **Seed programs compile through the same Lisp-hosted C compiler + as tcc-boot.** No separate seed-stage compiler. Authors write in + the C subset fixed in `PLAN.md`; backend emits P1, so seed lands + tri-arch via the existing M1+hex2 path. Accepts P1's ~2× + code-size tax. +2. **Vendor upstream C where it exists.** `cat`, `cp`, `mkdir`, + `rm`, `sha256sum`, `untar` are taken from live-bootstrap's + `mescc-tools-extra`; `patch-apply` from `simple-patch-1.0`. + The libc these sources depend on (`<stdio.h>`, `<string.h>`, + `<stdlib.h>`, etc.) is vendored M2libc's portable layer — + `bootstrappable.c`, `string.c`, `stdio.c`, `stdlib.c`, and the + small `ctype`/`fcntl` files (~1,500 LOC). Per-arch syscall + stubs backing M2libc's declarations are replaced with our + P1-based stubs (see "How seed tools reach syscalls" below). All + of the above was written against M2-Planet's C subset, which is + a subset of ours. Local adaptations ship as unified diffs in + the repo. **No C is written fresh here** — each vendored + source already has its own `main`. +3. **The Lisp program is the build driver — no separate shell.** + Per `PLAN.md`, the Lisp's syscall surface includes `clone`, + `execve`, `waitid`, so a top-level Lisp file drives the whole + tcc-boot build: iterate over translation units, call the + Lisp-hosted C compiler in-process, spawn M1/hex2 to finish + each artifact, check exit status. No `kaem`, no `sh`, no flat + script — just Lisp code. +4. **One binary per tool.** Each vendored source compiles to a + standalone ELF — `cat`, `cp`, `mkdir`, `rm`, `sha256sum`, + `untar`, `patch-apply`. Installed into a single directory + (say, `/seed/`) and invoked by absolute path from the Lisp + driver. No dispatcher, no argv[0] multiplexing, no fresh `main` + to write. Each tool is its own audit unit. +5. **Uncompressed tcc-boot mirror.** Host the upstream tcc-boot source as an uncompressed `.tar` with sha256 pinned. No gzip support anywhere in the seed stage. Deletes ~1000–1500 LOC of deflate from the audit. -5. **Explicit patches via `seed patch-apply`.** Upstream source stays +6. **Explicit patches via `patch-apply`.** Upstream source stays verbatim. Our changes live as unified-diff files in this repo, - applied by a ~200 LOC C1 subcommand. "Upstream vs ours" stays - legible. -6. **fork + execve for process spawn.** Simplest kernel contract, - stable syscall numbers on all three arches. Plus `wait4` to - reap children. No clone, vfork, or posix_spawn. + applied by the `simple-patch`-derived binary. "Upstream vs + ours" stays legible. 7. **Target self-build is primary; cross-build is a cache.** The canonical build is a fresh target machine bootstrapping from stage0-posix hex seed. Cross-built per-arch tarballs are supported as a reproducibility cache — identical bytes expected, verified against a target self-build, not trusted by assumption. -## The `seed` binary - -One ELF per arch, invoked as `seed <subcommand> [args...]`. Internal -dispatch table maps `argv[1]` to a function; unknown subcommands error -out. Startup shim parses `argc/argv`, calls the dispatch function, -propagates its return code to `exit`. - -### Subcommands - -| Subcommand | Purpose | C1 LOC | -|---------------|--------------------------------------------------|----------| -| `kaem` | shell driving the tcc-boot build | 700–900 | -| `untar` | POSIX ustar extract (no gzip, no creation) | 500–700 | -| `patch-apply` | apply a unified diff in-place | ~200 | -| `sha256sum` | verify source tarball hashes | 500–700 | -| `cp` | copy one file | ~150 | -| `mkdir` | single-level directory create | ~80 | -| `rm` | remove one file (no `-r`, no `-f`) | ~120 | -| `mv` | rename within one filesystem | ~150 | -| `cat` | concatenate files to stdout | ~80 | -| `test` | file and string predicates for kaem | ~280 | -| `echo` | write args to stdout | ~50 | -| dispatch + argv plumbing | top-level `main`, subcommand table | ~100 | -| C1 runtime + mini libc | startup, syscalls, memcpy/memset/str* | ~400 | -| **Total** | | **~3310–3910** | - -Dispatch is flat: there is no nesting, no aliases, no argv[0]-based -dispatch. Kaem scripts write out `seed <sub>` in full. One installed -file on disk, no symlinks, no `link` syscall needed. - -### Kaem feature set - -Line-oriented minimal shell: - -- One command per line. No `;`, no `&&`, no `||`. -- Command = word (built-in or path) + whitespace-separated args. - Quoting: `"..."` is one arg, with `\n \t \\ \"` escapes. No - single-quote form. -- Variable substitution: `${NAME}` from environment only. -- Built-ins: `cd` (via `chdir` syscall), `set NAME=VALUE` (env), - `exit`. -- Redirection: `> file` (truncate) and `< file` (stdin). No append, - no pipes. -- Failure: non-zero exit from any command aborts the script. -- Comments: `#` to end of line. -- Expansion excluded: globbing, command substitution, arithmetic, - here-docs, background jobs. - -That suffices to express "unpack → verify → compile each file → link -→ install." Orchestration lives in kaem text, not C1. +## The seed tools + +One ELF per tool per arch. Each tool is invoked by absolute path +from the Lisp build driver (e.g. `/seed/sha256sum foo.tar`). Each +binary links against the same vendored M2libc portable layer and +the same P1 syscall stubs. + +### Inventory + +| Tool / layer | Purpose | Source / LOC | +|--------------------|---------------------------------------------|-------------------------| +| `untar` | POSIX ustar extract (no gzip, no creation) | mescc-tools-extra/untar.c (460) | +| `patch-apply` | apply a unified diff in-place | simple-patch-1.0 (~200) | +| `sha256sum` | verify source tarball hashes | mescc-tools-extra/sha256sum.c (586) | +| `cp` | copy one file | mescc-tools-extra/cp.c (332) | +| `mkdir` | single-level directory create | mescc-tools-extra/mkdir.c (117) | +| `rm` | remove one file (no `-r`, no `-f`) | mescc-tools-extra/rm.c (54) | +| `cat` | concatenate files to stdout | mescc-tools-extra/catm.c (69) | +| libc (portable) | stdio, string, stdlib, ctype, fcntl | vendored M2libc (~1,500) | +| syscall stubs | per-arch bridge below M2libc | ~120 lines of P1, not C | +| **Total C** | | **~3,300, fully vendored** | + +Deliberately excluded: `test`, `echo`, `mv`. The Lisp driver does +any conditional or rename logic it needs in Lisp, and emits +progress messages via its own `write` calls — no externalised +shell utilities needed for those concerns. + +The driver is Lisp code, not a shell script; see `PLAN.md`'s +"Build driver" section for the control flow. ## Syscall surface -Combined with PLAN.md's compiler surface, the seed window requires -**12 syscalls** total. Each gets one row in every `p1_<arch>.M1` -defs file. +The seed tools collectively need **7 syscalls** (process spawn +lives in the Lisp driver, not in the tools). | Syscall | Used by | |------------|-------------------------------------------| -| `read` | all file-reading subcommands, Lisp I/O | +| `read` | all file-reading tools | | `write` | stdout/stderr, all file-writing | -| `open` | file open (`O_RDONLY` / `O_WRONLY|O_CREAT|O_TRUNC` with mode) | +| `openat` | file open (`AT_FDCWD` + `O_RDONLY` / `O_WRONLY|O_CREAT|O_TRUNC` with mode) | | `close` | all file ops | | `exit` | program termination | -| `fork` | kaem child spawn | -| `execve` | kaem child spawn | -| `wait4` | kaem reaping children | -| `mkdir` | `seed mkdir`, `untar` (directory entries) | -| `unlink` | `seed rm` | -| `rename` | `seed mv` | -| `access` | `seed test` (file predicates) | -| `chdir` | kaem `cd` builtin | - -Bumps `PLAN.md`'s "five syscalls" contract to 13 (includes `chdir`); -PLAN.md should be cross-referenced to this list, not restated -independently. Deliberately excluded: `stat/fstat` (use `access` -instead), `chmod` (rely on `open` mode bits for initial perms), +| `mkdir` | `mkdir` tool, `untar` (directory entries) | +| `unlink` | `rm` tool | + +PLAN.md's Lisp surface is 8 syscalls (`read`, `write`, `openat`, +`close`, `exit`, `clone`, `execve`, `waitid`). The seed tools add +`mkdir` and `unlink` on top of that, for a window total of **10 +distinct syscalls**. Each gets one row in every `p1_<arch>.M1` +defs file. Deliberately excluded: `stat/fstat`, `access`, +`rename`, `chmod` (rely on `openat` mode bits for initial perms), `lseek` (all reads are sequential), `getdents`/`readdir` (no directory traversal needed), `dup`/`pipe`/signals/time/net. -### How C1 reaches syscalls +### How seed tools reach syscalls -C1 has no inline asm and no intrinsics. Each syscall is exposed as an -ordinary `extern fn` declaration, backed by a hand-written P1 stub in -`runtime.p1`. The stubs are ~3 P1 ops each (load number, `SYSCALL`, -`RET`), totalling ~40 lines of P1 for the whole surface. +The Lisp-hosted C compiler has no inline asm and no intrinsics. Each +syscall is exposed as an ordinary `extern` function declaration, +backed by a hand-written P1 stub in `runtime.p1`. The stubs are ~3 P1 +ops each (load number, `SYSCALL`, `RET`), totalling ~40 lines of P1 +for the whole surface. ``` -:sys_write ; C1 args arrive in P1 r1-r6 per call ABI +:sys_write ; C args arrive in P1 r1-r6 per call ABI SYSCALL write ; expands per-arch via p1_<arch>.M1 defs RET ``` ``` -extern fn sys_write(fd int, buf ptr byte, n int) int; +extern int sys_write(int fd, char *buf, int n); ``` Prerequisite: P1 picks its argument registers (`r1–r6`) to coincide @@ -169,55 +163,65 @@ the result register. Wrappers return the raw integer; callers test ## Build ordering inside the seed window -Once the Lisp interpreter binary exists and the C1 compiler Lisp -source is loaded: - -1. Compile the `seed` monolith: one C1 source file (or small set - `#include`d into one translation unit, since C1's preprocessor - supports `#include` only) → P1 text → M1 → hex2 → `seed` ELF. - Per-arch, repeat for each target. -2. Install `seed` on the target (copy to a known path). No other - setup required. - -The tcc-boot build then runs as kaem scripts: - -1. `seed sha256sum upstream.tar` against pinned hash. -2. `seed untar upstream.tar`. -3. For each patch file: `seed patch-apply patches/foo.diff`. -4. Loop over tcc-boot `.c` files, invoking Lisp-as-C-compiler to - emit P1 text, then M1+hex2 to produce per-object files or a - single linked binary. (tcc-boot's build is simple enough to - treat as one compilation unit; the loop is unrolled in kaem.) -5. Install tcc-boot binary. - +Once the Lisp interpreter binary exists and the C compiler Lisp +source is loaded (both per `PLAN.md`): + +1. Compile each seed tool independently: its vendored source plus + the vendored M2libc layer plus the per-arch P1 syscall stubs → + P1 text → M1 → hex2 → one ELF per tool. Per-arch, repeat for + each target. +2. Install the tools into a single directory on the target (e.g. + `/seed/`). No other setup required. + +The tcc-boot build runs as a Lisp program invoked on the Lisp +interpreter. The driver: + +1. Spawns `/seed/sha256sum upstream.tar` and checks against pinned + hash. +2. Spawns `/seed/untar upstream.tar`. +3. For each patch file: spawns `/seed/patch-apply patches/foo.diff`. +4. Iterates over tcc-boot `.c` files. For each one, calls the + Lisp-hosted C compiler in-process to emit P1 text, then spawns + M1 and hex2 to produce the object or final linked binary. +5. Installs the tcc-boot binary. + +See `PLAN.md` "Build driver" for the spawn-and-wait primitive. Seed window is closed. ## Target self-build vs cross-build **Target self-build (primary).** A fresh machine of arch `A` starts from the stage0-posix hex seed, runs the hex0→hex1→hex2→M1 chain, -loads `p1_A.M1`, assembles the Lisp interpreter, loads the C1 -compiler into Lisp, compiles `seed`, runs the tcc-boot build. Whole -process is a kaem script (bootstrapped from a hand-assembled first -kaem, same way hex2 and M1 are) driving the toolchain. +loads `p1_A.M1`, assembles the Lisp interpreter, loads the C +compiler into Lisp, runs the Lisp build-driver program, which +compiles each seed tool, then compiles and links tcc-boot. +stage0-posix's own `kaem` runs the early hex0→M1 chain; above M1, +the Lisp program takes over. **Cross-build cache (secondary).** On an already-bootstrapped -machine, produce `seed` binaries for all three arches and ship them -as tarballs. Users who opt into this skip the target self-build and -land directly at "seed installed." Trust claim: **none by -assumption** — the cache is only trusted after a target self-build -of at least one arch has verified byte-identical output. Cross-build -is an optimization, not a trust input. +machine, produce the seed tool binaries for all three arches and +ship them as tarballs. Users who opt into this skip the target +self-build and land directly at "seed tools installed." Trust +claim: **none by assumption** — the cache is only trusted after a +target self-build of at least one arch has verified byte-identical +output. Cross-build is an optimization, not a trust input. ## Provenance -Three kinds of artifact flow in: +Artifacts flowing in: - **stage0-posix hex seed + P1 defs**: part of this repo, audited with the rest of it. -- **Lisp interpreter source (in P1)**: part of this repo. -- **C1 sources for `seed` + the C1 compiler + C compiler (in Lisp)**: - part of this repo. +- **Lisp interpreter source (in P1) and C compiler (in Lisp)**: + part of this repo, covered by `PLAN.md`. +- **Vendored seed C sources**: pinned snapshots of + live-bootstrap's `mescc-tools-extra` (catm, cp, mkdir, rm, + sha256sum, untar), `simple-patch-1.0`, and M2libc's portable + layer (the libc the mescc-tools sources depend on — stdio, + string, stdlib, ctype, fcntl, bootstrappable). All shipped + verbatim as `.tar` files with sha256 pinned. Local adaptations + ride as unified diffs in the repo, applied by `patch-apply` at + build time so "upstream vs ours" stays legible. - **Upstream tcc-boot source**: mirrored as uncompressed `.tar` at a pinned URL + sha256. The mirror file is one of this repo's auditable inputs; it can be re-derived from upstream by untaring @@ -225,10 +229,14 @@ Three kinds of artifact flow in: published `.tar.gz` by re-gzipping and comparing hashes on a machine that has `gzip` (done once, out of band). -`seed sha256sum` is the single piece of C1 whose correctness has a -direct trust consequence downstream; unit-test it against known -vectors (empty string, "abc", "abcdbcde..."-length tests) before -declaring the seed build complete. +No C is authored fresh in this repo for the seed window; the only +things written here are unified-diff patches against the vendored +tree and the per-arch P1 syscall stubs. + +`sha256sum` is the single seed tool whose correctness has a direct +trust consequence downstream; unit-test it against known vectors +(empty string, "abc", "abcdbcde..."-length tests) before declaring +the seed build complete. ## Interaction with tcc-boot @@ -237,53 +245,59 @@ coreutils`. Mapping: | tcc-boot expects | Seed provides | |------------------|--------------------------------------------------| -| `cc` / `gcc` | kaem loop invoking Lisp-as-C-compiler per `.c` | -| `make` | flat kaem script (tcc-boot is simple enough) | -| `sh` | `seed kaem` | -| `cat`/`cp`/etc. | `seed <sub>` | +| `cc` / `gcc` | Lisp-hosted C compiler, invoked in-process per `.c` | +| `make` | Lisp driver program (tcc-boot is simple enough) | +| `sh` | not provided — the Lisp driver spawns tools directly | +| `cat`/`cp`/etc. | individual seed-tool binaries at absolute paths | | `ld` | tcc-boot's built-in linker (for its own output) | | `ar` | not needed; tcc-boot builds one static binary | -A thin shim script under `scripts/` maps tcc-boot's literal command -names (`cc`, `make`, `install`) to the `seed <sub>` / Lisp-invocation -forms. That shim is kaem text, not C1. +Any translation from tcc-boot's literal build-command names +(`cc`, `make`, `install`) to seed tools lives in Lisp, not in a +separate shim script. ## Budget rollup Fresh auditable LOC introduced by this document, on top of PLAN.md: -| Layer | LOC | -|-----------------------------------------------|-----------------| -| C1 compiler (Lisp, counted in PLAN.md) | (1,500–2,500) | -| `seed` monolith (all subcommands + runtime) | 3,300–3,900 | -| kaem scripts (orchestration, driver) | a few hundred | -| **Seed window addition** | **~3,300–3,900**| +| Layer | LOC | +|--------------------------------------------------------|---------| +| seed tools — vendored mescc-tools-extra + simple-patch | ~1,800 | +| seed tools — vendored M2libc portable layer | ~1,500 | +| syscall stubs (P1, not C) | ~120 | +| Lisp build-driver program | counted in PLAN.md | +| **Seed window addition** | **~3,300 C (all vendored) + ~120 P1** | Combined PLAN.md + SEED.md audit surface: **~13–17k LOC**, tri-arch, -M2-Planet-free and Mes-free. +M2-Planet-free and Mes-free. No fresh C is authored for the seed +window; the entire ~3,300 LOC is audited upstream code written +against M2-Planet's C subset. The build driver is Lisp code +counted against PLAN.md (~100–200 LOC). ## Handoff notes for the engineer Approximate build order for implementation: -1. **C1 compiler in Lisp** (blocks everything below). Write against - a small corpus of C1 test programs. Validate by compiling a - 20–50 LOC C1 program, running the output, confirming behavior. -2. **C1 runtime + syscall wrappers + mini libc.** Smallest - subcommand (`echo` or `cat`) is the bring-up test. -3. **`seed` dispatch skeleton** plus `echo`, `cat`, `cp`, `mkdir`, - `rm`, `mv`. Small, independent, easy to unit-test. -4. **`sha256sum`** with unit tests before anything depends on its - correctness. -5. **`test`** (file predicates needed by kaem). +1. **C compiler in Lisp** (blocks everything below). Per `PLAN.md`; + validate on a small corpus before touching seed. +2. **Vendor M2libc's portable layer** and write the per-arch P1 + syscall stubs that back its declarations. Bring-up test: link + `catm.c` (69 LOC) against this libc and run it. +3. **Vendor mescc-tools-extra + simple-patch.** Pin sha256s. + Confirm each source compiles unmodified through the Lisp-hosted + C compiler; if anything trips, capture the delta as a unified + diff rather than editing the vendored tree in place. +4. **Build the small tools** individually (`cat`, `cp`, `mkdir`, + `rm`) — each is its own ELF. +5. **`sha256sum`** with unit tests (empty / "abc" / long vectors) + before anything depends on its correctness. 6. **`untar`** (ustar extract only). 7. **`patch-apply`** (unified-diff in-place). -8. **`kaem`** (depends on `fork`, `execve`, `wait4`, `chdir`, - redirect). -9. **End-to-end bring-up**: kaem script running `sha256sum` → - `untar` → `patch-apply` → Lisp-C-compile loop → linked - tcc-boot. First full trip through the seed window. +8. **End-to-end bring-up**: Lisp build-driver running + `sha256sum` → `untar` → `patch-apply` → in-process C-compile + loop (spawning M1/hex2 per `.c`) → linked tcc-boot. First full + trip through the seed window. -Each step compiles standalone C1 and assembles through the existing +Each step compiles standalone C and assembles through the existing P1 → M1 → hex2 path; no new tooling infrastructure is needed between steps.