commit a86b719a54e31357eadbcd172dc3bd1776912ed4
parent 9a3e9f8a0fa616639d12f375d6b7a46cfada0546
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Sat, 25 Apr 2026 21:50:07 -0700
docs: add C compiler spec — CC.md, CC-INTERNALS.md, CC-CONTRACTS.md
Three-doc set defining the scheme1-hosted C compiler that will replace
MesCC at the live-bootstrap tcc-mes stage:
- CC.md: accepted C subset (matches MesCC + tcc-mes defines)
- CC-INTERNALS.md: six-module decomposition (util/data/lex/pp/cg/parse)
- CC-CONTRACTS.md: frozen alphabets, test formats, frame ABI, mangling,
conversion-responsibility split, phase-1 milestone
Engineers can work the modules in parallel against these contracts.
Diffstat:
| A | docs/CC-CONTRACTS.md | | | 533 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
| A | docs/CC-INTERNALS.md | | | 726 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
| A | docs/CC.md | | | 446 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
3 files changed, 1705 insertions(+), 0 deletions(-)
diff --git a/docs/CC-CONTRACTS.md b/docs/CC-CONTRACTS.md
@@ -0,0 +1,533 @@
+# lispcc contracts
+
+Frozen interfaces between modules. Engineers must not diverge from
+these without proposing a change. This document is the source of truth
+for the symbol alphabets, test formats, ABI, and phase-1 milestone
+referenced from [CC-INTERNALS.md](CC-INTERNALS.md).
+
+## 1. Symbol alphabets
+
+Every record's `kind`-style fields use these exact symbols. Adding a
+symbol = updating this section first.
+
+### 1.1 `tok-kind`
+
+```
+IDENT KW INT STR CHAR PUNCT HASH NL EOF
+```
+
+Uppercase to distinguish from value-level symbols. `IDENT` carries an
+unrecognized identifier; `KW` is one of the symbols in §1.3.
+
+### 1.2 `PUNCT` value symbols
+
+The lexer produces `tok-value` symbols for punctuators per the
+following table. **Names are mandatory** — no engineer may use the raw
+`'+`, `'*`, etc. as symbols, because several C punctuator characters
+(`%`, `|`, `,`, `;`, `(`, `)`, `{`, `}`, `[`, `]`, `.`, `#`) cannot
+form valid scheme1 symbols. We use named symbols for *all* punctuators
+to keep the scheme uniform.
+
+| C | Symbol | C | Symbol |
+|---------|-------------|----------|------------|
+| `[` | `lbrack` | `==` | `eq2` |
+| `]` | `rbrack` | `!=` | `ne` |
+| `(` | `lparen` | `<` | `lt` |
+| `)` | `rparen` | `>` | `gt` |
+| `{` | `lbrace` | `<=` | `le` |
+| `}` | `rbrace` | `>=` | `ge` |
+| `.` | `dot` | `<<` | `shl` |
+| `->` | `arrow` | `>>` | `shr` |
+| `,` | `comma` | `&&` | `land` |
+| `;` | `semi` | `\|\|` | `lor` |
+| `:` | `colon` | `&` | `amp` |
+| `?` | `qmark` | `\|` | `bar` |
+| `...` | `ellipsis` | `^` | `caret` |
+| `++` | `inc` | `~` | `tilde` |
+| `--` | `dec` | `!` | `bang` |
+| `+` | `plus` | `=` | `assign` |
+| `-` | `minus` | `+=` | `plus-eq` |
+| `*` | `star` | `-=` | `minus-eq` |
+| `/` | `slash` | `*=` | `star-eq` |
+| `%` | `pct` | `/=` | `slash-eq` |
+| `#` | `hash` | `%=` | `pct-eq` |
+| `##` | `paste` | `<<=` | `shl-eq` |
+| | | `>>=` | `shr-eq` |
+| | | `&=` | `amp-eq` |
+| | | `^=` | `caret-eq` |
+| | | `\|=` | `bar-eq` |
+
+Digraphs (`<:` `:>` `<%` `%>` `%:` `%:%:`) lex as their standard
+equivalents: same symbol on the right-hand side of the table.
+
+### 1.3 `KW` value symbols
+
+```
+;; storage
+auto register static extern typedef
+;; qualifiers (parsed and discarded)
+const volatile restrict inline
+;; type specifiers
+void char short int long signed unsigned _Bool
+;; rejected type specifiers (lexed as KW so we get clean diagnostics)
+float double
+;; aggregates
+struct union enum
+;; statements
+if else while do for switch case default break continue return goto
+;; operators
+sizeof
+;; reserved-and-rejected (lexed as KW so we error crisply)
+_Generic _Atomic _Thread_local _Alignof _Alignas _Static_assert
+_Complex _Imaginary
+```
+
+Anything matching the C identifier grammar that is **not** in this
+list lexes as `IDENT`.
+
+### 1.4 `ctype-kind`
+
+```
+void i8 u8 i16 u16 i32 u32 i64 u64 bool
+ptr arr fn struct union enum
+```
+
+Char is `i8` (`signed char`) or `u8` (`unsigned char`); `char` itself
+is `i8` (we treat plain `char` as signed, matching MesCC and most
+compilers). Long and long long collapse to `i64`/`u64` on P1-64.
+
+### 1.5 `opnd-kind`
+
+```
+imm frame global reg
+```
+
+`reg` is transient — used for the result of a call before it spills
+to a frame slot. The vstack itself never holds `reg` opnds; cg
+materializes through `reg` only inside a single emission step.
+
+### 1.6 `macro-kind`
+
+```
+obj fn fn-vararg
+```
+
+### 1.7 `sym-kind`
+
+```
+var fn typedef enum-const param label
+```
+
+### 1.8 `sym-storage`
+
+```
+auto static extern register
+```
+
+`#f` for `typedef`, `enum-const`, and `label` symbols (storage
+class doesn't apply).
+
+### 1.9 `loop-ctx-kind`
+
+```
+while do for switch
+```
+
+### 1.10 `reg` opnd register names
+
+```
+a0 a1 a2 a3
+```
+
+Only argument registers. Saved registers (`s0`..`s3`) and temporaries
+(`t0`..`t2`) are cg-private; never exposed as opnd payload.
+
+### 1.11 `cg-binop` and `cg-unop` operator symbols
+
+```
+;; cg-binop:
+add sub mul div rem
+and or xor shl shr
+eq ne lt le gt ge
+
+;; cg-unop:
+neg ;; arithmetic negate
+bnot ;; bitwise complement (~)
+lnot ;; logical not (!)
+```
+
+These are abstract operations independent of source-level PUNCT
+symbols; the parser maps PUNCT → cg-op (e.g., `'plus` → `'add`,
+`'eq2` → `'eq`).
+
+## 2. Test serialization formats
+
+All test goldens use Scheme-readable forms so they `diff` cleanly and
+can be machine-parsed if useful.
+
+### 2.1 Token line format
+
+One token per line, as a Scheme list:
+
+```
+(KIND VALUE FILE LINE COL)
+```
+
+- **KIND**: bare symbol from §1.1.
+- **VALUE** rendering depends on KIND:
+ - `IDENT`, `STR`: bytevector literal `"..."` with `\n \t \r \\ \"`
+ escapes; non-ASCII bytes as `\xNN`.
+ - `INT`, `CHAR`: decimal integer.
+ - `KW`, `PUNCT`: bare symbol from §1.2 / §1.3.
+ - `HASH`, `NL`, `EOF`: `#f`.
+- **FILE**: bytevector literal.
+- **LINE**, **COL**: decimal integers (1-based).
+
+The `tok-hide` field is **not** serialized — it is implementation
+detail of the preprocessor.
+
+Example for `int main() { return 0; }` in `t.c`:
+
+```
+(KW int "t.c" 1 1)
+(IDENT "main" "t.c" 1 5)
+(PUNCT lparen "t.c" 1 9)
+(PUNCT rparen "t.c" 1 10)
+(PUNCT lbrace "t.c" 1 12)
+(KW return "t.c" 1 14)
+(INT 0 "t.c" 1 21)
+(PUNCT semi "t.c" 1 22)
+(PUNCT rbrace "t.c" 1 24)
+(EOF #f "t.c" 1 25)
+```
+
+Trailing whitespace and `;`-comments in the golden file are ignored.
+
+### 2.2 cg-trace line format
+
+The cg-trace mock writes one Scheme list per cg call:
+
+```
+(<call-name> <arg1> <arg2> ...)
+```
+
+`<call-name>` strips the `cg-` prefix (`cg-push-imm` → `push-imm`,
+`cg-fn-begin` → `fn-begin`).
+
+Argument renderers, applied per-call:
+
+- **ctype** → a stable symbolic form:
+ - primitives: `void`, `i8`, `u8`, `i16`, `u16`, `i32`, `u32`,
+ `i64`, `u64`, `bool` (the `kind` symbol verbatim).
+ - pointer: `(ptr <T>)`.
+ - array: `(arr <T> <N>)` where N is the length or `*` for
+ incomplete.
+ - function: `(fn <ret> (<param>...) <variadic?>)`.
+ - aggregates: `(struct <tag>)`, `(union <tag>)`, `(enum <tag>)`.
+- **sym** → `(<name-bv> <kind-symbol>)`. Storage and slot are not
+ surfaced — they are implementation detail.
+- **bv** → bytevector literal, as in §2.1.
+- **fixnum** → decimal integer.
+- **bool** → `#t` / `#f`.
+- **op symbol** (binop/unop) → bare symbol.
+
+cg calls that take *thunks* (`cg-if`, `cg-ifelse`, `cg-loop`) emit a
+matching open/close pair in the trace, with the body's calls in
+between:
+
+```
+(if-begin)
+ ...body trace...
+(if-end)
+
+(ifelse-begin)
+ ...then-trace...
+(ifelse-mid)
+ ...else-trace...
+(ifelse-end)
+
+(loop-begin <tag>)
+ ...body trace...
+(loop-end <tag>)
+```
+
+This is the canonical surface — `cg-if` *internally* uses a thunk,
+but the trace exposes begin/mid/end markers so tests can read top-down.
+
+### 2.3 Diagnostic format
+
+Already canonical from CC-INTERNALS:
+
+```
+<file>:<line>:<col>: error: <msg>: <irritants...>
+```
+
+Tests for failure paths verify:
+- exit status is 1
+- stderr contains the expected `<file>:<line>:<col>: error:` prefix
+
+The `<msg>` body is **not** matched character-for-character (so we
+can refine wording without breaking tests); only the prefix and a
+keyword-substring of the engineer's choice.
+
+## 3. Frame layout / parameter ABI
+
+### 3.1 cg-fn-begin contract
+
+```scheme
+(cg-fn-begin cg name params return-type) -> param-syms
+;; name: bv (un-mangled C identifier)
+;; params: list of (name-bv . ctype)
+;; return-type: ctype
+;; param-syms: alist (name-bv . sym), each sym already bound to a frame slot
+```
+
+Inside `cg-fn-begin`, cg:
+
+1. Allocates one frame slot per parameter via `cg-alloc-slot`. Slot
+ width = `ctype-size` rounded up to 8 (`align-up`); align = 8.
+ (Yes, every param costs at least 8 bytes. P1-64 frame is
+ word-stride; we don't pack.)
+2. Begins emitting into `cg-fn-buf`. Does **not** yet emit the
+ prologue — that's deferred to `cg-fn-end` once `frame-hi` is final.
+3. Emits the param-spill code into a "prologue prefix" buffer
+ (private to cg): for params 0..3, `ST aN, [sp + slotN]`; for
+ params 4+, `LDARG t0, K` then `ST t0, [sp + slotK]`.
+4. Returns the param-sym alist. Parser binds these into the function-
+ body scope.
+
+### 3.2 cg-fn-end contract
+
+```scheme
+(cg-fn-end cg)
+```
+
+cg:
+
+1. Reads final `frame-hi` (highest byte allocated).
+2. Emits the per-function preamble (an M1pp `%struct` is **not**
+ used — slots are numeric byte offsets, baked into the body text
+ already buffered in `fn-buf`).
+3. Wraps the prologue-prefix + fn-buf inside a libp1pp `%fn` macro:
+
+ ```
+ %fn(<mangled-name>, <frame-hi-aligned-up-to-16>, {
+ <prologue-prefix bytes>
+ <fn-buf bytes>
+ ::ret
+ LD a0, [sp + <return-slot>]
+ })
+ ```
+4. Flushes the result into `cg-text`, clears `fn-buf` and the
+ prologue-prefix buffer, resets `vstack`, `frame-hi`, and the
+ function-local label counter.
+
+The frame size is rounded up to 16 to satisfy the P1 stack-align
+contract.
+
+### 3.3 Outgoing-arg staging
+
+When `cg-call` is asked to emit a call with arity > 4, it stages
+args 4..(N-1) into the *low-addressed* prefix of the current frame
+at `[sp + 0*8]`, `[sp + 1*8]`, etc., per LIBP1PP.md §Frame locals.
+cg tracks the maximum staging count seen across the function and
+reserves that prefix at fn-end before any other slots — i.e.,
+`cg-alloc-slot`'s first allocation comes *after* the staging area.
+
+The accounting is internal to cg. Parse never sees staging slots.
+
+### 3.4 cg-alloc-slot contract
+
+```scheme
+(cg-alloc-slot cg bytes align) -> offset
+```
+
+- `bytes` = total size needed (e.g., 4 for `int`, 40 for `int[10]`,
+ `sizeof(struct foo)` for a struct).
+- `align` = required alignment (1, 2, 4, or 8).
+- Returns numeric byte offset relative to `sp` post-`%enter`.
+
+cg first aligns `frame-hi` up to `align`, returns that as the
+offset, then bumps `frame-hi` by `bytes`. Slots are not reused
+across scopes (we're optimizing for compiler simplicity, not frame
+size). Local arrays and structs request their full size in one call.
+
+## 4. Conversion responsibility
+
+The parser drives type semantics; cg is type-aware only enough to
+choose signed-vs-unsigned variants and to scale pointer arithmetic.
+
+### 4.1 Parser's responsibilities
+
+The parser **must** call cg in this order around each operation:
+
+| Source | Required parser actions before the operation |
+|--------|----------------------------------------------|
+| `e1 + e2` (and other arith binops) | (a) parse e1 → if lval, `cg-load`; (b) `cg-promote` if rank < int; (c) parse e2 same way; (d) `cg-arith-conv` to bring both to common type; (e) `cg-binop add` |
+| `*p` | parse p → if lval, `cg-load`; then `cg-push-deref` |
+| `&x` | parse x → must be lval; then `cg-take-addr` |
+| `(T)e` | parse e → if lval, `cg-load` (unless casting to a pointer); then `cg-cast T` |
+| `f(a, b, ...)` | parse f → if lval and `f` not a function-typed identifier, `cg-load`; parse each arg → `cg-load` if lval, then `cg-cast` to param type (or default-promote for variadic args); then `cg-call` |
+| `lhs = rhs` | parse lhs → must be lval (no load); parse rhs → `cg-load` if lval; `cg-cast` to lhs type; `cg-assign` |
+| `lhs += rhs` | parse lhs (lval) → duplicate via `cg-take-addr` then `cg-push-deref`; parse rhs; `cg-arith-conv`; `cg-binop add`; `cg-cast` to lhs type; `cg-assign` |
+| `return e` | parse e → `cg-load` if lval; `cg-cast` to fn return type; `cg-return` |
+| `if (e) ...` | parse e → `cg-load` if lval; `cg-cast bool` if not already int-shaped; `cg-if` |
+
+The parser is responsible for the standard:
+
+- **Integer promotion**: any operand of type rank below `int` is
+ promoted to `int` (or `unsigned int` if it can't fit) before use
+ in arithmetic, before assignment to a wider lhs in mixed contexts,
+ and before being passed as a variadic argument.
+- **Usual arithmetic conversions**: applied to both operands of a
+ binary arithmetic operator after promotion. The result type is
+ the common type.
+- **Pointer-int interaction**: detected by parser; `cg-binop add`
+ on (ptr, int) handles scaling internally (see §4.2).
+
+### 4.2 cg's responsibilities
+
+cg trusts the operand types it is handed.
+
+- **`cg-load`**: pop lval, emit one load (of the right width based
+ on `ctype-size`), push rval of the same type.
+- **`cg-cast to-type`**: pop, emit sign-extend / zero-extend /
+ truncate as needed based on source vs. target sizes and signedness.
+ For `to-type = bool`: emit `(BNEZ -> 1, fallthrough -> 0)` shape.
+ Pointer ↔ integer casts are bit-for-bit on P1-64 (no emission).
+- **`cg-binop add` (and `sub`)**: if exactly one operand is a `ptr`,
+ scale the int operand by `ctype-size` of the pointee before adding.
+ If both ptr → only `sub` is valid (yields `i64` byte difference,
+ divided by element size); other binops on (ptr, ptr) abort via
+ `die`.
+- **`cg-binop` for divisions and comparisons**: dispatch to signed
+ (`DIV`/`BLT`) or unsigned (`DIV`+sign-flip / `BLTU`) variant based
+ on the operand kinds (`i*` → signed, `u*` → unsigned). After
+ `cg-arith-conv`, both operands have the same kind, so dispatch is
+ unambiguous.
+
+cg never:
+- auto-loads an lvalue
+- auto-promotes
+- auto-converts arguments
+- looks at fn-ctx return type (parser passes the cast)
+
+This split keeps cg under ~600 LOC by pushing all "C language"
+knowledge into parse.
+
+## 5. Symbol-to-label mangling
+
+Three label namespaces in the emitted P1pp:
+
+### 5.1 User globals (functions, variables)
+
+```
+C identifier "foo" → P1pp label :cc__foo
+```
+
+Verbatim concatenation. C identifiers can't contain `:` or other
+P1pp-special characters, so no escaping is needed. The `cc__` prefix
+guarantees no collision with libp1pp internals (`libp1pp__*`),
+backend stubs (`_start`, `p1_main`, `sys_*`), or our own runtime
+support.
+
+`static` storage at file scope changes nothing about the label —
+since we have one TU, internal-linkage and external-linkage symbols
+share the same namespace. `static` only suppresses any future
+"export to other TUs" emission, which we don't do anyway.
+
+### 5.2 String pool
+
+```
+n-th distinct string literal → :cc__str_<n>
+```
+
+`<n>` is a fresh decimal counter starting at 0, advanced only on
+non-deduplicating insert. Identical string literals share a label
+(idempotent intern).
+
+### 5.3 Function-internal labels
+
+Inside `%fn(...)`, libp1pp's `%scope` mechanism prefixes short
+labels (`::ret`, `::lbl_42`) to `<fnname>__ret`, `<fnname>__lbl_42`
+at M1pp time. cg uses short labels exclusively inside fn-buf:
+
+- `::ret` — single function exit
+- `::lbl_<n>` — anonymous control-flow targets (for switch cases,
+ short-circuit eval, etc.)
+- `::user_<name>` — user-written `goto` labels (C `myloop:` →
+ `::user_myloop`). The `user_` prefix prevents collisions with our
+ `lbl_` and `ret` labels.
+
+Loop tags (for libp1pp's `%loop_tag`, `%break(tag)`, `%continue(tag)`)
+are not labels — they're macro-name fragments. cg generates them as
+`L<n>` (no `cc__` prefix; tag namespace is per-function and `%fn`
+already scopes them).
+
+### 5.4 Entry stub
+
+cg emits a small entry stub at `cg-finish` time:
+
+```
+%fn(p1_main, %p1_main_f.SIZE, {
+ ; argc = a0, argv = a1 already
+ %la_br(&cc__main)
+ %call
+ %eret
+})
+```
+
+So `int main(int argc, char **argv)` is reached from the P1
+program-entry contract.
+
+If the user's `main` has no parameters, the stub still passes
+argc/argv — main just ignores them, which is harmless. If the user
+defines `main` with a different return type, that's a CC.md
+violation; cg can either die or emit and let the cast happen at the
+return site. Recommend: parser checks at `cg-fn-end` time when
+fn-name == `main`.
+
+## 6. Phase-1 milestone
+
+```c
+int main(int argc, char **argv) {
+ return argc;
+}
+```
+
+This is the integration target every engineer aims at. It goes
+through:
+
+- **lex**: `int`, `main`, `(`, `int`, `argc`, `,`, `char`, `*`, `*`,
+ `argv`, `)`, `{`, `return`, `argc`, `;`, `}` — all PUNCT and KW
+ symbols touched are core; covers two of each kind.
+- **pp**: zero directives, but full token-list traversal.
+- **parse**: function definition, two-parameter list including
+ `char **`, compound stmt, return stmt, identifier expression,
+ lval→rval load.
+- **cg**: `cg-fn-begin` with two params, parameter spilling
+ (one register-passed, one register-passed), `cg-push-sym`,
+ `cg-load`, `cg-return`, `cg-fn-end`, `%fn` wrapping, and the
+ `p1_main` entry stub.
+- **e2e**: link with arch backend + libp1pp; run as native ELF;
+ exit code matches argc.
+
+Acceptance test: `tests/cc-e2e/00-return-argc.c` exists, the make
+target builds it, and:
+
+```
+$ ./tests/cc-e2e/build/00-return-argc ; echo $? → 1
+$ ./tests/cc-e2e/build/00-return-argc a b ; echo $? → 3
+```
+
+When this test passes on aarch64, amd64, and riscv64, phase 1 is
+complete.
+
+## Change protocol
+
+Anyone proposing a contract change:
+
+1. PR amends this doc first, with rationale.
+2. Affected modules + tests are listed.
+3. Changes land in one PR (doc + all affected code) so no engineer
+ pulls a half-migrated tree.
diff --git a/docs/CC-INTERNALS.md b/docs/CC-INTERNALS.md
@@ -0,0 +1,726 @@
+# lispcc internals
+
+Companion to [CC.md](CC.md). CC.md says what we accept; this doc says
+how the implementation is organized so engineers can split work and
+test independently.
+
+The compiler is one scheme1 program assembled from six files at build
+time:
+
+```
+build: catm cc/util.scm cc/data.scm cc/lex.scm cc/pp.scm cc/cg.scm cc/parse.scm cc/main.scm > cc/cc.scm
+run: scheme1 < cc/cc.scm -- input.flat.c output.P1pp
+```
+
+(Driver shell-script details — argv plumbing, scheme1 prelude prepend
+— belong in scripts/, not in this doc.)
+
+## Module DAG
+
+```
+util ────────┐
+ │
+data ────────┼─► lex ──► pp ──► parse ─► main
+ │ │
+ └──────────────────────► cg ──► main
+```
+
+- **util.scm** — leaf helpers; depends only on the scheme1 prelude.
+- **data.scm** — record-type definitions used across modules.
+- **lex.scm** — bytestream → token list. Pure function.
+- **pp.scm** — token list → expanded token list. Pure function.
+- **cg.scm** — codegen state and emission API. Mutates a `cg` record.
+- **parse.scm** — token list + cg → P1pp output. Mutates `pstate` and
+ drives `cg`.
+- **main.scm** — argv handling, file I/O, ties phases together.
+
+Cycles are forbidden. parse.scm calls cg.scm but never the reverse.
+
+## Conventions
+
+- **Naming**: every public function and accessor is prefixed by its
+ module or record. `lex-tokenize`, `pp-expand`, `cg-push-imm`,
+ `parse-translation-unit`, `tok-kind`, `ctype-size`. Internal helpers
+ use a leading `%` (e.g. `%pp-expand-line`).
+- **Record constructors**: `%record-name` is the all-fields raw
+ constructor; named factory functions like `make-tok` wrap it when
+ defaults are useful.
+- **Mutators**: `field-set!` form, e.g. `cg-out-set!`, matching
+ scheme1/prelude.scm.
+- **Bytevectors as strings**: every "string" in this codebase is a
+ bytevector. We never use `symbol->string` for runtime data — symbols
+ are reserved for the small fixed alphabets (token kinds, ctype
+ kinds, opnd kinds).
+- **Errors**: `(die loc msg . irritants)` writes a diagnostic to fd 2
+ and `sys-exit`s with status 1. No exceptions, no recovery. Every
+ module uses the same `die`.
+- **No global state**: every long-lived datum lives in a record passed
+ explicitly. The only top-level definitions are functions, constants,
+ and the keyword/punctuator alists.
+
+## Errors
+
+```scheme
+(die loc msg . irritants) ; abort with formatted diagnostic
+(loc-of-tok tok) -> loc ; pull file/line/col out of a tok
+(loc file line col) -> loc ; construct manually (used by lexer)
+```
+
+Diagnostic format on stderr:
+
+```
+<file>:<line>:<col>: error: <msg>: <irritants display'd one-by-one>
+```
+
+`irritants` are written via `display` (no quoting). Pairs and integers
+print naturally; bytevectors print as their byte content.
+
+`die` is the only failure path. There is no warning level — anything
+worth saying aborts.
+
+## util.scm
+
+Bytevector and list helpers that scheme1/prelude.scm doesn't already
+provide.
+
+```scheme
+;; bytevector
+(bv= a b) -> bool ; alias for bytevector=?, terser
+(bv-prefix? p s) -> bool ; does s start with p?
+(bv-find bv byte from) -> idx-or-#f ; first occurrence at >= from
+(bv-slice bv start end) -> bv ; fresh copy
+(bv-of-string str) -> bv ; literal helper for inline ASCII
+(bv-of-byte b) -> bv ; 1-byte bv
+(bv-cat lst-of-bv) -> bv ; concat list, single allocation
+(bv->fixnum bv radix) -> (ok . val) ; (ok . #f) on parse fail
+(fixnum->bv n radix) -> bv
+
+;; lists / alists
+(alist-ref key al) -> val-or-#f
+(alist-ref/eq key al) -> val-or-#f ; eq? compare (for symbol keys)
+(alist-set key val al) -> al' ; cons new pair on the front
+(alist-update key f al) -> al' ; functional update
+(any p xs) -> bool
+(every p xs) -> bool
+(count p xs) -> fixnum
+
+;; ints
+(min3 a b c) -> fixnum
+(align-up n k) -> fixnum ; round n up to multiple of k
+
+;; output buffers (reversed list of bv chunks; flush builds in one pass)
+(make-buf) -> buf ; record: { chunks }
+(buf-push! buf bv) ; cons bv onto chunks
+(buf-flush buf) -> bv ; reverse + bv-cat
+
+;; diagnostics + I/O
+(die loc msg . irritants) -> never returns
+(slurp-fd fd) -> bv ; read until EOF
+(write-bv-fd fd bv) ; full write or die
+
+;; fresh names
+(make-namer prefix) -> proc ; (proc) -> bv "prefix0", "prefix1", ...
+```
+
+`make-buf` and `buf-push!` are the universal output primitive — both
+cg's three output streams and pp's expansion staging buffer use them.
+
+## data.scm
+
+All record types used by more than one module.
+
+### loc
+
+```scheme
+(define-record-type loc
+ (%loc file line col)
+ loc?
+ (file loc-file) ; bv
+ (line loc-line) ; fixnum
+ (col loc-col)) ; fixnum
+```
+
+### tok
+
+```scheme
+(define-record-type tok
+ (%tok kind value loc hide)
+ tok?
+ (kind tok-kind) ; symbol; see kinds table
+ (value tok-value) ; varies by kind
+ (loc tok-loc) ; loc
+ (hide tok-hide)) ; list of bv (macro names already expanded)
+```
+
+| `kind` | `value` shape |
+|---------|--------------------------------------------------|
+| `IDENT` | bv (identifier name) |
+| `KW` | symbol (one of `if while ... typedef`) |
+| `INT` | fixnum (post-suffix integer value) |
+| `STR` | bv (raw bytes, no NUL terminator) |
+| `CHAR` | fixnum (value 0..255) |
+| `PUNCT` | symbol (`'+ '== '-> '... '## '#` …) |
+| `HASH` | `#f` (preprocessor only; line-leading `#`) |
+| `NL` | `#f` (significant only inside the preprocessor) |
+| `EOF` | `#f` |
+
+`make-tok` is the canonical constructor; pass `hide = '()` for
+freshly-lexed tokens.
+
+### macro
+
+```scheme
+(define-record-type macro
+ (%macro kind params body)
+ macro?
+ (kind macro-kind) ; 'obj | 'fn | 'fn-vararg
+ (params macro-params) ; list of bv
+ (body macro-body)) ; list of tok
+```
+
+### ctype
+
+```scheme
+(define-record-type ctype
+ (%ctype kind size align ext)
+ ctype?
+ (kind ctype-kind) ; 'void 'i8 'i16 'i32 'i64 'u8 'u16 'u32 'u64
+ ; 'bool 'ptr 'arr 'fn 'struct 'union 'enum
+ (size ctype-size) ; fixnum bytes; -1 = incomplete
+ (align ctype-align) ; fixnum bytes
+ (ext ctype-ext)) ; payload, kind-specific
+```
+
+ext payload by kind:
+
+| `kind` | `ext` shape |
+|-------------------|----------------------------------------------------------------------|
+| `void` / int / `bool` | `#f` |
+| `ptr` | pointee ctype |
+| `arr` | `(elem-ctype . length-or--1)` |
+| `fn` | `(ret-ctype params variadic?)` — `params` is list of ctype |
+| `struct`/`union` | `(tag-bv complete? fields)` — `fields` = `((name-bv ctype offset) ...)` |
+| `enum` | `(tag-bv ((const-name-bv . value) ...))` |
+
+Primitive ctypes are interned at startup as top-level bindings:
+`%t-void`, `%t-i8`, `%t-i32`, `%t-i64`, `%t-u8`, `%t-u32`, `%t-u64`,
+`%t-bool`, `%t-char-ptr`. Equality of primitive ctypes is `eq?`.
+Derived types are *not* deduped.
+
+### sym
+
+```scheme
+(define-record-type sym
+ (%sym name kind storage type slot)
+ sym?
+ (name sym-name) ; bv
+ (kind sym-kind) ; 'var 'fn 'typedef 'enum-const 'param 'label
+ (storage sym-storage) ; 'auto 'static 'extern 'register | #f for typedef/enum
+ (type sym-type) ; ctype
+ (slot sym-slot)) ; locals/params: fixnum byte offset
+ ; globals/fn: bv emitted-label
+ ; enum-const: fixnum value
+ ; typedef: #f
+ ; label: bv P1pp local label
+```
+
+### opnd
+
+```scheme
+(define-record-type opnd
+ (%opnd kind type ext lval?)
+ opnd?
+ (kind opnd-kind) ; 'imm 'frame 'global 'reg | sub-cases below
+ (type opnd-type) ; ctype
+ (ext opnd-ext) ; per kind
+ (lval? opnd-lval?)) ; #t = the slot/place holds an *address*
+ ; #f = the slot/place holds a *value*
+```
+
+ext by kind:
+
+| `kind` | `ext` |
+|----------|------------------------------------------------|
+| `imm` | fixnum literal value (lval? always `#f`) |
+| `frame` | fixnum byte offset in current function frame |
+| `global` | bv label name (the symbol's emitted P1pp label)|
+| `reg` | symbol `'a0`..`'a3` |
+
+`reg` is transient — used for the result of a call before it's
+spilled. The vstack itself holds only `imm`, `frame`, and `global`
+opnds.
+
+### pstate
+
+```scheme
+(define-record-type pstate
+ (%pstate toks scope tags loops fn-ctx typedefs cg)
+ pstate?
+ (toks ps-toks ps-toks-set!) ; remaining tok list (head = lookahead)
+ (scope ps-scope ps-scope-set!) ; list of alists: (bv . sym)
+ (tags ps-tags ps-tags-set!) ; list of alists: (bv . ctype) for struct/union/enum
+ (loops ps-loops ps-loops-set!) ; list of loop-ctx (break/continue stack)
+ (fn-ctx ps-fn-ctx ps-fn-ctx-set!) ; current function record, or #f at file scope
+ (typedefs ps-typedefs ps-typedefs-set!) ; flat alist (bv . #t) — fast typedef-name lookup
+ (cg ps-cg)) ; codegen state (not mutated through pstate)
+```
+
+`typedefs` is a separate flat alist (not derived from `scope` at
+lookup time) because the lexer-vs-parser distinction in C requires
+fast "is this identifier a typedef-name *now*" answers during
+declaration parsing. Updated in lockstep with `scope`.
+
+### loop-ctx
+
+```scheme
+(define-record-type loop-ctx
+ (%loop-ctx kind tag has-continue?)
+ loop-ctx?
+ (kind loop-ctx-kind) ; 'while 'do 'for 'switch
+ (tag loop-ctx-tag) ; bv tag for libp1pp %_tag macros
+ (has-continue? loop-ctx-has-continue?)) ; #f for switch
+```
+
+### fn-ctx
+
+```scheme
+(define-record-type fn-ctx
+ (%fn-ctx name return-type params variadic? labels)
+ fn-ctx?
+ (name fn-ctx-name) ; bv
+ (return-type fn-ctx-return-type) ; ctype
+ (params fn-ctx-params) ; list of sym
+ (variadic? fn-ctx-variadic?)
+ (labels fn-ctx-labels ; alist (user-bv . emitted-bv)
+ fn-ctx-labels-set!)) ; mutated as `goto`/labels resolve
+```
+
+### cg
+
+```scheme
+(define-record-type cg
+ (%cg text data bss vstack frame-hi label-ctr str-pool globals fn-buf)
+ cg?
+ (text cg-text) ; buf — final text section (all functions)
+ (data cg-data) ; buf — initialized globals + string pool
+ (bss cg-bss) ; buf — zero-init globals
+ (vstack cg-vstack cg-vstack-set!) ; list of opnd
+ (frame-hi cg-frame-hi cg-frame-hi-set!) ; fixnum: frame bytes used so far
+ (label-ctr cg-label-ctr cg-label-ctr-set!)
+ (str-pool cg-str-pool cg-str-pool-set!) ; alist (bv-content . bv-label)
+ (globals cg-globals cg-globals-set!) ; alist (bv-name . sym) — emitted globals
+ (fn-buf cg-fn-buf cg-fn-buf-set!)) ; buf — body of current function
+```
+
+Functions are emitted into `fn-buf`, then on `cg-fn-end` flushed
+through libp1pp's `%fn(...)` wrapper into `text`. This lets us know
+the final `frame-hi` before writing the prologue.
+
+## lex.scm
+
+Pure: bytestream + filename → token list. No file I/O, no macro
+awareness.
+
+```scheme
+(lex-tokenize src file) -> (list-of tok)
+;; src: bv (the C bytestream)
+;; file: bv (filename, for tok-loc)
+;; result: list ending in a single 'EOF tok; never #f
+;; aborts via die on unrecognized byte sequences
+```
+
+Internal contract:
+
+- Comments (`/* ... */`, `// ...`) are removed but produce no tokens.
+- Trigraphs and `\<newline>` line splicing are applied before
+ tokenization.
+- `NL` tokens are emitted at the end of every source line, even
+ blank ones — the preprocessor needs them for directive termination.
+- Adjacent string literals are **not** concatenated here; that's
+ pp's job (after macro expansion).
+- Keyword recognition: the lexer carries a fixed alist of
+ (keyword-bv . keyword-symbol) and emits `KW` directly for matches.
+ IDENT vs. KW is decided at lex time and never revised.
+- Punctuator longest-match table: `'<<=` before `'<<` before `'<`,
+ etc.
+
+Helpers exposed for unit tests:
+
+```scheme
+(lex-read-number src pos) -> (tok . pos')
+(lex-read-string src pos file) -> (tok . pos')
+(lex-read-ident src pos) -> (tok . pos') ; produces IDENT or KW
+```
+
+### Test plan
+
+`tests/cc-lex/` mirrors `tests/scheme1/`. Each test is one `.c`
+fragment plus an `.expected-toks` file containing the expected
+serialized form (one tok per line, e.g. `KW int 1 1`). Driver script:
+
+```
+scheme1 cc/lex-test.scm < input.c | diff - expected-toks
+```
+
+## pp.scm
+
+Pure: token list + initial macro alist → expanded token list.
+
+```scheme
+(pp-expand toks initial-defines) -> (list-of tok)
+;; toks: lex-tokenize output
+;; initial-defines: alist (bv . macro) — from -D flags
+;; result: token list with HASH and NL stripped, KW/IDENT/INT/STR/CHAR/PUNCT/EOF only
+;; aborts via die on bad directive, undefined #if identifier-as-macro misuse, etc.
+```
+
+Internal structure:
+
+- A driver loop classifies each line (looking only at the leading
+ HASH or non-HASH state) and dispatches to a directive handler or
+ the macro-expansion engine.
+- The conditional stack: list of `(active? . has-taken?)` pairs.
+ `#if`/`#ifdef` push; `#elif`/`#else` flip; `#endif` pops.
+- Macro expansion uses C11 6.10.3.4 hide-set discipline. Each
+ emitted token's `hide` field is the union of hide-sets of its
+ constituent body tokens plus the macro's own name.
+- `defined NAME` is a special form — recognized and resolved
+ *before* macro expansion of an `#if` line.
+
+Directive handlers (each takes the tokens up to the next NL):
+
+```scheme
+(%pp-do-define line state)
+(%pp-do-undef line state)
+(%pp-do-if line state)
+(%pp-do-ifdef line state)
+(%pp-do-ifndef line state)
+(%pp-do-elif line state)
+(%pp-do-else line state)
+(%pp-do-endif line state)
+(%pp-do-error line state)
+(%pp-do-line line state)
+(%pp-do-pragma line state)
+(%pp-do-include line state) ;; always dies (per CC.md §Toolchain envelope)
+```
+
+`state` is a private record `pp-state` with fields
+`{macros cond-stack out file-base-line}`. Internal — not in data.scm.
+
+Constant-expression evaluator for `#if`:
+
+```scheme
+(pp-eval-cexpr toks macros) -> fixnum
+;; toks: tokens of the expression after macro expansion
+;; aborts on unrecognized identifier (treated-as-zero is wrong for our errors policy)
+```
+
+### Test plan
+
+`tests/cc-pp/` per-file tests, same shape as cc-lex. Inputs are
+already-tokenized fixtures (so pp can be exercised without the lexer)
+or `.c` files (full pipeline). Both flavors live side-by-side.
+
+## cg.scm
+
+Mutable codegen state and a value-stack-style emission API. The parser
+calls these and never touches output buffers directly.
+
+### Lifecycle
+
+```scheme
+(cg-init) -> cg ; fresh state
+(cg-finish cg) -> bv ; flush all buffers; result is final P1pp text
+```
+
+Every function lives between `cg-fn-begin` and `cg-fn-end`. Globals
+live outside.
+
+```scheme
+(cg-fn-begin cg name params return-type)
+;; name: bv
+;; params: list of sym (their slots are pre-assigned by parser)
+;; return-type: ctype
+(cg-fn-end cg) ; emits epilogue, flushes fn-buf into cg-text under %fn(...)
+```
+
+### Vstack: push / pop / inspect
+
+```scheme
+(cg-push cg opnd) ; push opnd onto vstack
+(cg-pop cg) -> opnd ; remove and return top
+(cg-top cg) -> opnd ; non-destructive
+(cg-depth cg) -> fixnum
+```
+
+### Materialize
+
+```scheme
+(cg-push-imm cg ctype value) -> opnd ; rval
+(cg-push-string cg bv-content) -> opnd ; rval, char* — interns into str-pool
+(cg-push-sym cg sym) -> opnd ; lval (var) or rval (fn name)
+(cg-push-deref cg) -> opnd ; pop ptr-rval, push lval (no emission yet)
+```
+
+### Address & deref operators
+
+```scheme
+(cg-take-addr cg) -> opnd ; pop lval, push its address as rval
+(cg-load cg) -> opnd ; pop lval, push rval (loaded through address)
+```
+
+### Type conversions
+
+```scheme
+(cg-cast cg to-type) -> opnd ; pop, push opnd cast to to-type; emits sign-extension etc. as needed
+(cg-promote cg) -> opnd ; integer promotion (rank ≤ int → int)
+(cg-arith-conv cg) ; usual arithmetic conversions on top two opnds
+```
+
+### Operators
+
+```scheme
+(cg-binop cg op) -> opnd ; pop b, pop a, push (a op b)
+(cg-unop cg op) -> opnd ; pop a, push (op a)
+(cg-assign cg) -> opnd ; pop rval, pop lval, store, push rval (assignment yields the value)
+```
+
+`op` for binop is a symbol from:
+`'add 'sub 'mul 'div 'rem 'and 'or 'xor 'shl 'shr 'eq 'ne 'lt 'le 'gt 'ge`.
+For unop: `'neg 'bnot 'lnot`.
+
+Signed vs. unsigned dispatch is handled inside cg by inspecting the
+operand types after `cg-arith-conv`.
+
+### Calls
+
+```scheme
+(cg-call cg arity has-result?) -> opnd-or-#f
+;; pops (arity + 1) opnds: function, then args left-to-right at top
+;; (i.e. callable was pushed first, then args, so arg-N is on top)
+;; emits arg-passing per P1 ABI, CALL or CALLR, captures return into a fresh frame slot
+;; pushes result opnd; returns it (or #f if has-result? is #f)
+```
+
+### Structured control flow
+
+These take thunks so the parser can recursively emit the body:
+
+```scheme
+(cg-if cg then-thunk) ; pop cond; emit %if_nez { (then-thunk) }
+(cg-ifelse cg then-thunk else-thunk) ; pop cond; emit %ifelse_nez { ... }{ ... }
+(cg-loop cg head-thunk body-thunk) -> tag ; head-thunk emits the cond test; tag returned to parser
+(cg-loop-end cg tag) ; closes a %loop_tag
+(cg-break cg tag)
+(cg-continue cg tag)
+```
+
+For `for` and `do-while` the parser composes the building blocks
+above; cg doesn't expose a dedicated for/do helper. (Three helpers
+beat seven.)
+
+### switch helpers
+
+```scheme
+(cg-switch-begin cg) -> swctx ; spill controlling expression to a slot
+(cg-switch-case cg swctx const-int) ; emit %if_eq jump-to-case-label
+(cg-switch-default cg swctx) ; emit B to default-label
+(cg-switch-end cg swctx) ; close out
+```
+
+### Globals and data
+
+```scheme
+(cg-emit-global cg sym init-bv-or-#f) ; init-bv: bytes for .data, or #f for .bss
+(cg-emit-extern cg sym) ; declare without defining
+(cg-intern-string cg bv-content) -> bv-label ; idempotent; used internally by cg-push-string
+```
+
+### Frame allocation (used internally and by parse for locals)
+
+```scheme
+(cg-alloc-slot cg bytes align) -> offset ; bumps frame-hi; returns aligned offset
+```
+
+### Interaction-test mock
+
+`tests/cc-parse/` uses a swap-in `cg-trace.scm` that replaces cg.scm.
+It provides every public entry point above but each call appends a
+record to a global trace list:
+
+```scheme
+(cg-trace-get) -> (list-of (op . args))
+```
+
+Parser tests run a fragment, snapshot the trace, diff against
+`expected-trace`. This is the contract that lets parse and cg evolve
+independently — as long as parse emits the same sequence of cg calls,
+cg internals can change freely.
+
+## parse.scm
+
+Mutates a `pstate`. Drives `cg`. Single entry point:
+
+```scheme
+(parse-translation-unit ps) ; consumes ps-toks until EOF, emits via ps-cg
+;; ps must have ps-toks set; everything else starts empty.
+```
+
+### Top-down structure
+
+Internal helpers, hierarchically:
+
+```
+parse-translation-unit
+ ├─ parse-decl-or-fn ; returns 'fn or 'decl
+ │ ├─ parse-decl-spec ; storage + qualifiers + base type
+ │ ├─ parse-declarator ; spiral grammar; returns (name ctype)
+ │ ├─ parse-init-list ; for variable initializers (incl. designated)
+ │ └─ parse-fn-body ; only if declarator is fn-typed AND next tok is `{`
+ ├─ parse-stmt
+ │ ├─ parse-compound-stmt
+ │ ├─ parse-if-stmt
+ │ ├─ parse-while-stmt
+ │ ├─ parse-do-stmt
+ │ ├─ parse-for-stmt
+ │ ├─ parse-switch-stmt
+ │ ├─ parse-return-stmt
+ │ ├─ parse-goto-stmt
+ │ └─ parse-expr-stmt
+ └─ parse-expr ; Pratt; takes min-bp
+ ├─ parse-primary
+ ├─ parse-postfix
+ ├─ parse-unary
+ ├─ parse-cast-or-unary ; `(T)e` vs. `(e)`
+ └─ parse-binary-rhs
+```
+
+### Token-stream API (private to parse.scm but conventional)
+
+```scheme
+(peek ps) -> tok
+(peek2 ps) -> tok ; one-token lookahead helper
+(advance ps) -> tok ; consume and return
+(expect-kw ps sym) -> tok ; KW match or die
+(expect-punct ps sym) -> tok ; PUNCT match or die
+(at-kw? ps sym) -> bool
+(at-punct? ps sym) -> bool
+```
+
+`peek2` is needed exactly twice: distinguishing
+`(typename) cast-expr` from `(expr)` parenthesized expression, and
+distinguishing labelled-statement `IDENT :` from expression-statement
+`IDENT ...`.
+
+### Scope helpers
+
+```scheme
+(scope-enter! ps)
+(scope-leave! ps)
+(scope-bind! ps name sym) ; aborts on duplicate at innermost frame
+(scope-lookup ps name) -> sym-or-#f ; walks frames
+
+(tag-bind! ps name ctype)
+(tag-lookup ps name) -> ctype-or-#f
+
+(typedef-add! ps name) ; updates ps-typedefs
+(typedef? ps name) -> bool
+```
+
+### Pratt expression parser
+
+The binding-power table is a top-level alist:
+
+```scheme
+(define %binop-bp
+ ;; (punct-symbol . (lhs-bp . rhs-bp))
+ ;; left-assoc: rhs-bp = lhs-bp + 1
+ ;; right-assoc: rhs-bp = lhs-bp - 1
+ '((|*| 110 . 111) (|/| 110 . 111) (|%| 110 . 111)
+ ...))
+```
+
+Driver:
+
+```scheme
+(parse-expr-bp ps min-bp) ; Pratt climber
+(parse-expr ps) ; equivalent to (parse-expr-bp ps 0)
+```
+
+`parse-expr` leaves the result on `ps-cg`'s vstack and returns the
+opnd. Statements that don't consume the value follow with `cg-pop`.
+
+### Test plan
+
+`tests/cc-parse/` uses the cg-trace mock. Each test:
+
+```
+input.c -- C fragment
+expected-trace -- one cg call per line, e.g.
+ (cg-push-imm i32 42)
+ (cg-push-imm i32 7)
+ (cg-binop add)
+ ...
+```
+
+The driver builds a token list (via the real lex+pp) and runs
+`parse-translation-unit` against `cg-trace`. Diff fails the test.
+
+## main.scm
+
+Driver. Roughly 80 lines.
+
+```scheme
+(define (cc-main argv)
+ (let* ((args (parse-cli argv)) ; record: { input-path output-path defines }
+ (src (slurp-fd (open-or-die (cli-input args))))
+ (toks (lex-tokenize src (cli-input args)))
+ (defines (cli-defines->alist (cli-defines args)))
+ (expanded (pp-expand toks defines))
+ (cg (cg-init))
+ (ps (make-pstate expanded cg)))
+ (parse-translation-unit ps)
+ (write-bv-fd (open-or-die-out (cli-output args))
+ (cg-finish cg))
+ 0))
+
+(cc-main (argv))
+```
+
+CLI: `cc input.flat.c -o output.P1pp [-D NAME[=val]] ...`. Strict —
+unrecognized flags die.
+
+## Test infrastructure
+
+Three test trees, all using the same harness pattern as
+`tests/scheme1/`:
+
+- `tests/cc-lex/` — feeds `.c` through `lex-tokenize`, diffs token
+ serialization.
+- `tests/cc-pp/` — feeds tokens (or `.c`) through `pp-expand`, diffs
+ token serialization.
+- `tests/cc-parse/` — feeds `.c` through lex+pp+parse with the cg-trace
+ mock, diffs the trace.
+- `tests/cc-cg/` — directly calls cg APIs (handwritten Scheme test
+ programs), diffs the resulting P1pp bytes.
+- `tests/cc-e2e/` — tiny `.c` programs compiled all the way through
+ the toolchain to native executables, run, exit-code checked.
+
+`tests/cc-parse/` and `tests/cc-cg/` are the seam that lets parse and
+cg evolve independently. Anyone changing parse can keep running until
+the trace tests stay green. Anyone changing cg can keep running until
+the cg tests stay green and the trace contract is honored.
+
+## Out-of-scope here
+
+Deferred to follow-up docs once we start coding:
+
+- Exact P1pp text emitted for each cg primitive (precise opcodes,
+ spill discipline, libp1pp macro choices) — lives next to cg.scm
+ as a side document.
+- The exact `<stdarg.h>` / `<stddef.h>` we ship — the headers we
+ bundle so the flattener has something to inline.
+- Driver script for the pre-flatten pass (host shell tool, not part
+ of the Scheme compiler).
+- Performance tuning: alist → tree, vstack list → array of frame
+ slots, etc. None of this affects the interfaces above.
diff --git a/docs/CC.md b/docs/CC.md
@@ -0,0 +1,446 @@
+# Minimal C subset (lispcc)
+
+Working doc. Baseline is C99; everything here is a delta against it. The
+target is **just enough C** to compile
+
+ `tcc-0.9.26-1147-gee75a10c/tcc.c`
+
+with the same defines used at MesCC's `tcc-mes` stage in
+[live-bootstrap](../../live-bootstrap/steps/tcc-0.9.26/pass1.kaem):
+
+```
+-D BOOTSTRAP=1
+-D HAVE_LONG_LONG=1
+-D ONE_SOURCE=1
+-D TCC_TARGET_X86_64=1
+-D inline=
+-D CONFIG_TCCDIR="..." ...etc
+```
+
+Notably **not** defined: `HAVE_FLOAT`, `HAVE_BITFIELD`, `HAVE_SETJMP`.
+Those gate off entire code paths in tcc.c (floats, bitfield struct
+support, setjmp-based error recovery), and we don't have to compile any
+of it.
+
+The accepted surface is shaped by two intersecting constraints:
+
+1. **Lower bound** — what tcc.c (under those defines) actually uses.
+2. **Upper bound** — what MesCC accepts, since MesCC already builds
+ tcc-mes and we're its replacement. Anything MesCC strips silently
+ (`const`, `inline`, `__attribute__`) we also strip silently.
+
+Things outside both bounds are cut. Things admitted are load-bearing.
+
+## Scope
+
+- **Single translation unit.** Input is one bytestream. The
+ preprocessor does no file I/O — `#include` is an external
+ pre-flattening pass (system headers + tcc.c's `#include "libtcc.c"`
+ / `"tcctools.c"` are spliced upstream of our compiler). See
+ [§Toolchain envelope](#toolchain-envelope).
+- **P1-64 only.** Sizes assume LP64. Porting to P1-32 is out of scope.
+- **No optimization.** Output P1pp is a stack-machine lowering with
+ every operand spilled to a frame slot. Codegen quality is a v2
+ problem.
+
+## Toolchain envelope
+
+```
+tcc.c + system headers
+ │
+ │ pre-flatten: resolve #include recursively, splice into one file
+ │ (separate tool: scheme1 or shell; not part of cc.scm)
+ ▼
+tcc.flat.c single bytestream, no #include
+ │
+ │ scheme1 cc.scm
+ ▼
+tcc.P1pp our compiler's output
+ │
+ │ catm with arch backend + libp1pp.P1pp
+ │ m1pp
+ ▼
+tcc.M1
+ │ M0
+ ▼
+tcc.hex2
+ │ hex2
+ ▼
+tcc-mes native ELF, replaces MesCC's tcc-mes
+```
+
+The pre-flatten pass is *not* a C preprocessor — it only resolves
+`#include`. All other directives (`#define`, `#if`, …) are handled by
+the in-Scheme preprocessor in pass 2.
+
+## Translation phases
+
+The C standard names eight phases. We collapse them to three:
+
+1. **Lex** — bytestream → token list. Trigraphs and line-splicing
+ (backslash-newline) are handled here, alongside numbers / strings /
+ identifiers / punctuators. Comments removed. Newlines preserved as
+ `NL` tokens (the preprocessor needs them to delimit directives).
+2. **Preprocess** — token list → expanded token list. Directives
+ consumed, macros expanded, `NL` tokens stripped on exit.
+3. **Parse + emit** — token list → P1pp text. xcc-style direct emit;
+ no AST.
+
+## Lexical syntax
+
+Subset of C99 lexical grammar.
+
+- **Identifiers**: `[a-zA-Z_][a-zA-Z_0-9]*`. Universal character names
+ (`\uXXXX`) **not** supported.
+- **Integers**: decimal, octal (`0…`), hex (`0x…`); suffixes
+ `u`, `U`, `l`, `L`, `ll`, `LL`, `ul`, `ull`, etc. (case-insensitive).
+ All values fit in `unsigned long long` (64 bits).
+- **Floats**: **not** present. The lexer rejects floating-point
+ literals. (HAVE_FLOAT is off.)
+- **Characters**: `'c'` and standard escapes `\n \t \r \\ \' \" \0
+ \xNN \NNN`. `'\xNN'` is a `char`-typed value, not multi-character.
+ Multi-character constants (`'AB'`) are **not** supported.
+- **Strings**: `"…"` with same escapes. Adjacent string literals
+ concatenate (`"a" "b"` ≡ `"ab"`). Wide strings (`L"…"`), UTF-8
+ strings (`u8"…"`), UTF-16/32 (`u"…"`, `U"…"`) **not** supported.
+- **Punctuators**: full C99 set, including digraphs `<: :> <% %> %:`.
+ Trigraphs are handled in lex. `##` and `#` are preprocessor-only.
+- **Comments**: `// …` to end of line; `/* … */` block (no nesting).
+- **Line splicing**: `\` immediately before newline removes both,
+ per the standard.
+- **Whitespace**: space, tab, vertical tab, form feed, newline.
+
+## Preprocessor
+
+Directive set:
+
+- `#define NAME …` — object-like
+- `#define NAME(p1, p2, …) …` — function-like
+- `#define NAME(p1, …, …) …` — variadic, with `__VA_ARGS__` in body
+- `#undef NAME`
+- `#if expr`, `#ifdef NAME`, `#ifndef NAME`
+- `#elif expr`, `#else`, `#endif`
+- `#error msg…` — flush and exit nonzero
+- `#line NN ["file"]` — accepted; only `__LINE__` / `__FILE__` honor it
+- `#pragma …` — accepted and ignored (whole line consumed)
+- `#include …` — **rejected**. Pre-flattening handles this upstream.
+ We refuse rather than silently ignore so an unflattened input fails
+ loudly.
+
+Operators inside the body of a function-like macro:
+
+- `#param` — stringize. Result is a string literal of `param`'s
+ pre-expansion tokens.
+- `a##b` — token paste. Performed before rescanning for further
+ expansion.
+
+Built-in macros:
+
+- `__FILE__` — current source file (a string literal)
+- `__LINE__` — current line number (a decimal integer)
+- `__STDC__` — `1`
+- `__LISPCC__` — `1` (our analogue of MesCC's `__MESC__`)
+
+Expression evaluator (used by `#if`/`#elif`):
+
+- All integer operators including `defined NAME` / `defined(NAME)`.
+- Identifiers that aren't macros evaluate to `0`. (Standard.)
+- Result is a 64-bit signed integer.
+
+Macro expansion uses C11 6.10.3.4 hide-set discipline. Each token
+carries the set of macro names already expanded into it; an identifier
+inside its own hide-set is not re-expanded. This is the standard
+defense against `#define A B\n#define B A`.
+
+## Types
+
+### Primitives (P1-64)
+
+| Type | Size (bytes) | Align | Notes |
+|-----------------------|--------------|-------|------------------------------|
+| `void` | — | — | only as ptr-target / fn-ret |
+| `char` | 1 | 1 | signed by default |
+| `signed char` | 1 | 1 | |
+| `unsigned char` | 1 | 1 | |
+| `short` | 2 | 2 | |
+| `unsigned short` | 2 | 2 | |
+| `int` | 4 | 4 | |
+| `unsigned int` | 4 | 4 | |
+| `long` | 8 | 8 | LP64 |
+| `unsigned long` | 8 | 8 | |
+| `long long` | 8 | 8 | same as `long` in LP64 |
+| `unsigned long long` | 8 | 8 | |
+| pointer | 8 | 8 | tag-free; raw native address |
+| `_Bool` | 1 | 1 | values: `0`, `1` |
+
+`size_t` is `unsigned long`; `ptrdiff_t` is `long`; `intptr_t` /
+`uintptr_t` are `long` / `unsigned long`. These typedefs come from the
+flattened headers; the language doesn't bake them in.
+
+**Not present**: `float`, `double`, `long double`, `_Complex`,
+`_Imaginary`, `__int128`. `float.h` macros and `<math.h>` are
+unavailable to the input.
+
+### Derived types
+
+- **Pointer**: `T *`, multi-level. `void *` is a generic pointer that
+ freely converts to and from any other object pointer.
+- **Array**: `T[N]` with `N` a constant expression evaluating to a
+ positive integer. `T[]` is allowed in function parameter position
+ (decays to `T*`) and as a flexible-array tail field. **VLAs**
+ (`T[expr]` with non-constant `expr`) are **not** supported.
+- **Function**: `T(P1, P2, ..., Pn)` and `T(P1, ..., ...)` (variadic).
+ Pointers to functions, arrays of pointers to functions, and
+ functions returning pointers to functions all parse via the
+ spiral-declarator grammar. Old-style (K&R) function definitions are
+ **not** supported.
+- **Struct / union**: declared with `struct tag { ... }` or
+ `union tag { ... }`. Tag and member namespaces are separate from
+ identifiers. Forward declarations (`struct tag;`) supported.
+ Anonymous structs/unions inside other structs are **not** supported.
+ **Bitfields** (`int x : 3`) are **not** supported (HAVE_BITFIELD off
+ in our target). Flexible array member as last field allowed:
+ `struct s { int n; T data[]; }`.
+- **Enum**: `enum tag { A, B = 7, C }`. Underlying type is `int`.
+ Constants are usable in constant expressions.
+- **Typedef**: `typedef T name;` — name becomes a type-name token in
+ later declarations. Must be visible at parse time of any use
+ (lexer/parser cooperation: typedef names are tracked in the
+ current scope).
+
+### Qualifiers
+
+- `const`, `volatile`, `restrict` — **parsed and discarded**.
+ We don't enforce const-correctness, don't suppress optimization
+ on volatile (no optimizer to suppress), and don't honor restrict.
+ Same as MesCC.
+- `_Atomic`, `_Thread_local` — **rejected** (lex error if they appear;
+ tcc.c doesn't use them, so this won't fire).
+
+## Declarations and storage
+
+### Declarators
+
+Full C99 spiral-declarator grammar:
+
+```
+int *p // pointer to int
+int *p[10] // array of 10 pointers to int
+int (*p)[10] // pointer to array of 10 ints
+int (*f)(int, int) // pointer to function (int,int) returning int
+int *f(int) // function (int) returning pointer to int
+char *(*tab[5])(int) // array of 5 pointers to function (int) returning char*
+```
+
+### Storage classes
+
+- `extern` — declares without defining. References resolve at link
+ time. Honored.
+- `static` at file scope — gives internal linkage; prevents the symbol
+ from being emitted as a P1pp `:public_label`. Honored.
+- `static` at block scope — single shared instance, zero-initialized
+ by default. Honored.
+- `auto` — accepted, no effect (the default for block scope).
+- `register` — accepted, no effect.
+- `typedef` — handled specially (see Types).
+
+### Function definitions
+
+```
+[storage] [type-quals] return-type name(params) { body }
+```
+
+Parameter list forms:
+
+- `void` (zero parameters)
+- `T1 p1, T2 p2, ...`
+- `T1 p1, T2 p2, ..., ...` (variadic, `va_list` discipline below)
+
+K&R-style (`int f(a, b) int a, b; { … }`) is **not** supported.
+
+### Variable initializers
+
+- Scalars: `T x = expr;` — `expr` must be a constant for static-storage
+ variables; arbitrary for auto-storage.
+- Arrays: `T a[N] = { e0, e1, ... };` and `T a[] = { ... };` (size
+ inferred). String-literal initializer for `char[]` allowed.
+- Structs: `S s = { e0, e1, ... };` (positional). Designated
+ initializers (`{ .field = ... }`) **supported** at struct top level
+ only — required by tcc.c.
+- Nested initializers brace-flatten the obvious way.
+
+### Inline / attributes
+
+- `inline` — already removed by `-D inline=` in the bootstrap. Our
+ preprocessor would also strip the keyword if it appeared. No
+ effect on codegen either way.
+- `__attribute__((...))` — parsed and discarded everywhere it
+ appears in declarations.
+
+## Statements
+
+All standard C statements:
+
+- expression statement, including the empty `;`
+- compound statement `{ ... }`, with declarations interleaved with
+ statements (C99-style, not K&R block prologue)
+- `if (e) S` / `if (e) S else S`
+- `while (e) S`, `do S while (e);`
+- `for (init; cond; step) S` — `init` may be a declaration (C99)
+- `switch (e) { case K: ... default: ... }` — `case K` requires `K`
+ constant-integer; fall-through is the default; no implicit break
+- `break;`, `continue;`
+- `goto label;`, `label:` — function-scope labels
+- `return;`, `return e;`
+- declaration as statement (C99)
+
+Cut:
+
+- statement expressions `({ ... })` (GCC ext) — tcc.c doesn't use them
+- `__label__` (GCC) — N/A
+- compound literals `(T){ ... }` — tcc.c doesn't use them
+- `_Generic` selection — tcc.c doesn't use it
+- inline asm `__asm__(...)` — N/A; tcc.c gates this on conditions
+ that aren't active at the tcc-mes stage
+
+## Expressions
+
+All standard C operators with standard precedence and associativity:
+
+| Tier (high → low) | Operators |
+|-------------------|-----------|
+| postfix | `a[i]`, `f(a,...)`, `s.m`, `p->m`, `e++`, `e--` |
+| unary | `++e`, `--e`, `&e`, `*e`, `+e`, `-e`, `~e`, `!e`, `sizeof`, `(T)e` |
+| multiplicative | `*`, `/`, `%` |
+| additive | `+`, `-` |
+| shift | `<<`, `>>` |
+| relational | `<`, `<=`, `>`, `>=` |
+| equality | `==`, `!=` |
+| bitwise | `&`, `^`, `|` (in that order) |
+| logical | `&&`, `||` |
+| conditional | `?:` |
+| assignment | `=`, `+=`, `-=`, `*=`, `/=`, `%=`, `<<=`, `>>=`, `&=`, `^=`, `|=` |
+| comma | `,` |
+
+Notes:
+
+- `sizeof T` and `sizeof e` both supported. `sizeof e` does **not**
+ evaluate `e` (standard).
+- Integer promotion (rank ≤ `int` → `int`) and usual arithmetic
+ conversions performed automatically. Pointer arithmetic scales by
+ pointee size.
+- Implicit conversions for assignment, return, and function arguments
+ (incl. promotion of variadic args to `int` / `unsigned int` /
+ pointer / `long` / `unsigned long`).
+- String literals have type `char *` (not `const char[N]`) for our
+ purposes — we strip const, and tcc.c writes through string literals
+ in a few places.
+- `_Alignof` — **not** supported. tcc.c uses no alignment intrinsics.
+
+### Variadic argument access
+
+```
+#include <stdarg.h> // pre-flattened in
+void f(int n, ...) {
+ va_list ap; va_start(ap, n);
+ int x = va_arg(ap, int);
+ va_end(ap);
+}
+```
+
+`va_list`, `va_start`, `va_arg`, `va_end` are macros from the
+flattened header. They expand to direct frame-slot reads keyed off the
+`...` slot offset our codegen exposes. Implementation detail: our
+`stdarg.h` substitute is one of the headers shipped with the
+compiler.
+
+## Standard library expectations
+
+Our compiler doesn't bundle libc. The bootstrap script links the
+output against the same `libc+tcc` archive MesCC uses, which provides:
+
+- `<stdio.h>`: `FILE`, `fopen`, `fclose`, `fread`, `fwrite`, `fprintf`,
+ `fputs`, `fgetc`, `getc`, `printf`, `sprintf`, `vsnprintf`, …
+- `<stdlib.h>`: `malloc`, `free`, `realloc`, `exit`, `atoi`, `strtol`,
+ `qsort`, …
+- `<string.h>`: `strlen`, `strcpy`, `strncpy`, `strcmp`, `strncmp`,
+ `strcat`, `strchr`, `strrchr`, `strstr`, `memset`, `memcpy`,
+ `memmove`, `memcmp`, …
+- `<ctype.h>`, `<errno.h>`, `<unistd.h>`, `<fcntl.h>`, `<sys/stat.h>`,
+ …
+- `<stdarg.h>`, `<stddef.h>`, `<limits.h>` — supplied by us.
+
+Anything `<setjmp.h>` is **not** required at the tcc-mes stage
+(`HAVE_SETJMP` off). `<math.h>` is not required (`HAVE_FLOAT` off).
+
+Built-in functions our compiler *recognizes* (vs. linking against):
+
+- `__builtin_va_start`, `__builtin_va_arg`, `__builtin_va_end` —
+ expanded inline by the codegen. The `<stdarg.h>` we ship aliases
+ the standard names to these.
+- `alloca` — left as a library call. tcc.c only references it via
+ `__builtin_alloca` definition for compiled programs, not for itself.
+
+## Cut from C99 / C11
+
+Kept explicit so additions are deliberate.
+
+| Feature | Status | Rationale |
+|-----------------------------------------------|----------|------------------------------------------------|
+| Floats / doubles / `_Complex` | rejected | HAVE_FLOAT off |
+| `long double` | rejected | no FP |
+| Bitfields | rejected | HAVE_BITFIELD off |
+| `setjmp` / `longjmp` | not lib | HAVE_SETJMP off |
+| VLAs | rejected | tcc.c doesn't use; complicates frame layout |
+| Compound literals `(T){...}` | rejected | tcc.c doesn't use |
+| Statement expressions `({...})` (GCC) | rejected | tcc.c doesn't use |
+| `_Generic` | rejected | not used |
+| `_Atomic`, `_Thread_local` | rejected | not used |
+| `_Alignof`, `_Alignas` | rejected | not used |
+| `_Static_assert` | rejected | not used |
+| Wide / UTF strings (`L"…"`, `u8"…"`) | rejected | not used |
+| Anonymous struct/union members | rejected | not used |
+| Multi-character constants (`'AB'`) | rejected | not used |
+| Universal character names (`\uXXXX`) | rejected | identifier set is ASCII only |
+| K&R-style function definitions | rejected | tcc.c uses ANSI |
+| Nested function definitions (GCC) | rejected | not used |
+| Inline assembly (`__asm__`) | rejected | not used at this stage |
+| `__label__` (GCC) | rejected | not used |
+| `#include` | rejected | external pre-flatten step |
+| `const`, `volatile`, `restrict` | parsed, discarded | match MesCC |
+| `inline` | parsed, discarded | -D inline= in bootstrap |
+| `__attribute__((...))` | parsed, discarded | match MesCC |
+| `register`, `auto` storage classes | parsed, no effect | |
+
+## Undefined behavior policy
+
+Following [LISP.md](LISP.md)'s "Primitive failure" stance: out-of-bounds
+array access, signed integer overflow, dereferencing a null or
+uninitialized pointer, integer division by zero, and modifying a string
+literal are **undefined**. The compiler emits no runtime checks; the
+generated P1pp will crash, loop, or produce nonsense, and that's
+acceptable.
+
+The compiler itself aims to be **deterministic**: the same input bytes
+produce identical output bytes. Errors detected at compile time
+(syntax errors, type errors, unresolved identifiers) abort with a
+diagnostic on stderr and a nonzero exit code. No partial output is
+written.
+
+## Validation milestones
+
+Status legend: `[x]` done · `[~]` in progress · `[ ]` not started.
+
+1. [ ] Self-tests: a tests/cc/ tree mirroring tests/scheme1/ — one
+ tiny `.c` file per language feature, exit-status-driven.
+2. [ ] Compile a hand-written single-file C "hello world" through to
+ ELF.
+3. [ ] Compile the mes libc unified-libc.c (the same file MesCC builds
+ into libc.a).
+4. [ ] Compile tcc.c (under the tcc-mes defines) → tcc-lispcc; verify
+ `tcc-lispcc -version` runs.
+5. [ ] Use tcc-lispcc to build tcc-boot0; verify checksum matches the
+ live-bootstrap reference.
+
+Hitting (5) is the bootstrap milestone — at that point lispcc has
+fully replaced MesCC in the chain.