boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

commit a86b719a54e31357eadbcd172dc3bd1776912ed4
parent 9a3e9f8a0fa616639d12f375d6b7a46cfada0546
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sat, 25 Apr 2026 21:50:07 -0700

docs: add C compiler spec — CC.md, CC-INTERNALS.md, CC-CONTRACTS.md

Three-doc set defining the scheme1-hosted C compiler that will replace
MesCC at the live-bootstrap tcc-mes stage:

- CC.md:           accepted C subset (matches MesCC + tcc-mes defines)
- CC-INTERNALS.md: six-module decomposition (util/data/lex/pp/cg/parse)
- CC-CONTRACTS.md: frozen alphabets, test formats, frame ABI, mangling,
                   conversion-responsibility split, phase-1 milestone

Engineers can work the modules in parallel against these contracts.

Diffstat:
Adocs/CC-CONTRACTS.md | 533+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adocs/CC-INTERNALS.md | 726+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adocs/CC.md | 446+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 1705 insertions(+), 0 deletions(-)

diff --git a/docs/CC-CONTRACTS.md b/docs/CC-CONTRACTS.md @@ -0,0 +1,533 @@ +# lispcc contracts + +Frozen interfaces between modules. Engineers must not diverge from +these without proposing a change. This document is the source of truth +for the symbol alphabets, test formats, ABI, and phase-1 milestone +referenced from [CC-INTERNALS.md](CC-INTERNALS.md). + +## 1. Symbol alphabets + +Every record's `kind`-style fields use these exact symbols. Adding a +symbol = updating this section first. + +### 1.1 `tok-kind` + +``` +IDENT KW INT STR CHAR PUNCT HASH NL EOF +``` + +Uppercase to distinguish from value-level symbols. `IDENT` carries an +unrecognized identifier; `KW` is one of the symbols in §1.3. + +### 1.2 `PUNCT` value symbols + +The lexer produces `tok-value` symbols for punctuators per the +following table. **Names are mandatory** — no engineer may use the raw +`'+`, `'*`, etc. as symbols, because several C punctuator characters +(`%`, `|`, `,`, `;`, `(`, `)`, `{`, `}`, `[`, `]`, `.`, `#`) cannot +form valid scheme1 symbols. We use named symbols for *all* punctuators +to keep the scheme uniform. + +| C | Symbol | C | Symbol | +|---------|-------------|----------|------------| +| `[` | `lbrack` | `==` | `eq2` | +| `]` | `rbrack` | `!=` | `ne` | +| `(` | `lparen` | `<` | `lt` | +| `)` | `rparen` | `>` | `gt` | +| `{` | `lbrace` | `<=` | `le` | +| `}` | `rbrace` | `>=` | `ge` | +| `.` | `dot` | `<<` | `shl` | +| `->` | `arrow` | `>>` | `shr` | +| `,` | `comma` | `&&` | `land` | +| `;` | `semi` | `\|\|` | `lor` | +| `:` | `colon` | `&` | `amp` | +| `?` | `qmark` | `\|` | `bar` | +| `...` | `ellipsis` | `^` | `caret` | +| `++` | `inc` | `~` | `tilde` | +| `--` | `dec` | `!` | `bang` | +| `+` | `plus` | `=` | `assign` | +| `-` | `minus` | `+=` | `plus-eq` | +| `*` | `star` | `-=` | `minus-eq` | +| `/` | `slash` | `*=` | `star-eq` | +| `%` | `pct` | `/=` | `slash-eq` | +| `#` | `hash` | `%=` | `pct-eq` | +| `##` | `paste` | `<<=` | `shl-eq` | +| | | `>>=` | `shr-eq` | +| | | `&=` | `amp-eq` | +| | | `^=` | `caret-eq` | +| | | `\|=` | `bar-eq` | + +Digraphs (`<:` `:>` `<%` `%>` `%:` `%:%:`) lex as their standard +equivalents: same symbol on the right-hand side of the table. + +### 1.3 `KW` value symbols + +``` +;; storage +auto register static extern typedef +;; qualifiers (parsed and discarded) +const volatile restrict inline +;; type specifiers +void char short int long signed unsigned _Bool +;; rejected type specifiers (lexed as KW so we get clean diagnostics) +float double +;; aggregates +struct union enum +;; statements +if else while do for switch case default break continue return goto +;; operators +sizeof +;; reserved-and-rejected (lexed as KW so we error crisply) +_Generic _Atomic _Thread_local _Alignof _Alignas _Static_assert +_Complex _Imaginary +``` + +Anything matching the C identifier grammar that is **not** in this +list lexes as `IDENT`. + +### 1.4 `ctype-kind` + +``` +void i8 u8 i16 u16 i32 u32 i64 u64 bool +ptr arr fn struct union enum +``` + +Char is `i8` (`signed char`) or `u8` (`unsigned char`); `char` itself +is `i8` (we treat plain `char` as signed, matching MesCC and most +compilers). Long and long long collapse to `i64`/`u64` on P1-64. + +### 1.5 `opnd-kind` + +``` +imm frame global reg +``` + +`reg` is transient — used for the result of a call before it spills +to a frame slot. The vstack itself never holds `reg` opnds; cg +materializes through `reg` only inside a single emission step. + +### 1.6 `macro-kind` + +``` +obj fn fn-vararg +``` + +### 1.7 `sym-kind` + +``` +var fn typedef enum-const param label +``` + +### 1.8 `sym-storage` + +``` +auto static extern register +``` + +`#f` for `typedef`, `enum-const`, and `label` symbols (storage +class doesn't apply). + +### 1.9 `loop-ctx-kind` + +``` +while do for switch +``` + +### 1.10 `reg` opnd register names + +``` +a0 a1 a2 a3 +``` + +Only argument registers. Saved registers (`s0`..`s3`) and temporaries +(`t0`..`t2`) are cg-private; never exposed as opnd payload. + +### 1.11 `cg-binop` and `cg-unop` operator symbols + +``` +;; cg-binop: +add sub mul div rem +and or xor shl shr +eq ne lt le gt ge + +;; cg-unop: +neg ;; arithmetic negate +bnot ;; bitwise complement (~) +lnot ;; logical not (!) +``` + +These are abstract operations independent of source-level PUNCT +symbols; the parser maps PUNCT → cg-op (e.g., `'plus` → `'add`, +`'eq2` → `'eq`). + +## 2. Test serialization formats + +All test goldens use Scheme-readable forms so they `diff` cleanly and +can be machine-parsed if useful. + +### 2.1 Token line format + +One token per line, as a Scheme list: + +``` +(KIND VALUE FILE LINE COL) +``` + +- **KIND**: bare symbol from §1.1. +- **VALUE** rendering depends on KIND: + - `IDENT`, `STR`: bytevector literal `"..."` with `\n \t \r \\ \"` + escapes; non-ASCII bytes as `\xNN`. + - `INT`, `CHAR`: decimal integer. + - `KW`, `PUNCT`: bare symbol from §1.2 / §1.3. + - `HASH`, `NL`, `EOF`: `#f`. +- **FILE**: bytevector literal. +- **LINE**, **COL**: decimal integers (1-based). + +The `tok-hide` field is **not** serialized — it is implementation +detail of the preprocessor. + +Example for `int main() { return 0; }` in `t.c`: + +``` +(KW int "t.c" 1 1) +(IDENT "main" "t.c" 1 5) +(PUNCT lparen "t.c" 1 9) +(PUNCT rparen "t.c" 1 10) +(PUNCT lbrace "t.c" 1 12) +(KW return "t.c" 1 14) +(INT 0 "t.c" 1 21) +(PUNCT semi "t.c" 1 22) +(PUNCT rbrace "t.c" 1 24) +(EOF #f "t.c" 1 25) +``` + +Trailing whitespace and `;`-comments in the golden file are ignored. + +### 2.2 cg-trace line format + +The cg-trace mock writes one Scheme list per cg call: + +``` +(<call-name> <arg1> <arg2> ...) +``` + +`<call-name>` strips the `cg-` prefix (`cg-push-imm` → `push-imm`, +`cg-fn-begin` → `fn-begin`). + +Argument renderers, applied per-call: + +- **ctype** → a stable symbolic form: + - primitives: `void`, `i8`, `u8`, `i16`, `u16`, `i32`, `u32`, + `i64`, `u64`, `bool` (the `kind` symbol verbatim). + - pointer: `(ptr <T>)`. + - array: `(arr <T> <N>)` where N is the length or `*` for + incomplete. + - function: `(fn <ret> (<param>...) <variadic?>)`. + - aggregates: `(struct <tag>)`, `(union <tag>)`, `(enum <tag>)`. +- **sym** → `(<name-bv> <kind-symbol>)`. Storage and slot are not + surfaced — they are implementation detail. +- **bv** → bytevector literal, as in §2.1. +- **fixnum** → decimal integer. +- **bool** → `#t` / `#f`. +- **op symbol** (binop/unop) → bare symbol. + +cg calls that take *thunks* (`cg-if`, `cg-ifelse`, `cg-loop`) emit a +matching open/close pair in the trace, with the body's calls in +between: + +``` +(if-begin) + ...body trace... +(if-end) + +(ifelse-begin) + ...then-trace... +(ifelse-mid) + ...else-trace... +(ifelse-end) + +(loop-begin <tag>) + ...body trace... +(loop-end <tag>) +``` + +This is the canonical surface — `cg-if` *internally* uses a thunk, +but the trace exposes begin/mid/end markers so tests can read top-down. + +### 2.3 Diagnostic format + +Already canonical from CC-INTERNALS: + +``` +<file>:<line>:<col>: error: <msg>: <irritants...> +``` + +Tests for failure paths verify: +- exit status is 1 +- stderr contains the expected `<file>:<line>:<col>: error:` prefix + +The `<msg>` body is **not** matched character-for-character (so we +can refine wording without breaking tests); only the prefix and a +keyword-substring of the engineer's choice. + +## 3. Frame layout / parameter ABI + +### 3.1 cg-fn-begin contract + +```scheme +(cg-fn-begin cg name params return-type) -> param-syms +;; name: bv (un-mangled C identifier) +;; params: list of (name-bv . ctype) +;; return-type: ctype +;; param-syms: alist (name-bv . sym), each sym already bound to a frame slot +``` + +Inside `cg-fn-begin`, cg: + +1. Allocates one frame slot per parameter via `cg-alloc-slot`. Slot + width = `ctype-size` rounded up to 8 (`align-up`); align = 8. + (Yes, every param costs at least 8 bytes. P1-64 frame is + word-stride; we don't pack.) +2. Begins emitting into `cg-fn-buf`. Does **not** yet emit the + prologue — that's deferred to `cg-fn-end` once `frame-hi` is final. +3. Emits the param-spill code into a "prologue prefix" buffer + (private to cg): for params 0..3, `ST aN, [sp + slotN]`; for + params 4+, `LDARG t0, K` then `ST t0, [sp + slotK]`. +4. Returns the param-sym alist. Parser binds these into the function- + body scope. + +### 3.2 cg-fn-end contract + +```scheme +(cg-fn-end cg) +``` + +cg: + +1. Reads final `frame-hi` (highest byte allocated). +2. Emits the per-function preamble (an M1pp `%struct` is **not** + used — slots are numeric byte offsets, baked into the body text + already buffered in `fn-buf`). +3. Wraps the prologue-prefix + fn-buf inside a libp1pp `%fn` macro: + + ``` + %fn(<mangled-name>, <frame-hi-aligned-up-to-16>, { + <prologue-prefix bytes> + <fn-buf bytes> + ::ret + LD a0, [sp + <return-slot>] + }) + ``` +4. Flushes the result into `cg-text`, clears `fn-buf` and the + prologue-prefix buffer, resets `vstack`, `frame-hi`, and the + function-local label counter. + +The frame size is rounded up to 16 to satisfy the P1 stack-align +contract. + +### 3.3 Outgoing-arg staging + +When `cg-call` is asked to emit a call with arity > 4, it stages +args 4..(N-1) into the *low-addressed* prefix of the current frame +at `[sp + 0*8]`, `[sp + 1*8]`, etc., per LIBP1PP.md §Frame locals. +cg tracks the maximum staging count seen across the function and +reserves that prefix at fn-end before any other slots — i.e., +`cg-alloc-slot`'s first allocation comes *after* the staging area. + +The accounting is internal to cg. Parse never sees staging slots. + +### 3.4 cg-alloc-slot contract + +```scheme +(cg-alloc-slot cg bytes align) -> offset +``` + +- `bytes` = total size needed (e.g., 4 for `int`, 40 for `int[10]`, + `sizeof(struct foo)` for a struct). +- `align` = required alignment (1, 2, 4, or 8). +- Returns numeric byte offset relative to `sp` post-`%enter`. + +cg first aligns `frame-hi` up to `align`, returns that as the +offset, then bumps `frame-hi` by `bytes`. Slots are not reused +across scopes (we're optimizing for compiler simplicity, not frame +size). Local arrays and structs request their full size in one call. + +## 4. Conversion responsibility + +The parser drives type semantics; cg is type-aware only enough to +choose signed-vs-unsigned variants and to scale pointer arithmetic. + +### 4.1 Parser's responsibilities + +The parser **must** call cg in this order around each operation: + +| Source | Required parser actions before the operation | +|--------|----------------------------------------------| +| `e1 + e2` (and other arith binops) | (a) parse e1 → if lval, `cg-load`; (b) `cg-promote` if rank < int; (c) parse e2 same way; (d) `cg-arith-conv` to bring both to common type; (e) `cg-binop add` | +| `*p` | parse p → if lval, `cg-load`; then `cg-push-deref` | +| `&x` | parse x → must be lval; then `cg-take-addr` | +| `(T)e` | parse e → if lval, `cg-load` (unless casting to a pointer); then `cg-cast T` | +| `f(a, b, ...)` | parse f → if lval and `f` not a function-typed identifier, `cg-load`; parse each arg → `cg-load` if lval, then `cg-cast` to param type (or default-promote for variadic args); then `cg-call` | +| `lhs = rhs` | parse lhs → must be lval (no load); parse rhs → `cg-load` if lval; `cg-cast` to lhs type; `cg-assign` | +| `lhs += rhs` | parse lhs (lval) → duplicate via `cg-take-addr` then `cg-push-deref`; parse rhs; `cg-arith-conv`; `cg-binop add`; `cg-cast` to lhs type; `cg-assign` | +| `return e` | parse e → `cg-load` if lval; `cg-cast` to fn return type; `cg-return` | +| `if (e) ...` | parse e → `cg-load` if lval; `cg-cast bool` if not already int-shaped; `cg-if` | + +The parser is responsible for the standard: + +- **Integer promotion**: any operand of type rank below `int` is + promoted to `int` (or `unsigned int` if it can't fit) before use + in arithmetic, before assignment to a wider lhs in mixed contexts, + and before being passed as a variadic argument. +- **Usual arithmetic conversions**: applied to both operands of a + binary arithmetic operator after promotion. The result type is + the common type. +- **Pointer-int interaction**: detected by parser; `cg-binop add` + on (ptr, int) handles scaling internally (see §4.2). + +### 4.2 cg's responsibilities + +cg trusts the operand types it is handed. + +- **`cg-load`**: pop lval, emit one load (of the right width based + on `ctype-size`), push rval of the same type. +- **`cg-cast to-type`**: pop, emit sign-extend / zero-extend / + truncate as needed based on source vs. target sizes and signedness. + For `to-type = bool`: emit `(BNEZ -> 1, fallthrough -> 0)` shape. + Pointer ↔ integer casts are bit-for-bit on P1-64 (no emission). +- **`cg-binop add` (and `sub`)**: if exactly one operand is a `ptr`, + scale the int operand by `ctype-size` of the pointee before adding. + If both ptr → only `sub` is valid (yields `i64` byte difference, + divided by element size); other binops on (ptr, ptr) abort via + `die`. +- **`cg-binop` for divisions and comparisons**: dispatch to signed + (`DIV`/`BLT`) or unsigned (`DIV`+sign-flip / `BLTU`) variant based + on the operand kinds (`i*` → signed, `u*` → unsigned). After + `cg-arith-conv`, both operands have the same kind, so dispatch is + unambiguous. + +cg never: +- auto-loads an lvalue +- auto-promotes +- auto-converts arguments +- looks at fn-ctx return type (parser passes the cast) + +This split keeps cg under ~600 LOC by pushing all "C language" +knowledge into parse. + +## 5. Symbol-to-label mangling + +Three label namespaces in the emitted P1pp: + +### 5.1 User globals (functions, variables) + +``` +C identifier "foo" → P1pp label :cc__foo +``` + +Verbatim concatenation. C identifiers can't contain `:` or other +P1pp-special characters, so no escaping is needed. The `cc__` prefix +guarantees no collision with libp1pp internals (`libp1pp__*`), +backend stubs (`_start`, `p1_main`, `sys_*`), or our own runtime +support. + +`static` storage at file scope changes nothing about the label — +since we have one TU, internal-linkage and external-linkage symbols +share the same namespace. `static` only suppresses any future +"export to other TUs" emission, which we don't do anyway. + +### 5.2 String pool + +``` +n-th distinct string literal → :cc__str_<n> +``` + +`<n>` is a fresh decimal counter starting at 0, advanced only on +non-deduplicating insert. Identical string literals share a label +(idempotent intern). + +### 5.3 Function-internal labels + +Inside `%fn(...)`, libp1pp's `%scope` mechanism prefixes short +labels (`::ret`, `::lbl_42`) to `<fnname>__ret`, `<fnname>__lbl_42` +at M1pp time. cg uses short labels exclusively inside fn-buf: + +- `::ret` — single function exit +- `::lbl_<n>` — anonymous control-flow targets (for switch cases, + short-circuit eval, etc.) +- `::user_<name>` — user-written `goto` labels (C `myloop:` → + `::user_myloop`). The `user_` prefix prevents collisions with our + `lbl_` and `ret` labels. + +Loop tags (for libp1pp's `%loop_tag`, `%break(tag)`, `%continue(tag)`) +are not labels — they're macro-name fragments. cg generates them as +`L<n>` (no `cc__` prefix; tag namespace is per-function and `%fn` +already scopes them). + +### 5.4 Entry stub + +cg emits a small entry stub at `cg-finish` time: + +``` +%fn(p1_main, %p1_main_f.SIZE, { + ; argc = a0, argv = a1 already + %la_br(&cc__main) + %call + %eret +}) +``` + +So `int main(int argc, char **argv)` is reached from the P1 +program-entry contract. + +If the user's `main` has no parameters, the stub still passes +argc/argv — main just ignores them, which is harmless. If the user +defines `main` with a different return type, that's a CC.md +violation; cg can either die or emit and let the cast happen at the +return site. Recommend: parser checks at `cg-fn-end` time when +fn-name == `main`. + +## 6. Phase-1 milestone + +```c +int main(int argc, char **argv) { + return argc; +} +``` + +This is the integration target every engineer aims at. It goes +through: + +- **lex**: `int`, `main`, `(`, `int`, `argc`, `,`, `char`, `*`, `*`, + `argv`, `)`, `{`, `return`, `argc`, `;`, `}` — all PUNCT and KW + symbols touched are core; covers two of each kind. +- **pp**: zero directives, but full token-list traversal. +- **parse**: function definition, two-parameter list including + `char **`, compound stmt, return stmt, identifier expression, + lval→rval load. +- **cg**: `cg-fn-begin` with two params, parameter spilling + (one register-passed, one register-passed), `cg-push-sym`, + `cg-load`, `cg-return`, `cg-fn-end`, `%fn` wrapping, and the + `p1_main` entry stub. +- **e2e**: link with arch backend + libp1pp; run as native ELF; + exit code matches argc. + +Acceptance test: `tests/cc-e2e/00-return-argc.c` exists, the make +target builds it, and: + +``` +$ ./tests/cc-e2e/build/00-return-argc ; echo $? → 1 +$ ./tests/cc-e2e/build/00-return-argc a b ; echo $? → 3 +``` + +When this test passes on aarch64, amd64, and riscv64, phase 1 is +complete. + +## Change protocol + +Anyone proposing a contract change: + +1. PR amends this doc first, with rationale. +2. Affected modules + tests are listed. +3. Changes land in one PR (doc + all affected code) so no engineer + pulls a half-migrated tree. diff --git a/docs/CC-INTERNALS.md b/docs/CC-INTERNALS.md @@ -0,0 +1,726 @@ +# lispcc internals + +Companion to [CC.md](CC.md). CC.md says what we accept; this doc says +how the implementation is organized so engineers can split work and +test independently. + +The compiler is one scheme1 program assembled from six files at build +time: + +``` +build: catm cc/util.scm cc/data.scm cc/lex.scm cc/pp.scm cc/cg.scm cc/parse.scm cc/main.scm > cc/cc.scm +run: scheme1 < cc/cc.scm -- input.flat.c output.P1pp +``` + +(Driver shell-script details — argv plumbing, scheme1 prelude prepend +— belong in scripts/, not in this doc.) + +## Module DAG + +``` +util ────────┐ + │ +data ────────┼─► lex ──► pp ──► parse ─► main + │ │ + └──────────────────────► cg ──► main +``` + +- **util.scm** — leaf helpers; depends only on the scheme1 prelude. +- **data.scm** — record-type definitions used across modules. +- **lex.scm** — bytestream → token list. Pure function. +- **pp.scm** — token list → expanded token list. Pure function. +- **cg.scm** — codegen state and emission API. Mutates a `cg` record. +- **parse.scm** — token list + cg → P1pp output. Mutates `pstate` and + drives `cg`. +- **main.scm** — argv handling, file I/O, ties phases together. + +Cycles are forbidden. parse.scm calls cg.scm but never the reverse. + +## Conventions + +- **Naming**: every public function and accessor is prefixed by its + module or record. `lex-tokenize`, `pp-expand`, `cg-push-imm`, + `parse-translation-unit`, `tok-kind`, `ctype-size`. Internal helpers + use a leading `%` (e.g. `%pp-expand-line`). +- **Record constructors**: `%record-name` is the all-fields raw + constructor; named factory functions like `make-tok` wrap it when + defaults are useful. +- **Mutators**: `field-set!` form, e.g. `cg-out-set!`, matching + scheme1/prelude.scm. +- **Bytevectors as strings**: every "string" in this codebase is a + bytevector. We never use `symbol->string` for runtime data — symbols + are reserved for the small fixed alphabets (token kinds, ctype + kinds, opnd kinds). +- **Errors**: `(die loc msg . irritants)` writes a diagnostic to fd 2 + and `sys-exit`s with status 1. No exceptions, no recovery. Every + module uses the same `die`. +- **No global state**: every long-lived datum lives in a record passed + explicitly. The only top-level definitions are functions, constants, + and the keyword/punctuator alists. + +## Errors + +```scheme +(die loc msg . irritants) ; abort with formatted diagnostic +(loc-of-tok tok) -> loc ; pull file/line/col out of a tok +(loc file line col) -> loc ; construct manually (used by lexer) +``` + +Diagnostic format on stderr: + +``` +<file>:<line>:<col>: error: <msg>: <irritants display'd one-by-one> +``` + +`irritants` are written via `display` (no quoting). Pairs and integers +print naturally; bytevectors print as their byte content. + +`die` is the only failure path. There is no warning level — anything +worth saying aborts. + +## util.scm + +Bytevector and list helpers that scheme1/prelude.scm doesn't already +provide. + +```scheme +;; bytevector +(bv= a b) -> bool ; alias for bytevector=?, terser +(bv-prefix? p s) -> bool ; does s start with p? +(bv-find bv byte from) -> idx-or-#f ; first occurrence at >= from +(bv-slice bv start end) -> bv ; fresh copy +(bv-of-string str) -> bv ; literal helper for inline ASCII +(bv-of-byte b) -> bv ; 1-byte bv +(bv-cat lst-of-bv) -> bv ; concat list, single allocation +(bv->fixnum bv radix) -> (ok . val) ; (ok . #f) on parse fail +(fixnum->bv n radix) -> bv + +;; lists / alists +(alist-ref key al) -> val-or-#f +(alist-ref/eq key al) -> val-or-#f ; eq? compare (for symbol keys) +(alist-set key val al) -> al' ; cons new pair on the front +(alist-update key f al) -> al' ; functional update +(any p xs) -> bool +(every p xs) -> bool +(count p xs) -> fixnum + +;; ints +(min3 a b c) -> fixnum +(align-up n k) -> fixnum ; round n up to multiple of k + +;; output buffers (reversed list of bv chunks; flush builds in one pass) +(make-buf) -> buf ; record: { chunks } +(buf-push! buf bv) ; cons bv onto chunks +(buf-flush buf) -> bv ; reverse + bv-cat + +;; diagnostics + I/O +(die loc msg . irritants) -> never returns +(slurp-fd fd) -> bv ; read until EOF +(write-bv-fd fd bv) ; full write or die + +;; fresh names +(make-namer prefix) -> proc ; (proc) -> bv "prefix0", "prefix1", ... +``` + +`make-buf` and `buf-push!` are the universal output primitive — both +cg's three output streams and pp's expansion staging buffer use them. + +## data.scm + +All record types used by more than one module. + +### loc + +```scheme +(define-record-type loc + (%loc file line col) + loc? + (file loc-file) ; bv + (line loc-line) ; fixnum + (col loc-col)) ; fixnum +``` + +### tok + +```scheme +(define-record-type tok + (%tok kind value loc hide) + tok? + (kind tok-kind) ; symbol; see kinds table + (value tok-value) ; varies by kind + (loc tok-loc) ; loc + (hide tok-hide)) ; list of bv (macro names already expanded) +``` + +| `kind` | `value` shape | +|---------|--------------------------------------------------| +| `IDENT` | bv (identifier name) | +| `KW` | symbol (one of `if while ... typedef`) | +| `INT` | fixnum (post-suffix integer value) | +| `STR` | bv (raw bytes, no NUL terminator) | +| `CHAR` | fixnum (value 0..255) | +| `PUNCT` | symbol (`'+ '== '-> '... '## '#` …) | +| `HASH` | `#f` (preprocessor only; line-leading `#`) | +| `NL` | `#f` (significant only inside the preprocessor) | +| `EOF` | `#f` | + +`make-tok` is the canonical constructor; pass `hide = '()` for +freshly-lexed tokens. + +### macro + +```scheme +(define-record-type macro + (%macro kind params body) + macro? + (kind macro-kind) ; 'obj | 'fn | 'fn-vararg + (params macro-params) ; list of bv + (body macro-body)) ; list of tok +``` + +### ctype + +```scheme +(define-record-type ctype + (%ctype kind size align ext) + ctype? + (kind ctype-kind) ; 'void 'i8 'i16 'i32 'i64 'u8 'u16 'u32 'u64 + ; 'bool 'ptr 'arr 'fn 'struct 'union 'enum + (size ctype-size) ; fixnum bytes; -1 = incomplete + (align ctype-align) ; fixnum bytes + (ext ctype-ext)) ; payload, kind-specific +``` + +ext payload by kind: + +| `kind` | `ext` shape | +|-------------------|----------------------------------------------------------------------| +| `void` / int / `bool` | `#f` | +| `ptr` | pointee ctype | +| `arr` | `(elem-ctype . length-or--1)` | +| `fn` | `(ret-ctype params variadic?)` — `params` is list of ctype | +| `struct`/`union` | `(tag-bv complete? fields)` — `fields` = `((name-bv ctype offset) ...)` | +| `enum` | `(tag-bv ((const-name-bv . value) ...))` | + +Primitive ctypes are interned at startup as top-level bindings: +`%t-void`, `%t-i8`, `%t-i32`, `%t-i64`, `%t-u8`, `%t-u32`, `%t-u64`, +`%t-bool`, `%t-char-ptr`. Equality of primitive ctypes is `eq?`. +Derived types are *not* deduped. + +### sym + +```scheme +(define-record-type sym + (%sym name kind storage type slot) + sym? + (name sym-name) ; bv + (kind sym-kind) ; 'var 'fn 'typedef 'enum-const 'param 'label + (storage sym-storage) ; 'auto 'static 'extern 'register | #f for typedef/enum + (type sym-type) ; ctype + (slot sym-slot)) ; locals/params: fixnum byte offset + ; globals/fn: bv emitted-label + ; enum-const: fixnum value + ; typedef: #f + ; label: bv P1pp local label +``` + +### opnd + +```scheme +(define-record-type opnd + (%opnd kind type ext lval?) + opnd? + (kind opnd-kind) ; 'imm 'frame 'global 'reg | sub-cases below + (type opnd-type) ; ctype + (ext opnd-ext) ; per kind + (lval? opnd-lval?)) ; #t = the slot/place holds an *address* + ; #f = the slot/place holds a *value* +``` + +ext by kind: + +| `kind` | `ext` | +|----------|------------------------------------------------| +| `imm` | fixnum literal value (lval? always `#f`) | +| `frame` | fixnum byte offset in current function frame | +| `global` | bv label name (the symbol's emitted P1pp label)| +| `reg` | symbol `'a0`..`'a3` | + +`reg` is transient — used for the result of a call before it's +spilled. The vstack itself holds only `imm`, `frame`, and `global` +opnds. + +### pstate + +```scheme +(define-record-type pstate + (%pstate toks scope tags loops fn-ctx typedefs cg) + pstate? + (toks ps-toks ps-toks-set!) ; remaining tok list (head = lookahead) + (scope ps-scope ps-scope-set!) ; list of alists: (bv . sym) + (tags ps-tags ps-tags-set!) ; list of alists: (bv . ctype) for struct/union/enum + (loops ps-loops ps-loops-set!) ; list of loop-ctx (break/continue stack) + (fn-ctx ps-fn-ctx ps-fn-ctx-set!) ; current function record, or #f at file scope + (typedefs ps-typedefs ps-typedefs-set!) ; flat alist (bv . #t) — fast typedef-name lookup + (cg ps-cg)) ; codegen state (not mutated through pstate) +``` + +`typedefs` is a separate flat alist (not derived from `scope` at +lookup time) because the lexer-vs-parser distinction in C requires +fast "is this identifier a typedef-name *now*" answers during +declaration parsing. Updated in lockstep with `scope`. + +### loop-ctx + +```scheme +(define-record-type loop-ctx + (%loop-ctx kind tag has-continue?) + loop-ctx? + (kind loop-ctx-kind) ; 'while 'do 'for 'switch + (tag loop-ctx-tag) ; bv tag for libp1pp %_tag macros + (has-continue? loop-ctx-has-continue?)) ; #f for switch +``` + +### fn-ctx + +```scheme +(define-record-type fn-ctx + (%fn-ctx name return-type params variadic? labels) + fn-ctx? + (name fn-ctx-name) ; bv + (return-type fn-ctx-return-type) ; ctype + (params fn-ctx-params) ; list of sym + (variadic? fn-ctx-variadic?) + (labels fn-ctx-labels ; alist (user-bv . emitted-bv) + fn-ctx-labels-set!)) ; mutated as `goto`/labels resolve +``` + +### cg + +```scheme +(define-record-type cg + (%cg text data bss vstack frame-hi label-ctr str-pool globals fn-buf) + cg? + (text cg-text) ; buf — final text section (all functions) + (data cg-data) ; buf — initialized globals + string pool + (bss cg-bss) ; buf — zero-init globals + (vstack cg-vstack cg-vstack-set!) ; list of opnd + (frame-hi cg-frame-hi cg-frame-hi-set!) ; fixnum: frame bytes used so far + (label-ctr cg-label-ctr cg-label-ctr-set!) + (str-pool cg-str-pool cg-str-pool-set!) ; alist (bv-content . bv-label) + (globals cg-globals cg-globals-set!) ; alist (bv-name . sym) — emitted globals + (fn-buf cg-fn-buf cg-fn-buf-set!)) ; buf — body of current function +``` + +Functions are emitted into `fn-buf`, then on `cg-fn-end` flushed +through libp1pp's `%fn(...)` wrapper into `text`. This lets us know +the final `frame-hi` before writing the prologue. + +## lex.scm + +Pure: bytestream + filename → token list. No file I/O, no macro +awareness. + +```scheme +(lex-tokenize src file) -> (list-of tok) +;; src: bv (the C bytestream) +;; file: bv (filename, for tok-loc) +;; result: list ending in a single 'EOF tok; never #f +;; aborts via die on unrecognized byte sequences +``` + +Internal contract: + +- Comments (`/* ... */`, `// ...`) are removed but produce no tokens. +- Trigraphs and `\<newline>` line splicing are applied before + tokenization. +- `NL` tokens are emitted at the end of every source line, even + blank ones — the preprocessor needs them for directive termination. +- Adjacent string literals are **not** concatenated here; that's + pp's job (after macro expansion). +- Keyword recognition: the lexer carries a fixed alist of + (keyword-bv . keyword-symbol) and emits `KW` directly for matches. + IDENT vs. KW is decided at lex time and never revised. +- Punctuator longest-match table: `'<<=` before `'<<` before `'<`, + etc. + +Helpers exposed for unit tests: + +```scheme +(lex-read-number src pos) -> (tok . pos') +(lex-read-string src pos file) -> (tok . pos') +(lex-read-ident src pos) -> (tok . pos') ; produces IDENT or KW +``` + +### Test plan + +`tests/cc-lex/` mirrors `tests/scheme1/`. Each test is one `.c` +fragment plus an `.expected-toks` file containing the expected +serialized form (one tok per line, e.g. `KW int 1 1`). Driver script: + +``` +scheme1 cc/lex-test.scm < input.c | diff - expected-toks +``` + +## pp.scm + +Pure: token list + initial macro alist → expanded token list. + +```scheme +(pp-expand toks initial-defines) -> (list-of tok) +;; toks: lex-tokenize output +;; initial-defines: alist (bv . macro) — from -D flags +;; result: token list with HASH and NL stripped, KW/IDENT/INT/STR/CHAR/PUNCT/EOF only +;; aborts via die on bad directive, undefined #if identifier-as-macro misuse, etc. +``` + +Internal structure: + +- A driver loop classifies each line (looking only at the leading + HASH or non-HASH state) and dispatches to a directive handler or + the macro-expansion engine. +- The conditional stack: list of `(active? . has-taken?)` pairs. + `#if`/`#ifdef` push; `#elif`/`#else` flip; `#endif` pops. +- Macro expansion uses C11 6.10.3.4 hide-set discipline. Each + emitted token's `hide` field is the union of hide-sets of its + constituent body tokens plus the macro's own name. +- `defined NAME` is a special form — recognized and resolved + *before* macro expansion of an `#if` line. + +Directive handlers (each takes the tokens up to the next NL): + +```scheme +(%pp-do-define line state) +(%pp-do-undef line state) +(%pp-do-if line state) +(%pp-do-ifdef line state) +(%pp-do-ifndef line state) +(%pp-do-elif line state) +(%pp-do-else line state) +(%pp-do-endif line state) +(%pp-do-error line state) +(%pp-do-line line state) +(%pp-do-pragma line state) +(%pp-do-include line state) ;; always dies (per CC.md §Toolchain envelope) +``` + +`state` is a private record `pp-state` with fields +`{macros cond-stack out file-base-line}`. Internal — not in data.scm. + +Constant-expression evaluator for `#if`: + +```scheme +(pp-eval-cexpr toks macros) -> fixnum +;; toks: tokens of the expression after macro expansion +;; aborts on unrecognized identifier (treated-as-zero is wrong for our errors policy) +``` + +### Test plan + +`tests/cc-pp/` per-file tests, same shape as cc-lex. Inputs are +already-tokenized fixtures (so pp can be exercised without the lexer) +or `.c` files (full pipeline). Both flavors live side-by-side. + +## cg.scm + +Mutable codegen state and a value-stack-style emission API. The parser +calls these and never touches output buffers directly. + +### Lifecycle + +```scheme +(cg-init) -> cg ; fresh state +(cg-finish cg) -> bv ; flush all buffers; result is final P1pp text +``` + +Every function lives between `cg-fn-begin` and `cg-fn-end`. Globals +live outside. + +```scheme +(cg-fn-begin cg name params return-type) +;; name: bv +;; params: list of sym (their slots are pre-assigned by parser) +;; return-type: ctype +(cg-fn-end cg) ; emits epilogue, flushes fn-buf into cg-text under %fn(...) +``` + +### Vstack: push / pop / inspect + +```scheme +(cg-push cg opnd) ; push opnd onto vstack +(cg-pop cg) -> opnd ; remove and return top +(cg-top cg) -> opnd ; non-destructive +(cg-depth cg) -> fixnum +``` + +### Materialize + +```scheme +(cg-push-imm cg ctype value) -> opnd ; rval +(cg-push-string cg bv-content) -> opnd ; rval, char* — interns into str-pool +(cg-push-sym cg sym) -> opnd ; lval (var) or rval (fn name) +(cg-push-deref cg) -> opnd ; pop ptr-rval, push lval (no emission yet) +``` + +### Address & deref operators + +```scheme +(cg-take-addr cg) -> opnd ; pop lval, push its address as rval +(cg-load cg) -> opnd ; pop lval, push rval (loaded through address) +``` + +### Type conversions + +```scheme +(cg-cast cg to-type) -> opnd ; pop, push opnd cast to to-type; emits sign-extension etc. as needed +(cg-promote cg) -> opnd ; integer promotion (rank ≤ int → int) +(cg-arith-conv cg) ; usual arithmetic conversions on top two opnds +``` + +### Operators + +```scheme +(cg-binop cg op) -> opnd ; pop b, pop a, push (a op b) +(cg-unop cg op) -> opnd ; pop a, push (op a) +(cg-assign cg) -> opnd ; pop rval, pop lval, store, push rval (assignment yields the value) +``` + +`op` for binop is a symbol from: +`'add 'sub 'mul 'div 'rem 'and 'or 'xor 'shl 'shr 'eq 'ne 'lt 'le 'gt 'ge`. +For unop: `'neg 'bnot 'lnot`. + +Signed vs. unsigned dispatch is handled inside cg by inspecting the +operand types after `cg-arith-conv`. + +### Calls + +```scheme +(cg-call cg arity has-result?) -> opnd-or-#f +;; pops (arity + 1) opnds: function, then args left-to-right at top +;; (i.e. callable was pushed first, then args, so arg-N is on top) +;; emits arg-passing per P1 ABI, CALL or CALLR, captures return into a fresh frame slot +;; pushes result opnd; returns it (or #f if has-result? is #f) +``` + +### Structured control flow + +These take thunks so the parser can recursively emit the body: + +```scheme +(cg-if cg then-thunk) ; pop cond; emit %if_nez { (then-thunk) } +(cg-ifelse cg then-thunk else-thunk) ; pop cond; emit %ifelse_nez { ... }{ ... } +(cg-loop cg head-thunk body-thunk) -> tag ; head-thunk emits the cond test; tag returned to parser +(cg-loop-end cg tag) ; closes a %loop_tag +(cg-break cg tag) +(cg-continue cg tag) +``` + +For `for` and `do-while` the parser composes the building blocks +above; cg doesn't expose a dedicated for/do helper. (Three helpers +beat seven.) + +### switch helpers + +```scheme +(cg-switch-begin cg) -> swctx ; spill controlling expression to a slot +(cg-switch-case cg swctx const-int) ; emit %if_eq jump-to-case-label +(cg-switch-default cg swctx) ; emit B to default-label +(cg-switch-end cg swctx) ; close out +``` + +### Globals and data + +```scheme +(cg-emit-global cg sym init-bv-or-#f) ; init-bv: bytes for .data, or #f for .bss +(cg-emit-extern cg sym) ; declare without defining +(cg-intern-string cg bv-content) -> bv-label ; idempotent; used internally by cg-push-string +``` + +### Frame allocation (used internally and by parse for locals) + +```scheme +(cg-alloc-slot cg bytes align) -> offset ; bumps frame-hi; returns aligned offset +``` + +### Interaction-test mock + +`tests/cc-parse/` uses a swap-in `cg-trace.scm` that replaces cg.scm. +It provides every public entry point above but each call appends a +record to a global trace list: + +```scheme +(cg-trace-get) -> (list-of (op . args)) +``` + +Parser tests run a fragment, snapshot the trace, diff against +`expected-trace`. This is the contract that lets parse and cg evolve +independently — as long as parse emits the same sequence of cg calls, +cg internals can change freely. + +## parse.scm + +Mutates a `pstate`. Drives `cg`. Single entry point: + +```scheme +(parse-translation-unit ps) ; consumes ps-toks until EOF, emits via ps-cg +;; ps must have ps-toks set; everything else starts empty. +``` + +### Top-down structure + +Internal helpers, hierarchically: + +``` +parse-translation-unit + ├─ parse-decl-or-fn ; returns 'fn or 'decl + │ ├─ parse-decl-spec ; storage + qualifiers + base type + │ ├─ parse-declarator ; spiral grammar; returns (name ctype) + │ ├─ parse-init-list ; for variable initializers (incl. designated) + │ └─ parse-fn-body ; only if declarator is fn-typed AND next tok is `{` + ├─ parse-stmt + │ ├─ parse-compound-stmt + │ ├─ parse-if-stmt + │ ├─ parse-while-stmt + │ ├─ parse-do-stmt + │ ├─ parse-for-stmt + │ ├─ parse-switch-stmt + │ ├─ parse-return-stmt + │ ├─ parse-goto-stmt + │ └─ parse-expr-stmt + └─ parse-expr ; Pratt; takes min-bp + ├─ parse-primary + ├─ parse-postfix + ├─ parse-unary + ├─ parse-cast-or-unary ; `(T)e` vs. `(e)` + └─ parse-binary-rhs +``` + +### Token-stream API (private to parse.scm but conventional) + +```scheme +(peek ps) -> tok +(peek2 ps) -> tok ; one-token lookahead helper +(advance ps) -> tok ; consume and return +(expect-kw ps sym) -> tok ; KW match or die +(expect-punct ps sym) -> tok ; PUNCT match or die +(at-kw? ps sym) -> bool +(at-punct? ps sym) -> bool +``` + +`peek2` is needed exactly twice: distinguishing +`(typename) cast-expr` from `(expr)` parenthesized expression, and +distinguishing labelled-statement `IDENT :` from expression-statement +`IDENT ...`. + +### Scope helpers + +```scheme +(scope-enter! ps) +(scope-leave! ps) +(scope-bind! ps name sym) ; aborts on duplicate at innermost frame +(scope-lookup ps name) -> sym-or-#f ; walks frames + +(tag-bind! ps name ctype) +(tag-lookup ps name) -> ctype-or-#f + +(typedef-add! ps name) ; updates ps-typedefs +(typedef? ps name) -> bool +``` + +### Pratt expression parser + +The binding-power table is a top-level alist: + +```scheme +(define %binop-bp + ;; (punct-symbol . (lhs-bp . rhs-bp)) + ;; left-assoc: rhs-bp = lhs-bp + 1 + ;; right-assoc: rhs-bp = lhs-bp - 1 + '((|*| 110 . 111) (|/| 110 . 111) (|%| 110 . 111) + ...)) +``` + +Driver: + +```scheme +(parse-expr-bp ps min-bp) ; Pratt climber +(parse-expr ps) ; equivalent to (parse-expr-bp ps 0) +``` + +`parse-expr` leaves the result on `ps-cg`'s vstack and returns the +opnd. Statements that don't consume the value follow with `cg-pop`. + +### Test plan + +`tests/cc-parse/` uses the cg-trace mock. Each test: + +``` +input.c -- C fragment +expected-trace -- one cg call per line, e.g. + (cg-push-imm i32 42) + (cg-push-imm i32 7) + (cg-binop add) + ... +``` + +The driver builds a token list (via the real lex+pp) and runs +`parse-translation-unit` against `cg-trace`. Diff fails the test. + +## main.scm + +Driver. Roughly 80 lines. + +```scheme +(define (cc-main argv) + (let* ((args (parse-cli argv)) ; record: { input-path output-path defines } + (src (slurp-fd (open-or-die (cli-input args)))) + (toks (lex-tokenize src (cli-input args))) + (defines (cli-defines->alist (cli-defines args))) + (expanded (pp-expand toks defines)) + (cg (cg-init)) + (ps (make-pstate expanded cg))) + (parse-translation-unit ps) + (write-bv-fd (open-or-die-out (cli-output args)) + (cg-finish cg)) + 0)) + +(cc-main (argv)) +``` + +CLI: `cc input.flat.c -o output.P1pp [-D NAME[=val]] ...`. Strict — +unrecognized flags die. + +## Test infrastructure + +Three test trees, all using the same harness pattern as +`tests/scheme1/`: + +- `tests/cc-lex/` — feeds `.c` through `lex-tokenize`, diffs token + serialization. +- `tests/cc-pp/` — feeds tokens (or `.c`) through `pp-expand`, diffs + token serialization. +- `tests/cc-parse/` — feeds `.c` through lex+pp+parse with the cg-trace + mock, diffs the trace. +- `tests/cc-cg/` — directly calls cg APIs (handwritten Scheme test + programs), diffs the resulting P1pp bytes. +- `tests/cc-e2e/` — tiny `.c` programs compiled all the way through + the toolchain to native executables, run, exit-code checked. + +`tests/cc-parse/` and `tests/cc-cg/` are the seam that lets parse and +cg evolve independently. Anyone changing parse can keep running until +the trace tests stay green. Anyone changing cg can keep running until +the cg tests stay green and the trace contract is honored. + +## Out-of-scope here + +Deferred to follow-up docs once we start coding: + +- Exact P1pp text emitted for each cg primitive (precise opcodes, + spill discipline, libp1pp macro choices) — lives next to cg.scm + as a side document. +- The exact `<stdarg.h>` / `<stddef.h>` we ship — the headers we + bundle so the flattener has something to inline. +- Driver script for the pre-flatten pass (host shell tool, not part + of the Scheme compiler). +- Performance tuning: alist → tree, vstack list → array of frame + slots, etc. None of this affects the interfaces above. diff --git a/docs/CC.md b/docs/CC.md @@ -0,0 +1,446 @@ +# Minimal C subset (lispcc) + +Working doc. Baseline is C99; everything here is a delta against it. The +target is **just enough C** to compile + + `tcc-0.9.26-1147-gee75a10c/tcc.c` + +with the same defines used at MesCC's `tcc-mes` stage in +[live-bootstrap](../../live-bootstrap/steps/tcc-0.9.26/pass1.kaem): + +``` +-D BOOTSTRAP=1 +-D HAVE_LONG_LONG=1 +-D ONE_SOURCE=1 +-D TCC_TARGET_X86_64=1 +-D inline= +-D CONFIG_TCCDIR="..." ...etc +``` + +Notably **not** defined: `HAVE_FLOAT`, `HAVE_BITFIELD`, `HAVE_SETJMP`. +Those gate off entire code paths in tcc.c (floats, bitfield struct +support, setjmp-based error recovery), and we don't have to compile any +of it. + +The accepted surface is shaped by two intersecting constraints: + +1. **Lower bound** — what tcc.c (under those defines) actually uses. +2. **Upper bound** — what MesCC accepts, since MesCC already builds + tcc-mes and we're its replacement. Anything MesCC strips silently + (`const`, `inline`, `__attribute__`) we also strip silently. + +Things outside both bounds are cut. Things admitted are load-bearing. + +## Scope + +- **Single translation unit.** Input is one bytestream. The + preprocessor does no file I/O — `#include` is an external + pre-flattening pass (system headers + tcc.c's `#include "libtcc.c"` + / `"tcctools.c"` are spliced upstream of our compiler). See + [§Toolchain envelope](#toolchain-envelope). +- **P1-64 only.** Sizes assume LP64. Porting to P1-32 is out of scope. +- **No optimization.** Output P1pp is a stack-machine lowering with + every operand spilled to a frame slot. Codegen quality is a v2 + problem. + +## Toolchain envelope + +``` +tcc.c + system headers + │ + │ pre-flatten: resolve #include recursively, splice into one file + │ (separate tool: scheme1 or shell; not part of cc.scm) + ▼ +tcc.flat.c single bytestream, no #include + │ + │ scheme1 cc.scm + ▼ +tcc.P1pp our compiler's output + │ + │ catm with arch backend + libp1pp.P1pp + │ m1pp + ▼ +tcc.M1 + │ M0 + ▼ +tcc.hex2 + │ hex2 + ▼ +tcc-mes native ELF, replaces MesCC's tcc-mes +``` + +The pre-flatten pass is *not* a C preprocessor — it only resolves +`#include`. All other directives (`#define`, `#if`, …) are handled by +the in-Scheme preprocessor in pass 2. + +## Translation phases + +The C standard names eight phases. We collapse them to three: + +1. **Lex** — bytestream → token list. Trigraphs and line-splicing + (backslash-newline) are handled here, alongside numbers / strings / + identifiers / punctuators. Comments removed. Newlines preserved as + `NL` tokens (the preprocessor needs them to delimit directives). +2. **Preprocess** — token list → expanded token list. Directives + consumed, macros expanded, `NL` tokens stripped on exit. +3. **Parse + emit** — token list → P1pp text. xcc-style direct emit; + no AST. + +## Lexical syntax + +Subset of C99 lexical grammar. + +- **Identifiers**: `[a-zA-Z_][a-zA-Z_0-9]*`. Universal character names + (`\uXXXX`) **not** supported. +- **Integers**: decimal, octal (`0…`), hex (`0x…`); suffixes + `u`, `U`, `l`, `L`, `ll`, `LL`, `ul`, `ull`, etc. (case-insensitive). + All values fit in `unsigned long long` (64 bits). +- **Floats**: **not** present. The lexer rejects floating-point + literals. (HAVE_FLOAT is off.) +- **Characters**: `'c'` and standard escapes `\n \t \r \\ \' \" \0 + \xNN \NNN`. `'\xNN'` is a `char`-typed value, not multi-character. + Multi-character constants (`'AB'`) are **not** supported. +- **Strings**: `"…"` with same escapes. Adjacent string literals + concatenate (`"a" "b"` ≡ `"ab"`). Wide strings (`L"…"`), UTF-8 + strings (`u8"…"`), UTF-16/32 (`u"…"`, `U"…"`) **not** supported. +- **Punctuators**: full C99 set, including digraphs `<: :> <% %> %:`. + Trigraphs are handled in lex. `##` and `#` are preprocessor-only. +- **Comments**: `// …` to end of line; `/* … */` block (no nesting). +- **Line splicing**: `\` immediately before newline removes both, + per the standard. +- **Whitespace**: space, tab, vertical tab, form feed, newline. + +## Preprocessor + +Directive set: + +- `#define NAME …` — object-like +- `#define NAME(p1, p2, …) …` — function-like +- `#define NAME(p1, …, …) …` — variadic, with `__VA_ARGS__` in body +- `#undef NAME` +- `#if expr`, `#ifdef NAME`, `#ifndef NAME` +- `#elif expr`, `#else`, `#endif` +- `#error msg…` — flush and exit nonzero +- `#line NN ["file"]` — accepted; only `__LINE__` / `__FILE__` honor it +- `#pragma …` — accepted and ignored (whole line consumed) +- `#include …` — **rejected**. Pre-flattening handles this upstream. + We refuse rather than silently ignore so an unflattened input fails + loudly. + +Operators inside the body of a function-like macro: + +- `#param` — stringize. Result is a string literal of `param`'s + pre-expansion tokens. +- `a##b` — token paste. Performed before rescanning for further + expansion. + +Built-in macros: + +- `__FILE__` — current source file (a string literal) +- `__LINE__` — current line number (a decimal integer) +- `__STDC__` — `1` +- `__LISPCC__` — `1` (our analogue of MesCC's `__MESC__`) + +Expression evaluator (used by `#if`/`#elif`): + +- All integer operators including `defined NAME` / `defined(NAME)`. +- Identifiers that aren't macros evaluate to `0`. (Standard.) +- Result is a 64-bit signed integer. + +Macro expansion uses C11 6.10.3.4 hide-set discipline. Each token +carries the set of macro names already expanded into it; an identifier +inside its own hide-set is not re-expanded. This is the standard +defense against `#define A B\n#define B A`. + +## Types + +### Primitives (P1-64) + +| Type | Size (bytes) | Align | Notes | +|-----------------------|--------------|-------|------------------------------| +| `void` | — | — | only as ptr-target / fn-ret | +| `char` | 1 | 1 | signed by default | +| `signed char` | 1 | 1 | | +| `unsigned char` | 1 | 1 | | +| `short` | 2 | 2 | | +| `unsigned short` | 2 | 2 | | +| `int` | 4 | 4 | | +| `unsigned int` | 4 | 4 | | +| `long` | 8 | 8 | LP64 | +| `unsigned long` | 8 | 8 | | +| `long long` | 8 | 8 | same as `long` in LP64 | +| `unsigned long long` | 8 | 8 | | +| pointer | 8 | 8 | tag-free; raw native address | +| `_Bool` | 1 | 1 | values: `0`, `1` | + +`size_t` is `unsigned long`; `ptrdiff_t` is `long`; `intptr_t` / +`uintptr_t` are `long` / `unsigned long`. These typedefs come from the +flattened headers; the language doesn't bake them in. + +**Not present**: `float`, `double`, `long double`, `_Complex`, +`_Imaginary`, `__int128`. `float.h` macros and `<math.h>` are +unavailable to the input. + +### Derived types + +- **Pointer**: `T *`, multi-level. `void *` is a generic pointer that + freely converts to and from any other object pointer. +- **Array**: `T[N]` with `N` a constant expression evaluating to a + positive integer. `T[]` is allowed in function parameter position + (decays to `T*`) and as a flexible-array tail field. **VLAs** + (`T[expr]` with non-constant `expr`) are **not** supported. +- **Function**: `T(P1, P2, ..., Pn)` and `T(P1, ..., ...)` (variadic). + Pointers to functions, arrays of pointers to functions, and + functions returning pointers to functions all parse via the + spiral-declarator grammar. Old-style (K&R) function definitions are + **not** supported. +- **Struct / union**: declared with `struct tag { ... }` or + `union tag { ... }`. Tag and member namespaces are separate from + identifiers. Forward declarations (`struct tag;`) supported. + Anonymous structs/unions inside other structs are **not** supported. + **Bitfields** (`int x : 3`) are **not** supported (HAVE_BITFIELD off + in our target). Flexible array member as last field allowed: + `struct s { int n; T data[]; }`. +- **Enum**: `enum tag { A, B = 7, C }`. Underlying type is `int`. + Constants are usable in constant expressions. +- **Typedef**: `typedef T name;` — name becomes a type-name token in + later declarations. Must be visible at parse time of any use + (lexer/parser cooperation: typedef names are tracked in the + current scope). + +### Qualifiers + +- `const`, `volatile`, `restrict` — **parsed and discarded**. + We don't enforce const-correctness, don't suppress optimization + on volatile (no optimizer to suppress), and don't honor restrict. + Same as MesCC. +- `_Atomic`, `_Thread_local` — **rejected** (lex error if they appear; + tcc.c doesn't use them, so this won't fire). + +## Declarations and storage + +### Declarators + +Full C99 spiral-declarator grammar: + +``` +int *p // pointer to int +int *p[10] // array of 10 pointers to int +int (*p)[10] // pointer to array of 10 ints +int (*f)(int, int) // pointer to function (int,int) returning int +int *f(int) // function (int) returning pointer to int +char *(*tab[5])(int) // array of 5 pointers to function (int) returning char* +``` + +### Storage classes + +- `extern` — declares without defining. References resolve at link + time. Honored. +- `static` at file scope — gives internal linkage; prevents the symbol + from being emitted as a P1pp `:public_label`. Honored. +- `static` at block scope — single shared instance, zero-initialized + by default. Honored. +- `auto` — accepted, no effect (the default for block scope). +- `register` — accepted, no effect. +- `typedef` — handled specially (see Types). + +### Function definitions + +``` +[storage] [type-quals] return-type name(params) { body } +``` + +Parameter list forms: + +- `void` (zero parameters) +- `T1 p1, T2 p2, ...` +- `T1 p1, T2 p2, ..., ...` (variadic, `va_list` discipline below) + +K&R-style (`int f(a, b) int a, b; { … }`) is **not** supported. + +### Variable initializers + +- Scalars: `T x = expr;` — `expr` must be a constant for static-storage + variables; arbitrary for auto-storage. +- Arrays: `T a[N] = { e0, e1, ... };` and `T a[] = { ... };` (size + inferred). String-literal initializer for `char[]` allowed. +- Structs: `S s = { e0, e1, ... };` (positional). Designated + initializers (`{ .field = ... }`) **supported** at struct top level + only — required by tcc.c. +- Nested initializers brace-flatten the obvious way. + +### Inline / attributes + +- `inline` — already removed by `-D inline=` in the bootstrap. Our + preprocessor would also strip the keyword if it appeared. No + effect on codegen either way. +- `__attribute__((...))` — parsed and discarded everywhere it + appears in declarations. + +## Statements + +All standard C statements: + +- expression statement, including the empty `;` +- compound statement `{ ... }`, with declarations interleaved with + statements (C99-style, not K&R block prologue) +- `if (e) S` / `if (e) S else S` +- `while (e) S`, `do S while (e);` +- `for (init; cond; step) S` — `init` may be a declaration (C99) +- `switch (e) { case K: ... default: ... }` — `case K` requires `K` + constant-integer; fall-through is the default; no implicit break +- `break;`, `continue;` +- `goto label;`, `label:` — function-scope labels +- `return;`, `return e;` +- declaration as statement (C99) + +Cut: + +- statement expressions `({ ... })` (GCC ext) — tcc.c doesn't use them +- `__label__` (GCC) — N/A +- compound literals `(T){ ... }` — tcc.c doesn't use them +- `_Generic` selection — tcc.c doesn't use it +- inline asm `__asm__(...)` — N/A; tcc.c gates this on conditions + that aren't active at the tcc-mes stage + +## Expressions + +All standard C operators with standard precedence and associativity: + +| Tier (high → low) | Operators | +|-------------------|-----------| +| postfix | `a[i]`, `f(a,...)`, `s.m`, `p->m`, `e++`, `e--` | +| unary | `++e`, `--e`, `&e`, `*e`, `+e`, `-e`, `~e`, `!e`, `sizeof`, `(T)e` | +| multiplicative | `*`, `/`, `%` | +| additive | `+`, `-` | +| shift | `<<`, `>>` | +| relational | `<`, `<=`, `>`, `>=` | +| equality | `==`, `!=` | +| bitwise | `&`, `^`, `|` (in that order) | +| logical | `&&`, `||` | +| conditional | `?:` | +| assignment | `=`, `+=`, `-=`, `*=`, `/=`, `%=`, `<<=`, `>>=`, `&=`, `^=`, `|=` | +| comma | `,` | + +Notes: + +- `sizeof T` and `sizeof e` both supported. `sizeof e` does **not** + evaluate `e` (standard). +- Integer promotion (rank ≤ `int` → `int`) and usual arithmetic + conversions performed automatically. Pointer arithmetic scales by + pointee size. +- Implicit conversions for assignment, return, and function arguments + (incl. promotion of variadic args to `int` / `unsigned int` / + pointer / `long` / `unsigned long`). +- String literals have type `char *` (not `const char[N]`) for our + purposes — we strip const, and tcc.c writes through string literals + in a few places. +- `_Alignof` — **not** supported. tcc.c uses no alignment intrinsics. + +### Variadic argument access + +``` +#include <stdarg.h> // pre-flattened in +void f(int n, ...) { + va_list ap; va_start(ap, n); + int x = va_arg(ap, int); + va_end(ap); +} +``` + +`va_list`, `va_start`, `va_arg`, `va_end` are macros from the +flattened header. They expand to direct frame-slot reads keyed off the +`...` slot offset our codegen exposes. Implementation detail: our +`stdarg.h` substitute is one of the headers shipped with the +compiler. + +## Standard library expectations + +Our compiler doesn't bundle libc. The bootstrap script links the +output against the same `libc+tcc` archive MesCC uses, which provides: + +- `<stdio.h>`: `FILE`, `fopen`, `fclose`, `fread`, `fwrite`, `fprintf`, + `fputs`, `fgetc`, `getc`, `printf`, `sprintf`, `vsnprintf`, … +- `<stdlib.h>`: `malloc`, `free`, `realloc`, `exit`, `atoi`, `strtol`, + `qsort`, … +- `<string.h>`: `strlen`, `strcpy`, `strncpy`, `strcmp`, `strncmp`, + `strcat`, `strchr`, `strrchr`, `strstr`, `memset`, `memcpy`, + `memmove`, `memcmp`, … +- `<ctype.h>`, `<errno.h>`, `<unistd.h>`, `<fcntl.h>`, `<sys/stat.h>`, + … +- `<stdarg.h>`, `<stddef.h>`, `<limits.h>` — supplied by us. + +Anything `<setjmp.h>` is **not** required at the tcc-mes stage +(`HAVE_SETJMP` off). `<math.h>` is not required (`HAVE_FLOAT` off). + +Built-in functions our compiler *recognizes* (vs. linking against): + +- `__builtin_va_start`, `__builtin_va_arg`, `__builtin_va_end` — + expanded inline by the codegen. The `<stdarg.h>` we ship aliases + the standard names to these. +- `alloca` — left as a library call. tcc.c only references it via + `__builtin_alloca` definition for compiled programs, not for itself. + +## Cut from C99 / C11 + +Kept explicit so additions are deliberate. + +| Feature | Status | Rationale | +|-----------------------------------------------|----------|------------------------------------------------| +| Floats / doubles / `_Complex` | rejected | HAVE_FLOAT off | +| `long double` | rejected | no FP | +| Bitfields | rejected | HAVE_BITFIELD off | +| `setjmp` / `longjmp` | not lib | HAVE_SETJMP off | +| VLAs | rejected | tcc.c doesn't use; complicates frame layout | +| Compound literals `(T){...}` | rejected | tcc.c doesn't use | +| Statement expressions `({...})` (GCC) | rejected | tcc.c doesn't use | +| `_Generic` | rejected | not used | +| `_Atomic`, `_Thread_local` | rejected | not used | +| `_Alignof`, `_Alignas` | rejected | not used | +| `_Static_assert` | rejected | not used | +| Wide / UTF strings (`L"…"`, `u8"…"`) | rejected | not used | +| Anonymous struct/union members | rejected | not used | +| Multi-character constants (`'AB'`) | rejected | not used | +| Universal character names (`\uXXXX`) | rejected | identifier set is ASCII only | +| K&R-style function definitions | rejected | tcc.c uses ANSI | +| Nested function definitions (GCC) | rejected | not used | +| Inline assembly (`__asm__`) | rejected | not used at this stage | +| `__label__` (GCC) | rejected | not used | +| `#include` | rejected | external pre-flatten step | +| `const`, `volatile`, `restrict` | parsed, discarded | match MesCC | +| `inline` | parsed, discarded | -D inline= in bootstrap | +| `__attribute__((...))` | parsed, discarded | match MesCC | +| `register`, `auto` storage classes | parsed, no effect | | + +## Undefined behavior policy + +Following [LISP.md](LISP.md)'s "Primitive failure" stance: out-of-bounds +array access, signed integer overflow, dereferencing a null or +uninitialized pointer, integer division by zero, and modifying a string +literal are **undefined**. The compiler emits no runtime checks; the +generated P1pp will crash, loop, or produce nonsense, and that's +acceptable. + +The compiler itself aims to be **deterministic**: the same input bytes +produce identical output bytes. Errors detected at compile time +(syntax errors, type errors, unresolved identifiers) abort with a +diagnostic on stderr and a nonzero exit code. No partial output is +written. + +## Validation milestones + +Status legend: `[x]` done · `[~]` in progress · `[ ]` not started. + +1. [ ] Self-tests: a tests/cc/ tree mirroring tests/scheme1/ — one + tiny `.c` file per language feature, exit-status-driven. +2. [ ] Compile a hand-written single-file C "hello world" through to + ELF. +3. [ ] Compile the mes libc unified-libc.c (the same file MesCC builds + into libc.a). +4. [ ] Compile tcc.c (under the tcc-mes defines) → tcc-lispcc; verify + `tcc-lispcc -version` runs. +5. [ ] Use tcc-lispcc to build tcc-boot0; verify checksum matches the + live-bootstrap reference. + +Hitting (5) is the bootstrap milestone — at that point lispcc has +fully replaced MesCC in the chain.