boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

commit 83b6da62fabbdda3a189d49eba611679a4781546
parent 9d8abb8d6a2cd646c041021c86a3337af964f64c
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon, 27 Apr 2026 14:03:58 -0700

docs rm old

Diffstat:
Ddocs/CC-CONTRACTS.md | 578------------------------------------------------------------------------------
Ddocs/CC-INTERNALS.md | 764-------------------------------------------------------------------------------
Ddocs/CC-PUNCHLIST.md | 464-------------------------------------------------------------------------------
Ddocs/LISP-C.md | 1105-------------------------------------------------------------------------------
Ddocs/LISP-PMATCH.md | 259-------------------------------------------------------------------------------
Ddocs/SCHEME1-R7RS-TODO.md | 118-------------------------------------------------------------------------------
6 files changed, 0 insertions(+), 3288 deletions(-)

diff --git a/docs/CC-CONTRACTS.md b/docs/CC-CONTRACTS.md @@ -1,578 +0,0 @@ -# lispcc contracts - -Frozen interfaces between modules. Engineers must not diverge from -these without proposing a change. This document is the source of truth -for the symbol alphabets, test formats, ABI, and phase-1 milestone -referenced from [CC-INTERNALS.md](CC-INTERNALS.md). - -## 1. Symbol alphabets - -Every record's `kind`-style fields use these exact symbols. Adding a -symbol = updating this section first. - -### 1.1 `tok-kind` - -``` -IDENT KW INT STR CHAR PUNCT HASH NL EOF -``` - -Uppercase to distinguish from value-level symbols. `IDENT` carries an -unrecognized identifier; `KW` is one of the symbols in §1.3. - -### 1.2 `PUNCT` value symbols - -The lexer produces `tok-value` symbols for punctuators per the -following table. **Names are mandatory** — no engineer may use the raw -`'+`, `'*`, etc. as symbols, because several C punctuator characters -(`%`, `|`, `,`, `;`, `(`, `)`, `{`, `}`, `[`, `]`, `.`, `#`) cannot -form valid scheme1 symbols. We use named symbols for *all* punctuators -to keep the scheme uniform. - -| C | Symbol | C | Symbol | -|---------|-------------|----------|------------| -| `[` | `lbrack` | `==` | `eq2` | -| `]` | `rbrack` | `!=` | `ne` | -| `(` | `lparen` | `<` | `lt` | -| `)` | `rparen` | `>` | `gt` | -| `{` | `lbrace` | `<=` | `le` | -| `}` | `rbrace` | `>=` | `ge` | -| `.` | `dot` | `<<` | `shl` | -| `->` | `arrow` | `>>` | `shr` | -| `,` | `comma` | `&&` | `land` | -| `;` | `semi` | `\|\|` | `lor` | -| `:` | `colon` | `&` | `amp` | -| `?` | `qmark` | `\|` | `bar` | -| `...` | `ellipsis` | `^` | `caret` | -| `++` | `inc` | `~` | `tilde` | -| `--` | `dec` | `!` | `bang` | -| `+` | `plus` | `=` | `assign` | -| `-` | `minus` | `+=` | `plus-eq` | -| `*` | `star` | `-=` | `minus-eq` | -| `/` | `slash` | `*=` | `star-eq` | -| `%` | `pct` | `/=` | `slash-eq` | -| `#` | `hash` | `%=` | `pct-eq` | -| `##` | `paste` | `<<=` | `shl-eq` | -| | | `>>=` | `shr-eq` | -| | | `&=` | `amp-eq` | -| | | `^=` | `caret-eq` | -| | | `\|=` | `bar-eq` | - -Digraphs (`<:` `:>` `<%` `%>` `%:` `%:%:`) lex as their standard -equivalents: same symbol on the right-hand side of the table. - -### 1.3 `KW` value symbols - -``` -;; storage -auto register static extern typedef -;; qualifiers (parsed and discarded) -const volatile restrict inline -;; type specifiers -void char short int long signed unsigned _Bool -;; rejected type specifiers (lexed as KW so we get clean diagnostics) -float double -;; aggregates -struct union enum -;; statements -if else while do for switch case default break continue return goto -;; operators -sizeof -;; reserved-and-rejected (lexed as KW so we error crisply) -_Generic _Atomic _Thread_local _Alignof _Alignas _Static_assert -_Complex _Imaginary -``` - -Anything matching the C identifier grammar that is **not** in this -list lexes as `IDENT`. - -### 1.4 `ctype-kind` - -``` -void i8 u8 i16 u16 i32 u32 i64 u64 bool -ptr arr fn struct union enum -``` - -Char is `i8` (`signed char`) or `u8` (`unsigned char`); `char` itself -is `i8` (we treat plain `char` as signed, matching MesCC and most -compilers). Long and long long collapse to `i64`/`u64` on P1-64. - -### 1.5 `opnd-kind` - -``` -imm frame global reg -``` - -`reg` is transient — used for the result of a call before it spills -to a frame slot. The vstack itself never holds `reg` opnds; cg -materializes through `reg` only inside a single emission step. - -### 1.6 `macro-kind` - -``` -obj fn fn-vararg -``` - -### 1.7 `sym-kind` - -``` -var fn typedef enum-const param label -``` - -### 1.8 `sym-storage` - -``` -auto static extern register -``` - -`#f` for `typedef`, `enum-const`, and `label` symbols (storage -class doesn't apply). - -### 1.9 `loop-ctx-kind` - -``` -while do for switch -``` - -### 1.10 `reg` opnd register names - -``` -a0 a1 a2 a3 -``` - -Only argument registers. Saved registers (`s0`..`s3`) and temporaries -(`t0`..`t2`) are cg-private; never exposed as opnd payload. - -### 1.11 `cg-binop` and `cg-unop` operator symbols - -``` -;; cg-binop: -add sub mul div rem -and or xor shl shr -eq ne lt le gt ge - -;; cg-unop: -neg ;; arithmetic negate -bnot ;; bitwise complement (~) -lnot ;; logical not (!) -``` - -These are abstract operations independent of source-level PUNCT -symbols; the parser maps PUNCT → cg-op (e.g., `'plus` → `'add`, -`'eq2` → `'eq`). - -## 2. Test serialization formats - -Two test shapes coexist: - -- **Pure-transformation suites** (`cc-lex`, `cc-pp`) byte-diff a - Scheme-readable serialization (§2.1). -- **Codegen / language suites** (`cc-cg`, `cc`) - compile-and-run the emitted program and assert runtime behavior - (§2.2). No P1pp-text goldens. - -### 2.1 Token line format - -One token per line, as a Scheme list: - -``` -(KIND VALUE FILE LINE COL) -``` - -- **KIND**: bare symbol from §1.1. -- **VALUE** rendering depends on KIND: - - `IDENT`, `STR`: bytevector literal `"..."` with `\n \t \r \\ \"` - escapes; non-ASCII bytes as `\xNN`. - - `INT`, `CHAR`: decimal integer. - - `KW`, `PUNCT`: bare symbol from §1.2 / §1.3. - - `HASH`, `NL`, `EOF`: `#f`. -- **FILE**: bytevector literal. -- **LINE**, **COL**: decimal integers (1-based). - -The `tok-hide` field is **not** serialized — it is implementation -detail of the preprocessor. - -Example for `int main() { return 0; }` in `t.c`: - -``` -(KW int "t.c" 1 1) -(IDENT "main" "t.c" 1 5) -(PUNCT lparen "t.c" 1 9) -(PUNCT rparen "t.c" 1 10) -(PUNCT lbrace "t.c" 1 12) -(KW return "t.c" 1 14) -(INT 0 "t.c" 1 21) -(PUNCT semi "t.c" 1 22) -(PUNCT rbrace "t.c" 1 24) -(EOF #f "t.c" 1 25) -``` - -Trailing whitespace and `;`-comments in the golden file are ignored. - -### 2.2 Runtime fixture format - -Every cc-cg / cc fixture compiles to a runnable program -whose runtime behavior is the assertion. Two sibling files describe -the expectation: - -``` -<name>.expected-exit # decimal integer; default 0 if absent -<name>.expected # exact stdout match; default empty if absent -``` - -Stdout and stderr are merged in the runner. The harness exits non-zero -if either expectation fails. - -Fixture-source conventions: - -- **cc-cg** (`<name>.scm`): drives the cg API directly, ending in - `(write-bv-fd 1 (cg-finish cg))`. The fixture must define every - symbol it references — `cg-call`s into externs are out of scope for - this suite (no libc linkage). -- **cc** (`<name>.c`): a complete C translation unit including a - `main` that returns the asserted value. The full compiler driver - (`cc/cc.scm`: lex+pp+parse+cg) runs against it. Holds both - feature-targeted drills and full-envelope scenarios; once multi-TU - / libc linkage land, those fixtures live here too. - -Negative tests (compiler is supposed to die) set `expected-exit` to a -non-zero value and may rely on the diagnostic-prefix check in §2.3 -rather than asserting exact stdout. - -### 2.3 Diagnostic format - -Already canonical from CC-INTERNALS: - -``` -<file>:<line>:<col>: error: <msg>: <irritants...> -``` - -Tests for failure paths verify: -- exit status is 1 -- stderr contains the expected `<file>:<line>:<col>: error:` prefix - -The `<msg>` body is **not** matched character-for-character (so we -can refine wording without breaking tests); only the prefix and a -keyword-substring of the engineer's choice. - -## 3. Frame layout / parameter ABI - -### 3.1 cg-fn-begin contract - -```scheme -(cg-fn-begin cg name params return-type) -> param-syms -;; name: bv (un-mangled C identifier) -;; params: list of (name-bv . ctype) -;; return-type: ctype -;; param-syms: alist (name-bv . sym), each sym already bound to a frame slot -``` - -Inside `cg-fn-begin`, cg: - -1. Allocates one frame slot per parameter via `cg-alloc-slot`. Slot - width = `ctype-size` rounded up to 8 (`align-up`); align = 8. - (Yes, every param costs at least 8 bytes. P1-64 frame is - word-stride; we don't pack.) -2. Begins emitting into `cg-fn-buf`. Does **not** yet emit the - prologue — that's deferred to `cg-fn-end` once `frame-hi` is final. -3. Emits the param-spill code into a "prologue prefix" buffer - (private to cg): for params 0..3, `ST aN, [sp + slotN]`; for - params 4+, `LDARG t0, K` then `ST t0, [sp + slotK]`. -4. Returns the param-sym alist. Parser binds these into the function- - body scope. - -### 3.2 cg-fn-end contract - -```scheme -(cg-fn-end cg) -``` - -cg: - -1. Reads final `frame-hi` (highest byte allocated). -2. Emits the per-function preamble (an M1pp `%struct` is **not** - used — slots are numeric byte offsets, baked into the body text - already buffered in `fn-buf`). -3. Wraps the prologue-prefix + fn-buf inside a libp1pp `%fn` macro: - - ``` - %fn(<mangled-name>, <frame-hi-aligned-up-to-16>, { - <prologue-prefix bytes> - <fn-buf bytes> - ::ret - LD a0, [sp + <return-slot>] - ; LD a1, [sp + <return-slot> + 8] when ret-type size > 8 - }) - ``` -4. Flushes the result into `cg-text`, clears `fn-buf` and the - prologue-prefix buffer, resets `vstack`, `frame-hi`, and the - function-local label counter. - -The frame size is rounded up to 16 to satisfy the P1 stack-align -contract. - -The return slot itself is sized to `max(8, ctype-size(ret-type))` -rounded up to 8, 8-byte aligned. Slot width is what dispatches the -load epilogue across the three result conventions defined in -[P1.md §Arguments and return values](P1.md#arguments-and-return-values): - -| Width | Convention | Epilogue | -|---|---|---| -| ≤ 8B | one-word direct | `LD a0, [sp + slot]` | -| 9–16B | two-word direct | `LD a0, [sp + slot]; LD a1, [sp + slot + 8]` | -| > 16B | indirect-result | `LD a0, [sp + sret-slot]` (reload caller's buffer ptr) | - -For the indirect-result convention (struct/union return > 16B), the -caller materializes a fresh receive slot's address in `a0` before the -call; explicit args 0..2 shift to `a1..a3` and args 3+ stage to -incoming stack-arg slots 0+. The callee's prologue spills the -incoming `a0` to a private sret-slot; `cg-return` copies the source -struct's bytes through that saved pointer; the epilogue reloads -`a0` from the sret-slot so the convention's "a0 unchanged on -return" holds even if the body clobbered `a0`. - -The cg lowers all three conventions; the parser surface (return -statement, struct-typed call result) is identical regardless of -which convention applies. - -### 3.3 Loop tag protocol - -```scheme -(cg-loop cg head-thunk body-thunk) -> tag -``` - -`cg-loop` allocates a fresh per-function tag (`L0`, `L1`, …), emits -the libp1pp `%loop_tag(<tag>, { … })` wrapper, runs `head-thunk` -inside the loop head (it is expected to leave the condition opnd on -the vstack), pops the condition and emits an `%if_eqz(t0, %break(tag))`, -then invokes `body-thunk` **with the tag as its single argument**: - -```scheme -(cg-loop cg - (lambda () (cg-push-imm cg %t-i32 1)) - (lambda (tag) - (cg-continue cg tag) - (cg-break cg tag))) -``` - -The parser uses the same tag for any `cg-break` / `cg-continue` calls -made during body emission. `cg-loop` returns the tag to its caller as -well, so post-loop teardown code may reference it; `cg-loop-end` is a -no-op kept for symmetry. - -Switch dispatch follows the same pattern: `cg-switch-begin` returns a -`swctx` whose `swctx-end-tag` accessor exposes cg's break-target tag -to the parser. - -### 3.4 Outgoing-arg staging - -When `cg-call` is asked to emit a call with arity > 4, it stages -args 4..(N-1) into the *low-addressed* prefix of the current frame -at `[sp + 0*8]`, `[sp + 1*8]`, etc., per LIBP1PP.md §Frame locals. -cg tracks the maximum staging count seen across the function and -reserves that prefix at fn-end before any other slots — i.e., -`cg-alloc-slot`'s first allocation comes *after* the staging area. - -The accounting is internal to cg. Parse never sees staging slots. - -### 3.5 cg-alloc-slot contract - -```scheme -(cg-alloc-slot cg bytes align) -> offset -``` - -- `bytes` = total size needed (e.g., 4 for `int`, 40 for `int[10]`, - `sizeof(struct foo)` for a struct). -- `align` = required alignment (1, 2, 4, or 8). -- Returns numeric byte offset relative to `sp` post-`%enter`. - -cg first aligns `frame-hi` up to `align`, returns that as the -offset, then bumps `frame-hi` by `bytes`. Slots are not reused -across scopes (we're optimizing for compiler simplicity, not frame -size). Local arrays and structs request their full size in one call. - -## 4. Conversion responsibility - -The parser drives type semantics; cg is type-aware only enough to -choose signed-vs-unsigned variants and to scale pointer arithmetic. - -### 4.1 Parser's responsibilities - -The parser **must** call cg in this order around each operation: - -| Source | Required parser actions before the operation | -|--------|----------------------------------------------| -| `e1 + e2` (and other arith binops) | (a) parse e1 → if lval, `cg-load`; (b) `cg-promote` if rank < int; (c) parse e2 same way; (d) `cg-arith-conv` to bring both to common type; (e) `cg-binop add` | -| `*p` | parse p → if lval, `cg-load`; then `cg-push-deref` | -| `&x` | parse x → must be lval; then `cg-take-addr` | -| `(T)e` | parse e → if lval, `cg-load` (unless casting to a pointer); then `cg-cast T` | -| `f(a, b, ...)` | parse f → if lval and `f` not a function-typed identifier, `cg-load`; parse each arg → `cg-load` if lval, then `cg-cast` to param type (or default-promote for variadic args); then `cg-call` | -| `lhs = rhs` | parse lhs → must be lval (no load); parse rhs → `cg-load` if lval; `cg-assign` (cg internally casts rhs to lhs type — parse cannot peek beneath vstack top) | -| `lhs += rhs` (and other compound assigns) | parse lhs (lval) → `cg-dup` to preserve the lval across the read; `cg-load` (consumes one copy); parse rhs → `cg-load` if lval; `cg-arith-conv`; `cg-binop <op>`; `cg-assign` (cg casts internally) | -| `++lhs` / `--lhs` | parse lhs (lval) → `cg-dup`; `cg-load`; `cg-push-imm 1`; `cg-binop add`/`sub`; `cg-assign` | -| `lhs++` / `lhs--` | parse lhs (lval) → `cg-postinc` / `cg-postdec` (atomic primitive: dups+loads to capture old rval, then dup+load+`+1`+assign for the store, finally pushes the saved old rval) | -| `return e` | parse e → `cg-load` if lval; `cg-cast` to fn return type; `cg-return` | -| `if (e) ...` | parse e → `cg-load` if lval; `cg-cast bool` if not already int-shaped; `cg-if` | -| `c ? a : b` | parse c → `cg-load` if lval; `cg-ifelse-merge` with each thunk parsing one arm and ending with `rval!`; result type is the first arm's type | -| `a && b` | parse a → `cg-load` if lval; `cg-ifelse-merge` with then-arm = `parse b; rval!; cg-cast bool; cg-cast i32` and else-arm = `cg-push-imm i32 0` | -| `a \|\| b` | mirror of `&&`: then-arm = `cg-push-imm i32 1`; else-arm = parse b + bool/i32 cast | -| `a, b` | parse a (its rval is on top) → `cg-pop`; parse b → `cg-load` if lval; the comma's value is b | -| `sizeof e` | parse e (don't suppress emission); peek `(opnd-type (cg-top …))`'s `ctype-size`; `cg-pop`; `cg-push-imm u64 size` | - -The parser is responsible for the standard: - -- **Integer promotion**: any operand of type rank below `int` is - promoted to `int` (or `unsigned int` if it can't fit) before use - in arithmetic, before assignment to a wider lhs in mixed contexts, - and before being passed as a variadic argument. -- **Usual arithmetic conversions**: applied to both operands of a - binary arithmetic operator after promotion. The result type is - the common type. -- **Pointer-int interaction**: detected by parser; `cg-binop add` - on (ptr, int) handles scaling internally (see §4.2). - -### 4.2 cg's responsibilities - -cg trusts the operand types it is handed. - -- **`cg-load`**: pop lval, emit one load (of the right width based - on `ctype-size`), push rval of the same type. -- **`cg-cast to-type`**: pop, emit sign-extend / zero-extend / - truncate as needed based on source vs. target sizes and signedness. - For `to-type = bool`: emit `(BNEZ -> 1, fallthrough -> 0)` shape. - Pointer ↔ integer casts are bit-for-bit on P1-64 (no emission). -- **`cg-binop add` (and `sub`)**: if exactly one operand is a `ptr`, - scale the int operand by `ctype-size` of the pointee before adding. - If both ptr → only `sub` is valid (yields `i64` byte difference, - divided by element size); other binops on (ptr, ptr) abort via - `die`. -- **`cg-binop` for divisions and comparisons**: dispatch to signed - (`DIV`/`BLT`) or unsigned (`DIV`+sign-flip / `BLTU`) variant based - on the operand kinds (`i*` → signed, `u*` → unsigned). After - `cg-arith-conv`, both operands have the same kind, so dispatch is - unambiguous. - -cg never: -- auto-loads an lvalue -- auto-promotes -- auto-converts arguments -- looks at fn-ctx return type (parser passes the cast) - -This split keeps cg under ~600 LOC by pushing all "C language" -knowledge into parse. - -## 5. Symbol-to-label mangling - -Three label namespaces in the emitted P1pp: - -### 5.1 User globals (functions, variables) - -``` -C identifier "foo" → P1pp label :cc__foo -``` - -Verbatim concatenation. C identifiers can't contain `:` or other -P1pp-special characters, so no escaping is needed. The `cc__` prefix -guarantees no collision with libp1pp internals (`libp1pp__*`), -backend stubs (`_start`, `p1_main`, `sys_*`), or our own runtime -support. - -`static` storage at file scope changes nothing about the label — -since we have one TU, internal-linkage and external-linkage symbols -share the same namespace. `static` only suppresses any future -"export to other TUs" emission, which we don't do anyway. - -### 5.2 String pool - -``` -n-th distinct string literal → :cc__str_<n> -``` - -`<n>` is a fresh decimal counter starting at 0, advanced only on -non-deduplicating insert. Identical string literals share a label -(idempotent intern). - -### 5.3 Function-internal labels - -Inside `%fn(...)`, libp1pp's `%scope` mechanism prefixes short -labels (`::ret`, `::lbl_42`) to `<fnname>__ret`, `<fnname>__lbl_42` -at M1pp time. cg uses short labels exclusively inside fn-buf: - -- `::ret` — single function exit -- `::lbl_<n>` — anonymous control-flow targets (for switch cases, - short-circuit eval, etc.) -- `::user_<name>` — user-written `goto` labels (C `myloop:` → - `::user_myloop`). The `user_` prefix prevents collisions with our - `lbl_` and `ret` labels. - -Loop tags (for libp1pp's `%loop_tag`, `%break(tag)`, `%continue(tag)`) -are not labels — they're macro-name fragments. cg generates them as -`L<n>` (no `cc__` prefix; tag namespace is per-function and `%fn` -already scopes them). - -### 5.4 Entry stub - -cg emits a small entry stub at `cg-finish` time: - -``` -%fn(p1_main, %p1_main_f.SIZE, { - ; argc = a0, argv = a1 already - %la_br(&cc__main) - %call - %eret -}) -``` - -So `int main(int argc, char **argv)` is reached from the P1 -program-entry contract. - -If the user's `main` has no parameters, the stub still passes -argc/argv — main just ignores them, which is harmless. If the user -defines `main` with a different return type, that's a CC.md -violation; cg can either die or emit and let the cast happen at the -return site. Recommend: parser checks at `cg-fn-end` time when -fn-name == `main`. - -## 6. Phase-1 milestone - -```c -int main(int argc, char **argv) { - return argc; -} -``` - -This is the integration target every engineer aims at. It goes -through: - -- **lex**: `int`, `main`, `(`, `int`, `argc`, `,`, `char`, `*`, `*`, - `argv`, `)`, `{`, `return`, `argc`, `;`, `}` — all PUNCT and KW - symbols touched are core; covers two of each kind. -- **pp**: zero directives, but full token-list traversal. -- **parse**: function definition, two-parameter list including - `char **`, compound stmt, return stmt, identifier expression, - lval→rval load. -- **cg**: `cg-fn-begin` with two params, parameter spilling - (one register-passed, one register-passed), `cg-push-sym`, - `cg-load`, `cg-return`, `cg-fn-end`, `%fn` wrapping, and the - `p1_main` entry stub. -- **e2e**: link with arch backend + libp1pp; run as native ELF; - exit code matches argc. - -Acceptance test: `tests/cc/00-return-argc.c` exists, the make -target builds it, and: - -``` -$ ./tests/cc/build/00-return-argc ; echo $? → 1 -$ ./tests/cc/build/00-return-argc a b ; echo $? → 3 -``` - -When this test passes on aarch64, amd64, and riscv64, phase 1 is -complete. - -## Change protocol - -Anyone proposing a contract change: - -1. PR amends this doc first, with rationale. -2. Affected modules + tests are listed. -3. Changes land in one PR (doc + all affected code) so no engineer - pulls a half-migrated tree. diff --git a/docs/CC-INTERNALS.md b/docs/CC-INTERNALS.md @@ -1,764 +0,0 @@ -# lispcc internals - -Companion to [CC.md](CC.md). CC.md says what we accept; this doc says -how the implementation is organized so engineers can split work and -test independently. - -The compiler lives in one file, `cc/cc.scm`, that bundles every -module section in order; `cc/main.scm` is the one-liner entry point -that fires `cc-main`: - -``` -build: catm build/$ARCH/cc/cc.scm scheme1/prelude.scm cc/cc.scm cc/main.scm -run: scheme1 build/$ARCH/cc/cc.scm input.flat.c output.P1pp -``` - -Sections inside `cc.scm` are delimited by their original -`;; cc/<name>.scm —` headers (util, data, lex, pp, cg, parse, main), -so the layering below is still navigable by search. - -## Module DAG - -``` -util ────────┐ - │ -data ────────┼─► lex ──► pp ──► parse ─► main - │ │ - └──────────────────────► cg ──► main -``` - -- **util** — leaf helpers; depends only on the scheme1 prelude. -- **data** — record-type definitions used across modules. -- **lex** — bytestream → token list. Pure function. -- **pp** — token list → expanded token list. Pure function. -- **cg** — codegen state and emission API. Mutates a `cg` record. -- **parse** — token list + cg → P1pp output. Mutates `pstate` and - drives `cg`. -- **main** — argv handling, file I/O, ties phases together. - -Cycles are forbidden. The parse section calls cg but never the reverse. - -## Feature workflow - -When adding any new codegen-touching feature, **always** in this order: - -1. **cc-cg fixture first.** Write a `tests/cc-cg/<n>-name.scm` that drives - the cg API directly to emit a self-contained program whose runtime - exit code (and/or stdout) reflects the feature working. Add the - `.expected-exit` (default `0`) and optional `.expected` stdout file. - At this point the test fails — that's the spec. -2. **Implement the cg primitive(s).** Add or fix the API surface in - the cg section of `cc/cc.scm`. Iterate until `make test SUITE=cc-cg` - passes. -3. **cc fixture next.** Write a `tests/cc/<n>-name.c` that - exercises the same feature from the C source side, with a `main` - that exits with a known value. Add `.expected-exit` / `.expected`. - Test fails — that's the spec for the parser. -4. **Implement the parse changes.** Wire C syntax through the parse - section of `cc/cc.scm` - using the cg primitives from step 2. Iterate until - `make test SUITE=cc` passes. - -This sequencing keeps the parser↔cg seam honest: cg is validated by -runtime behavior on direct API calls before parse is even touched, so -parse never papers over a cg bug. Steps 1+3 are the contract; steps 2+4 -are the implementation. - -## Conventions - -- **Naming**: every public function and accessor is prefixed by its - module or record. `lex-tokenize`, `pp-expand`, `cg-push-imm`, - `parse-translation-unit`, `tok-kind`, `ctype-size`. Internal helpers - use a leading `%` (e.g. `%pp-expand-line`). -- **Record constructors**: `%record-name` is the all-fields raw - constructor; named factory functions like `make-tok` wrap it when - defaults are useful. -- **Mutators**: `field-set!` form, e.g. `cg-out-set!`, matching - scheme1/prelude.scm. -- **Bytevectors as strings**: every "string" in this codebase is a - bytevector. We never use `symbol->string` for runtime data — symbols - are reserved for the small fixed alphabets (token kinds, ctype - kinds, opnd kinds). -- **Errors**: `(die loc msg . irritants)` writes a diagnostic to fd 2 - and `sys-exit`s with status 1. No exceptions, no recovery. Every - module uses the same `die`. -- **No global state**: every long-lived datum lives in a record passed - explicitly. The only top-level definitions are functions, constants, - and the keyword/punctuator alists. - -## Errors - -```scheme -(die loc msg . irritants) ; abort with formatted diagnostic -(loc-of-tok tok) -> loc ; pull file/line/col out of a tok -(loc file line col) -> loc ; construct manually (used by lexer) -``` - -Diagnostic format on stderr: - -``` -<file>:<line>:<col>: error: <msg>: <irritants display'd one-by-one> -``` - -`irritants` are written via `display` (no quoting). Pairs and integers -print naturally; bytevectors print as their byte content. - -`die` is the only failure path. There is no warning level — anything -worth saying aborts. - -## util.scm - -Bytevector and list helpers that scheme1/prelude.scm doesn't already -provide. - -```scheme -;; bytevector -(bv= a b) -> bool ; alias for bytevector=?, terser -(bv-prefix? p s) -> bool ; does s start with p? -(bv-find bv byte from) -> idx-or-#f ; first occurrence at >= from -(bv-slice bv start end) -> bv ; fresh copy -(bv-of-string str) -> bv ; literal helper for inline ASCII -(bv-of-byte b) -> bv ; 1-byte bv -(bv-cat lst-of-bv) -> bv ; concat list, single allocation -(bv->fixnum bv radix) -> (ok . val) ; (ok . #f) on parse fail -(fixnum->bv n radix) -> bv - -;; lists / alists -(alist-ref key al) -> val-or-#f -(alist-ref/eq key al) -> val-or-#f ; eq? compare (for symbol keys) -(alist-set key val al) -> al' ; cons new pair on the front -(alist-update key f al) -> al' ; functional update -(any p xs) -> bool -(every p xs) -> bool -(count p xs) -> fixnum - -;; ints -(min3 a b c) -> fixnum -(align-up n k) -> fixnum ; round n up to multiple of k - -;; output buffers (reversed list of bv chunks; flush builds in one pass) -(make-buf) -> buf ; record: { chunks } -(buf-push! buf bv) ; cons bv onto chunks -(buf-flush buf) -> bv ; reverse + bv-cat - -;; diagnostics + I/O -(die loc msg . irritants) -> never returns -(slurp-fd fd) -> bv ; read until EOF -(write-bv-fd fd bv) ; full write or die - -;; fresh names -(make-namer prefix) -> proc ; (proc) -> bv "prefix0", "prefix1", ... -``` - -`make-buf` and `buf-push!` are the universal output primitive — both -cg's three output streams and pp's expansion staging buffer use them. - -## data.scm - -All record types used by more than one module. - -### loc - -```scheme -(define-record-type loc - (%loc file line col) - loc? - (file loc-file) ; bv - (line loc-line) ; fixnum - (col loc-col)) ; fixnum -``` - -### tok - -```scheme -(define-record-type tok - (%tok kind value loc hide) - tok? - (kind tok-kind) ; symbol; see kinds table - (value tok-value) ; varies by kind - (loc tok-loc) ; loc - (hide tok-hide)) ; list of bv (macro names already expanded) -``` - -| `kind` | `value` shape | -|---------|--------------------------------------------------| -| `IDENT` | bv (identifier name) | -| `KW` | symbol (one of `if while ... typedef`) | -| `INT` | fixnum (post-suffix integer value) | -| `STR` | bv (raw bytes, no NUL terminator) | -| `CHAR` | fixnum (value 0..255) | -| `PUNCT` | symbol (`'+ '== '-> '... '## '#` …) | -| `HASH` | `#f` (preprocessor only; line-leading `#`) | -| `NL` | `#f` (significant only inside the preprocessor) | -| `EOF` | `#f` | - -`make-tok` is the canonical constructor; pass `hide = '()` for -freshly-lexed tokens. - -### macro - -```scheme -(define-record-type macro - (%macro kind params body) - macro? - (kind macro-kind) ; 'obj | 'fn | 'fn-vararg - (params macro-params) ; list of bv - (body macro-body)) ; list of tok -``` - -### ctype - -```scheme -(define-record-type ctype - (%ctype kind size align ext) - ctype? - (kind ctype-kind) ; 'void 'i8 'i16 'i32 'i64 'u8 'u16 'u32 'u64 - ; 'bool 'ptr 'arr 'fn 'struct 'union 'enum - (size ctype-size) ; fixnum bytes; -1 = incomplete - (align ctype-align) ; fixnum bytes - (ext ctype-ext)) ; payload, kind-specific -``` - -ext payload by kind: - -| `kind` | `ext` shape | -|-------------------|----------------------------------------------------------------------| -| `void` / int / `bool` | `#f` | -| `ptr` | pointee ctype | -| `arr` | `(elem-ctype . length-or--1)` | -| `fn` | `(ret-ctype params variadic?)` — `params` is list of ctype | -| `struct`/`union` | `(tag-bv complete? fields)` — `fields` = `((name-bv ctype offset) ...)` | -| `enum` | `(tag-bv ((const-name-bv . value) ...))` | - -Primitive ctypes are interned at startup as top-level bindings: -`%t-void`, `%t-i8`, `%t-i32`, `%t-i64`, `%t-u8`, `%t-u32`, `%t-u64`, -`%t-bool`, `%t-char-ptr`. Equality of primitive ctypes is `eq?`. -Derived types are *not* deduped. - -### sym - -```scheme -(define-record-type sym - (%sym name kind storage type slot) - sym? - (name sym-name) ; bv - (kind sym-kind) ; 'var 'fn 'typedef 'enum-const 'param 'label - (storage sym-storage) ; 'auto 'static 'extern 'register | #f for typedef/enum - (type sym-type) ; ctype - (slot sym-slot)) ; locals/params: fixnum byte offset - ; globals/fn: bv emitted-label - ; enum-const: fixnum value - ; typedef: #f - ; label: bv P1pp local label -``` - -### opnd - -```scheme -(define-record-type opnd - (%opnd kind type ext lval?) - opnd? - (kind opnd-kind) ; 'imm 'frame 'global 'reg | sub-cases below - (type opnd-type) ; ctype - (ext opnd-ext) ; per kind - (lval? opnd-lval?)) ; #t = the slot/place holds an *address* - ; #f = the slot/place holds a *value* -``` - -ext by kind: - -| `kind` | `ext` | -|----------|------------------------------------------------| -| `imm` | fixnum literal value (lval? always `#f`) | -| `frame` | fixnum byte offset in current function frame | -| `global` | bv label name (the symbol's emitted P1pp label)| -| `reg` | symbol `'a0`..`'a3` | - -`reg` is transient — used for the result of a call before it's -spilled. The vstack itself holds only `imm`, `frame`, and `global` -opnds. - -### pstate - -```scheme -(define-record-type pstate - (%pstate toks scope tags loops fn-ctx typedefs cg) - pstate? - (toks ps-toks ps-toks-set!) ; remaining tok list (head = lookahead) - (scope ps-scope ps-scope-set!) ; list of alists: (bv . sym) - (tags ps-tags ps-tags-set!) ; list of alists: (bv . ctype) for struct/union/enum - (loops ps-loops ps-loops-set!) ; list of loop-ctx (break/continue stack) - (fn-ctx ps-fn-ctx ps-fn-ctx-set!) ; current function record, or #f at file scope - (typedefs ps-typedefs ps-typedefs-set!) ; flat alist (bv . #t) — fast typedef-name lookup - (cg ps-cg)) ; codegen state (not mutated through pstate) -``` - -`typedefs` is a separate flat alist (not derived from `scope` at -lookup time) because the lexer-vs-parser distinction in C requires -fast "is this identifier a typedef-name *now*" answers during -declaration parsing. Updated in lockstep with `scope`. - -### loop-ctx - -```scheme -(define-record-type loop-ctx - (%loop-ctx kind tag has-continue?) - loop-ctx? - (kind loop-ctx-kind) ; 'while 'do 'for 'switch - (tag loop-ctx-tag) ; bv tag for libp1pp %_tag macros - (has-continue? loop-ctx-has-continue?)) ; #f for switch -``` - -### fn-ctx - -```scheme -(define-record-type fn-ctx - (%fn-ctx name return-type params variadic? labels) - fn-ctx? - (name fn-ctx-name) ; bv - (return-type fn-ctx-return-type) ; ctype - (params fn-ctx-params) ; list of sym - (variadic? fn-ctx-variadic?) - (labels fn-ctx-labels ; alist (user-bv . emitted-bv) - fn-ctx-labels-set!)) ; mutated as `goto`/labels resolve -``` - -### cg - -```scheme -(define-record-type cg - (%cg text data bss vstack frame-hi label-ctr str-pool globals fn-buf) - cg? - (text cg-text) ; buf — final text section (all functions) - (data cg-data) ; buf — initialized globals + string pool - (bss cg-bss) ; buf — zero-init globals - (vstack cg-vstack cg-vstack-set!) ; list of opnd - (frame-hi cg-frame-hi cg-frame-hi-set!) ; fixnum: frame bytes used so far - (label-ctr cg-label-ctr cg-label-ctr-set!) - (str-pool cg-str-pool cg-str-pool-set!) ; alist (bv-content . bv-label) - (globals cg-globals cg-globals-set!) ; alist (bv-name . sym) — emitted globals - (fn-buf cg-fn-buf cg-fn-buf-set!)) ; buf — body of current function -``` - -Functions are emitted into `fn-buf`, then on `cg-fn-end` flushed -through libp1pp's `%fn(...)` wrapper into `text`. This lets us know -the final `frame-hi` before writing the prologue. - -## lex.scm - -Pure: bytestream + filename → token list. No file I/O, no macro -awareness. - -```scheme -(lex-tokenize src file) -> (list-of tok) -;; src: bv (the C bytestream) -;; file: bv (filename, for tok-loc) -;; result: list ending in a single 'EOF tok; never #f -;; aborts via die on unrecognized byte sequences -``` - -Internal contract: - -- Comments (`/* ... */`, `// ...`) are removed but produce no tokens. -- Trigraphs and `\<newline>` line splicing are applied before - tokenization. -- `NL` tokens are emitted at the end of every source line, even - blank ones — the preprocessor needs them for directive termination. -- Adjacent string literals are **not** concatenated here; that's - pp's job (after macro expansion). -- Keyword recognition: the lexer carries a fixed alist of - (keyword-bv . keyword-symbol) and emits `KW` directly for matches. - IDENT vs. KW is decided at lex time and never revised. -- Punctuator longest-match table: `'<<=` before `'<<` before `'<`, - etc. - -Helpers exposed for unit tests: - -```scheme -(lex-read-number src pos) -> (tok . pos') -(lex-read-string src pos file) -> (tok . pos') -(lex-read-ident src pos) -> (tok . pos') ; produces IDENT or KW -``` - -### Test plan - -`tests/cc-lex/` mirrors `tests/scheme1/`. Each test is one `.c` -fragment plus an `.expected-toks` file containing the expected -serialized form (one tok per line, e.g. `KW int 1 1`). Driver script: - -``` -scheme1 cc/lex-test.scm < input.c | diff - expected-toks -``` - -## pp.scm - -Pure: token list + initial macro alist → expanded token list. - -```scheme -(pp-expand toks initial-defines) -> (list-of tok) -;; toks: lex-tokenize output -;; initial-defines: alist (bv . macro) — from -D flags -;; result: token list with HASH and NL stripped, KW/IDENT/INT/STR/CHAR/PUNCT/EOF only -;; aborts via die on bad directive, undefined #if identifier-as-macro misuse, etc. -``` - -Internal structure: - -- A driver loop classifies each line (looking only at the leading - HASH or non-HASH state) and dispatches to a directive handler or - the macro-expansion engine. -- The conditional stack: list of `(active? . has-taken?)` pairs. - `#if`/`#ifdef` push; `#elif`/`#else` flip; `#endif` pops. -- Macro expansion uses C11 6.10.3.4 hide-set discipline. Each - emitted token's `hide` field is the union of hide-sets of its - constituent body tokens plus the macro's own name. -- `defined NAME` is a special form — recognized and resolved - *before* macro expansion of an `#if` line. - -Directive handlers (each takes the tokens up to the next NL): - -```scheme -(%pp-do-define line state) -(%pp-do-undef line state) -(%pp-do-if line state) -(%pp-do-ifdef line state) -(%pp-do-ifndef line state) -(%pp-do-elif line state) -(%pp-do-else line state) -(%pp-do-endif line state) -(%pp-do-error line state) -(%pp-do-line line state) -(%pp-do-pragma line state) -(%pp-do-include line state) ;; always dies (per CC.md §Toolchain envelope) -``` - -`state` is a private record `pp-state` with fields -`{macros cond-stack out file-base-line}`. Internal — not in data.scm. - -Constant-expression evaluator for `#if`: - -```scheme -(pp-eval-cexpr toks macros) -> fixnum -;; toks: tokens of the expression after macro expansion -;; aborts on unrecognized identifier (treated-as-zero is wrong for our errors policy) -``` - -### Test plan - -`tests/cc-pp/` per-file tests, same shape as cc-lex. Inputs are -already-tokenized fixtures (so pp can be exercised without the lexer) -or `.c` files (full pipeline). Both flavors live side-by-side. - -## cg.scm - -Mutable codegen state and a value-stack-style emission API. The parser -calls these and never touches output buffers directly. - -### Lifecycle - -```scheme -(cg-init) -> cg ; fresh state -(cg-finish cg) -> bv ; flush all buffers; result is final P1pp text -``` - -Every function lives between `cg-fn-begin` and `cg-fn-end`. Globals -live outside. - -```scheme -(cg-fn-begin cg name params return-type) -;; name: bv -;; params: list of sym (their slots are pre-assigned by parser) -;; return-type: ctype -(cg-fn-end cg) ; emits epilogue, flushes fn-buf into cg-text under %fn(...) -``` - -### Vstack: push / pop / inspect - -```scheme -(cg-push cg opnd) ; push opnd onto vstack -(cg-pop cg) -> opnd ; remove and return top -(cg-top cg) -> opnd ; non-destructive -(cg-depth cg) -> fixnum -``` - -### Materialize - -```scheme -(cg-push-imm cg ctype value) -> opnd ; rval -(cg-push-string cg bv-content) -> opnd ; rval, char* — interns into str-pool -(cg-push-sym cg sym) -> opnd ; lval (var) or rval (fn name) -(cg-push-deref cg) -> opnd ; pop ptr-rval, push lval (no emission yet) -``` - -### Address & deref operators - -```scheme -(cg-take-addr cg) -> opnd ; pop lval, push its address as rval -(cg-load cg) -> opnd ; pop lval, push rval (loaded through address) -``` - -### Type conversions - -```scheme -(cg-cast cg to-type) -> opnd ; pop, push opnd cast to to-type; emits sign-extension etc. as needed -(cg-promote cg) -> opnd ; integer promotion (rank ≤ int → int) -(cg-arith-conv cg) ; usual arithmetic conversions on top two opnds -``` - -### Operators - -```scheme -(cg-binop cg op) -> opnd ; pop b, pop a, push (a op b) -(cg-unop cg op) -> opnd ; pop a, push (op a) -(cg-assign cg) -> opnd ; pop rval, pop lval, cast rval to lval-type, store, push assigned value (cg-assign owns the cast since parse cannot peek beneath vstack top) -``` - -`op` for binop is a symbol from: -`'add 'sub 'mul 'div 'rem 'and 'or 'xor 'shl 'shr 'eq 'ne 'lt 'le 'gt 'ge`. -For unop: `'neg 'bnot 'lnot`. - -Signed vs. unsigned dispatch is handled inside cg by inspecting the -operand types after `cg-arith-conv`. - -### Calls - -```scheme -(cg-call cg arity has-result?) -> opnd-or-#f -;; pops (arity + 1) opnds: function, then args left-to-right at top -;; (i.e. callable was pushed first, then args, so arg-N is on top) -;; emits arg-passing per P1 ABI, CALL or CALLR, captures return into a fresh frame slot -;; pushes result opnd; returns it (or #f if has-result? is #f) -``` - -### Structured control flow - -These take thunks so the parser can recursively emit the body: - -```scheme -(cg-if cg then-thunk) ; pop cond; emit %if_nez { (then-thunk) } -(cg-ifelse cg then-thunk else-thunk) ; pop cond; emit %ifelse_nez { ... }{ ... } -(cg-loop cg head-thunk body-thunk) -> tag ; head-thunk emits the cond; body-thunk is invoked as (body-thunk tag) so parse can use the same tag for cg-break / cg-continue inside the body; tag is also returned to the caller -(cg-loop-end cg tag) ; no-op; reserved for future per-loop teardown -(cg-break cg tag) -(cg-continue cg tag) -``` - -For `for` and `do-while` the parser composes the building blocks -above; cg doesn't expose a dedicated for/do helper. (Three helpers -beat seven.) - -### switch helpers - -```scheme -(cg-switch-begin cg) -> swctx ; spill controlling expression to a slot -(cg-switch-case cg swctx const-int) ; emit %if_eq jump-to-case-label -(cg-switch-default cg swctx) ; emit B to default-label -(cg-switch-end cg swctx) ; close out -``` - -### Globals and data - -```scheme -(cg-emit-global cg sym init) ; init = #f (zero-init in .bss) - ; | (piece ...) in .data -;; piece := <bytevector> — raw bytes -;; | (label-ref . <label-bv>) — 8-byte slot holding &label -(cg-emit-extern cg sym) ; declare without defining -(cg-intern-string cg bv-content) -> bv-label ; idempotent; used internally by cg-push-string -``` - -### Frame allocation (used internally and by parse for locals) - -```scheme -(cg-alloc-slot cg bytes align) -> offset ; bumps frame-hi; returns aligned offset -``` - -### Test plan - -`tests/cc-cg/` is a runtime-validating suite. Each fixture is a `.scm` -program that drives the cg API directly, calls `cg-finish`, and writes -the resulting P1pp text to stdout. The harness assembles that P1pp -through the existing P1pp toolchain, runs the resulting ELF, and -diffs the program's exit code against `<name>.expected-exit` (default -`0`) and stdout against `<name>.expected` (default empty). - -Fixtures must therefore emit a complete, runnable program — including -any callees they invoke via `cg-call`. Direct-cg tests that depend on -external symbols (`extern abs`, `extern foo`) belong in cc once -real `libc` linkage exists; they don't fit cc-cg. - -The contract this suite locks in is **semantic**, not syntactic: cg -emission can be refactored freely as long as the program still -computes the asserted result. - -## parse.scm - -Mutates a `pstate`. Drives `cg`. Single entry point: - -```scheme -(parse-translation-unit ps) ; consumes ps-toks until EOF, emits via ps-cg -;; ps must have ps-toks set; everything else starts empty. -``` - -### Top-down structure - -Internal helpers, hierarchically: - -``` -parse-translation-unit - ├─ parse-decl-or-fn ; returns 'fn or 'decl - │ ├─ parse-decl-spec ; storage + qualifiers + base type - │ ├─ parse-declarator ; spiral grammar; returns (name ctype) - │ ├─ parse-init-list ; for variable initializers (incl. designated) - │ └─ parse-fn-body ; only if declarator is fn-typed AND next tok is `{` - ├─ parse-stmt - │ ├─ parse-compound-stmt - │ ├─ parse-if-stmt - │ ├─ parse-while-stmt - │ ├─ parse-do-stmt - │ ├─ parse-for-stmt - │ ├─ parse-switch-stmt - │ ├─ parse-return-stmt - │ ├─ parse-goto-stmt - │ └─ parse-expr-stmt - └─ parse-expr ; Pratt; takes min-bp - ├─ parse-primary - ├─ parse-postfix - ├─ parse-unary - ├─ parse-cast-or-unary ; `(T)e` vs. `(e)` - └─ parse-binary-rhs -``` - -### Token-stream API (private to parse.scm but conventional) - -```scheme -(peek ps) -> tok -(peek2 ps) -> tok ; one-token lookahead helper -(advance ps) -> tok ; consume and return -(expect-kw ps sym) -> tok ; KW match or die -(expect-punct ps sym) -> tok ; PUNCT match or die -(at-kw? ps sym) -> bool -(at-punct? ps sym) -> bool -``` - -`peek2` is needed exactly twice: distinguishing -`(typename) cast-expr` from `(expr)` parenthesized expression, and -distinguishing labelled-statement `IDENT :` from expression-statement -`IDENT ...`. - -### Scope helpers - -```scheme -(scope-enter! ps) -(scope-leave! ps) -(scope-bind! ps name sym) ; aborts on duplicate at innermost frame -(scope-lookup ps name) -> sym-or-#f ; walks frames - -(tag-bind! ps name ctype) -(tag-lookup ps name) -> ctype-or-#f - -(typedef-add! ps name) ; updates ps-typedefs -(typedef? ps name) -> bool -``` - -### Pratt expression parser - -The binding-power table is a top-level alist: - -```scheme -(define %binop-bp - ;; (punct-symbol . (lhs-bp . rhs-bp)) - ;; left-assoc: rhs-bp = lhs-bp + 1 - ;; right-assoc: rhs-bp = lhs-bp - 1 - '((|*| 110 . 111) (|/| 110 . 111) (|%| 110 . 111) - ...)) -``` - -Driver: - -```scheme -(parse-expr-bp ps min-bp) ; Pratt climber -(parse-expr ps) ; equivalent to (parse-expr-bp ps 0) -``` - -`parse-expr` leaves the result on `ps-cg`'s vstack and returns the -opnd. Statements that don't consume the value follow with `cg-pop`. - -### Test plan - -`tests/cc/` is a runtime-validating suite. Each fixture is a -`.c` source file containing a complete, runnable program — `main` -must exit with a known value. The harness runs the full -lex+pp+parse+real-cg pipeline, assembles the emitted P1pp through the -P1pp toolchain, runs the resulting ELF, and diffs exit code against -`<name>.expected-exit` (default `0`) and stdout against -`<name>.expected` (default empty). - -The contract this suite locks in is that the *C source* compiles to a -program with the asserted runtime behavior. Per the §Feature workflow, -parse fixtures land *after* the corresponding cg fixture is green. - -## main.scm - -Driver. Roughly 80 lines. - -```scheme -(define (cc-main argv) - (let* ((args (parse-cli argv)) ; record: { input-path output-path defines } - (src (slurp-fd (open-or-die (cli-input args)))) - (toks (lex-tokenize src (cli-input args))) - (defines (cli-defines->alist (cli-defines args))) - (expanded (pp-expand toks defines)) - (cg (cg-init)) - (ps (make-pstate expanded cg))) - (parse-translation-unit ps) - (write-bv-fd (open-or-die-out (cli-output args)) - (cg-finish cg)) - 0)) - -(cc-main (argv)) -``` - -CLI: `cc input.flat.c -o output.P1pp [-D NAME[=val]] ...`. Strict — -unrecognized flags die. - -## Test infrastructure - -Five test trees. The upstream layers (lex, pp) byte-diff their pure -transformations; the downstream layers (cg, cc) all -**compile-and-run** the emitted P1pp and assert runtime behavior — exit -code by default, optional stdout match. This split keeps the byte-diff -brittleness out of every layer where there's a real "is this code -correct" question to ask. - -- `tests/cc-util/` — direct unit tests of util helpers (Scheme-level). -- `tests/cc-lex/` — feeds `.c` through `lex-tokenize`, diffs token - serialization (per CC-CONTRACTS §2.1). -- `tests/cc-pp/` — feeds tokens (or `.c`) through `pp-expand`, diffs - token serialization. -- `tests/cc-cg/` — `.scm` driver calls cg APIs directly to emit a - complete program; harness builds + runs the P1pp; asserts exit code - / stdout. (See cg.scm §Test plan.) -- `tests/cc/` — `.c` fixture goes through the full compiler driver - (`cc/cc.scm`: lex+pp+parse+cg); harness builds + runs; asserts exit - code / stdout. (See parse.scm §Test plan.) Holds both feature drills - (parser / cg coverage) and full-envelope scenarios. - -`tests/cc-cg/` is the codegen contract. `tests/cc/` is the C- -language contract. Per §Feature workflow, every feature lands a cg -fixture first (which must fail before cg is touched), and a parse -fixture second (which must fail before parse is touched). - -## Out-of-scope here - -Deferred to follow-up docs once we start coding: - -- Exact P1pp text emitted for each cg primitive (precise opcodes, - spill discipline, libp1pp macro choices) — lives next to cg.scm - as a side document. -- The exact `<stdarg.h>` / `<stddef.h>` we ship — the headers we - bundle so the flattener has something to inline. -- Driver script for the pre-flatten pass (host shell tool, not part - of the Scheme compiler). -- Performance tuning: alist → tree, vstack list → array of frame - slots, etc. None of this affects the interfaces above. diff --git a/docs/CC-PUNCHLIST.md b/docs/CC-PUNCHLIST.md @@ -1,464 +0,0 @@ -# CC codegen punch list - -C99-subset codegen capabilities, ordered for red→green TDD per -[CC-INTERNALS.md §Feature workflow](CC-INTERNALS.md#feature-workflow). -The accepted language surface is defined in [CC.md](CC.md); this doc -is the implementation checklist against that surface. - -## Conventions - -- Every item has up to two runtime-validated fixtures: - - **cg**: `tests/cc-cg/<n>-name.scm` — drives the cg API directly. - - **cc**: `tests/cc/<n>-name.c` — exercises the same shape via real - C through the full compiler driver. Use the same bucket for - full-envelope scenarios (driver, multi-fn, libc) once those land. -- Acceptance: `make test SUITE=cc-cg` (then cc) green on all three - arches. The runner asserts `.expected-exit` (default `0`) and - `.expected` stdout (default empty). -- Land cg work + cg fixture in one PR; cc fixture + parse work in - the next. Don't block on parse to start cg. -- Pick the next free `<n>` per suite. -- Status legend: `[ ]` red · `[~]` partial · `[x]` green. - -## Already green - -cc-cg 00–14 + cc 00–14 cover: empty fn, `return` -with const/param, two-param fn, i64 binops, locals + assign, `if` / -`if-else`, `while` with `break` / `continue`, direct calls (0..5 -args, with stack staging), string literal interning, file-scope -zero-init globals, `&x` on a param, typedef plumbing through to a -return. - -## Punch list - -### A. Width-correct integer codegen - -The 64-bit-everything load/store path is the largest correctness gap -upstream of nearly everything else. Land this first. - -- [x] **`char` (8-bit) load/store via lval** - - cg: `cc-cg/15-char-roundtrip.scm` — two adjacent 1-byte slots; - stores must not bleed across slots → exit 1 on equality check. - - cc: `cc/15-char-arith.c` — same shape from C. - - Done: added `%cg-emit-{ld,st}{,-slot}-typed` helpers that dispatch - on `ctype-size = 1` to `%lb`/`%sb`; `%cg-load-opnd-into`, - `cg-load`, `cg-assign` thread the lval ctype. `i8` loads also - sign-extend via `shli`/`sari` 56. 16/32-bit fall through to the - 8-byte path until §A.2/§A.3 land. - -- [x] **`short` (16-bit) load/store via lval** - - cg: `cc-cg/16-short-roundtrip.scm` - - cc: `cc/16-short-arith.c` - - Done: byte-decomposed dispatch in the typed helpers — store low - byte then `%shri` 8 + store high byte; load two bytes + `%shli` 8 - + `%or`. `i16` sign-extends via `shli`/`sari` 48. Helpers may - clobber `t1`; callers never pass `reg=t1`. - -- [x] **`int` (32-bit) load/store via lval** - - cg: `cc-cg/17-int-roundtrip.scm` - - cc: `cc/17-int-arith.c` - - Done: byte-decomposed dispatch generalised to N bytes via - `%cg-emit-{ld,st}N-bytes`; size-4 routes through 4× `%lb`/`%sb` - with shift+OR / shri-shift. `i32` sign-extends via `shli`/`sari` - 32. The address-staging in `%cg-load-opnd-into` and `cg-load`'s - indirect path now uses `t2` so multi-byte gathers don't alias - dest with base. - -- [x] **Signed narrowing keeps sign on re-widen** - - cg: `cc-cg/18-sext-narrow.scm` — `(int)(char)-3 == -3` → exit 1. - - cc: `cc/18-sext-narrow.c` - - Done: `cg-cast`'s narrowing branch now `shli`/`sari`'s for signed - narrow targets (i8/i16/i32) instead of masking, so the slot holds - the canonical sign-extended 64-bit form. The widening cast back - (relabel-only) preserves it. - -- [x] **Unsigned narrowing zero-extends** - - cg: `cc-cg/19-zext-narrow.scm` — `(unsigned)(unsigned char)-3 == 253`. - - cc: `cc/19-zext-narrow.c` - - Done: pre-existing mask-on-unsigned-narrow path was already - correct; fixture locks the contrast with §A.4 in (same source, - same chain shape, divergent result via target signedness). - -- [x] **Integer promotion preserves sign across operations** - - cg: `cc-cg/20-promote-sign.scm` — `signed char x=-1; ((int)x)+2 == 1` - via 64-bit comparison so a non-canonical 0x101 result fails. - - cc: `cc/20-promote-sign.c` - - Done: load-side sign-extension from §A.1 already canonicalises the - slot, so `cg-promote`'s relabel-only path is correct. Fixture - locks the invariant in. - -### B. Lvalue mechanics - -Picked **(b) `cg-dup`** — duplicate the top vstack entry, used for -compound assign and pre-inc/dec to keep the lhs lval available across -its own load. Post-inc/dec use a dedicated `cg-postinc` / `cg-postdec` -primitive to capture the old rval before the store. See -[CC-CONTRACTS §4.1](CC-CONTRACTS.md#41-parsers-responsibilities). - -- [x] **Pre-`++` / pre-`--`** - - cg: `cc-cg/21-preinc.scm` — `int x = 5; ++x; return x;` → exit 6. - - cc: `cc/21-preinc.c` - - Done: parser dups lhs lval, loads, +1, assigns; pops result. - -- [x] **Post-`++` / post-`--` returns old value** - - cg: `cc-cg/22-postinc.scm` — `int x=5; int y=x++; return x*10+y;` - → exit 65. - - cc: `cc/22-postinc.c` - - Done: `cg-postinc` / `cg-postdec` primitives composed of two - dup+load passes — one to capture the old rval (which lives in a - never-reused spill slot), one to compute the +1/-1 store. - -- [x] **Compound assignment on simple lval (`+= -= *= /= %= <<= >>= &= ^= |=`)** - - cg: `cc-cg/23-cmpd-simple.scm` — `int x=7; x+=3; return x;` → exit 10. - - cc: `cc/23-cmpd-simple.c` — one of every op family. - - Done: parser uses `cg-dup` + `cg-load` + rhs + arith-conv + binop + - assign; the `cg-take-addr`/`cg-push-deref` indirection is gone. - -- [x] **Compound assignment through pointer** - - cg: `cc-cg/24-cmpd-ptr.scm` — `int x=7; int *p=&x; *p+=3; return x;` - - cc: `cc/24-cmpd-ptr.c` - - Done: same parser sequence; `cg-push-deref`'s indirect-slot lval - composes correctly with `cg-dup` + `cg-assign`. - -- [x] **`*p++` walking an array** - - cg: `cc-cg/25-deref-postinc.scm` — walks a 3-element span via *p++. - - cc: `cc/25-deref-postinc.c` - - Done: composes B.2 (post-inc on a ptr lval; pointer scaling falls - out of `cg-binop add`'s ptr branch) with `*p` deref. Also fixed - `cg-arith-conv` to skip the relabel when one operand is ptr/arr - so `cg-binop` still sees a ptr/int pair (previously it saw both - sides relabelled to ptr and skipped the scaling). - -### C. `sizeof` - -- [x] **`sizeof e` returns the type's actual size** - - cc: `cc/26-sizeof-expr.c` — `int x; return sizeof x;` → 4. - - Done: parser peeks `(opnd-type (cg-top …))`, takes its - `ctype-size`, pops, pushes `imm u64 size`. Both forms - (`sizeof e` and `sizeof(e)`) updated. - -- [x] **`sizeof` over struct, array, pointer, char** - - cc: `cc/27-sizeof-types.c` — sum of `char`, `short`, - `int`, `long`, `int*`, `int[5]`, `struct S{int a; int b;}` → 51. - - Done: the type form already returned `ctype-size ty`; this - fixture just locks the answer in. - -### D. Aggregates - -- [x] **Struct member load** - - cg: `cc-cg/36-struct-load.scm` — two-int struct, fields at 0 and 4. - - cc: `cc/36-struct-load.c` - - Done: added `cg-push-field cg fname` (cg.scm). Pops struct/union - lval, looks up `fname` in `ctype-ext`'s `(tag complete? fields)`. - Three input cases: direct frame lval shifts the slot offset; - indirect frame lval loads addr+fo into a new indirect slot; - global lval `la`'s the label, adds `fo`, stashes via indirect - slot. Parser `dot` arm replaced. - -- [x] **Struct member store** - - cg: `cc-cg/37-struct-store.scm` — three u8 fields, distinct - multipliers in the readback to isolate per-field width. - - cc: `cc/37-struct-store.c` - - Done: cg-push-field from §D.1 plus the width-aware store path - from §A.1. No new primitive. - -- [x] **Pointer-to-struct (`p->x`)** - - cg: `cc-cg/38-arrow.scm` - - cc: `cc/38-arrow.c` - - Done: arrow arm in `parse-postfix-rest` calls rval! (loads ptr), - cg-push-deref (struct lval through ptr), then cg-push-field. - Indirect-frame branch of cg-push-field (added in §D.1) handles - the deref-result struct lval correctly. - -- [x] **Nested struct access (`s.inner.x`, `s->inner.x`)** - - cc: `cc/39-struct-nested.c` - - Done: cg-push-field pushes a new lval whose ctype is the field's - type; if that's a struct, a subsequent `.x` chains naturally. - The fixture exercises both s.inner.x (direct frame) and - p->inner.x (indirect frame, via the §D.1 indirect path). - -- [x] **Array element access at non-zero index** - - cg: `cc-cg/40-array-index.scm` - - cc: `cc/40-array-index.c` - - Done: cg-load on an arr-typed lval delegates to cg-decay-array, - pushing a ptr-rval to the first element. Existing `cg-binop add` - pointer-arithmetic path scales by the pointee size, so `a + i` - yields `&a[i]`, and cg-push-deref turns that into the element - lval. cg-take-addr on an arr lval was also adjusted to yield - T* (not (T[N])*) so `&a[0]` stays consistent. - -- [x] **Multi-dim arrays** - - cc: `cc/41-array-2d.c` - - Done: fixed `parse-decl-suf-cont` to apply suffixes - right-to-left (innermost first) so `int a[2][3]` produces - `arr (arr int 3) 2`, not `arr (arr int 2) 3`. Same fix in the - fn-suffix arm so `T (...)(...)` chains compose correctly. Decay - + ptr arithmetic from §D.5 then handles the rest. - -- [x] **Struct passed by pointer to a function** - - cc: `cc/42-struct-fn-arg.c` — passes `&s` to `sum2`, - callee returns `p->x + p->y`. - - Done: composes §D.1/§D.3 (cg-push-field, arrow access) and the - pre-existing param/call/return wiring. No new primitive. - - *Pass-by-value of structs is outside CC.md's accepted set; tcc.c - doesn't use it.* - -### E. Initializers - -`parse-init-list` (`parse.scm` lines 398–413) currently balances -braces and returns `#f`, dropping all initializer data. `cg-emit-global` -accepts an init bv but is never given one. - -- [x] **Scalar global with constant initializer** - - cg: `cc-cg/49-init-scalar-global.scm` - - cc: `cc/49-init-scalar-global.c` - - Done: cg-emit-global now consumes a list of pieces (bytevectors - or `(label-ref . label-bv)` pairs); parser's parse-init-global - builds N-byte LE bv via %int->le-bv from a const expression. - -- [x] **Scalar global with address initializer (`int *p = &x;`)** - - cg: `cc-cg/50-init-addr.scm` - - cc: `cc/50-init-addr.c` - - Done: %const-init-piece recognises `&IDENT` and bare-IDENT for - fn / static / extern symbols, emitting `(label-ref . cc__name)`. - -- [x] **Array global from element list** - - cg: `cc-cg/51-init-array-list.scm` - - cc: `cc/51-init-array-list.c` - - Done: %parse-init-array-list walks brace lists; element types - drive bv width; array-name → label-ref decay so `int *p = a;` - works as init too. - -- [x] **Array global from string literal** - - cg: `cc-cg/52-init-array-str.scm` - - cc: `cc/52-init-array-str.c` - - Done: parse-init-global recognises STR for char[] target; - inferred-length arrays (`T a[]`) get patched with the literal - length. - -- [x] **Struct global, positional init** - - cg: `cc-cg/53-init-struct-pos.scm` - - cc: `cc/53-init-struct-pos.c` - - Done: %parse-init-struct-list walks fields positionally; trailing - fields zero-padded. - -- [x] **Struct global, designated init (`.field = …`)** - - cg: `cc-cg/54-init-struct-desig.scm` - - cc: `cc/54-init-struct-desig.c` - - Done: same %parse-init-struct-list handles `.name = …` form. - Also: cg-arith-conv now leaves pointer-typed operands alone so - `*(p + N)` scales correctly when one side is ptr. - -- [x] **Local array initializer** - - cc: `cc/55-init-local-array.c` - - Done: parse-init-local-aggregate emits per-element store ops at - slot+(i*esize); zero-pads trailing slots when declared length - exceeds initializer count. - -- [x] **Local struct initializer** - - cc: `cc/56-init-local-struct.c` - - Done: same per-field store sequence; designated form supported. - -### F. Control flow extensions - -- [x] **`do { } while (e);`** - - cg: `cc-cg/63-do-while.scm` - - cc: `cc/63-do-while.c` - - Done: composes existing `cg-loop` + `cg-if` + `cg-break`; - fixture-only. - -- [x] **`for (init; cond; step)` with declaration in `init`** - - cc: `cc/64-for-decl.c` - - Done: existing `parse-for-stmt` exercised end-to-end. - -- [x] **`switch / case / default` with fall-through** - - cg: `cc-cg/64-switch.scm` — three cases falling through to default. - - cc: `cc/65-switch.c` - - Done: validated the existing `swctx` machinery in cg. - -- [x] **`goto` / labelled statement (forward and backward)** - - cg: `cc-cg/65-goto.scm` - - cc: `cc/66-goto.c` - - Done: replaced the `cg-break` hack in `parse-goto-stmt`. Added - `cg-emit-label cg name-bv` (drops `::user_<name>`) and - `cg-goto cg name-bv` (emits `%b(&::user_<name>)`). - `parse-labelled-stmt` now calls `cg-emit-label` before the inner stmt. - - Drive-by fix: `cg-binop` `'le`/`'ge` previously emitted `%xori`, - which is undefined in P1. Replaced with `%li(t1,1) %xor(t0,t0,t1)`. - -### G. Variadics - -- [x] **Variadic call: per-arg default-promote** - - cg: `cc-cg/66-vararg-call.scm` - - cc: `cc/67-vararg-call.c` - - Done: parser inspects fn type at `parse-call-args`; for arg index - ≥ named-arg count, emits `cg-promote` per CC.md §Implicit - conversions. Fixed-arg index emits `cg-cast` to declared param - type (also covers §K.5). - -- [x] **Variadic receive: `__builtin_va_start/arg/end`** - - cg: `cc-cg/69-vararg-recv.scm` — sums N int-typed variadic args. - - cc: `cc/76-vararg-recv.c` - - Done: added `cg-va-start cg`, `cg-va-arg cg ctype`, - `cg-va-end cg` (each pops ap-lval from vstack); - `cg-fn-begin/v` reserves a 16-slot incoming-arg window: indices - 0..3 are saved from a-registers, indices 4..15 from `LDARG` slots - 0..11. va_arg walks the window linearly from the slot at index = - named-arg count. Parser recognizes `__builtin_va_start/arg/end` at - parse-primary; `parse-fn-body` threads the fn ctype's variadic? - flag. Bundled `cc/headers/stdarg.h` aliases `va_list`/`va_start`/ - `va_arg`/`va_end` to the builtins. Lock-in fixture - `cc/79-vararg-deep.c` exercises 1 named + 6 variadic args. - Limit: 15 variadic args after the named ones; bump - `VARARG_WINDOW` (currently 16) in `cg-fn-begin/v` to extend. - -### H. Conditionals as values - -Added `cg-ifelse-merge`: caller pre-allocates the result slot, each -thunk pushes one rval that is then loaded and stored into the slot, -and the slot's frame rval is left on the vstack. The merged result -type is taken from the first thunk's pushed type — parser is -responsible for arranging compatible types in the two branches. - -- [x] **Ternary `?:` leaves exactly one rval** - - cg: `cc-cg/28-ternary.scm` — `c ? 7 : 9` with c=1 → exit 7. - - cc: `cc/28-ternary.c` - - Done: parser swaps `cg-ifelse` for `cg-ifelse-merge` in the - `qmark` arm of `parse-binary-rhs`; both branches push their - parsed rval directly. - -- [x] **`&&` short-circuit leaves exactly one i32 rval** - - cg: `cc-cg/29-land.scm` - - cc: `cc/29-land.c` - - Done: parser injects `cg-cast %t-bool` then `cg-cast %t-i32` on - the rhs side so the merged result is i32 ∈ {0,1}; the else-arm - pushes `%t-i32 0`. - -- [x] **`||` short-circuit leaves exactly one i32 rval** - - cg: `cc-cg/30-lor.scm` - - cc: `cc/30-lor.c` - - Done: mirrors §H.2 with the bool-cast on the else-arm and a - constant `%t-i32 1` in the then-arm. - -### I. Storage classes - -- [x] **Block-scope `static` lives in bss/data, not on the stack** - - cg: `cc-cg/57-block-static.scm` - - cc: `cc/57-block-static.c` - - Done: handle-decl gates on `sto = 'static'` *before* the - file-vs-block branch and routes block-scope statics to - cg-emit-global with a `cc__<fn>__<var>` mangled label. - -### J. Driver / envelope - -- [x] **Entry stub forwards `argc` / `argv` to `main`** - - cc: `cc/00-return-argc` (already green; locked in). - - Done: P1's program-entry contract delivers `a0=argc`, `a1=argv` - at `p1_main` (P1.md §Program Entry). `%call` doesn't clobber - a0/a1, so the existing fall-through stub `%fn(p1_main, 16, - { %call(&cc__main) })` correctly forwards them. Documented in - `cg-finish`. - -- [x] **`int main()` falling off the end returns 0** - - cc: `cc/68-main-noret.c` — `int main(){}` → exit 0. - - Done: `cg-fn-begin` now zero-inits the ret slot in the prologue - when the return type isn't void, so falling through to `::ret` - reads back a defined 0 instead of relying on kernel zero-fill. - -- [x] **Multi-function translation unit with forward references** - - cc: `cc/69-multi-fn.c` - - Done: forward declaration `int helper(int x);` binds an extern - fn sym up-front so `parse-primary` finds it before the - definition appears. - -### K. Expressions and conversions - -- [x] **Comma operator (`a, b` as expression)** - - cc: `cc/31-comma.c` — `int a; int b; (a=1, b=2); return a + b*10;` - → exit 21. - - Done: added `(comma . (1 . 2))` to `%binop-bp` (left-assoc, below - `assign`'s 4/3 so `parse-call-args ps 4` still won't slurp it as a - call separator); handler `cg-pop`s the lhs and evaluates the rhs. - -- [x] **Function-pointer call** - - cg: `cc-cg/67-fnptr-call.scm` — push a fn-typed sym, spill to a - frame slot, reload, call. - - cc: `cc/71-fnptr-call.c` — `int (*fp)(int) = f; return fp(7);` - → exit 21. - - Done: exercises `cg-call`'s `%callr(t0)` branch; verified - return-type extraction walks `ptr → fn → ret` correctly. - -- [x] **Enum constant in expressions** - - cc: `cc/72-enum-const.c` — `enum E { A=1, B=10, C }; - return A+B+C;` → exit 22. - - Done: locked in the existing `parse-primary` `'enum-const` branch. - -- [x] **`void *` ↔ `T *` implicit conversion (no cast required)** - - cc: `cc/73-voidptr-impl.c` — `void *p; int x=42; p=&x; - int *q=p; return *q;` → exit 42. - - Done: cg-cast's `to-kind = 'ptr` clause is relabel-only between - any pointer types; `cg-assign` drives the cast each direction. - -- [x] **Implicit narrowing of fixed-arg call arguments to declared - param type** - - cc: `cc/74-call-narrow.c` — `int f(unsigned char x){return x;} - int main(){ return f(258); }` → exit 2. - - Done: `parse-call-args` now emits `cg-cast` per fixed arg to the - declared param type (variadic args promoted via §G.1). - -- [x] **Pointer comparison is unsigned** - - cg: `cc-cg/68-ptr-cmp.scm` — verifies `%ifelse_ltu` dispatch. - - cc: `cc/75-ptr-cmp.c` — `int a[2]; return &a[1] > &a[0];` - → exit 1. - - Done: `cg-binop`'s `lt/le/gt/ge` dispatch already picks the - unsigned variant for ptr/arr operands. Fixtures lock it in. - -### L. Aggregates round 2 - -- [x] **Flexible array member as last struct field** - - cc: `cc/77-flex-array.c` — backs `struct s { int n; int data[]; };` - with an `int[]` buffer, casts to `struct s *`, indexes through - `p->data[i]`. tcc.c's `Sym` / `TokenSym` rely on this. - - Done: parser already accepted `T name[]` (no size) via the existing - `parse-decl-suf-cont` `[]` arm; `complete-agg!` already excluded - flex extent because `(max sz 0)` collapses `-1` to 0; `cg-push-field` - + the indirect-frame path + `cg-decay-array` already composed - correctly. The actual fix was a latent cast-conversion bug in - `parse-cast-or-unary` — it skipped `rval!` when the target type - was a pointer, so `(struct s *)buf` relabeled the array's lval - rather than decaying it to a ptr-rval. Now `rval!` runs for every - cast (matches C lvalue-conversion semantics: arrays decay, lvals - become rvals before the bit-cast). - -- [x] **`T[]` in parameter position decays to `T *`** - - cc: `cc/43-array-param-decay.c` — `int sum(int a[], int n) - { ... s+=a[i]; ... } sum(xs,4)` over `{1,2,3,4}` → exit 10. - - Done: `parse-fn-params` rewrites arr→ptr (and fn→ptr) before - slot allocation, so cg sees an 8-byte ptr slot and the callee's - `a[i]` decays the param's loaded ptr-rval and scales the index - correctly. - -- [x] **Array of function pointers initialized with named functions** - - cg: `cc-cg/58-fnptr-tab.scm` - - cc: `cc/58-fnptr-tab.c` - - Done: composes §E.3's array list init with §E.2's label-ref; - %const-init-piece's bare-IDENT branch already covers `fn` syms. - Parse fixture uses an explicit typedef for the array element - because `int (*tab[])()` declarators are owned elsewhere. - -## Phase milestones (CC.md §Validation) - -The CC.md milestones gate on contiguous blocks above. Each lights up -once its dependencies are green: - -- [ ] **Self-test sweep** (cc fixtures mirroring tests/scheme1) — depends on §A, - §B, §C, §F, §H. -- [ ] **Hand-written hello-world ELF** — depends on §G, §I, §J + a - string-formatting libc surface. -- [ ] **Compile mes libc `unified-libc.c`** — depends on §D, §E. -- [ ] **Compile tcc.c (tcc-mes defines)** — depends on everything above. -- [ ] **tcc-lispcc builds tcc-boot0**; checksum matches live-bootstrap. - -The last is the bootstrap milestone — at that point lispcc has fully -replaced MesCC in the chain. diff --git a/docs/LISP-C.md b/docs/LISP-C.md @@ -1,1105 +0,0 @@ -# Scheme in P1 - -## Goal - -A small tree-walking interpreter for the Scheme subset specified in -[LISP.md](LISP.md), written in P1 via the M1PP + M1 + M0 + hex2 -toolchain. Intent: simplicity and correctness first; performance is a -later concern. - -The self-hosted C compiler (see [PLAN.md](PLAN.md)) will run under -this interpreter, as will shell-style scripts built on -[shell.scm](../../shell.scm) and -[prelude.scm](../lisp/prelude.scm). - -## Toolchain - -``` -scheme.P1 (hand-written M1PP source using P1.M1pp) - | prepend arch backend + P1.M1pp - | m1pp - v -scheme.M1 (target-specific M1 ops) - | M0 - v -scheme.hex2 - | hex2 - v -scheme (native ELF) -``` - -C-shaped snippets in this document are descriptive of the target shape, -not the source. The source is M1PP-preprocessed P1 assembly. - -## Word size - -P1-64 is the initial target. Every heap word is 8 bytes; every object -is 8-byte aligned. P1-32 is out of scope for the first pass; porting -would require a parallel layout story because M1PP's `%struct` stride -is fixed at 8. - -## Settled decisions - -Load-bearing; the rest of the document assumes them. - -1. **Tree-walking eval/apply over raw s-expressions.** No AST - preprocessing, no bytecode, no compilation step. The reader's - output is the interpreter's input. -2. **Static buffers, no GC.** Fixed-size labeled byte reservations for - heap, symbol table, reader buffer, writer buffer. Heap exhaustion - aborts with an error. -3. **Bump allocation, 8-byte aligned.** Single `heap_next` pointer - (pinned in `s0`); every object is 8-byte aligned by construction. -4. **Tagged values, low-3-bit tag.** See §Value representation. -5. **Symbols are interned indices.** No heap object per symbol; the - value is an index into the symbol table, which stores the name - bytevector. -6. **Environment is a flat association list.** `((sym . val) ...)`, - extended by consing onto the front. Lookup walks the list; falls - through to the global slot in the symbol table. -7. **Closures capture the env pointer directly.** No free-variable - analysis, no flat closures, no boxes. A closure is four words: - `[header, params, body, env]`. -8. **No `set!` on captured variables.** Language-level rule. The - implementation does not enforce it; behavior is unspecified if - violated. Use a pair for shared mutable state. -9. **Inner `define` is sequential-only.** Visible to later forms in - the same body, not to earlier forms. Use `letrec` for mutual - recursion between local helpers. -10. **`pmatch` is a built-in special form.** Saves writing a macro - expander for one feature. -11. **Tail calls via P1 `TAIL` / `TAILR`.** Every tail position in - `eval` and `apply` compiles to `TAIL` (label) or `TAILR` (register) - rather than `CALL` + `RET`. Scheme-level tail-call correctness - falls out. -12. **Primitive failure is undefined behavior.** No runtime checks on - `(car '())`, `(quotient n 0)`, out-of-bounds bytevector access, - integer overflow, or mutation of literals. Follows the spec's - "Primitive failure" policy. - -## Memory layout - -Labeled byte reservations in the final image. Starting sizes: - -| Region | Size | Purpose | -|---------------|----------|--------------------------------------| -| `heap` | 64 MB | bump-allocated Scheme objects | -| `symtab` | 16384 × | open-addressing intern table | -| `readbuf` | 16 MB | slurped source file | -| `writebuf` | 64 KB | staged output for display/write | -| `argbuf` | 64 × 8 B | primitive dispatch scratch | - -All sized via M1PP constants; easy to retune. Overflow of any buffer → -`error(...)` → `sys_exit(1)`. - -### Symbol table entry - -```c -struct sym_entry { - val_t name_bv; // interned name, heap bytevector - val_t global_val; // top-level binding (or UNBOUND) - uint32_t name_hash; - uint32_t chain_next; -}; -``` - -`global_val` gives O(1) top-level lookup: every symbol's global -binding is reachable without hash probing once the symbol is resolved. - -M1PP layout (widened to word-per-field for 8-byte stride): - -``` -%struct SYMENT { name_bv global_val name_hash chain_next } -; .name_bv = 0, .global_val = 8, .name_hash = 16, .chain_next = 24, -; .SIZE = 32 -``` - -The `hash` and `next` fields are logically 32-bit but occupy a full -word each to stay on the `%struct` stride; simpler than packing. - -The symbol table and heap are the only global roots. - -## Value representation - -All values are 64-bit tagged words. Low 3 bits are the tag. - -| Tag | Kind | Notes | -|-------|----------------------------|--------------------------------------------| -| `000` | fixnum | 61-bit signed; `>> 3` to recover | -| `001` | pair pointer | 16 B aligned; `[p-1]` car, `[p+7]` cdr | -| `010` | symbol | `(idx << 3) | 2`; idx into `symtab` | -| `011` | headered heap object | `[p-3]` is the header word | -| `100` | immediate singleton | `#f`, `#t`, `()`, unspec, unbound | -| `101`–`111` | reserved | | - -### Tag and immediate constants - -Encoded as M1PP `%enum`s. Concrete values derive via expression: - -``` -%enum TAG { FIXNUM PAIR SYM HEAP IMM } ; raw indices 0..4 -%enum IMM { FALSE TRUE NIL UNSPEC UNBOUND } ; raw indices 0..4 - -; a tag 100 value in low bits is literally %TAG.IMM == 4 -%macro MKIMM(idx) -$((| (<< idx 3) %TAG.IMM)) -%endm - -; usage: %MKIMM(%IMM.FALSE), %MKIMM(%IMM.NIL), etc. -``` - -Truthiness: every value except `%MKIMM(%IMM.FALSE)` is truthy. Test is -a single compare. - -## Object types - -This is the complete runtime type set. - -| Type | Representation | -|------------------|----------------------------------------------------| -| fixnum | immediate (tag 000) | -| boolean | immediate (`FALSE`, `TRUE`) | -| empty list | immediate (`NIL`) | -| unspec / unbound | immediates | -| pair | `[car][cdr]`, 16 B, tag 001, no header | -| bytevector | `[hdr = BV | len<<8][bytes...]`, tag 011 | -| closure | `[hdr = CLOSURE][params][body][env]`, tag 011 | -| primitive | `[hdr = PRIM | entry_label]`, tag 011 | -| record-td | `[hdr = TD][name_sym][nfields]`, tag 011 | -| record | `[hdr = REC][td][field_0]...[field_{n-1}]`, tag 011| - -Headered object header word (little-endian bytes): - -``` -[ type:8 | aux:56 ] -``` - -where `aux` carries type-specific data (length for bytevectors, -entry-label for primitives, nfields for records, etc.). One load -tells you the type. - -### Layouts - -``` -%struct PAIR { car cdr } ; .SIZE = 16 -%struct CLOSURE { hdr params body env } ; .SIZE = 32 -%struct BV { hdr } ; bytes follow inline -%struct PRIM { hdr } ; header encodes entry -%struct TD { hdr name_sym nfields } ; .SIZE = 24 -; record is variable-width; td+nfields known from TD -%struct REC_HDR { hdr td } ; field_i at offset 16 + 8*i -%enum HDR { BV CLOSURE PRIM TD REC } ; type-byte values -``` - -Strings and bytevectors are the same type. `string?` ≡ `bytevector?`. -No character type: `#\a` in source is sugar for fixnum 97. - -### Primitive encoding - -An M1 label is 4 bytes. A primitive fits in a single 8-byte header: - -``` -bytes [0..4): &entry_label ; 4-byte hex2 label reference -bytes [4..6): arity ; @(arity) little-endian 2 bytes -byte [6] : type = %HDR.PRIM ; !(%HDR.PRIM) -byte [7] : 0 ; !(0) -``` - -This makes `apply_prim` a header-load plus `callr`: no dispatch table, -no switch. §Apply sketches the sequence. - -## Tag idioms - -A handful of short sequences do all of the runtime's bit-twiddling. -They are M1PP macros so the rest of the code reads at the Scheme -level. These are illustrative; exact op selection is finalized during -implementation. - -``` -%macro tagof(rd, rs) ; rd = rs & 7 -%andi(rd, rs, 7) -%endm - -%macro make_fixnum(rd, rs) ; rd = rs << 3 -%shli(rd, rs, 3) -%endm - -%macro untag_fixnum(rd, rs) ; rd = rs >> 3, arithmetic -%sari(rd, rs, 3) -%endm - -%macro car(rd, rs) ; pair tag is 001; raw = rs - 1 -%ld(rd, rs, -1) ; [raw + 0] = car -%endm - -%macro cdr(rd, rs) -%ld(rd, rs, 7) ; [raw + 8] = cdr -%endm - -%macro set_car(rs, rv) -%st(rv, rs, -1) -%endm - -%macro set_cdr(rs, rv) -%st(rv, rs, 7) -%endm - -%macro hdr_load(rd, rs) ; heap tag is 011; raw = rs - 3 -%ld(rd, rs, -3) -%endm - -%macro hdr_type(rd, rs) ; type byte is low byte of header -%lb(rd, rs, -3) -%endm - -%macro hdr_field(rd, rs, idx) ; word field idx, 0 counts the header word -%ld(rd, rs, (- (* idx 8) 3)) -%endm -``` - -`cons` is a bump + two stores + tag: - -``` -%macro cons(rd, rcar, rcdr, rtmp) -; assumes s0 = heap_next, s1 = heap_end -%addi(rtmp, s0, 16) -%la_br(heap_oom) -%bltu(s1, rtmp) ; if heap_end < next+16, OOM -%st(rcar, s0, 0) -%st(rcdr, s0, 8) -%addi(rd, s0, 1) ; tag 001 -%mov(s0, rtmp) -%endm -``` - -`alloc_hdr(rd, type_byte, bytes, rtmp)` is the corresponding idiom for -headered objects; it bumps, writes the header word, and returns the -tagged pointer. The header word is composed on the fly from the type -byte and any aux field. - -## Dispatch tables - -P1 has no indexed branch op, but it has `BR rs` / `CALLR rs` / `TAILR -rs` and one-word loads. That's enough to build a jump table for any -switch wider than ~3 cases, which we use for the special-form -cascade in `eval`. - -Each entry is 8 bytes: a 4-byte M1 label reference plus 4 bytes of -pad. P1-64 has no 4-byte load, only full-word `LD`, so padding to -a word is the cheap path. - -``` -%macro table_entry(lref) -lref %(0) ; label ref + 4 zero bytes = 8 bytes -%endm -``` - -Usage (per special form, in index order): - -``` -special_dispatch_table: -%table_entry(&eval_quote) -%table_entry(&eval_if) -%table_entry(&eval_lambda) -%table_entry(&eval_define) -%table_entry(&eval_set) -%table_entry(&eval_begin) -... -``` - -Dispatch is load + indirect branch: - -``` -; t0 = sym_idx, already range-checked (< N_SPECIAL) -%la(t1, special_dispatch_table) -%shli(t2, t0, 3) -%add(t1, t1, t2) -%ld(t1, t1, 0) -%tailr(t1) -``` - -Five ops and one memory load, regardless of how many special forms. - -**Reservation convention.** Special-form symbols are interned first at -startup, so their sym indices are `0..N_SPECIAL-1`. The dispatcher -checks `idx < N_SPECIAL` before using the table; out-of-range means -"ordinary application". - -## Allocation - -C sketch (target shape): - -```c -static val_t alloc(size_t bytes) { - char *p = heap_next; - heap_next += (bytes + 7) & ~7; - if (heap_next > heap_end) error("heap exhausted"); - return (val_t) p; -} -``` - -P1 translation: `s0` and `s1` are pinned to `heap_next` and `heap_end` -for the lifetime of the program. Allocation is `addi` + `bltu` + -`mov`. In practice almost every allocation site calls one of `%cons`, -`%alloc_hdr`, or an inline bytevector emitter, not a generic `alloc`. - -## Register assignment - -Four callee-saved registers carry the hot state: - -| Reg | Role | -|------|----------------------------------------------------| -| `s0` | `heap_next` | -| `s1` | `heap_end` | -| `s2` | current env pointer | -| `s3` | scratch held across one inner loop at a time | - -`eval(expr, env)` is `a0 = expr`, `a1 = env`, returns in `a0`. -Handlers copy `a1` into `s2` at entry. `a0..a3` and `t0..t2` are free -across calls. Leaf code inside a handler may use `s3` without saving, -provided it restores before the handler returns. - -Reader, writer, and primitive bodies follow the same discipline. Any -primitive that needs deep recursion (e.g. `equal?`) saves `s2` in its -frame even if it does not touch env, since its callees might. - -## Closures - -``` -[ hdr = CLOSURE ][ params ][ body ][ env ] -``` - -- `params`: parameter list s-expression: `(a b c)`, `(a b . rest)`, or - a bare symbol `rest` for a fully variadic lambda. -- `body`: the list of body forms, evaluated sequentially; last in - tail position. -- `env`: the environment chain captured at `lambda` evaluation. - -Four words total. No free-var analysis at closure creation — the whole -env chain is retained and walked at lookup time. - -## Environment - -``` -env ::= NIL - | ( (sym . val) . env ) -``` - -- **Lookup.** Walk the list, pointer-compare the car of each pair with - the target symbol, return the cdr on match. If the walk ends at - `NIL`, fall through to `symtab[idx].global_val`; if that is - `UNBOUND`, error out with `"unbound variable"`. -- **Extension** (scope entry). Cons new `(sym . val)` pairs onto the - front. -- **`set!`.** Walk the env chain; on match, `set-cdr!` the pair. If - the chain is exhausted, write to `symtab[idx].global_val`. -- **Top-level `define`.** Write to `symtab[idx].global_val`. -- **Inner `define`.** Handled inline by `eval_body` (see below). - Each inner define prepends `(name . val)` to the local env as it - is encountered. - -The global env is not a chain — it's the symbol table. Closures -captured before a top-level `define` nonetheless see the new binding, -because the fallback is to `symtab[idx].global_val` at lookup time, -not at capture time. This is how mutual recursion between top-level -functions works without any pre-pass. - -## Eval - -C sketch (target shape): - -```c -val_t eval(val_t expr, val_t env) { - switch (tag(expr)) { - case TAG_FIXNUM: return expr; - case TAG_IMM: return expr; - case TAG_SYM: return env_lookup(expr, env); - case TAG_HEAP: return expr; // bv, closure, record — self-eval - case TAG_PAIR: break; // form; dispatch below - } - - val_t head = car(expr); - val_t rest = cdr(expr); - - if (head == SYM_QUOTE) return car(rest); - if (head == SYM_IF) { - val_t t = eval(car(rest), env); - val_t next = (t != FALSE) ? cadr(rest) : caddr(rest); - MUSTTAIL return eval(next, env); - } - if (head == SYM_LAMBDA) return make_closure(car(rest), cdr(rest), env); - if (head == SYM_DEFINE) return eval_define(rest, env); - if (head == SYM_SET) return eval_set(rest, env); - if (head == SYM_BEGIN) MUSTTAIL return eval_begin(rest, env); - if (head == SYM_LET) MUSTTAIL return eval_let(rest, env); - if (head == SYM_LETSTAR) MUSTTAIL return eval_letstar(rest, env); - if (head == SYM_LETREC) MUSTTAIL return eval_letrec(rest, env); - if (head == SYM_COND) MUSTTAIL return eval_cond(rest, env); - if (head == SYM_AND) MUSTTAIL return eval_and(rest, env); - if (head == SYM_OR) MUSTTAIL return eval_or(rest, env); - if (head == SYM_PMATCH) MUSTTAIL return eval_pmatch(rest, env); - if (head == SYM_DEFINE_RT) return eval_define_record_type(rest, env); - - // application - val_t fn = eval(head, env); - val_t args = eval_list(rest, env); - MUSTTAIL return apply(fn, args); -} -``` - -The special-form symbols (`SYM_QUOTE`, `SYM_IF`, etc.) are interned -once at startup and stored in labeled word slots. Dispatch is a -cascade of integer compares. - -P1 translation. The outer tag switch is a short cascade of `beq`s on -the tag value (fixnum / imm / sym / heap return immediately; pair -falls through to form dispatch). Form dispatch uses the jump table -from §Dispatch tables: - -``` -; eval_pair path, a0 = expr, a1 = env already normalized -%car(t1, a0) ; t1 = head -%cdr(a1, a0) ; a1 = rest, set up for handlers - -; fall through to ordinary application if head is not a symbol -%andi(t2, t1, 7) -%la_br(eval_apply) -%addi(t3, t2, (- %TAG.SYM)) -%bnez(t3) ; tag != SYM -> apply - -; extract idx and bounds-check -%sari(t0, t1, 3) ; t0 = sym_idx -%la(t2, N_SPECIAL_const) -%ld(t2, t2, 0) -%la_br(eval_apply) -%bltu(t0, t2) ; invert: take table path iff idx < N -... - -; table dispatch -%la(t1, special_dispatch_table) -%shli(t2, t0, 3) -%add(t1, t1, t2) -%ld(t1, t1, 0) -%tailr(t1) -``` - -(The bounds-check polarity and the branch for "take table path" are -sketched loosely; the final shape is whatever combination of -`BLT`/`BLTU`/`BEQ` produces the fewest ops.) - -Each handler is its own P1 function. Tail positions inside handlers -end with `%la_br(target); %tail()`. An ordinary application is -`eval(head)`, `eval_list(rest)`, then `%la_br(apply); %tail()`. - -## Apply - -C sketch: - -```c -val_t apply(val_t fn, val_t args) { - switch (header_type(fn)) { - case HDR_PRIM: - return call_prim(fn, args); - case HDR_CLOSURE: { - val_t env = bind_params(clo_params(fn), args, clo_env(fn)); - MUSTTAIL return eval_body(clo_body(fn), env); - } - default: - error("not a procedure", fn); - } -} -``` - -`bind_params` walks params and args in lockstep. Variadic `.`-tail: -when params becomes a non-pair symbol, that symbol is bound to the -remaining args list. When params becomes `NIL` with args remaining -(or vice versa), error out with an arity message. - -P1 translation. `hdr_type` reads one byte; dispatch is two compares: - -``` -; apply(fn=a0, args=a1) -%hdr_type(t0, a0) -%la(t1, %HDR.PRIM) -%la_br(apply_prim) -%beq(t0, t1) -%la(t1, %HDR.CLOSURE) -%la_br(apply_closure) -%beq(t0, t1) -%la_br(err_not_procedure) -%b() -``` - -`apply_prim` extracts the entry label from the primitive's header -(low 4 bytes) and `callr`s it with the arg list in `a0`: - -``` -apply_prim: -%hdr_load(t0, a0) ; t0 = header word -%andi(t0, t0, 0xFFFFFFFF) ; low 32 bits = &entry -%mov(a0, a1) ; arg list in a0 for the primitive -%callr(t0) -%ret() -``` - -`apply_closure` runs `bind_params`, extends env, then `%la_br`s -`eval_body` and `%tail()`s. - -## Eval_body - -```c -val_t eval_body(val_t body, val_t env) { - while (is_pair(cdr(body))) { - val_t form = car(body); - if (is_pair(form) && car(form) == SYM_DEFINE) { - val_t name = define_name(form); - val_t val = eval(define_rhs(form), env); - env = cons(cons(name, val), env); - } else { - eval(form, env); - } - body = cdr(body); - } - val_t last = car(body); - if (is_pair(last) && car(last) == SYM_DEFINE) { - val_t name = define_name(last); - val_t val = eval(define_rhs(last), env); - env = cons(cons(name, val), env); - return UNSPEC; - } - MUSTTAIL return eval(last, env); -} -``` - -Inner `define` is sequential-only: visible to forms that textually -follow it in the same body, not to earlier forms. For mutual -recursion between local helpers, use `letrec` or lift to top level. -The `SYM_DEFINE` handler in `eval` itself handles only top-level -`define`; inner defines are absorbed here and never reach that path. - -## Special forms - -Implemented directly, each ~30-60 P1 ops: - -- `quote` — return the datum. -- `if` — eval test; tail-eval the chosen branch. -- `lambda` — build closure (one `%alloc_hdr`). -- `define` — top-level: write `symtab[idx].global_val`. Inner: - handled by `eval_body`. -- `set!` — walk env; first match mutates. Miss: write global slot. -- `begin` — eval each form in sequence; tail-eval the last. -- `let` — eval all inits in the current env; extend env with - bindings; tail-eval body via `eval_body`. -- `let*` — extend env one binding at a time, each init evaluated in - the progressively extended env. -- `letrec` — pre-bind each name to `UNSPEC` in a new env; eval each - init in that env; `set!` each binding; tail-eval body. -- `let` (named) — desugar inline to `letrec` + call, or implement - directly. -- `cond` — loop clauses; first truthy test wins; tail-eval its body; - `else` always true. -- `and` / `or` — short-circuit fold; return the deciding value, not a - coerced boolean. -- `pmatch` — `eval_pmatch` clause loop + `pmatch_match` recursive - walker. See §pmatch below. -- `define-record-type` — see below. - -## pmatch - -Surface spec lives in [LISP-PMATCH.md](LISP-PMATCH.md). This section -covers integration into the scheme1 spine that's already in -`scheme1/scheme1.P1pp`. - -### Why built-in (not a library) - -Pattern bindings extend the interpreter's env via `cons`. The runtime -exposes no `(eval expr env)` primitive and no macro system, so a -user-level pmatch could not introduce bindings visible to its -clause body. Building it in is the smallest path; the alternative is -adding either macros or first-class envs, both bigger commitments than -one special form. - -### Reader integration — no refactor - -pmatch operates on **already-parsed s-expressions**. By the time eval -reaches `(pmatch e (pat body...) ...)`, every pattern is a heap-consed -datum produced by the existing `parse_one` / `parse_list` recursion. -The matcher walks those datums; it never re-invokes the reader. So -nothing in the reader needs restructuring. - -The one additive change is a new branch in `parse_one`'s leading-byte -dispatch: `,` (ASCII 44) → consume the byte, recurse `parse_one` for -the inner datum, build `(unquote <datum>)`. This is a near-copy of the -existing `'` branch (`::quote` in parse_one) — same shape, different -head symbol: - -``` -::comma ;; mirrors ::quote -%la(t2, &readbuf_pos) -%ld(t0, t2, 0) -%addi(t0, t0, 1) -%st(t0, t2, 0) -%call(&parse_one) -%li(a1, %imm_val(%IMM.NIL)) -%call(&cons) -%la(t0, &sym_unquote) -%ld(t0, t0, 0) -%mov(a1, a0) -%mov(a0, t0) -%tail(&cons) -``` - -`,@` is **not** a separate token: `@` is already in `is_ident_byte`'s -set, so `,@x` parses as `(unquote @x)` and the matcher then treats -`@x` as a (legal but odd) binder identifier. An explicit -`msg_bad_unquote_splice` is optional polish, not load-bearing. - -There is no `quasiquote` evaluator, so `,x` written outside a pmatch -pattern evaluates as `(unquote x)` — an ordinary application of an -unbound `unquote` — and dies through the existing `unbound variable` -path. No new failure mode. - -### New cached symbol slots - -Four labeled slots, interned in `intern_special_forms` alongside the -existing ones (`sym_quote`, `sym_if`, …): - -| Slot | Purpose | -|--------------------|----------------------------------------------------------| -| `sym_pmatch` | eval's pair-branch dispatch | -| `sym_unquote` | pattern-side: detect `(unquote x)` — i.e., a `,x` binder | -| `sym_guard` | clause shape: `((guard g…) …)` vs. plain body | -| `sym_underscore` | wildcard test (the literal symbol `_` after `unquote`) | - -`sym_else` is already cached for `cond`; pmatch reuses it for the -`else` clause. Caching `sym_underscore` (vs. byte-comparing the name) -costs one slot and turns the wildcard test into a single `beq`. - -### Eval-side dispatch - -Two new lines in eval's pair branch — same shape as every other -special form already wired up there: - -``` -%dispatch_form(&sym_pmatch, &::do_pmatch) ;; in the cascade -... -::do_pmatch ;; among the tail handlers -%tail_to_handler(&eval_pmatch) -``` - -The cascade is order-independent (each row is a pointer-compare on a -distinct cached symbol), so position in the list doesn't matter. - -### eval_pmatch(rest=a0, env=a1) → value (a0) - -`rest` is `(subject-expr . clauses)`. Subject is evaluated **once**, -then clauses walk top-to-bottom restarting from the *outer* env each -time. - -``` -Frame: 32 bytes - +0 subject ; eval(car(rest), env) once, then read-only - +8 env_outer ; restart point per clause; passed to pmatch_match - +16 clauses ; cdr(rest); advances per non-match - +24 spill ; clause body / extended env across guard eval -``` - -Per-clause shape (same control flow as `eval_cond`'s clause loop, just -with a richer match step): - -``` -loop: - if clauses == NIL: die msg_pmatch_no_match - clause = car(clauses) - pat = car(clause) - tail = cdr(clause) ; (body...) or ((guard ...) body...) - - if pat eq? sym_else: - a0 = tail; a1 = env_outer; tail-call eval_body - ;; we don't validate that an `else` clause has no (guard ...) head; - ;; per LISP-PMATCH.md grammar that's malformed source, but any - ;; (guard ...) here will just be evaluated as a body form, which - ;; will die through the unbound `guard` path. - - call pmatch_match(pat, subject, env_outer) ; -> (env_ext=a0, ok=a1) - if ok == 0: advance clauses; loop - - ; guard clause? — tail begins with (guard g...) - if pair? tail and pair? car(tail) and car(car(tail)) eq? sym_guard: - guards = cdr(car(tail)) - body = cdr(tail) - for each g in guards: ; AND-fold, short-circuit - if eval(g, env_ext) == FALSE: advance clauses; loop - else: - body = tail - - a0 = body; a1 = env_ext; tail-call eval_body -``` - -`eval_body` already preserves tail position for the body's last form, -so the matched clause's last expression is in tail position of the -surrounding `pmatch` — same as `cond`, `let`, `begin`. - -### pmatch_match(pat=a0, subj=a1, env=a2) → (env'=a0, ok=a1) - -Pure leaf-ish recursive helper. Dispatches on the **pattern**'s tag. -Output convention mirrors `parse_dec`: `a0` carries the extended env -on success, `a1` is the 1/0 ok flag; on failure `a0` is undefined and -the caller must not use it. - -| Pattern shape | Match condition | -|-----------------------|----------------------------------------------------------------| -| `IMM.NIL` | `subj == NIL` | -| `TAG.FIXNUM` | raw word compare `subj == pat` (covers integers and chars) | -| `TAG.IMM` (`#t`/`#f`) | raw word compare | -| `TAG.SYM` | raw word compare (symbols are interned, so this is `eq?`) | -| `TAG.HEAP` (bv) | byte-for-byte equality via `bv_equal_raw` — see below | -| `TAG.PAIR` | binder / wildcard if `car(pat) eq? sym_unquote`; else structural recurse | - -#### Frame - -``` -Frame: 24 bytes - +0 pat ; preserved across cons (binder) and across the car recursion - +8 subj ; same — needed in cons in binder case, and as cdr(subj) after recurse - +16 env ; saved across the cons calls in the binder case -``` - -All three are stored on entry; nothing gets unsaved on the way out -because the function returns through `%eret` after computing a0/a1. - -#### Binder / wildcard - -``` -phead = car(pat) -if phead eq? sym_unquote: - ;; validate shape: pat must be (unquote <sym>), exactly two elements - if tag(cdr(pat)) != PAIR: die msg_bad_unquote_pattern - if tag(cdr(cdr(pat))) != IMM.NIL: die msg_bad_unquote_pattern - pident = car(cdr(pat)) - if tag(pident) != SYM: die msg_bad_unquote_pattern - - if pident eq? sym_underscore: - return (env, ok=1) ;; ,_ — no binding - return (cons(cons(pident, subj), env), ok=1) -``` - -Validation is per-call and cheap — three tag compares before any cons. -A defined runtime error here (rather than UB) is the one carve-out -from the "primitive failure is UB" policy: pattern shape is a syntax -error in the user's source, not a hot-path concern. - -#### Structural pair - -``` -;; phead is anything else: a literal symbol, a nested pattern, NIL, ... -if tag(subj) != PAIR: return (_, ok=0) -(env1, ok) = pmatch_match(car(pat), car(subj), env) -if !ok: return (_, ok=0) -return pmatch_match(cdr(pat), cdr(subj), env1) ;; tail position -``` - -Two LISP-PMATCH.md guarantees fall out of this shape: - -- **Improper-tail patterns** — the cdr-recursion accepts any pattern as - ptail. `,rest` there binds the list tail (proper or improper); `()` - there demands a proper list. -- **Nested patterns** — descending in lockstep handles arbitrary - depth; no special case in the matcher. - -#### Partial-match rollback — none needed - -`env` is passed by register, never mutated. A failed sub-match returns -`ok=0` without having shadowed the caller's env pointer. The next -clause restarts from `env_outer` stashed in the eval_pmatch frame, so -any sub-bindings built during the failed match are simply unreachable -and will be reclaimed when the heap is reset. - -#### Allocation cost - -- Successful binder: one cons for `(ident . subj)` plus one cons to - push it onto env — same as a `let` binding. -- Wildcard: zero allocation. -- Failed match: zero allocation past the failure point (the - recursion bails on the first mismatch). -- No pattern compilation, no decision-tree caching. Patterns are - re-walked per call. Revisit if profiling demands it; until then this - matches the rest of scheme1 (tree-walk eval, no preprocessing). - -#### Bytevector equality — factor out of `equal?` - -The existing `equal?` primitive contains the byte-for-byte loop -already; pmatch_match needs the same logic but on raw `val_t`s, not on -an arg list. Factor a leaf helper: - -``` -;; bv_equal_raw(a=a0, b=a1) -> a0 = 1 if equal else 0 -;; assumes both inputs are HEAP-tagged bytevectors -``` - -Reshape `equal?`'s bv branch to `mov`/`mov` its two unpacked operands -into a0/a1 and `tail` into `bv_equal_raw`. pmatch_match calls the same -helper directly. Both keep their existing public behavior; the -duplicated inline loop disappears. - -### Guards - -Evaluated in the **extended env**, short-circuit AND on `FALSE`. The -inner loop reuses the same compare against `%imm_val(%IMM.FALSE)` that -`eval_and` already uses; no new helper. A rejected guard advances the -clause cursor — it does **not** backtrack against the subject. - -### No-match die path - -`msg_pmatch_no_match` routed through `runtime_error`, same path as -every other die in the interpreter. LISP-PMATCH.md classes this as -UB; in practice we always crash with a clear message rather than -wander. - -### File placement - -Both routines live in the §Special forms section of -`scheme1/scheme1.P1pp`, slotting in alongside `eval_cond` / -`eval_let` / etc.: - -1. `eval_pmatch` — the special-form entry point, called via the - `::do_pmatch` tail handler in eval's pair branch. -2. `pmatch_match` — leaf-ish recursive helper, defined immediately - below `eval_pmatch` (its only caller). -3. `bv_equal_raw` — leaf helper. Lives near the existing `equal?` - primitive in §Primitives, since both call into it. - -`intern_form` rows for the four new symbol slots go into -`intern_special_forms`, in any position. `parse_one`'s `,` branch -goes alongside the existing `'` branch in the reader section. - -## define-record-type - -Desugars at eval time into a sequence of primitive operations. - -```scheme -(define-record-type point - (make-point x y) - point? - (x point-x) - (y point-y set-point-y!)) -``` - -becomes - -```scheme -(define <point-td> (%make-record-td 'point 2)) -(define make-point (lambda (x y) (%make-record <point-td> x y))) -(define point? (lambda (v) (%record-is-a? v <point-td>))) -(define point-x (lambda (p) (%record-ref p 0))) -(define point-y (lambda (p) (%record-ref p 1))) -(define set-point-y! (lambda (p v) (%record-set! p 1 v))) -``` - -The `%` primitives are internal, not part of the user-facing -primitive list. - -## Primitives - -55 user-facing primitives plus 6 internal record ops. Each is its own -P1 function, reached via `callr` on the label packed into the -primitive object's header. - -**Predicates / equality** (13): `eq?`, `equal?`, `not`, `boolean?`, -`pair?`, `null?`, `symbol?`, `integer?`, `string?` (≡ `bytevector?`), -`procedure?`, `record?`, `record-is-a?`, `record-type?`. - -**Pairs** (5): `cons`, `car`, `cdr`, `set-car!`, `set-cdr!`. - -**Integers** (24): `+`, `-`, `*`, `quotient`, `remainder`, `modulo`, -`=`, `<`, `<=`, `>`, `>=`, `zero?`, `positive?`, `negative?`, `abs`, -`min`, `max`, `bit-and`, `bit-or`, `bit-xor`, `bit-not`, -`arithmetic-shift`, `number->string`, `string->number`. - -**Bytevectors / strings / symbols** (10): `make-bytevector`, -`bytevector-length`, `bytevector-u8-ref`, `bytevector-u8-set!`, -`bytevector-copy`, `bytevector-copy!`, `bytevector-append`, -`bytevector=?`, `string->symbol`, `symbol->string`. - -**Control / program / error** (3): `apply`, `argv`, `error`. - -**Internal record ops** (6): `%make-record-td`, `%make-record`, -`%record-ref`, `%record-set!`, `%record-is-a?`, `%record-td`. - -### Calling convention - -`apply_prim` receives the arg list in `a0` and jumps via `callr` to -the primitive's entry label (extracted from the primitive header). -Each primitive's body: - -1. Walks the arg list, copying values into `argbuf[]` and counting - `argc`. -2. Checks arity (fixed-arity primitives; variadic ones do their own). -3. Computes result into `a0`. -4. `ret`s. - -There is no code-id switch and no dispatch table — the table of -primitives is the set of labels referenced from `register_primitives` -at startup, and the interpreter keeps zero pointers to it afterwards. - -## Prelude - -Written in Scheme, embedded as a bytevector literal in `scheme.P1`, -parsed and evaluated at startup like any other source. Contents: - -```scheme -(define (list . xs) xs) -(define (length xs) ...) -(define (reverse xs) ...) -(define (append . lists) ...) -(define (list-ref xs i) ...) -(define (map f . lists) ...) -(define (for-each f . lists) ...) -``` - -Runs before the user file, sharing the same symbol table and heap. -~80 Scheme LOC. - -## Reader - -Single-pass recursive descent over `readbuf` with an integer cursor. -No token array. Conses directly onto the heap. - -Handles the lexical syntax specified in LISP.md: - -- lists with `.` dotted tail -- `'x` → `(quote x)` -- `,x` → `(unquote x)` — datum sugar only, consumed by `pmatch`; no - `unquote` evaluator. No `quasiquote`, no `,@` / unquote-splicing. -- fixnums: decimal and `#x`-prefixed hex, both with optional sign -- `#t`, `#f` -- `"..."` strings with `\n \t \r \\ \"` escapes -- `#\a`..`#\~`, `#\newline`, `#\space`, `#\tab`, `#\return`, - `#\null`, `#\xNN` → read as fixnums -- symbols: identifier grammar, case-sensitive -- `;` line comments (no block comments, no `#;`) - -### Source locations - -As the reader conses each pair, it records `pair_ptr → (line . col)` -in a side table (fixed-size open-addressing hash). `error` consults -this when its first argument is a pair and prefixes the message with -`"at file:line:col: "`. Without a GC, entries for collected pairs -never arise; a full table simply overwrites old entries via linear -probing. - -## Writer - -Three entry points: - -- `display` — no quoting on strings; bare character output. -- `write` — readable form; strings quoted. -- `format fmt args...` — supports `~a` (display), `~s` (write), `~d` - (decimal fixnum), `~%` (newline). - -Output buffered in `writebuf`; flushed via `sys_write(1, ...)` on -buffer-full or on newline for `display`/`write`. `error` flushes -before writing to fd 2. - -## Error path - -`(error msg arg ...)`: - -1. Flush writer buffer. -2. Write `"error: "` + message + formatted args to fd 2 via - `sys_write`. -3. `sys_exit(1)`. - -No unwinding, no handlers. Primitive failure follows the same policy -as the spec: undefined behavior, likely crash, no runtime check. - -## Startup - -P1 defines a portable program-entry model (see P1.md §Program Entry): -the arch backend emits `_start`, captures argc/argv off the native -entry stack, and calls the portable label `p1_main` with `a0 = argc`, -`a1 = argv`. On return, the backend does `sys_exit(a0)`. - -The Scheme interpreter therefore contains no per-target startup code. -It just defines `p1_main`: - -``` -p1_main: - ; a0 = argc, a1 = argv - init_heap ; s0 = &heap, s1 = &heap_end - init_symtab - intern_special_forms ; fill labeled slots sym_quote, sym_if, ... - ; (order reserves sym_idx 0..N_SPECIAL-1) - register_primitives ; emit %HDR.PRIM words with &entry labels - parse_and_eval_prelude ; prelude is an embedded bytevector literal - ; slurp argv[1] into readbuf - ; loop: read_one_top_level; eval(form, NIL); advance cursor; until EOF - li a0, 0 - ret ; backend _start does sys_exit(0) -``` - -All state lives in labeled reservations; nothing is heap-allocated -before `p1_main`. - -## File layout - -``` -boot2/ - src/ - scheme.P1 # M1PP source, consumes P1.M1pp macro library - lisp/ - prelude.scm # embedded into scheme.P1 as a bytevector literal - tests/lisp/ - 00-arith.scm - 01-list.scm - 02-reader.scm - 03-printer.scm - 04-eval.scm - 05-tailcall.scm - 06-pmatch.scm - 07-records.scm - ... -``` - -Build: prepend the arch-specific backend M1M to `p1/P1.M1pp`, prepend -that bundle to `src/scheme.P1`, run `m1pp` → `M0` → `hex2` → ELF. -Tests are `.scm` files the interpreter runs; pass = exit 0 and -expected stdout. - -## Milestones - -Status legend: `[x]` done · `[~]` in progress · `[ ]` not started. - -1. [ ] **Bootstrapping shim.** Per-target `_start`, argv capture, - labeled reservations for heap / symtab / readbuf / writebuf, - `sys_write` + `sys_exit` error path. Exits with a placeholder - status. -2. [ ] **Tag macros.** `%tagof`, `%make_fixnum`, `%untag_fixnum`, - `%car`, `%cdr`, `%hdr_*`. Verify constants at M1PP time with - `%select`. -3. [ ] **Bump allocator and cons.** `%cons` macro; OOM branch. - Hand-build `(cons 42 nil)` and exit with the decoded car. -4. [ ] **Symbols + interning.** 16k-slot open-addressing table; - `intern(name, len)` returns a tagged symbol val_t. Two interns of - the same name compare equal as integers. -5. [ ] **Bytevectors.** Headered heap object; `make-bytevector`, - `bytevector-u8-ref`, etc. -6. [ ] **Reader.** End-to-end source → s-expression for the full - lexical syntax, with source-location side table. -7. [ ] **Writer.** `display`, `write`, `format`. -8. [ ] **`eval` + `apply` core.** Self-eval, symbol lookup, `if`, - `lambda`, top-level `define`, application. Every tail position - uses `%tail`. -9. [ ] **Remaining special forms.** `set!`, `begin`, - `let`/`let*`/`letrec`, named let, `cond`, `and`/`or`, inner - `define` in `eval_body`. -10. [ ] **Primitives.** All 55 user-facing + 6 internal. Batches: - arith/bitwise, list-core, bv/string, I/O+apply+error, record - ops. -11. [ ] **Prelude.** Embed as bytevector literal; parse and eval at - startup. -12. [ ] **`define-record-type`.** Desugaring + record primitives. -13. [ ] **`pmatch`.** Pattern matcher + clause driver. -14. [ ] **End-to-end.** Run the self-hosted compiler under it. - -Rough total: **~4–5k P1 LOC + ~80 Scheme LOC**. diff --git a/docs/LISP-PMATCH.md b/docs/LISP-PMATCH.md @@ -1,259 +0,0 @@ -# pmatch - -Spec for the `pmatch` special form. Companion to [LISP.md](LISP.md) §Special -forms; implementation sketch in [LISP-C.md](LISP-C.md) §pmatch. - -## Why it's in the language - -The Scheme subset has no macros, no `case-lambda`, no `define-syntax`. The -self-hosted compiler destructures s-expressions on every form it walks -(`(if t a b)`, `(lambda ps . body)`, `(quote x)`, …); doing this by hand -through `car`/`cdr` chains is tedious and error-prone. `pmatch` is the -single concession: one built-in pattern matcher so the compiler reads at -the form level. It is **not** a general macro system — admitting it -specifically rules out admitting `syntax-rules`. - -## Surface syntax - -``` -(pmatch <expr> <clause> ...) - -<clause> ::= ( <pattern> <body> ... ) - | ( <pattern> (guard <g-expr> ...) <body> ... ) - | ( else <body> ... ) -``` - -- `<expr>` is evaluated **once**; its value is the *subject*. -- `<pattern>` is a literal datum read by the standard reader; it is **not** - evaluated. -- `<body>` is one or more expressions; the last is in tail position of the - enclosing `pmatch`. -- `<g-expr>` are evaluated with the pattern's bindings in scope; see - *Guards* below. - -## Patterns - -| Pattern | Matches | -|----------------------|-------------------------------------------------------| -| `()` | the empty list (`null?` of subject) | -| `<integer>` | that integer (`equal?` — same as `=` for fixnums) | -| `<string>` | byte-for-byte equal bytevector (`equal?`) | -| `#\c` | the integer that `#\c` denotes (`equal?`) | -| `#t` / `#f` | that boolean (`eq?`) | -| `<symbol>` | **that exact symbol** (`eq?`) — *not* a binder | -| `,<ident>` | anything; binds `<ident>` to the subject | -| `,_` | anything; introduces no binding (wildcard) | -| `(p1 p2 ... pN)` | proper list of length exactly `N`; each `pi` matches | -| `(p1 ... pN . ptail)`| improper list; prefix matches `p1..pN`, tail matches `ptail` | - -Patterns nest: `(if ,t (op ,a ,b) ,c)` is well-formed. Sub-patterns are -matched recursively against the corresponding sub-objects. - -### Malformed `,`-patterns - -`(unquote)`, `(unquote x y ...)`, and `(unquote <non-symbol>)` (e.g. -`,42`, `,(foo)`) are all **runtime errors**, not undefined behavior. -This is the only carve-out from the *Primitive failure* policy in -LISP.md — pattern shape is a syntax error in the user's source, worth -catching with a clear message. Implementation cost is three tag -compares per binder match; see LISP-C.md §pmatch for the check. - -### Reader interaction - -`pmatch` reuses the standard reader. To make `,<ident>` writeable, the -reader recognizes `,X` as datum sugar for `(unquote X)` — the same shape -R7RS gives it, but **only** as a datum form. There is no `unquote` -evaluator and no `quasiquote`; the form exists purely so that pmatch -patterns can be parsed. Writing `,x` outside a pmatch pattern produces a -list `(unquote x)`, which `eval` will reject as an application of an -unbound variable. - -This is the only reader concession. `,@` (unquote-splicing) is **not** -recognized. - -### Symbols vs. binders - -This is the most common stumble. In a pattern: - -- `foo` matches the literal symbol `foo` (by `eq?`). It does **not** bind. -- `,foo` matches anything and binds `foo` to it. - -So `(if ,t ,a ,b)` matches the head symbol `if` literally and binds `t`, -`a`, `b` to the three subforms. `(,head ,t ,a ,b)` matches *any* 4-element -list and binds `head` as well. - -### Wildcards - -`,_` matches anything and introduces no binding. Multiple `,_` in one -pattern are fine. The identifier `_` is otherwise unreserved — `(define _ -0)` is legal Scheme; `_` outside a `,` is just a literal symbol pattern. - -### Improper-tail patterns - -`(p1 ... pN . ptail)` first matches the prefix exactly like a proper-list -pattern, then matches `ptail` against whatever cdr remains (which may be a -proper list, an improper tail, or `()`). - -`ptail` is a full pattern, not just a binder. Common cases: - -- `ptail = ,rest` — `rest` binds the remaining list (the most common - form; works for both proper and improper tails). -- `ptail = ()` — equivalent to writing the proper-list pattern - `(p1 ... pN)`. -- `ptail = (q1 ... qM . qtail)` — keep destructuring further into the - tail. - -Example: `(lambda ,ps . ,body)` matches any list whose head is the symbol -`lambda` and at least one more element; `ps` gets the second element, -`body` gets the cdr from there. - -## Semantics - -### Clause selection - -Clauses are tried top to bottom. The first clause whose pattern matches -the subject **and** whose guards (if any) all evaluate to a true value is -selected. Its body is evaluated in the extended environment; the value of -the last body expression is the value of the `pmatch`. - -`else` is a clause that always matches with no bindings; it must be the -last clause if present. - -### Guards - -``` -( <pattern> (guard <g-expr> ...) <body> ... ) -``` - -- Guards are evaluated *only if the pattern structurally matched*. -- Each `<g-expr>` is evaluated in left-to-right order with the pattern's - bindings visible. Evaluation short-circuits: the first `<g-expr>` that - yields `#f` aborts the guard. -- All `<g-expr>` truthy ⇒ the clause is selected. Any `<g-expr>` `#f` ⇒ - the clause is **rejected**, bindings are discarded, and matching - continues with the next clause. -- A guard rejection does **not** re-match the pattern against later - positions in a list — the subject is fixed. (Same shape as guards in - Erlang/ML: structural match, then filter.) - -Zero `<g-expr>` is permitted but pointless: `(p (guard) body ...)` is the -same as `(p body ...)`. - -### Bindings - -- Each `,<ident>` in the pattern introduces a fresh binding. Bindings are - visible inside *that clause's* guards and body, and nowhere else. -- Two `,<ident>` with the **same** identifier in one pattern is undefined - — the language does not promise to compare for equality (no - Prolog-style unification) nor to reject the pattern. Don't write it. -- Bindings shadow same-named names in the surrounding lexical scope, just - like `let`. - -### No-match - -If no clause's pattern matches and no `else` is present, behavior is -**undefined** (per the *Primitive failure* policy in LISP.md; expect a -crash, not a defined error). Write an explicit -`(else (error "pmatch: no match" subject))` whenever the matcher is not -provably exhaustive. - -### Evaluation order - -- `<expr>` evaluated once before any pattern is tried. -- Patterns themselves are not evaluated; they are walked structurally - against the subject. -- For matching nested patterns, walk order is unspecified — patterns are - pure datum tests against an already-evaluated subject, so order is not - observable. -- The selected body is in tail position of the `pmatch`. Tail calls in the - body are proper tail calls of the surrounding context. - -### Equality used for literals - -Integer / string / character / boolean patterns match by `equal?`. -Concretely: integers and characters compare via `=` on the underlying -fixnum (characters are u8 fixnums per LISP.md); strings compare -byte-for-byte; booleans compare by `eq?`. Symbol patterns match by `eq?` -(symbols are interned). - -## Examples - -Compiler form dispatch: - -```scheme -(pmatch e - ((quote ,x) (emit-literal x)) - ((if ,t ,a ,b) (emit-if t a b)) - ((lambda ,ps . ,body) (emit-lambda ps body)) - ((set! ,name ,v) (emit-set! name v)) - ((,f . ,args) (emit-call f args)) - (,x (guard (symbol? x)) (emit-var x)) - (,x (guard (integer? x))(emit-int x)) - (else (error "unknown form" e))) -``` - -Three things to notice: - -1. `quote`, `if`, `lambda`, `set!` are literal symbol patterns — they - match those head symbols only. -2. The `(,f . ,args)` clause must come **after** the special-form clauses - because it would otherwise swallow them. -3. The trailing `,x (guard ...)` clauses fire on atoms (subject didn't - match any list pattern); guards then split symbol vs. integer. - -Destructuring a `let` clause: - -```scheme -(pmatch clause - ((,name ,init) (cons name init)) - (else (error "bad let clause" clause))) -``` - -Walking a proper list with a tail binder: - -```scheme -(define (sum xs) - (pmatch xs - (() 0) - ((,h . ,t) (+ h (sum t))))) -``` - -## What pmatch is *not* - -Kept explicit so additions are deliberate. - -| Feature | Status | -|--------------------------------------|------------------------------| -| `,@<ident>` segment / splice patterns | not supported | -| `...` / Kleene-star repetition | not supported | -| `or` / disjunction patterns | not supported | -| `and` / conjunction patterns | not supported | -| Predicate patterns (`(? pred)`) | use `(guard (pred ,x))` | -| As-patterns (`(<pat> :as ,name)`) | not supported | -| Non-linear patterns (repeated names) | undefined behavior | -| User-extensible patterns | not supported | -| Compilation / decision-tree caching | not done; rewalk every call | - -If a use case demands one of these, the workaround is a `guard` plus -auxiliary `,<ident>` bindings, or splitting the match into nested -`pmatch`es. Anything beyond that is out of scope for v1. - -## Implementation contract (for LISP-C.md) - -This section pins what the interpreter must guarantee so callers can rely -on it; the actual P1 implementation lives in LISP-C.md. - -- `pmatch` is a built-in special form, not a primitive procedure. It does - not exist as a value. -- It receives unevaluated clauses and walks them itself. -- The matcher walks pattern and subject recursively. On match, it returns - an environment extended with the new bindings (or fails). The driver - tries clauses top-to-bottom, evaluates guards in the extended env, and - tail-evaluates the body of the first selected clause. -- No pattern compilation; no decision-tree memoization. Patterns are - re-walked per call. (See *Primitive failure* policy: re-add caching - later if profiling shows it's worth it.) -- Bindings are stored in the same flat alist environment used by `let` — - no separate match-env type. -- The body's last expression is reached via the same tail-call path as - `begin` / `let` body, so tail recursion through `pmatch` works. diff --git a/docs/SCHEME1-R7RS-TODO.md b/docs/SCHEME1-R7RS-TODO.md @@ -1,118 +0,0 @@ -# scheme1 R7RS TODO - -Lists the R7RS items we intend to add, the items we explicitly aren't adding, -and rough implementation notes. - -## Interpreter changes - -### Reader - -- [ ] **`#(...)` vector literals.** Required by full vector support. - -### Runtime types - -- [ ] **Vector type.** New heap header (`HDR.VEC`?), inline length + - slot array. Tag stays `HEAP`. -- [ ] **Multiple-values protocol.** Internal MV vector (or stack - convention) returned from `values`. Single-value contexts that - receive ≠1 values are an error. - -### Special forms / form fixes - -- [ ] **`let-values`, `let*-values`, `define-values`.** Depend on the - MV protocol. -- [x] **Flat (one-level) quasiquote.** `(quasiquote ...)`, - `(unquote x)`, `(unquote-splicing xs)` evaluated at the single - template depth. Nested quasiquote is unsupported. Works on - lists today; extend to vectors when those land. - -### New primitives - -These can't be expressed in Scheme; they need new entries in -`prim_table` (or new syscall wrappers). - -- [ ] **Vector core:** `vector?`, `make-vector`, `vector-length`, - `vector-ref`, `vector-set!`. -- [ ] **Multi-values:** `values`, `call-with-values`. - ---- - -## Prelude additions (`prelude.scm`) - -These are blocked on the new primitives above; once those land the -prelude wrappers are straightforward. - -### Vectors (§6.8) — over the new vector primitives - -- [ ] **Constructor:** `vector` (variadic). -- [ ] **List/string conversion:** `vector->list`, `list->vector`, - `vector->string`, `string->vector`. -- [ ] **Bulk:** `vector-copy`, `vector-copy!`, `vector-append`, - `vector-fill!`. -- [ ] **Higher-order:** `vector-map`, `vector-for-each`. - ---- - -## Permanent deviations (won't add) - -These are deliberate omissions — document them in user-facing notes -rather than tracking them as TODOs. - -Out of scope: -- §6.2 numerics — keep fixnum. -- §4.3 macros / `define-syntax` / `syntax-rules` / `let-syntax` / - `letrec-syntax`. -- §5.6 libraries / `define-library` / `import` / `export` / `cond-expand`. - ---- - -### Types - -- Distinct `char` type (chars are fixnum bytes). -- Distinct R7RS port type (we have a prelude `port` record over fds). -- `promise`, `parameter`, `continuation`, `error-object`, - `environment`. - -### Forms - -- Internal `define` — interpreter raises an explicit error; users - must use `letrec`. -- `delay`, `delay-force`, `force`, `make-promise`, `promise?`. -- `make-parameter`, `parameterize`. -- `guard` as an exception form (the `guard` keyword is reserved for - `pmatch` sub-clauses only). -- `dynamic-wind`. -- `include`, `include-ci`. (Source concatenation is shell-side.) -- Nested quasiquote. -- Block comments `#| ... |#` and datum comments `#;`. - -### Procedures - -- `utf8->string`, `string->utf8`. -- All exception procs: `raise`, `raise-continuable`, - `with-exception-handler`, `error-object?`, `error-object-message`, - `error-object-irritants`, `read-error?`, `file-error?`. (`error` - itself stays as scheme1's flush-and-exit.) -- `call-with-current-continuation`, `call/cc`. -- `eval`, `environment`, `scheme-report-environment`, - `null-environment`, `interaction-environment`. -- The entire R7RS port surface (§6.13): `call-with-port`, - `current-input-port`, `current-output-port`, `current-error-port`, - `input-port?` / `output-port?` / `textual-port?` / `binary-port?` / - `port?` / `input-port-open?` / `output-port-open?`, - `with-input-from-file`, `with-output-to-file`, `open-input-file` / - `-binary-` / `open-output-file` / `-binary-`, `close-port` / - `close-input-port` / `close-output-port`, string ports - (`open-input-string`, `open-output-string`, `get-output-string`, - `open-input-bytevector`, `open-output-bytevector`, - `get-output-bytevector`), reader/writer procs (`read`, `read-char`, - `peek-char`, `read-line`, `read-string`, `read-u8`, `peek-u8`, - `u8-ready?`, `char-ready?`, `read-bytevector`, `read-bytevector!`, - `write`, `write-shared`, `write-simple`, `write-char`, - `write-string`, `write-u8`, `write-bytevector`, - `flush-output-port`, `newline`). -- `load`. -- `current-second`, `current-jiffy`, `jiffies-per-second`. -- `features`, `emergency-exit`. -- `char-ci=?` / `char-ci<?` / `char-ci>?` / `char-ci<=?` / - `char-ci>=?` (derive from `char-foldcase` if needed).