boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

commit 0e0688006486288d9b30757cc807f016a71b2e6f
parent ba9c6b8cadfd7b802403b6a2df129cad572cab58
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Tue, 28 Apr 2026 10:47:38 -0700

SCHEME1.md

Diffstat:
MREADME.md | 2+-
Ddocs/LISP.md | 224-------------------------------------------------------------------------------
Adocs/SCHEME1.md | 262+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 263 insertions(+), 225 deletions(-)

diff --git a/README.md b/README.md @@ -36,7 +36,7 @@ * P1: [docs/P1.md](docs/P1.md.html) * M1pp: [docs/M1PP.md](docs/M1PP.md.html) * P1PP: [docs/LIBP1PP.md](docs/LIBP1PP.md.html), [P1/P1.M1pp](P1/P1.M1pp.html), [P1/P1pp.P1pp](P1/P1pp.P1pp.html) -* Scheme: [docs/LISP.md](docs/LISP.md.html), [scheme1/scheme1.P1pp](scheme1/scheme1.P1pp.html) +* Scheme: [docs/SCHEME1.md](docs/SCHEME1.md.html), [scheme1/scheme1.P1pp](scheme1/scheme1.P1pp.html) * C: [docs/CC.md](docs/CC.md.html), [cc/cc.scm](cc/cc.scm.html) If you'd like to chat, email me at hi at ryansepassi.com diff --git a/docs/LISP.md b/docs/LISP.md @@ -1,224 +0,0 @@ -# Minimal Scheme subset (boot2) - -Working doc. Baseline is R7RS-small (2013); everything here is a delta -against it. Intent: smallest surface that supports writing a self-hosted -compiler and shell-style scripts comfortably. Things cut can always come -back; things admitted are load-bearing. - -## Lexical syntax - -- Identifiers: standard Scheme grammar, case-sensitive. Includes - `+ - * / ! ? < > = _ & : ~ ^` and `->` in names. **No** `|…|` - vertical-bar quoted identifiers. A lone `.` is **not** a symbol — - it's reserved for dotted-pair syntax. -- Booleans: `#t`, `#f`. -- Integers: decimal (`42`, `-7`) and hex (`#xff`, `#x-1a`). Word-size - only — 32-bit on 32-bit targets, 64-bit on 64-bit targets. **No** `#o` - (octal), **no** `#b` (binary), **no** floats, rationals, complex, or - bignums. Overflow is undefined (see "Primitive failure" below). -- Strings: `"…"` with escapes `\n \t \r \\ \"`. A string *is* a - bytevector; indexing is by u8. **String literals are immutable** — - `bytevector-u8-set!` on a literal is undefined. -- Characters: `#\a` through `#\~` for printable ASCII, plus named forms - `#\newline` (10), `#\space` (32), `#\tab` (9), `#\return` (13), - `#\null` (0), and `#\xNN` for any byte. A character literal *is* a - u8 integer — no distinct character type. `(= #\a 97)` is `#t`. -- Symbols: bare (`foo`, `list->bytes`). **Globally interned** — two - symbols that print the same are `eq?`. -- Pairs / lists: `(a b c)`, `(a . b)`, `'()` for the empty list. Literal - pairs are immutable. -- Quote: `'x` ≡ `(quote x)`. **No** quasiquote / unquote / unquote-splicing. -- Comments: `;` to end of line. **No** `#| … |#` block, **no** `#;datum`. - -## Types - -The runtime knows about exactly these: - -| Type | Notes | -|--------------|----------------------------------------------------| -| boolean | `#t`, `#f` | -| integer | word-size; 32-bit or 64-bit per target | -| symbol | globally interned; `eq?`-comparable | -| string / bv | same type; contiguous `u8[]`; literals immutable | -| pair | cons cell; literals immutable | -| empty list | `'()`, disjoint from pair | -| procedure | | -| record | via `define-record-type` | - -Characters are **not** a separate type — `#\a` is lexical sugar for the -u8 value 97. - -**Not present:** vectors, ports (as a language type), promises, -parameters (dynamic-scoped values, distinct from function args), -continuations, multiple values, exceptions. - -## Special forms - -- `(define name value)` and `(define (name arg ...) body ...)`, including - variadic tails: `(define (f . rest) …)` and `(define (f a b . rest) …)`. - **Top-level only.** Top-level `define`s may forward-reference each - other — mutual recursion at module scope is supported. Internal - `define` (inside a `lambda`/`let`/`begin` body) is **not** in the - language — use `letrec` instead. Cut to keep the body interpreter - one walk. -- `(lambda (arg ...) body ...)` with the same `.`-tail syntax. - A `lambda` evaluates to a closure: a first-class procedure value that - captures its enclosing lexical environment by reference. `set!` on a - captured variable is visible to every closure that captured it, and - closures that escape their creating frame outlive it (general case: - heap-allocated; non-escape optimization is an implementation detail). -- `(if test then else)` — `else` required. -- `(cond (test body ...) ... (else body ...))`. -- `(and e ...)` / `(or e ...)` — short-circuiting; return the deciding - value, not a coerced boolean. -- `(let ((x v) ...) body ...)`, `(let* …)`, `(letrec …)`. -- `(let name ((x v) ...) body ...)` — named let. -- `(begin e ...)`. -- `(set! name value)`. -- `(quote datum)` / `'datum`. -- `(define-record-type name (constructor field ...) predicate - (field accessor [mutator]) ...)` — R7RS form. Creates a disjoint - type with the given constructor, predicate, and field - accessors/mutators. Implementable as a single heap-allocated object - with a type-descriptor pointer; does not require exposing vectors. -- `(pmatch expr clause ...)` — pattern matcher. *Not* R7RS; added - because the self-hosted compiler destructures s-expressions - constantly and a macroless language makes hand-rolled `car`/`cdr` - chains painful. Clause forms: - - ``` - (pattern body ...) - (pattern (guard g-expr ...) body ...) - (else body ...) - ``` - - Patterns: - - | Pattern | Matches | - |----------------------|------------------------------------------| - | `()` | the empty list | - | `<literal>` | integers, strings, chars, booleans by `equal?` | - | `<symbol>` | that exact symbol (*not* a binder) | - | `,<ident>` | anything; binds to `<ident>` | - | `,_` | anything; no binding (wildcard) | - | `(p1 p2 ...)` | proper list of exactly that length | - | `(p1 ... . ptail)` | improper list; `ptail` binds the rest | - - Clauses are tried top-to-bottom; first matching pattern with - truthy guards wins. A guard that evaluates to `#f` is treated as - no-match and falls through to the next clause. **No match with no - `else`** is undefined behavior (follows the primitive-failure - policy); put an explicit `(else (error ...))` when you care. - - Example: - - ```scheme - (pmatch e - ((quote ,x) (emit-literal x)) - ((if ,t ,a ,b) (emit-if t a b)) - ((lambda ,ps . ,body) (emit-lambda ps body)) - (,x (guard (symbol? x)) (emit-var x)) - (else (error "unknown form" e))) - ``` - -## Core procedures - -Flat namespace — no library carving for v1. - -**Predicates / equality** -`eq?`, `equal?`, `not`, `boolean?`, `pair?`, `null?`, `symbol?`, -`integer?`, `string?` (≡ `bytevector?`), `procedure?`. - -**Pairs & lists** -`cons`, `car`, `cdr`, `set-car!`, `set-cdr!`, `list`, `length`, -`reverse`, `append`, `list-ref`, `map`, `for-each`. - -**Integers** (word-size; overflow undefined) -`+ - *`, `quotient`, `remainder`, `modulo`, `= < <= > >=`, `zero?`, -`positive?`, `negative?`, `abs`, `min`, `max`, -`bit-and`, `bit-or`, `bit-xor`, `bit-not`, `arithmetic-shift`, -`number->string` (optional radix; 10 and 16 required), -`string->number` (optional radix; returns `#f` on parse failure — this -is a pure op, not a syscall, so the normal `(ok . val)` convention -doesn't apply). - -Arities follow R7RS: `+ * bit-and bit-or bit-xor` accept 0+ args -(identities `0 1 -1 0 0`); `-` accepts 1+ (`(- x)` is unary negate); -`= < > <= >=` accept 2+ and chain pairwise. `quotient` / `remainder` / -`modulo` / `arithmetic-shift` are binary; `bit-not` is unary. - -Division rounding (R7RS semantics) — worth stating since target ISAs -differ at the instruction level: - - - `quotient` truncates toward zero: `(quotient -7 2) ⇒ -3` - - `remainder` has the sign of the *dividend*: `(remainder -7 2) ⇒ -1` - - `modulo` has the sign of the *divisor*: `(modulo -7 2) ⇒ 1` - - Identity: `(= n (+ (* (quotient n d) d) (remainder n d)))` - -**Strings / bytevectors** -`make-bytevector`, `bytevector-length`, `bytevector-u8-ref`, -`bytevector-u8-set!`, `bytevector-copy` (optional start/end → fresh bv), -`bytevector-copy!` (R7RS arg order: `dst dst-start src [src-start src-end]`), -`bytevector-append` (variadic; fresh bv), -`bytevector=?` (binary; byte-for-byte equality), -`string->symbol`, `symbol->string`. - -**Control** -`apply`. Tail calls are guaranteed proper on all targets — named let and -every looping idiom depend on it. - -**Program** -`(argv)` returns the process's command-line arguments as a list of -bytevectors. The first element is the program name as passed by the -kernel (argv[0]); user arguments start at the second element. - -**Errors** -`error` aborts with a message and optional irritants. No `raise` / -`guard` / handlers. - -**Primitive failure** is undefined behavior / crash. `(car '())`, -`(quotient 1 0)`, out-of-bounds `bytevector-u8-ref`, integer overflow, -mutating a literal — all implementation-defined, expect a crash. Callers -should not rely on any particular outcome. - -## Cut from R7RS-small - -Kept explicit so additions are deliberate. - -| Feature | Why cut | -|-------------------------------------------|----------------------------------------------| -| `syntax-rules`, macros, `define-syntax` | library is written without them for v1 | -| `call/cc`, `dynamic-wind` | big runtime cost, not needed here | -| `values`, `call-with-values` | replaced by `(ok . val)` pair convention | -| Numeric tower (rational/real/complex/float)| integers only | -| Bignums | word-size integers only (32- or 64-bit) | -| Character *type* (char? char->integer etc)| `#\a` syntax kept as u8 sugar; no char type | -| Vectors (`#(…)`, `make-vector`, etc.) | lists for sequences, records for structs | -| First-class ports | shell.scm defines its own | -| `quasiquote` / `` ` `` `,` `,@` | sugar; re-add with macros | -| `case`, `when`, `unless`, `do` | sugar over `cond`/`if` | -| `delay`, `force`, promises | — | -| `make-parameter`, `parameterize` | — | -| `guard`, `raise`, `with-exception-handler`| v1 uses `(ok . val)` | -| `define-library`, `import`, `export` | files are files for v1 | -| `include`, `cond-expand` | — | -| `case-lambda` | no arity overloading | -| Internal `define` (in lambda/let/begin) | use `letrec`; keeps the body interpreter one walk | -| `#o…` `#b…` `#;` `#\| \|#` | keeps the lexer small (`#\…` is kept) | -| `eqv?` | `eq?` + `equal?` are enough | - -## Dependencies of shell.scm on this subset - -Sanity check that the subset actually supports the library we've written: - -- variadic `.` in `lambda` / `define` → `spawn`, `run` -- named let → `refill!`-driven read loops, `bv-concat-reverse` -- `define-record-type` → `port` record -- `apply` → `run` calling `spawn` -- `bit-or`, `bit-and`, `arithmetic-shift` → openat flags, wait-status decode -- `make-bytevector` with fill byte → `NL-BV` -- `bytevector-copy` (3-arg) and `bytevector-copy!` → read/write paths -- `cons`, `car`, `cdr`, `null?`, `zero?`, `=`, `<`, `-`, `+` → everywhere - -Anything flagged here that we ultimately cut forces a rewrite of -shell.scm. diff --git a/docs/SCHEME1.md b/docs/SCHEME1.md @@ -0,0 +1,262 @@ +# scheme1 + +Minimal Scheme subset implemented by `scheme1/scheme1.P1pp`. A loose +subset of R7RS-small. The interpreter reads s-expressions from +`argv[1]`, evaluates them top-to-bottom in a single global env, and +exits. + +`scripts/boot-run-scheme1.sh` invokes `scheme1` with `prelude.scm` +catted in front of the user file. The prelude (`scheme1/prelude.scm`) +defines the R7RS surface that is expressible over the runtime +primitives — equivalence aliases, list/char/string helpers, and the +`shell.scm` process / file-I/O layer. + +## Lexical syntax + +- **Identifiers**: case-sensitive. Allowed bytes: ASCII letters, + digits, and `! $ % & * + - . / : < = > ? @ ^ _ ~`. A token whose + first byte is a digit (or sign-then-digit) is read as an integer; a + bare `+` or `-` is a symbol. A lone `.` between list elements is the + dotted-pair separator, not a symbol. +- **Booleans**: `#t`, `#f`. +- **Integers**: decimal (`42`, `-7`, `+3`) and hex (`#xff`, `#x-1a`). + Word-size — 32-bit on 32-bit targets, 64-bit on 64-bit targets. + No `#o`, no `#b`, no floats / rationals / bignums. +- **Strings**: `"…"`. Escapes: `\n \t \r \\ \"` and inline-hex `\xNN;` + (1+ hex digits, value 0..255, terminated by `;`). A string is a + bytevector; indexing is by u8. +- **Characters**: `#\a` through `#\~` for printable ASCII, plus + `#\space` (32), `#\newline` (10), `#\tab` (9), `#\return` (13), + `#\null` (0), and `#\xNN` for any byte. A character literal *is* a + fixnum byte — there is no distinct character type. `(= #\a 97)` is + `#t`. +- **Bytevector literal**: `#u8(b1 b2 ...)`. Each element must be a + fixnum 0..255 (range unchecked). +- **Symbols**: bare. Globally interned — two symbols that print the + same are `eq?`. +- **Pairs / lists**: `(a b c)`, `(a . b)`, `'()` for the empty list. +- **Quote sugar**: `'x` → `(quote x)`. **Comma sugar**: `,x` → + `(unquote x)`. The unquote form has no effect outside `pmatch` + patterns; evaluating it elsewhere fails as an unbound reference. +- **Comments**: `;` to end of line. No `#| … |#`, no `#;`. +- **No** vertical-bar identifiers, no quasiquote. + +## Types + +The runtime knows exactly: + +| Type | Notes | +|----------------|--------------------------------------------------------------| +| boolean | `#t`, `#f` | +| integer | word-size; 32- or 64-bit per target | +| symbol | globally interned; `eq?`-comparable | +| string / bv | same type (`HDR.BV`); contiguous u8 buffer | +| pair | cons cell | +| empty list | `'()`, disjoint from pair | +| procedure | closure or primitive | +| record | via `define-record-type` | +| eof-object | singleton; produced by `(eof-object)` and EOF reads | +| unspecified | singleton; result of `set!`, `define`, `(if #f x)`, etc. | + +Multiple-values packs flow through `values` / `call-with-values` / `let-values` +/ `let*-values`; they are not intended to be observed directly. + +## Special forms + +Top-level binding: + +- `(define name expr)` and `(define (name arg ...) body ...)`, + including variadic `.`-tails: `(define (f . rest) …)`, + `(define (f a b . rest) …)`. **Top-level only** — internal + `define` (inside `lambda` / `let` / `begin` body) is rejected at the + start of the body interpreter. Use `let` / nested closures instead. + +Procedures and binding: + +- `(lambda (arg ...) body ...)`, with the same `.`-tail syntax. + Captures its enclosing env by reference; `set!` on a captured + variable is visible to all closures over it. +- `(let ((x v) ...) body ...)`, `(let name ((x v) ...) body ...)` + (named let), `(let* …)`. +- `(let-values (((formals init) ...) body ...)`, + `(let*-values …)`. `formals` is bound via the same matching as + `lambda` parameters: list, dotted-tail, or bare symbol. +- `(set! name value)`. Walks the lexical env; on miss, rebinds the + global slot. + +Conditionals and sequencing: + +- `(if test then else)` — `else` required for two-arm form; + `(if test then)` returns the unspecified value when `test` is `#f`. +- `(cond (test body ...) ... (else body ...))`. A clause may also be + `(test => proc-expr)` — when truthy, calls `proc-expr` on the test + value. +- `(when test body ...)` — body runs when test is truthy, else + unspecified. +- `(case key (datum-list body ...) ... (else body ...))` — datums + matched by `eq?`. +- `(and e ...)` / `(or e ...)` — short-circuiting; return the + deciding value, not a coerced boolean. +- `(begin e ...)`. +- `(do ((var init step?) ...) (test result ...) body ...)` — + R7RS iteration construct; steps update in parallel. + +Quote, records, matching: + +- `(quote datum)` / `'datum`. +- `(define-record-type name (ctor f1 ...) pred (field accessor [mutator]) ...)` + Creates a disjoint type. Binds `ctor`, `pred`, and each `accessor` / + optional `mutator` at top level. Records are `equal?` iff TDs are + `eq?` and all fields are `equal?`. +- `(pmatch expr clause ...)` — pattern matcher. Clause forms: + + ``` + (pattern body ...) + (pattern (guard g-expr ...) body ...) + (else body ...) + ``` + + Patterns: + + | Pattern | Matches | + |----------------------|--------------------------------------------------| + | `()` | the empty list | + | `<literal>` | fixnum / bv / immediate by `equal?` | + | `<symbol>` | that exact symbol (*not* a binder) | + | `,<ident>` | anything; binds to `<ident>` | + | `,_` | anything; no binding (wildcard) | + | `(p1 p2 ...)` | proper list of exactly that length | + | `(p1 ... . ptail)` | improper list; `ptail` binds the rest | + | `($ pred (f p) ...)` | record whose predicate is `pred`; field `f` matches `p` | + + Clauses are tried top-to-bottom. A guard that evaluates to `#f` + falls through. No-match with no `else` aborts. + +## Primitives + +The runtime built-ins — registered at startup from `prim_table` in +`scheme1.P1pp`. The prelude builds the wider R7RS surface on top of +these. + +**Equality / predicates** +`eq?`, `equal?`, `not`, `null?`, `pair?`, `boolean?`, `integer?`, +`symbol?`, `string?` (≡ `bytevector?`), `procedure?`, `zero?`, +`eof-object?`. + +**Pairs** +`cons`, `car`, `cdr`, `set-car!`, `set-cdr!`, `length`, `list-ref`. + +**Integers** (word-size; overflow / divide-by-zero are UB) +`+ - *`, `quotient`, `remainder`, `=`, `<`, `>`, `bit-and`, `bit-or`, +`bit-xor`, `bit-not`, `arithmetic-shift`. Arities: `+ * bit-and +bit-or bit-xor` accept 0+ args (identities `0 1 -1 0 0`); `-` accepts +1+ (`(- x)` is unary negate); `= < >` accept 2+ and chain pairwise. +`quotient` / `remainder` / `arithmetic-shift` are binary; `bit-not` +is unary. `quotient` truncates toward zero; `remainder` has the sign +of the dividend. + +**Bytevectors / strings** +`make-bytevector`, `bytevector-length`, `bytevector-u8-ref`, +`bytevector-u8-set!`, `bytevector-copy` (3-arg `src start end` → +fresh bv), `bytevector-copy!` (`dst dst-start src src-start +src-end`), `bytevector-append` (variadic), `bytevector=?`, +`string-length` (strlen of the data buffer up to the first NUL). + +**Symbols / numbers as text** +`string->symbol`, `symbol->string`, `number->string` (decimal only; +optional radix arg accepted but ignored), `string->number` (decimal +only; returns `#f` on parse failure). + +**I/O and error** +`display`, `write`, `format`, `error`. `format` understands `~a` +(display), `~s` (write), `~d` (decimal fixnum), `~%` (newline), +`~~` (literal tilde); unknown directives pass through verbatim. +`error` writes `scheme1: error: <msg> <irritants…>` to stderr and +exits with status 1. + +**EOF** +`eof-object`, `eof-object?`. + +**Multiple values** +`values`, `call-with-values`. `(values x)` is identical to `x` in +single-value context; 0 or 2+ args produce an MV-pack consumable by +`call-with-values` / `let-values` / `let*-values`. + +**Apply** +`apply`. Tail calls are guaranteed proper. + +**Syscalls** (Linux). Each returns `(#t . val)` on success or +`(#f . errno)` on failure. +`sys-read fd buf offset count`, +`sys-write fd buf offset count`, +`sys-close fd`, +`sys-openat dirfd path-bv flags mode`, +`sys-clone` (fork-style, no args), +`sys-execve path-bv argv-list`, +`sys-waitid idtype id infop options`, +`sys-argv` (no args; returns the process's argv as a list of bvs), +`sys-exit code` (does not return). + +**Heap control** (used by the cc compiler for arena-style allocation) +`heap-usage`, `heap-mark`, `heap-rewind!`, `use-scratch-heap!`, +`use-main-heap!`, `reset-scratch-heap!`, `heap-in-main?`. +`heap-mark` / `heap-rewind!` discard everything allocated after the +mark on whichever heap is current; the scratch heap can be reset +wholesale. UNSAFE: the runtime does not track liveness, so any +surviving reference into a freed region becomes dangling. + +## Error semantics + +`error` is the only structured error path. Everything else — +`(car '())`, out-of-range `bytevector-u8-ref`, `(quotient 1 0)`, +mutating immutable state, integer overflow, unknown-form `pmatch` +fallthrough — is **primitive failure**: the runtime aborts with a +short message on stderr. Callers should not rely on any particular +outcome. + +There is no `raise` / `guard` / handlers, no `call/cc`, no +exceptions. Wrap-and-return through `(ok . val)` pairs (the syscall +convention) when failure needs to be observable. + +## Prelude surface + +`scheme1/prelude.scm` is bundled in front of every user program by +`scripts/boot-run-scheme1.sh`. It adds: + +- **R7RS aliases**: `eqv?` ≡ `eq?`, `number?` ≡ `integer?`, + `bytevector?` ≡ `string?`. +- **Arithmetic**: `<=`, `>=`, `negative?`, `positive?`, `abs`, `min`, + `max`, `modulo`. +- **Equivalence chains**: `boolean=?`, `symbol=?`. +- **List helpers**: `list`, `list?`, `reverse`, `append`, `make-list`, + `list-tail`, `list-set!`, `list-copy`, `memq` / `memv` / `member`, + `assq` / `assv` / `assoc`, `map`, `for-each`, `filter`, `fold`, plus + the full `c[ad]+r` family up to four levels. +- **Characters as fixnums**: `char?`, `char->integer`, + `integer->char` (identity), `char-upper-case?`, `char-lower-case?`, + `char-alphabetic?`, `char-numeric?`, `char-whitespace?`, + `digit-value`, `char-upcase`, `char-downcase`, `char-foldcase`, + `char=?`, `char<?`, `char>?`, `char<=?`, `char>=?`. +- **Strings as NUL-terminated bytevectors**: `make-string`, `string`, + `string-ref`, `string-set!`, `substring`, `string-append`, + `string-copy`, `string-copy!`, `string-fill!`, `string->list`, + `list->string`, `string-upcase`, `string-downcase`, + `string-foldcase`, `string-map`, `string-for-each`, + `string=?` / `string<?` / `string>?` / `string<=?` / `string>=?` + (and `-ci` variants). +- **Bytevectors**: `bytevector` constructor. +- **`shell.scm`** — process and file I/O layer: + - `argv`, `command-line`, `exit`, `spawn`, `run`, `wait`, + `decode-wait-status`. + - `port` record (via `define-record-type`) with a 4 KiB read buffer. + `stdin` / `stdout` / `stderr` are pre-built ports on fds 0 / 1 / 2. + - `open-input`, `open-output`, `open-append`, `close`, + `file-exists?`. + - Buffered reads: `read-bytes`, `read-line`, `read-all`. Each + returns either `(#t . value)` (where `value` may be the + eof-object) or `(#f . errno)` from the underlying syscall. + - Unbuffered writes: `write-bytes`, `write-string`, `write-line`. + Writes loop until the requested length is delivered or a syscall + error surfaces. +- **Constants**: `BUFSIZE`, `AT_FDCWD`, `O_RDONLY`, `O_WRONLY`, + `O_CREAT`, `O_TRUNC`, `O_APPEND`, `MODE_644`, `NL-BYTE`, `NL-BV`.