boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

commit 900ed3f87ef94e1ba1382ae46241fdd6514ea361
parent ced3120781606b6d585f0df7557e18e3ca789d8f
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sat, 25 Apr 2026 17:04:57 -0700

docs: add LISP-PMATCH spec and concrete pmatch impl sketch

LISP-PMATCH.md pins surface semantics that LISP.md left open: clause
selection order, guard AND-fold + short-circuit, binding scope per
clause, tail position, no-match policy, and runtime-error carve-out
for malformed (unquote ...) patterns.

LISP-C.md §pmatch is rewritten to integrate with the actual
scheme1.P1pp spine: ,X reader sugar as one new parse_one branch
(no reader refactor), four cached symbol slots, %dispatch_form +
::do_pmatch wiring, eval_pmatch frame, pmatch_match recursive
walker with explicit 24-byte frame, bv_equal_raw factored out of
equal?, and file placement.

Diffstat:
Mdocs/LISP-C.md | 286++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
Adocs/LISP-PMATCH.md | 259+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 527 insertions(+), 18 deletions(-)

diff --git a/docs/LISP-C.md b/docs/LISP-C.md @@ -600,30 +600,278 @@ Implemented directly, each ~30-60 P1 ops: `else` always true. - `and` / `or` — short-circuit fold; return the deciding value, not a coerced boolean. -- `pmatch` — see below. +- `pmatch` — `eval_pmatch` clause loop + `pmatch_match` recursive + walker. See §pmatch below. - `define-record-type` — see below. ## pmatch -The matcher walks pattern and subject recursively; on success it -returns an env extended with the bindings; on failure it returns a -sentinel. The driver tries clauses top-to-bottom, skipping clauses -whose guard evaluates to `#f`. No match and no `else` is undefined -behavior per spec. +Surface spec lives in [LISP-PMATCH.md](LISP-PMATCH.md). This section +covers integration into the scheme1 spine that's already in +`scheme1/scheme1.P1pp`. -Patterns: +### Why built-in (not a library) -| Pattern | Matches | -|----------------------|------------------------------------------------| -| `()` | the empty list | -| `<literal>` | integer / string / char / bool by `equal?` | -| `<symbol>` | that exact symbol (not a binder) | -| `,<ident>` | anything; binds to `<ident>` | -| `,_` | anything; no binding | -| `(p1 p2 ...)` | proper list of exactly that length | -| `(p1 ... . ptail)` | improper list; `ptail` binds the rest | +Pattern bindings extend the interpreter's env via `cons`. The runtime +exposes no `(eval expr env)` primitive and no macro system, so a +user-level pmatch could not introduce bindings visible to its +clause body. Building it in is the smallest path; the alternative is +adding either macros or first-class envs, both bigger commitments than +one special form. -No compilation or caching — rewalk the patterns each call. +### Reader integration — no refactor + +pmatch operates on **already-parsed s-expressions**. By the time eval +reaches `(pmatch e (pat body...) ...)`, every pattern is a heap-consed +datum produced by the existing `parse_one` / `parse_list` recursion. +The matcher walks those datums; it never re-invokes the reader. So +nothing in the reader needs restructuring. + +The one additive change is a new branch in `parse_one`'s leading-byte +dispatch: `,` (ASCII 44) → consume the byte, recurse `parse_one` for +the inner datum, build `(unquote <datum>)`. This is a near-copy of the +existing `'` branch (`::quote` in parse_one) — same shape, different +head symbol: + +``` +::comma ;; mirrors ::quote +%la(t2, &readbuf_pos) +%ld(t0, t2, 0) +%addi(t0, t0, 1) +%st(t0, t2, 0) +%call(&parse_one) +%li(a1, %imm_val(%IMM.NIL)) +%call(&cons) +%la(t0, &sym_unquote) +%ld(t0, t0, 0) +%mov(a1, a0) +%mov(a0, t0) +%tail(&cons) +``` + +`,@` is **not** a separate token: `@` is already in `is_ident_byte`'s +set, so `,@x` parses as `(unquote @x)` and the matcher then treats +`@x` as a (legal but odd) binder identifier. An explicit +`msg_bad_unquote_splice` is optional polish, not load-bearing. + +There is no `quasiquote` evaluator, so `,x` written outside a pmatch +pattern evaluates as `(unquote x)` — an ordinary application of an +unbound `unquote` — and dies through the existing `unbound variable` +path. No new failure mode. + +### New cached symbol slots + +Four labeled slots, interned in `intern_special_forms` alongside the +existing ones (`sym_quote`, `sym_if`, …): + +| Slot | Purpose | +|--------------------|----------------------------------------------------------| +| `sym_pmatch` | eval's pair-branch dispatch | +| `sym_unquote` | pattern-side: detect `(unquote x)` — i.e., a `,x` binder | +| `sym_guard` | clause shape: `((guard g…) …)` vs. plain body | +| `sym_underscore` | wildcard test (the literal symbol `_` after `unquote`) | + +`sym_else` is already cached for `cond`; pmatch reuses it for the +`else` clause. Caching `sym_underscore` (vs. byte-comparing the name) +costs one slot and turns the wildcard test into a single `beq`. + +### Eval-side dispatch + +Two new lines in eval's pair branch — same shape as every other +special form already wired up there: + +``` +%dispatch_form(&sym_pmatch, &::do_pmatch) ;; in the cascade +... +::do_pmatch ;; among the tail handlers +%tail_to_handler(&eval_pmatch) +``` + +The cascade is order-independent (each row is a pointer-compare on a +distinct cached symbol), so position in the list doesn't matter. + +### eval_pmatch(rest=a0, env=a1) → value (a0) + +`rest` is `(subject-expr . clauses)`. Subject is evaluated **once**, +then clauses walk top-to-bottom restarting from the *outer* env each +time. + +``` +Frame: 32 bytes + +0 subject ; eval(car(rest), env) once, then read-only + +8 env_outer ; restart point per clause; passed to pmatch_match + +16 clauses ; cdr(rest); advances per non-match + +24 spill ; clause body / extended env across guard eval +``` + +Per-clause shape (same control flow as `eval_cond`'s clause loop, just +with a richer match step): + +``` +loop: + if clauses == NIL: die msg_pmatch_no_match + clause = car(clauses) + pat = car(clause) + tail = cdr(clause) ; (body...) or ((guard ...) body...) + + if pat eq? sym_else: + a0 = tail; a1 = env_outer; tail-call eval_body + ;; we don't validate that an `else` clause has no (guard ...) head; + ;; per LISP-PMATCH.md grammar that's malformed source, but any + ;; (guard ...) here will just be evaluated as a body form, which + ;; will die through the unbound `guard` path. + + call pmatch_match(pat, subject, env_outer) ; -> (env_ext=a0, ok=a1) + if ok == 0: advance clauses; loop + + ; guard clause? — tail begins with (guard g...) + if pair? tail and pair? car(tail) and car(car(tail)) eq? sym_guard: + guards = cdr(car(tail)) + body = cdr(tail) + for each g in guards: ; AND-fold, short-circuit + if eval(g, env_ext) == FALSE: advance clauses; loop + else: + body = tail + + a0 = body; a1 = env_ext; tail-call eval_body +``` + +`eval_body` already preserves tail position for the body's last form, +so the matched clause's last expression is in tail position of the +surrounding `pmatch` — same as `cond`, `let`, `begin`. + +### pmatch_match(pat=a0, subj=a1, env=a2) → (env'=a0, ok=a1) + +Pure leaf-ish recursive helper. Dispatches on the **pattern**'s tag. +Output convention mirrors `parse_dec`: `a0` carries the extended env +on success, `a1` is the 1/0 ok flag; on failure `a0` is undefined and +the caller must not use it. + +| Pattern shape | Match condition | +|-----------------------|----------------------------------------------------------------| +| `IMM.NIL` | `subj == NIL` | +| `TAG.FIXNUM` | raw word compare `subj == pat` (covers integers and chars) | +| `TAG.IMM` (`#t`/`#f`) | raw word compare | +| `TAG.SYM` | raw word compare (symbols are interned, so this is `eq?`) | +| `TAG.HEAP` (bv) | byte-for-byte equality via `bv_equal_raw` — see below | +| `TAG.PAIR` | binder / wildcard if `car(pat) eq? sym_unquote`; else structural recurse | + +#### Frame + +``` +Frame: 24 bytes + +0 pat ; preserved across cons (binder) and across the car recursion + +8 subj ; same — needed in cons in binder case, and as cdr(subj) after recurse + +16 env ; saved across the cons calls in the binder case +``` + +All three are stored on entry; nothing gets unsaved on the way out +because the function returns through `%eret` after computing a0/a1. + +#### Binder / wildcard + +``` +phead = car(pat) +if phead eq? sym_unquote: + ;; validate shape: pat must be (unquote <sym>), exactly two elements + if tag(cdr(pat)) != PAIR: die msg_bad_unquote_pattern + if tag(cdr(cdr(pat))) != IMM.NIL: die msg_bad_unquote_pattern + pident = car(cdr(pat)) + if tag(pident) != SYM: die msg_bad_unquote_pattern + + if pident eq? sym_underscore: + return (env, ok=1) ;; ,_ — no binding + return (cons(cons(pident, subj), env), ok=1) +``` + +Validation is per-call and cheap — three tag compares before any cons. +A defined runtime error here (rather than UB) is the one carve-out +from the "primitive failure is UB" policy: pattern shape is a syntax +error in the user's source, not a hot-path concern. + +#### Structural pair + +``` +;; phead is anything else: a literal symbol, a nested pattern, NIL, ... +if tag(subj) != PAIR: return (_, ok=0) +(env1, ok) = pmatch_match(car(pat), car(subj), env) +if !ok: return (_, ok=0) +return pmatch_match(cdr(pat), cdr(subj), env1) ;; tail position +``` + +Two LISP-PMATCH.md guarantees fall out of this shape: + +- **Improper-tail patterns** — the cdr-recursion accepts any pattern as + ptail. `,rest` there binds the list tail (proper or improper); `()` + there demands a proper list. +- **Nested patterns** — descending in lockstep handles arbitrary + depth; no special case in the matcher. + +#### Partial-match rollback — none needed + +`env` is passed by register, never mutated. A failed sub-match returns +`ok=0` without having shadowed the caller's env pointer. The next +clause restarts from `env_outer` stashed in the eval_pmatch frame, so +any sub-bindings built during the failed match are simply unreachable +and will be reclaimed when the heap is reset. + +#### Allocation cost + +- Successful binder: one cons for `(ident . subj)` plus one cons to + push it onto env — same as a `let` binding. +- Wildcard: zero allocation. +- Failed match: zero allocation past the failure point (the + recursion bails on the first mismatch). +- No pattern compilation, no decision-tree caching. Patterns are + re-walked per call. Revisit if profiling demands it; until then this + matches the rest of scheme1 (tree-walk eval, no preprocessing). + +#### Bytevector equality — factor out of `equal?` + +The existing `equal?` primitive contains the byte-for-byte loop +already; pmatch_match needs the same logic but on raw `val_t`s, not on +an arg list. Factor a leaf helper: + +``` +;; bv_equal_raw(a=a0, b=a1) -> a0 = 1 if equal else 0 +;; assumes both inputs are HEAP-tagged bytevectors +``` + +Reshape `equal?`'s bv branch to `mov`/`mov` its two unpacked operands +into a0/a1 and `tail` into `bv_equal_raw`. pmatch_match calls the same +helper directly. Both keep their existing public behavior; the +duplicated inline loop disappears. + +### Guards + +Evaluated in the **extended env**, short-circuit AND on `FALSE`. The +inner loop reuses the same compare against `%imm_val(%IMM.FALSE)` that +`eval_and` already uses; no new helper. A rejected guard advances the +clause cursor — it does **not** backtrack against the subject. + +### No-match die path + +`msg_pmatch_no_match` routed through `runtime_error`, same path as +every other die in the interpreter. LISP-PMATCH.md classes this as +UB; in practice we always crash with a clear message rather than +wander. + +### File placement + +Both routines live in the §Special forms section of +`scheme1/scheme1.P1pp`, slotting in alongside `eval_cond` / +`eval_let` / etc.: + +1. `eval_pmatch` — the special-form entry point, called via the + `::do_pmatch` tail handler in eval's pair branch. +2. `pmatch_match` — leaf-ish recursive helper, defined immediately + below `eval_pmatch` (its only caller). +3. `bv_equal_raw` — leaf helper. Lives near the existing `equal?` + primitive in §Primitives, since both call into it. + +`intern_form` rows for the four new symbol slots go into +`intern_special_forms`, in any position. `parse_one`'s `,` branch +goes alongside the existing `'` branch in the reader section. ## define-record-type @@ -720,7 +968,9 @@ No token array. Conses directly onto the heap. Handles the lexical syntax specified in LISP.md: - lists with `.` dotted tail -- `'x` → `(quote x)` (no quasiquote, unquote, unquote-splicing) +- `'x` → `(quote x)` +- `,x` → `(unquote x)` — datum sugar only, consumed by `pmatch`; no + `unquote` evaluator. No `quasiquote`, no `,@` / unquote-splicing. - fixnums: decimal and `#x`-prefixed hex, both with optional sign - `#t`, `#f` - `"..."` strings with `\n \t \r \\ \"` escapes diff --git a/docs/LISP-PMATCH.md b/docs/LISP-PMATCH.md @@ -0,0 +1,259 @@ +# pmatch + +Spec for the `pmatch` special form. Companion to [LISP.md](LISP.md) §Special +forms; implementation sketch in [LISP-C.md](LISP-C.md) §pmatch. + +## Why it's in the language + +The Scheme subset has no macros, no `case-lambda`, no `define-syntax`. The +self-hosted compiler destructures s-expressions on every form it walks +(`(if t a b)`, `(lambda ps . body)`, `(quote x)`, …); doing this by hand +through `car`/`cdr` chains is tedious and error-prone. `pmatch` is the +single concession: one built-in pattern matcher so the compiler reads at +the form level. It is **not** a general macro system — admitting it +specifically rules out admitting `syntax-rules`. + +## Surface syntax + +``` +(pmatch <expr> <clause> ...) + +<clause> ::= ( <pattern> <body> ... ) + | ( <pattern> (guard <g-expr> ...) <body> ... ) + | ( else <body> ... ) +``` + +- `<expr>` is evaluated **once**; its value is the *subject*. +- `<pattern>` is a literal datum read by the standard reader; it is **not** + evaluated. +- `<body>` is one or more expressions; the last is in tail position of the + enclosing `pmatch`. +- `<g-expr>` are evaluated with the pattern's bindings in scope; see + *Guards* below. + +## Patterns + +| Pattern | Matches | +|----------------------|-------------------------------------------------------| +| `()` | the empty list (`null?` of subject) | +| `<integer>` | that integer (`equal?` — same as `=` for fixnums) | +| `<string>` | byte-for-byte equal bytevector (`equal?`) | +| `#\c` | the integer that `#\c` denotes (`equal?`) | +| `#t` / `#f` | that boolean (`eq?`) | +| `<symbol>` | **that exact symbol** (`eq?`) — *not* a binder | +| `,<ident>` | anything; binds `<ident>` to the subject | +| `,_` | anything; introduces no binding (wildcard) | +| `(p1 p2 ... pN)` | proper list of length exactly `N`; each `pi` matches | +| `(p1 ... pN . ptail)`| improper list; prefix matches `p1..pN`, tail matches `ptail` | + +Patterns nest: `(if ,t (op ,a ,b) ,c)` is well-formed. Sub-patterns are +matched recursively against the corresponding sub-objects. + +### Malformed `,`-patterns + +`(unquote)`, `(unquote x y ...)`, and `(unquote <non-symbol>)` (e.g. +`,42`, `,(foo)`) are all **runtime errors**, not undefined behavior. +This is the only carve-out from the *Primitive failure* policy in +LISP.md — pattern shape is a syntax error in the user's source, worth +catching with a clear message. Implementation cost is three tag +compares per binder match; see LISP-C.md §pmatch for the check. + +### Reader interaction + +`pmatch` reuses the standard reader. To make `,<ident>` writeable, the +reader recognizes `,X` as datum sugar for `(unquote X)` — the same shape +R7RS gives it, but **only** as a datum form. There is no `unquote` +evaluator and no `quasiquote`; the form exists purely so that pmatch +patterns can be parsed. Writing `,x` outside a pmatch pattern produces a +list `(unquote x)`, which `eval` will reject as an application of an +unbound variable. + +This is the only reader concession. `,@` (unquote-splicing) is **not** +recognized. + +### Symbols vs. binders + +This is the most common stumble. In a pattern: + +- `foo` matches the literal symbol `foo` (by `eq?`). It does **not** bind. +- `,foo` matches anything and binds `foo` to it. + +So `(if ,t ,a ,b)` matches the head symbol `if` literally and binds `t`, +`a`, `b` to the three subforms. `(,head ,t ,a ,b)` matches *any* 4-element +list and binds `head` as well. + +### Wildcards + +`,_` matches anything and introduces no binding. Multiple `,_` in one +pattern are fine. The identifier `_` is otherwise unreserved — `(define _ +0)` is legal Scheme; `_` outside a `,` is just a literal symbol pattern. + +### Improper-tail patterns + +`(p1 ... pN . ptail)` first matches the prefix exactly like a proper-list +pattern, then matches `ptail` against whatever cdr remains (which may be a +proper list, an improper tail, or `()`). + +`ptail` is a full pattern, not just a binder. Common cases: + +- `ptail = ,rest` — `rest` binds the remaining list (the most common + form; works for both proper and improper tails). +- `ptail = ()` — equivalent to writing the proper-list pattern + `(p1 ... pN)`. +- `ptail = (q1 ... qM . qtail)` — keep destructuring further into the + tail. + +Example: `(lambda ,ps . ,body)` matches any list whose head is the symbol +`lambda` and at least one more element; `ps` gets the second element, +`body` gets the cdr from there. + +## Semantics + +### Clause selection + +Clauses are tried top to bottom. The first clause whose pattern matches +the subject **and** whose guards (if any) all evaluate to a true value is +selected. Its body is evaluated in the extended environment; the value of +the last body expression is the value of the `pmatch`. + +`else` is a clause that always matches with no bindings; it must be the +last clause if present. + +### Guards + +``` +( <pattern> (guard <g-expr> ...) <body> ... ) +``` + +- Guards are evaluated *only if the pattern structurally matched*. +- Each `<g-expr>` is evaluated in left-to-right order with the pattern's + bindings visible. Evaluation short-circuits: the first `<g-expr>` that + yields `#f` aborts the guard. +- All `<g-expr>` truthy ⇒ the clause is selected. Any `<g-expr>` `#f` ⇒ + the clause is **rejected**, bindings are discarded, and matching + continues with the next clause. +- A guard rejection does **not** re-match the pattern against later + positions in a list — the subject is fixed. (Same shape as guards in + Erlang/ML: structural match, then filter.) + +Zero `<g-expr>` is permitted but pointless: `(p (guard) body ...)` is the +same as `(p body ...)`. + +### Bindings + +- Each `,<ident>` in the pattern introduces a fresh binding. Bindings are + visible inside *that clause's* guards and body, and nowhere else. +- Two `,<ident>` with the **same** identifier in one pattern is undefined + — the language does not promise to compare for equality (no + Prolog-style unification) nor to reject the pattern. Don't write it. +- Bindings shadow same-named names in the surrounding lexical scope, just + like `let`. + +### No-match + +If no clause's pattern matches and no `else` is present, behavior is +**undefined** (per the *Primitive failure* policy in LISP.md; expect a +crash, not a defined error). Write an explicit +`(else (error "pmatch: no match" subject))` whenever the matcher is not +provably exhaustive. + +### Evaluation order + +- `<expr>` evaluated once before any pattern is tried. +- Patterns themselves are not evaluated; they are walked structurally + against the subject. +- For matching nested patterns, walk order is unspecified — patterns are + pure datum tests against an already-evaluated subject, so order is not + observable. +- The selected body is in tail position of the `pmatch`. Tail calls in the + body are proper tail calls of the surrounding context. + +### Equality used for literals + +Integer / string / character / boolean patterns match by `equal?`. +Concretely: integers and characters compare via `=` on the underlying +fixnum (characters are u8 fixnums per LISP.md); strings compare +byte-for-byte; booleans compare by `eq?`. Symbol patterns match by `eq?` +(symbols are interned). + +## Examples + +Compiler form dispatch: + +```scheme +(pmatch e + ((quote ,x) (emit-literal x)) + ((if ,t ,a ,b) (emit-if t a b)) + ((lambda ,ps . ,body) (emit-lambda ps body)) + ((set! ,name ,v) (emit-set! name v)) + ((,f . ,args) (emit-call f args)) + (,x (guard (symbol? x)) (emit-var x)) + (,x (guard (integer? x))(emit-int x)) + (else (error "unknown form" e))) +``` + +Three things to notice: + +1. `quote`, `if`, `lambda`, `set!` are literal symbol patterns — they + match those head symbols only. +2. The `(,f . ,args)` clause must come **after** the special-form clauses + because it would otherwise swallow them. +3. The trailing `,x (guard ...)` clauses fire on atoms (subject didn't + match any list pattern); guards then split symbol vs. integer. + +Destructuring a `let` clause: + +```scheme +(pmatch clause + ((,name ,init) (cons name init)) + (else (error "bad let clause" clause))) +``` + +Walking a proper list with a tail binder: + +```scheme +(define (sum xs) + (pmatch xs + (() 0) + ((,h . ,t) (+ h (sum t))))) +``` + +## What pmatch is *not* + +Kept explicit so additions are deliberate. + +| Feature | Status | +|--------------------------------------|------------------------------| +| `,@<ident>` segment / splice patterns | not supported | +| `...` / Kleene-star repetition | not supported | +| `or` / disjunction patterns | not supported | +| `and` / conjunction patterns | not supported | +| Predicate patterns (`(? pred)`) | use `(guard (pred ,x))` | +| As-patterns (`(<pat> :as ,name)`) | not supported | +| Non-linear patterns (repeated names) | undefined behavior | +| User-extensible patterns | not supported | +| Compilation / decision-tree caching | not done; rewalk every call | + +If a use case demands one of these, the workaround is a `guard` plus +auxiliary `,<ident>` bindings, or splitting the match into nested +`pmatch`es. Anything beyond that is out of scope for v1. + +## Implementation contract (for LISP-C.md) + +This section pins what the interpreter must guarantee so callers can rely +on it; the actual P1 implementation lives in LISP-C.md. + +- `pmatch` is a built-in special form, not a primitive procedure. It does + not exist as a value. +- It receives unevaluated clauses and walks them itself. +- The matcher walks pattern and subject recursively. On match, it returns + an environment extended with the new bindings (or fails). The driver + tries clauses top-to-bottom, evaluates guards in the extended env, and + tail-evaluates the body of the first selected clause. +- No pattern compilation; no decision-tree memoization. Patterns are + re-walked per call. (See *Primitive failure* policy: re-add caching + later if profiling shows it's worth it.) +- Bindings are stored in the same flat alist environment used by `let` — + no separate match-env type. +- The body's last expression is reached via the same tail-call path as + `begin` / `let` body, so tail recursion through `pmatch` works.