commit 900ed3f87ef94e1ba1382ae46241fdd6514ea361
parent ced3120781606b6d585f0df7557e18e3ca789d8f
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Sat, 25 Apr 2026 17:04:57 -0700
docs: add LISP-PMATCH spec and concrete pmatch impl sketch
LISP-PMATCH.md pins surface semantics that LISP.md left open: clause
selection order, guard AND-fold + short-circuit, binding scope per
clause, tail position, no-match policy, and runtime-error carve-out
for malformed (unquote ...) patterns.
LISP-C.md §pmatch is rewritten to integrate with the actual
scheme1.P1pp spine: ,X reader sugar as one new parse_one branch
(no reader refactor), four cached symbol slots, %dispatch_form +
::do_pmatch wiring, eval_pmatch frame, pmatch_match recursive
walker with explicit 24-byte frame, bv_equal_raw factored out of
equal?, and file placement.
Diffstat:
| M | docs/LISP-C.md | | | 286 | ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----- |
| A | docs/LISP-PMATCH.md | | | 259 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
2 files changed, 527 insertions(+), 18 deletions(-)
diff --git a/docs/LISP-C.md b/docs/LISP-C.md
@@ -600,30 +600,278 @@ Implemented directly, each ~30-60 P1 ops:
`else` always true.
- `and` / `or` — short-circuit fold; return the deciding value, not a
coerced boolean.
-- `pmatch` — see below.
+- `pmatch` — `eval_pmatch` clause loop + `pmatch_match` recursive
+ walker. See §pmatch below.
- `define-record-type` — see below.
## pmatch
-The matcher walks pattern and subject recursively; on success it
-returns an env extended with the bindings; on failure it returns a
-sentinel. The driver tries clauses top-to-bottom, skipping clauses
-whose guard evaluates to `#f`. No match and no `else` is undefined
-behavior per spec.
+Surface spec lives in [LISP-PMATCH.md](LISP-PMATCH.md). This section
+covers integration into the scheme1 spine that's already in
+`scheme1/scheme1.P1pp`.
-Patterns:
+### Why built-in (not a library)
-| Pattern | Matches |
-|----------------------|------------------------------------------------|
-| `()` | the empty list |
-| `<literal>` | integer / string / char / bool by `equal?` |
-| `<symbol>` | that exact symbol (not a binder) |
-| `,<ident>` | anything; binds to `<ident>` |
-| `,_` | anything; no binding |
-| `(p1 p2 ...)` | proper list of exactly that length |
-| `(p1 ... . ptail)` | improper list; `ptail` binds the rest |
+Pattern bindings extend the interpreter's env via `cons`. The runtime
+exposes no `(eval expr env)` primitive and no macro system, so a
+user-level pmatch could not introduce bindings visible to its
+clause body. Building it in is the smallest path; the alternative is
+adding either macros or first-class envs, both bigger commitments than
+one special form.
-No compilation or caching — rewalk the patterns each call.
+### Reader integration — no refactor
+
+pmatch operates on **already-parsed s-expressions**. By the time eval
+reaches `(pmatch e (pat body...) ...)`, every pattern is a heap-consed
+datum produced by the existing `parse_one` / `parse_list` recursion.
+The matcher walks those datums; it never re-invokes the reader. So
+nothing in the reader needs restructuring.
+
+The one additive change is a new branch in `parse_one`'s leading-byte
+dispatch: `,` (ASCII 44) → consume the byte, recurse `parse_one` for
+the inner datum, build `(unquote <datum>)`. This is a near-copy of the
+existing `'` branch (`::quote` in parse_one) — same shape, different
+head symbol:
+
+```
+::comma ;; mirrors ::quote
+%la(t2, &readbuf_pos)
+%ld(t0, t2, 0)
+%addi(t0, t0, 1)
+%st(t0, t2, 0)
+%call(&parse_one)
+%li(a1, %imm_val(%IMM.NIL))
+%call(&cons)
+%la(t0, &sym_unquote)
+%ld(t0, t0, 0)
+%mov(a1, a0)
+%mov(a0, t0)
+%tail(&cons)
+```
+
+`,@` is **not** a separate token: `@` is already in `is_ident_byte`'s
+set, so `,@x` parses as `(unquote @x)` and the matcher then treats
+`@x` as a (legal but odd) binder identifier. An explicit
+`msg_bad_unquote_splice` is optional polish, not load-bearing.
+
+There is no `quasiquote` evaluator, so `,x` written outside a pmatch
+pattern evaluates as `(unquote x)` — an ordinary application of an
+unbound `unquote` — and dies through the existing `unbound variable`
+path. No new failure mode.
+
+### New cached symbol slots
+
+Four labeled slots, interned in `intern_special_forms` alongside the
+existing ones (`sym_quote`, `sym_if`, …):
+
+| Slot | Purpose |
+|--------------------|----------------------------------------------------------|
+| `sym_pmatch` | eval's pair-branch dispatch |
+| `sym_unquote` | pattern-side: detect `(unquote x)` — i.e., a `,x` binder |
+| `sym_guard` | clause shape: `((guard g…) …)` vs. plain body |
+| `sym_underscore` | wildcard test (the literal symbol `_` after `unquote`) |
+
+`sym_else` is already cached for `cond`; pmatch reuses it for the
+`else` clause. Caching `sym_underscore` (vs. byte-comparing the name)
+costs one slot and turns the wildcard test into a single `beq`.
+
+### Eval-side dispatch
+
+Two new lines in eval's pair branch — same shape as every other
+special form already wired up there:
+
+```
+%dispatch_form(&sym_pmatch, &::do_pmatch) ;; in the cascade
+...
+::do_pmatch ;; among the tail handlers
+%tail_to_handler(&eval_pmatch)
+```
+
+The cascade is order-independent (each row is a pointer-compare on a
+distinct cached symbol), so position in the list doesn't matter.
+
+### eval_pmatch(rest=a0, env=a1) → value (a0)
+
+`rest` is `(subject-expr . clauses)`. Subject is evaluated **once**,
+then clauses walk top-to-bottom restarting from the *outer* env each
+time.
+
+```
+Frame: 32 bytes
+ +0 subject ; eval(car(rest), env) once, then read-only
+ +8 env_outer ; restart point per clause; passed to pmatch_match
+ +16 clauses ; cdr(rest); advances per non-match
+ +24 spill ; clause body / extended env across guard eval
+```
+
+Per-clause shape (same control flow as `eval_cond`'s clause loop, just
+with a richer match step):
+
+```
+loop:
+ if clauses == NIL: die msg_pmatch_no_match
+ clause = car(clauses)
+ pat = car(clause)
+ tail = cdr(clause) ; (body...) or ((guard ...) body...)
+
+ if pat eq? sym_else:
+ a0 = tail; a1 = env_outer; tail-call eval_body
+ ;; we don't validate that an `else` clause has no (guard ...) head;
+ ;; per LISP-PMATCH.md grammar that's malformed source, but any
+ ;; (guard ...) here will just be evaluated as a body form, which
+ ;; will die through the unbound `guard` path.
+
+ call pmatch_match(pat, subject, env_outer) ; -> (env_ext=a0, ok=a1)
+ if ok == 0: advance clauses; loop
+
+ ; guard clause? — tail begins with (guard g...)
+ if pair? tail and pair? car(tail) and car(car(tail)) eq? sym_guard:
+ guards = cdr(car(tail))
+ body = cdr(tail)
+ for each g in guards: ; AND-fold, short-circuit
+ if eval(g, env_ext) == FALSE: advance clauses; loop
+ else:
+ body = tail
+
+ a0 = body; a1 = env_ext; tail-call eval_body
+```
+
+`eval_body` already preserves tail position for the body's last form,
+so the matched clause's last expression is in tail position of the
+surrounding `pmatch` — same as `cond`, `let`, `begin`.
+
+### pmatch_match(pat=a0, subj=a1, env=a2) → (env'=a0, ok=a1)
+
+Pure leaf-ish recursive helper. Dispatches on the **pattern**'s tag.
+Output convention mirrors `parse_dec`: `a0` carries the extended env
+on success, `a1` is the 1/0 ok flag; on failure `a0` is undefined and
+the caller must not use it.
+
+| Pattern shape | Match condition |
+|-----------------------|----------------------------------------------------------------|
+| `IMM.NIL` | `subj == NIL` |
+| `TAG.FIXNUM` | raw word compare `subj == pat` (covers integers and chars) |
+| `TAG.IMM` (`#t`/`#f`) | raw word compare |
+| `TAG.SYM` | raw word compare (symbols are interned, so this is `eq?`) |
+| `TAG.HEAP` (bv) | byte-for-byte equality via `bv_equal_raw` — see below |
+| `TAG.PAIR` | binder / wildcard if `car(pat) eq? sym_unquote`; else structural recurse |
+
+#### Frame
+
+```
+Frame: 24 bytes
+ +0 pat ; preserved across cons (binder) and across the car recursion
+ +8 subj ; same — needed in cons in binder case, and as cdr(subj) after recurse
+ +16 env ; saved across the cons calls in the binder case
+```
+
+All three are stored on entry; nothing gets unsaved on the way out
+because the function returns through `%eret` after computing a0/a1.
+
+#### Binder / wildcard
+
+```
+phead = car(pat)
+if phead eq? sym_unquote:
+ ;; validate shape: pat must be (unquote <sym>), exactly two elements
+ if tag(cdr(pat)) != PAIR: die msg_bad_unquote_pattern
+ if tag(cdr(cdr(pat))) != IMM.NIL: die msg_bad_unquote_pattern
+ pident = car(cdr(pat))
+ if tag(pident) != SYM: die msg_bad_unquote_pattern
+
+ if pident eq? sym_underscore:
+ return (env, ok=1) ;; ,_ — no binding
+ return (cons(cons(pident, subj), env), ok=1)
+```
+
+Validation is per-call and cheap — three tag compares before any cons.
+A defined runtime error here (rather than UB) is the one carve-out
+from the "primitive failure is UB" policy: pattern shape is a syntax
+error in the user's source, not a hot-path concern.
+
+#### Structural pair
+
+```
+;; phead is anything else: a literal symbol, a nested pattern, NIL, ...
+if tag(subj) != PAIR: return (_, ok=0)
+(env1, ok) = pmatch_match(car(pat), car(subj), env)
+if !ok: return (_, ok=0)
+return pmatch_match(cdr(pat), cdr(subj), env1) ;; tail position
+```
+
+Two LISP-PMATCH.md guarantees fall out of this shape:
+
+- **Improper-tail patterns** — the cdr-recursion accepts any pattern as
+ ptail. `,rest` there binds the list tail (proper or improper); `()`
+ there demands a proper list.
+- **Nested patterns** — descending in lockstep handles arbitrary
+ depth; no special case in the matcher.
+
+#### Partial-match rollback — none needed
+
+`env` is passed by register, never mutated. A failed sub-match returns
+`ok=0` without having shadowed the caller's env pointer. The next
+clause restarts from `env_outer` stashed in the eval_pmatch frame, so
+any sub-bindings built during the failed match are simply unreachable
+and will be reclaimed when the heap is reset.
+
+#### Allocation cost
+
+- Successful binder: one cons for `(ident . subj)` plus one cons to
+ push it onto env — same as a `let` binding.
+- Wildcard: zero allocation.
+- Failed match: zero allocation past the failure point (the
+ recursion bails on the first mismatch).
+- No pattern compilation, no decision-tree caching. Patterns are
+ re-walked per call. Revisit if profiling demands it; until then this
+ matches the rest of scheme1 (tree-walk eval, no preprocessing).
+
+#### Bytevector equality — factor out of `equal?`
+
+The existing `equal?` primitive contains the byte-for-byte loop
+already; pmatch_match needs the same logic but on raw `val_t`s, not on
+an arg list. Factor a leaf helper:
+
+```
+;; bv_equal_raw(a=a0, b=a1) -> a0 = 1 if equal else 0
+;; assumes both inputs are HEAP-tagged bytevectors
+```
+
+Reshape `equal?`'s bv branch to `mov`/`mov` its two unpacked operands
+into a0/a1 and `tail` into `bv_equal_raw`. pmatch_match calls the same
+helper directly. Both keep their existing public behavior; the
+duplicated inline loop disappears.
+
+### Guards
+
+Evaluated in the **extended env**, short-circuit AND on `FALSE`. The
+inner loop reuses the same compare against `%imm_val(%IMM.FALSE)` that
+`eval_and` already uses; no new helper. A rejected guard advances the
+clause cursor — it does **not** backtrack against the subject.
+
+### No-match die path
+
+`msg_pmatch_no_match` routed through `runtime_error`, same path as
+every other die in the interpreter. LISP-PMATCH.md classes this as
+UB; in practice we always crash with a clear message rather than
+wander.
+
+### File placement
+
+Both routines live in the §Special forms section of
+`scheme1/scheme1.P1pp`, slotting in alongside `eval_cond` /
+`eval_let` / etc.:
+
+1. `eval_pmatch` — the special-form entry point, called via the
+ `::do_pmatch` tail handler in eval's pair branch.
+2. `pmatch_match` — leaf-ish recursive helper, defined immediately
+ below `eval_pmatch` (its only caller).
+3. `bv_equal_raw` — leaf helper. Lives near the existing `equal?`
+ primitive in §Primitives, since both call into it.
+
+`intern_form` rows for the four new symbol slots go into
+`intern_special_forms`, in any position. `parse_one`'s `,` branch
+goes alongside the existing `'` branch in the reader section.
## define-record-type
@@ -720,7 +968,9 @@ No token array. Conses directly onto the heap.
Handles the lexical syntax specified in LISP.md:
- lists with `.` dotted tail
-- `'x` → `(quote x)` (no quasiquote, unquote, unquote-splicing)
+- `'x` → `(quote x)`
+- `,x` → `(unquote x)` — datum sugar only, consumed by `pmatch`; no
+ `unquote` evaluator. No `quasiquote`, no `,@` / unquote-splicing.
- fixnums: decimal and `#x`-prefixed hex, both with optional sign
- `#t`, `#f`
- `"..."` strings with `\n \t \r \\ \"` escapes
diff --git a/docs/LISP-PMATCH.md b/docs/LISP-PMATCH.md
@@ -0,0 +1,259 @@
+# pmatch
+
+Spec for the `pmatch` special form. Companion to [LISP.md](LISP.md) §Special
+forms; implementation sketch in [LISP-C.md](LISP-C.md) §pmatch.
+
+## Why it's in the language
+
+The Scheme subset has no macros, no `case-lambda`, no `define-syntax`. The
+self-hosted compiler destructures s-expressions on every form it walks
+(`(if t a b)`, `(lambda ps . body)`, `(quote x)`, …); doing this by hand
+through `car`/`cdr` chains is tedious and error-prone. `pmatch` is the
+single concession: one built-in pattern matcher so the compiler reads at
+the form level. It is **not** a general macro system — admitting it
+specifically rules out admitting `syntax-rules`.
+
+## Surface syntax
+
+```
+(pmatch <expr> <clause> ...)
+
+<clause> ::= ( <pattern> <body> ... )
+ | ( <pattern> (guard <g-expr> ...) <body> ... )
+ | ( else <body> ... )
+```
+
+- `<expr>` is evaluated **once**; its value is the *subject*.
+- `<pattern>` is a literal datum read by the standard reader; it is **not**
+ evaluated.
+- `<body>` is one or more expressions; the last is in tail position of the
+ enclosing `pmatch`.
+- `<g-expr>` are evaluated with the pattern's bindings in scope; see
+ *Guards* below.
+
+## Patterns
+
+| Pattern | Matches |
+|----------------------|-------------------------------------------------------|
+| `()` | the empty list (`null?` of subject) |
+| `<integer>` | that integer (`equal?` — same as `=` for fixnums) |
+| `<string>` | byte-for-byte equal bytevector (`equal?`) |
+| `#\c` | the integer that `#\c` denotes (`equal?`) |
+| `#t` / `#f` | that boolean (`eq?`) |
+| `<symbol>` | **that exact symbol** (`eq?`) — *not* a binder |
+| `,<ident>` | anything; binds `<ident>` to the subject |
+| `,_` | anything; introduces no binding (wildcard) |
+| `(p1 p2 ... pN)` | proper list of length exactly `N`; each `pi` matches |
+| `(p1 ... pN . ptail)`| improper list; prefix matches `p1..pN`, tail matches `ptail` |
+
+Patterns nest: `(if ,t (op ,a ,b) ,c)` is well-formed. Sub-patterns are
+matched recursively against the corresponding sub-objects.
+
+### Malformed `,`-patterns
+
+`(unquote)`, `(unquote x y ...)`, and `(unquote <non-symbol>)` (e.g.
+`,42`, `,(foo)`) are all **runtime errors**, not undefined behavior.
+This is the only carve-out from the *Primitive failure* policy in
+LISP.md — pattern shape is a syntax error in the user's source, worth
+catching with a clear message. Implementation cost is three tag
+compares per binder match; see LISP-C.md §pmatch for the check.
+
+### Reader interaction
+
+`pmatch` reuses the standard reader. To make `,<ident>` writeable, the
+reader recognizes `,X` as datum sugar for `(unquote X)` — the same shape
+R7RS gives it, but **only** as a datum form. There is no `unquote`
+evaluator and no `quasiquote`; the form exists purely so that pmatch
+patterns can be parsed. Writing `,x` outside a pmatch pattern produces a
+list `(unquote x)`, which `eval` will reject as an application of an
+unbound variable.
+
+This is the only reader concession. `,@` (unquote-splicing) is **not**
+recognized.
+
+### Symbols vs. binders
+
+This is the most common stumble. In a pattern:
+
+- `foo` matches the literal symbol `foo` (by `eq?`). It does **not** bind.
+- `,foo` matches anything and binds `foo` to it.
+
+So `(if ,t ,a ,b)` matches the head symbol `if` literally and binds `t`,
+`a`, `b` to the three subforms. `(,head ,t ,a ,b)` matches *any* 4-element
+list and binds `head` as well.
+
+### Wildcards
+
+`,_` matches anything and introduces no binding. Multiple `,_` in one
+pattern are fine. The identifier `_` is otherwise unreserved — `(define _
+0)` is legal Scheme; `_` outside a `,` is just a literal symbol pattern.
+
+### Improper-tail patterns
+
+`(p1 ... pN . ptail)` first matches the prefix exactly like a proper-list
+pattern, then matches `ptail` against whatever cdr remains (which may be a
+proper list, an improper tail, or `()`).
+
+`ptail` is a full pattern, not just a binder. Common cases:
+
+- `ptail = ,rest` — `rest` binds the remaining list (the most common
+ form; works for both proper and improper tails).
+- `ptail = ()` — equivalent to writing the proper-list pattern
+ `(p1 ... pN)`.
+- `ptail = (q1 ... qM . qtail)` — keep destructuring further into the
+ tail.
+
+Example: `(lambda ,ps . ,body)` matches any list whose head is the symbol
+`lambda` and at least one more element; `ps` gets the second element,
+`body` gets the cdr from there.
+
+## Semantics
+
+### Clause selection
+
+Clauses are tried top to bottom. The first clause whose pattern matches
+the subject **and** whose guards (if any) all evaluate to a true value is
+selected. Its body is evaluated in the extended environment; the value of
+the last body expression is the value of the `pmatch`.
+
+`else` is a clause that always matches with no bindings; it must be the
+last clause if present.
+
+### Guards
+
+```
+( <pattern> (guard <g-expr> ...) <body> ... )
+```
+
+- Guards are evaluated *only if the pattern structurally matched*.
+- Each `<g-expr>` is evaluated in left-to-right order with the pattern's
+ bindings visible. Evaluation short-circuits: the first `<g-expr>` that
+ yields `#f` aborts the guard.
+- All `<g-expr>` truthy ⇒ the clause is selected. Any `<g-expr>` `#f` ⇒
+ the clause is **rejected**, bindings are discarded, and matching
+ continues with the next clause.
+- A guard rejection does **not** re-match the pattern against later
+ positions in a list — the subject is fixed. (Same shape as guards in
+ Erlang/ML: structural match, then filter.)
+
+Zero `<g-expr>` is permitted but pointless: `(p (guard) body ...)` is the
+same as `(p body ...)`.
+
+### Bindings
+
+- Each `,<ident>` in the pattern introduces a fresh binding. Bindings are
+ visible inside *that clause's* guards and body, and nowhere else.
+- Two `,<ident>` with the **same** identifier in one pattern is undefined
+ — the language does not promise to compare for equality (no
+ Prolog-style unification) nor to reject the pattern. Don't write it.
+- Bindings shadow same-named names in the surrounding lexical scope, just
+ like `let`.
+
+### No-match
+
+If no clause's pattern matches and no `else` is present, behavior is
+**undefined** (per the *Primitive failure* policy in LISP.md; expect a
+crash, not a defined error). Write an explicit
+`(else (error "pmatch: no match" subject))` whenever the matcher is not
+provably exhaustive.
+
+### Evaluation order
+
+- `<expr>` evaluated once before any pattern is tried.
+- Patterns themselves are not evaluated; they are walked structurally
+ against the subject.
+- For matching nested patterns, walk order is unspecified — patterns are
+ pure datum tests against an already-evaluated subject, so order is not
+ observable.
+- The selected body is in tail position of the `pmatch`. Tail calls in the
+ body are proper tail calls of the surrounding context.
+
+### Equality used for literals
+
+Integer / string / character / boolean patterns match by `equal?`.
+Concretely: integers and characters compare via `=` on the underlying
+fixnum (characters are u8 fixnums per LISP.md); strings compare
+byte-for-byte; booleans compare by `eq?`. Symbol patterns match by `eq?`
+(symbols are interned).
+
+## Examples
+
+Compiler form dispatch:
+
+```scheme
+(pmatch e
+ ((quote ,x) (emit-literal x))
+ ((if ,t ,a ,b) (emit-if t a b))
+ ((lambda ,ps . ,body) (emit-lambda ps body))
+ ((set! ,name ,v) (emit-set! name v))
+ ((,f . ,args) (emit-call f args))
+ (,x (guard (symbol? x)) (emit-var x))
+ (,x (guard (integer? x))(emit-int x))
+ (else (error "unknown form" e)))
+```
+
+Three things to notice:
+
+1. `quote`, `if`, `lambda`, `set!` are literal symbol patterns — they
+ match those head symbols only.
+2. The `(,f . ,args)` clause must come **after** the special-form clauses
+ because it would otherwise swallow them.
+3. The trailing `,x (guard ...)` clauses fire on atoms (subject didn't
+ match any list pattern); guards then split symbol vs. integer.
+
+Destructuring a `let` clause:
+
+```scheme
+(pmatch clause
+ ((,name ,init) (cons name init))
+ (else (error "bad let clause" clause)))
+```
+
+Walking a proper list with a tail binder:
+
+```scheme
+(define (sum xs)
+ (pmatch xs
+ (() 0)
+ ((,h . ,t) (+ h (sum t)))))
+```
+
+## What pmatch is *not*
+
+Kept explicit so additions are deliberate.
+
+| Feature | Status |
+|--------------------------------------|------------------------------|
+| `,@<ident>` segment / splice patterns | not supported |
+| `...` / Kleene-star repetition | not supported |
+| `or` / disjunction patterns | not supported |
+| `and` / conjunction patterns | not supported |
+| Predicate patterns (`(? pred)`) | use `(guard (pred ,x))` |
+| As-patterns (`(<pat> :as ,name)`) | not supported |
+| Non-linear patterns (repeated names) | undefined behavior |
+| User-extensible patterns | not supported |
+| Compilation / decision-tree caching | not done; rewalk every call |
+
+If a use case demands one of these, the workaround is a `guard` plus
+auxiliary `,<ident>` bindings, or splitting the match into nested
+`pmatch`es. Anything beyond that is out of scope for v1.
+
+## Implementation contract (for LISP-C.md)
+
+This section pins what the interpreter must guarantee so callers can rely
+on it; the actual P1 implementation lives in LISP-C.md.
+
+- `pmatch` is a built-in special form, not a primitive procedure. It does
+ not exist as a value.
+- It receives unevaluated clauses and walks them itself.
+- The matcher walks pattern and subject recursively. On match, it returns
+ an environment extended with the new bindings (or fails). The driver
+ tries clauses top-to-bottom, evaluates guards in the extended env, and
+ tail-evaluates the body of the first selected clause.
+- No pattern compilation; no decision-tree memoization. Patterns are
+ re-walked per call. (See *Primitive failure* policy: re-add caching
+ later if profiling shows it's worth it.)
+- Bindings are stored in the same flat alist environment used by `let` —
+ no separate match-env type.
+- The body's last expression is reached via the same tail-call path as
+ `begin` / `let` body, so tail recursion through `pmatch` works.