rm old docs - boot2 - Playing with the boostrap

commit 687ee15acc1661be78ff6cd8922ad95dd19a2100
parent 79aedea7287b895e3089ff72a8ff80482bda1083
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 23 Apr 2026 20:15:53 -0700

rm old docs

Diffstat:
D docs/M1M-IMPL.md  | 550 -------------------------------------------------------------------------------
D docs/M1M-P1-PORT.md  | 257 -------------------------------------------------------------------------------
D docs/M1PP-EXT.md  | 656 -------------------------------------------------------------------------------
M docs/P1.md  | 957 ++++++++++++++++++++++++++++++++++++++++---------------------------------------
D docs/P1v2.md  | 531 -------------------------------------------------------------------------------
D docs/PLAN.md  | 210 -------------------------------------------------------------------------------
D docs/SEED.md  | 303 -------------------------------------------------------------------------------

7 files changed, 483 insertions(+), 2981 deletions(-)
diff --git a/docs/M1M-IMPL.md b/docs/M1M-IMPL.md
@@ -1,550 +0,0 @@
-## M1M Implementation Sketch
-
-This note is the implementation-oriented companion to
-`docs/M1M-P1-PORT.md`. It describes a practical structure for the P1
-macro expander.
-
-### Working on the port
-
-**Files**
-
-- `m1pp/m1pp.c` — C oracle. Behaviorally authoritative.
-  Each phase below names the oracle entry points to lift from.
-- `m1pp/m1pp.M1` — port target, on P1v2 as of Phase 1. Runtime
-  shell, lexer, pass-through emit, and structural `%macro` skip
-  land in Phase 1. Phases 2–10 extend it to real macro storage,
-  expansion, paste, the expression evaluator, builtins, and
-  `%select`.
-- `m1pp/build.sh`, `m1pp/test.sh` — build / run / diff a P1v2 .M1
-  into a runnable aarch64 binary. See `docs/M1M-IMPL.md` Phase 0.
-- `tests/m1pp/` — per-phase fixtures. Two shapes, selected by
-  extension:
-  - `<name>.M1` + `<name>.expected` — standalone P1v2 program; built,
-    run with no args, stdout diffed (build-pipeline smoke).
-  - `<name>.M1pp` + `<name>.expected` — expander input; runner builds
-    `m1pp/m1pp.M1` once, runs it as `m1pp <name>.M1pp <out>`, diffs
-    `<out>` (build-dir temp) against `.expected` (parity test).
-  Filenames beginning with `_` are skipped (parked until later phases).
-- `build/p1v2/aarch64/p1_aarch64.M1` — P1v2 DEFINE table, generated
-  from `p1/aarch64.py` + `p1/p1_gen.py`. Regenerate after any
-  backend edit.
-- `docs/P1v2.md` — ISA spec. `docs/M1M-P1-PORT.md` — higher-level
-  port contract.
-
-**Commands**
-
-```sh
-# Build one .M1 source into a binary:
-sh m1pp/build.sh tests/m1pp/01-passthrough.M1 build/m1pp/01-passthrough
-
-# Run the whole suite (regenerates P1v2 defs if the generator changed):
-make test-m1pp
-
-# Run one fixture by name:
-sh m1pp/test.sh 01-passthrough
-
-# Run a built binary manually in the aarch64 container.
-# `localhost/distroless-busybox:latest` is a local tag built by
-# m1pp/build.sh on first run from Containerfile.busybox (distroless-static
-# + the busybox binary from another distroless layer, both digest-pinned).
-podman run --rm --pull=never --platform linux/arm64 \
-    -v "$PWD":/work -w /work \
-    localhost/distroless-busybox:latest \
-    ./build/m1pp/<name> <argv...>
-
-# Regenerate P1v2 DEFINE tables after touching p1/*.py:
-python3 p1/p1_gen.py --arch aarch64 build/p1v2
-
-# Build the C oracle + compare its output to the M1 build:
-cc m1pp/m1pp.c -o build/m1pp/m1pp-oracle
-./build/m1pp/m1pp-oracle <input.M1pp> /tmp/out-c
-./build/m1pp/m1pp <input.M1pp> /tmp/out-m1    # run via podman as above
-diff /tmp/out-c /tmp/out-m1
-
-# Discover undefined P1 tokens without running M0 (catches typos that
-# would otherwise SIGILL silently — build.sh runs this automatically):
-sh lint.sh build/p1v2/aarch64/p1_aarch64.M1 m1pp/m1pp.M1
-```
-
-**P1v2 quick reference for this port**
-
-- Registers: `a0..a3` args + caller-saved, `t0..t2` caller-saved,
-  `s0..s3` callee-saved. `sp` is stack pointer; no raw writes.
-- Frame: `enter SIZE` / `leave`; no implicit `s*` save. Leaf
-  functions may skip frames.
-- Call: `la_br &target` then `call` / `tail` / `b` / `beq` / …
-  (the branch op consumes `br` — load it immediately before).
-- Materialize: `li_aN <8 bytes>` for any one-word integer
-  (`%lo %hi` or `'XXXXXXXXXXXXXXXX'`); `la_aN &label` for label
-  addresses — **no padding needed**, the 32-bit literal-pool
-  prefix zero-extends.
-- Syscall ABI: number in `a0`; args in `a1, a2, a3, t0, s0, s1`;
-  result in `a0`.
-
-### Supported Features
-
-The target expander supports the features required by `p1/*.M1pp`:
-
-- `%macro NAME(a, b)` / `%endm`
-- `%NAME(x, y)` function-like expansion with recursive rescanning
-- `##` token paste
-- `!(expr)` / `@(expr)` / `%(expr)` / `$(expr)`
-- `%select(cond, then, else)`
-- Lisp-shaped integer expressions used by the builtins
-
-### Top-Down Shape
-
-The program should be structured as a small compiler pipeline:
-
-1. Runtime shell
-   Read `argv[1]` into `input_buf`, lex into `source_tokens`, process
-   tokens through the macro engine, write `output_buf` to `argv[2]`.
-   Done in Phase 1.
-
-2. Lexer
-   Keep the current C-compatible tokenizer:
-   `WORD`, `STRING`, `NEWLINE`, `LPAREN`, `RPAREN`, `COMMA`, `PASTE`.
-   All token text lives in `text_buf`; tokens store pointers into that
-   arena.
-
-3. Definition pass during processing
-   Processing is single-pass, not a separate pre-scan. At line start,
-   `%macro` defines a macro and produces no output. Macro definitions
-   become available only after their definition, matching
-   `src/m1macro.c`.
-
-4. Stream-driven expansion
-   The main processor reads from the top stream. Source input is stream 0.
-   Macro expansions and `%select` selections push temporary token
-   streams onto a stack. When a stream is exhausted, it pops and
-   restores the expansion pool mark.
-
-5. Macro call expansion
-   `%NAME(...)` resolves only if `NAME` is already defined and the next
-   token is `(`. Expansion produces temporary tokens in `expand_pool`,
-   applies plain parameter substitution and paste, then pushes the
-   result as a new stream for recursive rescanning.
-
-6. Builtins
-   `%(expr)` and `$(expr)` evaluate integer expressions and emit
-   one generated token directly.
-   `%select(cond, then, else)` evaluates `cond` first, then chooses
-   exactly one of `then` or `else`, copies only that chosen token range
-   into `expand_pool`, and pushes it as a stream. The unchosen branch is
-   not expanded, validated, or expression-evaluated.
-
-7. Errors
-   Coarse fatal paths are sufficient: malformed macro header, wrong arg
-   count, bad paste, bad expression, overflow, unterminated macro/call.
-   Exact C error strings are not required.
-
-### Core Data Structures
-
-Use fixed BSS arenas and simple power-of-two-ish records.
-
-Text spans and token records are kept separate:
-
-```text
-TextSpan:
-+0   start     u64
-+8   len       u64
-
-Token:
-+0   kind      u64
-+8   text      TextSpan
-```
-
-Macro record:
-
-```text
-name        TextSpan
-param_count u64
-params      TextSpan[16]
-body_start  Token*
-body_end    Token*
-```
-
-Stream record:
-
-```text
-toks_start   Token*
-toks_end     Token*, exclusive
-pos          Token*
-line_start   bool
-pool_mark    stack mark, -1 for source-owned streams
-```
-
-Expression frame:
-
-```text
-op_code  enum
-argc     u64
-args     i64[16]
-```
-
-Global arenas:
-
-```text
-input_buf
-output_buf
-text_buf
-source_tokens
-macro_body_tokens
-macros
-expand_pool
-streams
-arg_starts[16]
-arg_ends[16]
-expr_frames
-```
-
-Token range boundaries should be stored as token pointers rather than
-indices. That keeps stream and argument walking simple in P1: advance by
-one token record, compare pointers, no repeated `base + index << 5`.
-
-Source token spans point into `input_buf`. `text_buf` is reserved for
-synthesized token text such as `##` pastes and `!@%$` output.
-
-### Bottom-Up Helper Layers
-
-#### Layer 0: raw memory/text helpers
-
-```text
-append_text(src_ptr, len) -> text_ptr
-append_text_cstr(const_ptr, len) -> text_ptr
-copy_bytes(dst, src, len)
-```
-
-#### Layer 1: token helpers
-
-```text
-push_source_token(kind, text)
-push_macro_body_token(token_ptr)
-push_pool_token(token_ptr)
-copy_token(dst_ptr, src_ptr)
-tok_eq_const(tok, const_ptr, len) -> bool
-span_eq_token(span, tok) -> bool
-```
-
-#### Layer 2: stream helpers
-
-```text
-push_stream(toks_ptr, count, pool_mark)
-pop_stream()
-current_stream() -> stream_ptr
-stream_peek(stream) -> token_ptr
-stream_advance(stream)
-```
-
-#### Layer 3: macro table helpers
-
-```text
-find_macro(call_tok) -> macro_ptr or 0
-find_param(macro_ptr, body_tok) -> param_index+1 or 0
-define_macro(stream_ptr)
-```
-
-No `find_prefixed_param` or local-rewrite helper is needed for this
-feature set.
-
-#### Layer 4: argument parser
-
-```text
-parse_args(stream_ptr, lparen_tok_ptr)
-```
-
-Outputs:
-
-```text
-arg_starts[i] = first token ptr
-arg_ends[i]   = exclusive token ptr
-arg_count
-call_end_pos  = token ptr after closing RPAREN
-```
-
-It tracks nested parentheses with a depth counter. Commas split only at
-depth 1.
-
-#### Layer 5: macro body expander
-
-```text
-expand_macro_at(stream_ptr, call_tok, macro_ptr)
-```
-
-Algorithm:
-
-1. Parse call args.
-2. Validate arg count.
-3. Save `mark = pool_used`.
-4. Walk macro body tokens.
-5. If body token is a param, copy arg tokens into the pool.
-6. Otherwise copy the body token as-is.
-7. Run paste compaction over `[mark, pool_used)`.
-8. Push an expansion stream if non-empty; otherwise restore pool mark.
-
-#### Layer 6: paste pass
-
-```text
-paste_range(start_ptr, end_ptr) -> new_count
-```
-
-This is an in-place compactor over `expand_pool`.
-
-Rules:
-
-```text
-## cannot be first or last
-left/right operands cannot be NEWLINE or PASTE
-pasted result is TOK_WORD
-if a substituted parameter participates in ##, its argument must be exactly one token
-```
-
-#### Layer 7: expression evaluator
-
-Do not implement expression evaluation as recursive P1 calls. Use an
-explicit expression frame stack. That avoids fragile recursion and makes
-macro-in-expression expansion controllable.
-
-Expression evaluator API:
-
-```text
-eval_expr_range(start_tok_ptr, end_tok_ptr) -> r0 value
-```
-
-Internal state:
-
-```text
-expr_pos
-expr_end
-expr_frame_top
-expr_done
-expr_result
-```
-
-Loop model:
-
-1. Skip expression newlines.
-2. If token is `(`:
-   Read next token as operator.
-   Convert operator token to `op_code`.
-   Push an expression frame with `argc = 0`, `accum = 0`.
-   Advance past the operator.
-3. If token is `)`:
-   Finalize the top frame based on `op_code` and `argc`.
-   Pop the frame.
-   Feed the produced value into the parent frame, or finish if there is
-   no parent.
-4. If token is an atom:
-   If token is a macro call, expand it to the pool, then evaluate that
-   expansion as a nested expression range.
-   Otherwise parse the integer atom.
-   Feed the value into the parent frame, or finish if there is no
-   parent.
-
-Operators:
-
-```text
-+   variadic, argc >= 1
--   unary neg or binary/variadic subtract, argc >= 1
-*   variadic, argc >= 1
-/   binary, div-by-zero check
-%   binary, div-by-zero check
-<<  binary
->>  binary arithmetic shift
-&   variadic, argc >= 1
-|   variadic, argc >= 1
-^   variadic, argc >= 1
-~   unary
-=   binary
-==  binary alias
-!=  binary
-<   binary
-<=  binary
->   binary
->=  binary
-```
-
-Keeping the full current operator set is cheap and avoids pointless
-divergence from the C oracle.
-
-For macro-in-expression, the clean composition is:
-
-```text
-eval atom sees %NAME followed by LPAREN
-expand_macro_at into pool without pushing a stream
-temporarily evaluate [mark, mark + expanded_count)
-require exactly one expression result and no extra tokens
-restore pool mark
-advance outer expr_pos to call_end_pos
-```
-
-That gives the C behavior without mixing expression parsing with the
-main output stream.
-
-#### Layer 8: builtins
-
-```text
-expand_builtin_call(stream_ptr, builtin_tok)
-```
-
-`!@%$`
-
-```text
-parse args
-require one arg
-value = eval_expr_range(arg_start, arg_end)
-emit_hex_value(value, 1 2 4 or 8)
-advance stream pos to call_end_pos
-line_start = 0
-```
-
-`%select`:
-
-```text
-parse args
-require three args
-value = eval_expr_range(arg0_start, arg0_end)
-chosen = arg1 if value != 0 else arg2
-copy chosen tokens to expand_pool
-advance stream pos to call_end_pos
-push chosen stream
-line_start = 0
-```
-
-Only `cond` is evaluated eagerly. The selected branch is rescanned as a
-normal token stream; the unselected branch is ignored completely.
-
-#### Layer 9: main processor
-
-```text
-process_tokens:
-    push_stream(source_tokens, source_count, -1)
-
-    while stream_top > 0:
-        s = current_stream()
-        if s.pos == s.end:
-            pop_stream()
-            continue
-
-        tok = *s.pos
-
-        if s.line_start && tok == "%macro":
-            define_macro(s)
-            continue
-
-        if tok.kind == NEWLINE:
-            emit_newline()
-            s.pos += 24     # one Token record
-            s.line_start = 1
-            continue
-
-        if tok is builtin call:
-            expand_builtin_call(s, tok)
-            continue
-
-        if tok is defined macro call:
-            expand_call(s, macro)
-            continue
-
-        emit_token(tok)
-        s.pos += 24
-        s.line_start = 0
-```
-
-### Implementation Slices
-
-The port is broken into phases. Each phase ends with a dedicated test
-under `tests/m1pp/` and a parity check (where applicable) against the C
-oracle in `m1pp/m1pp.c`. The target ISA is **P1v2** (registers
-`a0..a3`, `t0..t2`, `s0..s3`; `enter`/`leave`; `la_br`); the DEFINE
-table is `build/p1v2/aarch64/p1_aarch64.M1`. Aarch64 is the staging
-arch (matches the macOS host so podman runs natively).
-
-Each phase below lists the oracle entry points in `m1pp/m1pp.c` that
-the M1 port lifts for that slice. Line numbers are hints — track by
-symbol name.
-
-- [x] **Phase 0 — Build/run/diff infra under `m1pp/`.**
-  `m1pp/build.sh <source.M1> <out>` lints against the P1v2 DEFINE
-  table, prunes unused DEFINEs, runs M0 + hex2-0 with the aarch64
-  ELF header inside the distroless-busybox container, and deposits
-  a runnable binary. `m1pp/test.sh` walks fixtures in `tests/m1pp/`
-  and picks mode by extension: `.M1` fixtures are built and run stand-alone;
-  `.M1pp` fixtures are fed to a one-time build of `m1pp/m1pp.M1` as
-  input, and the produced output file is diffed. Wired into
-  `make test-m1pp`. Phase 0 fixture: `tests/m1pp/00-hello.M1` — a
-  P1v2 hello-world that proves the pipeline without depending on
-  `m1pp/m1pp.M1`'s current state.
-
-- [x] **Phase 1 — Port lexer + pass-through skeleton to P1v2.**
-  Rewrite `_start`, read/write, `lex_source`, `emit_token`,
-  `emit_newline`, `process_tokens`, and the structural %macro skip
-  in P1v2 conventions (`a*`/`t*`/`s*` registers, `enter SIZE` /
-  `leave`, `la_br &label`). Verify byte-for-byte parity against the
-  C oracle on a definition-only fixture (tokenizer pass-through).
-  Oracle entry points: `main`, `lex_source`, `emit_token`,
-  `emit_newline`, `process_tokens` (pass-through branches only),
-  plus `append_text_len`, `push_token`, `token_text_eq`,
-  `span_eq_token`.
-
-- [x] **Phase 2 — Macro definition storage.**
-  Replaced structural skipping with real storage: `define_macro`
-  parses the header (name, params with comma splits, trailing
-  newline) and copies body tokens into `macro_body_tokens[]` until
-  a line-start `%endm`. Records land in a 32-slot `macros[]` arena
-  (296 B/record). Macros are not yet called — defs-only input
-  matches the oracle. `find_macro` / `find_param` deferred to the
-  phases that exercise them (Phase 5).
-  Oracle: `define_macro`, `find_macro`, `find_param`.
-
-- [ ] **Phase 3 — Stream stack + expansion-pool lifetime.**
-  Stream stack push/pop for recursive rescanning; expansion-pool
-  mark/restore on stream pop. No semantic change until Phase 4
-  wires macro calls in, but isolates the lifecycle plumbing.
-  Oracle: `push_stream_span`, `current_stream`, `pop_stream`,
-  `copy_span_to_pool`, `push_pool_stream_from_mark`.
-
-- [ ] **Phase 4 — Argument parsing.**
-  Nested-paren depth tracking, comma split at depth 1, argument-
-  count validation, `call_end_pos` output.
-  Oracle: `parse_args`.
-
-- [ ] **Phase 5 — Plain parameter substitution.**
-  Walk macro body; substitute params via the expand pool; push
-  resulting slice as a stream. Enforces single-token-arg rule for
-  parameters adjacent to `##` (still no actual paste yet).
-  Oracle: `expand_macro_tokens` (parameter loop),
-  `copy_arg_tokens_to_pool`, `copy_paste_arg_to_pool`,
-  `expand_call`.
-
-- [ ] **Phase 6 — `##` token paste compaction.**
-  In-place compactor over the expand pool. Rejects misplaced or
-  malformed paste sites.
-  Oracle: `paste_pool_range`, `append_pasted_token`.
-
-- [ ] **Phase 7 — Integer atoms + S-expression evaluator.**
-  Integer-token parsing; explicit expression-frame stack; all
-  operators from the oracle; macro-in-expression composition (the
-  required path for `p1/P1-aarch64.M1pp`).
-  Oracle: `parse_int_token`, `expr_op_code`, `apply_expr_op`,
-  `eval_expr_atom`, `eval_expr_range`, `skip_expr_newlines`.
-
-- [ ] **Phase 8 — `!@%$(expr)` builtins.**
-  One-arg builtins on top of the evaluator; emit LE 1/2/4/8-byte
-  hex tokens.
-  Oracle: `expand_builtin_call` (the `!@%$` cases), `emit_hex_value`.
-
-- [ ] **Phase 9 — `%select(cond, then, else)`.**
-  Eager `cond` eval; copy chosen branch to expand pool, push as
-  stream; never evaluate the unchosen branch.
-  Oracle: `expand_builtin_call` (the `%select` case).
-
-- [ ] **Phase 10 — Full-parity + malformed-input smoke tests.**
-  Run `tests/m1pp/_full-parity.M1pp` against the M1 implementation
-  (unpark by dropping the `_` prefix);
-  add malformed fixtures (unterminated macro, wrong arg count, bad
-  paste, bad expression, bad builtin arity) requiring non-zero
-  exit. Then run combined `p1/P1-aarch64.M1pp + p1/P1.M1pp` through
-  the M1 expander and diff against the Python-generated
-  `build/p1v2/aarch64/p1_aarch64.M1`. Finally use the produced
-  frontend on a small P1 program through the normal toolchain.
diff --git a/docs/M1M-P1-PORT.md b/docs/M1M-P1-PORT.md
@@ -1,257 +0,0 @@
-# m1macro to P1 Port Plan
-
-## Goal
-
-Replace `src/m1macro.c` with a real P1 implementation in `src/m1m.M1`.
-`src/m1m.M1` must be pure portable P1 source. The final `m1m` binary must
-expand M1M input without shelling out to awk, C, Python, libc, or any host
-macro processor.
-
-Contract:
-
-```
-m1m input.M1 output.M1
-```
-
-Behavior should match `src/m1macro.c` byte-for-byte for valid inputs in the
-current M1M feature set, except where an implementation limit is explicitly
-documented.
-
-Architecture-specific code is not allowed in `src/m1m.M1`. The only
-architecture-specific layer is the generated P1 `DEFINE` file that `catm`
-prepends before `src/m1m.M1` during assembly. If the port needs additional
-P1 op/register/immediate combinations, add them to the generator and
-regenerate the arch-specific define tables.
-
-## Scope
-
-Implement the current M1M feature set needed by `p1/*.M1M` to define
-instruction encodings:
-
-- `%macro NAME(a, b)` / `%endm`
-- `%NAME(x, y)` function-like expansion with recursive rescanning
-- `##` token paste
-- `!(expr)` / `@(expr)` / `%(expr)` / `$(expr)`
-- `%select(cond, then, else)`
-- Lisp-shaped integer expressions used by the builtins
-
-Not supported: per-expansion locals (`@local`, `:@local`, `&@local`),
-prefixed parameter substitution (`:param`/`&param`), duplicate macro
-diagnostics, and byte-identical malformed-input diagnostics. Avoid duplicate
-macro names; the feature set does not promise a particular
-duplicate-definition behavior.
-
-Preserve the C tokenizer model: whitespace is normalized, strings are single
-tokens, `#` and `;` comments are skipped, and output is emitted as tokens plus
-newlines rather than preserving original formatting.
-
-## Static Data Model
-
-Use fixed BSS arenas, mirroring the C implementation:
-
-- Input buffer: raw file contents plus NUL sentinel.
-- Output buffer: emitted text.
-- Text buffer: copied token text and generated text.
-- Source token array: token records for the original input.
-- Macro table: name, params, and body token records.
-- Expansion pool: temporary tokens produced by macro calls and `%select`.
-- Stream stack: active token streams for recursive rescanning.
-
-Token record layout should be compact and uniform:
-
-```
-kind      token kind
-text      source span in `input_buf` or synthesized span in `text_buf`
-```
-
-Macro records should store name/parameter text spans plus a body token
-range, not inline strings. Prefer record shapes that stay uniform across the
-codebase so the address math remains easy to audit in P1.
-
-## Implementation Milestones
-
-## Incremental TODO
-
-Use this checklist to finish the port in reviewable slices. Each checked item
-should build `m1m` and include at least one C-oracle comparison where
-applicable.
-
-- [x] Land the portable P1 runtime shell: argv validation, input open/read,
-  output open/write, fatal-error reporting, and no external expander path.
-- [x] Add the first fixed BSS arenas for input, output, text, source tokens,
-  and runtime counters.
-- [x] Add initial text/token/output helpers: append copied token text, push
-  source token records, compare token text with constants, emit tokens, and
-  emit newlines.
-- [x] Port the C tokenizer model for source input: whitespace skipping,
-  string tokens, `##`, comments, delimiters, word tokens, newline tokens, and
-  text-buffer copies.
-- [x] Add a first processor skeleton that normalizes pass-through output and
-  structurally skips line-start `%macro` ... `%endm` definitions.
-- [x] Extend generated P1 support for the current `m1m.M1` needs: broader
-  `ADDI` immediates, token/record memory offsets, and full RRR register
-  tuples. The Makefile still prunes unused DEFINEs before assembly.
-- [x] Verify the current slice: `make PROG=m1m ARCH=aarch64 build/aarch64/m1m`,
-  byte-identical C-oracle output for definition-only library inputs
-  `p1/aarch64.M1M` and `p1/P1.M1M` (these currently only prove structural
-  `%macro` skipping, not macro-call expansion), and byte-identical tokenizer
-  pass-through fixture output.
-- [x] Add `tests/m1m/full-parity.M1M` and its C-oracle expected output as the
-  real expansion parity target. This fixture intentionally uses macro calls,
-  recursive rescanning, paste, `!@%$(` and `%select`; it is expected
-  to fail under the partial P1 implementation until the remaining unchecked
-  expansion tasks land.
-- [ ] Replace structural `%macro` skipping with real macro table storage:
-  parse headers, parameters, body tokens, body limits, and line-start `%endm`
-  recognition.
-- [ ] Add stream stack push/pop for recursive rescanning and expansion-pool
-  lifetime management.
-- [ ] Port macro call argument parsing, including nested parentheses and
-  argument-count validation.
-- [ ] Port plain parameter substitution, including the single-token argument
-  requirement when a parameter participates in `##`.
-- [ ] Port `##` token paste, including bad operand and misplaced paste
-  failures.
-- [ ] Port integer atom parsing and S-expression evaluation for arithmetic,
-  comparisons, shifts, and bitwise operators.
-- [ ] Implement `!@%$(expr)` on top of expression
-  evaluation and token emission.
-- [ ] Implement `%select(cond, then, else)` on top of expression evaluation
-  and stream pushback.
-- [ ] Add malformed-input smoke tests: unterminated macro, wrong arg count,
-  bad paste, bad expression, and bad builtin arity. These only need non-zero
-  exit, not exact diagnostic text.
-- [ ] Use the P1 `m1m` binary to expand a representative M1M frontend and
-  assemble a small program through the normal stage0 toolchain.
-- [ ] Revisit static limits and error strings so every documented arena limit
-  has a clear fatal path.
-- [ ] Re-run all acceptance tests and update this plan with any explicitly
-  documented implementation limits.
-
-1. **Runtime shell**
-
-   Keep the existing P1 argv, open/read, write, and fatal-error paths. Remove
-   any external backend or `execve` shortcut.
-
-2. **Text and token primitives**
-
-   Add helpers for `append_text_len`, `push_token`, token equality,
-   span equality, output token emission, and output newline emission.
-   Keep error handling simple: set an error message pointer and branch to
-   `fatal`.
-
-3. **Lexer**
-
-   Port `lex_source` directly. It should fill `source_tokens` from
-   `m1m_input_buf`, copying all token text into `text_buf`.
-
-4. **Stream processor skeleton**
-
-   Implement push/pop stream and the main `process_tokens` loop. Initially
-   support pass-through tokens and `%macro` skipping, then expand toward full
-   behavior.
-
-5. **Macro definitions**
-
-   Port `define_macro`: parse header, params, body tokens, body limit checks,
-   and line-start `%endm` recognition.
-
-6. **Macro call expansion**
-
-   Port `parse_args`, plain parameter substitution, token paste, and
-   expansion-stream pushback.
-
-7. **Expression evaluator**
-
-   Port integer atom parsing and S-expression evaluation. Implement arithmetic,
-   comparisons, shifts, and bitwise ops over 64-bit signed values as far as P1
-   can represent them. Document any temporary 32-bit limitation if unavoidable,
-   but the target is C-compatible 64-bit behavior.
-
-8. **Builtins**
-
-   Implement `!@%$(` and `%select` on top of the expression evaluator
-   and stream pushback.
-
-9. **Cleanup and limits**
-
-   Replace generic “not implemented” errors with coarse but useful failures
-   for buffer overflow, malformed macro headers, arg-count mismatch, bad
-   expressions, and bad paste operands. Exact C diagnostic parity is not a
-   goal.
-
-## Portability Rule
-
-`src/m1m.M1` must use only P1 tokens plus labels/data. Do not hand-code
-aarch64, amd64, or riscv64 instructions in this file. Do not introduce
-per-arch branches, per-arch data layouts, or per-arch syscall sequences in the
-implementation.
-
-Allowed architecture-specific work:
-
-- Extend `src/p1_gen.py` when `m1m.M1` needs a P1 operation tuple that is not
-  currently generated.
-- Regenerate `build/<arch>/p1_<arch>.M1`.
-- Keep the existing build shape where the arch-specific define file is
-  prepended with `catm` before the portable P1 source.
-
-All algorithmic behavior, buffer layout, parsing, expansion, expression
-evaluation, and error handling belongs in portable P1.
-
-## P1 Support Needed
-
-The current build may stage `PROG=m1m` on aarch64 first, but the source must
-remain portable P1 from the start. Staging on one arch is a build milestone,
-not permission to add arch-specific source.
-
-Likely generator/table updates:
-
-- More `ADDI` immediates for record-size and arena-limit arithmetic.
-- More `LD/ST/LB/SB` offsets for token, macro, and stream record fields.
-- Additional RRR register triples used by parser loops and address math.
-- Possibly a small set of helpers/macros for 32-byte record addressing.
-
-Do not hide core behavior behind host tools. If a P1 operation is missing,
-extend the generated P1 definitions or rewrite the algorithm in available P1.
-
-## Acceptance Tests
-
-Use `src/m1macro.c` as the oracle during development.
-
-Minimum checks:
-
-1. Build `m1m`:
-
-   ```
-   make PROG=m1m ARCH=aarch64 build/aarch64/m1m
-   ```
-
-2. Compare representative inputs against the C implementation:
-
-   ```
-   src/m1macro.c oracle: p1/aarch64.M1M
-   src/m1macro.c oracle: p1/P1.M1M
-   custom fixture: paste, recursive rescanning, !@%%(, %select
-   malformed fixtures: bad paste, wrong arg count, bad expression
-   ```
-
-3. Require byte-identical output for valid fixtures.
-
-4. Require non-zero exit for invalid fixtures.
-
-5. Once stable, use `m1m` to expand the P1 M1M front-end and assemble a small
-   program through the normal stage0 toolchain.
-
-## Non-Goals
-
-- No dependency on awk, shell scripts, Python, libc, or the host C compiler at
-  runtime.
-- No new macro language features.
-- No formatting preservation beyond the current C expander behavior.
-- No recursive macro cycle detection unless added after parity.
-
-## Done Definition
-
-`src/m1m.M1` contains the expander core, the generated `m1m` binary runs in the
-target Alpine container, and all acceptance tests match `src/m1macro.c` without
-executing any external macro-expansion program.
diff --git a/docs/M1PP-EXT.md b/docs/M1PP-EXT.md
@@ -1,656 +0,0 @@
-# M1PP extensions for the seed Scheme interpreter
-
-Three independent additions to `m1pp/m1pp.c`, ordered by sequencing.
-
-Motivation: when writing the seed Lisp interpreter portably across three
-arches, most pain in `lisp/lisp.M1` traces to two things — hand-named
-scratch labels that collide when a pattern is reused, and argument
-substitution that can't carry instruction bodies (commas break the
-parser). `strlen` is the smaller third item: it removes a class of
-hand-counted-length bugs in error messages and string literals.
-
-## 1. Local labels
-
-### Syntax
-
-Two prefixed word forms, recognized only when they appear in a macro
-body as body-native tokens:
-
-- `:@name` — label definition, scoped to the current expansion
-- `&@name` — address-of reference, scoped to the current expansion
-
-### Semantics
-
-Each `%NAME(...)` invocation allocates a fresh expansion id `NN` from a
-global monotonic counter. While copying body-native tokens into the pool,
-any TOK_WORD whose text starts with `:@` or `&@` (and has ≥1 char after
-the `@`) is rewritten to the corresponding non-`@` form with `__NN`
-suffixed: `:@end` → `:end__42`, `&@end` → `&end__42`.
-
-**Scoping.** Rename body-native tokens only. Argument-substituted tokens
-pass through unchanged — they were already renamed under the caller's
-`NN` if the caller was itself a macro body. This gives lexical label
-scoping: nested and stacked macros each see their own labels, collisions
-are impossible.
-
-**Interaction with `##`.** None. The rename happens before the paste
-pass; a body `:@end##_lbl` renames `:@end` first, then pastes. Edge
-cases here should error out (pasting onto a renamed label is almost
-certainly a bug); leave it unconstrained for v1 and revisit if it
-bites.
-
-### Tokenizer
-
-No changes. `:@foo` / `&@foo` already tokenize as single TOK_WORD under
-the current word-terminator set (`m1pp.c:310`). The existing `@(...)`
-builtin dispatch keys on token text being exactly `@` followed by
-LPAREN, so `@foo` words do not collide.
-
-### m1pp.c touchpoints
-
-- One new static `int next_expansion_id` (monotonic, never reset).
-- `expand_macro_tokens` (`m1pp.c:670`): allocate `NN = ++next_expansion_id`
-  before the body-walk. Inside the body-copy loop, when about to push a
-  body-native TOK_WORD whose text starts with `:@` or `&@`:
-  - build the renamed text directly by appending bytes into `text_buf`:
-    copy the original token bytes (sigil + tail), append `__`, then
-    append the decimal digits of `NN`
-  - push a TOK_WORD pointing at the new text span
-
-Avoid `snprintf`. The m1m port (`docs/M1M-P1-PORT.md`) will reimplement
-every new m1pp feature in P1 assembly; varargs format parsing is a
-non-trivial thing to port. Plain byte appends plus a hand-rolled
-integer → decimal emit (the `display_uint` reverse-fill pattern already
-in `lisp/lisp.M1:2983`) port cleanly.
-
-Concretely in C: reserve a small stack scratch (say 16 bytes), fill
-digits right-to-left via repeated `%10` / `/=10`, then `append_text_len`
-the sigil bytes, the tail bytes, `"__"`, and the digit run. A 32-bit
-counter fits in 10 decimal digits; collision across a file is a
-non-concern because the counter is file-global and monotonic.
-
-No struct changes. No lexer changes. No new global syntax.
-
-## 2. Braced block arguments
-
-### Syntax
-
-Curly braces `{` and `}` group tokens into a single macro argument,
-protecting commas inside the group from the comma-splits-args rule.
-
-```
-%if_eq(r1, r2, {
-    li(r0)
-    %5
-    st(r0, r3, 0)
-})
-```
-
-Without braces, `st(r0, r3, 0)` exposes two commas at paren depth 1 and
-the call parses as 5 args instead of 3.
-
-### Semantics
-
-- `{` and `}` are new TOK kinds, tokenized as single-char delimiters.
-- In `parse_args`, a `brace_depth` counter runs parallel to the paren
-  `depth`. Commas at `depth == 1` split args **only when
-  `brace_depth == 0`**. LBRACE increments, RBRACE decrements.
-- When copying an arg span into a macro body, if the span begins with
-  TOK_LBRACE and ends with matching TOK_RBRACE at the outermost level,
-  strip the outer pair. Otherwise copy verbatim — `%foo(plain)` stays
-  working.
-- Braces never reach output. Either filter them during substitution or
-  make `emit_token` treat both kinds as no-ops (belt-and-braces; I'd
-  do both).
-
-### Nesting
-
-`{ { ... } }` nests via `brace_depth`. Braces inside a `"..."` string
-stay inside the string token — the lexer already handles that.
-
-Braces and parens are independent. `{ ( }` is syntactically fine in the
-arg-splitter; paren balancing only cares about LPAREN/RPAREN.
-
-### Tokenizer
-
-`lex_source` (`m1pp.c:232`): add LBRACE/RBRACE cases alongside the
-existing LPAREN/RPAREN cases (~10 lines). Add `{` and `}` to the
-word-terminator set at `m1pp.c:310`.
-
-### m1pp.c touchpoints
-
-- New TOK_LBRACE, TOK_RBRACE enum entries (`m1pp.c:77`).
-- `lex_source`: two new single-char token cases.
-- `parse_args` (`m1pp.c:543`): add `brace_depth` counter; gate the
-  comma-split on `brace_depth == 0`; LBRACE/RBRACE bump/drop it.
-- Arg copy (in `expand_macro_tokens`, via `copy_arg_tokens_to_pool` and
-  `copy_paste_arg_to_pool`): detect outer `{ ... }` wrapping and strip.
-  The `copy_paste_arg_to_pool` path (single-token arg for `##`) should
-  reject braced args — pasting onto a block is nonsense.
-- `emit_token`: no-op for both brace kinds (defensive; they shouldn't
-  reach here if substitution is clean).
-
-### What this does not give you
-
-A C-like block-statement form (`%if_eq(a,b) { … } %else { … } %endif`)
-needs `process_tokens` to recognize line-start block openers/closers —
-a separate, heavier change. Braced args get us
-`%if_eq_else(a, b, { then }, { else })` and `%while_nez(r, { body })`,
-which covers the patterns in lisp.M1 we care about. Defer the block-
-statement form until braced-arg shows real ergonomic pain.
-
-## 3. `strlen` expression op
-
-### Syntax
-
-A new unary op in the Lisp-shaped expression grammar:
-
-```
-(strlen "literal")
-```
-
-Composes with arithmetic like any other op:
-
-```
-%((+ (strlen "hello") 1))
-```
-
-### Semantics
-
-- Argument must be a single `TOK_STRING` atom (double-quoted form).
-- Value is the raw byte count between the quotes: `span.len - 2`.
-  Matches what M1's `"…"` emission writes before appending NUL.
-- Single-quoted `'…'` hex literals error out — strlen is meaningless
-  on raw hex.
-
-### No decimal emitter needed
-
-The 4-byte LE hex emitter `%(expr)` is sufficient. Two paths cover
-everything:
-
-1. Companion DEFINE:
-
-   ```
-   %macro defstr(label, text)
-   :label text
-   DEFINE label##_LEN %((strlen text))
-   %endm
-   ```
-
-   M1 substitutes `label_LEN` with its 4 hex bytes at each use site.
-
-2. Inline at an LI-immediate slot:
-
-   ```
-   li_r2 %((strlen "usage: …"))
-   ```
-
-   LI's inline literal slot takes 4 raw LE bytes; `05000000` and `%5`
-   are byte-equivalent there. lisp.M1 already relies on this (see
-   `DEFINE NIL 07000000` at `lisp/lisp.M1:30`, consumed as
-   `li_r0 NIL`).
-
-The 1/2/8-byte emitters (`!(e)`, `@(e)`, `$(e)`) cover non-4-byte widths
-if needed.
-
-### m1pp.c touchpoints
-
-- `EXPR_STRLEN` entry in the `ExprOp` enum (`m1pp.c:87`).
-- `expr_op_code` (`m1pp.c:751`): match the word `strlen`.
-- Eval path: `strlen` is a degenerate case — its "argument" is a
-  TOK_STRING, not a recursive expression. Easiest is a special-case
-  branch in `eval_expr_range` (`m1pp.c:976`) that handles `(strlen
-  "...")` directly rather than routing through `eval_expr_atom`.
-  Emit `span.len - 2` as the value.
-- Alternative: extend `eval_expr_atom` to accept TOK_STRING atoms with
-  value `len - 2`, and treat `strlen` as identity. Cleaner
-  composition but more surface area; defer unless needed.
-
-## 4. Paren-less 0-arg macro calls
-
-### Syntax
-
-A macro defined with zero parameters may be called without trailing
-`()`:
-
-```
-%macro FRAME_BASE()
-16
-%endm
-
-%((+ %FRAME_BASE 8))           ## paren-less
-%((+ %FRAME_BASE() 8))         ## still works
-```
-
-### Semantics
-
-- When `find_macro` matches a `%NAME` token and the macro's
-  `param_count == 0`, the expansion triggers whether or not an LPAREN
-  follows.
-- Applies in both contexts where a macro call is currently recognized:
-  top-level processing in `process_tokens`, and atom position in
-  `eval_expr_atom` so a 0-arg macro is a valid expression atom inside
-  `%(...)`.
-- Non-zero-param macros still require their existing `(arg, ...)`
-  syntax.
-- `%foo` where `foo` is not defined as a macro still passes through
-  unchanged — the match only fires when a matching 0-param macro
-  exists. Backward compatible.
-
-### Why it matters
-
-The one feature that needs it is `%struct` field access (§5). Once
-`NAME.field` expands to an integer, writing `%NAME.field` reads as a
-named constant; `%NAME.field()` looks like a function call. The
-relaxation is also load-bearing for expression-level composition:
-`%((+ %frame_hdr.SIZE %frame_apply.callee))` needs both atoms to
-resolve as 0-arg calls inside the evaluator.
-
-### m1pp.c touchpoints
-
-- `process_tokens` (`m1pp.c:1225`): the LPAREN-next guard becomes
-  "LPAREN-next OR `param_count == 0`."
-- `eval_expr_atom` (`m1pp.c:944`): same relaxation on the same guard.
-- The zero-param paren-less path constructs an empty arg list and
-  calls `expand_macro_tokens` with `arg_count == 0` — no
-  `parse_args` change.
-- No lexer changes, no new token kinds, no new Macro fields.
-
-## 5. `%struct` directive
-
-### Syntax
-
-A top-level directive declaring a fixed-layout aggregate of 8-byte
-fields:
-
-```
-%struct closure { hdr params body env }
-```
-
-Fields are bare identifiers separated by whitespace and/or commas.
-The closing brace terminates the declaration.
-
-### Semantics
-
-Expands at declaration time to N+1 zero-parameter macros:
-
-- `NAME.field_k` → `k * 8` for each field at index k
-- `NAME.SIZE` → `N * 8`
-
-All fields are 8-byte words. Mixed widths are deferred until a real
-use case appears.
-
-Callers consume these as paren-less 0-arg calls (per §4):
-
-```
-ld(r0, r1, %closure.body)
-enter(%frame_apply.SIZE)
-```
-
-### No `base=` parameter
-
-The struct primitive declares offsets from zero. Base offsets (e.g.
-for stack-frame locals sitting above the retaddr/caller-sp header)
-compose at the call site via an ordinary wrapper macro:
-
-```
-%struct frame_hdr { retaddr caller_sp }        ## SIZE = 16
-
-%macro frame(field)
-%((+ field %frame_hdr.SIZE))
-%endm
-
-%struct frame_apply { callee args body env }
-
-:apply
-    enter(%frame_apply.SIZE)
-    st(r1, sp, %frame(%frame_apply.callee))    ## 0 + 16 = 16
-    st(r2, sp, %frame(%frame_apply.args))      ## 8 + 16 = 24
-    …
-    leave()
-    ret
-```
-
-Heap structs access fields directly (`%closure.body`); stack frames
-route through the `%frame` wrapper. Same primitive, two conventions,
-no special casing inside `%struct`. If a function needs a different
-base (e.g. a permanent spill prefix), define `%frame_big(field)`
-alongside `%frame` — the struct declarations don't change.
-
-### Tokenizer
-
-- `.` is already a word char, so `NAME.field` tokenizes as one
-  TOK_WORD under the current word-terminator set (`m1pp.c:310`).
-- `{` / `}` reuse the TOK_LBRACE / TOK_RBRACE kinds introduced for §2.
-  `%struct` cannot land before §2 does.
-
-### m1pp.c touchpoints
-
-- New top-level directive branch in `process_tokens` (`m1pp.c:1192`)
-  alongside the existing `%macro` detection. At line-start, if the
-  first word is `%struct`:
-  - consume name, `{`, field-identifier list (WORD tokens,
-    comma-or-whitespace separated), `}`, trailing newline
-  - for each field k, generate an entry in `macros[]`:
-    - name = synthesized `"NAME.field_k"` in `text_buf`
-    - `param_count = 0`
-    - body = a single TOK_WORD whose text is the decimal rendering of
-      `k * 8` in `text_buf`
-  - emit a final `"NAME.SIZE"` entry pointing at `N * 8`
-- Integer → decimal rendering reuses the hand-rolled reverse-fill
-  pattern from §1 local labels — no `snprintf`.
-- No new expression-evaluator surface; consumption goes through the
-  existing `find_macro` + `eval_expr_atom` path once §4 lands.
-- No new Macro struct fields. A struct-generated macro is
-  indistinguishable from any other 0-param macro once declared.
-
-### What this does not give you
-
-- **Mixed-width fields.** All offsets are `k * 8`. The packed 8-bit
-  type + 8-bit gc-flags + 48-bit length header in lisp.M1 is easier
-  to handle with dedicated bit-op macros than struct syntax; defer.
-- **Bundled enter/leave per frame.** A `%frame NAME { … }` directive
-  that also emits ENTER/LEAVE around a body would bring back the
-  block-body problem and tightly couple locals to one function shape.
-  The call-site verbosity savings don't pay; use plain `%struct` plus
-  a wrapper macro.
-
-## 6. `%enum` directive
-
-### Syntax
-
-A top-level directive declaring an incrementing sequence of named
-integer constants:
-
-```
-%enum tag { fixnum pair vector string symbol proc singleton }
-%enum prim_id { add sub mul div mod eq lt gt ... }
-```
-
-### Semantics
-
-Expands at declaration time to N+1 zero-parameter macros:
-
-- `NAME.label_k` → `k` for each label at index k
-- `NAME.COUNT` → `N`
-
-Callers consume these as paren-less 0-arg calls (per §4):
-
-```
-li_r2 %tag.pair                       ## loads 1
-%((= %prim_id.COUNT 45))              ## compile-time sanity check
-```
-
-### Relationship to `%struct`
-
-Implementation-wise, `%enum` is `%struct` with stride 1 instead of 8
-and a totalizer named `COUNT` instead of `SIZE`. The directive
-parser, brace consumption, field-list parsing, and macro-generation
-loop are all shared. Factor the §5 implementation around one helper
-parameterized by `(stride, totalizer_name)`:
-
-- `%struct` → `define_fielded(8, "SIZE")`
-- `%enum`   → `define_fielded(1, "COUNT")`
-
-No separate code path; adding `%enum` is a second line-start
-directive check in `process_tokens` plus one call to the shared
-helper.
-
-### Why it matters
-
-lisp.M1 maintains two hand-numbered integer enumerations whose
-numbering must stay in sync across disjoint sites:
-
-- Tag codes (`lisp/lisp.M1:35–47`) referenced throughout the
-  reader / eval / printer dispatchers.
-- Primitive code IDs — used by the registration table and the
-  dispatch cascade (`lisp/lisp.M1:3843–3983`). Inserting a new
-  primitive in the middle shifts every downstream id; silent drift,
-  no error until runtime.
-
-`%enum` eliminates both drift classes: names declared once,
-referenced by name everywhere, renumbering on insertion is automatic.
-
-### m1pp.c touchpoints
-
-Same as §5 with the two parameter differences above. No new Macro
-struct fields, no new token kinds, no new expression-evaluator
-surface.
-
-### What this does not give you
-
-- **Explicit values.** `%enum foo { a=5 b c }` is not supported in
-  v1. All values are consecutive from 0. C's explicit-value form
-  is useful when matching external ABIs; our enums are internal, so
-  defer until a real use case appears.
-- **Flag/bitmask enums.** Not specially supported. If you want bit
-  positions, declare the bit index via `%enum` and take
-  `(1 << %NAME.flag_k)` at use sites.
-
-## 7. `%str` stringification builtin
-
-### Syntax
-
-A new builtin alongside `!(e)`, `@(e)`, `%(e)`, `$(e)`, `strlen`,
-and `%select`:
-
-```
-%str(IDENT)
-```
-
-Takes a single WORD-token argument; produces a TOK_STRING literal
-whose contents are the argument's text wrapped in double quotes:
-
-```
-%macro quoteit(name)
-%str(name)
-%endm
-
-%quoteit(hello)          →  "hello"
-%quoteit(foo_bar)        →  "foo_bar"
-```
-
-### Semantics
-
-- Exactly one argument, kind TOK_WORD. Multi-token, pasted, or
-  already-string args error out.
-- Output is a freshly-allocated TOK_STRING span in `text_buf` built
-  as `"` + original_text + `"`. The span's `len` is
-  `original_len + 2`, so `strlen` on the result (per §3) returns
-  `original_len` — the char count between the quotes, matching
-  what M1's `"…"` emission writes before the NUL.
-- Produces a string literal, not a word. Complementary to `##`, not
-  a replacement — see below.
-
-### Relationship to `##` paste
-
-Both turn a parameter into something else, but they produce
-**different token kinds** and serve **different goals**:
-
-| operator | inputs          | output           | kind       |
-|----------|-----------------|------------------|------------|
-| `##`     | two WORD tokens | one WORD token   | TOK_WORD   |
-| `%str`   | one WORD token  | one STRING token | TOK_STRING |
-
-`##` joins word fragments to build identifiers / label names.
-`%str` wraps a word in quotes to produce a string literal. They
-can't substitute for each other:
-
-- `:str_quote` (a label definition) must be a word — `##` can
-  build it, `%str` can't.
-- `"quote"` (a string literal) must introduce quote characters —
-  `%str` is the only way to manufacture it from a bare identifier,
-  paste can't.
-
-M1 sees the difference too: `:str_quote "quote"` is a label-def
-word followed by a quoted-bytes directive (5 bytes + NUL). Paste
-manufactures the first, stringify the second, both from the same
-source identifier.
-
-### Why it matters
-
-Every special-form symbol in lisp.M1 (`lisp/lisp.M1:164–260`)
-follows the same triad, written longhand 15 times today:
-
-```
-:str_quote "quote"
-DEFINE str_quote_LEN 05000000
-:sym_quote %0 %0
-```
-
-With `##` and `%str` together, one declarative site per symbol:
-
-```
-%macro defsym(name)
-:str_##name %str(name)
-DEFINE str_##name##_LEN %((strlen %str(name)))
-:sym_##name %0 %0
-%endm
-
-%defsym(quote)
-%defsym(if)
-%defsym(begin)
-…
-```
-
-- `##name` builds the label identifiers (`str_quote`, `sym_quote`).
-- `%str(name)` builds the string literal (`"quote"`).
-- `(strlen %str(name))` computes the length for the DEFINE.
-- One source of truth per symbol — the identifier itself.
-
-Without `%str`, callers would have to pass the string explicitly
-(`%defsym(quote, "quote")`). That works today with zero m1pp
-changes but invites drift between the identifier and its
-spelled-out string form — nothing at compile time flags a typo
-where the two disagree.
-
-### Why a builtin, not a `#x` sigil
-
-cpp uses `#x` inside macro bodies to stringify a parameter. That
-shape doesn't port cleanly to m1pp because `#` is already the
-line-comment starter (`m1pp.c:278`). Giving `#` dual duty would
-create parse ambiguity in `lex_source`.
-
-`%str(x)` reuses the existing builtin-dispatch plumbing — the same
-path that handles `! / @ / % / $ / %select` — and reads uniformly
-with the other text and numeric builtins.
-
-### Tokenizer
-
-No changes. Existing TOK_STRING machinery handles the output;
-`%str` is a word token recognized as a builtin in `process_tokens`.
-
-### m1pp.c touchpoints
-
-- `process_tokens` (`m1pp.c:1211`): extend the builtin-dispatch
-  guard to accept `%str` alongside `! @ % $ %select`.
-- `expand_builtin_call` (`m1pp.c:1092`): add a branch for `%str`.
-  Arg-count check: exactly 1. Arg-shape check: exactly one token,
-  kind TOK_WORD. Anything else errors.
-- Stringification body: compute `out_len = arg.text.len + 2`,
-  reserve that many bytes via `append_text_len`, write `"`,
-  the original bytes, `"`. Push a TOK_STRING pointing at the new
-  span.
-- No `snprintf` — plain byte copies, straightforward port.
-- No new token kinds, no new Macro fields.
-
-### What this does not give you
-
-- **Stringification of non-parameter tokens.** Only single-token
-  WORD args. `%str(foo bar)` or `%str("already a string")` both
-  error. Wider forms are cpp-ish; defer until a real use case
-  appears.
-- **Escape processing inside the stringified text.** The input is
-  a bare identifier — no quotes, backslashes, or whitespace to
-  escape. If `%str` is ever extended to take broader token spans,
-  escape handling becomes relevant then.
-
-## Per-feature implementation sequence
-
-Each of the three features lands in the same three ordered steps. Do
-not skip or reorder — the tests exist to pin behavior before the
-port, and the port exists because the C expander is disposable.
-
-1. **Implement in `m1pp/m1pp.c`.** The C expander is the oracle. Land
-   the feature here first so there is something to diff against.
-2. **Add a test in `tests/m1pp/`.** New `NN-name.M1pp` +
-   `NN-name.expected` pair following the existing numbering (see
-   `tests/m1pp/` — current fixtures run 00 through 10), **or** extend
-   an existing fixture when the feature is a natural addition to one
-   (e.g. `strlen` goes into `04-expr-ops.M1pp` alongside the other
-   expression ops rather than getting its own file). For malformed-
-   input features, the expected artifact is a non-zero exit; document
-   that in the fixture.
-3. **Add to `m1pp/m1pp.M1`.** Port the feature to the pure-P1
-   implementation of m1pp so the seed bootstrap doesn't depend on the
-   host C expander. The test from step 2 runs against both `m1pp` (C)
-   and `m1m` (P1) and must produce byte-identical output; that parity
-   is what `docs/M1M-P1-PORT.md` calls "C-oracle comparison."
-
-Shipping a feature means all three steps are done. A half-landed
-feature (C only, or C + test but no port) blocks the next feature in
-the sequencing list below.
-
-## Cross-feature sequencing
-
-1. **Local labels.** Smallest patch, immediately useful — enables
-   straight-line macros like `%case_tag` and `%tag_dispatch` that
-   want one or two internal labels without hand-naming.
-2. **Braced args.** Unlocks structured `%if_eq_else` / `%while_nez`
-   that carry instruction bodies. Depends on (1) in practice — the
-   bodies reference labels defined in the surrounding macro.
-3. **`strlen`.** Independent of the other two. Land when the first
-   `%defstr` call site shows up.
-4. **Paren-less 0-arg macro calls.** Independent small relaxation of
-   two guards (one in `process_tokens`, one in `eval_expr_atom`).
-   Useful on its own for constants-as-macros; load-bearing for (5).
-5. **`%struct`.** Depends on (2) for the brace token kinds and (4)
-   for paren-less access syntax. Land only after both.
-6. **`%enum`.** Same dependencies as (5). Share the
-   directive-handler implementation with `%struct` — land together
-   or back-to-back.
-7. **`%str`.** Independent of everything else. Pairs naturally with
-   (3) `strlen` in the `%defsym` pattern but has no build-order
-   dependency on it. Land when the first `%defsym`-style
-   declarative macro shows up.
-
-Each is a self-contained patch. No cross-dependencies beyond the
-sequencing above and the three-step rule per feature.
-
-## Per-feature acceptance fixtures
-
-- **Local labels:** two fixtures — a single macro using `:@end` and
-  calling itself twice in one function (must produce distinct labels),
-  and nested macros each using `:@done` (must not collide). Assemble
-  through M1 + hex2 clean on at least one arch.
-- **Braced args:** fixture exercising a body with commas
-  (`st(r0, r3, 0)`), a body with nested braces, and a malformed
-  fixture (unmatched `{`) that exits non-zero.
-- **`strlen`:** fixture with `DEFINE X_LEN %((strlen "hello"))`
-  followed by `li_r2 X_LEN` — binary must load the value 5 and
-  syscall-exit 5 on all three arches via the existing P1 differential
-  harness.
-- **Paren-less 0-arg calls:** fixture with a 0-param macro invoked
-  both with and without trailing `()`, in top-level position and as
-  an atom inside `%(...)` expressions; all forms must produce
-  byte-identical output against a control fixture that always uses
-  `()`.
-- **`%struct`:** fixture declaring a 4-field struct, accessing each
-  field via paren-less calls, and layering a `%frame` wrapper using
-  `%frame_hdr.SIZE` composition (per the doc example); build on all
-  three arches and exit with a sentinel computed from both the
-  struct-level `.SIZE` and the wrapped base offset, proving the
-  compose-and-add path resolves correctly.
-- **`%enum`:** fixture declaring an enum with 3+ labels, referencing
-  each via paren-less call, and asserting `%NAME.COUNT` equals the
-  label count via a `%(=)` expression that feeds an exit code;
-  build on all three arches. Share fixture scaffolding with the
-  `%struct` test where practical.
-- **`%str`:** two fixtures — (a) a macro using `%str(name)` in its
-  body, compared against a control that writes the literal
-  `"name"` string directly (byte-identical output); (b) combined
-  paste + stringify, `%macro defsym(n) :str_##n %str(n) %endm`
-  invoked with distinct identifiers, assembled through M1 + hex2,
-  each generated label must point at the correctly-spelled string
-  bytes. A third malformed fixture (`%str(a b)` or
-  `%str("already_string")`) must exit non-zero.
diff --git a/docs/P1.md b/docs/P1.md
@@ -1,522 +1,531 @@
-# P1: A Portable Pseudo-ISA for M1
-
-## Motivation
-
-The stage0/live-bootstrap chain uses M1 (the mescc-tools macro assembler) as
-the lowest human-writable layer above raw hex. M1 itself is architecture-
-agnostic — it only knows `DEFINE name hex_bytes` — but every real M1 program
-in stage0 (including the seed C compiler `cc_*.M1`) is hand-written per arch.
-To write, say, a seed Lisp interpreter portably across amd64, aarch64, and
-riscv64 without reaching for M2-Planet, we need a thin portable layer: a
-pseudo-ISA whose mnemonics expand, per arch, to native encodings.
-
-P1 is that layer. The goal is an unoptimized RISC-shaped instruction set,
-hand-writable in M1 source, that assembles to three host ISAs via per-arch
-`DEFINE` tables on top of existing `M1` + `hex2` unchanged.
-
-## Non-goals
-
-- **Not an optimizing backend.** P1 is deliberately dumb. An `ADD rD, rA, rB`
-  on amd64 expands to `mov rD, rA; add rD, rB` unconditionally — no peephole
-  recognition of the `rD == rA` case. Paying ~2× code size is fine for a seed.
-- **Not ABI-compatible with platform C.** P1 programs are sovereign: direct
-  Linux syscalls, no libc linkage. Interop thunks can be written later if
-  needed.
-- **Not 32-bit.** x86-32, armv7l, riscv32 are out of scope for v1. Adding them
-  later means a separate defs file and some narrowing in the register model.
-- **Not self-hosting.** P1 is a target for humans, not a compiler IR. If you
-  want a compiler, write it in subset-C and use M2-Planet.
-
-## Current status
-
-Three programs assemble unchanged across aarch64, amd64, and riscv64
-from the generator-produced `p1_<arch>.M1` defs:
-  * `hello.M1` — write/exit, prints "Hello, world!".
-  * `demo.M1` — exercises the full tranche 1–5 op set (arith/imm/LD/ST/
-    branches/CALL/RET/PROLOGUE/EPILOGUE/TAIL); exits with code 5.
-  * `lisp.M1` — seed Lisp through step 2 of `LISP.md`: bump heap,
-    `cons`/`car`/`cdr`, tagged-value encoding. Exits with code 42
-    (decoded fixnum from `car(cons(42, nil))`).
-
-All runs on stock stage0 `M0` + `hex2-0`, bootstrapped per-arch from
-`hex0-seed` — no C compiler, no M2-Planet, no Mes. Run with
-`make PROG=<hello|demo|lisp> run-all` from `lispcc/`.
-
-The DEFINE table is generator-driven (`p1_gen.py`); tranches 1–8 are
-enumerated there, plus the full PROLOGUE_Nk family (k=1..4). Branch
-offsets are realized by the LI_BR-indirect pattern
-(`LI_BR &target ; BXX_rA_rB`), sidestepping the missing
-branch-offset support in hex2. The branch-target scratch is a
-reserved native reg (`x17`/`r11`/`t5`), not a P1 GPR.
-
-### Spike deviations from the design
-
-- Wide immediates use a per-`LI` inline literal slot (one PC-relative
-  load insn plus a 4-byte data slot, skipped past) rather than a shared
-  pool. Keeps the spike pool-free at the cost of one skip-branch per
-  `LI`. A pool can be reintroduced later without changes to P1 source.
-- `LI` is 4-byte zero-extended today; 8-byte absolute is deferred until
-  a program needs it. All current references are to addresses under
-  4 GiB, so `&label` + a 4-byte zero pad suffices.
-- The per-tuple DEFINE table is generator-produced (see `p1_gen.py`)
-  from a shared op table across all three arches. The emitted set
-  covers tranches 1–8 plus the N-slot PROLOGUE/EPILOGUE/TAIL
-  variants. Adding a new tuple is a one-line append to `rows()` in
-  the generator; no hand-encoding.
-
-## Design decisions
-
-| Decision       | Choice                                        | Why                                        |
-|----------------|-----------------------------------------------|--------------------------------------------|
-| Word size      | 64-bit                                        | All three target arches are 64-bit native  |
-| Endianness     | Little-endian                                 | All three agree                            |
-| Registers      | 8 GPRs (`r0`–`r7`) + `sp`, `lr`-on-stack      | Fits x86-64's usable register budget       |
-| Narrow imm     | Signed 12-bit                                 | riscv I-type width; aarch64 ≤12 also OK    |
-| Wide imm       | Pool-loaded via PC-relative `LI`              | Avoids arch-specific immediate synthesis   |
-| Calling conv   | r0 = return, r1–r3 = args (caller-saved), r4–r7 callee-saved | P1-defined; not platform ABI               |
-| Return address | Always spilled to stack on entry              | Hides x86's missing `lr` uniformly         |
-| Syscall        | `SYSCALL` with num in r0, args r1–r6; clobbers r0 only | Per-arch wrapper emits native sequence     |
-| Spill slot     | `[sp + 8]` is callee-private scratch after `PROLOGUE` | Frame already 16 B for alignment; second cell was otherwise unused |
-
-## Register mapping
-
-`r0`–`r3` are caller-saved. `r4`–`r7` are callee-saved, general-purpose,
-and preserved across `CALL`/`SYSCALL`. `sp` is special-purpose — see
-`PROLOGUE` semantics.
+# P1 v2
 
-| P1   | amd64 | aarch64 | riscv64 |
-|------|-------|---------|---------|
-| `r0` | `rax` | `x0`    | `a0`    |
-| `r1` | `rdi` | `x1`    | `a1`    |
-| `r2` | `rsi` | `x2`    | `a2`    |
-| `r3` | `rdx` | `x3`    | `a3`    |
-| `r4` | `r13` | `x26`   | `s4`    |
-| `r5` | `r14` | `x27`   | `s5`    |
-| `r6` | `rbx` | `x19`   | `s1`    |
-| `r7` | `r12` | `x20`   | `s2`    |
-| `sp` | `rsp` | `sp`    | `sp`    |
-| `lr` | (mem) | `x30`   | `ra`    |
-
-`r4`–`r7` all map to native callee-saved regs on each arch, so the SysV
-kernel+libc "callee preserves these" rule does the work for us across
-syscalls without explicit save/restore in the `SYSCALL` expansion.
-
-x86-64 has no link register; `CALL`/`RET` macros push/pop the return address
-on the stack. On aarch64/riscv64, the prologue spills `lr` (`x30`/`ra`) to
-the stack too, so all three converge on "return address lives in
-`[sp + 0]` after prologue." This uniformity is worth the extra store on the
-register-rich arches.
-
-**Reserved scratch registers (not available to P1):** certain native
-regs are used internally by op expansions and are never exposed as P1
-registers. Every P1 op writes only what its name says it writes —
-reserved scratch is save/restored within the expansion so no hidden
-clobbers leak across op boundaries.
-
-- **Branch-target scratch (all arches).** `B`/`BEQ`/`BNE`/`BLT`/`CALL`/
-  `TAIL` jump through a dedicated native reg pre-loaded via `LI_BR`:
-  `x17` (ARM IP1) on aarch64, `r11` on amd64, `t5` on riscv64. The reg
-  is caller-saved natively and never carries a live P1 value past the
-  following branch. Treat it as existing only between the `LI_BR` that
-  loads a target and the branch that consumes it.
-- **aarch64** — `x21`–`x23` hold `r1`–`r3` across the `SYSCALL` arg
-  shuffle (`r4`/`r5` live in callee-saved `x26`/`x27` so the kernel
-  preserves them for us). `x16` (ARM IP0) is scratch for `REM`
-  (carries the `SDIV` quotient into the following `MSUB`). `x8` holds
-  the syscall number.
-- **amd64** — `rcx` and the branch-target `r11` are kernel-clobbered by
-  the `syscall` instruction itself. `PROLOGUE`/`EPILOGUE` use `rcx` to
-  carry the retaddr across the `sub rsp, N` (can't use `r11` here — it
-  is the branch-target reg, and `TAIL` = `EPILOGUE` + `jmp r11`).
-  `DIV`/`REM` use `rcx` (to save `rdx` = P1 `r3`) and `r11` (to save
-  `rax` = P1 `r0`) so that `idiv`'s implicit writes to rax/rdx stay
-  invisible; the `r11` save is fine because no branch op can interrupt
-  the DIV/REM expansion.
-- **riscv64** — `s3`,`s6`,`s7` hold `r1`–`r3` across the `SYSCALL` arg
-  shuffle (`r4`/`r5` live in callee-saved `s4`/`s5`, same trick as
-  aarch64). `a7` holds the syscall number.
-
-All of these are off-limits to hand-written P1 programs and are never
-mentioned in P1 source. If you see a register name not in the r0–r7 /
-sp / lr set, it belongs to an op's internal expansion.
-
-## Reading P1 source
-
-P1 has no PC-relative branch immediates (hex2 offers no label-arithmetic
-sigil — branch ranges can't be expressed in hex2 source). Every branch,
-conditional or not, compiles through the **LI_BR-indirect** pattern: the
-caller loads the target into the dedicated branch-target scratch reg
-with `LI_BR`, then the branch op jumps through it. A conditional like
-"jump to `fail` if `r1 != r2`" is three source lines:
+## Scope
+
+P1 v2 is a portable pseudo-ISA for standalone executables.
+
+P1 v2 has two width variants:
+
+- **P1v2-64** — one word is one 64-bit integer or pointer value
+- **P1v2-32** — one word is one 32-bit integer or pointer value
+
+Portable source may use any number of word arguments. The first four argument
+registers are explicit, and additional argument words are passed through a
+portable incoming stack-argument area.
+
+Portable source may directly return `0..1` word. Wider results use the
+portable indirect-result convention described below.
+
+## Toolchain envelope
+
+P1 v2 must be assemblable through the existing `M0` + `hex2` path, with
+`catm` as the only composition primitive between source or generated fragments.
+The spec therefore assumes only the following toolchain features:
+
+- `M0`-level `DEFINE name hex_bytes` substitution
+- raw byte emission
+- labels and label references supported by `hex2`
+- file concatenation via `catm`
+
+## Source notation
+
+This document describes instructions using ordinary assembly notation such as
+`ADD rd, ra, rb`, `LD rd, [ra + off]`, or `CALL`.
+
+Because of the toolchain constraints above, portable source does not encode
+most operands as textual instruction arguments. Instead, register choices,
+inline immediate values, and small fixed parameters are fused into opcode
+names, following the generated-table style used by `src/p1_gen.py`.
+
+So the notation in this document is descriptive rather than literal:
+
+- `ADD rd, ra, rb` means a family of fused register-specific opcodes
+- `ADDI rd, ra, imm` means a family of fused register-and-immediate-specific
+  opcodes
+- `ENTER size` means a family of fused byte-count-specific opcodes
+- `LDARG rd, idx` means a family of fused register-and-argument-slot-specific
+  opcodes
+- `BR rs`, `CALLR rs`, and `TAILR rs` mean register-specific control-flow
+  opcodes
+- `LEAVE`, `CALL`, `RET`, `TAIL`, `B`, and `SYSCALL` remain operand-free
+
+Labels still appear in source where the toolchain supports them directly, such
+as `LA rd, %label` and `LA_BR %label`.
+
+## Register Model
+
+### Exposed registers
+
+P1 v2 exposes the following source-level registers:
+
+- `a0`–`a3` — argument registers. Also caller-saved general registers.
+- `t0`–`t2` — caller-saved temporaries.
+- `s0`–`s3` — callee-saved general registers.
+- `sp` — stack pointer.
+
+### Hidden registers
+
+The backend may reserve additional native registers that are never visible in
+P1 source:
+
+- `br` — branch / call target mechanism, implemented as a dedicated hidden
+  native register on every target
+- backend-local scratch used entirely within one instruction expansion
+
+No hidden register may carry a live P1 value across an instruction boundary.
+
+## Calling Convention
+
+### Arguments and return values
+
+P1 v2 defines three result conventions: one-word direct, two-word direct, and
+indirect.
+
+In the one-word direct-result convention:
+
+- Explicit argument words 0-3 live in `a0-a3`.
+- Additional explicit argument words live in the incoming stack-argument area
+  and are read with `LDARG`.
+- On return, a one-word result lives in `a0`.
+
+In the two-word direct-result convention:
+
+- Explicit argument words 0-3 live in `a0-a3` on entry.
+- Additional explicit argument words still live in the incoming
+  stack-argument area.
+- On return, `a0` holds result word 0 and `a1` holds result word 1.
+
+In the indirect-result convention:
+
+- The caller passes a writable result buffer pointer in `a0`.
+- Explicit argument words 0-2 then live in `a1-a3`.
+- Additional explicit argument words still live in the incoming
+  stack-argument area.
+- On return, `a0` holds the same result buffer pointer value.
+
+In both direct-result conventions, incoming stack-argument slot `0` corresponds
+to explicit argument word `4`. In the indirect-result convention, incoming
+stack-argument slot `0` corresponds to explicit argument word `3`.
+
+The two-word direct-result convention covers common cases such as 64-bit
+integer results on 32-bit targets, two-word aggregates, and divmod-style
+returns. The indirect-result convention is the portable way to return any
+result wider than two words.
+
+### Register preservation
+
+Caller-saved:
+
+- `a0`–`a3`
+- `t0`–`t2`
+
+Callee-saved:
+
+- `s0`–`s3`
+- `sp`
+
+### Call semantics
+
+A call is valid from any function, including a leaf. Call / return correctness
+does not depend on establishing a frame first.
+
+If a function needs any incoming argument after making a call, it must save it
+before the call. This matters in particular for `a0`, which is overwritten by
+every convention's return value, and for `a1` when the callee uses the two-word
+direct-result convention.
+
+A call that passes any stack argument words requires the caller to have an
+active standard frame with enough frame-local storage to stage those outgoing
+words.
+
+The return address is hidden machine state. Portable source must not assume
+that it lives in any exposed register.
+
+## Stack Convention
+
+### Call-boundary rule
+
+At every call boundary, the backend must satisfy the native C ABI stack
+alignment rule for the target architecture.
+
+Portable source must therefore treat raw function-entry `sp` as opaque. It may
+not assume that the low bits of `sp` have the same meaning on all targets
+before a frame is established.
+
+### Incoming stack-argument area
+
+P1 v2 defines an abstract incoming stack-argument area for explicit argument
+words that do not fit in registers.
+
+- Slot `0` is the first stack-passed explicit argument word.
+- Slots are word-indexed, not byte-indexed.
+- Portable source may access this area only through `LDARG`.
+
+`LDARG` is valid only when the current function has an active standard frame.
+Therefore, a function that needs any incoming stack argument must establish a
+standard frame before its first `LDARG`.
+
+Portable source must not assume any direct relationship between incoming
+argument slots and raw function-entry `sp`. In particular, source must not try
+to reconstruct stack arguments by manually indexing from `sp`; backend entry
+layouts differ across targets.
+
+For a call with `m` stack-passed explicit argument words, the caller stages
+those words in the first `m` words of its frame-local storage immediately
+before the call:
 
 ```
-P1_LI_BR
-&fail
-P1_BNE_R1_R2
+[sp + 2*WORD + 0*WORD] = outgoing arg word 0
+[sp + 2*WORD + 1*WORD] = outgoing arg word 1
+...
 ```
 
-`LI_BR` writes a reserved native reg (`x17`/`r11`/`t5` — see Register
-mapping), not a P1 GPR. The branch op that follows consumes it and
-jumps. `CALL` and `TAIL` follow the same shape
-(`LI_BR &callee ; P1_CALL`).
+At callee entry, those staged words become incoming argument slots `0..m-1`.
+The backend is responsible for mapping between the caller's frame layout and
+the callee's abstract incoming argument slots.
 
-The branch-target reg is owned by the branch machinery: never carry a
-live value across a branch in it. Since it isn't a P1 reg, this is
-automatic — there's no P1-level way to read or write it outside
-`LI_BR`.
+Portable code that needs both ordinary locals and stack-passed outgoing
+arguments must reserve enough total frame-local storage and keep the low-
+addressed prefix available for outgoing argument staging across the call.
 
-## Instruction set (~30 ops)
+### Standard frame layout
+
+Functions that need local stack storage use a standard frame layout. After
+frame establishment:
 
 ```
-# 3-operand arithmetic (reg forms)
-ADD  rD, rA, rB       SUB  rD, rA, rB
-AND  rD, rA, rB       OR   rD, rA, rB       XOR  rD, rA, rB
-SHL  rD, rA, rB       SHR  rD, rA, rB       SAR  rD, rA, rB
-MUL  rD, rA, rB       DIV  rD, rA, rB       REM  rD, rA, rB
-
-# Immediate forms (signed 12-bit)
-ADDI rD, rA, !imm     ANDI rD, rA, !imm     ORI  rD, rA, !imm
-SHLI rD, rA, !imm     SHRI rD, rA, !imm     SARI rD, rA, !imm
-
-# Moves
-MOV  rD, rA                           # reg-to-reg (rA may be sp)
-LI   rD, %label                       # load 64-bit literal from pool
-LA   rD, %label                       # load PC-relative address
-
-# Memory (offset is signed 12-bit)
-LD   rD, rA, !off     ST   rS, rA, !off    # 64-bit
-LB   rD, rA, !off     SB   rS, rA, !off    #  8-bit zero-extended / truncated
-
-# Control flow
-B    %label                           # unconditional branch
-BEQ  rA, rB, %label   BNE  rA, rB, %label
-BLT  rA, rB, %label                          # signed less-than
-CALL %label           RET
-PROLOGUE              EPILOGUE               # frame setup / teardown (see Semantics)
-TAIL %label                           # tail call: epilogue + B %label
-
-# System
-SYSCALL                               # num in r0, args r1-r6, ret in r0
+[sp + 0*WORD] = saved return address
+[sp + 1*WORD] = saved caller stack pointer
+[sp + 2*WORD ... sp + 2*WORD + local_bytes - 1] = frame-local storage
+...
 ```
 
-### Semantics
-
-- All arithmetic is on 64-bit values. `SHL`/`SHR`/`SAR` take shift amount in
-  the low 6 bits of `rB` (or the `!imm` for immediate forms).
-- `DIV` is signed, truncated toward zero. `REM` matches `DIV`.
-- `LB` zero-extends the loaded value into the 64-bit destination.
-  (A signed-extending variant `LBS` can be added later if needed. 32-bit
-  `LW`/`SW` are deliberately omitted — emulate with `LD`+`ANDI`/shift and
-  `ST` through a 64-bit scratch when needed.)
-- Unsigned comparisons (`BLTU`/`BGEU`) are not in the ISA: seed programs
-  with tagged-cell pointers only need signed comparisons. Synthesize from
-  `BLT` via operand-bias if unsigned compare is ever required.
-- `BGE rA, rB, %L` is not in the ISA: synthesize as
-  `BLT rA, rB, %skip; B %L; :skip` (the LI_BR-indirect branch pattern
-  makes the skip cheap). `BLT rB, rA, %L` handles the strict-greater
-  case.
-- Branch offsets are PC-relative. In the v0.1 spike they are realized by
-  loading the target address via `LI_BR` into the reserved branch-target
-  reg and jumping through it; range is therefore unbounded within the
-  4 GiB address space. Native-encoded branches (with tighter range
-  limits) are an optional future optimization.
-- `MOV rD, rA` copies `rA` into `rD`. The source may be `sp` (read the
-  current stack pointer into a GPR — used e.g. for stack-balance assertions
-  around a call tree). The reverse (`MOV sp, rA`) is not provided; `sp`
-  is only mutated by `PROLOGUE`/`EPILOGUE`.
-- `CALL %label` transfers control to `%label` with a return address
-  established such that a subsequent `RET` returns to the instruction
-  after the `CALL`. The storage location of that return address is
-  implementation-defined (stack on amd64, link register on
-  aarch64/riscv64) and **must be treated as volatile across any inner
-  `CALL`**.
-
-  Concrete rule: **a function that itself executes a `CALL` must wrap
-  its body in a matching `PROLOGUE`/`EPILOGUE` pair.** `PROLOGUE` is
-  what spills the incoming return address into the frame; `EPILOGUE`
-  restores it so `RET` can find it.
-
-  Leaf functions (no `PROLOGUE`) are permitted and may be called
-  normally: `CALL leaf` sets up the return address, the leaf's `RET`
-  uses it, control returns to the caller. The restriction is only on
-  what a leaf may itself do:
-
-  - **RET** — returns to whoever established the current return
-    address. Usually the direct `CALL`er; in the tail-branch case
-    below, whoever `CALL`ed the outermost caller in the chain.
-  - **Tail-branch** (`li_br &target ; B`) to another function — the
-    target's own `PROLOGUE`/`EPILOGUE` preserves the current return
-    address across the target's body, so the target's `RET` returns
-    directly to the leaf's caller, skipping the leaf in the return
-    chain.
-  - **`CALL`** — forbidden. The inner `CALL` clobbers the return
-    address slot (on arches where it's a register, not a stack
-    push), so the leaf's subsequent `RET` branches to itself.
-
-  The failure mode of a leaf `CALL` is platform-asymmetric: amd64's
-  native `CALL` pushes onto the stack so a prologue-less `CALL ; RET`
-  happens to work; aarch64 and riscv64 write the return address to a
-  link register and hang silently. Don't write code that relies on
-  the amd64-happens-to-work behavior.
-
-  `RET` pops / branches through the return address.
-- `PROLOGUE` / `EPILOGUE` set up and tear down a frame with **k
-  callee-private scratch slots**. `PROLOGUE` is shorthand for
-  `PROLOGUE_N1` (one slot); `PROLOGUE_Nk` for k = 2, 3, 4 reserves that
-  many slots. After `PROLOGUE_Nk`:
-
-  ```
-  [sp +  0]          = caller's return address
-  [sp +  8]          = slot 1 (callee-private scratch)
-  [sp + 16]          = slot 2         (k >= 2)
-  [sp + 24]          = slot 3         (k >= 3)
-  [sp + 32]          = slot 4         (k >= 4)
-  ```
-
-  Each slot is private to the current frame: a nested `PROLOGUE`
-  allocates its own slots, so the parent's spills survive unchanged.
-  Frame size is `round_up_16(8 + 8*k)`, so k=1→16, k=2→32 (with 8
-  bytes of padding past slot 2), k=3→32, k=4→48. `EPILOGUE_Nk` /
-  `TAIL_Nk` must match the `PROLOGUE_Nk` of the enclosing function.
-
-  Why multiple slots: constructors like `cons(car, cdr)` keep several
-  live values across an inner `alloc()` call. One scratch cell isn't
-  enough, and parking overflow in BSS would break the step-9 mark-sweep
-  GC (which walks the stack for roots). Per-frame slots keep every live
-  value on the walkable stack.
-
-  Per-arch mechanics differ — aarch64/riscv64 `PROLOGUE` subtracts the
-  frame size from `sp` and stores `lr`/`ra` at `[sp + 0]`; amd64 pops
-  the retaddr native `call` already pushed into a non-P1 scratch
-  (`rcx`), subtracts the frame size, then re-pushes it so the final
-  layout matches. (`rcx` rather than `r11`, because `r11` is the
-  branch-target reg and `TAIL` would otherwise clobber its own
-  destination mid-epilogue.) Access slots via `MOV rX, sp` followed by
-  `LD rY, rX, <off>` / `ST rY, rX, <off>`; `sp` itself isn't a valid
-  base for `LD`/`ST`.
-- `TAIL %label` is a tail call — it performs the current function's
-  standard epilogue (restore `lr` from `[sp+0]`, pop the frame) and then
-  branches unconditionally to `%label`, reusing the caller's return
-  address instead of pushing a new frame. The current function must be
-  using the standard prologue. Interpreter `eval` loops rely on `TAIL`
-  to recurse on sub-expressions without growing the stack.
-- `SYSCALL` is a single opcode in P1 source. Each arch's defs file expands it
-  to the native syscall sequence, including the register shuffle from P1's
-  `r0`=num, `r1`–`r6`=args convention into the platform's native convention
-  if different.
-
-## Encoding strategy
-
-For each `(op, register-tuple)` combination, emit one `DEFINE` per arch. A
-generator script produces the full defs file; no hand-encoding per entry.
-
-Example — `ADD r0, r1, r2`:
+Frame-local storage is byte-addressed. Portable code may use it for ordinary
+locals, spilled callee-saved registers, and the caller-staged outgoing
+stack-argument words described above.
 
-```
-# p1_riscv64.M1
-DEFINE P1_ADD_R0_R1_R2  33056000     # add a0, a1, a2 (little-endian)
+Total frame size is:
 
-# p1_aarch64.M1
-DEFINE P1_ADD_R0_R1_R2  2000028B     # add x0, x1, x2
+`round_up(STACK_ALIGN, 2*WORD_SIZE + local_bytes)`
 
-# p1_amd64.M1  (2-op destructive — expands to mov + add)
-DEFINE P1_ADD_R0_R1_R2  4889F84801F0 # mov rax, rdi ; add rax, rsi
-```
+Where:
 
-### Combinatorial footprint
+- `WORD_SIZE = 8` in P1v2-64
+- `WORD_SIZE = 4` in P1v2-32
+- `STACK_ALIGN` is target-defined and must satisfy the native call ABI
 
-Per-arch defs count (immediates handled by sigil, not enumerated):
+Leaf functions that need no frame-local storage may omit the frame entirely.
 
-- 11 reg-reg-reg arith × 8 `rD` × 8 `rA` × 8 `rB` = 704. Pruned to ~600 by
-  removing trivially-equivalent tuples.
-- 6 immediate arith × 8² = 384. Each entry uses an immediate sigil (`!imm`),
-  so the immediate value itself is not enumerated.
-- 3 move ops × 8 or 8² (plus +8 for the `MOV rD, sp` variant) = ~88.
-- 4 memory ops × 8² = 256. Offsets use `!imm` sigil.
-- 3 conditional branches × 8² = 192.
-- Singletons (`B`, `CALL`, `RET`, `PROLOGUE`, `EPILOGUE`, `TAIL`, `SYSCALL`) = 7.
+### Frame invariants
 
-Total ≈ 1210 defines per arch. Template-generated.
+- A function that allocates a frame must restore `sp` before returning.
+- Callee-saved registers modified by the function must be restored before
+  returning.
+- The standard frame layout is the only frame shape recognized by P1 v2.
 
-## Syscall conventions
+## Op Set Summary
 
-Linux syscall mechanics differ across arches. The `SYSCALL` macro hides this.
+| Category | Operations |
+|----------|------------|
+| Materialization | `LI rd, imm`, `LA rd, %label`, `LA_BR %label` |
+| Moves | `MOV rd, rs`, `MOV rd, sp` |
+| Arithmetic | `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, `MUL`, `DIV`, `REM` |
+| Immediate arithmetic | `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI` |
+| Memory | `LD`, `ST`, `LB`, `SB` |
+| ABI access | `LDARG` |
+| Branching | `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ` |
+| Calls / returns | `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR` |
+| Frame management | `ENTER`, `LEAVE` |
+| System | `SYSCALL` |
 
-| Arch     | Insn      | Num reg | Arg regs (plat ABI)          |
-|----------|-----------|---------|------------------------------|
-| amd64    | `syscall` | `rax`   | `rdi, rsi, rdx, r10, r8, r9` |
-| aarch64  | `svc #0`  | `x8`    | `x0 – x5`                    |
-| riscv64  | `ecall`   | `a7`    | `a0 – a5`                    |
+## Immediates
 
-**Observable semantics:** `SYSCALL` takes the number in `r0` and args in
-`r1`–`r6`, traps, and returns the kernel's result in `r0`. **Only `r0` is
-clobbered.** `r1`–`r7` are preserved across `SYSCALL` on every arch. This
-matches the kernel's own register discipline and lets callers thread live
-values through syscalls without per-arch save/restore dances.
+Immediate operands appear only in instructions that explicitly admit them.
+Portable source has three immediate classes:
 
-The per-arch expansions:
+- **Inline integer immediate** — a signed 12-bit assembly-time constant in the
+  range `-2048..2047`
+- **Materialized word value** — a full one-word assembly-time constant loaded
+  with `LI`
+- **Materialized address** — the address of a label loaded with `LA`
 
-- **amd64** — P1 args already occupy the native arg regs except for args
-  4/5/6. Three shuffle moves cover those: `mov r10, r13` (arg4 = P1 `r4`),
-  `mov r8, r14` (arg5 = P1 `r5`), `mov r9, rbx` (arg6 = P1 `r6`); then
-  `syscall`. The kernel preserves everything except `rax`, `rcx`, `r11`,
-  and `rax` = P1 `r0` is the only visible clobber.
-- **aarch64** — native arg regs are `x0`–`x5` but P1 puts args in
-  `x1`–`x3`,`x26`,`x27`,`x19` (the three caller-saved arg regs one slot
-  higher, plus three callee-saved for `r4`–`r6`). The expansion saves
-  P1 `r1`–`r3` into `x21`–`x23`, shuffles them and `r4`/`r5`/`r6` down
-  into `x0`–`x5`, moves the number into `x8`, `svc #0`s, then restores
-  `r1`–`r3` from `x21`–`x23`. No save/restore of `r4`/`r5` is needed
-  because they live in callee-saved natives that the kernel preserves.
-- **riscv64** — same shape as aarch64, with `s3`/`s6`/`s7` as the `r1`–
-  `r3` save slots, `s4`/`s5` already holding `r4`/`r5`, and `a7` as the
-  number register.
+P1 v2 also uses two structured assembly-time operands:
 
-The extra moves on aarch64/riscv64 are a few nanoseconds per syscall.
-Trading them for uniform "clobbers `r0` only" semantics is worth it:
-callers don't need to memorize a per-arch clobber set.
+- **Frame-local byte count** — a non-negative byte count used by `ENTER`
+- **Argument-slot index** — a non-negative word-slot index used by `LDARG`
 
-### Syscall numbers
+`LI rd, imm` loads the one-word integer value `imm`.
 
-Linux uses two syscall tables relevant here:
+`LA rd, %label` loads the address of `%label` as a one-word pointer value.
 
-- **amd64**: amd64-specific table (`write = 1`, `exit = 60`, …).
-- **aarch64 and riscv64**: generic table (`write = 64`, `exit = 93`, …).
+The backend may realize `LI` and `LA` using native immediates, literal pools,
+multi-instruction sequences, or other backend-private mechanisms.
 
-P1 programs use symbolic constants (`SYS_WRITE`, `SYS_EXIT`) defined per-arch:
+Backends may assume labels fit in 32 bits when realizing `LA` and `LA_BR`.
+This reflects the stage0 image layout (`hex2-0` base `0x00600000`, programs
+well under 4 GB), not a portable-ISA-level guarantee. Backends that target
+images loaded above the 4 GB boundary must adjust their `LA` / `LA_BR`
+lowering. `LI` makes no such assumption — it materializes any one-word value.
 
-```
-# p1_amd64.M1
-DEFINE SYS_WRITE 01000000
-DEFINE SYS_EXIT  3C000000
+## Control Flow
 
-# p1_aarch64.M1 and p1_riscv64.M1
-DEFINE SYS_WRITE 40000000
-DEFINE SYS_EXIT  5D000000
-```
+### Call / Return / Tail Call
 
-(The encodings shown are placeholder little-endian 32-bit immediates; real
-values are inlined as operands to `LI` or `ADDI`.)
+Control-flow targets are materialized with `LA_BR %label`, which loads
+`%label` into the hidden branch-target mechanism `br`. The immediately
+following control-flow op consumes that target.
 
-## Program layout
+`CALL` transfers control to the target most recently loaded by `LA_BR` and
+establishes a return continuation such that a subsequent `RET` returns to the
+instruction after the `CALL`. `CALL` is valid whether or not the caller has
+established a standard frame, except that any call using stack-passed argument
+words requires an active standard frame to hold the staged outgoing words.
 
-Each P1 object file is structured as:
+`CALLR rs` is the register-indirect form of `CALL`. It transfers control to
+the code pointer value held in `rs` and establishes the same return
+continuation semantics as `CALL`.
 
-```
-<ELF header, per arch>
-<code section>
-  <function prologues, bodies, epilogues>
-<constant pool>
-  pool_label_1: &0xDEADBEEFCAFEBABE
-  pool_label_2: &0x00000000004004C0
-  ...
-<data section>
-  <static bytes>
-```
+`RET` returns through the current return continuation. `RET` is valid whether
+or not the current function has established a standard frame, provided any
+frame established by the function has already been torn down.
 
-`LI rD, %pool_label_N` issues a PC-relative load; the pool must be reachable
-within the relocation's range (≤±1 MiB for aarch64 `LDR` literal, ≤±2 GiB for
-riscv `AUIPC`+`LD`, unlimited for x86 `mov rD, [rip + rel32]` within 2 GiB).
+`TAIL` is a tail call to the target most recently loaded by `LA_BR`. It is
+valid only when the current function has an active standard frame. `TAIL`
+performs the standard epilogue for the current frame and then transfers control
+to the loaded target without creating a new return continuation. The callee
+therefore returns directly to the current function's caller.
 
-For programs under a few MiB, a single pool per file is fine. For larger
-programs, emit a pool per function.
+`TAILR rs` is the register-indirect form of `TAIL`. It is valid only when the
+current function has an active standard frame.
 
-## Data alignment
+Because stack-passed outgoing argument words are staged in the caller's own
+frame-local storage, `TAIL` and `TAILR` are portable only when the tail-called
+callee requires no stack-passed argument words. Portable compilers must lower
+other tail-call cases to an ordinary `CALL` / `RET` sequence.
 
-**Labels have no inherent alignment.** A label's runtime address is
-`ELF_base + (cumulative bytes emitted before the label)`. Neither M1 nor
-hex2 offers an `.align` directive or any other alignment control — the
-existing hex2 sigils (`: ! @ $ ~ % &` and the `>` base override) cover
-labels and references, not padding. And because the cumulative byte count
-between the ELF header and any label varies per arch (different SYSCALL
-expansions, different branch encodings, different PROLOGUE sizes), the
-same label lands at a different low-3-bits offset on each target.
+Portable source must treat the return continuation as hidden machine state. It
+must not assume that the return address lives in any exposed register or stack
+location except as defined by the standard frame layout after frame
+establishment.
 
-Concretely: `heap_start` in a program that builds identically for all
-three arches can land at `0x...560` (aligned) on aarch64, `0x...2CB`
-(misaligned) on amd64, and `0x...604` (misaligned) on riscv64. If the
-program then tags pair pointers by ORing bits into the low 3, the tag
-collides with pointer bits on the misaligned arches and every pair is
-corrupt.
+### Prologue / Epilogue
 
-Programs that care about alignment therefore align **at boot, in code**:
+P1 v2 defines the following frame-establishment and frame-teardown operations:
+
+- `ENTER size`
+- `LEAVE`
+
+`ENTER size` establishes the standard frame layout with `size` bytes of
+frame-local storage:
 
 ```
-P1_LI_R4
-&heap_next
-P1_LD_R0_R4_0
-P1_ORI_R0_R0_7         ## x |= 7
-P1_ADDI_R0_R0_1        ## x += 1      → x rounded up to next 8-aligned
-P1_ST_R0_R4_0
+[sp + 0*WORD] = saved return address
+[sp + 1*WORD] = saved caller stack pointer
+[sp + 2*WORD ... sp + 2*WORD + size - 1] = frame-local storage
 ```
 
-The `(x | mask) + 1` idiom rounds any pointer up to `mask + 1`. Use
-`mask = 7` for 8-byte alignment (tagged pointers with a 3-bit tag),
-`mask = 15` for 16-byte alignment (cache lines, `malloc`-style).
-
-**Allocator contract.** Any allocator that returns cells eligible to be
-tagged (pair, closure, vector, …) MUST return pointers aligned to at
-least the tag width. The low tag bits are architecturally unowned by
-the allocator — they belong to the caller to stamp a tag into.
-
-**Caller contract.** Callers of bump-style allocators must pass sizes
-that are multiples of the alignment. For the step-2 bump allocator
-that's 8-byte multiples; the caller rounds up. A mature allocator
-(step 9 onward) rounds internally, but the current one trusts the
-caller.
-
-## Staged implementation plan
-
-1. **Spike across all three arches.** *Done.* `lispcc/hello.M1` and
-   `lispcc/demo.M1` run on aarch64, amd64, and riscv64 via existing
-   `M1` + `hex2_linker` (amd64, aarch64) / `hex2_word` (riscv64). Ops
-   demonstrated: `LI`, `SYSCALL`, `MOV`, `ADD`, `SUB`. The aarch64
-   `hex2_word` extensions in the work list above were *not* needed —
-   the inline-data `LI` trick sidesteps them. Order was reversed from
-   the original plan: aarch64 first (where the trick was designed),
-   then amd64 and riscv64.
-2. **Broaden the demonstrated op set.** *Done.* `demo.M1` exercises
-   control flow (`B`, `BEQ`, `BNE`, `BLT`, `CALL`, `RET`, `TAIL`),
-   loads/stores (`LD`/`ST`/`LB`/`SB`), and the full
-   arithmetic/logical/shift/mul-div set across tranches 1–5. All
-   reachable with stock hex2; no extensions required.
-3. **Generator for the ~30-op × register matrix.** *Done.*
-   `p1_gen.py` is the single source of truth for all three
-   `p1_<arch>.M1` defs files. Each row is an `(op, reg-tuple, imm)`
-   triple; per-arch encoders lower rows to native bytes. Includes the
-   N-slot `PROLOGUE_Nk` / `EPILOGUE_Nk` / `TAIL_Nk` variants (k=1..4).
-   Regenerate with `make gen`; CI-check freshness with `make check-gen`.
-4. **Cross-arch differential harness.** Assemble each P1 source three
-   ways and diff runtime behavior. Currently eyeballed via
-   `make run-all`.
-5. **Write something real.** *In progress.* `lisp.M1` is the seed Lisp
-   interpreter target (cons, car, cdr, eq, atom, cond, lambda, quote)
-   running identically on all three arches. Step 2 (cons/car/cdr +
-   tagged values) landed; the remaining staged steps live in
-   `LISP.md`.
-
-## Open questions
-
-- **Can we reuse hand-written `SYSCALL`/syscall-number conventions already in
-  stage0's arch ports?** Probably yes — adopt the conventions already in
-  `M2libc/<arch>/` to minimize surprise.
-- **Signed-extending loads.** Skipped for v1 — add `LBS`, `LWS` if the Lisp
-  interpreter needs them.
-- **Atomic / multi-core.** Not in scope. Seed interpreters are single-
-  threaded.
-- **Debug info.** `blood-elf` generates M1-format debug tables; we'd need to
-  decide whether P1 flows through it unchanged. Likely yes since P1 is just
-  another M1 source.
-- **x86-32 / armv7l / riscv32 support.** Requires narrowing the register
-  model and splitting word size. Defer.
+The total allocation size is:
 
-## Scope
+`round_up(STACK_ALIGN, 2*WORD_SIZE + size)`
+
+The named frame-local bytes are the usable local storage. Any additional bytes
+introduced by alignment rounding are padding, not extra local bytes.
+
+`LEAVE` tears down the current standard frame and restores the hidden return
+continuation so that a subsequent `RET` returns correctly.
+
+Because every standard frame stores the saved caller stack pointer at
+`[sp + 1*WORD]`, `LEAVE` does not need to know the frame-local byte count used
+by the corresponding `ENTER`.
+
+A function may omit `ENTER` / `LEAVE` entirely if it is a leaf and needs no
+standard frame.
+
+`ENTER` and `LEAVE` do not implicitly save or restore `s0` or `s1`. A
+function that modifies `s0` or `s1` must preserve them explicitly, typically by
+storing them in frame-local storage within its standard frame.
+
+### Branching
+
+P1 v2 branch targets are carried through the hidden branch-target mechanism
+`br`. Portable source may load `br` only through:
+
+- `LA_BR %label` — materialize the address of `%label` as the next branch, call,
+  or tail-call target
+
+No branch, call, or tail opcode takes a label operand directly. Portable source
+must treat `br` as owned by the control-flow machinery. No live value may be
+carried in `br`. Each `LA_BR` must be consumed by the immediately following
+branch, call, or tail op, and portable source must not rely on `br` surviving
+across any other instruction.
+
+The portable branch families are:
+
+- `B` — unconditional branch to the target in `br`
+- `BR rs` — unconditional branch to the code pointer in `rs`
+- `BEQ`, `BNE`, `BLT`, `BLTU` — conditional branch to the target in `br`
+- `BEQZ`, `BNEZ`, `BLTZ` — conditional branch to the target in `br` using zero
+  as the second operand
+
+`BLT` and `BLTZ` perform signed comparisons on one-word values. `BLTU`
+performs an unsigned comparison on one-word values; there is no unsigned
+zero-operand variant because `x < 0` is always false under unsigned
+interpretation.
+
+If a branch condition is true, control transfers to the target currently held in
+`br`. If the condition is false, execution falls through to the next
+instruction.
+
+## Data Ops
+
+### Arithmetic
+
+P1 v2 defines the following arithmetic and bitwise operations on one-word
+values:
+
+- register-register: `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`,
+  `MUL`, `DIV`, `REM`
+- immediate: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI`
+
+For `ADD`, `SUB`, `MUL`, `AND`, `OR`, and `XOR`, computation is modulo the
+active word size.
+
+`SHL` shifts left and discards high bits. `SHR` is a logical right shift and
+zero-fills. `SAR` is an arithmetic right shift and sign-fills.
+
+For register-count shifts, only the low `5` bits of the shift count are
+observed in `P1v2-32`, and only the low `6` bits are observed in `P1v2-64`.
+
+Immediate-form shifts use inline immediates in the range `0..31` in `P1v2-32`
+and `0..63` in `P1v2-64`.
+
+`DIV` is signed division on one-word two's-complement values and truncates
+toward zero. `REM` is the corresponding signed remainder.
+
+Division by zero is outside the portable contract. The overflow case
+`MIN_INT / -1` is also outside the portable contract, as is the corresponding
+remainder case.
 
-- **Defs files**: ~1500 entries × 3 arches, generator-driven.
-- **Testing**: shared harness that assembles each P1 source three ways
-  and diffs runtime behavior.
+### Moves
+
+P1 v2 defines the following move and materialization operations:
+
+- `MOV` — register-to-register copy
+- `LI` — load one-word integer constant
+- `LA` — load label address
+
+`MOV` may copy from any exposed general register to any exposed general
+register.
+
+Portable source may also read the current stack pointer through `MOV rd, sp`.
+
+Portable source may not write `sp` through `MOV`. Stack-pointer updates are only
+performed by `ENTER`, `LEAVE`, and backend-private call/return machinery.
+
+`LI` materializes an integer bit-pattern. `LA` materializes the address of a
+label. `LA_BR` is a separate control-flow-target materialization form and is not
+part of the general move family.
+
+### Memory
+
+P1 v2 defines the following memory-access operations:
+
+- `LD`, `ST` — one-word load and store
+- `LB`, `SB` — byte load and store
+- `LDARG` — one-word load from the incoming stack-argument area
+
+`LD` and `ST` access one full word: 4 bytes in `P1v2-32` and 8 bytes in
+`P1v2-64`.
+
+`LB` loads one byte and zero-extends it to a full word. `SB` stores the low
+8 bits of the source value.
+
+Memory offsets use signed 12-bit inline immediates.
+
+The base address for a memory access may be any exposed general register or
+`sp`.
+
+`LDARG rd, idx` loads incoming stack-argument slot `idx`, where slot `0` is the
+first stack-passed explicit argument word. `idx` is word-indexed, not
+byte-indexed. `LDARG` is an ABI access, not a general memory operation; it does
+not expose or imply any raw `sp`-relative layout at function entry.
+
+`LDARG` is valid only when the current function has an active standard frame.
+
+Portable source must not assume that labels are aligned beyond what is
+explicitly established by the program itself. Portable code should use
+naturally aligned addresses for `LD` and `ST`. Unaligned word accesses are
+outside the portable contract. Byte accesses have no additional alignment
+requirement.
+
+## System
+
+`SYSCALL` is part of the portable ISA surface.
+
+At the portable level, the syscall convention is:
+
+- `a0` = syscall number on entry, return value on exit
+- `a1`, `a2`, `a3`, `t0`, `s0`, `s1` = syscall arguments 0 through 5
+
+At the portable level, `SYSCALL` clobbers only `a0`. All other exposed
+registers are preserved across the syscall.
+
+The mapping from symbolic syscall names to numeric syscall identifiers is
+target-defined. The set of syscalls available to a given program is likewise
+specified outside the core P1 v2 ISA, for example by a target profile or
+runtime interface document.
+
+## Target notes
+
+- `a0` is argument 0, the one-word direct return-value register, the low word
+  of the two-word direct return pair, and the indirect-result buffer pointer.
+- On aarch64, riscv64, arm32, and rv32, that matches the native integer/pointer
+  ABI directly.
+- On amd64, the backend must translate between portable `a0` and native
+  return register `rax` at call and return boundaries. For the two-word direct
+  return, the backend must also translate `a1` against native `rdx`.
+- On amd64, `LDARG` must account for the return address pushed by the native
+  `call` instruction. On aarch64, riscv64, arm32, and rv32, it maps more
+  directly to the entry `sp` plus the backend's standard frame/header policy.
+- `br` is implemented as a dedicated hidden native register on every target.
+- On arm32, `t1` and `t2` map to natively callee-saved registers; the backend
+  is responsible for preserving them across function boundaries in accordance
+  with the native ABI, even though P1 treats them as caller-saved.
+- Frame-pointer use is backend policy, not part of the P1 v2 architectural
+  register set.
+
+### Native register mapping
+
+#### 64-bit targets
+
+| P1   | amd64 | aarch64 | riscv64 |
+|------|-------|---------|---------|
+| `a0` | `rdi` | `x0`    | `a0`    |
+| `a1` | `rsi` | `x1`    | `a1`    |
+| `a2` | `rdx` | `x2`    | `a2`    |
+| `a3` | `rcx` | `x3`    | `a3`    |
+| `t0` | `r10` | `x9`    | `t0`    |
+| `t1` | `r11` | `x10`   | `t1`    |
+| `t2` | `r8`  | `x11`   | `t2`    |
+| `s0` | `rbx` | `x19`   | `s1`    |
+| `s1` | `r12` | `x20`   | `s2`    |
+| `s2` | `r13` | `x21`   | `s3`    |
+| `s3` | `r14` | `x22`   | `s4`    |
+| `sp` | `rsp` | `sp`    | `sp`    |
 
-The output is a single portable ISA above which any seed-stage program
-(Lisp, Forth, a smaller C compiler) can be written once and run on three
-hosts. Below M2-Planet in the chain, above raw M1. Leans entirely on
-existing `M1` + `hex2` — no toolchain modifications.
+#### 32-bit targets
+
+| P1   | arm32 | rv32  |
+|------|-------|-------|
+| `a0` | `r0`  | `a0`  |
+| `a1` | `r1`  | `a1`  |
+| `a2` | `r2`  | `a2`  |
+| `a3` | `r3`  | `a3`  |
+| `t0` | `r12` | `t0`  |
+| `t1` | `r6`  | `t1`  |
+| `t2` | `r7`  | `t2`  |
+| `s0` | `r4`  | `s1`  |
+| `s1` | `r5`  | `s2`  |
+| `s2` | `r8`  | `s3`  |
+| `s3` | `r9`  | `s4`  |
+| `sp` | `sp`  | `sp`  |
diff --git a/docs/P1v2.md b/docs/P1v2.md
@@ -1,531 +0,0 @@
-# P1 v2
-
-## Scope
-
-P1 v2 is a portable pseudo-ISA for standalone executables.
-
-P1 v2 has two width variants:
-
-- **P1v2-64** — one word is one 64-bit integer or pointer value
-- **P1v2-32** — one word is one 32-bit integer or pointer value
-
-Portable source may use any number of word arguments. The first four argument
-registers are explicit, and additional argument words are passed through a
-portable incoming stack-argument area.
-
-Portable source may directly return `0..1` word. Wider results use the
-portable indirect-result convention described below.
-
-## Toolchain envelope
-
-P1 v2 must be assemblable through the existing `M0` + `hex2` path, with
-`catm` as the only composition primitive between source or generated fragments.
-The spec therefore assumes only the following toolchain features:
-
-- `M0`-level `DEFINE name hex_bytes` substitution
-- raw byte emission
-- labels and label references supported by `hex2`
-- file concatenation via `catm`
-
-## Source notation
-
-This document describes instructions using ordinary assembly notation such as
-`ADD rd, ra, rb`, `LD rd, [ra + off]`, or `CALL`.
-
-Because of the toolchain constraints above, portable source does not encode
-most operands as textual instruction arguments. Instead, register choices,
-inline immediate values, and small fixed parameters are fused into opcode
-names, following the generated-table style used by `src/p1_gen.py`.
-
-So the notation in this document is descriptive rather than literal:
-
-- `ADD rd, ra, rb` means a family of fused register-specific opcodes
-- `ADDI rd, ra, imm` means a family of fused register-and-immediate-specific
-  opcodes
-- `ENTER size` means a family of fused byte-count-specific opcodes
-- `LDARG rd, idx` means a family of fused register-and-argument-slot-specific
-  opcodes
-- `BR rs`, `CALLR rs`, and `TAILR rs` mean register-specific control-flow
-  opcodes
-- `LEAVE`, `CALL`, `RET`, `TAIL`, `B`, and `SYSCALL` remain operand-free
-
-Labels still appear in source where the toolchain supports them directly, such
-as `LA rd, %label` and `LA_BR %label`.
-
-## Register Model
-
-### Exposed registers
-
-P1 v2 exposes the following source-level registers:
-
-- `a0`–`a3` — argument registers. Also caller-saved general registers.
-- `t0`–`t2` — caller-saved temporaries.
-- `s0`–`s3` — callee-saved general registers.
-- `sp` — stack pointer.
-
-### Hidden registers
-
-The backend may reserve additional native registers that are never visible in
-P1 source:
-
-- `br` — branch / call target mechanism, implemented as a dedicated hidden
-  native register on every target
-- backend-local scratch used entirely within one instruction expansion
-
-No hidden register may carry a live P1 value across an instruction boundary.
-
-## Calling Convention
-
-### Arguments and return values
-
-P1 v2 defines three result conventions: one-word direct, two-word direct, and
-indirect.
-
-In the one-word direct-result convention:
-
-- Explicit argument words 0-3 live in `a0-a3`.
-- Additional explicit argument words live in the incoming stack-argument area
-  and are read with `LDARG`.
-- On return, a one-word result lives in `a0`.
-
-In the two-word direct-result convention:
-
-- Explicit argument words 0-3 live in `a0-a3` on entry.
-- Additional explicit argument words still live in the incoming
-  stack-argument area.
-- On return, `a0` holds result word 0 and `a1` holds result word 1.
-
-In the indirect-result convention:
-
-- The caller passes a writable result buffer pointer in `a0`.
-- Explicit argument words 0-2 then live in `a1-a3`.
-- Additional explicit argument words still live in the incoming
-  stack-argument area.
-- On return, `a0` holds the same result buffer pointer value.
-
-In both direct-result conventions, incoming stack-argument slot `0` corresponds
-to explicit argument word `4`. In the indirect-result convention, incoming
-stack-argument slot `0` corresponds to explicit argument word `3`.
-
-The two-word direct-result convention covers common cases such as 64-bit
-integer results on 32-bit targets, two-word aggregates, and divmod-style
-returns. The indirect-result convention is the portable way to return any
-result wider than two words.
-
-### Register preservation
-
-Caller-saved:
-
-- `a0`–`a3`
-- `t0`–`t2`
-
-Callee-saved:
-
-- `s0`–`s3`
-- `sp`
-
-### Call semantics
-
-A call is valid from any function, including a leaf. Call / return correctness
-does not depend on establishing a frame first.
-
-If a function needs any incoming argument after making a call, it must save it
-before the call. This matters in particular for `a0`, which is overwritten by
-every convention's return value, and for `a1` when the callee uses the two-word
-direct-result convention.
-
-A call that passes any stack argument words requires the caller to have an
-active standard frame with enough frame-local storage to stage those outgoing
-words.
-
-The return address is hidden machine state. Portable source must not assume
-that it lives in any exposed register.
-
-## Stack Convention
-
-### Call-boundary rule
-
-At every call boundary, the backend must satisfy the native C ABI stack
-alignment rule for the target architecture.
-
-Portable source must therefore treat raw function-entry `sp` as opaque. It may
-not assume that the low bits of `sp` have the same meaning on all targets
-before a frame is established.
-
-### Incoming stack-argument area
-
-P1 v2 defines an abstract incoming stack-argument area for explicit argument
-words that do not fit in registers.
-
-- Slot `0` is the first stack-passed explicit argument word.
-- Slots are word-indexed, not byte-indexed.
-- Portable source may access this area only through `LDARG`.
-
-`LDARG` is valid only when the current function has an active standard frame.
-Therefore, a function that needs any incoming stack argument must establish a
-standard frame before its first `LDARG`.
-
-Portable source must not assume any direct relationship between incoming
-argument slots and raw function-entry `sp`. In particular, source must not try
-to reconstruct stack arguments by manually indexing from `sp`; backend entry
-layouts differ across targets.
-
-For a call with `m` stack-passed explicit argument words, the caller stages
-those words in the first `m` words of its frame-local storage immediately
-before the call:
-
-```
-[sp + 2*WORD + 0*WORD] = outgoing arg word 0
-[sp + 2*WORD + 1*WORD] = outgoing arg word 1
-...
-```
-
-At callee entry, those staged words become incoming argument slots `0..m-1`.
-The backend is responsible for mapping between the caller's frame layout and
-the callee's abstract incoming argument slots.
-
-Portable code that needs both ordinary locals and stack-passed outgoing
-arguments must reserve enough total frame-local storage and keep the low-
-addressed prefix available for outgoing argument staging across the call.
-
-### Standard frame layout
-
-Functions that need local stack storage use a standard frame layout. After
-frame establishment:
-
-```
-[sp + 0*WORD] = saved return address
-[sp + 1*WORD] = saved caller stack pointer
-[sp + 2*WORD ... sp + 2*WORD + local_bytes - 1] = frame-local storage
-...
-```
-
-Frame-local storage is byte-addressed. Portable code may use it for ordinary
-locals, spilled callee-saved registers, and the caller-staged outgoing
-stack-argument words described above.
-
-Total frame size is:
-
-`round_up(STACK_ALIGN, 2*WORD_SIZE + local_bytes)`
-
-Where:
-
-- `WORD_SIZE = 8` in P1v2-64
-- `WORD_SIZE = 4` in P1v2-32
-- `STACK_ALIGN` is target-defined and must satisfy the native call ABI
-
-Leaf functions that need no frame-local storage may omit the frame entirely.
-
-### Frame invariants
-
-- A function that allocates a frame must restore `sp` before returning.
-- Callee-saved registers modified by the function must be restored before
-  returning.
-- The standard frame layout is the only frame shape recognized by P1 v2.
-
-## Op Set Summary
-
-| Category | Operations |
-|----------|------------|
-| Materialization | `LI rd, imm`, `LA rd, %label`, `LA_BR %label` |
-| Moves | `MOV rd, rs`, `MOV rd, sp` |
-| Arithmetic | `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, `MUL`, `DIV`, `REM` |
-| Immediate arithmetic | `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI` |
-| Memory | `LD`, `ST`, `LB`, `SB` |
-| ABI access | `LDARG` |
-| Branching | `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ` |
-| Calls / returns | `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR` |
-| Frame management | `ENTER`, `LEAVE` |
-| System | `SYSCALL` |
-
-## Immediates
-
-Immediate operands appear only in instructions that explicitly admit them.
-Portable source has three immediate classes:
-
-- **Inline integer immediate** — a signed 12-bit assembly-time constant in the
-  range `-2048..2047`
-- **Materialized word value** — a full one-word assembly-time constant loaded
-  with `LI`
-- **Materialized address** — the address of a label loaded with `LA`
-
-P1 v2 also uses two structured assembly-time operands:
-
-- **Frame-local byte count** — a non-negative byte count used by `ENTER`
-- **Argument-slot index** — a non-negative word-slot index used by `LDARG`
-
-`LI rd, imm` loads the one-word integer value `imm`.
-
-`LA rd, %label` loads the address of `%label` as a one-word pointer value.
-
-The backend may realize `LI` and `LA` using native immediates, literal pools,
-multi-instruction sequences, or other backend-private mechanisms.
-
-Backends may assume labels fit in 32 bits when realizing `LA` and `LA_BR`.
-This reflects the stage0 image layout (`hex2-0` base `0x00600000`, programs
-well under 4 GB), not a portable-ISA-level guarantee. Backends that target
-images loaded above the 4 GB boundary must adjust their `LA` / `LA_BR`
-lowering. `LI` makes no such assumption — it materializes any one-word value.
-
-## Control Flow
-
-### Call / Return / Tail Call
-
-Control-flow targets are materialized with `LA_BR %label`, which loads
-`%label` into the hidden branch-target mechanism `br`. The immediately
-following control-flow op consumes that target.
-
-`CALL` transfers control to the target most recently loaded by `LA_BR` and
-establishes a return continuation such that a subsequent `RET` returns to the
-instruction after the `CALL`. `CALL` is valid whether or not the caller has
-established a standard frame, except that any call using stack-passed argument
-words requires an active standard frame to hold the staged outgoing words.
-
-`CALLR rs` is the register-indirect form of `CALL`. It transfers control to
-the code pointer value held in `rs` and establishes the same return
-continuation semantics as `CALL`.
-
-`RET` returns through the current return continuation. `RET` is valid whether
-or not the current function has established a standard frame, provided any
-frame established by the function has already been torn down.
-
-`TAIL` is a tail call to the target most recently loaded by `LA_BR`. It is
-valid only when the current function has an active standard frame. `TAIL`
-performs the standard epilogue for the current frame and then transfers control
-to the loaded target without creating a new return continuation. The callee
-therefore returns directly to the current function's caller.
-
-`TAILR rs` is the register-indirect form of `TAIL`. It is valid only when the
-current function has an active standard frame.
-
-Because stack-passed outgoing argument words are staged in the caller's own
-frame-local storage, `TAIL` and `TAILR` are portable only when the tail-called
-callee requires no stack-passed argument words. Portable compilers must lower
-other tail-call cases to an ordinary `CALL` / `RET` sequence.
-
-Portable source must treat the return continuation as hidden machine state. It
-must not assume that the return address lives in any exposed register or stack
-location except as defined by the standard frame layout after frame
-establishment.
-
-### Prologue / Epilogue
-
-P1 v2 defines the following frame-establishment and frame-teardown operations:
-
-- `ENTER size`
-- `LEAVE`
-
-`ENTER size` establishes the standard frame layout with `size` bytes of
-frame-local storage:
-
-```
-[sp + 0*WORD] = saved return address
-[sp + 1*WORD] = saved caller stack pointer
-[sp + 2*WORD ... sp + 2*WORD + size - 1] = frame-local storage
-```
-
-The total allocation size is:
-
-`round_up(STACK_ALIGN, 2*WORD_SIZE + size)`
-
-The named frame-local bytes are the usable local storage. Any additional bytes
-introduced by alignment rounding are padding, not extra local bytes.
-
-`LEAVE` tears down the current standard frame and restores the hidden return
-continuation so that a subsequent `RET` returns correctly.
-
-Because every standard frame stores the saved caller stack pointer at
-`[sp + 1*WORD]`, `LEAVE` does not need to know the frame-local byte count used
-by the corresponding `ENTER`.
-
-A function may omit `ENTER` / `LEAVE` entirely if it is a leaf and needs no
-standard frame.
-
-`ENTER` and `LEAVE` do not implicitly save or restore `s0` or `s1`. A
-function that modifies `s0` or `s1` must preserve them explicitly, typically by
-storing them in frame-local storage within its standard frame.
-
-### Branching
-
-P1 v2 branch targets are carried through the hidden branch-target mechanism
-`br`. Portable source may load `br` only through:
-
-- `LA_BR %label` — materialize the address of `%label` as the next branch, call,
-  or tail-call target
-
-No branch, call, or tail opcode takes a label operand directly. Portable source
-must treat `br` as owned by the control-flow machinery. No live value may be
-carried in `br`. Each `LA_BR` must be consumed by the immediately following
-branch, call, or tail op, and portable source must not rely on `br` surviving
-across any other instruction.
-
-The portable branch families are:
-
-- `B` — unconditional branch to the target in `br`
-- `BR rs` — unconditional branch to the code pointer in `rs`
-- `BEQ`, `BNE`, `BLT`, `BLTU` — conditional branch to the target in `br`
-- `BEQZ`, `BNEZ`, `BLTZ` — conditional branch to the target in `br` using zero
-  as the second operand
-
-`BLT` and `BLTZ` perform signed comparisons on one-word values. `BLTU`
-performs an unsigned comparison on one-word values; there is no unsigned
-zero-operand variant because `x < 0` is always false under unsigned
-interpretation.
-
-If a branch condition is true, control transfers to the target currently held in
-`br`. If the condition is false, execution falls through to the next
-instruction.
-
-## Data Ops
-
-### Arithmetic
-
-P1 v2 defines the following arithmetic and bitwise operations on one-word
-values:
-
-- register-register: `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`,
-  `MUL`, `DIV`, `REM`
-- immediate: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI`
-
-For `ADD`, `SUB`, `MUL`, `AND`, `OR`, and `XOR`, computation is modulo the
-active word size.
-
-`SHL` shifts left and discards high bits. `SHR` is a logical right shift and
-zero-fills. `SAR` is an arithmetic right shift and sign-fills.
-
-For register-count shifts, only the low `5` bits of the shift count are
-observed in `P1v2-32`, and only the low `6` bits are observed in `P1v2-64`.
-
-Immediate-form shifts use inline immediates in the range `0..31` in `P1v2-32`
-and `0..63` in `P1v2-64`.
-
-`DIV` is signed division on one-word two's-complement values and truncates
-toward zero. `REM` is the corresponding signed remainder.
-
-Division by zero is outside the portable contract. The overflow case
-`MIN_INT / -1` is also outside the portable contract, as is the corresponding
-remainder case.
-
-### Moves
-
-P1 v2 defines the following move and materialization operations:
-
-- `MOV` — register-to-register copy
-- `LI` — load one-word integer constant
-- `LA` — load label address
-
-`MOV` may copy from any exposed general register to any exposed general
-register.
-
-Portable source may also read the current stack pointer through `MOV rd, sp`.
-
-Portable source may not write `sp` through `MOV`. Stack-pointer updates are only
-performed by `ENTER`, `LEAVE`, and backend-private call/return machinery.
-
-`LI` materializes an integer bit-pattern. `LA` materializes the address of a
-label. `LA_BR` is a separate control-flow-target materialization form and is not
-part of the general move family.
-
-### Memory
-
-P1 v2 defines the following memory-access operations:
-
-- `LD`, `ST` — one-word load and store
-- `LB`, `SB` — byte load and store
-- `LDARG` — one-word load from the incoming stack-argument area
-
-`LD` and `ST` access one full word: 4 bytes in `P1v2-32` and 8 bytes in
-`P1v2-64`.
-
-`LB` loads one byte and zero-extends it to a full word. `SB` stores the low
-8 bits of the source value.
-
-Memory offsets use signed 12-bit inline immediates.
-
-The base address for a memory access may be any exposed general register or
-`sp`.
-
-`LDARG rd, idx` loads incoming stack-argument slot `idx`, where slot `0` is the
-first stack-passed explicit argument word. `idx` is word-indexed, not
-byte-indexed. `LDARG` is an ABI access, not a general memory operation; it does
-not expose or imply any raw `sp`-relative layout at function entry.
-
-`LDARG` is valid only when the current function has an active standard frame.
-
-Portable source must not assume that labels are aligned beyond what is
-explicitly established by the program itself. Portable code should use
-naturally aligned addresses for `LD` and `ST`. Unaligned word accesses are
-outside the portable contract. Byte accesses have no additional alignment
-requirement.
-
-## System
-
-`SYSCALL` is part of the portable ISA surface.
-
-At the portable level, the syscall convention is:
-
-- `a0` = syscall number on entry, return value on exit
-- `a1`, `a2`, `a3`, `t0`, `s0`, `s1` = syscall arguments 0 through 5
-
-At the portable level, `SYSCALL` clobbers only `a0`. All other exposed
-registers are preserved across the syscall.
-
-The mapping from symbolic syscall names to numeric syscall identifiers is
-target-defined. The set of syscalls available to a given program is likewise
-specified outside the core P1 v2 ISA, for example by a target profile or
-runtime interface document.
-
-## Target notes
-
-- `a0` is argument 0, the one-word direct return-value register, the low word
-  of the two-word direct return pair, and the indirect-result buffer pointer.
-- On aarch64, riscv64, arm32, and rv32, that matches the native integer/pointer
-  ABI directly.
-- On amd64, the backend must translate between portable `a0` and native
-  return register `rax` at call and return boundaries. For the two-word direct
-  return, the backend must also translate `a1` against native `rdx`.
-- On amd64, `LDARG` must account for the return address pushed by the native
-  `call` instruction. On aarch64, riscv64, arm32, and rv32, it maps more
-  directly to the entry `sp` plus the backend's standard frame/header policy.
-- `br` is implemented as a dedicated hidden native register on every target.
-- On arm32, `t1` and `t2` map to natively callee-saved registers; the backend
-  is responsible for preserving them across function boundaries in accordance
-  with the native ABI, even though P1 treats them as caller-saved.
-- Frame-pointer use is backend policy, not part of the P1 v2 architectural
-  register set.
-
-### Native register mapping
-
-#### 64-bit targets
-
-| P1   | amd64 | aarch64 | riscv64 |
-|------|-------|---------|---------|
-| `a0` | `rdi` | `x0`    | `a0`    |
-| `a1` | `rsi` | `x1`    | `a1`    |
-| `a2` | `rdx` | `x2`    | `a2`    |
-| `a3` | `rcx` | `x3`    | `a3`    |
-| `t0` | `r10` | `x9`    | `t0`    |
-| `t1` | `r11` | `x10`   | `t1`    |
-| `t2` | `r8`  | `x11`   | `t2`    |
-| `s0` | `rbx` | `x19`   | `s1`    |
-| `s1` | `r12` | `x20`   | `s2`    |
-| `s2` | `r13` | `x21`   | `s3`    |
-| `s3` | `r14` | `x22`   | `s4`    |
-| `sp` | `rsp` | `sp`    | `sp`    |
-
-#### 32-bit targets
-
-| P1   | arm32 | rv32  |
-|------|-------|-------|
-| `a0` | `r0`  | `a0`  |
-| `a1` | `r1`  | `a1`  |
-| `a2` | `r2`  | `a2`  |
-| `a3` | `r3`  | `a3`  |
-| `t0` | `r12` | `t0`  |
-| `t1` | `r6`  | `t1`  |
-| `t2` | `r7`  | `t2`  |
-| `s0` | `r4`  | `s1`  |
-| `s1` | `r5`  | `s2`  |
-| `s2` | `r8`  | `s3`  |
-| `s3` | `r9`  | `s4`  |
-| `sp` | `sp`  | `sp`  |
diff --git a/docs/PLAN.md b/docs/PLAN.md
@@ -1,210 +0,0 @@
-# Alternative bootstrap path: Lisp-in-P1 → C compiler in Lisp → tcc-boot
-
-## Goal
-
-Shrink the auditable LOC between M1 assembly and tcc-boot by replacing the
-current `M2-Planet → mes → MesCC → nyacc` stack with a small Lisp written
-once in the P1 portable pseudo-ISA (see [P1.md](P1.md)) and a C compiler written
-in that Lisp. P1 is the same layer described in `P1.md`: ~30 RISC-shaped ops
-whose per-arch `DEFINE` tables expand to amd64 / aarch64 / riscv64 encodings,
-so one Lisp source serves all three hosts.
-
-## Current chain (validated counts)
-
-| Layer | Lang | Lines |
-|---|---|---|
-| `cc_amd64.M1` (subset-C compiler in M1 asm) | M1 | 5,413 (~3,152 actual instructions) |
-| M2-Planet (`*.c`, compiles mes) | C | 8,140 |
-| Mes interpreter (`src/*.c`) | C | 7,033 |
-| Mes headers (`include/mes/*.h`) | C | 6,145 |
-| MesCC + mes Scheme (`module/`) | Scheme | 8,271 |
-| Bundled mes runtime (SRFI/ice-9/rnrs shims) | Scheme | 9,191 |
-| nyacc (LALR engine + C99 parser/grammar/cpp) | Scheme | ~10,000 (essentials of 12,868) |
-| **Total auditable** | mixed | **~54,000** |
-
-## Proposed chain
-
-```
-M1 asm  →  P1 pseudo-ISA  →  Lisp interpreter (in P1)  →  C compiler (in Lisp)  →  tcc-boot
-```
-
-Two languages plus one portable asm layer, one new interpreter, one new
-compiler. No M2-Planet, no Mes core, no MesCC, no nyacc. The interpreter is
-authored once in P1 and assembled three ways; porting to a fourth arch means
-a new P1 defs file, not a rewrite.
-
-## Why P1 as the host
-
-- **Single source of truth.** A Lisp in raw M1 asm would need three
-  hand-written variants (one per target arch). In P1, there is one source;
-  the per-arch cost is already paid inside the P1 defs files.
-- **Cost lives in P1, not here.** P1's one-time tax (~1500 defines × 3 arches
-  generator-driven, plus ~240 LOC of `hex2_word` + `M1-macro` aarch64 work)
-  is accounted in `P1.md`. This plan inherits that layer rather than
-  duplicating it.
-- **Dependency ordering.** PLAN cannot start the Lisp interpreter until P1
-  stages 1–4 in `P1.md` are complete (spike on all three arches plus the
-  full ~30-op matrix). P1 stage 5 ("seed Lisp interpreter in ~500 lines of
-  P1") is effectively this plan's kickoff.
-
-## Lisp — feature floor
-
-Justification: empirical audit of MesCC's actual Scheme usage. MesCC barely
-exercises Scheme.
-
-**Required:**
-- Special forms: `define`, `lambda`, `if`, `cond`, `let`, `let*`, `letrec`,
-  `quote`, `quasiquote`/`unquote`, `set!`, `begin`
-- Data: pairs, fixnums, vectors, immutable ASCII strings, symbols
-- Primitives (~40): `cons/car/cdr`, list ops (`map/filter/fold/append/reverse/member/assoc`),
-  arithmetic (`+ - * / %`), bitwise (`and or xor << >>`), string ops
-  (`string-append/string-ref/substring/string-length`), type predicates,
-  `display`/`write`, basic `format` (`~a ~s ~d ~%` only), `apply`, `error`
-- Mark-sweep GC over tagged cells
-- Built-in `pmatch` macro (otherwise hand-expanding 57 call sites in the
-  C compiler costs ~1k extra LOC)
-- A records-via-vectors layer (replaces SRFI-9 `define-immutable-record-type`)
-- File I/O: `read-file path → string` and `write-file path string`. No port
-  type at all. The C lexer indexes into the source string with an integer
-  cursor (gives `read-char`/`peek-char` semantics for free); CPP keeps
-  `#include` context as a stack of (string, cursor) pairs. Codegen
-  accumulates output as a reversed list of chunks and concatenates once
-  via a one-pass variadic `string-append` (or a `string-concat list →
-  string` primitive). Output for tcc-boot is single-digit MB — well within
-  the existing mes 20MB arena budget.
-
-**Deliberately omitted:**
-- `call/cc`, `dynamic-wind`, `parameterize`, exception system
-- Mutable strings, Unicode
-- Bignums, rationals, floats
-- `syntax-rules` / `define-syntax` (only `pmatch` macro is needed)
-- First-class modules (single-file load in dependency order)
-- `do` loops, `delay`/`force`
-
-Tail calls: convenient for AST recursion; not strictly required if stack is
-generous (≥1MB).
-
-## C subset to support
-
-Start from MesCC's already-reduced subset; consider further reductions if
-they justify patching tcc-boot.
-
-**Must support (used by tcc-boot):**
-- Types: `char/short/int/long/long long`, signed/unsigned, pointers, arrays,
-  structs, unions, enums, **bitfields**, typedefs, `void`
-- Storage: `static`, `extern`, `register`; `const`/`volatile` parsed and
-  ignored
-- Operators: full arithmetic/bitwise/relational/logical, compound
-  assignment, ternary, `sizeof` (types and expressions), casts, comma
-- Statements: all loops, switch/case, goto/labels, `&&`/`||` short-circuit
-- Function declarations: ANSI only
-
-**Not supported (and not needed):**
-- `float`/`double` (errors at parse time)
-- `inline` (parsed and stripped, like MesCC)
-- Variadic functions (tcc-boot already works around this)
-- K&R declarations
-- C99 mid-block declarations
-- Statement expressions, nested functions, compound literals,
-  designated initializers
-
-**Candidate further reductions (require tcc-boot patches):**
-- Drop bitfields (significant tcc-boot rework — probably not worth it)
-- Drop compound assignment (modest tcc-boot patches)
-
-**Preprocessor:** target full `#define`/`#include`/`#if`/`#ifdef`/`#elif`/
-`#else`/`#endif` with function-like macros and stringification. tcc's source
-uses these heavily.
-
-## Backend
-
-**Settled: emit P1.** The C compiler is written once in portable Lisp and
-emits portable asm, so both the pre-tcc-boot seed userland (`SEED.md`) and
-tcc-boot itself land on all three arches without a second backend. Codegen
-is slightly harder than direct amd64 — P1 is deliberately dumb, so C
-idioms like `x += y` expand to multi-op P1 sequences — but we pay the
-~2× code-size tax already budgeted in `P1.md` rather than writing three
-backends.
-
-This forecloses the alternative of emitting amd64 M1 directly (simpler
-codegen, single-arch only). That option would have satisfied a
-tcc-boot-only goal, but `SEED.md` requires tri-arch seed binaries, so a
-portable backend is load-bearing.
-
-## Estimated budget
-
-| Component | Lines |
-|---|---|
-| Lisp interpreter in P1 (reader, eval, GC, primitives, I/O, pmatch) | 4,000–6,000 P1 |
-| C lexer + recursive-descent parser + CPP (in Lisp) | 2,000–3,000 |
-| Type checker + IR (slimmed compile.scm + info.scm) | 2,000–3,000 |
-| Codegen + P1 emit (see Backend) | 800–1,500 |
-| **Total auditable (this plan)** | **~9,000–13,000 LOC** |
-
-vs. **~54,000 LOC** current = **~4–6× shrink**, and the result is
-tri-arch instead of amd64-only. P1's own infrastructure (defs files,
-`hex2_word` extensions, generator) is audited once in `P1.md` and shared
-with any future seed-stage program.
-
-## Resolutions
-
-- **Narrow loads: zero-extend only, 8-bit only.** P1 keeps `LB`
-  zero-extending; no `LBS` added, and 32-bit `LW`/`SW` are out of the
-  ISA entirely (emulate through 64-bit `LD`/`ST` + mask/shift if ever
-  needed). Fixnums live in full 64-bit tagged cells, so the interpreter
-  never needs a sign-extended or 32-bit load — byte/ASCII access is
-  unsigned, and arithmetic happens on 64-bit values already.
-- **Static code size: accept the 2× tax.** P1's destructive-expansion
-  rule on amd64 roughly doubles instruction count vs. hand-tuned amd64.
-  Matches P1's "deliberately dumb" contract (see `P1.md`). Interpreter
-  binary expected in low single-digit MB — irrelevant for a seed.
-- **Tail calls: codify `TAIL` in P1.** A new `TAIL %label` macro (see
-  `P1.md`, Control flow) expands to `LD lr, sp, 0; ADDI sp, sp, +16;
-  B %label` or the per-arch equivalent. The interpreter's `eval` is
-  written in the natural recursive style with tail-position calls
-  compiled through `TAIL`, so the P1 stack does not grow per Scheme
-  frame. As a side effect, Scheme-level tail calls fall out R5RS-proper
-  for the interpreter's subset without extra mechanism.
-- **Pool placement: per-function on all arches.** Each function emits its
-  constant pool at its epilogue, inside the aarch64 `LDR`-literal ±1 MiB
-  range. Labels are file-local; duplicated constants across functions
-  are accepted. Simple rule, no range-check logic in codegen.
-- **GC arena: static BSS.** The ~20 MB heap is reserved as a single BSS
-  region at link time. No `brk`/`mmap` at runtime, no arena-sizing flag.
-  Keeps the P1 program to a minimal syscall surface and makes the
-  interpreter image self-describing.
-- **Syscalls: eight.** `read`, `write`, `openat`, `close`, `exit`,
-  `clone`, `execve`, `waitid`. Each becomes one P1 `SYSCALL` op
-  backed by a per-arch number table in the P1 defs file.
-  `read-file` loops `read` into a growable string until EOF (no
-  `stat`/`lseek`); `display`/`write`/`error` go through `write` on
-  fd 1/2; `error` finishes with `exit`. `openat(AT_FDCWD, …)`
-  replaces `open` because aarch64/riscv64 lack bare `open` in the
-  asm-generic table. `clone(SIGCHLD)` + `execve` + `waitid` give
-  the Lisp enough to drive the tcc-boot build directly — see
-  "Build driver" below. No signals, time, or networking.
-
-## Build driver
-
-Once Lisp can spawn, the Lisp program itself is the build driver.
-There is no separate shell. A top-level Lisp source file reads the
-pinned list of tcc-boot translation units, iterates over them, and
-for each one:
-
-1. Reads the `.c` source into a Lisp string.
-2. Calls the Lisp-hosted C compiler (in-process) to produce P1 text.
-3. Writes the P1 text to a temp file.
-4. Spawns M1 (from stage0-posix, via `clone`+`execve`) to assemble
-   P1 → `.hex2`; waits via `waitid`, aborts on non-zero.
-5. Spawns hex2 to emit the final `.o` / ELF; waits, aborts on
-   non-zero.
-
-The seed-tool builds (each mescc-tools-extra source → one ELF) run
-the same loop. Spawn-and-wait is a ~20 LOC Lisp primitive; the full
-driver, including the hard-coded tcc-boot file list, is ~100–200
-LOC of Lisp counted against this plan.
-
-Concentrating orchestration in the Lisp program (rather than a
-separate P1/M1 shell) collapses the post-M1 contribution list to
-exactly three artifacts: P1, the Lisp interpreter, and the C
-compiler.
diff --git a/docs/SEED.md b/docs/SEED.md
@@ -1,303 +0,0 @@
-# Seed userland: the pre-tcc-boot tools
-
-## Goal
-
-Bridge the window between *Lisp exists* and *tcc-boot exists* without
-touching M2-Planet, Mes, or MesCC. Inside that window, all code is
-either a Lisp program running on the Lisp interpreter or one of a
-small set of standalone C binaries compiled through the Lisp-hosted
-C compiler → P1 → M1 → hex2 pipeline.
-
-This document covers only that window. Phases before it (`seed0 →
-hex0/hex1/hex2 → M1`, P1 defs, Lisp interpreter, and the Lisp-hosted
-C compiler) are documented in `P1.md` and `PLAN.md`. tcc-boot itself
-and everything downstream are standard C and out of scope.
-
-## Position in the chain
-
-```
-stage0-posix:   seed0 → hex0 → hex1 → hex2 → M1              (no C, no Lisp)
-P1 layer:       P1 defs files load into M1                   (P1.md)
-Lisp:           P1 text (Lisp interp source) → M1 → hex2     (PLAN.md)
-C compiler:     Lisp program, loaded into the Lisp image     (PLAN.md)
-──────── seed window begins here ────────
-seed tools:     C source → Lisp+Ccc → P1 text → M1 → hex2    (this doc)
-──────── seed window ends when tcc-boot is built ────────
-tcc-boot:       C source → Lisp+Ccc → P1 text → M1 → hex2    (PLAN.md)
-```
-
-One Lisp-hosted C compiler (shared with tcc-boot) and a handful of
-statically-linked C binaries. No M2-Planet artifact and no Mes
-Scheme module anywhere.
-
-## Settled decisions
-
-These are load-bearing; rest of the document assumes them.
-
-1. **Seed programs compile through the same Lisp-hosted C compiler
-   as tcc-boot.** No separate seed-stage compiler. Authors write in
-   the C subset fixed in `PLAN.md`; backend emits P1, so seed lands
-   tri-arch via the existing M1+hex2 path. Accepts P1's ~2×
-   code-size tax.
-2. **Vendor upstream C where it exists.** `cat`, `cp`, `mkdir`,
-   `rm`, `sha256sum`, `untar` are taken from live-bootstrap's
-   `mescc-tools-extra`; `patch-apply` from `simple-patch-1.0`.
-   The libc these sources depend on (`<stdio.h>`, `<string.h>`,
-   `<stdlib.h>`, etc.) is vendored M2libc's portable layer —
-   `bootstrappable.c`, `string.c`, `stdio.c`, `stdlib.c`, and the
-   small `ctype`/`fcntl` files (~1,500 LOC). Per-arch syscall
-   stubs backing M2libc's declarations are replaced with our
-   P1-based stubs (see "How seed tools reach syscalls" below). All
-   of the above was written against M2-Planet's C subset, which is
-   a subset of ours. Local adaptations ship as unified diffs in
-   the repo. **No C is written fresh here** — each vendored
-   source already has its own `main`.
-3. **The Lisp program is the build driver — no separate shell.**
-   Per `PLAN.md`, the Lisp's syscall surface includes `clone`,
-   `execve`, `waitid`, so a top-level Lisp file drives the whole
-   tcc-boot build: iterate over translation units, call the
-   Lisp-hosted C compiler in-process, spawn M1/hex2 to finish
-   each artifact, check exit status. No `kaem`, no `sh`, no flat
-   script — just Lisp code.
-4. **One binary per tool.** Each vendored source compiles to a
-   standalone ELF — `cat`, `cp`, `mkdir`, `rm`, `sha256sum`,
-   `untar`, `patch-apply`. Installed into a single directory
-   (say, `/seed/`) and invoked by absolute path from the Lisp
-   driver. No dispatcher, no argv[0] multiplexing, no fresh `main`
-   to write. Each tool is its own audit unit.
-5. **Uncompressed tcc-boot mirror.** Host the upstream tcc-boot source
-   as an uncompressed `.tar` with sha256 pinned. No gzip support
-   anywhere in the seed stage. Deletes ~1000–1500 LOC of deflate from
-   the audit.
-6. **Explicit patches via `patch-apply`.** Upstream source stays
-   verbatim. Our changes live as unified-diff files in this repo,
-   applied by the `simple-patch`-derived binary. "Upstream vs
-   ours" stays legible.
-7. **Target self-build is primary; cross-build is a cache.** The
-   canonical build is a fresh target machine bootstrapping from
-   stage0-posix hex seed. Cross-built per-arch tarballs are supported
-   as a reproducibility cache — identical bytes expected, verified
-   against a target self-build, not trusted by assumption.
-
-## The seed tools
-
-One ELF per tool per arch. Each tool is invoked by absolute path
-from the Lisp build driver (e.g. `/seed/sha256sum foo.tar`). Each
-binary links against the same vendored M2libc portable layer and
-the same P1 syscall stubs.
-
-### Inventory
-
-| Tool / layer       | Purpose                                     | Source / LOC            |
-|--------------------|---------------------------------------------|-------------------------|
-| `untar`            | POSIX ustar extract (no gzip, no creation)  | mescc-tools-extra/untar.c (460) |
-| `patch-apply`      | apply a unified diff in-place               | simple-patch-1.0 (~200) |
-| `sha256sum`        | verify source tarball hashes                | mescc-tools-extra/sha256sum.c (586) |
-| `cp`               | copy one file                               | mescc-tools-extra/cp.c (332) |
-| `mkdir`            | single-level directory create               | mescc-tools-extra/mkdir.c (117) |
-| `rm`               | remove one file (no `-r`, no `-f`)          | mescc-tools-extra/rm.c (54) |
-| `cat`              | concatenate files to stdout                 | mescc-tools-extra/catm.c (69) |
-| libc (portable)    | stdio, string, stdlib, ctype, fcntl         | vendored M2libc (~1,500) |
-| syscall stubs      | per-arch bridge below M2libc                | ~120 lines of P1, not C |
-| **Total C**        |                                             | **~3,300, fully vendored** |
-
-Deliberately excluded: `test`, `echo`, `mv`. The Lisp driver does
-any conditional or rename logic it needs in Lisp, and emits
-progress messages via its own `write` calls — no externalised
-shell utilities needed for those concerns.
-
-The driver is Lisp code, not a shell script; see `PLAN.md`'s
-"Build driver" section for the control flow.
-
-## Syscall surface
-
-The seed tools collectively need **7 syscalls** (process spawn
-lives in the Lisp driver, not in the tools).
-
-| Syscall    | Used by                                   |
-|------------|-------------------------------------------|
-| `read`     | all file-reading tools                    |
-| `write`    | stdout/stderr, all file-writing           |
-| `openat`   | file open (`AT_FDCWD` + `O_RDONLY` / `O_WRONLY|O_CREAT|O_TRUNC` with mode) |
-| `close`    | all file ops                              |
-| `exit`     | program termination                       |
-| `mkdir`    | `mkdir` tool, `untar` (directory entries) |
-| `unlink`   | `rm` tool                                 |
-
-PLAN.md's Lisp surface is 8 syscalls (`read`, `write`, `openat`,
-`close`, `exit`, `clone`, `execve`, `waitid`). The seed tools add
-`mkdir` and `unlink` on top of that, for a window total of **10
-distinct syscalls**. Each gets one row in every `p1_<arch>.M1`
-defs file. Deliberately excluded: `stat/fstat`, `access`,
-`rename`, `chmod` (rely on `openat` mode bits for initial perms),
-`lseek` (all reads are sequential), `getdents`/`readdir` (no
-directory traversal needed), `dup`/`pipe`/signals/time/net.
-
-### How seed tools reach syscalls
-
-The Lisp-hosted C compiler has no inline asm and no intrinsics. Each
-syscall is exposed as an ordinary `extern` function declaration,
-backed by a hand-written P1 stub in `runtime.p1`. The stubs are ~3 P1
-ops each (load number, `SYSCALL`, `RET`), totalling ~40 lines of P1
-for the whole surface.
-
-```
-:sys_write                 ; C args arrive in P1 r1-r6 per call ABI
-    SYSCALL write          ; expands per-arch via p1_<arch>.M1 defs
-    RET
-```
-
-```
-extern int sys_write(int fd, char *buf, int n);
-```
-
-Prerequisite: P1 picks its argument registers (`r1–r6`) to coincide
-with the native syscall arg registers on each arch (`rdi/rsi/…`,
-`x0–x5`, `a0–a5`), so stubs need no register shuffling beyond what
-`SYSCALL` already does. Confirm this in `P1.md` during implementation.
-
-Return convention: Linux returns `-errno` (values in `-1..-4095`) in
-the result register. Wrappers return the raw integer; callers test
-`r <u 0xfffff000` to detect failure and abort with a message. No
-`errno` global, no per-tool error recovery.
-
-## Build ordering inside the seed window
-
-Once the Lisp interpreter binary exists and the C compiler Lisp
-source is loaded (both per `PLAN.md`):
-
-1. Compile each seed tool independently: its vendored source plus
-   the vendored M2libc layer plus the per-arch P1 syscall stubs →
-   P1 text → M1 → hex2 → one ELF per tool. Per-arch, repeat for
-   each target.
-2. Install the tools into a single directory on the target (e.g.
-   `/seed/`). No other setup required.
-
-The tcc-boot build runs as a Lisp program invoked on the Lisp
-interpreter. The driver:
-
-1. Spawns `/seed/sha256sum upstream.tar` and checks against pinned
-   hash.
-2. Spawns `/seed/untar upstream.tar`.
-3. For each patch file: spawns `/seed/patch-apply patches/foo.diff`.
-4. Iterates over tcc-boot `.c` files. For each one, calls the
-   Lisp-hosted C compiler in-process to emit P1 text, then spawns
-   M1 and hex2 to produce the object or final linked binary.
-5. Installs the tcc-boot binary.
-
-See `PLAN.md` "Build driver" for the spawn-and-wait primitive.
-Seed window is closed.
-
-## Target self-build vs cross-build
-
-**Target self-build (primary).** A fresh machine of arch `A` starts
-from the stage0-posix hex seed, runs the hex0→hex1→hex2→M1 chain,
-loads `p1_A.M1`, assembles the Lisp interpreter, loads the C
-compiler into Lisp, runs the Lisp build-driver program, which
-compiles each seed tool, then compiles and links tcc-boot.
-stage0-posix's own `kaem` runs the early hex0→M1 chain; above M1,
-the Lisp program takes over.
-
-**Cross-build cache (secondary).** On an already-bootstrapped
-machine, produce the seed tool binaries for all three arches and
-ship them as tarballs. Users who opt into this skip the target
-self-build and land directly at "seed tools installed." Trust
-claim: **none by assumption** — the cache is only trusted after a
-target self-build of at least one arch has verified byte-identical
-output. Cross-build is an optimization, not a trust input.
-
-## Provenance
-
-Artifacts flowing in:
-
-- **stage0-posix hex seed + P1 defs**: part of this repo, audited
-  with the rest of it.
-- **Lisp interpreter source (in P1) and C compiler (in Lisp)**:
-  part of this repo, covered by `PLAN.md`.
-- **Vendored seed C sources**: pinned snapshots of
-  live-bootstrap's `mescc-tools-extra` (catm, cp, mkdir, rm,
-  sha256sum, untar), `simple-patch-1.0`, and M2libc's portable
-  layer (the libc the mescc-tools sources depend on — stdio,
-  string, stdlib, ctype, fcntl, bootstrappable). All shipped
-  verbatim as `.tar` files with sha256 pinned. Local adaptations
-  ride as unified diffs in the repo, applied by `patch-apply` at
-  build time so "upstream vs ours" stays legible.
-- **Upstream tcc-boot source**: mirrored as uncompressed `.tar` at
-  a pinned URL + sha256. The mirror file is one of this repo's
-  auditable inputs; it can be re-derived from upstream by untaring
-  and retaring in a canonical form, or checked against upstream's
-  published `.tar.gz` by re-gzipping and comparing hashes on a
-  machine that has `gzip` (done once, out of band).
-
-No C is authored fresh in this repo for the seed window; the only
-things written here are unified-diff patches against the vendored
-tree and the per-arch P1 syscall stubs.
-
-`sha256sum` is the single seed tool whose correctness has a direct
-trust consequence downstream; unit-test it against known vectors
-(empty string, "abc", "abcdbcde..."-length tests) before declaring
-the seed build complete.
-
-## Interaction with tcc-boot
-
-tcc-boot expects a build environment roughly like `cc + make + sh +
-coreutils`. Mapping:
-
-| tcc-boot expects | Seed provides                                    |
-|------------------|--------------------------------------------------|
-| `cc` / `gcc`     | Lisp-hosted C compiler, invoked in-process per `.c` |
-| `make`           | Lisp driver program (tcc-boot is simple enough)  |
-| `sh`             | not provided — the Lisp driver spawns tools directly |
-| `cat`/`cp`/etc.  | individual seed-tool binaries at absolute paths  |
-| `ld`             | tcc-boot's built-in linker (for its own output)  |
-| `ar`             | not needed; tcc-boot builds one static binary    |
-
-Any translation from tcc-boot's literal build-command names
-(`cc`, `make`, `install`) to seed tools lives in Lisp, not in a
-separate shim script.
-
-## Budget rollup
-
-Fresh auditable LOC introduced by this document, on top of PLAN.md:
-
-| Layer                                                  | LOC     |
-|--------------------------------------------------------|---------|
-| seed tools — vendored mescc-tools-extra + simple-patch | ~1,800 |
-| seed tools — vendored M2libc portable layer            | ~1,500 |
-| syscall stubs (P1, not C)                              | ~120   |
-| Lisp build-driver program                              | counted in PLAN.md |
-| **Seed window addition**                               | **~3,300 C (all vendored) + ~120 P1** |
-
-Combined PLAN.md + SEED.md audit surface: **~13–17k LOC**, tri-arch,
-M2-Planet-free and Mes-free. No fresh C is authored for the seed
-window; the entire ~3,300 LOC is audited upstream code written
-against M2-Planet's C subset. The build driver is Lisp code
-counted against PLAN.md (~100–200 LOC).
-
-## Handoff notes for the engineer
-
-Approximate build order for implementation:
-
-1. **C compiler in Lisp** (blocks everything below). Per `PLAN.md`;
-   validate on a small corpus before touching seed.
-2. **Vendor M2libc's portable layer** and write the per-arch P1
-   syscall stubs that back its declarations. Bring-up test: link
-   `catm.c` (69 LOC) against this libc and run it.
-3. **Vendor mescc-tools-extra + simple-patch.** Pin sha256s.
-   Confirm each source compiles unmodified through the Lisp-hosted
-   C compiler; if anything trips, capture the delta as a unified
-   diff rather than editing the vendored tree in place.
-4. **Build the small tools** individually (`cat`, `cp`, `mkdir`,
-   `rm`) — each is its own ELF.
-5. **`sha256sum`** with unit tests (empty / "abc" / long vectors)
-   before anything depends on its correctness.
-6. **`untar`** (ustar extract only).
-7. **`patch-apply`** (unified-diff in-place).
-8. **End-to-end bring-up**: Lisp build-driver running
-   `sha256sum` → `untar` → `patch-apply` → in-process C-compile
-   loop (spawning M1/hex2 per `.c`) → linked tcc-boot. First full
-   trip through the seed window.
-
-Each step compiles standalone C and assembles through the existing
-P1 → M1 → hex2 path; no new tooling infrastructure is needed
-between steps.

	boot2 Playing with the boostrap
	git clone https://git.ryansepassi.com/git/boot2.git
	Log \| Files \| Refs

D	docs/M1M-IMPL.md	\|	550	-------------------------------------------------------------------------------
D	docs/M1M-P1-PORT.md	\|	257	-------------------------------------------------------------------------------
D	docs/M1PP-EXT.md	\|	656	-------------------------------------------------------------------------------
M	docs/P1.md	\|	957	++++++++++++++++++++++++++++++++++++++++---------------------------------------
D	docs/P1v2.md	\|	531	-------------------------------------------------------------------------------
D	docs/PLAN.md	\|	210	-------------------------------------------------------------------------------
D	docs/SEED.md	\|	303	-------------------------------------------------------------------------------