commit 687ee15acc1661be78ff6cd8922ad95dd19a2100
parent 79aedea7287b895e3089ff72a8ff80482bda1083
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Thu, 23 Apr 2026 20:15:53 -0700
rm old docs
Diffstat:
| D | docs/M1M-IMPL.md | | | 550 | ------------------------------------------------------------------------------- |
| D | docs/M1M-P1-PORT.md | | | 257 | ------------------------------------------------------------------------------- |
| D | docs/M1PP-EXT.md | | | 656 | ------------------------------------------------------------------------------- |
| M | docs/P1.md | | | 957 | ++++++++++++++++++++++++++++++++++++++++--------------------------------------- |
| D | docs/P1v2.md | | | 531 | ------------------------------------------------------------------------------- |
| D | docs/PLAN.md | | | 210 | ------------------------------------------------------------------------------- |
| D | docs/SEED.md | | | 303 | ------------------------------------------------------------------------------- |
7 files changed, 483 insertions(+), 2981 deletions(-)
diff --git a/docs/M1M-IMPL.md b/docs/M1M-IMPL.md
@@ -1,550 +0,0 @@
-## M1M Implementation Sketch
-
-This note is the implementation-oriented companion to
-`docs/M1M-P1-PORT.md`. It describes a practical structure for the P1
-macro expander.
-
-### Working on the port
-
-**Files**
-
-- `m1pp/m1pp.c` — C oracle. Behaviorally authoritative.
- Each phase below names the oracle entry points to lift from.
-- `m1pp/m1pp.M1` — port target, on P1v2 as of Phase 1. Runtime
- shell, lexer, pass-through emit, and structural `%macro` skip
- land in Phase 1. Phases 2–10 extend it to real macro storage,
- expansion, paste, the expression evaluator, builtins, and
- `%select`.
-- `m1pp/build.sh`, `m1pp/test.sh` — build / run / diff a P1v2 .M1
- into a runnable aarch64 binary. See `docs/M1M-IMPL.md` Phase 0.
-- `tests/m1pp/` — per-phase fixtures. Two shapes, selected by
- extension:
- - `<name>.M1` + `<name>.expected` — standalone P1v2 program; built,
- run with no args, stdout diffed (build-pipeline smoke).
- - `<name>.M1pp` + `<name>.expected` — expander input; runner builds
- `m1pp/m1pp.M1` once, runs it as `m1pp <name>.M1pp <out>`, diffs
- `<out>` (build-dir temp) against `.expected` (parity test).
- Filenames beginning with `_` are skipped (parked until later phases).
-- `build/p1v2/aarch64/p1_aarch64.M1` — P1v2 DEFINE table, generated
- from `p1/aarch64.py` + `p1/p1_gen.py`. Regenerate after any
- backend edit.
-- `docs/P1v2.md` — ISA spec. `docs/M1M-P1-PORT.md` — higher-level
- port contract.
-
-**Commands**
-
-```sh
-# Build one .M1 source into a binary:
-sh m1pp/build.sh tests/m1pp/01-passthrough.M1 build/m1pp/01-passthrough
-
-# Run the whole suite (regenerates P1v2 defs if the generator changed):
-make test-m1pp
-
-# Run one fixture by name:
-sh m1pp/test.sh 01-passthrough
-
-# Run a built binary manually in the aarch64 container.
-# `localhost/distroless-busybox:latest` is a local tag built by
-# m1pp/build.sh on first run from Containerfile.busybox (distroless-static
-# + the busybox binary from another distroless layer, both digest-pinned).
-podman run --rm --pull=never --platform linux/arm64 \
- -v "$PWD":/work -w /work \
- localhost/distroless-busybox:latest \
- ./build/m1pp/<name> <argv...>
-
-# Regenerate P1v2 DEFINE tables after touching p1/*.py:
-python3 p1/p1_gen.py --arch aarch64 build/p1v2
-
-# Build the C oracle + compare its output to the M1 build:
-cc m1pp/m1pp.c -o build/m1pp/m1pp-oracle
-./build/m1pp/m1pp-oracle <input.M1pp> /tmp/out-c
-./build/m1pp/m1pp <input.M1pp> /tmp/out-m1 # run via podman as above
-diff /tmp/out-c /tmp/out-m1
-
-# Discover undefined P1 tokens without running M0 (catches typos that
-# would otherwise SIGILL silently — build.sh runs this automatically):
-sh lint.sh build/p1v2/aarch64/p1_aarch64.M1 m1pp/m1pp.M1
-```
-
-**P1v2 quick reference for this port**
-
-- Registers: `a0..a3` args + caller-saved, `t0..t2` caller-saved,
- `s0..s3` callee-saved. `sp` is stack pointer; no raw writes.
-- Frame: `enter SIZE` / `leave`; no implicit `s*` save. Leaf
- functions may skip frames.
-- Call: `la_br &target` then `call` / `tail` / `b` / `beq` / …
- (the branch op consumes `br` — load it immediately before).
-- Materialize: `li_aN <8 bytes>` for any one-word integer
- (`%lo %hi` or `'XXXXXXXXXXXXXXXX'`); `la_aN &label` for label
- addresses — **no padding needed**, the 32-bit literal-pool
- prefix zero-extends.
-- Syscall ABI: number in `a0`; args in `a1, a2, a3, t0, s0, s1`;
- result in `a0`.
-
-### Supported Features
-
-The target expander supports the features required by `p1/*.M1pp`:
-
-- `%macro NAME(a, b)` / `%endm`
-- `%NAME(x, y)` function-like expansion with recursive rescanning
-- `##` token paste
-- `!(expr)` / `@(expr)` / `%(expr)` / `$(expr)`
-- `%select(cond, then, else)`
-- Lisp-shaped integer expressions used by the builtins
-
-### Top-Down Shape
-
-The program should be structured as a small compiler pipeline:
-
-1. Runtime shell
- Read `argv[1]` into `input_buf`, lex into `source_tokens`, process
- tokens through the macro engine, write `output_buf` to `argv[2]`.
- Done in Phase 1.
-
-2. Lexer
- Keep the current C-compatible tokenizer:
- `WORD`, `STRING`, `NEWLINE`, `LPAREN`, `RPAREN`, `COMMA`, `PASTE`.
- All token text lives in `text_buf`; tokens store pointers into that
- arena.
-
-3. Definition pass during processing
- Processing is single-pass, not a separate pre-scan. At line start,
- `%macro` defines a macro and produces no output. Macro definitions
- become available only after their definition, matching
- `src/m1macro.c`.
-
-4. Stream-driven expansion
- The main processor reads from the top stream. Source input is stream 0.
- Macro expansions and `%select` selections push temporary token
- streams onto a stack. When a stream is exhausted, it pops and
- restores the expansion pool mark.
-
-5. Macro call expansion
- `%NAME(...)` resolves only if `NAME` is already defined and the next
- token is `(`. Expansion produces temporary tokens in `expand_pool`,
- applies plain parameter substitution and paste, then pushes the
- result as a new stream for recursive rescanning.
-
-6. Builtins
- `%(expr)` and `$(expr)` evaluate integer expressions and emit
- one generated token directly.
- `%select(cond, then, else)` evaluates `cond` first, then chooses
- exactly one of `then` or `else`, copies only that chosen token range
- into `expand_pool`, and pushes it as a stream. The unchosen branch is
- not expanded, validated, or expression-evaluated.
-
-7. Errors
- Coarse fatal paths are sufficient: malformed macro header, wrong arg
- count, bad paste, bad expression, overflow, unterminated macro/call.
- Exact C error strings are not required.
-
-### Core Data Structures
-
-Use fixed BSS arenas and simple power-of-two-ish records.
-
-Text spans and token records are kept separate:
-
-```text
-TextSpan:
-+0 start u64
-+8 len u64
-
-Token:
-+0 kind u64
-+8 text TextSpan
-```
-
-Macro record:
-
-```text
-name TextSpan
-param_count u64
-params TextSpan[16]
-body_start Token*
-body_end Token*
-```
-
-Stream record:
-
-```text
-toks_start Token*
-toks_end Token*, exclusive
-pos Token*
-line_start bool
-pool_mark stack mark, -1 for source-owned streams
-```
-
-Expression frame:
-
-```text
-op_code enum
-argc u64
-args i64[16]
-```
-
-Global arenas:
-
-```text
-input_buf
-output_buf
-text_buf
-source_tokens
-macro_body_tokens
-macros
-expand_pool
-streams
-arg_starts[16]
-arg_ends[16]
-expr_frames
-```
-
-Token range boundaries should be stored as token pointers rather than
-indices. That keeps stream and argument walking simple in P1: advance by
-one token record, compare pointers, no repeated `base + index << 5`.
-
-Source token spans point into `input_buf`. `text_buf` is reserved for
-synthesized token text such as `##` pastes and `!@%$` output.
-
-### Bottom-Up Helper Layers
-
-#### Layer 0: raw memory/text helpers
-
-```text
-append_text(src_ptr, len) -> text_ptr
-append_text_cstr(const_ptr, len) -> text_ptr
-copy_bytes(dst, src, len)
-```
-
-#### Layer 1: token helpers
-
-```text
-push_source_token(kind, text)
-push_macro_body_token(token_ptr)
-push_pool_token(token_ptr)
-copy_token(dst_ptr, src_ptr)
-tok_eq_const(tok, const_ptr, len) -> bool
-span_eq_token(span, tok) -> bool
-```
-
-#### Layer 2: stream helpers
-
-```text
-push_stream(toks_ptr, count, pool_mark)
-pop_stream()
-current_stream() -> stream_ptr
-stream_peek(stream) -> token_ptr
-stream_advance(stream)
-```
-
-#### Layer 3: macro table helpers
-
-```text
-find_macro(call_tok) -> macro_ptr or 0
-find_param(macro_ptr, body_tok) -> param_index+1 or 0
-define_macro(stream_ptr)
-```
-
-No `find_prefixed_param` or local-rewrite helper is needed for this
-feature set.
-
-#### Layer 4: argument parser
-
-```text
-parse_args(stream_ptr, lparen_tok_ptr)
-```
-
-Outputs:
-
-```text
-arg_starts[i] = first token ptr
-arg_ends[i] = exclusive token ptr
-arg_count
-call_end_pos = token ptr after closing RPAREN
-```
-
-It tracks nested parentheses with a depth counter. Commas split only at
-depth 1.
-
-#### Layer 5: macro body expander
-
-```text
-expand_macro_at(stream_ptr, call_tok, macro_ptr)
-```
-
-Algorithm:
-
-1. Parse call args.
-2. Validate arg count.
-3. Save `mark = pool_used`.
-4. Walk macro body tokens.
-5. If body token is a param, copy arg tokens into the pool.
-6. Otherwise copy the body token as-is.
-7. Run paste compaction over `[mark, pool_used)`.
-8. Push an expansion stream if non-empty; otherwise restore pool mark.
-
-#### Layer 6: paste pass
-
-```text
-paste_range(start_ptr, end_ptr) -> new_count
-```
-
-This is an in-place compactor over `expand_pool`.
-
-Rules:
-
-```text
-## cannot be first or last
-left/right operands cannot be NEWLINE or PASTE
-pasted result is TOK_WORD
-if a substituted parameter participates in ##, its argument must be exactly one token
-```
-
-#### Layer 7: expression evaluator
-
-Do not implement expression evaluation as recursive P1 calls. Use an
-explicit expression frame stack. That avoids fragile recursion and makes
-macro-in-expression expansion controllable.
-
-Expression evaluator API:
-
-```text
-eval_expr_range(start_tok_ptr, end_tok_ptr) -> r0 value
-```
-
-Internal state:
-
-```text
-expr_pos
-expr_end
-expr_frame_top
-expr_done
-expr_result
-```
-
-Loop model:
-
-1. Skip expression newlines.
-2. If token is `(`:
- Read next token as operator.
- Convert operator token to `op_code`.
- Push an expression frame with `argc = 0`, `accum = 0`.
- Advance past the operator.
-3. If token is `)`:
- Finalize the top frame based on `op_code` and `argc`.
- Pop the frame.
- Feed the produced value into the parent frame, or finish if there is
- no parent.
-4. If token is an atom:
- If token is a macro call, expand it to the pool, then evaluate that
- expansion as a nested expression range.
- Otherwise parse the integer atom.
- Feed the value into the parent frame, or finish if there is no
- parent.
-
-Operators:
-
-```text
-+ variadic, argc >= 1
-- unary neg or binary/variadic subtract, argc >= 1
-* variadic, argc >= 1
-/ binary, div-by-zero check
-% binary, div-by-zero check
-<< binary
->> binary arithmetic shift
-& variadic, argc >= 1
-| variadic, argc >= 1
-^ variadic, argc >= 1
-~ unary
-= binary
-== binary alias
-!= binary
-< binary
-<= binary
-> binary
->= binary
-```
-
-Keeping the full current operator set is cheap and avoids pointless
-divergence from the C oracle.
-
-For macro-in-expression, the clean composition is:
-
-```text
-eval atom sees %NAME followed by LPAREN
-expand_macro_at into pool without pushing a stream
-temporarily evaluate [mark, mark + expanded_count)
-require exactly one expression result and no extra tokens
-restore pool mark
-advance outer expr_pos to call_end_pos
-```
-
-That gives the C behavior without mixing expression parsing with the
-main output stream.
-
-#### Layer 8: builtins
-
-```text
-expand_builtin_call(stream_ptr, builtin_tok)
-```
-
-`!@%$`
-
-```text
-parse args
-require one arg
-value = eval_expr_range(arg_start, arg_end)
-emit_hex_value(value, 1 2 4 or 8)
-advance stream pos to call_end_pos
-line_start = 0
-```
-
-`%select`:
-
-```text
-parse args
-require three args
-value = eval_expr_range(arg0_start, arg0_end)
-chosen = arg1 if value != 0 else arg2
-copy chosen tokens to expand_pool
-advance stream pos to call_end_pos
-push chosen stream
-line_start = 0
-```
-
-Only `cond` is evaluated eagerly. The selected branch is rescanned as a
-normal token stream; the unselected branch is ignored completely.
-
-#### Layer 9: main processor
-
-```text
-process_tokens:
- push_stream(source_tokens, source_count, -1)
-
- while stream_top > 0:
- s = current_stream()
- if s.pos == s.end:
- pop_stream()
- continue
-
- tok = *s.pos
-
- if s.line_start && tok == "%macro":
- define_macro(s)
- continue
-
- if tok.kind == NEWLINE:
- emit_newline()
- s.pos += 24 # one Token record
- s.line_start = 1
- continue
-
- if tok is builtin call:
- expand_builtin_call(s, tok)
- continue
-
- if tok is defined macro call:
- expand_call(s, macro)
- continue
-
- emit_token(tok)
- s.pos += 24
- s.line_start = 0
-```
-
-### Implementation Slices
-
-The port is broken into phases. Each phase ends with a dedicated test
-under `tests/m1pp/` and a parity check (where applicable) against the C
-oracle in `m1pp/m1pp.c`. The target ISA is **P1v2** (registers
-`a0..a3`, `t0..t2`, `s0..s3`; `enter`/`leave`; `la_br`); the DEFINE
-table is `build/p1v2/aarch64/p1_aarch64.M1`. Aarch64 is the staging
-arch (matches the macOS host so podman runs natively).
-
-Each phase below lists the oracle entry points in `m1pp/m1pp.c` that
-the M1 port lifts for that slice. Line numbers are hints — track by
-symbol name.
-
-- [x] **Phase 0 — Build/run/diff infra under `m1pp/`.**
- `m1pp/build.sh <source.M1> <out>` lints against the P1v2 DEFINE
- table, prunes unused DEFINEs, runs M0 + hex2-0 with the aarch64
- ELF header inside the distroless-busybox container, and deposits
- a runnable binary. `m1pp/test.sh` walks fixtures in `tests/m1pp/`
- and picks mode by extension: `.M1` fixtures are built and run stand-alone;
- `.M1pp` fixtures are fed to a one-time build of `m1pp/m1pp.M1` as
- input, and the produced output file is diffed. Wired into
- `make test-m1pp`. Phase 0 fixture: `tests/m1pp/00-hello.M1` — a
- P1v2 hello-world that proves the pipeline without depending on
- `m1pp/m1pp.M1`'s current state.
-
-- [x] **Phase 1 — Port lexer + pass-through skeleton to P1v2.**
- Rewrite `_start`, read/write, `lex_source`, `emit_token`,
- `emit_newline`, `process_tokens`, and the structural %macro skip
- in P1v2 conventions (`a*`/`t*`/`s*` registers, `enter SIZE` /
- `leave`, `la_br &label`). Verify byte-for-byte parity against the
- C oracle on a definition-only fixture (tokenizer pass-through).
- Oracle entry points: `main`, `lex_source`, `emit_token`,
- `emit_newline`, `process_tokens` (pass-through branches only),
- plus `append_text_len`, `push_token`, `token_text_eq`,
- `span_eq_token`.
-
-- [x] **Phase 2 — Macro definition storage.**
- Replaced structural skipping with real storage: `define_macro`
- parses the header (name, params with comma splits, trailing
- newline) and copies body tokens into `macro_body_tokens[]` until
- a line-start `%endm`. Records land in a 32-slot `macros[]` arena
- (296 B/record). Macros are not yet called — defs-only input
- matches the oracle. `find_macro` / `find_param` deferred to the
- phases that exercise them (Phase 5).
- Oracle: `define_macro`, `find_macro`, `find_param`.
-
-- [ ] **Phase 3 — Stream stack + expansion-pool lifetime.**
- Stream stack push/pop for recursive rescanning; expansion-pool
- mark/restore on stream pop. No semantic change until Phase 4
- wires macro calls in, but isolates the lifecycle plumbing.
- Oracle: `push_stream_span`, `current_stream`, `pop_stream`,
- `copy_span_to_pool`, `push_pool_stream_from_mark`.
-
-- [ ] **Phase 4 — Argument parsing.**
- Nested-paren depth tracking, comma split at depth 1, argument-
- count validation, `call_end_pos` output.
- Oracle: `parse_args`.
-
-- [ ] **Phase 5 — Plain parameter substitution.**
- Walk macro body; substitute params via the expand pool; push
- resulting slice as a stream. Enforces single-token-arg rule for
- parameters adjacent to `##` (still no actual paste yet).
- Oracle: `expand_macro_tokens` (parameter loop),
- `copy_arg_tokens_to_pool`, `copy_paste_arg_to_pool`,
- `expand_call`.
-
-- [ ] **Phase 6 — `##` token paste compaction.**
- In-place compactor over the expand pool. Rejects misplaced or
- malformed paste sites.
- Oracle: `paste_pool_range`, `append_pasted_token`.
-
-- [ ] **Phase 7 — Integer atoms + S-expression evaluator.**
- Integer-token parsing; explicit expression-frame stack; all
- operators from the oracle; macro-in-expression composition (the
- required path for `p1/P1-aarch64.M1pp`).
- Oracle: `parse_int_token`, `expr_op_code`, `apply_expr_op`,
- `eval_expr_atom`, `eval_expr_range`, `skip_expr_newlines`.
-
-- [ ] **Phase 8 — `!@%$(expr)` builtins.**
- One-arg builtins on top of the evaluator; emit LE 1/2/4/8-byte
- hex tokens.
- Oracle: `expand_builtin_call` (the `!@%$` cases), `emit_hex_value`.
-
-- [ ] **Phase 9 — `%select(cond, then, else)`.**
- Eager `cond` eval; copy chosen branch to expand pool, push as
- stream; never evaluate the unchosen branch.
- Oracle: `expand_builtin_call` (the `%select` case).
-
-- [ ] **Phase 10 — Full-parity + malformed-input smoke tests.**
- Run `tests/m1pp/_full-parity.M1pp` against the M1 implementation
- (unpark by dropping the `_` prefix);
- add malformed fixtures (unterminated macro, wrong arg count, bad
- paste, bad expression, bad builtin arity) requiring non-zero
- exit. Then run combined `p1/P1-aarch64.M1pp + p1/P1.M1pp` through
- the M1 expander and diff against the Python-generated
- `build/p1v2/aarch64/p1_aarch64.M1`. Finally use the produced
- frontend on a small P1 program through the normal toolchain.
diff --git a/docs/M1M-P1-PORT.md b/docs/M1M-P1-PORT.md
@@ -1,257 +0,0 @@
-# m1macro to P1 Port Plan
-
-## Goal
-
-Replace `src/m1macro.c` with a real P1 implementation in `src/m1m.M1`.
-`src/m1m.M1` must be pure portable P1 source. The final `m1m` binary must
-expand M1M input without shelling out to awk, C, Python, libc, or any host
-macro processor.
-
-Contract:
-
-```
-m1m input.M1 output.M1
-```
-
-Behavior should match `src/m1macro.c` byte-for-byte for valid inputs in the
-current M1M feature set, except where an implementation limit is explicitly
-documented.
-
-Architecture-specific code is not allowed in `src/m1m.M1`. The only
-architecture-specific layer is the generated P1 `DEFINE` file that `catm`
-prepends before `src/m1m.M1` during assembly. If the port needs additional
-P1 op/register/immediate combinations, add them to the generator and
-regenerate the arch-specific define tables.
-
-## Scope
-
-Implement the current M1M feature set needed by `p1/*.M1M` to define
-instruction encodings:
-
-- `%macro NAME(a, b)` / `%endm`
-- `%NAME(x, y)` function-like expansion with recursive rescanning
-- `##` token paste
-- `!(expr)` / `@(expr)` / `%(expr)` / `$(expr)`
-- `%select(cond, then, else)`
-- Lisp-shaped integer expressions used by the builtins
-
-Not supported: per-expansion locals (`@local`, `:@local`, `&@local`),
-prefixed parameter substitution (`:param`/`¶m`), duplicate macro
-diagnostics, and byte-identical malformed-input diagnostics. Avoid duplicate
-macro names; the feature set does not promise a particular
-duplicate-definition behavior.
-
-Preserve the C tokenizer model: whitespace is normalized, strings are single
-tokens, `#` and `;` comments are skipped, and output is emitted as tokens plus
-newlines rather than preserving original formatting.
-
-## Static Data Model
-
-Use fixed BSS arenas, mirroring the C implementation:
-
-- Input buffer: raw file contents plus NUL sentinel.
-- Output buffer: emitted text.
-- Text buffer: copied token text and generated text.
-- Source token array: token records for the original input.
-- Macro table: name, params, and body token records.
-- Expansion pool: temporary tokens produced by macro calls and `%select`.
-- Stream stack: active token streams for recursive rescanning.
-
-Token record layout should be compact and uniform:
-
-```
-kind token kind
-text source span in `input_buf` or synthesized span in `text_buf`
-```
-
-Macro records should store name/parameter text spans plus a body token
-range, not inline strings. Prefer record shapes that stay uniform across the
-codebase so the address math remains easy to audit in P1.
-
-## Implementation Milestones
-
-## Incremental TODO
-
-Use this checklist to finish the port in reviewable slices. Each checked item
-should build `m1m` and include at least one C-oracle comparison where
-applicable.
-
-- [x] Land the portable P1 runtime shell: argv validation, input open/read,
- output open/write, fatal-error reporting, and no external expander path.
-- [x] Add the first fixed BSS arenas for input, output, text, source tokens,
- and runtime counters.
-- [x] Add initial text/token/output helpers: append copied token text, push
- source token records, compare token text with constants, emit tokens, and
- emit newlines.
-- [x] Port the C tokenizer model for source input: whitespace skipping,
- string tokens, `##`, comments, delimiters, word tokens, newline tokens, and
- text-buffer copies.
-- [x] Add a first processor skeleton that normalizes pass-through output and
- structurally skips line-start `%macro` ... `%endm` definitions.
-- [x] Extend generated P1 support for the current `m1m.M1` needs: broader
- `ADDI` immediates, token/record memory offsets, and full RRR register
- tuples. The Makefile still prunes unused DEFINEs before assembly.
-- [x] Verify the current slice: `make PROG=m1m ARCH=aarch64 build/aarch64/m1m`,
- byte-identical C-oracle output for definition-only library inputs
- `p1/aarch64.M1M` and `p1/P1.M1M` (these currently only prove structural
- `%macro` skipping, not macro-call expansion), and byte-identical tokenizer
- pass-through fixture output.
-- [x] Add `tests/m1m/full-parity.M1M` and its C-oracle expected output as the
- real expansion parity target. This fixture intentionally uses macro calls,
- recursive rescanning, paste, `!@%$(` and `%select`; it is expected
- to fail under the partial P1 implementation until the remaining unchecked
- expansion tasks land.
-- [ ] Replace structural `%macro` skipping with real macro table storage:
- parse headers, parameters, body tokens, body limits, and line-start `%endm`
- recognition.
-- [ ] Add stream stack push/pop for recursive rescanning and expansion-pool
- lifetime management.
-- [ ] Port macro call argument parsing, including nested parentheses and
- argument-count validation.
-- [ ] Port plain parameter substitution, including the single-token argument
- requirement when a parameter participates in `##`.
-- [ ] Port `##` token paste, including bad operand and misplaced paste
- failures.
-- [ ] Port integer atom parsing and S-expression evaluation for arithmetic,
- comparisons, shifts, and bitwise operators.
-- [ ] Implement `!@%$(expr)` on top of expression
- evaluation and token emission.
-- [ ] Implement `%select(cond, then, else)` on top of expression evaluation
- and stream pushback.
-- [ ] Add malformed-input smoke tests: unterminated macro, wrong arg count,
- bad paste, bad expression, and bad builtin arity. These only need non-zero
- exit, not exact diagnostic text.
-- [ ] Use the P1 `m1m` binary to expand a representative M1M frontend and
- assemble a small program through the normal stage0 toolchain.
-- [ ] Revisit static limits and error strings so every documented arena limit
- has a clear fatal path.
-- [ ] Re-run all acceptance tests and update this plan with any explicitly
- documented implementation limits.
-
-1. **Runtime shell**
-
- Keep the existing P1 argv, open/read, write, and fatal-error paths. Remove
- any external backend or `execve` shortcut.
-
-2. **Text and token primitives**
-
- Add helpers for `append_text_len`, `push_token`, token equality,
- span equality, output token emission, and output newline emission.
- Keep error handling simple: set an error message pointer and branch to
- `fatal`.
-
-3. **Lexer**
-
- Port `lex_source` directly. It should fill `source_tokens` from
- `m1m_input_buf`, copying all token text into `text_buf`.
-
-4. **Stream processor skeleton**
-
- Implement push/pop stream and the main `process_tokens` loop. Initially
- support pass-through tokens and `%macro` skipping, then expand toward full
- behavior.
-
-5. **Macro definitions**
-
- Port `define_macro`: parse header, params, body tokens, body limit checks,
- and line-start `%endm` recognition.
-
-6. **Macro call expansion**
-
- Port `parse_args`, plain parameter substitution, token paste, and
- expansion-stream pushback.
-
-7. **Expression evaluator**
-
- Port integer atom parsing and S-expression evaluation. Implement arithmetic,
- comparisons, shifts, and bitwise ops over 64-bit signed values as far as P1
- can represent them. Document any temporary 32-bit limitation if unavoidable,
- but the target is C-compatible 64-bit behavior.
-
-8. **Builtins**
-
- Implement `!@%$(` and `%select` on top of the expression evaluator
- and stream pushback.
-
-9. **Cleanup and limits**
-
- Replace generic “not implemented” errors with coarse but useful failures
- for buffer overflow, malformed macro headers, arg-count mismatch, bad
- expressions, and bad paste operands. Exact C diagnostic parity is not a
- goal.
-
-## Portability Rule
-
-`src/m1m.M1` must use only P1 tokens plus labels/data. Do not hand-code
-aarch64, amd64, or riscv64 instructions in this file. Do not introduce
-per-arch branches, per-arch data layouts, or per-arch syscall sequences in the
-implementation.
-
-Allowed architecture-specific work:
-
-- Extend `src/p1_gen.py` when `m1m.M1` needs a P1 operation tuple that is not
- currently generated.
-- Regenerate `build/<arch>/p1_<arch>.M1`.
-- Keep the existing build shape where the arch-specific define file is
- prepended with `catm` before the portable P1 source.
-
-All algorithmic behavior, buffer layout, parsing, expansion, expression
-evaluation, and error handling belongs in portable P1.
-
-## P1 Support Needed
-
-The current build may stage `PROG=m1m` on aarch64 first, but the source must
-remain portable P1 from the start. Staging on one arch is a build milestone,
-not permission to add arch-specific source.
-
-Likely generator/table updates:
-
-- More `ADDI` immediates for record-size and arena-limit arithmetic.
-- More `LD/ST/LB/SB` offsets for token, macro, and stream record fields.
-- Additional RRR register triples used by parser loops and address math.
-- Possibly a small set of helpers/macros for 32-byte record addressing.
-
-Do not hide core behavior behind host tools. If a P1 operation is missing,
-extend the generated P1 definitions or rewrite the algorithm in available P1.
-
-## Acceptance Tests
-
-Use `src/m1macro.c` as the oracle during development.
-
-Minimum checks:
-
-1. Build `m1m`:
-
- ```
- make PROG=m1m ARCH=aarch64 build/aarch64/m1m
- ```
-
-2. Compare representative inputs against the C implementation:
-
- ```
- src/m1macro.c oracle: p1/aarch64.M1M
- src/m1macro.c oracle: p1/P1.M1M
- custom fixture: paste, recursive rescanning, !@%%(, %select
- malformed fixtures: bad paste, wrong arg count, bad expression
- ```
-
-3. Require byte-identical output for valid fixtures.
-
-4. Require non-zero exit for invalid fixtures.
-
-5. Once stable, use `m1m` to expand the P1 M1M front-end and assemble a small
- program through the normal stage0 toolchain.
-
-## Non-Goals
-
-- No dependency on awk, shell scripts, Python, libc, or the host C compiler at
- runtime.
-- No new macro language features.
-- No formatting preservation beyond the current C expander behavior.
-- No recursive macro cycle detection unless added after parity.
-
-## Done Definition
-
-`src/m1m.M1` contains the expander core, the generated `m1m` binary runs in the
-target Alpine container, and all acceptance tests match `src/m1macro.c` without
-executing any external macro-expansion program.
diff --git a/docs/M1PP-EXT.md b/docs/M1PP-EXT.md
@@ -1,656 +0,0 @@
-# M1PP extensions for the seed Scheme interpreter
-
-Three independent additions to `m1pp/m1pp.c`, ordered by sequencing.
-
-Motivation: when writing the seed Lisp interpreter portably across three
-arches, most pain in `lisp/lisp.M1` traces to two things — hand-named
-scratch labels that collide when a pattern is reused, and argument
-substitution that can't carry instruction bodies (commas break the
-parser). `strlen` is the smaller third item: it removes a class of
-hand-counted-length bugs in error messages and string literals.
-
-## 1. Local labels
-
-### Syntax
-
-Two prefixed word forms, recognized only when they appear in a macro
-body as body-native tokens:
-
-- `:@name` — label definition, scoped to the current expansion
-- `&@name` — address-of reference, scoped to the current expansion
-
-### Semantics
-
-Each `%NAME(...)` invocation allocates a fresh expansion id `NN` from a
-global monotonic counter. While copying body-native tokens into the pool,
-any TOK_WORD whose text starts with `:@` or `&@` (and has ≥1 char after
-the `@`) is rewritten to the corresponding non-`@` form with `__NN`
-suffixed: `:@end` → `:end__42`, `&@end` → `&end__42`.
-
-**Scoping.** Rename body-native tokens only. Argument-substituted tokens
-pass through unchanged — they were already renamed under the caller's
-`NN` if the caller was itself a macro body. This gives lexical label
-scoping: nested and stacked macros each see their own labels, collisions
-are impossible.
-
-**Interaction with `##`.** None. The rename happens before the paste
-pass; a body `:@end##_lbl` renames `:@end` first, then pastes. Edge
-cases here should error out (pasting onto a renamed label is almost
-certainly a bug); leave it unconstrained for v1 and revisit if it
-bites.
-
-### Tokenizer
-
-No changes. `:@foo` / `&@foo` already tokenize as single TOK_WORD under
-the current word-terminator set (`m1pp.c:310`). The existing `@(...)`
-builtin dispatch keys on token text being exactly `@` followed by
-LPAREN, so `@foo` words do not collide.
-
-### m1pp.c touchpoints
-
-- One new static `int next_expansion_id` (monotonic, never reset).
-- `expand_macro_tokens` (`m1pp.c:670`): allocate `NN = ++next_expansion_id`
- before the body-walk. Inside the body-copy loop, when about to push a
- body-native TOK_WORD whose text starts with `:@` or `&@`:
- - build the renamed text directly by appending bytes into `text_buf`:
- copy the original token bytes (sigil + tail), append `__`, then
- append the decimal digits of `NN`
- - push a TOK_WORD pointing at the new text span
-
-Avoid `snprintf`. The m1m port (`docs/M1M-P1-PORT.md`) will reimplement
-every new m1pp feature in P1 assembly; varargs format parsing is a
-non-trivial thing to port. Plain byte appends plus a hand-rolled
-integer → decimal emit (the `display_uint` reverse-fill pattern already
-in `lisp/lisp.M1:2983`) port cleanly.
-
-Concretely in C: reserve a small stack scratch (say 16 bytes), fill
-digits right-to-left via repeated `%10` / `/=10`, then `append_text_len`
-the sigil bytes, the tail bytes, `"__"`, and the digit run. A 32-bit
-counter fits in 10 decimal digits; collision across a file is a
-non-concern because the counter is file-global and monotonic.
-
-No struct changes. No lexer changes. No new global syntax.
-
-## 2. Braced block arguments
-
-### Syntax
-
-Curly braces `{` and `}` group tokens into a single macro argument,
-protecting commas inside the group from the comma-splits-args rule.
-
-```
-%if_eq(r1, r2, {
- li(r0)
- %5
- st(r0, r3, 0)
-})
-```
-
-Without braces, `st(r0, r3, 0)` exposes two commas at paren depth 1 and
-the call parses as 5 args instead of 3.
-
-### Semantics
-
-- `{` and `}` are new TOK kinds, tokenized as single-char delimiters.
-- In `parse_args`, a `brace_depth` counter runs parallel to the paren
- `depth`. Commas at `depth == 1` split args **only when
- `brace_depth == 0`**. LBRACE increments, RBRACE decrements.
-- When copying an arg span into a macro body, if the span begins with
- TOK_LBRACE and ends with matching TOK_RBRACE at the outermost level,
- strip the outer pair. Otherwise copy verbatim — `%foo(plain)` stays
- working.
-- Braces never reach output. Either filter them during substitution or
- make `emit_token` treat both kinds as no-ops (belt-and-braces; I'd
- do both).
-
-### Nesting
-
-`{ { ... } }` nests via `brace_depth`. Braces inside a `"..."` string
-stay inside the string token — the lexer already handles that.
-
-Braces and parens are independent. `{ ( }` is syntactically fine in the
-arg-splitter; paren balancing only cares about LPAREN/RPAREN.
-
-### Tokenizer
-
-`lex_source` (`m1pp.c:232`): add LBRACE/RBRACE cases alongside the
-existing LPAREN/RPAREN cases (~10 lines). Add `{` and `}` to the
-word-terminator set at `m1pp.c:310`.
-
-### m1pp.c touchpoints
-
-- New TOK_LBRACE, TOK_RBRACE enum entries (`m1pp.c:77`).
-- `lex_source`: two new single-char token cases.
-- `parse_args` (`m1pp.c:543`): add `brace_depth` counter; gate the
- comma-split on `brace_depth == 0`; LBRACE/RBRACE bump/drop it.
-- Arg copy (in `expand_macro_tokens`, via `copy_arg_tokens_to_pool` and
- `copy_paste_arg_to_pool`): detect outer `{ ... }` wrapping and strip.
- The `copy_paste_arg_to_pool` path (single-token arg for `##`) should
- reject braced args — pasting onto a block is nonsense.
-- `emit_token`: no-op for both brace kinds (defensive; they shouldn't
- reach here if substitution is clean).
-
-### What this does not give you
-
-A C-like block-statement form (`%if_eq(a,b) { … } %else { … } %endif`)
-needs `process_tokens` to recognize line-start block openers/closers —
-a separate, heavier change. Braced args get us
-`%if_eq_else(a, b, { then }, { else })` and `%while_nez(r, { body })`,
-which covers the patterns in lisp.M1 we care about. Defer the block-
-statement form until braced-arg shows real ergonomic pain.
-
-## 3. `strlen` expression op
-
-### Syntax
-
-A new unary op in the Lisp-shaped expression grammar:
-
-```
-(strlen "literal")
-```
-
-Composes with arithmetic like any other op:
-
-```
-%((+ (strlen "hello") 1))
-```
-
-### Semantics
-
-- Argument must be a single `TOK_STRING` atom (double-quoted form).
-- Value is the raw byte count between the quotes: `span.len - 2`.
- Matches what M1's `"…"` emission writes before appending NUL.
-- Single-quoted `'…'` hex literals error out — strlen is meaningless
- on raw hex.
-
-### No decimal emitter needed
-
-The 4-byte LE hex emitter `%(expr)` is sufficient. Two paths cover
-everything:
-
-1. Companion DEFINE:
-
- ```
- %macro defstr(label, text)
- :label text
- DEFINE label##_LEN %((strlen text))
- %endm
- ```
-
- M1 substitutes `label_LEN` with its 4 hex bytes at each use site.
-
-2. Inline at an LI-immediate slot:
-
- ```
- li_r2 %((strlen "usage: …"))
- ```
-
- LI's inline literal slot takes 4 raw LE bytes; `05000000` and `%5`
- are byte-equivalent there. lisp.M1 already relies on this (see
- `DEFINE NIL 07000000` at `lisp/lisp.M1:30`, consumed as
- `li_r0 NIL`).
-
-The 1/2/8-byte emitters (`!(e)`, `@(e)`, `$(e)`) cover non-4-byte widths
-if needed.
-
-### m1pp.c touchpoints
-
-- `EXPR_STRLEN` entry in the `ExprOp` enum (`m1pp.c:87`).
-- `expr_op_code` (`m1pp.c:751`): match the word `strlen`.
-- Eval path: `strlen` is a degenerate case — its "argument" is a
- TOK_STRING, not a recursive expression. Easiest is a special-case
- branch in `eval_expr_range` (`m1pp.c:976`) that handles `(strlen
- "...")` directly rather than routing through `eval_expr_atom`.
- Emit `span.len - 2` as the value.
-- Alternative: extend `eval_expr_atom` to accept TOK_STRING atoms with
- value `len - 2`, and treat `strlen` as identity. Cleaner
- composition but more surface area; defer unless needed.
-
-## 4. Paren-less 0-arg macro calls
-
-### Syntax
-
-A macro defined with zero parameters may be called without trailing
-`()`:
-
-```
-%macro FRAME_BASE()
-16
-%endm
-
-%((+ %FRAME_BASE 8)) ## paren-less
-%((+ %FRAME_BASE() 8)) ## still works
-```
-
-### Semantics
-
-- When `find_macro` matches a `%NAME` token and the macro's
- `param_count == 0`, the expansion triggers whether or not an LPAREN
- follows.
-- Applies in both contexts where a macro call is currently recognized:
- top-level processing in `process_tokens`, and atom position in
- `eval_expr_atom` so a 0-arg macro is a valid expression atom inside
- `%(...)`.
-- Non-zero-param macros still require their existing `(arg, ...)`
- syntax.
-- `%foo` where `foo` is not defined as a macro still passes through
- unchanged — the match only fires when a matching 0-param macro
- exists. Backward compatible.
-
-### Why it matters
-
-The one feature that needs it is `%struct` field access (§5). Once
-`NAME.field` expands to an integer, writing `%NAME.field` reads as a
-named constant; `%NAME.field()` looks like a function call. The
-relaxation is also load-bearing for expression-level composition:
-`%((+ %frame_hdr.SIZE %frame_apply.callee))` needs both atoms to
-resolve as 0-arg calls inside the evaluator.
-
-### m1pp.c touchpoints
-
-- `process_tokens` (`m1pp.c:1225`): the LPAREN-next guard becomes
- "LPAREN-next OR `param_count == 0`."
-- `eval_expr_atom` (`m1pp.c:944`): same relaxation on the same guard.
-- The zero-param paren-less path constructs an empty arg list and
- calls `expand_macro_tokens` with `arg_count == 0` — no
- `parse_args` change.
-- No lexer changes, no new token kinds, no new Macro fields.
-
-## 5. `%struct` directive
-
-### Syntax
-
-A top-level directive declaring a fixed-layout aggregate of 8-byte
-fields:
-
-```
-%struct closure { hdr params body env }
-```
-
-Fields are bare identifiers separated by whitespace and/or commas.
-The closing brace terminates the declaration.
-
-### Semantics
-
-Expands at declaration time to N+1 zero-parameter macros:
-
-- `NAME.field_k` → `k * 8` for each field at index k
-- `NAME.SIZE` → `N * 8`
-
-All fields are 8-byte words. Mixed widths are deferred until a real
-use case appears.
-
-Callers consume these as paren-less 0-arg calls (per §4):
-
-```
-ld(r0, r1, %closure.body)
-enter(%frame_apply.SIZE)
-```
-
-### No `base=` parameter
-
-The struct primitive declares offsets from zero. Base offsets (e.g.
-for stack-frame locals sitting above the retaddr/caller-sp header)
-compose at the call site via an ordinary wrapper macro:
-
-```
-%struct frame_hdr { retaddr caller_sp } ## SIZE = 16
-
-%macro frame(field)
-%((+ field %frame_hdr.SIZE))
-%endm
-
-%struct frame_apply { callee args body env }
-
-:apply
- enter(%frame_apply.SIZE)
- st(r1, sp, %frame(%frame_apply.callee)) ## 0 + 16 = 16
- st(r2, sp, %frame(%frame_apply.args)) ## 8 + 16 = 24
- …
- leave()
- ret
-```
-
-Heap structs access fields directly (`%closure.body`); stack frames
-route through the `%frame` wrapper. Same primitive, two conventions,
-no special casing inside `%struct`. If a function needs a different
-base (e.g. a permanent spill prefix), define `%frame_big(field)`
-alongside `%frame` — the struct declarations don't change.
-
-### Tokenizer
-
-- `.` is already a word char, so `NAME.field` tokenizes as one
- TOK_WORD under the current word-terminator set (`m1pp.c:310`).
-- `{` / `}` reuse the TOK_LBRACE / TOK_RBRACE kinds introduced for §2.
- `%struct` cannot land before §2 does.
-
-### m1pp.c touchpoints
-
-- New top-level directive branch in `process_tokens` (`m1pp.c:1192`)
- alongside the existing `%macro` detection. At line-start, if the
- first word is `%struct`:
- - consume name, `{`, field-identifier list (WORD tokens,
- comma-or-whitespace separated), `}`, trailing newline
- - for each field k, generate an entry in `macros[]`:
- - name = synthesized `"NAME.field_k"` in `text_buf`
- - `param_count = 0`
- - body = a single TOK_WORD whose text is the decimal rendering of
- `k * 8` in `text_buf`
- - emit a final `"NAME.SIZE"` entry pointing at `N * 8`
-- Integer → decimal rendering reuses the hand-rolled reverse-fill
- pattern from §1 local labels — no `snprintf`.
-- No new expression-evaluator surface; consumption goes through the
- existing `find_macro` + `eval_expr_atom` path once §4 lands.
-- No new Macro struct fields. A struct-generated macro is
- indistinguishable from any other 0-param macro once declared.
-
-### What this does not give you
-
-- **Mixed-width fields.** All offsets are `k * 8`. The packed 8-bit
- type + 8-bit gc-flags + 48-bit length header in lisp.M1 is easier
- to handle with dedicated bit-op macros than struct syntax; defer.
-- **Bundled enter/leave per frame.** A `%frame NAME { … }` directive
- that also emits ENTER/LEAVE around a body would bring back the
- block-body problem and tightly couple locals to one function shape.
- The call-site verbosity savings don't pay; use plain `%struct` plus
- a wrapper macro.
-
-## 6. `%enum` directive
-
-### Syntax
-
-A top-level directive declaring an incrementing sequence of named
-integer constants:
-
-```
-%enum tag { fixnum pair vector string symbol proc singleton }
-%enum prim_id { add sub mul div mod eq lt gt ... }
-```
-
-### Semantics
-
-Expands at declaration time to N+1 zero-parameter macros:
-
-- `NAME.label_k` → `k` for each label at index k
-- `NAME.COUNT` → `N`
-
-Callers consume these as paren-less 0-arg calls (per §4):
-
-```
-li_r2 %tag.pair ## loads 1
-%((= %prim_id.COUNT 45)) ## compile-time sanity check
-```
-
-### Relationship to `%struct`
-
-Implementation-wise, `%enum` is `%struct` with stride 1 instead of 8
-and a totalizer named `COUNT` instead of `SIZE`. The directive
-parser, brace consumption, field-list parsing, and macro-generation
-loop are all shared. Factor the §5 implementation around one helper
-parameterized by `(stride, totalizer_name)`:
-
-- `%struct` → `define_fielded(8, "SIZE")`
-- `%enum` → `define_fielded(1, "COUNT")`
-
-No separate code path; adding `%enum` is a second line-start
-directive check in `process_tokens` plus one call to the shared
-helper.
-
-### Why it matters
-
-lisp.M1 maintains two hand-numbered integer enumerations whose
-numbering must stay in sync across disjoint sites:
-
-- Tag codes (`lisp/lisp.M1:35–47`) referenced throughout the
- reader / eval / printer dispatchers.
-- Primitive code IDs — used by the registration table and the
- dispatch cascade (`lisp/lisp.M1:3843–3983`). Inserting a new
- primitive in the middle shifts every downstream id; silent drift,
- no error until runtime.
-
-`%enum` eliminates both drift classes: names declared once,
-referenced by name everywhere, renumbering on insertion is automatic.
-
-### m1pp.c touchpoints
-
-Same as §5 with the two parameter differences above. No new Macro
-struct fields, no new token kinds, no new expression-evaluator
-surface.
-
-### What this does not give you
-
-- **Explicit values.** `%enum foo { a=5 b c }` is not supported in
- v1. All values are consecutive from 0. C's explicit-value form
- is useful when matching external ABIs; our enums are internal, so
- defer until a real use case appears.
-- **Flag/bitmask enums.** Not specially supported. If you want bit
- positions, declare the bit index via `%enum` and take
- `(1 << %NAME.flag_k)` at use sites.
-
-## 7. `%str` stringification builtin
-
-### Syntax
-
-A new builtin alongside `!(e)`, `@(e)`, `%(e)`, `$(e)`, `strlen`,
-and `%select`:
-
-```
-%str(IDENT)
-```
-
-Takes a single WORD-token argument; produces a TOK_STRING literal
-whose contents are the argument's text wrapped in double quotes:
-
-```
-%macro quoteit(name)
-%str(name)
-%endm
-
-%quoteit(hello) → "hello"
-%quoteit(foo_bar) → "foo_bar"
-```
-
-### Semantics
-
-- Exactly one argument, kind TOK_WORD. Multi-token, pasted, or
- already-string args error out.
-- Output is a freshly-allocated TOK_STRING span in `text_buf` built
- as `"` + original_text + `"`. The span's `len` is
- `original_len + 2`, so `strlen` on the result (per §3) returns
- `original_len` — the char count between the quotes, matching
- what M1's `"…"` emission writes before the NUL.
-- Produces a string literal, not a word. Complementary to `##`, not
- a replacement — see below.
-
-### Relationship to `##` paste
-
-Both turn a parameter into something else, but they produce
-**different token kinds** and serve **different goals**:
-
-| operator | inputs | output | kind |
-|----------|-----------------|------------------|------------|
-| `##` | two WORD tokens | one WORD token | TOK_WORD |
-| `%str` | one WORD token | one STRING token | TOK_STRING |
-
-`##` joins word fragments to build identifiers / label names.
-`%str` wraps a word in quotes to produce a string literal. They
-can't substitute for each other:
-
-- `:str_quote` (a label definition) must be a word — `##` can
- build it, `%str` can't.
-- `"quote"` (a string literal) must introduce quote characters —
- `%str` is the only way to manufacture it from a bare identifier,
- paste can't.
-
-M1 sees the difference too: `:str_quote "quote"` is a label-def
-word followed by a quoted-bytes directive (5 bytes + NUL). Paste
-manufactures the first, stringify the second, both from the same
-source identifier.
-
-### Why it matters
-
-Every special-form symbol in lisp.M1 (`lisp/lisp.M1:164–260`)
-follows the same triad, written longhand 15 times today:
-
-```
-:str_quote "quote"
-DEFINE str_quote_LEN 05000000
-:sym_quote %0 %0
-```
-
-With `##` and `%str` together, one declarative site per symbol:
-
-```
-%macro defsym(name)
-:str_##name %str(name)
-DEFINE str_##name##_LEN %((strlen %str(name)))
-:sym_##name %0 %0
-%endm
-
-%defsym(quote)
-%defsym(if)
-%defsym(begin)
-…
-```
-
-- `##name` builds the label identifiers (`str_quote`, `sym_quote`).
-- `%str(name)` builds the string literal (`"quote"`).
-- `(strlen %str(name))` computes the length for the DEFINE.
-- One source of truth per symbol — the identifier itself.
-
-Without `%str`, callers would have to pass the string explicitly
-(`%defsym(quote, "quote")`). That works today with zero m1pp
-changes but invites drift between the identifier and its
-spelled-out string form — nothing at compile time flags a typo
-where the two disagree.
-
-### Why a builtin, not a `#x` sigil
-
-cpp uses `#x` inside macro bodies to stringify a parameter. That
-shape doesn't port cleanly to m1pp because `#` is already the
-line-comment starter (`m1pp.c:278`). Giving `#` dual duty would
-create parse ambiguity in `lex_source`.
-
-`%str(x)` reuses the existing builtin-dispatch plumbing — the same
-path that handles `! / @ / % / $ / %select` — and reads uniformly
-with the other text and numeric builtins.
-
-### Tokenizer
-
-No changes. Existing TOK_STRING machinery handles the output;
-`%str` is a word token recognized as a builtin in `process_tokens`.
-
-### m1pp.c touchpoints
-
-- `process_tokens` (`m1pp.c:1211`): extend the builtin-dispatch
- guard to accept `%str` alongside `! @ % $ %select`.
-- `expand_builtin_call` (`m1pp.c:1092`): add a branch for `%str`.
- Arg-count check: exactly 1. Arg-shape check: exactly one token,
- kind TOK_WORD. Anything else errors.
-- Stringification body: compute `out_len = arg.text.len + 2`,
- reserve that many bytes via `append_text_len`, write `"`,
- the original bytes, `"`. Push a TOK_STRING pointing at the new
- span.
-- No `snprintf` — plain byte copies, straightforward port.
-- No new token kinds, no new Macro fields.
-
-### What this does not give you
-
-- **Stringification of non-parameter tokens.** Only single-token
- WORD args. `%str(foo bar)` or `%str("already a string")` both
- error. Wider forms are cpp-ish; defer until a real use case
- appears.
-- **Escape processing inside the stringified text.** The input is
- a bare identifier — no quotes, backslashes, or whitespace to
- escape. If `%str` is ever extended to take broader token spans,
- escape handling becomes relevant then.
-
-## Per-feature implementation sequence
-
-Each of the three features lands in the same three ordered steps. Do
-not skip or reorder — the tests exist to pin behavior before the
-port, and the port exists because the C expander is disposable.
-
-1. **Implement in `m1pp/m1pp.c`.** The C expander is the oracle. Land
- the feature here first so there is something to diff against.
-2. **Add a test in `tests/m1pp/`.** New `NN-name.M1pp` +
- `NN-name.expected` pair following the existing numbering (see
- `tests/m1pp/` — current fixtures run 00 through 10), **or** extend
- an existing fixture when the feature is a natural addition to one
- (e.g. `strlen` goes into `04-expr-ops.M1pp` alongside the other
- expression ops rather than getting its own file). For malformed-
- input features, the expected artifact is a non-zero exit; document
- that in the fixture.
-3. **Add to `m1pp/m1pp.M1`.** Port the feature to the pure-P1
- implementation of m1pp so the seed bootstrap doesn't depend on the
- host C expander. The test from step 2 runs against both `m1pp` (C)
- and `m1m` (P1) and must produce byte-identical output; that parity
- is what `docs/M1M-P1-PORT.md` calls "C-oracle comparison."
-
-Shipping a feature means all three steps are done. A half-landed
-feature (C only, or C + test but no port) blocks the next feature in
-the sequencing list below.
-
-## Cross-feature sequencing
-
-1. **Local labels.** Smallest patch, immediately useful — enables
- straight-line macros like `%case_tag` and `%tag_dispatch` that
- want one or two internal labels without hand-naming.
-2. **Braced args.** Unlocks structured `%if_eq_else` / `%while_nez`
- that carry instruction bodies. Depends on (1) in practice — the
- bodies reference labels defined in the surrounding macro.
-3. **`strlen`.** Independent of the other two. Land when the first
- `%defstr` call site shows up.
-4. **Paren-less 0-arg macro calls.** Independent small relaxation of
- two guards (one in `process_tokens`, one in `eval_expr_atom`).
- Useful on its own for constants-as-macros; load-bearing for (5).
-5. **`%struct`.** Depends on (2) for the brace token kinds and (4)
- for paren-less access syntax. Land only after both.
-6. **`%enum`.** Same dependencies as (5). Share the
- directive-handler implementation with `%struct` — land together
- or back-to-back.
-7. **`%str`.** Independent of everything else. Pairs naturally with
- (3) `strlen` in the `%defsym` pattern but has no build-order
- dependency on it. Land when the first `%defsym`-style
- declarative macro shows up.
-
-Each is a self-contained patch. No cross-dependencies beyond the
-sequencing above and the three-step rule per feature.
-
-## Per-feature acceptance fixtures
-
-- **Local labels:** two fixtures — a single macro using `:@end` and
- calling itself twice in one function (must produce distinct labels),
- and nested macros each using `:@done` (must not collide). Assemble
- through M1 + hex2 clean on at least one arch.
-- **Braced args:** fixture exercising a body with commas
- (`st(r0, r3, 0)`), a body with nested braces, and a malformed
- fixture (unmatched `{`) that exits non-zero.
-- **`strlen`:** fixture with `DEFINE X_LEN %((strlen "hello"))`
- followed by `li_r2 X_LEN` — binary must load the value 5 and
- syscall-exit 5 on all three arches via the existing P1 differential
- harness.
-- **Paren-less 0-arg calls:** fixture with a 0-param macro invoked
- both with and without trailing `()`, in top-level position and as
- an atom inside `%(...)` expressions; all forms must produce
- byte-identical output against a control fixture that always uses
- `()`.
-- **`%struct`:** fixture declaring a 4-field struct, accessing each
- field via paren-less calls, and layering a `%frame` wrapper using
- `%frame_hdr.SIZE` composition (per the doc example); build on all
- three arches and exit with a sentinel computed from both the
- struct-level `.SIZE` and the wrapped base offset, proving the
- compose-and-add path resolves correctly.
-- **`%enum`:** fixture declaring an enum with 3+ labels, referencing
- each via paren-less call, and asserting `%NAME.COUNT` equals the
- label count via a `%(=)` expression that feeds an exit code;
- build on all three arches. Share fixture scaffolding with the
- `%struct` test where practical.
-- **`%str`:** two fixtures — (a) a macro using `%str(name)` in its
- body, compared against a control that writes the literal
- `"name"` string directly (byte-identical output); (b) combined
- paste + stringify, `%macro defsym(n) :str_##n %str(n) %endm`
- invoked with distinct identifiers, assembled through M1 + hex2,
- each generated label must point at the correctly-spelled string
- bytes. A third malformed fixture (`%str(a b)` or
- `%str("already_string")`) must exit non-zero.
diff --git a/docs/P1.md b/docs/P1.md
@@ -1,522 +1,531 @@
-# P1: A Portable Pseudo-ISA for M1
-
-## Motivation
-
-The stage0/live-bootstrap chain uses M1 (the mescc-tools macro assembler) as
-the lowest human-writable layer above raw hex. M1 itself is architecture-
-agnostic — it only knows `DEFINE name hex_bytes` — but every real M1 program
-in stage0 (including the seed C compiler `cc_*.M1`) is hand-written per arch.
-To write, say, a seed Lisp interpreter portably across amd64, aarch64, and
-riscv64 without reaching for M2-Planet, we need a thin portable layer: a
-pseudo-ISA whose mnemonics expand, per arch, to native encodings.
-
-P1 is that layer. The goal is an unoptimized RISC-shaped instruction set,
-hand-writable in M1 source, that assembles to three host ISAs via per-arch
-`DEFINE` tables on top of existing `M1` + `hex2` unchanged.
-
-## Non-goals
-
-- **Not an optimizing backend.** P1 is deliberately dumb. An `ADD rD, rA, rB`
- on amd64 expands to `mov rD, rA; add rD, rB` unconditionally — no peephole
- recognition of the `rD == rA` case. Paying ~2× code size is fine for a seed.
-- **Not ABI-compatible with platform C.** P1 programs are sovereign: direct
- Linux syscalls, no libc linkage. Interop thunks can be written later if
- needed.
-- **Not 32-bit.** x86-32, armv7l, riscv32 are out of scope for v1. Adding them
- later means a separate defs file and some narrowing in the register model.
-- **Not self-hosting.** P1 is a target for humans, not a compiler IR. If you
- want a compiler, write it in subset-C and use M2-Planet.
-
-## Current status
-
-Three programs assemble unchanged across aarch64, amd64, and riscv64
-from the generator-produced `p1_<arch>.M1` defs:
- * `hello.M1` — write/exit, prints "Hello, world!".
- * `demo.M1` — exercises the full tranche 1–5 op set (arith/imm/LD/ST/
- branches/CALL/RET/PROLOGUE/EPILOGUE/TAIL); exits with code 5.
- * `lisp.M1` — seed Lisp through step 2 of `LISP.md`: bump heap,
- `cons`/`car`/`cdr`, tagged-value encoding. Exits with code 42
- (decoded fixnum from `car(cons(42, nil))`).
-
-All runs on stock stage0 `M0` + `hex2-0`, bootstrapped per-arch from
-`hex0-seed` — no C compiler, no M2-Planet, no Mes. Run with
-`make PROG=<hello|demo|lisp> run-all` from `lispcc/`.
-
-The DEFINE table is generator-driven (`p1_gen.py`); tranches 1–8 are
-enumerated there, plus the full PROLOGUE_Nk family (k=1..4). Branch
-offsets are realized by the LI_BR-indirect pattern
-(`LI_BR &target ; BXX_rA_rB`), sidestepping the missing
-branch-offset support in hex2. The branch-target scratch is a
-reserved native reg (`x17`/`r11`/`t5`), not a P1 GPR.
-
-### Spike deviations from the design
-
-- Wide immediates use a per-`LI` inline literal slot (one PC-relative
- load insn plus a 4-byte data slot, skipped past) rather than a shared
- pool. Keeps the spike pool-free at the cost of one skip-branch per
- `LI`. A pool can be reintroduced later without changes to P1 source.
-- `LI` is 4-byte zero-extended today; 8-byte absolute is deferred until
- a program needs it. All current references are to addresses under
- 4 GiB, so `&label` + a 4-byte zero pad suffices.
-- The per-tuple DEFINE table is generator-produced (see `p1_gen.py`)
- from a shared op table across all three arches. The emitted set
- covers tranches 1–8 plus the N-slot PROLOGUE/EPILOGUE/TAIL
- variants. Adding a new tuple is a one-line append to `rows()` in
- the generator; no hand-encoding.
-
-## Design decisions
-
-| Decision | Choice | Why |
-|----------------|-----------------------------------------------|--------------------------------------------|
-| Word size | 64-bit | All three target arches are 64-bit native |
-| Endianness | Little-endian | All three agree |
-| Registers | 8 GPRs (`r0`–`r7`) + `sp`, `lr`-on-stack | Fits x86-64's usable register budget |
-| Narrow imm | Signed 12-bit | riscv I-type width; aarch64 ≤12 also OK |
-| Wide imm | Pool-loaded via PC-relative `LI` | Avoids arch-specific immediate synthesis |
-| Calling conv | r0 = return, r1–r3 = args (caller-saved), r4–r7 callee-saved | P1-defined; not platform ABI |
-| Return address | Always spilled to stack on entry | Hides x86's missing `lr` uniformly |
-| Syscall | `SYSCALL` with num in r0, args r1–r6; clobbers r0 only | Per-arch wrapper emits native sequence |
-| Spill slot | `[sp + 8]` is callee-private scratch after `PROLOGUE` | Frame already 16 B for alignment; second cell was otherwise unused |
-
-## Register mapping
-
-`r0`–`r3` are caller-saved. `r4`–`r7` are callee-saved, general-purpose,
-and preserved across `CALL`/`SYSCALL`. `sp` is special-purpose — see
-`PROLOGUE` semantics.
+# P1 v2
-| P1 | amd64 | aarch64 | riscv64 |
-|------|-------|---------|---------|
-| `r0` | `rax` | `x0` | `a0` |
-| `r1` | `rdi` | `x1` | `a1` |
-| `r2` | `rsi` | `x2` | `a2` |
-| `r3` | `rdx` | `x3` | `a3` |
-| `r4` | `r13` | `x26` | `s4` |
-| `r5` | `r14` | `x27` | `s5` |
-| `r6` | `rbx` | `x19` | `s1` |
-| `r7` | `r12` | `x20` | `s2` |
-| `sp` | `rsp` | `sp` | `sp` |
-| `lr` | (mem) | `x30` | `ra` |
-
-`r4`–`r7` all map to native callee-saved regs on each arch, so the SysV
-kernel+libc "callee preserves these" rule does the work for us across
-syscalls without explicit save/restore in the `SYSCALL` expansion.
-
-x86-64 has no link register; `CALL`/`RET` macros push/pop the return address
-on the stack. On aarch64/riscv64, the prologue spills `lr` (`x30`/`ra`) to
-the stack too, so all three converge on "return address lives in
-`[sp + 0]` after prologue." This uniformity is worth the extra store on the
-register-rich arches.
-
-**Reserved scratch registers (not available to P1):** certain native
-regs are used internally by op expansions and are never exposed as P1
-registers. Every P1 op writes only what its name says it writes —
-reserved scratch is save/restored within the expansion so no hidden
-clobbers leak across op boundaries.
-
-- **Branch-target scratch (all arches).** `B`/`BEQ`/`BNE`/`BLT`/`CALL`/
- `TAIL` jump through a dedicated native reg pre-loaded via `LI_BR`:
- `x17` (ARM IP1) on aarch64, `r11` on amd64, `t5` on riscv64. The reg
- is caller-saved natively and never carries a live P1 value past the
- following branch. Treat it as existing only between the `LI_BR` that
- loads a target and the branch that consumes it.
-- **aarch64** — `x21`–`x23` hold `r1`–`r3` across the `SYSCALL` arg
- shuffle (`r4`/`r5` live in callee-saved `x26`/`x27` so the kernel
- preserves them for us). `x16` (ARM IP0) is scratch for `REM`
- (carries the `SDIV` quotient into the following `MSUB`). `x8` holds
- the syscall number.
-- **amd64** — `rcx` and the branch-target `r11` are kernel-clobbered by
- the `syscall` instruction itself. `PROLOGUE`/`EPILOGUE` use `rcx` to
- carry the retaddr across the `sub rsp, N` (can't use `r11` here — it
- is the branch-target reg, and `TAIL` = `EPILOGUE` + `jmp r11`).
- `DIV`/`REM` use `rcx` (to save `rdx` = P1 `r3`) and `r11` (to save
- `rax` = P1 `r0`) so that `idiv`'s implicit writes to rax/rdx stay
- invisible; the `r11` save is fine because no branch op can interrupt
- the DIV/REM expansion.
-- **riscv64** — `s3`,`s6`,`s7` hold `r1`–`r3` across the `SYSCALL` arg
- shuffle (`r4`/`r5` live in callee-saved `s4`/`s5`, same trick as
- aarch64). `a7` holds the syscall number.
-
-All of these are off-limits to hand-written P1 programs and are never
-mentioned in P1 source. If you see a register name not in the r0–r7 /
-sp / lr set, it belongs to an op's internal expansion.
-
-## Reading P1 source
-
-P1 has no PC-relative branch immediates (hex2 offers no label-arithmetic
-sigil — branch ranges can't be expressed in hex2 source). Every branch,
-conditional or not, compiles through the **LI_BR-indirect** pattern: the
-caller loads the target into the dedicated branch-target scratch reg
-with `LI_BR`, then the branch op jumps through it. A conditional like
-"jump to `fail` if `r1 != r2`" is three source lines:
+## Scope
+
+P1 v2 is a portable pseudo-ISA for standalone executables.
+
+P1 v2 has two width variants:
+
+- **P1v2-64** — one word is one 64-bit integer or pointer value
+- **P1v2-32** — one word is one 32-bit integer or pointer value
+
+Portable source may use any number of word arguments. The first four argument
+registers are explicit, and additional argument words are passed through a
+portable incoming stack-argument area.
+
+Portable source may directly return `0..1` word. Wider results use the
+portable indirect-result convention described below.
+
+## Toolchain envelope
+
+P1 v2 must be assemblable through the existing `M0` + `hex2` path, with
+`catm` as the only composition primitive between source or generated fragments.
+The spec therefore assumes only the following toolchain features:
+
+- `M0`-level `DEFINE name hex_bytes` substitution
+- raw byte emission
+- labels and label references supported by `hex2`
+- file concatenation via `catm`
+
+## Source notation
+
+This document describes instructions using ordinary assembly notation such as
+`ADD rd, ra, rb`, `LD rd, [ra + off]`, or `CALL`.
+
+Because of the toolchain constraints above, portable source does not encode
+most operands as textual instruction arguments. Instead, register choices,
+inline immediate values, and small fixed parameters are fused into opcode
+names, following the generated-table style used by `src/p1_gen.py`.
+
+So the notation in this document is descriptive rather than literal:
+
+- `ADD rd, ra, rb` means a family of fused register-specific opcodes
+- `ADDI rd, ra, imm` means a family of fused register-and-immediate-specific
+ opcodes
+- `ENTER size` means a family of fused byte-count-specific opcodes
+- `LDARG rd, idx` means a family of fused register-and-argument-slot-specific
+ opcodes
+- `BR rs`, `CALLR rs`, and `TAILR rs` mean register-specific control-flow
+ opcodes
+- `LEAVE`, `CALL`, `RET`, `TAIL`, `B`, and `SYSCALL` remain operand-free
+
+Labels still appear in source where the toolchain supports them directly, such
+as `LA rd, %label` and `LA_BR %label`.
+
+## Register Model
+
+### Exposed registers
+
+P1 v2 exposes the following source-level registers:
+
+- `a0`–`a3` — argument registers. Also caller-saved general registers.
+- `t0`–`t2` — caller-saved temporaries.
+- `s0`–`s3` — callee-saved general registers.
+- `sp` — stack pointer.
+
+### Hidden registers
+
+The backend may reserve additional native registers that are never visible in
+P1 source:
+
+- `br` — branch / call target mechanism, implemented as a dedicated hidden
+ native register on every target
+- backend-local scratch used entirely within one instruction expansion
+
+No hidden register may carry a live P1 value across an instruction boundary.
+
+## Calling Convention
+
+### Arguments and return values
+
+P1 v2 defines three result conventions: one-word direct, two-word direct, and
+indirect.
+
+In the one-word direct-result convention:
+
+- Explicit argument words 0-3 live in `a0-a3`.
+- Additional explicit argument words live in the incoming stack-argument area
+ and are read with `LDARG`.
+- On return, a one-word result lives in `a0`.
+
+In the two-word direct-result convention:
+
+- Explicit argument words 0-3 live in `a0-a3` on entry.
+- Additional explicit argument words still live in the incoming
+ stack-argument area.
+- On return, `a0` holds result word 0 and `a1` holds result word 1.
+
+In the indirect-result convention:
+
+- The caller passes a writable result buffer pointer in `a0`.
+- Explicit argument words 0-2 then live in `a1-a3`.
+- Additional explicit argument words still live in the incoming
+ stack-argument area.
+- On return, `a0` holds the same result buffer pointer value.
+
+In both direct-result conventions, incoming stack-argument slot `0` corresponds
+to explicit argument word `4`. In the indirect-result convention, incoming
+stack-argument slot `0` corresponds to explicit argument word `3`.
+
+The two-word direct-result convention covers common cases such as 64-bit
+integer results on 32-bit targets, two-word aggregates, and divmod-style
+returns. The indirect-result convention is the portable way to return any
+result wider than two words.
+
+### Register preservation
+
+Caller-saved:
+
+- `a0`–`a3`
+- `t0`–`t2`
+
+Callee-saved:
+
+- `s0`–`s3`
+- `sp`
+
+### Call semantics
+
+A call is valid from any function, including a leaf. Call / return correctness
+does not depend on establishing a frame first.
+
+If a function needs any incoming argument after making a call, it must save it
+before the call. This matters in particular for `a0`, which is overwritten by
+every convention's return value, and for `a1` when the callee uses the two-word
+direct-result convention.
+
+A call that passes any stack argument words requires the caller to have an
+active standard frame with enough frame-local storage to stage those outgoing
+words.
+
+The return address is hidden machine state. Portable source must not assume
+that it lives in any exposed register.
+
+## Stack Convention
+
+### Call-boundary rule
+
+At every call boundary, the backend must satisfy the native C ABI stack
+alignment rule for the target architecture.
+
+Portable source must therefore treat raw function-entry `sp` as opaque. It may
+not assume that the low bits of `sp` have the same meaning on all targets
+before a frame is established.
+
+### Incoming stack-argument area
+
+P1 v2 defines an abstract incoming stack-argument area for explicit argument
+words that do not fit in registers.
+
+- Slot `0` is the first stack-passed explicit argument word.
+- Slots are word-indexed, not byte-indexed.
+- Portable source may access this area only through `LDARG`.
+
+`LDARG` is valid only when the current function has an active standard frame.
+Therefore, a function that needs any incoming stack argument must establish a
+standard frame before its first `LDARG`.
+
+Portable source must not assume any direct relationship between incoming
+argument slots and raw function-entry `sp`. In particular, source must not try
+to reconstruct stack arguments by manually indexing from `sp`; backend entry
+layouts differ across targets.
+
+For a call with `m` stack-passed explicit argument words, the caller stages
+those words in the first `m` words of its frame-local storage immediately
+before the call:
```
-P1_LI_BR
-&fail
-P1_BNE_R1_R2
+[sp + 2*WORD + 0*WORD] = outgoing arg word 0
+[sp + 2*WORD + 1*WORD] = outgoing arg word 1
+...
```
-`LI_BR` writes a reserved native reg (`x17`/`r11`/`t5` — see Register
-mapping), not a P1 GPR. The branch op that follows consumes it and
-jumps. `CALL` and `TAIL` follow the same shape
-(`LI_BR &callee ; P1_CALL`).
+At callee entry, those staged words become incoming argument slots `0..m-1`.
+The backend is responsible for mapping between the caller's frame layout and
+the callee's abstract incoming argument slots.
-The branch-target reg is owned by the branch machinery: never carry a
-live value across a branch in it. Since it isn't a P1 reg, this is
-automatic — there's no P1-level way to read or write it outside
-`LI_BR`.
+Portable code that needs both ordinary locals and stack-passed outgoing
+arguments must reserve enough total frame-local storage and keep the low-
+addressed prefix available for outgoing argument staging across the call.
-## Instruction set (~30 ops)
+### Standard frame layout
+
+Functions that need local stack storage use a standard frame layout. After
+frame establishment:
```
-# 3-operand arithmetic (reg forms)
-ADD rD, rA, rB SUB rD, rA, rB
-AND rD, rA, rB OR rD, rA, rB XOR rD, rA, rB
-SHL rD, rA, rB SHR rD, rA, rB SAR rD, rA, rB
-MUL rD, rA, rB DIV rD, rA, rB REM rD, rA, rB
-
-# Immediate forms (signed 12-bit)
-ADDI rD, rA, !imm ANDI rD, rA, !imm ORI rD, rA, !imm
-SHLI rD, rA, !imm SHRI rD, rA, !imm SARI rD, rA, !imm
-
-# Moves
-MOV rD, rA # reg-to-reg (rA may be sp)
-LI rD, %label # load 64-bit literal from pool
-LA rD, %label # load PC-relative address
-
-# Memory (offset is signed 12-bit)
-LD rD, rA, !off ST rS, rA, !off # 64-bit
-LB rD, rA, !off SB rS, rA, !off # 8-bit zero-extended / truncated
-
-# Control flow
-B %label # unconditional branch
-BEQ rA, rB, %label BNE rA, rB, %label
-BLT rA, rB, %label # signed less-than
-CALL %label RET
-PROLOGUE EPILOGUE # frame setup / teardown (see Semantics)
-TAIL %label # tail call: epilogue + B %label
-
-# System
-SYSCALL # num in r0, args r1-r6, ret in r0
+[sp + 0*WORD] = saved return address
+[sp + 1*WORD] = saved caller stack pointer
+[sp + 2*WORD ... sp + 2*WORD + local_bytes - 1] = frame-local storage
+...
```
-### Semantics
-
-- All arithmetic is on 64-bit values. `SHL`/`SHR`/`SAR` take shift amount in
- the low 6 bits of `rB` (or the `!imm` for immediate forms).
-- `DIV` is signed, truncated toward zero. `REM` matches `DIV`.
-- `LB` zero-extends the loaded value into the 64-bit destination.
- (A signed-extending variant `LBS` can be added later if needed. 32-bit
- `LW`/`SW` are deliberately omitted — emulate with `LD`+`ANDI`/shift and
- `ST` through a 64-bit scratch when needed.)
-- Unsigned comparisons (`BLTU`/`BGEU`) are not in the ISA: seed programs
- with tagged-cell pointers only need signed comparisons. Synthesize from
- `BLT` via operand-bias if unsigned compare is ever required.
-- `BGE rA, rB, %L` is not in the ISA: synthesize as
- `BLT rA, rB, %skip; B %L; :skip` (the LI_BR-indirect branch pattern
- makes the skip cheap). `BLT rB, rA, %L` handles the strict-greater
- case.
-- Branch offsets are PC-relative. In the v0.1 spike they are realized by
- loading the target address via `LI_BR` into the reserved branch-target
- reg and jumping through it; range is therefore unbounded within the
- 4 GiB address space. Native-encoded branches (with tighter range
- limits) are an optional future optimization.
-- `MOV rD, rA` copies `rA` into `rD`. The source may be `sp` (read the
- current stack pointer into a GPR — used e.g. for stack-balance assertions
- around a call tree). The reverse (`MOV sp, rA`) is not provided; `sp`
- is only mutated by `PROLOGUE`/`EPILOGUE`.
-- `CALL %label` transfers control to `%label` with a return address
- established such that a subsequent `RET` returns to the instruction
- after the `CALL`. The storage location of that return address is
- implementation-defined (stack on amd64, link register on
- aarch64/riscv64) and **must be treated as volatile across any inner
- `CALL`**.
-
- Concrete rule: **a function that itself executes a `CALL` must wrap
- its body in a matching `PROLOGUE`/`EPILOGUE` pair.** `PROLOGUE` is
- what spills the incoming return address into the frame; `EPILOGUE`
- restores it so `RET` can find it.
-
- Leaf functions (no `PROLOGUE`) are permitted and may be called
- normally: `CALL leaf` sets up the return address, the leaf's `RET`
- uses it, control returns to the caller. The restriction is only on
- what a leaf may itself do:
-
- - **RET** — returns to whoever established the current return
- address. Usually the direct `CALL`er; in the tail-branch case
- below, whoever `CALL`ed the outermost caller in the chain.
- - **Tail-branch** (`li_br &target ; B`) to another function — the
- target's own `PROLOGUE`/`EPILOGUE` preserves the current return
- address across the target's body, so the target's `RET` returns
- directly to the leaf's caller, skipping the leaf in the return
- chain.
- - **`CALL`** — forbidden. The inner `CALL` clobbers the return
- address slot (on arches where it's a register, not a stack
- push), so the leaf's subsequent `RET` branches to itself.
-
- The failure mode of a leaf `CALL` is platform-asymmetric: amd64's
- native `CALL` pushes onto the stack so a prologue-less `CALL ; RET`
- happens to work; aarch64 and riscv64 write the return address to a
- link register and hang silently. Don't write code that relies on
- the amd64-happens-to-work behavior.
-
- `RET` pops / branches through the return address.
-- `PROLOGUE` / `EPILOGUE` set up and tear down a frame with **k
- callee-private scratch slots**. `PROLOGUE` is shorthand for
- `PROLOGUE_N1` (one slot); `PROLOGUE_Nk` for k = 2, 3, 4 reserves that
- many slots. After `PROLOGUE_Nk`:
-
- ```
- [sp + 0] = caller's return address
- [sp + 8] = slot 1 (callee-private scratch)
- [sp + 16] = slot 2 (k >= 2)
- [sp + 24] = slot 3 (k >= 3)
- [sp + 32] = slot 4 (k >= 4)
- ```
-
- Each slot is private to the current frame: a nested `PROLOGUE`
- allocates its own slots, so the parent's spills survive unchanged.
- Frame size is `round_up_16(8 + 8*k)`, so k=1→16, k=2→32 (with 8
- bytes of padding past slot 2), k=3→32, k=4→48. `EPILOGUE_Nk` /
- `TAIL_Nk` must match the `PROLOGUE_Nk` of the enclosing function.
-
- Why multiple slots: constructors like `cons(car, cdr)` keep several
- live values across an inner `alloc()` call. One scratch cell isn't
- enough, and parking overflow in BSS would break the step-9 mark-sweep
- GC (which walks the stack for roots). Per-frame slots keep every live
- value on the walkable stack.
-
- Per-arch mechanics differ — aarch64/riscv64 `PROLOGUE` subtracts the
- frame size from `sp` and stores `lr`/`ra` at `[sp + 0]`; amd64 pops
- the retaddr native `call` already pushed into a non-P1 scratch
- (`rcx`), subtracts the frame size, then re-pushes it so the final
- layout matches. (`rcx` rather than `r11`, because `r11` is the
- branch-target reg and `TAIL` would otherwise clobber its own
- destination mid-epilogue.) Access slots via `MOV rX, sp` followed by
- `LD rY, rX, <off>` / `ST rY, rX, <off>`; `sp` itself isn't a valid
- base for `LD`/`ST`.
-- `TAIL %label` is a tail call — it performs the current function's
- standard epilogue (restore `lr` from `[sp+0]`, pop the frame) and then
- branches unconditionally to `%label`, reusing the caller's return
- address instead of pushing a new frame. The current function must be
- using the standard prologue. Interpreter `eval` loops rely on `TAIL`
- to recurse on sub-expressions without growing the stack.
-- `SYSCALL` is a single opcode in P1 source. Each arch's defs file expands it
- to the native syscall sequence, including the register shuffle from P1's
- `r0`=num, `r1`–`r6`=args convention into the platform's native convention
- if different.
-
-## Encoding strategy
-
-For each `(op, register-tuple)` combination, emit one `DEFINE` per arch. A
-generator script produces the full defs file; no hand-encoding per entry.
-
-Example — `ADD r0, r1, r2`:
+Frame-local storage is byte-addressed. Portable code may use it for ordinary
+locals, spilled callee-saved registers, and the caller-staged outgoing
+stack-argument words described above.
-```
-# p1_riscv64.M1
-DEFINE P1_ADD_R0_R1_R2 33056000 # add a0, a1, a2 (little-endian)
+Total frame size is:
-# p1_aarch64.M1
-DEFINE P1_ADD_R0_R1_R2 2000028B # add x0, x1, x2
+`round_up(STACK_ALIGN, 2*WORD_SIZE + local_bytes)`
-# p1_amd64.M1 (2-op destructive — expands to mov + add)
-DEFINE P1_ADD_R0_R1_R2 4889F84801F0 # mov rax, rdi ; add rax, rsi
-```
+Where:
-### Combinatorial footprint
+- `WORD_SIZE = 8` in P1v2-64
+- `WORD_SIZE = 4` in P1v2-32
+- `STACK_ALIGN` is target-defined and must satisfy the native call ABI
-Per-arch defs count (immediates handled by sigil, not enumerated):
+Leaf functions that need no frame-local storage may omit the frame entirely.
-- 11 reg-reg-reg arith × 8 `rD` × 8 `rA` × 8 `rB` = 704. Pruned to ~600 by
- removing trivially-equivalent tuples.
-- 6 immediate arith × 8² = 384. Each entry uses an immediate sigil (`!imm`),
- so the immediate value itself is not enumerated.
-- 3 move ops × 8 or 8² (plus +8 for the `MOV rD, sp` variant) = ~88.
-- 4 memory ops × 8² = 256. Offsets use `!imm` sigil.
-- 3 conditional branches × 8² = 192.
-- Singletons (`B`, `CALL`, `RET`, `PROLOGUE`, `EPILOGUE`, `TAIL`, `SYSCALL`) = 7.
+### Frame invariants
-Total ≈ 1210 defines per arch. Template-generated.
+- A function that allocates a frame must restore `sp` before returning.
+- Callee-saved registers modified by the function must be restored before
+ returning.
+- The standard frame layout is the only frame shape recognized by P1 v2.
-## Syscall conventions
+## Op Set Summary
-Linux syscall mechanics differ across arches. The `SYSCALL` macro hides this.
+| Category | Operations |
+|----------|------------|
+| Materialization | `LI rd, imm`, `LA rd, %label`, `LA_BR %label` |
+| Moves | `MOV rd, rs`, `MOV rd, sp` |
+| Arithmetic | `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, `MUL`, `DIV`, `REM` |
+| Immediate arithmetic | `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI` |
+| Memory | `LD`, `ST`, `LB`, `SB` |
+| ABI access | `LDARG` |
+| Branching | `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ` |
+| Calls / returns | `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR` |
+| Frame management | `ENTER`, `LEAVE` |
+| System | `SYSCALL` |
-| Arch | Insn | Num reg | Arg regs (plat ABI) |
-|----------|-----------|---------|------------------------------|
-| amd64 | `syscall` | `rax` | `rdi, rsi, rdx, r10, r8, r9` |
-| aarch64 | `svc #0` | `x8` | `x0 – x5` |
-| riscv64 | `ecall` | `a7` | `a0 – a5` |
+## Immediates
-**Observable semantics:** `SYSCALL` takes the number in `r0` and args in
-`r1`–`r6`, traps, and returns the kernel's result in `r0`. **Only `r0` is
-clobbered.** `r1`–`r7` are preserved across `SYSCALL` on every arch. This
-matches the kernel's own register discipline and lets callers thread live
-values through syscalls without per-arch save/restore dances.
+Immediate operands appear only in instructions that explicitly admit them.
+Portable source has three immediate classes:
-The per-arch expansions:
+- **Inline integer immediate** — a signed 12-bit assembly-time constant in the
+ range `-2048..2047`
+- **Materialized word value** — a full one-word assembly-time constant loaded
+ with `LI`
+- **Materialized address** — the address of a label loaded with `LA`
-- **amd64** — P1 args already occupy the native arg regs except for args
- 4/5/6. Three shuffle moves cover those: `mov r10, r13` (arg4 = P1 `r4`),
- `mov r8, r14` (arg5 = P1 `r5`), `mov r9, rbx` (arg6 = P1 `r6`); then
- `syscall`. The kernel preserves everything except `rax`, `rcx`, `r11`,
- and `rax` = P1 `r0` is the only visible clobber.
-- **aarch64** — native arg regs are `x0`–`x5` but P1 puts args in
- `x1`–`x3`,`x26`,`x27`,`x19` (the three caller-saved arg regs one slot
- higher, plus three callee-saved for `r4`–`r6`). The expansion saves
- P1 `r1`–`r3` into `x21`–`x23`, shuffles them and `r4`/`r5`/`r6` down
- into `x0`–`x5`, moves the number into `x8`, `svc #0`s, then restores
- `r1`–`r3` from `x21`–`x23`. No save/restore of `r4`/`r5` is needed
- because they live in callee-saved natives that the kernel preserves.
-- **riscv64** — same shape as aarch64, with `s3`/`s6`/`s7` as the `r1`–
- `r3` save slots, `s4`/`s5` already holding `r4`/`r5`, and `a7` as the
- number register.
+P1 v2 also uses two structured assembly-time operands:
-The extra moves on aarch64/riscv64 are a few nanoseconds per syscall.
-Trading them for uniform "clobbers `r0` only" semantics is worth it:
-callers don't need to memorize a per-arch clobber set.
+- **Frame-local byte count** — a non-negative byte count used by `ENTER`
+- **Argument-slot index** — a non-negative word-slot index used by `LDARG`
-### Syscall numbers
+`LI rd, imm` loads the one-word integer value `imm`.
-Linux uses two syscall tables relevant here:
+`LA rd, %label` loads the address of `%label` as a one-word pointer value.
-- **amd64**: amd64-specific table (`write = 1`, `exit = 60`, …).
-- **aarch64 and riscv64**: generic table (`write = 64`, `exit = 93`, …).
+The backend may realize `LI` and `LA` using native immediates, literal pools,
+multi-instruction sequences, or other backend-private mechanisms.
-P1 programs use symbolic constants (`SYS_WRITE`, `SYS_EXIT`) defined per-arch:
+Backends may assume labels fit in 32 bits when realizing `LA` and `LA_BR`.
+This reflects the stage0 image layout (`hex2-0` base `0x00600000`, programs
+well under 4 GB), not a portable-ISA-level guarantee. Backends that target
+images loaded above the 4 GB boundary must adjust their `LA` / `LA_BR`
+lowering. `LI` makes no such assumption — it materializes any one-word value.
-```
-# p1_amd64.M1
-DEFINE SYS_WRITE 01000000
-DEFINE SYS_EXIT 3C000000
+## Control Flow
-# p1_aarch64.M1 and p1_riscv64.M1
-DEFINE SYS_WRITE 40000000
-DEFINE SYS_EXIT 5D000000
-```
+### Call / Return / Tail Call
-(The encodings shown are placeholder little-endian 32-bit immediates; real
-values are inlined as operands to `LI` or `ADDI`.)
+Control-flow targets are materialized with `LA_BR %label`, which loads
+`%label` into the hidden branch-target mechanism `br`. The immediately
+following control-flow op consumes that target.
-## Program layout
+`CALL` transfers control to the target most recently loaded by `LA_BR` and
+establishes a return continuation such that a subsequent `RET` returns to the
+instruction after the `CALL`. `CALL` is valid whether or not the caller has
+established a standard frame, except that any call using stack-passed argument
+words requires an active standard frame to hold the staged outgoing words.
-Each P1 object file is structured as:
+`CALLR rs` is the register-indirect form of `CALL`. It transfers control to
+the code pointer value held in `rs` and establishes the same return
+continuation semantics as `CALL`.
-```
-<ELF header, per arch>
-<code section>
- <function prologues, bodies, epilogues>
-<constant pool>
- pool_label_1: &0xDEADBEEFCAFEBABE
- pool_label_2: &0x00000000004004C0
- ...
-<data section>
- <static bytes>
-```
+`RET` returns through the current return continuation. `RET` is valid whether
+or not the current function has established a standard frame, provided any
+frame established by the function has already been torn down.
-`LI rD, %pool_label_N` issues a PC-relative load; the pool must be reachable
-within the relocation's range (≤±1 MiB for aarch64 `LDR` literal, ≤±2 GiB for
-riscv `AUIPC`+`LD`, unlimited for x86 `mov rD, [rip + rel32]` within 2 GiB).
+`TAIL` is a tail call to the target most recently loaded by `LA_BR`. It is
+valid only when the current function has an active standard frame. `TAIL`
+performs the standard epilogue for the current frame and then transfers control
+to the loaded target without creating a new return continuation. The callee
+therefore returns directly to the current function's caller.
-For programs under a few MiB, a single pool per file is fine. For larger
-programs, emit a pool per function.
+`TAILR rs` is the register-indirect form of `TAIL`. It is valid only when the
+current function has an active standard frame.
-## Data alignment
+Because stack-passed outgoing argument words are staged in the caller's own
+frame-local storage, `TAIL` and `TAILR` are portable only when the tail-called
+callee requires no stack-passed argument words. Portable compilers must lower
+other tail-call cases to an ordinary `CALL` / `RET` sequence.
-**Labels have no inherent alignment.** A label's runtime address is
-`ELF_base + (cumulative bytes emitted before the label)`. Neither M1 nor
-hex2 offers an `.align` directive or any other alignment control — the
-existing hex2 sigils (`: ! @ $ ~ % &` and the `>` base override) cover
-labels and references, not padding. And because the cumulative byte count
-between the ELF header and any label varies per arch (different SYSCALL
-expansions, different branch encodings, different PROLOGUE sizes), the
-same label lands at a different low-3-bits offset on each target.
+Portable source must treat the return continuation as hidden machine state. It
+must not assume that the return address lives in any exposed register or stack
+location except as defined by the standard frame layout after frame
+establishment.
-Concretely: `heap_start` in a program that builds identically for all
-three arches can land at `0x...560` (aligned) on aarch64, `0x...2CB`
-(misaligned) on amd64, and `0x...604` (misaligned) on riscv64. If the
-program then tags pair pointers by ORing bits into the low 3, the tag
-collides with pointer bits on the misaligned arches and every pair is
-corrupt.
+### Prologue / Epilogue
-Programs that care about alignment therefore align **at boot, in code**:
+P1 v2 defines the following frame-establishment and frame-teardown operations:
+
+- `ENTER size`
+- `LEAVE`
+
+`ENTER size` establishes the standard frame layout with `size` bytes of
+frame-local storage:
```
-P1_LI_R4
-&heap_next
-P1_LD_R0_R4_0
-P1_ORI_R0_R0_7 ## x |= 7
-P1_ADDI_R0_R0_1 ## x += 1 → x rounded up to next 8-aligned
-P1_ST_R0_R4_0
+[sp + 0*WORD] = saved return address
+[sp + 1*WORD] = saved caller stack pointer
+[sp + 2*WORD ... sp + 2*WORD + size - 1] = frame-local storage
```
-The `(x | mask) + 1` idiom rounds any pointer up to `mask + 1`. Use
-`mask = 7` for 8-byte alignment (tagged pointers with a 3-bit tag),
-`mask = 15` for 16-byte alignment (cache lines, `malloc`-style).
-
-**Allocator contract.** Any allocator that returns cells eligible to be
-tagged (pair, closure, vector, …) MUST return pointers aligned to at
-least the tag width. The low tag bits are architecturally unowned by
-the allocator — they belong to the caller to stamp a tag into.
-
-**Caller contract.** Callers of bump-style allocators must pass sizes
-that are multiples of the alignment. For the step-2 bump allocator
-that's 8-byte multiples; the caller rounds up. A mature allocator
-(step 9 onward) rounds internally, but the current one trusts the
-caller.
-
-## Staged implementation plan
-
-1. **Spike across all three arches.** *Done.* `lispcc/hello.M1` and
- `lispcc/demo.M1` run on aarch64, amd64, and riscv64 via existing
- `M1` + `hex2_linker` (amd64, aarch64) / `hex2_word` (riscv64). Ops
- demonstrated: `LI`, `SYSCALL`, `MOV`, `ADD`, `SUB`. The aarch64
- `hex2_word` extensions in the work list above were *not* needed —
- the inline-data `LI` trick sidesteps them. Order was reversed from
- the original plan: aarch64 first (where the trick was designed),
- then amd64 and riscv64.
-2. **Broaden the demonstrated op set.** *Done.* `demo.M1` exercises
- control flow (`B`, `BEQ`, `BNE`, `BLT`, `CALL`, `RET`, `TAIL`),
- loads/stores (`LD`/`ST`/`LB`/`SB`), and the full
- arithmetic/logical/shift/mul-div set across tranches 1–5. All
- reachable with stock hex2; no extensions required.
-3. **Generator for the ~30-op × register matrix.** *Done.*
- `p1_gen.py` is the single source of truth for all three
- `p1_<arch>.M1` defs files. Each row is an `(op, reg-tuple, imm)`
- triple; per-arch encoders lower rows to native bytes. Includes the
- N-slot `PROLOGUE_Nk` / `EPILOGUE_Nk` / `TAIL_Nk` variants (k=1..4).
- Regenerate with `make gen`; CI-check freshness with `make check-gen`.
-4. **Cross-arch differential harness.** Assemble each P1 source three
- ways and diff runtime behavior. Currently eyeballed via
- `make run-all`.
-5. **Write something real.** *In progress.* `lisp.M1` is the seed Lisp
- interpreter target (cons, car, cdr, eq, atom, cond, lambda, quote)
- running identically on all three arches. Step 2 (cons/car/cdr +
- tagged values) landed; the remaining staged steps live in
- `LISP.md`.
-
-## Open questions
-
-- **Can we reuse hand-written `SYSCALL`/syscall-number conventions already in
- stage0's arch ports?** Probably yes — adopt the conventions already in
- `M2libc/<arch>/` to minimize surprise.
-- **Signed-extending loads.** Skipped for v1 — add `LBS`, `LWS` if the Lisp
- interpreter needs them.
-- **Atomic / multi-core.** Not in scope. Seed interpreters are single-
- threaded.
-- **Debug info.** `blood-elf` generates M1-format debug tables; we'd need to
- decide whether P1 flows through it unchanged. Likely yes since P1 is just
- another M1 source.
-- **x86-32 / armv7l / riscv32 support.** Requires narrowing the register
- model and splitting word size. Defer.
+The total allocation size is:
-## Scope
+`round_up(STACK_ALIGN, 2*WORD_SIZE + size)`
+
+The named frame-local bytes are the usable local storage. Any additional bytes
+introduced by alignment rounding are padding, not extra local bytes.
+
+`LEAVE` tears down the current standard frame and restores the hidden return
+continuation so that a subsequent `RET` returns correctly.
+
+Because every standard frame stores the saved caller stack pointer at
+`[sp + 1*WORD]`, `LEAVE` does not need to know the frame-local byte count used
+by the corresponding `ENTER`.
+
+A function may omit `ENTER` / `LEAVE` entirely if it is a leaf and needs no
+standard frame.
+
+`ENTER` and `LEAVE` do not implicitly save or restore `s0` or `s1`. A
+function that modifies `s0` or `s1` must preserve them explicitly, typically by
+storing them in frame-local storage within its standard frame.
+
+### Branching
+
+P1 v2 branch targets are carried through the hidden branch-target mechanism
+`br`. Portable source may load `br` only through:
+
+- `LA_BR %label` — materialize the address of `%label` as the next branch, call,
+ or tail-call target
+
+No branch, call, or tail opcode takes a label operand directly. Portable source
+must treat `br` as owned by the control-flow machinery. No live value may be
+carried in `br`. Each `LA_BR` must be consumed by the immediately following
+branch, call, or tail op, and portable source must not rely on `br` surviving
+across any other instruction.
+
+The portable branch families are:
+
+- `B` — unconditional branch to the target in `br`
+- `BR rs` — unconditional branch to the code pointer in `rs`
+- `BEQ`, `BNE`, `BLT`, `BLTU` — conditional branch to the target in `br`
+- `BEQZ`, `BNEZ`, `BLTZ` — conditional branch to the target in `br` using zero
+ as the second operand
+
+`BLT` and `BLTZ` perform signed comparisons on one-word values. `BLTU`
+performs an unsigned comparison on one-word values; there is no unsigned
+zero-operand variant because `x < 0` is always false under unsigned
+interpretation.
+
+If a branch condition is true, control transfers to the target currently held in
+`br`. If the condition is false, execution falls through to the next
+instruction.
+
+## Data Ops
+
+### Arithmetic
+
+P1 v2 defines the following arithmetic and bitwise operations on one-word
+values:
+
+- register-register: `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`,
+ `MUL`, `DIV`, `REM`
+- immediate: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI`
+
+For `ADD`, `SUB`, `MUL`, `AND`, `OR`, and `XOR`, computation is modulo the
+active word size.
+
+`SHL` shifts left and discards high bits. `SHR` is a logical right shift and
+zero-fills. `SAR` is an arithmetic right shift and sign-fills.
+
+For register-count shifts, only the low `5` bits of the shift count are
+observed in `P1v2-32`, and only the low `6` bits are observed in `P1v2-64`.
+
+Immediate-form shifts use inline immediates in the range `0..31` in `P1v2-32`
+and `0..63` in `P1v2-64`.
+
+`DIV` is signed division on one-word two's-complement values and truncates
+toward zero. `REM` is the corresponding signed remainder.
+
+Division by zero is outside the portable contract. The overflow case
+`MIN_INT / -1` is also outside the portable contract, as is the corresponding
+remainder case.
-- **Defs files**: ~1500 entries × 3 arches, generator-driven.
-- **Testing**: shared harness that assembles each P1 source three ways
- and diffs runtime behavior.
+### Moves
+
+P1 v2 defines the following move and materialization operations:
+
+- `MOV` — register-to-register copy
+- `LI` — load one-word integer constant
+- `LA` — load label address
+
+`MOV` may copy from any exposed general register to any exposed general
+register.
+
+Portable source may also read the current stack pointer through `MOV rd, sp`.
+
+Portable source may not write `sp` through `MOV`. Stack-pointer updates are only
+performed by `ENTER`, `LEAVE`, and backend-private call/return machinery.
+
+`LI` materializes an integer bit-pattern. `LA` materializes the address of a
+label. `LA_BR` is a separate control-flow-target materialization form and is not
+part of the general move family.
+
+### Memory
+
+P1 v2 defines the following memory-access operations:
+
+- `LD`, `ST` — one-word load and store
+- `LB`, `SB` — byte load and store
+- `LDARG` — one-word load from the incoming stack-argument area
+
+`LD` and `ST` access one full word: 4 bytes in `P1v2-32` and 8 bytes in
+`P1v2-64`.
+
+`LB` loads one byte and zero-extends it to a full word. `SB` stores the low
+8 bits of the source value.
+
+Memory offsets use signed 12-bit inline immediates.
+
+The base address for a memory access may be any exposed general register or
+`sp`.
+
+`LDARG rd, idx` loads incoming stack-argument slot `idx`, where slot `0` is the
+first stack-passed explicit argument word. `idx` is word-indexed, not
+byte-indexed. `LDARG` is an ABI access, not a general memory operation; it does
+not expose or imply any raw `sp`-relative layout at function entry.
+
+`LDARG` is valid only when the current function has an active standard frame.
+
+Portable source must not assume that labels are aligned beyond what is
+explicitly established by the program itself. Portable code should use
+naturally aligned addresses for `LD` and `ST`. Unaligned word accesses are
+outside the portable contract. Byte accesses have no additional alignment
+requirement.
+
+## System
+
+`SYSCALL` is part of the portable ISA surface.
+
+At the portable level, the syscall convention is:
+
+- `a0` = syscall number on entry, return value on exit
+- `a1`, `a2`, `a3`, `t0`, `s0`, `s1` = syscall arguments 0 through 5
+
+At the portable level, `SYSCALL` clobbers only `a0`. All other exposed
+registers are preserved across the syscall.
+
+The mapping from symbolic syscall names to numeric syscall identifiers is
+target-defined. The set of syscalls available to a given program is likewise
+specified outside the core P1 v2 ISA, for example by a target profile or
+runtime interface document.
+
+## Target notes
+
+- `a0` is argument 0, the one-word direct return-value register, the low word
+ of the two-word direct return pair, and the indirect-result buffer pointer.
+- On aarch64, riscv64, arm32, and rv32, that matches the native integer/pointer
+ ABI directly.
+- On amd64, the backend must translate between portable `a0` and native
+ return register `rax` at call and return boundaries. For the two-word direct
+ return, the backend must also translate `a1` against native `rdx`.
+- On amd64, `LDARG` must account for the return address pushed by the native
+ `call` instruction. On aarch64, riscv64, arm32, and rv32, it maps more
+ directly to the entry `sp` plus the backend's standard frame/header policy.
+- `br` is implemented as a dedicated hidden native register on every target.
+- On arm32, `t1` and `t2` map to natively callee-saved registers; the backend
+ is responsible for preserving them across function boundaries in accordance
+ with the native ABI, even though P1 treats them as caller-saved.
+- Frame-pointer use is backend policy, not part of the P1 v2 architectural
+ register set.
+
+### Native register mapping
+
+#### 64-bit targets
+
+| P1 | amd64 | aarch64 | riscv64 |
+|------|-------|---------|---------|
+| `a0` | `rdi` | `x0` | `a0` |
+| `a1` | `rsi` | `x1` | `a1` |
+| `a2` | `rdx` | `x2` | `a2` |
+| `a3` | `rcx` | `x3` | `a3` |
+| `t0` | `r10` | `x9` | `t0` |
+| `t1` | `r11` | `x10` | `t1` |
+| `t2` | `r8` | `x11` | `t2` |
+| `s0` | `rbx` | `x19` | `s1` |
+| `s1` | `r12` | `x20` | `s2` |
+| `s2` | `r13` | `x21` | `s3` |
+| `s3` | `r14` | `x22` | `s4` |
+| `sp` | `rsp` | `sp` | `sp` |
-The output is a single portable ISA above which any seed-stage program
-(Lisp, Forth, a smaller C compiler) can be written once and run on three
-hosts. Below M2-Planet in the chain, above raw M1. Leans entirely on
-existing `M1` + `hex2` — no toolchain modifications.
+#### 32-bit targets
+
+| P1 | arm32 | rv32 |
+|------|-------|-------|
+| `a0` | `r0` | `a0` |
+| `a1` | `r1` | `a1` |
+| `a2` | `r2` | `a2` |
+| `a3` | `r3` | `a3` |
+| `t0` | `r12` | `t0` |
+| `t1` | `r6` | `t1` |
+| `t2` | `r7` | `t2` |
+| `s0` | `r4` | `s1` |
+| `s1` | `r5` | `s2` |
+| `s2` | `r8` | `s3` |
+| `s3` | `r9` | `s4` |
+| `sp` | `sp` | `sp` |
diff --git a/docs/P1v2.md b/docs/P1v2.md
@@ -1,531 +0,0 @@
-# P1 v2
-
-## Scope
-
-P1 v2 is a portable pseudo-ISA for standalone executables.
-
-P1 v2 has two width variants:
-
-- **P1v2-64** — one word is one 64-bit integer or pointer value
-- **P1v2-32** — one word is one 32-bit integer or pointer value
-
-Portable source may use any number of word arguments. The first four argument
-registers are explicit, and additional argument words are passed through a
-portable incoming stack-argument area.
-
-Portable source may directly return `0..1` word. Wider results use the
-portable indirect-result convention described below.
-
-## Toolchain envelope
-
-P1 v2 must be assemblable through the existing `M0` + `hex2` path, with
-`catm` as the only composition primitive between source or generated fragments.
-The spec therefore assumes only the following toolchain features:
-
-- `M0`-level `DEFINE name hex_bytes` substitution
-- raw byte emission
-- labels and label references supported by `hex2`
-- file concatenation via `catm`
-
-## Source notation
-
-This document describes instructions using ordinary assembly notation such as
-`ADD rd, ra, rb`, `LD rd, [ra + off]`, or `CALL`.
-
-Because of the toolchain constraints above, portable source does not encode
-most operands as textual instruction arguments. Instead, register choices,
-inline immediate values, and small fixed parameters are fused into opcode
-names, following the generated-table style used by `src/p1_gen.py`.
-
-So the notation in this document is descriptive rather than literal:
-
-- `ADD rd, ra, rb` means a family of fused register-specific opcodes
-- `ADDI rd, ra, imm` means a family of fused register-and-immediate-specific
- opcodes
-- `ENTER size` means a family of fused byte-count-specific opcodes
-- `LDARG rd, idx` means a family of fused register-and-argument-slot-specific
- opcodes
-- `BR rs`, `CALLR rs`, and `TAILR rs` mean register-specific control-flow
- opcodes
-- `LEAVE`, `CALL`, `RET`, `TAIL`, `B`, and `SYSCALL` remain operand-free
-
-Labels still appear in source where the toolchain supports them directly, such
-as `LA rd, %label` and `LA_BR %label`.
-
-## Register Model
-
-### Exposed registers
-
-P1 v2 exposes the following source-level registers:
-
-- `a0`–`a3` — argument registers. Also caller-saved general registers.
-- `t0`–`t2` — caller-saved temporaries.
-- `s0`–`s3` — callee-saved general registers.
-- `sp` — stack pointer.
-
-### Hidden registers
-
-The backend may reserve additional native registers that are never visible in
-P1 source:
-
-- `br` — branch / call target mechanism, implemented as a dedicated hidden
- native register on every target
-- backend-local scratch used entirely within one instruction expansion
-
-No hidden register may carry a live P1 value across an instruction boundary.
-
-## Calling Convention
-
-### Arguments and return values
-
-P1 v2 defines three result conventions: one-word direct, two-word direct, and
-indirect.
-
-In the one-word direct-result convention:
-
-- Explicit argument words 0-3 live in `a0-a3`.
-- Additional explicit argument words live in the incoming stack-argument area
- and are read with `LDARG`.
-- On return, a one-word result lives in `a0`.
-
-In the two-word direct-result convention:
-
-- Explicit argument words 0-3 live in `a0-a3` on entry.
-- Additional explicit argument words still live in the incoming
- stack-argument area.
-- On return, `a0` holds result word 0 and `a1` holds result word 1.
-
-In the indirect-result convention:
-
-- The caller passes a writable result buffer pointer in `a0`.
-- Explicit argument words 0-2 then live in `a1-a3`.
-- Additional explicit argument words still live in the incoming
- stack-argument area.
-- On return, `a0` holds the same result buffer pointer value.
-
-In both direct-result conventions, incoming stack-argument slot `0` corresponds
-to explicit argument word `4`. In the indirect-result convention, incoming
-stack-argument slot `0` corresponds to explicit argument word `3`.
-
-The two-word direct-result convention covers common cases such as 64-bit
-integer results on 32-bit targets, two-word aggregates, and divmod-style
-returns. The indirect-result convention is the portable way to return any
-result wider than two words.
-
-### Register preservation
-
-Caller-saved:
-
-- `a0`–`a3`
-- `t0`–`t2`
-
-Callee-saved:
-
-- `s0`–`s3`
-- `sp`
-
-### Call semantics
-
-A call is valid from any function, including a leaf. Call / return correctness
-does not depend on establishing a frame first.
-
-If a function needs any incoming argument after making a call, it must save it
-before the call. This matters in particular for `a0`, which is overwritten by
-every convention's return value, and for `a1` when the callee uses the two-word
-direct-result convention.
-
-A call that passes any stack argument words requires the caller to have an
-active standard frame with enough frame-local storage to stage those outgoing
-words.
-
-The return address is hidden machine state. Portable source must not assume
-that it lives in any exposed register.
-
-## Stack Convention
-
-### Call-boundary rule
-
-At every call boundary, the backend must satisfy the native C ABI stack
-alignment rule for the target architecture.
-
-Portable source must therefore treat raw function-entry `sp` as opaque. It may
-not assume that the low bits of `sp` have the same meaning on all targets
-before a frame is established.
-
-### Incoming stack-argument area
-
-P1 v2 defines an abstract incoming stack-argument area for explicit argument
-words that do not fit in registers.
-
-- Slot `0` is the first stack-passed explicit argument word.
-- Slots are word-indexed, not byte-indexed.
-- Portable source may access this area only through `LDARG`.
-
-`LDARG` is valid only when the current function has an active standard frame.
-Therefore, a function that needs any incoming stack argument must establish a
-standard frame before its first `LDARG`.
-
-Portable source must not assume any direct relationship between incoming
-argument slots and raw function-entry `sp`. In particular, source must not try
-to reconstruct stack arguments by manually indexing from `sp`; backend entry
-layouts differ across targets.
-
-For a call with `m` stack-passed explicit argument words, the caller stages
-those words in the first `m` words of its frame-local storage immediately
-before the call:
-
-```
-[sp + 2*WORD + 0*WORD] = outgoing arg word 0
-[sp + 2*WORD + 1*WORD] = outgoing arg word 1
-...
-```
-
-At callee entry, those staged words become incoming argument slots `0..m-1`.
-The backend is responsible for mapping between the caller's frame layout and
-the callee's abstract incoming argument slots.
-
-Portable code that needs both ordinary locals and stack-passed outgoing
-arguments must reserve enough total frame-local storage and keep the low-
-addressed prefix available for outgoing argument staging across the call.
-
-### Standard frame layout
-
-Functions that need local stack storage use a standard frame layout. After
-frame establishment:
-
-```
-[sp + 0*WORD] = saved return address
-[sp + 1*WORD] = saved caller stack pointer
-[sp + 2*WORD ... sp + 2*WORD + local_bytes - 1] = frame-local storage
-...
-```
-
-Frame-local storage is byte-addressed. Portable code may use it for ordinary
-locals, spilled callee-saved registers, and the caller-staged outgoing
-stack-argument words described above.
-
-Total frame size is:
-
-`round_up(STACK_ALIGN, 2*WORD_SIZE + local_bytes)`
-
-Where:
-
-- `WORD_SIZE = 8` in P1v2-64
-- `WORD_SIZE = 4` in P1v2-32
-- `STACK_ALIGN` is target-defined and must satisfy the native call ABI
-
-Leaf functions that need no frame-local storage may omit the frame entirely.
-
-### Frame invariants
-
-- A function that allocates a frame must restore `sp` before returning.
-- Callee-saved registers modified by the function must be restored before
- returning.
-- The standard frame layout is the only frame shape recognized by P1 v2.
-
-## Op Set Summary
-
-| Category | Operations |
-|----------|------------|
-| Materialization | `LI rd, imm`, `LA rd, %label`, `LA_BR %label` |
-| Moves | `MOV rd, rs`, `MOV rd, sp` |
-| Arithmetic | `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, `MUL`, `DIV`, `REM` |
-| Immediate arithmetic | `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI` |
-| Memory | `LD`, `ST`, `LB`, `SB` |
-| ABI access | `LDARG` |
-| Branching | `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ` |
-| Calls / returns | `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR` |
-| Frame management | `ENTER`, `LEAVE` |
-| System | `SYSCALL` |
-
-## Immediates
-
-Immediate operands appear only in instructions that explicitly admit them.
-Portable source has three immediate classes:
-
-- **Inline integer immediate** — a signed 12-bit assembly-time constant in the
- range `-2048..2047`
-- **Materialized word value** — a full one-word assembly-time constant loaded
- with `LI`
-- **Materialized address** — the address of a label loaded with `LA`
-
-P1 v2 also uses two structured assembly-time operands:
-
-- **Frame-local byte count** — a non-negative byte count used by `ENTER`
-- **Argument-slot index** — a non-negative word-slot index used by `LDARG`
-
-`LI rd, imm` loads the one-word integer value `imm`.
-
-`LA rd, %label` loads the address of `%label` as a one-word pointer value.
-
-The backend may realize `LI` and `LA` using native immediates, literal pools,
-multi-instruction sequences, or other backend-private mechanisms.
-
-Backends may assume labels fit in 32 bits when realizing `LA` and `LA_BR`.
-This reflects the stage0 image layout (`hex2-0` base `0x00600000`, programs
-well under 4 GB), not a portable-ISA-level guarantee. Backends that target
-images loaded above the 4 GB boundary must adjust their `LA` / `LA_BR`
-lowering. `LI` makes no such assumption — it materializes any one-word value.
-
-## Control Flow
-
-### Call / Return / Tail Call
-
-Control-flow targets are materialized with `LA_BR %label`, which loads
-`%label` into the hidden branch-target mechanism `br`. The immediately
-following control-flow op consumes that target.
-
-`CALL` transfers control to the target most recently loaded by `LA_BR` and
-establishes a return continuation such that a subsequent `RET` returns to the
-instruction after the `CALL`. `CALL` is valid whether or not the caller has
-established a standard frame, except that any call using stack-passed argument
-words requires an active standard frame to hold the staged outgoing words.
-
-`CALLR rs` is the register-indirect form of `CALL`. It transfers control to
-the code pointer value held in `rs` and establishes the same return
-continuation semantics as `CALL`.
-
-`RET` returns through the current return continuation. `RET` is valid whether
-or not the current function has established a standard frame, provided any
-frame established by the function has already been torn down.
-
-`TAIL` is a tail call to the target most recently loaded by `LA_BR`. It is
-valid only when the current function has an active standard frame. `TAIL`
-performs the standard epilogue for the current frame and then transfers control
-to the loaded target without creating a new return continuation. The callee
-therefore returns directly to the current function's caller.
-
-`TAILR rs` is the register-indirect form of `TAIL`. It is valid only when the
-current function has an active standard frame.
-
-Because stack-passed outgoing argument words are staged in the caller's own
-frame-local storage, `TAIL` and `TAILR` are portable only when the tail-called
-callee requires no stack-passed argument words. Portable compilers must lower
-other tail-call cases to an ordinary `CALL` / `RET` sequence.
-
-Portable source must treat the return continuation as hidden machine state. It
-must not assume that the return address lives in any exposed register or stack
-location except as defined by the standard frame layout after frame
-establishment.
-
-### Prologue / Epilogue
-
-P1 v2 defines the following frame-establishment and frame-teardown operations:
-
-- `ENTER size`
-- `LEAVE`
-
-`ENTER size` establishes the standard frame layout with `size` bytes of
-frame-local storage:
-
-```
-[sp + 0*WORD] = saved return address
-[sp + 1*WORD] = saved caller stack pointer
-[sp + 2*WORD ... sp + 2*WORD + size - 1] = frame-local storage
-```
-
-The total allocation size is:
-
-`round_up(STACK_ALIGN, 2*WORD_SIZE + size)`
-
-The named frame-local bytes are the usable local storage. Any additional bytes
-introduced by alignment rounding are padding, not extra local bytes.
-
-`LEAVE` tears down the current standard frame and restores the hidden return
-continuation so that a subsequent `RET` returns correctly.
-
-Because every standard frame stores the saved caller stack pointer at
-`[sp + 1*WORD]`, `LEAVE` does not need to know the frame-local byte count used
-by the corresponding `ENTER`.
-
-A function may omit `ENTER` / `LEAVE` entirely if it is a leaf and needs no
-standard frame.
-
-`ENTER` and `LEAVE` do not implicitly save or restore `s0` or `s1`. A
-function that modifies `s0` or `s1` must preserve them explicitly, typically by
-storing them in frame-local storage within its standard frame.
-
-### Branching
-
-P1 v2 branch targets are carried through the hidden branch-target mechanism
-`br`. Portable source may load `br` only through:
-
-- `LA_BR %label` — materialize the address of `%label` as the next branch, call,
- or tail-call target
-
-No branch, call, or tail opcode takes a label operand directly. Portable source
-must treat `br` as owned by the control-flow machinery. No live value may be
-carried in `br`. Each `LA_BR` must be consumed by the immediately following
-branch, call, or tail op, and portable source must not rely on `br` surviving
-across any other instruction.
-
-The portable branch families are:
-
-- `B` — unconditional branch to the target in `br`
-- `BR rs` — unconditional branch to the code pointer in `rs`
-- `BEQ`, `BNE`, `BLT`, `BLTU` — conditional branch to the target in `br`
-- `BEQZ`, `BNEZ`, `BLTZ` — conditional branch to the target in `br` using zero
- as the second operand
-
-`BLT` and `BLTZ` perform signed comparisons on one-word values. `BLTU`
-performs an unsigned comparison on one-word values; there is no unsigned
-zero-operand variant because `x < 0` is always false under unsigned
-interpretation.
-
-If a branch condition is true, control transfers to the target currently held in
-`br`. If the condition is false, execution falls through to the next
-instruction.
-
-## Data Ops
-
-### Arithmetic
-
-P1 v2 defines the following arithmetic and bitwise operations on one-word
-values:
-
-- register-register: `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`,
- `MUL`, `DIV`, `REM`
-- immediate: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI`
-
-For `ADD`, `SUB`, `MUL`, `AND`, `OR`, and `XOR`, computation is modulo the
-active word size.
-
-`SHL` shifts left and discards high bits. `SHR` is a logical right shift and
-zero-fills. `SAR` is an arithmetic right shift and sign-fills.
-
-For register-count shifts, only the low `5` bits of the shift count are
-observed in `P1v2-32`, and only the low `6` bits are observed in `P1v2-64`.
-
-Immediate-form shifts use inline immediates in the range `0..31` in `P1v2-32`
-and `0..63` in `P1v2-64`.
-
-`DIV` is signed division on one-word two's-complement values and truncates
-toward zero. `REM` is the corresponding signed remainder.
-
-Division by zero is outside the portable contract. The overflow case
-`MIN_INT / -1` is also outside the portable contract, as is the corresponding
-remainder case.
-
-### Moves
-
-P1 v2 defines the following move and materialization operations:
-
-- `MOV` — register-to-register copy
-- `LI` — load one-word integer constant
-- `LA` — load label address
-
-`MOV` may copy from any exposed general register to any exposed general
-register.
-
-Portable source may also read the current stack pointer through `MOV rd, sp`.
-
-Portable source may not write `sp` through `MOV`. Stack-pointer updates are only
-performed by `ENTER`, `LEAVE`, and backend-private call/return machinery.
-
-`LI` materializes an integer bit-pattern. `LA` materializes the address of a
-label. `LA_BR` is a separate control-flow-target materialization form and is not
-part of the general move family.
-
-### Memory
-
-P1 v2 defines the following memory-access operations:
-
-- `LD`, `ST` — one-word load and store
-- `LB`, `SB` — byte load and store
-- `LDARG` — one-word load from the incoming stack-argument area
-
-`LD` and `ST` access one full word: 4 bytes in `P1v2-32` and 8 bytes in
-`P1v2-64`.
-
-`LB` loads one byte and zero-extends it to a full word. `SB` stores the low
-8 bits of the source value.
-
-Memory offsets use signed 12-bit inline immediates.
-
-The base address for a memory access may be any exposed general register or
-`sp`.
-
-`LDARG rd, idx` loads incoming stack-argument slot `idx`, where slot `0` is the
-first stack-passed explicit argument word. `idx` is word-indexed, not
-byte-indexed. `LDARG` is an ABI access, not a general memory operation; it does
-not expose or imply any raw `sp`-relative layout at function entry.
-
-`LDARG` is valid only when the current function has an active standard frame.
-
-Portable source must not assume that labels are aligned beyond what is
-explicitly established by the program itself. Portable code should use
-naturally aligned addresses for `LD` and `ST`. Unaligned word accesses are
-outside the portable contract. Byte accesses have no additional alignment
-requirement.
-
-## System
-
-`SYSCALL` is part of the portable ISA surface.
-
-At the portable level, the syscall convention is:
-
-- `a0` = syscall number on entry, return value on exit
-- `a1`, `a2`, `a3`, `t0`, `s0`, `s1` = syscall arguments 0 through 5
-
-At the portable level, `SYSCALL` clobbers only `a0`. All other exposed
-registers are preserved across the syscall.
-
-The mapping from symbolic syscall names to numeric syscall identifiers is
-target-defined. The set of syscalls available to a given program is likewise
-specified outside the core P1 v2 ISA, for example by a target profile or
-runtime interface document.
-
-## Target notes
-
-- `a0` is argument 0, the one-word direct return-value register, the low word
- of the two-word direct return pair, and the indirect-result buffer pointer.
-- On aarch64, riscv64, arm32, and rv32, that matches the native integer/pointer
- ABI directly.
-- On amd64, the backend must translate between portable `a0` and native
- return register `rax` at call and return boundaries. For the two-word direct
- return, the backend must also translate `a1` against native `rdx`.
-- On amd64, `LDARG` must account for the return address pushed by the native
- `call` instruction. On aarch64, riscv64, arm32, and rv32, it maps more
- directly to the entry `sp` plus the backend's standard frame/header policy.
-- `br` is implemented as a dedicated hidden native register on every target.
-- On arm32, `t1` and `t2` map to natively callee-saved registers; the backend
- is responsible for preserving them across function boundaries in accordance
- with the native ABI, even though P1 treats them as caller-saved.
-- Frame-pointer use is backend policy, not part of the P1 v2 architectural
- register set.
-
-### Native register mapping
-
-#### 64-bit targets
-
-| P1 | amd64 | aarch64 | riscv64 |
-|------|-------|---------|---------|
-| `a0` | `rdi` | `x0` | `a0` |
-| `a1` | `rsi` | `x1` | `a1` |
-| `a2` | `rdx` | `x2` | `a2` |
-| `a3` | `rcx` | `x3` | `a3` |
-| `t0` | `r10` | `x9` | `t0` |
-| `t1` | `r11` | `x10` | `t1` |
-| `t2` | `r8` | `x11` | `t2` |
-| `s0` | `rbx` | `x19` | `s1` |
-| `s1` | `r12` | `x20` | `s2` |
-| `s2` | `r13` | `x21` | `s3` |
-| `s3` | `r14` | `x22` | `s4` |
-| `sp` | `rsp` | `sp` | `sp` |
-
-#### 32-bit targets
-
-| P1 | arm32 | rv32 |
-|------|-------|-------|
-| `a0` | `r0` | `a0` |
-| `a1` | `r1` | `a1` |
-| `a2` | `r2` | `a2` |
-| `a3` | `r3` | `a3` |
-| `t0` | `r12` | `t0` |
-| `t1` | `r6` | `t1` |
-| `t2` | `r7` | `t2` |
-| `s0` | `r4` | `s1` |
-| `s1` | `r5` | `s2` |
-| `s2` | `r8` | `s3` |
-| `s3` | `r9` | `s4` |
-| `sp` | `sp` | `sp` |
diff --git a/docs/PLAN.md b/docs/PLAN.md
@@ -1,210 +0,0 @@
-# Alternative bootstrap path: Lisp-in-P1 → C compiler in Lisp → tcc-boot
-
-## Goal
-
-Shrink the auditable LOC between M1 assembly and tcc-boot by replacing the
-current `M2-Planet → mes → MesCC → nyacc` stack with a small Lisp written
-once in the P1 portable pseudo-ISA (see [P1.md](P1.md)) and a C compiler written
-in that Lisp. P1 is the same layer described in `P1.md`: ~30 RISC-shaped ops
-whose per-arch `DEFINE` tables expand to amd64 / aarch64 / riscv64 encodings,
-so one Lisp source serves all three hosts.
-
-## Current chain (validated counts)
-
-| Layer | Lang | Lines |
-|---|---|---|
-| `cc_amd64.M1` (subset-C compiler in M1 asm) | M1 | 5,413 (~3,152 actual instructions) |
-| M2-Planet (`*.c`, compiles mes) | C | 8,140 |
-| Mes interpreter (`src/*.c`) | C | 7,033 |
-| Mes headers (`include/mes/*.h`) | C | 6,145 |
-| MesCC + mes Scheme (`module/`) | Scheme | 8,271 |
-| Bundled mes runtime (SRFI/ice-9/rnrs shims) | Scheme | 9,191 |
-| nyacc (LALR engine + C99 parser/grammar/cpp) | Scheme | ~10,000 (essentials of 12,868) |
-| **Total auditable** | mixed | **~54,000** |
-
-## Proposed chain
-
-```
-M1 asm → P1 pseudo-ISA → Lisp interpreter (in P1) → C compiler (in Lisp) → tcc-boot
-```
-
-Two languages plus one portable asm layer, one new interpreter, one new
-compiler. No M2-Planet, no Mes core, no MesCC, no nyacc. The interpreter is
-authored once in P1 and assembled three ways; porting to a fourth arch means
-a new P1 defs file, not a rewrite.
-
-## Why P1 as the host
-
-- **Single source of truth.** A Lisp in raw M1 asm would need three
- hand-written variants (one per target arch). In P1, there is one source;
- the per-arch cost is already paid inside the P1 defs files.
-- **Cost lives in P1, not here.** P1's one-time tax (~1500 defines × 3 arches
- generator-driven, plus ~240 LOC of `hex2_word` + `M1-macro` aarch64 work)
- is accounted in `P1.md`. This plan inherits that layer rather than
- duplicating it.
-- **Dependency ordering.** PLAN cannot start the Lisp interpreter until P1
- stages 1–4 in `P1.md` are complete (spike on all three arches plus the
- full ~30-op matrix). P1 stage 5 ("seed Lisp interpreter in ~500 lines of
- P1") is effectively this plan's kickoff.
-
-## Lisp — feature floor
-
-Justification: empirical audit of MesCC's actual Scheme usage. MesCC barely
-exercises Scheme.
-
-**Required:**
-- Special forms: `define`, `lambda`, `if`, `cond`, `let`, `let*`, `letrec`,
- `quote`, `quasiquote`/`unquote`, `set!`, `begin`
-- Data: pairs, fixnums, vectors, immutable ASCII strings, symbols
-- Primitives (~40): `cons/car/cdr`, list ops (`map/filter/fold/append/reverse/member/assoc`),
- arithmetic (`+ - * / %`), bitwise (`and or xor << >>`), string ops
- (`string-append/string-ref/substring/string-length`), type predicates,
- `display`/`write`, basic `format` (`~a ~s ~d ~%` only), `apply`, `error`
-- Mark-sweep GC over tagged cells
-- Built-in `pmatch` macro (otherwise hand-expanding 57 call sites in the
- C compiler costs ~1k extra LOC)
-- A records-via-vectors layer (replaces SRFI-9 `define-immutable-record-type`)
-- File I/O: `read-file path → string` and `write-file path string`. No port
- type at all. The C lexer indexes into the source string with an integer
- cursor (gives `read-char`/`peek-char` semantics for free); CPP keeps
- `#include` context as a stack of (string, cursor) pairs. Codegen
- accumulates output as a reversed list of chunks and concatenates once
- via a one-pass variadic `string-append` (or a `string-concat list →
- string` primitive). Output for tcc-boot is single-digit MB — well within
- the existing mes 20MB arena budget.
-
-**Deliberately omitted:**
-- `call/cc`, `dynamic-wind`, `parameterize`, exception system
-- Mutable strings, Unicode
-- Bignums, rationals, floats
-- `syntax-rules` / `define-syntax` (only `pmatch` macro is needed)
-- First-class modules (single-file load in dependency order)
-- `do` loops, `delay`/`force`
-
-Tail calls: convenient for AST recursion; not strictly required if stack is
-generous (≥1MB).
-
-## C subset to support
-
-Start from MesCC's already-reduced subset; consider further reductions if
-they justify patching tcc-boot.
-
-**Must support (used by tcc-boot):**
-- Types: `char/short/int/long/long long`, signed/unsigned, pointers, arrays,
- structs, unions, enums, **bitfields**, typedefs, `void`
-- Storage: `static`, `extern`, `register`; `const`/`volatile` parsed and
- ignored
-- Operators: full arithmetic/bitwise/relational/logical, compound
- assignment, ternary, `sizeof` (types and expressions), casts, comma
-- Statements: all loops, switch/case, goto/labels, `&&`/`||` short-circuit
-- Function declarations: ANSI only
-
-**Not supported (and not needed):**
-- `float`/`double` (errors at parse time)
-- `inline` (parsed and stripped, like MesCC)
-- Variadic functions (tcc-boot already works around this)
-- K&R declarations
-- C99 mid-block declarations
-- Statement expressions, nested functions, compound literals,
- designated initializers
-
-**Candidate further reductions (require tcc-boot patches):**
-- Drop bitfields (significant tcc-boot rework — probably not worth it)
-- Drop compound assignment (modest tcc-boot patches)
-
-**Preprocessor:** target full `#define`/`#include`/`#if`/`#ifdef`/`#elif`/
-`#else`/`#endif` with function-like macros and stringification. tcc's source
-uses these heavily.
-
-## Backend
-
-**Settled: emit P1.** The C compiler is written once in portable Lisp and
-emits portable asm, so both the pre-tcc-boot seed userland (`SEED.md`) and
-tcc-boot itself land on all three arches without a second backend. Codegen
-is slightly harder than direct amd64 — P1 is deliberately dumb, so C
-idioms like `x += y` expand to multi-op P1 sequences — but we pay the
-~2× code-size tax already budgeted in `P1.md` rather than writing three
-backends.
-
-This forecloses the alternative of emitting amd64 M1 directly (simpler
-codegen, single-arch only). That option would have satisfied a
-tcc-boot-only goal, but `SEED.md` requires tri-arch seed binaries, so a
-portable backend is load-bearing.
-
-## Estimated budget
-
-| Component | Lines |
-|---|---|
-| Lisp interpreter in P1 (reader, eval, GC, primitives, I/O, pmatch) | 4,000–6,000 P1 |
-| C lexer + recursive-descent parser + CPP (in Lisp) | 2,000–3,000 |
-| Type checker + IR (slimmed compile.scm + info.scm) | 2,000–3,000 |
-| Codegen + P1 emit (see Backend) | 800–1,500 |
-| **Total auditable (this plan)** | **~9,000–13,000 LOC** |
-
-vs. **~54,000 LOC** current = **~4–6× shrink**, and the result is
-tri-arch instead of amd64-only. P1's own infrastructure (defs files,
-`hex2_word` extensions, generator) is audited once in `P1.md` and shared
-with any future seed-stage program.
-
-## Resolutions
-
-- **Narrow loads: zero-extend only, 8-bit only.** P1 keeps `LB`
- zero-extending; no `LBS` added, and 32-bit `LW`/`SW` are out of the
- ISA entirely (emulate through 64-bit `LD`/`ST` + mask/shift if ever
- needed). Fixnums live in full 64-bit tagged cells, so the interpreter
- never needs a sign-extended or 32-bit load — byte/ASCII access is
- unsigned, and arithmetic happens on 64-bit values already.
-- **Static code size: accept the 2× tax.** P1's destructive-expansion
- rule on amd64 roughly doubles instruction count vs. hand-tuned amd64.
- Matches P1's "deliberately dumb" contract (see `P1.md`). Interpreter
- binary expected in low single-digit MB — irrelevant for a seed.
-- **Tail calls: codify `TAIL` in P1.** A new `TAIL %label` macro (see
- `P1.md`, Control flow) expands to `LD lr, sp, 0; ADDI sp, sp, +16;
- B %label` or the per-arch equivalent. The interpreter's `eval` is
- written in the natural recursive style with tail-position calls
- compiled through `TAIL`, so the P1 stack does not grow per Scheme
- frame. As a side effect, Scheme-level tail calls fall out R5RS-proper
- for the interpreter's subset without extra mechanism.
-- **Pool placement: per-function on all arches.** Each function emits its
- constant pool at its epilogue, inside the aarch64 `LDR`-literal ±1 MiB
- range. Labels are file-local; duplicated constants across functions
- are accepted. Simple rule, no range-check logic in codegen.
-- **GC arena: static BSS.** The ~20 MB heap is reserved as a single BSS
- region at link time. No `brk`/`mmap` at runtime, no arena-sizing flag.
- Keeps the P1 program to a minimal syscall surface and makes the
- interpreter image self-describing.
-- **Syscalls: eight.** `read`, `write`, `openat`, `close`, `exit`,
- `clone`, `execve`, `waitid`. Each becomes one P1 `SYSCALL` op
- backed by a per-arch number table in the P1 defs file.
- `read-file` loops `read` into a growable string until EOF (no
- `stat`/`lseek`); `display`/`write`/`error` go through `write` on
- fd 1/2; `error` finishes with `exit`. `openat(AT_FDCWD, …)`
- replaces `open` because aarch64/riscv64 lack bare `open` in the
- asm-generic table. `clone(SIGCHLD)` + `execve` + `waitid` give
- the Lisp enough to drive the tcc-boot build directly — see
- "Build driver" below. No signals, time, or networking.
-
-## Build driver
-
-Once Lisp can spawn, the Lisp program itself is the build driver.
-There is no separate shell. A top-level Lisp source file reads the
-pinned list of tcc-boot translation units, iterates over them, and
-for each one:
-
-1. Reads the `.c` source into a Lisp string.
-2. Calls the Lisp-hosted C compiler (in-process) to produce P1 text.
-3. Writes the P1 text to a temp file.
-4. Spawns M1 (from stage0-posix, via `clone`+`execve`) to assemble
- P1 → `.hex2`; waits via `waitid`, aborts on non-zero.
-5. Spawns hex2 to emit the final `.o` / ELF; waits, aborts on
- non-zero.
-
-The seed-tool builds (each mescc-tools-extra source → one ELF) run
-the same loop. Spawn-and-wait is a ~20 LOC Lisp primitive; the full
-driver, including the hard-coded tcc-boot file list, is ~100–200
-LOC of Lisp counted against this plan.
-
-Concentrating orchestration in the Lisp program (rather than a
-separate P1/M1 shell) collapses the post-M1 contribution list to
-exactly three artifacts: P1, the Lisp interpreter, and the C
-compiler.
diff --git a/docs/SEED.md b/docs/SEED.md
@@ -1,303 +0,0 @@
-# Seed userland: the pre-tcc-boot tools
-
-## Goal
-
-Bridge the window between *Lisp exists* and *tcc-boot exists* without
-touching M2-Planet, Mes, or MesCC. Inside that window, all code is
-either a Lisp program running on the Lisp interpreter or one of a
-small set of standalone C binaries compiled through the Lisp-hosted
-C compiler → P1 → M1 → hex2 pipeline.
-
-This document covers only that window. Phases before it (`seed0 →
-hex0/hex1/hex2 → M1`, P1 defs, Lisp interpreter, and the Lisp-hosted
-C compiler) are documented in `P1.md` and `PLAN.md`. tcc-boot itself
-and everything downstream are standard C and out of scope.
-
-## Position in the chain
-
-```
-stage0-posix: seed0 → hex0 → hex1 → hex2 → M1 (no C, no Lisp)
-P1 layer: P1 defs files load into M1 (P1.md)
-Lisp: P1 text (Lisp interp source) → M1 → hex2 (PLAN.md)
-C compiler: Lisp program, loaded into the Lisp image (PLAN.md)
-──────── seed window begins here ────────
-seed tools: C source → Lisp+Ccc → P1 text → M1 → hex2 (this doc)
-──────── seed window ends when tcc-boot is built ────────
-tcc-boot: C source → Lisp+Ccc → P1 text → M1 → hex2 (PLAN.md)
-```
-
-One Lisp-hosted C compiler (shared with tcc-boot) and a handful of
-statically-linked C binaries. No M2-Planet artifact and no Mes
-Scheme module anywhere.
-
-## Settled decisions
-
-These are load-bearing; rest of the document assumes them.
-
-1. **Seed programs compile through the same Lisp-hosted C compiler
- as tcc-boot.** No separate seed-stage compiler. Authors write in
- the C subset fixed in `PLAN.md`; backend emits P1, so seed lands
- tri-arch via the existing M1+hex2 path. Accepts P1's ~2×
- code-size tax.
-2. **Vendor upstream C where it exists.** `cat`, `cp`, `mkdir`,
- `rm`, `sha256sum`, `untar` are taken from live-bootstrap's
- `mescc-tools-extra`; `patch-apply` from `simple-patch-1.0`.
- The libc these sources depend on (`<stdio.h>`, `<string.h>`,
- `<stdlib.h>`, etc.) is vendored M2libc's portable layer —
- `bootstrappable.c`, `string.c`, `stdio.c`, `stdlib.c`, and the
- small `ctype`/`fcntl` files (~1,500 LOC). Per-arch syscall
- stubs backing M2libc's declarations are replaced with our
- P1-based stubs (see "How seed tools reach syscalls" below). All
- of the above was written against M2-Planet's C subset, which is
- a subset of ours. Local adaptations ship as unified diffs in
- the repo. **No C is written fresh here** — each vendored
- source already has its own `main`.
-3. **The Lisp program is the build driver — no separate shell.**
- Per `PLAN.md`, the Lisp's syscall surface includes `clone`,
- `execve`, `waitid`, so a top-level Lisp file drives the whole
- tcc-boot build: iterate over translation units, call the
- Lisp-hosted C compiler in-process, spawn M1/hex2 to finish
- each artifact, check exit status. No `kaem`, no `sh`, no flat
- script — just Lisp code.
-4. **One binary per tool.** Each vendored source compiles to a
- standalone ELF — `cat`, `cp`, `mkdir`, `rm`, `sha256sum`,
- `untar`, `patch-apply`. Installed into a single directory
- (say, `/seed/`) and invoked by absolute path from the Lisp
- driver. No dispatcher, no argv[0] multiplexing, no fresh `main`
- to write. Each tool is its own audit unit.
-5. **Uncompressed tcc-boot mirror.** Host the upstream tcc-boot source
- as an uncompressed `.tar` with sha256 pinned. No gzip support
- anywhere in the seed stage. Deletes ~1000–1500 LOC of deflate from
- the audit.
-6. **Explicit patches via `patch-apply`.** Upstream source stays
- verbatim. Our changes live as unified-diff files in this repo,
- applied by the `simple-patch`-derived binary. "Upstream vs
- ours" stays legible.
-7. **Target self-build is primary; cross-build is a cache.** The
- canonical build is a fresh target machine bootstrapping from
- stage0-posix hex seed. Cross-built per-arch tarballs are supported
- as a reproducibility cache — identical bytes expected, verified
- against a target self-build, not trusted by assumption.
-
-## The seed tools
-
-One ELF per tool per arch. Each tool is invoked by absolute path
-from the Lisp build driver (e.g. `/seed/sha256sum foo.tar`). Each
-binary links against the same vendored M2libc portable layer and
-the same P1 syscall stubs.
-
-### Inventory
-
-| Tool / layer | Purpose | Source / LOC |
-|--------------------|---------------------------------------------|-------------------------|
-| `untar` | POSIX ustar extract (no gzip, no creation) | mescc-tools-extra/untar.c (460) |
-| `patch-apply` | apply a unified diff in-place | simple-patch-1.0 (~200) |
-| `sha256sum` | verify source tarball hashes | mescc-tools-extra/sha256sum.c (586) |
-| `cp` | copy one file | mescc-tools-extra/cp.c (332) |
-| `mkdir` | single-level directory create | mescc-tools-extra/mkdir.c (117) |
-| `rm` | remove one file (no `-r`, no `-f`) | mescc-tools-extra/rm.c (54) |
-| `cat` | concatenate files to stdout | mescc-tools-extra/catm.c (69) |
-| libc (portable) | stdio, string, stdlib, ctype, fcntl | vendored M2libc (~1,500) |
-| syscall stubs | per-arch bridge below M2libc | ~120 lines of P1, not C |
-| **Total C** | | **~3,300, fully vendored** |
-
-Deliberately excluded: `test`, `echo`, `mv`. The Lisp driver does
-any conditional or rename logic it needs in Lisp, and emits
-progress messages via its own `write` calls — no externalised
-shell utilities needed for those concerns.
-
-The driver is Lisp code, not a shell script; see `PLAN.md`'s
-"Build driver" section for the control flow.
-
-## Syscall surface
-
-The seed tools collectively need **7 syscalls** (process spawn
-lives in the Lisp driver, not in the tools).
-
-| Syscall | Used by |
-|------------|-------------------------------------------|
-| `read` | all file-reading tools |
-| `write` | stdout/stderr, all file-writing |
-| `openat` | file open (`AT_FDCWD` + `O_RDONLY` / `O_WRONLY|O_CREAT|O_TRUNC` with mode) |
-| `close` | all file ops |
-| `exit` | program termination |
-| `mkdir` | `mkdir` tool, `untar` (directory entries) |
-| `unlink` | `rm` tool |
-
-PLAN.md's Lisp surface is 8 syscalls (`read`, `write`, `openat`,
-`close`, `exit`, `clone`, `execve`, `waitid`). The seed tools add
-`mkdir` and `unlink` on top of that, for a window total of **10
-distinct syscalls**. Each gets one row in every `p1_<arch>.M1`
-defs file. Deliberately excluded: `stat/fstat`, `access`,
-`rename`, `chmod` (rely on `openat` mode bits for initial perms),
-`lseek` (all reads are sequential), `getdents`/`readdir` (no
-directory traversal needed), `dup`/`pipe`/signals/time/net.
-
-### How seed tools reach syscalls
-
-The Lisp-hosted C compiler has no inline asm and no intrinsics. Each
-syscall is exposed as an ordinary `extern` function declaration,
-backed by a hand-written P1 stub in `runtime.p1`. The stubs are ~3 P1
-ops each (load number, `SYSCALL`, `RET`), totalling ~40 lines of P1
-for the whole surface.
-
-```
-:sys_write ; C args arrive in P1 r1-r6 per call ABI
- SYSCALL write ; expands per-arch via p1_<arch>.M1 defs
- RET
-```
-
-```
-extern int sys_write(int fd, char *buf, int n);
-```
-
-Prerequisite: P1 picks its argument registers (`r1–r6`) to coincide
-with the native syscall arg registers on each arch (`rdi/rsi/…`,
-`x0–x5`, `a0–a5`), so stubs need no register shuffling beyond what
-`SYSCALL` already does. Confirm this in `P1.md` during implementation.
-
-Return convention: Linux returns `-errno` (values in `-1..-4095`) in
-the result register. Wrappers return the raw integer; callers test
-`r <u 0xfffff000` to detect failure and abort with a message. No
-`errno` global, no per-tool error recovery.
-
-## Build ordering inside the seed window
-
-Once the Lisp interpreter binary exists and the C compiler Lisp
-source is loaded (both per `PLAN.md`):
-
-1. Compile each seed tool independently: its vendored source plus
- the vendored M2libc layer plus the per-arch P1 syscall stubs →
- P1 text → M1 → hex2 → one ELF per tool. Per-arch, repeat for
- each target.
-2. Install the tools into a single directory on the target (e.g.
- `/seed/`). No other setup required.
-
-The tcc-boot build runs as a Lisp program invoked on the Lisp
-interpreter. The driver:
-
-1. Spawns `/seed/sha256sum upstream.tar` and checks against pinned
- hash.
-2. Spawns `/seed/untar upstream.tar`.
-3. For each patch file: spawns `/seed/patch-apply patches/foo.diff`.
-4. Iterates over tcc-boot `.c` files. For each one, calls the
- Lisp-hosted C compiler in-process to emit P1 text, then spawns
- M1 and hex2 to produce the object or final linked binary.
-5. Installs the tcc-boot binary.
-
-See `PLAN.md` "Build driver" for the spawn-and-wait primitive.
-Seed window is closed.
-
-## Target self-build vs cross-build
-
-**Target self-build (primary).** A fresh machine of arch `A` starts
-from the stage0-posix hex seed, runs the hex0→hex1→hex2→M1 chain,
-loads `p1_A.M1`, assembles the Lisp interpreter, loads the C
-compiler into Lisp, runs the Lisp build-driver program, which
-compiles each seed tool, then compiles and links tcc-boot.
-stage0-posix's own `kaem` runs the early hex0→M1 chain; above M1,
-the Lisp program takes over.
-
-**Cross-build cache (secondary).** On an already-bootstrapped
-machine, produce the seed tool binaries for all three arches and
-ship them as tarballs. Users who opt into this skip the target
-self-build and land directly at "seed tools installed." Trust
-claim: **none by assumption** — the cache is only trusted after a
-target self-build of at least one arch has verified byte-identical
-output. Cross-build is an optimization, not a trust input.
-
-## Provenance
-
-Artifacts flowing in:
-
-- **stage0-posix hex seed + P1 defs**: part of this repo, audited
- with the rest of it.
-- **Lisp interpreter source (in P1) and C compiler (in Lisp)**:
- part of this repo, covered by `PLAN.md`.
-- **Vendored seed C sources**: pinned snapshots of
- live-bootstrap's `mescc-tools-extra` (catm, cp, mkdir, rm,
- sha256sum, untar), `simple-patch-1.0`, and M2libc's portable
- layer (the libc the mescc-tools sources depend on — stdio,
- string, stdlib, ctype, fcntl, bootstrappable). All shipped
- verbatim as `.tar` files with sha256 pinned. Local adaptations
- ride as unified diffs in the repo, applied by `patch-apply` at
- build time so "upstream vs ours" stays legible.
-- **Upstream tcc-boot source**: mirrored as uncompressed `.tar` at
- a pinned URL + sha256. The mirror file is one of this repo's
- auditable inputs; it can be re-derived from upstream by untaring
- and retaring in a canonical form, or checked against upstream's
- published `.tar.gz` by re-gzipping and comparing hashes on a
- machine that has `gzip` (done once, out of band).
-
-No C is authored fresh in this repo for the seed window; the only
-things written here are unified-diff patches against the vendored
-tree and the per-arch P1 syscall stubs.
-
-`sha256sum` is the single seed tool whose correctness has a direct
-trust consequence downstream; unit-test it against known vectors
-(empty string, "abc", "abcdbcde..."-length tests) before declaring
-the seed build complete.
-
-## Interaction with tcc-boot
-
-tcc-boot expects a build environment roughly like `cc + make + sh +
-coreutils`. Mapping:
-
-| tcc-boot expects | Seed provides |
-|------------------|--------------------------------------------------|
-| `cc` / `gcc` | Lisp-hosted C compiler, invoked in-process per `.c` |
-| `make` | Lisp driver program (tcc-boot is simple enough) |
-| `sh` | not provided — the Lisp driver spawns tools directly |
-| `cat`/`cp`/etc. | individual seed-tool binaries at absolute paths |
-| `ld` | tcc-boot's built-in linker (for its own output) |
-| `ar` | not needed; tcc-boot builds one static binary |
-
-Any translation from tcc-boot's literal build-command names
-(`cc`, `make`, `install`) to seed tools lives in Lisp, not in a
-separate shim script.
-
-## Budget rollup
-
-Fresh auditable LOC introduced by this document, on top of PLAN.md:
-
-| Layer | LOC |
-|--------------------------------------------------------|---------|
-| seed tools — vendored mescc-tools-extra + simple-patch | ~1,800 |
-| seed tools — vendored M2libc portable layer | ~1,500 |
-| syscall stubs (P1, not C) | ~120 |
-| Lisp build-driver program | counted in PLAN.md |
-| **Seed window addition** | **~3,300 C (all vendored) + ~120 P1** |
-
-Combined PLAN.md + SEED.md audit surface: **~13–17k LOC**, tri-arch,
-M2-Planet-free and Mes-free. No fresh C is authored for the seed
-window; the entire ~3,300 LOC is audited upstream code written
-against M2-Planet's C subset. The build driver is Lisp code
-counted against PLAN.md (~100–200 LOC).
-
-## Handoff notes for the engineer
-
-Approximate build order for implementation:
-
-1. **C compiler in Lisp** (blocks everything below). Per `PLAN.md`;
- validate on a small corpus before touching seed.
-2. **Vendor M2libc's portable layer** and write the per-arch P1
- syscall stubs that back its declarations. Bring-up test: link
- `catm.c` (69 LOC) against this libc and run it.
-3. **Vendor mescc-tools-extra + simple-patch.** Pin sha256s.
- Confirm each source compiles unmodified through the Lisp-hosted
- C compiler; if anything trips, capture the delta as a unified
- diff rather than editing the vendored tree in place.
-4. **Build the small tools** individually (`cat`, `cp`, `mkdir`,
- `rm`) — each is its own ELF.
-5. **`sha256sum`** with unit tests (empty / "abc" / long vectors)
- before anything depends on its correctness.
-6. **`untar`** (ustar extract only).
-7. **`patch-apply`** (unified-diff in-place).
-8. **End-to-end bring-up**: Lisp build-driver running
- `sha256sum` → `untar` → `patch-apply` → in-process C-compile
- loop (spawning M1/hex2 per `.c`) → linked tcc-boot. First full
- trip through the seed window.
-
-Each step compiles standalone C and assembles through the existing
-P1 → M1 → hex2 path; no new tooling infrastructure is needed
-between steps.