boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs

commit 687ee15acc1661be78ff6cd8922ad95dd19a2100
parent 79aedea7287b895e3089ff72a8ff80482bda1083
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 23 Apr 2026 20:15:53 -0700

rm old docs

Diffstat:
Ddocs/M1M-IMPL.md | 550-------------------------------------------------------------------------------
Ddocs/M1M-P1-PORT.md | 257-------------------------------------------------------------------------------
Ddocs/M1PP-EXT.md | 656-------------------------------------------------------------------------------
Mdocs/P1.md | 957++++++++++++++++++++++++++++++++++++++++---------------------------------------
Ddocs/P1v2.md | 531-------------------------------------------------------------------------------
Ddocs/PLAN.md | 210-------------------------------------------------------------------------------
Ddocs/SEED.md | 303-------------------------------------------------------------------------------
7 files changed, 483 insertions(+), 2981 deletions(-)

diff --git a/docs/M1M-IMPL.md b/docs/M1M-IMPL.md @@ -1,550 +0,0 @@ -## M1M Implementation Sketch - -This note is the implementation-oriented companion to -`docs/M1M-P1-PORT.md`. It describes a practical structure for the P1 -macro expander. - -### Working on the port - -**Files** - -- `m1pp/m1pp.c` — C oracle. Behaviorally authoritative. - Each phase below names the oracle entry points to lift from. -- `m1pp/m1pp.M1` — port target, on P1v2 as of Phase 1. Runtime - shell, lexer, pass-through emit, and structural `%macro` skip - land in Phase 1. Phases 2–10 extend it to real macro storage, - expansion, paste, the expression evaluator, builtins, and - `%select`. -- `m1pp/build.sh`, `m1pp/test.sh` — build / run / diff a P1v2 .M1 - into a runnable aarch64 binary. See `docs/M1M-IMPL.md` Phase 0. -- `tests/m1pp/` — per-phase fixtures. Two shapes, selected by - extension: - - `<name>.M1` + `<name>.expected` — standalone P1v2 program; built, - run with no args, stdout diffed (build-pipeline smoke). - - `<name>.M1pp` + `<name>.expected` — expander input; runner builds - `m1pp/m1pp.M1` once, runs it as `m1pp <name>.M1pp <out>`, diffs - `<out>` (build-dir temp) against `.expected` (parity test). - Filenames beginning with `_` are skipped (parked until later phases). -- `build/p1v2/aarch64/p1_aarch64.M1` — P1v2 DEFINE table, generated - from `p1/aarch64.py` + `p1/p1_gen.py`. Regenerate after any - backend edit. -- `docs/P1v2.md` — ISA spec. `docs/M1M-P1-PORT.md` — higher-level - port contract. - -**Commands** - -```sh -# Build one .M1 source into a binary: -sh m1pp/build.sh tests/m1pp/01-passthrough.M1 build/m1pp/01-passthrough - -# Run the whole suite (regenerates P1v2 defs if the generator changed): -make test-m1pp - -# Run one fixture by name: -sh m1pp/test.sh 01-passthrough - -# Run a built binary manually in the aarch64 container. -# `localhost/distroless-busybox:latest` is a local tag built by -# m1pp/build.sh on first run from Containerfile.busybox (distroless-static -# + the busybox binary from another distroless layer, both digest-pinned). -podman run --rm --pull=never --platform linux/arm64 \ - -v "$PWD":/work -w /work \ - localhost/distroless-busybox:latest \ - ./build/m1pp/<name> <argv...> - -# Regenerate P1v2 DEFINE tables after touching p1/*.py: -python3 p1/p1_gen.py --arch aarch64 build/p1v2 - -# Build the C oracle + compare its output to the M1 build: -cc m1pp/m1pp.c -o build/m1pp/m1pp-oracle -./build/m1pp/m1pp-oracle <input.M1pp> /tmp/out-c -./build/m1pp/m1pp <input.M1pp> /tmp/out-m1 # run via podman as above -diff /tmp/out-c /tmp/out-m1 - -# Discover undefined P1 tokens without running M0 (catches typos that -# would otherwise SIGILL silently — build.sh runs this automatically): -sh lint.sh build/p1v2/aarch64/p1_aarch64.M1 m1pp/m1pp.M1 -``` - -**P1v2 quick reference for this port** - -- Registers: `a0..a3` args + caller-saved, `t0..t2` caller-saved, - `s0..s3` callee-saved. `sp` is stack pointer; no raw writes. -- Frame: `enter SIZE` / `leave`; no implicit `s*` save. Leaf - functions may skip frames. -- Call: `la_br &target` then `call` / `tail` / `b` / `beq` / … - (the branch op consumes `br` — load it immediately before). -- Materialize: `li_aN <8 bytes>` for any one-word integer - (`%lo %hi` or `'XXXXXXXXXXXXXXXX'`); `la_aN &label` for label - addresses — **no padding needed**, the 32-bit literal-pool - prefix zero-extends. -- Syscall ABI: number in `a0`; args in `a1, a2, a3, t0, s0, s1`; - result in `a0`. - -### Supported Features - -The target expander supports the features required by `p1/*.M1pp`: - -- `%macro NAME(a, b)` / `%endm` -- `%NAME(x, y)` function-like expansion with recursive rescanning -- `##` token paste -- `!(expr)` / `@(expr)` / `%(expr)` / `$(expr)` -- `%select(cond, then, else)` -- Lisp-shaped integer expressions used by the builtins - -### Top-Down Shape - -The program should be structured as a small compiler pipeline: - -1. Runtime shell - Read `argv[1]` into `input_buf`, lex into `source_tokens`, process - tokens through the macro engine, write `output_buf` to `argv[2]`. - Done in Phase 1. - -2. Lexer - Keep the current C-compatible tokenizer: - `WORD`, `STRING`, `NEWLINE`, `LPAREN`, `RPAREN`, `COMMA`, `PASTE`. - All token text lives in `text_buf`; tokens store pointers into that - arena. - -3. Definition pass during processing - Processing is single-pass, not a separate pre-scan. At line start, - `%macro` defines a macro and produces no output. Macro definitions - become available only after their definition, matching - `src/m1macro.c`. - -4. Stream-driven expansion - The main processor reads from the top stream. Source input is stream 0. - Macro expansions and `%select` selections push temporary token - streams onto a stack. When a stream is exhausted, it pops and - restores the expansion pool mark. - -5. Macro call expansion - `%NAME(...)` resolves only if `NAME` is already defined and the next - token is `(`. Expansion produces temporary tokens in `expand_pool`, - applies plain parameter substitution and paste, then pushes the - result as a new stream for recursive rescanning. - -6. Builtins - `%(expr)` and `$(expr)` evaluate integer expressions and emit - one generated token directly. - `%select(cond, then, else)` evaluates `cond` first, then chooses - exactly one of `then` or `else`, copies only that chosen token range - into `expand_pool`, and pushes it as a stream. The unchosen branch is - not expanded, validated, or expression-evaluated. - -7. Errors - Coarse fatal paths are sufficient: malformed macro header, wrong arg - count, bad paste, bad expression, overflow, unterminated macro/call. - Exact C error strings are not required. - -### Core Data Structures - -Use fixed BSS arenas and simple power-of-two-ish records. - -Text spans and token records are kept separate: - -```text -TextSpan: -+0 start u64 -+8 len u64 - -Token: -+0 kind u64 -+8 text TextSpan -``` - -Macro record: - -```text -name TextSpan -param_count u64 -params TextSpan[16] -body_start Token* -body_end Token* -``` - -Stream record: - -```text -toks_start Token* -toks_end Token*, exclusive -pos Token* -line_start bool -pool_mark stack mark, -1 for source-owned streams -``` - -Expression frame: - -```text -op_code enum -argc u64 -args i64[16] -``` - -Global arenas: - -```text -input_buf -output_buf -text_buf -source_tokens -macro_body_tokens -macros -expand_pool -streams -arg_starts[16] -arg_ends[16] -expr_frames -``` - -Token range boundaries should be stored as token pointers rather than -indices. That keeps stream and argument walking simple in P1: advance by -one token record, compare pointers, no repeated `base + index << 5`. - -Source token spans point into `input_buf`. `text_buf` is reserved for -synthesized token text such as `##` pastes and `!@%$` output. - -### Bottom-Up Helper Layers - -#### Layer 0: raw memory/text helpers - -```text -append_text(src_ptr, len) -> text_ptr -append_text_cstr(const_ptr, len) -> text_ptr -copy_bytes(dst, src, len) -``` - -#### Layer 1: token helpers - -```text -push_source_token(kind, text) -push_macro_body_token(token_ptr) -push_pool_token(token_ptr) -copy_token(dst_ptr, src_ptr) -tok_eq_const(tok, const_ptr, len) -> bool -span_eq_token(span, tok) -> bool -``` - -#### Layer 2: stream helpers - -```text -push_stream(toks_ptr, count, pool_mark) -pop_stream() -current_stream() -> stream_ptr -stream_peek(stream) -> token_ptr -stream_advance(stream) -``` - -#### Layer 3: macro table helpers - -```text -find_macro(call_tok) -> macro_ptr or 0 -find_param(macro_ptr, body_tok) -> param_index+1 or 0 -define_macro(stream_ptr) -``` - -No `find_prefixed_param` or local-rewrite helper is needed for this -feature set. - -#### Layer 4: argument parser - -```text -parse_args(stream_ptr, lparen_tok_ptr) -``` - -Outputs: - -```text -arg_starts[i] = first token ptr -arg_ends[i] = exclusive token ptr -arg_count -call_end_pos = token ptr after closing RPAREN -``` - -It tracks nested parentheses with a depth counter. Commas split only at -depth 1. - -#### Layer 5: macro body expander - -```text -expand_macro_at(stream_ptr, call_tok, macro_ptr) -``` - -Algorithm: - -1. Parse call args. -2. Validate arg count. -3. Save `mark = pool_used`. -4. Walk macro body tokens. -5. If body token is a param, copy arg tokens into the pool. -6. Otherwise copy the body token as-is. -7. Run paste compaction over `[mark, pool_used)`. -8. Push an expansion stream if non-empty; otherwise restore pool mark. - -#### Layer 6: paste pass - -```text -paste_range(start_ptr, end_ptr) -> new_count -``` - -This is an in-place compactor over `expand_pool`. - -Rules: - -```text -## cannot be first or last -left/right operands cannot be NEWLINE or PASTE -pasted result is TOK_WORD -if a substituted parameter participates in ##, its argument must be exactly one token -``` - -#### Layer 7: expression evaluator - -Do not implement expression evaluation as recursive P1 calls. Use an -explicit expression frame stack. That avoids fragile recursion and makes -macro-in-expression expansion controllable. - -Expression evaluator API: - -```text -eval_expr_range(start_tok_ptr, end_tok_ptr) -> r0 value -``` - -Internal state: - -```text -expr_pos -expr_end -expr_frame_top -expr_done -expr_result -``` - -Loop model: - -1. Skip expression newlines. -2. If token is `(`: - Read next token as operator. - Convert operator token to `op_code`. - Push an expression frame with `argc = 0`, `accum = 0`. - Advance past the operator. -3. If token is `)`: - Finalize the top frame based on `op_code` and `argc`. - Pop the frame. - Feed the produced value into the parent frame, or finish if there is - no parent. -4. If token is an atom: - If token is a macro call, expand it to the pool, then evaluate that - expansion as a nested expression range. - Otherwise parse the integer atom. - Feed the value into the parent frame, or finish if there is no - parent. - -Operators: - -```text -+ variadic, argc >= 1 -- unary neg or binary/variadic subtract, argc >= 1 -* variadic, argc >= 1 -/ binary, div-by-zero check -% binary, div-by-zero check -<< binary ->> binary arithmetic shift -& variadic, argc >= 1 -| variadic, argc >= 1 -^ variadic, argc >= 1 -~ unary -= binary -== binary alias -!= binary -< binary -<= binary -> binary ->= binary -``` - -Keeping the full current operator set is cheap and avoids pointless -divergence from the C oracle. - -For macro-in-expression, the clean composition is: - -```text -eval atom sees %NAME followed by LPAREN -expand_macro_at into pool without pushing a stream -temporarily evaluate [mark, mark + expanded_count) -require exactly one expression result and no extra tokens -restore pool mark -advance outer expr_pos to call_end_pos -``` - -That gives the C behavior without mixing expression parsing with the -main output stream. - -#### Layer 8: builtins - -```text -expand_builtin_call(stream_ptr, builtin_tok) -``` - -`!@%$` - -```text -parse args -require one arg -value = eval_expr_range(arg_start, arg_end) -emit_hex_value(value, 1 2 4 or 8) -advance stream pos to call_end_pos -line_start = 0 -``` - -`%select`: - -```text -parse args -require three args -value = eval_expr_range(arg0_start, arg0_end) -chosen = arg1 if value != 0 else arg2 -copy chosen tokens to expand_pool -advance stream pos to call_end_pos -push chosen stream -line_start = 0 -``` - -Only `cond` is evaluated eagerly. The selected branch is rescanned as a -normal token stream; the unselected branch is ignored completely. - -#### Layer 9: main processor - -```text -process_tokens: - push_stream(source_tokens, source_count, -1) - - while stream_top > 0: - s = current_stream() - if s.pos == s.end: - pop_stream() - continue - - tok = *s.pos - - if s.line_start && tok == "%macro": - define_macro(s) - continue - - if tok.kind == NEWLINE: - emit_newline() - s.pos += 24 # one Token record - s.line_start = 1 - continue - - if tok is builtin call: - expand_builtin_call(s, tok) - continue - - if tok is defined macro call: - expand_call(s, macro) - continue - - emit_token(tok) - s.pos += 24 - s.line_start = 0 -``` - -### Implementation Slices - -The port is broken into phases. Each phase ends with a dedicated test -under `tests/m1pp/` and a parity check (where applicable) against the C -oracle in `m1pp/m1pp.c`. The target ISA is **P1v2** (registers -`a0..a3`, `t0..t2`, `s0..s3`; `enter`/`leave`; `la_br`); the DEFINE -table is `build/p1v2/aarch64/p1_aarch64.M1`. Aarch64 is the staging -arch (matches the macOS host so podman runs natively). - -Each phase below lists the oracle entry points in `m1pp/m1pp.c` that -the M1 port lifts for that slice. Line numbers are hints — track by -symbol name. - -- [x] **Phase 0 — Build/run/diff infra under `m1pp/`.** - `m1pp/build.sh <source.M1> <out>` lints against the P1v2 DEFINE - table, prunes unused DEFINEs, runs M0 + hex2-0 with the aarch64 - ELF header inside the distroless-busybox container, and deposits - a runnable binary. `m1pp/test.sh` walks fixtures in `tests/m1pp/` - and picks mode by extension: `.M1` fixtures are built and run stand-alone; - `.M1pp` fixtures are fed to a one-time build of `m1pp/m1pp.M1` as - input, and the produced output file is diffed. Wired into - `make test-m1pp`. Phase 0 fixture: `tests/m1pp/00-hello.M1` — a - P1v2 hello-world that proves the pipeline without depending on - `m1pp/m1pp.M1`'s current state. - -- [x] **Phase 1 — Port lexer + pass-through skeleton to P1v2.** - Rewrite `_start`, read/write, `lex_source`, `emit_token`, - `emit_newline`, `process_tokens`, and the structural %macro skip - in P1v2 conventions (`a*`/`t*`/`s*` registers, `enter SIZE` / - `leave`, `la_br &label`). Verify byte-for-byte parity against the - C oracle on a definition-only fixture (tokenizer pass-through). - Oracle entry points: `main`, `lex_source`, `emit_token`, - `emit_newline`, `process_tokens` (pass-through branches only), - plus `append_text_len`, `push_token`, `token_text_eq`, - `span_eq_token`. - -- [x] **Phase 2 — Macro definition storage.** - Replaced structural skipping with real storage: `define_macro` - parses the header (name, params with comma splits, trailing - newline) and copies body tokens into `macro_body_tokens[]` until - a line-start `%endm`. Records land in a 32-slot `macros[]` arena - (296 B/record). Macros are not yet called — defs-only input - matches the oracle. `find_macro` / `find_param` deferred to the - phases that exercise them (Phase 5). - Oracle: `define_macro`, `find_macro`, `find_param`. - -- [ ] **Phase 3 — Stream stack + expansion-pool lifetime.** - Stream stack push/pop for recursive rescanning; expansion-pool - mark/restore on stream pop. No semantic change until Phase 4 - wires macro calls in, but isolates the lifecycle plumbing. - Oracle: `push_stream_span`, `current_stream`, `pop_stream`, - `copy_span_to_pool`, `push_pool_stream_from_mark`. - -- [ ] **Phase 4 — Argument parsing.** - Nested-paren depth tracking, comma split at depth 1, argument- - count validation, `call_end_pos` output. - Oracle: `parse_args`. - -- [ ] **Phase 5 — Plain parameter substitution.** - Walk macro body; substitute params via the expand pool; push - resulting slice as a stream. Enforces single-token-arg rule for - parameters adjacent to `##` (still no actual paste yet). - Oracle: `expand_macro_tokens` (parameter loop), - `copy_arg_tokens_to_pool`, `copy_paste_arg_to_pool`, - `expand_call`. - -- [ ] **Phase 6 — `##` token paste compaction.** - In-place compactor over the expand pool. Rejects misplaced or - malformed paste sites. - Oracle: `paste_pool_range`, `append_pasted_token`. - -- [ ] **Phase 7 — Integer atoms + S-expression evaluator.** - Integer-token parsing; explicit expression-frame stack; all - operators from the oracle; macro-in-expression composition (the - required path for `p1/P1-aarch64.M1pp`). - Oracle: `parse_int_token`, `expr_op_code`, `apply_expr_op`, - `eval_expr_atom`, `eval_expr_range`, `skip_expr_newlines`. - -- [ ] **Phase 8 — `!@%$(expr)` builtins.** - One-arg builtins on top of the evaluator; emit LE 1/2/4/8-byte - hex tokens. - Oracle: `expand_builtin_call` (the `!@%$` cases), `emit_hex_value`. - -- [ ] **Phase 9 — `%select(cond, then, else)`.** - Eager `cond` eval; copy chosen branch to expand pool, push as - stream; never evaluate the unchosen branch. - Oracle: `expand_builtin_call` (the `%select` case). - -- [ ] **Phase 10 — Full-parity + malformed-input smoke tests.** - Run `tests/m1pp/_full-parity.M1pp` against the M1 implementation - (unpark by dropping the `_` prefix); - add malformed fixtures (unterminated macro, wrong arg count, bad - paste, bad expression, bad builtin arity) requiring non-zero - exit. Then run combined `p1/P1-aarch64.M1pp + p1/P1.M1pp` through - the M1 expander and diff against the Python-generated - `build/p1v2/aarch64/p1_aarch64.M1`. Finally use the produced - frontend on a small P1 program through the normal toolchain. diff --git a/docs/M1M-P1-PORT.md b/docs/M1M-P1-PORT.md @@ -1,257 +0,0 @@ -# m1macro to P1 Port Plan - -## Goal - -Replace `src/m1macro.c` with a real P1 implementation in `src/m1m.M1`. -`src/m1m.M1` must be pure portable P1 source. The final `m1m` binary must -expand M1M input without shelling out to awk, C, Python, libc, or any host -macro processor. - -Contract: - -``` -m1m input.M1 output.M1 -``` - -Behavior should match `src/m1macro.c` byte-for-byte for valid inputs in the -current M1M feature set, except where an implementation limit is explicitly -documented. - -Architecture-specific code is not allowed in `src/m1m.M1`. The only -architecture-specific layer is the generated P1 `DEFINE` file that `catm` -prepends before `src/m1m.M1` during assembly. If the port needs additional -P1 op/register/immediate combinations, add them to the generator and -regenerate the arch-specific define tables. - -## Scope - -Implement the current M1M feature set needed by `p1/*.M1M` to define -instruction encodings: - -- `%macro NAME(a, b)` / `%endm` -- `%NAME(x, y)` function-like expansion with recursive rescanning -- `##` token paste -- `!(expr)` / `@(expr)` / `%(expr)` / `$(expr)` -- `%select(cond, then, else)` -- Lisp-shaped integer expressions used by the builtins - -Not supported: per-expansion locals (`@local`, `:@local`, `&@local`), -prefixed parameter substitution (`:param`/`&param`), duplicate macro -diagnostics, and byte-identical malformed-input diagnostics. Avoid duplicate -macro names; the feature set does not promise a particular -duplicate-definition behavior. - -Preserve the C tokenizer model: whitespace is normalized, strings are single -tokens, `#` and `;` comments are skipped, and output is emitted as tokens plus -newlines rather than preserving original formatting. - -## Static Data Model - -Use fixed BSS arenas, mirroring the C implementation: - -- Input buffer: raw file contents plus NUL sentinel. -- Output buffer: emitted text. -- Text buffer: copied token text and generated text. -- Source token array: token records for the original input. -- Macro table: name, params, and body token records. -- Expansion pool: temporary tokens produced by macro calls and `%select`. -- Stream stack: active token streams for recursive rescanning. - -Token record layout should be compact and uniform: - -``` -kind token kind -text source span in `input_buf` or synthesized span in `text_buf` -``` - -Macro records should store name/parameter text spans plus a body token -range, not inline strings. Prefer record shapes that stay uniform across the -codebase so the address math remains easy to audit in P1. - -## Implementation Milestones - -## Incremental TODO - -Use this checklist to finish the port in reviewable slices. Each checked item -should build `m1m` and include at least one C-oracle comparison where -applicable. - -- [x] Land the portable P1 runtime shell: argv validation, input open/read, - output open/write, fatal-error reporting, and no external expander path. -- [x] Add the first fixed BSS arenas for input, output, text, source tokens, - and runtime counters. -- [x] Add initial text/token/output helpers: append copied token text, push - source token records, compare token text with constants, emit tokens, and - emit newlines. -- [x] Port the C tokenizer model for source input: whitespace skipping, - string tokens, `##`, comments, delimiters, word tokens, newline tokens, and - text-buffer copies. -- [x] Add a first processor skeleton that normalizes pass-through output and - structurally skips line-start `%macro` ... `%endm` definitions. -- [x] Extend generated P1 support for the current `m1m.M1` needs: broader - `ADDI` immediates, token/record memory offsets, and full RRR register - tuples. The Makefile still prunes unused DEFINEs before assembly. -- [x] Verify the current slice: `make PROG=m1m ARCH=aarch64 build/aarch64/m1m`, - byte-identical C-oracle output for definition-only library inputs - `p1/aarch64.M1M` and `p1/P1.M1M` (these currently only prove structural - `%macro` skipping, not macro-call expansion), and byte-identical tokenizer - pass-through fixture output. -- [x] Add `tests/m1m/full-parity.M1M` and its C-oracle expected output as the - real expansion parity target. This fixture intentionally uses macro calls, - recursive rescanning, paste, `!@%$(` and `%select`; it is expected - to fail under the partial P1 implementation until the remaining unchecked - expansion tasks land. -- [ ] Replace structural `%macro` skipping with real macro table storage: - parse headers, parameters, body tokens, body limits, and line-start `%endm` - recognition. -- [ ] Add stream stack push/pop for recursive rescanning and expansion-pool - lifetime management. -- [ ] Port macro call argument parsing, including nested parentheses and - argument-count validation. -- [ ] Port plain parameter substitution, including the single-token argument - requirement when a parameter participates in `##`. -- [ ] Port `##` token paste, including bad operand and misplaced paste - failures. -- [ ] Port integer atom parsing and S-expression evaluation for arithmetic, - comparisons, shifts, and bitwise operators. -- [ ] Implement `!@%$(expr)` on top of expression - evaluation and token emission. -- [ ] Implement `%select(cond, then, else)` on top of expression evaluation - and stream pushback. -- [ ] Add malformed-input smoke tests: unterminated macro, wrong arg count, - bad paste, bad expression, and bad builtin arity. These only need non-zero - exit, not exact diagnostic text. -- [ ] Use the P1 `m1m` binary to expand a representative M1M frontend and - assemble a small program through the normal stage0 toolchain. -- [ ] Revisit static limits and error strings so every documented arena limit - has a clear fatal path. -- [ ] Re-run all acceptance tests and update this plan with any explicitly - documented implementation limits. - -1. **Runtime shell** - - Keep the existing P1 argv, open/read, write, and fatal-error paths. Remove - any external backend or `execve` shortcut. - -2. **Text and token primitives** - - Add helpers for `append_text_len`, `push_token`, token equality, - span equality, output token emission, and output newline emission. - Keep error handling simple: set an error message pointer and branch to - `fatal`. - -3. **Lexer** - - Port `lex_source` directly. It should fill `source_tokens` from - `m1m_input_buf`, copying all token text into `text_buf`. - -4. **Stream processor skeleton** - - Implement push/pop stream and the main `process_tokens` loop. Initially - support pass-through tokens and `%macro` skipping, then expand toward full - behavior. - -5. **Macro definitions** - - Port `define_macro`: parse header, params, body tokens, body limit checks, - and line-start `%endm` recognition. - -6. **Macro call expansion** - - Port `parse_args`, plain parameter substitution, token paste, and - expansion-stream pushback. - -7. **Expression evaluator** - - Port integer atom parsing and S-expression evaluation. Implement arithmetic, - comparisons, shifts, and bitwise ops over 64-bit signed values as far as P1 - can represent them. Document any temporary 32-bit limitation if unavoidable, - but the target is C-compatible 64-bit behavior. - -8. **Builtins** - - Implement `!@%$(` and `%select` on top of the expression evaluator - and stream pushback. - -9. **Cleanup and limits** - - Replace generic “not implemented” errors with coarse but useful failures - for buffer overflow, malformed macro headers, arg-count mismatch, bad - expressions, and bad paste operands. Exact C diagnostic parity is not a - goal. - -## Portability Rule - -`src/m1m.M1` must use only P1 tokens plus labels/data. Do not hand-code -aarch64, amd64, or riscv64 instructions in this file. Do not introduce -per-arch branches, per-arch data layouts, or per-arch syscall sequences in the -implementation. - -Allowed architecture-specific work: - -- Extend `src/p1_gen.py` when `m1m.M1` needs a P1 operation tuple that is not - currently generated. -- Regenerate `build/<arch>/p1_<arch>.M1`. -- Keep the existing build shape where the arch-specific define file is - prepended with `catm` before the portable P1 source. - -All algorithmic behavior, buffer layout, parsing, expansion, expression -evaluation, and error handling belongs in portable P1. - -## P1 Support Needed - -The current build may stage `PROG=m1m` on aarch64 first, but the source must -remain portable P1 from the start. Staging on one arch is a build milestone, -not permission to add arch-specific source. - -Likely generator/table updates: - -- More `ADDI` immediates for record-size and arena-limit arithmetic. -- More `LD/ST/LB/SB` offsets for token, macro, and stream record fields. -- Additional RRR register triples used by parser loops and address math. -- Possibly a small set of helpers/macros for 32-byte record addressing. - -Do not hide core behavior behind host tools. If a P1 operation is missing, -extend the generated P1 definitions or rewrite the algorithm in available P1. - -## Acceptance Tests - -Use `src/m1macro.c` as the oracle during development. - -Minimum checks: - -1. Build `m1m`: - - ``` - make PROG=m1m ARCH=aarch64 build/aarch64/m1m - ``` - -2. Compare representative inputs against the C implementation: - - ``` - src/m1macro.c oracle: p1/aarch64.M1M - src/m1macro.c oracle: p1/P1.M1M - custom fixture: paste, recursive rescanning, !@%%(, %select - malformed fixtures: bad paste, wrong arg count, bad expression - ``` - -3. Require byte-identical output for valid fixtures. - -4. Require non-zero exit for invalid fixtures. - -5. Once stable, use `m1m` to expand the P1 M1M front-end and assemble a small - program through the normal stage0 toolchain. - -## Non-Goals - -- No dependency on awk, shell scripts, Python, libc, or the host C compiler at - runtime. -- No new macro language features. -- No formatting preservation beyond the current C expander behavior. -- No recursive macro cycle detection unless added after parity. - -## Done Definition - -`src/m1m.M1` contains the expander core, the generated `m1m` binary runs in the -target Alpine container, and all acceptance tests match `src/m1macro.c` without -executing any external macro-expansion program. diff --git a/docs/M1PP-EXT.md b/docs/M1PP-EXT.md @@ -1,656 +0,0 @@ -# M1PP extensions for the seed Scheme interpreter - -Three independent additions to `m1pp/m1pp.c`, ordered by sequencing. - -Motivation: when writing the seed Lisp interpreter portably across three -arches, most pain in `lisp/lisp.M1` traces to two things — hand-named -scratch labels that collide when a pattern is reused, and argument -substitution that can't carry instruction bodies (commas break the -parser). `strlen` is the smaller third item: it removes a class of -hand-counted-length bugs in error messages and string literals. - -## 1. Local labels - -### Syntax - -Two prefixed word forms, recognized only when they appear in a macro -body as body-native tokens: - -- `:@name` — label definition, scoped to the current expansion -- `&@name` — address-of reference, scoped to the current expansion - -### Semantics - -Each `%NAME(...)` invocation allocates a fresh expansion id `NN` from a -global monotonic counter. While copying body-native tokens into the pool, -any TOK_WORD whose text starts with `:@` or `&@` (and has ≥1 char after -the `@`) is rewritten to the corresponding non-`@` form with `__NN` -suffixed: `:@end` → `:end__42`, `&@end` → `&end__42`. - -**Scoping.** Rename body-native tokens only. Argument-substituted tokens -pass through unchanged — they were already renamed under the caller's -`NN` if the caller was itself a macro body. This gives lexical label -scoping: nested and stacked macros each see their own labels, collisions -are impossible. - -**Interaction with `##`.** None. The rename happens before the paste -pass; a body `:@end##_lbl` renames `:@end` first, then pastes. Edge -cases here should error out (pasting onto a renamed label is almost -certainly a bug); leave it unconstrained for v1 and revisit if it -bites. - -### Tokenizer - -No changes. `:@foo` / `&@foo` already tokenize as single TOK_WORD under -the current word-terminator set (`m1pp.c:310`). The existing `@(...)` -builtin dispatch keys on token text being exactly `@` followed by -LPAREN, so `@foo` words do not collide. - -### m1pp.c touchpoints - -- One new static `int next_expansion_id` (monotonic, never reset). -- `expand_macro_tokens` (`m1pp.c:670`): allocate `NN = ++next_expansion_id` - before the body-walk. Inside the body-copy loop, when about to push a - body-native TOK_WORD whose text starts with `:@` or `&@`: - - build the renamed text directly by appending bytes into `text_buf`: - copy the original token bytes (sigil + tail), append `__`, then - append the decimal digits of `NN` - - push a TOK_WORD pointing at the new text span - -Avoid `snprintf`. The m1m port (`docs/M1M-P1-PORT.md`) will reimplement -every new m1pp feature in P1 assembly; varargs format parsing is a -non-trivial thing to port. Plain byte appends plus a hand-rolled -integer → decimal emit (the `display_uint` reverse-fill pattern already -in `lisp/lisp.M1:2983`) port cleanly. - -Concretely in C: reserve a small stack scratch (say 16 bytes), fill -digits right-to-left via repeated `%10` / `/=10`, then `append_text_len` -the sigil bytes, the tail bytes, `"__"`, and the digit run. A 32-bit -counter fits in 10 decimal digits; collision across a file is a -non-concern because the counter is file-global and monotonic. - -No struct changes. No lexer changes. No new global syntax. - -## 2. Braced block arguments - -### Syntax - -Curly braces `{` and `}` group tokens into a single macro argument, -protecting commas inside the group from the comma-splits-args rule. - -``` -%if_eq(r1, r2, { - li(r0) - %5 - st(r0, r3, 0) -}) -``` - -Without braces, `st(r0, r3, 0)` exposes two commas at paren depth 1 and -the call parses as 5 args instead of 3. - -### Semantics - -- `{` and `}` are new TOK kinds, tokenized as single-char delimiters. -- In `parse_args`, a `brace_depth` counter runs parallel to the paren - `depth`. Commas at `depth == 1` split args **only when - `brace_depth == 0`**. LBRACE increments, RBRACE decrements. -- When copying an arg span into a macro body, if the span begins with - TOK_LBRACE and ends with matching TOK_RBRACE at the outermost level, - strip the outer pair. Otherwise copy verbatim — `%foo(plain)` stays - working. -- Braces never reach output. Either filter them during substitution or - make `emit_token` treat both kinds as no-ops (belt-and-braces; I'd - do both). - -### Nesting - -`{ { ... } }` nests via `brace_depth`. Braces inside a `"..."` string -stay inside the string token — the lexer already handles that. - -Braces and parens are independent. `{ ( }` is syntactically fine in the -arg-splitter; paren balancing only cares about LPAREN/RPAREN. - -### Tokenizer - -`lex_source` (`m1pp.c:232`): add LBRACE/RBRACE cases alongside the -existing LPAREN/RPAREN cases (~10 lines). Add `{` and `}` to the -word-terminator set at `m1pp.c:310`. - -### m1pp.c touchpoints - -- New TOK_LBRACE, TOK_RBRACE enum entries (`m1pp.c:77`). -- `lex_source`: two new single-char token cases. -- `parse_args` (`m1pp.c:543`): add `brace_depth` counter; gate the - comma-split on `brace_depth == 0`; LBRACE/RBRACE bump/drop it. -- Arg copy (in `expand_macro_tokens`, via `copy_arg_tokens_to_pool` and - `copy_paste_arg_to_pool`): detect outer `{ ... }` wrapping and strip. - The `copy_paste_arg_to_pool` path (single-token arg for `##`) should - reject braced args — pasting onto a block is nonsense. -- `emit_token`: no-op for both brace kinds (defensive; they shouldn't - reach here if substitution is clean). - -### What this does not give you - -A C-like block-statement form (`%if_eq(a,b) { … } %else { … } %endif`) -needs `process_tokens` to recognize line-start block openers/closers — -a separate, heavier change. Braced args get us -`%if_eq_else(a, b, { then }, { else })` and `%while_nez(r, { body })`, -which covers the patterns in lisp.M1 we care about. Defer the block- -statement form until braced-arg shows real ergonomic pain. - -## 3. `strlen` expression op - -### Syntax - -A new unary op in the Lisp-shaped expression grammar: - -``` -(strlen "literal") -``` - -Composes with arithmetic like any other op: - -``` -%((+ (strlen "hello") 1)) -``` - -### Semantics - -- Argument must be a single `TOK_STRING` atom (double-quoted form). -- Value is the raw byte count between the quotes: `span.len - 2`. - Matches what M1's `"…"` emission writes before appending NUL. -- Single-quoted `'…'` hex literals error out — strlen is meaningless - on raw hex. - -### No decimal emitter needed - -The 4-byte LE hex emitter `%(expr)` is sufficient. Two paths cover -everything: - -1. Companion DEFINE: - - ``` - %macro defstr(label, text) - :label text - DEFINE label##_LEN %((strlen text)) - %endm - ``` - - M1 substitutes `label_LEN` with its 4 hex bytes at each use site. - -2. Inline at an LI-immediate slot: - - ``` - li_r2 %((strlen "usage: …")) - ``` - - LI's inline literal slot takes 4 raw LE bytes; `05000000` and `%5` - are byte-equivalent there. lisp.M1 already relies on this (see - `DEFINE NIL 07000000` at `lisp/lisp.M1:30`, consumed as - `li_r0 NIL`). - -The 1/2/8-byte emitters (`!(e)`, `@(e)`, `$(e)`) cover non-4-byte widths -if needed. - -### m1pp.c touchpoints - -- `EXPR_STRLEN` entry in the `ExprOp` enum (`m1pp.c:87`). -- `expr_op_code` (`m1pp.c:751`): match the word `strlen`. -- Eval path: `strlen` is a degenerate case — its "argument" is a - TOK_STRING, not a recursive expression. Easiest is a special-case - branch in `eval_expr_range` (`m1pp.c:976`) that handles `(strlen - "...")` directly rather than routing through `eval_expr_atom`. - Emit `span.len - 2` as the value. -- Alternative: extend `eval_expr_atom` to accept TOK_STRING atoms with - value `len - 2`, and treat `strlen` as identity. Cleaner - composition but more surface area; defer unless needed. - -## 4. Paren-less 0-arg macro calls - -### Syntax - -A macro defined with zero parameters may be called without trailing -`()`: - -``` -%macro FRAME_BASE() -16 -%endm - -%((+ %FRAME_BASE 8)) ## paren-less -%((+ %FRAME_BASE() 8)) ## still works -``` - -### Semantics - -- When `find_macro` matches a `%NAME` token and the macro's - `param_count == 0`, the expansion triggers whether or not an LPAREN - follows. -- Applies in both contexts where a macro call is currently recognized: - top-level processing in `process_tokens`, and atom position in - `eval_expr_atom` so a 0-arg macro is a valid expression atom inside - `%(...)`. -- Non-zero-param macros still require their existing `(arg, ...)` - syntax. -- `%foo` where `foo` is not defined as a macro still passes through - unchanged — the match only fires when a matching 0-param macro - exists. Backward compatible. - -### Why it matters - -The one feature that needs it is `%struct` field access (§5). Once -`NAME.field` expands to an integer, writing `%NAME.field` reads as a -named constant; `%NAME.field()` looks like a function call. The -relaxation is also load-bearing for expression-level composition: -`%((+ %frame_hdr.SIZE %frame_apply.callee))` needs both atoms to -resolve as 0-arg calls inside the evaluator. - -### m1pp.c touchpoints - -- `process_tokens` (`m1pp.c:1225`): the LPAREN-next guard becomes - "LPAREN-next OR `param_count == 0`." -- `eval_expr_atom` (`m1pp.c:944`): same relaxation on the same guard. -- The zero-param paren-less path constructs an empty arg list and - calls `expand_macro_tokens` with `arg_count == 0` — no - `parse_args` change. -- No lexer changes, no new token kinds, no new Macro fields. - -## 5. `%struct` directive - -### Syntax - -A top-level directive declaring a fixed-layout aggregate of 8-byte -fields: - -``` -%struct closure { hdr params body env } -``` - -Fields are bare identifiers separated by whitespace and/or commas. -The closing brace terminates the declaration. - -### Semantics - -Expands at declaration time to N+1 zero-parameter macros: - -- `NAME.field_k` → `k * 8` for each field at index k -- `NAME.SIZE` → `N * 8` - -All fields are 8-byte words. Mixed widths are deferred until a real -use case appears. - -Callers consume these as paren-less 0-arg calls (per §4): - -``` -ld(r0, r1, %closure.body) -enter(%frame_apply.SIZE) -``` - -### No `base=` parameter - -The struct primitive declares offsets from zero. Base offsets (e.g. -for stack-frame locals sitting above the retaddr/caller-sp header) -compose at the call site via an ordinary wrapper macro: - -``` -%struct frame_hdr { retaddr caller_sp } ## SIZE = 16 - -%macro frame(field) -%((+ field %frame_hdr.SIZE)) -%endm - -%struct frame_apply { callee args body env } - -:apply - enter(%frame_apply.SIZE) - st(r1, sp, %frame(%frame_apply.callee)) ## 0 + 16 = 16 - st(r2, sp, %frame(%frame_apply.args)) ## 8 + 16 = 24 - … - leave() - ret -``` - -Heap structs access fields directly (`%closure.body`); stack frames -route through the `%frame` wrapper. Same primitive, two conventions, -no special casing inside `%struct`. If a function needs a different -base (e.g. a permanent spill prefix), define `%frame_big(field)` -alongside `%frame` — the struct declarations don't change. - -### Tokenizer - -- `.` is already a word char, so `NAME.field` tokenizes as one - TOK_WORD under the current word-terminator set (`m1pp.c:310`). -- `{` / `}` reuse the TOK_LBRACE / TOK_RBRACE kinds introduced for §2. - `%struct` cannot land before §2 does. - -### m1pp.c touchpoints - -- New top-level directive branch in `process_tokens` (`m1pp.c:1192`) - alongside the existing `%macro` detection. At line-start, if the - first word is `%struct`: - - consume name, `{`, field-identifier list (WORD tokens, - comma-or-whitespace separated), `}`, trailing newline - - for each field k, generate an entry in `macros[]`: - - name = synthesized `"NAME.field_k"` in `text_buf` - - `param_count = 0` - - body = a single TOK_WORD whose text is the decimal rendering of - `k * 8` in `text_buf` - - emit a final `"NAME.SIZE"` entry pointing at `N * 8` -- Integer → decimal rendering reuses the hand-rolled reverse-fill - pattern from §1 local labels — no `snprintf`. -- No new expression-evaluator surface; consumption goes through the - existing `find_macro` + `eval_expr_atom` path once §4 lands. -- No new Macro struct fields. A struct-generated macro is - indistinguishable from any other 0-param macro once declared. - -### What this does not give you - -- **Mixed-width fields.** All offsets are `k * 8`. The packed 8-bit - type + 8-bit gc-flags + 48-bit length header in lisp.M1 is easier - to handle with dedicated bit-op macros than struct syntax; defer. -- **Bundled enter/leave per frame.** A `%frame NAME { … }` directive - that also emits ENTER/LEAVE around a body would bring back the - block-body problem and tightly couple locals to one function shape. - The call-site verbosity savings don't pay; use plain `%struct` plus - a wrapper macro. - -## 6. `%enum` directive - -### Syntax - -A top-level directive declaring an incrementing sequence of named -integer constants: - -``` -%enum tag { fixnum pair vector string symbol proc singleton } -%enum prim_id { add sub mul div mod eq lt gt ... } -``` - -### Semantics - -Expands at declaration time to N+1 zero-parameter macros: - -- `NAME.label_k` → `k` for each label at index k -- `NAME.COUNT` → `N` - -Callers consume these as paren-less 0-arg calls (per §4): - -``` -li_r2 %tag.pair ## loads 1 -%((= %prim_id.COUNT 45)) ## compile-time sanity check -``` - -### Relationship to `%struct` - -Implementation-wise, `%enum` is `%struct` with stride 1 instead of 8 -and a totalizer named `COUNT` instead of `SIZE`. The directive -parser, brace consumption, field-list parsing, and macro-generation -loop are all shared. Factor the §5 implementation around one helper -parameterized by `(stride, totalizer_name)`: - -- `%struct` → `define_fielded(8, "SIZE")` -- `%enum` → `define_fielded(1, "COUNT")` - -No separate code path; adding `%enum` is a second line-start -directive check in `process_tokens` plus one call to the shared -helper. - -### Why it matters - -lisp.M1 maintains two hand-numbered integer enumerations whose -numbering must stay in sync across disjoint sites: - -- Tag codes (`lisp/lisp.M1:35–47`) referenced throughout the - reader / eval / printer dispatchers. -- Primitive code IDs — used by the registration table and the - dispatch cascade (`lisp/lisp.M1:3843–3983`). Inserting a new - primitive in the middle shifts every downstream id; silent drift, - no error until runtime. - -`%enum` eliminates both drift classes: names declared once, -referenced by name everywhere, renumbering on insertion is automatic. - -### m1pp.c touchpoints - -Same as §5 with the two parameter differences above. No new Macro -struct fields, no new token kinds, no new expression-evaluator -surface. - -### What this does not give you - -- **Explicit values.** `%enum foo { a=5 b c }` is not supported in - v1. All values are consecutive from 0. C's explicit-value form - is useful when matching external ABIs; our enums are internal, so - defer until a real use case appears. -- **Flag/bitmask enums.** Not specially supported. If you want bit - positions, declare the bit index via `%enum` and take - `(1 << %NAME.flag_k)` at use sites. - -## 7. `%str` stringification builtin - -### Syntax - -A new builtin alongside `!(e)`, `@(e)`, `%(e)`, `$(e)`, `strlen`, -and `%select`: - -``` -%str(IDENT) -``` - -Takes a single WORD-token argument; produces a TOK_STRING literal -whose contents are the argument's text wrapped in double quotes: - -``` -%macro quoteit(name) -%str(name) -%endm - -%quoteit(hello) → "hello" -%quoteit(foo_bar) → "foo_bar" -``` - -### Semantics - -- Exactly one argument, kind TOK_WORD. Multi-token, pasted, or - already-string args error out. -- Output is a freshly-allocated TOK_STRING span in `text_buf` built - as `"` + original_text + `"`. The span's `len` is - `original_len + 2`, so `strlen` on the result (per §3) returns - `original_len` — the char count between the quotes, matching - what M1's `"…"` emission writes before the NUL. -- Produces a string literal, not a word. Complementary to `##`, not - a replacement — see below. - -### Relationship to `##` paste - -Both turn a parameter into something else, but they produce -**different token kinds** and serve **different goals**: - -| operator | inputs | output | kind | -|----------|-----------------|------------------|------------| -| `##` | two WORD tokens | one WORD token | TOK_WORD | -| `%str` | one WORD token | one STRING token | TOK_STRING | - -`##` joins word fragments to build identifiers / label names. -`%str` wraps a word in quotes to produce a string literal. They -can't substitute for each other: - -- `:str_quote` (a label definition) must be a word — `##` can - build it, `%str` can't. -- `"quote"` (a string literal) must introduce quote characters — - `%str` is the only way to manufacture it from a bare identifier, - paste can't. - -M1 sees the difference too: `:str_quote "quote"` is a label-def -word followed by a quoted-bytes directive (5 bytes + NUL). Paste -manufactures the first, stringify the second, both from the same -source identifier. - -### Why it matters - -Every special-form symbol in lisp.M1 (`lisp/lisp.M1:164–260`) -follows the same triad, written longhand 15 times today: - -``` -:str_quote "quote" -DEFINE str_quote_LEN 05000000 -:sym_quote %0 %0 -``` - -With `##` and `%str` together, one declarative site per symbol: - -``` -%macro defsym(name) -:str_##name %str(name) -DEFINE str_##name##_LEN %((strlen %str(name))) -:sym_##name %0 %0 -%endm - -%defsym(quote) -%defsym(if) -%defsym(begin) -… -``` - -- `##name` builds the label identifiers (`str_quote`, `sym_quote`). -- `%str(name)` builds the string literal (`"quote"`). -- `(strlen %str(name))` computes the length for the DEFINE. -- One source of truth per symbol — the identifier itself. - -Without `%str`, callers would have to pass the string explicitly -(`%defsym(quote, "quote")`). That works today with zero m1pp -changes but invites drift between the identifier and its -spelled-out string form — nothing at compile time flags a typo -where the two disagree. - -### Why a builtin, not a `#x` sigil - -cpp uses `#x` inside macro bodies to stringify a parameter. That -shape doesn't port cleanly to m1pp because `#` is already the -line-comment starter (`m1pp.c:278`). Giving `#` dual duty would -create parse ambiguity in `lex_source`. - -`%str(x)` reuses the existing builtin-dispatch plumbing — the same -path that handles `! / @ / % / $ / %select` — and reads uniformly -with the other text and numeric builtins. - -### Tokenizer - -No changes. Existing TOK_STRING machinery handles the output; -`%str` is a word token recognized as a builtin in `process_tokens`. - -### m1pp.c touchpoints - -- `process_tokens` (`m1pp.c:1211`): extend the builtin-dispatch - guard to accept `%str` alongside `! @ % $ %select`. -- `expand_builtin_call` (`m1pp.c:1092`): add a branch for `%str`. - Arg-count check: exactly 1. Arg-shape check: exactly one token, - kind TOK_WORD. Anything else errors. -- Stringification body: compute `out_len = arg.text.len + 2`, - reserve that many bytes via `append_text_len`, write `"`, - the original bytes, `"`. Push a TOK_STRING pointing at the new - span. -- No `snprintf` — plain byte copies, straightforward port. -- No new token kinds, no new Macro fields. - -### What this does not give you - -- **Stringification of non-parameter tokens.** Only single-token - WORD args. `%str(foo bar)` or `%str("already a string")` both - error. Wider forms are cpp-ish; defer until a real use case - appears. -- **Escape processing inside the stringified text.** The input is - a bare identifier — no quotes, backslashes, or whitespace to - escape. If `%str` is ever extended to take broader token spans, - escape handling becomes relevant then. - -## Per-feature implementation sequence - -Each of the three features lands in the same three ordered steps. Do -not skip or reorder — the tests exist to pin behavior before the -port, and the port exists because the C expander is disposable. - -1. **Implement in `m1pp/m1pp.c`.** The C expander is the oracle. Land - the feature here first so there is something to diff against. -2. **Add a test in `tests/m1pp/`.** New `NN-name.M1pp` + - `NN-name.expected` pair following the existing numbering (see - `tests/m1pp/` — current fixtures run 00 through 10), **or** extend - an existing fixture when the feature is a natural addition to one - (e.g. `strlen` goes into `04-expr-ops.M1pp` alongside the other - expression ops rather than getting its own file). For malformed- - input features, the expected artifact is a non-zero exit; document - that in the fixture. -3. **Add to `m1pp/m1pp.M1`.** Port the feature to the pure-P1 - implementation of m1pp so the seed bootstrap doesn't depend on the - host C expander. The test from step 2 runs against both `m1pp` (C) - and `m1m` (P1) and must produce byte-identical output; that parity - is what `docs/M1M-P1-PORT.md` calls "C-oracle comparison." - -Shipping a feature means all three steps are done. A half-landed -feature (C only, or C + test but no port) blocks the next feature in -the sequencing list below. - -## Cross-feature sequencing - -1. **Local labels.** Smallest patch, immediately useful — enables - straight-line macros like `%case_tag` and `%tag_dispatch` that - want one or two internal labels without hand-naming. -2. **Braced args.** Unlocks structured `%if_eq_else` / `%while_nez` - that carry instruction bodies. Depends on (1) in practice — the - bodies reference labels defined in the surrounding macro. -3. **`strlen`.** Independent of the other two. Land when the first - `%defstr` call site shows up. -4. **Paren-less 0-arg macro calls.** Independent small relaxation of - two guards (one in `process_tokens`, one in `eval_expr_atom`). - Useful on its own for constants-as-macros; load-bearing for (5). -5. **`%struct`.** Depends on (2) for the brace token kinds and (4) - for paren-less access syntax. Land only after both. -6. **`%enum`.** Same dependencies as (5). Share the - directive-handler implementation with `%struct` — land together - or back-to-back. -7. **`%str`.** Independent of everything else. Pairs naturally with - (3) `strlen` in the `%defsym` pattern but has no build-order - dependency on it. Land when the first `%defsym`-style - declarative macro shows up. - -Each is a self-contained patch. No cross-dependencies beyond the -sequencing above and the three-step rule per feature. - -## Per-feature acceptance fixtures - -- **Local labels:** two fixtures — a single macro using `:@end` and - calling itself twice in one function (must produce distinct labels), - and nested macros each using `:@done` (must not collide). Assemble - through M1 + hex2 clean on at least one arch. -- **Braced args:** fixture exercising a body with commas - (`st(r0, r3, 0)`), a body with nested braces, and a malformed - fixture (unmatched `{`) that exits non-zero. -- **`strlen`:** fixture with `DEFINE X_LEN %((strlen "hello"))` - followed by `li_r2 X_LEN` — binary must load the value 5 and - syscall-exit 5 on all three arches via the existing P1 differential - harness. -- **Paren-less 0-arg calls:** fixture with a 0-param macro invoked - both with and without trailing `()`, in top-level position and as - an atom inside `%(...)` expressions; all forms must produce - byte-identical output against a control fixture that always uses - `()`. -- **`%struct`:** fixture declaring a 4-field struct, accessing each - field via paren-less calls, and layering a `%frame` wrapper using - `%frame_hdr.SIZE` composition (per the doc example); build on all - three arches and exit with a sentinel computed from both the - struct-level `.SIZE` and the wrapped base offset, proving the - compose-and-add path resolves correctly. -- **`%enum`:** fixture declaring an enum with 3+ labels, referencing - each via paren-less call, and asserting `%NAME.COUNT` equals the - label count via a `%(=)` expression that feeds an exit code; - build on all three arches. Share fixture scaffolding with the - `%struct` test where practical. -- **`%str`:** two fixtures — (a) a macro using `%str(name)` in its - body, compared against a control that writes the literal - `"name"` string directly (byte-identical output); (b) combined - paste + stringify, `%macro defsym(n) :str_##n %str(n) %endm` - invoked with distinct identifiers, assembled through M1 + hex2, - each generated label must point at the correctly-spelled string - bytes. A third malformed fixture (`%str(a b)` or - `%str("already_string")`) must exit non-zero. diff --git a/docs/P1.md b/docs/P1.md @@ -1,522 +1,531 @@ -# P1: A Portable Pseudo-ISA for M1 - -## Motivation - -The stage0/live-bootstrap chain uses M1 (the mescc-tools macro assembler) as -the lowest human-writable layer above raw hex. M1 itself is architecture- -agnostic — it only knows `DEFINE name hex_bytes` — but every real M1 program -in stage0 (including the seed C compiler `cc_*.M1`) is hand-written per arch. -To write, say, a seed Lisp interpreter portably across amd64, aarch64, and -riscv64 without reaching for M2-Planet, we need a thin portable layer: a -pseudo-ISA whose mnemonics expand, per arch, to native encodings. - -P1 is that layer. The goal is an unoptimized RISC-shaped instruction set, -hand-writable in M1 source, that assembles to three host ISAs via per-arch -`DEFINE` tables on top of existing `M1` + `hex2` unchanged. - -## Non-goals - -- **Not an optimizing backend.** P1 is deliberately dumb. An `ADD rD, rA, rB` - on amd64 expands to `mov rD, rA; add rD, rB` unconditionally — no peephole - recognition of the `rD == rA` case. Paying ~2× code size is fine for a seed. -- **Not ABI-compatible with platform C.** P1 programs are sovereign: direct - Linux syscalls, no libc linkage. Interop thunks can be written later if - needed. -- **Not 32-bit.** x86-32, armv7l, riscv32 are out of scope for v1. Adding them - later means a separate defs file and some narrowing in the register model. -- **Not self-hosting.** P1 is a target for humans, not a compiler IR. If you - want a compiler, write it in subset-C and use M2-Planet. - -## Current status - -Three programs assemble unchanged across aarch64, amd64, and riscv64 -from the generator-produced `p1_<arch>.M1` defs: - * `hello.M1` — write/exit, prints "Hello, world!". - * `demo.M1` — exercises the full tranche 1–5 op set (arith/imm/LD/ST/ - branches/CALL/RET/PROLOGUE/EPILOGUE/TAIL); exits with code 5. - * `lisp.M1` — seed Lisp through step 2 of `LISP.md`: bump heap, - `cons`/`car`/`cdr`, tagged-value encoding. Exits with code 42 - (decoded fixnum from `car(cons(42, nil))`). - -All runs on stock stage0 `M0` + `hex2-0`, bootstrapped per-arch from -`hex0-seed` — no C compiler, no M2-Planet, no Mes. Run with -`make PROG=<hello|demo|lisp> run-all` from `lispcc/`. - -The DEFINE table is generator-driven (`p1_gen.py`); tranches 1–8 are -enumerated there, plus the full PROLOGUE_Nk family (k=1..4). Branch -offsets are realized by the LI_BR-indirect pattern -(`LI_BR &target ; BXX_rA_rB`), sidestepping the missing -branch-offset support in hex2. The branch-target scratch is a -reserved native reg (`x17`/`r11`/`t5`), not a P1 GPR. - -### Spike deviations from the design - -- Wide immediates use a per-`LI` inline literal slot (one PC-relative - load insn plus a 4-byte data slot, skipped past) rather than a shared - pool. Keeps the spike pool-free at the cost of one skip-branch per - `LI`. A pool can be reintroduced later without changes to P1 source. -- `LI` is 4-byte zero-extended today; 8-byte absolute is deferred until - a program needs it. All current references are to addresses under - 4 GiB, so `&label` + a 4-byte zero pad suffices. -- The per-tuple DEFINE table is generator-produced (see `p1_gen.py`) - from a shared op table across all three arches. The emitted set - covers tranches 1–8 plus the N-slot PROLOGUE/EPILOGUE/TAIL - variants. Adding a new tuple is a one-line append to `rows()` in - the generator; no hand-encoding. - -## Design decisions - -| Decision | Choice | Why | -|----------------|-----------------------------------------------|--------------------------------------------| -| Word size | 64-bit | All three target arches are 64-bit native | -| Endianness | Little-endian | All three agree | -| Registers | 8 GPRs (`r0`–`r7`) + `sp`, `lr`-on-stack | Fits x86-64's usable register budget | -| Narrow imm | Signed 12-bit | riscv I-type width; aarch64 ≤12 also OK | -| Wide imm | Pool-loaded via PC-relative `LI` | Avoids arch-specific immediate synthesis | -| Calling conv | r0 = return, r1–r3 = args (caller-saved), r4–r7 callee-saved | P1-defined; not platform ABI | -| Return address | Always spilled to stack on entry | Hides x86's missing `lr` uniformly | -| Syscall | `SYSCALL` with num in r0, args r1–r6; clobbers r0 only | Per-arch wrapper emits native sequence | -| Spill slot | `[sp + 8]` is callee-private scratch after `PROLOGUE` | Frame already 16 B for alignment; second cell was otherwise unused | - -## Register mapping - -`r0`–`r3` are caller-saved. `r4`–`r7` are callee-saved, general-purpose, -and preserved across `CALL`/`SYSCALL`. `sp` is special-purpose — see -`PROLOGUE` semantics. +# P1 v2 -| P1 | amd64 | aarch64 | riscv64 | -|------|-------|---------|---------| -| `r0` | `rax` | `x0` | `a0` | -| `r1` | `rdi` | `x1` | `a1` | -| `r2` | `rsi` | `x2` | `a2` | -| `r3` | `rdx` | `x3` | `a3` | -| `r4` | `r13` | `x26` | `s4` | -| `r5` | `r14` | `x27` | `s5` | -| `r6` | `rbx` | `x19` | `s1` | -| `r7` | `r12` | `x20` | `s2` | -| `sp` | `rsp` | `sp` | `sp` | -| `lr` | (mem) | `x30` | `ra` | - -`r4`–`r7` all map to native callee-saved regs on each arch, so the SysV -kernel+libc "callee preserves these" rule does the work for us across -syscalls without explicit save/restore in the `SYSCALL` expansion. - -x86-64 has no link register; `CALL`/`RET` macros push/pop the return address -on the stack. On aarch64/riscv64, the prologue spills `lr` (`x30`/`ra`) to -the stack too, so all three converge on "return address lives in -`[sp + 0]` after prologue." This uniformity is worth the extra store on the -register-rich arches. - -**Reserved scratch registers (not available to P1):** certain native -regs are used internally by op expansions and are never exposed as P1 -registers. Every P1 op writes only what its name says it writes — -reserved scratch is save/restored within the expansion so no hidden -clobbers leak across op boundaries. - -- **Branch-target scratch (all arches).** `B`/`BEQ`/`BNE`/`BLT`/`CALL`/ - `TAIL` jump through a dedicated native reg pre-loaded via `LI_BR`: - `x17` (ARM IP1) on aarch64, `r11` on amd64, `t5` on riscv64. The reg - is caller-saved natively and never carries a live P1 value past the - following branch. Treat it as existing only between the `LI_BR` that - loads a target and the branch that consumes it. -- **aarch64** — `x21`–`x23` hold `r1`–`r3` across the `SYSCALL` arg - shuffle (`r4`/`r5` live in callee-saved `x26`/`x27` so the kernel - preserves them for us). `x16` (ARM IP0) is scratch for `REM` - (carries the `SDIV` quotient into the following `MSUB`). `x8` holds - the syscall number. -- **amd64** — `rcx` and the branch-target `r11` are kernel-clobbered by - the `syscall` instruction itself. `PROLOGUE`/`EPILOGUE` use `rcx` to - carry the retaddr across the `sub rsp, N` (can't use `r11` here — it - is the branch-target reg, and `TAIL` = `EPILOGUE` + `jmp r11`). - `DIV`/`REM` use `rcx` (to save `rdx` = P1 `r3`) and `r11` (to save - `rax` = P1 `r0`) so that `idiv`'s implicit writes to rax/rdx stay - invisible; the `r11` save is fine because no branch op can interrupt - the DIV/REM expansion. -- **riscv64** — `s3`,`s6`,`s7` hold `r1`–`r3` across the `SYSCALL` arg - shuffle (`r4`/`r5` live in callee-saved `s4`/`s5`, same trick as - aarch64). `a7` holds the syscall number. - -All of these are off-limits to hand-written P1 programs and are never -mentioned in P1 source. If you see a register name not in the r0–r7 / -sp / lr set, it belongs to an op's internal expansion. - -## Reading P1 source - -P1 has no PC-relative branch immediates (hex2 offers no label-arithmetic -sigil — branch ranges can't be expressed in hex2 source). Every branch, -conditional or not, compiles through the **LI_BR-indirect** pattern: the -caller loads the target into the dedicated branch-target scratch reg -with `LI_BR`, then the branch op jumps through it. A conditional like -"jump to `fail` if `r1 != r2`" is three source lines: +## Scope + +P1 v2 is a portable pseudo-ISA for standalone executables. + +P1 v2 has two width variants: + +- **P1v2-64** — one word is one 64-bit integer or pointer value +- **P1v2-32** — one word is one 32-bit integer or pointer value + +Portable source may use any number of word arguments. The first four argument +registers are explicit, and additional argument words are passed through a +portable incoming stack-argument area. + +Portable source may directly return `0..1` word. Wider results use the +portable indirect-result convention described below. + +## Toolchain envelope + +P1 v2 must be assemblable through the existing `M0` + `hex2` path, with +`catm` as the only composition primitive between source or generated fragments. +The spec therefore assumes only the following toolchain features: + +- `M0`-level `DEFINE name hex_bytes` substitution +- raw byte emission +- labels and label references supported by `hex2` +- file concatenation via `catm` + +## Source notation + +This document describes instructions using ordinary assembly notation such as +`ADD rd, ra, rb`, `LD rd, [ra + off]`, or `CALL`. + +Because of the toolchain constraints above, portable source does not encode +most operands as textual instruction arguments. Instead, register choices, +inline immediate values, and small fixed parameters are fused into opcode +names, following the generated-table style used by `src/p1_gen.py`. + +So the notation in this document is descriptive rather than literal: + +- `ADD rd, ra, rb` means a family of fused register-specific opcodes +- `ADDI rd, ra, imm` means a family of fused register-and-immediate-specific + opcodes +- `ENTER size` means a family of fused byte-count-specific opcodes +- `LDARG rd, idx` means a family of fused register-and-argument-slot-specific + opcodes +- `BR rs`, `CALLR rs`, and `TAILR rs` mean register-specific control-flow + opcodes +- `LEAVE`, `CALL`, `RET`, `TAIL`, `B`, and `SYSCALL` remain operand-free + +Labels still appear in source where the toolchain supports them directly, such +as `LA rd, %label` and `LA_BR %label`. + +## Register Model + +### Exposed registers + +P1 v2 exposes the following source-level registers: + +- `a0`–`a3` — argument registers. Also caller-saved general registers. +- `t0`–`t2` — caller-saved temporaries. +- `s0`–`s3` — callee-saved general registers. +- `sp` — stack pointer. + +### Hidden registers + +The backend may reserve additional native registers that are never visible in +P1 source: + +- `br` — branch / call target mechanism, implemented as a dedicated hidden + native register on every target +- backend-local scratch used entirely within one instruction expansion + +No hidden register may carry a live P1 value across an instruction boundary. + +## Calling Convention + +### Arguments and return values + +P1 v2 defines three result conventions: one-word direct, two-word direct, and +indirect. + +In the one-word direct-result convention: + +- Explicit argument words 0-3 live in `a0-a3`. +- Additional explicit argument words live in the incoming stack-argument area + and are read with `LDARG`. +- On return, a one-word result lives in `a0`. + +In the two-word direct-result convention: + +- Explicit argument words 0-3 live in `a0-a3` on entry. +- Additional explicit argument words still live in the incoming + stack-argument area. +- On return, `a0` holds result word 0 and `a1` holds result word 1. + +In the indirect-result convention: + +- The caller passes a writable result buffer pointer in `a0`. +- Explicit argument words 0-2 then live in `a1-a3`. +- Additional explicit argument words still live in the incoming + stack-argument area. +- On return, `a0` holds the same result buffer pointer value. + +In both direct-result conventions, incoming stack-argument slot `0` corresponds +to explicit argument word `4`. In the indirect-result convention, incoming +stack-argument slot `0` corresponds to explicit argument word `3`. + +The two-word direct-result convention covers common cases such as 64-bit +integer results on 32-bit targets, two-word aggregates, and divmod-style +returns. The indirect-result convention is the portable way to return any +result wider than two words. + +### Register preservation + +Caller-saved: + +- `a0`–`a3` +- `t0`–`t2` + +Callee-saved: + +- `s0`–`s3` +- `sp` + +### Call semantics + +A call is valid from any function, including a leaf. Call / return correctness +does not depend on establishing a frame first. + +If a function needs any incoming argument after making a call, it must save it +before the call. This matters in particular for `a0`, which is overwritten by +every convention's return value, and for `a1` when the callee uses the two-word +direct-result convention. + +A call that passes any stack argument words requires the caller to have an +active standard frame with enough frame-local storage to stage those outgoing +words. + +The return address is hidden machine state. Portable source must not assume +that it lives in any exposed register. + +## Stack Convention + +### Call-boundary rule + +At every call boundary, the backend must satisfy the native C ABI stack +alignment rule for the target architecture. + +Portable source must therefore treat raw function-entry `sp` as opaque. It may +not assume that the low bits of `sp` have the same meaning on all targets +before a frame is established. + +### Incoming stack-argument area + +P1 v2 defines an abstract incoming stack-argument area for explicit argument +words that do not fit in registers. + +- Slot `0` is the first stack-passed explicit argument word. +- Slots are word-indexed, not byte-indexed. +- Portable source may access this area only through `LDARG`. + +`LDARG` is valid only when the current function has an active standard frame. +Therefore, a function that needs any incoming stack argument must establish a +standard frame before its first `LDARG`. + +Portable source must not assume any direct relationship between incoming +argument slots and raw function-entry `sp`. In particular, source must not try +to reconstruct stack arguments by manually indexing from `sp`; backend entry +layouts differ across targets. + +For a call with `m` stack-passed explicit argument words, the caller stages +those words in the first `m` words of its frame-local storage immediately +before the call: ``` -P1_LI_BR -&fail -P1_BNE_R1_R2 +[sp + 2*WORD + 0*WORD] = outgoing arg word 0 +[sp + 2*WORD + 1*WORD] = outgoing arg word 1 +... ``` -`LI_BR` writes a reserved native reg (`x17`/`r11`/`t5` — see Register -mapping), not a P1 GPR. The branch op that follows consumes it and -jumps. `CALL` and `TAIL` follow the same shape -(`LI_BR &callee ; P1_CALL`). +At callee entry, those staged words become incoming argument slots `0..m-1`. +The backend is responsible for mapping between the caller's frame layout and +the callee's abstract incoming argument slots. -The branch-target reg is owned by the branch machinery: never carry a -live value across a branch in it. Since it isn't a P1 reg, this is -automatic — there's no P1-level way to read or write it outside -`LI_BR`. +Portable code that needs both ordinary locals and stack-passed outgoing +arguments must reserve enough total frame-local storage and keep the low- +addressed prefix available for outgoing argument staging across the call. -## Instruction set (~30 ops) +### Standard frame layout + +Functions that need local stack storage use a standard frame layout. After +frame establishment: ``` -# 3-operand arithmetic (reg forms) -ADD rD, rA, rB SUB rD, rA, rB -AND rD, rA, rB OR rD, rA, rB XOR rD, rA, rB -SHL rD, rA, rB SHR rD, rA, rB SAR rD, rA, rB -MUL rD, rA, rB DIV rD, rA, rB REM rD, rA, rB - -# Immediate forms (signed 12-bit) -ADDI rD, rA, !imm ANDI rD, rA, !imm ORI rD, rA, !imm -SHLI rD, rA, !imm SHRI rD, rA, !imm SARI rD, rA, !imm - -# Moves -MOV rD, rA # reg-to-reg (rA may be sp) -LI rD, %label # load 64-bit literal from pool -LA rD, %label # load PC-relative address - -# Memory (offset is signed 12-bit) -LD rD, rA, !off ST rS, rA, !off # 64-bit -LB rD, rA, !off SB rS, rA, !off # 8-bit zero-extended / truncated - -# Control flow -B %label # unconditional branch -BEQ rA, rB, %label BNE rA, rB, %label -BLT rA, rB, %label # signed less-than -CALL %label RET -PROLOGUE EPILOGUE # frame setup / teardown (see Semantics) -TAIL %label # tail call: epilogue + B %label - -# System -SYSCALL # num in r0, args r1-r6, ret in r0 +[sp + 0*WORD] = saved return address +[sp + 1*WORD] = saved caller stack pointer +[sp + 2*WORD ... sp + 2*WORD + local_bytes - 1] = frame-local storage +... ``` -### Semantics - -- All arithmetic is on 64-bit values. `SHL`/`SHR`/`SAR` take shift amount in - the low 6 bits of `rB` (or the `!imm` for immediate forms). -- `DIV` is signed, truncated toward zero. `REM` matches `DIV`. -- `LB` zero-extends the loaded value into the 64-bit destination. - (A signed-extending variant `LBS` can be added later if needed. 32-bit - `LW`/`SW` are deliberately omitted — emulate with `LD`+`ANDI`/shift and - `ST` through a 64-bit scratch when needed.) -- Unsigned comparisons (`BLTU`/`BGEU`) are not in the ISA: seed programs - with tagged-cell pointers only need signed comparisons. Synthesize from - `BLT` via operand-bias if unsigned compare is ever required. -- `BGE rA, rB, %L` is not in the ISA: synthesize as - `BLT rA, rB, %skip; B %L; :skip` (the LI_BR-indirect branch pattern - makes the skip cheap). `BLT rB, rA, %L` handles the strict-greater - case. -- Branch offsets are PC-relative. In the v0.1 spike they are realized by - loading the target address via `LI_BR` into the reserved branch-target - reg and jumping through it; range is therefore unbounded within the - 4 GiB address space. Native-encoded branches (with tighter range - limits) are an optional future optimization. -- `MOV rD, rA` copies `rA` into `rD`. The source may be `sp` (read the - current stack pointer into a GPR — used e.g. for stack-balance assertions - around a call tree). The reverse (`MOV sp, rA`) is not provided; `sp` - is only mutated by `PROLOGUE`/`EPILOGUE`. -- `CALL %label` transfers control to `%label` with a return address - established such that a subsequent `RET` returns to the instruction - after the `CALL`. The storage location of that return address is - implementation-defined (stack on amd64, link register on - aarch64/riscv64) and **must be treated as volatile across any inner - `CALL`**. - - Concrete rule: **a function that itself executes a `CALL` must wrap - its body in a matching `PROLOGUE`/`EPILOGUE` pair.** `PROLOGUE` is - what spills the incoming return address into the frame; `EPILOGUE` - restores it so `RET` can find it. - - Leaf functions (no `PROLOGUE`) are permitted and may be called - normally: `CALL leaf` sets up the return address, the leaf's `RET` - uses it, control returns to the caller. The restriction is only on - what a leaf may itself do: - - - **RET** — returns to whoever established the current return - address. Usually the direct `CALL`er; in the tail-branch case - below, whoever `CALL`ed the outermost caller in the chain. - - **Tail-branch** (`li_br &target ; B`) to another function — the - target's own `PROLOGUE`/`EPILOGUE` preserves the current return - address across the target's body, so the target's `RET` returns - directly to the leaf's caller, skipping the leaf in the return - chain. - - **`CALL`** — forbidden. The inner `CALL` clobbers the return - address slot (on arches where it's a register, not a stack - push), so the leaf's subsequent `RET` branches to itself. - - The failure mode of a leaf `CALL` is platform-asymmetric: amd64's - native `CALL` pushes onto the stack so a prologue-less `CALL ; RET` - happens to work; aarch64 and riscv64 write the return address to a - link register and hang silently. Don't write code that relies on - the amd64-happens-to-work behavior. - - `RET` pops / branches through the return address. -- `PROLOGUE` / `EPILOGUE` set up and tear down a frame with **k - callee-private scratch slots**. `PROLOGUE` is shorthand for - `PROLOGUE_N1` (one slot); `PROLOGUE_Nk` for k = 2, 3, 4 reserves that - many slots. After `PROLOGUE_Nk`: - - ``` - [sp + 0] = caller's return address - [sp + 8] = slot 1 (callee-private scratch) - [sp + 16] = slot 2 (k >= 2) - [sp + 24] = slot 3 (k >= 3) - [sp + 32] = slot 4 (k >= 4) - ``` - - Each slot is private to the current frame: a nested `PROLOGUE` - allocates its own slots, so the parent's spills survive unchanged. - Frame size is `round_up_16(8 + 8*k)`, so k=1→16, k=2→32 (with 8 - bytes of padding past slot 2), k=3→32, k=4→48. `EPILOGUE_Nk` / - `TAIL_Nk` must match the `PROLOGUE_Nk` of the enclosing function. - - Why multiple slots: constructors like `cons(car, cdr)` keep several - live values across an inner `alloc()` call. One scratch cell isn't - enough, and parking overflow in BSS would break the step-9 mark-sweep - GC (which walks the stack for roots). Per-frame slots keep every live - value on the walkable stack. - - Per-arch mechanics differ — aarch64/riscv64 `PROLOGUE` subtracts the - frame size from `sp` and stores `lr`/`ra` at `[sp + 0]`; amd64 pops - the retaddr native `call` already pushed into a non-P1 scratch - (`rcx`), subtracts the frame size, then re-pushes it so the final - layout matches. (`rcx` rather than `r11`, because `r11` is the - branch-target reg and `TAIL` would otherwise clobber its own - destination mid-epilogue.) Access slots via `MOV rX, sp` followed by - `LD rY, rX, <off>` / `ST rY, rX, <off>`; `sp` itself isn't a valid - base for `LD`/`ST`. -- `TAIL %label` is a tail call — it performs the current function's - standard epilogue (restore `lr` from `[sp+0]`, pop the frame) and then - branches unconditionally to `%label`, reusing the caller's return - address instead of pushing a new frame. The current function must be - using the standard prologue. Interpreter `eval` loops rely on `TAIL` - to recurse on sub-expressions without growing the stack. -- `SYSCALL` is a single opcode in P1 source. Each arch's defs file expands it - to the native syscall sequence, including the register shuffle from P1's - `r0`=num, `r1`–`r6`=args convention into the platform's native convention - if different. - -## Encoding strategy - -For each `(op, register-tuple)` combination, emit one `DEFINE` per arch. A -generator script produces the full defs file; no hand-encoding per entry. - -Example — `ADD r0, r1, r2`: +Frame-local storage is byte-addressed. Portable code may use it for ordinary +locals, spilled callee-saved registers, and the caller-staged outgoing +stack-argument words described above. -``` -# p1_riscv64.M1 -DEFINE P1_ADD_R0_R1_R2 33056000 # add a0, a1, a2 (little-endian) +Total frame size is: -# p1_aarch64.M1 -DEFINE P1_ADD_R0_R1_R2 2000028B # add x0, x1, x2 +`round_up(STACK_ALIGN, 2*WORD_SIZE + local_bytes)` -# p1_amd64.M1 (2-op destructive — expands to mov + add) -DEFINE P1_ADD_R0_R1_R2 4889F84801F0 # mov rax, rdi ; add rax, rsi -``` +Where: -### Combinatorial footprint +- `WORD_SIZE = 8` in P1v2-64 +- `WORD_SIZE = 4` in P1v2-32 +- `STACK_ALIGN` is target-defined and must satisfy the native call ABI -Per-arch defs count (immediates handled by sigil, not enumerated): +Leaf functions that need no frame-local storage may omit the frame entirely. -- 11 reg-reg-reg arith × 8 `rD` × 8 `rA` × 8 `rB` = 704. Pruned to ~600 by - removing trivially-equivalent tuples. -- 6 immediate arith × 8² = 384. Each entry uses an immediate sigil (`!imm`), - so the immediate value itself is not enumerated. -- 3 move ops × 8 or 8² (plus +8 for the `MOV rD, sp` variant) = ~88. -- 4 memory ops × 8² = 256. Offsets use `!imm` sigil. -- 3 conditional branches × 8² = 192. -- Singletons (`B`, `CALL`, `RET`, `PROLOGUE`, `EPILOGUE`, `TAIL`, `SYSCALL`) = 7. +### Frame invariants -Total ≈ 1210 defines per arch. Template-generated. +- A function that allocates a frame must restore `sp` before returning. +- Callee-saved registers modified by the function must be restored before + returning. +- The standard frame layout is the only frame shape recognized by P1 v2. -## Syscall conventions +## Op Set Summary -Linux syscall mechanics differ across arches. The `SYSCALL` macro hides this. +| Category | Operations | +|----------|------------| +| Materialization | `LI rd, imm`, `LA rd, %label`, `LA_BR %label` | +| Moves | `MOV rd, rs`, `MOV rd, sp` | +| Arithmetic | `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, `MUL`, `DIV`, `REM` | +| Immediate arithmetic | `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI` | +| Memory | `LD`, `ST`, `LB`, `SB` | +| ABI access | `LDARG` | +| Branching | `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ` | +| Calls / returns | `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR` | +| Frame management | `ENTER`, `LEAVE` | +| System | `SYSCALL` | -| Arch | Insn | Num reg | Arg regs (plat ABI) | -|----------|-----------|---------|------------------------------| -| amd64 | `syscall` | `rax` | `rdi, rsi, rdx, r10, r8, r9` | -| aarch64 | `svc #0` | `x8` | `x0 – x5` | -| riscv64 | `ecall` | `a7` | `a0 – a5` | +## Immediates -**Observable semantics:** `SYSCALL` takes the number in `r0` and args in -`r1`–`r6`, traps, and returns the kernel's result in `r0`. **Only `r0` is -clobbered.** `r1`–`r7` are preserved across `SYSCALL` on every arch. This -matches the kernel's own register discipline and lets callers thread live -values through syscalls without per-arch save/restore dances. +Immediate operands appear only in instructions that explicitly admit them. +Portable source has three immediate classes: -The per-arch expansions: +- **Inline integer immediate** — a signed 12-bit assembly-time constant in the + range `-2048..2047` +- **Materialized word value** — a full one-word assembly-time constant loaded + with `LI` +- **Materialized address** — the address of a label loaded with `LA` -- **amd64** — P1 args already occupy the native arg regs except for args - 4/5/6. Three shuffle moves cover those: `mov r10, r13` (arg4 = P1 `r4`), - `mov r8, r14` (arg5 = P1 `r5`), `mov r9, rbx` (arg6 = P1 `r6`); then - `syscall`. The kernel preserves everything except `rax`, `rcx`, `r11`, - and `rax` = P1 `r0` is the only visible clobber. -- **aarch64** — native arg regs are `x0`–`x5` but P1 puts args in - `x1`–`x3`,`x26`,`x27`,`x19` (the three caller-saved arg regs one slot - higher, plus three callee-saved for `r4`–`r6`). The expansion saves - P1 `r1`–`r3` into `x21`–`x23`, shuffles them and `r4`/`r5`/`r6` down - into `x0`–`x5`, moves the number into `x8`, `svc #0`s, then restores - `r1`–`r3` from `x21`–`x23`. No save/restore of `r4`/`r5` is needed - because they live in callee-saved natives that the kernel preserves. -- **riscv64** — same shape as aarch64, with `s3`/`s6`/`s7` as the `r1`– - `r3` save slots, `s4`/`s5` already holding `r4`/`r5`, and `a7` as the - number register. +P1 v2 also uses two structured assembly-time operands: -The extra moves on aarch64/riscv64 are a few nanoseconds per syscall. -Trading them for uniform "clobbers `r0` only" semantics is worth it: -callers don't need to memorize a per-arch clobber set. +- **Frame-local byte count** — a non-negative byte count used by `ENTER` +- **Argument-slot index** — a non-negative word-slot index used by `LDARG` -### Syscall numbers +`LI rd, imm` loads the one-word integer value `imm`. -Linux uses two syscall tables relevant here: +`LA rd, %label` loads the address of `%label` as a one-word pointer value. -- **amd64**: amd64-specific table (`write = 1`, `exit = 60`, …). -- **aarch64 and riscv64**: generic table (`write = 64`, `exit = 93`, …). +The backend may realize `LI` and `LA` using native immediates, literal pools, +multi-instruction sequences, or other backend-private mechanisms. -P1 programs use symbolic constants (`SYS_WRITE`, `SYS_EXIT`) defined per-arch: +Backends may assume labels fit in 32 bits when realizing `LA` and `LA_BR`. +This reflects the stage0 image layout (`hex2-0` base `0x00600000`, programs +well under 4 GB), not a portable-ISA-level guarantee. Backends that target +images loaded above the 4 GB boundary must adjust their `LA` / `LA_BR` +lowering. `LI` makes no such assumption — it materializes any one-word value. -``` -# p1_amd64.M1 -DEFINE SYS_WRITE 01000000 -DEFINE SYS_EXIT 3C000000 +## Control Flow -# p1_aarch64.M1 and p1_riscv64.M1 -DEFINE SYS_WRITE 40000000 -DEFINE SYS_EXIT 5D000000 -``` +### Call / Return / Tail Call -(The encodings shown are placeholder little-endian 32-bit immediates; real -values are inlined as operands to `LI` or `ADDI`.) +Control-flow targets are materialized with `LA_BR %label`, which loads +`%label` into the hidden branch-target mechanism `br`. The immediately +following control-flow op consumes that target. -## Program layout +`CALL` transfers control to the target most recently loaded by `LA_BR` and +establishes a return continuation such that a subsequent `RET` returns to the +instruction after the `CALL`. `CALL` is valid whether or not the caller has +established a standard frame, except that any call using stack-passed argument +words requires an active standard frame to hold the staged outgoing words. -Each P1 object file is structured as: +`CALLR rs` is the register-indirect form of `CALL`. It transfers control to +the code pointer value held in `rs` and establishes the same return +continuation semantics as `CALL`. -``` -<ELF header, per arch> -<code section> - <function prologues, bodies, epilogues> -<constant pool> - pool_label_1: &0xDEADBEEFCAFEBABE - pool_label_2: &0x00000000004004C0 - ... -<data section> - <static bytes> -``` +`RET` returns through the current return continuation. `RET` is valid whether +or not the current function has established a standard frame, provided any +frame established by the function has already been torn down. -`LI rD, %pool_label_N` issues a PC-relative load; the pool must be reachable -within the relocation's range (≤±1 MiB for aarch64 `LDR` literal, ≤±2 GiB for -riscv `AUIPC`+`LD`, unlimited for x86 `mov rD, [rip + rel32]` within 2 GiB). +`TAIL` is a tail call to the target most recently loaded by `LA_BR`. It is +valid only when the current function has an active standard frame. `TAIL` +performs the standard epilogue for the current frame and then transfers control +to the loaded target without creating a new return continuation. The callee +therefore returns directly to the current function's caller. -For programs under a few MiB, a single pool per file is fine. For larger -programs, emit a pool per function. +`TAILR rs` is the register-indirect form of `TAIL`. It is valid only when the +current function has an active standard frame. -## Data alignment +Because stack-passed outgoing argument words are staged in the caller's own +frame-local storage, `TAIL` and `TAILR` are portable only when the tail-called +callee requires no stack-passed argument words. Portable compilers must lower +other tail-call cases to an ordinary `CALL` / `RET` sequence. -**Labels have no inherent alignment.** A label's runtime address is -`ELF_base + (cumulative bytes emitted before the label)`. Neither M1 nor -hex2 offers an `.align` directive or any other alignment control — the -existing hex2 sigils (`: ! @ $ ~ % &` and the `>` base override) cover -labels and references, not padding. And because the cumulative byte count -between the ELF header and any label varies per arch (different SYSCALL -expansions, different branch encodings, different PROLOGUE sizes), the -same label lands at a different low-3-bits offset on each target. +Portable source must treat the return continuation as hidden machine state. It +must not assume that the return address lives in any exposed register or stack +location except as defined by the standard frame layout after frame +establishment. -Concretely: `heap_start` in a program that builds identically for all -three arches can land at `0x...560` (aligned) on aarch64, `0x...2CB` -(misaligned) on amd64, and `0x...604` (misaligned) on riscv64. If the -program then tags pair pointers by ORing bits into the low 3, the tag -collides with pointer bits on the misaligned arches and every pair is -corrupt. +### Prologue / Epilogue -Programs that care about alignment therefore align **at boot, in code**: +P1 v2 defines the following frame-establishment and frame-teardown operations: + +- `ENTER size` +- `LEAVE` + +`ENTER size` establishes the standard frame layout with `size` bytes of +frame-local storage: ``` -P1_LI_R4 -&heap_next -P1_LD_R0_R4_0 -P1_ORI_R0_R0_7 ## x |= 7 -P1_ADDI_R0_R0_1 ## x += 1 → x rounded up to next 8-aligned -P1_ST_R0_R4_0 +[sp + 0*WORD] = saved return address +[sp + 1*WORD] = saved caller stack pointer +[sp + 2*WORD ... sp + 2*WORD + size - 1] = frame-local storage ``` -The `(x | mask) + 1` idiom rounds any pointer up to `mask + 1`. Use -`mask = 7` for 8-byte alignment (tagged pointers with a 3-bit tag), -`mask = 15` for 16-byte alignment (cache lines, `malloc`-style). - -**Allocator contract.** Any allocator that returns cells eligible to be -tagged (pair, closure, vector, …) MUST return pointers aligned to at -least the tag width. The low tag bits are architecturally unowned by -the allocator — they belong to the caller to stamp a tag into. - -**Caller contract.** Callers of bump-style allocators must pass sizes -that are multiples of the alignment. For the step-2 bump allocator -that's 8-byte multiples; the caller rounds up. A mature allocator -(step 9 onward) rounds internally, but the current one trusts the -caller. - -## Staged implementation plan - -1. **Spike across all three arches.** *Done.* `lispcc/hello.M1` and - `lispcc/demo.M1` run on aarch64, amd64, and riscv64 via existing - `M1` + `hex2_linker` (amd64, aarch64) / `hex2_word` (riscv64). Ops - demonstrated: `LI`, `SYSCALL`, `MOV`, `ADD`, `SUB`. The aarch64 - `hex2_word` extensions in the work list above were *not* needed — - the inline-data `LI` trick sidesteps them. Order was reversed from - the original plan: aarch64 first (where the trick was designed), - then amd64 and riscv64. -2. **Broaden the demonstrated op set.** *Done.* `demo.M1` exercises - control flow (`B`, `BEQ`, `BNE`, `BLT`, `CALL`, `RET`, `TAIL`), - loads/stores (`LD`/`ST`/`LB`/`SB`), and the full - arithmetic/logical/shift/mul-div set across tranches 1–5. All - reachable with stock hex2; no extensions required. -3. **Generator for the ~30-op × register matrix.** *Done.* - `p1_gen.py` is the single source of truth for all three - `p1_<arch>.M1` defs files. Each row is an `(op, reg-tuple, imm)` - triple; per-arch encoders lower rows to native bytes. Includes the - N-slot `PROLOGUE_Nk` / `EPILOGUE_Nk` / `TAIL_Nk` variants (k=1..4). - Regenerate with `make gen`; CI-check freshness with `make check-gen`. -4. **Cross-arch differential harness.** Assemble each P1 source three - ways and diff runtime behavior. Currently eyeballed via - `make run-all`. -5. **Write something real.** *In progress.* `lisp.M1` is the seed Lisp - interpreter target (cons, car, cdr, eq, atom, cond, lambda, quote) - running identically on all three arches. Step 2 (cons/car/cdr + - tagged values) landed; the remaining staged steps live in - `LISP.md`. - -## Open questions - -- **Can we reuse hand-written `SYSCALL`/syscall-number conventions already in - stage0's arch ports?** Probably yes — adopt the conventions already in - `M2libc/<arch>/` to minimize surprise. -- **Signed-extending loads.** Skipped for v1 — add `LBS`, `LWS` if the Lisp - interpreter needs them. -- **Atomic / multi-core.** Not in scope. Seed interpreters are single- - threaded. -- **Debug info.** `blood-elf` generates M1-format debug tables; we'd need to - decide whether P1 flows through it unchanged. Likely yes since P1 is just - another M1 source. -- **x86-32 / armv7l / riscv32 support.** Requires narrowing the register - model and splitting word size. Defer. +The total allocation size is: -## Scope +`round_up(STACK_ALIGN, 2*WORD_SIZE + size)` + +The named frame-local bytes are the usable local storage. Any additional bytes +introduced by alignment rounding are padding, not extra local bytes. + +`LEAVE` tears down the current standard frame and restores the hidden return +continuation so that a subsequent `RET` returns correctly. + +Because every standard frame stores the saved caller stack pointer at +`[sp + 1*WORD]`, `LEAVE` does not need to know the frame-local byte count used +by the corresponding `ENTER`. + +A function may omit `ENTER` / `LEAVE` entirely if it is a leaf and needs no +standard frame. + +`ENTER` and `LEAVE` do not implicitly save or restore `s0` or `s1`. A +function that modifies `s0` or `s1` must preserve them explicitly, typically by +storing them in frame-local storage within its standard frame. + +### Branching + +P1 v2 branch targets are carried through the hidden branch-target mechanism +`br`. Portable source may load `br` only through: + +- `LA_BR %label` — materialize the address of `%label` as the next branch, call, + or tail-call target + +No branch, call, or tail opcode takes a label operand directly. Portable source +must treat `br` as owned by the control-flow machinery. No live value may be +carried in `br`. Each `LA_BR` must be consumed by the immediately following +branch, call, or tail op, and portable source must not rely on `br` surviving +across any other instruction. + +The portable branch families are: + +- `B` — unconditional branch to the target in `br` +- `BR rs` — unconditional branch to the code pointer in `rs` +- `BEQ`, `BNE`, `BLT`, `BLTU` — conditional branch to the target in `br` +- `BEQZ`, `BNEZ`, `BLTZ` — conditional branch to the target in `br` using zero + as the second operand + +`BLT` and `BLTZ` perform signed comparisons on one-word values. `BLTU` +performs an unsigned comparison on one-word values; there is no unsigned +zero-operand variant because `x < 0` is always false under unsigned +interpretation. + +If a branch condition is true, control transfers to the target currently held in +`br`. If the condition is false, execution falls through to the next +instruction. + +## Data Ops + +### Arithmetic + +P1 v2 defines the following arithmetic and bitwise operations on one-word +values: + +- register-register: `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, + `MUL`, `DIV`, `REM` +- immediate: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI` + +For `ADD`, `SUB`, `MUL`, `AND`, `OR`, and `XOR`, computation is modulo the +active word size. + +`SHL` shifts left and discards high bits. `SHR` is a logical right shift and +zero-fills. `SAR` is an arithmetic right shift and sign-fills. + +For register-count shifts, only the low `5` bits of the shift count are +observed in `P1v2-32`, and only the low `6` bits are observed in `P1v2-64`. + +Immediate-form shifts use inline immediates in the range `0..31` in `P1v2-32` +and `0..63` in `P1v2-64`. + +`DIV` is signed division on one-word two's-complement values and truncates +toward zero. `REM` is the corresponding signed remainder. + +Division by zero is outside the portable contract. The overflow case +`MIN_INT / -1` is also outside the portable contract, as is the corresponding +remainder case. -- **Defs files**: ~1500 entries × 3 arches, generator-driven. -- **Testing**: shared harness that assembles each P1 source three ways - and diffs runtime behavior. +### Moves + +P1 v2 defines the following move and materialization operations: + +- `MOV` — register-to-register copy +- `LI` — load one-word integer constant +- `LA` — load label address + +`MOV` may copy from any exposed general register to any exposed general +register. + +Portable source may also read the current stack pointer through `MOV rd, sp`. + +Portable source may not write `sp` through `MOV`. Stack-pointer updates are only +performed by `ENTER`, `LEAVE`, and backend-private call/return machinery. + +`LI` materializes an integer bit-pattern. `LA` materializes the address of a +label. `LA_BR` is a separate control-flow-target materialization form and is not +part of the general move family. + +### Memory + +P1 v2 defines the following memory-access operations: + +- `LD`, `ST` — one-word load and store +- `LB`, `SB` — byte load and store +- `LDARG` — one-word load from the incoming stack-argument area + +`LD` and `ST` access one full word: 4 bytes in `P1v2-32` and 8 bytes in +`P1v2-64`. + +`LB` loads one byte and zero-extends it to a full word. `SB` stores the low +8 bits of the source value. + +Memory offsets use signed 12-bit inline immediates. + +The base address for a memory access may be any exposed general register or +`sp`. + +`LDARG rd, idx` loads incoming stack-argument slot `idx`, where slot `0` is the +first stack-passed explicit argument word. `idx` is word-indexed, not +byte-indexed. `LDARG` is an ABI access, not a general memory operation; it does +not expose or imply any raw `sp`-relative layout at function entry. + +`LDARG` is valid only when the current function has an active standard frame. + +Portable source must not assume that labels are aligned beyond what is +explicitly established by the program itself. Portable code should use +naturally aligned addresses for `LD` and `ST`. Unaligned word accesses are +outside the portable contract. Byte accesses have no additional alignment +requirement. + +## System + +`SYSCALL` is part of the portable ISA surface. + +At the portable level, the syscall convention is: + +- `a0` = syscall number on entry, return value on exit +- `a1`, `a2`, `a3`, `t0`, `s0`, `s1` = syscall arguments 0 through 5 + +At the portable level, `SYSCALL` clobbers only `a0`. All other exposed +registers are preserved across the syscall. + +The mapping from symbolic syscall names to numeric syscall identifiers is +target-defined. The set of syscalls available to a given program is likewise +specified outside the core P1 v2 ISA, for example by a target profile or +runtime interface document. + +## Target notes + +- `a0` is argument 0, the one-word direct return-value register, the low word + of the two-word direct return pair, and the indirect-result buffer pointer. +- On aarch64, riscv64, arm32, and rv32, that matches the native integer/pointer + ABI directly. +- On amd64, the backend must translate between portable `a0` and native + return register `rax` at call and return boundaries. For the two-word direct + return, the backend must also translate `a1` against native `rdx`. +- On amd64, `LDARG` must account for the return address pushed by the native + `call` instruction. On aarch64, riscv64, arm32, and rv32, it maps more + directly to the entry `sp` plus the backend's standard frame/header policy. +- `br` is implemented as a dedicated hidden native register on every target. +- On arm32, `t1` and `t2` map to natively callee-saved registers; the backend + is responsible for preserving them across function boundaries in accordance + with the native ABI, even though P1 treats them as caller-saved. +- Frame-pointer use is backend policy, not part of the P1 v2 architectural + register set. + +### Native register mapping + +#### 64-bit targets + +| P1 | amd64 | aarch64 | riscv64 | +|------|-------|---------|---------| +| `a0` | `rdi` | `x0` | `a0` | +| `a1` | `rsi` | `x1` | `a1` | +| `a2` | `rdx` | `x2` | `a2` | +| `a3` | `rcx` | `x3` | `a3` | +| `t0` | `r10` | `x9` | `t0` | +| `t1` | `r11` | `x10` | `t1` | +| `t2` | `r8` | `x11` | `t2` | +| `s0` | `rbx` | `x19` | `s1` | +| `s1` | `r12` | `x20` | `s2` | +| `s2` | `r13` | `x21` | `s3` | +| `s3` | `r14` | `x22` | `s4` | +| `sp` | `rsp` | `sp` | `sp` | -The output is a single portable ISA above which any seed-stage program -(Lisp, Forth, a smaller C compiler) can be written once and run on three -hosts. Below M2-Planet in the chain, above raw M1. Leans entirely on -existing `M1` + `hex2` — no toolchain modifications. +#### 32-bit targets + +| P1 | arm32 | rv32 | +|------|-------|-------| +| `a0` | `r0` | `a0` | +| `a1` | `r1` | `a1` | +| `a2` | `r2` | `a2` | +| `a3` | `r3` | `a3` | +| `t0` | `r12` | `t0` | +| `t1` | `r6` | `t1` | +| `t2` | `r7` | `t2` | +| `s0` | `r4` | `s1` | +| `s1` | `r5` | `s2` | +| `s2` | `r8` | `s3` | +| `s3` | `r9` | `s4` | +| `sp` | `sp` | `sp` | diff --git a/docs/P1v2.md b/docs/P1v2.md @@ -1,531 +0,0 @@ -# P1 v2 - -## Scope - -P1 v2 is a portable pseudo-ISA for standalone executables. - -P1 v2 has two width variants: - -- **P1v2-64** — one word is one 64-bit integer or pointer value -- **P1v2-32** — one word is one 32-bit integer or pointer value - -Portable source may use any number of word arguments. The first four argument -registers are explicit, and additional argument words are passed through a -portable incoming stack-argument area. - -Portable source may directly return `0..1` word. Wider results use the -portable indirect-result convention described below. - -## Toolchain envelope - -P1 v2 must be assemblable through the existing `M0` + `hex2` path, with -`catm` as the only composition primitive between source or generated fragments. -The spec therefore assumes only the following toolchain features: - -- `M0`-level `DEFINE name hex_bytes` substitution -- raw byte emission -- labels and label references supported by `hex2` -- file concatenation via `catm` - -## Source notation - -This document describes instructions using ordinary assembly notation such as -`ADD rd, ra, rb`, `LD rd, [ra + off]`, or `CALL`. - -Because of the toolchain constraints above, portable source does not encode -most operands as textual instruction arguments. Instead, register choices, -inline immediate values, and small fixed parameters are fused into opcode -names, following the generated-table style used by `src/p1_gen.py`. - -So the notation in this document is descriptive rather than literal: - -- `ADD rd, ra, rb` means a family of fused register-specific opcodes -- `ADDI rd, ra, imm` means a family of fused register-and-immediate-specific - opcodes -- `ENTER size` means a family of fused byte-count-specific opcodes -- `LDARG rd, idx` means a family of fused register-and-argument-slot-specific - opcodes -- `BR rs`, `CALLR rs`, and `TAILR rs` mean register-specific control-flow - opcodes -- `LEAVE`, `CALL`, `RET`, `TAIL`, `B`, and `SYSCALL` remain operand-free - -Labels still appear in source where the toolchain supports them directly, such -as `LA rd, %label` and `LA_BR %label`. - -## Register Model - -### Exposed registers - -P1 v2 exposes the following source-level registers: - -- `a0`–`a3` — argument registers. Also caller-saved general registers. -- `t0`–`t2` — caller-saved temporaries. -- `s0`–`s3` — callee-saved general registers. -- `sp` — stack pointer. - -### Hidden registers - -The backend may reserve additional native registers that are never visible in -P1 source: - -- `br` — branch / call target mechanism, implemented as a dedicated hidden - native register on every target -- backend-local scratch used entirely within one instruction expansion - -No hidden register may carry a live P1 value across an instruction boundary. - -## Calling Convention - -### Arguments and return values - -P1 v2 defines three result conventions: one-word direct, two-word direct, and -indirect. - -In the one-word direct-result convention: - -- Explicit argument words 0-3 live in `a0-a3`. -- Additional explicit argument words live in the incoming stack-argument area - and are read with `LDARG`. -- On return, a one-word result lives in `a0`. - -In the two-word direct-result convention: - -- Explicit argument words 0-3 live in `a0-a3` on entry. -- Additional explicit argument words still live in the incoming - stack-argument area. -- On return, `a0` holds result word 0 and `a1` holds result word 1. - -In the indirect-result convention: - -- The caller passes a writable result buffer pointer in `a0`. -- Explicit argument words 0-2 then live in `a1-a3`. -- Additional explicit argument words still live in the incoming - stack-argument area. -- On return, `a0` holds the same result buffer pointer value. - -In both direct-result conventions, incoming stack-argument slot `0` corresponds -to explicit argument word `4`. In the indirect-result convention, incoming -stack-argument slot `0` corresponds to explicit argument word `3`. - -The two-word direct-result convention covers common cases such as 64-bit -integer results on 32-bit targets, two-word aggregates, and divmod-style -returns. The indirect-result convention is the portable way to return any -result wider than two words. - -### Register preservation - -Caller-saved: - -- `a0`–`a3` -- `t0`–`t2` - -Callee-saved: - -- `s0`–`s3` -- `sp` - -### Call semantics - -A call is valid from any function, including a leaf. Call / return correctness -does not depend on establishing a frame first. - -If a function needs any incoming argument after making a call, it must save it -before the call. This matters in particular for `a0`, which is overwritten by -every convention's return value, and for `a1` when the callee uses the two-word -direct-result convention. - -A call that passes any stack argument words requires the caller to have an -active standard frame with enough frame-local storage to stage those outgoing -words. - -The return address is hidden machine state. Portable source must not assume -that it lives in any exposed register. - -## Stack Convention - -### Call-boundary rule - -At every call boundary, the backend must satisfy the native C ABI stack -alignment rule for the target architecture. - -Portable source must therefore treat raw function-entry `sp` as opaque. It may -not assume that the low bits of `sp` have the same meaning on all targets -before a frame is established. - -### Incoming stack-argument area - -P1 v2 defines an abstract incoming stack-argument area for explicit argument -words that do not fit in registers. - -- Slot `0` is the first stack-passed explicit argument word. -- Slots are word-indexed, not byte-indexed. -- Portable source may access this area only through `LDARG`. - -`LDARG` is valid only when the current function has an active standard frame. -Therefore, a function that needs any incoming stack argument must establish a -standard frame before its first `LDARG`. - -Portable source must not assume any direct relationship between incoming -argument slots and raw function-entry `sp`. In particular, source must not try -to reconstruct stack arguments by manually indexing from `sp`; backend entry -layouts differ across targets. - -For a call with `m` stack-passed explicit argument words, the caller stages -those words in the first `m` words of its frame-local storage immediately -before the call: - -``` -[sp + 2*WORD + 0*WORD] = outgoing arg word 0 -[sp + 2*WORD + 1*WORD] = outgoing arg word 1 -... -``` - -At callee entry, those staged words become incoming argument slots `0..m-1`. -The backend is responsible for mapping between the caller's frame layout and -the callee's abstract incoming argument slots. - -Portable code that needs both ordinary locals and stack-passed outgoing -arguments must reserve enough total frame-local storage and keep the low- -addressed prefix available for outgoing argument staging across the call. - -### Standard frame layout - -Functions that need local stack storage use a standard frame layout. After -frame establishment: - -``` -[sp + 0*WORD] = saved return address -[sp + 1*WORD] = saved caller stack pointer -[sp + 2*WORD ... sp + 2*WORD + local_bytes - 1] = frame-local storage -... -``` - -Frame-local storage is byte-addressed. Portable code may use it for ordinary -locals, spilled callee-saved registers, and the caller-staged outgoing -stack-argument words described above. - -Total frame size is: - -`round_up(STACK_ALIGN, 2*WORD_SIZE + local_bytes)` - -Where: - -- `WORD_SIZE = 8` in P1v2-64 -- `WORD_SIZE = 4` in P1v2-32 -- `STACK_ALIGN` is target-defined and must satisfy the native call ABI - -Leaf functions that need no frame-local storage may omit the frame entirely. - -### Frame invariants - -- A function that allocates a frame must restore `sp` before returning. -- Callee-saved registers modified by the function must be restored before - returning. -- The standard frame layout is the only frame shape recognized by P1 v2. - -## Op Set Summary - -| Category | Operations | -|----------|------------| -| Materialization | `LI rd, imm`, `LA rd, %label`, `LA_BR %label` | -| Moves | `MOV rd, rs`, `MOV rd, sp` | -| Arithmetic | `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, `MUL`, `DIV`, `REM` | -| Immediate arithmetic | `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI` | -| Memory | `LD`, `ST`, `LB`, `SB` | -| ABI access | `LDARG` | -| Branching | `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ` | -| Calls / returns | `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR` | -| Frame management | `ENTER`, `LEAVE` | -| System | `SYSCALL` | - -## Immediates - -Immediate operands appear only in instructions that explicitly admit them. -Portable source has three immediate classes: - -- **Inline integer immediate** — a signed 12-bit assembly-time constant in the - range `-2048..2047` -- **Materialized word value** — a full one-word assembly-time constant loaded - with `LI` -- **Materialized address** — the address of a label loaded with `LA` - -P1 v2 also uses two structured assembly-time operands: - -- **Frame-local byte count** — a non-negative byte count used by `ENTER` -- **Argument-slot index** — a non-negative word-slot index used by `LDARG` - -`LI rd, imm` loads the one-word integer value `imm`. - -`LA rd, %label` loads the address of `%label` as a one-word pointer value. - -The backend may realize `LI` and `LA` using native immediates, literal pools, -multi-instruction sequences, or other backend-private mechanisms. - -Backends may assume labels fit in 32 bits when realizing `LA` and `LA_BR`. -This reflects the stage0 image layout (`hex2-0` base `0x00600000`, programs -well under 4 GB), not a portable-ISA-level guarantee. Backends that target -images loaded above the 4 GB boundary must adjust their `LA` / `LA_BR` -lowering. `LI` makes no such assumption — it materializes any one-word value. - -## Control Flow - -### Call / Return / Tail Call - -Control-flow targets are materialized with `LA_BR %label`, which loads -`%label` into the hidden branch-target mechanism `br`. The immediately -following control-flow op consumes that target. - -`CALL` transfers control to the target most recently loaded by `LA_BR` and -establishes a return continuation such that a subsequent `RET` returns to the -instruction after the `CALL`. `CALL` is valid whether or not the caller has -established a standard frame, except that any call using stack-passed argument -words requires an active standard frame to hold the staged outgoing words. - -`CALLR rs` is the register-indirect form of `CALL`. It transfers control to -the code pointer value held in `rs` and establishes the same return -continuation semantics as `CALL`. - -`RET` returns through the current return continuation. `RET` is valid whether -or not the current function has established a standard frame, provided any -frame established by the function has already been torn down. - -`TAIL` is a tail call to the target most recently loaded by `LA_BR`. It is -valid only when the current function has an active standard frame. `TAIL` -performs the standard epilogue for the current frame and then transfers control -to the loaded target without creating a new return continuation. The callee -therefore returns directly to the current function's caller. - -`TAILR rs` is the register-indirect form of `TAIL`. It is valid only when the -current function has an active standard frame. - -Because stack-passed outgoing argument words are staged in the caller's own -frame-local storage, `TAIL` and `TAILR` are portable only when the tail-called -callee requires no stack-passed argument words. Portable compilers must lower -other tail-call cases to an ordinary `CALL` / `RET` sequence. - -Portable source must treat the return continuation as hidden machine state. It -must not assume that the return address lives in any exposed register or stack -location except as defined by the standard frame layout after frame -establishment. - -### Prologue / Epilogue - -P1 v2 defines the following frame-establishment and frame-teardown operations: - -- `ENTER size` -- `LEAVE` - -`ENTER size` establishes the standard frame layout with `size` bytes of -frame-local storage: - -``` -[sp + 0*WORD] = saved return address -[sp + 1*WORD] = saved caller stack pointer -[sp + 2*WORD ... sp + 2*WORD + size - 1] = frame-local storage -``` - -The total allocation size is: - -`round_up(STACK_ALIGN, 2*WORD_SIZE + size)` - -The named frame-local bytes are the usable local storage. Any additional bytes -introduced by alignment rounding are padding, not extra local bytes. - -`LEAVE` tears down the current standard frame and restores the hidden return -continuation so that a subsequent `RET` returns correctly. - -Because every standard frame stores the saved caller stack pointer at -`[sp + 1*WORD]`, `LEAVE` does not need to know the frame-local byte count used -by the corresponding `ENTER`. - -A function may omit `ENTER` / `LEAVE` entirely if it is a leaf and needs no -standard frame. - -`ENTER` and `LEAVE` do not implicitly save or restore `s0` or `s1`. A -function that modifies `s0` or `s1` must preserve them explicitly, typically by -storing them in frame-local storage within its standard frame. - -### Branching - -P1 v2 branch targets are carried through the hidden branch-target mechanism -`br`. Portable source may load `br` only through: - -- `LA_BR %label` — materialize the address of `%label` as the next branch, call, - or tail-call target - -No branch, call, or tail opcode takes a label operand directly. Portable source -must treat `br` as owned by the control-flow machinery. No live value may be -carried in `br`. Each `LA_BR` must be consumed by the immediately following -branch, call, or tail op, and portable source must not rely on `br` surviving -across any other instruction. - -The portable branch families are: - -- `B` — unconditional branch to the target in `br` -- `BR rs` — unconditional branch to the code pointer in `rs` -- `BEQ`, `BNE`, `BLT`, `BLTU` — conditional branch to the target in `br` -- `BEQZ`, `BNEZ`, `BLTZ` — conditional branch to the target in `br` using zero - as the second operand - -`BLT` and `BLTZ` perform signed comparisons on one-word values. `BLTU` -performs an unsigned comparison on one-word values; there is no unsigned -zero-operand variant because `x < 0` is always false under unsigned -interpretation. - -If a branch condition is true, control transfers to the target currently held in -`br`. If the condition is false, execution falls through to the next -instruction. - -## Data Ops - -### Arithmetic - -P1 v2 defines the following arithmetic and bitwise operations on one-word -values: - -- register-register: `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, - `MUL`, `DIV`, `REM` -- immediate: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI` - -For `ADD`, `SUB`, `MUL`, `AND`, `OR`, and `XOR`, computation is modulo the -active word size. - -`SHL` shifts left and discards high bits. `SHR` is a logical right shift and -zero-fills. `SAR` is an arithmetic right shift and sign-fills. - -For register-count shifts, only the low `5` bits of the shift count are -observed in `P1v2-32`, and only the low `6` bits are observed in `P1v2-64`. - -Immediate-form shifts use inline immediates in the range `0..31` in `P1v2-32` -and `0..63` in `P1v2-64`. - -`DIV` is signed division on one-word two's-complement values and truncates -toward zero. `REM` is the corresponding signed remainder. - -Division by zero is outside the portable contract. The overflow case -`MIN_INT / -1` is also outside the portable contract, as is the corresponding -remainder case. - -### Moves - -P1 v2 defines the following move and materialization operations: - -- `MOV` — register-to-register copy -- `LI` — load one-word integer constant -- `LA` — load label address - -`MOV` may copy from any exposed general register to any exposed general -register. - -Portable source may also read the current stack pointer through `MOV rd, sp`. - -Portable source may not write `sp` through `MOV`. Stack-pointer updates are only -performed by `ENTER`, `LEAVE`, and backend-private call/return machinery. - -`LI` materializes an integer bit-pattern. `LA` materializes the address of a -label. `LA_BR` is a separate control-flow-target materialization form and is not -part of the general move family. - -### Memory - -P1 v2 defines the following memory-access operations: - -- `LD`, `ST` — one-word load and store -- `LB`, `SB` — byte load and store -- `LDARG` — one-word load from the incoming stack-argument area - -`LD` and `ST` access one full word: 4 bytes in `P1v2-32` and 8 bytes in -`P1v2-64`. - -`LB` loads one byte and zero-extends it to a full word. `SB` stores the low -8 bits of the source value. - -Memory offsets use signed 12-bit inline immediates. - -The base address for a memory access may be any exposed general register or -`sp`. - -`LDARG rd, idx` loads incoming stack-argument slot `idx`, where slot `0` is the -first stack-passed explicit argument word. `idx` is word-indexed, not -byte-indexed. `LDARG` is an ABI access, not a general memory operation; it does -not expose or imply any raw `sp`-relative layout at function entry. - -`LDARG` is valid only when the current function has an active standard frame. - -Portable source must not assume that labels are aligned beyond what is -explicitly established by the program itself. Portable code should use -naturally aligned addresses for `LD` and `ST`. Unaligned word accesses are -outside the portable contract. Byte accesses have no additional alignment -requirement. - -## System - -`SYSCALL` is part of the portable ISA surface. - -At the portable level, the syscall convention is: - -- `a0` = syscall number on entry, return value on exit -- `a1`, `a2`, `a3`, `t0`, `s0`, `s1` = syscall arguments 0 through 5 - -At the portable level, `SYSCALL` clobbers only `a0`. All other exposed -registers are preserved across the syscall. - -The mapping from symbolic syscall names to numeric syscall identifiers is -target-defined. The set of syscalls available to a given program is likewise -specified outside the core P1 v2 ISA, for example by a target profile or -runtime interface document. - -## Target notes - -- `a0` is argument 0, the one-word direct return-value register, the low word - of the two-word direct return pair, and the indirect-result buffer pointer. -- On aarch64, riscv64, arm32, and rv32, that matches the native integer/pointer - ABI directly. -- On amd64, the backend must translate between portable `a0` and native - return register `rax` at call and return boundaries. For the two-word direct - return, the backend must also translate `a1` against native `rdx`. -- On amd64, `LDARG` must account for the return address pushed by the native - `call` instruction. On aarch64, riscv64, arm32, and rv32, it maps more - directly to the entry `sp` plus the backend's standard frame/header policy. -- `br` is implemented as a dedicated hidden native register on every target. -- On arm32, `t1` and `t2` map to natively callee-saved registers; the backend - is responsible for preserving them across function boundaries in accordance - with the native ABI, even though P1 treats them as caller-saved. -- Frame-pointer use is backend policy, not part of the P1 v2 architectural - register set. - -### Native register mapping - -#### 64-bit targets - -| P1 | amd64 | aarch64 | riscv64 | -|------|-------|---------|---------| -| `a0` | `rdi` | `x0` | `a0` | -| `a1` | `rsi` | `x1` | `a1` | -| `a2` | `rdx` | `x2` | `a2` | -| `a3` | `rcx` | `x3` | `a3` | -| `t0` | `r10` | `x9` | `t0` | -| `t1` | `r11` | `x10` | `t1` | -| `t2` | `r8` | `x11` | `t2` | -| `s0` | `rbx` | `x19` | `s1` | -| `s1` | `r12` | `x20` | `s2` | -| `s2` | `r13` | `x21` | `s3` | -| `s3` | `r14` | `x22` | `s4` | -| `sp` | `rsp` | `sp` | `sp` | - -#### 32-bit targets - -| P1 | arm32 | rv32 | -|------|-------|-------| -| `a0` | `r0` | `a0` | -| `a1` | `r1` | `a1` | -| `a2` | `r2` | `a2` | -| `a3` | `r3` | `a3` | -| `t0` | `r12` | `t0` | -| `t1` | `r6` | `t1` | -| `t2` | `r7` | `t2` | -| `s0` | `r4` | `s1` | -| `s1` | `r5` | `s2` | -| `s2` | `r8` | `s3` | -| `s3` | `r9` | `s4` | -| `sp` | `sp` | `sp` | diff --git a/docs/PLAN.md b/docs/PLAN.md @@ -1,210 +0,0 @@ -# Alternative bootstrap path: Lisp-in-P1 → C compiler in Lisp → tcc-boot - -## Goal - -Shrink the auditable LOC between M1 assembly and tcc-boot by replacing the -current `M2-Planet → mes → MesCC → nyacc` stack with a small Lisp written -once in the P1 portable pseudo-ISA (see [P1.md](P1.md)) and a C compiler written -in that Lisp. P1 is the same layer described in `P1.md`: ~30 RISC-shaped ops -whose per-arch `DEFINE` tables expand to amd64 / aarch64 / riscv64 encodings, -so one Lisp source serves all three hosts. - -## Current chain (validated counts) - -| Layer | Lang | Lines | -|---|---|---| -| `cc_amd64.M1` (subset-C compiler in M1 asm) | M1 | 5,413 (~3,152 actual instructions) | -| M2-Planet (`*.c`, compiles mes) | C | 8,140 | -| Mes interpreter (`src/*.c`) | C | 7,033 | -| Mes headers (`include/mes/*.h`) | C | 6,145 | -| MesCC + mes Scheme (`module/`) | Scheme | 8,271 | -| Bundled mes runtime (SRFI/ice-9/rnrs shims) | Scheme | 9,191 | -| nyacc (LALR engine + C99 parser/grammar/cpp) | Scheme | ~10,000 (essentials of 12,868) | -| **Total auditable** | mixed | **~54,000** | - -## Proposed chain - -``` -M1 asm → P1 pseudo-ISA → Lisp interpreter (in P1) → C compiler (in Lisp) → tcc-boot -``` - -Two languages plus one portable asm layer, one new interpreter, one new -compiler. No M2-Planet, no Mes core, no MesCC, no nyacc. The interpreter is -authored once in P1 and assembled three ways; porting to a fourth arch means -a new P1 defs file, not a rewrite. - -## Why P1 as the host - -- **Single source of truth.** A Lisp in raw M1 asm would need three - hand-written variants (one per target arch). In P1, there is one source; - the per-arch cost is already paid inside the P1 defs files. -- **Cost lives in P1, not here.** P1's one-time tax (~1500 defines × 3 arches - generator-driven, plus ~240 LOC of `hex2_word` + `M1-macro` aarch64 work) - is accounted in `P1.md`. This plan inherits that layer rather than - duplicating it. -- **Dependency ordering.** PLAN cannot start the Lisp interpreter until P1 - stages 1–4 in `P1.md` are complete (spike on all three arches plus the - full ~30-op matrix). P1 stage 5 ("seed Lisp interpreter in ~500 lines of - P1") is effectively this plan's kickoff. - -## Lisp — feature floor - -Justification: empirical audit of MesCC's actual Scheme usage. MesCC barely -exercises Scheme. - -**Required:** -- Special forms: `define`, `lambda`, `if`, `cond`, `let`, `let*`, `letrec`, - `quote`, `quasiquote`/`unquote`, `set!`, `begin` -- Data: pairs, fixnums, vectors, immutable ASCII strings, symbols -- Primitives (~40): `cons/car/cdr`, list ops (`map/filter/fold/append/reverse/member/assoc`), - arithmetic (`+ - * / %`), bitwise (`and or xor << >>`), string ops - (`string-append/string-ref/substring/string-length`), type predicates, - `display`/`write`, basic `format` (`~a ~s ~d ~%` only), `apply`, `error` -- Mark-sweep GC over tagged cells -- Built-in `pmatch` macro (otherwise hand-expanding 57 call sites in the - C compiler costs ~1k extra LOC) -- A records-via-vectors layer (replaces SRFI-9 `define-immutable-record-type`) -- File I/O: `read-file path → string` and `write-file path string`. No port - type at all. The C lexer indexes into the source string with an integer - cursor (gives `read-char`/`peek-char` semantics for free); CPP keeps - `#include` context as a stack of (string, cursor) pairs. Codegen - accumulates output as a reversed list of chunks and concatenates once - via a one-pass variadic `string-append` (or a `string-concat list → - string` primitive). Output for tcc-boot is single-digit MB — well within - the existing mes 20MB arena budget. - -**Deliberately omitted:** -- `call/cc`, `dynamic-wind`, `parameterize`, exception system -- Mutable strings, Unicode -- Bignums, rationals, floats -- `syntax-rules` / `define-syntax` (only `pmatch` macro is needed) -- First-class modules (single-file load in dependency order) -- `do` loops, `delay`/`force` - -Tail calls: convenient for AST recursion; not strictly required if stack is -generous (≥1MB). - -## C subset to support - -Start from MesCC's already-reduced subset; consider further reductions if -they justify patching tcc-boot. - -**Must support (used by tcc-boot):** -- Types: `char/short/int/long/long long`, signed/unsigned, pointers, arrays, - structs, unions, enums, **bitfields**, typedefs, `void` -- Storage: `static`, `extern`, `register`; `const`/`volatile` parsed and - ignored -- Operators: full arithmetic/bitwise/relational/logical, compound - assignment, ternary, `sizeof` (types and expressions), casts, comma -- Statements: all loops, switch/case, goto/labels, `&&`/`||` short-circuit -- Function declarations: ANSI only - -**Not supported (and not needed):** -- `float`/`double` (errors at parse time) -- `inline` (parsed and stripped, like MesCC) -- Variadic functions (tcc-boot already works around this) -- K&R declarations -- C99 mid-block declarations -- Statement expressions, nested functions, compound literals, - designated initializers - -**Candidate further reductions (require tcc-boot patches):** -- Drop bitfields (significant tcc-boot rework — probably not worth it) -- Drop compound assignment (modest tcc-boot patches) - -**Preprocessor:** target full `#define`/`#include`/`#if`/`#ifdef`/`#elif`/ -`#else`/`#endif` with function-like macros and stringification. tcc's source -uses these heavily. - -## Backend - -**Settled: emit P1.** The C compiler is written once in portable Lisp and -emits portable asm, so both the pre-tcc-boot seed userland (`SEED.md`) and -tcc-boot itself land on all three arches without a second backend. Codegen -is slightly harder than direct amd64 — P1 is deliberately dumb, so C -idioms like `x += y` expand to multi-op P1 sequences — but we pay the -~2× code-size tax already budgeted in `P1.md` rather than writing three -backends. - -This forecloses the alternative of emitting amd64 M1 directly (simpler -codegen, single-arch only). That option would have satisfied a -tcc-boot-only goal, but `SEED.md` requires tri-arch seed binaries, so a -portable backend is load-bearing. - -## Estimated budget - -| Component | Lines | -|---|---| -| Lisp interpreter in P1 (reader, eval, GC, primitives, I/O, pmatch) | 4,000–6,000 P1 | -| C lexer + recursive-descent parser + CPP (in Lisp) | 2,000–3,000 | -| Type checker + IR (slimmed compile.scm + info.scm) | 2,000–3,000 | -| Codegen + P1 emit (see Backend) | 800–1,500 | -| **Total auditable (this plan)** | **~9,000–13,000 LOC** | - -vs. **~54,000 LOC** current = **~4–6× shrink**, and the result is -tri-arch instead of amd64-only. P1's own infrastructure (defs files, -`hex2_word` extensions, generator) is audited once in `P1.md` and shared -with any future seed-stage program. - -## Resolutions - -- **Narrow loads: zero-extend only, 8-bit only.** P1 keeps `LB` - zero-extending; no `LBS` added, and 32-bit `LW`/`SW` are out of the - ISA entirely (emulate through 64-bit `LD`/`ST` + mask/shift if ever - needed). Fixnums live in full 64-bit tagged cells, so the interpreter - never needs a sign-extended or 32-bit load — byte/ASCII access is - unsigned, and arithmetic happens on 64-bit values already. -- **Static code size: accept the 2× tax.** P1's destructive-expansion - rule on amd64 roughly doubles instruction count vs. hand-tuned amd64. - Matches P1's "deliberately dumb" contract (see `P1.md`). Interpreter - binary expected in low single-digit MB — irrelevant for a seed. -- **Tail calls: codify `TAIL` in P1.** A new `TAIL %label` macro (see - `P1.md`, Control flow) expands to `LD lr, sp, 0; ADDI sp, sp, +16; - B %label` or the per-arch equivalent. The interpreter's `eval` is - written in the natural recursive style with tail-position calls - compiled through `TAIL`, so the P1 stack does not grow per Scheme - frame. As a side effect, Scheme-level tail calls fall out R5RS-proper - for the interpreter's subset without extra mechanism. -- **Pool placement: per-function on all arches.** Each function emits its - constant pool at its epilogue, inside the aarch64 `LDR`-literal ±1 MiB - range. Labels are file-local; duplicated constants across functions - are accepted. Simple rule, no range-check logic in codegen. -- **GC arena: static BSS.** The ~20 MB heap is reserved as a single BSS - region at link time. No `brk`/`mmap` at runtime, no arena-sizing flag. - Keeps the P1 program to a minimal syscall surface and makes the - interpreter image self-describing. -- **Syscalls: eight.** `read`, `write`, `openat`, `close`, `exit`, - `clone`, `execve`, `waitid`. Each becomes one P1 `SYSCALL` op - backed by a per-arch number table in the P1 defs file. - `read-file` loops `read` into a growable string until EOF (no - `stat`/`lseek`); `display`/`write`/`error` go through `write` on - fd 1/2; `error` finishes with `exit`. `openat(AT_FDCWD, …)` - replaces `open` because aarch64/riscv64 lack bare `open` in the - asm-generic table. `clone(SIGCHLD)` + `execve` + `waitid` give - the Lisp enough to drive the tcc-boot build directly — see - "Build driver" below. No signals, time, or networking. - -## Build driver - -Once Lisp can spawn, the Lisp program itself is the build driver. -There is no separate shell. A top-level Lisp source file reads the -pinned list of tcc-boot translation units, iterates over them, and -for each one: - -1. Reads the `.c` source into a Lisp string. -2. Calls the Lisp-hosted C compiler (in-process) to produce P1 text. -3. Writes the P1 text to a temp file. -4. Spawns M1 (from stage0-posix, via `clone`+`execve`) to assemble - P1 → `.hex2`; waits via `waitid`, aborts on non-zero. -5. Spawns hex2 to emit the final `.o` / ELF; waits, aborts on - non-zero. - -The seed-tool builds (each mescc-tools-extra source → one ELF) run -the same loop. Spawn-and-wait is a ~20 LOC Lisp primitive; the full -driver, including the hard-coded tcc-boot file list, is ~100–200 -LOC of Lisp counted against this plan. - -Concentrating orchestration in the Lisp program (rather than a -separate P1/M1 shell) collapses the post-M1 contribution list to -exactly three artifacts: P1, the Lisp interpreter, and the C -compiler. diff --git a/docs/SEED.md b/docs/SEED.md @@ -1,303 +0,0 @@ -# Seed userland: the pre-tcc-boot tools - -## Goal - -Bridge the window between *Lisp exists* and *tcc-boot exists* without -touching M2-Planet, Mes, or MesCC. Inside that window, all code is -either a Lisp program running on the Lisp interpreter or one of a -small set of standalone C binaries compiled through the Lisp-hosted -C compiler → P1 → M1 → hex2 pipeline. - -This document covers only that window. Phases before it (`seed0 → -hex0/hex1/hex2 → M1`, P1 defs, Lisp interpreter, and the Lisp-hosted -C compiler) are documented in `P1.md` and `PLAN.md`. tcc-boot itself -and everything downstream are standard C and out of scope. - -## Position in the chain - -``` -stage0-posix: seed0 → hex0 → hex1 → hex2 → M1 (no C, no Lisp) -P1 layer: P1 defs files load into M1 (P1.md) -Lisp: P1 text (Lisp interp source) → M1 → hex2 (PLAN.md) -C compiler: Lisp program, loaded into the Lisp image (PLAN.md) -──────── seed window begins here ──────── -seed tools: C source → Lisp+Ccc → P1 text → M1 → hex2 (this doc) -──────── seed window ends when tcc-boot is built ──────── -tcc-boot: C source → Lisp+Ccc → P1 text → M1 → hex2 (PLAN.md) -``` - -One Lisp-hosted C compiler (shared with tcc-boot) and a handful of -statically-linked C binaries. No M2-Planet artifact and no Mes -Scheme module anywhere. - -## Settled decisions - -These are load-bearing; rest of the document assumes them. - -1. **Seed programs compile through the same Lisp-hosted C compiler - as tcc-boot.** No separate seed-stage compiler. Authors write in - the C subset fixed in `PLAN.md`; backend emits P1, so seed lands - tri-arch via the existing M1+hex2 path. Accepts P1's ~2× - code-size tax. -2. **Vendor upstream C where it exists.** `cat`, `cp`, `mkdir`, - `rm`, `sha256sum`, `untar` are taken from live-bootstrap's - `mescc-tools-extra`; `patch-apply` from `simple-patch-1.0`. - The libc these sources depend on (`<stdio.h>`, `<string.h>`, - `<stdlib.h>`, etc.) is vendored M2libc's portable layer — - `bootstrappable.c`, `string.c`, `stdio.c`, `stdlib.c`, and the - small `ctype`/`fcntl` files (~1,500 LOC). Per-arch syscall - stubs backing M2libc's declarations are replaced with our - P1-based stubs (see "How seed tools reach syscalls" below). All - of the above was written against M2-Planet's C subset, which is - a subset of ours. Local adaptations ship as unified diffs in - the repo. **No C is written fresh here** — each vendored - source already has its own `main`. -3. **The Lisp program is the build driver — no separate shell.** - Per `PLAN.md`, the Lisp's syscall surface includes `clone`, - `execve`, `waitid`, so a top-level Lisp file drives the whole - tcc-boot build: iterate over translation units, call the - Lisp-hosted C compiler in-process, spawn M1/hex2 to finish - each artifact, check exit status. No `kaem`, no `sh`, no flat - script — just Lisp code. -4. **One binary per tool.** Each vendored source compiles to a - standalone ELF — `cat`, `cp`, `mkdir`, `rm`, `sha256sum`, - `untar`, `patch-apply`. Installed into a single directory - (say, `/seed/`) and invoked by absolute path from the Lisp - driver. No dispatcher, no argv[0] multiplexing, no fresh `main` - to write. Each tool is its own audit unit. -5. **Uncompressed tcc-boot mirror.** Host the upstream tcc-boot source - as an uncompressed `.tar` with sha256 pinned. No gzip support - anywhere in the seed stage. Deletes ~1000–1500 LOC of deflate from - the audit. -6. **Explicit patches via `patch-apply`.** Upstream source stays - verbatim. Our changes live as unified-diff files in this repo, - applied by the `simple-patch`-derived binary. "Upstream vs - ours" stays legible. -7. **Target self-build is primary; cross-build is a cache.** The - canonical build is a fresh target machine bootstrapping from - stage0-posix hex seed. Cross-built per-arch tarballs are supported - as a reproducibility cache — identical bytes expected, verified - against a target self-build, not trusted by assumption. - -## The seed tools - -One ELF per tool per arch. Each tool is invoked by absolute path -from the Lisp build driver (e.g. `/seed/sha256sum foo.tar`). Each -binary links against the same vendored M2libc portable layer and -the same P1 syscall stubs. - -### Inventory - -| Tool / layer | Purpose | Source / LOC | -|--------------------|---------------------------------------------|-------------------------| -| `untar` | POSIX ustar extract (no gzip, no creation) | mescc-tools-extra/untar.c (460) | -| `patch-apply` | apply a unified diff in-place | simple-patch-1.0 (~200) | -| `sha256sum` | verify source tarball hashes | mescc-tools-extra/sha256sum.c (586) | -| `cp` | copy one file | mescc-tools-extra/cp.c (332) | -| `mkdir` | single-level directory create | mescc-tools-extra/mkdir.c (117) | -| `rm` | remove one file (no `-r`, no `-f`) | mescc-tools-extra/rm.c (54) | -| `cat` | concatenate files to stdout | mescc-tools-extra/catm.c (69) | -| libc (portable) | stdio, string, stdlib, ctype, fcntl | vendored M2libc (~1,500) | -| syscall stubs | per-arch bridge below M2libc | ~120 lines of P1, not C | -| **Total C** | | **~3,300, fully vendored** | - -Deliberately excluded: `test`, `echo`, `mv`. The Lisp driver does -any conditional or rename logic it needs in Lisp, and emits -progress messages via its own `write` calls — no externalised -shell utilities needed for those concerns. - -The driver is Lisp code, not a shell script; see `PLAN.md`'s -"Build driver" section for the control flow. - -## Syscall surface - -The seed tools collectively need **7 syscalls** (process spawn -lives in the Lisp driver, not in the tools). - -| Syscall | Used by | -|------------|-------------------------------------------| -| `read` | all file-reading tools | -| `write` | stdout/stderr, all file-writing | -| `openat` | file open (`AT_FDCWD` + `O_RDONLY` / `O_WRONLY|O_CREAT|O_TRUNC` with mode) | -| `close` | all file ops | -| `exit` | program termination | -| `mkdir` | `mkdir` tool, `untar` (directory entries) | -| `unlink` | `rm` tool | - -PLAN.md's Lisp surface is 8 syscalls (`read`, `write`, `openat`, -`close`, `exit`, `clone`, `execve`, `waitid`). The seed tools add -`mkdir` and `unlink` on top of that, for a window total of **10 -distinct syscalls**. Each gets one row in every `p1_<arch>.M1` -defs file. Deliberately excluded: `stat/fstat`, `access`, -`rename`, `chmod` (rely on `openat` mode bits for initial perms), -`lseek` (all reads are sequential), `getdents`/`readdir` (no -directory traversal needed), `dup`/`pipe`/signals/time/net. - -### How seed tools reach syscalls - -The Lisp-hosted C compiler has no inline asm and no intrinsics. Each -syscall is exposed as an ordinary `extern` function declaration, -backed by a hand-written P1 stub in `runtime.p1`. The stubs are ~3 P1 -ops each (load number, `SYSCALL`, `RET`), totalling ~40 lines of P1 -for the whole surface. - -``` -:sys_write ; C args arrive in P1 r1-r6 per call ABI - SYSCALL write ; expands per-arch via p1_<arch>.M1 defs - RET -``` - -``` -extern int sys_write(int fd, char *buf, int n); -``` - -Prerequisite: P1 picks its argument registers (`r1–r6`) to coincide -with the native syscall arg registers on each arch (`rdi/rsi/…`, -`x0–x5`, `a0–a5`), so stubs need no register shuffling beyond what -`SYSCALL` already does. Confirm this in `P1.md` during implementation. - -Return convention: Linux returns `-errno` (values in `-1..-4095`) in -the result register. Wrappers return the raw integer; callers test -`r <u 0xfffff000` to detect failure and abort with a message. No -`errno` global, no per-tool error recovery. - -## Build ordering inside the seed window - -Once the Lisp interpreter binary exists and the C compiler Lisp -source is loaded (both per `PLAN.md`): - -1. Compile each seed tool independently: its vendored source plus - the vendored M2libc layer plus the per-arch P1 syscall stubs → - P1 text → M1 → hex2 → one ELF per tool. Per-arch, repeat for - each target. -2. Install the tools into a single directory on the target (e.g. - `/seed/`). No other setup required. - -The tcc-boot build runs as a Lisp program invoked on the Lisp -interpreter. The driver: - -1. Spawns `/seed/sha256sum upstream.tar` and checks against pinned - hash. -2. Spawns `/seed/untar upstream.tar`. -3. For each patch file: spawns `/seed/patch-apply patches/foo.diff`. -4. Iterates over tcc-boot `.c` files. For each one, calls the - Lisp-hosted C compiler in-process to emit P1 text, then spawns - M1 and hex2 to produce the object or final linked binary. -5. Installs the tcc-boot binary. - -See `PLAN.md` "Build driver" for the spawn-and-wait primitive. -Seed window is closed. - -## Target self-build vs cross-build - -**Target self-build (primary).** A fresh machine of arch `A` starts -from the stage0-posix hex seed, runs the hex0→hex1→hex2→M1 chain, -loads `p1_A.M1`, assembles the Lisp interpreter, loads the C -compiler into Lisp, runs the Lisp build-driver program, which -compiles each seed tool, then compiles and links tcc-boot. -stage0-posix's own `kaem` runs the early hex0→M1 chain; above M1, -the Lisp program takes over. - -**Cross-build cache (secondary).** On an already-bootstrapped -machine, produce the seed tool binaries for all three arches and -ship them as tarballs. Users who opt into this skip the target -self-build and land directly at "seed tools installed." Trust -claim: **none by assumption** — the cache is only trusted after a -target self-build of at least one arch has verified byte-identical -output. Cross-build is an optimization, not a trust input. - -## Provenance - -Artifacts flowing in: - -- **stage0-posix hex seed + P1 defs**: part of this repo, audited - with the rest of it. -- **Lisp interpreter source (in P1) and C compiler (in Lisp)**: - part of this repo, covered by `PLAN.md`. -- **Vendored seed C sources**: pinned snapshots of - live-bootstrap's `mescc-tools-extra` (catm, cp, mkdir, rm, - sha256sum, untar), `simple-patch-1.0`, and M2libc's portable - layer (the libc the mescc-tools sources depend on — stdio, - string, stdlib, ctype, fcntl, bootstrappable). All shipped - verbatim as `.tar` files with sha256 pinned. Local adaptations - ride as unified diffs in the repo, applied by `patch-apply` at - build time so "upstream vs ours" stays legible. -- **Upstream tcc-boot source**: mirrored as uncompressed `.tar` at - a pinned URL + sha256. The mirror file is one of this repo's - auditable inputs; it can be re-derived from upstream by untaring - and retaring in a canonical form, or checked against upstream's - published `.tar.gz` by re-gzipping and comparing hashes on a - machine that has `gzip` (done once, out of band). - -No C is authored fresh in this repo for the seed window; the only -things written here are unified-diff patches against the vendored -tree and the per-arch P1 syscall stubs. - -`sha256sum` is the single seed tool whose correctness has a direct -trust consequence downstream; unit-test it against known vectors -(empty string, "abc", "abcdbcde..."-length tests) before declaring -the seed build complete. - -## Interaction with tcc-boot - -tcc-boot expects a build environment roughly like `cc + make + sh + -coreutils`. Mapping: - -| tcc-boot expects | Seed provides | -|------------------|--------------------------------------------------| -| `cc` / `gcc` | Lisp-hosted C compiler, invoked in-process per `.c` | -| `make` | Lisp driver program (tcc-boot is simple enough) | -| `sh` | not provided — the Lisp driver spawns tools directly | -| `cat`/`cp`/etc. | individual seed-tool binaries at absolute paths | -| `ld` | tcc-boot's built-in linker (for its own output) | -| `ar` | not needed; tcc-boot builds one static binary | - -Any translation from tcc-boot's literal build-command names -(`cc`, `make`, `install`) to seed tools lives in Lisp, not in a -separate shim script. - -## Budget rollup - -Fresh auditable LOC introduced by this document, on top of PLAN.md: - -| Layer | LOC | -|--------------------------------------------------------|---------| -| seed tools — vendored mescc-tools-extra + simple-patch | ~1,800 | -| seed tools — vendored M2libc portable layer | ~1,500 | -| syscall stubs (P1, not C) | ~120 | -| Lisp build-driver program | counted in PLAN.md | -| **Seed window addition** | **~3,300 C (all vendored) + ~120 P1** | - -Combined PLAN.md + SEED.md audit surface: **~13–17k LOC**, tri-arch, -M2-Planet-free and Mes-free. No fresh C is authored for the seed -window; the entire ~3,300 LOC is audited upstream code written -against M2-Planet's C subset. The build driver is Lisp code -counted against PLAN.md (~100–200 LOC). - -## Handoff notes for the engineer - -Approximate build order for implementation: - -1. **C compiler in Lisp** (blocks everything below). Per `PLAN.md`; - validate on a small corpus before touching seed. -2. **Vendor M2libc's portable layer** and write the per-arch P1 - syscall stubs that back its declarations. Bring-up test: link - `catm.c` (69 LOC) against this libc and run it. -3. **Vendor mescc-tools-extra + simple-patch.** Pin sha256s. - Confirm each source compiles unmodified through the Lisp-hosted - C compiler; if anything trips, capture the delta as a unified - diff rather than editing the vendored tree in place. -4. **Build the small tools** individually (`cat`, `cp`, `mkdir`, - `rm`) — each is its own ELF. -5. **`sha256sum`** with unit tests (empty / "abc" / long vectors) - before anything depends on its correctness. -6. **`untar`** (ustar extract only). -7. **`patch-apply`** (unified-diff in-place). -8. **End-to-end bring-up**: Lisp build-driver running - `sha256sum` → `untar` → `patch-apply` → in-process C-compile - loop (spawning M1/hex2 per `.c`) → linked tcc-boot. First full - trip through the seed window. - -Each step compiles standalone C and assembles through the existing -P1 → M1 → hex2 path; no new tooling infrastructure is needed -between steps.