Move P1.md into the repo - boot2 - Playing with the boostrap

commit 505d30e7b343b771d0225501634bf5e9c1ca1b7f
parent 3e6ce5ada262294ca065705bdf65003ebf823fa1
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon, 20 Apr 2026 13:20:20 -0700

Move P1.md into the repo

Previously P1.md lived at ../P1.md, outside any git repo. Versioning it
alongside the code that implements it keeps the spec and the per-arch
defs files in sync, and means the repo is self-contained — a reader
cloning lispcc gets the spec too.

Updates every `../P1.md` reference to the in-repo path.

Diffstat:
M Makefile  | 2 +-
A P1.md  | 304 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M PLAN.md  | 2 +-
M README.md  | 2 +-
M p1_aarch64.M1  | 2 +-
M p1_amd64.M1  | 2 +-
M p1_riscv64.M1  | 2 +-

7 files changed, 310 insertions(+), 6 deletions(-)
diff --git a/Makefile b/Makefile
@@ -2,7 +2,7 @@
 #
 # hello.M1 is written in P1 mnemonics and assembles unchanged for all
 # three targets. The backing defs file (p1_<arch>.M1) is the only per-
-# arch source. See ../P1.md.
+# arch source. See P1.md.
 #
 # Two-image setup:
 #   - lispcc-builder (alpine + gcc, host-arch): builds M1 and hex2
diff --git a/P1.md b/P1.md
@@ -0,0 +1,304 @@
+# P1: A Portable Pseudo-ISA for M1
+
+## Motivation
+
+The stage0/live-bootstrap chain uses M1 (the mescc-tools macro assembler) as
+the lowest human-writable layer above raw hex. M1 itself is architecture-
+agnostic — it only knows `DEFINE name hex_bytes` — but every real M1 program
+in stage0 (including the seed C compiler `cc_*.M1`) is hand-written per arch.
+To write, say, a seed Lisp interpreter portably across amd64, aarch64, and
+riscv64 without reaching for M2-Planet, we need a thin portable layer: a
+pseudo-ISA whose mnemonics expand, per arch, to native encodings.
+
+P1 is that layer. The goal is an unoptimized RISC-shaped instruction set,
+hand-writable in M1 source, that assembles to three host ISAs via per-arch
+`DEFINE` tables on top of existing `M1` + `hex2` unchanged.
+
+## Non-goals
+
+- **Not an optimizing backend.** P1 is deliberately dumb. An `ADD rD, rA, rB`
+  on amd64 expands to `mov rD, rA; add rD, rB` unconditionally — no peephole
+  recognition of the `rD == rA` case. Paying ~2× code size is fine for a seed.
+- **Not ABI-compatible with platform C.** P1 programs are sovereign: direct
+  Linux syscalls, no libc linkage. Interop thunks can be written later if
+  needed.
+- **Not 32-bit.** x86-32, armv7l, riscv32 are out of scope for v1. Adding them
+  later means a separate defs file and some narrowing in the register model.
+- **Not self-hosting.** P1 is a target for humans, not a compiler IR. If you
+  want a compiler, write it in subset-C and use M2-Planet.
+
+## Current status (v0.1 spike)
+
+Built in `lispcc/`: `hello.M1` and a compute/print/exit `demo.M1`, both
+written in P1 mnemonics, assemble unchanged across aarch64, amd64, and
+riscv64 from per-arch `p1_<arch>.M1` defs. Ops exercised so far: `LI`,
+`SYSCALL`, `MOV`, `ADD`, `SUB` (the last three in specific
+register-tuple forms). Runs on stock `M1` + `hex2_linker` (amd64,
+aarch64) / `hex2_word` (riscv64). Run with `make PROG=demo run-all`
+from `lispcc/`.
+
+The rest of the ISA (branches, `CALL`/`RET`, `TAIL`, `LD`/`ST`,
+logical/shift/mul-div) is reachable with the same tooling via the
+inline-data `LI` pattern generalized: load the branch target with
+`P1_LI_R7 &label` and jump through the register. Local conditional
+skips over a known number of instructions hand-encode as plain hex
+distances inside the DEFINE.
+
+### Spike deviations from the design
+
+- Wide immediates use a per-`LI` inline literal slot (one PC-relative
+  load insn plus a 4-byte data slot, skipped past) rather than a shared
+  pool. Keeps the spike pool-free at the cost of one skip-branch per
+  `LI`. A pool can be reintroduced later without changes to P1 source.
+- `LI` is 4-byte zero-extended today; 8-byte absolute is deferred until
+  a program needs it. All current references are to addresses under
+  4 GiB, so `&label` + a 4-byte zero pad suffices.
+- The per-tuple DEFINE table is hand-written for the handful of
+  MOV/ADD/SUB register tuples `demo.M1` uses. The ~1500-entry generator
+  is still future work.
+
+## Design decisions
+
+| Decision       | Choice                                        | Why                                        |
+|----------------|-----------------------------------------------|--------------------------------------------|
+| Word size      | 64-bit                                        | All three target arches are 64-bit native  |
+| Endianness     | Little-endian                                 | All three agree                            |
+| Registers      | 8 GPRs (`r0`–`r7`) + `sp`, `lr`-on-stack      | Fits x86-64's usable register budget       |
+| Narrow imm     | Signed 12-bit                                 | riscv I-type width; aarch64 ≤12 also OK    |
+| Wide imm       | Pool-loaded via PC-relative `LI`              | Avoids arch-specific immediate synthesis   |
+| Calling conv   | r0 = return, r1–r6 = args, r6–r7 callee-saved | P1-defined; not platform ABI               |
+| Return address | Always spilled to stack on entry              | Hides x86's missing `lr` uniformly         |
+| Syscall        | `SYSCALL` with num in r0, args r1–r6          | Per-arch wrapper emits native sequence     |
+
+## Register mapping
+
+All mappings are callee-saved for `r6`/`r7`, caller-saved otherwise.
+
+| P1   | amd64 | aarch64 | riscv64 |
+|------|-------|---------|---------|
+| `r0` | `rax` | `x0`    | `a0`    |
+| `r1` | `rdi` | `x1`    | `a1`    |
+| `r2` | `rsi` | `x2`    | `a2`    |
+| `r3` | `rdx` | `x3`    | `a3`    |
+| `r4` | `r10` | `x4`    | `a4`    |
+| `r5` | `r8`  | `x5`    | `a5`    |
+| `r6` | `rbx` | `x19`   | `s1`    |
+| `r7` | `r12` | `x20`   | `s2`    |
+| `sp` | `rsp` | `sp`    | `sp`    |
+| `lr` | (mem) | `x30`   | `ra`    |
+
+x86-64 has no link register; `CALL`/`RET` macros push/pop the return address
+on the stack. On aarch64/riscv64, the prologue spills `lr` (`x30`/`ra`) to
+the stack too, so all three converge on "return address lives in
+`[sp + 0]` after prologue." This uniformity is worth the extra store on the
+register-rich arches.
+
+## Instruction set (~30 ops)
+
+```
+# 3-operand arithmetic (reg forms)
+ADD  rD, rA, rB       SUB  rD, rA, rB
+AND  rD, rA, rB       OR   rD, rA, rB       XOR  rD, rA, rB
+SHL  rD, rA, rB       SHR  rD, rA, rB       SAR  rD, rA, rB
+MUL  rD, rA, rB       DIV  rD, rA, rB       REM  rD, rA, rB
+
+# Immediate forms (signed 12-bit)
+ADDI rD, rA, !imm     ANDI rD, rA, !imm     ORI  rD, rA, !imm
+SHLI rD, rA, !imm     SHRI rD, rA, !imm     SARI rD, rA, !imm
+
+# Moves
+MOV  rD, rA                           # reg-to-reg
+LI   rD, %label                       # load 64-bit literal from pool
+LA   rD, %label                       # load PC-relative address
+
+# Memory (offset is signed 12-bit)
+LD   rD, rA, !off     ST   rS, rA, !off    # 64-bit
+LW   rD, rA, !off     SW   rS, rA, !off    # 32-bit zero-extended / truncated
+LB   rD, rA, !off     SB   rS, rA, !off    #  8-bit zero-extended / truncated
+
+# Control flow
+B    %label                           # unconditional branch
+BEQ  rA, rB, %label   BNE  rA, rB, %label
+BLT  rA, rB, %label   BGE  rA, rB, %label   # signed
+BLTU rA, rB, %label   BGEU rA, rB, %label   # unsigned
+CALL %label           RET
+TAIL %label                           # tail call: epilogue + B %label
+
+# System
+SYSCALL                               # num in r0, args r1-r6, ret in r0
+```
+
+### Semantics
+
+- All arithmetic is on 64-bit values. `SHL`/`SHR`/`SAR` take shift amount in
+  the low 6 bits of `rB` (or the `!imm` for immediate forms).
+- `DIV` is signed, truncated toward zero. `REM` matches `DIV`.
+- `LW`/`LB` zero-extend the loaded value into the 64-bit destination.
+  (Signed-extending variants — `LWS`, `LBS` — can be added later if needed.)
+- Branch offsets are PC-relative. In the v0.1 spike they are realized by
+  loading the target address via `LI` into `r7` and jumping through the
+  register; range is therefore unbounded within the 4 GiB address space.
+  Native-encoded branches (with tighter range limits) are an optional
+  future optimization.
+- `CALL` pushes a return address, jumps to target. `RET` pops and jumps.
+  Prologue convention: function entry does `ST lr, sp, 0; ADDI sp, sp, -16`
+  (or arch-specific equivalent); epilogue reverses.
+- `TAIL %label` is a tail call — it performs the callee's standard
+  epilogue (restore `lr` from `[sp+0]`, pop the frame) and then branches
+  unconditionally to `%label`, reusing the caller's return address
+  instead of pushing a new frame. The current function must be using the
+  standard prologue. Interpreter `eval` loops rely on `TAIL` to recurse
+  on sub-expressions without growing the stack.
+- `SYSCALL` is a single opcode in P1 source. Each arch's defs file expands it
+  to the native syscall sequence, including the register shuffle from P1's
+  `r0`=num, `r1`–`r6`=args convention into the platform's native convention
+  if different.
+
+## Encoding strategy
+
+For each `(op, register-tuple)` combination, emit one `DEFINE` per arch. A
+generator script produces the full defs file; no hand-encoding per entry.
+
+Example — `ADD r0, r1, r2`:
+
+```
+# p1_riscv64.M1
+DEFINE P1_ADD_R0_R1_R2  33056000     # add a0, a1, a2 (little-endian)
+
+# p1_aarch64.M1
+DEFINE P1_ADD_R0_R1_R2  2000028B     # add x0, x1, x2
+
+# p1_amd64.M1  (2-op destructive — expands to mov + add)
+DEFINE P1_ADD_R0_R1_R2  4889F84801F0 # mov rax, rdi ; add rax, rsi
+```
+
+### Combinatorial footprint
+
+Per-arch defs count (immediates handled by sigil, not enumerated):
+
+- 11 reg-reg-reg arith × 8 `rD` × 8 `rA` × 8 `rB` = 704. Pruned to ~600 by
+  removing trivially-equivalent tuples.
+- 6 immediate arith × 8² = 384. Each entry uses an immediate sigil (`!imm`),
+  so the immediate value itself is not enumerated.
+- 3 move ops × 8 or 8² = ~80.
+- 6 memory ops × 8² = 384. Offsets use `!imm` sigil.
+- 7 branches × 8² = 448.
+- Singletons (`B`, `CALL`, `RET`, `TAIL`, `SYSCALL`) = 5.
+
+Total ≈ 1500 defines per arch. Template-generated.
+
+## Syscall conventions
+
+Linux syscall mechanics differ across arches. The `SYSCALL` macro hides this.
+
+| Arch     | Insn      | Num reg | Arg regs (plat ABI)          |
+|----------|-----------|---------|------------------------------|
+| amd64    | `syscall` | `rax`   | `rdi, rsi, rdx, r10, r8, r9` |
+| aarch64  | `svc #0`  | `x8`    | `x0 – x5`                    |
+| riscv64  | `ecall`   | `a7`    | `a0 – a5`                    |
+
+On aarch64 and riscv64 the P1 register mapping already places args in the
+native arg regs; only the number register differs (`r0` → `x8`/`a7`). The
+`SYSCALL` expansion emits a `mov` from `r0` to the arch's num register, then
+`svc 0`/`ecall`.
+
+On amd64 the P1 mapping matches the syscall ABI *except* `r0`/`rax` is the
+num reg (correct) and return reg (correct). Perfect — `SYSCALL` expands to
+a single `syscall` instruction.
+
+### Syscall numbers
+
+Linux uses two syscall tables relevant here:
+
+- **amd64**: amd64-specific table (`write = 1`, `exit = 60`, …).
+- **aarch64 and riscv64**: generic table (`write = 64`, `exit = 93`, …).
+
+P1 programs use symbolic constants (`SYS_WRITE`, `SYS_EXIT`) defined per-arch:
+
+```
+# p1_amd64.M1
+DEFINE SYS_WRITE 01000000
+DEFINE SYS_EXIT  3C000000
+
+# p1_aarch64.M1 and p1_riscv64.M1
+DEFINE SYS_WRITE 40000000
+DEFINE SYS_EXIT  5D000000
+```
+
+(The encodings shown are placeholder little-endian 32-bit immediates; real
+values are inlined as operands to `LI` or `ADDI`.)
+
+## Program layout
+
+Each P1 object file is structured as:
+
+```
+<ELF header, per arch>
+<code section>
+  <function prologues, bodies, epilogues>
+<constant pool>
+  pool_label_1: &0xDEADBEEFCAFEBABE
+  pool_label_2: &0x00000000004004C0
+  ...
+<data section>
+  <static bytes>
+```
+
+`LI rD, %pool_label_N` issues a PC-relative load; the pool must be reachable
+within the relocation's range (≤±1 MiB for aarch64 `LDR` literal, ≤±2 GiB for
+riscv `AUIPC`+`LD`, unlimited for x86 `mov rD, [rip + rel32]` within 2 GiB).
+
+For programs under a few MiB, a single pool per file is fine. For larger
+programs, emit a pool per function.
+
+## Staged implementation plan
+
+1. **Spike across all three arches.** *Done.* `lispcc/hello.M1` and
+   `lispcc/demo.M1` run on aarch64, amd64, and riscv64 via existing
+   `M1` + `hex2_linker` (amd64, aarch64) / `hex2_word` (riscv64). Ops
+   demonstrated: `LI`, `SYSCALL`, `MOV`, `ADD`, `SUB`. The aarch64
+   `hex2_word` extensions in the work list above were *not* needed —
+   the inline-data `LI` trick sidesteps them. Order was reversed from
+   the original plan: aarch64 first (where the trick was designed),
+   then amd64 and riscv64.
+2. **Broaden the demonstrated op set.** Next. Control flow (`B`, `BEQ`,
+   `CALL`, `RET`, `TAIL`), loads/stores (`LD`/`ST`/`LW`/`LB`), and the
+   rest of arithmetic/logical/shift/mul-div. All reachable with stock
+   hex2; no extensions required. A loop-and-branch demo (e.g. print
+   digits 0–9, or sum 1..N) is the natural next program — it forces
+   conditional branching through the indirect-via-r7 pattern.
+3. **Generator for the ~30-op × register matrix.** Hand-maintenance of
+   the per-tuple DEFINEs becomes painful past ~20 entries. A small
+   template script produces `p1_<arch>.M1` from a shared op table.
+4. **Cross-arch differential harness.** Assemble each P1 source three
+   ways and diff runtime behavior. Currently eyeballed via
+   `make run-all`.
+5. **Write something real.** Seed Lisp interpreter (cons, car, cdr, eq,
+   atom, cond, lambda, quote) in ~500 lines of P1, running identically
+   on all three targets.
+
+## Open questions
+
+- **Can we reuse hand-written `SYSCALL`/syscall-number conventions already in
+  stage0's arch ports?** Probably yes — adopt the conventions already in
+  `M2libc/<arch>/` to minimize surprise.
+- **Signed-extending loads.** Skipped for v1 — add `LBS`, `LWS` if the Lisp
+  interpreter needs them.
+- **Atomic / multi-core.** Not in scope. Seed interpreters are single-
+  threaded.
+- **Debug info.** `blood-elf` generates M1-format debug tables; we'd need to
+  decide whether P1 flows through it unchanged. Likely yes since P1 is just
+  another M1 source.
+- **x86-32 / armv7l / riscv32 support.** Requires narrowing the register
+  model and splitting word size. Defer.
+
+## Scope
+
+- **Defs files**: ~1500 entries × 3 arches, generator-driven.
+- **Testing**: shared harness that assembles each P1 source three ways
+  and diffs runtime behavior.
+
+The output is a single portable ISA above which any seed-stage program
+(Lisp, Forth, a smaller C compiler) can be written once and run on three
+hosts. Below M2-Planet in the chain, above raw M1. Leans entirely on
+existing `M1` + `hex2` — no toolchain modifications.
diff --git a/PLAN.md b/PLAN.md
@@ -4,7 +4,7 @@
 
 Shrink the auditable LOC between M1 assembly and tcc-boot by replacing the
 current `M2-Planet → mes → MesCC → nyacc` stack with a small Lisp written
-once in the P1 portable pseudo-ISA (see `../P1.md`) and a C compiler written
+once in the P1 portable pseudo-ISA (see [P1.md](P1.md)) and a C compiler written
 in that Lisp. P1 is the same layer described in `P1.md`: ~30 RISC-shaped ops
 whose per-arch `DEFINE` tables expand to amd64 / aarch64 / riscv64 encodings,
 so one Lisp source serves all three hosts.
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ Goal is a 4–6× shrink in auditable LOC. See [PLAN.md](PLAN.md).
 
 ## Status
 
-Stage 0: hello-world in the P1 portable pseudo-ISA (see `../P1.md`),
+Stage 0: hello-world in the P1 portable pseudo-ISA (see [P1.md](P1.md)),
 assembled and run inside a pristine alpine container on all three target
 arches (aarch64, amd64, riscv64). The same `hello.M1` source assembles
 for every arch; only the backing `p1_<arch>.M1` defs file varies.
diff --git a/p1_aarch64.M1 b/p1_aarch64.M1
@@ -4,7 +4,7 @@
 ## plus a handful of MOV/ADD/SUB register tuples. The full 1500-entry
 ## table described in P1.md is generator-driven; what's here is the
 ## spike's hand-written sliver.
-## See ../P1.md for the full ISA and register mapping.
+## See P1.md for the full ISA and register mapping.
 ##
 ## Register mapping (P1 → aarch64):
 ##   r0 → x0 , r1 → x1 , r2 → x2 , r3 → x3
diff --git a/p1_amd64.M1 b/p1_amd64.M1
@@ -1,7 +1,7 @@
 ## P1 pseudo-ISA — amd64 backing defs (v0.1 spike)
 ##
 ## Implements the subset needed by hello.M1 and demo.M1: LI, SYSCALL,
-## plus a handful of MOV/ADD/SUB register tuples. See ../P1.md.
+## plus a handful of MOV/ADD/SUB register tuples. See P1.md.
 ##
 ## Register mapping (P1 → amd64):
 ##   r0 → rax , r1 → rdi , r2 → rsi , r3 → rdx
diff --git a/p1_riscv64.M1 b/p1_riscv64.M1
@@ -1,7 +1,7 @@
 ## P1 pseudo-ISA — riscv64 backing defs (v0.1 spike)
 ##
 ## Implements the subset needed by hello.M1 and demo.M1: LI, SYSCALL,
-## plus a handful of MOV/ADD/SUB register tuples. See ../P1.md.
+## plus a handful of MOV/ADD/SUB register tuples. See P1.md.
 ##
 ## Register mapping (P1 → RISC-V):
 ##   r0 → a0 (x10) , r1 → a1 (x11) , r2 → a2 (x12) , r3 → a3 (x13)

	boot2 Playing with the boostrap
	git clone https://git.ryansepassi.com/git/boot2.git
	Log \| Files \| Refs

M	Makefile	\|	2	+-
A	P1.md	\|	304	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	PLAN.md	\|	2	+-
M	README.md	\|	2	+-
M	p1_aarch64.M1	\|	2	+-
M	p1_amd64.M1	\|	2	+-
M	p1_riscv64.M1	\|	2	+-