commit 505d30e7b343b771d0225501634bf5e9c1ca1b7f
parent 3e6ce5ada262294ca065705bdf65003ebf823fa1
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Mon, 20 Apr 2026 13:20:20 -0700
Move P1.md into the repo
Previously P1.md lived at ../P1.md, outside any git repo. Versioning it
alongside the code that implements it keeps the spec and the per-arch
defs files in sync, and means the repo is self-contained — a reader
cloning lispcc gets the spec too.
Updates every `../P1.md` reference to the in-repo path.
Diffstat:
7 files changed, 310 insertions(+), 6 deletions(-)
diff --git a/Makefile b/Makefile
@@ -2,7 +2,7 @@
#
# hello.M1 is written in P1 mnemonics and assembles unchanged for all
# three targets. The backing defs file (p1_<arch>.M1) is the only per-
-# arch source. See ../P1.md.
+# arch source. See P1.md.
#
# Two-image setup:
# - lispcc-builder (alpine + gcc, host-arch): builds M1 and hex2
diff --git a/P1.md b/P1.md
@@ -0,0 +1,304 @@
+# P1: A Portable Pseudo-ISA for M1
+
+## Motivation
+
+The stage0/live-bootstrap chain uses M1 (the mescc-tools macro assembler) as
+the lowest human-writable layer above raw hex. M1 itself is architecture-
+agnostic — it only knows `DEFINE name hex_bytes` — but every real M1 program
+in stage0 (including the seed C compiler `cc_*.M1`) is hand-written per arch.
+To write, say, a seed Lisp interpreter portably across amd64, aarch64, and
+riscv64 without reaching for M2-Planet, we need a thin portable layer: a
+pseudo-ISA whose mnemonics expand, per arch, to native encodings.
+
+P1 is that layer. The goal is an unoptimized RISC-shaped instruction set,
+hand-writable in M1 source, that assembles to three host ISAs via per-arch
+`DEFINE` tables on top of existing `M1` + `hex2` unchanged.
+
+## Non-goals
+
+- **Not an optimizing backend.** P1 is deliberately dumb. An `ADD rD, rA, rB`
+ on amd64 expands to `mov rD, rA; add rD, rB` unconditionally — no peephole
+ recognition of the `rD == rA` case. Paying ~2× code size is fine for a seed.
+- **Not ABI-compatible with platform C.** P1 programs are sovereign: direct
+ Linux syscalls, no libc linkage. Interop thunks can be written later if
+ needed.
+- **Not 32-bit.** x86-32, armv7l, riscv32 are out of scope for v1. Adding them
+ later means a separate defs file and some narrowing in the register model.
+- **Not self-hosting.** P1 is a target for humans, not a compiler IR. If you
+ want a compiler, write it in subset-C and use M2-Planet.
+
+## Current status (v0.1 spike)
+
+Built in `lispcc/`: `hello.M1` and a compute/print/exit `demo.M1`, both
+written in P1 mnemonics, assemble unchanged across aarch64, amd64, and
+riscv64 from per-arch `p1_<arch>.M1` defs. Ops exercised so far: `LI`,
+`SYSCALL`, `MOV`, `ADD`, `SUB` (the last three in specific
+register-tuple forms). Runs on stock `M1` + `hex2_linker` (amd64,
+aarch64) / `hex2_word` (riscv64). Run with `make PROG=demo run-all`
+from `lispcc/`.
+
+The rest of the ISA (branches, `CALL`/`RET`, `TAIL`, `LD`/`ST`,
+logical/shift/mul-div) is reachable with the same tooling via the
+inline-data `LI` pattern generalized: load the branch target with
+`P1_LI_R7 &label` and jump through the register. Local conditional
+skips over a known number of instructions hand-encode as plain hex
+distances inside the DEFINE.
+
+### Spike deviations from the design
+
+- Wide immediates use a per-`LI` inline literal slot (one PC-relative
+ load insn plus a 4-byte data slot, skipped past) rather than a shared
+ pool. Keeps the spike pool-free at the cost of one skip-branch per
+ `LI`. A pool can be reintroduced later without changes to P1 source.
+- `LI` is 4-byte zero-extended today; 8-byte absolute is deferred until
+ a program needs it. All current references are to addresses under
+ 4 GiB, so `&label` + a 4-byte zero pad suffices.
+- The per-tuple DEFINE table is hand-written for the handful of
+ MOV/ADD/SUB register tuples `demo.M1` uses. The ~1500-entry generator
+ is still future work.
+
+## Design decisions
+
+| Decision | Choice | Why |
+|----------------|-----------------------------------------------|--------------------------------------------|
+| Word size | 64-bit | All three target arches are 64-bit native |
+| Endianness | Little-endian | All three agree |
+| Registers | 8 GPRs (`r0`–`r7`) + `sp`, `lr`-on-stack | Fits x86-64's usable register budget |
+| Narrow imm | Signed 12-bit | riscv I-type width; aarch64 ≤12 also OK |
+| Wide imm | Pool-loaded via PC-relative `LI` | Avoids arch-specific immediate synthesis |
+| Calling conv | r0 = return, r1–r6 = args, r6–r7 callee-saved | P1-defined; not platform ABI |
+| Return address | Always spilled to stack on entry | Hides x86's missing `lr` uniformly |
+| Syscall | `SYSCALL` with num in r0, args r1–r6 | Per-arch wrapper emits native sequence |
+
+## Register mapping
+
+All mappings are callee-saved for `r6`/`r7`, caller-saved otherwise.
+
+| P1 | amd64 | aarch64 | riscv64 |
+|------|-------|---------|---------|
+| `r0` | `rax` | `x0` | `a0` |
+| `r1` | `rdi` | `x1` | `a1` |
+| `r2` | `rsi` | `x2` | `a2` |
+| `r3` | `rdx` | `x3` | `a3` |
+| `r4` | `r10` | `x4` | `a4` |
+| `r5` | `r8` | `x5` | `a5` |
+| `r6` | `rbx` | `x19` | `s1` |
+| `r7` | `r12` | `x20` | `s2` |
+| `sp` | `rsp` | `sp` | `sp` |
+| `lr` | (mem) | `x30` | `ra` |
+
+x86-64 has no link register; `CALL`/`RET` macros push/pop the return address
+on the stack. On aarch64/riscv64, the prologue spills `lr` (`x30`/`ra`) to
+the stack too, so all three converge on "return address lives in
+`[sp + 0]` after prologue." This uniformity is worth the extra store on the
+register-rich arches.
+
+## Instruction set (~30 ops)
+
+```
+# 3-operand arithmetic (reg forms)
+ADD rD, rA, rB SUB rD, rA, rB
+AND rD, rA, rB OR rD, rA, rB XOR rD, rA, rB
+SHL rD, rA, rB SHR rD, rA, rB SAR rD, rA, rB
+MUL rD, rA, rB DIV rD, rA, rB REM rD, rA, rB
+
+# Immediate forms (signed 12-bit)
+ADDI rD, rA, !imm ANDI rD, rA, !imm ORI rD, rA, !imm
+SHLI rD, rA, !imm SHRI rD, rA, !imm SARI rD, rA, !imm
+
+# Moves
+MOV rD, rA # reg-to-reg
+LI rD, %label # load 64-bit literal from pool
+LA rD, %label # load PC-relative address
+
+# Memory (offset is signed 12-bit)
+LD rD, rA, !off ST rS, rA, !off # 64-bit
+LW rD, rA, !off SW rS, rA, !off # 32-bit zero-extended / truncated
+LB rD, rA, !off SB rS, rA, !off # 8-bit zero-extended / truncated
+
+# Control flow
+B %label # unconditional branch
+BEQ rA, rB, %label BNE rA, rB, %label
+BLT rA, rB, %label BGE rA, rB, %label # signed
+BLTU rA, rB, %label BGEU rA, rB, %label # unsigned
+CALL %label RET
+TAIL %label # tail call: epilogue + B %label
+
+# System
+SYSCALL # num in r0, args r1-r6, ret in r0
+```
+
+### Semantics
+
+- All arithmetic is on 64-bit values. `SHL`/`SHR`/`SAR` take shift amount in
+ the low 6 bits of `rB` (or the `!imm` for immediate forms).
+- `DIV` is signed, truncated toward zero. `REM` matches `DIV`.
+- `LW`/`LB` zero-extend the loaded value into the 64-bit destination.
+ (Signed-extending variants — `LWS`, `LBS` — can be added later if needed.)
+- Branch offsets are PC-relative. In the v0.1 spike they are realized by
+ loading the target address via `LI` into `r7` and jumping through the
+ register; range is therefore unbounded within the 4 GiB address space.
+ Native-encoded branches (with tighter range limits) are an optional
+ future optimization.
+- `CALL` pushes a return address, jumps to target. `RET` pops and jumps.
+ Prologue convention: function entry does `ST lr, sp, 0; ADDI sp, sp, -16`
+ (or arch-specific equivalent); epilogue reverses.
+- `TAIL %label` is a tail call — it performs the callee's standard
+ epilogue (restore `lr` from `[sp+0]`, pop the frame) and then branches
+ unconditionally to `%label`, reusing the caller's return address
+ instead of pushing a new frame. The current function must be using the
+ standard prologue. Interpreter `eval` loops rely on `TAIL` to recurse
+ on sub-expressions without growing the stack.
+- `SYSCALL` is a single opcode in P1 source. Each arch's defs file expands it
+ to the native syscall sequence, including the register shuffle from P1's
+ `r0`=num, `r1`–`r6`=args convention into the platform's native convention
+ if different.
+
+## Encoding strategy
+
+For each `(op, register-tuple)` combination, emit one `DEFINE` per arch. A
+generator script produces the full defs file; no hand-encoding per entry.
+
+Example — `ADD r0, r1, r2`:
+
+```
+# p1_riscv64.M1
+DEFINE P1_ADD_R0_R1_R2 33056000 # add a0, a1, a2 (little-endian)
+
+# p1_aarch64.M1
+DEFINE P1_ADD_R0_R1_R2 2000028B # add x0, x1, x2
+
+# p1_amd64.M1 (2-op destructive — expands to mov + add)
+DEFINE P1_ADD_R0_R1_R2 4889F84801F0 # mov rax, rdi ; add rax, rsi
+```
+
+### Combinatorial footprint
+
+Per-arch defs count (immediates handled by sigil, not enumerated):
+
+- 11 reg-reg-reg arith × 8 `rD` × 8 `rA` × 8 `rB` = 704. Pruned to ~600 by
+ removing trivially-equivalent tuples.
+- 6 immediate arith × 8² = 384. Each entry uses an immediate sigil (`!imm`),
+ so the immediate value itself is not enumerated.
+- 3 move ops × 8 or 8² = ~80.
+- 6 memory ops × 8² = 384. Offsets use `!imm` sigil.
+- 7 branches × 8² = 448.
+- Singletons (`B`, `CALL`, `RET`, `TAIL`, `SYSCALL`) = 5.
+
+Total ≈ 1500 defines per arch. Template-generated.
+
+## Syscall conventions
+
+Linux syscall mechanics differ across arches. The `SYSCALL` macro hides this.
+
+| Arch | Insn | Num reg | Arg regs (plat ABI) |
+|----------|-----------|---------|------------------------------|
+| amd64 | `syscall` | `rax` | `rdi, rsi, rdx, r10, r8, r9` |
+| aarch64 | `svc #0` | `x8` | `x0 – x5` |
+| riscv64 | `ecall` | `a7` | `a0 – a5` |
+
+On aarch64 and riscv64 the P1 register mapping already places args in the
+native arg regs; only the number register differs (`r0` → `x8`/`a7`). The
+`SYSCALL` expansion emits a `mov` from `r0` to the arch's num register, then
+`svc 0`/`ecall`.
+
+On amd64 the P1 mapping matches the syscall ABI *except* `r0`/`rax` is the
+num reg (correct) and return reg (correct). Perfect — `SYSCALL` expands to
+a single `syscall` instruction.
+
+### Syscall numbers
+
+Linux uses two syscall tables relevant here:
+
+- **amd64**: amd64-specific table (`write = 1`, `exit = 60`, …).
+- **aarch64 and riscv64**: generic table (`write = 64`, `exit = 93`, …).
+
+P1 programs use symbolic constants (`SYS_WRITE`, `SYS_EXIT`) defined per-arch:
+
+```
+# p1_amd64.M1
+DEFINE SYS_WRITE 01000000
+DEFINE SYS_EXIT 3C000000
+
+# p1_aarch64.M1 and p1_riscv64.M1
+DEFINE SYS_WRITE 40000000
+DEFINE SYS_EXIT 5D000000
+```
+
+(The encodings shown are placeholder little-endian 32-bit immediates; real
+values are inlined as operands to `LI` or `ADDI`.)
+
+## Program layout
+
+Each P1 object file is structured as:
+
+```
+<ELF header, per arch>
+<code section>
+ <function prologues, bodies, epilogues>
+<constant pool>
+ pool_label_1: &0xDEADBEEFCAFEBABE
+ pool_label_2: &0x00000000004004C0
+ ...
+<data section>
+ <static bytes>
+```
+
+`LI rD, %pool_label_N` issues a PC-relative load; the pool must be reachable
+within the relocation's range (≤±1 MiB for aarch64 `LDR` literal, ≤±2 GiB for
+riscv `AUIPC`+`LD`, unlimited for x86 `mov rD, [rip + rel32]` within 2 GiB).
+
+For programs under a few MiB, a single pool per file is fine. For larger
+programs, emit a pool per function.
+
+## Staged implementation plan
+
+1. **Spike across all three arches.** *Done.* `lispcc/hello.M1` and
+ `lispcc/demo.M1` run on aarch64, amd64, and riscv64 via existing
+ `M1` + `hex2_linker` (amd64, aarch64) / `hex2_word` (riscv64). Ops
+ demonstrated: `LI`, `SYSCALL`, `MOV`, `ADD`, `SUB`. The aarch64
+ `hex2_word` extensions in the work list above were *not* needed —
+ the inline-data `LI` trick sidesteps them. Order was reversed from
+ the original plan: aarch64 first (where the trick was designed),
+ then amd64 and riscv64.
+2. **Broaden the demonstrated op set.** Next. Control flow (`B`, `BEQ`,
+ `CALL`, `RET`, `TAIL`), loads/stores (`LD`/`ST`/`LW`/`LB`), and the
+ rest of arithmetic/logical/shift/mul-div. All reachable with stock
+ hex2; no extensions required. A loop-and-branch demo (e.g. print
+ digits 0–9, or sum 1..N) is the natural next program — it forces
+ conditional branching through the indirect-via-r7 pattern.
+3. **Generator for the ~30-op × register matrix.** Hand-maintenance of
+ the per-tuple DEFINEs becomes painful past ~20 entries. A small
+ template script produces `p1_<arch>.M1` from a shared op table.
+4. **Cross-arch differential harness.** Assemble each P1 source three
+ ways and diff runtime behavior. Currently eyeballed via
+ `make run-all`.
+5. **Write something real.** Seed Lisp interpreter (cons, car, cdr, eq,
+ atom, cond, lambda, quote) in ~500 lines of P1, running identically
+ on all three targets.
+
+## Open questions
+
+- **Can we reuse hand-written `SYSCALL`/syscall-number conventions already in
+ stage0's arch ports?** Probably yes — adopt the conventions already in
+ `M2libc/<arch>/` to minimize surprise.
+- **Signed-extending loads.** Skipped for v1 — add `LBS`, `LWS` if the Lisp
+ interpreter needs them.
+- **Atomic / multi-core.** Not in scope. Seed interpreters are single-
+ threaded.
+- **Debug info.** `blood-elf` generates M1-format debug tables; we'd need to
+ decide whether P1 flows through it unchanged. Likely yes since P1 is just
+ another M1 source.
+- **x86-32 / armv7l / riscv32 support.** Requires narrowing the register
+ model and splitting word size. Defer.
+
+## Scope
+
+- **Defs files**: ~1500 entries × 3 arches, generator-driven.
+- **Testing**: shared harness that assembles each P1 source three ways
+ and diffs runtime behavior.
+
+The output is a single portable ISA above which any seed-stage program
+(Lisp, Forth, a smaller C compiler) can be written once and run on three
+hosts. Below M2-Planet in the chain, above raw M1. Leans entirely on
+existing `M1` + `hex2` — no toolchain modifications.
diff --git a/PLAN.md b/PLAN.md
@@ -4,7 +4,7 @@
Shrink the auditable LOC between M1 assembly and tcc-boot by replacing the
current `M2-Planet → mes → MesCC → nyacc` stack with a small Lisp written
-once in the P1 portable pseudo-ISA (see `../P1.md`) and a C compiler written
+once in the P1 portable pseudo-ISA (see [P1.md](P1.md)) and a C compiler written
in that Lisp. P1 is the same layer described in `P1.md`: ~30 RISC-shaped ops
whose per-arch `DEFINE` tables expand to amd64 / aarch64 / riscv64 encodings,
so one Lisp source serves all three hosts.
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ Goal is a 4–6× shrink in auditable LOC. See [PLAN.md](PLAN.md).
## Status
-Stage 0: hello-world in the P1 portable pseudo-ISA (see `../P1.md`),
+Stage 0: hello-world in the P1 portable pseudo-ISA (see [P1.md](P1.md)),
assembled and run inside a pristine alpine container on all three target
arches (aarch64, amd64, riscv64). The same `hello.M1` source assembles
for every arch; only the backing `p1_<arch>.M1` defs file varies.
diff --git a/p1_aarch64.M1 b/p1_aarch64.M1
@@ -4,7 +4,7 @@
## plus a handful of MOV/ADD/SUB register tuples. The full 1500-entry
## table described in P1.md is generator-driven; what's here is the
## spike's hand-written sliver.
-## See ../P1.md for the full ISA and register mapping.
+## See P1.md for the full ISA and register mapping.
##
## Register mapping (P1 → aarch64):
## r0 → x0 , r1 → x1 , r2 → x2 , r3 → x3
diff --git a/p1_amd64.M1 b/p1_amd64.M1
@@ -1,7 +1,7 @@
## P1 pseudo-ISA — amd64 backing defs (v0.1 spike)
##
## Implements the subset needed by hello.M1 and demo.M1: LI, SYSCALL,
-## plus a handful of MOV/ADD/SUB register tuples. See ../P1.md.
+## plus a handful of MOV/ADD/SUB register tuples. See P1.md.
##
## Register mapping (P1 → amd64):
## r0 → rax , r1 → rdi , r2 → rsi , r3 → rdx
diff --git a/p1_riscv64.M1 b/p1_riscv64.M1
@@ -1,7 +1,7 @@
## P1 pseudo-ISA — riscv64 backing defs (v0.1 spike)
##
## Implements the subset needed by hello.M1 and demo.M1: LI, SYSCALL,
-## plus a handful of MOV/ADD/SUB register tuples. See ../P1.md.
+## plus a handful of MOV/ADD/SUB register tuples. See P1.md.
##
## Register mapping (P1 → RISC-V):
## r0 → a0 (x10) , r1 → a1 (x11) , r2 → a2 (x12) , r3 → a3 (x13)