boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs

commit 505d30e7b343b771d0225501634bf5e9c1ca1b7f
parent 3e6ce5ada262294ca065705bdf65003ebf823fa1
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon, 20 Apr 2026 13:20:20 -0700

Move P1.md into the repo

Previously P1.md lived at ../P1.md, outside any git repo. Versioning it
alongside the code that implements it keeps the spec and the per-arch
defs files in sync, and means the repo is self-contained — a reader
cloning lispcc gets the spec too.

Updates every `../P1.md` reference to the in-repo path.

Diffstat:
MMakefile | 2+-
AP1.md | 304+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
MPLAN.md | 2+-
MREADME.md | 2+-
Mp1_aarch64.M1 | 2+-
Mp1_amd64.M1 | 2+-
Mp1_riscv64.M1 | 2+-
7 files changed, 310 insertions(+), 6 deletions(-)

diff --git a/Makefile b/Makefile @@ -2,7 +2,7 @@ # # hello.M1 is written in P1 mnemonics and assembles unchanged for all # three targets. The backing defs file (p1_<arch>.M1) is the only per- -# arch source. See ../P1.md. +# arch source. See P1.md. # # Two-image setup: # - lispcc-builder (alpine + gcc, host-arch): builds M1 and hex2 diff --git a/P1.md b/P1.md @@ -0,0 +1,304 @@ +# P1: A Portable Pseudo-ISA for M1 + +## Motivation + +The stage0/live-bootstrap chain uses M1 (the mescc-tools macro assembler) as +the lowest human-writable layer above raw hex. M1 itself is architecture- +agnostic — it only knows `DEFINE name hex_bytes` — but every real M1 program +in stage0 (including the seed C compiler `cc_*.M1`) is hand-written per arch. +To write, say, a seed Lisp interpreter portably across amd64, aarch64, and +riscv64 without reaching for M2-Planet, we need a thin portable layer: a +pseudo-ISA whose mnemonics expand, per arch, to native encodings. + +P1 is that layer. The goal is an unoptimized RISC-shaped instruction set, +hand-writable in M1 source, that assembles to three host ISAs via per-arch +`DEFINE` tables on top of existing `M1` + `hex2` unchanged. + +## Non-goals + +- **Not an optimizing backend.** P1 is deliberately dumb. An `ADD rD, rA, rB` + on amd64 expands to `mov rD, rA; add rD, rB` unconditionally — no peephole + recognition of the `rD == rA` case. Paying ~2× code size is fine for a seed. +- **Not ABI-compatible with platform C.** P1 programs are sovereign: direct + Linux syscalls, no libc linkage. Interop thunks can be written later if + needed. +- **Not 32-bit.** x86-32, armv7l, riscv32 are out of scope for v1. Adding them + later means a separate defs file and some narrowing in the register model. +- **Not self-hosting.** P1 is a target for humans, not a compiler IR. If you + want a compiler, write it in subset-C and use M2-Planet. + +## Current status (v0.1 spike) + +Built in `lispcc/`: `hello.M1` and a compute/print/exit `demo.M1`, both +written in P1 mnemonics, assemble unchanged across aarch64, amd64, and +riscv64 from per-arch `p1_<arch>.M1` defs. Ops exercised so far: `LI`, +`SYSCALL`, `MOV`, `ADD`, `SUB` (the last three in specific +register-tuple forms). Runs on stock `M1` + `hex2_linker` (amd64, +aarch64) / `hex2_word` (riscv64). Run with `make PROG=demo run-all` +from `lispcc/`. + +The rest of the ISA (branches, `CALL`/`RET`, `TAIL`, `LD`/`ST`, +logical/shift/mul-div) is reachable with the same tooling via the +inline-data `LI` pattern generalized: load the branch target with +`P1_LI_R7 &label` and jump through the register. Local conditional +skips over a known number of instructions hand-encode as plain hex +distances inside the DEFINE. + +### Spike deviations from the design + +- Wide immediates use a per-`LI` inline literal slot (one PC-relative + load insn plus a 4-byte data slot, skipped past) rather than a shared + pool. Keeps the spike pool-free at the cost of one skip-branch per + `LI`. A pool can be reintroduced later without changes to P1 source. +- `LI` is 4-byte zero-extended today; 8-byte absolute is deferred until + a program needs it. All current references are to addresses under + 4 GiB, so `&label` + a 4-byte zero pad suffices. +- The per-tuple DEFINE table is hand-written for the handful of + MOV/ADD/SUB register tuples `demo.M1` uses. The ~1500-entry generator + is still future work. + +## Design decisions + +| Decision | Choice | Why | +|----------------|-----------------------------------------------|--------------------------------------------| +| Word size | 64-bit | All three target arches are 64-bit native | +| Endianness | Little-endian | All three agree | +| Registers | 8 GPRs (`r0`–`r7`) + `sp`, `lr`-on-stack | Fits x86-64's usable register budget | +| Narrow imm | Signed 12-bit | riscv I-type width; aarch64 ≤12 also OK | +| Wide imm | Pool-loaded via PC-relative `LI` | Avoids arch-specific immediate synthesis | +| Calling conv | r0 = return, r1–r6 = args, r6–r7 callee-saved | P1-defined; not platform ABI | +| Return address | Always spilled to stack on entry | Hides x86's missing `lr` uniformly | +| Syscall | `SYSCALL` with num in r0, args r1–r6 | Per-arch wrapper emits native sequence | + +## Register mapping + +All mappings are callee-saved for `r6`/`r7`, caller-saved otherwise. + +| P1 | amd64 | aarch64 | riscv64 | +|------|-------|---------|---------| +| `r0` | `rax` | `x0` | `a0` | +| `r1` | `rdi` | `x1` | `a1` | +| `r2` | `rsi` | `x2` | `a2` | +| `r3` | `rdx` | `x3` | `a3` | +| `r4` | `r10` | `x4` | `a4` | +| `r5` | `r8` | `x5` | `a5` | +| `r6` | `rbx` | `x19` | `s1` | +| `r7` | `r12` | `x20` | `s2` | +| `sp` | `rsp` | `sp` | `sp` | +| `lr` | (mem) | `x30` | `ra` | + +x86-64 has no link register; `CALL`/`RET` macros push/pop the return address +on the stack. On aarch64/riscv64, the prologue spills `lr` (`x30`/`ra`) to +the stack too, so all three converge on "return address lives in +`[sp + 0]` after prologue." This uniformity is worth the extra store on the +register-rich arches. + +## Instruction set (~30 ops) + +``` +# 3-operand arithmetic (reg forms) +ADD rD, rA, rB SUB rD, rA, rB +AND rD, rA, rB OR rD, rA, rB XOR rD, rA, rB +SHL rD, rA, rB SHR rD, rA, rB SAR rD, rA, rB +MUL rD, rA, rB DIV rD, rA, rB REM rD, rA, rB + +# Immediate forms (signed 12-bit) +ADDI rD, rA, !imm ANDI rD, rA, !imm ORI rD, rA, !imm +SHLI rD, rA, !imm SHRI rD, rA, !imm SARI rD, rA, !imm + +# Moves +MOV rD, rA # reg-to-reg +LI rD, %label # load 64-bit literal from pool +LA rD, %label # load PC-relative address + +# Memory (offset is signed 12-bit) +LD rD, rA, !off ST rS, rA, !off # 64-bit +LW rD, rA, !off SW rS, rA, !off # 32-bit zero-extended / truncated +LB rD, rA, !off SB rS, rA, !off # 8-bit zero-extended / truncated + +# Control flow +B %label # unconditional branch +BEQ rA, rB, %label BNE rA, rB, %label +BLT rA, rB, %label BGE rA, rB, %label # signed +BLTU rA, rB, %label BGEU rA, rB, %label # unsigned +CALL %label RET +TAIL %label # tail call: epilogue + B %label + +# System +SYSCALL # num in r0, args r1-r6, ret in r0 +``` + +### Semantics + +- All arithmetic is on 64-bit values. `SHL`/`SHR`/`SAR` take shift amount in + the low 6 bits of `rB` (or the `!imm` for immediate forms). +- `DIV` is signed, truncated toward zero. `REM` matches `DIV`. +- `LW`/`LB` zero-extend the loaded value into the 64-bit destination. + (Signed-extending variants — `LWS`, `LBS` — can be added later if needed.) +- Branch offsets are PC-relative. In the v0.1 spike they are realized by + loading the target address via `LI` into `r7` and jumping through the + register; range is therefore unbounded within the 4 GiB address space. + Native-encoded branches (with tighter range limits) are an optional + future optimization. +- `CALL` pushes a return address, jumps to target. `RET` pops and jumps. + Prologue convention: function entry does `ST lr, sp, 0; ADDI sp, sp, -16` + (or arch-specific equivalent); epilogue reverses. +- `TAIL %label` is a tail call — it performs the callee's standard + epilogue (restore `lr` from `[sp+0]`, pop the frame) and then branches + unconditionally to `%label`, reusing the caller's return address + instead of pushing a new frame. The current function must be using the + standard prologue. Interpreter `eval` loops rely on `TAIL` to recurse + on sub-expressions without growing the stack. +- `SYSCALL` is a single opcode in P1 source. Each arch's defs file expands it + to the native syscall sequence, including the register shuffle from P1's + `r0`=num, `r1`–`r6`=args convention into the platform's native convention + if different. + +## Encoding strategy + +For each `(op, register-tuple)` combination, emit one `DEFINE` per arch. A +generator script produces the full defs file; no hand-encoding per entry. + +Example — `ADD r0, r1, r2`: + +``` +# p1_riscv64.M1 +DEFINE P1_ADD_R0_R1_R2 33056000 # add a0, a1, a2 (little-endian) + +# p1_aarch64.M1 +DEFINE P1_ADD_R0_R1_R2 2000028B # add x0, x1, x2 + +# p1_amd64.M1 (2-op destructive — expands to mov + add) +DEFINE P1_ADD_R0_R1_R2 4889F84801F0 # mov rax, rdi ; add rax, rsi +``` + +### Combinatorial footprint + +Per-arch defs count (immediates handled by sigil, not enumerated): + +- 11 reg-reg-reg arith × 8 `rD` × 8 `rA` × 8 `rB` = 704. Pruned to ~600 by + removing trivially-equivalent tuples. +- 6 immediate arith × 8² = 384. Each entry uses an immediate sigil (`!imm`), + so the immediate value itself is not enumerated. +- 3 move ops × 8 or 8² = ~80. +- 6 memory ops × 8² = 384. Offsets use `!imm` sigil. +- 7 branches × 8² = 448. +- Singletons (`B`, `CALL`, `RET`, `TAIL`, `SYSCALL`) = 5. + +Total ≈ 1500 defines per arch. Template-generated. + +## Syscall conventions + +Linux syscall mechanics differ across arches. The `SYSCALL` macro hides this. + +| Arch | Insn | Num reg | Arg regs (plat ABI) | +|----------|-----------|---------|------------------------------| +| amd64 | `syscall` | `rax` | `rdi, rsi, rdx, r10, r8, r9` | +| aarch64 | `svc #0` | `x8` | `x0 – x5` | +| riscv64 | `ecall` | `a7` | `a0 – a5` | + +On aarch64 and riscv64 the P1 register mapping already places args in the +native arg regs; only the number register differs (`r0` → `x8`/`a7`). The +`SYSCALL` expansion emits a `mov` from `r0` to the arch's num register, then +`svc 0`/`ecall`. + +On amd64 the P1 mapping matches the syscall ABI *except* `r0`/`rax` is the +num reg (correct) and return reg (correct). Perfect — `SYSCALL` expands to +a single `syscall` instruction. + +### Syscall numbers + +Linux uses two syscall tables relevant here: + +- **amd64**: amd64-specific table (`write = 1`, `exit = 60`, …). +- **aarch64 and riscv64**: generic table (`write = 64`, `exit = 93`, …). + +P1 programs use symbolic constants (`SYS_WRITE`, `SYS_EXIT`) defined per-arch: + +``` +# p1_amd64.M1 +DEFINE SYS_WRITE 01000000 +DEFINE SYS_EXIT 3C000000 + +# p1_aarch64.M1 and p1_riscv64.M1 +DEFINE SYS_WRITE 40000000 +DEFINE SYS_EXIT 5D000000 +``` + +(The encodings shown are placeholder little-endian 32-bit immediates; real +values are inlined as operands to `LI` or `ADDI`.) + +## Program layout + +Each P1 object file is structured as: + +``` +<ELF header, per arch> +<code section> + <function prologues, bodies, epilogues> +<constant pool> + pool_label_1: &0xDEADBEEFCAFEBABE + pool_label_2: &0x00000000004004C0 + ... +<data section> + <static bytes> +``` + +`LI rD, %pool_label_N` issues a PC-relative load; the pool must be reachable +within the relocation's range (≤±1 MiB for aarch64 `LDR` literal, ≤±2 GiB for +riscv `AUIPC`+`LD`, unlimited for x86 `mov rD, [rip + rel32]` within 2 GiB). + +For programs under a few MiB, a single pool per file is fine. For larger +programs, emit a pool per function. + +## Staged implementation plan + +1. **Spike across all three arches.** *Done.* `lispcc/hello.M1` and + `lispcc/demo.M1` run on aarch64, amd64, and riscv64 via existing + `M1` + `hex2_linker` (amd64, aarch64) / `hex2_word` (riscv64). Ops + demonstrated: `LI`, `SYSCALL`, `MOV`, `ADD`, `SUB`. The aarch64 + `hex2_word` extensions in the work list above were *not* needed — + the inline-data `LI` trick sidesteps them. Order was reversed from + the original plan: aarch64 first (where the trick was designed), + then amd64 and riscv64. +2. **Broaden the demonstrated op set.** Next. Control flow (`B`, `BEQ`, + `CALL`, `RET`, `TAIL`), loads/stores (`LD`/`ST`/`LW`/`LB`), and the + rest of arithmetic/logical/shift/mul-div. All reachable with stock + hex2; no extensions required. A loop-and-branch demo (e.g. print + digits 0–9, or sum 1..N) is the natural next program — it forces + conditional branching through the indirect-via-r7 pattern. +3. **Generator for the ~30-op × register matrix.** Hand-maintenance of + the per-tuple DEFINEs becomes painful past ~20 entries. A small + template script produces `p1_<arch>.M1` from a shared op table. +4. **Cross-arch differential harness.** Assemble each P1 source three + ways and diff runtime behavior. Currently eyeballed via + `make run-all`. +5. **Write something real.** Seed Lisp interpreter (cons, car, cdr, eq, + atom, cond, lambda, quote) in ~500 lines of P1, running identically + on all three targets. + +## Open questions + +- **Can we reuse hand-written `SYSCALL`/syscall-number conventions already in + stage0's arch ports?** Probably yes — adopt the conventions already in + `M2libc/<arch>/` to minimize surprise. +- **Signed-extending loads.** Skipped for v1 — add `LBS`, `LWS` if the Lisp + interpreter needs them. +- **Atomic / multi-core.** Not in scope. Seed interpreters are single- + threaded. +- **Debug info.** `blood-elf` generates M1-format debug tables; we'd need to + decide whether P1 flows through it unchanged. Likely yes since P1 is just + another M1 source. +- **x86-32 / armv7l / riscv32 support.** Requires narrowing the register + model and splitting word size. Defer. + +## Scope + +- **Defs files**: ~1500 entries × 3 arches, generator-driven. +- **Testing**: shared harness that assembles each P1 source three ways + and diffs runtime behavior. + +The output is a single portable ISA above which any seed-stage program +(Lisp, Forth, a smaller C compiler) can be written once and run on three +hosts. Below M2-Planet in the chain, above raw M1. Leans entirely on +existing `M1` + `hex2` — no toolchain modifications. diff --git a/PLAN.md b/PLAN.md @@ -4,7 +4,7 @@ Shrink the auditable LOC between M1 assembly and tcc-boot by replacing the current `M2-Planet → mes → MesCC → nyacc` stack with a small Lisp written -once in the P1 portable pseudo-ISA (see `../P1.md`) and a C compiler written +once in the P1 portable pseudo-ISA (see [P1.md](P1.md)) and a C compiler written in that Lisp. P1 is the same layer described in `P1.md`: ~30 RISC-shaped ops whose per-arch `DEFINE` tables expand to amd64 / aarch64 / riscv64 encodings, so one Lisp source serves all three hosts. diff --git a/README.md b/README.md @@ -7,7 +7,7 @@ Goal is a 4–6× shrink in auditable LOC. See [PLAN.md](PLAN.md). ## Status -Stage 0: hello-world in the P1 portable pseudo-ISA (see `../P1.md`), +Stage 0: hello-world in the P1 portable pseudo-ISA (see [P1.md](P1.md)), assembled and run inside a pristine alpine container on all three target arches (aarch64, amd64, riscv64). The same `hello.M1` source assembles for every arch; only the backing `p1_<arch>.M1` defs file varies. diff --git a/p1_aarch64.M1 b/p1_aarch64.M1 @@ -4,7 +4,7 @@ ## plus a handful of MOV/ADD/SUB register tuples. The full 1500-entry ## table described in P1.md is generator-driven; what's here is the ## spike's hand-written sliver. -## See ../P1.md for the full ISA and register mapping. +## See P1.md for the full ISA and register mapping. ## ## Register mapping (P1 → aarch64): ## r0 → x0 , r1 → x1 , r2 → x2 , r3 → x3 diff --git a/p1_amd64.M1 b/p1_amd64.M1 @@ -1,7 +1,7 @@ ## P1 pseudo-ISA — amd64 backing defs (v0.1 spike) ## ## Implements the subset needed by hello.M1 and demo.M1: LI, SYSCALL, -## plus a handful of MOV/ADD/SUB register tuples. See ../P1.md. +## plus a handful of MOV/ADD/SUB register tuples. See P1.md. ## ## Register mapping (P1 → amd64): ## r0 → rax , r1 → rdi , r2 → rsi , r3 → rdx diff --git a/p1_riscv64.M1 b/p1_riscv64.M1 @@ -1,7 +1,7 @@ ## P1 pseudo-ISA — riscv64 backing defs (v0.1 spike) ## ## Implements the subset needed by hello.M1 and demo.M1: LI, SYSCALL, -## plus a handful of MOV/ADD/SUB register tuples. See ../P1.md. +## plus a handful of MOV/ADD/SUB register tuples. See P1.md. ## ## Register mapping (P1 → RISC-V): ## r0 → a0 (x10) , r1 → a1 (x11) , r2 → a2 (x12) , r3 → a3 (x13)