post update - boot2 - Playing with the boostrap

commit 90d6018d786c42cf3a053508647144dceca2167e
parent add01712f59b79acb21ef42924bdb260357ef6b8
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 23 Apr 2026 11:52:00 -0700

post update

Diffstat:
M post.md  | 318 +++++++++++++++++++++++++++++++++++--------------------------------------------

1 file changed, 142 insertions(+), 176 deletions(-)
diff --git a/post.md b/post.md
@@ -15,18 +15,19 @@ of course, by another compiler!
 Uh oh. Infinite regress? Not quite. It bottoms out in something directly
 specified in machine code, a "binary seed" from which the rest sprouts.
 
-For about 30 years though, we had lost the seed. We were just going around
-using whatever compiler was on hand, figuring that there always would be one
-on hand, and trusting that it would do what it said on its tin. People noticed
-it was a bit odd that we had lost the chain, that we couldn't compile the
-compiler if we had lost the compiler, but not much was done about it.
+For decades, though, we had lost the seed. We were just going around using
+whatever compiler was on hand, figuring that there always would be one on hand,
+and trusting that it would do what it said on its tin. People noticed it was a
+bit odd that we had lost the chain, that we couldn't compile the compiler if
+we had lost the compiler, but not much was done about it.
 
 In 1983, Ken Thompson told a little horror story that spooked folks, but still,
 it didn't really change much. Like many good horror stories, this one was about
 a little invisible demon. Someone had put a virus into the compiler that
 replicated itself into all future compilers and would scrub itself out of
 existence if you ever tried compiling a program that went looking for it. All
-of our compilers were infected and none of us knew. Shivers.
+of our compilers were infected and none of us knew. Shivers. (Search for
+"Trusting Trust" if you want to read the full story).
 
 ## Bottle universe
 
@@ -89,31 +90,30 @@ alternative path. One that hopefully will knock down that KLOC count and maybe
 tidy up this little bottle universe that exists at the bottom of the software
 sea.
 
-There are 2 parts. First, get to portability sooner, to knock down the amount
-of arch-specific code that's needed. I do this by defining a pseudo-ISA called
-P1 that can be mapped to each arch via simple expansions. Second, define a
-richer macro preprocessor to make P1 programming more ergonomic, ergonomic
-enough to... Third, define a Scheme interpreter in P1. Then this Scheme becomes
-the base to define a shell and a C compiler.
+Three moves. First, write a richer macro preprocessor called m1pp (ht Bjarne)
+that makes low-level programming ergonomic enough to skip the subset-C step.
+Second, get to portability sooner, by defining a pseudo-ISA called P1 in m1pp
+that each arch maps to via small expansions. Third, use P1 to host a Scheme,
+and use that Scheme to host a C compiler.
 
 Here's what I'm aiming for, in Schemey pseudocode:
 
 ```
 ;; Bootstrap from seed, same as before
-(define hex2 (((hex0-seed hex0.hex0) hex1.hex0) hex2.hex1))
+(define hex0 (hex0-seed hex0.hex0))
+(define hex1 (hex0 hex1.hex0))
+(define hex2 (hex1 hex2.hex1))
 (define catm (hex2 catm.hex2))
 (define M0   (hex2 (catm ELF.hex2 M0.hex2)))
 
 ;; Compile+Link
 (defn exe (M1-src) (hex2 (catm ELF.hex2 (M0 M1-src))))
 
-;; We define a portable pseudo-ISA called P1, and use that to define a richer
-;; macro expander called m1pp (ht Bjarne)
+;; m1pp itself is written in M1 and built via the M1 path
 (define m1pp (exe (catm P1.M1 m1pp.M1)))
 
-;; From there, we redefine P1 in terms of m1pp, and use P1 sources
-;; P1A.M1pp is the arch-specific "backend" file
-;; P1pp.M1pp is P1 exposed as M1pp macro definitions that sources use
+;; Now P1 is defined as m1pp macros:
+;; P1A.M1pp is the arch-specific backend, P1pp.M1pp is the portable interface
 (defn p1exe (P1-src) (exe (m1pp (catm P1A.M1pp P1pp.M1pp P1-src))))
 
 ;; To Scheme!
@@ -122,8 +122,8 @@ Here's what I'm aiming for, in Schemey pseudocode:
 ;; And finally C
 (defn scc (C-src) (p1exe (scheme cc.scm C-src)))
 (define tcc0 (scc tcc.c))   ;; compiler: scheme cc.scm
-(define tcc1 (tcc0 tcc.c))  ;; compiler: scheme-compiled tcc 
-(define tcc  (tcc1 tcc.c))   ;; compiler: tcc-compiled tcc
+(define tcc1 (tcc0 tcc.c))  ;; compiler: scheme-compiled tcc
+(define tcc  (tcc1 tcc.c))  ;; compiler: tcc-compiled tcc
 ```
 
 So the new bits we need:
@@ -132,179 +132,145 @@ So the new bits we need:
 - Scheme: scheme.P1
 - C: cc.scm
 
-Today, I present just m1pp and P1 defined in it. Next will come a Scheme in P1,
-and finally enough of a C compiler in Scheme to compile the tcc source.
+Today, I present just m1pp and P1. Next will come a Scheme in P1, and finally
+enough of a C compiler in Scheme to compile the tcc source.
+
+## m1pp
+
+m1pp is a small macro expander layered on top of M0, with an aim of making it
+bearable enough to write assembly programs that you might just want to linger
+for a moment way down here near the bottom of the sea. It adds three things M0
+doesn't have: user-defined macros with named parameters, integer expressions
+that evaluate at expand time, and conditional expansion.
+
+Features:
+
+- **Macros**: `%macro NAME(a, b) ... %endm`, called with `%NAME(x, y)`. Macro
+  bodies can themselves contain macro calls, so expansion is recursive.
+- **Integer expressions**: emit little-endian hex from a Lisp expression —
+  `!(expr)` for 8 bits, `@(expr)` for 16, `%(expr)` for 32, `$(expr)` for 64.
+  The punctuation mirrors M0's decimal/hex immediate widths (`!@%`) and adds
+  `$` for the 64-bit case.
+- **Conditional expansion**: `%select(cond, then, else)` evaluates `cond` as
+  an integer and emits exactly one of the two branches. Non-zero is true.
+- **Token concat**: `a ## b` pastes two tokens into one identifier.
+
+Comments (`#` and `;`) and M0 `DEFINE`s pass through untouched. That's it — no
+`%ifdef`, no string manipulation, no floating point. Enough to encode
+instructions and define P1 mnemonics; nothing more.
+
+The expression syntax is, of course, Lisp-shaped:
+
+```
+atoms: decimal or 0x-prefixed integer literals
+calls: (+ a b) (- a b) (* a b) (/ a b) (% a b) (<< a b) (>> a b)
+       (= a b) (!= a b) (< a b) (<= a b) (> a b) (>= a b)
+       (& a b) (| a b) (^ a b) (~ a)
+```
+
+Here's a real macro from the aarch64 P1 backend — the "add register, immediate"
+instruction:
+
+```
+%macro aa64_add_imm(rd, ra, imm12)
+%((| 0x91000000 (<< (& imm12 0xFFF) 10) (<< %aa64_reg(ra) 5) %aa64_reg(rd)))
+%endm
+```
+
+A call like `%aa64_add_imm(a0, a0, 8)` expands to a single Lisp expression
+that ORs together the opcode, the 12-bit immediate shifted into place, the
+source register field, and the destination register field. The outer `%(...)`
+evaluates that expression and emits it as four little-endian hex bytes, which
+M0 and hex2 then hand to the linker. The nested `%aa64_reg(...)` calls in the
+body show that macros can call other macros during expansion — `aa64_reg` is a
+tiny macro per register name that produces the aarch64-native register number
+(e.g. `a0` → 0, `s0` → 19).
+
+That one pattern — Lisp expression in, hex bytes out — covers all per-arch
+instruction encoding.
 
 ## P1
 
-"P1" is a portable pseudo-ISA targetable via M1 DEFINEs.
+P1 is a portable pseudo-ISA, targetable at hex2-ready bytes via m1pp macros.
+Source looks a lot like assembly, but the mnemonics are m1pp calls and each
+arch picks its own encoding.
 
 Registers:
 
-- `a0`–`a3` — argument registers, and caller-saved general registers.
-- `t0` — caller-saved temporary.
-- `s0`–`s1` — callee-saved general registers.
+- `a0`–`a3` — argument and caller-saved general registers.
+- `t0`–`t2` — caller-saved temporaries.
+- `s0`–`s3` — callee-saved general registers.
 - `sp` — stack pointer.
 
+That's a modest budget — twelve visible registers. The goal is small backends
+and uniform source across a 32-bit RISC-V, an aarch64, and an x86-64. Anything beyond this that a backend needs for encoding (scratch regs,
+link register, zero register) is backend-private and exposed only through the
+portable ops. `LA_BR` is the one exception on the source side: it loads a
+label into a hidden *branch-target register* that the immediately following
+direct control-flow op (`B`, `CALL`, `TAIL`, or any conditional branch) then
+consumes. The register-indirect forms (`BR`, `CALLR`, `TAILR`) take their
+target in an ordinary register instead.
+
+A word is register-sized — 32 bits on 32-bit targets, 64 bits on 64-bit
+targets. `LD`/`ST` move words; `LB`/`SB` move bytes. No 16/32-bit slots in
+between; code that needs them can OR/shift.
+
 Ops:
-- Materialization: `LI` (load immediate), `LA` (load address), `LA_BR` (load branch register)
-- Moves: `MOV`
+
+- Materialization: `LI` (load immediate), `LA` (load label address), `LA_BR`
+  (load label into the hidden branch-target register).
+- Moves: `MOV`.
 - Arithmetic: `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, `MUL`,
-  `DIV`, `REM`
-- Arithmetic with immediates: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI`
-- Memory: `LD`, `ST`, `LB`, `SB`
-- Branching: `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BEQZ`, `BNEZ`, `BLTZ`
-- Calls / returns: `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR`
-- Frame management: `ENTER`, `LEAVE`
-- ABI arg access: `LDARG`
-- System: `SYSCALL`
-
-It's a dumb little RISC, with a bit of sugar for frame management and tail
-calling: a handful of visible registers, a hidden branch/call target register
-behind `LA_BR`. It's meant to be implemented by a set of per-arch `DEFINEs`,
-aiming for 32-bit and 64-bit x86, ARM, and RISC-V.
+  `DIV`, `REM`.
+- Arithmetic with immediates: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI`.
+  Immediate widths are backend-dependent; the backend errors if the constant
+  doesn't fit.
+- Memory: `LD`, `ST`, `LB`, `SB`.
+- Branching: `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ`.
+  Signed and unsigned less-than; `>=`, `>`, `<=` are synthesized by swapping
+  operands or inverting branch sense.
+- Calls / returns: `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR`.
+- Frame management: `ENTER`, `LEAVE`.
+- ABI arg access: `LDARG` — reads stack-passed incoming args without
+  hard-coding the frame layout.
+- System: `SYSCALL`.
+
+No floating point. P1 is a scalar-integer world; enough for a Scheme and a
+C compiler.
 
 Calling convention:
 
-- `a0`-`a3` hold the first four argument words.
-- Extra arguments live in an incoming stack-argument area and are read with
-  `LDARG`.
-- `a0` is also the one-word return register.
-- `a0`-`a3` and `t0` are caller-saved; `s0`-`s1` and `sp` are callee-saved.
-- `ENTER` builds the standard frame and `LEAVE` tears it down.
-- Stack-passed outgoing args are staged in frame-local space before `CALL`.
-- Wider-than-one-word returns use the usual hidden pointer trick: caller passes
-  a writable result buffer in `a0`, real args slide over by one, callee returns
-  that same pointer in `a0`.
-
-It ships as a pair of files you catm before your P1 source:
-- P1A.m1pp: architecture-specific backend implementing the portable interface
-- P1.m1pp: the portable interface that P1 sources program against
+- `a0`-`a3` hold the first four argument words; extras live in an incoming
+  stack-argument area, read with `LDARG`.
+- `a0` is the one-word return register. Two-word returns use `a0`/`a1`.
+- `a0`-`a3` and `t0`-`t2` are caller-saved; `s0`-`s3` and `sp` are
+  callee-saved.
+- `ENTER` builds the standard frame; `LEAVE` tears it down.
+- Stack-passed outgoing args are staged in a dedicated frame-local area
+  before `CALL`, so the callee finds them at a known offset from `sp`.
+- Wider-than-two-word returns use the usual hidden-pointer trick: caller
+  passes a writable result buffer in `a0`, real args slide over by one,
+  callee returns that same pointer in `a0`.
 
-## m1pp
+P1 ships as a pair of files you catm before any P1 source:
 
-m1pp is a macro expander:
-- Define macros: `%macro NAME(a, b) ... %endm`
-- Use macros: `%NAME(x, y)`. These can appear within macros as well, so it's a
-  recursive expansion.
-- Concat tokens: `a ## b`
-- Evaluate arithmetic integer expressions and emit little-endian hex:
-  `!(expr)` (8-bit), `@(expr)` (16-bit), `%(expr)` (32-bit), `$(expr)` (64-bit)
-- Conditional expansions: `%select(cond_expr, then, else)`
+- `P1A.M1pp` — architecture-specific backend implementing the portable
+  interface.
+- `P1pp.M1pp` — the portable interface that P1 programs program against.
 
-The expression syntax is, of course, Lisp-shaped:
+## What it cost
 
-```
-atoms: decimal or 0x-prefixed integer literals
-calls: (+ a b) (- a b) (* a b) (/ a b) (% a b) (<< a b) (>> a b)
-       (= a b) (!= a b) (< a b) (<= a b) (> a b) (>= a b)
-       (& a b) (| a b) (^ a b) (~ a)
-```
+Concrete sizes for what's landed today:
 
-It's enough for all per-arch instruction encoding and defining P1 mnemonics.
+- `m1pp.M1`: ~2.5 KLOC — the M1 source of the macro expander itself.
+- `P1.M1pp`: 224 LOC — the portable P1 interface.
+- `P1-aarch64.M1pp`: ~500 LOC — one arch backend.
 
-## Example P1 program
+The pitch of this rewrite — shrinking the pre-Scheme KLOC count — can't be
+fully cashed in until scheme.P1 and cc.scm land; that's where the 30KLOC of
+M2-Planet-and-Mes gets compared against. But the shape is already visible:
+once m1pp is up, adding a new architecture is a few hundred lines of macro
+definitions, not a new port of a subset-C compiler. The arch-specific surface
+is the thin file, not the thick one.
 
-```
-## copy.P1
-
-DEFINE BUF_SIZE %128
-DEFINE SAVE_S0 %16
-DEFINE SAVE_S1 %24
-DEFINE COPY_LOCALS %144
-
-## Entry convention:
-##   a0 = argc
-##   a1 = argv
-:main
-  %li(t0, 2)
-  %blt(a0, t0, usage)          ; if argc < 2, exit 2
-
-  %ld(t0, a1, WORD)            ; t0 = argv[1]
-
-  %li(a0, SYS_open)
-  %mov(a1, t0)                 ; path
-  %li(a2, O_RDONLY)            ; flags
-  %li(a3, 0)                   ; mode
-  %syscall()
-  %bltz(a0, fail)              ; open failed
-
-  %mov(s0, a0)                 ; s0 = fd
-  %mov(a0, s0)
-  %call(copy_to_stdout)        ; returns 0 or -1 in a0
-  %bltz(a0, close_fail)
-
-  %li(a0, SYS_close)
-  %mov(a1, s0)
-  %syscall()
-
-  %li(a0, SYS_exit)
-  %li(a1, 0)
-  %syscall()
-
-:close_fail
-  %li(a0, SYS_close)
-  %mov(a1, s0)
-  %syscall()
-
-:fail
-  %li(a0, SYS_exit)
-  %li(a1, 1)
-  %syscall()
-
-:usage
-  %li(a0, SYS_exit)
-  %li(a1, 2)
-  %syscall()
-
-
-:copy_to_stdout
-  %enter(COPY_LOCALS)
-  %st(sp, SAVE_S0, s0)
-  %st(sp, SAVE_S1, s1)
-
-  %mov(s0, a0)                 ; s0 = input fd
-
-:read_loop
-  %li(a0, SYS_read)
-  %mov(a1, s0)                 ; fd
-  %mov(a2, sp)
-  %addi(a2, a2, BUF_OFF)       ; buf = sp + BUF_OFF
-  %li(a3, BUF_SIZE)            ; count
-  %syscall()
-
-  %beqz(a0, done)              ; EOF
-  %bltz(a0, io_error)          ; read error
-
-  %mov(s1, a0)                 ; remaining byte count
-  %mov(t0, sp)
-  %addi(t0, t0, BUF_OFF)       ; t0 = current write ptr
-
-:write_loop
-  %li(a0, SYS_write)
-  %li(a1, 1)                   ; stdout
-  %mov(a2, t0)                 ; buf
-  %mov(a3, s1)                 ; count
-  %syscall()
-
-  %bltz(a0, io_error)          ; write error
-
-  %add(t0, t0, a0)             ; buf += nwritten
-  %sub(s1, s1, a0)             ; remaining -= nwritten
-  %bnez(s1, write_loop)
-  %b(read_loop)
-
-:done
-  %li(a0, 0)
-  %ld(s0, sp, SAVE_S0)
-  %ld(s1, sp, SAVE_S1)
-  %leave()
-  %ret()
-
-:io_error
-  %li(a0, -1)
-  %ld(s0, sp, SAVE_S0)
-  %ld(s1, sp, SAVE_S1)
-  %leave()
-  %ret()
-```
+Next post: a Scheme interpreter written in P1.

	boot2 Playing with the boostrap
	git clone https://git.ryansepassi.com/git/boot2.git
	Log \| Files \| Refs