commit 90d6018d786c42cf3a053508647144dceca2167e
parent add01712f59b79acb21ef42924bdb260357ef6b8
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Thu, 23 Apr 2026 11:52:00 -0700
post update
Diffstat:
| M | post.md | | | 318 | +++++++++++++++++++++++++++++++++++-------------------------------------------- |
1 file changed, 142 insertions(+), 176 deletions(-)
diff --git a/post.md b/post.md
@@ -15,18 +15,19 @@ of course, by another compiler!
Uh oh. Infinite regress? Not quite. It bottoms out in something directly
specified in machine code, a "binary seed" from which the rest sprouts.
-For about 30 years though, we had lost the seed. We were just going around
-using whatever compiler was on hand, figuring that there always would be one
-on hand, and trusting that it would do what it said on its tin. People noticed
-it was a bit odd that we had lost the chain, that we couldn't compile the
-compiler if we had lost the compiler, but not much was done about it.
+For decades, though, we had lost the seed. We were just going around using
+whatever compiler was on hand, figuring that there always would be one on hand,
+and trusting that it would do what it said on its tin. People noticed it was a
+bit odd that we had lost the chain, that we couldn't compile the compiler if
+we had lost the compiler, but not much was done about it.
In 1983, Ken Thompson told a little horror story that spooked folks, but still,
it didn't really change much. Like many good horror stories, this one was about
a little invisible demon. Someone had put a virus into the compiler that
replicated itself into all future compilers and would scrub itself out of
existence if you ever tried compiling a program that went looking for it. All
-of our compilers were infected and none of us knew. Shivers.
+of our compilers were infected and none of us knew. Shivers. (Search for
+"Trusting Trust" if you want to read the full story).
## Bottle universe
@@ -89,31 +90,30 @@ alternative path. One that hopefully will knock down that KLOC count and maybe
tidy up this little bottle universe that exists at the bottom of the software
sea.
-There are 2 parts. First, get to portability sooner, to knock down the amount
-of arch-specific code that's needed. I do this by defining a pseudo-ISA called
-P1 that can be mapped to each arch via simple expansions. Second, define a
-richer macro preprocessor to make P1 programming more ergonomic, ergonomic
-enough to... Third, define a Scheme interpreter in P1. Then this Scheme becomes
-the base to define a shell and a C compiler.
+Three moves. First, write a richer macro preprocessor called m1pp (ht Bjarne)
+that makes low-level programming ergonomic enough to skip the subset-C step.
+Second, get to portability sooner, by defining a pseudo-ISA called P1 in m1pp
+that each arch maps to via small expansions. Third, use P1 to host a Scheme,
+and use that Scheme to host a C compiler.
Here's what I'm aiming for, in Schemey pseudocode:
```
;; Bootstrap from seed, same as before
-(define hex2 (((hex0-seed hex0.hex0) hex1.hex0) hex2.hex1))
+(define hex0 (hex0-seed hex0.hex0))
+(define hex1 (hex0 hex1.hex0))
+(define hex2 (hex1 hex2.hex1))
(define catm (hex2 catm.hex2))
(define M0 (hex2 (catm ELF.hex2 M0.hex2)))
;; Compile+Link
(defn exe (M1-src) (hex2 (catm ELF.hex2 (M0 M1-src))))
-;; We define a portable pseudo-ISA called P1, and use that to define a richer
-;; macro expander called m1pp (ht Bjarne)
+;; m1pp itself is written in M1 and built via the M1 path
(define m1pp (exe (catm P1.M1 m1pp.M1)))
-;; From there, we redefine P1 in terms of m1pp, and use P1 sources
-;; P1A.M1pp is the arch-specific "backend" file
-;; P1pp.M1pp is P1 exposed as M1pp macro definitions that sources use
+;; Now P1 is defined as m1pp macros:
+;; P1A.M1pp is the arch-specific backend, P1pp.M1pp is the portable interface
(defn p1exe (P1-src) (exe (m1pp (catm P1A.M1pp P1pp.M1pp P1-src))))
;; To Scheme!
@@ -122,8 +122,8 @@ Here's what I'm aiming for, in Schemey pseudocode:
;; And finally C
(defn scc (C-src) (p1exe (scheme cc.scm C-src)))
(define tcc0 (scc tcc.c)) ;; compiler: scheme cc.scm
-(define tcc1 (tcc0 tcc.c)) ;; compiler: scheme-compiled tcc
-(define tcc (tcc1 tcc.c)) ;; compiler: tcc-compiled tcc
+(define tcc1 (tcc0 tcc.c)) ;; compiler: scheme-compiled tcc
+(define tcc (tcc1 tcc.c)) ;; compiler: tcc-compiled tcc
```
So the new bits we need:
@@ -132,179 +132,145 @@ So the new bits we need:
- Scheme: scheme.P1
- C: cc.scm
-Today, I present just m1pp and P1 defined in it. Next will come a Scheme in P1,
-and finally enough of a C compiler in Scheme to compile the tcc source.
+Today, I present just m1pp and P1. Next will come a Scheme in P1, and finally
+enough of a C compiler in Scheme to compile the tcc source.
+
+## m1pp
+
+m1pp is a small macro expander layered on top of M0, with an aim of making it
+bearable enough to write assembly programs that you might just want to linger
+for a moment way down here near the bottom of the sea. It adds three things M0
+doesn't have: user-defined macros with named parameters, integer expressions
+that evaluate at expand time, and conditional expansion.
+
+Features:
+
+- **Macros**: `%macro NAME(a, b) ... %endm`, called with `%NAME(x, y)`. Macro
+ bodies can themselves contain macro calls, so expansion is recursive.
+- **Integer expressions**: emit little-endian hex from a Lisp expression —
+ `!(expr)` for 8 bits, `@(expr)` for 16, `%(expr)` for 32, `$(expr)` for 64.
+ The punctuation mirrors M0's decimal/hex immediate widths (`!@%`) and adds
+ `$` for the 64-bit case.
+- **Conditional expansion**: `%select(cond, then, else)` evaluates `cond` as
+ an integer and emits exactly one of the two branches. Non-zero is true.
+- **Token concat**: `a ## b` pastes two tokens into one identifier.
+
+Comments (`#` and `;`) and M0 `DEFINE`s pass through untouched. That's it — no
+`%ifdef`, no string manipulation, no floating point. Enough to encode
+instructions and define P1 mnemonics; nothing more.
+
+The expression syntax is, of course, Lisp-shaped:
+
+```
+atoms: decimal or 0x-prefixed integer literals
+calls: (+ a b) (- a b) (* a b) (/ a b) (% a b) (<< a b) (>> a b)
+ (= a b) (!= a b) (< a b) (<= a b) (> a b) (>= a b)
+ (& a b) (| a b) (^ a b) (~ a)
+```
+
+Here's a real macro from the aarch64 P1 backend — the "add register, immediate"
+instruction:
+
+```
+%macro aa64_add_imm(rd, ra, imm12)
+%((| 0x91000000 (<< (& imm12 0xFFF) 10) (<< %aa64_reg(ra) 5) %aa64_reg(rd)))
+%endm
+```
+
+A call like `%aa64_add_imm(a0, a0, 8)` expands to a single Lisp expression
+that ORs together the opcode, the 12-bit immediate shifted into place, the
+source register field, and the destination register field. The outer `%(...)`
+evaluates that expression and emits it as four little-endian hex bytes, which
+M0 and hex2 then hand to the linker. The nested `%aa64_reg(...)` calls in the
+body show that macros can call other macros during expansion — `aa64_reg` is a
+tiny macro per register name that produces the aarch64-native register number
+(e.g. `a0` → 0, `s0` → 19).
+
+That one pattern — Lisp expression in, hex bytes out — covers all per-arch
+instruction encoding.
## P1
-"P1" is a portable pseudo-ISA targetable via M1 DEFINEs.
+P1 is a portable pseudo-ISA, targetable at hex2-ready bytes via m1pp macros.
+Source looks a lot like assembly, but the mnemonics are m1pp calls and each
+arch picks its own encoding.
Registers:
-- `a0`–`a3` — argument registers, and caller-saved general registers.
-- `t0` — caller-saved temporary.
-- `s0`–`s1` — callee-saved general registers.
+- `a0`–`a3` — argument and caller-saved general registers.
+- `t0`–`t2` — caller-saved temporaries.
+- `s0`–`s3` — callee-saved general registers.
- `sp` — stack pointer.
+That's a modest budget — twelve visible registers. The goal is small backends
+and uniform source across a 32-bit RISC-V, an aarch64, and an x86-64. Anything beyond this that a backend needs for encoding (scratch regs,
+link register, zero register) is backend-private and exposed only through the
+portable ops. `LA_BR` is the one exception on the source side: it loads a
+label into a hidden *branch-target register* that the immediately following
+direct control-flow op (`B`, `CALL`, `TAIL`, or any conditional branch) then
+consumes. The register-indirect forms (`BR`, `CALLR`, `TAILR`) take their
+target in an ordinary register instead.
+
+A word is register-sized — 32 bits on 32-bit targets, 64 bits on 64-bit
+targets. `LD`/`ST` move words; `LB`/`SB` move bytes. No 16/32-bit slots in
+between; code that needs them can OR/shift.
+
Ops:
-- Materialization: `LI` (load immediate), `LA` (load address), `LA_BR` (load branch register)
-- Moves: `MOV`
+
+- Materialization: `LI` (load immediate), `LA` (load label address), `LA_BR`
+ (load label into the hidden branch-target register).
+- Moves: `MOV`.
- Arithmetic: `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, `MUL`,
- `DIV`, `REM`
-- Arithmetic with immediates: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI`
-- Memory: `LD`, `ST`, `LB`, `SB`
-- Branching: `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BEQZ`, `BNEZ`, `BLTZ`
-- Calls / returns: `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR`
-- Frame management: `ENTER`, `LEAVE`
-- ABI arg access: `LDARG`
-- System: `SYSCALL`
-
-It's a dumb little RISC, with a bit of sugar for frame management and tail
-calling: a handful of visible registers, a hidden branch/call target register
-behind `LA_BR`. It's meant to be implemented by a set of per-arch `DEFINEs`,
-aiming for 32-bit and 64-bit x86, ARM, and RISC-V.
+ `DIV`, `REM`.
+- Arithmetic with immediates: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI`.
+ Immediate widths are backend-dependent; the backend errors if the constant
+ doesn't fit.
+- Memory: `LD`, `ST`, `LB`, `SB`.
+- Branching: `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ`.
+ Signed and unsigned less-than; `>=`, `>`, `<=` are synthesized by swapping
+ operands or inverting branch sense.
+- Calls / returns: `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR`.
+- Frame management: `ENTER`, `LEAVE`.
+- ABI arg access: `LDARG` — reads stack-passed incoming args without
+ hard-coding the frame layout.
+- System: `SYSCALL`.
+
+No floating point. P1 is a scalar-integer world; enough for a Scheme and a
+C compiler.
Calling convention:
-- `a0`-`a3` hold the first four argument words.
-- Extra arguments live in an incoming stack-argument area and are read with
- `LDARG`.
-- `a0` is also the one-word return register.
-- `a0`-`a3` and `t0` are caller-saved; `s0`-`s1` and `sp` are callee-saved.
-- `ENTER` builds the standard frame and `LEAVE` tears it down.
-- Stack-passed outgoing args are staged in frame-local space before `CALL`.
-- Wider-than-one-word returns use the usual hidden pointer trick: caller passes
- a writable result buffer in `a0`, real args slide over by one, callee returns
- that same pointer in `a0`.
-
-It ships as a pair of files you catm before your P1 source:
-- P1A.m1pp: architecture-specific backend implementing the portable interface
-- P1.m1pp: the portable interface that P1 sources program against
+- `a0`-`a3` hold the first four argument words; extras live in an incoming
+ stack-argument area, read with `LDARG`.
+- `a0` is the one-word return register. Two-word returns use `a0`/`a1`.
+- `a0`-`a3` and `t0`-`t2` are caller-saved; `s0`-`s3` and `sp` are
+ callee-saved.
+- `ENTER` builds the standard frame; `LEAVE` tears it down.
+- Stack-passed outgoing args are staged in a dedicated frame-local area
+ before `CALL`, so the callee finds them at a known offset from `sp`.
+- Wider-than-two-word returns use the usual hidden-pointer trick: caller
+ passes a writable result buffer in `a0`, real args slide over by one,
+ callee returns that same pointer in `a0`.
-## m1pp
+P1 ships as a pair of files you catm before any P1 source:
-m1pp is a macro expander:
-- Define macros: `%macro NAME(a, b) ... %endm`
-- Use macros: `%NAME(x, y)`. These can appear within macros as well, so it's a
- recursive expansion.
-- Concat tokens: `a ## b`
-- Evaluate arithmetic integer expressions and emit little-endian hex:
- `!(expr)` (8-bit), `@(expr)` (16-bit), `%(expr)` (32-bit), `$(expr)` (64-bit)
-- Conditional expansions: `%select(cond_expr, then, else)`
+- `P1A.M1pp` — architecture-specific backend implementing the portable
+ interface.
+- `P1pp.M1pp` — the portable interface that P1 programs program against.
-The expression syntax is, of course, Lisp-shaped:
+## What it cost
-```
-atoms: decimal or 0x-prefixed integer literals
-calls: (+ a b) (- a b) (* a b) (/ a b) (% a b) (<< a b) (>> a b)
- (= a b) (!= a b) (< a b) (<= a b) (> a b) (>= a b)
- (& a b) (| a b) (^ a b) (~ a)
-```
+Concrete sizes for what's landed today:
-It's enough for all per-arch instruction encoding and defining P1 mnemonics.
+- `m1pp.M1`: ~2.5 KLOC — the M1 source of the macro expander itself.
+- `P1.M1pp`: 224 LOC — the portable P1 interface.
+- `P1-aarch64.M1pp`: ~500 LOC — one arch backend.
-## Example P1 program
+The pitch of this rewrite — shrinking the pre-Scheme KLOC count — can't be
+fully cashed in until scheme.P1 and cc.scm land; that's where the 30KLOC of
+M2-Planet-and-Mes gets compared against. But the shape is already visible:
+once m1pp is up, adding a new architecture is a few hundred lines of macro
+definitions, not a new port of a subset-C compiler. The arch-specific surface
+is the thin file, not the thick one.
-```
-## copy.P1
-
-DEFINE BUF_SIZE %128
-DEFINE SAVE_S0 %16
-DEFINE SAVE_S1 %24
-DEFINE COPY_LOCALS %144
-
-## Entry convention:
-## a0 = argc
-## a1 = argv
-:main
- %li(t0, 2)
- %blt(a0, t0, usage) ; if argc < 2, exit 2
-
- %ld(t0, a1, WORD) ; t0 = argv[1]
-
- %li(a0, SYS_open)
- %mov(a1, t0) ; path
- %li(a2, O_RDONLY) ; flags
- %li(a3, 0) ; mode
- %syscall()
- %bltz(a0, fail) ; open failed
-
- %mov(s0, a0) ; s0 = fd
- %mov(a0, s0)
- %call(copy_to_stdout) ; returns 0 or -1 in a0
- %bltz(a0, close_fail)
-
- %li(a0, SYS_close)
- %mov(a1, s0)
- %syscall()
-
- %li(a0, SYS_exit)
- %li(a1, 0)
- %syscall()
-
-:close_fail
- %li(a0, SYS_close)
- %mov(a1, s0)
- %syscall()
-
-:fail
- %li(a0, SYS_exit)
- %li(a1, 1)
- %syscall()
-
-:usage
- %li(a0, SYS_exit)
- %li(a1, 2)
- %syscall()
-
-
-:copy_to_stdout
- %enter(COPY_LOCALS)
- %st(sp, SAVE_S0, s0)
- %st(sp, SAVE_S1, s1)
-
- %mov(s0, a0) ; s0 = input fd
-
-:read_loop
- %li(a0, SYS_read)
- %mov(a1, s0) ; fd
- %mov(a2, sp)
- %addi(a2, a2, BUF_OFF) ; buf = sp + BUF_OFF
- %li(a3, BUF_SIZE) ; count
- %syscall()
-
- %beqz(a0, done) ; EOF
- %bltz(a0, io_error) ; read error
-
- %mov(s1, a0) ; remaining byte count
- %mov(t0, sp)
- %addi(t0, t0, BUF_OFF) ; t0 = current write ptr
-
-:write_loop
- %li(a0, SYS_write)
- %li(a1, 1) ; stdout
- %mov(a2, t0) ; buf
- %mov(a3, s1) ; count
- %syscall()
-
- %bltz(a0, io_error) ; write error
-
- %add(t0, t0, a0) ; buf += nwritten
- %sub(s1, s1, a0) ; remaining -= nwritten
- %bnez(s1, write_loop)
- %b(read_loop)
-
-:done
- %li(a0, 0)
- %ld(s0, sp, SAVE_S0)
- %ld(s1, sp, SAVE_S1)
- %leave()
- %ret()
-
-:io_error
- %li(a0, -1)
- %ld(s0, sp, SAVE_S0)
- %ld(s1, sp, SAVE_S1)
- %leave()
- %ret()
-```
+Next post: a Scheme interpreter written in P1.