boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs

commit 90d6018d786c42cf3a053508647144dceca2167e
parent add01712f59b79acb21ef42924bdb260357ef6b8
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 23 Apr 2026 11:52:00 -0700

post update

Diffstat:
Mpost.md | 318+++++++++++++++++++++++++++++++++++--------------------------------------------
1 file changed, 142 insertions(+), 176 deletions(-)

diff --git a/post.md b/post.md @@ -15,18 +15,19 @@ of course, by another compiler! Uh oh. Infinite regress? Not quite. It bottoms out in something directly specified in machine code, a "binary seed" from which the rest sprouts. -For about 30 years though, we had lost the seed. We were just going around -using whatever compiler was on hand, figuring that there always would be one -on hand, and trusting that it would do what it said on its tin. People noticed -it was a bit odd that we had lost the chain, that we couldn't compile the -compiler if we had lost the compiler, but not much was done about it. +For decades, though, we had lost the seed. We were just going around using +whatever compiler was on hand, figuring that there always would be one on hand, +and trusting that it would do what it said on its tin. People noticed it was a +bit odd that we had lost the chain, that we couldn't compile the compiler if +we had lost the compiler, but not much was done about it. In 1983, Ken Thompson told a little horror story that spooked folks, but still, it didn't really change much. Like many good horror stories, this one was about a little invisible demon. Someone had put a virus into the compiler that replicated itself into all future compilers and would scrub itself out of existence if you ever tried compiling a program that went looking for it. All -of our compilers were infected and none of us knew. Shivers. +of our compilers were infected and none of us knew. Shivers. (Search for +"Trusting Trust" if you want to read the full story). ## Bottle universe @@ -89,31 +90,30 @@ alternative path. One that hopefully will knock down that KLOC count and maybe tidy up this little bottle universe that exists at the bottom of the software sea. -There are 2 parts. First, get to portability sooner, to knock down the amount -of arch-specific code that's needed. I do this by defining a pseudo-ISA called -P1 that can be mapped to each arch via simple expansions. Second, define a -richer macro preprocessor to make P1 programming more ergonomic, ergonomic -enough to... Third, define a Scheme interpreter in P1. Then this Scheme becomes -the base to define a shell and a C compiler. +Three moves. First, write a richer macro preprocessor called m1pp (ht Bjarne) +that makes low-level programming ergonomic enough to skip the subset-C step. +Second, get to portability sooner, by defining a pseudo-ISA called P1 in m1pp +that each arch maps to via small expansions. Third, use P1 to host a Scheme, +and use that Scheme to host a C compiler. Here's what I'm aiming for, in Schemey pseudocode: ``` ;; Bootstrap from seed, same as before -(define hex2 (((hex0-seed hex0.hex0) hex1.hex0) hex2.hex1)) +(define hex0 (hex0-seed hex0.hex0)) +(define hex1 (hex0 hex1.hex0)) +(define hex2 (hex1 hex2.hex1)) (define catm (hex2 catm.hex2)) (define M0 (hex2 (catm ELF.hex2 M0.hex2))) ;; Compile+Link (defn exe (M1-src) (hex2 (catm ELF.hex2 (M0 M1-src)))) -;; We define a portable pseudo-ISA called P1, and use that to define a richer -;; macro expander called m1pp (ht Bjarne) +;; m1pp itself is written in M1 and built via the M1 path (define m1pp (exe (catm P1.M1 m1pp.M1))) -;; From there, we redefine P1 in terms of m1pp, and use P1 sources -;; P1A.M1pp is the arch-specific "backend" file -;; P1pp.M1pp is P1 exposed as M1pp macro definitions that sources use +;; Now P1 is defined as m1pp macros: +;; P1A.M1pp is the arch-specific backend, P1pp.M1pp is the portable interface (defn p1exe (P1-src) (exe (m1pp (catm P1A.M1pp P1pp.M1pp P1-src)))) ;; To Scheme! @@ -122,8 +122,8 @@ Here's what I'm aiming for, in Schemey pseudocode: ;; And finally C (defn scc (C-src) (p1exe (scheme cc.scm C-src))) (define tcc0 (scc tcc.c)) ;; compiler: scheme cc.scm -(define tcc1 (tcc0 tcc.c)) ;; compiler: scheme-compiled tcc -(define tcc (tcc1 tcc.c)) ;; compiler: tcc-compiled tcc +(define tcc1 (tcc0 tcc.c)) ;; compiler: scheme-compiled tcc +(define tcc (tcc1 tcc.c)) ;; compiler: tcc-compiled tcc ``` So the new bits we need: @@ -132,179 +132,145 @@ So the new bits we need: - Scheme: scheme.P1 - C: cc.scm -Today, I present just m1pp and P1 defined in it. Next will come a Scheme in P1, -and finally enough of a C compiler in Scheme to compile the tcc source. +Today, I present just m1pp and P1. Next will come a Scheme in P1, and finally +enough of a C compiler in Scheme to compile the tcc source. + +## m1pp + +m1pp is a small macro expander layered on top of M0, with an aim of making it +bearable enough to write assembly programs that you might just want to linger +for a moment way down here near the bottom of the sea. It adds three things M0 +doesn't have: user-defined macros with named parameters, integer expressions +that evaluate at expand time, and conditional expansion. + +Features: + +- **Macros**: `%macro NAME(a, b) ... %endm`, called with `%NAME(x, y)`. Macro + bodies can themselves contain macro calls, so expansion is recursive. +- **Integer expressions**: emit little-endian hex from a Lisp expression — + `!(expr)` for 8 bits, `@(expr)` for 16, `%(expr)` for 32, `$(expr)` for 64. + The punctuation mirrors M0's decimal/hex immediate widths (`!@%`) and adds + `$` for the 64-bit case. +- **Conditional expansion**: `%select(cond, then, else)` evaluates `cond` as + an integer and emits exactly one of the two branches. Non-zero is true. +- **Token concat**: `a ## b` pastes two tokens into one identifier. + +Comments (`#` and `;`) and M0 `DEFINE`s pass through untouched. That's it — no +`%ifdef`, no string manipulation, no floating point. Enough to encode +instructions and define P1 mnemonics; nothing more. + +The expression syntax is, of course, Lisp-shaped: + +``` +atoms: decimal or 0x-prefixed integer literals +calls: (+ a b) (- a b) (* a b) (/ a b) (% a b) (<< a b) (>> a b) + (= a b) (!= a b) (< a b) (<= a b) (> a b) (>= a b) + (& a b) (| a b) (^ a b) (~ a) +``` + +Here's a real macro from the aarch64 P1 backend — the "add register, immediate" +instruction: + +``` +%macro aa64_add_imm(rd, ra, imm12) +%((| 0x91000000 (<< (& imm12 0xFFF) 10) (<< %aa64_reg(ra) 5) %aa64_reg(rd))) +%endm +``` + +A call like `%aa64_add_imm(a0, a0, 8)` expands to a single Lisp expression +that ORs together the opcode, the 12-bit immediate shifted into place, the +source register field, and the destination register field. The outer `%(...)` +evaluates that expression and emits it as four little-endian hex bytes, which +M0 and hex2 then hand to the linker. The nested `%aa64_reg(...)` calls in the +body show that macros can call other macros during expansion — `aa64_reg` is a +tiny macro per register name that produces the aarch64-native register number +(e.g. `a0` → 0, `s0` → 19). + +That one pattern — Lisp expression in, hex bytes out — covers all per-arch +instruction encoding. ## P1 -"P1" is a portable pseudo-ISA targetable via M1 DEFINEs. +P1 is a portable pseudo-ISA, targetable at hex2-ready bytes via m1pp macros. +Source looks a lot like assembly, but the mnemonics are m1pp calls and each +arch picks its own encoding. Registers: -- `a0`–`a3` — argument registers, and caller-saved general registers. -- `t0` — caller-saved temporary. -- `s0`–`s1` — callee-saved general registers. +- `a0`–`a3` — argument and caller-saved general registers. +- `t0`–`t2` — caller-saved temporaries. +- `s0`–`s3` — callee-saved general registers. - `sp` — stack pointer. +That's a modest budget — twelve visible registers. The goal is small backends +and uniform source across a 32-bit RISC-V, an aarch64, and an x86-64. Anything beyond this that a backend needs for encoding (scratch regs, +link register, zero register) is backend-private and exposed only through the +portable ops. `LA_BR` is the one exception on the source side: it loads a +label into a hidden *branch-target register* that the immediately following +direct control-flow op (`B`, `CALL`, `TAIL`, or any conditional branch) then +consumes. The register-indirect forms (`BR`, `CALLR`, `TAILR`) take their +target in an ordinary register instead. + +A word is register-sized — 32 bits on 32-bit targets, 64 bits on 64-bit +targets. `LD`/`ST` move words; `LB`/`SB` move bytes. No 16/32-bit slots in +between; code that needs them can OR/shift. + Ops: -- Materialization: `LI` (load immediate), `LA` (load address), `LA_BR` (load branch register) -- Moves: `MOV` + +- Materialization: `LI` (load immediate), `LA` (load label address), `LA_BR` + (load label into the hidden branch-target register). +- Moves: `MOV`. - Arithmetic: `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, `MUL`, - `DIV`, `REM` -- Arithmetic with immediates: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI` -- Memory: `LD`, `ST`, `LB`, `SB` -- Branching: `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BEQZ`, `BNEZ`, `BLTZ` -- Calls / returns: `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR` -- Frame management: `ENTER`, `LEAVE` -- ABI arg access: `LDARG` -- System: `SYSCALL` - -It's a dumb little RISC, with a bit of sugar for frame management and tail -calling: a handful of visible registers, a hidden branch/call target register -behind `LA_BR`. It's meant to be implemented by a set of per-arch `DEFINEs`, -aiming for 32-bit and 64-bit x86, ARM, and RISC-V. + `DIV`, `REM`. +- Arithmetic with immediates: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI`. + Immediate widths are backend-dependent; the backend errors if the constant + doesn't fit. +- Memory: `LD`, `ST`, `LB`, `SB`. +- Branching: `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ`. + Signed and unsigned less-than; `>=`, `>`, `<=` are synthesized by swapping + operands or inverting branch sense. +- Calls / returns: `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR`. +- Frame management: `ENTER`, `LEAVE`. +- ABI arg access: `LDARG` — reads stack-passed incoming args without + hard-coding the frame layout. +- System: `SYSCALL`. + +No floating point. P1 is a scalar-integer world; enough for a Scheme and a +C compiler. Calling convention: -- `a0`-`a3` hold the first four argument words. -- Extra arguments live in an incoming stack-argument area and are read with - `LDARG`. -- `a0` is also the one-word return register. -- `a0`-`a3` and `t0` are caller-saved; `s0`-`s1` and `sp` are callee-saved. -- `ENTER` builds the standard frame and `LEAVE` tears it down. -- Stack-passed outgoing args are staged in frame-local space before `CALL`. -- Wider-than-one-word returns use the usual hidden pointer trick: caller passes - a writable result buffer in `a0`, real args slide over by one, callee returns - that same pointer in `a0`. - -It ships as a pair of files you catm before your P1 source: -- P1A.m1pp: architecture-specific backend implementing the portable interface -- P1.m1pp: the portable interface that P1 sources program against +- `a0`-`a3` hold the first four argument words; extras live in an incoming + stack-argument area, read with `LDARG`. +- `a0` is the one-word return register. Two-word returns use `a0`/`a1`. +- `a0`-`a3` and `t0`-`t2` are caller-saved; `s0`-`s3` and `sp` are + callee-saved. +- `ENTER` builds the standard frame; `LEAVE` tears it down. +- Stack-passed outgoing args are staged in a dedicated frame-local area + before `CALL`, so the callee finds them at a known offset from `sp`. +- Wider-than-two-word returns use the usual hidden-pointer trick: caller + passes a writable result buffer in `a0`, real args slide over by one, + callee returns that same pointer in `a0`. -## m1pp +P1 ships as a pair of files you catm before any P1 source: -m1pp is a macro expander: -- Define macros: `%macro NAME(a, b) ... %endm` -- Use macros: `%NAME(x, y)`. These can appear within macros as well, so it's a - recursive expansion. -- Concat tokens: `a ## b` -- Evaluate arithmetic integer expressions and emit little-endian hex: - `!(expr)` (8-bit), `@(expr)` (16-bit), `%(expr)` (32-bit), `$(expr)` (64-bit) -- Conditional expansions: `%select(cond_expr, then, else)` +- `P1A.M1pp` — architecture-specific backend implementing the portable + interface. +- `P1pp.M1pp` — the portable interface that P1 programs program against. -The expression syntax is, of course, Lisp-shaped: +## What it cost -``` -atoms: decimal or 0x-prefixed integer literals -calls: (+ a b) (- a b) (* a b) (/ a b) (% a b) (<< a b) (>> a b) - (= a b) (!= a b) (< a b) (<= a b) (> a b) (>= a b) - (& a b) (| a b) (^ a b) (~ a) -``` +Concrete sizes for what's landed today: -It's enough for all per-arch instruction encoding and defining P1 mnemonics. +- `m1pp.M1`: ~2.5 KLOC — the M1 source of the macro expander itself. +- `P1.M1pp`: 224 LOC — the portable P1 interface. +- `P1-aarch64.M1pp`: ~500 LOC — one arch backend. -## Example P1 program +The pitch of this rewrite — shrinking the pre-Scheme KLOC count — can't be +fully cashed in until scheme.P1 and cc.scm land; that's where the 30KLOC of +M2-Planet-and-Mes gets compared against. But the shape is already visible: +once m1pp is up, adding a new architecture is a few hundred lines of macro +definitions, not a new port of a subset-C compiler. The arch-specific surface +is the thin file, not the thick one. -``` -## copy.P1 - -DEFINE BUF_SIZE %128 -DEFINE SAVE_S0 %16 -DEFINE SAVE_S1 %24 -DEFINE COPY_LOCALS %144 - -## Entry convention: -## a0 = argc -## a1 = argv -:main - %li(t0, 2) - %blt(a0, t0, usage) ; if argc < 2, exit 2 - - %ld(t0, a1, WORD) ; t0 = argv[1] - - %li(a0, SYS_open) - %mov(a1, t0) ; path - %li(a2, O_RDONLY) ; flags - %li(a3, 0) ; mode - %syscall() - %bltz(a0, fail) ; open failed - - %mov(s0, a0) ; s0 = fd - %mov(a0, s0) - %call(copy_to_stdout) ; returns 0 or -1 in a0 - %bltz(a0, close_fail) - - %li(a0, SYS_close) - %mov(a1, s0) - %syscall() - - %li(a0, SYS_exit) - %li(a1, 0) - %syscall() - -:close_fail - %li(a0, SYS_close) - %mov(a1, s0) - %syscall() - -:fail - %li(a0, SYS_exit) - %li(a1, 1) - %syscall() - -:usage - %li(a0, SYS_exit) - %li(a1, 2) - %syscall() - - -:copy_to_stdout - %enter(COPY_LOCALS) - %st(sp, SAVE_S0, s0) - %st(sp, SAVE_S1, s1) - - %mov(s0, a0) ; s0 = input fd - -:read_loop - %li(a0, SYS_read) - %mov(a1, s0) ; fd - %mov(a2, sp) - %addi(a2, a2, BUF_OFF) ; buf = sp + BUF_OFF - %li(a3, BUF_SIZE) ; count - %syscall() - - %beqz(a0, done) ; EOF - %bltz(a0, io_error) ; read error - - %mov(s1, a0) ; remaining byte count - %mov(t0, sp) - %addi(t0, t0, BUF_OFF) ; t0 = current write ptr - -:write_loop - %li(a0, SYS_write) - %li(a1, 1) ; stdout - %mov(a2, t0) ; buf - %mov(a3, s1) ; count - %syscall() - - %bltz(a0, io_error) ; write error - - %add(t0, t0, a0) ; buf += nwritten - %sub(s1, s1, a0) ; remaining -= nwritten - %bnez(s1, write_loop) - %b(read_loop) - -:done - %li(a0, 0) - %ld(s0, sp, SAVE_S0) - %ld(s1, sp, SAVE_S1) - %leave() - %ret() - -:io_error - %li(a0, -1) - %ld(s0, sp, SAVE_S0) - %ld(s1, sp, SAVE_S1) - %leave() - %ret() -``` +Next post: a Scheme interpreter written in P1.