boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs

commit 932b2491b2cd4b0d21da8e12eb8796f6deb74da5
parent 67f4606e7e4c31112760ac9855384a4315e3a7df
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 23 Apr 2026 11:15:21 -0700

post update

Diffstat:
Mpost.md | 245++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
1 file changed, 143 insertions(+), 102 deletions(-)

diff --git a/post.md b/post.md @@ -1,31 +1,45 @@ -# Title +# Playing with the bootstrap -Your compiler compiles your code, but what compiled the compiler? And the one -before that? +## Where does it all come from? -Ken Thompson freaked everyone out in 1983 by describing a nasty -self-replicating attack on compilers themselves that reminded folks that it -might be a good idea to have a "full bootstrap" from as small a binary seed as -could be managed and then audited source from there forward to the toolchains -we know and (maybe-kinda) love today. +Computers execute programs. Programs are specified as machine code, binary +blobs loaded into memory that the processor interprets on the fly. -Jeremiah Orians and Jan Nieuwenhuizen delivered exactly that to the world in -2023, going from `hex0-seed`, a few hundred byte seed binary that is the "hex0" -compiler, through Fabrice Bellard's Tiny C Compiler (TCC), and then up through -GCC and the rest of the stack. Truly awesome work. +Humans write programs (well, some humans do, and now many non-humans as well). +But humans specify programs in programming languages, not machine code. It's +the job of a compiler to translate to machine code. -Between "hex0" and TCC, there's a minimal assembler and linker in M0 and hex2, -then a "subset C" compiler that builds out the Mes Scheme ecosystem, then -a Mes C compiler that compiles TCC. +The compiler is itself a program. Where did it come from? Well, it was compiled +of course, by another compiler! -From seed to M0 and hex2 is ~2KLOC. The "middle" stage is ~25KLOC to get to -Mes Scheme and a collection of build and packaging tools. And the "late" stage -of the Mes C compiler and TCC sources are another ~25KLOC. +Uh oh. Infinite regress? Not quite. It bottoms out in something directly +specified in machine code, a "binary seed" from which the rest sprouts. -I find M0 and hex2 to be very cute and everything after that to be rather -goopy. So, let's dive into the cuteness! +For about 30 years though, we had lost the seed. We were just going around +using whatever compiler was on hand, figuring that there always would be one +on hand, and trusting that it would do what it said on its tin. People noticed +it was a bit odd that we had lost the chain, that we couldn't compile the +compiler if we had lost the compiler, but not much was done about it. -Getting there is fairly straightforward: +In 1983, Ken Thompson told a little horror story that spooked folks, but still, +it didn't really change much. Like many good horror stories, this one was about +a little invisible demon. Someone had put a virus into the compiler that +replicated itself into all future compilers and would scrub itself out of +existence if you ever tried compiling a program that went looking for it. All +of our compilers were infected and none of us knew. Shivers. + +## Bottle universe + +Nobody did anything about it. Until Jeremiah Orians and Jan Nieuwenhuizen +decided to do something about it, and in 2023, they delivered "the full +bootstrap". A few hundred bytes of binary seed that defined a little language +called "hex0", and then source files from there on until they could compile +Fabrice Bellard's Tiny C Compiler (TCC). Once you have TCC, you compile an old +version of GCC that was still written in C (GCC moved to C++), then compile a +more recent GCC, then you use that to compile everything else, including Linux. +Truly amazing work. + +Jeremiah's early chain is quite beautiful: ``` hex0-seed = given @@ -38,60 +52,117 @@ M0 = hex2(catm(ELF.hex2, M0.hex2)) ``` **hex0**: the minimal seed hex compiler; it turns pairs of hexadecimal digits -into raw bytes. Supports # and ; line comments. +into raw bytes. Supports `#` and `;` line comments. **hex1**: a small extension of hex0 that adds the first notion of symbolic -control flow. Features everything in hex0, single-character labels, one size -of relative jump/offset encoding. +control flow. Adds single-character labels and 32-bit relative offsets to a +label. **hex2**: the final "hex language" and first real linker-like stage in the -chain. Features everything in hex1, labels of arbitrary length, relative -references of multiple sizes (!, @, %), absolute references ($, &), enough -symbol/address resolution to link later stages. +chain. Extends hex1 with labels of arbitrary length, 8/16/32-bit relative +references (!, @, %), and 16/32-bit absolute references ($, &). + +**catm**: a minimal file concatenation utility, takes N files and outputs 1. + +**M0**: a basic macro preprocessor built on top of hex2, used as a mini +assembler. Features DEFINE substitution of names to hex bytes, hex and ascii +strings, 8/16/32-bit decimal and hex immediates (!, @, %). + +From this base, they build a "subset C" compiler and a mini shell called "kaem" +in M0, then use that to build up to a little userspace called "M2-Planet" and +the Mes Scheme interpreter and ecosystem. They then define another C compiler +in Scheme (MesCC), and use that to compile TCC. And then you're off to the races. + +There's something elegant and simple and beautiful about this bottle universe +they constructed: an assembler and linker, a Scheme, a C. Timeless. What more +do you need really? + +## An alternative path + +Up through M0, it's about 2000 lines of code (2KLOC). But it gets goopy quick. +It's ~30KLOC to get from M0 to MesCC, and then ~20KLOC for TCC itself. And +the early bits defined in M1, like the subset C compiler, are all +arch-specific. + +For fun, learning, and tremendous profit, I thought I'd play around with an +alternative path. One that hopefully will knock down that KLOC count and maybe +tidy up this little bottle universe that exists at the bottom of the software +sea. + +There are 2 parts. First, get to portability sooner, to knock down the amount +of arch-specific code that's needed. I do this by defining a pseudo-ISA called +P1 that can be mapped to each arch via simple expansions. Second, define a +richer macro preprocessor to make P1 programming more ergonomic, ergonomic +enough to... Third, define a Scheme interpreter in P1. Then this Scheme becomes +the base to define a shell and a C compiler. -**catm**: a minimal file concatenation utility, take N files and output 1. +Here's what I'm aiming for, in Schemey pseudocode: -**M0**: the first assembler built on top of hex2. Features DEFINE substitution -of names to hex bytes, immediate values in various sizes/formats, symbolic -assembly over hex2. +``` +;; Bootstrap from seed, same as before +(define hex2 (((hex0-seed hex0.hex0) hex1.hex0) hex2.hex1)) +(define catm (hex2 catm.hex2)) +(define M0 (hex2 (catm ELF.hex2 M0.hex2))) + +;; Compile+Link +(defn exe (M1-src) (hex2 (catm ELF.hex2 (M0 M1-src)))) + +;; We define a portable pseudo-ISA called P1, and use that to define a richer +;; macro expander called m1pp (ht Bjarne) +(define m1pp (exe (catm P1.M1 m1pp.M1))) + +;; From there, we redefine P1 in terms of m1pp, and use P1 sources +;; P1A.M1pp is the arch-specific "backend" file +;; P1pp.M1pp is P1 exposed as M1pp macro definitions that sources use +(defn p1exe (P1-src) (exe (m1pp (catm P1A.M1pp P1pp.M1pp P1-src)))) + +;; To Scheme! +(define scheme (p1exe scheme.P1)) + +;; And finally C +(defn scc (C-src) (p1exe (scheme cc.scm C-src))) +(define tcc0 (scc tcc.c)) ;; compiler: scheme cc.scm +(define tcc1 (tcc0 tcc.c)) ;; compiler: scheme-compiled tcc +(define tcc (tcc1 tcc.c)) ;; compiler: tcc-compiled tcc +``` -M0 programs are architecture-specific because the DEFINEd set of mnemonics and -their byte-level encodings are architecture-specific. So that means that the -subset C compiler, defined in M0, has to be written once per architecture. -That's a drag. +So the new bits we need: +- m1pp: P1.M1, m1pp.M1 +- P1: P1A.M1pp, P1pp.M1pp +- Scheme: scheme.P1 +- C: cc.scm -I thought it'd be rather fun to play around at this low level and see if we -could build off M0 in a way that might get us to a functioning C compiler -through a path that cuts down on the amount of code in that "middle" layer. +Today, I present just m1pp and P1 defined in it. Next will come a Scheme in P1, +and finally enough of a C compiler in Scheme to compile the tcc source. -So I present to you "P1", a portable pseudo-ISA targetable via M1 DEFINEs, and -"M1M", a macro expander for M1, written in P1. +## P1 -P1 +"P1" is a portable pseudo-ISA targetable via M1 DEFINEs. Registers: -- `a0`–`a3` — argument registers. Also caller-saved general registers. +- `a0`–`a3` — argument registers, and caller-saved general registers. - `t0` — caller-saved temporary. - `s0`–`s1` — callee-saved general registers. - `sp` — stack pointer. Ops: -- Materialization: `LI`, `LA`, `LA_BR` +- Materialization: `LI` (load immediate), `LA` (load address), `LA_BR` (load branch register) - Moves: `MOV` - Arithmetic: `ADD`, `SUB`, `AND`, `OR`, `XOR`, `SHL`, `SHR`, `SAR`, `MUL`, `DIV`, `REM` -- Immediate arithmetic: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI` +- Arithmetic with immediates: `ADDI`, `ANDI`, `ORI`, `SHLI`, `SHRI`, `SARI` - Memory: `LD`, `ST`, `LB`, `SB` - Branching: `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BEQZ`, `BNEZ`, `BLTZ` - Calls / returns: `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR` - Frame management: `ENTER`, `LEAVE` -- ABI access: `LDARG` +- ABI arg access: `LDARG` - System: `SYSCALL` -It's a deliberately dumb little RISC: a handful of visible registers, a hidden -branch/call target register behind `LA_BR`, and a generator that spits out the -per-arch `DEFINE` tables for 32-bit and 64-bit x86, ARM, and RISC-V. +It's a dumb little RISC, with a bit of sugar for frame management and tail +calling: a handful of visible registers, a hidden branch/call target register +behind `LA_BR`. It's meant to be implemented by a set of per-arch `DEFINEs`, +aiming for 32-bit and 64-bit x86, ARM, and RISC-V. Calling convention: @@ -100,81 +171,51 @@ Calling convention: `LDARG`. - `a0` is also the one-word return register. - `a0`-`a3` and `t0` are caller-saved; `s0`-`s1` and `sp` are callee-saved. -- If you need a frame, `ENTER` builds the standard one and `LEAVE` tears it - down. +- `ENTER` builds the standard frame and `LEAVE` tears it down. - Stack-passed outgoing args are staged in frame-local space before `CALL`. - Wider-than-one-word returns use the usual hidden pointer trick: caller passes a writable result buffer in `a0`, real args slide over by one, callee returns that same pointer in `a0`. -That's enough to write a portable macro expander, a Lisp, or a tiny C -compiler once, then retarget by swapping out the generated M1 defs instead of -rewriting the whole program per architecture. That cuts down auditable code at -the bottom of the stack quite a bit. And it may give us a shorter path from -M0 to TCC (but no promises). Mostly it's just a cute base to work on. +It ships as a pair of files you catm before your P1 source: +- P1A.m1pp: architecture-specific backend implementing the portable interface +- P1.m1pp: the portable interface that P1 sources program against -``` -;; Bootstrap from seed -(define hex0 (hex0-seed hex0.hex0)) -(define hex1 (hex0 hex1.hex0)) -(define hex2 (hex1 hex2.hex1)) -(define catm (hex2 catm.hex2)) -(define M0 (hex2 (catm ELF.hex2 M0.hex2))) - -(defn link (hex2-src) (hex2 (catm ELF.hex2 hex2-src))) - -;; Bootstrap the macro expander once without macros. -(define m1m (link (M0 (catm P1.M1 m1m.M1)))) +## m1pp -;; Normal P1 build: expand macros, then assemble/link. -;; P1A.M1=arch-specific backend, P1M.M1=P1 macros -(defn p1compile (src) (M0 (m1m (catm P1A.M1 P1M.M1 src)))) -(defn p1exe (src) (link (p1compile src))) -``` - -Now with P1, with macro support, we have a half-decent portable assembler. -We use that to build a Scheme with file IO and process support. - -``` -;; Build the language that will host the rest of the climb. -(define scheme (p1exe scheme.P1)) -``` +m1pp is a macro expander: +- Define macros: `%macro NAME(a, b) ... %endm` +- Use macros: `%NAME(x, y)`. These can appear within macros as well, so it's a + recursive expansion. +- Concat tokens: `a ## b` +- Evaluate arithmetic integer expressions and emit little-endian hex: + `!(expr)` (8-bit), `@(expr)` (16-bit), `%(expr)` (32-bit), `$(expr)` (64-bit) +- Conditional expansions: `%select(cond_expr, then, else)` -And then Scheme is our shell and our C compiler. +The expression syntax is, of course, Lisp-shaped: ``` -(defn scc (src) (scheme cc.scm src)) -(defn ccexe (src) (p1exe (scc src))) +atoms: decimal or 0x-prefixed integer literals +calls: (+ a b) (- a b) (* a b) (/ a b) (% a b) (<< a b) (>> a b) + (= a b) (!= a b) (< a b) (<= a b) (> a b) (>= a b) + (& a b) (| a b) (^ a b) (~ a) ``` -So, just like in the current bootstrap, a unified source TCC would then compile -as: +It's enough for all per-arch instruction encoding and defining P1 mnemonics. -``` -(define tcc0 (ccexe tcc.c)) ;; compiler: scheme -(define tcc1 (tcc0 tcc.c)) ;; compiler: scheme-compiled tcc -(define tcc (tcc1 tcc.c)) ;; compiler: tcc-compiled tcc -``` - -So, what does P1 code with the macro expansion base look like? +## Example P1 program ``` -## copy.M1 -## -## And symbolic constants: -## SYS_open SYS_read SYS_write SYS_close SYS_exit O_RDONLY -## WORD -## SAVE_S0 SAVE_S1 BUF_OFF COPY_LOCALS BUF_SIZE -## -## Entry convention: -## a0 = argc -## a1 = argv +## copy.P1 DEFINE BUF_SIZE %128 DEFINE SAVE_S0 %16 DEFINE SAVE_S1 %24 DEFINE COPY_LOCALS %144 +## Entry convention: +## a0 = argc +## a1 = argv :main %li(t0, 2) %blt(a0, t0, usage) ; if argc < 2, exit 2