Playing with the bootstrap

Where does it all come from?

Computers execute programs. Programs are specified as machine code, binary blobs loaded into memory that the processor interprets on the fly.

Humans write programs (well, some humans do, and now many non-humans as well). But humans specify programs in programming languages, not machine code. It's the job of a compiler to translate to machine code.

The compiler is itself a program. Where did it come from? Well, it was compiled of course, by another compiler!

Uh oh. Infinite regress? Not quite. It bottoms out in something directly specified in machine code, a "binary seed" from which the rest sprouts.

For decades, though, we had lost the seed. We were just going around using whatever compiler was on hand, figuring that there always would be one on hand, and trusting that it would do what it said on its tin. People noticed it was a bit odd that we had lost the chain, that we couldn't compile the compiler if we lost the compiler, but not much was done about it.

In 1983, Ken Thompson told a little horror story that spooked folks, but still, it didn't really change much. Like many good horror stories, this one was about a little invisible demon. Someone had put a virus into the compiler that replicated itself into all future compilers but would scrub itself out of existence if you ever tried compiling a program that went looking for it. All of our compilers were infected and none of us knew. Shivers. (Search for "Trusting Trust" if you want to read the full story).

Bottle universe

Nobody did anything about it until someone did. Our heroes are Jeremiah Orians and Jan Nieuwenhuizen; in 2023, they delivered "THE FULL BOOTSTRAP" (tm). A few hundred bytes of binary seed that defined a little language called "hex0", and then source files from there on until they could compile Fabrice Bellard's Tiny C Compiler (TCC). Once you have TCC, you compile an old version of GCC that was still written in C (GCC has long since moved to C++, woe be unto us all), then compile a more recent GCC, then you use that to compile everything else, including Linux. Truly amazing work.

Jeremiah's early chain is quite beautiful:

hex0-seed = given

hex0 = hex0-seed(hex0.hex0)  # hex0 compiles itself from source
hex1 = hex0(hex1.hex0)
hex2 = hex1(hex2.hex1)
catm = hex2(catm.hex2)
M0 = hex2(catm(ELF.hex2, M0.hex2))

hex0: the minimal seed hex compiler; it turns pairs of hexadecimal digits into raw bytes. Supports # and ; line comments.

hex1: a small extension of hex0 that adds the first notion of symbolic control flow. Adds single-character labels and 32-bit relative offsets to a label.

hex2: the final "hex language" and first real linker-like stage in the chain. Extends hex1 with labels of arbitrary length, 8/16/32-bit relative references (!, @, %), and 16/32-bit absolute references ($, &).

catm: a minimal file concatenation utility, takes N files and outputs 1.

M0: a basic macro preprocessor built on top of hex2, used as a mini assembler. Features DEFINE substitution of names to hex bytes, hex and ascii strings, 8/16/32-bit decimal and hex immediates (!, @, %).

From this base, they build a "subset C" compiler and a mini shell called "kaem" in M0 (per architecture), then use that to build up to a little userspace called "M2-Planet" and the Mes Scheme interpreter and ecosystem. They then define another C compiler in Scheme (MesCC), and use that to compile TCC. And then you're off to the races.

There's something elegant and simple and beautiful about this bottle universe they constructed: an assembler and linker, a Scheme, a C. Timeless. What more do you need really?

An alternative path

Up through M0, it's about 2000 lines of code (2KLOC) per architecture. But it gets goopy quick. It's ~30KLOC to get from M0 to MesCC, and then ~20KLOC for TCC itself. And the early bits defined in M1, like the subset C compiler, are all arch-specific.

For fun, learning, and tremendous profit, I thought I'd play around with an alternative path. One that hopefully will knock down that KLOC count and maybe tidy up this little bottle universe that exists at the bottom of the software sea.

Two moves.

First, portability. M1 source is arch-specific: not just the bytes you emit depend on whether you're targeting aarch64 or amd64 or riscv, but the source code you write is dictated by arch-specific defines. Replace that with a portable pseudo-ISA called P1. The interface is entirely portable, implemented by arch-specific defines. Each target ships a standard backend — P1A-aarch64.M1, P1A-amd64.M1, and so on — that DEFINEs each P1 mnemonic as the right raw bytes for that arch. Catm the backend in at build time and the same P1 source assembles anywhere. Think of it as portable assembly, well before you get to the portable assembly that we call C, but targetable from a simple textual expander instead of a full-blown compiler.

Second, power. M1 is austere — DEFINE substitution of simple names and raw bytes — and writing anything of size in it is miserable. So layer a macro preprocessor on top and you get M1pp (ht Bjarne). Named-parameter macros, compile-time integer expressions, conditional expansion, local labels, a few structural shorthands. Rewrite the P1 files with M1pp macros to get the arch-specific P1A.M1pp and the arch-agnostic P1.M1pp; the latter I'll call P1pp, our richer portable target. Expressive enough to host a Lisp in a file you might actually want to read.

Here's what I'm aiming for, in Lispy pseudocode:

;; Bootstrap from seed, same as before
(define hex0 (hex0-seed hex0.hex0))
(define hex1 (hex0 hex1.hex0))
(define hex2 (hex1 hex2.hex1))
(define catm (hex2 catm.hex2))
(define M0   (hex2 (catm ELF.hex2 M0.hex2)))

;; Compile+Link for arch-specific M1 source
(defn exe (M1-src) (hex2 (catm ELF.hex2 (M0 M1-src))))

;; P1 — portable pseudo-ISA at the M1 level.
;; P1A.M1 is the arch-specific backend
;; m1pp is itself a P1 program.
(define m1pp (exe (catm P1A.M1 m1pp.P1)))

;; P1pp — P1 rewritten with m1pp macros. Assemble any P1pp source via m1pp.
;; P1A.M1pp is the arch-specific backend, now rewritten to use M1pp
;; P1.M1pp is the arch-agnostic interface
;; P1pp.P1pp is "libp1pp", P1pp's standard library, niceties and utilities for
;; programming in P1pp
(defn ppexe (src) (exe (m1pp (catm P1A.M1pp P1.M1pp P1pp.P1pp src))))

;; To Scheme!
(define scheme (ppexe scheme1.P1pp))

;; And finally C
(defn scc (C-src) (ppexe (scheme cc.scm C-src)))
(define tcc0 (scc tcc.c))   ;; compiler: scheme cc.scm
(define tcc1 (tcc0 tcc.c))  ;; compiler: scheme-compiled tcc
(define tcc  (tcc1 tcc.c))  ;; compiler: tcc-compiled tcc

So the new bits are:

P1: P1A.M1
m1pp: m1pp.P1
P1pp: P1A.M1pp, P1.M1pp, P1pp.P1pp
Scheme: scheme1.P1pp
C: cc.scm

Today, I present P1, M1pp, and P1pp — the portability move and the power move. Next will come a Scheme in P1pp, and finally enough of a C compiler in Scheme to compile the tcc source.

P1

P1 is a portable pseudo-ISA. Source looks a lot like assembly; each mnemonic is a macro, and each arch ships a backend that realizes those mnemonics as raw bytes for that target. Flip the backend file you catm in, and the same P1 source assembles for a different arch.

Registers:

a0–a3 — argument and caller-saved general registers.
t0–t2 — caller-saved temporaries.
s0–s3 — callee-saved general registers.
sp — stack pointer.

That's a modest budget — twelve visible registers. The goal is small backends and uniform source across architectures. Anything beyond this that a backend needs for encoding (scratch regs, link register, zero register) is backend-private and exposed only through the portable ops. LA_BR is the one exception on the source side: it loads a label into a hidden branch-target register that the immediately following direct control-flow op (B, CALL, TAIL, or any conditional branch) then consumes. The register-indirect forms (BR, CALLR, TAILR) take their target in an ordinary register instead.

A word is register-sized — 32 bits on 32-bit targets, 64 bits on 64-bit targets. LD/ST move words; LB/SB move bytes. No 16/32-bit slots in between; code that needs them can OR/shift.

Ops:

Materialization: LI (load immediate), LA (load label address), LA_BR (load label into the hidden branch-target register).
Moves: MOV.
Arithmetic: ADD, SUB, AND, OR, XOR, SHL, SHR, SAR, MUL, DIV, REM.
Arithmetic with immediates: ADDI, ANDI, ORI, SHLI, SHRI, SARI. Immediate widths are backend-dependent; the backend errors if the constant doesn't fit.
Memory: LD, ST, LB, SB.
Branching: B, BR, BEQ, BNE, BLT, BLTU, BEQZ, BNEZ, BLTZ. Signed and unsigned less-than; >=, >, <= are synthesized by swapping operands or inverting branch sense.
Calls / returns: CALL, CALLR, RET, ERET, TAIL, TAILR.
Frame management: ENTER.
ABI arg access: LDARG — reads stack-passed incoming args without hard-coding the frame layout.
System: SYSCALL.

Calling convention:

a0-a3 hold the first four argument words; extras live in an incoming stack-argument area, read with LDARG.
a0 is the one-word return register. Two-word returns use a0/a1.
a0-a3 and t0-t2 are caller-saved; s0-s3 and sp are callee-saved.
ENTER builds the standard frame; ERET tears it down and returns (TAIL/TAILR likewise combine teardown with a jump).
Stack-passed outgoing args are staged in a dedicated frame-local area before CALL, so the callee finds them at a known offset from sp.
Wider-than-two-word returns use the usual hidden-pointer trick: caller passes a writable result buffer in a0, real args slide over by one, callee returns that same pointer in a0.

Program entry: portable source writes :p1_main as an ordinary P1 function with argc in a0 and argv in a1, and returns an exit status in a0. The backend emits a small per-arch :_start stub that captures the native entry state, calls p1_main, and hands its return value to sys_exit. The portable side never sees the raw entry stack.

This shape is aimed at portability and simplicity of targeting from pure macro expansion, sacrificing some performance, which seems a reasonable tradeoff for the bootstrap.

P1 comes in two flavors, one per toolchain level. The M1-level flavor is a pair of files you catm before any P1 source:

P1A.M1 — arch-specific backend, one DEFINE per mnemonic.

The M1pp-level flavor — P1pp — is the same idea with richer macros, where the arch-specific encodings are specified in P1A.M1pp via macro instead of hardcoded like in P1A.M1:

P1A.M1pp — arch-specific backend.
P1.M1pp — the portable interface.

M1pp

P1-at-the-M1-level gets us portability, but M1 is austere — DEFINE substitution and raw bytes. Fine for hand-encoding a handful of instructions; painful for anything of size. m1pp is a small macro expander layered on top of M0 that makes low-level programming bearable enough to linger down here near the bottom of the sea. It adds three things M0 doesn't have: macros with named parameters, integer expressions that evaluate at expand time, and conditional expansion.

Features:

Macros: %macro NAME(a, b) ... %endm, called with %NAME(x, y). Macro bodies can themselves contain macro calls, so expansion is recursive.
Integer expressions: emit little-endian hex from a Lisp expression — !(expr) for 8 bits, @(expr) for 16, %(expr) for 32, $(expr) for 64. The punctuation mirrors M0's decimal/hex immediate widths (!@%) and adds $ for the 64-bit case.
Conditional expansion: %select(cond, then, else) evaluates cond as an integer and emits exactly one of the two branches. Non-zero is true.
Token concat: a ## b pastes two tokens into one identifier.
Struct / enum shorthands: %struct NAME { f1 f2 } synthesizes %NAME.f1 → 0, %NAME.f2 → 8, %NAME.SIZE → 16. %enum does the same with stride 1 and a COUNT terminator. Saves hand-counting offsets for records and small tag sets.
Local labels: inside a macro body, :@loop and &@loop pick up a fresh per-expansion suffix, so a macro can define its own labels without colliding with itself at a second call site.
Scopes: %scope NAME ... %endscope pushes a lexical scope. Inside it, a label written ::foo emits as :NAME__foo (stack joined by __ when nested), so several callers can reuse short names without collision. Resolution follows the caller's scope, not the macro's, so a generic %break or %continue inside a scoped loop lands on the right label.
Stringify: %str(foo) turns a word token into the string literal "foo".

Comments (# and ;) and M0 DEFINEs pass through untouched.

The expression syntax is, of course, Lisp-shaped:

atoms: decimal or 0x-prefixed integer literals
calls: (+ a b) (- a b) (* a b) (/ a b) (% a b) (<< a b) (>> a b)
       (= a b) (!= a b) (< a b) (<= a b) (> a b) (>= a b)
       (& a b) (| a b) (^ a b) (~ a)
       (strlen "astr")

Here's a real macro from the aarch64 P1pp backend — the "add register, immediate" instruction:

%macro aa64_add_imm(rd, ra, imm12)
%((| 0x91000000 (<< (& imm12 0xFFF) 10) (<< %aa64_reg(ra) 5) %aa64_reg(rd)))
%endm

A call like %aa64_add_imm(a0, a0, 8) expands to a single Lisp expression that ORs together the opcode, the 12-bit immediate shifted into place, the source register field, and the destination register field. The outer %(...) evaluates that expression and emits it as four little-endian hex bytes, which M0 and hex2 then hand to the linker. The nested %aa64_reg(...) calls in the body show that macros can call other macros during expansion — aa64_reg is a tiny macro per register name that produces the aarch64-native register number (e.g. a0 → 0, s0 → 19).

That one pattern — Lisp expression in, hex bytes out — covers all per-arch instruction encoding.

The scope mechanism is what lets libp1pp's standard library define a generic function macro %fn:

%fn(parse_number, 16, {
  ...
  LA_BR &::done
  BEQZ t0
  ...
  ::done
})

The 16 is the frame-local byte count passed through to %enter. The ::done labels mangle to parse_number__done, so every function gets its own short namespace with no hand-prefixing, and scoped loops inside the body (%loop_scoped, %while_scoped_nez, etc.) stack further — a %break anywhere inside the innermost loop finds its _end.

Three programs

The shortest useful P1 program is a bare return. p1_main gets argc in a0 on entry, a0 is also the one-word return register, and the backend stub hands the return value to sys_exit — so this returns argc as the exit status:

:p1_main
    %ret

:ELF_end

Hello world is a single sys_write from a leaf function:

:p1_main
    %li(a0) %sys_write
    %li(a1) %1 %0
    %la(a2) &msg
    %li(a3) %14 %0
    %syscall
    %li(a0) %0 %0
    %ret

:msg
"Hello, World!
"

:ELF_end

A few things to notice. %li(a0) and %la(a2) don't take the immediate as a macro argument — the backend emits a load-from-literal-pool prefix and the bytes that follow in source become the literal. Here %sys_write expands to the 8-byte Linux syscall number for write (the aarch64 backend's choice); &msg is a 4-byte label reference (addresses fit in 32 bits in the stage0 image layout). The two %N %0 pairs are M0 4-byte decimal immediates padded to 8 bytes.

A function call, with a helper that doubles its argument:

:double
    %shli(a0, a0, 1)
    %ret

:p1_main
    %enter(0)
    %call(&double)
    %eret

:ELF_end

double is a leaf and needs no frame. p1_main is not — it calls double, so it opens a standard frame with %enter(0) to preserve the hidden return-address state across the call, and closes it with %eret, which tears down the frame and returns in one step. Run with ./double a b c and the exit status is 8 (argc=4, doubled).

What it cost

Concrete sizes for what's landed today:

m1pp.P1: ~4KLOC — the P1 source of the macro expander itself.
P1.M1pp: ~150 LOC — the portable P1 interface.
P1-{aarch64,riscv64,amd64}.M1pp: ~{480,430,650} LOC — arch backends.
p1pp.P1pp: ~1KLOC — libp1pp, the standard library of control-flow macros and utilities.

(These are non-comment, non-blank LOC)

We can't really see the full payoff until we have scheme1.P1pp and cc.scm of course, but I think it's a meaningful improvement over per-arch subset C compilers, and it's a neat little macro language to boot. Adding support for a new arch at this level is just a few hundred lines of macro definitions, not a new compiler.

I'll be working on a tiny Scheme implementation in P1pp, and will likely edit P1 and M1pp as I do.

The dream is that we go from P1pp to Scheme to C via nice direct hops and no external dependencies until TCC, and specifically, to a tiny single-binary cross-compiling TCC-like C compiler. A micro Clang/LLVM, maybe 80% of the bang for 1% of the buck. Projects like TCC, QBE, and Tilde all make me hopeful that this is possible, and as ever, people like Jeremiah, Jan, and Fabrice are inspirations. Here's to the crazy ones.

	boot2 Playing with the boostrap
	git clone https://git.ryansepassi.com/git/boot2.git
	Log \| Files \| Refs \| README

boot2