boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs

commit 2bf50d2a8f56b2106a737e260c4900abfb98d07a
parent 687ee15acc1661be78ff6cd8922ad95dd19a2100
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 23 Apr 2026 20:28:36 -0700

M1PP doc

Diffstat:
Adocs/M1PP.md | 179+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 179 insertions(+), 0 deletions(-)

diff --git a/docs/M1PP.md b/docs/M1PP.md @@ -0,0 +1,179 @@ +# M1PP + +## Scope + +M1PP is a tiny single-pass macro expander that runs ahead of `M0`. It takes +M1 source with macro directives and emits plain M1 suitable for `M0`. + +The implementation lives in `m1pp/m1pp.c`. It is one pass, allocation-free +(fixed static buffers), and stops at the first error. + +## Invocation + + m1pp input.M1 output.M1 + +Input is read whole into a fixed buffer (`MAX_INPUT = 262144` bytes); output +is written whole from another (`MAX_OUTPUT = 524288` bytes). + +## Lexical structure + +The lexer produces a flat token array. Token kinds: + +- `WORD` — any run of non-special characters +- `STRING` — `"..."` or `'...'` (quotes included in the token text) +- `NEWLINE` — a single `\n` +- `LPAREN`, `RPAREN`, `COMMA`, `LBRACE`, `RBRACE` +- `PASTE` — the `##` marker + +Whitespace other than newlines is discarded. Line comments start with `#` or +`;` and run to end-of-line. Output formatting is normalized to tokens +separated by spaces and newlines; original spacing is not preserved. + +## Directives + +Directives are recognized only at the start of a line (after a newline or at +the top of file). + +### `%macro` / `%endm` + + %macro NAME(p1, p2, ...) + ... body tokens ... + %endm + +Defines a function-like macro. Zero-parameter macros are written `%macro +NAME()`. Macros are define-before-use; there is no prescan. Recursive +macros are not detected and will loop until a buffer limit fires. + +### `%struct` + + %struct NAME { f1 f2 f3 ... } + +Synthesizes zero-parameter macros for fixed 8-byte-per-field layout: + +- `%NAME.f1` → `0` +- `%NAME.f2` → `8` +- `%NAME.f3` → `16` +- `%NAME.SIZE` → `N * 8` + +Fields are separated by whitespace, commas, or newlines. + +### `%enum` + + %enum NAME { l1 l2 l3 ... } + +Like `%struct` with stride 1 and a trailing `COUNT`: + +- `%NAME.l1` → `0`, `%NAME.l2` → `1`, ... +- `%NAME.COUNT` → `N` + +## Macro calls + + %NAME(arg, arg, ...) + +Arguments are comma-separated token spans, with parentheses and braces +balanced inside an argument. A zero-parameter macro may be invoked either +as `%NAME()` or as a bare `%NAME`. + +An argument wrapped in a single outer pair of `{ ... }` has the braces +stripped on substitution. This lets a comma-containing or paren-containing +token sequence be passed as a single argument: `%foo({a, b, c})` passes one +argument whose tokens are `a , b , c`. + +Argument substitution happens inside the body. After substitution, `##` +token-paste is applied: `left ## right` becomes a single `WORD` token whose +text is the concatenation of the two operand tokens' text. Operands of +`##` must be exactly one non-braced token; newlines and other `##` tokens +are not valid neighbors. + +The expanded body is then rescanned by pushing it onto the stream stack, so +macros can call other macros. + +### Local labels + +Inside a macro body, a token starting with `:@name` or `&@name` is a local +label definition or reference. On expansion, `@` is replaced by `__N` where +`N` is a monotonically increasing expansion id, so each call site gets a +fresh label namespace: + +- `:@loop` → `:loop__7` +- `&@loop` → `&loop__7` + +## Built-in calls + +These are recognized wherever a token matches, not only at line start. + +### Integer emission: `!` `@` `%` `$` + + !(expr) → 1-byte little-endian hex, e.g. 'AB' + @(expr) → 2-byte little-endian hex + %(expr) → 4-byte little-endian hex + $(expr) → 8-byte little-endian hex + +The expression is evaluated to a signed 64-bit integer and emitted as an +M0-safe single-quoted hex literal (`'AABBCCDD'`) rather than a bare number, +so `M0` does not reinterpret it as decimal. + +### `%select(cond, then, else)` + +Evaluates `cond` as an expression. If nonzero, the `then` argument's tokens +are pushed back for rescan; otherwise the `else` argument's tokens are. The +branches are raw token spans, not expressions. + +### `%str(IDENT)` + +Stringifies a single `WORD` token into a double-quoted string literal: +`%str(foo)` → `"foo"`. The argument must be exactly one word token. + +## Expression language + +Expressions are Lisp-shaped S-expressions. Atoms are integer literals +(decimal, or any base accepted by `strtoull`/`strtoll`, including `0x...`) +or zero-arg macro calls that evaluate to integer tokens. + +Calls: + + (+ a b ...) (- a b ...) (* a b ...) (~ a) + (/ a b) (% a b) (<< a b) (>> a b) + (& a ...) (| a ...) (^ a ...) + (= a b) (!= a b) + (< a b) (<= a b) (> a b) (>= a b) + (strlen "literal") + +- `+ - * & | ^` are n-ary with at least one argument. Unary `-` negates. +- `/ % << >> = != < <= > >=` are strictly binary. +- `~` is unary. +- `strlen` takes one `STRING` token and returns the raw byte count of the + contents between the quotes. + +Inside an expression, a `%NAME` that names a zero-parameter (or invokable) +macro is expanded and its tokens are re-parsed as a sub-expression. This +is how `%struct` and `%enum`-generated names compose into arithmetic. + +## Limits + +Fixed at compile time: + +| Resource | Limit | +| --------------------- | ------- | +| input bytes | 262144 | +| output bytes | 524288 | +| total token text | 524288 | +| source tokens | 65536 | +| macro body tokens | 65536 | +| expansion pool tokens | 65536 | +| macros | 256 | +| parameters per macro | 16 | +| stream stack depth | 64 | +| expression frames | 256 | + +Exceeding any limit aborts with an error message on `stderr`. + +## Errors + +On failure, `m1pp` prints `m1macro: <reason>` to `stderr` and exits 1. +Reasons are terse: `bad macro header`, `unterminated macro`, +`wrong arg count`, `bad paste`, `bad expression`, `bad builtin`, +`text overflow`, `token overflow`, `expansion overflow`, `output overflow`, +`stream overflow`, `unbalanced braces`, `too many args`, `too many macros`, +`bad integer`, `bad directive`, `unterminated directive`, +`unterminated macro call`.