commit 2bf50d2a8f56b2106a737e260c4900abfb98d07a
parent 687ee15acc1661be78ff6cd8922ad95dd19a2100
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Thu, 23 Apr 2026 20:28:36 -0700
M1PP doc
Diffstat:
| A | docs/M1PP.md | | | 179 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
1 file changed, 179 insertions(+), 0 deletions(-)
diff --git a/docs/M1PP.md b/docs/M1PP.md
@@ -0,0 +1,179 @@
+# M1PP
+
+## Scope
+
+M1PP is a tiny single-pass macro expander that runs ahead of `M0`. It takes
+M1 source with macro directives and emits plain M1 suitable for `M0`.
+
+The implementation lives in `m1pp/m1pp.c`. It is one pass, allocation-free
+(fixed static buffers), and stops at the first error.
+
+## Invocation
+
+ m1pp input.M1 output.M1
+
+Input is read whole into a fixed buffer (`MAX_INPUT = 262144` bytes); output
+is written whole from another (`MAX_OUTPUT = 524288` bytes).
+
+## Lexical structure
+
+The lexer produces a flat token array. Token kinds:
+
+- `WORD` — any run of non-special characters
+- `STRING` — `"..."` or `'...'` (quotes included in the token text)
+- `NEWLINE` — a single `\n`
+- `LPAREN`, `RPAREN`, `COMMA`, `LBRACE`, `RBRACE`
+- `PASTE` — the `##` marker
+
+Whitespace other than newlines is discarded. Line comments start with `#` or
+`;` and run to end-of-line. Output formatting is normalized to tokens
+separated by spaces and newlines; original spacing is not preserved.
+
+## Directives
+
+Directives are recognized only at the start of a line (after a newline or at
+the top of file).
+
+### `%macro` / `%endm`
+
+ %macro NAME(p1, p2, ...)
+ ... body tokens ...
+ %endm
+
+Defines a function-like macro. Zero-parameter macros are written `%macro
+NAME()`. Macros are define-before-use; there is no prescan. Recursive
+macros are not detected and will loop until a buffer limit fires.
+
+### `%struct`
+
+ %struct NAME { f1 f2 f3 ... }
+
+Synthesizes zero-parameter macros for fixed 8-byte-per-field layout:
+
+- `%NAME.f1` → `0`
+- `%NAME.f2` → `8`
+- `%NAME.f3` → `16`
+- `%NAME.SIZE` → `N * 8`
+
+Fields are separated by whitespace, commas, or newlines.
+
+### `%enum`
+
+ %enum NAME { l1 l2 l3 ... }
+
+Like `%struct` with stride 1 and a trailing `COUNT`:
+
+- `%NAME.l1` → `0`, `%NAME.l2` → `1`, ...
+- `%NAME.COUNT` → `N`
+
+## Macro calls
+
+ %NAME(arg, arg, ...)
+
+Arguments are comma-separated token spans, with parentheses and braces
+balanced inside an argument. A zero-parameter macro may be invoked either
+as `%NAME()` or as a bare `%NAME`.
+
+An argument wrapped in a single outer pair of `{ ... }` has the braces
+stripped on substitution. This lets a comma-containing or paren-containing
+token sequence be passed as a single argument: `%foo({a, b, c})` passes one
+argument whose tokens are `a , b , c`.
+
+Argument substitution happens inside the body. After substitution, `##`
+token-paste is applied: `left ## right` becomes a single `WORD` token whose
+text is the concatenation of the two operand tokens' text. Operands of
+`##` must be exactly one non-braced token; newlines and other `##` tokens
+are not valid neighbors.
+
+The expanded body is then rescanned by pushing it onto the stream stack, so
+macros can call other macros.
+
+### Local labels
+
+Inside a macro body, a token starting with `:@name` or `&@name` is a local
+label definition or reference. On expansion, `@` is replaced by `__N` where
+`N` is a monotonically increasing expansion id, so each call site gets a
+fresh label namespace:
+
+- `:@loop` → `:loop__7`
+- `&@loop` → `&loop__7`
+
+## Built-in calls
+
+These are recognized wherever a token matches, not only at line start.
+
+### Integer emission: `!` `@` `%` `$`
+
+ !(expr) → 1-byte little-endian hex, e.g. 'AB'
+ @(expr) → 2-byte little-endian hex
+ %(expr) → 4-byte little-endian hex
+ $(expr) → 8-byte little-endian hex
+
+The expression is evaluated to a signed 64-bit integer and emitted as an
+M0-safe single-quoted hex literal (`'AABBCCDD'`) rather than a bare number,
+so `M0` does not reinterpret it as decimal.
+
+### `%select(cond, then, else)`
+
+Evaluates `cond` as an expression. If nonzero, the `then` argument's tokens
+are pushed back for rescan; otherwise the `else` argument's tokens are. The
+branches are raw token spans, not expressions.
+
+### `%str(IDENT)`
+
+Stringifies a single `WORD` token into a double-quoted string literal:
+`%str(foo)` → `"foo"`. The argument must be exactly one word token.
+
+## Expression language
+
+Expressions are Lisp-shaped S-expressions. Atoms are integer literals
+(decimal, or any base accepted by `strtoull`/`strtoll`, including `0x...`)
+or zero-arg macro calls that evaluate to integer tokens.
+
+Calls:
+
+ (+ a b ...) (- a b ...) (* a b ...) (~ a)
+ (/ a b) (% a b) (<< a b) (>> a b)
+ (& a ...) (| a ...) (^ a ...)
+ (= a b) (!= a b)
+ (< a b) (<= a b) (> a b) (>= a b)
+ (strlen "literal")
+
+- `+ - * & | ^` are n-ary with at least one argument. Unary `-` negates.
+- `/ % << >> = != < <= > >=` are strictly binary.
+- `~` is unary.
+- `strlen` takes one `STRING` token and returns the raw byte count of the
+ contents between the quotes.
+
+Inside an expression, a `%NAME` that names a zero-parameter (or invokable)
+macro is expanded and its tokens are re-parsed as a sub-expression. This
+is how `%struct` and `%enum`-generated names compose into arithmetic.
+
+## Limits
+
+Fixed at compile time:
+
+| Resource | Limit |
+| --------------------- | ------- |
+| input bytes | 262144 |
+| output bytes | 524288 |
+| total token text | 524288 |
+| source tokens | 65536 |
+| macro body tokens | 65536 |
+| expansion pool tokens | 65536 |
+| macros | 256 |
+| parameters per macro | 16 |
+| stream stack depth | 64 |
+| expression frames | 256 |
+
+Exceeding any limit aborts with an error message on `stderr`.
+
+## Errors
+
+On failure, `m1pp` prints `m1macro: <reason>` to `stderr` and exits 1.
+Reasons are terse: `bad macro header`, `unterminated macro`,
+`wrong arg count`, `bad paste`, `bad expression`, `bad builtin`,
+`text overflow`, `token overflow`, `expansion overflow`, `output overflow`,
+`stream overflow`, `unbalanced braces`, `too many args`, `too many macros`,
+`bad integer`, `bad directive`, `unterminated directive`,
+`unterminated macro call`.