boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

M1PP

Scope

M1PP is a tiny single-pass macro expander that runs ahead of M0. It takes M1 source with macro directives and emits plain M1 suitable for M0.

The implementation lives in m1pp/m1pp.c. It is one pass, allocation-free (fixed static buffers), and stops at the first error.

Invocation

m1pp input.M1 output.M1

Input is read whole into a fixed buffer (MAX_INPUT = 262144 bytes); output is written whole from another (MAX_OUTPUT = 524288 bytes).

Lexical structure

The lexer produces a flat token array. Token kinds:

Whitespace other than newlines is discarded. Line comments start with # or ; and run to end-of-line. Output formatting is normalized to tokens separated by spaces and newlines; original spacing is not preserved.

Directives

Directives are recognized only at the start of a line (after a newline or at the top of file).

%macro / %endm

%macro NAME(p1, p2, ...)
... body tokens ...
%endm

Defines a function-like macro. Zero-parameter macros are written %macro NAME(). Macros are define-before-use; there is no prescan. Recursive macros are not detected and will loop until a buffer limit fires.

%struct

%struct NAME { f1 f2 f3 ... }

Synthesizes zero-parameter macros for fixed 8-byte-per-field layout:

Fields are separated by whitespace, commas, or newlines.

%enum

%enum NAME { l1 l2 l3 ... }

Like %struct with stride 1 and a trailing COUNT:

%scope / %endscope

%scope NAME
... body ...
%endscope

Pushes NAME onto a lexical scope stack active until the matching %endscope. Scopes nest. While the stack is non-empty, any ::name or &::name token emitted from within is rewritten with the current scope path (see Scoped labels). Every %scope must be closed before end-of-input. NAME is a single WORD token and may come from macro-argument substitution.

Macro calls

%NAME(arg, arg, ...)

Arguments are comma-separated token spans, with parentheses and braces balanced inside an argument. A zero-parameter macro may be invoked either as %NAME() or as a bare %NAME.

An argument wrapped in a single outer pair of { ... } has the braces stripped on substitution. This lets a comma-containing or paren-containing token sequence be passed as a single argument: %foo({a, b, c}) passes one argument whose tokens are a , b , c.

Argument substitution happens inside the body. After substitution, ## token-paste is applied: left ## right becomes a single WORD token whose text is the concatenation of the two operand tokens' text. Operands of ## must be exactly one non-braced token; newlines and other ## tokens are not valid neighbors.

The expanded body is then rescanned by pushing it onto the stream stack, so macros can call other macros.

Local labels

Inside a macro body, a token starting with :@name or &@name is a local label definition or reference. On expansion, @ is replaced by __N where N is a monotonically increasing expansion id, so each call site gets a fresh label namespace:

Each macro expansion gets a fresh N, so :@loop in two different call sites (or two different macros) never collide. Argument-substituted tokens keep their original text and are not rewritten, so a :@name literal passed as a macro argument passes through verbatim.

Scoped labels

A WORD token whose text starts with :: is a scoped label definition; a token starting with &:: is a scoped reference. The :: prefix is rewritten at emit time against the current %scope stack:

Because resolution is at emit time rather than macro-expansion time, a ::foo token written inside a macro body resolves against whatever scope is active at the point the token flows to the output — i.e. the caller's surroundings, not the macro's own expansion id. This makes generic control-flow macros possible:

%macro loop_scoped(name, body)
%scope name
::top
body
LA_BR &::top
B
::end
%endscope
%endm

%macro break()
LA_BR &::end
B
%endm

%loop_scoped(scan, {
  ...
  %if_eqz(a0, { %break() })
  ...
})

Inside the expansion, %loop_scoped has pushed the scope [scan], so when %break()'s &::end token is finally emitted the stack is [scan] and the output is &scan__end — exactly the label %loop_scoped defined at the bottom of its body. A nested %loop_scoped(inner, { ... }) makes [outer, inner] the active stack, so a %break() inside the inner block targets the innermost scope. To jump past an intervening scope, write the concatenated name explicitly (&outer__end).

Scoped labels and local (:@ / &@) labels are independent and compose. A common pattern: use :@ for the macro's private internal labels (the caller can never name them) and :: for labels that are the macro's public contract with its caller (::end, ::top, etc.).

Built-in calls

These are recognized wherever a token matches, not only at line start.

Integer emission: ! @ % $

!(expr)    →  1-byte  little-endian hex, e.g. 'AB'
@(expr)    →  2-byte  little-endian hex
%(expr)    →  4-byte  little-endian hex
$(expr)    →  8-byte  little-endian hex

The expression is evaluated to a signed 64-bit integer and emitted as an M0-safe single-quoted hex literal ('AABBCCDD') rather than a bare number, so M0 does not reinterpret it as decimal.

%select(cond, then, else)

Evaluates cond as an expression. If nonzero, the then argument's tokens are pushed back for rescan; otherwise the else argument's tokens are. The branches are raw token spans, not expressions.

%str(IDENT)

Stringifies a single WORD token into a double-quoted string literal: %str(foo)"foo". The argument must be exactly one word token.

Expression language

Expressions are Lisp-shaped S-expressions. Atoms are integer literals (decimal, or any base accepted by strtoull/strtoll, including 0x...) or zero-arg macro calls that evaluate to integer tokens.

Calls:

(+ a b ...)   (- a b ...)   (* a b ...)   (~ a)
(/ a b)       (% a b)       (<< a b)      (>> a b)
(& a ...)     (| a ...)     (^ a ...)
(= a b)       (!= a b)
(< a b)       (<= a b)      (> a b)       (>= a b)
(strlen "literal")

Inside an expression, a %NAME that names a zero-parameter (or invokable) macro is expanded and its tokens are re-parsed as a sub-expression. This is how %struct and %enum-generated names compose into arithmetic.

Limits

Fixed at compile time:

Resource Limit
input bytes 262144
output bytes 524288
total token text 524288
source tokens 65536
macro body tokens 65536
expansion pool tokens 65536
macros 512
parameters per macro 16
stream stack depth 64
expression frames 256
scope stack depth 32

Exceeding any limit aborts with an error message on stderr.

Errors

On failure, m1pp prints m1macro: <reason> to stderr and exits 1. Reasons are terse: bad macro header, unterminated macro, wrong arg count, bad paste, bad expression, bad builtin, text overflow, token overflow, expansion overflow, output overflow, stream overflow, unbalanced braces, too many args, too many macros, bad integer, bad directive, unterminated directive, unterminated macro call, bad scope header, scope underflow, scope not closed, scope depth overflow, bad scope label.