lispcc contracts

Frozen interfaces between modules. Engineers must not diverge from these without proposing a change. This document is the source of truth for the symbol alphabets, test formats, ABI, and phase-1 milestone referenced from CC-INTERNALS.md.

1. Symbol alphabets

Every record's kind-style fields use these exact symbols. Adding a symbol = updating this section first.

1.1 `tok-kind`

IDENT KW INT STR CHAR PUNCT HASH NL EOF

Uppercase to distinguish from value-level symbols. IDENT carries an unrecognized identifier; KW is one of the symbols in §1.3.

1.2 `PUNCT` value symbols

The lexer produces tok-value symbols for punctuators per the following table. Names are mandatory — no engineer may use the raw '+, '*, etc. as symbols, because several C punctuator characters (%, |, ,, ;, (, ), {, }, [, ], ., #) cannot form valid scheme1 symbols. We use named symbols for all punctuators to keep the scheme uniform.

C	Symbol	C	Symbol
`[`	`lbrack`	`==`	`eq2`
`]`	`rbrack`	`!=`	`ne`
`(`	`lparen`	`<`	`lt`
`)`	`rparen`	`>`	`gt`
`{`	`lbrace`	`<=`	`le`
`}`	`rbrace`	`>=`	`ge`
`.`	`dot`	`<<`	`shl`
`->`	`arrow`	`>>`	`shr`
`,`	`comma`	`&&`	`land`
`;`	`semi`	`\\|\\|`	`lor`
`:`	`colon`	`&`	`amp`
`?`	`qmark`	`\\|`	`bar`
`...`	`ellipsis`	`^`	`caret`
`++`	`inc`	`~`	`tilde`
`--`	`dec`	`!`	`bang`
`+`	`plus`	`=`	`assign`
`-`	`minus`	`+=`	`plus-eq`
`*`	`star`	`-=`	`minus-eq`
`/`	`slash`	`*=`	`star-eq`
`%`	`pct`	`/=`	`slash-eq`
`#`	`hash`	`%=`	`pct-eq`
`##`	`paste`	`<<=`	`shl-eq`
		`>>=`	`shr-eq`
		`&=`	`amp-eq`
		`^=`	`caret-eq`
		`\\|=`	`bar-eq`

Digraphs (<: :> <% %> %: %:%:) lex as their standard equivalents: same symbol on the right-hand side of the table.

1.3 `KW` value symbols

;; storage
auto register static extern typedef
;; qualifiers (parsed and discarded)
const volatile restrict inline
;; type specifiers
void char short int long signed unsigned _Bool
;; rejected type specifiers (lexed as KW so we get clean diagnostics)
float double
;; aggregates
struct union enum
;; statements
if else while do for switch case default break continue return goto
;; operators
sizeof
;; reserved-and-rejected (lexed as KW so we error crisply)
_Generic _Atomic _Thread_local _Alignof _Alignas _Static_assert
_Complex _Imaginary

Anything matching the C identifier grammar that is not in this list lexes as IDENT.

1.4 `ctype-kind`

void  i8 u8  i16 u16  i32 u32  i64 u64  bool
ptr  arr  fn  struct  union  enum

Char is i8 (signed char) or u8 (unsigned char); char itself is i8 (we treat plain char as signed, matching MesCC and most compilers). Long and long long collapse to i64/u64 on P1-64.

1.5 `opnd-kind`

imm  frame  global  reg

reg is transient — used for the result of a call before it spills to a frame slot. The vstack itself never holds reg opnds; cg materializes through reg only inside a single emission step.

1.6 `macro-kind`

obj  fn  fn-vararg

1.7 `sym-kind`

var  fn  typedef  enum-const  param  label

1.8 `sym-storage`

auto  static  extern  register

#f for typedef, enum-const, and label symbols (storage class doesn't apply).

1.9 `loop-ctx-kind`

while  do  for  switch

1.10 `reg` opnd register names

a0  a1  a2  a3

Only argument registers. Saved registers (s0..s3) and temporaries (t0..t2) are cg-private; never exposed as opnd payload.

1.11 `cg-binop` and `cg-unop` operator symbols

;; cg-binop:
add  sub  mul  div  rem
and  or  xor  shl  shr
eq  ne  lt  le  gt  ge

;; cg-unop:
neg   ;; arithmetic negate
bnot  ;; bitwise complement (~)
lnot  ;; logical not (!)

These are abstract operations independent of source-level PUNCT symbols; the parser maps PUNCT → cg-op (e.g., 'plus → 'add, 'eq2 → 'eq).

2. Test serialization formats

Two test shapes coexist:

Pure-transformation suites (cc-lex, cc-pp) byte-diff a Scheme-readable serialization (§2.1).
Codegen / language suites (cc-cg, cc) compile-and-run the emitted program and assert runtime behavior (§2.2). No P1pp-text goldens.

2.1 Token line format

One token per line, as a Scheme list:

(KIND VALUE FILE LINE COL)

KIND: bare symbol from §1.1.
VALUE rendering depends on KIND:
- IDENT, STR: bytevector literal "..." with \n \t \r \\ \" escapes; non-ASCII bytes as \xNN.
- INT, CHAR: decimal integer.
- KW, PUNCT: bare symbol from §1.2 / §1.3.
- HASH, NL, EOF: #f.
FILE: bytevector literal.
LINE, COL: decimal integers (1-based).

The tok-hide field is not serialized — it is implementation detail of the preprocessor.

Example for int main() { return 0; } in t.c:

(KW int "t.c" 1 1)
(IDENT "main" "t.c" 1 5)
(PUNCT lparen "t.c" 1 9)
(PUNCT rparen "t.c" 1 10)
(PUNCT lbrace "t.c" 1 12)
(KW return "t.c" 1 14)
(INT 0 "t.c" 1 21)
(PUNCT semi "t.c" 1 22)
(PUNCT rbrace "t.c" 1 24)
(EOF #f "t.c" 1 25)

Trailing whitespace and ;-comments in the golden file are ignored.

2.2 Runtime fixture format

Every cc-cg / cc fixture compiles to a runnable program whose runtime behavior is the assertion. Two sibling files describe the expectation:

<name>.expected-exit    # decimal integer; default 0 if absent
<name>.expected         # exact stdout match; default empty if absent

Stdout and stderr are merged in the runner. The harness exits non-zero if either expectation fails.

Fixture-source conventions:

cc-cg (<name>.scm): drives the cg API directly, ending in (write-bv-fd 1 (cg-finish cg)). The fixture must define every symbol it references — cg-calls into externs are out of scope for this suite (no libc linkage).
cc (<name>.c): a complete C translation unit including a main that returns the asserted value. The full compiler driver (cc/cc.scm: lex+pp+parse+cg) runs against it. Holds both feature-targeted drills and full-envelope scenarios; once multi-TU / libc linkage land, those fixtures live here too.

Negative tests (compiler is supposed to die) set expected-exit to a non-zero value and may rely on the diagnostic-prefix check in §2.3 rather than asserting exact stdout.

2.3 Diagnostic format

Already canonical from CC-INTERNALS:

<file>:<line>:<col>: error: <msg>: <irritants...>

Tests for failure paths verify:

exit status is 1
stderr contains the expected <file>:<line>:<col>: error: prefix

The <msg> body is not matched character-for-character (so we can refine wording without breaking tests); only the prefix and a keyword-substring of the engineer's choice.

3. Frame layout / parameter ABI

3.1 cg-fn-begin contract

(cg-fn-begin cg name params return-type) -> param-syms
;; name:        bv (un-mangled C identifier)
;; params:      list of (name-bv . ctype)
;; return-type: ctype
;; param-syms:  alist (name-bv . sym), each sym already bound to a frame slot

Inside cg-fn-begin, cg:

Allocates one frame slot per parameter via cg-alloc-slot. Slot width = ctype-size rounded up to 8 (align-up); align = 8. (Yes, every param costs at least 8 bytes. P1-64 frame is word-stride; we don't pack.)
Begins emitting into cg-fn-buf. Does not yet emit the prologue — that's deferred to cg-fn-end once frame-hi is final.
Emits the param-spill code into a "prologue prefix" buffer (private to cg): for params 0..3, ST aN, [sp + slotN]; for params 4+, LDARG t0, K then ST t0, [sp + slotK].
Returns the param-sym alist. Parser binds these into the function- body scope.

3.2 cg-fn-end contract

(cg-fn-end cg)

cg:

Reads final frame-hi (highest byte allocated).
Emits the per-function preamble (an M1pp %struct is not used — slots are numeric byte offsets, baked into the body text already buffered in fn-buf).

Wraps the prologue-prefix + fn-buf inside a libp1pp %fn macro:

%fn(<mangled-name>, <frame-hi-aligned-up-to-16>, {
  <prologue-prefix bytes>
  <fn-buf bytes>
  ::ret
  LD a0, [sp + <return-slot>]
  ; LD a1, [sp + <return-slot> + 8]   when ret-type size > 8
})

Flushes the result into cg-text, clears fn-buf and the prologue-prefix buffer, resets vstack, frame-hi, and the function-local label counter.

The frame size is rounded up to 16 to satisfy the P1 stack-align contract.

The return slot itself is sized to max(8, ctype-size(ret-type)) rounded up to 8, 8-byte aligned. Slot width is what dispatches the load epilogue across the three result conventions defined in P1.md §Arguments and return values:

Width	Convention	Epilogue
≤ 8B	one-word direct	`LD a0, [sp + slot]`
9–16B	two-word direct	`LD a0, [sp + slot]; LD a1, [sp + slot + 8]`
> 16B	indirect-result	(Stream A2) caller passes buffer in a0; callee writes through it; epilogue restores a0

The cg lowers all three conventions; the parser surface (return statement, struct-typed call result) is identical regardless of which convention applies.

3.3 Loop tag protocol

(cg-loop cg head-thunk body-thunk) -> tag

cg-loop allocates a fresh per-function tag (L0, L1, …), emits the libp1pp %loop_tag(<tag>, { … }) wrapper, runs head-thunk inside the loop head (it is expected to leave the condition opnd on the vstack), pops the condition and emits an %if_eqz(t0, %break(tag)), then invokes body-thunk with the tag as its single argument:

(cg-loop cg
         (lambda () (cg-push-imm cg %t-i32 1))
         (lambda (tag)
           (cg-continue cg tag)
           (cg-break cg tag)))

The parser uses the same tag for any cg-break / cg-continue calls made during body emission. cg-loop returns the tag to its caller as well, so post-loop teardown code may reference it; cg-loop-end is a no-op kept for symmetry.

Switch dispatch follows the same pattern: cg-switch-begin returns a swctx whose swctx-end-tag accessor exposes cg's break-target tag to the parser.

3.4 Outgoing-arg staging

When cg-call is asked to emit a call with arity > 4, it stages args 4..(N-1) into the low-addressed prefix of the current frame at [sp + 0*8], [sp + 1*8], etc., per LIBP1PP.md §Frame locals. cg tracks the maximum staging count seen across the function and reserves that prefix at fn-end before any other slots — i.e., cg-alloc-slot's first allocation comes after the staging area.

The accounting is internal to cg. Parse never sees staging slots.

3.5 cg-alloc-slot contract

(cg-alloc-slot cg bytes align) -> offset

bytes = total size needed (e.g., 4 for int, 40 for int[10], sizeof(struct foo) for a struct).
align = required alignment (1, 2, 4, or 8).
Returns numeric byte offset relative to sp post-%enter.

cg first aligns frame-hi up to align, returns that as the offset, then bumps frame-hi by bytes. Slots are not reused across scopes (we're optimizing for compiler simplicity, not frame size). Local arrays and structs request their full size in one call.

4. Conversion responsibility

The parser drives type semantics; cg is type-aware only enough to choose signed-vs-unsigned variants and to scale pointer arithmetic.

4.1 Parser's responsibilities

The parser must call cg in this order around each operation:

Source	Required parser actions before the operation
`e1 + e2` (and other arith binops)	(a) parse e1 → if lval, `cg-load`; (b) `cg-promote` if rank < int; (c) parse e2 same way; (d) `cg-arith-conv` to bring both to common type; (e) `cg-binop add`
`*p`	parse p → if lval, `cg-load`; then `cg-push-deref`
`&x`	parse x → must be lval; then `cg-take-addr`
`(T)e`	parse e → if lval, `cg-load` (unless casting to a pointer); then `cg-cast T`
`f(a, b, ...)`	parse f → if lval and `f` not a function-typed identifier, `cg-load`; parse each arg → `cg-load` if lval, then `cg-cast` to param type (or default-promote for variadic args); then `cg-call`
`lhs = rhs`	parse lhs → must be lval (no load); parse rhs → `cg-load` if lval; `cg-assign` (cg internally casts rhs to lhs type — parse cannot peek beneath vstack top)
`lhs += rhs` (and other compound assigns)	parse lhs (lval) → `cg-dup` to preserve the lval across the read; `cg-load` (consumes one copy); parse rhs → `cg-load` if lval; `cg-arith-conv`; `cg-binop <op>`; `cg-assign` (cg casts internally)
`++lhs` / `--lhs`	parse lhs (lval) → `cg-dup`; `cg-load`; `cg-push-imm 1`; `cg-binop add`/`sub`; `cg-assign`
`lhs++` / `lhs--`	parse lhs (lval) → `cg-postinc` / `cg-postdec` (atomic primitive: dups+loads to capture old rval, then dup+load+`+1`+assign for the store, finally pushes the saved old rval)
`return e`	parse e → `cg-load` if lval; `cg-cast` to fn return type; `cg-return`
`if (e) ...`	parse e → `cg-load` if lval; `cg-cast bool` if not already int-shaped; `cg-if`
`c ? a : b`	parse c → `cg-load` if lval; `cg-ifelse-merge` with each thunk parsing one arm and ending with `rval!`; result type is the first arm's type
`a && b`	parse a → `cg-load` if lval; `cg-ifelse-merge` with then-arm = `parse b; rval!; cg-cast bool; cg-cast i32` and else-arm = `cg-push-imm i32 0`
`a \\|\\| b`	mirror of `&&`: then-arm = `cg-push-imm i32 1`; else-arm = parse b + bool/i32 cast
`a, b`	parse a (its rval is on top) → `cg-pop`; parse b → `cg-load` if lval; the comma's value is b
`sizeof e`	parse e (don't suppress emission); peek `(opnd-type (cg-top …))`'s `ctype-size`; `cg-pop`; `cg-push-imm u64 size`

The parser is responsible for the standard:

Integer promotion: any operand of type rank below int is promoted to int (or unsigned int if it can't fit) before use in arithmetic, before assignment to a wider lhs in mixed contexts, and before being passed as a variadic argument.
Usual arithmetic conversions: applied to both operands of a binary arithmetic operator after promotion. The result type is the common type.
Pointer-int interaction: detected by parser; cg-binop add on (ptr, int) handles scaling internally (see §4.2).

4.2 cg's responsibilities

cg trusts the operand types it is handed.

cg-load: pop lval, emit one load (of the right width based on ctype-size), push rval of the same type.
cg-cast to-type: pop, emit sign-extend / zero-extend / truncate as needed based on source vs. target sizes and signedness. For to-type = bool: emit (BNEZ -> 1, fallthrough -> 0) shape. Pointer ↔ integer casts are bit-for-bit on P1-64 (no emission).
cg-binop add (and sub): if exactly one operand is a ptr, scale the int operand by ctype-size of the pointee before adding. If both ptr → only sub is valid (yields i64 byte difference, divided by element size); other binops on (ptr, ptr) abort via die.
cg-binop for divisions and comparisons: dispatch to signed (DIV/BLT) or unsigned (DIV+sign-flip / BLTU) variant based on the operand kinds (i* → signed, u* → unsigned). After cg-arith-conv, both operands have the same kind, so dispatch is unambiguous.

cg never:

auto-loads an lvalue
auto-promotes
auto-converts arguments
looks at fn-ctx return type (parser passes the cast)

This split keeps cg under ~600 LOC by pushing all "C language" knowledge into parse.

5. Symbol-to-label mangling

Three label namespaces in the emitted P1pp:

5.1 User globals (functions, variables)

C identifier "foo"  →  P1pp label  :cc__foo

Verbatim concatenation. C identifiers can't contain : or other P1pp-special characters, so no escaping is needed. The cc__ prefix guarantees no collision with libp1pp internals (libp1pp__*), backend stubs (_start, p1_main, sys_*), or our own runtime support.

static storage at file scope changes nothing about the label — since we have one TU, internal-linkage and external-linkage symbols share the same namespace. static only suppresses any future "export to other TUs" emission, which we don't do anyway.

5.2 String pool

n-th distinct string literal  →  :cc__str_<n>

<n> is a fresh decimal counter starting at 0, advanced only on non-deduplicating insert. Identical string literals share a label (idempotent intern).

5.3 Function-internal labels

Inside %fn(...), libp1pp's %scope mechanism prefixes short labels (::ret, ::lbl_42) to <fnname>__ret, <fnname>__lbl_42 at M1pp time. cg uses short labels exclusively inside fn-buf:

::ret — single function exit
::lbl_<n> — anonymous control-flow targets (for switch cases, short-circuit eval, etc.)
::user_<name> — user-written goto labels (C myloop: → ::user_myloop). The user_ prefix prevents collisions with our lbl_ and ret labels.

Loop tags (for libp1pp's %loop_tag, %break(tag), %continue(tag)) are not labels — they're macro-name fragments. cg generates them as L<n> (no cc__ prefix; tag namespace is per-function and %fn already scopes them).

5.4 Entry stub

cg emits a small entry stub at cg-finish time:

%fn(p1_main, %p1_main_f.SIZE, {
  ; argc = a0, argv = a1 already
  %la_br(&cc__main)
  %call
  %eret
})

So int main(int argc, char **argv) is reached from the P1 program-entry contract.

If the user's main has no parameters, the stub still passes argc/argv — main just ignores them, which is harmless. If the user defines main with a different return type, that's a CC.md violation; cg can either die or emit and let the cast happen at the return site. Recommend: parser checks at cg-fn-end time when fn-name == main.

6. Phase-1 milestone

int main(int argc, char **argv) {
    return argc;
}

This is the integration target every engineer aims at. It goes through:

lex: int, main, (, int, argc, ,, char, *, *, argv, ), {, return, argc, ;, } — all PUNCT and KW symbols touched are core; covers two of each kind.
pp: zero directives, but full token-list traversal.
parse: function definition, two-parameter list including char **, compound stmt, return stmt, identifier expression, lval→rval load.
cg: cg-fn-begin with two params, parameter spilling (one register-passed, one register-passed), cg-push-sym, cg-load, cg-return, cg-fn-end, %fn wrapping, and the p1_main entry stub.
e2e: link with arch backend + libp1pp; run as native ELF; exit code matches argc.

Acceptance test: tests/cc/00-return-argc.c exists, the make target builds it, and:

$ ./tests/cc/build/00-return-argc            ; echo $?  → 1
$ ./tests/cc/build/00-return-argc a b        ; echo $?  → 3

When this test passes on aarch64, amd64, and riscv64, phase 1 is complete.

Change protocol

Anyone proposing a contract change:

PR amends this doc first, with rationale.
Affected modules + tests are listed.
Changes land in one PR (doc + all affected code) so no engineer pulls a half-migrated tree.

	boot2 Playing with the boostrap
	git clone https://git.ryansepassi.com/git/boot2.git
	Log \| Files \| Refs \| README

boot2