boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

lispcc contracts

Frozen interfaces between modules. Engineers must not diverge from these without proposing a change. This document is the source of truth for the symbol alphabets, test formats, ABI, and phase-1 milestone referenced from CC-INTERNALS.md.

1. Symbol alphabets

Every record's kind-style fields use these exact symbols. Adding a symbol = updating this section first.

1.1 tok-kind

IDENT KW INT STR CHAR PUNCT HASH NL EOF

Uppercase to distinguish from value-level symbols. IDENT carries an unrecognized identifier; KW is one of the symbols in §1.3.

1.2 PUNCT value symbols

The lexer produces tok-value symbols for punctuators per the following table. Names are mandatory — no engineer may use the raw '+, '*, etc. as symbols, because several C punctuator characters (%, |, ,, ;, (, ), {, }, [, ], ., #) cannot form valid scheme1 symbols. We use named symbols for all punctuators to keep the scheme uniform.

C Symbol C Symbol
[ lbrack == eq2
] rbrack != ne
( lparen < lt
) rparen > gt
{ lbrace <= le
} rbrace >= ge
. dot << shl
-> arrow >> shr
, comma && land
; semi \|\| lor
: colon & amp
? qmark \| bar
... ellipsis ^ caret
++ inc ~ tilde
-- dec ! bang
+ plus = assign
- minus += plus-eq
* star -= minus-eq
/ slash *= star-eq
% pct /= slash-eq
# hash %= pct-eq
## paste <<= shl-eq
>>= shr-eq
&= amp-eq
^= caret-eq
\|= bar-eq

Digraphs (<: :> <% %> %: %:%:) lex as their standard equivalents: same symbol on the right-hand side of the table.

1.3 KW value symbols

;; storage
auto register static extern typedef
;; qualifiers (parsed and discarded)
const volatile restrict inline
;; type specifiers
void char short int long signed unsigned _Bool
;; rejected type specifiers (lexed as KW so we get clean diagnostics)
float double
;; aggregates
struct union enum
;; statements
if else while do for switch case default break continue return goto
;; operators
sizeof
;; reserved-and-rejected (lexed as KW so we error crisply)
_Generic _Atomic _Thread_local _Alignof _Alignas _Static_assert
_Complex _Imaginary

Anything matching the C identifier grammar that is not in this list lexes as IDENT.

1.4 ctype-kind

void  i8 u8  i16 u16  i32 u32  i64 u64  bool
ptr  arr  fn  struct  union  enum

Char is i8 (signed char) or u8 (unsigned char); char itself is i8 (we treat plain char as signed, matching MesCC and most compilers). Long and long long collapse to i64/u64 on P1-64.

1.5 opnd-kind

imm  frame  global  reg

reg is transient — used for the result of a call before it spills to a frame slot. The vstack itself never holds reg opnds; cg materializes through reg only inside a single emission step.

1.6 macro-kind

obj  fn  fn-vararg

1.7 sym-kind

var  fn  typedef  enum-const  param  label

1.8 sym-storage

auto  static  extern  register

#f for typedef, enum-const, and label symbols (storage class doesn't apply).

1.9 loop-ctx-kind

while  do  for  switch

1.10 reg opnd register names

a0  a1  a2  a3

Only argument registers. Saved registers (s0..s3) and temporaries (t0..t2) are cg-private; never exposed as opnd payload.

1.11 cg-binop and cg-unop operator symbols

;; cg-binop:
add  sub  mul  div  rem
and  or  xor  shl  shr
eq  ne  lt  le  gt  ge

;; cg-unop:
neg   ;; arithmetic negate
bnot  ;; bitwise complement (~)
lnot  ;; logical not (!)

These are abstract operations independent of source-level PUNCT symbols; the parser maps PUNCT → cg-op (e.g., 'plus'add, 'eq2'eq).

2. Test serialization formats

Two test shapes coexist:

2.1 Token line format

One token per line, as a Scheme list:

(KIND VALUE FILE LINE COL)

The tok-hide field is not serialized — it is implementation detail of the preprocessor.

Example for int main() { return 0; } in t.c:

(KW int "t.c" 1 1)
(IDENT "main" "t.c" 1 5)
(PUNCT lparen "t.c" 1 9)
(PUNCT rparen "t.c" 1 10)
(PUNCT lbrace "t.c" 1 12)
(KW return "t.c" 1 14)
(INT 0 "t.c" 1 21)
(PUNCT semi "t.c" 1 22)
(PUNCT rbrace "t.c" 1 24)
(EOF #f "t.c" 1 25)

Trailing whitespace and ;-comments in the golden file are ignored.

2.2 Runtime fixture format

Every cc-cg / cc fixture compiles to a runnable program whose runtime behavior is the assertion. Two sibling files describe the expectation:

<name>.expected-exit    # decimal integer; default 0 if absent
<name>.expected         # exact stdout match; default empty if absent

Stdout and stderr are merged in the runner. The harness exits non-zero if either expectation fails.

Fixture-source conventions:

Negative tests (compiler is supposed to die) set expected-exit to a non-zero value and may rely on the diagnostic-prefix check in §2.3 rather than asserting exact stdout.

2.3 Diagnostic format

Already canonical from CC-INTERNALS:

<file>:<line>:<col>: error: <msg>: <irritants...>

Tests for failure paths verify:

The <msg> body is not matched character-for-character (so we can refine wording without breaking tests); only the prefix and a keyword-substring of the engineer's choice.

3. Frame layout / parameter ABI

3.1 cg-fn-begin contract

(cg-fn-begin cg name params return-type) -> param-syms
;; name:        bv (un-mangled C identifier)
;; params:      list of (name-bv . ctype)
;; return-type: ctype
;; param-syms:  alist (name-bv . sym), each sym already bound to a frame slot

Inside cg-fn-begin, cg:

  1. Allocates one frame slot per parameter via cg-alloc-slot. Slot width = ctype-size rounded up to 8 (align-up); align = 8. (Yes, every param costs at least 8 bytes. P1-64 frame is word-stride; we don't pack.)
  2. Begins emitting into cg-fn-buf. Does not yet emit the prologue — that's deferred to cg-fn-end once frame-hi is final.
  3. Emits the param-spill code into a "prologue prefix" buffer (private to cg): for params 0..3, ST aN, [sp + slotN]; for params 4+, LDARG t0, K then ST t0, [sp + slotK].
  4. Returns the param-sym alist. Parser binds these into the function- body scope.

3.2 cg-fn-end contract

(cg-fn-end cg)

cg:

  1. Reads final frame-hi (highest byte allocated).

  2. Emits the per-function preamble (an M1pp %struct is not used — slots are numeric byte offsets, baked into the body text already buffered in fn-buf).

  3. Wraps the prologue-prefix + fn-buf inside a libp1pp %fn macro:

    %fn(<mangled-name>, <frame-hi-aligned-up-to-16>, {
      <prologue-prefix bytes>
      <fn-buf bytes>
      ::ret
      LD a0, [sp + <return-slot>]
      ; LD a1, [sp + <return-slot> + 8]   when ret-type size > 8
    })
    
  4. Flushes the result into cg-text, clears fn-buf and the prologue-prefix buffer, resets vstack, frame-hi, and the function-local label counter.

The frame size is rounded up to 16 to satisfy the P1 stack-align contract.

The return slot itself is sized to max(8, ctype-size(ret-type)) rounded up to 8, 8-byte aligned. Slot width is what dispatches the load epilogue across the three result conventions defined in P1.md §Arguments and return values:

Width Convention Epilogue
≤ 8B one-word direct LD a0, [sp + slot]
9–16B two-word direct LD a0, [sp + slot]; LD a1, [sp + slot + 8]
> 16B indirect-result (Stream A2) caller passes buffer in a0; callee writes through it; epilogue restores a0

The cg lowers all three conventions; the parser surface (return statement, struct-typed call result) is identical regardless of which convention applies.

3.3 Loop tag protocol

(cg-loop cg head-thunk body-thunk) -> tag

cg-loop allocates a fresh per-function tag (L0, L1, …), emits the libp1pp %loop_tag(<tag>, { … }) wrapper, runs head-thunk inside the loop head (it is expected to leave the condition opnd on the vstack), pops the condition and emits an %if_eqz(t0, %break(tag)), then invokes body-thunk with the tag as its single argument:

(cg-loop cg
         (lambda () (cg-push-imm cg %t-i32 1))
         (lambda (tag)
           (cg-continue cg tag)
           (cg-break cg tag)))

The parser uses the same tag for any cg-break / cg-continue calls made during body emission. cg-loop returns the tag to its caller as well, so post-loop teardown code may reference it; cg-loop-end is a no-op kept for symmetry.

Switch dispatch follows the same pattern: cg-switch-begin returns a swctx whose swctx-end-tag accessor exposes cg's break-target tag to the parser.

3.4 Outgoing-arg staging

When cg-call is asked to emit a call with arity > 4, it stages args 4..(N-1) into the low-addressed prefix of the current frame at [sp + 0*8], [sp + 1*8], etc., per LIBP1PP.md §Frame locals. cg tracks the maximum staging count seen across the function and reserves that prefix at fn-end before any other slots — i.e., cg-alloc-slot's first allocation comes after the staging area.

The accounting is internal to cg. Parse never sees staging slots.

3.5 cg-alloc-slot contract

(cg-alloc-slot cg bytes align) -> offset

cg first aligns frame-hi up to align, returns that as the offset, then bumps frame-hi by bytes. Slots are not reused across scopes (we're optimizing for compiler simplicity, not frame size). Local arrays and structs request their full size in one call.

4. Conversion responsibility

The parser drives type semantics; cg is type-aware only enough to choose signed-vs-unsigned variants and to scale pointer arithmetic.

4.1 Parser's responsibilities

The parser must call cg in this order around each operation:

Source Required parser actions before the operation
e1 + e2 (and other arith binops) (a) parse e1 → if lval, cg-load; (b) cg-promote if rank < int; (c) parse e2 same way; (d) cg-arith-conv to bring both to common type; (e) cg-binop add
*p parse p → if lval, cg-load; then cg-push-deref
&x parse x → must be lval; then cg-take-addr
(T)e parse e → if lval, cg-load (unless casting to a pointer); then cg-cast T
f(a, b, ...) parse f → if lval and f not a function-typed identifier, cg-load; parse each arg → cg-load if lval, then cg-cast to param type (or default-promote for variadic args); then cg-call
lhs = rhs parse lhs → must be lval (no load); parse rhs → cg-load if lval; cg-assign (cg internally casts rhs to lhs type — parse cannot peek beneath vstack top)
lhs += rhs (and other compound assigns) parse lhs (lval) → cg-dup to preserve the lval across the read; cg-load (consumes one copy); parse rhs → cg-load if lval; cg-arith-conv; cg-binop <op>; cg-assign (cg casts internally)
++lhs / --lhs parse lhs (lval) → cg-dup; cg-load; cg-push-imm 1; cg-binop add/sub; cg-assign
lhs++ / lhs-- parse lhs (lval) → cg-postinc / cg-postdec (atomic primitive: dups+loads to capture old rval, then dup+load++1+assign for the store, finally pushes the saved old rval)
return e parse e → cg-load if lval; cg-cast to fn return type; cg-return
if (e) ... parse e → cg-load if lval; cg-cast bool if not already int-shaped; cg-if
c ? a : b parse c → cg-load if lval; cg-ifelse-merge with each thunk parsing one arm and ending with rval!; result type is the first arm's type
a && b parse a → cg-load if lval; cg-ifelse-merge with then-arm = parse b; rval!; cg-cast bool; cg-cast i32 and else-arm = cg-push-imm i32 0
a \|\| b mirror of &&: then-arm = cg-push-imm i32 1; else-arm = parse b + bool/i32 cast
a, b parse a (its rval is on top) → cg-pop; parse b → cg-load if lval; the comma's value is b
sizeof e parse e (don't suppress emission); peek (opnd-type (cg-top …))'s ctype-size; cg-pop; cg-push-imm u64 size

The parser is responsible for the standard:

4.2 cg's responsibilities

cg trusts the operand types it is handed.

cg never:

This split keeps cg under ~600 LOC by pushing all "C language" knowledge into parse.

5. Symbol-to-label mangling

Three label namespaces in the emitted P1pp:

5.1 User globals (functions, variables)

C identifier "foo"  →  P1pp label  :cc__foo

Verbatim concatenation. C identifiers can't contain : or other P1pp-special characters, so no escaping is needed. The cc__ prefix guarantees no collision with libp1pp internals (libp1pp__*), backend stubs (_start, p1_main, sys_*), or our own runtime support.

static storage at file scope changes nothing about the label — since we have one TU, internal-linkage and external-linkage symbols share the same namespace. static only suppresses any future "export to other TUs" emission, which we don't do anyway.

5.2 String pool

n-th distinct string literal  →  :cc__str_<n>

<n> is a fresh decimal counter starting at 0, advanced only on non-deduplicating insert. Identical string literals share a label (idempotent intern).

5.3 Function-internal labels

Inside %fn(...), libp1pp's %scope mechanism prefixes short labels (::ret, ::lbl_42) to <fnname>__ret, <fnname>__lbl_42 at M1pp time. cg uses short labels exclusively inside fn-buf:

Loop tags (for libp1pp's %loop_tag, %break(tag), %continue(tag)) are not labels — they're macro-name fragments. cg generates them as L<n> (no cc__ prefix; tag namespace is per-function and %fn already scopes them).

5.4 Entry stub

cg emits a small entry stub at cg-finish time:

%fn(p1_main, %p1_main_f.SIZE, {
  ; argc = a0, argv = a1 already
  %la_br(&cc__main)
  %call
  %eret
})

So int main(int argc, char **argv) is reached from the P1 program-entry contract.

If the user's main has no parameters, the stub still passes argc/argv — main just ignores them, which is harmless. If the user defines main with a different return type, that's a CC.md violation; cg can either die or emit and let the cast happen at the return site. Recommend: parser checks at cg-fn-end time when fn-name == main.

6. Phase-1 milestone

int main(int argc, char **argv) {
    return argc;
}

This is the integration target every engineer aims at. It goes through:

Acceptance test: tests/cc/00-return-argc.c exists, the make target builds it, and:

$ ./tests/cc/build/00-return-argc            ; echo $?  → 1
$ ./tests/cc/build/00-return-argc a b        ; echo $?  → 3

When this test passes on aarch64, amd64, and riscv64, phase 1 is complete.

Change protocol

Anyone proposing a contract change:

  1. PR amends this doc first, with rationale.
  2. Affected modules + tests are listed.
  3. Changes land in one PR (doc + all affected code) so no engineer pulls a half-migrated tree.