lispcc contracts
Frozen interfaces between modules. Engineers must not diverge from these without proposing a change. This document is the source of truth for the symbol alphabets, test formats, ABI, and phase-1 milestone referenced from CC-INTERNALS.md.
1. Symbol alphabets
Every record's kind-style fields use these exact symbols. Adding a
symbol = updating this section first.
1.1 tok-kind
IDENT KW INT STR CHAR PUNCT HASH NL EOF
Uppercase to distinguish from value-level symbols. IDENT carries an
unrecognized identifier; KW is one of the symbols in §1.3.
1.2 PUNCT value symbols
The lexer produces tok-value symbols for punctuators per the
following table. Names are mandatory — no engineer may use the raw
'+, '*, etc. as symbols, because several C punctuator characters
(%, |, ,, ;, (, ), {, }, [, ], ., #) cannot
form valid scheme1 symbols. We use named symbols for all punctuators
to keep the scheme uniform.
| C | Symbol | C | Symbol |
|---|---|---|---|
[ |
lbrack |
== |
eq2 |
] |
rbrack |
!= |
ne |
( |
lparen |
< |
lt |
) |
rparen |
> |
gt |
{ |
lbrace |
<= |
le |
} |
rbrace |
>= |
ge |
. |
dot |
<< |
shl |
-> |
arrow |
>> |
shr |
, |
comma |
&& |
land |
; |
semi |
\|\| |
lor |
: |
colon |
& |
amp |
? |
qmark |
\| |
bar |
... |
ellipsis |
^ |
caret |
++ |
inc |
~ |
tilde |
-- |
dec |
! |
bang |
+ |
plus |
= |
assign |
- |
minus |
+= |
plus-eq |
* |
star |
-= |
minus-eq |
/ |
slash |
*= |
star-eq |
% |
pct |
/= |
slash-eq |
# |
hash |
%= |
pct-eq |
## |
paste |
<<= |
shl-eq |
>>= |
shr-eq |
||
&= |
amp-eq |
||
^= |
caret-eq |
||
\|= |
bar-eq |
Digraphs (<: :> <% %> %: %:%:) lex as their standard
equivalents: same symbol on the right-hand side of the table.
1.3 KW value symbols
;; storage
auto register static extern typedef
;; qualifiers (parsed and discarded)
const volatile restrict inline
;; type specifiers
void char short int long signed unsigned _Bool
;; rejected type specifiers (lexed as KW so we get clean diagnostics)
float double
;; aggregates
struct union enum
;; statements
if else while do for switch case default break continue return goto
;; operators
sizeof
;; reserved-and-rejected (lexed as KW so we error crisply)
_Generic _Atomic _Thread_local _Alignof _Alignas _Static_assert
_Complex _Imaginary
Anything matching the C identifier grammar that is not in this
list lexes as IDENT.
1.4 ctype-kind
void i8 u8 i16 u16 i32 u32 i64 u64 bool
ptr arr fn struct union enum
Char is i8 (signed char) or u8 (unsigned char); char itself
is i8 (we treat plain char as signed, matching MesCC and most
compilers). Long and long long collapse to i64/u64 on P1-64.
1.5 opnd-kind
imm frame global reg
reg is transient — used for the result of a call before it spills
to a frame slot. The vstack itself never holds reg opnds; cg
materializes through reg only inside a single emission step.
1.6 macro-kind
obj fn fn-vararg
1.7 sym-kind
var fn typedef enum-const param label
1.8 sym-storage
auto static extern register
#f for typedef, enum-const, and label symbols (storage
class doesn't apply).
1.9 loop-ctx-kind
while do for switch
1.10 reg opnd register names
a0 a1 a2 a3
Only argument registers. Saved registers (s0..s3) and temporaries
(t0..t2) are cg-private; never exposed as opnd payload.
1.11 cg-binop and cg-unop operator symbols
;; cg-binop:
add sub mul div rem
and or xor shl shr
eq ne lt le gt ge
;; cg-unop:
neg ;; arithmetic negate
bnot ;; bitwise complement (~)
lnot ;; logical not (!)
These are abstract operations independent of source-level PUNCT
symbols; the parser maps PUNCT → cg-op (e.g., 'plus → 'add,
'eq2 → 'eq).
2. Test serialization formats
Two test shapes coexist:
- Pure-transformation suites (
cc-lex,cc-pp) byte-diff a Scheme-readable serialization (§2.1). - Codegen / language suites (
cc-cg,cc) compile-and-run the emitted program and assert runtime behavior (§2.2). No P1pp-text goldens.
2.1 Token line format
One token per line, as a Scheme list:
(KIND VALUE FILE LINE COL)
- KIND: bare symbol from §1.1.
- VALUE rendering depends on KIND:
IDENT,STR: bytevector literal"..."with\n \t \r \\ \"escapes; non-ASCII bytes as\xNN.INT,CHAR: decimal integer.KW,PUNCT: bare symbol from §1.2 / §1.3.HASH,NL,EOF:#f.
- FILE: bytevector literal.
- LINE, COL: decimal integers (1-based).
The tok-hide field is not serialized — it is implementation
detail of the preprocessor.
Example for int main() { return 0; } in t.c:
(KW int "t.c" 1 1)
(IDENT "main" "t.c" 1 5)
(PUNCT lparen "t.c" 1 9)
(PUNCT rparen "t.c" 1 10)
(PUNCT lbrace "t.c" 1 12)
(KW return "t.c" 1 14)
(INT 0 "t.c" 1 21)
(PUNCT semi "t.c" 1 22)
(PUNCT rbrace "t.c" 1 24)
(EOF #f "t.c" 1 25)
Trailing whitespace and ;-comments in the golden file are ignored.
2.2 Runtime fixture format
Every cc-cg / cc fixture compiles to a runnable program whose runtime behavior is the assertion. Two sibling files describe the expectation:
<name>.expected-exit # decimal integer; default 0 if absent
<name>.expected # exact stdout match; default empty if absent
Stdout and stderr are merged in the runner. The harness exits non-zero if either expectation fails.
Fixture-source conventions:
- cc-cg (
<name>.scm): drives the cg API directly, ending in(write-bv-fd 1 (cg-finish cg)). The fixture must define every symbol it references —cg-calls into externs are out of scope for this suite (no libc linkage). - cc (
<name>.c): a complete C translation unit including amainthat returns the asserted value. The full compiler driver (cc/cc.scm: lex+pp+parse+cg) runs against it. Holds both feature-targeted drills and full-envelope scenarios; once multi-TU / libc linkage land, those fixtures live here too.
Negative tests (compiler is supposed to die) set expected-exit to a
non-zero value and may rely on the diagnostic-prefix check in §2.3
rather than asserting exact stdout.
2.3 Diagnostic format
Already canonical from CC-INTERNALS:
<file>:<line>:<col>: error: <msg>: <irritants...>
Tests for failure paths verify:
- exit status is 1
- stderr contains the expected
<file>:<line>:<col>: error:prefix
The <msg> body is not matched character-for-character (so we
can refine wording without breaking tests); only the prefix and a
keyword-substring of the engineer's choice.
3. Frame layout / parameter ABI
3.1 cg-fn-begin contract
(cg-fn-begin cg name params return-type) -> param-syms
;; name: bv (un-mangled C identifier)
;; params: list of (name-bv . ctype)
;; return-type: ctype
;; param-syms: alist (name-bv . sym), each sym already bound to a frame slot
Inside cg-fn-begin, cg:
- Allocates one frame slot per parameter via
cg-alloc-slot. Slot width =ctype-sizerounded up to 8 (align-up); align = 8. (Yes, every param costs at least 8 bytes. P1-64 frame is word-stride; we don't pack.) - Begins emitting into
cg-fn-buf. Does not yet emit the prologue — that's deferred tocg-fn-endonceframe-hiis final. - Emits the param-spill code into a "prologue prefix" buffer
(private to cg): for params 0..3,
ST aN, [sp + slotN]; for params 4+,LDARG t0, KthenST t0, [sp + slotK]. - Returns the param-sym alist. Parser binds these into the function- body scope.
3.2 cg-fn-end contract
(cg-fn-end cg)
cg:
Reads final
frame-hi(highest byte allocated).Emits the per-function preamble (an M1pp
%structis not used — slots are numeric byte offsets, baked into the body text already buffered infn-buf).Wraps the prologue-prefix + fn-buf inside a libp1pp
%fnmacro:%fn(<mangled-name>, <frame-hi-aligned-up-to-16>, { <prologue-prefix bytes> <fn-buf bytes> ::ret LD a0, [sp + <return-slot>] ; LD a1, [sp + <return-slot> + 8] when ret-type size > 8 })Flushes the result into
cg-text, clearsfn-bufand the prologue-prefix buffer, resetsvstack,frame-hi, and the function-local label counter.
The frame size is rounded up to 16 to satisfy the P1 stack-align contract.
The return slot itself is sized to max(8, ctype-size(ret-type))
rounded up to 8, 8-byte aligned. Slot width is what dispatches the
load epilogue across the three result conventions defined in
P1.md §Arguments and return values:
| Width | Convention | Epilogue |
|---|---|---|
| ≤ 8B | one-word direct | LD a0, [sp + slot] |
| 9–16B | two-word direct | LD a0, [sp + slot]; LD a1, [sp + slot + 8] |
| > 16B | indirect-result | (Stream A2) caller passes buffer in a0; callee writes through it; epilogue restores a0 |
The cg lowers all three conventions; the parser surface (return statement, struct-typed call result) is identical regardless of which convention applies.
3.3 Loop tag protocol
(cg-loop cg head-thunk body-thunk) -> tag
cg-loop allocates a fresh per-function tag (L0, L1, …), emits
the libp1pp %loop_tag(<tag>, { … }) wrapper, runs head-thunk
inside the loop head (it is expected to leave the condition opnd on
the vstack), pops the condition and emits an %if_eqz(t0, %break(tag)),
then invokes body-thunk with the tag as its single argument:
(cg-loop cg
(lambda () (cg-push-imm cg %t-i32 1))
(lambda (tag)
(cg-continue cg tag)
(cg-break cg tag)))
The parser uses the same tag for any cg-break / cg-continue calls
made during body emission. cg-loop returns the tag to its caller as
well, so post-loop teardown code may reference it; cg-loop-end is a
no-op kept for symmetry.
Switch dispatch follows the same pattern: cg-switch-begin returns a
swctx whose swctx-end-tag accessor exposes cg's break-target tag
to the parser.
3.4 Outgoing-arg staging
When cg-call is asked to emit a call with arity > 4, it stages
args 4..(N-1) into the low-addressed prefix of the current frame
at [sp + 0*8], [sp + 1*8], etc., per LIBP1PP.md §Frame locals.
cg tracks the maximum staging count seen across the function and
reserves that prefix at fn-end before any other slots — i.e.,
cg-alloc-slot's first allocation comes after the staging area.
The accounting is internal to cg. Parse never sees staging slots.
3.5 cg-alloc-slot contract
(cg-alloc-slot cg bytes align) -> offset
bytes= total size needed (e.g., 4 forint, 40 forint[10],sizeof(struct foo)for a struct).align= required alignment (1, 2, 4, or 8).- Returns numeric byte offset relative to
sppost-%enter.
cg first aligns frame-hi up to align, returns that as the
offset, then bumps frame-hi by bytes. Slots are not reused
across scopes (we're optimizing for compiler simplicity, not frame
size). Local arrays and structs request their full size in one call.
4. Conversion responsibility
The parser drives type semantics; cg is type-aware only enough to choose signed-vs-unsigned variants and to scale pointer arithmetic.
4.1 Parser's responsibilities
The parser must call cg in this order around each operation:
| Source | Required parser actions before the operation |
|---|---|
e1 + e2 (and other arith binops) |
(a) parse e1 → if lval, cg-load; (b) cg-promote if rank < int; (c) parse e2 same way; (d) cg-arith-conv to bring both to common type; (e) cg-binop add |
*p |
parse p → if lval, cg-load; then cg-push-deref |
&x |
parse x → must be lval; then cg-take-addr |
(T)e |
parse e → if lval, cg-load (unless casting to a pointer); then cg-cast T |
f(a, b, ...) |
parse f → if lval and f not a function-typed identifier, cg-load; parse each arg → cg-load if lval, then cg-cast to param type (or default-promote for variadic args); then cg-call |
lhs = rhs |
parse lhs → must be lval (no load); parse rhs → cg-load if lval; cg-assign (cg internally casts rhs to lhs type — parse cannot peek beneath vstack top) |
lhs += rhs (and other compound assigns) |
parse lhs (lval) → cg-dup to preserve the lval across the read; cg-load (consumes one copy); parse rhs → cg-load if lval; cg-arith-conv; cg-binop <op>; cg-assign (cg casts internally) |
++lhs / --lhs |
parse lhs (lval) → cg-dup; cg-load; cg-push-imm 1; cg-binop add/sub; cg-assign |
lhs++ / lhs-- |
parse lhs (lval) → cg-postinc / cg-postdec (atomic primitive: dups+loads to capture old rval, then dup+load++1+assign for the store, finally pushes the saved old rval) |
return e |
parse e → cg-load if lval; cg-cast to fn return type; cg-return |
if (e) ... |
parse e → cg-load if lval; cg-cast bool if not already int-shaped; cg-if |
c ? a : b |
parse c → cg-load if lval; cg-ifelse-merge with each thunk parsing one arm and ending with rval!; result type is the first arm's type |
a && b |
parse a → cg-load if lval; cg-ifelse-merge with then-arm = parse b; rval!; cg-cast bool; cg-cast i32 and else-arm = cg-push-imm i32 0 |
a \|\| b |
mirror of &&: then-arm = cg-push-imm i32 1; else-arm = parse b + bool/i32 cast |
a, b |
parse a (its rval is on top) → cg-pop; parse b → cg-load if lval; the comma's value is b |
sizeof e |
parse e (don't suppress emission); peek (opnd-type (cg-top …))'s ctype-size; cg-pop; cg-push-imm u64 size |
The parser is responsible for the standard:
- Integer promotion: any operand of type rank below
intis promoted toint(orunsigned intif it can't fit) before use in arithmetic, before assignment to a wider lhs in mixed contexts, and before being passed as a variadic argument. - Usual arithmetic conversions: applied to both operands of a binary arithmetic operator after promotion. The result type is the common type.
- Pointer-int interaction: detected by parser;
cg-binop addon (ptr, int) handles scaling internally (see §4.2).
4.2 cg's responsibilities
cg trusts the operand types it is handed.
cg-load: pop lval, emit one load (of the right width based onctype-size), push rval of the same type.cg-cast to-type: pop, emit sign-extend / zero-extend / truncate as needed based on source vs. target sizes and signedness. Forto-type = bool: emit(BNEZ -> 1, fallthrough -> 0)shape. Pointer ↔ integer casts are bit-for-bit on P1-64 (no emission).cg-binop add(andsub): if exactly one operand is aptr, scale the int operand byctype-sizeof the pointee before adding. If both ptr → onlysubis valid (yieldsi64byte difference, divided by element size); other binops on (ptr, ptr) abort viadie.cg-binopfor divisions and comparisons: dispatch to signed (DIV/BLT) or unsigned (DIV+sign-flip /BLTU) variant based on the operand kinds (i*→ signed,u*→ unsigned). Aftercg-arith-conv, both operands have the same kind, so dispatch is unambiguous.
cg never:
- auto-loads an lvalue
- auto-promotes
- auto-converts arguments
- looks at fn-ctx return type (parser passes the cast)
This split keeps cg under ~600 LOC by pushing all "C language" knowledge into parse.
5. Symbol-to-label mangling
Three label namespaces in the emitted P1pp:
5.1 User globals (functions, variables)
C identifier "foo" → P1pp label :cc__foo
Verbatim concatenation. C identifiers can't contain : or other
P1pp-special characters, so no escaping is needed. The cc__ prefix
guarantees no collision with libp1pp internals (libp1pp__*),
backend stubs (_start, p1_main, sys_*), or our own runtime
support.
static storage at file scope changes nothing about the label —
since we have one TU, internal-linkage and external-linkage symbols
share the same namespace. static only suppresses any future
"export to other TUs" emission, which we don't do anyway.
5.2 String pool
n-th distinct string literal → :cc__str_<n>
<n> is a fresh decimal counter starting at 0, advanced only on
non-deduplicating insert. Identical string literals share a label
(idempotent intern).
5.3 Function-internal labels
Inside %fn(...), libp1pp's %scope mechanism prefixes short
labels (::ret, ::lbl_42) to <fnname>__ret, <fnname>__lbl_42
at M1pp time. cg uses short labels exclusively inside fn-buf:
::ret— single function exit::lbl_<n>— anonymous control-flow targets (for switch cases, short-circuit eval, etc.)::user_<name>— user-writtengotolabels (Cmyloop:→::user_myloop). Theuser_prefix prevents collisions with ourlbl_andretlabels.
Loop tags (for libp1pp's %loop_tag, %break(tag), %continue(tag))
are not labels — they're macro-name fragments. cg generates them as
L<n> (no cc__ prefix; tag namespace is per-function and %fn
already scopes them).
5.4 Entry stub
cg emits a small entry stub at cg-finish time:
%fn(p1_main, %p1_main_f.SIZE, {
; argc = a0, argv = a1 already
%la_br(&cc__main)
%call
%eret
})
So int main(int argc, char **argv) is reached from the P1
program-entry contract.
If the user's main has no parameters, the stub still passes
argc/argv — main just ignores them, which is harmless. If the user
defines main with a different return type, that's a CC.md
violation; cg can either die or emit and let the cast happen at the
return site. Recommend: parser checks at cg-fn-end time when
fn-name == main.
6. Phase-1 milestone
int main(int argc, char **argv) {
return argc;
}
This is the integration target every engineer aims at. It goes through:
- lex:
int,main,(,int,argc,,,char,*,*,argv,),{,return,argc,;,}— all PUNCT and KW symbols touched are core; covers two of each kind. - pp: zero directives, but full token-list traversal.
- parse: function definition, two-parameter list including
char **, compound stmt, return stmt, identifier expression, lval→rval load. - cg:
cg-fn-beginwith two params, parameter spilling (one register-passed, one register-passed),cg-push-sym,cg-load,cg-return,cg-fn-end,%fnwrapping, and thep1_mainentry stub. - e2e: link with arch backend + libp1pp; run as native ELF; exit code matches argc.
Acceptance test: tests/cc/00-return-argc.c exists, the make
target builds it, and:
$ ./tests/cc/build/00-return-argc ; echo $? → 1
$ ./tests/cc/build/00-return-argc a b ; echo $? → 3
When this test passes on aarch64, amd64, and riscv64, phase 1 is complete.
Change protocol
Anyone proposing a contract change:
- PR amends this doc first, with rationale.
- Affected modules + tests are listed.
- Changes land in one PR (doc + all affected code) so no engineer pulls a half-migrated tree.