boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

lispcc internals

Companion to CC.md. CC.md says what we accept; this doc says how the implementation is organized so engineers can split work and test independently.

The compiler lives in one file, cc/cc.scm, that bundles every module section in order; cc/main.scm is the one-liner entry point that fires cc-main:

build:  catm build/$ARCH/cc/cc.scm scheme1/prelude.scm cc/cc.scm cc/main.scm
run:    scheme1 build/$ARCH/cc/cc.scm input.flat.c output.P1pp

Sections inside cc.scm are delimited by their original ;; cc/<name>.scm — headers (util, data, lex, pp, cg, parse, main), so the layering below is still navigable by search.

Module DAG

util  ────────┐
              │
data  ────────┼─►  lex  ──►  pp  ──►  parse  ─►  main
              │                          │
              └──────────────────────►   cg  ──►  main

Cycles are forbidden. The parse section calls cg but never the reverse.

Feature workflow

When adding any new codegen-touching feature, always in this order:

  1. cc-cg fixture first. Write a tests/cc-cg/<n>-name.scm that drives the cg API directly to emit a self-contained program whose runtime exit code (and/or stdout) reflects the feature working. Add the .expected-exit (default 0) and optional .expected stdout file. At this point the test fails — that's the spec.
  2. Implement the cg primitive(s). Add or fix the API surface in the cg section of cc/cc.scm. Iterate until make test SUITE=cc-cg passes.
  3. cc fixture next. Write a tests/cc/<n>-name.c that exercises the same feature from the C source side, with a main that exits with a known value. Add .expected-exit / .expected. Test fails — that's the spec for the parser.
  4. Implement the parse changes. Wire C syntax through the parse section of cc/cc.scm using the cg primitives from step 2. Iterate until make test SUITE=cc passes.

This sequencing keeps the parser↔cg seam honest: cg is validated by runtime behavior on direct API calls before parse is even touched, so parse never papers over a cg bug. Steps 1+3 are the contract; steps 2+4 are the implementation.

Conventions

Errors

(die loc msg . irritants)        ; abort with formatted diagnostic
(loc-of-tok tok)         -> loc  ; pull file/line/col out of a tok
(loc file line col)      -> loc  ; construct manually (used by lexer)

Diagnostic format on stderr:

<file>:<line>:<col>: error: <msg>: <irritants display'd one-by-one>

irritants are written via display (no quoting). Pairs and integers print naturally; bytevectors print as their byte content.

die is the only failure path. There is no warning level — anything worth saying aborts.

util.scm

Bytevector and list helpers that scheme1/prelude.scm doesn't already provide.

;; bytevector
(bv= a b)              -> bool             ; alias for bytevector=?, terser
(bv-prefix? p s)       -> bool             ; does s start with p?
(bv-find bv byte from) -> idx-or-#f        ; first occurrence at >= from
(bv-slice bv start end) -> bv              ; fresh copy
(bv-of-string str)     -> bv               ; literal helper for inline ASCII
(bv-of-byte b)         -> bv               ; 1-byte bv
(bv-cat lst-of-bv)     -> bv               ; concat list, single allocation
(bv->fixnum bv radix)  -> (ok . val)       ; (ok . #f) on parse fail
(fixnum->bv n radix)   -> bv

;; lists / alists
(alist-ref key al)     -> val-or-#f
(alist-ref/eq key al)  -> val-or-#f        ; eq? compare (for symbol keys)
(alist-set key val al) -> al'              ; cons new pair on the front
(alist-update key f al) -> al'             ; functional update
(any p xs)             -> bool
(every p xs)           -> bool
(count p xs)           -> fixnum

;; ints
(min3 a b c)           -> fixnum
(align-up n k)         -> fixnum           ; round n up to multiple of k

;; output buffers (reversed list of bv chunks; flush builds in one pass)
(make-buf)             -> buf              ; record: { chunks }
(buf-push! buf bv)                          ; cons bv onto chunks
(buf-flush buf)        -> bv               ; reverse + bv-cat

;; diagnostics + I/O
(die loc msg . irritants) -> never returns
(slurp-fd fd)          -> bv                ; read until EOF
(write-bv-fd fd bv)                         ; full write or die

;; fresh names
(make-namer prefix)    -> proc              ; (proc) -> bv "prefix0", "prefix1", ...

make-buf and buf-push! are the universal output primitive — both cg's three output streams and pp's expansion staging buffer use them.

data.scm

All record types used by more than one module.

loc

(define-record-type loc
  (%loc file line col)
  loc?
  (file loc-file)        ; bv
  (line loc-line)        ; fixnum
  (col  loc-col))        ; fixnum

tok

(define-record-type tok
  (%tok kind value loc hide)
  tok?
  (kind  tok-kind)       ; symbol; see kinds table
  (value tok-value)      ; varies by kind
  (loc   tok-loc)        ; loc
  (hide  tok-hide))      ; list of bv (macro names already expanded)
kind value shape
IDENT bv (identifier name)
KW symbol (one of if while ... typedef)
INT fixnum (post-suffix integer value)
STR bv (raw bytes, no NUL terminator)
CHAR fixnum (value 0..255)
PUNCT symbol ('+ '== '-> '... '## '# …)
HASH #f (preprocessor only; line-leading #)
NL #f (significant only inside the preprocessor)
EOF #f

make-tok is the canonical constructor; pass hide = '() for freshly-lexed tokens.

macro

(define-record-type macro
  (%macro kind params body)
  macro?
  (kind   macro-kind)    ; 'obj | 'fn | 'fn-vararg
  (params macro-params)  ; list of bv
  (body   macro-body))   ; list of tok

ctype

(define-record-type ctype
  (%ctype kind size align ext)
  ctype?
  (kind  ctype-kind)     ; 'void 'i8 'i16 'i32 'i64 'u8 'u16 'u32 'u64
                         ; 'bool 'ptr 'arr 'fn 'struct 'union 'enum
  (size  ctype-size)     ; fixnum bytes; -1 = incomplete
  (align ctype-align)    ; fixnum bytes
  (ext   ctype-ext))     ; payload, kind-specific

ext payload by kind:

kind ext shape
void / int / bool #f
ptr pointee ctype
arr (elem-ctype . length-or--1)
fn (ret-ctype params variadic?)params is list of ctype
struct/union (tag-bv complete? fields)fields = ((name-bv ctype offset) ...)
enum (tag-bv ((const-name-bv . value) ...))

Primitive ctypes are interned at startup as top-level bindings: %t-void, %t-i8, %t-i32, %t-i64, %t-u8, %t-u32, %t-u64, %t-bool, %t-char-ptr. Equality of primitive ctypes is eq?. Derived types are not deduped.

sym

(define-record-type sym
  (%sym name kind storage type slot)
  sym?
  (name    sym-name)     ; bv
  (kind    sym-kind)     ; 'var 'fn 'typedef 'enum-const 'param 'label
  (storage sym-storage)  ; 'auto 'static 'extern 'register | #f for typedef/enum
  (type    sym-type)     ; ctype
  (slot    sym-slot))    ; locals/params: fixnum byte offset
                         ; globals/fn: bv emitted-label
                         ; enum-const: fixnum value
                         ; typedef: #f
                         ; label: bv P1pp local label

opnd

(define-record-type opnd
  (%opnd kind type ext lval?)
  opnd?
  (kind  opnd-kind)      ; 'imm 'frame 'global 'reg | sub-cases below
  (type  opnd-type)      ; ctype
  (ext   opnd-ext)       ; per kind
  (lval? opnd-lval?))    ; #t = the slot/place holds an *address*
                         ; #f = the slot/place holds a *value*

ext by kind:

kind ext
imm fixnum literal value (lval? always #f)
frame fixnum byte offset in current function frame
global bv label name (the symbol's emitted P1pp label)
reg symbol 'a0..'a3

reg is transient — used for the result of a call before it's spilled. The vstack itself holds only imm, frame, and global opnds.

pstate

(define-record-type pstate
  (%pstate toks scope tags loops fn-ctx typedefs cg)
  pstate?
  (toks    ps-toks    ps-toks-set!)         ; remaining tok list (head = lookahead)
  (scope   ps-scope   ps-scope-set!)        ; list of alists: (bv . sym)
  (tags    ps-tags    ps-tags-set!)         ; list of alists: (bv . ctype) for struct/union/enum
  (loops   ps-loops   ps-loops-set!)        ; list of loop-ctx (break/continue stack)
  (fn-ctx  ps-fn-ctx  ps-fn-ctx-set!)       ; current function record, or #f at file scope
  (typedefs ps-typedefs ps-typedefs-set!)   ; flat alist (bv . #t) — fast typedef-name lookup
  (cg      ps-cg))                          ; codegen state (not mutated through pstate)

typedefs is a separate flat alist (not derived from scope at lookup time) because the lexer-vs-parser distinction in C requires fast "is this identifier a typedef-name now" answers during declaration parsing. Updated in lockstep with scope.

loop-ctx

(define-record-type loop-ctx
  (%loop-ctx kind tag has-continue?)
  loop-ctx?
  (kind          loop-ctx-kind)             ; 'while 'do 'for 'switch
  (tag           loop-ctx-tag)              ; bv tag for libp1pp %_tag macros
  (has-continue? loop-ctx-has-continue?))   ; #f for switch

fn-ctx

(define-record-type fn-ctx
  (%fn-ctx name return-type params variadic? labels)
  fn-ctx?
  (name        fn-ctx-name)                 ; bv
  (return-type fn-ctx-return-type)          ; ctype
  (params      fn-ctx-params)               ; list of sym
  (variadic?   fn-ctx-variadic?)
  (labels      fn-ctx-labels                ; alist (user-bv . emitted-bv)
               fn-ctx-labels-set!))         ; mutated as `goto`/labels resolve

cg

(define-record-type cg
  (%cg text data bss vstack frame-hi label-ctr str-pool globals fn-buf)
  cg?
  (text      cg-text)                       ; buf — final text section (all functions)
  (data      cg-data)                       ; buf — initialized globals + string pool
  (bss       cg-bss)                        ; buf — zero-init globals
  (vstack    cg-vstack    cg-vstack-set!)   ; list of opnd
  (frame-hi  cg-frame-hi  cg-frame-hi-set!) ; fixnum: frame bytes used so far
  (label-ctr cg-label-ctr cg-label-ctr-set!)
  (str-pool  cg-str-pool  cg-str-pool-set!) ; alist (bv-content . bv-label)
  (globals   cg-globals   cg-globals-set!)  ; alist (bv-name . sym) — emitted globals
  (fn-buf    cg-fn-buf    cg-fn-buf-set!))  ; buf — body of current function

Functions are emitted into fn-buf, then on cg-fn-end flushed through libp1pp's %fn(...) wrapper into text. This lets us know the final frame-hi before writing the prologue.

lex.scm

Pure: bytestream + filename → token list. No file I/O, no macro awareness.

(lex-tokenize src file) -> (list-of tok)
;; src: bv (the C bytestream)
;; file: bv (filename, for tok-loc)
;; result: list ending in a single 'EOF tok; never #f
;; aborts via die on unrecognized byte sequences

Internal contract:

Helpers exposed for unit tests:

(lex-read-number src pos)       -> (tok . pos')
(lex-read-string src pos file)  -> (tok . pos')
(lex-read-ident  src pos)       -> (tok . pos')   ; produces IDENT or KW

Test plan

tests/cc-lex/ mirrors tests/scheme1/. Each test is one .c fragment plus an .expected-toks file containing the expected serialized form (one tok per line, e.g. KW int 1 1). Driver script:

scheme1 cc/lex-test.scm < input.c | diff - expected-toks

pp.scm

Pure: token list + initial macro alist → expanded token list.

(pp-expand toks initial-defines) -> (list-of tok)
;; toks: lex-tokenize output
;; initial-defines: alist (bv . macro) — from -D flags
;; result: token list with HASH and NL stripped, KW/IDENT/INT/STR/CHAR/PUNCT/EOF only
;; aborts via die on bad directive, undefined #if identifier-as-macro misuse, etc.

Internal structure:

Directive handlers (each takes the tokens up to the next NL):

(%pp-do-define line state)
(%pp-do-undef  line state)
(%pp-do-if     line state)
(%pp-do-ifdef  line state)
(%pp-do-ifndef line state)
(%pp-do-elif   line state)
(%pp-do-else   line state)
(%pp-do-endif  line state)
(%pp-do-error  line state)
(%pp-do-line   line state)
(%pp-do-pragma line state)
(%pp-do-include line state)   ;; always dies (per CC.md §Toolchain envelope)

state is a private record pp-state with fields {macros cond-stack out file-base-line}. Internal — not in data.scm.

Constant-expression evaluator for #if:

(pp-eval-cexpr toks macros) -> fixnum
;; toks: tokens of the expression after macro expansion
;; aborts on unrecognized identifier (treated-as-zero is wrong for our errors policy)

Test plan

tests/cc-pp/ per-file tests, same shape as cc-lex. Inputs are already-tokenized fixtures (so pp can be exercised without the lexer) or .c files (full pipeline). Both flavors live side-by-side.

cg.scm

Mutable codegen state and a value-stack-style emission API. The parser calls these and never touches output buffers directly.

Lifecycle

(cg-init)              -> cg                  ; fresh state
(cg-finish cg)         -> bv                   ; flush all buffers; result is final P1pp text

Every function lives between cg-fn-begin and cg-fn-end. Globals live outside.

(cg-fn-begin cg name params return-type)
;; name:        bv
;; params:      list of sym (their slots are pre-assigned by parser)
;; return-type: ctype
(cg-fn-end cg)         ; emits epilogue, flushes fn-buf into cg-text under %fn(...)

Vstack: push / pop / inspect

(cg-push   cg opnd)                  ; push opnd onto vstack
(cg-pop    cg)         -> opnd       ; remove and return top
(cg-top    cg)         -> opnd       ; non-destructive
(cg-depth  cg)         -> fixnum

Materialize

(cg-push-imm     cg ctype value)              -> opnd     ; rval
(cg-push-string  cg bv-content)               -> opnd     ; rval, char* — interns into str-pool
(cg-push-sym     cg sym)                      -> opnd     ; lval (var) or rval (fn name)
(cg-push-deref   cg)                          -> opnd     ; pop ptr-rval, push lval (no emission yet)

Address & deref operators

(cg-take-addr cg)     -> opnd        ; pop lval, push its address as rval
(cg-load      cg)     -> opnd        ; pop lval, push rval (loaded through address)

Type conversions

(cg-cast cg to-type)  -> opnd        ; pop, push opnd cast to to-type; emits sign-extension etc. as needed
(cg-promote cg)       -> opnd        ; integer promotion (rank ≤ int → int)
(cg-arith-conv cg)                   ; usual arithmetic conversions on top two opnds

Operators

(cg-binop cg op)      -> opnd        ; pop b, pop a, push (a op b)
(cg-unop  cg op)      -> opnd        ; pop a, push (op a)
(cg-assign cg)        -> opnd        ; pop rval, pop lval, cast rval to lval-type, store, push assigned value (cg-assign owns the cast since parse cannot peek beneath vstack top)

op for binop is a symbol from: 'add 'sub 'mul 'div 'rem 'and 'or 'xor 'shl 'shr 'eq 'ne 'lt 'le 'gt 'ge. For unop: 'neg 'bnot 'lnot.

Signed vs. unsigned dispatch is handled inside cg by inspecting the operand types after cg-arith-conv.

Calls

(cg-call cg arity has-result?)   -> opnd-or-#f
;; pops (arity + 1) opnds: function, then args left-to-right at top
;; (i.e. callable was pushed first, then args, so arg-N is on top)
;; emits arg-passing per P1 ABI, CALL or CALLR, captures return into a fresh frame slot
;; pushes result opnd; returns it (or #f if has-result? is #f)

Structured control flow

These take thunks so the parser can recursively emit the body:

(cg-if   cg then-thunk)                         ; pop cond; emit %if_nez { (then-thunk) }
(cg-ifelse cg then-thunk else-thunk)            ; pop cond; emit %ifelse_nez { ... }{ ... }
(cg-loop cg head-thunk body-thunk)   -> tag     ; head-thunk emits the cond; body-thunk is invoked as (body-thunk tag) so parse can use the same tag for cg-break / cg-continue inside the body; tag is also returned to the caller
(cg-loop-end cg tag)                            ; no-op; reserved for future per-loop teardown
(cg-break cg tag)
(cg-continue cg tag)

For for and do-while the parser composes the building blocks above; cg doesn't expose a dedicated for/do helper. (Three helpers beat seven.)

switch helpers

(cg-switch-begin cg)   -> swctx       ; spill controlling expression to a slot
(cg-switch-case  cg swctx const-int)  ; emit %if_eq jump-to-case-label
(cg-switch-default cg swctx)          ; emit B to default-label
(cg-switch-end   cg swctx)            ; close out

Globals and data

(cg-emit-global cg sym init)                 ; init = #f (zero-init in .bss)
                                             ;       | (piece ...) in .data
;; piece := <bytevector>            — raw bytes
;;        | (label-ref . <label-bv>) — 8-byte slot holding &label
(cg-emit-extern cg sym)                      ; declare without defining
(cg-intern-string cg bv-content) -> bv-label ; idempotent; used internally by cg-push-string

Frame allocation (used internally and by parse for locals)

(cg-alloc-slot cg bytes align) -> offset     ; bumps frame-hi; returns aligned offset

Test plan

tests/cc-cg/ is a runtime-validating suite. Each fixture is a .scm program that drives the cg API directly, calls cg-finish, and writes the resulting P1pp text to stdout. The harness assembles that P1pp through the existing P1pp toolchain, runs the resulting ELF, and diffs the program's exit code against <name>.expected-exit (default 0) and stdout against <name>.expected (default empty).

Fixtures must therefore emit a complete, runnable program — including any callees they invoke via cg-call. Direct-cg tests that depend on external symbols (extern abs, extern foo) belong in cc once real libc linkage exists; they don't fit cc-cg.

The contract this suite locks in is semantic, not syntactic: cg emission can be refactored freely as long as the program still computes the asserted result.

parse.scm

Mutates a pstate. Drives cg. Single entry point:

(parse-translation-unit ps)            ; consumes ps-toks until EOF, emits via ps-cg
;; ps must have ps-toks set; everything else starts empty.

Top-down structure

Internal helpers, hierarchically:

parse-translation-unit
 ├─ parse-decl-or-fn         ; returns 'fn or 'decl
 │   ├─ parse-decl-spec      ; storage + qualifiers + base type
 │   ├─ parse-declarator     ; spiral grammar; returns (name ctype)
 │   ├─ parse-init-list      ; for variable initializers (incl. designated)
 │   └─ parse-fn-body        ; only if declarator is fn-typed AND next tok is `{`
 ├─ parse-stmt
 │   ├─ parse-compound-stmt
 │   ├─ parse-if-stmt
 │   ├─ parse-while-stmt
 │   ├─ parse-do-stmt
 │   ├─ parse-for-stmt
 │   ├─ parse-switch-stmt
 │   ├─ parse-return-stmt
 │   ├─ parse-goto-stmt
 │   └─ parse-expr-stmt
 └─ parse-expr               ; Pratt; takes min-bp
     ├─ parse-primary
     ├─ parse-postfix
     ├─ parse-unary
     ├─ parse-cast-or-unary  ; `(T)e` vs. `(e)`
     └─ parse-binary-rhs

Token-stream API (private to parse.scm but conventional)

(peek ps)              -> tok
(peek2 ps)             -> tok                  ; one-token lookahead helper
(advance ps)           -> tok                  ; consume and return
(expect-kw ps sym)     -> tok                  ; KW match or die
(expect-punct ps sym)  -> tok                  ; PUNCT match or die
(at-kw? ps sym)        -> bool
(at-punct? ps sym)     -> bool

peek2 is needed exactly twice: distinguishing (typename) cast-expr from (expr) parenthesized expression, and distinguishing labelled-statement IDENT : from expression-statement IDENT ....

Scope helpers

(scope-enter! ps)
(scope-leave! ps)
(scope-bind!  ps name sym)              ; aborts on duplicate at innermost frame
(scope-lookup ps name)  -> sym-or-#f    ; walks frames

(tag-bind!    ps name ctype)
(tag-lookup   ps name)  -> ctype-or-#f

(typedef-add! ps name)                  ; updates ps-typedefs
(typedef?     ps name)  -> bool

Pratt expression parser

The binding-power table is a top-level alist:

(define %binop-bp
  ;; (punct-symbol . (lhs-bp . rhs-bp))
  ;; left-assoc:  rhs-bp = lhs-bp + 1
  ;; right-assoc: rhs-bp = lhs-bp - 1
  '((|*| 110 . 111) (|/| 110 . 111) (|%| 110 . 111)
    ...))

Driver:

(parse-expr-bp ps min-bp)              ; Pratt climber
(parse-expr ps)                        ; equivalent to (parse-expr-bp ps 0)

parse-expr leaves the result on ps-cg's vstack and returns the opnd. Statements that don't consume the value follow with cg-pop.

Test plan

tests/cc/ is a runtime-validating suite. Each fixture is a .c source file containing a complete, runnable program — main must exit with a known value. The harness runs the full lex+pp+parse+real-cg pipeline, assembles the emitted P1pp through the P1pp toolchain, runs the resulting ELF, and diffs exit code against <name>.expected-exit (default 0) and stdout against <name>.expected (default empty).

The contract this suite locks in is that the C source compiles to a program with the asserted runtime behavior. Per the §Feature workflow, parse fixtures land after the corresponding cg fixture is green.

main.scm

Driver. Roughly 80 lines.

(define (cc-main argv)
  (let* ((args     (parse-cli argv))      ; record: { input-path output-path defines }
         (src      (slurp-fd (open-or-die (cli-input args))))
         (toks     (lex-tokenize src (cli-input args)))
         (defines  (cli-defines->alist (cli-defines args)))
         (expanded (pp-expand toks defines))
         (cg       (cg-init))
         (ps       (make-pstate expanded cg)))
    (parse-translation-unit ps)
    (write-bv-fd (open-or-die-out (cli-output args))
                 (cg-finish cg))
    0))

(cc-main (argv))

CLI: cc input.flat.c -o output.P1pp [-D NAME[=val]] .... Strict — unrecognized flags die.

Test infrastructure

Five test trees. The upstream layers (lex, pp) byte-diff their pure transformations; the downstream layers (cg, cc) all compile-and-run the emitted P1pp and assert runtime behavior — exit code by default, optional stdout match. This split keeps the byte-diff brittleness out of every layer where there's a real "is this code correct" question to ask.

tests/cc-cg/ is the codegen contract. tests/cc/ is the C- language contract. Per §Feature workflow, every feature lands a cg fixture first (which must fail before cg is touched), and a parse fixture second (which must fail before parse is touched).

Out-of-scope here

Deferred to follow-up docs once we start coding: