boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

commit bba17b7e3b6b1e9a74966786524e89250fadff14
parent 311fbbd7bf644c421acabce8905880cd1b367362
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri,  1 May 2026 14:55:50 -0700

cc: add cc.scm.md code map

Diffstat:
Acc/cc.scm.md | 173+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 173 insertions(+), 0 deletions(-)

diff --git a/cc/cc.scm.md b/cc/cc.scm.md @@ -0,0 +1,173 @@ +# cc.scm — Code Map + +## Overview + +`cc.scm` is a complete C compiler (6611 lines) written in Scheme (scheme1 dialect) that compiles C source to P1pp assembly. It implements a streaming pipeline: **lexer → preprocessor → parser → codegen**. Designed for minimal memory use with fixed pre-allocated buffers and a scratch/main heap discipline that resets per declaration. Targets the P1 64-bit RISC ISA via libp1pp macros. + +--- + +## Structural Map + +### Major Subsystems + +| Subsystem | Lines | Role | +|-----------|-------|------| +| Utilities | 1–282 | Bytevector helpers, list/alist ops, output buffers, diagnostics, debug logging, name generation | +| Data Structures | 283–583 | Record type definitions, interned primitive ctypes, ctype predicates | +| Symbol Alphabets | ~430–~560 | Keyword and punctuator alists | +| Lexer | 584–1676 | Tokenizes C source; trigraph/splice, comments, escape sequences | +| Preprocessor | 1677–2503 | `#define`, `#if`, macro expansion with hide-sets; `pp-eval-cexpr` delegates to `parse-const-int` via `%pp-make-const-ps` | +| Code Generator | 2504–4034 | P1pp assembly emission, vstack, frame allocation, all operators and control flow | +| Parser | 4035–6505 | Recursive-descent + Pratt; declarations, statements, expressions; shared constant-expression evaluator | +| Main Driver | 6506–6611 | CLI parsing, file I/O, pipeline initialization | + +--- + +## Key Data Structures + +### Runtime state — the three structs that wire the pipeline together + +`world` is shared between parser and codegen. `pstate` owns the token stream and drives parsing. `cg` owns all assembly emission state. + +``` +world +├── scope (list of alist frames) var/typedef/fn bindings +├── tags (list of alist frames) struct/union/enum tag bindings +├── str-pool (alist) interned string literals → labels +└── tentatives (list) file-scope tentative definitions + +pstate +├── iter (tok-iter) pp-iter (lexer + preprocessor) +├── world (shared) +├── loops (stack) break/continue targets (loop-ctx records) +├── fn-ctx (fn-ctx|#f) current function context +└── cg (cg|#f) codegen state (#f in pp const-expr context) + +cg +├── text/data/bss (buf) fixed-capacity output section buffers +├── vstack (list of opnd) value stack for expression evaluation +├── frame-hi (fixnum) next free frame byte offset +├── label-ctr (fixnum) monotonic label counter +├── world (shared) +├── fn-meta (alist) transient per-fn metadata (sret ptr, indirect slots, etc.) +├── fn-buf/prologue-buf (buf) reused per-function; drained to text at fn-end +├── max-outgoing (fixnum) maximum stack args staged in current fn +├── in-fn? (bool) routes %cg-emit to fn-buf vs text +├── lib? (bool) skip entry stub + ELF_end +└── str-prefix (bv) namespace prefix for anonymous strings +``` + +### Leaf data records — passed through the pipeline as values + +| Record | Fields | Purpose | +|--------|--------|---------| +| `loc` | file/line/col | Source location | +| `tok` | kind/value/loc/hide | Token with hide-set for macro expansion | +| `macro` | kind/params/body | Preprocessor macro definition | +| `ctype` | kind/size/align/ext + mutators | C type representation | +| `sym` | name/kind/storage/type/slot/defined? | Symbol table entry | +| `opnd` | kind/type/ext/lval? | Operand on the vstack | +| `loop-ctx` | kind/tag/has-continue? | Loop break/continue target info | +| `fn-ctx` | name/return-type/params/variadic? | Current function metadata | + +--- + +## Compilation Flow + +``` +Source file + → lex-iter (make-lex-iter) — streaming tokenizer + → pp-iter (make-pp-iter) — macro expansion + directives + → parse-translation-unit (pstate/cg) — recursive descent + Pratt + per-decl: call-with-scratch-cycle — scratch heap reset per declaration + function bodies: cg-fn-begin → parse-fn-body → cg-fn-end + → cg-finish — tentatives → .bss, entry stub, combine sections + → write output file +``` + +**Per-function code path:** +1. `cg-fn-begin` — emit param spills, sret setup, allocate prologue-buf +2. `parse-fn-body` — emits P1pp directly into fn-buf via cg ops +3. `cg-fn-end` — drain prologue-buf + fn-buf into text, emit ret block + +**`#if` constant-expression path:** +`pp-eval-cexpr` → resolve `defined`, macro-expand, idents→0 → `%pp-make-const-ps` (minimal pstate, empty scope, no cg) → `parse-const-int` (shared with parser) + +--- + +## Line Map + +| Lines | Description | +|-------|-------------| +| **1–109** | Bytevector primitives: `bv=`, `bv-prefix?`, `bv-slice`, `bv-cat`, `bv->fixnum`; list/alist utilities: `alist-ref`, `alist-update`, `any`, `every`, `count`; integer helpers: `min3`, `align-up` | +| **110–122** | `%BUF-CAP-*` — buffer pre-allocation constants (TEXT 8MiB, DATA 2MiB, BSS 2MiB, FN 256KiB, PROLOGUE 16KiB) | +| **123–210** | Output buffer system: `buf` record, `buf-push!`, `buf-flush`, `buf-reset!`, `buf-drain!` — fixed-capacity, no growth | +| **211–282** | Diagnostics: `die` with loc formatting, `slurp-fd`, `write-bv-fd`; debug logging: `debug-log-on!/off!`, `trace-emit` flags; fresh name generator: `make-namer` | +| **283–496** | Record type definitions: `loc`, `tok`, `macro`, `ctype`, `sym`, `opnd`, `loop-ctx`, `fn-ctx`, `world`, `pstate`, `cg`; interned primitive ctypes (`%t-void`, `%t-i8`…`%t-u64`, `%t-bool`, `%t-flt`, `%t-dbl`, `%t-ldbl`); ctype predicates: `%ctype-ptr?`, `%ctype-pointee`, `%ctype-unsigned?`, `%ctype-fp?`; ctype accessors | +| **~430–~560** | `%keyword-alist` — storage/qualifiers/type specifiers/statements/operators/reserved; `%punct-alist` — punctuators longest-first, digraphs | +| **584–640** | Lexer byte-class predicates: `%digit?`, `%hex?`, `%alpha?`, `%ident-start?`, `%ident-cont?`, `%hspace?`, `%newline?`; `%lex-scratch` buffer | +| **641–770** | Logical byte access: `%lex-peek` with trigraph translation + line splice | +| **771–920** | Comment stripping: `%skip-ws-and-comments`, `%skip-line-comment`, `%skip-block-comment` | +| **921–1070** | Byte-run scanners: `%scan-while`, `%fill-while-bv`, `%accum-int-while`, `%accum-octal-bounded` | +| **1071–1270** | Token readers: `lex-read-ident`, `%lex-read-number` (hex/octal/decimal), `%lex-read-string` (with escapes), `lex-read-char` | +| **1271–1351** | `%lex-read-punct` with longest-match bucketing; `%punct-buckets` | +| **1352–1676** | `lex-iter` streaming token source: `make-lex-iter`, `%lex-iter-pull` with heap-rewind discipline; `list-iter` wrapper; `lex-tokenize` test driver | +| **1677–1820** | Preprocessor state (`pp-state`), token classification helpers (`%pp-eof?`, `%pp-nl?`, `%pp-hash?`, etc.) | +| **1821–1920** | Built-in macros: `__FILE__`, `__LINE__`, `__STDC__`, `__LISPCC__`, `__DATE__`, `__TIME__`, `__STDC_VERSION__`, `__STDC_HOSTED__`, `__VA_ARGS__` | +| **1921–2020** | Streaming pp-iter: `make-pp-iter`, `%pp-iter-pull` with out-buf stashing | +| **2021–2120** | Upstream helpers: `%pp-pull-upstream`, `%pp-peek-upstream`, `%pp-unshift-upstream!`, `%pp-collect-line-stream`, `%pp-collect-args-stream` | +| **2121–2270** | Directive dispatch: `%pp-dispatch-step`, `%pp-dispatch-directive` → `%pp-do-define`, `%pp-do-undef`, `%pp-do-if/ifdef/ifndef/elif/else/endif` with cond-stack | +| **2271–2370** | Directives: `%pp-do-error`, `%pp-do-line`, `%pp-do-pragma`, `%pp-do-include` | +| **2371–2430** | Macro expansion: `%pp-emit-expanded`, `%pp-apply-macro`, `%pp-prepare-body`, `%pp-collect-args`, `%pp-bind-args` (variadic), `%pp-substitute` (`#param` stringize, `##` paste) | +| **2431–2503** | Paste operator: `%pp-paste-tokens`; string fusion: `%pp-maybe-fuse-str`; `#if` evaluator: `%pp-make-const-ps` (IO adapter wrapping token list as minimal pstate), `pp-eval-cexpr`, `%pp-resolve-defined`, `%pp-expand-line`, `%pp-idents-as-zero` | +| **2504–2620** | CG emission primitives: `%cg-emit-buf`, `%cg-emit`, `%cg-emit-many`, `%cg-fresh-label`, `%n` (number→bv) | +| **2621–2720** | CG metadata: `%cg-fn-set!/%cg-fn-get`; register/label helpers: `%cg-reg→bv`, `%cg-emit-li`, `%cg-emit-la` | +| **2721–2870** | Load/store emission: `%cg-emit-ld/st`, `%cg-emit-ld-slot-typed` (sign-extended sub-word loads), `%cg-emit-sext`, `%cg-spill-reg` | +| **2871–3020** | Operand loading: `%cg-load-opnd-into` (imm/frame/global); vstack ops: `cg-push/pop/top/depth/dup`, snapshot/rewind for sizeof | +| **3021–3170** | Materialize: `cg-push-imm`, `cg-push-string` (with intern), `cg-push-sym` (fn/enum/var/param), `cg-push-deref` (indirect-slot tracking) | +| **3171–3320** | Aggregate access: `cg-push-field` with `%cg-find-field` (anonymous-member-aware lookup, shared with parser's offsetof), `cg-decay-array`; address/deref: `%cg-emit-addr-of`, `cg-copy-struct`, `cg-take-addr`, `cg-load` | +| **3321–3470** | Type conversions: `cg-cast` (bool/ptr/widening/narrowing with sign-extend), `cg-promote`, `cg-arith-conv` | +| **3471–3620** | Operators: `cg-binop` (pointer arithmetic scaling, comparison), `cg-unop` (neg/bnot/lnot), `cg-assign` (type coercion), post-inc/dec | +| **3621–3770** | Function calls: `cg-call` (sret >16B struct return, arg staging a0–a3 + stack, variadic) | +| **3771–3870** | Return: `cg-return` (void/scalar/struct); conditional: `cg-if`, `cg-ifelse`, `cg-ifelse-merge` (ternary/&&/\|\|) | +| **3871–3970** | Loop control flow: `cg-loop`, `cg-break`, `cg-continue`; switch: `cg-switch-begin`, `cg-switch-case`, `cg-switch-default`, `cg-switch-end` (dispatch table) | +| **3971–4000** | Variadic: `cg-va-start`, `cg-va-arg`, `cg-va-end`; labels/goto: `cg-emit-label`, `cg-goto` | +| **4001–4034** | Globals/data: `cg-emit-global`, `cg-emit-extern`, tentatives, `cg-intern-string`; frame: `cg-alloc-slot`; lifecycle: `cg-init`, `cg-fn-begin/v`, `cg-fn-end`, `cg-finish` | +| **4035–4160** | Scope/tag ops: `scope-enter/leave`, `scope-bind/lookup`, `tag-bind/lookup`, `typedef?` | +| **4161–4260** | Type compatibility: `ctype-compat?`, `%fn-ctype-compat?`, `%fn-params-compat?`; symbol merge: `sym-merge` (linkage inheritance) | +| **4261–4360** | Type constructors: `%mk-ptr`, `%mk-arr`, `%mk-fn`; qualifier handling: `eat-cv-quals!`, `skip-gnu-attribute!`, `eat-gnu-attributes!` | +| **4361–4411** | Declaration specifiers: `parse-decl-spec` (storage/type/signedness), `resolve-base` | +| **4412–~4480** | Aggregate parsing: `parse-aggregate-spec` (struct/union forward + complete), `parse-struct-fields` (union offset=0), `complete-agg!` (size/align/fields), `parse-enum-spec` | +| **4412–4488** | Const-expr value helpers: `%const-trunc`, `%const-arith-conv`, `%const-arith-conv-type`, `%const-promote`, `%const-bool?` | +| **4489–4512** | Const-expr binary-level infrastructure: `%const-binl` (generic left-associative loop), `%const-arith-op`, `%const-div-op`, `%const-cmp-op` | +| **4473–4814** | Constant expression evaluator: `parse-const-expr` → `parse-const-cond` (ternary) → binary levels via `%const-binl` (lor/land/bor/bxor/band/eq/rel/add/mul) → `parse-const-shift` (inline; lhs-type-only) → `parse-const-cast` → `parse-const-unary` (sizeof, &, prefix ops) → `parse-const-primary` (INT/CHAR/paren/enum-const); `%const-sizeof-expr` (cg snapshot/rewind; guards against pp context) | +| **4815–~4970** | offsetof support: `%const-parse-addrof-postfix`, `%const-parse-addrof-primary` — recognizes `&((T*)0)->field` chains; reuses `%cg-find-field` | +| **4814–5060** | `parse-const-int`; declarators: `parse-declarator`, `parse-decl-cont`, `parse-decl-suf-cont`, `parse-fn-params` | +| **5061–5160** | Phase 3 promotion: `%promote-pending-completions`, `rewrite-pending-completions!`, `promote-roots!`, `promote-iter-buffers!` (main/scratch boundary) | +| **5161–5310** | Translation unit: `parse-translation-unit` with `call-with-scratch-cycle` per decl; `parse-decl-or-fn` | +| **5311–5510** | Declarations/definitions: `handle-decl` (typedef/fn/var/static/file-scope/block-scope with tentatives) | +| **5511–5710** | Global initializers: `parse-init-global` (string/brace/scalar with inferred-length arrays), `%parse-init-array-list` with element promotion, `%parse-init-struct-list` with designated designators and padding | +| **5711–5860** | Local initializers: `parse-init-local-aggregate` (string/brace), `%parse-init-local-array-list`, `%parse-init-local-struct-list` (zero-pass); compound literals as frame lvalues | +| **5861–5960** | Function body: `parse-fn-body`, `%parse-fn-body-inner` (param binding, scope enter/leave) | +| **5961–6090** | Statements: `parse-stmt` dispatch, `parse-cstmt`, `parse-if-stmt`, `parse-while-stmt`, `parse-do-stmt`, `parse-for-stmt` (deferred condition/step), `parse-switch-stmt`, `parse-case-stmt`, `parse-default-stmt`, `parse-return-stmt`, `parse-goto-stmt`, `parse-labelled-stmt`, `parse-expr-stmt`, `parse-local-decl` | +| **6091–6110** | `%binop-bp` — Pratt binding power table (comma=1, assign=4, `\|\|`=10, `&&`=20, bitwise=30–50, relational=60, shift=70, add=80, mul=90) | +| **6111–6310** | Expression parser: `parse-expr` (`expr-bp(0)`), `parse-expr-bp` (Pratt climbing), `parse-binary-rhs` (comma/assign/compound-assign/ternary/logical/bitwise) | +| **6311–6460** | Unary/cast/postfix: `parse-unary` (prefix ops, sizeof), `parse-cast-or-unary` (paren disambiguation), `parse-compound-literal`, `parse-postfix` (`[]`/call/`.`/`->`/post-inc/post-dec) | +| **6461–6505** | Call parsing: `call-fn-type`, `parse-call-args` (param casting, variadic promotion); builtins: `parse-builtin-va-start/va-arg/va-end`; primary: `parse-primary` (literals/idents/strings/parens/enum-consts); rvalue: `rval!`, `rval-not-fn!` | +| **6506–6611** | Driver: `%cc-slurp`, `%cc-write`, CLI flag parsing (`--cc-debug`, `--cc-trace-emit`, `--lib=PFX`), `%cc-initial-defines` (CCSCM sentinel), `cc-main` (pipeline init + `parse-translation-unit` + `cg-finish` + write) | + +--- + +## Notable Design Choices + +- **Streaming pipeline** — no materialized token list; each stage pulls one token at a time +- **Fixed buffers** — pre-allocated per section (text/data/bss); no growth; tuned by `%BUF-CAP-*` +- **Heap discipline** — scratch heap reset at declaration boundaries via `call-with-scratch-cycle`; live roots deep-copied to main heap before reset +- **Vstack-based codegen** — expression evaluation pushes/pops `opnd` records; values optionally spilled to frame slots +- **Macro hide-sets** — `tok` carries hide set to prevent recursive expansion (C11 §6.10.3.4) +- **Shared constant-expression evaluator** — `parse-const-*` serves both the parser (typed, with sizeof/cast/offsetof) and the preprocessor `#if` evaluator (`%pp-make-const-ps` wraps a token list as a minimal pstate with empty scope and `ps-cg = #f`); `%const-binl` provides the generic left-associative binary level pattern +- **Sign-extension discipline** — narrow types (i8/i16/i32) stored as canonical 64-bit forms via shli/sari; widening casts are relabel-only +- **Sret (struct return)** — structs >16B use indirect result: caller passes pointer in `a0` +- **Variadic ABI** — 16 contiguous 8-byte slots; args 0–3 from `a`-regs, 4+ from `LDARG` +- **Tentative definitions** — collected in `world-tentatives`; emitted as `.bss` only if no full definition appears by TU end +- **FP softening** — float/double types parsed and sized per SysV ABI but all FP ops emit integer bitpattern operations