commit bba17b7e3b6b1e9a74966786524e89250fadff14
parent 311fbbd7bf644c421acabce8905880cd1b367362
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Fri, 1 May 2026 14:55:50 -0700
cc: add cc.scm.md code map
Diffstat:
| A | cc/cc.scm.md | | | 173 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
1 file changed, 173 insertions(+), 0 deletions(-)
diff --git a/cc/cc.scm.md b/cc/cc.scm.md
@@ -0,0 +1,173 @@
+# cc.scm — Code Map
+
+## Overview
+
+`cc.scm` is a complete C compiler (6611 lines) written in Scheme (scheme1 dialect) that compiles C source to P1pp assembly. It implements a streaming pipeline: **lexer → preprocessor → parser → codegen**. Designed for minimal memory use with fixed pre-allocated buffers and a scratch/main heap discipline that resets per declaration. Targets the P1 64-bit RISC ISA via libp1pp macros.
+
+---
+
+## Structural Map
+
+### Major Subsystems
+
+| Subsystem | Lines | Role |
+|-----------|-------|------|
+| Utilities | 1–282 | Bytevector helpers, list/alist ops, output buffers, diagnostics, debug logging, name generation |
+| Data Structures | 283–583 | Record type definitions, interned primitive ctypes, ctype predicates |
+| Symbol Alphabets | ~430–~560 | Keyword and punctuator alists |
+| Lexer | 584–1676 | Tokenizes C source; trigraph/splice, comments, escape sequences |
+| Preprocessor | 1677–2503 | `#define`, `#if`, macro expansion with hide-sets; `pp-eval-cexpr` delegates to `parse-const-int` via `%pp-make-const-ps` |
+| Code Generator | 2504–4034 | P1pp assembly emission, vstack, frame allocation, all operators and control flow |
+| Parser | 4035–6505 | Recursive-descent + Pratt; declarations, statements, expressions; shared constant-expression evaluator |
+| Main Driver | 6506–6611 | CLI parsing, file I/O, pipeline initialization |
+
+---
+
+## Key Data Structures
+
+### Runtime state — the three structs that wire the pipeline together
+
+`world` is shared between parser and codegen. `pstate` owns the token stream and drives parsing. `cg` owns all assembly emission state.
+
+```
+world
+├── scope (list of alist frames) var/typedef/fn bindings
+├── tags (list of alist frames) struct/union/enum tag bindings
+├── str-pool (alist) interned string literals → labels
+└── tentatives (list) file-scope tentative definitions
+
+pstate
+├── iter (tok-iter) pp-iter (lexer + preprocessor)
+├── world (shared)
+├── loops (stack) break/continue targets (loop-ctx records)
+├── fn-ctx (fn-ctx|#f) current function context
+└── cg (cg|#f) codegen state (#f in pp const-expr context)
+
+cg
+├── text/data/bss (buf) fixed-capacity output section buffers
+├── vstack (list of opnd) value stack for expression evaluation
+├── frame-hi (fixnum) next free frame byte offset
+├── label-ctr (fixnum) monotonic label counter
+├── world (shared)
+├── fn-meta (alist) transient per-fn metadata (sret ptr, indirect slots, etc.)
+├── fn-buf/prologue-buf (buf) reused per-function; drained to text at fn-end
+├── max-outgoing (fixnum) maximum stack args staged in current fn
+├── in-fn? (bool) routes %cg-emit to fn-buf vs text
+├── lib? (bool) skip entry stub + ELF_end
+└── str-prefix (bv) namespace prefix for anonymous strings
+```
+
+### Leaf data records — passed through the pipeline as values
+
+| Record | Fields | Purpose |
+|--------|--------|---------|
+| `loc` | file/line/col | Source location |
+| `tok` | kind/value/loc/hide | Token with hide-set for macro expansion |
+| `macro` | kind/params/body | Preprocessor macro definition |
+| `ctype` | kind/size/align/ext + mutators | C type representation |
+| `sym` | name/kind/storage/type/slot/defined? | Symbol table entry |
+| `opnd` | kind/type/ext/lval? | Operand on the vstack |
+| `loop-ctx` | kind/tag/has-continue? | Loop break/continue target info |
+| `fn-ctx` | name/return-type/params/variadic? | Current function metadata |
+
+---
+
+## Compilation Flow
+
+```
+Source file
+ → lex-iter (make-lex-iter) — streaming tokenizer
+ → pp-iter (make-pp-iter) — macro expansion + directives
+ → parse-translation-unit (pstate/cg) — recursive descent + Pratt
+ per-decl: call-with-scratch-cycle — scratch heap reset per declaration
+ function bodies: cg-fn-begin → parse-fn-body → cg-fn-end
+ → cg-finish — tentatives → .bss, entry stub, combine sections
+ → write output file
+```
+
+**Per-function code path:**
+1. `cg-fn-begin` — emit param spills, sret setup, allocate prologue-buf
+2. `parse-fn-body` — emits P1pp directly into fn-buf via cg ops
+3. `cg-fn-end` — drain prologue-buf + fn-buf into text, emit ret block
+
+**`#if` constant-expression path:**
+`pp-eval-cexpr` → resolve `defined`, macro-expand, idents→0 → `%pp-make-const-ps` (minimal pstate, empty scope, no cg) → `parse-const-int` (shared with parser)
+
+---
+
+## Line Map
+
+| Lines | Description |
+|-------|-------------|
+| **1–109** | Bytevector primitives: `bv=`, `bv-prefix?`, `bv-slice`, `bv-cat`, `bv->fixnum`; list/alist utilities: `alist-ref`, `alist-update`, `any`, `every`, `count`; integer helpers: `min3`, `align-up` |
+| **110–122** | `%BUF-CAP-*` — buffer pre-allocation constants (TEXT 8MiB, DATA 2MiB, BSS 2MiB, FN 256KiB, PROLOGUE 16KiB) |
+| **123–210** | Output buffer system: `buf` record, `buf-push!`, `buf-flush`, `buf-reset!`, `buf-drain!` — fixed-capacity, no growth |
+| **211–282** | Diagnostics: `die` with loc formatting, `slurp-fd`, `write-bv-fd`; debug logging: `debug-log-on!/off!`, `trace-emit` flags; fresh name generator: `make-namer` |
+| **283–496** | Record type definitions: `loc`, `tok`, `macro`, `ctype`, `sym`, `opnd`, `loop-ctx`, `fn-ctx`, `world`, `pstate`, `cg`; interned primitive ctypes (`%t-void`, `%t-i8`…`%t-u64`, `%t-bool`, `%t-flt`, `%t-dbl`, `%t-ldbl`); ctype predicates: `%ctype-ptr?`, `%ctype-pointee`, `%ctype-unsigned?`, `%ctype-fp?`; ctype accessors |
+| **~430–~560** | `%keyword-alist` — storage/qualifiers/type specifiers/statements/operators/reserved; `%punct-alist` — punctuators longest-first, digraphs |
+| **584–640** | Lexer byte-class predicates: `%digit?`, `%hex?`, `%alpha?`, `%ident-start?`, `%ident-cont?`, `%hspace?`, `%newline?`; `%lex-scratch` buffer |
+| **641–770** | Logical byte access: `%lex-peek` with trigraph translation + line splice |
+| **771–920** | Comment stripping: `%skip-ws-and-comments`, `%skip-line-comment`, `%skip-block-comment` |
+| **921–1070** | Byte-run scanners: `%scan-while`, `%fill-while-bv`, `%accum-int-while`, `%accum-octal-bounded` |
+| **1071–1270** | Token readers: `lex-read-ident`, `%lex-read-number` (hex/octal/decimal), `%lex-read-string` (with escapes), `lex-read-char` |
+| **1271–1351** | `%lex-read-punct` with longest-match bucketing; `%punct-buckets` |
+| **1352–1676** | `lex-iter` streaming token source: `make-lex-iter`, `%lex-iter-pull` with heap-rewind discipline; `list-iter` wrapper; `lex-tokenize` test driver |
+| **1677–1820** | Preprocessor state (`pp-state`), token classification helpers (`%pp-eof?`, `%pp-nl?`, `%pp-hash?`, etc.) |
+| **1821–1920** | Built-in macros: `__FILE__`, `__LINE__`, `__STDC__`, `__LISPCC__`, `__DATE__`, `__TIME__`, `__STDC_VERSION__`, `__STDC_HOSTED__`, `__VA_ARGS__` |
+| **1921–2020** | Streaming pp-iter: `make-pp-iter`, `%pp-iter-pull` with out-buf stashing |
+| **2021–2120** | Upstream helpers: `%pp-pull-upstream`, `%pp-peek-upstream`, `%pp-unshift-upstream!`, `%pp-collect-line-stream`, `%pp-collect-args-stream` |
+| **2121–2270** | Directive dispatch: `%pp-dispatch-step`, `%pp-dispatch-directive` → `%pp-do-define`, `%pp-do-undef`, `%pp-do-if/ifdef/ifndef/elif/else/endif` with cond-stack |
+| **2271–2370** | Directives: `%pp-do-error`, `%pp-do-line`, `%pp-do-pragma`, `%pp-do-include` |
+| **2371–2430** | Macro expansion: `%pp-emit-expanded`, `%pp-apply-macro`, `%pp-prepare-body`, `%pp-collect-args`, `%pp-bind-args` (variadic), `%pp-substitute` (`#param` stringize, `##` paste) |
+| **2431–2503** | Paste operator: `%pp-paste-tokens`; string fusion: `%pp-maybe-fuse-str`; `#if` evaluator: `%pp-make-const-ps` (IO adapter wrapping token list as minimal pstate), `pp-eval-cexpr`, `%pp-resolve-defined`, `%pp-expand-line`, `%pp-idents-as-zero` |
+| **2504–2620** | CG emission primitives: `%cg-emit-buf`, `%cg-emit`, `%cg-emit-many`, `%cg-fresh-label`, `%n` (number→bv) |
+| **2621–2720** | CG metadata: `%cg-fn-set!/%cg-fn-get`; register/label helpers: `%cg-reg→bv`, `%cg-emit-li`, `%cg-emit-la` |
+| **2721–2870** | Load/store emission: `%cg-emit-ld/st`, `%cg-emit-ld-slot-typed` (sign-extended sub-word loads), `%cg-emit-sext`, `%cg-spill-reg` |
+| **2871–3020** | Operand loading: `%cg-load-opnd-into` (imm/frame/global); vstack ops: `cg-push/pop/top/depth/dup`, snapshot/rewind for sizeof |
+| **3021–3170** | Materialize: `cg-push-imm`, `cg-push-string` (with intern), `cg-push-sym` (fn/enum/var/param), `cg-push-deref` (indirect-slot tracking) |
+| **3171–3320** | Aggregate access: `cg-push-field` with `%cg-find-field` (anonymous-member-aware lookup, shared with parser's offsetof), `cg-decay-array`; address/deref: `%cg-emit-addr-of`, `cg-copy-struct`, `cg-take-addr`, `cg-load` |
+| **3321–3470** | Type conversions: `cg-cast` (bool/ptr/widening/narrowing with sign-extend), `cg-promote`, `cg-arith-conv` |
+| **3471–3620** | Operators: `cg-binop` (pointer arithmetic scaling, comparison), `cg-unop` (neg/bnot/lnot), `cg-assign` (type coercion), post-inc/dec |
+| **3621–3770** | Function calls: `cg-call` (sret >16B struct return, arg staging a0–a3 + stack, variadic) |
+| **3771–3870** | Return: `cg-return` (void/scalar/struct); conditional: `cg-if`, `cg-ifelse`, `cg-ifelse-merge` (ternary/&&/\|\|) |
+| **3871–3970** | Loop control flow: `cg-loop`, `cg-break`, `cg-continue`; switch: `cg-switch-begin`, `cg-switch-case`, `cg-switch-default`, `cg-switch-end` (dispatch table) |
+| **3971–4000** | Variadic: `cg-va-start`, `cg-va-arg`, `cg-va-end`; labels/goto: `cg-emit-label`, `cg-goto` |
+| **4001–4034** | Globals/data: `cg-emit-global`, `cg-emit-extern`, tentatives, `cg-intern-string`; frame: `cg-alloc-slot`; lifecycle: `cg-init`, `cg-fn-begin/v`, `cg-fn-end`, `cg-finish` |
+| **4035–4160** | Scope/tag ops: `scope-enter/leave`, `scope-bind/lookup`, `tag-bind/lookup`, `typedef?` |
+| **4161–4260** | Type compatibility: `ctype-compat?`, `%fn-ctype-compat?`, `%fn-params-compat?`; symbol merge: `sym-merge` (linkage inheritance) |
+| **4261–4360** | Type constructors: `%mk-ptr`, `%mk-arr`, `%mk-fn`; qualifier handling: `eat-cv-quals!`, `skip-gnu-attribute!`, `eat-gnu-attributes!` |
+| **4361–4411** | Declaration specifiers: `parse-decl-spec` (storage/type/signedness), `resolve-base` |
+| **4412–~4480** | Aggregate parsing: `parse-aggregate-spec` (struct/union forward + complete), `parse-struct-fields` (union offset=0), `complete-agg!` (size/align/fields), `parse-enum-spec` |
+| **4412–4488** | Const-expr value helpers: `%const-trunc`, `%const-arith-conv`, `%const-arith-conv-type`, `%const-promote`, `%const-bool?` |
+| **4489–4512** | Const-expr binary-level infrastructure: `%const-binl` (generic left-associative loop), `%const-arith-op`, `%const-div-op`, `%const-cmp-op` |
+| **4473–4814** | Constant expression evaluator: `parse-const-expr` → `parse-const-cond` (ternary) → binary levels via `%const-binl` (lor/land/bor/bxor/band/eq/rel/add/mul) → `parse-const-shift` (inline; lhs-type-only) → `parse-const-cast` → `parse-const-unary` (sizeof, &, prefix ops) → `parse-const-primary` (INT/CHAR/paren/enum-const); `%const-sizeof-expr` (cg snapshot/rewind; guards against pp context) |
+| **4815–~4970** | offsetof support: `%const-parse-addrof-postfix`, `%const-parse-addrof-primary` — recognizes `&((T*)0)->field` chains; reuses `%cg-find-field` |
+| **4814–5060** | `parse-const-int`; declarators: `parse-declarator`, `parse-decl-cont`, `parse-decl-suf-cont`, `parse-fn-params` |
+| **5061–5160** | Phase 3 promotion: `%promote-pending-completions`, `rewrite-pending-completions!`, `promote-roots!`, `promote-iter-buffers!` (main/scratch boundary) |
+| **5161–5310** | Translation unit: `parse-translation-unit` with `call-with-scratch-cycle` per decl; `parse-decl-or-fn` |
+| **5311–5510** | Declarations/definitions: `handle-decl` (typedef/fn/var/static/file-scope/block-scope with tentatives) |
+| **5511–5710** | Global initializers: `parse-init-global` (string/brace/scalar with inferred-length arrays), `%parse-init-array-list` with element promotion, `%parse-init-struct-list` with designated designators and padding |
+| **5711–5860** | Local initializers: `parse-init-local-aggregate` (string/brace), `%parse-init-local-array-list`, `%parse-init-local-struct-list` (zero-pass); compound literals as frame lvalues |
+| **5861–5960** | Function body: `parse-fn-body`, `%parse-fn-body-inner` (param binding, scope enter/leave) |
+| **5961–6090** | Statements: `parse-stmt` dispatch, `parse-cstmt`, `parse-if-stmt`, `parse-while-stmt`, `parse-do-stmt`, `parse-for-stmt` (deferred condition/step), `parse-switch-stmt`, `parse-case-stmt`, `parse-default-stmt`, `parse-return-stmt`, `parse-goto-stmt`, `parse-labelled-stmt`, `parse-expr-stmt`, `parse-local-decl` |
+| **6091–6110** | `%binop-bp` — Pratt binding power table (comma=1, assign=4, `\|\|`=10, `&&`=20, bitwise=30–50, relational=60, shift=70, add=80, mul=90) |
+| **6111–6310** | Expression parser: `parse-expr` (`expr-bp(0)`), `parse-expr-bp` (Pratt climbing), `parse-binary-rhs` (comma/assign/compound-assign/ternary/logical/bitwise) |
+| **6311–6460** | Unary/cast/postfix: `parse-unary` (prefix ops, sizeof), `parse-cast-or-unary` (paren disambiguation), `parse-compound-literal`, `parse-postfix` (`[]`/call/`.`/`->`/post-inc/post-dec) |
+| **6461–6505** | Call parsing: `call-fn-type`, `parse-call-args` (param casting, variadic promotion); builtins: `parse-builtin-va-start/va-arg/va-end`; primary: `parse-primary` (literals/idents/strings/parens/enum-consts); rvalue: `rval!`, `rval-not-fn!` |
+| **6506–6611** | Driver: `%cc-slurp`, `%cc-write`, CLI flag parsing (`--cc-debug`, `--cc-trace-emit`, `--lib=PFX`), `%cc-initial-defines` (CCSCM sentinel), `cc-main` (pipeline init + `parse-translation-unit` + `cg-finish` + write) |
+
+---
+
+## Notable Design Choices
+
+- **Streaming pipeline** — no materialized token list; each stage pulls one token at a time
+- **Fixed buffers** — pre-allocated per section (text/data/bss); no growth; tuned by `%BUF-CAP-*`
+- **Heap discipline** — scratch heap reset at declaration boundaries via `call-with-scratch-cycle`; live roots deep-copied to main heap before reset
+- **Vstack-based codegen** — expression evaluation pushes/pops `opnd` records; values optionally spilled to frame slots
+- **Macro hide-sets** — `tok` carries hide set to prevent recursive expansion (C11 §6.10.3.4)
+- **Shared constant-expression evaluator** — `parse-const-*` serves both the parser (typed, with sizeof/cast/offsetof) and the preprocessor `#if` evaluator (`%pp-make-const-ps` wraps a token list as a minimal pstate with empty scope and `ps-cg = #f`); `%const-binl` provides the generic left-associative binary level pattern
+- **Sign-extension discipline** — narrow types (i8/i16/i32) stored as canonical 64-bit forms via shli/sari; widening casts are relabel-only
+- **Sret (struct return)** — structs >16B use indirect result: caller passes pointer in `a0`
+- **Variadic ABI** — 16 contiguous 8-byte slots; args 0–3 from `a`-regs, 4+ from `LDARG`
+- **Tentative definitions** — collected in `world-tentatives`; emitted as `.bss` only if no full definition appears by TU end
+- **FP softening** — float/double types parsed and sized per SysV ABI but all FP ops emit integer bitpattern operations