commit 67e43e7cea97347519640114712c419b7a4f9261
parent 9ecdd23fb47bc80e80f39af4bcb55069c2f59760
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Mon, 27 Apr 2026 19:57:09 -0700
docs: TCC-TODO punchlist + CC-STREAM plan
TCC-TODO.md tracks the blockers between today's scheme1-hosted cc
and a successful compile of tcc.flat.c (companion to TCC.md / CC.md /
CC-PARALLEL-PLAN.md). Captures float/double parsing as resolved
(commit 14855ba) and the lex-side memory tightening as done (commit
62dabed); blocker 2 (compatible redecl) and the pp-side memory work
remain open.
CC-STREAM.md plans the conversion from a four-phase materializing
pipeline (lex -> pp -> parse -> cg with full tok-list intermediates
at every layer) to a true single-pass streaming compiler where each
layer pulls one token at a time from upstream. Argues against GC as
the wrong tool (changes typical-case rather than worst-case memory;
much bigger lift in scheme1). Lays out a token-iter abstraction and
a three-step migration: lex-iter, pp-iter, parser repoint. Predicted
worst-case memory drops from O(source length) to O(parser state).
Diffstat:
| A | docs/CC-STREAM.md | | | 254 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
| A | docs/TCC-TODO.md | | | 179 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
2 files changed, 433 insertions(+), 0 deletions(-)
diff --git a/docs/CC-STREAM.md b/docs/CC-STREAM.md
@@ -0,0 +1,254 @@
+# Streaming the cc
+
+Working plan. The cc today is a four-phase pipeline that materializes
+the full intermediate state at every layer:
+
+```
+src bv → [lex-tokenize] → tok list → [pp-expand] → tok list → [parse] → cg buf
+```
+
+That shape doesn't fit the data flow. Lex, pp, parse, and cg are each
+inherently sequential, with small bounded lookahead. The full token
+list is never *needed*; it's just where the current code parks
+intermediate state on the heap. For [tcc.flat.c](TCC.md) (608 KB) the
+lex output alone runs ~170 MB at ~280 B/byte even after the
+[per-token mark/rewind tightening](TCC-TODO.md#blocker-3). pp adds
+another similar slug on top. We've doubled the ELF p_memsz twice
+already; the cleaner answer is to stop materializing.
+
+This doc plans the conversion to a true single-pass streaming
+compiler: lex, pp, parse, and cg all run concurrently, each pulling
+one token at a time from the stage upstream. Steady-state live data
+is bounded by *lookahead* (≤2 tokens) plus parser state (typedef
+table, scope chain, macro table, cg output buffer), not by source
+length.
+
+## Why streaming, not GC
+
+A real GC would fix the symptom — the per-byte multiple is mostly
+dead intermediate state, and a GC would reclaim it eventually — but
+the underlying shape would still be wrong. Adding a GC to scheme1 is
+also a much bigger lift than reshaping the cc:
+
+- scheme1 is shared by every bootstrap stage; touching its allocator
+ affects M1pp, p1, scheme1's own self-host, and anything else built
+ on top. Streaming is local to cc.
+- A GC pass needs to trace through closures, records, env alists, and
+ the bytevector heap layout. It's a multi-week project against
+ scheme1's current single-page-of-allocator code.
+- Memory headroom for the cc has a hard ceiling (ELF p_memsz, which
+ must be a constant in the boot ELF). A GC narrows the *typical*
+ high-water mark but doesn't change the *worst case* the cap has to
+ cover. Streaming changes the worst case from O(source length) to
+ O(parser state).
+- We're aiming for predictability: the cc is the bottleneck stage
+ that needs to compile multi-MB inputs. A bounded steady-state model
+ is easier to reason about than a "GC will handle it" model.
+
+GC stays a sensible move for scheme1 *eventually* — every bootstrap
+stage would benefit. But for closing the tcc.flat.c gap it's the
+wrong tool.
+
+## Symmetry argument
+
+Each layer in the pipeline has small bounded lookahead:
+
+| layer | lookahead / buffering |
+|--------|------------------------------------------------------|
+| lex | ≤3 source bytes (trigraph), 1 token internally |
+| pp | 1 token peek; arg-collection during fn-macro expansion (bounded by call site) |
+| parse | 2 tokens (`peek2` is the deepest the grammar uses) |
+| cg | 0 (already streams to cg-text/cg-data buffers) |
+
+Live tokens across the pipeline are O(macro-arg-list +
+parse-lookahead) — call it dozens. Persistent state (parser tables
++ macro definitions + cg output) grows with declared symbols, not
+source length. For tcc.flat.c that's an estimated 5–20 MB.
+
+## Iter abstraction
+
+A single record + three operations per layer:
+
+```scheme
+(define-record-type tok-iter
+ (%tok-iter pull-fn state buf)
+ tok-iter?
+ (pull-fn tok-iter-pull-fn) ; thunk: state -> tok
+ (state tok-iter-state) ; per-layer
+ (buf tok-iter-buf tok-iter-buf-set!)) ; pushed-back toks (small list)
+
+(define (iter-next it) ...) ; consume + return next tok (EOF sentinel)
+(define (iter-peek it) ...) ; like next, but cache in buf
+(define (iter-unget! it t) ...) ; push back; next iter-next returns t
+```
+
+Each layer wraps the layer below: `lex-iter` over bytes, `pp-iter`
+over `lex-iter`, `pstate` over `pp-iter`. The iter discipline
+replaces the current "list of toks" everywhere a sequence is
+consumed.
+
+## Mapping the layers
+
+### lex-iter
+
+Drop the per-iteration `(reverse (cons … acc))` driver in
+[cc/cc.scm](../cc/cc.scm) `lex-tokenize`. State holds (pos, line,
+col, bol?). `lex-iter-next` runs roughly one iteration of the current
+loop body: skip ws+comments, dispatch, allocate one tok+loc, return.
+
+The per-token `heap-mark` / `heap-rewind!` discipline I just landed
+transposes one-for-one — same `%lex-scratch`, same scalar boxes, same
+post-rewind rebuild of tok+loc. The difference is who advances:
+instead of the loop driving itself to EOF, the *consumer* (pp-iter)
+pulls one tok per `iter-next` call.
+
+### pp-iter
+
+Most invasive. State holds (lex-iter, pending-toks, macros,
+cond-stack, current-loc). `pp-iter-next`:
+
+1. If `pending` non-empty, pop and return.
+2. Pull from `lex-iter`.
+3. If it's `HASH` at bol: collect the directive line into a small
+ buffer (bounded by one logical line, terminated by NL), dispatch,
+ recurse.
+4. If it's a defined ident not in its hide set: collect args
+ (function-like — bounded by the call site), expand into
+ `pending`, recurse.
+5. If it's `STR` and the next non-trivia tok is `STR`: fuse on the
+ fly (replaces today's post-pass `%pp-merge-adjacent-strs`).
+6. Else return.
+
+The hide set tracking already runs per-token in `pp-expand`; the
+streaming version threads it through `pending` instead of through a
+flat list.
+
+### parser
+
+Already uses `ps-peek` / `ps-advance` (see
+[CC.md §Parser](CC.md)). Repoint them at `pp-iter` instead of
+indexing into a list. Add per-stmt `heap-mark` / `heap-rewind!` in
+`parse-stmt` so that the parse-time interpreter overhead also gets
+reclaimed at statement boundaries. Survivors that must cross the
+rewind (typedef bvs, scope-bound symbols, macro bodies) need to be
+allocated below the mark — same discipline as the lex driver, just
+one layer up.
+
+### cg
+
+Already streams to cg-text/cg-data buffers. No change.
+
+## Subtle bits
+
+- **#if cexpr** evaluates a fully-collected directive line — keep
+ that as a *bounded* buffer (one preprocessor line, terminated by
+ NL). Don't try to stream cexpr eval.
+- **Function-macro arg collection** also bounded buffer (until
+ matching `)`). Worst case is multi-line, but still bounded by the
+ call-site span. The only real memory concern; if a macro call spans
+ 10 KB of args we hold all of it. tcc.c doesn't do that.
+- **Adjacent string concat** moves from a post-pass to a peek-and-
+ fuse in `pp-iter`. Easy.
+- **Builtin macros** (`__FILE__`, `__LINE__`, `__DATE__`, …) need the
+ current loc at expansion time, not at lex time. The pp-iter has it
+ because it tracks lex's current loc.
+- **EOF discipline.** Each iter must yield exactly one EOF tok and
+ then keep yielding EOF on subsequent calls. The parser already
+ expects EOF as a stable sentinel.
+- **Lookahead beyond 2.** The parser uses `peek` and `peek2`; nothing
+ deeper. If something needs 3-token lookahead later, the unget
+ buffer accommodates it without changing the iter shape.
+- **Per-stmt rewind survivors.** A statement may bind a typedef
+ whose name must outlive the rewind. The mark goes *after* the
+ typedef-add path, or the bv is allocated below the mark — same
+ trick the lex driver uses with `%lex-scratch`. Easier than it
+ sounds because typedef adds happen at the top of `parse-decl`,
+ which is well before any deep parse-time scratch.
+
+## Migration path
+
+Three roughly self-contained PRs. Each is independently testable —
+the existing test suites already drive only the public API
+(`lex-tokenize` for cc-lex, `pp-expand` for cc-pp, the cc front end
+for cc + cc-cg). Adapter wrappers preserve those entry points
+through the migration.
+
+### Step 1 — lex-iter
+
+Convert `lex-tokenize` to `make-lex-iter` + `lex-iter-next`. Leave
+`lex-tokenize` as a thin wrapper that drains the iter into a list,
+so cc-lex tests pass unchanged. Touchpoints:
+[cc/cc.scm](../cc/cc.scm) lex section, [cc/main.scm](../cc/main.scm).
+
+After this step: lex output is still consumed by today's
+`pp-expand` via the wrapper. Identity-preserving; no behavior change
+visible to tests.
+
+### Step 2 — pp-iter
+
+Convert `pp-expand` to `make-pp-iter` + `pp-iter-next`. Directives
+and macro expansion all move from "list-walking" style to
+"stream + pending-buffer" style. Same logic, different shape. Leave
+a `pp-expand` wrapper that drains the iter into a list for
+back-compat. Heaviest single PR.
+
+After this step: pp consumes lex via the iter; parser still drains
+pp into a list.
+
+### Step 3 — parser repoint
+
+Change `pstate` to hold a `pp-iter` instead of a tok-list cursor.
+`peek` / `peek2` / `advance` redirect to iter ops. Drop the wrappers
+from steps 1–2 once the cc-cg / cc tests pass.
+
+Add per-stmt `heap-mark` / `heap-rewind!` in `parse-stmt`. This is
+the layer where the *parser* per-form interpreter overhead gets
+reclaimed.
+
+After this step: full streaming. Memory bound shifts from O(source
+length) to O(parser state).
+
+## Expected impact
+
+For tcc.flat.c (608 KB):
+
+| state | predicted high water |
+|-------------------------------|---------------------:|
+| current (post-lex tightening) | ~170 MB during lex; pp blows the 256 MiB cap |
+| post-step-1 | unchanged (same materialization at pp/parser) |
+| post-step-2 | lex no longer materialized; pp output still listed for parser. ~80 MB? |
+| post-step-3 | full streaming. Bounded by parser state. ~10–30 MB. |
+
+Numbers are guesses until we measure, but the shape is clear: most
+of the per-byte transient cost we pay today disappears not because
+we got faster but because we never store that intermediate at all.
+
+## Open questions
+
+- **Hide-set sharing.** Today the same hide-set list can be shared
+ across many tokens because it's immutable. In streaming, the same
+ property holds — pp-iter just hands out shared references — but
+ we should confirm there's no in-place mutation lurking in the pp
+ code paths that gets converted.
+- **Diagnostics ergonomics.** When the parser dies on an unexpected
+ token, today the line/col come from `tok-loc`. Same in streaming;
+ no change.
+- **Test runner adapters.** cc-lex and cc-pp test runners build full
+ token lists for inspection. Keep their wrappers (`lex-tokenize`,
+ `pp-expand`) as drain-to-list adapters. Internally the test
+ fixtures don't change.
+- **`#include`.** Currently rejected ([CC.md §Toolchain envelope](CC.md)).
+ If we ever want to support it, the streaming model is friendlier:
+ the include-file's lex-iter is pushed onto a stack inside pp-iter,
+ and `pp-iter-next` drains it before resuming the outer file. Out
+ of scope for the tcc.flat.c push but worth noting.
+
+## Related
+
+- [TCC-TODO.md](TCC-TODO.md) — blocker 3 covers the same memory
+ problem from the per-byte angle. The lex-side tightening landed
+ there is what step 1 of this plan transposes onto an iter.
+- [CC.md](CC.md) — the language and toolchain spec the cc implements.
+- [CC-PARALLEL-PLAN.md](CC-PARALLEL-PLAN.md) — independent feature
+ punchlist; streaming is orthogonal but unblocks running on
+ tcc.flat.c so those features can be exercised end-to-end.
diff --git a/docs/TCC-TODO.md b/docs/TCC-TODO.md
@@ -0,0 +1,179 @@
+# tcc.flat.c via scheme cc — known blockers
+
+Working punch list for what stands between today's scheme1-hosted cc
+and a successful compile of `tcc.flat.c`. Companion to
+[TCC.md](TCC.md) (the surrounding three-stage pipeline), [CC.md](CC.md)
+(the C subset the cc accepts), and [CC-PARALLEL-PLAN.md](CC-PARALLEL-PLAN.md)
+(the per-feature `tests/cc/` punchlist). The blockers below are the
+ones that fire *before* we even reach the parallel-plan items.
+
+## Repro
+
+`tcc.flat.c` is the stage-1 artifact. Generate it with the host
+preprocessor:
+
+```
+sh scripts/stage1-flatten.sh --arch X86_64
+# -> build/cc-bootstrap/X86_64/tcc.flat.c (608 KB, 18 896 lines, 0 directives)
+```
+
+Run the catm'd scheme cc against it inside the per-arch container.
+The cc-debug flag prints heap usage between phases on stderr:
+
+```
+podman run --rm --pull=never --platform linux/arm64 \
+ --tmpfs /tmp:size=512M -e ARCH=aarch64 \
+ -v "$(pwd)":/work -w /work boot2-busybox:aarch64 \
+ build/aarch64/scheme1 build/aarch64/cc/cc.scm --cc-debug \
+ build/cc-bootstrap/X86_64/tcc.flat.c /tmp/tcc.flat.P1pp
+```
+
+Prerequisites: `make scheme1 cc ARCH=aarch64` (or any other arch) so
+`build/$ARCH/scheme1` and `build/$ARCH/cc/cc.scm` exist.
+
+For triage, the small-prefix probe is useful:
+
+```
+head -c 50000 build/cc-bootstrap/X86_64/tcc.flat.c \
+ > build/cc-bootstrap/X86_64/tcc.head.c
+# then re-run the podman invocation against tcc.head.c
+```
+
+## Blocker 1 — `float` / `double` hard-rejected in type specifiers ✓ done
+
+Resolved: `parse-decl-spec` now accepts `float`, `double`, and `long
+double` as type specifiers, mapping them to `%t-flt` / `%t-dbl` /
+`%t-ldbl` ctypes (sizes 4/8/8). `_Complex` / `_Imaginary` are eaten as
+silent qualifiers. The five other type-name-start predicates
+(`paren-is-group?`, `stmt-starts-decl?`, `token-is-decl?`,
+`%const-paren-is-cast?`, `%const-tok-is-decl?`, plus the
+`parse-cast-or-unary` guard) recognize the new kw set. The cg guards
+fp ld/st/cast via `%cg-fp-reject!` so any attempt to materialize an fp
+value dies with `fp not codegen'd` rather than silently emitting
+integer code. Coverage: `tests/cc/119-float-parse.c` exercises
+prototypes, struct layout, and `sizeof(float|double|long double)`.
+Confirmed end-to-end: a 5 KB prefix of `tcc.flat.c` now reaches
+blocker 2 (`dup decl: _exit`) instead of dying on `double`.
+
+## Blocker 2 — compatible redeclaration rejected
+
+Site: [cc/cc.scm](../cc/cc.scm) `scope-bind!`, line 3667.
+
+```scheme
+(if (alist-ref n top) (die #f "dup decl" n)
+ (ps-scope-set! ps (cons (alist-set n s top) r)))
+```
+
+C allows redundant identical extern declarations and prototypes that
+differ only in parameter names. The flattened TU has many — `_exit`
+appears in both `<stdlib.h>` and `<unistd.h>`; `strlen` is declared
+once with a named param and once without; etc.
+
+Fix: when the name is already bound at the same scope, allow the
+redeclaration if the new type is compatible with the existing one
+(extern decls of the same function; pointer/array decay equivalents).
+Only definition-after-definition or genuinely conflicting types should
+die. The same compatible-redecl rule needs to land in the typedef
+table and the tag table for forward-decl chains.
+
+Observed (with blocker 1 fixed, on a 5 KB prefix of tcc.flat.c):
+
+```
+error: dup decl: _exit
+```
+
+## Blocker 3 — lex+pp heap usage at ~3 KB per source byte (lex side fixed)
+
+Site: lex/pp pipeline in [cc/cc.scm](../cc/cc.scm) plus heap caps in
+[scheme1/scheme1.P1pp](../scheme1/scheme1.P1pp) and
+[vendor/seed/](../vendor/seed/)`<arch>/ELF.hex2`.
+
+### Lex side — done
+
+Investigation traced the bulk of the per-byte cost to scheme1's
+interpreter calling convention: every closure call allocates ~32 B
+per param via `bind_params`, every `let*` binding allocates two cons
+cells, every `eval_args` allocates one cons per arg, and tail-rec
+named-let in the inner scanners (`%scan-while`, `%fill-while-bv`,
+`%match-bytes`, `%accum-int-while`) re-applies a closure per scanned
+byte. None of this overhead is reclaimed across token boundaries —
+the lex helpers' own internal `heap-mark` / `heap-rewind!` discipline
+trims their per-helper scratch but the result tok + 4-list lives
+below any putative outer mark, so a driver-level rewind needs to
+rebuild survivors from pre-mark scalar boxes.
+
+Fix landed in `cc/cc.scm`: `lex-tokenize` now wraps each iteration in
+its own `heap-mark` / `heap-rewind!`. Scalar fields (kind, value for
+non-bv tokens, npos/nline/ncol) ferry across the rewind via outer-let
+bindings (set! mutates the binding cdr in place; no heap). Bv-typed
+values (IDENT, STRING) ferry through `%lex-scratch`, a single sticky
+bytevector allocated below all marks; post-rewind the driver copies
+back into a fresh bv sized exactly to the content. tok+loc+acc-cons
+are rebuilt post-rewind.
+
+Measured (`tests/cc/001-kitchen-sink.c`, 9.7 KB):
+
+```
+phase=slurp: heap 1.23 MB src-bytes 9782
+phase=lex: heap 3.40 MB (delta ~2.17 MB → ~222 B/byte; 13× better)
+phase=pp: heap 8.58 MB
+```
+
+A 50 KB prefix of `tcc.flat.c` (which used to exhaust the heap during
+lex) now lex+pp's cleanly through to blocker 2 (`dup decl: _exit`).
+Lex finishes at 15.6 MB, pp at 41.3 MB. Extrapolating 222 B/byte to
+the full 608 KB input gives ~135 MB during lex+pp — close enough to
+the 128 MB `p_memsz` cap that a single bump (and a bigger
+`HEAP_CAP_BYTES`) closes the gap.
+
+We also considered an interning pool (one bv per unique IDENT/STRING,
+shared across token instances). With scheme1's interpreter, walking a
+cons-list pool — even bucketed 16-way — costs ~50–150 B per step in
+`bind_params`/`eval_args`/named-let frame overhead. The walk costs
+out-paced the bv-allocation savings: a 50 KB prefix went 60+ MB with
+buckets vs ~17 MB un-interned. Until scheme1 grows a true vector
+primitive (so bucket access is O(1) without interpreter overhead),
+no-intern is the cleanest path. The `%lex-scratch` infrastructure is
+ready to layer interning back on top if a downstream eq?-fast-path
+turns out to need it.
+
+### Remaining
+
+- **pp side memory.** pp still copies the lex token list into a
+ buf-list and walks it; `%pp-take-line` reverses tokens. There's
+ probably similar interpreter-overhead concentrate to drain. Worth
+ the same investigation pass once we're past blocker 2.
+- **Bump arena caps.** Even with the lex fix the full TU is many MB
+ of tokens. `READBUF_CAP_BYTES`, `HEAP_CAP_BYTES`, and the ELF
+ `p_memsz` literal must all grow in lockstep before tcc.flat.c will
+ finish.
+
+## Suspected next-tier blockers (not yet observed)
+
+Past blockers 1–3 the next wave we expect:
+
+- **`_Bool`, bitfield-typed struct fields, `setjmp.h` typedefs** —
+ same "parse, don't codegen" softening as floats. tcc.c carries all
+ of these under `HAVE_BITFIELD` / `HAVE_SETJMP` gates that are off
+ but leave the declarations in the flattened text.
+- **Existing CC-PARALLEL-PLAN items** — struct-return ABI streams
+ A1/A2/A3, compound literals (B), const-expr evaluator (C), union
+ offsets + sizeof no-eval (D). tcc.c uses every one of these.
+- **Throughput / wall-clock** — even with the heap fix, lex+pp+parse
+ on 18 896 lines under scheme1 is going to be slow. Worth measuring
+ before optimizing.
+
+## Suggested order of attack
+
+1. Soften float/double, then bitfields, then setjmp types into
+ "parse, codegen rejects ops". Small, mechanical, unblocks the gate
+ to actual triage.
+2. Compatible-redecl in `scope-bind!`, the typedef table, and the tag
+ table.
+3. Lex/pp memory pass + heap/p_memsz bumps.
+4. Re-run on `tcc.flat.c`. At this point work re-converges with
+ [CC-PARALLEL-PLAN.md](CC-PARALLEL-PLAN.md).
+
+Hitting milestone 4 in [CC.md §Validation milestones](CC.md) — "Compile
+tcc.c (under the tcc-mes defines) → tcc-lispcc; verify
+`tcc-lispcc -version` runs" — is the goal the items above feed into.