boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

commit 67e43e7cea97347519640114712c419b7a4f9261
parent 9ecdd23fb47bc80e80f39af4bcb55069c2f59760
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon, 27 Apr 2026 19:57:09 -0700

docs: TCC-TODO punchlist + CC-STREAM plan

TCC-TODO.md tracks the blockers between today's scheme1-hosted cc
and a successful compile of tcc.flat.c (companion to TCC.md / CC.md /
CC-PARALLEL-PLAN.md). Captures float/double parsing as resolved
(commit 14855ba) and the lex-side memory tightening as done (commit
62dabed); blocker 2 (compatible redecl) and the pp-side memory work
remain open.

CC-STREAM.md plans the conversion from a four-phase materializing
pipeline (lex -> pp -> parse -> cg with full tok-list intermediates
at every layer) to a true single-pass streaming compiler where each
layer pulls one token at a time from upstream. Argues against GC as
the wrong tool (changes typical-case rather than worst-case memory;
much bigger lift in scheme1). Lays out a token-iter abstraction and
a three-step migration: lex-iter, pp-iter, parser repoint. Predicted
worst-case memory drops from O(source length) to O(parser state).

Diffstat:
Adocs/CC-STREAM.md | 254+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adocs/TCC-TODO.md | 179+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 433 insertions(+), 0 deletions(-)

diff --git a/docs/CC-STREAM.md b/docs/CC-STREAM.md @@ -0,0 +1,254 @@ +# Streaming the cc + +Working plan. The cc today is a four-phase pipeline that materializes +the full intermediate state at every layer: + +``` +src bv → [lex-tokenize] → tok list → [pp-expand] → tok list → [parse] → cg buf +``` + +That shape doesn't fit the data flow. Lex, pp, parse, and cg are each +inherently sequential, with small bounded lookahead. The full token +list is never *needed*; it's just where the current code parks +intermediate state on the heap. For [tcc.flat.c](TCC.md) (608 KB) the +lex output alone runs ~170 MB at ~280 B/byte even after the +[per-token mark/rewind tightening](TCC-TODO.md#blocker-3). pp adds +another similar slug on top. We've doubled the ELF p_memsz twice +already; the cleaner answer is to stop materializing. + +This doc plans the conversion to a true single-pass streaming +compiler: lex, pp, parse, and cg all run concurrently, each pulling +one token at a time from the stage upstream. Steady-state live data +is bounded by *lookahead* (≤2 tokens) plus parser state (typedef +table, scope chain, macro table, cg output buffer), not by source +length. + +## Why streaming, not GC + +A real GC would fix the symptom — the per-byte multiple is mostly +dead intermediate state, and a GC would reclaim it eventually — but +the underlying shape would still be wrong. Adding a GC to scheme1 is +also a much bigger lift than reshaping the cc: + +- scheme1 is shared by every bootstrap stage; touching its allocator + affects M1pp, p1, scheme1's own self-host, and anything else built + on top. Streaming is local to cc. +- A GC pass needs to trace through closures, records, env alists, and + the bytevector heap layout. It's a multi-week project against + scheme1's current single-page-of-allocator code. +- Memory headroom for the cc has a hard ceiling (ELF p_memsz, which + must be a constant in the boot ELF). A GC narrows the *typical* + high-water mark but doesn't change the *worst case* the cap has to + cover. Streaming changes the worst case from O(source length) to + O(parser state). +- We're aiming for predictability: the cc is the bottleneck stage + that needs to compile multi-MB inputs. A bounded steady-state model + is easier to reason about than a "GC will handle it" model. + +GC stays a sensible move for scheme1 *eventually* — every bootstrap +stage would benefit. But for closing the tcc.flat.c gap it's the +wrong tool. + +## Symmetry argument + +Each layer in the pipeline has small bounded lookahead: + +| layer | lookahead / buffering | +|--------|------------------------------------------------------| +| lex | ≤3 source bytes (trigraph), 1 token internally | +| pp | 1 token peek; arg-collection during fn-macro expansion (bounded by call site) | +| parse | 2 tokens (`peek2` is the deepest the grammar uses) | +| cg | 0 (already streams to cg-text/cg-data buffers) | + +Live tokens across the pipeline are O(macro-arg-list + +parse-lookahead) — call it dozens. Persistent state (parser tables ++ macro definitions + cg output) grows with declared symbols, not +source length. For tcc.flat.c that's an estimated 5–20 MB. + +## Iter abstraction + +A single record + three operations per layer: + +```scheme +(define-record-type tok-iter + (%tok-iter pull-fn state buf) + tok-iter? + (pull-fn tok-iter-pull-fn) ; thunk: state -> tok + (state tok-iter-state) ; per-layer + (buf tok-iter-buf tok-iter-buf-set!)) ; pushed-back toks (small list) + +(define (iter-next it) ...) ; consume + return next tok (EOF sentinel) +(define (iter-peek it) ...) ; like next, but cache in buf +(define (iter-unget! it t) ...) ; push back; next iter-next returns t +``` + +Each layer wraps the layer below: `lex-iter` over bytes, `pp-iter` +over `lex-iter`, `pstate` over `pp-iter`. The iter discipline +replaces the current "list of toks" everywhere a sequence is +consumed. + +## Mapping the layers + +### lex-iter + +Drop the per-iteration `(reverse (cons … acc))` driver in +[cc/cc.scm](../cc/cc.scm) `lex-tokenize`. State holds (pos, line, +col, bol?). `lex-iter-next` runs roughly one iteration of the current +loop body: skip ws+comments, dispatch, allocate one tok+loc, return. + +The per-token `heap-mark` / `heap-rewind!` discipline I just landed +transposes one-for-one — same `%lex-scratch`, same scalar boxes, same +post-rewind rebuild of tok+loc. The difference is who advances: +instead of the loop driving itself to EOF, the *consumer* (pp-iter) +pulls one tok per `iter-next` call. + +### pp-iter + +Most invasive. State holds (lex-iter, pending-toks, macros, +cond-stack, current-loc). `pp-iter-next`: + +1. If `pending` non-empty, pop and return. +2. Pull from `lex-iter`. +3. If it's `HASH` at bol: collect the directive line into a small + buffer (bounded by one logical line, terminated by NL), dispatch, + recurse. +4. If it's a defined ident not in its hide set: collect args + (function-like — bounded by the call site), expand into + `pending`, recurse. +5. If it's `STR` and the next non-trivia tok is `STR`: fuse on the + fly (replaces today's post-pass `%pp-merge-adjacent-strs`). +6. Else return. + +The hide set tracking already runs per-token in `pp-expand`; the +streaming version threads it through `pending` instead of through a +flat list. + +### parser + +Already uses `ps-peek` / `ps-advance` (see +[CC.md §Parser](CC.md)). Repoint them at `pp-iter` instead of +indexing into a list. Add per-stmt `heap-mark` / `heap-rewind!` in +`parse-stmt` so that the parse-time interpreter overhead also gets +reclaimed at statement boundaries. Survivors that must cross the +rewind (typedef bvs, scope-bound symbols, macro bodies) need to be +allocated below the mark — same discipline as the lex driver, just +one layer up. + +### cg + +Already streams to cg-text/cg-data buffers. No change. + +## Subtle bits + +- **#if cexpr** evaluates a fully-collected directive line — keep + that as a *bounded* buffer (one preprocessor line, terminated by + NL). Don't try to stream cexpr eval. +- **Function-macro arg collection** also bounded buffer (until + matching `)`). Worst case is multi-line, but still bounded by the + call-site span. The only real memory concern; if a macro call spans + 10 KB of args we hold all of it. tcc.c doesn't do that. +- **Adjacent string concat** moves from a post-pass to a peek-and- + fuse in `pp-iter`. Easy. +- **Builtin macros** (`__FILE__`, `__LINE__`, `__DATE__`, …) need the + current loc at expansion time, not at lex time. The pp-iter has it + because it tracks lex's current loc. +- **EOF discipline.** Each iter must yield exactly one EOF tok and + then keep yielding EOF on subsequent calls. The parser already + expects EOF as a stable sentinel. +- **Lookahead beyond 2.** The parser uses `peek` and `peek2`; nothing + deeper. If something needs 3-token lookahead later, the unget + buffer accommodates it without changing the iter shape. +- **Per-stmt rewind survivors.** A statement may bind a typedef + whose name must outlive the rewind. The mark goes *after* the + typedef-add path, or the bv is allocated below the mark — same + trick the lex driver uses with `%lex-scratch`. Easier than it + sounds because typedef adds happen at the top of `parse-decl`, + which is well before any deep parse-time scratch. + +## Migration path + +Three roughly self-contained PRs. Each is independently testable — +the existing test suites already drive only the public API +(`lex-tokenize` for cc-lex, `pp-expand` for cc-pp, the cc front end +for cc + cc-cg). Adapter wrappers preserve those entry points +through the migration. + +### Step 1 — lex-iter + +Convert `lex-tokenize` to `make-lex-iter` + `lex-iter-next`. Leave +`lex-tokenize` as a thin wrapper that drains the iter into a list, +so cc-lex tests pass unchanged. Touchpoints: +[cc/cc.scm](../cc/cc.scm) lex section, [cc/main.scm](../cc/main.scm). + +After this step: lex output is still consumed by today's +`pp-expand` via the wrapper. Identity-preserving; no behavior change +visible to tests. + +### Step 2 — pp-iter + +Convert `pp-expand` to `make-pp-iter` + `pp-iter-next`. Directives +and macro expansion all move from "list-walking" style to +"stream + pending-buffer" style. Same logic, different shape. Leave +a `pp-expand` wrapper that drains the iter into a list for +back-compat. Heaviest single PR. + +After this step: pp consumes lex via the iter; parser still drains +pp into a list. + +### Step 3 — parser repoint + +Change `pstate` to hold a `pp-iter` instead of a tok-list cursor. +`peek` / `peek2` / `advance` redirect to iter ops. Drop the wrappers +from steps 1–2 once the cc-cg / cc tests pass. + +Add per-stmt `heap-mark` / `heap-rewind!` in `parse-stmt`. This is +the layer where the *parser* per-form interpreter overhead gets +reclaimed. + +After this step: full streaming. Memory bound shifts from O(source +length) to O(parser state). + +## Expected impact + +For tcc.flat.c (608 KB): + +| state | predicted high water | +|-------------------------------|---------------------:| +| current (post-lex tightening) | ~170 MB during lex; pp blows the 256 MiB cap | +| post-step-1 | unchanged (same materialization at pp/parser) | +| post-step-2 | lex no longer materialized; pp output still listed for parser. ~80 MB? | +| post-step-3 | full streaming. Bounded by parser state. ~10–30 MB. | + +Numbers are guesses until we measure, but the shape is clear: most +of the per-byte transient cost we pay today disappears not because +we got faster but because we never store that intermediate at all. + +## Open questions + +- **Hide-set sharing.** Today the same hide-set list can be shared + across many tokens because it's immutable. In streaming, the same + property holds — pp-iter just hands out shared references — but + we should confirm there's no in-place mutation lurking in the pp + code paths that gets converted. +- **Diagnostics ergonomics.** When the parser dies on an unexpected + token, today the line/col come from `tok-loc`. Same in streaming; + no change. +- **Test runner adapters.** cc-lex and cc-pp test runners build full + token lists for inspection. Keep their wrappers (`lex-tokenize`, + `pp-expand`) as drain-to-list adapters. Internally the test + fixtures don't change. +- **`#include`.** Currently rejected ([CC.md §Toolchain envelope](CC.md)). + If we ever want to support it, the streaming model is friendlier: + the include-file's lex-iter is pushed onto a stack inside pp-iter, + and `pp-iter-next` drains it before resuming the outer file. Out + of scope for the tcc.flat.c push but worth noting. + +## Related + +- [TCC-TODO.md](TCC-TODO.md) — blocker 3 covers the same memory + problem from the per-byte angle. The lex-side tightening landed + there is what step 1 of this plan transposes onto an iter. +- [CC.md](CC.md) — the language and toolchain spec the cc implements. +- [CC-PARALLEL-PLAN.md](CC-PARALLEL-PLAN.md) — independent feature + punchlist; streaming is orthogonal but unblocks running on + tcc.flat.c so those features can be exercised end-to-end. diff --git a/docs/TCC-TODO.md b/docs/TCC-TODO.md @@ -0,0 +1,179 @@ +# tcc.flat.c via scheme cc — known blockers + +Working punch list for what stands between today's scheme1-hosted cc +and a successful compile of `tcc.flat.c`. Companion to +[TCC.md](TCC.md) (the surrounding three-stage pipeline), [CC.md](CC.md) +(the C subset the cc accepts), and [CC-PARALLEL-PLAN.md](CC-PARALLEL-PLAN.md) +(the per-feature `tests/cc/` punchlist). The blockers below are the +ones that fire *before* we even reach the parallel-plan items. + +## Repro + +`tcc.flat.c` is the stage-1 artifact. Generate it with the host +preprocessor: + +``` +sh scripts/stage1-flatten.sh --arch X86_64 +# -> build/cc-bootstrap/X86_64/tcc.flat.c (608 KB, 18 896 lines, 0 directives) +``` + +Run the catm'd scheme cc against it inside the per-arch container. +The cc-debug flag prints heap usage between phases on stderr: + +``` +podman run --rm --pull=never --platform linux/arm64 \ + --tmpfs /tmp:size=512M -e ARCH=aarch64 \ + -v "$(pwd)":/work -w /work boot2-busybox:aarch64 \ + build/aarch64/scheme1 build/aarch64/cc/cc.scm --cc-debug \ + build/cc-bootstrap/X86_64/tcc.flat.c /tmp/tcc.flat.P1pp +``` + +Prerequisites: `make scheme1 cc ARCH=aarch64` (or any other arch) so +`build/$ARCH/scheme1` and `build/$ARCH/cc/cc.scm` exist. + +For triage, the small-prefix probe is useful: + +``` +head -c 50000 build/cc-bootstrap/X86_64/tcc.flat.c \ + > build/cc-bootstrap/X86_64/tcc.head.c +# then re-run the podman invocation against tcc.head.c +``` + +## Blocker 1 — `float` / `double` hard-rejected in type specifiers ✓ done + +Resolved: `parse-decl-spec` now accepts `float`, `double`, and `long +double` as type specifiers, mapping them to `%t-flt` / `%t-dbl` / +`%t-ldbl` ctypes (sizes 4/8/8). `_Complex` / `_Imaginary` are eaten as +silent qualifiers. The five other type-name-start predicates +(`paren-is-group?`, `stmt-starts-decl?`, `token-is-decl?`, +`%const-paren-is-cast?`, `%const-tok-is-decl?`, plus the +`parse-cast-or-unary` guard) recognize the new kw set. The cg guards +fp ld/st/cast via `%cg-fp-reject!` so any attempt to materialize an fp +value dies with `fp not codegen'd` rather than silently emitting +integer code. Coverage: `tests/cc/119-float-parse.c` exercises +prototypes, struct layout, and `sizeof(float|double|long double)`. +Confirmed end-to-end: a 5 KB prefix of `tcc.flat.c` now reaches +blocker 2 (`dup decl: _exit`) instead of dying on `double`. + +## Blocker 2 — compatible redeclaration rejected + +Site: [cc/cc.scm](../cc/cc.scm) `scope-bind!`, line 3667. + +```scheme +(if (alist-ref n top) (die #f "dup decl" n) + (ps-scope-set! ps (cons (alist-set n s top) r))) +``` + +C allows redundant identical extern declarations and prototypes that +differ only in parameter names. The flattened TU has many — `_exit` +appears in both `<stdlib.h>` and `<unistd.h>`; `strlen` is declared +once with a named param and once without; etc. + +Fix: when the name is already bound at the same scope, allow the +redeclaration if the new type is compatible with the existing one +(extern decls of the same function; pointer/array decay equivalents). +Only definition-after-definition or genuinely conflicting types should +die. The same compatible-redecl rule needs to land in the typedef +table and the tag table for forward-decl chains. + +Observed (with blocker 1 fixed, on a 5 KB prefix of tcc.flat.c): + +``` +error: dup decl: _exit +``` + +## Blocker 3 — lex+pp heap usage at ~3 KB per source byte (lex side fixed) + +Site: lex/pp pipeline in [cc/cc.scm](../cc/cc.scm) plus heap caps in +[scheme1/scheme1.P1pp](../scheme1/scheme1.P1pp) and +[vendor/seed/](../vendor/seed/)`<arch>/ELF.hex2`. + +### Lex side — done + +Investigation traced the bulk of the per-byte cost to scheme1's +interpreter calling convention: every closure call allocates ~32 B +per param via `bind_params`, every `let*` binding allocates two cons +cells, every `eval_args` allocates one cons per arg, and tail-rec +named-let in the inner scanners (`%scan-while`, `%fill-while-bv`, +`%match-bytes`, `%accum-int-while`) re-applies a closure per scanned +byte. None of this overhead is reclaimed across token boundaries — +the lex helpers' own internal `heap-mark` / `heap-rewind!` discipline +trims their per-helper scratch but the result tok + 4-list lives +below any putative outer mark, so a driver-level rewind needs to +rebuild survivors from pre-mark scalar boxes. + +Fix landed in `cc/cc.scm`: `lex-tokenize` now wraps each iteration in +its own `heap-mark` / `heap-rewind!`. Scalar fields (kind, value for +non-bv tokens, npos/nline/ncol) ferry across the rewind via outer-let +bindings (set! mutates the binding cdr in place; no heap). Bv-typed +values (IDENT, STRING) ferry through `%lex-scratch`, a single sticky +bytevector allocated below all marks; post-rewind the driver copies +back into a fresh bv sized exactly to the content. tok+loc+acc-cons +are rebuilt post-rewind. + +Measured (`tests/cc/001-kitchen-sink.c`, 9.7 KB): + +``` +phase=slurp: heap 1.23 MB src-bytes 9782 +phase=lex: heap 3.40 MB (delta ~2.17 MB → ~222 B/byte; 13× better) +phase=pp: heap 8.58 MB +``` + +A 50 KB prefix of `tcc.flat.c` (which used to exhaust the heap during +lex) now lex+pp's cleanly through to blocker 2 (`dup decl: _exit`). +Lex finishes at 15.6 MB, pp at 41.3 MB. Extrapolating 222 B/byte to +the full 608 KB input gives ~135 MB during lex+pp — close enough to +the 128 MB `p_memsz` cap that a single bump (and a bigger +`HEAP_CAP_BYTES`) closes the gap. + +We also considered an interning pool (one bv per unique IDENT/STRING, +shared across token instances). With scheme1's interpreter, walking a +cons-list pool — even bucketed 16-way — costs ~50–150 B per step in +`bind_params`/`eval_args`/named-let frame overhead. The walk costs +out-paced the bv-allocation savings: a 50 KB prefix went 60+ MB with +buckets vs ~17 MB un-interned. Until scheme1 grows a true vector +primitive (so bucket access is O(1) without interpreter overhead), +no-intern is the cleanest path. The `%lex-scratch` infrastructure is +ready to layer interning back on top if a downstream eq?-fast-path +turns out to need it. + +### Remaining + +- **pp side memory.** pp still copies the lex token list into a + buf-list and walks it; `%pp-take-line` reverses tokens. There's + probably similar interpreter-overhead concentrate to drain. Worth + the same investigation pass once we're past blocker 2. +- **Bump arena caps.** Even with the lex fix the full TU is many MB + of tokens. `READBUF_CAP_BYTES`, `HEAP_CAP_BYTES`, and the ELF + `p_memsz` literal must all grow in lockstep before tcc.flat.c will + finish. + +## Suspected next-tier blockers (not yet observed) + +Past blockers 1–3 the next wave we expect: + +- **`_Bool`, bitfield-typed struct fields, `setjmp.h` typedefs** — + same "parse, don't codegen" softening as floats. tcc.c carries all + of these under `HAVE_BITFIELD` / `HAVE_SETJMP` gates that are off + but leave the declarations in the flattened text. +- **Existing CC-PARALLEL-PLAN items** — struct-return ABI streams + A1/A2/A3, compound literals (B), const-expr evaluator (C), union + offsets + sizeof no-eval (D). tcc.c uses every one of these. +- **Throughput / wall-clock** — even with the heap fix, lex+pp+parse + on 18 896 lines under scheme1 is going to be slow. Worth measuring + before optimizing. + +## Suggested order of attack + +1. Soften float/double, then bitfields, then setjmp types into + "parse, codegen rejects ops". Small, mechanical, unblocks the gate + to actual triage. +2. Compatible-redecl in `scope-bind!`, the typedef table, and the tag + table. +3. Lex/pp memory pass + heap/p_memsz bumps. +4. Re-run on `tcc.flat.c`. At this point work re-converges with + [CC-PARALLEL-PLAN.md](CC-PARALLEL-PLAN.md). + +Hitting milestone 4 in [CC.md §Validation milestones](CC.md) — "Compile +tcc.c (under the tcc-mes defines) → tcc-lispcc; verify +`tcc-lispcc -version` runs" — is the goal the items above feed into.