commit 148a86184a0c6be28162f05f6a39ef921f3335f9
parent 3d0c80c1d6e1c5b0757cd6597ad2ca4f97a3af60
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Mon, 4 May 2026 10:29:58 -0700
boot3: tcc{0,1,2,3} naming, stage E fixed-point check, auto-flatten
Rename boot3.sh outputs to tcc0..tcc3 (cc.scm-built; tcc0-built;
tcc1-built; tcc2-built). Add stage E that builds tcc3 and asserts
tcc2 == tcc3 byte-for-byte — the actual self-host fixed point.
tcc0 -> tcc1 isn't a fixed point because cc.scm's emitted machine
code introduces subtle codegen-decision differences in tcc0's
behavior; once tcc is the compiler (tcc1 onward) the chain
converges.
Auto-invoke stage1-flatten.sh / libc-flatten.sh on the host when
their outputs are missing, logged with "(host)" matching the
existing cross-asm step.
TCC-TODO: replace the previous "tcc-tcc != tcc-tcc-tcc on riscv64"
(framed as a linker layout sensitivity) with the actual story —
cc.scm has a behavioral bug on riscv64 that surfaces as missed
immediate-folding peepholes in tcc0(tcc.flat.c). Open question for
later: which cc.scm predicate is mis-evaluating to make tcc0 fall
into the materialize-imm branch where tcc1 takes the fold-imm
branch.
Diffstat:
| M | docs/TCC-TODO.md | | | 97 | ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------- |
| M | scripts/boot3.sh | | | 144 | ++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------- |
2 files changed, 170 insertions(+), 71 deletions(-)
diff --git a/docs/TCC-TODO.md b/docs/TCC-TODO.md
@@ -210,32 +210,77 @@ For now: known limitation, document, move on. The scalar codegen
elsewhere on riscv64 is fine — only u32 narrowing of a wider source
trips it.
-### riscv64: tcc-tcc != tcc-tcc-tcc when boot3.sh assembles its own .S
-
-`scripts/boot3.sh riscv64` builds tcc-boot2 → tcc-tcc → tcc-tcc-tcc
-through tcc-boot2's own riscv64 assembler (no host cross-asm). All
-three binaries run, but the bootstrap fixed-point property breaks:
-`tcc-tcc` is 305656 bytes, `tcc-tcc-tcc` is 305464 bytes.
-
-The Makefile path (which uses alpine-gcc to pre-build start.o /
-sys_stubs.o) hits fixed-point on the same tcc-boot2: 305624 bytes for
-both stages. amd64 also hits fixed-point through boot3.sh's
-in-container assembler. So this is riscv64 + tcc-asm-built .o
-specific.
-
-Triage: tcc-boot2 and tcc-tcc emit byte-identical start.o /
-sys_stubs.o when given the .S inputs — the asm path is consistent.
-The divergence is in the *link* step: tcc-boot2's linker, when fed
-tcc-asm-built start.o (vs alpine-gcc-built start.o of the same
-semantic content but different ELF section/symbol layout), produces a
-tcc-tcc that is 32 bytes larger and miscompiles tcc.flat.c by 192
-bytes on the next stage. Likely a layout-sensitivity in tcc's riscv64
-linker around `.rela.text`/symtab ordering or alignment, but not yet
-isolated.
-
-Workarounds: keep using the Makefile / alpine-gcc path on riscv64 if
-fixed-point matters; the boot3.sh path is fine for "produce a working
-tcc" but not for the self-host idempotence check.
+### tcc0 → tcc1 is not a fixed point on riscv64 (cc.scm behavioral bug)
+
+`boot3.sh` produces four staged compilers:
+
+- `tcc0` = tcc-source compiled by cc.scm
+- `tcc1` = tcc-source compiled by tcc0
+- `tcc2` = tcc-source compiled by tcc1
+- `tcc3` = tcc-source compiled by tcc2
+
+The fixed-point check is **`tcc2 == tcc3`** (asserted at the end of
+`boot3.sh`, verified on aarch64, amd64, riscv64). On riscv64 the
+weaker `tcc1 == tcc2` does *not* hold: `tcc0(tcc.flat.c)` produces
+a 616100-byte `.o` while `tcc1(tcc.flat.c)` and `tcc2(tcc.flat.c)`
+produce a byte-identical 615892-byte `.o` — 208 bytes larger from
+tcc0 (200 in `.text` + 8 ripple in symtab/reloc offsets). amd64 and
+aarch64 satisfy `tcc1 == tcc2`; only riscv64 diverges.
+
+This is a **bug to investigate**, not just a "fatter code"
+observation. cc.scm should be a *faithful* (semantics-preserving)
+compiler — slower or larger output is acceptable, but tcc0 and tcc1
+must produce byte-identical output when run on the same source.
+That they don't on riscv64 means cc.scm's translation of tcc.flat.c
+into tcc0 changed what tcc0 *does at runtime*, not just how it's
+encoded. We don't care about peephole optimizations being missed; we
+do care that tcc0 makes different codegen decisions than tcc1
+makes.
+
+#### What's known
+
+The visible symptom: tcc0 emits 4 RISCV codegen patterns differently
+than tcc1 does:
+
+| Source pattern | tcc0 emits | tcc1 emits | Δ |
+|---|---|---|---|
+| `x = x - imm` (i32) | `addiw t,zero,imm; addw rd,rs,t` | `addiw rd,rs,imm` | +4 B |
+| `x = x & imm` | `addiw t,zero,imm; and rd,rs,t` | `andi rd,rs,imm` | +4 B |
+| zero-ext after `sext.w` | `sext.w r,r; slli r,r,0x20; srli r,r,0x20` | `sext.w r,r` | +8 B |
+| `x == 0xFFFFFFFF` (i32) | `addiw t,zero,-1; slli/srli; beq x,t,L` | `addi x,x,1; beqz x,L` | +8 B |
+
+These are decision points in `riscv64-gen.c` (immediate-folding,
+zero-ext elision). Same source code, same input C, but the running
+tcc0 takes the slow branch where the running tcc1 takes the fast
+one — even though both are compiled from the same `tcc.flat.c`.
+
+#### Hypothesis to test
+
+cc.scm likely miscompiles an integer comparison or bit-test inside
+the immediate-fits-in-instruction guard in `riscv64-gen.c`. Most of
+the missed patterns share the shape `if (small_int_fits) { fold } else
+{ materialize }`. If cc.scm gets the predicate wrong (e.g. signed vs.
+unsigned compare, or wrong branch on a particular bit pattern), tcc0
+falls into the materialize path on inputs where tcc1 takes the fold
+path.
+
+#### Repro / starting point
+
+```sh
+# In the riscv64 container with boot3 outputs present:
+$TCC0 -nostdlib -I $TCC_INC -include $SHIM -c -o /tmp/flat-tcc0.o tcc.flat.c
+$TCC1 -nostdlib -I $TCC_INC -include $SHIM -c -o /tmp/flat-tcc1.o tcc.flat.c
+# wc -c /tmp/flat-tcc0.o /tmp/flat-tcc1.o → 616100 vs 615892
+# objdump -d both, normalize addresses, diff to find divergent functions
+```
+
+The first divergent function in disassembly is `tal_free_impl` — a
+small refcount-decrement that hits the "x = x - 1" pattern. Good
+starting point because the function is short and the source path is
+narrow.
+
+Until this is fixed, tcc1 is the "shake-out" stage and tcc2 is the
+canonical compiler.
## Standalone `bootN.sh`: remaining host deps
diff --git a/scripts/boot3.sh b/scripts/boot3.sh
@@ -1,21 +1,37 @@
#!/bin/sh
-## boot3.sh — standalone three-stage tcc bootstrap.
+## boot3.sh — standalone four-stage tcc bootstrap.
##
-## README's `(define tcc (tcc1 tcc.c))`: produces tcc-boot2 (cc.scm
-## compiles tcc.flat.c), tcc-tcc (tcc-boot2 compiles tcc.flat.c), and
-## tcc-tcc-tcc (tcc-tcc compiles tcc.flat.c). Stages 2 and 3 are the
-## bootstrap fixed-point check.
+## README's `(define tcc (tcc1 tcc.c))`: cc.scm compiles tcc.flat.c
+## into tcc0; tcc0 compiles tcc.flat.c into tcc1; tcc1 does the same
+## to produce tcc2; tcc2 does the same to produce tcc3.
##
-## ─── Inputs (host-side preconditions, NOT produced by this script) ───
+## tcc0 = tcc-source compiled by cc.scm
+## tcc1 = tcc-source compiled by tcc0
+## tcc2 = tcc-source compiled by tcc1
+## tcc3 = tcc-source compiled by tcc2
+##
+## The bootstrap fixed-point check is `tcc2 == tcc3`: once tcc is
+## compiling itself with no help from cc.scm, the chain reaches a
+## byte-identical fixed point. tcc0 ≠ tcc1 in *behavior* (not just
+## in code size) because cc.scm's emitted machine code introduces
+## subtle codegen-decision differences — e.g. on riscv64 cc.scm
+## misses several immediate-folding peepholes that tcc applies, so
+## tcc0(tcc.flat.c) emits ~200 more bytes of `.text` than
+## tcc1(tcc.flat.c) does. tcc1 is faithful tcc behavior (its source
+## is tcc.flat.c, run through the cc.scm-built tcc0 translator
+## semantically intact); tcc2 is the first binary whose machine code
+## was emitted by faithful tcc.
+##
+## ─── Inputs (host-side, auto-built if missing) ────────────────────────
## build/tcc/$TCC_TARGET/tcc.flat.c
-## — flattened tcc TU (run scripts/stage1-flatten.sh
-## --arch $TCC_TARGET to produce)
## build/tcc/$TCC_TARGET/tcc-0.9.26-1147-gee75a10c/{include,lib}
-## — tcc-0.9.26 unpacked tree (side-product of
-## stage1-flatten.sh)
+## — flattened tcc TU + unpacked tree; built
+## via scripts/stage1-flatten.sh --arch
+## $TCC_TARGET (host cc -E, no container)
## build/$ARCH/vendor/mes-libc/libc.flat.c
-## — flattened mes-libc TU (run
-## scripts/libc-flatten.sh --arch $ARCH)
+## — flattened mes-libc TU; built via
+## scripts/libc-flatten.sh --arch $ARCH
+## (host cc -E, no container)
##
## ─── Inputs (sources, copied into staging) ────────────────────────────
## scheme1/prelude.scm cc/cc.scm cc/main.scm — catm'd to cc.scm bundle
@@ -24,6 +40,9 @@
## vendor/seed/$ARCH/ELF.hex2 — ELF header fragment
## tcc-libc/$ARCH/start.S — _start, calls __libc_init+main
## tcc-libc/$ARCH/sys_stubs.S — sys_* syscall wrappers
+## (Throughout this script: tcc0/tcc1/tcc2/tcc3 are the four stages
+## above; tcc0 is the cc.scm-built bootstrap, tcc2/tcc3 form the
+## self-host fixed-point check.)
## tcc-libc/va_list_shim.h — gcc/tcc va_list bridge
## tcc-cc/mem.c — memcpy/memmove/memset/memcmp
## build/tcc/$TCC_TARGET/tcc-0.9.26-1147-gee75a10c/include/** (whole tree)
@@ -45,12 +64,15 @@
## tcc 0.9.26 has no aarch64 assembler (no arm64-asm.c),
## so .S inputs are pre-compiled host-side. amd64 and
## riscv64 have CONFIG_TCC_ASM in their backends and feed
-## .S straight to tcc-boot2 in stages C+D — no host tool.
+## .S straight to tcc0 in stages C+D — no host tool.
##
## ─── Outputs ──────────────────────────────────────────────────────────
-## build/$ARCH/boot3/tcc-boot2 — cc.scm-built tcc (compile 1)
-## build/$ARCH/boot3/tcc-tcc — tcc-boot2-built tcc (compile 2)
-## build/$ARCH/boot3/tcc-tcc-tcc — tcc-tcc-built tcc (compile 3)
+## build/$ARCH/boot3/tcc0 — cc.scm-built tcc (compile 1)
+## build/$ARCH/boot3/tcc1 — tcc0-built tcc (compile 2)
+## build/$ARCH/boot3/tcc2 — tcc1-built tcc (compile 3)
+## build/$ARCH/boot3/tcc3 — tcc2-built tcc (compile 4)
+## tcc2 and tcc3 are byte-identical (asserted at the end of this
+## script) — that equality is the fixed-point check.
##
## Usage: scripts/boot3.sh <arch>
## <arch> ∈ {aarch64, amd64, riscv64}
@@ -108,10 +130,17 @@ fi
[ -x "$BOOT2/scheme1" ] || { echo "[boot3 $ARCH] missing $BOOT2/scheme1 (run scripts/boot2.sh $ARCH)" >&2; exit 1; }
# ── prerequisite: host-flattened sources + unpacked tcc tree ──────────
-[ -e "$TCC_FLAT" ] || { echo "[boot3 $ARCH] missing $TCC_FLAT (run scripts/stage1-flatten.sh --arch $TCC_TARGET)" >&2; exit 1; }
-[ -e "$LIBC_FLAT" ] || { echo "[boot3 $ARCH] missing $LIBC_FLAT (run scripts/libc-flatten.sh --arch $ARCH)" >&2; exit 1; }
-[ -d "$TCC_DIR/include" ] || { echo "[boot3 $ARCH] missing $TCC_DIR/include (run scripts/stage1-flatten.sh --arch $TCC_TARGET)" >&2; exit 1; }
-[ -e "$TCC_DIR/lib/$LIB_HELPER_SRC" ] || { echo "[boot3 $ARCH] missing $TCC_DIR/lib/$LIB_HELPER_SRC" >&2; exit 1; }
+# tcc.flat.c + the unpacked $TCC_DIR/{include,lib} tree are produced
+# together by stage1-flatten.sh; libc.flat.c by libc-flatten.sh. Both
+# run on the host (cc -E), no container — auto-invoke if missing.
+if [ ! -e "$TCC_FLAT" ] || [ ! -d "$TCC_DIR/include" ] || [ ! -e "$TCC_DIR/lib/$LIB_HELPER_SRC" ]; then
+ echo "[boot3 $ARCH] flatten tcc.flat.c (host)"
+ scripts/stage1-flatten.sh --arch "$TCC_TARGET"
+fi
+if [ ! -e "$LIBC_FLAT" ]; then
+ echo "[boot3 $ARCH] flatten libc.flat.c (host)"
+ scripts/libc-flatten.sh --arch "$ARCH"
+fi
# ── reset staging, copy inputs explicitly ─────────────────────────────
rm -rf "$STAGE"
@@ -149,7 +178,7 @@ cp "$TCC_DIR/lib/$LIB_HELPER_SRC" "$STAGE/in/$LIB_HELPER_SRC"
cp "$TCC_FLAT" "$STAGE/in/tcc.flat.c"
cp "$LIBC_FLAT" "$STAGE/in/libc.flat.c"
-# tcc include tree (small, < 200KB) — copied wholesale so tcc-boot2's
+# tcc include tree (small, < 200KB) — copied wholesale so tcc0's
# -I resolves stdarg.h etc. Recursive cp keeps directory layout.
cp -R "$TCC_DIR/include/." "$STAGE/in/tcc-include/"
@@ -169,13 +198,15 @@ else
ASM_BUILD_NEEDED=1
fi
-# ── run the full Stage A + B + C + D pipeline in one container ────────
+# ── run the full Stage A + B + C + D + E pipeline in one container ───
# Stage A: cc.scm bundle, libc.P1pp + tcc.flat.P1pp via scheme1 + cc.scm,
-# link tcc-boot2 ELF via M1pp + hex2pp.
-# Stage B: tcc-boot2 builds mem.o, libc.o, helper.o (va_list or lib-arm64).
-# Stage C: tcc-boot2 links tcc-tcc.
-# Stage D: tcc-tcc rebuilds helpers, links tcc-tcc-tcc.
-echo "[boot3 $ARCH] cc.scm bundle -> tcc-boot2 -> tcc-tcc -> tcc-tcc-tcc"
+# link tcc0 ELF via M1pp + hex2pp.
+# Stage B: tcc0 builds mem.o, libc.o, helper.o (va_list or lib-arm64).
+# Stage C: tcc0 compiles+links tcc1.
+# Stage D: tcc1 rebuilds helpers, compiles+links tcc2.
+# Stage E: tcc2 rebuilds helpers, compiles+links tcc3; the script
+# then asserts tcc2 == tcc3 (the fixed-point check).
+echo "[boot3 $ARCH] cc.scm -> tcc0 -> tcc1 -> tcc2 -> tcc3"
podman run --rm -i --pull=never --platform "$PLATFORM" \
--tmpfs /tmp:size=1024M \
-e LIB_HELPER_SRC="$LIB_HELPER_SRC" \
@@ -197,15 +228,15 @@ $IN/scheme1 /tmp/cc-bundled.scm --lib=libc__ $IN/libc.flat.c /tmp/libc.P1pp
# ── Stage A.3: scheme1 + cc.scm -> tcc.flat.P1pp ──────────────────────
$IN/scheme1 /tmp/cc-bundled.scm --lib=tcc__ $IN/tcc.flat.c /tmp/tcc.flat.P1pp
-# ── Stage A.4: M1pp + hex2pp pipeline -> tcc-boot2 ELF ────────────────
+# ── Stage A.4: M1pp + hex2pp pipeline -> tcc0 ELF ─────────────────────
$IN/catm /tmp/combined.M1pp \
$IN/backend.M1pp $IN/frontend.M1pp $IN/libp1pp.P1pp \
$IN/entry-libc.P1pp /tmp/libc.P1pp /tmp/tcc.flat.P1pp $IN/elf-end.P1pp
$IN/M1pp /tmp/combined.M1pp /tmp/expanded.hex2pp
$IN/catm /tmp/linked.hex2pp $IN/ELF.hex2 /tmp/expanded.hex2pp
-$IN/hex2pp -B 0x600000 /tmp/linked.hex2pp $OUT/tcc-boot2
+$IN/hex2pp -B 0x600000 /tmp/linked.hex2pp $OUT/tcc0
-# ── Stage B: tcc-boot2 builds helper objects ──────────────────────────
+# ── Stage B: tcc0 builds helper objects ───────────────────────────────
# build_asm produces start.o + sys_stubs.o into $workdir. amd64/riscv64
# assemble .S in-container via tcc's CONFIG_TCC_ASM (no -include flag —
# the asm parser doesn't accept C typedefs from va_list_shim.h's
@@ -230,33 +261,56 @@ build_helpers() {
"$cc" -nostdlib -I "$TCC_INC" $LIB_HELPER_DEFINES \
-c -o "$workdir/$LIB_HELPER_OBJ" "$IN/$LIB_HELPER_SRC"
}
-mkdir -p /tmp/stage2 /tmp/stage3
-build_asm $OUT/tcc-boot2 /tmp/stage2
-build_helpers $OUT/tcc-boot2 /tmp/stage2
+mkdir -p /tmp/stage1 /tmp/stage2 /tmp/stage3
+build_asm $OUT/tcc0 /tmp/stage1
+build_helpers $OUT/tcc0 /tmp/stage1
-# ── Stage C: tcc-boot2 -> tcc-tcc ─────────────────────────────────────
-$OUT/tcc-boot2 -nostdlib -I "$TCC_INC" -include $IN/va_list_shim.h \
+# ── Stage C: tcc0 -> tcc1 ─────────────────────────────────────────────
+$OUT/tcc0 -nostdlib -I "$TCC_INC" -include $IN/va_list_shim.h \
+ /tmp/stage1/start.o /tmp/stage1/sys_stubs.o \
+ /tmp/stage1/mem.o /tmp/stage1/libc.o \
+ /tmp/stage1/$LIB_HELPER_OBJ \
+ $IN/tcc.flat.c -o $OUT/tcc1
+chmod +x $OUT/tcc1
+
+# ── Stage D: tcc1 rebuilds helpers, links tcc2 ────────────────────────
+build_asm $OUT/tcc1 /tmp/stage2
+build_helpers $OUT/tcc1 /tmp/stage2
+$OUT/tcc1 -nostdlib -I "$TCC_INC" -include $IN/va_list_shim.h \
/tmp/stage2/start.o /tmp/stage2/sys_stubs.o \
/tmp/stage2/mem.o /tmp/stage2/libc.o \
/tmp/stage2/$LIB_HELPER_OBJ \
- $IN/tcc.flat.c -o $OUT/tcc-tcc
-chmod +x $OUT/tcc-tcc
+ $IN/tcc.flat.c -o $OUT/tcc2
+chmod +x $OUT/tcc2
-# ── Stage D: tcc-tcc rebuilds helpers, links tcc-tcc-tcc ──────────────
-build_asm $OUT/tcc-tcc /tmp/stage3
-build_helpers $OUT/tcc-tcc /tmp/stage3
-$OUT/tcc-tcc -nostdlib -I "$TCC_INC" -include $IN/va_list_shim.h \
+# ── Stage E: tcc2 rebuilds helpers, links tcc3 ────────────────────────
+# Self-host idempotence check: tcc2 compiling itself with its own
+# helpers must produce a byte-identical binary. This is the real
+# bootstrap fixed point — tcc0 → tcc1 isn't expected to converge
+# because cc.scm's emitted machine code introduces subtle codegen
+# differences in tcc0's behavior, but from tcc1 onward the chain
+# is tcc compiling tcc.
+build_asm $OUT/tcc2 /tmp/stage3
+build_helpers $OUT/tcc2 /tmp/stage3
+$OUT/tcc2 -nostdlib -I "$TCC_INC" -include $IN/va_list_shim.h \
/tmp/stage3/start.o /tmp/stage3/sys_stubs.o \
/tmp/stage3/mem.o /tmp/stage3/libc.o \
/tmp/stage3/$LIB_HELPER_OBJ \
- $IN/tcc.flat.c -o $OUT/tcc-tcc-tcc
-chmod +x $OUT/tcc-tcc-tcc
+ $IN/tcc.flat.c -o $OUT/tcc3
+chmod +x $OUT/tcc3
+
+if ! cmp -s $OUT/tcc2 $OUT/tcc3; then
+ s2=$(wc -c <$OUT/tcc2)
+ s3=$(wc -c <$OUT/tcc3)
+ echo "[boot3] FIXED-POINT FAIL: tcc2 ($s2) != tcc3 ($s3)" >&2
+ exit 1
+fi
CONTAINER
# ── copy outputs to final destination ─────────────────────────────────
-for f in tcc-boot2 tcc-tcc tcc-tcc-tcc; do
+for f in tcc0 tcc1 tcc2 tcc3; do
cp "$STAGE/out/$f" "$OUT/$f"
chmod 0700 "$OUT/$f"
done
-echo "[boot3 $ARCH] OK -> $OUT/{tcc-boot2, tcc-tcc, tcc-tcc-tcc}"
+echo "[boot3 $ARCH] OK -> $OUT/{tcc0, tcc1, tcc2, tcc3} (fixed point: tcc2 == tcc3)"