boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

tcc-boot2 Current TODO

Current tracker for the scheme1-hosted cc.scm path that builds tcc.flat.c into tcc-boot2.

Companion docs:

Current State

cc.scm compiles the flattened tcc translation unit, the P1pp output assembles and links via the M1pp + hex2++ chain, and the resulting tcc-boot2 is at full parity with the gcc-built control on the tcc-cc acceptance suite (see Latest Result below).

Useful smoke checks:

make tcc-boot2 ARCH=aarch64

build/aarch64/tcc-boot2/tcc-boot2 -v
build/aarch64/tcc-boot2/tcc-boot2 -E smoke.c
build/aarch64/tcc-boot2/tcc-boot2 -c smoke.c -o smoke.o

For native generated-program testing, use the ARM64-targeted build via the tcc-cc suite:

make test SUITE=tcc-cc

tcc-cc Suite

tcc-cc runs the plain tests/cc fixtures through tcc-tcc (second-stage tcc) instead of through cc.scm directly. The Makefile chain is cc.scmtcc-boot2tcc-tcc; the runner then does:

build/aarch64/tcc-tcc/tcc-tcc \
    -nostdlib build/aarch64/tcc-cc/start.o build/aarch64/tcc-cc/mem.o \
    tests/cc/NAME.c -o build/aarch64/tests/tcc-cc/NAME

./build/aarch64/tests/tcc-cc/NAME

Routing fixtures through tcc-tcc (rather than tcc-boot2 directly) turns every fixture into a self-host check: a regression in cc.scm's emitted code surfaces first when tcc-boot2 builds tcc-tcc, then again when tcc-tcc runs the fixtures.

mem.o is the compiler-builtin mem* runtime — memcpy/memmove/memset that tcc emits direct calls to for struct copies and bulk init, plus memcmp for fixtures that reach it via bare extern int memcmp(...). The tcc-gcc sibling supplies the equivalent four symbols by compiling mes-libc's string/{memcpy,memmove,memset,memcmp}.c into its runtime archive.

The result is compared against the same .expected and .expected-exit files used by the regular cc suite. The suite is aarch64-only today because it needs generated binaries to run natively inside the aarch64 container.

Run a subset with NAMES:

NAMES='002-arith 007-call-with-args' make test SUITE=tcc-cc

Latest Result

make test SUITE=tcc-cc       tcc-tcc on tests/cc:    181 passed, 0 failed
scripts/run-gcc-libc-flat-tcc.sh  tcc-gcc baseline:   181 passed, 0 failed

Exact parity, suite fully green on both paths.

The path from earlier results to here:

Result Delta
148/30 baseline before mem-runtime
163/15 added tcc-cc/mem.c runtime; cleared the mem* undefined-symbol cluster
175/3 cc.scm migration to M1pp + hex2++ pipeline (dotted local labels, .scope/.endscope, .align directives, bare-hex string emission) cleared the entire assert fail: 0@12051 cluster (14 fixtures) plus a hex2pp.P1 BSS-overlap fix that unblocked the tcc-boot2 link itself for inputs >1 MiB
176/2 ternary-arms common-type fix in cg-ifelse-merge cleared 220-const-promote (was: arm 1's type leaked through as the result type, truncating wider arm 2 to 32-bit; tcc's gen_opic sign-extension idiom hit this)
178/1 reframed mem* as compiler builtins supplied by the build process: renamed libp1pp's libp1pp__memcpy / _memcmp / _memset to plain memcpy / memcmp / memset and added memmove; dropped mes-libc's string/memcpy.c / memmove.c / memset.c / memcmp.c from unified-libc.c so the symbols are not duplicated; added memcmp to tcc-cc/mem.c and linked it into the gcc-built tcc-gcc binary; updated and renamed the regression fixture (129-extern-libp1pp129-extern-mem-builtins) to extern the plain names. Cleared the fixture on every path (cc, cc-libc, tcc-cc, tcc-gcc).
183/1 added tcc-tcc (second-stage tcc) and routed tcc-cc / tcc-libc through it. cc.scm's cg-load was 8-byte-spilling struct lvalues — anything sizeof > 8 got truncated when used in expression context (e.g. as a ternary arm). Fixed cg-load to leave aggregates as lvalues and updated cg-ifelse-merge to memcpy aggregate arms into a struct-sized merge slot; without this, tcc-boot2 (cc.scm-built) self-corrupted whenever it had to compile type = bt1 == 6 ? type1 : type2;. Regression locked by tests/cc/336-struct-assign-ternary.
181/0 fixed cc.scm's struct-by-value parameter ABI — both cg-call and cg-fn-begin/v now split 9..16-byte aggregates across two consecutive ABI slots. Locked by tests/cc/337-struct-by-value-arg.
181/0 added simple-patches/tcc-0.9.26/lex-char-unsigned so tcc reads single-byte character constants through uint8_t, not int8_t; clears 200-lex-char-type on both tcc-cc (tcc-tcc-driven) and tcc-gcc (gcc-built control). C99 §6.4.4.4¶10 leaves char signedness implementation-defined and aarch64 AAPCS picks unsigned, so '\xFF' must be 255, not -1.

Host Baseline

The tests/cc fixtures are coherent under a host compiler. The temporary host harness compiled, ran, and compared every fixture with plain host cc:

build/aarch64/.work/tests/tcc-cc/run-host-cc.sh

Recorded baseline:

HOST_CC=cc
HOST_CFLAGS=-std=gnu11 -w
153 passed, 0 failed

The gcc-built flattened-tcc control runs in the Alpine gcc image:

make tcc-gcc TCC_TARGET=ARM64
podman run --rm --pull=never --platform linux/arm64 \
    -v "$PWD":/work -w /work boot2-alpine-gcc:aarch64 \
    sh scripts/run-gcc-libc-flat-tcc.sh

This is the canonical sanity reference for "tcc-built-from-our-source" fixture coverage; cc.scm-built tcc-boot2 is now at exact parity with it.

Patches

scripts/simple-patches/tcc-0.9.26/ carries fixes applied during stage1-flatten so any tcc rebuilt from this tree picks them up:

Fixture cleanups

Two small fixtures were rewritten to drop assumptions the regular cc suite shouldn't depend on:

Known limitations

riscv64: u32 narrowing leaves dirty upper bits

tests/cc/335-ternary-merge-arith-conv fails on riscv64 in both tcc-cc[stage2] and tcc-cc[stage3] (identical behavior — the fixed-point property holds, the bug is in tcc's RISC-V codegen, not in cc.scm or the P1 pipeline). aarch64 and amd64 are green.

The proximate trigger is in riscv64-gen.c::load():

func3 = size == 1 ? 0 : size == 2 ? 1 : size == 4 ? 2 : 3;
if (size < 4 && !is_float(sv->type.t) && (sv->type.t & VT_UNSIGNED))
    func3 |= 4;          // promotes lb→lbu, lh→lhu, but skips lw→lwu

The func3 |= 4 promotion to LWU is gated on size < 4, so a 4-byte unsigned load uses LW (sign-extending) instead of LWU (zero-extending). gen_cast to VT_INT|VT_UNSIGNED from a wider source emits no narrowing — it relies on the use-time load to truncate, but with LW the high u32 bits of the source leak through. (u32)x where x is u64 with bit 31 set then evaluates to 0xFFFFFFFFFFFFFFFF. This same bug is present in upstream tcc mob.

Why the one-line patch isn't enough. Widening the gate to size <= 4 (so 4-byte unsigned loads use LWU) regresses 017-int-arith and 128-cast-signedness. They were passing because two compensating bugs canceled out: stock tcc on riscv64 also sign-extends unsigned 32-bit immediate constants (LUI/ADDI with a bit-31-set value), so a comparison between an unsigned int variable (loaded with sign-extending LW) and an unsigned int constant (loaded with sign-extending LUI/ADDI) had matching dirty upper bits and BEQ saw them as equal. Fixing only the load breaks that join, because the compare path also lies — BEQ is a 64-bit instruction but C semantics require 32-bit width for unsigned int == unsigned int.

Full fix shape. Three coupled pieces: (1) load — emit LWU for unsigned 4-byte loads; (2) immediate — clear bits 32–63 when materializing an unsigned 32-bit constant with bit 31 set; (3) compare — eagerly canonicalize 32-bit-typed values into zero-extended or sign-extended form (per VT_UNSIGNED) after every op that can leave the upper half dirty. Pieces 2 and 3 overlap: if values are canonicalized at every produce site, the load fix becomes one of many sites that need to do it. This is what gcc/clang's RISC-V backends do, and it's beyond the scope of the literal-block simple-patches mechanism — file upstream or write a real canonicalization pass.

For now: known limitation, document, move on. The scalar codegen elsewhere on riscv64 is fine — only u32 narrowing of a wider source trips it.

tcc0 → tcc1 is not a fixed point on riscv64 (cc.scm behavioral bug)

boot3.sh + boot4.sh produce four staged compilers:

The fixed-point check is tcc2 == tcc3 (asserted at the end of boot4.sh, verified on aarch64, amd64, riscv64). On riscv64 the weaker tcc1 == tcc2 does not hold: tcc0(tcc.flat.c) produces a 616100-byte .o while tcc1(tcc.flat.c) and tcc2(tcc.flat.c) produce a byte-identical 615892-byte .o — 208 bytes larger from tcc0 (200 in .text + 8 ripple in symtab/reloc offsets). amd64 and aarch64 satisfy tcc1 == tcc2; only riscv64 diverges.

This is a bug to investigate, not just a "fatter code" observation. cc.scm should be a faithful (semantics-preserving) compiler — slower or larger output is acceptable, but tcc0 and tcc1 must produce byte-identical output when run on the same source. That they don't on riscv64 means cc.scm's translation of tcc.flat.c into tcc0 changed what tcc0 does at runtime, not just how it's encoded. We don't care about peephole optimizations being missed; we do care that tcc0 makes different codegen decisions than tcc1 makes.

What's known

The visible symptom: tcc0 emits 4 RISCV codegen patterns differently than tcc1 does:

Source pattern tcc0 emits tcc1 emits Δ
x = x - imm (i32) addiw t,zero,imm; addw rd,rs,t addiw rd,rs,imm +4 B
x = x & imm addiw t,zero,imm; and rd,rs,t andi rd,rs,imm +4 B
zero-ext after sext.w sext.w r,r; slli r,r,0x20; srli r,r,0x20 sext.w r,r +8 B
x == 0xFFFFFFFF (i32) addiw t,zero,-1; slli/srli; beq x,t,L addi x,x,1; beqz x,L +8 B

These are decision points in riscv64-gen.c (immediate-folding, zero-ext elision). Same source code, same input C, but the running tcc0 takes the slow branch where the running tcc1 takes the fast one — even though both are compiled from the same tcc.flat.c.

Hypothesis to test

cc.scm likely miscompiles an integer comparison or bit-test inside the immediate-fits-in-instruction guard in riscv64-gen.c. Most of the missed patterns share the shape if (small_int_fits) { fold } else { materialize }. If cc.scm gets the predicate wrong (e.g. signed vs. unsigned compare, or wrong branch on a particular bit pattern), tcc0 falls into the materialize path on inputs where tcc1 takes the fold path.

Repro / starting point

# In the riscv64 container with boot3+boot4 outputs present:
$TCC0 -nostdlib -c -o /tmp/flat-tcc0.o tcc.flat.c
$TCC1 -nostdlib -c -o /tmp/flat-tcc1.o tcc.flat.c
# wc -c /tmp/flat-tcc0.o /tmp/flat-tcc1.o   →  616100 vs 615892
# objdump -d both, normalize addresses, diff to find divergent functions

The first divergent function in disassembly is tal_free_impl — a small refcount-decrement that hits the "x = x - 1" pattern. Good starting point because the function is short and the source path is narrow.

Until this is fixed, tcc1 is the "shake-out" stage and tcc2 is the canonical compiler.

Standalone bootN.sh: remaining host deps

scripts/{boot0,boot1,boot2}.sh are pure scratch + busybox — no host compiler, no alpine-gcc image, just podman + the pinned busybox:musl digest. boot3.sh is also pure scratch + busybox (it's just scheme1 + M1pp + hex2pp on .flat.c inputs flattened by host cc -E). boot4.sh previously had one host-tooling dep on aarch64 only: cross-asm of tcc-libc/aarch64/{start,sys_stubs}.S to .o via $HOST_CC -target aarch64-linux-gnu. tcc 0.9.26's aarch64 backend has no assembler (no arm64-asm.c) and no inline-asm support, so .S inputs historically needed pre-compilation host-side; the patched arm64-asm.c now removes that requirement (see docs/TCC-ARM64-ASM.md).

amd64 and riscv64 backends both ship CONFIG_TCC_ASM and assemble .S in-container via tcc-boot2 itself (stages C+D in boot4.sh). The riscv64 .S files are macroed behind #ifdef __TINYC__ because tcc's riscv64 asm parser uses 3-operand load/store syntax (ld rd, base, off, sd base, src, off — base first for stores) instead of GAS's ld rd, off(base) / sd src, off(base); the GAS path stays usable for the Makefile's alpine-gcc fallback. The boot2-alpine-gcc:riscv64 image is no longer used by boot3.sh / boot4.sh.

Replacing the aarch64 .S pair with .P1pp (or any in-container-buildable) equivalents drops the host-cc dep entirely. After that, every bootN.sh is podman + scratch + busybox only.

Out of scope for this TODO (already accepted as host-side): stage1-flatten.sh and libc-flatten.sh use the host cc -E preprocessor to produce tcc.flat.c and libc.flat.c. The unpacked tcc-0.9.26/lib/{lib-arm64.c, va_list.c} helpers compile cleanly under tcc-boot2 inside the container — no host cc on those, just source deps.

Next steps

The cc.scm path is at full parity with the gcc-built control on the test suites that pass: every fixture in tcc-cc and tcc-libc passes on both, modulo the riscv64 limitation noted above. Further bug-hunting work is open-ended — surface a misbehavior, write a tests/cc fixture that locks it, fix.