boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

commit e9e24687d1c5cced306a34171c12c5965af24126
parent b53d0180d693b167bcae99b7dbc5249b47b35be0
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Tue,  5 May 2026 19:12:51 -0700

docs: drop OS-TODO + TCC-TODO

OS-TODO had no open items left after the boot6 self-host commit.
TCC-TODO was mostly current-state reference (test-suite docs,
result table, simple-patches list); the only genuine open work
left in it was two riscv64 known-limitations.  Move those into
TCC.md as a "Known limitations (riscv64)" section so the bug
diagnosis still has a home, then delete both files.

Also clean up three stale cross-refs that pointed at TCC-TODO
sections that no longer existed (Tracepoint, libc strategy, Repro).

Diffstat:
Mdocs/LIBC.md | 8+++-----
Ddocs/OS-TODO.md | 32--------------------------------
Ddocs/TCC-TODO.md | 324-------------------------------------------------------------------------------
Mdocs/TCC.md | 139++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
Mscripts/boot-build-cc.sh | 1-
5 files changed, 138 insertions(+), 366 deletions(-)

diff --git a/docs/LIBC.md b/docs/LIBC.md @@ -22,7 +22,6 @@ inline-asm syscall wrappers with one hand-written file points, then build it three different ways: as P1pp linked into tcc-boot2 (Phase A), as ELF object files via tcc-boot2 itself (Phase B1), and tcc's own `lib/libtcc1.c` via tcc-boot2 (Phase B2).** -Rationale lives in [TCC-TODO.md §libc strategy](TCC-TODO.md#libc--see-libcmd). Anchors: mes source lives at `../mes/lib/`. P1pp syscall block is at [P1/P1pp.P1pp:986-1058](../P1/P1pp.P1pp). cc.scm's C linkage rule is @@ -421,10 +420,9 @@ That's tracked in [TCC.md](TCC.md), not here. - If a mes file pulls in a header path we don't have, the right move is almost always to copy the matching `mes/include/` header verbatim — don't write a substitute. -- cc.scm's debug flag (`--cc-debug`, see TCC-TODO.md "Repro") prints - per-phase heap usage. libc.flat.c is small (~52 KB after flatten) - so heap should be flat; if it isn't, that's a cc.scm bug, not a - libc bug. +- cc.scm's `--cc-debug` flag prints per-phase heap usage on stderr. + libc.flat.c is small (~52 KB after flatten) so heap should be + flat; if it isn't, that's a cc.scm bug, not a libc bug. - The existing `vendor/seed/` layout is `<tool>/<arch>/...`. mes-libc is per-arch only via headers; the .c manifest is arch-agnostic. Layout `vendor/mes-libc/{ctype,string,...}/` flat, with diff --git a/docs/OS-TODO.md b/docs/OS-TODO.md @@ -1,32 +0,0 @@ -# Seed kernel — open items - -The [`OS.md`](OS.md) contract is fully met by [`seed-kernel/`](../seed-kernel/): -boots via the arm64 Linux boot protocol, parses the DTB, unpacks an -initramfs into an in-memory tmpfs, loads `/init` as a static aarch64 -ELF, dispatches the eight Tier-1 syscalls plus atomic `sys_spawn` -(private syscall 1024, replaces POSIX `clone`+`execve`) and -`sys_waitid`, with virtio-blk in/out transports for boot{0..5} use. -[`scripts/tier1-gate.sh`](../seed-kernel/scripts/tier1-gate.sh) and -[`scripts/tier2-gate.sh`](../seed-kernel/scripts/tier2-gate.sh) cover -acceptance; `boot{0..5}.sh DRIVER=seed` is byte-identical to the -podman path. HVF acceleration enabled. - -[`scripts/boot6.sh`](../scripts/boot6.sh) builds and links the seed -kernel end-to-end with the patched -[`build/aarch64/boot4/tcc3`](../scripts/boot4.sh) — no `ld -T -kernel.lds`, no objcopy. The link line is just three flags: - -``` -tcc3 -nostdlib -static \ - -Wl,-Ttext=0x40080000 \ - -Wl,--oformat=binary \ - -o Image kernel.S.o kernel.c.o mem.c.o -``` - -The output (`build/aarch64/boot6/Image`) is byte-format identical in -shape to the gcc Makefile's `objcopy -O binary` flat Image: -[`seed-kernel/build/Image`](../seed-kernel/Makefile). - -`DRIVER=seed scripts/boot.sh aarch64` runs the entire boot0→boot6 -pipeline (including a re-run of boot6 itself) on top of the -tcc3-built kernel, closing the self-host loop at the OS layer. diff --git a/docs/TCC-TODO.md b/docs/TCC-TODO.md @@ -1,324 +0,0 @@ -# tcc-boot2 Current TODO - -Current tracker for the scheme1-hosted `cc.scm` path that builds -`tcc.flat.c` into `tcc-boot2`. - -Companion docs: - -- [TCC.md](TCC.md) describes the surrounding tcc pipeline. -- [CC.md](CC.md) describes the C subset and validation milestones. -- [LIBC.md](LIBC.md) describes the libc side used to link `tcc-boot2`. - -## Current State - -`cc.scm` compiles the flattened tcc translation unit, the P1pp output -assembles and links via the M1pp + hex2++ chain, and the resulting -`tcc-boot2` is at full parity with the gcc-built control on the -`tcc-cc` acceptance suite (see Latest Result below). - -Useful smoke checks: - -```sh -make tcc-boot2 ARCH=aarch64 - -build/aarch64/tcc-boot2/tcc-boot2 -v -build/aarch64/tcc-boot2/tcc-boot2 -E smoke.c -build/aarch64/tcc-boot2/tcc-boot2 -c smoke.c -o smoke.o -``` - -For native generated-program testing, use the ARM64-targeted build via -the `tcc-cc` suite: - -```sh -make test SUITE=tcc-cc -``` - -## `tcc-cc` Suite - -`tcc-cc` runs the plain `tests/cc` fixtures through `tcc-tcc` -(second-stage tcc) instead of through `cc.scm` directly. The Makefile -chain is `cc.scm` → `tcc-boot2` → `tcc-tcc`; the runner then does: - -```sh -build/aarch64/tcc-tcc/tcc-tcc \ - -nostdlib build/aarch64/tcc-cc/start.o build/aarch64/tcc-cc/mem.o \ - tests/cc/NAME.c -o build/aarch64/tests/tcc-cc/NAME - -./build/aarch64/tests/tcc-cc/NAME -``` - -Routing fixtures through `tcc-tcc` (rather than `tcc-boot2` directly) -turns every fixture into a self-host check: a regression in -`cc.scm`'s emitted code surfaces first when `tcc-boot2` builds -`tcc-tcc`, then again when `tcc-tcc` runs the fixtures. - -`mem.o` is the compiler-builtin mem* runtime — `memcpy/memmove/memset` -that tcc emits direct calls to for struct copies and bulk init, plus -`memcmp` for fixtures that reach it via bare `extern int memcmp(...)`. -The tcc-gcc sibling supplies the equivalent four symbols by compiling -mes-libc's `string/{memcpy,memmove,memset,memcmp}.c` into its runtime -archive. - -The result is compared against the same `.expected` and -`.expected-exit` files used by the regular `cc` suite. The suite is -aarch64-only today because it needs generated binaries to run natively -inside the aarch64 container. - -Run a subset with `NAMES`: - -```sh -NAMES='002-arith 007-call-with-args' make test SUITE=tcc-cc -``` - -## Latest Result - -```text -make test SUITE=tcc-cc tcc-tcc on tests/cc: 181 passed, 0 failed -scripts/run-gcc-libc-flat-tcc.sh tcc-gcc baseline: 181 passed, 0 failed -``` - -Exact parity, suite fully green on both paths. - -The path from earlier results to here: - -| Result | Delta | -|--------|-------| -| 148/30 | baseline before mem-runtime | -| 163/15 | added `tcc-cc/mem.c` runtime; cleared the `mem*` undefined-symbol cluster | -| 175/3 | cc.scm migration to M1pp + hex2++ pipeline (dotted local labels, `.scope`/`.endscope`, `.align` directives, bare-hex string emission) cleared the entire `assert fail: 0@12051` cluster (14 fixtures) plus a hex2pp.P1 BSS-overlap fix that unblocked the tcc-boot2 link itself for inputs >1 MiB | -| 176/2 | ternary-arms common-type fix in `cg-ifelse-merge` cleared `220-const-promote` (was: arm 1's type leaked through as the result type, truncating wider arm 2 to 32-bit; tcc's `gen_opic` sign-extension idiom hit this) | -| 178/1 | reframed mem* as compiler builtins supplied by the build process: renamed libp1pp's `libp1pp__memcpy` / `_memcmp` / `_memset` to plain `memcpy` / `memcmp` / `memset` and added `memmove`; dropped mes-libc's `string/memcpy.c` / `memmove.c` / `memset.c` / `memcmp.c` from `unified-libc.c` so the symbols are not duplicated; added `memcmp` to `tcc-cc/mem.c` and linked it into the gcc-built tcc-gcc binary; updated and renamed the regression fixture (`129-extern-libp1pp` → `129-extern-mem-builtins`) to extern the plain names. Cleared the fixture on every path (cc, cc-libc, tcc-cc, tcc-gcc). | -| 183/1 | added `tcc-tcc` (second-stage tcc) and routed `tcc-cc` / `tcc-libc` through it. cc.scm's `cg-load` was 8-byte-spilling struct lvalues — anything `sizeof > 8` got truncated when used in expression context (e.g. as a ternary arm). Fixed `cg-load` to leave aggregates as lvalues and updated `cg-ifelse-merge` to memcpy aggregate arms into a struct-sized merge slot; without this, tcc-boot2 (cc.scm-built) self-corrupted whenever it had to compile `type = bt1 == 6 ? type1 : type2;`. Regression locked by `tests/cc/336-struct-assign-ternary`. | -| 181/0 | fixed cc.scm's struct-by-value parameter ABI — both `cg-call` and `cg-fn-begin/v` now split 9..16-byte aggregates across two consecutive ABI slots. Locked by `tests/cc/337-struct-by-value-arg`. | -| 181/0 | added `simple-patches/tcc-0.9.26/lex-char-unsigned` so tcc reads single-byte character constants through `uint8_t`, not `int8_t`; clears `200-lex-char-type` on both `tcc-cc` (tcc-tcc-driven) and `tcc-gcc` (gcc-built control). C99 §6.4.4.4¶10 leaves `char` signedness implementation-defined and aarch64 AAPCS picks unsigned, so `'\xFF'` must be 255, not -1. | - -## Host Baseline - -The `tests/cc` fixtures are coherent under a host compiler. The -temporary host harness compiled, ran, and compared every fixture with -plain host `cc`: - -```sh -build/aarch64/.work/tests/tcc-cc/run-host-cc.sh -``` - -Recorded baseline: - -```text -HOST_CC=cc -HOST_CFLAGS=-std=gnu11 -w -153 passed, 0 failed -``` - -The gcc-built flattened-tcc control runs in the Alpine gcc image: - -```sh -make tcc-gcc TCC_TARGET=ARM64 -podman run --rm --pull=never --platform linux/arm64 \ - -v "$PWD":/work -w /work boot2-alpine-gcc:aarch64 \ - sh scripts/run-gcc-libc-flat-tcc.sh -``` - -This is the canonical sanity reference for "tcc-built-from-our-source" -fixture coverage; `cc.scm`-built tcc-boot2 is now at exact parity with -it. - -## Patches - -`scripts/simple-patches/tcc-0.9.26/` carries fixes applied during -`stage1-flatten` so any tcc rebuilt from this tree picks them up: - -- `aarch64-stdarg-array.{before,after}` — swaps the bundled - `va_list` for `__va_list_struct[1]` (matches glibc/musl/x86_64 ABI). -- `arm64-va-{pointer-operand,arg-pointer}.{before,after}` — teaches - `gen_va_start`/`gen_va_arg` to skip `gaddrof()` when the operand is - already a pointer (the array-decayed/pointer-parameter case). Without - this, `va_list` forwarding into a non-variadic helper (the - `vfprintf` shape, e.g. `131-vararg-mixed`) hit `assert fail: 0` in - `arm64-gen.c`. -- `const-divzero-shortcircuit-int.{before,after}` — gates `gen_opic`'s - "division by zero in constant" error on `!nocode_wanted` so the - unevaluated arm of `&&`/`||`/`?:` in constant expressions - (C11 §6.6¶3) does not abort. -- `lex-char-unsigned.{before,after}` — reads single-byte character - constants through `uint8_t` instead of `int8_t` so `'\xFF'` - produces 255, not -1, matching aarch64 AAPCS's plain-`char` - signedness (C99 §6.4.4.4¶10 leaves it implementation-defined). - -## Fixture cleanups - -Two small fixtures were rewritten to drop assumptions the regular `cc` -suite shouldn't depend on: - -- `tests/cc/125-anon-union.c` explicitly initializes its local struct - before probing anonymous-union aliasing. Tests must not depend on - implicit zeroing of automatic locals. -- `tests/cc/132-tentative-bss-sizing.c` returns distinct numeric exit - codes instead of calling `sys_write`/`strlen`. Plain `tests/cc` - fixtures must not need stdio/libc helpers. - -## Known limitations - -### riscv64: u32 narrowing leaves dirty upper bits - -`tests/cc/335-ternary-merge-arith-conv` fails on riscv64 in both -`tcc-cc[stage2]` and `tcc-cc[stage3]` (identical behavior — the -fixed-point property holds, the bug is in tcc's RISC-V codegen, not -in cc.scm or the P1 pipeline). aarch64 and amd64 are green. - -The proximate trigger is in `riscv64-gen.c::load()`: - -```c -func3 = size == 1 ? 0 : size == 2 ? 1 : size == 4 ? 2 : 3; -if (size < 4 && !is_float(sv->type.t) && (sv->type.t & VT_UNSIGNED)) - func3 |= 4; // promotes lb→lbu, lh→lhu, but skips lw→lwu -``` - -The `func3 |= 4` promotion to LWU is gated on `size < 4`, so a 4-byte -unsigned load uses LW (sign-extending) instead of LWU (zero-extending). -`gen_cast` to `VT_INT|VT_UNSIGNED` from a wider source emits no -narrowing — it relies on the use-time load to truncate, but with LW -the high u32 bits of the source leak through. `(u32)x` where `x` is -`u64` with bit 31 set then evaluates to `0xFFFFFFFFFFFFFFFF`. This -same bug is present in upstream tcc mob. - -**Why the one-line patch isn't enough.** Widening the gate to -`size <= 4` (so 4-byte unsigned loads use LWU) regresses -`017-int-arith` and `128-cast-signedness`. They were passing because -two compensating bugs canceled out: stock tcc on riscv64 also -sign-extends unsigned 32-bit immediate constants (`LUI`/`ADDI` with a -bit-31-set value), so a comparison between an `unsigned int` -variable (loaded with sign-extending LW) and an `unsigned int` -constant (loaded with sign-extending LUI/ADDI) had matching dirty -upper bits and `BEQ` saw them as equal. Fixing only the load breaks -that join, because the compare path also lies — `BEQ` is a 64-bit -instruction but C semantics require 32-bit width for `unsigned int == -unsigned int`. - -**Full fix shape.** Three coupled pieces: (1) load — emit LWU for -unsigned 4-byte loads; (2) immediate — clear bits 32–63 when -materializing an unsigned 32-bit constant with bit 31 set; (3) -compare — eagerly canonicalize 32-bit-typed values into zero-extended -or sign-extended form (per `VT_UNSIGNED`) after every op that can -leave the upper half dirty. Pieces 2 and 3 overlap: if values are -canonicalized at every produce site, the load fix becomes one of many -sites that need to do it. This is what gcc/clang's RISC-V backends -do, and it's beyond the scope of the literal-block `simple-patches` -mechanism — file upstream or write a real canonicalization pass. - -For now: known limitation, document, move on. The scalar codegen -elsewhere on riscv64 is fine — only u32 narrowing of a wider source -trips it. - -### tcc0 → tcc1 is not a fixed point on riscv64 (cc.scm behavioral bug) - -`boot3.sh` + `boot4.sh` produce four staged compilers: - -- `tcc0` = tcc-source compiled by cc.scm (boot3 output) -- `tcc1` = tcc-source compiled by tcc0 (boot4) -- `tcc2` = tcc-source compiled by tcc1 (boot4) -- `tcc3` = tcc-source compiled by tcc2 (boot4) - -The fixed-point check is **`tcc2 == tcc3`** (asserted at the end of -`boot4.sh`, verified on aarch64, amd64, riscv64). On riscv64 the -weaker `tcc1 == tcc2` does *not* hold: `tcc0(tcc.flat.c)` produces -a 616100-byte `.o` while `tcc1(tcc.flat.c)` and `tcc2(tcc.flat.c)` -produce a byte-identical 615892-byte `.o` — 208 bytes larger from -tcc0 (200 in `.text` + 8 ripple in symtab/reloc offsets). amd64 and -aarch64 satisfy `tcc1 == tcc2`; only riscv64 diverges. - -This is a **bug to investigate**, not just a "fatter code" -observation. cc.scm should be a *faithful* (semantics-preserving) -compiler — slower or larger output is acceptable, but tcc0 and tcc1 -must produce byte-identical output when run on the same source. -That they don't on riscv64 means cc.scm's translation of tcc.flat.c -into tcc0 changed what tcc0 *does at runtime*, not just how it's -encoded. We don't care about peephole optimizations being missed; we -do care that tcc0 makes different codegen decisions than tcc1 -makes. - -#### What's known - -The visible symptom: tcc0 emits 4 RISCV codegen patterns differently -than tcc1 does: - -| Source pattern | tcc0 emits | tcc1 emits | Δ | -|---|---|---|---| -| `x = x - imm` (i32) | `addiw t,zero,imm; addw rd,rs,t` | `addiw rd,rs,imm` | +4 B | -| `x = x & imm` | `addiw t,zero,imm; and rd,rs,t` | `andi rd,rs,imm` | +4 B | -| zero-ext after `sext.w` | `sext.w r,r; slli r,r,0x20; srli r,r,0x20` | `sext.w r,r` | +8 B | -| `x == 0xFFFFFFFF` (i32) | `addiw t,zero,-1; slli/srli; beq x,t,L` | `addi x,x,1; beqz x,L` | +8 B | - -These are decision points in `riscv64-gen.c` (immediate-folding, -zero-ext elision). Same source code, same input C, but the running -tcc0 takes the slow branch where the running tcc1 takes the fast -one — even though both are compiled from the same `tcc.flat.c`. - -#### Hypothesis to test - -cc.scm likely miscompiles an integer comparison or bit-test inside -the immediate-fits-in-instruction guard in `riscv64-gen.c`. Most of -the missed patterns share the shape `if (small_int_fits) { fold } else -{ materialize }`. If cc.scm gets the predicate wrong (e.g. signed vs. -unsigned compare, or wrong branch on a particular bit pattern), tcc0 -falls into the materialize path on inputs where tcc1 takes the fold -path. - -#### Repro / starting point - -```sh -# In the riscv64 container with boot3+boot4 outputs present: -$TCC0 -nostdlib -c -o /tmp/flat-tcc0.o tcc.flat.c -$TCC1 -nostdlib -c -o /tmp/flat-tcc1.o tcc.flat.c -# wc -c /tmp/flat-tcc0.o /tmp/flat-tcc1.o → 616100 vs 615892 -# objdump -d both, normalize addresses, diff to find divergent functions -``` - -The first divergent function in disassembly is `tal_free_impl` — a -small refcount-decrement that hits the "x = x - 1" pattern. Good -starting point because the function is short and the source path is -narrow. - -Until this is fixed, tcc1 is the "shake-out" stage and tcc2 is the -canonical compiler. - -## Standalone `bootN.sh`: remaining host deps - -`scripts/{boot0,boot1,boot2}.sh` are pure scratch + busybox — no host -compiler, no alpine-gcc image, just `podman` + the pinned `busybox:musl` -digest. `boot3.sh` is also pure scratch + busybox (it's just -scheme1 + M1pp + hex2pp on `.flat.c` inputs flattened by host `cc -E`). -`boot4.sh` previously had one host-tooling dep on **aarch64 only**: -cross-asm of `tcc-libc/aarch64/{start,sys_stubs}.S` to `.o` via -`$HOST_CC -target aarch64-linux-gnu`. tcc 0.9.26's aarch64 backend has -no assembler (no `arm64-asm.c`) and no inline-asm support, so .S inputs -historically needed pre-compilation host-side; the patched arm64-asm.c -now removes that requirement (see `docs/TCC-ARM64-ASM.md`). - -amd64 and riscv64 backends both ship `CONFIG_TCC_ASM` and assemble .S -in-container via tcc-boot2 itself (stages C+D in `boot4.sh`). The -riscv64 .S files are macroed behind `#ifdef __TINYC__` because tcc's -riscv64 asm parser uses 3-operand load/store syntax (`ld rd, base, off`, -`sd base, src, off` — base first for stores) instead of GAS's -`ld rd, off(base)` / `sd src, off(base)`; the GAS path stays usable -for the Makefile's alpine-gcc fallback. The `boot2-alpine-gcc:riscv64` -image is no longer used by `boot3.sh` / `boot4.sh`. - -Replacing the aarch64 .S pair with `.P1pp` (or any in-container-buildable) -equivalents drops the host-cc dep entirely. After that, every -`bootN.sh` is `podman` + scratch + busybox only. - -Out of scope for this TODO (already accepted as host-side): -`stage1-flatten.sh` and `libc-flatten.sh` use the host `cc -E` -preprocessor to produce `tcc.flat.c` and `libc.flat.c`. The unpacked -`tcc-0.9.26/lib/{lib-arm64.c, va_list.c}` helpers compile cleanly under -tcc-boot2 inside the container — no host cc on those, just source -deps. - -## Next steps - -The cc.scm path is at full parity with the gcc-built control on the -test suites that pass: every fixture in `tcc-cc` and `tcc-libc` -passes on both, modulo the riscv64 limitation noted above. Further -bug-hunting work is open-ended — surface a misbehavior, write a -`tests/cc` fixture that locks it, fix. diff --git a/docs/TCC.md b/docs/TCC.md @@ -270,10 +270,11 @@ The interface for the slot scheme CC fills: `make tcc-boot2 ARCH=aarch64` now runs that path end-to-end: `cc.scm + tcc.flat.c → tcc-boot2`, linking against a `cc.scm`-built `libc.flat.c` instead of mes libc. The `tcc-cc` acceptance suite -(see [TCC-TODO.md](TCC-TODO.md)) shows full parity with the -gcc-built control. Alpine + gcc + `tcc-host` (stage 2 of the original -plan) is no longer in our boot2 path; the busybox + scheme1-cc chain -covers everything from stage 1's `tcc.flat.c` to a runnable tcc. +(`make test SUITE=tcc-cc`) shows full parity with the gcc-built +control on aarch64 and amd64. Alpine + gcc + `tcc-host` (stage 2 of +the original plan) is no longer in our boot2 path; the busybox + +scheme1-cc chain covers everything from stage 1's `tcc.flat.c` to a +runnable tcc. ## Reproducibility @@ -390,3 +391,133 @@ divergence. Once tcc-boot0-mes runs, stage 3 is unblocked: the `tcc-boot1` / `tcc-boot2` rebuilds mirror what live-bootstrap's pass1.kaem already does, and the script is in place. + +## Known limitations (riscv64) + +aarch64 and amd64 are at full self-host parity (cc.scm path matches +the gcc-built control on every fixture). riscv64 has two real open +items, both rooted in tcc's riscv64 backend rather than in cc.scm +or the P1 pipeline. + +### riscv64: u32 narrowing leaves dirty upper bits + +`tests/cc/335-ternary-merge-arith-conv` fails on riscv64 in both +`tcc-cc[stage2]` and `tcc-cc[stage3]` (identical behavior — the +fixed-point property holds, the bug is in tcc's RISC-V codegen, not +in cc.scm or the P1 pipeline). aarch64 and amd64 are green. + +The proximate trigger is in `riscv64-gen.c::load()`: + +```c +func3 = size == 1 ? 0 : size == 2 ? 1 : size == 4 ? 2 : 3; +if (size < 4 && !is_float(sv->type.t) && (sv->type.t & VT_UNSIGNED)) + func3 |= 4; // promotes lb→lbu, lh→lhu, but skips lw→lwu +``` + +The `func3 |= 4` promotion to LWU is gated on `size < 4`, so a 4-byte +unsigned load uses LW (sign-extending) instead of LWU (zero-extending). +`gen_cast` to `VT_INT|VT_UNSIGNED` from a wider source emits no +narrowing — it relies on the use-time load to truncate, but with LW +the high u32 bits of the source leak through. `(u32)x` where `x` is +`u64` with bit 31 set then evaluates to `0xFFFFFFFFFFFFFFFF`. This +same bug is present in upstream tcc mob. + +**Why the one-line patch isn't enough.** Widening the gate to +`size <= 4` (so 4-byte unsigned loads use LWU) regresses +`017-int-arith` and `128-cast-signedness`. They were passing because +two compensating bugs canceled out: stock tcc on riscv64 also +sign-extends unsigned 32-bit immediate constants (`LUI`/`ADDI` with a +bit-31-set value), so a comparison between an `unsigned int` +variable (loaded with sign-extending LW) and an `unsigned int` +constant (loaded with sign-extending LUI/ADDI) had matching dirty +upper bits and `BEQ` saw them as equal. Fixing only the load breaks +that join, because the compare path also lies — `BEQ` is a 64-bit +instruction but C semantics require 32-bit width for `unsigned int == +unsigned int`. + +**Full fix shape.** Three coupled pieces: (1) load — emit LWU for +unsigned 4-byte loads; (2) immediate — clear bits 32–63 when +materializing an unsigned 32-bit constant with bit 31 set; (3) +compare — eagerly canonicalize 32-bit-typed values into zero-extended +or sign-extended form (per `VT_UNSIGNED`) after every op that can +leave the upper half dirty. Pieces 2 and 3 overlap: if values are +canonicalized at every produce site, the load fix becomes one of many +sites that need to do it. This is what gcc/clang's RISC-V backends +do, and it's beyond the scope of the literal-block `simple-patches` +mechanism — file upstream or write a real canonicalization pass. + +For now: known limitation, document, move on. The scalar codegen +elsewhere on riscv64 is fine — only u32 narrowing of a wider source +trips it. + +### riscv64: tcc0 → tcc1 is not a fixed point (cc.scm behavioral bug) + +`boot3.sh` + `boot4.sh` produce four staged compilers: + +- `tcc0` = tcc-source compiled by cc.scm (boot3 output) +- `tcc1` = tcc-source compiled by tcc0 (boot4) +- `tcc2` = tcc-source compiled by tcc1 (boot4) +- `tcc3` = tcc-source compiled by tcc2 (boot4) + +The fixed-point check is **`tcc2 == tcc3`** (asserted at the end of +`boot4.sh`, verified on aarch64, amd64, riscv64). On riscv64 the +weaker `tcc1 == tcc2` does *not* hold: `tcc0(tcc.flat.c)` produces +a 616100-byte `.o` while `tcc1(tcc.flat.c)` and `tcc2(tcc.flat.c)` +produce a byte-identical 615892-byte `.o` — 208 bytes larger from +tcc0 (200 in `.text` + 8 ripple in symtab/reloc offsets). amd64 and +aarch64 satisfy `tcc1 == tcc2`; only riscv64 diverges. + +This is a **bug to investigate**, not just a "fatter code" +observation. cc.scm should be a *faithful* (semantics-preserving) +compiler — slower or larger output is acceptable, but tcc0 and tcc1 +must produce byte-identical output when run on the same source. +That they don't on riscv64 means cc.scm's translation of tcc.flat.c +into tcc0 changed what tcc0 *does at runtime*, not just how it's +encoded. We don't care about peephole optimizations being missed; we +do care that tcc0 makes different codegen decisions than tcc1 +makes. + +#### What's known + +The visible symptom: tcc0 emits 4 RISCV codegen patterns differently +than tcc1 does: + +| Source pattern | tcc0 emits | tcc1 emits | Δ | +|---|---|---|---| +| `x = x - imm` (i32) | `addiw t,zero,imm; addw rd,rs,t` | `addiw rd,rs,imm` | +4 B | +| `x = x & imm` | `addiw t,zero,imm; and rd,rs,t` | `andi rd,rs,imm` | +4 B | +| zero-ext after `sext.w` | `sext.w r,r; slli r,r,0x20; srli r,r,0x20` | `sext.w r,r` | +8 B | +| `x == 0xFFFFFFFF` (i32) | `addiw t,zero,-1; slli/srli; beq x,t,L` | `addi x,x,1; beqz x,L` | +8 B | + +These are decision points in `riscv64-gen.c` (immediate-folding, +zero-ext elision). Same source code, same input C, but the running +tcc0 takes the slow branch where the running tcc1 takes the fast +one — even though both are compiled from the same `tcc.flat.c`. + +#### Hypothesis to test + +cc.scm likely miscompiles an integer comparison or bit-test inside +the immediate-fits-in-instruction guard in `riscv64-gen.c`. Most of +the missed patterns share the shape `if (small_int_fits) { fold } else +{ materialize }`. If cc.scm gets the predicate wrong (e.g. signed vs. +unsigned compare, or wrong branch on a particular bit pattern), tcc0 +falls into the materialize path on inputs where tcc1 takes the fold +path. + +#### Repro / starting point + +```sh +# In the riscv64 container with boot3+boot4 outputs present: +$TCC0 -nostdlib -c -o /tmp/flat-tcc0.o tcc.flat.c +$TCC1 -nostdlib -c -o /tmp/flat-tcc1.o tcc.flat.c +# wc -c /tmp/flat-tcc0.o /tmp/flat-tcc1.o → 616100 vs 615892 +# objdump -d both, normalize addresses, diff to find divergent functions +``` + +The first divergent function in disassembly is `tal_free_impl` — a +small refcount-decrement that hits the "x = x - 1" pattern. Good +starting point because the function is short and the source path is +narrow. + +Until this is fixed, tcc1 is the "shake-out" stage and tcc2 is the +canonical compiler. diff --git a/scripts/boot-build-cc.sh b/scripts/boot-build-cc.sh @@ -14,7 +14,6 @@ ## call at entry. Pair with libp1pp's %trace macro and ## libp1pp__trace runtime helper (in P1/P1pp.P1pp) to ## produce a stderr line per function entry at runtime. -## See docs/TCC-TODO.md "Tracepoint" section. ## CC_LIB=PFX (optional) — compile in library mode (cc.scm ## --lib=PFX). Skips cc.scm's auto-emitted entry ## stub and trailing :ELF_end so the output catm's