commit e9e24687d1c5cced306a34171c12c5965af24126
parent b53d0180d693b167bcae99b7dbc5249b47b35be0
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Tue, 5 May 2026 19:12:51 -0700
docs: drop OS-TODO + TCC-TODO
OS-TODO had no open items left after the boot6 self-host commit.
TCC-TODO was mostly current-state reference (test-suite docs,
result table, simple-patches list); the only genuine open work
left in it was two riscv64 known-limitations. Move those into
TCC.md as a "Known limitations (riscv64)" section so the bug
diagnosis still has a home, then delete both files.
Also clean up three stale cross-refs that pointed at TCC-TODO
sections that no longer existed (Tracepoint, libc strategy, Repro).
Diffstat:
5 files changed, 138 insertions(+), 366 deletions(-)
diff --git a/docs/LIBC.md b/docs/LIBC.md
@@ -22,7 +22,6 @@ inline-asm syscall wrappers with one hand-written file
points, then build it three different ways: as P1pp linked into
tcc-boot2 (Phase A), as ELF object files via tcc-boot2 itself
(Phase B1), and tcc's own `lib/libtcc1.c` via tcc-boot2 (Phase B2).**
-Rationale lives in [TCC-TODO.md §libc strategy](TCC-TODO.md#libc--see-libcmd).
Anchors: mes source lives at `../mes/lib/`. P1pp syscall block is at
[P1/P1pp.P1pp:986-1058](../P1/P1pp.P1pp). cc.scm's C linkage rule is
@@ -421,10 +420,9 @@ That's tracked in [TCC.md](TCC.md), not here.
- If a mes file pulls in a header path we don't have, the right move
is almost always to copy the matching `mes/include/` header
verbatim — don't write a substitute.
-- cc.scm's debug flag (`--cc-debug`, see TCC-TODO.md "Repro") prints
- per-phase heap usage. libc.flat.c is small (~52 KB after flatten)
- so heap should be flat; if it isn't, that's a cc.scm bug, not a
- libc bug.
+- cc.scm's `--cc-debug` flag prints per-phase heap usage on stderr.
+ libc.flat.c is small (~52 KB after flatten) so heap should be
+ flat; if it isn't, that's a cc.scm bug, not a libc bug.
- The existing `vendor/seed/` layout is `<tool>/<arch>/...`. mes-libc
is per-arch only via headers; the .c manifest is arch-agnostic.
Layout `vendor/mes-libc/{ctype,string,...}/` flat, with
diff --git a/docs/OS-TODO.md b/docs/OS-TODO.md
@@ -1,32 +0,0 @@
-# Seed kernel — open items
-
-The [`OS.md`](OS.md) contract is fully met by [`seed-kernel/`](../seed-kernel/):
-boots via the arm64 Linux boot protocol, parses the DTB, unpacks an
-initramfs into an in-memory tmpfs, loads `/init` as a static aarch64
-ELF, dispatches the eight Tier-1 syscalls plus atomic `sys_spawn`
-(private syscall 1024, replaces POSIX `clone`+`execve`) and
-`sys_waitid`, with virtio-blk in/out transports for boot{0..5} use.
-[`scripts/tier1-gate.sh`](../seed-kernel/scripts/tier1-gate.sh) and
-[`scripts/tier2-gate.sh`](../seed-kernel/scripts/tier2-gate.sh) cover
-acceptance; `boot{0..5}.sh DRIVER=seed` is byte-identical to the
-podman path. HVF acceleration enabled.
-
-[`scripts/boot6.sh`](../scripts/boot6.sh) builds and links the seed
-kernel end-to-end with the patched
-[`build/aarch64/boot4/tcc3`](../scripts/boot4.sh) — no `ld -T
-kernel.lds`, no objcopy. The link line is just three flags:
-
-```
-tcc3 -nostdlib -static \
- -Wl,-Ttext=0x40080000 \
- -Wl,--oformat=binary \
- -o Image kernel.S.o kernel.c.o mem.c.o
-```
-
-The output (`build/aarch64/boot6/Image`) is byte-format identical in
-shape to the gcc Makefile's `objcopy -O binary` flat Image:
-[`seed-kernel/build/Image`](../seed-kernel/Makefile).
-
-`DRIVER=seed scripts/boot.sh aarch64` runs the entire boot0→boot6
-pipeline (including a re-run of boot6 itself) on top of the
-tcc3-built kernel, closing the self-host loop at the OS layer.
diff --git a/docs/TCC-TODO.md b/docs/TCC-TODO.md
@@ -1,324 +0,0 @@
-# tcc-boot2 Current TODO
-
-Current tracker for the scheme1-hosted `cc.scm` path that builds
-`tcc.flat.c` into `tcc-boot2`.
-
-Companion docs:
-
-- [TCC.md](TCC.md) describes the surrounding tcc pipeline.
-- [CC.md](CC.md) describes the C subset and validation milestones.
-- [LIBC.md](LIBC.md) describes the libc side used to link `tcc-boot2`.
-
-## Current State
-
-`cc.scm` compiles the flattened tcc translation unit, the P1pp output
-assembles and links via the M1pp + hex2++ chain, and the resulting
-`tcc-boot2` is at full parity with the gcc-built control on the
-`tcc-cc` acceptance suite (see Latest Result below).
-
-Useful smoke checks:
-
-```sh
-make tcc-boot2 ARCH=aarch64
-
-build/aarch64/tcc-boot2/tcc-boot2 -v
-build/aarch64/tcc-boot2/tcc-boot2 -E smoke.c
-build/aarch64/tcc-boot2/tcc-boot2 -c smoke.c -o smoke.o
-```
-
-For native generated-program testing, use the ARM64-targeted build via
-the `tcc-cc` suite:
-
-```sh
-make test SUITE=tcc-cc
-```
-
-## `tcc-cc` Suite
-
-`tcc-cc` runs the plain `tests/cc` fixtures through `tcc-tcc`
-(second-stage tcc) instead of through `cc.scm` directly. The Makefile
-chain is `cc.scm` → `tcc-boot2` → `tcc-tcc`; the runner then does:
-
-```sh
-build/aarch64/tcc-tcc/tcc-tcc \
- -nostdlib build/aarch64/tcc-cc/start.o build/aarch64/tcc-cc/mem.o \
- tests/cc/NAME.c -o build/aarch64/tests/tcc-cc/NAME
-
-./build/aarch64/tests/tcc-cc/NAME
-```
-
-Routing fixtures through `tcc-tcc` (rather than `tcc-boot2` directly)
-turns every fixture into a self-host check: a regression in
-`cc.scm`'s emitted code surfaces first when `tcc-boot2` builds
-`tcc-tcc`, then again when `tcc-tcc` runs the fixtures.
-
-`mem.o` is the compiler-builtin mem* runtime — `memcpy/memmove/memset`
-that tcc emits direct calls to for struct copies and bulk init, plus
-`memcmp` for fixtures that reach it via bare `extern int memcmp(...)`.
-The tcc-gcc sibling supplies the equivalent four symbols by compiling
-mes-libc's `string/{memcpy,memmove,memset,memcmp}.c` into its runtime
-archive.
-
-The result is compared against the same `.expected` and
-`.expected-exit` files used by the regular `cc` suite. The suite is
-aarch64-only today because it needs generated binaries to run natively
-inside the aarch64 container.
-
-Run a subset with `NAMES`:
-
-```sh
-NAMES='002-arith 007-call-with-args' make test SUITE=tcc-cc
-```
-
-## Latest Result
-
-```text
-make test SUITE=tcc-cc tcc-tcc on tests/cc: 181 passed, 0 failed
-scripts/run-gcc-libc-flat-tcc.sh tcc-gcc baseline: 181 passed, 0 failed
-```
-
-Exact parity, suite fully green on both paths.
-
-The path from earlier results to here:
-
-| Result | Delta |
-|--------|-------|
-| 148/30 | baseline before mem-runtime |
-| 163/15 | added `tcc-cc/mem.c` runtime; cleared the `mem*` undefined-symbol cluster |
-| 175/3 | cc.scm migration to M1pp + hex2++ pipeline (dotted local labels, `.scope`/`.endscope`, `.align` directives, bare-hex string emission) cleared the entire `assert fail: 0@12051` cluster (14 fixtures) plus a hex2pp.P1 BSS-overlap fix that unblocked the tcc-boot2 link itself for inputs >1 MiB |
-| 176/2 | ternary-arms common-type fix in `cg-ifelse-merge` cleared `220-const-promote` (was: arm 1's type leaked through as the result type, truncating wider arm 2 to 32-bit; tcc's `gen_opic` sign-extension idiom hit this) |
-| 178/1 | reframed mem* as compiler builtins supplied by the build process: renamed libp1pp's `libp1pp__memcpy` / `_memcmp` / `_memset` to plain `memcpy` / `memcmp` / `memset` and added `memmove`; dropped mes-libc's `string/memcpy.c` / `memmove.c` / `memset.c` / `memcmp.c` from `unified-libc.c` so the symbols are not duplicated; added `memcmp` to `tcc-cc/mem.c` and linked it into the gcc-built tcc-gcc binary; updated and renamed the regression fixture (`129-extern-libp1pp` → `129-extern-mem-builtins`) to extern the plain names. Cleared the fixture on every path (cc, cc-libc, tcc-cc, tcc-gcc). |
-| 183/1 | added `tcc-tcc` (second-stage tcc) and routed `tcc-cc` / `tcc-libc` through it. cc.scm's `cg-load` was 8-byte-spilling struct lvalues — anything `sizeof > 8` got truncated when used in expression context (e.g. as a ternary arm). Fixed `cg-load` to leave aggregates as lvalues and updated `cg-ifelse-merge` to memcpy aggregate arms into a struct-sized merge slot; without this, tcc-boot2 (cc.scm-built) self-corrupted whenever it had to compile `type = bt1 == 6 ? type1 : type2;`. Regression locked by `tests/cc/336-struct-assign-ternary`. |
-| 181/0 | fixed cc.scm's struct-by-value parameter ABI — both `cg-call` and `cg-fn-begin/v` now split 9..16-byte aggregates across two consecutive ABI slots. Locked by `tests/cc/337-struct-by-value-arg`. |
-| 181/0 | added `simple-patches/tcc-0.9.26/lex-char-unsigned` so tcc reads single-byte character constants through `uint8_t`, not `int8_t`; clears `200-lex-char-type` on both `tcc-cc` (tcc-tcc-driven) and `tcc-gcc` (gcc-built control). C99 §6.4.4.4¶10 leaves `char` signedness implementation-defined and aarch64 AAPCS picks unsigned, so `'\xFF'` must be 255, not -1. |
-
-## Host Baseline
-
-The `tests/cc` fixtures are coherent under a host compiler. The
-temporary host harness compiled, ran, and compared every fixture with
-plain host `cc`:
-
-```sh
-build/aarch64/.work/tests/tcc-cc/run-host-cc.sh
-```
-
-Recorded baseline:
-
-```text
-HOST_CC=cc
-HOST_CFLAGS=-std=gnu11 -w
-153 passed, 0 failed
-```
-
-The gcc-built flattened-tcc control runs in the Alpine gcc image:
-
-```sh
-make tcc-gcc TCC_TARGET=ARM64
-podman run --rm --pull=never --platform linux/arm64 \
- -v "$PWD":/work -w /work boot2-alpine-gcc:aarch64 \
- sh scripts/run-gcc-libc-flat-tcc.sh
-```
-
-This is the canonical sanity reference for "tcc-built-from-our-source"
-fixture coverage; `cc.scm`-built tcc-boot2 is now at exact parity with
-it.
-
-## Patches
-
-`scripts/simple-patches/tcc-0.9.26/` carries fixes applied during
-`stage1-flatten` so any tcc rebuilt from this tree picks them up:
-
-- `aarch64-stdarg-array.{before,after}` — swaps the bundled
- `va_list` for `__va_list_struct[1]` (matches glibc/musl/x86_64 ABI).
-- `arm64-va-{pointer-operand,arg-pointer}.{before,after}` — teaches
- `gen_va_start`/`gen_va_arg` to skip `gaddrof()` when the operand is
- already a pointer (the array-decayed/pointer-parameter case). Without
- this, `va_list` forwarding into a non-variadic helper (the
- `vfprintf` shape, e.g. `131-vararg-mixed`) hit `assert fail: 0` in
- `arm64-gen.c`.
-- `const-divzero-shortcircuit-int.{before,after}` — gates `gen_opic`'s
- "division by zero in constant" error on `!nocode_wanted` so the
- unevaluated arm of `&&`/`||`/`?:` in constant expressions
- (C11 §6.6¶3) does not abort.
-- `lex-char-unsigned.{before,after}` — reads single-byte character
- constants through `uint8_t` instead of `int8_t` so `'\xFF'`
- produces 255, not -1, matching aarch64 AAPCS's plain-`char`
- signedness (C99 §6.4.4.4¶10 leaves it implementation-defined).
-
-## Fixture cleanups
-
-Two small fixtures were rewritten to drop assumptions the regular `cc`
-suite shouldn't depend on:
-
-- `tests/cc/125-anon-union.c` explicitly initializes its local struct
- before probing anonymous-union aliasing. Tests must not depend on
- implicit zeroing of automatic locals.
-- `tests/cc/132-tentative-bss-sizing.c` returns distinct numeric exit
- codes instead of calling `sys_write`/`strlen`. Plain `tests/cc`
- fixtures must not need stdio/libc helpers.
-
-## Known limitations
-
-### riscv64: u32 narrowing leaves dirty upper bits
-
-`tests/cc/335-ternary-merge-arith-conv` fails on riscv64 in both
-`tcc-cc[stage2]` and `tcc-cc[stage3]` (identical behavior — the
-fixed-point property holds, the bug is in tcc's RISC-V codegen, not
-in cc.scm or the P1 pipeline). aarch64 and amd64 are green.
-
-The proximate trigger is in `riscv64-gen.c::load()`:
-
-```c
-func3 = size == 1 ? 0 : size == 2 ? 1 : size == 4 ? 2 : 3;
-if (size < 4 && !is_float(sv->type.t) && (sv->type.t & VT_UNSIGNED))
- func3 |= 4; // promotes lb→lbu, lh→lhu, but skips lw→lwu
-```
-
-The `func3 |= 4` promotion to LWU is gated on `size < 4`, so a 4-byte
-unsigned load uses LW (sign-extending) instead of LWU (zero-extending).
-`gen_cast` to `VT_INT|VT_UNSIGNED` from a wider source emits no
-narrowing — it relies on the use-time load to truncate, but with LW
-the high u32 bits of the source leak through. `(u32)x` where `x` is
-`u64` with bit 31 set then evaluates to `0xFFFFFFFFFFFFFFFF`. This
-same bug is present in upstream tcc mob.
-
-**Why the one-line patch isn't enough.** Widening the gate to
-`size <= 4` (so 4-byte unsigned loads use LWU) regresses
-`017-int-arith` and `128-cast-signedness`. They were passing because
-two compensating bugs canceled out: stock tcc on riscv64 also
-sign-extends unsigned 32-bit immediate constants (`LUI`/`ADDI` with a
-bit-31-set value), so a comparison between an `unsigned int`
-variable (loaded with sign-extending LW) and an `unsigned int`
-constant (loaded with sign-extending LUI/ADDI) had matching dirty
-upper bits and `BEQ` saw them as equal. Fixing only the load breaks
-that join, because the compare path also lies — `BEQ` is a 64-bit
-instruction but C semantics require 32-bit width for `unsigned int ==
-unsigned int`.
-
-**Full fix shape.** Three coupled pieces: (1) load — emit LWU for
-unsigned 4-byte loads; (2) immediate — clear bits 32–63 when
-materializing an unsigned 32-bit constant with bit 31 set; (3)
-compare — eagerly canonicalize 32-bit-typed values into zero-extended
-or sign-extended form (per `VT_UNSIGNED`) after every op that can
-leave the upper half dirty. Pieces 2 and 3 overlap: if values are
-canonicalized at every produce site, the load fix becomes one of many
-sites that need to do it. This is what gcc/clang's RISC-V backends
-do, and it's beyond the scope of the literal-block `simple-patches`
-mechanism — file upstream or write a real canonicalization pass.
-
-For now: known limitation, document, move on. The scalar codegen
-elsewhere on riscv64 is fine — only u32 narrowing of a wider source
-trips it.
-
-### tcc0 → tcc1 is not a fixed point on riscv64 (cc.scm behavioral bug)
-
-`boot3.sh` + `boot4.sh` produce four staged compilers:
-
-- `tcc0` = tcc-source compiled by cc.scm (boot3 output)
-- `tcc1` = tcc-source compiled by tcc0 (boot4)
-- `tcc2` = tcc-source compiled by tcc1 (boot4)
-- `tcc3` = tcc-source compiled by tcc2 (boot4)
-
-The fixed-point check is **`tcc2 == tcc3`** (asserted at the end of
-`boot4.sh`, verified on aarch64, amd64, riscv64). On riscv64 the
-weaker `tcc1 == tcc2` does *not* hold: `tcc0(tcc.flat.c)` produces
-a 616100-byte `.o` while `tcc1(tcc.flat.c)` and `tcc2(tcc.flat.c)`
-produce a byte-identical 615892-byte `.o` — 208 bytes larger from
-tcc0 (200 in `.text` + 8 ripple in symtab/reloc offsets). amd64 and
-aarch64 satisfy `tcc1 == tcc2`; only riscv64 diverges.
-
-This is a **bug to investigate**, not just a "fatter code"
-observation. cc.scm should be a *faithful* (semantics-preserving)
-compiler — slower or larger output is acceptable, but tcc0 and tcc1
-must produce byte-identical output when run on the same source.
-That they don't on riscv64 means cc.scm's translation of tcc.flat.c
-into tcc0 changed what tcc0 *does at runtime*, not just how it's
-encoded. We don't care about peephole optimizations being missed; we
-do care that tcc0 makes different codegen decisions than tcc1
-makes.
-
-#### What's known
-
-The visible symptom: tcc0 emits 4 RISCV codegen patterns differently
-than tcc1 does:
-
-| Source pattern | tcc0 emits | tcc1 emits | Δ |
-|---|---|---|---|
-| `x = x - imm` (i32) | `addiw t,zero,imm; addw rd,rs,t` | `addiw rd,rs,imm` | +4 B |
-| `x = x & imm` | `addiw t,zero,imm; and rd,rs,t` | `andi rd,rs,imm` | +4 B |
-| zero-ext after `sext.w` | `sext.w r,r; slli r,r,0x20; srli r,r,0x20` | `sext.w r,r` | +8 B |
-| `x == 0xFFFFFFFF` (i32) | `addiw t,zero,-1; slli/srli; beq x,t,L` | `addi x,x,1; beqz x,L` | +8 B |
-
-These are decision points in `riscv64-gen.c` (immediate-folding,
-zero-ext elision). Same source code, same input C, but the running
-tcc0 takes the slow branch where the running tcc1 takes the fast
-one — even though both are compiled from the same `tcc.flat.c`.
-
-#### Hypothesis to test
-
-cc.scm likely miscompiles an integer comparison or bit-test inside
-the immediate-fits-in-instruction guard in `riscv64-gen.c`. Most of
-the missed patterns share the shape `if (small_int_fits) { fold } else
-{ materialize }`. If cc.scm gets the predicate wrong (e.g. signed vs.
-unsigned compare, or wrong branch on a particular bit pattern), tcc0
-falls into the materialize path on inputs where tcc1 takes the fold
-path.
-
-#### Repro / starting point
-
-```sh
-# In the riscv64 container with boot3+boot4 outputs present:
-$TCC0 -nostdlib -c -o /tmp/flat-tcc0.o tcc.flat.c
-$TCC1 -nostdlib -c -o /tmp/flat-tcc1.o tcc.flat.c
-# wc -c /tmp/flat-tcc0.o /tmp/flat-tcc1.o → 616100 vs 615892
-# objdump -d both, normalize addresses, diff to find divergent functions
-```
-
-The first divergent function in disassembly is `tal_free_impl` — a
-small refcount-decrement that hits the "x = x - 1" pattern. Good
-starting point because the function is short and the source path is
-narrow.
-
-Until this is fixed, tcc1 is the "shake-out" stage and tcc2 is the
-canonical compiler.
-
-## Standalone `bootN.sh`: remaining host deps
-
-`scripts/{boot0,boot1,boot2}.sh` are pure scratch + busybox — no host
-compiler, no alpine-gcc image, just `podman` + the pinned `busybox:musl`
-digest. `boot3.sh` is also pure scratch + busybox (it's just
-scheme1 + M1pp + hex2pp on `.flat.c` inputs flattened by host `cc -E`).
-`boot4.sh` previously had one host-tooling dep on **aarch64 only**:
-cross-asm of `tcc-libc/aarch64/{start,sys_stubs}.S` to `.o` via
-`$HOST_CC -target aarch64-linux-gnu`. tcc 0.9.26's aarch64 backend has
-no assembler (no `arm64-asm.c`) and no inline-asm support, so .S inputs
-historically needed pre-compilation host-side; the patched arm64-asm.c
-now removes that requirement (see `docs/TCC-ARM64-ASM.md`).
-
-amd64 and riscv64 backends both ship `CONFIG_TCC_ASM` and assemble .S
-in-container via tcc-boot2 itself (stages C+D in `boot4.sh`). The
-riscv64 .S files are macroed behind `#ifdef __TINYC__` because tcc's
-riscv64 asm parser uses 3-operand load/store syntax (`ld rd, base, off`,
-`sd base, src, off` — base first for stores) instead of GAS's
-`ld rd, off(base)` / `sd src, off(base)`; the GAS path stays usable
-for the Makefile's alpine-gcc fallback. The `boot2-alpine-gcc:riscv64`
-image is no longer used by `boot3.sh` / `boot4.sh`.
-
-Replacing the aarch64 .S pair with `.P1pp` (or any in-container-buildable)
-equivalents drops the host-cc dep entirely. After that, every
-`bootN.sh` is `podman` + scratch + busybox only.
-
-Out of scope for this TODO (already accepted as host-side):
-`stage1-flatten.sh` and `libc-flatten.sh` use the host `cc -E`
-preprocessor to produce `tcc.flat.c` and `libc.flat.c`. The unpacked
-`tcc-0.9.26/lib/{lib-arm64.c, va_list.c}` helpers compile cleanly under
-tcc-boot2 inside the container — no host cc on those, just source
-deps.
-
-## Next steps
-
-The cc.scm path is at full parity with the gcc-built control on the
-test suites that pass: every fixture in `tcc-cc` and `tcc-libc`
-passes on both, modulo the riscv64 limitation noted above. Further
-bug-hunting work is open-ended — surface a misbehavior, write a
-`tests/cc` fixture that locks it, fix.
diff --git a/docs/TCC.md b/docs/TCC.md
@@ -270,10 +270,11 @@ The interface for the slot scheme CC fills:
`make tcc-boot2 ARCH=aarch64` now runs that path end-to-end:
`cc.scm + tcc.flat.c → tcc-boot2`, linking against a `cc.scm`-built
`libc.flat.c` instead of mes libc. The `tcc-cc` acceptance suite
-(see [TCC-TODO.md](TCC-TODO.md)) shows full parity with the
-gcc-built control. Alpine + gcc + `tcc-host` (stage 2 of the original
-plan) is no longer in our boot2 path; the busybox + scheme1-cc chain
-covers everything from stage 1's `tcc.flat.c` to a runnable tcc.
+(`make test SUITE=tcc-cc`) shows full parity with the gcc-built
+control on aarch64 and amd64. Alpine + gcc + `tcc-host` (stage 2 of
+the original plan) is no longer in our boot2 path; the busybox +
+scheme1-cc chain covers everything from stage 1's `tcc.flat.c` to a
+runnable tcc.
## Reproducibility
@@ -390,3 +391,133 @@ divergence.
Once tcc-boot0-mes runs, stage 3 is unblocked: the `tcc-boot1` /
`tcc-boot2` rebuilds mirror what live-bootstrap's pass1.kaem already
does, and the script is in place.
+
+## Known limitations (riscv64)
+
+aarch64 and amd64 are at full self-host parity (cc.scm path matches
+the gcc-built control on every fixture). riscv64 has two real open
+items, both rooted in tcc's riscv64 backend rather than in cc.scm
+or the P1 pipeline.
+
+### riscv64: u32 narrowing leaves dirty upper bits
+
+`tests/cc/335-ternary-merge-arith-conv` fails on riscv64 in both
+`tcc-cc[stage2]` and `tcc-cc[stage3]` (identical behavior — the
+fixed-point property holds, the bug is in tcc's RISC-V codegen, not
+in cc.scm or the P1 pipeline). aarch64 and amd64 are green.
+
+The proximate trigger is in `riscv64-gen.c::load()`:
+
+```c
+func3 = size == 1 ? 0 : size == 2 ? 1 : size == 4 ? 2 : 3;
+if (size < 4 && !is_float(sv->type.t) && (sv->type.t & VT_UNSIGNED))
+ func3 |= 4; // promotes lb→lbu, lh→lhu, but skips lw→lwu
+```
+
+The `func3 |= 4` promotion to LWU is gated on `size < 4`, so a 4-byte
+unsigned load uses LW (sign-extending) instead of LWU (zero-extending).
+`gen_cast` to `VT_INT|VT_UNSIGNED` from a wider source emits no
+narrowing — it relies on the use-time load to truncate, but with LW
+the high u32 bits of the source leak through. `(u32)x` where `x` is
+`u64` with bit 31 set then evaluates to `0xFFFFFFFFFFFFFFFF`. This
+same bug is present in upstream tcc mob.
+
+**Why the one-line patch isn't enough.** Widening the gate to
+`size <= 4` (so 4-byte unsigned loads use LWU) regresses
+`017-int-arith` and `128-cast-signedness`. They were passing because
+two compensating bugs canceled out: stock tcc on riscv64 also
+sign-extends unsigned 32-bit immediate constants (`LUI`/`ADDI` with a
+bit-31-set value), so a comparison between an `unsigned int`
+variable (loaded with sign-extending LW) and an `unsigned int`
+constant (loaded with sign-extending LUI/ADDI) had matching dirty
+upper bits and `BEQ` saw them as equal. Fixing only the load breaks
+that join, because the compare path also lies — `BEQ` is a 64-bit
+instruction but C semantics require 32-bit width for `unsigned int ==
+unsigned int`.
+
+**Full fix shape.** Three coupled pieces: (1) load — emit LWU for
+unsigned 4-byte loads; (2) immediate — clear bits 32–63 when
+materializing an unsigned 32-bit constant with bit 31 set; (3)
+compare — eagerly canonicalize 32-bit-typed values into zero-extended
+or sign-extended form (per `VT_UNSIGNED`) after every op that can
+leave the upper half dirty. Pieces 2 and 3 overlap: if values are
+canonicalized at every produce site, the load fix becomes one of many
+sites that need to do it. This is what gcc/clang's RISC-V backends
+do, and it's beyond the scope of the literal-block `simple-patches`
+mechanism — file upstream or write a real canonicalization pass.
+
+For now: known limitation, document, move on. The scalar codegen
+elsewhere on riscv64 is fine — only u32 narrowing of a wider source
+trips it.
+
+### riscv64: tcc0 → tcc1 is not a fixed point (cc.scm behavioral bug)
+
+`boot3.sh` + `boot4.sh` produce four staged compilers:
+
+- `tcc0` = tcc-source compiled by cc.scm (boot3 output)
+- `tcc1` = tcc-source compiled by tcc0 (boot4)
+- `tcc2` = tcc-source compiled by tcc1 (boot4)
+- `tcc3` = tcc-source compiled by tcc2 (boot4)
+
+The fixed-point check is **`tcc2 == tcc3`** (asserted at the end of
+`boot4.sh`, verified on aarch64, amd64, riscv64). On riscv64 the
+weaker `tcc1 == tcc2` does *not* hold: `tcc0(tcc.flat.c)` produces
+a 616100-byte `.o` while `tcc1(tcc.flat.c)` and `tcc2(tcc.flat.c)`
+produce a byte-identical 615892-byte `.o` — 208 bytes larger from
+tcc0 (200 in `.text` + 8 ripple in symtab/reloc offsets). amd64 and
+aarch64 satisfy `tcc1 == tcc2`; only riscv64 diverges.
+
+This is a **bug to investigate**, not just a "fatter code"
+observation. cc.scm should be a *faithful* (semantics-preserving)
+compiler — slower or larger output is acceptable, but tcc0 and tcc1
+must produce byte-identical output when run on the same source.
+That they don't on riscv64 means cc.scm's translation of tcc.flat.c
+into tcc0 changed what tcc0 *does at runtime*, not just how it's
+encoded. We don't care about peephole optimizations being missed; we
+do care that tcc0 makes different codegen decisions than tcc1
+makes.
+
+#### What's known
+
+The visible symptom: tcc0 emits 4 RISCV codegen patterns differently
+than tcc1 does:
+
+| Source pattern | tcc0 emits | tcc1 emits | Δ |
+|---|---|---|---|
+| `x = x - imm` (i32) | `addiw t,zero,imm; addw rd,rs,t` | `addiw rd,rs,imm` | +4 B |
+| `x = x & imm` | `addiw t,zero,imm; and rd,rs,t` | `andi rd,rs,imm` | +4 B |
+| zero-ext after `sext.w` | `sext.w r,r; slli r,r,0x20; srli r,r,0x20` | `sext.w r,r` | +8 B |
+| `x == 0xFFFFFFFF` (i32) | `addiw t,zero,-1; slli/srli; beq x,t,L` | `addi x,x,1; beqz x,L` | +8 B |
+
+These are decision points in `riscv64-gen.c` (immediate-folding,
+zero-ext elision). Same source code, same input C, but the running
+tcc0 takes the slow branch where the running tcc1 takes the fast
+one — even though both are compiled from the same `tcc.flat.c`.
+
+#### Hypothesis to test
+
+cc.scm likely miscompiles an integer comparison or bit-test inside
+the immediate-fits-in-instruction guard in `riscv64-gen.c`. Most of
+the missed patterns share the shape `if (small_int_fits) { fold } else
+{ materialize }`. If cc.scm gets the predicate wrong (e.g. signed vs.
+unsigned compare, or wrong branch on a particular bit pattern), tcc0
+falls into the materialize path on inputs where tcc1 takes the fold
+path.
+
+#### Repro / starting point
+
+```sh
+# In the riscv64 container with boot3+boot4 outputs present:
+$TCC0 -nostdlib -c -o /tmp/flat-tcc0.o tcc.flat.c
+$TCC1 -nostdlib -c -o /tmp/flat-tcc1.o tcc.flat.c
+# wc -c /tmp/flat-tcc0.o /tmp/flat-tcc1.o → 616100 vs 615892
+# objdump -d both, normalize addresses, diff to find divergent functions
+```
+
+The first divergent function in disassembly is `tal_free_impl` — a
+small refcount-decrement that hits the "x = x - 1" pattern. Good
+starting point because the function is short and the source path is
+narrow.
+
+Until this is fixed, tcc1 is the "shake-out" stage and tcc2 is the
+canonical compiler.
diff --git a/scripts/boot-build-cc.sh b/scripts/boot-build-cc.sh
@@ -14,7 +14,6 @@
## call at entry. Pair with libp1pp's %trace macro and
## libp1pp__trace runtime helper (in P1/P1pp.P1pp) to
## produce a stderr line per function entry at runtime.
-## See docs/TCC-TODO.md "Tracepoint" section.
## CC_LIB=PFX (optional) — compile in library mode (cc.scm
## --lib=PFX). Skips cc.scm's auto-emitted entry
## stub and trailing :ELF_end so the output catm's