docs: tcc-cc bug investigation report + gdb harness - boot2

commit db08235c3ef5e143f15c843c191dd2d9e08f446b
parent e586fa17898c1d3b99304f4cf22af60b00041837
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sat,  2 May 2026 15:50:45 -0700

docs: tcc-cc bug investigation report + gdb harness

docs/TCC-CC-INVESTIGATION.md documents the chase for the assert-fail-0
failures in the tcc-cc suite (14 of 15 failures hit the same site). Bug is
localized but not fixed: cc.scm-built tcc-boot2 corrupts ret.type in
cc__unary's function-call-return path, which makes is_float() return true on
a junk pointer-like value, allocates a float register for an int return, and
later trips a legitimate assert in arm64-gen.c:load() for the unsupported
int<->float register-pair case.

The report captures confirmed facts (verified by gdb on the running binary),
ruled-out hypotheses, ranked candidate root causes, and concrete next-step
suggestions so a fresh agent can pick this up without re-deriving the chain.

scripts/dbg-load-cbz.sh is the working gdb harness — runs in the
boot2-alpine-gcc:aarch64 container, breaks at key addresses in cc__unary /
cc__load / cc__vsetc / cc__is_float, and dumps register/stack state.
Addresses listed in the report match the binary at the report's commit.

Diffstat:
A docs/TCC-CC-INVESTIGATION.md  | 303 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A scripts/dbg-load-cbz.sh  | 36 ++++++++++++++++++++++++++++++++++++

2 files changed, 339 insertions(+), 0 deletions(-)
diff --git a/docs/TCC-CC-INVESTIGATION.md b/docs/TCC-CC-INVESTIGATION.md
@@ -0,0 +1,303 @@
+# tcc-cc bug investigation: 14 fixtures fail with `assert fail: 0@12051`
+
+## Status
+
+Bug **localized but not fixed**. Root cause is in `cc.scm`-built `tcc-boot2`'s
+runtime behavior: a corrupted `ret.type.t` in `cc__unary`'s function-call-return
+path leads to allocating a float register for an int return value, which then
+trips a legitimate `assert(0)` in `arm64-gen.c:load()` for the unsupported
+mixed int/float register class case.
+
+Of the 15 failing fixtures, 14 hit the same `assert fail: 0` at the same
+`tcc.flat.c:12051` site. The 15th (`220-const-promote`) is a separate
+"compile succeeds, exits wrong" issue not covered here.
+
+## TL;DR for the next agent
+
+You're looking for a **`cc.scm` miscompile of `cc__unary` (in `tcc.flat.c`)
+that corrupts the local `SValue ret` between `ret.type = s->type;`
+(line 7837) and `is_float(ret.type.t)` (line 7840)**.
+
+The corruption is layout-sensitive (any 4-byte instruction added in
+`cc__load` or anywhere in `cc__unary` makes the bug not fire on this
+fixture). The corrupted value reads back as `0x007B4C19` — a vstack
+region pointer-looking value, which has low 4 bits = 9 = `VT_DOUBLE`, so
+`is_float` returns true, `ret.r = TREG_F(0) = 20`, and `vsetc` pushes an
+SValue with `r=20` (a float register) for an int return. Later
+`gv(RC_INT) → load(0, vtop)` correctly asserts because there's no int↔float
+move for the `svr<0x30` register-pair case.
+
+## Reproduction
+
+```sh
+make test SUITE=tcc-cc NAMES=013-call          # fails: assert fail: 0@12051
+```
+
+(The Makefile already drives `TCC_TARGET=ARM64` for the `tcc-cc` suite.
+Don't run `make tcc-boot2 ARCH=aarch64` standalone without setting
+`TCC_TARGET` — until commit `3317ca3` the default `TCC_TARGET=X86_64` was
+silently producing an x86_64-targeted `tcc.flat.c`. That's fixed.)
+
+To run native gdb against the binary:
+```sh
+podman run --rm --pull=never --platform linux/arm64 \
+    -v "$PWD":/work -w /work boot2-alpine-gcc:aarch64 \
+    sh scripts/dbg-load-cbz.sh
+```
+
+`scripts/dbg-load-cbz.sh` is a working diagnostic harness with breakpoints
+already wired up. Edit the gdb commands in `/tmp/g.gdb` (heredoc inside the
+script) to add new breakpoints. Note: the binary has no symbols — work
+in raw addresses.
+
+## Confirmed facts (verified by gdb on the running binary)
+
+1. **The `assert(0)` itself is correct given its inputs.** At the failing
+   `load()` call: `r=0` (target = `TREG_R(0)` = X0, an int reg) and
+   `sv->r=0x14` (= `TREG_F(0)` = 20, a float reg). `arm64-gen.c:12042-51`
+   asserts because there's no defined cross-class register move for the
+   `svr<0x30` branch. So the bug is upstream.
+
+2. **`vtop->r=20` was set by a `vsetc` call** at `LR=0x6c777c`, which is
+   `cc__unary` line 7864:
+   ```c
+   for (r = ret.r + ret_nregs + !ret_nregs; r-- > ret.r;) {
+       vsetc(&ret.type, r, &ret.c);
+       vtop->r2 = ret.r2;
+   }
+   ```
+   For `ret.r=20` (= TREG_F(0)) and `ret_nregs=1`, the loop runs once with
+   `r=20`, so `vsetc` pushes an SValue with `r=20`.
+
+3. **`ret.r` was set to 20 by line 7841** because `is_float(ret.type.t)`
+   returned true:
+   ```c
+   if (is_float(ret.type.t)) {
+       ret.r = reg_fret(ret.type.t);   // TREG_F(0) = 20
+   } else {
+       ret.r = ((0));
+   }
+   ```
+
+4. **`is_float` was called with `t=0x007B4C19`**, not a valid type code.
+   Low 4 bits = 9 (= `VT_DOUBLE`), so `is_float` returns true. gdb output
+   confirmed: `is_float(t=0x7b4c19) -> TRUE  lr=0x6c6e30`.
+
+5. **`ret.type.t` (read at SP+704 in `cc__unary`'s frame) holds
+   `0x007B4C19`** — a vstack-region pointer-like value, NOT a type code.
+   It must have been corrupted between line 7837 (the assignment `ret.type
+   = s->type`) and line 7840 (the `is_float` read).
+
+6. **Adjacent stack bytes look like memcpy loop variables.** At the
+   moment of the failing `is_float`:
+   - `[SP+704..711] = 0x007B4C19`
+   - `[SP+712..719] = 0x007B4C1A`
+
+   These differ by exactly 1, which is the signature of `_memcpy`'s
+   byte-by-byte `dest`/`src` walking (each iteration: `dest++; src++`).
+   `0x7B4C0A` is `vtop` for this fixture; `0x7B4C1A = vtop+16` is exactly
+   `&vtop->r`. So one of these slots is holding a pointer that walked
+   into the middle of an SValue during a struct-copy memcpy.
+
+7. **`mes-libc/string/memcpy.c:_memcpy` is byte-by-byte** (compiled to a
+   loop with 4 LDRBs + ORs + shifts per byte for ld_w, etc). cc.scm
+   compiles its locals into 240 bytes of frame; max slot offset used is
+   216, so it doesn't overflow its own frame. The `_memcpy → memcpy`
+   wrapper is the only `_memcpy` caller.
+
+8. **The bug is `cc.scm`-specific.** The gcc-built control
+   (`scripts/run-gcc-libc-flat-tcc.sh`) compiles the same `tcc.flat.c` and
+   passes 177/178 of the same fixtures. So the C source is fine; it's
+   cc.scm's lowering that breaks.
+
+9. **Layout-sensitive fix.** Inserting any 4-byte instruction (even
+   `%addi(t0, t0, 0)`) anywhere before the failing site in `cc__load`
+   makes the test pass. The CBZ at `0x73B6C0` (mod 64 = 0) is *not*
+   the cause — replacing all CBZ/CBNZ with CMP+B.cond+BR (which shifts
+   load() by ~144 bytes) didn't fix it. Layout sensitivity comes from
+   `tcc-boot2`'s runtime state changing as code positions shift, not from
+   any specific instruction's alignment.
+
+10. **`%li(rd, imm)` lowering was changed** from LDR-literal-pool to
+    MOVZ/MOVK chain (4 instructions, 16 bytes — same size as before).
+    This was investigated as a possible alignment fix; it isn't, but the
+    new lowering is kept as a defensible cleanup that eliminates literal
+    pool entries from the executable instruction stream.
+
+## Hypotheses, ranked by likelihood
+
+### H1 (most likely): cc.scm slot-allocator bug — `ret`'s slot overlaps memcpy state
+
+The fact that two slots in `cc__unary`'s frame (`SP+704` and `SP+712`)
+hold sequential pointer values one byte apart is the signature of
+`_memcpy`'s `dest`/`src` loop variables. cc.scm allocates locals as
+fixed slot offsets per function, but if its bookkeeping for `ret`'s
+slot collides with another local *or* if there's interference from a
+helper called via `gfunc_call`, then `ret.type.t` and `ret.type.ref` get
+clobbered.
+
+The shape of the corrupted value (`0x7B4C19` = vstack address midway
+through an SValue) strongly suggests the leak is from inside an
+SValue-copying memcpy. Likely sources of such struct copies between
+line 7837 (set ret.type) and line 7840 (read ret.type.t):
+- there are NO C statements between those lines, so the leak must
+  come from how cc.scm compiles **line 7837 itself** — `ret.type = s->type`
+  which is a 16-byte struct copy via memcpy.
+- Or a related copy emitted by cc.scm for a temporary.
+
+**Investigation steps:**
+1. Generate `cc__unary`'s P1pp around the function-call-return path
+   (lines ~7800-7870). Look for the slot offsets used for `ret.type` and
+   `s->type` and any `%call(&memcpy)` between them.
+2. Compare cc.scm's computed slot offset for `ret.type.t` against
+   what's actually loaded at the failing address `0x6c6dd0` (loads from
+   `SP+704`). Do they match?
+3. Hypothesis-test by adding a deliberate stack-padding local in
+   `unary()` (e.g. `volatile int __pad[64];` near `ret`) and re-running.
+   If that fixes it, slot allocation is the issue.
+
+### H2: cc.scm miscompile of `s->type` member access
+
+Maybe `s->type` is being computed wrong — reading from the wrong
+offset within Sym, returning a pointer-like value. cc.scm's C grammar
+includes `member-of-pointer-deref`; if `s->type` (where type is a
+nested struct) is mis-translated to `&s->type` (the address) or to
+`s + offsetof(Sym, type) + N` for the wrong N, the source of the
+struct copy is wrong.
+
+**Investigation steps:**
+1. Look at `cc.scm`'s parsing/codegen for `->` accessing a struct
+   member that is itself a struct.
+2. Inspect the emitted P1pp for `ret.type = s->type` — does the
+   memcpy source address look correct?
+
+### H3: ABI mismatch in cc.scm's struct-copy lowering for `CType`
+
+`CType` is 16 bytes (int t + Sym *ref) with 4 bytes of padding for
+8-byte alignment of `ref`. If cc.scm's struct-copy lowering walks bytes
+0..16 of source but skips/duplicates the padding region differently
+between source and destination, `ret.type.ref`'s bytes can land in
+`ret.type.t`'s position.
+
+The previous SValue struct-copy fix (`cc/cg-assign-struct`) was
+specifically called out in `docs/TCC-TODO.md` — a similar fix may be
+needed for nested CType copy.
+
+**Investigation steps:**
+1. Read `cc.scm`'s `cg-assign-struct` and verify it handles CType-sized
+   (16 byte) copies. Check whether `cc-assign` for the case
+   "lhs is a struct member that is itself a struct" routes through
+   `cg-assign-struct` correctly.
+2. Try adding a regression test: a tiny C program that does `dst.type
+   = src.type;` where type is a `CType`-shaped struct, run through
+   cc.scm and verify field values are preserved.
+
+### H4 (less likely): vstack pointer corruption
+
+Maybe `vtop` itself is moving incorrectly, and the SValue at the
+"failing vtop position" is actually unused garbage left behind from a
+prior operation. This would mean the "wrong sv->r=20" was set in some
+earlier vstack slot and we're just reading stale memory.
+
+We already confirmed via watchpoint that the `r=0x14` at
+`vtop[0]+16` (= `0x7B4C1A`) was *written* by the memcpy of
+`vtop[-1]` (a vswap), so the source of the bad value is `vtop[-1].r =
+20` immediately before the swap. Trace continues upstream: who set
+`vtop[-1].r = 20`? The trace showed the vsetc(r=0x14) at lr=0x6c777c
+WROTE r=20 to its target slot — and that slot is the one that becomes
+vtop[-1] after the next vpush. So this is downstream of H1-H3.
+
+### H5 (ruled out, listed for completeness)
+
+- LDR-literal alignment (8-byte literals at 4-byte aligned addresses):
+  ruled out — replaced `%li` with MOVZ/MOVK chain, bug persists.
+- CBZ at cache-line boundary (Apple Silicon erratum): ruled out — replaced
+  CBZ with CMP+B.cond+BR, bug persists.
+- Wrong opcode encoding for cond-branch: ruled out — count of CBZ vs CBNZ
+  in expanded.M1 matches the count of `%ifelse_nez` vs `%cmpset_eqz` in
+  source.
+- Hex2 mis-resolving labels: verified literals in binary point to correct
+  branch targets.
+
+## Suggested next steps in priority order
+
+1. **Read `cc.scm`'s `cg-assign-struct`** (search for `cg-assign-struct`
+   and the nested struct copy path). Verify handling for CType-sized
+   nested struct copies. This is the most likely culprit (H3).
+
+2. **Build a minimal C reproducer** that triggers the corruption without
+   needing the whole tcc compile. Something like:
+   ```c
+   typedef struct { int t; void *ref; } CType;
+   typedef struct { CType type; int r; } SValue;
+   int test(SValue *sv) {
+       SValue ret;
+       ret.type = sv->type;
+       return ret.type.t;
+   }
+   ```
+   Compile with cc.scm and verify the field copy works correctly. If it
+   doesn't reproduce in isolation, the trigger requires more state
+   (e.g. specific stack frame size or memcpy interaction).
+
+3. **Compare the emitted P1pp for the failing call site against a
+   working call site.** Find another place in `cc__unary` that does a
+   similar struct copy followed by an int read, and diff the two P1pp
+   sequences. The buggy one will have a structural anomaly.
+
+4. **If P1pp looks correct, drop down to gdb tracing.** Use
+   `scripts/dbg-load-cbz.sh` as a starting point. Set watchpoints on
+   stack slots in `cc__unary`'s frame to find what writes the
+   pointer-like value into `ret.type.t`'s slot. Don't add new code
+   inside `cc__unary` for diagnostic — it shifts the layout and the bug
+   disappears.
+
+## Useful addresses (will shift if anything in the load chain changes)
+
+In the **current broken binary** (no `%addi` workaround):
+- `cc__unary`              entry: `0x006BFC70`
+- `cc__unary` is_float call (line 7840): `0x006C6E2C`
+- `cc__unary` vsetc call (line 7864):    `0x006C7778`
+- `cc__load`               entry: `0x007395EC`
+- `cc__load` outer-2 if test CBZ:        `0x0073B6C0` (the actual assert site)
+- `cc__vsetc`              entry: `0x006795BC`
+- `cc__vpop`               entry: `0x0067A6B8`
+- `cc__save_reg`           entry: `0x0067DB94`
+- `cc__get_reg`            entry: `0x0067F130`
+- `cc__gv`                 entry: `0x00680918`
+- `cc__is_float`           entry: `0x00672C54`
+- `cc__vtop` (global ptr): `0x007B4A32`
+- `cc__pvtop` (global ptr): `0x007B4A2A`
+- `cc____vstack` (array):  `0x007B4A3A`
+- `_memcpy`                entry: `0x006068BC`
+
+To regenerate addresses after a build change, use the recipe in
+`scripts/dbg-load-cbz.sh` (anchor on `cc__load`'s entry signature
+`FF0324D1` = SUB SP, SP, #2304, then walk byte counts in expanded.M1).
+
+## Files of interest
+
+- `tcc.flat.c:7745-7870` — `unary()`'s symbol-lookup and function-call
+  paths. The C code allegedly sets up `ret` correctly.
+- `tcc.flat.c:5006-5022` — `vsetc` source (where the wrong r=20 ends up
+  written into vtop).
+- `tcc.flat.c:11999-12053` — `arm64-gen.c:load()` (where the assert
+  fires; not actually wrong).
+- `cc/cc.scm` — the cc.scm compiler. Search for `cg-assign-struct`,
+  member access through pointer, slot allocation logic.
+- `P1/P1-aarch64.M1pp:385-403` — `p1_li` (already changed to MOVZ/MOVK,
+  not relevant to the bug).
+- `scripts/dbg-load-cbz.sh` — gdb diagnostic harness with working
+  breakpoints.
+
+## What's already in the tree from this investigation
+
+Committed:
+- `3317ca3` — Makefile: ARCH controls TCC_TARGET (no longer silently
+  building x86_64-targeted tcc-boot2 when running with ARCH=aarch64).
+
+Uncommitted (working tree):
+- `P1/P1-aarch64.M1pp` — `p1_li` rewritten as MOVZ/MOVK chain. Same
+  16-byte size, no functional change beyond eliminating literal-pool
+  reads. **Defensible cleanup; does not fix the bug.**
+- `scripts/dbg-load-cbz.sh` — gdb diagnostic helper (untracked).
diff --git a/scripts/dbg-load-cbz.sh b/scripts/dbg-load-cbz.sh
@@ -0,0 +1,36 @@
+#!/bin/sh
+set -e
+apk add --quiet gdb >/dev/null 2>&1
+
+cd /work
+cat > /tmp/g.gdb << 'GDB'
+set pagination off
+set width 0
+set print address off
+
+# vstack[7] is at 0x7b4c0a (sv->r at +16 = 0x7b4c1a)
+# Watch the r field for changes; trace what writes 20 to it.
+# Break at cc__vsetc entry — dump ret.type.t
+break *0x6795BC
+commands
+  printf "  vsetc(type.t=%d, r=0x%llx) lr=0x%llx\n", *(int *)$x0, $x1, $lr
+  continue
+end
+
+break *0x6c6e2c
+commands
+  printf "  is_float at 7840: X0=0x%llx  vtop=0x%llx\n", $x0, *(unsigned long long *)0x7b4a32
+  printf "    [vtop+0..7]=0x%llx  [vtop+8..15]=0x%llx  [vtop+16..23]=0x%llx\n", *(unsigned long long *)(*(unsigned long long *)0x7b4a32), *(unsigned long long *)(*(unsigned long long *)0x7b4a32 + 8), *(unsigned long long *)(*(unsigned long long *)0x7b4a32 + 16)
+  continue
+end
+
+# Also: load entry to know when we're in the failing call
+break *0x7395ec
+commands
+  printf ">> load(r=%d sv=0x%llx sv->r=0x%x)\n", $x0, $x1, *(unsigned short *)($x1 + 16)
+  continue
+end
+
+run -nostdlib build/aarch64/tcc-cc/start.o build/aarch64/tcc-cc/mem.o tests/cc/013-call.c -o /tmp/out_013
+GDB
+gdb -batch -x /tmp/g.gdb build/aarch64/tcc-boot2/tcc-boot2 2>&1

	boot2 Playing with the boostrap
	git clone https://git.ryansepassi.com/git/boot2.git
	Log \| Files \| Refs \| README

A	docs/TCC-CC-INVESTIGATION.md	\|	303	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	scripts/dbg-load-cbz.sh	\|	36	++++++++++++++++++++++++++++++++++++