commit bcb244c97bd9a962f8f7fb969f770d7e8d00a195
parent 060023ae36abcbc5aad2e62383201b9ef0f3834c
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Wed, 29 Apr 2026 20:49:37 -0700
Make trace macro preserve registers
Diffstat:
3 files changed, 103 insertions(+), 46 deletions(-)
diff --git a/P1/P1pp.P1pp b/P1/P1pp.P1pp
@@ -1441,9 +1441,9 @@
#
# %trace(tag_addr, tag_len) — emit a runtime stderr probe at the call
# site. Prints `[trace @0xHEX TAG]\n` to stderr, where 0xHEX is the
-# runtime address of the instruction immediately following the trace's
-# call sequence (the address of `:@here` in this site's expansion) and
-# TAG is the byte string at [tag_addr..tag_addr+tag_len).
+# runtime address of this trace site (the address of `:@here` in this
+# site's expansion) and TAG is the byte string at
+# [tag_addr..tag_addr+tag_len).
#
# `tag_addr` is a label reference token (e.g. `&cc__str_3`) — the
# caller is responsible for emitting the bytes at that label. cc.scm's
@@ -1457,15 +1457,40 @@
# guarantees that each function's first instruction *is* a trace call,
# so the printed address falls on a known function-entry boundary.
#
-# Clobbers: a0..a2, ra, t0..t2 (per %call ABI). Use only inside a %fn
-# body where the caller has already spilled live argument regs (or
-# doesn't need them past the trace point).
+# Preserves all exposed P1 registers (a0..a3, t0..t2, s0..s3) by
+# borrowing 112 aligned bytes below the current stack pointer: 16 bytes
+# for the backend frame prefix plus 88 bytes for saved registers. Use
+# only inside an active %fn body, after %enter and before %eret.
%macro trace(tag_addr, tag_len)
:@here
+ %addi(sp, sp, -112)
+ %st(a0, sp, 0)
+ %st(a1, sp, 8)
+ %st(a2, sp, 16)
+ %st(a3, sp, 24)
+ %st(t0, sp, 32)
+ %st(t1, sp, 40)
+ %st(t2, sp, 48)
+ %st(s0, sp, 56)
+ %st(s1, sp, 64)
+ %st(s2, sp, 72)
+ %st(s3, sp, 80)
%la(a0, &@here)
%la(a1, tag_addr)
%li(a2, tag_len)
%call(&libp1pp__trace)
+ %ld(a0, sp, 0)
+ %ld(a1, sp, 8)
+ %ld(a2, sp, 16)
+ %ld(a3, sp, 24)
+ %ld(t0, sp, 32)
+ %ld(t1, sp, 40)
+ %ld(t2, sp, 48)
+ %ld(s0, sp, 56)
+ %ld(s1, sp, 64)
+ %ld(s2, sp, 72)
+ %ld(s3, sp, 80)
+ %addi(sp, sp, 112)
%endm
# libp1pp__trace(addr=a0, tag_addr=a1, tag_len=a2) — print
diff --git a/docs/DEBUG.md b/docs/DEBUG.md
@@ -13,10 +13,8 @@ body. At runtime each entry prints one line:
[trace @601a34 main]
```
-The hex is the runtime address of the instruction immediately after
-the trace's call sequence (i.e. the first instruction of the body
-proper). The trailing word is the mangled function name, interned
-through cc's regular string pool.
+The hex is the runtime address of the trace site. The trailing word is
+the mangled function name, interned through cc's regular string pool.
Build + run:
@@ -34,12 +32,16 @@ make tcc-boot2 ARCH=aarch64 CC_TRACE_EMIT=1
./build/aarch64/tcc-boot2/tcc-boot2 -version 2>trace.log
```
-Cost: ~6 instructions + one call per traced function. Off by default;
-the `%trace` macro itself lives in [P1/P1pp.P1pp](../P1/P1pp.P1pp)
-(§Tracepoint) and can also be invoked manually — drop a
-`%trace(&label, len)` into any `combined.M1pp` snapshot under
-`build/$ARCH/.work/<src>/`, re-run the m1pp/M0/hex2 stages, and bisect
-by stderr position.
+Cost: register save/restore traffic plus one call per traced function.
+Off by default; the `%trace` macro itself lives in
+[P1/P1pp.P1pp](../P1/P1pp.P1pp) (§Tracepoint) and can also be invoked
+manually — drop a `%trace(&label, len)` into any `combined.M1pp`
+snapshot under `build/$ARCH/.work/<src>/`, re-run the m1pp/M0/hex2
+stages, and bisect by stderr position. `%trace` preserves the exposed
+P1 registers (`a0..a3`, `t0..t2`, `s0..s3`) by borrowing temporary
+stack space, so it is safe to add inside an active `%fn` body after
+the function prologue. The borrowed area includes the backend's
+standard frame prefix, so trace saves stay below the caller's frame.
To map an address back to its function, see the lookup tool below.
diff --git a/docs/TCC-TODO.md b/docs/TCC-TODO.md
@@ -37,7 +37,7 @@ head -c 50000 build/tcc/X86_64/tcc.flat.c \
# then re-run the podman invocation against tcc.head.c
```
-## Status — parse + cg-finish complete on tcc.flat.c
+## Status — tcc-boot2 builds; runtime segfault remains
The full 608 KB TU now parses to EOF (line 18800) and cg-finish emits
~6.5 MB of P1pp. No semantic-coverage gap remains in this TU. Last
@@ -53,18 +53,51 @@ aarch64 cc-debug run:
[cc] phase=cg-finish: heap 90 674 020 out-bytes 6 489 215
```
-The remaining work is downstream of cc.scm:
+The emitted P1pp now assembles through m1pp → M0 → hex2 and links with
+the mes-libc subset via the `tcc-boot2` make target. The active blocker
+is runtime correctness: `build/aarch64/tcc-boot2/tcc-boot2 -version`
+still exits 139 with no stdout.
-1. **Assemble the emitted P1pp** through the existing
- `scripts/boot-build-p1pp.sh` pipeline (m1pp → M0 → hex2). The output
- is large by P1pp standards — about 2× the scheme1 binary's input —
- so this exercises m1pp/M0 throughput at a scale they haven't yet
- been used at. Expect to find table size or scratch caps that need
- bumping in those tools, or P1pp emission patterns cc.scm produces
- that the macro layer doesn't accept verbatim.
-2. **Run the resulting `tcc-boot2`** and verify `-version`. Beyond
- that, milestone 4 in [CC.md §Validation milestones](CC.md) — full
- self-host of tcc — is the end goal.
+Current traced aarch64 crash tail with `CC_TRACE_EMIT=1`:
+
+```
+[trace @663108 cc__next_nomacro]
+[trace @662d68 cc__next_nomacro_spc]
+[trace @658d20 cc__next_nomacro1]
+[trace @630580 cc__tok_alloc_new]
+[trace @62d228 cc__tal_realloc_impl]
+[trace @607bb4 memcpy]
+[trace @6078e8 _memcpy]
+Segmentation fault (core dumped)
+```
+
+Address lookup for the tail:
+
+```
+0x630580 cc__tok_alloc_new+0x30
+0x62d228 cc__tal_realloc_impl+0x30
+0x607bb4 memcpy+0x30
+0x6078e8 _memcpy+0x30
+```
+
+Source review puts the final `memcpy` after `tal_realloc_impl` returns
+in `tok_alloc_new`:
+
+```
+ts = tal_realloc_impl(&toksym_alloc, 0, sizeof(TokenSym) + len);
+...
+memcpy(ts->str, str, len);
+```
+
+So the next investigation should focus on the returned `TokenSym`
+pointer, the computed `TokenSym::str` offset, and the `len` / `str`
+arguments at that call site. The reduced
+`tests/cc-libc/18-tinyalloc-token.c` fixture currently passes, including
+with traced libc, so the failing condition likely depends on the full
+tcc struct layout or parser token stream rather than TinyAlloc alone.
+
+Milestone 4 in [CC.md §Validation milestones](CC.md) remains the end
+goal: compile tcc and verify `tcc-boot2 -version` runs.
Harness target: `make tcc-boot2 ARCH=amd64` (see Makefile +
`scripts/boot-build-cc.sh`) drives stage1-flatten on the host, runs
@@ -281,32 +314,29 @@ decl complete with parse heap at ~31 MB on the 1612-line cut.
See [DEBUG.md](DEBUG.md) — `CC_TRACE_EMIT=1` injects per-function-entry
stderr probes; `m1-symbols.py lookup` resolves the printed addresses
-back to functions.
+back to functions. `%trace` now saves/restores all exposed P1 registers
+(`a0..a3`, `t0..t2`, `s0..s3`) by borrowing stack space inside the
+current `%fn` frame, so manual probes can be inserted in live code
+without clobbering caller state.
## Expected next-tier blockers (downstream of cc.scm)
-The semantic parser has covered every construct in this TU. The next
-likely walls live in the assembly side and at runtime:
-
-- **m1pp / M0 / hex2 caps under a 6.5 MB P1pp**. These tools have only
- ever been driven against scheme1-scale inputs (tens to hundreds of
- KB of source, maybe a few MB after expansion). cc.scm's tcc.c output
- is ~6.5 MB pre-expansion. Expect symbol-table, line-buffer, or
- scratch-arena caps to need bumping.
-- **Patterns cc.scm emits that m1pp / M0 don't accept**. Until now the
- cc has only been validated against the small `tests/cc/*` programs.
- Larger programs may hit edge cases in label naming, literal sizing,
- or directive ordering that the existing tests didn't reach.
-- **Wall-clock**. Parsing to EOF takes ~30 s under scheme1 today;
- cg-finish adds another bump. Assembly is in addition. A first end-
- to-end run will set the baseline.
+The semantic parser has covered every construct in this TU, and the
+large P1pp output now makes it through m1pp / M0 / hex2. The next likely
+walls are runtime/codegen mismatches:
+
- **`tcc-boot2 -version` correctness**. Even when the toolchain
produces an ELF, the runtime still has to walk through tcc's setup
(string-table init, command-line parsing, output for `-version`)
without tripping on cg semantics that pass the small tests but
diverge from C in subtle ways.
-- **libc**. The 39 unresolved externals in [LIBC.txt](LIBC.txt) are
- unmet — there is no libc in the link today. See §libc strategy below.
+- **Struct layout / flexible-tail object correctness**. The current
+ crash path is `tok_alloc_new` copying into `TokenSym::str`, so offsets
+ around `TokenSym`, `TinyAlloc`, and related tcc structs are high-value
+ targets for small focused tests.
+- **libc behavior under full tcc load**. The mes-libc subset is now in
+ the link, but runtime helpers still need validation under tcc's actual
+ allocation/string/token workloads.
The end goal is milestone 4 in [CC.md §Validation milestones](CC.md)
— "Compile tcc.c (under the tcc-mes defines) → tcc-boot2; verify