Make trace macro preserve registers - boot2

commit bcb244c97bd9a962f8f7fb969f770d7e8d00a195
parent 060023ae36abcbc5aad2e62383201b9ef0f3834c
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Wed, 29 Apr 2026 20:49:37 -0700

Make trace macro preserve registers

Diffstat:
M P1/P1pp.P1pp  | 37 +++++++++++++++++++++++++++++++------
M docs/DEBUG.md  | 22 ++++++++++++----------
M docs/TCC-TODO.md  | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------

3 files changed, 103 insertions(+), 46 deletions(-)
diff --git a/P1/P1pp.P1pp b/P1/P1pp.P1pp
@@ -1441,9 +1441,9 @@
 #
 # %trace(tag_addr, tag_len) — emit a runtime stderr probe at the call
 # site. Prints `[trace @0xHEX TAG]\n` to stderr, where 0xHEX is the
-# runtime address of the instruction immediately following the trace's
-# call sequence (the address of `:@here` in this site's expansion) and
-# TAG is the byte string at [tag_addr..tag_addr+tag_len).
+# runtime address of this trace site (the address of `:@here` in this
+# site's expansion) and TAG is the byte string at
+# [tag_addr..tag_addr+tag_len).
 #
 # `tag_addr` is a label reference token (e.g. `&cc__str_3`) — the
 # caller is responsible for emitting the bytes at that label. cc.scm's
@@ -1457,15 +1457,40 @@
 # guarantees that each function's first instruction *is* a trace call,
 # so the printed address falls on a known function-entry boundary.
 #
-# Clobbers: a0..a2, ra, t0..t2 (per %call ABI). Use only inside a %fn
-# body where the caller has already spilled live argument regs (or
-# doesn't need them past the trace point).
+# Preserves all exposed P1 registers (a0..a3, t0..t2, s0..s3) by
+# borrowing 112 aligned bytes below the current stack pointer: 16 bytes
+# for the backend frame prefix plus 88 bytes for saved registers. Use
+# only inside an active %fn body, after %enter and before %eret.
 %macro trace(tag_addr, tag_len)
     :@here
+    %addi(sp, sp, -112)
+    %st(a0, sp, 0)
+    %st(a1, sp, 8)
+    %st(a2, sp, 16)
+    %st(a3, sp, 24)
+    %st(t0, sp, 32)
+    %st(t1, sp, 40)
+    %st(t2, sp, 48)
+    %st(s0, sp, 56)
+    %st(s1, sp, 64)
+    %st(s2, sp, 72)
+    %st(s3, sp, 80)
     %la(a0, &@here)
     %la(a1, tag_addr)
     %li(a2, tag_len)
     %call(&libp1pp__trace)
+    %ld(a0, sp, 0)
+    %ld(a1, sp, 8)
+    %ld(a2, sp, 16)
+    %ld(a3, sp, 24)
+    %ld(t0, sp, 32)
+    %ld(t1, sp, 40)
+    %ld(t2, sp, 48)
+    %ld(s0, sp, 56)
+    %ld(s1, sp, 64)
+    %ld(s2, sp, 72)
+    %ld(s3, sp, 80)
+    %addi(sp, sp, 112)
 %endm
 
 # libp1pp__trace(addr=a0, tag_addr=a1, tag_len=a2) — print
diff --git a/docs/DEBUG.md b/docs/DEBUG.md
@@ -13,10 +13,8 @@ body. At runtime each entry prints one line:
 [trace @601a34 main]
 ```
 
-The hex is the runtime address of the instruction immediately after
-the trace's call sequence (i.e. the first instruction of the body
-proper). The trailing word is the mangled function name, interned
-through cc's regular string pool.
+The hex is the runtime address of the trace site. The trailing word is
+the mangled function name, interned through cc's regular string pool.
 
 Build + run:
 
@@ -34,12 +32,16 @@ make tcc-boot2 ARCH=aarch64 CC_TRACE_EMIT=1
 ./build/aarch64/tcc-boot2/tcc-boot2 -version 2>trace.log
 ```
 
-Cost: ~6 instructions + one call per traced function. Off by default;
-the `%trace` macro itself lives in [P1/P1pp.P1pp](../P1/P1pp.P1pp)
-(§Tracepoint) and can also be invoked manually — drop a
-`%trace(&label, len)` into any `combined.M1pp` snapshot under
-`build/$ARCH/.work/<src>/`, re-run the m1pp/M0/hex2 stages, and bisect
-by stderr position.
+Cost: register save/restore traffic plus one call per traced function.
+Off by default; the `%trace` macro itself lives in
+[P1/P1pp.P1pp](../P1/P1pp.P1pp) (§Tracepoint) and can also be invoked
+manually — drop a `%trace(&label, len)` into any `combined.M1pp`
+snapshot under `build/$ARCH/.work/<src>/`, re-run the m1pp/M0/hex2
+stages, and bisect by stderr position. `%trace` preserves the exposed
+P1 registers (`a0..a3`, `t0..t2`, `s0..s3`) by borrowing temporary
+stack space, so it is safe to add inside an active `%fn` body after
+the function prologue. The borrowed area includes the backend's
+standard frame prefix, so trace saves stay below the caller's frame.
 
 To map an address back to its function, see the lookup tool below.
 
diff --git a/docs/TCC-TODO.md b/docs/TCC-TODO.md
@@ -37,7 +37,7 @@ head -c 50000 build/tcc/X86_64/tcc.flat.c \
 # then re-run the podman invocation against tcc.head.c
 ```
 
-## Status — parse + cg-finish complete on tcc.flat.c
+## Status — tcc-boot2 builds; runtime segfault remains
 
 The full 608 KB TU now parses to EOF (line 18800) and cg-finish emits
 ~6.5 MB of P1pp. No semantic-coverage gap remains in this TU. Last
@@ -53,18 +53,51 @@ aarch64 cc-debug run:
 [cc] phase=cg-finish: heap 90 674 020  out-bytes 6 489 215
 ```
 
-The remaining work is downstream of cc.scm:
+The emitted P1pp now assembles through m1pp → M0 → hex2 and links with
+the mes-libc subset via the `tcc-boot2` make target. The active blocker
+is runtime correctness: `build/aarch64/tcc-boot2/tcc-boot2 -version`
+still exits 139 with no stdout.
 
-1. **Assemble the emitted P1pp** through the existing
-   `scripts/boot-build-p1pp.sh` pipeline (m1pp → M0 → hex2). The output
-   is large by P1pp standards — about 2× the scheme1 binary's input —
-   so this exercises m1pp/M0 throughput at a scale they haven't yet
-   been used at. Expect to find table size or scratch caps that need
-   bumping in those tools, or P1pp emission patterns cc.scm produces
-   that the macro layer doesn't accept verbatim.
-2. **Run the resulting `tcc-boot2`** and verify `-version`. Beyond
-   that, milestone 4 in [CC.md §Validation milestones](CC.md) — full
-   self-host of tcc — is the end goal.
+Current traced aarch64 crash tail with `CC_TRACE_EMIT=1`:
+
+```
+[trace @663108 cc__next_nomacro]
+[trace @662d68 cc__next_nomacro_spc]
+[trace @658d20 cc__next_nomacro1]
+[trace @630580 cc__tok_alloc_new]
+[trace @62d228 cc__tal_realloc_impl]
+[trace @607bb4 memcpy]
+[trace @6078e8 _memcpy]
+Segmentation fault (core dumped)
+```
+
+Address lookup for the tail:
+
+```
+0x630580  cc__tok_alloc_new+0x30
+0x62d228  cc__tal_realloc_impl+0x30
+0x607bb4  memcpy+0x30
+0x6078e8  _memcpy+0x30
+```
+
+Source review puts the final `memcpy` after `tal_realloc_impl` returns
+in `tok_alloc_new`:
+
+```
+ts = tal_realloc_impl(&toksym_alloc, 0, sizeof(TokenSym) + len);
+...
+memcpy(ts->str, str, len);
+```
+
+So the next investigation should focus on the returned `TokenSym`
+pointer, the computed `TokenSym::str` offset, and the `len` / `str`
+arguments at that call site. The reduced
+`tests/cc-libc/18-tinyalloc-token.c` fixture currently passes, including
+with traced libc, so the failing condition likely depends on the full
+tcc struct layout or parser token stream rather than TinyAlloc alone.
+
+Milestone 4 in [CC.md §Validation milestones](CC.md) remains the end
+goal: compile tcc and verify `tcc-boot2 -version` runs.
 
 Harness target: `make tcc-boot2 ARCH=amd64` (see Makefile +
 `scripts/boot-build-cc.sh`) drives stage1-flatten on the host, runs
@@ -281,32 +314,29 @@ decl complete with parse heap at ~31 MB on the 1612-line cut.
 
 See [DEBUG.md](DEBUG.md) — `CC_TRACE_EMIT=1` injects per-function-entry
 stderr probes; `m1-symbols.py lookup` resolves the printed addresses
-back to functions.
+back to functions. `%trace` now saves/restores all exposed P1 registers
+(`a0..a3`, `t0..t2`, `s0..s3`) by borrowing stack space inside the
+current `%fn` frame, so manual probes can be inserted in live code
+without clobbering caller state.
 
 ## Expected next-tier blockers (downstream of cc.scm)
 
-The semantic parser has covered every construct in this TU. The next
-likely walls live in the assembly side and at runtime:
-
-- **m1pp / M0 / hex2 caps under a 6.5 MB P1pp**. These tools have only
-  ever been driven against scheme1-scale inputs (tens to hundreds of
-  KB of source, maybe a few MB after expansion). cc.scm's tcc.c output
-  is ~6.5 MB pre-expansion. Expect symbol-table, line-buffer, or
-  scratch-arena caps to need bumping.
-- **Patterns cc.scm emits that m1pp / M0 don't accept**. Until now the
-  cc has only been validated against the small `tests/cc/*` programs.
-  Larger programs may hit edge cases in label naming, literal sizing,
-  or directive ordering that the existing tests didn't reach.
-- **Wall-clock**. Parsing to EOF takes ~30 s under scheme1 today;
-  cg-finish adds another bump. Assembly is in addition. A first end-
-  to-end run will set the baseline.
+The semantic parser has covered every construct in this TU, and the
+large P1pp output now makes it through m1pp / M0 / hex2. The next likely
+walls are runtime/codegen mismatches:
+
 - **`tcc-boot2 -version` correctness**. Even when the toolchain
   produces an ELF, the runtime still has to walk through tcc's setup
   (string-table init, command-line parsing, output for `-version`)
   without tripping on cg semantics that pass the small tests but
   diverge from C in subtle ways.
-- **libc**. The 39 unresolved externals in [LIBC.txt](LIBC.txt) are
-  unmet — there is no libc in the link today. See §libc strategy below.
+- **Struct layout / flexible-tail object correctness**. The current
+  crash path is `tok_alloc_new` copying into `TokenSym::str`, so offsets
+  around `TokenSym`, `TinyAlloc`, and related tcc structs are high-value
+  targets for small focused tests.
+- **libc behavior under full tcc load**. The mes-libc subset is now in
+  the link, but runtime helpers still need validation under tcc's actual
+  allocation/string/token workloads.
 
 The end goal is milestone 4 in [CC.md §Validation milestones](CC.md)
 — "Compile tcc.c (under the tcc-mes defines) → tcc-boot2; verify

	boot2 Playing with the boostrap
	git clone https://git.ryansepassi.com/git/boot2.git
	Log \| Files \| Refs \| README

M	P1/P1pp.P1pp	\|	37	+++++++++++++++++++++++++++++++------
M	docs/DEBUG.md	\|	22	++++++++++++----------
M	docs/TCC-TODO.md	\|	90	+++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------