commit dc1fabbf4cea6f8570680c2365efc94d2cbf4a09
parent 0a2faa92c8f3b6a302e7f621f78a92ba4b4180c7
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Fri, 5 Jun 2026 06:40:34 -0700
doc/plan: add stack-trace builtins & runtime backtrace roadmap
Proposes a three-layer design: GCC-compatible __builtin_return_address /
__builtin_frame_address primitives (FP-chain lowering, enabled by kit always
keeping a frame pointer), a freestanding __kit_backtrace capture helper, and a
symbolizing __kit_print_backtrace print path that keeps the DWARF reader on the
hosted side of the freestanding boundary.
Diffstat:
2 files changed, 229 insertions(+), 0 deletions(-)
diff --git a/doc/plan/BACKTRACE.md b/doc/plan/BACKTRACE.md
@@ -0,0 +1,228 @@
+# Plan: stack-trace builtins & runtime backtrace
+
+## Status — 2026-06-05 — proposed; nothing built yet
+
+kit has no way for compiled code to inspect its own call stack. This roadmap
+adds that capability in three layers: GCC-compatible primitive **builtins**
+(`__builtin_return_address`, `__builtin_frame_address`), a freestanding runtime
+**capture** function (`__kit_backtrace`), and a **symbolizing print** path
+(`__kit_print_backtrace`) that turns return addresses into `func at file:line`.
+
+Matching design docs once shipped: [../FRONTENDS.md](../FRONTENDS.md) (the
+builtins), [../RUNTIME.md](../RUNTIME.md) (the rt helpers), [../DWARF.md](../DWARF.md)
+(symbolization).
+
+## Why
+
+- **Portability.** `__builtin_return_address` / `__builtin_frame_address` are a
+ de-facto part of the GCC/Clang surface. Real C code (libc `backtrace`,
+ sanitizer shims, allocators, profilers, `unwind`-free panic handlers) uses
+ them; kit currently can't compile any of it.
+- **Diagnostics.** `__kit_assert_fail` (`rt/lib/assert/assert.c`) and the
+ emulator fault path (`src/emu/emu.c`, `compiler_panic`) currently die silently
+ with `__builtin_trap()`. A backtrace at the trap point is the single biggest
+ debuggability win for kit-compiled programs.
+- **It is cheap here, specifically.** kit maintains a frame pointer on **every**
+ backend and has **no `-fomit-frame-pointer`** (x29 on aarch64, rbp on x64,
+ s0/x8 on rv64; `AA_FP = 29` at `src/arch/aa64/native.c:61`). Every prologue
+ stores a `{saved_fp, saved_ra}` frame record. Frame-pointer-chain walking is
+ therefore *reliable*, with no unwind tables and no `.eh_frame` dependency.
+
+## What already exists (and what it can't do)
+
+- **`.eh_frame` CFI** is emitted by default for hosted targets
+ (`src/arch/mc.c:736`, `mc_emit_eh_frame`), and **off for freestanding**.
+- **A CFI unwinder**, `kit_dwarf_unwind_step` (`src/debug/dwarf_cfi.c:213`),
+ interprets FDE/CIE programs — but deliberately takes **no memory provider**, so
+ it *cannot self-unwind a live stack*. It is built for the dbg/JIT path where a
+ session reads target memory out-of-band (`driver/cmd/dbg.c:1010`, the `bt`
+ command). It is not a candidate for in-process capture.
+- **Symbolization** (`kit_dwarf_addr_to_line`, `kit_dwarf_func_at`,
+ `include/kit/dwarf.h:21`) is mature — it backs `addr2line`
+ (`driver/cmd/addr2line.c`) and `dbg bt`. But it lives in **`libkit.a`, not the
+ freestanding runtime `rt/`**. Pulling it into a freestanding image is a
+ non-goal (see L3).
+- **The runtime has zero unwind/backtrace code today.** `rt/lib/stack/` exists
+ but holds only the Windows `chkstk` helper — a natural home for the new
+ capture code.
+
+Design consequence: **capture via the FP chain; symbolize via the existing
+DWARF reader, kept on the hosted side of the boundary.** Do *not* reuse the CFI
+unwinder for self-capture.
+
+---
+
+## L1 — Primitive builtins (`__builtin_return_address`, `__builtin_frame_address`)
+
+GCC semantics: `__builtin_frame_address(n)` returns the frame address of the
+current function (n=0) or its n-th caller; `__builtin_return_address(n)` returns
+the return address into that frame. The level argument **must be an integer
+constant** (kit validates via the existing `eval_const_int()` path, as
+`__builtin_offsetof` already does at `parse_expr.c:1331`). Out-of-range / runaway
+walks are allowed to return a garbage-but-safe value or 0, matching GCC's "use 0
+only with care" contract.
+
+### Lowering: two new CG intrinsics, FP-chain only
+
+Add to `KitCgIntrinsic` (`include/kit/cg.h:916`):
+
+```
+KIT_CG_INTRIN_FRAME_ADDRESS, /* pop level(u32 const); push void* */
+KIT_CG_INTRIN_RETURN_ADDRESS, /* pop level(u32 const); push void* */
+```
+
+Both lower through one shared FP-walk so level 0 and level N use the same path,
+and so level 0's return address comes from the **spilled** frame-record slot (not
+the live LR/RA, which may be clobbered mid-function):
+
+| arch | FP reg | `frame(0)` | walk one frame | return addr from frame F |
+|------|--------|-----------|----------------|--------------------------|
+| aarch64 | x29 | x29 | `fp = *(fp)` | `*(fp + 8)` (saved x30) |
+| x86-64 | rbp | rbp | `fp = *(fp)` | `*(fp + 8)` (pushed retaddr) |
+| rv64 | s0/x8 | s0 | `fp = *(fp - 16)` | `*(fp - 8)` (saved ra) |
+
+For a constant level the walk unrolls to `level` dependent loads (typically 0–2),
+so no loop is emitted. wasm has no FP chain → **diagnose unsupported**, exactly
+as the IRQ/cache intrinsics already do per-arch.
+
+### Files to touch (the standard "new value-producing intrinsic" path)
+
+- `include/kit/cg.h` — two enum entries + doc comments.
+- `src/cg/arith.c:1726` — two rows in the `KitCgIntrinsic → INTRIN_*` table.
+- `src/cg/cgtarget.h:148` — two `INTRIN_*` enum entries
+ (`INTRIN_FRAME_ADDRESS`, `INTRIN_RETURN_ADDRESS`).
+- `lang/c/parse/parse_priv.h:231` + `parse.c:1526` — intern
+ `__builtin_return_address` / `__builtin_frame_address` symbols.
+- `lang/c/parse/parse_expr.c` (in `try_parse_builtin_call`, ~1696–2018) — two
+ handlers: parse the constant level via `eval_const_int`, then emit the
+ intrinsic with result type `void*`. New `cg_adapter.c` helper
+ `pcg_frame_or_return_address(p, kind, level)`.
+- Per-arch O0 lowering: `src/arch/aa64/native.c` (~3572), `src/arch/x64/native.c`
+ (~3378), `src/arch/riscv/native.c` (~2992) — emit the FP-walk loads;
+ `src/arch/wasm/emit.c` (~1590) + `src/arch/c_target/c_emit.c` (~2603) — handle
+ or diagnose (C target can emit `__builtin_*` straight through to the host
+ compiler).
+- Capability hooks: `src/arch/{aa64,x64,riscv,wasm}/arch.c` (alongside the
+ existing `KIT_CG_INTRIN_TRAP` cases at e.g. `aa64/arch.c:197`).
+- **Optimizer (O1/O2):** unlike `INTRIN_TRAP`/`INTRIN_LONGJMP` these are *not*
+ control-flow terminators, so the special-casing scattered through `src/opt`
+ (`pass_cfg.c:43`, `cg_ir_lower.c:286`, `pass_ssa.c:818`, …) does **not** apply.
+ But they **read memory and depend on the live frame**, so they must be modeled
+ as memory-reading / non-pure: not hoistable out of the function, not CSE'd
+ across a call, level-0 not sunk past a frame-pointer change. Audit the O1 emit
+ path (`src/opt/pass_native_emit.c`, `cg_ir_lower.c`) to lower them like an
+ ordinary load with a frame dependency. **This is the main risk area** and wants
+ dedicated O1 tests.
+
+### Tests (L1)
+
+- `test/toy/` — a CG-API toy case exercising both intrinsics at levels 0/1/2.
+- `test/parse/cases/` — `builtin_NN_return_address.c` etc.; an error case for a
+ non-constant level.
+- Smoke (`test/smoke`, exit-code-42 convention): a known 2-deep call chain where
+ `f` returns `__builtin_return_address(0)` and `main` compares it against a
+ label/`&&`-style anchor, asserting it lands inside the caller's range. Run at
+ **O0 and O1** on x64 + aa64 + rv64.
+
+---
+
+## L2 — Capture: `__kit_backtrace` (freestanding runtime fn)
+
+Surface decision (confirmed): **primitives are builtins; capture/print are
+runtime functions**, mirroring the GCC-builtin / glibc-`backtrace` split. New
+freestanding TU `rt/lib/stack/backtrace.c`, declared in a new public runtime
+header `rt/include/kit/backtrace.h`:
+
+```c
+/* Fill buf[0..max) with return addresses, innermost first, skipping the
+ * innermost `skip` frames (skip >= 1 hides __kit_backtrace itself).
+ * Returns the number of frames written. Freestanding: pure FP walk, no libc,
+ * no DWARF, works on every target that keeps a frame pointer (all of kit's). */
+int __kit_backtrace(void** buf, int max, int skip);
+```
+
+Implementation is the L1 walk expressed in portable C: seed from
+`__builtin_frame_address(0)`, then loop `fp = *(void**)fp` reading the saved-RA
+slot per the table above, stopping on NULL fp, an fp that doesn't increase
+(stack grows down — detect cycles/garbage), or `max`. The per-arch slot offsets
+are the *only* target knob; keep them in one small `static` per the
+RUNTIME.md "no target-dispatch ifdef" rule (parameterize, don't `#ifdef`-cascade
+— select by a build-time constant the way the int/fp helpers do).
+
+- `mk/rt.mk` — add `rt/lib/stack/backtrace.c` to every variant (it already
+ compiles `rt/lib/stack/` for the Windows chkstk helper).
+- **Hook the trap paths:** make `rt/lib/assert/assert.c::__kit_assert_fail` call
+ `__kit_print_backtrace()` (L3) before `__builtin_trap()`. Because the symbol is
+ `weak`, a freestanding user with no output sink can still override it.
+
+### Tests (L2)
+
+- `test/rt/cases/backtrace_capture.c` — build a known N-deep recursion, capture,
+ assert depth and monotonic frame addresses; `return 42` on success. Runs under
+ the existing `test/rt/run.sh` harness across variants.
+
+---
+
+## L3 — Symbolize & print: `__kit_print_backtrace`
+
+This is where the freestanding boundary bites: turning an address into
+`func at file:line` needs the DWARF reader, which is **libkit, not rt**. Three
+sub-options, ordered by how cleanly they respect that boundary. Recommend
+shipping **L3a now**, leaving L3b/L3c as documented extensions.
+
+- **L3a — raw print + out-of-process symbolization (recommended default).**
+ `__kit_print_backtrace()` lives in rt, walks via `__kit_backtrace`, and writes
+ raw lines (`#0 0x401136`, …) to a host-provided sink (a weak
+ `__kit_backtrace_write(const char*, size_t)` the host or `_start` wires to
+ `write(2)`; freestanding default is a no-op). Symbolization is then a separate
+ step through the **existing** `kit addr2line` tool (or a thin new `kit
+ symbolize` that batches). Zero new symbolization code, fully freestanding,
+ matches how minimal panic handlers work in the wild.
+
+- **L3b — in-process self-symbolization (hosted-only).** A trimmed line/func
+ reader (reusing `kit_dwarf_addr_to_line` + `kit_dwarf_func_at`) linked into a
+ **hosted-only** archive — e.g. `libkit_bt.a` or a `*-hosted` rt variant — that
+ opens the running image's own DWARF. Heavy (drags in the DWARF reader and an
+ image-self-map); strictly opt-in, never in the freestanding default. Only build
+ if a concrete consumer needs in-binary symbolized panics.
+
+- **L3c — tool-side auto-backtrace.** `kit run` / `kit emu` / `dbg` already own a
+ DWARF reader and the `dbg bt` rendering path (`driver/cmd/dbg.c:1010`). Hook
+ their fault/trap handlers (e.g. the `EMU_TRAP_FAULT` → `compiler_panic` site in
+ `src/emu/emu.c`) to print a symbolized backtrace automatically. This is the
+ highest-value, lowest-risk symbolized experience because it reuses everything
+ and never crosses into rt. Largely independent of L1/L2 (the tools can unwind
+ via their own session memory + `kit_dwarf_unwind_step`).
+
+### Tests (L3)
+
+- L3a: smoke test piping captured addresses through `kit addr2line`, asserting
+ the expected function names appear.
+- L3c: an `kit emu` fault test asserting a symbolized frame line on stderr.
+
+---
+
+## Suggested sequencing
+
+1. **WS1 — L1 primitives, O0**, all three native arches + parse/toy tests. Ship
+ the GCC-compatible surface first; it's the foundation and independently useful.
+2. **WS2 — L1 at O1/O2**: the optimizer memory-effect modeling + O1 smoke tests.
+ (Highest-risk slice; isolate it.)
+3. **WS3 — L2 `__kit_backtrace`** in rt + capture test + assert-hook.
+4. **WS4 — L3a** raw print + `kit addr2line` round-trip; wire into assert/emu.
+5. **WS5 — L3c** tool-side auto-backtrace (optional, parallelizable with WS3/4).
+6. **L3b** deferred until a consumer needs in-binary symbolized panics.
+
+## Open questions
+
+- **wasm:** confirm "diagnose unsupported" is acceptable for L1 (no FP chain), or
+ whether the C/wasm targets should forward `__builtin_*` to the host toolchain.
+- **rv64 frame-record layout:** verify the saved-ra/prev-fp offsets against the
+ actual prologue emitted by `src/arch/riscv/native.c` (the table above assumes
+ `ra@fp-8`, `fp@fp-16`; confirm before coding the walk).
+- **Output sink for L3a:** weak `__kit_backtrace_write` vs. requiring the host to
+ pass a sink explicitly. Weak-symbol default keeps freestanding builds linking.
+- **Level-0 return address semantics under tail-call / leaf-frame omission:** kit
+ keeps a frame pointer everywhere, but confirm leaf functions still store the
+ frame record (if a leaf skips the `{fp,ra}` store, level-0 return address must
+ fall back to live LR/RA on aa64/rv64).
diff --git a/doc/plan/README.md b/doc/plan/README.md
@@ -19,4 +19,5 @@ shrinks to whatever remains open.
| [IMAGE_INSPECT.md](IMAGE_INSPECT.md) | Extending object inspection to executables and shared libraries. | [../OBJ.md](../OBJ.md) |
| [BUILD.md](BUILD.md) | A new content-addressed build coordinator (Bazel/Nix-style incremental builds layered on the CAS) — storage state machine, caching algorithm, recipe protocol. Distinct from `../BUILD.md` (kit's own Makefile build). | — (new subsystem) |
| [BUILD_COMMANDS.md](BUILD_COMMANDS.md) | The kit-native `build-exe`/`build-lib`/`build-obj` verbs that replace `compile`: polyglot, in-memory compile+link with `--group` flag scoping and full link-flag control. Distinct from `BUILD.md` (the CAS coordinator). | [../DRIVER.md](../DRIVER.md) |
+| [BACKTRACE.md](BACKTRACE.md) | Stack-trace support: GCC-compatible `__builtin_return_address`/`__builtin_frame_address` primitives, a freestanding `__kit_backtrace` capture helper, and symbolized backtrace printing. | [../FRONTENDS.md](../FRONTENDS.md), [../RUNTIME.md](../RUNTIME.md), [../DWARF.md](../DWARF.md) |
| [TODO.md](TODO.md) | Open deferred fixes and code smells only. Completed items are removed instead of checked off. Not a roadmap; a current backlog. | — |