lisp.M1 step 13: wire GC into allocator + mostly-precise stack scan - boot2

commit f74017622a562616c2b1993be427593e9a227220
parent 837a27f3af091754af36f471a1da603449e3e06a
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Tue, 21 Apr 2026 22:23:15 -0700

lisp.M1 step 13: wire GC into allocator + mostly-precise stack scan

The step-13 spike landed mark/sweep/free-list machinery but left the
allocator overflow paths bailing straight to OOM. Wire them through
the standard "GC then retry once" pattern:

- pair_alloc_need_gc / obj_alloc_need_gc now CALL gc, re-check the
  freelist (using the cached &head in obj_alloc's slot 2), then
  fall through to retry_bump on miss; OOM only if both fail post-GC.

Three problems surfaced once GC actually ran under cons-churn:

1. The aarch64 prologue's `mov xN, #k` was synthesized as
   `add xN, xzr, #k`. ADD imm encodes register 31 as SP, not XZR,
   so [sp+16] held sp+k instead of the slot count. Pre-GC nothing
   read [sp+16], so the bug was latent. Added an `aa_movz` helper
   and routed the k-load through it.

2. `:pair_heap_start` / `:obj_heap_start` land at whatever byte
   offset the assembler reached (no .align directive in M1). The
   pair sweep walks a fixed +16 stride from its start label, but
   pair_alloc's bump aligns its first allocation up — so the two
   walked different addresses and the bitmap origin was off too.
   Fix: _start aligns both bumps once and snapshots the aligned
   addresses into :pair_heap_base / :obj_heap_base, which the GC
   paths use as their canonical origin.

3. Frame slots can hold non-tagged values: obj_alloc spills
   integers / BSS pointers, intern spills raw byte pointers that
   may not be 8-aligned. The GC walker treating those as tagged
   would chase wild pointers. Adopted a mostly-precise scheme
   (see docs/LISP-GC.md §"Why mostly-precise"):
   - PROLOGUE_Nk now zeros every slot it reserves so uninitialized
     reads see tag-0 (harmless).
   - mark_value_pair / mark_value_object bounds-check the
     candidate raw pointer against [heap_base, heap_next) before
     dereference, in the BDW-collector style.

Also publish :prim_argc from marshal_argv so the GC walker bounds
the prim_argv scan to the live prefix; stale entries from earlier
calls used to pin freed objects across GCs.

Step-5 stress tests added: 17-gc-cons-churn (pair churn over many
GC cycles), 18-gc-deep-list (300-deep reachable list survives mid-
build GCs, sums to 45150), 19-gc-closure-churn (obj-arena exercise
via short-lived closures).

26 tests * 3 arches = 78 passes under `make test-lisp-all`.

Diffstat:
M Makefile  | 5 ++++-
M docs/LISP-GC.md  | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------
M src/lisp.M1  | 131 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
M src/p1_gen.py  | 37 ++++++++++++++++++++++++++++++++++---
A tests/lisp/17-gc-cons-churn.expected  | 1 +
A tests/lisp/17-gc-cons-churn.scm  | 16 ++++++++++++++++
A tests/lisp/18-gc-deep-list.expected  | 1 +
A tests/lisp/18-gc-deep-list.scm  | 23 +++++++++++++++++++++++
A tests/lisp/19-gc-closure-churn.expected  | 1 +
A tests/lisp/19-gc-closure-churn.scm  | 21 +++++++++++++++++++++

10 files changed, 340 insertions(+), 32 deletions(-)
diff --git a/Makefile b/Makefile
@@ -222,7 +222,10 @@ LISP_TESTS := \
 	tests/lisp/26-let.scm \
 	tests/lisp/27-cond.scm \
 	tests/lisp/28-quasi.scm \
-	tests/lisp/29-innerdef.scm
+	tests/lisp/29-innerdef.scm \
+	tests/lisp/17-gc-cons-churn.scm \
+	tests/lisp/18-gc-deep-list.scm \
+	tests/lisp/19-gc-closure-churn.scm
 
 # Build the prelude-prefixed twin of every fixture and feed that to the
 # interpreter; .expected still lives next to the original .scm.
diff --git a/docs/LISP-GC.md b/docs/LISP-GC.md
@@ -31,6 +31,14 @@ Load-bearing; the rest of the document assumes them.
 4. **Heap size 1 MB for this step.** Triggers GC easily under stress
    tests. Grow to 20 MB in a follow-up once C-compiler workloads
    demand it.
+5. **Mostly-precise stack scan with a bounded-conservative fallback.**
+   The roots in §Roots #1–#4 are *precise* — we know each entry is a
+   tagged Lisp value. The stack-frame walk (#5) is *bounded
+   conservative*: each frame slot is checked against
+   `[pair_heap_base, pair_heap_next)` / `[obj_heap_base,
+   obj_heap_next)` before being dereferenced, in the BDW-collector
+   style. See §"Why mostly-precise" for the reasoning and the
+   upgrade path.
 
 These supersede the following LISP.md text, which must be updated when
 this work lands: decision 11 (single bitmap over whole heap), §Heap
@@ -51,9 +59,19 @@ Single BSS region, split in two by fixed labels:
 
 Bump pointers:
 
-- `pair_heap_next` — starts at `pair_heap_start`, bumps by 16.
-- `obj_heap_next` — starts at `obj_heap_start`, bumps by 8-rounded
-  size.
+- `pair_heap_next` — bumps by 16. Initially `pair_heap_start`, but
+  `_start` rounds it up to a 16-byte boundary before any allocation
+  and snapshots that aligned value into `:pair_heap_base`. The GC
+  paths use `:pair_heap_base` as their canonical origin.
+- `obj_heap_next` — bumps by an 8-rounded size. Same alignment
+  treatment: `_start` rounds up to 8 and snapshots `:obj_heap_base`.
+
+The two `_base` slots exist because M1 has no `.align` directive
+and the labels `:pair_heap_start` / `:obj_heap_start` land at
+whatever byte offset the assembler reached. A fixed-stride pair
+sweep walking from the raw label would visit gap addresses; the
+runtime alignment + base snapshot keeps mark and sweep on the same
+grid.
 
 Free lists:
 
@@ -81,7 +99,8 @@ symbols, vectors, closures, primitives) routes through `obj_alloc`.
 ### Pair mark bitmap
 
 One bit per 16-byte slot in the pair arena, indexed by
-`(pair_ptr - pair_heap_start) / 16`. At 600 KB pair arena:
+`(pair_ptr - pair_heap_base) / 16` (note: `_base`, the runtime-
+aligned origin — see §Heap layout). At 600 KB pair arena:
 600 KB / 16 = 37 500 bits ≈ 4.7 KB BSS. Declared as
 `pair_mark_bitmap` with fixed size.
 
@@ -125,9 +144,12 @@ returning or branching.
 
 ### Chain termination
 
-`_start` writes `0` to BSS `stack_bottom_fp` before its first `CALL`.
-Any frame whose `saved_fp` is `0` is the bottom frame; the walker
-stops there.
+`_start` snapshots its own initial `sp` into `:stack_bottom_fp`
+before its first `CALL`. The first prologued callee then stores
+that same value as its `saved_fp`, so the walker compares each
+`saved_fp` it pops to `:stack_bottom_fp`'s value and stops on
+match. (The `_start` frame itself has no `PROLOGUE`, so there is
+nothing to walk past it.)
 
 ### Per-arch implementation in `p1_gen.py`
 
@@ -135,11 +157,21 @@ Three arch-specific emitters change:
 
 - **amd64**: `call` already pushes retaddr natively. Prologue does
   `pop rcx; mov rax, rsp; sub rsp, frame_size; mov [rsp], rcx;
-   mov [rsp+8], rax; mov qword [rsp+16], k`. Epilogue reverses.
-- **aarch64**: `sub sp, sp, frame_size; str lr, [sp, 0]; mov x9, sp
-   (pre-sub sp + frame_size, computed); str x9, [sp, 8];
-   mov x9, #k; str x9, [sp, 16]`. Epilogue reverses.
+   mov [rsp+8], rax; mov qword [rsp+16], k`, then zeros each slot
+   in turn. Epilogue reverses.
+- **aarch64**: `sub sp, sp, frame_size; str lr, [sp, 0]; add x8, sp,
+   #frame_size; str x8, [sp, 8]; movz x8, #k; str x8, [sp, 16]`,
+   then `str xzr, [sp, 24+8*i]` per slot. Epilogue reverses.
+  *Important:* the k-load uses `MOVZ` rather than `add x8, xzr, #k`
+  because aarch64 ADD encodes register 31 as SP, not XZR — the
+  obvious `mov xN, #imm` synthesis silently computes `sp + imm`.
+  This was a latent bug pre-GC since nothing read `[sp+16]`.
 - **riscv64**: analogous to aarch64, using `ra` and a scratch `t0`.
+  Slot zeroing uses `sd zero, off(sp)`.
+
+Slot zeroing is the prologue-side half of the mostly-precise scheme
+(see §"Why mostly-precise"); the bounds check in `mark_value_*` is
+the walker-side half.
 
 The "caller's sp" is `sp + frame_size` on all three arches after the
 sub has run, since prologue ran immediately after call.
@@ -168,17 +200,87 @@ Five sources, walked in this order on each GC invocation:
 2. **`symbol_table[0..4095]`.** Intern table's backing array. Walk
    each slot; mark occupied entries as tagged heap pointers.
 3. **`prim_argv[0..prim_argc)`.** Scratch region live during a
-   primitive call. The current `prim_argc` BSS slot bounds the walk.
-4. **Reader scratch** (`saved_src_*`, ~5 slots). Live during a
-   `read` invocation.
+   primitive call. `marshal_argv` publishes the count into the
+   `:prim_argc` BSS slot, which the walker uses as the upper bound;
+   stale entries past `prim_argc` (from earlier primitive calls)
+   are skipped so they can't pin freed objects across GCs.
+4. **Reader scratch** (`saved_src_*`). Holds the integer cursor /
+   line / col plus a raw byte pointer into `:src_buf`. None of these
+   are tagged Lisp values, so no marking is needed; listed here only
+   to make the omission explicit.
 5. **Stack (fp chain).** Start at the caller of `gc()`, walk
-   `saved_fp` until `stack_bottom_fp` (value 0). At each frame,
-   read `k` from `[fp+16]` and pass every word in `[fp+24..+24+8*k)`
-   through `mark_object`.
+   `saved_fp` until the value snapshotted in `:stack_bottom_fp` at
+   `_start` (the kernel-provided initial sp). At each frame, read
+   `k` from `[fp+16]` and pass every word in `[fp+24..+24+8*k)`
+   through `mark_object`. Each word is bounds-checked against the
+   pair / obj arenas before dereference (see §"Why mostly-precise").
 
 The `sym_*` special-form symbols are reachable via `symbol_table`
 so they need no separate root entry.
 
+## Why mostly-precise
+
+A fully precise stack scan needs the walker to know, at every
+GC-fire point, exactly which slots in each frame hold tagged Lisp
+values. Two things break that invariant in this codebase:
+
+1. **Uninitialized slots.** A function reserves `k` slots in its
+   prologue, but its body may take an allocating call before having
+   written all of them. Slots that haven't been written yet contain
+   stale stack bytes from prior frames.
+2. **Slots holding non-tagged values.** `obj_alloc` legitimately
+   spills a rounded byte size and a free-list head pointer into its
+   slots. `intern` / `make_string` / `make_symbol` spill raw byte
+   pointers (into `:src_buf` or `:str_*` literal regions) that may
+   not even be 8-aligned. These have arbitrary low 3 bits and don't
+   look like tagged Lisp values.
+
+The standard answers in the literature:
+
+- **Precise stack maps.** Compiler emits per-callsite metadata
+  (HotSpot, V8, Go, OCaml, SBCL). Best correctness, requires real
+  codegen infrastructure.
+- **Strict slot discipline.** Every walked slot is always a tagged
+  value or sentinel; raw values live in registers or non-walked
+  storage. Used by classic SECD-style implementations.
+- **Conservative scanning.** Treat every machine word as potentially
+  a pointer; check if it falls in heap range. Boehm–Demers–Weiser,
+  Mono, MicroPython, JavaScriptCore (for the C++ stack).
+- **Mostly-precise / semi-conservative.** Precise for known-typed
+  roots, conservative for the C stack. CPython's cstack bridge,
+  SpiderMonkey for foreign frames.
+
+This implementation is mostly-precise (#4). The defenses live in
+two places:
+
+- **`PROLOGUE_Nk` zeros every slot it reserves** (`p1_gen.py`
+  prologue emitters). An uninitialized slot deterministically reads
+  as `0`, which has no heap tag and is filtered by the bounds check
+  for free.
+- **`mark_value_pair` / `mark_value_object` bounds-check the
+  candidate raw pointer** against `[heap_base, heap_next)` of its
+  arena before any dereference. A stale stack value whose low 3 bits
+  happen to match a heap tag is rejected unless its address actually
+  lies in the live arena prefix.
+
+The bounds check is the BDW idea in miniature, scoped to the stack
+walk only. False positives (a random stack int that does fall in
+the heap range) cause leaks, not crashes — they pin a dead object
+until something perturbs the stack enough to overwrite the slot.
+
+### Upgrade path
+
+If/when leak measurement shows the conservative fallback pinning
+real memory, the clean answer is **two slot counts per frame**:
+`k_tagged` for GC-walked slots, `k_raw` for scratch the GC
+ignores. `obj_alloc` declares 0+2, `cons` declares 2+0,
+`intern` declares 0+3. The frame layout grows a second count
+word, every `PROLOGUE_Nk` site is reclassified, frame-relative
+offsets shift for the raw region, and the walker can drop both
+the slot zeroing and the bounds check. That's the right shape
+for the long-lived compiler; for the seed that bootstraps one
+C compiler, the cost outweighs the benefit.
+
 ## Mark phase
 
 ```
diff --git a/src/lisp.M1 b/src/lisp.M1
@@ -85,6 +85,32 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
     mov_r0,sp
     st_r0,r1,0
 
+    ## Align both heap bumps to their natural strides and snapshot
+    ## the aligned starts in :pair_heap_base / :obj_heap_base. The
+    ## :pair_heap_start label may land at any byte offset depending
+    ## on what the assembler emitted before it; if a pair sweep then
+    ## walked from the raw label with a fixed +16 stride it would
+    ## visit gap addresses, never the actual allocations. Pinning
+    ## the bump and the bitmap origin to the same aligned address
+    ## keeps mark and sweep on the same grid.
+    li_r1 &pair_heap_next
+    ld_r0,r1,0
+    addi_r0,r0,15
+    shri_r0,r0,4
+    shli_r0,r0,4
+    st_r0,r1,0
+    li_r1 &pair_heap_base
+    st_r0,r1,0
+
+    li_r1 &obj_heap_next
+    ld_r0,r1,0
+    addi_r0,r0,7
+    shri_r0,r0,3
+    shli_r0,r0,3
+    st_r0,r1,0
+    li_r1 &obj_heap_base
+    st_r0,r1,0
+
     ## openat(AT_FDCWD=-100, argv[1], O_RDONLY=0, mode=0) → r0 = fd.
     ## Use openat rather than open because open is amd64-only
     ## (removed from the asm-generic table used by aarch64/riscv64).
@@ -945,9 +971,20 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
     epilogue
     ret
 
+## GC, then retry. Re-checks the free list first because the sweep
+## populates it with reclaimed pairs; only falls through to the bump
+## path if the list is still empty. A second overflow goes to OOM.
 :pair_alloc_need_gc
-    li_br &alloc_oom
-    b
+    li_br &gc
+    call
+    li_r3 &free_list_pair
+    ld_r0,r3,0
+    li_br &pair_alloc_retry_bump
+    beqz_r0
+    ld_r1,r0,0
+    st_r1,r3,0
+    epilogue
+    ret
 
 :pair_alloc_retry_bump
     li_r3 &pair_heap_next
@@ -1005,9 +1042,24 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
     epilogue_n2
     ret
 
+## GC, then retry. Slot 2 still holds the size-class freelist head
+## (or 0 when no class fits this size). After the sweep, re-check
+## that list before falling through to the bump path. Second overflow
+## escalates to OOM.
 :obj_alloc_need_gc
-    li_br &alloc_oom
-    b
+    li_br &gc
+    call
+    ld_r3,sp,32                      ## &head (or 0 = no size class)
+    li_br &obj_alloc_retry_bump
+    beqz_r3
+    ld_r1,r3,0
+    li_br &obj_alloc_retry_bump
+    beqz_r1
+    ld_r2,r1,8
+    st_r2,r3,0
+    mov_r0,r1
+    epilogue_n2
+    ret
 
 :obj_alloc_retry_bump
     ld_r1,sp,24
@@ -1118,7 +1170,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
 
 ## ---- pair mark-bitmap helpers ---------------------------------------
 :pair_mark_seen_or_set
-    li_r2 &pair_heap_start
+    li_r2 &pair_heap_base
+    ld_r2,r2,0
     sub_r2,r1,r2
     shri_r2,r2,4                     ## slot index
     mov_r3,r2
@@ -1142,7 +1195,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
     ret
 
 :pair_mark_test
-    li_r2 &pair_heap_start
+    li_r2 &pair_heap_base
+    ld_r2,r2,0
     sub_r2,r1,r2
     shri_r2,r2,4
     mov_r3,r2
@@ -1162,7 +1216,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
     ret
 
 :pair_mark_clear
-    li_r2 &pair_heap_start
+    li_r2 &pair_heap_base
+    ld_r2,r2,0
     sub_r2,r1,r2
     shri_r2,r2,4
     mov_r3,r2
@@ -1218,8 +1273,23 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
     epilogue_n4
     ret
 
+## Tagged-pair bounds check: a frame slot may hold a raw integer whose
+## low 3 bits coincide with the pair tag (e.g. a small loop counter).
+## Reject anything outside [pair_heap_start, pair_heap_next) so the
+## bitmap index stays in range and dereferencing stays inside our arena.
 :mark_value_pair
-    addi_r1,r1,neg2
+    addi_r1,r1,neg2                  ## r1 = raw pair ptr
+    li_r0 &pair_heap_base
+    ld_r0,r0,0
+    li_br &mark_value_done
+    blt_r1,r0
+    li_r0 &pair_heap_next
+    ld_r0,r0,0
+    li_br &mark_value_pair_inb
+    blt_r1,r0
+    li_br &mark_value_done
+    b
+:mark_value_pair_inb
     st_r1,sp,24                      ## slot 1 = raw pair ptr
     li_br &pair_mark_seen_or_set
     call
@@ -1238,9 +1308,23 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
     li_br &mark_value_done
     b
 
+## Same bounds-check rationale as mark_value_pair. Reject anything
+## outside [obj_heap_start, obj_heap_next) before touching the header
+## byte, so a stale raw pointer in a frame slot can't crash the marker.
 :mark_value_object
     andi_r0,r1,7
     sub_r1,r1,r0                     ## raw object ptr
+    li_r0 &obj_heap_base
+    ld_r0,r0,0
+    li_br &mark_value_done
+    blt_r1,r0
+    li_r0 &obj_heap_next
+    ld_r0,r0,0
+    li_br &mark_value_object_inb
+    blt_r1,r0
+    li_br &mark_value_done
+    b
+:mark_value_object_inb
     st_r1,sp,24                      ## slot 1 = raw object ptr
     li_br &obj_mark_seen_or_set
     call
@@ -1359,11 +1443,16 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
 
 
 ## ---- gc_mark_prim_argv() --------------------------------------------
+## Bound by :prim_argc, the slot count marshal_argv last published.
+## Walking past it would re-mark stale tagged values left over from
+## earlier primitive calls, falsely keeping the objects they pointed
+## to alive across an arbitrary number of GCs.
 :gc_mark_prim_argv
     prologue_n2
     li_r1 &prim_argv
     st_r1,sp,24                      ## slot 1 = cursor
-    li_r1 %32
+    li_r1 &prim_argc
+    ld_r1,r1,0
     st_r1,sp,32                      ## slot 2 = remaining slots
 
 :gc_mark_prim_argv_loop
@@ -1487,7 +1576,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
 ## ---- gc_sweep_pair() -------------------------------------------------
 :gc_sweep_pair
     prologue
-    li_r1 &pair_heap_start
+    li_r1 &pair_heap_base
+    ld_r1,r1,0
     st_r1,sp,24
 
 :gc_sweep_pair_loop
@@ -1530,7 +1620,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
 ## ---- gc_sweep_obj() --------------------------------------------------
 :gc_sweep_obj
     prologue_n4
-    li_r1 &obj_heap_start
+    li_r1 &obj_heap_base
+    ld_r1,r1,0
     st_r1,sp,24                      ## slot 1 = current ptr
 
 :gc_sweep_obj_loop
@@ -3728,6 +3819,11 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
     b
 
 :marshal_argv_done
+    ## Publish the count so the GC root walker only marks the live
+    ## prefix of prim_argv. Stale slots beyond the current call would
+    ## otherwise look like roots and pin freed objects across GCs.
+    li_r0 &prim_argc
+    st_r3,r0,0
     mov_r0,r3                        ## r0 = count
     li_r2 &prim_argv  ## r2 = buffer base
     epilogue
@@ -6990,6 +7086,19 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
 :stack_bottom_fp  %0 %0
 :gc_root_fp       %0 %0
 
+## Aligned mirrors of :pair_heap_start / :obj_heap_start. _start fills
+## these once, so every GC code path can derive the same canonical
+## start address that the bump pointer actually used for its first
+## allocation. See the alignment block in _start for the rationale.
+:pair_heap_base   %0 %0
+:obj_heap_base    %0 %0
+
+## Live prefix length of :prim_argv, set by marshal_argv on every
+## primitive call. The GC root walker uses this as the upper bound
+## of the prim_argv scan. Initialised to 0 so a GC that fires before
+## the first primitive call walks zero slots.
+:prim_argc        %0 %0
+
 :free_list_pair   %0 %0
 :free_list_obj16  %0 %0
 :free_list_obj24  %0 %0
diff --git a/src/p1_gen.py b/src/p1_gen.py
@@ -240,6 +240,15 @@ def aa_ldst_unscaled(base, rT, rN, off):
     t, n = NAT_AA64[rT], NAT_AA64[rN]
     return le32(base | (imm9 << 12) | (n << 5) | t)
 
+def aa_movz(rD, imm16):
+    """MOVZ rD, #imm16 — load a 16-bit immediate into the low half of
+    rD, zeroing the rest. Use this instead of `add rD, xzr, #imm`:
+    aarch64 ADD encodes register 31 as SP (not XZR), so the obvious
+    `mov xN, #imm` synthesis silently computes sp + imm."""
+    d = NAT_AA64[rD]
+    assert 0 <= imm16 < (1 << 16)
+    return le32(0xD2800000 | (imm16 << 5) | d)
+
 
 ## ---------- riscv64 primitive encoders ----------------------------------
 
@@ -461,9 +470,21 @@ class AA64(Encoder):
         str_lr = aa_ldst_uimm12(0xF9000000, 'lr', 'sp', 0, 3)
         save_fp = aa_add_imm('x8', 'sp', fb, sub=False)
         str_fp = aa_ldst_uimm12(0xF9000000, 'x8', 'sp', 8, 3)
-        mov_k = aa_add_imm('x8', 'xzr', k, sub=False)
+        # MUST use MOVZ here: ADD imm with Rn=31 reads SP, not XZR,
+        # so `add x8, xzr, #k` would store sp+k at [sp+16] instead
+        # of the slot count k. The pre-GC code never read [sp+16]
+        # so the bug stayed latent until the GC walker showed up.
+        mov_k = aa_movz('x8', k)
         str_k = aa_ldst_uimm12(0xF9000000, 'x8', 'sp', 16, 3)
-        return sub + str_lr + save_fp + str_fp + mov_k + str_k
+        # Zero each slot. The GC walker reads every slot through
+        # mark_value, which dereferences any value whose tag bits
+        # match a heap tag. Stale stack data left here from prior
+        # frames can otherwise look like a tagged pointer and chase
+        # the marker into garbage. See LISP-GC.md §"Roots".
+        zero_slots = ''
+        for i in range(k):
+            zero_slots += aa_ldst_uimm12(0xF9000000, 'xzr', 'sp', 24 + 8*i, 3)
+        return sub + str_lr + save_fp + str_fp + mov_k + str_k + zero_slots
 
     def epilogue(self, k):
         ldr_lr = aa_ldst_uimm12(0xF9400000, 'lr', 'sp', 0, 3)
@@ -600,6 +621,9 @@ class AMD64(Encoder):
         # xor rax,rax ; add rax,k ; [rsp+16]=rax.
         # rcx carries the retaddr, rax carries saved_fp then k. Neither
         # is a live incoming P1 argument register at function entry.
+        # Each slot is then zeroed so a stale stack value can't masquerade
+        # as a tagged pointer when the GC walks this frame. See
+        # LISP-GC.md §"Roots".
         fb = prologue_frame_bytes(k)
         assert fb <= 127
         seq  = '59'                                  # pop rcx
@@ -610,6 +634,9 @@ class AMD64(Encoder):
         seq += amd_alu_rr('31', 'r0', 'r0')         # xor rax, rax
         seq += amd_alu_ri8(0, 'r0', k)              # add rax, k
         seq += amd_mem_rm('89', 'r0', 'sp', 16)     # [rsp+16] = rax
+        seq += amd_alu_rr('31', 'r0', 'r0')         # xor rax, rax
+        for i in range(k):
+            seq += amd_mem_rm('89', 'r0', 'sp', 24 + 8*i)
         return seq
 
     def epilogue(self, k):
@@ -698,7 +725,11 @@ class RV64(Encoder):
         sd_fp = rv_s(0x00003023, 't0', 'sp', 8)
         k_imm = rv_i(0x00000013, 't0', 'zero', k)
         sd_k = rv_s(0x00003023, 't0', 'sp', 16)
-        return sub + sd + fp + sd_fp + k_imm + sd_k
+        # Zero each slot to keep GC walks safe; see LISP-GC.md §"Roots".
+        zero_slots = ''
+        for i in range(k):
+            zero_slots += rv_s(0x00003023, 'zero', 'sp', 24 + 8*i)
+        return sub + sd + fp + sd_fp + k_imm + sd_k + zero_slots
 
     def epilogue(self, k):
         ld  = rv_i(0x00003003, 'ra', 'sp', 0)
diff --git a/tests/lisp/17-gc-cons-churn.expected b/tests/lisp/17-gc-cons-churn.expected
@@ -0,0 +1 @@
+42
diff --git a/tests/lisp/17-gc-cons-churn.scm b/tests/lisp/17-gc-cons-churn.scm
@@ -0,0 +1,16 @@
+;; Cons churn stress: allocate many short-lived pairs, force multiple
+;; GC cycles. lisp.M1's bring-up heap is 32 KB (~2048 pairs). Each
+;; user-level cons drives several pair allocations once eval_args and
+;; the marshal path are counted, so 5000 iterations exhaust the arena
+;; multiple times over and exercise mark + sweep + freelist reuse.
+;;
+;; Uses the explicit `(define name (lambda ...))` form because the
+;; bring-up evaluator does not expand `(define (name args) body)`.
+
+(define churn
+  (lambda (n)
+    (if (= n 0)
+        42
+        (begin (cons n n) (churn (- n 1))))))
+
+(churn 5000)
diff --git a/tests/lisp/18-gc-deep-list.expected b/tests/lisp/18-gc-deep-list.expected
@@ -0,0 +1 @@
+45150
diff --git a/tests/lisp/18-gc-deep-list.scm b/tests/lisp/18-gc-deep-list.scm
@@ -0,0 +1,23 @@
+;; Deep-list stress: build a list whose total allocations exceed the
+;; arena, then sum it. The mark phase must preserve every live pair
+;; on the partial list across the GC cycles that fire mid-build.
+;; A sweep that drops a live pair would corrupt the chain and either
+;; crash sum or yield a wrong total.
+;;
+;; build 300 → 300 user pairs alive at the end; ~8 pair allocs per
+;; iter * 300 = ~2400 raw allocations vs the 32 KB / 16 = 2048-pair
+;; arena, so GC must fire mid-build.
+;;
+;; Sum of 1..300 = 300 * 301 / 2 = 45150.
+
+(define build
+  (lambda (n acc)
+    (if (= n 0) acc
+        (build (- n 1) (cons n acc)))))
+
+(define sum
+  (lambda (lst acc)
+    (if (null? lst) acc
+        (sum (cdr lst) (+ acc (car lst))))))
+
+(sum (build 300 (quote ())) 0)
diff --git a/tests/lisp/19-gc-closure-churn.expected b/tests/lisp/19-gc-closure-churn.expected
@@ -0,0 +1 @@
+42
diff --git a/tests/lisp/19-gc-closure-churn.scm b/tests/lisp/19-gc-closure-churn.scm
@@ -0,0 +1,21 @@
+;; Closure churn: each call to make-counter evaluates a fresh
+;; (lambda () ...) which allocates a 32-byte closure object on the
+;; obj heap. The closure is consumed on the next line and becomes
+;; unreachable. Iterating exhausts the obj heap and forces GC to
+;; reclaim dead closures via the obj-arena sweep + free-list path.
+;;
+;; Each iteration drops 1 closure (32 bytes) plus a pile of
+;; transient pairs (eval_args, env-extend). 800 iterations easily
+;; cycle both arenas multiple times.
+
+(define make-counter
+  (lambda (start)
+    (lambda () start)))
+
+(define churn
+  (lambda (n)
+    (if (= n 0)
+        42
+        (begin ((make-counter n)) (churn (- n 1))))))
+
+(churn 800)

	boot2 Playing with the boostrap
	git clone https://git.ryansepassi.com/git/boot2.git
	Log \| Files \| Refs

M	Makefile	\|	5	++++-
M	docs/LISP-GC.md	\|	136	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------
M	src/lisp.M1	\|	131	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
M	src/p1_gen.py	\|	37	++++++++++++++++++++++++++++++++++---
A	tests/lisp/17-gc-cons-churn.expected	\|	1	+
A	tests/lisp/17-gc-cons-churn.scm	\|	16	++++++++++++++++
A	tests/lisp/18-gc-deep-list.expected	\|	1	+
A	tests/lisp/18-gc-deep-list.scm	\|	23	+++++++++++++++++++++++
A	tests/lisp/19-gc-closure-churn.expected	\|	1	+
A	tests/lisp/19-gc-closure-churn.scm	\|	21	+++++++++++++++++++++