commit f74017622a562616c2b1993be427593e9a227220
parent 837a27f3af091754af36f471a1da603449e3e06a
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Tue, 21 Apr 2026 22:23:15 -0700
lisp.M1 step 13: wire GC into allocator + mostly-precise stack scan
The step-13 spike landed mark/sweep/free-list machinery but left the
allocator overflow paths bailing straight to OOM. Wire them through
the standard "GC then retry once" pattern:
- pair_alloc_need_gc / obj_alloc_need_gc now CALL gc, re-check the
freelist (using the cached &head in obj_alloc's slot 2), then
fall through to retry_bump on miss; OOM only if both fail post-GC.
Three problems surfaced once GC actually ran under cons-churn:
1. The aarch64 prologue's `mov xN, #k` was synthesized as
`add xN, xzr, #k`. ADD imm encodes register 31 as SP, not XZR,
so [sp+16] held sp+k instead of the slot count. Pre-GC nothing
read [sp+16], so the bug was latent. Added an `aa_movz` helper
and routed the k-load through it.
2. `:pair_heap_start` / `:obj_heap_start` land at whatever byte
offset the assembler reached (no .align directive in M1). The
pair sweep walks a fixed +16 stride from its start label, but
pair_alloc's bump aligns its first allocation up — so the two
walked different addresses and the bitmap origin was off too.
Fix: _start aligns both bumps once and snapshots the aligned
addresses into :pair_heap_base / :obj_heap_base, which the GC
paths use as their canonical origin.
3. Frame slots can hold non-tagged values: obj_alloc spills
integers / BSS pointers, intern spills raw byte pointers that
may not be 8-aligned. The GC walker treating those as tagged
would chase wild pointers. Adopted a mostly-precise scheme
(see docs/LISP-GC.md §"Why mostly-precise"):
- PROLOGUE_Nk now zeros every slot it reserves so uninitialized
reads see tag-0 (harmless).
- mark_value_pair / mark_value_object bounds-check the
candidate raw pointer against [heap_base, heap_next) before
dereference, in the BDW-collector style.
Also publish :prim_argc from marshal_argv so the GC walker bounds
the prim_argv scan to the live prefix; stale entries from earlier
calls used to pin freed objects across GCs.
Step-5 stress tests added: 17-gc-cons-churn (pair churn over many
GC cycles), 18-gc-deep-list (300-deep reachable list survives mid-
build GCs, sums to 45150), 19-gc-closure-churn (obj-arena exercise
via short-lived closures).
26 tests * 3 arches = 78 passes under `make test-lisp-all`.
Diffstat:
10 files changed, 340 insertions(+), 32 deletions(-)
diff --git a/Makefile b/Makefile
@@ -222,7 +222,10 @@ LISP_TESTS := \
tests/lisp/26-let.scm \
tests/lisp/27-cond.scm \
tests/lisp/28-quasi.scm \
- tests/lisp/29-innerdef.scm
+ tests/lisp/29-innerdef.scm \
+ tests/lisp/17-gc-cons-churn.scm \
+ tests/lisp/18-gc-deep-list.scm \
+ tests/lisp/19-gc-closure-churn.scm
# Build the prelude-prefixed twin of every fixture and feed that to the
# interpreter; .expected still lives next to the original .scm.
diff --git a/docs/LISP-GC.md b/docs/LISP-GC.md
@@ -31,6 +31,14 @@ Load-bearing; the rest of the document assumes them.
4. **Heap size 1 MB for this step.** Triggers GC easily under stress
tests. Grow to 20 MB in a follow-up once C-compiler workloads
demand it.
+5. **Mostly-precise stack scan with a bounded-conservative fallback.**
+ The roots in §Roots #1–#4 are *precise* — we know each entry is a
+ tagged Lisp value. The stack-frame walk (#5) is *bounded
+ conservative*: each frame slot is checked against
+ `[pair_heap_base, pair_heap_next)` / `[obj_heap_base,
+ obj_heap_next)` before being dereferenced, in the BDW-collector
+ style. See §"Why mostly-precise" for the reasoning and the
+ upgrade path.
These supersede the following LISP.md text, which must be updated when
this work lands: decision 11 (single bitmap over whole heap), §Heap
@@ -51,9 +59,19 @@ Single BSS region, split in two by fixed labels:
Bump pointers:
-- `pair_heap_next` — starts at `pair_heap_start`, bumps by 16.
-- `obj_heap_next` — starts at `obj_heap_start`, bumps by 8-rounded
- size.
+- `pair_heap_next` — bumps by 16. Initially `pair_heap_start`, but
+ `_start` rounds it up to a 16-byte boundary before any allocation
+ and snapshots that aligned value into `:pair_heap_base`. The GC
+ paths use `:pair_heap_base` as their canonical origin.
+- `obj_heap_next` — bumps by an 8-rounded size. Same alignment
+ treatment: `_start` rounds up to 8 and snapshots `:obj_heap_base`.
+
+The two `_base` slots exist because M1 has no `.align` directive
+and the labels `:pair_heap_start` / `:obj_heap_start` land at
+whatever byte offset the assembler reached. A fixed-stride pair
+sweep walking from the raw label would visit gap addresses; the
+runtime alignment + base snapshot keeps mark and sweep on the same
+grid.
Free lists:
@@ -81,7 +99,8 @@ symbols, vectors, closures, primitives) routes through `obj_alloc`.
### Pair mark bitmap
One bit per 16-byte slot in the pair arena, indexed by
-`(pair_ptr - pair_heap_start) / 16`. At 600 KB pair arena:
+`(pair_ptr - pair_heap_base) / 16` (note: `_base`, the runtime-
+aligned origin — see §Heap layout). At 600 KB pair arena:
600 KB / 16 = 37 500 bits ≈ 4.7 KB BSS. Declared as
`pair_mark_bitmap` with fixed size.
@@ -125,9 +144,12 @@ returning or branching.
### Chain termination
-`_start` writes `0` to BSS `stack_bottom_fp` before its first `CALL`.
-Any frame whose `saved_fp` is `0` is the bottom frame; the walker
-stops there.
+`_start` snapshots its own initial `sp` into `:stack_bottom_fp`
+before its first `CALL`. The first prologued callee then stores
+that same value as its `saved_fp`, so the walker compares each
+`saved_fp` it pops to `:stack_bottom_fp`'s value and stops on
+match. (The `_start` frame itself has no `PROLOGUE`, so there is
+nothing to walk past it.)
### Per-arch implementation in `p1_gen.py`
@@ -135,11 +157,21 @@ Three arch-specific emitters change:
- **amd64**: `call` already pushes retaddr natively. Prologue does
`pop rcx; mov rax, rsp; sub rsp, frame_size; mov [rsp], rcx;
- mov [rsp+8], rax; mov qword [rsp+16], k`. Epilogue reverses.
-- **aarch64**: `sub sp, sp, frame_size; str lr, [sp, 0]; mov x9, sp
- (pre-sub sp + frame_size, computed); str x9, [sp, 8];
- mov x9, #k; str x9, [sp, 16]`. Epilogue reverses.
+ mov [rsp+8], rax; mov qword [rsp+16], k`, then zeros each slot
+ in turn. Epilogue reverses.
+- **aarch64**: `sub sp, sp, frame_size; str lr, [sp, 0]; add x8, sp,
+ #frame_size; str x8, [sp, 8]; movz x8, #k; str x8, [sp, 16]`,
+ then `str xzr, [sp, 24+8*i]` per slot. Epilogue reverses.
+ *Important:* the k-load uses `MOVZ` rather than `add x8, xzr, #k`
+ because aarch64 ADD encodes register 31 as SP, not XZR — the
+ obvious `mov xN, #imm` synthesis silently computes `sp + imm`.
+ This was a latent bug pre-GC since nothing read `[sp+16]`.
- **riscv64**: analogous to aarch64, using `ra` and a scratch `t0`.
+ Slot zeroing uses `sd zero, off(sp)`.
+
+Slot zeroing is the prologue-side half of the mostly-precise scheme
+(see §"Why mostly-precise"); the bounds check in `mark_value_*` is
+the walker-side half.
The "caller's sp" is `sp + frame_size` on all three arches after the
sub has run, since prologue ran immediately after call.
@@ -168,17 +200,87 @@ Five sources, walked in this order on each GC invocation:
2. **`symbol_table[0..4095]`.** Intern table's backing array. Walk
each slot; mark occupied entries as tagged heap pointers.
3. **`prim_argv[0..prim_argc)`.** Scratch region live during a
- primitive call. The current `prim_argc` BSS slot bounds the walk.
-4. **Reader scratch** (`saved_src_*`, ~5 slots). Live during a
- `read` invocation.
+ primitive call. `marshal_argv` publishes the count into the
+ `:prim_argc` BSS slot, which the walker uses as the upper bound;
+ stale entries past `prim_argc` (from earlier primitive calls)
+ are skipped so they can't pin freed objects across GCs.
+4. **Reader scratch** (`saved_src_*`). Holds the integer cursor /
+ line / col plus a raw byte pointer into `:src_buf`. None of these
+ are tagged Lisp values, so no marking is needed; listed here only
+ to make the omission explicit.
5. **Stack (fp chain).** Start at the caller of `gc()`, walk
- `saved_fp` until `stack_bottom_fp` (value 0). At each frame,
- read `k` from `[fp+16]` and pass every word in `[fp+24..+24+8*k)`
- through `mark_object`.
+ `saved_fp` until the value snapshotted in `:stack_bottom_fp` at
+ `_start` (the kernel-provided initial sp). At each frame, read
+ `k` from `[fp+16]` and pass every word in `[fp+24..+24+8*k)`
+ through `mark_object`. Each word is bounds-checked against the
+ pair / obj arenas before dereference (see §"Why mostly-precise").
The `sym_*` special-form symbols are reachable via `symbol_table`
so they need no separate root entry.
+## Why mostly-precise
+
+A fully precise stack scan needs the walker to know, at every
+GC-fire point, exactly which slots in each frame hold tagged Lisp
+values. Two things break that invariant in this codebase:
+
+1. **Uninitialized slots.** A function reserves `k` slots in its
+ prologue, but its body may take an allocating call before having
+ written all of them. Slots that haven't been written yet contain
+ stale stack bytes from prior frames.
+2. **Slots holding non-tagged values.** `obj_alloc` legitimately
+ spills a rounded byte size and a free-list head pointer into its
+ slots. `intern` / `make_string` / `make_symbol` spill raw byte
+ pointers (into `:src_buf` or `:str_*` literal regions) that may
+ not even be 8-aligned. These have arbitrary low 3 bits and don't
+ look like tagged Lisp values.
+
+The standard answers in the literature:
+
+- **Precise stack maps.** Compiler emits per-callsite metadata
+ (HotSpot, V8, Go, OCaml, SBCL). Best correctness, requires real
+ codegen infrastructure.
+- **Strict slot discipline.** Every walked slot is always a tagged
+ value or sentinel; raw values live in registers or non-walked
+ storage. Used by classic SECD-style implementations.
+- **Conservative scanning.** Treat every machine word as potentially
+ a pointer; check if it falls in heap range. Boehm–Demers–Weiser,
+ Mono, MicroPython, JavaScriptCore (for the C++ stack).
+- **Mostly-precise / semi-conservative.** Precise for known-typed
+ roots, conservative for the C stack. CPython's cstack bridge,
+ SpiderMonkey for foreign frames.
+
+This implementation is mostly-precise (#4). The defenses live in
+two places:
+
+- **`PROLOGUE_Nk` zeros every slot it reserves** (`p1_gen.py`
+ prologue emitters). An uninitialized slot deterministically reads
+ as `0`, which has no heap tag and is filtered by the bounds check
+ for free.
+- **`mark_value_pair` / `mark_value_object` bounds-check the
+ candidate raw pointer** against `[heap_base, heap_next)` of its
+ arena before any dereference. A stale stack value whose low 3 bits
+ happen to match a heap tag is rejected unless its address actually
+ lies in the live arena prefix.
+
+The bounds check is the BDW idea in miniature, scoped to the stack
+walk only. False positives (a random stack int that does fall in
+the heap range) cause leaks, not crashes — they pin a dead object
+until something perturbs the stack enough to overwrite the slot.
+
+### Upgrade path
+
+If/when leak measurement shows the conservative fallback pinning
+real memory, the clean answer is **two slot counts per frame**:
+`k_tagged` for GC-walked slots, `k_raw` for scratch the GC
+ignores. `obj_alloc` declares 0+2, `cons` declares 2+0,
+`intern` declares 0+3. The frame layout grows a second count
+word, every `PROLOGUE_Nk` site is reclassified, frame-relative
+offsets shift for the raw region, and the walker can drop both
+the slot zeroing and the bounds check. That's the right shape
+for the long-lived compiler; for the seed that bootstraps one
+C compiler, the cost outweighs the benefit.
+
## Mark phase
```
diff --git a/src/lisp.M1 b/src/lisp.M1
@@ -85,6 +85,32 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
mov_r0,sp
st_r0,r1,0
+ ## Align both heap bumps to their natural strides and snapshot
+ ## the aligned starts in :pair_heap_base / :obj_heap_base. The
+ ## :pair_heap_start label may land at any byte offset depending
+ ## on what the assembler emitted before it; if a pair sweep then
+ ## walked from the raw label with a fixed +16 stride it would
+ ## visit gap addresses, never the actual allocations. Pinning
+ ## the bump and the bitmap origin to the same aligned address
+ ## keeps mark and sweep on the same grid.
+ li_r1 &pair_heap_next
+ ld_r0,r1,0
+ addi_r0,r0,15
+ shri_r0,r0,4
+ shli_r0,r0,4
+ st_r0,r1,0
+ li_r1 &pair_heap_base
+ st_r0,r1,0
+
+ li_r1 &obj_heap_next
+ ld_r0,r1,0
+ addi_r0,r0,7
+ shri_r0,r0,3
+ shli_r0,r0,3
+ st_r0,r1,0
+ li_r1 &obj_heap_base
+ st_r0,r1,0
+
## openat(AT_FDCWD=-100, argv[1], O_RDONLY=0, mode=0) → r0 = fd.
## Use openat rather than open because open is amd64-only
## (removed from the asm-generic table used by aarch64/riscv64).
@@ -945,9 +971,20 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
epilogue
ret
+## GC, then retry. Re-checks the free list first because the sweep
+## populates it with reclaimed pairs; only falls through to the bump
+## path if the list is still empty. A second overflow goes to OOM.
:pair_alloc_need_gc
- li_br &alloc_oom
- b
+ li_br &gc
+ call
+ li_r3 &free_list_pair
+ ld_r0,r3,0
+ li_br &pair_alloc_retry_bump
+ beqz_r0
+ ld_r1,r0,0
+ st_r1,r3,0
+ epilogue
+ ret
:pair_alloc_retry_bump
li_r3 &pair_heap_next
@@ -1005,9 +1042,24 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
epilogue_n2
ret
+## GC, then retry. Slot 2 still holds the size-class freelist head
+## (or 0 when no class fits this size). After the sweep, re-check
+## that list before falling through to the bump path. Second overflow
+## escalates to OOM.
:obj_alloc_need_gc
- li_br &alloc_oom
- b
+ li_br &gc
+ call
+ ld_r3,sp,32 ## &head (or 0 = no size class)
+ li_br &obj_alloc_retry_bump
+ beqz_r3
+ ld_r1,r3,0
+ li_br &obj_alloc_retry_bump
+ beqz_r1
+ ld_r2,r1,8
+ st_r2,r3,0
+ mov_r0,r1
+ epilogue_n2
+ ret
:obj_alloc_retry_bump
ld_r1,sp,24
@@ -1118,7 +1170,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
## ---- pair mark-bitmap helpers ---------------------------------------
:pair_mark_seen_or_set
- li_r2 &pair_heap_start
+ li_r2 &pair_heap_base
+ ld_r2,r2,0
sub_r2,r1,r2
shri_r2,r2,4 ## slot index
mov_r3,r2
@@ -1142,7 +1195,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
ret
:pair_mark_test
- li_r2 &pair_heap_start
+ li_r2 &pair_heap_base
+ ld_r2,r2,0
sub_r2,r1,r2
shri_r2,r2,4
mov_r3,r2
@@ -1162,7 +1216,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
ret
:pair_mark_clear
- li_r2 &pair_heap_start
+ li_r2 &pair_heap_base
+ ld_r2,r2,0
sub_r2,r1,r2
shri_r2,r2,4
mov_r3,r2
@@ -1218,8 +1273,23 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
epilogue_n4
ret
+## Tagged-pair bounds check: a frame slot may hold a raw integer whose
+## low 3 bits coincide with the pair tag (e.g. a small loop counter).
+## Reject anything outside [pair_heap_start, pair_heap_next) so the
+## bitmap index stays in range and dereferencing stays inside our arena.
:mark_value_pair
- addi_r1,r1,neg2
+ addi_r1,r1,neg2 ## r1 = raw pair ptr
+ li_r0 &pair_heap_base
+ ld_r0,r0,0
+ li_br &mark_value_done
+ blt_r1,r0
+ li_r0 &pair_heap_next
+ ld_r0,r0,0
+ li_br &mark_value_pair_inb
+ blt_r1,r0
+ li_br &mark_value_done
+ b
+:mark_value_pair_inb
st_r1,sp,24 ## slot 1 = raw pair ptr
li_br &pair_mark_seen_or_set
call
@@ -1238,9 +1308,23 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
li_br &mark_value_done
b
+## Same bounds-check rationale as mark_value_pair. Reject anything
+## outside [obj_heap_start, obj_heap_next) before touching the header
+## byte, so a stale raw pointer in a frame slot can't crash the marker.
:mark_value_object
andi_r0,r1,7
sub_r1,r1,r0 ## raw object ptr
+ li_r0 &obj_heap_base
+ ld_r0,r0,0
+ li_br &mark_value_done
+ blt_r1,r0
+ li_r0 &obj_heap_next
+ ld_r0,r0,0
+ li_br &mark_value_object_inb
+ blt_r1,r0
+ li_br &mark_value_done
+ b
+:mark_value_object_inb
st_r1,sp,24 ## slot 1 = raw object ptr
li_br &obj_mark_seen_or_set
call
@@ -1359,11 +1443,16 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
## ---- gc_mark_prim_argv() --------------------------------------------
+## Bound by :prim_argc, the slot count marshal_argv last published.
+## Walking past it would re-mark stale tagged values left over from
+## earlier primitive calls, falsely keeping the objects they pointed
+## to alive across an arbitrary number of GCs.
:gc_mark_prim_argv
prologue_n2
li_r1 &prim_argv
st_r1,sp,24 ## slot 1 = cursor
- li_r1 %32
+ li_r1 &prim_argc
+ ld_r1,r1,0
st_r1,sp,32 ## slot 2 = remaining slots
:gc_mark_prim_argv_loop
@@ -1487,7 +1576,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
## ---- gc_sweep_pair() -------------------------------------------------
:gc_sweep_pair
prologue
- li_r1 &pair_heap_start
+ li_r1 &pair_heap_base
+ ld_r1,r1,0
st_r1,sp,24
:gc_sweep_pair_loop
@@ -1530,7 +1620,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
## ---- gc_sweep_obj() --------------------------------------------------
:gc_sweep_obj
prologue_n4
- li_r1 &obj_heap_start
+ li_r1 &obj_heap_base
+ ld_r1,r1,0
st_r1,sp,24 ## slot 1 = current ptr
:gc_sweep_obj_loop
@@ -3728,6 +3819,11 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
b
:marshal_argv_done
+ ## Publish the count so the GC root walker only marks the live
+ ## prefix of prim_argv. Stale slots beyond the current call would
+ ## otherwise look like roots and pin freed objects across GCs.
+ li_r0 &prim_argc
+ st_r3,r0,0
mov_r0,r3 ## r0 = count
li_r2 &prim_argv ## r2 = buffer base
epilogue
@@ -6990,6 +7086,19 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000'
:stack_bottom_fp %0 %0
:gc_root_fp %0 %0
+## Aligned mirrors of :pair_heap_start / :obj_heap_start. _start fills
+## these once, so every GC code path can derive the same canonical
+## start address that the bump pointer actually used for its first
+## allocation. See the alignment block in _start for the rationale.
+:pair_heap_base %0 %0
+:obj_heap_base %0 %0
+
+## Live prefix length of :prim_argv, set by marshal_argv on every
+## primitive call. The GC root walker uses this as the upper bound
+## of the prim_argv scan. Initialised to 0 so a GC that fires before
+## the first primitive call walks zero slots.
+:prim_argc %0 %0
+
:free_list_pair %0 %0
:free_list_obj16 %0 %0
:free_list_obj24 %0 %0
diff --git a/src/p1_gen.py b/src/p1_gen.py
@@ -240,6 +240,15 @@ def aa_ldst_unscaled(base, rT, rN, off):
t, n = NAT_AA64[rT], NAT_AA64[rN]
return le32(base | (imm9 << 12) | (n << 5) | t)
+def aa_movz(rD, imm16):
+ """MOVZ rD, #imm16 — load a 16-bit immediate into the low half of
+ rD, zeroing the rest. Use this instead of `add rD, xzr, #imm`:
+ aarch64 ADD encodes register 31 as SP (not XZR), so the obvious
+ `mov xN, #imm` synthesis silently computes sp + imm."""
+ d = NAT_AA64[rD]
+ assert 0 <= imm16 < (1 << 16)
+ return le32(0xD2800000 | (imm16 << 5) | d)
+
## ---------- riscv64 primitive encoders ----------------------------------
@@ -461,9 +470,21 @@ class AA64(Encoder):
str_lr = aa_ldst_uimm12(0xF9000000, 'lr', 'sp', 0, 3)
save_fp = aa_add_imm('x8', 'sp', fb, sub=False)
str_fp = aa_ldst_uimm12(0xF9000000, 'x8', 'sp', 8, 3)
- mov_k = aa_add_imm('x8', 'xzr', k, sub=False)
+ # MUST use MOVZ here: ADD imm with Rn=31 reads SP, not XZR,
+ # so `add x8, xzr, #k` would store sp+k at [sp+16] instead
+ # of the slot count k. The pre-GC code never read [sp+16]
+ # so the bug stayed latent until the GC walker showed up.
+ mov_k = aa_movz('x8', k)
str_k = aa_ldst_uimm12(0xF9000000, 'x8', 'sp', 16, 3)
- return sub + str_lr + save_fp + str_fp + mov_k + str_k
+ # Zero each slot. The GC walker reads every slot through
+ # mark_value, which dereferences any value whose tag bits
+ # match a heap tag. Stale stack data left here from prior
+ # frames can otherwise look like a tagged pointer and chase
+ # the marker into garbage. See LISP-GC.md §"Roots".
+ zero_slots = ''
+ for i in range(k):
+ zero_slots += aa_ldst_uimm12(0xF9000000, 'xzr', 'sp', 24 + 8*i, 3)
+ return sub + str_lr + save_fp + str_fp + mov_k + str_k + zero_slots
def epilogue(self, k):
ldr_lr = aa_ldst_uimm12(0xF9400000, 'lr', 'sp', 0, 3)
@@ -600,6 +621,9 @@ class AMD64(Encoder):
# xor rax,rax ; add rax,k ; [rsp+16]=rax.
# rcx carries the retaddr, rax carries saved_fp then k. Neither
# is a live incoming P1 argument register at function entry.
+ # Each slot is then zeroed so a stale stack value can't masquerade
+ # as a tagged pointer when the GC walks this frame. See
+ # LISP-GC.md §"Roots".
fb = prologue_frame_bytes(k)
assert fb <= 127
seq = '59' # pop rcx
@@ -610,6 +634,9 @@ class AMD64(Encoder):
seq += amd_alu_rr('31', 'r0', 'r0') # xor rax, rax
seq += amd_alu_ri8(0, 'r0', k) # add rax, k
seq += amd_mem_rm('89', 'r0', 'sp', 16) # [rsp+16] = rax
+ seq += amd_alu_rr('31', 'r0', 'r0') # xor rax, rax
+ for i in range(k):
+ seq += amd_mem_rm('89', 'r0', 'sp', 24 + 8*i)
return seq
def epilogue(self, k):
@@ -698,7 +725,11 @@ class RV64(Encoder):
sd_fp = rv_s(0x00003023, 't0', 'sp', 8)
k_imm = rv_i(0x00000013, 't0', 'zero', k)
sd_k = rv_s(0x00003023, 't0', 'sp', 16)
- return sub + sd + fp + sd_fp + k_imm + sd_k
+ # Zero each slot to keep GC walks safe; see LISP-GC.md §"Roots".
+ zero_slots = ''
+ for i in range(k):
+ zero_slots += rv_s(0x00003023, 'zero', 'sp', 24 + 8*i)
+ return sub + sd + fp + sd_fp + k_imm + sd_k + zero_slots
def epilogue(self, k):
ld = rv_i(0x00003003, 'ra', 'sp', 0)
diff --git a/tests/lisp/17-gc-cons-churn.expected b/tests/lisp/17-gc-cons-churn.expected
@@ -0,0 +1 @@
+42
diff --git a/tests/lisp/17-gc-cons-churn.scm b/tests/lisp/17-gc-cons-churn.scm
@@ -0,0 +1,16 @@
+;; Cons churn stress: allocate many short-lived pairs, force multiple
+;; GC cycles. lisp.M1's bring-up heap is 32 KB (~2048 pairs). Each
+;; user-level cons drives several pair allocations once eval_args and
+;; the marshal path are counted, so 5000 iterations exhaust the arena
+;; multiple times over and exercise mark + sweep + freelist reuse.
+;;
+;; Uses the explicit `(define name (lambda ...))` form because the
+;; bring-up evaluator does not expand `(define (name args) body)`.
+
+(define churn
+ (lambda (n)
+ (if (= n 0)
+ 42
+ (begin (cons n n) (churn (- n 1))))))
+
+(churn 5000)
diff --git a/tests/lisp/18-gc-deep-list.expected b/tests/lisp/18-gc-deep-list.expected
@@ -0,0 +1 @@
+45150
diff --git a/tests/lisp/18-gc-deep-list.scm b/tests/lisp/18-gc-deep-list.scm
@@ -0,0 +1,23 @@
+;; Deep-list stress: build a list whose total allocations exceed the
+;; arena, then sum it. The mark phase must preserve every live pair
+;; on the partial list across the GC cycles that fire mid-build.
+;; A sweep that drops a live pair would corrupt the chain and either
+;; crash sum or yield a wrong total.
+;;
+;; build 300 → 300 user pairs alive at the end; ~8 pair allocs per
+;; iter * 300 = ~2400 raw allocations vs the 32 KB / 16 = 2048-pair
+;; arena, so GC must fire mid-build.
+;;
+;; Sum of 1..300 = 300 * 301 / 2 = 45150.
+
+(define build
+ (lambda (n acc)
+ (if (= n 0) acc
+ (build (- n 1) (cons n acc)))))
+
+(define sum
+ (lambda (lst acc)
+ (if (null? lst) acc
+ (sum (cdr lst) (+ acc (car lst))))))
+
+(sum (build 300 (quote ())) 0)
diff --git a/tests/lisp/19-gc-closure-churn.expected b/tests/lisp/19-gc-closure-churn.expected
@@ -0,0 +1 @@
+42
diff --git a/tests/lisp/19-gc-closure-churn.scm b/tests/lisp/19-gc-closure-churn.scm
@@ -0,0 +1,21 @@
+;; Closure churn: each call to make-counter evaluates a fresh
+;; (lambda () ...) which allocates a 32-byte closure object on the
+;; obj heap. The closure is consumed on the next line and becomes
+;; unreachable. Iterating exhausts the obj heap and forces GC to
+;; reclaim dead closures via the obj-arena sweep + free-list path.
+;;
+;; Each iteration drops 1 closure (32 bytes) plus a pile of
+;; transient pairs (eval_args, env-extend). 800 iterations easily
+;; cycle both arenas multiple times.
+
+(define make-counter
+ (lambda (start)
+ (lambda () start)))
+
+(define churn
+ (lambda (n)
+ (if (= n 0)
+ 42
+ (begin ((make-counter n)) (churn (- n 1))))))
+
+(churn 800)