boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs

commit f74017622a562616c2b1993be427593e9a227220
parent 837a27f3af091754af36f471a1da603449e3e06a
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Tue, 21 Apr 2026 22:23:15 -0700

lisp.M1 step 13: wire GC into allocator + mostly-precise stack scan

The step-13 spike landed mark/sweep/free-list machinery but left the
allocator overflow paths bailing straight to OOM. Wire them through
the standard "GC then retry once" pattern:

- pair_alloc_need_gc / obj_alloc_need_gc now CALL gc, re-check the
  freelist (using the cached &head in obj_alloc's slot 2), then
  fall through to retry_bump on miss; OOM only if both fail post-GC.

Three problems surfaced once GC actually ran under cons-churn:

1. The aarch64 prologue's `mov xN, #k` was synthesized as
   `add xN, xzr, #k`. ADD imm encodes register 31 as SP, not XZR,
   so [sp+16] held sp+k instead of the slot count. Pre-GC nothing
   read [sp+16], so the bug was latent. Added an `aa_movz` helper
   and routed the k-load through it.

2. `:pair_heap_start` / `:obj_heap_start` land at whatever byte
   offset the assembler reached (no .align directive in M1). The
   pair sweep walks a fixed +16 stride from its start label, but
   pair_alloc's bump aligns its first allocation up — so the two
   walked different addresses and the bitmap origin was off too.
   Fix: _start aligns both bumps once and snapshots the aligned
   addresses into :pair_heap_base / :obj_heap_base, which the GC
   paths use as their canonical origin.

3. Frame slots can hold non-tagged values: obj_alloc spills
   integers / BSS pointers, intern spills raw byte pointers that
   may not be 8-aligned. The GC walker treating those as tagged
   would chase wild pointers. Adopted a mostly-precise scheme
   (see docs/LISP-GC.md §"Why mostly-precise"):
   - PROLOGUE_Nk now zeros every slot it reserves so uninitialized
     reads see tag-0 (harmless).
   - mark_value_pair / mark_value_object bounds-check the
     candidate raw pointer against [heap_base, heap_next) before
     dereference, in the BDW-collector style.

Also publish :prim_argc from marshal_argv so the GC walker bounds
the prim_argv scan to the live prefix; stale entries from earlier
calls used to pin freed objects across GCs.

Step-5 stress tests added: 17-gc-cons-churn (pair churn over many
GC cycles), 18-gc-deep-list (300-deep reachable list survives mid-
build GCs, sums to 45150), 19-gc-closure-churn (obj-arena exercise
via short-lived closures).

26 tests * 3 arches = 78 passes under `make test-lisp-all`.

Diffstat:
MMakefile | 5++++-
Mdocs/LISP-GC.md | 136+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------
Msrc/lisp.M1 | 131++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
Msrc/p1_gen.py | 37++++++++++++++++++++++++++++++++++---
Atests/lisp/17-gc-cons-churn.expected | 1+
Atests/lisp/17-gc-cons-churn.scm | 16++++++++++++++++
Atests/lisp/18-gc-deep-list.expected | 1+
Atests/lisp/18-gc-deep-list.scm | 23+++++++++++++++++++++++
Atests/lisp/19-gc-closure-churn.expected | 1+
Atests/lisp/19-gc-closure-churn.scm | 21+++++++++++++++++++++
10 files changed, 340 insertions(+), 32 deletions(-)

diff --git a/Makefile b/Makefile @@ -222,7 +222,10 @@ LISP_TESTS := \ tests/lisp/26-let.scm \ tests/lisp/27-cond.scm \ tests/lisp/28-quasi.scm \ - tests/lisp/29-innerdef.scm + tests/lisp/29-innerdef.scm \ + tests/lisp/17-gc-cons-churn.scm \ + tests/lisp/18-gc-deep-list.scm \ + tests/lisp/19-gc-closure-churn.scm # Build the prelude-prefixed twin of every fixture and feed that to the # interpreter; .expected still lives next to the original .scm. diff --git a/docs/LISP-GC.md b/docs/LISP-GC.md @@ -31,6 +31,14 @@ Load-bearing; the rest of the document assumes them. 4. **Heap size 1 MB for this step.** Triggers GC easily under stress tests. Grow to 20 MB in a follow-up once C-compiler workloads demand it. +5. **Mostly-precise stack scan with a bounded-conservative fallback.** + The roots in §Roots #1–#4 are *precise* — we know each entry is a + tagged Lisp value. The stack-frame walk (#5) is *bounded + conservative*: each frame slot is checked against + `[pair_heap_base, pair_heap_next)` / `[obj_heap_base, + obj_heap_next)` before being dereferenced, in the BDW-collector + style. See §"Why mostly-precise" for the reasoning and the + upgrade path. These supersede the following LISP.md text, which must be updated when this work lands: decision 11 (single bitmap over whole heap), §Heap @@ -51,9 +59,19 @@ Single BSS region, split in two by fixed labels: Bump pointers: -- `pair_heap_next` — starts at `pair_heap_start`, bumps by 16. -- `obj_heap_next` — starts at `obj_heap_start`, bumps by 8-rounded - size. +- `pair_heap_next` — bumps by 16. Initially `pair_heap_start`, but + `_start` rounds it up to a 16-byte boundary before any allocation + and snapshots that aligned value into `:pair_heap_base`. The GC + paths use `:pair_heap_base` as their canonical origin. +- `obj_heap_next` — bumps by an 8-rounded size. Same alignment + treatment: `_start` rounds up to 8 and snapshots `:obj_heap_base`. + +The two `_base` slots exist because M1 has no `.align` directive +and the labels `:pair_heap_start` / `:obj_heap_start` land at +whatever byte offset the assembler reached. A fixed-stride pair +sweep walking from the raw label would visit gap addresses; the +runtime alignment + base snapshot keeps mark and sweep on the same +grid. Free lists: @@ -81,7 +99,8 @@ symbols, vectors, closures, primitives) routes through `obj_alloc`. ### Pair mark bitmap One bit per 16-byte slot in the pair arena, indexed by -`(pair_ptr - pair_heap_start) / 16`. At 600 KB pair arena: +`(pair_ptr - pair_heap_base) / 16` (note: `_base`, the runtime- +aligned origin — see §Heap layout). At 600 KB pair arena: 600 KB / 16 = 37 500 bits ≈ 4.7 KB BSS. Declared as `pair_mark_bitmap` with fixed size. @@ -125,9 +144,12 @@ returning or branching. ### Chain termination -`_start` writes `0` to BSS `stack_bottom_fp` before its first `CALL`. -Any frame whose `saved_fp` is `0` is the bottom frame; the walker -stops there. +`_start` snapshots its own initial `sp` into `:stack_bottom_fp` +before its first `CALL`. The first prologued callee then stores +that same value as its `saved_fp`, so the walker compares each +`saved_fp` it pops to `:stack_bottom_fp`'s value and stops on +match. (The `_start` frame itself has no `PROLOGUE`, so there is +nothing to walk past it.) ### Per-arch implementation in `p1_gen.py` @@ -135,11 +157,21 @@ Three arch-specific emitters change: - **amd64**: `call` already pushes retaddr natively. Prologue does `pop rcx; mov rax, rsp; sub rsp, frame_size; mov [rsp], rcx; - mov [rsp+8], rax; mov qword [rsp+16], k`. Epilogue reverses. -- **aarch64**: `sub sp, sp, frame_size; str lr, [sp, 0]; mov x9, sp - (pre-sub sp + frame_size, computed); str x9, [sp, 8]; - mov x9, #k; str x9, [sp, 16]`. Epilogue reverses. + mov [rsp+8], rax; mov qword [rsp+16], k`, then zeros each slot + in turn. Epilogue reverses. +- **aarch64**: `sub sp, sp, frame_size; str lr, [sp, 0]; add x8, sp, + #frame_size; str x8, [sp, 8]; movz x8, #k; str x8, [sp, 16]`, + then `str xzr, [sp, 24+8*i]` per slot. Epilogue reverses. + *Important:* the k-load uses `MOVZ` rather than `add x8, xzr, #k` + because aarch64 ADD encodes register 31 as SP, not XZR — the + obvious `mov xN, #imm` synthesis silently computes `sp + imm`. + This was a latent bug pre-GC since nothing read `[sp+16]`. - **riscv64**: analogous to aarch64, using `ra` and a scratch `t0`. + Slot zeroing uses `sd zero, off(sp)`. + +Slot zeroing is the prologue-side half of the mostly-precise scheme +(see §"Why mostly-precise"); the bounds check in `mark_value_*` is +the walker-side half. The "caller's sp" is `sp + frame_size` on all three arches after the sub has run, since prologue ran immediately after call. @@ -168,17 +200,87 @@ Five sources, walked in this order on each GC invocation: 2. **`symbol_table[0..4095]`.** Intern table's backing array. Walk each slot; mark occupied entries as tagged heap pointers. 3. **`prim_argv[0..prim_argc)`.** Scratch region live during a - primitive call. The current `prim_argc` BSS slot bounds the walk. -4. **Reader scratch** (`saved_src_*`, ~5 slots). Live during a - `read` invocation. + primitive call. `marshal_argv` publishes the count into the + `:prim_argc` BSS slot, which the walker uses as the upper bound; + stale entries past `prim_argc` (from earlier primitive calls) + are skipped so they can't pin freed objects across GCs. +4. **Reader scratch** (`saved_src_*`). Holds the integer cursor / + line / col plus a raw byte pointer into `:src_buf`. None of these + are tagged Lisp values, so no marking is needed; listed here only + to make the omission explicit. 5. **Stack (fp chain).** Start at the caller of `gc()`, walk - `saved_fp` until `stack_bottom_fp` (value 0). At each frame, - read `k` from `[fp+16]` and pass every word in `[fp+24..+24+8*k)` - through `mark_object`. + `saved_fp` until the value snapshotted in `:stack_bottom_fp` at + `_start` (the kernel-provided initial sp). At each frame, read + `k` from `[fp+16]` and pass every word in `[fp+24..+24+8*k)` + through `mark_object`. Each word is bounds-checked against the + pair / obj arenas before dereference (see §"Why mostly-precise"). The `sym_*` special-form symbols are reachable via `symbol_table` so they need no separate root entry. +## Why mostly-precise + +A fully precise stack scan needs the walker to know, at every +GC-fire point, exactly which slots in each frame hold tagged Lisp +values. Two things break that invariant in this codebase: + +1. **Uninitialized slots.** A function reserves `k` slots in its + prologue, but its body may take an allocating call before having + written all of them. Slots that haven't been written yet contain + stale stack bytes from prior frames. +2. **Slots holding non-tagged values.** `obj_alloc` legitimately + spills a rounded byte size and a free-list head pointer into its + slots. `intern` / `make_string` / `make_symbol` spill raw byte + pointers (into `:src_buf` or `:str_*` literal regions) that may + not even be 8-aligned. These have arbitrary low 3 bits and don't + look like tagged Lisp values. + +The standard answers in the literature: + +- **Precise stack maps.** Compiler emits per-callsite metadata + (HotSpot, V8, Go, OCaml, SBCL). Best correctness, requires real + codegen infrastructure. +- **Strict slot discipline.** Every walked slot is always a tagged + value or sentinel; raw values live in registers or non-walked + storage. Used by classic SECD-style implementations. +- **Conservative scanning.** Treat every machine word as potentially + a pointer; check if it falls in heap range. Boehm–Demers–Weiser, + Mono, MicroPython, JavaScriptCore (for the C++ stack). +- **Mostly-precise / semi-conservative.** Precise for known-typed + roots, conservative for the C stack. CPython's cstack bridge, + SpiderMonkey for foreign frames. + +This implementation is mostly-precise (#4). The defenses live in +two places: + +- **`PROLOGUE_Nk` zeros every slot it reserves** (`p1_gen.py` + prologue emitters). An uninitialized slot deterministically reads + as `0`, which has no heap tag and is filtered by the bounds check + for free. +- **`mark_value_pair` / `mark_value_object` bounds-check the + candidate raw pointer** against `[heap_base, heap_next)` of its + arena before any dereference. A stale stack value whose low 3 bits + happen to match a heap tag is rejected unless its address actually + lies in the live arena prefix. + +The bounds check is the BDW idea in miniature, scoped to the stack +walk only. False positives (a random stack int that does fall in +the heap range) cause leaks, not crashes — they pin a dead object +until something perturbs the stack enough to overwrite the slot. + +### Upgrade path + +If/when leak measurement shows the conservative fallback pinning +real memory, the clean answer is **two slot counts per frame**: +`k_tagged` for GC-walked slots, `k_raw` for scratch the GC +ignores. `obj_alloc` declares 0+2, `cons` declares 2+0, +`intern` declares 0+3. The frame layout grows a second count +word, every `PROLOGUE_Nk` site is reclassified, frame-relative +offsets shift for the raw region, and the walker can drop both +the slot zeroing and the bounds check. That's the right shape +for the long-lived compiler; for the seed that bootstraps one +C compiler, the cost outweighs the benefit. + ## Mark phase ``` diff --git a/src/lisp.M1 b/src/lisp.M1 @@ -85,6 +85,32 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' mov_r0,sp st_r0,r1,0 + ## Align both heap bumps to their natural strides and snapshot + ## the aligned starts in :pair_heap_base / :obj_heap_base. The + ## :pair_heap_start label may land at any byte offset depending + ## on what the assembler emitted before it; if a pair sweep then + ## walked from the raw label with a fixed +16 stride it would + ## visit gap addresses, never the actual allocations. Pinning + ## the bump and the bitmap origin to the same aligned address + ## keeps mark and sweep on the same grid. + li_r1 &pair_heap_next + ld_r0,r1,0 + addi_r0,r0,15 + shri_r0,r0,4 + shli_r0,r0,4 + st_r0,r1,0 + li_r1 &pair_heap_base + st_r0,r1,0 + + li_r1 &obj_heap_next + ld_r0,r1,0 + addi_r0,r0,7 + shri_r0,r0,3 + shli_r0,r0,3 + st_r0,r1,0 + li_r1 &obj_heap_base + st_r0,r1,0 + ## openat(AT_FDCWD=-100, argv[1], O_RDONLY=0, mode=0) → r0 = fd. ## Use openat rather than open because open is amd64-only ## (removed from the asm-generic table used by aarch64/riscv64). @@ -945,9 +971,20 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' epilogue ret +## GC, then retry. Re-checks the free list first because the sweep +## populates it with reclaimed pairs; only falls through to the bump +## path if the list is still empty. A second overflow goes to OOM. :pair_alloc_need_gc - li_br &alloc_oom - b + li_br &gc + call + li_r3 &free_list_pair + ld_r0,r3,0 + li_br &pair_alloc_retry_bump + beqz_r0 + ld_r1,r0,0 + st_r1,r3,0 + epilogue + ret :pair_alloc_retry_bump li_r3 &pair_heap_next @@ -1005,9 +1042,24 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' epilogue_n2 ret +## GC, then retry. Slot 2 still holds the size-class freelist head +## (or 0 when no class fits this size). After the sweep, re-check +## that list before falling through to the bump path. Second overflow +## escalates to OOM. :obj_alloc_need_gc - li_br &alloc_oom - b + li_br &gc + call + ld_r3,sp,32 ## &head (or 0 = no size class) + li_br &obj_alloc_retry_bump + beqz_r3 + ld_r1,r3,0 + li_br &obj_alloc_retry_bump + beqz_r1 + ld_r2,r1,8 + st_r2,r3,0 + mov_r0,r1 + epilogue_n2 + ret :obj_alloc_retry_bump ld_r1,sp,24 @@ -1118,7 +1170,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' ## ---- pair mark-bitmap helpers --------------------------------------- :pair_mark_seen_or_set - li_r2 &pair_heap_start + li_r2 &pair_heap_base + ld_r2,r2,0 sub_r2,r1,r2 shri_r2,r2,4 ## slot index mov_r3,r2 @@ -1142,7 +1195,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' ret :pair_mark_test - li_r2 &pair_heap_start + li_r2 &pair_heap_base + ld_r2,r2,0 sub_r2,r1,r2 shri_r2,r2,4 mov_r3,r2 @@ -1162,7 +1216,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' ret :pair_mark_clear - li_r2 &pair_heap_start + li_r2 &pair_heap_base + ld_r2,r2,0 sub_r2,r1,r2 shri_r2,r2,4 mov_r3,r2 @@ -1218,8 +1273,23 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' epilogue_n4 ret +## Tagged-pair bounds check: a frame slot may hold a raw integer whose +## low 3 bits coincide with the pair tag (e.g. a small loop counter). +## Reject anything outside [pair_heap_start, pair_heap_next) so the +## bitmap index stays in range and dereferencing stays inside our arena. :mark_value_pair - addi_r1,r1,neg2 + addi_r1,r1,neg2 ## r1 = raw pair ptr + li_r0 &pair_heap_base + ld_r0,r0,0 + li_br &mark_value_done + blt_r1,r0 + li_r0 &pair_heap_next + ld_r0,r0,0 + li_br &mark_value_pair_inb + blt_r1,r0 + li_br &mark_value_done + b +:mark_value_pair_inb st_r1,sp,24 ## slot 1 = raw pair ptr li_br &pair_mark_seen_or_set call @@ -1238,9 +1308,23 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' li_br &mark_value_done b +## Same bounds-check rationale as mark_value_pair. Reject anything +## outside [obj_heap_start, obj_heap_next) before touching the header +## byte, so a stale raw pointer in a frame slot can't crash the marker. :mark_value_object andi_r0,r1,7 sub_r1,r1,r0 ## raw object ptr + li_r0 &obj_heap_base + ld_r0,r0,0 + li_br &mark_value_done + blt_r1,r0 + li_r0 &obj_heap_next + ld_r0,r0,0 + li_br &mark_value_object_inb + blt_r1,r0 + li_br &mark_value_done + b +:mark_value_object_inb st_r1,sp,24 ## slot 1 = raw object ptr li_br &obj_mark_seen_or_set call @@ -1359,11 +1443,16 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' ## ---- gc_mark_prim_argv() -------------------------------------------- +## Bound by :prim_argc, the slot count marshal_argv last published. +## Walking past it would re-mark stale tagged values left over from +## earlier primitive calls, falsely keeping the objects they pointed +## to alive across an arbitrary number of GCs. :gc_mark_prim_argv prologue_n2 li_r1 &prim_argv st_r1,sp,24 ## slot 1 = cursor - li_r1 %32 + li_r1 &prim_argc + ld_r1,r1,0 st_r1,sp,32 ## slot 2 = remaining slots :gc_mark_prim_argv_loop @@ -1487,7 +1576,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' ## ---- gc_sweep_pair() ------------------------------------------------- :gc_sweep_pair prologue - li_r1 &pair_heap_start + li_r1 &pair_heap_base + ld_r1,r1,0 st_r1,sp,24 :gc_sweep_pair_loop @@ -1530,7 +1620,8 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' ## ---- gc_sweep_obj() -------------------------------------------------- :gc_sweep_obj prologue_n4 - li_r1 &obj_heap_start + li_r1 &obj_heap_base + ld_r1,r1,0 st_r1,sp,24 ## slot 1 = current ptr :gc_sweep_obj_loop @@ -3728,6 +3819,11 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' b :marshal_argv_done + ## Publish the count so the GC root walker only marks the live + ## prefix of prim_argv. Stale slots beyond the current call would + ## otherwise look like roots and pin freed objects across GCs. + li_r0 &prim_argc + st_r3,r0,0 mov_r0,r3 ## r0 = count li_r2 &prim_argv ## r2 = buffer base epilogue @@ -6990,6 +7086,19 @@ DEFINE ZERO32 '0000000000000000000000000000000000000000000000000000000000000000' :stack_bottom_fp %0 %0 :gc_root_fp %0 %0 +## Aligned mirrors of :pair_heap_start / :obj_heap_start. _start fills +## these once, so every GC code path can derive the same canonical +## start address that the bump pointer actually used for its first +## allocation. See the alignment block in _start for the rationale. +:pair_heap_base %0 %0 +:obj_heap_base %0 %0 + +## Live prefix length of :prim_argv, set by marshal_argv on every +## primitive call. The GC root walker uses this as the upper bound +## of the prim_argv scan. Initialised to 0 so a GC that fires before +## the first primitive call walks zero slots. +:prim_argc %0 %0 + :free_list_pair %0 %0 :free_list_obj16 %0 %0 :free_list_obj24 %0 %0 diff --git a/src/p1_gen.py b/src/p1_gen.py @@ -240,6 +240,15 @@ def aa_ldst_unscaled(base, rT, rN, off): t, n = NAT_AA64[rT], NAT_AA64[rN] return le32(base | (imm9 << 12) | (n << 5) | t) +def aa_movz(rD, imm16): + """MOVZ rD, #imm16 — load a 16-bit immediate into the low half of + rD, zeroing the rest. Use this instead of `add rD, xzr, #imm`: + aarch64 ADD encodes register 31 as SP (not XZR), so the obvious + `mov xN, #imm` synthesis silently computes sp + imm.""" + d = NAT_AA64[rD] + assert 0 <= imm16 < (1 << 16) + return le32(0xD2800000 | (imm16 << 5) | d) + ## ---------- riscv64 primitive encoders ---------------------------------- @@ -461,9 +470,21 @@ class AA64(Encoder): str_lr = aa_ldst_uimm12(0xF9000000, 'lr', 'sp', 0, 3) save_fp = aa_add_imm('x8', 'sp', fb, sub=False) str_fp = aa_ldst_uimm12(0xF9000000, 'x8', 'sp', 8, 3) - mov_k = aa_add_imm('x8', 'xzr', k, sub=False) + # MUST use MOVZ here: ADD imm with Rn=31 reads SP, not XZR, + # so `add x8, xzr, #k` would store sp+k at [sp+16] instead + # of the slot count k. The pre-GC code never read [sp+16] + # so the bug stayed latent until the GC walker showed up. + mov_k = aa_movz('x8', k) str_k = aa_ldst_uimm12(0xF9000000, 'x8', 'sp', 16, 3) - return sub + str_lr + save_fp + str_fp + mov_k + str_k + # Zero each slot. The GC walker reads every slot through + # mark_value, which dereferences any value whose tag bits + # match a heap tag. Stale stack data left here from prior + # frames can otherwise look like a tagged pointer and chase + # the marker into garbage. See LISP-GC.md §"Roots". + zero_slots = '' + for i in range(k): + zero_slots += aa_ldst_uimm12(0xF9000000, 'xzr', 'sp', 24 + 8*i, 3) + return sub + str_lr + save_fp + str_fp + mov_k + str_k + zero_slots def epilogue(self, k): ldr_lr = aa_ldst_uimm12(0xF9400000, 'lr', 'sp', 0, 3) @@ -600,6 +621,9 @@ class AMD64(Encoder): # xor rax,rax ; add rax,k ; [rsp+16]=rax. # rcx carries the retaddr, rax carries saved_fp then k. Neither # is a live incoming P1 argument register at function entry. + # Each slot is then zeroed so a stale stack value can't masquerade + # as a tagged pointer when the GC walks this frame. See + # LISP-GC.md §"Roots". fb = prologue_frame_bytes(k) assert fb <= 127 seq = '59' # pop rcx @@ -610,6 +634,9 @@ class AMD64(Encoder): seq += amd_alu_rr('31', 'r0', 'r0') # xor rax, rax seq += amd_alu_ri8(0, 'r0', k) # add rax, k seq += amd_mem_rm('89', 'r0', 'sp', 16) # [rsp+16] = rax + seq += amd_alu_rr('31', 'r0', 'r0') # xor rax, rax + for i in range(k): + seq += amd_mem_rm('89', 'r0', 'sp', 24 + 8*i) return seq def epilogue(self, k): @@ -698,7 +725,11 @@ class RV64(Encoder): sd_fp = rv_s(0x00003023, 't0', 'sp', 8) k_imm = rv_i(0x00000013, 't0', 'zero', k) sd_k = rv_s(0x00003023, 't0', 'sp', 16) - return sub + sd + fp + sd_fp + k_imm + sd_k + # Zero each slot to keep GC walks safe; see LISP-GC.md §"Roots". + zero_slots = '' + for i in range(k): + zero_slots += rv_s(0x00003023, 'zero', 'sp', 24 + 8*i) + return sub + sd + fp + sd_fp + k_imm + sd_k + zero_slots def epilogue(self, k): ld = rv_i(0x00003003, 'ra', 'sp', 0) diff --git a/tests/lisp/17-gc-cons-churn.expected b/tests/lisp/17-gc-cons-churn.expected @@ -0,0 +1 @@ +42 diff --git a/tests/lisp/17-gc-cons-churn.scm b/tests/lisp/17-gc-cons-churn.scm @@ -0,0 +1,16 @@ +;; Cons churn stress: allocate many short-lived pairs, force multiple +;; GC cycles. lisp.M1's bring-up heap is 32 KB (~2048 pairs). Each +;; user-level cons drives several pair allocations once eval_args and +;; the marshal path are counted, so 5000 iterations exhaust the arena +;; multiple times over and exercise mark + sweep + freelist reuse. +;; +;; Uses the explicit `(define name (lambda ...))` form because the +;; bring-up evaluator does not expand `(define (name args) body)`. + +(define churn + (lambda (n) + (if (= n 0) + 42 + (begin (cons n n) (churn (- n 1)))))) + +(churn 5000) diff --git a/tests/lisp/18-gc-deep-list.expected b/tests/lisp/18-gc-deep-list.expected @@ -0,0 +1 @@ +45150 diff --git a/tests/lisp/18-gc-deep-list.scm b/tests/lisp/18-gc-deep-list.scm @@ -0,0 +1,23 @@ +;; Deep-list stress: build a list whose total allocations exceed the +;; arena, then sum it. The mark phase must preserve every live pair +;; on the partial list across the GC cycles that fire mid-build. +;; A sweep that drops a live pair would corrupt the chain and either +;; crash sum or yield a wrong total. +;; +;; build 300 → 300 user pairs alive at the end; ~8 pair allocs per +;; iter * 300 = ~2400 raw allocations vs the 32 KB / 16 = 2048-pair +;; arena, so GC must fire mid-build. +;; +;; Sum of 1..300 = 300 * 301 / 2 = 45150. + +(define build + (lambda (n acc) + (if (= n 0) acc + (build (- n 1) (cons n acc))))) + +(define sum + (lambda (lst acc) + (if (null? lst) acc + (sum (cdr lst) (+ acc (car lst)))))) + +(sum (build 300 (quote ())) 0) diff --git a/tests/lisp/19-gc-closure-churn.expected b/tests/lisp/19-gc-closure-churn.expected @@ -0,0 +1 @@ +42 diff --git a/tests/lisp/19-gc-closure-churn.scm b/tests/lisp/19-gc-closure-churn.scm @@ -0,0 +1,21 @@ +;; Closure churn: each call to make-counter evaluates a fresh +;; (lambda () ...) which allocates a 32-byte closure object on the +;; obj heap. The closure is consumed on the next line and becomes +;; unreachable. Iterating exhausts the obj heap and forces GC to +;; reclaim dead closures via the obj-arena sweep + free-list path. +;; +;; Each iteration drops 1 closure (32 bytes) plus a pile of +;; transient pairs (eval_args, env-extend). 800 iterations easily +;; cycle both arenas multiple times. + +(define make-counter + (lambda (start) + (lambda () start))) + +(define churn + (lambda (n) + (if (= n 0) + 42 + (begin ((make-counter n)) (churn (- n 1)))))) + +(churn 800)