P1: extract branch-scratch; 4:4 caller/callee split; stack-walking GC roots - boot2

commit 4f257aaba799025e6f84cccd9311991e67c65681
parent 78312f7592353f51c3b5957eca18ea098cbd4a66
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon, 20 Apr 2026 22:20:13 -0700

P1: extract branch-scratch; 4:4 caller/callee split; stack-walking GC roots

Branch target lives in a hidden native scratch reg (aarch64 x17, amd64
r11, riscv64 t5) instead of a P1 GPR. Source code uses LI_BR in place
of LI_R7, and branch/call ops drop the _R7 suffix — the scratch is no
longer a named P1 register, so it can't be accidentally referenced from
program code.

The freed slot realigns the P1 register file as 4:4 caller/callee-saved:
r0-r3 caller-saved (native argregs on each arch), r4-r7 callee-saved.
r4/r5 remap to native callee-saved regs (aarch64 x26/x27, amd64 r13/r14,
riscv64 s4/s5) so SYSCALL no longer has to preserve them across the
kernel trap. Function bodies get four walkable cross-call roots without
touching the frame.

amd64 PROLOGUE/EPILOGUE now uses rcx (not r11) as the retaddr-carry
scratch. TAIL = EPILOGUE + jmp r11, so the old sequence popped its own
branch target a microsecond before jumping through it; the bug surfaced
as demo exiting 13 on amd64 while aarch64/riscv64 passed. rcx was
already a non-P1 scratch used by shifts and DIV/REM, so no new
reservations.

lisp.M1 cons() switches its sp-temp from r4 to r3 (caller-saved, no
preservation burden) and alloc() now uses only r0-r3, shrinking its
callee-save pressure to zero.

LISP.md §Settled decisions #3 rewrites the GC root model from "32 fixed
BSS root_slots + live_mask, stack not scanned" to "per-frame root slots
walked on the P1 stack." This was the discipline N-slot PROLOGUE was
built for in 78312f7 — the doc just hadn't caught up. global_env and
symbol_table remain named BSS roots (long-lived singletons with no
stack frame to hang off of).

P1.md and comments across lisp.M1/demo.M1/p1_gen.py updated to match.
P1_TODO.md removed — all five issues are landed.

Diffstat:
M LISP.md  | 42 ++++++++++++++++++++++++------------------
M Makefile  | 3 +--
M P1.md  | 156 +++++++++++++++++++++++++++++++++++++++++++------------------------------------
D P1_TODO.md  | 251 -------------------------------------------------------------------------------
M demo.M1  | 56 ++++++++++++++++++++++++++++----------------------------
M lisp.M1  | 73 +++++++++++++++++++++++++++++++++++++++----------------------------------
M p1_aarch64.M1  | 99 ++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
M p1_amd64.M1  | 115 +++++++++++++++++++++++++++++++++++++++++++------------------------------------
M p1_gen.py  | 198 ++++++++++++++++++++++++++++++++++++++++++-------------------------------------
M p1_riscv64.M1  | 99 ++++++++++++++++++++++++++++++++++++++++++++-----------------------------------

10 files changed, 456 insertions(+), 636 deletions(-)
diff --git a/LISP.md b/LISP.md
@@ -31,9 +31,11 @@ Load-bearing; the rest of the document assumes them.
    §Value representation.
 2. **Mark-sweep GC over a fixed BSS arena.** 20 MB, linked in at build
    time, no `brk`/`mmap`.
-3. **Fixed root-spill slots.** 32 named cells in BSS; Scheme-facing
-   functions spill live values there before any allocating call. The
-   P1 stack is not scanned.
+3. **Per-frame root slots on the P1 stack.** Scheme-facing functions
+   declare `PROLOGUE_Nk` frames with one slot per live tagged value
+   they need across an allocating call; the GC mark phase walks those
+   frames. `global_env` and `symbol_table` remain named BSS roots
+   (long-lived singletons with no stack frame to hang off of).
 4. **Source file via `argv[1]`.** The interpreter opens the path given
    on the command line, reads it whole into a string, parses it,
    evaluates top-to-bottom. No linked-in blob.
@@ -144,14 +146,15 @@ Three sources:
    transitively.
 2. **Symbol table.** Intern table's backing array plus all interned
    symbols and their name strings. Fixed pointer (`symbol_table`).
-3. **Root-spill slots.** 32 tagged words at `root_slots[0..32]` in BSS.
-   A 32-bit `live_mask` word marks which slots currently hold a live
-   value. Scheme-facing functions set and clear bits in the mask
-   across their dynamic extent.
+3. **Per-frame stack slots.** The mark phase walks the P1 stack from
+   the current `sp` up to the saved frame-chain bottom. Each frame
+   declares its slot count via `PROLOGUE_Nk` (see P1.md §"Semantics"),
+   and the GC treats every slot word as a tagged-value root —
+   non-pointer slots self-filter because their tags don't match any
+   heap object.
 
-The P1 stack is **not scanned.** Stack frames hold P1 integers, return
-addresses, and scratch — not Scheme-value-shaped words unless those
-words are also rooted in a spill slot.
+Return addresses and raw P1 integers on the stack sit outside the
+PROLOGUE_Nk slot range, so the walker doesn't see them.
 
 ### Algorithm
 
@@ -196,11 +199,13 @@ uncommon in compiler workloads.
 ### Invariants
 
 - Between any two GC-observable points, every live Scheme value must be
-  reachable from `global_env`, `symbol_table`, or a live root slot.
-  "GC-observable" = any call to `cons`, `make-vector`, `make-string`,
-  `make-closure`, `intern`, or another Scheme function.
-- `live_mask` bits are owned by the function that set them; callers
-  don't set a callee's bits.
+  reachable from `global_env`, `symbol_table`, or a PROLOGUE_Nk slot
+  in an active frame. "GC-observable" = any call to `cons`,
+  `make-vector`, `make-string`, `make-closure`, `intern`, or another
+  Scheme function.
+- Scheme-facing functions spill each live tagged value into a frame
+  slot before any allocating call. A value kept only in a P1 register
+  will be lost if GC fires mid-call.
 
 Tradeoffs taken:
 
@@ -348,8 +353,8 @@ startup the interpreter populates `global_env` with one binding per
 primitive, each wrapped as a primitive-type heap object.
 
 Primitives allocate through the same `alloc`/GC path as user code and
-must honor the root-spill discipline when they make more than one
-allocation.
+must honor the per-frame-slot root discipline when they make more than
+one allocation.
 
 Target set: ~40 primitives per PLAN.md. Roughly:
 
@@ -464,7 +469,8 @@ coverage before the next begins.
    via syscalls, `error` path, BSS layout, bump allocator. Exits with
    the file's first byte as proof-of-life. ~200 P1 LOC.
 2. **Tagged values.** Fixnum/pair/singleton encoding; `cons`/`car`/`cdr`/
-   tag predicates. Hand-build a pair; print its car. ~300 LOC.
+   tag predicates. Hand-build `cons(42, nil)` and exit with the decoded
+   car as status. ~300 LOC.
 3. **Strings + symbol interning.** Heap strings, 4096-slot intern table,
    symbol allocator. Two `(intern "foo")` calls return `eq?` pointers.
    ~400 LOC.
diff --git a/Makefile b/Makefile
@@ -84,8 +84,7 @@ $(TOOLS_DIR)/M0 $(TOOLS_DIR)/hex2-0 $(TOOLS_DIR)/catm $(TOOLS_DIR)/hex0 $(TOOLS_
 #
 # Lint catches P1_*/SYS_* tokens with no matching DEFINE — M1 otherwise
 # silently emits the literal token text and produces a SIGILL-on-run
-# binary (see P1_TODO.md issue 1). Runs on the host (plain POSIX sh);
-# no podman dependency.
+# binary. Runs on the host (plain POSIX sh); no podman dependency.
 #
 # M0 takes a single positional input (no -f flag), so we catm the two
 # sources together first. The intermediate .combined.M1 is kept in OUT_DIR
diff --git a/P1.md b/P1.md
@@ -44,9 +44,10 @@ All runs on stock stage0 `M0` + `hex2-0`, bootstrapped per-arch from
 
 The DEFINE table is generator-driven (`p1_gen.py`); tranches 1–8 are
 enumerated there, plus the full PROLOGUE_Nk family (k=1..4). Branch
-offsets are realized by the r7-indirect pattern
-(`LI_R7 &target ; BXX_rA_rB_R7`), sidestepping the missing
-branch-offset support in hex2.
+offsets are realized by the LI_BR-indirect pattern
+(`LI_BR &target ; BXX_rA_rB`), sidestepping the missing
+branch-offset support in hex2. The branch-target scratch is a
+reserved native reg (`x17`/`r11`/`t5`), not a P1 GPR.
 
 ### Spike deviations from the design
 
@@ -72,15 +73,16 @@ branch-offset support in hex2.
 | Registers      | 8 GPRs (`r0`–`r7`) + `sp`, `lr`-on-stack      | Fits x86-64's usable register budget       |
 | Narrow imm     | Signed 12-bit                                 | riscv I-type width; aarch64 ≤12 also OK    |
 | Wide imm       | Pool-loaded via PC-relative `LI`              | Avoids arch-specific immediate synthesis   |
-| Calling conv   | r0 = return, r1–r6 = args, r6 callee-saved, r7 per-op scratch | P1-defined; not platform ABI               |
+| Calling conv   | r0 = return, r1–r3 = args (caller-saved), r4–r7 callee-saved | P1-defined; not platform ABI               |
 | Return address | Always spilled to stack on entry              | Hides x86's missing `lr` uniformly         |
 | Syscall        | `SYSCALL` with num in r0, args r1–r6; clobbers r0 only | Per-arch wrapper emits native sequence     |
 | Spill slot     | `[sp + 8]` is callee-private scratch after `PROLOGUE` | Frame already 16 B for alignment; second cell was otherwise unused |
 
 ## Register mapping
 
-`r0`–`r5` are caller-saved. `r6` is callee-saved. `r7` is per-op scratch
-(see below). `sp` is special-purpose — see `PROLOGUE` semantics.
+`r0`–`r3` are caller-saved. `r4`–`r7` are callee-saved, general-purpose,
+and preserved across `CALL`/`SYSCALL`. `sp` is special-purpose — see
+`PROLOGUE` semantics.
 
 | P1   | amd64 | aarch64 | riscv64 |
 |------|-------|---------|---------|
@@ -88,43 +90,51 @@ branch-offset support in hex2.
 | `r1` | `rdi` | `x1`    | `a1`    |
 | `r2` | `rsi` | `x2`    | `a2`    |
 | `r3` | `rdx` | `x3`    | `a3`    |
-| `r4` | `r10` | `x4`    | `a4`    |
-| `r5` | `r8`  | `x5`    | `a5`    |
+| `r4` | `r13` | `x26`   | `s4`    |
+| `r5` | `r14` | `x27`   | `s5`    |
 | `r6` | `rbx` | `x19`   | `s1`    |
 | `r7` | `r12` | `x20`   | `s2`    |
 | `sp` | `rsp` | `sp`    | `sp`    |
 | `lr` | (mem) | `x30`   | `ra`    |
 
+`r4`–`r7` all map to native callee-saved regs on each arch, so the SysV
+kernel+libc "callee preserves these" rule does the work for us across
+syscalls without explicit save/restore in the `SYSCALL` expansion.
+
 x86-64 has no link register; `CALL`/`RET` macros push/pop the return address
 on the stack. On aarch64/riscv64, the prologue spills `lr` (`x30`/`ra`) to
 the stack too, so all three converge on "return address lives in
 `[sp + 0]` after prologue." This uniformity is worth the extra store on the
 register-rich arches.
 
-**`r7` is not a general-purpose callee-saved register.** Every conditional
-and unconditional branch in P1 compiles through the r7-indirect pattern
-(`LI_R7 &target ; BLT/B`), which overwrites `r7` at every branch site.
-Treat `r7` as a per-instruction scratch owned by the branch / call
-machinery: its value is meaningful only between the `LI_R7` that loads a
-target and the branch that consumes it. Never carry a live value through a
-branch or call in `r7`.
-
 **Reserved scratch registers (not available to P1):** certain native
 regs are used internally by op expansions and are never exposed as P1
 registers. Every P1 op writes only what its name says it writes —
 reserved scratch is save/restored within the expansion so no hidden
 clobbers leak across op boundaries.
 
-- **aarch64** — `x21`–`x25` hold `r1`–`r5` across the `SYSCALL` arg
-  shuffle. `x16` (ARM IP0) is scratch for `REM` (carries the `SDIV`
-  quotient into the following `MSUB`). `x8` holds the syscall number.
-- **amd64** — `rcx` and `r11` are kernel-clobbered by the `syscall`
-  instruction itself; `PROLOGUE` additionally uses `r11` to temporarily
-  hold the return address across the `sub rsp, N`. `DIV`/`REM` use both
-  `rcx` (to save `rdx` = P1 `r3`) and `r11` (to save `rax` = P1 `r0`)
-  so that idiv's implicit writes to rax/rdx stay invisible.
-- **riscv64** — `s3`–`s7` hold `r1`–`r5` across the `SYSCALL` arg
-  shuffle; `a7` holds the syscall number.
+- **Branch-target scratch (all arches).** `B`/`BEQ`/`BNE`/`BLT`/`CALL`/
+  `TAIL` jump through a dedicated native reg pre-loaded via `LI_BR`:
+  `x17` (ARM IP1) on aarch64, `r11` on amd64, `t5` on riscv64. The reg
+  is caller-saved natively and never carries a live P1 value past the
+  following branch. Treat it as existing only between the `LI_BR` that
+  loads a target and the branch that consumes it.
+- **aarch64** — `x21`–`x23` hold `r1`–`r3` across the `SYSCALL` arg
+  shuffle (`r4`/`r5` live in callee-saved `x26`/`x27` so the kernel
+  preserves them for us). `x16` (ARM IP0) is scratch for `REM`
+  (carries the `SDIV` quotient into the following `MSUB`). `x8` holds
+  the syscall number.
+- **amd64** — `rcx` and the branch-target `r11` are kernel-clobbered by
+  the `syscall` instruction itself. `PROLOGUE`/`EPILOGUE` use `rcx` to
+  carry the retaddr across the `sub rsp, N` (can't use `r11` here — it
+  is the branch-target reg, and `TAIL` = `EPILOGUE` + `jmp r11`).
+  `DIV`/`REM` use `rcx` (to save `rdx` = P1 `r3`) and `r11` (to save
+  `rax` = P1 `r0`) so that `idiv`'s implicit writes to rax/rdx stay
+  invisible; the `r11` save is fine because no branch op can interrupt
+  the DIV/REM expansion.
+- **riscv64** — `s3`,`s6`,`s7` hold `r1`–`r3` across the `SYSCALL` arg
+  shuffle (`r4`/`r5` live in callee-saved `s4`/`s5`, same trick as
+  aarch64). `a7` holds the syscall number.
 
 All of these are off-limits to hand-written P1 programs and are never
 mentioned in P1 source. If you see a register name not in the r0–r7 /
@@ -134,28 +144,26 @@ sp / lr set, it belongs to an op's internal expansion.
 
 P1 has no PC-relative branch immediates (hex2 offers no label-arithmetic
 sigil — branch ranges can't be expressed in hex2 source). Every branch,
-conditional or not, compiles through the **r7-indirect** pattern: the
-caller loads the target into `r7` with a wide `LI`, then the branch op
-jumps through `r7`. A conditional like "jump to `fail` if `r1 != r2`" is
-three source lines:
+conditional or not, compiles through the **LI_BR-indirect** pattern: the
+caller loads the target into the dedicated branch-target scratch reg
+with `LI_BR`, then the branch op jumps through it. A conditional like
+"jump to `fail` if `r1 != r2`" is three source lines:
 
 ```
-P1_LI_R7
+P1_LI_BR
 &fail
-P1_BNE_R1_R2_R7
+P1_BNE_R1_R2
 ```
 
-The `_R7` suffix on every branch op is a reminder that `r7` is the
-branch-target register: the value loaded by the preceding `LI_R7` is
-consumed by the branch, and both lines always appear together. `CALL`
-follows the same shape (`LI_R7 &callee ; P1_CALL`).
+`LI_BR` writes a reserved native reg (`x17`/`r11`/`t5` — see Register
+mapping), not a P1 GPR. The branch op that follows consumes it and
+jumps. `CALL` and `TAIL` follow the same shape
+(`LI_BR &callee ; P1_CALL`).
 
-Because every branch site overwrites `r7`, **`r7` is branch-scratch, not
-a general-purpose callee-saved register** — see the "`r7` is not a
-general-purpose callee-saved register" paragraph above. If you need to
-keep a value live across a branch or call, use `r0`–`r5` (caller-saved,
-lives across un-branched straight-line code) or `r6` (callee-saved,
-lives across calls).
+The branch-target reg is owned by the branch machinery: never carry a
+live value across a branch in it. Since it isn't a P1 reg, this is
+automatic — there's no P1-level way to read or write it outside
+`LI_BR`.
 
 ## Instruction set (~30 ops)
 
@@ -204,13 +212,14 @@ SYSCALL                               # num in r0, args r1-r6, ret in r0
   with tagged-cell pointers only need signed comparisons. Synthesize from
   `BLT` via operand-bias if unsigned compare is ever required.
 - `BGE rA, rB, %L` is not in the ISA: synthesize as
-  `BLT rA, rB, %skip; B %L; :skip` (the r7-indirect branch pattern makes
-  the skip cheap). `BLT rB, rA, %L` handles the strict-greater case.
+  `BLT rA, rB, %skip; B %L; :skip` (the LI_BR-indirect branch pattern
+  makes the skip cheap). `BLT rB, rA, %L` handles the strict-greater
+  case.
 - Branch offsets are PC-relative. In the v0.1 spike they are realized by
-  loading the target address via `LI` into `r7` and jumping through the
-  register; range is therefore unbounded within the 4 GiB address space.
-  Native-encoded branches (with tighter range limits) are an optional
-  future optimization.
+  loading the target address via `LI_BR` into the reserved branch-target
+  reg and jumping through it; range is therefore unbounded within the
+  4 GiB address space. Native-encoded branches (with tighter range
+  limits) are an optional future optimization.
 - `MOV rD, rA` copies `rA` into `rD`. The source may be `sp` (read the
   current stack pointer into a GPR — used e.g. for stack-balance assertions
   around a call tree). The reverse (`MOV sp, rA`) is not provided; `sp`
@@ -246,8 +255,10 @@ SYSCALL                               # num in r0, args r1-r6, ret in r0
   Per-arch mechanics differ — aarch64/riscv64 `PROLOGUE` subtracts the
   frame size from `sp` and stores `lr`/`ra` at `[sp + 0]`; amd64 pops
   the retaddr native `call` already pushed into a non-P1 scratch
-  (`r11`), subtracts the frame size, then re-pushes it so the final
-  layout matches. Access slots via `MOV rX, sp` followed by
+  (`rcx`), subtracts the frame size, then re-pushes it so the final
+  layout matches. (`rcx` rather than `r11`, because `r11` is the
+  branch-target reg and `TAIL` would otherwise clobber its own
+  destination mid-epilogue.) Access slots via `MOV rX, sp` followed by
   `LD rY, rX, <off>` / `ST rY, rX, <off>`; `sp` itself isn't a valid
   base for `LD`/`ST`.
 - `TAIL %label` is a tail call — it performs the current function's
@@ -312,19 +323,21 @@ values through syscalls without per-arch save/restore dances.
 
 The per-arch expansions:
 
-- **amd64** — P1 args already occupy the native arg regs except for arg6
-  (`r9` vs. P1 `r6`/`rbx`). Expansion is `mov r9, rbx ; syscall`. The
-  kernel preserves everything except `rax`, `rcx`, `r11`; none of `rcx`
-  or `r11` are P1 registers, so the only visible clobber is P1 `r0`.
+- **amd64** — P1 args already occupy the native arg regs except for args
+  4/5/6. Three shuffle moves cover those: `mov r10, r13` (arg4 = P1 `r4`),
+  `mov r8, r14` (arg5 = P1 `r5`), `mov r9, rbx` (arg6 = P1 `r6`); then
+  `syscall`. The kernel preserves everything except `rax`, `rcx`, `r11`,
+  and `rax` = P1 `r0` is the only visible clobber.
 - **aarch64** — native arg regs are `x0`–`x5` but P1 puts args in
-  `x1`–`x5`,`x19` (one register higher). The expansion saves P1 `r1`–
-  `r5` into `x21`–`x25` (reserved for SYSCALL, see Register mapping),
-  shuffles the saved values into `x0`–`x4` plus `x19` into `x5`, moves
-  the number into `x8`, `svc #0`s, then restores `r1`–`r5` from the
-  `x21`–`x25` saves. Net cost: 12 additional moves vs. a bare
-  single-instruction syscall.
-- **riscv64** — same shape as aarch64, with `s3`–`s7` as the save slots
-  and `a7` as the number register.
+  `x1`–`x3`,`x26`,`x27`,`x19` (the three caller-saved arg regs one slot
+  higher, plus three callee-saved for `r4`–`r6`). The expansion saves
+  P1 `r1`–`r3` into `x21`–`x23`, shuffles them and `r4`/`r5`/`r6` down
+  into `x0`–`x5`, moves the number into `x8`, `svc #0`s, then restores
+  `r1`–`r3` from `x21`–`x23`. No save/restore of `r4`/`r5` is needed
+  because they live in callee-saved natives that the kernel preserves.
+- **riscv64** — same shape as aarch64, with `s3`/`s6`/`s7` as the `r1`–
+  `r3` save slots, `s4`/`s5` already holding `r4`/`r5`, and `a7` as the
+  number register.
 
 The extra moves on aarch64/riscv64 are a few nanoseconds per syscall.
 Trading them for uniform "clobbers `r0` only" semantics is worth it:
@@ -429,12 +442,11 @@ caller.
    the inline-data `LI` trick sidesteps them. Order was reversed from
    the original plan: aarch64 first (where the trick was designed),
    then amd64 and riscv64.
-2. **Broaden the demonstrated op set.** Next. Control flow (`B`, `BEQ`,
-   `CALL`, `RET`, `TAIL`), loads/stores (`LD`/`ST`/`LW`/`LB`), and the
-   rest of arithmetic/logical/shift/mul-div. All reachable with stock
-   hex2; no extensions required. A loop-and-branch demo (e.g. print
-   digits 0–9, or sum 1..N) is the natural next program — it forces
-   conditional branching through the indirect-via-r7 pattern.
+2. **Broaden the demonstrated op set.** *Done.* `demo.M1` exercises
+   control flow (`B`, `BEQ`, `BNE`, `BLT`, `CALL`, `RET`, `TAIL`),
+   loads/stores (`LD`/`ST`/`LB`/`SB`), and the full
+   arithmetic/logical/shift/mul-div set across tranches 1–5. All
+   reachable with stock hex2; no extensions required.
 3. **Generator for the ~30-op × register matrix.** *Done.*
    `p1_gen.py` is the single source of truth for all three
    `p1_<arch>.M1` defs files. Each row is an `(op, reg-tuple, imm)`
@@ -444,9 +456,11 @@ caller.
 4. **Cross-arch differential harness.** Assemble each P1 source three
    ways and diff runtime behavior. Currently eyeballed via
    `make run-all`.
-5. **Write something real.** Seed Lisp interpreter (cons, car, cdr, eq,
-   atom, cond, lambda, quote) in ~500 lines of P1, running identically
-   on all three targets.
+5. **Write something real.** *In progress.* `lisp.M1` is the seed Lisp
+   interpreter target (cons, car, cdr, eq, atom, cond, lambda, quote)
+   running identically on all three arches. Step 2 (cons/car/cdr +
+   tagged values) landed; the remaining staged steps live in
+   `LISP.md`.
 
 ## Open questions
 
diff --git a/P1_TODO.md b/P1_TODO.md
@@ -1,251 +0,0 @@
-# P1 — follow-up work from `lisp.M1` step 2
-
-Issues surfaced while bringing up step 2 (tagged values: cons/car/cdr),
-with the decision recorded for each. This is a work list, not a design
-doc; `P1.md` is the source of truth for the ISA itself.
-
-Two of these (issue 1, issue 2) were caught the hard way during step 2:
-the first build passed on aarch64 and failed on amd64/riscv64 because
-`heap_start` happened to land at different offsets from the ELF base on
-each arch, and the second build SIGILL'd because a `P1_LD_R0_R4_0` macro
-was missing from the defs files but M1 silently passed the undefined
-token through as literal text. Both are preventable upstream.
-
-## 1. Undefined `P1_*` tokens pass silently through M1
-
-**Symptom.** In step 2, `lisp.M1` used `P1_LD_R0_R4_0` (offset 0) but the
-defs files only had `P1_LD_R0_R4_8` (offset 8). M1 emitted the literal
-string `P1_LD_R0_R4_0` into the output stream rather than expanding it.
-The resulting binary SIGILL'd on aarch64 and segfaulted on amd64. No
-diagnostic from M1, hex2, or the loader — it took `xxd` + grep to find.
-
-This will keep biting us as we add more `(op, register-tuple)` entries
-to the defs: the combinatorial surface (~1200 DEFINEs per arch) is
-exactly where a missing tuple is easy to miss and hard to notice.
-
-**Options considered.**
-
-1. **Makefile lint.** Before invoking M1, grep the combined source for
-   `P1_[A-Z0-9_]*` tokens and assert every one has a matching `DEFINE`
-   in the per-arch defs file. Fails the build with the offending token
-   name. ~10 lines of shell, runs in milliseconds.
-2. **Patch M1** to error on undefined tokens. Correct fix, but M1 is
-   upstream stage0 — we don't want to fork it.
-3. **Exhaustive DEFINE table.** Generate every `(op, rD, rA, rB)` tuple
-   so no lookup can miss. Waits on issue 3 (generator).
-
-**Decision — option 1: Makefile lint.** Add a `lint` target to
-`lispcc/Makefile` that runs before each arch's assemble step. Keeps the
-toolchain unchanged; catches the bug at the exact layer where it's
-introduced (M1 input); works today with no generator. Option 3 also
-helps but doesn't cover typos (`P1_LD_R0_R4_0` where the author meant
-`P1_LD_R0_R4_8` would still silently miss if both existed and the wrong
-one was spelled).
-
-## 2. Data alignment is implicit and arch-dependent
-
-**Symptom.** `heap_start` is a label into a zero-padded byte blob.
-Nothing in hex2 or M1 forces the label's file offset (and therefore its
-runtime address under the fixed ELF base) to be 8-byte aligned. Because
-the code size between the ELF header and `heap_start` differs per arch,
-the three targets landed on different alignments:
-
-- aarch64: `0x00600560` → low 3 bits = `000` (aligned)
-- amd64: `0x006002CB` → low 3 bits = `011` (misaligned)
-- riscv64: `0x00600604` → low 3 bits = `100` (misaligned)
-
-Pair pointers carry a 3-bit tag in the low bits, so the allocator MUST
-return pointers with `ptr & 7 == 0`. If `heap_start` itself is
-misaligned, every tagged pair is corrupt.
-
-**Options considered.**
-
-1. **Application-level: align at boot + make alignment an allocator
-   contract.** `_start` rounds `heap_next` up with `(x | 7) + 1` before
-   any allocation. `alloc()` is documented to require 8-aligned sizes
-   and to return 8-aligned pointers. Document both in `P1.md`. This is
-   what step 2 already does.
-2. **hex2-level alignment directive.** Add something like `.align 8` to
-   hex2 so labels can be padded at assembly time.
-
-Investigated hex2 capabilities (both the High Level Prototype and the
-actual bootstrap AMD64 `hex2_AMD64.M1` in stage0-posix). hex2 supports
-only the pointer sigils `: ! @ $ ~ % &` and the `>` base override — no
-alignment directive, no label arithmetic. Option 2 would require forking
-hex2, which is off the table for the same reason we don't fork M1.
-
-**Decision — option 1: doc + allocator contract.** Add a §"Data
-alignment" subsection to `P1.md`:
-
-- Labels have no inherent alignment. The offset of any label from the
-  ELF base is the cumulative code/data size before it, which varies per
-  arch.
-- Programs that care about alignment (tagged pointers, cache lines,
-  atomic fields — the first being immediate for us) must align at boot
-  with `(x | mask) + 1`.
-- Any allocator that returns tagged-pointer-eligible cells MUST return
-  pointers aligned to at least the tag width (8 bytes, low-3-bit tag).
-- Caller must pass tag-width-aligned sizes to bump allocators for now.
-  When the real allocator lands in step 9, it rounds internally.
-
-## 3. The ~1200-entry DEFINE generator is now a blocker
-
-**Symptom.** Step 2 added ~13 new `(op, register-tuple)` entries per
-arch by hand. Step 3 (strings + symbol interning) and step 4 (reader)
-will need many more — load/store variants across r0–r7 × base-register
-× common offsets, more arithmetic tuples, more branches. Hand-writing
-each tuple × 3 arches is already where most of step 2's wall-clock
-went.
-
-**Options considered.**
-
-1. **Build generator before step 3.** Design a shared op table
-   (`op, rD, rA, rB` plus a template for each arch's native encoding),
-   emit `p1_<arch>.M1` from it. Pause Lisp work until done.
-2. **Defer until step 4 or 5.** Hand-add entries as each step needs
-   them; accept the per-step friction.
-3. **Lazy defs.** Only enumerate tuples actually referenced by the
-   program, via a source-scan + per-arch hand-encoder. More complex
-   than full generation.
-
-**Decision — option 1: build before step 3.** Deferring pushes the
-problem into a worse spot (each new step adds friction, and the defs
-files diverge across arches in hard-to-review ways). Design the op
-table to be appendable — tranches 1–8 already map cleanly onto rows of
-`(op, rD, rA, rB)` plus per-arch encoding templates. The generator
-unblocks step 3 and is a prerequisite for the ~1200-entry target in
-`P1.md`.
-
-This adds one item to the staged plan between steps 2 and 3.
-
-## 4. Root spill across `CALL` needs more than one scratch slot
-
-**Symptom.** `cons(car, cdr)` must keep both args live across the
-`alloc(16)` call. The P1 ABI gives each frame exactly one callee-private
-scratch cell at `[sp+8]`, which fits one value. Step 2 worked around
-this by parking `cdr` in a BSS cell (`:cons_save_cdr`), a single-threaded
-hack that breaks under the step-9 GC (which needs every live tagged
-value to be on the stack so the mark phase can find it).
-
-The same pattern will recur everywhere: anything multi-arg that calls
-into the allocator (pair constructors, list builders, environment frame
-builders, closure constructors) needs per-function root-spill slots
-proportional to its live set, not a fixed one.
-
-**Options considered.**
-
-1. **Expand `PROLOGUE` frame + per-function spill-slot count.** Bump
-   standard frame from 16 B to N B, where each function declares how
-   many callee-private cells it needs at `[sp+8]`, `[sp+16]`,
-   `[sp+24]`, etc. `PROLOGUE`/`EPILOGUE` take the slot count as an
-   immediate. Update all three arch defs files.
-2. **Keep one scratch slot, use BSS for overflow.** What step 2 does.
-   Breaks under GC, doesn't scale past single-threaded.
-3. **Caller-managed save area.** Caller subtracts from `sp` itself,
-   spills, calls, restores. Works but breaks the "`sp` only mutated by
-   `PROLOGUE`/`EPILOGUE`" invariant from `P1.md`.
-
-**Decision — option 1: expand PROLOGUE frame + spill-slot count.**
-Aligns with GC discipline (all roots on stack, walkable), keeps `sp`
-mutation localized to PROLOGUE/EPILOGUE, and scales with live-set size.
-Specifics to pin down when implementing:
-
-- Encoding of the slot count. Probably a suffix on the op name
-  (`PROLOGUE_N2`, `PROLOGUE_N3`, …) so each variant is a distinct
-  `DEFINE` and the generator can enumerate them the same way it
-  enumerates register tuples. Cap at some N (4? 8?) and grow if a
-  function ever needs more.
-- Frame size must stay 16-aligned on aarch64 — round up.
-- The existing single-slot `PROLOGUE`/`EPILOGUE` stay as aliases for
-  the N=1 case.
-
-Replace `cons_save_cdr` in `lisp.M1` with a second frame slot once the
-N-slot prologue lands.
-
-## 5. r7-indirect branch pattern is verbose but fine
-
-**Symptom.** Every branch or call compiles through three source lines:
-
-```
-P1_LI_R7
-&target
-P1_BNE_R1_R2_R7
-```
-
-`lisp.M1` has 8 such sites and will grow roughly linearly with the
-interpreter's control flow. Readers unfamiliar with P1 don't immediately
-see this is "branch if r1 != r2 to `target`."
-
-A conditional in a high-level language:
-
-```
-if x == 0 goto fail;
-```
-
-lands in P1 as:
-
-```
-P1_LI_R7
-&fail
-P1_BEQ_R1_R0_R7     ## (after loading 0 into r0)
-```
-
-**Options considered.**
-
-1. **Document the pattern only.** Add a §"Reading P1 source" subsection
-   to `P1.md` that explains: every branch is
-   `LI_R7 &target ; BXX_rA_rB_R7`; r7 is branch-scratch owned by this
-   pattern; target load + branch are always together. Once seen, it
-   reads fluently.
-2. **Introduce `*_TO` pseudos.** E.g. `P1_BNE_R1_R2_TO &target` expands
-   to the 3-op sequence. Cosmetic — saves one source line per site —
-   but it ripples through the DEFINE table (each pseudo needs per-arch
-   encoding that emits a forward-jump-around-a-literal-pool or similar)
-   and hides the r7 clobber, which is load-bearing.
-3. **Change the ISA to have a 32-bit branch-target field.** A real
-   native-encoded conditional branch would save the literal and the
-   `LI_R7`. Breaks hex2's "no label arithmetic" contract (branch
-   offsets are label_A - label_B), and hex2 is not forkable.
-
-**Decision — option 1: document the pattern only.** `*_TO` pseudos
-would require the generator to emit paired (branch, literal-pool-entry)
-per call site, and hiding the r7 clobber invites bugs when someone
-assumes r7 survives a branch. The pattern is learnable; the workaround
-for hex2's missing branch-offset support is exactly what `P1.md` calls
-out ("unbounded range within 4 GiB at the cost of one r7 clobber per
-branch"). Documenting is sufficient.
-
-Add §"Reading P1 source" under the existing §"Register mapping" in
-`P1.md`: one paragraph on the pattern, one on the r7 clobber rule (some
-of this already lives in the "r7 is not a general-purpose callee-saved
-register" paragraph, which should be cross-referenced).
-
-## Ordering
-
-All five land before step 3. Reasons:
-
-- **Issue 4 blocks step 3 directly.** Strings and symbol interning are
-  allocator-calling constructors — same shape as `cons`, more live
-  values per frame. Deferring issue 4 means reproducing the
-  `cons_save_cdr` BSS hack across every step-3 constructor, then
-  ripping them all out for step 9 (GC). Cheaper to land N-slot
-  PROLOGUE once.
-- **Issue 4 must land with issue 3, not after.** If the generator
-  emits the defs table first and N-slot PROLOGUE lands later, the
-  generator runs twice and the defs files churn. Do them together.
-- **Issues 1, 2, 5 are cheap and compounding.** Makefile lint (1)
-  prevents the exact class of bug that ate an hour in step 2 — every
-  step 3 tuple we add is another chance to mistype. Alignment doc (2)
-  and r7-pattern doc (5) are one sitting each and make step 3 review
-  easier for anyone but the author.
-
-Suggested order within the pre-step-3 block:
-
-1. Issue 1 (Makefile lint) — ~10 lines of shell, unblocks confident
-   iteration on the rest.
-2. Issue 5 (`P1.md` §"Reading P1 source") and issue 2
-   (`P1.md` §"Data alignment") — doc-only, batch into one commit.
-3. Issue 3 (generator) + issue 4 (N-slot PROLOGUE) — together.
-   Design the op table so PROLOGUE/EPILOGUE variants are rows like any
-   other `(op, reg-tuple)`. Rip out `cons_save_cdr` from `lisp.M1`
-   as the validation case.
diff --git a/demo.M1 b/demo.M1
@@ -22,7 +22,7 @@
 ##   synthesize from LD + mask/shift if needed. Signed imm12 on
 ##   LD/ST is exercised via a NEG8-offset round-trip through
 ##   scratch_mid, mirroring the ADDI NEG1/NEG3 signed-imm test.
-## Tranche 4: r7-indirect branches. B, BEQ, BNE, BLT each tested
+## Tranche 4: LI_BR-indirect branches. B, BEQ, BNE, BLT each tested
 ##   in BOTH taken and fall-through directions so an inverted
 ##   condition encoding doesn't pass. BLT additionally tested with
 ##   a negative operand (-1 < 0 signed) to prove signed semantics.
@@ -31,10 +31,10 @@
 ##   stresses PROLOGUE lr-save; TAIL unwinds the frame. Stack
 ##   balance is additionally verified by snapshotting sp via
 ##   MOV_R6_SP before tranche 5 and comparing after — a TAIL that
-##   omits its epilogue leaks fn_parent_tail's 16-byte frame and
-##   the delta folds into the accumulator. On amd64 PROLOGUE/
-##   EPILOGUE are NOPs so the delta is always 0; the check bites
-##   on aarch64/riscv64 where prologue does real sp work.
+##   omits its epilogue (or whose epilogue clobbers the branch
+##   target) leaks fn_parent_tail's 16-byte frame and the delta
+##   folds into the accumulator. All three arches now run real
+##   sp-moving PROLOGUE/EPILOGUE, so the check bites everywhere.
 ##
 ## Run-and-verify:
 ##   make PROG=demo ARCH=<arch> run && echo "exit=$?"
@@ -160,7 +160,7 @@
     '01000000'                  # r3 = 1
 
     ## B — unconditional. Correct: jump to b4_1_ok, skipping clobber.
-    P1_LI_R7
+    P1_LI_BR
     &b4_1_ok
     P1_B
     P1_LI_R1
@@ -170,9 +170,9 @@
     ## BEQ taken: r3=0 so r2==r3.
     P1_LI_R3
     '00000000'
-    P1_LI_R7
+    P1_LI_BR
     &b4_2_ok
-    P1_BEQ_R2_R3_R7             # 0 == 0, taken
+    P1_BEQ_R2_R3             # 0 == 0, taken
     P1_LI_R1
     '00000000'
 :b4_2_ok
@@ -181,10 +181,10 @@
 
     ## BEQ fall-through: 0 != 1, branch must NOT fire. If it
     ## (incorrectly) fires we jump to b4_3_bad and clobber r1.
-    P1_LI_R7
+    P1_LI_BR
     &b4_3_bad
-    P1_BEQ_R2_R3_R7             # 0 == 1? no, fall through
-    P1_LI_R7
+    P1_BEQ_R2_R3             # 0 == 1? no, fall through
+    P1_LI_BR
     &b4_3_ok
     P1_B
 :b4_3_bad
@@ -193,9 +193,9 @@
 :b4_3_ok
 
     ## BNE taken: 0 != 1.
-    P1_LI_R7
+    P1_LI_BR
     &b4_4_ok
-    P1_BNE_R2_R3_R7             # 0 != 1, taken
+    P1_BNE_R2_R3             # 0 != 1, taken
     P1_LI_R1
     '00000000'
 :b4_4_ok
@@ -203,10 +203,10 @@
     ## BNE fall-through: r3=0 so r2==r3; branch must NOT fire.
     P1_LI_R3
     '00000000'
-    P1_LI_R7
+    P1_LI_BR
     &b4_5_bad
-    P1_BNE_R2_R3_R7             # 0 != 0? no, fall through
-    P1_LI_R7
+    P1_BNE_R2_R3             # 0 != 0? no, fall through
+    P1_LI_BR
     &b4_5_ok
     P1_B
 :b4_5_bad
@@ -217,9 +217,9 @@
     '01000000'                  # restore r3 = 1
 
     ## BLT taken: 0 < 1 (signed).
-    P1_LI_R7
+    P1_LI_BR
     &b4_6_ok
-    P1_BLT_R2_R3_R7             # 0 < 1, taken
+    P1_BLT_R2_R3             # 0 < 1, taken
     P1_LI_R1
     '00000000'
 :b4_6_ok
@@ -229,10 +229,10 @@
     '01000000'                  # r2 = 1
     P1_LI_R3
     '00000000'                  # r3 = 0
-    P1_LI_R7
+    P1_LI_BR
     &b4_7_bad
-    P1_BLT_R2_R3_R7             # 1 < 0? no, fall through
-    P1_LI_R7
+    P1_BLT_R2_R3             # 1 < 0? no, fall through
+    P1_LI_BR
     &b4_7_ok
     P1_B
 :b4_7_bad
@@ -248,9 +248,9 @@
     P1_LI_R4
     '00000000'
     P1_ADDI_R4_R4_NEG1          # r4 = -1 (sign-extended)
-    P1_LI_R7
+    P1_LI_BR
     &b4_8_ok
-    P1_BLT_R4_R2_R7             # -1 < 0 (signed)? yes, taken
+    P1_BLT_R4_R2             # -1 < 0 (signed)? yes, taken
     P1_LI_R1
     '00000000'
 :b4_8_ok
@@ -269,15 +269,15 @@
     ## into the accumulator below via SUB r1, r1, delta.
     P1_MOV_R6_SP                # r6 = sp snapshot (pre-tranche)
 
-    P1_LI_R7
+    P1_LI_BR
     &fn_identity
     P1_CALL                     # nested-CALL test: returns r1 unchanged
 
-    P1_LI_R7
+    P1_LI_BR
     &fn_parent_tail
     P1_CALL                     # TAIL test: fn_identity RETs to here
 
-    P1_LI_R7
+    P1_LI_BR
     &b5_end
     P1_B                        # skip over the inlined function bodies
 
@@ -288,7 +288,7 @@
 
 :fn_identity
     P1_PROLOGUE
-    P1_LI_R7
+    P1_LI_BR
     &fn_inner
     P1_CALL
     P1_EPILOGUE
@@ -296,7 +296,7 @@
 
 :fn_parent_tail
     P1_PROLOGUE
-    P1_LI_R7
+    P1_LI_BR
     &fn_identity
     P1_TAIL
 
diff --git a/lisp.M1 b/lisp.M1
@@ -64,7 +64,7 @@
     ## cons(r1=fixnum(42), r2=nil=0x07).
     P1_LI_R2
     '07000000'                  ## r2 = nil
-    P1_LI_R7
+    P1_LI_BR
     &cons
     P1_CALL                     ## r0 = pair pointer|2
     P1_MOV_R6_R0                ## r6 = pair (callee-saved across calls)
@@ -74,24 +74,24 @@
     P1_ANDI_R1_R1_7
     P1_LI_R2
     '02000000'
-    P1_LI_R7
+    P1_LI_BR
     &not_pair_err
-    P1_BNE_R1_R2_R7
+    P1_BNE_R1_R2
 
     ## Check cdr(pair) == nil.
     P1_MOV_R1_R6
-    P1_LI_R7
+    P1_LI_BR
     &cdr
     P1_CALL                     ## r0 = cdr
     P1_LI_R2
     '07000000'
-    P1_LI_R7
+    P1_LI_BR
     &not_nil_err
-    P1_BNE_R0_R2_R7
+    P1_BNE_R0_R2
 
     ## x = car(pair); expect tag == fixnum.
     P1_MOV_R1_R6
-    P1_LI_R7
+    P1_LI_BR
     &car
     P1_CALL                     ## r0 = car
     P1_MOV_R6_R0                ## r6 = car (callee-saved)
@@ -100,9 +100,9 @@
     P1_ANDI_R1_R1_7
     P1_LI_R2
     '01000000'
-    P1_LI_R7
+    P1_LI_BR
     &not_fixnum_err
-    P1_BNE_R1_R2_R7
+    P1_BNE_R1_R2
 
     ## exit(car >> 3)  — yields 42.
     P1_MOV_R1_R6
@@ -120,27 +120,29 @@
 ## (ptr | PAIR_TAG). Uses the N=2 prologue — car lives in frame slot 1
 ## at [sp+8], cdr in slot 2 at [sp+16] — to shepherd both args across
 ## alloc(). Both roots stay on the walkable stack, which is the
-## discipline the step-9 GC will depend on (P1_TODO.md issue 4).
+## discipline the step-9 mark-sweep GC will depend on.
 :cons
     P1_PROLOGUE_N2
 
-    ## Spill car to slot 1 [sp+8], cdr to slot 2 [sp+16].
-    P1_MOV_R4_SP
-    P1_ST_R1_R4_8
-    P1_ST_R2_R4_16
+    ## Spill car to slot 1 [sp+8], cdr to slot 2 [sp+16]. r3 is caller-
+    ## saved so we can clobber it freely as an sp-temp for the ST/LD
+    ## displacements.
+    P1_MOV_R3_SP
+    P1_ST_R1_R3_8
+    P1_ST_R2_R3_16
 
     ## alloc(16). 16 is already 8-byte aligned.
     P1_LI_R1
     '10000000'
-    P1_LI_R7
+    P1_LI_BR
     &alloc
     P1_CALL                     ## r0 = raw (untagged) pair ptr
 
     ## [ptr+0] = car ; [ptr+8] = cdr.
-    P1_MOV_R4_SP
-    P1_LD_R1_R4_8               ## r1 = car
+    P1_MOV_R3_SP
+    P1_LD_R1_R3_8               ## r1 = car
     P1_ST_R1_R0_0
-    P1_LD_R2_R4_16              ## r2 = cdr
+    P1_LD_R2_R3_16              ## r2 = cdr
     P1_ST_R2_R0_8
 
     ## Tag as a pair and return.
@@ -170,26 +172,29 @@
 ## Bump allocator over [heap_start, heap_end). Caller must pass an
 ## 8-byte-aligned size (caller-rounds for now). On overflow branches to
 ## :alloc_oom, which funnels through `error`.
+##
+## Uses only r0-r3 (caller-saved) so it needn't preserve any callee-
+## saved regs across its own body. Keeps the PROLOGUE/EPILOGUE just for
+## the lr save — the N=0 frame has no slots.
 :alloc
     P1_PROLOGUE
 
-    P1_LI_R4
+    P1_LI_R3
     &heap_next
-    P1_LD_R3_R4_0               ## r3 = *heap_next (old bump ptr)
+    P1_LD_R0_R3_0               ## r0 = *heap_next (old bump ptr; also return value)
 
-    P1_ADD_R2_R3_R1             ## r2 = new_next = old + size
+    P1_ADD_R2_R0_R1             ## r2 = new_next = old + size
 
-    P1_LI_R5
+    P1_LI_R1
     &heap_end
-    P1_LD_R0_R5_0               ## r0 = *heap_end
+    P1_LD_R1_R1_0               ## r1 = *heap_end
 
     ## heap_end < new_next  ->  OOM
-    P1_LI_R7
+    P1_LI_BR
     &alloc_oom
-    P1_BLT_R0_R2_R7
+    P1_BLT_R1_R2
 
-    P1_ST_R2_R4_0               ## *heap_next = new_next
-    P1_MOV_R0_R3                ## return old bump ptr
+    P1_ST_R2_R3_0               ## *heap_next = new_next
 
     P1_EPILOGUE
     P1_RET
@@ -197,7 +202,7 @@
 
 ## ---- error(msg_ptr, msg_len) ----------------------------------------
 ## Writes "error: " + msg + "\n" to fd 2, then exit(1). Never returns.
-## Callers reach this via LI_R7 &error ; P1_B (branch, not call) and
+## Callers reach this via LI_BR &error ; P1_B (branch, not call) and
 ## arrange r1 = msg_ptr, r2 = msg_len beforehand.
 :error
     ## Spill msg_ptr/msg_len into callee-saved regs across the write
@@ -247,14 +252,14 @@
 
 ## ---- Error landing pads ---------------------------------------------
 ## Plain branch targets (no PROLOGUE, no return). Load (msg_ptr, msg_len)
-## into r1/r2, then r7-indirect branch into `error`. Keep message-length
-## imm32s in sync with the strings below.
+## into r1/r2, then LI_BR-indirect branch into `error`. Keep message-
+## length imm32s in sync with the strings below.
 :alloc_oom
     P1_LI_R1
     &msg_oom
     P1_LI_R2
     '0E000000'                  ## strlen("heap exhausted") = 14
-    P1_LI_R7
+    P1_LI_BR
     &error
     P1_B
 
@@ -263,7 +268,7 @@
     &msg_not_pair
     P1_LI_R2
     '0D000000'                  ## strlen("expected pair") = 13
-    P1_LI_R7
+    P1_LI_BR
     &error
     P1_B
 
@@ -272,7 +277,7 @@
     &msg_not_nil
     P1_LI_R2
     '0C000000'                  ## strlen("expected nil") = 12
-    P1_LI_R7
+    P1_LI_BR
     &error
     P1_B
 
@@ -281,7 +286,7 @@
     &msg_not_fixnum
     P1_LI_R2
     '0F000000'                  ## strlen("expected fixnum") = 15
-    P1_LI_R7
+    P1_LI_BR
     &error
     P1_B
 
diff --git a/p1_aarch64.M1 b/p1_aarch64.M1
@@ -10,13 +10,14 @@ DEFINE P1_LI_R0 4000001802000014
 DEFINE P1_LI_R1 4100001802000014
 DEFINE P1_LI_R2 4200001802000014
 DEFINE P1_LI_R3 4300001802000014
-DEFINE P1_LI_R4 4400001802000014
-DEFINE P1_LI_R5 4500001802000014
+DEFINE P1_LI_R4 5A00001802000014
+DEFINE P1_LI_R5 5B00001802000014
 DEFINE P1_LI_R6 5300001802000014
 DEFINE P1_LI_R7 5400001802000014
+DEFINE P1_LI_BR 5100001802000014
 
 ## ---- SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches 
-DEFINE P1_SYSCALL E80300AAF50301AAF60302AAF70303AAF80304AAF90305AAE00315AAE10316AAE20317AAE30318AAE40319AAE50313AA010000D4E10315AAE20316AAE30317AAE40318AAE50319AA
+DEFINE P1_SYSCALL E80300AAF50301AAF60302AAF70303AAE00315AAE10316AAE20317AAE3031AAAE4031BAAE50313AA010000D4E10315AAE20316AAE30317AA
 DEFINE P1_SYSOPEN 600C8092080780D2010000D4
 
 ## ---- Linux syscall numbers (per-arch table). LE-32 immediate operands for LI. 
@@ -27,31 +28,31 @@ DEFINE SYS_CLOSE 39000000
 
 ## ---- Reg-reg-reg arithmetic (tranche 1) --------------------------
 DEFINE P1_ADD_R1_R1_R2 2100028B
-DEFINE P1_ADD_R1_R1_R4 2100048B
+DEFINE P1_ADD_R1_R1_R4 21001A8B
 DEFINE P1_ADD_R2_R2_R6 4200138B
 DEFINE P1_ADD_R2_R3_R1 6200018B
 DEFINE P1_SUB_R1_R1_R2 210002CB
 DEFINE P1_SUB_R2_R2_R6 420013CB
-DEFINE P1_AND_R1_R1_R5 2100058A
+DEFINE P1_AND_R1_R1_R5 21001B8A
 DEFINE P1_OR_R1_R1_R2 210002AA
 DEFINE P1_XOR_R1_R1_R2 210002CA
 DEFINE P1_MUL_R1_R1_R2 217C029B
 DEFINE P1_DIV_R1_R1_R2 210CC29A
-DEFINE P1_REM_R1_R1_R5 300CC59A0106059B
+DEFINE P1_REM_R1_R1_R5 300CDB9A01061B9B
 DEFINE P1_SHL_R1_R1_R2 2120C29A
 DEFINE P1_SHR_R1_R1_R2 2124C29A
-DEFINE P1_SAR_R4_R4_R2 8428C29A
+DEFINE P1_SAR_R4_R4_R2 5A2BC29A
 
 ## ---- Immediate arithmetic (tranche 2) ----------------------------
 DEFINE P1_ADDI_R1_R1_3 210C0091
 DEFINE P1_ADDI_R1_R1_1 21040091
 DEFINE P1_ADDI_R1_R1_NEG3 210C00D1
-DEFINE P1_ADDI_R4_R4_NEG1 840400D1
+DEFINE P1_ADDI_R4_R4_NEG1 5A0700D1
 DEFINE P1_ADDI_R1_R1_NEG2 210800D1
 DEFINE P1_ADDI_R0_R0_1 00040091
 DEFINE P1_SHLI_R1_R1_1 21F87FD3
 DEFINE P1_SHRI_R1_R1_1 21FC41D3
-DEFINE P1_SARI_R4_R4_1 84FC4193
+DEFINE P1_SARI_R4_R4_1 5AFF4193
 DEFINE P1_ANDI_R1_R1_6 21047F92
 DEFINE P1_ANDI_R1_R1_7 21084092
 DEFINE P1_ORI_R1_R1_1 210040B2
@@ -59,38 +60,38 @@ DEFINE P1_ORI_R0_R0_2 00007FB2
 DEFINE P1_ORI_R0_R0_7 000840B2
 
 ## ---- LA + memory ops (tranche 3) ---------------------------------
-DEFINE P1_LA_R4 4400001802000014
-DEFINE P1_ST_R1_R4_0 810000F9
-DEFINE P1_LD_R1_R4_0 810040F9
-DEFINE P1_ST_R1_R4_8 810400F9
-DEFINE P1_LD_R1_R4_8 810440F9
-DEFINE P1_SB_R1_R4_16 81400039
-DEFINE P1_LB_R1_R4_16 81404039
-DEFINE P1_ST_R1_R4_NEG8 81801FF8
-DEFINE P1_LD_R1_R4_NEG8 81805FF8
+DEFINE P1_LA_R4 5A00001802000014
+DEFINE P1_ST_R1_R4_0 410300F9
+DEFINE P1_LD_R1_R4_0 410340F9
+DEFINE P1_ST_R1_R4_8 410700F9
+DEFINE P1_LD_R1_R4_8 410740F9
+DEFINE P1_SB_R1_R4_16 41430039
+DEFINE P1_LB_R1_R4_16 41434039
+DEFINE P1_ST_R1_R4_NEG8 41831FF8
+DEFINE P1_LD_R1_R4_NEG8 41835FF8
 
-## ---- Branches (tranche 4, r7-indirect) ---------------------------
-DEFINE P1_B 80021FD6
-DEFINE P1_BEQ_R2_R3_R7 5F0003EB4100005480021FD6
-DEFINE P1_BNE_R2_R3_R7 5F0003EB4000005480021FD6
-DEFINE P1_BLT_R2_R3_R7 5F0003EB4A00005480021FD6
-DEFINE P1_BLT_R4_R2_R7 9F0002EB4A00005480021FD6
+## ---- Branches (tranche 4, LI_BR-indirect) ------------------------
+DEFINE P1_B 20021FD6
+DEFINE P1_BEQ_R2_R3 5F0003EB4100005420021FD6
+DEFINE P1_BNE_R2_R3 5F0003EB4000005420021FD6
+DEFINE P1_BLT_R2_R3 5F0003EB4A00005420021FD6
+DEFINE P1_BLT_R4_R2 5F0302EB4A00005420021FD6
 
 ## ---- Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL 
 DEFINE P1_PROLOGUE FF4300D1FE0300F9
 DEFINE P1_EPILOGUE FE0340F9FF430091
 DEFINE P1_RET C0035FD6
-DEFINE P1_CALL 80023FD6
-DEFINE P1_TAIL FE0340F9FF43009180021FD6
+DEFINE P1_CALL 20023FD6
+DEFINE P1_TAIL FE0340F9FF43009120021FD6
 DEFINE P1_PROLOGUE_N2 FF8300D1FE0300F9
 DEFINE P1_EPILOGUE_N2 FE0340F9FF830091
-DEFINE P1_TAIL_N2 FE0340F9FF83009180021FD6
+DEFINE P1_TAIL_N2 FE0340F9FF83009120021FD6
 DEFINE P1_PROLOGUE_N3 FF8300D1FE0300F9
 DEFINE P1_EPILOGUE_N3 FE0340F9FF830091
-DEFINE P1_TAIL_N3 FE0340F9FF83009180021FD6
+DEFINE P1_TAIL_N3 FE0340F9FF83009120021FD6
 DEFINE P1_PROLOGUE_N4 FFC300D1FE0300F9
 DEFINE P1_EPILOGUE_N4 FE0340F9FFC30091
-DEFINE P1_TAIL_N4 FE0340F9FFC3009180021FD6
+DEFINE P1_TAIL_N4 FE0340F9FFC3009120021FD6
 
 ## ---- Seed-Lisp step 1 extensions (tranche 6) ---------------------
 DEFINE P1_MOV_R1_R6 E10313AA
@@ -102,28 +103,38 @@ DEFINE P1_MOV_R7_R2 F40302AA
 DEFINE P1_MOV_R2_R6 E20313AA
 DEFINE P1_MOV_R3_R7 E30314AA
 DEFINE P1_MOV_R2_R7 E20314AA
-DEFINE P1_MOV_R4_R7 E40314AA
+DEFINE P1_MOV_R4_R7 FA0314AA
 DEFINE P1_MOV_R2_SP E2030091
-DEFINE P1_MOV_R4_SP E4030091
+DEFINE P1_MOV_R3_SP E3030091
+DEFINE P1_MOV_R4_SP FA030091
 DEFINE P1_MOV_R6_SP F3030091
 DEFINE P1_MOV_R2_R0 E20300AA
 DEFINE P1_LD_R0_R6_0 600240F9
 DEFINE P1_LD_R1_R6_16 610A40F9
-DEFINE P1_LD_R3_R4_0 830040F9
-DEFINE P1_LD_R0_R5_0 A00040F9
-DEFINE P1_LB_R1_R4_0 81004039
-DEFINE P1_ST_R2_R4_0 820000F9
-DEFINE P1_ST_R0_R4_8 800400F9
-DEFINE P1_LD_R0_R4_8 800440F9
+DEFINE P1_LD_R3_R4_0 430340F9
+DEFINE P1_LD_R0_R5_0 600340F9
+DEFINE P1_LB_R1_R4_0 41034039
+DEFINE P1_ST_R2_R4_0 420300F9
+DEFINE P1_ST_R0_R4_8 400700F9
+DEFINE P1_LD_R0_R4_8 400740F9
 DEFINE P1_LB_R1_R0_0 01004039
 DEFINE P1_LD_R0_R1_0 200040F9
 DEFINE P1_LD_R0_R1_8 200440F9
 DEFINE P1_ST_R1_R0_0 010000F9
-DEFINE P1_LD_R2_R4_0 820040F9
+DEFINE P1_LD_R2_R4_0 420340F9
 DEFINE P1_ST_R2_R0_8 020400F9
-DEFINE P1_LD_R0_R4_0 800040F9
-DEFINE P1_ST_R2_R4_16 820800F9
-DEFINE P1_LD_R2_R4_16 820840F9
-DEFINE P1_BLT_R0_R2_R7 1F0002EB4A00005480021FD6
-DEFINE P1_BNE_R1_R2_R7 3F0002EB4000005480021FD6
-DEFINE P1_BNE_R0_R2_R7 1F0002EB4000005480021FD6
+DEFINE P1_LD_R0_R4_0 400340F9
+DEFINE P1_ST_R2_R4_16 420B00F9
+DEFINE P1_LD_R2_R4_16 420B40F9
+DEFINE P1_LD_R0_R3_0 600040F9
+DEFINE P1_ST_R2_R3_0 620000F9
+DEFINE P1_ST_R1_R3_8 610400F9
+DEFINE P1_LD_R1_R3_8 610440F9
+DEFINE P1_ST_R2_R3_16 620800F9
+DEFINE P1_LD_R2_R3_16 620840F9
+DEFINE P1_LD_R1_R1_0 210040F9
+DEFINE P1_ADD_R2_R0_R1 0200018B
+DEFINE P1_BLT_R0_R2 1F0002EB4A00005420021FD6
+DEFINE P1_BLT_R1_R2 3F0002EB4A00005420021FD6
+DEFINE P1_BNE_R1_R2 3F0002EB4000005420021FD6
+DEFINE P1_BNE_R0_R2 1F0002EB4000005420021FD6
diff --git a/p1_amd64.M1 b/p1_amd64.M1
@@ -10,13 +10,14 @@ DEFINE P1_LI_R0 B8
 DEFINE P1_LI_R1 BF
 DEFINE P1_LI_R2 BE
 DEFINE P1_LI_R3 BA
-DEFINE P1_LI_R4 41BA
-DEFINE P1_LI_R5 41B8
+DEFINE P1_LI_R4 41BD
+DEFINE P1_LI_R5 41BE
 DEFINE P1_LI_R6 BB
 DEFINE P1_LI_R7 41BC
+DEFINE P1_LI_BR 41BB
 
 ## ---- SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches 
-DEFINE P1_SYSCALL 4989D90F05
+DEFINE P1_SYSCALL 4D89EA4D89F04989D90F05
 DEFINE P1_SYSOPEN B8020000000F05
 
 ## ---- Linux syscall numbers (per-arch table). LE-32 immediate operands for LI. 
@@ -27,31 +28,31 @@ DEFINE SYS_CLOSE 03000000
 
 ## ---- Reg-reg-reg arithmetic (tranche 1) --------------------------
 DEFINE P1_ADD_R1_R1_R2 4889FF4801F7
-DEFINE P1_ADD_R1_R1_R4 4889FF4C01D7
+DEFINE P1_ADD_R1_R1_R4 4889FF4C01EF
 DEFINE P1_ADD_R2_R2_R6 4889F64801DE
 DEFINE P1_ADD_R2_R3_R1 4889D64801FE
 DEFINE P1_SUB_R1_R1_R2 4889FF4829F7
 DEFINE P1_SUB_R2_R2_R6 4889F64829DE
-DEFINE P1_AND_R1_R1_R5 4889FF4C21C7
+DEFINE P1_AND_R1_R1_R5 4889FF4C21F7
 DEFINE P1_OR_R1_R1_R2 4889FF4809F7
 DEFINE P1_XOR_R1_R1_R2 4889FF4831F7
 DEFINE P1_MUL_R1_R1_R2 4889FF480FAFFE
 DEFINE P1_DIV_R1_R1_R2 4989C34889D14889F8489948F7FE4889C74889CA4C89D8
-DEFINE P1_REM_R1_R1_R5 4989C34889D14889F8489949F7F84889D74889CA4C89D8
+DEFINE P1_REM_R1_R1_R5 4989C34889D14889F8489949F7FE4889D74889CA4C89D8
 DEFINE P1_SHL_R1_R1_R2 4889FF4889F148D3E7
 DEFINE P1_SHR_R1_R1_R2 4889FF4889F148D3EF
-DEFINE P1_SAR_R4_R4_R2 4D89D24889F149D3FA
+DEFINE P1_SAR_R4_R4_R2 4D89ED4889F149D3FD
 
 ## ---- Immediate arithmetic (tranche 2) ----------------------------
 DEFINE P1_ADDI_R1_R1_3 4889FF4883C703
 DEFINE P1_ADDI_R1_R1_1 4889FF4883C701
 DEFINE P1_ADDI_R1_R1_NEG3 4889FF4883C7FD
-DEFINE P1_ADDI_R4_R4_NEG1 4D89D24983C2FF
+DEFINE P1_ADDI_R4_R4_NEG1 4D89ED4983C5FF
 DEFINE P1_ADDI_R1_R1_NEG2 4889FF4883C7FE
 DEFINE P1_ADDI_R0_R0_1 4889C04883C001
 DEFINE P1_SHLI_R1_R1_1 4889FF48C1E701
 DEFINE P1_SHRI_R1_R1_1 4889FF48C1EF01
-DEFINE P1_SARI_R4_R4_1 4D89D249C1FA01
+DEFINE P1_SARI_R4_R4_1 4D89ED49C1FD01
 DEFINE P1_ANDI_R1_R1_6 4889FF4883E706
 DEFINE P1_ANDI_R1_R1_7 4889FF4883E707
 DEFINE P1_ORI_R1_R1_1 4889FF4883CF01
@@ -59,38 +60,38 @@ DEFINE P1_ORI_R0_R0_2 4889C04883C802
 DEFINE P1_ORI_R0_R0_7 4889C04883C807
 
 ## ---- LA + memory ops (tranche 3) ---------------------------------
-DEFINE P1_LA_R4 41BA
-DEFINE P1_ST_R1_R4_0 49897A00
-DEFINE P1_LD_R1_R4_0 498B7A00
-DEFINE P1_ST_R1_R4_8 49897A08
-DEFINE P1_LD_R1_R4_8 498B7A08
-DEFINE P1_SB_R1_R4_16 49887A10
-DEFINE P1_LB_R1_R4_16 490FB67A10
-DEFINE P1_ST_R1_R4_NEG8 49897AF8
-DEFINE P1_LD_R1_R4_NEG8 498B7AF8
+DEFINE P1_LA_R4 41BD
+DEFINE P1_ST_R1_R4_0 49897D00
+DEFINE P1_LD_R1_R4_0 498B7D00
+DEFINE P1_ST_R1_R4_8 49897D08
+DEFINE P1_LD_R1_R4_8 498B7D08
+DEFINE P1_SB_R1_R4_16 49887D10
+DEFINE P1_LB_R1_R4_16 490FB67D10
+DEFINE P1_ST_R1_R4_NEG8 49897DF8
+DEFINE P1_LD_R1_R4_NEG8 498B7DF8
 
-## ---- Branches (tranche 4, r7-indirect) ---------------------------
-DEFINE P1_B 41FFE4
-DEFINE P1_BEQ_R2_R3_R7 4839D6750341FFE4
-DEFINE P1_BNE_R2_R3_R7 4839D6740341FFE4
-DEFINE P1_BLT_R2_R3_R7 4839D67D0341FFE4
-DEFINE P1_BLT_R4_R2_R7 4939F27D0341FFE4
+## ---- Branches (tranche 4, LI_BR-indirect) ------------------------
+DEFINE P1_B 41FFE3
+DEFINE P1_BEQ_R2_R3 4839D6750341FFE3
+DEFINE P1_BNE_R2_R3 4839D6740341FFE3
+DEFINE P1_BLT_R2_R3 4839D67D0341FFE3
+DEFINE P1_BLT_R4_R2 4939F57D0341FFE3
 
 ## ---- Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL 
-DEFINE P1_PROLOGUE 415B4883EC104153
-DEFINE P1_EPILOGUE 415B4883C4104153
+DEFINE P1_PROLOGUE 594883EC1051
+DEFINE P1_EPILOGUE 594883C41051
 DEFINE P1_RET C3
-DEFINE P1_CALL 41FFD4
-DEFINE P1_TAIL 415B4883C410415341FFE4
-DEFINE P1_PROLOGUE_N2 415B4883EC204153
-DEFINE P1_EPILOGUE_N2 415B4883C4204153
-DEFINE P1_TAIL_N2 415B4883C420415341FFE4
-DEFINE P1_PROLOGUE_N3 415B4883EC204153
-DEFINE P1_EPILOGUE_N3 415B4883C4204153
-DEFINE P1_TAIL_N3 415B4883C420415341FFE4
-DEFINE P1_PROLOGUE_N4 415B4883EC304153
-DEFINE P1_EPILOGUE_N4 415B4883C4304153
-DEFINE P1_TAIL_N4 415B4883C430415341FFE4
+DEFINE P1_CALL 41FFD3
+DEFINE P1_TAIL 594883C4105141FFE3
+DEFINE P1_PROLOGUE_N2 594883EC2051
+DEFINE P1_EPILOGUE_N2 594883C42051
+DEFINE P1_TAIL_N2 594883C4205141FFE3
+DEFINE P1_PROLOGUE_N3 594883EC2051
+DEFINE P1_EPILOGUE_N3 594883C42051
+DEFINE P1_TAIL_N3 594883C4205141FFE3
+DEFINE P1_PROLOGUE_N4 594883EC3051
+DEFINE P1_EPILOGUE_N4 594883C43051
+DEFINE P1_TAIL_N4 594883C4305141FFE3
 
 ## ---- Seed-Lisp step 1 extensions (tranche 6) ---------------------
 DEFINE P1_MOV_R1_R6 4889DF
@@ -102,28 +103,38 @@ DEFINE P1_MOV_R7_R2 4989F4
 DEFINE P1_MOV_R2_R6 4889DE
 DEFINE P1_MOV_R3_R7 4C89E2
 DEFINE P1_MOV_R2_R7 4C89E6
-DEFINE P1_MOV_R4_R7 4D89E2
+DEFINE P1_MOV_R4_R7 4D89E5
 DEFINE P1_MOV_R2_SP 4889E6
-DEFINE P1_MOV_R4_SP 4989E2
+DEFINE P1_MOV_R3_SP 4889E2
+DEFINE P1_MOV_R4_SP 4989E5
 DEFINE P1_MOV_R6_SP 4889E3
 DEFINE P1_MOV_R2_R0 4889C6
 DEFINE P1_LD_R0_R6_0 488B4300
 DEFINE P1_LD_R1_R6_16 488B7B10
-DEFINE P1_LD_R3_R4_0 498B5200
-DEFINE P1_LD_R0_R5_0 498B4000
-DEFINE P1_LB_R1_R4_0 490FB67A00
-DEFINE P1_ST_R2_R4_0 49897200
-DEFINE P1_ST_R0_R4_8 49894208
-DEFINE P1_LD_R0_R4_8 498B4208
+DEFINE P1_LD_R3_R4_0 498B5500
+DEFINE P1_LD_R0_R5_0 498B4600
+DEFINE P1_LB_R1_R4_0 490FB67D00
+DEFINE P1_ST_R2_R4_0 49897500
+DEFINE P1_ST_R0_R4_8 49894508
+DEFINE P1_LD_R0_R4_8 498B4508
 DEFINE P1_LB_R1_R0_0 480FB67800
 DEFINE P1_LD_R0_R1_0 488B4700
 DEFINE P1_LD_R0_R1_8 488B4708
 DEFINE P1_ST_R1_R0_0 48897800
-DEFINE P1_LD_R2_R4_0 498B7200
+DEFINE P1_LD_R2_R4_0 498B7500
 DEFINE P1_ST_R2_R0_8 48897008
-DEFINE P1_LD_R0_R4_0 498B4200
-DEFINE P1_ST_R2_R4_16 49897210
-DEFINE P1_LD_R2_R4_16 498B7210
-DEFINE P1_BLT_R0_R2_R7 4839F07D0341FFE4
-DEFINE P1_BNE_R1_R2_R7 4839F7740341FFE4
-DEFINE P1_BNE_R0_R2_R7 4839F0740341FFE4
+DEFINE P1_LD_R0_R4_0 498B4500
+DEFINE P1_ST_R2_R4_16 49897510
+DEFINE P1_LD_R2_R4_16 498B7510
+DEFINE P1_LD_R0_R3_0 488B4200
+DEFINE P1_ST_R2_R3_0 48897200
+DEFINE P1_ST_R1_R3_8 48897A08
+DEFINE P1_LD_R1_R3_8 488B7A08
+DEFINE P1_ST_R2_R3_16 48897210
+DEFINE P1_LD_R2_R3_16 488B7210
+DEFINE P1_LD_R1_R1_0 488B7F00
+DEFINE P1_ADD_R2_R0_R1 4889C64801FE
+DEFINE P1_BLT_R0_R2 4839F07D0341FFE3
+DEFINE P1_BLT_R1_R2 4839F77D0341FFE3
+DEFINE P1_BNE_R1_R2 4839F7740341FFE3
+DEFINE P1_BNE_R0_R2 4839F0740341FFE3
diff --git a/p1_gen.py b/p1_gen.py
@@ -7,9 +7,9 @@ rewrites all three in place.
 
 Why a generator: the hand-written defs diverge across arches in
 hard-to-review ways, typos silently produce SIGILL'ing binaries (M1
-passes undefined tokens through as literal text — see P1_TODO.md issue
-1), and the combinatorial surface (~1200 DEFINEs per arch if fully
-enumerated) is past the point of hand-maintainability.
+passes undefined tokens through as literal text — see `lint.sh` for
+the pre-assemble check), and the combinatorial surface (~1200 DEFINEs
+per arch if fully enumerated) is past the point of hand-maintainability.
 
 Design:
   * OPS is a list of emission rows. Each row is a small class whose
@@ -39,10 +39,15 @@ ARCHES = ('aarch64', 'amd64', 'riscv64')
 ## the per-arch encoders insert into instruction fields; the human-facing
 ## names (rax, x1, a2, …) never appear in this file.
 
+## 4:4 caller/callee-saved split. r0–r3 caller (native argregs); r4–r7
+## callee (native callee-saved). `br` is the hidden branch-target scratch
+## (not a P1 reg) — picked so every op's expansion clobbers only what its
+## name declares.
 NAT_AA64 = {'r0': 0, 'r1': 1, 'r2': 2, 'r3': 3,
-            'r4': 4, 'r5': 5, 'r6': 19, 'r7': 20,
+            'r4': 26, 'r5': 27, 'r6': 19, 'r7': 20,
+            'br': 17,  # x17 (IP1, caller-saved linker scratch)
             'sp': 31, 'xzr': 31, 'lr': 30,
-            'x21': 21, 'x22': 22, 'x23': 23, 'x24': 24, 'x25': 25, 'x8': 8}
+            'x21': 21, 'x22': 22, 'x23': 23, 'x8': 8}
 
 ## amd64 ModRM.reg/rm + REX.R/B bit: native regnums 0..15 with r8..r15
 ## setting the REX bit. We store the 4-bit native number directly.
@@ -50,20 +55,24 @@ NAT_AMD64 = {'r0': 0,   # rax
              'r1': 7,   # rdi
              'r2': 6,   # rsi
              'r3': 2,   # rdx
-             'r4': 10,  # r10
-             'r5': 8,   # r8
+             'r4': 13,  # r13 (callee-saved)
+             'r5': 14,  # r14 (callee-saved)
              'r6': 3,   # rbx
              'r7': 12,  # r12
+             'br': 11,  # r11 — branch/call target scratch + DIV/REM r0 save
              'sp': 4,   # rsp
-             'rcx': 1,  # shift-count scratch (not a P1 reg)
-             'r9':  9,  # syscall arg6 slot
-             'r11': 11, # call-retaddr scratch in PROLOGUE
+             'rcx': 1,  # shift-count scratch + DIV/REM rdx save (not a P1 reg)
+             'r10': 10, # syscall arg4 slot (not a P1 reg)
+             'r8':  8,  # syscall arg5 slot (not a P1 reg)
+             'r9':  9,  # syscall arg6 slot (not a P1 reg)
+             'r11': 11, # alias for br (some expansions spell it r11 directly)
              }
 
 NAT_RV64 = {'r0': 10, 'r1': 11, 'r2': 12, 'r3': 13,
-            'r4': 14, 'r5': 15, 'r6': 9,  'r7': 18,
+            'r4': 20, 'r5': 21, 'r6': 9,  'r7': 18,
+            'br': 30,  # t5 (caller-saved temp)
             'sp': 2, 'ra': 1, 'zero': 0, 'a7': 17,
-            's3': 19, 's4': 20, 's5': 21, 's6': 22, 's7': 23}
+            's3': 19, 's6': 22, 's7': 23}
 
 
 ## ---------- Low-level encoding helpers -----------------------------------
@@ -521,19 +530,16 @@ class Mem(Op):
             if self.op == 'SB':
                 return rv_s(0x00000023, self.rT, self.rN, self.off)
 
-## --- Branches: r7-indirect pattern ---
+## --- Branches: LI_BR-indirect pattern ---
 @dataclass
 class B(Op):
     def encode(self, arch):
         if arch == 'aarch64':
-            # BR x20 — 0xD61F0280
-            return le32(0xD61F0280)
+            return le32(0xD61F0000 | (NAT_AA64['br'] << 5))  # BR x17
         if arch == 'amd64':
-            # jmp r12 — 41 FF E4
-            return '41FFE4'
+            return '41FFE3'                                   # jmp r11
         if arch == 'riscv64':
-            # jalr x0, 0(s2) — opcode 67 00 09 00
-            return le32(0x00090067)
+            return le32(0x00000067 | (NAT_RV64['br'] << 15))  # jalr x0, 0(t5)
 
 @dataclass
 class CondB(Op):
@@ -550,37 +556,25 @@ class CondB(Op):
             # Skip when NOT cond holds. BEQ→NE(1), BNE→EQ(0), BLT→GE(A).
             cond = {'BEQ': 1, 'BNE': 0, 'BLT': 10}[self.op]
             bcond = le32(0x54000040 | cond)
-            br_x20 = le32(0xD61F0280)
-            return cmp + bcond + br_x20
+            br = le32(0xD61F0000 | (NAT_AA64['br'] << 5))
+            return cmp + bcond + br
         if arch == 'amd64':
             a, b = NAT_AMD64[self.rA], NAT_AMD64[self.rB]
             # cmp rA, rB — opcode 39 /r with rA as r/m
             cmp_ = rex(1, b >> 3, 0, a >> 3) + '39' + modrm(3, b, a)
-            # jcc rel8 opcode, skip=3 (past jmp r12):
+            # jcc rel8 opcode, skip=3 (past jmp r11):
             #   BEQ→JNE 75 03 ; BNE→JE 74 03 ; BLT→JGE 7D 03
             jop = {'BEQ': '75', 'BNE': '74', 'BLT': '7D'}[self.op]
-            jmp_r12 = '41FFE4'
-            return cmp_ + jop + '03' + jmp_r12
+            return cmp_ + jop + '03' + '41FFE3'              # jmp r11
         if arch == 'riscv64':
-            # B<inv> rA, rB, +8 ; jalr x0, 0(s2)
+            # B<inv> rA, rB, +8 ; jalr x0, 0(t5)
             a, b = NAT_RV64[self.rA], NAT_RV64[self.rB]
-            # B-type imm=8 encoding: imm[12|10:5] = 0, imm[4:1|11] = 0b0100_0.
-            # That is: [31]=0 [30:25]=0 [11:8]=0b0100 [7]=0 → imm[11:7] = 0b01000, imm[31:25]=0.
-            # Bytes: xx xx x? 00 with byte[1]_upper=0x4 for the rd-field=8.
             # funct3 picks the op: BEQ→BNE(1), BNE→BEQ(0), BLT→BGE(5).
             funct3 = {'BEQ': 1, 'BNE': 0, 'BLT': 5}[self.op]
-            # Build B-type: opcode=0x63, funct3, rs1=rA, rs2=rB, imm=+8.
-            # imm encoding for +8:
-            #   bit 12 = 0
-            #   bits 10:5 = 000000
-            #   bits 4:1 = 0100
-            #   bit 11 = 0
-            insn = 0x00000063 | (funct3 << 12) | (a << 15) | (b << 20)
-            # imm bits: [11:7] carries imm[4:1,11]; [31:25] carries imm[12,10:5]
-            # For +8: imm[4:1]=0b0100 → [10:7] = 0b1000 → byte pos [11:7] = 0b01000 = 8
-            insn |= (8 << 7)  # bits [11:8] = 0b0100, bit 7=0
-            jalr_x0_s2 = 0x00090067
-            return le32(insn) + le32(jalr_x0_s2)
+            # B-type with imm=+8: imm[11:8]=0b0100 → byte[1] nibble high=4.
+            insn = 0x00000063 | (funct3 << 12) | (a << 15) | (b << 20) | (8 << 7)
+            jalr = 0x00000067 | (NAT_RV64['br'] << 15)        # jalr x0, 0(t5)
+            return le32(insn) + le32(jalr)
 
 
 ## --- Simple singletons ---
@@ -622,11 +616,16 @@ class Prologue(Op):
             str_lr = aa_ldst_uimm12(0xF9000000, 'lr', 'sp', 0, 3)
             return sub + str_lr
         if arch == 'amd64':
-            # pop r11 ; sub rsp,fb ; push r11
-            # pop r11 = 41 5B ; sub rsp,imm8 = 48 83 EC ib (if fb<=127)
-            # push r11 = 41 53
+            # pop rcx ; sub rsp,fb ; push rcx
+            # pop rcx = 59 ; sub rsp,imm8 = 48 83 EC ib (if fb<=127)
+            # push rcx = 51
+            # rcx is the retaddr-carry scratch. r11 (= P1 'br') is off-limits
+            # because TAIL = EPILOGUE + `jmp r11` and using r11 here would
+            # clobber the LI_BR-loaded tail target mid-EPILOGUE. rcx is
+            # already the shift-count + DIV/REM save scratch — caller-save,
+            # never a P1 reg.
             assert fb <= 127
-            return '415B' + '4883EC' + byte(fb) + '4153'
+            return '59' + '4883EC' + byte(fb) + '51'
         if arch == 'riscv64':
             # addi sp, sp, -fb ; sd ra, 0(sp)
             sub = rv_i(0x00000013, 'sp', 'sp', -fb)
@@ -644,8 +643,9 @@ class Epilogue(Op):
             add    = aa_add_imm('sp', 'sp', fb, sub=False)
             return ldr_lr + add
         if arch == 'amd64':
+            # Mirror of Prologue — pop rcx ; add rsp,fb ; push rcx.
             assert fb <= 127
-            return '415B' + '4883C4' + byte(fb) + '4153'
+            return '59' + '4883C4' + byte(fb) + '51'
         if arch == 'riscv64':
             ld  = rv_i(0x00003003, 'ra', 'sp', 0)
             add = rv_i(0x00000013, 'sp', 'sp', fb)
@@ -656,7 +656,7 @@ class Tail(Op):
     k: int = 1
 
     def encode(self, arch):
-        # Epilogue + unconditional B (= jump through r7)
+        # Epilogue + unconditional B (= jump through the branch-scratch reg)
         epi = Epilogue(name='', k=self.k).encode(arch)
         br  = B(name='').encode(arch)
         return epi + br
@@ -665,11 +665,11 @@ class Tail(Op):
 class Call(Op):
     def encode(self, arch):
         if arch == 'aarch64':
-            return le32(0xD63F0280)    # BLR x20
+            return le32(0xD63F0000 | (NAT_AA64['br'] << 5))    # BLR x17
         if arch == 'amd64':
-            return '41FFD4'            # call r12
+            return '41FFD3'                                      # call r11
         if arch == 'riscv64':
-            return le32(0x000900E7)    # jalr ra, 0(s2)
+            return le32(0x000000E7 | (NAT_RV64['br'] << 15))    # jalr ra, 0(t5)
 
 @dataclass
 class Ret(Op):
@@ -684,50 +684,49 @@ class Ret(Op):
 ## --- SYSCALL / SYSOPEN ---
 SYSCALL_HEX = {
     'aarch64': (
-        # mov x8,x0 ; save r1..r5 into x21..x25 ; shuffle into x0..x4 ;
-        # mov x5,x19 ; svc #0 ; restore r1..r5.
+        # r4/r5 now live in callee-saved natives (x26/x27), so the
+        # kernel preserves them — no save/restore needed. Only r1/r2/r3
+        # (in caller-saved x1/x2/x3) must be stashed across the shuffle.
         '' .join([
-            le32(0xAA0003E8),  # mov x8, x0  (ORR x8, xzr, x0)
-            le32(0xAA0103F5),  # mov x21, x1
-            le32(0xAA0203F6),  # mov x22, x2
-            le32(0xAA0303F7),  # mov x23, x3
-            le32(0xAA0403F8),  # mov x24, x4
-            le32(0xAA0503F9),  # mov x25, x5
-            le32(0xAA1503E0),  # mov x0, x21
-            le32(0xAA1603E1),  # mov x1, x22
-            le32(0xAA1703E2),  # mov x2, x23
-            le32(0xAA1803E3),  # mov x3, x24
-            le32(0xAA1903E4),  # mov x4, x25
-            le32(0xAA1303E5),  # mov x5, x19
+            le32(0xAA0003E8),  # mov x8, x0      (syscall number)
+            le32(0xAA0103F5),  # mov x21, x1     (save r1)
+            le32(0xAA0203F6),  # mov x22, x2     (save r2)
+            le32(0xAA0303F7),  # mov x23, x3     (save r3)
+            le32(0xAA1503E0),  # mov x0, x21     (arg1 = r1)
+            le32(0xAA1603E1),  # mov x1, x22     (arg2 = r2)
+            le32(0xAA1703E2),  # mov x2, x23     (arg3 = r3)
+            le32(0xAA1A03E3),  # mov x3, x26     (arg4 = r4)
+            le32(0xAA1B03E4),  # mov x4, x27     (arg5 = r5)
+            le32(0xAA1303E5),  # mov x5, x19     (arg6 = r6)
             le32(0xD4000001),  # svc #0
-            le32(0xAA1503E1),  # mov x1, x21
+            le32(0xAA1503E1),  # mov x1, x21     (restore r1)
             le32(0xAA1603E2),  # mov x2, x22
             le32(0xAA1703E3),  # mov x3, x23
-            le32(0xAA1803E4),  # mov x4, x24
-            le32(0xAA1903E5),  # mov x5, x25
         ])
     ),
-    'amd64': '4989D9' + '0F05',     # mov r9, rbx ; syscall
+    # r4=r13, r5=r14 are callee-saved natively, but syscall wants args
+    # 4/5 in r10/r8. r6=rbx, but arg6 lives in r9. Three shuffle moves,
+    # then syscall. The kernel preserves rdi/rsi/rdx/r12–r15/rbx, so no
+    # P1 reg is clobbered beyond r0 (syscall return).
+    'amd64':  '4D89EA' + '4D89F0' + '4989D9' + '0F05',
     'riscv64': (
+        # Same story as aarch64: r4/r5 in callee-saved s4/s5 (=x20/x21),
+        # so we only save/restore a1/a2/a3. Scratch slots: s3, s6, s7.
         ''.join([
-            le32(0x00050893),  # addi a7, a0, 0    (mv a7, a0)
-            le32(0x00058993),  # mv s3, a1
-            le32(0x00060A13),  # mv s4, a2
-            le32(0x00068A93),  # mv s5, a3
-            le32(0x00070B13),  # mv s6, a4
-            le32(0x00078B93),  # mv s7, a5
-            le32(0x00098513),  # mv a0, s3
-            le32(0x000A0593),  # mv a1, s4
-            le32(0x000A8613),  # mv a2, s5
-            le32(0x000B0693),  # mv a3, s6
-            le32(0x000B8713),  # mv a4, s7
-            le32(0x00048793),  # mv a5, s1
+            le32(0x00050893),  # mv a7, a0       (syscall number)
+            le32(0x00058993),  # mv s3, a1       (save r1)
+            le32(0x00060B13),  # mv s6, a2       (save r2)
+            le32(0x00068B93),  # mv s7, a3       (save r3)
+            le32(0x00098513),  # mv a0, s3       (arg1 = r1)
+            le32(0x000B0593),  # mv a1, s6       (arg2 = r2)
+            le32(0x000B8613),  # mv a2, s7       (arg3 = r3)
+            le32(0x000A0693),  # mv a3, s4       (arg4 = r4)
+            le32(0x000A8713),  # mv a4, s5       (arg5 = r5)
+            le32(0x00048793),  # mv a5, s1       (arg6 = r6)
             le32(0x00000073),  # ecall
-            le32(0x00098593),  # mv a1, s3
-            le32(0x000A0613),  # mv a2, s4
-            le32(0x000A8693),  # mv a3, s5
-            le32(0x000B0713),  # mv a4, s6
-            le32(0x000B8793),  # mv a5, s7
+            le32(0x00098593),  # mv a1, s3       (restore r1)
+            le32(0x000B0613),  # mv a2, s6
+            le32(0x000B8693),  # mv a3, s7
         ])
     ),
 }
@@ -770,6 +769,10 @@ def rows():
     R.append(Banner('LI — load 4-byte zero-extended literal from inline data slot'))
     for rd in ['r0', 'r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7']:
         R.append(Li(name=f'LI_{rd.upper()}', rD=rd))
+    # LI_BR loads into the hidden branch-target scratch (x17/r11/t5).
+    # Every branch and call site is `LI_BR &target ; P1_B` (or the
+    # matching conditional / CALL). The scratch is *not* a P1 reg.
+    R.append(Li(name='LI_BR', rD='br'))
 
     # --- SYSCALL / SYSOPEN ---
     R.append(Banner('SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches'))
@@ -841,12 +844,12 @@ def rows():
                      op=op, rT=rt, rN=rn, off=off))
 
     # --- Branches ---
-    R.append(Banner('Branches (tranche 4, r7-indirect)'))
+    R.append(Banner('Branches (tranche 4, LI_BR-indirect)'))
     R.append(B(name='B'))
     for op, a, b in [
         ('BEQ','r2','r3'), ('BNE','r2','r3'), ('BLT','r2','r3'), ('BLT','r4','r2'),
     ]:
-        R.append(CondB(name=f'{op}_{a.upper()}_{b.upper()}_R7', op=op, rA=a, rB=b))
+        R.append(CondB(name=f'{op}_{a.upper()}_{b.upper()}', op=op, rA=a, rB=b))
 
     # --- PROLOGUE / EPILOGUE / CALL / RET / TAIL — single-slot + Nk variants ---
     R.append(Banner('Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL'))
@@ -874,6 +877,7 @@ def rows():
     R.append(Mov(name='MOV_R4_R7', rD='r4', rA='r7'))
     # MOV rD, sp variants
     R.append(Mov(name='MOV_R2_SP', rD='r2', rA='sp'))
+    R.append(Mov(name='MOV_R3_SP', rD='r3', rA='sp'))
     R.append(Mov(name='MOV_R4_SP', rD='r4', rA='sp'))
     R.append(Mov(name='MOV_R6_SP', rD='r6', rA='sp'))
     # Extra MOVs needed around calls
@@ -886,18 +890,28 @@ def rows():
         ('ST','r0','r4',8), ('LD','r0','r4',8), ('LB','r1','r0',0),
         ('LD','r0','r1',0), ('LD','r0','r1',8), ('ST','r1','r0',0),
         ('LD','r2','r4',0), ('ST','r2','r0',8), ('LD','r0','r4',0),
-        # N=2 scratch slot 2 at [sp+16]: used by cons after dropping
-        # the :cons_save_cdr BSS spill (P1_TODO.md issue 4).
         ('ST','r2','r4',16), ('LD','r2','r4',16),
+        # r3-based addressing: after the 4:4 split, r4 is callee-saved
+        # and cons/alloc use r3 (caller-saved) as sp-temp and heap-cell
+        # scratch instead. LD_R1_R1_0 is alloc's `r1 = *heap_end`.
+        ('LD','r0','r3',0), ('ST','r2','r3',0),
+        ('ST','r1','r3',8), ('LD','r1','r3',8),
+        ('ST','r2','r3',16), ('LD','r2','r3',16),
+        ('LD','r1','r1',0),
     ]:
         suf = f'NEG{-off}' if off < 0 else f'{off}'
         R.append(Mem(name=f'{op}_{rt.upper()}_{rn.upper()}_{suf}',
                      op=op, rT=rt, rN=rn, off=off))
 
-    # BLT r0, r2 ; BNE r1, r2 ; BNE r0, r2
-    R.append(CondB(name='BLT_R0_R2_R7', op='BLT', rA='r0', rB='r2'))
-    R.append(CondB(name='BNE_R1_R2_R7', op='BNE', rA='r1', rB='r2'))
-    R.append(CondB(name='BNE_R0_R2_R7', op='BNE', rA='r0', rB='r2'))
+    # ADD_R2_R0_R1: alloc's `new_next = old + size`. The other
+    # tranche-1 ADDs don't happen to have (d=r2, a=r0, b=r1).
+    R.append(RRR(name='ADD_R2_R0_R1', op='ADD', rD='r2', rA='r0', rB='r1'))
+
+    # Branches used by lisp step 2 / alloc OOM.
+    R.append(CondB(name='BLT_R0_R2', op='BLT', rA='r0', rB='r2'))
+    R.append(CondB(name='BLT_R1_R2', op='BLT', rA='r1', rB='r2'))
+    R.append(CondB(name='BNE_R1_R2', op='BNE', rA='r1', rB='r2'))
+    R.append(CondB(name='BNE_R0_R2', op='BNE', rA='r0', rB='r2'))
 
     return R
 
diff --git a/p1_riscv64.M1 b/p1_riscv64.M1
@@ -10,13 +10,14 @@ DEFINE P1_LI_R0 170500000365C5006F008000
 DEFINE P1_LI_R1 9705000083E5C5006F008000
 DEFINE P1_LI_R2 170600000366C6006F008000
 DEFINE P1_LI_R3 9706000083E6C6006F008000
-DEFINE P1_LI_R4 170700000367C7006F008000
-DEFINE P1_LI_R5 9707000083E7C7006F008000
+DEFINE P1_LI_R4 170A0000036ACA006F008000
+DEFINE P1_LI_R5 970A000083EACA006F008000
 DEFINE P1_LI_R6 9704000083E4C4006F008000
 DEFINE P1_LI_R7 170900000369C9006F008000
+DEFINE P1_LI_BR 170F0000036FCF006F008000
 
 ## ---- SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches 
-DEFINE P1_SYSCALL 9308050093890500130A0600938A0600130B0700938B07001385090093050A0013860A0093060B0013870B0093870400730000009385090013060A0093860A0013070B0093870B00
+DEFINE P1_SYSCALL 9308050093890500130B0600938B06001385090093050B0013860B0093060A0013870A0093870400730000009385090013060B0093860B00
 DEFINE P1_SYSOPEN 1305C0F99308800373000000
 
 ## ---- Linux syscall numbers (per-arch table). LE-32 immediate operands for LI. 
@@ -27,31 +28,31 @@ DEFINE SYS_CLOSE 39000000
 
 ## ---- Reg-reg-reg arithmetic (tranche 1) --------------------------
 DEFINE P1_ADD_R1_R1_R2 B385C500
-DEFINE P1_ADD_R1_R1_R4 B385E500
+DEFINE P1_ADD_R1_R1_R4 B3854501
 DEFINE P1_ADD_R2_R2_R6 33069600
 DEFINE P1_ADD_R2_R3_R1 3386B600
 DEFINE P1_SUB_R1_R1_R2 B385C540
 DEFINE P1_SUB_R2_R2_R6 33069640
-DEFINE P1_AND_R1_R1_R5 B3F5F500
+DEFINE P1_AND_R1_R1_R5 B3F55501
 DEFINE P1_OR_R1_R1_R2 B3E5C500
 DEFINE P1_XOR_R1_R1_R2 B3C5C500
 DEFINE P1_MUL_R1_R1_R2 B385C502
 DEFINE P1_DIV_R1_R1_R2 B3C5C502
-DEFINE P1_REM_R1_R1_R5 B3E5F502
+DEFINE P1_REM_R1_R1_R5 B3E55503
 DEFINE P1_SHL_R1_R1_R2 B395C500
 DEFINE P1_SHR_R1_R1_R2 B3D5C500
-DEFINE P1_SAR_R4_R4_R2 3357C740
+DEFINE P1_SAR_R4_R4_R2 335ACA40
 
 ## ---- Immediate arithmetic (tranche 2) ----------------------------
 DEFINE P1_ADDI_R1_R1_3 93853500
 DEFINE P1_ADDI_R1_R1_1 93851500
 DEFINE P1_ADDI_R1_R1_NEG3 9385D5FF
-DEFINE P1_ADDI_R4_R4_NEG1 1307F7FF
+DEFINE P1_ADDI_R4_R4_NEG1 130AFAFF
 DEFINE P1_ADDI_R1_R1_NEG2 9385E5FF
 DEFINE P1_ADDI_R0_R0_1 13051500
 DEFINE P1_SHLI_R1_R1_1 93951500
 DEFINE P1_SHRI_R1_R1_1 93D51500
-DEFINE P1_SARI_R4_R4_1 13571740
+DEFINE P1_SARI_R4_R4_1 135A1A40
 DEFINE P1_ANDI_R1_R1_6 93F56500
 DEFINE P1_ANDI_R1_R1_7 93F57500
 DEFINE P1_ORI_R1_R1_1 93E51500
@@ -59,38 +60,38 @@ DEFINE P1_ORI_R0_R0_2 13652500
 DEFINE P1_ORI_R0_R0_7 13657500
 
 ## ---- LA + memory ops (tranche 3) ---------------------------------
-DEFINE P1_LA_R4 170700000367C7006F008000
-DEFINE P1_ST_R1_R4_0 2330B700
-DEFINE P1_LD_R1_R4_0 83350700
-DEFINE P1_ST_R1_R4_8 2334B700
-DEFINE P1_LD_R1_R4_8 83358700
-DEFINE P1_SB_R1_R4_16 2308B700
-DEFINE P1_LB_R1_R4_16 83450701
-DEFINE P1_ST_R1_R4_NEG8 233CB7FE
-DEFINE P1_LD_R1_R4_NEG8 833587FF
+DEFINE P1_LA_R4 170A0000036ACA006F008000
+DEFINE P1_ST_R1_R4_0 2330BA00
+DEFINE P1_LD_R1_R4_0 83350A00
+DEFINE P1_ST_R1_R4_8 2334BA00
+DEFINE P1_LD_R1_R4_8 83358A00
+DEFINE P1_SB_R1_R4_16 2308BA00
+DEFINE P1_LB_R1_R4_16 83450A01
+DEFINE P1_ST_R1_R4_NEG8 233CBAFE
+DEFINE P1_LD_R1_R4_NEG8 83358AFF
 
-## ---- Branches (tranche 4, r7-indirect) ---------------------------
-DEFINE P1_B 67000900
-DEFINE P1_BEQ_R2_R3_R7 6314D60067000900
-DEFINE P1_BNE_R2_R3_R7 6304D60067000900
-DEFINE P1_BLT_R2_R3_R7 6354D60067000900
-DEFINE P1_BLT_R4_R2_R7 6354C70067000900
+## ---- Branches (tranche 4, LI_BR-indirect) ------------------------
+DEFINE P1_B 67000F00
+DEFINE P1_BEQ_R2_R3 6314D60067000F00
+DEFINE P1_BNE_R2_R3 6304D60067000F00
+DEFINE P1_BLT_R2_R3 6354D60067000F00
+DEFINE P1_BLT_R4_R2 6354CA0067000F00
 
 ## ---- Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL 
 DEFINE P1_PROLOGUE 130101FF23301100
 DEFINE P1_EPILOGUE 8330010013010101
 DEFINE P1_RET 67800000
-DEFINE P1_CALL E7000900
-DEFINE P1_TAIL 833001001301010167000900
+DEFINE P1_CALL E7000F00
+DEFINE P1_TAIL 833001001301010167000F00
 DEFINE P1_PROLOGUE_N2 130101FE23301100
 DEFINE P1_EPILOGUE_N2 8330010013010102
-DEFINE P1_TAIL_N2 833001001301010267000900
+DEFINE P1_TAIL_N2 833001001301010267000F00
 DEFINE P1_PROLOGUE_N3 130101FE23301100
 DEFINE P1_EPILOGUE_N3 8330010013010102
-DEFINE P1_TAIL_N3 833001001301010267000900
+DEFINE P1_TAIL_N3 833001001301010267000F00
 DEFINE P1_PROLOGUE_N4 130101FD23301100
 DEFINE P1_EPILOGUE_N4 8330010013010103
-DEFINE P1_TAIL_N4 833001001301010367000900
+DEFINE P1_TAIL_N4 833001001301010367000F00
 
 ## ---- Seed-Lisp step 1 extensions (tranche 6) ---------------------
 DEFINE P1_MOV_R1_R6 93850400
@@ -102,28 +103,38 @@ DEFINE P1_MOV_R7_R2 13090600
 DEFINE P1_MOV_R2_R6 13860400
 DEFINE P1_MOV_R3_R7 93060900
 DEFINE P1_MOV_R2_R7 13060900
-DEFINE P1_MOV_R4_R7 13070900
+DEFINE P1_MOV_R4_R7 130A0900
 DEFINE P1_MOV_R2_SP 13060100
-DEFINE P1_MOV_R4_SP 13070100
+DEFINE P1_MOV_R3_SP 93060100
+DEFINE P1_MOV_R4_SP 130A0100
 DEFINE P1_MOV_R6_SP 93040100
 DEFINE P1_MOV_R2_R0 13060500
 DEFINE P1_LD_R0_R6_0 03B50400
 DEFINE P1_LD_R1_R6_16 83B50401
-DEFINE P1_LD_R3_R4_0 83360700
-DEFINE P1_LD_R0_R5_0 03B50700
-DEFINE P1_LB_R1_R4_0 83450700
-DEFINE P1_ST_R2_R4_0 2330C700
-DEFINE P1_ST_R0_R4_8 2334A700
-DEFINE P1_LD_R0_R4_8 03358700
+DEFINE P1_LD_R3_R4_0 83360A00
+DEFINE P1_LD_R0_R5_0 03B50A00
+DEFINE P1_LB_R1_R4_0 83450A00
+DEFINE P1_ST_R2_R4_0 2330CA00
+DEFINE P1_ST_R0_R4_8 2334AA00
+DEFINE P1_LD_R0_R4_8 03358A00
 DEFINE P1_LB_R1_R0_0 83450500
 DEFINE P1_LD_R0_R1_0 03B50500
 DEFINE P1_LD_R0_R1_8 03B58500
 DEFINE P1_ST_R1_R0_0 2330B500
-DEFINE P1_LD_R2_R4_0 03360700
+DEFINE P1_LD_R2_R4_0 03360A00
 DEFINE P1_ST_R2_R0_8 2334C500
-DEFINE P1_LD_R0_R4_0 03350700
-DEFINE P1_ST_R2_R4_16 2338C700
-DEFINE P1_LD_R2_R4_16 03360701
-DEFINE P1_BLT_R0_R2_R7 6354C50067000900
-DEFINE P1_BNE_R1_R2_R7 6384C50067000900
-DEFINE P1_BNE_R0_R2_R7 6304C50067000900
+DEFINE P1_LD_R0_R4_0 03350A00
+DEFINE P1_ST_R2_R4_16 2338CA00
+DEFINE P1_LD_R2_R4_16 03360A01
+DEFINE P1_LD_R0_R3_0 03B50600
+DEFINE P1_ST_R2_R3_0 23B0C600
+DEFINE P1_ST_R1_R3_8 23B4B600
+DEFINE P1_LD_R1_R3_8 83B58600
+DEFINE P1_ST_R2_R3_16 23B8C600
+DEFINE P1_LD_R2_R3_16 03B60601
+DEFINE P1_LD_R1_R1_0 83B50500
+DEFINE P1_ADD_R2_R0_R1 3306B500
+DEFINE P1_BLT_R0_R2 6354C50067000F00
+DEFINE P1_BLT_R1_R2 63D4C50067000F00
+DEFINE P1_BNE_R1_R2 6384C50067000F00
+DEFINE P1_BNE_R0_R2 6304C50067000F00

	boot2 Playing with the boostrap
	git clone https://git.ryansepassi.com/git/boot2.git
	Log \| Files \| Refs

M	LISP.md	\|	42	++++++++++++++++++++++++------------------
M	Makefile	\|	3	+--
M	P1.md	\|	156	+++++++++++++++++++++++++++++++++++++++++++------------------------------------
D	P1_TODO.md	\|	251	-------------------------------------------------------------------------------
M	demo.M1	\|	56	++++++++++++++++++++++++++++----------------------------
M	lisp.M1	\|	73	+++++++++++++++++++++++++++++++++++++++----------------------------------
M	p1_aarch64.M1	\|	99	++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
M	p1_amd64.M1	\|	115	+++++++++++++++++++++++++++++++++++++++++++------------------------------------
M	p1_gen.py	\|	198	++++++++++++++++++++++++++++++++++++++++++-------------------------------------
M	p1_riscv64.M1	\|	99	++++++++++++++++++++++++++++++++++++++++++++-----------------------------------