commit 4f257aaba799025e6f84cccd9311991e67c65681
parent 78312f7592353f51c3b5957eca18ea098cbd4a66
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Mon, 20 Apr 2026 22:20:13 -0700
P1: extract branch-scratch; 4:4 caller/callee split; stack-walking GC roots
Branch target lives in a hidden native scratch reg (aarch64 x17, amd64
r11, riscv64 t5) instead of a P1 GPR. Source code uses LI_BR in place
of LI_R7, and branch/call ops drop the _R7 suffix — the scratch is no
longer a named P1 register, so it can't be accidentally referenced from
program code.
The freed slot realigns the P1 register file as 4:4 caller/callee-saved:
r0-r3 caller-saved (native argregs on each arch), r4-r7 callee-saved.
r4/r5 remap to native callee-saved regs (aarch64 x26/x27, amd64 r13/r14,
riscv64 s4/s5) so SYSCALL no longer has to preserve them across the
kernel trap. Function bodies get four walkable cross-call roots without
touching the frame.
amd64 PROLOGUE/EPILOGUE now uses rcx (not r11) as the retaddr-carry
scratch. TAIL = EPILOGUE + jmp r11, so the old sequence popped its own
branch target a microsecond before jumping through it; the bug surfaced
as demo exiting 13 on amd64 while aarch64/riscv64 passed. rcx was
already a non-P1 scratch used by shifts and DIV/REM, so no new
reservations.
lisp.M1 cons() switches its sp-temp from r4 to r3 (caller-saved, no
preservation burden) and alloc() now uses only r0-r3, shrinking its
callee-save pressure to zero.
LISP.md §Settled decisions #3 rewrites the GC root model from "32 fixed
BSS root_slots + live_mask, stack not scanned" to "per-frame root slots
walked on the P1 stack." This was the discipline N-slot PROLOGUE was
built for in 78312f7 — the doc just hadn't caught up. global_env and
symbol_table remain named BSS roots (long-lived singletons with no
stack frame to hang off of).
P1.md and comments across lisp.M1/demo.M1/p1_gen.py updated to match.
P1_TODO.md removed — all five issues are landed.
Diffstat:
| M | LISP.md | | | 42 | ++++++++++++++++++++++++------------------ |
| M | Makefile | | | 3 | +-- |
| M | P1.md | | | 156 | +++++++++++++++++++++++++++++++++++++++++++------------------------------------ |
| D | P1_TODO.md | | | 251 | ------------------------------------------------------------------------------- |
| M | demo.M1 | | | 56 | ++++++++++++++++++++++++++++---------------------------- |
| M | lisp.M1 | | | 73 | +++++++++++++++++++++++++++++++++++++++---------------------------------- |
| M | p1_aarch64.M1 | | | 99 | ++++++++++++++++++++++++++++++++++++++++++++----------------------------------- |
| M | p1_amd64.M1 | | | 115 | +++++++++++++++++++++++++++++++++++++++++++------------------------------------ |
| M | p1_gen.py | | | 198 | ++++++++++++++++++++++++++++++++++++++++++------------------------------------- |
| M | p1_riscv64.M1 | | | 99 | ++++++++++++++++++++++++++++++++++++++++++++----------------------------------- |
10 files changed, 456 insertions(+), 636 deletions(-)
diff --git a/LISP.md b/LISP.md
@@ -31,9 +31,11 @@ Load-bearing; the rest of the document assumes them.
§Value representation.
2. **Mark-sweep GC over a fixed BSS arena.** 20 MB, linked in at build
time, no `brk`/`mmap`.
-3. **Fixed root-spill slots.** 32 named cells in BSS; Scheme-facing
- functions spill live values there before any allocating call. The
- P1 stack is not scanned.
+3. **Per-frame root slots on the P1 stack.** Scheme-facing functions
+ declare `PROLOGUE_Nk` frames with one slot per live tagged value
+ they need across an allocating call; the GC mark phase walks those
+ frames. `global_env` and `symbol_table` remain named BSS roots
+ (long-lived singletons with no stack frame to hang off of).
4. **Source file via `argv[1]`.** The interpreter opens the path given
on the command line, reads it whole into a string, parses it,
evaluates top-to-bottom. No linked-in blob.
@@ -144,14 +146,15 @@ Three sources:
transitively.
2. **Symbol table.** Intern table's backing array plus all interned
symbols and their name strings. Fixed pointer (`symbol_table`).
-3. **Root-spill slots.** 32 tagged words at `root_slots[0..32]` in BSS.
- A 32-bit `live_mask` word marks which slots currently hold a live
- value. Scheme-facing functions set and clear bits in the mask
- across their dynamic extent.
+3. **Per-frame stack slots.** The mark phase walks the P1 stack from
+ the current `sp` up to the saved frame-chain bottom. Each frame
+ declares its slot count via `PROLOGUE_Nk` (see P1.md §"Semantics"),
+ and the GC treats every slot word as a tagged-value root —
+ non-pointer slots self-filter because their tags don't match any
+ heap object.
-The P1 stack is **not scanned.** Stack frames hold P1 integers, return
-addresses, and scratch — not Scheme-value-shaped words unless those
-words are also rooted in a spill slot.
+Return addresses and raw P1 integers on the stack sit outside the
+PROLOGUE_Nk slot range, so the walker doesn't see them.
### Algorithm
@@ -196,11 +199,13 @@ uncommon in compiler workloads.
### Invariants
- Between any two GC-observable points, every live Scheme value must be
- reachable from `global_env`, `symbol_table`, or a live root slot.
- "GC-observable" = any call to `cons`, `make-vector`, `make-string`,
- `make-closure`, `intern`, or another Scheme function.
-- `live_mask` bits are owned by the function that set them; callers
- don't set a callee's bits.
+ reachable from `global_env`, `symbol_table`, or a PROLOGUE_Nk slot
+ in an active frame. "GC-observable" = any call to `cons`,
+ `make-vector`, `make-string`, `make-closure`, `intern`, or another
+ Scheme function.
+- Scheme-facing functions spill each live tagged value into a frame
+ slot before any allocating call. A value kept only in a P1 register
+ will be lost if GC fires mid-call.
Tradeoffs taken:
@@ -348,8 +353,8 @@ startup the interpreter populates `global_env` with one binding per
primitive, each wrapped as a primitive-type heap object.
Primitives allocate through the same `alloc`/GC path as user code and
-must honor the root-spill discipline when they make more than one
-allocation.
+must honor the per-frame-slot root discipline when they make more than
+one allocation.
Target set: ~40 primitives per PLAN.md. Roughly:
@@ -464,7 +469,8 @@ coverage before the next begins.
via syscalls, `error` path, BSS layout, bump allocator. Exits with
the file's first byte as proof-of-life. ~200 P1 LOC.
2. **Tagged values.** Fixnum/pair/singleton encoding; `cons`/`car`/`cdr`/
- tag predicates. Hand-build a pair; print its car. ~300 LOC.
+ tag predicates. Hand-build `cons(42, nil)` and exit with the decoded
+ car as status. ~300 LOC.
3. **Strings + symbol interning.** Heap strings, 4096-slot intern table,
symbol allocator. Two `(intern "foo")` calls return `eq?` pointers.
~400 LOC.
diff --git a/Makefile b/Makefile
@@ -84,8 +84,7 @@ $(TOOLS_DIR)/M0 $(TOOLS_DIR)/hex2-0 $(TOOLS_DIR)/catm $(TOOLS_DIR)/hex0 $(TOOLS_
#
# Lint catches P1_*/SYS_* tokens with no matching DEFINE — M1 otherwise
# silently emits the literal token text and produces a SIGILL-on-run
-# binary (see P1_TODO.md issue 1). Runs on the host (plain POSIX sh);
-# no podman dependency.
+# binary. Runs on the host (plain POSIX sh); no podman dependency.
#
# M0 takes a single positional input (no -f flag), so we catm the two
# sources together first. The intermediate .combined.M1 is kept in OUT_DIR
diff --git a/P1.md b/P1.md
@@ -44,9 +44,10 @@ All runs on stock stage0 `M0` + `hex2-0`, bootstrapped per-arch from
The DEFINE table is generator-driven (`p1_gen.py`); tranches 1–8 are
enumerated there, plus the full PROLOGUE_Nk family (k=1..4). Branch
-offsets are realized by the r7-indirect pattern
-(`LI_R7 &target ; BXX_rA_rB_R7`), sidestepping the missing
-branch-offset support in hex2.
+offsets are realized by the LI_BR-indirect pattern
+(`LI_BR &target ; BXX_rA_rB`), sidestepping the missing
+branch-offset support in hex2. The branch-target scratch is a
+reserved native reg (`x17`/`r11`/`t5`), not a P1 GPR.
### Spike deviations from the design
@@ -72,15 +73,16 @@ branch-offset support in hex2.
| Registers | 8 GPRs (`r0`–`r7`) + `sp`, `lr`-on-stack | Fits x86-64's usable register budget |
| Narrow imm | Signed 12-bit | riscv I-type width; aarch64 ≤12 also OK |
| Wide imm | Pool-loaded via PC-relative `LI` | Avoids arch-specific immediate synthesis |
-| Calling conv | r0 = return, r1–r6 = args, r6 callee-saved, r7 per-op scratch | P1-defined; not platform ABI |
+| Calling conv | r0 = return, r1–r3 = args (caller-saved), r4–r7 callee-saved | P1-defined; not platform ABI |
| Return address | Always spilled to stack on entry | Hides x86's missing `lr` uniformly |
| Syscall | `SYSCALL` with num in r0, args r1–r6; clobbers r0 only | Per-arch wrapper emits native sequence |
| Spill slot | `[sp + 8]` is callee-private scratch after `PROLOGUE` | Frame already 16 B for alignment; second cell was otherwise unused |
## Register mapping
-`r0`–`r5` are caller-saved. `r6` is callee-saved. `r7` is per-op scratch
-(see below). `sp` is special-purpose — see `PROLOGUE` semantics.
+`r0`–`r3` are caller-saved. `r4`–`r7` are callee-saved, general-purpose,
+and preserved across `CALL`/`SYSCALL`. `sp` is special-purpose — see
+`PROLOGUE` semantics.
| P1 | amd64 | aarch64 | riscv64 |
|------|-------|---------|---------|
@@ -88,43 +90,51 @@ branch-offset support in hex2.
| `r1` | `rdi` | `x1` | `a1` |
| `r2` | `rsi` | `x2` | `a2` |
| `r3` | `rdx` | `x3` | `a3` |
-| `r4` | `r10` | `x4` | `a4` |
-| `r5` | `r8` | `x5` | `a5` |
+| `r4` | `r13` | `x26` | `s4` |
+| `r5` | `r14` | `x27` | `s5` |
| `r6` | `rbx` | `x19` | `s1` |
| `r7` | `r12` | `x20` | `s2` |
| `sp` | `rsp` | `sp` | `sp` |
| `lr` | (mem) | `x30` | `ra` |
+`r4`–`r7` all map to native callee-saved regs on each arch, so the SysV
+kernel+libc "callee preserves these" rule does the work for us across
+syscalls without explicit save/restore in the `SYSCALL` expansion.
+
x86-64 has no link register; `CALL`/`RET` macros push/pop the return address
on the stack. On aarch64/riscv64, the prologue spills `lr` (`x30`/`ra`) to
the stack too, so all three converge on "return address lives in
`[sp + 0]` after prologue." This uniformity is worth the extra store on the
register-rich arches.
-**`r7` is not a general-purpose callee-saved register.** Every conditional
-and unconditional branch in P1 compiles through the r7-indirect pattern
-(`LI_R7 &target ; BLT/B`), which overwrites `r7` at every branch site.
-Treat `r7` as a per-instruction scratch owned by the branch / call
-machinery: its value is meaningful only between the `LI_R7` that loads a
-target and the branch that consumes it. Never carry a live value through a
-branch or call in `r7`.
-
**Reserved scratch registers (not available to P1):** certain native
regs are used internally by op expansions and are never exposed as P1
registers. Every P1 op writes only what its name says it writes —
reserved scratch is save/restored within the expansion so no hidden
clobbers leak across op boundaries.
-- **aarch64** — `x21`–`x25` hold `r1`–`r5` across the `SYSCALL` arg
- shuffle. `x16` (ARM IP0) is scratch for `REM` (carries the `SDIV`
- quotient into the following `MSUB`). `x8` holds the syscall number.
-- **amd64** — `rcx` and `r11` are kernel-clobbered by the `syscall`
- instruction itself; `PROLOGUE` additionally uses `r11` to temporarily
- hold the return address across the `sub rsp, N`. `DIV`/`REM` use both
- `rcx` (to save `rdx` = P1 `r3`) and `r11` (to save `rax` = P1 `r0`)
- so that idiv's implicit writes to rax/rdx stay invisible.
-- **riscv64** — `s3`–`s7` hold `r1`–`r5` across the `SYSCALL` arg
- shuffle; `a7` holds the syscall number.
+- **Branch-target scratch (all arches).** `B`/`BEQ`/`BNE`/`BLT`/`CALL`/
+ `TAIL` jump through a dedicated native reg pre-loaded via `LI_BR`:
+ `x17` (ARM IP1) on aarch64, `r11` on amd64, `t5` on riscv64. The reg
+ is caller-saved natively and never carries a live P1 value past the
+ following branch. Treat it as existing only between the `LI_BR` that
+ loads a target and the branch that consumes it.
+- **aarch64** — `x21`–`x23` hold `r1`–`r3` across the `SYSCALL` arg
+ shuffle (`r4`/`r5` live in callee-saved `x26`/`x27` so the kernel
+ preserves them for us). `x16` (ARM IP0) is scratch for `REM`
+ (carries the `SDIV` quotient into the following `MSUB`). `x8` holds
+ the syscall number.
+- **amd64** — `rcx` and the branch-target `r11` are kernel-clobbered by
+ the `syscall` instruction itself. `PROLOGUE`/`EPILOGUE` use `rcx` to
+ carry the retaddr across the `sub rsp, N` (can't use `r11` here — it
+ is the branch-target reg, and `TAIL` = `EPILOGUE` + `jmp r11`).
+ `DIV`/`REM` use `rcx` (to save `rdx` = P1 `r3`) and `r11` (to save
+ `rax` = P1 `r0`) so that `idiv`'s implicit writes to rax/rdx stay
+ invisible; the `r11` save is fine because no branch op can interrupt
+ the DIV/REM expansion.
+- **riscv64** — `s3`,`s6`,`s7` hold `r1`–`r3` across the `SYSCALL` arg
+ shuffle (`r4`/`r5` live in callee-saved `s4`/`s5`, same trick as
+ aarch64). `a7` holds the syscall number.
All of these are off-limits to hand-written P1 programs and are never
mentioned in P1 source. If you see a register name not in the r0–r7 /
@@ -134,28 +144,26 @@ sp / lr set, it belongs to an op's internal expansion.
P1 has no PC-relative branch immediates (hex2 offers no label-arithmetic
sigil — branch ranges can't be expressed in hex2 source). Every branch,
-conditional or not, compiles through the **r7-indirect** pattern: the
-caller loads the target into `r7` with a wide `LI`, then the branch op
-jumps through `r7`. A conditional like "jump to `fail` if `r1 != r2`" is
-three source lines:
+conditional or not, compiles through the **LI_BR-indirect** pattern: the
+caller loads the target into the dedicated branch-target scratch reg
+with `LI_BR`, then the branch op jumps through it. A conditional like
+"jump to `fail` if `r1 != r2`" is three source lines:
```
-P1_LI_R7
+P1_LI_BR
&fail
-P1_BNE_R1_R2_R7
+P1_BNE_R1_R2
```
-The `_R7` suffix on every branch op is a reminder that `r7` is the
-branch-target register: the value loaded by the preceding `LI_R7` is
-consumed by the branch, and both lines always appear together. `CALL`
-follows the same shape (`LI_R7 &callee ; P1_CALL`).
+`LI_BR` writes a reserved native reg (`x17`/`r11`/`t5` — see Register
+mapping), not a P1 GPR. The branch op that follows consumes it and
+jumps. `CALL` and `TAIL` follow the same shape
+(`LI_BR &callee ; P1_CALL`).
-Because every branch site overwrites `r7`, **`r7` is branch-scratch, not
-a general-purpose callee-saved register** — see the "`r7` is not a
-general-purpose callee-saved register" paragraph above. If you need to
-keep a value live across a branch or call, use `r0`–`r5` (caller-saved,
-lives across un-branched straight-line code) or `r6` (callee-saved,
-lives across calls).
+The branch-target reg is owned by the branch machinery: never carry a
+live value across a branch in it. Since it isn't a P1 reg, this is
+automatic — there's no P1-level way to read or write it outside
+`LI_BR`.
## Instruction set (~30 ops)
@@ -204,13 +212,14 @@ SYSCALL # num in r0, args r1-r6, ret in r0
with tagged-cell pointers only need signed comparisons. Synthesize from
`BLT` via operand-bias if unsigned compare is ever required.
- `BGE rA, rB, %L` is not in the ISA: synthesize as
- `BLT rA, rB, %skip; B %L; :skip` (the r7-indirect branch pattern makes
- the skip cheap). `BLT rB, rA, %L` handles the strict-greater case.
+ `BLT rA, rB, %skip; B %L; :skip` (the LI_BR-indirect branch pattern
+ makes the skip cheap). `BLT rB, rA, %L` handles the strict-greater
+ case.
- Branch offsets are PC-relative. In the v0.1 spike they are realized by
- loading the target address via `LI` into `r7` and jumping through the
- register; range is therefore unbounded within the 4 GiB address space.
- Native-encoded branches (with tighter range limits) are an optional
- future optimization.
+ loading the target address via `LI_BR` into the reserved branch-target
+ reg and jumping through it; range is therefore unbounded within the
+ 4 GiB address space. Native-encoded branches (with tighter range
+ limits) are an optional future optimization.
- `MOV rD, rA` copies `rA` into `rD`. The source may be `sp` (read the
current stack pointer into a GPR — used e.g. for stack-balance assertions
around a call tree). The reverse (`MOV sp, rA`) is not provided; `sp`
@@ -246,8 +255,10 @@ SYSCALL # num in r0, args r1-r6, ret in r0
Per-arch mechanics differ — aarch64/riscv64 `PROLOGUE` subtracts the
frame size from `sp` and stores `lr`/`ra` at `[sp + 0]`; amd64 pops
the retaddr native `call` already pushed into a non-P1 scratch
- (`r11`), subtracts the frame size, then re-pushes it so the final
- layout matches. Access slots via `MOV rX, sp` followed by
+ (`rcx`), subtracts the frame size, then re-pushes it so the final
+ layout matches. (`rcx` rather than `r11`, because `r11` is the
+ branch-target reg and `TAIL` would otherwise clobber its own
+ destination mid-epilogue.) Access slots via `MOV rX, sp` followed by
`LD rY, rX, <off>` / `ST rY, rX, <off>`; `sp` itself isn't a valid
base for `LD`/`ST`.
- `TAIL %label` is a tail call — it performs the current function's
@@ -312,19 +323,21 @@ values through syscalls without per-arch save/restore dances.
The per-arch expansions:
-- **amd64** — P1 args already occupy the native arg regs except for arg6
- (`r9` vs. P1 `r6`/`rbx`). Expansion is `mov r9, rbx ; syscall`. The
- kernel preserves everything except `rax`, `rcx`, `r11`; none of `rcx`
- or `r11` are P1 registers, so the only visible clobber is P1 `r0`.
+- **amd64** — P1 args already occupy the native arg regs except for args
+ 4/5/6. Three shuffle moves cover those: `mov r10, r13` (arg4 = P1 `r4`),
+ `mov r8, r14` (arg5 = P1 `r5`), `mov r9, rbx` (arg6 = P1 `r6`); then
+ `syscall`. The kernel preserves everything except `rax`, `rcx`, `r11`,
+ and `rax` = P1 `r0` is the only visible clobber.
- **aarch64** — native arg regs are `x0`–`x5` but P1 puts args in
- `x1`–`x5`,`x19` (one register higher). The expansion saves P1 `r1`–
- `r5` into `x21`–`x25` (reserved for SYSCALL, see Register mapping),
- shuffles the saved values into `x0`–`x4` plus `x19` into `x5`, moves
- the number into `x8`, `svc #0`s, then restores `r1`–`r5` from the
- `x21`–`x25` saves. Net cost: 12 additional moves vs. a bare
- single-instruction syscall.
-- **riscv64** — same shape as aarch64, with `s3`–`s7` as the save slots
- and `a7` as the number register.
+ `x1`–`x3`,`x26`,`x27`,`x19` (the three caller-saved arg regs one slot
+ higher, plus three callee-saved for `r4`–`r6`). The expansion saves
+ P1 `r1`–`r3` into `x21`–`x23`, shuffles them and `r4`/`r5`/`r6` down
+ into `x0`–`x5`, moves the number into `x8`, `svc #0`s, then restores
+ `r1`–`r3` from `x21`–`x23`. No save/restore of `r4`/`r5` is needed
+ because they live in callee-saved natives that the kernel preserves.
+- **riscv64** — same shape as aarch64, with `s3`/`s6`/`s7` as the `r1`–
+ `r3` save slots, `s4`/`s5` already holding `r4`/`r5`, and `a7` as the
+ number register.
The extra moves on aarch64/riscv64 are a few nanoseconds per syscall.
Trading them for uniform "clobbers `r0` only" semantics is worth it:
@@ -429,12 +442,11 @@ caller.
the inline-data `LI` trick sidesteps them. Order was reversed from
the original plan: aarch64 first (where the trick was designed),
then amd64 and riscv64.
-2. **Broaden the demonstrated op set.** Next. Control flow (`B`, `BEQ`,
- `CALL`, `RET`, `TAIL`), loads/stores (`LD`/`ST`/`LW`/`LB`), and the
- rest of arithmetic/logical/shift/mul-div. All reachable with stock
- hex2; no extensions required. A loop-and-branch demo (e.g. print
- digits 0–9, or sum 1..N) is the natural next program — it forces
- conditional branching through the indirect-via-r7 pattern.
+2. **Broaden the demonstrated op set.** *Done.* `demo.M1` exercises
+ control flow (`B`, `BEQ`, `BNE`, `BLT`, `CALL`, `RET`, `TAIL`),
+ loads/stores (`LD`/`ST`/`LB`/`SB`), and the full
+ arithmetic/logical/shift/mul-div set across tranches 1–5. All
+ reachable with stock hex2; no extensions required.
3. **Generator for the ~30-op × register matrix.** *Done.*
`p1_gen.py` is the single source of truth for all three
`p1_<arch>.M1` defs files. Each row is an `(op, reg-tuple, imm)`
@@ -444,9 +456,11 @@ caller.
4. **Cross-arch differential harness.** Assemble each P1 source three
ways and diff runtime behavior. Currently eyeballed via
`make run-all`.
-5. **Write something real.** Seed Lisp interpreter (cons, car, cdr, eq,
- atom, cond, lambda, quote) in ~500 lines of P1, running identically
- on all three targets.
+5. **Write something real.** *In progress.* `lisp.M1` is the seed Lisp
+ interpreter target (cons, car, cdr, eq, atom, cond, lambda, quote)
+ running identically on all three arches. Step 2 (cons/car/cdr +
+ tagged values) landed; the remaining staged steps live in
+ `LISP.md`.
## Open questions
diff --git a/P1_TODO.md b/P1_TODO.md
@@ -1,251 +0,0 @@
-# P1 — follow-up work from `lisp.M1` step 2
-
-Issues surfaced while bringing up step 2 (tagged values: cons/car/cdr),
-with the decision recorded for each. This is a work list, not a design
-doc; `P1.md` is the source of truth for the ISA itself.
-
-Two of these (issue 1, issue 2) were caught the hard way during step 2:
-the first build passed on aarch64 and failed on amd64/riscv64 because
-`heap_start` happened to land at different offsets from the ELF base on
-each arch, and the second build SIGILL'd because a `P1_LD_R0_R4_0` macro
-was missing from the defs files but M1 silently passed the undefined
-token through as literal text. Both are preventable upstream.
-
-## 1. Undefined `P1_*` tokens pass silently through M1
-
-**Symptom.** In step 2, `lisp.M1` used `P1_LD_R0_R4_0` (offset 0) but the
-defs files only had `P1_LD_R0_R4_8` (offset 8). M1 emitted the literal
-string `P1_LD_R0_R4_0` into the output stream rather than expanding it.
-The resulting binary SIGILL'd on aarch64 and segfaulted on amd64. No
-diagnostic from M1, hex2, or the loader — it took `xxd` + grep to find.
-
-This will keep biting us as we add more `(op, register-tuple)` entries
-to the defs: the combinatorial surface (~1200 DEFINEs per arch) is
-exactly where a missing tuple is easy to miss and hard to notice.
-
-**Options considered.**
-
-1. **Makefile lint.** Before invoking M1, grep the combined source for
- `P1_[A-Z0-9_]*` tokens and assert every one has a matching `DEFINE`
- in the per-arch defs file. Fails the build with the offending token
- name. ~10 lines of shell, runs in milliseconds.
-2. **Patch M1** to error on undefined tokens. Correct fix, but M1 is
- upstream stage0 — we don't want to fork it.
-3. **Exhaustive DEFINE table.** Generate every `(op, rD, rA, rB)` tuple
- so no lookup can miss. Waits on issue 3 (generator).
-
-**Decision — option 1: Makefile lint.** Add a `lint` target to
-`lispcc/Makefile` that runs before each arch's assemble step. Keeps the
-toolchain unchanged; catches the bug at the exact layer where it's
-introduced (M1 input); works today with no generator. Option 3 also
-helps but doesn't cover typos (`P1_LD_R0_R4_0` where the author meant
-`P1_LD_R0_R4_8` would still silently miss if both existed and the wrong
-one was spelled).
-
-## 2. Data alignment is implicit and arch-dependent
-
-**Symptom.** `heap_start` is a label into a zero-padded byte blob.
-Nothing in hex2 or M1 forces the label's file offset (and therefore its
-runtime address under the fixed ELF base) to be 8-byte aligned. Because
-the code size between the ELF header and `heap_start` differs per arch,
-the three targets landed on different alignments:
-
-- aarch64: `0x00600560` → low 3 bits = `000` (aligned)
-- amd64: `0x006002CB` → low 3 bits = `011` (misaligned)
-- riscv64: `0x00600604` → low 3 bits = `100` (misaligned)
-
-Pair pointers carry a 3-bit tag in the low bits, so the allocator MUST
-return pointers with `ptr & 7 == 0`. If `heap_start` itself is
-misaligned, every tagged pair is corrupt.
-
-**Options considered.**
-
-1. **Application-level: align at boot + make alignment an allocator
- contract.** `_start` rounds `heap_next` up with `(x | 7) + 1` before
- any allocation. `alloc()` is documented to require 8-aligned sizes
- and to return 8-aligned pointers. Document both in `P1.md`. This is
- what step 2 already does.
-2. **hex2-level alignment directive.** Add something like `.align 8` to
- hex2 so labels can be padded at assembly time.
-
-Investigated hex2 capabilities (both the High Level Prototype and the
-actual bootstrap AMD64 `hex2_AMD64.M1` in stage0-posix). hex2 supports
-only the pointer sigils `: ! @ $ ~ % &` and the `>` base override — no
-alignment directive, no label arithmetic. Option 2 would require forking
-hex2, which is off the table for the same reason we don't fork M1.
-
-**Decision — option 1: doc + allocator contract.** Add a §"Data
-alignment" subsection to `P1.md`:
-
-- Labels have no inherent alignment. The offset of any label from the
- ELF base is the cumulative code/data size before it, which varies per
- arch.
-- Programs that care about alignment (tagged pointers, cache lines,
- atomic fields — the first being immediate for us) must align at boot
- with `(x | mask) + 1`.
-- Any allocator that returns tagged-pointer-eligible cells MUST return
- pointers aligned to at least the tag width (8 bytes, low-3-bit tag).
-- Caller must pass tag-width-aligned sizes to bump allocators for now.
- When the real allocator lands in step 9, it rounds internally.
-
-## 3. The ~1200-entry DEFINE generator is now a blocker
-
-**Symptom.** Step 2 added ~13 new `(op, register-tuple)` entries per
-arch by hand. Step 3 (strings + symbol interning) and step 4 (reader)
-will need many more — load/store variants across r0–r7 × base-register
-× common offsets, more arithmetic tuples, more branches. Hand-writing
-each tuple × 3 arches is already where most of step 2's wall-clock
-went.
-
-**Options considered.**
-
-1. **Build generator before step 3.** Design a shared op table
- (`op, rD, rA, rB` plus a template for each arch's native encoding),
- emit `p1_<arch>.M1` from it. Pause Lisp work until done.
-2. **Defer until step 4 or 5.** Hand-add entries as each step needs
- them; accept the per-step friction.
-3. **Lazy defs.** Only enumerate tuples actually referenced by the
- program, via a source-scan + per-arch hand-encoder. More complex
- than full generation.
-
-**Decision — option 1: build before step 3.** Deferring pushes the
-problem into a worse spot (each new step adds friction, and the defs
-files diverge across arches in hard-to-review ways). Design the op
-table to be appendable — tranches 1–8 already map cleanly onto rows of
-`(op, rD, rA, rB)` plus per-arch encoding templates. The generator
-unblocks step 3 and is a prerequisite for the ~1200-entry target in
-`P1.md`.
-
-This adds one item to the staged plan between steps 2 and 3.
-
-## 4. Root spill across `CALL` needs more than one scratch slot
-
-**Symptom.** `cons(car, cdr)` must keep both args live across the
-`alloc(16)` call. The P1 ABI gives each frame exactly one callee-private
-scratch cell at `[sp+8]`, which fits one value. Step 2 worked around
-this by parking `cdr` in a BSS cell (`:cons_save_cdr`), a single-threaded
-hack that breaks under the step-9 GC (which needs every live tagged
-value to be on the stack so the mark phase can find it).
-
-The same pattern will recur everywhere: anything multi-arg that calls
-into the allocator (pair constructors, list builders, environment frame
-builders, closure constructors) needs per-function root-spill slots
-proportional to its live set, not a fixed one.
-
-**Options considered.**
-
-1. **Expand `PROLOGUE` frame + per-function spill-slot count.** Bump
- standard frame from 16 B to N B, where each function declares how
- many callee-private cells it needs at `[sp+8]`, `[sp+16]`,
- `[sp+24]`, etc. `PROLOGUE`/`EPILOGUE` take the slot count as an
- immediate. Update all three arch defs files.
-2. **Keep one scratch slot, use BSS for overflow.** What step 2 does.
- Breaks under GC, doesn't scale past single-threaded.
-3. **Caller-managed save area.** Caller subtracts from `sp` itself,
- spills, calls, restores. Works but breaks the "`sp` only mutated by
- `PROLOGUE`/`EPILOGUE`" invariant from `P1.md`.
-
-**Decision — option 1: expand PROLOGUE frame + spill-slot count.**
-Aligns with GC discipline (all roots on stack, walkable), keeps `sp`
-mutation localized to PROLOGUE/EPILOGUE, and scales with live-set size.
-Specifics to pin down when implementing:
-
-- Encoding of the slot count. Probably a suffix on the op name
- (`PROLOGUE_N2`, `PROLOGUE_N3`, …) so each variant is a distinct
- `DEFINE` and the generator can enumerate them the same way it
- enumerates register tuples. Cap at some N (4? 8?) and grow if a
- function ever needs more.
-- Frame size must stay 16-aligned on aarch64 — round up.
-- The existing single-slot `PROLOGUE`/`EPILOGUE` stay as aliases for
- the N=1 case.
-
-Replace `cons_save_cdr` in `lisp.M1` with a second frame slot once the
-N-slot prologue lands.
-
-## 5. r7-indirect branch pattern is verbose but fine
-
-**Symptom.** Every branch or call compiles through three source lines:
-
-```
-P1_LI_R7
-&target
-P1_BNE_R1_R2_R7
-```
-
-`lisp.M1` has 8 such sites and will grow roughly linearly with the
-interpreter's control flow. Readers unfamiliar with P1 don't immediately
-see this is "branch if r1 != r2 to `target`."
-
-A conditional in a high-level language:
-
-```
-if x == 0 goto fail;
-```
-
-lands in P1 as:
-
-```
-P1_LI_R7
-&fail
-P1_BEQ_R1_R0_R7 ## (after loading 0 into r0)
-```
-
-**Options considered.**
-
-1. **Document the pattern only.** Add a §"Reading P1 source" subsection
- to `P1.md` that explains: every branch is
- `LI_R7 &target ; BXX_rA_rB_R7`; r7 is branch-scratch owned by this
- pattern; target load + branch are always together. Once seen, it
- reads fluently.
-2. **Introduce `*_TO` pseudos.** E.g. `P1_BNE_R1_R2_TO &target` expands
- to the 3-op sequence. Cosmetic — saves one source line per site —
- but it ripples through the DEFINE table (each pseudo needs per-arch
- encoding that emits a forward-jump-around-a-literal-pool or similar)
- and hides the r7 clobber, which is load-bearing.
-3. **Change the ISA to have a 32-bit branch-target field.** A real
- native-encoded conditional branch would save the literal and the
- `LI_R7`. Breaks hex2's "no label arithmetic" contract (branch
- offsets are label_A - label_B), and hex2 is not forkable.
-
-**Decision — option 1: document the pattern only.** `*_TO` pseudos
-would require the generator to emit paired (branch, literal-pool-entry)
-per call site, and hiding the r7 clobber invites bugs when someone
-assumes r7 survives a branch. The pattern is learnable; the workaround
-for hex2's missing branch-offset support is exactly what `P1.md` calls
-out ("unbounded range within 4 GiB at the cost of one r7 clobber per
-branch"). Documenting is sufficient.
-
-Add §"Reading P1 source" under the existing §"Register mapping" in
-`P1.md`: one paragraph on the pattern, one on the r7 clobber rule (some
-of this already lives in the "r7 is not a general-purpose callee-saved
-register" paragraph, which should be cross-referenced).
-
-## Ordering
-
-All five land before step 3. Reasons:
-
-- **Issue 4 blocks step 3 directly.** Strings and symbol interning are
- allocator-calling constructors — same shape as `cons`, more live
- values per frame. Deferring issue 4 means reproducing the
- `cons_save_cdr` BSS hack across every step-3 constructor, then
- ripping them all out for step 9 (GC). Cheaper to land N-slot
- PROLOGUE once.
-- **Issue 4 must land with issue 3, not after.** If the generator
- emits the defs table first and N-slot PROLOGUE lands later, the
- generator runs twice and the defs files churn. Do them together.
-- **Issues 1, 2, 5 are cheap and compounding.** Makefile lint (1)
- prevents the exact class of bug that ate an hour in step 2 — every
- step 3 tuple we add is another chance to mistype. Alignment doc (2)
- and r7-pattern doc (5) are one sitting each and make step 3 review
- easier for anyone but the author.
-
-Suggested order within the pre-step-3 block:
-
-1. Issue 1 (Makefile lint) — ~10 lines of shell, unblocks confident
- iteration on the rest.
-2. Issue 5 (`P1.md` §"Reading P1 source") and issue 2
- (`P1.md` §"Data alignment") — doc-only, batch into one commit.
-3. Issue 3 (generator) + issue 4 (N-slot PROLOGUE) — together.
- Design the op table so PROLOGUE/EPILOGUE variants are rows like any
- other `(op, reg-tuple)`. Rip out `cons_save_cdr` from `lisp.M1`
- as the validation case.
diff --git a/demo.M1 b/demo.M1
@@ -22,7 +22,7 @@
## synthesize from LD + mask/shift if needed. Signed imm12 on
## LD/ST is exercised via a NEG8-offset round-trip through
## scratch_mid, mirroring the ADDI NEG1/NEG3 signed-imm test.
-## Tranche 4: r7-indirect branches. B, BEQ, BNE, BLT each tested
+## Tranche 4: LI_BR-indirect branches. B, BEQ, BNE, BLT each tested
## in BOTH taken and fall-through directions so an inverted
## condition encoding doesn't pass. BLT additionally tested with
## a negative operand (-1 < 0 signed) to prove signed semantics.
@@ -31,10 +31,10 @@
## stresses PROLOGUE lr-save; TAIL unwinds the frame. Stack
## balance is additionally verified by snapshotting sp via
## MOV_R6_SP before tranche 5 and comparing after — a TAIL that
-## omits its epilogue leaks fn_parent_tail's 16-byte frame and
-## the delta folds into the accumulator. On amd64 PROLOGUE/
-## EPILOGUE are NOPs so the delta is always 0; the check bites
-## on aarch64/riscv64 where prologue does real sp work.
+## omits its epilogue (or whose epilogue clobbers the branch
+## target) leaks fn_parent_tail's 16-byte frame and the delta
+## folds into the accumulator. All three arches now run real
+## sp-moving PROLOGUE/EPILOGUE, so the check bites everywhere.
##
## Run-and-verify:
## make PROG=demo ARCH=<arch> run && echo "exit=$?"
@@ -160,7 +160,7 @@
'01000000' # r3 = 1
## B — unconditional. Correct: jump to b4_1_ok, skipping clobber.
- P1_LI_R7
+ P1_LI_BR
&b4_1_ok
P1_B
P1_LI_R1
@@ -170,9 +170,9 @@
## BEQ taken: r3=0 so r2==r3.
P1_LI_R3
'00000000'
- P1_LI_R7
+ P1_LI_BR
&b4_2_ok
- P1_BEQ_R2_R3_R7 # 0 == 0, taken
+ P1_BEQ_R2_R3 # 0 == 0, taken
P1_LI_R1
'00000000'
:b4_2_ok
@@ -181,10 +181,10 @@
## BEQ fall-through: 0 != 1, branch must NOT fire. If it
## (incorrectly) fires we jump to b4_3_bad and clobber r1.
- P1_LI_R7
+ P1_LI_BR
&b4_3_bad
- P1_BEQ_R2_R3_R7 # 0 == 1? no, fall through
- P1_LI_R7
+ P1_BEQ_R2_R3 # 0 == 1? no, fall through
+ P1_LI_BR
&b4_3_ok
P1_B
:b4_3_bad
@@ -193,9 +193,9 @@
:b4_3_ok
## BNE taken: 0 != 1.
- P1_LI_R7
+ P1_LI_BR
&b4_4_ok
- P1_BNE_R2_R3_R7 # 0 != 1, taken
+ P1_BNE_R2_R3 # 0 != 1, taken
P1_LI_R1
'00000000'
:b4_4_ok
@@ -203,10 +203,10 @@
## BNE fall-through: r3=0 so r2==r3; branch must NOT fire.
P1_LI_R3
'00000000'
- P1_LI_R7
+ P1_LI_BR
&b4_5_bad
- P1_BNE_R2_R3_R7 # 0 != 0? no, fall through
- P1_LI_R7
+ P1_BNE_R2_R3 # 0 != 0? no, fall through
+ P1_LI_BR
&b4_5_ok
P1_B
:b4_5_bad
@@ -217,9 +217,9 @@
'01000000' # restore r3 = 1
## BLT taken: 0 < 1 (signed).
- P1_LI_R7
+ P1_LI_BR
&b4_6_ok
- P1_BLT_R2_R3_R7 # 0 < 1, taken
+ P1_BLT_R2_R3 # 0 < 1, taken
P1_LI_R1
'00000000'
:b4_6_ok
@@ -229,10 +229,10 @@
'01000000' # r2 = 1
P1_LI_R3
'00000000' # r3 = 0
- P1_LI_R7
+ P1_LI_BR
&b4_7_bad
- P1_BLT_R2_R3_R7 # 1 < 0? no, fall through
- P1_LI_R7
+ P1_BLT_R2_R3 # 1 < 0? no, fall through
+ P1_LI_BR
&b4_7_ok
P1_B
:b4_7_bad
@@ -248,9 +248,9 @@
P1_LI_R4
'00000000'
P1_ADDI_R4_R4_NEG1 # r4 = -1 (sign-extended)
- P1_LI_R7
+ P1_LI_BR
&b4_8_ok
- P1_BLT_R4_R2_R7 # -1 < 0 (signed)? yes, taken
+ P1_BLT_R4_R2 # -1 < 0 (signed)? yes, taken
P1_LI_R1
'00000000'
:b4_8_ok
@@ -269,15 +269,15 @@
## into the accumulator below via SUB r1, r1, delta.
P1_MOV_R6_SP # r6 = sp snapshot (pre-tranche)
- P1_LI_R7
+ P1_LI_BR
&fn_identity
P1_CALL # nested-CALL test: returns r1 unchanged
- P1_LI_R7
+ P1_LI_BR
&fn_parent_tail
P1_CALL # TAIL test: fn_identity RETs to here
- P1_LI_R7
+ P1_LI_BR
&b5_end
P1_B # skip over the inlined function bodies
@@ -288,7 +288,7 @@
:fn_identity
P1_PROLOGUE
- P1_LI_R7
+ P1_LI_BR
&fn_inner
P1_CALL
P1_EPILOGUE
@@ -296,7 +296,7 @@
:fn_parent_tail
P1_PROLOGUE
- P1_LI_R7
+ P1_LI_BR
&fn_identity
P1_TAIL
diff --git a/lisp.M1 b/lisp.M1
@@ -64,7 +64,7 @@
## cons(r1=fixnum(42), r2=nil=0x07).
P1_LI_R2
'07000000' ## r2 = nil
- P1_LI_R7
+ P1_LI_BR
&cons
P1_CALL ## r0 = pair pointer|2
P1_MOV_R6_R0 ## r6 = pair (callee-saved across calls)
@@ -74,24 +74,24 @@
P1_ANDI_R1_R1_7
P1_LI_R2
'02000000'
- P1_LI_R7
+ P1_LI_BR
¬_pair_err
- P1_BNE_R1_R2_R7
+ P1_BNE_R1_R2
## Check cdr(pair) == nil.
P1_MOV_R1_R6
- P1_LI_R7
+ P1_LI_BR
&cdr
P1_CALL ## r0 = cdr
P1_LI_R2
'07000000'
- P1_LI_R7
+ P1_LI_BR
¬_nil_err
- P1_BNE_R0_R2_R7
+ P1_BNE_R0_R2
## x = car(pair); expect tag == fixnum.
P1_MOV_R1_R6
- P1_LI_R7
+ P1_LI_BR
&car
P1_CALL ## r0 = car
P1_MOV_R6_R0 ## r6 = car (callee-saved)
@@ -100,9 +100,9 @@
P1_ANDI_R1_R1_7
P1_LI_R2
'01000000'
- P1_LI_R7
+ P1_LI_BR
¬_fixnum_err
- P1_BNE_R1_R2_R7
+ P1_BNE_R1_R2
## exit(car >> 3) — yields 42.
P1_MOV_R1_R6
@@ -120,27 +120,29 @@
## (ptr | PAIR_TAG). Uses the N=2 prologue — car lives in frame slot 1
## at [sp+8], cdr in slot 2 at [sp+16] — to shepherd both args across
## alloc(). Both roots stay on the walkable stack, which is the
-## discipline the step-9 GC will depend on (P1_TODO.md issue 4).
+## discipline the step-9 mark-sweep GC will depend on.
:cons
P1_PROLOGUE_N2
- ## Spill car to slot 1 [sp+8], cdr to slot 2 [sp+16].
- P1_MOV_R4_SP
- P1_ST_R1_R4_8
- P1_ST_R2_R4_16
+ ## Spill car to slot 1 [sp+8], cdr to slot 2 [sp+16]. r3 is caller-
+ ## saved so we can clobber it freely as an sp-temp for the ST/LD
+ ## displacements.
+ P1_MOV_R3_SP
+ P1_ST_R1_R3_8
+ P1_ST_R2_R3_16
## alloc(16). 16 is already 8-byte aligned.
P1_LI_R1
'10000000'
- P1_LI_R7
+ P1_LI_BR
&alloc
P1_CALL ## r0 = raw (untagged) pair ptr
## [ptr+0] = car ; [ptr+8] = cdr.
- P1_MOV_R4_SP
- P1_LD_R1_R4_8 ## r1 = car
+ P1_MOV_R3_SP
+ P1_LD_R1_R3_8 ## r1 = car
P1_ST_R1_R0_0
- P1_LD_R2_R4_16 ## r2 = cdr
+ P1_LD_R2_R3_16 ## r2 = cdr
P1_ST_R2_R0_8
## Tag as a pair and return.
@@ -170,26 +172,29 @@
## Bump allocator over [heap_start, heap_end). Caller must pass an
## 8-byte-aligned size (caller-rounds for now). On overflow branches to
## :alloc_oom, which funnels through `error`.
+##
+## Uses only r0-r3 (caller-saved) so it needn't preserve any callee-
+## saved regs across its own body. Keeps the PROLOGUE/EPILOGUE just for
+## the lr save — the N=0 frame has no slots.
:alloc
P1_PROLOGUE
- P1_LI_R4
+ P1_LI_R3
&heap_next
- P1_LD_R3_R4_0 ## r3 = *heap_next (old bump ptr)
+ P1_LD_R0_R3_0 ## r0 = *heap_next (old bump ptr; also return value)
- P1_ADD_R2_R3_R1 ## r2 = new_next = old + size
+ P1_ADD_R2_R0_R1 ## r2 = new_next = old + size
- P1_LI_R5
+ P1_LI_R1
&heap_end
- P1_LD_R0_R5_0 ## r0 = *heap_end
+ P1_LD_R1_R1_0 ## r1 = *heap_end
## heap_end < new_next -> OOM
- P1_LI_R7
+ P1_LI_BR
&alloc_oom
- P1_BLT_R0_R2_R7
+ P1_BLT_R1_R2
- P1_ST_R2_R4_0 ## *heap_next = new_next
- P1_MOV_R0_R3 ## return old bump ptr
+ P1_ST_R2_R3_0 ## *heap_next = new_next
P1_EPILOGUE
P1_RET
@@ -197,7 +202,7 @@
## ---- error(msg_ptr, msg_len) ----------------------------------------
## Writes "error: " + msg + "\n" to fd 2, then exit(1). Never returns.
-## Callers reach this via LI_R7 &error ; P1_B (branch, not call) and
+## Callers reach this via LI_BR &error ; P1_B (branch, not call) and
## arrange r1 = msg_ptr, r2 = msg_len beforehand.
:error
## Spill msg_ptr/msg_len into callee-saved regs across the write
@@ -247,14 +252,14 @@
## ---- Error landing pads ---------------------------------------------
## Plain branch targets (no PROLOGUE, no return). Load (msg_ptr, msg_len)
-## into r1/r2, then r7-indirect branch into `error`. Keep message-length
-## imm32s in sync with the strings below.
+## into r1/r2, then LI_BR-indirect branch into `error`. Keep message-
+## length imm32s in sync with the strings below.
:alloc_oom
P1_LI_R1
&msg_oom
P1_LI_R2
'0E000000' ## strlen("heap exhausted") = 14
- P1_LI_R7
+ P1_LI_BR
&error
P1_B
@@ -263,7 +268,7 @@
&msg_not_pair
P1_LI_R2
'0D000000' ## strlen("expected pair") = 13
- P1_LI_R7
+ P1_LI_BR
&error
P1_B
@@ -272,7 +277,7 @@
&msg_not_nil
P1_LI_R2
'0C000000' ## strlen("expected nil") = 12
- P1_LI_R7
+ P1_LI_BR
&error
P1_B
@@ -281,7 +286,7 @@
&msg_not_fixnum
P1_LI_R2
'0F000000' ## strlen("expected fixnum") = 15
- P1_LI_R7
+ P1_LI_BR
&error
P1_B
diff --git a/p1_aarch64.M1 b/p1_aarch64.M1
@@ -10,13 +10,14 @@ DEFINE P1_LI_R0 4000001802000014
DEFINE P1_LI_R1 4100001802000014
DEFINE P1_LI_R2 4200001802000014
DEFINE P1_LI_R3 4300001802000014
-DEFINE P1_LI_R4 4400001802000014
-DEFINE P1_LI_R5 4500001802000014
+DEFINE P1_LI_R4 5A00001802000014
+DEFINE P1_LI_R5 5B00001802000014
DEFINE P1_LI_R6 5300001802000014
DEFINE P1_LI_R7 5400001802000014
+DEFINE P1_LI_BR 5100001802000014
## ---- SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches
-DEFINE P1_SYSCALL E80300AAF50301AAF60302AAF70303AAF80304AAF90305AAE00315AAE10316AAE20317AAE30318AAE40319AAE50313AA010000D4E10315AAE20316AAE30317AAE40318AAE50319AA
+DEFINE P1_SYSCALL E80300AAF50301AAF60302AAF70303AAE00315AAE10316AAE20317AAE3031AAAE4031BAAE50313AA010000D4E10315AAE20316AAE30317AA
DEFINE P1_SYSOPEN 600C8092080780D2010000D4
## ---- Linux syscall numbers (per-arch table). LE-32 immediate operands for LI.
@@ -27,31 +28,31 @@ DEFINE SYS_CLOSE 39000000
## ---- Reg-reg-reg arithmetic (tranche 1) --------------------------
DEFINE P1_ADD_R1_R1_R2 2100028B
-DEFINE P1_ADD_R1_R1_R4 2100048B
+DEFINE P1_ADD_R1_R1_R4 21001A8B
DEFINE P1_ADD_R2_R2_R6 4200138B
DEFINE P1_ADD_R2_R3_R1 6200018B
DEFINE P1_SUB_R1_R1_R2 210002CB
DEFINE P1_SUB_R2_R2_R6 420013CB
-DEFINE P1_AND_R1_R1_R5 2100058A
+DEFINE P1_AND_R1_R1_R5 21001B8A
DEFINE P1_OR_R1_R1_R2 210002AA
DEFINE P1_XOR_R1_R1_R2 210002CA
DEFINE P1_MUL_R1_R1_R2 217C029B
DEFINE P1_DIV_R1_R1_R2 210CC29A
-DEFINE P1_REM_R1_R1_R5 300CC59A0106059B
+DEFINE P1_REM_R1_R1_R5 300CDB9A01061B9B
DEFINE P1_SHL_R1_R1_R2 2120C29A
DEFINE P1_SHR_R1_R1_R2 2124C29A
-DEFINE P1_SAR_R4_R4_R2 8428C29A
+DEFINE P1_SAR_R4_R4_R2 5A2BC29A
## ---- Immediate arithmetic (tranche 2) ----------------------------
DEFINE P1_ADDI_R1_R1_3 210C0091
DEFINE P1_ADDI_R1_R1_1 21040091
DEFINE P1_ADDI_R1_R1_NEG3 210C00D1
-DEFINE P1_ADDI_R4_R4_NEG1 840400D1
+DEFINE P1_ADDI_R4_R4_NEG1 5A0700D1
DEFINE P1_ADDI_R1_R1_NEG2 210800D1
DEFINE P1_ADDI_R0_R0_1 00040091
DEFINE P1_SHLI_R1_R1_1 21F87FD3
DEFINE P1_SHRI_R1_R1_1 21FC41D3
-DEFINE P1_SARI_R4_R4_1 84FC4193
+DEFINE P1_SARI_R4_R4_1 5AFF4193
DEFINE P1_ANDI_R1_R1_6 21047F92
DEFINE P1_ANDI_R1_R1_7 21084092
DEFINE P1_ORI_R1_R1_1 210040B2
@@ -59,38 +60,38 @@ DEFINE P1_ORI_R0_R0_2 00007FB2
DEFINE P1_ORI_R0_R0_7 000840B2
## ---- LA + memory ops (tranche 3) ---------------------------------
-DEFINE P1_LA_R4 4400001802000014
-DEFINE P1_ST_R1_R4_0 810000F9
-DEFINE P1_LD_R1_R4_0 810040F9
-DEFINE P1_ST_R1_R4_8 810400F9
-DEFINE P1_LD_R1_R4_8 810440F9
-DEFINE P1_SB_R1_R4_16 81400039
-DEFINE P1_LB_R1_R4_16 81404039
-DEFINE P1_ST_R1_R4_NEG8 81801FF8
-DEFINE P1_LD_R1_R4_NEG8 81805FF8
+DEFINE P1_LA_R4 5A00001802000014
+DEFINE P1_ST_R1_R4_0 410300F9
+DEFINE P1_LD_R1_R4_0 410340F9
+DEFINE P1_ST_R1_R4_8 410700F9
+DEFINE P1_LD_R1_R4_8 410740F9
+DEFINE P1_SB_R1_R4_16 41430039
+DEFINE P1_LB_R1_R4_16 41434039
+DEFINE P1_ST_R1_R4_NEG8 41831FF8
+DEFINE P1_LD_R1_R4_NEG8 41835FF8
-## ---- Branches (tranche 4, r7-indirect) ---------------------------
-DEFINE P1_B 80021FD6
-DEFINE P1_BEQ_R2_R3_R7 5F0003EB4100005480021FD6
-DEFINE P1_BNE_R2_R3_R7 5F0003EB4000005480021FD6
-DEFINE P1_BLT_R2_R3_R7 5F0003EB4A00005480021FD6
-DEFINE P1_BLT_R4_R2_R7 9F0002EB4A00005480021FD6
+## ---- Branches (tranche 4, LI_BR-indirect) ------------------------
+DEFINE P1_B 20021FD6
+DEFINE P1_BEQ_R2_R3 5F0003EB4100005420021FD6
+DEFINE P1_BNE_R2_R3 5F0003EB4000005420021FD6
+DEFINE P1_BLT_R2_R3 5F0003EB4A00005420021FD6
+DEFINE P1_BLT_R4_R2 5F0302EB4A00005420021FD6
## ---- Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL
DEFINE P1_PROLOGUE FF4300D1FE0300F9
DEFINE P1_EPILOGUE FE0340F9FF430091
DEFINE P1_RET C0035FD6
-DEFINE P1_CALL 80023FD6
-DEFINE P1_TAIL FE0340F9FF43009180021FD6
+DEFINE P1_CALL 20023FD6
+DEFINE P1_TAIL FE0340F9FF43009120021FD6
DEFINE P1_PROLOGUE_N2 FF8300D1FE0300F9
DEFINE P1_EPILOGUE_N2 FE0340F9FF830091
-DEFINE P1_TAIL_N2 FE0340F9FF83009180021FD6
+DEFINE P1_TAIL_N2 FE0340F9FF83009120021FD6
DEFINE P1_PROLOGUE_N3 FF8300D1FE0300F9
DEFINE P1_EPILOGUE_N3 FE0340F9FF830091
-DEFINE P1_TAIL_N3 FE0340F9FF83009180021FD6
+DEFINE P1_TAIL_N3 FE0340F9FF83009120021FD6
DEFINE P1_PROLOGUE_N4 FFC300D1FE0300F9
DEFINE P1_EPILOGUE_N4 FE0340F9FFC30091
-DEFINE P1_TAIL_N4 FE0340F9FFC3009180021FD6
+DEFINE P1_TAIL_N4 FE0340F9FFC3009120021FD6
## ---- Seed-Lisp step 1 extensions (tranche 6) ---------------------
DEFINE P1_MOV_R1_R6 E10313AA
@@ -102,28 +103,38 @@ DEFINE P1_MOV_R7_R2 F40302AA
DEFINE P1_MOV_R2_R6 E20313AA
DEFINE P1_MOV_R3_R7 E30314AA
DEFINE P1_MOV_R2_R7 E20314AA
-DEFINE P1_MOV_R4_R7 E40314AA
+DEFINE P1_MOV_R4_R7 FA0314AA
DEFINE P1_MOV_R2_SP E2030091
-DEFINE P1_MOV_R4_SP E4030091
+DEFINE P1_MOV_R3_SP E3030091
+DEFINE P1_MOV_R4_SP FA030091
DEFINE P1_MOV_R6_SP F3030091
DEFINE P1_MOV_R2_R0 E20300AA
DEFINE P1_LD_R0_R6_0 600240F9
DEFINE P1_LD_R1_R6_16 610A40F9
-DEFINE P1_LD_R3_R4_0 830040F9
-DEFINE P1_LD_R0_R5_0 A00040F9
-DEFINE P1_LB_R1_R4_0 81004039
-DEFINE P1_ST_R2_R4_0 820000F9
-DEFINE P1_ST_R0_R4_8 800400F9
-DEFINE P1_LD_R0_R4_8 800440F9
+DEFINE P1_LD_R3_R4_0 430340F9
+DEFINE P1_LD_R0_R5_0 600340F9
+DEFINE P1_LB_R1_R4_0 41034039
+DEFINE P1_ST_R2_R4_0 420300F9
+DEFINE P1_ST_R0_R4_8 400700F9
+DEFINE P1_LD_R0_R4_8 400740F9
DEFINE P1_LB_R1_R0_0 01004039
DEFINE P1_LD_R0_R1_0 200040F9
DEFINE P1_LD_R0_R1_8 200440F9
DEFINE P1_ST_R1_R0_0 010000F9
-DEFINE P1_LD_R2_R4_0 820040F9
+DEFINE P1_LD_R2_R4_0 420340F9
DEFINE P1_ST_R2_R0_8 020400F9
-DEFINE P1_LD_R0_R4_0 800040F9
-DEFINE P1_ST_R2_R4_16 820800F9
-DEFINE P1_LD_R2_R4_16 820840F9
-DEFINE P1_BLT_R0_R2_R7 1F0002EB4A00005480021FD6
-DEFINE P1_BNE_R1_R2_R7 3F0002EB4000005480021FD6
-DEFINE P1_BNE_R0_R2_R7 1F0002EB4000005480021FD6
+DEFINE P1_LD_R0_R4_0 400340F9
+DEFINE P1_ST_R2_R4_16 420B00F9
+DEFINE P1_LD_R2_R4_16 420B40F9
+DEFINE P1_LD_R0_R3_0 600040F9
+DEFINE P1_ST_R2_R3_0 620000F9
+DEFINE P1_ST_R1_R3_8 610400F9
+DEFINE P1_LD_R1_R3_8 610440F9
+DEFINE P1_ST_R2_R3_16 620800F9
+DEFINE P1_LD_R2_R3_16 620840F9
+DEFINE P1_LD_R1_R1_0 210040F9
+DEFINE P1_ADD_R2_R0_R1 0200018B
+DEFINE P1_BLT_R0_R2 1F0002EB4A00005420021FD6
+DEFINE P1_BLT_R1_R2 3F0002EB4A00005420021FD6
+DEFINE P1_BNE_R1_R2 3F0002EB4000005420021FD6
+DEFINE P1_BNE_R0_R2 1F0002EB4000005420021FD6
diff --git a/p1_amd64.M1 b/p1_amd64.M1
@@ -10,13 +10,14 @@ DEFINE P1_LI_R0 B8
DEFINE P1_LI_R1 BF
DEFINE P1_LI_R2 BE
DEFINE P1_LI_R3 BA
-DEFINE P1_LI_R4 41BA
-DEFINE P1_LI_R5 41B8
+DEFINE P1_LI_R4 41BD
+DEFINE P1_LI_R5 41BE
DEFINE P1_LI_R6 BB
DEFINE P1_LI_R7 41BC
+DEFINE P1_LI_BR 41BB
## ---- SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches
-DEFINE P1_SYSCALL 4989D90F05
+DEFINE P1_SYSCALL 4D89EA4D89F04989D90F05
DEFINE P1_SYSOPEN B8020000000F05
## ---- Linux syscall numbers (per-arch table). LE-32 immediate operands for LI.
@@ -27,31 +28,31 @@ DEFINE SYS_CLOSE 03000000
## ---- Reg-reg-reg arithmetic (tranche 1) --------------------------
DEFINE P1_ADD_R1_R1_R2 4889FF4801F7
-DEFINE P1_ADD_R1_R1_R4 4889FF4C01D7
+DEFINE P1_ADD_R1_R1_R4 4889FF4C01EF
DEFINE P1_ADD_R2_R2_R6 4889F64801DE
DEFINE P1_ADD_R2_R3_R1 4889D64801FE
DEFINE P1_SUB_R1_R1_R2 4889FF4829F7
DEFINE P1_SUB_R2_R2_R6 4889F64829DE
-DEFINE P1_AND_R1_R1_R5 4889FF4C21C7
+DEFINE P1_AND_R1_R1_R5 4889FF4C21F7
DEFINE P1_OR_R1_R1_R2 4889FF4809F7
DEFINE P1_XOR_R1_R1_R2 4889FF4831F7
DEFINE P1_MUL_R1_R1_R2 4889FF480FAFFE
DEFINE P1_DIV_R1_R1_R2 4989C34889D14889F8489948F7FE4889C74889CA4C89D8
-DEFINE P1_REM_R1_R1_R5 4989C34889D14889F8489949F7F84889D74889CA4C89D8
+DEFINE P1_REM_R1_R1_R5 4989C34889D14889F8489949F7FE4889D74889CA4C89D8
DEFINE P1_SHL_R1_R1_R2 4889FF4889F148D3E7
DEFINE P1_SHR_R1_R1_R2 4889FF4889F148D3EF
-DEFINE P1_SAR_R4_R4_R2 4D89D24889F149D3FA
+DEFINE P1_SAR_R4_R4_R2 4D89ED4889F149D3FD
## ---- Immediate arithmetic (tranche 2) ----------------------------
DEFINE P1_ADDI_R1_R1_3 4889FF4883C703
DEFINE P1_ADDI_R1_R1_1 4889FF4883C701
DEFINE P1_ADDI_R1_R1_NEG3 4889FF4883C7FD
-DEFINE P1_ADDI_R4_R4_NEG1 4D89D24983C2FF
+DEFINE P1_ADDI_R4_R4_NEG1 4D89ED4983C5FF
DEFINE P1_ADDI_R1_R1_NEG2 4889FF4883C7FE
DEFINE P1_ADDI_R0_R0_1 4889C04883C001
DEFINE P1_SHLI_R1_R1_1 4889FF48C1E701
DEFINE P1_SHRI_R1_R1_1 4889FF48C1EF01
-DEFINE P1_SARI_R4_R4_1 4D89D249C1FA01
+DEFINE P1_SARI_R4_R4_1 4D89ED49C1FD01
DEFINE P1_ANDI_R1_R1_6 4889FF4883E706
DEFINE P1_ANDI_R1_R1_7 4889FF4883E707
DEFINE P1_ORI_R1_R1_1 4889FF4883CF01
@@ -59,38 +60,38 @@ DEFINE P1_ORI_R0_R0_2 4889C04883C802
DEFINE P1_ORI_R0_R0_7 4889C04883C807
## ---- LA + memory ops (tranche 3) ---------------------------------
-DEFINE P1_LA_R4 41BA
-DEFINE P1_ST_R1_R4_0 49897A00
-DEFINE P1_LD_R1_R4_0 498B7A00
-DEFINE P1_ST_R1_R4_8 49897A08
-DEFINE P1_LD_R1_R4_8 498B7A08
-DEFINE P1_SB_R1_R4_16 49887A10
-DEFINE P1_LB_R1_R4_16 490FB67A10
-DEFINE P1_ST_R1_R4_NEG8 49897AF8
-DEFINE P1_LD_R1_R4_NEG8 498B7AF8
+DEFINE P1_LA_R4 41BD
+DEFINE P1_ST_R1_R4_0 49897D00
+DEFINE P1_LD_R1_R4_0 498B7D00
+DEFINE P1_ST_R1_R4_8 49897D08
+DEFINE P1_LD_R1_R4_8 498B7D08
+DEFINE P1_SB_R1_R4_16 49887D10
+DEFINE P1_LB_R1_R4_16 490FB67D10
+DEFINE P1_ST_R1_R4_NEG8 49897DF8
+DEFINE P1_LD_R1_R4_NEG8 498B7DF8
-## ---- Branches (tranche 4, r7-indirect) ---------------------------
-DEFINE P1_B 41FFE4
-DEFINE P1_BEQ_R2_R3_R7 4839D6750341FFE4
-DEFINE P1_BNE_R2_R3_R7 4839D6740341FFE4
-DEFINE P1_BLT_R2_R3_R7 4839D67D0341FFE4
-DEFINE P1_BLT_R4_R2_R7 4939F27D0341FFE4
+## ---- Branches (tranche 4, LI_BR-indirect) ------------------------
+DEFINE P1_B 41FFE3
+DEFINE P1_BEQ_R2_R3 4839D6750341FFE3
+DEFINE P1_BNE_R2_R3 4839D6740341FFE3
+DEFINE P1_BLT_R2_R3 4839D67D0341FFE3
+DEFINE P1_BLT_R4_R2 4939F57D0341FFE3
## ---- Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL
-DEFINE P1_PROLOGUE 415B4883EC104153
-DEFINE P1_EPILOGUE 415B4883C4104153
+DEFINE P1_PROLOGUE 594883EC1051
+DEFINE P1_EPILOGUE 594883C41051
DEFINE P1_RET C3
-DEFINE P1_CALL 41FFD4
-DEFINE P1_TAIL 415B4883C410415341FFE4
-DEFINE P1_PROLOGUE_N2 415B4883EC204153
-DEFINE P1_EPILOGUE_N2 415B4883C4204153
-DEFINE P1_TAIL_N2 415B4883C420415341FFE4
-DEFINE P1_PROLOGUE_N3 415B4883EC204153
-DEFINE P1_EPILOGUE_N3 415B4883C4204153
-DEFINE P1_TAIL_N3 415B4883C420415341FFE4
-DEFINE P1_PROLOGUE_N4 415B4883EC304153
-DEFINE P1_EPILOGUE_N4 415B4883C4304153
-DEFINE P1_TAIL_N4 415B4883C430415341FFE4
+DEFINE P1_CALL 41FFD3
+DEFINE P1_TAIL 594883C4105141FFE3
+DEFINE P1_PROLOGUE_N2 594883EC2051
+DEFINE P1_EPILOGUE_N2 594883C42051
+DEFINE P1_TAIL_N2 594883C4205141FFE3
+DEFINE P1_PROLOGUE_N3 594883EC2051
+DEFINE P1_EPILOGUE_N3 594883C42051
+DEFINE P1_TAIL_N3 594883C4205141FFE3
+DEFINE P1_PROLOGUE_N4 594883EC3051
+DEFINE P1_EPILOGUE_N4 594883C43051
+DEFINE P1_TAIL_N4 594883C4305141FFE3
## ---- Seed-Lisp step 1 extensions (tranche 6) ---------------------
DEFINE P1_MOV_R1_R6 4889DF
@@ -102,28 +103,38 @@ DEFINE P1_MOV_R7_R2 4989F4
DEFINE P1_MOV_R2_R6 4889DE
DEFINE P1_MOV_R3_R7 4C89E2
DEFINE P1_MOV_R2_R7 4C89E6
-DEFINE P1_MOV_R4_R7 4D89E2
+DEFINE P1_MOV_R4_R7 4D89E5
DEFINE P1_MOV_R2_SP 4889E6
-DEFINE P1_MOV_R4_SP 4989E2
+DEFINE P1_MOV_R3_SP 4889E2
+DEFINE P1_MOV_R4_SP 4989E5
DEFINE P1_MOV_R6_SP 4889E3
DEFINE P1_MOV_R2_R0 4889C6
DEFINE P1_LD_R0_R6_0 488B4300
DEFINE P1_LD_R1_R6_16 488B7B10
-DEFINE P1_LD_R3_R4_0 498B5200
-DEFINE P1_LD_R0_R5_0 498B4000
-DEFINE P1_LB_R1_R4_0 490FB67A00
-DEFINE P1_ST_R2_R4_0 49897200
-DEFINE P1_ST_R0_R4_8 49894208
-DEFINE P1_LD_R0_R4_8 498B4208
+DEFINE P1_LD_R3_R4_0 498B5500
+DEFINE P1_LD_R0_R5_0 498B4600
+DEFINE P1_LB_R1_R4_0 490FB67D00
+DEFINE P1_ST_R2_R4_0 49897500
+DEFINE P1_ST_R0_R4_8 49894508
+DEFINE P1_LD_R0_R4_8 498B4508
DEFINE P1_LB_R1_R0_0 480FB67800
DEFINE P1_LD_R0_R1_0 488B4700
DEFINE P1_LD_R0_R1_8 488B4708
DEFINE P1_ST_R1_R0_0 48897800
-DEFINE P1_LD_R2_R4_0 498B7200
+DEFINE P1_LD_R2_R4_0 498B7500
DEFINE P1_ST_R2_R0_8 48897008
-DEFINE P1_LD_R0_R4_0 498B4200
-DEFINE P1_ST_R2_R4_16 49897210
-DEFINE P1_LD_R2_R4_16 498B7210
-DEFINE P1_BLT_R0_R2_R7 4839F07D0341FFE4
-DEFINE P1_BNE_R1_R2_R7 4839F7740341FFE4
-DEFINE P1_BNE_R0_R2_R7 4839F0740341FFE4
+DEFINE P1_LD_R0_R4_0 498B4500
+DEFINE P1_ST_R2_R4_16 49897510
+DEFINE P1_LD_R2_R4_16 498B7510
+DEFINE P1_LD_R0_R3_0 488B4200
+DEFINE P1_ST_R2_R3_0 48897200
+DEFINE P1_ST_R1_R3_8 48897A08
+DEFINE P1_LD_R1_R3_8 488B7A08
+DEFINE P1_ST_R2_R3_16 48897210
+DEFINE P1_LD_R2_R3_16 488B7210
+DEFINE P1_LD_R1_R1_0 488B7F00
+DEFINE P1_ADD_R2_R0_R1 4889C64801FE
+DEFINE P1_BLT_R0_R2 4839F07D0341FFE3
+DEFINE P1_BLT_R1_R2 4839F77D0341FFE3
+DEFINE P1_BNE_R1_R2 4839F7740341FFE3
+DEFINE P1_BNE_R0_R2 4839F0740341FFE3
diff --git a/p1_gen.py b/p1_gen.py
@@ -7,9 +7,9 @@ rewrites all three in place.
Why a generator: the hand-written defs diverge across arches in
hard-to-review ways, typos silently produce SIGILL'ing binaries (M1
-passes undefined tokens through as literal text — see P1_TODO.md issue
-1), and the combinatorial surface (~1200 DEFINEs per arch if fully
-enumerated) is past the point of hand-maintainability.
+passes undefined tokens through as literal text — see `lint.sh` for
+the pre-assemble check), and the combinatorial surface (~1200 DEFINEs
+per arch if fully enumerated) is past the point of hand-maintainability.
Design:
* OPS is a list of emission rows. Each row is a small class whose
@@ -39,10 +39,15 @@ ARCHES = ('aarch64', 'amd64', 'riscv64')
## the per-arch encoders insert into instruction fields; the human-facing
## names (rax, x1, a2, …) never appear in this file.
+## 4:4 caller/callee-saved split. r0–r3 caller (native argregs); r4–r7
+## callee (native callee-saved). `br` is the hidden branch-target scratch
+## (not a P1 reg) — picked so every op's expansion clobbers only what its
+## name declares.
NAT_AA64 = {'r0': 0, 'r1': 1, 'r2': 2, 'r3': 3,
- 'r4': 4, 'r5': 5, 'r6': 19, 'r7': 20,
+ 'r4': 26, 'r5': 27, 'r6': 19, 'r7': 20,
+ 'br': 17, # x17 (IP1, caller-saved linker scratch)
'sp': 31, 'xzr': 31, 'lr': 30,
- 'x21': 21, 'x22': 22, 'x23': 23, 'x24': 24, 'x25': 25, 'x8': 8}
+ 'x21': 21, 'x22': 22, 'x23': 23, 'x8': 8}
## amd64 ModRM.reg/rm + REX.R/B bit: native regnums 0..15 with r8..r15
## setting the REX bit. We store the 4-bit native number directly.
@@ -50,20 +55,24 @@ NAT_AMD64 = {'r0': 0, # rax
'r1': 7, # rdi
'r2': 6, # rsi
'r3': 2, # rdx
- 'r4': 10, # r10
- 'r5': 8, # r8
+ 'r4': 13, # r13 (callee-saved)
+ 'r5': 14, # r14 (callee-saved)
'r6': 3, # rbx
'r7': 12, # r12
+ 'br': 11, # r11 — branch/call target scratch + DIV/REM r0 save
'sp': 4, # rsp
- 'rcx': 1, # shift-count scratch (not a P1 reg)
- 'r9': 9, # syscall arg6 slot
- 'r11': 11, # call-retaddr scratch in PROLOGUE
+ 'rcx': 1, # shift-count scratch + DIV/REM rdx save (not a P1 reg)
+ 'r10': 10, # syscall arg4 slot (not a P1 reg)
+ 'r8': 8, # syscall arg5 slot (not a P1 reg)
+ 'r9': 9, # syscall arg6 slot (not a P1 reg)
+ 'r11': 11, # alias for br (some expansions spell it r11 directly)
}
NAT_RV64 = {'r0': 10, 'r1': 11, 'r2': 12, 'r3': 13,
- 'r4': 14, 'r5': 15, 'r6': 9, 'r7': 18,
+ 'r4': 20, 'r5': 21, 'r6': 9, 'r7': 18,
+ 'br': 30, # t5 (caller-saved temp)
'sp': 2, 'ra': 1, 'zero': 0, 'a7': 17,
- 's3': 19, 's4': 20, 's5': 21, 's6': 22, 's7': 23}
+ 's3': 19, 's6': 22, 's7': 23}
## ---------- Low-level encoding helpers -----------------------------------
@@ -521,19 +530,16 @@ class Mem(Op):
if self.op == 'SB':
return rv_s(0x00000023, self.rT, self.rN, self.off)
-## --- Branches: r7-indirect pattern ---
+## --- Branches: LI_BR-indirect pattern ---
@dataclass
class B(Op):
def encode(self, arch):
if arch == 'aarch64':
- # BR x20 — 0xD61F0280
- return le32(0xD61F0280)
+ return le32(0xD61F0000 | (NAT_AA64['br'] << 5)) # BR x17
if arch == 'amd64':
- # jmp r12 — 41 FF E4
- return '41FFE4'
+ return '41FFE3' # jmp r11
if arch == 'riscv64':
- # jalr x0, 0(s2) — opcode 67 00 09 00
- return le32(0x00090067)
+ return le32(0x00000067 | (NAT_RV64['br'] << 15)) # jalr x0, 0(t5)
@dataclass
class CondB(Op):
@@ -550,37 +556,25 @@ class CondB(Op):
# Skip when NOT cond holds. BEQ→NE(1), BNE→EQ(0), BLT→GE(A).
cond = {'BEQ': 1, 'BNE': 0, 'BLT': 10}[self.op]
bcond = le32(0x54000040 | cond)
- br_x20 = le32(0xD61F0280)
- return cmp + bcond + br_x20
+ br = le32(0xD61F0000 | (NAT_AA64['br'] << 5))
+ return cmp + bcond + br
if arch == 'amd64':
a, b = NAT_AMD64[self.rA], NAT_AMD64[self.rB]
# cmp rA, rB — opcode 39 /r with rA as r/m
cmp_ = rex(1, b >> 3, 0, a >> 3) + '39' + modrm(3, b, a)
- # jcc rel8 opcode, skip=3 (past jmp r12):
+ # jcc rel8 opcode, skip=3 (past jmp r11):
# BEQ→JNE 75 03 ; BNE→JE 74 03 ; BLT→JGE 7D 03
jop = {'BEQ': '75', 'BNE': '74', 'BLT': '7D'}[self.op]
- jmp_r12 = '41FFE4'
- return cmp_ + jop + '03' + jmp_r12
+ return cmp_ + jop + '03' + '41FFE3' # jmp r11
if arch == 'riscv64':
- # B<inv> rA, rB, +8 ; jalr x0, 0(s2)
+ # B<inv> rA, rB, +8 ; jalr x0, 0(t5)
a, b = NAT_RV64[self.rA], NAT_RV64[self.rB]
- # B-type imm=8 encoding: imm[12|10:5] = 0, imm[4:1|11] = 0b0100_0.
- # That is: [31]=0 [30:25]=0 [11:8]=0b0100 [7]=0 → imm[11:7] = 0b01000, imm[31:25]=0.
- # Bytes: xx xx x? 00 with byte[1]_upper=0x4 for the rd-field=8.
# funct3 picks the op: BEQ→BNE(1), BNE→BEQ(0), BLT→BGE(5).
funct3 = {'BEQ': 1, 'BNE': 0, 'BLT': 5}[self.op]
- # Build B-type: opcode=0x63, funct3, rs1=rA, rs2=rB, imm=+8.
- # imm encoding for +8:
- # bit 12 = 0
- # bits 10:5 = 000000
- # bits 4:1 = 0100
- # bit 11 = 0
- insn = 0x00000063 | (funct3 << 12) | (a << 15) | (b << 20)
- # imm bits: [11:7] carries imm[4:1,11]; [31:25] carries imm[12,10:5]
- # For +8: imm[4:1]=0b0100 → [10:7] = 0b1000 → byte pos [11:7] = 0b01000 = 8
- insn |= (8 << 7) # bits [11:8] = 0b0100, bit 7=0
- jalr_x0_s2 = 0x00090067
- return le32(insn) + le32(jalr_x0_s2)
+ # B-type with imm=+8: imm[11:8]=0b0100 → byte[1] nibble high=4.
+ insn = 0x00000063 | (funct3 << 12) | (a << 15) | (b << 20) | (8 << 7)
+ jalr = 0x00000067 | (NAT_RV64['br'] << 15) # jalr x0, 0(t5)
+ return le32(insn) + le32(jalr)
## --- Simple singletons ---
@@ -622,11 +616,16 @@ class Prologue(Op):
str_lr = aa_ldst_uimm12(0xF9000000, 'lr', 'sp', 0, 3)
return sub + str_lr
if arch == 'amd64':
- # pop r11 ; sub rsp,fb ; push r11
- # pop r11 = 41 5B ; sub rsp,imm8 = 48 83 EC ib (if fb<=127)
- # push r11 = 41 53
+ # pop rcx ; sub rsp,fb ; push rcx
+ # pop rcx = 59 ; sub rsp,imm8 = 48 83 EC ib (if fb<=127)
+ # push rcx = 51
+ # rcx is the retaddr-carry scratch. r11 (= P1 'br') is off-limits
+ # because TAIL = EPILOGUE + `jmp r11` and using r11 here would
+ # clobber the LI_BR-loaded tail target mid-EPILOGUE. rcx is
+ # already the shift-count + DIV/REM save scratch — caller-save,
+ # never a P1 reg.
assert fb <= 127
- return '415B' + '4883EC' + byte(fb) + '4153'
+ return '59' + '4883EC' + byte(fb) + '51'
if arch == 'riscv64':
# addi sp, sp, -fb ; sd ra, 0(sp)
sub = rv_i(0x00000013, 'sp', 'sp', -fb)
@@ -644,8 +643,9 @@ class Epilogue(Op):
add = aa_add_imm('sp', 'sp', fb, sub=False)
return ldr_lr + add
if arch == 'amd64':
+ # Mirror of Prologue — pop rcx ; add rsp,fb ; push rcx.
assert fb <= 127
- return '415B' + '4883C4' + byte(fb) + '4153'
+ return '59' + '4883C4' + byte(fb) + '51'
if arch == 'riscv64':
ld = rv_i(0x00003003, 'ra', 'sp', 0)
add = rv_i(0x00000013, 'sp', 'sp', fb)
@@ -656,7 +656,7 @@ class Tail(Op):
k: int = 1
def encode(self, arch):
- # Epilogue + unconditional B (= jump through r7)
+ # Epilogue + unconditional B (= jump through the branch-scratch reg)
epi = Epilogue(name='', k=self.k).encode(arch)
br = B(name='').encode(arch)
return epi + br
@@ -665,11 +665,11 @@ class Tail(Op):
class Call(Op):
def encode(self, arch):
if arch == 'aarch64':
- return le32(0xD63F0280) # BLR x20
+ return le32(0xD63F0000 | (NAT_AA64['br'] << 5)) # BLR x17
if arch == 'amd64':
- return '41FFD4' # call r12
+ return '41FFD3' # call r11
if arch == 'riscv64':
- return le32(0x000900E7) # jalr ra, 0(s2)
+ return le32(0x000000E7 | (NAT_RV64['br'] << 15)) # jalr ra, 0(t5)
@dataclass
class Ret(Op):
@@ -684,50 +684,49 @@ class Ret(Op):
## --- SYSCALL / SYSOPEN ---
SYSCALL_HEX = {
'aarch64': (
- # mov x8,x0 ; save r1..r5 into x21..x25 ; shuffle into x0..x4 ;
- # mov x5,x19 ; svc #0 ; restore r1..r5.
+ # r4/r5 now live in callee-saved natives (x26/x27), so the
+ # kernel preserves them — no save/restore needed. Only r1/r2/r3
+ # (in caller-saved x1/x2/x3) must be stashed across the shuffle.
'' .join([
- le32(0xAA0003E8), # mov x8, x0 (ORR x8, xzr, x0)
- le32(0xAA0103F5), # mov x21, x1
- le32(0xAA0203F6), # mov x22, x2
- le32(0xAA0303F7), # mov x23, x3
- le32(0xAA0403F8), # mov x24, x4
- le32(0xAA0503F9), # mov x25, x5
- le32(0xAA1503E0), # mov x0, x21
- le32(0xAA1603E1), # mov x1, x22
- le32(0xAA1703E2), # mov x2, x23
- le32(0xAA1803E3), # mov x3, x24
- le32(0xAA1903E4), # mov x4, x25
- le32(0xAA1303E5), # mov x5, x19
+ le32(0xAA0003E8), # mov x8, x0 (syscall number)
+ le32(0xAA0103F5), # mov x21, x1 (save r1)
+ le32(0xAA0203F6), # mov x22, x2 (save r2)
+ le32(0xAA0303F7), # mov x23, x3 (save r3)
+ le32(0xAA1503E0), # mov x0, x21 (arg1 = r1)
+ le32(0xAA1603E1), # mov x1, x22 (arg2 = r2)
+ le32(0xAA1703E2), # mov x2, x23 (arg3 = r3)
+ le32(0xAA1A03E3), # mov x3, x26 (arg4 = r4)
+ le32(0xAA1B03E4), # mov x4, x27 (arg5 = r5)
+ le32(0xAA1303E5), # mov x5, x19 (arg6 = r6)
le32(0xD4000001), # svc #0
- le32(0xAA1503E1), # mov x1, x21
+ le32(0xAA1503E1), # mov x1, x21 (restore r1)
le32(0xAA1603E2), # mov x2, x22
le32(0xAA1703E3), # mov x3, x23
- le32(0xAA1803E4), # mov x4, x24
- le32(0xAA1903E5), # mov x5, x25
])
),
- 'amd64': '4989D9' + '0F05', # mov r9, rbx ; syscall
+ # r4=r13, r5=r14 are callee-saved natively, but syscall wants args
+ # 4/5 in r10/r8. r6=rbx, but arg6 lives in r9. Three shuffle moves,
+ # then syscall. The kernel preserves rdi/rsi/rdx/r12–r15/rbx, so no
+ # P1 reg is clobbered beyond r0 (syscall return).
+ 'amd64': '4D89EA' + '4D89F0' + '4989D9' + '0F05',
'riscv64': (
+ # Same story as aarch64: r4/r5 in callee-saved s4/s5 (=x20/x21),
+ # so we only save/restore a1/a2/a3. Scratch slots: s3, s6, s7.
''.join([
- le32(0x00050893), # addi a7, a0, 0 (mv a7, a0)
- le32(0x00058993), # mv s3, a1
- le32(0x00060A13), # mv s4, a2
- le32(0x00068A93), # mv s5, a3
- le32(0x00070B13), # mv s6, a4
- le32(0x00078B93), # mv s7, a5
- le32(0x00098513), # mv a0, s3
- le32(0x000A0593), # mv a1, s4
- le32(0x000A8613), # mv a2, s5
- le32(0x000B0693), # mv a3, s6
- le32(0x000B8713), # mv a4, s7
- le32(0x00048793), # mv a5, s1
+ le32(0x00050893), # mv a7, a0 (syscall number)
+ le32(0x00058993), # mv s3, a1 (save r1)
+ le32(0x00060B13), # mv s6, a2 (save r2)
+ le32(0x00068B93), # mv s7, a3 (save r3)
+ le32(0x00098513), # mv a0, s3 (arg1 = r1)
+ le32(0x000B0593), # mv a1, s6 (arg2 = r2)
+ le32(0x000B8613), # mv a2, s7 (arg3 = r3)
+ le32(0x000A0693), # mv a3, s4 (arg4 = r4)
+ le32(0x000A8713), # mv a4, s5 (arg5 = r5)
+ le32(0x00048793), # mv a5, s1 (arg6 = r6)
le32(0x00000073), # ecall
- le32(0x00098593), # mv a1, s3
- le32(0x000A0613), # mv a2, s4
- le32(0x000A8693), # mv a3, s5
- le32(0x000B0713), # mv a4, s6
- le32(0x000B8793), # mv a5, s7
+ le32(0x00098593), # mv a1, s3 (restore r1)
+ le32(0x000B0613), # mv a2, s6
+ le32(0x000B8693), # mv a3, s7
])
),
}
@@ -770,6 +769,10 @@ def rows():
R.append(Banner('LI — load 4-byte zero-extended literal from inline data slot'))
for rd in ['r0', 'r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7']:
R.append(Li(name=f'LI_{rd.upper()}', rD=rd))
+ # LI_BR loads into the hidden branch-target scratch (x17/r11/t5).
+ # Every branch and call site is `LI_BR &target ; P1_B` (or the
+ # matching conditional / CALL). The scratch is *not* a P1 reg.
+ R.append(Li(name='LI_BR', rD='br'))
# --- SYSCALL / SYSOPEN ---
R.append(Banner('SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches'))
@@ -841,12 +844,12 @@ def rows():
op=op, rT=rt, rN=rn, off=off))
# --- Branches ---
- R.append(Banner('Branches (tranche 4, r7-indirect)'))
+ R.append(Banner('Branches (tranche 4, LI_BR-indirect)'))
R.append(B(name='B'))
for op, a, b in [
('BEQ','r2','r3'), ('BNE','r2','r3'), ('BLT','r2','r3'), ('BLT','r4','r2'),
]:
- R.append(CondB(name=f'{op}_{a.upper()}_{b.upper()}_R7', op=op, rA=a, rB=b))
+ R.append(CondB(name=f'{op}_{a.upper()}_{b.upper()}', op=op, rA=a, rB=b))
# --- PROLOGUE / EPILOGUE / CALL / RET / TAIL — single-slot + Nk variants ---
R.append(Banner('Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL'))
@@ -874,6 +877,7 @@ def rows():
R.append(Mov(name='MOV_R4_R7', rD='r4', rA='r7'))
# MOV rD, sp variants
R.append(Mov(name='MOV_R2_SP', rD='r2', rA='sp'))
+ R.append(Mov(name='MOV_R3_SP', rD='r3', rA='sp'))
R.append(Mov(name='MOV_R4_SP', rD='r4', rA='sp'))
R.append(Mov(name='MOV_R6_SP', rD='r6', rA='sp'))
# Extra MOVs needed around calls
@@ -886,18 +890,28 @@ def rows():
('ST','r0','r4',8), ('LD','r0','r4',8), ('LB','r1','r0',0),
('LD','r0','r1',0), ('LD','r0','r1',8), ('ST','r1','r0',0),
('LD','r2','r4',0), ('ST','r2','r0',8), ('LD','r0','r4',0),
- # N=2 scratch slot 2 at [sp+16]: used by cons after dropping
- # the :cons_save_cdr BSS spill (P1_TODO.md issue 4).
('ST','r2','r4',16), ('LD','r2','r4',16),
+ # r3-based addressing: after the 4:4 split, r4 is callee-saved
+ # and cons/alloc use r3 (caller-saved) as sp-temp and heap-cell
+ # scratch instead. LD_R1_R1_0 is alloc's `r1 = *heap_end`.
+ ('LD','r0','r3',0), ('ST','r2','r3',0),
+ ('ST','r1','r3',8), ('LD','r1','r3',8),
+ ('ST','r2','r3',16), ('LD','r2','r3',16),
+ ('LD','r1','r1',0),
]:
suf = f'NEG{-off}' if off < 0 else f'{off}'
R.append(Mem(name=f'{op}_{rt.upper()}_{rn.upper()}_{suf}',
op=op, rT=rt, rN=rn, off=off))
- # BLT r0, r2 ; BNE r1, r2 ; BNE r0, r2
- R.append(CondB(name='BLT_R0_R2_R7', op='BLT', rA='r0', rB='r2'))
- R.append(CondB(name='BNE_R1_R2_R7', op='BNE', rA='r1', rB='r2'))
- R.append(CondB(name='BNE_R0_R2_R7', op='BNE', rA='r0', rB='r2'))
+ # ADD_R2_R0_R1: alloc's `new_next = old + size`. The other
+ # tranche-1 ADDs don't happen to have (d=r2, a=r0, b=r1).
+ R.append(RRR(name='ADD_R2_R0_R1', op='ADD', rD='r2', rA='r0', rB='r1'))
+
+ # Branches used by lisp step 2 / alloc OOM.
+ R.append(CondB(name='BLT_R0_R2', op='BLT', rA='r0', rB='r2'))
+ R.append(CondB(name='BLT_R1_R2', op='BLT', rA='r1', rB='r2'))
+ R.append(CondB(name='BNE_R1_R2', op='BNE', rA='r1', rB='r2'))
+ R.append(CondB(name='BNE_R0_R2', op='BNE', rA='r0', rB='r2'))
return R
diff --git a/p1_riscv64.M1 b/p1_riscv64.M1
@@ -10,13 +10,14 @@ DEFINE P1_LI_R0 170500000365C5006F008000
DEFINE P1_LI_R1 9705000083E5C5006F008000
DEFINE P1_LI_R2 170600000366C6006F008000
DEFINE P1_LI_R3 9706000083E6C6006F008000
-DEFINE P1_LI_R4 170700000367C7006F008000
-DEFINE P1_LI_R5 9707000083E7C7006F008000
+DEFINE P1_LI_R4 170A0000036ACA006F008000
+DEFINE P1_LI_R5 970A000083EACA006F008000
DEFINE P1_LI_R6 9704000083E4C4006F008000
DEFINE P1_LI_R7 170900000369C9006F008000
+DEFINE P1_LI_BR 170F0000036FCF006F008000
## ---- SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches
-DEFINE P1_SYSCALL 9308050093890500130A0600938A0600130B0700938B07001385090093050A0013860A0093060B0013870B0093870400730000009385090013060A0093860A0013070B0093870B00
+DEFINE P1_SYSCALL 9308050093890500130B0600938B06001385090093050B0013860B0093060A0013870A0093870400730000009385090013060B0093860B00
DEFINE P1_SYSOPEN 1305C0F99308800373000000
## ---- Linux syscall numbers (per-arch table). LE-32 immediate operands for LI.
@@ -27,31 +28,31 @@ DEFINE SYS_CLOSE 39000000
## ---- Reg-reg-reg arithmetic (tranche 1) --------------------------
DEFINE P1_ADD_R1_R1_R2 B385C500
-DEFINE P1_ADD_R1_R1_R4 B385E500
+DEFINE P1_ADD_R1_R1_R4 B3854501
DEFINE P1_ADD_R2_R2_R6 33069600
DEFINE P1_ADD_R2_R3_R1 3386B600
DEFINE P1_SUB_R1_R1_R2 B385C540
DEFINE P1_SUB_R2_R2_R6 33069640
-DEFINE P1_AND_R1_R1_R5 B3F5F500
+DEFINE P1_AND_R1_R1_R5 B3F55501
DEFINE P1_OR_R1_R1_R2 B3E5C500
DEFINE P1_XOR_R1_R1_R2 B3C5C500
DEFINE P1_MUL_R1_R1_R2 B385C502
DEFINE P1_DIV_R1_R1_R2 B3C5C502
-DEFINE P1_REM_R1_R1_R5 B3E5F502
+DEFINE P1_REM_R1_R1_R5 B3E55503
DEFINE P1_SHL_R1_R1_R2 B395C500
DEFINE P1_SHR_R1_R1_R2 B3D5C500
-DEFINE P1_SAR_R4_R4_R2 3357C740
+DEFINE P1_SAR_R4_R4_R2 335ACA40
## ---- Immediate arithmetic (tranche 2) ----------------------------
DEFINE P1_ADDI_R1_R1_3 93853500
DEFINE P1_ADDI_R1_R1_1 93851500
DEFINE P1_ADDI_R1_R1_NEG3 9385D5FF
-DEFINE P1_ADDI_R4_R4_NEG1 1307F7FF
+DEFINE P1_ADDI_R4_R4_NEG1 130AFAFF
DEFINE P1_ADDI_R1_R1_NEG2 9385E5FF
DEFINE P1_ADDI_R0_R0_1 13051500
DEFINE P1_SHLI_R1_R1_1 93951500
DEFINE P1_SHRI_R1_R1_1 93D51500
-DEFINE P1_SARI_R4_R4_1 13571740
+DEFINE P1_SARI_R4_R4_1 135A1A40
DEFINE P1_ANDI_R1_R1_6 93F56500
DEFINE P1_ANDI_R1_R1_7 93F57500
DEFINE P1_ORI_R1_R1_1 93E51500
@@ -59,38 +60,38 @@ DEFINE P1_ORI_R0_R0_2 13652500
DEFINE P1_ORI_R0_R0_7 13657500
## ---- LA + memory ops (tranche 3) ---------------------------------
-DEFINE P1_LA_R4 170700000367C7006F008000
-DEFINE P1_ST_R1_R4_0 2330B700
-DEFINE P1_LD_R1_R4_0 83350700
-DEFINE P1_ST_R1_R4_8 2334B700
-DEFINE P1_LD_R1_R4_8 83358700
-DEFINE P1_SB_R1_R4_16 2308B700
-DEFINE P1_LB_R1_R4_16 83450701
-DEFINE P1_ST_R1_R4_NEG8 233CB7FE
-DEFINE P1_LD_R1_R4_NEG8 833587FF
+DEFINE P1_LA_R4 170A0000036ACA006F008000
+DEFINE P1_ST_R1_R4_0 2330BA00
+DEFINE P1_LD_R1_R4_0 83350A00
+DEFINE P1_ST_R1_R4_8 2334BA00
+DEFINE P1_LD_R1_R4_8 83358A00
+DEFINE P1_SB_R1_R4_16 2308BA00
+DEFINE P1_LB_R1_R4_16 83450A01
+DEFINE P1_ST_R1_R4_NEG8 233CBAFE
+DEFINE P1_LD_R1_R4_NEG8 83358AFF
-## ---- Branches (tranche 4, r7-indirect) ---------------------------
-DEFINE P1_B 67000900
-DEFINE P1_BEQ_R2_R3_R7 6314D60067000900
-DEFINE P1_BNE_R2_R3_R7 6304D60067000900
-DEFINE P1_BLT_R2_R3_R7 6354D60067000900
-DEFINE P1_BLT_R4_R2_R7 6354C70067000900
+## ---- Branches (tranche 4, LI_BR-indirect) ------------------------
+DEFINE P1_B 67000F00
+DEFINE P1_BEQ_R2_R3 6314D60067000F00
+DEFINE P1_BNE_R2_R3 6304D60067000F00
+DEFINE P1_BLT_R2_R3 6354D60067000F00
+DEFINE P1_BLT_R4_R2 6354CA0067000F00
## ---- Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL
DEFINE P1_PROLOGUE 130101FF23301100
DEFINE P1_EPILOGUE 8330010013010101
DEFINE P1_RET 67800000
-DEFINE P1_CALL E7000900
-DEFINE P1_TAIL 833001001301010167000900
+DEFINE P1_CALL E7000F00
+DEFINE P1_TAIL 833001001301010167000F00
DEFINE P1_PROLOGUE_N2 130101FE23301100
DEFINE P1_EPILOGUE_N2 8330010013010102
-DEFINE P1_TAIL_N2 833001001301010267000900
+DEFINE P1_TAIL_N2 833001001301010267000F00
DEFINE P1_PROLOGUE_N3 130101FE23301100
DEFINE P1_EPILOGUE_N3 8330010013010102
-DEFINE P1_TAIL_N3 833001001301010267000900
+DEFINE P1_TAIL_N3 833001001301010267000F00
DEFINE P1_PROLOGUE_N4 130101FD23301100
DEFINE P1_EPILOGUE_N4 8330010013010103
-DEFINE P1_TAIL_N4 833001001301010367000900
+DEFINE P1_TAIL_N4 833001001301010367000F00
## ---- Seed-Lisp step 1 extensions (tranche 6) ---------------------
DEFINE P1_MOV_R1_R6 93850400
@@ -102,28 +103,38 @@ DEFINE P1_MOV_R7_R2 13090600
DEFINE P1_MOV_R2_R6 13860400
DEFINE P1_MOV_R3_R7 93060900
DEFINE P1_MOV_R2_R7 13060900
-DEFINE P1_MOV_R4_R7 13070900
+DEFINE P1_MOV_R4_R7 130A0900
DEFINE P1_MOV_R2_SP 13060100
-DEFINE P1_MOV_R4_SP 13070100
+DEFINE P1_MOV_R3_SP 93060100
+DEFINE P1_MOV_R4_SP 130A0100
DEFINE P1_MOV_R6_SP 93040100
DEFINE P1_MOV_R2_R0 13060500
DEFINE P1_LD_R0_R6_0 03B50400
DEFINE P1_LD_R1_R6_16 83B50401
-DEFINE P1_LD_R3_R4_0 83360700
-DEFINE P1_LD_R0_R5_0 03B50700
-DEFINE P1_LB_R1_R4_0 83450700
-DEFINE P1_ST_R2_R4_0 2330C700
-DEFINE P1_ST_R0_R4_8 2334A700
-DEFINE P1_LD_R0_R4_8 03358700
+DEFINE P1_LD_R3_R4_0 83360A00
+DEFINE P1_LD_R0_R5_0 03B50A00
+DEFINE P1_LB_R1_R4_0 83450A00
+DEFINE P1_ST_R2_R4_0 2330CA00
+DEFINE P1_ST_R0_R4_8 2334AA00
+DEFINE P1_LD_R0_R4_8 03358A00
DEFINE P1_LB_R1_R0_0 83450500
DEFINE P1_LD_R0_R1_0 03B50500
DEFINE P1_LD_R0_R1_8 03B58500
DEFINE P1_ST_R1_R0_0 2330B500
-DEFINE P1_LD_R2_R4_0 03360700
+DEFINE P1_LD_R2_R4_0 03360A00
DEFINE P1_ST_R2_R0_8 2334C500
-DEFINE P1_LD_R0_R4_0 03350700
-DEFINE P1_ST_R2_R4_16 2338C700
-DEFINE P1_LD_R2_R4_16 03360701
-DEFINE P1_BLT_R0_R2_R7 6354C50067000900
-DEFINE P1_BNE_R1_R2_R7 6384C50067000900
-DEFINE P1_BNE_R0_R2_R7 6304C50067000900
+DEFINE P1_LD_R0_R4_0 03350A00
+DEFINE P1_ST_R2_R4_16 2338CA00
+DEFINE P1_LD_R2_R4_16 03360A01
+DEFINE P1_LD_R0_R3_0 03B50600
+DEFINE P1_ST_R2_R3_0 23B0C600
+DEFINE P1_ST_R1_R3_8 23B4B600
+DEFINE P1_LD_R1_R3_8 83B58600
+DEFINE P1_ST_R2_R3_16 23B8C600
+DEFINE P1_LD_R2_R3_16 03B60601
+DEFINE P1_LD_R1_R1_0 83B50500
+DEFINE P1_ADD_R2_R0_R1 3306B500
+DEFINE P1_BLT_R0_R2 6354C50067000F00
+DEFINE P1_BLT_R1_R2 63D4C50067000F00
+DEFINE P1_BNE_R1_R2 6384C50067000F00
+DEFINE P1_BNE_R0_R2 6304C50067000F00