boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs

commit 4f257aaba799025e6f84cccd9311991e67c65681
parent 78312f7592353f51c3b5957eca18ea098cbd4a66
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon, 20 Apr 2026 22:20:13 -0700

P1: extract branch-scratch; 4:4 caller/callee split; stack-walking GC roots

Branch target lives in a hidden native scratch reg (aarch64 x17, amd64
r11, riscv64 t5) instead of a P1 GPR. Source code uses LI_BR in place
of LI_R7, and branch/call ops drop the _R7 suffix — the scratch is no
longer a named P1 register, so it can't be accidentally referenced from
program code.

The freed slot realigns the P1 register file as 4:4 caller/callee-saved:
r0-r3 caller-saved (native argregs on each arch), r4-r7 callee-saved.
r4/r5 remap to native callee-saved regs (aarch64 x26/x27, amd64 r13/r14,
riscv64 s4/s5) so SYSCALL no longer has to preserve them across the
kernel trap. Function bodies get four walkable cross-call roots without
touching the frame.

amd64 PROLOGUE/EPILOGUE now uses rcx (not r11) as the retaddr-carry
scratch. TAIL = EPILOGUE + jmp r11, so the old sequence popped its own
branch target a microsecond before jumping through it; the bug surfaced
as demo exiting 13 on amd64 while aarch64/riscv64 passed. rcx was
already a non-P1 scratch used by shifts and DIV/REM, so no new
reservations.

lisp.M1 cons() switches its sp-temp from r4 to r3 (caller-saved, no
preservation burden) and alloc() now uses only r0-r3, shrinking its
callee-save pressure to zero.

LISP.md §Settled decisions #3 rewrites the GC root model from "32 fixed
BSS root_slots + live_mask, stack not scanned" to "per-frame root slots
walked on the P1 stack." This was the discipline N-slot PROLOGUE was
built for in 78312f7 — the doc just hadn't caught up. global_env and
symbol_table remain named BSS roots (long-lived singletons with no
stack frame to hang off of).

P1.md and comments across lisp.M1/demo.M1/p1_gen.py updated to match.
P1_TODO.md removed — all five issues are landed.

Diffstat:
MLISP.md | 42++++++++++++++++++++++++------------------
MMakefile | 3+--
MP1.md | 156+++++++++++++++++++++++++++++++++++++++++++------------------------------------
DP1_TODO.md | 251-------------------------------------------------------------------------------
Mdemo.M1 | 56++++++++++++++++++++++++++++----------------------------
Mlisp.M1 | 73+++++++++++++++++++++++++++++++++++++++----------------------------------
Mp1_aarch64.M1 | 99++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
Mp1_amd64.M1 | 115+++++++++++++++++++++++++++++++++++++++++++------------------------------------
Mp1_gen.py | 198++++++++++++++++++++++++++++++++++++++++++-------------------------------------
Mp1_riscv64.M1 | 99++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
10 files changed, 456 insertions(+), 636 deletions(-)

diff --git a/LISP.md b/LISP.md @@ -31,9 +31,11 @@ Load-bearing; the rest of the document assumes them. §Value representation. 2. **Mark-sweep GC over a fixed BSS arena.** 20 MB, linked in at build time, no `brk`/`mmap`. -3. **Fixed root-spill slots.** 32 named cells in BSS; Scheme-facing - functions spill live values there before any allocating call. The - P1 stack is not scanned. +3. **Per-frame root slots on the P1 stack.** Scheme-facing functions + declare `PROLOGUE_Nk` frames with one slot per live tagged value + they need across an allocating call; the GC mark phase walks those + frames. `global_env` and `symbol_table` remain named BSS roots + (long-lived singletons with no stack frame to hang off of). 4. **Source file via `argv[1]`.** The interpreter opens the path given on the command line, reads it whole into a string, parses it, evaluates top-to-bottom. No linked-in blob. @@ -144,14 +146,15 @@ Three sources: transitively. 2. **Symbol table.** Intern table's backing array plus all interned symbols and their name strings. Fixed pointer (`symbol_table`). -3. **Root-spill slots.** 32 tagged words at `root_slots[0..32]` in BSS. - A 32-bit `live_mask` word marks which slots currently hold a live - value. Scheme-facing functions set and clear bits in the mask - across their dynamic extent. +3. **Per-frame stack slots.** The mark phase walks the P1 stack from + the current `sp` up to the saved frame-chain bottom. Each frame + declares its slot count via `PROLOGUE_Nk` (see P1.md §"Semantics"), + and the GC treats every slot word as a tagged-value root — + non-pointer slots self-filter because their tags don't match any + heap object. -The P1 stack is **not scanned.** Stack frames hold P1 integers, return -addresses, and scratch — not Scheme-value-shaped words unless those -words are also rooted in a spill slot. +Return addresses and raw P1 integers on the stack sit outside the +PROLOGUE_Nk slot range, so the walker doesn't see them. ### Algorithm @@ -196,11 +199,13 @@ uncommon in compiler workloads. ### Invariants - Between any two GC-observable points, every live Scheme value must be - reachable from `global_env`, `symbol_table`, or a live root slot. - "GC-observable" = any call to `cons`, `make-vector`, `make-string`, - `make-closure`, `intern`, or another Scheme function. -- `live_mask` bits are owned by the function that set them; callers - don't set a callee's bits. + reachable from `global_env`, `symbol_table`, or a PROLOGUE_Nk slot + in an active frame. "GC-observable" = any call to `cons`, + `make-vector`, `make-string`, `make-closure`, `intern`, or another + Scheme function. +- Scheme-facing functions spill each live tagged value into a frame + slot before any allocating call. A value kept only in a P1 register + will be lost if GC fires mid-call. Tradeoffs taken: @@ -348,8 +353,8 @@ startup the interpreter populates `global_env` with one binding per primitive, each wrapped as a primitive-type heap object. Primitives allocate through the same `alloc`/GC path as user code and -must honor the root-spill discipline when they make more than one -allocation. +must honor the per-frame-slot root discipline when they make more than +one allocation. Target set: ~40 primitives per PLAN.md. Roughly: @@ -464,7 +469,8 @@ coverage before the next begins. via syscalls, `error` path, BSS layout, bump allocator. Exits with the file's first byte as proof-of-life. ~200 P1 LOC. 2. **Tagged values.** Fixnum/pair/singleton encoding; `cons`/`car`/`cdr`/ - tag predicates. Hand-build a pair; print its car. ~300 LOC. + tag predicates. Hand-build `cons(42, nil)` and exit with the decoded + car as status. ~300 LOC. 3. **Strings + symbol interning.** Heap strings, 4096-slot intern table, symbol allocator. Two `(intern "foo")` calls return `eq?` pointers. ~400 LOC. diff --git a/Makefile b/Makefile @@ -84,8 +84,7 @@ $(TOOLS_DIR)/M0 $(TOOLS_DIR)/hex2-0 $(TOOLS_DIR)/catm $(TOOLS_DIR)/hex0 $(TOOLS_ # # Lint catches P1_*/SYS_* tokens with no matching DEFINE — M1 otherwise # silently emits the literal token text and produces a SIGILL-on-run -# binary (see P1_TODO.md issue 1). Runs on the host (plain POSIX sh); -# no podman dependency. +# binary. Runs on the host (plain POSIX sh); no podman dependency. # # M0 takes a single positional input (no -f flag), so we catm the two # sources together first. The intermediate .combined.M1 is kept in OUT_DIR diff --git a/P1.md b/P1.md @@ -44,9 +44,10 @@ All runs on stock stage0 `M0` + `hex2-0`, bootstrapped per-arch from The DEFINE table is generator-driven (`p1_gen.py`); tranches 1–8 are enumerated there, plus the full PROLOGUE_Nk family (k=1..4). Branch -offsets are realized by the r7-indirect pattern -(`LI_R7 &target ; BXX_rA_rB_R7`), sidestepping the missing -branch-offset support in hex2. +offsets are realized by the LI_BR-indirect pattern +(`LI_BR &target ; BXX_rA_rB`), sidestepping the missing +branch-offset support in hex2. The branch-target scratch is a +reserved native reg (`x17`/`r11`/`t5`), not a P1 GPR. ### Spike deviations from the design @@ -72,15 +73,16 @@ branch-offset support in hex2. | Registers | 8 GPRs (`r0`–`r7`) + `sp`, `lr`-on-stack | Fits x86-64's usable register budget | | Narrow imm | Signed 12-bit | riscv I-type width; aarch64 ≤12 also OK | | Wide imm | Pool-loaded via PC-relative `LI` | Avoids arch-specific immediate synthesis | -| Calling conv | r0 = return, r1–r6 = args, r6 callee-saved, r7 per-op scratch | P1-defined; not platform ABI | +| Calling conv | r0 = return, r1–r3 = args (caller-saved), r4–r7 callee-saved | P1-defined; not platform ABI | | Return address | Always spilled to stack on entry | Hides x86's missing `lr` uniformly | | Syscall | `SYSCALL` with num in r0, args r1–r6; clobbers r0 only | Per-arch wrapper emits native sequence | | Spill slot | `[sp + 8]` is callee-private scratch after `PROLOGUE` | Frame already 16 B for alignment; second cell was otherwise unused | ## Register mapping -`r0`–`r5` are caller-saved. `r6` is callee-saved. `r7` is per-op scratch -(see below). `sp` is special-purpose — see `PROLOGUE` semantics. +`r0`–`r3` are caller-saved. `r4`–`r7` are callee-saved, general-purpose, +and preserved across `CALL`/`SYSCALL`. `sp` is special-purpose — see +`PROLOGUE` semantics. | P1 | amd64 | aarch64 | riscv64 | |------|-------|---------|---------| @@ -88,43 +90,51 @@ branch-offset support in hex2. | `r1` | `rdi` | `x1` | `a1` | | `r2` | `rsi` | `x2` | `a2` | | `r3` | `rdx` | `x3` | `a3` | -| `r4` | `r10` | `x4` | `a4` | -| `r5` | `r8` | `x5` | `a5` | +| `r4` | `r13` | `x26` | `s4` | +| `r5` | `r14` | `x27` | `s5` | | `r6` | `rbx` | `x19` | `s1` | | `r7` | `r12` | `x20` | `s2` | | `sp` | `rsp` | `sp` | `sp` | | `lr` | (mem) | `x30` | `ra` | +`r4`–`r7` all map to native callee-saved regs on each arch, so the SysV +kernel+libc "callee preserves these" rule does the work for us across +syscalls without explicit save/restore in the `SYSCALL` expansion. + x86-64 has no link register; `CALL`/`RET` macros push/pop the return address on the stack. On aarch64/riscv64, the prologue spills `lr` (`x30`/`ra`) to the stack too, so all three converge on "return address lives in `[sp + 0]` after prologue." This uniformity is worth the extra store on the register-rich arches. -**`r7` is not a general-purpose callee-saved register.** Every conditional -and unconditional branch in P1 compiles through the r7-indirect pattern -(`LI_R7 &target ; BLT/B`), which overwrites `r7` at every branch site. -Treat `r7` as a per-instruction scratch owned by the branch / call -machinery: its value is meaningful only between the `LI_R7` that loads a -target and the branch that consumes it. Never carry a live value through a -branch or call in `r7`. - **Reserved scratch registers (not available to P1):** certain native regs are used internally by op expansions and are never exposed as P1 registers. Every P1 op writes only what its name says it writes — reserved scratch is save/restored within the expansion so no hidden clobbers leak across op boundaries. -- **aarch64** — `x21`–`x25` hold `r1`–`r5` across the `SYSCALL` arg - shuffle. `x16` (ARM IP0) is scratch for `REM` (carries the `SDIV` - quotient into the following `MSUB`). `x8` holds the syscall number. -- **amd64** — `rcx` and `r11` are kernel-clobbered by the `syscall` - instruction itself; `PROLOGUE` additionally uses `r11` to temporarily - hold the return address across the `sub rsp, N`. `DIV`/`REM` use both - `rcx` (to save `rdx` = P1 `r3`) and `r11` (to save `rax` = P1 `r0`) - so that idiv's implicit writes to rax/rdx stay invisible. -- **riscv64** — `s3`–`s7` hold `r1`–`r5` across the `SYSCALL` arg - shuffle; `a7` holds the syscall number. +- **Branch-target scratch (all arches).** `B`/`BEQ`/`BNE`/`BLT`/`CALL`/ + `TAIL` jump through a dedicated native reg pre-loaded via `LI_BR`: + `x17` (ARM IP1) on aarch64, `r11` on amd64, `t5` on riscv64. The reg + is caller-saved natively and never carries a live P1 value past the + following branch. Treat it as existing only between the `LI_BR` that + loads a target and the branch that consumes it. +- **aarch64** — `x21`–`x23` hold `r1`–`r3` across the `SYSCALL` arg + shuffle (`r4`/`r5` live in callee-saved `x26`/`x27` so the kernel + preserves them for us). `x16` (ARM IP0) is scratch for `REM` + (carries the `SDIV` quotient into the following `MSUB`). `x8` holds + the syscall number. +- **amd64** — `rcx` and the branch-target `r11` are kernel-clobbered by + the `syscall` instruction itself. `PROLOGUE`/`EPILOGUE` use `rcx` to + carry the retaddr across the `sub rsp, N` (can't use `r11` here — it + is the branch-target reg, and `TAIL` = `EPILOGUE` + `jmp r11`). + `DIV`/`REM` use `rcx` (to save `rdx` = P1 `r3`) and `r11` (to save + `rax` = P1 `r0`) so that `idiv`'s implicit writes to rax/rdx stay + invisible; the `r11` save is fine because no branch op can interrupt + the DIV/REM expansion. +- **riscv64** — `s3`,`s6`,`s7` hold `r1`–`r3` across the `SYSCALL` arg + shuffle (`r4`/`r5` live in callee-saved `s4`/`s5`, same trick as + aarch64). `a7` holds the syscall number. All of these are off-limits to hand-written P1 programs and are never mentioned in P1 source. If you see a register name not in the r0–r7 / @@ -134,28 +144,26 @@ sp / lr set, it belongs to an op's internal expansion. P1 has no PC-relative branch immediates (hex2 offers no label-arithmetic sigil — branch ranges can't be expressed in hex2 source). Every branch, -conditional or not, compiles through the **r7-indirect** pattern: the -caller loads the target into `r7` with a wide `LI`, then the branch op -jumps through `r7`. A conditional like "jump to `fail` if `r1 != r2`" is -three source lines: +conditional or not, compiles through the **LI_BR-indirect** pattern: the +caller loads the target into the dedicated branch-target scratch reg +with `LI_BR`, then the branch op jumps through it. A conditional like +"jump to `fail` if `r1 != r2`" is three source lines: ``` -P1_LI_R7 +P1_LI_BR &fail -P1_BNE_R1_R2_R7 +P1_BNE_R1_R2 ``` -The `_R7` suffix on every branch op is a reminder that `r7` is the -branch-target register: the value loaded by the preceding `LI_R7` is -consumed by the branch, and both lines always appear together. `CALL` -follows the same shape (`LI_R7 &callee ; P1_CALL`). +`LI_BR` writes a reserved native reg (`x17`/`r11`/`t5` — see Register +mapping), not a P1 GPR. The branch op that follows consumes it and +jumps. `CALL` and `TAIL` follow the same shape +(`LI_BR &callee ; P1_CALL`). -Because every branch site overwrites `r7`, **`r7` is branch-scratch, not -a general-purpose callee-saved register** — see the "`r7` is not a -general-purpose callee-saved register" paragraph above. If you need to -keep a value live across a branch or call, use `r0`–`r5` (caller-saved, -lives across un-branched straight-line code) or `r6` (callee-saved, -lives across calls). +The branch-target reg is owned by the branch machinery: never carry a +live value across a branch in it. Since it isn't a P1 reg, this is +automatic — there's no P1-level way to read or write it outside +`LI_BR`. ## Instruction set (~30 ops) @@ -204,13 +212,14 @@ SYSCALL # num in r0, args r1-r6, ret in r0 with tagged-cell pointers only need signed comparisons. Synthesize from `BLT` via operand-bias if unsigned compare is ever required. - `BGE rA, rB, %L` is not in the ISA: synthesize as - `BLT rA, rB, %skip; B %L; :skip` (the r7-indirect branch pattern makes - the skip cheap). `BLT rB, rA, %L` handles the strict-greater case. + `BLT rA, rB, %skip; B %L; :skip` (the LI_BR-indirect branch pattern + makes the skip cheap). `BLT rB, rA, %L` handles the strict-greater + case. - Branch offsets are PC-relative. In the v0.1 spike they are realized by - loading the target address via `LI` into `r7` and jumping through the - register; range is therefore unbounded within the 4 GiB address space. - Native-encoded branches (with tighter range limits) are an optional - future optimization. + loading the target address via `LI_BR` into the reserved branch-target + reg and jumping through it; range is therefore unbounded within the + 4 GiB address space. Native-encoded branches (with tighter range + limits) are an optional future optimization. - `MOV rD, rA` copies `rA` into `rD`. The source may be `sp` (read the current stack pointer into a GPR — used e.g. for stack-balance assertions around a call tree). The reverse (`MOV sp, rA`) is not provided; `sp` @@ -246,8 +255,10 @@ SYSCALL # num in r0, args r1-r6, ret in r0 Per-arch mechanics differ — aarch64/riscv64 `PROLOGUE` subtracts the frame size from `sp` and stores `lr`/`ra` at `[sp + 0]`; amd64 pops the retaddr native `call` already pushed into a non-P1 scratch - (`r11`), subtracts the frame size, then re-pushes it so the final - layout matches. Access slots via `MOV rX, sp` followed by + (`rcx`), subtracts the frame size, then re-pushes it so the final + layout matches. (`rcx` rather than `r11`, because `r11` is the + branch-target reg and `TAIL` would otherwise clobber its own + destination mid-epilogue.) Access slots via `MOV rX, sp` followed by `LD rY, rX, <off>` / `ST rY, rX, <off>`; `sp` itself isn't a valid base for `LD`/`ST`. - `TAIL %label` is a tail call — it performs the current function's @@ -312,19 +323,21 @@ values through syscalls without per-arch save/restore dances. The per-arch expansions: -- **amd64** — P1 args already occupy the native arg regs except for arg6 - (`r9` vs. P1 `r6`/`rbx`). Expansion is `mov r9, rbx ; syscall`. The - kernel preserves everything except `rax`, `rcx`, `r11`; none of `rcx` - or `r11` are P1 registers, so the only visible clobber is P1 `r0`. +- **amd64** — P1 args already occupy the native arg regs except for args + 4/5/6. Three shuffle moves cover those: `mov r10, r13` (arg4 = P1 `r4`), + `mov r8, r14` (arg5 = P1 `r5`), `mov r9, rbx` (arg6 = P1 `r6`); then + `syscall`. The kernel preserves everything except `rax`, `rcx`, `r11`, + and `rax` = P1 `r0` is the only visible clobber. - **aarch64** — native arg regs are `x0`–`x5` but P1 puts args in - `x1`–`x5`,`x19` (one register higher). The expansion saves P1 `r1`– - `r5` into `x21`–`x25` (reserved for SYSCALL, see Register mapping), - shuffles the saved values into `x0`–`x4` plus `x19` into `x5`, moves - the number into `x8`, `svc #0`s, then restores `r1`–`r5` from the - `x21`–`x25` saves. Net cost: 12 additional moves vs. a bare - single-instruction syscall. -- **riscv64** — same shape as aarch64, with `s3`–`s7` as the save slots - and `a7` as the number register. + `x1`–`x3`,`x26`,`x27`,`x19` (the three caller-saved arg regs one slot + higher, plus three callee-saved for `r4`–`r6`). The expansion saves + P1 `r1`–`r3` into `x21`–`x23`, shuffles them and `r4`/`r5`/`r6` down + into `x0`–`x5`, moves the number into `x8`, `svc #0`s, then restores + `r1`–`r3` from `x21`–`x23`. No save/restore of `r4`/`r5` is needed + because they live in callee-saved natives that the kernel preserves. +- **riscv64** — same shape as aarch64, with `s3`/`s6`/`s7` as the `r1`– + `r3` save slots, `s4`/`s5` already holding `r4`/`r5`, and `a7` as the + number register. The extra moves on aarch64/riscv64 are a few nanoseconds per syscall. Trading them for uniform "clobbers `r0` only" semantics is worth it: @@ -429,12 +442,11 @@ caller. the inline-data `LI` trick sidesteps them. Order was reversed from the original plan: aarch64 first (where the trick was designed), then amd64 and riscv64. -2. **Broaden the demonstrated op set.** Next. Control flow (`B`, `BEQ`, - `CALL`, `RET`, `TAIL`), loads/stores (`LD`/`ST`/`LW`/`LB`), and the - rest of arithmetic/logical/shift/mul-div. All reachable with stock - hex2; no extensions required. A loop-and-branch demo (e.g. print - digits 0–9, or sum 1..N) is the natural next program — it forces - conditional branching through the indirect-via-r7 pattern. +2. **Broaden the demonstrated op set.** *Done.* `demo.M1` exercises + control flow (`B`, `BEQ`, `BNE`, `BLT`, `CALL`, `RET`, `TAIL`), + loads/stores (`LD`/`ST`/`LB`/`SB`), and the full + arithmetic/logical/shift/mul-div set across tranches 1–5. All + reachable with stock hex2; no extensions required. 3. **Generator for the ~30-op × register matrix.** *Done.* `p1_gen.py` is the single source of truth for all three `p1_<arch>.M1` defs files. Each row is an `(op, reg-tuple, imm)` @@ -444,9 +456,11 @@ caller. 4. **Cross-arch differential harness.** Assemble each P1 source three ways and diff runtime behavior. Currently eyeballed via `make run-all`. -5. **Write something real.** Seed Lisp interpreter (cons, car, cdr, eq, - atom, cond, lambda, quote) in ~500 lines of P1, running identically - on all three targets. +5. **Write something real.** *In progress.* `lisp.M1` is the seed Lisp + interpreter target (cons, car, cdr, eq, atom, cond, lambda, quote) + running identically on all three arches. Step 2 (cons/car/cdr + + tagged values) landed; the remaining staged steps live in + `LISP.md`. ## Open questions diff --git a/P1_TODO.md b/P1_TODO.md @@ -1,251 +0,0 @@ -# P1 — follow-up work from `lisp.M1` step 2 - -Issues surfaced while bringing up step 2 (tagged values: cons/car/cdr), -with the decision recorded for each. This is a work list, not a design -doc; `P1.md` is the source of truth for the ISA itself. - -Two of these (issue 1, issue 2) were caught the hard way during step 2: -the first build passed on aarch64 and failed on amd64/riscv64 because -`heap_start` happened to land at different offsets from the ELF base on -each arch, and the second build SIGILL'd because a `P1_LD_R0_R4_0` macro -was missing from the defs files but M1 silently passed the undefined -token through as literal text. Both are preventable upstream. - -## 1. Undefined `P1_*` tokens pass silently through M1 - -**Symptom.** In step 2, `lisp.M1` used `P1_LD_R0_R4_0` (offset 0) but the -defs files only had `P1_LD_R0_R4_8` (offset 8). M1 emitted the literal -string `P1_LD_R0_R4_0` into the output stream rather than expanding it. -The resulting binary SIGILL'd on aarch64 and segfaulted on amd64. No -diagnostic from M1, hex2, or the loader — it took `xxd` + grep to find. - -This will keep biting us as we add more `(op, register-tuple)` entries -to the defs: the combinatorial surface (~1200 DEFINEs per arch) is -exactly where a missing tuple is easy to miss and hard to notice. - -**Options considered.** - -1. **Makefile lint.** Before invoking M1, grep the combined source for - `P1_[A-Z0-9_]*` tokens and assert every one has a matching `DEFINE` - in the per-arch defs file. Fails the build with the offending token - name. ~10 lines of shell, runs in milliseconds. -2. **Patch M1** to error on undefined tokens. Correct fix, but M1 is - upstream stage0 — we don't want to fork it. -3. **Exhaustive DEFINE table.** Generate every `(op, rD, rA, rB)` tuple - so no lookup can miss. Waits on issue 3 (generator). - -**Decision — option 1: Makefile lint.** Add a `lint` target to -`lispcc/Makefile` that runs before each arch's assemble step. Keeps the -toolchain unchanged; catches the bug at the exact layer where it's -introduced (M1 input); works today with no generator. Option 3 also -helps but doesn't cover typos (`P1_LD_R0_R4_0` where the author meant -`P1_LD_R0_R4_8` would still silently miss if both existed and the wrong -one was spelled). - -## 2. Data alignment is implicit and arch-dependent - -**Symptom.** `heap_start` is a label into a zero-padded byte blob. -Nothing in hex2 or M1 forces the label's file offset (and therefore its -runtime address under the fixed ELF base) to be 8-byte aligned. Because -the code size between the ELF header and `heap_start` differs per arch, -the three targets landed on different alignments: - -- aarch64: `0x00600560` → low 3 bits = `000` (aligned) -- amd64: `0x006002CB` → low 3 bits = `011` (misaligned) -- riscv64: `0x00600604` → low 3 bits = `100` (misaligned) - -Pair pointers carry a 3-bit tag in the low bits, so the allocator MUST -return pointers with `ptr & 7 == 0`. If `heap_start` itself is -misaligned, every tagged pair is corrupt. - -**Options considered.** - -1. **Application-level: align at boot + make alignment an allocator - contract.** `_start` rounds `heap_next` up with `(x | 7) + 1` before - any allocation. `alloc()` is documented to require 8-aligned sizes - and to return 8-aligned pointers. Document both in `P1.md`. This is - what step 2 already does. -2. **hex2-level alignment directive.** Add something like `.align 8` to - hex2 so labels can be padded at assembly time. - -Investigated hex2 capabilities (both the High Level Prototype and the -actual bootstrap AMD64 `hex2_AMD64.M1` in stage0-posix). hex2 supports -only the pointer sigils `: ! @ $ ~ % &` and the `>` base override — no -alignment directive, no label arithmetic. Option 2 would require forking -hex2, which is off the table for the same reason we don't fork M1. - -**Decision — option 1: doc + allocator contract.** Add a §"Data -alignment" subsection to `P1.md`: - -- Labels have no inherent alignment. The offset of any label from the - ELF base is the cumulative code/data size before it, which varies per - arch. -- Programs that care about alignment (tagged pointers, cache lines, - atomic fields — the first being immediate for us) must align at boot - with `(x | mask) + 1`. -- Any allocator that returns tagged-pointer-eligible cells MUST return - pointers aligned to at least the tag width (8 bytes, low-3-bit tag). -- Caller must pass tag-width-aligned sizes to bump allocators for now. - When the real allocator lands in step 9, it rounds internally. - -## 3. The ~1200-entry DEFINE generator is now a blocker - -**Symptom.** Step 2 added ~13 new `(op, register-tuple)` entries per -arch by hand. Step 3 (strings + symbol interning) and step 4 (reader) -will need many more — load/store variants across r0–r7 × base-register -× common offsets, more arithmetic tuples, more branches. Hand-writing -each tuple × 3 arches is already where most of step 2's wall-clock -went. - -**Options considered.** - -1. **Build generator before step 3.** Design a shared op table - (`op, rD, rA, rB` plus a template for each arch's native encoding), - emit `p1_<arch>.M1` from it. Pause Lisp work until done. -2. **Defer until step 4 or 5.** Hand-add entries as each step needs - them; accept the per-step friction. -3. **Lazy defs.** Only enumerate tuples actually referenced by the - program, via a source-scan + per-arch hand-encoder. More complex - than full generation. - -**Decision — option 1: build before step 3.** Deferring pushes the -problem into a worse spot (each new step adds friction, and the defs -files diverge across arches in hard-to-review ways). Design the op -table to be appendable — tranches 1–8 already map cleanly onto rows of -`(op, rD, rA, rB)` plus per-arch encoding templates. The generator -unblocks step 3 and is a prerequisite for the ~1200-entry target in -`P1.md`. - -This adds one item to the staged plan between steps 2 and 3. - -## 4. Root spill across `CALL` needs more than one scratch slot - -**Symptom.** `cons(car, cdr)` must keep both args live across the -`alloc(16)` call. The P1 ABI gives each frame exactly one callee-private -scratch cell at `[sp+8]`, which fits one value. Step 2 worked around -this by parking `cdr` in a BSS cell (`:cons_save_cdr`), a single-threaded -hack that breaks under the step-9 GC (which needs every live tagged -value to be on the stack so the mark phase can find it). - -The same pattern will recur everywhere: anything multi-arg that calls -into the allocator (pair constructors, list builders, environment frame -builders, closure constructors) needs per-function root-spill slots -proportional to its live set, not a fixed one. - -**Options considered.** - -1. **Expand `PROLOGUE` frame + per-function spill-slot count.** Bump - standard frame from 16 B to N B, where each function declares how - many callee-private cells it needs at `[sp+8]`, `[sp+16]`, - `[sp+24]`, etc. `PROLOGUE`/`EPILOGUE` take the slot count as an - immediate. Update all three arch defs files. -2. **Keep one scratch slot, use BSS for overflow.** What step 2 does. - Breaks under GC, doesn't scale past single-threaded. -3. **Caller-managed save area.** Caller subtracts from `sp` itself, - spills, calls, restores. Works but breaks the "`sp` only mutated by - `PROLOGUE`/`EPILOGUE`" invariant from `P1.md`. - -**Decision — option 1: expand PROLOGUE frame + spill-slot count.** -Aligns with GC discipline (all roots on stack, walkable), keeps `sp` -mutation localized to PROLOGUE/EPILOGUE, and scales with live-set size. -Specifics to pin down when implementing: - -- Encoding of the slot count. Probably a suffix on the op name - (`PROLOGUE_N2`, `PROLOGUE_N3`, …) so each variant is a distinct - `DEFINE` and the generator can enumerate them the same way it - enumerates register tuples. Cap at some N (4? 8?) and grow if a - function ever needs more. -- Frame size must stay 16-aligned on aarch64 — round up. -- The existing single-slot `PROLOGUE`/`EPILOGUE` stay as aliases for - the N=1 case. - -Replace `cons_save_cdr` in `lisp.M1` with a second frame slot once the -N-slot prologue lands. - -## 5. r7-indirect branch pattern is verbose but fine - -**Symptom.** Every branch or call compiles through three source lines: - -``` -P1_LI_R7 -&target -P1_BNE_R1_R2_R7 -``` - -`lisp.M1` has 8 such sites and will grow roughly linearly with the -interpreter's control flow. Readers unfamiliar with P1 don't immediately -see this is "branch if r1 != r2 to `target`." - -A conditional in a high-level language: - -``` -if x == 0 goto fail; -``` - -lands in P1 as: - -``` -P1_LI_R7 -&fail -P1_BEQ_R1_R0_R7 ## (after loading 0 into r0) -``` - -**Options considered.** - -1. **Document the pattern only.** Add a §"Reading P1 source" subsection - to `P1.md` that explains: every branch is - `LI_R7 &target ; BXX_rA_rB_R7`; r7 is branch-scratch owned by this - pattern; target load + branch are always together. Once seen, it - reads fluently. -2. **Introduce `*_TO` pseudos.** E.g. `P1_BNE_R1_R2_TO &target` expands - to the 3-op sequence. Cosmetic — saves one source line per site — - but it ripples through the DEFINE table (each pseudo needs per-arch - encoding that emits a forward-jump-around-a-literal-pool or similar) - and hides the r7 clobber, which is load-bearing. -3. **Change the ISA to have a 32-bit branch-target field.** A real - native-encoded conditional branch would save the literal and the - `LI_R7`. Breaks hex2's "no label arithmetic" contract (branch - offsets are label_A - label_B), and hex2 is not forkable. - -**Decision — option 1: document the pattern only.** `*_TO` pseudos -would require the generator to emit paired (branch, literal-pool-entry) -per call site, and hiding the r7 clobber invites bugs when someone -assumes r7 survives a branch. The pattern is learnable; the workaround -for hex2's missing branch-offset support is exactly what `P1.md` calls -out ("unbounded range within 4 GiB at the cost of one r7 clobber per -branch"). Documenting is sufficient. - -Add §"Reading P1 source" under the existing §"Register mapping" in -`P1.md`: one paragraph on the pattern, one on the r7 clobber rule (some -of this already lives in the "r7 is not a general-purpose callee-saved -register" paragraph, which should be cross-referenced). - -## Ordering - -All five land before step 3. Reasons: - -- **Issue 4 blocks step 3 directly.** Strings and symbol interning are - allocator-calling constructors — same shape as `cons`, more live - values per frame. Deferring issue 4 means reproducing the - `cons_save_cdr` BSS hack across every step-3 constructor, then - ripping them all out for step 9 (GC). Cheaper to land N-slot - PROLOGUE once. -- **Issue 4 must land with issue 3, not after.** If the generator - emits the defs table first and N-slot PROLOGUE lands later, the - generator runs twice and the defs files churn. Do them together. -- **Issues 1, 2, 5 are cheap and compounding.** Makefile lint (1) - prevents the exact class of bug that ate an hour in step 2 — every - step 3 tuple we add is another chance to mistype. Alignment doc (2) - and r7-pattern doc (5) are one sitting each and make step 3 review - easier for anyone but the author. - -Suggested order within the pre-step-3 block: - -1. Issue 1 (Makefile lint) — ~10 lines of shell, unblocks confident - iteration on the rest. -2. Issue 5 (`P1.md` §"Reading P1 source") and issue 2 - (`P1.md` §"Data alignment") — doc-only, batch into one commit. -3. Issue 3 (generator) + issue 4 (N-slot PROLOGUE) — together. - Design the op table so PROLOGUE/EPILOGUE variants are rows like any - other `(op, reg-tuple)`. Rip out `cons_save_cdr` from `lisp.M1` - as the validation case. diff --git a/demo.M1 b/demo.M1 @@ -22,7 +22,7 @@ ## synthesize from LD + mask/shift if needed. Signed imm12 on ## LD/ST is exercised via a NEG8-offset round-trip through ## scratch_mid, mirroring the ADDI NEG1/NEG3 signed-imm test. -## Tranche 4: r7-indirect branches. B, BEQ, BNE, BLT each tested +## Tranche 4: LI_BR-indirect branches. B, BEQ, BNE, BLT each tested ## in BOTH taken and fall-through directions so an inverted ## condition encoding doesn't pass. BLT additionally tested with ## a negative operand (-1 < 0 signed) to prove signed semantics. @@ -31,10 +31,10 @@ ## stresses PROLOGUE lr-save; TAIL unwinds the frame. Stack ## balance is additionally verified by snapshotting sp via ## MOV_R6_SP before tranche 5 and comparing after — a TAIL that -## omits its epilogue leaks fn_parent_tail's 16-byte frame and -## the delta folds into the accumulator. On amd64 PROLOGUE/ -## EPILOGUE are NOPs so the delta is always 0; the check bites -## on aarch64/riscv64 where prologue does real sp work. +## omits its epilogue (or whose epilogue clobbers the branch +## target) leaks fn_parent_tail's 16-byte frame and the delta +## folds into the accumulator. All three arches now run real +## sp-moving PROLOGUE/EPILOGUE, so the check bites everywhere. ## ## Run-and-verify: ## make PROG=demo ARCH=<arch> run && echo "exit=$?" @@ -160,7 +160,7 @@ '01000000' # r3 = 1 ## B — unconditional. Correct: jump to b4_1_ok, skipping clobber. - P1_LI_R7 + P1_LI_BR &b4_1_ok P1_B P1_LI_R1 @@ -170,9 +170,9 @@ ## BEQ taken: r3=0 so r2==r3. P1_LI_R3 '00000000' - P1_LI_R7 + P1_LI_BR &b4_2_ok - P1_BEQ_R2_R3_R7 # 0 == 0, taken + P1_BEQ_R2_R3 # 0 == 0, taken P1_LI_R1 '00000000' :b4_2_ok @@ -181,10 +181,10 @@ ## BEQ fall-through: 0 != 1, branch must NOT fire. If it ## (incorrectly) fires we jump to b4_3_bad and clobber r1. - P1_LI_R7 + P1_LI_BR &b4_3_bad - P1_BEQ_R2_R3_R7 # 0 == 1? no, fall through - P1_LI_R7 + P1_BEQ_R2_R3 # 0 == 1? no, fall through + P1_LI_BR &b4_3_ok P1_B :b4_3_bad @@ -193,9 +193,9 @@ :b4_3_ok ## BNE taken: 0 != 1. - P1_LI_R7 + P1_LI_BR &b4_4_ok - P1_BNE_R2_R3_R7 # 0 != 1, taken + P1_BNE_R2_R3 # 0 != 1, taken P1_LI_R1 '00000000' :b4_4_ok @@ -203,10 +203,10 @@ ## BNE fall-through: r3=0 so r2==r3; branch must NOT fire. P1_LI_R3 '00000000' - P1_LI_R7 + P1_LI_BR &b4_5_bad - P1_BNE_R2_R3_R7 # 0 != 0? no, fall through - P1_LI_R7 + P1_BNE_R2_R3 # 0 != 0? no, fall through + P1_LI_BR &b4_5_ok P1_B :b4_5_bad @@ -217,9 +217,9 @@ '01000000' # restore r3 = 1 ## BLT taken: 0 < 1 (signed). - P1_LI_R7 + P1_LI_BR &b4_6_ok - P1_BLT_R2_R3_R7 # 0 < 1, taken + P1_BLT_R2_R3 # 0 < 1, taken P1_LI_R1 '00000000' :b4_6_ok @@ -229,10 +229,10 @@ '01000000' # r2 = 1 P1_LI_R3 '00000000' # r3 = 0 - P1_LI_R7 + P1_LI_BR &b4_7_bad - P1_BLT_R2_R3_R7 # 1 < 0? no, fall through - P1_LI_R7 + P1_BLT_R2_R3 # 1 < 0? no, fall through + P1_LI_BR &b4_7_ok P1_B :b4_7_bad @@ -248,9 +248,9 @@ P1_LI_R4 '00000000' P1_ADDI_R4_R4_NEG1 # r4 = -1 (sign-extended) - P1_LI_R7 + P1_LI_BR &b4_8_ok - P1_BLT_R4_R2_R7 # -1 < 0 (signed)? yes, taken + P1_BLT_R4_R2 # -1 < 0 (signed)? yes, taken P1_LI_R1 '00000000' :b4_8_ok @@ -269,15 +269,15 @@ ## into the accumulator below via SUB r1, r1, delta. P1_MOV_R6_SP # r6 = sp snapshot (pre-tranche) - P1_LI_R7 + P1_LI_BR &fn_identity P1_CALL # nested-CALL test: returns r1 unchanged - P1_LI_R7 + P1_LI_BR &fn_parent_tail P1_CALL # TAIL test: fn_identity RETs to here - P1_LI_R7 + P1_LI_BR &b5_end P1_B # skip over the inlined function bodies @@ -288,7 +288,7 @@ :fn_identity P1_PROLOGUE - P1_LI_R7 + P1_LI_BR &fn_inner P1_CALL P1_EPILOGUE @@ -296,7 +296,7 @@ :fn_parent_tail P1_PROLOGUE - P1_LI_R7 + P1_LI_BR &fn_identity P1_TAIL diff --git a/lisp.M1 b/lisp.M1 @@ -64,7 +64,7 @@ ## cons(r1=fixnum(42), r2=nil=0x07). P1_LI_R2 '07000000' ## r2 = nil - P1_LI_R7 + P1_LI_BR &cons P1_CALL ## r0 = pair pointer|2 P1_MOV_R6_R0 ## r6 = pair (callee-saved across calls) @@ -74,24 +74,24 @@ P1_ANDI_R1_R1_7 P1_LI_R2 '02000000' - P1_LI_R7 + P1_LI_BR &not_pair_err - P1_BNE_R1_R2_R7 + P1_BNE_R1_R2 ## Check cdr(pair) == nil. P1_MOV_R1_R6 - P1_LI_R7 + P1_LI_BR &cdr P1_CALL ## r0 = cdr P1_LI_R2 '07000000' - P1_LI_R7 + P1_LI_BR &not_nil_err - P1_BNE_R0_R2_R7 + P1_BNE_R0_R2 ## x = car(pair); expect tag == fixnum. P1_MOV_R1_R6 - P1_LI_R7 + P1_LI_BR &car P1_CALL ## r0 = car P1_MOV_R6_R0 ## r6 = car (callee-saved) @@ -100,9 +100,9 @@ P1_ANDI_R1_R1_7 P1_LI_R2 '01000000' - P1_LI_R7 + P1_LI_BR &not_fixnum_err - P1_BNE_R1_R2_R7 + P1_BNE_R1_R2 ## exit(car >> 3) — yields 42. P1_MOV_R1_R6 @@ -120,27 +120,29 @@ ## (ptr | PAIR_TAG). Uses the N=2 prologue — car lives in frame slot 1 ## at [sp+8], cdr in slot 2 at [sp+16] — to shepherd both args across ## alloc(). Both roots stay on the walkable stack, which is the -## discipline the step-9 GC will depend on (P1_TODO.md issue 4). +## discipline the step-9 mark-sweep GC will depend on. :cons P1_PROLOGUE_N2 - ## Spill car to slot 1 [sp+8], cdr to slot 2 [sp+16]. - P1_MOV_R4_SP - P1_ST_R1_R4_8 - P1_ST_R2_R4_16 + ## Spill car to slot 1 [sp+8], cdr to slot 2 [sp+16]. r3 is caller- + ## saved so we can clobber it freely as an sp-temp for the ST/LD + ## displacements. + P1_MOV_R3_SP + P1_ST_R1_R3_8 + P1_ST_R2_R3_16 ## alloc(16). 16 is already 8-byte aligned. P1_LI_R1 '10000000' - P1_LI_R7 + P1_LI_BR &alloc P1_CALL ## r0 = raw (untagged) pair ptr ## [ptr+0] = car ; [ptr+8] = cdr. - P1_MOV_R4_SP - P1_LD_R1_R4_8 ## r1 = car + P1_MOV_R3_SP + P1_LD_R1_R3_8 ## r1 = car P1_ST_R1_R0_0 - P1_LD_R2_R4_16 ## r2 = cdr + P1_LD_R2_R3_16 ## r2 = cdr P1_ST_R2_R0_8 ## Tag as a pair and return. @@ -170,26 +172,29 @@ ## Bump allocator over [heap_start, heap_end). Caller must pass an ## 8-byte-aligned size (caller-rounds for now). On overflow branches to ## :alloc_oom, which funnels through `error`. +## +## Uses only r0-r3 (caller-saved) so it needn't preserve any callee- +## saved regs across its own body. Keeps the PROLOGUE/EPILOGUE just for +## the lr save — the N=0 frame has no slots. :alloc P1_PROLOGUE - P1_LI_R4 + P1_LI_R3 &heap_next - P1_LD_R3_R4_0 ## r3 = *heap_next (old bump ptr) + P1_LD_R0_R3_0 ## r0 = *heap_next (old bump ptr; also return value) - P1_ADD_R2_R3_R1 ## r2 = new_next = old + size + P1_ADD_R2_R0_R1 ## r2 = new_next = old + size - P1_LI_R5 + P1_LI_R1 &heap_end - P1_LD_R0_R5_0 ## r0 = *heap_end + P1_LD_R1_R1_0 ## r1 = *heap_end ## heap_end < new_next -> OOM - P1_LI_R7 + P1_LI_BR &alloc_oom - P1_BLT_R0_R2_R7 + P1_BLT_R1_R2 - P1_ST_R2_R4_0 ## *heap_next = new_next - P1_MOV_R0_R3 ## return old bump ptr + P1_ST_R2_R3_0 ## *heap_next = new_next P1_EPILOGUE P1_RET @@ -197,7 +202,7 @@ ## ---- error(msg_ptr, msg_len) ---------------------------------------- ## Writes "error: " + msg + "\n" to fd 2, then exit(1). Never returns. -## Callers reach this via LI_R7 &error ; P1_B (branch, not call) and +## Callers reach this via LI_BR &error ; P1_B (branch, not call) and ## arrange r1 = msg_ptr, r2 = msg_len beforehand. :error ## Spill msg_ptr/msg_len into callee-saved regs across the write @@ -247,14 +252,14 @@ ## ---- Error landing pads --------------------------------------------- ## Plain branch targets (no PROLOGUE, no return). Load (msg_ptr, msg_len) -## into r1/r2, then r7-indirect branch into `error`. Keep message-length -## imm32s in sync with the strings below. +## into r1/r2, then LI_BR-indirect branch into `error`. Keep message- +## length imm32s in sync with the strings below. :alloc_oom P1_LI_R1 &msg_oom P1_LI_R2 '0E000000' ## strlen("heap exhausted") = 14 - P1_LI_R7 + P1_LI_BR &error P1_B @@ -263,7 +268,7 @@ &msg_not_pair P1_LI_R2 '0D000000' ## strlen("expected pair") = 13 - P1_LI_R7 + P1_LI_BR &error P1_B @@ -272,7 +277,7 @@ &msg_not_nil P1_LI_R2 '0C000000' ## strlen("expected nil") = 12 - P1_LI_R7 + P1_LI_BR &error P1_B @@ -281,7 +286,7 @@ &msg_not_fixnum P1_LI_R2 '0F000000' ## strlen("expected fixnum") = 15 - P1_LI_R7 + P1_LI_BR &error P1_B diff --git a/p1_aarch64.M1 b/p1_aarch64.M1 @@ -10,13 +10,14 @@ DEFINE P1_LI_R0 4000001802000014 DEFINE P1_LI_R1 4100001802000014 DEFINE P1_LI_R2 4200001802000014 DEFINE P1_LI_R3 4300001802000014 -DEFINE P1_LI_R4 4400001802000014 -DEFINE P1_LI_R5 4500001802000014 +DEFINE P1_LI_R4 5A00001802000014 +DEFINE P1_LI_R5 5B00001802000014 DEFINE P1_LI_R6 5300001802000014 DEFINE P1_LI_R7 5400001802000014 +DEFINE P1_LI_BR 5100001802000014 ## ---- SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches -DEFINE P1_SYSCALL E80300AAF50301AAF60302AAF70303AAF80304AAF90305AAE00315AAE10316AAE20317AAE30318AAE40319AAE50313AA010000D4E10315AAE20316AAE30317AAE40318AAE50319AA +DEFINE P1_SYSCALL E80300AAF50301AAF60302AAF70303AAE00315AAE10316AAE20317AAE3031AAAE4031BAAE50313AA010000D4E10315AAE20316AAE30317AA DEFINE P1_SYSOPEN 600C8092080780D2010000D4 ## ---- Linux syscall numbers (per-arch table). LE-32 immediate operands for LI. @@ -27,31 +28,31 @@ DEFINE SYS_CLOSE 39000000 ## ---- Reg-reg-reg arithmetic (tranche 1) -------------------------- DEFINE P1_ADD_R1_R1_R2 2100028B -DEFINE P1_ADD_R1_R1_R4 2100048B +DEFINE P1_ADD_R1_R1_R4 21001A8B DEFINE P1_ADD_R2_R2_R6 4200138B DEFINE P1_ADD_R2_R3_R1 6200018B DEFINE P1_SUB_R1_R1_R2 210002CB DEFINE P1_SUB_R2_R2_R6 420013CB -DEFINE P1_AND_R1_R1_R5 2100058A +DEFINE P1_AND_R1_R1_R5 21001B8A DEFINE P1_OR_R1_R1_R2 210002AA DEFINE P1_XOR_R1_R1_R2 210002CA DEFINE P1_MUL_R1_R1_R2 217C029B DEFINE P1_DIV_R1_R1_R2 210CC29A -DEFINE P1_REM_R1_R1_R5 300CC59A0106059B +DEFINE P1_REM_R1_R1_R5 300CDB9A01061B9B DEFINE P1_SHL_R1_R1_R2 2120C29A DEFINE P1_SHR_R1_R1_R2 2124C29A -DEFINE P1_SAR_R4_R4_R2 8428C29A +DEFINE P1_SAR_R4_R4_R2 5A2BC29A ## ---- Immediate arithmetic (tranche 2) ---------------------------- DEFINE P1_ADDI_R1_R1_3 210C0091 DEFINE P1_ADDI_R1_R1_1 21040091 DEFINE P1_ADDI_R1_R1_NEG3 210C00D1 -DEFINE P1_ADDI_R4_R4_NEG1 840400D1 +DEFINE P1_ADDI_R4_R4_NEG1 5A0700D1 DEFINE P1_ADDI_R1_R1_NEG2 210800D1 DEFINE P1_ADDI_R0_R0_1 00040091 DEFINE P1_SHLI_R1_R1_1 21F87FD3 DEFINE P1_SHRI_R1_R1_1 21FC41D3 -DEFINE P1_SARI_R4_R4_1 84FC4193 +DEFINE P1_SARI_R4_R4_1 5AFF4193 DEFINE P1_ANDI_R1_R1_6 21047F92 DEFINE P1_ANDI_R1_R1_7 21084092 DEFINE P1_ORI_R1_R1_1 210040B2 @@ -59,38 +60,38 @@ DEFINE P1_ORI_R0_R0_2 00007FB2 DEFINE P1_ORI_R0_R0_7 000840B2 ## ---- LA + memory ops (tranche 3) --------------------------------- -DEFINE P1_LA_R4 4400001802000014 -DEFINE P1_ST_R1_R4_0 810000F9 -DEFINE P1_LD_R1_R4_0 810040F9 -DEFINE P1_ST_R1_R4_8 810400F9 -DEFINE P1_LD_R1_R4_8 810440F9 -DEFINE P1_SB_R1_R4_16 81400039 -DEFINE P1_LB_R1_R4_16 81404039 -DEFINE P1_ST_R1_R4_NEG8 81801FF8 -DEFINE P1_LD_R1_R4_NEG8 81805FF8 +DEFINE P1_LA_R4 5A00001802000014 +DEFINE P1_ST_R1_R4_0 410300F9 +DEFINE P1_LD_R1_R4_0 410340F9 +DEFINE P1_ST_R1_R4_8 410700F9 +DEFINE P1_LD_R1_R4_8 410740F9 +DEFINE P1_SB_R1_R4_16 41430039 +DEFINE P1_LB_R1_R4_16 41434039 +DEFINE P1_ST_R1_R4_NEG8 41831FF8 +DEFINE P1_LD_R1_R4_NEG8 41835FF8 -## ---- Branches (tranche 4, r7-indirect) --------------------------- -DEFINE P1_B 80021FD6 -DEFINE P1_BEQ_R2_R3_R7 5F0003EB4100005480021FD6 -DEFINE P1_BNE_R2_R3_R7 5F0003EB4000005480021FD6 -DEFINE P1_BLT_R2_R3_R7 5F0003EB4A00005480021FD6 -DEFINE P1_BLT_R4_R2_R7 9F0002EB4A00005480021FD6 +## ---- Branches (tranche 4, LI_BR-indirect) ------------------------ +DEFINE P1_B 20021FD6 +DEFINE P1_BEQ_R2_R3 5F0003EB4100005420021FD6 +DEFINE P1_BNE_R2_R3 5F0003EB4000005420021FD6 +DEFINE P1_BLT_R2_R3 5F0003EB4A00005420021FD6 +DEFINE P1_BLT_R4_R2 5F0302EB4A00005420021FD6 ## ---- Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL DEFINE P1_PROLOGUE FF4300D1FE0300F9 DEFINE P1_EPILOGUE FE0340F9FF430091 DEFINE P1_RET C0035FD6 -DEFINE P1_CALL 80023FD6 -DEFINE P1_TAIL FE0340F9FF43009180021FD6 +DEFINE P1_CALL 20023FD6 +DEFINE P1_TAIL FE0340F9FF43009120021FD6 DEFINE P1_PROLOGUE_N2 FF8300D1FE0300F9 DEFINE P1_EPILOGUE_N2 FE0340F9FF830091 -DEFINE P1_TAIL_N2 FE0340F9FF83009180021FD6 +DEFINE P1_TAIL_N2 FE0340F9FF83009120021FD6 DEFINE P1_PROLOGUE_N3 FF8300D1FE0300F9 DEFINE P1_EPILOGUE_N3 FE0340F9FF830091 -DEFINE P1_TAIL_N3 FE0340F9FF83009180021FD6 +DEFINE P1_TAIL_N3 FE0340F9FF83009120021FD6 DEFINE P1_PROLOGUE_N4 FFC300D1FE0300F9 DEFINE P1_EPILOGUE_N4 FE0340F9FFC30091 -DEFINE P1_TAIL_N4 FE0340F9FFC3009180021FD6 +DEFINE P1_TAIL_N4 FE0340F9FFC3009120021FD6 ## ---- Seed-Lisp step 1 extensions (tranche 6) --------------------- DEFINE P1_MOV_R1_R6 E10313AA @@ -102,28 +103,38 @@ DEFINE P1_MOV_R7_R2 F40302AA DEFINE P1_MOV_R2_R6 E20313AA DEFINE P1_MOV_R3_R7 E30314AA DEFINE P1_MOV_R2_R7 E20314AA -DEFINE P1_MOV_R4_R7 E40314AA +DEFINE P1_MOV_R4_R7 FA0314AA DEFINE P1_MOV_R2_SP E2030091 -DEFINE P1_MOV_R4_SP E4030091 +DEFINE P1_MOV_R3_SP E3030091 +DEFINE P1_MOV_R4_SP FA030091 DEFINE P1_MOV_R6_SP F3030091 DEFINE P1_MOV_R2_R0 E20300AA DEFINE P1_LD_R0_R6_0 600240F9 DEFINE P1_LD_R1_R6_16 610A40F9 -DEFINE P1_LD_R3_R4_0 830040F9 -DEFINE P1_LD_R0_R5_0 A00040F9 -DEFINE P1_LB_R1_R4_0 81004039 -DEFINE P1_ST_R2_R4_0 820000F9 -DEFINE P1_ST_R0_R4_8 800400F9 -DEFINE P1_LD_R0_R4_8 800440F9 +DEFINE P1_LD_R3_R4_0 430340F9 +DEFINE P1_LD_R0_R5_0 600340F9 +DEFINE P1_LB_R1_R4_0 41034039 +DEFINE P1_ST_R2_R4_0 420300F9 +DEFINE P1_ST_R0_R4_8 400700F9 +DEFINE P1_LD_R0_R4_8 400740F9 DEFINE P1_LB_R1_R0_0 01004039 DEFINE P1_LD_R0_R1_0 200040F9 DEFINE P1_LD_R0_R1_8 200440F9 DEFINE P1_ST_R1_R0_0 010000F9 -DEFINE P1_LD_R2_R4_0 820040F9 +DEFINE P1_LD_R2_R4_0 420340F9 DEFINE P1_ST_R2_R0_8 020400F9 -DEFINE P1_LD_R0_R4_0 800040F9 -DEFINE P1_ST_R2_R4_16 820800F9 -DEFINE P1_LD_R2_R4_16 820840F9 -DEFINE P1_BLT_R0_R2_R7 1F0002EB4A00005480021FD6 -DEFINE P1_BNE_R1_R2_R7 3F0002EB4000005480021FD6 -DEFINE P1_BNE_R0_R2_R7 1F0002EB4000005480021FD6 +DEFINE P1_LD_R0_R4_0 400340F9 +DEFINE P1_ST_R2_R4_16 420B00F9 +DEFINE P1_LD_R2_R4_16 420B40F9 +DEFINE P1_LD_R0_R3_0 600040F9 +DEFINE P1_ST_R2_R3_0 620000F9 +DEFINE P1_ST_R1_R3_8 610400F9 +DEFINE P1_LD_R1_R3_8 610440F9 +DEFINE P1_ST_R2_R3_16 620800F9 +DEFINE P1_LD_R2_R3_16 620840F9 +DEFINE P1_LD_R1_R1_0 210040F9 +DEFINE P1_ADD_R2_R0_R1 0200018B +DEFINE P1_BLT_R0_R2 1F0002EB4A00005420021FD6 +DEFINE P1_BLT_R1_R2 3F0002EB4A00005420021FD6 +DEFINE P1_BNE_R1_R2 3F0002EB4000005420021FD6 +DEFINE P1_BNE_R0_R2 1F0002EB4000005420021FD6 diff --git a/p1_amd64.M1 b/p1_amd64.M1 @@ -10,13 +10,14 @@ DEFINE P1_LI_R0 B8 DEFINE P1_LI_R1 BF DEFINE P1_LI_R2 BE DEFINE P1_LI_R3 BA -DEFINE P1_LI_R4 41BA -DEFINE P1_LI_R5 41B8 +DEFINE P1_LI_R4 41BD +DEFINE P1_LI_R5 41BE DEFINE P1_LI_R6 BB DEFINE P1_LI_R7 41BC +DEFINE P1_LI_BR 41BB ## ---- SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches -DEFINE P1_SYSCALL 4989D90F05 +DEFINE P1_SYSCALL 4D89EA4D89F04989D90F05 DEFINE P1_SYSOPEN B8020000000F05 ## ---- Linux syscall numbers (per-arch table). LE-32 immediate operands for LI. @@ -27,31 +28,31 @@ DEFINE SYS_CLOSE 03000000 ## ---- Reg-reg-reg arithmetic (tranche 1) -------------------------- DEFINE P1_ADD_R1_R1_R2 4889FF4801F7 -DEFINE P1_ADD_R1_R1_R4 4889FF4C01D7 +DEFINE P1_ADD_R1_R1_R4 4889FF4C01EF DEFINE P1_ADD_R2_R2_R6 4889F64801DE DEFINE P1_ADD_R2_R3_R1 4889D64801FE DEFINE P1_SUB_R1_R1_R2 4889FF4829F7 DEFINE P1_SUB_R2_R2_R6 4889F64829DE -DEFINE P1_AND_R1_R1_R5 4889FF4C21C7 +DEFINE P1_AND_R1_R1_R5 4889FF4C21F7 DEFINE P1_OR_R1_R1_R2 4889FF4809F7 DEFINE P1_XOR_R1_R1_R2 4889FF4831F7 DEFINE P1_MUL_R1_R1_R2 4889FF480FAFFE DEFINE P1_DIV_R1_R1_R2 4989C34889D14889F8489948F7FE4889C74889CA4C89D8 -DEFINE P1_REM_R1_R1_R5 4989C34889D14889F8489949F7F84889D74889CA4C89D8 +DEFINE P1_REM_R1_R1_R5 4989C34889D14889F8489949F7FE4889D74889CA4C89D8 DEFINE P1_SHL_R1_R1_R2 4889FF4889F148D3E7 DEFINE P1_SHR_R1_R1_R2 4889FF4889F148D3EF -DEFINE P1_SAR_R4_R4_R2 4D89D24889F149D3FA +DEFINE P1_SAR_R4_R4_R2 4D89ED4889F149D3FD ## ---- Immediate arithmetic (tranche 2) ---------------------------- DEFINE P1_ADDI_R1_R1_3 4889FF4883C703 DEFINE P1_ADDI_R1_R1_1 4889FF4883C701 DEFINE P1_ADDI_R1_R1_NEG3 4889FF4883C7FD -DEFINE P1_ADDI_R4_R4_NEG1 4D89D24983C2FF +DEFINE P1_ADDI_R4_R4_NEG1 4D89ED4983C5FF DEFINE P1_ADDI_R1_R1_NEG2 4889FF4883C7FE DEFINE P1_ADDI_R0_R0_1 4889C04883C001 DEFINE P1_SHLI_R1_R1_1 4889FF48C1E701 DEFINE P1_SHRI_R1_R1_1 4889FF48C1EF01 -DEFINE P1_SARI_R4_R4_1 4D89D249C1FA01 +DEFINE P1_SARI_R4_R4_1 4D89ED49C1FD01 DEFINE P1_ANDI_R1_R1_6 4889FF4883E706 DEFINE P1_ANDI_R1_R1_7 4889FF4883E707 DEFINE P1_ORI_R1_R1_1 4889FF4883CF01 @@ -59,38 +60,38 @@ DEFINE P1_ORI_R0_R0_2 4889C04883C802 DEFINE P1_ORI_R0_R0_7 4889C04883C807 ## ---- LA + memory ops (tranche 3) --------------------------------- -DEFINE P1_LA_R4 41BA -DEFINE P1_ST_R1_R4_0 49897A00 -DEFINE P1_LD_R1_R4_0 498B7A00 -DEFINE P1_ST_R1_R4_8 49897A08 -DEFINE P1_LD_R1_R4_8 498B7A08 -DEFINE P1_SB_R1_R4_16 49887A10 -DEFINE P1_LB_R1_R4_16 490FB67A10 -DEFINE P1_ST_R1_R4_NEG8 49897AF8 -DEFINE P1_LD_R1_R4_NEG8 498B7AF8 +DEFINE P1_LA_R4 41BD +DEFINE P1_ST_R1_R4_0 49897D00 +DEFINE P1_LD_R1_R4_0 498B7D00 +DEFINE P1_ST_R1_R4_8 49897D08 +DEFINE P1_LD_R1_R4_8 498B7D08 +DEFINE P1_SB_R1_R4_16 49887D10 +DEFINE P1_LB_R1_R4_16 490FB67D10 +DEFINE P1_ST_R1_R4_NEG8 49897DF8 +DEFINE P1_LD_R1_R4_NEG8 498B7DF8 -## ---- Branches (tranche 4, r7-indirect) --------------------------- -DEFINE P1_B 41FFE4 -DEFINE P1_BEQ_R2_R3_R7 4839D6750341FFE4 -DEFINE P1_BNE_R2_R3_R7 4839D6740341FFE4 -DEFINE P1_BLT_R2_R3_R7 4839D67D0341FFE4 -DEFINE P1_BLT_R4_R2_R7 4939F27D0341FFE4 +## ---- Branches (tranche 4, LI_BR-indirect) ------------------------ +DEFINE P1_B 41FFE3 +DEFINE P1_BEQ_R2_R3 4839D6750341FFE3 +DEFINE P1_BNE_R2_R3 4839D6740341FFE3 +DEFINE P1_BLT_R2_R3 4839D67D0341FFE3 +DEFINE P1_BLT_R4_R2 4939F57D0341FFE3 ## ---- Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL -DEFINE P1_PROLOGUE 415B4883EC104153 -DEFINE P1_EPILOGUE 415B4883C4104153 +DEFINE P1_PROLOGUE 594883EC1051 +DEFINE P1_EPILOGUE 594883C41051 DEFINE P1_RET C3 -DEFINE P1_CALL 41FFD4 -DEFINE P1_TAIL 415B4883C410415341FFE4 -DEFINE P1_PROLOGUE_N2 415B4883EC204153 -DEFINE P1_EPILOGUE_N2 415B4883C4204153 -DEFINE P1_TAIL_N2 415B4883C420415341FFE4 -DEFINE P1_PROLOGUE_N3 415B4883EC204153 -DEFINE P1_EPILOGUE_N3 415B4883C4204153 -DEFINE P1_TAIL_N3 415B4883C420415341FFE4 -DEFINE P1_PROLOGUE_N4 415B4883EC304153 -DEFINE P1_EPILOGUE_N4 415B4883C4304153 -DEFINE P1_TAIL_N4 415B4883C430415341FFE4 +DEFINE P1_CALL 41FFD3 +DEFINE P1_TAIL 594883C4105141FFE3 +DEFINE P1_PROLOGUE_N2 594883EC2051 +DEFINE P1_EPILOGUE_N2 594883C42051 +DEFINE P1_TAIL_N2 594883C4205141FFE3 +DEFINE P1_PROLOGUE_N3 594883EC2051 +DEFINE P1_EPILOGUE_N3 594883C42051 +DEFINE P1_TAIL_N3 594883C4205141FFE3 +DEFINE P1_PROLOGUE_N4 594883EC3051 +DEFINE P1_EPILOGUE_N4 594883C43051 +DEFINE P1_TAIL_N4 594883C4305141FFE3 ## ---- Seed-Lisp step 1 extensions (tranche 6) --------------------- DEFINE P1_MOV_R1_R6 4889DF @@ -102,28 +103,38 @@ DEFINE P1_MOV_R7_R2 4989F4 DEFINE P1_MOV_R2_R6 4889DE DEFINE P1_MOV_R3_R7 4C89E2 DEFINE P1_MOV_R2_R7 4C89E6 -DEFINE P1_MOV_R4_R7 4D89E2 +DEFINE P1_MOV_R4_R7 4D89E5 DEFINE P1_MOV_R2_SP 4889E6 -DEFINE P1_MOV_R4_SP 4989E2 +DEFINE P1_MOV_R3_SP 4889E2 +DEFINE P1_MOV_R4_SP 4989E5 DEFINE P1_MOV_R6_SP 4889E3 DEFINE P1_MOV_R2_R0 4889C6 DEFINE P1_LD_R0_R6_0 488B4300 DEFINE P1_LD_R1_R6_16 488B7B10 -DEFINE P1_LD_R3_R4_0 498B5200 -DEFINE P1_LD_R0_R5_0 498B4000 -DEFINE P1_LB_R1_R4_0 490FB67A00 -DEFINE P1_ST_R2_R4_0 49897200 -DEFINE P1_ST_R0_R4_8 49894208 -DEFINE P1_LD_R0_R4_8 498B4208 +DEFINE P1_LD_R3_R4_0 498B5500 +DEFINE P1_LD_R0_R5_0 498B4600 +DEFINE P1_LB_R1_R4_0 490FB67D00 +DEFINE P1_ST_R2_R4_0 49897500 +DEFINE P1_ST_R0_R4_8 49894508 +DEFINE P1_LD_R0_R4_8 498B4508 DEFINE P1_LB_R1_R0_0 480FB67800 DEFINE P1_LD_R0_R1_0 488B4700 DEFINE P1_LD_R0_R1_8 488B4708 DEFINE P1_ST_R1_R0_0 48897800 -DEFINE P1_LD_R2_R4_0 498B7200 +DEFINE P1_LD_R2_R4_0 498B7500 DEFINE P1_ST_R2_R0_8 48897008 -DEFINE P1_LD_R0_R4_0 498B4200 -DEFINE P1_ST_R2_R4_16 49897210 -DEFINE P1_LD_R2_R4_16 498B7210 -DEFINE P1_BLT_R0_R2_R7 4839F07D0341FFE4 -DEFINE P1_BNE_R1_R2_R7 4839F7740341FFE4 -DEFINE P1_BNE_R0_R2_R7 4839F0740341FFE4 +DEFINE P1_LD_R0_R4_0 498B4500 +DEFINE P1_ST_R2_R4_16 49897510 +DEFINE P1_LD_R2_R4_16 498B7510 +DEFINE P1_LD_R0_R3_0 488B4200 +DEFINE P1_ST_R2_R3_0 48897200 +DEFINE P1_ST_R1_R3_8 48897A08 +DEFINE P1_LD_R1_R3_8 488B7A08 +DEFINE P1_ST_R2_R3_16 48897210 +DEFINE P1_LD_R2_R3_16 488B7210 +DEFINE P1_LD_R1_R1_0 488B7F00 +DEFINE P1_ADD_R2_R0_R1 4889C64801FE +DEFINE P1_BLT_R0_R2 4839F07D0341FFE3 +DEFINE P1_BLT_R1_R2 4839F77D0341FFE3 +DEFINE P1_BNE_R1_R2 4839F7740341FFE3 +DEFINE P1_BNE_R0_R2 4839F0740341FFE3 diff --git a/p1_gen.py b/p1_gen.py @@ -7,9 +7,9 @@ rewrites all three in place. Why a generator: the hand-written defs diverge across arches in hard-to-review ways, typos silently produce SIGILL'ing binaries (M1 -passes undefined tokens through as literal text — see P1_TODO.md issue -1), and the combinatorial surface (~1200 DEFINEs per arch if fully -enumerated) is past the point of hand-maintainability. +passes undefined tokens through as literal text — see `lint.sh` for +the pre-assemble check), and the combinatorial surface (~1200 DEFINEs +per arch if fully enumerated) is past the point of hand-maintainability. Design: * OPS is a list of emission rows. Each row is a small class whose @@ -39,10 +39,15 @@ ARCHES = ('aarch64', 'amd64', 'riscv64') ## the per-arch encoders insert into instruction fields; the human-facing ## names (rax, x1, a2, …) never appear in this file. +## 4:4 caller/callee-saved split. r0–r3 caller (native argregs); r4–r7 +## callee (native callee-saved). `br` is the hidden branch-target scratch +## (not a P1 reg) — picked so every op's expansion clobbers only what its +## name declares. NAT_AA64 = {'r0': 0, 'r1': 1, 'r2': 2, 'r3': 3, - 'r4': 4, 'r5': 5, 'r6': 19, 'r7': 20, + 'r4': 26, 'r5': 27, 'r6': 19, 'r7': 20, + 'br': 17, # x17 (IP1, caller-saved linker scratch) 'sp': 31, 'xzr': 31, 'lr': 30, - 'x21': 21, 'x22': 22, 'x23': 23, 'x24': 24, 'x25': 25, 'x8': 8} + 'x21': 21, 'x22': 22, 'x23': 23, 'x8': 8} ## amd64 ModRM.reg/rm + REX.R/B bit: native regnums 0..15 with r8..r15 ## setting the REX bit. We store the 4-bit native number directly. @@ -50,20 +55,24 @@ NAT_AMD64 = {'r0': 0, # rax 'r1': 7, # rdi 'r2': 6, # rsi 'r3': 2, # rdx - 'r4': 10, # r10 - 'r5': 8, # r8 + 'r4': 13, # r13 (callee-saved) + 'r5': 14, # r14 (callee-saved) 'r6': 3, # rbx 'r7': 12, # r12 + 'br': 11, # r11 — branch/call target scratch + DIV/REM r0 save 'sp': 4, # rsp - 'rcx': 1, # shift-count scratch (not a P1 reg) - 'r9': 9, # syscall arg6 slot - 'r11': 11, # call-retaddr scratch in PROLOGUE + 'rcx': 1, # shift-count scratch + DIV/REM rdx save (not a P1 reg) + 'r10': 10, # syscall arg4 slot (not a P1 reg) + 'r8': 8, # syscall arg5 slot (not a P1 reg) + 'r9': 9, # syscall arg6 slot (not a P1 reg) + 'r11': 11, # alias for br (some expansions spell it r11 directly) } NAT_RV64 = {'r0': 10, 'r1': 11, 'r2': 12, 'r3': 13, - 'r4': 14, 'r5': 15, 'r6': 9, 'r7': 18, + 'r4': 20, 'r5': 21, 'r6': 9, 'r7': 18, + 'br': 30, # t5 (caller-saved temp) 'sp': 2, 'ra': 1, 'zero': 0, 'a7': 17, - 's3': 19, 's4': 20, 's5': 21, 's6': 22, 's7': 23} + 's3': 19, 's6': 22, 's7': 23} ## ---------- Low-level encoding helpers ----------------------------------- @@ -521,19 +530,16 @@ class Mem(Op): if self.op == 'SB': return rv_s(0x00000023, self.rT, self.rN, self.off) -## --- Branches: r7-indirect pattern --- +## --- Branches: LI_BR-indirect pattern --- @dataclass class B(Op): def encode(self, arch): if arch == 'aarch64': - # BR x20 — 0xD61F0280 - return le32(0xD61F0280) + return le32(0xD61F0000 | (NAT_AA64['br'] << 5)) # BR x17 if arch == 'amd64': - # jmp r12 — 41 FF E4 - return '41FFE4' + return '41FFE3' # jmp r11 if arch == 'riscv64': - # jalr x0, 0(s2) — opcode 67 00 09 00 - return le32(0x00090067) + return le32(0x00000067 | (NAT_RV64['br'] << 15)) # jalr x0, 0(t5) @dataclass class CondB(Op): @@ -550,37 +556,25 @@ class CondB(Op): # Skip when NOT cond holds. BEQ→NE(1), BNE→EQ(0), BLT→GE(A). cond = {'BEQ': 1, 'BNE': 0, 'BLT': 10}[self.op] bcond = le32(0x54000040 | cond) - br_x20 = le32(0xD61F0280) - return cmp + bcond + br_x20 + br = le32(0xD61F0000 | (NAT_AA64['br'] << 5)) + return cmp + bcond + br if arch == 'amd64': a, b = NAT_AMD64[self.rA], NAT_AMD64[self.rB] # cmp rA, rB — opcode 39 /r with rA as r/m cmp_ = rex(1, b >> 3, 0, a >> 3) + '39' + modrm(3, b, a) - # jcc rel8 opcode, skip=3 (past jmp r12): + # jcc rel8 opcode, skip=3 (past jmp r11): # BEQ→JNE 75 03 ; BNE→JE 74 03 ; BLT→JGE 7D 03 jop = {'BEQ': '75', 'BNE': '74', 'BLT': '7D'}[self.op] - jmp_r12 = '41FFE4' - return cmp_ + jop + '03' + jmp_r12 + return cmp_ + jop + '03' + '41FFE3' # jmp r11 if arch == 'riscv64': - # B<inv> rA, rB, +8 ; jalr x0, 0(s2) + # B<inv> rA, rB, +8 ; jalr x0, 0(t5) a, b = NAT_RV64[self.rA], NAT_RV64[self.rB] - # B-type imm=8 encoding: imm[12|10:5] = 0, imm[4:1|11] = 0b0100_0. - # That is: [31]=0 [30:25]=0 [11:8]=0b0100 [7]=0 → imm[11:7] = 0b01000, imm[31:25]=0. - # Bytes: xx xx x? 00 with byte[1]_upper=0x4 for the rd-field=8. # funct3 picks the op: BEQ→BNE(1), BNE→BEQ(0), BLT→BGE(5). funct3 = {'BEQ': 1, 'BNE': 0, 'BLT': 5}[self.op] - # Build B-type: opcode=0x63, funct3, rs1=rA, rs2=rB, imm=+8. - # imm encoding for +8: - # bit 12 = 0 - # bits 10:5 = 000000 - # bits 4:1 = 0100 - # bit 11 = 0 - insn = 0x00000063 | (funct3 << 12) | (a << 15) | (b << 20) - # imm bits: [11:7] carries imm[4:1,11]; [31:25] carries imm[12,10:5] - # For +8: imm[4:1]=0b0100 → [10:7] = 0b1000 → byte pos [11:7] = 0b01000 = 8 - insn |= (8 << 7) # bits [11:8] = 0b0100, bit 7=0 - jalr_x0_s2 = 0x00090067 - return le32(insn) + le32(jalr_x0_s2) + # B-type with imm=+8: imm[11:8]=0b0100 → byte[1] nibble high=4. + insn = 0x00000063 | (funct3 << 12) | (a << 15) | (b << 20) | (8 << 7) + jalr = 0x00000067 | (NAT_RV64['br'] << 15) # jalr x0, 0(t5) + return le32(insn) + le32(jalr) ## --- Simple singletons --- @@ -622,11 +616,16 @@ class Prologue(Op): str_lr = aa_ldst_uimm12(0xF9000000, 'lr', 'sp', 0, 3) return sub + str_lr if arch == 'amd64': - # pop r11 ; sub rsp,fb ; push r11 - # pop r11 = 41 5B ; sub rsp,imm8 = 48 83 EC ib (if fb<=127) - # push r11 = 41 53 + # pop rcx ; sub rsp,fb ; push rcx + # pop rcx = 59 ; sub rsp,imm8 = 48 83 EC ib (if fb<=127) + # push rcx = 51 + # rcx is the retaddr-carry scratch. r11 (= P1 'br') is off-limits + # because TAIL = EPILOGUE + `jmp r11` and using r11 here would + # clobber the LI_BR-loaded tail target mid-EPILOGUE. rcx is + # already the shift-count + DIV/REM save scratch — caller-save, + # never a P1 reg. assert fb <= 127 - return '415B' + '4883EC' + byte(fb) + '4153' + return '59' + '4883EC' + byte(fb) + '51' if arch == 'riscv64': # addi sp, sp, -fb ; sd ra, 0(sp) sub = rv_i(0x00000013, 'sp', 'sp', -fb) @@ -644,8 +643,9 @@ class Epilogue(Op): add = aa_add_imm('sp', 'sp', fb, sub=False) return ldr_lr + add if arch == 'amd64': + # Mirror of Prologue — pop rcx ; add rsp,fb ; push rcx. assert fb <= 127 - return '415B' + '4883C4' + byte(fb) + '4153' + return '59' + '4883C4' + byte(fb) + '51' if arch == 'riscv64': ld = rv_i(0x00003003, 'ra', 'sp', 0) add = rv_i(0x00000013, 'sp', 'sp', fb) @@ -656,7 +656,7 @@ class Tail(Op): k: int = 1 def encode(self, arch): - # Epilogue + unconditional B (= jump through r7) + # Epilogue + unconditional B (= jump through the branch-scratch reg) epi = Epilogue(name='', k=self.k).encode(arch) br = B(name='').encode(arch) return epi + br @@ -665,11 +665,11 @@ class Tail(Op): class Call(Op): def encode(self, arch): if arch == 'aarch64': - return le32(0xD63F0280) # BLR x20 + return le32(0xD63F0000 | (NAT_AA64['br'] << 5)) # BLR x17 if arch == 'amd64': - return '41FFD4' # call r12 + return '41FFD3' # call r11 if arch == 'riscv64': - return le32(0x000900E7) # jalr ra, 0(s2) + return le32(0x000000E7 | (NAT_RV64['br'] << 15)) # jalr ra, 0(t5) @dataclass class Ret(Op): @@ -684,50 +684,49 @@ class Ret(Op): ## --- SYSCALL / SYSOPEN --- SYSCALL_HEX = { 'aarch64': ( - # mov x8,x0 ; save r1..r5 into x21..x25 ; shuffle into x0..x4 ; - # mov x5,x19 ; svc #0 ; restore r1..r5. + # r4/r5 now live in callee-saved natives (x26/x27), so the + # kernel preserves them — no save/restore needed. Only r1/r2/r3 + # (in caller-saved x1/x2/x3) must be stashed across the shuffle. '' .join([ - le32(0xAA0003E8), # mov x8, x0 (ORR x8, xzr, x0) - le32(0xAA0103F5), # mov x21, x1 - le32(0xAA0203F6), # mov x22, x2 - le32(0xAA0303F7), # mov x23, x3 - le32(0xAA0403F8), # mov x24, x4 - le32(0xAA0503F9), # mov x25, x5 - le32(0xAA1503E0), # mov x0, x21 - le32(0xAA1603E1), # mov x1, x22 - le32(0xAA1703E2), # mov x2, x23 - le32(0xAA1803E3), # mov x3, x24 - le32(0xAA1903E4), # mov x4, x25 - le32(0xAA1303E5), # mov x5, x19 + le32(0xAA0003E8), # mov x8, x0 (syscall number) + le32(0xAA0103F5), # mov x21, x1 (save r1) + le32(0xAA0203F6), # mov x22, x2 (save r2) + le32(0xAA0303F7), # mov x23, x3 (save r3) + le32(0xAA1503E0), # mov x0, x21 (arg1 = r1) + le32(0xAA1603E1), # mov x1, x22 (arg2 = r2) + le32(0xAA1703E2), # mov x2, x23 (arg3 = r3) + le32(0xAA1A03E3), # mov x3, x26 (arg4 = r4) + le32(0xAA1B03E4), # mov x4, x27 (arg5 = r5) + le32(0xAA1303E5), # mov x5, x19 (arg6 = r6) le32(0xD4000001), # svc #0 - le32(0xAA1503E1), # mov x1, x21 + le32(0xAA1503E1), # mov x1, x21 (restore r1) le32(0xAA1603E2), # mov x2, x22 le32(0xAA1703E3), # mov x3, x23 - le32(0xAA1803E4), # mov x4, x24 - le32(0xAA1903E5), # mov x5, x25 ]) ), - 'amd64': '4989D9' + '0F05', # mov r9, rbx ; syscall + # r4=r13, r5=r14 are callee-saved natively, but syscall wants args + # 4/5 in r10/r8. r6=rbx, but arg6 lives in r9. Three shuffle moves, + # then syscall. The kernel preserves rdi/rsi/rdx/r12–r15/rbx, so no + # P1 reg is clobbered beyond r0 (syscall return). + 'amd64': '4D89EA' + '4D89F0' + '4989D9' + '0F05', 'riscv64': ( + # Same story as aarch64: r4/r5 in callee-saved s4/s5 (=x20/x21), + # so we only save/restore a1/a2/a3. Scratch slots: s3, s6, s7. ''.join([ - le32(0x00050893), # addi a7, a0, 0 (mv a7, a0) - le32(0x00058993), # mv s3, a1 - le32(0x00060A13), # mv s4, a2 - le32(0x00068A93), # mv s5, a3 - le32(0x00070B13), # mv s6, a4 - le32(0x00078B93), # mv s7, a5 - le32(0x00098513), # mv a0, s3 - le32(0x000A0593), # mv a1, s4 - le32(0x000A8613), # mv a2, s5 - le32(0x000B0693), # mv a3, s6 - le32(0x000B8713), # mv a4, s7 - le32(0x00048793), # mv a5, s1 + le32(0x00050893), # mv a7, a0 (syscall number) + le32(0x00058993), # mv s3, a1 (save r1) + le32(0x00060B13), # mv s6, a2 (save r2) + le32(0x00068B93), # mv s7, a3 (save r3) + le32(0x00098513), # mv a0, s3 (arg1 = r1) + le32(0x000B0593), # mv a1, s6 (arg2 = r2) + le32(0x000B8613), # mv a2, s7 (arg3 = r3) + le32(0x000A0693), # mv a3, s4 (arg4 = r4) + le32(0x000A8713), # mv a4, s5 (arg5 = r5) + le32(0x00048793), # mv a5, s1 (arg6 = r6) le32(0x00000073), # ecall - le32(0x00098593), # mv a1, s3 - le32(0x000A0613), # mv a2, s4 - le32(0x000A8693), # mv a3, s5 - le32(0x000B0713), # mv a4, s6 - le32(0x000B8793), # mv a5, s7 + le32(0x00098593), # mv a1, s3 (restore r1) + le32(0x000B0613), # mv a2, s6 + le32(0x000B8693), # mv a3, s7 ]) ), } @@ -770,6 +769,10 @@ def rows(): R.append(Banner('LI — load 4-byte zero-extended literal from inline data slot')) for rd in ['r0', 'r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7']: R.append(Li(name=f'LI_{rd.upper()}', rD=rd)) + # LI_BR loads into the hidden branch-target scratch (x17/r11/t5). + # Every branch and call site is `LI_BR &target ; P1_B` (or the + # matching conditional / CALL). The scratch is *not* a P1 reg. + R.append(Li(name='LI_BR', rD='br')) # --- SYSCALL / SYSOPEN --- R.append(Banner('SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches')) @@ -841,12 +844,12 @@ def rows(): op=op, rT=rt, rN=rn, off=off)) # --- Branches --- - R.append(Banner('Branches (tranche 4, r7-indirect)')) + R.append(Banner('Branches (tranche 4, LI_BR-indirect)')) R.append(B(name='B')) for op, a, b in [ ('BEQ','r2','r3'), ('BNE','r2','r3'), ('BLT','r2','r3'), ('BLT','r4','r2'), ]: - R.append(CondB(name=f'{op}_{a.upper()}_{b.upper()}_R7', op=op, rA=a, rB=b)) + R.append(CondB(name=f'{op}_{a.upper()}_{b.upper()}', op=op, rA=a, rB=b)) # --- PROLOGUE / EPILOGUE / CALL / RET / TAIL — single-slot + Nk variants --- R.append(Banner('Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL')) @@ -874,6 +877,7 @@ def rows(): R.append(Mov(name='MOV_R4_R7', rD='r4', rA='r7')) # MOV rD, sp variants R.append(Mov(name='MOV_R2_SP', rD='r2', rA='sp')) + R.append(Mov(name='MOV_R3_SP', rD='r3', rA='sp')) R.append(Mov(name='MOV_R4_SP', rD='r4', rA='sp')) R.append(Mov(name='MOV_R6_SP', rD='r6', rA='sp')) # Extra MOVs needed around calls @@ -886,18 +890,28 @@ def rows(): ('ST','r0','r4',8), ('LD','r0','r4',8), ('LB','r1','r0',0), ('LD','r0','r1',0), ('LD','r0','r1',8), ('ST','r1','r0',0), ('LD','r2','r4',0), ('ST','r2','r0',8), ('LD','r0','r4',0), - # N=2 scratch slot 2 at [sp+16]: used by cons after dropping - # the :cons_save_cdr BSS spill (P1_TODO.md issue 4). ('ST','r2','r4',16), ('LD','r2','r4',16), + # r3-based addressing: after the 4:4 split, r4 is callee-saved + # and cons/alloc use r3 (caller-saved) as sp-temp and heap-cell + # scratch instead. LD_R1_R1_0 is alloc's `r1 = *heap_end`. + ('LD','r0','r3',0), ('ST','r2','r3',0), + ('ST','r1','r3',8), ('LD','r1','r3',8), + ('ST','r2','r3',16), ('LD','r2','r3',16), + ('LD','r1','r1',0), ]: suf = f'NEG{-off}' if off < 0 else f'{off}' R.append(Mem(name=f'{op}_{rt.upper()}_{rn.upper()}_{suf}', op=op, rT=rt, rN=rn, off=off)) - # BLT r0, r2 ; BNE r1, r2 ; BNE r0, r2 - R.append(CondB(name='BLT_R0_R2_R7', op='BLT', rA='r0', rB='r2')) - R.append(CondB(name='BNE_R1_R2_R7', op='BNE', rA='r1', rB='r2')) - R.append(CondB(name='BNE_R0_R2_R7', op='BNE', rA='r0', rB='r2')) + # ADD_R2_R0_R1: alloc's `new_next = old + size`. The other + # tranche-1 ADDs don't happen to have (d=r2, a=r0, b=r1). + R.append(RRR(name='ADD_R2_R0_R1', op='ADD', rD='r2', rA='r0', rB='r1')) + + # Branches used by lisp step 2 / alloc OOM. + R.append(CondB(name='BLT_R0_R2', op='BLT', rA='r0', rB='r2')) + R.append(CondB(name='BLT_R1_R2', op='BLT', rA='r1', rB='r2')) + R.append(CondB(name='BNE_R1_R2', op='BNE', rA='r1', rB='r2')) + R.append(CondB(name='BNE_R0_R2', op='BNE', rA='r0', rB='r2')) return R diff --git a/p1_riscv64.M1 b/p1_riscv64.M1 @@ -10,13 +10,14 @@ DEFINE P1_LI_R0 170500000365C5006F008000 DEFINE P1_LI_R1 9705000083E5C5006F008000 DEFINE P1_LI_R2 170600000366C6006F008000 DEFINE P1_LI_R3 9706000083E6C6006F008000 -DEFINE P1_LI_R4 170700000367C7006F008000 -DEFINE P1_LI_R5 9707000083E7C7006F008000 +DEFINE P1_LI_R4 170A0000036ACA006F008000 +DEFINE P1_LI_R5 970A000083EACA006F008000 DEFINE P1_LI_R6 9704000083E4C4006F008000 DEFINE P1_LI_R7 170900000369C9006F008000 +DEFINE P1_LI_BR 170F0000036FCF006F008000 ## ---- SYSCALL / SYSOPEN — uniform (clobbers r0 only) across arches -DEFINE P1_SYSCALL 9308050093890500130A0600938A0600130B0700938B07001385090093050A0013860A0093060B0013870B0093870400730000009385090013060A0093860A0013070B0093870B00 +DEFINE P1_SYSCALL 9308050093890500130B0600938B06001385090093050B0013860B0093060A0013870A0093870400730000009385090013060B0093860B00 DEFINE P1_SYSOPEN 1305C0F99308800373000000 ## ---- Linux syscall numbers (per-arch table). LE-32 immediate operands for LI. @@ -27,31 +28,31 @@ DEFINE SYS_CLOSE 39000000 ## ---- Reg-reg-reg arithmetic (tranche 1) -------------------------- DEFINE P1_ADD_R1_R1_R2 B385C500 -DEFINE P1_ADD_R1_R1_R4 B385E500 +DEFINE P1_ADD_R1_R1_R4 B3854501 DEFINE P1_ADD_R2_R2_R6 33069600 DEFINE P1_ADD_R2_R3_R1 3386B600 DEFINE P1_SUB_R1_R1_R2 B385C540 DEFINE P1_SUB_R2_R2_R6 33069640 -DEFINE P1_AND_R1_R1_R5 B3F5F500 +DEFINE P1_AND_R1_R1_R5 B3F55501 DEFINE P1_OR_R1_R1_R2 B3E5C500 DEFINE P1_XOR_R1_R1_R2 B3C5C500 DEFINE P1_MUL_R1_R1_R2 B385C502 DEFINE P1_DIV_R1_R1_R2 B3C5C502 -DEFINE P1_REM_R1_R1_R5 B3E5F502 +DEFINE P1_REM_R1_R1_R5 B3E55503 DEFINE P1_SHL_R1_R1_R2 B395C500 DEFINE P1_SHR_R1_R1_R2 B3D5C500 -DEFINE P1_SAR_R4_R4_R2 3357C740 +DEFINE P1_SAR_R4_R4_R2 335ACA40 ## ---- Immediate arithmetic (tranche 2) ---------------------------- DEFINE P1_ADDI_R1_R1_3 93853500 DEFINE P1_ADDI_R1_R1_1 93851500 DEFINE P1_ADDI_R1_R1_NEG3 9385D5FF -DEFINE P1_ADDI_R4_R4_NEG1 1307F7FF +DEFINE P1_ADDI_R4_R4_NEG1 130AFAFF DEFINE P1_ADDI_R1_R1_NEG2 9385E5FF DEFINE P1_ADDI_R0_R0_1 13051500 DEFINE P1_SHLI_R1_R1_1 93951500 DEFINE P1_SHRI_R1_R1_1 93D51500 -DEFINE P1_SARI_R4_R4_1 13571740 +DEFINE P1_SARI_R4_R4_1 135A1A40 DEFINE P1_ANDI_R1_R1_6 93F56500 DEFINE P1_ANDI_R1_R1_7 93F57500 DEFINE P1_ORI_R1_R1_1 93E51500 @@ -59,38 +60,38 @@ DEFINE P1_ORI_R0_R0_2 13652500 DEFINE P1_ORI_R0_R0_7 13657500 ## ---- LA + memory ops (tranche 3) --------------------------------- -DEFINE P1_LA_R4 170700000367C7006F008000 -DEFINE P1_ST_R1_R4_0 2330B700 -DEFINE P1_LD_R1_R4_0 83350700 -DEFINE P1_ST_R1_R4_8 2334B700 -DEFINE P1_LD_R1_R4_8 83358700 -DEFINE P1_SB_R1_R4_16 2308B700 -DEFINE P1_LB_R1_R4_16 83450701 -DEFINE P1_ST_R1_R4_NEG8 233CB7FE -DEFINE P1_LD_R1_R4_NEG8 833587FF +DEFINE P1_LA_R4 170A0000036ACA006F008000 +DEFINE P1_ST_R1_R4_0 2330BA00 +DEFINE P1_LD_R1_R4_0 83350A00 +DEFINE P1_ST_R1_R4_8 2334BA00 +DEFINE P1_LD_R1_R4_8 83358A00 +DEFINE P1_SB_R1_R4_16 2308BA00 +DEFINE P1_LB_R1_R4_16 83450A01 +DEFINE P1_ST_R1_R4_NEG8 233CBAFE +DEFINE P1_LD_R1_R4_NEG8 83358AFF -## ---- Branches (tranche 4, r7-indirect) --------------------------- -DEFINE P1_B 67000900 -DEFINE P1_BEQ_R2_R3_R7 6314D60067000900 -DEFINE P1_BNE_R2_R3_R7 6304D60067000900 -DEFINE P1_BLT_R2_R3_R7 6354D60067000900 -DEFINE P1_BLT_R4_R2_R7 6354C70067000900 +## ---- Branches (tranche 4, LI_BR-indirect) ------------------------ +DEFINE P1_B 67000F00 +DEFINE P1_BEQ_R2_R3 6314D60067000F00 +DEFINE P1_BNE_R2_R3 6304D60067000F00 +DEFINE P1_BLT_R2_R3 6354D60067000F00 +DEFINE P1_BLT_R4_R2 6354CA0067000F00 ## ---- Control: CALL/RET + single-slot and N-slot PROLOGUE/EPILOGUE/TAIL DEFINE P1_PROLOGUE 130101FF23301100 DEFINE P1_EPILOGUE 8330010013010101 DEFINE P1_RET 67800000 -DEFINE P1_CALL E7000900 -DEFINE P1_TAIL 833001001301010167000900 +DEFINE P1_CALL E7000F00 +DEFINE P1_TAIL 833001001301010167000F00 DEFINE P1_PROLOGUE_N2 130101FE23301100 DEFINE P1_EPILOGUE_N2 8330010013010102 -DEFINE P1_TAIL_N2 833001001301010267000900 +DEFINE P1_TAIL_N2 833001001301010267000F00 DEFINE P1_PROLOGUE_N3 130101FE23301100 DEFINE P1_EPILOGUE_N3 8330010013010102 -DEFINE P1_TAIL_N3 833001001301010267000900 +DEFINE P1_TAIL_N3 833001001301010267000F00 DEFINE P1_PROLOGUE_N4 130101FD23301100 DEFINE P1_EPILOGUE_N4 8330010013010103 -DEFINE P1_TAIL_N4 833001001301010367000900 +DEFINE P1_TAIL_N4 833001001301010367000F00 ## ---- Seed-Lisp step 1 extensions (tranche 6) --------------------- DEFINE P1_MOV_R1_R6 93850400 @@ -102,28 +103,38 @@ DEFINE P1_MOV_R7_R2 13090600 DEFINE P1_MOV_R2_R6 13860400 DEFINE P1_MOV_R3_R7 93060900 DEFINE P1_MOV_R2_R7 13060900 -DEFINE P1_MOV_R4_R7 13070900 +DEFINE P1_MOV_R4_R7 130A0900 DEFINE P1_MOV_R2_SP 13060100 -DEFINE P1_MOV_R4_SP 13070100 +DEFINE P1_MOV_R3_SP 93060100 +DEFINE P1_MOV_R4_SP 130A0100 DEFINE P1_MOV_R6_SP 93040100 DEFINE P1_MOV_R2_R0 13060500 DEFINE P1_LD_R0_R6_0 03B50400 DEFINE P1_LD_R1_R6_16 83B50401 -DEFINE P1_LD_R3_R4_0 83360700 -DEFINE P1_LD_R0_R5_0 03B50700 -DEFINE P1_LB_R1_R4_0 83450700 -DEFINE P1_ST_R2_R4_0 2330C700 -DEFINE P1_ST_R0_R4_8 2334A700 -DEFINE P1_LD_R0_R4_8 03358700 +DEFINE P1_LD_R3_R4_0 83360A00 +DEFINE P1_LD_R0_R5_0 03B50A00 +DEFINE P1_LB_R1_R4_0 83450A00 +DEFINE P1_ST_R2_R4_0 2330CA00 +DEFINE P1_ST_R0_R4_8 2334AA00 +DEFINE P1_LD_R0_R4_8 03358A00 DEFINE P1_LB_R1_R0_0 83450500 DEFINE P1_LD_R0_R1_0 03B50500 DEFINE P1_LD_R0_R1_8 03B58500 DEFINE P1_ST_R1_R0_0 2330B500 -DEFINE P1_LD_R2_R4_0 03360700 +DEFINE P1_LD_R2_R4_0 03360A00 DEFINE P1_ST_R2_R0_8 2334C500 -DEFINE P1_LD_R0_R4_0 03350700 -DEFINE P1_ST_R2_R4_16 2338C700 -DEFINE P1_LD_R2_R4_16 03360701 -DEFINE P1_BLT_R0_R2_R7 6354C50067000900 -DEFINE P1_BNE_R1_R2_R7 6384C50067000900 -DEFINE P1_BNE_R0_R2_R7 6304C50067000900 +DEFINE P1_LD_R0_R4_0 03350A00 +DEFINE P1_ST_R2_R4_16 2338CA00 +DEFINE P1_LD_R2_R4_16 03360A01 +DEFINE P1_LD_R0_R3_0 03B50600 +DEFINE P1_ST_R2_R3_0 23B0C600 +DEFINE P1_ST_R1_R3_8 23B4B600 +DEFINE P1_LD_R1_R3_8 83B58600 +DEFINE P1_ST_R2_R3_16 23B8C600 +DEFINE P1_LD_R2_R3_16 03B60601 +DEFINE P1_LD_R1_R1_0 83B50500 +DEFINE P1_ADD_R2_R0_R1 3306B500 +DEFINE P1_BLT_R0_R2 6354C50067000F00 +DEFINE P1_BLT_R1_R2 63D4C50067000F00 +DEFINE P1_BNE_R1_R2 6384C50067000F00 +DEFINE P1_BNE_R0_R2 6304C50067000F00