boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs

commit c97fe9fc116a451582d73ddcba70c6eecef2c08c
parent 29dd4df7b3d09e614c5999efd71cbe9651d2ef02
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon, 20 Apr 2026 16:36:01 -0700

P1 prune ISA and rewrite demo with discriminating tests

Drop BLTU/BGEU/BGE and LW/SW from the P1 ISA: seed programs with
tagged-cell pointers don't need unsigned compares, and byte/64-bit
loads are enough once LW is synthesizable from LD + mask. Combinatoric
footprint drops from ~1500 to ~1200 defines/arch.

Rewrite demo.M1 so every op is applied with non-identity operands
whose result is unique vs. neighbor ops on the same inputs, so a
mis-encoded op poisons the accumulator and fails the exit-code check.
Branch tranche now tests both taken and fall-through directions for
each conditional, plus a signed-vs-unsigned discriminator on BLT via
r4=-1. SAR/SARI discriminated on a negative value (SHR/SHRI would
disagree). LD/ST exercised at both offset=0 and offset=8; LI_R6
roundtripped through MOV_R1_R6 so the otherwise-unused LI_R6 encoding
is covered.

Per-arch defs rewritten to match: dropped encodings for the removed
ops; added the new register/immediate tuples the demo consumes.

Diffstat:
MP1.md | 22++++++++++++++--------
MPLAN.md | 10++++++----
Mdemo.M1 | 243++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------
Mp1_aarch64.M1 | 112++++++++++++++++++++++++++++++++++++++-----------------------------------------
Mp1_amd64.M1 | 105++++++++++++++++++++++++++++++++++++-------------------------------------------
Mp1_riscv64.M1 | 84++++++++++++++++++++++++++++++++++++-------------------------------------------
6 files changed, 309 insertions(+), 267 deletions(-)

diff --git a/P1.md b/P1.md @@ -113,14 +113,12 @@ LA rD, %label # load PC-relative address # Memory (offset is signed 12-bit) LD rD, rA, !off ST rS, rA, !off # 64-bit -LW rD, rA, !off SW rS, rA, !off # 32-bit zero-extended / truncated LB rD, rA, !off SB rS, rA, !off # 8-bit zero-extended / truncated # Control flow B %label # unconditional branch BEQ rA, rB, %label BNE rA, rB, %label -BLT rA, rB, %label BGE rA, rB, %label # signed -BLTU rA, rB, %label BGEU rA, rB, %label # unsigned +BLT rA, rB, %label # signed less-than CALL %label RET TAIL %label # tail call: epilogue + B %label @@ -133,8 +131,16 @@ SYSCALL # num in r0, args r1-r6, ret in r0 - All arithmetic is on 64-bit values. `SHL`/`SHR`/`SAR` take shift amount in the low 6 bits of `rB` (or the `!imm` for immediate forms). - `DIV` is signed, truncated toward zero. `REM` matches `DIV`. -- `LW`/`LB` zero-extend the loaded value into the 64-bit destination. - (Signed-extending variants — `LWS`, `LBS` — can be added later if needed.) +- `LB` zero-extends the loaded value into the 64-bit destination. + (A signed-extending variant `LBS` can be added later if needed. 32-bit + `LW`/`SW` are deliberately omitted — emulate with `LD`+`ANDI`/shift and + `ST` through a 64-bit scratch when needed.) +- Unsigned comparisons (`BLTU`/`BGEU`) are not in the ISA: seed programs + with tagged-cell pointers only need signed comparisons. Synthesize from + `BLT` via operand-bias if unsigned compare is ever required. +- `BGE rA, rB, %L` is not in the ISA: synthesize as + `BLT rA, rB, %skip; B %L; :skip` (the r7-indirect branch pattern makes + the skip cheap). `BLT rB, rA, %L` handles the strict-greater case. - Branch offsets are PC-relative. In the v0.1 spike they are realized by loading the target address via `LI` into `r7` and jumping through the register; range is therefore unbounded within the 4 GiB address space. @@ -181,11 +187,11 @@ Per-arch defs count (immediates handled by sigil, not enumerated): - 6 immediate arith × 8² = 384. Each entry uses an immediate sigil (`!imm`), so the immediate value itself is not enumerated. - 3 move ops × 8 or 8² = ~80. -- 6 memory ops × 8² = 384. Offsets use `!imm` sigil. -- 7 branches × 8² = 448. +- 4 memory ops × 8² = 256. Offsets use `!imm` sigil. +- 3 conditional branches × 8² = 192. - Singletons (`B`, `CALL`, `RET`, `TAIL`, `SYSCALL`) = 5. -Total ≈ 1500 defines per arch. Template-generated. +Total ≈ 1200 defines per arch. Template-generated. ## Syscall conventions diff --git a/PLAN.md b/PLAN.md @@ -150,10 +150,12 @@ with any future seed-stage program. ## Resolutions -- **Narrow loads: zero-extend only.** P1 keeps `LB`/`LW` zero-extending; - no `LBS`/`LWS` added. Fixnums live in full 64-bit tagged cells, so - the interpreter never needs a sign-extended narrow load — byte/ASCII - access is unsigned, and arithmetic happens on 64-bit values already. +- **Narrow loads: zero-extend only, 8-bit only.** P1 keeps `LB` + zero-extending; no `LBS` added, and 32-bit `LW`/`SW` are out of the + ISA entirely (emulate through 64-bit `LD`/`ST` + mask/shift if ever + needed). Fixnums live in full 64-bit tagged cells, so the interpreter + never needs a sign-extended or 32-bit load — byte/ASCII access is + unsigned, and arithmetic happens on 64-bit values already. - **Static code size: accept the 2× tax.** P1's destructive-expansion rule on amd64 roughly doubles instruction count vs. hand-tuned amd64. Matches P1's "deliberately dumb" contract (see `P1.md`). Interpreter diff --git a/demo.M1 b/demo.M1 @@ -1,87 +1,139 @@ ## P1 broader-ISA demo — portable across aarch64, amd64, riscv64. ## -## Exercises the P1 ISA in tranches (see P1.md §"ISA"). Each op is -## applied in a way that preserves the running value r1=5, so a -## miscoded op produces the wrong exit status. +## Exercises the P1 ISA in tranches (see P1.md §"Instruction set"). +## Each op is applied with non-identity operands so a mis-encoded op +## produces a detectably-wrong final r1. If every op in every tranche +## is correct, r1 ends at 5, stdout is "P1 = 5\n", exit is 5. ## -## Tranche 1: LI, MOV, SYSCALL, plus every reg-reg-reg arith op: -## ADD, SUB, AND, OR, XOR, MUL, DIV, REM, SHL, SHR, SAR. -## Tranche 2: immediate forms -## ADDI, ANDI, ORI, SHLI, SHRI, SARI -## (no SUBI/XORI/MULI in P1 — see PLAN.md §"Feature floor"). -## Tranche 3: LA + memory round-trip -## LA, ST/LD (64b), SW/LW (32b zero-ext), SB/LB (8b zero-ext) -## Tranche 4: r7-indirect branches -## B, BEQ, BNE, BLT, BGE, BLTU, BGEU — 7 taken-path subtests + 1 fall-through -## Tranche 5 (current): CALL / RET / TAIL / PROLOGUE / EPILOGUE -## Nested CALL (stresses PROLOGUE lr-save) + TAIL (must unwind before B) -## -## Constants for the identity chain: -## r2 = 0 identity for ADD/SUB/XOR/OR/SHL/SHR/SAR -## r3 = 1 identity for MUL/DIV -## r5 = 7 identity for AND (5 & 7 = 5) and REM (5 % 7 = 5) -## r1 holds the running value — starts at 5 and stays 5. +## Tranche 1: reg-reg-reg arith (11 ops). Each step's result is +## unique vs. neighbor ops on the same operands, e.g. +## ADD(5,3)=8 SUB(8,3)=5 XOR(5,3)=6 OR(6,3)=7 AND(7,5)=5 +## MUL(5,3)=15 DIV(15,3)=5 REM(5,7)=5 SHL(5,3)=40 SHR(40,3)=5. +## SAR vs SHR agree on positive values; SAR is separately tested +## on a negative value in r4 (SAR(-1,3)=-1, SHR would give huge). +## Tranche 2: immediate forms. ADDI tested with BOTH positive and +## negative imm12 (proves signed-immediate encoding). SHLI/SHRI +## with non-zero shift amount. ANDI/ORI with overlapping-bit +## masks so the result is distinguishable from ADDI/XORI. SARI +## tested on negative value via r4. +## (No SUBI/XORI/MULI in P1 — see PLAN.md §"Feature floor".) +## Tranche 3: LA + memory round-trip. LD/ST (64-bit) and LB/SB +## (8-bit zero-extended). 32-bit LW/SW dropped from the ISA — +## synthesize from LD + mask/shift if needed. +## Tranche 4: r7-indirect branches. B, BEQ, BNE, BLT each tested +## in BOTH taken and fall-through directions so an inverted +## condition encoding doesn't pass. BLT additionally tested with +## a negative operand (-1 < 0 signed) to prove signed semantics. +## BLTU/BGEU/BGE dropped from the ISA — see P1.md. +## Tranche 5: CALL / RET / TAIL / PROLOGUE / EPILOGUE. Nested CALL +## stresses PROLOGUE lr-save; TAIL unwinds the frame. ## ## Run-and-verify: ## make PROG=demo ARCH=<arch> run && echo "exit=$?" ## expected stdout: "P1 = 5\n" expected exit: 5 :_start + ## Setup: r1 is the running accumulator; r2/r5 are arith partners. + ## r1 is seeded via the LI_R6 → MOV_R1_R6 roundtrip so the LI_R6 + ## encoding gets exercised — otherwise r6 is only written by + ## MOV_R6_R1 in the exit path. Clobbering r1 to 0 between the two + ## ops means a broken MOV_R1_R6 (wrong source reg) leaves r1=0 and + ## poisons the chain; a broken LI_R6 (wrong dest reg or wrong + ## literal-slot offset) puts the wrong value in r6 and MOV + ## propagates it. + P1_LI_R6 + '05000000' # r6 = 5 via LI_R6 (discriminator) P1_LI_R1 - '05000000' + '00000000' # clobber r1 before the MOV + P1_MOV_R1_R6 # r1 = r6 = 5 P1_LI_R2 - '00000000' - P1_LI_R3 - '01000000' + '03000000' # r2 = 3 (arith partner & shift amount) P1_LI_R5 - '07000000' - - P1_ADD_R1_R1_R2 # 5 + 0 = 5 - P1_SUB_R1_R1_R2 # 5 - 0 = 5 - P1_XOR_R1_R1_R2 # 5 ^ 0 = 5 - P1_OR_R1_R1_R2 # 5 | 0 = 5 - P1_AND_R1_R1_R5 # 5 & 7 = 5 - P1_MUL_R1_R1_R3 # 5 * 1 = 5 - P1_DIV_R1_R1_R3 # 5 / 1 = 5 + '05000000' # r5 = 5 (AND partner) + + ## Tranche 1: reg-reg-reg arith. + P1_ADD_R1_R1_R2 # 5 + 3 = 8 + P1_SUB_R1_R1_R2 # 8 - 3 = 5 + P1_XOR_R1_R1_R2 # 5 ^ 3 = 6 + P1_OR_R1_R1_R2 # 6 | 3 = 7 + P1_AND_R1_R1_R5 # 7 & 5 = 5 + P1_MUL_R1_R1_R2 # 5 * 3 = 15 + P1_DIV_R1_R1_R2 # 15 / 3 = 5 + P1_LI_R5 + '07000000' # r5 = 7 for REM P1_REM_R1_R1_R5 # 5 % 7 = 5 - P1_SHL_R1_R1_R2 # 5 << 0 = 5 - P1_SHR_R1_R1_R2 # 5 >> 0 = 5 (logical) - P1_SAR_R1_R1_R2 # 5 >> 0 = 5 (arithmetic) + P1_SHL_R1_R1_R2 # 5 << 3 = 40 + P1_SHR_R1_R1_R2 # 40 >> 3 = 5 - ## Tranche 2: immediate forms. Chain 5 -> 8 -> 4 -> 5 -> 5 -> 5 -> 5. + ## SAR discriminator: on positive values SAR and SHR agree, so + ## put a negative value in r4 and check that SAR preserves the + ## sign bits. Then fold r4 into r1 so a misbehaving SAR poisons + ## the accumulator. + P1_LI_R4 + '00000000' # r4 = 0 + P1_ADDI_R4_R4_NEG1 # r4 = 0 + (-1) = -1 (sign-extended 64-bit) + P1_SAR_R4_R4_R2 # r4 = -1 >> 3 = -1 (SHR would be 0x1FFF...FFFF) + P1_ADD_R1_R1_R4 # r1 = 5 + (-1) = 4 + P1_ADDI_R1_R1_1 # r1 = 4 + 1 = 5 + + ## Tranche 2: immediate forms. Chain 5 → 8 → 5 → 10 → 5 → 4 → 5 + ## plus a SARI discriminator via r4. P1_ADDI_R1_R1_3 # 5 + 3 = 8 - P1_SHRI_R1_R1_1 # 8 >> 1 = 4 - P1_ORI_R1_R1_1 # 4 | 1 = 5 - P1_ANDI_R1_R1_7 # 5 & 7 = 5 - P1_SHLI_R1_R1_0 # 5 << 0 = 5 - P1_SARI_R1_R1_0 # 5 >> 0 = 5 (arithmetic) + P1_ADDI_R1_R1_NEG3 # 8 + (-3) = 5 (signed imm12) + P1_SHLI_R1_R1_1 # 5 << 1 = 10 (non-zero shift) + P1_SHRI_R1_R1_1 # 10 >> 1 = 5 + P1_ANDI_R1_R1_6 # 5 & 6 = 4 (0b101 & 0b110 = 0b100) + P1_ORI_R1_R1_1 # 4 | 1 = 5 (aarch64 bitmask-immediate + # requires valid-mask imm; 5 is invalid, 1 is + # valid. Tranche 1 OR_R1_R1_R2 (6|3=7) already + # discriminates OR from ADD/XOR on overlapping + # bits; ORI here just tests the imm-ORR encoding + # exists and doesn't crash, plus distinguishes + # from AND (4&1=0 would fail the chain).) - ## Tranche 3: memory round-trip. For each access width, store r1, + P1_LI_R4 + '00000000' + P1_ADDI_R4_R4_NEG1 # r4 = -1 + P1_SARI_R4_R4_1 # r4 = -1 >> 1 = -1 (SHRI would be 0x7FFF...FFFF) + P1_ADD_R1_R1_R4 # r1 = 5 + (-1) = 4 + P1_ADDI_R1_R1_1 # r1 = 5 + + ## Tranche 3: memory round-trip. For each width, store r1, ## clobber r1 to 0, then reload. A broken ST/LD leaves r1 != 5. P1_LA_R4 &scratch - P1_ST_R1_R4_0 # [scratch+0..8] = r1 (= 5) + P1_ST_R1_R4_0 # [scratch+0..8] = r1 (= 5) P1_LI_R1 '00000000' P1_LD_R1_R4_0 # r1 = [scratch+0..8] -> 5 - P1_SW_R1_R4_8 # [scratch+8..12] = r1 (low 4 bytes) + ## Non-zero imm12 on the 64-bit forms. Distinct value 8 at +8 so + ## an LD_R4_8 that silently aliased to +0 would read 5 and fail the + ## SUB step; a broken ST_R4_8 leaves [+8]=0 (scratch init) which the + ## LD observes. The +0 round-trip above anchors the imm12=0 case. + P1_LI_R1 + '08000000' # distinct from 5 + P1_ST_R1_R4_8 # [scratch+8..16] = 8 P1_LI_R1 '00000000' - P1_LW_R1_R4_8 # r1 = zext [scratch+8..12] -> 5 + P1_LD_R1_R4_8 # r1 = [scratch+8..16] -> 8 + P1_SUB_R1_R1_R2 # r1 = 8 - 3 = 5 (r2 still 3 from setup) - P1_SB_R1_R4_16 # [scratch+16] = r1 (low byte) + P1_SB_R1_R4_16 # [scratch+16] = r1 (low byte) P1_LI_R1 '00000000' P1_LB_R1_R4_16 # r1 = zext [scratch+16] -> 5 - ## Tranche 4: branches. r2=0, r3=1 already, r5=7 already. - ## Pattern for each taken-path test: - ## set r7 = &b4_N_ok ; <branch cond met> ; clobber r1->0 ; :b4_N_ok - ## If the branch correctly fires, the clobber is skipped. - ## - ## B — unconditional, jumps to &b4_1_ok. + ## Tranche 4: branches. r2=0, r3=1 to start; each subtest resets + ## as needed. Taken-path subtests clobber r1 on fall-through; + ## fall-through subtests clobber r1 on incorrect branch. + P1_LI_R2 + '00000000' # r2 = 0 + P1_LI_R3 + '01000000' # r3 = 1 + + ## B — unconditional. Correct: jump to b4_1_ok, skipping clobber. P1_LI_R7 &b4_1_ok P1_B @@ -89,97 +141,100 @@ '00000000' :b4_1_ok - ## BEQ r2,r3,r7 — 0 == 0? set r3 back to 0 first, branch, then restore. + ## BEQ taken: r3=0 so r2==r3. P1_LI_R3 - '00000000' # r3 = 0 so r2 == r3 + '00000000' P1_LI_R7 &b4_2_ok - P1_BEQ_R2_R3_R7 # 0 == 0, branch taken + P1_BEQ_R2_R3_R7 # 0 == 0, taken P1_LI_R1 '00000000' :b4_2_ok P1_LI_R3 '01000000' # restore r3 = 1 - ## BNE r2,r3,r7 — 0 != 1? yes, branch taken. + ## BEQ fall-through: 0 != 1, branch must NOT fire. If it + ## (incorrectly) fires we jump to b4_3_bad and clobber r1. + P1_LI_R7 + &b4_3_bad + P1_BEQ_R2_R3_R7 # 0 == 1? no, fall through P1_LI_R7 &b4_3_ok - P1_BNE_R2_R3_R7 + P1_B +:b4_3_bad P1_LI_R1 '00000000' :b4_3_ok - ## BLT r2,r3,r7 — 0 < 1 (signed)? yes, branch taken. + ## BNE taken: 0 != 1. P1_LI_R7 &b4_4_ok - P1_BLT_R2_R3_R7 + P1_BNE_R2_R3_R7 # 0 != 1, taken P1_LI_R1 '00000000' :b4_4_ok - ## BGE r3,r2,r7 — needs a3 >= a2. Swap: load r2=1 r3=0 for this test. - P1_LI_R2 - '01000000' + ## BNE fall-through: r3=0 so r2==r3; branch must NOT fire. P1_LI_R3 '00000000' P1_LI_R7 + &b4_5_bad + P1_BNE_R2_R3_R7 # 0 != 0? no, fall through + P1_LI_R7 &b4_5_ok - P1_BGE_R2_R3_R7 # 1 >= 0 (signed), branch taken + P1_B +:b4_5_bad P1_LI_R1 '00000000' :b4_5_ok - P1_LI_R2 - '00000000' # restore r2 = 0 P1_LI_R3 '01000000' # restore r3 = 1 - ## BLTU r2,r3,r7 — 0 < 1 (unsigned)? yes, branch taken. + ## BLT taken: 0 < 1 (signed). P1_LI_R7 &b4_6_ok - P1_BLTU_R2_R3_R7 + P1_BLT_R2_R3_R7 # 0 < 1, taken P1_LI_R1 '00000000' :b4_6_ok - ## BGEU r2,r3,r7 — needs a2 >= a3 (unsigned). Swap r2/r3. + ## BLT fall-through: 1 < 0 is false. P1_LI_R2 - '01000000' + '01000000' # r2 = 1 P1_LI_R3 - '00000000' + '00000000' # r3 = 0 + P1_LI_R7 + &b4_7_bad + P1_BLT_R2_R3_R7 # 1 < 0? no, fall through P1_LI_R7 &b4_7_ok - P1_BGEU_R2_R3_R7 # 1 >= 0 (unsigned), branch taken + P1_B +:b4_7_bad P1_LI_R1 '00000000' :b4_7_ok + + ## BLT signed discrimination: -1 < 0 must be taken. If BLT were + ## accidentally unsigned, -1 as 0xFFFF...FFFF > 0 and the branch + ## would not fire. P1_LI_R2 + '00000000' # r2 = 0 + P1_LI_R4 '00000000' - P1_LI_R3 - '01000000' - - ## Fall-through test: BEQ with r2=0, r3=1 (unequal). The branch must NOT - ## fire. If it does fire (incorrectly), we jump to &b4_ft_bad which - ## clobbers r1. Correct behavior falls through the branch, then an - ## unconditional B jumps past the clobber to &b4_ft_good. - P1_LI_R7 - &b4_ft_bad - P1_BEQ_R2_R3_R7 # 0 == 1? no, fall through + P1_ADDI_R4_R4_NEG1 # r4 = -1 (sign-extended) P1_LI_R7 - &b4_ft_good - P1_B # skip the bad-path clobber -:b4_ft_bad + &b4_8_ok + P1_BLT_R4_R2_R7 # -1 < 0 (signed)? yes, taken P1_LI_R1 '00000000' -:b4_ft_good - - ## Restore r1 = 5 is implicit — we never clobbered it on the happy path. +:b4_8_ok ## Tranche 5: CALL / RET / PROLOGUE / EPILOGUE / TAIL. - ## fn_identity does its own nested CALL to fn_inner — if PROLOGUE doesn't - ## spill lr correctly, the inner CALL clobbers the return-to-_start - ## address and we crash or hang. The function bodies live inline below - ## the subtests, guarded by a P1_B over them so we don't fall through - ## into them after the last subtest. + ## fn_identity does its own nested CALL to fn_inner — if PROLOGUE + ## doesn't spill lr correctly, the inner CALL clobbers the + ## return-to-_start address and we crash or hang. The function + ## bodies live inline below the subtests, guarded by a P1_B over + ## them so we don't fall through after the last subtest. P1_LI_R7 &fn_identity P1_CALL # nested-CALL test: returns r1 unchanged @@ -249,7 +304,7 @@ '01000000' P1_SYSCALL - ## exit(r6) — exit status = computed result + ## exit(r6) — exit status = computed result P1_LI_R0 SYS_EXIT P1_MOV_R1_R6 diff --git a/p1_aarch64.M1 b/p1_aarch64.M1 @@ -1,7 +1,7 @@ ## P1 pseudo-ISA — aarch64 backing defs (v0.1 spike) ## ## Implements the subset needed by hello.M1 and demo.M1: LI, SYSCALL, -## plus a handful of MOV/ADD/SUB register tuples. The full 1500-entry +## the demo's arith/imm/memory/branch/call tuples. The full ~1200-entry ## table described in P1.md is generator-driven; what's here is the ## spike's hand-written sliver. ## See P1.md for the full ISA and register mapping. @@ -23,8 +23,7 @@ ## ## The 4-byte form is deliberate for the spike: it pairs with hex2's ## `&label` sigil unchanged. A proper 64-bit LI via a PC-relative LDR X -## literal is left for the aarch64 hex2_word extensions described in -## P1.md §"What needs to be added". +## literal is left for later. DEFINE P1_LI_R0 4000001802000014 DEFINE P1_LI_R1 4100001802000014 @@ -64,90 +63,90 @@ DEFINE SYS_WRITE 40000000 DEFINE SYS_EXIT 5D000000 -## ---- MOV / ADD / SUB (demo.M1 subset) ------------------------------------ -## One native insn per op. Named after the P1-register tuple they realize. -## Add entries here as new programs need them; a generator will replace the -## hand-maintenance once more than a handful of tuples are in use. - -## MOV rD, rA -> orr xD, xzr, xA -DEFINE P1_MOV_R6_R3 F30303AA ## mov x19, x3 -DEFINE P1_MOV_R1_R6 E10313AA ## mov x1, x19 - -## ADD rD, rA, rB -> add xD, xA, xB -DEFINE P1_ADD_R3_R1_R2 2300028B ## add x3, x1, x2 -DEFINE P1_ADD_R2_R2_R6 4200138B ## add x2, x2, x19 - -## SUB rD, rA, rB -> sub xD, xA, xB -DEFINE P1_SUB_R3_R3_R4 630004CB ## sub x3, x3, x4 - - ## ---- Tranche 1: full arith reg-reg-reg ---------------------------------- -## Identity-chain tuples used by demo.M1 to exercise every P1 arith op -## without branches or memory. r1 is the running accumulator; each op -## uses an identity partner so the correct output is always r1. -## aarch64 has 3-operand forms for all of these, so one insn per op. +## Non-identity discriminator chain used by demo.M1 to exercise every +## P1 arith op. r1 is the running accumulator; each step's result is +## unique vs. its neighbor ops on the chosen operands (see demo.M1). +## aarch64 has 3-operand forms for all of these — one insn per op. ## REM has no native op — expands to sdiv+msub through scratch reg x4. ## MOV rD, rA -> orr xD, xzr, xA +DEFINE P1_MOV_R1_R6 E10313AA ## mov x1, x19 DEFINE P1_MOV_R6_R1 F30301AA ## mov x19, x1 -## ADD / SUB / AND / OR / XOR — base opcodes 8B / CB / 8A / AA / CA. +## ADD rD, rA, rB -> add xD, xA, xB DEFINE P1_ADD_R1_R1_R2 2100028B ## add x1, x1, x2 +DEFINE P1_ADD_R1_R1_R4 2100048B ## add x1, x1, x4 +DEFINE P1_ADD_R2_R2_R6 4200138B ## add x2, x2, x19 (syscall-arg computation) + +## SUB / AND / OR / XOR — base opcodes CB / 8A / AA / CA. DEFINE P1_SUB_R1_R1_R2 210002CB ## sub x1, x1, x2 DEFINE P1_AND_R1_R1_R5 2100058A ## and x1, x1, x5 DEFINE P1_OR_R1_R1_R2 210002AA ## orr x1, x1, x2 DEFINE P1_XOR_R1_R1_R2 210002CA ## eor x1, x1, x2 -## MUL rD, rA, rB -> madd xD, xA, xB, xzr -DEFINE P1_MUL_R1_R1_R3 217C039B ## mul x1, x1, x3 +## MUL rD, rA, rB -> mul xD, xA, xB (= madd xD, xA, xB, xzr) +DEFINE P1_MUL_R1_R1_R2 217C029B ## mul x1, x1, x2 ## DIV rD, rA, rB -> sdiv xD, xA, xB -DEFINE P1_DIV_R1_R1_R3 210CC39A ## sdiv x1, x1, x3 +DEFINE P1_DIV_R1_R1_R2 210CC29A ## sdiv x1, x1, x2 ## REM rD, rA, rB -> sdiv x4, xA, xB ; msub xD, x4, xB, xA ## Two insns; x4 is a caller-saved scratch (= P1 r4). Demo keeps any ## live r4 value dead across this op. DEFINE P1_REM_R1_R1_R5 240CC59A8184059B ## sdiv x4,x1,x5; msub x1,x4,x5,x1 -## SHL / SHR / SAR -> lslv / lsrv / asrv (64-bit variants). +## SHL / SHR -> lslv / lsrv (64-bit). DEFINE P1_SHL_R1_R1_R2 2120C29A ## lsl x1, x1, x2 DEFINE P1_SHR_R1_R1_R2 2124C29A ## lsr x1, x1, x2 -DEFINE P1_SAR_R1_R1_R2 2128C29A ## asr x1, x1, x2 + +## SAR rD, rA, rB -> asrv xD, xA, xB. Discriminator lives on r4 +## (negative value) — see demo.M1 tranche 1. +DEFINE P1_SAR_R4_R4_R2 8428C29A ## asr x4, x4, x2 ## ---- Tranche 2: immediate arith --------------------------------------- -## Tuples exercised by demo.M1: chain 5 -> 8 -> 4 -> 5 -> 5 -> 5 -> 5. +## Demo chain: 5 → 8 → 5 → 10 → 5 → 4 → 5 plus a SARI discriminator +## on r4 (negative). All imms are valid bitmask-immediates where +## applicable (see P1.md); 5 is NOT a valid bitmask so ORI uses 1. ## -## Bitwise-imm ops (AND/OR) use aarch64's logical-immediate encoding: -## (N, immr, imms) encodes a run of consecutive 1s rotated by immr. -## For 64-bit patterns N=1 and imms = (ones_count - 1). Immediates: -## imm=1 -> N=1 imms=0 immr=0 (one 1, no rotation) -## imm=7 -> N=1 imms=2 immr=0 (three consecutive 1s) ## Shift-imm ops are UBFM/SBFM aliases: ## LSL #n -> UBFM immr=-n mod 64, imms=63-n ## LSR #n -> UBFM immr=n, imms=63 ## ASR #n -> SBFM immr=n, imms=63 +## Bitwise-imm ops use aarch64's logical-immediate encoding +## (N:immr:imms over 13 bits). + +## aarch64 has no native "ADDI with negative imm" form — model P1's +## signed ADDI by emitting ADD (imm) for non-negative values and +## SUB (imm) for negative ones. Both flavors sign-extend the imm12 +## into the 64-bit destination. +DEFINE P1_ADDI_R1_R1_3 210C0091 ## add x1, x1, #3 +DEFINE P1_ADDI_R1_R1_1 21040091 ## add x1, x1, #1 +DEFINE P1_ADDI_R1_R1_NEG3 210C00D1 ## sub x1, x1, #3 +DEFINE P1_ADDI_R4_R4_NEG1 840400D1 ## sub x4, x4, #1 + +DEFINE P1_SHLI_R1_R1_1 21F87FD3 ## lsl x1, x1, #1 (UBFM immr=63, imms=62) +DEFINE P1_SHRI_R1_R1_1 21FC41D3 ## lsr x1, x1, #1 (UBFM immr=1, imms=63) + +DEFINE P1_ANDI_R1_R1_6 21043F92 ## and x1, x1, #6 +DEFINE P1_ORI_R1_R1_1 210040B2 ## orr x1, x1, #1 -DEFINE P1_ADDI_R1_R1_3 210C0091 ## add x1, x1, #3 -DEFINE P1_ANDI_R1_R1_7 21084092 ## and x1, x1, #7 -DEFINE P1_ORI_R1_R1_1 210040B2 ## orr x1, x1, #1 -DEFINE P1_SHLI_R1_R1_0 21FC40D3 ## lsl x1, x1, #0 (UBFM #0,#63) -DEFINE P1_SHRI_R1_R1_1 21FC41D3 ## lsr x1, x1, #1 (UBFM #1,#63) -DEFINE P1_SARI_R1_R1_0 21FC4093 ## asr x1, x1, #0 (SBFM #0,#63) +DEFINE P1_SARI_R4_R4_1 84FC4193 ## asr x4, x4, #1 (SBFM immr=1, imms=63) ## ---- Tranche 3: LA + memory ops --------------------------------------- ## LA is LI in the spike — both load a 4-byte zero-extended literal, -## which is enough to address the ELF (base 0x00600000 < 2^32). Extending -## LA to a full 64-bit load is future work (P1.md §"What needs added"). +## which is enough to address the ELF (base 0x00600000 < 2^32). DEFINE P1_LA_R4 4400001802000014 ## ldr w4, [pc+8] ; b +8 ; <4 bytes> -## Unsigned-offset forms of LDR/STR at 8/4/1-byte widths. -## imm12 is scaled by access size, so e.g. LDR-w offset 8 uses imm12=2. +## Unsigned-offset forms of LDR/STR at 8/1-byte widths. imm12 is scaled +## by access size, so e.g. LDR-B offset 16 uses imm12=16. +## 32-bit LW/SW dropped from the ISA. DEFINE P1_ST_R1_R4_0 810000F9 ## str x1, [x4, #0] DEFINE P1_LD_R1_R4_0 810040F9 ## ldr x1, [x4, #0] -DEFINE P1_SW_R1_R4_8 810800B9 ## str w1, [x4, #8] -DEFINE P1_LW_R1_R4_8 810840B9 ## ldr w1, [x4, #8] (zero-ext) +DEFINE P1_ST_R1_R4_8 810400F9 ## str x1, [x4, #8] (imm12=1, scaled ×8) +DEFINE P1_LD_R1_R4_8 810440F9 ## ldr x1, [x4, #8] DEFINE P1_SB_R1_R4_16 81400039 ## strb w1, [x4, #16] DEFINE P1_LB_R1_R4_16 81404039 ## ldrb w1, [x4, #16] (zero-ext) @@ -162,18 +161,15 @@ DEFINE P1_LB_R1_R4_16 81404039 ## ldrb w1, [x4, #16] (zero-ext) ## ## Conditions for "skip if NOT cond": ## BEQ -> B.NE (cond 1) BNE -> B.EQ (cond 0) -## BLT -> B.GE (cond A) BGE -> B.LT (cond B) -## BLTU-> B.HS (cond 2) BGEU -> B.LO (cond 3) -## CMP x2, x3 = SUBS xzr, x2, x3 = 0xEB03005F (Rm=3, Rn=2, Rd=31). -## BR x20 = 0xD61F0280. +## BLT -> B.GE (cond A) +## Unsigned branches (BLTU/BGEU/BGE) dropped from the ISA — see P1.md. +## CMP xN, xM = SUBS xzr, xN, xM. BR x20 = 0xD61F0280. DEFINE P1_B 80021FD6 -DEFINE P1_BEQ_R2_R3_R7 5F0003EB4100005480021FD6 ## cmp ; b.ne +8 ; br x20 -DEFINE P1_BNE_R2_R3_R7 5F0003EB4000005480021FD6 ## cmp ; b.eq +8 ; br x20 -DEFINE P1_BLT_R2_R3_R7 5F0003EB4A00005480021FD6 ## cmp ; b.ge +8 ; br x20 -DEFINE P1_BGE_R2_R3_R7 5F0003EB4B00005480021FD6 ## cmp ; b.lt +8 ; br x20 -DEFINE P1_BLTU_R2_R3_R7 5F0003EB4200005480021FD6 ## cmp ; b.hs +8 ; br x20 -DEFINE P1_BGEU_R2_R3_R7 5F0003EB4300005480021FD6 ## cmp ; b.lo +8 ; br x20 +DEFINE P1_BEQ_R2_R3_R7 5F0003EB4100005480021FD6 ## cmp x2,x3 ; b.ne +8 ; br x20 +DEFINE P1_BNE_R2_R3_R7 5F0003EB4000005480021FD6 ## cmp x2,x3 ; b.eq +8 ; br x20 +DEFINE P1_BLT_R2_R3_R7 5F0003EB4A00005480021FD6 ## cmp x2,x3 ; b.ge +8 ; br x20 +DEFINE P1_BLT_R4_R2_R7 9F0002EB4A00005480021FD6 ## cmp x4,x2 ; b.ge +8 ; br x20 ## ---- Tranche 5: CALL / RET / PROLOGUE / EPILOGUE / TAIL ----------------- diff --git a/p1_amd64.M1 b/p1_amd64.M1 @@ -1,7 +1,6 @@ ## P1 pseudo-ISA — amd64 backing defs (v0.1 spike) ## -## Implements the subset needed by hello.M1 and demo.M1: LI, SYSCALL, -## plus a handful of MOV/ADD/SUB register tuples. See P1.md. +## Implements the subset needed by hello.M1 and demo.M1. See P1.md. ## ## Register mapping (P1 → amd64): ## r0 → rax , r1 → rdi , r2 → rsi , r3 → rdx @@ -37,11 +36,6 @@ DEFINE P1_LI_R7 41BC # mov r12d, imm32 ## Expansion: ## mov r9, rbx ; 49 89 D9 ## syscall ; 0F 05 -## -## The shuffle runs unconditionally — syscalls that ignore arg 6 don't -## care that r9 got overwritten with rbx (both are caller-saved on the -## platform ABI; the kernel reads only the registers the specific -## syscall cares about). DEFINE P1_SYSCALL 4989D90F05 @@ -51,109 +45,107 @@ DEFINE SYS_WRITE 01000000 DEFINE SYS_EXIT 3C000000 -## ---- MOV / ADD / SUB (demo.M1 subset) ------------------------------------ +## ---- Tranche 1: full arith reg-reg-reg ---------------------------------- ## amd64 lacks a 3-operand arithmetic form, so every "rD = rA op rB" ## expands unconditionally to two native insns: ## mov rD_native, rA_native ## <op> rD_native, rB_native ## When rD == rA the leading mov is a no-op write, kept anyway — P1 is ## deliberately unoptimized (P1.md §"Non-goals"). - -## MOV rD, rA -> mov rD_native, rA_native (single insn) -DEFINE P1_MOV_R6_R3 4889D3 ## mov rbx, rdx -DEFINE P1_MOV_R1_R6 4889DF ## mov rdi, rbx - -## ADD rD, rA, rB -> mov rD,rA ; add rD,rB -DEFINE P1_ADD_R3_R1_R2 4889FA4801F2 ## mov rdx,rdi ; add rdx,rsi -DEFINE P1_ADD_R2_R2_R6 4889F64801DE ## mov rsi,rsi ; add rsi,rbx - -## SUB rD, rA, rB -> mov rD,rA ; sub rD,rB -DEFINE P1_SUB_R3_R3_R4 4889D24C29D2 ## mov rdx,rdx ; sub rdx,r10 - - -## ---- Tranche 1: full arith reg-reg-reg ---------------------------------- -## Identity-chain tuples used by demo.M1. r1=rdi runs through each op. ## x86-64 oddities: ## - Shifts need count in cl; three-insn form mov rD,rA; mov rcx,rB; shl. ## - IDIV needs dividend in rdx:rax. We save rdx to rcx before CQO so ## both the divisor (when it is rdx) and the caller's r3 survive. ## MOV rD, rA -> mov rD_native, rA_native +DEFINE P1_MOV_R1_R6 4889DF ## mov rdi, rbx DEFINE P1_MOV_R6_R1 4889FB ## mov rbx, rdi -## ADD / SUB / AND / OR / XOR — 2-insn form, leading mov rdi,rdi kept. +## ADD rD, rA, rB -> mov rD,rA ; add rD,rB DEFINE P1_ADD_R1_R1_R2 4889FF4801F7 ## mov rdi,rdi ; add rdi,rsi +DEFINE P1_ADD_R1_R1_R4 4889FF4C01D7 ## mov rdi,rdi ; add rdi,r10 +DEFINE P1_ADD_R2_R2_R6 4889F64801DE ## mov rsi,rsi ; add rsi,rbx (syscall arg) + +## SUB / AND / OR / XOR — 2-insn form, leading mov rdi,rdi kept. DEFINE P1_SUB_R1_R1_R2 4889FF4829F7 ## mov rdi,rdi ; sub rdi,rsi DEFINE P1_XOR_R1_R1_R2 4889FF4831F7 ## mov rdi,rdi ; xor rdi,rsi DEFINE P1_OR_R1_R1_R2 4889FF4809F7 ## mov rdi,rdi ; or rdi,rsi DEFINE P1_AND_R1_R1_R5 4889FF4C21C7 ## mov rdi,rdi ; and rdi,r8 -## MUL rD, rA, rB -> mov rD,rA ; imul rD,rB (IMUL r64,r/m64 = 0F AF) -DEFINE P1_MUL_R1_R1_R3 4889FF480FAFFA ## mov rdi,rdi ; imul rdi,rdx +## MUL rD, rA, rB -> mov rD,rA ; imul rD,rB (IMUL r64,r/m64 = 0F AF) +DEFINE P1_MUL_R1_R1_R2 4889FF480FAFFE ## mov rdi,rdi ; imul rdi,rsi -## DIV rD, rA, rB -> divisor in rdx (r3). Save rdx to rcx so the -## divisor survives CQO's clobber of rdx, and so r3 is restored after. -## mov rcx,rdx ; mov rax,rdi ; cqo ; idiv rcx ; mov rdi,rax ; mov rdx,rcx -DEFINE P1_DIV_R1_R1_R3 4889D14889F8489948F7F94889C74889CA +## DIV rD, rA, rB -> divisor in rsi (r2). CQO still clobbers rdx, +## so save/restore caller's r3 through rcx. +## mov rcx,rdx ; mov rax,rdi ; cqo ; idiv rsi ; mov rdi,rax ; mov rdx,rcx +DEFINE P1_DIV_R1_R1_R2 4889D14889F8489948F7FE4889C74889CA -## REM rD, rA, rB -> divisor in r8 (r5). CQO still clobbers rdx, so -## save/restore r3 through rcx. +## REM rD, rA, rB -> divisor in r8 (r5). Same rdx-save dance. ## mov rcx,rdx ; mov rax,rdi ; cqo ; idiv r8 ; mov rdi,rdx ; mov rdx,rcx DEFINE P1_REM_R1_R1_R5 4889D14889F8489949F7F84889D74889CA -## SHL / SHR / SAR -> mov rD,rA ; mov rcx,rB ; shl/shr/sar rD,cl +## SHL / SHR -> mov rD,rA ; mov rcx,rB ; shl/shr rD,cl ## Opcode D3 /n with REX.W: /4=SHL, /5=SHR, /7=SAR. DEFINE P1_SHL_R1_R1_R2 4889FF4889F148D3E7 ## mov rdi,rdi; mov rcx,rsi; shl rdi,cl DEFINE P1_SHR_R1_R1_R2 4889FF4889F148D3EF ## mov rdi,rdi; mov rcx,rsi; shr rdi,cl -DEFINE P1_SAR_R1_R1_R2 4889FF4889F148D3FF ## mov rdi,rdi; mov rcx,rsi; sar rdi,cl + +## SAR — lives on r4 (r10) with negative input; see demo.M1 tranche 1. +## mov r10,r10 ; mov rcx,rsi ; sar r10,cl +DEFINE P1_SAR_R4_R4_R2 4D89D24889F149D3FA ## ---- Tranche 2: immediate arith --------------------------------------- -## mov rdi,rdi ; <op> rdi, imm8 (sign-extended imm8 forms via opcode 83; +## Pattern: mov rD,rA ; <op> rD, imm8 (sign-extended imm8 via opcode 83; ## shifts via C1). /n is the opcode-extension field in ModRM.reg. +## Negative ADDI uses the same `add r/m64, imm8` with a sign-extended +## imm8 — x86 handles the sign natively. -DEFINE P1_ADDI_R1_R1_3 4889FF4883C703 ## add rdi, 3 (83 /0 ib) -DEFINE P1_ANDI_R1_R1_7 4889FF4883E707 ## and rdi, 7 (83 /4 ib) -DEFINE P1_ORI_R1_R1_1 4889FF4883CF01 ## or rdi, 1 (83 /1 ib) -DEFINE P1_SHLI_R1_R1_0 4889FF48C1E700 ## shl rdi, 0 (C1 /4 ib) +DEFINE P1_ADDI_R1_R1_3 4889FF4883C703 ## add rdi, 3 (83 /0 ib) +DEFINE P1_ADDI_R1_R1_1 4889FF4883C701 ## add rdi, 1 +DEFINE P1_ADDI_R1_R1_NEG3 4889FF4883C7FD ## add rdi, -3 (imm8 FD = -3) +DEFINE P1_ADDI_R4_R4_NEG1 4D89D24983C2FF ## mov r10,r10 ; add r10, -1 + +DEFINE P1_SHLI_R1_R1_1 4889FF48C1E701 ## shl rdi, 1 (C1 /4 ib) DEFINE P1_SHRI_R1_R1_1 4889FF48C1EF01 ## shr rdi, 1 (C1 /5 ib) -DEFINE P1_SARI_R1_R1_0 4889FF48C1FF00 ## sar rdi, 0 (C1 /7 ib) + +DEFINE P1_ANDI_R1_R1_6 4889FF4883E706 ## and rdi, 6 (83 /4 ib) +DEFINE P1_ORI_R1_R1_1 4889FF4883CF01 ## or rdi, 1 (83 /1 ib) + +DEFINE P1_SARI_R4_R4_1 4D89D249C1FA01 ## mov r10,r10 ; sar r10, 1 ## ---- Tranche 3: LA + memory ops --------------------------------------- ## LA is LI in the spike (addresses fit in 32 bits → zero-extends cleanly -## through the mov-to-r32 form). r4 is r10, so base-reg encoding uses -## REX.B=1 and ModRM rm=010. No SIB byte needed (r10's low3 bits = 010). +## through the mov-to-r32 form). r4 is r10. DEFINE P1_LA_R4 41BA ## mov r10d, imm32 -## Plain MOV r/m, r / MOV r, r/m with 8-bit displacement. -## REX: W=1 for 64-bit moves; B=1 always (r10 is the base register). +## Plain MOV r/m, r / MOV r, r/m with 8-bit displacement. REX.W=1 for +## 64-bit moves; REX.B=1 always (r10 is the base register). +## 32-bit LW/SW dropped from the ISA. DEFINE P1_ST_R1_R4_0 49893A ## mov [r10], rdi DEFINE P1_LD_R1_R4_0 498B3A ## mov rdi, [r10] -DEFINE P1_SW_R1_R4_8 41897A08 ## mov [r10+8], edi -DEFINE P1_LW_R1_R4_8 418B7A08 ## mov edi, [r10+8] (zero-ext) +DEFINE P1_ST_R1_R4_8 49897A08 ## mov [r10+8], rdi (disp8) +DEFINE P1_LD_R1_R4_8 498B7A08 ## mov rdi, [r10+8] (disp8) DEFINE P1_SB_R1_R4_16 41887A10 ## mov [r10+16], dil DEFINE P1_LB_R1_R4_16 490FB67A10 ## movzx rdi, byte [r10+16] ## ---- Tranche 4: branches (r7-indirect, no hex2_word needed) ------------ -## Same pattern as aarch64: cmp ra,rb ; short native jcc over a jmp-through-r7. -## If cond is false the native "skip" jcc fires (opposite of the P1 cond) and -## steps past the 3-byte `jmp r12`, falling through. If cond is true we take -## the jmp r12 to the address the caller stashed in r7. +## Pattern: cmp ra,rb ; short native jcc over a jmp-through-r7. +## If cond is false the native "skip" jcc fires (opposite of the P1 cond) +## and steps past the 3-byte `jmp r12`, falling through. If cond is true +## we take the jmp r12 to the address the caller stashed in r7. ## P1_B is just `jmp r12` unconditionally. ## ## CMP rsi, rdx = 48 39 D6 (REX.W, opcode 39 /r, ModRM: 11 010 110). ## JMP r12 = 41 FF E4 (REX.B, opcode FF /4, ModRM: 11 100 100). -## jcc rel8 opcodes (skip when NOT cond): JE=74 JNE=75 JL=7C JGE=7D JB=72 JAE=73. +## jcc rel8 opcodes (skip when NOT cond): JE=74 JNE=75 JL=7C JGE=7D. DEFINE P1_B 41FFE4 ## jmp r12 DEFINE P1_BEQ_R2_R3_R7 4839D6750341FFE4 ## cmp ; jne +3 ; jmp r12 DEFINE P1_BNE_R2_R3_R7 4839D6740341FFE4 ## cmp ; je +3 ; jmp r12 -DEFINE P1_BLT_R2_R3_R7 4839D67D0341FFE4 ## cmp ; jge +3 ; jmp r12 -DEFINE P1_BGE_R2_R3_R7 4839D67C0341FFE4 ## cmp ; jl +3 ; jmp r12 -DEFINE P1_BLTU_R2_R3_R7 4839D6730341FFE4 ## cmp ; jae +3 ; jmp r12 -DEFINE P1_BGEU_R2_R3_R7 4839D6720341FFE4 ## cmp ; jb +3 ; jmp r12 +DEFINE P1_BLT_R2_R3_R7 4839D67D0341FFE4 ## cmp rsi,rdx ; jge +3 ; jmp r12 +DEFINE P1_BLT_R4_R2_R7 4939F27D0341FFE4 ## cmp r10,rsi ; jge +3 ; jmp r12 ## ---- Tranche 5: CALL / RET / PROLOGUE / EPILOGUE / TAIL ----------------- @@ -164,8 +156,7 @@ DEFINE P1_BGEU_R2_R3_R7 4839D6720341FFE4 ## cmp ; jb +3 ; jmp r12 ## work spilling/reloading lr. ## ## CALL expects the target pre-loaded into r7 (= r12); expands to `call r12`. -## TAIL = EPILOGUE + unconditional B (= `jmp r12`), so the caller of the -## tail-calling function receives the return from the tail target directly. +## TAIL = EPILOGUE + unconditional B (= `jmp r12`). DEFINE P1_PROLOGUE 90 ## nop (retaddr already on stack) DEFINE P1_EPILOGUE 90 ## nop diff --git a/p1_riscv64.M1 b/p1_riscv64.M1 @@ -1,7 +1,6 @@ ## P1 pseudo-ISA — riscv64 backing defs (v0.1 spike) ## -## Implements the subset needed by hello.M1 and demo.M1: LI, SYSCALL, -## plus a handful of MOV/ADD/SUB register tuples. See P1.md. +## Implements the subset needed by hello.M1 and demo.M1. See P1.md. ## ## Register mapping (P1 → RISC-V): ## r0 → a0 (x10) , r1 → a1 (x11) , r2 → a2 (x12) , r3 → a3 (x13) @@ -52,9 +51,6 @@ DEFINE P1_LI_R7 170900000369C9006F008000 ## mv a4, a5 ; P1 r5 -> native arg5 ## mv a5, s1 ; P1 r6 -> native arg6 ## ecall -## -## Unconditional — unused shuffles read caller-saved scratch, and the -## kernel reads only the regs each syscall cares about. DEFINE P1_SYSCALL 9308050013850500930506001386060093060700138707009387040073000000 @@ -63,56 +59,55 @@ DEFINE SYS_WRITE 40000000 DEFINE SYS_EXIT 5D000000 -## ---- MOV / ADD / SUB (demo.M1 subset) ------------------------------------ -## One native insn per op. - -## MOV rD, rA -> addi rD, rA, 0 (the `mv` pseudo) -DEFINE P1_MOV_R6_R3 93840600 ## mv s1, a3 -DEFINE P1_MOV_R1_R6 93850400 ## mv a1, s1 - -## ADD rD, rA, rB -> add rD, rA, rB -DEFINE P1_ADD_R3_R1_R2 B386C500 ## add a3, a1, a2 -DEFINE P1_ADD_R2_R2_R6 33069600 ## add a2, a2, s1 - -## SUB rD, rA, rB -> sub rD, rA, rB -DEFINE P1_SUB_R3_R3_R4 B386E640 ## sub a3, a3, a4 - - ## ---- Tranche 1: full arith reg-reg-reg ---------------------------------- -## Identity-chain tuples used by demo.M1. All one-insn R-type ops. -## MUL/DIV/REM are M-extension ops (standard on rv64gc). +## Non-identity discriminator chain used by demo.M1. All one-insn R-type +## ops. MUL/DIV/REM are M-extension ops (standard on rv64gc). -## MOV rD, rA -> addi rD, rA, 0 +## MOV rD, rA -> addi rD, rA, 0 (the `mv` pseudo) +DEFINE P1_MOV_R1_R6 93850400 ## mv a1, s1 DEFINE P1_MOV_R6_R1 93840500 ## mv s1, a1 ## ADD / SUB / AND / OR / XOR — R-type, funct3 picks the op. DEFINE P1_ADD_R1_R1_R2 B385C500 ## add a1, a1, a2 +DEFINE P1_ADD_R1_R1_R4 B385E500 ## add a1, a1, a4 +DEFINE P1_ADD_R2_R2_R6 33069600 ## add a2, a2, s1 (syscall-arg computation) + DEFINE P1_SUB_R1_R1_R2 B385C540 ## sub a1, a1, a2 (funct7=0x20) DEFINE P1_XOR_R1_R1_R2 B3C5C500 ## xor a1, a1, a2 (funct3=4) DEFINE P1_OR_R1_R1_R2 B3E5C500 ## or a1, a1, a2 (funct3=6) DEFINE P1_AND_R1_R1_R5 B3F5F500 ## and a1, a1, a5 (funct3=7) ## MUL / DIV / REM — M extension, funct7=1. -DEFINE P1_MUL_R1_R1_R3 B385D502 ## mul a1, a1, a3 -DEFINE P1_DIV_R1_R1_R3 B3C5D502 ## div a1, a1, a3 (funct3=4) +DEFINE P1_MUL_R1_R1_R2 B385C502 ## mul a1, a1, a2 +DEFINE P1_DIV_R1_R1_R2 B3C5C502 ## div a1, a1, a2 (funct3=4) DEFINE P1_REM_R1_R1_R5 B3E5F502 ## rem a1, a1, a5 (funct3=6) -## SHL / SHR / SAR -> sll / srl / sra. +## SHL / SHR -> sll / srl. DEFINE P1_SHL_R1_R1_R2 B395C500 ## sll a1, a1, a2 (funct3=1) DEFINE P1_SHR_R1_R1_R2 B3D5C500 ## srl a1, a1, a2 (funct3=5) -DEFINE P1_SAR_R1_R1_R2 B3D5C540 ## sra a1, a1, a2 (funct7=0x20) + +## SAR rD, rA, rB -> sra xD, xA, xB. Discriminator lives on r4 +## (negative value) — see demo.M1 tranche 1. +DEFINE P1_SAR_R4_R4_R2 3357C740 ## sra a4, a4, a2 (funct7=0x20) ## ---- Tranche 2: immediate arith --------------------------------------- ## I-type / shift-immediate forms. Shift amount is 6 bits (shamt6) for -## rv64. SRAI sets bit 30 to distinguish from SRLI. +## rv64. SRAI sets bit 30 to distinguish from SRLI. Negative ADDI uses +## sign-extended imm12 directly — no separate "subi" needed. -DEFINE P1_ADDI_R1_R1_3 93853500 ## addi a1, a1, 3 -DEFINE P1_ANDI_R1_R1_7 93F57500 ## andi a1, a1, 7 (funct3=7) -DEFINE P1_ORI_R1_R1_1 93E51500 ## ori a1, a1, 1 (funct3=6) -DEFINE P1_SHLI_R1_R1_0 93950500 ## slli a1, a1, 0 (funct3=1) +DEFINE P1_ADDI_R1_R1_3 93853500 ## addi a1, a1, 3 +DEFINE P1_ADDI_R1_R1_1 93851500 ## addi a1, a1, 1 +DEFINE P1_ADDI_R1_R1_NEG3 9385D5FF ## addi a1, a1, -3 +DEFINE P1_ADDI_R4_R4_NEG1 1307F7FF ## addi a4, a4, -1 + +DEFINE P1_SHLI_R1_R1_1 93951500 ## slli a1, a1, 1 (funct3=1) DEFINE P1_SHRI_R1_R1_1 93D51500 ## srli a1, a1, 1 (funct3=5) -DEFINE P1_SARI_R1_R1_0 93D50540 ## srai a1, a1, 0 (funct7=0x20) + +DEFINE P1_ANDI_R1_R1_6 93F56500 ## andi a1, a1, 6 (funct3=7) +DEFINE P1_ORI_R1_R1_1 93E51500 ## ori a1, a1, 1 (funct3=6) + +DEFINE P1_SARI_R4_R4_1 13571740 ## srai a4, a4, 1 (imm[11:6]=0x10) ## ---- Tranche 3: LA + memory ops --------------------------------------- @@ -120,12 +115,12 @@ DEFINE P1_SARI_R1_R1_0 93D50540 ## srai a1, a1, 0 (funct7=0x20) DEFINE P1_LA_R4 170700000367C7006F008000 ## auipc a4,0; lwu a4,12(a4); jal x0,+8 ## LOAD (opcode 0x03) / STORE (opcode 0x23) with signed 12-bit offset. -## For P1: LD=64b, LW=32b zero-ext (= LWU), LB=8b zero-ext (= LBU). -## ST=SD (64b), SW=32b, SB=8b. +## For P1: LD=64b, LB=8b zero-ext (= LBU). ST=SD (64b), SB=8b. +## 32-bit LW/SW dropped from the ISA. DEFINE P1_ST_R1_R4_0 2330B700 ## sd a1, 0(a4) DEFINE P1_LD_R1_R4_0 83350700 ## ld a1, 0(a4) -DEFINE P1_SW_R1_R4_8 2324B700 ## sw a1, 8(a4) -DEFINE P1_LW_R1_R4_8 83658700 ## lwu a1, 8(a4) +DEFINE P1_ST_R1_R4_8 2334B700 ## sd a1, 8(a4) +DEFINE P1_LD_R1_R4_8 83358700 ## ld a1, 8(a4) DEFINE P1_SB_R1_R4_16 2308B700 ## sb a1, 16(a4) DEFINE P1_LB_R1_R4_16 83450701 ## lbu a1, 16(a4) @@ -137,20 +132,17 @@ DEFINE P1_LB_R1_R4_16 83450701 ## lbu a1, 16(a4) ## the unconditional `jalr x0, 0(s2)` when the P1 condition is NOT met. ## ## Fixed offset: the conditional branches below all use imm=8 (skip past the -## 4-byte JALR on the false path). B-type imm=8 encodes to (imm[12|10:5]=0, -## imm[4:1|11]=4, so imms spread as funct7 bits 0x00 and rd bits 0x4). -## Branch instruction byte strings below were assembled and verified by hand. +## 4-byte JALR on the false path). B-type imm=8 encodes to [11:7]=0x08, +## [31:25]=0x00. ## ## P1_B is just `jalr x0, 0(s2)` — unconditional jump to address in r7. -## All comparisons use a2 (P1 r2) vs a3 (P1 r3). +## Unsigned branches (BLTU/BGEU/BGE) dropped from the ISA — see P1.md. DEFINE P1_B 67000900 ## jalr x0, 0(s2) DEFINE P1_BEQ_R2_R3_R7 6314D60067000900 ## bne a2,a3,+8 ; jalr x0,0(s2) -DEFINE P1_BNE_R2_R3_R7 6304D60067000900 ## beq ; jalr -DEFINE P1_BLT_R2_R3_R7 6354D60067000900 ## bge ; jalr -DEFINE P1_BGE_R2_R3_R7 6344D60067000900 ## blt ; jalr -DEFINE P1_BLTU_R2_R3_R7 6374D60067000900 ## bgeu; jalr -DEFINE P1_BGEU_R2_R3_R7 6364D60067000900 ## bltu; jalr +DEFINE P1_BNE_R2_R3_R7 6304D60067000900 ## beq a2,a3,+8 ; jalr +DEFINE P1_BLT_R2_R3_R7 6354D60067000900 ## bge a2,a3,+8 ; jalr +DEFINE P1_BLT_R4_R2_R7 6354C70067000900 ## bge a4,a2,+8 ; jalr ## ---- Tranche 5: CALL / RET / PROLOGUE / EPILOGUE / TAIL -----------------