Hide P1 frame header and merge LEAVE+RET into ERET - boot2

commit 30273099f630eb2e8f787c413fc863ad2e44edfa
parent d19a402ecc3df0f912aecf4ec2a8d400f26c5f05
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 24 Apr 2026 09:03:03 -0700

Hide P1 frame header and merge LEAVE+RET into ERET

Portable sp after ENTER points to the frame-local base; the saved
retaddr and saved caller sp become backend-private. LEAVE is dropped
as a standalone op — ERET atomically tears down the frame and returns,
mirroring TAIL/TAILR which already bundle the epilogue. Leaves still
return with bare RET.

Diffstat:
M docs/P1.md  | 97 +++++++++++++++++++++++++++++++++++++++----------------------------------------
M m1pp/m1pp.M1  | 359 +++++++++++++++++++++++++++++++++++--------------------------------------------
M p1/P1-aarch64.M1pp  | 17 ++++++++++++-----
M p1/P1-amd64.M1pp  | 240 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
M p1/P1-riscv64.M1pp  | 21 +++++++++++++++------
M p1/P1.M1pp  | 6 +++---
M p1/aarch64.py  | 38 +++++++++++++++++++++++++++-----------
M p1/p1_gen.py  | 2 +-
M post.md  | 12 ++++++------
M tests/p1/double.P1  | 5 ++---
M tests/p1/p1-aliasing.P1  | 3 +--
M tests/p1/p1-call.P1  | 8 +++-----

12 files changed, 505 insertions(+), 303 deletions(-)
diff --git a/docs/P1.md b/docs/P1.md
@@ -47,7 +47,7 @@ So the notation in this document is descriptive rather than literal:
   opcodes
 - `BR rs`, `CALLR rs`, and `TAILR rs` mean register-specific control-flow
   opcodes
-- `LEAVE`, `CALL`, `RET`, `TAIL`, `B`, and `SYSCALL` remain operand-free
+- `ERET`, `CALL`, `RET`, `TAIL`, `B`, and `SYSCALL` remain operand-free
 
 Labels still appear in source where the toolchain supports them directly, such
 as `LA rd, %label` and `LA_BR %label`.
@@ -176,8 +176,8 @@ those words in the first `m` words of its frame-local storage immediately
 before the call:
 
 ```
-[sp + 2*WORD + 0*WORD] = outgoing arg word 0
-[sp + 2*WORD + 1*WORD] = outgoing arg word 1
+[sp + 0*WORD] = outgoing arg word 0
+[sp + 1*WORD] = outgoing arg word 1
 ...
 ```
 
@@ -191,29 +191,30 @@ addressed prefix available for outgoing argument staging across the call.
 
 ### Standard frame layout
 
-Functions that need local stack storage use a standard frame layout. After
-frame establishment:
+Functions that need local stack storage establish a standard frame with
+`ENTER size`. After frame establishment, the portable-visible frame-local
+storage occupies the first `size` bytes above `sp`:
 
 ```
-[sp + 0*WORD] = saved return address
-[sp + 1*WORD] = saved caller stack pointer
-[sp + 2*WORD ... sp + 2*WORD + local_bytes - 1] = frame-local storage
-...
+[sp + 0 ... sp + size - 1] = frame-local storage
 ```
 
 Frame-local storage is byte-addressed. Portable code may use it for ordinary
 locals, spilled callee-saved registers, and the caller-staged outgoing
 stack-argument words described above.
 
-Total frame size is:
+Each frame also carries backend-private per-frame state — typically the
+saved return continuation, saved caller `sp`, and any padding needed to
+satisfy `STACK_ALIGN`. That state is not addressable by portable source,
+and the backend chooses its layout and total allocation size.
 
-`round_up(STACK_ALIGN, 2*WORD_SIZE + local_bytes)`
+Word sizes:
 
-Where:
+- `WORD = 8` in P1v2-64
+- `WORD = 4` in P1v2-32
 
-- `WORD_SIZE = 8` in P1v2-64
-- `WORD_SIZE = 4` in P1v2-32
-- `STACK_ALIGN` is target-defined and must satisfy the native call ABI
+`STACK_ALIGN` is target-defined and must satisfy the native call ABI at
+every call boundary.
 
 Leaf functions that need no frame-local storage may omit the frame entirely.
 
@@ -235,8 +236,8 @@ Leaf functions that need no frame-local storage may omit the frame entirely.
 | Memory | `LD`, `ST`, `LB`, `SB` |
 | ABI access | `LDARG` |
 | Branching | `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ` |
-| Calls / returns | `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR` |
-| Frame management | `ENTER`, `LEAVE` |
+| Calls / returns | `CALL`, `CALLR`, `RET`, `ERET`, `TAIL`, `TAILR` |
+| Frame management | `ENTER` |
 | System | `SYSCALL` |
 
 ## Immediates
@@ -284,9 +285,14 @@ instruction after the `CALL`. `CALL` requires an active standard frame.
 the code pointer value held in `rs` and establishes the same return
 continuation semantics as `CALL`.
 
-`RET` returns through the current return continuation. `RET` is valid whether
-or not the current function has established a standard frame, provided any
-frame established by the function has already been torn down.
+`RET` returns from a leaf function through the hidden return continuation
+captured at call time. `RET` is valid only when the current function has no
+active standard frame.
+
+`ERET` returns from a function that has an active standard frame. It
+performs the standard epilogue — restoring `sp` and the hidden return
+continuation — and then returns to the caller. `ERET` is valid only when
+the current function has an active standard frame.
 
 `TAIL` is a tail call to the target most recently loaded by `LA_BR`. It is
 valid only when the current function has an active standard frame. `TAIL`
@@ -300,7 +306,7 @@ current function has an active standard frame.
 Because stack-passed outgoing argument words are staged in the caller's own
 frame-local storage, `TAIL` and `TAILR` are portable only when the tail-called
 callee requires no stack-passed argument words. Portable compilers must lower
-other tail-call cases to an ordinary `CALL` / `RET` sequence.
+other tail-call cases to an ordinary `CALL` / `ERET` sequence.
 
 Portable source must treat the return continuation as hidden machine state. It
 must not assume that the return address lives in any exposed register or stack
@@ -309,40 +315,32 @@ establishment.
 
 ### Prologue / Epilogue
 
-P1 v2 defines the following frame-establishment and frame-teardown operations:
-
-- `ENTER size`
-- `LEAVE`
+P1 v2 has a single frame-establishment op, `ENTER size`. Frame teardown is
+not a standalone op; it is embedded in `ERET`, `TAIL`, and `TAILR`.
 
-`ENTER size` establishes the standard frame layout with `size` bytes of
-frame-local storage:
+`ENTER size` establishes a standard frame with `size` bytes of frame-local
+storage. After it executes:
 
 ```
-[sp + 0*WORD] = saved return address
-[sp + 1*WORD] = saved caller stack pointer
-[sp + 2*WORD ... sp + 2*WORD + size - 1] = frame-local storage
+[sp + 0 ... sp + size - 1] = frame-local storage
 ```
 
-The total allocation size is:
-
-`round_up(STACK_ALIGN, 2*WORD_SIZE + size)`
-
-The named frame-local bytes are the usable local storage. Any additional bytes
-introduced by alignment rounding are padding, not extra local bytes.
-
-`LEAVE` tears down the current standard frame and restores the hidden return
-continuation so that a subsequent `RET` returns correctly.
+Any backend-private per-frame state (saved return continuation, saved
+caller `sp`, alignment padding) lives outside the portable-visible `size`
+bytes. Portable source may not address it.
 
-Because every standard frame stores the saved caller stack pointer at
-`[sp + 1*WORD]`, `LEAVE` does not need to know the frame-local byte count used
-by the corresponding `ENTER`.
+`ERET`, `TAIL`, and `TAILR` each perform the standard epilogue — restoring
+`sp` and the hidden return continuation — and then transfer control: `ERET`
+to the caller, `TAIL` to the target in `br`, and `TAILR` to the target in
+`rs`. Portable source must use one of these ops (not `RET`) to exit a
+function that has established a frame.
 
-A function may omit `ENTER` / `LEAVE` entirely if it is a leaf and needs no
-standard frame.
+A function may omit `ENTER` entirely if it is a leaf and needs no frame.
+Such a function exits with `RET`.
 
-`ENTER` and `LEAVE` do not implicitly save or restore `s0` or `s1`. A
-function that modifies `s0` or `s1` must preserve them explicitly, typically by
-storing them in frame-local storage within its standard frame.
+`ENTER` does not implicitly save or restore `s0`-`s3`. A function that
+modifies any callee-saved register must preserve it explicitly, typically
+by storing it in frame-local storage within its standard frame.
 
 ### Branching
 
@@ -419,7 +417,8 @@ register.
 Portable source may also read the current stack pointer through `MOV rd, sp`.
 
 Portable source may not write `sp` through `MOV`. Stack-pointer updates are only
-performed by `ENTER`, `LEAVE`, and backend-private call/return machinery.
+performed by `ENTER`, `ERET`, `TAIL`, `TAILR`, and backend-private call/return
+machinery.
 
 `LI` materializes an integer bit-pattern. `LA` materializes the address of a
 label. `LA_BR` is a separate control-flow-target materialization form and is not
@@ -483,7 +482,7 @@ At entry to `p1_main`, the native entry-stack layout has already been
 consumed by the backend stub. Portable source may not assume anything
 about the `sp` value inherited from `_start` except that it satisfies
 the call-boundary alignment rule and that the standard frame protocol
-(`ENTER` / `LEAVE`) works correctly from it.
+(`ENTER` / `ERET`) works correctly from it.
 
 `p1_main` may return normally, or it may call `sys_exit` directly at
 any point.
diff --git a/m1pp/m1pp.M1 b/m1pp/m1pp.M1
@@ -20,7 +20,7 @@
 ##                   without emitting output.
 ##
 ## P1v2 ABI: a0..a3 arg/return, t0..t2 caller-saved temps, s0..s3 callee-saved
-## (unused here). Non-leaf functions use enter_0 / leave. _start has no frame;
+## (unused here). Non-leaf functions use enter_0 / eret. _start has no frame;
 ## the kernel-supplied SP carries argv/argc directly.
 
 ## --- Constants & sizing ------------------------------------------------------
@@ -114,13 +114,13 @@ DEFINE EXPR_INVALID 1200000000000000
 
 :_start
     # if (argc < 3) usage
-    ld_a0,sp,0
+    ld_a0,sp,neg16
     li_a1 %3 %0
     la_br &err_usage
     blt_a0,a1
 
     # output_path = argv[2]
-    ld_t0,sp,24
+    ld_t0,sp,8
     la_a0 &output_path
     st_t0,a0,0
 
@@ -140,7 +140,7 @@ DEFINE EXPR_INVALID 1200000000000000
     # input_fd = openat(AT_FDCWD, argv[1], O_RDONLY, 0)
     li_a0 sys_openat
     li_a1 AT_FDCWD
-    ld_a2,sp,16
+    ld_a2,sp,0
     li_a3 %0 %0
     li_t0 %0 %0
     syscall
@@ -719,8 +719,7 @@ DEFINE EXPR_INVALID 1200000000000000
     b
 
 :lex_done
-    leave
-    ret
+    eret
 
 ## --- Output: normalized token stream to output_buf ---------------------------
 ## emit_newline writes '\n' and clears output_need_space.
@@ -855,7 +854,7 @@ DEFINE EXPR_INVALID 1200000000000000
     call
     la_br &proc_done
     beqz_a0
-    st_a0,sp,16
+    st_a0,sp,0
 
     # if (s->pos == s->end) pop and continue
     ld_t0,a0,16
@@ -864,7 +863,7 @@ DEFINE EXPR_INVALID 1200000000000000
     beq_t0,t1
 
     # tok = s->pos
-    st_t0,sp,24
+    st_t0,sp,8
 
     # ---- line_start && tok->kind == TOK_WORD && tok eq "%macro" ----
     ld_a1,a0,24
@@ -888,7 +887,7 @@ DEFINE EXPR_INVALID 1200000000000000
     # holds in practice (line_start in expansion streams is cleared
     # before any %macro could matter). After it returns we copy
     # proc_pos back into s->pos and set s->line_start = 1.
-    ld_t0,sp,24
+    ld_t0,sp,8
     la_a0 &proc_pos
     st_t0,a0,0
     la_a0 &proc_line_start
@@ -896,7 +895,7 @@ DEFINE EXPR_INVALID 1200000000000000
     st_a1,a0,0
     la_br &define_macro
     call
-    ld_a0,sp,16
+    ld_a0,sp,0
     la_a1 &proc_pos
     ld_t0,a1,0
     st_t0,a0,16
@@ -909,7 +908,7 @@ DEFINE EXPR_INVALID 1200000000000000
 ## The %macro guard above already proved line_start && kind == TOK_WORD; if
 ## we reach here via a %macro non-match, those gates still hold.
 :proc_check_struct
-    ld_t0,sp,24
+    ld_t0,sp,8
     mov_a0,t0
     la_a1 &const_struct
     li_a2 %7 %0
@@ -919,7 +918,7 @@ DEFINE EXPR_INVALID 1200000000000000
     beqz_a0
 
     # %struct matched: shim into define_fielded(stride=8, total="SIZE", len=4)
-    ld_t0,sp,24
+    ld_t0,sp,8
     la_a0 &proc_pos
     st_t0,a0,0
     la_a0 &proc_line_start
@@ -930,7 +929,7 @@ DEFINE EXPR_INVALID 1200000000000000
     li_a2 %4 %0
     la_br &define_fielded
     call
-    ld_a0,sp,16
+    ld_a0,sp,0
     la_a1 &proc_pos
     ld_t0,a1,0
     st_t0,a0,16
@@ -941,7 +940,7 @@ DEFINE EXPR_INVALID 1200000000000000
 
 ## ---- line_start && tok eq "%enum" ----
 :proc_check_enum
-    ld_t0,sp,24
+    ld_t0,sp,8
     mov_a0,t0
     la_a1 &const_enum
     li_a2 %5 %0
@@ -951,7 +950,7 @@ DEFINE EXPR_INVALID 1200000000000000
     beqz_a0
 
     # %enum matched: shim into define_fielded(stride=1, total="COUNT", len=5)
-    ld_t0,sp,24
+    ld_t0,sp,8
     la_a0 &proc_pos
     st_t0,a0,0
     la_a0 &proc_line_start
@@ -962,7 +961,7 @@ DEFINE EXPR_INVALID 1200000000000000
     li_a2 %5 %0
     la_br &define_fielded
     call
-    ld_a0,sp,16
+    ld_a0,sp,0
     la_a1 &proc_pos
     ld_t0,a1,0
     st_t0,a0,16
@@ -973,8 +972,8 @@ DEFINE EXPR_INVALID 1200000000000000
 
 :proc_check_newline
     # reload s, tok
-    ld_a0,sp,16
-    ld_t0,sp,24
+    ld_a0,sp,0
+    ld_t0,sp,8
     ld_a1,t0,0
     li_a2 TOK_NEWLINE
     la_br &proc_check_builtin
@@ -992,8 +991,8 @@ DEFINE EXPR_INVALID 1200000000000000
 
 :proc_check_builtin
     # tok->kind == TOK_WORD && tok+1 < s->end && (tok+1)->kind == TOK_LPAREN ?
-    ld_a0,sp,16
-    ld_t0,sp,24
+    ld_a0,sp,0
+    ld_t0,sp,8
     ld_a1,t0,0
     li_a2 TOK_WORD
     la_br &proc_check_macro
@@ -1018,35 +1017,35 @@ DEFINE EXPR_INVALID 1200000000000000
     call
     la_br &proc_do_builtin
     bnez_a0
-    ld_a0,sp,24
+    ld_a0,sp,8
     la_a1 &const_at
     li_a2 %1 %0
     la_br &tok_eq_const
     call
     la_br &proc_do_builtin
     bnez_a0
-    ld_a0,sp,24
+    ld_a0,sp,8
     la_a1 &const_pct
     li_a2 %1 %0
     la_br &tok_eq_const
     call
     la_br &proc_do_builtin
     bnez_a0
-    ld_a0,sp,24
+    ld_a0,sp,8
     la_a1 &const_dlr
     li_a2 %1 %0
     la_br &tok_eq_const
     call
     la_br &proc_do_builtin
     bnez_a0
-    ld_a0,sp,24
+    ld_a0,sp,8
     la_a1 &const_select
     li_a2 %7 %0
     la_br &tok_eq_const
     call
     la_br &proc_do_builtin
     bnez_a0
-    ld_a0,sp,24
+    ld_a0,sp,8
     la_a1 &const_str
     li_a2 %4 %0
     la_br &tok_eq_const
@@ -1058,8 +1057,8 @@ DEFINE EXPR_INVALID 1200000000000000
 
 :proc_do_builtin
     # expand_builtin_call(s, tok)
-    ld_a0,sp,16
-    ld_a1,sp,24
+    ld_a0,sp,0
+    ld_a1,sp,8
     la_br &expand_builtin_call
     call
     la_br &proc_loop
@@ -1069,14 +1068,14 @@ DEFINE EXPR_INVALID 1200000000000000
     # macro = find_macro(tok); if non-zero AND
     #   ((tok+1 < s->end AND (tok+1)->kind == TOK_LPAREN) OR macro->param_count == 0)
     # then expand_call. (§4 paren-less 0-arg calls.)
-    ld_a0,sp,24
+    ld_a0,sp,8
     la_br &find_macro
     call
     la_br &proc_emit
     beqz_a0
     mov_t2,a0
-    ld_a0,sp,16
-    ld_t0,sp,24
+    ld_a0,sp,0
+    ld_t0,sp,8
     addi_t1,t0,24
     ld_a1,a0,8
     la_br &proc_macro_has_next
@@ -1088,7 +1087,7 @@ DEFINE EXPR_INVALID 1200000000000000
     li_a2 TOK_LPAREN
     la_br &proc_macro_zero_arg
     bne_a1,a2
-    ld_a0,sp,16
+    ld_a0,sp,0
     mov_a1,t2
     la_br &expand_call
     call
@@ -1099,7 +1098,7 @@ DEFINE EXPR_INVALID 1200000000000000
     ld_t0,t2,16
     la_br &proc_emit
     bnez_t0
-    ld_a0,sp,16
+    ld_a0,sp,0
     mov_a1,t2
     la_br &expand_call
     call
@@ -1108,10 +1107,10 @@ DEFINE EXPR_INVALID 1200000000000000
 
 :proc_emit
     # emit_token(tok); s->pos += 24; s->line_start = 0
-    ld_a0,sp,24
+    ld_a0,sp,8
     la_br &emit_token
     call
-    ld_a0,sp,16
+    ld_a0,sp,0
     ld_t0,a0,16
     addi_t0,t0,24
     st_t0,a0,16
@@ -1127,13 +1126,12 @@ DEFINE EXPR_INVALID 1200000000000000
     b
 
 :proc_done
-    leave
-    ret
+    eret
 
 ## --- %macro storage: parse header + body into macros[] / macro_body_tokens --
 ## Called at proc_pos == line-start `%macro`. Leaves proc_pos past the %endm
 ## line with proc_line_start = 1. Uses BSS scratch (def_m_ptr, def_param_ptr,
-## def_body_line_start) since P1v2 enter/leave does not save s* registers.
+## def_body_line_start) since P1v2 enter/eret does not save s* registers.
 ##
 ## Macro record layout (296 bytes, see M1PP_MACRO_RECORD_SIZE):
 ##   +0   name.ptr        (8)
@@ -1447,8 +1445,7 @@ DEFINE EXPR_INVALID 1200000000000000
     la_a0 &proc_line_start
     li_a1 %1 %0
     st_a1,a0,0
-    leave
-    ret
+    eret
 
 ## --- %struct / %enum directive ----------------------------------------------
 ## define_fielded(a0=stride, a1=total_name_ptr, a2=total_name_len).
@@ -1653,8 +1650,7 @@ DEFINE EXPR_INVALID 1200000000000000
     la_a0 &proc_line_start
     li_a1 %1 %0
     st_a1,a0,0
-    leave
-    ret
+    eret
 
 ## df_emit_field(): read df_base_*, df_suffix_*, df_value from BSS; synthesize
 ## one macro record + one body token. Builds the "NAME.field" identifier in
@@ -1798,8 +1794,7 @@ DEFINE EXPR_INVALID 1200000000000000
     la_a1 &macros_end
     st_t2,a1,0
 
-    leave
-    ret
+    eret
 
 ## df_render_decimal(): reads df_value; writes a reverse-filled decimal
 ## rendering into df_digit_scratch[cursor..end) and stores df_digit_count +
@@ -1992,8 +1987,7 @@ DEFINE EXPR_INVALID 1200000000000000
     la_br &push_stream_span
     call
 :ppsfm_done
-    leave
-    ret
+    eret
 
 ## ============================================================================
 ## --- Argument parsing -------------------------------------------------------
@@ -2480,15 +2474,15 @@ DEFINE EXPR_INVALID 1200000000000000
     la_br &err_bad_macro_header
     beq_a0,a1
     # spill a0/a1 so arg_is_braced can clobber regs
-    st_a0,sp,16
-    st_a1,sp,24
+    st_a0,sp,0
+    st_a1,sp,8
     la_br &arg_is_braced
     call
     la_br &catp_plain
     beqz_a0
     # braced: strip outer braces (start+24, end-24)
-    ld_a0,sp,16
-    ld_a1,sp,24
+    ld_a0,sp,0
+    ld_a1,sp,8
     addi_a0,a0,24
     addi_a1,a1,neg24
     la_br &catp_done
@@ -2498,13 +2492,12 @@ DEFINE EXPR_INVALID 1200000000000000
     la_br &catp_done
     b
 :catp_plain
-    ld_a0,sp,16
-    ld_a1,sp,24
+    ld_a0,sp,0
+    ld_a1,sp,8
     la_br &copy_span_to_pool
     call
 :catp_done
-    leave
-    ret
+    eret
 
 ## copy_paste_arg_to_pool(a0=arg_start, a1=arg_end) -> void (fatal unless len 1)
 ## Enforces the single-token-argument rule for params adjacent to ##.
@@ -2512,14 +2505,14 @@ DEFINE EXPR_INVALID 1200000000000000
 :copy_paste_arg_to_pool
     enter_16
     # spill a0/a1 for the arg_is_braced call
-    st_a0,sp,16
-    st_a1,sp,24
+    st_a0,sp,0
+    st_a1,sp,8
     la_br &arg_is_braced
     call
     la_br &err_bad_macro_header
     bnez_a0
-    ld_a0,sp,16
-    ld_a1,sp,24
+    ld_a0,sp,0
+    ld_a1,sp,8
     # if ((arg_end - arg_start) != 24) fatal
     sub_a2,a1,a0
     li_a3 M1PP_TOK_SIZE
@@ -2527,8 +2520,7 @@ DEFINE EXPR_INVALID 1200000000000000
     bne_a2,a3
     la_br &copy_span_to_pool
     call
-    leave
-    ret
+    eret
 
 ## expand_macro_tokens(a0=call_tok, a1=limit, a2=macro_ptr) -> void (fatal on bad)
 ## Requires call_tok+1 is TOK_LPAREN. Runs parse_args(call_tok+1, limit),
@@ -3027,8 +3019,7 @@ DEFINE EXPR_INVALID 1200000000000000
     la_br &paste_pool_range
     call
 
-    leave
-    ret
+    eret
 
 ## expand_call(a0=stream_ptr, a1=macro_ptr) -> void (fatal on bad call)
 ## Calls expand_macro_tokens for the call at stream->pos, sets
@@ -3039,7 +3030,7 @@ DEFINE EXPR_INVALID 1200000000000000
 
     # spill stream_ptr to local frame slot (sp+16 is the first local; sp+0/+8
     # hold the saved return address and saved caller sp).
-    st_a0,sp,16
+    st_a0,sp,0
 
     # expand_macro_tokens(stream->pos, stream->end, macro)
     # stream->pos at +16, stream->end at +8
@@ -3052,7 +3043,7 @@ DEFINE EXPR_INVALID 1200000000000000
     call
 
     # stream->pos = emt_after_pos
-    ld_a0,sp,16
+    ld_a0,sp,0
     la_a1 &emt_after_pos
     ld_t0,a1,0
     st_t0,a0,16
@@ -3067,8 +3058,7 @@ DEFINE EXPR_INVALID 1200000000000000
     la_br &push_pool_stream_from_mark
     call
 
-    leave
-    ret
+    eret
 
 ## ============================================================================
 ## --- ## token paste compaction ----------------------------------------------
@@ -3171,8 +3161,7 @@ DEFINE EXPR_INVALID 1200000000000000
     ld_a1,a1,0
     st_a1,t0,16
 
-    leave
-    ret
+    eret
 
 ## paste_pool_range(a0=mark) -> void (fatal on bad paste)
 ## In-place compactor over expand_pool[mark..pool_used). For each TOK_PASTE,
@@ -3311,8 +3300,7 @@ DEFINE EXPR_INVALID 1200000000000000
     sub_t0,t0,a1
     la_a1 &pool_used
     st_t0,a1,0
-    leave
-    ret
+    eret
 
 ## ============================================================================
 ## --- Integer atoms + S-expression evaluator ---------------------------------
@@ -3678,80 +3666,61 @@ DEFINE EXPR_INVALID 1200000000000000
 
 :eoc_invalid
     li_a0 EXPR_INVALID
-    leave
-    ret
+    eret
 :eoc_add
     li_a0 EXPR_ADD
-    leave
-    ret
+    eret
 :eoc_sub
     li_a0 EXPR_SUB
-    leave
-    ret
+    eret
 :eoc_mul
     li_a0 EXPR_MUL
-    leave
-    ret
+    eret
 :eoc_div
     li_a0 EXPR_DIV
-    leave
-    ret
+    eret
 :eoc_mod
     li_a0 EXPR_MOD
-    leave
-    ret
+    eret
 :eoc_shl
     li_a0 EXPR_SHL
-    leave
-    ret
+    eret
 :eoc_shr
     li_a0 EXPR_SHR
-    leave
-    ret
+    eret
 :eoc_and
     li_a0 EXPR_AND
-    leave
-    ret
+    eret
 :eoc_or
     li_a0 EXPR_OR
-    leave
-    ret
+    eret
 :eoc_xor
     li_a0 EXPR_XOR
-    leave
-    ret
+    eret
 :eoc_not
     li_a0 EXPR_NOT
-    leave
-    ret
+    eret
 :eoc_eq
     li_a0 EXPR_EQ
-    leave
-    ret
+    eret
 :eoc_ne
     li_a0 EXPR_NE
-    leave
-    ret
+    eret
 :eoc_lt
     li_a0 EXPR_LT
-    leave
-    ret
+    eret
 :eoc_le
     li_a0 EXPR_LE
-    leave
-    ret
+    eret
 :eoc_gt
     li_a0 EXPR_GT
-    leave
-    ret
+    eret
 :eoc_ge
     li_a0 EXPR_GE
-    leave
-    ret
+    eret
 :eoc_strlen
     li_a0 EXPR_STRLEN
-    leave
-    ret
+    eret
 
 ## apply_expr_op(a0=op_code, a1=args_ptr, a2=argc) -> a0 = i64 result
 ## Reduce args[0..argc) per op:
@@ -4209,8 +4178,7 @@ DEFINE EXPR_INVALID 1200000000000000
 :aeo_finish
     la_a0 &aeo_acc
     ld_a0,a0,0
-    leave
-    ret
+    eret
 
 ## helper: validate argc >= 1; fatal otherwise. (Returns to caller.)
 :aeo_require_argc_ge1
@@ -4286,13 +4254,13 @@ DEFINE EXPR_INVALID 1200000000000000
 ##   sp+48  saved emt_mark
 :eval_expr_atom
     enter_40
-    st_a0,sp,16
-    st_a1,sp,24
+    st_a0,sp,0
+    st_a1,sp,8
 
     # macro_ptr = find_macro(tok)
     la_br &find_macro
     call
-    st_a0,sp,32
+    st_a0,sp,16
 
     # if (macro_ptr == 0) -> integer atom branch
     la_br &eea_int_atom
@@ -4301,9 +4269,9 @@ DEFINE EXPR_INVALID 1200000000000000
     # §4 paren-less 0-arg atom:
     #   Take the macro-call branch if (tok+1 < limit AND (tok+1)->kind == TOK_LPAREN)
     #   OR macro->param_count == 0. Otherwise fall through to int atom (unchanged).
-    ld_t0,sp,16
+    ld_t0,sp,0
     addi_t0,t0,24
-    ld_t1,sp,24
+    ld_t1,sp,8
     la_br &eea_check_zero_arg
     blt_t1,t0
     la_br &eea_check_zero_arg
@@ -4317,7 +4285,7 @@ DEFINE EXPR_INVALID 1200000000000000
 
 :eea_check_zero_arg
     # No trailing LPAREN. Take the macro branch only if param_count == 0.
-    ld_t0,sp,32
+    ld_t0,sp,16
     ld_t1,t0,16
     la_br &eea_int_atom
     bnez_t1
@@ -4325,30 +4293,30 @@ DEFINE EXPR_INVALID 1200000000000000
 :eea_do_macro
     # Macro call branch:
     #   expand_macro_tokens(tok, limit, macro_ptr)
-    ld_a0,sp,16
-    ld_a1,sp,24
-    ld_a2,sp,32
+    ld_a0,sp,0
+    ld_a1,sp,8
+    ld_a2,sp,16
     la_br &expand_macro_tokens
     call
 
     # Snapshot emt outputs immediately.
     la_a0 &emt_after_pos
     ld_t0,a0,0
-    st_t0,sp,40
+    st_t0,sp,24
     la_a0 &emt_mark
     ld_t0,a0,0
-    st_t0,sp,48
+    st_t0,sp,32
 
     # If pool was not extended (pool_used == mark) -> bad expression.
     la_a0 &pool_used
     ld_t0,a0,0
-    ld_t1,sp,48
+    ld_t1,sp,32
     la_br &err_bad_macro_header
     beq_t0,t1
 
     # eval_expr_range(expand_pool + mark, expand_pool + pool_used)
     la_a0 &expand_pool
-    ld_t1,sp,48
+    ld_t1,sp,32
     add_a0,a0,t1
     la_a1 &expand_pool
     la_a2 &pool_used
@@ -4363,33 +4331,31 @@ DEFINE EXPR_INVALID 1200000000000000
 
     # restore pool_used = mark
     la_a0 &pool_used
-    ld_t0,sp,48
+    ld_t0,sp,32
     st_t0,a0,0
 
     # eval_after_pos = saved emt_after_pos
     la_a0 &eval_after_pos
-    ld_t0,sp,40
+    ld_t0,sp,24
     st_t0,a0,0
 
-    leave
-    ret
+    eret
 
 :eea_int_atom
     # parse_int_token(tok) -> i64
-    ld_a0,sp,16
+    ld_a0,sp,0
     la_br &parse_int_token
     call
     la_a1 &eval_value
     st_a0,a1,0
 
     # eval_after_pos = tok + 24
-    ld_t0,sp,16
+    ld_t0,sp,0
     addi_t0,t0,24
     la_a0 &eval_after_pos
     st_t0,a0,0
 
-    leave
-    ret
+    eret
 
 ## eval_expr_range(a0=start_tok, a1=end_tok) -> a0 = i64 result (fatal on bad)
 ## Main S-expression evaluator loop, driven by the explicit ExprFrame stack
@@ -4412,28 +4378,28 @@ DEFINE EXPR_INVALID 1200000000000000
 ##                                used as the local base for stack checks)
 :eval_expr_range
     enter_56
-    st_a0,sp,16
-    st_a1,sp,24
+    st_a0,sp,0
+    st_a1,sp,8
     li_t0 %0 %0
+    st_t0,sp,16
+    st_t0,sp,24
     st_t0,sp,32
     st_t0,sp,40
-    st_t0,sp,48
-    st_t0,sp,56
     # entry_frame_top = expr_frame_top
     la_a0 &expr_frame_top
     ld_t0,a0,0
-    st_t0,sp,64
+    st_t0,sp,48
 
 :eer_loop
     # If have_value, deliver it.
-    ld_t0,sp,48
+    ld_t0,sp,32
     la_br &eer_no_have_value
     beqz_t0
 
     # have_value: feed into top frame, or set result.
     la_a0 &expr_frame_top
     ld_t0,a0,0
-    ld_t1,sp,64
+    ld_t1,sp,48
     la_br &eer_set_result
     beq_t0,t1
     # frame = &expr_frames[frame_top - 1]
@@ -4456,42 +4422,42 @@ DEFINE EXPR_INVALID 1200000000000000
     add_a3,a0,a2
     shli_a2,t1,3
     add_a3,a3,a2
-    ld_t2,sp,32
+    ld_t2,sp,16
     st_t2,a3,0
     # frame->argc++
     addi_t1,t1,1
     st_t1,a1,0
     # have_value = 0
     li_t0 %0 %0
-    st_t0,sp,48
+    st_t0,sp,32
     la_br &eer_loop
     b
 
 :eer_set_result
     # No frame open; this value is the top-level result.
-    ld_t0,sp,56
+    ld_t0,sp,40
     la_br &err_bad_macro_header
     bnez_t0
-    ld_t0,sp,32
-    st_t0,sp,40
+    ld_t0,sp,16
+    st_t0,sp,24
     li_t0 %1 %0
-    st_t0,sp,56
+    st_t0,sp,40
     li_t0 %0 %0
-    st_t0,sp,48
+    st_t0,sp,32
     la_br &eer_loop
     b
 
 :eer_no_have_value
     # skip_expr_newlines(pos, end)
-    ld_a0,sp,16
-    ld_a1,sp,24
+    ld_a0,sp,0
+    ld_a1,sp,8
     la_br &skip_expr_newlines
     call
-    st_a0,sp,16
+    st_a0,sp,0
 
     # if (pos >= end) break
-    ld_t0,sp,16
-    ld_t1,sp,24
+    ld_t0,sp,0
+    ld_t1,sp,8
     la_br &eer_loop_done
     beq_t0,t1
 
@@ -4505,38 +4471,38 @@ DEFINE EXPR_INVALID 1200000000000000
     beq_t2,a3
 
     # atom: eval_expr_atom(pos, end); value = eval_value; pos = eval_after_pos
-    ld_a0,sp,16
-    ld_a1,sp,24
+    ld_a0,sp,0
+    ld_a1,sp,8
     la_br &eval_expr_atom
     call
     la_a0 &eval_value
     ld_t0,a0,0
-    st_t0,sp,32
+    st_t0,sp,16
     la_a0 &eval_after_pos
     ld_t0,a0,0
-    st_t0,sp,16
+    st_t0,sp,0
     li_t0 %1 %0
-    st_t0,sp,48
+    st_t0,sp,32
     la_br &eer_loop
     b
 
 :eer_lparen
     # pos++
     addi_t0,t0,24
-    st_t0,sp,16
+    st_t0,sp,0
     # skip_expr_newlines
-    ld_a0,sp,16
-    ld_a1,sp,24
+    ld_a0,sp,0
+    ld_a1,sp,8
     la_br &skip_expr_newlines
     call
-    st_a0,sp,16
+    st_a0,sp,0
     # if (pos >= end) fatal
-    ld_t0,sp,16
-    ld_t1,sp,24
+    ld_t0,sp,0
+    ld_t1,sp,8
     la_br &err_bad_macro_header
     beq_t0,t1
     # op = expr_op_code(pos)
-    ld_a0,sp,16
+    ld_a0,sp,0
     la_br &expr_op_code
     call
     # if (op == EXPR_INVALID) fatal
@@ -4572,9 +4538,9 @@ DEFINE EXPR_INVALID 1200000000000000
     addi_t0,t0,1
     st_t0,a1,0
     # pos++ (skip operator token)
-    ld_t0,sp,16
+    ld_t0,sp,0
     addi_t0,t0,24
-    st_t0,sp,16
+    st_t0,sp,0
     la_br &eer_loop
     b
 
@@ -4582,7 +4548,7 @@ DEFINE EXPR_INVALID 1200000000000000
     # if (frame_top <= entry_frame_top) fatal
     la_a0 &expr_frame_top
     ld_t0,a0,0
-    ld_t1,sp,64
+    ld_t1,sp,48
     la_br &err_bad_macro_header
     beq_t0,t1
     la_br &err_bad_macro_header
@@ -4603,16 +4569,16 @@ DEFINE EXPR_INVALID 1200000000000000
     la_br &apply_expr_op
     call
     # value = result; frame_top--; pos++; have_value = 1
-    st_a0,sp,32
+    st_a0,sp,16
     la_a1 &expr_frame_top
     ld_t0,a1,0
     addi_t0,t0,neg1
     st_t0,a1,0
-    ld_t0,sp,16
+    ld_t0,sp,0
     addi_t0,t0,24
-    st_t0,sp,16
+    st_t0,sp,0
     li_t0 %1 %0
-    st_t0,sp,48
+    st_t0,sp,32
     la_br &eer_loop
     b
 
@@ -4620,18 +4586,18 @@ DEFINE EXPR_INVALID 1200000000000000
     # (strlen "literal") — degenerate unary op whose argument is a
     # TOK_STRING atom, not a recursive expression.
     # pos++ past the "strlen" operator word.
-    ld_t0,sp,16
+    ld_t0,sp,0
     addi_t0,t0,24
-    st_t0,sp,16
+    st_t0,sp,0
     # skip_expr_newlines(pos, end)
-    ld_a0,sp,16
-    ld_a1,sp,24
+    ld_a0,sp,0
+    ld_a1,sp,8
     la_br &skip_expr_newlines
     call
-    st_a0,sp,16
+    st_a0,sp,0
     # if (pos >= end) fatal
-    ld_t0,sp,16
-    ld_t1,sp,24
+    ld_t0,sp,0
+    ld_t1,sp,8
     la_br &err_bad_macro_header
     beq_t0,t1
     # if (pos->kind != TOK_STRING) fatal
@@ -4652,19 +4618,19 @@ DEFINE EXPR_INVALID 1200000000000000
     bne_a3,a0
     # value = pos->text.len - 2
     addi_a1,a1,neg2
-    st_a1,sp,32
+    st_a1,sp,16
     # pos++
     addi_t0,t0,24
-    st_t0,sp,16
+    st_t0,sp,0
     # skip_expr_newlines(pos, end)
-    ld_a0,sp,16
-    ld_a1,sp,24
+    ld_a0,sp,0
+    ld_a1,sp,8
     la_br &skip_expr_newlines
     call
-    st_a0,sp,16
+    st_a0,sp,0
     # if (pos >= end) fatal
-    ld_t0,sp,16
-    ld_t1,sp,24
+    ld_t0,sp,0
+    ld_t1,sp,8
     la_br &err_bad_macro_header
     beq_t0,t1
     # if (pos->kind != TOK_RPAREN) fatal
@@ -4674,10 +4640,10 @@ DEFINE EXPR_INVALID 1200000000000000
     bne_t2,a3
     # pos++
     addi_t0,t0,24
-    st_t0,sp,16
+    st_t0,sp,0
     # have_value = 1
     li_t0 %1 %0
-    st_t0,sp,48
+    st_t0,sp,32
     la_br &eer_loop
     b
 
@@ -4685,22 +4651,21 @@ DEFINE EXPR_INVALID 1200000000000000
     # frame_top must equal entry_frame_top
     la_a0 &expr_frame_top
     ld_t0,a0,0
-    ld_t1,sp,64
+    ld_t1,sp,48
     la_br &err_bad_macro_header
     bne_t0,t1
     # have_result must be 1
-    ld_t0,sp,56
+    ld_t0,sp,40
     la_br &err_bad_macro_header
     beqz_t0
     # pos must equal end
-    ld_t0,sp,16
-    ld_t1,sp,24
+    ld_t0,sp,0
+    ld_t1,sp,8
     la_br &err_bad_macro_header
     bne_t0,t1
     # return result
-    ld_a0,sp,40
-    leave
-    ret
+    ld_a0,sp,24
+    eret
 
 ## ============================================================================
 ## --- Hex emit for !@%$ ------------------------------------------------------
@@ -4820,8 +4785,7 @@ DEFINE EXPR_INVALID 1200000000000000
     la_br &emit_token
     call
 
-    leave
-    ret
+    eret
 
 ## ============================================================================
 ## --- Builtin dispatcher ( ! @ % $ %select ) ---------------------------------
@@ -5030,8 +4994,7 @@ DEFINE EXPR_INVALID 1200000000000000
     la_br &emit_hex_value
     call
 
-    leave
-    ret
+    eret
 
 :ebc_select
     # require arg_count == 3
@@ -5142,8 +5105,7 @@ DEFINE EXPR_INVALID 1200000000000000
     call
 
 :ebc_select_done
-    leave
-    ret
+    eret
 
 ## %str(IDENT): stringify a single WORD argument into a TOK_STRING literal.
 ## Validation: arg_count == 1, arg span length == 1 token, and that token's
@@ -5261,8 +5223,7 @@ DEFINE EXPR_INVALID 1200000000000000
     la_br &emit_token
     call
 
-    leave
-    ret
+    eret
 
 ## --- Error paths -------------------------------------------------------------
 ## Each err_* loads a (msg, len) pair for fatal; fatal writes "m1pp: <msg>\n"
diff --git a/p1/P1-aarch64.M1pp b/p1/P1-aarch64.M1pp
@@ -165,7 +165,7 @@
 %select((= %aa64_is_sp(dst) 1),
     %aa64_add_imm(sp, src, 0),
     %select((= %aa64_is_sp(src) 1),
-        %aa64_add_imm(dst, sp, 0),
+        %aa64_add_imm(dst, sp, 16),
         %((| 0xAA000000 (<< %aa64_reg(src) 16) (<< 31 5) %aa64_reg(dst)))))
 %endm
 
@@ -408,7 +408,9 @@
 %endm
 
 %macro p1_mem(op, rt, rn, off)
-%aa64_mem(op, rt, rn, off)
+%select((= %aa64_is_sp(rn) 1),
+    %aa64_mem(op, rt, rn, (+ off 16)),
+    %aa64_mem(op, rt, rn, off))
 %endm
 
 %macro p1_ldarg(rd, slot)
@@ -436,19 +438,24 @@
 %aa64_ret()
 %endm
 
-%macro p1_leave()
+%macro p1_eret()
 %aa64_mem(LD, lr, sp, 0)
 %aa64_mem(LD, x8, sp, 8)
 %aa64_mov_rr(sp, x8)
+%aa64_ret()
 %endm
 
 %macro p1_tail()
-%p1_leave()
+%aa64_mem(LD, lr, sp, 0)
+%aa64_mem(LD, x8, sp, 8)
+%aa64_mov_rr(sp, x8)
 %aa64_br(br)
 %endm
 
 %macro p1_tailr(rs)
-%p1_leave()
+%aa64_mem(LD, lr, sp, 0)
+%aa64_mem(LD, x8, sp, 8)
+%aa64_mov_rr(sp, x8)
 %aa64_br(rs)
 %endm
 
diff --git a/p1/P1-amd64.M1pp b/p1/P1-amd64.M1pp
@@ -562,7 +562,50 @@
 %endm
 
 %macro p1_mov(rd, rs)
-%amd_mov_rr(rd, rs)
+%p1_mov_##rs(rd)
+%endm
+
+# All non-sp sources: plain register copy.
+%macro p1_mov_a0(rd)
+%amd_mov_rr(rd, a0)
+%endm
+%macro p1_mov_a1(rd)
+%amd_mov_rr(rd, a1)
+%endm
+%macro p1_mov_a2(rd)
+%amd_mov_rr(rd, a2)
+%endm
+%macro p1_mov_a3(rd)
+%amd_mov_rr(rd, a3)
+%endm
+%macro p1_mov_t0(rd)
+%amd_mov_rr(rd, t0)
+%endm
+%macro p1_mov_t1(rd)
+%amd_mov_rr(rd, t1)
+%endm
+%macro p1_mov_t2(rd)
+%amd_mov_rr(rd, t2)
+%endm
+%macro p1_mov_s0(rd)
+%amd_mov_rr(rd, s0)
+%endm
+%macro p1_mov_s1(rd)
+%amd_mov_rr(rd, s1)
+%endm
+%macro p1_mov_s2(rd)
+%amd_mov_rr(rd, s2)
+%endm
+%macro p1_mov_s3(rd)
+%amd_mov_rr(rd, s3)
+%endm
+
+# sp-source: portable sp is the frame-local base, which is native rsp + 16
+# (the 16-byte backend-private frame header sits at [rsp+0..rsp+15]).
+# Emit `mov rd, rsp ; add rd, 16`.
+%macro p1_mov_sp(rd)
+%amd_mov_rr(rd, sp)
+%amd_alu_ri8(0, rd, 16)
 %endm
 
 %macro p1_rrr(op, rd, ra, rb)
@@ -618,18 +661,176 @@
 %p1_shifti_##op(rd, ra, imm)
 %endm
 
+# p1_mem dispatches on (op, base). When the base is sp, portable sp is the
+# frame-local base — 16 bytes above native rsp — so the physical access needs
+# the supplied portable offset plus 16. For any other base, the portable and
+# native offset coincide. Internal backend callers that need raw native-rsp
+# access (p1_enter, p1_eret, _start stub, p1_ldarg, p1_syscall) use
+# amd_mem_LD/amd_mem_ST directly and bypass this translation.
+
+%macro p1_mem_LD_sp(rt, off)
+%amd_mem_LD(rt, sp, (+ off 16))
+%endm
+%macro p1_mem_ST_sp(rt, off)
+%amd_mem_ST(rt, sp, (+ off 16))
+%endm
+%macro p1_mem_LB_sp(rt, off)
+%amd_mem_LB(rt, sp, (+ off 16))
+%endm
+%macro p1_mem_SB_sp(rt, off)
+%amd_mem_SB(rt, sp, (+ off 16))
+%endm
+
 %macro p1_mem_LD(rt, rn, off)
-%amd_mem_LD(rt, rn, off)
+%p1_mem_LD_##rn(rt, off)
 %endm
 %macro p1_mem_ST(rt, rn, off)
-%amd_mem_ST(rt, rn, off)
+%p1_mem_ST_##rn(rt, off)
 %endm
 %macro p1_mem_LB(rt, rn, off)
-%amd_mem_LB(rt, rn, off)
+%p1_mem_LB_##rn(rt, off)
 %endm
 %macro p1_mem_SB(rt, rn, off)
-%amd_mem_SB(rt, rn, off)
+%p1_mem_SB_##rn(rt, off)
+%endm
+
+# Non-sp bases for each op -- plain native load/store with portable offset.
+%macro p1_mem_LD_a0(rt, off)
+%amd_mem_LD(rt, a0, off)
+%endm
+%macro p1_mem_LD_a1(rt, off)
+%amd_mem_LD(rt, a1, off)
+%endm
+%macro p1_mem_LD_a2(rt, off)
+%amd_mem_LD(rt, a2, off)
+%endm
+%macro p1_mem_LD_a3(rt, off)
+%amd_mem_LD(rt, a3, off)
+%endm
+%macro p1_mem_LD_t0(rt, off)
+%amd_mem_LD(rt, t0, off)
+%endm
+%macro p1_mem_LD_t1(rt, off)
+%amd_mem_LD(rt, t1, off)
+%endm
+%macro p1_mem_LD_t2(rt, off)
+%amd_mem_LD(rt, t2, off)
+%endm
+%macro p1_mem_LD_s0(rt, off)
+%amd_mem_LD(rt, s0, off)
+%endm
+%macro p1_mem_LD_s1(rt, off)
+%amd_mem_LD(rt, s1, off)
+%endm
+%macro p1_mem_LD_s2(rt, off)
+%amd_mem_LD(rt, s2, off)
+%endm
+%macro p1_mem_LD_s3(rt, off)
+%amd_mem_LD(rt, s3, off)
+%endm
+
+%macro p1_mem_ST_a0(rt, off)
+%amd_mem_ST(rt, a0, off)
+%endm
+%macro p1_mem_ST_a1(rt, off)
+%amd_mem_ST(rt, a1, off)
+%endm
+%macro p1_mem_ST_a2(rt, off)
+%amd_mem_ST(rt, a2, off)
+%endm
+%macro p1_mem_ST_a3(rt, off)
+%amd_mem_ST(rt, a3, off)
+%endm
+%macro p1_mem_ST_t0(rt, off)
+%amd_mem_ST(rt, t0, off)
+%endm
+%macro p1_mem_ST_t1(rt, off)
+%amd_mem_ST(rt, t1, off)
 %endm
+%macro p1_mem_ST_t2(rt, off)
+%amd_mem_ST(rt, t2, off)
+%endm
+%macro p1_mem_ST_s0(rt, off)
+%amd_mem_ST(rt, s0, off)
+%endm
+%macro p1_mem_ST_s1(rt, off)
+%amd_mem_ST(rt, s1, off)
+%endm
+%macro p1_mem_ST_s2(rt, off)
+%amd_mem_ST(rt, s2, off)
+%endm
+%macro p1_mem_ST_s3(rt, off)
+%amd_mem_ST(rt, s3, off)
+%endm
+
+%macro p1_mem_LB_a0(rt, off)
+%amd_mem_LB(rt, a0, off)
+%endm
+%macro p1_mem_LB_a1(rt, off)
+%amd_mem_LB(rt, a1, off)
+%endm
+%macro p1_mem_LB_a2(rt, off)
+%amd_mem_LB(rt, a2, off)
+%endm
+%macro p1_mem_LB_a3(rt, off)
+%amd_mem_LB(rt, a3, off)
+%endm
+%macro p1_mem_LB_t0(rt, off)
+%amd_mem_LB(rt, t0, off)
+%endm
+%macro p1_mem_LB_t1(rt, off)
+%amd_mem_LB(rt, t1, off)
+%endm
+%macro p1_mem_LB_t2(rt, off)
+%amd_mem_LB(rt, t2, off)
+%endm
+%macro p1_mem_LB_s0(rt, off)
+%amd_mem_LB(rt, s0, off)
+%endm
+%macro p1_mem_LB_s1(rt, off)
+%amd_mem_LB(rt, s1, off)
+%endm
+%macro p1_mem_LB_s2(rt, off)
+%amd_mem_LB(rt, s2, off)
+%endm
+%macro p1_mem_LB_s3(rt, off)
+%amd_mem_LB(rt, s3, off)
+%endm
+
+%macro p1_mem_SB_a0(rt, off)
+%amd_mem_SB(rt, a0, off)
+%endm
+%macro p1_mem_SB_a1(rt, off)
+%amd_mem_SB(rt, a1, off)
+%endm
+%macro p1_mem_SB_a2(rt, off)
+%amd_mem_SB(rt, a2, off)
+%endm
+%macro p1_mem_SB_a3(rt, off)
+%amd_mem_SB(rt, a3, off)
+%endm
+%macro p1_mem_SB_t0(rt, off)
+%amd_mem_SB(rt, t0, off)
+%endm
+%macro p1_mem_SB_t1(rt, off)
+%amd_mem_SB(rt, t1, off)
+%endm
+%macro p1_mem_SB_t2(rt, off)
+%amd_mem_SB(rt, t2, off)
+%endm
+%macro p1_mem_SB_s0(rt, off)
+%amd_mem_SB(rt, s0, off)
+%endm
+%macro p1_mem_SB_s1(rt, off)
+%amd_mem_SB(rt, s1, off)
+%endm
+%macro p1_mem_SB_s2(rt, off)
+%amd_mem_SB(rt, s2, off)
+%endm
+%macro p1_mem_SB_s3(rt, off)
+%amd_mem_SB(rt, s3, off)
+%endm
+
 %macro p1_mem(op, rt, rn, off)
 %p1_mem_##op(rt, rn, off)
 %endm
@@ -659,25 +860,38 @@
 %amd_ret()
 %endm
 
-# LEAVE
-#   r9 = [sp + 0]       -- retaddr into scratch
-#   rax = [sp + 8]      -- saved caller sp into rax (an unused native reg)
-#   sp = rax            -- unwind to caller sp
-#   push r9             -- reinstall retaddr so RET returns correctly
-%macro p1_leave()
+# ERET -- atomic frame epilogue + return from a framed function.
+#   r9 = [rsp + 0]       -- retaddr into scratch (native rsp; backend-private)
+#   rax = [rsp + 8]      -- saved caller sp into rax (an unused native reg)
+#   rsp = rax            -- unwind to caller sp
+#   push r9              -- reinstall retaddr so the trailing ret returns
+#                          correctly
+#   ret                  -- pop reinstated retaddr into rip
+%macro p1_eret()
 %amd_mem_LD(scratch, sp, 0)
 %amd_mem_LD(rax, sp, 8)
 %amd_mov_rr(sp, rax)
 %amd_push(scratch)
+%amd_ret()
 %endm
 
+# TAIL / TAILR -- frame epilogue followed by an unconditional jump to the
+# target. The epilogue is the same sequence as the first four steps of
+# p1_eret (we omit the trailing ret because we jmp to a fresh target
+# instead).
 %macro p1_tail()
-%p1_leave()
+%amd_mem_LD(scratch, sp, 0)
+%amd_mem_LD(rax, sp, 8)
+%amd_mov_rr(sp, rax)
+%amd_push(scratch)
 %amd_jmp_r(br)
 %endm
 
 %macro p1_tailr(rs)
-%p1_leave()
+%amd_mem_LD(scratch, sp, 0)
+%amd_mem_LD(rax, sp, 8)
+%amd_mov_rr(sp, rax)
+%amd_push(scratch)
 %amd_jmp_r(rs)
 %endm
 
diff --git a/p1/P1-riscv64.M1pp b/p1/P1-riscv64.M1pp
@@ -9,7 +9,7 @@
 #   save0   = t4 (x29)  -- transient across SYSCALL only
 #   save1   = t3 (x28)
 #   save2   = a6 (x16)
-#   saved_fp = fp (x8)  -- used by ENTER/LEAVE to capture caller sp
+#   saved_fp = fp (x8)  -- used by ENTER/ERET to capture caller sp
 #   a7      = x17       -- Linux riscv64 syscall-number slot
 #   a4      = x14       -- syscall arg4 slot
 #   a5      = x15       -- syscall arg5 slot
@@ -331,7 +331,9 @@
 %endm
 
 %macro p1_mov(rd, rs)
-%rv_mov_rr(rd, rs)
+%select((= %rv_is_sp(rs) 1),
+    %rv_addi(rd, sp, 16),
+    %rv_mov_rr(rd, rs))
 %endm
 
 %macro p1_rrr(op, rd, ra, rb)
@@ -378,7 +380,9 @@
 %rv_sb(rt, rn, off)
 %endm
 %macro p1_mem(op, rt, rn, off)
-%p1_mem_##op(rt, rn, off)
+%select((= %rv_is_sp(rn) 1),
+    %p1_mem_##op(rt, rn, (+ off 16)),
+    %p1_mem_##op(rt, rn, off))
 %endm
 
 %macro p1_ldarg(rd, slot)
@@ -406,19 +410,24 @@
 %rv_jalr(zero, ra, 0)
 %endm
 
-%macro p1_leave()
+%macro p1_eret()
 %rv_ld(ra, sp, 0)
 %rv_ld(fp, sp, 8)
 %rv_mov_rr(sp, fp)
+%rv_jalr(zero, ra, 0)
 %endm
 
 %macro p1_tail()
-%p1_leave()
+%rv_ld(ra, sp, 0)
+%rv_ld(fp, sp, 8)
+%rv_mov_rr(sp, fp)
 %rv_jalr(zero, br, 0)
 %endm
 
 %macro p1_tailr(rs)
-%p1_leave()
+%rv_ld(ra, sp, 0)
+%rv_ld(fp, sp, 8)
+%rv_mov_rr(sp, fp)
 %rv_jalr(zero, rs, 0)
 %endm
 
diff --git a/p1/P1.M1pp b/p1/P1.M1pp
@@ -4,7 +4,7 @@
 # The backend must provide the target hooks used below:
 #   %p1_li, %p1_la, %p1_labr, %p1_mov, %p1_rrr, %p1_addi, %p1_logi,
 #   %p1_shifti, %p1_mem, %p1_ldarg, %p1_b, %p1_br, %p1_call, %p1_callr,
-#   %p1_ret, %p1_leave, %p1_tail, %p1_tailr, %p1_condb, %p1_condbz,
+#   %p1_ret, %p1_eret, %p1_tail, %p1_tailr, %p1_condb, %p1_condbz,
 #   %p1_enter, %p1_syscall, and %p1_sys_*.
 
 # ---- Materialization ------------------------------------------------------
@@ -185,8 +185,8 @@
 %p1_enter(size)
 %endm
 
-%macro leave()
-%p1_leave()
+%macro eret()
+%p1_eret()
 %endm
 
 # ---- System ---------------------------------------------------------------
diff --git a/p1/aarch64.py b/p1/aarch64.py
@@ -185,6 +185,18 @@ def aa_ret():
     return le32(0xD65F03C0)
 
 
+def aa_epilogue():
+    # Frame teardown, shared by ERET, TAIL, TAILR. Loads lr and the
+    # saved caller sp from the hidden header at native_sp+0/+8, then
+    # unwinds sp. Does NOT transfer control; the caller appends an
+    # aa_ret / aa_br as appropriate.
+    return (
+        aa_mem('LD', 'lr', 'sp', 0)
+        + aa_mem('LD', 'x8', 'sp', 8)
+        + aa_mov_rr('sp', 'x8')
+    )
+
+
 def aa_lit64_prefix(rd):
     ## 64-bit literal-pool prefix for LI: ldr xN, [pc,#8]; b PC+12.
     ## The 8 bytes that follow in source become the literal; b skips them.
@@ -219,6 +231,12 @@ def encode_labr(_arch, _row):
 
 
 def encode_mov(_arch, row):
+    # Portable `sp` is the frame-local base, which is 16 bytes above
+    # native sp (the backend's 2-word hidden header sits at the low end
+    # of each frame allocation). So reading sp into a register yields
+    # native_sp + 16, not native_sp itself.
+    if row.rs == 'sp':
+        return aa_add_imm(row.rd, 'sp', 16, sub=False)
     return aa_mov_rr(row.rd, row.rs)
 
 
@@ -263,7 +281,11 @@ def encode_shifti(_arch, row):
 
 
 def encode_mem(_arch, row):
-    return aa_mem(row.op, row.rt, row.rn, row.off)
+    # Portable sp points to the frame-local base; the 2-word hidden
+    # header sits at native_sp+0/+8 and is not portable-addressable.
+    # Shift sp-relative offsets past the header.
+    off = row.off + 16 if row.rn == 'sp' else row.off
+    return aa_mem(row.op, row.rt, row.rn, off)
 
 
 def encode_ldarg(_arch, row):
@@ -276,8 +298,7 @@ def encode_branch_reg(_arch, row):
     if row.kind == 'CALLR':
         return aa_blr(row.rs)
     if row.kind == 'TAILR':
-        leave = encode_nullary(_arch, Nullary('LEAVE', 'LEAVE'))
-        return leave + aa_br(row.rs)
+        return aa_epilogue() + aa_br(row.rs)
     raise ValueError(f'unknown branch-reg kind: {row.kind}')
 
 
@@ -314,15 +335,10 @@ def encode_nullary(_arch, row):
         return aa_blr('br')
     if row.kind == 'RET':
         return aa_ret()
-    if row.kind == 'LEAVE':
-        return (
-            aa_mem('LD', 'lr', 'sp', 0)
-            + aa_mem('LD', 'x8', 'sp', 8)
-            + aa_mov_rr('sp', 'x8')
-        )
+    if row.kind == 'ERET':
+        return aa_epilogue() + aa_ret()
     if row.kind == 'TAIL':
-        leave = encode_nullary(_arch, Nullary('LEAVE', 'LEAVE'))
-        return leave + aa_br('br')
+        return aa_epilogue() + aa_br('br')
     if row.kind == 'SYSCALL':
         return ''.join([
             aa_mov_rr('x8', 'a0'),
diff --git a/p1/p1_gen.py b/p1/p1_gen.py
@@ -139,6 +139,7 @@ def rows(arch):
     out.append(Banner('Calls And Returns'))
     out.append(Nullary(name='CALL', kind='CALL'))
     out.append(Nullary(name='RET', kind='RET'))
+    out.append(Nullary(name='ERET', kind='ERET'))
     out.append(Nullary(name='TAIL', kind='TAIL'))
     for rs in P1_GPRS:
         out.append(BranchReg(name=f'CALLR_{rs.upper()}', kind='CALLR', rs=rs))
@@ -148,7 +149,6 @@ def rows(arch):
     out.append(Banner('Frame Management'))
     for size in ENTER_SIZES:
         out.append(Enter(name=f'ENTER_{size}', size=size))
-    out.append(Nullary(name='LEAVE', kind='LEAVE'))
 
     out.append(Banner('System'))
     out.append(Nullary(name='SYSCALL', kind='SYSCALL'))
diff --git a/post.md b/post.md
@@ -239,8 +239,8 @@ Ops:
 - Branching: `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ`.
   Signed and unsigned less-than; `>=`, `>`, `<=` are synthesized by swapping
   operands or inverting branch sense.
-- Calls / returns: `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR`.
-- Frame management: `ENTER`, `LEAVE`.
+- Calls / returns: `CALL`, `CALLR`, `RET`, `ERET`, `TAIL`, `TAILR`.
+- Frame management: `ENTER`.
 - ABI arg access: `LDARG` — reads stack-passed incoming args without
   hard-coding the frame layout.
 - System: `SYSCALL`.
@@ -252,7 +252,8 @@ Calling convention:
 - `a0` is the one-word return register. Two-word returns use `a0`/`a1`.
 - `a0`-`a3` and `t0`-`t2` are caller-saved; `s0`-`s3` and `sp` are
   callee-saved.
-- `ENTER` builds the standard frame; `LEAVE` tears it down.
+- `ENTER` builds the standard frame; `ERET` tears it down and returns
+  (`TAIL`/`TAILR` likewise combine teardown with a jump).
 - Stack-passed outgoing args are staged in a dedicated frame-local area
   before `CALL`, so the callee finds them at a known offset from `sp`.
 - Wider-than-two-word returns use the usual hidden-pointer trick: caller
@@ -323,8 +324,7 @@ A function call, with a helper that doubles its argument:
     %enter(0)
     %la_br() &double
     %call()
-    %leave()
-    %ret()
+    %eret()
 
 :ELF_end
 ```
@@ -333,7 +333,7 @@ A function call, with a helper that doubles its argument:
 op consumes it. `double` is a leaf and needs no frame. `p1_main` is not
 — it calls `double`, so it opens a standard frame with `%enter(0)` to
 preserve the hidden return-address state across the call, and closes it
-with `%leave()` before returning. Run with `./double a b c` and the exit
+with `%eret()`, which tears down the frame and returns in one step. Run with `./double a b c` and the exit
 status is `8` (argc=4, doubled).
 
 ## What it cost
diff --git a/tests/p1/double.P1 b/tests/p1/double.P1
@@ -2,7 +2,7 @@
 #
 # `:double` is a leaf function that shifts its one-word argument left by
 # one and returns. `:p1_main` is not a leaf (it calls `double`), so it
-# establishes a standard frame with %enter/%leave to preserve the hidden
+# establishes a standard frame with %enter/%eret to preserve the hidden
 # return-address state across the call. argc arrives in a0, is handed to
 # double unchanged, and the doubled result comes back in a0.
 
@@ -14,7 +14,6 @@
     %enter(0)
     %la_br() &double
     %call()
-    %leave()
-    %ret()
+    %eret()
 
 :ELF_end
diff --git a/tests/p1/p1-aliasing.P1 b/tests/p1/p1-aliasing.P1
@@ -60,8 +60,7 @@
     %syscall()
 
     %li(a0) $(0)
-    %leave()
-    %ret()
+    %eret()
 
 # Two-byte output scratch: [0] = computed byte, [1] = newline. The space
 # placeholder gets overwritten by SB before the write syscall.
diff --git a/tests/p1/p1-call.P1 b/tests/p1/p1-call.P1
@@ -1,4 +1,4 @@
-# tests/p1/p1-call.P1 -- exercise ENTER, LEAVE, CALL, RET, MOV, ADDI
+# tests/p1/p1-call.P1 -- exercise ENTER, ERET, CALL, RET, MOV, ADDI
 # across a nontrivial P1 program. Calls a `write_msg` subroutine twice
 # and returns argc + 1 as the exit status so we also verify the argv-
 # aware _start stub (argc is always >= 1).
@@ -22,8 +22,7 @@
 
     # exit status = argc + 1 (so it's always >= 2).
     %addi(a0, s0, 1)
-    %leave()
-    %ret()
+    %eret()
 
 # write_msg(buf=a0, len=a1) -> void
 :write_msg
@@ -34,8 +33,7 @@
     %li(a0) %sys_write()
     %li(a1) $(1)
     %syscall()
-    %leave()
-    %ret()
+    %eret()
 
 :msg_a
 "A

	boot2 Playing with the boostrap
	git clone https://git.ryansepassi.com/git/boot2.git
	Log \| Files \| Refs

M	docs/P1.md	\|	97	+++++++++++++++++++++++++++++++++++++++----------------------------------------
M	m1pp/m1pp.M1	\|	359	+++++++++++++++++++++++++++++++++++--------------------------------------------
M	p1/P1-aarch64.M1pp	\|	17	++++++++++++-----
M	p1/P1-amd64.M1pp	\|	240	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
M	p1/P1-riscv64.M1pp	\|	21	+++++++++++++++------
M	p1/P1.M1pp	\|	6	+++---
M	p1/aarch64.py	\|	38	+++++++++++++++++++++++++++-----------
M	p1/p1_gen.py	\|	2	+-
M	post.md	\|	12	++++++------
M	tests/p1/double.P1	\|	5	++---
M	tests/p1/p1-aliasing.P1	\|	3	+--
M	tests/p1/p1-call.P1	\|	8	+++-----