boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs

commit 30273099f630eb2e8f787c413fc863ad2e44edfa
parent d19a402ecc3df0f912aecf4ec2a8d400f26c5f05
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 24 Apr 2026 09:03:03 -0700

Hide P1 frame header and merge LEAVE+RET into ERET

Portable sp after ENTER points to the frame-local base; the saved
retaddr and saved caller sp become backend-private. LEAVE is dropped
as a standalone op — ERET atomically tears down the frame and returns,
mirroring TAIL/TAILR which already bundle the epilogue. Leaves still
return with bare RET.

Diffstat:
Mdocs/P1.md | 97+++++++++++++++++++++++++++++++++++++++----------------------------------------
Mm1pp/m1pp.M1 | 359+++++++++++++++++++++++++++++++++++--------------------------------------------
Mp1/P1-aarch64.M1pp | 17++++++++++++-----
Mp1/P1-amd64.M1pp | 240++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
Mp1/P1-riscv64.M1pp | 21+++++++++++++++------
Mp1/P1.M1pp | 6+++---
Mp1/aarch64.py | 38+++++++++++++++++++++++++++-----------
Mp1/p1_gen.py | 2+-
Mpost.md | 12++++++------
Mtests/p1/double.P1 | 5++---
Mtests/p1/p1-aliasing.P1 | 3+--
Mtests/p1/p1-call.P1 | 8+++-----
12 files changed, 505 insertions(+), 303 deletions(-)

diff --git a/docs/P1.md b/docs/P1.md @@ -47,7 +47,7 @@ So the notation in this document is descriptive rather than literal: opcodes - `BR rs`, `CALLR rs`, and `TAILR rs` mean register-specific control-flow opcodes -- `LEAVE`, `CALL`, `RET`, `TAIL`, `B`, and `SYSCALL` remain operand-free +- `ERET`, `CALL`, `RET`, `TAIL`, `B`, and `SYSCALL` remain operand-free Labels still appear in source where the toolchain supports them directly, such as `LA rd, %label` and `LA_BR %label`. @@ -176,8 +176,8 @@ those words in the first `m` words of its frame-local storage immediately before the call: ``` -[sp + 2*WORD + 0*WORD] = outgoing arg word 0 -[sp + 2*WORD + 1*WORD] = outgoing arg word 1 +[sp + 0*WORD] = outgoing arg word 0 +[sp + 1*WORD] = outgoing arg word 1 ... ``` @@ -191,29 +191,30 @@ addressed prefix available for outgoing argument staging across the call. ### Standard frame layout -Functions that need local stack storage use a standard frame layout. After -frame establishment: +Functions that need local stack storage establish a standard frame with +`ENTER size`. After frame establishment, the portable-visible frame-local +storage occupies the first `size` bytes above `sp`: ``` -[sp + 0*WORD] = saved return address -[sp + 1*WORD] = saved caller stack pointer -[sp + 2*WORD ... sp + 2*WORD + local_bytes - 1] = frame-local storage -... +[sp + 0 ... sp + size - 1] = frame-local storage ``` Frame-local storage is byte-addressed. Portable code may use it for ordinary locals, spilled callee-saved registers, and the caller-staged outgoing stack-argument words described above. -Total frame size is: +Each frame also carries backend-private per-frame state — typically the +saved return continuation, saved caller `sp`, and any padding needed to +satisfy `STACK_ALIGN`. That state is not addressable by portable source, +and the backend chooses its layout and total allocation size. -`round_up(STACK_ALIGN, 2*WORD_SIZE + local_bytes)` +Word sizes: -Where: +- `WORD = 8` in P1v2-64 +- `WORD = 4` in P1v2-32 -- `WORD_SIZE = 8` in P1v2-64 -- `WORD_SIZE = 4` in P1v2-32 -- `STACK_ALIGN` is target-defined and must satisfy the native call ABI +`STACK_ALIGN` is target-defined and must satisfy the native call ABI at +every call boundary. Leaf functions that need no frame-local storage may omit the frame entirely. @@ -235,8 +236,8 @@ Leaf functions that need no frame-local storage may omit the frame entirely. | Memory | `LD`, `ST`, `LB`, `SB` | | ABI access | `LDARG` | | Branching | `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ` | -| Calls / returns | `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR` | -| Frame management | `ENTER`, `LEAVE` | +| Calls / returns | `CALL`, `CALLR`, `RET`, `ERET`, `TAIL`, `TAILR` | +| Frame management | `ENTER` | | System | `SYSCALL` | ## Immediates @@ -284,9 +285,14 @@ instruction after the `CALL`. `CALL` requires an active standard frame. the code pointer value held in `rs` and establishes the same return continuation semantics as `CALL`. -`RET` returns through the current return continuation. `RET` is valid whether -or not the current function has established a standard frame, provided any -frame established by the function has already been torn down. +`RET` returns from a leaf function through the hidden return continuation +captured at call time. `RET` is valid only when the current function has no +active standard frame. + +`ERET` returns from a function that has an active standard frame. It +performs the standard epilogue — restoring `sp` and the hidden return +continuation — and then returns to the caller. `ERET` is valid only when +the current function has an active standard frame. `TAIL` is a tail call to the target most recently loaded by `LA_BR`. It is valid only when the current function has an active standard frame. `TAIL` @@ -300,7 +306,7 @@ current function has an active standard frame. Because stack-passed outgoing argument words are staged in the caller's own frame-local storage, `TAIL` and `TAILR` are portable only when the tail-called callee requires no stack-passed argument words. Portable compilers must lower -other tail-call cases to an ordinary `CALL` / `RET` sequence. +other tail-call cases to an ordinary `CALL` / `ERET` sequence. Portable source must treat the return continuation as hidden machine state. It must not assume that the return address lives in any exposed register or stack @@ -309,40 +315,32 @@ establishment. ### Prologue / Epilogue -P1 v2 defines the following frame-establishment and frame-teardown operations: - -- `ENTER size` -- `LEAVE` +P1 v2 has a single frame-establishment op, `ENTER size`. Frame teardown is +not a standalone op; it is embedded in `ERET`, `TAIL`, and `TAILR`. -`ENTER size` establishes the standard frame layout with `size` bytes of -frame-local storage: +`ENTER size` establishes a standard frame with `size` bytes of frame-local +storage. After it executes: ``` -[sp + 0*WORD] = saved return address -[sp + 1*WORD] = saved caller stack pointer -[sp + 2*WORD ... sp + 2*WORD + size - 1] = frame-local storage +[sp + 0 ... sp + size - 1] = frame-local storage ``` -The total allocation size is: - -`round_up(STACK_ALIGN, 2*WORD_SIZE + size)` - -The named frame-local bytes are the usable local storage. Any additional bytes -introduced by alignment rounding are padding, not extra local bytes. - -`LEAVE` tears down the current standard frame and restores the hidden return -continuation so that a subsequent `RET` returns correctly. +Any backend-private per-frame state (saved return continuation, saved +caller `sp`, alignment padding) lives outside the portable-visible `size` +bytes. Portable source may not address it. -Because every standard frame stores the saved caller stack pointer at -`[sp + 1*WORD]`, `LEAVE` does not need to know the frame-local byte count used -by the corresponding `ENTER`. +`ERET`, `TAIL`, and `TAILR` each perform the standard epilogue — restoring +`sp` and the hidden return continuation — and then transfer control: `ERET` +to the caller, `TAIL` to the target in `br`, and `TAILR` to the target in +`rs`. Portable source must use one of these ops (not `RET`) to exit a +function that has established a frame. -A function may omit `ENTER` / `LEAVE` entirely if it is a leaf and needs no -standard frame. +A function may omit `ENTER` entirely if it is a leaf and needs no frame. +Such a function exits with `RET`. -`ENTER` and `LEAVE` do not implicitly save or restore `s0` or `s1`. A -function that modifies `s0` or `s1` must preserve them explicitly, typically by -storing them in frame-local storage within its standard frame. +`ENTER` does not implicitly save or restore `s0`-`s3`. A function that +modifies any callee-saved register must preserve it explicitly, typically +by storing it in frame-local storage within its standard frame. ### Branching @@ -419,7 +417,8 @@ register. Portable source may also read the current stack pointer through `MOV rd, sp`. Portable source may not write `sp` through `MOV`. Stack-pointer updates are only -performed by `ENTER`, `LEAVE`, and backend-private call/return machinery. +performed by `ENTER`, `ERET`, `TAIL`, `TAILR`, and backend-private call/return +machinery. `LI` materializes an integer bit-pattern. `LA` materializes the address of a label. `LA_BR` is a separate control-flow-target materialization form and is not @@ -483,7 +482,7 @@ At entry to `p1_main`, the native entry-stack layout has already been consumed by the backend stub. Portable source may not assume anything about the `sp` value inherited from `_start` except that it satisfies the call-boundary alignment rule and that the standard frame protocol -(`ENTER` / `LEAVE`) works correctly from it. +(`ENTER` / `ERET`) works correctly from it. `p1_main` may return normally, or it may call `sys_exit` directly at any point. diff --git a/m1pp/m1pp.M1 b/m1pp/m1pp.M1 @@ -20,7 +20,7 @@ ## without emitting output. ## ## P1v2 ABI: a0..a3 arg/return, t0..t2 caller-saved temps, s0..s3 callee-saved -## (unused here). Non-leaf functions use enter_0 / leave. _start has no frame; +## (unused here). Non-leaf functions use enter_0 / eret. _start has no frame; ## the kernel-supplied SP carries argv/argc directly. ## --- Constants & sizing ------------------------------------------------------ @@ -114,13 +114,13 @@ DEFINE EXPR_INVALID 1200000000000000 :_start # if (argc < 3) usage - ld_a0,sp,0 + ld_a0,sp,neg16 li_a1 %3 %0 la_br &err_usage blt_a0,a1 # output_path = argv[2] - ld_t0,sp,24 + ld_t0,sp,8 la_a0 &output_path st_t0,a0,0 @@ -140,7 +140,7 @@ DEFINE EXPR_INVALID 1200000000000000 # input_fd = openat(AT_FDCWD, argv[1], O_RDONLY, 0) li_a0 sys_openat li_a1 AT_FDCWD - ld_a2,sp,16 + ld_a2,sp,0 li_a3 %0 %0 li_t0 %0 %0 syscall @@ -719,8 +719,7 @@ DEFINE EXPR_INVALID 1200000000000000 b :lex_done - leave - ret + eret ## --- Output: normalized token stream to output_buf --------------------------- ## emit_newline writes '\n' and clears output_need_space. @@ -855,7 +854,7 @@ DEFINE EXPR_INVALID 1200000000000000 call la_br &proc_done beqz_a0 - st_a0,sp,16 + st_a0,sp,0 # if (s->pos == s->end) pop and continue ld_t0,a0,16 @@ -864,7 +863,7 @@ DEFINE EXPR_INVALID 1200000000000000 beq_t0,t1 # tok = s->pos - st_t0,sp,24 + st_t0,sp,8 # ---- line_start && tok->kind == TOK_WORD && tok eq "%macro" ---- ld_a1,a0,24 @@ -888,7 +887,7 @@ DEFINE EXPR_INVALID 1200000000000000 # holds in practice (line_start in expansion streams is cleared # before any %macro could matter). After it returns we copy # proc_pos back into s->pos and set s->line_start = 1. - ld_t0,sp,24 + ld_t0,sp,8 la_a0 &proc_pos st_t0,a0,0 la_a0 &proc_line_start @@ -896,7 +895,7 @@ DEFINE EXPR_INVALID 1200000000000000 st_a1,a0,0 la_br &define_macro call - ld_a0,sp,16 + ld_a0,sp,0 la_a1 &proc_pos ld_t0,a1,0 st_t0,a0,16 @@ -909,7 +908,7 @@ DEFINE EXPR_INVALID 1200000000000000 ## The %macro guard above already proved line_start && kind == TOK_WORD; if ## we reach here via a %macro non-match, those gates still hold. :proc_check_struct - ld_t0,sp,24 + ld_t0,sp,8 mov_a0,t0 la_a1 &const_struct li_a2 %7 %0 @@ -919,7 +918,7 @@ DEFINE EXPR_INVALID 1200000000000000 beqz_a0 # %struct matched: shim into define_fielded(stride=8, total="SIZE", len=4) - ld_t0,sp,24 + ld_t0,sp,8 la_a0 &proc_pos st_t0,a0,0 la_a0 &proc_line_start @@ -930,7 +929,7 @@ DEFINE EXPR_INVALID 1200000000000000 li_a2 %4 %0 la_br &define_fielded call - ld_a0,sp,16 + ld_a0,sp,0 la_a1 &proc_pos ld_t0,a1,0 st_t0,a0,16 @@ -941,7 +940,7 @@ DEFINE EXPR_INVALID 1200000000000000 ## ---- line_start && tok eq "%enum" ---- :proc_check_enum - ld_t0,sp,24 + ld_t0,sp,8 mov_a0,t0 la_a1 &const_enum li_a2 %5 %0 @@ -951,7 +950,7 @@ DEFINE EXPR_INVALID 1200000000000000 beqz_a0 # %enum matched: shim into define_fielded(stride=1, total="COUNT", len=5) - ld_t0,sp,24 + ld_t0,sp,8 la_a0 &proc_pos st_t0,a0,0 la_a0 &proc_line_start @@ -962,7 +961,7 @@ DEFINE EXPR_INVALID 1200000000000000 li_a2 %5 %0 la_br &define_fielded call - ld_a0,sp,16 + ld_a0,sp,0 la_a1 &proc_pos ld_t0,a1,0 st_t0,a0,16 @@ -973,8 +972,8 @@ DEFINE EXPR_INVALID 1200000000000000 :proc_check_newline # reload s, tok - ld_a0,sp,16 - ld_t0,sp,24 + ld_a0,sp,0 + ld_t0,sp,8 ld_a1,t0,0 li_a2 TOK_NEWLINE la_br &proc_check_builtin @@ -992,8 +991,8 @@ DEFINE EXPR_INVALID 1200000000000000 :proc_check_builtin # tok->kind == TOK_WORD && tok+1 < s->end && (tok+1)->kind == TOK_LPAREN ? - ld_a0,sp,16 - ld_t0,sp,24 + ld_a0,sp,0 + ld_t0,sp,8 ld_a1,t0,0 li_a2 TOK_WORD la_br &proc_check_macro @@ -1018,35 +1017,35 @@ DEFINE EXPR_INVALID 1200000000000000 call la_br &proc_do_builtin bnez_a0 - ld_a0,sp,24 + ld_a0,sp,8 la_a1 &const_at li_a2 %1 %0 la_br &tok_eq_const call la_br &proc_do_builtin bnez_a0 - ld_a0,sp,24 + ld_a0,sp,8 la_a1 &const_pct li_a2 %1 %0 la_br &tok_eq_const call la_br &proc_do_builtin bnez_a0 - ld_a0,sp,24 + ld_a0,sp,8 la_a1 &const_dlr li_a2 %1 %0 la_br &tok_eq_const call la_br &proc_do_builtin bnez_a0 - ld_a0,sp,24 + ld_a0,sp,8 la_a1 &const_select li_a2 %7 %0 la_br &tok_eq_const call la_br &proc_do_builtin bnez_a0 - ld_a0,sp,24 + ld_a0,sp,8 la_a1 &const_str li_a2 %4 %0 la_br &tok_eq_const @@ -1058,8 +1057,8 @@ DEFINE EXPR_INVALID 1200000000000000 :proc_do_builtin # expand_builtin_call(s, tok) - ld_a0,sp,16 - ld_a1,sp,24 + ld_a0,sp,0 + ld_a1,sp,8 la_br &expand_builtin_call call la_br &proc_loop @@ -1069,14 +1068,14 @@ DEFINE EXPR_INVALID 1200000000000000 # macro = find_macro(tok); if non-zero AND # ((tok+1 < s->end AND (tok+1)->kind == TOK_LPAREN) OR macro->param_count == 0) # then expand_call. (§4 paren-less 0-arg calls.) - ld_a0,sp,24 + ld_a0,sp,8 la_br &find_macro call la_br &proc_emit beqz_a0 mov_t2,a0 - ld_a0,sp,16 - ld_t0,sp,24 + ld_a0,sp,0 + ld_t0,sp,8 addi_t1,t0,24 ld_a1,a0,8 la_br &proc_macro_has_next @@ -1088,7 +1087,7 @@ DEFINE EXPR_INVALID 1200000000000000 li_a2 TOK_LPAREN la_br &proc_macro_zero_arg bne_a1,a2 - ld_a0,sp,16 + ld_a0,sp,0 mov_a1,t2 la_br &expand_call call @@ -1099,7 +1098,7 @@ DEFINE EXPR_INVALID 1200000000000000 ld_t0,t2,16 la_br &proc_emit bnez_t0 - ld_a0,sp,16 + ld_a0,sp,0 mov_a1,t2 la_br &expand_call call @@ -1108,10 +1107,10 @@ DEFINE EXPR_INVALID 1200000000000000 :proc_emit # emit_token(tok); s->pos += 24; s->line_start = 0 - ld_a0,sp,24 + ld_a0,sp,8 la_br &emit_token call - ld_a0,sp,16 + ld_a0,sp,0 ld_t0,a0,16 addi_t0,t0,24 st_t0,a0,16 @@ -1127,13 +1126,12 @@ DEFINE EXPR_INVALID 1200000000000000 b :proc_done - leave - ret + eret ## --- %macro storage: parse header + body into macros[] / macro_body_tokens -- ## Called at proc_pos == line-start `%macro`. Leaves proc_pos past the %endm ## line with proc_line_start = 1. Uses BSS scratch (def_m_ptr, def_param_ptr, -## def_body_line_start) since P1v2 enter/leave does not save s* registers. +## def_body_line_start) since P1v2 enter/eret does not save s* registers. ## ## Macro record layout (296 bytes, see M1PP_MACRO_RECORD_SIZE): ## +0 name.ptr (8) @@ -1447,8 +1445,7 @@ DEFINE EXPR_INVALID 1200000000000000 la_a0 &proc_line_start li_a1 %1 %0 st_a1,a0,0 - leave - ret + eret ## --- %struct / %enum directive ---------------------------------------------- ## define_fielded(a0=stride, a1=total_name_ptr, a2=total_name_len). @@ -1653,8 +1650,7 @@ DEFINE EXPR_INVALID 1200000000000000 la_a0 &proc_line_start li_a1 %1 %0 st_a1,a0,0 - leave - ret + eret ## df_emit_field(): read df_base_*, df_suffix_*, df_value from BSS; synthesize ## one macro record + one body token. Builds the "NAME.field" identifier in @@ -1798,8 +1794,7 @@ DEFINE EXPR_INVALID 1200000000000000 la_a1 &macros_end st_t2,a1,0 - leave - ret + eret ## df_render_decimal(): reads df_value; writes a reverse-filled decimal ## rendering into df_digit_scratch[cursor..end) and stores df_digit_count + @@ -1992,8 +1987,7 @@ DEFINE EXPR_INVALID 1200000000000000 la_br &push_stream_span call :ppsfm_done - leave - ret + eret ## ============================================================================ ## --- Argument parsing ------------------------------------------------------- @@ -2480,15 +2474,15 @@ DEFINE EXPR_INVALID 1200000000000000 la_br &err_bad_macro_header beq_a0,a1 # spill a0/a1 so arg_is_braced can clobber regs - st_a0,sp,16 - st_a1,sp,24 + st_a0,sp,0 + st_a1,sp,8 la_br &arg_is_braced call la_br &catp_plain beqz_a0 # braced: strip outer braces (start+24, end-24) - ld_a0,sp,16 - ld_a1,sp,24 + ld_a0,sp,0 + ld_a1,sp,8 addi_a0,a0,24 addi_a1,a1,neg24 la_br &catp_done @@ -2498,13 +2492,12 @@ DEFINE EXPR_INVALID 1200000000000000 la_br &catp_done b :catp_plain - ld_a0,sp,16 - ld_a1,sp,24 + ld_a0,sp,0 + ld_a1,sp,8 la_br &copy_span_to_pool call :catp_done - leave - ret + eret ## copy_paste_arg_to_pool(a0=arg_start, a1=arg_end) -> void (fatal unless len 1) ## Enforces the single-token-argument rule for params adjacent to ##. @@ -2512,14 +2505,14 @@ DEFINE EXPR_INVALID 1200000000000000 :copy_paste_arg_to_pool enter_16 # spill a0/a1 for the arg_is_braced call - st_a0,sp,16 - st_a1,sp,24 + st_a0,sp,0 + st_a1,sp,8 la_br &arg_is_braced call la_br &err_bad_macro_header bnez_a0 - ld_a0,sp,16 - ld_a1,sp,24 + ld_a0,sp,0 + ld_a1,sp,8 # if ((arg_end - arg_start) != 24) fatal sub_a2,a1,a0 li_a3 M1PP_TOK_SIZE @@ -2527,8 +2520,7 @@ DEFINE EXPR_INVALID 1200000000000000 bne_a2,a3 la_br &copy_span_to_pool call - leave - ret + eret ## expand_macro_tokens(a0=call_tok, a1=limit, a2=macro_ptr) -> void (fatal on bad) ## Requires call_tok+1 is TOK_LPAREN. Runs parse_args(call_tok+1, limit), @@ -3027,8 +3019,7 @@ DEFINE EXPR_INVALID 1200000000000000 la_br &paste_pool_range call - leave - ret + eret ## expand_call(a0=stream_ptr, a1=macro_ptr) -> void (fatal on bad call) ## Calls expand_macro_tokens for the call at stream->pos, sets @@ -3039,7 +3030,7 @@ DEFINE EXPR_INVALID 1200000000000000 # spill stream_ptr to local frame slot (sp+16 is the first local; sp+0/+8 # hold the saved return address and saved caller sp). - st_a0,sp,16 + st_a0,sp,0 # expand_macro_tokens(stream->pos, stream->end, macro) # stream->pos at +16, stream->end at +8 @@ -3052,7 +3043,7 @@ DEFINE EXPR_INVALID 1200000000000000 call # stream->pos = emt_after_pos - ld_a0,sp,16 + ld_a0,sp,0 la_a1 &emt_after_pos ld_t0,a1,0 st_t0,a0,16 @@ -3067,8 +3058,7 @@ DEFINE EXPR_INVALID 1200000000000000 la_br &push_pool_stream_from_mark call - leave - ret + eret ## ============================================================================ ## --- ## token paste compaction ---------------------------------------------- @@ -3171,8 +3161,7 @@ DEFINE EXPR_INVALID 1200000000000000 ld_a1,a1,0 st_a1,t0,16 - leave - ret + eret ## paste_pool_range(a0=mark) -> void (fatal on bad paste) ## In-place compactor over expand_pool[mark..pool_used). For each TOK_PASTE, @@ -3311,8 +3300,7 @@ DEFINE EXPR_INVALID 1200000000000000 sub_t0,t0,a1 la_a1 &pool_used st_t0,a1,0 - leave - ret + eret ## ============================================================================ ## --- Integer atoms + S-expression evaluator --------------------------------- @@ -3678,80 +3666,61 @@ DEFINE EXPR_INVALID 1200000000000000 :eoc_invalid li_a0 EXPR_INVALID - leave - ret + eret :eoc_add li_a0 EXPR_ADD - leave - ret + eret :eoc_sub li_a0 EXPR_SUB - leave - ret + eret :eoc_mul li_a0 EXPR_MUL - leave - ret + eret :eoc_div li_a0 EXPR_DIV - leave - ret + eret :eoc_mod li_a0 EXPR_MOD - leave - ret + eret :eoc_shl li_a0 EXPR_SHL - leave - ret + eret :eoc_shr li_a0 EXPR_SHR - leave - ret + eret :eoc_and li_a0 EXPR_AND - leave - ret + eret :eoc_or li_a0 EXPR_OR - leave - ret + eret :eoc_xor li_a0 EXPR_XOR - leave - ret + eret :eoc_not li_a0 EXPR_NOT - leave - ret + eret :eoc_eq li_a0 EXPR_EQ - leave - ret + eret :eoc_ne li_a0 EXPR_NE - leave - ret + eret :eoc_lt li_a0 EXPR_LT - leave - ret + eret :eoc_le li_a0 EXPR_LE - leave - ret + eret :eoc_gt li_a0 EXPR_GT - leave - ret + eret :eoc_ge li_a0 EXPR_GE - leave - ret + eret :eoc_strlen li_a0 EXPR_STRLEN - leave - ret + eret ## apply_expr_op(a0=op_code, a1=args_ptr, a2=argc) -> a0 = i64 result ## Reduce args[0..argc) per op: @@ -4209,8 +4178,7 @@ DEFINE EXPR_INVALID 1200000000000000 :aeo_finish la_a0 &aeo_acc ld_a0,a0,0 - leave - ret + eret ## helper: validate argc >= 1; fatal otherwise. (Returns to caller.) :aeo_require_argc_ge1 @@ -4286,13 +4254,13 @@ DEFINE EXPR_INVALID 1200000000000000 ## sp+48 saved emt_mark :eval_expr_atom enter_40 - st_a0,sp,16 - st_a1,sp,24 + st_a0,sp,0 + st_a1,sp,8 # macro_ptr = find_macro(tok) la_br &find_macro call - st_a0,sp,32 + st_a0,sp,16 # if (macro_ptr == 0) -> integer atom branch la_br &eea_int_atom @@ -4301,9 +4269,9 @@ DEFINE EXPR_INVALID 1200000000000000 # §4 paren-less 0-arg atom: # Take the macro-call branch if (tok+1 < limit AND (tok+1)->kind == TOK_LPAREN) # OR macro->param_count == 0. Otherwise fall through to int atom (unchanged). - ld_t0,sp,16 + ld_t0,sp,0 addi_t0,t0,24 - ld_t1,sp,24 + ld_t1,sp,8 la_br &eea_check_zero_arg blt_t1,t0 la_br &eea_check_zero_arg @@ -4317,7 +4285,7 @@ DEFINE EXPR_INVALID 1200000000000000 :eea_check_zero_arg # No trailing LPAREN. Take the macro branch only if param_count == 0. - ld_t0,sp,32 + ld_t0,sp,16 ld_t1,t0,16 la_br &eea_int_atom bnez_t1 @@ -4325,30 +4293,30 @@ DEFINE EXPR_INVALID 1200000000000000 :eea_do_macro # Macro call branch: # expand_macro_tokens(tok, limit, macro_ptr) - ld_a0,sp,16 - ld_a1,sp,24 - ld_a2,sp,32 + ld_a0,sp,0 + ld_a1,sp,8 + ld_a2,sp,16 la_br &expand_macro_tokens call # Snapshot emt outputs immediately. la_a0 &emt_after_pos ld_t0,a0,0 - st_t0,sp,40 + st_t0,sp,24 la_a0 &emt_mark ld_t0,a0,0 - st_t0,sp,48 + st_t0,sp,32 # If pool was not extended (pool_used == mark) -> bad expression. la_a0 &pool_used ld_t0,a0,0 - ld_t1,sp,48 + ld_t1,sp,32 la_br &err_bad_macro_header beq_t0,t1 # eval_expr_range(expand_pool + mark, expand_pool + pool_used) la_a0 &expand_pool - ld_t1,sp,48 + ld_t1,sp,32 add_a0,a0,t1 la_a1 &expand_pool la_a2 &pool_used @@ -4363,33 +4331,31 @@ DEFINE EXPR_INVALID 1200000000000000 # restore pool_used = mark la_a0 &pool_used - ld_t0,sp,48 + ld_t0,sp,32 st_t0,a0,0 # eval_after_pos = saved emt_after_pos la_a0 &eval_after_pos - ld_t0,sp,40 + ld_t0,sp,24 st_t0,a0,0 - leave - ret + eret :eea_int_atom # parse_int_token(tok) -> i64 - ld_a0,sp,16 + ld_a0,sp,0 la_br &parse_int_token call la_a1 &eval_value st_a0,a1,0 # eval_after_pos = tok + 24 - ld_t0,sp,16 + ld_t0,sp,0 addi_t0,t0,24 la_a0 &eval_after_pos st_t0,a0,0 - leave - ret + eret ## eval_expr_range(a0=start_tok, a1=end_tok) -> a0 = i64 result (fatal on bad) ## Main S-expression evaluator loop, driven by the explicit ExprFrame stack @@ -4412,28 +4378,28 @@ DEFINE EXPR_INVALID 1200000000000000 ## used as the local base for stack checks) :eval_expr_range enter_56 - st_a0,sp,16 - st_a1,sp,24 + st_a0,sp,0 + st_a1,sp,8 li_t0 %0 %0 + st_t0,sp,16 + st_t0,sp,24 st_t0,sp,32 st_t0,sp,40 - st_t0,sp,48 - st_t0,sp,56 # entry_frame_top = expr_frame_top la_a0 &expr_frame_top ld_t0,a0,0 - st_t0,sp,64 + st_t0,sp,48 :eer_loop # If have_value, deliver it. - ld_t0,sp,48 + ld_t0,sp,32 la_br &eer_no_have_value beqz_t0 # have_value: feed into top frame, or set result. la_a0 &expr_frame_top ld_t0,a0,0 - ld_t1,sp,64 + ld_t1,sp,48 la_br &eer_set_result beq_t0,t1 # frame = &expr_frames[frame_top - 1] @@ -4456,42 +4422,42 @@ DEFINE EXPR_INVALID 1200000000000000 add_a3,a0,a2 shli_a2,t1,3 add_a3,a3,a2 - ld_t2,sp,32 + ld_t2,sp,16 st_t2,a3,0 # frame->argc++ addi_t1,t1,1 st_t1,a1,0 # have_value = 0 li_t0 %0 %0 - st_t0,sp,48 + st_t0,sp,32 la_br &eer_loop b :eer_set_result # No frame open; this value is the top-level result. - ld_t0,sp,56 + ld_t0,sp,40 la_br &err_bad_macro_header bnez_t0 - ld_t0,sp,32 - st_t0,sp,40 + ld_t0,sp,16 + st_t0,sp,24 li_t0 %1 %0 - st_t0,sp,56 + st_t0,sp,40 li_t0 %0 %0 - st_t0,sp,48 + st_t0,sp,32 la_br &eer_loop b :eer_no_have_value # skip_expr_newlines(pos, end) - ld_a0,sp,16 - ld_a1,sp,24 + ld_a0,sp,0 + ld_a1,sp,8 la_br &skip_expr_newlines call - st_a0,sp,16 + st_a0,sp,0 # if (pos >= end) break - ld_t0,sp,16 - ld_t1,sp,24 + ld_t0,sp,0 + ld_t1,sp,8 la_br &eer_loop_done beq_t0,t1 @@ -4505,38 +4471,38 @@ DEFINE EXPR_INVALID 1200000000000000 beq_t2,a3 # atom: eval_expr_atom(pos, end); value = eval_value; pos = eval_after_pos - ld_a0,sp,16 - ld_a1,sp,24 + ld_a0,sp,0 + ld_a1,sp,8 la_br &eval_expr_atom call la_a0 &eval_value ld_t0,a0,0 - st_t0,sp,32 + st_t0,sp,16 la_a0 &eval_after_pos ld_t0,a0,0 - st_t0,sp,16 + st_t0,sp,0 li_t0 %1 %0 - st_t0,sp,48 + st_t0,sp,32 la_br &eer_loop b :eer_lparen # pos++ addi_t0,t0,24 - st_t0,sp,16 + st_t0,sp,0 # skip_expr_newlines - ld_a0,sp,16 - ld_a1,sp,24 + ld_a0,sp,0 + ld_a1,sp,8 la_br &skip_expr_newlines call - st_a0,sp,16 + st_a0,sp,0 # if (pos >= end) fatal - ld_t0,sp,16 - ld_t1,sp,24 + ld_t0,sp,0 + ld_t1,sp,8 la_br &err_bad_macro_header beq_t0,t1 # op = expr_op_code(pos) - ld_a0,sp,16 + ld_a0,sp,0 la_br &expr_op_code call # if (op == EXPR_INVALID) fatal @@ -4572,9 +4538,9 @@ DEFINE EXPR_INVALID 1200000000000000 addi_t0,t0,1 st_t0,a1,0 # pos++ (skip operator token) - ld_t0,sp,16 + ld_t0,sp,0 addi_t0,t0,24 - st_t0,sp,16 + st_t0,sp,0 la_br &eer_loop b @@ -4582,7 +4548,7 @@ DEFINE EXPR_INVALID 1200000000000000 # if (frame_top <= entry_frame_top) fatal la_a0 &expr_frame_top ld_t0,a0,0 - ld_t1,sp,64 + ld_t1,sp,48 la_br &err_bad_macro_header beq_t0,t1 la_br &err_bad_macro_header @@ -4603,16 +4569,16 @@ DEFINE EXPR_INVALID 1200000000000000 la_br &apply_expr_op call # value = result; frame_top--; pos++; have_value = 1 - st_a0,sp,32 + st_a0,sp,16 la_a1 &expr_frame_top ld_t0,a1,0 addi_t0,t0,neg1 st_t0,a1,0 - ld_t0,sp,16 + ld_t0,sp,0 addi_t0,t0,24 - st_t0,sp,16 + st_t0,sp,0 li_t0 %1 %0 - st_t0,sp,48 + st_t0,sp,32 la_br &eer_loop b @@ -4620,18 +4586,18 @@ DEFINE EXPR_INVALID 1200000000000000 # (strlen "literal") — degenerate unary op whose argument is a # TOK_STRING atom, not a recursive expression. # pos++ past the "strlen" operator word. - ld_t0,sp,16 + ld_t0,sp,0 addi_t0,t0,24 - st_t0,sp,16 + st_t0,sp,0 # skip_expr_newlines(pos, end) - ld_a0,sp,16 - ld_a1,sp,24 + ld_a0,sp,0 + ld_a1,sp,8 la_br &skip_expr_newlines call - st_a0,sp,16 + st_a0,sp,0 # if (pos >= end) fatal - ld_t0,sp,16 - ld_t1,sp,24 + ld_t0,sp,0 + ld_t1,sp,8 la_br &err_bad_macro_header beq_t0,t1 # if (pos->kind != TOK_STRING) fatal @@ -4652,19 +4618,19 @@ DEFINE EXPR_INVALID 1200000000000000 bne_a3,a0 # value = pos->text.len - 2 addi_a1,a1,neg2 - st_a1,sp,32 + st_a1,sp,16 # pos++ addi_t0,t0,24 - st_t0,sp,16 + st_t0,sp,0 # skip_expr_newlines(pos, end) - ld_a0,sp,16 - ld_a1,sp,24 + ld_a0,sp,0 + ld_a1,sp,8 la_br &skip_expr_newlines call - st_a0,sp,16 + st_a0,sp,0 # if (pos >= end) fatal - ld_t0,sp,16 - ld_t1,sp,24 + ld_t0,sp,0 + ld_t1,sp,8 la_br &err_bad_macro_header beq_t0,t1 # if (pos->kind != TOK_RPAREN) fatal @@ -4674,10 +4640,10 @@ DEFINE EXPR_INVALID 1200000000000000 bne_t2,a3 # pos++ addi_t0,t0,24 - st_t0,sp,16 + st_t0,sp,0 # have_value = 1 li_t0 %1 %0 - st_t0,sp,48 + st_t0,sp,32 la_br &eer_loop b @@ -4685,22 +4651,21 @@ DEFINE EXPR_INVALID 1200000000000000 # frame_top must equal entry_frame_top la_a0 &expr_frame_top ld_t0,a0,0 - ld_t1,sp,64 + ld_t1,sp,48 la_br &err_bad_macro_header bne_t0,t1 # have_result must be 1 - ld_t0,sp,56 + ld_t0,sp,40 la_br &err_bad_macro_header beqz_t0 # pos must equal end - ld_t0,sp,16 - ld_t1,sp,24 + ld_t0,sp,0 + ld_t1,sp,8 la_br &err_bad_macro_header bne_t0,t1 # return result - ld_a0,sp,40 - leave - ret + ld_a0,sp,24 + eret ## ============================================================================ ## --- Hex emit for !@%$ ------------------------------------------------------ @@ -4820,8 +4785,7 @@ DEFINE EXPR_INVALID 1200000000000000 la_br &emit_token call - leave - ret + eret ## ============================================================================ ## --- Builtin dispatcher ( ! @ % $ %select ) --------------------------------- @@ -5030,8 +4994,7 @@ DEFINE EXPR_INVALID 1200000000000000 la_br &emit_hex_value call - leave - ret + eret :ebc_select # require arg_count == 3 @@ -5142,8 +5105,7 @@ DEFINE EXPR_INVALID 1200000000000000 call :ebc_select_done - leave - ret + eret ## %str(IDENT): stringify a single WORD argument into a TOK_STRING literal. ## Validation: arg_count == 1, arg span length == 1 token, and that token's @@ -5261,8 +5223,7 @@ DEFINE EXPR_INVALID 1200000000000000 la_br &emit_token call - leave - ret + eret ## --- Error paths ------------------------------------------------------------- ## Each err_* loads a (msg, len) pair for fatal; fatal writes "m1pp: <msg>\n" diff --git a/p1/P1-aarch64.M1pp b/p1/P1-aarch64.M1pp @@ -165,7 +165,7 @@ %select((= %aa64_is_sp(dst) 1), %aa64_add_imm(sp, src, 0), %select((= %aa64_is_sp(src) 1), - %aa64_add_imm(dst, sp, 0), + %aa64_add_imm(dst, sp, 16), %((| 0xAA000000 (<< %aa64_reg(src) 16) (<< 31 5) %aa64_reg(dst))))) %endm @@ -408,7 +408,9 @@ %endm %macro p1_mem(op, rt, rn, off) -%aa64_mem(op, rt, rn, off) +%select((= %aa64_is_sp(rn) 1), + %aa64_mem(op, rt, rn, (+ off 16)), + %aa64_mem(op, rt, rn, off)) %endm %macro p1_ldarg(rd, slot) @@ -436,19 +438,24 @@ %aa64_ret() %endm -%macro p1_leave() +%macro p1_eret() %aa64_mem(LD, lr, sp, 0) %aa64_mem(LD, x8, sp, 8) %aa64_mov_rr(sp, x8) +%aa64_ret() %endm %macro p1_tail() -%p1_leave() +%aa64_mem(LD, lr, sp, 0) +%aa64_mem(LD, x8, sp, 8) +%aa64_mov_rr(sp, x8) %aa64_br(br) %endm %macro p1_tailr(rs) -%p1_leave() +%aa64_mem(LD, lr, sp, 0) +%aa64_mem(LD, x8, sp, 8) +%aa64_mov_rr(sp, x8) %aa64_br(rs) %endm diff --git a/p1/P1-amd64.M1pp b/p1/P1-amd64.M1pp @@ -562,7 +562,50 @@ %endm %macro p1_mov(rd, rs) -%amd_mov_rr(rd, rs) +%p1_mov_##rs(rd) +%endm + +# All non-sp sources: plain register copy. +%macro p1_mov_a0(rd) +%amd_mov_rr(rd, a0) +%endm +%macro p1_mov_a1(rd) +%amd_mov_rr(rd, a1) +%endm +%macro p1_mov_a2(rd) +%amd_mov_rr(rd, a2) +%endm +%macro p1_mov_a3(rd) +%amd_mov_rr(rd, a3) +%endm +%macro p1_mov_t0(rd) +%amd_mov_rr(rd, t0) +%endm +%macro p1_mov_t1(rd) +%amd_mov_rr(rd, t1) +%endm +%macro p1_mov_t2(rd) +%amd_mov_rr(rd, t2) +%endm +%macro p1_mov_s0(rd) +%amd_mov_rr(rd, s0) +%endm +%macro p1_mov_s1(rd) +%amd_mov_rr(rd, s1) +%endm +%macro p1_mov_s2(rd) +%amd_mov_rr(rd, s2) +%endm +%macro p1_mov_s3(rd) +%amd_mov_rr(rd, s3) +%endm + +# sp-source: portable sp is the frame-local base, which is native rsp + 16 +# (the 16-byte backend-private frame header sits at [rsp+0..rsp+15]). +# Emit `mov rd, rsp ; add rd, 16`. +%macro p1_mov_sp(rd) +%amd_mov_rr(rd, sp) +%amd_alu_ri8(0, rd, 16) %endm %macro p1_rrr(op, rd, ra, rb) @@ -618,18 +661,176 @@ %p1_shifti_##op(rd, ra, imm) %endm +# p1_mem dispatches on (op, base). When the base is sp, portable sp is the +# frame-local base — 16 bytes above native rsp — so the physical access needs +# the supplied portable offset plus 16. For any other base, the portable and +# native offset coincide. Internal backend callers that need raw native-rsp +# access (p1_enter, p1_eret, _start stub, p1_ldarg, p1_syscall) use +# amd_mem_LD/amd_mem_ST directly and bypass this translation. + +%macro p1_mem_LD_sp(rt, off) +%amd_mem_LD(rt, sp, (+ off 16)) +%endm +%macro p1_mem_ST_sp(rt, off) +%amd_mem_ST(rt, sp, (+ off 16)) +%endm +%macro p1_mem_LB_sp(rt, off) +%amd_mem_LB(rt, sp, (+ off 16)) +%endm +%macro p1_mem_SB_sp(rt, off) +%amd_mem_SB(rt, sp, (+ off 16)) +%endm + %macro p1_mem_LD(rt, rn, off) -%amd_mem_LD(rt, rn, off) +%p1_mem_LD_##rn(rt, off) %endm %macro p1_mem_ST(rt, rn, off) -%amd_mem_ST(rt, rn, off) +%p1_mem_ST_##rn(rt, off) %endm %macro p1_mem_LB(rt, rn, off) -%amd_mem_LB(rt, rn, off) +%p1_mem_LB_##rn(rt, off) %endm %macro p1_mem_SB(rt, rn, off) -%amd_mem_SB(rt, rn, off) +%p1_mem_SB_##rn(rt, off) +%endm + +# Non-sp bases for each op -- plain native load/store with portable offset. +%macro p1_mem_LD_a0(rt, off) +%amd_mem_LD(rt, a0, off) +%endm +%macro p1_mem_LD_a1(rt, off) +%amd_mem_LD(rt, a1, off) +%endm +%macro p1_mem_LD_a2(rt, off) +%amd_mem_LD(rt, a2, off) +%endm +%macro p1_mem_LD_a3(rt, off) +%amd_mem_LD(rt, a3, off) +%endm +%macro p1_mem_LD_t0(rt, off) +%amd_mem_LD(rt, t0, off) +%endm +%macro p1_mem_LD_t1(rt, off) +%amd_mem_LD(rt, t1, off) +%endm +%macro p1_mem_LD_t2(rt, off) +%amd_mem_LD(rt, t2, off) +%endm +%macro p1_mem_LD_s0(rt, off) +%amd_mem_LD(rt, s0, off) +%endm +%macro p1_mem_LD_s1(rt, off) +%amd_mem_LD(rt, s1, off) +%endm +%macro p1_mem_LD_s2(rt, off) +%amd_mem_LD(rt, s2, off) +%endm +%macro p1_mem_LD_s3(rt, off) +%amd_mem_LD(rt, s3, off) +%endm + +%macro p1_mem_ST_a0(rt, off) +%amd_mem_ST(rt, a0, off) +%endm +%macro p1_mem_ST_a1(rt, off) +%amd_mem_ST(rt, a1, off) +%endm +%macro p1_mem_ST_a2(rt, off) +%amd_mem_ST(rt, a2, off) +%endm +%macro p1_mem_ST_a3(rt, off) +%amd_mem_ST(rt, a3, off) +%endm +%macro p1_mem_ST_t0(rt, off) +%amd_mem_ST(rt, t0, off) +%endm +%macro p1_mem_ST_t1(rt, off) +%amd_mem_ST(rt, t1, off) %endm +%macro p1_mem_ST_t2(rt, off) +%amd_mem_ST(rt, t2, off) +%endm +%macro p1_mem_ST_s0(rt, off) +%amd_mem_ST(rt, s0, off) +%endm +%macro p1_mem_ST_s1(rt, off) +%amd_mem_ST(rt, s1, off) +%endm +%macro p1_mem_ST_s2(rt, off) +%amd_mem_ST(rt, s2, off) +%endm +%macro p1_mem_ST_s3(rt, off) +%amd_mem_ST(rt, s3, off) +%endm + +%macro p1_mem_LB_a0(rt, off) +%amd_mem_LB(rt, a0, off) +%endm +%macro p1_mem_LB_a1(rt, off) +%amd_mem_LB(rt, a1, off) +%endm +%macro p1_mem_LB_a2(rt, off) +%amd_mem_LB(rt, a2, off) +%endm +%macro p1_mem_LB_a3(rt, off) +%amd_mem_LB(rt, a3, off) +%endm +%macro p1_mem_LB_t0(rt, off) +%amd_mem_LB(rt, t0, off) +%endm +%macro p1_mem_LB_t1(rt, off) +%amd_mem_LB(rt, t1, off) +%endm +%macro p1_mem_LB_t2(rt, off) +%amd_mem_LB(rt, t2, off) +%endm +%macro p1_mem_LB_s0(rt, off) +%amd_mem_LB(rt, s0, off) +%endm +%macro p1_mem_LB_s1(rt, off) +%amd_mem_LB(rt, s1, off) +%endm +%macro p1_mem_LB_s2(rt, off) +%amd_mem_LB(rt, s2, off) +%endm +%macro p1_mem_LB_s3(rt, off) +%amd_mem_LB(rt, s3, off) +%endm + +%macro p1_mem_SB_a0(rt, off) +%amd_mem_SB(rt, a0, off) +%endm +%macro p1_mem_SB_a1(rt, off) +%amd_mem_SB(rt, a1, off) +%endm +%macro p1_mem_SB_a2(rt, off) +%amd_mem_SB(rt, a2, off) +%endm +%macro p1_mem_SB_a3(rt, off) +%amd_mem_SB(rt, a3, off) +%endm +%macro p1_mem_SB_t0(rt, off) +%amd_mem_SB(rt, t0, off) +%endm +%macro p1_mem_SB_t1(rt, off) +%amd_mem_SB(rt, t1, off) +%endm +%macro p1_mem_SB_t2(rt, off) +%amd_mem_SB(rt, t2, off) +%endm +%macro p1_mem_SB_s0(rt, off) +%amd_mem_SB(rt, s0, off) +%endm +%macro p1_mem_SB_s1(rt, off) +%amd_mem_SB(rt, s1, off) +%endm +%macro p1_mem_SB_s2(rt, off) +%amd_mem_SB(rt, s2, off) +%endm +%macro p1_mem_SB_s3(rt, off) +%amd_mem_SB(rt, s3, off) +%endm + %macro p1_mem(op, rt, rn, off) %p1_mem_##op(rt, rn, off) %endm @@ -659,25 +860,38 @@ %amd_ret() %endm -# LEAVE -# r9 = [sp + 0] -- retaddr into scratch -# rax = [sp + 8] -- saved caller sp into rax (an unused native reg) -# sp = rax -- unwind to caller sp -# push r9 -- reinstall retaddr so RET returns correctly -%macro p1_leave() +# ERET -- atomic frame epilogue + return from a framed function. +# r9 = [rsp + 0] -- retaddr into scratch (native rsp; backend-private) +# rax = [rsp + 8] -- saved caller sp into rax (an unused native reg) +# rsp = rax -- unwind to caller sp +# push r9 -- reinstall retaddr so the trailing ret returns +# correctly +# ret -- pop reinstated retaddr into rip +%macro p1_eret() %amd_mem_LD(scratch, sp, 0) %amd_mem_LD(rax, sp, 8) %amd_mov_rr(sp, rax) %amd_push(scratch) +%amd_ret() %endm +# TAIL / TAILR -- frame epilogue followed by an unconditional jump to the +# target. The epilogue is the same sequence as the first four steps of +# p1_eret (we omit the trailing ret because we jmp to a fresh target +# instead). %macro p1_tail() -%p1_leave() +%amd_mem_LD(scratch, sp, 0) +%amd_mem_LD(rax, sp, 8) +%amd_mov_rr(sp, rax) +%amd_push(scratch) %amd_jmp_r(br) %endm %macro p1_tailr(rs) -%p1_leave() +%amd_mem_LD(scratch, sp, 0) +%amd_mem_LD(rax, sp, 8) +%amd_mov_rr(sp, rax) +%amd_push(scratch) %amd_jmp_r(rs) %endm diff --git a/p1/P1-riscv64.M1pp b/p1/P1-riscv64.M1pp @@ -9,7 +9,7 @@ # save0 = t4 (x29) -- transient across SYSCALL only # save1 = t3 (x28) # save2 = a6 (x16) -# saved_fp = fp (x8) -- used by ENTER/LEAVE to capture caller sp +# saved_fp = fp (x8) -- used by ENTER/ERET to capture caller sp # a7 = x17 -- Linux riscv64 syscall-number slot # a4 = x14 -- syscall arg4 slot # a5 = x15 -- syscall arg5 slot @@ -331,7 +331,9 @@ %endm %macro p1_mov(rd, rs) -%rv_mov_rr(rd, rs) +%select((= %rv_is_sp(rs) 1), + %rv_addi(rd, sp, 16), + %rv_mov_rr(rd, rs)) %endm %macro p1_rrr(op, rd, ra, rb) @@ -378,7 +380,9 @@ %rv_sb(rt, rn, off) %endm %macro p1_mem(op, rt, rn, off) -%p1_mem_##op(rt, rn, off) +%select((= %rv_is_sp(rn) 1), + %p1_mem_##op(rt, rn, (+ off 16)), + %p1_mem_##op(rt, rn, off)) %endm %macro p1_ldarg(rd, slot) @@ -406,19 +410,24 @@ %rv_jalr(zero, ra, 0) %endm -%macro p1_leave() +%macro p1_eret() %rv_ld(ra, sp, 0) %rv_ld(fp, sp, 8) %rv_mov_rr(sp, fp) +%rv_jalr(zero, ra, 0) %endm %macro p1_tail() -%p1_leave() +%rv_ld(ra, sp, 0) +%rv_ld(fp, sp, 8) +%rv_mov_rr(sp, fp) %rv_jalr(zero, br, 0) %endm %macro p1_tailr(rs) -%p1_leave() +%rv_ld(ra, sp, 0) +%rv_ld(fp, sp, 8) +%rv_mov_rr(sp, fp) %rv_jalr(zero, rs, 0) %endm diff --git a/p1/P1.M1pp b/p1/P1.M1pp @@ -4,7 +4,7 @@ # The backend must provide the target hooks used below: # %p1_li, %p1_la, %p1_labr, %p1_mov, %p1_rrr, %p1_addi, %p1_logi, # %p1_shifti, %p1_mem, %p1_ldarg, %p1_b, %p1_br, %p1_call, %p1_callr, -# %p1_ret, %p1_leave, %p1_tail, %p1_tailr, %p1_condb, %p1_condbz, +# %p1_ret, %p1_eret, %p1_tail, %p1_tailr, %p1_condb, %p1_condbz, # %p1_enter, %p1_syscall, and %p1_sys_*. # ---- Materialization ------------------------------------------------------ @@ -185,8 +185,8 @@ %p1_enter(size) %endm -%macro leave() -%p1_leave() +%macro eret() +%p1_eret() %endm # ---- System --------------------------------------------------------------- diff --git a/p1/aarch64.py b/p1/aarch64.py @@ -185,6 +185,18 @@ def aa_ret(): return le32(0xD65F03C0) +def aa_epilogue(): + # Frame teardown, shared by ERET, TAIL, TAILR. Loads lr and the + # saved caller sp from the hidden header at native_sp+0/+8, then + # unwinds sp. Does NOT transfer control; the caller appends an + # aa_ret / aa_br as appropriate. + return ( + aa_mem('LD', 'lr', 'sp', 0) + + aa_mem('LD', 'x8', 'sp', 8) + + aa_mov_rr('sp', 'x8') + ) + + def aa_lit64_prefix(rd): ## 64-bit literal-pool prefix for LI: ldr xN, [pc,#8]; b PC+12. ## The 8 bytes that follow in source become the literal; b skips them. @@ -219,6 +231,12 @@ def encode_labr(_arch, _row): def encode_mov(_arch, row): + # Portable `sp` is the frame-local base, which is 16 bytes above + # native sp (the backend's 2-word hidden header sits at the low end + # of each frame allocation). So reading sp into a register yields + # native_sp + 16, not native_sp itself. + if row.rs == 'sp': + return aa_add_imm(row.rd, 'sp', 16, sub=False) return aa_mov_rr(row.rd, row.rs) @@ -263,7 +281,11 @@ def encode_shifti(_arch, row): def encode_mem(_arch, row): - return aa_mem(row.op, row.rt, row.rn, row.off) + # Portable sp points to the frame-local base; the 2-word hidden + # header sits at native_sp+0/+8 and is not portable-addressable. + # Shift sp-relative offsets past the header. + off = row.off + 16 if row.rn == 'sp' else row.off + return aa_mem(row.op, row.rt, row.rn, off) def encode_ldarg(_arch, row): @@ -276,8 +298,7 @@ def encode_branch_reg(_arch, row): if row.kind == 'CALLR': return aa_blr(row.rs) if row.kind == 'TAILR': - leave = encode_nullary(_arch, Nullary('LEAVE', 'LEAVE')) - return leave + aa_br(row.rs) + return aa_epilogue() + aa_br(row.rs) raise ValueError(f'unknown branch-reg kind: {row.kind}') @@ -314,15 +335,10 @@ def encode_nullary(_arch, row): return aa_blr('br') if row.kind == 'RET': return aa_ret() - if row.kind == 'LEAVE': - return ( - aa_mem('LD', 'lr', 'sp', 0) - + aa_mem('LD', 'x8', 'sp', 8) - + aa_mov_rr('sp', 'x8') - ) + if row.kind == 'ERET': + return aa_epilogue() + aa_ret() if row.kind == 'TAIL': - leave = encode_nullary(_arch, Nullary('LEAVE', 'LEAVE')) - return leave + aa_br('br') + return aa_epilogue() + aa_br('br') if row.kind == 'SYSCALL': return ''.join([ aa_mov_rr('x8', 'a0'), diff --git a/p1/p1_gen.py b/p1/p1_gen.py @@ -139,6 +139,7 @@ def rows(arch): out.append(Banner('Calls And Returns')) out.append(Nullary(name='CALL', kind='CALL')) out.append(Nullary(name='RET', kind='RET')) + out.append(Nullary(name='ERET', kind='ERET')) out.append(Nullary(name='TAIL', kind='TAIL')) for rs in P1_GPRS: out.append(BranchReg(name=f'CALLR_{rs.upper()}', kind='CALLR', rs=rs)) @@ -148,7 +149,6 @@ def rows(arch): out.append(Banner('Frame Management')) for size in ENTER_SIZES: out.append(Enter(name=f'ENTER_{size}', size=size)) - out.append(Nullary(name='LEAVE', kind='LEAVE')) out.append(Banner('System')) out.append(Nullary(name='SYSCALL', kind='SYSCALL')) diff --git a/post.md b/post.md @@ -239,8 +239,8 @@ Ops: - Branching: `B`, `BR`, `BEQ`, `BNE`, `BLT`, `BLTU`, `BEQZ`, `BNEZ`, `BLTZ`. Signed and unsigned less-than; `>=`, `>`, `<=` are synthesized by swapping operands or inverting branch sense. -- Calls / returns: `CALL`, `CALLR`, `RET`, `TAIL`, `TAILR`. -- Frame management: `ENTER`, `LEAVE`. +- Calls / returns: `CALL`, `CALLR`, `RET`, `ERET`, `TAIL`, `TAILR`. +- Frame management: `ENTER`. - ABI arg access: `LDARG` — reads stack-passed incoming args without hard-coding the frame layout. - System: `SYSCALL`. @@ -252,7 +252,8 @@ Calling convention: - `a0` is the one-word return register. Two-word returns use `a0`/`a1`. - `a0`-`a3` and `t0`-`t2` are caller-saved; `s0`-`s3` and `sp` are callee-saved. -- `ENTER` builds the standard frame; `LEAVE` tears it down. +- `ENTER` builds the standard frame; `ERET` tears it down and returns + (`TAIL`/`TAILR` likewise combine teardown with a jump). - Stack-passed outgoing args are staged in a dedicated frame-local area before `CALL`, so the callee finds them at a known offset from `sp`. - Wider-than-two-word returns use the usual hidden-pointer trick: caller @@ -323,8 +324,7 @@ A function call, with a helper that doubles its argument: %enter(0) %la_br() &double %call() - %leave() - %ret() + %eret() :ELF_end ``` @@ -333,7 +333,7 @@ A function call, with a helper that doubles its argument: op consumes it. `double` is a leaf and needs no frame. `p1_main` is not — it calls `double`, so it opens a standard frame with `%enter(0)` to preserve the hidden return-address state across the call, and closes it -with `%leave()` before returning. Run with `./double a b c` and the exit +with `%eret()`, which tears down the frame and returns in one step. Run with `./double a b c` and the exit status is `8` (argc=4, doubled). ## What it cost diff --git a/tests/p1/double.P1 b/tests/p1/double.P1 @@ -2,7 +2,7 @@ # # `:double` is a leaf function that shifts its one-word argument left by # one and returns. `:p1_main` is not a leaf (it calls `double`), so it -# establishes a standard frame with %enter/%leave to preserve the hidden +# establishes a standard frame with %enter/%eret to preserve the hidden # return-address state across the call. argc arrives in a0, is handed to # double unchanged, and the doubled result comes back in a0. @@ -14,7 +14,6 @@ %enter(0) %la_br() &double %call() - %leave() - %ret() + %eret() :ELF_end diff --git a/tests/p1/p1-aliasing.P1 b/tests/p1/p1-aliasing.P1 @@ -60,8 +60,7 @@ %syscall() %li(a0) $(0) - %leave() - %ret() + %eret() # Two-byte output scratch: [0] = computed byte, [1] = newline. The space # placeholder gets overwritten by SB before the write syscall. diff --git a/tests/p1/p1-call.P1 b/tests/p1/p1-call.P1 @@ -1,4 +1,4 @@ -# tests/p1/p1-call.P1 -- exercise ENTER, LEAVE, CALL, RET, MOV, ADDI +# tests/p1/p1-call.P1 -- exercise ENTER, ERET, CALL, RET, MOV, ADDI # across a nontrivial P1 program. Calls a `write_msg` subroutine twice # and returns argc + 1 as the exit status so we also verify the argv- # aware _start stub (argc is always >= 1). @@ -22,8 +22,7 @@ # exit status = argc + 1 (so it's always >= 2). %addi(a0, s0, 1) - %leave() - %ret() + %eret() # write_msg(buf=a0, len=a1) -> void :write_msg @@ -34,8 +33,7 @@ %li(a0) %sys_write() %li(a1) $(1) %syscall() - %leave() - %ret() + %eret() :msg_a "A