rv64: port to NativeTarget API (-O0 core) - kit

commit 6eaf127a60f04b7a93730df0bb959e140818a196
parent 429defa5e448fa66882d0d0b729424b59b73fa09
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 29 May 2026 04:55:10 -0700

rv64: port to NativeTarget API (-O0 core)

Implement src/arch/rv64/native.c: the RISC-V NativeTarget physical-emission
backend driven by the shared NativeDirectTarget at -O0, mirroring the aa64
port. Covers the frame model (s0 fp, single top-record, single-pass
reserve-and-patch prologue), register tables, operand legality, memory
load/store, arithmetic/compare/convert (flag-free SLT/branch lowering),
calls/returns/params via the abi interface, spill/reload, alloca, and
control flow. Atomics, va_args, inline asm, bitfields, TLS, intrinsics, and
the -O1 known-frame path are stubbed for follow-up.

Rewire rv64/arch.c through native_direct_target_new; adapt asm.c inline-asm
binding to the physical-operand pseudo-kind scheme (RV64_INLINE_OPK_REG);
delete the dead legacy CGTarget files (ops/emit/alloc/opt_coord.c +
internal.h); enable CFREE_ARCH_RV64_ENABLED.

Porting reference: doc/NATIVE_PORT_RV64.md. All 13 core toy cases (arith,
bitwise, cmp, if/else, loops, break/continue, recursion, params, pointers,
globals, short-circuit) pass cross-exec at -O0 via podman/riscv64.

rv64: fold indexed addresses in load/store (Zba), fix global/frame-value base resolution

The -O0 NativeDirectTarget passes indexed NativeAddrs to load/store (aa64
encodes them natively); RISC-V has no indexed memory ops, so rv_emit_mem now
folds the index via Zba sh{1,2,3}add and resolves GLOBAL/FRAME_VALUE bases.
Fixes array indexing, global string/array data, pointer-to-array, slices.

test/toy: batch path-X cross-exec into one container per arch

The toy X path ran one emulated podman per case (~30 min for the full suite
under riscv64 emulation). Defer each case's exec via exec_target_queue and
drain the whole queue with a single exec_target_flush per target arch after
compile+link, matching what parse/run.sh path E already does. Cuts a full
rv64 -O0 cross run to a couple of minutes. Adds scripts/toy_cross_batch.sh
(standalone batched runner) for ad-hoc triage.

rv64: implement atomics, va_args, inline asm, bitfields, TLS, intrinsics (-O0)

Fill in the hooks stubbed in the initial port, mirroring the legacy RISC-V
ISA sequences: A-extension atomics (LR/SC + AMO with .aq/.rl ordering),
LP64D va_list (pointer model + GP save area), inline asm binding (direct +
optimizer paths via the RV64_INLINE_OPK_* pseudo-kinds), bitfield load/store,
TLS (local-exec + initial-exec/GOT), and intrinsics (popcount/ctz/clz/bswap/
overflow/memcpy/memset/trap). test-toy X-O0 for rv64: 244 pass, 12 fail
(6 expected tail-call + 6 real bugs in varargs/overflow to triage).

rv64: fix va-pointer load, 32-bit popcount mask, disjoint emit/scratch regs

- rv_direct_va_base loaded the *address* of the OPK_LOCAL holding &ap instead
  of its pointer value; mirror aa64's *_pointer_addr (load the home value).
  Fixes all varargs cases + spec_demo.
- 32-bit SWAR popcount used a 64-bit multiply then >>24, leaving product bits
  [32,64) in the result; mask to a byte. Fixes clz/ctz/popcount on i32
  (132_intrinsic_bit_and_overflow).
- Make emit-internal temps (t0..t3) disjoint from the driver scratch pool
  (t4,t5) so a hook can't clobber an operand the driver parked in scratch;
  t6 is the lone caller-saved cache reg. test-toy X-O0 rv64: 250 pass, 6 fail
  (all deferred tail-call cases), 0 skip.

rv64: fix signed-overflow UB in prologue far-offset hi/lo split

(hi << 12) overflowed i32 for frames/offsets > 2048 bytes (UBSan trap when
the compiler itself compiled rt/qsort.c). Compute the shift unsigned. The
full riscv64 runtime (qsort/printf/string/...) now compiles cleanly.

rv64: pass variadic FP args in integer registers (LP64D)

rv_param_abi synthesized ABI_CLASS_FP for unnamed FP args, so plan_call put
variadic doubles in fa-regs while the callee reads them from the integer
save area. LP64D passes variadic FP args per the integer convention; classify
synthesized variadic float parts as ABI_CLASS_INT. Fixes variadic_05_double,
variadic_06_mixed. (Note: parse harness binaries statically link libcfree.a
and must be rebuilt after backend changes.)

rv64: fix sret/aggregate copy clobbering dest pointer

rv_copy_bytes used RV_TMP1 as both the per-granule transfer reg and (via
rv_resolve_mem_addr) for far-offset materialization, so when plan_ret's sret
path passed the destination pointer in RV_TMP1 the first loaded word
overwrote it -> store-to-data-address SIGSEGV (struct_return_large and
friends). Resolve dst (first, since it may be in RV_TMP1) and src into
dedicated pointer regs (RV_TMP3/RV_TMP0) and copy with advancing pointers at
offset 0, so the transfer reg never aliases a base and offsets never exceed
imm12. Same shape for set_bytes. Fixes the 32-byte sret return cases.

rv64: handle FP equality (CMP_EQ/CMP_NE on float operands)

CMP_EQ/CMP_NE are shared int/FP opcodes; FP == / != (and the x!=x, x!=0
lowerings behind isnan, bool-from-float, etc.) arrive with FP-class operands.
rv_cmp/rv_cmp_branch only treated CMP_LT_F..GE_F as FP, so EQ/NE on floats ran
the integer path, reading the FP register numbers as integer registers. Use
feq.{s,d} (+ xori for NE). Fixes isnan, fabs_inf, bool_from_float,
fp_unary_neg_struct_field.

rv64: reserve a0 for the sret pointer in call arg allocation

For an sret call the hidden destination pointer is the implicit first integer
arg (a0), so real args start at a1. rv_plan_call/rv_call_stack_size started
next_int at 0, placing the first real arg in a0 and then overwriting it with
the sret pointer — the callee read a garbage/NULL first param and faulted
(call_indirect_ret_struct_byval SIGSEGV). Start next_int at 1 when has_sret.

rv64: implement file-scope asm (assemble through the generic .s parser)

rv_file_scope_asm was a no-op, so top-level __asm__("...") data/symbol
definitions were dropped and references failed to link (asm_02_file_scope).
Run the source through asm_lex_open_mem + asm_parse like aa64.

Diffstat:
A doc/NATIVE_PORT_RV64.md  | 3647 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A doc/OPT_O0_NATIVE_DIRECT_NOTES.md  | 186 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A doc/OPT_O0_PERF_NOTES.md  | 168 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M include/cfree/config.h  | 2 +-
A scripts/toy_cross_batch.sh  | 113 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D src/arch/rv64/alloc.c  | 589 -------------------------------------------------------------------------------
M src/arch/rv64/arch.c  | 32 ++++++++++++++++++++++++++------
M src/arch/rv64/asm.c  | 35 ++++++++++++++++++-----------------
M src/arch/rv64/asm.h  | 13 +++++++++++++
D src/arch/rv64/emit.c  | 631 -------------------------------------------------------------------------------
D src/arch/rv64/internal.h  | 189 -------------------------------------------------------------------------------
A src/arch/rv64/native.c  | 3458 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D src/arch/rv64/ops.c  | 2699 -------------------------------------------------------------------------------
D src/arch/rv64/opt_coord.c  | 345 -------------------------------------------------------------------------------
M src/arch/rv64/rv64.h  | 11 ++++++++++-
M test/toy/run.sh  | 52 ++++++++++++++++++++++++++++++++++++++++++++--------

16 files changed, 7684 insertions(+), 4486 deletions(-)
diff --git a/doc/NATIVE_PORT_RV64.md b/doc/NATIVE_PORT_RV64.md
@@ -0,0 +1,3647 @@
+# RV64 NativeTarget Porting Reference
+
+(generated guide — cross-references aa64 native.c, the -O0 driver, rv64 legacy ISA/ABI)
+
+
+
+---
+
+# RV64 NativeTarget API Porting Guide — GROUP 1: Skeleton, Frame Model, and Function Lifecycle
+
+## Overview
+
+This guide specifies the rv64 NativeTarget implementation (src/arch/rv64/native.c) for the single-pass -O0 path (NativeDirectTarget). The reference implementation is AA64 (src/arch/aa64/native.c ~4557 lines); the rv64 legacy code (src/arch/rv64/*.c) provides correct ISA/ABI logic but does not compile to the NativeTarget API. Focus initially on the single-pass path; the known-frame (-O1) path is a separate optimization discussed at the end.
+
+---
+
+## (a) Includes and RV64-Specific Subclass Struct
+
+### Header Includes (model aa64/native.c lines 30–45)
+
+```c
+#include <string.h>
+
+#include "abi/abi.h"
+#include "arch/rv64/isa.h"          // ISA instruction encoders (rv_add, rv_addi, etc.)
+#include "arch/rv64/regs.h"          // Register name/index lookup
+#include "arch/rv64/rv64.h"          // Public rv64_native_target_new() declaration
+#include "asm/asm.h"
+#include "asm/asm_lex.h"
+#include "cg/native_direct_target.h"
+#include "cg/type.h"
+#include "core/arena.h"
+#include "core/bytes.h"
+#include "core/pool.h"
+#include "core/slice.h"
+#include "obj/obj.h"
+```
+
+The rv64/isa.h file contains instruction format helpers (rv_r, rv_i, rv_s, etc.) and named-register constants:
+- RV_S0 = 8 (frame pointer, s0/fp in psABI)
+- RV_RA = 1 (return address)
+- RV_SP = 2 (stack pointer)
+- RV_A0..A7 = 10..17 (argument registers)
+- RV_T0..T6 = 5, 6, 7, 28–31 (temporaries; use T0=5 as primary scratch)
+- RV_S2..S11 = 18..27 (callee-saved; allocatable by register allocator)
+- RV_FS0..FS11 = 40, 41, 50–59 (FP callee-saved)
+- RV_FA0..FA7 = 42–49 (FP argument registers; 32-based DWARF numbering)
+
+---
+
+### RvNativeTarget Subclass Struct (model AANativeTarget at aa64/native.c lines 181–261)
+
+```c
+#define RV_PROLOGUE_WORDS 128u  // Worst-case placeholder size (single-pass)
+#define RV_TMP0 5u              // Scratch register (RV_T0 = x5)
+#define RV_TMP1 6u              // Secondary scratch (RV_T1 = x6)
+#define RV_FRAME_SAVE_SIZE 16u  // saved s0 (8B) + saved ra (8B)
+
+typedef struct RvNativeSlot {
+  u32 off;        // Bytes below s0 (positive); address = s0 - off
+  u32 size;
+  u32 align;
+  u8 kind;        // NativeFrameSlotKind
+  u8 pad[3];
+} RvNativeSlot;
+
+typedef struct RvCalleeSave {
+  NativeFrameSlot slot;
+  CfreeCgTypeId type;
+  u8 cls;         // NativeAllocClass (NATIVE_REG_INT or NATIVE_REG_FP)
+  Reg reg;
+} RvCalleeSave;
+
+#define RV_MAX_CALLEE_SAVES 18u // s2–s11 (10) + fs2–fs11 (8)
+
+typedef struct RvNativeTarget {
+  NativeTarget base;
+  SrcLoc loc;
+  const CGFuncDesc* func;
+
+  RvNativeSlot* slots;
+  u32 nslots;
+  u32 slots_cap;
+  u32 cum_off;          // Cumulative frame-slot bytes below s0 (not including saved pair)
+  u32 max_outgoing;     // Max outgoing-arg bytes across all calls
+
+  u32 incoming_stack_size;  // Callee's incoming stack args (for tail-call validation)
+  u32 next_param_int;   // 0–8: next a-register index for INT parts; 8+ = stack
+  u32 next_param_fp;    // 0–8: next fa-register index for FP parts; 8+ = stack
+  u32 next_param_stack; // 0-based byte offset for stack-passed params
+
+  NativeFrameSlot sret_ptr_slot;   // Hidden slot for sret pointer (a0 on entry)
+  NativeFrameSlot va_gp_slot;      // Variadic GP save area slot (if needed)
+
+  // Deferred patches (single-pass path only)
+  struct RvPatch {
+    u32 pos;            // Code offset in text section
+    u32 dst_reg;        // Destination register (for alloca patches)
+  }* patches;
+  u32 npatches;
+  u32 npatches_cap;
+
+  u32 func_start;       // func_start offset in text section
+  u32 prologue_pos;     // prologue_pos within func_start (start of NOP region)
+  MCLabel epilogue_label;
+
+  RvCalleeSave callee_saves[RV_MAX_CALLEE_SAVES];
+  u32 ncallee_saves;
+
+  // Frame layout flags (single-pass: only known_frame and has_alloca used)
+  u8 known_frame;       // 0 on single-pass (NativeDirectTarget); 1 on known-frame
+  u8 has_alloca;        // Set if body contains dynamic alloca
+} RvNativeTarget;
+
+static inline RvNativeTarget* rv_of(NativeTarget* t) { return (RvNativeTarget*)t; }
+```
+
+**Key differences from AA64:**
+1. **Frame anchor:** RV64 uses s0 (x8) as the frame pointer, which anchors the saved s0/ra pair at [s0+0]/[s0+8]. Offsets are **bytes below s0** (positive values; address = s0 - off).
+2. **No bottom-record layout:** AA64 has fp_at_bottom for small frames; RV64's simpler ABI (8-byte-aligned stack, no outgoing-area subtlety) uses a single **top-record** layout where s0 is set at the prologue to sp + fp_pair_off (= frame_size - 16 - variadic_save_sz).
+3. **Variadic:** If the function is variadic, a 64-byte GP save area sits immediately above the saved pair (at [s0 + 16]) to form a contiguous va_list walk. This is implicit in the frame layout and reserved in prologue (not via a frame_slot).
+4. **Scratch:** RV_TMP0 (x5 = t0) and RV_TMP1 (x6 = t1) are the primary temporaries for immediate materialization; they are not allocable.
+
+---
+
+## (b) Frame Layout and Helper Functions
+
+### Frame Layout Math (model aa64's aa_build_layout at aa64/native.c lines 121–128 + frame offset helpers lines 275–293)
+
+RV64 uses a single, simpler layout:
+
+```
+  high addr   caller's stack frame
+              +------------------------------+
+              | incoming stack args          |  s0-relative: s0 + 16 + byte_off
+              +------------------------------+
+  s0  -->     | saved s0                     |  s0-relative: 0
+              | saved ra                     |  s0-relative: 8
+              +------------------------------+
+              | frame slots                  |  s0-relative: -(off) where off = cum_off
+              |  (callee-saves + locals      |
+              |   + spills + sret/va)        |
+              +------------------------------+
+              | outgoing args                |  sp-relative: byte_off
+  sp  -->     +------------------------------+
+  low addr                                       CFA = s0 + frame_size
+              frame_size = align16(16 + cum_off + max_outgoing + va_save_sz)
+              where va_save_sz = is_variadic ? 64 : 0
+              fp_pair_off = frame_size - 16 - va_save_sz (where saved pair sits in sp frame)
+```
+
+**Inline helpers:**
+
+```c
+#define RV_FRAME_SAVE_SIZE 16u
+
+// s0-relative offset of saved s0 (or ra at +8)
+static inline i32 rv_s0_off_saved_s0(void) { return 0; }
+static inline i32 rv_s0_off_saved_ra(void) { return 8; }
+
+// s0-relative offset of incoming stack arg at byte_off (0-based caller offset)
+// Incoming stack args sit at s0 + 16 [+ 64 for variadic] + byte_off
+static inline i32 rv_s0_off_in_arg(const RvNativeTarget* a, u32 byte_off) {
+  u32 base = a->is_variadic ? 16u + 64u : 16u;
+  return (i32)(base + byte_off);
+}
+
+// s0-relative offset of a frame slot (off = cum_off value from its RvNativeSlot)
+// Slots stack downward from the saved pair: address = s0 - off
+static inline i32 rv_s0_off_slot(u32 slot_off) {
+  return -(i32)slot_off;
+}
+
+// CFA = s0 + (frame_size - fp_pair_off) = s0 + 16 + va_save_sz (absolute offset)
+static inline i32 rv_cfa_off(u32 frame_size, u32 fp_pair_off) {
+  return (i32)(frame_size - fp_pair_off);
+}
+
+// Frame size calculation (called once per function at prologue patch time)
+static inline u32 rv_frame_size(u32 cum_off, u32 max_outgoing, u8 is_variadic) {
+  u32 va_sz = is_variadic ? 64u : 0u;
+  u32 raw = RV_FRAME_SAVE_SIZE + cum_off + max_outgoing + va_sz;
+  return (raw + 15u) & ~15u;  // align to 16 bytes
+}
+
+// fp_pair_off: where the saved s0/ra pair sits within the frame (sp-relative)
+static inline u32 rv_fp_pair_off(u32 frame_size, u8 is_variadic) {
+  return frame_size - RV_FRAME_SAVE_SIZE - (is_variadic ? 64u : 0u);
+}
+```
+
+---
+
+## (c) rv64_native_target_new() Entry (model aa64_native_target_new at aa64/native.c lines 3540–3609)
+
+```c
+NativeTarget* rv64_native_target_new(Compiler* c, ObjBuilder* obj, MCEmitter* mc) {
+  RvNativeTarget* a = arena_znew(c->tu, RvNativeTarget);
+  NativeTarget* t;
+  if (!a) return NULL;
+  t = &a->base;
+  t->c = c;
+  t->obj = obj;
+  t->mc = mc;
+  t->regs = &rv_reg_info;  // Defined in rv64/regs.c (NativeRegInfo with allocable/scratch/phys)
+
+  // Semantic-decision hooks (class, immediates, addressing)
+  t->class_for_type = rv_class_for_type;
+  t->imm_legal = rv_imm_legal;
+  t->addr_legal = rv_addr_legal;
+
+  // Function lifecycle
+  t->func_begin = rv_func_begin;
+  t->func_begin_known_frame = rv_func_begin_known_frame;  // For -O1 path
+  t->note_frame_state = rv_note_frame_state;             // Optional, for patching
+  t->reserve_callee_saves = NULL;  // OR rv_reserve_callee_saves if needed
+  t->signature_stack_bytes = rv_signature_stack_bytes;
+  t->call_stack_bytes = rv_call_stack_bytes;
+  t->has_store_zero_reg = 0;  // RV64 has x0 but most targets don't use it for store (test this)
+  t->store_zero_reg = 0;
+  t->func_end = rv_func_end;
+
+  // Frame slot and parameter binding
+  t->frame_slot = rv_frame_slot;
+  t->frame_slot_debug_loc = NULL;  // Optional; use if debugger needs per-slot dwarf locs
+  t->bind_param = rv_bind_native_param;
+
+  // Control flow
+  t->label_new = rv_label_new;
+  t->label_place = rv_label_place;
+  t->jump = rv_jump;
+  t->cmp_branch = rv_cmp_branch;
+  t->indirect_branch = rv_indirect_branch;
+  t->load_label_addr = rv_load_label_addr;
+
+  // Instruction emission (scalars, memory, shifts, calls, etc.)
+  t->move = rv_move;
+  t->load_imm = rv_load_imm;
+  t->load_const = rv_load_const;
+  t->load_addr = rv_load_addr;
+  t->load = rv_load;
+  t->store = rv_store;
+  t->tls_addr_of = rv_tls_addr_of;
+  t->copy_bytes = rv_copy_bytes;
+  t->set_bytes = rv_set_bytes;
+  t->bitfield_load = rv_bitfield_load;
+  t->bitfield_store = rv_bitfield_store;
+  t->binop = rv_binop;
+  t->unop = rv_unop;
+  t->cmp = rv_cmp;
+  t->convert = rv_convert;
+  t->alloca_ = rv_alloca;
+
+  // Spill/reload
+  t->spill = rv_spill;
+  t->reload = rv_reload;
+
+  // Calls and returns
+  t->plan_call = rv_plan_call;
+  t->emit_call = rv_emit_call;
+  t->plan_ret = rv_plan_ret;
+  t->ret = rv_ret;
+
+  // Atomics
+  t->atomic_load = rv_atomic_load;
+  t->atomic_store = rv_atomic_store;
+  t->atomic_rmw = rv_atomic_rmw;
+  t->atomic_cas = rv_atomic_cas;
+  t->fence = rv_fence;
+
+  // Variadic
+  t->va_start_ = rv_va_start_;
+  t->va_arg_ = rv_va_arg_;
+  t->va_end_ = rv_va_end_;
+  t->va_copy_ = rv_va_copy_;
+
+  // Inline/file-scope asm and misc
+  t->intrinsic = rv_intrinsic;
+  t->asm_block = rv_asm_block;
+  t->file_scope_asm = rv_file_scope_asm;
+  t->trap = rv_trap;
+  t->set_loc = rv_set_loc;
+  t->finalize = rv_finalize;
+
+  return t;
+}
+```
+
+---
+
+## (d) func_begin, func_begin_known_frame, func_end, and Prologue/Epilogue Emission
+
+### func_begin_common (Single-Pass Setup)
+
+```c
+static void rv_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+
+  a->func = fd;
+  a->nslots = 0;
+  a->cum_off = 0;
+  a->max_outgoing = 0;
+  a->incoming_stack_size = 0;
+  a->next_param_int = 0;
+  a->next_param_fp = 0;
+  a->next_param_stack = 0;
+  a->sret_ptr_slot = NATIVE_FRAME_SLOT_NONE;
+  a->va_gp_slot = NATIVE_FRAME_SLOT_NONE;
+  a->npatches = 0;
+  a->ncallee_saves = 0;
+  a->known_frame = 0;
+  a->has_alloca = 0;
+
+  mc->set_section(mc, fd->text_section_id);
+  mc->emit_align(mc, 4, 0);
+  a->func_start = mc->pos(mc);
+  mc_begin_function(mc, fd->sym, fd->text_section_id, a->func_start);
+  if (mc->cfi_startproc) mc->cfi_startproc(mc);
+
+  a->prologue_pos = mc->pos(mc);
+  a->epilogue_label = mc->label_new(mc);
+}
+```
+
+### func_begin (Single-Pass, NativeDirectTarget Path)
+
+```c
+// Model: aa64_func_begin (aa64/native.c lines 1089–1095)
+static void rv_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+
+  rv_func_begin_common(t, fd);
+
+  // Reserve a worst-case prologue region (RV_PROLOGUE_WORDS NOPs).
+  // The exact size is unknown until all frame-slot and call-site max_outgoing
+  // data is gathered; func_end patches it once the frame is final.
+  for (u32 i = 0; i < RV_PROLOGUE_WORDS; ++i) {
+    rv64_emit32(mc, RV_NOP);
+  }
+
+  // Emit entry saves (sret pointer, variadic GP register save area).
+  // Slots are reserved so spill/reload addresses are known immediately.
+  rv_reserve_entry_saves(a);
+}
+```
+
+### rv_reserve_entry_saves and rv_emit_entry_save_stores
+
+```c
+// Model: aa64's aa_reserve_entry_saves / aa_emit_entry_save_stores
+// (aa64/native.c lines 1101–1142)
+
+static void rv_reserve_entry_saves(RvNativeTarget* a) {
+  NativeTarget* t = &a->base;
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, a->func->fn_type);
+
+  // sret: hidden slot for incoming a0 (destination pointer for struct return)
+  if (abi && abi->has_sret) {
+    NativeFrameSlotDesc sd;
+    memset(&sd, 0, sizeof sd);
+    sd.type = builtin_id(CFREE_CG_BUILTIN_I64);
+    sd.size = 8;
+    sd.align = 8;
+    sd.kind = NATIVE_FRAME_SLOT_SAVE;
+    a->sret_ptr_slot = t->frame_slot(t, &sd);
+  }
+
+  // Variadic: GP save area (64 bytes) is implicit at [s0 + 16] but no explicit slot
+  // (The prologue will spill unconsumed a-registers there automatically.)
+}
+
+static void rv_emit_entry_save_stores(RvNativeTarget* a) {
+  NativeTarget* t = &a->base;
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, a->func->fn_type);
+
+  // Spill a0 (sret pointer) into the hidden slot
+  if (abi && abi->has_sret && a->sret_ptr_slot != NATIVE_FRAME_SLOT_NONE) {
+    NativeAddr addr;
+    NativeLoc src;
+    MemAccess mem;
+    CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64);
+
+    memset(&addr, 0, sizeof addr);
+    addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+    addr.base.frame = a->sret_ptr_slot;
+    addr.base_type = i64;
+
+    memset(&src, 0, sizeof src);
+    src.kind = NATIVE_LOC_REG;
+    src.cls = NATIVE_REG_INT;
+    src.type = i64;
+    src.v.reg = RV_A0;  // Incoming a0
+
+    memset(&mem, 0, sizeof mem);
+    mem.type = i64;
+    mem.size = 8;
+    mem.align = 8;
+
+    rv_emit_mem(a, 0, src, addr, mem);  // Store (0 = write)
+  }
+  // Variadic save spills happen in the prologue itself (auto via rv_build_prologue).
+}
+```
+
+### func_end (Single-Pass Prologue Patching)
+
+```c
+// Model: aa64_func_end (aa64/native.c lines 1493–1543)
+// For single-pass (known_frame=0): patch the reserved prologue region.
+
+static void rv_func_end(NativeTarget* t) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  ObjBuilder* obj = t->obj;
+
+  // Compute final frame size now that max_outgoing and callee-saves are known
+  u32 n_int_saves = 0, n_fp_saves = 0;
+  u32 int_regs[10], fp_regs[10];  // Caller provides these
+  
+  if (!a->known_frame) {
+    // Single-pass: collect the actual callee-saves from some allocator state
+    // (this would be filled in by reserve_callee_saves or tracked during body emission)
+    n_int_saves = rv_collect_callee_saves(a, int_regs, fp_regs, &n_fp_saves);
+  }
+
+  u32 frame_size = rv_frame_size(a->cum_off, a->max_outgoing, a->func->abi && a->func->abi->variadic);
+  u32 fp_pair_off = rv_fp_pair_off(frame_size, a->func->abi && a->func->abi->variadic);
+
+  // Place epilogue label
+  mc->label_place(mc, a->epilogue_label);
+
+  // Emit epilogue: restore callee-saves and frame, then ret
+  rv_emit_callee_restores(a, int_regs, n_int_saves, fp_regs, n_fp_saves);
+  rv_emit_restore_frame(a, frame_size, fp_pair_off);
+  rv64_emit32(mc, rv_jalr(RV_ZERO, RV_RA, 0));  // ret (jalr x0, x1, 0)
+
+  if (!a->known_frame) {
+    // Single-pass: patch the prologue region with actual instructions
+    u32 words[RV_PROLOGUE_WORDS];
+    u32 nwords = rv_build_prologue(t, words, RV_PROLOGUE_WORDS,
+                                   frame_size, fp_pair_off,
+                                   a->cum_off, a->max_outgoing,
+                                   int_regs, n_int_saves, fp_regs, n_fp_saves,
+                                   a->func->abi && a->func->abi->has_sret,
+                                   a->func->abi && a->func->abi->variadic);
+    rv64_patch_region(obj, a->func->text_section_id, a->prologue_pos, words, nwords);
+  }
+
+  // CFI frame information
+  if (mc->cfi_set_next_pc_offset && mc->cfi_def_cfa && mc->cfi_offset) {
+    u32 post_prologue = a->prologue_pos + (a->known_frame ? a->nwords_emitted * 4 : RV_PROLOGUE_WORDS * 4);
+    i32 cfa = rv_cfa_off(frame_size, fp_pair_off);
+    mc->cfi_set_next_pc_offset(mc, post_prologue);
+    mc->cfi_def_cfa(mc, RV_S0, cfa);              // CFA = s0 + cfa_dist
+    mc->cfi_offset(mc, RV_S0, -cfa);              // saved s0 at CFA - cfa
+    mc->cfi_offset(mc, RV_RA, -cfa + 8);          // saved ra at CFA - cfa + 8
+    for (u32 i = 0; i < n_int_saves; ++i) {
+      i32 slot_off = -(i32)(a->cum_off + 8u + i * 8u);  // s0-relative
+      i32 cfa_off = slot_off - cfa;
+      mc->cfi_offset(mc, int_regs[i], cfa_off);
+    }
+    for (u32 i = 0; i < n_fp_saves; ++i) {
+      i32 slot_off = -(i32)(a->cum_off + 8u + n_int_saves * 8u + i * 8u);
+      i32 cfa_off = slot_off - cfa;
+      mc->cfi_offset(mc, 32u + fp_regs[i], cfa_off);  // DWARF: fp regs 32–63
+    }
+  }
+
+  obj_symbol_define(obj, a->func->sym, a->func->text_section_id,
+                    a->func_start, mc->pos(mc) - a->func_start);
+  if (a->func->atomize) {
+    obj_atom_define(obj, a->func->text_section_id, a->func_start,
+                    mc->pos(mc) - a->func_start, a->func->sym, 0);
+  }
+  if (mc->debug)
+    debug_func_pc_range(mc->debug, a->func->text_section_id, a->func_start, mc->pos(mc));
+  if (mc->cfi_endproc) mc->cfi_endproc(mc);
+
+  mc_end_function(mc);
+  a->func = NULL;
+}
+```
+
+### rv_build_prologue (Prologue Word Array)
+
+**Pseudo-C sketch** (full ISA details in rv64/isa.h):
+
+```c
+// Model: rv64_emit.c rv_build_prologue (lines 338–416)
+static u32 rv_build_prologue(NativeTarget* t, u32* words, u32 cap,
+                             u32 frame_size, u32 fp_pair_off,
+                             u32 cum_off, u32 max_outgoing,
+                             const u32* int_regs, u32 n_int_saves,
+                             const u32* fp_regs, u32 n_fp_saves,
+                             u8 has_sret, u8 is_variadic) {
+  u32 wi = 0;
+
+  // 1. Adjust sp: sp -= frame_size
+  //    Encoding: ADDI sp, sp, -frame_size (or multi-instruction if imm > 12 bits)
+  if (fits_i12(-(i32)frame_size)) {
+    if (wi >= cap) goto overflow;
+    words[wi++] = rv_addi(RV_SP, RV_SP, -(i32)frame_size);
+  } else {
+    // Use t0 as scratch for large immediates
+    if (wi >= cap) goto overflow;
+    i32 hi = (i32)(((i64)(-(i32)frame_size) + 0x800) >> 12);
+    i32 lo = -(i32)frame_size - (hi << 12);
+    words[wi++] = rv_lui(RV_TMP0, (u32)hi & 0xffffu);
+    if (lo) {
+      if (wi >= cap) goto overflow;
+      words[wi++] = rv_addiw(RV_TMP0, RV_TMP0, lo);
+    }
+    if (wi >= cap) goto overflow;
+    words[wi++] = rv_add(RV_SP, RV_SP, RV_TMP0);
+  }
+
+  // 2. Save s0 and ra at [sp + fp_pair_off]
+  if (fits_i12((i32)fp_pair_off)) {
+    if (wi + 2 > cap) goto overflow;
+    words[wi++] = rv_sd(RV_S0, RV_SP, (i32)fp_pair_off);
+    words[wi++] = rv_sd(RV_RA, RV_SP, (i32)fp_pair_off + 8);
+  } else {
+    // Use t0 to compute address
+    if (wi >= cap) goto overflow;
+    i32 hi = (i32)(((i64)fp_pair_off + 0x800) >> 12);
+    i32 lo = (i32)fp_pair_off - (hi << 12);
+    words[wi++] = rv_lui(RV_TMP0, (u32)hi & 0xffffu);
+    if (lo) {
+      if (wi >= cap) goto overflow;
+      words[wi++] = rv_addiw(RV_TMP0, RV_TMP0, lo);
+    }
+    if (wi >= cap) goto overflow;
+    words[wi++] = rv_add(RV_TMP0, RV_SP, RV_TMP0);
+    if (wi + 2 > cap) goto overflow;
+    words[wi++] = rv_sd(RV_S0, RV_TMP0, 0);
+    words[wi++] = rv_sd(RV_RA, RV_TMP0, 8);
+  }
+
+  // 3. Set s0 = sp + fp_pair_off
+  if (fits_i12((i32)fp_pair_off)) {
+    if (wi >= cap) goto overflow;
+    words[wi++] = rv_addi(RV_S0, RV_SP, (i32)fp_pair_off);
+  } else {
+    // Already in t0 from step 2
+    if (wi >= cap) goto overflow;
+    words[wi++] = rv_addi(RV_S0, RV_TMP0, 0);
+  }
+
+  // 4. If sret: spill a0 into hidden slot
+  if (has_sret) {
+    // (Assume sret_ptr_slot.off is known from frame_slot calls)
+    // For now, emit stores via rv_store_int_s0 helper
+    // words[wi++] = rv_sd(RV_A0, RV_S0, -(i32)sret_slot_off);
+  }
+
+  // 5. If variadic: spill unconsumed a-regs into save area at [s0 + 16]
+  if (is_variadic) {
+    u32 first_var = /* computed from fixed param count */;
+    for (u32 i = first_var; i < 8; ++i) {
+      if (wi >= cap) goto overflow;
+      words[wi++] = rv_sd(RV_A0 + i, RV_S0, 16 + (i32)(i * 8));
+    }
+  }
+
+  // 6. Save callee-saved integer registers (s2–s11)
+  for (u32 i = 0; i < n_int_saves; ++i) {
+    u32 r = int_regs[i];
+    i32 off = -(i32)(cum_off + 8u * (i + 1u));  // s0-relative
+    if (fits_i12(off)) {
+      if (wi >= cap) goto overflow;
+      words[wi++] = rv_sd(r, RV_S0, off);
+    } else {
+      // Use t0 for far offset
+      if (wi >= cap) goto overflow;
+      i32 hi = (i32)(((i64)off + 0x800) >> 12);
+      i32 lo = off - (hi << 12);
+      words[wi++] = rv_lui(RV_TMP0, (u32)hi & 0xffffu);
+      if (lo) {
+        if (wi >= cap) goto overflow;
+        words[wi++] = rv_addiw(RV_TMP0, RV_TMP0, lo);
+      }
+      if (wi >= cap) goto overflow;
+      words[wi++] = rv_add(RV_TMP0, RV_S0, RV_TMP0);
+      if (wi >= cap) goto overflow;
+      words[wi++] = rv_sd(r, RV_TMP0, 0);
+    }
+  }
+
+  // 7. Save callee-saved FP registers (fs2–fs11)
+  for (u32 i = 0; i < n_fp_saves; ++i) {
+    u32 r = fp_regs[i];
+    i32 off = -(i32)(cum_off + 8u * (n_int_saves + i + 1u));
+    // Similar store logic as int saves
+    if (wi >= cap) goto overflow;
+    words[wi++] = rv_fsd(r, RV_S0, off);
+  }
+
+  return wi;
+
+overflow:
+  compiler_panic(t->c, rv_of(t)->loc, "rv64: prologue overflow (cap %u)", cap);
+  return 0;
+}
+```
+
+**Helper functions for prologue emission:**
+
+```c
+static inline int fits_i12(i32 imm) {
+  return imm >= -2048 && imm <= 2047;
+}
+
+static void rv64_patch_region(ObjBuilder* obj, u32 sec_id, u32 ofs,
+                              const u32* words, u32 nwords) {
+  for (u32 i = 0; i < nwords; ++i) {
+    u8 b[4];
+    u32 word = words[i];
+    b[0] = (u8)(word & 0xff);
+    b[1] = (u8)((word >> 8) & 0xff);
+    b[2] = (u8)((word >> 16) & 0xff);
+    b[3] = (u8)((word >> 24) & 0xff);
+    obj_patch(obj, sec_id, ofs + i * 4, b, 4);
+  }
+}
+
+static void rv_emit_callee_restores(RvNativeTarget* a,
+                                     const u32* int_regs, u32 n_int_saves,
+                                     const u32* fp_regs, u32 n_fp_saves) {
+  NativeTarget* t = &a->base;
+  MCEmitter* mc = t->mc;
+  // Restore in reverse order of saves
+  for (i32 i = (i32)n_int_saves - 1; i >= 0; --i) {
+    i32 off = -(i32)(a->cum_off + 8u * (i + 1u));
+    rv64_emit32(mc, rv_ld(int_regs[i], RV_S0, off));
+  }
+  for (i32 i = (i32)n_fp_saves - 1; i >= 0; --i) {
+    i32 off = -(i32)(a->cum_off + 8u * (n_int_saves + i + 1u));
+    rv64_emit32(mc, rv_fld(fp_regs[i], RV_S0, off));
+  }
+}
+
+static void rv_emit_restore_frame(RvNativeTarget* a, u32 frame_size, u32 fp_pair_off) {
+  NativeTarget* t = &a->base;
+  MCEmitter* mc = t->mc;
+  // Load s0, ra from [sp + fp_pair_off]
+  if (fits_i12((i32)fp_pair_off)) {
+    rv64_emit32(mc, rv_ld(RV_S0, RV_SP, (i32)fp_pair_off));
+    rv64_emit32(mc, rv_ld(RV_RA, RV_SP, (i32)fp_pair_off + 8));
+  } else {
+    rv64_emit32(mc, rv_lui(RV_TMP0, (u32)(((i64)fp_pair_off + 0x800) >> 12) & 0xffffu));
+    if ((i32)fp_pair_off - ((i32)(((i64)fp_pair_off + 0x800) >> 12) << 12)) {
+      rv64_emit32(mc, rv_addiw(RV_TMP0, RV_TMP0, 
+                               (i32)fp_pair_off - ((i32)(((i64)fp_pair_off + 0x800) >> 12) << 12)));
+    }
+    rv64_emit32(mc, rv_add(RV_TMP0, RV_SP, RV_TMP0));
+    rv64_emit32(mc, rv_ld(RV_S0, RV_TMP0, 0));
+    rv64_emit32(mc, rv_ld(RV_RA, RV_TMP0, 8));
+  }
+  // Adjust sp: sp += frame_size (inverse of prologue)
+  if (fits_i12((i32)frame_size)) {
+    rv64_emit32(mc, rv_addi(RV_SP, RV_SP, (i32)frame_size));
+  } else {
+    rv64_emit32(mc, rv_lui(RV_TMP0, (u32)(((i64)frame_size + 0x800) >> 12) & 0xffffu));
+    if ((i32)frame_size - ((i32)(((i64)frame_size + 0x800) >> 12) << 12)) {
+      rv64_emit32(mc, rv_addiw(RV_TMP0, RV_TMP0, 
+                               (i32)frame_size - ((i32)(((i64)frame_size + 0x800) >> 12) << 12)));
+    }
+    rv64_emit32(mc, rv_add(RV_SP, RV_SP, RV_TMP0));
+  }
+}
+```
+
+---
+
+## (e) frame_slot, reserve_callee_saves, note_frame_state, signature_stack_bytes, call_stack_bytes
+
+### frame_slot (model aa64/native.c lines 1545–1567)
+
+```c
+static NativeFrameSlot rv_frame_slot(NativeTarget* t,
+                                     const NativeFrameSlotDesc* d) {
+  RvNativeTarget* a = rv_of(t);
+  RvNativeSlot* s;
+  u32 size = d->size ? d->size : 8u;
+  u32 align = d->align ? d->align : 1u;
+
+  // Panic on known-frame path if frame is already finalized
+  if (a->frame_final)
+    compiler_panic(a->base.c, a->loc, "rv64: frame slot requested after prologue");
+
+  // Grow slots array if needed
+  if (a->nslots == a->slots_cap) {
+    u32 cap = a->slots_cap ? a->slots_cap * 2u : 16u;
+    RvNativeSlot* nb = arena_zarray(t->c->tu, RvNativeSlot, cap);
+    if (a->slots) memcpy(nb, a->slots, sizeof(*nb) * a->nslots);
+    a->slots = nb;
+    a->slots_cap = cap;
+  }
+
+  // Allocate: align cum_off, then reserve [cum_off, cum_off+size)
+  a->cum_off = align_up_u32(a->cum_off + size, align);
+  s = &a->slots[a->nslots++];
+  s->off = a->cum_off;       // This is the address: s0 - cum_off
+  s->size = size;
+  s->align = align;
+  s->kind = d->kind;
+
+  return (NativeFrameSlot)a->nslots;  // 1-based slot ID
+}
+```
+
+### reserve_callee_saves (Optional, model aa64/native.c lines 1230–1286)
+
+If implemented (not required on single-pass path):
+
+```c
+static void rv_reserve_callee_saves(NativeTarget* t, const u32* used_by_class,
+                                    u32 nclasses) {
+  RvNativeTarget* a = rv_of(t);
+  // For each (class, mask) pair, walk the callee-saved registers (s2–s11 for INT,
+  // fs2–fs11 for FP) and reserve frame slots for those the allocator used.
+  for (u32 cls = 0; cls < nclasses; ++cls) {
+    u32 mask = used_by_class[cls];
+    if (mask == 0) continue;
+    u32 first = (cls == NATIVE_REG_INT) ? RV_S2 : (32 + 18);  // fs2 = DWARF 50
+    u32 last = (cls == NATIVE_REG_INT) ? RV_S11 : (32 + 27);
+    for (u32 reg = first; reg <= last; ++reg) {
+      if (!(mask & (1u << reg))) continue;
+      NativeFrameSlotDesc sd;
+      memset(&sd, 0, sizeof sd);
+      sd.type = (cls == NATIVE_REG_INT) ? builtin_id(CFREE_CG_BUILTIN_I64)
+                                        : builtin_id(CFREE_CG_BUILTIN_F64);
+      sd.size = 8;
+      sd.align = 8;
+      sd.kind = NATIVE_FRAME_SLOT_SAVE;
+      NativeFrameSlot slot = t->frame_slot(t, &sd);
+      a->callee_saves[a->ncallee_saves].slot = slot;
+      a->callee_saves[a->ncallee_saves].reg = reg;
+      a->callee_saves[a->ncallee_saves].cls = cls;
+      a->ncallee_saves++;
+    }
+  }
+}
+```
+
+### note_frame_state (Optional, for deferred patching)
+
+```c
+static void rv_note_frame_state(NativeTarget* t,
+                                 const NativeFramePatchState* state) {
+  RvNativeTarget* a = rv_of(t);
+  if (state->max_outgoing > a->max_outgoing)
+    a->max_outgoing = state->max_outgoing;
+}
+```
+
+### signature_stack_bytes (model aa64/native.c lines 1350–1370)
+
+Query the incoming stack-argument bytes for a function signature. Used for tail-call validation.
+
+```c
+static u32 rv_signature_stack_bytes(NativeTarget* t, CfreeCgTypeId fn_type,
+                                    int* variadic, u32* nparams) {
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, fn_type);
+  if (!abi) {
+    if (variadic) *variadic = 0;
+    if (nparams) *nparams = 0;
+    return 0;
+  }
+  if (variadic) *variadic = abi->variadic;
+  if (nparams) *nparams = abi->nparams;
+
+  u32 stack_bytes = 0;
+  for (u32 i = 0; i < abi->nparams; ++i) {
+    const ABIArgInfo* ai = &abi->params[i];
+    if (ai->kind == ABI_ARG_IGNORE) continue;
+    if (ai->kind == ABI_ARG_INDIRECT) {
+      // Indirect arg: takes one a-register or stack slot (8 bytes)
+      if (i >= 8) stack_bytes += 8;
+    } else {
+      // Direct parts: walk each part, count stack occupants
+      for (u16 j = 0; j < ai->nparts; ++j) {
+        const ABIArgPart* pt = &ai->parts[j];
+        u32 part_reg_idx = (pt->cls == ABI_CLASS_FP) ? RV_NEXT_FP : RV_NEXT_INT;
+        if (RV_NEXT_INT >= 8 || (pt->cls == ABI_CLASS_FP && RV_NEXT_FP >= 8)) {
+          stack_bytes += 8;  // Simplified; real logic increments per-class cursor
+        }
+      }
+    }
+  }
+  return align_up_u32(stack_bytes, 16u);
+}
+```
+
+### call_stack_bytes (model aa64/native.c lines 1371–1400)
+
+Pure query: given a NativeCallDesc (already marshalled with locations), return the outgoing stack-argument bytes.
+
+```c
+static u32 rv_call_stack_bytes(NativeTarget* t, const NativeCallDesc* desc) {
+  if (!desc || desc->nargs == 0) return 0;
+  // Walk desc->args, which are already assigned to physical locations.
+  // Count those on the stack (NATIVE_LOC_STACK).
+  u32 max_off = 0;
+  for (u32 i = 0; i < desc->nargs; ++i) {
+    if (desc->args[i].kind == NATIVE_LOC_STACK) {
+      u32 end = desc->args[i].v.stack.offset + cg_type_size(t->c, desc->args[i].type);
+      if (end > max_off) max_off = end;
+    }
+  }
+  return align_up_u32(max_off, 16u);
+}
+```
+
+---
+
+## (f) bind_param (NativeTarget Hook) — Parameter Binding
+
+### High-Level Contract
+
+`bind_param` is called once per parameter after register allocation and frame slots are final. It reads the parameter from its ABI-mandated incoming location (a-register or stack) and places it in the allocator-chosen destination:
+- **NATIVE_LOC_REG:** The allocator assigned the param to a hard register.
+- **NATIVE_LOC_FRAME:** The allocator assigned the param to a frame slot (address-taken, large aggregate, or spilled).
+- **NATIVE_LOC_NONE:** The param is unused; only the ABI cursor advances.
+
+### bind_param Pseudo-C (model aa64/native.c lines 3616–3696)
+
+```c
+static void rv_bind_native_param(NativeTarget* t, const CGParamDesc* p,
+                                 NativeLoc dst) {
+  RvNativeTarget* a = rv_of(t);
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, a->func->fn_type);
+  const ABIArgInfo* ai = (p->index < abi->nparams) ? &abi->params[p->index] : NULL;
+  int to_reg = (dst.kind == NATIVE_LOC_REG);
+
+  if (!ai || ai->kind == ABI_ARG_IGNORE) return;
+
+  // INDIRECT argument: sret or byval aggregate.
+  // The caller passes a pointer in an a-register (or stack if beyond a7).
+  if (ai->kind == ABI_ARG_INDIRECT) {
+    NativeAddr d_addr, from;
+    AggregateAccess access;
+    NativeLoc src;
+
+    // Fetch the pointer from the next available a-register or stack
+    if (a->next_param_int < 8u) {
+      src = rv_reg_loc(p->type, NATIVE_REG_INT, a->next_param_int++);
+    } else {
+      // Stack-passed pointer: load into t0
+      src = rv_reg_loc(p->type, NATIVE_REG_INT, RV_TMP0);
+      NativeAddr saddr;
+      memset(&saddr, 0, sizeof saddr);
+      saddr.base_kind = NATIVE_ADDR_BASE_REG;
+      saddr.base.reg = RV_S0;  // Frame pointer (could also use sp with offset calc)
+      saddr.offset = rv_s0_off_in_arg(a, a->next_param_stack);
+      a->next_param_stack += 8u;
+      rv_emit_mem(a, 1, src, saddr, rv_mem_for_type(t, p->type, 8));  // Load
+    }
+
+    // Destination must be a frame slot (indirect params can't go to registers)
+    if (dst.kind != NATIVE_LOC_FRAME)
+      compiler_panic(t->c, a->loc, "rv64: indirect param requires frame dest");
+
+    // Copy aggregate from [pointer] to [dst slot]
+    memset(&d_addr, 0, sizeof d_addr);
+    d_addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+    d_addr.base.frame = dst.v.frame;
+    d_addr.base_type = p->type;
+
+    memset(&from, 0, sizeof from);
+    from.base_kind = NATIVE_ADDR_BASE_REG;
+    from.base.reg = src.v.reg;
+    from.base_type = p->type;
+
+    memset(&access, 0, sizeof access);
+    access.type = p->type;
+    access.size = p->size ? p->size : (u32)cg_type_size(t->c, p->type);
+    access.align = p->align ? p->align : type_align32(t, p->type);
+
+    rv_copy_bytes(t, d_addr, from, access);
+    return;
+  }
+
+  // DIRECT argument: one or more parts (INT / FP scalars or small aggregates).
+  for (u32 i = 0; i < ai->nparts; ++i) {
+    const ABIArgPart* part = &ai->parts[i];
+    NativeAllocClass cls = (part->cls == ABI_CLASS_FP) ? NATIVE_REG_FP : NATIVE_REG_INT;
+    int reg_dst = to_reg && (NativeAllocClass)dst.cls == cls;
+    NativeLoc src;
+
+    // Fetch the part from the next available a-reg/fa-reg or stack
+    if (cls == NATIVE_REG_FP && a->next_param_fp < 8u) {
+      src = rv_reg_loc(p->type, cls, a->next_param_fp++);
+    } else if (cls == NATIVE_REG_INT && a->next_param_int < 8u) {
+      src = rv_reg_loc(p->type, cls, a->next_param_int++);
+    } else {
+      // Stack-passed part: load into a scratch (t0 for int, ft0 for fp) or directly into dst reg
+      Reg tmp = reg_dst ? (Reg)dst.v.reg : (cls == NATIVE_REG_FP ? 8u : RV_TMP0);  // ft0=DWARF 32
+      src = rv_reg_loc(p->type, cls, tmp);
+
+      // Align and load from stack
+      a->next_param_stack = align_up_u32(a->next_param_stack, rv_part_stack_align(part));
+      NativeAddr saddr;
+      memset(&saddr, 0, sizeof saddr);
+      saddr.base_kind = NATIVE_ADDR_BASE_REG;
+      saddr.base.reg = RV_S0;
+      saddr.base_type = p->type;
+      saddr.offset = rv_s0_off_in_arg(a, a->next_param_stack);
+      rv_emit_mem(a, 1, src, saddr, rv_mem_for_type(t, p->type, part->size));
+      a->next_param_stack += 8u;
+    }
+
+    // Place src into dst
+    if (dst.kind == NATIVE_LOC_NONE) {
+      // Unused parameter: only the ABI cursor advances.
+    } else if (to_reg) {
+      NativeLoc d = rv_reg_loc(dst.type ? dst.type : p->type,
+                               (NativeAllocClass)dst.cls, (Reg)dst.v.reg);
+      if (!(src.kind == NATIVE_LOC_REG && src.v.reg == d.v.reg &&
+            (NativeAllocClass)src.cls == (NativeAllocClass)d.cls)) {
+        rv_move(t, d, src);
+      }
+    } else {
+      // Store part into frame slot at offset part->src_offset
+      rv_store_part(t, rv_stack_loc(p->type, dst.v.frame, (i32)part->src_offset),
+                    src, 0, part->size);
+    }
+  }
+
+  a->incoming_stack_size = align_up_u32(a->next_param_stack, 16u);
+}
+```
+
+### Helpers for bind_param
+
+```c
+static inline NativeLoc rv_reg_loc(CfreeCgTypeId type, NativeAllocClass cls, Reg reg) {
+  NativeLoc loc;
+  memset(&loc, 0, sizeof loc);
+  loc.kind = NATIVE_LOC_REG;
+  loc.cls = cls;
+  loc.type = type;
+  loc.v.reg = reg;
+  return loc;
+}
+
+static inline NativeLoc rv_stack_loc(CfreeCgTypeId type, NativeFrameSlot slot, i32 offset) {
+  NativeLoc loc;
+  memset(&loc, 0, sizeof loc);
+  loc.kind = NATIVE_LOC_FRAME;
+  loc.type = type;
+  loc.v.frame = slot;
+  return loc;
+}
+
+static inline u32 rv_part_stack_align(const ABIArgPart* part) {
+  return part->align ? part->align : 8u;
+}
+
+static inline MemAccess rv_mem_for_type(NativeTarget* t, CfreeCgTypeId type, u32 size) {
+  MemAccess mem;
+  memset(&mem, 0, sizeof mem);
+  mem.type = type;
+  mem.size = size;
+  mem.align = size;  // Simplified; real code queries type alignment
+  return mem;
+}
+
+static void rv_store_part(NativeTarget* t, NativeLoc dst, NativeLoc src,
+                          u32 dst_offset, u32 size) {
+  // Generalized store of a part into a frame location
+  // (Pseudo-implementation; real code uses rv_emit_mem with computed addresses)
+  rv_move(t, dst, src);
+}
+
+static void rv_emit_mem(RvNativeTarget* a, int is_load, NativeLoc dst,
+                        NativeAddr addr, MemAccess mem) {
+  // Emit a memory load or store instruction with the given operands.
+  // dst is the register location; addr is the address (base_kind + base + offset).
+  // This is a facade that dispatches to rv_load / rv_store.
+}
+```
+
+---
+
+## (g) Summary: Single-Pass vs Known-Frame Flow
+
+### Single-Pass (-O0, NativeDirectTarget)
+
+1. **func_begin:** Reserve prologue region (RV_PROLOGUE_WORDS NOPs), reserve entry-save slots.
+2. **Body:** Frame slots grow (via frame_slot) as needed; max_outgoing grows (via note_frame_state) as calls are encountered.
+3. **func_end:** 
+   - Compute final frame size = align16(16 + cum_off + max_outgoing + va_sz).
+   - Patch prologue region with rv_build_prologue (exact instructions).
+   - Emit epilogue (restore + ret).
+   - Post CFI metadata.
+
+### Known-Frame (-O1, Optimizer Emit Path)
+
+1. **func_begin_known_frame:** Receives NativeKnownFrameDesc (slots, max_outgoing, callee_saved_used pre-computed).
+2. **Frame is final immediately:** Call aa_reserve_callee_saves, aa_frame_slot for all planned slots, set frame_final.
+3. **Emit prologue inline:** Call aa_build_prologue_words once, emit the exact word count (no patching).
+4. **Body:** No frame growth; allocas / tail-epilogues can be emitted final with no back-patching.
+5. **func_end:** Just place epilogue label and emit epilogue (no patching).
+
+For this porting guide (GROUP 1), the focus is **single-pass only**. Known-frame is a future -O1 optimization; the entry point is rv_func_begin_known_frame (stub or minimal), and the frame_final flag prevents post-prologue frame changes.
+
+---
+
+## Key Takeaways for the RV64 Implementation
+
+1. **Register model:** s0 (x8) is the frame pointer. Offsets are **bytes below s0** (positive = downward). Stack grows downward (sp decreases on entry).
+2. **Frame layout:** Single top-record: saved pair at [sp + fp_pair_off], then locals/slots below.
+3. **ABI alignment:** 16-byte stack alignment; variadic save area (64B) sits at [s0 + 16].
+4. **Prologue:** Multi-phase: sp adjust → save pair → set s0 → save callee-saves → spill sret/variadic.
+5. **Epilogue:** Reverse: restore callee-saves → restore pair → sp adjust → ret.
+6. **Immediate materialization:** Use LUI+ADDIW for large immediates (> 12 bits); store absolute offsets in t0/t1.
+7. **Parameter binding:** Walk ABI parts, read from a0–a7 (or fa0–fa7), move to destination (register or frame).
+8. **Tail calls:** Compare outgoing stack bytes vs incoming (via signature_stack_bytes); if smaller or equal, sibling call is feasible.
+
+
+
+---
+
+# RV64 NativeTarget: Register Tables and Operand Legality (GROUP 2)
+
+## Overview
+
+This group implements the register and operand legality infrastructure for the rv64 NativeTarget. The reference is `/Users/ryan/code/cfree/src/arch/aa64/native.c` (lines ~3370-3526 for register tables), modeled exactly on the aa64 structure but parameterized by RISC-V ISA and LP64D ABI specifics.
+
+---
+
+## 1. Register Constants and Reserved Registers
+
+Define at the top of `src/arch/rv64/native.c` (before the allocable/scratch/phys tables):
+
+```c
+enum {
+  RV_PROLOGUE_WORDS = 128u,  /* reserved NOP region for -O0 prologue */
+  RV_TMP0 = 29,              /* t4: backend scratch (caller-saved, temp) */
+  RV_TMP1 = 30,              /* t5: backend scratch (caller-saved, temp) */
+  RV_FP = 8,                 /* s0: frame pointer (callee-saved) */
+  RV_RA = 1,                 /* return address (callee-saved) */
+  RV_SP = 2,                 /* stack pointer (reserved) */
+  RV_GP = 3,                 /* global pointer (reserved) */
+  RV_TP = 4,                 /* thread pointer (reserved) */
+  RV_ZERO = 0,               /* x0: hardware zero register (reserved) */
+  /* Argument registers (a0-a7 / fa0-fa7, caller-saved) */
+  RV_A0 = 10, RV_A1 = 11, RV_A2 = 12, RV_A3 = 13,
+  RV_A4 = 14, RV_A5 = 15, RV_A6 = 16, RV_A7 = 17,
+  /* Callee-saved integer registers (s2-s11) */
+  RV_S2 = 18, RV_S3 = 19, RV_S4 = 20, RV_S5 = 21,
+  RV_S6 = 22, RV_S7 = 23, RV_S8 = 24, RV_S9 = 25,
+  RV_S10 = 26, RV_S11 = 27,
+  /* Temporary registers (t0-t2 before register region; t3-t6 after allocable) */
+  RV_T0 = 5, RV_T1 = 6, RV_T2 = 7, RV_T3 = 28, RV_T4 = 29, RV_T5 = 30, RV_T6 = 31,
+  /* FP temporaries */
+  RV_FT0 = 0, RV_FT1 = 1, RV_FT2 = 2, RV_FT3 = 3, RV_FT4 = 4,
+  RV_FT5 = 5, RV_FT6 = 6, RV_FT7 = 7,
+  RV_FS0 = 8, RV_FS1 = 9,  /* callee-saved */
+  RV_FA0 = 10, RV_FA1 = 11, RV_FA2 = 12, RV_FA3 = 13,  /* argument regs */
+  RV_FA4 = 14, RV_FA5 = 15, RV_FA6 = 16, RV_FA7 = 17,
+  RV_FS2 = 18, RV_FS3 = 19, RV_FS4 = 20, RV_FS5 = 21,  /* callee-saved */
+  RV_FS6 = 22, RV_FS7 = 23, RV_FS8 = 24, RV_FS9 = 25,
+  RV_FS10 = 26, RV_FS11 = 27,
+  RV_FT8 = 28, RV_FT9 = 29, RV_FT10 = 30, RV_FT11 = 31,
+};
+```
+
+**Source of truth for RISC-V register names:** `/Users/ryan/code/cfree/src/arch/rv64/isa.h` lines 17-67 (RV_X0 through RV_T6 enum definitions).
+
+---
+
+## 2. Integer Register Allocable and Scratch Tables
+
+### rv_int_allocable[]
+
+Allocable integer registers come from the RISC-V psABI callee-saved pool (s2-s11 / x18-x27). The allocator prefers caller-saved temporaries when not under pressure, so they're listed separately and pulled via distinct scratch array.
+
+**Source:** `/Users/ryan/code/cfree/src/arch/rv64/opt_coord.c` lines 8, 11 (legacy tables; note these enumerate both the allocable and reserved scratch regs).
+
+```c
+/* Allocable integers: s2-s11 (callee-saved, x18-x27).
+ * These are the only registers available for general allocation after
+ * reserving tmp0/tmp1 (t4/t5, x29/x30), ra/sp/gp/tp/zero, and the
+ * FP (s0). */
+static const Reg rv_int_allocable[] = {
+  20, 21, 22, 23, 24, 25, 26, 27,  /* s4-s11 */
+  18, 19,                            /* s2-s3: allocated only under register pressure */
+};
+```
+
+**Rationale:** s2 and s3 are marked as "reserved by opt_emit" in the legacy code (line 11), indicating the original backend used them for special purposes. However, under the NativeTarget contract they can be allocable since emit hooks (not alloc) control their usage.
+
+### rv_int_scratch[]
+
+Scratch registers available for temporary materialization without forcing the allocator into the callee-saved pool. Drawn from caller-saved temporaries (t0-t2, t4-t6).
+
+```c
+/* Scratch integers available to emit without spilling.
+ * t4/t5 (x29/x30) are reserved for backend internal use (e.g., atomic RMW
+ * helper, address computation). t0-t2/t3/t6 are available to the emitter.
+ * For simplicity, we expose t4/t5 here; emit hooks may use them if free. */
+static const Reg rv_int_scratch[] = {29, 30};  /* t4, t5 */
+```
+
+---
+
+## 3. FP Register Allocable and Scratch Tables
+
+### rv_fp_allocable[]
+
+Allocable FP registers from the RISC-V psABI callee-saved pool (fs2-fs11 / f18-f27) plus fs0-fs1 (f8-f9). Note: fa0-fa7 are argument registers (not allocable; reserved). ft0-ft7 and ft8-ft11 are caller-saved temporaries.
+
+```c
+/* Allocable FP: fs0-fs1 (callee-saved, f8-f9), fs2-fs11 (f18-f27).
+ * fa0-fa7 are argument registers (reserved for ABI). */
+static const Reg rv_fp_allocable[] = {
+  8, 9,                             /* fs0-fs1 (callee-saved, prefer first) */
+  18, 19, 20, 21, 22, 23, 24, 25, 26, 27,  /* fs2-fs11 */
+};
+```
+
+### rv_fp_scratch[]
+
+FP scratch registers for temporary use without spilling.
+
+```c
+/* Scratch FP: ft8-ft11 (x28-x31 of f-register numbering, caller-saved). */
+static const Reg rv_fp_scratch[] = {28, 29, 30, 31};  /* ft8-ft11 */
+```
+
+---
+
+## 4. NativePhysRegInfo Arrays (phys[])
+
+Each register in the physical register file gets a descriptor. Order the integer array as: argument/return registers first (a0-a7), then allocables (s4-s11, s2-s3), then reserved (sp, gp, tp, zero, ra).
+
+### Integer Physical Registers
+
+**Source:** `/Users/ryan/code/cfree/src/arch/aa64/native.c` lines 3424-3441 (aa_int_phys[] macro pattern).
+
+```c
+#define RV_PHYS_INT_ARG(r, idx) \
+  {.reg = (r), \
+   .cls = NATIVE_REG_INT, \
+   .abi_index = (idx), \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG | \
+            ((idx) < 2u ? NATIVE_REG_RET : 0), \
+   .spill_cost = 1u, \
+   .copy_cost = 1u}
+
+#define RV_PHYS_INT_ALLOC(r) \
+  {.reg = (r), \
+   .cls = NATIVE_REG_INT, \
+   .abi_index = 0xffu, \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLEE_SAVED, \
+   .spill_cost = 4u, \
+   .copy_cost = 1u}
+
+#define RV_PHYS_INT_RESERVED(r) \
+  {.reg = (r), \
+   .cls = NATIVE_REG_INT, \
+   .abi_index = 0xffu, \
+   .flags = NATIVE_REG_RESERVED, \
+   .spill_cost = 0u, \
+   .copy_cost = 0u}
+
+static const NativePhysRegInfo rv_int_phys[] = {
+  /* Argument/return registers (a0-a7, x10-x17) */
+  RV_PHYS_INT_ARG(10, 0),  /* a0 / x10 — return arg 0 + arg 0 */
+  RV_PHYS_INT_ARG(11, 1),  /* a1 / x11 — return arg 1 + arg 1 */
+  RV_PHYS_INT_ARG(12, 2),  /* a2 / x12 — arg 2 only */
+  RV_PHYS_INT_ARG(13, 3),  /* a3 / x13 — arg 3 only */
+  RV_PHYS_INT_ARG(14, 4),  /* a4 / x14 — arg 4 only */
+  RV_PHYS_INT_ARG(15, 5),  /* a5 / x15 — arg 5 only */
+  RV_PHYS_INT_ARG(16, 6),  /* a6 / x16 — arg 6 only */
+  RV_PHYS_INT_ARG(17, 7),  /* a7 / x17 — arg 7 only */
+  /* Allocable callee-saved (s2-s11, x18-x27) */
+  RV_PHYS_INT_ALLOC(18),   /* s2 / x18 */
+  RV_PHYS_INT_ALLOC(19),   /* s3 / x19 */
+  RV_PHYS_INT_ALLOC(20),   /* s4 / x20 */
+  RV_PHYS_INT_ALLOC(21),   /* s5 / x21 */
+  RV_PHYS_INT_ALLOC(22),   /* s6 / x22 */
+  RV_PHYS_INT_ALLOC(23),   /* s7 / x23 */
+  RV_PHYS_INT_ALLOC(24),   /* s8 / x24 */
+  RV_PHYS_INT_ALLOC(25),   /* s9 / x25 */
+  RV_PHYS_INT_ALLOC(26),   /* s10 / x26 */
+  RV_PHYS_INT_ALLOC(27),   /* s11 / x27 */
+  /* Reserved: temporaries, frame pointer, return address, zero, etc. */
+  RV_PHYS_INT_RESERVED(0),   /* zero / x0 */
+  RV_PHYS_INT_RESERVED(1),   /* ra / x1 */
+  RV_PHYS_INT_RESERVED(2),   /* sp / x2 */
+  RV_PHYS_INT_RESERVED(3),   /* gp / x3 */
+  RV_PHYS_INT_RESERVED(4),   /* tp / x4 */
+  RV_PHYS_INT_RESERVED(5),   /* t0 / x5 */
+  RV_PHYS_INT_RESERVED(6),   /* t1 / x6 */
+  RV_PHYS_INT_RESERVED(7),   /* t2 / x7 */
+  RV_PHYS_INT_RESERVED(8),   /* s0 / x8 (frame pointer) */
+  RV_PHYS_INT_RESERVED(9),   /* s1 / x9 */
+  RV_PHYS_INT_RESERVED(28),  /* t3 / x28 */
+  RV_PHYS_INT_RESERVED(29),  /* t4 / x29 (backend tmp0) */
+  RV_PHYS_INT_RESERVED(30),  /* t5 / x30 (backend tmp1) */
+  RV_PHYS_INT_RESERVED(31),  /* t6 / x31 */
+};
+```
+
+**Key points:**
+- `abi_index`: Position in ABI register order (a0-a7 map to indices 0-7; non-arg registers get 0xff).
+- `flags`: Argument registers have NATIVE_REG_ARG; return arg registers (a0-a1) additionally have NATIVE_REG_RET.
+- `spill_cost`: Callee-saved set to 4u (higher cost discourages allocation under light pressure); caller-saved and arg regs 1u.
+- **Return register mask:** a0 and a1 both participate in 64-bit integer returns.
+
+### FP Physical Registers
+
+```c
+#define RV_PHYS_FP_ARG(r, idx) \
+  {.reg = (r), \
+   .cls = NATIVE_REG_FP, \
+   .abi_index = (idx), \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG | \
+            ((idx) < 2u ? NATIVE_REG_RET : 0), \
+   .spill_cost = 1u, \
+   .copy_cost = 1u}
+
+#define RV_PHYS_FP_ALLOC(r) \
+  {.reg = (r), \
+   .cls = NATIVE_REG_FP, \
+   .abi_index = 0xffu, \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLEE_SAVED, \
+   .spill_cost = 4u, \
+   .copy_cost = 1u}
+
+#define RV_PHYS_FP_RESERVED(r) \
+  {.reg = (r), \
+   .cls = NATIVE_REG_FP, \
+   .abi_index = 0xffu, \
+   .flags = NATIVE_REG_RESERVED, \
+   .spill_cost = 0u, \
+   .copy_cost = 0u}
+
+static const NativePhysRegInfo rv_fp_phys[] = {
+  /* Argument/return registers (fa0-fa7, f10-f17) */
+  RV_PHYS_FP_ARG(10, 0),   /* fa0 / f10 */
+  RV_PHYS_FP_ARG(11, 1),   /* fa1 / f11 */
+  RV_PHYS_FP_ARG(12, 2),   /* fa2 / f12 */
+  RV_PHYS_FP_ARG(13, 3),   /* fa3 / f13 */
+  RV_PHYS_FP_ARG(14, 4),   /* fa4 / f14 */
+  RV_PHYS_FP_ARG(15, 5),   /* fa5 / f15 */
+  RV_PHYS_FP_ARG(16, 6),   /* fa6 / f16 */
+  RV_PHYS_FP_ARG(17, 7),   /* fa7 / f17 */
+  /* Allocable callee-saved (fs0-fs1, fs2-fs11) */
+  RV_PHYS_FP_ALLOC(8),     /* fs0 / f8 */
+  RV_PHYS_FP_ALLOC(9),     /* fs1 / f9 */
+  RV_PHYS_FP_ALLOC(18),    /* fs2 / f18 */
+  RV_PHYS_FP_ALLOC(19),    /* fs3 / f19 */
+  RV_PHYS_FP_ALLOC(20),    /* fs4 / f20 */
+  RV_PHYS_FP_ALLOC(21),    /* fs5 / f21 */
+  RV_PHYS_FP_ALLOC(22),    /* fs6 / f22 */
+  RV_PHYS_FP_ALLOC(23),    /* fs7 / f23 */
+  RV_PHYS_FP_ALLOC(24),    /* fs8 / f24 */
+  RV_PHYS_FP_ALLOC(25),    /* fs9 / f25 */
+  RV_PHYS_FP_ALLOC(26),    /* fs10 / f26 */
+  RV_PHYS_FP_ALLOC(27),    /* fs11 / f27 */
+  /* Reserved: caller-saved temps, thread-local fp. */
+  RV_PHYS_FP_RESERVED(0),  /* ft0 / f0 */
+  RV_PHYS_FP_RESERVED(1),  /* ft1 / f1 */
+  RV_PHYS_FP_RESERVED(2),  /* ft2 / f2 */
+  RV_PHYS_FP_RESERVED(3),  /* ft3 / f3 */
+  RV_PHYS_FP_RESERVED(4),  /* ft4 / f4 */
+  RV_PHYS_FP_RESERVED(5),  /* ft5 / f5 */
+  RV_PHYS_FP_RESERVED(6),  /* ft6 / f6 */
+  RV_PHYS_FP_RESERVED(7),  /* ft7 / f7 */
+  RV_PHYS_FP_RESERVED(28), /* ft8 / f28 */
+  RV_PHYS_FP_RESERVED(29), /* ft9 / f29 */
+  RV_PHYS_FP_RESERVED(30), /* ft10 / f30 */
+  RV_PHYS_FP_RESERVED(31), /* ft11 / f31 */
+};
+```
+
+---
+
+## 5. NativeAllocClassInfo Arrays
+
+Define two class infos: one for NATIVE_REG_INT, one for NATIVE_REG_FP. Include the four register-state masks.
+
+```c
+static const NativeAllocClassInfo rv_classes[] = {
+  /* INTEGER CLASS */
+  {.cls = NATIVE_REG_INT,
+   .allocable = rv_int_allocable,
+   .nallocable = sizeof rv_int_allocable / sizeof rv_int_allocable[0],
+   .scratch = rv_int_scratch,
+   .nscratch = sizeof rv_int_scratch / sizeof rv_int_scratch[0],
+   .phys = rv_int_phys,
+   .nphys = sizeof rv_int_phys / sizeof rv_int_phys[0],
+   /* Caller-saved mask: a0-a7 (x10-x17) + t0-t2, t3-t6 (x5-x7, x28-x31).
+    * RISC-V psABI: x5-x7, x10-x17, x28-x31 are caller-saved. */
+   .caller_saved_mask = 
+     ((1u << 5) | (1u << 6) | (1u << 7) |  /* t0-t2 */
+      (1u << 10) | (1u << 11) | (1u << 12) | (1u << 13) |  /* a0-a3 */
+      (1u << 14) | (1u << 15) | (1u << 16) | (1u << 17) |  /* a4-a7 */
+      (1u << 28) | (1u << 29) | (1u << 30) | (1u << 31)),  /* t3-t6 */
+   /* Callee-saved mask: s0-s11 (x8-x9, x18-x27). */
+   .callee_saved_mask =
+     ((1u << 8) | (1u << 9) |  /* s0-s1 */
+      (1u << 18) | (1u << 19) | (1u << 20) | (1u << 21) |  /* s2-s5 */
+      (1u << 22) | (1u << 23) | (1u << 24) | (1u << 25) |  /* s6-s9 */
+      (1u << 26) | (1u << 27)),  /* s10-s11 */
+   /* Argument mask: a0-a7 (x10-x17). */
+   .arg_mask =
+     ((1u << 10) | (1u << 11) | (1u << 12) | (1u << 13) |
+      (1u << 14) | (1u << 15) | (1u << 16) | (1u << 17)),
+   /* Return mask: a0-a1 (x10-x11) for 64-bit integers. */
+   .ret_mask = ((1u << 10) | (1u << 11)),
+   /* Reserved: zero, ra, sp, gp, tp, s0 (fp), s1, t4/t5 (tmp0/tmp1). */
+   .reserved_mask =
+     ((1u << 0) |   /* zero */
+      (1u << 1) |   /* ra */
+      (1u << 2) |   /* sp */
+      (1u << 3) |   /* gp */
+      (1u << 4) |   /* tp */
+      (1u << 8) |   /* s0 / fp */
+      (1u << 9) |   /* s1 */
+      (1u << 29) | (1u << 30))},  /* t4/t5 tmp0/tmp1 */
+
+  /* FLOATING-POINT CLASS */
+  {.cls = NATIVE_REG_FP,
+   .allocable = rv_fp_allocable,
+   .nallocable = sizeof rv_fp_allocable / sizeof rv_fp_allocable[0],
+   .scratch = rv_fp_scratch,
+   .nscratch = sizeof rv_fp_scratch / sizeof rv_fp_scratch[0],
+   .phys = rv_fp_phys,
+   .nphys = sizeof rv_fp_phys / sizeof rv_fp_phys[0],
+   /* Caller-saved FP: ft0-ft7 (f0-f7) + fa0-fa7 (f10-f17) + ft8-ft11 (f28-f31). */
+   .caller_saved_mask =
+     ((1u << 0) | (1u << 1) | (1u << 2) | (1u << 3) |  /* ft0-ft3 */
+      (1u << 4) | (1u << 5) | (1u << 6) | (1u << 7) |  /* ft4-ft7 */
+      (1u << 10) | (1u << 11) | (1u << 12) | (1u << 13) |  /* fa0-fa3 */
+      (1u << 14) | (1u << 15) | (1u << 16) | (1u << 17) |  /* fa4-fa7 */
+      (1u << 28) | (1u << 29) | (1u << 30) | (1u << 31)),  /* ft8-ft11 */
+   /* Callee-saved FP: fs0-fs11 (f8-f9, f18-f27). */
+   .callee_saved_mask =
+     ((1u << 8) | (1u << 9) |  /* fs0-fs1 */
+      (1u << 18) | (1u << 19) | (1u << 20) | (1u << 21) |  /* fs2-fs5 */
+      (1u << 22) | (1u << 23) | (1u << 24) | (1u << 25) |  /* fs6-fs9 */
+      (1u << 26) | (1u << 27)),  /* fs10-fs11 */
+   /* Argument mask: fa0-fa7 (f10-f17). */
+   .arg_mask =
+     ((1u << 10) | (1u << 11) | (1u << 12) | (1u << 13) |
+      (1u << 14) | (1u << 15) | (1u << 16) | (1u << 17)),
+   /* Return mask: fa0-fa1 (f10-f11) for 64-bit floats. */
+   .ret_mask = ((1u << 10) | (1u << 11)),
+   /* Reserved: all temp registers. */
+   .reserved_mask =
+     ((1u << 0) | (1u << 1) | (1u << 2) | (1u << 3) |  /* ft0-ft3 */
+      (1u << 4) | (1u << 5) | (1u << 6) | (1u << 7) |  /* ft4-ft7 */
+      (1u << 28) | (1u << 29) | (1u << 30) | (1u << 31))},  /* ft8-ft11 */
+};
+```
+
+---
+
+## 6. NativeRegInfo Global
+
+```c
+static const NativeRegInfo rv_reg_info = {
+  .classes = rv_classes,
+  .nclasses = sizeof rv_classes / sizeof rv_classes[0],
+  /* Function pointers are set to NULL here; no resolve_name / debug_name / dwarf_reg
+   * implementations are exposed via NativeTarget (they are used internally via
+   * rv64_register_index / rv64_register_name from src/arch/rv64/regs.c if needed). */
+};
+```
+
+---
+
+## 7. Operand Legality: class_for_type
+
+Query which register class a type occupies. RISC-V uses GPR (INT) for integers/pointers and FPR (FP) for floats/doubles. Inline or placed before imm_legal.
+
+```c
+static NativeAllocClass rv_class_for_type(NativeTarget* t, CfreeCgTypeId type) {
+  /* FP types use the FP register class. All others (including aggregates,
+   * which get passed by reference) use INT. */
+  if (type && cg_type_is_float(t->c, type) && cg_type_size(t->c, type) <= 8u)
+    return NATIVE_REG_FP;
+  return NATIVE_REG_INT;
+}
+```
+
+---
+
+## 8. Operand Legality: addr_legal
+
+Check if a memory address mode is legal. RISC-V supports base+imm12 only (no indexed addressing). Zba extension allows folding index into base via sh{1,2,3}add, but these are codegen decisions, not legality checks here.
+
+```c
+/* RISC-V memory addressing: base + imm12 (signed 12-bit) only.
+ * No indexed addressing without Zba transforms (handled by emit). */
+static int rv_addr_legal(NativeTarget* t, const NativeAddr* addr,
+                         MemAccess mem) {
+  (void)t;
+  (void)mem;
+  if (!addr) return 0;
+  /* Index must be absent. */
+  if (addr->index_kind != NATIVE_ADDR_INDEX_NONE) return 0;
+  /* Base must be present (NATIVE_ADDR_BASE_REG or NATIVE_ADDR_BASE_FRAME). */
+  return addr->base_kind == NATIVE_ADDR_BASE_REG ||
+         addr->base_kind == NATIVE_ADDR_BASE_FRAME;
+}
+```
+
+---
+
+## 9. Operand Legality: imm_legal
+
+Check if an immediate can be folded directly into an instruction without materialization.
+
+```c
+/* RISC-V immediate legality.
+ * - ALU/load immediates: 12-bit signed [-2048, 2047] via I-type.
+ * - Shifts: 6-bit (shamt) in [0, 63] for 64-bit, [0, 31] for 32-bit.
+ * - Moves: any value can be materialized via LUI+ADDI or LI pseudo.
+ * - Comparisons: 12-bit signed for CMP (substracting immediate).
+ *
+ * This is a simplified query used by the optimizer to avoid materializing
+ * large constants. The emitter has full responsibility for folding or
+ * rejecting each case. */
+static int rv_imm_legal(NativeTarget* t, NativeImmUse use, u32 op,
+                        CfreeCgTypeId type, i64 imm) {
+  (void)t;
+  (void)type;
+  
+  switch (use) {
+    case NATIVE_IMM_MOVE:
+      /* Any constant can be materialized. */
+      return 1;
+    
+    case NATIVE_IMM_BINOP:
+      /* For ALU binops (IADD, ISUB, etc.), check if imm fits I-type (12-bit). */
+      if ((BinOp)op == BO_IADD || (BinOp)op == BO_ISUB) {
+        return imm >= -2048 && imm <= 2047;
+      }
+      /* Shifts: 6-bit shamt for RV64, 5-bit for RV32. */
+      if ((BinOp)op == BO_SHL || (BinOp)op == BO_LSHR || (BinOp)op == BO_ASHR) {
+        return imm >= 0 && imm <= 63;
+      }
+      /* AND, OR, XOR: 12-bit immediate. */
+      if ((BinOp)op == BO_AND || (BinOp)op == BO_OR || (BinOp)op == BO_XOR) {
+        return imm >= -2048 && imm <= 2047;
+      }
+      return 0;
+    
+    case NATIVE_IMM_CMP:
+      /* CMP uses subtraction, so 12-bit signed immediate. */
+      return imm >= -2048 && imm <= 2047;
+    
+    case NATIVE_IMM_ADDR_OFFSET:
+      /* Address computations and load/store offsets: 12-bit signed. */
+      return imm >= -2048 && imm <= 2047;
+  }
+  return 0;
+}
+```
+
+---
+
+## 10. NativeTarget Initialization
+
+In `rv64_native_target_new()` (or wherever the NativeTarget is created), set:
+
+```c
+  t->regs = &rv_reg_info;
+  t->class_for_type = rv_class_for_type;
+  t->imm_legal = rv_imm_legal;
+  t->addr_legal = rv_addr_legal;
+  t->has_store_zero_reg = 1;      /* x0 is hardware zero */
+  t->store_zero_reg = RV_ZERO;    /* x0 */
+```
+
+---
+
+## 11. Summary: Mask Computation
+
+For reference, the four masks per class are computed as follows:
+
+**Integer:**
+- `caller_saved_mask`: All registers in (a0-a7, t0-t2, t3-t6) = bits [5-7] | [10-17] | [28-31]
+- `callee_saved_mask`: All registers in (s0-s11) = bits [8-9] | [18-27]
+- `arg_mask`: All registers in (a0-a7) = bits [10-17]
+- `ret_mask`: Registers that receive return values (a0-a1) = bits [10-11]
+- `reserved_mask`: zero, ra, sp, gp, tp, s0 (fp), s1, tmp0 (t4), tmp1 (t5) = bits [0-4] | [8-9] | [29-30]
+
+**FP:**
+- `caller_saved_mask`: All registers in (fa0-fa7, ft0-ft7, ft8-ft11) = bits [0-7] | [10-17] | [28-31]
+- `callee_saved_mask`: All registers in (fs0-fs11) = bits [8-9] | [18-27]
+- `arg_mask`: All registers in (fa0-fa7) = bits [10-17]
+- `ret_mask`: Registers that receive FP returns (fa0-fa1) = bits [10-11]
+- `reserved_mask`: All temporaries = bits [0-7] | [28-31]
+
+---
+
+## 12. Key Differences from AA64
+
+1. **Immediate range:** AA64 uses 12-bit with optional shift-left-12; RV64 uses plain 12-bit [-2048, 2047].
+2. **Shift operands:** AA64 shift has a 6-bit immediate field for both 32 and 64-bit ops; RV64 uses 6-bit shamt for RV64I, 5-bit for RV32I.
+3. **Indexed addressing:** AA64 supports optional shift-and-add on the index (log2_scale checked against memory size); RV64 has no indexed addressing—the emit layer folds index via Zba if needed.
+4. **Register names:** DWARF indices match hardware register numbers (0-31 for x-regs, 32-63 for f-regs). Use `rv64_register_index()` from `src/arch/rv64/regs.c` for name → index mapping if needed.
+5. **Return registers:** Both a0 and a1 (for 128-bit integer returns) and fa0-fa1 (for FP pair returns) are marked NATIVE_REG_RET in the phys[] descriptor but only a0 and fa0 are typically returned as single values.
+
+---
+
+## 13. Integration Checklist
+
+- [ ] Define RV_* register constants at the top of native.c
+- [ ] Implement rv_int_allocable[], rv_int_scratch[], rv_fp_allocable[], rv_fp_scratch[]
+- [ ] Populate rv_int_phys[] (16 argument/return + 10 callee-saved + 8 reserved = 34 entries)
+- [ ] Populate rv_fp_phys[] (8 argument/return + 10 callee-saved + 12+ reserved)
+- [ ] Define rv_classes[] with correct caller_saved, callee_saved, arg, ret, reserved masks
+- [ ] Create rv_reg_info pointing to rv_classes[]
+- [ ] Implement rv_class_for_type(), rv_addr_legal(), rv_imm_legal()
+- [ ] Wire up .regs, .class_for_type, .imm_legal, .addr_legal, .has_store_zero_reg, .store_zero_reg in rv64_native_target_new()
+
+
+
+
+---
+
+# RV64 NativeTarget Porting Guide — GROUP 3: Data Movement, ALU, Control Flow, Addressing
+
+## Overview
+This guide details the implementation of rv64 NativeTarget hooks for GROUP 3 operations, mirroring the aa64 reference implementation (/Users/ryan/code/cfree/src/arch/aa64/native.c) and mining the correct rv64 legacy code (/Users/ryan/code/cfree/src/arch/rv64/ops.c, emit.c, alloc.c, isa.h). RV64 has no condition flags (unlike aa64's NZCV), so all condition-based operations must materialize the result into a register via SLT/SLTU or FLT/FLE for FP.
+
+## ISA Encoder Helpers (src/arch/rv64/isa.h)
+- **rv_r(funct7, rs2, rs1, funct3, rd, op)** — R-type: ADD/SUB/SLL/SRL/SRA/MUL/DIV etc.
+- **rv_i(imm12, rs1, funct3, rd, op)** — I-type: ADDI/ANDI/ORI/XORI/SLTI/SLTIU/loads/JALR
+- **rv_s(imm12, rs2, rs1, funct3, op)** — S-type: SB/SH/SW/SD/FSW/FSD
+- **rv_b(imm13, rs2, rs1, funct3, op)** — B-type: BEQ/BNE/BLT/BGE/BLTU/BGEU
+- **rv_u(imm32_hi20, rd, op)** — U-type: LUI/AUIPC (imm32_hi20 = upper 20 bits, shifted left 12)
+- **rv_j(imm21, rd, op)** — J-type: JAL
+- **rv_sh1add/rv_sh2add/rv_sh3add(rd, rs1, rs2)** — Zba: (rs1 << {1,2,3}) + rs2
+
+Integer register mnemonics: RV_ZERO/RV_X0 (x0), RV_RA/RV_X1 (x1, return address), RV_SP/RV_X2 (x2, stack pointer), RV_GP/RV_X3, RV_TP/RV_X4, RV_T0..RV_T6 (x5, x6, x7, x28..x31 temp), RV_S0..RV_S11 (x8, x9, x18, x27 callee-saved), RV_A0..RV_A7 (x10..x17 args).
+
+FP register constants: RV_FMT_S (0) for float, RV_FMT_D (1) for double.
+
+## Key Implementation Patterns
+
+### Load Immediate (rv64_emit_load_imm — src/arch/rv64/emit.c:117)
+For **sf=1 (64-bit)** and large immediates:
+1. Recursively decompose via hi20/lo12 split
+2. Use LUI rd, hi20 (upper 20 bits)
+3. Add ADDIW rd, rd, lo12 if lo12 ≠ 0
+
+For **sf=0 (32-bit)**:
+- Fits in 12-bit signed: ADDI rd, x0, imm12
+- Otherwise: LUI rd, hi; ADDIW rd, rd, lo (as above)
+
+Address-computation path: In load_imm_native and during addr materialization, use emit_li_32 (emit.c:76) or rv64_emit_load_imm. The latter auto-detects sign-extend range and chooses the shortest encoding.
+
+### Address Folding (Zba Index — src/arch/rv64/ops.c:273)
+rv_fold_indexed materializes `base + (index << log2_scale)` into a scratch register using:
+- log2_scale=0: ADD scratch, base, index
+- log2_scale=1: SH1ADD scratch, index, base  (= (index<<1) + base)
+- log2_scale=2: SH2ADD scratch, index, base  (= (index<<2) + base)
+- log2_scale=3: SH3ADD scratch, index, base  (= (index<<3) + base)
+
+Then update the addr tuple: base ← scratch, index ← REG_NONE, log2_scale ← 0.
+
+### Integer Sign/Zero Extension (src/arch/rv64/ops.c:322, 329, src/arch/rv64/alloc.c:322)
+For **CV_SEXT** on 32-bit source (src/arch/rv64/ops.c:901):
+- To 64-bit: ADDIW rd, rs, 0 (sign-extends low 32)
+
+For **CV_ZEXT** on 32-bit source (src/arch/rv64/ops.c:914):
+- To 64-bit: SLLI rd, rs, 32 ; SRLI rd, rd, 32
+
+For **CV_SEXT** on <32-bit (e.g., 16-bit):
+- sh = 64 - src_bits
+- SLLI rd, rs, sh ; SRAI rd, rd, sh
+
+For **CV_ZEXT** on <32-bit:
+- sh = 64 - src_bits
+- SLLI rd, rs, sh ; SRLI rd, rd, sh
+
+Canonical i32 CMP operands (src/arch/rv64/alloc.c:337): Before signed order comparisons (CMP_LT_S etc.) or EQ/NE on 32-bit types, sign-extend both operands. For unsigned order comparisons, zero-extend both.
+
+## Hook Implementation Sketches
+
+### move(NativeTarget* t, NativeLoc dst, NativeLoc src)
+**Location**: /Users/ryan/code/cfree/src/arch/aa64/native.c:1707
+
+Integer register-to-register: Move via ADDI rd, rs, 0 (rv_addi).
+FP-to-FP: Use FSGNJ.fmt rd, rs, rs (rv_fsgnj(fmt, rd, rs, rs)) for same register copy.
+Int-to-FP: FMV.D.X rd, rs (rv_fmv_d_x) for 64-bit, FMV.W.X for 32-bit.
+FP-to-Int: FMV.X.D rd, rs (rv_fmv_x_d) for 64-bit, FMV.X.W for 32-bit.
+
+Elision: Skip if same register, same class (rv64 unlike aa64 has no disjoint register files).
+
+### load_imm(NativeTarget* t, NativeLoc dst_reg, i64 imm)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:1740; rv64 /Users/ryan/code/cfree/src/arch/rv64/emit.c:117
+
+```c
+void rv64_native_load_imm(NativeTarget* t, NativeLoc dst, i64 imm) {
+  u32 rd = loc_reg(dst);
+  int is_64 = (type_size(t, dst.type) == 8);
+  rv64_emit_load_imm(t->mc, is_64, rd, imm);
+}
+```
+
+Call rv64_emit_load_imm (which handles the LUI/ADDIW/SLLI/ADDI sequence internally).
+
+### load_const(NativeTarget* t, NativeLoc dst_reg, ConstBytes cbytes)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:1744; rv64 /Users/ryan/code/cfree/src/arch/rv64/ops.c:43
+
+Pack the bytes into a u64 (little-endian):
+```c
+u64 v = 0;
+for (u32 i = 0; i < cbytes.size; ++i)
+  v |= (u64)cbytes.bytes[i] << (i * 8);
+```
+
+If FP dst: materialize into a temp (t0), then move to the FP register (move(t, dst, tmp_loc)).
+If Int dst: call load_imm with the packed value.
+
+### load_addr(NativeTarget* t, NativeLoc dst_reg, NativeAddr addr)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:1759
+
+**NATIVE_ADDR_BASE_FRAME**: Frame slot address.
+- Compute FP offset (frame-relative address)
+- If offset fits ±2047: ADDI rd, fp, off
+- Else: LUI/ADDIW sequence to materialize offset, then ADD rd, fp, t0
+- Apply index if present (via Zba sh{1,2,3}add)
+
+**NATIVE_ADDR_BASE_FRAME_VALUE**: Load pointer from frame, add offset.
+- Recursively load base address from frame slot (use enc_int_load for LD)
+- Add offset via ADDI if fits, else materialize then ADD
+- Apply index
+
+**NATIVE_ADDR_BASE_REG**: Register + offset.
+- ADDI rd, base_reg, offset (or LUI/ADDIW + ADD for large offsets)
+- Apply index
+
+**NATIVE_ADDR_BASE_GLOBAL**: Global symbol.
+- If extern-via-GOT: AUIPC rd, %got_pcrel_hi(sym) + LD rd, %pcrel_lo(.)(rd), then ADDI for addend
+- Else: AUIPC rd, %pcrel_hi(sym) + ADDI rd, %pcrel_lo(.)(rd) with relocations
+- Emit relocs at each site (R_RV_PCREL_HI20 on AUIPC, R_RV_PCREL_LO12_I on the load/add)
+- Apply index
+
+Index application (Zba): If index_kind != NONE, fold via rv_sh{1,2,3}add before returning.
+
+### load(NativeTarget* t, NativeLoc dst_reg, NativeAddr addr, MemAccess mem)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:1825
+
+1. Fold any indexed address component (rv_fold_indexed into a scratch)
+2. Materialize base + offset via addr_mode (lines 160–204 in rv64/ops.c)
+3. Emit the appropriate load instruction:
+   - FP: FLD (8-byte, funct3=0x3) or FLW (4-byte, funct3=0x2)
+   - Int: enc_int_load(mem.size, sign_extend, rd, base, offset) → LB/LH/LW/LD/LBU/LHU/LWU
+
+```c
+void rv64_native_load(NativeTarget* t, NativeLoc dst, NativeAddr addr, MemAccess mem) {
+  // Fold index
+  NativeAddr a = addr;
+  if (addr.index_kind != NATIVE_ADDR_INDEX_NONE) {
+    // Materialize index fold into scratch
+    // (Zba sh1add/sh2add/sh3add as per rv_fold_indexed pattern)
+  }
+  // Materialize base + offset
+  RvAddrMode am = addr_mode(t, a, RV_T0);
+  // Emit load
+  u32 sz = mem.size;
+  if (/* FP */) {
+    rv64_emit32(t->mc, (sz == 8) ? rv_fld(loc_reg(dst), am.base, am.ofs)
+                                  : rv_flw(loc_reg(dst), am.base, am.ofs));
+  } else {
+    rv64_emit32(t->mc, enc_int_load(sz, /* sign_extend */, loc_reg(dst), am.base, am.ofs));
+  }
+}
+```
+
+### store(NativeTarget* t, NativeAddr addr, NativeLoc src_reg, MemAccess mem)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:1830
+
+Parallel to load:
+1. Fold indexed address
+2. Materialize base + offset
+3. Emit store (enc_int_store or FSW/FSD)
+
+```c
+void rv64_native_store(NativeTarget* t, NativeAddr addr, NativeLoc src, MemAccess mem) {
+  // Fold + materialize as above
+  RvAddrMode am = /* ... */;
+  u32 sz = mem.size;
+  if (/* FP */) {
+    rv64_emit32(t->mc, (sz == 8) ? rv_fsd(loc_reg(src), am.base, am.ofs)
+                                  : rv_fsw(loc_reg(src), am.base, am.ofs));
+  } else {
+    rv64_emit32(t->mc, enc_int_store(sz, loc_reg(src), am.base, am.ofs));
+  }
+}
+```
+
+### tls_addr_of(NativeTarget* t, NativeLoc dst_reg, ObjSymId sym, i64 addend)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:1835; rv64 /Users/ryan/code/cfree/src/arch/rv64/ops.c:491
+
+**ELF LE (local-exec model)**:
+1. Materialize TLS offset via LUI/ADDIW
+2. ADD rd, RV_TP (thread pointer), offset
+3. Emit R_RV_TPREL_HI20 / R_RV_TPREL_LO12_I relocations
+
+### copy_bytes(NativeTarget* t, NativeAddr dst, NativeAddr src, AggregateAccess access)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:1922
+
+Forward copy (non-overlapping) or backward (overlapping). For each granule (8/4/2/1 bytes):
+1. Load from src + offset
+2. Store to dst + offset
+
+```c
+void rv64_native_copy_bytes(NativeTarget* t, NativeAddr dst, NativeAddr src, AggregateAccess access) {
+  CfreeCgTypeId i64 = /* i64 type id */, i32, i16, i8;
+  NativeLoc tmp = /* tmp_loc(i64, RV_T0) */;
+  for (u32 off = 0; off < access.size; ) {
+    u32 rem = access.size - off;
+    u32 sz = (rem >= 8) ? 8 : (rem >= 4) ? 4 : (rem >= 2) ? 2 : 1;
+    MemAccess mem = /* set size, align */;
+    load(t, tmp, /* src + off */, mem);
+    store(t, /* dst + off */, tmp, mem);
+    off += sz;
+  }
+}
+```
+
+### set_bytes(NativeTarget* t, NativeAddr dst, NativeLoc byte_value, AggregateAccess access)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:1927
+
+Loop, storing the byte_value repetitively:
+```c
+void rv64_native_set_bytes(NativeTarget* t, NativeAddr dst, NativeLoc byte_value, AggregateAccess access) {
+  CfreeCgTypeId i8 = /* i8 type id */;
+  MemAccess mem = /* i8, size=1, align=1 */;
+  NativeLoc byte = byte_value;
+  byte.type = i8;
+  for (u32 off = 0; off < access.size; ++off)
+    store(t, /* dst + off */, byte, mem);
+}
+```
+
+### bitfield_load / bitfield_store
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:2328, 2350
+
+Extract/insert a bitfield. Use SLLI/SRLI/SRAI to mask and position:
+- Load: SLLI to align, SRLI/SRAI to extract (sign-extend if signed)
+- Store: Mask old value, shift new value, OR together
+
+### binop(NativeTarget* t, BinOp op, NativeLoc dst, NativeLoc a_reg, NativeLoc b_reg_or_imm)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:2019; rv64 /Users/ryan/code/cfree/src/arch/rv64/ops.c:697
+
+**FP operations** (BO_FADD/FSUB/FMUL/FDIV):
+```c
+u32 fmt = (type_size(t, dst.type) == 8) ? RV_FMT_D : RV_FMT_S;
+switch (op) {
+  case BO_FADD: rv64_emit32(mc, rv_fadd(fmt, rd, ra, rb)); break;
+  case BO_FSUB: rv64_emit32(mc, rv_fsub(fmt, rd, ra, rb)); break;
+  case BO_FMUL: rv64_emit32(mc, rv_fmul(fmt, rd, ra, rb)); break;
+  case BO_FDIV: rv64_emit32(mc, rv_fdiv(fmt, rd, ra, rb)); break;
+}
+```
+
+**Integer immediate fast paths** (sf = type_size == 8):
+- BO_IADD with imm fitting ±2047: ADDI rd, ra, imm (or ADDIW for 32-bit)
+- BO_ISUB with -imm fitting ±2047: ADDI rd, ra, -imm (negation)
+- BO_AND/OR/XOR with imm: ANDI/ORI/XORI rd, ra, imm
+- BO_SHL/SHR_U/SHR_S with imm: SLLI/SRLI/SRAI rd, ra, shamt (shamt masked to 5 bits for 32-bit, 6 for 64-bit)
+
+**Register-register** (both operands in registers or IMM out of range):
+- BO_IADD: ADD rd, ra, rb (or ADDW for 32-bit)
+- BO_ISUB: SUB rd, ra, rb (or SUBW)
+- BO_IMUL: MUL rd, ra, rb (or MULW)
+- BO_SDIV/UDIV: DIV/DIVU rd, ra, rb (or DIVW/DIVUW)
+- BO_SREM/UREM: REM/REMU rd, ra, rb (or REMW/REMUW)
+- BO_AND/OR/XOR: AND/OR/XOR rd, ra, rb
+- BO_SHL: SLL rd, ra, rb (or SLLW)
+- BO_SHR_U: SRL rd, ra, rb (or SRLW)
+- BO_SHR_S: SRA rd, ra, rb (or SRAW)
+
+Commutative canonicalization (ops.c:728): Swap a_op ↔ b_op for IADD/AND/OR/XOR if a is IMM and b is not, so the imm-form check handles both orders.
+
+### unop(NativeTarget* t, UnOp op, NativeLoc dst, NativeLoc src)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:2109; rv64 /Users/ryan/code/cfree/src/arch/rv64/ops.c:860
+
+**FP negation** (UO_FNEG):
+- FSGNJN.fmt rd, rs, rs (fsgnj with negation, rv_fsgnjn)
+
+**Integer negation** (UO_NEG):
+- SUB rd, x0, rs (or SUBW for 32-bit); rv_sub(rd, RV_ZERO, rs)
+
+**Bitwise NOT** (UO_BNOT):
+- XORI rd, rs, -1; rv_xori(rd, rs, -1)
+
+**Logical NOT** (UO_NOT):
+- SLTIU rd, rs, 1 (set if rs < 1, i.e., rs == 0); rv_sltiu(rd, rs, 1)
+
+### cmp(NativeTarget* t, CmpOp op, NativeLoc dst, NativeLoc a_reg, NativeLoc b_reg_or_imm)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:2158; rv64 /Users/ryan/code/cfree/src/arch/rv64/alloc.c:431
+
+**FP comparisons** (CMP_EQ, CMP_NE, CMP_LT_F, CMP_LE_F, CMP_GT_F, CMP_GE_F):
+- FEQ.fmt rd, fa, fb; for CMP_EQ (rv_feq_s/rv_feq_d)
+- FLT.fmt rd, fa, fb; for CMP_LT_F (rv_flt_s/rv_flt_d)
+- FLE.fmt rd, fa, fb; for CMP_LE_F (rv_fle_s/rv_fle_d)
+- Invert GT/GE by swapping operands: CMP_GT → FLT(fb, fa), CMP_GE → FLE(fb, fa)
+- For CMP_NE: FEQ, then XORI rd, rd, 1
+
+**Integer comparisons**:
+1. Canonicalize i32 operands (sign/zero extend as per context)
+2. Use SLT/SLTU/BEQ-based sequences:
+   - CMP_EQ: SUB rd, ra, rb; SLTIU rd, rd, 1 (set if diff == 0)
+   - CMP_NE: SUB rd, ra, rb; SLTU rd, x0, rd (set if diff != 0)
+   - CMP_LT_S: SLT rd, ra, rb
+   - CMP_LT_U: SLTU rd, ra, rb
+   - CMP_GT_S: SLT rd, rb, ra (swapped operands)
+   - CMP_GT_U: SLTU rd, rb, ra
+   - CMP_GE_S: SLT rd, ra, rb; XORI rd, rd, 1 (NOT of <)
+   - CMP_GE_U: SLTU rd, ra, rb; XORI rd, rd, 1
+   - CMP_LE_S: SLT rd, rb, ra; XORI rd, rd, 1
+   - CMP_LE_U: SLTU rd, rb, ra; XORI rd, rd, 1
+
+### convert(NativeTarget* t, ConvKind op, NativeLoc dst, NativeLoc src)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:2164; rv64 /Users/ryan/code/cfree/src/arch/rv64/ops.c:894
+
+**CV_SEXT** (sign-extend):
+- 32 bits: ADDIW rd, rs, 0
+- < 32 bits: SLLI rd, rs, (64 - src_bits); SRAI rd, rd, (64 - src_bits)
+
+**CV_ZEXT** (zero-extend):
+- 32 bits: SLLI rd, rs, 32; SRLI rd, rd, 32
+- < 32 bits: SLLI rd, rs, (64 - src_bits); SRLI rd, rd, (64 - src_bits)
+
+**CV_TRUNC**: ADDIW rd, rs, 0 (truncates to 32 bits, sign-extends; narrower widths handled by store)
+
+**CV_ITOF_S** (int → float, signed):
+- FCVT.D.L / FCVT.D.W (64-bit / 32-bit src to double)
+- FCVT.S.L / FCVT.S.W (to single)
+
+**CV_ITOF_U** (unsigned):
+- FCVT.D.LU / FCVT.D.WU
+- FCVT.S.LU / FCVT.S.WU
+
+**CV_FTOI_S** (float → int, signed):
+- FCVT.L.D / FCVT.W.D (double to 64-bit / 32-bit)
+- FCVT.L.S / FCVT.W.S (from single)
+
+**CV_FTOI_U** (unsigned):
+- FCVT.LU.D / FCVT.WU.D
+- FCVT.LU.S / FCVT.WU.S
+
+**CV_FEXT** (float extend, single → double):
+- FCVT.D.S rd, rs; rv_fcvt_d_s(rd, rs)
+
+**CV_FTRUNC** (float truncate, double → single):
+- FCVT.S.D rd, rs; rv_fcvt_s_d(rd, rs)
+
+**CV_BITCAST** (bitcast between int/float registers):
+- Int → FP: FMV.D.X / FMV.W.X (rv_fmv_d_x / rv_fmv_w_x)
+- FP → Int: FMV.X.D / FMV.X.W (rv_fmv_x_d / rv_fmv_x_w)
+- Same-class: Move (ADDI for int, FSGNJ for FP) or elide if same register
+
+### alloca_(NativeTarget* t, NativeLoc dst, NativeLoc size, u32 align)
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c:2234
+
+Round size up: ADDI t0, size, (align - 1); AND t0, t0, -align
+Decrement sp: SUB sp, sp, t0
+Return address: ADDI dst, sp, max_outgoing (or record a patch if max_outgoing not final)
+
+### spill / reload
+**Reference**: aa64 /Users/ryan/code/cfree/src/arch/aa64/native.c (via aa_emit_mem)
+
+Spill src_reg to a frame slot: Materialize the slot's address, then call store.
+Reload from slot to dst_reg: Load from the materialized address.
+
+### label_new / label_place / jump
+**Reference**: rv64 /Users/ryan/code/cfree/src/arch/rv64/alloc.c:261
+
+```c
+MCLabel rv64_native_label_new(NativeTarget* t) {
+  return t->mc->label_new(t->mc);
+}
+
+void rv64_native_label_place(NativeTarget* t, MCLabel label) {
+  t->mc->label_place(t->mc, label);
+}
+
+void rv64_native_jump(NativeTarget* t, MCLabel label) {
+  rv64_emit32(t->mc, rv_jal(RV_ZERO, 0));  // JAL x0, offset (discards return address)
+  t->mc->emit_label_ref(t->mc, label, R_RV_JAL, 4, 0);
+}
+```
+
+### cmp_branch(NativeTarget* t, CmpOp op, NativeLoc a, NativeLoc b, MCLabel label)
+**Reference**: rv64 /Users/ryan/code/cfree/src/arch/rv64/alloc.c:355
+
+**FP branch** (CMP_LT_F, CMP_LE_F, CMP_GT_F, CMP_GE_F):
+1. Materialize comparison into a register via FLT/FLE
+2. Branch: BNE rd, x0, label
+
+**Integer branch**:
+1. Canonicalize i32 operands if needed
+2. Emit appropriate branch:
+   - CMP_EQ: BEQ ra, rb, 0
+   - CMP_NE: BNE ra, rb, 0
+   - CMP_LT_S: BLT ra, rb, 0
+   - CMP_GE_S: BGE ra, rb, 0
+   - CMP_LT_U: BLTU ra, rb, 0
+   - CMP_GE_U: BGEU ra, rb, 0
+   - CMP_GT_S: BLT rb, ra, 0 (swapped)
+   - CMP_LE_S: BGE rb, ra, 0
+   - CMP_GT_U: BLTU rb, ra, 0
+   - CMP_LE_U: BGEU rb, ra, 0
+3. Emit relocation: emit_label_ref(mc, label, R_RV_BRANCH, 4, 0)
+
+### indirect_branch(NativeTarget* t, NativeLoc addr, const MCLabel* valid_targets, u32 ntargets)
+**Reference**: rv64 /Users/ryan/code/cfree/src/arch/rv64/alloc.c:287
+
+```c
+void rv64_native_indirect_branch(NativeTarget* t, NativeLoc addr, const MCLabel* valid_targets, u32 ntargets) {
+  (void)valid_targets;
+  (void)ntargets;  // Not used; CFI is optional
+  u32 rs1 = loc_reg(addr);
+  rv64_emit32(t->mc, rv_jalr(RV_ZERO, rs1, 0));  // JALR x0, rs1, 0 (indirect jump)
+}
+```
+
+### load_label_addr(NativeTarget* t, NativeLoc dst, MCLabel label)
+**Reference**: rv64 /Users/ryan/code/cfree/src/arch/rv64/alloc.c:271
+
+PC-relative pair: AUIPC + ADDI with R_RV_INTRA_AUIPC_ADDI relocation (width=8, addend=0 anchors to AUIPC):
+```c
+void rv64_native_load_label_addr(NativeTarget* t, NativeLoc dst, MCLabel label) {
+  u32 rd = loc_reg(dst);
+  rv64_emit32(t->mc, rv_auipc(rd, 0));
+  rv64_emit32(t->mc, rv_addi(rd, rd, 0));
+  t->mc->emit_label_ref(t->mc, label, R_RV_INTRA_AUIPC_ADDI, 8, 0);
+}
+```
+
+## ABI & Frame Considerations
+
+- **RV64 LP64D ABI**: Integer args in a0..a7 (x10..x17), FP args in fa0..fa7 (f10..f17).
+- **Callee-saved**: s0..s11 (x8, x9, x18..x27).
+- **Frame layout** (known-frame path, /Users/ryan/code/cfree/src/arch/rv64/emit.c:227):
+  - Outgoing area (aligned 16)
+  - Saved GP registers (callee-saves, FP saves)
+  - Local slots
+  - Saved s0/ra (16 bytes)
+- **FP (frame pointer)**: s0 (x8); points to saved-pair; CFA = sp + frame_size.
+
+## Register References
+
+- x0: RV_ZERO (hardwired 0; write discards, read returns 0)
+- x1: RV_RA (return address)
+- x2: RV_SP (stack pointer)
+- x5..x7: RV_T0..RV_T2 (temporaries, caller-saved)
+- x8: RV_S0 / RV_FP (frame pointer, callee-saved)
+- x10..x17: RV_A0..RV_A7 (argument/return registers)
+- x18..x27: RV_S2..RV_S11 (callee-saved)
+- x28..x31: RV_T3..RV_T6 (temporaries, caller-saved)
+
+Floating-point temporaries: f5..f7, f28..f31 (caller-saved); f8..f9, f18..f27 (callee-saved).
+
+
+
+---
+
+# GROUP 4: Calls, Returns, and the ABI Interface — Porting Guide for rv64 NativeTarget
+
+## Overview
+
+This guide covers porting the call, return, and ABI-binding mechanisms to rv64's NativeTarget implementation. The contract is in `/Users/ryan/code/cfree/src/arch/native_target.h` (NativeCallDesc, NativeCallPlan, plan_call, emit_call, plan_ret, ret). The reference implementation is aa64 (`src/arch/aa64/native.c`, lines ~2614–2891), which queries the ABI via `src/abi/abi.h` (ABIFuncInfo, ABIArgInfo, ABIArgPart). The rv64 legacy code provides ISA and ABI logic in `src/arch/rv64/ops.c` (rv_call, rv_ret, rv_call_stack_size) and `src/arch/rv64/opt_coord.c` (rv_plan_call), plus the RISC-V LP64D ABI rules in `src/abi/abi_rv64.c`.
+
+## ABI Architecture
+
+Both aa64 and rv64 keep ABI decisions **behind the abi/ interface** — hardcoding is forbidden. The flow is:
+
+1. **ABIFuncInfo** (`abi_cg_func_info(c->abi, fn_type)`) gives you the signature's calling convention: parameter ABIArgInfo array, return ABIArgInfo, sret flag, variadic flag, vararg-stack behavior.
+
+2. **ABIArgInfo** classifies one parameter or return value:
+   - `kind`: ABI_ARG_IGNORE, ABI_ARG_DIRECT (parts), ABI_ARG_INDIRECT (caller passes address)
+   - `nparts` + `parts[]` (ABIArgPart): for DIRECT, each part is one register-passed or stack-passed chunk
+   - Each ABIArgPart holds: `cls` (ABI_CLASS_INT/FP), `size`, `align`, `src_offset` (offset in the original value)
+
+3. **Variadic handling**: ABIFuncInfo.variadic flag + vararg_on_stack control whether varargs bypass register pools and go straight to stack (Apple ARM64 sets this; RISC-V does not).
+
+4. **Return values**: ABIFuncInfo.ret + has_sret. If ret.kind == ABI_ARG_INDIRECT, the caller passes a pointer in a0, and plan_ret handles the aggregate copy-back.
+
+## RV64 ABI Specifics (from abi_rv64.c)
+
+- **8 integer argument registers** (a0–a7, aka x10–x17) and **8 FP argument registers** (fa0–fa7, aka f10–f17)
+- **Scalar rules**: integers ≤8B → DIRECT + one INT part; float/double → DIRECT + one FP part; void → IGNORE
+- **Small aggregate rules** (≤16B): homogeneous FP aggregates → FP parts; one FP + one INT → mixed parts; otherwise INT parts (up to 2 GPRs)
+- **Large aggregates** (>16B): INDIRECT (sret for return, byval for args)
+- **Stack arguments**: 8-byte aligned slots at sp+0, sp+8, sp+16, …
+- **Variadic args**: forced to stack (integer registers only, no FP split) — handled at call/return sites
+- **Return value registers**: a0/a1 for integers, fa0/fa1 for FP
+- **sret**: a0 holds the destination pointer; the callee copies the return value to [a0]
+
+## Step 1: Implement plan_call (analog: aa64 lines 2614–2762)
+
+### Signature
+```c
+static void rv_plan_call(NativeTarget* t, const NativeCallDesc* desc,
+                         NativeCallPlan* plan) {
+  // plan->callee = desc->callee;
+  // plan->flags = desc->flags;
+  // plan->stack_arg_size = <computed>;
+  // plan->has_sret = abi && abi->has_sret;
+  // plan->is_variadic = abi && abi->variadic;
+  // plan->args[0..nargs-1] with src/dst moves
+  // plan->rets[0..nrets-1] with src/dst return moves
+}
+```
+
+### Body Sketch
+
+1. **Query ABI**: `const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, desc->fn_type);`
+2. **Initialize plan**: zero the struct, copy callee, flags, has_sret, is_variadic.
+3. **Compute stack_arg_size**: walk arguments with lookahead cursors (next_int=0, next_fp=0, stack=0) to find which go on stack:
+   - For each arg, get ABIArgInfo from abi→params[i] (or synthesize one if no ABI)
+   - IGNORE: skip
+   - INDIRECT: takes a register if next_int < 8, else stack (8 bytes)
+   - DIRECT with parts: for each part, check if it fits in registers (next_int/next_fp counters) or stack
+   - Variadic args with vararg_on_stack: force to stack
+   - Accumulate stack offset, align to part's alignment, round final size to 16 bytes (rv64 stack is 16-byte aligned)
+4. **Sret handling**: if has_sret and not a tail call, first arg move writes the destination address to a0
+5. **Prepare arg moves**: for each argument:
+   - Create NativeCallPlanMove entries (src is desc→args[i] location, dst is the target register/stack)
+   - src_kind: NATIVE_CALL_MOVE_VALUE (load the value) or NATIVE_CALL_MOVE_ADDR (write addr-of)
+   - dst_kind: NATIVE_LOC_REG (a0–a7, fa0–fa7), NATIVE_LOC_STACK (sp+offset)
+   - For indirects on stack, compute sp+offset; use a scratch register (t0) to emit the address
+6. **Return value setup**: if not a tail call and nresults > 0:
+   - Query abi→ret
+   - If ret.kind == ABI_ARG_INDIRECT: sret case, no return moves (copy happens in plan_ret)
+   - If ret.kind == ABI_ARG_DIRECT: create NativeCallPlanRet entries, one per part (a0, a1, fa0, fa1, etc.)
+   - Each rets[i]: src is the return register, dst is desc→results[0] (adjusted by part→src_offset)
+
+### RV64-specific Details
+
+- **Scratch registers**: a5 (x15, t0) is available for address calculations
+- **Parallel-copy cycle-breaking**: aa64 handles this with cycle detection; rv64 may not need it if the allocator cooperates, but emit conservatively (process register moves in topological order)
+- **Frame anchoring**: sp is the stack pointer; stack args are at sp+0..sp+N (unlike aa64, which uses fp for incoming args)
+- **Outgoing area growth**: track max_outgoing across all calls; update t→mc state or record patches if needed
+
+## Step 2: Implement emit_call (analog: aa64 lines 2805–2826)
+
+### Signature
+```c
+static void rv_emit_call(NativeTarget* t, const NativeCallPlan* plan) {
+  // Emit the actual call instruction(s)
+}
+```
+
+### Body Sketch
+
+1. **Tail call path** (plan→flags & CG_CALL_TAIL):
+   - Restore callee-saved registers (if needed)
+   - Restore frame (sp/fp adjustment)
+   - Emit indirect branch (jr) or PC-relative jump (jal with R_RV_CALL reloc)
+   - Return (no fallthrough to regular return path)
+2. **Non-tail call path**:
+   - Emit argument moves (plan→args array): parallel copy of registers / stack stores
+   - If plan→has_sret and not tail: set a0 to destination pointer (already handled in plan_call)
+   - If plan→callee.kind == NATIVE_LOC_GLOBAL: emit auipc ra, 0; jalr ra, ra, 0 with R_RV_CALL reloc on auipc
+   - If plan→callee.kind == NATIVE_LOC_REG: emit jalr ra, <reg>, 0
+   - Emit return-value collection (plan→rets): load each return register into its destination
+
+### RV64 ISA Notes
+
+- **jalr ra, ra, 0**: 2-instruction CALL sequence (auipc + jalr); R_RV_CALL relocation on auipc
+- **jalr ra, <reg>, 0**: indirect call via register
+- **jr <reg>** (pseudo → jalr x0, <reg>, 0): for tail calls
+- **Scratch t0 (x5)** for intermediate calculations without clobbering live registers
+
+## Step 3: Implement plan_ret (analog: aa64 lines 2828–2889)
+
+### Signature
+```c
+static void rv_plan_ret(NativeTarget* t, const CGFuncDesc* fd,
+                        const NativeLoc* values, u32 nvalues,
+                        NativeCallPlanRet** out_rets, u32* out_nrets) {
+  // Plan the moves of return values into a0/a1 (int), fa0/fa1 (FP), or sret
+}
+```
+
+### Body Sketch
+
+1. **Query ABI**: `const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, fd→fn_type);`
+2. **No return case**: nvalues == 0 → set *out_rets = NULL, *out_nrets = 0, return
+3. **Indirect return (sret)** — abi→ret.kind == ABI_ARG_INDIRECT:
+   - Load the sret pointer (spilled at entry in a reserved frame slot, or already in a0 for tail calls)
+   - Emit aggregate copy from values[0] to [a0] using memcpy
+   - No return moves needed (copy is in-place)
+   - Set *out_rets = NULL, *out_nrets = 0
+4. **Direct return** — abi→ret.kind == ABI_ARG_DIRECT:
+   - Allocate NativeCallPlanRet array
+   - For each part in abi→ret.parts:
+     - src: values[0] (adjusted by part→src_offset if aggregate)
+     - dst: a0/a1 for INT parts, fa0/fa1 for FP parts
+     - mem: MemAccess with part→size
+   - Set *out_rets and *out_nrets
+5. **No ABI or void return**: *out_rets = NULL, *out_nrets = 0
+
+### RV64 Details
+
+- **sret pointer location**: stored at function entry in a reserved frame slot (e.g., rv_sret_ptr_slot). Retrieve with sp/fp offset math.
+- **Return register mapping**: part→cls == ABI_CLASS_FP → fa0 (x10 FP), fa1 (x11 FP); else a0 (x10), a1 (x11)
+- **Part ordering**: emit parts in source order (part→src_offset) for aggregates
+
+## Step 4: Implement ret (analog: aa64 lines 2891–2894)
+
+### Signature
+```c
+static void rv_ret(NativeTarget* t) {
+  // Emit return (jump to epilogue or direct ret instruction)
+}
+```
+
+### Body Sketch
+
+1. **Single-pass mode** (no known_frame): emit jal x0, <epilogue_label> (NOP-placeholder until patched)
+2. **Known-frame mode** (optimized): epilogue is already emitted inline; just emit jr ra
+
+## Step 5: Helper Functions for ABI Queries
+
+### rv_param_abi (analog: aa64 lines 2385–2401)
+```c
+static const ABIArgInfo* rv_param_abi(NativeTarget* t, const ABIFuncInfo* abi,
+                                      const NativeCallDesc* desc, u32 i,
+                                      ABIArgInfo* scratch) {
+  if (abi && i < abi->nparams) return &abi->params[i];
+  // Synthesize a default DIRECT + INT part for untyped/extern calls
+  memset(scratch, 0, sizeof *scratch);
+  scratch->kind = ABI_ARG_DIRECT;
+  scratch->nparts = 1;
+  scratch->parts = arena_zarray(t->c->tu, ABIArgPart, 1);
+  scratch->parts[0].cls = cg_type_is_float(t->c, desc->args[i].type) ? ABI_CLASS_FP : ABI_CLASS_INT;
+  scratch->parts[0].loc = ABI_LOC_REG;
+  scratch->parts[0].size = type_size32(t, desc->args[i].type);
+  scratch->parts[0].align = type_align32(t, desc->args[i].type);
+  scratch->parts[0].src_offset = 0;
+  return scratch;
+}
+```
+
+### rv_part_scalar_type (analog: aa64 lines 2417–2436)
+```c
+static CfreeCgTypeId rv_part_scalar_type(const ABIArgPart* part) {
+  if (part->cls == ABI_CLASS_FP) {
+    if (part->size <= 4) return builtin_id(CFREE_CG_BUILTIN_F32);
+    return builtin_id(CFREE_CG_BUILTIN_F64);  // rv64 locks F128 to F64
+  }
+  switch (part->size) {
+    case 1: return builtin_id(CFREE_CG_BUILTIN_I8);
+    case 2: return builtin_id(CFREE_CG_BUILTIN_I16);
+    case 4: return builtin_id(CFREE_CG_BUILTIN_I32);
+    default: return builtin_id(CFREE_CG_BUILTIN_I64);
+  }
+}
+```
+
+### rv_call_stack_size (analog: aa64 lines 2449–2489)
+```c
+static u32 rv_call_stack_size(NativeTarget* t, const NativeCallDesc* desc) {
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, desc->fn_type);
+  u32 next_int = 0, next_fp = 0, stack = 0;
+  for (u32 i = 0; i < desc->nargs; ++i) {
+    ABIArgInfo tmp;
+    const ABIArgInfo* ai = rv_param_abi(t, abi, desc, i, &tmp);
+    int force_stack = abi && abi->variadic && abi->vararg_on_stack && i >= abi->nparams;
+    
+    if (ai->kind == ABI_ARG_IGNORE) continue;
+    if (force_stack) {
+      // Variadic args: round to 8 bytes, advance stack
+      stack += 8;
+      continue;
+    }
+    if (ai->kind == ABI_ARG_INDIRECT) {
+      if (next_int < 8) next_int++;
+      else stack += 8;
+      continue;
+    }
+    for (u32 p = 0; p < ai->nparts; ++p) {
+      const ABIArgPart* part = &ai->parts[p];
+      if (part->cls == ABI_CLASS_FP) {
+        if (next_fp < 8) next_fp++;
+        else stack += 8;  // FP part on stack
+      } else {
+        if (next_int < 8) next_int++;
+        else stack += 8;  // INT part on stack
+      }
+    }
+  }
+  return (stack + 15) & ~15;  // 16-byte align
+}
+```
+
+### rv_signature_stack_bytes (analog: aa64 lines 2495–2506)
+```c
+static u32 rv_signature_stack_bytes(NativeTarget* t, CfreeCgTypeId fn_type,
+                                    int* variadic, u32* nparams) {
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, fn_type);
+  NativeCallDesc d;
+  if (variadic) *variadic = abi ? (int)abi->variadic : 0;
+  if (nparams) *nparams = abi ? abi->nparams : 0;
+  memset(&d, 0, sizeof d);
+  d.fn_type = fn_type;
+  d.nargs = abi ? abi->nparams : 0;
+  if (d.nargs) d.args = arena_zarray(t->c->tu, NativeLoc, d.nargs);
+  return rv_call_stack_size(t, &d);
+}
+```
+
+### rv_call_stack_bytes (analog: aa64 lines 2508–2514)
+```c
+static u32 rv_call_stack_bytes(NativeTarget* t, const NativeCallDesc* desc) {
+  return rv_call_stack_size(t, desc);
+}
+```
+
+## Step 6: Register Parameter Binding (NativeDirectTarget path)
+
+The NativeDirectTarget adapter is in `/Users/ryan/code/cfree/src/cg/native_direct_target.h` (NativeOps struct, lines 66–94). For the -O0 semantic path, implement:
+
+### rv_bind_param (analog: aa64 lines 3616–3695 aa_bind_native_param)
+
+**Purpose**: Accept a CGParamDesc (semantic parameter) and route its incoming value from the ABI location (arg register or stack) to the user's allocated home (register or frame slot).
+
+**Signature**:
+```c
+static void rv_bind_param(NativeTarget* t, const CGParamDesc* p,
+                          NativeLoc dst) {
+  // Route p from its ABI location to dst
+}
+```
+
+**Body Sketch**:
+1. Query ABI: `const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, rv_of(t)->func->fn_type);`
+2. Get param ABIArgInfo: `const ABIArgInfo* ai = (p->index < abi->nparams) ? &abi->params[p->index] : NULL;`
+3. If IGNORE or NULL: no action (only ABI cursor advances)
+4. If INDIRECT:
+   - Load the pointer from the current int register (a0+next_param_int) or stack
+   - Copy aggregate from [pointer] to dst
+   - Advance next_param_int or next_param_stack
+5. If DIRECT with parts:
+   - For each part:
+     - Determine source: a0+next_int, fa0+next_fp, or stack+offset
+     - Load from source into intermediate register (if needed)
+     - Move/store into dst (register or frame slot)
+   - Advance int/fp cursors and stack offset
+
+**Key fields to track** (arch-private state):
+- `next_param_int`: cursor for a0..a7
+- `next_param_fp`: cursor for fa0..fa7
+- `next_param_stack`: sp-relative offset for stack params (round up to part alignment)
+
+## Step 7: Tail Call Unrealizability (NativeDirectTarget path)
+
+### rv_tail_call_unrealizable_reason
+
+**Purpose**: Return a blocker string if the tail call cannot be emitted, else NULL.
+
+**Signature**:
+```c
+static const char* rv_tail_call_unrealizable_reason(NativeDirectTarget* d,
+                                                    const CGCallDesc* call) {
+  // Check if tail call is realizable given the current function's incoming stack args
+}
+```
+
+**Body Sketch**:
+1. Compute the outgoing stack-arg size for the call descriptor (use rv_call_stack_size)
+2. Compare against incoming_stack_size (set at function entry)
+3. If outgoing > incoming: return "rv64 tail call: stack argument area too small"
+4. Else return NULL (realizable)
+
+## Integration Points
+
+### 1. NativeTarget Hook Registration (in rv_native_target_new or similar)
+
+```c
+t->plan_call = rv_plan_call;
+t->emit_call = rv_emit_call;
+t->plan_ret = rv_plan_ret;
+t->ret = rv_ret;
+t->signature_stack_bytes = rv_signature_stack_bytes;
+t->call_stack_bytes = rv_call_stack_bytes;
+```
+
+### 2. NativeOps Adapter Registration (for NativeDirectTarget)
+
+```c
+static const NativeOps rv_direct_ops = {
+  .bind_param = rv_bind_native_param,  // the NativeTarget version
+  .tail_call_unrealizable_reason = rv_no_tail,  // or version that checks stack
+  .va_start_ = rv_va_start_,
+  .va_arg_ = rv_va_arg_,
+  .va_end_ = rv_va_end_,
+  .va_copy_ = rv_va_copy_,
+  .asm_block = rv_direct_asm_block,
+  .barrier = rv_direct_barrier,
+};
+const NativeOps* rv64_native_direct_ops(void) { return &rv_direct_ops; }
+```
+
+But note: the NativeOps bind_param receives a **semantic CGParamDesc** and routes to a **semantic home** (CGLocal), which is wrapped in a NativeDirectLocal. For the NativeTarget emission path, a separate bind_param hook on the NativeTarget is NOT called directly by the optimizer; instead, the backend's own prologue setup handles parameter binding. The NativeDirectTarget adapter is only for -O0.
+
+## Key Distinctions: aa64 vs rv64
+
+| Aspect | aa64 | rv64 |
+|--------|------|------|
+| Arg registers | a0–a7 (x0–x7) + fa0–fa7 (v0–v7 FP) | a0–a7 (x10–x17) + fa0–fa7 (f10–f17) |
+| Incoming frame anchor | fp (x29) | sp (x2) |
+| Outgoing frame anchor | sp (x31) | sp (x2) |
+| Variadic stack behavior | vararg_on_stack (Apple) | always uses int regs then stack |
+| Return registers | a0/a1 (x0–x1), v0–v1 (FP) | a0/a1 (x10–x11), fa0–fa1 (f10–f11) |
+| sret pointer reg | x8 (always indirect) | a0 (first arg) |
+| Stack alignment | 16 bytes | 16 bytes |
+| Longest scalar type | F128 | double (F128 deferred) |
+
+## Notes
+
+- **Parallel-copy semantics**: aa64 handles cycles explicitly (aa_emit_reg_arg_moves). rv64 may rely on the allocator to avoid cycles; emit conservatively.
+- **Scratch registers**: rv64 uses t0 (x5), t1 (x6), t2 (x7) for temporaries; don't clobber live argument registers.
+- **Frame pointer**: rv64 uses s0 (x8) as fp; sp is x2. Adjust offset calculations accordingly.
+- **ABI queries must go through abi_cg_func_info and friends**; no hardcoding of register numbers.
+- **Tail call forwarding**: if has_sret and (flags & CG_CALL_TAIL), a0 already holds the incoming sret pointer (forwarded); no need to load it again.
+
+
+
+
+---
+
+# RV64 NativeTarget Porting Guide — GROUP 5: Atomics, Variadics, Inline ASM, Intrinsics, Finalize
+
+## Overview
+This guide covers the five functional groups at the tail of `src/arch/rv64/native.c` implementation: atomics (load/store/RMW/CAS/fence), variadic support (va_start_/va_arg_/va_end_/va_copy_), inline assembly (asm_block + direct path), compiler intrinsics, and finalization hooks (trap, set_loc, finalize, destroy).
+
+The reference implementation is **aa64 native.c** (4557 lines total). The legacy rv64 code exists but does not compile; it provides correct ISA/ABI logic that must be ported. The -O0 path uses **NativeOps** adapter (native_direct_target.h); the optimizer path uses **NativeTarget** hooks directly.
+
+---
+
+## File Lifecycle: Keep vs. Delete
+
+**DELETE** (old legacy single-pass code; not ported):
+- `src/arch/rv64/ops.c` — semantic CG lowering (replaced by NativeTarget hooks)
+- `src/arch/rv64/emit.c` — emission driver (replaced by MCEmitter in NativeTarget)
+- `src/arch/rv64/alloc.c` — register allocation (replaced by NativeAllocClass + optimizer)
+- `src/arch/rv64/opt_coord.c` — optimizer coordination (not part of native backend)
+- `src/arch/rv64/internal.h` — old RvImpl state struct (replaced by RvNativeTarget private state)
+
+**KEEP** (ISA encoding, disasm, link, debug, register info):
+- `src/arch/rv64/isa.h` — instruction encoders (RV_A, RV_I extensions)
+- `src/arch/rv64/isa.c` — disassembler / utility decoders
+- `src/arch/rv64/regs.c` — register file info (migrate to NativeRegInfo hooks)
+- `src/arch/rv64/regs.h` — register enum definitions
+- `src/arch/rv64/link.c` — object linker integration
+- `src/arch/rv64/dbg.c` — DWARF debug info emission
+- `src/arch/rv64/disasm.c/disasm.h` — disassembler (if needed for diagnostics)
+- `src/arch/rv64/emu.c` — emulator / JIT (independent of CG)
+- `src/arch/rv64/arch.c` — arch initialization hooks
+- `src/arch/rv64/asm.c/asm.h` — assembler (inline asm binding)
+- `src/arch/rv64/rv64.h` — public header
+
+---
+
+## GROUP 5A: Atomics (MemOrder → RISC-V A-Extension)
+
+### Design Overview
+- **Memory ordering mapping**: MemOrder enum (relaxed, acquire, release, acq_rel, seq_cst) → RISC-V .aq/.rl bits on LR/SC and fence instructions
+- **Loop strategy**: LR.W/D (load-reserved) + SC.W/D (store-conditional) retry loop for all ops (preferred over per-op AMO for simplicity)
+- **Alternative**: AMO instructions (AMOADD, AMOAND, AMOOR, AMOXOR, AMOSWAP with .aq/.rl) for hot paths (deferred)
+- **Spill slot for CAS**: unlike aa64 which uses a single saved-tmp-reg, rv64 can allocate temp regs freely, so no backend spill slot needed unless aggressive optimization desired
+
+### Source Files for ISA Encoding
+- `src/arch/rv64/isa.h` lines 520–529: `rv_lr_w/d`, `rv_sc_w/d` function signatures
+- `src/arch/rv64/ops.c` lines 1917–2109: complete atomic_load/store/rmw/cas/fence legacy implementation
+
+### NativeTarget Hook Signatures (src/arch/native_target.h, lines 396–405)
+```c
+void (*atomic_load)(NativeTarget*, NativeLoc dst, NativeAddr addr, MemAccess, MemOrder);
+void (*atomic_store)(NativeTarget*, NativeAddr addr, NativeLoc src, MemAccess, MemOrder);
+void (*atomic_rmw)(NativeTarget*, AtomicOp, NativeLoc dst, NativeAddr addr,
+                   NativeLoc val, MemAccess, MemOrder);
+void (*atomic_cas)(NativeTarget*, NativeLoc prior, NativeLoc ok,
+                   NativeAddr addr, NativeLoc expected, NativeLoc desired,
+                   MemAccess, MemOrder success, MemOrder failure);
+void (*fence)(NativeTarget*, MemOrder);
+```
+
+### Implementation Body Sketches
+
+#### 1. `rv_atomic_load(NativeTarget* t, NativeLoc dst, NativeAddr addr, MemAccess mem, MemOrder order)`
+**Location**: `src/arch/rv64/native.c` (new file)
+
+**Pseudo-C**:
+```c
+static void rv_atomic_load(NativeTarget* t, NativeLoc dst, NativeAddr addr, MemAccess mem, MemOrder order) {
+  u32 base_reg = RV_T0;  /* or use first available scratch */
+  u32 dst_reg = dst.v.reg & 0x1f;
+  u32 size = mem.size ? mem.size : type_size32(t, dst.type);
+  
+  /* Materialize address base into base_reg */
+  rv_atomic_addr_reg(t, addr, base_reg);
+  
+  int aq = mem_order_is_acquire(order);
+  int rl = 0;  /* LR ignores rl bit, but API expects it */
+  
+  if (aq) {
+    /* lr.w/d dst, (base) with aq=1 */
+    u32 enc = (size == 8) 
+      ? rv_lr_d(dst_reg, base_reg, aq, rl)
+      : rv_lr_w(dst_reg, base_reg, aq, rl);
+    rv64_emit32(t->mc, enc);
+  } else {
+    /* Plain load (relaxed read) */
+    u32 enc = enc_int_load(size, 0, dst_reg, base_reg, 0);
+    rv64_emit32(t->mc, enc);
+  }
+  
+  if (order == MO_SEQ_CST) {
+    /* fence rw,rw after acquire-load for seq_cst */
+    rv64_emit32(t->mc, rv_fence_rw_rw());
+  }
+}
+```
+
+**Key points**:
+- LR.W/D with aq=1 serves as acquire-load; plain load for relaxed/release (rl is ignored on loads)
+- SEQ_CST requires a FENCE after the load to satisfy full ordering
+- Address materialization delegates to `rv_atomic_addr_reg(t, addr, base_reg)` helper (similar to aa64's aa_atomic_addr_reg)
+
+---
+
+#### 2. `rv_atomic_store(NativeTarget* t, NativeAddr addr, NativeLoc src, MemAccess mem, MemOrder order)`
+
+**Pseudo-C**:
+```c
+static void rv_atomic_store(NativeTarget* t, NativeAddr addr, NativeLoc src, MemAccess mem, MemOrder order) {
+  u32 base_reg = RV_T0;
+  u32 src_reg = src.v.reg & 0x1f;
+  u32 size = mem.size ? mem.size : type_size32(t, src.type);
+  
+  if (order == MO_SEQ_CST) {
+    /* fence rw,rw before seq_cst write */
+    rv64_emit32(t->mc, rv_fence_rw_rw());
+  }
+  
+  rv_atomic_addr_reg(t, addr, base_reg);
+  
+  int rl = mem_order_is_release(order);
+  
+  if (rl) {
+    /* fence rw,w before release store (conservative) */
+    rv64_emit32(t->mc, rv_fence_rw_w());
+    u32 enc = enc_int_store(size, src_reg, base_reg, 0);
+    rv64_emit32(t->mc, enc);
+  } else {
+    /* Plain store (relaxed) */
+    u32 enc = enc_int_store(size, src_reg, base_reg, 0);
+    rv64_emit32(t->mc, enc);
+  }
+  
+  if (order == MO_SEQ_CST) {
+    /* fence rw,rw after seq_cst write */
+    rv64_emit32(t->mc, rv_fence_rw_rw());
+  }
+}
+```
+
+**Key points**:
+- RISC-V stores are always non-atomic (plain SW/SD); use FENCE to emulate release/seq_cst semantics
+- Release store uses fence-rw-w (release) + plain store; SEQ_CST wraps in full fences
+
+---
+
+#### 3. `rv_atomic_rmw(NativeTarget* t, AtomicOp op, NativeLoc dst, NativeAddr addr, NativeLoc val, MemAccess mem, MemOrder order)`
+
+**Pseudo-C**:
+```c
+static void rv_atomic_rmw(NativeTarget* t, AtomicOp op, NativeLoc dst, NativeAddr addr,
+                          NativeLoc val, MemAccess mem, MemOrder order) {
+  MCEmitter* mc = t->mc;
+  u32 base_reg = RV_T0;
+  u32 dst_reg = dst.v.reg & 0x1f;
+  u32 val_reg = RV_T1;
+  u32 new_reg = RV_T2;  /* computed result in loop */
+  u32 status = RV_T3;   /* SC.W/D's failure flag */
+  u32 size = mem.size ? mem.size : type_size32(t, dst.type);
+  int sf = (size == 8) ? 1 : 0;
+  
+  if (order == MO_SEQ_CST) {
+    rv64_emit32(mc, rv_fence_rw_rw());
+  }
+  
+  /* Materialize val into val_reg if not already there */
+  if (val.kind == NATIVE_LOC_IMM) {
+    rv64_emit_load_imm(mc, sf, val_reg, val.v.imm);
+  } else {
+    rv64_emit32(mc, rv_addi(val_reg, val.v.reg, 0));
+  }
+  
+  rv_atomic_addr_reg(t, addr, base_reg);
+  
+  int aq = mem_order_is_acquire(order);
+  int rl = mem_order_is_release(order);
+  
+  MCLabel retry = mc->label_new(mc);
+  mc->label_place(mc, retry);
+  
+  /* LR.W/D: load-reserve current value into dst */
+  u32 enc = sf ? rv_lr_d(dst_reg, base_reg, aq, 0)
+               : rv_lr_w(dst_reg, base_reg, aq, 0);
+  rv64_emit32(mc, enc);
+  
+  /* Compute: new = f(dst, val) based on op */
+  switch (op) {
+    case AO_XCHG:
+      rv64_emit32(mc, rv_addi(new_reg, val_reg, 0));
+      break;
+    case AO_ADD:
+      rv64_emit32(mc, sf ? rv_add(new_reg, dst_reg, val_reg)
+                         : rv_addw(new_reg, dst_reg, val_reg));
+      break;
+    case AO_SUB:
+      rv64_emit32(mc, sf ? rv_sub(new_reg, dst_reg, val_reg)
+                         : rv_subw(new_reg, dst_reg, val_reg));
+      break;
+    case AO_AND:
+      rv64_emit32(mc, rv_and(new_reg, dst_reg, val_reg));
+      break;
+    case AO_OR:
+      rv64_emit32(mc, rv_or(new_reg, dst_reg, val_reg));
+      break;
+    case AO_XOR:
+      rv64_emit32(mc, rv_xor(new_reg, dst_reg, val_reg));
+      break;
+    case AO_NAND:
+      rv64_emit32(mc, rv_and(new_reg, dst_reg, val_reg));
+      rv64_emit32(mc, rv_xori(new_reg, new_reg, -1));
+      break;
+    default:
+      /* Unsupported op */
+      compiler_panic(t->c, (SrcLoc){0, 0, 0}, "rv64: unsupported atomic rmw op");
+      break;
+  }
+  
+  /* SC.W/D: try to store new value; status != 0 on failure */
+  enc = sf ? rv_sc_d(status, base_reg, new_reg, 0, rl)
+           : rv_sc_w(status, base_reg, new_reg, 0, rl);
+  rv64_emit32(mc, enc);
+  
+  /* If status != 0 (SC failed), retry */
+  rv64_emit32(mc, rv_bne(status, RV_ZERO, 0));
+  mc->emit_label_ref(mc, retry, R_RV_BRANCH, 4, 0);
+  
+  if (order == MO_SEQ_CST) {
+    rv64_emit32(mc, rv_fence_rw_rw());
+  }
+}
+```
+
+**Key points**:
+- LR.W/D / SC.W/D pair forms the core atomic RMW loop
+- status register (T3) holds SC result (0 = success, != 0 = retry needed)
+- Ordering: aq on LR, rl on SC; SEQ_CST wraps in full fences
+- Each op (ADD, SUB, AND, OR, XOR, NAND) is computed in a temp reg between LR and SC
+
+---
+
+#### 4. `rv_atomic_cas(NativeTarget* t, NativeLoc prior, NativeLoc ok, NativeAddr addr, NativeLoc expected, NativeLoc desired, MemAccess mem, MemOrder success, MemOrder failure)`
+
+**Pseudo-C**:
+```c
+static void rv_atomic_cas(NativeTarget* t, NativeLoc prior, NativeLoc ok,
+                          NativeAddr addr, NativeLoc expected, NativeLoc desired,
+                          MemAccess mem, MemOrder success, MemOrder failure) {
+  MCEmitter* mc = t->mc;
+  u32 base_reg = RV_T0;
+  u32 prior_reg = prior.v.reg & 0x1f;
+  u32 exp_reg = RV_T1;
+  u32 des_reg = RV_T2;
+  u32 status = RV_T3;
+  u32 ok_reg = ok.v.reg & 0x1f;
+  u32 size = mem.size ? mem.size : type_size32(t, prior.type);
+  int sf = (size == 8) ? 1 : 0;
+  
+  if (success == MO_SEQ_CST || failure == MO_SEQ_CST) {
+    rv64_emit32(mc, rv_fence_rw_rw());
+  }
+  
+  /* Materialize expected and desired into temp regs */
+  if (expected.kind == NATIVE_LOC_IMM) {
+    rv64_emit_load_imm(mc, sf, exp_reg, expected.v.imm);
+  } else {
+    rv64_emit32(mc, rv_addi(exp_reg, expected.v.reg, 0));
+  }
+  
+  if (desired.kind == NATIVE_LOC_IMM) {
+    rv64_emit_load_imm(mc, sf, des_reg, desired.v.imm);
+  } else {
+    rv64_emit32(mc, rv_addi(des_reg, desired.v.reg, 0));
+  }
+  
+  rv_atomic_addr_reg(t, addr, base_reg);
+  
+  int aq = mem_order_is_acquire(success) || mem_order_is_acquire(failure);
+  int rl = mem_order_is_release(success);
+  
+  MCLabel retry = mc->label_new(mc);
+  MCLabel fail = mc->label_new(mc);
+  MCLabel done = mc->label_new(mc);
+  
+  mc->label_place(mc, retry);
+  
+  /* LR.W/D: load-reserve prior value */
+  u32 enc = sf ? rv_lr_d(prior_reg, base_reg, aq, 0)
+               : rv_lr_w(prior_reg, base_reg, aq, 0);
+  rv64_emit32(mc, enc);
+  
+  /* if (prior != expected) goto fail */
+  rv64_emit32(mc, rv_bne(prior_reg, exp_reg, 0));
+  mc->emit_label_ref(mc, fail, R_RV_BRANCH, 4, 0);
+  
+  /* SC.W/D: try to store desired */
+  enc = sf ? rv_sc_d(status, base_reg, des_reg, 0, rl)
+           : rv_sc_w(status, base_reg, des_reg, 0, rl);
+  rv64_emit32(mc, enc);
+  
+  /* if (status != 0) goto retry */
+  rv64_emit32(mc, rv_bne(status, RV_ZERO, 0));
+  mc->emit_label_ref(mc, retry, R_RV_BRANCH, 4, 0);
+  
+  /* ok = 1; goto done */
+  rv64_emit_load_imm(mc, 0, ok_reg, 1);
+  rv64_emit32(mc, rv_jal(RV_ZERO, 0));
+  mc->emit_label_ref(mc, done, R_RV_JAL, 4, 0);
+  
+  /* fail label: ok = 0 */
+  mc->label_place(mc, fail);
+  rv64_emit_load_imm(mc, 0, ok_reg, 0);
+  
+  /* done label */
+  mc->label_place(mc, done);
+  
+  if (success == MO_SEQ_CST || failure == MO_SEQ_CST) {
+    rv64_emit32(mc, rv_fence_rw_rw());
+  }
+}
+```
+
+**Key points**:
+- Three labels: retry (LR loop), fail (mismatch exit), done (all-done exit)
+- ok output is 1 on success, 0 on failure
+- Failure ordering is ignored in the simple LR/SC model; both success/failure use the same LR (aq from both) and SC (rl from success)
+- Fence placement respects both success and failure orders (conservative: if either is SEQ_CST, full fence)
+
+---
+
+#### 5. `rv_fence(NativeTarget* t, MemOrder order)`
+
+**Pseudo-C**:
+```c
+static void rv_fence(NativeTarget* t, MemOrder order) {
+  if (order == MO_RELAXED) return;
+  /* All other orders use fence rw,rw (full barrier) */
+  rv64_emit32(t->mc, rv_fence_rw_rw());
+}
+```
+
+**Key points**:
+- RISC-V FENCE instruction with pred=rw, succ=rw is the full memory barrier
+- Relaxed needs no fence
+- Acquire/release/seq_cst all use rw,rw; fine-grained pred/succ bits are a refinement (not yet needed)
+
+---
+
+## GROUP 5B: Variadics (LP64D Save-Area Spill)
+
+### Design Overview
+- **va_list layout**: single 8-byte pointer (ABI_VA_LIST_POINTER) pointing to the next argument slot
+- **Prologue**: variadic functions spill a_{nparams_int}..a7 (unused GP regs) into a save area at [s0 + 16] (top of callee frame, above saved s0/ra pair)
+- **Calling convention**: va_arg advances the pointer by 8 bytes per call; all variadic args sit in the same save area regardless of type (integer regs are bit-cast to FP when needed)
+- **Setup**: next_param_int cursor in RvNativeTarget tracks how many GP regs have been bound as fixed params; variadic spill begins after that
+
+### Source Files for ABI & Lowering
+- `src/abi/abi_rv64.c` lines 223–251: ABIVaListInfo initialization (.kind = ABI_VA_LIST_POINTER)
+- `src/abi/abi.h` lines 30–50: ABIVaListKind enum and ABIVaListInfo struct
+- `src/arch/rv64/ops.c` lines 1846–1905: legacy va_start_/va_arg_/va_end_/va_copy_ implementation
+
+### NativeTarget Hook Signatures (src/arch/native_target.h, lines 406–417)
+```c
+void (*va_start_)(NativeTarget*, NativeLoc ap_ptr);
+void (*va_arg_)(NativeTarget*, NativeLoc dst, NativeLoc ap_ptr, CfreeCgTypeId type);
+void (*va_end_)(NativeTarget*, NativeLoc ap_ptr);
+void (*va_copy_)(NativeTarget*, NativeLoc dst_ap_ptr, NativeLoc src_ap_ptr);
+```
+
+### NativeOps Adapter (for -O0 path, src/cg/native_direct_target.h lines 81–86)
+```c
+void (*va_start_)(NativeDirectTarget*, Operand ap_addr);
+void (*va_arg_)(NativeDirectTarget*, Operand dst, Operand ap_addr, CfreeCgTypeId type);
+void (*va_end_)(NativeDirectTarget*, Operand ap_addr);
+void (*va_copy_)(NativeDirectTarget*, Operand dst_ap_addr, Operand src_ap_addr);
+```
+
+### Frame Layout State (RvNativeTarget private struct fields)
+```c
+u32 next_param_int;    /* count of integer register params bound so far (0–7) */
+u32 next_param_stack;  /* byte offset of next stack arg past fixed params */
+u8 is_variadic;        /* function is variadic */
+```
+
+### Implementation Body Sketches
+
+#### 1. `rv_va_start_(NativeTarget* t, NativeLoc ap_ptr)`
+
+**Pseudo-C**:
+```c
+static void rv_va_start_(NativeTarget* t, NativeLoc ap_ptr) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  
+  /* ap_ptr is a register or frame slot holding &va_list (i.e., the address
+   * where va_start writes the initial ap pointer). */
+  
+  /* Compute first-variadic-slot address: s0 + 16 + next_param_int*8 */
+  u32 ap_base_reg = RV_T0;
+  u32 ap_value_reg = RV_T1;
+  i32 offset = 16 + (i32)(a->next_param_int * 8u);
+  
+  /* t0 = s0 + offset */
+  rv_emit_add_imm(mc, ap_base_reg, RV_S0, offset);
+  
+  /* Store t0 into the va_list location (*ap_ptr = t0) */
+  rv_emit_store(mc, ap_value_reg, ap_ptr, 0, 8);  /* Store t0 @ ap_ptr */
+}
+```
+
+**For -O0 path (NativeOps: rv_va_start_ in native_direct_target.h)**:
+
+```c
+static void rv_va_start_(NativeDirectTarget* d, Operand ap_addr) {
+  RvNativeTarget* a = rv_of(d->native);
+  MCEmitter* mc = d->native->mc;
+  
+  /* ap_addr is a semantic Operand (local, immediate offset, etc.).
+   * Materialize its address into a register. */
+  u32 ap_ptr_reg = RV_T0;
+  rv_direct_materialize_addr(d, ap_addr, ap_ptr_reg);
+  
+  /* Compute ap value: s0 + 16 + next_param_int*8 */
+  u32 ap_val_reg = RV_T1;
+  i32 offset = 16 + (i32)(a->next_param_int * 8u);
+  rv_emit_add_imm(mc, ap_val_reg, RV_S0, offset);
+  
+  /* *ap_ptr = ap_val */
+  rv_emit_store(mc, ap_val_reg, ap_ptr_reg, 0, 8);
+}
+```
+
+**Key points**:
+- RV64 LP64D: all variadic arguments (int and FP) are spilled to the integer save area
+- Offset 16 is above the saved s0/ra pair at [s0+0] and [s0+8]
+- next_param_int tracks how many of {a0..a7} are already consumed by fixed params
+
+---
+
+#### 2. `rv_va_arg_(NativeTarget* t, NativeLoc dst, NativeLoc ap_ptr, CfreeCgTypeId type)`
+
+**Pseudo-C**:
+```c
+static void rv_va_arg_(NativeTarget* t, NativeLoc dst, NativeLoc ap_ptr,
+                       CfreeCgTypeId type) {
+  MCEmitter* mc = t->mc;
+  u32 size = type_size32(t, type);
+  int is_fp = cg_type_is_float(t->c, type);
+  
+  /* Load ap_ptr's current value into t0 (the current ap) */
+  u32 ap_reg = RV_T0;
+  u32 val_reg = RV_T1;
+  u32 next_ap_reg = RV_T2;
+  
+  /* t0 = *ap_ptr (load current va_list pointer) */
+  rv_emit_load(mc, ap_reg, ap_ptr, 0, 8);
+  
+  /* Load value from [t0] into val_reg or dst (depending on type) */
+  if (is_fp && size == 8) {
+    /* FP8: load double from [t0], bit-cast to FP register */
+    rv64_emit32(mc, rv_ld(RV_T1, ap_reg, 0));         /* t1 = *ap (int64) */
+    rv64_emit32(mc, rv_fmv_d_x(dst.v.reg, RV_T1));    /* dst.fp = bit_cast(t1) */
+  } else if (is_fp && size == 4) {
+    /* FP4: load word, bit-cast to FP */
+    rv64_emit32(mc, rv_lw(RV_T1, ap_reg, 0));
+    rv64_emit32(mc, rv_fmv_w_x(dst.v.reg, RV_T1));
+  } else {
+    /* Integer: load with sign extension based on type signedness */
+    int sx = type_is_signed(type);
+    u32 enc = enc_int_load(size, sx, dst.v.reg, ap_reg, 0);
+    rv64_emit32(mc, enc);
+  }
+  
+  /* Advance ap_ptr by 8 bytes: t2 = t0 + 8 */
+  rv64_emit32(mc, rv_addi(next_ap_reg, ap_reg, 8));
+  
+  /* Store back: *ap_ptr = t2 */
+  rv64_emit32(mc, rv_sd(next_ap_reg, ap_ptr.v.reg, 0));
+}
+```
+
+**For -O0 path (NativeOps)**:
+
+```c
+static void rv_va_arg_(NativeDirectTarget* d, Operand dst, Operand ap_addr,
+                       CfreeCgTypeId type) {
+  MCEmitter* mc = d->native->mc;
+  u32 size = type_size32(d->native, type);
+  int is_fp = cg_type_is_float(d->base.c, type);
+  
+  /* Materialize ap_addr into a register holding &va_list */
+  u32 ap_ptr_reg = RV_T0;
+  rv_direct_materialize_addr(d, ap_addr, ap_ptr_reg);
+  
+  /* Load current va_list value */
+  u32 ap_reg = RV_T1;
+  rv64_emit32(mc, rv_ld(ap_reg, ap_ptr_reg, 0));
+  
+  /* Load value from [ap] */
+  u32 dst_reg = rv_dst_reg(d, dst);
+  if (is_fp && size == 8) {
+    rv64_emit32(mc, rv_ld(RV_T2, ap_reg, 0));
+    rv64_emit32(mc, rv_fmv_d_x(dst_reg, RV_T2));
+  } else if (is_fp && size == 4) {
+    rv64_emit32(mc, rv_lw(RV_T2, ap_reg, 0));
+    rv64_emit32(mc, rv_fmv_w_x(dst_reg, RV_T2));
+  } else {
+    int sx = type_is_signed(type);
+    u32 enc = enc_int_load(size, sx, dst_reg, ap_reg, 0);
+    rv64_emit32(mc, enc);
+  }
+  
+  /* Advance ap: t2 = t1 + 8; *ap_ptr = t2 */
+  rv64_emit32(mc, rv_addi(RV_T2, ap_reg, 8));
+  rv64_emit32(mc, rv_sd(RV_T2, ap_ptr_reg, 0));
+  
+  /* If dst is a memory location, store dst_reg into it */
+  if (dst.kind != OPK_REG) {
+    rv_direct_store_reg_to_operand(d, dst, dst_reg);
+  }
+}
+```
+
+**Key points**:
+- All variadic args occupy 8-byte slots (even int32 and float32)
+- FP args are stored in the integer save area; va_arg bit-casts them back via fmv_d_x / fmv_w_x
+- Sign extension applies to integer variadic args per type signedness
+- ap advances by 8 unconditionally
+
+---
+
+#### 3. `rv_va_end_(NativeTarget* t, NativeLoc ap_ptr)` and `rv_va_copy_(NativeTarget* t, NativeLoc dst_ap_ptr, NativeLoc src_ap_ptr)`
+
+**va_end** is a no-op for pointer-based va_list:
+```c
+static void rv_va_end_(NativeTarget* t, NativeLoc ap_ptr) {
+  (void)t;
+  (void)ap_ptr;
+}
+```
+
+**va_copy** copies the 8-byte pointer:
+```c
+static void rv_va_copy_(NativeTarget* t, NativeLoc dst_ap_ptr, NativeLoc src_ap_ptr) {
+  MCEmitter* mc = t->mc;
+  
+  /* t0 = *src_ap_ptr; *dst_ap_ptr = t0 */
+  u32 tmp_reg = RV_T0;
+  rv64_emit32(mc, rv_ld(tmp_reg, src_ap_ptr.v.reg, 0));
+  rv64_emit32(mc, rv_sd(tmp_reg, dst_ap_ptr.v.reg, 0));
+}
+```
+
+**Key points**:
+- LP64D pointer va_list is 8 bytes; simple load/store copy
+- No complex save-area state like AAPCS64
+
+---
+
+## GROUP 5C: Inline Assembly (NativeTarget + NativeOps)
+
+### Design Overview
+- Two paths: **NativeOps direct** (aa64 aa_direct_asm_block, lines 4293–4395) and **NativeTarget** (aa64 aa_asm_block_native, lines 4480–4545)
+- Direct path: -O0, self-allocates registers for operands, loads/stores operands, binds to template
+- NativeTarget path: optimizer has pre-allocated all registers; backend only materializes memory bases and saves/restores clobbered callee-saves
+- Template binding uses aa64_inline_bind (constraint parsing, named operands, tied operands)
+- RV64 must adapt the same pattern with RV64-specific constraints: r (int), f (FP), i (imm), m (mem)
+- **asm.h pseudo-kinds**: aa64 adds AA64_INLINE_OPK_REG (0xf0) and AA64_INLINE_OPCLS_{INT,FP}; rv64 must add equivalent RV_INLINE_OPK_REG and RV_INLINE_OPCLS_{INT,FP}
+
+### Source Files
+- `src/arch/aa64/native.c` lines 4150–4395: direct-path constraint/register handling (aa_asm_alloc_reg, aa_asm_constraint_class, etc.)
+- `src/arch/aa64/native.c` lines 4409–4545: memory-operand materialization and native-path asm_block
+- `src/arch/aa64/asm.h` lines 27–32: AA64_INLINE_OPK_REG, AA64_INLINE_OPCLS_* pseudo-kinds
+- `src/arch/rv64/asm.c` / `src/arch/rv64/asm.h`: existing assembler (to be adapted for inline asm)
+
+### NativeTarget Hook Signature (native_target.h, lines 420–423)
+```c
+void (*asm_block)(NativeTarget*, const char* tmpl, const AsmConstraint* outs,
+                  u32 nout, NativeLoc* out_locs, const AsmConstraint* ins,
+                  u32 nin, const NativeLoc* in_locs, const Sym* clobbers,
+                  u32 nclob);
+```
+
+### NativeOps Hook Signature (native_direct_target.h, lines 88–91)
+```c
+void (*asm_block)(NativeDirectTarget*, const char* tmpl,
+                  const AsmConstraint* outs, u32 nout, Operand* out_ops,
+                  const AsmConstraint* ins, u32 nin, const Operand* in_ops,
+                  const Sym* clobbers, u32 nclob);
+```
+
+### Constraint Parsing & Register Allocation (Direct Path Only)
+Both paths need these helpers in asm.c or native.c:
+
+```c
+/* Constraint string introspection */
+static const char* rv_asm_constraint_body(const char* s) {
+  if (!s) return "";
+  if (s[0] == '=' && s[1] == '&') return s + 2;  /* =& early-clobber output */
+  if (s[0] == '=' || s[0] == '+' || s[0] == '&') return s + 1;
+  return s;
+}
+
+static int rv_asm_constraint_early(const char* s) {
+  if (!s) return 0;
+  return (s[0] == '=' && s[1] == '&') || s[0] == '&';
+}
+
+static int rv_asm_match_index(const char* s) {
+  int n = 0;
+  if (!s || s[0] < '0' || s[0] > '9') return -1;
+  for (const char* p = s; *p >= '0' && *p <= '9'; ++p) {
+    n = n * 10 + (*p - '0');
+  }
+  return n;
+}
+
+static NativeAllocClass rv_asm_constraint_class(NativeDirectTarget* d,
+                                               const char* body) {
+  if (body[0] == 'r') return NATIVE_REG_INT;
+  if (body[0] == 'f') return NATIVE_REG_FP;
+  /* Panic for unsupported constraint */
+  compiler_panic(d->base.c, d->loc, "rv64 asm: unsupported constraint '%s'", body);
+  return NATIVE_REG_INT;
+}
+
+/* Register allocation for the direct path (scratch pools) */
+static Reg rv_asm_alloc_reg(NativeDirectTarget* d, NativeAllocClass cls,
+                            u32* used_int, u32* used_fp) {
+  /* Allocable scratch: t0-t6 for int (5–11), ft0-ft7 for FP (0–7) */
+  static const Reg int_pool[] = {5, 6, 7, 8, 9, 10, 11};
+  static const Reg fp_pool[] = {0, 1, 2, 3, 4, 5, 6, 7};
+  const Reg* pool = (cls == NATIVE_REG_FP) ? fp_pool : int_pool;
+  u32 n = (cls == NATIVE_REG_FP) ? 8u : 7u;
+  u32* used = (cls == NATIVE_REG_FP) ? used_fp : used_int;
+  
+  for (u32 i = 0; i < n; ++i) {
+    Reg r = pool[i];
+    if ((*used & (1u << r)) != 0) continue;
+    *used |= 1u << r;
+    return r;
+  }
+  compiler_panic(d->base.c, d->loc, "rv64 asm: out of registers for operands");
+  return REG_NONE;
+}
+```
+
+### asm.h Pseudo-Kinds for RV64
+
+Add to `src/arch/rv64/asm.h`:
+```c
+enum RvAsmPseudoOperandKind {
+  RV_INLINE_OPK_REG = 0xf0u,
+};
+
+enum RvAsmOperandClass {
+  RV_INLINE_OPCLS_INT = 0u,
+  RV_INLINE_OPCLS_FP = 1u,
+};
+```
+
+### Implementation Body Sketches
+
+#### Direct Path: `rv_direct_asm_block(NativeDirectTarget* d, const char* tmpl, const AsmConstraint* outs, u32 nout, Operand* out_ops, const AsmConstraint* ins, u32 nin, const Operand* in_ops, const Sym* clobbers, u32 nclob)`
+
+**Pseudo-C** (lines ~4000–4300 in native.c):
+```c
+static void rv_direct_asm_block(NativeDirectTarget* d, const char* tmpl,
+                                const AsmConstraint* outs, u32 nout,
+                                Operand* out_ops, const AsmConstraint* ins,
+                                u32 nin, const Operand* in_ops,
+                                const Sym* clobbers, u32 nclob) {
+  Operand* bound_outs = nout ? arena_zarray(d->base.c->tu, Operand, nout) : NULL;
+  Operand* bound_ins = nin ? arena_zarray(d->base.c->tu, Operand, nin) : NULL;
+  u32 clob_int, clob_fp, used_int, used_fp;
+  RvAsmSavedClobber* saved;
+  u32 nsaved;
+  
+  /* Parse clobber list into bitmasks */
+  rv_asm_clobber_masks(d->base.c, d->loc, clobbers, nclob, &clob_int, &clob_fp);
+  
+  /* Reserve scratch: t0,t1 + call-saved regs + sp/gp/tp */
+  used_int = clob_int | (1u << RV_T0) | (1u << RV_T1) | (1u << RV_S0) |
+             (1u << RV_SP) | (1u << RV_GP) | (1u << RV_TP);
+  used_fp = clob_fp;
+  
+  /* Bind outputs: allocate registers, load initial values if inout */
+  for (u32 i = 0; i < nout; ++i) {
+    const char* body = rv_asm_constraint_body(outs[i].str);
+    if (body[0] == 'r' || body[0] == 'f') {
+      NativeAllocClass cls = rv_asm_constraint_class(d, body);
+      Reg reg = rv_asm_alloc_reg(d, cls, &used_int, &used_fp);
+      CfreeCgTypeId type = outs[i].type ? outs[i].type : out_ops[i].type;
+      rv_asm_bound_reg(&bound_outs[i], type, cls, reg);
+      if (outs[i].dir == ASM_INOUT) {
+        /* Inout: load initial value from out_ops[i] into allocated reg */
+        NativeLoc loc = rv_reg_loc(type, cls, reg);
+        rv_direct_load_operand_to_reg(d, out_ops[i], loc);
+      }
+    } else if (body[0] == 'm') {
+      /* Memory output: allocate base register */
+      Reg reg = rv_asm_alloc_reg(d, NATIVE_REG_INT, &used_int, &used_fp);
+      NativeLoc loc = rv_reg_loc(builtin_id(CFREE_CG_BUILTIN_I64), NATIVE_REG_INT, reg);
+      CfreeCgTypeId type = outs[i].type ? outs[i].type : out_ops[i].type;
+      rv_direct_load_address_to_reg(d, out_ops[i], loc);
+      rv_asm_bound_mem(&bound_outs[i], type, reg);
+    } else {
+      compiler_panic(d->base.c, d->loc, "rv64 asm: unsupported output constraint");
+    }
+  }
+  
+  /* Bind inputs: match to outputs if tied, else allocate / load */
+  for (u32 i = 0; i < nin; ++i) {
+    const char* body = rv_asm_constraint_body(ins[i].str);
+    int matched = rv_asm_match_index(body);
+    if (matched >= 0) {
+      if ((u32)matched >= nout)
+        compiler_panic(d->base.c, d->loc, "rv64 asm: matching constraint out of range");
+      if (rv_asm_constraint_early(outs[matched].str))
+        compiler_panic(d->base.c, d->loc, "rv64 asm: matching input ties early-clobber output");
+      if (bound_outs[matched].kind != RV_INLINE_OPK_REG)
+        compiler_panic(d->base.c, d->loc, "rv64 asm: matching constraint requires register output");
+      bound_ins[i] = bound_outs[matched];
+      /* Load input value into the matched register */
+      rv_direct_load_operand_to_reg(d, in_ops[i],
+                                    rv_reg_loc(bound_ins[i].type,
+                                               bound_ins[i].pad[0] == RV_INLINE_OPCLS_FP
+                                                   ? NATIVE_REG_FP : NATIVE_REG_INT,
+                                               (Reg)bound_ins[i].v.local));
+      continue;
+    }
+    
+    if (body[0] == 'r' || body[0] == 'f') {
+      NativeAllocClass cls = rv_asm_constraint_class(d, body);
+      Reg reg = rv_asm_alloc_reg(d, cls, &used_int, &used_fp);
+      CfreeCgTypeId type = ins[i].type ? ins[i].type : in_ops[i].type;
+      rv_asm_bound_reg(&bound_ins[i], type, cls, reg);
+      rv_direct_load_operand_to_reg(d, in_ops[i], rv_reg_loc(type, cls, reg));
+    } else if (body[0] == 'i') {
+      if (in_ops[i].kind != OPK_IMM)
+        compiler_panic(d->base.c, d->loc, "rv64 asm: immediate constraint requires immediate");
+      bound_ins[i] = in_ops[i];
+    } else if (body[0] == 'm') {
+      Reg reg = rv_asm_alloc_reg(d, NATIVE_REG_INT, &used_int, &used_fp);
+      NativeLoc loc = rv_reg_loc(builtin_id(CFREE_CG_BUILTIN_I64), NATIVE_REG_INT, reg);
+      CfreeCgTypeId type = ins[i].type ? ins[i].type : in_ops[i].type;
+      rv_direct_load_address_to_reg(d, in_ops[i], loc);
+      rv_asm_bound_mem(&bound_ins[i], type, reg);
+    } else {
+      compiler_panic(d->base.c, d->loc, "rv64 asm: unsupported input constraint");
+    }
+  }
+  
+  /* Save clobbered callee-saved regs */
+  saved = rv_asm_save_callee_clobbers(rv_of(d->native), clob_int, clob_fp, &nsaved);
+  
+  /* Open assembler, bind operands, run template */
+  RvAsm* a = rv_asm_open(d->base.c);
+  rv_inline_bind(a, outs, nout, bound_outs, ins, nin, bound_ins, clobbers, nclob);
+  rv_asm_run_template(a, d->native->mc, tmpl);
+  rv_asm_close(a);
+  
+  /* Store output results back to operands */
+  for (u32 i = 0; i < nout; ++i) {
+    NativeAllocClass cls;
+    NativeLoc src;
+    if (bound_outs[i].kind != RV_INLINE_OPK_REG) continue;
+    cls = bound_outs[i].pad[0] == RV_INLINE_OPCLS_FP ? NATIVE_REG_FP : NATIVE_REG_INT;
+    src = rv_reg_loc(bound_outs[i].type, cls, (Reg)bound_outs[i].v.local);
+    rv_direct_store_reg_to_operand(d, out_ops[i], src);
+  }
+  
+  /* Restore clobbered callee-saves */
+  for (u32 i = nsaved; i > 0; --i) {
+    rv_asm_restore_one(rv_of(d->native), &saved[i - 1u]);
+  }
+}
+```
+
+**Helper functions for the direct path**:
+
+```c
+static void rv_asm_bound_reg(Operand* out, CfreeCgTypeId type,
+                             NativeAllocClass cls, Reg reg) {
+  memset(out, 0, sizeof *out);
+  out->kind = RV_INLINE_OPK_REG;
+  out->pad[0] = (cls == NATIVE_REG_FP) ? RV_INLINE_OPCLS_FP : RV_INLINE_OPCLS_INT;
+  out->type = type;
+  out->v.local = (CGLocal)reg;
+}
+
+static void rv_asm_bound_mem(Operand* out, CfreeCgTypeId type, Reg base) {
+  memset(out, 0, sizeof *out);
+  out->kind = OPK_INDIRECT;
+  out->type = type;
+  out->v.ind.base = (CGLocal)base;
+  out->v.ind.index = CG_LOCAL_NONE;
+}
+
+static void rv_direct_load_operand_to_reg(NativeDirectTarget* d,
+                                          Operand op, NativeLoc dst) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  switch ((OpKind)op.kind) {
+    case OPK_IMM:
+      if ((NativeAllocClass)dst.cls != NATIVE_REG_INT)
+        compiler_panic(d->base.c, d->loc, "rv64 asm: FP immediate unsupported");
+      d->native->load_imm(d->native, dst, op.v.imm);
+      return;
+    case OPK_LOCAL:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = d->locals[op.v.local - 1u].home;
+      addr.base_type = op.type;
+      rv_emit_mem(rv_of(d->native), 1, dst, addr,
+                  rv_mem_for_type(d->native, op.type, 0));
+      return;
+    /* Global, indirect cases: similar to aa64 */
+    default:
+      compiler_panic(d->base.c, d->loc, "rv64 asm: unsupported input operand kind");
+  }
+}
+
+static void rv_direct_store_reg_to_operand(NativeDirectTarget* d,
+                                           Operand op, NativeLoc src) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  if (op.kind == OPK_LOCAL) {
+    addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+    addr.base.frame = d->locals[op.v.local - 1u].home;
+    addr.base_type = op.type;
+  } else {
+    addr = rv_direct_materialize_addr(d, op);
+  }
+  rv_emit_mem(rv_of(d->native), 0, src, addr,
+              rv_mem_for_type(d->native, op.type, 0));
+}
+
+static void rv_asm_clobber_masks(Compiler* c, SrcLoc loc, const Sym* clobbers,
+                                 u32 nclob, u32* clob_int, u32* clob_fp) {
+  *clob_int = 0;
+  *clob_fp = 0;
+  for (u32 i = 0; i < nclob; ++i) {
+    Reg r;
+    NativeAllocClass cls;
+    if (rv_reg_resolve(c, clobbers[i], &r, &cls) != 0) {
+      compiler_panic(c, loc, "rv64 asm: unknown clobbered register");
+    }
+    if (cls == NATIVE_REG_FP)
+      *clob_fp |= 1u << r;
+    else
+      *clob_int |= 1u << r;
+  }
+}
+
+typedef struct RvAsmSavedClobber {
+  NativeFrameSlot slot;
+  NativeAllocClass cls;
+  Reg reg;
+  CfreeCgTypeId type;
+} RvAsmSavedClobber;
+
+static RvAsmSavedClobber* rv_asm_save_callee_clobbers(RvNativeTarget* a,
+                                                      u32 clob_int,
+                                                      u32 clob_fp,
+                                                      u32* nsaved) {
+  /* Identify which clobbered registers are callee-saved, allocate frame slots */
+  static const u32 callee_saved_int[] = {8, 9, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27};
+  static const u32 callee_saved_fp[] = {8, 9, 10, 11};
+  
+  /* Count and allocate slots; return array for later restore */
+  /* Similar to aa64's aa_asm_save_callee_clobbers */
+}
+
+static void rv_asm_save_one(RvNativeTarget* a, RvAsmSavedClobber* s) {
+  /* Allocate frame slot and emit save instruction */
+}
+
+static void rv_asm_restore_one(RvNativeTarget* a, RvAsmSavedClobber* s) {
+  /* Emit restore instruction */
+}
+```
+
+---
+
+#### NativeTarget Path: `rv_asm_block_native(NativeTarget* t, const char* tmpl, const AsmConstraint* outs, u32 nout, NativeLoc* out_locs, const AsmConstraint* ins, u32 nin, const NativeLoc* in_locs, const Sym* clobbers, u32 nclob)`
+
+**Pseudo-C** (lines ~4480–4600 in native.c):
+```c
+static void rv_asm_block_native(NativeTarget* t, const char* tmpl,
+                                const AsmConstraint* outs, u32 nout,
+                                NativeLoc* out_locs, const AsmConstraint* ins,
+                                u32 nin, const NativeLoc* in_locs,
+                                const Sym* clobbers, u32 nclob) {
+  RvNativeTarget* a = rv_of(t);
+  Compiler* c = t->c;
+  SrcLoc loc = a->func ? a->func->loc : (SrcLoc){0, 0, 0};
+  Operand* bound_outs = nout ? arena_zarray(c->tu, Operand, nout) : NULL;
+  Operand* bound_ins = nin ? arena_zarray(c->tu, Operand, nin) : NULL;
+  u32 clob_int, clob_fp, ntmp = 0;
+  RvAsmSavedClobber* saved;
+  u32 nsaved;
+  
+  rv_asm_clobber_masks(c, loc, clobbers, nclob, &clob_int, &clob_fp);
+  
+  /* Bind outputs using pre-allocated physical registers */
+  for (u32 i = 0; i < nout; ++i) {
+    CfreeCgTypeId type = outs[i].type ? outs[i].type : out_locs[i].type;
+    rv_asm_bind_native(a, loc, &bound_outs[i], outs[i].str, type,
+                       out_locs[i], &ntmp);
+  }
+  
+  /* Bind inputs: match to outputs or allocate from reserved scratch */
+  for (u32 i = 0; i < nin; ++i) {
+    const char* body = rv_asm_constraint_body(ins[i].str);
+    int matched = rv_asm_match_index(body);
+    CfreeCgTypeId type;
+    
+    if (matched >= 0) {
+      if ((u32)matched >= nout)
+        compiler_panic(c, loc, "rv64 asm: matching constraint out of range");
+      bound_ins[i] = bound_outs[matched];
+      continue;
+    }
+    
+    type = ins[i].type ? ins[i].type : in_locs[i].type;
+    /* Address-taken locals may arrive in frame slots; load into scratch if needed */
+    NativeLoc inloc = in_locs[i];
+    if (body[0] == 'r' && inloc.kind != NATIVE_LOC_REG) {
+      Reg r;
+      if (ntmp >= 2u)
+        compiler_panic(c, loc, "rv64 asm: too many memory operands");
+      r = (ntmp == 0u) ? RV_T0 : RV_T1;
+      ntmp++;
+      inloc = rv_reg_loc(type, NATIVE_REG_INT, r);
+      rv_emit_mem(a, 1, inloc, rv_asm_loc_to_addr(a, loc, in_locs[i]),
+                  rv_mem_for_type(t, type, type_size32(t, type)));
+    }
+    rv_asm_bind_native(a, loc, &bound_ins[i], ins[i].str, type, inloc, &ntmp);
+  }
+  
+  /* Save clobbered callee-saves */
+  saved = rv_asm_save_callee_clobbers(a, clob_int, clob_fp, &nsaved);
+  
+  /* Open assembler, bind, run template, close */
+  RvAsm* asmh = rv_asm_open(c);
+  rv_inline_bind(asmh, outs, nout, bound_outs, ins, nin, bound_ins, clobbers, nclob);
+  rv_asm_run_template(asmh, t->mc, tmpl);
+  rv_asm_close(asmh);
+  
+  /* Restore clobbered callee-saves */
+  for (u32 i = nsaved; i > 0; --i) {
+    rv_asm_restore_one(a, &saved[i - 1u]);
+  }
+}
+
+/* Helper: convert NativeLoc to NativeAddr (for memory operand materialization) */
+static NativeAddr rv_asm_loc_to_addr(RvNativeTarget* a, SrcLoc loc,
+                                     NativeLoc src) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_type = src.type;
+  switch ((NativeLocKind)src.kind) {
+    case NATIVE_LOC_FRAME:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = src.v.frame;
+      return addr;
+    case NATIVE_LOC_ADDR:
+      return src.v.addr;
+    case NATIVE_LOC_GLOBAL:
+      addr.base_kind = NATIVE_ADDR_BASE_GLOBAL;
+      addr.base.global.sym = src.v.global.sym;
+      addr.base.global.addend = src.v.global.addend;
+      return addr;
+    case NATIVE_LOC_REG:
+      addr.base_kind = NATIVE_ADDR_BASE_REG;
+      addr.cls = NATIVE_REG_INT;
+      addr.base.reg = src.v.reg;
+      return addr;
+    default:
+      compiler_panic(a->base.c, loc, "rv64 asm: unsupported memory operand");
+  }
+}
+
+/* Helper: materialize memory-operand base into a single register */
+static Reg rv_asm_native_mem_base(RvNativeTarget* a, SrcLoc loc, NativeLoc src,
+                                  u32* ntmp) {
+  NativeAddr addr = rv_asm_loc_to_addr(a, loc, src);
+  u32 base;
+  i32 off;
+  Reg dst;
+  
+  if (addr.index_kind != NATIVE_ADDR_INDEX_NONE)
+    compiler_panic(a->base.c, loc, "rv64 asm: indexed memory operand unsupported");
+  
+  rv_addr_base(a, addr, &base, &off);
+  if (off == 0) return (Reg)base;
+  
+  if (*ntmp >= 2u)
+    compiler_panic(a->base.c, loc, "rv64 asm: too many memory operands");
+  
+  dst = (*ntmp == 0u) ? RV_T0 : RV_T1;
+  (*ntmp)++;
+  rv_emit_add_imm(a, dst, base, off);
+  return dst;
+}
+
+/* Helper: bind a single operand (output or input) to Operand struct for template */
+static void rv_asm_bind_native(RvNativeTarget* a, SrcLoc loc, Operand* out,
+                               const char* constraint, CfreeCgTypeId type,
+                               NativeLoc src, u32* ntmp) {
+  const char* body = rv_asm_constraint_body(constraint);
+  
+  if (body[0] == 'r' || body[0] == 'f') {
+    NativeAllocClass cls = (body[0] == 'f') ? NATIVE_REG_FP : NATIVE_REG_INT;
+    if (src.kind != NATIVE_LOC_REG)
+      compiler_panic(a->base.c, loc, "rv64 asm: register operand not in a register");
+    rv_asm_bound_reg(out, type, cls, (Reg)src.v.reg);
+  } else if (body[0] == 'i') {
+    if (src.kind != NATIVE_LOC_IMM)
+      compiler_panic(a->base.c, loc, "rv64 asm: immediate operand is not immediate");
+    memset(out, 0, sizeof *out);
+    out->kind = OPK_IMM;
+    out->type = type;
+    out->v.imm = src.v.imm;
+  } else if (body[0] == 'm') {
+    rv_asm_bound_mem(out, type, rv_asm_native_mem_base(a, loc, src, ntmp));
+  } else {
+    compiler_panic(a->base.c, loc, "rv64 asm: unsupported constraint '%s'", constraint);
+  }
+}
+```
+
+---
+
+## GROUP 5D: Compiler Intrinsics & Finalization
+
+### Intrinsics: `rv_intrinsic(NativeTarget* t, IntrinKind kind, const NativeLoc* dsts, u32 ndst, const NativeLoc* args, u32 narg)`
+
+**Subset of intrinsics to implement** (from rv64/ops.c lines 2112–2250):
+- INTRIN_EXPECT / INTRIN_ASSUME_ALIGNED: identity on value
+- INTRIN_PREFETCH: no-op
+- INTRIN_CLZ / INTRIN_CTZ / INTRIN_POPCOUNT: via software loops or native instructions if available
+- INTRIN_BSWAP16 / BSWAP32 / BSWAP64: byte-swap sequences
+- INTRIN_SADD/UADD/SSUB/USUB/SMUL/UMUL_OVERFLOW: compute result + overflow bit (via intermediate size or bit introspection)
+- INTRIN_MEMCPY / MEMMOVE / MEMSET: bulk memory operations (delegates to copy_bytes / set_bytes)
+- INTRIN_TRAP / UNREACHABLE: EBREAK instruction
+
+**Body sketch** (pseudo-C, ~200 lines):
+```c
+static void rv_intrinsic(NativeTarget* t, IntrinKind kind,
+                         const NativeLoc* dsts, u32 ndst,
+                         const NativeLoc* args, u32 narg) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  
+  switch (kind) {
+    case INTRIN_EXPECT:
+    case INTRIN_ASSUME_ALIGNED:
+      if (ndst == 1u && narg >= 1u) {
+        if (args[0].kind == NATIVE_LOC_IMM) {
+          t->load_imm(t, dsts[0], args[0].v.imm);
+        } else {
+          t->move(t, dsts[0], args[0]);
+        }
+      }
+      return;
+    
+    case INTRIN_PREFETCH:
+      return;  /* No-op; PREFETCH hint (HINTs) not yet emitted */
+    
+    case INTRIN_TRAP:
+    case INTRIN_UNREACHABLE:
+      rv64_emit32(mc, rv_ebreak());
+      return;
+    
+    case INTRIN_CLZ:
+      if (ndst == 1u && narg == 1u) {
+        /* Count leading zeros (software loop if no Zbb extension) */
+        /* For now: loop via set-bit-on-msb and bit-scan-reverse emulation */
+        /* Deferred: check for Zbb (CLZ native instruction) support */
+        rv_intrinsic_clz(t, dsts[0], args[0]);
+      }
+      return;
+    
+    case INTRIN_CTZ:
+      if (ndst == 1u && narg == 1u) {
+        /* Count trailing zeros (software; Zbb has CTZ) */
+        rv_intrinsic_ctz(t, dsts[0], args[0]);
+      }
+      return;
+    
+    case INTRIN_POPCOUNT:
+      if (ndst == 1u && narg == 1u) {
+        /* Population count (software; Zbb has CPOP) */
+        rv_intrinsic_popcount(t, dsts[0], args[0]);
+      }
+      return;
+    
+    case INTRIN_BSWAP16:
+    case INTRIN_BSWAP32:
+    case INTRIN_BSWAP64: {
+      u32 result_size = (kind == INTRIN_BSWAP16) ? 2 : (kind == INTRIN_BSWAP32) ? 4 : 8;
+      rv_intrinsic_bswap(t, dsts[0], args[0], result_size);
+      return;
+    }
+    
+    case INTRIN_SADD_OVERFLOW:
+    case INTRIN_UADD_OVERFLOW:
+    case INTRIN_SSUB_OVERFLOW:
+    case INTRIN_USUB_OVERFLOW:
+      if (ndst == 2u && narg == 2u) {
+        rv_intrinsic_binop_overflow(t, kind, dsts[0], dsts[1], args[0], args[1]);
+      }
+      return;
+    
+    case INTRIN_SMUL_OVERFLOW:
+    case INTRIN_UMUL_OVERFLOW:
+      if (ndst == 2u && narg == 2u) {
+        rv_intrinsic_mul_overflow(t, kind, dsts[0], dsts[1], args[0], args[1]);
+      }
+      return;
+    
+    case INTRIN_MEMCPY:
+      if (narg == 3u && args[0].kind == NATIVE_LOC_REG &&
+          args[1].kind == NATIVE_LOC_REG && args[2].kind == NATIVE_LOC_IMM) {
+        NativeAddr dst_addr, src_addr;
+        AggregateAccess access;
+        memset(&dst_addr, 0, sizeof dst_addr);
+        memset(&src_addr, 0, sizeof src_addr);
+        memset(&access, 0, sizeof access);
+        access.size = (u32)args[2].v.imm;
+        access.align = 1u;
+        dst_addr.base_kind = NATIVE_ADDR_BASE_REG;
+        dst_addr.base.reg = args[0].v.reg;
+        src_addr.base_kind = NATIVE_ADDR_BASE_REG;
+        src_addr.base.reg = args[1].v.reg;
+        t->copy_bytes(t, dst_addr, src_addr, access);
+      }
+      return;
+    
+    case INTRIN_MEMMOVE: {
+      MCLabel forward = mc->label_new(mc);
+      MCLabel done = mc->label_new(mc);
+      if (narg == 3u && args[0].kind == NATIVE_LOC_REG &&
+          args[1].kind == NATIVE_LOC_REG && args[2].kind == NATIVE_LOC_IMM) {
+        NativeAddr dst_addr, src_addr;
+        AggregateAccess access;
+        memset(&dst_addr, 0, sizeof dst_addr);
+        memset(&src_addr, 0, sizeof src_addr);
+        memset(&access, 0, sizeof access);
+        access.size = (u32)args[2].v.imm;
+        access.align = 1u;
+        dst_addr.base_kind = NATIVE_ADDR_BASE_REG;
+        dst_addr.base.reg = args[0].v.reg;
+        src_addr.base_kind = NATIVE_ADDR_BASE_REG;
+        src_addr.base.reg = args[1].v.reg;
+        /* Check direction: if src >= dst, copy forward, else backward */
+        rv64_emit32(mc, rv_blt(args[0].v.reg, args[1].v.reg, 0));
+        mc->emit_label_ref(mc, forward, R_RV_BRANCH, 4, 0);
+        /* Copy backward (from end) */
+        t->copy_bytes(t, dst_addr, src_addr, access);
+        rv64_emit32(mc, rv_jal(RV_ZERO, 0));
+        mc->emit_label_ref(mc, done, R_RV_JAL, 4, 0);
+        /* Copy forward (from start) */
+        mc->label_place(mc, forward);
+        t->copy_bytes(t, dst_addr, src_addr, access);
+        mc->label_place(mc, done);
+      }
+      return;
+    }
+    
+    case INTRIN_MEMSET:
+      if (narg == 3u && args[0].kind == NATIVE_LOC_REG &&
+          args[2].kind == NATIVE_LOC_IMM) {
+        NativeAddr dst_addr;
+        AggregateAccess access;
+        memset(&dst_addr, 0, sizeof dst_addr);
+        memset(&access, 0, sizeof access);
+        access.size = (u32)args[2].v.imm;
+        access.align = 1u;
+        dst_addr.base_kind = NATIVE_ADDR_BASE_REG;
+        dst_addr.base.reg = args[0].v.reg;
+        if (args[1].kind == NATIVE_LOC_IMM) {
+          NativeLoc byte = rv_reg_loc(builtin_id(CFREE_CG_BUILTIN_I8), NATIVE_REG_INT, RV_T0);
+          t->load_imm(t, byte, args[1].v.imm & 0xffu);
+          t->set_bytes(t, dst_addr, byte, access);
+        } else {
+          t->set_bytes(t, dst_addr, args[1], access);
+        }
+      }
+      return;
+    
+    default:
+      compiler_panic(t->c, (SrcLoc){0, 0, 0}, "rv64: unsupported intrinsic %d", (int)kind);
+  }
+}
+```
+
+---
+
+### Finalization Hooks
+
+#### `rv_trap(NativeTarget* t)`
+```c
+static void rv_trap(NativeTarget* t) {
+  rv64_emit32(t->mc, rv_ebreak());
+}
+```
+
+#### `rv_set_loc(NativeTarget* t, SrcLoc loc)`
+```c
+static void rv_set_loc(NativeTarget* t, SrcLoc loc) {
+  RvNativeTarget* a = rv_of(t);
+  a->loc = loc;
+  if (t->mc && t->mc->set_loc) t->mc->set_loc(t->mc, loc);
+}
+```
+
+#### `rv_file_scope_asm(NativeTarget* t, const char* src, size_t len)`
+```c
+static void rv_file_scope_asm(NativeTarget* t, const char* src, size_t len) {
+  RvAsm* lex = rv_asm_open_mem(t->c, "<file-scope-asm>", src, len);
+  rv_asm_parse(t->c, lex, t->mc);
+  rv_asm_close(lex);
+}
+```
+
+#### `rv_finalize(NativeTarget* t)`
+```c
+static void rv_finalize(NativeTarget* t) {
+  if (t->mc) mc_emit_eh_frame(t->mc);  /* DWARF unwind info */
+}
+```
+
+#### `rv_destroy(NativeTarget* t)` (if needed)
+```c
+static void rv_destroy(NativeTarget* t) {
+  /* Cleanup: free RvNativeTarget private state if dynamically allocated */
+  if (t) free((RvNativeTarget*)t);
+}
+```
+
+---
+
+## Initialization Hook Registration (near end of native.c)
+
+Insert before the final `return t;` in the main creation function (e.g., `rv64_backend_make`):
+
+```c
+  t->atomic_load = rv_atomic_load;
+  t->atomic_store = rv_atomic_store;
+  t->atomic_rmw = rv_atomic_rmw;
+  t->atomic_cas = rv_atomic_cas;
+  t->fence = rv_fence;
+  
+  t->va_start_ = rv_va_start_native;
+  t->va_arg_ = rv_va_arg_native;
+  t->va_end_ = rv_va_end_native;
+  t->va_copy_ = rv_va_copy_native;
+  
+  t->intrinsic = rv_intrinsic;
+  t->asm_block = rv_asm_block_native;
+  t->file_scope_asm = rv_file_scope_asm;
+  
+  t->trap = rv_trap;
+  t->set_loc = rv_set_loc;
+  t->finalize = rv_finalize;
+  /* t->destroy optional; only if custom cleanup needed */
+```
+
+And for the **NativeOps** direct-path adapter (to be registered in the NativeOps vtable):
+
+```c
+static const NativeOps rv_direct_ops = {
+    .bind_param = rv_bind_param,
+    .tail_call_unrealizable_reason = rv_no_tail,
+    .va_start_ = rv_va_start_,
+    .va_arg_ = rv_va_arg_,
+    .va_end_ = rv_va_end_,
+    .va_copy_ = rv_va_copy_,
+    .asm_block = rv_direct_asm_block,
+};
+
+const NativeOps* rv64_native_direct_ops(void) { return &rv_direct_ops; }
+```
+
+---
+
+## Summary: Function Counts & Organization
+
+**Group 5 hook count**: 14 core hooks (atomic×5, va×4, asm×2, intrinsic, trap, set_loc, finalize; destroy optional)
+
+**Subdirectory organization**:
+- `native.c`: ~4500 lines (mirror aa64 volume)
+  - Atomics: ~250 lines
+  - Variadics: ~200 lines (core + helpers)
+  - Inline ASM: ~900 lines (direct + native paths + constraint/register helpers)
+  - Intrinsics: ~200 lines
+  - Finalization: ~100 lines
+  - Helper utilities (addr materialization, scratch alloc, etc.): ~300 lines
+  - Setup/registration: ~50 lines
+
+**Key helper modules** (leverage from existing rv64 files):
+- `isa.h` — atomic instruction encoders (rv_lr_w/d, rv_sc_w/d, rv_fence_*)
+- `asm.c/asm.h` — inline ASM binding (rv_inline_bind, rv_asm_run_template)
+- `regs.c` — register metadata (NativeRegInfo, constraint resolution)
+
+**Do NOT duplicate**: regs.c, isa.h, arch.c, link.c, dbg.c, disasm.c — all are reused as-is.
+
diff --git a/doc/OPT_O0_NATIVE_DIRECT_NOTES.md b/doc/OPT_O0_NATIVE_DIRECT_NOTES.md
@@ -0,0 +1,186 @@
+# NativeDirectTarget O0 investigation
+
+Notes from a 2026-05-29 read of the direct native O0 path, focused on reducing
+the aarch64 prologue padding and reducing stack traffic for compiler
+temporaries while keeping semantic codegen single-pass.
+
+Relevant code:
+
+- `src/cg/native_direct_target.c`: shared direct O0 semantic target.
+- `src/arch/aa64/native.c`: physical aarch64 frame/prologue/call lowering.
+- `src/cg/local.c`, `src/cg/memory.c`: value-stack temporary allocation and
+  local/value stack behavior.
+- `src/opt/pass_native_emit.c`: known-frame O1 path for comparison.
+
+## Current direct O0 shape
+
+`NativeDirectTarget` is genuinely single-pass: frontend/value-stack operations
+are lowered immediately to `NativeTarget` calls. It does not record the function
+body, liveness, or a complete frame plan.
+
+On aa64, `aa_func_begin` reserves `AA_PROLOGUE_WORDS` (24 words) at entry,
+emits entry-save stores after that reserved region, and `aa_func_end` patches
+the reserved words once final `cum_off` and `max_outgoing` are known. For common
+small top-record frames the real prologue is only four instructions:
+
+```asm
+sub sp, sp, #N
+add x17, sp, #(N - 16)
+stp x29, x30, [x17]
+add x29, sp, #(N - 16)
+```
+
+Because the reserved region is 24 words, the patcher emits those four
+instructions followed by `b body` and 19 nops. That is the branch/nop blob seen
+in O0 disassembly.
+
+The O1 path avoids this by planning the whole frame in `plan_frame`, then using
+`func_begin_known_frame`. That is not directly available to O0 without either
+recording/pre-scanning the function or buffering the body.
+
+## Prologue options that keep single-pass semantics
+
+### 1. Small inline region plus out-of-line slow prologue
+
+Reserve a small direct-prologue region, probably four words, and record the
+first body offset. At `func_end`:
+
+- If the final frame fits the four-word top-record prologue, patch those four
+  words directly. No entry branch and no padding in the common case.
+- If it does not fit, patch entry to branch to an out-of-line prologue stub
+  emitted after the function body/epilogue. The stub emits the full prologue
+  and branches back to the recorded body-start label.
+
+This keeps the frontend and direct target single-pass: only the rare large-frame
+machine-code prologue is moved out of line. It needs careful unwind/debug work:
+the first executed prologue instructions are no longer laid out before the body
+in the slow case, so CFI/line handling cannot keep assuming a contiguous
+entry-prologue region.
+
+This is the best shape if we want the common O0 case to have zero padding and
+still support arbitrarily large frames without pre-scanning.
+
+### 2. Smaller fixed reserve
+
+Use a smaller reserved region for direct O0, e.g. 8 or 12 words instead of 24.
+This is simple and low risk, and immediately reduces the current branch target
+distance and static padding.
+
+It does not remove the branch unless the real prologue exactly fills the
+reserved region. For the observed small frames, a reserve of 8 still gives:
+
+```asm
+4 real prologue insns
+b body
+3 nops
+```
+
+This is a conservative stepping stone but not the end state.
+
+### 3. Four-word reserve with hard fallback restriction
+
+Reserve four words and require direct O0 frames to fit the small top-record
+prologue. This would remove the common padding with minimal code churn, but it
+is not acceptable as a general solution unless paired with a fallback. Large
+frames, odd entry-save cases, or future callee-save use could exceed four words.
+
+### 4. Buffer body emission until function end
+
+Keep semantic generation single-pass, but make the native emitter buffer each
+function body in memory. At `func_end`, compute the final frame, emit the exact
+prologue, then append the buffered body and epilogue.
+
+This gives the cleanest machine code and keeps frontend parsing single-pass, but
+it is no longer single-pass object emission. It also needs relocation/label/CFI
+plumbing in the emitter buffer. This is effectively a mini recorded native body,
+so it is less attractive than option 1.
+
+### What is not possible
+
+The `fp_at_bottom` O1 fold needs final `frame_size` before any body memory
+operand is emitted, because slot offsets become `frame_size - slot_off`.
+Without a pre-scan or body buffering, direct O0 cannot use that layout safely.
+For true single-pass direct emission, O0 should keep the top-record layout and
+focus on removing reserved padding.
+
+## Stack traffic and compiler temporaries
+
+`NativeDirectTarget` already has a write-back local register cache:
+
+- cacheable locals are scalar, non-address-taken, non-memory-required;
+- cache regs are caller-saved allocables only, so calls just flush the cache;
+- pure compute destinations use `nd_dst_reg` and stay dirty in a cache reg until
+  a flush or eviction.
+
+The cache is underused. Several ops still force their scalar result through a
+scratch register and immediately store to the frame home:
+
+- `nd_load`
+- `nd_bitfield_load`
+- `nd_addr_of`
+- `nd_load_label_addr`
+- `nd_tls_addr_of`
+- `nd_alloca`
+- `nd_atomic_load` / `nd_atomic_rmw` / `nd_atomic_cas`
+- `nd_intrinsic`
+- direct call returns
+- parameter binding
+
+For scalar, non-escaped destination locals, most of these can use
+`nd_dst_reg` + `nd_dst_writeback` instead of `nd_dst_scratch` +
+`nd_store_operand_from_reg`. That keeps the result in the direct local cache and
+avoids a store/reload pair when the value feeds later compute or address ops.
+
+The safest first wave:
+
+1. Change scalar `nd_load` to write through `nd_dst_reg`.
+2. Change `nd_addr_of`, `nd_load_label_addr`, `nd_tls_addr_of`, and `nd_alloca`
+   similarly for pointer results.
+3. Change `nd_bitfield_load` for scalar results.
+4. For direct call returns, allocate cache regs for scalar result locals before
+   planning the call, let the call plan move return registers into those regs,
+   then mark the locals dirty after the call. Keep aggregate/sret returns on the
+   frame path.
+5. Bind scalar parameters into cache regs when the local is cacheable; flush if
+   their address is later taken.
+
+These are still not a real allocator. They only extend the existing local cache
+to more producer kinds.
+
+## Compiler-temp identity
+
+The public API has `CFREE_CG_LOCAL_COMPILER_TEMP`, and the Wasm frontend uses it
+heavily, but the current semantic `CGLocalDesc` only carries
+`CG_LOCAL_ADDR_TAKEN` and `CG_LOCAL_MEMORY_REQUIRED`. The C parser also creates
+many temporary locals through `pcg_local` / `cg_local` without setting a
+compiler-temp flag.
+
+Short-term improvement:
+
+- Add an internal `CG_LOCAL_COMPILER_TEMP` bit to `CGLocalFlag`.
+- In `cfree_cg_local` / `cfree_cg_param`, propagate
+  `CFREE_CG_LOCAL_COMPILER_TEMP` from `CfreeCgLocalAttrs` into `CGLocalDesc`.
+- Add C-parser helpers for compiler temporaries and use them in
+  `builtin_tmp_slot`, `ll_tmp_slot`, conditional/compound-assignment stash
+  temps, EA normalization temps, and value-stack temps from
+  `api_alloc_temp_local`.
+
+This flag should not by itself remove the frame home in direct O0. Without
+liveness, a temp may still need to survive a branch, call, or cache eviction.
+But it can drive policy:
+
+- do not require debug homes for compiler temps;
+- prefer caching compiler temps over anonymous source locals;
+- make diagnostics/debug omit them;
+- later, allow an explicitly "ephemeral" temp API for values whose lifetime is
+  stack-top-only and cannot cross barriers.
+
+## Practical priority
+
+1. Prologue: implement the small inline region + out-of-line slow prologue, or
+   first reduce the fixed direct reserve as a low-risk intermediate.
+2. Stack traffic: extend the existing cache to scalar load/address/alloca and
+   bitfield-load producers.
+3. Call/param traffic: cache scalar params and scalar direct returns.
+4. Metadata: propagate compiler-temp identity through `CGLocalDesc`, then use it
+   for cache preference and debug/home policy.
diff --git a/doc/OPT_O0_PERF_NOTES.md b/doc/OPT_O0_PERF_NOTES.md
@@ -0,0 +1,168 @@
+# cfree -O0 runtime gaps vs clang -O0
+
+Focused notes for the unoptimized codegen/runtime gap. `doc/OPT_O1_PERF_TODO.md`
+tracks O1 work; this file catalogs the O0 shape seen in a small aarch64/Apple
+sample against clang O0.
+
+## Snapshot
+
+Measured 2026-05-29 on Apple/aarch64 using the MIR `c-benchmarks` sources from
+`$HOME/tmp/mir/c-benchmarks`. Each runtime is best of two runs with the
+benchmark's checked-in `.arg` and `.expect` files. Artifacts:
+
+- raw timings: `build/bench/o0gap/manual_results.csv`
+- object disassembly: `build/bench/o0gap/disasm/{clang,cfree}.O0.*.txt`
+- executable disassembly for binary-trees:
+  `build/bench/o0gap/disasm/{clang,cfree}.O0.binary-trees.exe.txt`
+
+| bench | clang O0 runtime ms | cfree O0 runtime ms | cfree / clang | object text: clang -> cfree |
+| --- | ---: | ---: | ---: | ---: |
+| binary-trees | 2720 | 3754 | 1.38x slower | 975 B -> 10459 B |
+| hash2 | 8754 | 15138 | 1.73x slower | 3127 B -> 18258 B |
+| sieve | 13668 | 12607 | 0.92x faster | 411 B -> 2571 B |
+| mandelbrot | 10544 | 13215 | 1.25x slower | 774 B -> 3870 B |
+
+Geomean over these four: cfree O0 is 1.29x slower than clang O0.
+
+## Cross-cutting gaps
+
+### O0 still emits patched-prologue padding
+
+Every cfree O0 function in this sample starts with a real prologue, then an
+unconditional branch over roughly 19 nops:
+
+```asm
+sub sp, sp, #0x60
+add x17, sp, #0x50
+stp x29, x30, [x17]
+add x29, sp, #0x50
+b   body
+nop
+...
+body:
+```
+
+This is the old fat-prologue reservation shape that O1's known-frame path fixed.
+At runtime it costs one extra unconditional branch per function entry, and
+statically it bloats hot functions enough to hurt I-cache locality. In
+binary-trees, the recursive hot functions all have it:
+
+| function | clang O0 insns | cfree O0 insns |
+| --- | ---: | ---: |
+| NewTreeNode | 18 | 58 |
+| ItemCheck | 27 | 73 |
+| BottomUpTree | 29 | 82 |
+| DeleteTree | 20 | 59 |
+
+### Compiler-generated temporaries spill as if they were user variables
+
+cfree O0 preserves a very literal expression lowering: many values are copied
+through temps, stored to stack, then immediately reloaded. Clang O0 still uses
+stack slots heavily, but it keeps simple expression values in registers across a
+single expression.
+
+Concrete examples:
+
+- `NewTreeNode`: cfree stores the constant `16` to the frame and reloads it into
+  `x0` before `malloc`; clang uses `mov x0, #0x10`.
+- `BottomUpTree(NULL, NULL)`: cfree creates several zero/copy temporaries before
+  loading `x0`/`x1`; clang does `mov x1, #0; mov x0, x1`.
+- `ht_find_new`: cfree is 171 instructions with a 0x130-byte frame; clang is 61
+  instructions with a 0x40-byte frame.
+- `mandelbrot`: cfree main is 562 instructions with a large temporary area;
+  clang main is 191 instructions.
+
+This is an O0-specific quality issue, not necessarily an optimizer issue: keep
+debuggable stack locations for source variables, but avoid materializing
+compiler-only intermediates as mandatory stack homes.
+
+### Header inline/helper emission bloats O0 objects
+
+cfree emits many SDK/header helper functions that clang does not emit into these
+objects:
+
+- `binary-trees` includes `<math.h>` and cfree emits `___inline_isfinite*`,
+  `___inline_isinf*`, `___inline_isnan*`, `___sincos*`, `___sputc`,
+  `__OSSwapInt*`, etc.
+- `hash2` includes `simple_hash.h`, which includes `<ctype.h>`; cfree emits a
+  large set of ctype helpers such as `_isalnum`, `_isalpha`, `_isdigit`,
+  `_tolower`, `_toupper`, etc.
+- even `sieve` and `mandelbrot` get `___sputc` / `__OSSwapInt*` helpers.
+
+Most of this is not on the hot runtime path, but it inflates object text and the
+linked image. It also obscures disassembly and compile/link-time measurements.
+O0 should not eagerly emit unused static inline/header helpers unless they are
+referenced.
+
+### FP O0 misses cheap instruction selection wins
+
+The `mandelbrot` escape loop shows a runtime-visible gap:
+
+- clang uses `fmov` immediates for `1.0` / `2.0` and uses `fmadd` for
+  `2.0 * Zr * Zi + Ci` even at O0.
+- cfree repeatedly materializes FP constants with `mov`/`movk`/`fmov`, spills
+  intermediate FP values, and emits separate `fmul` + `fadd`.
+
+These are instruction-selection/local lowering issues rather than global
+optimization. They should be safe to improve at O0 without changing source-level
+debuggability.
+
+## Per-benchmark notes
+
+### binary-trees
+
+Runtime: cfree O0 is 1.38x slower than clang O0.
+
+The main gap is hot recursive function shape: every call pays the O0 padded
+prologue branch and 2-3x the instruction count. The final executable uses stubs
+for allocator/libc calls on both compilers, so unlike the O1 note this sample
+does not look like a pure external-call-path issue; cfree's function bodies are
+substantially larger at O0.
+
+Priority fixes:
+
+1. Reuse the known-frame/no-padding prologue path for O0 when the frame is known.
+2. Stop stack-homing compiler-only temporaries in the expression-lowering path.
+3. Extend the existing call-argument immediate/coalescing work to O0 lowering.
+
+### hash2
+
+Runtime: cfree O0 is 1.73x slower than clang O0, the worst case in this sample.
+
+The hot helpers are much larger:
+
+| function | clang O0 insns | cfree O0 insns |
+| --- | ---: | ---: |
+| ht_hashcode | 30 | 87 |
+| ht_find | 37 | 105 |
+| ht_find_new | 61 | 171 |
+| ht_next | 55 | 142 |
+
+Both compilers use `udiv` for `val % ht->size` at O0, so the O1 modulo/reciprocal
+hypothesis is not the O0 gap here. The O0 gap is mostly stack/copy expansion,
+plus eager emission of unused ctype helpers from headers.
+
+### sieve
+
+Runtime: cfree O0 is 0.92x of clang O0 in this run, so this is not currently a
+runtime gap despite cfree's object being larger.
+
+The disassembly still shows the cross-cutting O0 problems: padded prologue,
+larger frame, repeated constant/base materialization, and extra stack temporaries.
+However, the hot loop is simple enough that the extra code did not lose in this
+two-run sample. Treat sieve as a regression guard rather than a first target.
+
+### mandelbrot
+
+Runtime: cfree O0 is 1.25x slower than clang O0.
+
+This is the clearest local instruction-selection case. The inner FP loop has
+substantial temporary spills, repeated FP constant construction, and misses
+`fmadd`. Clang O0 is still unoptimized, but its local FP lowering is compact
+enough to matter here.
+
+Priority fixes:
+
+1. Use encodable FP immediates directly where available.
+2. Select `fmadd` for multiply-add expression trees during lowering.
+3. Avoid spilling FP expression intermediates that have no source-level storage.
diff --git a/include/cfree/config.h b/include/cfree/config.h
@@ -26,7 +26,7 @@
 /* Backend architectures. */
 #define CFREE_ARCH_AA64_ENABLED 1
 #define CFREE_ARCH_X64_ENABLED 0
-#define CFREE_ARCH_RV64_ENABLED 0
+#define CFREE_ARCH_RV64_ENABLED 1
 #define CFREE_ARCH_WASM_ENABLED 1
 #define CFREE_ARCH_C_TARGET_ENABLED 1
 
diff --git a/scripts/toy_cross_batch.sh b/scripts/toy_cross_batch.sh
@@ -0,0 +1,113 @@
+#!/usr/bin/env bash
+# scripts/toy_cross_batch.sh — fast batched cross-exec of the toy suite for a
+# single target arch + opt level. Compiles/links every case natively (fast),
+# then runs all executables inside ONE podman container, instead of the
+# per-case podman invocation test/toy/run.sh uses. Cuts a full-suite run from
+# ~30 min (257 emulated container starts) to a couple of minutes.
+#
+# Usage: scripts/toy_cross_batch.sh <arch> <opt> [name_substr]
+#   arch: rv64 | x64 | aa64        opt: 0 | 1
+# Env: CFREE (path to cfree binary, default build/cfree)
+#
+# Checks only the X path (cross-compile + link + exec rc vs <name>.expected).
+# Compile/link failures and nonzero/mismatched rc count as FAIL.
+set -u
+ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+CFREE="${CFREE:-$ROOT/build/cfree}"
+ARCH="${1:?arch}"
+OPT="${2:?opt}"
+FILTER="${3:-}"
+CASES="$ROOT/test/toy/cases"
+WORK="/tmp/toy_cross_batch/$ARCH-O$OPT"
+rm -rf "$WORK"; mkdir -p "$WORK"
+
+case "$ARCH" in
+  rv64) TRIPLE=riscv64-linux-gnu; PLAT=linux/riscv64 ;;
+  x64)  TRIPLE=x86_64-linux-gnu;  PLAT=linux/amd64 ;;
+  aa64) TRIPLE=aarch64-linux-gnu; PLAT=linux/arm64 ;;
+  *) echo "unknown arch $ARCH"; exit 2 ;;
+esac
+IMAGE="${IMAGE:-alpine:latest}"
+
+# Build a freestanding _start.o for the target via clang (mirrors run.sh).
+START_C="$WORK/start.c"
+cat > "$START_C" <<'EOF'
+extern int main(void);
+__attribute__((noreturn)) static void do_exit(int code) {
+#if defined(__aarch64__)
+  register long x8 __asm__("x8") = 94; register long x0 __asm__("x0") = code;
+  __asm__ volatile("svc #0" ::"r"(x8), "r"(x0) : "memory");
+#elif defined(__x86_64__)
+  register long rax __asm__("rax") = 231; register long rdi __asm__("rdi") = code;
+  __asm__ volatile("syscall" ::"r"(rax), "r"(rdi) : "memory");
+#elif defined(__riscv) && __riscv_xlen == 64
+  register long a7 __asm__("a7") = 94; register long a0 __asm__("a0") = code;
+  __asm__ volatile("ecall" ::"r"(a7), "r"(a0) : "memory");
+#endif
+  __builtin_unreachable();
+}
+#if defined(__x86_64__)
+__attribute__((force_align_arg_pointer))
+#endif
+void _start(void) { do_exit(main()); }
+EOF
+if ! clang --target="$TRIPLE" -O1 -ffreestanding -fno-stack-protector -fno-PIC \
+     -fno-pie -c "$START_C" -o "$WORK/start.o" 2>"$WORK/start.err"; then
+  echo "FATAL: could not build start.o for $TRIPLE"; cat "$WORK/start.err"; exit 2
+fi
+
+PASS=0; FAIL=0; SKIP=0; FAILED=()
+RUNLIST="$WORK/runlist.txt"; : > "$RUNLIST"
+declare -A EXPECT
+
+for src in "$CASES"/*.toy; do
+  name="$(basename "$src" .toy)"
+  [ -n "$FILTER" ] && [[ "$name" != *"$FILTER"* ]] && continue
+  # asmnop is aa64-specific before toy asm selectors.
+  if [ "$ARCH" != "aa64" ] && grep -q 'asmnop' "$src"; then
+    SKIP=$((SKIP+1)); continue
+  fi
+  exp=0; [ -f "${src%.toy}.expected" ] && exp="$(tr -d '[:space:]' < "${src%.toy}.expected")"
+  EXPECT[$name]=$exp
+  if ! "$CFREE" cc "-O$OPT" -target "$TRIPLE" -c "$src" -o "$WORK/$name.o" \
+       >"$WORK/$name.cc.out" 2>"$WORK/$name.cc.err"; then
+    FAIL=$((FAIL+1)); FAILED+=("$name [cc] $(head -1 "$WORK/$name.cc.err")"); continue
+  fi
+  if [ -s "$WORK/$name.cc.err" ]; then
+    FAIL=$((FAIL+1)); FAILED+=("$name [cc-stderr] $(head -1 "$WORK/$name.cc.err")"); continue
+  fi
+  if ! "$CFREE" ld "$WORK/$name.o" "$WORK/start.o" -o "$WORK/$name.exe" \
+       >"$WORK/$name.ld.out" 2>"$WORK/$name.ld.err"; then
+    FAIL=$((FAIL+1)); FAILED+=("$name [ld] $(head -1 "$WORK/$name.ld.err")"); continue
+  fi
+  chmod +x "$WORK/$name.exe" 2>/dev/null || true
+  echo "$name" >> "$RUNLIST"
+done
+
+# Single batched container run: execute every linked exe, print "name rc".
+RESULTS="$WORK/results.txt"
+if [ -s "$RUNLIST" ]; then
+  podman run --rm --pull=never --platform "$PLAT" --net=none \
+    -v "$WORK:$WORK" -w "$WORK" "$IMAGE" sh -c '
+      while read n; do
+        "./'$'"$n"'.exe" >/dev/null 2>"'"$WORK"'/$n.run.err"; echo "$n $?";
+      done < "'"$RUNLIST"'"' > "$RESULTS" 2>"$WORK/podman.err" || {
+        echo "FATAL: podman batch run failed"; cat "$WORK/podman.err"; exit 2; }
+fi
+
+while read -r name rc; do
+  exp=${EXPECT[$name]:-0}; exp=$((exp & 255))
+  if [ "$rc" -eq "$exp" ] && [ ! -s "$WORK/$name.run.err" ]; then
+    PASS=$((PASS+1))
+  else
+    FAIL=$((FAIL+1))
+    msg="rc=$rc want=$exp"; [ -s "$WORK/$name.run.err" ] && msg="$msg stderr=$(head -1 "$WORK/$name.run.err")"
+    FAILED+=("$name [run] $msg")
+  fi
+done < "$RESULTS"
+
+echo "==== $ARCH -O$OPT : $PASS pass, $FAIL fail, $SKIP skip ===="
+if [ "${#FAILED[@]}" -gt 0 ]; then
+  printf '%s\n' "${FAILED[@]}" | sort
+fi
+[ "$FAIL" -eq 0 ]
diff --git a/src/arch/rv64/alloc.c b/src/arch/rv64/alloc.c
@@ -1,589 +0,0 @@
-/* src/arch/rv64/alloc.c — register pool, spill/reload, labels, control flow. */
-
-#include "arch/rv64/internal.h"
-
-/* ---- frame ---- */
-
-FrameSlot rv_frame_slot(CGTarget* t, const FrameSlotDesc* d) {
-  RImpl* a = impl_of(t);
-  if (a->nslots == a->slots_cap) {
-    u32 ncap = a->slots_cap ? a->slots_cap * 2 : 8;
-    RvSlot* nbuf = arena_array(t->c->tu, RvSlot, ncap);
-    if (a->slots) memcpy(nbuf, a->slots, sizeof(RvSlot) * a->nslots);
-    a->slots = nbuf;
-    a->slots_cap = ncap;
-  }
-  u32 size = d->size ? d->size : 8;
-  u32 align = d->align ? d->align : 1;
-  u32 next = a->cum_off + size;
-  u32 mask = align - 1;
-  next = (next + mask) & ~mask;
-
-  RvSlot* s = &a->slots[a->nslots];
-  s->off = next;
-  s->size = size;
-  s->align = align;
-  s->kind = d->kind;
-
-  a->cum_off = next;
-  a->nslots++;
-  return (FrameSlot)(a->nslots);
-}
-
-RvSlot* rv64_slot_get(RImpl* a, FrameSlot fs) {
-  if (fs == FRAME_SLOT_NONE || fs > a->nslots) return NULL;
-  return &a->slots[fs - 1];
-}
-
-/* ---- param ---- */
-
-static void rv_consume_param_location(RImpl* a, const ABIArgInfo* ai) {
-  if (!ai || ai->kind == ABI_ARG_IGNORE) return;
-  if (ai->kind == ABI_ARG_INDIRECT) {
-    if (a->next_param_int < 8)
-      ++a->next_param_int;
-    else
-      a->next_param_stack += 8;
-    return;
-  }
-  for (u16 i = 0; i < ai->nparts; ++i) {
-    const ABIArgPart* pt = &ai->parts[i];
-    if (pt->cls == ABI_CLASS_INT) {
-      if (a->next_param_int < 8)
-        ++a->next_param_int;
-      else
-        a->next_param_stack += 8;
-    } else if (pt->cls == ABI_CLASS_FP) {
-      if (a->next_param_fp < 8)
-        ++a->next_param_fp;
-      else
-        a->next_param_stack += 8;
-    }
-  }
-}
-
-CGLocalStorage rv_param(CGTarget* t, const CGParamDesc* p) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  CGLocalStorage st = p->storage;
-  if (st.kind == CG_LOCAL_STORAGE_FRAME && st.v.frame_slot == FRAME_SLOT_NONE) {
-    FrameSlotDesc fsd = {0};
-    fsd.type = p->type;
-    fsd.name = p->name;
-    fsd.loc = p->loc;
-    fsd.size = p->size;
-    fsd.align = p->align;
-    fsd.kind = FS_PARAM;
-    if (p->flags & CG_LOCAL_ADDR_TAKEN) fsd.flags |= FSF_ADDR_TAKEN;
-    st.v.frame_slot = rv_frame_slot(t, &fsd);
-  }
-  RvSlot* s = st.kind == CG_LOCAL_STORAGE_FRAME
-                  ? rv64_slot_get(a, st.v.frame_slot)
-                  : NULL;
-  if (st.kind == CG_LOCAL_STORAGE_FRAME && !s)
-    compiler_panic(t->c, a->loc, "rv64 param: bad slot");
-  const ABIArgInfo* ai = p->abi;
-  /* Caller's stack args start above the saved-s0/ra pair, plus the
-   * 64-byte variadic save area when this function is variadic. */
-  u32 incoming_stack_base = a->omit_frame ? RV_SP : RV_S0;
-  i32 caller_stack_base = a->omit_frame ? 0 : 16 + (a->is_variadic ? 64 : 0);
-
-  if (ai->kind == ABI_ARG_IGNORE) return st;
-  if (st.kind == CG_LOCAL_STORAGE_REG && st.v.reg == (Reg)REG_NONE) {
-    rv_consume_param_location(a, ai);
-    return st;
-  }
-  if (st.kind == CG_LOCAL_STORAGE_REG) {
-    if (ai->kind != ABI_ARG_DIRECT || ai->nparts != 1) {
-      compiler_panic(t->c, a->loc,
-                     "rv64 param: register storage requires one direct part");
-    }
-    const ABIArgPart* pt = &ai->parts[0];
-    u32 sz = pt->size;
-    if (pt->cls == ABI_CLASS_INT) {
-      u32 dst = st.v.reg;
-      if (a->next_param_int < 8) {
-        u32 src = RV_A0 + a->next_param_int;
-        a->next_param_int++;
-        if (dst != src) rv64_emit32(mc, rv_addi(dst, src, 0));
-      } else {
-        u32 caller_off = a->next_param_stack;
-        a->next_param_stack += 8;
-        rv64_emit32(mc, enc_int_load(sz, 0, dst, incoming_stack_base,
-                                     caller_stack_base + (i32)caller_off));
-      }
-    } else if (pt->cls == ABI_CLASS_FP) {
-      u32 dst = st.v.reg;
-      if (a->next_param_fp < 8) {
-        u32 src = 10u + a->next_param_fp;
-        a->next_param_fp++;
-        if (dst != src) {
-          rv64_emit32(mc, rv_fsgnj(sz == 8 ? 1u : 0u, dst, src, src));
-        }
-      } else {
-        u32 caller_off = a->next_param_stack;
-        a->next_param_stack += 8;
-        if (sz == 8)
-          rv64_emit32(mc, rv_fld(dst, incoming_stack_base,
-                                 caller_stack_base + (i32)caller_off));
-        else
-          rv64_emit32(mc, rv_flw(dst, incoming_stack_base,
-                                 caller_stack_base + (i32)caller_off));
-      }
-    } else {
-      compiler_panic(t->c, a->loc, "rv64 param: ABI class %d unimpl",
-                     (int)pt->cls);
-    }
-    return st;
-  }
-  if (ai->kind == ABI_ARG_INDIRECT) {
-    /* Pointer-to-copy passed in a-register. Copy bytes from there into
-     * the home slot.  Source pointer is in a0..a7. */
-    u32 ptr_reg;
-    if (a->next_param_int < 8) {
-      ptr_reg = RV_A0 + a->next_param_int;
-      a->next_param_int++;
-    } else {
-      u32 caller_off = a->next_param_stack;
-      a->next_param_stack += 8;
-      /* Incoming stack args live in the caller's outgoing-arg area,
-       * which is `frame_size - fp_pair_off` (= 16 + the saved-s0/ra
-       * pair) above s0 — same logic as aa64's `16 + caller_off`. */
-      rv64_emit32(mc, rv_ld(RV_T1, incoming_stack_base,
-                            caller_stack_base + (i32)caller_off));
-      ptr_reg = RV_T1;
-    }
-    u32 nbytes = s->size;
-    u32 i = 0;
-    while (i + 8 <= nbytes) {
-      rv64_emit32(mc, rv_ld(RV_T2, ptr_reg, (i32)i));
-      rv64_emit32(mc, rv_sd(RV_T2, RV_S0, -(i32)s->off + (i32)i));
-      i += 8;
-    }
-    while (i + 4 <= nbytes) {
-      rv64_emit32(mc, rv_lwu(RV_T2, ptr_reg, (i32)i));
-      rv64_emit32(mc, rv_sw(RV_T2, RV_S0, -(i32)s->off + (i32)i));
-      i += 4;
-    }
-    while (i + 2 <= nbytes) {
-      rv64_emit32(mc, rv_lhu(RV_T2, ptr_reg, (i32)i));
-      rv64_emit32(mc, rv_sh(RV_T2, RV_S0, -(i32)s->off + (i32)i));
-      i += 2;
-    }
-    while (i < nbytes) {
-      rv64_emit32(mc, rv_lbu(RV_T2, ptr_reg, (i32)i));
-      rv64_emit32(mc, rv_sb(RV_T2, RV_S0, -(i32)s->off + (i32)i));
-      i += 1;
-    }
-    return st;
-  }
-  /* DIRECT */
-  for (u16 i = 0; i < ai->nparts; ++i) {
-    const ABIArgPart* pt = &ai->parts[i];
-    u32 part_off = pt->src_offset;
-    u32 sz = pt->size;
-
-    if (pt->cls == ABI_CLASS_INT) {
-      if (a->next_param_int < 8) {
-        u32 reg = RV_A0 + a->next_param_int;
-        a->next_param_int++;
-        rv64_emit32(
-            mc, enc_int_store(sz, reg, RV_S0, -(i32)s->off + (i32)part_off));
-      } else {
-        u32 caller_off = a->next_param_stack;
-        a->next_param_stack += 8;
-        rv64_emit32(mc, enc_int_load(sz, 0, RV_T2, incoming_stack_base,
-                                     caller_stack_base + (i32)caller_off));
-        rv64_emit32(
-            mc, enc_int_store(sz, RV_T2, RV_S0, -(i32)s->off + (i32)part_off));
-      }
-    } else if (pt->cls == ABI_CLASS_FP) {
-      if (a->next_param_fp < 8) {
-        u32 reg = a->next_param_fp; /* fa0..fa7 → freg 10..17 */
-        u32 freg = 10u + reg;
-        a->next_param_fp++;
-        if (sz == 8) {
-          rv64_emit32(mc, rv_fsd(freg, RV_S0, -(i32)s->off + (i32)part_off));
-        } else {
-          rv64_emit32(mc, rv_fsw(freg, RV_S0, -(i32)s->off + (i32)part_off));
-        }
-      } else {
-        u32 caller_off = a->next_param_stack;
-        a->next_param_stack += 8;
-        if (sz == 8) {
-          rv64_emit32(mc, rv_fld(0, incoming_stack_base,
-                                 caller_stack_base + (i32)caller_off));
-          rv64_emit32(mc, rv_fsd(0, RV_S0, -(i32)s->off + (i32)part_off));
-        } else {
-          rv64_emit32(mc, rv_flw(0, incoming_stack_base,
-                                 caller_stack_base + (i32)caller_off));
-          rv64_emit32(mc, rv_fsw(0, RV_S0, -(i32)s->off + (i32)part_off));
-        }
-      }
-    } else {
-      compiler_panic(t->c, a->loc, "rv64 param: ABI class %d unimpl",
-                     (int)pt->cls);
-    }
-  }
-  return st;
-}
-
-void rv_spill_reg(CGTarget* t, Operand src, FrameSlot slot, MemAccess ma) {
-  RImpl* a = impl_of(t);
-  if (src.kind != OPK_REG) {
-    compiler_panic(t->c, a->loc, "rv64 spill_reg: src is not OPK_REG");
-  }
-  Operand addr;
-  memset(&addr, 0, sizeof addr);
-  addr.kind = OPK_LOCAL;
-  addr.cls = RC_INT;
-  addr.type = ma.type;
-  addr.v.frame_slot = slot;
-  rv_store(t, addr, src, ma);
-}
-
-void rv_reload_reg(CGTarget* t, Operand dst, FrameSlot slot, MemAccess ma) {
-  RImpl* a = impl_of(t);
-  if (dst.kind != OPK_REG) {
-    compiler_panic(t->c, a->loc, "rv64 reload_reg: dst is not OPK_REG");
-  }
-  Operand addr;
-  memset(&addr, 0, sizeof addr);
-  addr.kind = OPK_LOCAL;
-  addr.cls = RC_INT;
-  addr.type = ma.type;
-  addr.v.frame_slot = slot;
-  rv_load(t, dst, addr, ma);
-}
-
-/* ---- labels / control flow ---- */
-
-Label rv_label_new(CGTarget* t) { return (Label)t->mc->label_new(t->mc); }
-void rv_label_place(CGTarget* t, Label l) {
-  t->mc->label_place(t->mc, (MCLabel)l);
-}
-void rv_jump(CGTarget* t, Label l) {
-  MCEmitter* mc = t->mc;
-  rv64_emit32(mc, rv_jal(RV_ZERO, 0));
-  mc->emit_label_ref(mc, (MCLabel)l, R_RV_JAL, 4, 0);
-}
-
-void rv_load_label_addr(CGTarget* t, Operand dst, Label l) {
-  /* AUIPC rd, %hi(L); ADDI rd, rd, %lo(L) — PC-relative pair fixed up
-   * via R_RV_INTRA_AUIPC_ADDI (width=8, addend=0 references the AUIPC
-   * site). */
-  MCEmitter* mc = t->mc;
-  u32 rd;
-  if (dst.kind != OPK_REG) {
-    compiler_panic(t->c, impl_of(t)->loc,
-                   "rv64: load_label_addr dst must be REG");
-  }
-  rd = reg_num(dst);
-  rv64_emit32(mc, rv_auipc(rd, 0));
-  rv64_emit32(mc, rv_addi(rd, rd, 0));
-  mc->emit_label_ref(mc, (MCLabel)l, R_RV_INTRA_AUIPC_ADDI, 8, 0);
-}
-
-void rv_indirect_branch(CGTarget* t, Operand addr, const Label* targets,
-                        u32 ntargets) {
-  /* JALR x0, rd, 0  — register-indirect jump (discards return address). */
-  MCEmitter* mc = t->mc;
-  u32 rs1;
-  (void)targets;
-  (void)ntargets;
-  if (addr.kind != OPK_REG) {
-    compiler_panic(t->c, impl_of(t)->loc,
-                   "rv64: indirect_branch expects REG operand");
-  }
-  rs1 = reg_num(addr);
-  rv64_emit32(mc, rv_i(0, rs1, 0, RV_ZERO, RV_JALR));
-}
-
-/* Force an integer Operand into a register; materializes IMM via scratch. */
-u32 rv64_force_reg_int(CGTarget* t, Operand op, u32 scratch) {
-  if (op.kind == OPK_REG) return reg_num(op);
-  if (op.kind == OPK_IMM) {
-    u32 sf = type_is_64(op.type) ? 1u : 0u;
-    rv64_emit_load_imm(t->mc, sf, scratch, op.v.imm);
-    return scratch;
-  }
-  compiler_panic(t->c, impl_of(t)->loc,
-                 "rv64: operand kind %d unsupported here", (int)op.kind);
-}
-
-static int signed_order_cmp_op(CmpOp op) {
-  return op == CMP_LT_S || op == CMP_LE_S || op == CMP_GT_S || op == CMP_GE_S;
-}
-
-static int unsigned_order_cmp_op(CmpOp op) {
-  return op == CMP_LT_U || op == CMP_LE_U || op == CMP_GT_U || op == CMP_GE_U;
-}
-
-static u32 sign_extend_i32_for_cmp(MCEmitter* mc, u32 src, u32 other) {
-  u32 dst =
-      (src == RV_T0 || src == RV_T1) ? src : ((other == RV_T0) ? RV_T1 : RV_T0);
-  rv64_emit32(mc, rv_addiw(dst, src, 0));
-  return dst;
-}
-
-static u32 zero_extend_i32_for_cmp(MCEmitter* mc, u32 src, u32 other) {
-  u32 dst =
-      (src == RV_T0 || src == RV_T1) ? src : ((other == RV_T0) ? RV_T1 : RV_T0);
-  rv64_emit32(mc, rv_slli(dst, src, 32));
-  rv64_emit32(mc, rv_srli(dst, dst, 32));
-  return dst;
-}
-
-static void canonicalize_i32_cmp_operands(MCEmitter* mc, CmpOp op,
-                                          CfreeCgTypeId type, u32* ra,
-                                          u32* rb) {
-  if (type_byte_size(type) != 4u) return;
-
-  if (unsigned_order_cmp_op(op)) {
-    *ra = zero_extend_i32_for_cmp(mc, *ra, *rb);
-    *rb = zero_extend_i32_for_cmp(mc, *rb, *ra);
-    return;
-  }
-
-  if (signed_order_cmp_op(op) || op == CMP_EQ || op == CMP_NE) {
-    *ra = sign_extend_i32_for_cmp(mc, *ra, *rb);
-    *rb = sign_extend_i32_for_cmp(mc, *rb, *ra);
-  }
-}
-
-/* Emit a conditional branch (a OP b) → label.  Uses BEQ/BNE/BLT/BGE etc. */
-void rv_cmp_branch(CGTarget* t, CmpOp op, Operand a_op, Operand b_op, Label l) {
-  MCEmitter* mc = t->mc;
-  RImpl* a = impl_of(t);
-  /* FP compares: materialize the comparison into a GPR via FLT/FLE,
-   * then branch on (result != 0). Inverted predicates are handled by
-   * swapping operands (a > b ↔ b < a, a >= b ↔ b <= a). */
-  if (op == CMP_LT_F || op == CMP_LE_F || op == CMP_GT_F || op == CMP_GE_F) {
-    int is_d = type_is_fp_double(a_op.type);
-    u32 fa = reg_num(a_op);
-    u32 fb = reg_num(b_op);
-    u32 rd = RV_T0;
-    switch (op) {
-      case CMP_LT_F:
-        rv64_emit32(mc, is_d ? rv_flt_d(rd, fa, fb) : rv_flt_s(rd, fa, fb));
-        break;
-      case CMP_LE_F:
-        rv64_emit32(mc, is_d ? rv_fle_d(rd, fa, fb) : rv_fle_s(rd, fa, fb));
-        break;
-      case CMP_GT_F:
-        rv64_emit32(mc, is_d ? rv_flt_d(rd, fb, fa) : rv_flt_s(rd, fb, fa));
-        break;
-      case CMP_GE_F:
-        rv64_emit32(mc, is_d ? rv_fle_d(rd, fb, fa) : rv_fle_s(rd, fb, fa));
-        break;
-      default:
-        break;
-    }
-    rv64_emit32(mc, rv_bne(rd, RV_ZERO, 0));
-    mc->emit_label_ref(mc, (MCLabel)l, R_RV_BRANCH, 4, 0);
-    return;
-  }
-  u32 ra = rv64_force_reg_int(t, a_op, RV_T0);
-  u32 rb = rv64_force_reg_int(t, b_op, (ra == RV_T0) ? RV_T1 : RV_T0);
-  canonicalize_i32_cmp_operands(mc, op, a_op.type, &ra, &rb);
-  u32 word = 0;
-  switch (op) {
-    case CMP_EQ:
-      word = rv_beq(ra, rb, 0);
-      break;
-    case CMP_NE:
-      word = rv_bne(ra, rb, 0);
-      break;
-    case CMP_LT_S:
-      word = rv_blt(ra, rb, 0);
-      break;
-    case CMP_GE_S:
-      word = rv_bge(ra, rb, 0);
-      break;
-    case CMP_LT_U:
-      word = rv_bltu(ra, rb, 0);
-      break;
-    case CMP_GE_U:
-      word = rv_bgeu(ra, rb, 0);
-      break;
-    /* >= can become <  with operands swapped: a > b  ↔  b < a;
-     *                                          a <= b ↔  b >= a. */
-    case CMP_GT_S:
-      word = rv_blt(rb, ra, 0);
-      break;
-    case CMP_LE_S:
-      word = rv_bge(rb, ra, 0);
-      break;
-    case CMP_GT_U:
-      word = rv_bltu(rb, ra, 0);
-      break;
-    case CMP_LE_U:
-      word = rv_bgeu(rb, ra, 0);
-      break;
-    default:
-      compiler_panic(t->c, a->loc, "rv64 cmp_branch: op %d unimpl", (int)op);
-  }
-  rv64_emit32(mc, word);
-  mc->emit_label_ref(mc, (MCLabel)l, R_RV_BRANCH, 4, 0);
-}
-
-/* Materialize 0/1 into dst from a comparison. */
-void rv_cmp(CGTarget* t, CmpOp op, Operand dst, Operand a_op, Operand b_op) {
-  MCEmitter* mc = t->mc;
-  RImpl* a = impl_of(t);
-  u32 rd = reg_num(dst);
-
-  if ((a_op.cls == RC_FP || b_op.cls == RC_FP) &&
-      (op == CMP_EQ || op == CMP_NE || op == CMP_LT_F || op == CMP_LE_F ||
-       op == CMP_GT_F || op == CMP_GE_F)) {
-    /* FP compare in fa,fb → rd. Use FLT/FLE/FEQ depending on op. */
-    int is_d = type_is_fp_double(a_op.type);
-    u32 fa = reg_num(a_op);
-    u32 fb = reg_num(b_op);
-    switch (op) {
-      case CMP_EQ:
-        rv64_emit32(mc, is_d ? rv_feq_d(rd, fa, fb) : rv_feq_s(rd, fa, fb));
-        return;
-      case CMP_NE:
-        rv64_emit32(mc, is_d ? rv_feq_d(rd, fa, fb) : rv_feq_s(rd, fa, fb));
-        rv64_emit32(mc, rv_xori(rd, rd, 1));
-        return;
-      case CMP_LT_F:
-        rv64_emit32(mc, is_d ? rv_flt_d(rd, fa, fb) : rv_flt_s(rd, fa, fb));
-        return;
-      case CMP_LE_F:
-        rv64_emit32(mc, is_d ? rv_fle_d(rd, fa, fb) : rv_fle_s(rd, fa, fb));
-        return;
-      case CMP_GT_F:
-        rv64_emit32(mc, is_d ? rv_flt_d(rd, fb, fa) : rv_flt_s(rd, fb, fa));
-        return;
-      case CMP_GE_F:
-        rv64_emit32(mc, is_d ? rv_fle_d(rd, fb, fa) : rv_fle_s(rd, fb, fa));
-        return;
-      default:
-        break;
-    }
-  }
-  u32 ra = rv64_force_reg_int(t, a_op, RV_T0);
-  u32 rb = rv64_force_reg_int(t, b_op, (ra == RV_T0) ? RV_T1 : RV_T0);
-  canonicalize_i32_cmp_operands(mc, op, a_op.type, &ra, &rb);
-
-  switch (op) {
-    case CMP_EQ:
-      rv64_emit32(mc, rv_sub(rd, ra, rb));
-      rv64_emit32(mc, rv_sltiu(rd, rd, 1));
-      return;
-    case CMP_NE:
-      rv64_emit32(mc, rv_sub(rd, ra, rb));
-      rv64_emit32(mc, rv_sltu(rd, RV_ZERO, rd));
-      return;
-    case CMP_LT_S:
-      rv64_emit32(mc, rv_slt(rd, ra, rb));
-      return;
-    case CMP_LT_U:
-      rv64_emit32(mc, rv_sltu(rd, ra, rb));
-      return;
-    case CMP_GT_S:
-      rv64_emit32(mc, rv_slt(rd, rb, ra));
-      return;
-    case CMP_GT_U:
-      rv64_emit32(mc, rv_sltu(rd, rb, ra));
-      return;
-    case CMP_GE_S:
-      rv64_emit32(mc, rv_slt(rd, ra, rb));
-      rv64_emit32(mc, rv_xori(rd, rd, 1));
-      return;
-    case CMP_GE_U:
-      rv64_emit32(mc, rv_sltu(rd, ra, rb));
-      rv64_emit32(mc, rv_xori(rd, rd, 1));
-      return;
-    case CMP_LE_S:
-      rv64_emit32(mc, rv_slt(rd, rb, ra));
-      rv64_emit32(mc, rv_xori(rd, rd, 1));
-      return;
-    case CMP_LE_U:
-      rv64_emit32(mc, rv_sltu(rd, rb, ra));
-      rv64_emit32(mc, rv_xori(rd, rd, 1));
-      return;
-    default:
-      compiler_panic(t->c, a->loc, "rv64 cmp: op %d unimpl", (int)op);
-  }
-}
-
-/* ---- structured scopes (SCOPE_IF + SCOPE_LOOP/BLOCK bookkeep) ---- */
-
-CGScope rv_scope_begin(CGTarget* t, const CGScopeDesc* d) {
-  RImpl* a = impl_of(t);
-  if (a->nscopes == a->scopes_cap) {
-    u32 ncap = a->scopes_cap ? a->scopes_cap * 2u : 4u;
-    RvScope* nb = arena_array(t->c->tu, RvScope, ncap);
-    if (a->scopes) memcpy(nb, a->scopes, sizeof(RvScope) * a->nscopes);
-    a->scopes = nb;
-    a->scopes_cap = ncap;
-  }
-  RvScope* sc = &a->scopes[a->nscopes];
-  sc->kind = (u8)d->kind;
-  sc->has_else = 0;
-  sc->else_label = 0;
-  sc->end_label = 0;
-  sc->break_label = d->break_label;
-  sc->continue_label = d->continue_label;
-
-  if (d->kind == SCOPE_IF) {
-    sc->else_label = t->mc->label_new(t->mc);
-    sc->end_label = t->mc->label_new(t->mc);
-    u32 rn = rv64_force_reg_int(t, d->cond, RV_T0);
-    /* beq rn, x0, else_label */
-    rv64_emit32(t->mc, rv_beq(rn, RV_ZERO, 0));
-    t->mc->emit_label_ref(t->mc, sc->else_label, R_RV_BRANCH, 4, 0);
-  } else if (d->kind == SCOPE_LOOP || d->kind == SCOPE_BLOCK) {
-    /* bookkeep only */
-  } else {
-    compiler_panic(t->c, a->loc,
-                   "rv64 scope_begin: kind %d not yet implemented",
-                   (int)d->kind);
-  }
-  a->nscopes++;
-  return (CGScope)a->nscopes;
-}
-
-void rv_scope_else(CGTarget* t, CGScope s) {
-  RImpl* a = impl_of(t);
-  if (s == CG_SCOPE_NONE || s > a->nscopes) {
-    compiler_panic(t->c, a->loc, "rv64 scope_else: bad scope");
-  }
-  RvScope* sc = &a->scopes[s - 1];
-  /* jump end ; place else */
-  rv64_emit32(t->mc, rv_jal(RV_ZERO, 0));
-  t->mc->emit_label_ref(t->mc, sc->end_label, R_RV_JAL, 4, 0);
-  t->mc->label_place(t->mc, sc->else_label);
-  sc->has_else = 1;
-}
-
-void rv_scope_end(CGTarget* t, CGScope s) {
-  RImpl* a = impl_of(t);
-  if (s == CG_SCOPE_NONE || s > a->nscopes) {
-    compiler_panic(t->c, a->loc, "rv64 scope_end: bad scope");
-  }
-  RvScope* sc = &a->scopes[s - 1];
-  if (sc->kind == SCOPE_IF) {
-    if (!sc->has_else) t->mc->label_place(t->mc, sc->else_label);
-    t->mc->label_place(t->mc, sc->end_label);
-  }
-}
-
-void rv_break_to(CGTarget* t, CGScope s) {
-  RImpl* a = impl_of(t);
-  if (s == CG_SCOPE_NONE || s > a->nscopes) {
-    compiler_panic(t->c, a->loc, "rv64 break_to: bad scope");
-  }
-  rv_jump(t, a->scopes[s - 1].break_label);
-}
-
-void rv_continue_to(CGTarget* t, CGScope s) {
-  RImpl* a = impl_of(t);
-  if (s == CG_SCOPE_NONE || s > a->nscopes) {
-    compiler_panic(t->c, a->loc, "rv64 continue_to: bad scope");
-  }
-  rv_jump(t, a->scopes[s - 1].continue_label);
-}
diff --git a/src/arch/rv64/arch.c b/src/arch/rv64/arch.c
@@ -4,6 +4,7 @@
 #include "arch/rv64/disasm.h"
 #include "arch/rv64/regs.h"
 #include "arch/rv64/rv64.h"
+#include "cg/native_direct_target.h"
 #include "core/bytes.h"
 #include "link/link_arch.h"
 #include "obj/obj.h"
@@ -121,23 +122,42 @@ static const CfreePredefinedMacro rv64_predefined_macros[] = {
     {CFREE_SLICE_LIT("__LITTLE_ENDIAN__"), CFREE_SLICE_LIT("1")},
 };
 
-static CGTarget* rv64_backend_make(Compiler* c, ObjBuilder* o,
+static CgTarget* rv64_backend_make(Compiler* c, ObjBuilder* o,
                                    const CfreeCodeOptions* opts) {
   MCEmitter* mc = NULL;
   Debug* debug = NULL;
-  CGTarget* t;
+  CgTarget* t;
+  NativeTarget* native;
+  NativeDirectTargetConfig cfg;
   if (cg_mc_debug_new(c, o, opts, &mc, &debug) != CFREE_OK) return NULL;
-  t = rv64_cgtarget_new(c, o, mc);
-  if (!t) return NULL;
-  t->debug = debug;
+  native = rv64_native_target_new(c, o, mc);
+  if (!native) return NULL;
+  memset(&cfg, 0, sizeof cfg);
+  cfg.native = native;
+  cfg.ops = rv64_native_direct_ops();
+  t = native_direct_target_new(c, o, &cfg);
+  if (t) t->debug = debug;
   return t;
 }
 
+static CgTarget* rv64_semantic_target_new(Compiler* c, ObjBuilder* o,
+                                          MCEmitter* mc) {
+  NativeTarget* native;
+  NativeDirectTargetConfig cfg;
+  if (!mc) mc = mc_new(c, o);
+  native = rv64_native_target_new(c, o, mc);
+  if (!native) return NULL;
+  memset(&cfg, 0, sizeof cfg);
+  cfg.native = native;
+  cfg.ops = rv64_native_direct_ops();
+  return native_direct_target_new(c, o, &cfg);
+}
+
 const ArchImpl arch_impl_rv64 = {
     .backend = {.name = "rv64", .make = rv64_backend_make},
     .kind = CFREE_ARCH_RV64,
     .name = "rv64",
-    .cgtarget_new = rv64_cgtarget_new,
+    .cgtarget_new = rv64_semantic_target_new,
     .asm_new = rv64_arch_asm_new,
     .disasm_new = rv64_disasm_new,
     .apply_label_fixup = rv64_apply_label_fixup,
diff --git a/src/arch/rv64/asm.c b/src/arch/rv64/asm.c
@@ -15,9 +15,10 @@
 
 #include <string.h>
 
-#include "arch/rv64/internal.h"
 #include "arch/rv64/isa.h"
 #include "arch/rv64/regs.h"
+#include "arch/rv64/rv64.h"
+#include "obj/obj.h"
 #include "asm/asm_helpers.h"
 #include "core/arena.h"
 #include "core/pool.h"
@@ -779,11 +780,11 @@ static void render_operand(Rv64Asm* a, StrBuf* sb, u32 idx, int form) {
   switch (form) {
     case 1: /* %wN — accept any reg/imm; rv64 has no narrower spelling. */
     case 2: /* %xN — same. */
-      if (op->kind == OPK_REG) {
-        if (op->cls == RC_FP)
-          render_freg(sb, (u32)op->v.reg);
+      if (op->kind == RV64_INLINE_OPK_REG) {
+        if (op->pad[0] == RV64_INLINE_OPCLS_FP)
+          render_freg(sb, (u32)op->v.local);
         else
-          render_xreg(sb, (u32)op->v.reg);
+          render_xreg(sb, (u32)op->v.local);
         return;
       }
       if (op->kind == OPK_IMM) {
@@ -793,22 +794,22 @@ static void render_operand(Rv64Asm* a, StrBuf* sb, u32 idx, int form) {
       inline_panic(a, "%w/%x on unsupported operand kind");
     case 3: /* %aN — memory addressing form */
       if (op->kind != OPK_INDIRECT) inline_panic(a, "%a on non-memory operand");
-      if (op->v.ind.index != REG_NONE)
+      if (op->v.ind.index != CG_LOCAL_NONE)
         inline_panic(a,
                      "%a on indexed memory operand: rv64 inline asm "
                      "requires base+disp only");
-      render_indirect(a, sb, op->v.ind.base, op->v.ind.ofs);
+      render_indirect(a, sb, (Reg)op->v.ind.base, op->v.ind.ofs);
       return;
     case 4: /* %zN — zero-or-reg */
       if (op->kind == OPK_IMM && op->v.imm == 0) {
         strbuf_puts(sb, "zero");
         return;
       }
-      if (op->kind == OPK_REG) {
-        if (op->cls == RC_FP)
-          render_freg(sb, (u32)op->v.reg);
+      if (op->kind == RV64_INLINE_OPK_REG) {
+        if (op->pad[0] == RV64_INLINE_OPCLS_FP)
+          render_freg(sb, (u32)op->v.local);
         else
-          render_xreg(sb, (u32)op->v.reg);
+          render_xreg(sb, (u32)op->v.local);
         return;
       }
       inline_panic(a, "%z on unsupported operand kind");
@@ -816,21 +817,21 @@ static void render_operand(Rv64Asm* a, StrBuf* sb, u32 idx, int form) {
       break;
   }
   switch (op->kind) {
-    case OPK_REG:
-      if (op->cls == RC_FP)
-        render_freg(sb, (u32)op->v.reg);
+    case RV64_INLINE_OPK_REG:
+      if (op->pad[0] == RV64_INLINE_OPCLS_FP)
+        render_freg(sb, (u32)op->v.local);
       else
-        render_xreg(sb, (u32)op->v.reg);
+        render_xreg(sb, (u32)op->v.local);
       return;
     case OPK_IMM:
       render_imm(sb, op->v.imm);
       return;
     case OPK_INDIRECT:
-      if (op->v.ind.index != REG_NONE)
+      if (op->v.ind.index != CG_LOCAL_NONE)
         inline_panic(a,
                      "indexed memory operand in inline asm: rv64 requires "
                      "base+disp only");
-      render_indirect(a, sb, op->v.ind.base, op->v.ind.ofs);
+      render_indirect(a, sb, (Reg)op->v.ind.base, op->v.ind.ofs);
       return;
     default:
       inline_panic(a, "unsupported operand kind for %N");
diff --git a/src/arch/rv64/asm.h b/src/arch/rv64/asm.h
@@ -14,6 +14,19 @@
 typedef struct AsmDriver AsmDriver;
 typedef struct Rv64Asm Rv64Asm;
 
+/* Private pseudo operand used by the rv64 inline-asm binder. Semantic CG
+ * operands never expose physical registers, so native.c lowers register
+ * constraints into this arch-private shape before template substitution:
+ * Operand.kind = RV64_INLINE_OPK_REG, Operand.v.local carries the 5-bit
+ * physical register number, Operand.pad[0] carries RV64_INLINE_OPCLS_*.
+ * Memory operands reuse OPK_INDIRECT with v.ind.base holding the physical
+ * base register and v.ind.index == CG_LOCAL_NONE. */
+enum {
+  RV64_INLINE_OPK_REG = 0xf0u,
+  RV64_INLINE_OPCLS_INT = 0u,
+  RV64_INLINE_OPCLS_FP = 1u,
+};
+
 ArchAsm* rv64_arch_asm_new(Compiler*);
 
 /* ---- inline-asm entry points (parallel to aa64) ---- */
diff --git a/src/arch/rv64/emit.c b/src/arch/rv64/emit.c
@@ -1,631 +0,0 @@
-/* src/arch/rv64/emit.c — immediate encoding, function lifecycle, frame setup.
- */
-
-#include "arch/rv64/internal.h"
-#include "core/slice.h"
-
-static u32 collect_mask_regs(u32 mask, u32 first, u32 last, u32* out) {
-  u32 n = 0;
-  for (u32 r = first; r <= last; ++r) {
-    if (mask & (1u << r)) out[n++] = r;
-  }
-  return n;
-}
-
-static u32 count_mask_regs(u32 mask, u32 first, u32 last) {
-  u32 n = 0;
-  for (u32 r = first; r <= last; ++r) {
-    if (mask & (1u << r)) ++n;
-  }
-  return n;
-}
-
-static u32 rv_planned_prologue_words(const RImpl* a) {
-  u32 n = RV_PROLOGUE_FRAME_WORDS;
-  if (a->has_sret) ++n;
-  if (a->is_variadic) n += 8u;
-  n += 4u * count_mask_regs(a->planned_cs_int_mask, 18u, 27u);
-  n += 4u * count_mask_regs(a->planned_cs_fp_mask, 18u, 27u);
-  return n ? n : 1u;
-}
-
-void rv64_emit32(MCEmitter* mc, u32 word) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  u8 b[4];
-  b[0] = (u8)(word & 0xff);
-  b[1] = (u8)((word >> 8) & 0xff);
-  b[2] = (u8)((word >> 16) & 0xff);
-  b[3] = (u8)((word >> 24) & 0xff);
-  mc->emit_bytes(mc, b, 4);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-void rv64_emit16(MCEmitter* mc, u32 halfword) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  u8 b[2];
-  b[0] = (u8)(halfword & 0xff);
-  b[1] = (u8)((halfword >> 8) & 0xff);
-  mc->emit_bytes(mc, b, 2);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-void rv64_patch32(ObjBuilder* obj, u32 sec_id, u32 ofs, u32 word) {
-  u8 b[4];
-  b[0] = (u8)(word & 0xff);
-  b[1] = (u8)((word >> 8) & 0xff);
-  b[2] = (u8)((word >> 16) & 0xff);
-  b[3] = (u8)((word >> 24) & 0xff);
-  obj_patch(obj, sec_id, ofs, b, 4);
-}
-
-_Noreturn void rv_panic(CGTarget* t, const char* what) {
-  SrcLoc loc = impl_of(t)->loc;
-  compiler_panic(t->c, loc, "rv64: %.*s not implemented",
-                 SLICE_ARG(slice_from_cstr(what)));
-}
-
-int fits_signed32(i64 v) {
-  return v >= (i64)(i32)0x80000000 && v <= (i64)(i32)0x7fffffff;
-}
-
-static i64 floor_div_4096(i64 v) {
-  if (v >= 0) return v / 4096;
-  return -((-v + 4095) / 4096);
-}
-
-void emit_li_32(MCEmitter* mc, u32 rd, i32 imm) {
-  if (imm >= -2048 && imm <= 2047) {
-    rv64_emit32(mc, rv_addi(rd, RV_ZERO, imm));
-    return;
-  }
-  /* hi20 + lo12, with 0x800 bias to compensate ADDIW's sign-ext. */
-  i64 hi64 = floor_div_4096((i64)imm + 0x800);
-  i64 lo64 = (i64)imm - hi64 * 4096;
-  i32 hi = (i32)hi64;
-  i32 lo = (i32)lo64;
-  rv64_emit32(mc, rv_lui(rd, (u32)hi & 0xfffffu));
-  if (lo) rv64_emit32(mc, rv_addiw(rd, rd, lo));
-}
-
-static i32 sign_extend_12(u32 v) {
-  v &= 0xfffu;
-  return (v & 0x800u) ? (i32)v - 4096 : (i32)v;
-}
-
-static int fits_signed32_bits(u64 v) {
-  return v <= 0x7fffffffull || v >= 0xffffffff80000000ull;
-}
-
-static i32 i32_from_bits(u32 v) {
-  if (v <= 0x7fffffffu) return (i32)v;
-  if (v == 0x80000000u) return -2147483647 - 1;
-  return -(i32)(~v + 1u);
-}
-
-static void emit_li_64(MCEmitter* mc, u32 rd, u64 imm) {
-  if (fits_signed32_bits(imm)) {
-    emit_li_32(mc, rd, i32_from_bits((u32)imm));
-    return;
-  }
-  i32 lo = sign_extend_12((u32)imm);
-  u64 hi = (imm - (u64)(i64)lo) >> 12;
-  emit_li_64(mc, rd, hi);
-  rv64_emit32(mc, rv_slli(rd, rd, 12));
-  if (lo) rv64_emit32(mc, rv_addi(rd, rd, lo));
-}
-
-void rv64_emit_load_imm(MCEmitter* mc, u32 sf, u32 rd, i64 imm) {
-  if (!sf) {
-    /* 32-bit destination: low 32 bits, sign-extended. */
-    emit_li_32(mc, rd, (i32)imm);
-    return;
-  }
-  if (fits_signed32(imm)) {
-    emit_li_32(mc, rd, (i32)imm);
-    return;
-  }
-  emit_li_64(mc, rd, (u64)imm);
-}
-
-/* sp += imm.  imm can be any signed value the caller passes — we pick
- * the shortest sequence. */
-void emit_sp_addi(MCEmitter* mc, i64 imm) {
-  if (imm >= -2048 && imm <= 2047) {
-    rv64_emit32(mc, rv_addi(RV_SP, RV_SP, (i32)imm));
-    return;
-  }
-  rv64_emit_load_imm(mc, 1, RV_T0, imm);
-  rv64_emit32(mc, rv_add(RV_SP, RV_SP, RV_T0));
-}
-
-/* ---- function lifecycle ---- */
-
-typedef struct RvFrameLayout RvFrameLayout;
-static void rv_emit_cfi_frame(CGTarget* t, u32 post_prologue_off,
-                              const RvFrameLayout* fl, const u32* int_regs,
-                              u32 n_int_saves, const u32* fp_regs,
-                              u32 n_fp_saves, int omit_frame);
-
-struct RvFrameLayout {
-  u32 max_out;
-  u32 fp_saves_sz;
-  u32 fp_pair_off;
-  u32 frame_size;
-  i32 fp_save_base;
-  i32 int_save_base;
-};
-
-static void rv_func_begin_init(CGTarget* t, const CGFuncDesc* fd) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  mc->set_section(mc, fd->text_section_id);
-  mc->emit_align(mc, 4, 0);
-
-  a->fd = fd;
-  a->func_start = mc->pos(mc);
-  mc_begin_function(mc, fd->sym, fd->text_section_id, a->func_start);
-  a->next_param_int = 0;
-  a->next_param_fp = 0;
-  a->next_param_stack = 0;
-  a->has_sret = (fd->abi && fd->abi->has_sret) ? 1 : 0;
-  a->is_variadic = (fd->abi && fd->abi->variadic) ? 1 : 0;
-  a->known_frame = 0;
-  a->omit_frame = 0;
-  a->cum_off = 0;
-  a->max_outgoing = 0;
-  a->fp_pair_off = 0;
-  a->used_cs_int_mask = a->has_planned_regs ? a->planned_cs_int_mask : 0;
-  a->used_cs_fp_mask = a->has_planned_regs ? a->planned_cs_fp_mask : 0;
-  a->prologue_words =
-      a->has_planned_regs ? rv_planned_prologue_words(a) : RV_PROLOGUE_WORDS;
-  a->post_prologue_off = 0;
-  a->planned_cs_int_mask = 0;
-  a->planned_cs_fp_mask = 0;
-  a->has_planned_regs = 0;
-  a->nslots = 0;
-  a->nscopes = 0;
-  a->has_alloca = 0;
-  a->nadd_patches = 0;
-  a->gp_save_slot = FRAME_SLOT_NONE;
-  a->sret_ptr_slot = FRAME_SLOT_NONE;
-  a->epilogue_label = mc->label_new(mc);
-
-  mc->cfi_startproc(mc);
-}
-
-static void rv_add_entry_frame_slots(CGTarget* t) {
-  RImpl* a = impl_of(t);
-
-  /* For an sret return, the caller passed the destination pointer in
-   * a0; reserve a hidden slot to spill it into so the body can use a0
-   * freely. The actual SD a0, ...(s0) is emitted in the patched
-   * prologue once the slot offset is known. */
-  if (a->has_sret) {
-    FrameSlotDesc fsd = {
-        .type = CFREE_CG_TYPE_NONE,
-        .name = 0,
-        .loc = (SrcLoc){0, 0, 0},
-        .size = 8,
-        .align = 8,
-        .kind = FS_SPILL,
-        .flags = 0,
-    };
-    a->sret_ptr_slot = rv_frame_slot(t, &fsd);
-    /* Consume a0 — it is no longer available for the first real param. */
-    a->next_param_int = 1;
-  }
-
-  /* Variadic: a 64-byte GP save area for a0..a7 lives at the very top
-   * of the frame, immediately above the saved-s0/ra pair, so its bytes
-   * are contiguous with the caller's stack args. The patcher spills the
-   * unnamed a-regs into it as part of the prologue. The slot is implicit
-   * (not allocated through rv_frame_slot) — it sits at [s0 + 16] when
-   * is_variadic is set. */
-}
-
-static void rv_compute_frame(const RImpl* a, u32 n_int_saves, u32 n_fp_saves,
-                             RvFrameLayout* fl) {
-  fl->max_out = (a->max_outgoing + 15u) & ~15u;
-  u32 int_saves_sz = n_int_saves * 8u;
-  fl->fp_saves_sz = n_fp_saves * 8u;
-
-  /* Variadic functions reserve a 64-byte save area at the very top of
-   * the frame so the save area and caller's stack args form a single
-   * contiguous byte stream walked by the va_list pointer. */
-  u32 va_save_sz = a->is_variadic ? 64u : 0u;
-  u32 locals_off = fl->max_out + int_saves_sz + fl->fp_saves_sz;
-  fl->fp_pair_off = locals_off + a->cum_off;
-  fl->frame_size = fl->fp_pair_off + 16u + va_save_sz;
-  fl->frame_size = (fl->frame_size + 15u) & ~15u;
-  fl->fp_pair_off = fl->frame_size - 16u - va_save_sz;
-
-  /* Save slots sit at the start of an 8-byte cell below the locals
-   * area. fp_save_base = offset of the first fp save (=-(L+8)); each
-   * subsequent save is 8 bytes lower. int saves start below the fp
-   * block. */
-  fl->fp_save_base = -(i32)a->cum_off - 8;
-  fl->int_save_base = fl->fp_save_base - (i32)fl->fp_saves_sz;
-}
-
-static u32 rv_variadic_first_saved_int(const CGFuncDesc* fd) {
-  u32 next_int = (fd->abi && fd->abi->has_sret) ? 1u : 0u;
-  u32 next_fp = 0;
-  for (u32 pidx = 0; pidx < fd->nparams; ++pidx) {
-    const ABIArgInfo* ai = fd->params[pidx].abi;
-    if (!ai || ai->kind == ABI_ARG_IGNORE) continue;
-    if (ai->kind == ABI_ARG_INDIRECT) {
-      if (next_int < 8) ++next_int;
-      continue;
-    }
-    for (u16 i = 0; i < ai->nparts; ++i) {
-      const ABIArgPart* pt = &ai->parts[i];
-      if (pt->cls == ABI_CLASS_INT) {
-        if (next_int < 8) ++next_int;
-      } else if (pt->cls == ABI_CLASS_FP) {
-        if (next_fp < 8) ++next_fp;
-      }
-    }
-  }
-  return next_int;
-}
-
-static void rv_words_addr_adjust(CGTarget* t, u32* words, u32 cap, u32* wi,
-                                 u32 rd, u32 base, i32 off) {
-  if (off == 0) {
-    if (rd != base) {
-      if (*wi >= cap) goto overflow;
-      words[(*wi)++] = rv_addi(rd, base, 0);
-    }
-    return;
-  }
-  if (off >= -2048 && off <= 2047) {
-    if (*wi >= cap) goto overflow;
-    words[(*wi)++] = rv_addi(rd, base, off);
-    return;
-  }
-  i32 hi = (i32)(((i64)off + 0x800) >> 12);
-  i32 lo = off - (hi << 12);
-  if (*wi >= cap) goto overflow;
-  words[(*wi)++] = rv_lui(rd, (u32)hi & 0xfffffu);
-  if (lo) {
-    if (*wi >= cap) goto overflow;
-    words[(*wi)++] = rv_addiw(rd, rd, lo);
-  }
-  if (*wi >= cap) goto overflow;
-  words[(*wi)++] = rv_add(rd, base, rd);
-  return;
-
-overflow:
-  compiler_panic(t->c, impl_of(t)->loc,
-                 "rv64: prologue placeholder too small (cap %u)", cap);
-}
-
-static void rv_words_store_int_s0(CGTarget* t, u32* words, u32 cap, u32* wi,
-                                  u32 reg, i32 off) {
-  if (off >= -2048 && off <= 2047) {
-    if (*wi >= cap) goto overflow;
-    words[(*wi)++] = rv_sd(reg, RV_S0, off);
-    return;
-  }
-  rv_words_addr_adjust(t, words, cap, wi, RV_T0, RV_S0, off);
-  if (*wi >= cap) goto overflow;
-  words[(*wi)++] = rv_sd(reg, RV_T0, 0);
-  return;
-
-overflow:
-  compiler_panic(t->c, impl_of(t)->loc,
-                 "rv64: prologue placeholder too small (cap %u)", cap);
-}
-
-static void rv_words_store_fp_s0(CGTarget* t, u32* words, u32 cap, u32* wi,
-                                 u32 reg, i32 off) {
-  if (off >= -2048 && off <= 2047) {
-    if (*wi >= cap) goto overflow;
-    words[(*wi)++] = rv_fsd(reg, RV_S0, off);
-    return;
-  }
-  rv_words_addr_adjust(t, words, cap, wi, RV_T0, RV_S0, off);
-  if (*wi >= cap) goto overflow;
-  words[(*wi)++] = rv_fsd(reg, RV_T0, 0);
-  return;
-
-overflow:
-  compiler_panic(t->c, impl_of(t)->loc,
-                 "rv64: prologue placeholder too small (cap %u)", cap);
-}
-
-static u32 rv_build_prologue(CGTarget* t, u32* words, u32 cap,
-                             const RvFrameLayout* fl, const u32* int_regs,
-                             u32 n_int_saves, const u32* fp_regs,
-                             u32 n_fp_saves, u32 variadic_first_int) {
-  RImpl* a = impl_of(t);
-  u32 wi = 0;
-
-  /* addi sp, sp, -frame_size  (or multi-insn if too large) */
-  if ((i64)fl->frame_size <= 2048) {
-    if (wi >= cap) goto overflow;
-    words[wi++] = rv_addi(RV_SP, RV_SP, -(i32)fl->frame_size);
-  } else {
-    i64 neg = -(i64)fl->frame_size;
-    if (!fits_signed32(neg))
-      compiler_panic(t->c, a->loc, "rv64: frame_size too large to patch");
-    i32 hi = (i32)((u32)((i32)neg + 0x800) >> 12);
-    i32 lo = (i32)neg - (hi << 12);
-    if (wi >= cap) goto overflow;
-    words[wi++] = rv_lui(RV_T0, (u32)hi & 0xfffffu);
-    if (lo) {
-      if (wi >= cap) goto overflow;
-      words[wi++] = rv_addiw(RV_T0, RV_T0, lo);
-    }
-    if (wi >= cap) goto overflow;
-    words[wi++] = rv_add(RV_SP, RV_SP, RV_T0);
-  }
-
-  if ((i32)fl->fp_pair_off <= 2039) {
-    if (wi + 3 > cap) goto overflow;
-    words[wi++] = rv_sd(RV_S0, RV_SP, (i32)fl->fp_pair_off);
-    words[wi++] = rv_sd(RV_RA, RV_SP, (i32)fl->fp_pair_off + 8);
-    words[wi++] = rv_addi(RV_S0, RV_SP, (i32)fl->fp_pair_off);
-  } else {
-    i32 off = (i32)fl->fp_pair_off;
-    i32 hi = (i32)(((i64)off + 0x800) >> 12);
-    i32 lo = off - (hi << 12);
-    if (fl->fp_pair_off > 0x7fffffffu)
-      compiler_panic(t->c, a->loc, "rv64: fp_pair_off too large");
-    if (wi + 6 > cap) goto overflow;
-    words[wi++] = rv_lui(RV_T0, (u32)hi & 0xfffffu);
-    if (lo) words[wi++] = rv_addiw(RV_T0, RV_T0, lo);
-    words[wi++] = rv_add(RV_T0, RV_SP, RV_T0);
-    words[wi++] = rv_sd(RV_S0, RV_T0, 0);
-    words[wi++] = rv_sd(RV_RA, RV_T0, 8);
-    words[wi++] = rv_addi(RV_S0, RV_T0, 0);
-  }
-
-  /* If sret, spill incoming a0 into the hidden slot. */
-  if (a->has_sret && a->sret_ptr_slot != FRAME_SLOT_NONE) {
-    RvSlot* s = rv64_slot_get(a, a->sret_ptr_slot);
-    if (s) {
-      if (wi >= cap) goto overflow;
-      words[wi++] = rv_sd(RV_A0, RV_S0, -(i32)s->off);
-    }
-  }
-  /* Variadic: spill the still-unconsumed a-regs into the save area. */
-  if (a->is_variadic) {
-    for (u32 i = variadic_first_int; i < 8; ++i) {
-      if (wi >= cap) goto overflow;
-      words[wi++] = rv_sd(RV_A0 + i, RV_S0, 16 + (i32)i * 8);
-    }
-  }
-  for (u32 i = 0; i < n_int_saves; ++i) {
-    u32 r = int_regs[i];
-    i32 off = fl->int_save_base - 8 * (i32)i;
-    rv_words_store_int_s0(t, words, cap, &wi, r, off);
-  }
-  for (u32 i = 0; i < n_fp_saves; ++i) {
-    u32 r = fp_regs[i];
-    i32 off = fl->fp_save_base - 8 * (i32)i;
-    rv_words_store_fp_s0(t, words, cap, &wi, r, off);
-  }
-  return wi;
-
-overflow:
-  compiler_panic(t->c, a->loc, "rv64: prologue placeholder too small (cap %u)",
-                 cap);
-  return 0;
-}
-
-void rv_func_begin(CGTarget* t, const CGFuncDesc* fd) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  rv_func_begin_init(t, fd);
-
-  /* Reserve a NOP-filled prologue placeholder; func_end patches it. */
-  a->prologue_pos = mc->pos(mc);
-  for (u32 i = 0; i < a->prologue_words; ++i) rv64_emit32(mc, RV_NOP);
-
-  rv_add_entry_frame_slots(t);
-  /* Capture end-of-prologue position for CFI emission in func_end. */
-  a->post_prologue_off = mc->pos(mc) - a->func_start;
-}
-
-void rv_func_begin_known_frame(CGTarget* t, const CGFuncDesc* fd,
-                               const CGKnownFrameDesc* frame,
-                               FrameSlot* out_slots) {
-  RImpl* a = impl_of(t);
-  u32 int_regs[10];
-  u32 fp_regs[10];
-  u32 words[RV_PROLOGUE_WORDS];
-  RvFrameLayout fl;
-
-  rv_func_begin_init(t, fd);
-  a->known_frame = 1;
-  rv_add_entry_frame_slots(t);
-  for (u32 i = 0; frame && i < frame->nslots; ++i) {
-    FrameSlot fs = rv_frame_slot(t, &frame->slots[i]);
-    if (out_slots) out_slots[i] = fs;
-  }
-  if (frame) {
-    a->max_outgoing = frame->max_outgoing;
-    a->has_alloca = frame->has_alloca ? 1u : 0u;
-  }
-
-  u32 n_int_saves = collect_mask_regs(a->used_cs_int_mask, 18u, 27u, int_regs);
-  u32 n_fp_saves = collect_mask_regs(a->used_cs_fp_mask, 18u, 27u, fp_regs);
-  if (frame && frame->may_omit_frame && frame->nslots == 0 &&
-      frame->max_outgoing == 0 && !frame->has_alloca && !frame->has_call &&
-      !a->has_sret && !a->is_variadic && n_int_saves == 0 && n_fp_saves == 0) {
-    a->omit_frame = 1;
-    return;
-  }
-  rv_compute_frame(a, n_int_saves, n_fp_saves, &fl);
-  a->fp_pair_off = fl.fp_pair_off;
-  a->prologue_pos = t->mc->pos(t->mc);
-  u32 nwords =
-      rv_build_prologue(t, words, RV_PROLOGUE_WORDS, &fl, int_regs, n_int_saves,
-                        fp_regs, n_fp_saves, rv_variadic_first_saved_int(fd));
-  for (u32 i = 0; i < nwords; ++i) rv64_emit32(t->mc, words[i]);
-  {
-    u32 post = t->mc->pos(t->mc) - a->func_start;
-    rv_emit_cfi_frame(t, post, &fl, int_regs, n_int_saves, fp_regs, n_fp_saves,
-                      /*omit_frame=*/0);
-  }
-}
-
-/* CFI for the post-prologue state of an RV64 frame.
- *   s0 (x8) = sp + fp_pair_off; pre-call sp = s0 + (frame_size - fp_pair_off)
- *   ⇒ CFA = s0 + (frame_size - fp_pair_off)
- *   saved caller-s0 at [s0+0] = CFA - (frame_size - fp_pair_off)
- *   saved ra      at [s0+8] = saved-s0 offset + 8
- *   each callee-save at s0-relative offsets recorded in RvFrameLayout
- */
-static void rv_emit_cfi_frame(CGTarget* t, u32 post_prologue_off,
-                              const RvFrameLayout* fl, const u32* int_regs,
-                              u32 n_int_saves, const u32* fp_regs,
-                              u32 n_fp_saves, int omit_frame) {
-  MCEmitter* mc = t->mc;
-  i32 fp_dist;
-  if (omit_frame) return;
-  fp_dist = (i32)fl->frame_size - (i32)fl->fp_pair_off;
-  mc->cfi_set_next_pc_offset(mc, post_prologue_off);
-  mc->cfi_def_cfa(mc, 8u, fp_dist);
-  mc->cfi_offset(mc, 8u, -fp_dist);     /* saved s0 at [s0+0] */
-  mc->cfi_offset(mc, 1u, -fp_dist + 8); /* saved ra at [s0+8] */
-  {
-    u32 i;
-    for (i = 0; i < n_int_saves; ++i) {
-      i32 slot = fl->int_save_base - 8 * (i32)i;
-      i32 cfa_off = slot - fp_dist;
-      mc->cfi_offset(mc, int_regs[i], cfa_off);
-    }
-    for (i = 0; i < n_fp_saves; ++i) {
-      i32 slot = fl->fp_save_base - 8 * (i32)i;
-      i32 cfa_off = slot - fp_dist;
-      /* DWARF FP regs: f0..f31 → 32..63 */
-      mc->cfi_offset(mc, 32u + fp_regs[i], cfa_off);
-    }
-  }
-}
-
-void rv_func_end(CGTarget* t) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  ObjBuilder* obj = t->obj;
-  u32 sec = a->fd->text_section_id;
-
-  u32 int_regs[10];
-  u32 fp_regs[10];
-  u32 n_int_saves = collect_mask_regs(a->used_cs_int_mask, 18u, 27u, int_regs);
-  u32 n_fp_saves = collect_mask_regs(a->used_cs_fp_mask, 18u, 27u, fp_regs);
-  RvFrameLayout fl;
-  rv_compute_frame(a, n_int_saves, n_fp_saves, &fl);
-  a->fp_pair_off = fl.fp_pair_off;
-
-  if (!a->known_frame) {
-    rv_emit_cfi_frame(t, a->post_prologue_off, &fl, int_regs, n_int_saves,
-                      fp_regs, n_fp_saves, /*omit_frame=*/a->omit_frame);
-  }
-
-  if (a->omit_frame) goto finish;
-
-  /* Place the epilogue label at current pos. */
-  mc->label_place(mc, a->epilogue_label);
-
-  /* Restore int and fp saves using s0-relative addressing so they
-   * don't depend on the final frame_size encoding (and survive
-   * alloca-induced sp shifts). */
-  /* layout below s0:
-   *   s0 - 8 .. s0 - 16  saved s0/ra ?  No — those are at sp+fp_pair_off
-   *   We arranged saved-s0/ra at [sp+fp_pair_off], not below s0. So
-   *   immediately below s0 are: int saves, then fp saves, then locals.
-   *   Wait — let me recompute.
-   *
-   *   sp + 0           outgoing args (max_out bytes)
-   *   sp + max_out     int saves
-   *   sp + max_out + I fp saves
-   *   sp + max_out+I+F locals (cum_off)
-   *   sp + fp_pair_off saved s0_caller (8)
-   *   sp + fp_pair_off+8 saved ra (8)
-   *   sp + frame_size  end
-   *
-   *   s0 = sp + fp_pair_off  (so [s0+0] = saved s0_caller).
-   *   Locals at [s0 - off] where off in [1..cum_off].
-   *   FP saves at [s0 - cum_off - 8*i].
-   *   Int saves at [s0 - cum_off - F - 8*i]. */
-  /* Save slots sit at the start of an 8-byte cell below the locals
-   * area. fp_save_base = offset of the first fp save (=-(L+8)); each
-   * subsequent save is 8 bytes lower. int saves start below the fp
-   * block. */
-  /* Reverse order: ints first (lowest address) on restore, but we emit
-   * the restore loop in reverse to keep the prologue/epilogue symmetric. */
-  for (i32 i = (i32)n_int_saves - 1; i >= 0; --i) {
-    u32 r = int_regs[i];
-    i32 off = fl.int_save_base - 8 * (i32)i;
-    if (off >= -2048 && off <= 2047) {
-      rv64_emit32(mc, rv_ld(r, RV_S0, off));
-    } else {
-      rv64_emit_addr_adjust(mc, RV_T0, RV_S0, off);
-      rv64_emit32(mc, rv_ld(r, RV_T0, 0));
-    }
-  }
-  for (i32 i = (i32)n_fp_saves - 1; i >= 0; --i) {
-    u32 r = fp_regs[i];
-    i32 off = fl.fp_save_base - 8 * (i32)i;
-    if (off >= -2048 && off <= 2047) {
-      rv64_emit32(mc, rv_fld(r, RV_S0, off));
-    } else {
-      rv64_emit_addr_adjust(mc, RV_T0, RV_S0, off);
-      rv64_emit32(mc, rv_fld(r, RV_T0, 0));
-    }
-  }
-  /* Restore sp from s0 first so alloca-induced offsets don't matter.
-   * After this, sp == its post-prologue value. */
-  if (a->has_alloca) {
-    rv64_emit_addr_adjust(mc, RV_SP, RV_S0, -(i32)fl.fp_pair_off);
-  }
-  rv64_emit32(mc, rv_ld(RV_RA, RV_S0, 8));
-  rv64_emit32(mc, rv_ld(RV_S0, RV_S0, 0));
-  emit_sp_addi(mc, (i64)fl.frame_size);
-  rv64_emit32(mc, rv_ret_());
-
-  if (!a->known_frame) {
-    u32 pos = a->prologue_pos;
-    u32 words[RV_PROLOGUE_WORDS];
-    u32 prologue_words =
-        a->prologue_words ? a->prologue_words : RV_PROLOGUE_WORDS;
-    for (u32 i = 0; i < prologue_words; ++i) words[i] = RV_NOP;
-    (void)rv_build_prologue(t, words, prologue_words, &fl, int_regs,
-                            n_int_saves, fp_regs, n_fp_saves,
-                            a->next_param_int);
-    for (u32 i = 0; i < prologue_words; ++i)
-      rv64_patch32(obj, sec, pos + i * 4u, words[i]);
-  }
-
-  /* Patch alloca placeholders with max_outgoing. */
-  if (fl.max_out > 2047u) {
-    compiler_panic(t->c, a->loc,
-                   "rv64: max_outgoing %u out of imm12 for alloca patch",
-                   fl.max_out);
-  }
-  for (u32 i = 0; i < a->nadd_patches; ++i) {
-    u32 dr = a->add_patches[i].dst_reg;
-    u32 word = rv_addi(dr, RV_SP, (i32)fl.max_out);
-    rv64_patch32(obj, sec, a->add_patches[i].pos, word);
-  }
-
-finish:;
-  /* Define the function symbol. */
-  u32 end = mc->pos(mc);
-  obj_symbol_define(obj, a->fd->sym, sec, (u64)a->func_start,
-                    (u64)(end - a->func_start));
-  if (a->fd->atomize) {
-    obj_atom_define(obj, sec, a->func_start, end - a->func_start, a->fd->sym,
-                    0);
-  }
-  if (t->debug) debug_func_pc_range(t->debug, sec, a->func_start, end);
-
-  mc->cfi_endproc(mc);
-  mc_end_function(mc);
-  a->fd = NULL;
-}
diff --git a/src/arch/rv64/internal.h b/src/arch/rv64/internal.h
@@ -1,189 +0,0 @@
-/* src/arch/rv64/internal.h — private header shared by emit.c, alloc.c, ops.c.
- * Do not include from outside src/arch/rv64/. */
-#pragma once
-
-#include <string.h>
-
-#include "arch/mc.h"
-#include "arch/rv64/isa.h"
-#include "arch/rv64/rv64.h"
-#include "core/arena.h"
-#include "core/pool.h"
-#include "obj/obj.h"
-
-#define RV_PROLOGUE_WORDS 128u
-#define RV_PROLOGUE_FRAME_WORDS \
-  10u /* sp adjust + far/near s0/ra save + set s0 */
-
-/* ---- RvSlot / RvScope ---- */
-typedef struct RvSlot {
-  u32 off; /* bytes below s0 (positive); address = s0 - off */
-  u32 size;
-  u32 align;
-  u8 kind;
-  u8 pad[3];
-} RvSlot;
-
-typedef struct RvScope {
-  u8 kind;
-  u8 has_else;
-  u8 pad[2];
-  MCLabel else_label;
-  MCLabel end_label;
-  Label break_label;
-  Label continue_label;
-} RvScope;
-
-/* ---- RImpl ---- */
-typedef struct RImpl {
-  CGTarget base;
-  SrcLoc loc;
-  const CGFuncDesc* fd;
-
-  u32 func_start;
-  u32 prologue_pos;
-  u32 prologue_words;
-  u32 post_prologue_off; /* end-of-prologue offset within function, for CFI */
-  MCLabel epilogue_label;
-
-  RvSlot* slots;
-  u32 nslots;
-  u32 slots_cap;
-  u32 cum_off;
-  u32 max_outgoing;
-  u32 fp_pair_off;
-
-  u32 next_param_int;
-  u32 next_param_fp;
-  u32 next_param_stack;
-  u8 has_sret;
-  u8 known_frame;
-  u8 omit_frame;
-  u8 pad0;
-  FrameSlot sret_ptr_slot;
-
-  u32 used_cs_int_mask; /* bit reg set for s2-s11 */
-  u32 used_cs_fp_mask;  /* bit reg set for fs2-fs11 */
-  u32 planned_cs_int_mask;
-  u32 planned_cs_fp_mask;
-  u8 has_planned_regs;
-  u8 pad1[3];
-
-  RvScope* scopes;
-  u32 nscopes;
-  u32 scopes_cap;
-
-  u8 has_alloca;
-  struct RvAllocaPatch {
-    u32 pos;
-    u32 dst_reg;
-  }* add_patches;
-  u32 nadd_patches;
-  u32 add_patches_cap;
-
-  u8 is_variadic;
-  FrameSlot gp_save_slot;
-} RImpl;
-
-/* ---- impl_of ---- */
-static inline RImpl* impl_of(CGTarget* t) { return (RImpl*)t; }
-
-/* ---- type helpers ---- */
-#define CG_BUILTIN_ID(k) ((CfreeCgTypeId)((1u << 6) | (u32)(k)))
-static inline int type_is_64(CfreeCgTypeId t) {
-  return t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I64) ||
-         t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F64) ||
-         t >= (CfreeCgTypeId)(2u << 6);
-}
-static inline int type_is_fp_double(CfreeCgTypeId t) {
-  return t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F64);
-}
-static inline u32 type_byte_size(CfreeCgTypeId t) {
-  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I8) ||
-      t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_BOOL))
-    return 1;
-  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I16)) return 2;
-  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I32) ||
-      t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F32))
-    return 4;
-  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F128)) return 16;
-  return 8;
-}
-static inline int type_is_signed(CfreeCgTypeId t) {
-  (void)t;
-  return 0;
-}
-
-static inline u32 reg_num(Operand op) { return op.v.reg & 0x1fu; }
-
-/* ---- emit.c: function lifecycle (referenced by ops.c vtable) ---- */
-void rv_func_begin(CGTarget* t, const CGFuncDesc* fd);
-void rv_func_begin_known_frame(CGTarget* t, const CGFuncDesc* fd,
-                               const CGKnownFrameDesc* frame,
-                               FrameSlot* out_slots);
-void rv_func_end(CGTarget* t);
-
-void rv_coord_vtable_init(CGTarget* t);
-
-/* ---- emit helpers (defined in emit.c, used cross-file) ---- */
-extern void debug_emit_row(Debug*, ObjSecId text_section, u32 offset, SrcLoc);
-extern void debug_func_pc_range(Debug*, ObjSecId text_section, u32 begin_ofs,
-                                u32 end_ofs);
-
-void rv64_emit32(MCEmitter* mc, u32 word);
-void rv64_emit16(MCEmitter* mc, u32 halfword);
-void rv64_patch32(ObjBuilder* obj, u32 sec_id, u32 ofs, u32 word);
-int fits_signed32(i64 v);
-void emit_li_32(MCEmitter* mc, u32 rd, i32 imm);
-void rv64_emit_load_imm(MCEmitter* mc, u32 sf, u32 rd, i64 imm);
-void emit_sp_addi(MCEmitter* mc, i64 imm);
-_Noreturn void rv_panic(CGTarget* t, const char* what);
-
-/* ---- alloc.c: all functions (non-static; referenced by ops.c vtable) ---- */
-FrameSlot rv_frame_slot(CGTarget* t, const FrameSlotDesc* d);
-RvSlot* rv64_slot_get(RImpl* a, FrameSlot fs);
-CGLocalStorage rv_param(CGTarget* t, const CGParamDesc* p);
-void rv_spill_reg(CGTarget* t, Operand src, FrameSlot slot, MemAccess ma);
-void rv_reload_reg(CGTarget* t, Operand dst, FrameSlot slot, MemAccess ma);
-Label rv_label_new(CGTarget* t);
-void rv_label_place(CGTarget* t, Label l);
-void rv_jump(CGTarget* t, Label l);
-void rv_load_label_addr(CGTarget* t, Operand dst, Label l);
-void rv_indirect_branch(CGTarget* t, Operand addr, const Label* targets,
-                        u32 ntargets);
-u32 rv64_force_reg_int(CGTarget* t, Operand op, u32 scratch);
-void rv_cmp_branch(CGTarget* t, CmpOp op, Operand a_op, Operand b_op, Label l);
-void rv_cmp(CGTarget* t, CmpOp op, Operand dst, Operand a_op, Operand b_op);
-CGScope rv_scope_begin(CGTarget* t, const CGScopeDesc* d);
-void rv_scope_else(CGTarget* t, CGScope s);
-void rv_scope_end(CGTarget* t, CGScope s);
-void rv_break_to(CGTarget* t, CGScope s);
-void rv_continue_to(CGTarget* t, CGScope s);
-
-/* ---- ops.c: functions used cross-file ---- */
-void rv_load(CGTarget* t, Operand dst, Operand addr, MemAccess ma);
-void rv_store(CGTarget* t, Operand addr, Operand src, MemAccess ma);
-u32 enc_int_store(u32 nbytes, u32 src, u32 base, i32 off);
-u32 enc_int_load(u32 nbytes, int sign_ext, u32 rd, u32 base, i32 off);
-
-/* Effective-address tuple returned by addr_mode: `base + (index << log2_scale)
- * + ofs`, where `index == REG_NONE` means no index operand. rv64 has no
- * indexed load/store instructions even with Zba, so load/store fold any
- * index into a scratch register up front via Zba `sh{1,2,3}add` (see
- * rv_fold_indexed in ops.c); other paths (atomics, spill/reload, ...)
- * assert that input OPK_INDIRECT operands already have `index == REG_NONE`. */
-typedef struct RvAddrMode {
-  u32 base;
-  u32 index;
-  u8 log2_scale;
-  i32 ofs;
-} RvAddrMode;
-
-RvAddrMode addr_mode(CGTarget* t, Operand addr, u32 tmp_reg);
-void rv64_emit_addr_adjust(MCEmitter* mc, u32 rd, u32 base, i32 off);
-ObjSymId emit_pcrel_anchor(CGTarget* t, u32 sec, u32 auipc_pos);
-void rv64_emit_got_load_addr(CGTarget* t, u32 dst_reg, ObjSymId sym);
-u32 agg_addr_reg(CGTarget* t, Operand op, u32 scratch);
-int rv64_use_got_for_sym(CGTarget* t, ObjSymId sym);
-int mem_order_is_acquire(MemOrder o);
-int mem_order_is_release(MemOrder o);
diff --git a/src/arch/rv64/native.c b/src/arch/rv64/native.c
@@ -0,0 +1,3458 @@
+/* src/arch/rv64/native.c — RISC-V (RV64GC, LP64D) NativeTarget implementation.
+ *
+ * Mirrors the aa64 reference (src/arch/aa64/native.c): a physical-emission
+ * NativeTarget driven at -O0 by the shared NativeDirectTarget and at -O1+ by
+ * the optimizer emit path. ABI decisions go through the abi/ interface; this
+ * file owns only ISA emission and the RV64 frame layout.
+ *
+ * Frame model (single, top-record): s0 (x8) is the frame pointer anchored at
+ * the saved s0/ra pair; slots live below s0 at positive byte offsets `off`
+ * (address = s0 - off); outgoing args sit at the bottom of the frame (sp+0..).
+ *   frame_size  = align16(16 + cum_off + max_outgoing + va_save_sz)
+ *   fp_pair_off = frame_size - 16 - va_save_sz   (saved pair, sp-relative)
+ *   CFA = s0 + (frame_size - fp_pair_off)
+ * RISC-V has no condition flags: comparisons materialize a 0/1 via SLT/SLTU or
+ * FLT/FLE; branches compare two registers directly. x0 is a hardware zero. */
+
+#include <string.h>
+
+#include "abi/abi.h"
+#include "arch/rv64/asm.h"
+#include "arch/rv64/isa.h"
+#include "asm/asm.h"
+#include "asm/asm_lex.h"
+#include "arch/rv64/regs.h"
+#include "arch/rv64/rv64.h"
+#include "cg/native_direct_target.h"
+#include "cg/type.h"
+#include "core/arena.h"
+#include "core/bytes.h"
+#include "core/pool.h"
+#include "core/slice.h"
+#include "obj/obj.h"
+
+enum {
+  RV_TMP0 = 5u,  /* t0: emit-internal scratch (reserved, never allocable) */
+  RV_TMP1 = 6u,  /* t1: emit-internal scratch */
+  RV_TMP2 = 7u,  /* t2: emit-internal scratch (reserved in phys table) */
+  RV_TMP3 = 28u, /* t3: emit-internal scratch (reserved in phys table) */
+  RV_FTMP0 = 0u, /* ft0: emit-internal FP scratch */
+  RV_FTMP1 = 1u, /* ft1: emit-internal FP scratch */
+  RV_FA0 = 10u,  /* fa0..fa7 = f10..f17 (FP arg/return registers) */
+  RV_FA7 = 17u,
+  /* Single-pass (-O0) worst-case prologue: sp adjust (3) + far save pair (7)
+   * + sret spill (1) + variadic GP spills (8). No callee-saves at -O0. */
+  RV_PROLOGUE_WORDS = 32u,
+  RV_FRAME_SAVE_SIZE = 16u,
+};
+
+#define RV_MAX_CALLEE_SAVES 22u /* s1..s11 (11) + fs0..fs11 (12)... capped */
+#define RV_MAX_REG_ARG_MOVES 16u
+
+extern void debug_emit_row(Debug*, ObjSecId text_section, u32 offset, SrcLoc);
+extern void debug_func_pc_range(Debug*, ObjSecId text_section, u32 begin_ofs,
+                                u32 end_ofs);
+
+/* ============================ low-level emit ============================ */
+
+void rv64_emit32(MCEmitter* mc, u32 word) {
+  u8 b[4];
+  u32 ofs = obj_pos(mc->obj, mc->section_id);
+  wr_u32_le(b, word);
+  mc->emit_bytes(mc, b, sizeof b);
+  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
+}
+
+void rv64_emit16(MCEmitter* mc, u32 halfword) {
+  u8 b[2];
+  u32 ofs = obj_pos(mc->obj, mc->section_id);
+  b[0] = (u8)(halfword & 0xff);
+  b[1] = (u8)((halfword >> 8) & 0xff);
+  mc->emit_bytes(mc, b, sizeof b);
+  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
+}
+
+static void rv_patch32(ObjBuilder* obj, ObjSecId sec, u32 off, u32 word) {
+  u8 b[4];
+  wr_u32_le(b, word);
+  obj_patch(obj, sec, off, b, sizeof b);
+}
+
+static int fits_i12(i64 v) { return v >= -2048 && v <= 2047; }
+static int fits_i32(i64 v) {
+  return v >= (i64)(i32)0x80000000 && v <= (i64)(i32)0x7fffffff;
+}
+
+static u32 align_up_u32(u32 v, u32 align) {
+  u32 mask = align ? align - 1u : 0u;
+  return (v + mask) & ~mask;
+}
+
+static i64 floor_div_4096(i64 v) {
+  if (v >= 0) return v / 4096;
+  return -((-v + 4095) / 4096);
+}
+
+static void rv_emit_li32(MCEmitter* mc, u32 rd, i32 imm) {
+  if (imm >= -2048 && imm <= 2047) {
+    rv64_emit32(mc, rv_addi(rd, RV_ZERO, imm));
+    return;
+  }
+  {
+    i64 hi64 = floor_div_4096((i64)imm + 0x800);
+    i32 hi = (i32)hi64;
+    i32 lo = (i32)((i64)imm - hi64 * 4096);
+    rv64_emit32(mc, rv_lui(rd, (u32)hi & 0xfffffu));
+    if (lo) rv64_emit32(mc, rv_addiw(rd, rd, lo));
+  }
+}
+
+static i32 sext12(u32 v) {
+  v &= 0xfffu;
+  return (v & 0x800u) ? (i32)v - 4096 : (i32)v;
+}
+
+static void rv_emit_li64(MCEmitter* mc, u32 rd, u64 imm) {
+  if (fits_i32((i64)imm)) {
+    rv_emit_li32(mc, rd, (i32)(i64)imm);
+    return;
+  }
+  {
+    i32 lo = sext12((u32)imm);
+    u64 hi = (imm - (u64)(i64)lo) >> 12;
+    rv_emit_li64(mc, rd, hi);
+    rv64_emit32(mc, rv_slli(rd, rd, 12));
+    if (lo) rv64_emit32(mc, rv_addi(rd, rd, lo));
+  }
+}
+
+/* sf!=0 selects a full 64-bit materialization; sf==0 a 32-bit value. */
+static void rv_emit_load_imm(MCEmitter* mc, u32 sf, u32 rd, i64 imm) {
+  if (!sf) {
+    rv_emit_li32(mc, rd, (i32)imm);
+    return;
+  }
+  if (fits_i32(imm))
+    rv_emit_li32(mc, rd, (i32)imm);
+  else
+    rv_emit_li64(mc, rd, (u64)imm);
+}
+
+/* rd = base + off, materializing the offset when it exceeds imm12. Uses RV_TMP1
+ * as scratch for the wide path, so callers must keep RV_TMP1 free. */
+static void rv_emit_addr_adjust(MCEmitter* mc, u32 rd, u32 base, i32 off) {
+  if (off == 0) {
+    if (rd != base) rv64_emit32(mc, rv_addi(rd, base, 0));
+    return;
+  }
+  if (fits_i12(off)) {
+    rv64_emit32(mc, rv_addi(rd, base, off));
+    return;
+  }
+  rv_emit_load_imm(mc, 1, RV_TMP1, (i64)off);
+  rv64_emit32(mc, rv_add(rd, base, RV_TMP1));
+}
+
+static u32 enc_int_store(u32 nbytes, u32 src, u32 base, i32 off) {
+  switch (nbytes) {
+    case 1: return rv_sb(src, base, off);
+    case 2: return rv_sh(src, base, off);
+    case 4: return rv_sw(src, base, off);
+    default: return rv_sd(src, base, off);
+  }
+}
+static u32 enc_int_load(u32 nbytes, int sign_ext, u32 rd, u32 base, i32 off) {
+  switch (nbytes) {
+    case 1: return sign_ext ? rv_lb(rd, base, off) : rv_lbu(rd, base, off);
+    case 2: return sign_ext ? rv_lh(rd, base, off) : rv_lhu(rd, base, off);
+    case 4: return sign_ext ? rv_lw(rd, base, off) : rv_lwu(rd, base, off);
+    default: return rv_ld(rd, base, off);
+  }
+}
+
+/* ============================ target state ============================ */
+
+typedef struct RvNativeSlot {
+  u32 off; /* bytes below s0 (positive); address = s0 - off */
+  u32 size;
+  u32 align;
+  u8 kind; /* NativeFrameSlotKind */
+  u8 pad[3];
+} RvNativeSlot;
+
+typedef struct RvCalleeSave {
+  NativeFrameSlot slot;
+  CfreeCgTypeId type;
+  u8 cls; /* NativeAllocClass */
+  Reg reg;
+} RvCalleeSave;
+
+typedef enum RvPatchKind { RV_PATCH_ALLOCA } RvPatchKind;
+
+typedef struct RvPatch {
+  u8 kind; /* RvPatchKind */
+  u32 pos;
+  u32 dst_reg;
+} RvPatch;
+
+typedef struct RvNativeTarget {
+  NativeTarget base;
+  SrcLoc loc;
+  const CGFuncDesc* func;
+
+  RvNativeSlot* slots;
+  u32 nslots;
+  u32 slots_cap;
+  u32 cum_off;      /* sum of frame-slot reservations below s0 */
+  u32 max_outgoing; /* max outgoing-arg bytes across all calls */
+  u32 frame_size_final;
+  u32 fp_pair_off;
+
+  u32 incoming_stack_size; /* fixed-param stack bytes (tail-call check) */
+  u32 next_param_int;
+  u32 next_param_fp;
+  u32 next_param_stack;
+  u8 has_sret;
+  u8 is_variadic;
+  NativeFrameSlot sret_ptr_slot;
+
+  RvPatch* patches;
+  u32 npatches;
+  u32 patches_cap;
+  u32 nalloca;
+
+  u32 func_start;
+  u32 prologue_pos;
+  MCLabel epilogue_label;
+
+  RvCalleeSave callee_saves[RV_MAX_CALLEE_SAVES];
+  u32 ncallee_saves;
+
+  u8 known_frame;
+  u8 has_alloca;
+  u8 frame_final;
+} RvNativeTarget;
+
+static RvNativeTarget* rv_of(NativeTarget* t) { return (RvNativeTarget*)t; }
+
+static _Noreturn void rv_panic(RvNativeTarget* a, const char* msg) {
+  compiler_panic(a->base.c, a->loc, "rv64 native target: %s", msg);
+}
+
+static RvNativeSlot* rv_slot_get(RvNativeTarget* a, NativeFrameSlot fs) {
+  if (fs == NATIVE_FRAME_SLOT_NONE || fs > a->nslots)
+    rv_panic(a, "bad frame slot");
+  return &a->slots[fs - 1u];
+}
+
+/* s0-relative byte offset of a frame slot's base (address = s0 + ret). */
+static i32 rv_s0_off_slot(const RvNativeSlot* s) { return -(i32)s->off; }
+
+/* s0-relative byte offset of incoming stack arg at byte_off. Stack args sit
+ * just above the saved pair; the 64-byte variadic GP save area (when present)
+ * is contiguous with them at [s0+16). */
+static i32 rv_s0_off_in_arg(const RvNativeTarget* a, u32 byte_off) {
+  u32 base = a->is_variadic ? 16u + 64u : 16u;
+  return (i32)(base + byte_off);
+}
+
+static u32 rv_va_save_sz(const RvNativeTarget* a) {
+  return a->is_variadic ? 64u : 0u;
+}
+
+static u32 rv_frame_size(const RvNativeTarget* a) {
+  u32 raw = RV_FRAME_SAVE_SIZE + a->cum_off + a->max_outgoing + rv_va_save_sz(a);
+  return align_up_u32(raw, 16u);
+}
+
+static u32 rv_fp_pair_off(const RvNativeTarget* a, u32 frame_size) {
+  return frame_size - RV_FRAME_SAVE_SIZE - rv_va_save_sz(a);
+}
+
+/* ============================ type helpers ============================ */
+
+static u32 rv_type_size(NativeTarget* t, CfreeCgTypeId type) {
+  u64 n = type ? cg_type_size(t->c, type) : 8u;
+  if (n == 0) n = 8u;
+  return (u32)n;
+}
+
+static u32 rv_type_align(NativeTarget* t, CfreeCgTypeId type) {
+  u64 n = type ? cg_type_align(t->c, type) : 8u;
+  if (n == 0) n = 1u;
+  if (n > 16u) n = 16u;
+  return (u32)n;
+}
+
+/* A scalar value occupies a 64-bit register when it is pointer-sized or wider,
+ * else it is a 32-bit value (drives ADDW vs ADD selection etc). */
+static int rv_is_64(NativeTarget* t, CfreeCgTypeId type) {
+  return rv_type_size(t, type) >= 8u || cg_type_is_ptr(t->c, type);
+}
+
+static int loc_is_fp(NativeLoc loc) {
+  return (NativeAllocClass)loc.cls == NATIVE_REG_FP;
+}
+static u32 loc_reg(NativeLoc loc) { return loc.v.reg & 0x1fu; }
+
+static NativeAllocClass rv_class_for_type(NativeTarget* t, CfreeCgTypeId type) {
+  if (type && cg_type_is_float(t->c, type) && cg_type_size(t->c, type) <= 8u)
+    return NATIVE_REG_FP;
+  return NATIVE_REG_INT;
+}
+
+static MemAccess rv_mem_for_type(NativeTarget* t, CfreeCgTypeId type, u32 size) {
+  MemAccess m;
+  memset(&m, 0, sizeof m);
+  m.type = type;
+  m.size = size ? size : rv_type_size(t, type);
+  m.align = rv_type_align(t, type);
+  return m;
+}
+
+static NativeLoc rv_reg_loc(CfreeCgTypeId type, NativeAllocClass cls, Reg reg) {
+  NativeLoc loc;
+  memset(&loc, 0, sizeof loc);
+  loc.kind = NATIVE_LOC_REG;
+  loc.cls = (u8)cls;
+  loc.type = type;
+  loc.v.reg = reg;
+  return loc;
+}
+
+static NativeLoc rv_stack_loc(CfreeCgTypeId type, NativeFrameSlot slot,
+                              i32 offset) {
+  NativeLoc loc;
+  memset(&loc, 0, sizeof loc);
+  loc.kind = NATIVE_LOC_STACK;
+  loc.cls = NATIVE_REG_INT;
+  loc.type = type;
+  loc.v.stack.slot = slot;
+  loc.v.stack.offset = offset;
+  return loc;
+}
+
+/* ============================ register tables ============================ */
+
+#define RV_PHYS_INT_ARG(r, idx)                          \
+  {.reg = (r),                                           \
+   .cls = NATIVE_REG_INT,                                \
+   .abi_index = (idx),                                   \
+   .flags = NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG |   \
+            ((idx) < 2u ? NATIVE_REG_RET : 0),           \
+   .spill_cost = 1u,                                     \
+   .copy_cost = 1u}
+#define RV_PHYS_INT_CALLER(r)                                            \
+  {.reg = (r),                                                           \
+   .cls = NATIVE_REG_INT,                                                \
+   .abi_index = 0xffu,                                                   \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED,              \
+   .spill_cost = 1u,                                                     \
+   .copy_cost = 1u}
+#define RV_PHYS_INT_CALLEE(r)                                            \
+  {.reg = (r),                                                           \
+   .cls = NATIVE_REG_INT,                                                \
+   .abi_index = 0xffu,                                                   \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLEE_SAVED,              \
+   .spill_cost = 4u,                                                     \
+   .copy_cost = 1u}
+#define RV_PHYS_INT_RESERVED(r)  \
+  {.reg = (r),                   \
+   .cls = NATIVE_REG_INT,        \
+   .abi_index = 0xffu,           \
+   .flags = NATIVE_REG_RESERVED, \
+   .spill_cost = 0u,             \
+   .copy_cost = 0u}
+
+/* t0..t3 (x5,x6,x7,x28) are emit-internal scratch (RV_TMP0..RV_TMP3), reserved
+ * and never handed to the allocator or driver. t4/t5 are the driver scratch
+ * pool (disjoint from the emit temps so a hook can never clobber an operand the
+ * driver parked there). t6 is the lone caller-saved allocable (the -O0 cache's
+ * only caller-saved home); s1..s11 are appended callee-saved, chosen under
+ * pressure (and saved by the optimizer prologue at -O1). */
+static const Reg rv_int_allocable[] = {31u, 9u,  18u, 19u, 20u, 21u, 22u,
+                                       23u, 24u, 25u, 26u, 27u};
+static const Reg rv_int_scratch[] = {29u, 30u}; /* t4, t5 */
+
+static const NativePhysRegInfo rv_int_phys[] = {
+    RV_PHYS_INT_RESERVED(0u),   /* zero */
+    RV_PHYS_INT_RESERVED(1u),   /* ra */
+    RV_PHYS_INT_RESERVED(2u),   /* sp */
+    RV_PHYS_INT_RESERVED(3u),   /* gp */
+    RV_PHYS_INT_RESERVED(4u),   /* tp */
+    RV_PHYS_INT_RESERVED(5u),   /* t0 = TMP0 */
+    RV_PHYS_INT_RESERVED(6u),   /* t1 = TMP1 */
+    RV_PHYS_INT_RESERVED(7u),   /* t2 = TMP2 (emit) */
+    RV_PHYS_INT_RESERVED(8u),   /* s0/fp */
+    RV_PHYS_INT_CALLEE(9u),     /* s1 */
+    RV_PHYS_INT_ARG(10u, 0u),   RV_PHYS_INT_ARG(11u, 1u),
+    RV_PHYS_INT_ARG(12u, 2u),   RV_PHYS_INT_ARG(13u, 3u),
+    RV_PHYS_INT_ARG(14u, 4u),   RV_PHYS_INT_ARG(15u, 5u),
+    RV_PHYS_INT_ARG(16u, 6u),   RV_PHYS_INT_ARG(17u, 7u),
+    RV_PHYS_INT_CALLEE(18u),    RV_PHYS_INT_CALLEE(19u),
+    RV_PHYS_INT_CALLEE(20u),    RV_PHYS_INT_CALLEE(21u),
+    RV_PHYS_INT_CALLEE(22u),    RV_PHYS_INT_CALLEE(23u),
+    RV_PHYS_INT_CALLEE(24u),    RV_PHYS_INT_CALLEE(25u),
+    RV_PHYS_INT_CALLEE(26u),    RV_PHYS_INT_CALLEE(27u),
+    RV_PHYS_INT_RESERVED(28u),  /* t3 = TMP3 (emit) */
+    RV_PHYS_INT_RESERVED(29u),  /* t4 = driver scratch */
+    RV_PHYS_INT_RESERVED(30u),  /* t5 = driver scratch */
+    RV_PHYS_INT_CALLER(31u),    /* t6 = caller-saved allocable */
+};
+
+#define RV_PHYS_FP_ARG(r, idx)                           \
+  {.reg = (r),                                           \
+   .cls = NATIVE_REG_FP,                                 \
+   .abi_index = (idx),                                   \
+   .flags = NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG |   \
+            ((idx) < 2u ? NATIVE_REG_RET : 0),           \
+   .spill_cost = 1u,                                     \
+   .copy_cost = 1u}
+#define RV_PHYS_FP_CALLER(r)                                             \
+  {.reg = (r),                                                           \
+   .cls = NATIVE_REG_FP,                                                 \
+   .abi_index = 0xffu,                                                   \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED,              \
+   .spill_cost = 1u,                                                     \
+   .copy_cost = 1u}
+#define RV_PHYS_FP_CALLEE(r)                                             \
+  {.reg = (r),                                                           \
+   .cls = NATIVE_REG_FP,                                                 \
+   .abi_index = 0xffu,                                                   \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLEE_SAVED,              \
+   .spill_cost = 4u,                                                     \
+   .copy_cost = 1u}
+#define RV_PHYS_FP_RESERVED(r)   \
+  {.reg = (r),                   \
+   .cls = NATIVE_REG_FP,         \
+   .abi_index = 0xffu,           \
+   .flags = NATIVE_REG_RESERVED, \
+   .spill_cost = 0u,             \
+   .copy_cost = 0u}
+
+/* Caller-saved allocable first (ft4..ft7, ft8..ft11), then callee (fs0..fs11).
+ * ft0/ft1 reserved as emit-internal scratch; ft2/ft3 driver scratch. */
+static const Reg rv_fp_allocable[] = {4u,  5u,  6u,  7u,  28u, 29u, 30u, 31u,
+                                      8u,  9u,  18u, 19u, 20u, 21u, 22u, 23u,
+                                      24u, 25u, 26u, 27u};
+static const Reg rv_fp_scratch[] = {2u, 3u}; /* ft2, ft3 */
+
+static const NativePhysRegInfo rv_fp_phys[] = {
+    RV_PHYS_FP_RESERVED(0u),   /* ft0 = FTMP0 */
+    RV_PHYS_FP_RESERVED(1u),   /* ft1 = FTMP1 */
+    RV_PHYS_FP_RESERVED(2u),   /* ft2 = scratch */
+    RV_PHYS_FP_RESERVED(3u),   /* ft3 = scratch */
+    RV_PHYS_FP_CALLER(4u),     RV_PHYS_FP_CALLER(5u),
+    RV_PHYS_FP_CALLER(6u),     RV_PHYS_FP_CALLER(7u),
+    RV_PHYS_FP_CALLEE(8u),     RV_PHYS_FP_CALLEE(9u),
+    RV_PHYS_FP_ARG(10u, 0u),   RV_PHYS_FP_ARG(11u, 1u),
+    RV_PHYS_FP_ARG(12u, 2u),   RV_PHYS_FP_ARG(13u, 3u),
+    RV_PHYS_FP_ARG(14u, 4u),   RV_PHYS_FP_ARG(15u, 5u),
+    RV_PHYS_FP_ARG(16u, 6u),   RV_PHYS_FP_ARG(17u, 7u),
+    RV_PHYS_FP_CALLEE(18u),    RV_PHYS_FP_CALLEE(19u),
+    RV_PHYS_FP_CALLEE(20u),    RV_PHYS_FP_CALLEE(21u),
+    RV_PHYS_FP_CALLEE(22u),    RV_PHYS_FP_CALLEE(23u),
+    RV_PHYS_FP_CALLEE(24u),    RV_PHYS_FP_CALLEE(25u),
+    RV_PHYS_FP_CALLEE(26u),    RV_PHYS_FP_CALLEE(27u),
+    RV_PHYS_FP_CALLER(28u),    RV_PHYS_FP_CALLER(29u),
+    RV_PHYS_FP_CALLER(30u),    RV_PHYS_FP_CALLER(31u),
+};
+
+static const NativeAllocClassInfo rv_classes[] = {
+    {.cls = NATIVE_REG_INT,
+     .allocable = rv_int_allocable,
+     .nallocable = sizeof rv_int_allocable / sizeof rv_int_allocable[0],
+     .scratch = rv_int_scratch,
+     .nscratch = sizeof rv_int_scratch / sizeof rv_int_scratch[0],
+     .phys = rv_int_phys,
+     .nphys = sizeof rv_int_phys / sizeof rv_int_phys[0],
+     /* t0-t6 (5-7,28-31) + a0-a7 (10-17) */
+     .caller_saved_mask = 0xf00400e0u | 0x0001fc00u,
+     /* s0-s11 (8,9,18-27) */
+     .callee_saved_mask = 0x0ffc0300u,
+     .arg_mask = 0x0001fc00u,
+     .ret_mask = 0x00000c00u,
+     /* zero,ra,sp,gp,tp,t0,t1,t2,s0 (bits 0-8) + t3 (bit 28). t4/t5 are the
+      * driver scratch pool (reserved-from-alloc but listed in scratch[]). */
+     .reserved_mask = 0x000001ffu | (1u << 28)},
+    {.cls = NATIVE_REG_FP,
+     .allocable = rv_fp_allocable,
+     .nallocable = sizeof rv_fp_allocable / sizeof rv_fp_allocable[0],
+     .scratch = rv_fp_scratch,
+     .nscratch = sizeof rv_fp_scratch / sizeof rv_fp_scratch[0],
+     .phys = rv_fp_phys,
+     .nphys = sizeof rv_fp_phys / sizeof rv_fp_phys[0],
+     /* ft0-ft7 (0-7), fa0-fa7 (10-17), ft8-ft11 (28-31) */
+     .caller_saved_mask = 0xf00400ffu | 0x0001fc00u,
+     /* fs0-fs11 (8,9,18-27) */
+     .callee_saved_mask = 0x0ffc0300u,
+     .arg_mask = 0x0001fc00u,
+     .ret_mask = 0x00000c00u,
+     .reserved_mask = 0x0000000fu /* ft0-ft3 */},
+};
+
+static const NativeRegInfo rv_reg_info = {
+    .classes = rv_classes,
+    .nclasses = sizeof rv_classes / sizeof rv_classes[0],
+};
+
+/* ============================ legality ============================ */
+
+static int rv_imm_legal(NativeTarget* t, NativeImmUse use, u32 op,
+                        CfreeCgTypeId type, i64 imm) {
+  (void)t;
+  (void)type;
+  switch (use) {
+    case NATIVE_IMM_MOVE:
+      return 1;
+    case NATIVE_IMM_BINOP:
+      switch ((BinOp)op) {
+        case BO_IADD:
+          return fits_i12(imm);
+        case BO_ISUB:
+          return fits_i12(-imm); /* emitted as ADDI with negated imm */
+        case BO_AND:
+        case BO_OR:
+        case BO_XOR:
+          return fits_i12(imm);
+        case BO_SHL:
+        case BO_SHR_S:
+        case BO_SHR_U:
+          return imm >= 0 && imm <= 63;
+        default:
+          return 0;
+      }
+    case NATIVE_IMM_CMP:
+      return imm == 0; /* compares need both ends in registers (SLT/branch) */
+    case NATIVE_IMM_ADDR_OFFSET:
+      return fits_i12(imm);
+  }
+  return 0;
+}
+
+static int rv_addr_legal(NativeTarget* t, const NativeAddr* addr,
+                         MemAccess mem) {
+  (void)t;
+  (void)mem;
+  if (!addr) return 0;
+  if (addr->index_kind != NATIVE_ADDR_INDEX_NONE) return 0;
+  if (addr->base_kind != NATIVE_ADDR_BASE_REG &&
+      addr->base_kind != NATIVE_ADDR_BASE_FRAME)
+    return 0;
+  return fits_i12(addr->offset);
+}
+
+/* ============================ memory ============================ */
+
+/* Materialize the runtime address of a global into `dst`, including addend. */
+static void rv_emit_global_addr(RvNativeTarget* a, u32 dst, ObjSymId sym,
+                                i64 addend) {
+  NativeTarget* t = &a->base;
+  MCEmitter* mc = t->mc;
+  u32 sec = mc->section_id;
+  if (obj_symbol_extern_via_got(t->c, t->obj, sym)) {
+    u32 ap = mc->pos(mc);
+    rv64_emit32(mc, rv_auipc(dst, 0));
+    mc->emit_reloc_at(mc, sec, ap, R_RV_GOT_HI20, sym, 0, 0, 0);
+    {
+      Sym an = pool_intern_slice(t->c->global, SLICE_LIT(".LpcrelHi"));
+      ObjSymId anchor = obj_symbol(t->obj, an, SB_LOCAL, SK_OBJ, sec, (u64)ap, 0);
+      u32 lp = mc->pos(mc);
+      rv64_emit32(mc, rv_ld(dst, dst, 0));
+      mc->emit_reloc_at(mc, sec, lp, R_RV_PCREL_LO12_I, anchor, 0, 0, 0);
+    }
+  } else {
+    u32 ap = mc->pos(mc);
+    rv64_emit32(mc, rv_auipc(dst, 0));
+    mc->emit_reloc_at(mc, sec, ap, R_RV_PCREL_HI20, sym, 0, 0, 0);
+    {
+      Sym an = pool_intern_slice(t->c->global, SLICE_LIT(".LpcrelHi"));
+      ObjSymId anchor = obj_symbol(t->obj, an, SB_LOCAL, SK_OBJ, sec, (u64)ap, 0);
+      u32 lp = mc->pos(mc);
+      rv64_emit32(mc, rv_addi(dst, dst, 0));
+      mc->emit_reloc_at(mc, sec, lp, R_RV_PCREL_LO12_I, anchor, 0, 0, 0);
+    }
+  }
+  if (addend) rv_emit_addr_adjust(mc, dst, dst, (i32)addend);
+}
+
+/* Fold (base_reg << 0) + (index << scale) into RV_TMP0 via Zba. */
+static u32 rv_fold_index(RvNativeTarget* a, u32 base, u32 idx, u8 log2_scale) {
+  MCEmitter* mc = a->base.mc;
+  switch (log2_scale) {
+    case 0: rv64_emit32(mc, rv_add(RV_TMP0, base, idx)); break;
+    case 1: rv64_emit32(mc, rv_sh1add(RV_TMP0, idx, base)); break;
+    case 2: rv64_emit32(mc, rv_sh2add(RV_TMP0, idx, base)); break;
+    default: rv64_emit32(mc, rv_sh3add(RV_TMP0, idx, base)); break;
+  }
+  return RV_TMP0;
+}
+
+/* Resolve any NativeAddr to a base register + imm12 offset. RISC-V has no
+ * indexed load/store, so an index is folded into RV_TMP0 via Zba; far offsets
+ * and FRAME/FRAME_VALUE/GLOBAL bases are materialized into RV_TMP0/RV_TMP1. */
+static void rv_resolve_mem_addr(RvNativeTarget* a, const NativeAddr* addr,
+                                u32* base_out, i32* off_out) {
+  MCEmitter* mc = a->base.mc;
+  u32 base;
+  i32 off;
+  switch (addr->base_kind) {
+    case NATIVE_ADDR_BASE_REG:
+      base = addr->base.reg & 0x1fu;
+      off = addr->offset;
+      break;
+    case NATIVE_ADDR_BASE_FRAME: {
+      RvNativeSlot* s = rv_slot_get(a, addr->base.frame);
+      base = RV_S0;
+      off = rv_s0_off_slot(s) + addr->offset;
+      break;
+    }
+    case NATIVE_ADDR_BASE_FRAME_VALUE: {
+      RvNativeSlot* s = rv_slot_get(a, addr->base.frame);
+      rv64_emit32(mc, rv_ld(RV_TMP0, RV_S0, rv_s0_off_slot(s)));
+      base = RV_TMP0;
+      off = addr->offset;
+      break;
+    }
+    case NATIVE_ADDR_BASE_GLOBAL:
+      rv_emit_global_addr(a, RV_TMP0, addr->base.global.sym,
+                          addr->base.global.addend);
+      base = RV_TMP0;
+      off = addr->offset;
+      break;
+    default:
+      rv_panic(a, "unsupported address base");
+  }
+  if (addr->index_kind == NATIVE_ADDR_INDEX_REG) {
+    base = rv_fold_index(a, base, addr->index.reg & 0x1fu, addr->log2_scale);
+  } else if (addr->index_kind == NATIVE_ADDR_INDEX_FRAME_VALUE) {
+    RvNativeSlot* s = rv_slot_get(a, addr->index.frame);
+    rv64_emit32(mc, rv_ld(RV_TMP1, RV_S0, rv_s0_off_slot(s)));
+    base = rv_fold_index(a, base, RV_TMP1, addr->log2_scale);
+  }
+  if (!fits_i12(off)) {
+    rv_emit_load_imm(mc, 1, RV_TMP1, (i64)off);
+    rv64_emit32(mc, rv_add(RV_TMP0, base, RV_TMP1));
+    base = RV_TMP0;
+    off = 0;
+  }
+  *base_out = base;
+  *off_out = off;
+}
+
+/* Central load/store primitive. is_load: 1 load into reg, 0 store reg to mem. */
+static void rv_emit_mem(RvNativeTarget* a, int is_load, NativeLoc reg,
+                        NativeAddr addr, MemAccess mem) {
+  NativeTarget* t = &a->base;
+  MCEmitter* mc = t->mc;
+  u32 r = loc_reg(reg);
+  int fp = loc_is_fp(reg);
+  u32 sz = mem.size ? mem.size : rv_type_size(t, reg.type);
+  u32 base;
+  i32 off;
+
+  rv_resolve_mem_addr(a, &addr, &base, &off);
+  if (fp) {
+    rv64_emit32(mc, is_load ? (sz == 8u ? rv_fld(r, base, off)
+                                        : rv_flw(r, base, off))
+                            : (sz == 8u ? rv_fsd(r, base, off)
+                                        : rv_fsw(r, base, off)));
+  } else {
+    rv64_emit32(mc, is_load ? enc_int_load(sz, 0, r, base, off)
+                            : enc_int_store(sz, r, base, off));
+  }
+}
+
+/* ============================ moves / data ============================ */
+
+static void rv_move(NativeTarget* t, NativeLoc dst, NativeLoc src) {
+  MCEmitter* mc = t->mc;
+  int dfp = loc_is_fp(dst), sfp = loc_is_fp(src);
+  u32 rd = loc_reg(dst), rs = loc_reg(src);
+  if (dfp && sfp) {
+    u32 fmt = rv_type_size(t, dst.type) == 8u ? RV_FMT_D : RV_FMT_S;
+    if (rd == rs) return;
+    rv64_emit32(mc, rv_fsgnj(fmt, rd, rs, rs));
+    return;
+  }
+  if (!dfp && sfp) {
+    u32 sz = rv_type_size(t, src.type);
+    rv64_emit32(mc, sz == 8u ? rv_fmv_x_d(rd, rs) : rv_fmv_x_w(rd, rs));
+    return;
+  }
+  if (dfp && !sfp) {
+    u32 sz = rv_type_size(t, dst.type);
+    rv64_emit32(mc, sz == 8u ? rv_fmv_d_x(rd, rs) : rv_fmv_w_x(rd, rs));
+    return;
+  }
+  if (rd == rs) return;
+  rv64_emit32(mc, rv_addi(rd, rs, 0));
+}
+
+static void rv_load_imm(NativeTarget* t, NativeLoc dst, i64 imm) {
+  rv_emit_load_imm(t->mc, rv_is_64(t, dst.type) ? 1u : 0u, loc_reg(dst), imm);
+}
+
+static void rv_load_const(NativeTarget* t, NativeLoc dst, ConstBytes cb) {
+  RvNativeTarget* a = rv_of(t);
+  u64 v = 0;
+  u32 i;
+  if (!loc_is_fp(dst)) {
+    for (i = 0; i < cb.size && i < 8u; ++i) v |= (u64)cb.bytes[i] << (i * 8u);
+    rv_load_imm(t, dst, (i64)v);
+    return;
+  }
+  /* FP constant: materialize the bit pattern in TMP0, bitcast into the FPR. */
+  for (i = 0; i < cb.size && i < 8u; ++i) v |= (u64)cb.bytes[i] << (i * 8u);
+  rv_emit_load_imm(t->mc, 1, RV_TMP0, (i64)v);
+  if (cb.size == 8u)
+    rv64_emit32(t->mc, rv_fmv_d_x(loc_reg(dst), RV_TMP0));
+  else
+    rv64_emit32(t->mc, rv_fmv_w_x(loc_reg(dst), RV_TMP0));
+  (void)a;
+}
+
+static void rv_load_addr(NativeTarget* t, NativeLoc dst, NativeAddr addr) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  u32 rd = loc_reg(dst);
+  u32 base;
+  i32 off;
+  if (addr.base_kind == NATIVE_ADDR_BASE_GLOBAL) {
+    rv_emit_global_addr(a, rd, addr.base.global.sym,
+                        addr.base.global.addend + addr.offset);
+    base = rd;
+    off = 0;
+  } else if (addr.base_kind == NATIVE_ADDR_BASE_FRAME_VALUE) {
+    /* Load the pointer stored in the frame slot, then add the offset. */
+    RvNativeSlot* s = rv_slot_get(a, addr.base.frame);
+    rv64_emit32(mc, rv_ld(rd, RV_S0, rv_s0_off_slot(s)));
+    base = rd;
+    off = addr.offset;
+  } else if (addr.base_kind == NATIVE_ADDR_BASE_FRAME) {
+    RvNativeSlot* s = rv_slot_get(a, addr.base.frame);
+    base = RV_S0;
+    off = rv_s0_off_slot(s) + addr.offset;
+  } else if (addr.base_kind == NATIVE_ADDR_BASE_REG) {
+    base = addr.base.reg & 0x1fu;
+    off = addr.offset;
+  } else {
+    rv_panic(a, "unsupported address base in load_addr");
+  }
+  /* Fold any index via Zba sh{1,2,3}add (index << scale) + base. */
+  if (addr.index_kind == NATIVE_ADDR_INDEX_REG) {
+    u32 idx = addr.index.reg & 0x1fu;
+    if (off != 0 || base != rd) rv_emit_addr_adjust(mc, rd, base, off);
+    switch (addr.log2_scale) {
+      case 0: rv64_emit32(mc, rv_add(rd, rd, idx)); break;
+      case 1: rv64_emit32(mc, rv_sh1add(rd, idx, rd)); break;
+      case 2: rv64_emit32(mc, rv_sh2add(rd, idx, rd)); break;
+      default: rv64_emit32(mc, rv_sh3add(rd, idx, rd)); break;
+    }
+    return;
+  }
+  rv_emit_addr_adjust(mc, rd, base, off);
+}
+
+static void rv_load(NativeTarget* t, NativeLoc dst, NativeAddr addr,
+                    MemAccess mem) {
+  rv_emit_mem(rv_of(t), 1, dst, addr, mem);
+}
+static void rv_store(NativeTarget* t, NativeAddr addr, NativeLoc src,
+                     MemAccess mem) {
+  rv_emit_mem(rv_of(t), 0, src, addr, mem);
+}
+
+/* copy_bytes: resolve dst and src to dedicated pointer regs (RV_TMP3 / RV_TMP0)
+ * once, then copy granule-by-granule advancing both pointers. dst is resolved
+ * first because its base may itself live in RV_TMP1 (the transfer reg, e.g. the
+ * sret pointer from plan_ret); capturing it into RV_TMP3 before src resolution
+ * (which may clobber RV_TMP1 for far offsets) keeps it live. Advancing the
+ * pointers keeps every load/store at offset 0, so no offset ever exceeds imm12
+ * and the transfer reg never aliases a base. */
+static void rv_copy_bytes(NativeTarget* t, NativeAddr dst, NativeAddr src,
+                          AggregateAccess access) {
+  MCEmitter* mc = t->mc;
+  CfreeCgTypeId i64t = builtin_id(CFREE_CG_BUILTIN_I64);
+  u32 rem = access.size;
+  rv_load_addr(t, rv_reg_loc(i64t, NATIVE_REG_INT, RV_TMP3), dst);
+  rv_load_addr(t, rv_reg_loc(i64t, NATIVE_REG_INT, RV_TMP0), src);
+  while (rem) {
+    u32 sz = rem >= 8u ? 8u : rem >= 4u ? 4u : rem >= 2u ? 2u : 1u;
+    rv64_emit32(mc, enc_int_load(sz, 0, RV_TMP1, RV_TMP0, 0));
+    rv64_emit32(mc, enc_int_store(sz, RV_TMP1, RV_TMP3, 0));
+    rv64_emit32(mc, rv_addi(RV_TMP0, RV_TMP0, (i32)sz));
+    rv64_emit32(mc, rv_addi(RV_TMP3, RV_TMP3, (i32)sz));
+    rem -= sz;
+  }
+}
+
+static void rv_set_bytes(NativeTarget* t, NativeAddr dst, NativeLoc byte_value,
+                         AggregateAccess access) {
+  MCEmitter* mc = t->mc;
+  CfreeCgTypeId i64t = builtin_id(CFREE_CG_BUILTIN_I64);
+  u32 bv = loc_reg(byte_value);
+  u32 rem = access.size;
+  rv_load_addr(t, rv_reg_loc(i64t, NATIVE_REG_INT, RV_TMP3), dst);
+  while (rem) {
+    rv64_emit32(mc, rv_sb(bv, RV_TMP3, 0));
+    rv64_emit32(mc, rv_addi(RV_TMP3, RV_TMP3, 1));
+    rem -= 1u;
+  }
+}
+
+/* ============================ arithmetic ============================ */
+
+static void rv_binop(NativeTarget* t, BinOp op, NativeLoc dst, NativeLoc aop,
+                     NativeLoc bop) {
+  MCEmitter* mc = t->mc;
+  u32 rd = loc_reg(dst);
+  u32 ra = loc_reg(aop);
+  int sf = rv_is_64(t, dst.type);
+  int b_imm = bop.kind == NATIVE_LOC_IMM;
+  u32 rb = b_imm ? 0u : loc_reg(bop);
+  i64 imm = b_imm ? bop.v.imm : 0;
+
+  switch (op) {
+    case BO_FADD:
+    case BO_FSUB:
+    case BO_FMUL:
+    case BO_FDIV: {
+      u32 fmt = rv_type_size(t, dst.type) == 8u ? RV_FMT_D : RV_FMT_S;
+      switch (op) {
+        case BO_FADD: rv64_emit32(mc, rv_fadd(fmt, rd, ra, rb)); break;
+        case BO_FSUB: rv64_emit32(mc, rv_fsub(fmt, rd, ra, rb)); break;
+        case BO_FMUL: rv64_emit32(mc, rv_fmul(fmt, rd, ra, rb)); break;
+        default: rv64_emit32(mc, rv_fdiv(fmt, rd, ra, rb)); break;
+      }
+      return;
+    }
+    case BO_IADD:
+      if (b_imm) {
+        rv64_emit32(mc, sf ? rv_addi(rd, ra, (i32)imm) : rv_addiw(rd, ra, (i32)imm));
+      } else {
+        rv64_emit32(mc, sf ? rv_add(rd, ra, rb) : rv_addw(rd, ra, rb));
+      }
+      return;
+    case BO_ISUB:
+      if (b_imm) {
+        rv64_emit32(mc, sf ? rv_addi(rd, ra, (i32)-imm)
+                           : rv_addiw(rd, ra, (i32)-imm));
+      } else {
+        rv64_emit32(mc, sf ? rv_sub(rd, ra, rb) : rv_subw(rd, ra, rb));
+      }
+      return;
+    case BO_IMUL:
+      rv64_emit32(mc, sf ? rv_mul(rd, ra, rb) : rv_mulw(rd, ra, rb));
+      return;
+    case BO_SDIV:
+      rv64_emit32(mc, sf ? rv_div(rd, ra, rb) : rv_divw(rd, ra, rb));
+      return;
+    case BO_UDIV:
+      rv64_emit32(mc, sf ? rv_divu(rd, ra, rb) : rv_divuw(rd, ra, rb));
+      return;
+    case BO_SREM:
+      rv64_emit32(mc, sf ? rv_rem(rd, ra, rb) : rv_remw(rd, ra, rb));
+      return;
+    case BO_UREM:
+      rv64_emit32(mc, sf ? rv_remu(rd, ra, rb) : rv_remuw(rd, ra, rb));
+      return;
+    case BO_AND:
+      rv64_emit32(mc, b_imm ? rv_andi(rd, ra, (i32)imm) : rv_and(rd, ra, rb));
+      return;
+    case BO_OR:
+      rv64_emit32(mc, b_imm ? rv_ori(rd, ra, (i32)imm) : rv_or(rd, ra, rb));
+      return;
+    case BO_XOR:
+      rv64_emit32(mc, b_imm ? rv_xori(rd, ra, (i32)imm) : rv_xor(rd, ra, rb));
+      return;
+    case BO_SHL:
+      if (b_imm)
+        rv64_emit32(mc, sf ? rv_slli(rd, ra, (u32)imm & 63u)
+                           : rv_slliw(rd, ra, (u32)imm & 31u));
+      else
+        rv64_emit32(mc, sf ? rv_sll(rd, ra, rb) : rv_sllw(rd, ra, rb));
+      return;
+    case BO_SHR_U:
+      if (b_imm)
+        rv64_emit32(mc, sf ? rv_srli(rd, ra, (u32)imm & 63u)
+                           : rv_srliw(rd, ra, (u32)imm & 31u));
+      else
+        rv64_emit32(mc, sf ? rv_srl(rd, ra, rb) : rv_srlw(rd, ra, rb));
+      return;
+    case BO_SHR_S:
+      if (b_imm)
+        rv64_emit32(mc, sf ? rv_srai(rd, ra, (u32)imm & 63u)
+                           : rv_sraiw(rd, ra, (u32)imm & 31u));
+      else
+        rv64_emit32(mc, sf ? rv_sra(rd, ra, rb) : rv_sraw(rd, ra, rb));
+      return;
+    default:
+      rv_panic(rv_of(t), "unsupported binop");
+  }
+}
+
+static void rv_unop(NativeTarget* t, UnOp op, NativeLoc dst, NativeLoc src) {
+  MCEmitter* mc = t->mc;
+  u32 rd = loc_reg(dst), rs = loc_reg(src);
+  int sf = rv_is_64(t, dst.type);
+  switch (op) {
+    case UO_NEG:
+      rv64_emit32(mc, sf ? rv_sub(rd, RV_ZERO, rs) : rv_subw(rd, RV_ZERO, rs));
+      return;
+    case UO_FNEG: {
+      u32 fmt = rv_type_size(t, dst.type) == 8u ? RV_FMT_D : RV_FMT_S;
+      rv64_emit32(mc, rv_fsgnjn(fmt, rd, rs, rs));
+      return;
+    }
+    case UO_BNOT:
+      rv64_emit32(mc, rv_xori(rd, rs, -1));
+      return;
+    case UO_NOT:
+      rv64_emit32(mc, rv_sltiu(rd, rs, 1));
+      return;
+    default:
+      rv_panic(rv_of(t), "unsupported unop");
+  }
+}
+
+/* Sign/zero-extend a 32-bit operand into a 64-bit register for comparison.
+ * Returns the register to compare. */
+static u32 rv_cmp_ext(NativeTarget* t, int is_signed, NativeLoc op, u32 tmp) {
+  MCEmitter* mc = t->mc;
+  u32 r = loc_reg(op);
+  if (rv_is_64(t, op.type)) return r;
+  if (is_signed) {
+    rv64_emit32(mc, rv_addiw(tmp, r, 0)); /* sign-extend low 32 */
+  } else {
+    rv64_emit32(mc, rv_slli(tmp, r, 32));
+    rv64_emit32(mc, rv_srli(tmp, tmp, 32));
+  }
+  return tmp;
+}
+
+static int cmp_is_signed(CmpOp op) {
+  switch (op) {
+    case CMP_LT_U:
+    case CMP_LE_U:
+    case CMP_GT_U:
+    case CMP_GE_U:
+      return 0;
+    default:
+      return 1;
+  }
+}
+
+/* Emit a 0/1 comparison result into rd from two integer registers. */
+static void rv_emit_icmp(NativeTarget* t, CmpOp op, u32 rd, u32 ra, u32 rb) {
+  MCEmitter* mc = t->mc;
+  switch (op) {
+    case CMP_EQ:
+      rv64_emit32(mc, rv_sub(rd, ra, rb));
+      rv64_emit32(mc, rv_sltiu(rd, rd, 1));
+      return;
+    case CMP_NE:
+      rv64_emit32(mc, rv_sub(rd, ra, rb));
+      rv64_emit32(mc, rv_sltu(rd, RV_ZERO, rd));
+      return;
+    case CMP_LT_S: rv64_emit32(mc, rv_slt(rd, ra, rb)); return;
+    case CMP_LT_U: rv64_emit32(mc, rv_sltu(rd, ra, rb)); return;
+    case CMP_GT_S: rv64_emit32(mc, rv_slt(rd, rb, ra)); return;
+    case CMP_GT_U: rv64_emit32(mc, rv_sltu(rd, rb, ra)); return;
+    case CMP_GE_S:
+      rv64_emit32(mc, rv_slt(rd, ra, rb));
+      rv64_emit32(mc, rv_xori(rd, rd, 1));
+      return;
+    case CMP_GE_U:
+      rv64_emit32(mc, rv_sltu(rd, ra, rb));
+      rv64_emit32(mc, rv_xori(rd, rd, 1));
+      return;
+    case CMP_LE_S:
+      rv64_emit32(mc, rv_slt(rd, rb, ra));
+      rv64_emit32(mc, rv_xori(rd, rd, 1));
+      return;
+    case CMP_LE_U:
+      rv64_emit32(mc, rv_sltu(rd, rb, ra));
+      rv64_emit32(mc, rv_xori(rd, rd, 1));
+      return;
+    default:
+      rv_panic(rv_of(t), "unsupported integer cmp");
+  }
+}
+
+static void rv_cmp(NativeTarget* t, CmpOp op, NativeLoc dst, NativeLoc aop,
+                   NativeLoc bop) {
+  MCEmitter* mc = t->mc;
+  u32 rd = loc_reg(dst);
+  /* EQ/NE are shared int/FP opcodes; FP equality (and FP x!=x for isnan,
+   * bool-from-float, etc.) arrives as CMP_EQ/CMP_NE with FP-class operands and
+   * must use feq, not an integer compare on the FP register numbers. */
+  if (op >= CMP_LT_F ||
+      ((op == CMP_EQ || op == CMP_NE) && loc_is_fp(aop))) {
+    u32 fmt = rv_type_size(t, aop.type) == 8u ? RV_FMT_D : RV_FMT_S;
+    u32 ra = loc_reg(aop), rb = loc_reg(bop);
+    switch (op) {
+      case CMP_EQ:
+        rv64_emit32(mc, fmt == RV_FMT_D ? rv_feq_d(rd, ra, rb)
+                                        : rv_feq_s(rd, ra, rb));
+        return;
+      case CMP_NE:
+        rv64_emit32(mc, fmt == RV_FMT_D ? rv_feq_d(rd, ra, rb)
+                                        : rv_feq_s(rd, ra, rb));
+        rv64_emit32(mc, rv_xori(rd, rd, 1));
+        return;
+      case CMP_LT_F:
+        rv64_emit32(mc, fmt == RV_FMT_D ? rv_flt_d(rd, ra, rb)
+                                        : rv_flt_s(rd, ra, rb));
+        return;
+      case CMP_LE_F:
+        rv64_emit32(mc, fmt == RV_FMT_D ? rv_fle_d(rd, ra, rb)
+                                        : rv_fle_s(rd, ra, rb));
+        return;
+      case CMP_GT_F:
+        rv64_emit32(mc, fmt == RV_FMT_D ? rv_flt_d(rd, rb, ra)
+                                        : rv_flt_s(rd, rb, ra));
+        return;
+      case CMP_GE_F:
+        rv64_emit32(mc, fmt == RV_FMT_D ? rv_fle_d(rd, rb, ra)
+                                        : rv_fle_s(rd, rb, ra));
+        return;
+      default:
+        rv_panic(rv_of(t), "unsupported fp cmp");
+    }
+  }
+  {
+    int sg = cmp_is_signed(op);
+    u32 ra = rv_cmp_ext(t, sg, aop, RV_TMP0);
+    u32 rb = rv_cmp_ext(t, sg, bop, RV_TMP1);
+    rv_emit_icmp(t, op, rd, ra, rb);
+  }
+}
+
+static void rv_convert(NativeTarget* t, ConvKind op, NativeLoc dst,
+                       NativeLoc src) {
+  MCEmitter* mc = t->mc;
+  u32 rd = loc_reg(dst), rs = loc_reg(src);
+  u32 src_sz = rv_type_size(t, src.type);
+  u32 dst_sz = rv_type_size(t, dst.type);
+  switch (op) {
+    case CV_SEXT:
+      if (src_sz >= 4u) {
+        rv64_emit32(mc, rv_addiw(rd, rs, 0));
+      } else {
+        u32 sh = 64u - src_sz * 8u;
+        rv64_emit32(mc, rv_slli(rd, rs, sh));
+        rv64_emit32(mc, rv_srai(rd, rd, sh));
+      }
+      return;
+    case CV_ZEXT: {
+      u32 sh = 64u - src_sz * 8u;
+      rv64_emit32(mc, rv_slli(rd, rs, sh));
+      rv64_emit32(mc, rv_srli(rd, rd, sh));
+      return;
+    }
+    case CV_TRUNC:
+      if (rd != rs || dst_sz <= 4u)
+        rv64_emit32(mc, rv_addi(rd, rs, 0)); /* low bits; users re-narrow */
+      return;
+    case CV_ITOF_S:
+      if (rv_type_size(t, dst.type) == 8u)
+        rv64_emit32(mc, src_sz == 8u ? rv_fcvt_d_l(rd, rs) : rv_fcvt_d_w(rd, rs));
+      else
+        rv64_emit32(mc, src_sz == 8u ? rv_fcvt_s_l(rd, rs) : rv_fcvt_s_w(rd, rs));
+      return;
+    case CV_ITOF_U:
+      if (rv_type_size(t, dst.type) == 8u)
+        rv64_emit32(mc, src_sz == 8u ? rv_fcvt_d_lu(rd, rs) : rv_fcvt_d_wu(rd, rs));
+      else
+        rv64_emit32(mc, src_sz == 8u ? rv_fcvt_s_lu(rd, rs) : rv_fcvt_s_wu(rd, rs));
+      return;
+    case CV_FTOI_S:
+      if (src_sz == 8u)
+        rv64_emit32(mc, dst_sz == 8u ? rv_fcvt_l_d(rd, rs) : rv_fcvt_w_d(rd, rs));
+      else
+        rv64_emit32(mc, dst_sz == 8u ? rv_fcvt_l_s(rd, rs) : rv_fcvt_w_s(rd, rs));
+      return;
+    case CV_FTOI_U:
+      if (src_sz == 8u)
+        rv64_emit32(mc, dst_sz == 8u ? rv_fcvt_lu_d(rd, rs) : rv_fcvt_wu_d(rd, rs));
+      else
+        rv64_emit32(mc, dst_sz == 8u ? rv_fcvt_lu_s(rd, rs) : rv_fcvt_wu_s(rd, rs));
+      return;
+    case CV_FEXT:
+      rv64_emit32(mc, rv_fcvt_d_s(rd, rs));
+      return;
+    case CV_FTRUNC:
+      rv64_emit32(mc, rv_fcvt_s_d(rd, rs));
+      return;
+    case CV_BITCAST:
+      rv_move(t, dst, src);
+      return;
+    default:
+      rv_panic(rv_of(t), "unsupported convert");
+  }
+}
+
+/* ============================ spill / reload ============================ */
+
+static void rv_spill(NativeTarget* t, NativeLoc src, NativeFrameSlot slot,
+                     MemAccess mem) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+  addr.base.frame = slot;
+  addr.base_type = src.type;
+  rv_emit_mem(rv_of(t), 0, src, addr, mem);
+}
+static void rv_reload(NativeTarget* t, NativeLoc dst, NativeFrameSlot slot,
+                      MemAccess mem) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+  addr.base.frame = slot;
+  addr.base_type = dst.type;
+  rv_emit_mem(rv_of(t), 1, dst, addr, mem);
+}
+
+/* ============================ control flow ============================ */
+
+static MCLabel rv_label_new(NativeTarget* t) { return t->mc->label_new(t->mc); }
+static void rv_label_place(NativeTarget* t, MCLabel l) {
+  t->mc->label_place(t->mc, l);
+}
+static void rv_jump(NativeTarget* t, MCLabel l) {
+  rv64_emit32(t->mc, rv_jal(RV_ZERO, 0));
+  t->mc->emit_label_ref(t->mc, l, R_RV_JAL, 4, 0);
+}
+
+static void rv_cmp_branch(NativeTarget* t, CmpOp op, NativeLoc aop,
+                          NativeLoc bop, MCLabel l) {
+  MCEmitter* mc = t->mc;
+  /* FP compares (incl. EQ/NE on FP operands) have no register-register branch
+   * form: materialize the 0/1 into TMP0 via rv_cmp, then branch on nonzero. */
+  if (op >= CMP_LT_F ||
+      ((op == CMP_EQ || op == CMP_NE) && loc_is_fp(aop))) {
+    NativeLoc tmp = rv_reg_loc(builtin_id(CFREE_CG_BUILTIN_I64), NATIVE_REG_INT,
+                               RV_TMP0);
+    rv_cmp(t, op, tmp, aop, bop);
+    rv64_emit32(mc, rv_bne(RV_TMP0, RV_ZERO, 0));
+    mc->emit_label_ref(mc, l, R_RV_BRANCH, 4, 0);
+    return;
+  }
+  {
+    int sg = cmp_is_signed(op);
+    u32 ra = rv_cmp_ext(t, sg, aop, RV_TMP0);
+    u32 rb = rv_cmp_ext(t, sg, bop, RV_TMP1);
+    u32 word;
+    switch (op) {
+      case CMP_EQ: word = rv_beq(ra, rb, 0); break;
+      case CMP_NE: word = rv_bne(ra, rb, 0); break;
+      case CMP_LT_S: word = rv_blt(ra, rb, 0); break;
+      case CMP_GE_S: word = rv_bge(ra, rb, 0); break;
+      case CMP_LT_U: word = rv_bltu(ra, rb, 0); break;
+      case CMP_GE_U: word = rv_bgeu(ra, rb, 0); break;
+      case CMP_GT_S: word = rv_blt(rb, ra, 0); break;
+      case CMP_LE_S: word = rv_bge(rb, ra, 0); break;
+      case CMP_GT_U: word = rv_bltu(rb, ra, 0); break;
+      case CMP_LE_U: word = rv_bgeu(rb, ra, 0); break;
+      default: rv_panic(rv_of(t), "unsupported cmp_branch");
+    }
+    rv64_emit32(mc, word);
+    mc->emit_label_ref(mc, l, R_RV_BRANCH, 4, 0);
+  }
+}
+
+static void rv_indirect_branch(NativeTarget* t, NativeLoc addr,
+                               const MCLabel* valid_targets, u32 ntargets) {
+  (void)valid_targets;
+  (void)ntargets;
+  rv64_emit32(t->mc, rv_jalr(RV_ZERO, loc_reg(addr), 0));
+}
+
+static void rv_load_label_addr(NativeTarget* t, NativeLoc dst, MCLabel l) {
+  u32 rd = loc_reg(dst);
+  rv64_emit32(t->mc, rv_auipc(rd, 0));
+  rv64_emit32(t->mc, rv_addi(rd, rd, 0));
+  t->mc->emit_label_ref(t->mc, l, R_RV_INTRA_AUIPC_ADDI, 8, 0);
+}
+
+/* ============================ frame / lifecycle ============================ */
+
+static NativeFrameSlot rv_frame_slot(NativeTarget* t,
+                                     const NativeFrameSlotDesc* d) {
+  RvNativeTarget* a = rv_of(t);
+  RvNativeSlot* s;
+  u32 size = d->size ? d->size : 8u;
+  u32 align = d->align ? d->align : 1u;
+  if (a->frame_final) rv_panic(a, "frame slot requested after prologue");
+  if (a->nslots == a->slots_cap) {
+    u32 cap = a->slots_cap ? a->slots_cap * 2u : 16u;
+    RvNativeSlot* nb = arena_zarray(t->c->tu, RvNativeSlot, cap);
+    if (a->slots) memcpy(nb, a->slots, sizeof(*nb) * a->nslots);
+    a->slots = nb;
+    a->slots_cap = cap;
+  }
+  a->cum_off = align_up_u32(a->cum_off + size, align);
+  s = &a->slots[a->nslots++];
+  s->off = a->cum_off;
+  s->size = size;
+  s->align = align;
+  s->kind = d->kind;
+  return (NativeFrameSlot)a->nslots;
+}
+
+static void rv_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, fd->fn_type);
+  a->func = fd;
+  a->loc = fd->loc;
+  a->nslots = 0;
+  a->cum_off = 0;
+  a->max_outgoing = 0;
+  a->incoming_stack_size = 0;
+  a->next_param_int = 0;
+  a->next_param_fp = 0;
+  a->next_param_stack = 0;
+  a->has_sret = (abi && abi->has_sret) ? 1u : 0u;
+  a->is_variadic = (abi && abi->variadic) ? 1u : 0u;
+  a->sret_ptr_slot = NATIVE_FRAME_SLOT_NONE;
+  a->npatches = 0;
+  a->nalloca = 0;
+  a->ncallee_saves = 0;
+  a->known_frame = 0;
+  a->has_alloca = 0;
+  a->frame_final = 0;
+
+  mc->set_section(mc, fd->text_section_id);
+  mc->emit_align(mc, 4, 0);
+  a->func_start = mc->pos(mc);
+  mc_begin_function(mc, fd->sym, fd->text_section_id, a->func_start);
+  if (mc->cfi_startproc) mc->cfi_startproc(mc);
+  a->epilogue_label = mc->label_new(mc);
+}
+
+/* sret: reserve a hidden slot for the incoming destination pointer (a0). */
+static void rv_reserve_entry_saves(RvNativeTarget* a) {
+  NativeTarget* t = &a->base;
+  if (a->has_sret) {
+    NativeFrameSlotDesc sd;
+    memset(&sd, 0, sizeof sd);
+    sd.type = builtin_id(CFREE_CG_BUILTIN_I64);
+    sd.size = 8;
+    sd.align = 8;
+    sd.kind = NATIVE_FRAME_SLOT_SAVE;
+    a->sret_ptr_slot = t->frame_slot(t, &sd);
+    a->next_param_int = 1; /* a0 consumed by the sret pointer */
+  }
+}
+
+static void rv_emit_entry_save_stores(RvNativeTarget* a) {
+  NativeTarget* t = &a->base;
+  if (a->has_sret && a->sret_ptr_slot != NATIVE_FRAME_SLOT_NONE) {
+    CfreeCgTypeId i64t = builtin_id(CFREE_CG_BUILTIN_I64);
+    NativeAddr addr;
+    memset(&addr, 0, sizeof addr);
+    addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+    addr.base.frame = a->sret_ptr_slot;
+    addr.base_type = i64t;
+    rv_emit_mem(a, 0, rv_reg_loc(i64t, NATIVE_REG_INT, RV_A0), addr,
+                rv_mem_for_type(t, i64t, 8));
+  }
+}
+
+/* Collect the callee-saves the body used (none at -O0). */
+static u32 rv_collect_int_saves(RvNativeTarget* a, u32* regs) {
+  u32 n = 0, i;
+  for (i = 0; i < a->ncallee_saves; ++i)
+    if (a->callee_saves[i].cls == NATIVE_REG_INT)
+      regs[n++] = a->callee_saves[i].reg;
+  return n;
+}
+static u32 rv_collect_fp_saves(RvNativeTarget* a, u32* regs) {
+  u32 n = 0, i;
+  for (i = 0; i < a->ncallee_saves; ++i)
+    if (a->callee_saves[i].cls == NATIVE_REG_FP)
+      regs[n++] = a->callee_saves[i].reg;
+  return n;
+}
+
+/* s0-relative offset of the i-th saved register (saves stack below locals). */
+static i32 rv_save_off(RvNativeTarget* a, u32 idx) {
+  return -(i32)(a->cum_off) - 8 - 8 * (i32)idx;
+}
+
+static void rv_load_s0(MCEmitter* mc, int fp, u32 reg, i32 off) {
+  if (fits_i12(off)) {
+    rv64_emit32(mc, fp ? rv_fld(reg, RV_S0, off) : rv_ld(reg, RV_S0, off));
+    return;
+  }
+  rv_emit_load_imm(mc, 1, RV_TMP0, (i64)off);
+  rv64_emit32(mc, rv_add(RV_TMP0, RV_S0, RV_TMP0));
+  rv64_emit32(mc, fp ? rv_fld(reg, RV_TMP0, 0) : rv_ld(reg, RV_TMP0, 0));
+}
+
+/* Build the prologue instruction sequence into words[]. Returns count. */
+static u32 rv_build_prologue(RvNativeTarget* a, u32* words, u32 cap,
+                             u32 frame_size, u32 fp_pair_off,
+                             const u32* int_regs, u32 n_int,
+                             const u32* fp_regs, u32 n_fp) {
+  u32 wi = 0;
+#define PUSH(w)                                            \
+  do {                                                     \
+    if (wi >= cap) rv_panic(a, "prologue placeholder overflow"); \
+    words[wi++] = (w);                                     \
+  } while (0)
+  /* sp -= frame_size */
+  if (fits_i12(-(i32)frame_size)) {
+    PUSH(rv_addi(RV_SP, RV_SP, -(i32)frame_size));
+  } else {
+    i32 neg = -(i32)frame_size;
+    i32 hi = (i32)(((i64)neg + 0x800) >> 12);
+    i32 lo = neg - (i32)((u32)hi << 12);
+    PUSH(rv_lui(RV_TMP0, (u32)hi & 0xfffffu));
+    if (lo) PUSH(rv_addiw(RV_TMP0, RV_TMP0, lo));
+    PUSH(rv_add(RV_SP, RV_SP, RV_TMP0));
+  }
+  /* save s0/ra at [sp + fp_pair_off], set s0 = sp + fp_pair_off */
+  if (fits_i12((i32)fp_pair_off + 8)) {
+    PUSH(rv_sd(RV_S0, RV_SP, (i32)fp_pair_off));
+    PUSH(rv_sd(RV_RA, RV_SP, (i32)fp_pair_off + 8));
+    PUSH(rv_addi(RV_S0, RV_SP, (i32)fp_pair_off));
+  } else {
+    i32 off = (i32)fp_pair_off;
+    i32 hi = (i32)(((i64)off + 0x800) >> 12);
+    i32 lo = off - (i32)((u32)hi << 12);
+    PUSH(rv_lui(RV_TMP0, (u32)hi & 0xfffffu));
+    if (lo) PUSH(rv_addiw(RV_TMP0, RV_TMP0, lo));
+    PUSH(rv_add(RV_TMP0, RV_SP, RV_TMP0));
+    PUSH(rv_sd(RV_S0, RV_TMP0, 0));
+    PUSH(rv_sd(RV_RA, RV_TMP0, 8));
+    PUSH(rv_addi(RV_S0, RV_TMP0, 0));
+  }
+  /* sret a0 spill */
+  if (a->has_sret && a->sret_ptr_slot != NATIVE_FRAME_SLOT_NONE) {
+    RvNativeSlot* s = rv_slot_get(a, a->sret_ptr_slot);
+    PUSH(rv_sd(RV_A0, RV_S0, rv_s0_off_slot(s)));
+  }
+  /* variadic GP save area: spill unconsumed a-regs at [s0 + 16 + i*8] */
+  if (a->is_variadic) {
+    u32 i;
+    for (i = a->next_param_int; i < 8u; ++i)
+      PUSH(rv_sd(RV_A0 + i, RV_S0, 16 + (i32)i * 8));
+  }
+  /* callee saves */
+  {
+    u32 i;
+    for (i = 0; i < n_int; ++i) {
+      i32 off = rv_save_off(a, i);
+      if (fits_i12(off)) {
+        PUSH(rv_sd(int_regs[i], RV_S0, off));
+      } else {
+        /* rare; emitted directly is fine in the known-frame path, but the
+         * single-pass placeholder must hold these too. Use the wide form. */
+        i32 hi = (i32)(((i64)off + 0x800) >> 12);
+        i32 lo = off - (i32)((u32)hi << 12);
+        PUSH(rv_lui(RV_TMP0, (u32)hi & 0xfffffu));
+        if (lo) PUSH(rv_addiw(RV_TMP0, RV_TMP0, lo));
+        PUSH(rv_add(RV_TMP0, RV_S0, RV_TMP0));
+        PUSH(rv_sd(int_regs[i], RV_TMP0, 0));
+      }
+    }
+    for (i = 0; i < n_fp; ++i) {
+      i32 off = rv_save_off(a, n_int + i);
+      if (fits_i12(off)) {
+        PUSH(rv_fsd(fp_regs[i], RV_S0, off));
+      } else {
+        i32 hi = (i32)(((i64)off + 0x800) >> 12);
+        i32 lo = off - (i32)((u32)hi << 12);
+        PUSH(rv_lui(RV_TMP0, (u32)hi & 0xfffffu));
+        if (lo) PUSH(rv_addiw(RV_TMP0, RV_TMP0, lo));
+        PUSH(rv_add(RV_TMP0, RV_S0, RV_TMP0));
+        PUSH(rv_fsd(fp_regs[i], RV_TMP0, 0));
+      }
+    }
+  }
+#undef PUSH
+  return wi;
+}
+
+static void rv_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  u32 i;
+  rv_func_begin_common(t, fd);
+  a->prologue_pos = mc->pos(mc);
+  for (i = 0; i < RV_PROLOGUE_WORDS; ++i) rv64_emit32(mc, RV_NOP);
+  rv_reserve_entry_saves(a);
+  rv_emit_entry_save_stores(a);
+}
+
+static void rv_func_end(NativeTarget* t) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  ObjBuilder* obj = t->obj;
+  ObjSecId sec = a->func->text_section_id;
+  u32 int_regs[16], fp_regs[16];
+  u32 n_int = rv_collect_int_saves(a, int_regs);
+  u32 n_fp = rv_collect_fp_saves(a, fp_regs);
+  u32 frame_size = rv_frame_size(a);
+  u32 fp_pair_off = rv_fp_pair_off(a, frame_size);
+  u32 end;
+  i32 i;
+  a->frame_size_final = frame_size;
+  a->fp_pair_off = fp_pair_off;
+
+  /* epilogue */
+  mc->label_place(mc, a->epilogue_label);
+  for (i = (i32)n_int - 1; i >= 0; --i)
+    rv_load_s0(mc, 0, int_regs[i], rv_save_off(a, (u32)i));
+  for (i = (i32)n_fp - 1; i >= 0; --i)
+    rv_load_s0(mc, 1, fp_regs[i], rv_save_off(a, n_int + (u32)i));
+  if (a->has_alloca)
+    rv_emit_addr_adjust(mc, RV_SP, RV_S0, -(i32)fp_pair_off);
+  rv64_emit32(mc, rv_ld(RV_RA, RV_S0, 8));
+  rv64_emit32(mc, rv_ld(RV_S0, RV_S0, 0));
+  /* sp += frame_size */
+  if (fits_i12((i32)frame_size)) {
+    rv64_emit32(mc, rv_addi(RV_SP, RV_SP, (i32)frame_size));
+  } else {
+    rv_emit_load_imm(mc, 1, RV_TMP0, (i64)frame_size);
+    rv64_emit32(mc, rv_add(RV_SP, RV_SP, RV_TMP0));
+  }
+  rv64_emit32(mc, rv_jalr(RV_ZERO, RV_RA, 0));
+
+  /* patch prologue */
+  if (!a->known_frame) {
+    u32 words[RV_PROLOGUE_WORDS];
+    u32 nwords, k;
+    for (k = 0; k < RV_PROLOGUE_WORDS; ++k) words[k] = RV_NOP;
+    nwords = rv_build_prologue(a, words, RV_PROLOGUE_WORDS, frame_size,
+                               fp_pair_off, int_regs, n_int, fp_regs, n_fp);
+    (void)nwords;
+    for (k = 0; k < RV_PROLOGUE_WORDS; ++k)
+      rv_patch32(obj, sec, a->prologue_pos + k * 4u, words[k]);
+  }
+  /* patch alloca sites: addi dst, sp, max_outgoing */
+  {
+    u32 mo = align_up_u32(a->max_outgoing, 16u);
+    u32 k;
+    if (mo > 2047u) rv_panic(a, "max_outgoing too large for alloca patch");
+    for (k = 0; k < a->npatches; ++k)
+      rv_patch32(obj, sec, a->patches[k].pos,
+                 rv_addi(a->patches[k].dst_reg, RV_SP, (i32)mo));
+  }
+
+  /* CFI: CFA = s0 + (frame_size - fp_pair_off) */
+  if (mc->cfi_set_next_pc_offset && mc->cfi_def_cfa && mc->cfi_offset) {
+    i32 cfa = (i32)frame_size - (i32)fp_pair_off;
+    u32 post = a->prologue_pos + (a->known_frame ? 0u : RV_PROLOGUE_WORDS * 4u);
+    u32 k;
+    mc->cfi_set_next_pc_offset(mc, post - a->func_start);
+    mc->cfi_def_cfa(mc, RV_S0, cfa);
+    mc->cfi_offset(mc, RV_S0, -cfa);
+    mc->cfi_offset(mc, RV_RA, -cfa + 8);
+    for (k = 0; k < n_int; ++k)
+      mc->cfi_offset(mc, int_regs[k], rv_save_off(a, k) - cfa);
+    for (k = 0; k < n_fp; ++k)
+      mc->cfi_offset(mc, 32u + fp_regs[k], rv_save_off(a, n_int + k) - cfa);
+  }
+
+  end = mc->pos(mc);
+  obj_symbol_define(obj, a->func->sym, sec, (u64)a->func_start,
+                    (u64)(end - a->func_start));
+  if (a->func->atomize)
+    obj_atom_define(obj, sec, a->func_start, end - a->func_start, a->func->sym,
+                    0);
+  if (mc->debug) debug_func_pc_range(mc->debug, sec, a->func_start, end);
+  if (mc->cfi_endproc) mc->cfi_endproc(mc);
+  mc_end_function(mc);
+  a->func = NULL;
+}
+
+static void rv_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd,
+                                      const NativeKnownFrameDesc* frame,
+                                      NativeFrameSlot* out_slots) {
+  (void)fd;
+  (void)frame;
+  (void)out_slots;
+  rv_panic(rv_of(t), "known-frame path not implemented yet");
+}
+
+/* ============================ params / ABI helpers ============================ */
+
+static const ABIArgInfo* rv_param_abi(NativeTarget* t, const ABIFuncInfo* abi,
+                                      const NativeCallDesc* desc, u32 i,
+                                      ABIArgInfo* scratch) {
+  /* Synthesized for unnamed (variadic) args, or untyped calls. RISC-V LP64D
+   * passes variadic FP args in INTEGER registers (as their bit pattern), not
+   * the FP pool — so a variadic float part is ABI_CLASS_INT. */
+  int variadic = abi && i >= abi->nparams;
+  if (abi && i < abi->nparams) return &abi->params[i];
+  memset(scratch, 0, sizeof *scratch);
+  scratch->kind = ABI_ARG_DIRECT;
+  scratch->nparts = 1;
+  scratch->parts = arena_zarray(t->c->tu, ABIArgPart, 1);
+  ((ABIArgPart*)scratch->parts)[0].cls =
+      (!variadic && cg_type_is_float(t->c, desc->args[i].type)) ? ABI_CLASS_FP
+                                                                : ABI_CLASS_INT;
+  ((ABIArgPart*)scratch->parts)[0].loc = ABI_LOC_REG;
+  ((ABIArgPart*)scratch->parts)[0].size = rv_type_size(t, desc->args[i].type);
+  ((ABIArgPart*)scratch->parts)[0].align = rv_type_align(t, desc->args[i].type);
+  return scratch;
+}
+
+static u32 rv_part_stack_size(const ABIArgPart* part) {
+  return align_up_u32(part->size ? part->size : 8u, 8u);
+}
+static u32 rv_part_stack_align(const ABIArgPart* part) {
+  u32 al = part->align ? part->align : 8u;
+  if (al < 8u) al = 8u;
+  if (al > 16u) al = 16u;
+  return al;
+}
+
+static CfreeCgTypeId rv_part_scalar_type(const ABIArgPart* part) {
+  if (part->cls == ABI_CLASS_FP) {
+    if (part->size <= 4u) return builtin_id(CFREE_CG_BUILTIN_F32);
+    return builtin_id(CFREE_CG_BUILTIN_F64);
+  }
+  switch (part->size) {
+    case 1u: return builtin_id(CFREE_CG_BUILTIN_I8);
+    case 2u: return builtin_id(CFREE_CG_BUILTIN_I16);
+    case 4u: return builtin_id(CFREE_CG_BUILTIN_I32);
+    default: return builtin_id(CFREE_CG_BUILTIN_I64);
+  }
+}
+
+static u32 rv_class_stack_size(const ABIArgInfo* ai) {
+  u32 total = 0, p;
+  if (!ai || ai->kind == ABI_ARG_IGNORE) return 0;
+  if (ai->kind == ABI_ARG_INDIRECT) return 8u;
+  for (p = 0; p < ai->nparts; ++p) {
+    total = align_up_u32(total, rv_part_stack_align(&ai->parts[p]));
+    total += rv_part_stack_size(&ai->parts[p]);
+  }
+  return align_up_u32(total ? total : 8u, 8u);
+}
+
+static u32 rv_call_stack_size(NativeTarget* t, const NativeCallDesc* desc) {
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, desc->fn_type);
+  /* sret consumes a0 as the implicit first integer argument. */
+  u32 next_int = (abi && abi->has_sret) ? 1u : 0u;
+  u32 next_fp = 0, stack = 0, i, p;
+  for (i = 0; i < desc->nargs; ++i) {
+    ABIArgInfo tmp;
+    const ABIArgInfo* ai = rv_param_abi(t, abi, desc, i, &tmp);
+    int force_stack =
+        abi && abi->variadic && abi->vararg_on_stack && i >= abi->nparams;
+    if (ai->kind == ABI_ARG_IGNORE) continue;
+    if (force_stack) {
+      stack += rv_class_stack_size(ai);
+      continue;
+    }
+    if (ai->kind == ABI_ARG_INDIRECT) {
+      if (next_int < 8u)
+        next_int++;
+      else
+        stack += 8u;
+      continue;
+    }
+    for (p = 0; p < ai->nparts; ++p) {
+      const ABIArgPart* part = &ai->parts[p];
+      if (part->cls == ABI_CLASS_FP) {
+        if (next_fp < 8u)
+          next_fp++;
+        else {
+          stack = align_up_u32(stack, rv_part_stack_align(part));
+          stack += rv_part_stack_size(part);
+        }
+      } else {
+        if (next_int < 8u)
+          next_int++;
+        else {
+          stack = align_up_u32(stack, rv_part_stack_align(part));
+          stack += rv_part_stack_size(part);
+        }
+      }
+    }
+  }
+  return align_up_u32(stack, 16u);
+}
+
+static u32 rv_signature_stack_bytes(NativeTarget* t, CfreeCgTypeId fn_type,
+                                    int* variadic, u32* nparams) {
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, fn_type);
+  NativeCallDesc d;
+  if (variadic) *variadic = abi ? (int)abi->variadic : 0;
+  if (nparams) *nparams = abi ? abi->nparams : 0u;
+  memset(&d, 0, sizeof d);
+  d.fn_type = fn_type;
+  d.nargs = abi ? abi->nparams : 0u;
+  if (d.nargs) d.args = arena_zarray(t->c->tu, NativeLoc, d.nargs);
+  return rv_call_stack_size(t, &d);
+}
+
+static u32 rv_call_stack_bytes(NativeTarget* t, const NativeCallDesc* desc) {
+  return rv_call_stack_size(t, desc);
+}
+
+/* Resolve a NativeLoc to an addressable NativeAddr (frame/stack/addr). */
+static NativeAddr rv_loc_addr(RvNativeTarget* a, NativeLoc loc, u32 offset) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  switch ((NativeLocKind)loc.kind) {
+    case NATIVE_LOC_FRAME:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = loc.v.frame;
+      addr.base_type = loc.type;
+      addr.offset = (i32)offset;
+      return addr;
+    case NATIVE_LOC_STACK:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = loc.v.stack.slot;
+      addr.base_type = loc.type;
+      addr.offset = loc.v.stack.offset + (i32)offset;
+      return addr;
+    case NATIVE_LOC_ADDR:
+      addr = loc.v.addr;
+      addr.offset += (i32)offset;
+      return addr;
+    default:
+      rv_panic(a, "location is not addressable");
+  }
+  return addr;
+}
+
+static void rv_load_part(NativeTarget* t, NativeLoc dst, NativeLoc src,
+                         u32 offset, u32 size) {
+  RvNativeTarget* a = rv_of(t);
+  if (src.kind == NATIVE_LOC_REG) {
+    rv_move(t, dst, src);
+    return;
+  }
+  if (src.kind == NATIVE_LOC_FRAME || src.kind == NATIVE_LOC_STACK ||
+      src.kind == NATIVE_LOC_ADDR) {
+    NativeAddr addr = rv_loc_addr(a, src, offset);
+    addr.base_type = dst.type;
+    rv_emit_mem(a, 1, dst, addr, rv_mem_for_type(t, dst.type, size));
+    return;
+  }
+  if (src.kind == NATIVE_LOC_IMM) {
+    rv_emit_load_imm(t->mc, rv_is_64(t, dst.type) ? 1u : 0u, loc_reg(dst),
+                     src.v.imm);
+    return;
+  }
+  rv_panic(a, "unsupported part source");
+}
+
+static void rv_store_part(NativeTarget* t, NativeLoc dst, NativeLoc src,
+                          u32 offset, u32 size) {
+  RvNativeTarget* a = rv_of(t);
+  if (dst.kind == NATIVE_LOC_FRAME || dst.kind == NATIVE_LOC_STACK ||
+      dst.kind == NATIVE_LOC_ADDR) {
+    NativeAddr addr = rv_loc_addr(a, dst, offset);
+    addr.base_type = src.type;
+    rv_emit_mem(a, 0, src, addr, rv_mem_for_type(t, src.type, size));
+    return;
+  }
+  if (dst.kind == NATIVE_LOC_REG) {
+    rv_move(t, dst, src);
+    return;
+  }
+  rv_panic(a, "unsupported part destination");
+}
+
+static void rv_addr_of_loc(NativeTarget* t, NativeLoc dst, NativeLoc src) {
+  NativeAddr addr = rv_loc_addr(rv_of(t), src, 0);
+  rv_load_addr(t, dst, addr);
+}
+
+static void rv_store_outgoing_part(NativeTarget* t, u32 stack_off, NativeLoc src,
+                                   u32 size) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_REG;
+  addr.base.reg = RV_SP;
+  addr.base_type = src.type;
+  addr.offset = (i32)stack_off;
+  rv_emit_mem(rv_of(t), 0, src, addr, rv_mem_for_type(t, src.type, size));
+}
+
+/* NativeTarget bind_param: route incoming param (ABI loc) into dst. */
+static void rv_bind_native_param(NativeTarget* t, const CGParamDesc* p,
+                                 NativeLoc dst) {
+  RvNativeTarget* a = rv_of(t);
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, a->func->fn_type);
+  const ABIArgInfo* ai = p->index < abi->nparams ? &abi->params[p->index] : NULL;
+  int to_reg = dst.kind == NATIVE_LOC_REG;
+  u32 i;
+  if (!ai || ai->kind == ABI_ARG_IGNORE) return;
+  if (ai->kind == ABI_ARG_INDIRECT) {
+    NativeLoc src = rv_reg_loc(builtin_id(CFREE_CG_BUILTIN_I64), NATIVE_REG_INT,
+                               a->next_param_int < 8u ? RV_A0 + a->next_param_int
+                                                      : RV_TMP0);
+    NativeAddr d_addr, from;
+    AggregateAccess access;
+    if (a->next_param_int < 8u) {
+      a->next_param_int++;
+    } else {
+      NativeAddr sa;
+      memset(&sa, 0, sizeof sa);
+      sa.base_kind = NATIVE_ADDR_BASE_REG;
+      sa.base.reg = RV_S0;
+      sa.offset = rv_s0_off_in_arg(a, a->next_param_stack);
+      sa.base_type = src.type;
+      rv_emit_mem(a, 1, src, sa, rv_mem_for_type(t, src.type, 8));
+      a->next_param_stack += 8u;
+    }
+    if (dst.kind != NATIVE_LOC_FRAME)
+      rv_panic(a, "indirect parameter requires a frame destination");
+    memset(&d_addr, 0, sizeof d_addr);
+    d_addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+    d_addr.base.frame = dst.v.frame;
+    d_addr.base_type = p->type;
+    memset(&from, 0, sizeof from);
+    from.base_kind = NATIVE_ADDR_BASE_REG;
+    from.base.reg = loc_reg(src);
+    from.base_type = p->type;
+    memset(&access, 0, sizeof access);
+    access.type = p->type;
+    access.size = p->size ? p->size : (u32)cg_type_size(t->c, p->type);
+    access.align = p->align ? p->align : rv_type_align(t, p->type);
+    rv_copy_bytes(t, d_addr, from, access);
+    return;
+  }
+  for (i = 0; i < ai->nparts; ++i) {
+    const ABIArgPart* part = &ai->parts[i];
+    NativeAllocClass cls =
+        part->cls == ABI_CLASS_FP ? NATIVE_REG_FP : NATIVE_REG_INT;
+    NativeLoc src;
+    if (cls == NATIVE_REG_FP && a->next_param_fp < 8u) {
+      src = rv_reg_loc(p->type, cls, RV_FA0 + a->next_param_fp++);
+    } else if (cls == NATIVE_REG_INT && a->next_param_int < 8u) {
+      src = rv_reg_loc(p->type, cls, RV_A0 + a->next_param_int++);
+    } else {
+      Reg tmp = (cls == NATIVE_REG_FP) ? RV_FTMP0 : RV_TMP0;
+      NativeAddr sa;
+      src = rv_reg_loc(p->type, cls, tmp);
+      a->next_param_stack =
+          align_up_u32(a->next_param_stack, rv_part_stack_align(part));
+      memset(&sa, 0, sizeof sa);
+      sa.base_kind = NATIVE_ADDR_BASE_REG;
+      sa.base.reg = RV_S0;
+      sa.base_type = p->type;
+      sa.offset = rv_s0_off_in_arg(a, a->next_param_stack);
+      rv_emit_mem(a, 1, src, sa, rv_mem_for_type(t, p->type, part->size));
+      a->next_param_stack += rv_part_stack_size(part);
+    }
+    if (dst.kind == NATIVE_LOC_NONE) {
+      /* unused parameter; cursors already advanced */
+    } else if (to_reg) {
+      NativeLoc d = rv_reg_loc(dst.type ? dst.type : p->type,
+                               (NativeAllocClass)dst.cls, (Reg)dst.v.reg);
+      if (!(src.kind == NATIVE_LOC_REG && loc_reg(src) == loc_reg(d) &&
+            (NativeAllocClass)src.cls == (NativeAllocClass)d.cls))
+        rv_move(t, d, src);
+    } else {
+      rv_store_part(t, rv_stack_loc(p->type, dst.v.frame, (i32)part->src_offset),
+                    src, 0, part->size);
+    }
+  }
+  a->incoming_stack_size = align_up_u32(a->next_param_stack, 16u);
+}
+
+/* ============================ calls / returns ============================ */
+
+typedef struct {
+  NativeLoc dst;
+  NativeLoc src;
+  u32 src_offset;
+  u32 size;
+  int is_addr;
+} RvArgMove;
+
+static Reg rv_arg_move_src_reg(const RvArgMove* m, NativeAllocClass* cls_out) {
+  if (!m->is_addr && m->src.kind == NATIVE_LOC_REG) {
+    *cls_out = (NativeAllocClass)m->src.cls;
+    return m->src.v.reg;
+  }
+  return REG_NONE;
+}
+
+static void rv_emit_one_arg_move(NativeTarget* t, const RvArgMove* m) {
+  if (m->is_addr)
+    rv_addr_of_loc(t, m->dst, m->src);
+  else
+    rv_load_part(t, m->dst, m->src, m->src_offset, m->size);
+}
+
+/* Parallel-copy register arg moves with cycle breaking. */
+static void rv_emit_reg_arg_moves(NativeTarget* t, RvArgMove* moves, u32 n) {
+  u8 done[RV_MAX_REG_ARG_MOVES];
+  u32 emitted = 0;
+  if (n > RV_MAX_REG_ARG_MOVES) rv_panic(rv_of(t), "too many register args");
+  memset(done, 0, sizeof done);
+  while (emitted < n) {
+    int progress = 0;
+    u32 i, j;
+    for (i = 0; i < n; ++i) {
+      int blocked = 0;
+      if (done[i]) continue;
+      for (j = 0; j < n && !blocked; ++j) {
+        NativeAllocClass sc;
+        Reg sr;
+        if (done[j] || j == i) continue;
+        sr = rv_arg_move_src_reg(&moves[j], &sc);
+        if (sr != REG_NONE && sr == moves[i].dst.v.reg &&
+            sc == (NativeAllocClass)moves[i].dst.cls)
+          blocked = 1;
+      }
+      if (!blocked) {
+        rv_emit_one_arg_move(t, &moves[i]);
+        done[i] = 1;
+        emitted++;
+        progress = 1;
+      }
+    }
+    if (!progress) {
+      u32 k = 0;
+      NativeAllocClass bc, sc;
+      Reg scratch_reg;
+      NativeLoc scratchloc;
+      while (k < n &&
+             (done[k] || rv_arg_move_src_reg(&moves[k], &sc) == REG_NONE))
+        ++k;
+      bc = (NativeAllocClass)moves[k].dst.cls;
+      scratch_reg = bc == NATIVE_REG_FP ? RV_FTMP1 : RV_TMP1;
+      scratchloc = rv_reg_loc(moves[k].dst.type, bc, scratch_reg);
+      rv_move(t, scratchloc, moves[k].dst);
+      {
+        u32 jj;
+        for (jj = 0; jj < n; ++jj) {
+          Reg sr = rv_arg_move_src_reg(&moves[jj], &sc);
+          if (!done[jj] && sr != REG_NONE && sr == moves[k].dst.v.reg &&
+              sc == bc) {
+            moves[jj].src = scratchloc;
+            moves[jj].src_offset = 0;
+          }
+        }
+      }
+    }
+  }
+}
+
+static void rv_plan_call(NativeTarget* t, const NativeCallDesc* desc,
+                         NativeCallPlan* plan) {
+  RvNativeTarget* a = rv_of(t);
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, desc->fn_type);
+  NativeCallPlanRet* rets;
+  CfreeCgTypeId i64t = builtin_id(CFREE_CG_BUILTIN_I64);
+  memset(plan, 0, sizeof *plan);
+  rets = desc->nresults ? arena_zarray(t->c->tu, NativeCallPlanRet, 4) : NULL;
+  plan->callee = desc->callee;
+  plan->rets = rets;
+  plan->flags = desc->flags;
+  plan->has_sret = abi && abi->has_sret;
+  plan->is_variadic = abi && abi->variadic;
+  plan->stack_arg_size = rv_call_stack_size(t, desc);
+  if (plan->stack_arg_size > a->max_outgoing)
+    a->max_outgoing = plan->stack_arg_size;
+  /* Indirect callee in an arg register would be clobbered by arg loads. */
+  if (plan->callee.kind == NATIVE_LOC_REG &&
+      (NativeAllocClass)plan->callee.cls == NATIVE_REG_INT &&
+      plan->callee.v.reg >= RV_A0 && plan->callee.v.reg <= RV_A7) {
+    NativeLoc scratch = rv_reg_loc(plan->callee.type, NATIVE_REG_INT, RV_TMP0);
+    rv_move(t, scratch, plan->callee);
+    plan->callee = scratch;
+  }
+  {
+    /* sret returns pass the hidden destination pointer as the implicit first
+     * integer argument (a0), so the real args start at a1. */
+    u32 next_int = (abi && abi->has_sret) ? 1u : 0u;
+    u32 next_fp = 0, stack = 0, nmoves = 0, i, p;
+    RvArgMove moves[RV_MAX_REG_ARG_MOVES];
+    for (i = 0; i < desc->nargs; ++i) {
+      ABIArgInfo tmp;
+      const ABIArgInfo* ai = rv_param_abi(t, abi, desc, i, &tmp);
+      int force_stack =
+          abi && abi->variadic && abi->vararg_on_stack && i >= abi->nparams;
+      if (ai->kind == ABI_ARG_IGNORE) continue;
+      if (force_stack) {
+        NativeLoc tmpreg = rv_reg_loc(desc->args[i].type, NATIVE_REG_INT, RV_TMP0);
+        u32 n = rv_class_stack_size(ai), off = 0;
+        while (off < n) {
+          rv_load_part(t, tmpreg, desc->args[i], off, 8);
+          rv_store_outgoing_part(t, stack + off, tmpreg, 8);
+          off += 8;
+        }
+        stack += n;
+        continue;
+      }
+      if (ai->kind == ABI_ARG_INDIRECT) {
+        if (next_int < 8u) {
+          RvArgMove* m = &moves[nmoves++];
+          m->dst = rv_reg_loc(i64t, NATIVE_REG_INT, RV_A0 + next_int++);
+          m->src = desc->args[i];
+          m->src_offset = 0;
+          m->size = 8;
+          m->is_addr = 1;
+        } else {
+          NativeLoc ptr = rv_reg_loc(i64t, NATIVE_REG_INT, RV_TMP0);
+          rv_addr_of_loc(t, ptr, desc->args[i]);
+          rv_store_outgoing_part(t, stack, ptr, 8);
+          stack += 8u;
+        }
+        continue;
+      }
+      for (p = 0; p < ai->nparts; ++p) {
+        const ABIArgPart* part = &ai->parts[p];
+        NativeAllocClass cls =
+            part->cls == ABI_CLASS_FP ? NATIVE_REG_FP : NATIVE_REG_INT;
+        if ((cls == NATIVE_REG_FP && next_fp < 8u) ||
+            (cls == NATIVE_REG_INT && next_int < 8u)) {
+          RvArgMove* m = &moves[nmoves++];
+          Reg areg = cls == NATIVE_REG_FP ? RV_FA0 + next_fp++
+                                          : RV_A0 + next_int++;
+          m->dst = rv_reg_loc(desc->args[i].type, cls, areg);
+          m->src = desc->args[i];
+          m->src_offset = part->src_offset;
+          m->size = part->size;
+          m->is_addr = 0;
+        } else {
+          Reg tmp = cls == NATIVE_REG_FP ? RV_FTMP0 : RV_TMP0;
+          NativeLoc tmpreg = rv_reg_loc(desc->args[i].type, cls, tmp);
+          rv_load_part(t, tmpreg, desc->args[i], part->src_offset, part->size);
+          stack = align_up_u32(stack, rv_part_stack_align(part));
+          rv_store_outgoing_part(t, stack, tmpreg, part->size);
+          stack += rv_part_stack_size(part);
+        }
+      }
+    }
+    rv_emit_reg_arg_moves(t, moves, nmoves);
+    if (abi && abi->has_sret && desc->nresults) {
+      /* sret pointer goes in a0; arg loads have completed. */
+      NativeLoc a0 = rv_reg_loc(i64t, NATIVE_REG_INT, RV_A0);
+      rv_addr_of_loc(t, a0, desc->results[0]);
+    }
+  }
+  if (abi && abi->ret.kind == ABI_ARG_DIRECT && desc->nresults) {
+    u32 nr = 0, ni = 0, nf = 0, p;
+    for (p = 0; p < abi->ret.nparts; ++p) {
+      const ABIArgPart* part = &abi->ret.parts[p];
+      NativeAllocClass cls =
+          part->cls == ABI_CLASS_FP ? NATIVE_REG_FP : NATIVE_REG_INT;
+      CfreeCgTypeId pty = rv_part_scalar_type(part);
+      Reg rreg = cls == NATIVE_REG_FP ? RV_FA0 + nf++ : RV_A0 + ni++;
+      rets[nr].src = rv_reg_loc(pty, cls, rreg);
+      rets[nr].dst = desc->results[0];
+      if (rets[nr].dst.kind == NATIVE_LOC_FRAME)
+        rets[nr].dst =
+            rv_stack_loc(pty, desc->results[0].v.frame, (i32)part->src_offset);
+      else if (rets[nr].dst.kind == NATIVE_LOC_STACK) {
+        rets[nr].dst.v.stack.offset += (i32)part->src_offset;
+        rets[nr].dst.type = pty;
+      }
+      rets[nr].mem = rv_mem_for_type(t, pty, part->size);
+      nr++;
+    }
+    plan->nrets = nr;
+  } else if (abi && abi->ret.kind == ABI_ARG_IGNORE) {
+    plan->nrets = 0;
+  } else if (!abi && desc->nresults) {
+    rets[0].src = rv_reg_loc(desc->results[0].type, NATIVE_REG_INT, RV_A0);
+    rets[0].dst = desc->results[0];
+    rets[0].mem = rv_mem_for_type(t, desc->results[0].type, 0);
+    plan->nrets = 1;
+  }
+}
+
+static void rv_emit_call(NativeTarget* t, const NativeCallPlan* plan) {
+  MCEmitter* mc = t->mc;
+  ObjSecId sec = mc->section_id;
+  if (plan->flags & CG_CALL_TAIL) rv_panic(rv_of(t), "tail call not implemented");
+  if (plan->callee.kind == NATIVE_LOC_GLOBAL) {
+    u32 pos = mc->pos(mc);
+    rv64_emit32(mc, rv_auipc(RV_RA, 0));
+    rv64_emit32(mc, rv_jalr(RV_RA, RV_RA, 0));
+    mc->emit_reloc_at(mc, sec, pos, R_RV_CALL, plan->callee.v.global.sym,
+                      plan->callee.v.global.addend, 0, 0);
+    return;
+  }
+  if (plan->callee.kind == NATIVE_LOC_REG) {
+    rv64_emit32(mc, rv_jalr(RV_RA, loc_reg(plan->callee), 0));
+    return;
+  }
+  rv_panic(rv_of(t), "unsupported call target");
+}
+
+static void rv_plan_ret(NativeTarget* t, const CGFuncDesc* fd,
+                        const NativeLoc* values, u32 nvalues,
+                        NativeCallPlanRet** out_rets, u32* out_nrets) {
+  RvNativeTarget* a = rv_of(t);
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, fd->fn_type);
+  NativeCallPlanRet* rets = NULL;
+  u32 nr = 0;
+  if (nvalues > 1u) rv_panic(a, "multiple returns unsupported");
+  if (nvalues) rets = arena_zarray(t->c->tu, NativeCallPlanRet, 4);
+  if (nvalues && abi && abi->ret.kind == ABI_ARG_INDIRECT) {
+    CfreeCgTypeId i64t = builtin_id(CFREE_CG_BUILTIN_I64);
+    NativeLoc dstp = rv_reg_loc(i64t, NATIVE_REG_INT, RV_TMP1);
+    NativeLoc saved = rv_stack_loc(i64t, a->sret_ptr_slot, 0);
+    NativeAddr dst_addr, src_addr;
+    AggregateAccess access;
+    rv_load_part(t, dstp, saved, 0, 8);
+    memset(&dst_addr, 0, sizeof dst_addr);
+    dst_addr.base_kind = NATIVE_ADDR_BASE_REG;
+    dst_addr.base.reg = RV_TMP1;
+    dst_addr.base_type = values[0].type;
+    src_addr = rv_loc_addr(a, values[0], 0);
+    src_addr.base_type = values[0].type;
+    memset(&access, 0, sizeof access);
+    access.type = values[0].type;
+    access.size = (u32)cg_type_size(t->c, values[0].type);
+    access.align = rv_type_align(t, values[0].type);
+    rv_copy_bytes(t, dst_addr, src_addr, access);
+    *out_rets = NULL;
+    *out_nrets = 0;
+    return;
+  }
+  if (nvalues && abi && abi->ret.kind == ABI_ARG_DIRECT) {
+    u32 ni = 0, nf = 0, p;
+    for (p = 0; p < abi->ret.nparts; ++p) {
+      const ABIArgPart* part = &abi->ret.parts[p];
+      NativeAllocClass cls =
+          part->cls == ABI_CLASS_FP ? NATIVE_REG_FP : NATIVE_REG_INT;
+      CfreeCgTypeId pty = rv_part_scalar_type(part);
+      Reg rreg = cls == NATIVE_REG_FP ? RV_FA0 + nf++ : RV_A0 + ni++;
+      rets[nr].src = values[0];
+      if (rets[nr].src.kind == NATIVE_LOC_FRAME)
+        rets[nr].src = rv_stack_loc(pty, values[0].v.frame, (i32)part->src_offset);
+      else if (rets[nr].src.kind == NATIVE_LOC_STACK) {
+        rets[nr].src.v.stack.offset += (i32)part->src_offset;
+        rets[nr].src.type = pty;
+      }
+      rets[nr].dst = rv_reg_loc(pty, cls, rreg);
+      rets[nr].mem = rv_mem_for_type(t, pty, part->size);
+      nr++;
+    }
+  } else if (nvalues) {
+    rets[0].src = values[0];
+    rets[0].dst = rv_reg_loc(values[0].type, NATIVE_REG_INT, RV_A0);
+    rets[0].mem = rv_mem_for_type(t, values[0].type, 0);
+    nr = 1;
+  }
+  *out_rets = rets;
+  *out_nrets = nr;
+}
+
+static void rv_ret(NativeTarget* t) {
+  RvNativeTarget* a = rv_of(t);
+  rv_jump(t, a->epilogue_label);
+}
+
+/* ============================ alloca ============================ */
+
+static void rv_alloca(NativeTarget* t, NativeLoc dst, NativeLoc size,
+                      u32 align) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  u32 rsz = loc_reg(size);
+  u32 rd = loc_reg(dst);
+  u32 al = align ? align : 16u;
+  if (al < 16u) al = 16u;
+  /* round up: t0 = (size + (al-1)) & ~(al-1) */
+  rv64_emit32(mc, rv_addi(RV_TMP0, rsz, (i32)(al - 1u)));
+  rv_emit_load_imm(mc, 1, RV_TMP1, -(i64)al);
+  rv64_emit32(mc, rv_and(RV_TMP0, RV_TMP0, RV_TMP1));
+  rv64_emit32(mc, rv_sub(RV_SP, RV_SP, RV_TMP0));
+  a->has_alloca = 1;
+  /* dst = sp + max_outgoing (patched in func_end) */
+  if (a->npatches == a->patches_cap) {
+    u32 cap = a->patches_cap ? a->patches_cap * 2u : 8u;
+    RvPatch* nb = arena_zarray(t->c->tu, RvPatch, cap);
+    if (a->patches) memcpy(nb, a->patches, sizeof(*nb) * a->npatches);
+    a->patches = nb;
+    a->patches_cap = cap;
+  }
+  a->patches[a->npatches].kind = RV_PATCH_ALLOCA;
+  a->patches[a->npatches].pos = mc->pos(mc);
+  a->patches[a->npatches].dst_reg = rd;
+  a->npatches++;
+  a->nalloca++;
+  rv64_emit32(mc, RV_NOP); /* placeholder for addi dst, sp, max_outgoing */
+}
+
+/* ============================ TLS / bitfield / atomics ============================ */
+
+/* Define a fresh local .LpcrelHi anchor pointing at `ap`, for the
+ * R_RV_PCREL_LO12_I follow-on that pairs with an AUIPC-relative HI20 reloc. */
+static ObjSymId rv_pcrel_anchor(RvNativeTarget* a, ObjSecId sec, u32 ap) {
+  NativeTarget* t = &a->base;
+  Sym an = pool_intern_slice(t->c->global, SLICE_LIT(".LpcrelHi"));
+  return obj_symbol(t->obj, an, SB_LOCAL, SK_OBJ, sec, (u64)ap, 0);
+}
+
+static void rv_tls_addr_of(NativeTarget* t, NativeLoc dst, ObjSymId sym,
+                           i64 addend) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  u32 sec = mc->section_id;
+  u32 rd = loc_reg(dst);
+  if (obj_symbol_extern_via_got(t->c, t->obj, sym)) {
+    /* Initial-Exec: auipc t0, %tls_ie_pcrel_hi(sym)
+     *               ld   t0, %pcrel_lo(.Ltmp)(t0)
+     *               add  dst, tp, t0
+     * GOT relocs disallow an addend, so apply it after the GOT load. */
+    u32 ap = mc->pos(mc);
+    rv64_emit32(mc, rv_auipc(RV_TMP0, 0));
+    mc->emit_reloc_at(mc, sec, ap, R_RV_TLS_GOT_HI20, sym, 0, 0, 0);
+    {
+      ObjSymId anchor = rv_pcrel_anchor(a, sec, ap);
+      u32 lp = mc->pos(mc);
+      rv64_emit32(mc, rv_ld(RV_TMP0, RV_TMP0, 0));
+      mc->emit_reloc_at(mc, sec, lp, R_RV_PCREL_LO12_I, anchor, 0, 0, 0);
+    }
+    rv64_emit32(mc, rv_add(rd, RV_TP, RV_TMP0));
+    if (addend) rv_emit_addr_adjust(mc, rd, rd, (i32)addend);
+    return;
+  }
+  /* Local-Exec: lui t0, %tprel_hi(sym); add t0, tp, t0; addi dst, t0,
+   * %tprel_lo(sym). */
+  {
+    u32 hp = mc->pos(mc);
+    rv64_emit32(mc, rv_lui(RV_TMP0, 0));
+    mc->emit_reloc_at(mc, sec, hp, R_RV_TPREL_HI20, sym, addend, 0, 0);
+    rv64_emit32(mc, rv_add(RV_TMP0, RV_TP, RV_TMP0));
+    {
+      u32 lp = mc->pos(mc);
+      rv64_emit32(mc, rv_addi(rd, RV_TMP0, 0));
+      mc->emit_reloc_at(mc, sec, lp, R_RV_TPREL_LO12_I, sym, addend, 0, 0);
+    }
+  }
+}
+static void rv_bitfield_load(NativeTarget* t, NativeLoc dst, NativeAddr ra,
+                             BitFieldAccess bf) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  u32 storage_bytes = bf.storage.size ? bf.storage.size : 4u;
+  u32 rd = loc_reg(dst);
+  u32 base;
+  i32 off;
+  u32 lsb = bf.bit_offset;
+  u32 width = bf.bit_width ? bf.bit_width : 1u;
+  /* Shift left so the field's MSB lands at bit 63, then shift right to
+   * sign/zero extend it down. Use 64-bit shifts throughout. */
+  u32 sh_left = 64u - (lsb + width);
+  u32 sh_right = 64u - width;
+  ra.offset += (i32)bf.storage_offset;
+  rv_resolve_mem_addr(a, &ra, &base, &off);
+  rv64_emit32(mc, enc_int_load(storage_bytes, 0, rd, base, off));
+  rv64_emit32(mc, rv_slli(rd, rd, sh_left));
+  if (bf.signed_)
+    rv64_emit32(mc, rv_srai(rd, rd, sh_right));
+  else
+    rv64_emit32(mc, rv_srli(rd, rd, sh_right));
+}
+static void rv_bitfield_store(NativeTarget* t, NativeAddr ra, NativeLoc src,
+                              BitFieldAccess bf) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  u32 storage_bytes = bf.storage.size ? bf.storage.size : 4u;
+  u32 src_reg = loc_reg(src);
+  u32 base;
+  i32 off;
+  u32 lsb = bf.bit_offset;
+  u32 width = bf.bit_width ? bf.bit_width : 1u;
+  u64 ones = width >= 64u ? ~(u64)0 : (((u64)1 << width) - 1u);
+  u64 mask_in = ones << lsb;
+  ra.offset += (i32)bf.storage_offset;
+  /* Resolve the field address; rv_resolve_mem_addr may use RV_TMP0/RV_TMP1, so
+   * stabilize the base into RV_TMP1 before consuming the scratch temps. */
+  rv_resolve_mem_addr(a, &ra, &base, &off);
+  if (base != RV_S0 && base != RV_TMP1) {
+    rv_emit_addr_adjust(mc, RV_TMP1, base, off);
+    base = RV_TMP1;
+    off = 0;
+  } else if (base == RV_TMP1 && off != 0) {
+    rv_emit_addr_adjust(mc, RV_TMP1, RV_TMP1, off);
+    off = 0;
+  }
+  /* word in RV_TMP2; merged via RV_TMP0 (clear mask, then shifted src). */
+  rv64_emit32(mc, enc_int_load(storage_bytes, 0, RV_TMP2, base, off));
+  rv_emit_load_imm(mc, 1, RV_TMP0, (i64)~mask_in);
+  rv64_emit32(mc, rv_and(RV_TMP2, RV_TMP2, RV_TMP0));
+  rv_emit_load_imm(mc, 1, RV_TMP0, (i64)ones);
+  rv64_emit32(mc, rv_and(RV_TMP0, src_reg, RV_TMP0));
+  if (lsb) rv64_emit32(mc, rv_slli(RV_TMP0, RV_TMP0, lsb));
+  rv64_emit32(mc, rv_or(RV_TMP2, RV_TMP2, RV_TMP0));
+  rv64_emit32(mc, enc_int_store(storage_bytes, RV_TMP2, base, off));
+}
+static int rv_order_acquire(MemOrder o) {
+  return o == MO_CONSUME || o == MO_ACQUIRE || o == MO_ACQ_REL ||
+         o == MO_SEQ_CST;
+}
+static int rv_order_release(MemOrder o) {
+  return o == MO_RELEASE || o == MO_ACQ_REL || o == MO_SEQ_CST;
+}
+
+/* Materialize the atomic operand address into RV_TMP0 (a bare pointer, since
+ * LR/SC and AMO take a base register with no offset) and return it. */
+static u32 rv_atomic_addr_reg(RvNativeTarget* a, NativeAddr addr) {
+  NativeLoc dst =
+      rv_reg_loc(builtin_id(CFREE_CG_BUILTIN_I64), NATIVE_REG_INT, RV_TMP0);
+  rv_load_addr(&a->base, dst, addr);
+  return RV_TMP0;
+}
+
+static void rv_atomic_load(NativeTarget* t, NativeLoc dst, NativeAddr addr,
+                           MemAccess mem, MemOrder mo) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  u32 sf = (mem.size ? mem.size : rv_type_size(t, dst.type)) == 8u ? 1u : 0u;
+  u32 base = rv_atomic_addr_reg(a, addr);
+  if (mo == MO_SEQ_CST) rv64_emit32(mc, rv_fence_rw_rw());
+  if (rv_order_acquire(mo)) {
+    /* lr.w/d as an ordered load (aq=1). */
+    rv64_emit32(mc, sf ? rv_lr_d(loc_reg(dst), base, 1, 0)
+                       : rv_lr_w(loc_reg(dst), base, 1, 0));
+  } else {
+    rv64_emit32(mc, enc_int_load(mem.size ? mem.size : rv_type_size(t, dst.type),
+                                 0, loc_reg(dst), base, 0));
+  }
+}
+
+static void rv_atomic_store(NativeTarget* t, NativeAddr addr, NativeLoc src,
+                            MemAccess mem, MemOrder mo) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  u32 sz = mem.size ? mem.size : rv_type_size(t, src.type);
+  /* RV_TMP0 holds the address; never collides with src (an allocable reg). */
+  u32 base = rv_atomic_addr_reg(a, addr);
+  if (rv_order_release(mo)) rv64_emit32(mc, rv_fence_rw_rw());
+  rv64_emit32(mc, enc_int_store(sz, loc_reg(src), base, 0));
+  if (mo == MO_SEQ_CST) rv64_emit32(mc, rv_fence_rw_rw());
+}
+
+static void rv_atomic_rmw(NativeTarget* t, AtomicOp op, NativeLoc dst,
+                          NativeAddr addr, NativeLoc val, MemAccess mem,
+                          MemOrder mo) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  u32 sf = (mem.size ? mem.size : rv_type_size(t, dst.type)) == 8u ? 1u : 0u;
+  u32 base = rv_atomic_addr_reg(a, addr); /* RV_TMP0 */
+  u32 vreg = loc_reg(val);
+  u32 rd = loc_reg(dst);
+  u32 aq = (u32)rv_order_acquire(mo);
+  u32 rl = (u32)rv_order_release(mo);
+  MCLabel retry = mc->label_new(mc);
+  /* LR/SC loop: dst = *base; new = dst op val; sc new; retry on failure.
+   * RV_TMP1 carries the SC status, RV_TMP3 the computed new value. */
+  mc->label_place(mc, retry);
+  rv64_emit32(mc, sf ? rv_lr_d(rd, base, aq, 0) : rv_lr_w(rd, base, aq, 0));
+  switch (op) {
+    case AO_XCHG:
+      rv64_emit32(mc, rv_addi(RV_TMP3, vreg, 0));
+      break;
+    case AO_ADD:
+      rv64_emit32(mc, sf ? rv_add(RV_TMP3, rd, vreg) : rv_addw(RV_TMP3, rd, vreg));
+      break;
+    case AO_SUB:
+      rv64_emit32(mc, sf ? rv_sub(RV_TMP3, rd, vreg) : rv_subw(RV_TMP3, rd, vreg));
+      break;
+    case AO_AND:
+      rv64_emit32(mc, rv_and(RV_TMP3, rd, vreg));
+      break;
+    case AO_OR:
+      rv64_emit32(mc, rv_or(RV_TMP3, rd, vreg));
+      break;
+    case AO_XOR:
+      rv64_emit32(mc, rv_xor(RV_TMP3, rd, vreg));
+      break;
+    case AO_NAND:
+      rv64_emit32(mc, rv_and(RV_TMP3, rd, vreg));
+      rv64_emit32(mc, rv_xori(RV_TMP3, RV_TMP3, -1));
+      break;
+    default:
+      rv_panic(a, "unsupported atomic rmw op");
+  }
+  rv64_emit32(mc, sf ? rv_sc_d(RV_TMP1, base, RV_TMP3, 0, rl)
+                     : rv_sc_w(RV_TMP1, base, RV_TMP3, 0, rl));
+  rv64_emit32(mc, rv_bne(RV_TMP1, RV_ZERO, 0));
+  mc->emit_label_ref(mc, retry, R_RV_BRANCH, 4, 0);
+}
+
+static void rv_atomic_cas(NativeTarget* t, NativeLoc prior, NativeLoc ok,
+                          NativeAddr addr, NativeLoc expected, NativeLoc desired,
+                          MemAccess mem, MemOrder success, MemOrder failure) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  u32 sf = (mem.size ? mem.size : rv_type_size(t, prior.type)) == 8u ? 1u : 0u;
+  u32 base = rv_atomic_addr_reg(a, addr); /* RV_TMP0 */
+  u32 rprior = loc_reg(prior);
+  u32 rexp = loc_reg(expected);
+  u32 rdes = loc_reg(desired);
+  u32 rok = loc_reg(ok);
+  u32 aq = (u32)rv_order_acquire(success);
+  u32 rl = (u32)rv_order_release(success);
+  MCLabel retry = mc->label_new(mc);
+  MCLabel fail = mc->label_new(mc);
+  MCLabel done = mc->label_new(mc);
+  (void)failure;
+  mc->label_place(mc, retry);
+  rv64_emit32(mc, sf ? rv_lr_d(rprior, base, aq, 0) : rv_lr_w(rprior, base, aq, 0));
+  /* if (prior != expected) -> fail */
+  rv64_emit32(mc, rv_bne(rprior, rexp, 0));
+  mc->emit_label_ref(mc, fail, R_RV_BRANCH, 4, 0);
+  /* sc.w/d status, desired, (base); retry on failure. */
+  rv64_emit32(mc, sf ? rv_sc_d(RV_TMP1, base, rdes, 0, rl)
+                     : rv_sc_w(RV_TMP1, base, rdes, 0, rl));
+  rv64_emit32(mc, rv_bne(RV_TMP1, RV_ZERO, 0));
+  mc->emit_label_ref(mc, retry, R_RV_BRANCH, 4, 0);
+  /* ok = 1; jump done. */
+  rv_emit_load_imm(mc, 0, rok, 1);
+  rv64_emit32(mc, rv_jal(RV_ZERO, 0));
+  mc->emit_label_ref(mc, done, R_RV_JAL, 4, 0);
+  mc->label_place(mc, fail);
+  rv_emit_load_imm(mc, 0, rok, 0);
+  mc->label_place(mc, done);
+}
+
+static void rv_fence(NativeTarget* t, MemOrder mo) {
+  if (mo == MO_RELAXED) return;
+  rv64_emit32(t->mc, rv_fence_rw_rw());
+}
+/* ---- variadics (LP64D ABI_VA_LIST_POINTER) ----
+ * va_list is a single void* to the next argument slot. The prologue spilled
+ * unconsumed a-regs into the 64-byte save area at [s0+16); incoming stack args
+ * follow contiguously, so a uniform 8-byte stride covers both. `ap` is a
+ * NativeAddr that addresses the va_list object itself. */
+
+static void rv_va_start_core(RvNativeTarget* a, NativeAddr ap) {
+  NativeTarget* t = &a->base;
+  MCEmitter* mc = t->mc;
+  ABIVaListInfo vai = abi_va_list_layout(t->c->abi);
+  CfreeCgTypeId i64t = builtin_id(CFREE_CG_BUILTIN_I64);
+  if (vai.kind != ABI_VA_LIST_POINTER) rv_panic(a, "unsupported va_list layout");
+  if (!a->is_variadic) rv_panic(a, "va_start: function not variadic");
+  /* *ap = s0 + 16 + next_param_int*8 (skip past named-int save slots). */
+  rv64_emit32(mc, rv_addi(RV_TMP1, RV_S0, 16 + (i32)(a->next_param_int * 8u)));
+  rv_emit_mem(a, 0, rv_reg_loc(i64t, NATIVE_REG_INT, RV_TMP1), ap,
+              rv_mem_for_type(t, i64t, 8));
+}
+
+static void rv_va_arg_core(RvNativeTarget* a, NativeLoc dst, NativeAddr ap,
+                           CfreeCgTypeId type) {
+  NativeTarget* t = &a->base;
+  MCEmitter* mc = t->mc;
+  ABIVaListInfo vai = abi_va_list_layout(t->c->abi);
+  CfreeCgTypeId i64t = builtin_id(CFREE_CG_BUILTIN_I64);
+  u32 sz = rv_type_size(t, type);
+  NativeLoc cur = rv_reg_loc(i64t, NATIVE_REG_INT, RV_TMP1);
+  NativeAddr from;
+  if (vai.kind != ABI_VA_LIST_POINTER) rv_panic(a, "unsupported va_list layout");
+  if (dst.kind != NATIVE_LOC_REG) rv_panic(a, "va_arg destination must be reg");
+  /* cur = *ap; load value from [cur]; *ap = cur + 8 (each slot is 8 bytes). */
+  rv_emit_mem(a, 1, cur, ap, rv_mem_for_type(t, i64t, 8));
+  memset(&from, 0, sizeof from);
+  from.base_kind = NATIVE_ADDR_BASE_REG;
+  from.base.reg = RV_TMP1;
+  from.base_type = type;
+  if (loc_is_fp(dst)) {
+    /* Variadic FP args sit in the integer save area as their bit pattern;
+     * load into RV_TMP2 and bitcast into the FPR. */
+    NativeLoc itmp = rv_reg_loc(type, NATIVE_REG_INT, RV_TMP2);
+    rv_emit_mem(a, 1, itmp, from, rv_mem_for_type(t, type, sz));
+    rv64_emit32(mc, sz == 8u ? rv_fmv_d_x(loc_reg(dst), RV_TMP2)
+                             : rv_fmv_w_x(loc_reg(dst), RV_TMP2));
+  } else {
+    rv_emit_mem(a, 1, dst, from, rv_mem_for_type(t, type, sz));
+  }
+  rv64_emit32(mc, rv_addi(RV_TMP1, RV_TMP1, 8));
+  rv_emit_mem(a, 0, cur, ap, rv_mem_for_type(t, i64t, 8));
+}
+
+static void rv_va_copy_core(RvNativeTarget* a, NativeAddr dst_ap,
+                            NativeAddr src_ap) {
+  NativeTarget* t = &a->base;
+  CfreeCgTypeId i64t = builtin_id(CFREE_CG_BUILTIN_I64);
+  NativeLoc tmp = rv_reg_loc(i64t, NATIVE_REG_INT, RV_TMP1);
+  /* va_list is a single 8-byte pointer. */
+  rv_emit_mem(a, 1, tmp, src_ap, rv_mem_for_type(t, i64t, 8));
+  rv_emit_mem(a, 0, tmp, dst_ap, rv_mem_for_type(t, i64t, 8));
+}
+
+static NativeAddr rv_va_addr_from_ptr(NativeLoc ap_ptr) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_REG;
+  addr.cls = NATIVE_REG_INT;
+  addr.base.reg = ap_ptr.v.reg;
+  addr.base_type = ap_ptr.type;
+  return addr;
+}
+
+static void rv_va_start_native(NativeTarget* t, NativeLoc ap_ptr) {
+  rv_va_start_core(rv_of(t), rv_va_addr_from_ptr(ap_ptr));
+}
+static void rv_va_arg_native(NativeTarget* t, NativeLoc dst, NativeLoc ap_ptr,
+                             CfreeCgTypeId type) {
+  rv_va_arg_core(rv_of(t), dst, rv_va_addr_from_ptr(ap_ptr), type);
+}
+static void rv_va_end_native(NativeTarget* t, NativeLoc ap_ptr) {
+  (void)t;
+  (void)ap_ptr;
+}
+static void rv_va_copy_native(NativeTarget* t, NativeLoc dst, NativeLoc src) {
+  rv_va_copy_core(rv_of(t), rv_va_addr_from_ptr(dst), rv_va_addr_from_ptr(src));
+}
+/* Software popcount of RV_TMP1 (already width-normalized) into rd, using
+ * RV_TMP1/RV_TMP2/RV_TMP3 as scratch. Mirrors the legacy bit-twiddling. */
+static void rv_emit_popcount(MCEmitter* mc, u32 rd, int is64) {
+  rv64_emit32(mc, rv_srli(RV_TMP2, RV_TMP1, 1));
+  rv_emit_load_imm(mc, 1, RV_TMP3,
+                   is64 ? (i64)0x5555555555555555ll : (i64)0x55555555);
+  rv64_emit32(mc, rv_and(RV_TMP2, RV_TMP2, RV_TMP3));
+  rv64_emit32(mc, rv_sub(RV_TMP1, RV_TMP1, RV_TMP2));
+  rv_emit_load_imm(mc, 1, RV_TMP3,
+                   is64 ? (i64)0x3333333333333333ll : (i64)0x33333333);
+  rv64_emit32(mc, rv_and(RV_TMP2, RV_TMP1, RV_TMP3));
+  rv64_emit32(mc, rv_srli(RV_TMP1, RV_TMP1, 2));
+  rv64_emit32(mc, rv_and(RV_TMP1, RV_TMP1, RV_TMP3));
+  rv64_emit32(mc, rv_add(RV_TMP1, RV_TMP1, RV_TMP2));
+  rv64_emit32(mc, rv_srli(RV_TMP2, RV_TMP1, 4));
+  rv64_emit32(mc, rv_add(RV_TMP1, RV_TMP1, RV_TMP2));
+  rv_emit_load_imm(mc, 1, RV_TMP3,
+                   is64 ? (i64)0x0f0f0f0f0f0f0f0fll : (i64)0x0f0f0f0f);
+  rv64_emit32(mc, rv_and(RV_TMP1, RV_TMP1, RV_TMP3));
+  rv_emit_load_imm(mc, 1, RV_TMP3,
+                   is64 ? (i64)0x0101010101010101ll : (i64)0x01010101);
+  rv64_emit32(mc, rv_mul(RV_TMP1, RV_TMP1, RV_TMP3));
+  rv64_emit32(mc, rv_srli(rd, RV_TMP1, is64 ? 56u : 24u));
+  /* The 32-bit SWAR sum lives in product bits [24,32); since the multiply is
+   * 64-bit, bits [32,64) survive the >>24 and must be masked off. (The 64-bit
+   * path's >>56 already isolates the top byte, so it needs no mask.) */
+  if (!is64) rv64_emit32(mc, rv_andi(rd, rd, 0xff));
+}
+
+/* Inline byte-granule copy/set between bare base registers (memcpy/memmove/
+ * memset intrinsics). dir<0 copies high-to-low (memmove backward). */
+static void rv_intrin_copy(MCEmitter* mc, u32 dr, u32 sr, u32 n, int backward) {
+  if (!backward) {
+    u32 i = 0;
+    while (i + 8u <= n) { rv64_emit32(mc, rv_ld(RV_TMP3, sr, (i32)i)); rv64_emit32(mc, rv_sd(RV_TMP3, dr, (i32)i)); i += 8u; }
+    while (i + 4u <= n) { rv64_emit32(mc, rv_lwu(RV_TMP3, sr, (i32)i)); rv64_emit32(mc, rv_sw(RV_TMP3, dr, (i32)i)); i += 4u; }
+    while (i + 2u <= n) { rv64_emit32(mc, rv_lhu(RV_TMP3, sr, (i32)i)); rv64_emit32(mc, rv_sh(RV_TMP3, dr, (i32)i)); i += 2u; }
+    while (i < n) { rv64_emit32(mc, rv_lbu(RV_TMP3, sr, (i32)i)); rv64_emit32(mc, rv_sb(RV_TMP3, dr, (i32)i)); i += 1u; }
+  } else {
+    u32 i = n;
+    while (i >= 8u) { i -= 8u; rv64_emit32(mc, rv_ld(RV_TMP3, sr, (i32)i)); rv64_emit32(mc, rv_sd(RV_TMP3, dr, (i32)i)); }
+    while (i >= 4u) { i -= 4u; rv64_emit32(mc, rv_lwu(RV_TMP3, sr, (i32)i)); rv64_emit32(mc, rv_sw(RV_TMP3, dr, (i32)i)); }
+    while (i >= 2u) { i -= 2u; rv64_emit32(mc, rv_lhu(RV_TMP3, sr, (i32)i)); rv64_emit32(mc, rv_sh(RV_TMP3, dr, (i32)i)); }
+    while (i >= 1u) { i -= 1u; rv64_emit32(mc, rv_lbu(RV_TMP3, sr, (i32)i)); rv64_emit32(mc, rv_sb(RV_TMP3, dr, (i32)i)); }
+  }
+}
+
+static void rv_intrinsic(NativeTarget* t, IntrinKind kind, const NativeLoc* dsts,
+                         u32 ndst, const NativeLoc* args, u32 narg) {
+  RvNativeTarget* a = rv_of(t);
+  MCEmitter* mc = t->mc;
+  (void)ndst;
+  (void)narg;
+  switch (kind) {
+    case INTRIN_NONE:
+      break;
+    case INTRIN_EXPECT:
+    case INTRIN_ASSUME_ALIGNED: {
+      /* dst = val (hint dropped). */
+      if (args[0].kind == NATIVE_LOC_IMM)
+        rv_emit_load_imm(mc, rv_is_64(t, dsts[0].type) ? 1u : 0u,
+                         loc_reg(dsts[0]), args[0].v.imm);
+      else
+        rv_move(t, dsts[0], args[0]);
+      return;
+    }
+    case INTRIN_PREFETCH:
+      return;
+    case INTRIN_UNREACHABLE:
+    case INTRIN_TRAP:
+      rv64_emit32(mc, rv_ebreak());
+      return;
+    case INTRIN_BSWAP16: {
+      u32 rd = loc_reg(dsts[0]), rs = loc_reg(args[0]);
+      /* rd = ((rs & 0xff) << 8) | ((rs >> 8) & 0xff). */
+      rv64_emit32(mc, rv_addi(RV_TMP2, RV_ZERO, 0xff));
+      rv64_emit32(mc, rv_slli(RV_TMP2, RV_TMP2, 8)); /* 0xff00 */
+      rv64_emit32(mc, rv_slli(RV_TMP1, rs, 8));
+      rv64_emit32(mc, rv_and(RV_TMP1, RV_TMP1, RV_TMP2));
+      rv64_emit32(mc, rv_srli(RV_TMP3, rs, 8));
+      rv64_emit32(mc, rv_andi(RV_TMP3, RV_TMP3, 0xff));
+      rv64_emit32(mc, rv_or(rd, RV_TMP1, RV_TMP3));
+      return;
+    }
+    case INTRIN_BSWAP32: {
+      u32 rd = loc_reg(dsts[0]), rs = loc_reg(args[0]);
+      rv64_emit32(mc, rv_srliw(RV_TMP1, rs, 24));
+      rv64_emit32(mc, rv_andi(RV_TMP1, RV_TMP1, 0xff));
+      rv64_emit32(mc, rv_srliw(RV_TMP2, rs, 16));
+      rv64_emit32(mc, rv_andi(RV_TMP2, RV_TMP2, 0xff));
+      rv64_emit32(mc, rv_slli(RV_TMP2, RV_TMP2, 8));
+      rv64_emit32(mc, rv_or(RV_TMP1, RV_TMP1, RV_TMP2));
+      rv64_emit32(mc, rv_srliw(RV_TMP2, rs, 8));
+      rv64_emit32(mc, rv_andi(RV_TMP2, RV_TMP2, 0xff));
+      rv64_emit32(mc, rv_slli(RV_TMP2, RV_TMP2, 16));
+      rv64_emit32(mc, rv_or(RV_TMP1, RV_TMP1, RV_TMP2));
+      rv64_emit32(mc, rv_andi(RV_TMP2, rs, 0xff));
+      rv64_emit32(mc, rv_slli(RV_TMP2, RV_TMP2, 24));
+      rv64_emit32(mc, rv_or(rd, RV_TMP1, RV_TMP2));
+      rv64_emit32(mc, rv_slli(rd, rd, 32));
+      rv64_emit32(mc, rv_srli(rd, rd, 32));
+      return;
+    }
+    case INTRIN_BSWAP64: {
+      u32 rd = loc_reg(dsts[0]), rs = loc_reg(args[0]);
+      int i;
+      rv64_emit32(mc, rv_addi(RV_TMP1, RV_ZERO, 0));
+      for (i = 0; i < 8; ++i) {
+        int sh = 56 - 8 * i;
+        if (i == 0) {
+          rv64_emit32(mc, rv_andi(RV_TMP2, rs, 0xff));
+        } else {
+          rv64_emit32(mc, rv_srli(RV_TMP2, rs, (u32)(8 * i)));
+          rv64_emit32(mc, rv_andi(RV_TMP2, RV_TMP2, 0xff));
+        }
+        if (sh) rv64_emit32(mc, rv_slli(RV_TMP2, RV_TMP2, (u32)sh));
+        rv64_emit32(mc, rv_or(RV_TMP1, RV_TMP1, RV_TMP2));
+      }
+      rv64_emit32(mc, rv_addi(rd, RV_TMP1, 0));
+      return;
+    }
+    case INTRIN_POPCOUNT: {
+      u32 rd = loc_reg(dsts[0]), rs = loc_reg(args[0]);
+      int is64 = rv_is_64(t, args[0].type);
+      rv64_emit32(mc, rv_addi(RV_TMP1, rs, 0));
+      if (!is64) {
+        rv64_emit32(mc, rv_slli(RV_TMP1, RV_TMP1, 32));
+        rv64_emit32(mc, rv_srli(RV_TMP1, RV_TMP1, 32));
+      }
+      rv_emit_popcount(mc, rd, is64);
+      return;
+    }
+    case INTRIN_CTZ: {
+      /* ctz(x) = popcount((x & -x) - 1) for x != 0. */
+      u32 rd = loc_reg(dsts[0]), rs = loc_reg(args[0]);
+      int is64 = rv_is_64(t, args[0].type);
+      rv64_emit32(mc, rv_sub(RV_TMP1, RV_ZERO, rs));
+      rv64_emit32(mc, rv_and(RV_TMP1, RV_TMP1, rs));
+      rv64_emit32(mc, rv_addi(RV_TMP1, RV_TMP1, -1));
+      if (!is64) {
+        rv64_emit32(mc, rv_slli(RV_TMP1, RV_TMP1, 32));
+        rv64_emit32(mc, rv_srli(RV_TMP1, RV_TMP1, 32));
+      }
+      rv_emit_popcount(mc, rd, is64);
+      return;
+    }
+    case INTRIN_CLZ: {
+      /* Fold the high bit downward, then clz = popcount(~folded). */
+      u32 rd = loc_reg(dsts[0]), rs = loc_reg(args[0]);
+      int is64 = rv_is_64(t, args[0].type);
+      u32 shifts[6] = {1, 2, 4, 8, 16, 32};
+      u32 ns = is64 ? 6u : 5u, i;
+      rv64_emit32(mc, rv_addi(RV_TMP1, rs, 0));
+      if (!is64) {
+        rv64_emit32(mc, rv_slli(RV_TMP1, RV_TMP1, 32));
+        rv64_emit32(mc, rv_srli(RV_TMP1, RV_TMP1, 32));
+      }
+      for (i = 0; i < ns; ++i) {
+        rv64_emit32(mc, rv_srli(RV_TMP2, RV_TMP1, shifts[i]));
+        rv64_emit32(mc, rv_or(RV_TMP1, RV_TMP1, RV_TMP2));
+      }
+      rv64_emit32(mc, rv_xori(RV_TMP1, RV_TMP1, -1));
+      if (!is64) {
+        rv64_emit32(mc, rv_slli(RV_TMP1, RV_TMP1, 32));
+        rv64_emit32(mc, rv_srli(RV_TMP1, RV_TMP1, 32));
+      }
+      rv_emit_popcount(mc, rd, is64);
+      return;
+    }
+    case INTRIN_SADD_OVERFLOW:
+    case INTRIN_SSUB_OVERFLOW: {
+      /* dsts: [val, ovf]. ADD: ovf=((a^r)&(b^r))>>(w-1);
+       * SUB: ovf=((a^b)&(a^r))>>(w-1). */
+      int is64 = rv_is_64(t, dsts[0].type);
+      u32 ra = loc_reg(args[0]), rb = loc_reg(args[1]);
+      u32 rd = loc_reg(dsts[0]), rovf = loc_reg(dsts[1]);
+      u32 sh = is64 ? 63u : 31u;
+      if (kind == INTRIN_SADD_OVERFLOW)
+        rv64_emit32(mc, is64 ? rv_add(RV_TMP2, ra, rb) : rv_addw(RV_TMP2, ra, rb));
+      else
+        rv64_emit32(mc, is64 ? rv_sub(RV_TMP2, ra, rb) : rv_subw(RV_TMP2, ra, rb));
+      rv64_emit32(mc, rv_xor(RV_TMP3, ra, RV_TMP2)); /* a ^ r */
+      if (kind == INTRIN_SADD_OVERFLOW) {
+        rv64_emit32(mc, rv_xor(rovf, rb, RV_TMP2)); /* b ^ r */
+        rv64_emit32(mc, rv_and(rovf, rovf, RV_TMP3));
+      } else {
+        rv64_emit32(mc, rv_xor(rovf, ra, rb)); /* a ^ b */
+        rv64_emit32(mc, rv_and(rovf, rovf, RV_TMP3));
+      }
+      rv64_emit32(mc, is64 ? rv_srli(rovf, rovf, sh) : rv_srliw(rovf, rovf, sh));
+      rv64_emit32(mc, rv_andi(rovf, rovf, 1));
+      rv64_emit32(mc, rv_addi(rd, RV_TMP2, 0));
+      return;
+    }
+    case INTRIN_UADD_OVERFLOW:
+    case INTRIN_USUB_OVERFLOW: {
+      int is64 = rv_is_64(t, dsts[0].type);
+      u32 ra = loc_reg(args[0]), rb = loc_reg(args[1]);
+      u32 rd = loc_reg(dsts[0]), rovf = loc_reg(dsts[1]);
+      if (!is64) {
+        rv64_emit32(mc, rv_slli(RV_TMP2, ra, 32));
+        rv64_emit32(mc, rv_srli(RV_TMP2, RV_TMP2, 32));
+        rv64_emit32(mc, rv_slli(RV_TMP3, rb, 32));
+        rv64_emit32(mc, rv_srli(RV_TMP3, RV_TMP3, 32));
+        ra = RV_TMP2;
+        rb = RV_TMP3;
+      }
+      if (kind == INTRIN_UADD_OVERFLOW) {
+        if (is64) {
+          rv64_emit32(mc, rv_add(RV_TMP2, ra, rb));
+          rv64_emit32(mc, rv_sltu(rovf, RV_TMP2, ra));
+        } else {
+          rv64_emit32(mc, rv_add(RV_TMP2, ra, rb));
+          rv64_emit32(mc, rv_srli(rovf, RV_TMP2, 32));
+          rv64_emit32(mc, rv_sltu(rovf, RV_ZERO, rovf));
+          rv64_emit32(mc, rv_addiw(RV_TMP2, RV_TMP2, 0));
+        }
+      } else {
+        rv64_emit32(mc, rv_sltu(rovf, ra, rb));
+        rv64_emit32(mc, is64 ? rv_sub(RV_TMP2, ra, rb) : rv_subw(RV_TMP2, ra, rb));
+      }
+      rv64_emit32(mc, rv_addi(rd, RV_TMP2, 0));
+      return;
+    }
+    case INTRIN_SMUL_OVERFLOW: {
+      int is64 = rv_is_64(t, dsts[0].type);
+      u32 ra = loc_reg(args[0]), rb = loc_reg(args[1]);
+      u32 rd = loc_reg(dsts[0]), rovf = loc_reg(dsts[1]);
+      if (is64) {
+        rv64_emit32(mc, rv_mul(RV_TMP2, ra, rb));
+        rv64_emit32(mc, rv_mulh(RV_TMP3, ra, rb));
+        rv64_emit32(mc, rv_srai(rovf, RV_TMP2, 63));
+        rv64_emit32(mc, rv_xor(rovf, RV_TMP3, rovf));
+        rv64_emit32(mc, rv_sltu(rovf, RV_ZERO, rovf));
+        rv64_emit32(mc, rv_addi(rd, RV_TMP2, 0));
+      } else {
+        rv64_emit32(mc, rv_addiw(RV_TMP2, ra, 0));
+        rv64_emit32(mc, rv_addiw(RV_TMP3, rb, 0));
+        rv64_emit32(mc, rv_mul(RV_TMP2, RV_TMP2, RV_TMP3));
+        rv64_emit32(mc, rv_addiw(RV_TMP3, RV_TMP2, 0));
+        rv64_emit32(mc, rv_xor(rovf, RV_TMP2, RV_TMP3));
+        rv64_emit32(mc, rv_sltu(rovf, RV_ZERO, rovf));
+        rv64_emit32(mc, rv_addiw(rd, RV_TMP2, 0));
+      }
+      return;
+    }
+    case INTRIN_UMUL_OVERFLOW: {
+      int is64 = rv_is_64(t, dsts[0].type);
+      u32 ra = loc_reg(args[0]), rb = loc_reg(args[1]);
+      u32 rd = loc_reg(dsts[0]), rovf = loc_reg(dsts[1]);
+      if (is64) {
+        rv64_emit32(mc, rv_mulhu(rovf, ra, rb));
+        rv64_emit32(mc, rv_mul(rd, ra, rb));
+        rv64_emit32(mc, rv_sltu(rovf, RV_ZERO, rovf));
+      } else {
+        rv64_emit32(mc, rv_slli(RV_TMP2, ra, 32));
+        rv64_emit32(mc, rv_srli(RV_TMP2, RV_TMP2, 32));
+        rv64_emit32(mc, rv_slli(RV_TMP3, rb, 32));
+        rv64_emit32(mc, rv_srli(RV_TMP3, RV_TMP3, 32));
+        rv64_emit32(mc, rv_mul(RV_TMP2, RV_TMP2, RV_TMP3));
+        rv64_emit32(mc, rv_srli(rovf, RV_TMP2, 32));
+        rv64_emit32(mc, rv_sltu(rovf, RV_ZERO, rovf));
+        rv64_emit32(mc, rv_addiw(rd, RV_TMP2, 0));
+      }
+      return;
+    }
+    case INTRIN_MEMCPY:
+    case INTRIN_MEMMOVE: {
+      u32 dr, sr, n;
+      if (narg != 3u || args[0].kind != NATIVE_LOC_REG ||
+          args[1].kind != NATIVE_LOC_REG || args[2].kind != NATIVE_LOC_IMM)
+        rv_panic(a, "unsupported memory intrinsic operands");
+      if (args[2].v.imm < 0 || args[2].v.imm > 0xffffffffll)
+        rv_panic(a, "unsupported memory intrinsic size");
+      dr = loc_reg(args[0]);
+      sr = loc_reg(args[1]);
+      n = (u32)args[2].v.imm;
+      rv_intrin_copy(mc, dr, sr, n, kind == INTRIN_MEMMOVE);
+      return;
+    }
+    case INTRIN_MEMSET: {
+      u32 dr, n, src;
+      if (narg != 3u || args[0].kind != NATIVE_LOC_REG ||
+          args[2].kind != NATIVE_LOC_IMM)
+        rv_panic(a, "unsupported memset operands");
+      if (args[2].v.imm < 0 || args[2].v.imm > 0xffffffffll)
+        rv_panic(a, "unsupported memset size");
+      dr = loc_reg(args[0]);
+      n = (u32)args[2].v.imm;
+      if (args[1].kind == NATIVE_LOC_IMM) {
+        u32 byte = (u32)(args[1].v.imm & 0xffu);
+        if (byte == 0) {
+          src = RV_ZERO;
+        } else {
+          u64 b = byte;
+          b |= b << 8;
+          b |= b << 16;
+          b |= b << 32;
+          rv_emit_load_imm(mc, 1, RV_TMP3, (i64)b);
+          src = RV_TMP3;
+        }
+      } else {
+        /* Replicate the low byte of a register value across 8 bytes. */
+        u32 rb = loc_reg(args[1]);
+        rv64_emit32(mc, rv_andi(RV_TMP3, rb, 0xff));
+        rv64_emit32(mc, rv_slli(RV_TMP2, RV_TMP3, 8));
+        rv64_emit32(mc, rv_or(RV_TMP3, RV_TMP3, RV_TMP2));
+        rv64_emit32(mc, rv_slli(RV_TMP2, RV_TMP3, 16));
+        rv64_emit32(mc, rv_or(RV_TMP3, RV_TMP3, RV_TMP2));
+        rv64_emit32(mc, rv_slli(RV_TMP2, RV_TMP3, 32));
+        rv64_emit32(mc, rv_or(RV_TMP3, RV_TMP3, RV_TMP2));
+        src = RV_TMP3;
+      }
+      {
+        u32 i = 0;
+        while (i + 8u <= n) { rv64_emit32(mc, rv_sd(src, dr, (i32)i)); i += 8u; }
+        while (i + 4u <= n) { rv64_emit32(mc, rv_sw(src, dr, (i32)i)); i += 4u; }
+        while (i + 2u <= n) { rv64_emit32(mc, rv_sh(src, dr, (i32)i)); i += 2u; }
+        while (i < n) { rv64_emit32(mc, rv_sb(src, dr, (i32)i)); i += 1u; }
+      }
+      return;
+    }
+    default:
+      break;
+  }
+  rv_panic(a, "unsupported compiler intrinsic");
+}
+/* ============================ inline asm ============================ */
+
+_Noreturn static void rv_asm_panic_at(Compiler* c, SrcLoc loc, const char* msg) {
+  compiler_panic(c, loc, "rv64 inline asm: %s", msg);
+}
+_Noreturn static void rv_asm_panic(NativeDirectTarget* d, const char* msg) {
+  rv_asm_panic_at(d->base.c, d->loc, msg);
+}
+
+static const char* rv_asm_constraint_body(const char* s) {
+  if (!s) return "";
+  if (s[0] == '=' && s[1] == '&') return s + 2;
+  if (s[0] == '=' || s[0] == '+' || s[0] == '&') return s + 1;
+  return s;
+}
+static int rv_asm_constraint_early(const char* s) {
+  if (!s) return 0;
+  return (s[0] == '=' && s[1] == '&') || s[0] == '&';
+}
+static int rv_asm_match_index(const char* s) {
+  int n = 0;
+  const char* p;
+  if (!s || s[0] < '0' || s[0] > '9') return -1;
+  for (p = s; *p >= '0' && *p <= '9'; ++p) n = n * 10 + (*p - '0');
+  return n;
+}
+
+/* Build a bound register pseudo-operand in the rv64 inline shape. */
+static void rv_asm_bound_reg(Operand* out, CfreeCgTypeId type,
+                             NativeAllocClass cls, Reg reg) {
+  memset(out, 0, sizeof *out);
+  out->kind = RV64_INLINE_OPK_REG;
+  out->pad[0] =
+      (cls == NATIVE_REG_FP) ? RV64_INLINE_OPCLS_FP : RV64_INLINE_OPCLS_INT;
+  out->type = type;
+  out->v.local = (CGLocal)reg;
+}
+static void rv_asm_bound_mem(Operand* out, CfreeCgTypeId type, Reg base) {
+  memset(out, 0, sizeof *out);
+  out->kind = OPK_INDIRECT;
+  out->type = type;
+  out->v.ind.base = (CGLocal)base;
+  out->v.ind.index = CG_LOCAL_NONE;
+  out->v.ind.ofs = 0;
+}
+
+/* Parse a clobber register name into (class, reg). Returns 0 for the special
+ * "cc"/"memory" clobbers and panics on an unknown register. RV64 dwarf: int
+ * x0..x31 = 0..31, fp f0..f31 = 32..63. */
+static int rv_asm_parse_reg_clobber(Compiler* c, SrcLoc loc, Sym name,
+                                    NativeAllocClass* cls_out, Reg* reg_out) {
+  Slice s = pool_slice(c->global, name);
+  char buf[16];
+  uint32_t dwarf;
+  if (!s.s || !s.len) return 0;
+  if (s.len == 2 && s.s[0] == 'c' && s.s[1] == 'c') return 0;
+  if (s.len == 6 && memcmp(s.s, "memory", 6) == 0) return 0;
+  if (s.len >= sizeof buf) rv_asm_panic_at(c, loc, "clobber name is too long");
+  memcpy(buf, s.s, s.len);
+  buf[s.len] = '\0';
+  if (rv64_register_index(buf, &dwarf) != 0)
+    rv_asm_panic_at(c, loc, "unknown clobber register");
+  if (dwarf <= 31u) {
+    *cls_out = NATIVE_REG_INT;
+    *reg_out = (Reg)dwarf;
+    return 1;
+  }
+  if (dwarf >= 32u && dwarf <= 63u) {
+    *cls_out = NATIVE_REG_FP;
+    *reg_out = (Reg)(dwarf - 32u);
+    return 1;
+  }
+  rv_asm_panic_at(c, loc, "unsupported clobber register");
+  return 0;
+}
+
+static void rv_asm_clobber_masks(Compiler* c, SrcLoc loc, const Sym* clobbers,
+                                 u32 nclob, u32* int_mask, u32* fp_mask) {
+  u32 i;
+  *int_mask = 0;
+  *fp_mask = 0;
+  for (i = 0; i < nclob; ++i) {
+    NativeAllocClass cls;
+    Reg reg;
+    if (!rv_asm_parse_reg_clobber(c, loc, clobbers[i], &cls, &reg)) continue;
+    if (cls == NATIVE_REG_INT)
+      *int_mask |= 1u << reg;
+    else
+      *fp_mask |= 1u << reg;
+  }
+}
+
+static NativeAllocClass rv_asm_constraint_class(NativeDirectTarget* d,
+                                                const char* body) {
+  if (body[0] == 'r') return NATIVE_REG_INT;
+  if (body[0] == 'f') return NATIVE_REG_FP;
+  rv_asm_panic(d, "constraint is not a register constraint");
+  return NATIVE_REG_INT;
+}
+
+/* Pick a free register from the arch's caller-saved allocable pools for an
+ * asm operand the direct path must self-allocate. */
+static Reg rv_asm_alloc_reg(NativeDirectTarget* d, NativeAllocClass cls,
+                            u32* used_int, u32* used_fp) {
+  /* int: a0..a7 (10..17) then t-temps that aren't emit scratch. */
+  static const Reg int_pool[] = {10u, 11u, 12u, 13u, 14u, 15u,
+                                 16u, 17u, 29u, 30u, 31u};
+  /* fp: fa0..fa7 (10..17) then ft caller-saved. */
+  static const Reg fp_pool[] = {10u, 11u, 12u, 13u, 14u, 15u, 16u, 17u,
+                                4u,  5u,  6u,  7u,  28u, 29u, 30u, 31u};
+  const Reg* pool = cls == NATIVE_REG_FP ? fp_pool : int_pool;
+  u32 n = cls == NATIVE_REG_FP ? (u32)(sizeof fp_pool / sizeof fp_pool[0])
+                               : (u32)(sizeof int_pool / sizeof int_pool[0]);
+  u32* used = cls == NATIVE_REG_FP ? used_fp : used_int;
+  u32 i;
+  for (i = 0; i < n; ++i) {
+    Reg r = pool[i];
+    if ((*used & (1u << r)) != 0) continue;
+    *used |= 1u << r;
+    return r;
+  }
+  rv_asm_panic(d, "out of registers for asm operands");
+  return REG_NONE;
+}
+
+/* Direct (-O0) path: resolve a semantic Operand to a NativeAddr. */
+static NativeAddr rv_direct_addr(NativeDirectTarget* d, Operand op) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  switch ((OpKind)op.kind) {
+    case OPK_LOCAL:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = d->locals[op.v.local - 1u].home;
+      addr.base_type = op.type;
+      return addr;
+    case OPK_INDIRECT:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME_VALUE;
+      addr.base.frame = d->locals[op.v.ind.base - 1u].home;
+      addr.cls = d->locals[op.v.ind.base - 1u].cls;
+      addr.base_type = d->locals[op.v.ind.base - 1u].type;
+      addr.offset = op.v.ind.ofs;
+      return addr;
+    default:
+      rv_asm_panic(d, "operand is not addressable");
+  }
+}
+
+/* Materialize an OPK_INDIRECT (frame-value) base into a register, returning a
+ * plain register-based NativeAddr. */
+static NativeAddr rv_direct_materialize_addr(NativeDirectTarget* d, Operand op) {
+  RvNativeTarget* a = rv_of(d->native);
+  NativeAddr addr = rv_direct_addr(d, op);
+  if (addr.base_kind == NATIVE_ADDR_BASE_FRAME_VALUE) {
+    NativeLoc base = rv_reg_loc(addr.base_type, NATIVE_REG_INT, RV_TMP1);
+    NativeAddr load;
+    memset(&load, 0, sizeof load);
+    load.base_kind = NATIVE_ADDR_BASE_FRAME;
+    load.base.frame = addr.base.frame;
+    load.base_type = addr.base_type;
+    rv_emit_mem(a, 1, base, load, rv_mem_for_type(d->native, addr.base_type, 8));
+    addr.base_kind = NATIVE_ADDR_BASE_REG;
+    addr.base.reg = RV_TMP1;
+  }
+  return addr;
+}
+
+static void rv_direct_load_operand_to_reg(NativeDirectTarget* d, Operand op,
+                                          NativeLoc dst) {
+  RvNativeTarget* a = rv_of(d->native);
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  switch ((OpKind)op.kind) {
+    case OPK_IMM:
+      if ((NativeAllocClass)dst.cls != NATIVE_REG_INT)
+        rv_asm_panic(d, "floating-point immediate asm input is unsupported");
+      d->native->load_imm(d->native, dst, op.v.imm);
+      return;
+    case OPK_LOCAL:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = d->locals[op.v.local - 1u].home;
+      addr.base_type = op.type;
+      rv_emit_mem(a, 1, dst, addr, rv_mem_for_type(d->native, op.type, 0));
+      return;
+    case OPK_GLOBAL:
+      addr.base_kind = NATIVE_ADDR_BASE_GLOBAL;
+      addr.base.global.sym = op.v.global.sym;
+      addr.base.global.addend = op.v.global.addend;
+      addr.base_type = op.type;
+      d->native->load_addr(d->native, dst, addr);
+      return;
+    case OPK_INDIRECT:
+      addr = rv_direct_materialize_addr(d, op);
+      rv_emit_mem(a, 1, dst, addr, rv_mem_for_type(d->native, op.type, 0));
+      return;
+  }
+  rv_asm_panic(d, "unsupported asm input operand");
+}
+
+static void rv_direct_load_address_to_reg(NativeDirectTarget* d, Operand op,
+                                          NativeLoc dst) {
+  d->native->load_addr(d->native, dst, rv_direct_addr(d, op));
+}
+
+static void rv_direct_store_reg_to_operand(NativeDirectTarget* d, Operand op,
+                                           NativeLoc src) {
+  RvNativeTarget* a = rv_of(d->native);
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  if (op.kind == OPK_LOCAL) {
+    addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+    addr.base.frame = d->locals[op.v.local - 1u].home;
+    addr.base_type = op.type;
+  } else {
+    addr = rv_direct_materialize_addr(d, op);
+  }
+  rv_emit_mem(a, 0, src, addr, rv_mem_for_type(d->native, op.type, 0));
+}
+
+/* Callee-saved registers an asm block clobbers must be spilled/restored around
+ * the block (the only ABI duty the allocator cannot discharge itself). */
+typedef struct RvAsmSavedClobber {
+  NativeFrameSlot slot;
+  NativeAllocClass cls;
+  Reg reg;
+  CfreeCgTypeId type;
+} RvAsmSavedClobber;
+
+static void rv_asm_save_one(RvNativeTarget* a, RvAsmSavedClobber* s) {
+  NativeFrameSlotDesc desc;
+  NativeAddr addr;
+  memset(&desc, 0, sizeof desc);
+  desc.type = s->type;
+  desc.size = 8;
+  desc.align = 8;
+  desc.kind = NATIVE_FRAME_SLOT_SAVE;
+  s->slot = a->base.frame_slot(&a->base, &desc);
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+  addr.base.frame = s->slot;
+  addr.base_type = s->type;
+  rv_emit_mem(a, 0, rv_reg_loc(s->type, s->cls, s->reg), addr,
+              rv_mem_for_type(&a->base, s->type, 8));
+}
+static void rv_asm_restore_one(RvNativeTarget* a, const RvAsmSavedClobber* s) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+  addr.base.frame = s->slot;
+  addr.base_type = s->type;
+  rv_emit_mem(a, 1, rv_reg_loc(s->type, s->cls, s->reg), addr,
+              rv_mem_for_type(&a->base, s->type, 8));
+}
+
+/* psABI callee-saved: integer s0..s11 (x8,x9,x18..x27), fp fs0..fs11
+ * (f8,f9,f18..f27). x8 is the frame pointer and never asm-clobbered. */
+static int rv_reg_is_callee_int(Reg r) {
+  return r == 9u || (r >= 18u && r <= 27u);
+}
+static int rv_reg_is_callee_fp(Reg r) {
+  return r == 8u || r == 9u || (r >= 18u && r <= 27u);
+}
+
+static RvAsmSavedClobber* rv_asm_save_callee_clobbers(RvNativeTarget* a,
+                                                      u32 int_mask, u32 fp_mask,
+                                                      u32* nsaved_out) {
+  RvAsmSavedClobber* saved = arena_zarray(a->base.c->tu, RvAsmSavedClobber, 24u);
+  CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64);
+  CfreeCgTypeId f64 = builtin_id(CFREE_CG_BUILTIN_F64);
+  u32 n = 0;
+  Reg r;
+  for (r = 0; r <= 31u; ++r) {
+    if ((int_mask & (1u << r)) == 0 || !rv_reg_is_callee_int(r)) continue;
+    saved[n].cls = NATIVE_REG_INT;
+    saved[n].reg = r;
+    saved[n].type = i64;
+    rv_asm_save_one(a, &saved[n++]);
+  }
+  for (r = 0; r <= 31u; ++r) {
+    if ((fp_mask & (1u << r)) == 0 || !rv_reg_is_callee_fp(r)) continue;
+    saved[n].cls = NATIVE_REG_FP;
+    saved[n].reg = r;
+    saved[n].type = f64;
+    rv_asm_save_one(a, &saved[n++]);
+  }
+  *nsaved_out = n;
+  return saved;
+}
+
+/* ---- NativeTarget (optimizer) asm hook ----
+ * The optimizer pre-allocated every operand register and arranged surrounding
+ * data flow, so this binds pre-allocated registers to the template and only
+ * materializes memory-operand bases into the reserved scratch + spills the
+ * callee-saved registers the asm clobbers. */
+
+static NativeAddr rv_asm_loc_to_addr(RvNativeTarget* a, SrcLoc loc,
+                                     NativeLoc src) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_type = src.type;
+  switch ((NativeLocKind)src.kind) {
+    case NATIVE_LOC_FRAME:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = src.v.frame;
+      return addr;
+    case NATIVE_LOC_ADDR:
+      return src.v.addr;
+    case NATIVE_LOC_GLOBAL:
+      addr.base_kind = NATIVE_ADDR_BASE_GLOBAL;
+      addr.base.global.sym = src.v.global.sym;
+      addr.base.global.addend = src.v.global.addend;
+      return addr;
+    case NATIVE_LOC_REG:
+      addr.base_kind = NATIVE_ADDR_BASE_REG;
+      addr.cls = NATIVE_REG_INT;
+      addr.base.reg = src.v.reg;
+      return addr;
+    default:
+      rv_asm_panic_at(a->base.c, loc, "unsupported memory asm operand");
+  }
+}
+
+/* Resolve a memory-constraint operand to a single base register with zero
+ * offset, folding any frame/global/offset into a reserved scratch register. */
+static Reg rv_asm_native_mem_base(RvNativeTarget* a, SrcLoc loc, NativeLoc src,
+                                  u32* ntmp) {
+  NativeAddr addr = rv_asm_loc_to_addr(a, loc, src);
+  u32 base;
+  i32 off;
+  Reg dst;
+  if (addr.index_kind != NATIVE_ADDR_INDEX_NONE)
+    rv_asm_panic_at(a->base.c, loc, "indexed memory asm operand unsupported");
+  rv_resolve_mem_addr(a, &addr, &base, &off);
+  if (off == 0 && base != RV_TMP0 && base != RV_TMP1) return (Reg)base;
+  if (*ntmp >= 2u)
+    rv_asm_panic_at(a->base.c, loc, "too many memory asm operands");
+  dst = (*ntmp == 0u) ? RV_TMP0 : RV_TMP1;
+  (*ntmp)++;
+  rv_emit_addr_adjust(a->base.mc, dst, base, off);
+  return dst;
+}
+
+static void rv_asm_bind_native(RvNativeTarget* a, SrcLoc loc, Operand* out,
+                               const char* constraint, CfreeCgTypeId type,
+                               NativeLoc src, u32* ntmp) {
+  const char* body = rv_asm_constraint_body(constraint);
+  if (body[0] == 'r' || body[0] == 'f') {
+    NativeAllocClass cls = (body[0] == 'f') ? NATIVE_REG_FP : NATIVE_REG_INT;
+    if (src.kind != NATIVE_LOC_REG)
+      rv_asm_panic_at(a->base.c, loc, "register asm operand not in a register");
+    rv_asm_bound_reg(out, type, cls, (Reg)src.v.reg);
+  } else if (body[0] == 'i') {
+    if (src.kind != NATIVE_LOC_IMM)
+      rv_asm_panic_at(a->base.c, loc, "immediate asm operand is not immediate");
+    memset(out, 0, sizeof *out);
+    out->kind = OPK_IMM;
+    out->type = type;
+    out->v.imm = src.v.imm;
+  } else if (body[0] == 'm') {
+    rv_asm_bound_mem(out, type, rv_asm_native_mem_base(a, loc, src, ntmp));
+  } else {
+    rv_asm_panic_at(a->base.c, loc, "unsupported asm constraint");
+  }
+}
+
+static void rv_asm_block_native(NativeTarget* t, const char* tmpl,
+                                const AsmConstraint* outs, u32 nout,
+                                NativeLoc* out_locs, const AsmConstraint* ins,
+                                u32 nin, const NativeLoc* in_locs,
+                                const Sym* clobbers, u32 nclob) {
+  RvNativeTarget* a = rv_of(t);
+  Compiler* c = t->c;
+  SrcLoc loc = a->func ? a->func->loc : (SrcLoc){0, 0, 0};
+  Operand* bound_outs = nout ? arena_zarray(c->tu, Operand, nout) : NULL;
+  Operand* bound_ins = nin ? arena_zarray(c->tu, Operand, nin) : NULL;
+  u32 clob_int, clob_fp, ntmp = 0;
+  RvAsmSavedClobber* saved;
+  u32 nsaved, i;
+  Rv64Asm* asmh;
+
+  rv_asm_clobber_masks(c, loc, clobbers, nclob, &clob_int, &clob_fp);
+
+  for (i = 0; i < nout; ++i) {
+    CfreeCgTypeId type = outs[i].type ? outs[i].type : out_locs[i].type;
+    rv_asm_bind_native(a, loc, &bound_outs[i], outs[i].str, type, out_locs[i],
+                       &ntmp);
+  }
+  for (i = 0; i < nin; ++i) {
+    const char* body = rv_asm_constraint_body(ins[i].str);
+    int matched = rv_asm_match_index(body);
+    CfreeCgTypeId type;
+    NativeLoc inloc;
+    if (matched >= 0) {
+      if ((u32)matched >= nout)
+        rv_asm_panic_at(c, loc, "matching constraint out of range");
+      bound_ins[i] = bound_outs[matched];
+      continue;
+    }
+    type = ins[i].type ? ins[i].type : in_locs[i].type;
+    inloc = in_locs[i];
+    /* A register-constrained input that lives in a frame slot (address-taken
+     * local) must be loaded into a reserved scratch first. */
+    if (body[0] == 'r' && inloc.kind != NATIVE_LOC_REG) {
+      Reg r;
+      if (ntmp >= 2u) rv_asm_panic_at(c, loc, "too many memory asm operands");
+      r = (ntmp == 0u) ? RV_TMP0 : RV_TMP1;
+      ntmp++;
+      inloc = rv_reg_loc(type, NATIVE_REG_INT, r);
+      rv_emit_mem(a, 1, inloc, rv_asm_loc_to_addr(a, loc, in_locs[i]),
+                  rv_mem_for_type(t, type, rv_type_size(t, type)));
+    }
+    rv_asm_bind_native(a, loc, &bound_ins[i], ins[i].str, type, inloc, &ntmp);
+  }
+
+  saved = rv_asm_save_callee_clobbers(a, clob_int, clob_fp, &nsaved);
+  asmh = rv64_asm_open(c);
+  rv64_inline_bind(asmh, outs, nout, bound_outs, ins, nin, bound_ins, clobbers,
+                   nclob);
+  rv64_asm_run_template(asmh, t->mc, tmpl);
+  rv64_asm_close(asmh);
+  for (i = nsaved; i > 0; --i) rv_asm_restore_one(a, &saved[i - 1u]);
+}
+static void rv_file_scope_asm(NativeTarget* t, const char* src, size_t len) {
+  /* Top-level __asm__("...") — assemble through the generic .s parser, which
+   * dispatches instruction lines to the rv64 asm driver and handles directives
+   * (.data/.word/.globl/.text/...) itself. */
+  AsmLexer* lex = asm_lex_open_mem(t->c, "<file-scope-asm>", src, len);
+  asm_parse(t->c, lex, t->mc);
+  asm_lex_close(lex);
+}
+static void rv_trap(NativeTarget* t) { rv64_emit32(t->mc, rv_ebreak()); }
+static void rv_set_loc(NativeTarget* t, SrcLoc loc) {
+  rv_of(t)->loc = loc;
+  if (t->mc->set_loc) t->mc->set_loc(t->mc, loc);
+}
+static void rv_finalize(NativeTarget* t) {
+  if (t->mc) mc_emit_eh_frame(t->mc);
+}
+
+/* ============================ construction ============================ */
+
+NativeTarget* rv64_native_target_new(Compiler* c, ObjBuilder* obj,
+                                     MCEmitter* mc) {
+  RvNativeTarget* a = arena_znew(c->tu, RvNativeTarget);
+  NativeTarget* t;
+  if (!a) return NULL;
+  t = &a->base;
+  t->c = c;
+  t->obj = obj;
+  t->mc = mc;
+  t->regs = &rv_reg_info;
+  t->class_for_type = rv_class_for_type;
+  t->imm_legal = rv_imm_legal;
+  t->addr_legal = rv_addr_legal;
+  t->func_begin = rv_func_begin;
+  t->func_begin_known_frame = rv_func_begin_known_frame;
+  t->note_frame_state = NULL;
+  t->reserve_callee_saves = NULL;
+  t->signature_stack_bytes = rv_signature_stack_bytes;
+  t->call_stack_bytes = rv_call_stack_bytes;
+  t->has_store_zero_reg = 1;
+  t->store_zero_reg = RV_ZERO;
+  t->func_end = rv_func_end;
+  t->frame_slot = rv_frame_slot;
+  t->frame_slot_debug_loc = NULL;
+  t->bind_param = rv_bind_native_param;
+  t->label_new = rv_label_new;
+  t->label_place = rv_label_place;
+  t->jump = rv_jump;
+  t->cmp_branch = rv_cmp_branch;
+  t->indirect_branch = rv_indirect_branch;
+  t->load_label_addr = rv_load_label_addr;
+  t->move = rv_move;
+  t->load_imm = rv_load_imm;
+  t->load_const = rv_load_const;
+  t->load_addr = rv_load_addr;
+  t->load = rv_load;
+  t->store = rv_store;
+  t->tls_addr_of = rv_tls_addr_of;
+  t->copy_bytes = rv_copy_bytes;
+  t->set_bytes = rv_set_bytes;
+  t->bitfield_load = rv_bitfield_load;
+  t->bitfield_store = rv_bitfield_store;
+  t->binop = rv_binop;
+  t->unop = rv_unop;
+  t->cmp = rv_cmp;
+  t->convert = rv_convert;
+  t->alloca_ = rv_alloca;
+  t->spill = rv_spill;
+  t->reload = rv_reload;
+  t->plan_call = rv_plan_call;
+  t->emit_call = rv_emit_call;
+  t->plan_ret = rv_plan_ret;
+  t->ret = rv_ret;
+  t->atomic_load = rv_atomic_load;
+  t->atomic_store = rv_atomic_store;
+  t->atomic_rmw = rv_atomic_rmw;
+  t->atomic_cas = rv_atomic_cas;
+  t->fence = rv_fence;
+  t->va_start_ = rv_va_start_native;
+  t->va_arg_ = rv_va_arg_native;
+  t->va_end_ = rv_va_end_native;
+  t->va_copy_ = rv_va_copy_native;
+  t->intrinsic = rv_intrinsic;
+  t->asm_block = rv_asm_block_native;
+  t->file_scope_asm = rv_file_scope_asm;
+  t->trap = rv_trap;
+  t->set_loc = rv_set_loc;
+  t->finalize = rv_finalize;
+  return t;
+}
+
+/* ============================ NativeOps (-O0) ============================ */
+
+static void rv_bind_param(NativeDirectTarget* d, const CGParamDesc* p,
+                          CGLocal local, NativeDirectLocal* l) {
+  NativeLoc dst;
+  (void)local;
+  memset(&dst, 0, sizeof dst);
+  dst.kind = NATIVE_LOC_FRAME;
+  dst.type = p->type;
+  dst.v.frame = l->home;
+  rv_bind_native_param(d->native, p, dst);
+}
+
+static const char* rv_no_tail(NativeDirectTarget* d, const CGCallDesc* call) {
+  (void)d;
+  (void)call;
+  return "rv64 tail calls not implemented yet";
+}
+
+/* Resolve a pointer-typed Operand (the address of a va_list object) into `reg`
+ * and return a register-based NativeAddr. An OPK_LOCAL holds the va_list object
+ * itself, so we take its frame address; an OPK_INDIRECT holds the pointer in
+ * memory and must be loaded. The va cores use TMP1/TMP2 internally, so `reg`
+ * must be distinct from those (callers pass TMP0 / TMP3). */
+/* ap_addr is the pointer value &ap (the va_list object's address). For an
+ * OPK_LOCAL the local HOLDS that pointer, so load its home value; an
+ * OPK_INDIRECT names *(base+ofs), whose address base+ofs is the pointer.
+ * Mirrors aa64's aa_direct_pointer_addr. */
+static NativeAddr rv_direct_pointer_addr(NativeDirectTarget* d, Operand op) {
+  RvNativeTarget* a = rv_of(d->native);
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  if (op.kind == OPK_LOCAL) {
+    NativeLoc base = rv_reg_loc(op.type, NATIVE_REG_INT, RV_TMP1);
+    NativeAddr load;
+    memset(&load, 0, sizeof load);
+    load.base_kind = NATIVE_ADDR_BASE_FRAME;
+    load.base.frame = d->locals[op.v.local - 1u].home;
+    load.base_type = op.type;
+    rv_emit_mem(a, 1, base, load, rv_mem_for_type(d->native, op.type, 8));
+    addr.base_kind = NATIVE_ADDR_BASE_REG;
+    addr.base.reg = RV_TMP1;
+    addr.base_type = op.type;
+    return addr;
+  }
+  return rv_direct_materialize_addr(d, op);
+}
+
+static NativeAddr rv_direct_va_base(NativeDirectTarget* d, Operand ap_addr,
+                                    Reg reg) {
+  NativeLoc dst = rv_reg_loc(builtin_id(CFREE_CG_BUILTIN_I64), NATIVE_REG_INT, reg);
+  NativeAddr addr;
+  d->native->load_addr(d->native, dst, rv_direct_pointer_addr(d, ap_addr));
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_REG;
+  addr.cls = NATIVE_REG_INT;
+  addr.base.reg = reg;
+  addr.base_type = builtin_id(CFREE_CG_BUILTIN_I64);
+  return addr;
+}
+
+static void rv_va_start_(NativeDirectTarget* d, Operand ap_addr) {
+  rv_va_start_core(rv_of(d->native), rv_direct_va_base(d, ap_addr, RV_TMP3));
+}
+static void rv_va_arg_(NativeDirectTarget* d, Operand dst, Operand ap_addr,
+                       CfreeCgTypeId type) {
+  RvNativeTarget* a = rv_of(d->native);
+  int is_fp = cg_type_is_float(d->base.c, type);
+  NativeLoc res = rv_reg_loc(type, is_fp ? NATIVE_REG_FP : NATIVE_REG_INT,
+                             is_fp ? RV_FTMP0 : RV_TMP0);
+  NativeAddr dst_addr;
+  rv_va_arg_core(a, res, rv_direct_va_base(d, ap_addr, RV_TMP3), type);
+  /* Store the fetched value back into the semantic destination. */
+  dst_addr = rv_direct_addr(d, dst);
+  if (dst_addr.base_kind == NATIVE_ADDR_BASE_FRAME_VALUE) {
+    NativeLoc base = rv_reg_loc(dst_addr.base_type, NATIVE_REG_INT, RV_TMP1);
+    NativeAddr load;
+    memset(&load, 0, sizeof load);
+    load.base_kind = NATIVE_ADDR_BASE_FRAME;
+    load.base.frame = dst_addr.base.frame;
+    load.base_type = dst_addr.base_type;
+    rv_emit_mem(a, 1, base, load, rv_mem_for_type(d->native, dst_addr.base_type, 8));
+    dst_addr.base_kind = NATIVE_ADDR_BASE_REG;
+    dst_addr.base.reg = RV_TMP1;
+  }
+  rv_emit_mem(a, 0, res, dst_addr,
+              rv_mem_for_type(d->native, type, rv_type_size(d->native, type)));
+}
+static void rv_va_end_(NativeDirectTarget* d, Operand ap_addr) {
+  (void)d;
+  (void)ap_addr;
+}
+static void rv_va_copy_(NativeDirectTarget* d, Operand dst, Operand src) {
+  RvNativeTarget* a = rv_of(d->native);
+  NativeAddr src_ap = rv_direct_va_base(d, src, RV_TMP0);
+  NativeAddr dst_ap = rv_direct_va_base(d, dst, RV_TMP3);
+  rv_va_copy_core(a, dst_ap, src_ap);
+}
+
+static void rv_direct_asm_block(NativeDirectTarget* d, const char* tmpl,
+                                const AsmConstraint* outs, u32 nout,
+                                Operand* out_ops, const AsmConstraint* ins,
+                                u32 nin, const Operand* in_ops,
+                                const Sym* clobbers, u32 nclob) {
+  RvNativeTarget* a = rv_of(d->native);
+  Compiler* c = d->base.c;
+  Operand* bound_outs = nout ? arena_zarray(c->tu, Operand, nout) : NULL;
+  Operand* bound_ins = nin ? arena_zarray(c->tu, Operand, nin) : NULL;
+  u32 clob_int, clob_fp, used_int, used_fp;
+  RvAsmSavedClobber* saved;
+  u32 nsaved, i;
+  Rv64Asm* asmh;
+
+  rv_asm_clobber_masks(c, d->loc, clobbers, nclob, &clob_int, &clob_fp);
+  /* Reserve emit scratch (t0/t1/t2/t3), sp/gp/tp/zero/ra and the frame pointer
+   * so the operand allocator never hands them out. */
+  used_int = clob_int | (1u << RV_ZERO) | (1u << RV_RA) | (1u << RV_SP) |
+             (1u << RV_GP) | (1u << RV_TP) | (1u << RV_TMP0) | (1u << RV_TMP1) |
+             (1u << RV_TMP2) | (1u << RV_TMP3) | (1u << RV_S0);
+  used_fp = clob_fp | (1u << RV_FTMP0) | (1u << RV_FTMP1) | (1u << 2u) |
+            (1u << 3u);
+
+  for (i = 0; i < nout; ++i) {
+    const char* body = rv_asm_constraint_body(outs[i].str);
+    CfreeCgTypeId type = outs[i].type ? outs[i].type : out_ops[i].type;
+    if (body[0] == 'r' || body[0] == 'f') {
+      NativeAllocClass cls = rv_asm_constraint_class(d, body);
+      Reg reg = rv_asm_alloc_reg(d, cls, &used_int, &used_fp);
+      rv_asm_bound_reg(&bound_outs[i], type, cls, reg);
+      if (outs[i].dir == ASM_INOUT)
+        rv_direct_load_operand_to_reg(d, out_ops[i], rv_reg_loc(type, cls, reg));
+    } else if (body[0] == 'm') {
+      Reg reg = rv_asm_alloc_reg(d, NATIVE_REG_INT, &used_int, &used_fp);
+      NativeLoc lloc =
+          rv_reg_loc(builtin_id(CFREE_CG_BUILTIN_I64), NATIVE_REG_INT, reg);
+      rv_direct_load_address_to_reg(d, out_ops[i], lloc);
+      rv_asm_bound_mem(&bound_outs[i], type, reg);
+    } else {
+      rv_asm_panic(d, "unsupported output constraint");
+    }
+  }
+
+  for (i = 0; i < nin; ++i) {
+    const char* body = rv_asm_constraint_body(ins[i].str);
+    int matched = rv_asm_match_index(body);
+    CfreeCgTypeId type = ins[i].type ? ins[i].type : in_ops[i].type;
+    if (matched >= 0) {
+      if ((u32)matched >= nout)
+        rv_asm_panic(d, "matching constraint out of range");
+      if (rv_asm_constraint_early(outs[matched].str))
+        rv_asm_panic(d, "matching input names early-clobber output");
+      if (bound_outs[matched].kind != RV64_INLINE_OPK_REG)
+        rv_asm_panic(d, "matching constraint requires register output");
+      bound_ins[i] = bound_outs[matched];
+      rv_direct_load_operand_to_reg(
+          d, in_ops[i],
+          rv_reg_loc(bound_ins[i].type,
+                     bound_ins[i].pad[0] == RV64_INLINE_OPCLS_FP ? NATIVE_REG_FP
+                                                                 : NATIVE_REG_INT,
+                     (Reg)bound_ins[i].v.local));
+      continue;
+    }
+    if (body[0] == 'r' || body[0] == 'f') {
+      NativeAllocClass cls = rv_asm_constraint_class(d, body);
+      Reg reg = rv_asm_alloc_reg(d, cls, &used_int, &used_fp);
+      rv_asm_bound_reg(&bound_ins[i], type, cls, reg);
+      rv_direct_load_operand_to_reg(d, in_ops[i], rv_reg_loc(type, cls, reg));
+    } else if (body[0] == 'i') {
+      if (in_ops[i].kind != OPK_IMM)
+        rv_asm_panic(d, "immediate constraint requires immediate operand");
+      bound_ins[i] = in_ops[i];
+    } else if (body[0] == 'm') {
+      Reg reg = rv_asm_alloc_reg(d, NATIVE_REG_INT, &used_int, &used_fp);
+      NativeLoc lloc =
+          rv_reg_loc(builtin_id(CFREE_CG_BUILTIN_I64), NATIVE_REG_INT, reg);
+      rv_direct_load_address_to_reg(d, in_ops[i], lloc);
+      rv_asm_bound_mem(&bound_ins[i], type, reg);
+    } else {
+      rv_asm_panic(d, "unsupported input constraint");
+    }
+  }
+
+  saved = rv_asm_save_callee_clobbers(a, clob_int, clob_fp, &nsaved);
+  asmh = rv64_asm_open(c);
+  rv64_inline_bind(asmh, outs, nout, bound_outs, ins, nin, bound_ins, clobbers,
+                   nclob);
+  rv64_asm_run_template(asmh, d->native->mc, tmpl);
+  rv64_asm_close(asmh);
+
+  for (i = 0; i < nout; ++i) {
+    NativeAllocClass cls;
+    NativeLoc src;
+    if (bound_outs[i].kind != RV64_INLINE_OPK_REG) continue;
+    cls = bound_outs[i].pad[0] == RV64_INLINE_OPCLS_FP ? NATIVE_REG_FP
+                                                       : NATIVE_REG_INT;
+    src = rv_reg_loc(bound_outs[i].type, cls, (Reg)bound_outs[i].v.local);
+    rv_direct_store_reg_to_operand(d, out_ops[i], src);
+  }
+  for (i = nsaved; i > 0; --i) rv_asm_restore_one(a, &saved[i - 1u]);
+}
+
+static const NativeOps rv_direct_ops = {
+    .bind_param = rv_bind_param,
+    .tail_call_unrealizable_reason = rv_no_tail,
+    .va_start_ = rv_va_start_,
+    .va_arg_ = rv_va_arg_,
+    .va_end_ = rv_va_end_,
+    .va_copy_ = rv_va_copy_,
+    .asm_block = rv_direct_asm_block,
+};
+
+const NativeOps* rv64_native_direct_ops(void) { return &rv_direct_ops; }
diff --git a/src/arch/rv64/ops.c b/src/arch/rv64/ops.c
@@ -1,2699 +0,0 @@
-/* src/arch/rv64/ops.c — data movement, arithmetic, calls, atomics, vtable. */
-
-#include "arch/rv64/asm.h"
-#include "arch/rv64/internal.h"
-#include "arch/rv64/regs.h"
-#include "cfree/config.h"
-#include "core/pool.h"
-#include "core/slice.h"
-
-/* ---- For a memory access of `nbytes`, pick the right store opcode. ---- */
-u32 enc_int_store(u32 nbytes, u32 src, u32 base, i32 off) {
-  switch (nbytes) {
-    case 1:
-      return rv_sb(src, base, off);
-    case 2:
-      return rv_sh(src, base, off);
-    case 4:
-      return rv_sw(src, base, off);
-    default:
-      return rv_sd(src, base, off);
-  }
-}
-u32 enc_int_load(u32 nbytes, int sign_ext, u32 rd, u32 base, i32 off) {
-  switch (nbytes) {
-    case 1:
-      return sign_ext ? rv_lb(rd, base, off) : rv_lbu(rd, base, off);
-    case 2:
-      return sign_ext ? rv_lh(rd, base, off) : rv_lhu(rd, base, off);
-    case 4:
-      return sign_ext ? rv_lw(rd, base, off) : rv_lwu(rd, base, off);
-    default:
-      return rv_ld(rd, base, off);
-  }
-}
-
-/* ---- data movement ---- */
-
-static void rv_load_imm(CGTarget* t, Operand dst, i64 imm) {
-  u32 sf = type_is_64(dst.type) ? 1u : 0u;
-  rv64_emit_load_imm(t->mc, sf, reg_num(dst), imm);
-}
-
-static void rv_load_const(CGTarget* t, Operand dst, ConstBytes cb) {
-  RImpl* a = impl_of(t);
-  if (dst.cls != RC_FP) {
-    compiler_panic(t->c, a->loc, "rv64 load_const: only FP supported in v1");
-  }
-  Sym ro_name = pool_intern_slice(t->c->global, SLICE_LIT(".rodata"));
-  ObjSecId ro = obj_section(t->obj, ro_name, SEC_RODATA, SF_ALLOC, 1u);
-
-  u32 cur_section = t->mc->section_id;
-  t->mc->set_section(t->mc, ro);
-  u32 ro_off = obj_align_to(t->obj, ro, cb.align ? cb.align : 4);
-  t->mc->emit_bytes(t->mc, cb.bytes, cb.size);
-
-  char namebuf[64];
-  static u32 lit_seq = 0;
-  int len = 0;
-  {
-    const char* prefix = ".LCFP";
-    for (; prefix[len]; ++len) namebuf[len] = prefix[len];
-    u32 v = lit_seq++;
-    char tmp[16];
-    int tn = 0;
-    if (v == 0)
-      tmp[tn++] = '0';
-    else {
-      while (v) {
-        tmp[tn++] = '0' + (char)(v % 10);
-        v /= 10;
-      }
-    }
-    for (int i = tn - 1; i >= 0; --i) namebuf[len++] = tmp[i];
-    namebuf[len] = 0;
-  }
-  Sym sname = pool_intern_slice(t->c->global, slice_from_cstr(namebuf));
-  ObjSymId sym = obj_symbol(t->obj, sname, SB_LOCAL, SK_OBJ, ro, (u64)ro_off,
-                            (u64)cb.size);
-  t->mc->set_section(t->mc, cur_section);
-
-  /* auipc t0, %pcrel_hi(sym) ; flw/fld dst, %pcrel_lo(...)(t0)
-   * The LO12_I reloc references the AUIPC's site address (a label/sym
-   * placed at the AUIPC). For simplicity we make a local symbol at the
-   * AUIPC and bind LO12_I to it. */
-  u32 sec = t->mc->section_id;
-  u32 auipc_pos = t->mc->pos(t->mc);
-  rv64_emit32(t->mc, rv_auipc(RV_T0, 0));
-  t->mc->emit_reloc_at(t->mc, sec, auipc_pos, R_RV_PCREL_HI20, sym, 0, 0, 0);
-  /* Create a local symbol at the AUIPC site to anchor PCREL_LO12. */
-  char anchor_buf[64];
-  int al = 0;
-  {
-    const char* p2 = ".LpcrelHi";
-    for (; p2[al]; ++al) anchor_buf[al] = p2[al];
-    static u32 seq2 = 0;
-    u32 v = seq2++;
-    char tmp[16];
-    int tn = 0;
-    if (v == 0)
-      tmp[tn++] = '0';
-    else {
-      while (v) {
-        tmp[tn++] = '0' + (char)(v % 10);
-        v /= 10;
-      }
-    }
-    for (int i = tn - 1; i >= 0; --i) anchor_buf[al++] = tmp[i];
-    anchor_buf[al] = 0;
-  }
-  Sym aname = pool_intern_slice(t->c->global, slice_from_cstr(anchor_buf));
-  ObjSymId anchor =
-      obj_symbol(t->obj, aname, SB_LOCAL, SK_OBJ, sec, (u64)auipc_pos, 0);
-  u32 lpos = t->mc->pos(t->mc);
-  if (cb.size == 8) {
-    rv64_emit32(t->mc, rv_fld(reg_num(dst), RV_T0, 0));
-  } else {
-    rv64_emit32(t->mc, rv_flw(reg_num(dst), RV_T0, 0));
-  }
-  t->mc->emit_reloc_at(t->mc, sec, lpos, R_RV_PCREL_LO12_I, anchor, 0, 0, 0);
-}
-
-static void rv_copy(CGTarget* t, Operand dst, Operand src) {
-  if (dst.cls == RC_FP && src.cls == RC_FP) {
-    u32 fmt = type_is_fp_double(dst.type) ? RV_FMT_D : RV_FMT_S;
-    /* fmv.fmt rd, rs  = fsgnj.fmt rd, rs, rs */
-    u32 r = reg_num(src);
-    rv64_emit32(t->mc, rv_fsgnj(fmt, reg_num(dst), r, r));
-    return;
-  }
-  if (dst.cls == RC_INT && src.cls == RC_FP) {
-    /* Variadic FP arg routed to an integer a-reg per RV64 LP64D psABI:
-     * bitcast FP -> INT via FMV.X.{D,W}. Width is determined by the FP
-     * source's type (the dst's integer type is the carrier, not the value). */
-    u32 sz = type_byte_size(src.type);
-    rv64_emit32(t->mc, (sz == 8) ? rv_fmv_x_d(reg_num(dst), reg_num(src))
-                                 : rv_fmv_x_w(reg_num(dst), reg_num(src)));
-    return;
-  }
-  if (dst.cls == RC_FP && src.cls == RC_INT) {
-    /* Reverse direction: INT bitpattern back into an FP register. */
-    u32 sz = type_byte_size(dst.type);
-    rv64_emit32(t->mc, (sz == 8) ? rv_fmv_d_x(reg_num(dst), reg_num(src))
-                                 : rv_fmv_w_x(reg_num(dst), reg_num(src)));
-    return;
-  }
-  /* mv rd, rs = addi rd, rs, 0  (works for both 32 and 64-bit copies) */
-  rv64_emit32(t->mc, rv_addi(reg_num(dst), reg_num(src), 0));
-}
-
-/* ---- address resolution ---- */
-
-/* Materialize the address of `addr` (LOCAL or INDIRECT) into a
- * base-register + signed-offset pair, possibly using `tmp_reg` when the
- * raw offset exceeds the imm[11:0] range. The returned tuple carries an
- * optional index (`REG_NONE` for "no index"); rv64 has no indexed loads
- * or stores even with Zba, so callers must have already folded any index
- * away (load/store do this via rv_fold_indexed). OPK_GLOBAL is not
- * handled here — its callers emit AUIPC + an LO12 reloc on the load/store
- * directly. */
-RvAddrMode addr_mode(CGTarget* t, Operand addr, u32 tmp_reg) {
-  RImpl* a = impl_of(t);
-  RvAddrMode am = {0};
-  am.index = REG_NONE;
-  if (addr.kind == OPK_LOCAL) {
-    RvSlot* s = rv64_slot_get(a, addr.v.frame_slot);
-    if (!s) compiler_panic(t->c, a->loc, "rv64 addr_mode: bad slot");
-    i32 off = -(i32)s->off;
-    if (off >= -2048 && off <= 2047) {
-      am.base = RV_S0;
-      am.ofs = off;
-      return am;
-    }
-    rv64_emit_load_imm(t->mc, 1, tmp_reg, (i64)off);
-    rv64_emit32(t->mc, rv_add(tmp_reg, RV_S0, tmp_reg));
-    am.base = tmp_reg;
-    am.ofs = 0;
-    return am;
-  }
-  if (addr.kind == OPK_INDIRECT) {
-    /* This helper does not encode an index — rv64 has no indexed
-     * load/store even with Zba. Load/store fold the index via
-     * rv_fold_indexed before calling here; all other paths take
-     * pointer-only operands. */
-    if (addr.v.ind.index != REG_NONE) {
-      compiler_panic(t->c, a->loc,
-                     "rv64 addr_mode: indexed addressing not supported here "
-                     "(caller must fold via rv_fold_indexed)");
-    }
-    i32 off = addr.v.ind.ofs;
-    u32 base = addr.v.ind.base & 0x1f;
-    if (off >= -2048 && off <= 2047) {
-      am.base = base;
-      am.ofs = off;
-      return am;
-    }
-    rv64_emit_load_imm(t->mc, 1, tmp_reg, (i64)off);
-    rv64_emit32(t->mc, rv_add(tmp_reg, base, tmp_reg));
-    am.base = tmp_reg;
-    am.ofs = 0;
-    return am;
-  }
-  compiler_panic(t->c, a->loc, "rv64 addr_mode: kind %d unsupported",
-                 (int)addr.kind);
-}
-
-int rv64_use_got_for_sym(CGTarget* t, ObjSymId sym) {
-  return obj_symbol_extern_via_got(t->c, t->obj, sym);
-}
-
-/* Anchor symbol management for PCREL_LO12_*. Each AUIPC site gets a
- * fresh local sym; the paired LO12 reloc references the anchor. */
-ObjSymId emit_pcrel_anchor(CGTarget* t, u32 sec, u32 auipc_pos) {
-  char buf[64];
-  int len = 0;
-  const char* p = ".LpcrelHi";
-  for (; p[len]; ++len) buf[len] = p[len];
-  static u32 seq = 0;
-  u32 v = seq++;
-  char tmp[16];
-  int tn = 0;
-  if (v == 0)
-    tmp[tn++] = '0';
-  else {
-    while (v) {
-      tmp[tn++] = '0' + (char)(v % 10);
-      v /= 10;
-    }
-  }
-  for (int i = tn - 1; i >= 0; --i) buf[len++] = tmp[i];
-  buf[len] = 0;
-  Sym n = pool_intern_slice(t->c->global, slice_from_cstr(buf));
-  return obj_symbol(t->obj, n, SB_LOCAL, SK_OBJ, sec, (u64)auipc_pos, 0);
-}
-
-/* Emit `auipc dst, %got_pcrel_hi(sym) ; ld dst, %pcrel_lo(.)(dst)`,
- * leaving the runtime address of `sym` (the GOT slot's contents) in
- * `dst_reg`. Addends are omitted from the GOT relocs — most loaders
- * disallow nonzero addends on GOT-load fixups — so callers apply any
- * displacement with a follow-on ADDI/ADD against the loaded base. */
-void rv64_emit_got_load_addr(CGTarget* t, u32 dst_reg, ObjSymId sym) {
-  MCEmitter* mc = t->mc;
-  u32 sec = mc->section_id;
-  u32 ap = mc->pos(mc);
-  rv64_emit32(mc, rv_auipc(dst_reg, 0));
-  mc->emit_reloc_at(mc, sec, ap, R_RV_GOT_HI20, sym, 0, 0, 0);
-  ObjSymId anchor = emit_pcrel_anchor(t, sec, ap);
-  u32 lp = mc->pos(mc);
-  rv64_emit32(mc, rv_ld(dst_reg, dst_reg, 0));
-  mc->emit_reloc_at(mc, sec, lp, R_RV_PCREL_LO12_I, anchor, 0, 0, 0);
-}
-
-/* Add a signed displacement `off` to `base`, writing into `rd`. Uses
- * ADDI for ±2047, otherwise materializes the offset via rv64_emit_load_imm
- * + ADD. Mirrors rv64_emit_addr_adjust in aarch64.c. */
-void rv64_emit_addr_adjust(MCEmitter* mc, u32 rd, u32 base, i32 off) {
-  if (off == 0) {
-    if (rd != base) rv64_emit32(mc, rv_addi(rd, base, 0));
-    return;
-  }
-  if (off >= -2048 && off <= 2047) {
-    rv64_emit32(mc, rv_addi(rd, base, off));
-    return;
-  }
-  rv64_emit_load_imm(mc, 1, RV_T1, (i64)off);
-  rv64_emit32(mc, rv_add(rd, base, RV_T1));
-}
-
-/* Fold an indexed OPK_INDIRECT into a plain base+disp by emitting one Zba
- * `sh{1,2,3}add` (or a plain `add` when log2_scale == 0) into `scratch`.
- * Returns an OPK_INDIRECT(scratch, ofs) with `index = REG_NONE`. When the
- * input has no index the operand is returned unchanged. Zba is assumed
- * available on rv64 targets — no feature gate. */
-static Operand rv_fold_indexed(CGTarget* t, Operand addr, u32 scratch) {
-  if (addr.kind != OPK_INDIRECT || addr.v.ind.index == REG_NONE) return addr;
-  u32 base = addr.v.ind.base & 0x1fu;
-  u32 index = addr.v.ind.index & 0x1fu;
-  u8 s = addr.v.ind.log2_scale;
-  MCEmitter* mc = t->mc;
-  /* sh{1,2,3}add rd, rs1, rs2 = (rs1 << s) + rs2, so rs1=index, rs2=base. */
-  switch (s) {
-    case 0:
-      rv64_emit32(mc, rv_add(scratch, base, index));
-      break;
-    case 1:
-      rv64_emit32(mc, rv_sh1add(scratch, index, base));
-      break;
-    case 2:
-      rv64_emit32(mc, rv_sh2add(scratch, index, base));
-      break;
-    case 3:
-      rv64_emit32(mc, rv_sh3add(scratch, index, base));
-      break;
-    default:
-      compiler_panic(t->c, impl_of(t)->loc,
-                     "rv64 rv_fold_indexed: bad log2_scale %u", (u32)s);
-  }
-  addr.v.ind.base = scratch;
-  addr.v.ind.index = REG_NONE;
-  addr.v.ind.log2_scale = 0;
-  return addr;
-}
-
-void rv_load(CGTarget* t, Operand dst, Operand addr, MemAccess ma) {
-  u32 sz = ma.size ? ma.size : type_byte_size(addr.type);
-  MCEmitter* mc = t->mc;
-
-  if (addr.kind == OPK_GLOBAL) {
-    u32 sec = mc->section_id;
-    ObjSymId sym = addr.v.global.sym;
-    i64 add = addr.v.global.addend;
-    /* Extern-via-GOT path: load &sym from GOT, then load the value at
-     * +addend (addend baked into the data load's imm12; relies on the
-     * common case of `add` fitting ±2047 — larger addends would need a
-     * follow-on ADD). */
-    if (rv64_use_got_for_sym(t, sym)) {
-      rv64_emit_got_load_addr(t, RV_T0, sym);
-      i32 ao = (i32)add;
-      if (dst.cls == RC_FP) {
-        if (sz == 8)
-          rv64_emit32(mc, rv_fld(reg_num(dst), RV_T0, ao));
-        else
-          rv64_emit32(mc, rv_flw(reg_num(dst), RV_T0, ao));
-      } else {
-        int sx = type_is_signed(addr.type);
-        rv64_emit32(mc, enc_int_load(sz, sx, reg_num(dst), RV_T0, ao));
-      }
-      return;
-    }
-    u32 ap = mc->pos(mc);
-    rv64_emit32(mc, rv_auipc(RV_T0, 0));
-    mc->emit_reloc_at(mc, sec, ap, R_RV_PCREL_HI20, sym, add, 0, 0);
-    ObjSymId anchor = emit_pcrel_anchor(t, sec, ap);
-    u32 lp = mc->pos(mc);
-    if (dst.cls == RC_FP) {
-      if (sz == 8)
-        rv64_emit32(mc, rv_fld(reg_num(dst), RV_T0, 0));
-      else
-        rv64_emit32(mc, rv_flw(reg_num(dst), RV_T0, 0));
-    } else {
-      int sx = type_is_signed(addr.type);
-      rv64_emit32(mc, enc_int_load(sz, sx, reg_num(dst), RV_T0, 0));
-    }
-    mc->emit_reloc_at(mc, sec, lp, R_RV_PCREL_LO12_I, anchor, 0, 0, 0);
-    return;
-  }
-
-  /* Fold any index via Zba sh{1,2,3}add into RV_T0 first; addr_mode then
-   * sees a plain base+disp. */
-  addr = rv_fold_indexed(t, addr, RV_T0);
-  RvAddrMode am = addr_mode(t, addr, RV_T0);
-  if (dst.cls == RC_FP) {
-    if (sz == 8)
-      rv64_emit32(mc, rv_fld(reg_num(dst), am.base, am.ofs));
-    else
-      rv64_emit32(mc, rv_flw(reg_num(dst), am.base, am.ofs));
-  } else {
-    int sx = type_is_signed(addr.type);
-    rv64_emit32(mc, enc_int_load(sz, sx, reg_num(dst), am.base, am.ofs));
-  }
-}
-
-void rv_store(CGTarget* t, Operand addr, Operand src, MemAccess ma) {
-  u32 sz = ma.size ? ma.size : type_byte_size(addr.type);
-  MCEmitter* mc = t->mc;
-
-  if (addr.kind == OPK_GLOBAL) {
-    u32 sec = mc->section_id;
-    ObjSymId sym = addr.v.global.sym;
-    i64 add = addr.v.global.addend;
-    u32 src_reg;
-    int src_fp = 0;
-    if (src.kind == OPK_IMM) {
-      u32 sf = (sz == 8) ? 1u : 0u;
-      rv64_emit_load_imm(mc, sf, RV_T1, src.v.imm);
-      src_reg = RV_T1;
-    } else if (src.cls == RC_FP) {
-      src_reg = reg_num(src);
-      src_fp = 1;
-    } else {
-      src_reg = reg_num(src);
-    }
-    /* Extern-via-GOT path: load &sym from GOT into t0, then store with
-     * addend baked into the imm12 (no reloc on the store). */
-    if (rv64_use_got_for_sym(t, sym)) {
-      rv64_emit_got_load_addr(t, RV_T0, sym);
-      i32 ao = (i32)add;
-      if (src_fp) {
-        if (sz == 8)
-          rv64_emit32(mc, rv_fsd(src_reg, RV_T0, ao));
-        else
-          rv64_emit32(mc, rv_fsw(src_reg, RV_T0, ao));
-      } else {
-        rv64_emit32(mc, enc_int_store(sz, src_reg, RV_T0, ao));
-      }
-      return;
-    }
-    u32 ap = mc->pos(mc);
-    rv64_emit32(mc, rv_auipc(RV_T0, 0));
-    mc->emit_reloc_at(mc, sec, ap, R_RV_PCREL_HI20, sym, add, 0, 0);
-    ObjSymId anchor = emit_pcrel_anchor(t, sec, ap);
-    u32 sp_pos = mc->pos(mc);
-    if (src_fp) {
-      if (sz == 8)
-        rv64_emit32(mc, rv_fsd(src_reg, RV_T0, 0));
-      else
-        rv64_emit32(mc, rv_fsw(src_reg, RV_T0, 0));
-    } else {
-      rv64_emit32(mc, enc_int_store(sz, src_reg, RV_T0, 0));
-    }
-    mc->emit_reloc_at(mc, sec, sp_pos, R_RV_PCREL_LO12_S, anchor, 0, 0, 0);
-    return;
-  }
-
-  /* Fold any index into a scratch via Zba sh{1,2,3}add. RV_T0 stays free
-   * for the IMM-src temporary in the OPK_IMM branch below, so route the
-   * fold scratch to RV_T1 in that case; the index-fold scratch matches
-   * addr_mode's tmp_reg. */
-  u32 addr_tmp = (src.kind == OPK_IMM) ? RV_T1 : RV_T0;
-  addr = rv_fold_indexed(t, addr, addr_tmp);
-  RvAddrMode am = addr_mode(t, addr, addr_tmp);
-  if (src.kind == OPK_IMM) {
-    u32 sf = (sz == 8) ? 1u : 0u;
-    rv64_emit_load_imm(mc, sf, RV_T0, src.v.imm);
-    rv64_emit32(mc, enc_int_store(sz, RV_T0, am.base, am.ofs));
-    return;
-  }
-  if (src.cls == RC_FP) {
-    if (sz == 8)
-      rv64_emit32(mc, rv_fsd(reg_num(src), am.base, am.ofs));
-    else
-      rv64_emit32(mc, rv_fsw(reg_num(src), am.base, am.ofs));
-  } else {
-    rv64_emit32(mc, enc_int_store(sz, reg_num(src), am.base, am.ofs));
-  }
-}
-
-static void rv_addr_of(CGTarget* t, Operand dst, Operand lv) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  u32 rd = reg_num(dst);
-  if (lv.kind == OPK_LOCAL) {
-    RvSlot* s = rv64_slot_get(a, lv.v.frame_slot);
-    if (!s) compiler_panic(t->c, a->loc, "rv64 addr_of: bad slot");
-    i32 off = -(i32)s->off;
-    if (off >= -2048 && off <= 2047) {
-      rv64_emit32(mc, rv_addi(rd, RV_S0, off));
-    } else {
-      rv64_emit_load_imm(mc, 1, rd, (i64)off);
-      rv64_emit32(mc, rv_add(rd, RV_S0, rd));
-    }
-    return;
-  }
-  if (lv.kind == OPK_INDIRECT) {
-    if (lv.v.ind.index != REG_NONE) {
-      compiler_panic(t->c, a->loc,
-                     "rv64 addr_of: indexed INDIRECT not supported");
-    }
-    i32 ofs = lv.v.ind.ofs;
-    u32 base = lv.v.ind.base & 0x1f;
-    if (ofs >= -2048 && ofs <= 2047) {
-      rv64_emit32(mc, rv_addi(rd, base, ofs));
-    } else {
-      rv64_emit_load_imm(mc, 1, rd, (i64)ofs);
-      rv64_emit32(mc, rv_add(rd, base, rd));
-    }
-    return;
-  }
-  if (lv.kind == OPK_GLOBAL) {
-    ObjSymId sym = lv.v.global.sym;
-    i64 addend = lv.v.global.addend;
-    /* Extern-via-GOT path: GOT load yields &sym directly; apply any
-     * addend with a follow-on ADDI/ADD (GOT relocs disallow addends). */
-    if (rv64_use_got_for_sym(t, sym)) {
-      rv64_emit_got_load_addr(t, rd, sym);
-      if (addend) rv64_emit_addr_adjust(mc, rd, rd, (i32)addend);
-      return;
-    }
-    u32 sec = mc->section_id;
-    u32 ap = mc->pos(mc);
-    rv64_emit32(mc, rv_auipc(rd, 0));
-    mc->emit_reloc_at(mc, sec, ap, R_RV_PCREL_HI20, sym, addend, 0, 0);
-    ObjSymId anchor = emit_pcrel_anchor(t, sec, ap);
-    u32 ip = mc->pos(mc);
-    rv64_emit32(mc, rv_addi(rd, rd, 0));
-    mc->emit_reloc_at(mc, sec, ip, R_RV_PCREL_LO12_I, anchor, 0, 0, 0);
-    return;
-  }
-  rv_panic(t, "addr_of");
-}
-
-static void rv_tls_addr_of(CGTarget* t, Operand dst, ObjSymId sym, i64 addend) {
-  /* RV64 TLS lowering.
-   *
-   * Two models are exposed; the choice is driven by symbol locality:
-   *
-   *   Local-Exec (LE): for TU-local TLS symbols. Emits the 3-insn
-   *     `lui + add + addi` sequence with R_RV_TPREL_HI20 /
-   *     R_RV_TPREL_LO12_I; the linker resolves them against the symbol's
-   *     tp-relative offset at link time.
-   *
-   *   Initial-Exec (IE): for externally-defined TLS symbols accessed
-   *     from an executable. Emits `auipc + ld + add` with the new
-   *     R_RV_TLS_GOT_HI20 / R_RV_PCREL_LO12_I pair; the LD loads
-   *     (&sym - tp) from the GOT and the ADD applies tp.
-   *
-   * The IE encoding requires either a real GOT entry (dynamic link) or
-   * a link-time IE->LE relaxation (static link). The reloc plumbing
-   * lives in src/obj + src/link; corpus TLS coverage stays exclusively
-   * on the LE side until that linker piece lands. The IE branch below
-   * is wired through `rv64_use_got_for_sym` so it activates only when
-   * the symbol would otherwise have used the regular GOT path.
-   *
-   * General-Dynamic and TLS-Descriptor models are deferred. */
-  MCEmitter* mc = t->mc;
-  u32 sec = mc->section_id;
-  u32 rd = reg_num(dst);
-
-  if (rv64_use_got_for_sym(t, sym)) {
-    /* Initial-Exec: auipc t0, %tls_ie_pcrel_hi(sym)
-     *               ld   t0, %pcrel_lo(.Ltmp)(t0)
-     *               add  dst, tp, t0
-     * The PCREL_LO12 reloc binds to a fresh anchor pointing at the
-     * AUIPC, mirroring the regular extern-via-GOT lowering. Any addend
-     * is applied after the GOT load (GOT relocs disallow addends). */
-    u32 ap = mc->pos(mc);
-    rv64_emit32(mc, rv_auipc(RV_T0, 0));
-    mc->emit_reloc_at(mc, sec, ap, R_RV_TLS_GOT_HI20, sym, 0, 0, 0);
-    ObjSymId anchor = emit_pcrel_anchor(t, sec, ap);
-    u32 ip = mc->pos(mc);
-    rv64_emit32(mc, rv_ld(RV_T0, RV_T0, 0));
-    mc->emit_reloc_at(mc, sec, ip, R_RV_PCREL_LO12_I, anchor, 0, 0, 0);
-    rv64_emit32(mc, rv_add(rd, RV_TP, RV_T0));
-    if (addend) rv64_emit_addr_adjust(mc, rd, rd, (i32)addend);
-    return;
-  }
-
-  /* Local-Exec: lui + add + addi. */
-  u32 hp = mc->pos(mc);
-  rv64_emit32(mc, rv_lui(RV_T0, 0));
-  mc->emit_reloc_at(mc, sec, hp, R_RV_TPREL_HI20, sym, addend, 0, 0);
-  rv64_emit32(mc, rv_add(RV_T0, RV_TP, RV_T0));
-  u32 lp = mc->pos(mc);
-  rv64_emit32(mc, rv_addi(rd, RV_T0, 0));
-  mc->emit_reloc_at(mc, sec, lp, R_RV_TPREL_LO12_I, sym, addend, 0, 0);
-}
-
-/* ---- aggregate ops ---- */
-
-u32 agg_addr_reg(CGTarget* t, Operand op, u32 scratch) {
-  RImpl* a = impl_of(t);
-  if (op.kind == OPK_REG) return reg_num(op);
-  if (op.kind == OPK_LOCAL) {
-    RvSlot* s = rv64_slot_get(a, op.v.frame_slot);
-    if (!s) compiler_panic(t->c, a->loc, "rv64 agg: bad slot");
-    i32 off = -(i32)s->off;
-    if (off >= -2048 && off <= 2047) {
-      rv64_emit32(t->mc, rv_addi(scratch, RV_S0, off));
-    } else {
-      rv64_emit_load_imm(t->mc, 1, scratch, (i64)off);
-      rv64_emit32(t->mc, rv_add(scratch, RV_S0, scratch));
-    }
-    return scratch;
-  }
-  compiler_panic(t->c, a->loc, "rv64 agg: address kind %d unsupported",
-                 (int)op.kind);
-}
-
-static void rv_copy_bytes(CGTarget* t, Operand dst_addr, Operand src_addr,
-                          AggregateAccess agg) {
-  MCEmitter* mc = t->mc;
-  u32 dr = agg_addr_reg(t, dst_addr, RV_T0);
-  u32 sr = agg_addr_reg(t, src_addr, (dr == RV_T1) ? RV_T2 : RV_T1);
-  u32 n = agg.size;
-  u32 i = 0;
-  while (i + 8 <= n) {
-    rv64_emit32(mc, rv_ld(RV_T3, sr, (i32)i));
-    rv64_emit32(mc, rv_sd(RV_T3, dr, (i32)i));
-    i += 8;
-  }
-  while (i + 4 <= n) {
-    rv64_emit32(mc, rv_lwu(RV_T3, sr, (i32)i));
-    rv64_emit32(mc, rv_sw(RV_T3, dr, (i32)i));
-    i += 4;
-  }
-  while (i + 2 <= n) {
-    rv64_emit32(mc, rv_lhu(RV_T3, sr, (i32)i));
-    rv64_emit32(mc, rv_sh(RV_T3, dr, (i32)i));
-    i += 2;
-  }
-  while (i < n) {
-    rv64_emit32(mc, rv_lbu(RV_T3, sr, (i32)i));
-    rv64_emit32(mc, rv_sb(RV_T3, dr, (i32)i));
-    i += 1;
-  }
-}
-
-static void rv_set_bytes(CGTarget* t, Operand dst_addr, Operand byte_value,
-                         AggregateAccess agg) {
-  MCEmitter* mc = t->mc;
-  u32 dr = agg_addr_reg(t, dst_addr, RV_T0);
-  u32 byte;
-  if (byte_value.kind == OPK_IMM) {
-    byte = (u32)(byte_value.v.imm & 0xffu);
-  } else {
-    compiler_panic(t->c, impl_of(t)->loc, "rv64 set_bytes: REG byte NYI");
-  }
-  u32 n = agg.size;
-  u32 src;
-  if (byte == 0) {
-    src = RV_ZERO;
-  } else {
-    u64 b = byte;
-    b |= b << 8;
-    b |= b << 16;
-    b |= b << 32;
-    rv64_emit_load_imm(mc, 1, RV_T3, (i64)b);
-    src = RV_T3;
-  }
-  u32 i = 0;
-  while (i + 8 <= n) {
-    rv64_emit32(mc, rv_sd(src, dr, (i32)i));
-    i += 8;
-  }
-  while (i + 4 <= n) {
-    rv64_emit32(mc, rv_sw(src, dr, (i32)i));
-    i += 4;
-  }
-  while (i + 2 <= n) {
-    rv64_emit32(mc, rv_sh(src, dr, (i32)i));
-    i += 2;
-  }
-  while (i < n) {
-    rv64_emit32(mc, rv_sb(src, dr, (i32)i));
-    i += 1;
-  }
-}
-
-static void rv_bitfield_load(CGTarget* t, Operand dst, Operand record_addr,
-                             BitFieldAccess bf) {
-  MCEmitter* mc = t->mc;
-  u32 base = agg_addr_reg(t, record_addr, RV_T0);
-  u32 storage_bytes = bf.storage.size ? bf.storage.size : 4u;
-  u32 rd = reg_num(dst);
-  /* Load full storage unit (zero-ext for shifts). */
-  rv64_emit32(mc,
-              enc_int_load(storage_bytes, 0, rd, base, (i32)bf.storage_offset));
-  /* Shift left by (XLEN - (bit_offset + bit_width)) then arithmetic
-   * right-shift by (XLEN - bit_width). Use 64-bit shifts. */
-  u32 lsb = bf.bit_offset;
-  u32 width = bf.bit_width ? bf.bit_width : 1u;
-  u32 sh_left = 64u - (lsb + width);
-  u32 sh_right = 64u - width;
-  rv64_emit32(mc, rv_slli(rd, rd, sh_left));
-  if (bf.signed_)
-    rv64_emit32(mc, rv_srai(rd, rd, sh_right));
-  else
-    rv64_emit32(mc, rv_srli(rd, rd, sh_right));
-}
-
-static void rv_bitfield_store(CGTarget* t, Operand record_addr, Operand src,
-                              BitFieldAccess bf) {
-  MCEmitter* mc = t->mc;
-  u32 base = agg_addr_reg(t, record_addr, RV_T0);
-  u32 storage_bytes = bf.storage.size ? bf.storage.size : 4u;
-  /* Load current value into t1 */
-  rv64_emit32(
-      mc, enc_int_load(storage_bytes, 0, RV_T1, base, (i32)bf.storage_offset));
-  u32 src_reg;
-  if (src.kind == OPK_IMM) {
-    rv64_emit_load_imm(mc, 1, RV_T2, src.v.imm);
-    src_reg = RV_T2;
-  } else if (src.kind == OPK_REG) {
-    src_reg = reg_num(src);
-  } else {
-    compiler_panic(t->c, impl_of(t)->loc,
-                   "rv64 bitfield_store: src kind %d NYI", (int)src.kind);
-  }
-  u32 lsb = bf.bit_offset;
-  u32 width = bf.bit_width ? bf.bit_width : 1u;
-  /* mask = ((1 << width) - 1) << lsb */
-  u64 mask = ((u64)1 << width) - 1u;
-  /* t3 = src & ((1<<width)-1), then shifted to lsb */
-  rv64_emit_load_imm(mc, 1, RV_T3, (i64)mask);
-  rv64_emit32(mc, rv_and(RV_T3, src_reg, RV_T3));
-  if (lsb) rv64_emit32(mc, rv_slli(RV_T3, RV_T3, lsb));
-  /* clear the field bits in t1: andi or and-not pattern */
-  u64 mask_in = mask << lsb;
-  rv64_emit_load_imm(mc, 1, RV_T2, (i64)~mask_in);
-  rv64_emit32(mc, rv_and(RV_T1, RV_T1, RV_T2));
-  rv64_emit32(mc, rv_or(RV_T1, RV_T1, RV_T3));
-  rv64_emit32(
-      mc, enc_int_store(storage_bytes, RV_T1, base, (i32)bf.storage_offset));
-}
-
-/* ---- arithmetic ---- */
-
-static void rv_binop(CGTarget* t, BinOp op, Operand dst, Operand a_op,
-                     Operand b_op) {
-  MCEmitter* mc = t->mc;
-  if (op == BO_FADD || op == BO_FSUB || op == BO_FMUL || op == BO_FDIV) {
-    u32 fmt = type_is_fp_double(dst.type) ? RV_FMT_D : RV_FMT_S;
-    u32 rd = reg_num(dst);
-    u32 fa = reg_num(a_op);
-    u32 fb = reg_num(b_op);
-    switch (op) {
-      case BO_FADD:
-        rv64_emit32(mc, rv_fadd(fmt, rd, fa, fb));
-        return;
-      case BO_FSUB:
-        rv64_emit32(mc, rv_fsub(fmt, rd, fa, fb));
-        return;
-      case BO_FMUL:
-        rv64_emit32(mc, rv_fmul(fmt, rd, fa, fb));
-        return;
-      case BO_FDIV:
-        rv64_emit32(mc, rv_fdiv(fmt, rd, fa, fb));
-        return;
-      default:
-        break;
-    }
-  }
-  u32 sf = type_is_64(dst.type) ? 1u : 0u;
-  u32 rd = reg_num(dst);
-
-  /* Canonicalize IMM to the RHS for commutative ops so the imm-form
-   * check below handles `3 + a` the same as `a + 3`. ISUB is not
-   * commutative — IMM-on-LHS still materializes. */
-  switch (op) {
-    case BO_IADD:
-    case BO_AND:
-    case BO_OR:
-    case BO_XOR: {
-      if (a_op.kind == OPK_IMM && b_op.kind != OPK_IMM) {
-        Operand t_op = a_op;
-        a_op = b_op;
-        b_op = t_op;
-      }
-      break;
-    }
-    default:
-      break;
-  }
-
-  /* IMM-form fast paths. RV-I admits a 12-bit signed immediate for
-   * ADDI/ANDI/ORI/XORI/SLTI/SLTIU (range [-2048, 2047]). ISUB has no
-   * SUBI — we encode it as ADDI with the negated literal when -imm
-   * fits the same range (i.e., imm ∈ [-2047, 2048]; INT_MIN is
-   * intentionally excluded since -INT_MIN overflows). Shifts admit a
-   * shamt: 6 bits (0..63) on the 64-bit forms, 5 bits (0..31) on the
-   * W-variants. */
-  if (b_op.kind == OPK_IMM && a_op.kind != OPK_IMM) {
-    u32 ra = reg_num(a_op);
-    i64 imm = b_op.v.imm;
-    int fits12 = imm >= -2048 && imm <= 2047;
-    switch (op) {
-      case BO_IADD:
-        if (fits12) {
-          rv64_emit32(
-              mc, sf ? rv_addi(rd, ra, (i32)imm) : rv_addiw(rd, ra, (i32)imm));
-          return;
-        }
-        break;
-      case BO_ISUB:
-        if (imm >= -2047 && imm <= 2048) {
-          rv64_emit32(mc, sf ? rv_addi(rd, ra, (i32)-imm)
-                             : rv_addiw(rd, ra, (i32)-imm));
-          return;
-        }
-        break;
-      case BO_AND:
-        if (fits12) {
-          rv64_emit32(mc, rv_andi(rd, ra, (i32)imm));
-          return;
-        }
-        break;
-      case BO_OR:
-        if (fits12) {
-          rv64_emit32(mc, rv_ori(rd, ra, (i32)imm));
-          return;
-        }
-        break;
-      case BO_XOR:
-        if (fits12) {
-          rv64_emit32(mc, rv_xori(rd, ra, (i32)imm));
-          return;
-        }
-        break;
-      case BO_SHL: {
-        u32 width = sf ? 64u : 32u;
-        u32 sh = (u32)((u64)imm & (width - 1u));
-        rv64_emit32(mc, sf ? rv_slli(rd, ra, sh) : rv_slliw(rd, ra, sh));
-        return;
-      }
-      case BO_SHR_U: {
-        u32 width = sf ? 64u : 32u;
-        u32 sh = (u32)((u64)imm & (width - 1u));
-        rv64_emit32(mc, sf ? rv_srli(rd, ra, sh) : rv_srliw(rd, ra, sh));
-        return;
-      }
-      case BO_SHR_S: {
-        u32 width = sf ? 64u : 32u;
-        u32 sh = (u32)((u64)imm & (width - 1u));
-        rv64_emit32(mc, sf ? rv_srai(rd, ra, sh) : rv_sraiw(rd, ra, sh));
-        return;
-      }
-      default:
-        break;
-    }
-  }
-
-  u32 ra = rv64_force_reg_int(t, a_op, RV_T0);
-  u32 rb = rv64_force_reg_int(t, b_op, (ra == RV_T0) ? RV_T1 : RV_T0);
-
-  switch (op) {
-    case BO_IADD:
-      rv64_emit32(mc, sf ? rv_add(rd, ra, rb) : rv_addw(rd, ra, rb));
-      return;
-    case BO_ISUB:
-      rv64_emit32(mc, sf ? rv_sub(rd, ra, rb) : rv_subw(rd, ra, rb));
-      return;
-    case BO_IMUL:
-      rv64_emit32(mc, sf ? rv_mul(rd, ra, rb) : rv_mulw(rd, ra, rb));
-      return;
-    case BO_AND:
-      rv64_emit32(mc, rv_and(rd, ra, rb));
-      return;
-    case BO_OR:
-      rv64_emit32(mc, rv_or(rd, ra, rb));
-      return;
-    case BO_XOR:
-      rv64_emit32(mc, rv_xor(rd, ra, rb));
-      return;
-    case BO_SHL:
-      rv64_emit32(mc, sf ? rv_sll(rd, ra, rb) : rv_sllw(rd, ra, rb));
-      return;
-    case BO_SHR_U:
-      rv64_emit32(mc, sf ? rv_srl(rd, ra, rb) : rv_srlw(rd, ra, rb));
-      return;
-    case BO_SHR_S:
-      rv64_emit32(mc, sf ? rv_sra(rd, ra, rb) : rv_sraw(rd, ra, rb));
-      return;
-    case BO_SDIV:
-      rv64_emit32(mc, sf ? rv_div(rd, ra, rb) : rv_divw(rd, ra, rb));
-      return;
-    case BO_UDIV:
-      rv64_emit32(mc, sf ? rv_divu(rd, ra, rb) : rv_divuw(rd, ra, rb));
-      return;
-    case BO_SREM:
-      rv64_emit32(mc, sf ? rv_rem(rd, ra, rb) : rv_remw(rd, ra, rb));
-      return;
-    case BO_UREM:
-      rv64_emit32(mc, sf ? rv_remu(rd, ra, rb) : rv_remuw(rd, ra, rb));
-      return;
-    default:
-      compiler_panic(t->c, impl_of(t)->loc, "rv64 binop: op %d unimpl",
-                     (int)op);
-  }
-}
-
-static void rv_unop(CGTarget* t, UnOp op, Operand dst, Operand a_op) {
-  MCEmitter* mc = t->mc;
-  u32 rd = reg_num(dst);
-  if (op == UO_FNEG) {
-    if (dst.cls != RC_FP || a_op.kind != OPK_REG || a_op.cls != RC_FP) {
-      compiler_panic(t->c, impl_of(t)->loc,
-                     "rv64 unop: FP neg requires FP REG operand");
-    }
-    u32 fmt = type_is_fp_double(dst.type) ? RV_FMT_D : RV_FMT_S;
-    rv64_emit32(mc, rv_fsgnjn(fmt, rd, reg_num(a_op), reg_num(a_op)));
-    return;
-  }
-
-  u32 sf = type_is_64(dst.type) ? 1u : 0u;
-  /* IMM operand is legal per the CGTarget contract (arch.h); materialize
-   * into t0 when not already a register. cg folds literal unops upstream
-   * via cg_fold_unop. */
-  u32 rn = rv64_force_reg_int(t, a_op, RV_T0);
-  switch (op) {
-    case UO_NEG:
-      rv64_emit32(mc, sf ? rv_sub(rd, RV_ZERO, rn) : rv_subw(rd, RV_ZERO, rn));
-      return;
-    case UO_BNOT:
-      rv64_emit32(mc, rv_xori(rd, rn, -1));
-      return;
-    case UO_NOT:
-      /* logical: 1 if rn==0 else 0 → sltiu rd, rn, 1 */
-      rv64_emit32(mc, rv_sltiu(rd, rn, 1));
-      return;
-    default:
-      compiler_panic(t->c, impl_of(t)->loc, "rv64 unop: op %d unimpl", (int)op);
-  }
-}
-
-static void rv_convert(CGTarget* t, ConvKind k, Operand dst, Operand src) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  u32 rd = reg_num(dst);
-  u32 rn = reg_num(src);
-
-  switch (k) {
-    case CV_SEXT: {
-      u32 src_bits = type_byte_size(src.type) * 8u;
-      if (src_bits == 32u) {
-        /* sext.w rd, rs = addiw rd, rs, 0 */
-        rv64_emit32(mc, rv_addiw(rd, rn, 0));
-        return;
-      }
-      /* slli + srai by (64 - src_bits) */
-      u32 sh = 64u - src_bits;
-      rv64_emit32(mc, rv_slli(rd, rn, sh));
-      rv64_emit32(mc, rv_srai(rd, rd, sh));
-      return;
-    }
-    case CV_ZEXT: {
-      u32 src_bits = type_byte_size(src.type) * 8u;
-      if (src_bits == 32u) {
-        /* zext.w: slli rd, rs, 32; srli rd, rd, 32 */
-        rv64_emit32(mc, rv_slli(rd, rn, 32));
-        rv64_emit32(mc, rv_srli(rd, rd, 32));
-      } else {
-        u32 sh = 64u - src_bits;
-        rv64_emit32(mc, rv_slli(rd, rn, sh));
-        rv64_emit32(mc, rv_srli(rd, rd, sh));
-      }
-      return;
-    }
-    case CV_TRUNC:
-      /* Truncate to W: addiw rd, rs, 0 puts low 32 in rd sign-extended.
-       * For narrower widths the consumer (store) handles it. */
-      rv64_emit32(mc, rv_addiw(rd, rn, 0));
-      return;
-    case CV_ITOF_S: {
-      int sf_src = type_is_64(src.type);
-      int dst_d = type_is_fp_double(dst.type);
-      if (dst_d) {
-        rv64_emit32(mc, sf_src ? rv_fcvt_d_l(rd, rn) : rv_fcvt_d_w(rd, rn));
-      } else {
-        rv64_emit32(mc, sf_src ? rv_fcvt_s_l(rd, rn) : rv_fcvt_s_w(rd, rn));
-      }
-      return;
-    }
-    case CV_ITOF_U: {
-      int sf_src = type_is_64(src.type);
-      int dst_d = type_is_fp_double(dst.type);
-      if (dst_d) {
-        rv64_emit32(mc, sf_src ? rv_fcvt_d_lu(rd, rn) : rv_fcvt_d_wu(rd, rn));
-      } else {
-        rv64_emit32(mc, sf_src ? rv_fcvt_s_lu(rd, rn) : rv_fcvt_s_wu(rd, rn));
-      }
-      return;
-    }
-    case CV_FTOI_S: {
-      int sf_dst = type_is_64(dst.type);
-      int src_d = type_is_fp_double(src.type);
-      if (src_d) {
-        rv64_emit32(mc, sf_dst ? rv_fcvt_l_d(rd, rn) : rv_fcvt_w_d(rd, rn));
-      } else {
-        rv64_emit32(mc, sf_dst ? rv_fcvt_l_s(rd, rn) : rv_fcvt_w_s(rd, rn));
-      }
-      return;
-    }
-    case CV_FTOI_U: {
-      int sf_dst = type_is_64(dst.type);
-      int src_d = type_is_fp_double(src.type);
-      if (src_d) {
-        rv64_emit32(mc, sf_dst ? rv_fcvt_lu_d(rd, rn) : rv_fcvt_wu_d(rd, rn));
-      } else {
-        rv64_emit32(mc, sf_dst ? rv_fcvt_lu_s(rd, rn) : rv_fcvt_wu_s(rd, rn));
-      }
-      return;
-    }
-    case CV_FEXT:
-      rv64_emit32(mc, rv_fcvt_d_s(rd, rn));
-      return;
-    case CV_FTRUNC:
-      rv64_emit32(mc, rv_fcvt_s_d(rd, rn));
-      return;
-    case CV_BITCAST: {
-      if (src.cls == RC_INT && dst.cls == RC_FP) {
-        u32 sz = type_byte_size(dst.type);
-        rv64_emit32(mc, sz == 8 ? rv_fmv_d_x(rd, rn) : rv_fmv_w_x(rd, rn));
-      } else if (src.cls == RC_FP && dst.cls == RC_INT) {
-        u32 sz = type_byte_size(src.type);
-        rv64_emit32(mc, sz == 8 ? rv_fmv_x_d(rd, rn) : rv_fmv_x_w(rd, rn));
-      } else if (src.cls == RC_INT && dst.cls == RC_INT) {
-        /* GPR→GPR: mv pseudo (addi rd, rs, 0). */
-        if (rd != rn) rv64_emit32(mc, rv_addi(rd, rn, 0));
-      } else if (src.cls == RC_FP && dst.cls == RC_FP) {
-        /* FPR→FPR: fmv.fmt pseudo (fsgnj.fmt rd, rs, rs). */
-        if (rd != rn) {
-          u32 sz = type_byte_size(src.type);
-          u32 fmt = (sz == 8) ? 1u : 0u; /* 0 = single, 1 = double */
-          rv64_emit32(mc, rv_fsgnj(fmt, rd, rn, rn));
-        }
-      } else {
-        compiler_panic(t->c, a->loc, "rv64 BITCAST: same-class NYI");
-      }
-      return;
-    }
-    default:
-      compiler_panic(t->c, a->loc, "rv64 convert kind %d unimpl", (int)k);
-  }
-}
-
-/* ---- calls / return ---- */
-
-static Operand rv_call_stack_arg_addr(CGTarget* t, u32 stack_offset, int tail) {
-  RImpl* a = impl_of(t);
-  Operand addr;
-  memset(&addr, 0, sizeof addr);
-  addr.kind = OPK_INDIRECT;
-  addr.cls = RC_INT;
-  addr.v.ind.base = tail && !a->omit_frame ? RV_S0 : RV_SP;
-  addr.v.ind.index = REG_NONE;
-  addr.v.ind.log2_scale = 0;
-  addr.v.ind.ofs = (i32)stack_offset;
-  if (tail && !a->omit_frame) {
-    addr.v.ind.ofs += 16 + (a->is_variadic ? 64 : 0);
-  }
-  return addr;
-}
-
-static void rv_check_tail_stack_args(CGTarget* t, u32 stack_size) {
-  RImpl* a = impl_of(t);
-  if (stack_size > a->next_param_stack) {
-    compiler_panic(t->c, a->loc,
-                   "rv64 tail call: stack argument area too small");
-  }
-}
-
-static u32 rv_call_plan_stack_raw_size(const CGCallPlan* p) {
-  u32 size = 0;
-  for (u32 i = 0; i < p->nargs; ++i) {
-    const CGCallPlanMove* m = &p->args[i];
-    if (m->dst_kind == CG_CALL_PLAN_STACK ||
-        m->dst_kind == CG_CALL_PLAN_TAIL_STACK) {
-      u32 end = m->stack_offset + (m->mem.size > 8u ? m->mem.size : 8u);
-      if (end > size) size = end;
-    }
-  }
-  return size;
-}
-
-static void rv_store_stack_reg(CGTarget* t, u32 reg, RegClass cls,
-                               CfreeCgTypeId type, u32 size, u32 stack_offset,
-                               int tail) {
-  Operand addr = rv_call_stack_arg_addr(t, stack_offset, tail);
-  Operand src;
-  MemAccess ma;
-  memset(&src, 0, sizeof src);
-  memset(&ma, 0, sizeof ma);
-  src.kind = OPK_REG;
-  src.cls = (u8)cls;
-  src.type = type;
-  src.v.reg = reg;
-  addr.type = type;
-  ma.type = type;
-  ma.size = size;
-  ma.align = size ? size : 1u;
-  rv_store(t, addr, src, ma);
-}
-
-static Operand rv_offset_mem_operand(CGTarget* t, Operand op, u32 offset) {
-  if (!offset) return op;
-  if (op.kind == OPK_INDIRECT) {
-    op.v.ind.ofs += (i32)offset;
-  } else if (op.kind == OPK_LOCAL) {
-    RImpl* a = impl_of(t);
-    RvSlot* s = rv64_slot_get(a, op.v.frame_slot);
-    if (!s) compiler_panic(t->c, a->loc, "rv64 offset operand: bad slot");
-    op.kind = OPK_INDIRECT;
-    op.v.ind.base = RV_S0;
-    op.v.ind.index = REG_NONE;
-    op.v.ind.log2_scale = 0;
-    op.v.ind.ofs = -(i32)s->off + (i32)offset;
-  }
-  return op;
-}
-
-static void rv_load_abi_part(CGTarget* t, Operand dst, Operand src, u32 offset,
-                             u32 size) {
-  MemAccess ma;
-  memset(&ma, 0, sizeof ma);
-  ma.type = dst.type;
-  ma.size = size;
-  ma.align = size ? size : 1u;
-  rv_load(t, dst, rv_offset_mem_operand(t, src, offset), ma);
-}
-
-static void rv_store_abi_part(CGTarget* t, Operand dst, Operand src, u32 offset,
-                              u32 size) {
-  MemAccess ma;
-  memset(&ma, 0, sizeof ma);
-  ma.type = src.type;
-  ma.size = size;
-  ma.align = size ? size : 1u;
-  rv_store(t, rv_offset_mem_operand(t, dst, offset), src, ma);
-}
-
-static void emit_arg_value(CGTarget* t, const CGABIValue* av, u32* next_int,
-                           u32* next_fp, u32* stack_off, int tail) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  /* For variadic args (av->abi NULL) synthesize a one-part DIRECT shape.
-   * On RV64 LP64D, variadic args go through the integer registers
-   * regardless of FP-ness (per the psABI). */
-  ABIArgInfo va_ai;
-  ABIArgPart va_pt;
-  const ABIArgInfo* ai = av->abi;
-  if (!ai) {
-    u32 sz = type_byte_size(av->type);
-    memset(&va_ai, 0, sizeof va_ai);
-    memset(&va_pt, 0, sizeof va_pt);
-    va_ai.kind = ABI_ARG_DIRECT;
-    va_ai.parts = &va_pt;
-    va_ai.nparts = 1;
-    va_pt.cls = ABI_CLASS_INT;
-    va_pt.size = sz;
-    va_pt.align = sz;
-    va_pt.src_offset = 0;
-    ai = &va_ai;
-  }
-  if (ai->kind == ABI_ARG_IGNORE) return;
-
-  if (ai->kind == ABI_ARG_INDIRECT) {
-    /* Pass the address of the storage in the next integer slot. */
-    int to_stack = (*next_int >= 8);
-    u32 dst_reg = to_stack ? RV_T0 : (RV_A0 + (*next_int)++);
-    if (av->storage.kind == OPK_LOCAL) {
-      RvSlot* s = rv64_slot_get(a, av->storage.v.frame_slot);
-      if (!s) compiler_panic(t->c, a->loc, "rv64 call: bad byval slot");
-      i32 off = -(i32)s->off;
-      if (off >= -2048 && off <= 2047) {
-        rv64_emit32(mc, rv_addi(dst_reg, RV_S0, off));
-      } else {
-        rv64_emit_load_imm(mc, 1, dst_reg, (i64)off);
-        rv64_emit32(mc, rv_add(dst_reg, RV_S0, dst_reg));
-      }
-    } else if (av->storage.kind == OPK_INDIRECT) {
-      if (av->storage.v.ind.index != REG_NONE) {
-        compiler_panic(t->c, a->loc,
-                       "rv64 call byval: indexed storage not supported");
-      }
-      u32 base = av->storage.v.ind.base & 0x1fu;
-      i32 off = av->storage.v.ind.ofs;
-      if (off >= -2048 && off <= 2047) {
-        rv64_emit32(mc, rv_addi(dst_reg, base, off));
-      } else {
-        rv64_emit_load_imm(mc, 1, dst_reg, (i64)off);
-        rv64_emit32(mc, rv_add(dst_reg, base, dst_reg));
-      }
-    } else if (av->storage.kind == OPK_GLOBAL) {
-      /* byval pass-by-pointer of a global aggregate (e.g. a const global
-       * struct). Materialize the symbol address into dst_reg via the
-       * standard PC-relative AUIPC + ADDI(LO12) sequence. */
-      Operand dst_addr;
-      memset(&dst_addr, 0, sizeof dst_addr);
-      dst_addr.kind = OPK_REG;
-      dst_addr.cls = RC_INT;
-      dst_addr.type = av->type;
-      dst_addr.v.reg = dst_reg;
-      rv_addr_of(t, dst_addr, av->storage);
-    } else {
-      compiler_panic(t->c, a->loc, "rv64 call: INDIRECT storage kind %d NYI",
-                     (int)av->storage.kind);
-    }
-    if (to_stack) {
-      rv_store_stack_reg(t, dst_reg, RC_INT, av->type, 8, *stack_off, tail);
-      *stack_off += 8;
-    }
-    return;
-  }
-
-  for (u16 i = 0; i < ai->nparts; ++i) {
-    const ABIArgPart* pt = &ai->parts[i];
-    u32 sz = pt->size;
-
-    if (pt->cls == ABI_CLASS_INT) {
-      int to_stack = (*next_int >= 8);
-      u32 dst_reg = to_stack ? RV_T0 : (RV_A0 + (*next_int)++);
-      switch (av->storage.kind) {
-        case OPK_IMM: {
-          u32 sf = (sz == 8) ? 1u : 0u;
-          rv64_emit_load_imm(mc, sf, dst_reg, av->storage.v.imm);
-          break;
-        }
-        case OPK_REG: {
-          /* Variadic FP arg pinned into an integer register: bitcast
-           * via FMV.X.{D,W}. Otherwise normal MV. */
-          if (av->storage.cls == RC_FP) {
-            rv64_emit32(mc, (sz == 8)
-                                ? rv_fmv_x_d(dst_reg, reg_num(av->storage))
-                                : rv_fmv_x_w(dst_reg, reg_num(av->storage)));
-          } else {
-            rv64_emit32(mc, rv_addi(dst_reg, reg_num(av->storage), 0));
-          }
-          break;
-        }
-        case OPK_LOCAL: {
-          Operand dst = {.kind = OPK_REG, .cls = RC_INT, .type = av->type};
-          dst.v.reg = dst_reg;
-          rv_load_abi_part(t, dst, av->storage, pt->src_offset, sz);
-          break;
-        }
-        case OPK_INDIRECT: {
-          Operand dst = {.kind = OPK_REG, .cls = RC_INT, .type = av->type};
-          dst.v.reg = dst_reg;
-          rv_load_abi_part(t, dst, av->storage, pt->src_offset, sz);
-          break;
-        }
-        default:
-          compiler_panic(t->c, a->loc, "rv64 call: storage kind %d NYI",
-                         (int)av->storage.kind);
-      }
-      if (to_stack) {
-        rv_store_stack_reg(t, dst_reg, RC_INT, av->type, 8, *stack_off, tail);
-        *stack_off += 8;
-      }
-    } else if (pt->cls == ABI_CLASS_FP) {
-      int to_stack = (*next_fp >= 8);
-      if (!to_stack) {
-        u32 freg = 10u + (*next_fp)++;
-        switch (av->storage.kind) {
-          case OPK_REG: {
-            u32 fmt = (sz == 8) ? RV_FMT_D : RV_FMT_S;
-            u32 r = reg_num(av->storage);
-            rv64_emit32(mc, rv_fsgnj(fmt, freg, r, r));
-            break;
-          }
-          case OPK_LOCAL: {
-            Operand dst = {.kind = OPK_REG, .cls = RC_FP, .type = av->type};
-            dst.v.reg = freg;
-            rv_load_abi_part(t, dst, av->storage, pt->src_offset, sz);
-            break;
-          }
-          case OPK_INDIRECT: {
-            Operand dst = {.kind = OPK_REG, .cls = RC_FP, .type = av->type};
-            dst.v.reg = freg;
-            rv_load_abi_part(t, dst, av->storage, pt->src_offset, sz);
-            break;
-          }
-          default:
-            compiler_panic(t->c, a->loc, "rv64 call: FP storage kind %d NYI",
-                           (int)av->storage.kind);
-        }
-      } else {
-        switch (av->storage.kind) {
-          case OPK_REG:
-            rv_store_stack_reg(t, reg_num(av->storage), RC_FP, av->type, sz,
-                               *stack_off, tail);
-            break;
-          case OPK_LOCAL: {
-            Operand tmp = {.kind = OPK_REG, .cls = RC_FP, .type = av->type};
-            tmp.v.reg = 0u;
-            if (sz == 8) {
-              rv_load_abi_part(t, tmp, av->storage, pt->src_offset, sz);
-              rv_store_stack_reg(t, /*ft0=*/0u, RC_FP, av->type, sz, *stack_off,
-                                 tail);
-            } else {
-              rv_load_abi_part(t, tmp, av->storage, pt->src_offset, sz);
-              rv_store_stack_reg(t, /*ft0=*/0u, RC_FP, av->type, sz, *stack_off,
-                                 tail);
-            }
-            break;
-          }
-          case OPK_INDIRECT: {
-            /* Route through ft0 — it is in {ft0..ft7}, caller-saved
-             * scratch outside the cg fs2..fs11 pool. */
-            Operand tmp = {.kind = OPK_REG, .cls = RC_FP, .type = av->type};
-            tmp.v.reg = 0u;
-            if (sz == 8) {
-              rv_load_abi_part(t, tmp, av->storage, pt->src_offset, sz);
-              rv_store_stack_reg(t, /*ft0=*/0u, RC_FP, av->type, sz, *stack_off,
-                                 tail);
-            } else {
-              rv_load_abi_part(t, tmp, av->storage, pt->src_offset, sz);
-              rv_store_stack_reg(t, /*ft0=*/0u, RC_FP, av->type, sz, *stack_off,
-                                 tail);
-            }
-            break;
-          }
-          default:
-            compiler_panic(t->c, a->loc, "rv64 call: FP stack-arg NYI");
-        }
-        *stack_off += 8;
-      }
-    } else {
-      compiler_panic(t->c, a->loc, "rv64 call: ABI class %d unimpl",
-                     (int)pt->cls);
-    }
-  }
-}
-
-static void count_arg_stack(const CGABIValue* av, u32* next_int, u32* next_fp,
-                            u32* stack_off) {
-  ABIArgInfo va_ai;
-  ABIArgPart va_pt;
-  const ABIArgInfo* ai = av->abi;
-  if (!ai) {
-    u32 sz = type_byte_size(av->type);
-    memset(&va_ai, 0, sizeof va_ai);
-    memset(&va_pt, 0, sizeof va_pt);
-    va_ai.kind = ABI_ARG_DIRECT;
-    va_ai.parts = &va_pt;
-    va_ai.nparts = 1;
-    va_pt.cls = ABI_CLASS_INT;
-    va_pt.size = sz;
-    va_pt.align = sz;
-    va_pt.src_offset = 0;
-    ai = &va_ai;
-  }
-  if (ai->kind == ABI_ARG_IGNORE) return;
-  if (ai->kind == ABI_ARG_INDIRECT) {
-    if (*next_int < 8)
-      ++*next_int;
-    else
-      *stack_off += 8;
-    return;
-  }
-  for (u16 i = 0; i < ai->nparts; ++i) {
-    const ABIArgPart* pt = &ai->parts[i];
-    if (pt->cls == ABI_CLASS_INT) {
-      if (*next_int < 8)
-        ++*next_int;
-      else
-        *stack_off += 8;
-    } else if (pt->cls == ABI_CLASS_FP) {
-      if (*next_fp < 8)
-        ++*next_fp;
-      else
-        *stack_off += 8;
-    }
-  }
-}
-
-static u32 rv_call_stack_size(CGTarget* t, const CGCallDesc* d) {
-  (void)t;
-  u32 next_int = (d->abi && d->abi->has_sret) ? 1u : 0u;
-  u32 next_fp = 0, stack_off = 0;
-  for (u32 i = 0; i < d->nargs; ++i)
-    count_arg_stack(&d->args[i], &next_int, &next_fp, &stack_off);
-  return (stack_off + 15u) & ~15u;
-}
-
-/* Realizability of a sibling call (see CGTarget.tail_call_unrealizable_reason).
- * The callee's outgoing stack arguments must fit the area this function itself
- * received (next_param_stack). Variadic callees need no special handling and
- * sret callees are realizable by forwarding this function's own incoming sret
- * pointer (the return-shape precondition guarantees it matches). */
-static const char* rv_tail_call_unrealizable_reason(CGTarget* t,
-                                                    const CGCallDesc* d) {
-  RImpl* a = impl_of(t);
-  u32 next_int = (d->abi && d->abi->has_sret) ? 1u : 0u;
-  u32 next_fp = 0, stack_off = 0;
-  for (u32 i = 0; i < d->nargs; ++i)
-    count_arg_stack(&d->args[i], &next_int, &next_fp, &stack_off);
-  if (stack_off > a->next_param_stack)
-    return "tail call stack arguments exceed the caller's parameter area";
-  return NULL;
-}
-
-typedef struct RvTailFrameLayout {
-  u32 max_out;
-  u32 fp_saves_sz;
-  u32 fp_pair_off;
-  u32 frame_size;
-  i32 fp_save_base;
-  i32 int_save_base;
-} RvTailFrameLayout;
-
-static u32 rv_tail_collect_mask_regs(u32 mask, u32 first, u32 last, u32* out) {
-  u32 n = 0;
-  for (u32 r = first; r <= last; ++r) {
-    if (mask & (1u << r)) out[n++] = r;
-  }
-  return n;
-}
-
-static void rv_tail_compute_frame(const RImpl* a, u32 n_int_saves,
-                                  u32 n_fp_saves, RvTailFrameLayout* fl) {
-  fl->max_out = (a->max_outgoing + 15u) & ~15u;
-  u32 int_saves_sz = n_int_saves * 8u;
-  fl->fp_saves_sz = n_fp_saves * 8u;
-  u32 va_save_sz = a->is_variadic ? 64u : 0u;
-  u32 locals_off = fl->max_out + int_saves_sz + fl->fp_saves_sz;
-  fl->fp_pair_off = locals_off + a->cum_off;
-  fl->frame_size = fl->fp_pair_off + 16u + va_save_sz;
-  fl->frame_size = (fl->frame_size + 15u) & ~15u;
-  fl->fp_pair_off = fl->frame_size - 16u - va_save_sz;
-  fl->fp_save_base = -(i32)a->cum_off - 8;
-  fl->int_save_base = fl->fp_save_base - (i32)fl->fp_saves_sz;
-}
-
-static void rv_tail_restore_frame(CGTarget* t) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  u32 int_regs[10];
-  u32 fp_regs[10];
-  RvTailFrameLayout fl;
-  u32 n_int_saves =
-      rv_tail_collect_mask_regs(a->used_cs_int_mask, 18u, 27u, int_regs);
-  u32 n_fp_saves =
-      rv_tail_collect_mask_regs(a->used_cs_fp_mask, 18u, 27u, fp_regs);
-  rv_tail_compute_frame(a, n_int_saves, n_fp_saves, &fl);
-
-  if (a->omit_frame) return;
-  for (i32 i = (i32)n_int_saves - 1; i >= 0; --i) {
-    i32 off = fl.int_save_base - 8 * i;
-    if (off >= -2048 && off <= 2047) {
-      rv64_emit32(mc, rv_ld(int_regs[i], RV_S0, off));
-    } else {
-      rv64_emit_addr_adjust(mc, RV_T0, RV_S0, off);
-      rv64_emit32(mc, rv_ld(int_regs[i], RV_T0, 0));
-    }
-  }
-  for (i32 i = (i32)n_fp_saves - 1; i >= 0; --i) {
-    i32 off = fl.fp_save_base - 8 * i;
-    if (off >= -2048 && off <= 2047) {
-      rv64_emit32(mc, rv_fld(fp_regs[i], RV_S0, off));
-    } else {
-      rv64_emit_addr_adjust(mc, RV_T0, RV_S0, off);
-      rv64_emit32(mc, rv_fld(fp_regs[i], RV_T0, 0));
-    }
-  }
-  if (a->has_alloca) {
-    rv64_emit_addr_adjust(mc, RV_SP, RV_S0, -(i32)fl.fp_pair_off);
-  }
-  rv64_emit32(mc, rv_ld(RV_RA, RV_S0, 8));
-  rv64_emit32(mc, rv_ld(RV_S0, RV_S0, 0));
-  emit_sp_addi(mc, (i64)fl.frame_size);
-}
-
-static void rv_tail_branch(CGTarget* t, Operand callee) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  if (callee.kind == OPK_REG) {
-    if (reg_num(callee) != RV_T1)
-      rv64_emit32(mc, rv_addi(RV_T1, reg_num(callee), 0));
-    rv_tail_restore_frame(t);
-    rv64_emit32(mc, rv_jr(RV_T1));
-  } else if (callee.kind == OPK_GLOBAL) {
-    rv_tail_restore_frame(t);
-    u32 sec = mc->section_id;
-    u32 pos = mc->pos(mc);
-    rv64_emit32(mc, rv_auipc(RV_T1, 0));
-    rv64_emit32(mc, rv_jalr(RV_ZERO, RV_T1, 0));
-    mc->emit_reloc_at(mc, sec, pos, R_RV_CALL, callee.v.global.sym,
-                      callee.v.global.addend, 0, 0);
-  } else {
-    compiler_panic(t->c, a->loc, "rv64 tail call: callee kind %d unsupported",
-                   (int)callee.kind);
-  }
-}
-
-static void rv_call(CGTarget* t, const CGCallDesc* d) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  u32 next_int = 0, next_fp = 0, stack_off = 0;
-
-  /* sret: a0 holds the result pointer. An ordinary call points it at the
-   * destination local; a tail call forwards this function's own incoming sret
-   * pointer (loaded just before the branch below), and ret.storage is the
-   * void sentinel, so only reserve a0 here. */
-  if (d->abi && d->abi->has_sret) {
-    next_int = 1;
-    if ((d->flags & CG_CALL_TAIL) == 0) {
-      if (d->ret.storage.kind != OPK_LOCAL) {
-        compiler_panic(t->c, a->loc, "rv64 call: sret dst must be LOCAL");
-      }
-      RvSlot* s = rv64_slot_get(a, d->ret.storage.v.frame_slot);
-      if (!s) compiler_panic(t->c, a->loc, "rv64 call: bad sret slot");
-      i32 off = -(i32)s->off;
-      if (off >= -2048 && off <= 2047) {
-        rv64_emit32(mc, rv_addi(RV_A0, RV_S0, off));
-      } else {
-        rv64_emit_load_imm(mc, 1, RV_A0, (i64)off);
-        rv64_emit32(mc, rv_add(RV_A0, RV_S0, RV_A0));
-      }
-    }
-  }
-
-  for (u32 i = 0; i < d->nargs; ++i) {
-    emit_arg_value(t, &d->args[i], &next_int, &next_fp, &stack_off,
-                   (d->flags & CG_CALL_TAIL) != 0);
-  }
-  u32 needed = (stack_off + 15u) & ~15u;
-  if ((d->flags & CG_CALL_TAIL) == 0 && needed > a->max_outgoing) {
-    if (a->known_frame)
-      compiler_panic(t->c, a->loc,
-                     "rv64 call: known frame outgoing area too small");
-    a->max_outgoing = needed;
-  }
-
-  if (d->flags & CG_CALL_TAIL) {
-    if (d->abi && d->abi->has_sret) {
-      /* Forward the incoming sret pointer into a0 (spilled to sret_ptr_slot
-       * at entry). Load while s0 is valid, before rv_tail_branch restores the
-       * frame; a0 survives the restore and is unused by the args above. */
-      if (a->sret_ptr_slot == FRAME_SLOT_NONE)
-        compiler_panic(t->c, a->loc,
-                       "rv64 tail call: missing incoming sret slot");
-      RvSlot* s = rv64_slot_get(a, a->sret_ptr_slot);
-      if (!s) compiler_panic(t->c, a->loc, "rv64 tail call: bad sret slot");
-      rv64_emit32(mc, rv_ld(RV_A0, RV_S0, -(i32)s->off));
-    }
-    rv_check_tail_stack_args(t, stack_off);
-    rv_tail_branch(t, d->callee);
-    return;
-  }
-
-  if (d->callee.kind == OPK_GLOBAL) {
-    /* AUIPC ra, 0 ; JALR ra, ra, 0  with R_RV_CALL on AUIPC */
-    u32 sec = mc->section_id;
-    u32 pos = mc->pos(mc);
-    rv64_emit32(mc, rv_auipc(RV_RA, 0));
-    rv64_emit32(mc, rv_jalr(RV_RA, RV_RA, 0));
-    mc->emit_reloc_at(mc, sec, pos, R_RV_CALL, d->callee.v.global.sym,
-                      d->callee.v.global.addend, 0, 0);
-  } else if (d->callee.kind == OPK_REG) {
-    rv64_emit32(mc, rv_jalr(RV_RA, reg_num(d->callee), 0));
-  } else {
-    compiler_panic(t->c, a->loc, "rv64 call: callee kind %d unsupported",
-                   (int)d->callee.kind);
-  }
-
-  /* Receive return value. */
-  const ABIArgInfo* ri = &d->abi->ret;
-  if (ri->kind == ABI_ARG_IGNORE || ri->kind == ABI_ARG_INDIRECT) return;
-  if (ri->nparts == 0) return;
-
-  Operand rs = d->ret.storage;
-  u32 nir = 0, nfr = 0;
-  for (u16 i = 0; i < ri->nparts; ++i) {
-    const ABIArgPart* p = &ri->parts[i];
-    u32 src_reg = (p->cls == ABI_CLASS_INT) ? (RV_A0 + nir++) : (10u + nfr++);
-
-    if (rs.kind == OPK_REG) {
-      if (ri->nparts != 1) {
-        compiler_panic(t->c, a->loc, "rv64 call: REG ret with %u parts",
-                       (unsigned)ri->nparts);
-      }
-      if (p->cls == ABI_CLASS_INT) {
-        rv64_emit32(mc, rv_addi(reg_num(rs), src_reg, 0));
-      } else {
-        u32 fmt = (p->size == 8) ? RV_FMT_D : RV_FMT_S;
-        rv64_emit32(mc, rv_fsgnj(fmt, reg_num(rs), src_reg, src_reg));
-      }
-    } else if (rs.kind == OPK_LOCAL || rs.kind == OPK_INDIRECT) {
-      Operand src = {.kind = OPK_REG,
-                     .cls = (u8)((p->cls == ABI_CLASS_FP) ? RC_FP : RC_INT),
-                     .type = d->ret.type};
-      src.v.reg = src_reg;
-      if (p->cls == ABI_CLASS_INT || p->cls == ABI_CLASS_FP) {
-        rv_store_abi_part(t, rs, src, p->src_offset, p->size);
-      } else {
-        compiler_panic(t->c, a->loc, "rv64 call: ret part cls %d unimpl",
-                       (int)p->cls);
-      }
-    } else if (rs.kind == OPK_IMM &&
-               rs.type == CG_BUILTIN_ID(CFREE_CG_BUILTIN_VOID)) {
-      /* void return placeholder — nothing to do. */
-    } else {
-      compiler_panic(t->c, a->loc, "rv64 call: ret_storage kind %d unsupported",
-                     (int)rs.kind);
-    }
-  }
-}
-
-static void rv_emit_call_plan(CGTarget* t, const CGCallPlan* p) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  if (p->flags & CG_CALL_TAIL) {
-    if (p->has_sret) {
-      /* Forward the incoming sret pointer into a0 (see rv_call). Load before
-       * rv_tail_branch restores the frame; a0 survives the restore. */
-      if (a->sret_ptr_slot == FRAME_SLOT_NONE)
-        compiler_panic(t->c, a->loc,
-                       "rv64 tail call: missing incoming sret slot");
-      RvSlot* s = rv64_slot_get(a, a->sret_ptr_slot);
-      if (!s) compiler_panic(t->c, a->loc, "rv64 tail call: bad sret slot");
-      rv64_emit32(mc, rv_ld(RV_A0, RV_S0, -(i32)s->off));
-    }
-    rv_check_tail_stack_args(t, rv_call_plan_stack_raw_size(p));
-    rv_tail_branch(t, p->callee);
-    return;
-  }
-
-  {
-    u32 needed = (rv_call_plan_stack_raw_size(p) + 15u) & ~15u;
-    if (needed > a->max_outgoing) {
-      if (a->known_frame)
-        compiler_panic(t->c, a->loc,
-                       "rv64 call plan: known frame outgoing area too small");
-      a->max_outgoing = needed;
-    }
-  }
-
-  if (p->callee.kind == OPK_GLOBAL) {
-    u32 sec = mc->section_id;
-    u32 pos = mc->pos(mc);
-    rv64_emit32(mc, rv_auipc(RV_RA, 0));
-    rv64_emit32(mc, rv_jalr(RV_RA, RV_RA, 0));
-    mc->emit_reloc_at(mc, sec, pos, R_RV_CALL, p->callee.v.global.sym,
-                      p->callee.v.global.addend, 0, 0);
-  } else if (p->callee.kind == OPK_REG) {
-    rv64_emit32(mc, rv_jalr(RV_RA, reg_num(p->callee), 0));
-  } else {
-    compiler_panic(t->c, a->loc,
-                   "rv64 emit_call_plan: callee kind %d unsupported",
-                   (int)p->callee.kind);
-  }
-}
-
-static Operand rv_call_plan_offset_operand(CGTarget* t, Operand op,
-                                           u32 offset) {
-  if (!offset) return op;
-  if (op.kind == OPK_INDIRECT) {
-    op.v.ind.ofs += (i32)offset;
-  } else if (op.kind == OPK_LOCAL) {
-    RImpl* a = impl_of(t);
-    RvSlot* s = rv64_slot_get(a, op.v.frame_slot);
-    if (!s) compiler_panic(t->c, a->loc, "rv64 call plan: bad slot");
-    op.kind = OPK_INDIRECT;
-    op.v.ind.base = RV_S0;
-    op.v.ind.index = REG_NONE;
-    op.v.ind.log2_scale = 0;
-    op.v.ind.ofs = -(i32)s->off + (i32)offset;
-  }
-  return op;
-}
-
-static void rv_load_call_arg(CGTarget* t, Operand dst,
-                             const CGCallPlanMove* m) {
-  Operand src = rv_call_plan_offset_operand(t, m->src, m->src_offset);
-  if (m->src_kind == CG_CALL_PLAN_SRC_ADDR) {
-    rv_addr_of(t, dst, src);
-    return;
-  }
-  if (src.kind == OPK_GLOBAL) {
-    rv_addr_of(t, dst, src);
-    return;
-  }
-  rv_load(t, dst, src, m->mem);
-}
-
-static void rv_store_call_ret(CGTarget* t, const CGCallPlanRet* r,
-                              Operand src) {
-  Operand dst = rv_call_plan_offset_operand(t, r->dst, r->dst_offset);
-  rv_store(t, dst, src, r->mem);
-}
-
-static void rv_store_call_arg(CGTarget* t, const CGCallPlanMove* m) {
-  Operand addr;
-  addr = rv_call_stack_arg_addr(t, m->stack_offset,
-                                m->dst_kind == CG_CALL_PLAN_TAIL_STACK);
-  addr.type = m->mem.type;
-
-  if (m->src_kind == CG_CALL_PLAN_SRC_ADDR) {
-    Operand tmp = {.kind = OPK_REG, .cls = RC_INT, .type = m->mem.type};
-    tmp.v.reg = RV_T0;
-    rv_load_call_arg(t, tmp, m);
-    rv_store(t, addr, tmp, m->mem);
-    return;
-  }
-
-  if (m->src.kind == OPK_REG || m->src.kind == OPK_IMM) {
-    rv_store(t, addr, m->src, m->mem);
-    return;
-  }
-  if (m->src.kind == OPK_GLOBAL) {
-    Operand tmp = {.kind = OPK_REG, .cls = RC_INT, .type = m->mem.type};
-    tmp.v.reg = RV_T0;
-    rv_load_call_arg(t, tmp, m);
-    rv_store(t, addr, tmp, m->mem);
-    return;
-  }
-  if (m->src.kind == OPK_LOCAL || m->src.kind == OPK_INDIRECT) {
-    Operand tmp = {.kind = OPK_REG, .cls = m->cls, .type = m->mem.type};
-    tmp.v.reg = m->cls == RC_FP ? 0u : RV_T0;
-    rv_load_call_arg(t, tmp, m);
-    rv_store(t, addr, tmp, m->mem);
-    return;
-  }
-  compiler_panic(t->c, impl_of(t)->loc,
-                 "rv64 store_call_arg: source kind %d unsupported",
-                 (int)m->src.kind);
-}
-
-static void rv_ret(CGTarget* t, const CGABIValue* val) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  if (val) {
-    const ABIArgInfo* ri = val->abi;
-    if (ri && ri->kind == ABI_ARG_INDIRECT) {
-      /* sret: reload destination pointer from sret_ptr_slot into t0,
-       * then memcpy from val->storage into [t0]. */
-      u32 src_base;
-      i32 src_base_off;
-      u32 nbytes;
-      if (val->storage.kind == OPK_LOCAL) {
-        RvSlot* s = rv64_slot_get(a, val->storage.v.frame_slot);
-        if (!s) compiler_panic(t->c, a->loc, "rv64 ret: bad sret slot");
-        src_base = RV_S0;
-        src_base_off = -(i32)s->off;
-        nbytes = s->size;
-      } else if (val->storage.kind == OPK_INDIRECT) {
-        if (val->storage.v.ind.index != REG_NONE) {
-          compiler_panic(t->c, a->loc,
-                         "rv64 ret indirect: indexed storage not supported");
-        }
-        src_base = val->storage.v.ind.base & 0x1fu;
-        src_base_off = val->storage.v.ind.ofs;
-        nbytes = val->size;
-        if (!nbytes) {
-          compiler_panic(t->c, a->loc,
-                         "rv64 ret indirect: missing aggregate size");
-        }
-      } else {
-        compiler_panic(t->c, a->loc, "rv64 ret indirect: storage kind %d NYI",
-                       (int)val->storage.kind);
-      }
-      RvSlot* sp = (a->sret_ptr_slot != FRAME_SLOT_NONE)
-                       ? rv64_slot_get(a, a->sret_ptr_slot)
-                       : NULL;
-      if (sp) rv64_emit32(mc, rv_ld(RV_T0, RV_S0, -(i32)sp->off));
-      u32 i = 0;
-      while (i + 8 <= nbytes) {
-        rv64_emit32(mc, rv_ld(RV_T1, src_base, src_base_off + (i32)i));
-        rv64_emit32(mc, rv_sd(RV_T1, RV_T0, (i32)i));
-        i += 8;
-      }
-      while (i + 4 <= nbytes) {
-        rv64_emit32(mc, rv_lwu(RV_T1, src_base, src_base_off + (i32)i));
-        rv64_emit32(mc, rv_sw(RV_T1, RV_T0, (i32)i));
-        i += 4;
-      }
-      while (i + 2 <= nbytes) {
-        rv64_emit32(mc, rv_lhu(RV_T1, src_base, src_base_off + (i32)i));
-        rv64_emit32(mc, rv_sh(RV_T1, RV_T0, (i32)i));
-        i += 2;
-      }
-      while (i < nbytes) {
-        rv64_emit32(mc, rv_lbu(RV_T1, src_base, src_base_off + (i32)i));
-        rv64_emit32(mc, rv_sb(RV_T1, RV_T0, (i32)i));
-        i += 1;
-      }
-    } else if (val->storage.kind == OPK_REG) {
-      if (val->storage.cls == RC_FP) {
-        u32 fmt = type_is_fp_double(val->storage.type) ? RV_FMT_D : RV_FMT_S;
-        u32 r = reg_num(val->storage);
-        if (r != 10u)
-          rv64_emit32(mc, rv_fsgnj(fmt, 10u, r, r)); /* fa0 = freg 10 */
-      } else {
-        if (reg_num(val->storage) != RV_A0)
-          rv64_emit32(mc, rv_addi(RV_A0, reg_num(val->storage), 0));
-      }
-    } else if (val->storage.kind == OPK_IMM) {
-      u32 sf = type_is_64(val->storage.type) ? 1u : 0u;
-      rv64_emit_load_imm(mc, sf, RV_A0, val->storage.v.imm);
-    } else if (val->storage.kind == OPK_LOCAL ||
-               val->storage.kind == OPK_INDIRECT) {
-      const ABIArgInfo* ri2 = val->abi;
-      u32 nir = 0, nfr = 0;
-      for (u16 i = 0; i < (ri2 ? ri2->nparts : 0); ++i) {
-        const ABIArgPart* pt = &ri2->parts[i];
-        if (pt->cls == ABI_CLASS_INT) {
-          Operand dst = {.kind = OPK_REG, .cls = RC_INT, .type = val->type};
-          dst.v.reg = RV_A0 + nir++;
-          rv_load_abi_part(t, dst, val->storage, pt->src_offset, pt->size);
-        } else if (pt->cls == ABI_CLASS_FP) {
-          Operand dst = {.kind = OPK_REG, .cls = RC_FP, .type = val->type};
-          u32 freg = 10u + nfr++;
-          dst.v.reg = freg;
-          rv_load_abi_part(t, dst, val->storage, pt->src_offset, pt->size);
-        } else {
-          compiler_panic(t->c, a->loc, "rv64 ret: part cls %d unimpl",
-                         (int)pt->cls);
-        }
-      }
-    }
-  }
-  if (a->omit_frame) {
-    rv64_emit32(mc, rv_ret_());
-    return;
-  }
-  /* Jump to epilogue. */
-  rv64_emit32(mc, rv_jal(RV_ZERO, 0));
-  mc->emit_label_ref(mc, a->epilogue_label, R_RV_JAL, 4, 0);
-}
-
-/* ---- panic stubs for features we don't yet cover ---- */
-
-static void rv_alloca_(CGTarget* t, Operand d, Operand sz, u32 align) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  if (d.kind != OPK_REG) {
-    compiler_panic(t->c, a->loc, "rv64 alloca: dst must be REG");
-  }
-  if (align > 16) {
-    compiler_panic(t->c, a->loc, "rv64 alloca: align %u > 16 not yet supported",
-                   align);
-  }
-  if (sz.kind == OPK_IMM) {
-    i64 v = sz.v.imm;
-    if (v < 0) compiler_panic(t->c, a->loc, "rv64 alloca: negative size");
-    u64 aligned = ((u64)v + 15u) & ~(u64)15u;
-    if (aligned == 0) aligned = 16;
-    if (aligned > 2047u) {
-      compiler_panic(t->c, a->loc,
-                     "rv64 alloca: const size %llu too large for v1",
-                     (unsigned long long)aligned);
-    }
-    rv64_emit32(mc, rv_addi(RV_SP, RV_SP, -(i32)aligned));
-  } else if (sz.kind == OPK_REG) {
-    u32 sz_reg = reg_num(sz);
-    /* t0 = (sz + 15) & ~15; sp -= t0 */
-    rv64_emit32(mc, rv_addi(RV_T0, sz_reg, 15));
-    rv64_emit32(mc, rv_andi(RV_T0, RV_T0, -16));
-    rv64_emit32(mc, rv_sub(RV_SP, RV_SP, RV_T0));
-  } else {
-    compiler_panic(t->c, a->loc, "rv64 alloca: size kind %d unsupported",
-                   (int)sz.kind);
-  }
-
-  /* Placeholder: addi dst, sp, max_outgoing  (imm patched at func_end). */
-  if (a->nadd_patches == a->add_patches_cap) {
-    u32 ncap = a->add_patches_cap ? a->add_patches_cap * 2 : 4;
-    struct RvAllocaPatch* nb =
-        arena_array(t->c->tu, struct RvAllocaPatch, ncap);
-    if (a->add_patches)
-      memcpy(nb, a->add_patches, sizeof(*nb) * a->nadd_patches);
-    a->add_patches = nb;
-    a->add_patches_cap = ncap;
-  }
-  u32 dst_reg = reg_num(d);
-  a->add_patches[a->nadd_patches].pos = mc->pos(mc);
-  a->add_patches[a->nadd_patches].dst_reg = dst_reg;
-  a->nadd_patches++;
-  rv64_emit32(mc, rv_addi(dst_reg, RV_SP, 0));
-  a->has_alloca = 1;
-}
-/* RV64 LP64D va_list: a single `void*` pointing at the next argument
- * slot. The prologue spills a_{nparams_int}..a7 into the save area at
- * [s0 + 16]. The save area lives at the top of the callee frame,
- * immediately above the saved-s0/ra pair, so save_area[8] coincides
- * with the caller's first stack arg — a single 8-byte stride covers
- * register and stack args alike. */
-static void rv_va_start_(CGTarget* t, Operand ap_op) {
-  RImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  if (!a->is_variadic) {
-    compiler_panic(t->c, a->loc, "rv64 va_start: function not variadic");
-  }
-  u32 ap = reg_num(ap_op);
-  /* *ap = s0 + 16 + next_param_int*8 (skip past named-int slots). */
-  i32 off = 16 + (i32)(a->next_param_int * 8u);
-  rv64_emit32(mc, rv_addi(RV_T0, RV_S0, off));
-  rv64_emit32(mc, rv_sd(RV_T0, ap, 0));
-}
-
-static void rv_va_arg_(CGTarget* t, Operand dst, Operand ap_op,
-                       CfreeCgTypeId ty) {
-  MCEmitter* mc = t->mc;
-  u32 ap = reg_num(ap_op);
-  u32 sz = type_byte_size(ty);
-  /* t1 = *ap; load value; *ap = t1 + 8 (rounded up).
-   * On RV64 LP64D every var arg occupies an 8-byte slot. */
-  rv64_emit32(mc, rv_ld(RV_T1, ap, 0));
-  if (dst.cls == RC_FP) {
-    /* For variadic FP args on RV64 LP64D, the value sits in the integer
-     * save area at the same bit pattern as a double bit-cast. Load and
-     * bitcast. */
-    if (sz == 8) {
-      rv64_emit32(mc, rv_ld(RV_T2, RV_T1, 0));
-      rv64_emit32(mc, rv_fmv_d_x(reg_num(dst), RV_T2));
-    } else {
-      rv64_emit32(mc, rv_lw(RV_T2, RV_T1, 0));
-      rv64_emit32(mc, rv_fmv_w_x(reg_num(dst), RV_T2));
-    }
-  } else {
-    int sx = type_is_signed(ty);
-    rv64_emit32(mc, enc_int_load(sz, sx, reg_num(dst), RV_T1, 0));
-  }
-  /* advance ap by 8 bytes. */
-  rv64_emit32(mc, rv_addi(RV_T1, RV_T1, 8));
-  rv64_emit32(mc, rv_sd(RV_T1, ap, 0));
-}
-
-static void rv_va_end_(CGTarget* t, Operand a) {
-  (void)t;
-  (void)a;
-}
-
-static void rv_va_copy_(CGTarget* t, Operand d, Operand s) {
-  MCEmitter* mc = t->mc;
-  u32 dr = reg_num(d);
-  u32 sr = reg_num(s);
-  /* va_list is a single pointer (8 bytes). */
-  rv64_emit32(mc, rv_ld(RV_T0, sr, 0));
-  rv64_emit32(mc, rv_sd(RV_T0, dr, 0));
-}
-
-/* ---- atomics (LL/SC + AMO) ---- */
-
-int mem_order_is_acquire(MemOrder o) {
-  return o == MO_ACQUIRE || o == MO_ACQ_REL || o == MO_SEQ_CST ||
-         o == MO_CONSUME;
-}
-int mem_order_is_release(MemOrder o) {
-  return o == MO_RELEASE || o == MO_ACQ_REL || o == MO_SEQ_CST;
-}
-
-static void rv_atomic_load(CGTarget* t, Operand dst, Operand addr, MemAccess ma,
-                           MemOrder o) {
-  MCEmitter* mc = t->mc;
-  u32 sf = (ma.size == 8) ? 1u : 0u;
-  /* Resolve address to a register. */
-  u32 base;
-  if (addr.kind == OPK_REG) {
-    base = reg_num(addr);
-  } else if (addr.kind == OPK_LOCAL) {
-    RvAddrMode am = addr_mode(t, addr, RV_T0);
-    base = am.base;
-    if (am.ofs) {
-      rv64_emit32(mc, rv_addi(RV_T0, base, am.ofs));
-      base = RV_T0;
-    }
-  } else {
-    compiler_panic(t->c, impl_of(t)->loc, "rv64 atomic_load: addr kind %d NYI",
-                   (int)addr.kind);
-  }
-  if (mem_order_is_acquire(o)) {
-    /* lr.w/d as ordered load (aq=1, rl=0). */
-    rv64_emit32(mc, sf ? rv_lr_d(reg_num(dst), base, 1, 0)
-                       : rv_lr_w(reg_num(dst), base, 1, 0));
-  } else {
-    rv64_emit32(mc, enc_int_load(ma.size, 0, reg_num(dst), base, 0));
-  }
-}
-
-static void rv_atomic_store(CGTarget* t, Operand addr, Operand src,
-                            MemAccess ma, MemOrder o) {
-  MCEmitter* mc = t->mc;
-  u32 sf = (ma.size == 8) ? 1u : 0u;
-  u32 src_reg;
-  if (src.kind == OPK_IMM) {
-    rv64_emit_load_imm(mc, sf, RV_T1, src.v.imm);
-    src_reg = RV_T1;
-  } else if (src.kind == OPK_REG) {
-    src_reg = reg_num(src);
-  } else {
-    compiler_panic(t->c, impl_of(t)->loc, "rv64 atomic_store: src kind %d NYI",
-                   (int)src.kind);
-  }
-  u32 base;
-  if (addr.kind == OPK_REG) {
-    base = reg_num(addr);
-  } else if (addr.kind == OPK_LOCAL) {
-    RvAddrMode am = addr_mode(t, addr, RV_T0);
-    base = am.base;
-    if (am.ofs) {
-      rv64_emit32(mc, rv_addi(RV_T0, base, am.ofs));
-      base = RV_T0;
-    }
-  } else {
-    compiler_panic(t->c, impl_of(t)->loc, "rv64 atomic_store: addr kind %d NYI",
-                   (int)addr.kind);
-  }
-  if (mem_order_is_release(o)) {
-    /* fence rw,w; sw/sd src, 0(base). Conservative for SEQ_CST. */
-    rv64_emit32(mc, rv_fence_rw_rw());
-    rv64_emit32(mc, enc_int_store(ma.size, src_reg, base, 0));
-    if (o == MO_SEQ_CST) rv64_emit32(mc, rv_fence_rw_rw());
-  } else {
-    rv64_emit32(mc, enc_int_store(ma.size, src_reg, base, 0));
-  }
-}
-
-static void rv_atomic_rmw(CGTarget* t, AtomicOp op, Operand dst, Operand addr,
-                          Operand val, MemAccess ma, MemOrder o) {
-  MCEmitter* mc = t->mc;
-  u32 sf = (ma.size == 8) ? 1u : 0u;
-  u32 base = RV_T0;
-  if (addr.kind == OPK_REG) {
-    rv64_emit32(mc, rv_addi(base, reg_num(addr), 0));
-  } else if (addr.kind == OPK_LOCAL) {
-    RvAddrMode am = addr_mode(t, addr, RV_T0);
-    if (am.base != RV_T0 || am.ofs) {
-      rv64_emit32(mc, rv_addi(base, am.base, am.ofs));
-    }
-  } else {
-    compiler_panic(t->c, impl_of(t)->loc, "rv64 atomic_rmw: addr NYI");
-  }
-  u32 vreg = RV_T1;
-  if (val.kind == OPK_IMM)
-    rv64_emit_load_imm(mc, sf, vreg, val.v.imm);
-  else if (val.kind == OPK_REG)
-    rv64_emit32(mc, rv_addi(vreg, reg_num(val), 0));
-  else
-    compiler_panic(t->c, impl_of(t)->loc, "rv64 atomic_rmw: val kind NYI");
-
-  int aq = mem_order_is_acquire(o);
-  int rl = mem_order_is_release(o);
-
-  /* LR/SC loop for any op (simpler than per-op AMO encodings, but AMO is
-   * preferred for the cases the corpus exercises). */
-  MCLabel L_retry = mc->label_new(mc);
-  mc->label_place(mc, L_retry);
-  rv64_emit32(mc, sf ? rv_lr_d(reg_num(dst), base, (u32)aq, 0)
-                     : rv_lr_w(reg_num(dst), base, (u32)aq, 0));
-  u32 new_r = RV_T2;
-  switch (op) {
-    case AO_XCHG:
-      rv64_emit32(mc, rv_addi(new_r, vreg, 0));
-      break;
-    case AO_ADD:
-      rv64_emit32(mc, sf ? rv_add(new_r, reg_num(dst), vreg)
-                         : rv_addw(new_r, reg_num(dst), vreg));
-      break;
-    case AO_SUB:
-      rv64_emit32(mc, sf ? rv_sub(new_r, reg_num(dst), vreg)
-                         : rv_subw(new_r, reg_num(dst), vreg));
-      break;
-    case AO_AND:
-      rv64_emit32(mc, rv_and(new_r, reg_num(dst), vreg));
-      break;
-    case AO_OR:
-      rv64_emit32(mc, rv_or(new_r, reg_num(dst), vreg));
-      break;
-    case AO_XOR:
-      rv64_emit32(mc, rv_xor(new_r, reg_num(dst), vreg));
-      break;
-    case AO_NAND:
-      rv64_emit32(mc, rv_and(new_r, reg_num(dst), vreg));
-      rv64_emit32(mc, rv_xori(new_r, new_r, -1));
-      break;
-    default:
-      rv64_emit32(mc, rv_addi(new_r, vreg, 0));
-      break;
-  }
-  /* sc.w/d t3, new_r, (base); bnez t3, retry. */
-  rv64_emit32(mc, sf ? rv_sc_d(RV_T3, base, new_r, 0, (u32)rl)
-                     : rv_sc_w(RV_T3, base, new_r, 0, (u32)rl));
-  rv64_emit32(mc, rv_bne(RV_T3, RV_ZERO, 0));
-  mc->emit_label_ref(mc, L_retry, R_RV_BRANCH, 4, 0);
-}
-
-static void rv_atomic_cas(CGTarget* t, Operand prior, Operand ok, Operand addr,
-                          Operand exp, Operand des, MemAccess ma, MemOrder succ,
-                          MemOrder fail) {
-  MCEmitter* mc = t->mc;
-  u32 sf = (ma.size == 8) ? 1u : 0u;
-  (void)fail;
-  u32 base = RV_T0;
-  if (addr.kind == OPK_REG)
-    rv64_emit32(mc, rv_addi(base, reg_num(addr), 0));
-  else if (addr.kind == OPK_LOCAL) {
-    RvAddrMode am = addr_mode(t, addr, RV_T0);
-    if (am.base != RV_T0 || am.ofs)
-      rv64_emit32(mc, rv_addi(base, am.base, am.ofs));
-  } else
-    compiler_panic(t->c, impl_of(t)->loc, "rv64 atomic_cas: addr NYI");
-  u32 ereg = RV_T1, dreg = RV_T2;
-  if (exp.kind == OPK_IMM)
-    rv64_emit_load_imm(mc, sf, ereg, exp.v.imm);
-  else
-    rv64_emit32(mc, rv_addi(ereg, reg_num(exp), 0));
-  if (des.kind == OPK_IMM)
-    rv64_emit_load_imm(mc, sf, dreg, des.v.imm);
-  else
-    rv64_emit32(mc, rv_addi(dreg, reg_num(des), 0));
-
-  int aq = mem_order_is_acquire(succ);
-  int rl = mem_order_is_release(succ);
-
-  MCLabel L_retry = mc->label_new(mc);
-  MCLabel L_fail = mc->label_new(mc);
-  MCLabel L_done = mc->label_new(mc);
-
-  mc->label_place(mc, L_retry);
-  rv64_emit32(mc, sf ? rv_lr_d(reg_num(prior), base, (u32)aq, 0)
-                     : rv_lr_w(reg_num(prior), base, (u32)aq, 0));
-  /* if (prior != expected) -> fail */
-  rv64_emit32(mc, rv_bne(reg_num(prior), ereg, 0));
-  mc->emit_label_ref(mc, L_fail, R_RV_BRANCH, 4, 0);
-  /* sc.w/d t3, des, (base); bnez t3, retry */
-  rv64_emit32(mc, sf ? rv_sc_d(RV_T3, base, dreg, 0, (u32)rl)
-                     : rv_sc_w(RV_T3, base, dreg, 0, (u32)rl));
-  rv64_emit32(mc, rv_bne(RV_T3, RV_ZERO, 0));
-  mc->emit_label_ref(mc, L_retry, R_RV_BRANCH, 4, 0);
-  /* ok = 1; jump done */
-  rv64_emit_load_imm(mc, 0, reg_num(ok), 1);
-  rv64_emit32(mc, rv_jal(RV_ZERO, 0));
-  mc->emit_label_ref(mc, L_done, R_RV_JAL, 4, 0);
-
-  mc->label_place(mc, L_fail);
-  rv64_emit_load_imm(mc, 0, reg_num(ok), 0);
-
-  mc->label_place(mc, L_done);
-}
-
-static void rv_fence(CGTarget* t, MemOrder o) {
-  if (o == MO_RELAXED) return;
-  rv64_emit32(t->mc, rv_fence_rw_rw());
-}
-
-/* ---- intrinsics: do what we can; panic on the rest. ---- */
-static void rv_intrinsic(CGTarget* t, IntrinKind kind, Operand* dsts, u32 nd,
-                         const Operand* args, u32 na) {
-  (void)nd;
-  (void)na;
-  MCEmitter* mc = t->mc;
-  RImpl* a = impl_of(t);
-  switch (kind) {
-    case INTRIN_ASSUME_ALIGNED:
-    case INTRIN_EXPECT: {
-      /* dst = val (hint dropped). */
-      Operand val = args[0];
-      Operand dst = dsts[0];
-      u32 sf = type_is_64(dst.type) ? 1u : 0u;
-      if (val.kind == OPK_REG) {
-        if (reg_num(val) != reg_num(dst))
-          rv64_emit32(mc, rv_addi(reg_num(dst), reg_num(val), 0));
-      } else if (val.kind == OPK_IMM) {
-        rv64_emit_load_imm(mc, sf, reg_num(dst), val.v.imm);
-      } else {
-        compiler_panic(t->c, a->loc, "rv64 intrinsic: val kind %d NYI",
-                       (int)val.kind);
-      }
-      return;
-    }
-    case INTRIN_PREFETCH:
-      return;
-    case INTRIN_UNREACHABLE:
-    case INTRIN_TRAP:
-      rv64_emit32(mc, rv_ebreak());
-      return;
-    case INTRIN_BSWAP16: {
-      /* rd = ((rs & 0xff) << 8) | ((rs >> 8) & 0xff) */
-      u32 rd = reg_num(dsts[0]);
-      u32 rs = reg_num(args[0]);
-      rv64_emit32(mc, rv_slli(RV_T1, rs, 8));    /* t1 = rs << 8 */
-      rv64_emit32(mc, rv_andi(RV_T1, RV_T1, 0)); /* placeholder */
-      /* Use lui mask approach for portability: build mask 0xff00 in t2. */
-      rv64_emit32(mc, rv_addi(RV_T2, RV_ZERO, 0));
-      /* Simpler: 0xff00 fits in lui+addi pattern but is also small enough:
-       * we can build via shift: t2 = 0xff << 8 = (0xff << 8). */
-      rv64_emit32(mc, rv_addi(RV_T2, RV_ZERO, 0xff));
-      rv64_emit32(mc, rv_slli(RV_T2, RV_T2, 8));
-      /* t1 = (rs << 8) & 0xff00 */
-      rv64_emit32(mc, rv_slli(RV_T1, rs, 8));
-      rv64_emit32(mc, rv_and(RV_T1, RV_T1, RV_T2));
-      /* t3 = (rs >> 8) & 0xff (use srli on RV64 — high bits zeroed by
-       * preceding ANDI mask if input is uint16, but be safe and mask). */
-      rv64_emit32(mc, rv_srli(RV_T3, rs, 8));
-      rv64_emit32(mc, rv_andi(RV_T3, RV_T3, 0xff));
-      rv64_emit32(mc, rv_or(rd, RV_T1, RV_T3));
-      return;
-    }
-    case INTRIN_BSWAP32: {
-      u32 rd = reg_num(dsts[0]);
-      u32 rs = reg_num(args[0]);
-      /* result = (b0<<24)|(b1<<16)|(b2<<8)|b3, where bi = (rs >> (8*i)) & 0xff.
-       */
-      /* t1 = ((rs >> 24) & 0xff) */
-      rv64_emit32(mc, rv_srliw(RV_T1, rs, 24));
-      rv64_emit32(mc, rv_andi(RV_T1, RV_T1, 0xff));
-      /* t2 = ((rs >> 16) & 0xff) << 8 */
-      rv64_emit32(mc, rv_srliw(RV_T2, rs, 16));
-      rv64_emit32(mc, rv_andi(RV_T2, RV_T2, 0xff));
-      rv64_emit32(mc, rv_slli(RV_T2, RV_T2, 8));
-      rv64_emit32(mc, rv_or(RV_T1, RV_T1, RV_T2));
-      /* t2 = ((rs >> 8) & 0xff) << 16 */
-      rv64_emit32(mc, rv_srliw(RV_T2, rs, 8));
-      rv64_emit32(mc, rv_andi(RV_T2, RV_T2, 0xff));
-      rv64_emit32(mc, rv_slli(RV_T2, RV_T2, 16));
-      rv64_emit32(mc, rv_or(RV_T1, RV_T1, RV_T2));
-      /* t2 = (rs & 0xff) << 24 */
-      rv64_emit32(mc, rv_andi(RV_T2, rs, 0xff));
-      rv64_emit32(mc, rv_slli(RV_T2, RV_T2, 24));
-      rv64_emit32(mc, rv_or(rd, RV_T1, RV_T2));
-      /* zero-extend to 32 bits if dest is u32 */
-      rv64_emit32(mc, rv_slli(rd, rd, 32));
-      rv64_emit32(mc, rv_srli(rd, rd, 32));
-      return;
-    }
-    case INTRIN_BSWAP64: {
-      u32 rd = reg_num(dsts[0]);
-      u32 rs = reg_num(args[0]);
-      /* General bswap64: iterate over the 8 bytes. */
-      /* t1 accumulator */
-      rv64_emit32(mc, rv_addi(RV_T1, RV_ZERO, 0));
-      for (int i = 0; i < 8; ++i) {
-        /* t2 = (rs >> (8*i)) & 0xff */
-        if (i == 0) {
-          rv64_emit32(mc, rv_andi(RV_T2, rs, 0xff));
-        } else {
-          rv64_emit32(mc, rv_srli(RV_T2, rs, (u32)(8 * i)));
-          rv64_emit32(mc, rv_andi(RV_T2, RV_T2, 0xff));
-        }
-        /* t2 <<= (56 - 8*i) (so byte 0 goes to top) */
-        int sh = 56 - 8 * i;
-        if (sh) rv64_emit32(mc, rv_slli(RV_T2, RV_T2, (u32)sh));
-        rv64_emit32(mc, rv_or(RV_T1, RV_T1, RV_T2));
-      }
-      rv64_emit32(mc, rv_addi(rd, RV_T1, 0));
-      return;
-    }
-    case INTRIN_POPCOUNT: {
-      /* Software popcount.  Use the bit-twiddling sequence on the
-       * appropriate width. dst type drives width. */
-      u32 rd = reg_num(dsts[0]);
-      u32 rs = reg_num(args[0]);
-      int is64 = type_is_64(args[0].type);
-      /* Move rs into t1 to avoid clobbering input. */
-      rv64_emit32(mc, rv_addi(RV_T1, rs, 0));
-      if (!is64) {
-        /* zext.w t1, t1 */
-        rv64_emit32(mc, rv_slli(RV_T1, RV_T1, 32));
-        rv64_emit32(mc, rv_srli(RV_T1, RV_T1, 32));
-      }
-      /* t1 = t1 - ((t1 >> 1) & 0x5555...) */
-      rv64_emit32(mc, rv_srli(RV_T2, RV_T1, 1));
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x5555555555555555ll : (i64)0x55555555);
-      rv64_emit32(mc, rv_and(RV_T2, RV_T2, RV_T3));
-      rv64_emit32(mc, rv_sub(RV_T1, RV_T1, RV_T2));
-      /* t1 = (t1 & 0x3333...) + ((t1 >> 2) & 0x3333...) */
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x3333333333333333ll : (i64)0x33333333);
-      rv64_emit32(mc, rv_and(RV_T2, RV_T1, RV_T3));
-      rv64_emit32(mc, rv_srli(RV_T1, RV_T1, 2));
-      rv64_emit32(mc, rv_and(RV_T1, RV_T1, RV_T3));
-      rv64_emit32(mc, rv_add(RV_T1, RV_T1, RV_T2));
-      /* t1 = (t1 + (t1 >> 4)) & 0x0f0f... */
-      rv64_emit32(mc, rv_srli(RV_T2, RV_T1, 4));
-      rv64_emit32(mc, rv_add(RV_T1, RV_T1, RV_T2));
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x0f0f0f0f0f0f0f0fll : (i64)0x0f0f0f0f);
-      rv64_emit32(mc, rv_and(RV_T1, RV_T1, RV_T3));
-      /* t1 *= 0x0101010101... ; result in top byte */
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x0101010101010101ll : (i64)0x01010101);
-      rv64_emit32(mc, rv_mul(RV_T1, RV_T1, RV_T3));
-      /* shift right by (XLEN - 8) */
-      rv64_emit32(mc, rv_srli(rd, RV_T1, is64 ? 56u : 24u));
-      return;
-    }
-    case INTRIN_CTZ: {
-      /* ctz(x) = popcount((x & -x) - 1) for x != 0. */
-      u32 rd = reg_num(dsts[0]);
-      u32 rs = reg_num(args[0]);
-      int is64 = type_is_64(args[0].type);
-      /* t1 = -x */
-      rv64_emit32(mc, rv_sub(RV_T1, RV_ZERO, rs));
-      /* t1 = x & -x */
-      rv64_emit32(mc, rv_and(RV_T1, RV_T1, rs));
-      /* t1 = t1 - 1 */
-      rv64_emit32(mc, rv_addi(RV_T1, RV_T1, -1));
-      if (!is64) {
-        rv64_emit32(mc, rv_slli(RV_T1, RV_T1, 32));
-        rv64_emit32(mc, rv_srli(RV_T1, RV_T1, 32));
-      }
-      /* popcount(t1) into rd */
-      rv64_emit32(mc, rv_srli(RV_T2, RV_T1, 1));
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x5555555555555555ll : (i64)0x55555555);
-      rv64_emit32(mc, rv_and(RV_T2, RV_T2, RV_T3));
-      rv64_emit32(mc, rv_sub(RV_T1, RV_T1, RV_T2));
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x3333333333333333ll : (i64)0x33333333);
-      rv64_emit32(mc, rv_and(RV_T2, RV_T1, RV_T3));
-      rv64_emit32(mc, rv_srli(RV_T1, RV_T1, 2));
-      rv64_emit32(mc, rv_and(RV_T1, RV_T1, RV_T3));
-      rv64_emit32(mc, rv_add(RV_T1, RV_T1, RV_T2));
-      rv64_emit32(mc, rv_srli(RV_T2, RV_T1, 4));
-      rv64_emit32(mc, rv_add(RV_T1, RV_T1, RV_T2));
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x0f0f0f0f0f0f0f0fll : (i64)0x0f0f0f0f);
-      rv64_emit32(mc, rv_and(RV_T1, RV_T1, RV_T3));
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x0101010101010101ll : (i64)0x01010101);
-      rv64_emit32(mc, rv_mul(RV_T1, RV_T1, RV_T3));
-      rv64_emit32(mc, rv_srli(rd, RV_T1, is64 ? 56u : 24u));
-      return;
-    }
-    case INTRIN_CLZ: {
-      /* Software clz: fold the high bit downward, then popcount the
-       * inverted result.  Standard recipe:
-       *   x |= x>>1; x |= x>>2; x |= x>>4; x |= x>>8; x |= x>>16;
-       *   [x |= x>>32;] // 64-bit
-       *   clz = popcount(~x) [for the appropriate width].
-       */
-      u32 rd = reg_num(dsts[0]);
-      u32 rs = reg_num(args[0]);
-      int is64 = type_is_64(args[0].type);
-      rv64_emit32(mc, rv_addi(RV_T1, rs, 0));
-      if (!is64) {
-        /* zero-ext to 32 to make srli safe */
-        rv64_emit32(mc, rv_slli(RV_T1, RV_T1, 32));
-        rv64_emit32(mc, rv_srli(RV_T1, RV_T1, 32));
-      }
-      u32 shifts[6] = {1, 2, 4, 8, 16, 32};
-      u32 ns = is64 ? 6u : 5u;
-      for (u32 i = 0; i < ns; ++i) {
-        rv64_emit32(mc, rv_srli(RV_T2, RV_T1, shifts[i]));
-        rv64_emit32(mc, rv_or(RV_T1, RV_T1, RV_T2));
-      }
-      /* t1 = ~t1, then popcount and we want the (width - popcount) ... wait.
-       * Actually clz(x) for the folded x = popcount(~x). Let me verify.
-       * If x = 0b00011010, fold => 0b00011111. ~ => 0b11100000.
-       * popcount(~folded) = 3 = clz(0b00011010) ✓. */
-      rv64_emit32(mc, rv_xori(RV_T1, RV_T1, -1));
-      if (!is64) {
-        rv64_emit32(mc, rv_slli(RV_T1, RV_T1, 32));
-        rv64_emit32(mc, rv_srli(RV_T1, RV_T1, 32));
-      }
-      /* popcount(t1) into rd */
-      rv64_emit32(mc, rv_srli(RV_T2, RV_T1, 1));
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x5555555555555555ll : (i64)0x55555555);
-      rv64_emit32(mc, rv_and(RV_T2, RV_T2, RV_T3));
-      rv64_emit32(mc, rv_sub(RV_T1, RV_T1, RV_T2));
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x3333333333333333ll : (i64)0x33333333);
-      rv64_emit32(mc, rv_and(RV_T2, RV_T1, RV_T3));
-      rv64_emit32(mc, rv_srli(RV_T1, RV_T1, 2));
-      rv64_emit32(mc, rv_and(RV_T1, RV_T1, RV_T3));
-      rv64_emit32(mc, rv_add(RV_T1, RV_T1, RV_T2));
-      rv64_emit32(mc, rv_srli(RV_T2, RV_T1, 4));
-      rv64_emit32(mc, rv_add(RV_T1, RV_T1, RV_T2));
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x0f0f0f0f0f0f0f0fll : (i64)0x0f0f0f0f);
-      rv64_emit32(mc, rv_and(RV_T1, RV_T1, RV_T3));
-      rv64_emit_load_imm(mc, 1, RV_T3,
-                         is64 ? (i64)0x0101010101010101ll : (i64)0x01010101);
-      rv64_emit32(mc, rv_mul(RV_T1, RV_T1, RV_T3));
-      rv64_emit32(mc, rv_srli(rd, RV_T1, is64 ? 56u : 24u));
-      return;
-    }
-    case INTRIN_SADD_OVERFLOW:
-    case INTRIN_SSUB_OVERFLOW: {
-      /* dsts: [val, ovf]. Signed overflow check.
-       * For ADD: ovf = ((a XOR result) & (b XOR result)) >> (width-1)
-       * For SUB: ovf = ((a XOR b)      & (a XOR result)) >> (width-1) */
-      Operand a_op = args[0], b_op = args[1];
-      Operand dval = dsts[0], dovf = dsts[1];
-      int is64 = type_is_64(dval.type);
-      u32 ra = rv64_force_reg_int(t, a_op, RV_T0);
-      u32 rb = rv64_force_reg_int(t, b_op, (ra == RV_T0) ? RV_T1 : RV_T0);
-      u32 rd = reg_num(dval);
-      u32 rovf = reg_num(dovf);
-      /* Compute result into t2 (avoid clobbering rd if rd == ra/rb). */
-      if (kind == INTRIN_SADD_OVERFLOW) {
-        rv64_emit32(mc, is64 ? rv_add(RV_T2, ra, rb) : rv_addw(RV_T2, ra, rb));
-      } else {
-        rv64_emit32(mc, is64 ? rv_sub(RV_T2, ra, rb) : rv_subw(RV_T2, ra, rb));
-      }
-      /* t3 = a XOR t2 */
-      rv64_emit32(mc, rv_xor(RV_T3, ra, RV_T2));
-      if (kind == INTRIN_SADD_OVERFLOW) {
-        /* ovf = b XOR t2 */
-        rv64_emit32(mc, rv_xor(rovf, rb, RV_T2));
-        rv64_emit32(mc, rv_and(rovf, rovf, RV_T3));
-      } else {
-        /* ovf = a XOR b */
-        rv64_emit32(mc, rv_xor(rovf, ra, rb));
-        rv64_emit32(mc, rv_and(rovf, rovf, RV_T3));
-      }
-      /* shift right to extract sign bit */
-      u32 sh = is64 ? 63u : 31u;
-      rv64_emit32(mc,
-                  is64 ? rv_srli(rovf, rovf, sh) : rv_srliw(rovf, rovf, sh));
-      rv64_emit32(mc, rv_andi(rovf, rovf, 1));
-      /* Now write the value. */
-      rv64_emit32(mc, rv_addi(rd, RV_T2, 0));
-      return;
-    }
-    case INTRIN_UADD_OVERFLOW:
-    case INTRIN_USUB_OVERFLOW: {
-      Operand a_op = args[0], b_op = args[1];
-      Operand dval = dsts[0], dovf = dsts[1];
-      int is64 = type_is_64(dval.type);
-      u32 ra = rv64_force_reg_int(t, a_op, RV_T0);
-      u32 rb = rv64_force_reg_int(t, b_op, (ra == RV_T0) ? RV_T1 : RV_T0);
-      u32 rd = reg_num(dval);
-      u32 rovf = reg_num(dovf);
-      if (!is64) {
-        rv64_emit32(mc, rv_slli(RV_T2, ra, 32));
-        rv64_emit32(mc, rv_srli(RV_T2, RV_T2, 32));
-        rv64_emit32(mc, rv_slli(RV_T3, rb, 32));
-        rv64_emit32(mc, rv_srli(RV_T3, RV_T3, 32));
-        ra = RV_T2;
-        rb = RV_T3;
-      }
-      if (kind == INTRIN_UADD_OVERFLOW) {
-        if (is64) {
-          rv64_emit32(mc, rv_add(RV_T2, ra, rb));
-          rv64_emit32(mc, rv_sltu(rovf, RV_T2, ra));
-        } else {
-          rv64_emit32(mc, rv_add(RV_T2, ra, rb));
-          rv64_emit32(mc, rv_srli(rovf, RV_T2, 32));
-          rv64_emit32(mc, rv_sltu(rovf, RV_ZERO, rovf));
-          rv64_emit32(mc, rv_addiw(RV_T2, RV_T2, 0));
-        }
-      } else {
-        rv64_emit32(mc, rv_sltu(rovf, ra, rb));
-        rv64_emit32(mc, is64 ? rv_sub(RV_T2, ra, rb) : rv_subw(RV_T2, ra, rb));
-      }
-      rv64_emit32(mc, rv_addi(rd, RV_T2, 0));
-      return;
-    }
-    case INTRIN_SMUL_OVERFLOW: {
-      Operand a_op = args[0], b_op = args[1];
-      Operand dval = dsts[0], dovf = dsts[1];
-      int is64 = type_is_64(dval.type);
-      u32 ra = rv64_force_reg_int(t, a_op, RV_T0);
-      u32 rb = rv64_force_reg_int(t, b_op, (ra == RV_T0) ? RV_T1 : RV_T0);
-      u32 rd = reg_num(dval);
-      u32 rovf = reg_num(dovf);
-      if (is64) {
-        rv64_emit32(mc, rv_mul(RV_T2, ra, rb));
-        rv64_emit32(mc, rv_mulh(RV_T3, ra, rb));
-        rv64_emit32(mc, rv_srai(rovf, RV_T2, 63));
-        rv64_emit32(mc, rv_xor(rovf, RV_T3, rovf));
-        rv64_emit32(mc, rv_sltu(rovf, RV_ZERO, rovf));
-        rv64_emit32(mc, rv_addi(rd, RV_T2, 0));
-      } else {
-        /* Full 64-bit signed product of two i32s, then compare with
-         * sign-extension of the low 32 bits. */
-        rv64_emit32(mc, rv_addiw(RV_T2, ra, 0));
-        rv64_emit32(mc, rv_addiw(RV_T3, rb, 0));
-        rv64_emit32(mc, rv_mul(RV_T2, RV_T2, RV_T3));
-        rv64_emit32(mc, rv_addiw(RV_T3, RV_T2, 0));
-        rv64_emit32(mc, rv_xor(rovf, RV_T2, RV_T3));
-        rv64_emit32(mc, rv_sltu(rovf, RV_ZERO, rovf));
-        rv64_emit32(mc, rv_addiw(rd, RV_T2, 0));
-      }
-      return;
-    }
-    case INTRIN_UMUL_OVERFLOW: {
-      Operand a_op = args[0], b_op = args[1];
-      Operand dval = dsts[0], dovf = dsts[1];
-      int is64 = type_is_64(dval.type);
-      u32 ra = rv64_force_reg_int(t, a_op, RV_T0);
-      u32 rb = rv64_force_reg_int(t, b_op, (ra == RV_T0) ? RV_T1 : RV_T0);
-      u32 rd = reg_num(dval);
-      u32 rovf = reg_num(dovf);
-      if (is64) {
-        rv64_emit32(mc, rv_mulhu(rovf, ra, rb));
-        rv64_emit32(mc, rv_mul(rd, ra, rb));
-        rv64_emit32(mc, rv_sltu(rovf, RV_ZERO, rovf));
-      } else {
-        rv64_emit32(mc, rv_slli(RV_T2, ra, 32));
-        rv64_emit32(mc, rv_srli(RV_T2, RV_T2, 32));
-        rv64_emit32(mc, rv_slli(RV_T3, rb, 32));
-        rv64_emit32(mc, rv_srli(RV_T3, RV_T3, 32));
-        rv64_emit32(mc, rv_mul(RV_T2, RV_T2, RV_T3));
-        rv64_emit32(mc, rv_srli(rovf, RV_T2, 32));
-        rv64_emit32(mc, rv_sltu(rovf, RV_ZERO, rovf));
-        rv64_emit32(mc, rv_addiw(rd, RV_T2, 0));
-      }
-      return;
-    }
-    case INTRIN_MEMCPY:
-    case INTRIN_MEMMOVE: {
-      Operand da = args[0], sa = args[1], nb = args[2];
-      if (da.kind != OPK_REG || sa.kind != OPK_REG || nb.kind != OPK_IMM) {
-        compiler_panic(t->c, a->loc,
-                       "rv64 intrinsic: memcpy/memmove non-const NYI");
-      }
-      u32 dr = reg_num(da), sr = reg_num(sa), n = (u32)nb.v.imm;
-      if (kind == INTRIN_MEMCPY) {
-        u32 i = 0;
-        while (i + 8 <= n) {
-          rv64_emit32(mc, rv_ld(RV_T3, sr, (i32)i));
-          rv64_emit32(mc, rv_sd(RV_T3, dr, (i32)i));
-          i += 8;
-        }
-        while (i + 4 <= n) {
-          rv64_emit32(mc, rv_lwu(RV_T3, sr, (i32)i));
-          rv64_emit32(mc, rv_sw(RV_T3, dr, (i32)i));
-          i += 4;
-        }
-        while (i + 2 <= n) {
-          rv64_emit32(mc, rv_lhu(RV_T3, sr, (i32)i));
-          rv64_emit32(mc, rv_sh(RV_T3, dr, (i32)i));
-          i += 2;
-        }
-        while (i < n) {
-          rv64_emit32(mc, rv_lbu(RV_T3, sr, (i32)i));
-          rv64_emit32(mc, rv_sb(RV_T3, dr, (i32)i));
-          i += 1;
-        }
-      } else {
-        u32 i = n;
-        while (i >= 8) {
-          i -= 8;
-          rv64_emit32(mc, rv_ld(RV_T3, sr, (i32)i));
-          rv64_emit32(mc, rv_sd(RV_T3, dr, (i32)i));
-        }
-        while (i >= 4) {
-          i -= 4;
-          rv64_emit32(mc, rv_lwu(RV_T3, sr, (i32)i));
-          rv64_emit32(mc, rv_sw(RV_T3, dr, (i32)i));
-        }
-        while (i >= 2) {
-          i -= 2;
-          rv64_emit32(mc, rv_lhu(RV_T3, sr, (i32)i));
-          rv64_emit32(mc, rv_sh(RV_T3, dr, (i32)i));
-        }
-        while (i >= 1) {
-          i -= 1;
-          rv64_emit32(mc, rv_lbu(RV_T3, sr, (i32)i));
-          rv64_emit32(mc, rv_sb(RV_T3, dr, (i32)i));
-        }
-      }
-      return;
-    }
-    case INTRIN_MEMSET: {
-      Operand da = args[0], bv = args[1], nb = args[2];
-      if (da.kind != OPK_REG || nb.kind != OPK_IMM) {
-        compiler_panic(t->c, a->loc, "rv64 intrinsic: memset non-const NYI");
-      }
-      u32 dr = reg_num(da), n = (u32)nb.v.imm;
-      u32 src;
-      if (bv.kind == OPK_IMM) {
-        u32 byte = (u32)(bv.v.imm & 0xffu);
-        if (byte == 0)
-          src = RV_ZERO;
-        else {
-          u64 b = byte;
-          b |= b << 8;
-          b |= b << 16;
-          b |= b << 32;
-          rv64_emit_load_imm(mc, 1, RV_T3, (i64)b);
-          src = RV_T3;
-        }
-      } else {
-        compiler_panic(t->c, a->loc, "rv64 intrinsic: memset REG byte NYI");
-      }
-      u32 i = 0;
-      while (i + 8 <= n) {
-        rv64_emit32(mc, rv_sd(src, dr, (i32)i));
-        i += 8;
-      }
-      while (i + 4 <= n) {
-        rv64_emit32(mc, rv_sw(src, dr, (i32)i));
-        i += 4;
-      }
-      while (i + 2 <= n) {
-        rv64_emit32(mc, rv_sh(src, dr, (i32)i));
-        i += 2;
-      }
-      while (i < n) {
-        rv64_emit32(mc, rv_sb(src, dr, (i32)i));
-        i += 1;
-      }
-      return;
-    }
-    default:
-      compiler_panic(t->c, a->loc, "rv64 intrinsic kind %d NYI", (int)kind);
-  }
-}
-
-static void rv_asm_block(CGTarget* t, const char* tmpl,
-                         const AsmConstraint* outs, u32 no, Operand* oo,
-                         const AsmConstraint* ins, u32 ni, const Operand* io,
-                         const Sym* clobs, u32 nc) {
-  RImpl* impl = impl_of(t);
-  /* Bump the callee-save high-water mark for any callee-saved register
-   * named in the clobber list (psABI: s0..s11 are CS for integers, fs0..
-   * fs11 for FP). Same accounting the prologue uses for bound regs. */
-  for (u32 i = 0; i < nc; ++i) {
-    Slice cs = pool_slice(t->c->global, clobs[i]);
-    char buf[16];
-    uint32_t dwarf;
-    if (!cs.s || !cs.len) continue;
-    if (cs.len >= sizeof buf) continue;
-    memcpy(buf, cs.s, cs.len);
-    buf[cs.len] = '\0';
-    if (rv64_register_index(buf, &dwarf) != 0) continue;
-    if (dwarf <= 31u) {
-      /* Integer reg: s0=x8, s1=x9, s2..s11=x18..x27. */
-      if (dwarf == 8u || dwarf == 9u || (dwarf >= 18u && dwarf <= 27u)) {
-        impl->used_cs_int_mask |= 1u << dwarf;
-      }
-    } else if (dwarf >= 32u && dwarf <= 63u) {
-      uint32_t fr = dwarf - 32u;
-      /* fs0=f8, fs1=f9, fs2..fs11=f18..f27. */
-      if (fr == 8u || fr == 9u || (fr >= 18u && fr <= 27u)) {
-        impl->used_cs_fp_mask |= 1u << fr;
-      }
-    }
-  }
-  Rv64Asm* a = rv64_asm_open(t->c);
-  rv64_inline_bind(a, outs, no, oo, ins, ni, io, clobs, nc);
-  rv64_asm_run_template(a, t->mc, tmpl);
-  rv64_asm_close(a);
-}
-
-static void rv_set_loc(CGTarget* t, SrcLoc l) {
-  ((RImpl*)t)->loc = l;
-  if (t->mc) t->mc->set_loc(t->mc, l);
-}
-
-static void rv_finalize(CGTarget* t) { (void)t; }
-static void rv_destroy(CGTarget* t) { (void)t; }
-
-static void cgt_cleanup(void* arg) { cgtarget_free((CGTarget*)arg); }
-
-CGTarget* rv64_cgtarget_new(Compiler* c, ObjBuilder* o, MCEmitter* m) {
-  RImpl* x = arena_new(c->tu, RImpl);
-  memset(x, 0, sizeof *x);
-
-  CGTarget* t = &x->base;
-  t->c = c;
-  t->obj = o;
-  t->mc = m;
-
-  t->func_begin = rv_func_begin;
-  t->func_begin_known_frame = rv_func_begin_known_frame;
-  t->func_end = rv_func_end;
-
-  t->frame_slot = rv_frame_slot;
-  t->param = rv_param;
-  t->spill_reg = rv_spill_reg;
-  t->reload_reg = rv_reload_reg;
-
-  t->label_new = rv_label_new;
-  t->label_place = rv_label_place;
-  t->jump = rv_jump;
-  t->cmp_branch = rv_cmp_branch;
-  t->load_label_addr = rv_load_label_addr;
-  t->indirect_branch = rv_indirect_branch;
-
-  t->scope_begin = rv_scope_begin;
-  t->scope_else = rv_scope_else;
-  t->scope_end = rv_scope_end;
-  t->break_to = rv_break_to;
-  t->continue_to = rv_continue_to;
-
-  t->load_imm = rv_load_imm;
-  t->load_const = rv_load_const;
-  t->copy = rv_copy;
-  t->load = rv_load;
-  t->store = rv_store;
-  t->addr_of = rv_addr_of;
-  t->tls_addr_of = rv_tls_addr_of;
-  t->copy_bytes = rv_copy_bytes;
-  t->set_bytes = rv_set_bytes;
-  t->bitfield_load = rv_bitfield_load;
-  t->bitfield_store = rv_bitfield_store;
-
-  t->binop = rv_binop;
-  t->unop = rv_unop;
-  t->cmp = rv_cmp;
-  t->convert = rv_convert;
-
-  t->call = rv_call;
-  t->load_call_arg = rv_load_call_arg;
-  t->emit_call_plan = rv_emit_call_plan;
-  t->store_call_arg = rv_store_call_arg;
-  t->store_call_ret = rv_store_call_ret;
-  t->call_stack_size = rv_call_stack_size;
-  t->tail_call_unrealizable_reason = rv_tail_call_unrealizable_reason;
-  t->ret = rv_ret;
-
-  t->alloca_ = rv_alloca_;
-  t->va_start_ = rv_va_start_;
-  t->va_arg_ = rv_va_arg_;
-  t->va_end_ = rv_va_end_;
-  t->va_copy_ = rv_va_copy_;
-
-  t->atomic_load = rv_atomic_load;
-  t->atomic_store = rv_atomic_store;
-  t->atomic_rmw = rv_atomic_rmw;
-  t->atomic_cas = rv_atomic_cas;
-  t->fence = rv_fence;
-
-  t->intrinsic = rv_intrinsic;
-  t->asm_block = rv_asm_block;
-
-  t->set_loc = rv_set_loc;
-  t->finalize = rv_finalize;
-  t->destroy = rv_destroy;
-
-#if CFREE_OPT_ENABLED
-  rv_coord_vtable_init(t);
-#endif
-
-  (void)type_is_signed;
-  compiler_defer(c, cgt_cleanup, t);
-  return t;
-}
diff --git a/src/arch/rv64/opt_coord.c b/src/arch/rv64/opt_coord.c
@@ -1,345 +0,0 @@
-/* rv64/opt_coord.c — opt/backend register coordination hooks. */
-
-#include "arch/rv64/internal.h"
-
-/* ============================================================
- * Static register tables reported to caller-owned allocators. */
-
-static const Reg rv_int_allocable[] = {20, 21, 22, 23, 24, 25, 26, 27};
-static const Reg rv_fp_allocable[] = {20, 21, 22, 23, 24, 25, 26, 27};
-
-static const Reg rv_int_scratch[] = {18, 19}; /* s2, s3; reserved by opt_emit */
-static const Reg rv_fp_scratch[] = {18,
-                                    19}; /* fs2, fs3; reserved by opt_emit */
-
-static const CGPhysRegInfo rv_int_phys[] = {
-    {10, RC_INT, 0,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG | CG_REG_RET, 0, 0},
-    {11, RC_INT, 1,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG | CG_REG_RET, 0, 0},
-    {12, RC_INT, 2, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {13, RC_INT, 3, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {14, RC_INT, 4, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {15, RC_INT, 5, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {16, RC_INT, 6, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {17, RC_INT, 7, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {20, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {21, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {22, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {23, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {24, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {25, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {26, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {27, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {29, RC_INT, 0xff,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_TEMP_PREFERRED, 0, 0},
-    {30, RC_INT, 0xff,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_TEMP_PREFERRED, 0, 0},
-    {31, RC_INT, 0xff,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_TEMP_PREFERRED, 0, 0},
-};
-static const CGPhysRegInfo rv_fp_phys[] = {
-    {10, RC_FP, 0,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG | CG_REG_RET, 0, 0},
-    {11, RC_FP, 1,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG | CG_REG_RET, 0, 0},
-    {12, RC_FP, 2, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {13, RC_FP, 3, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {14, RC_FP, 4, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {15, RC_FP, 5, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {16, RC_FP, 6, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {17, RC_FP, 7, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {20, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {21, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {22, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {23, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {24, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {25, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {26, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {27, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {28, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
-    {29, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
-    {30, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
-    {31, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
-};
-
-/* ============================================================
- * Vtable methods */
-
-static void rv_get_allocable_regs(CGTarget* t, RegClass cls, const Reg** out,
-                                  u32* nregs) {
-  (void)t;
-  switch (cls) {
-    case RC_INT:
-      *out = rv_int_allocable;
-      *nregs = sizeof rv_int_allocable / sizeof rv_int_allocable[0];
-      break;
-    case RC_FP:
-      *out = rv_fp_allocable;
-      *nregs = sizeof rv_fp_allocable / sizeof rv_fp_allocable[0];
-      break;
-    default:
-      *out = NULL;
-      *nregs = 0;
-      break;
-  }
-}
-
-static void rv_get_scratch_regs(CGTarget* t, RegClass cls, const Reg** out,
-                                u32* nregs) {
-  (void)t;
-  switch (cls) {
-    case RC_INT:
-      *out = rv_int_scratch;
-      *nregs = sizeof rv_int_scratch / sizeof rv_int_scratch[0];
-      break;
-    case RC_FP:
-      *out = rv_fp_scratch;
-      *nregs = sizeof rv_fp_scratch / sizeof rv_fp_scratch[0];
-      break;
-    default:
-      *out = NULL;
-      *nregs = 0;
-      break;
-  }
-}
-
-static void rv_get_phys_regs(CGTarget* t, RegClass cls,
-                             const CGPhysRegInfo** out, u32* nregs) {
-  (void)t;
-  switch (cls) {
-    case RC_INT:
-      *out = rv_int_phys;
-      *nregs = sizeof rv_int_phys / sizeof rv_int_phys[0];
-      break;
-    case RC_FP:
-      *out = rv_fp_phys;
-      *nregs = sizeof rv_fp_phys / sizeof rv_fp_phys[0];
-      break;
-    default:
-      *out = NULL;
-      *nregs = 0;
-      break;
-  }
-}
-
-static int rv_is_caller_saved(CGTarget* t, RegClass cls, Reg reg) {
-  (void)t;
-  switch (cls) {
-    case RC_INT:
-      /* RV64 psABI caller-saved: x5-x7, x10-x17, x28-x31 */
-      return (reg >= 5 && reg <= 7) || (reg >= 10 && reg <= 17) ||
-             (reg >= 28 && reg <= 31);
-    case RC_FP:
-      /* RV64 psABI caller-saved: f0-f7, f10-f17, f28-f31 */
-      return (reg >= 0 && reg <= 7) || (reg >= 10 && reg <= 17) ||
-             (reg >= 28 && reg <= 31);
-    default:
-      return 0;
-  }
-}
-
-static u32 rv_call_clobber_mask(CGTarget* t, const CGCallDesc* d,
-                                RegClass cls) {
-  (void)t;
-  (void)d;
-  u32 mask = 0;
-  if (cls == RC_INT || cls == RC_FP) {
-    for (u32 r = 5; r <= 7; ++r) mask |= 1u << r;
-    for (u32 r = 10; r <= 17; ++r) mask |= 1u << r;
-    for (u32 r = 28; r <= 31; ++r) mask |= 1u << r;
-  }
-  return mask;
-}
-
-static u32 rv_callee_save_mask(CGTarget* t, RegClass cls) {
-  (void)t;
-  u32 mask = 0;
-  if (cls == RC_INT || cls == RC_FP)
-    for (u32 r = 18; r <= 27; ++r) mask |= 1u << r;
-  return mask;
-}
-
-static u32 rv_return_reg_mask(CGTarget* t, const ABIFuncInfo* abi,
-                              RegClass cls) {
-  (void)t;
-  if (!abi || abi->ret.kind == ABI_ARG_IGNORE ||
-      abi->ret.kind == ABI_ARG_INDIRECT)
-    return 0;
-  u32 mask = 0, ni = 0, nf = 0;
-  for (u16 i = 0; i < abi->ret.nparts; ++i) {
-    const ABIArgPart* p = &abi->ret.parts[i];
-    if (cls == RC_INT && p->cls == ABI_CLASS_INT)
-      mask |= 1u << (RV_A0 + ni++);
-    else if (cls == RC_FP && p->cls == ABI_CLASS_FP)
-      mask |= 1u << (10u + nf++);
-  }
-  return mask;
-}
-
-static void rv_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
-  memset(out, 0, sizeof *out);
-  out->callee = d->callee;
-  out->flags = d->flags;
-  out->stack_arg_size = t->call_stack_size ? t->call_stack_size(t, d) : 0;
-  out->has_sret = d->abi && d->abi->has_sret;
-  out->is_variadic = d->abi && d->abi->variadic;
-  for (u32 c = 0; c < CG_CALL_PLAN_REG_CLASSES; ++c) {
-    out->clobber_mask[c] = rv_call_clobber_mask(t, d, (RegClass)c);
-    out->return_mask[c] = rv_return_reg_mask(t, d->abi, (RegClass)c);
-  }
-  u32 cap = d->nargs * 2u + 2u;
-  out->args = arena_zarray(t->c->tu, CGCallPlanMove, cap ? cap : 1u);
-  out->rets = arena_zarray(t->c->tu, CGCallPlanRet, 4);
-  u32 next_int = d->abi && d->abi->has_sret ? 1u : 0u, next_fp = 0, stack = 0;
-  /* Ordinary sret call: pass the destination address in a0. A tail call
-   * instead forwards the function's own incoming sret pointer (handled in
-   * rv_emit_call_plan), and ret.storage is the void sentinel, so skip it. */
-  if (d->abi && d->abi->has_sret && (d->flags & CG_CALL_TAIL) == 0) {
-    CGCallPlanMove* m = &out->args[out->nargs++];
-    m->src = d->ret.storage;
-    m->src_kind = CG_CALL_PLAN_SRC_ADDR;
-    m->dst_kind = CG_CALL_PLAN_REG;
-    m->cls = RC_INT;
-    m->dst_reg = RV_A0;
-    m->mem.type = d->ret.type;
-    m->mem.size = 8;
-    m->mem.align = 8;
-  }
-  for (u32 a = 0; a < d->nargs; ++a) {
-    const CGABIValue* av = &d->args[a];
-    const ABIArgInfo* ai = av->abi;
-    ABIArgInfo vai;
-    ABIArgPart vap;
-    if (!ai) {
-      memset(&vai, 0, sizeof vai);
-      memset(&vap, 0, sizeof vap);
-      vap.cls = ABI_CLASS_INT;
-      vap.size = type_byte_size(av->type);
-      vai.kind = ABI_ARG_DIRECT;
-      vai.nparts = 1;
-      vai.parts = &vap;
-      ai = &vai;
-    }
-    if (ai->kind == ABI_ARG_IGNORE) continue;
-    if (ai->kind == ABI_ARG_INDIRECT) {
-      CGCallPlanMove* m = &out->args[out->nargs++];
-      m->src = av->storage;
-      m->src_kind = CG_CALL_PLAN_SRC_ADDR;
-      m->cls = RC_INT;
-      if (next_int < 8) {
-        m->dst_kind = CG_CALL_PLAN_REG;
-        m->dst_reg = RV_A0 + next_int++;
-      } else {
-        m->dst_kind = CG_CALL_PLAN_STACK;
-        m->stack_offset = stack;
-        stack += 8;
-      }
-      m->mem.type = av->type;
-      m->mem.size = 8;
-      m->mem.align = 8;
-      continue;
-    }
-    for (u16 i = 0; i < ai->nparts; ++i) {
-      const ABIArgPart* p = &ai->parts[i];
-      CGCallPlanMove* m = &out->args[out->nargs++];
-      m->src = av->nparts ? av->parts[i].op : av->storage;
-      m->src_offset = av->nparts ? av->parts[i].src_offset : p->src_offset;
-      m->mem.type = av->type;
-      m->mem.size = p->size;
-      m->mem.align = p->align ? p->align : p->size;
-      if (p->cls == ABI_CLASS_FP) {
-        m->cls = RC_FP;
-        if (next_fp < 8) {
-          m->dst_kind = CG_CALL_PLAN_REG;
-          m->dst_reg = 10u + next_fp++;
-        } else {
-          m->dst_kind = CG_CALL_PLAN_STACK;
-          m->stack_offset = stack;
-          stack += 8;
-        }
-      } else {
-        m->cls = RC_INT;
-        if (next_int < 8) {
-          m->dst_kind = CG_CALL_PLAN_REG;
-          m->dst_reg = RV_A0 + next_int++;
-        } else {
-          m->dst_kind = CG_CALL_PLAN_STACK;
-          m->stack_offset = stack;
-          stack += 8;
-        }
-      }
-    }
-  }
-  if ((d->flags & CG_CALL_TAIL) == 0 && d->abi &&
-      d->abi->ret.kind != ABI_ARG_IGNORE &&
-      d->abi->ret.kind != ABI_ARG_INDIRECT) {
-    u32 ni = 0, nf = 0;
-    for (u16 i = 0; i < d->abi->ret.nparts; ++i) {
-      const ABIArgPart* p = &d->abi->ret.parts[i];
-      CGCallPlanRet* r = &out->rets[out->nrets++];
-      r->dst = d->ret.storage;
-      r->dst_offset = p->src_offset;
-      r->mem.type = d->ret.type;
-      r->mem.size = p->size;
-      r->mem.align = p->align ? p->align : p->size;
-      if (p->cls == ABI_CLASS_FP) {
-        r->cls = RC_FP;
-        r->src_reg = 10u + nf++;
-      } else {
-        r->cls = RC_INT;
-        r->src_reg = RV_A0 + ni++;
-      }
-    }
-  }
-}
-
-static void rv_reserve_hard_regs(CGTarget* t, RegClass cls, const Reg* regs,
-                                 u32 n) {
-  RImpl* a = impl_of(t);
-  for (u32 i = 0; i < n; ++i) {
-    Reg r = regs[i];
-    switch (cls) {
-      case RC_INT:
-        if (r >= 18u && r <= 27u) a->used_cs_int_mask |= 1u << r;
-        break;
-      case RC_FP:
-        if (r >= 18u && r <= 27u) a->used_cs_fp_mask |= 1u << r;
-        break;
-      default:
-        break;
-    }
-  }
-}
-
-static void rv_plan_hard_regs(CGTarget* t, RegClass cls, const Reg* regs,
-                              u32 n) {
-  RImpl* a = impl_of(t);
-  a->has_planned_regs = 1;
-  for (u32 i = 0; i < n; ++i) {
-    Reg r = regs[i];
-    switch (cls) {
-      case RC_INT:
-        if (r >= 18u && r <= 27u) a->planned_cs_int_mask |= 1u << r;
-        break;
-      case RC_FP:
-        if (r >= 18u && r <= 27u) a->planned_cs_fp_mask |= 1u << r;
-        break;
-      default:
-        break;
-    }
-  }
-}
-
-void rv_coord_vtable_init(CGTarget* t) {
-  t->get_allocable_regs = rv_get_allocable_regs;
-  t->get_phys_regs = rv_get_phys_regs;
-  t->get_scratch_regs = rv_get_scratch_regs;
-  t->is_caller_saved = rv_is_caller_saved;
-  t->call_clobber_mask = rv_call_clobber_mask;
-  t->return_reg_mask = rv_return_reg_mask;
-  t->callee_save_mask = rv_callee_save_mask;
-  t->plan_call = rv_plan_call;
-  t->plan_hard_regs = rv_plan_hard_regs;
-  t->reserve_hard_regs = rv_reserve_hard_regs;
-}
diff --git a/src/arch/rv64/rv64.h b/src/arch/rv64/rv64.h
@@ -2,7 +2,16 @@
 #define CFREE_ARCH_RV64_H
 
 #include "arch/mc.h"
+#include "arch/native_target.h"
 
-CGTarget* rv64_cgtarget_new(Compiler*, ObjBuilder*, MCEmitter*);
+typedef struct NativeOps NativeOps;
+
+NativeTarget* rv64_native_target_new(Compiler*, ObjBuilder*, MCEmitter*);
+const NativeOps* rv64_native_direct_ops(void);
+
+/* Shared low-level word emitters, defined in native.c and used by the
+ * standalone assembler (asm.c). */
+void rv64_emit32(MCEmitter* mc, u32 word);
+void rv64_emit16(MCEmitter* mc, u32 halfword);
 
 #endif
diff --git a/test/toy/run.sh b/test/toy/run.sh
@@ -296,14 +296,14 @@ run_case_cross_one() {
     fi
 
     chmod +x "$exe" 2>/dev/null || true
-    exec_target_run "$tag" "$exe" "$out" "$err"
-    rc=$RUN_RC
-    if [ -s "$err" ]; then
-        grep -v '^WARNING: image platform .* does not match the expected platform' \
-            "$err" > "$err.clean" || true
-        mv "$err.clean" "$err"
-    fi
-    check_rc "$label" "$rc" "$expected" "$err"
+    # Defer execution: queue for one batched container run per arch (drained
+    # by exec_target_flush after every case is compiled+linked). The
+    # platform-mismatch warning is filtered and the rc checked post-flush.
+    XQ_LABELS+=("$label")
+    XQ_EXPECTED+=("$expected")
+    XQ_ERRS+=("$err")
+    XQ_RCS+=("$work/$arch.rc")
+    exec_target_queue "$tag" "$name" "$exe" "$out" "$err" "$work/$arch.rc"
 }
 
 run_case_cross() {
@@ -424,6 +424,14 @@ if [ $RUN_X -eq 1 ]; then
     export have_qemu QEMU_BIN have_podman is_aarch64
     # shellcheck source=../lib/exec_target.sh
     source "$ROOT/test/lib/exec_target.sh"
+    # Every queued exe/out/err/rc path lives under BUILD_DIR; bind-mount it
+    # once so a single batched container drains the whole path-X queue.
+    EXEC_TARGET_MOUNT_ROOT="$BUILD_DIR"
+    # Deferred path-X bookkeeping, checked after exec_target_flush.
+    XQ_LABELS=()
+    XQ_EXPECTED=()
+    XQ_ERRS=()
+    XQ_RCS=()
 fi
 
 shopt -s nullglob
@@ -466,6 +474,34 @@ for src in "${cases[@]}"; do
     done
 done
 
+# Drain the path-X queue in a single batched container per target arch, then
+# check each deferred case's exit code (the cross-compile + link already
+# passed/failed inline above; only the exec was deferred).
+if [ $RUN_X -eq 1 ] && [ "$(exec_target_queue_size)" -gt 0 ]; then
+    printf 'Running path X (%d cases batched)...\n' "$(exec_target_queue_size)"
+    exec_target_flush
+    xi=0
+    xn=${#XQ_LABELS[@]}
+    while [ $xi -lt "$xn" ]; do
+        xlabel="${XQ_LABELS[$xi]}"
+        xexp="${XQ_EXPECTED[$xi]}"
+        xerr="${XQ_ERRS[$xi]}"
+        xrcf="${XQ_RCS[$xi]}"
+        if [ -s "$xerr" ]; then
+            grep -v '^WARNING: image platform .* does not match the expected platform' \
+                "$xerr" > "$xerr.clean" 2>/dev/null || true
+            mv "$xerr.clean" "$xerr" 2>/dev/null || true
+        fi
+        if [ -f "$xrcf" ]; then
+            xrc="$(cat "$xrcf")"
+        else
+            xrc=127
+        fi
+        check_rc "$xlabel" "$xrc" "$xexp" "$xerr"
+        xi=$((xi+1))
+    done
+fi
+
 # err cases exercise compile-failure paths; they aren't relevant to path C
 # (which goes through the same cc invocation). Only run them when at least
 # one of the native compile paths (R/L/X) is enabled.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

A	doc/NATIVE_PORT_RV64.md	\|	3647	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	doc/OPT_O0_NATIVE_DIRECT_NOTES.md	\|	186	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	doc/OPT_O0_PERF_NOTES.md	\|	168	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	include/cfree/config.h	\|	2	+-
A	scripts/toy_cross_batch.sh	\|	113	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D	src/arch/rv64/alloc.c	\|	589	-------------------------------------------------------------------------------
M	src/arch/rv64/arch.c	\|	32	++++++++++++++++++++++++++------
M	src/arch/rv64/asm.c	\|	35	++++++++++++++++++-----------------
M	src/arch/rv64/asm.h	\|	13	+++++++++++++
D	src/arch/rv64/emit.c	\|	631	-------------------------------------------------------------------------------
D	src/arch/rv64/internal.h	\|	189	-------------------------------------------------------------------------------
A	src/arch/rv64/native.c	\|	3458	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D	src/arch/rv64/ops.c	\|	2699	-------------------------------------------------------------------------------
D	src/arch/rv64/opt_coord.c	\|	345	-------------------------------------------------------------------------------
M	src/arch/rv64/rv64.h	\|	11	++++++++++-
M	test/toy/run.sh	\|	52	++++++++++++++++++++++++++++++++++++++++++++--------