Arch-Backend Completeness (planned work)

This roadmap consolidates the remaining native-backend work across the three machine-code targets (aa64, x64, rv64). The bulk of the NativeTarget port -- the single-pass (-O0) path and the known-frame (-O1) path for all three arches, plus the asm/disasm/link-reloc/dwarf matrix -- is already in tree and is treated here as the baseline, not as planned work. What follows is the genuinely-open follow-up: per-arch hooks where x64/rv64 still trail the aa64 reference, prologue/epilogue and tail-call cost-model parity, and a small set of niche asm/disasm and debugger gaps. The backend abstraction and ABI layer this work sits behind are documented in the design set: ../CODEGEN.md, ../OPT.md, ../ASM.md, ../DWARF.md.

Baseline (done -- context, not planned work)

All three backends implement the full NativeTarget vtable on both the single-pass (NativeDirectTarget) and known-frame (func_begin_known_frame) paths: prologue/epilogue, bind_param, frame slots, calls/returns, atomics, variadics, inline/file-scope asm, TLS (Local-Exec), and intrinsics.
x64 carries both ABIs (SysV + Win64), including shadow space, callee-saved XMMs, the __chkstk large-frame probe, and the SysV 176-byte variadic register-save area. rv64 covers LP64D with the s0-anchored frame and Zba/Zbb use where available.
asm/disasm/link-reloc/dwarf parity across the OS matrix (ELF/COFF/Mach-O as applicable) is in place: rv64 relocation emission, the x64 .eh_frame RBP DWARF-reg fix, the aa64 FP/SIMD and x64 SSE disasm rows, named params/locals in x64 and rv64 DWARF, and the shared LEB128 / .comm assembler directives.

The items below are what is not yet at aa64 parity.

1. Tail-call realization on x64 and rv64 (blocker-check removal)

aa64 realizes sibling (tail) calls whenever the outgoing stack-argument area fits the caller's incoming parameter window -- aa_no_tail returns a blocker only on that size check. The same is true of the size check on x64 and rv64 (x64 even accounts for the shadow-space prefix), and the restore-before-jump machinery already exists on both: x64_emit_tail_site / rv_emit_tail_site emit the callee-save restore and frame teardown ahead of the tail jump exactly the way aa64 does. What remains is conservatism in the realizability gate -- x64_no_tail and rv_no_tail still bail out with "callee-saved registers in use" whenever the function has any callee-save live (frame.ncallee_saves != 0), so those functions fall back to a normal call + return even though the tail site could handle them.

This is the single largest aa64-vs-rest divergence and matters most for the recursion-heavy / interpreter-dispatch workloads that the O(1)-tail-call work targets (see the interpreter and toy musttail tracks).

Remove the ncallee_saves guard from x64_no_tail / rv_no_tail so the size check alone gates realizability, letting the existing tail-site restore sequences run for callee-saves-live functions.
x64: confirm the existing restore ordering interacts correctly with the SysV vs Win64 callee-save sets (Win64 also saves XMM6-15) and with a forwarded sret pointer before lifting the guard.
rv64: confirm the s2-s11 / fs2-fs11 and s0/ra restore-then-jr sequence holds for the previously-blocked frames once the guard is gone.
Extend the tail-call test corpus to cover the callee-saves-live case on x64 and rv64, since those paths were unexercised while the guard masked them.
Win64 FP-arg tail-call interaction is noted as a deferred sub-case in the port notes; validate or document the restriction explicitly.

2. Prologue / epilogue cost-model parity (per-call overhead)

The fixed per-call overhead -- prologue + epilogue + arg setup, independent of the body -- is the dominant cost on call-heavy code. aa64 picks one of four frame shapes per function to minimize it. x64 and rv64 now select a cheaper known-frame shape too (see Done below); the design rationale lives in ../ARCH.md; the aa64 measurements and the remaining body-level warts are tracked alongside ../OPT.md and OPTIMIZER.md.

aa64 tiers (baseline, for reference):

tier	when	fixed insns
`slim_prologue` (Tier A)	no callee-saves, no alloca, no body slots, no outgoing stack	3 (optimal)
`fp_at_bottom`	>=1 callee-save/body slot, no outgoing stack args, frame <= 504	5 (optimal)
`slim_small_frame`	as above but with outgoing stack args	7
fat	large frame / alloca / big saved-pair offset	7+

The known-frame asymmetry (bottom-record only on the -O1 path) is intentional: the frame-size-dependent offsets require the frame to be final before the body, which only the optimizer's frame planner guarantees.

Leaf-ness is surfaced to the backends through NativeKnownFrameDesc.is_leaf (set in plan_frame, pass_native_emit.c, as "no IR_CALL of any kind -- regular or sibling/tail"). A leaf never clobbers the return-address register or the stack below sp, which is what unlocks the no-frame / red-zone shapes below.

Done:

x64 slim + red-zone tiers (x64_func_begin_known_frame). Two known-frame shapes, both keeping the push rbp; mov rbp,rsp record (so the leave epilogue, the CFA = rbp+16 CFI, and every rbp-relative offset are unchanged) and only dropping the sub rsp reservation:
- slim_frame -- empty frame (no callee-saves, no body slots, no outgoing args, no alloca). Safe for non-leaves too: push rbp keeps rsp 16-aligned for any register-only call, and nothing lives below rsp. SysV + Win64.
- redzone_leaf -- SysV leaf with a small frame (is_leaf, no alloca, no outgoing args, frame_size <= 128). Locals/callee-saves stay at their rbp-relative offsets, which now land in the 128-byte red zone. Leaf-only, since any call would clobber the red zone; Win64 (no red zone) is excluded by the shadow_space == 0 gate.
No x64 fold tier: push rbp already folds the sp-move into the store, so there is no aa64-fp_at_bottom-style win to capture.
rv64 leaf tier (rv_func_begin_known_frame, slim_prologue). A leaf with no callee-saves, no body slots, no outgoing args, no sret/variadic and register-only params (signature_stack_bytes == 0) never reads s0 nor clobbers ra (both are reserved, never allocable), so it emits no prologue and a bare ret -- the whole frame setup/teardown is elided (~8 insns/leaf). CFI is def_cfa(sp, 0), matching the CIE default (ra stays live in its register).
rv64 frame fold: intentionally not ported. Porting aa64's fp_at_bottom to rv64 was measured at a zero instruction win: RISC-V has no pre/post-indexed store, so moving the saved s0/ra pair to the bottom still needs a separate addi sp,sp,-N plus the sd/addi s0,sp -- the same four instructions as the top-record shape. The fold only relocates data, it removes no instruction, so it was skipped rather than add a fold-aware offset-helper layer for no benefit. (Per the "quantify the win before committing" guidance that previously stood here.) The rv64 leaf tier above is the real rv64 win.

Still open:

Cost-model alignment. signature_stack_bytes / call_stack_bytes are the shared hooks the optimizer uses to size the outgoing area and gate tail-call realizability; they exist on all three. As the tail-call paths (section 1) land, verify the optimizer's per-call cost estimates reflect the cheaper shapes so frame/spill decisions stay consistent across arches.

Body-level per-call warts from the aa64 study that are arch-shared and still open:

Redundant branch chain. An if/else merge can emit b A; A: b B -- a conditional branch to a label that just unconditionally branches onward. cleanup_layout_fallthrough_branches in the jump pass does not yet thread this shape; this is an optimizer pass fix, surfaced per-arch at the call site.

3. x64 debugger step-out / unwind

kit_dwarf_unwind_step has no memory provider, and x64 (unlike aa64/rv64, which have a link register) has no link-register fallback, so step-out can't recover the return address from the stack. Compounding it, the JIT debugger doesn't populate .eh_frame for in-process images.

Add a memory-reading unwind variant so the unwinder can read the saved RA / RBP from the stack on x64.
Populate .eh_frame (or an equivalent CFI source) for JIT in-process images so the debugger has unwind data to consume.

This is a debugging-UX robustness item with test-infra dependencies; see ../DBG.md and ../DWARF.md. Sibling debugger roadmap: DEBUG.md.

4. Niche assembler / disassembler gaps

These are in the standalone as / inline-asm() encode-decode paths only. The compiler's codegen emits machine code directly and never routes through the text assembler, and the shipped runtime .s/.S files don't use these forms, so none of this blocks any build. They are GNU-as / llvm-mc parity gaps for hand-written assembly. Design context: ../ASM.md.

aa64 atomics, remaining encode forms. CASP, the LSE min/max family (ldsmax/ldsmin/ldumax/ldumin), and LDAPR/STLLR are not yet encoded.
aa64 disasm rows for the new encode-only forms. The recently-added exclusive/LSE atomics, register-offset, and writeback load/store forms encode correctly but have no decode rows, so a round-trip currently renders them as .inst. Add the matching disasm rows.
TLS relocation modifiers in operands. :tprel_*: (aa64) and %tls_* (rv64) operand syntax is not yet accepted; the non-TLS modifiers (:lo12:/:got:, %hi/%lo/%pcrel_*, x64 @PLT/@GOTPCREL) are done.
.L-prefixed local-label spellings in operand references. Plain labels work (including as the %pcrel_lo anchor); the .L-prefixed spelling in an operand position is a shared-lexer change.

5. Cross-cutting hygiene

Keep the three backends converging on the shared NativeFrame / native_argmove (parallel-copy shuffle) scaffolding rather than re-implementing per-arch; new fold tiers and tail-call paths should reuse it.
As each gap above closes, prefer locking it in with a targeted corpus case (per-arch, per-form) over broad sweeps, per the testing guidance in ../TESTING.md.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README