Arch-Backend Completeness (planned work)
This roadmap consolidates the remaining native-backend work across the three machine-code targets (aa64, x64, rv64). The bulk of the NativeTarget port -- the single-pass (-O0) path and the known-frame (-O1) path for all three arches, plus the asm/disasm/link-reloc/dwarf matrix -- is already in tree and is treated here as the baseline, not as planned work. What follows is the genuinely-open follow-up: per-arch hooks where x64/rv64 still trail the aa64 reference, prologue/epilogue and tail-call cost-model parity, and a small set of niche asm/disasm and debugger gaps. The backend abstraction and ABI layer this work sits behind are documented in the design set: ../CODEGEN.md, ../OPT.md, ../ASM.md, ../DWARF.md.
Baseline (done -- context, not planned work)
- All three backends implement the full NativeTarget vtable on both the
single-pass (
NativeDirectTarget) and known-frame (func_begin_known_frame) paths: prologue/epilogue,bind_param, frame slots, calls/returns, atomics, variadics, inline/file-scope asm, TLS (Local-Exec), and intrinsics. - x64 carries both ABIs (SysV + Win64), including shadow space, callee-saved
XMMs, the
__chkstklarge-frame probe, and the SysV 176-byte variadic register-save area. rv64 covers LP64D with the s0-anchored frame and Zba/Zbb use where available. - asm/disasm/link-reloc/dwarf parity across the OS matrix (ELF/COFF/Mach-O as
applicable) is in place: rv64 relocation emission, the x64
.eh_frameRBP DWARF-reg fix, the aa64 FP/SIMD and x64 SSE disasm rows, named params/locals in x64 and rv64 DWARF, and the shared LEB128 /.commassembler directives.
The items below are what is not yet at aa64 parity.
1. Tail-call realization on x64 and rv64 (blocker-check removal)
aa64 realizes sibling (tail) calls whenever the outgoing stack-argument area fits
the caller's incoming parameter window -- aa_no_tail returns a blocker only on
that size check. The same is true of the size check on x64 and rv64 (x64 even
accounts for the shadow-space prefix), and the restore-before-jump machinery
already exists on both: x64_emit_tail_site / rv_emit_tail_site emit the
callee-save restore and frame teardown ahead of the tail jump exactly the way
aa64 does. What remains is conservatism in the realizability gate -- x64_no_tail
and rv_no_tail still bail out with "callee-saved registers in use" whenever
the function has any callee-save live (frame.ncallee_saves != 0), so those
functions fall back to a normal call + return even though the tail site could
handle them.
This is the single largest aa64-vs-rest divergence and matters most for the
recursion-heavy / interpreter-dispatch workloads that the O(1)-tail-call work
targets (see the interpreter and toy musttail tracks).
- Remove the
ncallee_savesguard fromx64_no_tail/rv_no_tailso the size check alone gates realizability, letting the existing tail-site restore sequences run for callee-saves-live functions. - x64: confirm the existing restore ordering interacts correctly with the SysV vs Win64 callee-save sets (Win64 also saves XMM6-15) and with a forwarded sret pointer before lifting the guard.
- rv64: confirm the s2-s11 / fs2-fs11 and s0/ra restore-then-
jrsequence holds for the previously-blocked frames once the guard is gone. - Extend the tail-call test corpus to cover the callee-saves-live case on x64 and rv64, since those paths were unexercised while the guard masked them.
- Win64 FP-arg tail-call interaction is noted as a deferred sub-case in the port notes; validate or document the restriction explicitly.
2. Prologue / epilogue cost-model parity (per-call overhead)
The fixed per-call overhead -- prologue + epilogue + arg setup, independent of the body -- is the dominant cost on call-heavy code. aa64 picks one of four frame shapes per function to minimize it. x64 and rv64 now select a cheaper known-frame shape too (see Done below); the design rationale lives in ../ARCH.md; the aa64 measurements and the remaining body-level warts are tracked alongside ../OPT.md and OPTIMIZER.md.
aa64 tiers (baseline, for reference):
| tier | when | fixed insns |
|---|---|---|
slim_prologue (Tier A) |
no callee-saves, no alloca, no body slots, no outgoing stack | 3 (optimal) |
fp_at_bottom |
>=1 callee-save/body slot, no outgoing stack args, frame <= 504 | 5 (optimal) |
slim_small_frame |
as above but with outgoing stack args | 7 |
| fat | large frame / alloca / big saved-pair offset | 7+ |
The known-frame asymmetry (bottom-record only on the -O1 path) is intentional: the frame-size-dependent offsets require the frame to be final before the body, which only the optimizer's frame planner guarantees.
Leaf-ness is surfaced to the backends through NativeKnownFrameDesc.is_leaf
(set in plan_frame, pass_native_emit.c, as "no IR_CALL of any kind --
regular or sibling/tail"). A leaf never clobbers the return-address register or
the stack below sp, which is what unlocks the no-frame / red-zone shapes below.
Done:
x64 slim + red-zone tiers (
x64_func_begin_known_frame). Two known-frame shapes, both keeping thepush rbp; mov rbp,rsprecord (so theleaveepilogue, theCFA = rbp+16CFI, and every rbp-relative offset are unchanged) and only dropping thesub rspreservation:slim_frame-- empty frame (no callee-saves, no body slots, no outgoing args, no alloca). Safe for non-leaves too:push rbpkeeps rsp 16-aligned for any register-only call, and nothing lives below rsp. SysV + Win64.redzone_leaf-- SysV leaf with a small frame (is_leaf, no alloca, no outgoing args,frame_size <= 128). Locals/callee-saves stay at their rbp-relative offsets, which now land in the 128-byte red zone. Leaf-only, since any call would clobber the red zone; Win64 (no red zone) is excluded by theshadow_space == 0gate.
No x64 fold tier:
push rbpalready folds the sp-move into the store, so there is no aa64-fp_at_bottom-style win to capture.rv64 leaf tier (
rv_func_begin_known_frame,slim_prologue). A leaf with no callee-saves, no body slots, no outgoing args, no sret/variadic and register-only params (signature_stack_bytes == 0) never reads s0 nor clobbers ra (both are reserved, never allocable), so it emits no prologue and a bareret-- the whole frame setup/teardown is elided (~8 insns/leaf). CFI isdef_cfa(sp, 0), matching the CIE default (ra stays live in its register).rv64 frame fold: intentionally not ported. Porting aa64's
fp_at_bottomto rv64 was measured at a zero instruction win: RISC-V has no pre/post-indexed store, so moving the saved s0/ra pair to the bottom still needs a separateaddi sp,sp,-Nplus thesd/addi s0,sp-- the same four instructions as the top-record shape. The fold only relocates data, it removes no instruction, so it was skipped rather than add a fold-aware offset-helper layer for no benefit. (Per the "quantify the win before committing" guidance that previously stood here.) The rv64 leaf tier above is the real rv64 win.
Still open:
- Cost-model alignment.
signature_stack_bytes/call_stack_bytesare the shared hooks the optimizer uses to size the outgoing area and gate tail-call realizability; they exist on all three. As the tail-call paths (section 1) land, verify the optimizer's per-call cost estimates reflect the cheaper shapes so frame/spill decisions stay consistent across arches.
Body-level per-call warts from the aa64 study that are arch-shared and still open:
- Redundant branch chain. An if/else merge can emit
b A; A: b B-- a conditional branch to a label that just unconditionally branches onward.cleanup_layout_fallthrough_branchesin the jump pass does not yet thread this shape; this is an optimizer pass fix, surfaced per-arch at the call site.
3. x64 debugger step-out / unwind
kit_dwarf_unwind_step has no memory provider, and x64 (unlike aa64/rv64, which
have a link register) has no link-register fallback, so step-out can't recover the
return address from the stack. Compounding it, the JIT debugger doesn't populate
.eh_frame for in-process images.
- Add a memory-reading unwind variant so the unwinder can read the saved RA / RBP from the stack on x64.
- Populate
.eh_frame(or an equivalent CFI source) for JIT in-process images so the debugger has unwind data to consume.
This is a debugging-UX robustness item with test-infra dependencies; see ../DBG.md and ../DWARF.md. Sibling debugger roadmap: DEBUG.md.
4. Niche assembler / disassembler gaps
These are in the standalone as / inline-asm() encode-decode paths only. The
compiler's codegen emits machine code directly and never routes through the text
assembler, and the shipped runtime .s/.S files don't use these forms, so
none of this blocks any build. They are GNU-as / llvm-mc parity gaps for
hand-written assembly. Design context: ../ASM.md.
- aa64 atomics, remaining encode forms.
CASP, the LSE min/max family (ldsmax/ldsmin/ldumax/ldumin), andLDAPR/STLLRare not yet encoded. - aa64 disasm rows for the new encode-only forms. The recently-added
exclusive/LSE atomics, register-offset, and writeback load/store forms encode
correctly but have no decode rows, so a round-trip currently renders them as
.inst. Add the matching disasm rows. - TLS relocation modifiers in operands.
:tprel_*:(aa64) and%tls_*(rv64) operand syntax is not yet accepted; the non-TLS modifiers (:lo12:/:got:,%hi/%lo/%pcrel_*, x64@PLT/@GOTPCREL) are done. .L-prefixed local-label spellings in operand references. Plain labels work (including as the%pcrel_loanchor); the.L-prefixed spelling in an operand position is a shared-lexer change.
5. Cross-cutting hygiene
- Keep the three backends converging on the shared
NativeFrame/native_argmove(parallel-copy shuffle) scaffolding rather than re-implementing per-arch; new fold tiers and tail-call paths should reuse it. - As each gap above closes, prefer locking it in with a targeted corpus case (per-arch, per-form) over broad sweeps, per the testing guidance in ../TESTING.md.