kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit f0897759c641e776198bfe031f1286489ab58934
parent 049d0f0ae42e920aa7f5a997dbc23fbe53fef7c0
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sun, 10 May 2026 15:13:50 -0700

arch/x64: codegen for Linux ELF — A–D spine + most of E–H, P, Q

Replaces the all-panic x64 CGTarget skeleton with a single-pass backend
mirroring aa64/rv64. SysV AMD64 ABI classifier upgraded from
indirect-everything to real scalar DIRECT (aggregates ≤16B in INT parts,
larger sret/byval). Runs under qemu-x86_64 via the existing podman path.

test/cg E result (both opt-levels): 230 pass / 154 fail / 0 skip.
Full sweep of Groups A, B, C, D, E, H, P, Q + most of F and G; alloca,
varargs, atomics, intrinsics, TLS, globals, bitfields, indirect call
still stubbed and tracked in doc/X64.md.

Diffstat:
Adoc/X64.md | 206+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Msrc/abi/abi_sysv_x64.c | 102++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
Msrc/arch/x64.c | 2043+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------
Asrc/arch/x64_isa.h | 75+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 2172 insertions(+), 254 deletions(-)

diff --git a/doc/X64.md b/doc/X64.md @@ -0,0 +1,206 @@ +# X64 codegen status + +Living checklist for the x86_64 (SysV AMD64, Linux ELF) backend +(`src/arch/x64.c`) and ABI (`src/abi/abi_sysv_x64.c`). Behavioral +oracles are `test/cg/` and (later) `test/parse/`. Phase status: + +- ✅ landed +- 🚧 in progress +- ⬜ planned + +--- + +## Test cg coverage + +Targeted scope: the **A–D spine** plus enough core ops (FP, sret, +byval, scalar conversions, locals + indirect load/store, structured +control flow, multi-function) that several follow-on groups +incidentally pass. Path D (in-process JIT) is skipped on a non-x64 +host — the harness reports SKIP. Path E (qemu/podman exec) runs +under `qemu-x86_64`. + +Current full-corpus result on x64 ELF, both opt-levels: +**E: 230 pass / 154 fail / 0 skip.** + +| Group | Status | Notes | +|-------|--------|-------| +| MC-only | n/a | `mc_smoke` is aa64-bytes-only (excluded by arch mask) | +| A — lifecycle | ✅ | a01–a10 | +| B — params/locals | ✅ | b01–b08, including sret (b06), byval (b07), FP param via xmm0 (b08), and 9-int-param stack spill (b03) | +| C — int arith | ✅ | c01–c12 | +| D — cmp/branch | ✅ | d01–d13 | +| E — conversions | ✅ | e01–e15 (SEXT/ZEXT/TRUNC/ITOF/FTOI/FEXT/FTRUNC/BITCAST) | +| F — memory | ✅ except bitfields | f01–f11; 🚧 f12/f13 bitfields | +| G — calls | ✅ except indirect | g02–g13; 🚧 g01 indirect call via reg (the synthesized fnptr type fails to classify cleanly through the stub ABI — TODO when callee.cls/type plumbing settles) | +| H — control | ✅ | h01–h18 — SCOPE_LOOP / SCOPE_BLOCK bookkeeping suffices for while / do / for / switch / ternary | +| I — alloca | ⬜ | needs `max_outgoing` patch site (placeholder ADD) and SP-from-RBP epilogue restoration | +| J — varargs | ⬜ | needs SysV `__va_list_tag` GP/FP save areas in the prologue | +| K — atomics | ⬜ | LOCK XADD / CMPXCHG / MFENCE family | +| L — intrinsics | ⬜ | POPCNT / BSF / BSR / BSWAP / memcpy ABI lowering | +| N — TLS LE | ⬜ | `mov rd, fs:0` + 32-bit TPOFF32 displacement | +| O — globals | ⬜ except o11 | RIP-relative addressing for OPK_GLOBAL load/store/addr-of; o11 already passes because it only renames the text section | +| P — DWARF | ✅ exit-code | p01–p07 pass on the value oracle; the W-path DWARF directives still depend on the stubbed `cfree_dwarf_*` consumers (same as on aa64/rv64) | +| Q — multi-fn | ✅ except q11 | q01–q10 pass; 🚧 q11 needs `addr_of` for OPK_GLOBAL | + +--- + +## Phase 1 — Backend foundation ✅ + +- ✅ Register pools: 13 int (rbx, r12..r15 callee-saved first, then + r10, r11, rsi, rdi, rcx, rdx, r8, r9 caller-saved); 10 FP (xmm6..xmm15) +- ✅ Frame layout: rbp-relative locals at negative offsets; callee-save + area immediately below; outgoing args at `[rsp+0]` (16-aligned) +- ✅ Prologue placeholder + func_end patch (mirrors aa64 / rv64) +- ✅ Epilogue: restore callee-saves, `leave; ret` +- ✅ MCEmitter fixup encodings already cover `R_PC32` for branches and + PC-relative calls/jumps; no new fixup kinds needed for Groups A–D + +## Phase 2 — Core ops ✅ + +- ✅ `load_imm`: 1B `mov r8, imm8`-via-MOV (32-bit) or `MOVABS` (64-bit) +- ✅ `copy`, `load`, `store` (i8/i16/i32/i64 + float/double) +- ✅ `addr_of` for OPK_LOCAL (`lea rd, [rbp - off]`) +- ✅ `binop` (int): ADD/SUB/IMUL via reg-reg; SDIV/UDIV/SREM/UREM via + CQO/CDQ + IDIV/DIV; AND/OR/XOR; SHL/SHR/SAR via `cl` +- ✅ `unop` (NEG/BNOT/NOT-as-`!`) +- ✅ `cmp` (materialize 0/1 via SETcc + MOVZX) +- ✅ `cmp_branch` (CMP + Jcc rel32, R_PC32 fixup to MCLabel) +- ✅ Structured `SCOPE_IF` / `else`; `SCOPE_LOOP` / `SCOPE_BLOCK` + (label bookkeeping only — caller drives `label_place`/`jump`) +- ✅ Calls (direct via `call rel32` + R_X64_PLT32; indirect via `call rax`) +- ✅ Returns: scalar in rax / xmm0; multi-instruction `jmp epilogue` +- ✅ Sret skeleton: incoming rdi spilled to a hidden slot at func_begin; + the ret-indirect path memcpys src→[rdi] before branching to epilogue +- ✅ FP scalar: `addss/addsd`, `cvtss2sd/cvtsd2ss`, `cvtsi2sd`, `cvttsd2si`, + `movd/movq` for BITCAST, `movss/movsd` for load/store/copy +- ✅ FP `load_const` via a fresh `.rodata` symbol + RIP-relative load +- ✅ `convert`: SEXT (`movsx`/`movsxd`), ZEXT (`movzx`/zero high), TRUNC + (no-op — narrower stores select width), FP↔int (CVTSI2S*/CVTTS*2SI), + FEXT/FTRUNC (CVTSS2SD/CVTSD2SS), BITCAST (movd/movq) + +## Phase 3 — Remaining cg coverage ⬜ + +- ⬜ Aggregate ops: `copy_bytes`, `set_bytes`, bitfields +- ⬜ Calls: byval (b07), large struct byval (g08), HFA edges (rejected via + ABI fallback to INDIRECT) +- ⬜ Group H: SCOPE_LOOP/BLOCK with `break_to`/`continue_to` exercised by + while/for/do-while/switch +- ⬜ Group I: alloca (constant + runtime size, max_outgoing patch site) +- ⬜ Group J: varargs (SysV `__va_list_tag` + gp/fp save areas) +- ⬜ Group K: atomics (LOCK XADD / CMPXCHG / MFENCE) +- ⬜ Group L: intrinsics (popcnt / bsf / bsr / bswap / memcpy ABI calls) +- ⬜ Group N: TLS LE — `mov rd, fs:0` + 32-bit TPOFF32 displacement +- ⬜ Group O: globals via RIP-relative addressing +- ⬜ Group P: DWARF line/subprogram (driven by Debug; backend forwards locs) + +## Phase 4 — opt-cgtarget equivalence ⬜ + +- ⬜ Confirm L1/L2 (opt-wrapped) cg paths match L0 on the spine +- ⬜ Same equivalence on the full corpus once Phase 3 lands + +## Phase 5 — test-parse on x64 ⬜ + +Same pattern as rv64 phase 5 — `test/parse/` is the file-driven C +parser harness. Plan: run `CFREE_TEST_ARCH=x64 make test-parse` after +Phase 3 stabilizes and triage failures, then mirror RV64's per-case +opt-out scheme for arch-specific cases. + +--- + +## Open follow-ups + +- Caller-saved register spilling around calls. The current pool hands + out caller-saved regs only after callee-saved are exhausted; cases + that hold a caller-saved reg live across a call (heavy register + pressure with a call in the middle) will mis-execute. The corpus + used to be designed so the first-allocated reg is callee-saved + (g11_caller_saved_live_across_call), but this is fragile — the + full Phase 3 plan tracks an explicit "live across call" annotation. +- Variadic FP register save area. Today the prologue spills only + int arg regs because varargs aren't reached; the save layout has + to mirror the SysV `__va_list_tag` once Group J lands. +- CFI directives are no-ops (debug.h's CFI fanout is unwired across + arches at present); revisit when `.eh_frame` lands. + + + +### Currently failing cg tests + + f12_bitfield_unsigned/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/f12_bitfield_unsigned/emit.err) + f13_bitfield_signed/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/f13_bitfield_signed/emit.err) + g01_indirect_call/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/g01_indirect_call/emit.err) + i01_alloca_const_int/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/i01_alloca_const_int/emit.err) + i02_alloca_runtime_size/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/i02_alloca_runtime_size/emit.err) + i03_alloca_align_16/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/i03_alloca_align_16/emit.err) + i04_alloca_in_loop_distinct/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/i04_alloca_in_loop_distinct/emit.err) + i05_alloca_then_call/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/i05_alloca_then_call/emit.err) + i06_two_allocas_disjoint/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/i06_two_allocas_disjoint/emit.err) + i07_alloca_addr_escapes/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/i07_alloca_addr_escapes/emit.err) + i08_vla_param_sum/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/i08_vla_param_sum/emit.err) + i09_alloca_preserves_locals/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/i09_alloca_preserves_locals/emit.err) + i10_alloca_after_named_local/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/i10_alloca_after_named_local/emit.err) + j01_va_int_sum_3/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/j01_va_int_sum_3/emit.err) + j02_va_zero_args/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/j02_va_zero_args/emit.err) + j03_va_int_spill/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/j03_va_int_spill/emit.err) + j04_va_int64/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/j04_va_int64/emit.err) + j05_va_double_sum/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/j05_va_double_sum/emit.err) + j06_va_double_spill/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/j06_va_double_spill/emit.err) + j07_va_mixed_int_dbl/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/j07_va_mixed_int_dbl/emit.err) + j08_va_copy/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/j08_va_copy/emit.err) + j09_va_two_fixed/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/j09_va_two_fixed/emit.err) + k01_atomic_load_relaxed/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k01_atomic_load_relaxed/emit.err) + k02_atomic_store_load_acq/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k02_atomic_store_load_acq/emit.err) + k03_atomic_load_seq_cst/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k03_atomic_load_seq_cst/emit.err) + k04_atomic_rmw_add/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k04_atomic_rmw_add/emit.err) + k05_atomic_rmw_xchg/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k05_atomic_rmw_xchg/emit.err) + k06_atomic_rmw_and/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k06_atomic_rmw_and/emit.err) + k07_atomic_rmw_or/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k07_atomic_rmw_or/emit.err) + k08_atomic_rmw_xor/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k08_atomic_rmw_xor/emit.err) + k09_atomic_rmw_sub/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k09_atomic_rmw_sub/emit.err) + k10_atomic_rmw_nand/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k10_atomic_rmw_nand/emit.err) + k11_atomic_cas_success/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k11_atomic_cas_success/emit.err) + k12_atomic_cas_failure/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k12_atomic_cas_failure/emit.err) + k13_atomic_load_i64/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k13_atomic_load_i64/emit.err) + k14_atomic_rmw_prior/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k14_atomic_rmw_prior/emit.err) + k15_fence_seq_cst/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/k15_fence_seq_cst/emit.err) + l01_popcount_u32/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l01_popcount_u32/emit.err) + l02_popcount_u64/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l02_popcount_u64/emit.err) + l03_ctz_u32/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l03_ctz_u32/emit.err) + l04_clz_u32/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l04_clz_u32/emit.err) + l05_bswap16/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l05_bswap16/emit.err) + l06_bswap32/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l06_bswap32/emit.err) + l07_bswap64/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l07_bswap64/emit.err) + l08_memcpy_4/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l08_memcpy_4/emit.err) + l09_memmove_overlap/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l09_memmove_overlap/emit.err) + l10_memset_zero/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l10_memset_zero/emit.err) + l11_memset_ff/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l11_memset_ff/emit.err) + l12_expect_taken/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l12_expect_taken/emit.err) + l13_unreachable_live/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l13_unreachable_live/emit.err) + l14_trap_live/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l14_trap_live/emit.err) + l15_prefetch_noop/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l15_prefetch_noop/emit.err) + l16_assume_aligned/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l16_assume_aligned/emit.err) + l17_add_overflow_no/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l17_add_overflow_no/emit.err) + l18_add_overflow_yes/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l18_add_overflow_yes/emit.err) + l19_sub_overflow_yes/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l19_sub_overflow_yes/emit.err) + l20_mul_overflow_no/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/l20_mul_overflow_no/emit.err) + n01_tls_load_le/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/n01_tls_load_le/emit.err) + n02_tls_store_le/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/n02_tls_store_le/emit.err) + n03_tls_addr_taken/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/n03_tls_addr_taken/emit.err) + n04_tls_i64/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/n04_tls_i64/emit.err) + n05_tls_in_loop/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/n05_tls_in_loop/emit.err) + n06_tls_two_vars/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/n06_tls_two_vars/emit.err) + n07_tls_bss_zero_init/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/n07_tls_bss_zero_init/emit.err) + n08_tls_addend_offset/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/n08_tls_addend_offset/emit.err) + o01_global_load_data/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/o01_global_load_data/emit.err) + o02_global_store_data/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/o02_global_store_data/emit.err) + o03_global_bss_zero/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/o03_global_bss_zero/emit.err) + o04_global_addr_taken/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/o04_global_addr_taken/emit.err) + o05_global_i64/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/o05_global_i64/emit.err) + o06_rodata_load/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/o06_rodata_load/emit.err) + o07_global_struct_field/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/o07_global_struct_field/emit.err) + o08_global_array_runtime_idx/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/o08_global_array_runtime_idx/emit.err) + o09_static_local_linkage/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/o09_static_local_linkage/emit.err) + o10_global_addend/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/o10_global_addend/emit.err) + o12_global_across_call/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/o12_global_across_call/emit.err) + q11_addr_of_helper_through_global/emit (cg-runner --emit failed; see /Users/ryan/code/cfree/build/test/cg/q11_addr_of_helper_through_global/emit.err) + diff --git a/src/abi/abi_sysv_x64.c b/src/abi/abi_sysv_x64.c @@ -1,9 +1,17 @@ -/* SysV AMD64 ABI — phase-2 stub. +/* SysV AMD64 ABI — minimal classifier. * - * Initial classifier returns ABI_ARG_INDIRECT for everything: correct - * (every value passes through memory), slow, but unblocks bring-up of - * the x64 codegen path. Phase 3 replaces this with the real eight-byte - * INTEGER/SSE classifier (see doc/MULTIARCH.md §4 phase 3 step 2). */ + * Covers the subset the cg test harness needs through the spine: + * void -> IGNORE + * integer ≤ 8B -> DIRECT, one INT part (rdi..r9 for args; rax for return) + * pointer -> DIRECT, one INT part + * float/double -> DIRECT, one FP part (xmm0..xmm7 for args; xmm0 return) + * small struct -> DIRECT, INT parts up to 16B (passed in up to 2 GPRs) + * large struct -> INDIRECT (sret for return; byval for args) + * + * The full SysV INTEGER/SSE eight-byte classification (with X87/COMPLEX_X87/ + * NO_CLASS rules and the MEMORY-pulls-down rule) is deferred — for the + * cg corpus this approximation is enough and matches what the rv64 ABI + * does today. */ #include <string.h> @@ -12,26 +20,86 @@ #include "core/core.h" #include "core/pool.h" -static void classify_indirect(TargetABI* a, const Type* t, ABIArgInfo* out, - int is_return) { +static void classify_void(ABIArgInfo* out) { + memset(out, 0, sizeof *out); + out->kind = ABI_ARG_IGNORE; +} + +static void classify_scalar(TargetABI* a, const Type* t, ABIArgInfo* out) { + ABITypeInfo ti = abi_internal_type_info(a, t); + out->kind = ABI_ARG_DIRECT; + out->flags = ABI_AF_NONE; + out->indirect_align = 0; + + ABIArgPart* parts = arena_new(a->c->tu, ABIArgPart); + memset(parts, 0, sizeof *parts); + parts->cls = (ti.scalar_kind == ABI_SC_FLOAT) ? ABI_CLASS_FP : ABI_CLASS_INT; + parts->loc = ABI_LOC_REG; + parts->size = ti.size; + parts->align = ti.align; + parts->src_offset = 0; + + out->parts = parts; + out->nparts = 1; +} + +static void classify_aggregate(TargetABI* a, const Type* t, ABIArgInfo* out, + int is_return) { + ABITypeInfo ti = abi_internal_type_info(a, t); + if (ti.size == 0) { + classify_void(out); + return; + } + if (ti.size <= 16) { + u32 nparts = (ti.size + 7) / 8; + ABIArgPart* parts = arena_array(a->c->tu, ABIArgPart, nparts); + memset(parts, 0, sizeof(ABIArgPart) * nparts); + u32 off = 0; + for (u32 i = 0; i < nparts; ++i) { + u32 chunk = (ti.size - off > 8) ? 8 : (ti.size - off); + parts[i].cls = ABI_CLASS_INT; + parts[i].loc = ABI_LOC_REG; + parts[i].size = chunk; + parts[i].align = 8; + parts[i].src_offset = off; + off += chunk; + } + out->kind = ABI_ARG_DIRECT; + out->flags = ABI_AF_NONE; + out->parts = parts; + out->nparts = (u16)nparts; + out->indirect_align = 0; + } else { + out->kind = ABI_ARG_INDIRECT; + out->flags = is_return ? ABI_AF_SRET : ABI_AF_BYVAL; + out->indirect_align = ti.align ? ti.align : 8; + out->parts = NULL; + out->nparts = 0; + } +} + +static void classify_one(TargetABI* a, const Type* t, ABIArgInfo* out, + int is_return) { if (!t || t->kind == TY_VOID) { - memset(out, 0, sizeof *out); - out->kind = ABI_ARG_IGNORE; + classify_void(out); return; } - ABITypeInfo ti = abi_internal_type_info(a, t); - out->kind = ABI_ARG_INDIRECT; - out->flags = is_return ? ABI_AF_SRET : ABI_AF_BYVAL; - out->indirect_align = ti.align ? ti.align : 8; - out->parts = NULL; - out->nparts = 0; + switch (t->kind) { + case TY_STRUCT: + case TY_UNION: + classify_aggregate(a, t, out, is_return); + return; + default: + classify_scalar(a, t, out); + return; + } } static ABIFuncInfo* sysv_x64_compute_func_info(TargetABI* a, const Type* fn) { ABIFuncInfo* info = arena_new(a->c->tu, ABIFuncInfo); memset(info, 0, sizeof *info); - classify_indirect(a, fn->fn.ret, &info->ret, /*is_return=*/1); + classify_one(a, fn->fn.ret, &info->ret, /*is_return=*/1); info->has_sret = (info->ret.kind == ABI_ARG_INDIRECT) ? 1 : 0; info->variadic = fn->fn.variadic; @@ -40,7 +108,7 @@ static ABIFuncInfo* sysv_x64_compute_func_info(TargetABI* a, const Type* fn) { ABIArgInfo* arr = arena_array(a->c->tu, ABIArgInfo, fn->fn.nparams); memset(arr, 0, sizeof(ABIArgInfo) * fn->fn.nparams); for (u16 i = 0; i < fn->fn.nparams; ++i) { - classify_indirect(a, fn->fn.params[i], &arr[i], /*is_return=*/0); + classify_one(a, fn->fn.params[i], &arr[i], /*is_return=*/0); } info->params = arr; } else { diff --git a/src/arch/x64.c b/src/arch/x64.c @@ -1,293 +1,1862 @@ -/* x86_64 CGTarget skeleton. +/* Minimal x86_64 (SysV AMD64, Linux ELF) CGTarget. * - * Phase-2 placeholder: the vtable is wired up but every method panics. - * This proves the cgtarget_new dispatch reaches an x64-shaped target. - * Phase 3 fills in real codegen. */ + * Single-pass codegen mirroring the structure of src/arch/aarch64.c + * and src/arch/rv64.c. The frame uses rbp as a frame pointer; locals + * live at negative offsets from rbp, callee-save spills live below + * the local area at known offsets, and outgoing args sit at sp+0. + * The prologue is reserved as a NOP-filled placeholder at func_begin + * and patched at func_end once frame_size and the callee-save high- + * water mark are known. + * + * Reg allocator: lowest-bit-first over a fixed preference list. INT + * pool has callee-saves (rbx, r12..r15) at the low bits, then a + * caller-saved tail (r10, rdi, rsi, r8, r9) — so the first reg handed + * out is callee-saved, which is what tests like + * g11_caller_saved_live_across_call rely on. FP pool is xmm6..xmm15 + * (10 regs, all caller-saved on SysV). + * + * Scratches kept outside the pools: rax (primary), rcx, rdx, r11 + * (secondary). rax is also the int return reg; xmm0 is the FP return + * reg. + * + * Scope: the test/cg spine (Groups A–D plus call/local/sret/byval/FP + * pieces of B). Methods past the spine panic with a clear message so + * Phase 3 work has obvious landing pads — see doc/X64.md. */ #include <string.h> #include "arch/arch.h" #include "arch/x64.h" +#include "arch/x64_isa.h" #include "core/arena.h" +#include "core/pool.h" +#include "obj/obj.h" +#include "type/type.h" + +#define X64_PROLOGUE_BYTES 96u + +/* ============================================================ + * Custom register pool. + * + * Unlike aa64/rv64 the x64 pool is non-contiguous (skipping rax, + * rcx, rdx, rsp, rbp, r11). So we keep a bitmap over a static + * preference order rather than a (base, nregs) range. */ +typedef struct XRegPool { + u32 free; /* bit i set ⇔ alloc_order[i] is free */ + u32 hwm; /* highest index+1 ever allocated */ + const u8* order; /* alloc_order; first n_cs are callee-saved */ + u8 nregs; + u8 n_cs; + u8 pad[2]; +} XRegPool; + +static void xpool_init(XRegPool* p, const u8* order, u8 nregs, u8 n_cs) { + p->order = order; + p->nregs = nregs; + p->n_cs = n_cs; + p->hwm = 0; + p->free = (nregs >= 32u) ? 0xFFFFFFFFu : ((1u << nregs) - 1u); +} + +static Reg xpool_alloc(XRegPool* p) { + if (p->free == 0) return (Reg)REG_NONE; + u32 idx = (u32)__builtin_ctz(p->free); + p->free &= ~(1u << idx); + if (idx + 1u > p->hwm) p->hwm = idx + 1u; + return (Reg)p->order[idx]; +} + +static int xpool_free(XRegPool* p, Reg r) { + for (u8 i = 0; i < p->nregs; ++i) { + if (p->order[i] == (u8)r) { + u32 bit = 1u << i; + if (p->free & bit) return -1; + p->free |= bit; + return 1; + } + } + return 0; +} + +static const u8 g_int_order[10] = { + X64_RBX, X64_R12, X64_R13, X64_R14, X64_R15, /* callee-saved (n_cs=5) */ + X64_R10, X64_RDI, X64_RSI, X64_R8, X64_R9, /* caller-saved tail */ +}; + +static const u8 g_fp_order[10] = { + /* All xmm regs are caller-saved on SysV; preference order is xmm6 + * upward to keep the low arg/return regs (xmm0..5) clear for calls. */ + X64_XMM6, X64_XMM7, X64_XMM8, X64_XMM0 + 9, X64_XMM0 + 10, + X64_XMM0 + 11, X64_XMM0 + 12, X64_XMM0 + 13, X64_XMM0 + 14, X64_XMM15, +}; + +static const u32 g_int_arg_regs[6] = {X64_RDI, X64_RSI, X64_RDX, + X64_RCX, X64_R8, X64_R9}; + +/* ============================================================ + * XImpl */ + +typedef struct XSlot { + u32 off; /* bytes below rbp (positive); address = rbp - off */ + u32 size; + u32 align; + u8 kind; + u8 pad[3]; +} XSlot; + +typedef struct XScope { + u8 kind; + u8 has_else; + u8 pad[2]; + MCLabel else_label; + MCLabel end_label; + Label break_label; + Label continue_label; +} XScope; typedef struct XImpl { CGTarget base; SrcLoc loc; + const CGFuncDesc* fd; + + u32 func_start; + u32 prologue_pos; + MCLabel epilogue_label; + + XSlot* slots; + u32 nslots; + u32 slots_cap; + u32 cum_off; + u32 max_outgoing; + + u32 next_param_int; + u32 next_param_fp; + u32 next_param_stack; + u8 has_sret; + FrameSlot sret_ptr_slot; + + XRegPool int_pool; + XRegPool fp_pool; + + XScope* scopes; + u32 nscopes; + u32 scopes_cap; } XImpl; -static SrcLoc xx_loc(void) { return (SrcLoc){0, 0, 0}; } +static XImpl* impl_of(CGTarget* t) { return (XImpl*)t; } + +/* Forward declarations. */ +static FrameSlot x_frame_slot(CGTarget* t, const FrameSlotDesc* d); +static XSlot* slot_get(XImpl* a, FrameSlot fs); +static void x_load(CGTarget* t, Operand dst, Operand addr, MemAccess ma); +static void x_store(CGTarget* t, Operand addr, Operand src, MemAccess ma); +static void x_free_reg(CGTarget* t, Reg r, RegClass cls); -_Noreturn static void xx_panic(CGTarget* t, const char* what) { - compiler_panic(t->c, xx_loc(), "x64: %s not implemented", what); +extern void debug_emit_row(Debug*, ObjSecId text_section, u32 offset, SrcLoc); + +/* ---- type helpers ---- */ +static int type_is_64(const Type* t) { + if (!t) return 0; + switch (t->kind) { + case TY_LONG: + case TY_ULONG: + case TY_LLONG: + case TY_ULLONG: + case TY_PTR: + case TY_DOUBLE: + return 1; + default: + return 0; + } +} +static int type_is_fp_double(const Type* t) { + return t && (t->kind == TY_DOUBLE || t->kind == TY_LDOUBLE); +} +static u32 type_byte_size(const Type* t) { + if (!t) return 4; + switch (t->kind) { + case TY_CHAR: + case TY_SCHAR: + case TY_UCHAR: + case TY_BOOL: + return 1; + case TY_SHORT: + case TY_USHORT: + return 2; + case TY_INT: + case TY_UINT: + case TY_FLOAT: + return 4; + case TY_LONG: + case TY_ULONG: + case TY_LLONG: + case TY_ULLONG: + case TY_PTR: + case TY_DOUBLE: + return 8; + default: + return 8; + } +} +static int type_is_signed(const Type* t) { + if (!t) return 0; + switch (t->kind) { + case TY_CHAR: + case TY_SCHAR: + case TY_SHORT: + case TY_INT: + case TY_LONG: + case TY_LLONG: + return 1; + default: + return 0; + } } -static void xx_func_begin(CGTarget* t, const CGFuncDesc* d) { - (void)d; - xx_panic(t, "func_begin"); +static _Noreturn void x_panic(CGTarget* t, const char* what) { + SrcLoc loc = impl_of(t)->loc; + compiler_panic(t->c, loc, "x64: %s not implemented", what); +} + +/* ============================================================ + * Byte-level emit helpers. + * + * x64 instructions are variable length: optional legacy prefix(es), + * optional REX, 1-3 byte opcode, ModR/M, optional SIB, optional + * displacement, optional immediate. Helpers below build sequences + * into the active MCEmitter section, recording one Debug row per + * instruction-start. */ +static void emit1(MCEmitter* mc, u8 b) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + mc->emit_bytes(mc, &b, 1); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} +static void emit_u32le(MCEmitter* mc, u32 v) { + u8 b[4]; + b[0] = (u8)v; + b[1] = (u8)(v >> 8); + b[2] = (u8)(v >> 16); + b[3] = (u8)(v >> 24); + mc->emit_bytes(mc, b, 4); +} +static void emit_u64le(MCEmitter* mc, u64 v) { + u8 b[8]; + for (int i = 0; i < 8; ++i) b[i] = (u8)(v >> (i * 8)); + mc->emit_bytes(mc, b, 8); +} + +static u8 make_rex(int w, u32 reg, u32 index, u32 rm) { + u8 r = 0; + if (w) r |= X64_REX_W; + if (reg & 8) r |= X64_REX_R; + if (index & 8) r |= X64_REX_X; + if (rm & 8) r |= X64_REX_B; + return r ? (u8)(X64_REX_BASE | r) : 0; +} +static void emit_rex(MCEmitter* mc, int w, u32 reg, u32 index, u32 rm) { + u8 r = make_rex(w, reg, index, rm); + if (r) mc->emit_bytes(mc, &r, 1); +} +/* Force REX (even REX=0x40) — required for byte-reg encodings that + * promote SIL/DIL/etc. */ +static void emit_rex_force(MCEmitter* mc, int w, u32 reg, u32 index, u32 rm) { + u8 r = (u8)(X64_REX_BASE | (w ? X64_REX_W : 0) | ((reg & 8) ? X64_REX_R : 0) | + ((index & 8) ? X64_REX_X : 0) | ((rm & 8) ? X64_REX_B : 0)); + mc->emit_bytes(mc, &r, 1); +} + +static u8 modrm(u32 mod, u32 reg, u32 rm) { + return (u8)(((mod & 3u) << 6) | ((reg & 7u) << 3) | (rm & 7u)); +} +static u8 sib(u32 scale, u32 index, u32 base) { + return (u8)(((scale & 3u) << 6) | ((index & 7u) << 3) | (base & 7u)); +} + +static u32 disp_mod(u32 base, i32 disp) { + if (disp == 0 && (base & 7u) != 5u) return 0u; /* [base] */ + if (disp >= -128 && disp <= 127) return 1u; /* [base + disp8] */ + return 2u; /* [base + disp32] */ +} + +static void emit_mem_operand(MCEmitter* mc, u32 reg, u32 base, i32 disp) { + u32 m = disp_mod(base, disp); + if ((base & 7u) == 4u) { + /* SIB byte required: index=4 (none), base=base. */ + u8 mr = modrm(m, reg, 4u); + mc->emit_bytes(mc, &mr, 1); + u8 s = sib(0, 4u, base); + mc->emit_bytes(mc, &s, 1); + } else { + u8 mr = modrm(m, reg, base); + mc->emit_bytes(mc, &mr, 1); + } + if (m == 1u) { + u8 d = (u8)(i8)disp; + mc->emit_bytes(mc, &d, 1); + } else if (m == 2u) { + emit_u32le(mc, (u32)disp); + } +} +static void emit_rm_reg(MCEmitter* mc, u32 reg, u32 rm) { + u8 mr = modrm(3u, reg, rm); + mc->emit_bytes(mc, &mr, 1); +} + +/* ---- specific instruction emitters ---- */ + +/* mov rd, rs (64-bit if w, else 32-bit). */ +static void emit_mov_rr(MCEmitter* mc, int w, u32 dst, u32 src) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + emit_rex(mc, w, src, 0, dst); + u8 op = 0x89; /* MOV r/m, r */ + mc->emit_bytes(mc, &op, 1); + emit_rm_reg(mc, src, dst); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +/* mov reg, [base + disp]; size 1/2/4/8. */ +static void emit_mov_load(MCEmitter* mc, u32 size, int signed_ext, u32 dst, + u32 base, i32 disp) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + if (size == 8) { + emit_rex(mc, 1, dst, 0, base); + u8 op = 0x8B; + mc->emit_bytes(mc, &op, 1); + emit_mem_operand(mc, dst, base, disp); + } else if (size == 4) { + emit_rex(mc, 0, dst, 0, base); + u8 op = 0x8B; + mc->emit_bytes(mc, &op, 1); + emit_mem_operand(mc, dst, base, disp); + } else if (size == 2) { + emit_rex(mc, 0, dst, 0, base); + u8 op[2] = {0x0F, signed_ext ? 0xBF : 0xB7}; + mc->emit_bytes(mc, op, 2); + emit_mem_operand(mc, dst, base, disp); + } else if (size == 1) { + emit_rex(mc, 0, dst, 0, base); + u8 op[2] = {0x0F, signed_ext ? 0xBE : 0xB6}; + mc->emit_bytes(mc, op, 2); + emit_mem_operand(mc, dst, base, disp); + } + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +/* mov [base + disp], src; size 1/2/4/8. */ +static void emit_mov_store(MCEmitter* mc, u32 size, u32 src, u32 base, + i32 disp) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + if (size == 8) { + emit_rex(mc, 1, src, 0, base); + u8 op = 0x89; + mc->emit_bytes(mc, &op, 1); + emit_mem_operand(mc, src, base, disp); + } else if (size == 4) { + emit_rex(mc, 0, src, 0, base); + u8 op = 0x89; + mc->emit_bytes(mc, &op, 1); + emit_mem_operand(mc, src, base, disp); + } else if (size == 2) { + u8 p = 0x66; + mc->emit_bytes(mc, &p, 1); + emit_rex(mc, 0, src, 0, base); + u8 op = 0x89; + mc->emit_bytes(mc, &op, 1); + emit_mem_operand(mc, src, base, disp); + } else if (size == 1) { + /* Force REX so SIL/DIL/etc are addressable as byte regs. */ + emit_rex_force(mc, 0, src, 0, base); + u8 op = 0x88; + mc->emit_bytes(mc, &op, 1); + emit_mem_operand(mc, src, base, disp); + } + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +static void emit_lea(MCEmitter* mc, u32 dst, u32 base, i32 disp) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + emit_rex(mc, 1, dst, 0, base); + u8 op = 0x8D; + mc->emit_bytes(mc, &op, 1); + emit_mem_operand(mc, dst, base, disp); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +/* movabs reg, imm64 (REX.W + B8+r imm64) for is64; mov r32, imm32 (B8+r + * imm32) for !is64. Both 10/5 bytes. */ +static void emit_load_imm(MCEmitter* mc, int is64, u32 dst, i64 imm) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + if (is64) { + emit_rex(mc, 1, 0, 0, dst); + u8 op = (u8)(0xB8 | (dst & 7)); + mc->emit_bytes(mc, &op, 1); + emit_u64le(mc, (u64)imm); + } else { + emit_rex(mc, 0, 0, 0, dst); + u8 op = (u8)(0xB8 | (dst & 7)); + mc->emit_bytes(mc, &op, 1); + emit_u32le(mc, (u32)imm); + } + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +/* Two-operand ALU r/m, r. op picks ADD(01)/SUB(29)/AND(21)/OR(09)/XOR(31)/ + * CMP(39)/MOV(89)/TEST(85). */ +static void emit_alu_rr(MCEmitter* mc, int w, u8 op, u32 dst, u32 src) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + emit_rex(mc, w, src, 0, dst); + mc->emit_bytes(mc, &op, 1); + emit_rm_reg(mc, src, dst); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +static void emit_imul_rr(MCEmitter* mc, int w, u32 dst, u32 src) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + emit_rex(mc, w, dst, 0, src); + u8 op[2] = {0x0F, 0xAF}; + mc->emit_bytes(mc, op, 2); + emit_rm_reg(mc, dst, src); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); } -static void xx_func_end(CGTarget* t) { xx_panic(t, "func_end"); } -static Reg xx_alloc_reg(CGTarget* t, RegClass cls, const Type* ty) { - (void)cls; +static void emit_f7_rm(MCEmitter* mc, int w, u32 sub, u32 reg) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + emit_rex(mc, w, 0, 0, reg); + u8 op = 0xF7; + mc->emit_bytes(mc, &op, 1); + emit_rm_reg(mc, sub, reg); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +static void emit_shift_cl(MCEmitter* mc, int w, u32 sub, u32 reg) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + emit_rex(mc, w, 0, 0, reg); + u8 op = 0xD3; + mc->emit_bytes(mc, &op, 1); + emit_rm_reg(mc, sub, reg); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +static void emit_cqo_or_cdq(MCEmitter* mc, int w) { + if (w) { + u8 buf[2] = {X64_REX_BASE | X64_REX_W, 0x99}; + mc->emit_bytes(mc, buf, 2); + } else { + u8 op = 0x99; + mc->emit_bytes(mc, &op, 1); + } +} + +static void emit_xor_self(MCEmitter* mc, int w, u32 r) { + emit_alu_rr(mc, w, 0x31, r, r); +} + +/* cmp r/m, imm8 (0x83 /7). */ +static void emit_cmp_imm8(MCEmitter* mc, int w, u32 reg, i8 imm) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + emit_rex(mc, w, 0, 0, reg); + u8 buf[3]; + buf[0] = 0x83; + buf[1] = modrm(3u, 7u, reg); + buf[2] = (u8)imm; + mc->emit_bytes(mc, buf, 3); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +static void emit_test_self(MCEmitter* mc, int w, u32 reg) { + emit_alu_rr(mc, w, 0x85, reg, reg); +} + +static void emit_setcc(MCEmitter* mc, u32 cc, u32 reg) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + emit_rex_force(mc, 0, 0, 0, reg); + u8 op[2] = {0x0F, (u8)(0x90 | (cc & 0xF))}; + mc->emit_bytes(mc, op, 2); + emit_rm_reg(mc, 0u, reg); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +static void emit_movzx_r32_r8(MCEmitter* mc, u32 dst, u32 src) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + emit_rex_force(mc, 0, dst, 0, src); + u8 op[2] = {0x0F, 0xB6}; + mc->emit_bytes(mc, op, 2); + emit_rm_reg(mc, dst, src); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +/* movzx/movsx r→r. src_size is source byte width. */ +static void emit_extend_rr(MCEmitter* mc, int w, int signed_ext, u32 src_size, + u32 dst, u32 src) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + if (src_size == 4 && signed_ext) { + /* movsxd r64, r32: REX.W 0x63 ModRM */ + emit_rex(mc, 1, dst, 0, src); + u8 op = 0x63; + mc->emit_bytes(mc, &op, 1); + emit_rm_reg(mc, dst, src); + } else if (src_size == 4 && !signed_ext) { + /* zext 32→64 is `mov r32, r32` (clears high 32). */ + emit_rex(mc, 0, src, 0, dst); + u8 op = 0x89; + mc->emit_bytes(mc, &op, 1); + emit_rm_reg(mc, src, dst); + } else if (src_size == 1) { + emit_rex_force(mc, w, dst, 0, src); + u8 op[2] = {0x0F, signed_ext ? 0xBE : 0xB6}; + mc->emit_bytes(mc, op, 2); + emit_rm_reg(mc, dst, src); + } else if (src_size == 2) { + emit_rex(mc, w, dst, 0, src); + u8 op[2] = {0x0F, signed_ext ? 0xBF : 0xB7}; + mc->emit_bytes(mc, op, 2); + emit_rm_reg(mc, dst, src); + } + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +static void emit_ret(MCEmitter* mc) { + u8 op = 0xC3; + mc->emit_bytes(mc, &op, 1); +} +static void emit_leave(MCEmitter* mc) { + u8 op = 0xC9; + mc->emit_bytes(mc, &op, 1); +} + +/* ---- SSE scalar FP encoders ---- */ +static void emit_sse_rr(MCEmitter* mc, u8 prefix, u8 opcode, u32 dst, u32 src) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + if (prefix) mc->emit_bytes(mc, &prefix, 1); + emit_rex(mc, 0, dst, 0, src); + u8 op[2] = {0x0F, opcode}; + mc->emit_bytes(mc, op, 2); + emit_rm_reg(mc, dst, src); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} +static void emit_sse_load(MCEmitter* mc, u8 prefix, u8 opcode, u32 dst, + u32 base, i32 disp) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + if (prefix) mc->emit_bytes(mc, &prefix, 1); + emit_rex(mc, 0, dst, 0, base); + u8 op[2] = {0x0F, opcode}; + mc->emit_bytes(mc, op, 2); + emit_mem_operand(mc, dst, base, disp); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} +static void emit_sse_store(MCEmitter* mc, u8 prefix, u8 opcode, u32 src, + u32 base, i32 disp) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + if (prefix) mc->emit_bytes(mc, &prefix, 1); + emit_rex(mc, 0, src, 0, base); + u8 op[2] = {0x0F, opcode}; + mc->emit_bytes(mc, op, 2); + emit_mem_operand(mc, src, base, disp); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} +static void emit_sse_rr_w(MCEmitter* mc, u8 prefix, u8 opcode, int w, u32 dst, + u32 src) { + u32 ofs = obj_pos(mc->obj, mc->section_id); + if (prefix) mc->emit_bytes(mc, &prefix, 1); + emit_rex(mc, w, dst, 0, src); + u8 op[2] = {0x0F, opcode}; + mc->emit_bytes(mc, op, 2); + emit_rm_reg(mc, dst, src); + if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc); +} + +/* ============================================================ + * Function lifecycle */ + +static void x_func_begin(CGTarget* t, const CGFuncDesc* fd) { + XImpl* a = impl_of(t); + MCEmitter* mc = t->mc; + + mc->set_section(mc, fd->text_section_id); + mc->emit_align(mc, 16, 0x90); + + a->fd = fd; + a->func_start = mc->pos(mc); + a->next_param_int = 0; + a->next_param_fp = 0; + a->next_param_stack = 0; + a->has_sret = (fd->abi && fd->abi->has_sret) ? 1 : 0; + a->cum_off = 0; + a->max_outgoing = 0; + xpool_init(&a->int_pool, g_int_order, 10u, 5u); + xpool_init(&a->fp_pool, g_fp_order, 10u, 0u); + a->nslots = 0; + a->nscopes = 0; + a->sret_ptr_slot = FRAME_SLOT_NONE; + a->epilogue_label = mc->label_new(mc); + + mc->cfi_startproc(mc); + + /* Reserve a fixed-size prologue placeholder filled with NOPs. */ + a->prologue_pos = mc->pos(mc); + for (u32 i = 0; i < X64_PROLOGUE_BYTES; ++i) emit1(mc, 0x90); + + /* sret: rdi at entry holds the destination pointer. Spill it to a + * hidden slot so the body can use rdi freely. */ + if (a->has_sret) { + FrameSlotDesc fsd = { + .type = NULL, .name = 0, .loc = {0, 0, 0}, + .size = 8, .align = 8, .kind = FS_SPILL, .flags = 0, + }; + a->sret_ptr_slot = x_frame_slot(t, &fsd); + /* Subsequent int args start at rsi (next_param_int = 1). */ + a->next_param_int = 1; + } +} + +static u32 align_up_u32(u32 v, u32 a) { return (v + (a - 1u)) & ~(a - 1u); } + +static void x_func_end(CGTarget* t) { + XImpl* a = impl_of(t); + MCEmitter* mc = t->mc; + + u32 cs_used = a->int_pool.hwm; + if (cs_used > a->int_pool.n_cs) cs_used = a->int_pool.n_cs; + u32 cs_size = cs_used * 8u; + + /* Stack alignment: SysV requires rsp ≡ 0 mod 16 just before a call, + * which means rsp ≡ 8 mod 16 inside the function (after the return + * address is pushed). On entry, rsp ≡ 8 mod 16; after `push rbp` it + * is 0 mod 16; after `sub rsp, frame_size` we need it back to 0 + * mod 16, so frame_size must be a multiple of 16. */ + u32 raw = a->max_outgoing + cs_size + a->cum_off; + u32 frame_size = align_up_u32(raw, 16u); + if (frame_size == 0) frame_size = 16; + + mc->label_place(mc, a->epilogue_label); + + /* Restore callee-saves. Each at rbp - (cum_off + (i+1)*8). */ + for (i32 i = (i32)cs_used - 1; i >= 0; --i) { + u32 reg = a->int_pool.order[i]; + i32 off = -(i32)a->cum_off - (i32)(i + 1) * 8; + emit_mov_load(mc, /*size=*/8, /*signed=*/0, reg, X64_RBP, off); + } + + /* leave; ret. */ + emit_leave(mc); + emit_ret(mc); + + /* Patch prologue placeholder. */ + u8 buf[X64_PROLOGUE_BYTES]; + for (u32 i = 0; i < X64_PROLOGUE_BYTES; ++i) buf[i] = 0x90; + u32 wi = 0; + + /* push rbp (1 byte). */ + buf[wi++] = 0x55; + /* mov rbp, rsp: REX.W 89 E5. */ + buf[wi++] = X64_REX_BASE | X64_REX_W; + buf[wi++] = 0x89; + buf[wi++] = modrm(3u, X64_RSP, X64_RBP); + /* sub rsp, frame_size: REX.W 81 /5 imm32 = 7 bytes. */ + buf[wi++] = X64_REX_BASE | X64_REX_W; + buf[wi++] = 0x81; + buf[wi++] = modrm(3u, 5u, X64_RSP); + buf[wi++] = (u8)frame_size; + buf[wi++] = (u8)(frame_size >> 8); + buf[wi++] = (u8)(frame_size >> 16); + buf[wi++] = (u8)(frame_size >> 24); + + /* sret: mov [rbp + disp32], rdi. */ + if (a->has_sret && a->sret_ptr_slot != FRAME_SLOT_NONE) { + XSlot* s = slot_get(a, a->sret_ptr_slot); + if (s) { + i32 off = -(i32)s->off; + if (wi + 7 > X64_PROLOGUE_BYTES) goto overflow; + buf[wi++] = X64_REX_BASE | X64_REX_W; + buf[wi++] = 0x89; + buf[wi++] = modrm(2u, X64_RDI, X64_RBP); + buf[wi++] = (u8)off; + buf[wi++] = (u8)(off >> 8); + buf[wi++] = (u8)(off >> 16); + buf[wi++] = (u8)(off >> 24); + } + } + + /* Spill callee-saves. */ + for (u32 i = 0; i < cs_used; ++i) { + u32 reg = a->int_pool.order[i]; + i32 off = -(i32)a->cum_off - (i32)(i + 1) * 8; + if (wi + 7 > X64_PROLOGUE_BYTES) goto overflow; + buf[wi++] = (u8)(X64_REX_BASE | X64_REX_W | ((reg & 8) ? X64_REX_R : 0)); + buf[wi++] = 0x89; + buf[wi++] = modrm(2u, (reg & 7u), X64_RBP); + buf[wi++] = (u8)off; + buf[wi++] = (u8)(off >> 8); + buf[wi++] = (u8)(off >> 16); + buf[wi++] = (u8)(off >> 24); + } + + if (0) { + overflow: + compiler_panic(t->c, a->loc, + "x64: prologue placeholder overflow (%u of %u bytes)", wi, + X64_PROLOGUE_BYTES); + } + obj_patch(t->obj, a->fd->text_section_id, a->prologue_pos, buf, + X64_PROLOGUE_BYTES); + + /* Define the function symbol. */ + u32 end = mc->pos(mc); + obj_symbol_define(t->obj, a->fd->sym, a->fd->text_section_id, + (u64)a->func_start, (u64)(end - a->func_start)); + + mc->cfi_endproc(mc); + a->fd = NULL; +} + +/* ============================================================ + * Registers / frame */ + +static Reg x_alloc_reg(CGTarget* t, RegClass cls, const Type* ty) { + XImpl* a = impl_of(t); (void)ty; - xx_panic(t, "alloc_reg"); + if (cls == RC_INT) return xpool_alloc(&a->int_pool); + if (cls == RC_FP) return xpool_alloc(&a->fp_pool); + compiler_panic(t->c, a->loc, "x64 alloc_reg: class %d unimpl", (int)cls); } -static void xx_free_reg(CGTarget* t, Reg r, RegClass cls) { - (void)r; - (void)cls; - xx_panic(t, "free_reg"); + +static void x_free_reg(CGTarget* t, Reg r, RegClass cls) { + XImpl* a = impl_of(t); + XRegPool* p = (cls == RC_FP) ? &a->fp_pool : &a->int_pool; + int rc = xpool_free(p, r); + if (rc == 1) return; + if (rc == -1) { + compiler_panic(t->c, a->loc, "x64 free_reg: reg %u already free", + (unsigned)r); + } + compiler_panic(t->c, a->loc, "x64 free_reg: reg %u not in %s pool", + (unsigned)r, cls == RC_FP ? "fp" : "int"); } -static FrameSlot xx_frame_slot(CGTarget* t, const FrameSlotDesc* d) { - (void)d; - xx_panic(t, "frame_slot"); + +static FrameSlot x_frame_slot(CGTarget* t, const FrameSlotDesc* d) { + XImpl* a = impl_of(t); + if (a->nslots == a->slots_cap) { + u32 ncap = a->slots_cap ? a->slots_cap * 2 : 8; + XSlot* nbuf = arena_array(t->c->tu, XSlot, ncap); + if (a->slots) memcpy(nbuf, a->slots, sizeof(XSlot) * a->nslots); + a->slots = nbuf; + a->slots_cap = ncap; + } + u32 size = d->size ? d->size : 8; + u32 align = d->align ? d->align : 1; + u32 next = a->cum_off + size; + u32 mask = align - 1u; + next = (next + mask) & ~mask; + XSlot* s = &a->slots[a->nslots]; + s->off = next; + s->size = size; + s->align = align; + s->kind = d->kind; + a->cum_off = next; + a->nslots++; + return (FrameSlot)(a->nslots); } -static void xx_param(CGTarget* t, const CGParamDesc* d) { - (void)d; - xx_panic(t, "param"); + +static XSlot* slot_get(XImpl* a, FrameSlot fs) { + if (fs == FRAME_SLOT_NONE || fs > a->nslots) return NULL; + return &a->slots[fs - 1]; } -static const Reg* xx_clobbers(CGTarget* t, RegClass cls, u32* nregs) { - (void)cls; - (void)nregs; - xx_panic(t, "clobbers"); + +/* ---- param: store incoming arg(s) into the home slot ---- */ +static void x_param(CGTarget* t, const CGParamDesc* p) { + XImpl* a = impl_of(t); + XSlot* s = slot_get(a, p->slot); + if (!s) compiler_panic(t->c, a->loc, "x64 param: bad slot"); + const ABIArgInfo* ai = p->abi; + + if (ai->kind == ABI_ARG_IGNORE) return; + if (ai->kind == ABI_ARG_INDIRECT) { + /* Incoming pointer to byval copy: load pointer, memcpy into slot. */ + u32 ptr_reg; + if (a->next_param_int < 6) { + ptr_reg = g_int_arg_regs[a->next_param_int++]; + } else { + u32 caller_off = a->next_param_stack; + a->next_param_stack += 8; + emit_mov_load(t->mc, 8, 0, X64_R11, X64_RBP, (i32)(16 + caller_off)); + ptr_reg = X64_R11; + } + u32 nbytes = s->size; + u32 i = 0; + while (i + 8 <= nbytes) { + emit_mov_load(t->mc, 8, 0, X64_RAX, ptr_reg, (i32)i); + emit_mov_store(t->mc, 8, X64_RAX, X64_RBP, -(i32)s->off + (i32)i); + i += 8; + } + while (i + 4 <= nbytes) { + emit_mov_load(t->mc, 4, 0, X64_RAX, ptr_reg, (i32)i); + emit_mov_store(t->mc, 4, X64_RAX, X64_RBP, -(i32)s->off + (i32)i); + i += 4; + } + while (i + 2 <= nbytes) { + emit_mov_load(t->mc, 2, 0, X64_RAX, ptr_reg, (i32)i); + emit_mov_store(t->mc, 2, X64_RAX, X64_RBP, -(i32)s->off + (i32)i); + i += 2; + } + while (i < nbytes) { + emit_mov_load(t->mc, 1, 0, X64_RAX, ptr_reg, (i32)i); + emit_mov_store(t->mc, 1, X64_RAX, X64_RBP, -(i32)s->off + (i32)i); + i += 1; + } + return; + } + /* DIRECT */ + for (u16 i = 0; i < ai->nparts; ++i) { + const ABIArgPart* pt = &ai->parts[i]; + u32 part_off = pt->src_offset; + u32 sz = pt->size; + if (pt->cls == ABI_CLASS_INT) { + if (a->next_param_int < 6) { + u32 reg = g_int_arg_regs[a->next_param_int++]; + emit_mov_store(t->mc, sz, reg, X64_RBP, + -(i32)s->off + (i32)part_off); + } else { + u32 caller_off = a->next_param_stack; + a->next_param_stack += 8; + emit_mov_load(t->mc, sz, 0, X64_RAX, X64_RBP, + (i32)(16 + caller_off)); + emit_mov_store(t->mc, sz, X64_RAX, X64_RBP, + -(i32)s->off + (i32)part_off); + } + } else if (pt->cls == ABI_CLASS_FP) { + if (a->next_param_fp < 8) { + u32 xmm = a->next_param_fp++; + u8 prefix = (sz == 8) ? 0xF2 : 0xF3; + emit_sse_store(t->mc, prefix, 0x11, xmm, X64_RBP, + -(i32)s->off + (i32)part_off); + } else { + u32 caller_off = a->next_param_stack; + a->next_param_stack += 8; + u8 prefix = (sz == 8) ? 0xF2 : 0xF3; + emit_sse_load(t->mc, prefix, 0x10, X64_XMM0, X64_RBP, + (i32)(16 + caller_off)); + emit_sse_store(t->mc, prefix, 0x11, X64_XMM0, X64_RBP, + -(i32)s->off + (i32)part_off); + } + } else { + compiler_panic(t->c, a->loc, "x64 param: ABI class %d unimpl", + (int)pt->cls); + } + } } -static void xx_spill_reg(CGTarget* t, Operand a, FrameSlot s, MemAccess m) { - (void)a; - (void)s; - (void)m; - xx_panic(t, "spill_reg"); + +static const Reg* x_clobbers(CGTarget* t, RegClass c, u32* n) { + (void)c; + (void)n; + x_panic(t, "clobbers"); } -static void xx_reload_reg(CGTarget* t, Operand a, FrameSlot s, MemAccess m) { - (void)a; - (void)s; - (void)m; - xx_panic(t, "reload_reg"); +static void x_spill_reg(CGTarget* t, Operand src, FrameSlot slot, + MemAccess ma) { + XImpl* a = impl_of(t); + if (src.kind != OPK_REG) + compiler_panic(t->c, a->loc, "x64 spill_reg: src is not OPK_REG"); + Operand addr; + memset(&addr, 0, sizeof addr); + addr.kind = OPK_LOCAL; + addr.cls = RC_INT; + addr.type = ma.type; + addr.v.frame_slot = slot; + x_store(t, addr, src, ma); + x_free_reg(t, src.v.reg, src.cls); } -static Label xx_label_new(CGTarget* t) { xx_panic(t, "label_new"); } -static void xx_label_place(CGTarget* t, Label l) { - (void)l; - xx_panic(t, "label_place"); +static void x_reload_reg(CGTarget* t, Operand dst, FrameSlot slot, + MemAccess ma) { + XImpl* a = impl_of(t); + if (dst.kind != OPK_REG) + compiler_panic(t->c, a->loc, "x64 reload_reg: dst is not OPK_REG"); + Operand addr; + memset(&addr, 0, sizeof addr); + addr.kind = OPK_LOCAL; + addr.cls = RC_INT; + addr.type = ma.type; + addr.v.frame_slot = slot; + x_load(t, dst, addr, ma); } -static void xx_jump(CGTarget* t, Label l) { - (void)l; - xx_panic(t, "jump"); + +/* ============================================================ + * Labels / control flow */ + +static Label x_label_new(CGTarget* t) { + return (Label)t->mc->label_new(t->mc); } -static void xx_cmp_branch(CGTarget* t, CmpOp op, Operand a, Operand b, - Label l) { - (void)op; - (void)a; - (void)b; - (void)l; - xx_panic(t, "cmp_branch"); +static void x_label_place(CGTarget* t, Label l) { + t->mc->label_place(t->mc, (MCLabel)l); } -static CGScope xx_scope_begin(CGTarget* t, const CGScopeDesc* d) { - (void)d; - xx_panic(t, "scope_begin"); +/* Emit `jmp rel32` (E9 + 4-byte disp) with a label fixup. R_PC32 applied + * at the disp32 site with addend=-4 yields target - end_of_insn. */ +static void emit_jmp_label(MCEmitter* mc, MCLabel l) { + u8 op = 0xE9; + mc->emit_bytes(mc, &op, 1); + emit_u32le(mc, 0); + mc->emit_label_ref(mc, l, R_PC32, 4, -4); } -static void xx_scope_else(CGTarget* t, CGScope s) { - (void)s; - xx_panic(t, "scope_else"); + +/* Emit `Jcc rel32` (0F 8x + 4-byte disp) with a label fixup. */ +static void emit_jcc_label(MCEmitter* mc, u32 cc, MCLabel l) { + u8 op[2] = {0x0F, (u8)(0x80 | (cc & 0xF))}; + mc->emit_bytes(mc, op, 2); + emit_u32le(mc, 0); + mc->emit_label_ref(mc, l, R_PC32, 4, -4); } -static void xx_scope_end(CGTarget* t, CGScope s) { - (void)s; - xx_panic(t, "scope_end"); + +static void x_jump(CGTarget* t, Label l) { emit_jmp_label(t->mc, (MCLabel)l); } + +static u32 cmp_to_cc(CmpOp op) { + switch (op) { + case CMP_EQ: return X64_CC_E; + case CMP_NE: return X64_CC_NE; + case CMP_LT_U: return X64_CC_B; + case CMP_LE_U: return X64_CC_BE; + case CMP_GT_U: return X64_CC_A; + case CMP_GE_U: return X64_CC_AE; + case CMP_LT_S: return X64_CC_L; + case CMP_LE_S: return X64_CC_LE; + case CMP_GT_S: return X64_CC_G; + case CMP_GE_S: return X64_CC_GE; + default: return X64_CC_E; + } } -static void xx_break_to(CGTarget* t, CGScope s) { - (void)s; - xx_panic(t, "break_to"); + +static u32 force_reg_int(CGTarget* t, Operand op, int w, u32 scratch) { + if (op.kind == OPK_REG) return op.v.reg & 0xFu; + if (op.kind == OPK_IMM) { + emit_load_imm(t->mc, w, scratch, op.v.imm); + return scratch; + } + compiler_panic(t->c, impl_of(t)->loc, "x64: operand kind %d not REG/IMM", + (int)op.kind); } -static void xx_continue_to(CGTarget* t, CGScope s) { - (void)s; - xx_panic(t, "continue_to"); + +static void emit_cmp_ab(CGTarget* t, Operand a_op, Operand b_op) { + int w = type_is_64(a_op.type) ? 1 : 0; + if (a_op.kind == OPK_REG && b_op.kind == OPK_IMM && b_op.v.imm >= -128 && + b_op.v.imm <= 127) { + emit_cmp_imm8(t->mc, w, a_op.v.reg & 0xFu, (i8)b_op.v.imm); + return; + } + u32 ra = force_reg_int(t, a_op, w, X64_RAX); + u32 rb = force_reg_int(t, b_op, w, (ra == X64_R11) ? X64_RAX : X64_R11); + /* cmp r/m, r — opcode 0x39 (encoded as `cmp ra, rb` ⇒ flags = ra - rb). */ + emit_alu_rr(t->mc, w, 0x39, ra, rb); } -static void xx_load_imm(CGTarget* t, Operand d, i64 i) { - (void)d; - (void)i; - xx_panic(t, "load_imm"); +static void x_cmp_branch(CGTarget* t, CmpOp op, Operand a, Operand b, + Label l) { + emit_cmp_ab(t, a, b); + emit_jcc_label(t->mc, cmp_to_cc(op), (MCLabel)l); } -static void xx_load_const(CGTarget* t, Operand d, ConstBytes b) { - (void)d; - (void)b; - xx_panic(t, "load_const"); + +static void x_cmp(CGTarget* t, CmpOp op, Operand dst, Operand a, Operand b) { + emit_cmp_ab(t, a, b); + u32 d = dst.v.reg & 0xFu; + emit_setcc(t->mc, cmp_to_cc(op), d); + emit_movzx_r32_r8(t->mc, d, d); } -static void xx_copy(CGTarget* t, Operand d, Operand s) { - (void)d; - (void)s; - xx_panic(t, "copy"); + +/* ---- structured scopes ---- */ +static CGScope x_scope_begin(CGTarget* t, const CGScopeDesc* d) { + XImpl* a = impl_of(t); + if (a->nscopes == a->scopes_cap) { + u32 ncap = a->scopes_cap ? a->scopes_cap * 2u : 4u; + XScope* nb = arena_array(t->c->tu, XScope, ncap); + if (a->scopes) memcpy(nb, a->scopes, sizeof(XScope) * a->nscopes); + a->scopes = nb; + a->scopes_cap = ncap; + } + XScope* sc = &a->scopes[a->nscopes]; + sc->kind = (u8)d->kind; + sc->has_else = 0; + sc->else_label = 0; + sc->end_label = 0; + sc->break_label = d->break_label; + sc->continue_label = d->continue_label; + + if (d->kind == SCOPE_IF) { + sc->else_label = t->mc->label_new(t->mc); + sc->end_label = t->mc->label_new(t->mc); + int w = type_is_64(d->cond.type) ? 1 : 0; + u32 rc = force_reg_int(t, d->cond, w, X64_RAX); + emit_test_self(t->mc, w, rc); + emit_jcc_label(t->mc, X64_CC_E, sc->else_label); + } else if (d->kind == SCOPE_LOOP || d->kind == SCOPE_BLOCK) { + /* Bookkeeping only. */ + } else { + compiler_panic(t->c, a->loc, + "x64 scope_begin: kind %d not yet implemented", + (int)d->kind); + } + a->nscopes++; + return (CGScope)a->nscopes; } -static void xx_load(CGTarget* t, Operand d, Operand a, MemAccess m) { - (void)d; - (void)a; - (void)m; - xx_panic(t, "load"); + +static void x_scope_else(CGTarget* t, CGScope s) { + XImpl* a = impl_of(t); + if (s == CG_SCOPE_NONE || s > a->nscopes) + compiler_panic(t->c, a->loc, "x64 scope_else: bad scope"); + XScope* sc = &a->scopes[s - 1]; + emit_jmp_label(t->mc, sc->end_label); + t->mc->label_place(t->mc, sc->else_label); + sc->has_else = 1; } -static void xx_store(CGTarget* t, Operand a, Operand s, MemAccess m) { - (void)a; - (void)s; - (void)m; - xx_panic(t, "store"); + +static void x_scope_end(CGTarget* t, CGScope s) { + XImpl* a = impl_of(t); + if (s == CG_SCOPE_NONE || s > a->nscopes) + compiler_panic(t->c, a->loc, "x64 scope_end: bad scope"); + XScope* sc = &a->scopes[s - 1]; + if (sc->kind == SCOPE_IF) { + if (!sc->has_else) t->mc->label_place(t->mc, sc->else_label); + t->mc->label_place(t->mc, sc->end_label); + } } -static void xx_addr_of(CGTarget* t, Operand d, Operand l) { - (void)d; - (void)l; - xx_panic(t, "addr_of"); + +static void x_break_to(CGTarget* t, CGScope s) { + XImpl* a = impl_of(t); + if (s == CG_SCOPE_NONE || s > a->nscopes) + compiler_panic(t->c, a->loc, "x64 break_to: bad scope"); + x_jump(t, a->scopes[s - 1].break_label); +} +static void x_continue_to(CGTarget* t, CGScope s) { + XImpl* a = impl_of(t); + if (s == CG_SCOPE_NONE || s > a->nscopes) + compiler_panic(t->c, a->loc, "x64 continue_to: bad scope"); + x_jump(t, a->scopes[s - 1].continue_label); +} + +/* ============================================================ + * Data movement */ + +static void x_load_imm(CGTarget* t, Operand dst, i64 imm) { + int w = type_is_64(dst.type) ? 1 : 0; + emit_load_imm(t->mc, w, dst.v.reg & 0xFu, imm); +} + +/* Materialize an FP literal: stash bytes in .rodata as a fresh local + * symbol, then load via RIP-relative movss/movsd. */ +static void x_load_const(CGTarget* t, Operand dst, ConstBytes cb) { + XImpl* a = impl_of(t); + if (dst.cls != RC_FP) + compiler_panic(t->c, a->loc, "x64 load_const: only FP supported in v1"); + + Sym ro_name = pool_intern_cstr(t->c->global, ".rodata"); + ObjSecId ro = obj_section(t->obj, ro_name, SEC_RODATA, SF_ALLOC, + cb.align ? cb.align : 4); + + u32 cur_section = t->mc->section_id; + t->mc->set_section(t->mc, ro); + t->mc->emit_align(t->mc, cb.align ? cb.align : 4, 0); + u32 ro_off = t->mc->pos(t->mc); + t->mc->emit_bytes(t->mc, cb.bytes, cb.size); + + char namebuf[64]; + static u32 lit_seq = 0; + int len = 0; + const char* prefix = ".LCFP_x64_"; + for (; prefix[len]; ++len) namebuf[len] = prefix[len]; + u32 v = lit_seq++; + char tmp[16]; + int tn = 0; + if (v == 0) + tmp[tn++] = '0'; + else + while (v) { + tmp[tn++] = '0' + (char)(v % 10); + v /= 10; + } + for (int i = tn - 1; i >= 0; --i) namebuf[len++] = tmp[i]; + namebuf[len] = 0; + + Sym sname = pool_intern_cstr(t->c->global, namebuf); + ObjSymId sym = obj_symbol(t->obj, sname, SB_LOCAL, SK_OBJ, ro, (u64)ro_off, + (u64)cb.size); + t->mc->set_section(t->mc, cur_section); + + /* movs{s,d} xmm, [rip+disp32]. Reloc R_PC32 with addend=-4 at the + * disp32 site so the linker resolves to target relative to end-of-insn. */ + u8 prefix2 = (cb.size == 8) ? 0xF2 : 0xF3; + u32 dst_x = dst.v.reg & 0xFu; + t->mc->emit_bytes(t->mc, &prefix2, 1); + emit_rex(t->mc, 0, dst_x, 0, 0); + u8 op[2] = {0x0F, 0x10}; + t->mc->emit_bytes(t->mc, op, 2); + u8 mr = modrm(0u, (dst_x & 7u), 5u); /* [RIP + disp32] */ + t->mc->emit_bytes(t->mc, &mr, 1); + u32 disp_pos = t->mc->pos(t->mc); + emit_u32le(t->mc, 0); + t->mc->emit_reloc_at(t->mc, cur_section, disp_pos, R_PC32, sym, -4, 1, 0); +} + +static void x_copy(CGTarget* t, Operand dst, Operand src) { + if (dst.cls == RC_FP || src.cls == RC_FP) { + u8 prefix2 = type_is_fp_double(dst.type) ? 0xF2 : 0xF3; + emit_sse_rr(t->mc, prefix2, 0x10, dst.v.reg & 0xFu, src.v.reg & 0xFu); + return; + } + int w = type_is_64(dst.type) ? 1 : 0; + emit_mov_rr(t->mc, w, dst.v.reg & 0xFu, src.v.reg & 0xFu); +} + +static u32 addr_base(CGTarget* t, Operand addr, i32* out_off) { + XImpl* a = impl_of(t); + if (addr.kind == OPK_LOCAL) { + XSlot* s = slot_get(a, addr.v.frame_slot); + if (!s) compiler_panic(t->c, a->loc, "x64 addr_base: bad slot"); + *out_off = -(i32)s->off; + return X64_RBP; + } + if (addr.kind == OPK_INDIRECT) { + *out_off = addr.v.ind.ofs; + return addr.v.ind.base & 0xFu; + } + compiler_panic(t->c, a->loc, "x64 addr_base: kind %d unsupported", + (int)addr.kind); +} + +static void x_load(CGTarget* t, Operand dst, Operand addr, MemAccess ma) { + XImpl* a = impl_of(t); + u32 sz = ma.size ? ma.size : type_byte_size(addr.type); + + if (addr.kind == OPK_GLOBAL) { + compiler_panic(t->c, a->loc, "x64 load: OPK_GLOBAL not yet implemented"); + } + + i32 off; + u32 base = addr_base(t, addr, &off); + if (dst.cls == RC_FP) { + u8 prefix2 = (sz == 8) ? 0xF2 : 0xF3; + emit_sse_load(t->mc, prefix2, 0x10, dst.v.reg & 0xFu, base, off); + } else { + int signed_ = type_is_signed(ma.type ? ma.type : addr.type); + emit_mov_load(t->mc, sz, signed_, dst.v.reg & 0xFu, base, off); + } +} + +static void x_store(CGTarget* t, Operand addr, Operand src, MemAccess ma) { + XImpl* a = impl_of(t); + u32 sz = ma.size ? ma.size : type_byte_size(addr.type); + + if (addr.kind == OPK_GLOBAL) { + compiler_panic(t->c, a->loc, "x64 store: OPK_GLOBAL not yet implemented"); + } + + i32 off; + u32 base = addr_base(t, addr, &off); + + if (src.kind == OPK_IMM) { + int w = (sz == 8) ? 1 : 0; + emit_load_imm(t->mc, w, X64_RAX, src.v.imm); + emit_mov_store(t->mc, sz, X64_RAX, base, off); + return; + } + if (src.cls == RC_FP) { + u8 prefix2 = (sz == 8) ? 0xF2 : 0xF3; + emit_sse_store(t->mc, prefix2, 0x11, src.v.reg & 0xFu, base, off); + return; + } + emit_mov_store(t->mc, sz, src.v.reg & 0xFu, base, off); +} + +static void x_addr_of(CGTarget* t, Operand dst, Operand lv) { + XImpl* a = impl_of(t); + if (lv.kind == OPK_LOCAL) { + XSlot* s = slot_get(a, lv.v.frame_slot); + if (!s) compiler_panic(t->c, a->loc, "x64 addr_of: bad slot"); + emit_lea(t->mc, dst.v.reg & 0xFu, X64_RBP, -(i32)s->off); + return; + } + if (lv.kind == OPK_INDIRECT) { + emit_lea(t->mc, dst.v.reg & 0xFu, lv.v.ind.base & 0xFu, lv.v.ind.ofs); + return; + } + x_panic(t, "addr_of: kind unsupported"); } -static void xx_tls_addr_of(CGTarget* t, Operand d, ObjSymId s, i64 a) { + +static void x_tls_addr_of(CGTarget* t, Operand d, ObjSymId s, i64 a) { (void)d; (void)s; (void)a; - xx_panic(t, "tls_addr_of"); + x_panic(t, "tls_addr_of"); } -static void xx_copy_bytes(CGTarget* t, Operand da, Operand sa, - AggregateAccess g) { - (void)da; - (void)sa; - (void)g; - xx_panic(t, "copy_bytes"); + +/* Aggregate ops — small unrolled memcpy/memset. */ +static u32 agg_addr_reg(CGTarget* t, Operand op, u32 scratch) { + if (op.kind == OPK_REG) return op.v.reg & 0xFu; + if (op.kind == OPK_LOCAL) { + XImpl* a = impl_of(t); + XSlot* s = slot_get(a, op.v.frame_slot); + if (!s) compiler_panic(t->c, a->loc, "x64 agg: bad slot"); + emit_lea(t->mc, scratch, X64_RBP, -(i32)s->off); + return scratch; + } + compiler_panic(t->c, impl_of(t)->loc, + "x64 agg: address kind %d unsupported", (int)op.kind); } -static void xx_set_bytes(CGTarget* t, Operand da, Operand bv, + +static void x_copy_bytes(CGTarget* t, Operand da, Operand sa, AggregateAccess g) { - (void)da; - (void)bv; - (void)g; - xx_panic(t, "set_bytes"); + u32 dr = agg_addr_reg(t, da, X64_R11); + u32 sr = agg_addr_reg(t, sa, (dr == X64_RAX) ? X64_RCX : X64_RAX); + u32 nbytes = g.size; + u32 i = 0; + while (i + 8 <= nbytes) { + emit_mov_load(t->mc, 8, 0, X64_RDX, sr, (i32)i); + emit_mov_store(t->mc, 8, X64_RDX, dr, (i32)i); + i += 8; + } + while (i + 4 <= nbytes) { + emit_mov_load(t->mc, 4, 0, X64_RDX, sr, (i32)i); + emit_mov_store(t->mc, 4, X64_RDX, dr, (i32)i); + i += 4; + } + while (i + 2 <= nbytes) { + emit_mov_load(t->mc, 2, 0, X64_RDX, sr, (i32)i); + emit_mov_store(t->mc, 2, X64_RDX, dr, (i32)i); + i += 2; + } + while (i < nbytes) { + emit_mov_load(t->mc, 1, 0, X64_RDX, sr, (i32)i); + emit_mov_store(t->mc, 1, X64_RDX, dr, (i32)i); + i += 1; + } } -static void xx_bitfield_load(CGTarget* t, Operand d, Operand ra, - BitFieldAccess b) { + +static void x_set_bytes(CGTarget* t, Operand da, Operand bv, + AggregateAccess g) { + u32 dr = agg_addr_reg(t, da, X64_R11); + if (bv.kind != OPK_IMM) + compiler_panic(t->c, impl_of(t)->loc, + "x64 set_bytes: non-IMM byte not yet supported"); + u8 b = (u8)(bv.v.imm & 0xff); + u64 b64 = b; + b64 |= b64 << 8; + b64 |= b64 << 16; + b64 |= b64 << 32; + emit_load_imm(t->mc, 1, X64_RAX, (i64)b64); + u32 nbytes = g.size; + u32 i = 0; + while (i + 8 <= nbytes) { + emit_mov_store(t->mc, 8, X64_RAX, dr, (i32)i); + i += 8; + } + while (i + 4 <= nbytes) { + emit_mov_store(t->mc, 4, X64_RAX, dr, (i32)i); + i += 4; + } + while (i + 2 <= nbytes) { + emit_mov_store(t->mc, 2, X64_RAX, dr, (i32)i); + i += 2; + } + while (i < nbytes) { + emit_mov_store(t->mc, 1, X64_RAX, dr, (i32)i); + i += 1; + } +} + +static void x_bitfield_load(CGTarget* t, Operand d, Operand ra, + BitFieldAccess b) { (void)d; (void)ra; (void)b; - xx_panic(t, "bitfield_load"); + x_panic(t, "bitfield_load"); } -static void xx_bitfield_store(CGTarget* t, Operand ra, Operand s, - BitFieldAccess b) { +static void x_bitfield_store(CGTarget* t, Operand ra, Operand s, + BitFieldAccess b) { (void)ra; (void)s; (void)b; - xx_panic(t, "bitfield_store"); + x_panic(t, "bitfield_store"); } -static void xx_binop(CGTarget* t, BinOp op, Operand d, Operand a, Operand b) { - (void)op; - (void)d; - (void)a; - (void)b; - xx_panic(t, "binop"); +/* ============================================================ + * Arithmetic */ + +static void x_binop(CGTarget* t, BinOp op, Operand dst, Operand a_op, + Operand b_op) { + MCEmitter* mc = t->mc; + + /* FP binops. */ + if (op == BO_FADD || op == BO_FSUB || op == BO_FMUL || op == BO_FDIV) { + u32 rd = dst.v.reg & 0xFu; + u32 ra = a_op.v.reg & 0xFu; + u32 rb = b_op.v.reg & 0xFu; + u8 prefix2 = type_is_fp_double(dst.type) ? 0xF2 : 0xF3; + if (rd != ra) emit_sse_rr(mc, prefix2, 0x10, rd, ra); + u8 opcode; + switch (op) { + case BO_FADD: opcode = 0x58; break; + case BO_FSUB: opcode = 0x5C; break; + case BO_FMUL: opcode = 0x59; break; + case BO_FDIV: opcode = 0x5E; break; + default: opcode = 0x58; break; + } + emit_sse_rr(mc, prefix2, opcode, rd, rb); + return; + } + + int w = type_is_64(dst.type) ? 1 : 0; + u32 rd = dst.v.reg & 0xFu; + + /* Division: idiv/div uses rax/rdx implicitly. Route divisor through r11 + * if it would otherwise be rax/rdx. */ + if (op == BO_SDIV || op == BO_UDIV || op == BO_SREM || op == BO_UREM) { + u32 ra = force_reg_int(t, a_op, w, X64_RAX); + if (ra != X64_RAX) emit_mov_rr(mc, w, X64_RAX, ra); + u32 rb; + if (b_op.kind == OPK_REG) { + rb = b_op.v.reg & 0xFu; + if (rb == X64_RAX || rb == X64_RDX) { + emit_mov_rr(mc, w, X64_R11, rb); + rb = X64_R11; + } + } else if (b_op.kind == OPK_IMM) { + emit_load_imm(mc, w, X64_R11, b_op.v.imm); + rb = X64_R11; + } else { + compiler_panic(t->c, impl_of(t)->loc, + "x64 div: divisor kind %d unsupported", (int)b_op.kind); + } + if (op == BO_SDIV || op == BO_SREM) { + emit_cqo_or_cdq(mc, w); + emit_f7_rm(mc, w, 7u, rb); /* idiv */ + } else { + emit_xor_self(mc, w, X64_RDX); + emit_f7_rm(mc, w, 6u, rb); /* div */ + } + u32 result_reg = (op == BO_SREM || op == BO_UREM) ? X64_RDX : X64_RAX; + if (rd != result_reg) emit_mov_rr(mc, w, rd, result_reg); + return; + } + + /* Shifts: shift count must be in cl. */ + if (op == BO_SHL || op == BO_SHR_U || op == BO_SHR_S) { + u32 ra = force_reg_int(t, a_op, w, X64_RAX); + if (rd != ra) emit_mov_rr(mc, w, rd, ra); + if (b_op.kind == OPK_REG) { + u32 rb = b_op.v.reg & 0xFu; + if (rb != X64_RCX) emit_mov_rr(mc, 0, X64_RCX, rb); + } else if (b_op.kind == OPK_IMM) { + emit_load_imm(mc, 0, X64_RCX, b_op.v.imm & 0x3f); + } else { + compiler_panic(t->c, impl_of(t)->loc, + "x64 shift: count kind %d unsupported", (int)b_op.kind); + } + u32 sub = (op == BO_SHL) ? 4u : (op == BO_SHR_U ? 5u : 7u); + emit_shift_cl(mc, w, sub, rd); + return; + } + + /* Generic 2-operand ALU: copy ra → dst, then dst op= rb. */ + u32 ra = force_reg_int(t, a_op, w, X64_RAX); + if (rd != ra) emit_mov_rr(mc, w, rd, ra); + u32 rb = force_reg_int(t, b_op, w, X64_R11); + switch (op) { + case BO_IADD: emit_alu_rr(mc, w, 0x01, rd, rb); break; + case BO_ISUB: emit_alu_rr(mc, w, 0x29, rd, rb); break; + case BO_AND: emit_alu_rr(mc, w, 0x21, rd, rb); break; + case BO_OR: emit_alu_rr(mc, w, 0x09, rd, rb); break; + case BO_XOR: emit_alu_rr(mc, w, 0x31, rd, rb); break; + case BO_IMUL: emit_imul_rr(mc, w, rd, rb); break; + default: + compiler_panic(t->c, impl_of(t)->loc, "x64 binop: op %d unimpl", + (int)op); + } } -static void xx_unop(CGTarget* t, UnOp op, Operand d, Operand a) { - (void)op; - (void)d; - (void)a; - xx_panic(t, "unop"); + +static void x_unop(CGTarget* t, UnOp op, Operand dst, Operand a_op) { + MCEmitter* mc = t->mc; + int w = type_is_64(dst.type) ? 1 : 0; + u32 rd = dst.v.reg & 0xFu; + u32 ra = a_op.v.reg & 0xFu; + if (a_op.kind != OPK_REG) + compiler_panic(t->c, impl_of(t)->loc, + "x64 unop: non-REG operand not supported"); + switch (op) { + case UO_NEG: + if (rd != ra) emit_mov_rr(mc, w, rd, ra); + emit_f7_rm(mc, w, 3u, rd); + return; + case UO_BNOT: + if (rd != ra) emit_mov_rr(mc, w, rd, ra); + emit_f7_rm(mc, w, 2u, rd); + return; + case UO_NOT: + /* !x → (x == 0) materialized as 0/1 in dst. */ + emit_test_self(mc, w, ra); + emit_setcc(mc, X64_CC_E, rd); + emit_movzx_r32_r8(mc, rd, rd); + return; + default: + compiler_panic(t->c, impl_of(t)->loc, "x64 unop: op %d unimpl", + (int)op); + } } -static void xx_cmp(CGTarget* t, CmpOp op, Operand d, Operand a, Operand b) { - (void)op; - (void)d; - (void)a; - (void)b; - xx_panic(t, "cmp"); + +static void x_convert(CGTarget* t, ConvKind k, Operand dst, Operand src) { + XImpl* a = impl_of(t); + MCEmitter* mc = t->mc; + u32 rd = dst.v.reg & 0xFu; + u32 rs = src.v.reg & 0xFu; + switch (k) { + case CV_SEXT: { + u32 src_bytes = type_byte_size(src.type); + int w = type_is_64(dst.type) ? 1 : 0; + emit_extend_rr(mc, w, /*signed=*/1, src_bytes, rd, rs); + return; + } + case CV_ZEXT: { + u32 src_bytes = type_byte_size(src.type); + int w = type_is_64(dst.type) ? 1 : 0; + emit_extend_rr(mc, w, /*signed=*/0, src_bytes, rd, rs); + return; + } + case CV_TRUNC: { + /* In-reg truncation: `mov r32, r32` clears high 32. Narrower stores + * select width themselves. */ + emit_mov_rr(mc, 0, rd, rs); + return; + } + case CV_ITOF_S: + case CV_ITOF_U: { + int w_src = type_is_64(src.type) ? 1 : 0; + u8 prefix2 = type_is_fp_double(dst.type) ? 0xF2 : 0xF3; + if (k == CV_ITOF_U && w_src == 1) { + compiler_panic(t->c, a->loc, + "x64 convert: u64→fp not yet implemented"); + } + if (k == CV_ITOF_U) { + /* u32→fp: zero-extend to 64-bit, then signed cvtsi2sd works. */ + emit_extend_rr(mc, 0, 0, 4, X64_R11, rs); + rs = X64_R11; + w_src = 1; + } + emit_sse_rr_w(mc, prefix2, 0x2A, w_src, rd, rs); + return; + } + case CV_FTOI_S: + case CV_FTOI_U: { + int w_dst = type_is_64(dst.type) ? 1 : 0; + u8 prefix2 = type_is_fp_double(src.type) ? 0xF2 : 0xF3; + if (k == CV_FTOI_U && w_dst == 1) { + compiler_panic(t->c, a->loc, + "x64 convert: fp→u64 not yet implemented"); + } + emit_sse_rr_w(mc, prefix2, 0x2C, w_dst, rd, rs); + return; + } + case CV_FEXT: + emit_sse_rr(mc, 0xF3, 0x5A, rd, rs); + return; + case CV_FTRUNC: + emit_sse_rr(mc, 0xF2, 0x5A, rd, rs); + return; + case CV_BITCAST: { + /* movd/movq between xmm and GPR. */ + if (src.cls == RC_INT && dst.cls == RC_FP) { + int w = type_is_64(dst.type) ? 1 : 0; + emit_sse_rr_w(mc, 0x66, 0x6E, w, rd, rs); + } else if (src.cls == RC_FP && dst.cls == RC_INT) { + int w = type_is_64(src.type) ? 1 : 0; + emit_sse_rr_w(mc, 0x66, 0x7E, w, rs, rd); + } else { + compiler_panic(t->c, a->loc, + "x64 convert BITCAST: same-class not supported"); + } + return; + } + default: + compiler_panic(t->c, a->loc, "x64 convert kind %d unimpl", (int)k); + } } -static void xx_convert(CGTarget* t, ConvKind k, Operand d, Operand s) { - (void)k; - (void)d; - (void)s; - xx_panic(t, "convert"); + +/* ============================================================ + * Calls / return */ + +static void emit_arg_value(CGTarget* t, const CGABIValue* av, u32* next_int, + u32* next_fp, u32* stack_off) { + XImpl* a = impl_of(t); + /* Synthesize one-part DIRECT for variadic args (av->abi NULL). */ + ABIArgInfo va_ai; + ABIArgPart va_pt; + const ABIArgInfo* ai = av->abi; + if (!ai) { + u32 sz = type_byte_size(av->type); + memset(&va_ai, 0, sizeof va_ai); + memset(&va_pt, 0, sizeof va_pt); + va_ai.kind = ABI_ARG_DIRECT; + va_ai.parts = &va_pt; + va_ai.nparts = 1; + va_pt.cls = (av->storage.cls == RC_FP) ? ABI_CLASS_FP : ABI_CLASS_INT; + va_pt.size = sz; + va_pt.align = sz; + va_pt.src_offset = 0; + ai = &va_ai; + } + if (ai->kind == ABI_ARG_IGNORE) return; + if (ai->kind == ABI_ARG_INDIRECT) { + /* Pass &av->storage_local in the next int arg reg. */ + u32 dst_reg = (*next_int < 6) ? g_int_arg_regs[(*next_int)++] : X64_RAX; + int to_stack = (*next_int > 6) || (dst_reg == X64_RAX && *next_int == 6); + /* Above is awkward — recompute clearly: */ + if (*next_int >= 6 + (a->has_sret ? 0 : 0)) { + /* (next_int was already bumped past 6) — stack route */ + } + to_stack = (dst_reg == X64_RAX); + if (av->storage.kind == OPK_LOCAL) { + XSlot* s = slot_get(a, av->storage.v.frame_slot); + if (!s) compiler_panic(t->c, a->loc, "x64 call: bad byval slot"); + emit_lea(t->mc, dst_reg, X64_RBP, -(i32)s->off); + } else { + compiler_panic(t->c, a->loc, + "x64 call: INDIRECT arg storage kind %d unsupported", + (int)av->storage.kind); + } + if (to_stack) { + emit_mov_store(t->mc, 8, dst_reg, X64_RSP, (i32)*stack_off); + *stack_off += 8; + } + return; + } + + for (u16 i = 0; i < ai->nparts; ++i) { + const ABIArgPart* pt = &ai->parts[i]; + u32 sz = pt->size; + if (pt->cls == ABI_CLASS_INT) { + int to_stack = (*next_int >= 6); + u32 dst_reg = to_stack ? X64_RAX : g_int_arg_regs[(*next_int)++]; + switch (av->storage.kind) { + case OPK_IMM: { + int w = (sz == 8) ? 1 : 0; + emit_load_imm(t->mc, w, dst_reg, av->storage.v.imm); + break; + } + case OPK_REG: { + int w = (sz == 8) ? 1 : 0; + u32 sr = av->storage.v.reg & 0xFu; + if (sr != dst_reg) emit_mov_rr(t->mc, w, dst_reg, sr); + break; + } + case OPK_LOCAL: { + XSlot* s = slot_get(a, av->storage.v.frame_slot); + if (!s) compiler_panic(t->c, a->loc, "x64 call: bad arg slot"); + emit_mov_load(t->mc, sz, 0, dst_reg, X64_RBP, + -(i32)s->off + (i32)pt->src_offset); + break; + } + default: + compiler_panic(t->c, a->loc, + "x64 call: arg storage kind %d unsupported", + (int)av->storage.kind); + } + if (to_stack) { + emit_mov_store(t->mc, 8, dst_reg, X64_RSP, (i32)*stack_off); + *stack_off += 8; + } + } else if (pt->cls == ABI_CLASS_FP) { + int to_stack = (*next_fp >= 8); + if (!to_stack) { + u32 dst_x = (*next_fp)++; + if (av->storage.kind == OPK_REG) { + u32 sx = av->storage.v.reg & 0xFu; + if (sx != dst_x) { + u8 prefix2 = (sz == 8) ? 0xF2 : 0xF3; + emit_sse_rr(t->mc, prefix2, 0x10, dst_x, sx); + } + } else { + compiler_panic(t->c, a->loc, + "x64 call: FP arg storage kind %d unsupported", + (int)av->storage.kind); + } + } else { + if (av->storage.kind == OPK_REG) { + u8 prefix2 = (sz == 8) ? 0xF2 : 0xF3; + emit_sse_store(t->mc, prefix2, 0x11, av->storage.v.reg & 0xFu, + X64_RSP, (i32)*stack_off); + } else { + compiler_panic(t->c, a->loc, + "x64 call: FP stack-arg storage kind %d unsupported", + (int)av->storage.kind); + } + *stack_off += 8; + } + } else { + compiler_panic(t->c, a->loc, "x64 call: ABI class %d unimpl", + (int)pt->cls); + } + } } -static void xx_call(CGTarget* t, const CGCallDesc* d) { - (void)d; - xx_panic(t, "call"); +static void x_call(CGTarget* t, const CGCallDesc* d) { + XImpl* a = impl_of(t); + MCEmitter* mc = t->mc; + + u32 next_int = 0, next_fp = 0, stack_off = 0; + + /* sret: caller puts destination pointer in rdi. */ + if (d->abi && d->abi->has_sret) { + if (d->ret.storage.kind != OPK_LOCAL) { + compiler_panic(t->c, a->loc, "x64 call: sret destination must be LOCAL"); + } + XSlot* s = slot_get(a, d->ret.storage.v.frame_slot); + if (!s) compiler_panic(t->c, a->loc, "x64 call: bad sret slot"); + emit_lea(mc, X64_RDI, X64_RBP, -(i32)s->off); + next_int = 1; + } + for (u32 i = 0; i < d->nargs; ++i) { + emit_arg_value(t, &d->args[i], &next_int, &next_fp, &stack_off); + } + u32 needed = (stack_off + 15u) & ~15u; + if (needed > a->max_outgoing) a->max_outgoing = needed; + + /* Variadic calls: AL = number of XMM regs used. */ + if (d->abi && d->abi->variadic) { + emit_load_imm(mc, 0, X64_RAX, (i64)next_fp); + } + + if (d->callee.kind == OPK_GLOBAL) { + /* call rel32: E8 + disp32 + R_X64_PLT32. */ + u8 op = 0xE8; + mc->emit_bytes(mc, &op, 1); + u32 disp_pos = mc->pos(mc); + emit_u32le(mc, 0); + mc->emit_reloc_at(mc, mc->section_id, disp_pos, R_X64_PLT32, + d->callee.v.global.sym, + d->callee.v.global.addend - 4, 1, 0); + } else if (d->callee.kind == OPK_REG) { + u32 r = d->callee.v.reg & 0xFu; + emit_rex(mc, 0, 0, 0, r); + u8 buf[2] = {0xFF, modrm(3u, 2u, r)}; + mc->emit_bytes(mc, buf, 2); + } else { + compiler_panic(t->c, a->loc, "x64 call: callee kind %d unsupported", + (int)d->callee.kind); + } + + /* Receive return value. */ + const ABIArgInfo* ri = &d->abi->ret; + if (ri->kind == ABI_ARG_IGNORE || ri->kind == ABI_ARG_INDIRECT) return; + if (ri->nparts == 0) return; + + Operand rs = d->ret.storage; + u32 next_int_ret = 0, next_fp_ret = 0; + static const u32 ret_int_regs[2] = {X64_RAX, X64_RDX}; + for (u16 i = 0; i < ri->nparts; ++i) { + const ABIArgPart* p = &ri->parts[i]; + u32 src_reg; + if (p->cls == ABI_CLASS_INT) src_reg = ret_int_regs[next_int_ret++]; + else if (p->cls == ABI_CLASS_FP) src_reg = (u32)(X64_XMM0 + next_fp_ret++); + else compiler_panic(t->c, a->loc, "x64 call: ret cls %d unimpl", + (int)p->cls); + + if (rs.kind == OPK_REG) { + if (ri->nparts != 1) { + compiler_panic(t->c, a->loc, + "x64 call: REG ret_storage with %u parts", + (unsigned)ri->nparts); + } + if (p->cls == ABI_CLASS_INT) { + int w = (p->size == 8) ? 1 : 0; + u32 dr = rs.v.reg & 0xFu; + if (dr != src_reg) emit_mov_rr(mc, w, dr, src_reg); + } else { + u8 prefix2 = (p->size == 8) ? 0xF2 : 0xF3; + u32 dr = rs.v.reg & 0xFu; + if (dr != src_reg) emit_sse_rr(mc, prefix2, 0x10, dr, src_reg); + } + } else if (rs.kind == OPK_LOCAL) { + XSlot* s = slot_get(a, rs.v.frame_slot); + if (!s) compiler_panic(t->c, a->loc, "x64 call: bad ret slot"); + i32 off = -(i32)s->off + (i32)p->src_offset; + if (p->cls == ABI_CLASS_INT) { + emit_mov_store(mc, p->size, src_reg, X64_RBP, off); + } else { + u8 prefix2 = (p->size == 8) ? 0xF2 : 0xF3; + emit_sse_store(mc, prefix2, 0x11, src_reg, X64_RBP, off); + } + } else if (rs.kind == OPK_IMM && rs.type && rs.type->kind == TY_VOID) { + /* void ret placeholder — nothing to do. */ + } else { + compiler_panic(t->c, a->loc, + "x64 call: ret_storage kind %d unsupported", + (int)rs.kind); + } + } } -static void xx_ret(CGTarget* t, const CGABIValue* v) { - (void)v; - xx_panic(t, "ret"); + +static void x_ret(CGTarget* t, const CGABIValue* val) { + XImpl* a = impl_of(t); + MCEmitter* mc = t->mc; + + if (val) { + const ABIArgInfo* ri = val->abi; + if (ri && ri->kind == ABI_ARG_INDIRECT) { + /* sret: reload destination pointer into rdi, memcpy source into [rdi]. */ + if (val->storage.kind != OPK_LOCAL) { + compiler_panic(t->c, a->loc, + "x64 ret indirect: storage kind %d unsupported", + (int)val->storage.kind); + } + XSlot* s = slot_get(a, val->storage.v.frame_slot); + if (!s) compiler_panic(t->c, a->loc, "x64 ret: bad sret slot"); + if (a->sret_ptr_slot != FRAME_SLOT_NONE) { + XSlot* sp = slot_get(a, a->sret_ptr_slot); + if (sp) emit_mov_load(mc, 8, 0, X64_RDI, X64_RBP, -(i32)sp->off); + } + u32 nbytes = s->size; + u32 i = 0; + while (i + 8 <= nbytes) { + emit_mov_load(mc, 8, 0, X64_RAX, X64_RBP, -(i32)s->off + (i32)i); + emit_mov_store(mc, 8, X64_RAX, X64_RDI, (i32)i); + i += 8; + } + while (i + 4 <= nbytes) { + emit_mov_load(mc, 4, 0, X64_RAX, X64_RBP, -(i32)s->off + (i32)i); + emit_mov_store(mc, 4, X64_RAX, X64_RDI, (i32)i); + i += 4; + } + while (i + 2 <= nbytes) { + emit_mov_load(mc, 2, 0, X64_RAX, X64_RBP, -(i32)s->off + (i32)i); + emit_mov_store(mc, 2, X64_RAX, X64_RDI, (i32)i); + i += 2; + } + while (i < nbytes) { + emit_mov_load(mc, 1, 0, X64_RAX, X64_RBP, -(i32)s->off + (i32)i); + emit_mov_store(mc, 1, X64_RAX, X64_RDI, (i32)i); + i += 1; + } + /* Convention: return sret pointer in rax. */ + emit_mov_rr(mc, 1, X64_RAX, X64_RDI); + } else if (val->storage.kind == OPK_REG) { + if (val->storage.cls == RC_FP) { + u8 prefix2 = type_is_fp_double(val->storage.type) ? 0xF2 : 0xF3; + u32 sr = val->storage.v.reg & 0xFu; + if (sr != X64_XMM0) emit_sse_rr(mc, prefix2, 0x10, X64_XMM0, sr); + } else { + int w = type_is_64(val->storage.type) ? 1 : 0; + u32 sr = val->storage.v.reg & 0xFu; + if (sr != X64_RAX) emit_mov_rr(mc, w, X64_RAX, sr); + } + } else if (val->storage.kind == OPK_IMM) { + int w = type_is_64(val->storage.type) ? 1 : 0; + emit_load_imm(mc, w, X64_RAX, val->storage.v.imm); + } else if (val->storage.kind == OPK_LOCAL) { + /* DIRECT struct return: load each part into rax/rdx or xmm0/xmm1. */ + XSlot* s = slot_get(a, val->storage.v.frame_slot); + if (!s) compiler_panic(t->c, a->loc, "x64 ret: bad local slot"); + const ABIArgInfo* ri2 = val->abi; + u32 next_int_ret = 0, next_fp_ret = 0; + static const u32 ret_int_regs[2] = {X64_RAX, X64_RDX}; + for (u16 i = 0; i < (ri2 ? ri2->nparts : 0); ++i) { + const ABIArgPart* pt = &ri2->parts[i]; + i32 off = -(i32)s->off + (i32)pt->src_offset; + if (pt->cls == ABI_CLASS_INT) { + emit_mov_load(mc, pt->size, 0, ret_int_regs[next_int_ret++], + X64_RBP, off); + } else if (pt->cls == ABI_CLASS_FP) { + u8 prefix2 = (pt->size == 8) ? 0xF2 : 0xF3; + emit_sse_load(mc, prefix2, 0x10, (u32)(X64_XMM0 + next_fp_ret++), + X64_RBP, off); + } else { + compiler_panic(t->c, a->loc, "x64 ret: ret part cls %d unimpl", + (int)pt->cls); + } + } + } + } + emit_jmp_label(mc, a->epilogue_label); } -static void xx_alloca_(CGTarget* t, Operand d, Operand s, u32 a) { +/* ============================================================ + * Stubs for unimplemented methods. */ +static void x_alloca_(CGTarget* t, Operand d, Operand s, u32 a) { (void)d; (void)s; (void)a; - xx_panic(t, "alloca"); + x_panic(t, "alloca"); } -static void xx_va_start_(CGTarget* t, Operand a) { +static void x_va_start_(CGTarget* t, Operand a) { (void)a; - xx_panic(t, "va_start"); + x_panic(t, "va_start"); } -static void xx_va_arg_(CGTarget* t, Operand d, Operand a, const Type* ty) { +static void x_va_arg_(CGTarget* t, Operand d, Operand a, const Type* ty) { (void)d; (void)a; (void)ty; - xx_panic(t, "va_arg"); + x_panic(t, "va_arg"); } -static void xx_va_end_(CGTarget* t, Operand a) { +static void x_va_end_(CGTarget* t, Operand a) { (void)a; - xx_panic(t, "va_end"); + (void)t; } -static void xx_va_copy_(CGTarget* t, Operand d, Operand s) { +static void x_va_copy_(CGTarget* t, Operand d, Operand s) { (void)d; (void)s; - xx_panic(t, "va_copy"); + x_panic(t, "va_copy"); } -static void xx_atomic_load(CGTarget* t, Operand d, Operand a, MemAccess m, - MemOrder o) { +static void x_atomic_load(CGTarget* t, Operand d, Operand ad, MemAccess m, + MemOrder o) { (void)d; - (void)a; + (void)ad; (void)m; (void)o; - xx_panic(t, "atomic_load"); + x_panic(t, "atomic_load"); } -static void xx_atomic_store(CGTarget* t, Operand a, Operand s, MemAccess m, - MemOrder o) { - (void)a; +static void x_atomic_store(CGTarget* t, Operand ad, Operand s, MemAccess m, + MemOrder o) { + (void)ad; (void)s; (void)m; (void)o; - xx_panic(t, "atomic_store"); + x_panic(t, "atomic_store"); } -static void xx_atomic_rmw(CGTarget* t, AtomicOp op, Operand d, Operand a, - Operand v, MemAccess m, MemOrder o) { +static void x_atomic_rmw(CGTarget* t, AtomicOp op, Operand d, Operand ad, + Operand v, MemAccess m, MemOrder o) { (void)op; (void)d; - (void)a; + (void)ad; (void)v; (void)m; (void)o; - xx_panic(t, "atomic_rmw"); + x_panic(t, "atomic_rmw"); } -static void xx_atomic_cas(CGTarget* t, Operand p, Operand ok, Operand a, - Operand e, Operand des, MemAccess m, MemOrder so, - MemOrder fo) { +static void x_atomic_cas(CGTarget* t, Operand p, Operand ok, Operand ad, + Operand e, Operand des, MemAccess m, MemOrder so, + MemOrder fo) { (void)p; (void)ok; - (void)a; + (void)ad; (void)e; (void)des; (void)m; (void)so; (void)fo; - xx_panic(t, "atomic_cas"); + x_panic(t, "atomic_cas"); } -static void xx_fence(CGTarget* t, MemOrder o) { +static void x_fence(CGTarget* t, MemOrder o) { (void)o; - xx_panic(t, "fence"); + x_panic(t, "fence"); } -static void xx_intrinsic(CGTarget* t, IntrinKind k, Operand* d, u32 nd, - const Operand* a, u32 na) { +static void x_intrinsic(CGTarget* t, IntrinKind k, Operand* d, u32 nd, + const Operand* a, u32 na) { (void)k; (void)d; (void)nd; (void)a; (void)na; - xx_panic(t, "intrinsic"); + x_panic(t, "intrinsic"); } -static void xx_asm_block(CGTarget* t, const char* tmpl, - const AsmConstraint* outs, u32 no, Operand* oo, - const AsmConstraint* ins, u32 ni, const Operand* io, - const Sym* clobs, u32 nc) { +static void x_asm_block(CGTarget* t, const char* tmpl, + const AsmConstraint* outs, u32 no, Operand* oo, + const AsmConstraint* ins, u32 ni, const Operand* io, + const Sym* clobs, u32 nc) { (void)tmpl; (void)outs; (void)no; @@ -297,16 +1866,16 @@ static void xx_asm_block(CGTarget* t, const char* tmpl, (void)io; (void)clobs; (void)nc; - xx_panic(t, "asm_block"); + x_panic(t, "asm_block"); } -static void xx_set_loc(CGTarget* t, SrcLoc l) { +static void x_set_loc(CGTarget* t, SrcLoc l) { ((XImpl*)t)->loc = l; if (t->mc) t->mc->set_loc(t->mc, l); } -static void xx_finalize(CGTarget* t) { (void)t; } -static void xx_destroy(CGTarget* t) { (void)t; } +static void x_finalize(CGTarget* t) { (void)t; } +static void x_destroy(CGTarget* t) { (void)t; } static void cgt_cleanup(void* arg) { cgtarget_free((CGTarget*)arg); } @@ -319,69 +1888,69 @@ CGTarget* x64_cgtarget_new(Compiler* c, ObjBuilder* o, MCEmitter* m) { t->obj = o; t->mc = m; - t->func_begin = xx_func_begin; - t->func_end = xx_func_end; - - t->alloc_reg = xx_alloc_reg; - t->free_reg = xx_free_reg; - t->frame_slot = xx_frame_slot; - t->param = xx_param; - t->clobbers = xx_clobbers; - t->spill_reg = xx_spill_reg; - t->reload_reg = xx_reload_reg; - - t->label_new = xx_label_new; - t->label_place = xx_label_place; - t->jump = xx_jump; - t->cmp_branch = xx_cmp_branch; - - t->scope_begin = xx_scope_begin; - t->scope_else = xx_scope_else; - t->scope_end = xx_scope_end; - t->break_to = xx_break_to; - t->continue_to = xx_continue_to; - - t->load_imm = xx_load_imm; - t->load_const = xx_load_const; - t->copy = xx_copy; - t->load = xx_load; - t->store = xx_store; - t->addr_of = xx_addr_of; - t->tls_addr_of = xx_tls_addr_of; - t->copy_bytes = xx_copy_bytes; - t->set_bytes = xx_set_bytes; - t->bitfield_load = xx_bitfield_load; - t->bitfield_store = xx_bitfield_store; - - t->binop = xx_binop; - t->unop = xx_unop; - t->cmp = xx_cmp; - t->convert = xx_convert; - - t->call = xx_call; - t->ret = xx_ret; - - t->alloca_ = xx_alloca_; - t->va_start_ = xx_va_start_; - t->va_arg_ = xx_va_arg_; - t->va_end_ = xx_va_end_; - t->va_copy_ = xx_va_copy_; + t->func_begin = x_func_begin; + t->func_end = x_func_end; + + t->alloc_reg = x_alloc_reg; + t->free_reg = x_free_reg; + t->frame_slot = x_frame_slot; + t->param = x_param; + t->clobbers = x_clobbers; + t->spill_reg = x_spill_reg; + t->reload_reg = x_reload_reg; + + t->label_new = x_label_new; + t->label_place = x_label_place; + t->jump = x_jump; + t->cmp_branch = x_cmp_branch; + + t->scope_begin = x_scope_begin; + t->scope_else = x_scope_else; + t->scope_end = x_scope_end; + t->break_to = x_break_to; + t->continue_to = x_continue_to; + + t->load_imm = x_load_imm; + t->load_const = x_load_const; + t->copy = x_copy; + t->load = x_load; + t->store = x_store; + t->addr_of = x_addr_of; + t->tls_addr_of = x_tls_addr_of; + t->copy_bytes = x_copy_bytes; + t->set_bytes = x_set_bytes; + t->bitfield_load = x_bitfield_load; + t->bitfield_store = x_bitfield_store; + + t->binop = x_binop; + t->unop = x_unop; + t->cmp = x_cmp; + t->convert = x_convert; + + t->call = x_call; + t->ret = x_ret; + + t->alloca_ = x_alloca_; + t->va_start_ = x_va_start_; + t->va_arg_ = x_va_arg_; + t->va_end_ = x_va_end_; + t->va_copy_ = x_va_copy_; t->setjmp_ = NULL; t->longjmp_ = NULL; - t->atomic_load = xx_atomic_load; - t->atomic_store = xx_atomic_store; - t->atomic_rmw = xx_atomic_rmw; - t->atomic_cas = xx_atomic_cas; - t->fence = xx_fence; + t->atomic_load = x_atomic_load; + t->atomic_store = x_atomic_store; + t->atomic_rmw = x_atomic_rmw; + t->atomic_cas = x_atomic_cas; + t->fence = x_fence; - t->intrinsic = xx_intrinsic; - t->asm_block = xx_asm_block; + t->intrinsic = x_intrinsic; + t->asm_block = x_asm_block; - t->set_loc = xx_set_loc; - t->finalize = xx_finalize; - t->destroy = xx_destroy; + t->set_loc = x_set_loc; + t->finalize = x_finalize; + t->destroy = x_destroy; compiler_defer(c, cgt_cleanup, t); return t; diff --git a/src/arch/x64_isa.h b/src/arch/x64_isa.h @@ -0,0 +1,75 @@ +/* x86_64 ISA helpers used by arch/x64.c. + * + * Only the constants here. Instruction encoders live in arch/x64.c + * because they're variable length and depend on the MCEmitter byte + * stream (REX prefix, ModR/M, SIB, displacement). The disassembler + * doesn't share these yet; if/when it does, a parallel x64_isa.c will + * host decode tables. */ + +#ifndef CFREE_X64_ISA_H +#define CFREE_X64_ISA_H + +#include "core/core.h" + +/* ---- GPR numbering (DWARF / ABI matches HW encoding 0..15) ---- */ +enum { + X64_RAX = 0, + X64_RCX = 1, + X64_RDX = 2, + X64_RBX = 3, + X64_RSP = 4, + X64_RBP = 5, + X64_RSI = 6, + X64_RDI = 7, + X64_R8 = 8, + X64_R9 = 9, + X64_R10 = 10, + X64_R11 = 11, + X64_R12 = 12, + X64_R13 = 13, + X64_R14 = 14, + X64_R15 = 15, +}; + +/* SSE register numbering — xmm0..xmm15 share encoding with r0..r15. */ +enum { + X64_XMM0 = 0, + X64_XMM1 = 1, + X64_XMM2 = 2, + X64_XMM3 = 3, + X64_XMM4 = 4, + X64_XMM5 = 5, + X64_XMM6 = 6, + X64_XMM7 = 7, + X64_XMM8 = 8, + X64_XMM15 = 15, +}; + +/* Condition codes for Jcc / SETcc / CMOVcc. Encoded in the low nibble. */ +enum { + X64_CC_O = 0x0, + X64_CC_NO = 0x1, + X64_CC_B = 0x2, /* below / CF=1 → CMP_LT_U */ + X64_CC_AE = 0x3, /* above-or-equal / CF=0 → CMP_GE_U */ + X64_CC_E = 0x4, /* equal / ZF=1 → CMP_EQ */ + X64_CC_NE = 0x5, /* → CMP_NE */ + X64_CC_BE = 0x6, /* below-or-equal / CF=1 or ZF=1 → CMP_LE_U */ + X64_CC_A = 0x7, /* above / CF=0 and ZF=0 → CMP_GT_U */ + X64_CC_S = 0x8, + X64_CC_NS = 0x9, + X64_CC_P = 0xA, + X64_CC_NP = 0xB, + X64_CC_L = 0xC, /* less (signed) / SF!=OF → CMP_LT_S */ + X64_CC_GE = 0xD, /* → CMP_GE_S */ + X64_CC_LE = 0xE, /* less-or-equal (signed) → CMP_LE_S */ + X64_CC_G = 0xF, /* greater → CMP_GT_S */ +}; + +/* REX prefix is 0x40 | W<<3 | R<<2 | X<<1 | B. */ +#define X64_REX_BASE 0x40u +#define X64_REX_W 0x08u +#define X64_REX_R 0x04u +#define X64_REX_X 0x02u +#define X64_REX_B 0x01u + +#endif