x64: port to NativeTarget API (-O0) - kit

commit 910583a6980b0af766d502b20a688260efa409ec
parent 6eaf127a60f04b7a93730df0bb959e140818a196
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 29 May 2026 08:01:09 -0700

x64: port to NativeTarget API (-O0)

Implement src/arch/x64/native.c: the x86-64 NativeTarget backend driven by the
shared NativeDirectTarget at -O0, mirroring the rv64/aa64 ports. Covers the
rbp frame model (single-pass reserve-and-patch prologue, Win64 chkstk),
register tables (SysV masks; per-OS ABI via x64_abi_for_os), two-address ALU,
flags/setcc/jcc compares, fixed-register division, converts, memory load/store
+ rip-relative globals, calls/returns/params (sret as implicit first int arg),
atomics, va_args (SysV reg-save-area + Win64), TLS, intrinsics, bitfields,
inline + file-scope asm. SysV + Win64 ABI-abstract; tail calls and the -O1
known-frame path deferred (panic) like rv64.

Restructure: keep emit.c stripped to the shared byte-encoders + SysV/Win64
X64ABIRegs tables (new emit.h, shared with asm.c); delete the dead legacy
{ops,alloc,opt_coord}.c + internal.h; rewire arch.c through
native_direct_target_new; adapt asm.c inline-asm binding to physical operands
(X64_INLINE_OPK_*); enable CFREE_ARCH_X64_ENABLED. Reference:
doc/NATIVE_PORT_X64.md. test-toy X-O0 x64: 246 pass, 10 fail (5 deferred
tail-call + 5 varargs to triage).

x64: hold va_list base in R11, not RAX, on the direct va path

x64_va_start_core/va_arg_core materialize the SysV va_list field values
(gp_offset, fp_offset, overflow/reg-save-area pointers) through RAX, so
passing the va_list base in RAX (x64_direct_va_base ..., X64_RAX) had the
first field store clobber the base -> store-to-garbage SIGSEGV. The optimizer
path already passes a real allocated reg; the direct path now uses R11.
Fixes all x64 varargs toy cases (133/139/19/23 + spec_demo). x64 toy X-O0:
251 pass, 5 fail (all deferred tail calls).

x64: skip aa64-asm-specific parse case asm_01_grammar (covered by x64_inline_test)

asm_01_grammar's templates use aarch64 register names (w0/x0); rv64 already
skips it via .rv64.skip. Add the matching .x64.skip — x64 inline-asm coverage
is in test/arch/x64_inline_test.c. x64 parse E-O0: 450 pass, 14 fail (all
deferred ldbl128), 1 skip.

Diffstat:
A doc/INTERFACES.md  | 309 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A doc/NATIVE_PORT_X64.md  | 4342 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M include/cfree/config.h  | 2 +-
D src/arch/x64/alloc.c  | 598 -------------------------------------------------------------------------------
M src/arch/x64/arch.c  | 34 ++++++++++++++++++++++++++++------
M src/arch/x64/asm.c  | 23 ++++++++++++++++++-----
M src/arch/x64/asm.h  | 13 +++++++++++++
M src/arch/x64/emit.c  | 506 +------------------------------------------------------------------------------
A src/arch/x64/emit.h  | 147 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D src/arch/x64/internal.h  | 314 -------------------------------------------------------------------------------
A src/arch/x64/native.c  | 3751 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D src/arch/x64/ops.c  | 2939 -------------------------------------------------------------------------------
D src/arch/x64/opt_coord.c  | 410 -------------------------------------------------------------------------------
M src/arch/x64/x64.h  | 6 +++++-
A test/parse/cases/asm_01_grammar.x64.skip  | 1 +

15 files changed, 8618 insertions(+), 4777 deletions(-)
diff --git a/doc/INTERFACES.md b/doc/INTERFACES.md
@@ -0,0 +1,309 @@
+# cfree Interfaces
+
+Modularity and clean interfaces are a top project priority. This document is the
+**interface inventory** and the **interface-review checklist** for cfree.
+
+It complements `doc/DESIGN.md` (which describes the layering narrative). This doc
+is operational: it lists every interface worth reviewing, the contract each one
+carries, and a checklist to apply when adding to or changing one of them.
+
+- **Scope**: public API (`include/cfree/`), the backend/codegen contracts,
+  shared internal subsystem boundaries, the core utilities, and the
+  frontend↔library boundary.
+- **How to use**: when touching an interface, run the [review checklist](#interface-review-checklist)
+  against it and update the [status table](#review-status). When adding a new
+  cross-module header, add it here first.
+
+---
+
+## Boundary map
+
+From outside to inside (see `DESIGN.md` for the full narrative):
+
+```
+driver/                 CLI policy + host I/O. Includes ONLY <cfree/*.h>.
+  └─ lang/              Frontends (c, cpp, toy, wasm). API consumers; ONLY <cfree/*.h> + own private headers.
+       └─ include/cfree/   PUBLIC BOUNDARY. The library's entire contract.
+            └─ src/api/     Composition layer: public handles ↔ internal subsystems.
+                 └─ src/    Internal subsystems. Share private headers; expose nothing except through include/.
+```
+
+**Invariants (verified by grep; keep them true):**
+- `driver/` and `lang/` include only `<cfree/...>` headers — never `src/`.
+- There is **no** umbrella `include/cfree.h`; consumers include the specific
+  headers they use. (`DESIGN.md` still references `include/cfree.h` and
+  `include/cfree/hashmap.h` — stale; real paths are `include/cfree/*.h` and
+  `include/cfree/support/hashmap.h`.)
+- `*_internal.h` headers are private to their subsystem and must not be included
+  across subsystem boundaries.
+
+---
+
+## Tier 1 — Public API (`include/cfree/`)
+
+The library's entire stable contract. 19 headers + 2 support headers. No umbrella.
+
+| Header | Purpose | Key opaque type(s) | Primary consumer |
+|--------|---------|--------------------|------------------|
+| `core.h` | Foundational substrate: compiler lifecycle, target triple, slices, status codes, host vtables (`CfreeHeap`/`CfreeWriter`/`CfreeDiagSink`/`CfreeContext`), symbol interning. | `CfreeCompiler` | everyone |
+| `config.h` | Build-time component enable flags (arch / obj-format / language / subsystem / tool). Preprocessor-only. | — | build |
+| `compile.h` | High-level source→object compilation; frontend registration vtable; dep iteration. | `CfreeCompileSession`, `CfreeDepIter` | driver, frontends |
+| `cg.h` | Code-generation API (~53 KB): stack-machine typed IR over `CfreeCg`. Types/ABI, functions, control flow, memory, arithmetic, calls, intrinsics, inline+file asm, static data. | `CfreeCg` | frontends |
+| `frontend.h` | Frontend convenience bridge: panic boundary (`cfree_frontend_run`), metrics scopes, fatal helpers. | — | frontends |
+| `source.h` | Source registry: stable file IDs + include-edge recording. | — | frontends |
+| `preprocess.h` | Standalone C preprocessor entry. | — | driver |
+| `object.h` | Format-neutral object model: builder + read-only inspection; section/symbol/reloc enums. | `CfreeObjBuilder`, `CfreeObjFile` | cg, link, jit, disasm, dwarf |
+| `link.h` | Linker: byte/object/archive/DSO inputs, linker-script model, emit or JIT. | `CfreeLinkSession`, `CfreeLinkScript` | driver, jit |
+| `jit.h` | JIT image: mapped pages, symbol resolution, publish/append/replace, object view. | `CfreeJit` | runtime, dbg |
+| `dbg.h` | In-process JIT execution control: breakpoints, stepping, regs/mem, signal host. | `CfreeJitSession` | debuggers |
+| `dwarf.h` | DWARF5 consumer: PC↔line, type/var/subprogram queries, structural iterators. | `CfreeDebugInfo`, `CfreeDwarfType` | debuggers, dumpers |
+| `disasm.h` | Disassembly of byte ranges and objects, with symbol/reloc annotation. | `CfreeDisasmIter` | objdump, dbg |
+| `emu.h` | User-mode guest-ELF emulator (per-block JIT). | `CfreeEmu` | emu tool |
+| `arch.h` | Arch-agnostic register/unwind-frame metadata helpers. | — | dbg, dwarf, disasm |
+| `archive.h` | POSIX `ar` reader/writer + symbol index. | `CfreeArIter` | ar/ranlib |
+| `asm_emit.h` | Emit assembled object bytes as GAS text. | — | objdump |
+| `wasm.h` | WebAssembly host-import resolver/binder. | `CfreeWasmInstance` | wasm runners |
+| `support/arena.h` | Public bump allocator (narrowed mirror of `src/core/arena.h`). | `CfreeArena` | frontends |
+| `support/hashmap.h` | Header-only `CFREE_HASHMAP_DEFINE` template + hash fns. | — (macro) | frontends |
+
+**Public-tier review notes:**
+- `cg.h` is by far the largest contract and the one frontends couple to hardest.
+  Changes here ripple to every frontend — treat it as the highest-risk public
+  interface.
+- `core.h` defines the host vtables (`CfreeHeap`, `CfreeWriter`, `CfreeDiagSink`,
+  `CfreeContext`) — these are the project's "no global state" enforcement point;
+  every subsystem threads context through them.
+
+---
+
+## Tier 2 — Backend / codegen contract (internal)
+
+The codegen path is **tiered**; each tier is a distinct vtable a backend or layer
+fills in. This is the most actively-changing area (x64/rv64 are being ported onto
+`NativeTarget`; **aa64 is the done reference**).
+
+| Tier | Header | Contract type | What implements it | Role |
+|------|--------|---------------|--------------------|------|
+| ABI | `src/abi/abi.h` (+ `abi_internal.h`) | `TargetABI` / `ABIVtable` | per-ABI TUs (`aapcs64`, `sysv_x64`, `rv64`, `wasm32`, `apple_*`, `win64_x64`) | calling-convention + layout queries; `abi_new(Compiler)` selects vtable |
+| Arch registry | `src/arch/arch.h` | `ArchImpl` | one singleton per arch, via `arch_lookup(kind)` | discovery + dispatch to decode/emu/link/dbg/dwarf surfaces; CFI defaults |
+| Semantic CG | `src/cg/cgtarget.h` | `CgTarget` | `native_direct_target` (-O0) or `opt_cgtarget` (-O≥1) | frontend-facing lowering, pre-regalloc |
+| -O0 adapter | `src/cg/native_direct_target.h` | `NativeDirectTarget` + `NativeOps` | shared, parameterized by arch `NativeOps` | adapts `NativeTarget` to `CgTarget` for -O0 |
+| Physical emit | `src/arch/native_target.h` | `NativeTarget` | `aa64`/`x64`/`rv64` `*_native_target_new()` | hard-register, machine-code emission + frame/CFI |
+| Machine code | `src/arch/mc.h` | `MCEmitter` | one generic impl, `mc_new(Compiler, ObjBuilder)` | section/label/reloc/CFI byte emission for all MC archs |
+
+**Per-arch entry points** (the surface each backend exposes to the rest of the
+compiler):
+
+| Arch | Header | Entry points |
+|------|--------|--------------|
+| aa64 (reference) | `src/arch/aa64/aa64.h` | `aa64_native_target_new`, `aa64_native_direct_ops` |
+| x64 (porting) | `src/arch/x64/x64.h` | `x64_native_target_new`, `x64_native_direct_ops` |
+| rv64 (porting) | `src/arch/rv64/rv64.h` | `rv64_native_target_new`, `rv64_native_direct_ops`, `rv64_emit32/16` |
+| c_target | `src/arch/c_target/{c_emit,ir_emit}.h` | C-source emission backend |
+| wasm | `src/arch/wasm/*` | wasm emission backend |
+
+**Backend-tier review notes:**
+- `NativeTarget` (~35 hooks: frame, control flow, data movement, arithmetic,
+  calls, atomics, variadics, intrinsics, asm) is the contract the port must
+  satisfy. Reviewing a port = checking every hook against the aa64 reference for
+  semantics, not just compilation.
+- The same arch fills both `NativeTarget` (physical) and `NativeOps` (semantic
+  shims used by the -O0 adapter). Keep the split clean: semantic decisions in
+  `NativeOps`, pure emission in `NativeTarget`.
+- `mc.h` is arch-neutral; per-arch reloc encoding lives behind
+  `ArchImpl.apply_label_fixup` + CFI constants. Don't leak arch knowledge into
+  the generic emitter.
+
+---
+
+## Tier 3 — Internal subsystem boundaries
+
+Each subsystem exposes a single shared header (its boundary) and may keep an
+`*_internal.h` private to its own TUs.
+
+| Subsystem | Boundary header | Internal header | Key exported types | Main entry points |
+|-----------|-----------------|-----------------|--------------------|-------------------|
+| obj | `src/obj/obj.h` | — (format headers private) | `ObjBuilder`, `Section`, `ObjSym`, `Reloc`, `RelocKind`, `ObjImage` | `obj_new/free`, `obj_section/symbol/reloc`, `obj_finalize`, `obj_sweep_dead`; format emit/read via `ObjFormatImpl` |
+| ↳ formats | `src/obj/{elf,macho,coff,wasm}/*.h`, `format.h`, `reloc_apply.h` | — | `ObjFormatImpl`, per-format arch ops | `emit_*`/`read_*` per format; `link_reloc_apply` |
+| link | `src/link/link.h` (+ `link_arch.h`) | `link_internal.h` | `LinkInput`, `LinkSymbol`, `LinkImage`, `LinkArchDesc` | `link_new`, `link_add_*`, `link_resolve[_extend]`, `link_emit_*_writer`, `cfree_jit_from_image` |
+| opt | `src/opt/opt.h` (+ `ir.h`) | `opt_internal.h` | `OptOperand`, `OptFrameSlot`, optimizer `Func` | `opt_cgtarget_new`, `opt_func_from_cg_ir`, pass entries (`opt_build_ssa`, `opt_regalloc`, `opt_lower_to_mir`, …) |
+| cg | `src/cg/{ir,ir_recorder,type}.h` | `internal.h` | `CgIrFunc`, `CgIrInst`, `CgType`, `CgTypeField` | IR recording, `cg_type_*` queries |
+| debug | `src/debug/debug.h` (+ `dwarf_defs.h`) | `debug_internal.h`, `dwarf_internal.h` | `Debug`, `DebugTypeId`, `DebugVarLoc` | `debug_new`, `debug_type_*`, `debug_func_*`, `debug_line`, `debug_emit` |
+| emu | `<cfree/emu.h>` (public face) | `src/emu/emu.h` | `EmuProcess`, `EmuThread`, `ObjFormatEmuOps` | runs via public API; format hooks in `ObjFormatEmuOps` |
+| dbg | `<cfree/dbg.h>` (public face) | `src/dbg/dbg.h` | session internals | public `cfree_jit_session_*` |
+| asm | `src/asm/asm.h` (+ `asm_lex.h`) | — (`asm_helpers.h` shared) | `AsmLexer`, `AsmTok` | `asm_parse(Compiler, AsmLexer, MCEmitter)`; `asm_driver_*` helpers |
+| jit | `src/jit/tlv_thunk.h` | — | — | `cfree_jit_tlv_thunk` (Mach-O TLV); rest via `LinkImage` |
+| wasm | `src/wasm/wasm.h` | — | `WasmValType`, `WasmFeatureSet`, `WasmInsnKind` | module model / codec / WAT / validate (public Cfree types only) |
+| api | `src/api/lang_registry.h` | — | — | `lang_registry_init(Compiler)` wires enabled frontends |
+
+**Subsystem-tier review notes:**
+- `obj.h` is the hub: cg, link, jit, disasm, and dwarf all depend on it. Format
+  knowledge (ELF/Mach-O/COFF/Wasm) stays behind `ObjFormatImpl` — verify new code
+  doesn't hard-code one format above that line.
+- `link` exposes both single-shot (`link_resolve`) and incremental
+  (`link_resolve_extend`) surfaces; keep them consistent.
+- `emu` and `dbg` present their real contract through the **public** headers;
+  the `src/` headers are implementation. Don't grow a second public-ish surface
+  in `src/`.
+
+---
+
+## Tier 4 — Core utilities (`src/core/`)
+
+Foundational data structures. Enforce the project rules: **no global state**
+(everything takes an explicit `Heap*`/`Arena*`/`Compiler*`), **no VLAs**.
+
+| Header | Purpose | Takes explicit allocator? | Public mirror |
+|--------|---------|---------------------------|---------------|
+| `core.h` | Type aliases, `Compiler` struct, panic/defer machinery | `Compiler` holds context | partially via `include/cfree/core.h` |
+| `arena.h` | Bump allocator; reset frees all | `Heap*` | `include/cfree/support/arena.h` (narrowed) |
+| `pool.h` | Symbol interning (`Sym` canonical IDs) | `Heap*` | — |
+| `buf.h` | Chunked byte buffer with patch/seek | `Heap*` | — |
+| `vec.h` | Doubling-growth vector (`VEC_GROW` macro) | `Heap*` | — |
+| `segvec.h` | Segmented append-only array, stable pointers (`SEGVEC_DEFINE`) | `Heap*` | — |
+| `hashmap.h` | Alias to public template | n/a | `include/cfree/support/hashmap.h` |
+| `heap.h` | Heap abstraction + JIT exec-mmap helper | wraps `CfreeHeap` | `CfreeHeap` in `core.h` |
+| `strbuf.h` | Bounded text builder, caller-owned buffer | none (caller buffer) | — |
+| `slice.h` | Fat-pointer byte view (alias of `CfreeSlice`) | `Arena*` for dup | `CfreeSlice` in `core.h` |
+| `bytes.h` | LE/BE int serialize helpers | none | — |
+| `diag.h` | Diagnostic-sink convenience wrappers | none (delegates) | — |
+| `metrics.h` | Telemetry dispatch to optional callbacks | none (reads Compiler) | — |
+| `sha256.h` | Streaming SHA-256 | none | — |
+| `util.h` | `MIN`/`MAX`/`ALIGN_*`/`CONTAINER_OF` macros | none | — |
+
+**Core-tier review notes:**
+- Public mirrors (`arena`, `hashmap`, parts of `core`) deliberately expose a
+  *narrowed* surface, not the full internal one. When changing an internal
+  utility that has a mirror, decide explicitly whether the public mirror moves
+  too — they are allowed to diverge.
+- These are the foundation for the "no global state" rule; any new core utility
+  that reaches for a static/global is a red flag.
+
+---
+
+## Tier 5 — Frontend ↔ library boundary
+
+Frontends live in `lang/`, are API consumers, and register per-`CfreeCompiler`.
+
+**The contract a frontend must implement** — `CfreeFrontendVTable`
+(`include/cfree/compile.h`):
+
+```c
+typedef struct CfreeFrontendVTable {
+  CfreeFrontendNewFn  new_frontend;   // CfreeFrontendState* (*)(CfreeCompiler*)
+  CfreeFrontendCompileFn compile;     // source -> CfreeObjBuilder
+  CfreeFrontendFreeFn free_frontend;
+  const CfreeSlice*  extensions;      // file extensions claimed (no leading dot)
+  uint32_t           nextensions;
+} CfreeFrontendVTable;
+```
+
+Registered via `cfree_register_frontend(compiler, language, vtable)`;
+`src/api/lang_registry.h::lang_registry_init` auto-wires the
+`CFREE_LANG_*_ENABLED` frontends at construction.
+
+**What frontends consume** (public only): `cg.h`, `frontend.h`, `source.h`,
+`object.h`, `support/arena.h`, `support/hashmap.h`, `core.h`.
+
+| Frontend | Public entry | Notable internal headers |
+|----------|--------------|--------------------------|
+| C (`lang/c/`) | `cfree_c_frontend_vtable` (`c.h`) | `type/`, `decl/`, `sem/`, `abi/c_abi.h`, `parse/parse.h` |
+| cpp (`lang/cpp/`) | shared by C; `pp/pp.h`, `lex/lex.h` | `cpp_support.h`, `pp/pp_priv.h` |
+| toy (`lang/toy/`) | `cfree_toy_frontend_vtable` (`toy.h`) | `internal.h`, `lexer.h` |
+| wasm (`lang/wasm/`) | `cfree_wasm_frontend_vtable` (`wasm.h`) | `runtime_abi.h` |
+
+**Frontend-tier review notes & flags:**
+- ⚠️ **`lang/c/parse/cg_public_compat.h`** is a compatibility shim wrapping
+  `cg.h` with C-semantic sugar (lvalue aux, type stack, `pcg_*` helpers, >100
+  functions). It is the real coupling point between the C parser and codegen.
+  Its existence suggests `cg.h` doesn't yet serve the C frontend's needs
+  directly — worth tracking as interface debt: does the shim hide gaps that
+  should be promoted into `cg.h`, or is it legitimately C-specific policy?
+- ⚠️ `lang/wasm/wasm.h` exposes `cfree_wasm_wat_to_wasm()` — a test/dev helper
+  living in a public-facing header. Confirm it belongs in the public surface vs.
+  a test-only header.
+- Frontends must not reach into `src/` (verified clean today). New frontend code
+  that needs something from `src/` is a signal the public API is missing
+  something — add it to `include/cfree/`, don't cross the boundary.
+
+---
+
+## Interface review checklist
+
+Apply to any interface (header / vtable) you add or change. Tier-1 (public) and
+Tier-2 (backend contract) changes warrant the full list; lower tiers can be
+lighter.
+
+### Boundary & layering
+- [ ] Header lives at the right tier; consumers at the correct layer can reach it.
+- [ ] No layering violation: `driver/`+`lang/` use only `<cfree/*.h>`; subsystems
+      don't include each other's `*_internal.h`.
+- [ ] Format/arch/OS specifics stay behind their dispatch vtable
+      (`ObjFormatImpl`, `ArchImpl`, `*Vtable`) — not leaked above it.
+- [ ] If a public mirror exists (arena/hashmap/core), the divergence from the
+      internal version is intentional and documented.
+
+### Surface shape
+- [ ] Opaque handles where the consumer shouldn't see layout; concrete structs
+      only where the layout *is* the contract.
+- [ ] Minimal surface — no entry points added "just in case"; each has a caller.
+- [ ] Naming consistent with tier (`cfree_*` public; subsystem-prefixed internal).
+- [ ] Enums/flags are explicitly valued where they cross a format/wire boundary.
+
+### State & ownership
+- [ ] No global/static state — context threaded via `Compiler`/`Heap`/`Arena`/
+      `CfreeContext` (project rule).
+- [ ] Allocation ownership is explicit: who allocates, who frees, lifetime rules.
+- [ ] No VLAs (project rule).
+- [ ] Borrowed vs. owned bytes (`CfreeSlice` etc.) documented at the boundary.
+
+### Errors & contracts
+- [ ] Errors reported via `CfreeStatus` / diag sink, not ad-hoc returns; failure
+      modes documented.
+- [ ] Pre/postconditions and ordering constraints stated (e.g. `obj_finalize`
+      before read-side queries; `func_begin`/`func_end` pairing).
+- [ ] No magic numbers — shared constants promoted to a header (project rule).
+
+### Vtable / backend contracts (Tier 2)
+- [ ] Every hook present in the reference (aa64) is implemented and matches its
+      semantics, not just its signature.
+- [ ] Semantic vs. physical responsibilities kept on the right side
+      (`NativeOps` vs. `NativeTarget`).
+- [ ] New hook added to the contract is implemented by *all* live backends (or
+      has a documented capability-query fallback, e.g. `supports_label_table`).
+
+### Stability & docs
+- [ ] Public (Tier-1) change: is it source-compatible? If not, callers updated in
+      the same change.
+- [ ] This document's inventory + status table updated.
+- [ ] `DESIGN.md` updated if the layering narrative changed.
+
+---
+
+## Review status
+
+Track interface-review passes here. Status: ⬜ not reviewed · 🔶 in progress · ✅ reviewed.
+
+| Interface | Tier | Status | Notes |
+|-----------|------|--------|-------|
+| `core.h` | 1 | ⬜ | host vtables = no-global-state enforcement point |
+| `cg.h` | 1 | ⬜ | largest/highest-risk; frontends couple hard |
+| `object.h` | 1 | ⬜ | hub for cg/link/jit/disasm/dwarf |
+| `link.h` | 1 | ⬜ | single-shot vs. incremental surfaces |
+| `jit.h` / `dbg.h` | 1 | ⬜ | — |
+| `dwarf.h` / `disasm.h` / `arch.h` | 1 | ⬜ | inspection family |
+| `compile.h` / `frontend.h` / `source.h` | 1 | ⬜ | frontend-facing |
+| other Tier-1 (`archive`, `asm_emit`, `emu`, `preprocess`, `wasm`, `config`, support) | 1 | ⬜ | smaller surfaces |
+| `NativeTarget` (`native_target.h`) | 2 | 🔶 | aa64 ✅ reference; x64/rv64 porting |
+| `CgTarget` (`cgtarget.h`) | 2 | ⬜ | — |
+| `NativeDirectTarget`/`NativeOps` | 2 | ⬜ | -O0 adapter; semantic/physical split |
+| `MCEmitter` (`mc.h`) | 2 | ⬜ | arch-neutral; keep it that way |
+| `TargetABI` (`abi.h`) | 2 | ⬜ | — |
+| `ArchImpl` (`arch.h`) | 2 | ⬜ | dispatch hub |
+| obj boundary (`obj.h` + formats) | 3 | ⬜ | format dispatch via `ObjFormatImpl` |
+| link / opt / cg / debug boundaries | 3 | ⬜ | — |
+| asm / emu / dbg / jit / wasm / api | 3 | ⬜ | — |
+| core utilities (`src/core/`) | 4 | ⬜ | no-global-state foundation |
+| `CfreeFrontendVTable` | 5 | ⬜ | the frontend contract |
+| `cg_public_compat.h` shim | 5 | ⬜ | ⚠️ interface debt — gap in `cg.h`? |
+| `cfree_wasm_wat_to_wasm` placement | 5 | ⬜ | ⚠️ test helper in public header? |
diff --git a/doc/NATIVE_PORT_X64.md b/doc/NATIVE_PORT_X64.md
@@ -0,0 +1,4342 @@
+# x64 NativeTarget Porting Reference
+
+
+
+---
+
+# X64 NativeTarget Porting Guide (GROUP 1: Skeleton, Encoders, Frame Model, Lifecycle)
+
+## Overview
+
+This guide covers porting the x64 backend from the legacy CGTarget API (disabled, non-compiling as of commit 429defa) to the NativeTarget API. GROUP 1 establishes the infrastructure: the X64NativeTarget subclass, encoder header restructuring, frame model, and the func_begin/func_end/bind_param lifecycle.
+
+References:
+- **Contract**: `/Users/ryan/code/cfree/src/arch/native_target.h` (NativeTarget vtable & types)
+- **Driver**: `/Users/ryan/code/cfree/src/cg/native_direct_target.h` (NativeDirectTarget & NativeOps adapter)
+- **RV64 Template**: `/Users/ryan/code/cfree/src/arch/rv64/native.c` (working reference, 1500+ lines)
+- **AA64 Template**: `/Users/ryan/code/cfree/src/arch/aa64/native.c` (aarch64 reference, 2000+ lines)
+- **Legacy x64 (non-compiling)**:
+  - `git show 429defa:src/arch/x64/ops.c` (old vtable, data-movement hooks)
+  - `git show 429defa:src/arch/x64/emit.c` (byte-level encoders, prologue/epilogue)
+  - `git show 429defa:src/arch/x64/alloc.c` (frame slots, parameter binding)
+  - `git show 429defa:src/arch/x64/internal.h` (XImpl state, X64ABIRegs, type helpers)
+- **ISA Constants**: `/Users/ryan/code/cfree/src/arch/x64/isa.h` (X64_* opcodes, REX, ModRM, condition codes)
+- **ABI**: `/Users/ryan/code/cfree/src/abi/abi_sysv_x64.c` & `/Users/ryan/code/cfree/src/abi/abi_win64_x64.c` (parameter/return lowering)
+
+---
+
+## (a) X64NativeTarget Subclass Struct
+
+The x64 backend maintains per-function state in a subclass of NativeTarget. Model your struct after RvNativeTarget (rv64/native.c:198–234) but adapted for x64's SysV vs. Win64 ABI duality.
+
+### Struct Definition (Pseudo-C)
+
+```c
+typedef struct X64NativeSlot {
+  u32 off;     /* bytes below rbp (positive); address = rbp - off */
+  u32 size;
+  u32 align;
+  u8 kind;     /* NativeFrameSlotKind */
+  u8 pad[3];
+} X64NativeSlot;
+
+typedef struct X64Patch {
+  u8 kind;     /* X64PatchKind (e.g., X64_PATCH_ALLOCA) */
+  u32 pos;     /* byte offset in text section */
+  u32 dst_reg; /* for X64_PATCH_ALLOCA: destination register */
+} X64Patch;
+
+typedef struct X64NativeTarget {
+  NativeTarget base;           /* parent vtable & MCEmitter */
+  SrcLoc loc;
+  const CGFuncDesc* func;
+
+  /* Frame slots (locals, spills, sret, variadic save area). */
+  X64NativeSlot* slots;
+  u32 nslots;
+  u32 slots_cap;
+  u32 cum_off;                 /* sum of slot reservations below rbp */
+  u32 max_outgoing;            /* max stack-arg bytes across all calls */
+  u32 frame_size_final;        /* final frame size (patched at func_end) */
+
+  /* Parameter tracking (for bind_param). */
+  u32 incoming_stack_size;     /* fixed-param stack bytes (tail-call check) */
+  u32 next_param_int;          /* index into ABI int-arg register list */
+  u32 next_param_fp;           /* index into ABI FP-arg register list */
+  u32 next_param_stack;        /* cumulative stack-arg offset */
+  u8 has_sret;
+  u8 is_variadic;
+  NativeFrameSlot sret_ptr_slot;
+
+  /* Patches deferred to func_end (alloca disp32, etc). */
+  X64Patch* patches;
+  u32 npatches;
+  u32 patches_cap;
+  u32 nalloca;                 /* count of allocas (gates slim epilogue) */
+
+  /* Prologue/epilogue state. */
+  u32 func_start;              /* text-section offset at func_begin */
+  u32 prologue_pos;            /* offset of prologue placeholder */
+  MCLabel epilogue_label;
+
+  /* Callee-saved registers assigned by the allocator. */
+  struct X64CalleeSave {
+    NativeFrameSlot slot;
+    CfreeCgTypeId type;
+    u8 cls;                    /* NATIVE_REG_INT or NATIVE_REG_FP */
+    Reg reg;
+  } callee_saves[X64_MAX_CS_REGS]; /* X64_MAX_CS_REGS = 15u (5 int + 10 xmm) */
+  u32 ncallee_saves;
+
+  /* Flags. */
+  u8 known_frame;              /* set by func_begin_known_frame */
+  u8 has_alloca;               /* dynamic alloca in body */
+  u8 frame_final;              /* prologue emitted (not patched at func_end) */
+  u8 pad[1];
+
+  /* ABI selection: points to either SysV or Win64 config. */
+  const X64ABIRegs* abi;
+} X64NativeTarget;
+
+static inline X64NativeTarget* x64_of(NativeTarget* t) {
+  return (X64NativeTarget*)t;
+}
+```
+
+### Key Fields
+
+- **`cum_off`**: Running total of frame-slot reservations. Each new slot allocates `(cum_off + size + align-1) & ~(align-1)` bytes and bumps `cum_off`.
+- **`next_param_int`, `next_param_fp`, `next_param_stack`**: Cursors into the ABI's argument lists. On SysV, int args live in rdi/rsi/rdx/rcx/r8/r9 (6 total); on Win64, rcx/rdx/r8/r9 (4 total). FP args differ similarly. Parameters beyond the register pool land on the stack; the cursors track both.
+- **`abi`**: Resolved once at func_begin from `c->target.os` via `x64_abi_for_os()`. Points to either `&g_x64_abi_sysv` or `&g_x64_abi_win64`.
+- **`frame_size_final`**: Calculated at func_end after alloca/call site patches are known. On single-pass (-O0), the prologue is emitted with a NOP placeholder and patched with the real size; on known-frame (optimizer), the prologue is emitted once with the final size.
+
+---
+
+## (b) Emit.c Restructuring & emit.h
+
+### Current State
+
+Legacy x64/emit.c (`git show 429defa:src/arch/x64/emit.c`) contains:
+1. **Byte-level encoders** (emit1, emit_rex, emit_mem_operand, emit_rm_reg, emit_alu_rr, emit_imul_rr, etc.)
+2. **Constant tables** (g_int_order, g_fp_order, X64ABIRegs SysV/Win64)
+3. **Prologue/epilogue builders** (x_build_prologue, x_compute_frame_size, x_func_begin, x_func_end)
+
+### Keep (Byte-Encoders → emit.h)
+
+These are low-level helpers used by both native.c (new NativeTarget) and asm.c (standalone assembler):
+
+- `emit1(mc, b)` – emit one byte
+- `emit_u32le(mc, v)` – emit 32-bit little-endian
+- `make_rex(w, reg, index, rm)` → `u8`; `emit_rex(mc, ...)`; `emit_rex_force(mc, ...)`
+- `modrm(mod, reg, rm)` → `u8`; `sib(scale, index, base)` → `u8`
+- `emit_mem_operand(mc, reg, base, disp)` – ModR/M + SIB + displacement
+- `emit_rm_reg(mc, reg, rm)` – ModR/M for reg-to-reg (mod=3)
+- `emit_mov_rr(mc, w, dst, src)` – MOV reg64/reg32
+- `emit_mov_load(mc, size, signed, dst, base, disp)` – MOV from [base+disp]
+- `emit_mov_store(mc, size, dst_reg, base, disp)` – MOV to [base+disp]
+- `emit_lea(mc, dst, base, disp)` – LEA dst, [base+disp]
+- `emit_sse_rr(mc, prefix, opcode, dst, src)` – SSE reg-to-reg (MOVSD, ADDPD, etc.)
+- `emit_sse_load(mc, prefix, opcode, dst, base, disp)` – SSE load from memory
+- `emit_sse_store(mc, prefix, opcode, src, base, disp)` – SSE store to memory
+- `emit_alu_rr(mc, w, op, dst, src)` – ALU two-operand (ADD, SUB, AND, OR, XOR, CMP, MOV, TEST)
+- `emit_alu_imm8(mc, w, op, dst, imm)` – ALU with imm8
+- `emit_alu_imm32(mc, w, op, dst, imm)` – ALU with imm32
+- `emit_imul_rr(mc, w, dst, src)` – IMUL (reg * reg form)
+- `emit_imul_imm8(mc, w, dst, src, imm)` – IMUL with imm8
+- `emit_imul_imm32(mc, w, dst, src, imm)` – IMUL with imm32
+- `emit_f7_rm(mc, w, sub, reg)` – F7 opcode (DIV, MUL, NEG, NOT)
+- `emit_shift_cl(mc, w, sub, reg)` – Shift by CL (SHL reg, cl; etc.)
+- `emit_shift_imm(mc, w, sub, reg, imm)` – Shift by imm8 (SHL reg, imm; etc.)
+- `emit_cqo_or_cdq(mc, w)` – CQO/CDQ (sign-extend rax→rdx:rax / eax→edx:eax)
+- `emit_setcc(mc, cc, dst)` – SETCC (SETcc reg8)
+- `emit_movzx(mc, w_src, dst, src)` – MOVZX (zero-extend)
+- `emit_extend_rr(mc, w_src, w_dst, dst, src)` – Sign-extend or zero-extend based on type
+- `emit_ret(mc)` – RET
+- `emit_leave(mc)` – LEAVE
+- `x64_emit_load_imm(mc, is64, dst, imm)` – Load immediate (handles splits for >32-bit on 32-bit form)
+- `x64_abi_for_os(os)` → `const X64ABIRegs*` – Resolve ABI from OS kind
+
+### New Header: x64/emit.h
+
+Create `/Users/ryan/code/cfree/src/arch/x64/emit.h` with:
+- Declarations of all byte-encoder functions above
+- Inline helpers: `make_rex()`, `modrm()`, `sib()`
+- Inline immediates-legality checks: `imm_fits_i8(v)`, `imm_fits_i32(v)`
+- The three byte-encoder constants already in isa.h (X64_REX_*, X64_NOP1, etc.)
+- No internal.h dependency (no XImpl, XSlot, etc.)
+- Include `"arch/mc.h"`, `"arch/x64/isa.h"`, `"core/bytes.h"`
+
+### Delete
+
+- All `x_*` semantic wrappers (x_load_imm, x_copy, x_load, x_call, etc.) → moved to native.c as NativeTarget hooks
+- The legacy XImpl-only internal state helpers
+- internal.h (subsumed by native.c's X64NativeTarget struct)
+
+### Update asm.c
+
+The standalone assembler (`src/arch/x64/asm.c`) currently includes internal.h. Swap that for:
+```c
+#include "arch/x64/emit.h"  /* instead of internal.h */
+```
+No functional change to asm.c's own encoder-wrapping logic; just link against the new emit.h symbols instead of inline copies.
+
+---
+
+## (c) Frame Model: SysV vs. Win64
+
+x64 uses an **RBP-anchored frame** on both ABIs. The prologue saves the caller's RBP and chains frames; slots live below RBP at negative offsets. The two ABIs differ in:
+1. Caller-saved and callee-saved registers
+2. Argument registers and register-to-stack mapping
+3. Win64 requires 32 bytes of "shadow space" (home space) above the return address for the first 4 register arguments
+4. Win64 requires stack probing (via `__chkstk`) for frames > 4096 bytes
+5. Win64 XMM registers 6–15 are callee-saved
+
+### Frame Layout (SysV / Win64 with RBP frame)
+
+```
+    high addr (caller's frame)
+   +---------------------+
+   | incoming stack args | [rbp + 16 + shadow_space + ...] (Win64 shadow=32)
+   +---------------------+
+   | return address      | [rbp + 8]
+   +---------------------+
+rbp →| saved rbp          | [rbp + 0]
+   +---------------------+
+   | sret pointer (if)   | [rbp - 8]  (if has_sret)
+   | local vars / spills | [rbp - 8*k]
+   | callee-saved GPRs   | [rbp - N_gpr*8]
+   | callee-saved XMMs   | [rbp - N_xmm*16]  (Win64 only, xmm6–15)
+   +---------------------+
+rsp →| outgoing args      | [rsp + 0 .. rsp + max_outgoing)
+   +---------------------+
+    low addr
+```
+
+**Frame-size formula:**
+```
+xmm_base = (sret_slot_exists ? 8 : 0) + cum_off
+frame_size = align_up_16(xmm_base + cs_xmm_count * 16 + cs_gpr_count * 8 + max_outgoing)
+```
+
+**RBP-relative offsets:**
+- Saved RBP: rbp + 0
+- Return address: rbp + 8
+- Incoming stack arg at stack offset K: rbp + 16 + shadow_space + K
+- Local/spill slot with `off` bytes: rbp - off
+- Saved GPR i (i=0 for first callee-saved): rbp - xmm_base - xmm_count*16 - (i+1)*8
+- Saved XMM i (Win64): rbp - xmm_base - (i+1)*16
+
+**Red zone (SysV only):** 128 bytes below RSP are reserved for leaf functions; no signal handler can clobber them. Win64 has no red zone.
+
+**Stack alignment (both):** RSP must be 16-byte aligned *before* a call. After `push rbp`, RSP ≡ 0 mod 16. After `sub rsp, frame_size`, RSP must be 16-byte aligned again so the next call's implicit `push return_address` leaves RSP ≡ 8 mod 16 (the ABI invariant on function entry). Thus `frame_size ≡ 0 mod 16`.
+
+### Win64-Specific: Shadow Space & Stack Probing
+
+Win64 reserves 32 bytes of caller-provided "home space" immediately above the return address. The first 4 register arguments (int or FP) correspond to home slots [rbp+16], [rbp+24], [rbp+32], [rbp+40] respectively. Stack-passed arguments sit at [rbp+48] onward.
+
+**Stack probing** (large frames > 4096B):
+```asm
+mov eax, frame_size
+call __chkstk           ; __chkstk probes page-by-page; does NOT adjust rsp
+sub rsp, rax            ; explicit adjustment
+```
+This prevents stack-overflow corruption by touching each page before allocation.
+
+---
+
+## (d) Lifecycle: func_begin, func_end, bind_param
+
+### x_func_begin_init (helper, called by both func_begin paths)
+
+**Purpose:** Initialize frame/ABI state once per function.
+
+**Pseudo-code:**
+```c
+static void x_func_begin_init(NativeTarget* t, const CGFuncDesc* fd) {
+  X64NativeTarget* a = x64_of(t);
+  a->func = fd;
+  a->loc = fd->text_section_id; // or fd->loc
+  
+  /* Resolve ABI (SysV or Win64) from compiler's OS target. */
+  a->abi = x64_abi_for_os(t->c->target.os);
+  
+  /* Initialize cursors & counters. */
+  a->cum_off = 0;
+  a->max_outgoing = 0;
+  a->next_param_int = 0;
+  a->next_param_fp = 0;
+  a->next_param_stack = 0;
+  a->has_sret = 0;
+  a->is_variadic = 0;
+  a->sret_ptr_slot = NATIVE_FRAME_SLOT_NONE;
+  a->nslots = 0;
+  a->ncallee_saves = 0;
+  a->nalloca = 0;
+  a->known_frame = 0;
+  a->has_alloca = 0;
+  a->frame_final = 0;
+  
+  /* Mark function start in text section. */
+  a->func_start = t->mc->pos(t->mc);
+  a->epilogue_label = t->mc->label_new(t->mc);
+}
+```
+
+### x_func_begin (NativeTarget hook, -O0 single-pass path)
+
+**Purpose:** Reserve prologue placeholder (filled with NOPs), then call x_add_entry_frame_slots and variadic saves.
+
+**Pseudo-code:**
+```c
+static void x_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  
+  x_func_begin_init(t, fd);
+  
+  /* Query ABI to populate variadic / sret flags. */
+  const ABIFuncInfo* abi_info = abi_cg_func_info(t->c->abi, fd->fn_type);
+  a->has_sret = abi_info->has_sret ? 1 : 0;
+  a->is_variadic = abi_info->variadic ? 1 : 0;
+  
+  /* Determine prologue size budget based on OS. */
+  u32 prologue_nbytes = (a->abi == &g_x64_abi_win64)
+                          ? X64_PROLOGUE_BYTES_WIN64  // 192
+                          : X64_PROLOGUE_BYTES;       // 96
+  
+  /* Reserve prologue region filled with NOPs. */
+  a->prologue_pos = mc->pos(mc);
+  for (u32 i = 0; i < prologue_nbytes; ++i) {
+    emit1(mc, X64_NOP1);  // 0x90
+  }
+  
+  /* Allocate frame slots for incoming sret pointer, variadic GP save area. */
+  if (a->has_sret) {
+    NativeFrameSlotDesc sret_desc = {
+      .type = /* ptr type */,
+      .size = 8,
+      .align = 8,
+      .kind = NATIVE_FRAME_SLOT_SAVE,
+      .flags = 0,
+    };
+    a->sret_ptr_slot = t->frame_slot(t, &sret_desc);
+  }
+  
+  if (a->is_variadic) {
+    /* SysV variadic: reserve 176-byte reg-save area (__va_list_tag).
+       Win64 variadic: no special reg-save area (args already in home space). */
+    if (a->abi == &g_x64_abi_sysv) {
+      NativeFrameSlotDesc va_desc = {
+        .type = 0,  // untyped
+        .size = 176,
+        .align = 8,
+        .kind = NATIVE_FRAME_SLOT_SAVE,
+        .flags = 0,
+      };
+      /* Store for later access by va_start_ hook. */
+      a->va_reg_save_slot = t->frame_slot(t, &va_desc);
+    }
+  }
+}
+```
+
+### x_func_begin_known_frame (NativeTarget hook, optimizer path)
+
+**Purpose:** Called with a pre-computed frame layout; emit the final prologue immediately (no placeholder).
+
+**Pseudo-code:**
+```c
+static void x_func_begin_known_frame(
+    NativeTarget* t,
+    const CGFuncDesc* fd,
+    const NativeKnownFrameDesc* frame,
+    NativeFrameSlot* out_slots) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  
+  x_func_begin_init(t, fd);
+  a->known_frame = 1;
+  
+  /* ABI info. */
+  const ABIFuncInfo* abi_info = abi_cg_func_info(t->c->abi, fd->fn_type);
+  a->has_sret = abi_info->has_sret ? 1 : 0;
+  a->is_variadic = abi_info->variadic ? 1 : 0;
+  
+  /* Allocate frame slots from the known-frame descriptor. */
+  for (u32 i = 0; i < frame->nslots; ++i) {
+    NativeFrameSlot fs = t->frame_slot(t, &frame->slots[i]);
+    if (out_slots) out_slots[i] = fs;
+  }
+  
+  /* Populate frame dimensions. */
+  a->cum_off = 0;  // already accounted in frame->slots
+  a->max_outgoing = frame->max_outgoing;
+  a->has_alloca = frame->has_alloca ? 1 : 0;
+  
+  /* Compute frame size from callee-saves and slot sum. */
+  /* Collect the allocator's assigned callee-saved masks. */
+  u32 cs_int_mask = frame->callee_saved_used[NATIVE_REG_INT];
+  u32 cs_fp_mask = frame->callee_saved_used[NATIVE_REG_FP];
+  
+  u32 frame_size = x_compute_frame_size(a, cs_int_mask, cs_fp_mask);
+  a->frame_size_final = frame_size;
+  
+  /* Check if prologue can be omitted entirely (leaf function, no frame needed). */
+  if (x_can_omit_frame(a, cs_int_mask, cs_fp_mask)) {
+    a->frame_final = 1;
+    return;
+  }
+  
+  /* Emit final prologue (no placeholder, no patching). */
+  u8 buf[X64_PROLOGUE_BYTES_WIN64];
+  a->prologue_pos = mc->pos(mc);
+  u32 chkstk_disp_pos = (u32)-1;
+  u32 nbytes = x_build_prologue(t, buf, sizeof buf, frame_size,
+                                 cs_int_mask, cs_fp_mask, &chkstk_disp_pos);
+  mc->emit_bytes(mc, buf, nbytes);
+  if (chkstk_disp_pos != (u32)-1) {
+    ObjSymId chk = x_chkstk_sym(t);
+    mc->emit_reloc_at(mc, mc->section_id, a->prologue_pos + chkstk_disp_pos,
+                      R_X64_PLT32, chk, -4, 1, 0);
+  }
+  
+  a->frame_final = 1;
+}
+```
+
+### x_build_prologue (helper)
+
+**Purpose:** Generate prologue bytes: `push rbp; mov rbp, rsp; [chkstk]; sub rsp, frame_size; [sret spill]; [callee-save spills]`.
+
+**Structure (from legacy emit.c):**
+
+```c
+static u32 x_build_prologue(
+    NativeTarget* t,
+    u8* buf,
+    u32 cap,
+    u32 frame_size,
+    u32 cs_int_mask,      // bitmask of used callee-saved GPRs
+    u32 cs_fp_mask,       // bitmask of used callee-saved XMMs (Win64)
+    u32* chkstk_disp_pos_out) {
+  X64NativeTarget* a = x64_of(t);
+  u32 wi = 0;
+  
+  if (chkstk_disp_pos_out) *chkstk_disp_pos_out = (u32)-1;
+  
+  // 1. push rbp (1 byte)
+  buf[wi++] = 0x55;
+  
+  // 2. mov rbp, rsp (3 bytes: REX.W 89 E5)
+  buf[wi++] = X64_REX_BASE | X64_REX_W;
+  buf[wi++] = 0x89;
+  buf[wi++] = modrm(3, X64_RSP, X64_RBP);  // 0xE5
+  
+  // 3. (Win64 only) chkstk if frame_size > 4096
+  if (a->abi->shadow_space && frame_size > X64_WIN64_CHKSTK_THRESHOLD) {
+    // mov eax, frame_size (5 bytes: B8 imm32)
+    buf[wi++] = 0xB8;
+    wr_u32_le(buf + wi, frame_size);
+    wi += 4;
+    
+    // call __chkstk (5 bytes: E8 disp32)
+    buf[wi++] = 0xE8;
+    if (chkstk_disp_pos_out) *chkstk_disp_pos_out = wi;
+    wi += 4;  // disp32 patched by caller
+    
+    // sub rsp, rax (3 bytes: REX.W 29 C4)
+    buf[wi++] = X64_REX_BASE | X64_REX_W;
+    buf[wi++] = 0x29;
+    buf[wi++] = modrm(3, X64_RAX, X64_RSP);
+  } else {
+    // sub rsp, frame_size (7 bytes: REX.W 81 EC imm32)
+    buf[wi++] = X64_REX_BASE | X64_REX_W;
+    buf[wi++] = 0x81;
+    buf[wi++] = modrm(3, 5, X64_RSP);  // /5 for SUB
+    wr_u32_le(buf + wi, frame_size);
+    wi += 4;
+  }
+  
+  // 4. Spill sret pointer (if present)
+  if (a->has_sret && a->sret_ptr_slot != NATIVE_FRAME_SLOT_NONE) {
+    X64NativeSlot* s = x64_slot_get(a, a->sret_ptr_slot);
+    u32 sret_reg = a->abi->int_args[0];  // RDI (SysV) or RCX (Win64)
+    i32 off = -(i32)s->off;
+    // mov [rbp + disp32], sret_reg  (7 bytes)
+    buf[wi++] = X64_REX_BASE | X64_REX_W | ((sret_reg & 8) ? X64_REX_R : 0);
+    buf[wi++] = 0x89;
+    buf[wi++] = modrm(2, sret_reg & 7, X64_RBP);
+    wr_u32_le(buf + wi, (u32)off);
+    wi += 4;
+  }
+  
+  // 5. Spill callee-saved GPRs
+  u32 xmm_base = (a->has_sret ? 8 : 0) + a->cum_off;
+  u32 cs_fp_count = popcount(cs_fp_mask);
+  for (u32 reg = 0; reg < 16; ++reg) {
+    if (!(cs_int_mask & (1 << reg))) continue;
+    u32 idx = __builtin_ctz(cs_int_mask & ((1 << reg) - 1)); // position in order
+    i32 off = -(i32)(xmm_base + cs_fp_count * 16 + (idx + 1) * 8);
+    // mov [rbp + disp32], reg  (7 bytes)
+    buf[wi++] = X64_REX_BASE | X64_REX_W | ((reg & 8) ? X64_REX_R : 0);
+    buf[wi++] = 0x89;
+    buf[wi++] = modrm(2, reg & 7, X64_RBP);
+    wr_u32_le(buf + wi, (u32)off);
+    wi += 4;
+  }
+  
+  // 6. Spill callee-saved XMMs (Win64 only)
+  if (a->abi == &g_x64_abi_win64) {
+    for (u32 xmm = 0; xmm < 16; ++xmm) {
+      if (!(cs_fp_mask & (1 << xmm))) continue;
+      u32 idx = __builtin_ctz(cs_fp_mask & ((1 << xmm) - 1));
+      i32 off = -(i32)(xmm_base + (idx + 1) * 16);
+      // movaps [rbp + disp32], xmmN  (8 or 7 bytes w/ REX)
+      u8 rex = (xmm & 8) ? (X64_REX_BASE | X64_REX_R) : 0;
+      if (rex) buf[wi++] = rex;
+      buf[wi++] = 0x0F;
+      buf[wi++] = 0x29;
+      buf[wi++] = modrm(2, xmm & 7, X64_RBP);
+      wr_u32_le(buf + wi, (u32)off);
+      wi += 4;
+    }
+  }
+  
+  return wi;
+}
+```
+
+### x_func_end (NativeTarget hook)
+
+**Purpose:** Patch prologue (if single-pass), emit epilogue, patch alloca sites, define function symbol.
+
+**Pseudo-code:**
+```c
+static void x_func_end(NativeTarget* t) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  
+  if (a->frame_final) {
+    // Known-frame path: prologue already emitted final, skip patching.
+    goto emit_epilogue;
+  }
+  
+  // Single-pass path: collect actual callee-saves used & patch prologue.
+  u32 cs_int_mask = 0, cs_fp_mask = 0;
+  for (u32 i = 0; i < a->ncallee_saves; ++i) {
+    if (a->callee_saves[i].cls == NATIVE_REG_INT) {
+      cs_int_mask |= (1 << a->callee_saves[i].reg);
+    } else {
+      cs_fp_mask |= (1 << a->callee_saves[i].reg);
+    }
+  }
+  
+  u32 frame_size = x_compute_frame_size(a, cs_int_mask, cs_fp_mask);
+  a->frame_size_final = frame_size;
+  
+  if (!x_can_omit_frame(a, cs_int_mask, cs_fp_mask)) {
+    // Patch prologue placeholder.
+    u8 buf[X64_PROLOGUE_BYTES_WIN64];
+    for (u32 i = 0; i < sizeof buf; ++i) buf[i] = 0x90;
+    u32 chkstk_disp_pos = (u32)-1;
+    u32 nbytes = x_build_prologue(t, buf, sizeof buf, frame_size,
+                                   cs_int_mask, cs_fp_mask, &chkstk_disp_pos);
+    obj_patch(t->obj, a->func->text_section_id, a->prologue_pos, buf, nbytes);
+    if (chkstk_disp_pos != (u32)-1) {
+      ObjSymId chk = x_chkstk_sym(t);
+      mc->emit_reloc_at(mc, a->func->text_section_id,
+                        a->prologue_pos + chkstk_disp_pos, R_X64_PLT32, chk, -4, 1, 0);
+    }
+  }
+  
+emit_epilogue:
+  // Place epilogue label (target of tail-call or exception-unwind).
+  mc->label_place(mc, a->epilogue_label);
+  
+  // Restore callee-saved XMMs (Win64).
+  u32 xmm_base = (a->has_sret ? 8 : 0) + a->cum_off;
+  for (i32 i = (i32)a->ncallee_saves - 1; i >= 0; --i) {
+    if (a->callee_saves[i].cls != NATIVE_REG_FP) continue;
+    u32 xmm = a->callee_saves[i].reg;
+    u32 idx = /* position in order */;
+    i32 off = -(i32)(xmm_base + (idx + 1) * 16);
+    emit_sse_load(mc, 0, 0x28, xmm, X64_RBP, off);  // movaps xmm, [rbp+off]
+  }
+  
+  // Restore callee-saved GPRs.
+  u32 cs_fp_count = /* count of XMMs saved */;
+  for (i32 i = (i32)a->ncallee_saves - 1; i >= 0; --i) {
+    if (a->callee_saves[i].cls != NATIVE_REG_INT) continue;
+    u32 reg = a->callee_saves[i].reg;
+    u32 idx = /* position in order */;
+    i32 off = -(i32)(xmm_base + cs_fp_count * 16 + (idx + 1) * 8);
+    emit_mov_load(mc, 8, 0, reg, X64_RBP, off);
+  }
+  
+  // leave; ret (2 bytes)
+  emit_leave(mc);  // 0xC9
+  emit_ret(mc);    // 0xC3
+  
+  // Patch alloca sites with final max_outgoing.
+  for (u32 i = 0; i < a->npatches; ++i) {
+    if (a->patches[i].kind != X64_PATCH_ALLOCA) continue;
+    u8 dbuf[4];
+    wr_u32_le(dbuf, a->max_outgoing);
+    obj_patch(t->obj, a->func->text_section_id, a->patches[i].pos, dbuf, 4);
+  }
+  
+  // Define function symbol.
+  u32 end = mc->pos(mc);
+  obj_symbol_define(t->obj, a->func->sym, a->func->text_section_id,
+                    (u64)a->func_start, (u64)(end - a->func_start));
+  if (a->func->atomize) {
+    obj_atom_define(t->obj, a->func->text_section_id, a->func_start,
+                    end - a->func_start, a->func->sym, 0);
+  }
+  if (t->debug) {
+    debug_func_pc_range(t->debug, a->func->text_section_id, a->func_start, end);
+  }
+  
+  mc->cfi_endproc(mc);
+  mc_end_function(mc);
+  a->func = NULL;
+}
+```
+
+### x_compute_frame_size (helper)
+
+```c
+static u32 x_compute_frame_size(const X64NativeTarget* a,
+                                 u32 cs_int_mask,
+                                 u32 cs_fp_mask) {
+  u32 xmm_base = (a->has_sret ? 8 : 0) + a->cum_off;
+  u32 cs_gpr_count = __builtin_popcount(cs_int_mask);
+  u32 cs_xmm_count = __builtin_popcount(cs_fp_mask);
+  u32 raw = a->max_outgoing + cs_gpr_count * 8 + cs_xmm_count * 16 + xmm_base;
+  u32 frame_size = align_up_u32(raw, 16);
+  return frame_size ? frame_size : 16;  // never 0
+}
+```
+
+### bind_param (NativeTarget hook)
+
+**Purpose:** Move incoming parameter from ABI register/stack location into the caller-selected destination (hard reg or frame slot).
+
+**Pseudo-code (SysV example):**
+
+```c
+static void x_bind_param(NativeTarget* t, const CGParamDesc* p, NativeLoc dst) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  
+  if (dst.kind == NATIVE_LOC_NONE) {
+    // Parameter unused; just advance the ABI cursor.
+    x_consume_param_location(a, p->abi);
+    return;
+  }
+  
+  const ABIArgInfo* ai = p->abi;
+  if (!ai || ai->kind == ABI_ARG_IGNORE) return;
+  
+  // Incoming stack bias: the offset from rbp to the first stack-passed arg.
+  // On entry, the return address is at [rsp], and rbp = rsp + 8 after `push rbp`.
+  // So [rbp + 16] is the first incoming stack arg (SysV); Win64 adds shadow_space (32).
+  i32 incoming_stack_bias = 16 + a->abi->shadow_space;
+  
+  // Handle INDIRECT (byval): incoming is a pointer to the actual data.
+  if (ai->kind == ABI_ARG_INDIRECT) {
+    u32 ptr_reg;
+    if (a->next_param_int < a->abi->n_int_args) {
+      ptr_reg = a->abi->int_args[a->next_param_int++];
+    } else {
+      ptr_reg = X64_R11;  // scratch
+      emit_mov_load(mc, 8, 0, ptr_reg, X64_RBP, incoming_stack_bias + (i32)a->next_param_stack);
+      a->next_param_stack += 8;
+    }
+    
+    // Copy byval data from [ptr_reg] into dst.
+    if (dst.kind == NATIVE_LOC_FRAME) {
+      X64NativeSlot* s = x64_slot_get(a, dst.v.frame);
+      // memcpy [rbp - s->off], [ptr_reg], p->size
+      u32 nbytes = p->size;
+      for (u32 off = 0; off < nbytes; off += 8) {
+        emit_mov_load(mc, 8, 0, X64_RAX, ptr_reg, (i32)off);
+        emit_mov_store(mc, 8, X64_RAX, X64_RBP, -(i32)s->off + (i32)off);
+      }
+    }
+    return;
+  }
+  
+  // Handle DIRECT: one or more ABI parts (int/FP scalars, or aggregate pieces).
+  if (ai->kind == ABI_ARG_DIRECT || ai->kind == ABI_ARG_EXPAND) {
+    for (u16 i = 0; i < ai->nparts; ++i) {
+      const ABIArgPart* pt = &ai->parts[i];
+      u32 part_size = pt->size;
+      NativeLoc part_dst = dst;  // same destination for all parts (or split across?)
+      
+      if (pt->cls == ABI_CLASS_INT) {
+        u32 src_reg;
+        if (a->next_param_int < a->abi->n_int_args) {
+          src_reg = a->abi->int_args[a->next_param_int++];
+        } else {
+          src_reg = X64_RAX;  // load from stack
+          emit_mov_load(mc, part_size, 0, src_reg, X64_RBP,
+                        incoming_stack_bias + (i32)a->next_param_stack);
+          a->next_param_stack += 8;
+        }
+        
+        // Move src_reg to dst.
+        if (dst.kind == NATIVE_LOC_REG) {
+          if (dst.v.reg != src_reg) {
+            emit_mov_rr(mc, part_size == 8, dst.v.reg & 15, src_reg & 15);
+          }
+        } else if (dst.kind == NATIVE_LOC_FRAME) {
+          X64NativeSlot* s = x64_slot_get(a, dst.v.frame);
+          emit_mov_store(mc, part_size, src_reg, X64_RBP, -(i32)s->off);
+        }
+      } else if (pt->cls == ABI_CLASS_FP) {
+        u32 src_xmm;
+        u8 prefix = (part_size == 8) ? 0xF2 : 0xF3;  // MOVSD vs. MOVSS
+        if (a->next_param_fp < a->abi->n_fp_args) {
+          src_xmm = a->next_param_fp++;
+        } else {
+          src_xmm = 0;  // XMM0
+          emit_sse_load(mc, prefix, 0x10, src_xmm, X64_RBP,
+                        incoming_stack_bias + (i32)a->next_param_stack);
+          a->next_param_stack += 8;
+        }
+        
+        // Move src_xmm to dst.
+        if (dst.kind == NATIVE_LOC_REG) {
+          if (dst.v.reg != src_xmm) {
+            emit_sse_rr(mc, prefix, 0x10, dst.v.reg & 15, src_xmm & 15);
+          }
+        } else if (dst.kind == NATIVE_LOC_FRAME) {
+          X64NativeSlot* s = x64_slot_get(a, dst.v.frame);
+          emit_sse_store(mc, prefix, 0x11, src_xmm, X64_RBP, -(i32)s->off);
+        }
+      }
+    }
+    return;
+  }
+}
+```
+
+---
+
+## Summary: Files to Create/Modify
+
+### Create
+- **`src/arch/x64/native.c`** (~2500 lines, modeled after rv64/native.c)
+  - X64NativeTarget struct, lifecycle hooks (func_begin, func_end, bind_param)
+  - Byte-encoder inline wrappers (rv_* → x_* pattern)
+  - All NativeTarget vtable hooks
+  - x64_native_target_new constructor
+
+- **`src/arch/x64/emit.h`** (~300 lines)
+  - Declarations of all byte-encoder functions
+  - Inline helpers (modrm, sib, REX builder)
+  - Immediates-legality checks
+  - No struct/state definitions; pure functions
+
+### Modify
+- **`src/arch/x64/emit.c`** (legacy, commit 429defa)
+  - Extract all byte-encoder bodies into `.h` inline / in native.c
+  - Keep only the byte-emission function definitions (for asm.c to link)
+  - Delete XImpl/XSlot struct definitions, XImpl-only helpers
+  - Delete x_* semantic wrappers (they move to native.c)
+
+- **`src/arch/x64/asm.c`**
+  - Replace `#include "arch/x64/internal.h"` with `#include "arch/x64/emit.h"`
+  - No other functional changes
+
+### Delete
+- **`src/arch/x64/internal.h`** (subsumed by native.c's X64NativeTarget)
+- **`src/arch/x64/ops.c`** (legacy vtable; body moves to native.c)
+- **`src/arch/x64/alloc.c`** (legacy frame/param; body moves to native.c)
+- **`src/arch/x64/opt_coord.c`** (legacy reg tables; move to native.c)
+
+### Keep (Already Compiling)
+- `src/arch/x64/isa.h` (opcode constants)
+- `src/arch/x64/isa.c` (disasm)
+- `src/arch/x64/regs.c` (DWARF names)
+- `src/arch/x64/link.c` (linker integration)
+- `src/arch/x64/dbg.c` (debugging)
+- `src/arch/x64/disasm.c` (disassembly)
+- `src/arch/x64/x64.h` (public header if any)
+
+---
+
+## Key References & Constants
+
+From `/Users/ryan/code/cfree/src/arch/x64/isa.h`:
+```c
+enum {
+  X64_RAX = 0, ..., X64_R15 = 15,  /* GPR encoding (DWARF numbering) */
+  X64_XMM0 = 0, ..., X64_XMM15 = 15,
+  X64_CC_* = 0x0..0xF,  /* condition codes for Jcc/SETcc/CMOVcc */
+};
+#define X64_REX_BASE 0x40u
+#define X64_REX_W 0x08u
+#define X64_REX_R 0x04u
+#define X64_REX_X 0x02u
+#define X64_REX_B 0x01u
+```
+
+From legacy emit.c (`git show 429defa:src/arch/x64/emit.c`):
+```c
+#define X64_PROLOGUE_BYTES 96u           /* SysV budget */
+#define X64_PROLOGUE_BYTES_WIN64 192u    /* Win64 budget */
+#define X64_WIN64_SHADOW_SPACE 32u       /* home space */
+#define X64_WIN64_CHKSTK_THRESHOLD 4096u /* stack probe threshold */
+#define X64_MAX_CS_INT_REGS 7u           /* SysV 5 + Win64 +2 for RDI/RSI */
+#define X64_MAX_CS_FP_REGS 10u           /* Win64 XMM6..15 */
+```
+
+---
+
+## Notes for Implementation
+
+1. **ABI Dispatch:** Use `x64_abi_for_os(t->c->target.os)` once at func_begin to resolve the ABI struct. Store in `a->abi` so all parameter/call logic reads from one place.
+
+2. **Frame Slot Offsets:** RBP-relative offsets are always negative. A slot with `off` bytes is at address `rbp - off`. Incoming stack args are at positive offsets from RBP (e.g., `[rbp + 16 + shadow + byte_off]`).
+
+3. **Emit Patterns:** Follow rv64/native.c's pattern: small inline wrappers (rv_reg_loc, rv_stack_loc, rv_mem_for_type) that construct NativeLoc/MemAccess, then call the byte-encoder (rv_emit_li64, rv_emit_addr_adjust, etc.). This keeps semantic logic tight and encoders reusable.
+
+4. **Placeholder & Patch:** Single-pass -O0 reserves X64_PROLOGUE_BYTES of NOPs at prologue_pos, then patches at func_end once max_outgoing/callee-saves are known. Known-frame emits immediately with the final size.
+
+5. **Callee-Save Tracking:** The allocator provides a bitmask per reg class (int/fp) in reserve_callee_saves or known_frame. Collect from the allocator, then lay them out in x_build_prologue and epilogue in reverse order (highest reg first in the stack layout).
+
+6. **Win64 Specifics:** 
+   - Shadow space: first 4 args get 32 bytes of home slots, so stack args start at rbp+48.
+   - Stack probing: frame > 4096 → `mov eax, N; call __chkstk; sub rsp, rax`.
+   - Callee-saved XMMs: Win64 must save xmm6–15; SysV never saves XMMs (all caller-saved).
+
+7. **Relocation:** Emit R_X64_PLT32 for the __chkstk call site with addend -4 (PC-relative at end of insn).
+
+
+
+---
+
+# x64 NativeTarget Backend Port: Register Tables & Legality (GROUP 2)
+
+## Overview
+
+This guide produces the exact x64 register metadata tables (NativePhysRegInfo, NativeAllocClassInfo, NativeRegInfo) required by the NativeTarget contract, along with ABI-abstract routing for SysV and Win64. The source is legacy commit 429defa (src/arch/x64/{ops,emit,alloc,opt_coord,internal}.h), which encodes the x64 register landscape and calling conventions.
+
+---
+
+## Register Enumeration (Hardware Model)
+
+From `src/arch/x64/isa.h` (lines 36–68):
+
+**Integer Registers (16 total, DWARF/ABI encoding 0–15):**
+```
+X64_RAX = 0,  X64_RCX = 1,  X64_RDX = 2,  X64_RBX = 3,
+X64_RSP = 4,  X64_RBP = 5,  X64_RSI = 6,  X64_RDI = 7,
+X64_R8 = 8,   X64_R9 = 9,   X64_R10 = 10, X64_R11 = 11,
+X64_R12 = 12, X64_R13 = 13, X64_R14 = 14, X64_R15 = 15,
+```
+
+**FP/SSE Registers (16 total, XMM0–XMM15, encoding 0–15):**
+```
+X64_XMM0 = 0 through X64_XMM15 = 15
+```
+
+---
+
+## ABI Constants & SysV vs Win64 Differences
+
+From `git show 429defa:src/arch/x64/emit.c` (lines 15–65):
+
+### SysV x86-64 (Linux, BSD, most Unix):
+- **Int arg regs (6):** RDI, RSI, RDX, RCX, R8, R9
+- **FP arg regs (8):** XMM0–XMM7
+- **Callee-saved int regs:** RBX, RBP, R12, R13, R14, R15 (5 + RBP = 6 total)
+  - cs_int_mask = `(1ull << RBX) | (1ull << RBP) | (1ull << R12) | (1ull << R13) | (1ull << R14) | (1ull << R15)`
+- **Callee-saved FP regs:** none
+  - cs_fp_mask = 0
+- **Return regs:** RAX, RDX (int); XMM0, XMM1 (FP)
+- **Shadow space:** 0 (stack-pass args directly after return addr)
+- **Variadic save:** 176-byte __va_list_tag register-save area emitted by prologue
+
+### Win64 x86-64 (Windows):
+- **Int arg regs (4):** RCX, RDX, R8, R9
+- **FP arg regs (4):** XMM0–XMM3
+- **Callee-saved int regs:** RBX, RBP, R12–R15, **RDI, RSI** (7 total, extra 2 vs SysV)
+  - cs_int_mask = `(1ull << RBX) | (1ull << RBP) | (1ull << R12) | (1ull << R13) | (1ull << R14) | (1ull << R15) | (1ull << RDI) | (1ull << RSI)`
+- **Callee-saved FP regs:** XMM6–XMM15 (10 regs)
+  - cs_fp_mask = `(1ull << XMM6) | (1ull << XMM7) | (1ull << XMM8) | ... | (1ull << XMM15)` (all 10 bits set)
+- **Return regs:** RAX, RDX (int); XMM0, XMM1 (FP)
+- **Shadow space:** 32 bytes (4 × 8B home slots for first 4 args, even if passed in regs)
+- **Variadic:** no register-save area; variadic FP args must be duplicated into matching GPR for stack-offset tracking
+- **Stack align:** Win64 requires special __chkstk probe call for allocations > 4096 bytes
+
+---
+
+## Legacy Register Pool Extraction
+
+From `git show 429defa:src/arch/x64/opt_coord.c` (lines 4–49):
+
+### Allocable Register Pools (for optimizer spill/reload):
+```c
+// INT allocable (4 regs for opt, excludes callee-saves used by -O0 single-pass)
+static const Reg x_int_allocable[] = {X64_R13, X64_R14, X64_R15, X64_R10};
+
+// FP allocable (8 regs for opt)
+static const Reg x_fp_allocable[] = {
+    X64_XMM6,  X64_XMM7,  X64_XMM8,  X64_XMM0 + 9,
+    X64_XMM0 + 10, X64_XMM0 + 11, X64_XMM0 + 12, X64_XMM0 + 13
+};
+```
+
+**Key insight:** The optimizer's allocable set is curated to avoid the backend's internal scratch registers (RAX, R11 for int; XMM14, XMM15 for FP), leaving those free for lowering paths.
+
+### Scratch Registers (for emit-internal temporary use):
+```c
+// INT scratch: must NOT overlap allocable or ABI-protected regs
+static const Reg x_int_scratch[] = {X64_RBX, X64_R12};
+
+// FP scratch
+static const Reg x_fp_scratch[] = {X64_XMM0 + 14, X64_XMM15};  // XMM14, XMM15
+```
+
+**Reserve regs (never allocable/scratch):** RAX, RBP (frame ptr), RSP (stack ptr), R11 (internal scratch).
+
+---
+
+## NativePhysRegInfo Tables (Legacy to NativeTarget Mapping)
+
+From `git show 429defa:src/arch/x64/opt_coord.c` (lines 52–95), we reconstruct the full register inventory:
+
+### x64 Integer Register Set (16 regs):
+
+```c
+// Full NativePhysRegInfo[] for INT class (all 16 GPRs)
+// Placeholder frame: all are (spill_cost=?, copy_cost=?)
+// Legacy opt_coord.c does not expose these costs; use default 0.
+
+static const NativePhysRegInfo x64_int_phys[] = {
+  // ABI arg regs (SysV or Win64, resolved via ABIFuncInfo)
+  // SysV: RDI=0, RSI=1, RDX=2, RCX=3, R8=4, R9=5; legacy used abi_index for ordering
+  // Win64: RCX=0, RDX=1, R8=2, R9=3; RDI/RSI not args
+  
+  {X64_RAX, NATIVE_REG_INT, 0xff, NATIVE_REG_RESERVED | NATIVE_REG_RET, 0, 0},
+  // RAX: return value, reserved scratch (not allocable)
+  
+  {X64_RCX, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLER_SAVED, 0, 0},
+  // RCX: arg reg (Win64 only, or general caller-saved on SysV); 0xff = no fixed ABI position
+  
+  {X64_RDX, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLER_SAVED | NATIVE_REG_RET, 0, 0},
+  // RDX: return value, caller-saved (arg on SysV)
+  
+  {X64_RBX, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLEE_SAVED, 0, 0},
+  // RBX: callee-saved (both SysV/Win64)
+  
+  {X64_RSP, NATIVE_REG_INT, 0xff, NATIVE_REG_RESERVED, 0, 0},
+  // RSP: stack pointer, reserved
+  
+  {X64_RBP, NATIVE_REG_INT, 0xff, NATIVE_REG_RESERVED, 0, 0},
+  // RBP: frame pointer, reserved (saved/restored by prologue)
+  
+  {X64_RSI, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLER_SAVED, 0, 0},
+  // RSI: arg reg (SysV, or callee-saved on Win64)
+  
+  {X64_RDI, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLER_SAVED, 0, 0},
+  // RDI: arg reg (SysV, or callee-saved on Win64)
+  
+  {X64_R8, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLER_SAVED, 0, 0},
+  // R8: arg reg (both ABIs)
+  
+  {X64_R9, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLER_SAVED, 0, 0},
+  // R9: arg reg (both ABIs)
+  
+  {X64_R10, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLER_SAVED | NATIVE_REG_TEMP_PREFERRED, 0, 0},
+  // R10: caller-saved, scratch-preferred (used for internal ops)
+  
+  {X64_R11, NATIVE_REG_INT, 0xff, NATIVE_REG_RESERVED | NATIVE_REG_CALLER_SAVED, 0, 0},
+  // R11: reserved scratch (used by emit paths for immediates, not allocable)
+  
+  {X64_R12, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLEE_SAVED, 50, 4},
+  // R12: callee-saved (both ABIs); spill_cost/copy_cost from legacy table
+  
+  {X64_R13, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLEE_SAVED, 50, 4},
+  // R13: callee-saved (both ABIs)
+  
+  {X64_R14, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLEE_SAVED, 50, 4},
+  // R14: callee-saved (both ABIs)
+  
+  {X64_R15, NATIVE_REG_INT, 0xff, NATIVE_REG_CALLEE_SAVED, 50, 4},
+  // R15: callee-saved (both ABIs)
+};
+```
+
+### x64 FP/SSE Register Set (16 regs):
+
+```c
+static const NativePhysRegInfo x64_fp_phys[] = {
+  // Arg/return regs (SysV XMM0–7 are args, all caller-saved; Win64 XMM0–3 args, XMM6–15 callee-saved)
+  
+  {X64_XMM0, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG | NATIVE_REG_RET, 0, 0},
+  {X64_XMM1, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG | NATIVE_REG_RET, 0, 0},
+  {X64_XMM2, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG, 0, 0},
+  {X64_XMM3, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG, 0, 0},
+  {X64_XMM4, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG, 0, 0},
+  // XMM4–5 are arg-passable on SysV (arg 5–6 of 8); not args on Win64 but still caller-saved
+  {X64_XMM5, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG, 0, 0},
+  
+  // XMM6–7: args on SysV (7–8), but on Win64 both are callee-saved (not args)
+  {X64_XMM6, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG, 0, 0},
+  {X64_XMM7, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED | NATIVE_REG_ARG, 0, 0},
+  
+  // XMM8–15: caller-saved on SysV (not args); callee-saved on Win64
+  {X64_XMM8, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED, 0, 0},
+  {X64_XMM9, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED, 0, 0},
+  {X64_XMM10, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED, 0, 0},
+  {X64_XMM11, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED, 0, 0},
+  {X64_XMM12, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED, 0, 0},
+  {X64_XMM13, NATIVE_REG_FP, 0xff, NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED, 0, 0},
+  
+  // XMM14: caller-saved, reserved for emit scratch (like R11 for int)
+  {X64_XMM14, NATIVE_REG_FP, 0xff, NATIVE_REG_RESERVED | NATIVE_REG_CALLER_SAVED, 0, 0},
+  
+  // XMM15: caller-saved on SysV, callee-saved on Win64; reserved for scratch on both
+  {X64_XMM15, NATIVE_REG_FP, 0xff, NATIVE_REG_RESERVED | NATIVE_REG_CALLER_SAVED, 0, 0},
+};
+```
+
+**Key abi_index note:** The legacy code sets abi_index = 0–5 for SysV arg regs (RDI–R9), 0xff for non-args. The NativeTarget contract uses abi_index for **ordered ABI sequencing** in call/return marshalling. Since x64 routing differs by OS, this must be resolved **per-OS** at initialization: either the per-OS NativeRegInfo is built at native_new time, or the NativeTarget legality hooks query the ABI directly.
+
+---
+
+## NativeAllocClassInfo Structure Definition
+
+From `src/arch/native_target.h` (lines 95–113), each class needs:
+
+```c
+typedef struct NativeAllocClassInfo {
+  u8 cls;  // NATIVE_REG_INT or NATIVE_REG_FP
+  u8 pad[3];
+  
+  const Reg* allocable;       // array of allocable register enums
+  u32 nallocable;             // count
+  
+  const Reg* scratch;         // array of emit-internal scratch regs
+  u32 nscratch;               // count
+  
+  const NativePhysRegInfo* phys;  // full register inventory for this class
+  u32 nphys;                      // count (16 for INT, 16 for FP)
+  
+  u32 caller_saved_mask;      // bitmask of caller-saved regs in this class
+  u32 callee_saved_mask;      // bitmask of callee-saved regs in this class
+  u32 arg_mask;               // bitmask of argument-passing regs
+  u32 ret_mask;               // bitmask of return-value regs
+  u32 reserved_mask;          // bitmask of reserved (non-allocable) regs
+} NativeAllocClassInfo;
+```
+
+### Concrete x64 INT class:
+
+```c
+// Allocable: {R13, R14, R15, R10}
+static const Reg x64_int_allocable[] = {X64_R13, X64_R14, X64_R15, X64_R10};
+
+// Scratch: {RBX, R12} (not allocable; used internally by emit)
+static const Reg x64_int_scratch[] = {X64_RBX, X64_R12};
+
+static const NativeAllocClassInfo x64_int_class_sysv = {
+  .cls = NATIVE_REG_INT,
+  .allocable = x64_int_allocable,
+  .nallocable = 4,
+  .scratch = x64_int_scratch,
+  .nscratch = 2,
+  .phys = x64_int_phys,
+  .nphys = 16,
+  
+  // SysV: all GPRs except RBP/RSP are either caller or callee-saved
+  .caller_saved_mask = (1u << X64_RAX) | (1u << X64_RCX) | (1u << X64_RDX) |
+                       (1u << X64_RSI) | (1u << X64_RDI) | (1u << X64_R8) |
+                       (1u << X64_R9) | (1u << X64_R10) | (1u << X64_R11),
+  
+  .callee_saved_mask = (1u << X64_RBX) | (1u << X64_RBP) | (1u << X64_R12) |
+                       (1u << X64_R13) | (1u << X64_R14) | (1u << X64_R15),
+  
+  .arg_mask = (1u << X64_RDI) | (1u << X64_RSI) | (1u << X64_RDX) |
+              (1u << X64_RCX) | (1u << X64_R8) | (1u << X64_R9),  // 6 args
+  
+  .ret_mask = (1u << X64_RAX) | (1u << X64_RDX),  // RAX, RDX
+  
+  .reserved_mask = (1u << X64_RSP) | (1u << X64_RBP) | (1u << X64_RAX) |
+                   (1u << X64_R11),  // RAX and R11 are emit-internal scratch
+};
+
+static const NativeAllocClassInfo x64_int_class_win64 = {
+  .cls = NATIVE_REG_INT,
+  .allocable = x64_int_allocable,  // {R13, R14, R15, R10} same
+  .nallocable = 4,
+  .scratch = x64_int_scratch,
+  .nscratch = 2,
+  .phys = x64_int_phys,
+  .nphys = 16,
+  
+  // Win64: RDI/RSI are callee-saved (unlike SysV)
+  .caller_saved_mask = (1u << X64_RAX) | (1u << X64_RCX) | (1u << X64_RDX) |
+                       (1u << X64_R8) | (1u << X64_R9) | (1u << X64_R10) |
+                       (1u << X64_R11),
+  
+  .callee_saved_mask = (1u << X64_RBX) | (1u << X64_RBP) | (1u << X64_R12) |
+                       (1u << X64_R13) | (1u << X64_R14) | (1u << X64_R15) |
+                       (1u << X64_RDI) | (1u << X64_RSI),  // +2 vs SysV
+  
+  .arg_mask = (1u << X64_RCX) | (1u << X64_RDX) | (1u << X64_R8) |
+              (1u << X64_R9),  // 4 args
+  
+  .ret_mask = (1u << X64_RAX) | (1u << X64_RDX),
+  
+  .reserved_mask = (1u << X64_RSP) | (1u << X64_RBP) | (1u << X64_RAX) |
+                   (1u << X64_R11),
+};
+```
+
+### Concrete x64 FP class:
+
+```c
+static const Reg x64_fp_allocable[] = {
+    X64_XMM6, X64_XMM7, X64_XMM8, X64_XMM9,
+    X64_XMM10, X64_XMM11, X64_XMM12, X64_XMM13
+};
+
+static const Reg x64_fp_scratch[] = {X64_XMM14, X64_XMM15};
+
+// SysV: all XMMs are caller-saved; none are callee-saved
+static const NativeAllocClassInfo x64_fp_class_sysv = {
+  .cls = NATIVE_REG_FP,
+  .allocable = x64_fp_allocable,
+  .nallocable = 8,
+  .scratch = x64_fp_scratch,
+  .nscratch = 2,
+  .phys = x64_fp_phys,
+  .nphys = 16,
+  
+  .caller_saved_mask = 0xFFFF,  // all 16 XMMs
+  .callee_saved_mask = 0,        // none
+  
+  .arg_mask = 0xFF,  // XMM0–7
+  
+  .ret_mask = (1u << X64_XMM0) | (1u << X64_XMM1),
+  
+  .reserved_mask = (1u << X64_XMM14) | (1u << X64_XMM15),
+};
+
+// Win64: XMM0–5 caller-saved (4 args + 2 more), XMM6–15 callee-saved
+static const NativeAllocClassInfo x64_fp_class_win64 = {
+  .cls = NATIVE_REG_FP,
+  .allocable = x64_fp_allocable,  // XMM6–13 (Win64 allocable: callee-saved + non-arg caller-saved)
+  .nallocable = 8,
+  .scratch = x64_fp_scratch,
+  .nscratch = 2,
+  .phys = x64_fp_phys,
+  .nphys = 16,
+  
+  .caller_saved_mask = 0x3F,  // XMM0–5 (args 0–3, overflow args 4–5)
+  .callee_saved_mask = 0xFFC0,  // XMM6–15 (10 regs)
+  
+  .arg_mask = 0x0F,  // XMM0–3 (4 args only)
+  
+  .ret_mask = (1u << X64_XMM0) | (1u << X64_XMM1),
+  
+  .reserved_mask = (1u << X64_XMM14) | (1u << X64_XMM15),
+};
+```
+
+---
+
+## NativeRegInfo & ABI-Abstract Initialization
+
+From `src/arch/native_target.h` (lines 115–124):
+
+```c
+typedef struct NativeRegInfo {
+  const NativeAllocClassInfo* classes;  // array of per-class info
+  u32 nclasses;                        // 2: INT + FP (VEC optional)
+  
+  int (*resolve_name)(const NativeRegInfo*, Sym name, Reg* out, NativeAllocClass* cls_out);
+  const char* (*debug_name)(const NativeRegInfo*, NativeAllocClass, Reg);
+  u32 (*dwarf_reg)(const NativeRegInfo*, NativeAllocClass, Reg);
+} NativeRegInfo;
+```
+
+### x64 Implementation Sketch (per-OS):
+
+```c
+// Helper: dispatch to SysV or Win64 based on compiler OS
+static const NativeAllocClassInfo* x64_get_classes(Compiler* c, u32* out_nclasses) {
+  static const NativeAllocClassInfo x64_classes_sysv[2] = {x64_int_class_sysv, x64_fp_class_sysv};
+  static const NativeAllocClassInfo x64_classes_win64[2] = {x64_int_class_win64, x64_fp_class_win64};
+  
+  if (c->target.os == CFREE_OS_WINDOWS) {
+    *out_nclasses = 2;
+    return x64_classes_win64;
+  }
+  *out_nclasses = 2;
+  return x64_classes_sysv;
+}
+
+static int x64_resolve_name(const NativeRegInfo* ri, Sym name, Reg* out, NativeAllocClass* cls_out) {
+  // Use regs.c x64_register_hw_index or x64_register_index (which returns DWARF index)
+  // to resolve name to hardware register number, then classify as INT or FP
+  Slice ns = pool_slice(/* compiler pool */, name);
+  // ... (see alloc.c x_resolve_reg_name for full mapping)
+}
+
+static const char* x64_debug_name(const NativeRegInfo* ri, NativeAllocClass cls, Reg reg) {
+  // Return assembler name: "rax", "xmm0", etc. Use regs.c x64_register_name(dwarf_idx)
+  if (cls == NATIVE_REG_INT) {
+    return x64_register_name(reg);  // reg is hardware index 0–15
+  } else if (cls == NATIVE_REG_FP) {
+    // XMM reg: return x64_register_name(17 + reg) for DWARF mapping
+    return x64_register_name(17 + reg);
+  }
+  return NULL;
+}
+
+static u32 x64_dwarf_reg(const NativeRegInfo* ri, NativeAllocClass cls, Reg reg) {
+  // DWARF numbering: GPR 0–15 = DWARF 0–15; XMM0–15 = DWARF 17–32
+  if (cls == NATIVE_REG_INT) return (u32)reg;
+  if (cls == NATIVE_REG_FP) return 17u + (u32)reg;
+  return 0xffffffffu;
+}
+
+static NativeRegInfo x64_reg_info_sysv = {
+  .classes = x64_classes_sysv,
+  .nclasses = 2,
+  .resolve_name = x64_resolve_name,
+  .debug_name = x64_debug_name,
+  .dwarf_reg = x64_dwarf_reg,
+};
+
+static NativeRegInfo x64_reg_info_win64 = {
+  .classes = x64_classes_win64,
+  .nclasses = 2,
+  .resolve_name = x64_resolve_name,
+  .debug_name = x64_debug_name,
+  .dwarf_reg = x64_dwarf_reg,
+};
+
+// At x64_native_target_new(Compiler* c, ...):
+const NativeRegInfo* x64_get_reg_info(Compiler* c) {
+  return (c->target.os == CFREE_OS_WINDOWS) ? &x64_reg_info_win64 : &x64_reg_info_sysv;
+}
+```
+
+---
+
+## Operand & Addressing Legality (class_for_type, imm_legal, addr_legal)
+
+### class_for_type:
+
+```c
+static NativeAllocClass x64_class_for_type(NativeTarget* nt, CfreeCgTypeId type) {
+  // Dispatch on type_is_fp_scalar / type_is_fp_double
+  CfreeCgTypeInfo ti = cg_type_info(nt->c, type);
+  if (ti.scalar_kind == ABI_SC_FLOAT || ti.scalar_kind == ABI_SC_DOUBLE) {
+    // But this NativeTarget is -O0 direct emission, so no intrinsic SIMD yet
+    // For now: all FP → NATIVE_REG_FP
+    return NATIVE_REG_FP;
+  }
+  return NATIVE_REG_INT;
+}
+```
+
+### imm_legal:
+
+From `git show 429defa:src/arch/x64/ops.c` (immediate-legality check patterns):
+
+```c
+static int x64_imm_legal(NativeTarget* nt, NativeImmUse use, u32 op, CfreeCgTypeId type, i64 imm) {
+  // x64 immediates:
+  // - MOV: imm32 sign-extended to 64, or movabs imm64 (15 bytes, expensive)
+  // - ALU (ADD/SUB/AND/OR/XOR/CMP): imm8 (1 byte) or imm32 (sign-extended to 64)
+  // - SHIFT: imm8 (1 byte) or CL register
+  // - Large imms: must use movabs or two-step (mov imm32, shift+or imm32)
+  
+  switch (use) {
+    case NATIVE_IMM_MOVE:
+      // MOV is always legal; will use movabs if imm doesn't fit i32
+      return 1;
+    case NATIVE_IMM_BINOP:
+      // ALU/CMP operand: must fit i8 or i32
+      return imm >= -128 && imm <= 127;  // i8
+           || (imm >= -2147483648LL && imm <= 2147483647LL);  // i32 sign-extended
+    case NATIVE_IMM_CMP:
+      // CMP is an ALU op, same rules
+      return imm >= -128 && imm <= 127 ||
+             (imm >= -2147483648LL && imm <= 2147483647LL);
+    case NATIVE_IMM_ADDR_OFFSET:
+      // disp32 in addressing mode: [base + disp32]
+      return imm >= -2147483648LL && imm <= 2147483647LL;
+    default:
+      return 0;
+  }
+}
+```
+
+### addr_legal:
+
+From legacy ops.c (addr_mode, emit_global_lea patterns):
+
+```c
+static int x64_addr_legal(NativeTarget* nt, const NativeAddr* addr, MemAccess mem) {
+  // x64 addressing: [base + index<<scale + disp32]
+  // - base: any GPR (or RIP for RIP-relative literals)
+  // - index: any GPR except RSP (base register cannot be index)
+  // - scale: 1, 2, 4, 8 (log2_scale 0–3)
+  // - disp32: signed i32 (-2^31 .. 2^31-1)
+  
+  // The NativeDirect path materializes index into a register before the load,
+  // so we only check validity of the mode itself (no index register aliasing).
+  
+  if (addr->base_kind == NATIVE_ADDR_BASE_NONE) return 0;
+  if (addr->offset < -2147483648LL || addr->offset > 2147483647LL) return 0;
+  
+  // x64 permits index scaling; log2_scale is 0–3 (scales 1, 2, 4, 8)
+  if (addr->log2_scale > 3) return 0;
+  
+  // All base kinds (REG, FRAME, FRAME_VALUE, GLOBAL) are valid;
+  // the backend resolves them to physical addresses before emission.
+  return 1;
+}
+```
+
+---
+
+## Reserved Registers & Scratch Strategy
+
+From `git show 429defa:src/arch/x64/internal.h` (prologue budgets, register counts):
+
+**Reserved (not allocable by register allocator):**
+- RAX: return value, emit-internal scratch for immediates (movabs, load-const relocs)
+- R11: emit-internal scratch for immediates and address calculations
+- RBP: frame pointer (saved/restored by prologue; restored by epilogue)
+- RSP: stack pointer
+
+**Callee-saved GPRs (both ABIs, saved/restored automatically by prologue/epilogue):**
+- RBX, R12, R13, R14, R15 (5 regs on SysV)
+- **Win64 adds:** RDI, RSI (2 extra = 7 total)
+
+**Callee-saved XMMs (Win64 only; SysV none):**
+- XMM6–XMM15 (10 regs on Win64; must be saved if used)
+
+---
+
+## Practical Pseudo-Code: Materialization in NativeTarget Hooks
+
+### bind_param (incoming argument binding):
+
+```c
+void x64_bind_param(NativeTarget* nt, const CGParamDesc* pd, NativeLoc dst) {
+  // Query ABIFuncInfo for the current function's parameter layout
+  const ABIFuncInfo* abi = abi_cg_func_info(nt->c->abi, /* fn_type */);
+  const ABIArgInfo* ai = &abi->params[param_index];  // resolved by allocator
+  
+  // Materialize the source location from the ABI (register or stack)
+  NativeLoc src = {0};
+  if (ai->kind == ABI_ARG_INDIRECT) {
+    // Sret or byval: caller passes address in first int arg register
+    u32 arg_idx = x64_next_param_int++;  // 0 = RDI/RCX, etc.
+    src.kind = NATIVE_LOC_REG;
+    src.cls = NATIVE_REG_INT;
+    src.v.reg = get_int_arg_reg(abi, arg_idx);
+  } else if (ai->kind == ABI_ARG_DIRECT) {
+    // Direct args: split across registers and stack per ABIArgPart
+    for (u16 i = 0; i < ai->nparts; ++i) {
+      const ABIArgPart* part = &ai->parts[i];
+      if (part->cls == ABI_CLASS_INT && x64_next_param_int < abi->n_int_args) {
+        src.kind = NATIVE_LOC_REG;
+        src.cls = NATIVE_REG_INT;
+        src.v.reg = get_int_arg_reg(abi, x64_next_param_int++);
+      } else if (part->cls == ABI_CLASS_FP && x64_next_param_fp < abi->n_fp_args) {
+        src.kind = NATIVE_LOC_REG;
+        src.cls = NATIVE_REG_FP;
+        src.v.reg = get_fp_arg_reg(abi, x64_next_param_fp++);
+      } else {
+        // Stack-passed: incoming args live above the saved pair
+        src.kind = NATIVE_LOC_FRAME;
+        src.v.frame = ...; // compute incoming-stack frame slot
+      }
+    }
+  }
+  
+  // Emit move from src (ABI location) to dst (allocator location)
+  if (dst.kind == NATIVE_LOC_REG) {
+    nt->move(nt, dst, src);
+  } else if (dst.kind == NATIVE_LOC_FRAME) {
+    nt->store(nt, native_addr_of_frame(dst.v.frame), src, mem_access_for_param_type(pd->type));
+  }
+}
+```
+
+### plan_call (outgoing argument marshalling):
+
+```c
+void x64_plan_call(NativeTarget* nt, const NativeCallDesc* desc, NativeCallPlan* plan) {
+  // Query the callee's ABI to decide which registers each argument goes in
+  const ABIFuncInfo* callee_abi = abi_cg_func_info(nt->c->abi, desc->fn_type);
+  
+  u32 stack_arg_size = 0;
+  u32 arg_idx_int = 0, arg_idx_fp = 0;
+  
+  for (u32 i = 0; i < desc->nargs; ++i) {
+    const ABIArgInfo* ai = &callee_abi->params[i];
+    NativeLoc* arg_loc = &desc->args[i];
+    
+    // Decide where this argument goes (register or stack)
+    for (u16 j = 0; j < ai->nparts; ++j) {
+      const ABIArgPart* part = &ai->parts[j];
+      NativeCallPlanMove* move = &plan->args[plan->nargs++];
+      
+      if (part->cls == ABI_CLASS_INT && arg_idx_int < callee_abi->n_int_args) {
+        move->dst.kind = NATIVE_LOC_REG;
+        move->dst.cls = NATIVE_REG_INT;
+        move->dst.v.reg = get_int_arg_reg(callee_abi, arg_idx_int++);
+      } else if (part->cls == ABI_CLASS_FP && arg_idx_fp < callee_abi->n_fp_args) {
+        move->dst.kind = NATIVE_LOC_REG;
+        move->dst.cls = NATIVE_REG_FP;
+        move->dst.v.reg = get_fp_arg_reg(callee_abi, arg_idx_fp++);
+      } else {
+        // Stack-passed argument
+        move->dst.kind = NATIVE_LOC_STACK;
+        move->dst.v.stack.slot = 0;  // stack offset computed from stack_arg_size
+        move->dst.v.stack.offset = stack_arg_size;
+        stack_arg_size += 8;  // aligned to 8 on x64 (or ABI-specified stack_align)
+      }
+      
+      move->src = *arg_loc;  // pre-materialized by caller
+      move->src_kind = NATIVE_CALL_MOVE_VALUE;
+    }
+  }
+  
+  plan->stack_arg_size = align_up(stack_arg_size + abi->shadow_space, 16);
+  // Win64: add shadow space; SysV: shadow_space = 0
+}
+```
+
+---
+
+## Summary: Data Flow for x64 NativeTarget
+
+1. **At init (x64_native_target_new):**
+   - Resolve `const NativeRegInfo* regs = x64_get_reg_info(compiler);`
+   - Select SysV or Win64 class tables based on compiler->target.os
+   - Wire up class_for_type, imm_legal, addr_legal hooks
+
+2. **Per-function (func_begin):**
+   - Read abi_cg_func_info(callee type) to learn parameter/return layout
+   - Track next_param_int / next_param_fp indices for incoming args
+   - Allocate frame slots for sret pointer, variadic save area, etc.
+
+3. **Per-parameter (bind_param):**
+   - Consult ABIArgInfo for the parameter's class (INT/FP), location (reg/stack), and part layout
+   - Materialize the source from the ABI location (e.g., RDI for SysV param 0)
+   - Emit a move to the allocator-chosen destination
+
+4. **Per-call (plan_call):**
+   - Iterate callee ABIArgInfo to decide arg destination (register or outgoing stack)
+   - Emit setup moves (caller is responsible; NativeTarget validates legality)
+
+5. **Legality (class_for_type, imm_legal, addr_legal):**
+   - class_for_type: FP type → NATIVE_REG_FP, else NATIVE_REG_INT
+   - imm_legal: permit i8 or i32 sign-extended immediates for ALU, disp32 for addresses
+   - addr_legal: validate [base + index<<scale + disp32] format
+
+This design keeps x64 emission **independent of the allocator** and **agnostic to direct vs. optimized lowering**, while delegating ABI decisions to the abi/ layer (which is already OS-aware).
+
+
+
+
+---
+
+# x64 NativeTarget Porting Guide — GROUP 3: Data Movement, ALU, Flags, Convert, Control Flow
+
+## Overview
+
+This guide ports GROUP 3 hooks (move, load_imm, load_const, load_addr, load/store, tls_addr_of, copy_bytes, set_bytes, bitfield_load/store, binop, unop, cmp, convert, alloca_, spill/reload, label/jump, cmp_branch, indirect_branch, load_label_addr) from the disabled x64 legacy backend to the NativeTarget API. The contract is in `src/arch/native_target.h`; working templates are in `src/arch/rv64/native.c` (just finished) and `src/arch/aa64/native.c` (aarch64, both -O0 and -O1+ via optimizer).
+
+x64 is **two-address** (destination is also a source) and uses the **flags register** instead of materializing condition bits. Division, shifts, and multiplication have special implicit-register requirements (rax/rdx, cl, etc.). The byte-level emit helpers in the legacy `emit.c` (kept and still compiling) are reusable; this guide quotes line ranges and function signatures to call.
+
+---
+
+## Key Differences: x64 vs. Templates
+
+1. **Flags-based comparisons**: x64 cmp/jcc/setcc set the RFLAGS register; rv64/aa64 materialize 0/1 directly.
+2. **Two-address ALU**: Intel alu ops read-modify-write a single operand; ARM64 has three-register forms.
+3. **Implicit registers**: div/idiv clobber rax/rdx, mul clobbers rdx, shifts use cl, conversion opcodes hardcode registers.
+4. **Variable-width encoding**: sizes 1/2/4/8 bytes map to distinct opcodes (movsxd for 32→64, movzx for byte/word, movabs for imm64).
+5. **RIP-relative addressing**: PC-relative immediates are -4 addend (end-of-insn) for relocations (R_PC32, R_X64_PLT32, R_X64_REX_GOTPCRELX).
+6. **ABI dual-path**: SysV vs. Win64 differ in arg regs, shadow space, callee-save masks. Use `abi_cg_func_info()` to abstract.
+
+---
+
+## Kept Emit.c Encoders (Reusable Byte-Level Helpers)
+
+File: `git show 429defa:src/arch/x64/emit.c`
+
+### Low-Level Primitives
+
+- **`x64_make_rex(w, reg, index, rm)`** (isa.h:374): Build REX byte or return 0.
+- **`x64_pack_rex(out, w, reg, index, rm)`** (isa.h:474): Emit optional REX.
+- **`x64_pack_mem(out, reg, base, disp)`** (isa.h:418): ModR/M + disp for `[base+disp]`.
+- **`x64_pack_mem_sib(out, reg, base, index, log2_scale, disp)`** (isa.h:442): ModR/M + SIB for `[base+index*scale+disp]`.
+- **`x64_pack_rm_reg(out, reg, rm)`** (isa.h:468): ModR/M for reg-reg (mod=3).
+
+### Instruction Encoders (Called from emit.c, wrapped by native.c)
+
+Each returns byte count; caller reserves ≥16 bytes in buffer.
+
+**Movement & Loads:**
+- **`x64_mov_ri_pack()`** (isa.h:552): MOV r, imm32/imm64 (B8+rd).
+- **`x64_mov_rm_load_pack()`** (isa.h:572): MOV r, [base+disp] or LEA.
+- **`x64_movzx_rr_pack()`** (isa.h:594): MOVZX/MOVSX r,r (0F B6/B7/BE/BF).
+- **`x64_movsxd_pack()`** (isa.h:611): MOVSXD r64, r32 (REX.W 63).
+
+**SSE (Scalar FP):**
+- **`x64_sse_rr_pack()`** (isa.h:743): SSE reg-reg with optional prefix (0x66/0xF2/0xF3).
+- **`x64_sse_mem_pack()`** (isa.h:760): SSE load/store via [base+disp].
+
+**ALU:**
+- **`x64_alu_rr_pack()`** (isa.h:515): op r/m, r (MOV/ADD/SUB/AND/OR/XOR/CMP/TEST).
+- **`x64_alu_rm_pack()`** (isa.h:534): op [base+disp], r (memory form).
+- **`x64_alu_imm8_pack()`** (isa.h:625): op r/m, imm8 (83 /sub).
+- **`x64_alu_imm32_pack()`** (isa.h:640): op r/m, imm32 (81 /sub).
+- **`x64_imul_rr_pack()`** (isa.h:654): IMUL r, r (0F AF).
+- **`x64_imul_rri_pack()`** (isa.h:670): IMUL r, r, imm (69/6B).
+- **`x64_f7_rm_pack()`** (isa.h:687): F7 /sub (NOT/NEG/MUL/IMUL/DIV/IDIV).
+- **`x64_shift_imm_pack()`** (isa.h:701): SHL/SHR/SAR r, imm8 (C1).
+- **`x64_shift_cl_pack()`** (isa.h:715): SHL/SHR/SAR r, cl (D3).
+
+**Branches & Setcc:**
+- **`x64_setcc_pack()`** (isa.h:727): SETcc r8 (0F 9x /0).
+- **`x64_nullary_pack()`** (isa.h:496): RET, CQO/CDQ, etc.
+
+### Called Emit Functions (emit.c impl side, called from native.c)
+
+Each emits debug row via `debug_emit_row()` when `mc->debug` is set.
+
+```c
+void emit_mov_rr(MCEmitter* mc, int w, u32 dst, u32 src);              // MOV w=1→r64, w=0→r32
+void emit_mov_load(MCEmitter* mc, u32 size, int signed_ext, u32 dst, u32 base, i32 disp);  // size 1/2/4/8
+void emit_mov_store(MCEmitter* mc, u32 size, u32 src, u32 base, i32 disp);
+void emit_lea(MCEmitter* mc, u32 dst, u32 base, i32 disp);             // LEA (always 64-bit in our ISA)
+void emit_mov_load_idx(MCEmitter* mc, u32 size, int signed_ext, u32 dst, u32 base, u32 index, u32 log2_scale, i32 disp);
+void emit_mov_store_idx(MCEmitter* mc, u32 size, u32 src, u32 base, u32 index, u32 log2_scale, i32 disp);
+void x64_emit_load_imm(MCEmitter* mc, int is64, u32 dst, i64 imm);    // MOV/MOVABS
+void emit_alu_rr(MCEmitter* mc, int w, u8 op, u32 dst, u32 src);
+void emit_imul_rr(MCEmitter* mc, int w, u32 dst, u32 src);
+void emit_f7_rm(MCEmitter* mc, int w, u32 sub, u32 reg);              // NOT/NEG/MUL/IMUL/DIV/IDIV
+void emit_shift_cl(MCEmitter* mc, int w, u32 sub, u32 reg);
+void emit_shift_imm(MCEmitter* mc, int w, u32 sub, u32 reg, u8 imm);
+void emit_alu_imm8(MCEmitter* mc, int w, u32 sub, u32 reg, i8 imm);
+void emit_alu_imm32(MCEmitter* mc, int w, u32 sub, u32 reg, i32 imm);
+void emit_imul_imm8(MCEmitter* mc, int w, u32 dst, u32 src, i8 imm);
+void emit_imul_imm32(MCEmitter* mc, int w, u32 dst, u32 src, i32 imm);
+void emit_cmp_imm8(MCEmitter* mc, int w, u32 reg, i8 imm);
+void emit_test_self(MCEmitter* mc, int w, u32 reg);                  // TEST r, r
+void emit_setcc(MCEmitter* mc, u32 cc, u32 reg);                      // SETcc (cc = X64_CC_*)
+void emit_extend_rr(MCEmitter* mc, int w, int signed_ext, u32 src_size, u32 dst, u32 src);
+void emit_cqo_or_cdq(MCEmitter* mc, int w);                           // CQO/CDQ
+void emit_xor_self(MCEmitter* mc, int w, u32 r);                      // XOR r, r (zero)
+void emit_movzx_r32_r8(MCEmitter* mc, u32 dst, u32 src);             // MOVZX r32, r8
+void emit_ret(MCEmitter* mc);
+void emit_sse_rr(MCEmitter* mc, u8 prefix, u8 opcode, u32 dst, u32 src);
+void emit_sse_load(MCEmitter* mc, u8 prefix, u8 opcode, u32 dst, u32 base, i32 disp);
+void emit_sse_store(MCEmitter* mc, u8 prefix, u8 opcode, u32 src, u32 base, i32 disp);
+void emit_sse_load_idx(MCEmitter* mc, u8 prefix, u8 opcode, u32 dst, u32 base, u32 index, u32 log2_scale, i32 disp);
+void emit_sse_store_idx(MCEmitter* mc, u8 prefix, u8 opcode, u32 src, u32 base, u32 index, u32 log2_scale, i32 disp);
+void emit_sse_rr_w(MCEmitter* mc, u8 prefix, u8 opcode, int w, u32 dst, u32 src);
+int imm_fits_i8(i64 imm);
+int imm_fits_i32(i64 imm);
+```
+
+### Width Flags
+
+- **`w=1`**: 64-bit (q suffix), REX.W set.
+- **`w=0`**: 32-bit (l suffix), REX.W clear (32-bit write zero-extends to 64).
+- **`size` (load/store)**: 1, 2, 4, or 8 bytes; calls use MOVZX/MOVSX for 1/2-byte loads.
+
+### Opcode Constants (from isa.h)
+
+Key ones for GROUP 3:
+
+```c
+#define X64_OPC_MOV_RM_R  0x89u    // MOV r/m, r (store)
+#define X64_OPC_MOV_R_RM  0x8Bu    // MOV r, r/m (load)
+#define X64_OPC_MOV_RI    0xB8u    // MOV r, imm (with +rd in low bits)
+#define X64_OPC_LEA       0x8Du
+#define X64_OPC_MOVSXD    0x63u    // MOVSXD r64, r32 (REX.W 63)
+#define X64_OPC_ALU_ADD   0x01u    // ADD r/m, r
+#define X64_OPC_ALU_SUB   0x29u    // SUB r/m, r
+#define X64_OPC_ALU_AND   0x21u    // AND r/m, r
+#define X64_OPC_ALU_OR    0x09u    // OR r/m, r
+#define X64_OPC_ALU_XOR   0x31u    // XOR r/m, r
+#define X64_OPC_ALU_CMP   0x39u    // CMP r/m, r
+#define X64_OPC_ALU_TEST  0x85u    // TEST r/m, r
+#define X64_OPC_IMUL_2B   0xAFu    // IMUL r, r/m (0F AF)
+#define X64_OPC_IMUL_IMM8 0x6Bu    // IMUL r, r, imm8
+#define X64_OPC_IMUL_IMM32 0x69u   // IMUL r, r, imm32
+#define X64_OPC_F7        0xF7u    // NOT/NEG/MUL/IMUL/DIV/IDIV (sub picks op)
+#define X64_F7_SUB_NOT    2u
+#define X64_F7_SUB_NEG    3u
+#define X64_F7_SUB_DIV    6u
+#define X64_F7_SUB_IDIV   7u
+#define X64_OPC_SHIFT_IMM 0xC1u    // SHL/SHR/SAR r, imm8
+#define X64_OPC_SHIFT_CL  0xD3u    // SHL/SHR/SAR r, cl
+#define X64_SHIFT_SUB_SHL 4u
+#define X64_SHIFT_SUB_SHR 5u
+#define X64_SHIFT_SUB_SAR 7u
+#define X64_OPC_SETCC_BASE 0x90u   // SETcc (cc in low nibble, 0F 9x)
+#define X64_OPC_CDQ_CQO   0x99u    // CQO (REX.W) / CDQ
+#define X64_ALU_SUB_ADD   0u        // For 83/81 encoding
+#define X64_ALU_SUB_CMP   7u
+#define X64_ALU_SUB_SUB   5u
+#define X64_ALU_SUB_AND   4u
+#define X64_ALU_SUB_OR    1u
+#define X64_ALU_SUB_XOR   6u
+
+// Condition codes (for jcc, setcc, cmovcc)
+#define X64_CC_E   0x4u   // equal / ZF=1
+#define X64_CC_NE  0x5u
+#define X64_CC_B   0x2u   // below (unsigned) / CF=1
+#define X64_CC_AE  0x3u   // above-or-equal (unsigned) / CF=0
+#define X64_CC_BE  0x6u   // below-or-equal (unsigned)
+#define X64_CC_A   0x7u   // above (unsigned)
+#define X64_CC_L   0xCu   // less (signed) / SF!=OF
+#define X64_CC_GE  0xDu
+#define X64_CC_LE  0xEu
+#define X64_CC_G   0xFu   // greater (signed)
+#define X64_CC_S   0x8u   // sign set
+#define X64_CC_NS  0x9u
+#define X64_CC_P   0xAu   // parity (FP unordered)
+#define X64_CC_NP  0xBu   // no parity (FP ordered)
+```
+
+### SSE Prefixes and Opcodes
+
+```c
+#define X64_PFX_66  0x66u   // Operand-size override (16-bit); also picks double precision for SSE
+#define X64_PFX_F2  0xF2u   // Scalar double (ADDSD, etc.)
+#define X64_PFX_F3  0xF3u   // Scalar single (ADDSS, etc.)
+#define X64_OPC_TWOBYTE 0x0Fu  // Prefix for two-byte opcodes (SSE, shift 0Fxx, etc.)
+
+// SSE scalar FP opcodes (second byte after 0x0F):
+#define 0x10  // movs{s,d} (load); as emit_sse_rr, it's always sse_rr form
+#define 0x58  // adds{s,d}
+#define 0x5C  // subs{s,d}
+#define 0x59  // muls{s,d}
+#define 0x5E  // divs{s,d}
+#define 0x2E  // ucomis{s,d} (compare, sets flags)
+#define 0x2A  // cvtsi2s{s,d} (int→fp)
+#define 0x2C  // cvtts{s,d}2si (fp→int, truncate)
+#define 0x6E  // MOVD/MOVQ (GPR→XMM, used for bitcast)
+#define 0x7E  // MOVD/MOVQ (XMM→GPR, used for bitcast)
+```
+
+---
+
+## ABI Query Interface
+
+Call via `abi.h`:
+
+```c
+const ABIFuncInfo* abi_cg_func_info(Compiler*, CfreeCgTypeId fn_type);
+```
+
+Returns ABIFuncInfo with:
+- `nparams`: Number of fixed parameters.
+- `is_variadic`: Boolean.
+- `params[]`: Array of ABIArgInfo per parameter.
+- `ret_*`: Return value layout(s).
+
+For x64, also check `os_kind`:
+```c
+if (c->config->os_kind == CFREE_OS_WINDOWS) { /* Win64 path */ }
+else { /* SysV path */ }
+```
+
+---
+
+## NativeOps Adapter Structure
+
+File: `src/cg/native_direct_target.h:66`
+
+For -O0 (direct lowering), the NativeOps adapter bridges semantic operands (Operand type with OPK_REG/OPK_IMM/OPK_LOCAL/OPK_INDIRECT) to NativeLoc/NativeAddr. At -O1+, the optimizer emits NativeInst directly, so NativeOps is not called.
+
+Key callbacks used in the emit path (from ops.c, now native.c):
+- **`operand_legal()`**: Check if semantic Operand is legal for the arch.
+- **`semantic_addr_legal()`**: Check if an Operand address is legal and reachable.
+- **`plan_call()`, `emit_call()`, `emit_ret()`**: Lowering calls and returns.
+- **`va_start_()`, `va_arg_()`, `va_end_()`, `va_copy_()`**: Variadic setup.
+
+For GROUP 3, focus is on NativeTarget emission; NativeOps is used only in the -O0 path to map semantic operands to NativeLoc. See `src/arch/aa64/native_direct.c` for the adapter implementation.
+
+---
+
+## Group 3 Hook Bodies
+
+### move(dst_reg, src_reg)
+
+**Input:** Two NATIVE_LOC_REG locations, same class (int→int or fp→fp).
+**Emit:** Register move or elide (same reg).
+
+**Pseudo-C:**
+```c
+static void x_move(NativeTarget* t, NativeLoc dst, NativeLoc src) {
+  // Elide if same reg and class.
+  if (dst.kind == NATIVE_LOC_REG && src.kind == NATIVE_LOC_REG &&
+      (NativeAllocClass)dst.cls == (NativeAllocClass)src.cls &&
+      dst.v.reg == src.v.reg)
+    return;
+  
+  u32 rd = dst.v.reg & 0xFu;
+  u32 rs = src.v.reg & 0xFu;
+  
+  if ((NativeAllocClass)dst.cls == NATIVE_REG_FP) {
+    // FP reg move: prefix selects width (0xF2=double, 0xF3=single).
+    u8 prefix = type_size32(t, dst.type) == 8u ? 0xF2u : 0xF3u;
+    emit_sse_rr(t->mc, prefix, 0x10, rd, rs);  // movs{d,s}
+  } else {
+    // Integer: width from type size or pointer size.
+    int w = (dst.type && cg_type_size(t->c, dst.type) >= 8u) ? 1 : 0;
+    emit_mov_rr(t->mc, w, rd, rs);
+  }
+}
+```
+
+**Emit.c calls:** `emit_mov_rr()`, `emit_sse_rr()`.
+
+---
+
+### load_imm(dst_reg, imm)
+
+**Input:** NativeLoc dst (REG), immediate value.
+**Emit:** MOV r, imm32 or MOVABS r, imm64.
+
+**Pseudo-C:**
+```c
+static void x_load_imm(NativeTarget* t, NativeLoc dst, i64 imm) {
+  int is64 = (dst.type && cg_type_size(t->c, dst.type) >= 8u) ? 1 : 0;
+  u32 rd = dst.v.reg & 0xFu;
+  x64_emit_load_imm(t->mc, is64, rd, imm);
+}
+```
+
+**Emit.c calls:** `x64_emit_load_imm()` (which emits MOV or MOVABS based on is64).
+
+---
+
+### load_const(dst_reg, const_bytes)
+
+**Input:** FP constant (size 4 or 8), dst is FP register.
+**Emit:** .rodata symbol, RIP-relative movss/movsd with R_PC32 reloc.
+
+**Pseudo-C:**
+```c
+static void x_load_const(NativeTarget* t, NativeLoc dst, ConstBytes cb) {
+  // Must route through .rodata to emit RIP-relative load.
+  // 1. Allocate .rodata section if needed.
+  // 2. Align and emit const bytes at rodata_offset.
+  // 3. Create local symbol for the constant.
+  // 4. Emit: movs{s,d} xmm_dst, [rip + disp32]
+  //    with R_PC32 reloc (addend=-4).
+  
+  // Pseudo: allocate rodata offset, emit symbol, store current section, switch to .rodata.
+  u32 ro_off = /* rodata-aligned offset */;
+  ObjSymId sym = /* create symbol at ro_off */;
+  
+  u8 prefix = (cb.size == 8) ? 0xF2u : 0xF3u;  // F2=double, F3=single
+  u32 dst_x = dst.v.reg & 0xFu;
+  
+  // Emit: prefix 0F 10 /r [RIP + disp32]
+  // Use emit_sse_load with base=X64_RBP (signals rip-relative in our ISA).
+  // OR manually build with x64_sse_mem_pack and emit_reloc_at.
+  u32 pos = t->mc->pos(t->mc);
+  emit_sse_load(t->mc, prefix, 0x10, dst_x, X64_RBP, 0);  // [RIP] form
+  t->mc->emit_reloc_at(t->mc, t->mc->section_id, pos, R_PC32, sym, -4, 1, 0);
+}
+```
+
+**Note:** Implementation deferred to platform-specific rodata handling. For now, treat like rv64: load via register + immediate offset.
+
+---
+
+### load_addr(dst_reg, addr)
+
+**Input:** NativeLoc dst (REG), NativeAddr with base (reg/frame/global), index, scale, offset.
+**Emit:** LEA [base+disp], or for globals: LEA [RIP+disp] with R_PC32/R_X64_PLT32 for function symbols / R_X64_REX_GOTPCRELX for GOT-accessed externs.
+
+**Pseudo-C:**
+```c
+static void x_load_addr(NativeTarget* t, NativeLoc dst, NativeAddr addr) {
+  XNativeTarget* x = (XNativeTarget*)t;
+  u32 rd = dst.v.reg & 0xFu;
+  
+  if ((NativeAddrBaseKind)addr.base_kind == NATIVE_ADDR_BASE_GLOBAL) {
+    ObjSymId sym = addr.base.global.sym;
+    i64 addend = addr.base.global.addend + (i64)addr.offset;
+    
+    // Route through GOT for extern undef symbols in PIC/PIE.
+    if (obj_symbol_extern_via_got(t->c, t->obj, sym)) {
+      // mov rd, [rip + disp32]  (R_X64_REX_GOTPCRELX)
+      // addend applied post-load if nonzero.
+      u32 pos = t->mc->pos(t->mc);
+      emit_mov_load(t->mc, 8, /*signed=*/0, rd, X64_RBP, 0);  // [RIP] form
+      t->mc->emit_reloc_at(t->mc, t->mc->section_id, pos, R_X64_REX_GOTPCRELX, sym, -4, 1, 0);
+      if (addend) {
+        if (addend >= -2048 && addend <= 2047) {
+          emit_alu_imm32(t->mc, 1, X64_ALU_SUB_ADD, rd, (i32)addend);
+        } else {
+          x64_emit_load_imm(t->mc, 1, X64_R11, addend);
+          emit_alu_rr(t->mc, 1, X64_OPC_ALU_ADD, rd, X64_R11);
+        }
+      }
+      return;
+    }
+    
+    // lea rd, [rip + disp32]  (R_PC32 for data, R_X64_PLT32 for funcs)
+    u32 reloc_kind = (obj_symbol_get(t->obj, sym)->kind == SK_FUNC) ?
+                      R_X64_PLT32 : R_PC32;
+    u32 pos = t->mc->pos(t->mc);
+    emit_lea(t->mc, rd, X64_RBP, 0);  // [RIP] form
+    t->mc->emit_reloc_at(t->mc, t->mc->section_id, pos, reloc_kind, sym, addend - 4, 1, 0);
+    return;
+  }
+  
+  if ((NativeAddrBaseKind)addr.base_kind == NATIVE_ADDR_BASE_FRAME) {
+    // lea rd, [rbp - slot_offset + addr.offset]
+    XNativeSlot* s = x_slot_get(x, addr.base.frame);
+    i32 disp = -(i32)s->off + addr.offset;
+    emit_lea(t->mc, rd, X64_RBP, disp);
+    // apply index if present (scaled, rare for frame addresses)
+    if ((NativeAddrIndexKind)addr.index_kind == NATIVE_ADDR_INDEX_REG) {
+      u32 ri = addr.index.reg & 0xFu;
+      u32 scale = 1u << addr.log2_scale;
+      // lea rd, [rd + ri*scale] (SIB form)
+      emit_lea_sib(t->mc, rd, rd, ri, addr.log2_scale, 0);
+    }
+    return;
+  }
+  
+  if ((NativeAddrBaseKind)addr.base_kind == NATIVE_ADDR_BASE_REG) {
+    // lea rd, [base_reg + offset]
+    u32 rb = addr.base.reg & 0xFu;
+    emit_lea(t->mc, rd, rb, addr.offset);
+    if ((NativeAddrIndexKind)addr.index_kind == NATIVE_ADDR_INDEX_REG) {
+      // apply scaled index
+      u32 ri = addr.index.reg & 0xFu;
+      emit_lea_sib(t->mc, rd, rd, ri, addr.log2_scale, 0);
+    }
+    return;
+  }
+  
+  compiler_panic(t->c, x->loc, "x64 load_addr: unsupported base kind");
+}
+```
+
+**Emit.c calls:** `emit_lea()`, `emit_mov_load()` (for GOT), `emit_alu_imm32()`, `emit_alu_rr()`, `x64_emit_load_imm()`.
+
+---
+
+### load(dst_reg, addr, mem_access)
+
+**Input:** NativeLoc dst (REG), NativeAddr, MemAccess (type, size, align).
+**Emit:** MOV with size 1/2/4/8 bytes; MOVSX/MOVZX for sub-word; MOVSD/MOVSS for FP.
+
+**Pseudo-C:**
+```c
+static void x_load(NativeTarget* t, NativeLoc dst, NativeAddr addr, MemAccess mem) {
+  u32 rd = dst.v.reg & 0xFu;
+  u32 size = mem.size;
+  u32 base = addr.base.reg & 0xFu;  // assume simple base (no SIB for now)
+  
+  if ((NativeAllocClass)dst.cls == NATIVE_REG_FP) {
+    // FP load: movs{s,d} xmm_dst, [base + disp]
+    u8 prefix = (size == 8) ? 0xF2u : 0xF3u;
+    emit_sse_load(t->mc, prefix, 0x10, rd, base, addr.offset);
+  } else {
+    // Integer load: size determines extension.
+    int signed_ext = cg_type_is_signed(t->c, mem.type) ? 1 : 0;
+    emit_mov_load(t->mc, size, signed_ext, rd, base, addr.offset);
+  }
+}
+```
+
+**Emit.c calls:** `emit_mov_load()`, `emit_sse_load()`.
+
+---
+
+### store(addr, src_reg, mem_access)
+
+**Input:** NativeAddr, NativeLoc src (REG), MemAccess.
+**Emit:** MOV [base+disp], src with size 1/2/4/8.
+
+**Pseudo-C:**
+```c
+static void x_store(NativeTarget* t, NativeAddr addr, NativeLoc src, MemAccess mem) {
+  u32 rs = src.v.reg & 0xFu;
+  u32 size = mem.size;
+  u32 base = addr.base.reg & 0xFu;
+  
+  if ((NativeAllocClass)src.cls == NATIVE_REG_FP) {
+    // FP store: movs{s,d} [base + disp], xmm_src
+    u8 prefix = (size == 8) ? 0xF2u : 0xF3u;
+    emit_sse_store(t->mc, prefix, 0x11, rs, base, addr.offset);  // 0x11 = MOVS*D/S store
+  } else {
+    // Integer store.
+    emit_mov_store(t->mc, size, rs, base, addr.offset);
+  }
+}
+```
+
+**Emit.c calls:** `emit_mov_store()`, `emit_sse_store()`.
+
+---
+
+### tls_addr_of(dst_reg, sym, addend)
+
+**Input:** TLS symbol, NativeLoc dst (REG), addend.
+**Emit:** SysV: `mov rd, %fs:0` then `add rd, [rip+disp]` with R_X64_TLSGD / R_X64_TLSLD reloc. Win64: reserved (OS-specific path).
+
+**Pseudo-C:**
+```c
+static void x_tls_addr_of(NativeTarget* t, NativeLoc dst, ObjSymId sym, i64 addend) {
+  XNativeTarget* x = (XNativeTarget*)t;
+  u32 rd = dst.v.reg & 0xFu;
+  
+  // x86-64 TLS is complex; for now, emit panic to mark as not yet implemented.
+  // SysV model: emit movabs rd, sym@gottpoff(rip) + addend.
+  // Win64: no TLS support in -O0 direct lowering.
+  
+  compiler_panic(t->c, x->loc, "x64 tls_addr_of: not yet implemented");
+}
+```
+
+---
+
+### copy_bytes(dst_addr, src_addr, agg_access)
+
+**Input:** Two NativeAddr, AggregateAccess (size, align).
+**Emit:** REP MOVS or unrolled MOV loop.
+
+**Pseudo-C:**
+```c
+static void x_copy_bytes(NativeTarget* t, NativeAddr dst, NativeAddr src, AggregateAccess agg) {
+  u32 size = agg.size;
+  
+  // Unrolled loop for small sizes or misaligned; REP MOVS for larger aligned blocks.
+  if (size <= 32u) {
+    // Unroll: emit mov for each chunk (8, 4, 2, 1 as needed).
+    u32 off = 0;
+    while (off < size) {
+      u32 chunk = (size - off >= 8) ? 8 : (size - off >= 4) ? 4 : (size - off >= 2) ? 2 : 1;
+      u32 rs = X64_R10;  // scratch for load
+      emit_mov_load(t->mc, chunk, /*signed=*/0, rs, src.base.reg, src.offset + (i32)off);
+      emit_mov_store(t->mc, chunk, rs, dst.base.reg, dst.offset + (i32)off);
+      off += chunk;
+    }
+  } else {
+    // REP MOVS: set rcx=size, rsi=src, rdi=dst, then: rep movsq (8-byte chunks).
+    x64_emit_load_imm(t->mc, 1, X64_RCX, (i64)(size / 8));
+    emit_mov_rr(t->mc, 1, X64_RSI, src.base.reg);
+    emit_alu_imm32(t->mc, 1, X64_ALU_SUB_ADD, X64_RSI, src.offset);
+    emit_mov_rr(t->mc, 1, X64_RDI, dst.base.reg);
+    emit_alu_imm32(t->mc, 1, X64_ALU_SUB_ADD, X64_RDI, dst.offset);
+    emit_rep_movsq(t->mc);  // not shown; emit 0xF3 0x48 0xA5
+  }
+}
+```
+
+---
+
+### set_bytes(dst_addr, byte_value, agg_access)
+
+**Input:** NativeAddr dst, NativeLoc byte_value (REG with 0..255), size.
+**Emit:** Unrolled or REP STOS.
+
+**Pseudo-C:**
+```c
+static void x_set_bytes(NativeTarget* t, NativeAddr dst, NativeLoc byte_value, AggregateAccess agg) {
+  u32 size = agg.size;
+  
+  // For small sizes, unroll MOV stores.
+  if (size <= 32u) {
+    u32 rs = byte_value.v.reg & 0xFu;
+    u32 off = 0;
+    while (off < size) {
+      u32 chunk = (size - off >= 8) ? 8 : (size - off >= 4) ? 4 : (size - off >= 2) ? 2 : 1;
+      emit_mov_store(t->mc, chunk, rs, dst.base.reg, dst.offset + (i32)off);
+      off += chunk;
+    }
+  } else {
+    // REP STOS: al=byte, rcx=size, rdi=dst, then rep stosq.
+    u32 rs = byte_value.v.reg & 0xFu;
+    x64_emit_load_imm(t->mc, 1, X64_RCX, (i64)(size / 8));
+    emit_mov_rr(t->mc, 0, X64_RAX, rs);  // mov al, rs (or al if byte already)
+    emit_mov_rr(t->mc, 1, X64_RDI, dst.base.reg);
+    emit_alu_imm32(t->mc, 1, X64_ALU_SUB_ADD, X64_RDI, dst.offset);
+    emit_rep_stosq(t->mc);  // emit 0xF3 0x48 0xAB
+  }
+}
+```
+
+---
+
+### bitfield_load(dst_reg, record_addr, bf_access)
+
+**Input:** NativeLoc dst (REG), NativeAddr (record), BitFieldAccess (width, offset, sign-extend).
+**Emit:** Load, then mask/shift/extend.
+
+**Pseudo-C:**
+```c
+static void x_bitfield_load(NativeTarget* t, NativeLoc dst, NativeAddr record_addr, BitFieldAccess bf) {
+  u32 rd = dst.v.reg & 0xFu;
+  u32 byte_off = bf.byte_offset;
+  u32 bit_off = bf.bit_offset;
+  u32 width = bf.width;
+  
+  // Load the container (size rounded up to 1/2/4/8 bytes).
+  u32 container_size = 1u << ((31 - __builtin_clz(width + bit_off + 7)) >> 1);  // power of 2
+  emit_mov_load(t->mc, container_size, /*signed=*/0, rd, record_addr.base.reg,
+                record_addr.offset + (i32)byte_off);
+  
+  // Shift right to align LSB to bit 0.
+  if (bit_off > 0) {
+    emit_shift_imm(t->mc, 1, X64_SHIFT_SUB_SHR, rd, (u8)bit_off);
+  }
+  
+  // Mask to width bits.
+  u64 mask = (1ull << width) - 1;
+  if (mask < 0xffffffff) {
+    x64_emit_load_imm(t->mc, 0, X64_R11, (i64)mask);
+    emit_alu_rr(t->mc, 0, X64_OPC_ALU_AND, rd, X64_R11);
+  }
+  
+  // Sign-extend if needed.
+  if (bf.sign_extend && width < 64) {
+    emit_extend_rr(t->mc, 1, /*signed=*/1, width / 8, rd, rd);
+  }
+}
+```
+
+---
+
+### binop(op, dst_reg, a_reg, b_reg_or_imm)
+
+**Input:** BinOp, NativeLoc dst (REG), NativeLoc a (REG), NativeLoc b (REG or IMM).
+**Emit:** Two-address: copy a→dst, then dst op= b.
+
+**Key branches by op:**
+
+**FP binops (BO_FADD/FSUB/FMUL/FDIV):**
+- All commutative except subtract. Use SSE opcodes (0x58/0x5C/0x59/0x5E).
+- If `dst==rb && rd!=ra`: commutative ops can emit `op rd, ra`; for non-commutative, spill rb to temp.
+
+**Integer division/remainder (BO_SDIV/UDIV/SREM/UREM):**
+- Signed: emit CQO/CDQ, then IDIV. Result in RAX (quotient) or RDX (remainder).
+- Unsigned: XOR RDX, RDX, then DIV. Same result regs.
+- Route divisor through R11 if it's RAX/RDX.
+
+**Shifts (BO_SHL/SHR_U/SHR_S):**
+- Count must be in CL or encoded as imm8. Route RHS through CL if not immediate.
+- Sub-opcodes: SHL=4, SHR_U=5, SHR_S=7.
+
+**ALU (BO_IADD/ISUB/AND/OR/XOR/IMUL):**
+- Commutative ops: swap to move immediate to RHS for fast-path immediate forms.
+- Immediate fast-paths: 0x83 (imm8 sext) or 0x81 (imm32 sext).
+- Fallback: mov ra→rd, then op with register form (0x01/0x29/0x21/0x09/0x31/0xAF).
+
+**Pseudo-C:**
+```c
+static void x_binop(NativeTarget* t, BinOp op, NativeLoc dst, NativeLoc a, NativeLoc b) {
+  MCEmitter* mc = t->mc;
+  u32 rd = dst.v.reg & 0xFu;
+  
+  // FP binops.
+  if (op == BO_FADD || op == BO_FSUB || op == BO_FMUL || op == BO_FDIV) {
+    u32 ra = a.v.reg & 0xFu;
+    u32 rb = b.v.reg & 0xFu;
+    u8 prefix = (dst.type && cg_type_size(t->c, dst.type) == 8u) ? 0xF2u : 0xF3u;
+    u8 opcode = (op == BO_FADD) ? 0x58 : (op == BO_FSUB) ? 0x5C : 
+                (op == BO_FMUL) ? 0x59 : 0x5E;  // FDIV
+    
+    if (rd == rb && rd != ra) {
+      // Can use commutative source swap.
+      if (op == BO_FADD || op == BO_FMUL) {
+        emit_sse_rr(mc, prefix, opcode, rd, ra);
+        return;
+      }
+      // For FSUB/FDIV, must preserve order: use temp.
+      emit_sse_rr(mc, prefix, 0x10, X64_XMM15, rb);   // spill rb
+      emit_sse_rr(mc, prefix, 0x10, rd, ra);          // rd = ra
+      emit_sse_rr(mc, prefix, opcode, rd, X64_XMM15); // rd -= temp
+      return;
+    }
+    if (rd != ra) emit_sse_rr(mc, prefix, 0x10, rd, ra);  // movs{s,d}
+    emit_sse_rr(mc, prefix, opcode, rd, rb);
+    return;
+  }
+  
+  int w = (dst.type && cg_type_size(t->c, dst.type) >= 8u) ? 1 : 0;
+  
+  // Integer division.
+  if (op == BO_SDIV || op == BO_UDIV || op == BO_SREM || op == BO_UREM) {
+    u32 ra = (a.kind == NATIVE_LOC_REG) ? (a.v.reg & 0xFu) : X64_R11;
+    if (ra != (a.v.reg & 0xFu) && a.kind == NATIVE_LOC_REG) {
+      emit_mov_rr(mc, w, ra, a.v.reg & 0xFu);
+    } else if (a.kind == NATIVE_LOC_IMM) {
+      x64_emit_load_imm(mc, w, ra, a.v.imm);
+    }
+    if (ra != X64_RAX) emit_mov_rr(mc, w, X64_RAX, ra);
+    
+    u32 rb = (b.kind == NATIVE_LOC_REG) ? (b.v.reg & 0xFu) : X64_R11;
+    if (b.kind == NATIVE_LOC_REG) {
+      if (rb == X64_RAX || rb == X64_RDX) {
+        emit_mov_rr(mc, w, X64_R11, rb);
+        rb = X64_R11;
+      }
+    } else if (b.kind == NATIVE_LOC_IMM) {
+      x64_emit_load_imm(mc, w, X64_R11, b.v.imm);
+      rb = X64_R11;
+    }
+    
+    if (op == BO_SDIV || op == BO_SREM) {
+      emit_cqo_or_cdq(mc, w);  // sign-extend rax→rdx:rax
+      emit_f7_rm(mc, w, X64_F7_SUB_IDIV, rb);
+    } else {
+      emit_xor_self(mc, w, X64_RDX);  // zero rdx
+      emit_f7_rm(mc, w, X64_F7_SUB_DIV, rb);
+    }
+    
+    u32 result_reg = (op == BO_SREM || op == BO_UREM) ? X64_RDX : X64_RAX;
+    if (rd != result_reg) emit_mov_rr(mc, w, rd, result_reg);
+    return;
+  }
+  
+  // Shifts.
+  if (op == BO_SHL || op == BO_SHR_U || op == BO_SHR_S) {
+    u32 ra = (a.kind == NATIVE_LOC_REG) ? (a.v.reg & 0xFu) : X64_R11;
+    if (a.kind == NATIVE_LOC_IMM) {
+      x64_emit_load_imm(mc, w, ra, a.v.imm);
+    } else {
+      ra = a.v.reg & 0xFu;
+    }
+    
+    u32 sub = (op == BO_SHL) ? X64_SHIFT_SUB_SHL : 
+              (op == BO_SHR_U) ? X64_SHIFT_SUB_SHR : X64_SHIFT_SUB_SAR;
+    
+    if (b.kind == NATIVE_LOC_IMM) {
+      // Immediate shift: encode in C1 /sub ib.
+      if (rd != ra) emit_mov_rr(mc, w, rd, ra);
+      u32 width = w ? 64u : 32u;
+      emit_shift_imm(mc, w, sub, rd, (u8)(b.v.imm & (width - 1u)));
+      return;
+    }
+    
+    // Register shift: count in CL.
+    u32 rc = b.v.reg & 0xFu;
+    if (rc != X64_RCX) emit_mov_rr(mc, 0, X64_RCX, rc);
+    if (rd != ra) emit_mov_rr(mc, w, rd, ra);
+    emit_shift_cl(mc, w, sub, rd);
+    return;
+  }
+  
+  // Canonicalize commutative ops: IMM to RHS.
+  if ((op == BO_IADD || op == BO_AND || op == BO_OR || op == BO_XOR || op == BO_IMUL) &&
+      a.kind == NATIVE_LOC_IMM && b.kind != NATIVE_LOC_IMM) {
+    NativeLoc tmp = a;
+    a = b;
+    b = tmp;
+  }
+  
+  // Immediate fast-paths (ALU/IMUL).
+  if (b.kind == NATIVE_LOC_IMM && a.kind == NATIVE_LOC_REG &&
+      (op == BO_IADD || op == BO_ISUB || op == BO_AND || op == BO_OR || op == BO_XOR || op == BO_IMUL)) {
+    i64 imm = b.v.imm;
+    u32 ra = a.v.reg & 0xFu;
+    
+    if (op == BO_IMUL) {
+      if (imm >= -128 && imm <= 127) {
+        emit_imul_imm8(mc, w, rd, ra, (i8)imm);
+        return;
+      } else if (imm >= -(1LL<<31) && imm <= (1LL<<31)-1) {
+        emit_imul_imm32(mc, w, rd, ra, (i32)imm);
+        return;
+      }
+    } else {
+      u32 sub = (op == BO_IADD) ? X64_ALU_SUB_ADD : (op == BO_OR) ? X64_ALU_SUB_OR :
+                (op == BO_AND) ? X64_ALU_SUB_AND : (op == BO_ISUB) ? X64_ALU_SUB_SUB : X64_ALU_SUB_XOR;
+      if (imm >= -128 && imm <= 127) {
+        if (rd != ra) emit_mov_rr(mc, w, rd, ra);
+        emit_alu_imm8(mc, w, sub, rd, (i8)imm);
+        return;
+      } else if (imm >= -(1LL<<31) && imm <= (1LL<<31)-1) {
+        if (rd != ra) emit_mov_rr(mc, w, rd, ra);
+        emit_alu_imm32(mc, w, sub, rd, (i32)imm);
+        return;
+      }
+    }
+    // Fall through to full materialization.
+  }
+  
+  // Generic two-operand: copy ra→dst, then dst op= rb.
+  u32 ra = (a.kind == NATIVE_LOC_REG) ? (a.v.reg & 0xFu) : X64_R11;
+  u32 rb = (b.kind == NATIVE_LOC_REG) ? (b.v.reg & 0xFu) : X64_R11;
+  
+  if (a.kind == NATIVE_LOC_IMM) x64_emit_load_imm(mc, w, ra, a.v.imm);
+  else ra = a.v.reg & 0xFu;
+  
+  if (b.kind == NATIVE_LOC_IMM && ra != X64_R11) {
+    x64_emit_load_imm(mc, w, X64_R11, b.v.imm);
+    rb = X64_R11;
+  } else if (b.kind == NATIVE_LOC_IMM) {
+    // ra already in R11; need another scratch.
+    x64_emit_load_imm(mc, w, X64_R10, b.v.imm);
+    rb = X64_R10;
+  } else {
+    rb = b.v.reg & 0xFu;
+  }
+  
+  // Preserve rb if dst == rb && dst != ra.
+  if (rd == rb && rd != ra) {
+    if (op == BO_IADD || op == BO_AND || op == BO_OR || op == BO_XOR || op == BO_IMUL) {
+      emit_mov_rr(mc, w, X64_R10, rb);
+      rb = X64_R10;
+    } else {
+      // Non-commutative (ISUB): must still preserve rb.
+      emit_mov_rr(mc, w, X64_R10, rb);
+      rb = X64_R10;
+    }
+  }
+  
+  if (rd != ra) emit_mov_rr(mc, w, rd, ra);
+  
+  switch (op) {
+    case BO_IADD: emit_alu_rr(mc, w, X64_OPC_ALU_ADD, rd, rb); break;
+    case BO_ISUB: emit_alu_rr(mc, w, X64_OPC_ALU_SUB, rd, rb); break;
+    case BO_AND:  emit_alu_rr(mc, w, X64_OPC_ALU_AND, rd, rb); break;
+    case BO_OR:   emit_alu_rr(mc, w, X64_OPC_ALU_OR, rd, rb); break;
+    case BO_XOR:  emit_alu_rr(mc, w, X64_OPC_ALU_XOR, rd, rb); break;
+    case BO_IMUL: emit_imul_rr(mc, w, rd, rb); break;
+    default: compiler_panic(t->c, ((XNativeTarget*)t)->loc, "x64 binop: unsupported op");
+  }
+}
+```
+
+**Emit.c calls:** `emit_sse_rr()`, `x64_emit_load_imm()`, `emit_mov_rr()`, `emit_f7_rm()`, `emit_xor_self()`, `emit_cqo_or_cdq()`, `emit_shift_imm()`, `emit_shift_cl()`, `emit_alu_imm8()`, `emit_alu_imm32()`, `emit_imul_imm8()`, `emit_imul_imm32()`, `emit_alu_rr()`, `emit_imul_rr()`.
+
+---
+
+### unop(op, dst_reg, src_reg)
+
+**Input:** UnOp, NativeLoc dst (REG), NativeLoc src (REG).
+**Emit:** F7 /sub for NEG/BNOT, TEST+SETCC for NOT, SSE xor-sign for FNEG.
+
+**Pseudo-C:**
+```c
+static void x_unop(NativeTarget* t, UnOp op, NativeLoc dst, NativeLoc src) {
+  MCEmitter* mc = t->mc;
+  u32 rd = dst.v.reg & 0xFu;
+  u32 rs = src.v.reg & 0xFu;
+  
+  if (op == UO_FNEG) {
+    // FP negate: flip sign bit via xor with sign-mask constant.
+    if (rd != rs) {
+      u8 prefix = (dst.type && cg_type_size(t->c, dst.type) == 8u) ? 0xF2u : 0xF3u;
+      emit_sse_rr(mc, prefix, 0x10, rd, rs);  // movs{s,d}
+    }
+    
+    // Load sign mask from .rodata and xor.
+    u8 mask_bytes[8];
+    memset(mask_bytes, 0, sizeof mask_bytes);
+    if (cg_type_size(t->c, dst.type) == 8u) {
+      mask_bytes[7] = 0x80u;  // double: bit 63 sign
+      ConstBytes cb = {mask_bytes, 8, 8, dst.type};
+    } else {
+      mask_bytes[3] = 0x80u;  // single: bit 31 sign
+      ConstBytes cb = {mask_bytes, 4, 4, dst.type};
+    }
+    // Load mask into temp FP reg and xor.
+    u8 prefix = (cg_type_size(t->c, dst.type) == 8u) ? 0x66u : 0u;
+    // (Not shown: emit load of cb into X64_XMM15, then xor)
+    emit_sse_rr(mc, prefix, 0x57, rd, X64_XMM15);  // xorpd/xorps
+    return;
+  }
+  
+  int w = (dst.type && cg_type_size(t->c, dst.type) >= 8u) ? 1 : 0;
+  
+  if (op == UO_NEG) {
+    if (rd != rs) emit_mov_rr(mc, w, rd, rs);
+    emit_f7_rm(mc, w, X64_F7_SUB_NEG, rd);
+    return;
+  }
+  
+  if (op == UO_BNOT) {  // Bitwise NOT.
+    if (rd != rs) emit_mov_rr(mc, w, rd, rs);
+    emit_f7_rm(mc, w, X64_F7_SUB_NOT, rd);
+    return;
+  }
+  
+  if (op == UO_NOT) {  // Logical NOT (x ? 0 : 1).
+    emit_test_self(mc, w, rs);
+    emit_setcc(mc, X64_CC_E, rd);  // ZF set if rs == 0
+    emit_movzx_r32_r8(mc, rd, rd);  // zero-extend al→r32
+    return;
+  }
+  
+  compiler_panic(t->c, ((XNativeTarget*)t)->loc, "x64 unop: unsupported op");
+}
+```
+
+**Emit.c calls:** `emit_sse_rr()`, `emit_mov_rr()`, `emit_f7_rm()`, `emit_test_self()`, `emit_setcc()`, `emit_movzx_r32_r8()`.
+
+---
+
+### cmp(op, dst_reg, a_reg, b_reg_or_imm)
+
+**Input:** CmpOp, NativeLoc dst (REG), two operands.
+**Emit:** Compare (CMP or UCOMISD), then SETCC to materialize 0/1.
+
+**Pseudo-C:**
+```c
+static void x_cmp(NativeTarget* t, CmpOp op, NativeLoc dst, NativeLoc a, NativeLoc b) {
+  MCEmitter* mc = t->mc;
+  u32 rd = dst.v.reg & 0xFu;
+  u32 ra = a.v.reg & 0xFu;
+  
+  if ((NativeAllocClass)a.cls == NATIVE_REG_FP) {
+    // FP comparison: UCOMISD/UCOMISS, then handle unordered/ordered cases.
+    u8 prefix = (cg_type_size(t->c, a.type) == 8u) ? 0x66u : 0u;
+    u32 rb = b.v.reg & 0xFu;
+    emit_sse_rr(mc, prefix, 0x2E, ra, rb);  // ucomisd/ucomiss sets EFLAGS
+    
+    switch (op) {
+      case CMP_NE:
+        // Unordered OR not-equal: set if P (unordered) OR NE.
+        emit_setcc(mc, X64_CC_P, rd);  // P (parity, i.e., unordered)
+        emit_movzx_r32_r8(mc, rd, rd);
+        emit_setcc(mc, X64_CC_NE, X64_R11);
+        emit_movzx_r32_r8(mc, X64_R11, X64_R11);
+        emit_alu_rr(mc, 0, X64_OPC_ALU_OR, rd, X64_R11);
+        return;
+      case CMP_EQ:
+      case CMP_LT_F:
+      case CMP_LE_F:
+        // Ordered comparisons: must check NP (not unordered).
+        // Set if (cond AND ordered).
+        u32 cc = (op == CMP_EQ) ? X64_CC_E : (op == CMP_LT_F) ? X64_CC_B : X64_CC_BE;
+        emit_setcc(mc, cc, rd);
+        emit_movzx_r32_r8(mc, rd, rd);
+        emit_setcc(mc, X64_CC_NP, X64_R11);
+        emit_movzx_r32_r8(mc, X64_R11, X64_R11);
+        emit_alu_rr(mc, 0, X64_OPC_ALU_AND, rd, X64_R11);
+        return;
+      case CMP_GT_F:
+        emit_setcc(mc, X64_CC_A, rd);
+        emit_movzx_r32_r8(mc, rd, rd);
+        return;
+      case CMP_GE_F:
+        emit_setcc(mc, X64_CC_AE, rd);
+        emit_movzx_r32_r8(mc, rd, rd);
+        return;
+      default:
+        emit_setcc(mc, cmp_to_cc(op), rd);
+        emit_movzx_r32_r8(mc, rd, rd);
+        return;
+    }
+  }
+  
+  // Integer comparison.
+  int w = (a.type && cg_type_size(t->c, a.type) >= 8u) ? 1 : 0;
+  
+  if (b.kind == NATIVE_LOC_IMM && imm_fits_i8(b.v.imm)) {
+    emit_cmp_imm8(mc, w, ra, (i8)b.v.imm);
+  } else if (b.kind == NATIVE_LOC_IMM && imm_fits_i32(b.v.imm)) {
+    emit_alu_imm32(mc, w, X64_ALU_SUB_CMP, ra, (i32)b.v.imm);
+  } else {
+    u32 rb = (b.kind == NATIVE_LOC_REG) ? (b.v.reg & 0xFu) : X64_R11;
+    if (b.kind == NATIVE_LOC_IMM) x64_emit_load_imm(mc, w, rb, b.v.imm);
+    else rb = b.v.reg & 0xFu;
+    emit_alu_rr(mc, w, X64_OPC_ALU_CMP, ra, rb);
+  }
+  
+  emit_setcc(mc, cmp_to_cc(op), rd);
+  emit_movzx_r32_r8(mc, rd, rd);
+}
+
+static u32 cmp_to_cc(CmpOp op) {
+  switch (op) {
+    case CMP_EQ:    return X64_CC_E;
+    case CMP_NE:    return X64_CC_NE;
+    case CMP_LT_U:  return X64_CC_B;
+    case CMP_LE_U:  return X64_CC_BE;
+    case CMP_GT_U:  return X64_CC_A;
+    case CMP_GE_U:  return X64_CC_AE;
+    case CMP_LT_S:  return X64_CC_L;
+    case CMP_LE_S:  return X64_CC_LE;
+    case CMP_GT_S:  return X64_CC_G;
+    case CMP_GE_S:  return X64_CC_GE;
+    default:        return X64_CC_E;
+  }
+}
+```
+
+**Emit.c calls:** `emit_sse_rr()`, `emit_setcc()`, `emit_movzx_r32_r8()`, `emit_alu_rr()`, `emit_cmp_imm8()`, `emit_alu_imm32()`, `x64_emit_load_imm()`, `imm_fits_i8()`, `imm_fits_i32()`.
+
+---
+
+### convert(kind, dst_reg, src_reg)
+
+**Input:** ConvKind, NativeLoc dst, NativeLoc src.
+**Emit:** MOVZX/MOVSX for extension, MOVSXD for 32→64 signed, MOV r32,r32 for 32→64 unsigned (zero-extends high 32), CVTSI2/CVTTS/CVTF for int↔fp with special paths for unsigned 64-bit.
+
+**Pseudo-C:**
+```c
+static void x_convert(NativeTarget* t, ConvKind kind, NativeLoc dst, NativeLoc src) {
+  MCEmitter* mc = t->mc;
+  u32 rd = dst.v.reg & 0xFu;
+  u32 rs = src.v.reg & 0xFu;
+  
+  switch (kind) {
+    case CV_SEXT: {
+      u32 src_bytes = cg_type_size(t->c, src.type);
+      int w = (cg_type_size(t->c, dst.type) >= 8u) ? 1 : 0;
+      emit_extend_rr(mc, w, /*signed=*/1, src_bytes, rd, rs);
+      return;
+    }
+    case CV_ZEXT: {
+      u32 src_bytes = cg_type_size(t->c, src.type);
+      int w = (cg_type_size(t->c, dst.type) >= 8u) ? 1 : 0;
+      emit_extend_rr(mc, w, /*signed=*/0, src_bytes, rd, rs);
+      return;
+    }
+    case CV_TRUNC:
+      // In-register truncation: mov r32, r32 clears high 32.
+      emit_mov_rr(mc, 0, rd, rs);
+      return;
+    case CV_ITOF_S: {
+      int w_src = (cg_type_size(t->c, src.type) >= 8u) ? 1 : 0;
+      u8 prefix = (cg_type_size(t->c, dst.type) == 8u) ? 0xF2u : 0xF3u;
+      emit_sse_rr_w(mc, prefix, 0x2A, w_src, rd, rs);  // cvtsi2sd/cvtsi2ss
+      return;
+    }
+    case CV_ITOF_U: {
+      int w_src = (cg_type_size(t->c, src.type) >= 8u) ? 1 : 0;
+      u8 prefix = (cg_type_size(t->c, dst.type) == 8u) ? 0xF2u : 0xF3u;
+      
+      if (w_src == 1) {
+        // Unsigned 64→FP: special path (test sign, branch, two paths).
+        MCLabel L_high = mc->label_new(mc), L_done = mc->label_new(mc);
+        emit_test_self(mc, 1, rs);
+        emit_jcc_label(mc, X64_CC_S, L_high);
+        emit_sse_rr_w(mc, prefix, 0x2A, 1, rd, rs);
+        emit_jmp_label(mc, L_done);
+        
+        mc->label_place(mc, L_high);
+        emit_mov_rr(mc, 1, X64_R11, rs);
+        emit_mov_rr(mc, 1, X64_RAX, rs);
+        emit_alu_imm8(mc, 1, X64_ALU_SUB_AND, X64_RAX, 1);  // and rax, 1
+        emit_shift_imm(mc, 1, X64_SHIFT_SUB_SHR, X64_R11, 1);  // shr r11, 1
+        emit_alu_rr(mc, 1, X64_OPC_ALU_OR, X64_R11, X64_RAX);  // or r11, rax
+        emit_sse_rr_w(mc, prefix, 0x2A, 1, rd, X64_R11);
+        emit_sse_rr(mc, prefix, 0x58, rd, rd);  // adds{s,d} dst, dst
+        
+        mc->label_place(mc, L_done);
+        return;
+      } else {
+        // u32→fp: zero-extend to 64-bit, then signed convert works.
+        emit_extend_rr(mc, 1, /*signed=*/0, 4, X64_R11, rs);
+        emit_sse_rr_w(mc, prefix, 0x2A, 1, rd, X64_R11);
+        return;
+      }
+    }
+    case CV_FTOI_S: {
+      int w_dst = (cg_type_size(t->c, dst.type) >= 8u) ? 1 : 0;
+      u8 prefix = (cg_type_size(t->c, src.type) == 8u) ? 0xF2u : 0xF3u;
+      emit_sse_rr_w(mc, prefix, 0x2C, w_dst, rd, rs);  // cvtts{d,s}2si
+      return;
+    }
+    case CV_FTOI_U: {
+      int w_dst = (cg_type_size(t->c, dst.type) >= 8u) ? 1 : 0;
+      u8 prefix = (cg_type_size(t->c, src.type) == 8u) ? 0xF2u : 0xF3u;
+      
+      if (w_dst == 1) {
+        // FP→u64: special path (compare against 2^63, two branches).
+        // (Shortened for space; see legacy ops.c:1050-1075.)
+        MCLabel L_small = mc->label_new(mc), L_done = mc->label_new(mc);
+        
+        // Load 2^63 constant and compare.
+        ConstBytes cb = {/*2^63 bytes*/};
+        x_load_const(t, (NativeLoc){.kind=NATIVE_LOC_REG, .cls=NATIVE_REG_FP, .v.reg=X64_XMM15}, cb);
+        
+        emit_sse_rr(mc, prefix == 0xF2u ? 0x66u : 0u, 0x2E, rs, X64_XMM15);
+        emit_jcc_label(mc, X64_CC_B, L_small);
+        
+        // src >= 2^63: subtract 2^63, convert, add sign bit back.
+        emit_sse_rr(mc, prefix, 0x10, X64_XMM0 + 14, rs);
+        emit_sse_rr(mc, prefix, 0x5C, X64_XMM0 + 14, X64_XMM15);  // sub
+        emit_sse_rr_w(mc, prefix, 0x2C, 1, rd, X64_XMM0 + 14);
+        x64_emit_load_imm(mc, 1, X64_R11, -9223372036854775807LL - 1LL);
+        emit_alu_rr(mc, 1, X64_OPC_ALU_XOR, rd, X64_R11);
+        emit_jmp_label(mc, L_done);
+        
+        mc->label_place(mc, L_small);
+        emit_sse_rr_w(mc, prefix, 0x2C, 1, rd, rs);
+        mc->label_place(mc, L_done);
+        return;
+      } else {
+        // FP→u32: convert as signed, result fits in u32 range.
+        emit_sse_rr_w(mc, prefix, 0x2C, 0, rd, rs);
+        return;
+      }
+    }
+    case CV_BITCAST:
+      if ((NativeAllocClass)src.cls == NATIVE_REG_FP && (NativeAllocClass)dst.cls == NATIVE_REG_INT) {
+        // FP→int: MOVQ xmm, gpr (0F 7E with REX.W).
+        emit_sse_rr_w(mc, 0x66u, 0x7E, 1, rs, rd);
+      } else if ((NativeAllocClass)src.cls == NATIVE_REG_INT && (NativeAllocClass)dst.cls == NATIVE_REG_FP) {
+        // int→FP: MOVQ gpr, xmm (0F 6E with REX.W).
+        emit_sse_rr_w(mc, 0x66u, 0x6E, 1, rd, rs);
+      } else {
+        // same class, same reg: already correct.
+        if (rd != rs) emit_mov_rr(mc, 1, rd, rs);
+      }
+      return;
+    default:
+      compiler_panic(t->c, ((XNativeTarget*)t)->loc, "x64 convert: unsupported kind");
+  }
+}
+```
+
+**Emit.c calls:** `emit_extend_rr()`, `emit_mov_rr()`, `emit_sse_rr_w()`, `emit_sse_rr()`, `emit_test_self()`, `emit_jcc_label()`, `emit_jmp_label()`, `mc->label_place()`, `emit_alu_imm8()`, `emit_shift_imm()`, `emit_alu_rr()`, `x64_emit_load_imm()`.
+
+---
+
+### cmp_branch(op, a_reg, b_reg, target_label)
+
+**Input:** CmpOp, two operands, MCLabel target.
+**Emit:** Compare (CMP or UCOMISD), then conditional branch (Jcc rel32) with label reloc.
+
+**Pseudo-C:**
+```c
+static void x_cmp_branch(NativeTarget* t, CmpOp op, NativeLoc a, NativeLoc b, MCLabel label) {
+  MCEmitter* mc = t->mc;
+  u32 ra = a.v.reg & 0xFu;
+  
+  if ((NativeAllocClass)a.cls == NATIVE_REG_FP) {
+    // FP: UCOMISD/UCOMISS, then Jcc for ordered/unordered case.
+    u8 prefix = (cg_type_size(t->c, a.type) == 8u) ? 0x66u : 0u;
+    u32 rb = b.v.reg & 0xFu;
+    emit_sse_rr(mc, prefix, 0x2E, ra, rb);
+    
+    // Pick CC based on op and ordering.
+    u32 cc = cmp_to_cc(op);
+    if (op == CMP_NE) {
+      // Branch if unordered OR not-equal: JP label; JNE label (can chain)
+      emit_jcc_label(mc, X64_CC_P, label);
+      emit_jcc_label(mc, X64_CC_NE, label);
+    } else if (op == CMP_EQ || op == CMP_LT_F || op == CMP_LE_F) {
+      // Ordered: emit JP to skip, then Jcc.
+      MCLabel skip = mc->label_new(mc);
+      emit_jcc_label(mc, X64_CC_P, skip);
+      emit_jcc_label(mc, cc, label);
+      mc->label_place(mc, skip);
+    } else {
+      // Other FP ops: direct Jcc (already ordered by fused cmp+branch).
+      emit_jcc_label(mc, cc, label);
+    }
+    return;
+  }
+  
+  // Integer comparison.
+  int w = (a.type && cg_type_size(t->c, a.type) >= 8u) ? 1 : 0;
+  
+  if (b.kind == NATIVE_LOC_IMM && imm_fits_i8(b.v.imm)) {
+    emit_cmp_imm8(mc, w, ra, (i8)b.v.imm);
+  } else if (b.kind == NATIVE_LOC_IMM && imm_fits_i32(b.v.imm)) {
+    emit_alu_imm32(mc, w, X64_ALU_SUB_CMP, ra, (i32)b.v.imm);
+  } else {
+    u32 rb = (b.kind == NATIVE_LOC_REG) ? (b.v.reg & 0xFu) : X64_R11;
+    if (b.kind == NATIVE_LOC_IMM) x64_emit_load_imm(mc, w, rb, b.v.imm);
+    emit_alu_rr(mc, w, X64_OPC_ALU_CMP, ra, rb);
+  }
+  
+  emit_jcc_label(mc, cmp_to_cc(op), label);
+}
+
+void emit_jcc_label(MCEmitter* mc, u32 cc, MCLabel label) {
+  u32 pos = mc->pos(mc);
+  u8 buf[6];
+  buf[0] = X64_OPC_TWOBYTE;
+  buf[1] = (u8)(X64_OPC_JCC_BASE | (cc & 0xFu));
+  mc->emit_bytes(mc, buf, 2);
+  mc->emit_u32le(mc, 0);  // placeholder
+  mc->emit_label_ref(mc, label, R_PC32, 4, -4);  // or R_X64_PC32_RELOC if x64-specific
+}
+```
+
+**Emit.c calls:** `emit_sse_rr()`, `emit_cmp_imm8()`, `emit_alu_imm32()`, `emit_alu_rr()`, `x64_emit_load_imm()`, and custom `emit_jcc_label()`.
+
+---
+
+### Additional Hooks (Brief Outlines)
+
+**jump(label):** Emit JMP rel32 with label reloc.
+**indirect_branch(addr_reg, valid_targets, ntargets):** JMP r/m64.
+**load_label_addr(dst_reg, label):** LEA [RIP+offset] with label fixup or jump-over technique.
+**label_new() / label_place():** Delegate to mc.
+**alloca_(dst_reg, size_reg, align):** SUB RSP, size; LEA dst, [RSP + max_outgoing].
+**spill(src_reg, slot, mem_access):** MOV [slot_addr], src.
+**reload(dst_reg, slot, mem_access):** MOV dst, [slot_addr].
+
+---
+
+## Summary
+
+The x64 NativeTarget port reuses the legacy byte-level emit.c helpers for instruction encoding, abstracts frame layout and ABI details via NativeFrameSlot and abi.h queries, and maps semantic operations (two-address ALU, flags comparisons, implicit-register division/shifts, FP SSE) to NativeLoc-based NativeTarget hooks. Key differences from rv64/aa64 templates:
+
+1. **Two-address ALU**: Always copy `a → dst` before `dst op= b`.
+2. **Flags-based branches**: Emit CMP+JCC instead of materializing condition bits.
+3. **Implicit registers**: Route division through RAX/RDX, shifts through CL, multiplies via special implicit-register opcodes.
+4. **Width encoding**: Use w=0/1 (32/64) and size= (1/2/4/8) parameters to emit correct opcodes.
+5. **Relocations**: RIP-relative addressing with -4 addend for end-of-insn.
+
+See `src/arch/aa64/native.c` for the complete working template; apply this guide to write `src/arch/x64/native.c` with identical hook signatures and the x64-specific emission logic outlined above.
+
+
+
+
+---
+
+# x64 NativeTarget Port: GROUP 4 — Calls, Returns, and ABI
+
+## Executive Summary
+
+GROUP 4 handles **calling conventions, return-value marshalling, and ABI routing** for both **SysV x86-64** (Unix/Linux) and **Win64** (Windows x64). The x64 ABI is dramatically different from the RV64/AA64 references: **two separate register passing windows** (SysV: rdi/rsi/rdx/rcx/r8/r9 GPR + xmm0-7 FPR; Win64: rcx/rdx/r8/r9 GPR + xmm0-3 FPR with 32-byte shadow space), **sret hidden pointers that consume an integer argument slot**, and **stack-passing mismatches between the two ABIs**. Return values span rax/rdx (GPR) or xmm0/xmm1 (FPR). Tail calls must fit the caller's incoming parameter area. The implementation must dispatch at runtime via `c->target.os` to select SysV or Win64 mode **once**, store the ABI dispatch table in the backend state, and route all argument/return logic through it.
+
+---
+
+## ABI Dispatch & Register Tables
+
+### X64ABIRegs Structure
+The legacy implementation stores a pointer to one of two dispatch tables, **`X64ABIRegs`**, selected at `func_begin` based on `c->target.os`. On the NativeTarget port, this table must be created once per function and held in state so that `plan_call`, `bind_param`, and return logic reuse it.
+
+**Source references:**
+- Legacy internal.h ~lines 60–72: X64ABIRegs definition
+- Legacy abi_sysv_x64.c & abi_win64_x64.c: ABI initialization functions (not part of this port; the abi/ interface already computes ABIFuncInfo)
+
+**Table structure (pseudo-code):**
+```c
+typedef struct X64ABIRegs {
+  const u32* int_args;       // rdi/rsi/rdx/rcx/r8/r9 (SysV, 6) 
+                             // rcx/rdx/r8/r9 (Win64, 4)
+  u32 n_int_args;            // 6 or 4
+  u32 n_fp_args;             // 8 (SysV) or 4 (Win64)
+  int slot_shared_int_fp;    // 0 (SysV) or 1 (Win64): slots shared between int/fp
+  u32 shadow_space;          // 0 (SysV) or 32 (Win64)
+  int emit_sysv_vararg_save; // 1 (SysV only): emit 176-B GP/FP save area
+  int vararg_fp_dup_to_gpr;  // 1 (Win64): variadic FPs duplicated to GPRs
+  u64 cs_int_mask;           // callee-saved GPRs eligible for save: 
+                             // SysV 0xxxxxxxE0 | 0xF000 (rbx/r12-15 + rdi/rsi tail)
+                             // Win64: same + rdi/rsi (home-arg regs)
+  u64 cs_fp_mask;            // callee-saved XMMs: xmm6-15 (10 regs, ~0xFFC0)
+} X64ABIRegs;
+```
+
+**Register orderings in src/arch/x64/isa.h:**
+- `X64_RDI` = 5u, `X64_RSI` = 4u, `X64_RDX` = 3u, `X64_RCX` = 1u, `X64_R8` = 8u, `X64_R9` = 9u
+- `X64_RAX` = 0u, `X64_RBP` = 5u (frame pointer, reserved)
+- `X64_RSP` = 4u (stack pointer, reserved)
+- `X64_RBX`, `X64_R12`..`X64_R15` are callee-saved
+- `X64_XMM0`..`X64_XMM15` are SSE/FP registers
+
+---
+
+## Call Site Marshalling: `plan_call` & `emit_call`
+
+### Design Pattern
+
+The legacy code separates **planning** (compute argument locations, stack usage) from **emission** (generate actual code). The NativeTarget port mirrors this with `plan_call` (populates `NativeCallPlan*`) and `emit_call` (reads the plan, emits moves and the call instruction).
+
+**Key invariants:**
+1. **sret pointer reserves the first integer argument slot** (rdi in SysV, rcx in Win64). The callee receives it as an implicit first parameter in the integer argument register AND as a hidden implicit return-value pointer. On return, it is passed back in rax.
+2. **Win64 shadow space** (32 bytes = 4 home slots) is caller-reserved at [rsp+0..31] and is counted as part of `stack_arg_size`. SysV has no shadow space.
+3. **Variadic calls**: SysV places variadic args into the same register pools as fixed args; Win64 duplicates variadic FP values into the matching GPR and bypasses the FP register pool for variadics.
+4. **Stack arguments** must 16-byte-align the rsp *before* the call (i.e., the call instruction itself misaligns rsp by 8 bytes, so outgoing stack args + shadow space must be 16-aligned).
+5. **Tail calls** are realized if the callee's outgoing stack args fit in the caller's incoming parameter area (checked via `signature_stack_bytes` against `call_stack_bytes`).
+
+### `plan_call` Body Sketch
+
+**Inputs:** `NativeCallDesc* desc` (fn_type via abi_cg_func_info, callee, args[], nargs, results[], nresults, flags, tail_policy, inline_policy)
+
+**Outputs:** `NativeCallPlan* out` (callee, args[], nargs; rets[], nrets; stack_arg_size; clobber_mask[], return_mask[]; has_sret, is_variadic)
+
+**Pseudo-code:**
+```c
+void x64_plan_call(NativeTarget* t, const NativeCallDesc* desc, 
+                   NativeCallPlan* out) {
+  // 1. Fetch ABI function signature via the abi/ interface
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, desc->fn_type);
+  const X64ABIRegs* x_abi = x64_abi_for_os(t->c->target.os); // [selected once]
+  
+  // 2. Initialize output
+  memset(out, 0, sizeof *out);
+  out->callee = desc->callee;
+  out->flags = desc->flags;
+  out->has_sret = abi && abi->has_sret;
+  out->is_variadic = abi && abi->variadic;
+  
+  // 3. Allocate argument move array (desc->nargs typically * 2 for splits)
+  out->args = arena_array(t->c->tu, NativeCallPlanMove, 
+                          desc->nargs * 2 + 2);
+  out->nargs = 0;
+  
+  // 4. Start with shadow space (Win64 only)
+  u32 stack = x_abi->shadow_space;
+  u32 next_int = 0, next_fp = 0;
+  
+  // 5. sret argument (if present): occupies first int arg slot and stack reservation
+  if (abi && abi->has_sret) {
+    if ((desc->flags & CG_CALL_TAIL) == 0) {
+      // Ordinary call: pass destination address in first int arg reg
+      NativeCallPlanMove* m = &out->args[out->nargs++];
+      m->src = desc->ret.storage;  // caller's destination (frame/reg/indirect)
+      m->src_kind = NATIVE_CALL_MOVE_ADDR;  // compute address, not value
+      m->dst = native_loc_reg(builtin_i64(), NATIVE_REG_INT, 
+                              x_abi->int_args[0]);
+      m->dst_kind = NATIVE_LOC_REG;
+    }
+    next_int = 1;  // sret consumed first slot; any FP args skip if slot_shared
+    if (x_abi->slot_shared_int_fp) next_fp = 1;
+  }
+  
+  // 6. Iterate through each argument descriptor
+  for (u32 i = 0; i < desc->nargs; ++i) {
+    const NativeLoc* arg = &desc->args[i];
+    const ABIArgInfo* ai = abi ? &abi->params[i] : NULL;
+    
+    // Handle ABI_ARG_IGNORE, ABI_ARG_INDIRECT (pass address), ABI_ARG_DIRECT (parts)
+    // For each ABIArgPart:
+    //   - If next_int < n_int_args and cls == INT: assign to int_args[next_int++]
+    //   - Else if next_fp < n_fp_args and cls == FP:  assign to xmm_args[next_fp++]
+    //   - Else: stack-pass at offset stack; stack += 8 (aligned per part->align)
+    // Win64: if slot_shared_int_fp, next_fp mirrors next_int
+  }
+  
+  // 7. Finalize stack size and alignment
+  out->stack_arg_size = (stack + 15) & ~15;  // 16-byte align
+  
+  // 8. Allocate return-value array and clobber/return masks
+  out->rets = arena_array(t->c->tu, NativeCallPlanRet, 4);
+  out->nrets = 0;  // populated by caller after plan_call via abi->ret
+  
+  // Clobber mask: all caller-saved registers
+  for (u32 c = 0; c < NATIVE_CALL_PLAN_CLASSES; ++c)
+    out->clobber_mask[c] = ~x_abi->cs_*_mask;
+  
+  // Return mask: ret registers (rax, rdx for int; xmm0, xmm1 for fp)
+  // Populated from abi->ret parts
+}
+```
+
+**Key points:**
+- **sret as implicit first arg**: SysV passes it in rdi, Win64 in rcx. It occupies a slot and is not separately stack-passed.
+- **slot_shared_int_fp (Win64)**: When true, next_int and next_fp advance in lockstep so that int_args[1] and xmm_args[1] both point to slot 1, etc.
+- **Stack alignment**: must be 16-byte-aligned *before* the call (the call instruction pushes an 8-byte return address, misaligning to 8 mod 16, which is correct for the first instruction of the callee).
+- **Variadic handling**: SysV uses the same pools; Win64 duplicates FP args to GPRs (handled in emit_call by checking is_variadic and writing xmm→gpr moves).
+
+### `emit_call` Body Sketch
+
+**Inputs:** `NativeCallPlan* plan` (pre-computed by plan_call)
+
+**Pseudo-code:**
+```c
+void x64_emit_call(NativeTarget* t, const NativeCallPlan* plan) {
+  MCEmitter* mc = t->mc;
+  const X64ABIRegs* x_abi = x64_abi_for_os(t->c->target.os);
+  
+  // Tail-call path (covered in tail_call_unrealizable_reason)
+  if (plan->flags & CG_CALL_TAIL) {
+    if (plan->has_sret) {
+      // Load incoming sret pointer from frame into rdi (SysV) or rcx (Win64)
+      // Spilled at func_begin to preserve it across the body
+    }
+    // Restore frame (callee-saved regs, rbp), then jmp/call to callee
+    return;
+  }
+  
+  // Ordinary call: emit argument moves
+  for (u32 i = 0; i < plan->nargs; ++i) {
+    const NativeCallPlanMove* m = &plan->args[i];
+    // Resolve m->src (NativeLoc: reg, frame, stack, imm, global, addr)
+    // Move into m->dst (always a physical register or stack slot)
+    // For ADDR moves: compute address and store pointer; for VALUE: load/move
+  }
+  
+  // Set AL = number of XMM regs for variadic (SysV convention)
+  if (plan->is_variadic) {
+    // Count XMM regs used and emit: mov al, (count)
+  }
+  
+  // Emit the call instruction
+  if (plan->callee.kind == NATIVE_LOC_GLOBAL) {
+    // call rel32 + R_X64_PLT32 reloc for function symbols
+    // call rel32 + R_PC32 reloc for data symbols (rare)
+    emit_call_rel32(mc, plan->callee.v.global.sym, 
+                    plan->callee.v.global.addend);
+  } else if (plan->callee.kind == NATIVE_LOC_REG) {
+    // call r/m (opcode FF /2)
+    u32 r = plan->callee.v.reg & 0xFu;
+    emit_rex(mc, 0, 0, 0, r);
+    u8 buf[2] = {0xFF, modrm(3u, 2u, r)};
+    mc->emit_bytes(mc, buf, 2);
+  }
+  
+  // Return-value harvest (via rets[] populated elsewhere)
+  // Caller moves from rax/rdx/xmm0/xmm1 into destination
+}
+```
+
+**Emit helpers from legacy emit.c:**
+- `emit_rex(mc, w, reg, index, rm)`: emit REX prefix; x64_emit_load_imm calls this
+- `emit_mov_rr(mc, w, dst, src)`: register-to-register move (opcode 89/8B)
+- `emit_mov_load(mc, sz, signed_ext, dst, base, disp)`: load from [base+disp] into reg
+- `emit_mov_store(mc, sz, src, base, disp)`: store reg into [base+disp]
+- `emit_sse_rr(mc, prefix2, opcode, dst, src)`: SSE move (movss/movsd)
+- `x64_emit_load_imm(mc, is64, dst, imm)`: materialize immediate into register (uses movz/movk or mov-immediate sequences)
+
+---
+
+## Return-Value Marshalling: `plan_ret` & Return Instruction
+
+### `plan_ret` Body Sketch
+
+The return path is simpler than calls: the callee extracts the return value(s) from wherever the caller placed them (frame slot, register, indirect address) and moves them into the standard return registers.
+
+**Inputs:** `CGFuncDesc* func`, `NativeLoc* values[]`, `u32 nvalues` (semantic return locations from the optimizer)
+
+**Outputs:** `NativeCallPlanRet** out_rets`, `u32* out_nrets` (array of moves from return values to rax/rdx/xmm0/xmm1)
+
+**Pseudo-code:**
+```c
+void x64_plan_ret(NativeTarget* t, const CGFuncDesc* func,
+                  const NativeLoc* values, u32 nvalues,
+                  NativeCallPlanRet** out_rets, u32* out_nrets) {
+  // 1. Query function signature
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, func->type);
+  
+  // 2. Handle no-return case (void or unreachable)
+  if (!abi || abi->ret.kind == ABI_ARG_IGNORE) {
+    *out_nrets = 0;
+    return;
+  }
+  
+  // 3. Handle sret (structure return): copy struct to caller-supplied address
+  if (abi->ret.kind == ABI_ARG_INDIRECT) {
+    // The sret pointer arrived in rdi (SysV) or rcx (Win64).
+    // It was spilled to a frame slot at func_begin (sret_ptr_slot).
+    // 
+    // Two patterns:
+    // (a) struct value is in a frame slot: memcpy each granule into [rdi]
+    // (b) struct already at an address (indirect): memcpy from there
+    //
+    // Emit inline memcpy-like moves (8B / 4B / 1B chunks)
+    // Then move rdi into rax (return the sret pointer)
+    
+    *out_nrets = 1;  // one "move" that represents the memcpy + rax setup
+    // *out_rets[0] encodes the source struct and destination (rdi)
+  }
+  
+  // 4. Handle direct return (scalar or small aggregate)
+  // abi->ret.parts[] lists the parts and their classes (INT or FP)
+  u32 next_int_ret = 0, next_fp_ret = 0;  // indices into rax/rdx, xmm0/xmm1
+  for (u16 i = 0; i < abi->ret.nparts; ++i) {
+    const ABIArgPart* p = &abi->ret.parts[i];
+    NativeCallPlanRet* r = &(*out_rets)[(*out_nrets)++];
+    
+    // Determine source location for this part
+    // (from values[], or constructed from the aggregate)
+    
+    // Determine destination register
+    if (p->cls == ABI_CLASS_INT) {
+      r->dst = native_loc_reg(..., NATIVE_REG_INT, 
+                              next_int_ret == 0 ? X64_RAX : X64_RDX);
+      next_int_ret++;
+    } else if (p->cls == ABI_CLASS_FP) {
+      r->dst = native_loc_reg(..., NATIVE_REG_FP,
+                              X64_XMM0 + next_fp_ret);
+      next_fp_ret++;
+    }
+  }
+}
+```
+
+### Return Instruction Emission
+
+The legacy `x_ret` emits a plain `ret` (opcode C3) after setting up the return-value registers. The NativeTarget `ret` hook is simpler: it just emits `ret`.
+
+**Pseudo-code:**
+```c
+void x64_ret(NativeTarget* t) {
+  u8 op = 0xC3;  // ret
+  t->mc->emit_bytes(t->mc, &op, 1);
+}
+```
+
+**Return register handling:**
+- **Int return**: value in rax (64-bit) or eax (32-bit, zero-extended to 64)
+- **FP return**: value in xmm0 (double) or xmm0 (float)
+- **Multi-register return** (e.g., __int128 or 16-byte struct): rax (first 8B), rdx (next 8B); or xmm0/xmm1
+- **sret return**: address pointer in rax, struct copied to [rdi]
+
+---
+
+## Parameter Binding: `bind_param`
+
+Parameters arrive at the callee in one of three places:
+1. **Integer registers**: rdi/rsi/rdx/rcx/r8/r9 (SysV) or rcx/rdx/r8/r9 (Win64)
+2. **FP registers**: xmm0–xmm7 (SysV) or xmm0–xmm3 (Win64)
+3. **Stack**: [rsp+8], [rsp+16], ... (Win64 shadow space at [rsp+0..31])
+
+### `bind_param` Body Sketch
+
+**Inputs:** `CGParamDesc* param`, `NativeLoc dst` (destination chosen by allocator: REG or FRAME)
+
+**Pseudo-code:**
+```c
+void x64_bind_param(NativeTarget* t, const CGParamDesc* param, NativeLoc dst) {
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, param->fn_type);
+  const ABIArgInfo* ai = &abi->params[param->index];  // param's ABI classification
+  const X64ABIRegs* x_abi = x64_abi_for_os(t->c->target.os);
+  
+  // sret handling (implicit first parameter):
+  if (param->index == 0 && abi->has_sret) {
+    // sret pointer arrives in rdi (SysV) or rcx (Win64)
+    // If dst is NATIVE_LOC_REG: direct move into dst->v.reg
+    // If dst is NATIVE_LOC_FRAME: store into frame slot
+    // Special: spill sret_ptr_slot = frame_slot where pointer is saved for tail calls
+    u32 src_reg = x_abi->int_args[0];  // rdi or rcx
+    if (dst.kind == NATIVE_LOC_REG) {
+      emit_mov_rr(t->mc, 1, dst.v.reg, src_reg);
+    } else if (dst.kind == NATIVE_LOC_FRAME) {
+      // Store into frame slot; also record sret_ptr_slot for tail calls
+      emit_mov_store(t->mc, 8, src_reg, X64_RBP, -frame_slot_offset(dst));
+    }
+    return;
+  }
+  
+  // Ordinary parameter: query its ABI part (Int or FP, Reg or Stack)
+  const ABIArgPart* part = &ai->parts[0];  // assume single-part for now
+  
+  if (part->loc == ABI_LOC_REG) {
+    u32 src_reg;
+    if (part->cls == ABI_CLASS_INT) {
+      src_reg = x_abi->int_args[next_int];
+    } else {
+      src_reg = X64_XMM0 + next_fp;
+    }
+    
+    if (dst.kind == NATIVE_LOC_REG) {
+      if (part->cls == ABI_CLASS_INT) {
+        emit_mov_rr(t->mc, w, dst.v.reg, src_reg);
+      } else {
+        emit_sse_rr(t->mc, prefix, 0x10, dst.v.reg, src_reg);
+      }
+    } else if (dst.kind == NATIVE_LOC_FRAME) {
+      if (part->cls == ABI_CLASS_INT) {
+        emit_mov_store(t->mc, part->size, src_reg, X64_RBP, -frame_offset);
+      } else {
+        emit_sse_store(t->mc, prefix, 0x11, src_reg, X64_RBP, -frame_offset);
+      }
+    } else if (dst.kind == NATIVE_LOC_NONE) {
+      // Parameter is unused; ABI register advances but nothing is emitted
+    }
+  } else if (part->loc == ABI_LOC_STACK) {
+    // Parameter on stack: [rsp+stack_offset] (Win64: after shadow space)
+    u32 in_arg_offset = X64_WIN64_SHADOW_SPACE + (param->index * 8);
+    
+    if (dst.kind == NATIVE_LOC_REG) {
+      // Load from stack into register
+      emit_mov_load(t->mc, part->size, 0, dst.v.reg, X64_RSP, in_arg_offset);
+    } else if (dst.kind == NATIVE_LOC_FRAME) {
+      // Copy from incoming stack to frame slot (via RAX scratch)
+      emit_mov_load(t->mc, part->size, 0, X64_RAX, X64_RSP, in_arg_offset);
+      emit_mov_store(t->mc, part->size, X64_RAX, X64_RBP, -frame_offset);
+    }
+  }
+}
+```
+
+**Calling context:**
+- Called once per parameter during `func_begin` (NativeDirectTarget path) or `func_begin_known_frame` (optimizer path)
+- Incoming ABI registers are **never allocable** (they are not in the allocable register pool), so no collision checking is needed
+- The caller's allocator pre-computes `dst` (register vs. frame), and this hook simply moves the incoming value to that destination
+
+---
+
+## Signature Stack Bytes & Call Stack Bytes
+
+### `signature_stack_bytes`
+
+Computes the stack-parameter bytes a function's *fixed parameters* use (beyond the register pools). Used to gate tail-call realizability.
+
+**Pseudo-code:**
+```c
+u32 x64_signature_stack_bytes(NativeTarget* t, CfreeCgTypeId fn_type,
+                              int* out_variadic, u32* out_nparams) {
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, fn_type);
+  if (!abi) {
+    if (out_variadic) *out_variadic = 0;
+    if (out_nparams) *out_nparams = 0;
+    return 0;
+  }
+  
+  if (out_variadic) *out_variadic = abi->variadic;
+  if (out_nparams) *out_nparams = abi->nparams;
+  
+  const X64ABIRegs* x_abi = x64_abi_for_os(t->c->target.os);
+  u32 next_int = abi->has_sret ? 1 : 0, next_fp = 0;
+  u32 stack = 0;  // no shadow space here; shadow is caller's responsibility
+  
+  for (u32 i = 0; i < abi->nparams; ++i) {
+    const ABIArgInfo* ai = &abi->params[i];
+    for (u16 j = 0; j < ai->nparts; ++j) {
+      const ABIArgPart* p = &ai->parts[j];
+      if (p->cls == ABI_CLASS_INT && next_int < x_abi->n_int_args) {
+        next_int++;
+      } else if (p->cls == ABI_CLASS_FP && next_fp < x_abi->n_fp_args) {
+        next_fp++;
+      } else {
+        stack += 8;  // Stack-passed argument
+      }
+      if (x_abi->slot_shared_int_fp) next_fp = next_int;
+    }
+  }
+  
+  return (stack + 15) & ~15;  // 16-byte align
+}
+```
+
+### `call_stack_bytes`
+
+Computes the outgoing stack-argument bytes for a specific *call*, including shadow space. Used in a pre-pass frame-planning phase.
+
+**Pseudo-code:**
+```c
+u32 x64_call_stack_bytes(NativeTarget* t, const NativeCallDesc* desc) {
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, desc->fn_type);
+  const X64ABIRegs* x_abi = x64_abi_for_os(t->c->target.os);
+  
+  u32 stack = x_abi->shadow_space;  // Win64: 32; SysV: 0
+  u32 next_int = abi && abi->has_sret ? 1 : 0, next_fp = 0;
+  
+  for (u32 i = 0; i < desc->nargs; ++i) {
+    const NativeLoc* arg = &desc->args[i];
+    const ABIArgInfo* ai = abi ? &abi->params[i] : NULL;
+    if (!ai) continue;
+    
+    for (u16 j = 0; j < ai->nparts; ++j) {
+      const ABIArgPart* p = &ai->parts[j];
+      if (p->cls == ABI_CLASS_INT && next_int < x_abi->n_int_args) {
+        next_int++;
+      } else if (p->cls == ABI_CLASS_FP && next_fp < x_abi->n_fp_args) {
+        next_fp++;
+      } else {
+        stack += 8;
+      }
+      if (x_abi->slot_shared_int_fp) next_fp = next_int;
+    }
+  }
+  
+  return (stack + 15) & ~15;
+}
+```
+
+---
+
+## Tail-Call Realizability & Emission
+
+### `tail_call_unrealizable_reason`
+
+Tail calls are a **sibling call**: the callee reuses the caller's stack frame and incoming parameter area. The callee's outgoing stack args must fit in the caller's incoming parameter window.
+
+**Pseudo-code:**
+```c
+const char* x64_tail_call_unrealizable_reason(NativeTarget* t,
+                                              const NativeCallDesc* desc) {
+  // Compute the callee's outgoing stack-arg bytes
+  u32 callee_stack = x64_call_stack_bytes(t, desc);
+  
+  // Compute the caller's incoming stack-arg bytes
+  // (This requires querying the caller's signature; held in backend state)
+  u32 caller_incoming = x_impl->incoming_stack_size;  // set at func_begin
+  
+  if (callee_stack > caller_incoming) {
+    return "tail call stack arguments exceed the caller's parameter area";
+  }
+  
+  // Win64: additional check for FP arguments (no true tail calls if fp args conflict)
+  // Deferred: handle as "not yet implemented"
+  
+  return NULL;  // realizable
+}
+```
+
+### Tail Call Emission
+
+On the NativeTarget path, tail calls are **deferred as not implemented** (return a blocker string). A full implementation would:
+
+1. **Restore callee-saved registers** in reverse order (mirrors `func_begin` prologue)
+2. **Emit variadic AL setup** (count of XMM args)
+3. **Restore the frame pointer** (mov rsp, rbp; pop rbp / leave instruction)
+4. **Emit a jump** to the callee (jmp rel32 for global, jmp r/m for indirect)
+
+**Key constraint:** sret pointer must be forwarded from the caller's incoming sret (spilled to `sret_ptr_slot` at entry).
+
+---
+
+## NativeOps Adapter Block
+
+The `-O0` (NativeDirectTarget) path requires a `NativeOps` struct with semantic-level call marshalling. The key members for GROUP 4:
+
+```c
+struct NativeOps {
+  // ... other hooks ...
+  
+  void (*plan_call)(NativeDirectTarget*, const NativeCallDesc*,
+                    NativeCallPlan*);
+  const char* (*tail_call_unrealizable_reason)(NativeDirectTarget*,
+                                               const CGCallDesc*);
+  void (*emit_call)(NativeDirectTarget*, const NativeCallPlan*);
+  void (*emit_ret)(NativeDirectTarget*, const CGLocal* values, u32 nvalues);
+  
+  // ... variadics, asm ...
+};
+```
+
+**Implementation notes:**
+- `plan_call`: map the semantic `NativeCallDesc` (with CGLocal homes) to a `NativeCallPlan` (with physical locations)
+- `emit_call`: emit the code to move arguments into registers/stack, call the function, and harvest return values
+- `emit_ret`: emit code to place return values in rax/rdx/xmm0/xmm1 and emit `ret`
+
+---
+
+## Integration Checklist
+
+### Per-Function State (XImpl analog)
+
+Store in the NativeTarget backend state:
+- **X64ABIRegs*** x_abi: selected once at func_begin from c->target.os
+- **u32** incoming_stack_size: fixed-param stack bytes (for tail-call checks)
+- **NativeFrameSlot** sret_ptr_slot: where the incoming sret pointer is spilled (for tail calls)
+- **u32** max_outgoing: largest call's stack arg size (updated by plan_call / emit_call)
+
+### Relocation Constants (from src/arch/x64/isa.h)
+
+- **R_X64_PLT32** (42): PC-relative function-call relocation (collapsible to local)
+- **R_PC32** (2): PC-relative data-symbol relocation
+- **R_X64_REX_GOTPCRELX** (41): GOT-indirect relocation for extern symbols in PIC/PIE
+
+### Byte Encoders (from legacy emit.c)
+
+Reuse / migrate to emit.h:
+- `emit_rex(mc, w, reg, index, rm)`: REX prefix
+- `emit_mov_rr(mc, w, dst, src)`: move register to register (opcode 89 / 8B)
+- `emit_mov_load(mc, sz, signed_ext, dst, base, disp)`: load from [base+disp]
+- `emit_mov_store(mc, sz, src, base, disp)`: store to [base+disp]
+- `emit_sse_rr(mc, prefix, opcode, dst, src)`: SSE move
+- `x64_emit_load_imm(mc, is64, dst, imm)`: load immediate into register
+- `modrm(mod, reg, rm)`: encode ModR/M byte
+- `emit_u32le(mc, v)`: emit 32-bit little-endian value
+
+---
+
+## Summary of Data Flow
+
+1. **func_begin**: Select X64ABIRegs from c->target.os; query caller's signature for incoming_stack_size; compute prologue.
+2. **bind_param**: Route each parameter (rdi/rsi/rdx/rcx/r8/r9/xmm0-7 or stack) to its destination (register or frame slot).
+3. **plan_call** (when encountering a call): Query callee signature; lay out arg slots (int/fp registers, then stack); return NativeCallPlan with stack_arg_size.
+4. **emit_call**: Emit argument moves, set AL (variadic), emit call instruction, harvest return values.
+5. **plan_ret**: Route return values (rax/rdx/xmm0/xmm1) to caller-specified destinations.
+6. **ret**: Emit the ret instruction.
+7. **func_end**: Finalize frame size, patch prologue/epilogue.
+
+---
+
+## Critical Divergences from RV64/AA64
+
+| Aspect | x64 | RV64/AA64 |
+|--------|-----|-----------|
+| **Arg regs (int)** | rdi/rsi/rdx/rcx/r8/r9 (SysV); rcx/rdx/r8/r9 (Win64) | a0–a7 (uniform) |
+| **Arg regs (fp)** | xmm0–7 (8, SysV); xmm0–3 (4, Win64) | fa0–fa7 (8, uniform) |
+| **Shadow space** | 32 bytes (Win64 only); SysV none | None |
+| **Slot sharing** | Win64: int & fp slots shared (slot_shared_int_fp=1); SysV: separate | Separate pools |
+| **sret** | Reserves first int arg slot; callee also returns pointer in rax | Passed as pointer in register; no special return |
+| **Tail calls** | Stack-arg fit check; forward sret if present | Similar, no sret complication |
+| **Variadic** | SysV: same pools; Win64: FP dup to GPR | Same pools (RV64); register-save-area (AA64) |
+
+**Key bug risk (noted in rv64 comments):** sret reserves the first int argument slot even when sret is not an ordinary parameter. The callee must not allocate the first int arg slot to any user parameter. The ABI query ensures the caller knows this; bind_param must skip the first slot when sret is present.
+
+
+
+---
+
+# X64 NativeTarget Porting Guide: GROUP 5
+## Atomics, Variadics, Inline Asm, Intrinsics, File-Scope Asm, Finalize & Cleanup
+
+---
+
+## Overview
+
+This guide covers the final set of NativeTarget hooks for x64 porting, plus the file management strategy. The x64 backend must transition from the legacy ops.c/alloc.c/opt_coord.c/internal.h architecture to the NativeTarget abstraction (src/arch/native_target.h), driven at -O0 by NativeDirectTarget + NativeOps adapter and at -O1+ by the optimizer.
+
+Key constraint: **x64 must support both SysV (Linux/BSD/Unix) and Win64 ABIs** without Apple variants. All ABI queries route through `src/abi/{abi_sysv_x64.c, abi_win64_x64.c}` via `abi_cg_func_info()` and `abi_va_list_layout()`.
+
+Emit-level byte encoders (emit_rex, emit_mem_operand, emit_mov_rr, etc.) live in a new emit.h header shared by native.c and asm.c; function prologue/epilogue and lifecycle hooks remain in native.c.
+
+---
+
+## Part 1: Atomic Operations
+
+### Context: x86-64 TSO Model
+
+x86-64 is **Strongly Ordered (Total Store Order)**: all memory operations are fully sequential within a core. A plain `mov` (load or store) satisfies acquire/release/seq_cst. Atomics only need explicit fencing in very specific cases:
+
+- **atomic_load(seq_cst)**: plain load (TSO guarantees it sees all prior stores) + **mfence** after
+- **atomic_store(release)**: plain store (no fence before; TSO makes it visible to future loads)
+- **atomic_store(seq_cst)**: **mfence** before, plain store, **mfence** after
+- **atomic_rmw/cas**: lock-prefixed instruction (implicit full barrier)
+- **fence(seq_cst)**: **mfence**
+
+### NativeTarget Hooks
+
+Located in src/arch/native_target.h lines 396–405:
+
+```c
+void (*atomic_load)(NativeTarget*, NativeLoc dst, NativeAddr addr, MemAccess, MemOrder);
+void (*atomic_store)(NativeTarget*, NativeAddr addr, NativeLoc src, MemAccess, MemOrder);
+void (*atomic_rmw)(NativeTarget*, AtomicOp, NativeLoc dst, NativeAddr addr,
+                   NativeLoc val, MemAccess, MemOrder);
+void (*atomic_cas)(NativeTarget*, NativeLoc prior, NativeLoc ok,
+                   NativeAddr addr, NativeLoc expected, NativeLoc desired,
+                   MemAccess, MemOrder success, MemOrder failure);
+void (*fence)(NativeTarget*, MemOrder);
+```
+
+### Body Sketch: x_atomic_load
+
+**Caller contract**: `dst` = NATIVE_LOC_REG (allocable register); `addr` resolved (base_reg + imm offset, no index).
+
+```c
+static void x_atomic_load(NativeTarget* t, NativeLoc dst, NativeAddr addr,
+                          MemAccess mem, MemOrder order) {
+  MCEmitter* mc = t->mc;
+  u32 sz = mem.size ? mem.size : x_type_byte_size(t, dst.type);
+  
+  /* Resolve addr to (base_reg, imm_offset) — no index. */
+  u32 base = x_atomic_addr_base(t, addr);  /* Helper: extracts addr.base.reg */
+  i32 offset = addr.offset;
+  
+  /* x86: plain MOV satisfies all acquire/release/seq_cst for load due to TSO.
+     Only seq_cst needs post-load mfence. */
+  
+  int signext = type_is_signed(mem.type ? mem.type : dst.type);
+  x_emit_mov_load(mc, sz, signext, dst.v.reg & 0xfu, base, offset);
+  
+  if (order == MO_SEQ_CST)
+    emit_mfence(mc);
+}
+```
+
+**Emit helpers** (from git 429defa:src/arch/x64/emit.c):
+- `emit_mfence()`: 0x0f 0xae 0xf0 (3 bytes)
+- `x_emit_mov_load(sz, signext, dst_reg, base, offset)`: mov reg, [base+offset] with size/sign handling (reuse emit.h encoders)
+
+### Body Sketch: x_atomic_store
+
+```c
+static void x_atomic_store(NativeTarget* t, NativeAddr addr, NativeLoc src,
+                           MemAccess mem, MemOrder order) {
+  MCEmitter* mc = t->mc;
+  u32 sz = mem.size ? mem.size : x_type_byte_size(t, src.type);
+  u32 base = x_atomic_addr_base(t, addr);
+  i32 offset = addr.offset;
+  
+  if (order == MO_SEQ_CST)
+    emit_mfence(mc);
+  
+  x_emit_mov_store(mc, sz, src.v.reg & 0xfu, base, offset);
+  
+  if (order == MO_SEQ_CST)
+    emit_mfence(mc);
+}
+```
+
+### Body Sketch: x_atomic_rmw
+
+**Lock-prefixed read-modify-write** for add/sub/xchg (single instruction); **cmpxchg retry loop** for and/or/xor/nand.
+
+```c
+static void x_atomic_rmw(NativeTarget* t, AtomicOp op, NativeLoc dst,
+                         NativeAddr addr, NativeLoc val, MemAccess mem,
+                         MemOrder order) {
+  MCEmitter* mc = t->mc;
+  u32 sz = mem.size ? mem.size : x_type_byte_size(t, dst.type);
+  int w = (sz == 8) ? 1 : 0;
+  u32 base = x_atomic_addr_base(t, addr);
+  i32 offset = addr.offset;
+  u32 dr = dst.v.reg & 0xfu;
+  u32 tmp_reg = X64_R11;  /* Working register */
+  
+  /* (void)order; LOCK prefixed ops are unconditionally full barriers. */
+  
+  /* Materialize val into tmp_reg. For SUB, negate it. */
+  if (val.kind == NATIVE_LOC_IMM) {
+    i64 v = val.v.imm;
+    if (op == AO_SUB) v = -v;
+    x_emit_load_imm(mc, w, tmp_reg, v);
+  } else if (val.kind == NATIVE_LOC_REG) {
+    u32 vr = val.v.reg & 0xfu;
+    if (vr != tmp_reg)
+      emit_mov_rr(mc, w, tmp_reg, vr);
+    if (op == AO_SUB)
+      emit_f7_rm(mc, w, 3u, tmp_reg);  /* NEG tmp_reg */
+  } else {
+    compiler_panic(t->c, /* loc */, "x64 atomic_rmw: val kind unsupported");
+  }
+  
+  if (op == AO_ADD || op == AO_SUB) {
+    /* LOCK XADD [base+offset], tmp_reg
+       Opcode: 0xf0 (lock) 0x0f 0xc1 /r
+       Afterwards tmp_reg = prior value. */
+    emit_lock_xadd(mc, w, tmp_reg, base, offset);
+    if (dr != tmp_reg)
+      emit_mov_rr(mc, w, dr, tmp_reg);
+    return;
+  }
+  
+  if (op == AO_XCHG) {
+    /* LOCK XCHG [base+offset], tmp_reg (lock is implicit for XCHG mem)
+       Opcode: 0xf0 (explicit) 0x87 /r */
+    emit_lock_xchg_mem(mc, w, tmp_reg, base, offset);
+    if (dr != tmp_reg)
+      emit_mov_rr(mc, w, dr, tmp_reg);
+    return;
+  }
+  
+  /* AND/OR/XOR/NAND: CMPXCHG retry loop
+     rax = prior, rcx = new, r11 = val
+     .retry:  lr.w/d rd, [mem]        (load-reserve, implicit aq=1)
+              <new op val>
+              [NAND: not new]
+              lock cmpxchg [mem], new
+              jne .retry */
+  
+  x_emit_mov_load(mc, sz, 0, X64_RAX, base, offset);
+  MCLabel L_retry = mc->label_new(mc);
+  mc->label_place(mc, L_retry);
+  emit_mov_rr(mc, w, X64_RCX, X64_RAX);
+  
+  switch (op) {
+    case AO_AND:
+      emit_alu_rr(mc, w, 0x21, X64_RCX, tmp_reg);  /* AND */
+      break;
+    case AO_OR:
+      emit_alu_rr(mc, w, 0x09, X64_RCX, tmp_reg);  /* OR */
+      break;
+    case AO_XOR:
+      emit_alu_rr(mc, w, 0x31, X64_RCX, tmp_reg);  /* XOR */
+      break;
+    case AO_NAND:
+      emit_alu_rr(mc, w, 0x21, X64_RCX, tmp_reg);  /* AND */
+      emit_f7_rm(mc, w, 2u, X64_RCX);               /* NOT rcx */
+      break;
+    default:
+      compiler_panic(t->c, /* loc */, "unsupported atomic rmw op");
+  }
+  
+  emit_lock_cmpxchg(mc, w, X64_RCX, base, offset);
+  /* jne .retry (ZF = 0 if failed) */
+  emit_jcc_label(mc, X64_CC_NE, L_retry);
+  
+  if (dr != X64_RAX)
+    emit_mov_rr(mc, w, dr, X64_RAX);
+}
+```
+
+**Emit helpers**:
+- `emit_lock_xadd(w, src, base, offset)`: lock.prefix + rex + 0x0f 0xc1 + modrm+sib+disp
+- `emit_lock_xchg_mem(w, src, base, offset)`: lock.prefix + rex + 0x87 + modrm+sib+disp
+- `emit_lock_cmpxchg(w, src, base, offset)`: lock.prefix + rex + 0x0f 0xb1 + modrm+sib+disp
+
+### Body Sketch: x_atomic_cas
+
+**Caller contract**: Compare-and-swap with two memory orders (success/failure). Returns:
+- `prior`: the loaded value (old or new depending on success)
+- `ok`: condition flag (0/1 indicating success)
+
+```c
+static void x_atomic_cas(NativeTarget* t, NativeLoc prior, NativeLoc ok,
+                         NativeAddr addr, NativeLoc expected, NativeLoc desired,
+                         MemAccess mem, MemOrder success, MemOrder failure) {
+  MCEmitter* mc = t->mc;
+  u32 sz = mem.size ? mem.size : x_type_byte_size(t, prior.type);
+  int w = (sz == 8) ? 1 : 0;
+  u32 base = x_atomic_addr_base(t, addr);
+  i32 offset = addr.offset;
+  
+  /* CMPXCHG uses RAX as implicit operand (compare value).
+     On success, ZF=1; on failure, ZF=0 (and RAX=actual). */
+  
+  /* Materialize expected into RAX. */
+  if (expected.kind == NATIVE_LOC_IMM)
+    x_emit_load_imm(mc, w, X64_RAX, expected.v.imm);
+  else
+    emit_mov_rr(mc, w, X64_RAX, expected.v.reg & 0xfu);
+  
+  /* Materialize desired into RCX (working reg). */
+  if (desired.kind == NATIVE_LOC_IMM)
+    x_emit_load_imm(mc, w, X64_RCX, desired.v.imm);
+  else
+    emit_mov_rr(mc, w, X64_RCX, desired.v.reg & 0xfu);
+  
+  /* LOCK CMPXCHG [base+offset], rcx
+     Implicit: rax = expected, mem = current. ZF = (mem == expected). */
+  emit_lock_cmpxchg(mc, w, X64_RCX, base, offset);
+  
+  /* prior = rax (actual value, either old or fetched). */
+  if (prior.v.reg != X64_RAX)
+    emit_mov_rr(mc, w, prior.v.reg & 0xfu, X64_RAX);
+  
+  /* ok = (ZF ? 1 : 0) via SETO or SETNE. SETO sets byte to 1 if OF=1 (never happens
+     for cmpxchg); better: SETE (ZF=1). */
+  u32 ok_reg = ok.v.reg & 0xfu;
+  emit_setcc(mc, X64_CC_E, ok_reg);  /* sete ok_reg */
+  /* Zero-extend the byte result. */
+  emit_movzx(mc, 0, ok_reg, ok_reg, 0);  /* movzx ok_reg, ok_reg_b */
+}
+```
+
+**Emit helpers**:
+- `emit_setcc(cond, reg)`: 0x0f 0x9<cond> modrm — sets byte in reg based on condition (ZF, OF, etc.)
+- `emit_movzx(w, dst, src, sign)`: zero/sign extends src into dst
+
+### Body Sketch: x_fence
+
+```c
+static void x_fence(NativeTarget* t, MemOrder order) {
+  if (order == MO_SEQ_CST)
+    emit_mfence(t->mc);
+  /* Other orders (acquire, release, relaxed) are implicit in TSO. */
+}
+```
+
+---
+
+## Part 2: Variadic Argument Handling
+
+### Context: SysV vs Win64 va_list Layout
+
+Two fundamentally different designs:
+
+**SysV x64** (Linux/BSD):
+- `va_list` is a 24-byte struct holding gp_offset, fp_offset, overflow_arg_area, reg_save_area
+- Prologue saves 6 GPR (rdi, rsi, rdx, rcx, r8, r9) + 8 XMM (xmm0..7) to a 176-byte register save area (6*8 + 8*16)
+- va_arg scans offsets and fetches from either the save area (if offset < max) or overflow area
+
+**Win64** (Windows):
+- `va_list` is a single pointer to the next variadic stack slot
+- No register save area; variadic args are on the stack
+- Caller-side **vararg_fp_dup_to_gpr** flag: FP varargs are duplicated into the matching GPR slot by the call site
+
+### NativeTarget Hooks
+
+Lines 413–417 in src/arch/native_target.h:
+
+```c
+void (*va_start_)(NativeTarget*, NativeLoc ap_ptr);
+void (*va_arg_)(NativeTarget*, NativeLoc dst, NativeLoc ap_ptr, CfreeCgTypeId type);
+void (*va_end_)(NativeTarget*, NativeLoc ap_ptr);
+void (*va_copy_)(NativeTarget*, NativeLoc dst_ap_ptr, NativeLoc src_ap_ptr);
+```
+
+**Caller contract**:
+- `ap_ptr`: NATIVE_LOC_REG or NATIVE_LOC_FRAME (the va_list object's address)
+- `dst`: for va_arg, NATIVE_LOC_REG (destination register for the fetched value)
+- The backend queries the ABI via `abi_va_list_layout(t->c->abi)` to determine structure layout
+
+### Body Sketch: x_va_start_
+
+**Must query the ABI** to determine SysV vs Win64. Use `abi_cg_func_info()` + `abi_va_list_layout()`.
+
+```c
+static void x_va_start_(NativeTarget* t, NativeLoc ap_ptr) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  
+  if (!a->is_variadic)
+    compiler_panic(t->c, a->func ? a->func->loc : (SrcLoc){0, 0, 0},
+                   "x64 va_start: function not variadic");
+  
+  ABIVaListInfo vai = abi_va_list_layout(t->c->abi);
+  u32 ap_reg = ap_ptr.v.reg & 0xfu;  /* Address of va_list object */
+  
+  if (vai.kind == ABI_VA_LIST_POINTER) {
+    /* Win64: va_list = pointer to next stack slot.
+       Incoming variadics start at [rbp + 16 + (n_named_int * 8) + n_named_stack].
+       (The prologue's shadow space at [rbp+0..16] holds a copy of RCX/RDX/R8/R9.) */
+    u32 first_var_off = 16u + (a->next_param_int * 8u) + a->next_param_stack;
+    x_emit_lea(mc, X64_RAX, X64_RBP, (i32)first_var_off);
+    x_emit_mov_store(mc, 8, X64_RAX, ap_reg, 0);  /* *ap = lea result */
+    return;
+  }
+  
+  if (vai.kind == ABI_VA_LIST_SYSV_STRUCT) {
+    /* SysV: 24-byte va_list struct with 4 fields, plus 176-byte register save area.
+       *ap = { gp_offset=0, fp_offset=48, overflow_arg_area, reg_save_area } */
+    
+    /* Get the register save slot (allocated during func_begin). */
+    X64NativeSlot* rs = x64_slot_get(a, a->reg_save_slot);
+    if (!rs)
+      compiler_panic(t->c, a->func ? a->func->loc : (SrcLoc){0, 0, 0},
+                     "x64 va_start: no reg_save_slot");
+    
+    /* gp_offset = next_param_int * 8 (bytes into save area for next GP reg) */
+    x_emit_load_imm(mc, 0, X64_RAX, (i64)(a->next_param_int * 8u));
+    x_emit_mov_store(mc, 4, X64_RAX, ap_reg, 0);
+    
+    /* fp_offset = 48 + next_param_fp * 16 (XMM area starts at byte 48) */
+    x_emit_load_imm(mc, 0, X64_RAX, (i64)(48u + a->next_param_fp * 16u));
+    x_emit_mov_store(mc, 4, X64_RAX, ap_reg, 4);
+    
+    /* overflow_arg_area = rbp + 16 + next_param_stack (stack args start above saved pair) */
+    x_emit_lea(mc, X64_RAX, X64_RBP, (i32)(16u + a->next_param_stack));
+    x_emit_mov_store(mc, 8, X64_RAX, ap_reg, 8);
+    
+    /* reg_save_area = rbp - rs->off (address of register save area) */
+    x_emit_lea(mc, X64_RAX, X64_RBP, -(i32)rs->off);
+    x_emit_mov_store(mc, 8, X64_RAX, ap_reg, 16);
+    return;
+  }
+  
+  compiler_panic(t->c, a->func ? a->func->loc : (SrcLoc){0, 0, 0},
+                 "x64 va_start: unsupported va_list layout");
+}
+```
+
+**Key state tracked** (set during func_begin):
+- `a->next_param_int`: count of GPR params consumed by the signature
+- `a->next_param_fp`: count of FP params
+- `a->next_param_stack`: byte offset of first stack param relative to rbp
+- `a->reg_save_slot`: frame slot holding the register save area (SysV only)
+- `a->is_variadic`: whether the function signature is variadic
+
+### Body Sketch: x_va_arg_
+
+Complex: SysV has two paths (register or overflow area), Win64 is simpler.
+
+```c
+static void x_va_arg_(NativeTarget* t, NativeLoc dst, NativeLoc ap_ptr,
+                      CfreeCgTypeId type) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  
+  ABIVaListInfo vai = abi_va_list_layout(t->c->abi);
+  u32 ap_reg = ap_ptr.v.reg & 0xfu;
+  u32 sz = x_type_byte_size(t, type);
+  int is_fp = (dst.cls == NATIVE_REG_FP);
+  u32 dr = dst.v.reg & 0xfu;
+  
+  if (vai.kind == ABI_VA_LIST_POINTER) {
+    /* Win64: va_list is a plain pointer. All variadics are 8-byte slots
+       on the stack. FP varargs are duplicated into GPR by call site.
+       
+       r11 = *ap              (current slot address)
+       if (is_fp) xmm_dst = [r11] else dst = [r11]
+       r11 += 8
+       *ap = r11 */
+    
+    x_emit_mov_load(mc, 8, 0, X64_R11, ap_reg, 0);
+    if (is_fp) {
+      u8 prefix = (sz == 8) ? 0xf2 : 0xf3;
+      x_emit_sse_load(mc, prefix, 0x10, dr, X64_R11, 0);  /* movs[sd] */
+    } else {
+      int sx = type_is_signed(type);
+      x_emit_mov_load(mc, sz, sx, dr, X64_R11, 0);
+    }
+    
+    /* add r11, 8 (advance to next slot) */
+    x_emit_alu_imm(mc, 1, 0, X64_R11, 8);  /* ADD */
+    x_emit_mov_store(mc, 8, X64_R11, ap_reg, 0);
+    return;
+  }
+  
+  if (vai.kind == ABI_VA_LIST_SYSV_STRUCT) {
+    /* SysV: check if arg came via register or overflow.
+       offs_field = (is_fp ? 4 : 0)    (gp_offset or fp_offset field offset)
+       max_offs = (is_fp ? 176 : 48)   (end of register save area)
+       stride = (is_fp ? 16 : 8)       (bytes per register slot)
+       
+       eax = va_list[offs_field]
+       if (eax >= max_offs) goto L_stack
+       dst = va_list[reg_save_area] + eax
+       eax += stride
+       va_list[offs_field] = eax
+       goto L_done
+       L_stack:
+       dst = va_list[overflow_arg_area]
+       va_list[overflow_arg_area] += 8
+       L_done: */
+    
+    u32 offs_field = is_fp ? 4u : 0u;
+    u32 max_offs = is_fp ? 176u : 48u;
+    u32 stride = is_fp ? 16u : 8u;
+    
+    MCLabel L_stack = mc->label_new(mc);
+    MCLabel L_done = mc->label_new(mc);
+    
+    /* Load offset field. */
+    x_emit_mov_load(mc, 4, 0, X64_RAX, ap_reg, (i32)offs_field);
+    
+    /* Compare with max_offs. */
+    if (max_offs <= 127u) {
+      x_emit_cmp_imm8(mc, 0, X64_RAX, (i8)max_offs);
+    } else {
+      /* cmp eax, imm32 via 0x3d (EAX-specific form). */
+      u32 ofs = obj_pos(mc->obj, mc->section_id);
+      u8 op = 0x3d;
+      mc->emit_bytes(mc, &op, 1);
+      emit_u32le(mc, max_offs);
+      /* Debug row if needed */
+    }
+    
+    /* jae L_stack (jump if >=, i.e., not in register range). */
+    emit_jcc_label(mc, X64_CC_AE, L_stack);
+    
+    /* In-register path: dst = reg_save_area + eax. */
+    x_emit_mov_load(mc, 8, 0, X64_RCX, ap_reg, 16);  /* rcx = reg_save_area */
+    x_emit_lea(mc, X64_RCX, X64_RCX, 0);  /* lea rcx, [rcx + eax * scale] — need indexed addr */
+    /* Actually: add rcx, rax; then load from rcx. */
+    x_emit_alu_rr(mc, 0, 0x01, X64_RCX, X64_RAX);  /* add eax, ecx */
+    
+    if (is_fp) {
+      u8 prefix = (sz == 8) ? 0xf2 : 0xf3;
+      x_emit_sse_load(mc, prefix, 0x10, dr, X64_RCX, 0);
+    } else {
+      int sx = type_is_signed(type);
+      x_emit_mov_load(mc, sz, sx, dr, X64_RCX, 0);
+    }
+    
+    /* Update offset: offset += stride. */
+    if (stride <= 127u) {
+      x_emit_alu_imm8(mc, 0, 0, X64_RAX, (i8)stride);  /* add eax, stride */
+    } else {
+      x_emit_alu_imm32(mc, 0, 0, X64_RAX, stride);
+    }
+    x_emit_mov_store(mc, 4, X64_RAX, ap_reg, (i32)offs_field);
+    
+    emit_jmp_label(mc, L_done);
+    
+    /* Overflow area path: dst = overflow_arg_area; overflow_arg_area += 8. */
+    mc->label_place(mc, L_stack);
+    x_emit_mov_load(mc, 8, 0, X64_RCX, ap_reg, 8);  /* rcx = overflow_arg_area */
+    
+    if (is_fp) {
+      u8 prefix = (sz == 8) ? 0xf2 : 0xf3;
+      x_emit_sse_load(mc, prefix, 0x10, dr, X64_RCX, 0);
+    } else {
+      int sx = type_is_signed(type);
+      x_emit_mov_load(mc, sz, sx, dr, X64_RCX, 0);
+    }
+    
+    x_emit_alu_imm(mc, 1, 0, X64_RCX, 8);  /* add rcx, 8 */
+    x_emit_mov_store(mc, 8, X64_RCX, ap_reg, 8);
+    
+    mc->label_place(mc, L_done);
+    return;
+  }
+  
+  compiler_panic(t->c, a->func ? a->func->loc : (SrcLoc){0, 0, 0},
+                 "x64 va_arg: unsupported va_list layout");
+}
+```
+
+**Key point**: For SysV, the register-save area is **prologue-emitted** in x_emit_variadic_reg_saves() (part of x_func_begin). The caller must ensure 176 bytes are reserved on the stack and the 6 GPR + 8 XMM are saved at a fixed offset.
+
+### Body Sketch: x_va_end_ & x_va_copy_
+
+```c
+static void x_va_end_(NativeTarget* t, NativeLoc ap_ptr) {
+  (void)t;
+  (void)ap_ptr;
+  /* x64 va_end is a no-op (no resources to clean up). */
+}
+
+static void x_va_copy_(NativeTarget* t, NativeLoc dst_ap_ptr, NativeLoc src_ap_ptr) {
+  MCEmitter* mc = t->mc;
+  /* Copy 24 bytes (SysV) or 8 bytes (Win64) from src to dst.
+     For simplicity, memcpy the whole va_list struct. */
+  
+  ABIVaListInfo vai = abi_va_list_layout(t->c->abi);
+  u32 copy_sz = (vai.kind == ABI_VA_LIST_POINTER) ? 8u : 24u;
+  
+  u32 src_ptr = src_ap_ptr.v.reg & 0xfu;
+  u32 dst_ptr = dst_ap_ptr.v.reg & 0xfu;
+  
+  /* r10 = src; r11 = dst; copy copy_sz bytes. */
+  x_emit_mov_load(mc, copy_sz, 0, X64_R10, src_ptr, 0);
+  x_emit_mov_store(mc, copy_sz, X64_R10, dst_ptr, 0);
+  
+  /* For 24 bytes, split into 8+8+8 or use a loop. Simpler: two 8-byte moves. */
+  if (copy_sz > 8u) {
+    x_emit_mov_load(mc, 8, 0, X64_R10, src_ptr, 8);
+    x_emit_mov_store(mc, 8, X64_R10, dst_ptr, 8);
+    if (copy_sz > 16u) {
+      x_emit_mov_load(mc, 8, 0, X64_R10, src_ptr, 16);
+      x_emit_mov_store(mc, 8, X64_R10, dst_ptr, 16);
+    }
+  }
+}
+```
+
+### NativeOps Adapter (NativeDirectTarget path)
+
+The -O0 direct path uses semantic operands (OPK_REG, OPK_LOCAL), not NativeLoc. The NativeOps vtable (src/cg/native_direct_target.h lines 81–86) bridges:
+
+```c
+struct NativeOps {
+  ...
+  void (*va_start_)(NativeDirectTarget*, Operand ap_addr);
+  void (*va_arg_)(NativeDirectTarget*, Operand dst, Operand ap_addr, CfreeCgTypeId);
+  void (*va_end_)(NativeDirectTarget*, Operand ap_addr);
+  void (*va_copy_)(NativeDirectTarget*, Operand dst_ap_addr, Operand src_ap_addr);
+  ...
+};
+```
+
+Pattern (mirror rv64's rv_direct_va_base + rv_va_*_core):
+
+```c
+/* Convert semantic Operand (OPK_LOCAL holding va_list struct) to NativeAddr.
+   Key issue: OPK_LOCAL is the address of the frame slot; we need to pass that
+   pointer as a register location to the native hooks. */
+static NativeAddr x_direct_va_base(NativeDirectTarget* d, Operand ap_addr, 
+                                    u32 scratch_reg) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  
+  if (ap_addr.kind == OPK_LOCAL) {
+    /* Load the address into scratch_reg, then use that register. */
+    NativeTarget* nt = d->native;
+    OperandLoc floc = /* resolve ap_addr to frame slot */;
+    /* Load frame address into scratch_reg. */
+    addr.base_kind = NATIVE_ADDR_BASE_REG;
+    addr.base.reg = scratch_reg;
+  } else if (ap_addr.kind == OPK_REG) {
+    addr.base_kind = NATIVE_ADDR_BASE_REG;
+    addr.base.reg = ap_addr.v.reg;
+  }
+  return addr;
+}
+
+static void x_va_start_direct(NativeDirectTarget* d, Operand ap_addr) {
+  NativeTarget* nt = d->native;
+  /* Resolve ap_addr to a register (or load it into a scratch). */
+  NativeLoc ap_ptr = nd_materialize_operand(d, ap_addr);  /* Helper: load if needed */
+  nt->va_start_(nt, ap_ptr);
+}
+```
+
+---
+
+## Part 3: Inline Assembly (Inline Asm)
+
+### Context: x64 Inline Assembly Constraints
+
+x64 constraints in the legacy asm.c (git 429defa:src/arch/x64/asm.c):
+- **r**: GPR (rax, rbx, ..., r15)
+- **m**: Memory operand (direct reg, indirect [reg], indexed [reg+scale*idx], RIP-relative)
+- **i**: Immediate (any constant)
+- **a,b,c,d,S,D**: Specific registers (rax, rbx, rcx, rdx, rsi, rdi)
+- **x**: XMM registers (xmm0..xmm15)
+- Width qualifiers: b (byte), w (word), l (dword), q (qword)
+
+### NativeTarget Hooks
+
+Line 420–423 in src/arch/native_target.h:
+
+```c
+void (*asm_block)(NativeTarget*, const char* tmpl, const AsmConstraint* outs,
+                  u32 nout, NativeLoc* out_locs, const AsmConstraint* ins,
+                  u32 nin, const NativeLoc* in_locs, const Sym* clobbers,
+                  u32 nclob);
+```
+
+**Caller contract**: All operand locations are already physically bound (NATIVE_LOC_REG, NATIVE_LOC_FRAME, NATIVE_LOC_IMM, NATIVE_LOC_ADDR). No further allocation needed.
+
+### Body Sketch: x_asm_block_native
+
+Mirrors aa64/rv64 pattern: open an Asm context, bind operands (semantic → physical), run template expansion, close.
+
+```c
+static void x_asm_block_native(NativeTarget* t, const char* tmpl,
+                               const AsmConstraint* outs, u32 nout,
+                               NativeLoc* out_locs,
+                               const AsmConstraint* ins, u32 nin,
+                               const NativeLoc* in_locs,
+                               const Sym* clobbers, u32 nclob) {
+  X64NativeTarget* a = x64_of(t);
+  Compiler* c = t->c;
+  SrcLoc loc = a->func ? a->func->loc : (SrcLoc){0, 0, 0};
+  
+  /* Allocate bound operand arrays. */
+  Operand* bound_outs = nout ? arena_zarray(c->tu, Operand, nout) : NULL;
+  Operand* bound_ins = nin ? arena_zarray(c->tu, Operand, nin) : NULL;
+  
+  X64Asm* asmh;
+  u32 i;
+  
+  /* Track clobbered registers for prologue/epilogue purposes. */
+  for (i = 0; i < nclob; ++i) {
+    Reg phys;
+    RegClass cls;
+    if (!c->resolve_reg_name || c->resolve_reg_name(c, clobbers[i], &phys, &cls) != 0)
+      continue;
+    if (cls == RC_INT) {
+      /* Mark callee-saved regs that are clobbered. */
+      if (phys == X64_RBX || phys == X64_R12 || phys == X64_R13 ||
+          phys == X64_R14 || phys == X64_R15)
+        a->used_cs_int_mask |= (1u << phys);
+    }
+  }
+  
+  /* Bind outputs: constraint + out_locs[i] → Operand. */
+  for (i = 0; i < nout; ++i) {
+    CfreeCgTypeId type = outs[i].type ? outs[i].type : out_locs[i].type;
+    x_asm_bind_native(a, loc, &bound_outs[i], outs[i].str, type, out_locs[i]);
+  }
+  
+  /* Bind inputs. */
+  for (i = 0; i < nin; ++i) {
+    const char* constraint = ins[i].str;
+    /* Check for matching constraint (e.g., "0" = matches output 0). */
+    int matched = x_asm_match_index(constraint);
+    CfreeCgTypeId type = ins[i].type ? ins[i].type : in_locs[i].type;
+    
+    if (matched >= 0) {
+      if ((u32)matched >= nout)
+        compiler_panic(c, loc, "x64 asm: matching constraint out of range");
+      bound_ins[i] = bound_outs[matched];
+      continue;
+    }
+    
+    /* Regular constraint. */
+    x_asm_bind_native(a, loc, &bound_ins[i], constraint, type, in_locs[i]);
+  }
+  
+  /* Open asm template processor, run, close. */
+  asmh = x64_asm_open(c);
+  x64_inline_bind(asmh, outs, nout, bound_outs, ins, nin, bound_ins,
+                  clobbers, nclob);
+  x64_asm_run_template(asmh, t->mc, tmpl);
+  x64_asm_close(asmh);
+}
+
+/* Helper: bind a single constraint to a native location.
+   Handles "r" → allocable GPR, "m" → [reg+offset], "i" → immediate, etc. */
+static void x_asm_bind_native(X64NativeTarget* a, SrcLoc loc,
+                              Operand* out, const char* constraint,
+                              CfreeCgTypeId type, NativeLoc nloc) {
+  /* Parse constraint (e.g., "=r", "+m", "i", "0", etc.). */
+  const char* body = x_asm_constraint_body(constraint);
+  
+  memset(out, 0, sizeof *out);
+  
+  if (body[0] == 'r') {
+    /* Register constraint: expect nloc.kind == NATIVE_LOC_REG. */
+    if (nloc.kind != NATIVE_LOC_REG)
+      compiler_panic(a->base.c, loc, "x64 asm: 'r' constraint needs register");
+    out->kind = OPK_REG;
+    out->cls = x_class_from_type(type);  /* RC_INT or RC_FP */
+    out->v.reg = nloc.v.reg;
+    return;
+  }
+  
+  if (body[0] == 'm') {
+    /* Memory constraint: nloc is NATIVE_LOC_FRAME or NATIVE_LOC_STACK. */
+    if (nloc.kind == NATIVE_LOC_FRAME) {
+      X64NativeSlot* s = x64_slot_get(a, nloc.v.frame);
+      out->kind = OPK_LOCAL;
+      out->v.frame_slot = nloc.v.frame;
+    } else if (nloc.kind == NATIVE_LOC_ADDR) {
+      /* Materialized address: [base + disp]. */
+      out->kind = OPK_INDIRECT;
+      out->v.ind.base = nloc.v.addr.base.reg;
+      out->v.ind.index = REG_NONE;
+      out->v.ind.ofs = nloc.v.addr.offset;
+    } else {
+      compiler_panic(a->base.c, loc, "x64 asm: 'm' constraint needs memory");
+    }
+    return;
+  }
+  
+  if (body[0] == 'i') {
+    /* Immediate constraint: nloc.kind == NATIVE_LOC_IMM. */
+    if (nloc.kind != NATIVE_LOC_IMM)
+      compiler_panic(a->base.c, loc, "x64 asm: 'i' constraint needs immediate");
+    out->kind = OPK_IMM;
+    out->v.imm = nloc.v.imm;
+    return;
+  }
+  
+  if (body[0] == 'a' || body[0] == 'b' || body[0] == 'c' || body[0] == 'd' ||
+      body[0] == 'S' || body[0] == 'D') {
+    /* Specific register constraint (a=rax, b=rbx, c=rcx, d=rdx, S=rsi, D=rdi). */
+    static const Reg map[] = {X64_RAX, X64_RBX, X64_RCX, X64_RDX, X64_RSI, X64_RDI};
+    const char* names = "abcdSD";
+    for (int j = 0; j < 6; ++j) {
+      if (body[0] == names[j]) {
+        if (nloc.kind != NATIVE_LOC_REG || (nloc.v.reg & 0xfu) != map[j])
+          compiler_panic(a->base.c, loc, "x64 asm: constraint '%c' requires %s",
+                         body[0], /* reg name */);
+        out->kind = OPK_REG;
+        out->cls = RC_INT;
+        out->v.reg = nloc.v.reg;
+        return;
+      }
+    }
+  }
+  
+  if (body[0] == 'x') {
+    /* XMM constraint: expect NATIVE_LOC_REG with FP class. */
+    if (nloc.kind != NATIVE_LOC_REG || nloc.cls != NATIVE_REG_FP)
+      compiler_panic(a->base.c, loc, "x64 asm: 'x' constraint needs XMM");
+    out->kind = OPK_REG;
+    out->cls = RC_FP;
+    out->v.reg = nloc.v.reg;
+    return;
+  }
+  
+  compiler_panic(a->base.c, loc, "x64 asm: unsupported constraint '%s'", constraint);
+}
+```
+
+### Legacy asm.c Integration
+
+The standalone assembler (asm.c) currently includes internal.h to access legacy types and must be updated to include a new emit.h header instead. See file management section.
+
+---
+
+## Part 4: Intrinsics
+
+### NativeTarget Hook
+
+Line 418–419 in src/arch/native_target.h:
+
+```c
+void (*intrinsic)(NativeTarget*, IntrinKind, const NativeLoc* dsts, u32 ndst,
+                  const NativeLoc* args, u32 narg);
+```
+
+IntrinKind enum (from cg/cgtarget.h): INTRIN_POPCOUNT, INTRIN_CTZ, INTRIN_CLZ, INTRIN_BSWAP*, INTRIN_EXPECT, INTRIN_ASSUME_ALIGNED, INTRIN_PREFETCH, INTRIN_TRAP, INTRIN_UNREACHABLE, INTRIN_*_OVERFLOW, and memory intrinsics (MEMCPY, MEMSET).
+
+### Body Sketch: x_intrinsic
+
+```c
+static void x_intrinsic(NativeTarget* t, IntrinKind kind, const NativeLoc* dsts,
+                        u32 ndst, const NativeLoc* args, u32 narg) {
+  MCEmitter* mc = t->mc;
+  (void)ndst;  /* Caller guarantees valid indices. */
+  (void)narg;
+  
+  switch (kind) {
+    case INTRIN_POPCOUNT: {
+      /* POPCNT rd, rs: F3 0F B8 /r. Requires SSE4.2. */
+      if (ndst >= 1 && narg >= 1) {
+        u32 sz = x_type_byte_size(t, args[0].type);
+        int w = (sz == 8) ? 1 : 0;
+        emit_popcnt(mc, w, dsts[0].v.reg & 0xfu, args[0].v.reg & 0xfu);
+      }
+      return;
+    }
+    
+    case INTRIN_CTZ: {
+      /* BSF gives index of lowest set bit; ZF if input is 0 (undefined). */
+      if (ndst >= 1 && narg >= 1) {
+        u32 sz = x_type_byte_size(t, args[0].type);
+        int w = (sz == 8) ? 1 : 0;
+        emit_bs(mc, w, 0xbc, dsts[0].v.reg & 0xfu, args[0].v.reg & 0xfu);  /* BSF */
+      }
+      return;
+    }
+    
+    case INTRIN_CLZ: {
+      /* BSR gives index of highest set bit; XOR with (bits-1) for CLZ. */
+      if (ndst >= 1 && narg >= 1) {
+        u32 sz = x_type_byte_size(t, args[0].type);
+        int w = (sz == 8) ? 1 : 0;
+        u32 dr = dsts[0].v.reg & 0xfu;
+        emit_bs(mc, w, 0xbd, dr, args[0].v.reg & 0xfu);  /* BSR */
+        emit_xor_imm32(mc, w, dr, w ? 63 : 31);
+      }
+      return;
+    }
+    
+    case INTRIN_BSWAP16:
+    case INTRIN_BSWAP32:
+    case INTRIN_BSWAP64: {
+      if (ndst >= 1 && narg >= 1) {
+        u32 dr = dsts[0].v.reg & 0xfu;
+        u32 sr = args[0].v.reg & 0xfu;
+        int w = (kind == INTRIN_BSWAP64) ? 1 : 0;
+        if (dr != sr) emit_mov_rr(mc, w, dr, sr);
+        if (kind == INTRIN_BSWAP16) {
+          emit_rol16_imm8(mc, dr, 8);  /* ROR dx, 8 (16-bit) */
+        } else {
+          emit_bswap(mc, w, dr);  /* BSWAP rax/eax */
+        }
+      }
+      return;
+    }
+    
+    case INTRIN_SADD_OVERFLOW:
+    case INTRIN_UADD_OVERFLOW:
+    case INTRIN_SSUB_OVERFLOW:
+    case INTRIN_USUB_OVERFLOW:
+    case INTRIN_SMUL_OVERFLOW:
+    case INTRIN_UMUL_OVERFLOW: {
+      /* Result in dsts[0], overflow flag in dsts[1]. */
+      if (ndst >= 2 && narg >= 2) {
+        u32 sz = x_type_byte_size(t, dsts[0].type);
+        int w = (sz == 8) ? 1 : 0;
+        u32 rd = dsts[0].v.reg & 0xfu;
+        u32 ro = dsts[1].v.reg & 0xfu;  /* overflow flag */
+        
+        u32 ra = args[0].v.reg & 0xfu;
+        u32 rb = args[1].v.reg & 0xfu;
+        
+        switch (kind) {
+          case INTRIN_SADD_OVERFLOW:
+          case INTRIN_UADD_OVERFLOW:
+            emit_mov_rr(mc, w, rd, ra);
+            emit_alu_rr(mc, w, 0x01, rd, rb);  /* ADD */
+            break;
+          case INTRIN_SSUB_OVERFLOW:
+          case INTRIN_USUB_OVERFLOW:
+            emit_mov_rr(mc, w, rd, ra);
+            emit_alu_rr(mc, w, 0x29, rd, rb);  /* SUB */
+            break;
+          case INTRIN_SMUL_OVERFLOW:
+          case INTRIN_UMUL_OVERFLOW:
+            emit_imul_rr(mc, w, rd, rb);  /* IMUL */
+            break;
+          default:
+            break;
+        }
+        
+        /* SETO ro (set if overflow: 0x0F 0x90 /0) */
+        emit_setcc(mc, X64_CC_O, ro);
+        emit_movzx(mc, 0, ro, ro, 0);
+      }
+      return;
+    }
+    
+    case INTRIN_TRAP:
+    case INTRIN_UNREACHABLE: {
+      /* UD2: 0x0F 0x0B (undefined instruction) */
+      mc->emit_bytes(mc, (u8[]){0x0f, 0x0b}, 2);
+      return;
+    }
+    
+    case INTRIN_EXPECT:
+    case INTRIN_ASSUME_ALIGNED: {
+      /* Hints dropped; just copy value. */
+      if (ndst >= 1 && narg >= 1) {
+        if (args[0].kind == NATIVE_LOC_IMM)
+          x_emit_load_imm(mc, x_is_64(t, dsts[0].type) ? 1 : 0,
+                          dsts[0].v.reg & 0xfu, args[0].v.imm);
+        else
+          emit_mov_rr(mc, x_is_64(t, dsts[0].type) ? 1 : 0,
+                      dsts[0].v.reg & 0xfu, args[0].v.reg & 0xfu);
+      }
+      return;
+    }
+    
+    case INTRIN_PREFETCH:
+      /* No-op on x64 (or PREFETCHT0 if we want to emit one). */
+      return;
+    
+    case INTRIN_MEMCPY:
+    case INTRIN_MEMSET: {
+      /* Inline byte-at-a-time copy/set (from legacy x_copy_bytes / x_set_bytes). */
+      if (kind == INTRIN_MEMCPY && ndst == 0 && narg == 3) {
+        /* dst_addr, src_addr, count: use REP MOVSB or byte loop. */
+        x_intrinsic_memcpy(t, args[0], args[1], args[2]);
+      } else if (kind == INTRIN_MEMSET && ndst == 0 && narg == 3) {
+        /* dst_addr, byte_val, count. */
+        x_intrinsic_memset(t, args[0], args[1], args[2]);
+      }
+      return;
+    }
+    
+    default:
+      /* Unimplemented intrinsic. */
+      break;
+  }
+}
+
+/* Helper: inline memcpy via REP MOVSB or byte loop. */
+static void x_intrinsic_memcpy(NativeTarget* t, NativeLoc dst, NativeLoc src,
+                               NativeLoc count) {
+  MCEmitter* mc = t->mc;
+  /* dst = rdi, src = rsi, count = rcx; REP MOVSB */
+  /* (Simplified: assumes operands are already in the right regs or we materialize them.) */
+  u32 rdi = dst.v.reg & 0xfu;
+  u32 rsi = src.v.reg & 0xfu;
+  u32 rcx = count.v.reg & 0xfu;
+  
+  if (rdi != X64_RDI) emit_mov_rr(mc, 1, X64_RDI, rdi);
+  if (rsi != X64_RSI) emit_mov_rr(mc, 1, X64_RSI, rsi);
+  if (rcx != X64_RCX) emit_mov_rr(mc, 1, X64_RCX, rcx);
+  
+  /* REP MOVSB: 0xf3 0xa4 */
+  u8 rep_movsb[] = {0xf3, 0xa4};
+  mc->emit_bytes(mc, rep_movsb, 2);
+}
+```
+
+**Emit helpers** (from git 429defa:src/arch/x64/emit.c):
+- `emit_popcnt(w, dst, src)`: F3 0F B8 /r (SSE4.2)
+- `emit_bs(w, opcode, dst, src)`: 0x0f (0xbc=BSF or 0xbd=BSR) /r
+- `emit_bswap(w, reg)`: 0x0f 0xc8+reg (BSWAP; 32-bit only, 64-bit is 0xc8+reg in REX.W)
+- `emit_setcc(cond, reg)`: 0x0f 0x90+cond modrm (SETCC byte)
+- `emit_movzx(w, dst, src, sign)`: 0x0f (0xb6/0xb7/0xbe/0xbf) modrm
+
+---
+
+## Part 5: File-Scope Asm & Finalize
+
+### file_scope_asm Hook
+
+Line 424 in src/arch/native_target.h:
+
+```c
+void (*file_scope_asm)(NativeTarget*, const char* src, size_t len);
+```
+
+Used to emit raw assembly at the file scope (e.g., from asm_parse pseudo-statements in the source).
+
+### Body Sketch: x_file_scope_asm
+
+```c
+static void x_file_scope_asm(NativeTarget* t, const char* src, size_t len) {
+  /* Write source verbatim to the current section. Most backers emit as a blob
+     and let the assembler parse; for native emission, run the asm parser here. */
+  X64Asm* a = x64_asm_open(t->c);
+  x64_asm_run_template(a, t->mc, src);
+  x64_asm_close(a);
+}
+```
+
+Alternatively, if full template processing is unneeded, pass through to the standalone asm.c parser.
+
+### finalize Hook
+
+Line 429 in src/arch/native_target.h:
+
+```c
+void (*finalize)(NativeTarget*);
+```
+
+Called at the end of code generation to finalize any pending state (e.g., unwind info, extra frame state patches, register allocation reporting for debuggers).
+
+### Body Sketch: x_finalize
+
+```c
+static void x_finalize(NativeTarget* t) {
+  /* Finalize any pending state (unwind info, profile tables, etc). */
+  (void)t;  /* No-op on x64 v1. */
+}
+```
+
+### trap Hook
+
+Line 427 in src/arch/native_target.h:
+
+```c
+void (*trap)(NativeTarget*);
+```
+
+Emit an unconditional trap/breakpoint (used for unreachable code, panic sites).
+
+```c
+static void x_trap(NativeTarget* t) {
+  /* UD2: undefined instruction. */
+  u8 ud2[] = {0x0f, 0x0b};
+  t->mc->emit_bytes(t->mc, ud2, 2);
+}
+```
+
+---
+
+## Part 6: File Management Plan
+
+### DELETE (Replaced by native.c)
+
+1. **src/arch/x64/ops.c** (104,924 bytes): All semantic-level operations (loads, stores, arithmetic, atomics, calls, intrinsics, variadics). Fully subsumed by native.c hooks.
+
+2. **src/arch/x64/alloc.c** (19,461 bytes): Register allocation state machine (XImpl frame slots, param binding, spill/reload). Replaced by NativeTarget frame_slot/bind_param/spill/reload hooks.
+
+3. **src/arch/x64/opt_coord.c** (14,220 bytes): Register tables and ABI coordination (x_int_allocable, x_fp_allocable, x_plan_call, X64ABIRegs dispatch). Replaced by native.c register tables and NativeOps vtable.
+
+4. **src/arch/x64/internal.h** (12,299 bytes): XImpl state struct and shared helpers. Absorbed into native.c and a new x64.h header (if needed for ISA helpers).
+
+### STRIP (Extract byte encoders; keep ISA/relocs)
+
+1. **src/arch/x64/emit.c** (41,172 bytes):
+   - **KEEP**: emit_rex(), emit_mem_operand(), emit_rm_reg(), emit_mov_rr(), emit_mov_load(), emit_mov_store(), emit_lea(), specific instruction emitters (emit_alu_rr, emit_imul_rr, emit_f7_rm, emit_shift_*, emit_cqo_or_cdq, emit_alu_imm*, emit_setcc, emit_movzx, emit_extend_rr, emit_ret, emit_sse_*), plus the shared ABI constant tables (g_int_order, g_fp_order, g_x64_abi_sysv, g_x64_abi_win64, x64_abi_for_os).
+   - **DELETE**: x_func_begin, x_func_end, x_func_begin_known_frame, x_func_begin_init, x_build_prologue, x_compute_frame_size, x_collect_cs_regs, x_emit_variadic_reg_saves, x_add_entry_frame_slots, x_chkstk_sym, x_planned_prologue_bytes, x_prologue_placeholder. (These move to native.c.)
+   - **OUTPUT**: Create src/arch/x64/emit.h (header exporting only the byte-level encoders and ABI tables).
+
+### KEEP-AS-IS (No changes required)
+
+1. **src/arch/x64/isa.h** (26,086 bytes): X64_* opcode constants, REX/ModRM helpers, register enums. No changes.
+
+2. **src/arch/x64/isa.c** (40,662 bytes): Disassembler. No changes.
+
+3. **src/arch/x64/regs.c** (3,251 bytes): DWARF register names. No changes.
+
+4. **src/arch/x64/link.c** (3,689 bytes): Object file relocation helpers. No changes.
+
+5. **src/arch/x64/dbg.c** (12,299 bytes): Debug info emission. No changes (or minimal updates for the new frame slot API).
+
+6. **src/arch/x64/disasm.c** (4,282 bytes): Disassembler helpers. No changes.
+
+### ADAPT (Update includes; minor rewiring)
+
+1. **src/arch/x64/asm.c** (53,046 bytes):
+   - **CHANGE**: Replace `#include "arch/x64/internal.h"` with `#include "arch/x64/emit.h"` (for emit helpers).
+   - **CHANGE**: Adapt x64_inline_bind() if it references deleted functions from ops.c/alloc.c (e.g., constraint resolution should use NativeOps adapter or standalone resolution).
+   - **KEEP**: Inline assembly template parsing and operand binding (x64_asm_open, x64_asm_run_template, x64_asm_close, inline_bind helpers).
+
+2. **src/arch/x64/arch.c** (2,936 bytes):
+   - **CHANGE**: Replace `x64_cgtarget_new()` call with `x64_native_target_new()` and `native_direct_target_new()` wiring.
+   - **DELETE**: Any old CGTarget construction code.
+   - **ADD**: NativeTarget → NativeDirectTarget → CgTarget adapter chain.
+
+3. **src/arch/x64/x64.h** (145 bytes):
+   - **KEEP**: public API declarations.
+   - **ADD**: `X64NativeTarget* x64_native_target_new(Compiler*, ObjBuilder*, MCEmitter*);`
+
+### NEW FILE
+
+**src/arch/x64/emit.h** (exported byte-level encoder stubs):
+
+Extract the following from emit.c into a header:
+
+```c
+#ifndef CFREE_ARCH_X64_EMIT_H
+#define CFREE_ARCH_X64_EMIT_H
+
+#include "arch/mc.h"
+#include "arch/x64/isa.h"
+
+/* Byte-level emit helpers. */
+void emit_rex(MCEmitter*, int w, u32 reg, u32 index, u32 rm);
+void emit_rex_force(MCEmitter*, int w, u32 reg, u32 index, u32 rm);
+void emit_mem_operand(MCEmitter*, u32 reg, u32 base, i32 disp);
+void emit_rm_reg(MCEmitter*, u32 reg, u32 rm);
+void emit_mov_rr(MCEmitter*, int w, u32 dst, u32 src);
+void emit_mov_load(MCEmitter*, u32 size, int signed_ext, u32 dst, u32 base, i32 disp);
+void emit_mov_store(MCEmitter*, u32 size, int is_store, u32 reg, u32 base, i32 disp);
+void emit_lea(MCEmitter*, u32 dst, u32 base, i32 disp);
+/* ... etc for all instruction emitters ... */
+
+/* Shared ABI tables. */
+extern const Reg g_int_order[];
+extern const Reg g_fp_order[];
+typedef struct { /* X64ABIRegs ... */ } X64ABIRegs;
+const X64ABIRegs* x64_abi_for_os(CfreeOSKind os);
+
+#endif
+```
+
+---
+
+## Integration Checklist
+
+1. **native.c creation**:
+   - Copy the structure from rv64/native.c and aa64/native.c.
+   - Implement all NativeTarget hooks (func_begin, bind_param, load_imm, move, load, store, binop, ..., atomic_load, atomic_rmw, atomic_cas, va_start_, va_arg_, intrinsic, asm_block, file_scope_asm, finalize).
+   - Use src/abi/abi_{sysv,win64}_x64.c for ABI queries (abi_cg_func_info, abi_va_list_layout).
+   - Register tables: x_int_allocable, x_fp_allocable (from opt_coord.c); x_int_phys, x_fp_phys; NativeAllocClassInfo x_classes.
+   - Frame layout: X64NativeSlot, X64NativeTarget state struct.
+   - Prologue/epilogue: x_func_begin, x_func_end (adapting from emit.c).
+
+2. **emit.h extraction**:
+   - Create src/arch/x64/emit.h exporting byte-level encoders and ABI constants.
+   - Update emit.c to include it (if not self-contained).
+
+3. **asm.c rewiring**:
+   - Replace `#include "arch/x64/internal.h"` with `#include "arch/x64/emit.h"`.
+   - Keep inline template parsing intact.
+
+4. **arch.c rewiring**:
+   - Replace x64_cgtarget_new() vtable construction with native + NativeOps adapter.
+   - Wire native.c hooks into the CgTarget path.
+
+5. **Delete obsolete files**:
+   - ops.c, alloc.c, opt_coord.c, internal.h.
+
+6. **Update build system**:
+   - Remove ops.o, alloc.o, opt_coord.o from x64 link.
+   - Add native.o.
+
+---
+
+## Summary of NativeTarget Hooks (x64)
+
+| Hook | Implemented | Notes |
+|------|-------------|-------|
+| func_begin | native.c | Prologue placeholder, ABI setup |
+| func_begin_known_frame | native.c | Optimizer path: exact prologue |
+| note_frame_state | native.c | Prologue patch state |
+| reserve_callee_saves | native.c | Mark save slots for allocator-assigned regs |
+| emit_prologue | native.c | Emit minimal prologue (opt path) |
+| frame_slot | native.c | Allocate frame slot |
+| bind_param | native.c | Move incoming param to dst (reg or frame) |
+| label_new, label_place, jump, cmp_branch, indirect_branch, load_label_addr | native.c | Control flow |
+| move, load_imm, load_const, load_addr, load, store, tls_addr_of | native.c | Data movement |
+| copy_bytes, set_bytes, bitfield_load, bitfield_store | native.c | Aggregate ops |
+| binop, unop, cmp, convert, alloca_ | native.c | Arithmetic/conversion |
+| spill, reload | native.c | Register save/restore |
+| plan_call, emit_call, plan_ret, ret | native.c | Call ABI |
+| **atomic_load, atomic_store, atomic_rmw, atomic_cas, fence** | **native.c GROUP 5** | **x86 TSO semantics** |
+| **va_start_, va_arg_, va_end_, va_copy_** | **native.c GROUP 5** | **SysV & Win64 layouts** |
+| **intrinsic** | **native.c GROUP 5** | **popcount, bswap, overflow, trap, etc.** |
+| **asm_block** | **native.c GROUP 5** | **Inline asm with constraints** |
+| **file_scope_asm** | **native.c GROUP 5** | **File-scope assembly** |
+| trap, set_loc, finalize, destroy | native.c | Utilities |
+
+---
+
+## Legend: Code Locations (git 429defa)
+
+- **ops.c atomics** (lines ~1600–1800): x_atomic_load/store/rmw/cas, emit_lock_*, emit_mfence
+- **ops.c variadics** (lines ~1200–1400): x_va_start_, x_va_arg_, x_va_end_, x_va_copy_
+- **ops.c intrinsics** (lines ~1900–2100): x_intrinsic, emit_popcnt, emit_bs, emit_bswap, emit_*_overflow
+- **ops.c asm** (lines ~2100–2250): x_asm_block, x_set_loc, x_finalize
+- **emit.c prologues** (lines ~1000–1400): x_func_begin, x_func_end, x_build_prologue, x_emit_variadic_reg_saves
+- **emit.c encoders** (lines ~200–900): emit_rex, emit_mem_operand, emit_mov_*, emit_alu_*, emit_sse_*, etc.
+- **isa.h opcodes** (lines 1–300): X64_* constants, modrm/rex helpers
+- **internal.h structs** (lines 80–150): XImpl, XSlot, X64ABIRegs
+- **opt_coord.c tables** (lines 1–200): register tables, x_plan_call, X64ABIRegs dispatch
+- **alloc.c frame** (lines 1–300): x_frame_slot, x_param, frame layout computation
+
diff --git a/include/cfree/config.h b/include/cfree/config.h
@@ -25,7 +25,7 @@
 
 /* Backend architectures. */
 #define CFREE_ARCH_AA64_ENABLED 1
-#define CFREE_ARCH_X64_ENABLED 0
+#define CFREE_ARCH_X64_ENABLED 1
 #define CFREE_ARCH_RV64_ENABLED 1
 #define CFREE_ARCH_WASM_ENABLED 1
 #define CFREE_ARCH_C_TARGET_ENABLED 1
diff --git a/src/arch/x64/alloc.c b/src/arch/x64/alloc.c
@@ -1,598 +0,0 @@
-/* arch/x64/alloc.c — frame slots, spill/reload, labels, control flow.
- *
- * Covers: x_frame_slot, x64_slot_get, x_param, x_spill_reg, x_reload_reg,
- * x_label_*,
- * emit_jmp_label, emit_jcc_label, x_jump, x64_force_reg_int, emit_cmp_ab,
- * x_cmp_branch, x_cmp, x_scope_*, x_break_to, x_continue_to. */
-
-#include <string.h>
-
-#include "arch/mc.h"
-#include "arch/x64/internal.h"
-#include "arch/x64/isa.h"
-#include "arch/x64/regs.h"
-#include "arch/x64/x64.h"
-#include "core/arena.h"
-#include "core/pool.h"
-#include "core/slice.h"
-#include "obj/obj.h"
-
-/* ============================================================
- * Registers / frame */
-
-int x_resolve_reg_name(CGTarget* t, Sym name, Reg* out, RegClass* cls_out) {
-  Slice ns = pool_slice(t->c->global, name);
-  char buf[16];
-  u32 idx;
-  if (!ns.s || !ns.len || ns.len >= sizeof buf) return 1;
-  memcpy(buf, ns.s, ns.len);
-  buf[ns.len] = '\0';
-  Slice q = slice_from_cstr(buf);
-  if (slice_eq_cstr(q, "ah")) {
-    if (out) *out = X64_RAX;
-    if (cls_out) *cls_out = RC_INT;
-    return 0;
-  }
-  if (slice_eq_cstr(q, "ch")) {
-    if (out) *out = X64_RCX;
-    if (cls_out) *cls_out = RC_INT;
-    return 0;
-  }
-  if (slice_eq_cstr(q, "dh")) {
-    if (out) *out = X64_RDX;
-    if (cls_out) *cls_out = RC_INT;
-    return 0;
-  }
-  if (slice_eq_cstr(q, "bh")) {
-    if (out) *out = X64_RBX;
-    if (cls_out) *cls_out = RC_INT;
-    return 0;
-  }
-  if (x64_register_hw_index(buf, &idx) == 0) {
-    if (out) *out = (Reg)idx;
-    if (cls_out) *cls_out = RC_INT;
-    return 0;
-  }
-  if (x64_register_index(buf, &idx) != 0) return 1;
-  if (idx >= 17u && idx <= 32u) {
-    if (out) *out = (Reg)(idx - 17u);
-    if (cls_out) *cls_out = RC_FP;
-    return 0;
-  }
-  return 1;
-}
-
-FrameSlot x_frame_slot(CGTarget* t, const FrameSlotDesc* d) {
-  XImpl* a = impl_of(t);
-  if (a->nslots == a->slots_cap) {
-    u32 ncap = a->slots_cap ? a->slots_cap * 2 : 8;
-    XSlot* nbuf = arena_array(t->c->tu, XSlot, ncap);
-    if (a->slots) memcpy(nbuf, a->slots, sizeof(XSlot) * a->nslots);
-    a->slots = nbuf;
-    a->slots_cap = ncap;
-  }
-  u32 size = d->size ? d->size : 8;
-  u32 align = d->align ? d->align : 1;
-  u32 next = a->cum_off + size;
-  u32 mask = align - 1u;
-  next = (next + mask) & ~mask;
-  XSlot* s = &a->slots[a->nslots];
-  s->off = next;
-  s->size = size;
-  s->align = align;
-  s->kind = d->kind;
-  a->cum_off = next;
-  a->nslots++;
-  return (FrameSlot)(a->nslots);
-}
-
-XSlot* x64_slot_get(XImpl* a, FrameSlot fs) {
-  if (fs == FRAME_SLOT_NONE || fs > a->nslots) return NULL;
-  return &a->slots[fs - 1];
-}
-
-/* ---- param: bind incoming arg(s) to the requested storage ---- */
-
-/* Win64 shares one arg-slot counter across int and FP regs; the kth
- * argument consumes either GPR-k or XMM-k but never both. Keep
- * next_param_int and next_param_fp in lockstep so a later FP/int arg
- * sees the same slot index. */
-static inline void x_param_sync_slot(XImpl* a) {
-  if (!a->abi->slot_shared_int_fp) return;
-  u32 m = a->next_param_int > a->next_param_fp ? a->next_param_int
-                                               : a->next_param_fp;
-  a->next_param_int = m;
-  a->next_param_fp = m;
-}
-
-static void x_consume_param_location(XImpl* a, const ABIArgInfo* ai) {
-  if (!ai || ai->kind == ABI_ARG_IGNORE) return;
-  if (ai->kind == ABI_ARG_INDIRECT) {
-    if (a->next_param_int < a->abi->n_int_args)
-      ++a->next_param_int;
-    else
-      a->next_param_stack += 8;
-    x_param_sync_slot(a);
-    return;
-  }
-  if (ai->kind == ABI_ARG_DIRECT &&
-      x64_abi_direct_to_stack(ai, a->next_param_int, a->next_param_fp)) {
-    a->next_param_stack += (u32)ai->nparts * 8u;
-    return;
-  }
-  for (u16 i = 0; i < ai->nparts; ++i) {
-    const ABIArgPart* pt = &ai->parts[i];
-    if (pt->cls == ABI_CLASS_INT) {
-      if (a->next_param_int < a->abi->n_int_args)
-        ++a->next_param_int;
-      else
-        a->next_param_stack += 8;
-    } else if (pt->cls == ABI_CLASS_FP) {
-      if (a->next_param_fp < a->abi->n_fp_args)
-        ++a->next_param_fp;
-      else
-        a->next_param_stack += 8;
-    }
-    x_param_sync_slot(a);
-  }
-}
-
-CGLocalStorage x_param(CGTarget* t, const CGParamDesc* p) {
-  XImpl* a = impl_of(t);
-  CGLocalStorage st = p->storage;
-  if (st.kind == CG_LOCAL_STORAGE_FRAME && st.v.frame_slot == FRAME_SLOT_NONE) {
-    FrameSlotDesc fsd = {0};
-    fsd.type = p->type;
-    fsd.name = p->name;
-    fsd.loc = p->loc;
-    fsd.size = p->size;
-    fsd.align = p->align;
-    fsd.kind = FS_PARAM;
-    if (p->flags & CG_LOCAL_ADDR_TAKEN) fsd.flags |= FSF_ADDR_TAKEN;
-    st.v.frame_slot = x_frame_slot(t, &fsd);
-  }
-  XSlot* s = st.kind == CG_LOCAL_STORAGE_FRAME
-                 ? x64_slot_get(a, st.v.frame_slot)
-                 : NULL;
-  if (st.kind == CG_LOCAL_STORAGE_FRAME && !s)
-    compiler_panic(t->c, a->loc, "x64 param: bad slot");
-  const ABIArgInfo* ai = p->abi;
-  u32 incoming_stack_base = a->omit_frame ? X64_RSP : X64_RBP;
-  /* incoming_stack_bias is the offset from the base register to the
-   * first stack-passed argument. After `push rbp` we are at +0; +8
-   * skips the saved RBP and +16 skips the saved return address.
-   * Win64 reserves 32 B of caller-provided "home space" for the 4
-   * register arg slots immediately above the return address, so stack
-   * args start at [rbp + 16 + 32] = +48. SysV has no shadow space. */
-  i32 incoming_stack_bias =
-      a->omit_frame ? 8 : (i32)(16u + a->abi->shadow_space);
-
-  if (ai->kind == ABI_ARG_IGNORE) return st;
-  if (st.kind == CG_LOCAL_STORAGE_REG && st.v.reg == (Reg)REG_NONE) {
-    x_consume_param_location(a, ai);
-    return st;
-  }
-  if (st.kind == CG_LOCAL_STORAGE_REG) {
-    if (ai->kind != ABI_ARG_DIRECT || ai->nparts != 1) {
-      compiler_panic(t->c, a->loc,
-                     "x64 param: register storage requires one direct part");
-    }
-    const ABIArgPart* pt = &ai->parts[0];
-    u32 sz = pt->size;
-    if (pt->cls == ABI_CLASS_INT) {
-      if (a->next_param_int < a->abi->n_int_args) {
-        u32 src = a->abi->int_args[a->next_param_int++];
-        u32 dst = st.v.reg & 0xFu;
-        int w = (sz == 8) ? 1 : 0;
-        if (dst != src) emit_mov_rr(t->mc, w, dst, src);
-      } else {
-        u32 caller_off = a->next_param_stack;
-        a->next_param_stack += 8;
-        emit_mov_load(t->mc, sz, 0, st.v.reg & 0xFu, incoming_stack_base,
-                      incoming_stack_bias + (i32)caller_off);
-      }
-    } else if (pt->cls == ABI_CLASS_FP) {
-      u8 prefix = (sz == 8) ? 0xF2 : 0xF3;
-      u32 dst = st.v.reg & 0xFu;
-      if (a->next_param_fp < a->abi->n_fp_args) {
-        u32 src = a->next_param_fp++;
-        if (dst != src) emit_sse_rr(t->mc, prefix, 0x10, dst, src);
-      } else {
-        u32 caller_off = a->next_param_stack;
-        a->next_param_stack += 8;
-        emit_sse_load(t->mc, prefix, 0x10, dst, incoming_stack_base,
-                      incoming_stack_bias + (i32)caller_off);
-      }
-    } else {
-      compiler_panic(t->c, a->loc, "x64 param: ABI class %d unimpl",
-                     (int)pt->cls);
-    }
-    x_param_sync_slot(a);
-    return st;
-  }
-  if (ai->kind == ABI_ARG_INDIRECT) {
-    /* Incoming pointer to byval copy: load pointer, memcpy into slot. */
-    u32 ptr_reg;
-    if (a->next_param_int < a->abi->n_int_args) {
-      ptr_reg = a->abi->int_args[a->next_param_int++];
-    } else {
-      u32 caller_off = a->next_param_stack;
-      a->next_param_stack += 8;
-      emit_mov_load(t->mc, 8, 0, X64_R11, incoming_stack_base,
-                    incoming_stack_bias + (i32)caller_off);
-      ptr_reg = X64_R11;
-    }
-    x_param_sync_slot(a);
-    u32 nbytes = s->size;
-    u32 i = 0;
-    while (i + 8 <= nbytes) {
-      emit_mov_load(t->mc, 8, 0, X64_RAX, ptr_reg, (i32)i);
-      emit_mov_store(t->mc, 8, X64_RAX, X64_RBP, -(i32)s->off + (i32)i);
-      i += 8;
-    }
-    while (i + 4 <= nbytes) {
-      emit_mov_load(t->mc, 4, 0, X64_RAX, ptr_reg, (i32)i);
-      emit_mov_store(t->mc, 4, X64_RAX, X64_RBP, -(i32)s->off + (i32)i);
-      i += 4;
-    }
-    while (i + 2 <= nbytes) {
-      emit_mov_load(t->mc, 2, 0, X64_RAX, ptr_reg, (i32)i);
-      emit_mov_store(t->mc, 2, X64_RAX, X64_RBP, -(i32)s->off + (i32)i);
-      i += 2;
-    }
-    while (i < nbytes) {
-      emit_mov_load(t->mc, 1, 0, X64_RAX, ptr_reg, (i32)i);
-      emit_mov_store(t->mc, 1, X64_RAX, X64_RBP, -(i32)s->off + (i32)i);
-      i += 1;
-    }
-    return st;
-  }
-  /* DIRECT */
-  if (x64_abi_direct_to_stack(ai, a->next_param_int, a->next_param_fp)) {
-    for (u16 i = 0; i < ai->nparts; ++i) {
-      const ABIArgPart* pt = &ai->parts[i];
-      u32 caller_off = a->next_param_stack;
-      u32 sz = pt->size;
-      a->next_param_stack += 8;
-      if (pt->cls == ABI_CLASS_FP) {
-        u8 prefix = (sz == 8) ? 0xF2 : 0xF3;
-        emit_sse_load(t->mc, prefix, 0x10, X64_XMM0, incoming_stack_base,
-                      incoming_stack_bias + (i32)caller_off);
-        emit_sse_store(t->mc, prefix, 0x11, X64_XMM0, X64_RBP,
-                       -(i32)s->off + (i32)pt->src_offset);
-      } else {
-        emit_mov_load(t->mc, sz, 0, X64_RAX, incoming_stack_base,
-                      incoming_stack_bias + (i32)caller_off);
-        emit_mov_store(t->mc, sz, X64_RAX, X64_RBP,
-                       -(i32)s->off + (i32)pt->src_offset);
-      }
-    }
-    return st;
-  }
-  for (u16 i = 0; i < ai->nparts; ++i) {
-    const ABIArgPart* pt = &ai->parts[i];
-    u32 part_off = pt->src_offset;
-    u32 sz = pt->size;
-    if (pt->cls == ABI_CLASS_INT) {
-      if (a->next_param_int < a->abi->n_int_args) {
-        u32 reg = a->abi->int_args[a->next_param_int++];
-        emit_mov_store(t->mc, sz, reg, X64_RBP, -(i32)s->off + (i32)part_off);
-      } else {
-        u32 caller_off = a->next_param_stack;
-        a->next_param_stack += 8;
-        emit_mov_load(t->mc, sz, 0, X64_RAX, incoming_stack_base,
-                      incoming_stack_bias + (i32)caller_off);
-        emit_mov_store(t->mc, sz, X64_RAX, X64_RBP,
-                       -(i32)s->off + (i32)part_off);
-      }
-    } else if (pt->cls == ABI_CLASS_FP) {
-      if (a->next_param_fp < a->abi->n_fp_args) {
-        u32 xmm = a->next_param_fp++;
-        u8 prefix = (sz == 8) ? 0xF2 : 0xF3;
-        emit_sse_store(t->mc, prefix, 0x11, xmm, X64_RBP,
-                       -(i32)s->off + (i32)part_off);
-      } else {
-        u32 caller_off = a->next_param_stack;
-        a->next_param_stack += 8;
-        u8 prefix = (sz == 8) ? 0xF2 : 0xF3;
-        emit_sse_load(t->mc, prefix, 0x10, X64_XMM0, incoming_stack_base,
-                      incoming_stack_bias + (i32)caller_off);
-        emit_sse_store(t->mc, prefix, 0x11, X64_XMM0, X64_RBP,
-                       -(i32)s->off + (i32)part_off);
-      }
-    } else {
-      compiler_panic(t->c, a->loc, "x64 param: ABI class %d unimpl",
-                     (int)pt->cls);
-    }
-    x_param_sync_slot(a);
-  }
-  return st;
-}
-
-void x_spill_reg(CGTarget* t, Operand src, FrameSlot slot, MemAccess ma) {
-  XImpl* a = impl_of(t);
-  if (src.kind != OPK_REG)
-    compiler_panic(t->c, a->loc, "x64 spill_reg: src is not OPK_REG");
-  Operand addr;
-  memset(&addr, 0, sizeof addr);
-  addr.kind = OPK_LOCAL;
-  addr.cls = RC_INT;
-  addr.type = ma.type;
-  addr.v.frame_slot = slot;
-  x_store(t, addr, src, ma);
-}
-
-void x_reload_reg(CGTarget* t, Operand dst, FrameSlot slot, MemAccess ma) {
-  XImpl* a = impl_of(t);
-  if (dst.kind != OPK_REG)
-    compiler_panic(t->c, a->loc, "x64 reload_reg: dst is not OPK_REG");
-  Operand addr;
-  memset(&addr, 0, sizeof addr);
-  addr.kind = OPK_LOCAL;
-  addr.cls = RC_INT;
-  addr.type = ma.type;
-  addr.v.frame_slot = slot;
-  x_load(t, dst, addr, ma);
-}
-
-/* ============================================================
- * Labels / control flow */
-
-Label x_label_new(CGTarget* t) { return (Label)t->mc->label_new(t->mc); }
-void x_label_place(CGTarget* t, Label l) {
-  t->mc->label_place(t->mc, (MCLabel)l);
-}
-
-/* Emit `jmp rel32` (E9 + 4-byte disp) with a label fixup. R_PC32 applied
- * at the disp32 site with addend=-4 yields target - end_of_insn. */
-void emit_jmp_label(MCEmitter* mc, MCLabel l) {
-  u8 op = 0xE9;
-  mc->emit_bytes(mc, &op, 1);
-  emit_u32le(mc, 0);
-  mc->emit_label_ref(mc, l, R_PC32, 4, -4);
-}
-
-/* Emit `Jcc rel32` (0F 8x + 4-byte disp) with a label fixup. */
-void emit_jcc_label(MCEmitter* mc, u32 cc, MCLabel l) {
-  u8 op[2] = {0x0F, (u8)(0x80 | (cc & 0xF))};
-  mc->emit_bytes(mc, op, 2);
-  emit_u32le(mc, 0);
-  mc->emit_label_ref(mc, l, R_PC32, 4, -4);
-}
-
-void x_jump(CGTarget* t, Label l) { emit_jmp_label(t->mc, (MCLabel)l); }
-
-void x_load_label_addr(CGTarget* t, Operand dst, Label l) {
-  /* lea %dst, [rip + disp32]
-   *   REX.W + 0x8D /5 (mod=00 r/m=101 = RIP-relative)
-   * The disp32 is fixed up at label_place via R_PC32 with addend -4
-   * (because the PC is end-of-instruction). */
-  MCEmitter* mc = t->mc;
-  u32 dr = dst.v.reg & 0xFu;
-  emit_rex(mc, 1, dr, 0, 0);
-  u8 op = 0x8D;
-  mc->emit_bytes(mc, &op, 1);
-  u8 mr = modrm(0u, (dr & 7u), 5u);
-  mc->emit_bytes(mc, &mr, 1);
-  emit_u32le(mc, 0);
-  mc->emit_label_ref(mc, (MCLabel)l, R_PC32, 4, -4);
-}
-
-void x_indirect_branch(CGTarget* t, Operand addr, const Label* targets,
-                       u32 ntargets) {
-  /* jmpq *%reg
-   *   FF /4 with mod=11 r/m=reg */
-  MCEmitter* mc = t->mc;
-  u32 reg;
-  (void)targets;
-  (void)ntargets;
-  if (addr.kind != OPK_REG) {
-    compiler_panic(t->c, mc->loc, "x64: indirect_branch expects REG operand");
-  }
-  reg = addr.v.reg & 0xFu;
-  /* REX.B if reg >= 8 (no REX.W needed for jmpq *) */
-  if (reg & 8u) {
-    u8 rex = 0x41;
-    mc->emit_bytes(mc, &rex, 1);
-  }
-  u8 op = 0xFF;
-  mc->emit_bytes(mc, &op, 1);
-  u8 mr = modrm(3u, 4u /* sub-opcode */, (reg & 7u));
-  mc->emit_bytes(mc, &mr, 1);
-}
-
-static u32 cmp_to_cc(CmpOp op) {
-  switch (op) {
-    case CMP_EQ:
-      return X64_CC_E;
-    case CMP_NE:
-      return X64_CC_NE;
-    case CMP_LT_U:
-      return X64_CC_B;
-    case CMP_LE_U:
-      return X64_CC_BE;
-    case CMP_GT_U:
-      return X64_CC_A;
-    case CMP_GE_U:
-      return X64_CC_AE;
-    case CMP_LT_S:
-      return X64_CC_L;
-    case CMP_LE_S:
-      return X64_CC_LE;
-    case CMP_GT_S:
-      return X64_CC_G;
-    case CMP_GE_S:
-      return X64_CC_GE;
-    default:
-      return X64_CC_E;
-  }
-}
-
-static void emit_fp_setcc_ordered(CGTarget* t, CmpOp op, u32 dst) {
-  u32 primary;
-  switch (op) {
-    case CMP_EQ:
-      primary = X64_CC_E;
-      break;
-    case CMP_LT_F:
-      primary = X64_CC_B;
-      break;
-    case CMP_LE_F:
-      primary = X64_CC_BE;
-      break;
-    default:
-      primary = cmp_to_cc(op);
-      break;
-  }
-  emit_setcc(t->mc, primary, dst);
-  emit_movzx_r32_r8(t->mc, dst, dst);
-  emit_setcc(t->mc, 0xBu /* NP */, X64_R11);
-  emit_movzx_r32_r8(t->mc, X64_R11, X64_R11);
-  emit_alu_rr(t->mc, 0, 0x21, dst, X64_R11); /* AND */
-}
-
-static void emit_fp_setcc_unordered_ne(CGTarget* t, u32 dst) {
-  emit_setcc(t->mc, 0xAu /* P */, dst);
-  emit_movzx_r32_r8(t->mc, dst, dst);
-  emit_setcc(t->mc, X64_CC_NE, X64_R11);
-  emit_movzx_r32_r8(t->mc, X64_R11, X64_R11);
-  emit_alu_rr(t->mc, 0, 0x09, dst, X64_R11); /* OR */
-}
-
-u32 x64_force_reg_int(CGTarget* t, Operand op, int w, u32 scratch) {
-  if (op.kind == OPK_REG) return op.v.reg & 0xFu;
-  if (op.kind == OPK_IMM) {
-    x64_emit_load_imm(t->mc, w, scratch, op.v.imm);
-    return scratch;
-  }
-  compiler_panic(t->c, impl_of(t)->loc, "x64: operand kind %d not REG/IMM",
-                 (int)op.kind);
-}
-
-static void emit_cmp_ab(CGTarget* t, Operand a_op, Operand b_op) {
-  int w = type_is_64(a_op.type) ? 1 : 0;
-  /* IMM RHS imm8 / imm32 fast paths. CMP is not commutative across the
-   * cond codes, so IMM-on-LHS still has to materialize. */
-  if (b_op.kind == OPK_IMM && a_op.kind == OPK_REG) {
-    if (imm_fits_i8(b_op.v.imm)) {
-      emit_cmp_imm8(t->mc, w, a_op.v.reg & 0xFu, (i8)b_op.v.imm);
-      return;
-    }
-    if (imm_fits_i32(b_op.v.imm)) {
-      emit_alu_imm32(t->mc, w, /*sub=CMP*/ 7u, a_op.v.reg & 0xFu,
-                     (i32)b_op.v.imm);
-      return;
-    }
-  }
-  u32 ra = x64_force_reg_int(t, a_op, w, X64_RAX);
-  u32 rb = x64_force_reg_int(t, b_op, w, (ra == X64_R11) ? X64_RAX : X64_R11);
-  /* cmp r/m, r — opcode 0x39 (encoded as `cmp ra, rb` ⇒ flags = ra - rb). */
-  emit_alu_rr(t->mc, w, 0x39, ra, rb);
-}
-
-void x_cmp_branch(CGTarget* t, CmpOp op, Operand a, Operand b, Label l) {
-  emit_cmp_ab(t, a, b);
-  emit_jcc_label(t->mc, cmp_to_cc(op), (MCLabel)l);
-}
-
-void x_cmp(CGTarget* t, CmpOp op, Operand dst, Operand a, Operand b) {
-  if (a.cls == RC_FP || b.cls == RC_FP) {
-    u8 prefix = type_is_fp_double(a.type) ? 0x66 : 0x00;
-    u32 d = dst.v.reg & 0xFu;
-    emit_sse_rr(t->mc, prefix, 0x2E, a.v.reg & 0xFu, b.v.reg & 0xFu);
-    switch (op) {
-      case CMP_NE:
-        emit_fp_setcc_unordered_ne(t, d);
-        return;
-      case CMP_EQ:
-      case CMP_LT_F:
-      case CMP_LE_F:
-        emit_fp_setcc_ordered(t, op, d);
-        return;
-      case CMP_GT_F:
-        emit_setcc(t->mc, X64_CC_A, d);
-        break;
-      case CMP_GE_F:
-        emit_setcc(t->mc, X64_CC_AE, d);
-        break;
-      default:
-        emit_setcc(t->mc, cmp_to_cc(op), d);
-        break;
-    }
-    emit_movzx_r32_r8(t->mc, d, d);
-    return;
-  }
-  emit_cmp_ab(t, a, b);
-  u32 d = dst.v.reg & 0xFu;
-  emit_setcc(t->mc, cmp_to_cc(op), d);
-  emit_movzx_r32_r8(t->mc, d, d);
-}
-
-/* ---- structured scopes ---- */
-CGScope x_scope_begin(CGTarget* t, const CGScopeDesc* d) {
-  XImpl* a = impl_of(t);
-  if (a->nscopes == a->scopes_cap) {
-    u32 ncap = a->scopes_cap ? a->scopes_cap * 2u : 4u;
-    XScope* nb = arena_array(t->c->tu, XScope, ncap);
-    if (a->scopes) memcpy(nb, a->scopes, sizeof(XScope) * a->nscopes);
-    a->scopes = nb;
-    a->scopes_cap = ncap;
-  }
-  XScope* sc = &a->scopes[a->nscopes];
-  sc->kind = (u8)d->kind;
-  sc->has_else = 0;
-  sc->else_label = 0;
-  sc->end_label = 0;
-  sc->break_label = d->break_label;
-  sc->continue_label = d->continue_label;
-
-  if (d->kind == SCOPE_IF) {
-    sc->else_label = t->mc->label_new(t->mc);
-    sc->end_label = t->mc->label_new(t->mc);
-    int w = type_is_64(d->cond.type) ? 1 : 0;
-    u32 rc = x64_force_reg_int(t, d->cond, w, X64_RAX);
-    emit_test_self(t->mc, w, rc);
-    emit_jcc_label(t->mc, X64_CC_E, sc->else_label);
-  } else if (d->kind == SCOPE_LOOP || d->kind == SCOPE_BLOCK) {
-    /* Bookkeeping only. */
-  } else {
-    compiler_panic(t->c, a->loc, "x64 scope_begin: kind %d not yet implemented",
-                   (int)d->kind);
-  }
-  a->nscopes++;
-  return (CGScope)a->nscopes;
-}
-
-void x_scope_else(CGTarget* t, CGScope s) {
-  XImpl* a = impl_of(t);
-  if (s == CG_SCOPE_NONE || s > a->nscopes)
-    compiler_panic(t->c, a->loc, "x64 scope_else: bad scope");
-  XScope* sc = &a->scopes[s - 1];
-  emit_jmp_label(t->mc, sc->end_label);
-  t->mc->label_place(t->mc, sc->else_label);
-  sc->has_else = 1;
-}
-
-void x_scope_end(CGTarget* t, CGScope s) {
-  XImpl* a = impl_of(t);
-  if (s == CG_SCOPE_NONE || s > a->nscopes)
-    compiler_panic(t->c, a->loc, "x64 scope_end: bad scope");
-  XScope* sc = &a->scopes[s - 1];
-  if (sc->kind == SCOPE_IF) {
-    if (!sc->has_else) t->mc->label_place(t->mc, sc->else_label);
-    t->mc->label_place(t->mc, sc->end_label);
-  }
-}
-
-void x_break_to(CGTarget* t, CGScope s) {
-  XImpl* a = impl_of(t);
-  if (s == CG_SCOPE_NONE || s > a->nscopes)
-    compiler_panic(t->c, a->loc, "x64 break_to: bad scope");
-  x_jump(t, a->scopes[s - 1].break_label);
-}
-void x_continue_to(CGTarget* t, CGScope s) {
-  XImpl* a = impl_of(t);
-  if (s == CG_SCOPE_NONE || s > a->nscopes)
-    compiler_panic(t->c, a->loc, "x64 continue_to: bad scope");
-  x_jump(t, a->scopes[s - 1].continue_label);
-}
diff --git a/src/arch/x64/arch.c b/src/arch/x64/arch.c
@@ -1,9 +1,12 @@
 #include "arch/arch.h"
 
+#include <string.h>
+
 #include "arch/x64/asm.h"
 #include "arch/x64/disasm.h"
 #include "arch/x64/regs.h"
 #include "arch/x64/x64.h"
+#include "cg/native_direct_target.h"
 #include "core/bytes.h"
 #include "link/link_arch.h"
 #include "obj/obj.h"
@@ -45,23 +48,42 @@ static int x64_register_at_public(uint32_t idx, CfreeArchReg* out) {
   return rc;
 }
 
-static CGTarget* x64_backend_make(Compiler* c, ObjBuilder* o,
+static CgTarget* x64_backend_make(Compiler* c, ObjBuilder* o,
                                   const CfreeCodeOptions* opts) {
   MCEmitter* mc = NULL;
   Debug* debug = NULL;
-  CGTarget* t;
+  CgTarget* t;
+  NativeTarget* native;
+  NativeDirectTargetConfig cfg;
   if (cg_mc_debug_new(c, o, opts, &mc, &debug) != CFREE_OK) return NULL;
-  t = x64_cgtarget_new(c, o, mc);
-  if (!t) return NULL;
-  t->debug = debug;
+  native = x64_native_target_new(c, o, mc);
+  if (!native) return NULL;
+  memset(&cfg, 0, sizeof cfg);
+  cfg.native = native;
+  cfg.ops = x64_native_direct_ops();
+  t = native_direct_target_new(c, o, &cfg);
+  if (t) t->debug = debug;
   return t;
 }
 
+static CgTarget* x64_semantic_target_new(Compiler* c, ObjBuilder* o,
+                                         MCEmitter* mc) {
+  NativeTarget* native;
+  NativeDirectTargetConfig cfg;
+  if (!mc) mc = mc_new(c, o);
+  native = x64_native_target_new(c, o, mc);
+  if (!native) return NULL;
+  memset(&cfg, 0, sizeof cfg);
+  cfg.native = native;
+  cfg.ops = x64_native_direct_ops();
+  return native_direct_target_new(c, o, &cfg);
+}
+
 const ArchImpl arch_impl_x64 = {
     .backend = {.name = "x64", .make = x64_backend_make},
     .kind = CFREE_ARCH_X86_64,
     .name = "x64",
-    .cgtarget_new = x64_cgtarget_new,
+    .cgtarget_new = x64_semantic_target_new,
     .asm_new = x64_arch_asm_new,
     .disasm_new = x64_disasm_new,
     .apply_label_fixup = x64_apply_label_fixup,
diff --git a/src/arch/x64/asm.c b/src/arch/x64/asm.c
@@ -2,7 +2,7 @@
 
 #include <string.h>
 
-#include "arch/x64/internal.h"
+#include "arch/x64/emit.h"
 #include "arch/x64/regs.h"
 #include "asm/asm_helpers.h"
 #include "core/arena.h"
@@ -1360,6 +1360,14 @@ _Noreturn static void inline_panic(X64Asm* a, const char* msg) {
 #define X64_REG_WIDTH_16 3
 #define X64_REG_WIDTH_H8 4
 
+static void render_xmm(StrBuf* sb, u32 reg) {
+  strbuf_putc(sb, '%');
+  strbuf_puts(sb, "xmm");
+  reg &= 15u;
+  if (reg >= 10u) strbuf_putc(sb, (char)('0' + (reg / 10u)));
+  strbuf_putc(sb, (char)('0' + (reg % 10u)));
+}
+
 static const char* x64_reg_spelling(u32 reg, int width) {
   static const char* r64[16] = {
       "rax", "rcx", "rdx", "rbx", "rsp", "rbp", "rsi", "rdi",
@@ -1455,15 +1463,20 @@ static void render_operand(X64Asm* a, StrBuf* sb, u32 idx, int form) {
     render_indirect(sb, op->v.ind.base, op->v.ind.ofs);
     return;
   }
-  if ((form == X64_FORM_B || form == X64_FORM_H) && op->kind != OPK_REG) {
+  if ((form == X64_FORM_B || form == X64_FORM_H) &&
+      op->kind != X64_INLINE_OPK_REG) {
     inline_panic(a, "byte-register modifier requires a register operand");
   }
-  if (op->kind == OPK_REG) {
+  if (op->kind == X64_INLINE_OPK_REG) {
     int width;
+    if (op->pad[0] == X64_INLINE_OPCLS_FP) {
+      render_xmm(sb, (u32)op->v.local);
+      return;
+    }
     if (form == X64_FORM_B)
       width = X64_REG_WIDTH_8;
     else if (form == X64_FORM_H) {
-      if (op->v.reg > X64_RBX) {
+      if (op->v.local > X64_RBX) {
         inline_panic(a, "%h modifier requires ax/cx/dx/bx register");
       }
       width = X64_REG_WIDTH_H8;
@@ -1476,7 +1489,7 @@ static void render_operand(X64Asm* a, StrBuf* sb, u32 idx, int form) {
     else
       width =
           x64_type_prefers_32(op->type) ? X64_REG_WIDTH_32 : X64_REG_WIDTH_64;
-    render_reg(sb, (u32)op->v.reg, width);
+    render_reg(sb, (u32)op->v.local, width);
     return;
   }
   if (op->kind == OPK_IMM) {
diff --git a/src/arch/x64/asm.h b/src/arch/x64/asm.h
@@ -3,6 +3,19 @@
 
 #include "arch/arch.h"
 
+/* Private pseudo operand used by the x64 inline-asm binder. Semantic CG
+ * operands never expose physical registers, so native.c lowers register
+ * constraints into this arch-private shape before template substitution:
+ * Operand.kind = X64_INLINE_OPK_REG, Operand.v.local carries the 4-bit
+ * physical register number, Operand.pad[0] carries X64_INLINE_OPCLS_*.
+ * Memory operands reuse OPK_INDIRECT with v.ind.base holding the physical
+ * base register and v.ind.index == CG_LOCAL_NONE. */
+enum {
+  X64_INLINE_OPK_REG = 0xf0u,
+  X64_INLINE_OPCLS_INT = 0u,
+  X64_INLINE_OPCLS_FP = 1u,
+};
+
 typedef struct X64Asm X64Asm;
 
 X64Asm* x64_asm_open(Compiler*);
diff --git a/src/arch/x64/emit.c b/src/arch/x64/emit.c
@@ -7,11 +7,8 @@
 #include <string.h>
 
 #include "arch/mc.h"
-#include "arch/x64/internal.h"
+#include "arch/x64/emit.h"
 #include "arch/x64/isa.h"
-#include "arch/x64/x64.h"
-#include "core/arena.h"
-#include "core/pool.h"
 #include "core/slice.h"
 #include "obj/obj.h"
 
@@ -77,7 +74,7 @@ const X64ABIRegs* x64_abi_for_os(CfreeOSKind os) {
  * displacement, optional immediate. Helpers below build sequences
  * into the active MCEmitter section, recording one Debug row per
  * instruction-start. */
-static void emit1(MCEmitter* mc, u8 b) {
+void emit1(MCEmitter* mc, u8 b) {
   u32 ofs = obj_pos(mc->obj, mc->section_id);
   mc->emit_bytes(mc, &b, 1);
   if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
@@ -528,7 +525,7 @@ void emit_ret(MCEmitter* mc) {
   u8 op = X64_OPC_RET;
   mc->emit_bytes(mc, &op, 1);
 }
-static void emit_leave(MCEmitter* mc) {
+void emit_leave(MCEmitter* mc) {
   u8 op = X64_OPC_LEAVE;
   mc->emit_bytes(mc, &op, 1);
 }
@@ -603,500 +600,3 @@ void emit_sse_rr_w(MCEmitter* mc, u8 prefix, u8 opcode, int w, u32 dst,
   mc->emit_bytes(mc, buf, n);
   if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
 }
-
-/* ============================================================
- * Function lifecycle */
-
-/* Count the callee-saved GPR bits in `mask` that the ABI's cs_int_mask
- * actually owns. RBP is excluded because the prologue head saves it via
- * `push rbp`, not via the per-reg slot loop. */
-static u32 count_x64_cs_int(u32 mask, u64 cs_int_mask) {
-  u32 n = 0;
-  u64 eligible = (u64)mask & cs_int_mask;
-  eligible &= ~(1ull << X64_RBP);
-  while (eligible) {
-    eligible &= (eligible - 1);
-    ++n;
-  }
-  return n;
-}
-
-/* Count callee-saved XMM bits the ABI claims (Win64 only — SysV's
- * cs_fp_mask is empty). */
-static u32 count_x64_cs_fp(u32 mask, u64 cs_fp_mask) {
-  u32 n = 0;
-  u64 eligible = (u64)mask & cs_fp_mask;
-  while (eligible) {
-    eligible &= (eligible - 1);
-    ++n;
-  }
-  return n;
-}
-
-static u32 x64_planned_prologue_bytes(const XImpl* a) {
-  u32 n = X64_PROLOGUE_BASE_BYTES;
-  if (a->has_sret) n += X64_PROLOGUE_SRET_BYTES;
-  n += count_x64_cs_int(a->planned_cs_int_mask, a->abi->cs_int_mask) *
-       X64_PROLOGUE_SAVE_BYTES;
-  n += count_x64_cs_fp(a->planned_cs_fp_mask, a->abi->cs_fp_mask) *
-       X64_PROLOGUE_XMM_SAVE_BYTES;
-  /* We don't know the final frame size at planning time; reserve the
-   * chkstk delta whenever the ABI requires it so the placeholder is
-   * large enough if the body grows past 4 KiB. */
-  if (a->abi->shadow_space) n += X64_PROLOGUE_CHKSTK_DELTA;
-  return n ? n : 1u;
-}
-
-static void x_func_begin_init(CGTarget* t, const CGFuncDesc* fd) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  mc->set_section(mc, fd->text_section_id);
-  mc->emit_align(mc, 16, 0x90);
-
-  a->fd = fd;
-  a->abi = x64_abi_for_os(t->c->target.os);
-  a->func_start = mc->pos(mc);
-  mc_begin_function(mc, fd->sym, fd->text_section_id, a->func_start);
-  a->next_param_int = 0;
-  a->next_param_fp = 0;
-  a->next_param_stack = 0;
-  a->has_sret = (fd->abi && fd->abi->has_sret) ? 1 : 0;
-  a->has_alloca = 0;
-  a->is_variadic = (fd->abi && fd->abi->variadic) ? 1 : 0;
-  a->known_frame = 0;
-  a->omit_frame = 0;
-  a->cum_off = 0;
-  a->max_outgoing = 0;
-  a->used_cs_int_mask = a->has_planned_regs ? a->planned_cs_int_mask : 0;
-  a->used_cs_fp_mask = a->has_planned_regs ? a->planned_cs_fp_mask : 0;
-  a->prologue_nbytes = a->has_planned_regs
-                           ? x64_planned_prologue_bytes(a)
-                           : (a->abi->shadow_space ? X64_PROLOGUE_BYTES_WIN64
-                                                   : X64_PROLOGUE_BYTES);
-  a->planned_cs_int_mask = 0;
-  a->planned_cs_fp_mask = 0;
-  a->has_planned_regs = 0;
-  a->nslots = 0;
-  a->nscopes = 0;
-  a->nalloca_patches = 0;
-  a->sret_ptr_slot = FRAME_SLOT_NONE;
-  a->reg_save_slot = FRAME_SLOT_NONE;
-  a->epilogue_label = mc->label_new(mc);
-
-  mc->cfi_startproc(mc);
-}
-
-static void x_add_entry_frame_slots(CGTarget* t) {
-  XImpl* a = impl_of(t);
-
-  /* sret: the first int arg reg at entry holds the destination pointer
-   * (RDI on SysV, RCX on Win64). Spill it to a hidden slot so the body
-   * can use that register freely. */
-  if (a->has_sret) {
-    FrameSlotDesc fsd = {
-        .type = CFREE_CG_TYPE_NONE,
-        .name = 0,
-        .loc = {0, 0, 0},
-        .size = 8,
-        .align = 8,
-        .kind = FS_SPILL,
-        .flags = 0,
-    };
-    a->sret_ptr_slot = x_frame_slot(t, &fsd);
-    /* Subsequent int args start at the next slot. */
-    a->next_param_int = 1;
-  }
-
-  /* Variadic SysV: reserve the 176 B reg-save area (rdi..r9 at +0..+40,
-   * then xmm0..xmm7 at +48..+160 with 16-byte stride) and emit the
-   * saves after the prologue placeholder. Win64 variadic uses the
-   * caller-provided 32 B home space at [rbp+16..+47] instead — no
-   * callee-allocated reg-save slot. */
-  if (a->is_variadic && a->abi->emit_sysv_vararg_save) {
-    FrameSlotDesc rsd = {
-        .type = CFREE_CG_TYPE_NONE,
-        .name = 0,
-        .loc = {0, 0, 0},
-        .size = 176,
-        .align = 8,
-        .kind = FS_SPILL,
-        .flags = 0,
-    };
-    a->reg_save_slot = x_frame_slot(t, &rsd);
-  }
-}
-
-static void x_emit_variadic_reg_saves(CGTarget* t) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  if (!a->is_variadic) return;
-  if (a->abi->emit_sysv_vararg_save) {
-    XSlot* rs = x64_slot_get(a, a->reg_save_slot);
-    static const u32 gprs[6] = {X64_RDI, X64_RSI, X64_RDX,
-                                X64_RCX, X64_R8,  X64_R9};
-    for (u32 i = 0; i < 6; ++i) {
-      emit_mov_store(mc, 8, gprs[i], X64_RBP, -(i32)rs->off + (i32)(i * 8u));
-    }
-    /* movsd writes the low 8 bytes of each xmm; va_arg reads 8 bytes per
-     * FP slot, so the upper half of the 16-byte stride stays unused. */
-    for (u32 i = 0; i < 8; ++i) {
-      emit_sse_store(mc, 0xF2, 0x11, (u32)(X64_XMM0 + i), X64_RBP,
-                     -(i32)rs->off + (i32)(48u + i * 16u));
-    }
-    return;
-  }
-  /* Win64 variadic: spill RCX, RDX, R8, R9 into the caller's 32 B home
-   * space at [rbp+16..+47]. va_start ends up pointing at
-   * [rbp+16 + named_int_slots*8] (a contiguous arg array). FP variadic
-   * args are duplicated into the matching GPR at the call site (see
-   * vararg_fp_dup_to_gpr), so by the time the callee accesses them
-   * they're already in the GPR home slot. */
-  emit_mov_store(mc, 8, X64_RCX, X64_RBP, 16);
-  emit_mov_store(mc, 8, X64_RDX, X64_RBP, 24);
-  emit_mov_store(mc, 8, X64_R8, X64_RBP, 32);
-  emit_mov_store(mc, 8, X64_R9, X64_RBP, 40);
-}
-
-static u32 align_up_u32(u32 v, u32 a) { return (v + (a - 1u)) & ~(a - 1u); }
-
-/* Spill order for the per-ABI callee-saved set. SysV: RBX, R12..R15 (the
- * leading entries of g_int_order). Win64 adds RDI then RSI at the tail
- * (mingw/MSVC pick a stable order; the saved slot is offsets-only for
- * cfree's purposes). RBP is excluded — handled by the prologue head. */
-static const Reg g_cs_int_order_all[X64_MAX_CS_INT_REGS] = {
-    X64_RBX, X64_R12, X64_R13, X64_R14, X64_R15, X64_RDI, X64_RSI,
-};
-
-/* Spill order for Win64 callee-saved XMMs (XMM6..XMM15). */
-#define X64_MAX_CS_FP_REGS 10u
-static const Reg g_cs_fp_order_all[X64_MAX_CS_FP_REGS] = {
-    X64_XMM6,      X64_XMM7,      X64_XMM8,      X64_XMM0 + 9,  X64_XMM0 + 10,
-    X64_XMM0 + 11, X64_XMM0 + 12, X64_XMM0 + 13, X64_XMM0 + 14, X64_XMM15,
-};
-
-static u32 x_collect_cs_regs(const XImpl* a, Reg* cs_regs) {
-  u32 cs_used = 0;
-  u64 mask = (u64)a->used_cs_int_mask & a->abi->cs_int_mask;
-  mask &= ~(1ull << X64_RBP);
-  for (u32 i = 0; i < X64_MAX_CS_INT_REGS; ++i) {
-    Reg r = g_cs_int_order_all[i];
-    if (mask & (1ull << r)) cs_regs[cs_used++] = r;
-  }
-  return cs_used;
-}
-
-static u32 x_collect_cs_fp_regs(const XImpl* a, Reg* cs_fp_regs) {
-  u32 n = 0;
-  u64 mask = (u64)a->used_cs_fp_mask & a->abi->cs_fp_mask;
-  for (u32 i = 0; i < X64_MAX_CS_FP_REGS; ++i) {
-    Reg r = g_cs_fp_order_all[i];
-    if (mask & (1ull << r)) cs_fp_regs[n++] = r;
-  }
-  return n;
-}
-
-/* Frame layout (rbp-relative, high → low):
- *   [rbp]                        : saved rbp (push rbp)
- *   [rbp - cum_off]              : locals + spills (cum_off bytes)
- *   [rbp - xmm_base]             : XMM saves, 16 B each (16-aligned)
- *   [rbp - xmm_base - cs_size]   : GPR callee-saves
- *   [rsp]                        : outgoing args (max_outgoing, 16-aligned)
- * xmm_base = align_up(cum_off, 16) when any XMM saved, else == cum_off.
- * Frame size includes the alignment pad so rsp lands at 0 mod 16. */
-static u32 x_xmm_base(const XImpl* a, u32 cs_fp_used) {
-  if (cs_fp_used == 0) return a->cum_off;
-  return align_up_u32(a->cum_off, 16u);
-}
-
-static u32 x_compute_frame_size(const XImpl* a, u32 cs_used, u32 cs_fp_used) {
-  u32 xmm_base = x_xmm_base(a, cs_fp_used);
-  u32 cs_size = cs_used * 8u;
-  u32 xmm_size = cs_fp_used * 16u;
-  u32 raw = a->max_outgoing + cs_size + xmm_size + xmm_base;
-  u32 frame_size = align_up_u32(raw, 16u);
-  return frame_size ? frame_size : 16u;
-}
-
-/* Cached lookup/creation of __chkstk as a SK_UNDEF symbol. The Win64
- * stack-probe helper is provided by mingw's libmingwex / MSVC's CRT;
- * cfree references it on demand from the prologue and lets the linker
- * resolve it. */
-static ObjSymId x_chkstk_sym(CGTarget* t) {
-  Sym name = pool_intern_slice(t->c->global, SLICE_LIT("__chkstk"));
-  ObjSymId s = obj_symbol_find(t->obj, name);
-  if (s != 0) return s;
-  return obj_symbol(t->obj, name, SB_GLOBAL, SK_UNDEF, OBJ_SEC_NONE, 0, 0);
-}
-
-/* Build the prologue byte sequence. Returns the number of bytes
- * written. If `chkstk_disp_pos_out` is non-NULL and the chkstk path was
- * taken, stores the byte offset of the `call __chkstk` disp32 within
- * `buf` so the caller can emit the matching R_X64_PLT32 reloc. Sets
- * it to UINT32_MAX otherwise. */
-static u32 x_build_prologue(CGTarget* t, u8* buf, u32 cap, u32 frame_size,
-                            const Reg* cs_regs, u32 cs_used,
-                            const Reg* cs_fp_regs, u32 cs_fp_used,
-                            u32* chkstk_disp_pos_out) {
-  XImpl* a = impl_of(t);
-  u32 wi = 0;
-  if (chkstk_disp_pos_out) *chkstk_disp_pos_out = (u32)-1;
-
-  if (wi + 4 > cap) goto overflow;
-  /* push rbp (1 byte). */
-  buf[wi++] = 0x55;
-  /* mov rbp, rsp: REX.W 89 E5. */
-  buf[wi++] = X64_REX_BASE | X64_REX_W;
-  buf[wi++] = 0x89;
-  buf[wi++] = modrm(3u, X64_RSP, X64_RBP);
-
-  int need_chkstk =
-      a->abi->shadow_space && frame_size > X64_WIN64_CHKSTK_THRESHOLD;
-  if (need_chkstk) {
-    /* Win64 large-frame probe sequence (matches what GCC/clang emit on
-     * x86_64-windows):
-     *   mov eax, frame_size    ; B8 imm32             (5 bytes)
-     *   call __chkstk          ; E8 disp32            (5 bytes)
-     *   sub rsp, rax           ; REX.W 29 C4          (3 bytes)
-     * __chkstk probes one page at a time over the requested allocation
-     * but does NOT adjust rsp itself; the explicit `sub rsp, rax`
-     * after the call does that. */
-    if (wi + 13 > cap) goto overflow;
-    buf[wi++] = 0xB8;
-    buf[wi++] = (u8)frame_size;
-    buf[wi++] = (u8)(frame_size >> 8);
-    buf[wi++] = (u8)(frame_size >> 16);
-    buf[wi++] = (u8)(frame_size >> 24);
-    buf[wi++] = 0xE8;
-    if (chkstk_disp_pos_out) *chkstk_disp_pos_out = wi;
-    buf[wi++] = 0;
-    buf[wi++] = 0;
-    buf[wi++] = 0;
-    buf[wi++] = 0;
-    buf[wi++] = X64_REX_BASE | X64_REX_W;
-    buf[wi++] = 0x29;
-    buf[wi++] = modrm(3u, X64_RAX, X64_RSP);
-  } else {
-    /* sub rsp, frame_size: REX.W 81 /5 imm32 = 7 bytes. */
-    if (wi + 7 > cap) goto overflow;
-    buf[wi++] = X64_REX_BASE | X64_REX_W;
-    buf[wi++] = 0x81;
-    buf[wi++] = modrm(3u, 5u, X64_RSP);
-    buf[wi++] = (u8)frame_size;
-    buf[wi++] = (u8)(frame_size >> 8);
-    buf[wi++] = (u8)(frame_size >> 16);
-    buf[wi++] = (u8)(frame_size >> 24);
-  }
-
-  /* sret: spill the first int arg reg (which holds the destination
-   * pointer at entry) to the hidden slot. SysV uses RDI; Win64 uses
-   * RCX. */
-  if (a->has_sret && a->sret_ptr_slot != FRAME_SLOT_NONE) {
-    XSlot* s = x64_slot_get(a, a->sret_ptr_slot);
-    if (s) {
-      i32 off = -(i32)s->off;
-      u32 sret_reg = a->abi->int_args[0];
-      if (wi + 7 > cap) goto overflow;
-      buf[wi++] =
-          (u8)(X64_REX_BASE | X64_REX_W | ((sret_reg & 8) ? X64_REX_R : 0));
-      buf[wi++] = 0x89;
-      buf[wi++] = modrm(2u, (sret_reg & 7u), X64_RBP);
-      buf[wi++] = (u8)off;
-      buf[wi++] = (u8)(off >> 8);
-      buf[wi++] = (u8)(off >> 16);
-      buf[wi++] = (u8)(off >> 24);
-    }
-  }
-
-  u32 xmm_base = x_xmm_base(a, cs_fp_used);
-
-  /* Spill callee-saves. */
-  for (u32 i = 0; i < cs_used; ++i) {
-    u32 reg = cs_regs[i];
-    i32 off = -(i32)xmm_base - (i32)(cs_fp_used) * 16 - (i32)(i + 1) * 8;
-    if (wi + 7 > cap) goto overflow;
-    buf[wi++] = (u8)(X64_REX_BASE | X64_REX_W | ((reg & 8) ? X64_REX_R : 0));
-    buf[wi++] = 0x89;
-    buf[wi++] = modrm(2u, (reg & 7u), X64_RBP);
-    buf[wi++] = (u8)off;
-    buf[wi++] = (u8)(off >> 8);
-    buf[wi++] = (u8)(off >> 16);
-    buf[wi++] = (u8)(off >> 24);
-  }
-
-  /* Spill callee-saved XMMs (Win64 only). movaps [rbp+disp32], xmm_n.
-   * Layout: xmm[0] at -(xmm_base+16), xmm[1] at -(xmm_base+32), ...
-   * Each slot is 16-aligned because rbp is 16-aligned at entry and
-   * xmm_base is rounded up to 16. */
-  for (u32 i = 0; i < cs_fp_used; ++i) {
-    u32 xmm = cs_fp_regs[i];
-    i32 off = -(i32)xmm_base - (i32)(i + 1) * 16;
-    u8 rex = (u8)((xmm & 8) ? (X64_REX_BASE | X64_REX_R) : 0);
-    u32 n = rex ? 8u : 7u;
-    if (wi + n > cap) goto overflow;
-    if (rex) buf[wi++] = rex;
-    buf[wi++] = 0x0F;
-    buf[wi++] = 0x29; /* MOVAPS r/m128, xmm */
-    buf[wi++] = modrm(2u, (xmm & 7u), X64_RBP);
-    buf[wi++] = (u8)off;
-    buf[wi++] = (u8)(off >> 8);
-    buf[wi++] = (u8)(off >> 16);
-    buf[wi++] = (u8)(off >> 24);
-  }
-  return wi;
-
-overflow:
-  compiler_panic(t->c, a->loc,
-                 "x64: prologue placeholder overflow (cap %u bytes)", cap);
-  return 0;
-}
-
-void x_func_begin(CGTarget* t, const CGFuncDesc* fd) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  x_func_begin_init(t, fd);
-
-  /* Reserve a fixed-size prologue placeholder filled with NOPs. */
-  a->prologue_pos = mc->pos(mc);
-  for (u32 i = 0; i < a->prologue_nbytes; ++i) emit1(mc, 0x90);
-
-  x_add_entry_frame_slots(t);
-  x_emit_variadic_reg_saves(t);
-}
-
-void x_func_begin_known_frame(CGTarget* t, const CGFuncDesc* fd,
-                              const CGKnownFrameDesc* frame,
-                              FrameSlot* out_slots) {
-  XImpl* a = impl_of(t);
-  Reg cs_regs[X64_MAX_CS_INT_REGS];
-  Reg cs_fp_regs[X64_MAX_CS_FP_REGS];
-  u8 buf[X64_PROLOGUE_BYTES_WIN64];
-
-  x_func_begin_init(t, fd);
-  a->known_frame = 1;
-  x_add_entry_frame_slots(t);
-  for (u32 i = 0; frame && i < frame->nslots; ++i) {
-    FrameSlot fs = x_frame_slot(t, &frame->slots[i]);
-    if (out_slots) out_slots[i] = fs;
-  }
-  if (frame) {
-    a->max_outgoing = frame->max_outgoing;
-    a->has_alloca = frame->has_alloca ? 1u : 0u;
-  }
-
-  u32 cs_used = x_collect_cs_regs(a, cs_regs);
-  u32 cs_fp_used = x_collect_cs_fp_regs(a, cs_fp_regs);
-  if (frame && frame->may_omit_frame && frame->nslots == 0 &&
-      frame->max_outgoing == 0 && !frame->has_alloca && !frame->has_call &&
-      !a->has_sret && !a->is_variadic && cs_used == 0 && cs_fp_used == 0) {
-    a->omit_frame = 1;
-    return;
-  }
-  u32 frame_size = x_compute_frame_size(a, cs_used, cs_fp_used);
-  a->prologue_pos = t->mc->pos(t->mc);
-  u32 chkstk_disp_pos = (u32)-1;
-  u32 nbytes =
-      x_build_prologue(t, buf, sizeof buf, frame_size, cs_regs, cs_used,
-                       cs_fp_regs, cs_fp_used, &chkstk_disp_pos);
-  t->mc->emit_bytes(t->mc, buf, nbytes);
-  if (chkstk_disp_pos != (u32)-1) {
-    ObjSymId chk = x_chkstk_sym(t);
-    t->mc->emit_reloc_at(t->mc, t->mc->section_id,
-                         a->prologue_pos + chkstk_disp_pos, R_X64_PLT32, chk,
-                         -4, 1, 0);
-  }
-  x_emit_variadic_reg_saves(t);
-}
-
-void x_func_end(CGTarget* t) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  Reg cs_regs[X64_MAX_CS_INT_REGS];
-  Reg cs_fp_regs[X64_MAX_CS_FP_REGS];
-  u32 cs_used = x_collect_cs_regs(a, cs_regs);
-  u32 cs_fp_used = x_collect_cs_fp_regs(a, cs_fp_regs);
-
-  /* Stack alignment: SysV requires rsp ≡ 0 mod 16 just before a call,
-   * which means rsp ≡ 8 mod 16 inside the function (after the return
-   * address is pushed). On entry, rsp ≡ 8 mod 16; after `push rbp` it
-   * is 0 mod 16; after `sub rsp, frame_size` we need it back to 0
-   * mod 16, so frame_size must be a multiple of 16. */
-  u32 frame_size = x_compute_frame_size(a, cs_used, cs_fp_used);
-
-  if (a->omit_frame) goto finish;
-
-  mc->label_place(mc, a->epilogue_label);
-
-  u32 xmm_base = x_xmm_base(a, cs_fp_used);
-
-  /* Restore callee-saved XMMs (Win64). movaps xmm_n, [rbp+disp32]. */
-  for (i32 i = (i32)cs_fp_used - 1; i >= 0; --i) {
-    u32 xmm = cs_fp_regs[i];
-    i32 off = -(i32)xmm_base - (i32)(i + 1) * 16;
-    /* prefix=0 selects MOVAPS (0F 28 /r) when used through emit_sse_load. */
-    emit_sse_load(mc, /*prefix=*/0, /*opcode=*/0x28, xmm, X64_RBP, off);
-  }
-
-  /* Restore callee-saved GPRs. */
-  for (i32 i = (i32)cs_used - 1; i >= 0; --i) {
-    u32 reg = cs_regs[i];
-    i32 off = -(i32)xmm_base - (i32)(cs_fp_used) * 16 - (i32)(i + 1) * 8;
-    emit_mov_load(mc, /*size=*/8, /*signed=*/0, reg, X64_RBP, off);
-  }
-
-  /* leave; ret. */
-  emit_leave(mc);
-  emit_ret(mc);
-
-  if (!a->known_frame) {
-    /* Patch prologue placeholder. */
-    u8 buf[X64_PROLOGUE_BYTES_WIN64];
-    u32 prologue_nbytes =
-        a->prologue_nbytes ? a->prologue_nbytes : X64_PROLOGUE_BYTES;
-    for (u32 i = 0; i < prologue_nbytes; ++i) buf[i] = 0x90;
-    u32 chkstk_disp_pos = (u32)-1;
-    (void)x_build_prologue(t, buf, prologue_nbytes, frame_size, cs_regs,
-                           cs_used, cs_fp_regs, cs_fp_used, &chkstk_disp_pos);
-    obj_patch(t->obj, a->fd->text_section_id, a->prologue_pos, buf,
-              prologue_nbytes);
-    if (chkstk_disp_pos != (u32)-1) {
-      ObjSymId chk = x_chkstk_sym(t);
-      mc->emit_reloc_at(mc, a->fd->text_section_id,
-                        a->prologue_pos + chkstk_disp_pos, R_X64_PLT32, chk, -4,
-                        1, 0);
-    }
-  }
-
-  /* Patch each alloca's `lea dst, [rsp + 0]` disp32 with the final
-   * max_outgoing (already 16-aligned via the `(stack_off+15)&~15` round
-   * at every call site). */
-  for (u32 i = 0; i < a->nalloca_patches; ++i) {
-    u8 dbuf[4];
-    u32 m = a->max_outgoing;
-    dbuf[0] = (u8)m;
-    dbuf[1] = (u8)(m >> 8);
-    dbuf[2] = (u8)(m >> 16);
-    dbuf[3] = (u8)(m >> 24);
-    obj_patch(t->obj, a->fd->text_section_id, a->alloca_patches[i].disp_pos,
-              dbuf, 4);
-  }
-
-finish:;
-  /* Define the function symbol. */
-  u32 end = mc->pos(mc);
-  obj_symbol_define(t->obj, a->fd->sym, a->fd->text_section_id,
-                    (u64)a->func_start, (u64)(end - a->func_start));
-  if (a->fd->atomize) {
-    obj_atom_define(t->obj, a->fd->text_section_id, a->func_start,
-                    end - a->func_start, a->fd->sym, 0);
-  }
-  if (t->debug)
-    debug_func_pc_range(t->debug, a->fd->text_section_id, a->func_start, end);
-
-  mc->cfi_endproc(mc);
-  mc_end_function(mc);
-  a->fd = NULL;
-}
diff --git a/src/arch/x64/emit.h b/src/arch/x64/emit.h
@@ -0,0 +1,147 @@
+/* arch/x64/emit.h — shared x86-64 byte-level encoders, ABI tables, and type
+ * helpers. Used by emit.c (definitions), the standalone assembler (asm.c), and
+ * the NativeTarget codegen (native.c). Carries no codegen state (XImpl lives in
+ * native.c). */
+#ifndef CFREE_ARCH_X64_EMIT_H
+#define CFREE_ARCH_X64_EMIT_H
+
+#include <cfree/cg.h>
+#include <cfree/compile.h>
+
+#include "abi/abi.h"
+#include "arch/mc.h"
+#include "arch/x64/isa.h"
+#include "core/core.h"
+#include "core/slice.h"
+#include "obj/obj.h"
+
+/* ---- prologue placeholder budgets / Win64 constants ---- */
+#define X64_PROLOGUE_BYTES 96u
+#define X64_PROLOGUE_BYTES_WIN64 192u
+#define X64_PROLOGUE_BASE_BYTES 11u
+#define X64_PROLOGUE_SRET_BYTES 7u
+#define X64_PROLOGUE_SAVE_BYTES 7u
+#define X64_PROLOGUE_XMM_SAVE_BYTES 8u
+#define X64_PROLOGUE_CHKSTK_DELTA 6u
+#define X64_WIN64_SHADOW_SPACE 32u
+#define X64_WIN64_CHKSTK_THRESHOLD 4096u
+#define X64_MAX_CS_INT_REGS 7u
+
+/* ---- per-OS ABI register layout (SysV vs Win64) ---- */
+typedef struct X64ABIRegs {
+  const u32* int_args;       /* size = n_int_args */
+  u32 n_int_args;            /* 6 (SysV) / 4 (Win64) */
+  u32 n_fp_args;             /* 8 (SysV) / 4 (Win64) */
+  int slot_shared_int_fp;    /* 1 (Win64): arg slot index shared int/xmm */
+  u32 shadow_space;          /* 0 (SysV) / 32 (Win64) */
+  int emit_sysv_vararg_save; /* 1 (SysV): emit 176B reg-save area */
+  int vararg_fp_dup_to_gpr;  /* 1 (Win64): variadic FP arg duped to GPR */
+  u64 cs_int_mask;           /* callee-saved GPRs */
+  u64 cs_fp_mask;            /* callee-saved XMMs */
+} X64ABIRegs;
+
+const X64ABIRegs* x64_abi_for_os(CfreeOSKind os);
+
+extern const Reg g_int_order[6];
+extern const Reg g_fp_order[10];
+
+/* Per-instruction debug line rows. Declared here (mc.h only forward-declares
+ * Debug) so emit.c's encoders and native.c's lifecycle can both record rows
+ * without taking a full dependency on debug/debug.h. */
+extern void debug_emit_row(Debug*, ObjSecId text_section, u32 offset, SrcLoc);
+extern void debug_func_pc_range(Debug*, ObjSecId text_section, u32 begin_ofs,
+                                u32 end_ofs);
+
+/* ---- type helpers ---- */
+#define CG_BUILTIN_ID(k) ((CfreeCgTypeId)((1u << 6) | (u32)(k)))
+static inline int type_is_64(CfreeCgTypeId t) {
+  return t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I64) ||
+         t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F64) ||
+         t >= (CfreeCgTypeId)(2u << 6);
+}
+static inline int type_is_fp_double(CfreeCgTypeId t) {
+  return t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F64);
+}
+static inline u32 type_byte_size(CfreeCgTypeId t) {
+  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I8) ||
+      t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_BOOL))
+    return 1;
+  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I16)) return 2;
+  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I32) ||
+      t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F32))
+    return 4;
+  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F128)) return 16;
+  return 8;
+}
+static inline int type_is_signed(CfreeCgTypeId t) {
+  (void)t;
+  return 0;
+}
+
+static inline void x64_abi_direct_reg_need(const ABIArgInfo* ai, u32* need_int,
+                                           u32* need_fp) {
+  *need_int = 0;
+  *need_fp = 0;
+  if (!ai || ai->kind != ABI_ARG_DIRECT) return;
+  for (u16 i = 0; i < ai->nparts; ++i) {
+    const ABIArgPart* p = &ai->parts[i];
+    if (p->cls == ABI_CLASS_FP)
+      ++*need_fp;
+    else if (p->cls == ABI_CLASS_INT)
+      ++*need_int;
+  }
+}
+
+/* ---- byte-level encoders (defined in emit.c) ---- */
+void emit1(MCEmitter* mc, u8 b);
+void emit_leave(MCEmitter* mc);
+void emit_u32le(MCEmitter* mc, u32 v);
+void emit_rex(MCEmitter* mc, int w, u32 reg, u32 index, u32 rm);
+void emit_rex_force(MCEmitter* mc, int w, u32 reg, u32 index, u32 rm);
+u8 modrm(u32 mod, u32 reg, u32 rm);
+u8 sib(u32 scale, u32 index, u32 base);
+void emit_mem_operand(MCEmitter* mc, u32 reg, u32 base, i32 disp);
+void emit_rm_reg(MCEmitter* mc, u32 reg, u32 rm);
+void emit_mov_rr(MCEmitter* mc, int w, u32 dst, u32 src);
+void emit_mov_load(MCEmitter* mc, u32 size, int signed_ext, u32 dst, u32 base,
+                   i32 disp);
+void emit_mov_store(MCEmitter* mc, u32 size, u32 src, u32 base, i32 disp);
+void emit_lea(MCEmitter* mc, u32 dst, u32 base, i32 disp);
+void emit_mov_load_idx(MCEmitter* mc, u32 size, int signed_ext, u32 dst,
+                       u32 base, u32 index, u32 log2_scale, i32 disp);
+void emit_mov_store_idx(MCEmitter* mc, u32 size, u32 src, u32 base, u32 index,
+                        u32 log2_scale, i32 disp);
+void emit_ret(MCEmitter* mc);
+void x64_emit_load_imm(MCEmitter* mc, int is64, u32 dst, i64 imm);
+void emit_alu_rr(MCEmitter* mc, int w, u8 op, u32 dst, u32 src);
+void emit_imul_rr(MCEmitter* mc, int w, u32 dst, u32 src);
+void emit_f7_rm(MCEmitter* mc, int w, u32 sub, u32 reg);
+void emit_shift_cl(MCEmitter* mc, int w, u32 sub, u32 reg);
+void emit_shift_imm(MCEmitter* mc, int w, u32 sub, u32 reg, u8 imm);
+void emit_cqo_or_cdq(MCEmitter* mc, int w);
+void emit_xor_self(MCEmitter* mc, int w, u32 r);
+void emit_cmp_imm8(MCEmitter* mc, int w, u32 reg, i8 imm);
+void emit_alu_imm8(MCEmitter* mc, int w, u32 sub, u32 reg, i8 imm);
+void emit_alu_imm32(MCEmitter* mc, int w, u32 sub, u32 reg, i32 imm);
+void emit_imul_imm8(MCEmitter* mc, int w, u32 dst, u32 src, i8 imm);
+void emit_imul_imm32(MCEmitter* mc, int w, u32 dst, u32 src, i32 imm);
+int imm_fits_i8(i64 imm);
+int imm_fits_i32(i64 imm);
+void emit_test_self(MCEmitter* mc, int w, u32 reg);
+void emit_setcc(MCEmitter* mc, u32 cc, u32 reg);
+void emit_movzx_r32_r8(MCEmitter* mc, u32 dst, u32 src);
+void emit_extend_rr(MCEmitter* mc, int w, int signed_ext, u32 src_size, u32 dst,
+                    u32 src);
+void emit_sse_rr(MCEmitter* mc, u8 prefix, u8 opcode, u32 dst, u32 src);
+void emit_sse_load(MCEmitter* mc, u8 prefix, u8 opcode, u32 dst, u32 base,
+                   i32 disp);
+void emit_sse_store(MCEmitter* mc, u8 prefix, u8 opcode, u32 src, u32 base,
+                    i32 disp);
+void emit_sse_load_idx(MCEmitter* mc, u8 prefix, u8 opcode, u32 dst, u32 base,
+                       u32 index, u32 log2_scale, i32 disp);
+void emit_sse_store_idx(MCEmitter* mc, u8 prefix, u8 opcode, u32 src, u32 base,
+                        u32 index, u32 log2_scale, i32 disp);
+void emit_sse_rr_w(MCEmitter* mc, u8 prefix, u8 opcode, int w, u32 dst,
+                   u32 src);
+
+#endif
diff --git a/src/arch/x64/internal.h b/src/arch/x64/internal.h
@@ -1,314 +0,0 @@
-/* arch/x64/internal.h — private header shared by emit.c, alloc.c, ops.c.
- *
- * Contains:
- *   - XSlot, XScope, XAllocaPatch, XImpl struct definitions
- *   - impl_of() accessor
- *   - Small type helpers (static inline)
- *   - Forward declarations of cross-file functions
- *
- * NOT included by external consumers; use arch/x64/x64.h for the public API. */
-
-#pragma once
-
-#include <string.h>
-
-#include "arch/mc.h"
-#include "arch/x64/isa.h"
-#include "arch/x64/x64.h"
-#include "core/arena.h"
-#include "core/pool.h"
-#include "core/slice.h"
-#include "obj/obj.h"
-
-/* Prologue placeholder budget for the unplanned-regs path (the C
- * frontend's default; the opt pipeline pre-plans registers and hits
- * x64_planned_prologue_bytes for tight sizing).
- *
- * SysV worst case: 11 base + 7 sret + 5*7 GPR saves = 53.
- * Win64 worst case adds XMM6-15 (10 * 8 = 80) plus chkstk delta (+6)
- * plus the 2 extra GPR slots for RDI/RSI (2*7 = 14), so 153 — round
- * up to 192. We pick the larger budget for both OSes (the SysV path
- * is unaffected past byte 53) and rely on dead-strip / link-time
- * coalescing if size becomes a concern. */
-#define X64_PROLOGUE_BYTES 96u
-#define X64_PROLOGUE_BYTES_WIN64 192u
-#define X64_PROLOGUE_BASE_BYTES 11u
-#define X64_PROLOGUE_SRET_BYTES 7u
-#define X64_PROLOGUE_SAVE_BYTES 7u
-/* XMM save: movaps [rbp + disp32], xmm_n.
- * XMM0-7  : 0F 29 modrm disp32             = 7 B
- * XMM8-15 : 44 0F 29 modrm disp32 (REX.R)  = 8 B
- * We size with the high-reg worst case so the placeholder always fits. */
-#define X64_PROLOGUE_XMM_SAVE_BYTES 8u
-/* chkstk replaces a 7B sub-rsp-imm32 with 13B (mov eax,imm32 +
- * call disp32 + sub rsp,rax). Net +6 over the plain sub. */
-#define X64_PROLOGUE_CHKSTK_DELTA 6u
-
-/* Win64-specific constants. */
-#define X64_WIN64_SHADOW_SPACE 32u /* 4 home slots, 8 B each. */
-#define X64_WIN64_CHKSTK_THRESHOLD 4096u
-
-/* Maximum callee-saved GPRs across all supported ABIs. SysV saves up to
- * 5 (RBX, R12..R15; RBP is handled separately by the prologue head),
- * Win64 adds RDI + RSI for 7. */
-#define X64_MAX_CS_INT_REGS 7u
-
-/* ============================================================
- * Per-OS ABI register layout.
- *
- * Selected once at x_func_begin_init from t->c->target.os and
- * consulted by the call-site and param-consumer paths so they stop
- * hard-coding SysV reg orders and slot counts. */
-typedef struct X64ABIRegs {
-  const u32* int_args;       /* size = n_int_args; SysV: RDI..R9;
-                                Win64: RCX..R9 */
-  u32 n_int_args;            /* 6 (SysV) or 4 (Win64) */
-  u32 n_fp_args;             /* 8 (SysV) or 4 (Win64) */
-  int slot_shared_int_fp;    /* 1 (Win64): arg slot index shared between
-                                int_args[i] and XMMi; 0 (SysV) */
-  u32 shadow_space;          /* 0 (SysV) or 32 (Win64) */
-  int emit_sysv_vararg_save; /* 1 (SysV): emit the 176 B reg-save area */
-  int vararg_fp_dup_to_gpr;  /* 1 (Win64): call-site duplicates each
-                                variadic FP arg into the matching GPR */
-  u64 cs_int_mask;           /* callee-saved GPRs (eligible set) */
-  u64 cs_fp_mask;            /* callee-saved XMMs (eligible set) */
-} X64ABIRegs;
-
-const X64ABIRegs* x64_abi_for_os(CfreeOSKind os);
-
-/* ============================================================
- * XImpl and friends. */
-
-typedef struct XSlot {
-  u32 off; /* bytes below rbp (positive); address = rbp - off */
-  u32 size;
-  u32 align;
-  u8 kind;
-  u8 pad[3];
-} XSlot;
-
-typedef struct XScope {
-  u8 kind;
-  u8 has_else;
-  u8 pad[2];
-  MCLabel else_label;
-  MCLabel end_label;
-  Label break_label;
-  Label continue_label;
-} XScope;
-
-/* alloca emits a placeholder `lea dst, [rsp + 0]` whose disp32 is patched
- * at func_end with the final max_outgoing value. disp_pos records the
- * byte offset of that disp32 in the active text section. */
-typedef struct XAllocaPatch {
-  u32 disp_pos;
-} XAllocaPatch;
-
-typedef struct XImpl {
-  CGTarget base;
-  SrcLoc loc;
-  const CGFuncDesc* fd;
-
-  u32 func_start;
-  u32 prologue_pos;
-  u32 prologue_nbytes;
-  MCLabel epilogue_label;
-
-  XSlot* slots;
-  u32 nslots;
-  u32 slots_cap;
-  u32 cum_off;
-  u32 max_outgoing;
-
-  u32 next_param_int;
-  u32 next_param_fp;
-  u32 next_param_stack;
-  u8 has_sret;
-  u8 has_alloca;
-  u8 is_variadic;
-  u8 known_frame;
-  u8 omit_frame;
-  u8 pad0[3];
-  FrameSlot sret_ptr_slot;
-  FrameSlot reg_save_slot; /* variadic: 176-byte __va_list_tag reg save area */
-
-  u32 used_cs_int_mask; /* callee-saved GPRs used by this function */
-  u32 used_cs_fp_mask;  /* callee-saved XMMs used by this function */
-  u32 planned_cs_int_mask;
-  u32 planned_cs_fp_mask;
-  u8 has_planned_regs;
-  u8 pad1[3];
-
-  const X64ABIRegs* abi; /* selected from t->c->target.os at func_begin */
-
-  XScope* scopes;
-  u32 nscopes;
-  u32 scopes_cap;
-
-  XAllocaPatch* alloca_patches;
-  u32 nalloca_patches;
-  u32 alloca_patches_cap;
-} XImpl;
-
-static inline XImpl* impl_of(CGTarget* t) { return (XImpl*)t; }
-
-extern void debug_emit_row(Debug*, ObjSecId text_section, u32 offset, SrcLoc);
-extern void debug_func_pc_range(Debug*, ObjSecId text_section, u32 begin_ofs,
-                                u32 end_ofs);
-
-/* ============================================================
- * Type helpers (static inline — used in all three translation units). */
-
-#define CG_BUILTIN_ID(k) ((CfreeCgTypeId)((1u << 6) | (u32)(k)))
-static inline int type_is_64(CfreeCgTypeId t) {
-  return t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I64) ||
-         t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F64) ||
-         t >= (CfreeCgTypeId)(2u << 6);
-}
-static inline int type_is_fp_double(CfreeCgTypeId t) {
-  return t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F64);
-}
-static inline u32 type_byte_size(CfreeCgTypeId t) {
-  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I8) ||
-      t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_BOOL))
-    return 1;
-  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I16)) return 2;
-  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_I32) ||
-      t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F32))
-    return 4;
-  if (t == CG_BUILTIN_ID(CFREE_CG_BUILTIN_F128)) return 16;
-  return 8;
-}
-static inline int type_is_signed(CfreeCgTypeId t) {
-  (void)t;
-  return 0;
-}
-
-static inline _Noreturn void x_panic(CGTarget* t, const char* what) {
-  SrcLoc loc = impl_of(t)->loc;
-  compiler_panic(t->c, loc, "x64: %.*s not implemented",
-                 SLICE_ARG(slice_from_cstr(what)));
-}
-
-/* ============================================================
- * Shared constant tables (defined in alloc.c, used in emit.c and ops.c). */
-
-extern const Reg g_int_order[6];
-extern const Reg g_fp_order[10];
-
-static inline void x64_abi_direct_reg_need(const ABIArgInfo* ai, u32* need_int,
-                                           u32* need_fp) {
-  *need_int = 0;
-  *need_fp = 0;
-  if (!ai || ai->kind != ABI_ARG_DIRECT) return;
-  for (u16 i = 0; i < ai->nparts; ++i) {
-    const ABIArgPart* p = &ai->parts[i];
-    if (p->cls == ABI_CLASS_FP)
-      ++*need_fp;
-    else if (p->cls == ABI_CLASS_INT)
-      ++*need_int;
-  }
-}
-
-static inline int x64_abi_direct_to_stack(const ABIArgInfo* ai, u32 next_int,
-                                          u32 next_fp) {
-  u32 need_int, need_fp;
-  x64_abi_direct_reg_need(ai, &need_int, &need_fp);
-  return next_int + need_int > 6u || next_fp + need_fp > 8u;
-}
-
-/* ============================================================
- * Cross-file function declarations.
- *
- * Functions that are defined in one translation unit but called from
- * another cannot remain static; they are declared here. */
-
-/* --- emit.c exports (lifecycle used by ops.c vtable constructor,
- *     encoding helpers used by alloc.c and ops.c) --- */
-void x_func_begin(CGTarget* t, const CGFuncDesc* fd);
-void x_func_begin_known_frame(CGTarget* t, const CGFuncDesc* fd,
-                              const CGKnownFrameDesc* frame,
-                              FrameSlot* out_slots);
-void x_func_end(CGTarget* t);
-
-void x_coord_vtable_init(CGTarget* t);
-
-/* encoding helpers */
-void emit_u32le(MCEmitter* mc, u32 v);
-void emit_rex(MCEmitter* mc, int w, u32 reg, u32 index, u32 rm);
-void emit_rex_force(MCEmitter* mc, int w, u32 reg, u32 index, u32 rm);
-u8 modrm(u32 mod, u32 reg, u32 rm);
-u8 sib(u32 scale, u32 index, u32 base);
-void emit_mem_operand(MCEmitter* mc, u32 reg, u32 base, i32 disp);
-void emit_rm_reg(MCEmitter* mc, u32 reg, u32 rm);
-void emit_mov_rr(MCEmitter* mc, int w, u32 dst, u32 src);
-void emit_mov_load(MCEmitter* mc, u32 size, int signed_ext, u32 dst, u32 base,
-                   i32 disp);
-void emit_mov_store(MCEmitter* mc, u32 size, u32 src, u32 base, i32 disp);
-void emit_lea(MCEmitter* mc, u32 dst, u32 base, i32 disp);
-/* Indexed-addressing variants: [base + index<<log2_scale + disp]. Pass
- * index = REG_NONE to fall back to the plain [base + disp] encoding. */
-void emit_mov_load_idx(MCEmitter* mc, u32 size, int signed_ext, u32 dst,
-                       u32 base, u32 index, u32 log2_scale, i32 disp);
-void emit_mov_store_idx(MCEmitter* mc, u32 size, u32 src, u32 base, u32 index,
-                        u32 log2_scale, i32 disp);
-void emit_ret(MCEmitter* mc);
-void x64_emit_load_imm(MCEmitter* mc, int is64, u32 dst, i64 imm);
-void emit_alu_rr(MCEmitter* mc, int w, u8 op, u32 dst, u32 src);
-void emit_imul_rr(MCEmitter* mc, int w, u32 dst, u32 src);
-void emit_f7_rm(MCEmitter* mc, int w, u32 sub, u32 reg);
-void emit_shift_cl(MCEmitter* mc, int w, u32 sub, u32 reg);
-void emit_shift_imm(MCEmitter* mc, int w, u32 sub, u32 reg, u8 imm);
-void emit_cqo_or_cdq(MCEmitter* mc, int w);
-void emit_xor_self(MCEmitter* mc, int w, u32 r);
-void emit_cmp_imm8(MCEmitter* mc, int w, u32 reg, i8 imm);
-void emit_alu_imm8(MCEmitter* mc, int w, u32 sub, u32 reg, i8 imm);
-void emit_alu_imm32(MCEmitter* mc, int w, u32 sub, u32 reg, i32 imm);
-void emit_imul_imm8(MCEmitter* mc, int w, u32 dst, u32 src, i8 imm);
-void emit_imul_imm32(MCEmitter* mc, int w, u32 dst, u32 src, i32 imm);
-int imm_fits_i8(i64 imm);
-int imm_fits_i32(i64 imm);
-void emit_test_self(MCEmitter* mc, int w, u32 reg);
-void emit_setcc(MCEmitter* mc, u32 cc, u32 reg);
-void emit_movzx_r32_r8(MCEmitter* mc, u32 dst, u32 src);
-void emit_extend_rr(MCEmitter* mc, int w, int signed_ext, u32 src_size, u32 dst,
-                    u32 src);
-void emit_sse_rr(MCEmitter* mc, u8 prefix, u8 opcode, u32 dst, u32 src);
-void emit_sse_load(MCEmitter* mc, u8 prefix, u8 opcode, u32 dst, u32 base,
-                   i32 disp);
-void emit_sse_store(MCEmitter* mc, u8 prefix, u8 opcode, u32 src, u32 base,
-                    i32 disp);
-void emit_sse_load_idx(MCEmitter* mc, u8 prefix, u8 opcode, u32 dst, u32 base,
-                       u32 index, u32 log2_scale, i32 disp);
-void emit_sse_store_idx(MCEmitter* mc, u8 prefix, u8 opcode, u32 src, u32 base,
-                        u32 index, u32 log2_scale, i32 disp);
-void emit_sse_rr_w(MCEmitter* mc, u8 prefix, u8 opcode, int w, u32 dst,
-                   u32 src);
-
-/* --- alloc.c exports (used by emit.c and/or ops.c) --- */
-XSlot* x64_slot_get(XImpl* a, FrameSlot fs);
-int x_resolve_reg_name(CGTarget* t, Sym name, Reg* out, RegClass* cls_out);
-FrameSlot x_frame_slot(CGTarget* t, const FrameSlotDesc* d);
-CGLocalStorage x_param(CGTarget* t, const CGParamDesc* p);
-void x_spill_reg(CGTarget* t, Operand src, FrameSlot slot, MemAccess ma);
-void x_reload_reg(CGTarget* t, Operand dst, FrameSlot slot, MemAccess ma);
-Label x_label_new(CGTarget* t);
-void x_label_place(CGTarget* t, Label l);
-void emit_jmp_label(MCEmitter* mc, MCLabel l);
-void emit_jcc_label(MCEmitter* mc, u32 cc, MCLabel l);
-void x_jump(CGTarget* t, Label l);
-void x_cmp_branch(CGTarget* t, CmpOp op, Operand a, Operand b, Label l);
-void x_load_label_addr(CGTarget* t, Operand dst, Label l);
-void x_indirect_branch(CGTarget* t, Operand addr, const Label* targets,
-                       u32 ntargets);
-void x_cmp(CGTarget* t, CmpOp op, Operand dst, Operand a, Operand b);
-CGScope x_scope_begin(CGTarget* t, const CGScopeDesc* d);
-void x_scope_else(CGTarget* t, CGScope s);
-void x_scope_end(CGTarget* t, CGScope s);
-void x_break_to(CGTarget* t, CGScope s);
-void x_continue_to(CGTarget* t, CGScope s);
-u32 x64_force_reg_int(CGTarget* t, Operand op, int w, u32 scratch);
-
-/* --- ops.c exports (used by alloc.c) --- */
-void x_load(CGTarget* t, Operand dst, Operand addr, MemAccess ma);
-void x_store(CGTarget* t, Operand addr, Operand src, MemAccess ma);
diff --git a/src/arch/x64/native.c b/src/arch/x64/native.c
@@ -0,0 +1,3751 @@
+/* src/arch/x64/native.c — x86-64 (SysV / Win64) NativeTarget implementation.
+ *
+ * Mirrors the rv64 reference (src/arch/rv64/native.c): a physical-emission
+ * NativeTarget driven at -O0 by the shared NativeDirectTarget and at -O1+ by
+ * the optimizer emit path. ABI decisions route through abi/ and the per-OS
+ * X64ABIRegs (x64_abi_for_os); this file owns ISA emission and the x64 frame
+ * layout.
+ *
+ * Frame model (single, rbp-anchored): the prologue does `push rbp; mov rbp,rsp;
+ * sub rsp,frame_size`. Local/spill slots live below rbp at positive byte
+ * offsets `off` (address = rbp - off). Incoming stack args sit above the saved
+ * return address at [rbp + 16 + shadow_space + ...]. Callee-saved GPRs (and, on
+ * Win64, XMMs) are saved below the locals; outgoing args sit at [rsp + 0..].
+ * The single-pass (-O0) prologue reserves a NOP placeholder patched in func_end
+ * once max_outgoing and callee-saves are known.
+ *
+ * Register model. INT scratch (never allocable, never driver scratch): RAX and
+ * R11 — the legacy emit paths' fixed temporaries. FP scratch: XMM14 and XMM15.
+ * RSP/RBP are reserved (stack/frame pointers). Everything else is allocable.
+ * The driver scratch pool is RBX/R12 (int) and XMM12/XMM13 (fp), disjoint from
+ * the emit temps so a hook never clobbers an operand parked there. ABI arg/ret
+ * registers are caller-saved-allocable; callee-saved set is resolved per-OS via
+ * x64_abi_for_os at runtime (the legality masks below are SysV's, the conserva-
+ * tive superset that both ABIs' allocators respect — Win64's extra callee-saves
+ * RDI/RSI/xmm6-15 only shrink the allocable pool, never grow it). */
+
+#include <string.h>
+
+#include "abi/abi.h"
+#include "arch/x64/asm.h"
+#include "arch/x64/emit.h"
+#include "arch/x64/isa.h"
+#include "arch/x64/regs.h"
+#include "arch/x64/x64.h"
+#include "asm/asm.h"
+#include "asm/asm_lex.h"
+#include "cg/native_direct_target.h"
+#include "cg/type.h"
+#include "core/arena.h"
+#include "core/bytes.h"
+#include "core/pool.h"
+#include "core/slice.h"
+#include "obj/obj.h"
+
+enum {
+  X64_TMP_INT = X64_RAX,  /* emit-internal int scratch (reserved) */
+  X64_TMP_INT2 = X64_R11, /* emit-internal int scratch (reserved) */
+  X64_TMP_FP = X64_XMM0 + 14, /* emit-internal fp scratch (reserved) */
+  X64_TMP_FP2 = X64_XMM15,    /* emit-internal fp scratch (reserved) */
+  X64_MAX_REG_ARG_MOVES = 16u,
+  X64_MAX_CS_FP_REGS = 10u, /* Win64 xmm6..xmm15 */
+};
+
+/* ============================ target state ============================ */
+
+typedef struct X64NativeSlot {
+  u32 off; /* bytes below rbp (positive); address = rbp - off */
+  u32 size;
+  u32 align;
+  u8 kind; /* NativeFrameSlotKind */
+  u8 pad[3];
+} X64NativeSlot;
+
+typedef struct X64CalleeSave {
+  Reg reg;
+  u8 cls; /* NativeAllocClass */
+} X64CalleeSave;
+
+typedef enum X64PatchKind { X64_PATCH_ALLOCA } X64PatchKind;
+
+typedef struct X64Patch {
+  u8 kind; /* X64PatchKind */
+  u32 pos; /* byte offset of the disp32 to patch */
+} X64Patch;
+
+typedef struct X64NativeTarget {
+  NativeTarget base;
+  SrcLoc loc;
+  const CGFuncDesc* func;
+
+  X64NativeSlot* slots;
+  u32 nslots;
+  u32 slots_cap;
+  u32 cum_off;      /* sum of frame-slot reservations below rbp */
+  u32 max_outgoing; /* max outgoing-arg bytes across all calls */
+  u32 frame_size_final;
+
+  u32 incoming_stack_size; /* fixed-param stack bytes (tail-call check) */
+  u32 next_param_int;
+  u32 next_param_fp;
+  u32 next_param_stack;
+  u8 has_sret;
+  u8 is_variadic;
+  NativeFrameSlot sret_ptr_slot;
+  NativeFrameSlot reg_save_slot; /* SysV variadic 176B __va_list_tag area */
+
+  X64Patch* patches;
+  u32 npatches;
+  u32 patches_cap;
+  u32 nalloca;
+
+  u32 func_start;
+  u32 prologue_pos;
+  u32 prologue_nbytes;
+  MCLabel epilogue_label;
+
+  X64CalleeSave callee_saves[X64_MAX_CS_INT_REGS + X64_MAX_CS_FP_REGS];
+  u32 ncallee_saves;
+
+  u8 known_frame;
+  u8 has_alloca;
+  u8 frame_final;
+
+  const X64ABIRegs* abi;
+} X64NativeTarget;
+
+static X64NativeTarget* x64_of(NativeTarget* t) { return (X64NativeTarget*)t; }
+
+static _Noreturn void x64_panic(X64NativeTarget* a, const char* msg) {
+  compiler_panic(a->base.c, a->loc, "x64 native target: %s", msg);
+}
+
+static X64NativeSlot* x64_slot_get(X64NativeTarget* a, NativeFrameSlot fs) {
+  if (fs == NATIVE_FRAME_SLOT_NONE || fs > a->nslots)
+    x64_panic(a, "bad frame slot");
+  return &a->slots[fs - 1u];
+}
+
+static u32 align_up_u32(u32 v, u32 align) {
+  u32 mask = align ? align - 1u : 0u;
+  return (v + mask) & ~mask;
+}
+
+/* ============================ type helpers ============================ */
+
+static u32 x64_type_size(NativeTarget* t, CfreeCgTypeId type) {
+  u64 n = type ? cg_type_size(t->c, type) : 8u;
+  if (n == 0) n = 8u;
+  return (u32)n;
+}
+
+static u32 x64_type_align(NativeTarget* t, CfreeCgTypeId type) {
+  u64 n = type ? cg_type_align(t->c, type) : 8u;
+  if (n == 0) n = 1u;
+  if (n > 16u) n = 16u;
+  return (u32)n;
+}
+
+/* A scalar value occupies a 64-bit register when it is pointer-sized or wider
+ * (drives REX.W selection). */
+static int x64_is_64(NativeTarget* t, CfreeCgTypeId type) {
+  return x64_type_size(t, type) >= 8u || cg_type_is_ptr(t->c, type);
+}
+
+static int loc_is_fp(NativeLoc loc) {
+  return (NativeAllocClass)loc.cls == NATIVE_REG_FP;
+}
+static u32 loc_reg(NativeLoc loc) { return loc.v.reg & 0xfu; }
+
+static NativeAllocClass x64_class_for_type(NativeTarget* t, CfreeCgTypeId type) {
+  if (type && cg_type_is_float(t->c, type) && cg_type_size(t->c, type) <= 8u)
+    return NATIVE_REG_FP;
+  return NATIVE_REG_INT;
+}
+
+static MemAccess x64_mem_for_type(NativeTarget* t, CfreeCgTypeId type,
+                                  u32 size) {
+  MemAccess m;
+  memset(&m, 0, sizeof m);
+  m.type = type;
+  m.size = size ? size : x64_type_size(t, type);
+  m.align = x64_type_align(t, type);
+  return m;
+}
+
+static NativeLoc x64_reg_loc(CfreeCgTypeId type, NativeAllocClass cls, Reg reg) {
+  NativeLoc loc;
+  memset(&loc, 0, sizeof loc);
+  loc.kind = NATIVE_LOC_REG;
+  loc.cls = (u8)cls;
+  loc.type = type;
+  loc.v.reg = reg;
+  return loc;
+}
+
+static NativeLoc x64_stack_loc(CfreeCgTypeId type, NativeFrameSlot slot,
+                               i32 offset) {
+  NativeLoc loc;
+  memset(&loc, 0, sizeof loc);
+  loc.kind = NATIVE_LOC_STACK;
+  loc.cls = NATIVE_REG_INT;
+  loc.type = type;
+  loc.v.stack.slot = slot;
+  loc.v.stack.offset = offset;
+  return loc;
+}
+
+/* SSE scalar prefix: F2 (double / 8-byte) vs F3 (single / 4-byte). */
+static u8 sse_scalar_prefix(u32 size) { return size == 8u ? 0xF2u : 0xF3u; }
+
+/* Forward decls for the rel32 branch emitters (used by convert before the
+ * control-flow section defines them). */
+static void emit_jmp_rel32(MCEmitter* mc, MCLabel l);
+static void emit_jcc_rel32(MCEmitter* mc, u32 cc, MCLabel l);
+
+/* ============================ register tables ============================ */
+
+#define X64_PHYS_INT_ARG(r)                                              \
+  {.reg = (r),                                                           \
+   .cls = NATIVE_REG_INT,                                                \
+   .abi_index = 0xffu,                                                   \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED |             \
+            NATIVE_REG_ARG,                                              \
+   .spill_cost = 1u,                                                     \
+   .copy_cost = 1u}
+#define X64_PHYS_INT_RET_ARG(r)                                          \
+  {.reg = (r),                                                           \
+   .cls = NATIVE_REG_INT,                                                \
+   .abi_index = 0xffu,                                                   \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED |             \
+            NATIVE_REG_ARG | NATIVE_REG_RET,                             \
+   .spill_cost = 1u,                                                     \
+   .copy_cost = 1u}
+#define X64_PHYS_INT_CALLER(r)                                           \
+  {.reg = (r),                                                           \
+   .cls = NATIVE_REG_INT,                                                \
+   .abi_index = 0xffu,                                                   \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED,              \
+   .spill_cost = 1u,                                                     \
+   .copy_cost = 1u}
+#define X64_PHYS_INT_CALLEE(r)                                           \
+  {.reg = (r),                                                           \
+   .cls = NATIVE_REG_INT,                                                \
+   .abi_index = 0xffu,                                                   \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLEE_SAVED,              \
+   .spill_cost = 4u,                                                     \
+   .copy_cost = 1u}
+#define X64_PHYS_INT_RESERVED(r) \
+  {.reg = (r),                   \
+   .cls = NATIVE_REG_INT,        \
+   .abi_index = 0xffu,           \
+   .flags = NATIVE_REG_RESERVED, \
+   .spill_cost = 0u,             \
+   .copy_cost = 0u}
+
+/* Allocable int pool, opt's spill/reload set: caller-saved callee-saves first
+ * so -O0's local cache prefers regs that don't grow the prologue. RAX/R11 are
+ * emit scratch (reserved); RBX/R12 are the driver scratch pool. */
+static const Reg x64_int_allocable[] = {X64_R13, X64_R14, X64_R15, X64_R10};
+static const Reg x64_int_scratch[] = {X64_RBX, X64_R12};
+
+static const NativePhysRegInfo x64_int_phys[] = {
+    X64_PHYS_INT_RESERVED(X64_RAX), /* return / emit scratch */
+    X64_PHYS_INT_ARG(X64_RCX),
+    X64_PHYS_INT_RET_ARG(X64_RDX),
+    X64_PHYS_INT_RESERVED(X64_RBX), /* driver scratch */
+    X64_PHYS_INT_RESERVED(X64_RSP), /* stack pointer */
+    X64_PHYS_INT_RESERVED(X64_RBP), /* frame pointer */
+    X64_PHYS_INT_ARG(X64_RSI),
+    X64_PHYS_INT_ARG(X64_RDI),
+    X64_PHYS_INT_ARG(X64_R8),
+    X64_PHYS_INT_ARG(X64_R9),
+    X64_PHYS_INT_CALLER(X64_R10),
+    X64_PHYS_INT_RESERVED(X64_R11), /* emit scratch */
+    X64_PHYS_INT_RESERVED(X64_R12), /* driver scratch */
+    X64_PHYS_INT_CALLEE(X64_R13),
+    X64_PHYS_INT_CALLEE(X64_R14),
+    X64_PHYS_INT_CALLEE(X64_R15),
+};
+
+#define X64_PHYS_FP_ARG_RET(r)                                           \
+  {.reg = (r),                                                           \
+   .cls = NATIVE_REG_FP,                                                 \
+   .abi_index = 0xffu,                                                   \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED |             \
+            NATIVE_REG_ARG | NATIVE_REG_RET,                             \
+   .spill_cost = 1u,                                                     \
+   .copy_cost = 1u}
+#define X64_PHYS_FP_ARG(r)                                               \
+  {.reg = (r),                                                           \
+   .cls = NATIVE_REG_FP,                                                 \
+   .abi_index = 0xffu,                                                   \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED |             \
+            NATIVE_REG_ARG,                                              \
+   .spill_cost = 1u,                                                     \
+   .copy_cost = 1u}
+#define X64_PHYS_FP_CALLER(r)                                            \
+  {.reg = (r),                                                           \
+   .cls = NATIVE_REG_FP,                                                 \
+   .abi_index = 0xffu,                                                   \
+   .flags = NATIVE_REG_ALLOCABLE | NATIVE_REG_CALLER_SAVED,              \
+   .spill_cost = 1u,                                                     \
+   .copy_cost = 1u}
+#define X64_PHYS_FP_RESERVED(r)  \
+  {.reg = (r),                   \
+   .cls = NATIVE_REG_FP,         \
+   .abi_index = 0xffu,           \
+   .flags = NATIVE_REG_RESERVED, \
+   .spill_cost = 0u,             \
+   .copy_cost = 0u}
+
+/* Allocable FP pool: xmm6..xmm13 (keep arg/ret xmm0..5 clear). xmm14/xmm15 are
+ * emit scratch; xmm12/xmm13 the driver scratch pool. */
+static const Reg x64_fp_allocable[] = {
+    X64_XMM6,     X64_XMM7,      X64_XMM8,      X64_XMM0 + 9,
+    X64_XMM0 + 10, X64_XMM0 + 11};
+static const Reg x64_fp_scratch[] = {X64_XMM0 + 12, X64_XMM0 + 13};
+
+static const NativePhysRegInfo x64_fp_phys[] = {
+    X64_PHYS_FP_ARG_RET(X64_XMM0),
+    X64_PHYS_FP_ARG_RET(X64_XMM1),
+    X64_PHYS_FP_ARG(X64_XMM2),
+    X64_PHYS_FP_ARG(X64_XMM3),
+    X64_PHYS_FP_ARG(X64_XMM4),
+    X64_PHYS_FP_ARG(X64_XMM5),
+    X64_PHYS_FP_CALLER(X64_XMM6),
+    X64_PHYS_FP_CALLER(X64_XMM7),
+    X64_PHYS_FP_CALLER(X64_XMM8),
+    X64_PHYS_FP_CALLER(X64_XMM0 + 9),
+    X64_PHYS_FP_CALLER(X64_XMM0 + 10),
+    X64_PHYS_FP_CALLER(X64_XMM0 + 11),
+    X64_PHYS_FP_RESERVED(X64_XMM0 + 12), /* driver scratch */
+    X64_PHYS_FP_RESERVED(X64_XMM0 + 13), /* driver scratch */
+    X64_PHYS_FP_RESERVED(X64_XMM0 + 14), /* emit scratch */
+    X64_PHYS_FP_RESERVED(X64_XMM15),     /* emit scratch */
+};
+
+static const NativeAllocClassInfo x64_classes[] = {
+    {.cls = NATIVE_REG_INT,
+     .allocable = x64_int_allocable,
+     .nallocable = sizeof x64_int_allocable / sizeof x64_int_allocable[0],
+     .scratch = x64_int_scratch,
+     .nscratch = sizeof x64_int_scratch / sizeof x64_int_scratch[0],
+     .phys = x64_int_phys,
+     .nphys = sizeof x64_int_phys / sizeof x64_int_phys[0],
+     /* caller-saved: rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 (SysV) */
+     .caller_saved_mask = (1u << X64_RAX) | (1u << X64_RCX) | (1u << X64_RDX) |
+                          (1u << X64_RSI) | (1u << X64_RDI) | (1u << X64_R8) |
+                          (1u << X64_R9) | (1u << X64_R10) | (1u << X64_R11),
+     /* callee-saved: rbx,r12,r13,r14,r15 (rbp handled by prologue head) */
+     .callee_saved_mask = (1u << X64_RBX) | (1u << X64_R12) | (1u << X64_R13) |
+                          (1u << X64_R14) | (1u << X64_R15),
+     /* SysV arg regs rdi,rsi,rdx,rcx,r8,r9 */
+     .arg_mask = (1u << X64_RDI) | (1u << X64_RSI) | (1u << X64_RDX) |
+                 (1u << X64_RCX) | (1u << X64_R8) | (1u << X64_R9),
+     .ret_mask = (1u << X64_RAX) | (1u << X64_RDX),
+     /* rax, rsp, rbp, r11 reserved (plus the rbx/r12 driver scratch pool) */
+     .reserved_mask = (1u << X64_RAX) | (1u << X64_RSP) | (1u << X64_RBP) |
+                      (1u << X64_R11) | (1u << X64_RBX) | (1u << X64_R12)},
+    {.cls = NATIVE_REG_FP,
+     .allocable = x64_fp_allocable,
+     .nallocable = sizeof x64_fp_allocable / sizeof x64_fp_allocable[0],
+     .scratch = x64_fp_scratch,
+     .nscratch = sizeof x64_fp_scratch / sizeof x64_fp_scratch[0],
+     .phys = x64_fp_phys,
+     .nphys = sizeof x64_fp_phys / sizeof x64_fp_phys[0],
+     /* All xmm caller-saved on SysV. */
+     .caller_saved_mask = 0xffffu,
+     .callee_saved_mask = 0u,
+     .arg_mask = 0xffu, /* xmm0..xmm7 */
+     .ret_mask = (1u << X64_XMM0) | (1u << X64_XMM1),
+     /* xmm12..xmm15 reserved (driver scratch + emit scratch) */
+     .reserved_mask = (1u << (X64_XMM0 + 12)) | (1u << (X64_XMM0 + 13)) |
+                      (1u << (X64_XMM0 + 14)) | (1u << X64_XMM15)},
+};
+
+static const NativeRegInfo x64_reg_info = {
+    .classes = x64_classes,
+    .nclasses = sizeof x64_classes / sizeof x64_classes[0],
+};
+
+/* ============================ legality ============================ */
+
+static int x64_imm_legal(NativeTarget* t, NativeImmUse use, u32 op,
+                         CfreeCgTypeId type, i64 imm) {
+  (void)t;
+  (void)type;
+  switch (use) {
+    case NATIVE_IMM_MOVE:
+      return 1;
+    case NATIVE_IMM_BINOP:
+      switch ((BinOp)op) {
+        case BO_IADD:
+        case BO_ISUB:
+        case BO_AND:
+        case BO_OR:
+        case BO_XOR:
+        case BO_IMUL:
+          return imm_fits_i32(imm);
+        case BO_SHL:
+        case BO_SHR_S:
+        case BO_SHR_U:
+          return imm >= 0 && imm <= 63;
+        default:
+          return 0;
+      }
+    case NATIVE_IMM_CMP:
+      return imm_fits_i32(imm);
+    case NATIVE_IMM_ADDR_OFFSET:
+      return imm_fits_i32(imm);
+  }
+  return 0;
+}
+
+static int x64_addr_legal(NativeTarget* t, const NativeAddr* addr,
+                          MemAccess mem) {
+  (void)t;
+  (void)mem;
+  if (!addr) return 0;
+  if (addr->base_kind != NATIVE_ADDR_BASE_REG &&
+      addr->base_kind != NATIVE_ADDR_BASE_FRAME)
+    return 0;
+  /* x64 supports [base + index*scale + disp32]; index must be a register. */
+  if (addr->index_kind != NATIVE_ADDR_INDEX_NONE &&
+      addr->index_kind != NATIVE_ADDR_INDEX_REG)
+    return 0;
+  return imm_fits_i32(addr->offset);
+}
+
+/* ============================ globals / addresses ============================
+ */
+
+static int x64_use_got_for_sym(NativeTarget* t, ObjSymId sym) {
+  return obj_symbol_extern_via_got(t->c, t->obj, sym);
+}
+
+/* PC-relative reloc kind for a non-GOT &sym reference. Functions use PLT32 so
+ * the linker can route through a PLT; data uses plain PC32. */
+static u32 x64_pcrel_reloc_for_sym(NativeTarget* t, ObjSymId sym) {
+  const ObjSym* s = obj_symbol_get(t->obj, sym);
+  if (s && (s->kind == SK_FUNC || s->kind == SK_IFUNC)) return R_X64_PLT32;
+  return R_PC32;
+}
+
+/* Materialize &sym + addend into dst_reg. Local/static-link symbols use
+ * `lea rd, [rip + disp32]`; GOT-routed externs use `mov rd, [rip + GOT]` then
+ * add any nonzero addend. */
+static void x64_emit_global_lea(NativeTarget* t, u32 dst_reg, ObjSymId sym,
+                                i64 addend) {
+  MCEmitter* mc = t->mc;
+  u32 sec = mc->section_id;
+  if (x64_use_got_for_sym(t, sym)) {
+    u8 op;
+    u32 disp_pos;
+    emit_rex(mc, 1, dst_reg, 0, 0);
+    op = X64_OPC_MOV_R_RM;
+    mc->emit_bytes(mc, &op, 1);
+    {
+      u8 mr = modrm(0u, dst_reg & 7u, 5u); /* [rip + disp32] */
+      mc->emit_bytes(mc, &mr, 1);
+    }
+    disp_pos = mc->pos(mc);
+    emit_u32le(mc, 0);
+    mc->emit_reloc_at(mc, sec, disp_pos, R_X64_REX_GOTPCRELX, sym, -4, 1, 0);
+    if (addend) {
+      i32 a = (i32)addend;
+      emit_rex(mc, 1, 0, 0, dst_reg);
+      if (imm_fits_i8(a)) {
+        u8 buf[3] = {X64_OPC_ALU_IMM8, modrm(3u, X64_ALU_SUB_ADD, dst_reg & 7u),
+                     (u8)a};
+        mc->emit_bytes(mc, buf, 3);
+      } else {
+        u8 buf[2] = {X64_OPC_ALU_IMM32, modrm(3u, X64_ALU_SUB_ADD, dst_reg & 7u)};
+        mc->emit_bytes(mc, buf, 2);
+        emit_u32le(mc, (u32)a);
+      }
+    }
+    return;
+  }
+  {
+    u8 op = X64_OPC_LEA;
+    u32 disp_pos;
+    emit_rex(mc, 1, dst_reg, 0, 0);
+    mc->emit_bytes(mc, &op, 1);
+    {
+      u8 mr = modrm(0u, dst_reg & 7u, 5u); /* [rip + disp32] */
+      mc->emit_bytes(mc, &mr, 1);
+    }
+    disp_pos = mc->pos(mc);
+    emit_u32le(mc, 0);
+    mc->emit_reloc_at(mc, sec, disp_pos, x64_pcrel_reloc_for_sym(t, sym), sym,
+                      addend - 4, 1, 0);
+  }
+}
+
+/* Resolve a NativeAddr to (base, index, log2_scale, off). Materializes
+ * FRAME/FRAME_VALUE/GLOBAL bases into the supplied scratch register. */
+static u32 x64_resolve_addr(X64NativeTarget* a, const NativeAddr* addr,
+                            u32 scratch, u32* idx_out, u32* scale_out,
+                            i32* off_out) {
+  NativeTarget* t = &a->base;
+  u32 base;
+  i32 off;
+  switch (addr->base_kind) {
+    case NATIVE_ADDR_BASE_REG:
+      base = addr->base.reg & 0xfu;
+      off = addr->offset;
+      break;
+    case NATIVE_ADDR_BASE_FRAME: {
+      X64NativeSlot* s = x64_slot_get(a, addr->base.frame);
+      base = X64_RBP;
+      off = -(i32)s->off + addr->offset;
+      break;
+    }
+    case NATIVE_ADDR_BASE_FRAME_VALUE: {
+      X64NativeSlot* s = x64_slot_get(a, addr->base.frame);
+      emit_mov_load(t->mc, 8, 0, scratch, X64_RBP, -(i32)s->off);
+      base = scratch;
+      off = addr->offset;
+      break;
+    }
+    case NATIVE_ADDR_BASE_GLOBAL:
+      x64_emit_global_lea(t, scratch, addr->base.global.sym,
+                          addr->base.global.addend);
+      base = scratch;
+      off = addr->offset;
+      break;
+    default:
+      x64_panic(a, "unsupported address base");
+  }
+  if (addr->index_kind == NATIVE_ADDR_INDEX_REG) {
+    *idx_out = addr->index.reg & 0xfu;
+    *scale_out = addr->log2_scale;
+  } else if (addr->index_kind == NATIVE_ADDR_INDEX_FRAME_VALUE) {
+    X64NativeSlot* s = x64_slot_get(a, addr->index.frame);
+    emit_mov_load(t->mc, 8, 0, X64_TMP_INT2, X64_RBP, -(i32)s->off);
+    *idx_out = X64_TMP_INT2;
+    *scale_out = addr->log2_scale;
+  } else {
+    *idx_out = REG_NONE;
+    *scale_out = 0;
+  }
+  *off_out = off;
+  return base;
+}
+
+/* ============================ memory ============================ */
+
+/* Central load/store primitive. is_load: 1 load into reg, 0 store reg to mem.
+ * Materializes the address through X64_TMP_INT2 (r11) for non-reg bases. */
+static void x64_emit_mem(X64NativeTarget* a, int is_load, NativeLoc reg,
+                         NativeAddr addr, MemAccess mem) {
+  NativeTarget* t = &a->base;
+  MCEmitter* mc = t->mc;
+  u32 r = loc_reg(reg);
+  int fp = loc_is_fp(reg);
+  u32 sz = mem.size ? mem.size : x64_type_size(t, reg.type);
+  u32 base, idx, scale;
+  i32 off;
+
+  /* Global base: fold into a single rip-relative access when local. */
+  if (addr.base_kind == NATIVE_ADDR_BASE_GLOBAL &&
+      addr.index_kind == NATIVE_ADDR_INDEX_NONE &&
+      !x64_use_got_for_sym(t, addr.base.global.sym)) {
+    ObjSymId sym = addr.base.global.sym;
+    i64 ad = addr.base.global.addend + addr.offset;
+    u32 sec = mc->section_id;
+    u32 disp_pos;
+    if (fp) {
+      u8 prefix = sse_scalar_prefix(sz);
+      mc->emit_bytes(mc, &prefix, 1);
+      emit_rex(mc, 0, r, 0, 0);
+      {
+        u8 op2[2] = {X64_OPC_TWOBYTE, (u8)(is_load ? 0x10u : 0x11u)};
+        mc->emit_bytes(mc, op2, 2);
+      }
+    } else if (sz == 8 || sz == 4) {
+      emit_rex(mc, sz == 8, r, 0, 0);
+      {
+        u8 op = is_load ? X64_OPC_MOV_R_RM : X64_OPC_MOV_RM_R;
+        mc->emit_bytes(mc, &op, 1);
+      }
+    } else if (sz == 2) {
+      if (is_load) {
+        emit_rex(mc, 0, r, 0, 0);
+        {
+          u8 op2[2] = {X64_OPC_TWOBYTE, X64_OPC_MOVZX_W};
+          mc->emit_bytes(mc, op2, 2);
+        }
+      } else {
+        u8 p = X64_OPSIZE_PFX;
+        mc->emit_bytes(mc, &p, 1);
+        emit_rex(mc, 0, r, 0, 0);
+        {
+          u8 op = X64_OPC_MOV_RM_R;
+          mc->emit_bytes(mc, &op, 1);
+        }
+      }
+    } else { /* size 1 */
+      if (is_load) {
+        emit_rex(mc, 0, r, 0, 0);
+        {
+          u8 op2[2] = {X64_OPC_TWOBYTE, X64_OPC_MOVZX_B};
+          mc->emit_bytes(mc, op2, 2);
+        }
+      } else {
+        emit_rex_force(mc, 0, r, 0, 0);
+        {
+          u8 op = X64_OPC_MOV_RM_R8;
+          mc->emit_bytes(mc, &op, 1);
+        }
+      }
+    }
+    {
+      u8 mr = modrm(0u, r & 7u, 5u);
+      mc->emit_bytes(mc, &mr, 1);
+    }
+    disp_pos = mc->pos(mc);
+    emit_u32le(mc, 0);
+    mc->emit_reloc_at(mc, sec, disp_pos, x64_pcrel_reloc_for_sym(t, sym), sym,
+                      ad - 4, 1, 0);
+    return;
+  }
+
+  base = x64_resolve_addr(a, &addr, X64_TMP_INT2, &idx, &scale, &off);
+  if (fp) {
+    u8 prefix = sse_scalar_prefix(sz);
+    if (is_load)
+      emit_sse_load_idx(mc, prefix, 0x10, r, base, idx, scale, off);
+    else
+      emit_sse_store_idx(mc, prefix, 0x11, r, base, idx, scale, off);
+  } else if (is_load) {
+    /* Loads narrower than 4 bytes zero-extend (sign-extension is applied by a
+     * later CV_SEXT). */
+    emit_mov_load_idx(mc, sz, 0, r, base, idx, scale, off);
+  } else {
+    emit_mov_store_idx(mc, sz, r, base, idx, scale, off);
+  }
+}
+
+/* ============================ moves / data ============================ */
+
+static void x64_move(NativeTarget* t, NativeLoc dst, NativeLoc src) {
+  MCEmitter* mc = t->mc;
+  int dfp = loc_is_fp(dst), sfp = loc_is_fp(src);
+  u32 rd = loc_reg(dst), rs = loc_reg(src);
+  if (dfp && sfp) {
+    if (rd == rs) return;
+    emit_sse_rr(mc, sse_scalar_prefix(x64_type_size(t, dst.type)), 0x10, rd, rs);
+    return;
+  }
+  if (dfp && !sfp) { /* movd/movq gpr -> xmm: 66 0F 6E /r */
+    int w = x64_type_size(t, dst.type) == 8u;
+    emit_sse_rr_w(mc, 0x66, 0x6E, w, rd, rs);
+    return;
+  }
+  if (!dfp && sfp) { /* movd/movq xmm -> gpr: 66 0F 7E /r (xmm is reg field) */
+    int w = x64_type_size(t, src.type) == 8u;
+    emit_sse_rr_w(mc, 0x66, 0x7E, w, rs, rd);
+    return;
+  }
+  if (rd == rs) return;
+  emit_mov_rr(mc, x64_is_64(t, dst.type) ? 1 : 0, rd, rs);
+}
+
+static void x64_load_imm(NativeTarget* t, NativeLoc dst, i64 imm) {
+  x64_emit_load_imm(t->mc, x64_is_64(t, dst.type) ? 1 : 0, loc_reg(dst), imm);
+}
+
+/* FP constant: materialize the bit pattern in a GPR scratch, then movd/movq
+ * into the FPR. Integer constant: plain load_imm. */
+static void x64_load_const(NativeTarget* t, NativeLoc dst, ConstBytes cb) {
+  u64 v = 0;
+  u32 i;
+  for (i = 0; i < cb.size && i < 8u; ++i) v |= (u64)cb.bytes[i] << (i * 8u);
+  if (!loc_is_fp(dst)) {
+    x64_load_imm(t, dst, (i64)v);
+    return;
+  }
+  x64_emit_load_imm(t->mc, cb.size == 8u, X64_TMP_INT, (i64)v);
+  emit_sse_rr_w(t->mc, 0x66, 0x6E, cb.size == 8u, loc_reg(dst), X64_TMP_INT);
+}
+
+static void x64_load_addr(NativeTarget* t, NativeLoc dst, NativeAddr addr) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 rd = loc_reg(dst);
+  u32 base, idx, scale;
+  i32 off;
+  if (addr.base_kind == NATIVE_ADDR_BASE_GLOBAL &&
+      addr.index_kind == NATIVE_ADDR_INDEX_NONE) {
+    x64_emit_global_lea(t, rd, addr.base.global.sym,
+                        addr.base.global.addend + addr.offset);
+    return;
+  }
+  base = x64_resolve_addr(a, &addr, rd, &idx, &scale, &off);
+  if (idx == REG_NONE) {
+    if (base == rd && off == 0) return; /* already &slot in rd */
+    emit_lea(mc, rd, base, off);
+    return;
+  }
+  /* lea rd, [base + idx*scale + off] */
+  {
+    u8 buf[16];
+    u32 n = 0;
+    n += x64_pack_rex(buf + n, 1, rd, idx, base);
+    buf[n++] = X64_OPC_LEA;
+    n += x64_pack_mem_sib(buf + n, rd, base, idx, scale, off);
+    mc->emit_bytes(mc, buf, n);
+  }
+}
+
+static void x64_load(NativeTarget* t, NativeLoc dst, NativeAddr addr,
+                     MemAccess mem) {
+  x64_emit_mem(x64_of(t), 1, dst, addr, mem);
+}
+static void x64_store(NativeTarget* t, NativeAddr addr, NativeLoc src,
+                      MemAccess mem) {
+  x64_emit_mem(x64_of(t), 0, src, addr, mem);
+}
+
+/* Resolve an addressable NativeAddr to a bare base register (no index, off 0)
+ * by emitting an lea into `scratch` when needed. */
+static u32 x64_addr_to_base_reg(X64NativeTarget* a, NativeAddr addr,
+                                u32 scratch) {
+  MCEmitter* mc = a->base.mc;
+  u32 base, idx, scale;
+  i32 off;
+  if (addr.base_kind == NATIVE_ADDR_BASE_GLOBAL &&
+      addr.index_kind == NATIVE_ADDR_INDEX_NONE) {
+    x64_emit_global_lea(&a->base, scratch, addr.base.global.sym,
+                        addr.base.global.addend + addr.offset);
+    return scratch;
+  }
+  base = x64_resolve_addr(a, &addr, scratch, &idx, &scale, &off);
+  if (idx == REG_NONE && off == 0) return base;
+  if (idx == REG_NONE) {
+    emit_lea(mc, scratch, base, off);
+    return scratch;
+  }
+  {
+    u8 buf[16];
+    u32 n = 0;
+    n += x64_pack_rex(buf + n, 1, scratch, idx, base);
+    buf[n++] = X64_OPC_LEA;
+    n += x64_pack_mem_sib(buf + n, scratch, base, idx, scale, off);
+    mc->emit_bytes(mc, buf, n);
+  }
+  return scratch;
+}
+
+/* copy_bytes: resolve dst into r11 and src into rax (both bare pointers), then
+ * unrolled granule copy through rdx. dst is resolved first (its base may live in
+ * r11 from a FRAME_VALUE load) and src second so the two never alias. */
+static void x64_copy_bytes(NativeTarget* t, NativeAddr dst, NativeAddr src,
+                           AggregateAccess access) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 dr = x64_addr_to_base_reg(a, dst, X64_TMP_INT2);
+  u32 sr = x64_addr_to_base_reg(a, src, X64_TMP_INT);
+  u32 n = access.size, i = 0;
+  while (i + 8u <= n) {
+    emit_mov_load(mc, 8, 0, X64_RDX, sr, (i32)i);
+    emit_mov_store(mc, 8, X64_RDX, dr, (i32)i);
+    i += 8u;
+  }
+  while (i + 4u <= n) {
+    emit_mov_load(mc, 4, 0, X64_RDX, sr, (i32)i);
+    emit_mov_store(mc, 4, X64_RDX, dr, (i32)i);
+    i += 4u;
+  }
+  while (i + 2u <= n) {
+    emit_mov_load(mc, 2, 0, X64_RDX, sr, (i32)i);
+    emit_mov_store(mc, 2, X64_RDX, dr, (i32)i);
+    i += 2u;
+  }
+  while (i < n) {
+    emit_mov_load(mc, 1, 0, X64_RDX, sr, (i32)i);
+    emit_mov_store(mc, 1, X64_RDX, dr, (i32)i);
+    i += 1u;
+  }
+}
+
+static void x64_set_bytes(NativeTarget* t, NativeAddr dst, NativeLoc byte_value,
+                          AggregateAccess access) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 dr = x64_addr_to_base_reg(a, dst, X64_TMP_INT2);
+  u32 n = access.size, i = 0;
+  /* Broadcast the byte across 8 bytes into rax. */
+  if (byte_value.kind == NATIVE_LOC_IMM) {
+    u8 b = (u8)(byte_value.v.imm & 0xffu);
+    u64 b64 = b;
+    b64 |= b64 << 8;
+    b64 |= b64 << 16;
+    b64 |= b64 << 32;
+    x64_emit_load_imm(mc, 1, X64_RAX, (i64)b64);
+  } else {
+    /* Replicate the low byte of a register via multiply by 0x0101..01. */
+    x64_emit_load_imm(mc, 1, X64_R11, (i64)0x0101010101010101ll);
+    emit_mov_rr(mc, 1, X64_RAX, loc_reg(byte_value));
+    emit_imul_rr(mc, 1, X64_RAX, X64_R11);
+  }
+  while (i + 8u <= n) {
+    emit_mov_store(mc, 8, X64_RAX, dr, (i32)i);
+    i += 8u;
+  }
+  while (i + 4u <= n) {
+    emit_mov_store(mc, 4, X64_RAX, dr, (i32)i);
+    i += 4u;
+  }
+  while (i + 2u <= n) {
+    emit_mov_store(mc, 2, X64_RAX, dr, (i32)i);
+    i += 2u;
+  }
+  while (i < n) {
+    emit_mov_store(mc, 1, X64_RAX, dr, (i32)i);
+    i += 1u;
+  }
+}
+
+/* ============================ bitfields ============================ */
+
+static void x64_bitfield_load(NativeTarget* t, NativeLoc dst, NativeAddr ra,
+                              BitFieldAccess bf) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 storage_bytes = bf.storage.size ? bf.storage.size : 4u;
+  int w = storage_bytes == 8u ? 1 : 0;
+  u32 reg_size = w ? 64u : 32u;
+  u32 lsb = bf.bit_offset;
+  u32 width = bf.bit_width ? bf.bit_width : 1u;
+  u32 rd = loc_reg(dst);
+  u32 base;
+  ra.offset += (i32)bf.storage_offset;
+  base = x64_addr_to_base_reg(a, ra, X64_TMP_INT2);
+  emit_mov_load(mc, storage_bytes, 0, rd, base, 0);
+  {
+    u8 left = (u8)(reg_size - lsb - width);
+    u8 right = (u8)(reg_size - width);
+    if (left) emit_shift_imm(mc, w, X64_SHIFT_SUB_SHL, rd, left);
+    if (right)
+      emit_shift_imm(mc, w, bf.signed_ ? X64_SHIFT_SUB_SAR : X64_SHIFT_SUB_SHR,
+                     rd, right);
+  }
+}
+
+static void x64_bitfield_store(NativeTarget* t, NativeAddr ra, NativeLoc src,
+                               BitFieldAccess bf) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 storage_bytes = bf.storage.size ? bf.storage.size : 4u;
+  int w = storage_bytes == 8u ? 1 : 0;
+  u32 lsb = bf.bit_offset;
+  u32 width = bf.bit_width ? bf.bit_width : 1u;
+  u64 ones = width >= 64u ? ~(u64)0 : (((u64)1 << width) - 1u);
+  u64 mask = ones << lsb;
+  u32 src_reg = loc_reg(src);
+  u32 base;
+  ra.offset += (i32)bf.storage_offset;
+  /* Stabilize the base into r11 before consuming rax/rcx/rdx scratch. */
+  base = x64_addr_to_base_reg(a, ra, X64_TMP_INT2);
+  /* rax = storage; rax &= ~mask. */
+  emit_mov_load(mc, storage_bytes, 0, X64_RAX, base, 0);
+  x64_emit_load_imm(mc, w, X64_RCX, (i64)~mask);
+  emit_alu_rr(mc, w, X64_OPC_ALU_AND, X64_RAX, X64_RCX);
+  /* rcx = (src & ones) << lsb. */
+  emit_mov_rr(mc, w, X64_RCX, src_reg);
+  x64_emit_load_imm(mc, w, X64_RDX, (i64)ones);
+  emit_alu_rr(mc, w, X64_OPC_ALU_AND, X64_RCX, X64_RDX);
+  if (lsb) emit_shift_imm(mc, w, X64_SHIFT_SUB_SHL, X64_RCX, (u8)lsb);
+  emit_alu_rr(mc, w, X64_OPC_ALU_OR, X64_RAX, X64_RCX);
+  emit_mov_store(mc, storage_bytes, X64_RAX, base, 0);
+}
+
+/* ============================ arithmetic ============================ */
+
+static void x64_binop(NativeTarget* t, BinOp op, NativeLoc dst, NativeLoc aop,
+                      NativeLoc bop) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 rd = loc_reg(dst);
+
+  /* FP binops: two-address. dst = aop op bop. */
+  if (op == BO_FADD || op == BO_FSUB || op == BO_FMUL || op == BO_FDIV) {
+    u32 ra = loc_reg(aop), rb = loc_reg(bop);
+    u8 prefix = sse_scalar_prefix(x64_type_size(t, dst.type));
+    u8 opcode;
+    switch (op) {
+      case BO_FADD: opcode = 0x58; break;
+      case BO_FSUB: opcode = 0x5C; break;
+      case BO_FMUL: opcode = 0x59; break;
+      default: opcode = 0x5E; break; /* BO_FDIV */
+    }
+    if (rd == rb && rd != ra) {
+      if (op == BO_FADD || op == BO_FMUL) { /* commutative */
+        emit_sse_rr(mc, prefix, opcode, rd, ra);
+        return;
+      }
+      /* non-commutative dst==rb: stage rb in fp scratch. */
+      emit_sse_rr(mc, prefix, 0x10, X64_TMP_FP2, rb);
+      emit_sse_rr(mc, prefix, 0x10, rd, ra);
+      emit_sse_rr(mc, prefix, opcode, rd, X64_TMP_FP2);
+      return;
+    }
+    if (rd != ra) emit_sse_rr(mc, prefix, 0x10, rd, ra);
+    emit_sse_rr(mc, prefix, opcode, rd, rb);
+    return;
+  }
+
+  {
+    int w = x64_is_64(t, dst.type) ? 1 : 0;
+    int b_imm = bop.kind == NATIVE_LOC_IMM;
+    i64 imm = b_imm ? bop.v.imm : 0;
+    u32 ra = loc_reg(aop);
+
+    /* Division: rax/rdx implicit; divisor must avoid rax/rdx. */
+    if (op == BO_SDIV || op == BO_UDIV || op == BO_SREM || op == BO_UREM) {
+      u32 rb;
+      if (ra != X64_RAX) emit_mov_rr(mc, w, X64_RAX, ra);
+      if (b_imm) {
+        x64_emit_load_imm(mc, w, X64_R11, imm);
+        rb = X64_R11;
+      } else {
+        rb = loc_reg(bop);
+        if (rb == X64_RAX || rb == X64_RDX) {
+          emit_mov_rr(mc, w, X64_R11, rb);
+          rb = X64_R11;
+        }
+      }
+      if (op == BO_SDIV || op == BO_SREM) {
+        emit_cqo_or_cdq(mc, w);
+        emit_f7_rm(mc, w, X64_F7_SUB_IDIV, rb);
+      } else {
+        emit_xor_self(mc, w, X64_RDX);
+        emit_f7_rm(mc, w, X64_F7_SUB_DIV, rb);
+      }
+      {
+        u32 result = (op == BO_SREM || op == BO_UREM) ? X64_RDX : X64_RAX;
+        if (rd != result) emit_mov_rr(mc, w, rd, result);
+      }
+      return;
+    }
+
+    /* Shifts: count in CL or imm8. */
+    if (op == BO_SHL || op == BO_SHR_U || op == BO_SHR_S) {
+      u32 sub = (op == BO_SHL) ? X64_SHIFT_SUB_SHL
+                : (op == BO_SHR_U) ? X64_SHIFT_SUB_SHR
+                                   : X64_SHIFT_SUB_SAR;
+      if (b_imm) {
+        u32 wbits = w ? 64u : 32u;
+        if (rd != ra) emit_mov_rr(mc, w, rd, ra);
+        emit_shift_imm(mc, w, sub, rd, (u8)((u64)imm & (wbits - 1u)));
+        return;
+      }
+      {
+        u32 rb = loc_reg(bop);
+        if (rb != X64_RCX) emit_mov_rr(mc, 0, X64_RCX, rb);
+      }
+      if (rd != ra) emit_mov_rr(mc, w, rd, ra);
+      emit_shift_cl(mc, w, sub, rd);
+      return;
+    }
+
+    /* IMM-form fast paths (b_imm guaranteed legal by imm_legal: imm32). */
+    if (b_imm &&
+        (op == BO_IADD || op == BO_ISUB || op == BO_AND || op == BO_OR ||
+         op == BO_XOR || op == BO_IMUL)) {
+      if (op == BO_IMUL) {
+        if (imm_fits_i8(imm)) {
+          emit_imul_imm8(mc, w, rd, ra, (i8)imm);
+          return;
+        }
+        emit_imul_imm32(mc, w, rd, ra, (i32)imm);
+        return;
+      }
+      {
+        u32 sub;
+        switch (op) {
+          case BO_IADD: sub = X64_ALU_SUB_ADD; break;
+          case BO_OR: sub = X64_ALU_SUB_OR; break;
+          case BO_AND: sub = X64_ALU_SUB_AND; break;
+          case BO_ISUB: sub = X64_ALU_SUB_SUB; break;
+          default: sub = X64_ALU_SUB_XOR; break; /* BO_XOR */
+        }
+        if (rd != ra) emit_mov_rr(mc, w, rd, ra);
+        if (imm_fits_i8(imm))
+          emit_alu_imm8(mc, w, sub, rd, (i8)imm);
+        else
+          emit_alu_imm32(mc, w, sub, rd, (i32)imm);
+        return;
+      }
+    }
+
+    /* Generic 2-operand ALU: dst = ra op rb. Preserve rb if dst == rb. */
+    {
+      u32 rb = loc_reg(bop);
+      if (rd == rb && rd != ra) {
+        switch (op) {
+          case BO_IADD: emit_alu_rr(mc, w, X64_OPC_ALU_ADD, rd, ra); return;
+          case BO_AND: emit_alu_rr(mc, w, X64_OPC_ALU_AND, rd, ra); return;
+          case BO_OR: emit_alu_rr(mc, w, X64_OPC_ALU_OR, rd, ra); return;
+          case BO_XOR: emit_alu_rr(mc, w, X64_OPC_ALU_XOR, rd, ra); return;
+          case BO_IMUL: emit_imul_rr(mc, w, rd, ra); return;
+          default: break; /* ISUB falls through: stage rb */
+        }
+        emit_mov_rr(mc, w, X64_R11, rb);
+        rb = X64_R11;
+      }
+      if (rd != ra) emit_mov_rr(mc, w, rd, ra);
+      switch (op) {
+        case BO_IADD: emit_alu_rr(mc, w, X64_OPC_ALU_ADD, rd, rb); break;
+        case BO_ISUB: emit_alu_rr(mc, w, X64_OPC_ALU_SUB, rd, rb); break;
+        case BO_AND: emit_alu_rr(mc, w, X64_OPC_ALU_AND, rd, rb); break;
+        case BO_OR: emit_alu_rr(mc, w, X64_OPC_ALU_OR, rd, rb); break;
+        case BO_XOR: emit_alu_rr(mc, w, X64_OPC_ALU_XOR, rd, rb); break;
+        case BO_IMUL: emit_imul_rr(mc, w, rd, rb); break;
+        default: x64_panic(a, "unsupported binop");
+      }
+    }
+  }
+}
+
+/* FP sign-mask constant materialized in fp scratch for FNEG. */
+static void x64_unop(NativeTarget* t, UnOp op, NativeLoc dst, NativeLoc src) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 rd = loc_reg(dst), rs = loc_reg(src);
+  if (op == UO_FNEG) {
+    int dbl = x64_type_size(t, dst.type) == 8u;
+    if (rd != rs) emit_sse_rr(mc, sse_scalar_prefix(dbl ? 8u : 4u), 0x10, rd, rs);
+    /* sign mask into fp scratch via gpr, then XORPS/XORPD. */
+    x64_emit_load_imm(mc, dbl, X64_TMP_INT, dbl ? (i64)0x8000000000000000ull
+                                                : (i64)0x80000000ull);
+    emit_sse_rr_w(mc, 0x66, 0x6E, dbl, X64_TMP_FP2, X64_TMP_INT);
+    emit_sse_rr(mc, dbl ? 0x66 : 0, 0x57, rd, X64_TMP_FP2);
+    return;
+  }
+  {
+    int w = x64_is_64(t, dst.type) ? 1 : 0;
+    switch (op) {
+      case UO_NEG:
+        if (rd != rs) emit_mov_rr(mc, w, rd, rs);
+        emit_f7_rm(mc, w, X64_F7_SUB_NEG, rd);
+        return;
+      case UO_BNOT:
+        if (rd != rs) emit_mov_rr(mc, w, rd, rs);
+        emit_f7_rm(mc, w, X64_F7_SUB_NOT, rd);
+        return;
+      case UO_NOT:
+        /* !x -> (x == 0) as 0/1. */
+        emit_test_self(mc, w, rs);
+        emit_setcc(mc, X64_CC_E, rd);
+        emit_movzx_r32_r8(mc, rd, rd);
+        return;
+      default:
+        x64_panic(a, "unsupported unop");
+    }
+  }
+}
+
+/* ============================ compares ============================ */
+
+static u32 cmp_to_cc(CmpOp op) {
+  switch (op) {
+    case CMP_EQ: return X64_CC_E;
+    case CMP_NE: return X64_CC_NE;
+    case CMP_LT_U: return X64_CC_B;
+    case CMP_LE_U: return X64_CC_BE;
+    case CMP_GT_U: return X64_CC_A;
+    case CMP_GE_U: return X64_CC_AE;
+    case CMP_LT_S: return X64_CC_L;
+    case CMP_LE_S: return X64_CC_LE;
+    case CMP_GT_S: return X64_CC_G;
+    case CMP_GE_S: return X64_CC_GE;
+    default: return X64_CC_E;
+  }
+}
+
+static int cmp_is_fp(CmpOp op, NativeLoc aop) {
+  return op >= CMP_LT_F || ((op == CMP_EQ || op == CMP_NE) && loc_is_fp(aop));
+}
+
+/* Emit `cmp ra, rb` (or ucomis[sd] for FP), setting flags from ra - rb. */
+static void x64_emit_cmp_flags(NativeTarget* t, NativeLoc aop, NativeLoc bop,
+                               int fp) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  if (fp) {
+    u8 prefix = x64_type_size(t, aop.type) == 8u ? 0x66u : 0u;
+    emit_sse_rr(mc, prefix, 0x2E, loc_reg(aop), loc_reg(bop)); /* ucomis */
+    return;
+  }
+  {
+    int w = x64_is_64(t, aop.type) ? 1 : 0;
+    u32 ra = loc_reg(aop);
+    if (bop.kind == NATIVE_LOC_IMM) {
+      i64 imm = bop.v.imm;
+      if (imm_fits_i8(imm))
+        emit_alu_imm8(mc, w, X64_ALU_SUB_CMP, ra, (i8)imm);
+      else
+        emit_alu_imm32(mc, w, X64_ALU_SUB_CMP, ra, (i32)imm);
+      return;
+    }
+    emit_alu_rr(mc, w, X64_OPC_ALU_CMP, ra, loc_reg(bop));
+    (void)a;
+  }
+}
+
+/* FP ordered setcc: result = (primary cc) && !unordered (NP). */
+static void x64_fp_setcc_ordered(NativeTarget* t, u32 primary, u32 dst) {
+  MCEmitter* mc = t->mc;
+  emit_setcc(mc, primary, dst);
+  emit_movzx_r32_r8(mc, dst, dst);
+  emit_setcc(mc, X64_CC_NP, X64_R11);
+  emit_movzx_r32_r8(mc, X64_R11, X64_R11);
+  emit_alu_rr(mc, 0, X64_OPC_ALU_AND, dst, X64_R11);
+}
+
+/* FP NE: result = unordered (P) || NE. */
+static void x64_fp_setcc_ne(NativeTarget* t, u32 dst) {
+  MCEmitter* mc = t->mc;
+  emit_setcc(mc, X64_CC_P, dst);
+  emit_movzx_r32_r8(mc, dst, dst);
+  emit_setcc(mc, X64_CC_NE, X64_R11);
+  emit_movzx_r32_r8(mc, X64_R11, X64_R11);
+  emit_alu_rr(mc, 0, X64_OPC_ALU_OR, dst, X64_R11);
+}
+
+static void x64_cmp(NativeTarget* t, CmpOp op, NativeLoc dst, NativeLoc aop,
+                   NativeLoc bop) {
+  MCEmitter* mc = t->mc;
+  u32 d = loc_reg(dst);
+  int fp = cmp_is_fp(op, aop);
+  x64_emit_cmp_flags(t, aop, bop, fp);
+  if (fp) {
+    /* ucomis sets CF/ZF; unordered sets PF. GT/GE map to A/AE (operands not
+     * swapped — ucomis already gives the right unsigned-flag semantics for
+     * a>b / a>=b on ordered inputs). EQ/LT/LE additionally require ordered. */
+    switch (op) {
+      case CMP_NE: x64_fp_setcc_ne(t, d); return;
+      case CMP_EQ: x64_fp_setcc_ordered(t, X64_CC_E, d); return;
+      case CMP_LT_F: x64_fp_setcc_ordered(t, X64_CC_B, d); return;
+      case CMP_LE_F: x64_fp_setcc_ordered(t, X64_CC_BE, d); return;
+      case CMP_GT_F: emit_setcc(mc, X64_CC_A, d); break;
+      case CMP_GE_F: emit_setcc(mc, X64_CC_AE, d); break;
+      default: emit_setcc(mc, cmp_to_cc(op), d); break;
+    }
+    emit_movzx_r32_r8(mc, d, d);
+    return;
+  }
+  emit_setcc(mc, cmp_to_cc(op), d);
+  emit_movzx_r32_r8(mc, d, d);
+}
+
+/* ============================ converts ============================ */
+
+static void x64_convert(NativeTarget* t, ConvKind k, NativeLoc dst,
+                        NativeLoc src) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 rd = loc_reg(dst), rs = loc_reg(src);
+  switch (k) {
+    case CV_SEXT: {
+      u32 src_sz = x64_type_size(t, src.type);
+      int w = x64_is_64(t, dst.type) ? 1 : 0;
+      emit_extend_rr(mc, w, 1, src_sz, rd, rs);
+      return;
+    }
+    case CV_ZEXT: {
+      u32 src_sz = x64_type_size(t, src.type);
+      int w = x64_is_64(t, dst.type) ? 1 : 0;
+      emit_extend_rr(mc, w, 0, src_sz, rd, rs);
+      return;
+    }
+    case CV_TRUNC:
+      emit_mov_rr(mc, 0, rd, rs); /* low 32 bits; clears high */
+      return;
+    case CV_ITOF_S:
+    case CV_ITOF_U: {
+      int w_src = x64_is_64(t, src.type) ? 1 : 0;
+      u8 prefix = sse_scalar_prefix(x64_type_size(t, dst.type));
+      if (k == CV_ITOF_U && w_src == 1) {
+        MCLabel L_high = mc->label_new(mc);
+        MCLabel L_done = mc->label_new(mc);
+        emit_test_self(mc, 1, rs);
+        emit_jcc_rel32(mc, X64_CC_S, L_high);
+        emit_sse_rr_w(mc, prefix, 0x2A, 1, rd, rs);
+        emit_jmp_rel32(mc, L_done);
+        mc->label_place(mc, L_high);
+        emit_mov_rr(mc, 1, X64_R11, rs);
+        emit_mov_rr(mc, 1, X64_RAX, rs);
+        emit_alu_imm8(mc, 1, X64_ALU_SUB_AND, X64_RAX, 1);
+        emit_shift_imm(mc, 1, X64_SHIFT_SUB_SHR, X64_R11, 1);
+        emit_alu_rr(mc, 1, X64_OPC_ALU_OR, X64_R11, X64_RAX);
+        emit_sse_rr_w(mc, prefix, 0x2A, 1, rd, X64_R11);
+        emit_sse_rr(mc, prefix, 0x58, rd, rd);
+        mc->label_place(mc, L_done);
+        return;
+      }
+      if (k == CV_ITOF_U) {
+        emit_extend_rr(mc, 0, 0, 4, X64_R11, rs); /* zext u32 -> 64 */
+        rs = X64_R11;
+        w_src = 1;
+      }
+      emit_sse_rr_w(mc, prefix, 0x2A, w_src, rd, rs);
+      return;
+    }
+    case CV_FTOI_S:
+    case CV_FTOI_U: {
+      int w_dst = x64_is_64(t, dst.type) ? 1 : 0;
+      u8 prefix = sse_scalar_prefix(x64_type_size(t, src.type));
+      /* Unsigned 64-bit FTOI needs the 2^63 bias dance; otherwise cvtt
+       * (with the destination widened to 64 for u32) is exact. */
+      if (k == CV_FTOI_U && w_dst == 1) {
+        int dbl = x64_type_size(t, src.type) == 8u;
+        MCLabel L_small = mc->label_new(mc);
+        MCLabel L_done = mc->label_new(mc);
+        /* limit = 2^63 in fp scratch. */
+        x64_emit_load_imm(mc, 1, X64_R11,
+                          dbl ? (i64)0x43E0000000000000ull
+                              : (i64)0x5F000000ull);
+        emit_sse_rr_w(mc, 0x66, 0x6E, dbl, X64_TMP_FP2, X64_R11);
+        emit_sse_rr(mc, dbl ? 0x66 : 0, 0x2E, rs, X64_TMP_FP2); /* ucomis */
+        emit_jcc_rel32(mc, X64_CC_B, L_small);
+        emit_sse_rr(mc, prefix, 0x10, X64_TMP_FP, rs);
+        emit_sse_rr(mc, prefix, 0x5C, X64_TMP_FP, X64_TMP_FP2); /* sub bias */
+        emit_sse_rr_w(mc, prefix, 0x2C, 1, rd, X64_TMP_FP);
+        x64_emit_load_imm(mc, 1, X64_R11, (i64)0x8000000000000000ull);
+        emit_alu_rr(mc, 1, X64_OPC_ALU_XOR, rd, X64_R11);
+        emit_jmp_rel32(mc, L_done);
+        mc->label_place(mc, L_small);
+        emit_sse_rr_w(mc, prefix, 0x2C, 1, rd, rs);
+        mc->label_place(mc, L_done);
+        return;
+      }
+      if (k == CV_FTOI_U) w_dst = 1; /* widen u32 result */
+      emit_sse_rr_w(mc, prefix, 0x2C, w_dst, rd, rs);
+      return;
+    }
+    case CV_FEXT:
+      emit_sse_rr(mc, 0xF3, 0x5A, rd, rs); /* cvtss2sd */
+      return;
+    case CV_FTRUNC:
+      emit_sse_rr(mc, 0xF2, 0x5A, rd, rs); /* cvtsd2ss */
+      return;
+    case CV_BITCAST:
+      if (!loc_is_fp(src) && loc_is_fp(dst)) {
+        emit_sse_rr_w(mc, 0x66, 0x6E, x64_is_64(t, dst.type), rd, rs);
+      } else if (loc_is_fp(src) && !loc_is_fp(dst)) {
+        emit_sse_rr_w(mc, 0x66, 0x7E, x64_is_64(t, src.type), rs, rd);
+      } else {
+        x64_move(t, dst, src);
+      }
+      return;
+    default:
+      x64_panic(a, "unsupported convert");
+  }
+}
+
+/* ============================ spill / reload ============================ */
+
+static void x64_spill(NativeTarget* t, NativeLoc src, NativeFrameSlot slot,
+                     MemAccess mem) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+  addr.base.frame = slot;
+  addr.base_type = src.type;
+  x64_emit_mem(x64_of(t), 0, src, addr, mem);
+}
+static void x64_reload(NativeTarget* t, NativeLoc dst, NativeFrameSlot slot,
+                      MemAccess mem) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+  addr.base.frame = slot;
+  addr.base_type = dst.type;
+  x64_emit_mem(x64_of(t), 1, dst, addr, mem);
+}
+
+/* ============================ control flow ============================ */
+
+static void emit_jmp_rel32(MCEmitter* mc, MCLabel l) {
+  u8 op = X64_OPC_JMP_REL32;
+  mc->emit_bytes(mc, &op, 1);
+  emit_u32le(mc, 0);
+  mc->emit_label_ref(mc, l, R_PC32, 4, -4);
+}
+static void emit_jcc_rel32(MCEmitter* mc, u32 cc, MCLabel l) {
+  u8 op[2] = {X64_OPC_TWOBYTE, (u8)(X64_OPC_JCC_BASE | (cc & 0xfu))};
+  mc->emit_bytes(mc, op, 2);
+  emit_u32le(mc, 0);
+  mc->emit_label_ref(mc, l, R_PC32, 4, -4);
+}
+
+static MCLabel x64_label_new(NativeTarget* t) { return t->mc->label_new(t->mc); }
+static void x64_label_place(NativeTarget* t, MCLabel l) {
+  t->mc->label_place(t->mc, l);
+}
+static void x64_jump(NativeTarget* t, MCLabel l) { emit_jmp_rel32(t->mc, l); }
+
+static void x64_cmp_branch(NativeTarget* t, CmpOp op, NativeLoc aop,
+                          NativeLoc bop, MCLabel l) {
+  MCEmitter* mc = t->mc;
+  int fp = cmp_is_fp(op, aop);
+  if (fp) {
+    /* Materialize the 0/1 result, then branch on nonzero. */
+    NativeLoc tmp =
+        x64_reg_loc(builtin_id(CFREE_CG_BUILTIN_I32), NATIVE_REG_INT, X64_RAX);
+    x64_cmp(t, op, tmp, aop, bop);
+    emit_test_self(mc, 0, X64_RAX);
+    emit_jcc_rel32(mc, X64_CC_NE, l);
+    return;
+  }
+  x64_emit_cmp_flags(t, aop, bop, 0);
+  emit_jcc_rel32(mc, cmp_to_cc(op), l);
+}
+
+static void x64_indirect_branch(NativeTarget* t, NativeLoc addr,
+                               const MCLabel* valid_targets, u32 ntargets) {
+  MCEmitter* mc = t->mc;
+  u32 r = loc_reg(addr);
+  (void)valid_targets;
+  (void)ntargets;
+  if (r & 8u) {
+    u8 rex = X64_REX_BASE | X64_REX_B;
+    mc->emit_bytes(mc, &rex, 1);
+  }
+  {
+    u8 buf[2] = {X64_OP_JMP_RM64, modrm(3u, 4u, r & 7u)};
+    mc->emit_bytes(mc, buf, 2);
+  }
+}
+
+static void x64_load_label_addr(NativeTarget* t, NativeLoc dst, MCLabel l) {
+  MCEmitter* mc = t->mc;
+  u32 rd = loc_reg(dst);
+  emit_rex(mc, 1, rd, 0, 0);
+  {
+    u8 op = X64_OPC_LEA;
+    mc->emit_bytes(mc, &op, 1);
+  }
+  {
+    u8 mr = modrm(0u, rd & 7u, 5u); /* [rip + disp32] */
+    mc->emit_bytes(mc, &mr, 1);
+  }
+  emit_u32le(mc, 0);
+  mc->emit_label_ref(mc, l, R_PC32, 4, -4);
+}
+
+/* ============================ frame / lifecycle ============================ */
+
+static NativeFrameSlot x64_frame_slot(NativeTarget* t,
+                                      const NativeFrameSlotDesc* d) {
+  X64NativeTarget* a = x64_of(t);
+  X64NativeSlot* s;
+  u32 size = d->size ? d->size : 8u;
+  u32 align = d->align ? d->align : 1u;
+  if (a->frame_final) x64_panic(a, "frame slot requested after prologue");
+  if (a->nslots == a->slots_cap) {
+    u32 cap = a->slots_cap ? a->slots_cap * 2u : 16u;
+    X64NativeSlot* nb = arena_zarray(t->c->tu, X64NativeSlot, cap);
+    if (a->slots) memcpy(nb, a->slots, sizeof(*nb) * a->nslots);
+    a->slots = nb;
+    a->slots_cap = cap;
+  }
+  a->cum_off = align_up_u32(a->cum_off + size, align);
+  s = &a->slots[a->nslots++];
+  s->off = a->cum_off;
+  s->size = size;
+  s->align = align;
+  s->kind = d->kind;
+  return (NativeFrameSlot)a->nslots;
+}
+
+/* xmm save area base (rbp-relative). XMM saves are 16-aligned. */
+static u32 x64_xmm_base(const X64NativeTarget* a, u32 cs_fp) {
+  if (cs_fp == 0) return a->cum_off;
+  return align_up_u32(a->cum_off, 16u);
+}
+
+static u32 x64_compute_frame_size(const X64NativeTarget* a, u32 cs_int,
+                                  u32 cs_fp) {
+  u32 xmm_base = x64_xmm_base(a, cs_fp);
+  u32 raw = a->max_outgoing + cs_int * 8u + cs_fp * 16u + xmm_base;
+  u32 fs = align_up_u32(raw, 16u);
+  return fs ? fs : 16u;
+}
+
+/* Collect the callee-saves the body actually used. */
+static u32 x64_collect_int_saves(X64NativeTarget* a, Reg* regs) {
+  u32 n = 0, i;
+  for (i = 0; i < a->ncallee_saves; ++i)
+    if (a->callee_saves[i].cls == NATIVE_REG_INT)
+      regs[n++] = a->callee_saves[i].reg;
+  return n;
+}
+static u32 x64_collect_fp_saves(X64NativeTarget* a, Reg* regs) {
+  u32 n = 0, i;
+  for (i = 0; i < a->ncallee_saves; ++i)
+    if (a->callee_saves[i].cls == NATIVE_REG_FP)
+      regs[n++] = a->callee_saves[i].reg;
+  return n;
+}
+
+static ObjSymId x64_chkstk_sym(NativeTarget* t) {
+  Sym name = pool_intern_slice(t->c->global, SLICE_LIT("__chkstk"));
+  ObjSymId s = obj_symbol_find(t->obj, name);
+  if (s != 0) return s;
+  return obj_symbol(t->obj, name, SB_GLOBAL, SK_UNDEF, OBJ_SEC_NONE, 0, 0);
+}
+
+/* Build the prologue byte sequence into buf. Returns bytes written and, when
+ * the chkstk path fires, the disp32 offset of the call site. */
+static u32 x64_build_prologue(X64NativeTarget* a, u8* buf, u32 cap,
+                              u32 frame_size, const Reg* cs_int, u32 n_int,
+                              const Reg* cs_fp, u32 n_fp,
+                              u32* chkstk_disp_pos_out) {
+  u32 wi = 0;
+  u32 xmm_base = x64_xmm_base(a, n_fp);
+  u32 i;
+  *chkstk_disp_pos_out = (u32)-1;
+  if (cap < X64_PROLOGUE_BASE_BYTES) x64_panic(a, "prologue placeholder overflow");
+  /* push rbp; mov rbp, rsp. */
+  buf[wi++] = (u8)(X64_OPC_PUSH_R | (X64_RBP & 7u));
+  buf[wi++] = X64_REX_BASE | X64_REX_W;
+  buf[wi++] = X64_OPC_MOV_RM_R;
+  buf[wi++] = modrm(3u, X64_RSP, X64_RBP);
+  /* sub rsp, frame_size (or chkstk on Win64 large frame). */
+  if (a->abi->shadow_space && frame_size > X64_WIN64_CHKSTK_THRESHOLD) {
+    if (wi + 13u > cap) x64_panic(a, "prologue placeholder overflow");
+    buf[wi++] = (u8)(X64_OPC_MOV_RI | (X64_RAX & 7u)); /* mov eax, imm32 */
+    wr_u32_le(buf + wi, frame_size);
+    wi += 4;
+    buf[wi++] = X64_OPC_CALL_REL32;
+    *chkstk_disp_pos_out = wi;
+    wr_u32_le(buf + wi, 0);
+    wi += 4;
+    buf[wi++] = X64_REX_BASE | X64_REX_W; /* sub rsp, rax */
+    buf[wi++] = X64_OPC_ALU_SUB;
+    buf[wi++] = modrm(3u, X64_RAX, X64_RSP);
+  } else {
+    if (wi + 7u > cap) x64_panic(a, "prologue placeholder overflow");
+    buf[wi++] = X64_REX_BASE | X64_REX_W;
+    buf[wi++] = X64_OPC_ALU_IMM32;
+    buf[wi++] = modrm(3u, X64_ALU_SUB_SUB, X64_RSP);
+    wr_u32_le(buf + wi, frame_size);
+    wi += 4;
+  }
+  /* sret: spill the first int arg reg (destination pointer) into its slot. */
+  if (a->has_sret && a->sret_ptr_slot != NATIVE_FRAME_SLOT_NONE) {
+    X64NativeSlot* s = x64_slot_get(a, a->sret_ptr_slot);
+    u32 sret_reg = a->abi->int_args[0];
+    i32 off = -(i32)s->off;
+    if (wi + 7u > cap) x64_panic(a, "prologue placeholder overflow");
+    buf[wi++] = (u8)(X64_REX_BASE | X64_REX_W |
+                     ((sret_reg & 8u) ? X64_REX_R : 0u));
+    buf[wi++] = X64_OPC_MOV_RM_R;
+    buf[wi++] = modrm(2u, sret_reg & 7u, X64_RBP);
+    wr_u32_le(buf + wi, (u32)off);
+    wi += 4;
+  }
+  /* Spill callee-saved GPRs. */
+  for (i = 0; i < n_int; ++i) {
+    u32 reg = cs_int[i];
+    i32 off = -(i32)xmm_base - (i32)n_fp * 16 - (i32)(i + 1u) * 8;
+    if (wi + 7u > cap) x64_panic(a, "prologue placeholder overflow");
+    buf[wi++] = (u8)(X64_REX_BASE | X64_REX_W | ((reg & 8u) ? X64_REX_R : 0u));
+    buf[wi++] = X64_OPC_MOV_RM_R;
+    buf[wi++] = modrm(2u, reg & 7u, X64_RBP);
+    wr_u32_le(buf + wi, (u32)off);
+    wi += 4;
+  }
+  /* Spill callee-saved XMMs (Win64). movaps [rbp+disp32], xmm. */
+  for (i = 0; i < n_fp; ++i) {
+    u32 xmm = cs_fp[i];
+    i32 off = -(i32)xmm_base - (i32)(i + 1u) * 16;
+    u8 rex = (u8)((xmm & 8u) ? (X64_REX_BASE | X64_REX_R) : 0u);
+    u32 need = rex ? 8u : 7u;
+    if (wi + need > cap) x64_panic(a, "prologue placeholder overflow");
+    if (rex) buf[wi++] = rex;
+    buf[wi++] = X64_OPC_TWOBYTE;
+    buf[wi++] = 0x29; /* MOVAPS r/m128, xmm */
+    buf[wi++] = modrm(2u, xmm & 7u, X64_RBP);
+    wr_u32_le(buf + wi, (u32)off);
+    wi += 4;
+  }
+  return wi;
+}
+
+static void x64_func_begin_common(NativeTarget* t, const CGFuncDesc* fd) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, fd->fn_type);
+  a->func = fd;
+  a->loc = fd->loc;
+  a->abi = x64_abi_for_os(t->c->target.os);
+  a->nslots = 0;
+  a->cum_off = 0;
+  a->max_outgoing = 0;
+  a->incoming_stack_size = 0;
+  a->next_param_int = 0;
+  a->next_param_fp = 0;
+  a->next_param_stack = 0;
+  a->has_sret = (abi && abi->has_sret) ? 1u : 0u;
+  a->is_variadic = (abi && abi->variadic) ? 1u : 0u;
+  a->sret_ptr_slot = NATIVE_FRAME_SLOT_NONE;
+  a->reg_save_slot = NATIVE_FRAME_SLOT_NONE;
+  a->npatches = 0;
+  a->nalloca = 0;
+  a->ncallee_saves = 0;
+  a->known_frame = 0;
+  a->has_alloca = 0;
+  a->frame_final = 0;
+  a->prologue_nbytes =
+      a->abi->shadow_space ? X64_PROLOGUE_BYTES_WIN64 : X64_PROLOGUE_BYTES;
+
+  mc->set_section(mc, fd->text_section_id);
+  mc->emit_align(mc, 16, X64_NOP1);
+  a->func_start = mc->pos(mc);
+  mc_begin_function(mc, fd->sym, fd->text_section_id, a->func_start);
+  if (mc->cfi_startproc) mc->cfi_startproc(mc);
+  a->epilogue_label = mc->label_new(mc);
+}
+
+/* Reserve the sret-pointer slot and (SysV) the 176-byte variadic reg-save
+ * area. Advances next_param_int past the sret pointer (a0). */
+static void x64_reserve_entry_saves(X64NativeTarget* a) {
+  NativeTarget* t = &a->base;
+  if (a->has_sret) {
+    NativeFrameSlotDesc sd;
+    memset(&sd, 0, sizeof sd);
+    sd.type = builtin_id(CFREE_CG_BUILTIN_I64);
+    sd.size = 8;
+    sd.align = 8;
+    sd.kind = NATIVE_FRAME_SLOT_SAVE;
+    a->sret_ptr_slot = t->frame_slot(t, &sd);
+    a->next_param_int = 1;
+  }
+  if (a->is_variadic && a->abi->emit_sysv_vararg_save) {
+    NativeFrameSlotDesc rd;
+    memset(&rd, 0, sizeof rd);
+    rd.type = builtin_id(CFREE_CG_BUILTIN_I64);
+    rd.size = 176;
+    rd.align = 8;
+    rd.kind = NATIVE_FRAME_SLOT_SAVE;
+    a->reg_save_slot = t->frame_slot(t, &rd);
+  }
+}
+
+static void x64_emit_variadic_reg_saves(X64NativeTarget* a) {
+  NativeTarget* t = &a->base;
+  MCEmitter* mc = t->mc;
+  if (!a->is_variadic) return;
+  if (a->abi->emit_sysv_vararg_save) {
+    X64NativeSlot* rs = x64_slot_get(a, a->reg_save_slot);
+    static const u32 gprs[6] = {X64_RDI, X64_RSI, X64_RDX,
+                                X64_RCX, X64_R8,  X64_R9};
+    u32 i;
+    for (i = 0; i < 6u; ++i)
+      emit_mov_store(mc, 8, gprs[i], X64_RBP, -(i32)rs->off + (i32)(i * 8u));
+    for (i = 0; i < 8u; ++i)
+      emit_sse_store(mc, 0xF2, 0x11, (u32)(X64_XMM0 + i), X64_RBP,
+                     -(i32)rs->off + (i32)(48u + i * 16u));
+    return;
+  }
+  /* Win64 variadic: spill the 4 GPR arg slots to the home space. */
+  emit_mov_store(mc, 8, X64_RCX, X64_RBP, 16);
+  emit_mov_store(mc, 8, X64_RDX, X64_RBP, 24);
+  emit_mov_store(mc, 8, X64_R8, X64_RBP, 32);
+  emit_mov_store(mc, 8, X64_R9, X64_RBP, 40);
+}
+
+static void x64_func_begin(NativeTarget* t, const CGFuncDesc* fd) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 i;
+  x64_func_begin_common(t, fd);
+  a->prologue_pos = mc->pos(mc);
+  for (i = 0; i < a->prologue_nbytes; ++i) emit1(mc, X64_NOP1);
+  x64_reserve_entry_saves(a);
+  x64_emit_variadic_reg_saves(a);
+}
+
+static void x64_func_begin_known_frame(NativeTarget* t, const CGFuncDesc* fd,
+                                       const NativeKnownFrameDesc* frame,
+                                       NativeFrameSlot* out_slots) {
+  (void)fd;
+  (void)frame;
+  (void)out_slots;
+  x64_panic(x64_of(t), "known-frame path not implemented yet");
+}
+
+static void x64_func_end(NativeTarget* t) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  ObjBuilder* obj = t->obj;
+  ObjSecId sec = a->func->text_section_id;
+  Reg cs_int[X64_MAX_CS_INT_REGS], cs_fp[X64_MAX_CS_FP_REGS];
+  u32 n_int = x64_collect_int_saves(a, cs_int);
+  u32 n_fp = x64_collect_fp_saves(a, cs_fp);
+  u32 frame_size = x64_compute_frame_size(a, n_int, n_fp);
+  u32 xmm_base = x64_xmm_base(a, n_fp);
+  u32 end;
+  i32 i;
+  a->frame_size_final = frame_size;
+
+  /* Epilogue. */
+  mc->label_place(mc, a->epilogue_label);
+  for (i = (i32)n_fp - 1; i >= 0; --i) {
+    i32 off = -(i32)xmm_base - (i32)(i + 1) * 16;
+    emit_sse_load(mc, 0, 0x28, cs_fp[i], X64_RBP, off); /* movaps */
+  }
+  for (i = (i32)n_int - 1; i >= 0; --i) {
+    i32 off = -(i32)xmm_base - (i32)n_fp * 16 - (i32)(i + 1) * 8;
+    emit_mov_load(mc, 8, 0, cs_int[i], X64_RBP, off);
+  }
+  emit_leave(mc);
+  emit_ret(mc);
+
+  /* Patch the single-pass prologue placeholder. */
+  if (!a->known_frame) {
+    u8 buf[X64_PROLOGUE_BYTES_WIN64];
+    u32 chkstk_disp_pos;
+    u32 nbytes;
+    u32 k;
+    for (k = 0; k < a->prologue_nbytes; ++k) buf[k] = X64_NOP1;
+    nbytes = x64_build_prologue(a, buf, a->prologue_nbytes, frame_size, cs_int,
+                                n_int, cs_fp, n_fp, &chkstk_disp_pos);
+    (void)nbytes;
+    obj_patch(obj, sec, a->prologue_pos, buf, a->prologue_nbytes);
+    if (chkstk_disp_pos != (u32)-1) {
+      ObjSymId chk = x64_chkstk_sym(t);
+      mc->emit_reloc_at(mc, sec, a->prologue_pos + chkstk_disp_pos, R_X64_PLT32,
+                        chk, -4, 1, 0);
+    }
+  }
+
+  /* Patch alloca disp32s: lea dst, [rsp + max_outgoing]. */
+  {
+    u32 mo = align_up_u32(a->max_outgoing, 16u);
+    u32 k;
+    for (k = 0; k < a->npatches; ++k) {
+      u8 dbuf[4];
+      wr_u32_le(dbuf, mo);
+      obj_patch(obj, sec, a->patches[k].pos, dbuf, 4);
+    }
+  }
+
+  /* CFI: after the prologue, CFA = rbp + 16; rbp at cfa-16, ra at cfa-8. */
+  if (mc->cfi_set_next_pc_offset && mc->cfi_def_cfa && mc->cfi_offset) {
+    u32 post =
+        a->prologue_pos + (a->known_frame ? 0u : a->prologue_nbytes);
+    u32 k;
+    mc->cfi_set_next_pc_offset(mc, post - a->func_start);
+    mc->cfi_def_cfa(mc, X64_RBP, 16);
+    mc->cfi_offset(mc, X64_RBP, -16);
+    mc->cfi_offset(mc, 16u /* rip */, -8);
+    for (k = 0; k < n_int; ++k) {
+      i32 off = -(i32)xmm_base - (i32)n_fp * 16 - (i32)(k + 1u) * 8;
+      mc->cfi_offset(mc, cs_int[k], off);
+    }
+  }
+
+  end = mc->pos(mc);
+  obj_symbol_define(obj, a->func->sym, sec, (u64)a->func_start,
+                    (u64)(end - a->func_start));
+  if (a->func->atomize)
+    obj_atom_define(obj, sec, a->func_start, end - a->func_start, a->func->sym,
+                    0);
+  if (mc->debug) debug_func_pc_range(mc->debug, sec, a->func_start, end);
+  if (mc->cfi_endproc) mc->cfi_endproc(mc);
+  mc_end_function(mc);
+  a->func = NULL;
+}
+
+/* ============================ params / ABI helpers ============================
+ */
+
+/* Win64 shares one arg-slot index across int and FP. Keep cursors in lockstep. */
+static void x64_sync_slot(const X64ABIRegs* abi, u32* next_int, u32* next_fp) {
+  u32 m;
+  if (!abi->slot_shared_int_fp) return;
+  m = *next_int > *next_fp ? *next_int : *next_fp;
+  *next_int = m;
+  *next_fp = m;
+}
+
+static const ABIArgInfo* x64_param_abi(NativeTarget* t, const ABIFuncInfo* abi,
+                                       const NativeCallDesc* desc, u32 i,
+                                       ABIArgInfo* scratch) {
+  int variadic = abi && i >= abi->nparams;
+  if (abi && i < abi->nparams) return &abi->params[i];
+  (void)variadic;
+  memset(scratch, 0, sizeof *scratch);
+  scratch->kind = ABI_ARG_DIRECT;
+  scratch->nparts = 1;
+  scratch->parts = arena_zarray(t->c->tu, ABIArgPart, 1);
+  ((ABIArgPart*)scratch->parts)[0].cls =
+      cg_type_is_float(t->c, desc->args[i].type) ? ABI_CLASS_FP : ABI_CLASS_INT;
+  ((ABIArgPart*)scratch->parts)[0].loc = ABI_LOC_REG;
+  ((ABIArgPart*)scratch->parts)[0].size = x64_type_size(t, desc->args[i].type);
+  ((ABIArgPart*)scratch->parts)[0].align = x64_type_align(t, desc->args[i].type);
+  return scratch;
+}
+
+static CfreeCgTypeId x64_part_scalar_type(const ABIArgPart* part) {
+  if (part->cls == ABI_CLASS_FP)
+    return part->size <= 4u ? builtin_id(CFREE_CG_BUILTIN_F32)
+                            : builtin_id(CFREE_CG_BUILTIN_F64);
+  switch (part->size) {
+    case 1u: return builtin_id(CFREE_CG_BUILTIN_I8);
+    case 2u: return builtin_id(CFREE_CG_BUILTIN_I16);
+    case 4u: return builtin_id(CFREE_CG_BUILTIN_I32);
+    default: return builtin_id(CFREE_CG_BUILTIN_I64);
+  }
+}
+
+/* Is the whole DIRECT arg forced to the stack (not enough reg slots)? */
+static int x64_direct_to_stack(const X64ABIRegs* abi, const ABIArgInfo* ai,
+                               u32 next_int, u32 next_fp) {
+  u32 need_int, need_fp;
+  x64_abi_direct_reg_need(ai, &need_int, &need_fp);
+  return next_int + need_int > abi->n_int_args ||
+         next_fp + need_fp > abi->n_fp_args;
+}
+
+/* Outgoing stack bytes a call uses (16-aligned), per the ABI. */
+static u32 x64_call_stack_size(NativeTarget* t, const NativeCallDesc* desc) {
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, desc->fn_type);
+  const X64ABIRegs* aregs = x64_abi_for_os(t->c->target.os);
+  u32 next_int = (abi && abi->has_sret) ? 1u : 0u;
+  u32 next_fp = 0;
+  u32 stack = aregs->shadow_space;
+  u32 i;
+  x64_sync_slot(aregs, &next_int, &next_fp);
+  for (i = 0; i < desc->nargs; ++i) {
+    ABIArgInfo tmp;
+    const ABIArgInfo* ai = x64_param_abi(t, abi, desc, i, &tmp);
+    u16 p;
+    if (ai->kind == ABI_ARG_IGNORE) continue;
+    if (ai->kind == ABI_ARG_INDIRECT) {
+      if (next_int < aregs->n_int_args)
+        ++next_int;
+      else
+        stack += 8u;
+      x64_sync_slot(aregs, &next_int, &next_fp);
+      continue;
+    }
+    if (ai->kind == ABI_ARG_DIRECT &&
+        x64_direct_to_stack(aregs, ai, next_int, next_fp)) {
+      stack += (u32)ai->nparts * 8u;
+      continue;
+    }
+    for (p = 0; p < ai->nparts; ++p) {
+      const ABIArgPart* part = &ai->parts[p];
+      if (part->cls == ABI_CLASS_FP) {
+        if (next_fp < aregs->n_fp_args)
+          ++next_fp;
+        else
+          stack += 8u;
+      } else {
+        if (next_int < aregs->n_int_args)
+          ++next_int;
+        else
+          stack += 8u;
+      }
+      x64_sync_slot(aregs, &next_int, &next_fp);
+    }
+  }
+  return align_up_u32(stack, 16u);
+}
+
+static u32 x64_call_stack_bytes(NativeTarget* t, const NativeCallDesc* desc) {
+  return x64_call_stack_size(t, desc);
+}
+
+static u32 x64_signature_stack_bytes(NativeTarget* t, CfreeCgTypeId fn_type,
+                                     int* variadic, u32* nparams) {
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, fn_type);
+  NativeCallDesc d;
+  if (variadic) *variadic = abi ? (int)abi->variadic : 0;
+  if (nparams) *nparams = abi ? abi->nparams : 0u;
+  memset(&d, 0, sizeof d);
+  d.fn_type = fn_type;
+  d.nargs = abi ? abi->nparams : 0u;
+  if (d.nargs) d.args = arena_zarray(t->c->tu, NativeLoc, d.nargs);
+  return x64_call_stack_size(t, &d);
+}
+
+/* Resolve a NativeLoc to an addressable NativeAddr (frame/stack/addr). */
+static NativeAddr x64_loc_addr(X64NativeTarget* a, NativeLoc loc, u32 offset) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  switch ((NativeLocKind)loc.kind) {
+    case NATIVE_LOC_FRAME:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = loc.v.frame;
+      addr.base_type = loc.type;
+      addr.offset = (i32)offset;
+      return addr;
+    case NATIVE_LOC_STACK:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = loc.v.stack.slot;
+      addr.base_type = loc.type;
+      addr.offset = loc.v.stack.offset + (i32)offset;
+      return addr;
+    case NATIVE_LOC_ADDR:
+      addr = loc.v.addr;
+      addr.offset += (i32)offset;
+      return addr;
+    default:
+      x64_panic(a, "location is not addressable");
+  }
+  return addr;
+}
+
+static void x64_load_part(NativeTarget* t, NativeLoc dst, NativeLoc src,
+                          u32 offset, u32 size) {
+  X64NativeTarget* a = x64_of(t);
+  if (src.kind == NATIVE_LOC_REG) {
+    x64_move(t, dst, src);
+    return;
+  }
+  if (src.kind == NATIVE_LOC_FRAME || src.kind == NATIVE_LOC_STACK ||
+      src.kind == NATIVE_LOC_ADDR) {
+    NativeAddr addr = x64_loc_addr(a, src, offset);
+    addr.base_type = dst.type;
+    x64_emit_mem(a, 1, dst, addr, x64_mem_for_type(t, dst.type, size));
+    return;
+  }
+  if (src.kind == NATIVE_LOC_IMM) {
+    x64_emit_load_imm(t->mc, x64_is_64(t, dst.type) ? 1 : 0, loc_reg(dst),
+                      src.v.imm);
+    return;
+  }
+  x64_panic(a, "unsupported part source");
+}
+
+static void x64_store_part(NativeTarget* t, NativeLoc dst, NativeLoc src,
+                           u32 offset, u32 size) {
+  X64NativeTarget* a = x64_of(t);
+  if (dst.kind == NATIVE_LOC_FRAME || dst.kind == NATIVE_LOC_STACK ||
+      dst.kind == NATIVE_LOC_ADDR) {
+    NativeAddr addr = x64_loc_addr(a, dst, offset);
+    addr.base_type = src.type;
+    x64_emit_mem(a, 0, src, addr, x64_mem_for_type(t, src.type, size));
+    return;
+  }
+  if (dst.kind == NATIVE_LOC_REG) {
+    x64_move(t, dst, src);
+    return;
+  }
+  x64_panic(a, "unsupported part destination");
+}
+
+static void x64_addr_of_loc(NativeTarget* t, NativeLoc dst, NativeLoc src) {
+  NativeAddr addr = x64_loc_addr(x64_of(t), src, 0);
+  x64_load_addr(t, dst, addr);
+}
+
+static void x64_store_outgoing_part(NativeTarget* t, u32 stack_off,
+                                    NativeLoc src, u32 size) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_REG;
+  addr.base.reg = X64_RSP;
+  addr.base_type = src.type;
+  addr.offset = (i32)stack_off;
+  x64_emit_mem(x64_of(t), 0, src, addr, x64_mem_for_type(t, src.type, size));
+}
+
+/* NativeTarget bind_param: route incoming param (ABI loc) into dst. */
+static void x64_bind_native_param(NativeTarget* t, const CGParamDesc* p,
+                                  NativeLoc dst) {
+  X64NativeTarget* a = x64_of(t);
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, a->func->fn_type);
+  const ABIArgInfo* ai = p->index < abi->nparams ? &abi->params[p->index] : NULL;
+  int to_reg = dst.kind == NATIVE_LOC_REG;
+  /* Incoming stack args sit above the saved rbp + return addr (+16); Win64
+   * additionally reserves 32B of home space. */
+  i32 incoming_bias = (i32)(16u + a->abi->shadow_space);
+  u16 i;
+  if (!ai || ai->kind == ABI_ARG_IGNORE) return;
+
+  if (ai->kind == ABI_ARG_INDIRECT) {
+    /* Incoming pointer to a byval copy: load pointer, memcpy into dst frame. */
+    u32 ptr_reg;
+    NativeAddr d_addr, from;
+    AggregateAccess access;
+    if (a->next_param_int < a->abi->n_int_args) {
+      ptr_reg = a->abi->int_args[a->next_param_int++];
+    } else {
+      ptr_reg = X64_R11;
+      emit_mov_load(t->mc, 8, 0, ptr_reg, X64_RBP,
+                    incoming_bias + (i32)a->next_param_stack);
+      a->next_param_stack += 8u;
+    }
+    x64_sync_slot(a->abi, &a->next_param_int, &a->next_param_fp);
+    if (dst.kind != NATIVE_LOC_FRAME)
+      x64_panic(a, "indirect parameter requires a frame destination");
+    memset(&d_addr, 0, sizeof d_addr);
+    d_addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+    d_addr.base.frame = dst.v.frame;
+    d_addr.base_type = p->type;
+    memset(&from, 0, sizeof from);
+    from.base_kind = NATIVE_ADDR_BASE_REG;
+    from.base.reg = ptr_reg;
+    from.base_type = p->type;
+    memset(&access, 0, sizeof access);
+    access.type = p->type;
+    access.size = p->size ? p->size : (u32)cg_type_size(t->c, p->type);
+    access.align = p->align ? p->align : x64_type_align(t, p->type);
+    x64_copy_bytes(t, d_addr, from, access);
+    return;
+  }
+
+  if (ai->kind == ABI_ARG_DIRECT &&
+      x64_direct_to_stack(a->abi, ai, a->next_param_int, a->next_param_fp)) {
+    /* Whole arg on the stack. */
+    for (i = 0; i < ai->nparts; ++i) {
+      const ABIArgPart* part = &ai->parts[i];
+      NativeAllocClass cls =
+          part->cls == ABI_CLASS_FP ? NATIVE_REG_FP : NATIVE_REG_INT;
+      Reg tmp = cls == NATIVE_REG_FP ? X64_TMP_FP : X64_TMP_INT;
+      NativeLoc src = x64_reg_loc(p->type, cls, tmp);
+      NativeAddr sa;
+      memset(&sa, 0, sizeof sa);
+      sa.base_kind = NATIVE_ADDR_BASE_REG;
+      sa.base.reg = X64_RBP;
+      sa.base_type = p->type;
+      sa.offset = incoming_bias + (i32)a->next_param_stack;
+      a->next_param_stack += 8u;
+      x64_emit_mem(a, 1, src, sa, x64_mem_for_type(t, p->type, part->size));
+      if (dst.kind == NATIVE_LOC_NONE) {
+        /* unused */
+      } else if (to_reg) {
+        x64_move(t, x64_reg_loc(dst.type ? dst.type : p->type,
+                                (NativeAllocClass)dst.cls, (Reg)dst.v.reg),
+                 src);
+      } else {
+        x64_store_part(t,
+                       x64_stack_loc(p->type, dst.v.frame, (i32)part->src_offset),
+                       src, 0, part->size);
+      }
+    }
+    return;
+  }
+
+  for (i = 0; i < ai->nparts; ++i) {
+    const ABIArgPart* part = &ai->parts[i];
+    NativeAllocClass cls =
+        part->cls == ABI_CLASS_FP ? NATIVE_REG_FP : NATIVE_REG_INT;
+    NativeLoc src;
+    if (cls == NATIVE_REG_FP && a->next_param_fp < a->abi->n_fp_args) {
+      src = x64_reg_loc(p->type, cls, (Reg)(X64_XMM0 + a->next_param_fp++));
+    } else if (cls == NATIVE_REG_INT &&
+               a->next_param_int < a->abi->n_int_args) {
+      src = x64_reg_loc(p->type, cls, a->abi->int_args[a->next_param_int++]);
+    } else {
+      Reg tmp = cls == NATIVE_REG_FP ? X64_TMP_FP : X64_TMP_INT;
+      NativeAddr sa;
+      src = x64_reg_loc(p->type, cls, tmp);
+      memset(&sa, 0, sizeof sa);
+      sa.base_kind = NATIVE_ADDR_BASE_REG;
+      sa.base.reg = X64_RBP;
+      sa.base_type = p->type;
+      sa.offset = incoming_bias + (i32)a->next_param_stack;
+      x64_emit_mem(a, 1, src, sa, x64_mem_for_type(t, p->type, part->size));
+      a->next_param_stack += 8u;
+    }
+    x64_sync_slot(a->abi, &a->next_param_int, &a->next_param_fp);
+    if (dst.kind == NATIVE_LOC_NONE) {
+      /* unused parameter; cursors advanced */
+    } else if (to_reg) {
+      NativeLoc d = x64_reg_loc(dst.type ? dst.type : p->type,
+                                (NativeAllocClass)dst.cls, (Reg)dst.v.reg);
+      if (!(src.kind == NATIVE_LOC_REG && loc_reg(src) == loc_reg(d) &&
+            (NativeAllocClass)src.cls == (NativeAllocClass)d.cls))
+        x64_move(t, d, src);
+    } else {
+      x64_store_part(t,
+                     x64_stack_loc(p->type, dst.v.frame, (i32)part->src_offset),
+                     src, 0, part->size);
+    }
+  }
+  a->incoming_stack_size = align_up_u32(a->next_param_stack, 16u);
+}
+
+/* ============================ calls / returns ============================ */
+
+typedef struct {
+  NativeLoc dst;
+  NativeLoc src;
+  u32 src_offset;
+  u32 size;
+  int is_addr;
+  int dup_to_gpr;  /* Win64: also place into the matching GPR */
+  Reg dup_gpr;
+} X64ArgMove;
+
+static Reg x64_arg_move_src_reg(const X64ArgMove* m, NativeAllocClass* cls_out) {
+  if (!m->is_addr && m->src.kind == NATIVE_LOC_REG) {
+    *cls_out = (NativeAllocClass)m->src.cls;
+    return m->src.v.reg;
+  }
+  return REG_NONE;
+}
+
+static void x64_emit_one_arg_move(NativeTarget* t, const X64ArgMove* m) {
+  if (m->is_addr) {
+    x64_addr_of_loc(t, m->dst, m->src);
+  } else {
+    x64_load_part(t, m->dst, m->src, m->src_offset, m->size);
+  }
+  if (m->dup_to_gpr) {
+    /* movq gpr, xmm: 66 REX.W 0F 7E /r (xmm in reg field). */
+    emit_sse_rr_w(t->mc, 0x66, 0x7E, 1, loc_reg(m->dst), m->dup_gpr);
+  }
+}
+
+/* Parallel-copy register arg moves with cycle breaking through scratch. */
+static void x64_emit_reg_arg_moves(NativeTarget* t, X64ArgMove* moves, u32 n) {
+  u8 done[X64_MAX_REG_ARG_MOVES];
+  u32 emitted = 0;
+  if (n > X64_MAX_REG_ARG_MOVES) x64_panic(x64_of(t), "too many register args");
+  memset(done, 0, sizeof done);
+  while (emitted < n) {
+    int progress = 0;
+    u32 i, j;
+    for (i = 0; i < n; ++i) {
+      int blocked = 0;
+      if (done[i]) continue;
+      for (j = 0; j < n && !blocked; ++j) {
+        NativeAllocClass sc;
+        Reg sr;
+        if (done[j] || j == i) continue;
+        sr = x64_arg_move_src_reg(&moves[j], &sc);
+        if (sr != REG_NONE && sr == moves[i].dst.v.reg &&
+            sc == (NativeAllocClass)moves[i].dst.cls)
+          blocked = 1;
+      }
+      if (!blocked) {
+        x64_emit_one_arg_move(t, &moves[i]);
+        done[i] = 1;
+        emitted++;
+        progress = 1;
+      }
+    }
+    if (!progress) {
+      u32 k = 0;
+      NativeAllocClass bc, sc;
+      Reg scratch_reg;
+      NativeLoc scratchloc;
+      u32 jj;
+      while (k < n &&
+             (done[k] || x64_arg_move_src_reg(&moves[k], &sc) == REG_NONE))
+        ++k;
+      bc = (NativeAllocClass)moves[k].dst.cls;
+      scratch_reg = bc == NATIVE_REG_FP ? X64_TMP_FP : X64_TMP_INT2;
+      scratchloc = x64_reg_loc(moves[k].dst.type, bc, scratch_reg);
+      x64_move(t, scratchloc, moves[k].dst);
+      for (jj = 0; jj < n; ++jj) {
+        Reg sr = x64_arg_move_src_reg(&moves[jj], &sc);
+        if (!done[jj] && sr != REG_NONE && sr == moves[k].dst.v.reg &&
+            sc == bc) {
+          moves[jj].src = scratchloc;
+          moves[jj].src_offset = 0;
+        }
+      }
+    }
+  }
+}
+
+/* Clobber masks: per-call all caller-saved regs are clobbered. */
+static u32 x64_clobber_mask(const X64ABIRegs* abi, NativeAllocClass cls) {
+  u32 mask = 0, r;
+  if (cls == NATIVE_REG_INT) {
+    for (r = 0; r < 16u; ++r) {
+      if (r == X64_RSP || r == X64_RBP) continue;
+      if ((abi->cs_int_mask & (1ull << r)) == 0) mask |= 1u << r;
+    }
+  } else if (cls == NATIVE_REG_FP) {
+    for (r = 0; r < 16u; ++r)
+      if ((abi->cs_fp_mask & (1ull << r)) == 0) mask |= 1u << r;
+  }
+  return mask;
+}
+
+static u32 x64_return_mask(const ABIFuncInfo* abi, NativeAllocClass cls) {
+  u32 mask = 0, ni = 0, nf = 0;
+  static const u32 iregs[2] = {X64_RAX, X64_RDX};
+  u16 i;
+  if (!abi || abi->ret.kind == ABI_ARG_IGNORE ||
+      abi->ret.kind == ABI_ARG_INDIRECT)
+    return 0;
+  for (i = 0; i < abi->ret.nparts; ++i) {
+    const ABIArgPart* p = &abi->ret.parts[i];
+    if (cls == NATIVE_REG_INT && p->cls == ABI_CLASS_INT && ni < 2)
+      mask |= 1u << iregs[ni++];
+    else if (cls == NATIVE_REG_FP && p->cls == ABI_CLASS_FP && nf < 2)
+      mask |= 1u << (X64_XMM0 + nf++);
+  }
+  return mask;
+}
+
+static void x64_plan_call(NativeTarget* t, const NativeCallDesc* desc,
+                          NativeCallPlan* plan) {
+  X64NativeTarget* a = x64_of(t);
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, desc->fn_type);
+  const X64ABIRegs* aregs = a->abi ? a->abi : x64_abi_for_os(t->c->target.os);
+  NativeCallPlanRet* rets;
+  CfreeCgTypeId i64t = builtin_id(CFREE_CG_BUILTIN_I64);
+  u32 c;
+  memset(plan, 0, sizeof *plan);
+  rets = desc->nresults ? arena_zarray(t->c->tu, NativeCallPlanRet, 4) : NULL;
+  plan->callee = desc->callee;
+  plan->rets = rets;
+  plan->flags = desc->flags;
+  plan->has_sret = abi && abi->has_sret;
+  plan->is_variadic = abi && abi->variadic;
+  plan->stack_arg_size = x64_call_stack_size(t, desc);
+  if (plan->stack_arg_size > a->max_outgoing)
+    a->max_outgoing = plan->stack_arg_size;
+  for (c = 0; c < NATIVE_CALL_PLAN_CLASSES; ++c) {
+    plan->clobber_mask[c] = x64_clobber_mask(aregs, (NativeAllocClass)c);
+    plan->return_mask[c] = x64_return_mask(abi, (NativeAllocClass)c);
+  }
+  /* Indirect callee in a clobbered/arg register would be lost; stage in r11. */
+  if (plan->callee.kind == NATIVE_LOC_REG &&
+      (NativeAllocClass)plan->callee.cls == NATIVE_REG_INT &&
+      plan->callee.v.reg != X64_R11) {
+    NativeLoc scratch = x64_reg_loc(plan->callee.type, NATIVE_REG_INT, X64_R11);
+    x64_move(t, scratch, plan->callee);
+    plan->callee = scratch;
+  }
+  {
+    u32 next_int = (abi && abi->has_sret) ? 1u : 0u;
+    u32 next_fp = 0, stack = aregs->shadow_space, nmoves = 0, i;
+    u16 p;
+    X64ArgMove moves[X64_MAX_REG_ARG_MOVES];
+    x64_sync_slot(aregs, &next_int, &next_fp);
+    for (i = 0; i < desc->nargs; ++i) {
+      ABIArgInfo tmp;
+      const ABIArgInfo* ai = x64_param_abi(t, abi, desc, i, &tmp);
+      int variadic_arg = abi && i >= abi->nparams;
+      if (ai->kind == ABI_ARG_IGNORE) continue;
+      if (ai->kind == ABI_ARG_INDIRECT) {
+        if (next_int < aregs->n_int_args) {
+          X64ArgMove* m = &moves[nmoves++];
+          memset(m, 0, sizeof *m);
+          m->dst = x64_reg_loc(i64t, NATIVE_REG_INT,
+                               aregs->int_args[next_int++]);
+          m->src = desc->args[i];
+          m->size = 8;
+          m->is_addr = 1;
+        } else {
+          NativeLoc ptr = x64_reg_loc(i64t, NATIVE_REG_INT, X64_RAX);
+          x64_addr_of_loc(t, ptr, desc->args[i]);
+          x64_store_outgoing_part(t, stack, ptr, 8);
+          stack += 8u;
+        }
+        x64_sync_slot(aregs, &next_int, &next_fp);
+        continue;
+      }
+      if (ai->kind == ABI_ARG_DIRECT &&
+          x64_direct_to_stack(aregs, ai, next_int, next_fp)) {
+        for (p = 0; p < ai->nparts; ++p) {
+          const ABIArgPart* part = &ai->parts[p];
+          NativeAllocClass cls =
+              part->cls == ABI_CLASS_FP ? NATIVE_REG_FP : NATIVE_REG_INT;
+          Reg tmp = cls == NATIVE_REG_FP ? X64_TMP_FP : X64_TMP_INT;
+          NativeLoc tmpreg = x64_reg_loc(desc->args[i].type, cls, tmp);
+          x64_load_part(t, tmpreg, desc->args[i], part->src_offset, part->size);
+          x64_store_outgoing_part(t, stack, tmpreg, part->size);
+          stack += 8u;
+        }
+        continue;
+      }
+      for (p = 0; p < ai->nparts; ++p) {
+        const ABIArgPart* part = &ai->parts[p];
+        NativeAllocClass cls =
+            part->cls == ABI_CLASS_FP ? NATIVE_REG_FP : NATIVE_REG_INT;
+        if (cls == NATIVE_REG_FP && next_fp < aregs->n_fp_args) {
+          X64ArgMove* m = &moves[nmoves++];
+          u32 slot = next_fp;
+          memset(m, 0, sizeof *m);
+          m->dst = x64_reg_loc(desc->args[i].type, cls, (Reg)(X64_XMM0 + next_fp++));
+          m->src = desc->args[i];
+          m->src_offset = part->src_offset;
+          m->size = part->size;
+          if (aregs->vararg_fp_dup_to_gpr && variadic_arg &&
+              slot < aregs->n_int_args) {
+            m->dup_to_gpr = 1;
+            m->dup_gpr = aregs->int_args[slot];
+          }
+          x64_sync_slot(aregs, &next_int, &next_fp);
+        } else if (cls == NATIVE_REG_INT && next_int < aregs->n_int_args) {
+          X64ArgMove* m = &moves[nmoves++];
+          memset(m, 0, sizeof *m);
+          m->dst = x64_reg_loc(desc->args[i].type, cls,
+                               aregs->int_args[next_int++]);
+          m->src = desc->args[i];
+          m->src_offset = part->src_offset;
+          m->size = part->size;
+          x64_sync_slot(aregs, &next_int, &next_fp);
+        } else {
+          Reg tmp = cls == NATIVE_REG_FP ? X64_TMP_FP : X64_TMP_INT;
+          NativeLoc tmpreg = x64_reg_loc(desc->args[i].type, cls, tmp);
+          x64_load_part(t, tmpreg, desc->args[i], part->src_offset, part->size);
+          x64_store_outgoing_part(t, stack, tmpreg, part->size);
+          stack += 8u;
+          x64_sync_slot(aregs, &next_int, &next_fp);
+        }
+      }
+    }
+    x64_emit_reg_arg_moves(t, moves, nmoves);
+    if (abi && abi->has_sret && desc->nresults) {
+      NativeLoc sret = x64_reg_loc(i64t, NATIVE_REG_INT, aregs->int_args[0]);
+      x64_addr_of_loc(t, sret, desc->results[0]);
+    }
+    /* Variadic call: AL = number of vector regs used. */
+    if (abi && abi->variadic)
+      x64_emit_load_imm(t->mc, 0, X64_RAX, (i64)next_fp);
+  }
+  /* Return value receipt. */
+  if (abi && abi->ret.kind == ABI_ARG_DIRECT && desc->nresults) {
+    u32 nr = 0, ni = 0, nf = 0;
+    static const u32 ret_int_regs[2] = {X64_RAX, X64_RDX};
+    u16 p;
+    for (p = 0; p < abi->ret.nparts; ++p) {
+      const ABIArgPart* part = &abi->ret.parts[p];
+      NativeAllocClass cls =
+          part->cls == ABI_CLASS_FP ? NATIVE_REG_FP : NATIVE_REG_INT;
+      CfreeCgTypeId pty = x64_part_scalar_type(part);
+      Reg rreg = cls == NATIVE_REG_FP ? (Reg)(X64_XMM0 + nf++)
+                                      : (Reg)ret_int_regs[ni++];
+      rets[nr].src = x64_reg_loc(pty, cls, rreg);
+      rets[nr].dst = desc->results[0];
+      if (rets[nr].dst.kind == NATIVE_LOC_FRAME)
+        rets[nr].dst =
+            x64_stack_loc(pty, desc->results[0].v.frame, (i32)part->src_offset);
+      else if (rets[nr].dst.kind == NATIVE_LOC_STACK) {
+        rets[nr].dst.v.stack.offset += (i32)part->src_offset;
+        rets[nr].dst.type = pty;
+      }
+      rets[nr].mem = x64_mem_for_type(t, pty, part->size);
+      nr++;
+    }
+    plan->nrets = nr;
+  } else if (abi && abi->ret.kind == ABI_ARG_IGNORE) {
+    plan->nrets = 0;
+  } else if (!abi && desc->nresults) {
+    rets[0].src = x64_reg_loc(desc->results[0].type, NATIVE_REG_INT, X64_RAX);
+    rets[0].dst = desc->results[0];
+    rets[0].mem = x64_mem_for_type(t, desc->results[0].type, 0);
+    plan->nrets = 1;
+  }
+}
+
+static void x64_emit_call(NativeTarget* t, const NativeCallPlan* plan) {
+  MCEmitter* mc = t->mc;
+  ObjSecId sec = mc->section_id;
+  if (plan->flags & CG_CALL_TAIL) x64_panic(x64_of(t), "tail call not implemented");
+  if (plan->callee.kind == NATIVE_LOC_GLOBAL) {
+    u8 op = X64_OPC_CALL_REL32;
+    u32 disp_pos;
+    mc->emit_bytes(mc, &op, 1);
+    disp_pos = mc->pos(mc);
+    emit_u32le(mc, 0);
+    mc->emit_reloc_at(mc, sec, disp_pos, R_X64_PLT32, plan->callee.v.global.sym,
+                      plan->callee.v.global.addend - 4, 1, 0);
+    return;
+  }
+  if (plan->callee.kind == NATIVE_LOC_REG) {
+    u32 r = loc_reg(plan->callee);
+    if (r & 8u) {
+      u8 rex = X64_REX_BASE | X64_REX_B;
+      mc->emit_bytes(mc, &rex, 1);
+    }
+    {
+      u8 buf[2] = {X64_OP_JMP_RM64, modrm(3u, 2u, r & 7u)}; /* call r/m, /2 */
+      mc->emit_bytes(mc, buf, 2);
+    }
+    return;
+  }
+  x64_panic(x64_of(t), "unsupported call target");
+}
+
+static void x64_plan_ret(NativeTarget* t, const CGFuncDesc* fd,
+                         const NativeLoc* values, u32 nvalues,
+                         NativeCallPlanRet** out_rets, u32* out_nrets) {
+  X64NativeTarget* a = x64_of(t);
+  const ABIFuncInfo* abi = abi_cg_func_info(t->c->abi, fd->fn_type);
+  NativeCallPlanRet* rets = NULL;
+  u32 nr = 0;
+  if (nvalues > 1u) x64_panic(a, "multiple returns unsupported");
+  if (nvalues) rets = arena_zarray(t->c->tu, NativeCallPlanRet, 4);
+  if (nvalues && abi && abi->ret.kind == ABI_ARG_INDIRECT) {
+    /* sret: reload destination pointer (spilled at entry) into r11, memcpy the
+     * source aggregate into [r11], and convention-return the pointer in rax. */
+    CfreeCgTypeId i64t = builtin_id(CFREE_CG_BUILTIN_I64);
+    NativeLoc dstp = x64_reg_loc(i64t, NATIVE_REG_INT, X64_R11);
+    NativeLoc saved = x64_stack_loc(i64t, a->sret_ptr_slot, 0);
+    NativeAddr dst_addr, src_addr;
+    AggregateAccess access;
+    x64_load_part(t, dstp, saved, 0, 8);
+    memset(&dst_addr, 0, sizeof dst_addr);
+    dst_addr.base_kind = NATIVE_ADDR_BASE_REG;
+    dst_addr.base.reg = X64_R11;
+    dst_addr.base_type = values[0].type;
+    src_addr = x64_loc_addr(a, values[0], 0);
+    src_addr.base_type = values[0].type;
+    memset(&access, 0, sizeof access);
+    access.type = values[0].type;
+    access.size = (u32)cg_type_size(t->c, values[0].type);
+    access.align = x64_type_align(t, values[0].type);
+    x64_copy_bytes(t, dst_addr, src_addr, access);
+    /* rax = sret pointer. Reload it (copy_bytes clobbered r11/rax). */
+    x64_load_part(t, x64_reg_loc(i64t, NATIVE_REG_INT, X64_RAX), saved, 0, 8);
+    *out_rets = NULL;
+    *out_nrets = 0;
+    return;
+  }
+  if (nvalues && abi && abi->ret.kind == ABI_ARG_DIRECT) {
+    u32 ni = 0, nf = 0;
+    static const u32 ret_int_regs[2] = {X64_RAX, X64_RDX};
+    u16 p;
+    for (p = 0; p < abi->ret.nparts; ++p) {
+      const ABIArgPart* part = &abi->ret.parts[p];
+      NativeAllocClass cls =
+          part->cls == ABI_CLASS_FP ? NATIVE_REG_FP : NATIVE_REG_INT;
+      CfreeCgTypeId pty = x64_part_scalar_type(part);
+      Reg rreg = cls == NATIVE_REG_FP ? (Reg)(X64_XMM0 + nf++)
+                                      : (Reg)ret_int_regs[ni++];
+      rets[nr].src = values[0];
+      if (rets[nr].src.kind == NATIVE_LOC_FRAME)
+        rets[nr].src =
+            x64_stack_loc(pty, values[0].v.frame, (i32)part->src_offset);
+      else if (rets[nr].src.kind == NATIVE_LOC_STACK) {
+        rets[nr].src.v.stack.offset += (i32)part->src_offset;
+        rets[nr].src.type = pty;
+      }
+      rets[nr].dst = x64_reg_loc(pty, cls, rreg);
+      rets[nr].mem = x64_mem_for_type(t, pty, part->size);
+      nr++;
+    }
+  } else if (nvalues) {
+    rets[0].src = values[0];
+    rets[0].dst = x64_reg_loc(values[0].type, NATIVE_REG_INT, X64_RAX);
+    rets[0].mem = x64_mem_for_type(t, values[0].type, 0);
+    nr = 1;
+  }
+  *out_rets = rets;
+  *out_nrets = nr;
+}
+
+static void x64_ret(NativeTarget* t) {
+  X64NativeTarget* a = x64_of(t);
+  x64_jump(t, a->epilogue_label);
+}
+
+/* ============================ alloca ============================ */
+
+static void x64_alloca(NativeTarget* t, NativeLoc dst, NativeLoc size,
+                       u32 align) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 rsz = loc_reg(size);
+  u32 rd = loc_reg(dst);
+  u32 al = align ? align : 16u;
+  if (al < 16u) al = 16u;
+  if (al > 16u) x64_panic(a, "alloca align > 16 not supported");
+  if (size.kind == NATIVE_LOC_IMM) {
+    u64 aligned = ((u64)size.v.imm + 15u) & ~(u64)15u;
+    if (aligned == 0) aligned = 16;
+    /* sub rsp, imm32. */
+    emit_rex(mc, 1, 0, 0, X64_RSP);
+    {
+      u8 buf[2] = {X64_OPC_ALU_IMM32, modrm(3u, X64_ALU_SUB_SUB, X64_RSP)};
+      mc->emit_bytes(mc, buf, 2);
+    }
+    emit_u32le(mc, (u32)aligned);
+  } else {
+    /* rax = (size + 15) & ~15; sub rsp, rax. */
+    emit_lea(mc, X64_RAX, rsz, 15);
+    emit_rex(mc, 1, 0, 0, X64_RAX);
+    {
+      u8 buf[3] = {X64_OPC_ALU_IMM8, modrm(3u, X64_ALU_SUB_AND, X64_RAX), 0xF0};
+      mc->emit_bytes(mc, buf, 3);
+    }
+    emit_alu_rr(mc, 1, X64_OPC_ALU_SUB, X64_RSP, X64_RAX);
+  }
+  a->has_alloca = 1;
+  /* lea dst, [rsp + max_outgoing] — disp32 patched in func_end. */
+  if (a->npatches == a->patches_cap) {
+    u32 cap = a->patches_cap ? a->patches_cap * 2u : 8u;
+    X64Patch* nb = arena_zarray(t->c->tu, X64Patch, cap);
+    if (a->patches) memcpy(nb, a->patches, sizeof(*nb) * a->npatches);
+    a->patches = nb;
+    a->patches_cap = cap;
+  }
+  emit_rex(mc, 1, rd, 0, X64_RSP);
+  {
+    u8 op = X64_OPC_LEA;
+    mc->emit_bytes(mc, &op, 1);
+  }
+  {
+    u8 mr = modrm(2u, rd & 7u, 4u);
+    mc->emit_bytes(mc, &mr, 1);
+  }
+  {
+    u8 s = sib(0u, 4u, X64_RSP);
+    mc->emit_bytes(mc, &s, 1);
+  }
+  a->patches[a->npatches].kind = X64_PATCH_ALLOCA;
+  a->patches[a->npatches].pos = mc->pos(mc);
+  a->npatches++;
+  a->nalloca++;
+  emit_u32le(mc, 0); /* placeholder disp32 */
+}
+
+/* ============================ TLS ============================ */
+
+/* Win64 TLS Local-Exec (PE-COFF): TEB pointer -> _tls_index -> TLS block ->
+ * lea &sym@SECREL. R11 is scratch. */
+static void x64_tls_addr_of_win64(NativeTarget* t, NativeLoc dst, ObjSymId sym,
+                                  i64 addend) {
+  MCEmitter* mc = t->mc;
+  u32 sec = mc->section_id;
+  u32 rd = loc_reg(dst);
+  /* (1) mov rd, gs:[0x58]. */
+  {
+    u8 gs = 0x65;
+    mc->emit_bytes(mc, &gs, 1);
+    emit_rex(mc, 1, rd, 0, 0);
+    {
+      u8 op = X64_OPC_MOV_R_RM;
+      mc->emit_bytes(mc, &op, 1);
+    }
+    {
+      u8 mr = modrm(0u, rd & 7u, 4u);
+      mc->emit_bytes(mc, &mr, 1);
+    }
+    {
+      u8 s = sib(0u, 4u, 5u);
+      mc->emit_bytes(mc, &s, 1);
+    }
+    emit_u32le(mc, 0x58u);
+  }
+  /* (2) mov r11d, [rip + _tls_index]. */
+  {
+    Sym idx_name = pool_intern_slice(t->c->global, SLICE_LIT("_tls_index"));
+    ObjSymId idx_sym = obj_symbol_find(t->obj, idx_name);
+    u8 rex_r, op, mr;
+    u32 disp_pos;
+    if (idx_sym == 0)
+      idx_sym =
+          obj_symbol(t->obj, idx_name, SB_GLOBAL, SK_UNDEF, OBJ_SEC_NONE, 0, 0);
+    rex_r = X64_REX_BASE | X64_REX_R;
+    mc->emit_bytes(mc, &rex_r, 1);
+    op = X64_OPC_MOV_R_RM;
+    mc->emit_bytes(mc, &op, 1);
+    mr = modrm(0u, 3u, 5u); /* r11&7, rip-rel */
+    mc->emit_bytes(mc, &mr, 1);
+    disp_pos = mc->pos(mc);
+    emit_u32le(mc, 0);
+    mc->emit_reloc_at(mc, sec, disp_pos, R_PC32, idx_sym, -4, 1, 0);
+  }
+  /* (3) mov rd, [rd + r11*8]. */
+  {
+    u8 rex = X64_REX_BASE | X64_REX_W | X64_REX_X;
+    u8 op;
+    if (rd & 8u) rex |= X64_REX_R | X64_REX_B;
+    mc->emit_bytes(mc, &rex, 1);
+    op = X64_OPC_MOV_R_RM;
+    mc->emit_bytes(mc, &op, 1);
+    if ((rd & 7u) == 5u) {
+      u8 mr = modrm(1u, rd & 7u, 4u);
+      u8 s = sib(3u, 3u, rd & 7u);
+      u8 zero = 0;
+      mc->emit_bytes(mc, &mr, 1);
+      mc->emit_bytes(mc, &s, 1);
+      mc->emit_bytes(mc, &zero, 1);
+    } else {
+      u8 mr = modrm(0u, rd & 7u, 4u);
+      u8 s = sib(3u, 3u, rd & 7u);
+      mc->emit_bytes(mc, &mr, 1);
+      mc->emit_bytes(mc, &s, 1);
+    }
+  }
+  /* (4) lea rd, [rd + sym@SECREL]. */
+  {
+    u8 rex = X64_REX_BASE | X64_REX_W;
+    u8 op;
+    u32 disp_pos;
+    if (rd & 8u) rex |= X64_REX_R | X64_REX_B;
+    mc->emit_bytes(mc, &rex, 1);
+    op = X64_OPC_LEA;
+    mc->emit_bytes(mc, &op, 1);
+    if ((rd & 7u) == 4u) {
+      u8 mr = modrm(2u, rd & 7u, 4u);
+      u8 s = sib(0u, 4u, rd & 7u);
+      mc->emit_bytes(mc, &mr, 1);
+      mc->emit_bytes(mc, &s, 1);
+    } else {
+      u8 mr = modrm(2u, rd & 7u, rd & 7u);
+      mc->emit_bytes(mc, &mr, 1);
+    }
+    disp_pos = mc->pos(mc);
+    emit_u32le(mc, 0);
+    mc->emit_reloc_at(mc, sec, disp_pos, R_COFF_SECREL, sym, addend, 1, 0);
+  }
+}
+
+/* x86-64 TLS Local-Exec: mov rd, fs:0; lea rd, [rd + sym@tpoff]. */
+static void x64_tls_addr_of(NativeTarget* t, NativeLoc dst, ObjSymId sym,
+                            i64 addend) {
+  MCEmitter* mc = t->mc;
+  u32 sec = mc->section_id;
+  u32 rd = loc_reg(dst);
+  u32 disp_pos;
+  if (t->c->target.os == CFREE_OS_WINDOWS) {
+    x64_tls_addr_of_win64(t, dst, sym, addend);
+    return;
+  }
+  /* mov rd, fs:[0]. */
+  {
+    u8 fs = 0x64;
+    mc->emit_bytes(mc, &fs, 1);
+    emit_rex(mc, 1, rd, 0, 0);
+    {
+      u8 op = X64_OPC_MOV_R_RM;
+      mc->emit_bytes(mc, &op, 1);
+    }
+    {
+      u8 mr = modrm(0u, rd & 7u, 4u);
+      mc->emit_bytes(mc, &mr, 1);
+    }
+    {
+      u8 s = sib(0u, 4u, 5u);
+      mc->emit_bytes(mc, &s, 1);
+    }
+    emit_u32le(mc, 0);
+  }
+  /* lea rd, [rd + disp32@tpoff]. */
+  emit_rex(mc, 1, rd, 0, rd);
+  {
+    u8 op = X64_OPC_LEA;
+    mc->emit_bytes(mc, &op, 1);
+  }
+  if ((rd & 7u) == 4u) {
+    u8 mr = modrm(2u, rd & 7u, 4u);
+    u8 s = sib(0u, 4u, rd & 7u);
+    mc->emit_bytes(mc, &mr, 1);
+    mc->emit_bytes(mc, &s, 1);
+  } else {
+    u8 mr = modrm(2u, rd & 7u, rd & 7u);
+    mc->emit_bytes(mc, &mr, 1);
+  }
+  disp_pos = mc->pos(mc);
+  emit_u32le(mc, 0);
+  mc->emit_reloc_at(mc, sec, disp_pos, R_X64_TPOFF32, sym, addend, 0, 0);
+}
+
+/* ============================ atomics ============================ */
+
+static void emit_lock_prefix(MCEmitter* mc) {
+  u8 b = 0xF0;
+  mc->emit_bytes(mc, &b, 1);
+}
+static void emit_mfence(MCEmitter* mc) {
+  u8 b[3] = {0x0F, 0xAE, 0xF0};
+  mc->emit_bytes(mc, b, 3);
+}
+
+/* Resolve an atomic addr to a bare base register (r11) + disp 0. */
+static u32 x64_atomic_base(X64NativeTarget* a, NativeAddr addr) {
+  return x64_addr_to_base_reg(a, addr, X64_TMP_INT2);
+}
+
+static void x64_atomic_load(NativeTarget* t, NativeLoc dst, NativeAddr addr,
+                            MemAccess mem, MemOrder mo) {
+  X64NativeTarget* a = x64_of(t);
+  u32 sz = mem.size ? mem.size : x64_type_size(t, dst.type);
+  u32 base;
+  (void)mo; /* x86 plain MOV is an acquire load. */
+  base = x64_atomic_base(a, addr);
+  emit_mov_load(t->mc, sz, 0, loc_reg(dst), base, 0);
+}
+
+static void x64_atomic_store(NativeTarget* t, NativeAddr addr, NativeLoc src,
+                             MemAccess mem, MemOrder mo) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 sz = mem.size ? mem.size : x64_type_size(t, src.type);
+  int w = sz == 8u ? 1 : 0;
+  u32 base = x64_atomic_base(a, addr);
+  u32 sr = loc_reg(src);
+  if (mo == MO_SEQ_CST) {
+    /* xchg [mem], r11 implicitly fences. Stage src in rax (r11 holds base). */
+    if (sr != X64_RAX) emit_mov_rr(mc, w, X64_RAX, sr);
+    emit_lock_prefix(mc);
+    emit_rex(mc, w, X64_RAX, 0, base);
+    {
+      u8 op = 0x87; /* xchg r/m, r */
+      mc->emit_bytes(mc, &op, 1);
+    }
+    emit_mem_operand(mc, X64_RAX, base, 0);
+    return;
+  }
+  emit_mov_store(mc, sz, sr, base, 0);
+}
+
+static void x64_atomic_rmw(NativeTarget* t, AtomicOp op, NativeLoc dst,
+                           NativeAddr addr, NativeLoc val, MemAccess mem,
+                           MemOrder mo) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 sz = mem.size ? mem.size : x64_type_size(t, dst.type);
+  int w = sz == 8u ? 1 : 0;
+  u32 base = x64_atomic_base(a, addr); /* r11 */
+  u32 dr = loc_reg(dst);
+  u32 vr = loc_reg(val);
+  (void)mo; /* LOCK ops are full barriers. */
+  /* val staged in rdx (base owns r11; rax/rcx used by the cmpxchg loop). */
+  emit_mov_rr(mc, w, X64_RDX, vr);
+  if (op == AO_ADD || op == AO_SUB) {
+    if (op == AO_SUB) emit_f7_rm(mc, w, X64_F7_SUB_NEG, X64_RDX);
+    emit_lock_prefix(mc);
+    emit_rex(mc, w, X64_RDX, 0, base);
+    {
+      u8 op2[2] = {X64_OPC_TWOBYTE, 0xC1}; /* xadd */
+      mc->emit_bytes(mc, op2, 2);
+    }
+    emit_mem_operand(mc, X64_RDX, base, 0);
+    if (dr != X64_RDX) emit_mov_rr(mc, w, dr, X64_RDX);
+    return;
+  }
+  if (op == AO_XCHG) {
+    emit_lock_prefix(mc);
+    emit_rex(mc, w, X64_RDX, 0, base);
+    {
+      u8 op2 = 0x87; /* xchg */
+      mc->emit_bytes(mc, &op2, 1);
+    }
+    emit_mem_operand(mc, X64_RDX, base, 0);
+    if (dr != X64_RDX) emit_mov_rr(mc, w, dr, X64_RDX);
+    return;
+  }
+  /* AND/OR/XOR/NAND: cmpxchg retry loop. rax=prior, rcx=new, rdx=val. */
+  {
+    MCLabel retry = mc->label_new(mc);
+    emit_mov_load(mc, sz, 0, X64_RAX, base, 0);
+    mc->label_place(mc, retry);
+    emit_mov_rr(mc, w, X64_RCX, X64_RAX);
+    switch (op) {
+      case AO_AND: emit_alu_rr(mc, w, X64_OPC_ALU_AND, X64_RCX, X64_RDX); break;
+      case AO_OR: emit_alu_rr(mc, w, X64_OPC_ALU_OR, X64_RCX, X64_RDX); break;
+      case AO_XOR: emit_alu_rr(mc, w, X64_OPC_ALU_XOR, X64_RCX, X64_RDX); break;
+      case AO_NAND:
+        emit_alu_rr(mc, w, X64_OPC_ALU_AND, X64_RCX, X64_RDX);
+        emit_f7_rm(mc, w, X64_F7_SUB_NOT, X64_RCX);
+        break;
+      default: x64_panic(a, "unsupported atomic rmw op");
+    }
+    emit_lock_prefix(mc);
+    emit_rex(mc, w, X64_RCX, 0, base);
+    {
+      u8 op2[2] = {X64_OPC_TWOBYTE, 0xB1}; /* cmpxchg */
+      mc->emit_bytes(mc, op2, 2);
+    }
+    emit_mem_operand(mc, X64_RCX, base, 0);
+    emit_jcc_rel32(mc, X64_CC_NE, retry);
+    if (dr != X64_RAX) emit_mov_rr(mc, w, dr, X64_RAX);
+  }
+}
+
+static void x64_atomic_cas(NativeTarget* t, NativeLoc prior, NativeLoc ok,
+                           NativeAddr addr, NativeLoc expected, NativeLoc desired,
+                           MemAccess mem, MemOrder success, MemOrder failure) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  u32 sz = mem.size ? mem.size : x64_type_size(t, prior.type);
+  int w = sz == 8u ? 1 : 0;
+  u32 base = x64_atomic_base(a, addr); /* r11 */
+  u32 rprior = loc_reg(prior);
+  u32 rok = loc_reg(ok);
+  u32 rexp = loc_reg(expected);
+  u32 rdes = loc_reg(desired);
+  (void)success;
+  (void)failure;
+  /* rax = expected; rcx = desired. */
+  if (rexp != X64_RAX) emit_mov_rr(mc, w, X64_RAX, rexp);
+  if (rdes != X64_RCX) emit_mov_rr(mc, w, X64_RCX, rdes);
+  emit_lock_prefix(mc);
+  emit_rex(mc, w, X64_RCX, 0, base);
+  {
+    u8 op2[2] = {X64_OPC_TWOBYTE, 0xB1}; /* cmpxchg [base], rcx */
+    mc->emit_bytes(mc, op2, 2);
+  }
+  emit_mem_operand(mc, X64_RCX, base, 0);
+  emit_setcc(mc, X64_CC_E, rok);
+  emit_movzx_r32_r8(mc, rok, rok);
+  if (rprior != X64_RAX) emit_mov_rr(mc, w, rprior, X64_RAX);
+}
+
+static void x64_fence(NativeTarget* t, MemOrder mo) {
+  if (mo == MO_SEQ_CST) emit_mfence(t->mc);
+}
+
+/* ============================ variadics ============================
+ * SysV: __va_list_tag (gp_offset@0, fp_offset@4, overflow@8, reg_save@16). The
+ * prologue filled the 176B reg-save area. Win64: va_list is a single pointer
+ * to the next 8-byte slot in the home/overflow area; FP varargs are duplicated
+ * into the matching GPR slot at the call site. `ap` addresses the va_list
+ * object. */
+
+static void x64_va_start_core(X64NativeTarget* a, NativeAddr ap) {
+  NativeTarget* t = &a->base;
+  MCEmitter* mc = t->mc;
+  u32 ap_base;
+  if (!a->is_variadic) x64_panic(a, "va_start: function not variadic");
+  ap_base = x64_addr_to_base_reg(a, ap, X64_TMP_INT2);
+  if (a->abi->shadow_space) {
+    /* Win64: *ap = rbp + 16 + named_int*8 + named_stack. */
+    u32 first = 16u + a->next_param_int * 8u + a->next_param_stack;
+    emit_lea(mc, X64_RAX, X64_RBP, (i32)first);
+    emit_mov_store(mc, 8, X64_RAX, ap_base, 0);
+    return;
+  }
+  {
+    X64NativeSlot* rs = x64_slot_get(a, a->reg_save_slot);
+    /* gp_offset = next_param_int * 8 */
+    x64_emit_load_imm(mc, 0, X64_RAX, (i64)(a->next_param_int * 8u));
+    emit_mov_store(mc, 4, X64_RAX, ap_base, 0);
+    /* fp_offset = 48 + next_param_fp * 16 */
+    x64_emit_load_imm(mc, 0, X64_RAX, (i64)(48u + a->next_param_fp * 16u));
+    emit_mov_store(mc, 4, X64_RAX, ap_base, 4);
+    /* overflow_arg_area = rbp + 16 + next_param_stack */
+    emit_lea(mc, X64_RAX, X64_RBP, (i32)(16u + a->next_param_stack));
+    emit_mov_store(mc, 8, X64_RAX, ap_base, 8);
+    /* reg_save_area = rbp - reg_save_slot.off */
+    emit_lea(mc, X64_RAX, X64_RBP, -(i32)rs->off);
+    emit_mov_store(mc, 8, X64_RAX, ap_base, 16);
+  }
+}
+
+static void x64_va_arg_core(X64NativeTarget* a, NativeLoc dst, NativeAddr ap,
+                            CfreeCgTypeId type) {
+  NativeTarget* t = &a->base;
+  MCEmitter* mc = t->mc;
+  u32 sz = x64_type_size(t, type);
+  int is_fp = loc_is_fp(dst);
+  u32 dr = loc_reg(dst);
+  u32 ap_base = x64_addr_to_base_reg(a, ap, X64_TMP_INT2);
+  if (a->abi->shadow_space) {
+    /* Win64: r10 = *ap; load; *ap += 8. (r10 is caller-saved scratch here.) */
+    emit_mov_load(mc, 8, 0, X64_R10, ap_base, 0);
+    if (is_fp)
+      emit_sse_load(mc, sse_scalar_prefix(sz), 0x10, dr, X64_R10, 0);
+    else
+      emit_mov_load(mc, sz, 0, dr, X64_R10, 0);
+    /* add r10, 8; *ap = r10. */
+    emit_rex(mc, 1, 0, 0, X64_R10);
+    {
+      u8 buf[3] = {X64_OPC_ALU_IMM8, modrm(3u, X64_ALU_SUB_ADD, X64_R10 & 7u), 8};
+      mc->emit_bytes(mc, buf, 3);
+    }
+    emit_mov_store(mc, 8, X64_R10, ap_base, 0);
+    return;
+  }
+  {
+    u32 offs_field = is_fp ? 4u : 0u;
+    u32 max_offs = is_fp ? 176u : 48u;
+    u32 stride = is_fp ? 16u : 8u;
+    MCLabel L_stack = mc->label_new(mc);
+    MCLabel L_done = mc->label_new(mc);
+    /* eax = ap[offs]; cmp eax, max; jae L_stack. */
+    emit_mov_load(mc, 4, 0, X64_RAX, ap_base, (i32)offs_field);
+    emit_alu_imm32(mc, 0, X64_ALU_SUB_CMP, X64_RAX, (i32)max_offs);
+    emit_jcc_rel32(mc, X64_CC_AE, L_stack);
+    /* reg path: r10 = ap[16] + rax; load; eax += stride; ap[offs] = eax. */
+    emit_mov_load(mc, 8, 0, X64_R10, ap_base, 16);
+    emit_alu_rr(mc, 1, X64_OPC_ALU_ADD, X64_R10, X64_RAX);
+    if (is_fp)
+      emit_sse_load(mc, sse_scalar_prefix(sz), 0x10, dr, X64_R10, 0);
+    else
+      emit_mov_load(mc, sz, 0, dr, X64_R10, 0);
+    emit_alu_imm8(mc, 0, X64_ALU_SUB_ADD, X64_RAX, (i8)stride);
+    emit_mov_store(mc, 4, X64_RAX, ap_base, (i32)offs_field);
+    emit_jmp_rel32(mc, L_done);
+    /* stack path: r10 = ap[8]; load; r10 += 8; ap[8] = r10. */
+    mc->label_place(mc, L_stack);
+    emit_mov_load(mc, 8, 0, X64_R10, ap_base, 8);
+    if (is_fp)
+      emit_sse_load(mc, sse_scalar_prefix(sz), 0x10, dr, X64_R10, 0);
+    else
+      emit_mov_load(mc, sz, 0, dr, X64_R10, 0);
+    emit_rex(mc, 1, 0, 0, X64_R10);
+    {
+      u8 buf[3] = {X64_OPC_ALU_IMM8, modrm(3u, X64_ALU_SUB_ADD, X64_R10 & 7u), 8};
+      mc->emit_bytes(mc, buf, 3);
+    }
+    emit_mov_store(mc, 8, X64_R10, ap_base, 8);
+    mc->label_place(mc, L_done);
+  }
+}
+
+static void x64_va_copy_core(X64NativeTarget* a, NativeAddr dst_ap,
+                             NativeAddr src_ap) {
+  NativeTarget* t = &a->base;
+  MCEmitter* mc = t->mc;
+  /* Resolve dst into r11 first, src into rax (disjoint). */
+  u32 dst_base = x64_addr_to_base_reg(a, dst_ap, X64_TMP_INT2);
+  u32 src_base = x64_addr_to_base_reg(a, src_ap, X64_TMP_INT);
+  u32 n = a->abi->shadow_space ? 8u : 24u, i;
+  for (i = 0; i < n; i += 8u) {
+    emit_mov_load(mc, 8, 0, X64_RDX, src_base, (i32)i);
+    emit_mov_store(mc, 8, X64_RDX, dst_base, (i32)i);
+  }
+}
+
+static NativeAddr x64_va_addr_from_ptr(NativeLoc ap_ptr) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_REG;
+  addr.cls = NATIVE_REG_INT;
+  addr.base.reg = ap_ptr.v.reg;
+  addr.base_type = ap_ptr.type;
+  return addr;
+}
+
+static void x64_va_start_native(NativeTarget* t, NativeLoc ap_ptr) {
+  x64_va_start_core(x64_of(t), x64_va_addr_from_ptr(ap_ptr));
+}
+static void x64_va_arg_native(NativeTarget* t, NativeLoc dst, NativeLoc ap_ptr,
+                              CfreeCgTypeId type) {
+  x64_va_arg_core(x64_of(t), dst, x64_va_addr_from_ptr(ap_ptr), type);
+}
+static void x64_va_end_native(NativeTarget* t, NativeLoc ap_ptr) {
+  (void)t;
+  (void)ap_ptr;
+}
+static void x64_va_copy_native(NativeTarget* t, NativeLoc dst, NativeLoc src) {
+  x64_va_copy_core(x64_of(t), x64_va_addr_from_ptr(dst),
+                   x64_va_addr_from_ptr(src));
+}
+
+/* ============================ intrinsics ============================ */
+
+static void emit_popcnt(MCEmitter* mc, int w, u32 dst, u32 src) {
+  u8 p = 0xF3;
+  mc->emit_bytes(mc, &p, 1);
+  emit_rex(mc, w, dst, 0, src);
+  {
+    u8 op[2] = {X64_OPC_TWOBYTE, 0xB8};
+    mc->emit_bytes(mc, op, 2);
+  }
+  emit_rm_reg(mc, dst, src);
+}
+static void emit_bs(MCEmitter* mc, int w, u8 opcode2, u32 dst, u32 src) {
+  emit_rex(mc, w, dst, 0, src);
+  {
+    u8 op[2] = {X64_OPC_TWOBYTE, opcode2};
+    mc->emit_bytes(mc, op, 2);
+  }
+  emit_rm_reg(mc, dst, src);
+}
+static void emit_bswap(MCEmitter* mc, int w, u32 reg) {
+  emit_rex(mc, w, 0, 0, reg);
+  {
+    u8 op[2] = {X64_OPC_TWOBYTE, (u8)(0xC8 + (reg & 7u))};
+    mc->emit_bytes(mc, op, 2);
+  }
+}
+static void emit_rol16_imm8(MCEmitter* mc, u32 reg, u8 imm) {
+  u8 p = X64_OPSIZE_PFX;
+  mc->emit_bytes(mc, &p, 1);
+  emit_rex(mc, 0, 0, 0, reg);
+  {
+    u8 buf[3] = {X64_OPC_SHIFT_IMM, modrm(3u, 0u, reg & 7u), imm};
+    mc->emit_bytes(mc, buf, 3);
+  }
+}
+static void emit_ud2(MCEmitter* mc) {
+  u8 b[2] = {0x0F, 0x0B};
+  mc->emit_bytes(mc, b, 2);
+}
+
+static void x64_intrinsic(NativeTarget* t, IntrinKind kind,
+                          const NativeLoc* dsts, u32 ndst, const NativeLoc* args,
+                          u32 narg) {
+  X64NativeTarget* a = x64_of(t);
+  MCEmitter* mc = t->mc;
+  (void)ndst;
+  switch (kind) {
+    case INTRIN_NONE:
+      break;
+    case INTRIN_EXPECT:
+    case INTRIN_ASSUME_ALIGNED:
+      if (args[0].kind == NATIVE_LOC_IMM)
+        x64_emit_load_imm(mc, x64_is_64(t, dsts[0].type) ? 1 : 0,
+                          loc_reg(dsts[0]), args[0].v.imm);
+      else
+        x64_move(t, dsts[0], args[0]);
+      return;
+    case INTRIN_PREFETCH:
+      return;
+    case INTRIN_UNREACHABLE:
+    case INTRIN_TRAP:
+      emit_ud2(mc);
+      return;
+    case INTRIN_POPCOUNT:
+      emit_popcnt(mc, x64_is_64(t, args[0].type) ? 1 : 0, loc_reg(dsts[0]),
+                  loc_reg(args[0]));
+      return;
+    case INTRIN_CTZ:
+      emit_bs(mc, x64_is_64(t, args[0].type) ? 1 : 0, 0xBC /* bsf */,
+              loc_reg(dsts[0]), loc_reg(args[0]));
+      return;
+    case INTRIN_CLZ: {
+      int w = x64_is_64(t, args[0].type) ? 1 : 0;
+      u32 dr = loc_reg(dsts[0]);
+      emit_bs(mc, w, 0xBD /* bsr */, dr, loc_reg(args[0]));
+      /* clz = (bits-1) - bsr, computed via xor with bits-1. */
+      emit_rex(mc, w, 0, 0, dr);
+      {
+        u8 op = X64_OPC_ALU_IMM32;
+        mc->emit_bytes(mc, &op, 1);
+      }
+      emit_rm_reg(mc, X64_ALU_SUB_XOR, dr);
+      emit_u32le(mc, w ? 63u : 31u);
+      return;
+    }
+    case INTRIN_BSWAP16: {
+      u32 dr = loc_reg(dsts[0]), sr = loc_reg(args[0]);
+      if (dr != sr) emit_mov_rr(mc, 0, dr, sr);
+      emit_rol16_imm8(mc, dr, 8);
+      return;
+    }
+    case INTRIN_BSWAP32: {
+      u32 dr = loc_reg(dsts[0]), sr = loc_reg(args[0]);
+      if (dr != sr) emit_mov_rr(mc, 0, dr, sr);
+      emit_bswap(mc, 0, dr);
+      return;
+    }
+    case INTRIN_BSWAP64: {
+      u32 dr = loc_reg(dsts[0]), sr = loc_reg(args[0]);
+      if (dr != sr) emit_mov_rr(mc, 1, dr, sr);
+      emit_bswap(mc, 1, dr);
+      return;
+    }
+    case INTRIN_SADD_OVERFLOW:
+    case INTRIN_UADD_OVERFLOW:
+    case INTRIN_SSUB_OVERFLOW:
+    case INTRIN_USUB_OVERFLOW: {
+      int w = x64_is_64(t, dsts[0].type) ? 1 : 0;
+      u32 rd = loc_reg(dsts[0]), rovf = loc_reg(dsts[1]);
+      u32 ra = loc_reg(args[0]), rb = loc_reg(args[1]);
+      u8 op = (kind == INTRIN_SADD_OVERFLOW || kind == INTRIN_UADD_OVERFLOW)
+                  ? X64_OPC_ALU_ADD
+                  : X64_OPC_ALU_SUB;
+      u32 cc = (kind == INTRIN_UADD_OVERFLOW || kind == INTRIN_USUB_OVERFLOW)
+                   ? X64_CC_B
+                   : X64_CC_O;
+      if (rd != ra) emit_mov_rr(mc, w, rd, ra);
+      emit_alu_rr(mc, w, op, rd, rb);
+      emit_setcc(mc, cc, rovf);
+      emit_movzx_r32_r8(mc, rovf, rovf);
+      return;
+    }
+    case INTRIN_SMUL_OVERFLOW: {
+      int w = x64_is_64(t, dsts[0].type) ? 1 : 0;
+      u32 rd = loc_reg(dsts[0]), rovf = loc_reg(dsts[1]);
+      u32 ra = loc_reg(args[0]), rb = loc_reg(args[1]);
+      if (rd != ra) emit_mov_rr(mc, w, rd, ra);
+      emit_imul_rr(mc, w, rd, rb);
+      emit_setcc(mc, X64_CC_O, rovf);
+      emit_movzx_r32_r8(mc, rovf, rovf);
+      return;
+    }
+    case INTRIN_UMUL_OVERFLOW: {
+      int w = x64_is_64(t, dsts[0].type) ? 1 : 0;
+      u32 rd = loc_reg(dsts[0]), rovf = loc_reg(dsts[1]);
+      u32 ra = loc_reg(args[0]), rb = loc_reg(args[1]);
+      if (rb == X64_RAX || rb == X64_RDX) {
+        emit_mov_rr(mc, w, X64_R11, rb);
+        rb = X64_R11;
+      }
+      if (ra != X64_RAX) emit_mov_rr(mc, w, X64_RAX, ra);
+      emit_f7_rm(mc, w, X64_F7_SUB_MUL, rb); /* MUL: rdx:rax = rax * rb */
+      if (rd != X64_RAX) emit_mov_rr(mc, w, rd, X64_RAX);
+      emit_setcc(mc, X64_CC_O, rovf);
+      emit_movzx_r32_r8(mc, rovf, rovf);
+      return;
+    }
+    case INTRIN_MEMCPY:
+    case INTRIN_MEMMOVE: {
+      u32 dr, sr, n;
+      if (narg != 3u || args[0].kind != NATIVE_LOC_REG ||
+          args[1].kind != NATIVE_LOC_REG || args[2].kind != NATIVE_LOC_IMM)
+        x64_panic(a, "unsupported memory intrinsic operands");
+      if (args[2].v.imm < 0 || args[2].v.imm > 0xffffffffll)
+        x64_panic(a, "unsupported memory intrinsic size");
+      dr = loc_reg(args[0]);
+      sr = loc_reg(args[1]);
+      n = (u32)args[2].v.imm;
+      if (kind == INTRIN_MEMCPY) {
+        u32 i = 0;
+        while (i + 8u <= n) { emit_mov_load(mc, 8, 0, X64_RAX, sr, (i32)i); emit_mov_store(mc, 8, X64_RAX, dr, (i32)i); i += 8u; }
+        while (i + 4u <= n) { emit_mov_load(mc, 4, 0, X64_RAX, sr, (i32)i); emit_mov_store(mc, 4, X64_RAX, dr, (i32)i); i += 4u; }
+        while (i + 2u <= n) { emit_mov_load(mc, 2, 0, X64_RAX, sr, (i32)i); emit_mov_store(mc, 2, X64_RAX, dr, (i32)i); i += 2u; }
+        while (i < n) { emit_mov_load(mc, 1, 0, X64_RAX, sr, (i32)i); emit_mov_store(mc, 1, X64_RAX, dr, (i32)i); i += 1u; }
+      } else {
+        u32 i = n;
+        while (i >= 8u) { i -= 8u; emit_mov_load(mc, 8, 0, X64_RAX, sr, (i32)i); emit_mov_store(mc, 8, X64_RAX, dr, (i32)i); }
+        while (i >= 4u) { i -= 4u; emit_mov_load(mc, 4, 0, X64_RAX, sr, (i32)i); emit_mov_store(mc, 4, X64_RAX, dr, (i32)i); }
+        while (i >= 2u) { i -= 2u; emit_mov_load(mc, 2, 0, X64_RAX, sr, (i32)i); emit_mov_store(mc, 2, X64_RAX, dr, (i32)i); }
+        while (i >= 1u) { i -= 1u; emit_mov_load(mc, 1, 0, X64_RAX, sr, (i32)i); emit_mov_store(mc, 1, X64_RAX, dr, (i32)i); }
+      }
+      return;
+    }
+    case INTRIN_MEMSET: {
+      u32 dr, n;
+      if (narg != 3u || args[0].kind != NATIVE_LOC_REG ||
+          args[2].kind != NATIVE_LOC_IMM)
+        x64_panic(a, "unsupported memset operands");
+      if (args[2].v.imm < 0 || args[2].v.imm > 0xffffffffll)
+        x64_panic(a, "unsupported memset size");
+      dr = loc_reg(args[0]);
+      n = (u32)args[2].v.imm;
+      if (args[1].kind == NATIVE_LOC_IMM) {
+        u8 byte = (u8)(args[1].v.imm & 0xffu);
+        u64 b64 = byte;
+        b64 |= b64 << 8;
+        b64 |= b64 << 16;
+        b64 |= b64 << 32;
+        x64_emit_load_imm(mc, 1, X64_RAX, (i64)b64);
+      } else {
+        /* Broadcast low byte of a register via multiply by 0x0101010101010101. */
+        x64_emit_load_imm(mc, 1, X64_R11, (i64)0x0101010101010101ll);
+        emit_mov_rr(mc, 1, X64_RAX, loc_reg(args[1]));
+        emit_imul_rr(mc, 1, X64_RAX, X64_R11);
+      }
+      {
+        u32 i = 0;
+        while (i + 8u <= n) { emit_mov_store(mc, 8, X64_RAX, dr, (i32)i); i += 8u; }
+        while (i + 4u <= n) { emit_mov_store(mc, 4, X64_RAX, dr, (i32)i); i += 4u; }
+        while (i + 2u <= n) { emit_mov_store(mc, 2, X64_RAX, dr, (i32)i); i += 2u; }
+        while (i < n) { emit_mov_store(mc, 1, X64_RAX, dr, (i32)i); i += 1u; }
+      }
+      return;
+    }
+    default:
+      break;
+  }
+  x64_panic(a, "unsupported compiler intrinsic");
+}
+
+/* ============================ inline asm ============================ */
+
+_Noreturn static void x64_asm_panic_at(Compiler* c, SrcLoc loc,
+                                       const char* msg) {
+  compiler_panic(c, loc, "x64 inline asm: %s", msg);
+}
+_Noreturn static void x64_asm_panic(NativeDirectTarget* d, const char* msg) {
+  x64_asm_panic_at(d->base.c, d->loc, msg);
+}
+
+static const char* x64_asm_constraint_body(const char* s) {
+  if (!s) return "";
+  if (s[0] == '=' && s[1] == '&') return s + 2;
+  if (s[0] == '=' || s[0] == '+' || s[0] == '&') return s + 1;
+  return s;
+}
+static int x64_asm_constraint_early(const char* s) {
+  if (!s) return 0;
+  return (s[0] == '=' && s[1] == '&') || s[0] == '&';
+}
+static int x64_asm_match_index(const char* s) {
+  int n = 0;
+  const char* p;
+  if (!s || s[0] < '0' || s[0] > '9') return -1;
+  for (p = s; *p >= '0' && *p <= '9'; ++p) n = n * 10 + (*p - '0');
+  return n;
+}
+
+static void x64_asm_bound_reg(Operand* out, CfreeCgTypeId type,
+                              NativeAllocClass cls, Reg reg) {
+  memset(out, 0, sizeof *out);
+  out->kind = X64_INLINE_OPK_REG;
+  out->pad[0] =
+      (cls == NATIVE_REG_FP) ? X64_INLINE_OPCLS_FP : X64_INLINE_OPCLS_INT;
+  out->type = type;
+  out->v.local = (CGLocal)reg;
+}
+static void x64_asm_bound_mem(Operand* out, CfreeCgTypeId type, Reg base) {
+  memset(out, 0, sizeof *out);
+  out->kind = OPK_INDIRECT;
+  out->type = type;
+  out->v.ind.base = (CGLocal)base;
+  out->v.ind.index = CG_LOCAL_NONE;
+  out->v.ind.ofs = 0;
+}
+
+/* Parse a clobber register name into (class, reg). Returns 0 for cc/memory.
+ * GPR names map to HW encoding via x64_register_hw_index; xmm names map via the
+ * DWARF table (xmm0..15 = dwarf 17..32). */
+static int x64_asm_parse_reg_clobber(Compiler* c, SrcLoc loc, Sym name,
+                                     NativeAllocClass* cls_out, Reg* reg_out) {
+  Slice s = pool_slice(c->global, name);
+  char buf[16];
+  uint32_t idx;
+  if (!s.s || !s.len) return 0;
+  if (s.len == 2 && s.s[0] == 'c' && s.s[1] == 'c') return 0;
+  if (s.len == 6 && memcmp(s.s, "memory", 6) == 0) return 0;
+  if (s.len >= sizeof buf) x64_asm_panic_at(c, loc, "clobber name is too long");
+  memcpy(buf, s.s, s.len);
+  buf[s.len] = '\0';
+  if (x64_register_hw_index(buf, &idx) == 0 && idx <= 15u) {
+    *cls_out = NATIVE_REG_INT;
+    *reg_out = (Reg)idx;
+    return 1;
+  }
+  if (x64_register_index(buf, &idx) == 0 && idx >= 17u && idx <= 32u) {
+    *cls_out = NATIVE_REG_FP;
+    *reg_out = (Reg)(idx - 17u);
+    return 1;
+  }
+  x64_asm_panic_at(c, loc, "unknown clobber register");
+  return 0;
+}
+
+static void x64_asm_clobber_masks(Compiler* c, SrcLoc loc, const Sym* clobbers,
+                                  u32 nclob, u32* int_mask, u32* fp_mask) {
+  u32 i;
+  *int_mask = 0;
+  *fp_mask = 0;
+  for (i = 0; i < nclob; ++i) {
+    NativeAllocClass cls;
+    Reg reg;
+    if (!x64_asm_parse_reg_clobber(c, loc, clobbers[i], &cls, &reg)) continue;
+    if (cls == NATIVE_REG_INT)
+      *int_mask |= 1u << reg;
+    else
+      *fp_mask |= 1u << reg;
+  }
+}
+
+static NativeAllocClass x64_asm_constraint_class(NativeDirectTarget* d,
+                                                 const char* body) {
+  if (body[0] == 'r' || body[0] == 'q' || body[0] == 'a' || body[0] == 'b' ||
+      body[0] == 'c' || body[0] == 'd' || body[0] == 'S' || body[0] == 'D')
+    return NATIVE_REG_INT;
+  if (body[0] == 'x' || body[0] == 'v') return NATIVE_REG_FP;
+  x64_asm_panic(d, "constraint is not a register constraint");
+  return NATIVE_REG_INT;
+}
+
+/* Pick a free register from caller-saved allocable pools for an asm operand the
+ * direct path self-allocates. */
+static Reg x64_asm_alloc_reg(NativeDirectTarget* d, NativeAllocClass cls,
+                             u32* used_int, u32* used_fp) {
+  static const Reg int_pool[] = {X64_RDI, X64_RSI, X64_RDX, X64_RCX,
+                                 X64_R8,  X64_R9,  X64_R10};
+  static const Reg fp_pool[] = {X64_XMM0,     X64_XMM1,      X64_XMM2,
+                                X64_XMM3,     X64_XMM4,      X64_XMM5,
+                                X64_XMM6,     X64_XMM7,      X64_XMM8,
+                                X64_XMM0 + 9, X64_XMM0 + 10, X64_XMM0 + 11};
+  const Reg* pool = cls == NATIVE_REG_FP ? fp_pool : int_pool;
+  u32 n = cls == NATIVE_REG_FP ? (u32)(sizeof fp_pool / sizeof fp_pool[0])
+                               : (u32)(sizeof int_pool / sizeof int_pool[0]);
+  u32* used = cls == NATIVE_REG_FP ? used_fp : used_int;
+  u32 i;
+  for (i = 0; i < n; ++i) {
+    Reg r = pool[i];
+    if ((*used & (1u << r)) != 0) continue;
+    *used |= 1u << r;
+    return r;
+  }
+  x64_asm_panic(d, "out of registers for asm operands");
+  return REG_NONE;
+}
+
+/* Direct (-O0) path: resolve a semantic Operand to a NativeAddr. */
+static NativeAddr x64_direct_addr(NativeDirectTarget* d, Operand op) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  switch ((OpKind)op.kind) {
+    case OPK_LOCAL:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = d->locals[op.v.local - 1u].home;
+      addr.base_type = op.type;
+      return addr;
+    case OPK_INDIRECT:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME_VALUE;
+      addr.base.frame = d->locals[op.v.ind.base - 1u].home;
+      addr.cls = d->locals[op.v.ind.base - 1u].cls;
+      addr.base_type = d->locals[op.v.ind.base - 1u].type;
+      addr.offset = op.v.ind.ofs;
+      return addr;
+    default:
+      x64_asm_panic(d, "operand is not addressable");
+  }
+}
+
+static NativeAddr x64_direct_materialize_addr(NativeDirectTarget* d,
+                                              Operand op) {
+  X64NativeTarget* a = x64_of(d->native);
+  NativeAddr addr = x64_direct_addr(d, op);
+  if (addr.base_kind == NATIVE_ADDR_BASE_FRAME_VALUE) {
+    NativeAddr load;
+    memset(&load, 0, sizeof load);
+    load.base_kind = NATIVE_ADDR_BASE_FRAME;
+    load.base.frame = addr.base.frame;
+    load.base_type = addr.base_type;
+    emit_mov_load(a->base.mc, 8, 0, X64_TMP_INT2, X64_RBP,
+                  -(i32)x64_slot_get(a, addr.base.frame)->off);
+    addr.base_kind = NATIVE_ADDR_BASE_REG;
+    addr.base.reg = X64_TMP_INT2;
+  }
+  return addr;
+}
+
+static void x64_direct_load_operand_to_reg(NativeDirectTarget* d, Operand op,
+                                           NativeLoc dst) {
+  X64NativeTarget* a = x64_of(d->native);
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  switch ((OpKind)op.kind) {
+    case OPK_IMM:
+      if ((NativeAllocClass)dst.cls != NATIVE_REG_INT)
+        x64_asm_panic(d, "floating-point immediate asm input is unsupported");
+      d->native->load_imm(d->native, dst, op.v.imm);
+      return;
+    case OPK_LOCAL:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = d->locals[op.v.local - 1u].home;
+      addr.base_type = op.type;
+      x64_emit_mem(a, 1, dst, addr, x64_mem_for_type(d->native, op.type, 0));
+      return;
+    case OPK_GLOBAL:
+      addr.base_kind = NATIVE_ADDR_BASE_GLOBAL;
+      addr.base.global.sym = op.v.global.sym;
+      addr.base.global.addend = op.v.global.addend;
+      addr.base_type = op.type;
+      d->native->load_addr(d->native, dst, addr);
+      return;
+    case OPK_INDIRECT:
+      addr = x64_direct_materialize_addr(d, op);
+      x64_emit_mem(a, 1, dst, addr, x64_mem_for_type(d->native, op.type, 0));
+      return;
+  }
+  x64_asm_panic(d, "unsupported asm input operand");
+}
+
+static void x64_direct_load_address_to_reg(NativeDirectTarget* d, Operand op,
+                                           NativeLoc dst) {
+  d->native->load_addr(d->native, dst, x64_direct_addr(d, op));
+}
+
+static void x64_direct_store_reg_to_operand(NativeDirectTarget* d, Operand op,
+                                            NativeLoc src) {
+  X64NativeTarget* a = x64_of(d->native);
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  if (op.kind == OPK_LOCAL) {
+    addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+    addr.base.frame = d->locals[op.v.local - 1u].home;
+    addr.base_type = op.type;
+  } else {
+    addr = x64_direct_materialize_addr(d, op);
+  }
+  x64_emit_mem(a, 0, src, addr, x64_mem_for_type(d->native, op.type, 0));
+}
+
+/* Callee-saved registers an asm block clobbers must be saved around the block. */
+typedef struct X64AsmSavedClobber {
+  NativeFrameSlot slot;
+  NativeAllocClass cls;
+  Reg reg;
+  CfreeCgTypeId type;
+} X64AsmSavedClobber;
+
+static void x64_asm_save_one(X64NativeTarget* a, X64AsmSavedClobber* s) {
+  NativeFrameSlotDesc desc;
+  NativeAddr addr;
+  memset(&desc, 0, sizeof desc);
+  desc.type = s->type;
+  desc.size = s->cls == NATIVE_REG_FP ? 16u : 8u;
+  desc.align = desc.size;
+  desc.kind = NATIVE_FRAME_SLOT_SAVE;
+  s->slot = a->base.frame_slot(&a->base, &desc);
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+  addr.base.frame = s->slot;
+  addr.base_type = s->type;
+  x64_emit_mem(a, 0, x64_reg_loc(s->type, s->cls, s->reg), addr,
+               x64_mem_for_type(&a->base, s->type, desc.size));
+}
+static void x64_asm_restore_one(X64NativeTarget* a, const X64AsmSavedClobber* s) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+  addr.base.frame = s->slot;
+  addr.base_type = s->type;
+  x64_emit_mem(a, 1, x64_reg_loc(s->type, s->cls, s->reg), addr,
+               x64_mem_for_type(&a->base,
+                                s->type, s->cls == NATIVE_REG_FP ? 16u : 8u));
+}
+
+/* SysV callee-saved: int rbx,r12-r15; no fp. Win64 adds rdi,rsi + xmm6-15. */
+static int x64_reg_is_callee_int(const X64ABIRegs* abi, Reg r) {
+  if (r == X64_RBP) return 0; /* prologue head handles rbp */
+  return (abi->cs_int_mask & (1ull << r)) != 0;
+}
+static int x64_reg_is_callee_fp(const X64ABIRegs* abi, Reg r) {
+  return (abi->cs_fp_mask & (1ull << r)) != 0;
+}
+
+static X64AsmSavedClobber* x64_asm_save_callee_clobbers(X64NativeTarget* a,
+                                                        u32 int_mask,
+                                                        u32 fp_mask,
+                                                        u32* nsaved_out) {
+  X64AsmSavedClobber* saved =
+      arena_zarray(a->base.c->tu, X64AsmSavedClobber, 32u);
+  CfreeCgTypeId i64 = builtin_id(CFREE_CG_BUILTIN_I64);
+  CfreeCgTypeId f64 = builtin_id(CFREE_CG_BUILTIN_F64);
+  u32 n = 0;
+  Reg r;
+  for (r = 0; r <= 15u; ++r) {
+    if ((int_mask & (1u << r)) == 0 || !x64_reg_is_callee_int(a->abi, r))
+      continue;
+    saved[n].cls = NATIVE_REG_INT;
+    saved[n].reg = r;
+    saved[n].type = i64;
+    x64_asm_save_one(a, &saved[n++]);
+  }
+  for (r = 0; r <= 15u; ++r) {
+    if ((fp_mask & (1u << r)) == 0 || !x64_reg_is_callee_fp(a->abi, r)) continue;
+    saved[n].cls = NATIVE_REG_FP;
+    saved[n].reg = r;
+    saved[n].type = f64;
+    x64_asm_save_one(a, &saved[n++]);
+  }
+  *nsaved_out = n;
+  return saved;
+}
+
+/* ---- NativeTarget (optimizer) asm hook ---- */
+
+static NativeAddr x64_asm_loc_to_addr(X64NativeTarget* a, SrcLoc loc,
+                                      NativeLoc src) {
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  addr.base_type = src.type;
+  switch ((NativeLocKind)src.kind) {
+    case NATIVE_LOC_FRAME:
+      addr.base_kind = NATIVE_ADDR_BASE_FRAME;
+      addr.base.frame = src.v.frame;
+      return addr;
+    case NATIVE_LOC_ADDR:
+      return src.v.addr;
+    case NATIVE_LOC_GLOBAL:
+      addr.base_kind = NATIVE_ADDR_BASE_GLOBAL;
+      addr.base.global.sym = src.v.global.sym;
+      addr.base.global.addend = src.v.global.addend;
+      return addr;
+    case NATIVE_LOC_REG:
+      addr.base_kind = NATIVE_ADDR_BASE_REG;
+      addr.cls = NATIVE_REG_INT;
+      addr.base.reg = src.v.reg;
+      return addr;
+    default:
+      x64_asm_panic_at(a->base.c, loc, "unsupported memory asm operand");
+  }
+}
+
+static Reg x64_asm_native_mem_base(X64NativeTarget* a, SrcLoc loc, NativeLoc src,
+                                   u32* ntmp) {
+  NativeAddr addr = x64_asm_loc_to_addr(a, loc, src);
+  Reg dst;
+  if (addr.base_kind == NATIVE_ADDR_BASE_REG && addr.offset == 0 &&
+      addr.index_kind == NATIVE_ADDR_INDEX_NONE) {
+    if ((addr.base.reg & 0xfu) != X64_RAX && (addr.base.reg & 0xfu) != X64_R11)
+      return (Reg)(addr.base.reg & 0xfu);
+  }
+  if (*ntmp >= 2u)
+    x64_asm_panic_at(a->base.c, loc, "too many memory asm operands");
+  dst = (*ntmp == 0u) ? (Reg)X64_RAX : (Reg)X64_R11;
+  (*ntmp)++;
+  x64_addr_to_base_reg(a, addr, dst);
+  return dst;
+}
+
+static void x64_asm_bind_native(X64NativeTarget* a, SrcLoc loc, Operand* out,
+                                const char* constraint, CfreeCgTypeId type,
+                                NativeLoc src, u32* ntmp) {
+  const char* body = x64_asm_constraint_body(constraint);
+  if (body[0] == 'r' || body[0] == 'x') {
+    NativeAllocClass cls = (body[0] == 'x') ? NATIVE_REG_FP : NATIVE_REG_INT;
+    if (src.kind != NATIVE_LOC_REG)
+      x64_asm_panic_at(a->base.c, loc, "register asm operand not in a register");
+    x64_asm_bound_reg(out, type, cls, (Reg)src.v.reg);
+  } else if (body[0] == 'i') {
+    if (src.kind != NATIVE_LOC_IMM)
+      x64_asm_panic_at(a->base.c, loc, "immediate asm operand is not immediate");
+    memset(out, 0, sizeof *out);
+    out->kind = OPK_IMM;
+    out->type = type;
+    out->v.imm = src.v.imm;
+  } else if (body[0] == 'm') {
+    x64_asm_bound_mem(out, type, x64_asm_native_mem_base(a, loc, src, ntmp));
+  } else {
+    x64_asm_panic_at(a->base.c, loc, "unsupported asm constraint");
+  }
+}
+
+static void x64_asm_block_native(NativeTarget* t, const char* tmpl,
+                                 const AsmConstraint* outs, u32 nout,
+                                 NativeLoc* out_locs, const AsmConstraint* ins,
+                                 u32 nin, const NativeLoc* in_locs,
+                                 const Sym* clobbers, u32 nclob) {
+  X64NativeTarget* a = x64_of(t);
+  Compiler* c = t->c;
+  SrcLoc loc = a->func ? a->func->loc : (SrcLoc){0, 0, 0};
+  Operand* bound_outs = nout ? arena_zarray(c->tu, Operand, nout) : NULL;
+  Operand* bound_ins = nin ? arena_zarray(c->tu, Operand, nin) : NULL;
+  u32 clob_int, clob_fp, ntmp = 0;
+  X64AsmSavedClobber* saved;
+  u32 nsaved, i;
+  X64Asm* asmh;
+
+  x64_asm_clobber_masks(c, loc, clobbers, nclob, &clob_int, &clob_fp);
+
+  for (i = 0; i < nout; ++i) {
+    CfreeCgTypeId type = outs[i].type ? outs[i].type : out_locs[i].type;
+    x64_asm_bind_native(a, loc, &bound_outs[i], outs[i].str, type, out_locs[i],
+                        &ntmp);
+  }
+  for (i = 0; i < nin; ++i) {
+    const char* body = x64_asm_constraint_body(ins[i].str);
+    int matched = x64_asm_match_index(body);
+    CfreeCgTypeId type;
+    NativeLoc inloc;
+    if (matched >= 0) {
+      if ((u32)matched >= nout)
+        x64_asm_panic_at(c, loc, "matching constraint out of range");
+      bound_ins[i] = bound_outs[matched];
+      continue;
+    }
+    type = ins[i].type ? ins[i].type : in_locs[i].type;
+    inloc = in_locs[i];
+    if ((body[0] == 'r') && inloc.kind != NATIVE_LOC_REG) {
+      Reg r;
+      if (ntmp >= 2u) x64_asm_panic_at(c, loc, "too many memory asm operands");
+      r = (ntmp == 0u) ? (Reg)X64_RAX : (Reg)X64_R11;
+      ntmp++;
+      inloc = x64_reg_loc(type, NATIVE_REG_INT, r);
+      x64_emit_mem(a, 1, inloc, x64_asm_loc_to_addr(a, loc, in_locs[i]),
+                   x64_mem_for_type(t, type, x64_type_size(t, type)));
+    }
+    x64_asm_bind_native(a, loc, &bound_ins[i], ins[i].str, type, inloc, &ntmp);
+  }
+
+  saved = x64_asm_save_callee_clobbers(a, clob_int, clob_fp, &nsaved);
+  asmh = x64_asm_open(c);
+  x64_inline_bind(asmh, outs, nout, bound_outs, ins, nin, bound_ins, clobbers,
+                  nclob);
+  x64_asm_run_template(asmh, t->mc, tmpl);
+  x64_asm_close(asmh);
+  for (i = nsaved; i > 0; --i) x64_asm_restore_one(a, &saved[i - 1u]);
+}
+
+static void x64_file_scope_asm(NativeTarget* t, const char* src, size_t len) {
+  AsmLexer* lex = asm_lex_open_mem(t->c, "<file-scope-asm>", src, len);
+  asm_parse(t->c, lex, t->mc);
+  asm_lex_close(lex);
+}
+
+static void x64_trap(NativeTarget* t) { emit_ud2(t->mc); }
+static void x64_set_loc(NativeTarget* t, SrcLoc loc) {
+  x64_of(t)->loc = loc;
+  if (t->mc->set_loc) t->mc->set_loc(t->mc, loc);
+}
+static void x64_finalize(NativeTarget* t) {
+  if (t->mc) mc_emit_eh_frame(t->mc);
+}
+
+/* ============================ construction ============================ */
+
+NativeTarget* x64_native_target_new(Compiler* c, ObjBuilder* obj,
+                                    MCEmitter* mc) {
+  X64NativeTarget* a = arena_znew(c->tu, X64NativeTarget);
+  NativeTarget* t;
+  if (!a) return NULL;
+  t = &a->base;
+  t->c = c;
+  t->obj = obj;
+  t->mc = mc;
+  t->regs = &x64_reg_info;
+  t->class_for_type = x64_class_for_type;
+  t->imm_legal = x64_imm_legal;
+  t->addr_legal = x64_addr_legal;
+  t->func_begin = x64_func_begin;
+  t->func_begin_known_frame = x64_func_begin_known_frame;
+  t->note_frame_state = NULL;
+  t->reserve_callee_saves = NULL;
+  t->signature_stack_bytes = x64_signature_stack_bytes;
+  t->call_stack_bytes = x64_call_stack_bytes;
+  t->has_store_zero_reg = 0;
+  t->func_end = x64_func_end;
+  t->frame_slot = x64_frame_slot;
+  t->frame_slot_debug_loc = NULL;
+  t->bind_param = x64_bind_native_param;
+  t->label_new = x64_label_new;
+  t->label_place = x64_label_place;
+  t->jump = x64_jump;
+  t->cmp_branch = x64_cmp_branch;
+  t->indirect_branch = x64_indirect_branch;
+  t->load_label_addr = x64_load_label_addr;
+  t->move = x64_move;
+  t->load_imm = x64_load_imm;
+  t->load_const = x64_load_const;
+  t->load_addr = x64_load_addr;
+  t->load = x64_load;
+  t->store = x64_store;
+  t->tls_addr_of = x64_tls_addr_of;
+  t->copy_bytes = x64_copy_bytes;
+  t->set_bytes = x64_set_bytes;
+  t->bitfield_load = x64_bitfield_load;
+  t->bitfield_store = x64_bitfield_store;
+  t->binop = x64_binop;
+  t->unop = x64_unop;
+  t->cmp = x64_cmp;
+  t->convert = x64_convert;
+  t->alloca_ = x64_alloca;
+  t->spill = x64_spill;
+  t->reload = x64_reload;
+  t->plan_call = x64_plan_call;
+  t->emit_call = x64_emit_call;
+  t->plan_ret = x64_plan_ret;
+  t->ret = x64_ret;
+  t->atomic_load = x64_atomic_load;
+  t->atomic_store = x64_atomic_store;
+  t->atomic_rmw = x64_atomic_rmw;
+  t->atomic_cas = x64_atomic_cas;
+  t->fence = x64_fence;
+  t->va_start_ = x64_va_start_native;
+  t->va_arg_ = x64_va_arg_native;
+  t->va_end_ = x64_va_end_native;
+  t->va_copy_ = x64_va_copy_native;
+  t->intrinsic = x64_intrinsic;
+  t->asm_block = x64_asm_block_native;
+  t->file_scope_asm = x64_file_scope_asm;
+  t->trap = x64_trap;
+  t->set_loc = x64_set_loc;
+  t->finalize = x64_finalize;
+  return t;
+}
+
+/* ============================ NativeOps (-O0) ============================ */
+
+static void x64_bind_param(NativeDirectTarget* d, const CGParamDesc* p,
+                           CGLocal local, NativeDirectLocal* l) {
+  NativeLoc dst;
+  (void)local;
+  memset(&dst, 0, sizeof dst);
+  dst.kind = NATIVE_LOC_FRAME;
+  dst.type = p->type;
+  dst.v.frame = l->home;
+  x64_bind_native_param(d->native, p, dst);
+}
+
+static const char* x64_no_tail(NativeDirectTarget* d, const CGCallDesc* call) {
+  (void)d;
+  (void)call;
+  return "x64 tail calls not implemented yet";
+}
+
+/* Resolve a pointer-typed Operand (the address of a va_list object) into `reg`,
+ * returning a register-based NativeAddr. */
+static NativeAddr x64_direct_pointer_addr(NativeDirectTarget* d, Operand op) {
+  X64NativeTarget* a = x64_of(d->native);
+  NativeAddr addr;
+  memset(&addr, 0, sizeof addr);
+  if (op.kind == OPK_LOCAL) {
+    emit_mov_load(a->base.mc, 8, 0, X64_R11, X64_RBP,
+                  -(i32)x64_slot_get(a, d->locals[op.v.local - 1u].home)->off);
+    addr.base_kind = NATIVE_ADDR_BASE_REG;
+    addr.base.reg = X64_R11;
+    addr.base_type = op.type;
+    return addr;
+  }
+  return x64_direct_materialize_addr(d, op);
+}
+
+static NativeAddr x64_direct_va_base(NativeDirectTarget* d, Operand ap_addr,
+                                     Reg reg) {
+  NativeLoc dst = x64_reg_loc(builtin_id(CFREE_CG_BUILTIN_I64), NATIVE_REG_INT,
+                              reg);
+  NativeAddr addr;
+  d->native->load_addr(d->native, dst, x64_direct_pointer_addr(d, ap_addr));
+  memset(&addr, 0, sizeof addr);
+  addr.base_kind = NATIVE_ADDR_BASE_REG;
+  addr.cls = NATIVE_REG_INT;
+  addr.base.reg = reg;
+  addr.base_type = builtin_id(CFREE_CG_BUILTIN_I64);
+  return addr;
+}
+
+static void x64_va_start_(NativeDirectTarget* d, Operand ap_addr) {
+  /* Hold the va_list base in R11, not RAX: x64_va_start_core materializes the
+   * gp/fp_offset and overflow/reg-save-area field values through RAX, which
+   * would otherwise clobber the base before the field stores. */
+  x64_va_start_core(x64_of(d->native), x64_direct_va_base(d, ap_addr, X64_R11));
+}
+static void x64_va_arg_(NativeDirectTarget* d, Operand dst, Operand ap_addr,
+                        CfreeCgTypeId type) {
+  X64NativeTarget* a = x64_of(d->native);
+  int is_fp = cg_type_is_float(d->base.c, type);
+  NativeLoc res = x64_reg_loc(type, is_fp ? NATIVE_REG_FP : NATIVE_REG_INT,
+                              is_fp ? X64_TMP_FP : (Reg)X64_RDX);
+  NativeAddr dst_addr;
+  /* Base in R11 (not RAX/R10, which the va_arg core uses as scratch). */
+  x64_va_arg_core(a, res, x64_direct_va_base(d, ap_addr, X64_R11), type);
+  dst_addr = x64_direct_addr(d, dst);
+  if (dst_addr.base_kind == NATIVE_ADDR_BASE_FRAME_VALUE) {
+    emit_mov_load(a->base.mc, 8, 0, X64_R11, X64_RBP,
+                  -(i32)x64_slot_get(a, dst_addr.base.frame)->off);
+    dst_addr.base_kind = NATIVE_ADDR_BASE_REG;
+    dst_addr.base.reg = X64_R11;
+  }
+  x64_emit_mem(a, 0, res, dst_addr,
+               x64_mem_for_type(d->native, type, x64_type_size(d->native, type)));
+}
+static void x64_va_end_(NativeDirectTarget* d, Operand ap_addr) {
+  (void)d;
+  (void)ap_addr;
+}
+static void x64_va_copy_(NativeDirectTarget* d, Operand dst, Operand src) {
+  X64NativeTarget* a = x64_of(d->native);
+  NativeAddr src_ap = x64_direct_va_base(d, src, X64_RAX);
+  NativeAddr dst_ap = x64_direct_va_base(d, dst, X64_R11);
+  x64_va_copy_core(a, dst_ap, src_ap);
+}
+
+static void x64_direct_asm_block(NativeDirectTarget* d, const char* tmpl,
+                                 const AsmConstraint* outs, u32 nout,
+                                 Operand* out_ops, const AsmConstraint* ins,
+                                 u32 nin, const Operand* in_ops,
+                                 const Sym* clobbers, u32 nclob) {
+  X64NativeTarget* a = x64_of(d->native);
+  Compiler* c = d->base.c;
+  Operand* bound_outs = nout ? arena_zarray(c->tu, Operand, nout) : NULL;
+  Operand* bound_ins = nin ? arena_zarray(c->tu, Operand, nin) : NULL;
+  u32 clob_int, clob_fp, used_int, used_fp;
+  X64AsmSavedClobber* saved;
+  u32 nsaved, i;
+  X64Asm* asmh;
+
+  x64_asm_clobber_masks(c, d->loc, clobbers, nclob, &clob_int, &clob_fp);
+  /* Reserve emit scratch (rax,r11), driver scratch, sp/bp, and clobbers. */
+  used_int = clob_int | (1u << X64_RAX) | (1u << X64_R11) | (1u << X64_RSP) |
+             (1u << X64_RBP) | (1u << X64_RBX) | (1u << X64_R12) |
+             (1u << X64_R10);
+  used_fp = clob_fp | (1u << (X64_XMM0 + 12)) | (1u << (X64_XMM0 + 13)) |
+            (1u << (X64_XMM0 + 14)) | (1u << X64_XMM15);
+
+  for (i = 0; i < nout; ++i) {
+    const char* body = x64_asm_constraint_body(outs[i].str);
+    CfreeCgTypeId type = outs[i].type ? outs[i].type : out_ops[i].type;
+    if (body[0] == 'r' || body[0] == 'x') {
+      NativeAllocClass cls = x64_asm_constraint_class(d, body);
+      Reg reg = x64_asm_alloc_reg(d, cls, &used_int, &used_fp);
+      x64_asm_bound_reg(&bound_outs[i], type, cls, reg);
+      if (outs[i].dir == ASM_INOUT)
+        x64_direct_load_operand_to_reg(d, out_ops[i], x64_reg_loc(type, cls, reg));
+    } else if (body[0] == 'm') {
+      Reg reg = x64_asm_alloc_reg(d, NATIVE_REG_INT, &used_int, &used_fp);
+      NativeLoc lloc =
+          x64_reg_loc(builtin_id(CFREE_CG_BUILTIN_I64), NATIVE_REG_INT, reg);
+      x64_direct_load_address_to_reg(d, out_ops[i], lloc);
+      x64_asm_bound_mem(&bound_outs[i], type, reg);
+    } else {
+      x64_asm_panic(d, "unsupported output constraint");
+    }
+  }
+
+  for (i = 0; i < nin; ++i) {
+    const char* body = x64_asm_constraint_body(ins[i].str);
+    int matched = x64_asm_match_index(body);
+    CfreeCgTypeId type = ins[i].type ? ins[i].type : in_ops[i].type;
+    if (matched >= 0) {
+      if ((u32)matched >= nout)
+        x64_asm_panic(d, "matching constraint out of range");
+      if (x64_asm_constraint_early(outs[matched].str))
+        x64_asm_panic(d, "matching input names early-clobber output");
+      if (bound_outs[matched].kind != X64_INLINE_OPK_REG)
+        x64_asm_panic(d, "matching constraint requires register output");
+      bound_ins[i] = bound_outs[matched];
+      x64_direct_load_operand_to_reg(
+          d, in_ops[i],
+          x64_reg_loc(bound_ins[i].type,
+                      bound_ins[i].pad[0] == X64_INLINE_OPCLS_FP ? NATIVE_REG_FP
+                                                                 : NATIVE_REG_INT,
+                      (Reg)bound_ins[i].v.local));
+      continue;
+    }
+    if (body[0] == 'r' || body[0] == 'x') {
+      NativeAllocClass cls = x64_asm_constraint_class(d, body);
+      Reg reg = x64_asm_alloc_reg(d, cls, &used_int, &used_fp);
+      x64_asm_bound_reg(&bound_ins[i], type, cls, reg);
+      x64_direct_load_operand_to_reg(d, in_ops[i], x64_reg_loc(type, cls, reg));
+    } else if (body[0] == 'i') {
+      if (in_ops[i].kind != OPK_IMM)
+        x64_asm_panic(d, "immediate constraint requires immediate operand");
+      bound_ins[i] = in_ops[i];
+    } else if (body[0] == 'm') {
+      Reg reg = x64_asm_alloc_reg(d, NATIVE_REG_INT, &used_int, &used_fp);
+      NativeLoc lloc =
+          x64_reg_loc(builtin_id(CFREE_CG_BUILTIN_I64), NATIVE_REG_INT, reg);
+      x64_direct_load_address_to_reg(d, in_ops[i], lloc);
+      x64_asm_bound_mem(&bound_ins[i], type, reg);
+    } else {
+      x64_asm_panic(d, "unsupported input constraint");
+    }
+  }
+
+  saved = x64_asm_save_callee_clobbers(a, clob_int, clob_fp, &nsaved);
+  asmh = x64_asm_open(c);
+  x64_inline_bind(asmh, outs, nout, bound_outs, ins, nin, bound_ins, clobbers,
+                  nclob);
+  x64_asm_run_template(asmh, d->native->mc, tmpl);
+  x64_asm_close(asmh);
+
+  for (i = 0; i < nout; ++i) {
+    NativeAllocClass cls;
+    NativeLoc src;
+    if (bound_outs[i].kind != X64_INLINE_OPK_REG) continue;
+    cls = bound_outs[i].pad[0] == X64_INLINE_OPCLS_FP ? NATIVE_REG_FP
+                                                      : NATIVE_REG_INT;
+    src = x64_reg_loc(bound_outs[i].type, cls, (Reg)bound_outs[i].v.local);
+    x64_direct_store_reg_to_operand(d, out_ops[i], src);
+  }
+  for (i = nsaved; i > 0; --i) x64_asm_restore_one(a, &saved[i - 1u]);
+}
+
+static const NativeOps x64_direct_ops = {
+    .bind_param = x64_bind_param,
+    .tail_call_unrealizable_reason = x64_no_tail,
+    .va_start_ = x64_va_start_,
+    .va_arg_ = x64_va_arg_,
+    .va_end_ = x64_va_end_,
+    .va_copy_ = x64_va_copy_,
+    .asm_block = x64_direct_asm_block,
+};
+
+const NativeOps* x64_native_direct_ops(void) { return &x64_direct_ops; }
diff --git a/src/arch/x64/ops.c b/src/arch/x64/ops.c
@@ -1,2939 +0,0 @@
-/* arch/x64/ops.c — data movement, arithmetic, calls, atomics, intrinsics,
- * and the vtable constructor x64_cgtarget_new.
- *
- * Covers: x_load_imm, x_load_const, x_copy, x_load, x_store, x_addr_of,
- * x_tls_addr_of, x_copy_bytes, x_set_bytes, x_bitfield_load/store,
- * x_binop, x_unop, x_convert, emit_arg_value, x_call, x_ret,
- * x_alloca_, x_va_start_, x_va_arg_, x_va_end_, x_va_copy_,
- * emit_lock_*, x_atomic_load/store/rmw/cas, x_fence,
- * emit_popcnt, emit_bs, emit_bswap, emit_rol16_imm8, emit_xor_imm32,
- * x_intrinsic, x_asm_block, x_set_loc, x_finalize, x_destroy,
- * x64_cgtarget_new. */
-
-#include <string.h>
-
-#include "arch/mc.h"
-#include "arch/x64/asm.h"
-#include "arch/x64/internal.h"
-#include "arch/x64/isa.h"
-#include "arch/x64/x64.h"
-#include "cfree/config.h"
-#include "core/arena.h"
-#include "core/pool.h"
-#include "core/slice.h"
-#include "obj/obj.h"
-
-/* ============================================================
- * Data movement */
-
-static void x_load_imm(CGTarget* t, Operand dst, i64 imm) {
-  int w = type_is_64(dst.type) ? 1 : 0;
-  x64_emit_load_imm(t->mc, w, dst.v.reg & 0xFu, imm);
-}
-
-/* Materialize an FP literal: stash bytes in .rodata as a fresh local
- * symbol, then load via RIP-relative movss/movsd. */
-static void x_load_const(CGTarget* t, Operand dst, ConstBytes cb) {
-  XImpl* a = impl_of(t);
-  if (dst.cls != RC_FP)
-    compiler_panic(t->c, a->loc, "x64 load_const: only FP supported in v1");
-
-  Sym ro_name = pool_intern_slice(t->c->global, SLICE_LIT(".rodata"));
-  ObjSecId ro = obj_section(t->obj, ro_name, SEC_RODATA, SF_ALLOC, 1u);
-
-  u32 cur_section = t->mc->section_id;
-  t->mc->set_section(t->mc, ro);
-  u32 ro_off = obj_align_to(t->obj, ro, cb.align ? cb.align : 4);
-  t->mc->emit_bytes(t->mc, cb.bytes, cb.size);
-
-  char namebuf[64];
-  static u32 lit_seq = 0;
-  int len = 0;
-  const char* prefix = ".LCFP_x64_";
-  for (; prefix[len]; ++len) namebuf[len] = prefix[len];
-  u32 v = lit_seq++;
-  char tmp[16];
-  int tn = 0;
-  if (v == 0)
-    tmp[tn++] = '0';
-  else
-    while (v) {
-      tmp[tn++] = '0' + (char)(v % 10);
-      v /= 10;
-    }
-  for (int i = tn - 1; i >= 0; --i) namebuf[len++] = tmp[i];
-  namebuf[len] = 0;
-
-  Sym sname = pool_intern_slice(t->c->global, slice_from_cstr(namebuf));
-  ObjSymId sym = obj_symbol(t->obj, sname, SB_LOCAL, SK_OBJ, ro, (u64)ro_off,
-                            (u64)cb.size);
-  t->mc->set_section(t->mc, cur_section);
-
-  /* movs{s,d} xmm, [rip+disp32]. Reloc R_PC32 with addend=-4 at the
-   * disp32 site so the linker resolves to target relative to end-of-insn. */
-  u8 prefix2 = (cb.size == 8) ? 0xF2 : 0xF3;
-  u32 dst_x = dst.v.reg & 0xFu;
-  t->mc->emit_bytes(t->mc, &prefix2, 1);
-  emit_rex(t->mc, 0, dst_x, 0, 0);
-  u8 op[2] = {0x0F, 0x10};
-  t->mc->emit_bytes(t->mc, op, 2);
-  u8 mr = modrm(0u, (dst_x & 7u), 5u); /* [RIP + disp32] */
-  t->mc->emit_bytes(t->mc, &mr, 1);
-  u32 disp_pos = t->mc->pos(t->mc);
-  emit_u32le(t->mc, 0);
-  t->mc->emit_reloc_at(t->mc, cur_section, disp_pos, R_PC32, sym, -4, 1, 0);
-}
-
-static void x_copy(CGTarget* t, Operand dst, Operand src) {
-  if (dst.cls == RC_FP && src.cls == RC_INT) {
-    u32 sz = type_byte_size(dst.type);
-    int w = sz == 8 ? 1 : 0;
-    emit_sse_rr_w(t->mc, 0x66, 0x6E, w, dst.v.reg & 0xFu, src.v.reg & 0xFu);
-    return;
-  }
-  if (dst.cls == RC_INT && src.cls == RC_FP) {
-    u32 sz = type_byte_size(src.type);
-    int w = sz == 8 ? 1 : 0;
-    emit_sse_rr_w(t->mc, 0x66, 0x7E, w, src.v.reg & 0xFu, dst.v.reg & 0xFu);
-    return;
-  }
-  if (dst.cls == RC_FP || src.cls == RC_FP) {
-    u8 prefix2 = type_is_fp_double(dst.type) ? 0xF2 : 0xF3;
-    emit_sse_rr(t->mc, prefix2, 0x10, dst.v.reg & 0xFu, src.v.reg & 0xFu);
-    return;
-  }
-  int w = type_is_64(dst.type) ? 1 : 0;
-  emit_mov_rr(t->mc, w, dst.v.reg & 0xFu, src.v.reg & 0xFu);
-}
-
-/* Resolve an addr operand to the full effective-address tuple
- * (base, index, log2_scale, ofs). `OPK_LOCAL` resolves to its RBP-relative
- * slot offset with no index. `OPK_INDIRECT` carries the EA verbatim:
- * `index == REG_NONE` for plain base+disp, otherwise the SIB scaled-index
- * form (`log2_scale ∈ {0,1,2,3}` for byte scale 1/2/4/8). */
-static u32 addr_mode(CGTarget* t, Operand addr, u32* out_index,
-                     u32* out_log2_scale, i32* out_off) {
-  XImpl* a = impl_of(t);
-  if (addr.kind == OPK_LOCAL) {
-    XSlot* s = x64_slot_get(a, addr.v.frame_slot);
-    if (!s) compiler_panic(t->c, a->loc, "x64 addr_mode: bad slot");
-    *out_index = REG_NONE;
-    *out_log2_scale = 0;
-    *out_off = -(i32)s->off;
-    return X64_RBP;
-  }
-  if (addr.kind == OPK_INDIRECT) {
-    *out_index =
-        (addr.v.ind.index == REG_NONE) ? REG_NONE : (addr.v.ind.index & 0xFu);
-    *out_log2_scale = addr.v.ind.log2_scale;
-    *out_off = addr.v.ind.ofs;
-    return addr.v.ind.base & 0xFu;
-  }
-  compiler_panic(t->c, a->loc, "x64 addr_mode: kind %d unsupported",
-                 (int)addr.kind);
-}
-
-/* Plain-base+disp accessor for non-load/store paths (atomics, calls,
- * spill/reload, copy_bytes/set_bytes, inline asm). Per the EA contract,
- * those paths always see `index == REG_NONE`; assert that here so any
- * regression is caught at the boundary. */
-static u32 addr_base(CGTarget* t, Operand addr, i32* out_off) {
-  u32 idx, ls;
-  u32 base = addr_mode(t, addr, &idx, &ls, out_off);
-  if (idx != REG_NONE) {
-    compiler_panic(t->c, impl_of(t)->loc,
-                   "x64 addr_base: indexed addr in non-load/store path");
-  }
-  return base;
-}
-
-static int x64_use_got_for_sym(CGTarget* t, ObjSymId sym) {
-  return obj_symbol_extern_via_got(t->c, t->obj, sym);
-}
-
-/* Pick the PC-relative reloc kind for a non-GOT &sym reference.
- * Function symbols use R_X64_PLT32 so the linker can route through a
- * PLT trampoline when needed (calls into a DSO; address-taken function
- * pointers that must agree across DSOs).  Data symbols use the plain
- * R_PC32: PLT32 happens to encode identically when the linker resolves
- * the reference locally, but strict linkers warn when a data symbol
- * carries a PLT-flavored reloc. */
-static u32 x64_pcrel_reloc_for_sym(CGTarget* t, ObjSymId sym) {
-  const ObjSym* s = obj_symbol_get(t->obj, sym);
-  if (s && (s->kind == SK_FUNC || s->kind == SK_IFUNC)) return R_X64_PLT32;
-  return R_PC32;
-}
-
-/* Materialize `&sym + addend` into `dst_reg`.  For locally-defined or
- * static-link extern symbols, emit `lea rd, [rip + disp32]` with
- * R_PC32 for data symbols or R_X64_PLT32 for functions (PLT32 collapses
- * to a plain PC-relative LEA at link time — the PLT routing only fires
- * when the linker actually needs the trampoline, i.e. function calls
- * or address-taken funcs into a DSO).  For undef externs in PIC/PIE we
- * instead emit `mov rd, [rip + disp32]` against a GOT slot
- * (R_X64_REX_GOTPCRELX) so the loader can resolve the symbol by
- * patching a single slot rather than touching .text.
- *
- * Addend -4 because the PC is end-of-instruction.  When routing
- * through the GOT we omit any extra addend on the reloc (most loaders
- * disallow nonzero addends on GOT-load fixups); a follow-up `add` /
- * `lea` would have to add it after the load if the codegen needed
- * `&sym + nonzero`.  In practice the caller only ever passes
- * addend=0 for global references that go through the GOT path. */
-static void emit_global_lea(CGTarget* t, u32 dst_reg, ObjSymId sym,
-                            i64 addend) {
-  if (x64_use_got_for_sym(t, sym)) {
-    /* mov rd, [rip + disp32] */
-    emit_rex(t->mc, 1, dst_reg, 0, 0);
-    u8 op = 0x8B;
-    t->mc->emit_bytes(t->mc, &op, 1);
-    u8 mr = modrm(0u, (dst_reg & 7u), 5u); /* [RIP + disp32] */
-    t->mc->emit_bytes(t->mc, &mr, 1);
-    u32 disp_pos = t->mc->pos(t->mc);
-    emit_u32le(t->mc, 0);
-    t->mc->emit_reloc_at(t->mc, t->mc->section_id, disp_pos,
-                         R_X64_REX_GOTPCRELX, sym, -4, 1, 0);
-    /* Apply any nonzero addend by adjusting the loaded value. */
-    if (addend) {
-      i32 a = (i32)addend;
-      if (a >= -128 && a <= 127) {
-        /* add r/m64, imm8 (REX.W + 0x83 /0 ib) */
-        emit_rex(t->mc, 1, 0, 0, dst_reg);
-        u8 add_op[2] = {0x83, modrm(3u, 0u, (u8)(dst_reg & 7u))};
-        t->mc->emit_bytes(t->mc, add_op, 2);
-        u8 ib = (u8)a;
-        t->mc->emit_bytes(t->mc, &ib, 1);
-      } else {
-        /* add r/m64, imm32 (REX.W + 0x81 /0 id) */
-        emit_rex(t->mc, 1, 0, 0, dst_reg);
-        u8 add_op[2] = {0x81, modrm(3u, 0u, (u8)(dst_reg & 7u))};
-        t->mc->emit_bytes(t->mc, add_op, 2);
-        emit_u32le(t->mc, (u32)a);
-      }
-    }
-    return;
-  }
-  emit_rex(t->mc, 1, dst_reg, 0, 0);
-  u8 op = 0x8D;
-  t->mc->emit_bytes(t->mc, &op, 1);
-  u8 mr = modrm(0u, (dst_reg & 7u), 5u); /* [RIP + disp32] */
-  t->mc->emit_bytes(t->mc, &mr, 1);
-  u32 disp_pos = t->mc->pos(t->mc);
-  emit_u32le(t->mc, 0);
-  t->mc->emit_reloc_at(t->mc, t->mc->section_id, disp_pos,
-                       x64_pcrel_reloc_for_sym(t, sym), sym, addend - 4, 1, 0);
-}
-
-/* Emit a single PC-relative GPR `mov reg, sym(%rip)` (load) or
- * `mov sym(%rip), reg` (store).  Saves one instruction and one scratch
- * register vs. the lea-then-indirect-mov pair the GOT path needs.
- * Caller guarantees the symbol is not GOT-routed. */
-static void emit_global_pcrel_gpr(CGTarget* t, u32 sz, int signed_ext,
-                                  int is_store, u32 reg, ObjSymId sym,
-                                  i64 addend) {
-  MCEmitter* mc = t->mc;
-  /* RIP-relative addressing: mod=00, r/m=101, disp32; pass base=0
-   * to emit_rex so REX.B stays clear (RIP isn't an extended reg). */
-  if (sz == 8) {
-    emit_rex(mc, 1, reg, 0, 0);
-    u8 op = is_store ? 0x89 : 0x8B;
-    mc->emit_bytes(mc, &op, 1);
-  } else if (sz == 4) {
-    emit_rex(mc, 0, reg, 0, 0);
-    u8 op = is_store ? 0x89 : 0x8B;
-    mc->emit_bytes(mc, &op, 1);
-  } else if (sz == 2) {
-    if (is_store) {
-      u8 p = 0x66;
-      mc->emit_bytes(mc, &p, 1);
-      emit_rex(mc, 0, reg, 0, 0);
-      u8 op = 0x89;
-      mc->emit_bytes(mc, &op, 1);
-    } else {
-      emit_rex(mc, 0, reg, 0, 0);
-      u8 op[2] = {0x0F, signed_ext ? 0xBFu : 0xB7u};
-      mc->emit_bytes(mc, op, 2);
-    }
-  } else if (sz == 1) {
-    if (is_store) {
-      /* Force REX so SIL/DIL/etc are addressable as byte regs. */
-      emit_rex_force(mc, 0, reg, 0, 0);
-      u8 op = 0x88;
-      mc->emit_bytes(mc, &op, 1);
-    } else {
-      emit_rex(mc, 0, reg, 0, 0);
-      u8 op[2] = {0x0F, signed_ext ? 0xBEu : 0xB6u};
-      mc->emit_bytes(mc, op, 2);
-    }
-  }
-  u8 mr = modrm(0u, (reg & 7u), 5u); /* [RIP + disp32] */
-  mc->emit_bytes(mc, &mr, 1);
-  u32 disp_pos = mc->pos(mc);
-  emit_u32le(mc, 0);
-  mc->emit_reloc_at(mc, mc->section_id, disp_pos,
-                    x64_pcrel_reloc_for_sym(t, sym), sym, addend - 4, 1, 0);
-}
-
-/* Emit a single PC-relative SSE `movs[sd] xmm, sym(%rip)` (load) or
- * `movs[sd] sym(%rip), xmm` (store).  Caller guarantees the symbol is
- * not GOT-routed. */
-static void emit_global_pcrel_sse(CGTarget* t, u32 sz, int is_store, u32 reg,
-                                  ObjSymId sym, i64 addend) {
-  MCEmitter* mc = t->mc;
-  u8 prefix2 = (sz == 8) ? 0xF2u : 0xF3u;
-  mc->emit_bytes(mc, &prefix2, 1);
-  emit_rex(mc, 0, reg, 0, 0);
-  u8 op[2] = {0x0Fu, is_store ? 0x11u : 0x10u};
-  mc->emit_bytes(mc, op, 2);
-  u8 mr = modrm(0u, (reg & 7u), 5u); /* [RIP + disp32] */
-  mc->emit_bytes(mc, &mr, 1);
-  u32 disp_pos = mc->pos(mc);
-  emit_u32le(mc, 0);
-  mc->emit_reloc_at(mc, mc->section_id, disp_pos,
-                    x64_pcrel_reloc_for_sym(t, sym), sym, addend - 4, 1, 0);
-}
-
-void x_load(CGTarget* t, Operand dst, Operand addr, MemAccess ma) {
-  u32 sz = ma.size ? ma.size : type_byte_size(addr.type);
-
-  if (addr.kind == OPK_GLOBAL) {
-    ObjSymId sym = addr.v.global.sym;
-    i64 addend = addr.v.global.addend;
-    if (!x64_use_got_for_sym(t, sym)) {
-      /* Locally-resolvable: fold lea+load into a single PC-relative mov. */
-      if (dst.cls == RC_FP) {
-        emit_global_pcrel_sse(t, sz, 0, dst.v.reg & 0xFu, sym, addend);
-      } else {
-        int signed_ = type_is_signed(ma.type ? ma.type : addr.type);
-        emit_global_pcrel_gpr(t, sz, signed_, 0, dst.v.reg & 0xFu, sym, addend);
-      }
-      return;
-    }
-    /* GOT path: materialize &sym into R11, then load from [r11]. */
-    emit_global_lea(t, X64_R11, sym, addend);
-    if (dst.cls == RC_FP) {
-      u8 prefix2 = (sz == 8) ? 0xF2 : 0xF3;
-      emit_sse_load(t->mc, prefix2, 0x10, dst.v.reg & 0xFu, X64_R11, 0);
-    } else {
-      int signed_ = type_is_signed(ma.type ? ma.type : addr.type);
-      emit_mov_load(t->mc, sz, signed_, dst.v.reg & 0xFu, X64_R11, 0);
-    }
-    return;
-  }
-
-  i32 off;
-  u32 idx, ls;
-  u32 base = addr_mode(t, addr, &idx, &ls, &off);
-  if (dst.cls == RC_FP) {
-    u8 prefix2 = (sz == 8) ? 0xF2 : 0xF3;
-    emit_sse_load_idx(t->mc, prefix2, 0x10, dst.v.reg & 0xFu, base, idx, ls,
-                      off);
-  } else {
-    int signed_ = type_is_signed(ma.type ? ma.type : addr.type);
-    emit_mov_load_idx(t->mc, sz, signed_, dst.v.reg & 0xFu, base, idx, ls, off);
-  }
-}
-
-void x_store(CGTarget* t, Operand addr, Operand src, MemAccess ma) {
-  u32 sz = ma.size ? ma.size : type_byte_size(addr.type);
-
-  if (addr.kind == OPK_GLOBAL) {
-    ObjSymId sym = addr.v.global.sym;
-    i64 addend = addr.v.global.addend;
-    if (!x64_use_got_for_sym(t, sym)) {
-      /* Locally-resolvable: fold lea+store into a single PC-relative mov. */
-      if (src.kind == OPK_IMM) {
-        int w = (sz == 8) ? 1 : 0;
-        x64_emit_load_imm(t->mc, w, X64_RAX, src.v.imm);
-        emit_global_pcrel_gpr(t, sz, 0, 1, X64_RAX, sym, addend);
-        return;
-      }
-      if (src.cls == RC_FP) {
-        emit_global_pcrel_sse(t, sz, 1, src.v.reg & 0xFu, sym, addend);
-        return;
-      }
-      emit_global_pcrel_gpr(t, sz, 0, 1, src.v.reg & 0xFu, sym, addend);
-      return;
-    }
-    /* GOT path: materialize &sym into R11, then store via [r11]. The
-     * IMM source branch below uses RAX as a scratch for the value, so
-     * R11 stays untouched between the LEA and the store. */
-    emit_global_lea(t, X64_R11, sym, addend);
-    if (src.kind == OPK_IMM) {
-      int w = (sz == 8) ? 1 : 0;
-      x64_emit_load_imm(t->mc, w, X64_RAX, src.v.imm);
-      emit_mov_store(t->mc, sz, X64_RAX, X64_R11, 0);
-      return;
-    }
-    if (src.cls == RC_FP) {
-      u8 prefix2 = (sz == 8) ? 0xF2 : 0xF3;
-      emit_sse_store(t->mc, prefix2, 0x11, src.v.reg & 0xFu, X64_R11, 0);
-      return;
-    }
-    emit_mov_store(t->mc, sz, src.v.reg & 0xFu, X64_R11, 0);
-    return;
-  }
-
-  i32 off;
-  u32 idx, ls;
-  u32 base = addr_mode(t, addr, &idx, &ls, &off);
-
-  if (src.kind == OPK_IMM) {
-    int w = (sz == 8) ? 1 : 0;
-    x64_emit_load_imm(t->mc, w, X64_RAX, src.v.imm);
-    emit_mov_store_idx(t->mc, sz, X64_RAX, base, idx, ls, off);
-    return;
-  }
-  if (src.cls == RC_FP) {
-    u8 prefix2 = (sz == 8) ? 0xF2 : 0xF3;
-    emit_sse_store_idx(t->mc, prefix2, 0x11, src.v.reg & 0xFu, base, idx, ls,
-                       off);
-    return;
-  }
-  emit_mov_store_idx(t->mc, sz, src.v.reg & 0xFu, base, idx, ls, off);
-}
-
-static void x_addr_of(CGTarget* t, Operand dst, Operand lv) {
-  XImpl* a = impl_of(t);
-  if (lv.kind == OPK_LOCAL) {
-    XSlot* s = x64_slot_get(a, lv.v.frame_slot);
-    if (!s) compiler_panic(t->c, a->loc, "x64 addr_of: bad slot");
-    emit_lea(t->mc, dst.v.reg & 0xFu, X64_RBP, -(i32)s->off);
-    return;
-  }
-  if (lv.kind == OPK_INDIRECT) {
-    if (lv.v.ind.index != REG_NONE) {
-      x_panic(t, "addr_of: indexed INDIRECT lvalue (cg should fold)");
-    }
-    emit_lea(t->mc, dst.v.reg & 0xFu, lv.v.ind.base & 0xFu, lv.v.ind.ofs);
-    return;
-  }
-  if (lv.kind == OPK_GLOBAL) {
-    emit_global_lea(t, dst.v.reg & 0xFu, lv.v.global.sym, lv.v.global.addend);
-    return;
-  }
-  x_panic(t, "addr_of: kind unsupported");
-}
-
-/* Win64 TLS Local-Exec materialization (PE-COFF).
- *
- * Sequence (5 instructions, 26-29 bytes depending on register encoding):
- *   mov  rd,  gs:[0x58]              ; TEB.ThreadLocalStoragePointer
- *   mov  r11d,[rip + _tls_index]     ; per-image TLS slot index
- *   mov  rd,  [rd + r11*8]           ; TLS block base for this image
- *   lea  rd,  [rd + sym@SECREL]      ; rd = &sym
- *
- * `_tls_index` is a u32 the CRT defines for each image; the linker
- * resolves the RIP-relative load. The LEA's disp32 carries
- * IMAGE_REL_AMD64_SECREL (via R_COFF_SECREL) against the TLS data
- * symbol — the linker fills in the symbol's offset from the start of
- * the merged .tls section, which matches what gs:[0x58]+index lookup
- * lands on at runtime. R11 is caller-saved under Win64; we use it
- * unconditionally as scratch so we don't have to special-case
- * rd == rcx. */
-static void x_tls_addr_of_win64(CGTarget* t, Operand dst, ObjSymId sym,
-                                i64 addend) {
-  MCEmitter* mc = t->mc;
-  u32 sec = mc->section_id;
-  u32 rd = dst.v.reg & 0xFu;
-
-  /* (1) mov rd, gs:[0x58]: 65 [REX.W|R?] 8B mod=00/reg=rd/rm=100 sib disp32. */
-  u8 gs_prefix = 0x65;
-  mc->emit_bytes(mc, &gs_prefix, 1);
-  emit_rex(mc, 1, rd, 0, 0);
-  u8 op_mov_load = 0x8B;
-  mc->emit_bytes(mc, &op_mov_load, 1);
-  u8 mr1 = modrm(0u, rd & 7u, 4u);
-  mc->emit_bytes(mc, &mr1, 1);
-  u8 s1 = sib(0u, 4u, 5u);
-  mc->emit_bytes(mc, &s1, 1);
-  emit_u32le(mc, 0x58u);
-
-  /* (2) mov r11d, [rip + _tls_index]: 44 8B 1D disp32. */
-  Sym idx_name = pool_intern_slice(t->c->global, SLICE_LIT("_tls_index"));
-  ObjSymId idx_sym = obj_symbol_find(t->obj, idx_name);
-  if (idx_sym == 0) {
-    idx_sym =
-        obj_symbol(t->obj, idx_name, SB_GLOBAL, SK_UNDEF, OBJ_SEC_NONE, 0, 0);
-  }
-  u8 rex_r_only = X64_REX_BASE | X64_REX_R; /* R11 in ModRM.reg. */
-  mc->emit_bytes(mc, &rex_r_only, 1);
-  u8 op_mov_load_32 = 0x8B;
-  mc->emit_bytes(mc, &op_mov_load_32, 1);
-  u8 mr2 = modrm(0u, 3u /* r11 & 7 */, 5u /* RIP-rel */);
-  mc->emit_bytes(mc, &mr2, 1);
-  u32 idx_disp_pos = mc->pos(mc);
-  emit_u32le(mc, 0);
-  mc->emit_reloc_at(mc, sec, idx_disp_pos, R_PC32, idx_sym, -4, 1, 0);
-
-  /* (3) mov rd, [rd + r11*8]: REX.W + (REX.X for r11) + (REX.B for rd>=8) +
-   *     8B modrm(mod, reg=rd&7, rm=4=SIB) sib(scale=3, index=3=r11&7,
-   * base=rd&7). When base&7 == 5 (rbp/r13) mod=0 means "disp32 only"; force
-   * mod=01 with disp8=0 to actually mean [reg+r11*8 + 0]. */
-  u8 rex3 = X64_REX_BASE | X64_REX_W | X64_REX_X;
-  if (rd & 8) rex3 |= X64_REX_R; /* reg = rd */
-  if (rd & 8) rex3 |= X64_REX_B; /* base = rd */
-  mc->emit_bytes(mc, &rex3, 1);
-  u8 op_mov_load2 = 0x8B;
-  mc->emit_bytes(mc, &op_mov_load2, 1);
-  if ((rd & 7u) == 5u) {
-    u8 mr3 = modrm(1u, rd & 7u, 4u);
-    mc->emit_bytes(mc, &mr3, 1);
-    u8 s3 = sib(3u, 3u, rd & 7u);
-    mc->emit_bytes(mc, &s3, 1);
-    u8 zero = 0;
-    mc->emit_bytes(mc, &zero, 1);
-  } else {
-    u8 mr3 = modrm(0u, rd & 7u, 4u);
-    mc->emit_bytes(mc, &mr3, 1);
-    u8 s3 = sib(3u, 3u, rd & 7u);
-    mc->emit_bytes(mc, &s3, 1);
-  }
-
-  /* (4) lea rd, [rd + disp32@SECREL]: REX.W + (.R/.B for rd) + 8D modrm +
-   * disp32. rsp/r12 (rd&7==4) needs a SIB; rbp/r13 (rd&7==5) already takes
-   *     disp32 form natively at mod=10. */
-  u8 rex4 = X64_REX_BASE | X64_REX_W;
-  if (rd & 8) rex4 |= X64_REX_R; /* reg = rd */
-  if (rd & 8) rex4 |= X64_REX_B; /* base = rd */
-  mc->emit_bytes(mc, &rex4, 1);
-  u8 op_lea = 0x8D;
-  mc->emit_bytes(mc, &op_lea, 1);
-  u32 lea_disp_pos;
-  if ((rd & 7u) == 4u) {
-    u8 mr4 = modrm(2u, rd & 7u, 4u);
-    mc->emit_bytes(mc, &mr4, 1);
-    u8 s4 = sib(0u, 4u, rd & 7u);
-    mc->emit_bytes(mc, &s4, 1);
-    lea_disp_pos = mc->pos(mc);
-    emit_u32le(mc, 0);
-  } else {
-    u8 mr4 = modrm(2u, rd & 7u, rd & 7u);
-    mc->emit_bytes(mc, &mr4, 1);
-    lea_disp_pos = mc->pos(mc);
-    emit_u32le(mc, 0);
-  }
-  mc->emit_reloc_at(mc, sec, lea_disp_pos, R_COFF_SECREL, sym, addend, 1, 0);
-}
-
-/* x86_64 TLS Local-Exec materialization.
- *   mov rd, fs:0                 ; read thread pointer (FS base + 0)
- *   lea rd, [rd + sym@tpoff]     ; add TP-relative offset
- * The disp32 of the LEA carries an R_X64_TPOFF32 reloc; the linker fills
- * in the signed TP-relative offset (negative under variant II — TLS image
- * sits below the TCB that FS points at). */
-static void x_tls_addr_of(CGTarget* t, Operand dst, ObjSymId sym, i64 addend) {
-  MCEmitter* mc = t->mc;
-  u32 sec = mc->section_id;
-  u32 rd = dst.v.reg & 0xFu;
-
-  if (t->c->target.os == CFREE_OS_WINDOWS) {
-    x_tls_addr_of_win64(t, dst, sym, addend);
-    return;
-  }
-
-  /* mov rd, qword ptr fs:[0]
-   *   64 [REX.W|REX.R] 8B mod=00/reg=rd/rm=100 sib(0,4,5) disp32=0 */
-  u8 fs_prefix = 0x64;
-  mc->emit_bytes(mc, &fs_prefix, 1);
-  emit_rex(mc, 1, rd, 0, 0);
-  u8 op_mov = 0x8B;
-  mc->emit_bytes(mc, &op_mov, 1);
-  u8 mr1 = modrm(0u, rd & 7u, 4u);
-  mc->emit_bytes(mc, &mr1, 1);
-  u8 s1 = sib(0u, 4u, 5u);
-  mc->emit_bytes(mc, &s1, 1);
-  emit_u32le(mc, 0);
-
-  /* lea rd, [rd + disp32]
-   *   [REX.W|REX.R|REX.B] 8D mod=10/reg=rd/rm=rd [SIB if rd&7==4] disp32 */
-  emit_rex(mc, 1, rd, 0, rd);
-  u8 op_lea = 0x8D;
-  mc->emit_bytes(mc, &op_lea, 1);
-  u32 disp_pos;
-  if ((rd & 7u) == 4u) {
-    u8 mr2 = modrm(2u, rd & 7u, 4u);
-    mc->emit_bytes(mc, &mr2, 1);
-    u8 s2 = sib(0u, 4u, rd & 7u);
-    mc->emit_bytes(mc, &s2, 1);
-    disp_pos = mc->pos(mc);
-    emit_u32le(mc, 0);
-  } else {
-    u8 mr2 = modrm(2u, rd & 7u, rd & 7u);
-    mc->emit_bytes(mc, &mr2, 1);
-    disp_pos = mc->pos(mc);
-    emit_u32le(mc, 0);
-  }
-  mc->emit_reloc_at(mc, sec, disp_pos, R_X64_TPOFF32, sym, addend, 0, 0);
-}
-
-/* Aggregate ops — small unrolled memcpy/memset. */
-static u32 agg_addr_reg(CGTarget* t, Operand op, u32 scratch) {
-  if (op.kind == OPK_REG) return op.v.reg & 0xFu;
-  if (op.kind == OPK_LOCAL) {
-    XImpl* a = impl_of(t);
-    XSlot* s = x64_slot_get(a, op.v.frame_slot);
-    if (!s) compiler_panic(t->c, a->loc, "x64 agg: bad slot");
-    emit_lea(t->mc, scratch, X64_RBP, -(i32)s->off);
-    return scratch;
-  }
-  compiler_panic(t->c, impl_of(t)->loc, "x64 agg: address kind %d unsupported",
-                 (int)op.kind);
-}
-
-static void x_copy_bytes(CGTarget* t, Operand da, Operand sa,
-                         AggregateAccess g) {
-  u32 dr = agg_addr_reg(t, da, X64_R11);
-  u32 sr = agg_addr_reg(t, sa, (dr == X64_RAX) ? X64_RCX : X64_RAX);
-  u32 nbytes = g.size;
-  u32 i = 0;
-  while (i + 8 <= nbytes) {
-    emit_mov_load(t->mc, 8, 0, X64_RDX, sr, (i32)i);
-    emit_mov_store(t->mc, 8, X64_RDX, dr, (i32)i);
-    i += 8;
-  }
-  while (i + 4 <= nbytes) {
-    emit_mov_load(t->mc, 4, 0, X64_RDX, sr, (i32)i);
-    emit_mov_store(t->mc, 4, X64_RDX, dr, (i32)i);
-    i += 4;
-  }
-  while (i + 2 <= nbytes) {
-    emit_mov_load(t->mc, 2, 0, X64_RDX, sr, (i32)i);
-    emit_mov_store(t->mc, 2, X64_RDX, dr, (i32)i);
-    i += 2;
-  }
-  while (i < nbytes) {
-    emit_mov_load(t->mc, 1, 0, X64_RDX, sr, (i32)i);
-    emit_mov_store(t->mc, 1, X64_RDX, dr, (i32)i);
-    i += 1;
-  }
-}
-
-static void x_set_bytes(CGTarget* t, Operand da, Operand bv,
-                        AggregateAccess g) {
-  u32 dr = agg_addr_reg(t, da, X64_R11);
-  if (bv.kind != OPK_IMM)
-    compiler_panic(t->c, impl_of(t)->loc,
-                   "x64 set_bytes: non-IMM byte not yet supported");
-  u8 b = (u8)(bv.v.imm & 0xff);
-  u64 b64 = b;
-  b64 |= b64 << 8;
-  b64 |= b64 << 16;
-  b64 |= b64 << 32;
-  x64_emit_load_imm(t->mc, 1, X64_RAX, (i64)b64);
-  u32 nbytes = g.size;
-  u32 i = 0;
-  while (i + 8 <= nbytes) {
-    emit_mov_store(t->mc, 8, X64_RAX, dr, (i32)i);
-    i += 8;
-  }
-  while (i + 4 <= nbytes) {
-    emit_mov_store(t->mc, 4, X64_RAX, dr, (i32)i);
-    i += 4;
-  }
-  while (i + 2 <= nbytes) {
-    emit_mov_store(t->mc, 2, X64_RAX, dr, (i32)i);
-    i += 2;
-  }
-  while (i < nbytes) {
-    emit_mov_store(t->mc, 1, X64_RAX, dr, (i32)i);
-    i += 1;
-  }
-}
-
-/* Load the storage unit, then extract the field by shifting it to the
- * top of the register and shifting back. SAR for signed, SHR for unsigned. */
-static void x_bitfield_load(CGTarget* t, Operand dst, Operand record_addr,
-                            BitFieldAccess bf) {
-  u32 base = agg_addr_reg(t, record_addr, X64_R11);
-  u32 storage_bytes = bf.storage.size ? bf.storage.size : 4u;
-  int w = (storage_bytes == 8u) ? 1 : 0;
-  u32 reg_size = w ? 64u : 32u;
-  u32 lsb = bf.bit_offset;
-  u32 width = bf.bit_width ? bf.bit_width : 1u;
-  u32 rd = dst.v.reg & 0xFu;
-
-  emit_mov_load(t->mc, storage_bytes, 0, rd, base, (i32)bf.storage_offset);
-  u8 left = (u8)(reg_size - lsb - width);
-  u8 right = (u8)(reg_size - width);
-  if (left) emit_shift_imm(t->mc, w, 4u, rd, left);
-  if (right) emit_shift_imm(t->mc, w, bf.signed_ ? 7u : 5u, rd, right);
-}
-
-/* Read-modify-write: clear the field bits in the storage unit via AND ~mask,
- * mask/shift the source into place, OR it in, write back. RAX holds the
- * storage word; RCX is the staged value; RDX holds the source-side mask when
- * needed. Avoids touching the base register. */
-static void x_bitfield_store(CGTarget* t, Operand record_addr, Operand src,
-                             BitFieldAccess bf) {
-  u32 base = agg_addr_reg(t, record_addr, X64_R11);
-  u32 storage_bytes = bf.storage.size ? bf.storage.size : 4u;
-  int w = (storage_bytes == 8u) ? 1 : 0;
-  u32 lsb = bf.bit_offset;
-  u32 width = bf.bit_width ? bf.bit_width : 1u;
-  u64 ones = (width >= 64u) ? ~(u64)0 : (((u64)1 << width) - 1u);
-  u64 mask = ones << lsb;
-
-  emit_mov_load(t->mc, storage_bytes, 0, X64_RAX, base, (i32)bf.storage_offset);
-  x64_emit_load_imm(t->mc, w, X64_RCX, (i64)~mask);
-  emit_alu_rr(t->mc, w, 0x21, X64_RAX, X64_RCX); /* AND rax, rcx */
-
-  if (src.kind == OPK_IMM) {
-    u64 v = ((u64)src.v.imm & ones) << lsb;
-    x64_emit_load_imm(t->mc, w, X64_RCX, (i64)v);
-  } else if (src.kind == OPK_REG) {
-    emit_mov_rr(t->mc, w, X64_RCX, src.v.reg & 0xFu);
-    x64_emit_load_imm(t->mc, w, X64_RDX, (i64)ones);
-    emit_alu_rr(t->mc, w, 0x21, X64_RCX, X64_RDX); /* AND rcx, rdx */
-    if (lsb) emit_shift_imm(t->mc, w, 4u, X64_RCX, (u8)lsb);
-  } else {
-    compiler_panic(t->c, impl_of(t)->loc,
-                   "x64 bitfield_store: src kind %d unsupported",
-                   (int)src.kind);
-  }
-  emit_alu_rr(t->mc, w, 0x09, X64_RAX, X64_RCX); /* OR rax, rcx */
-  emit_mov_store(t->mc, storage_bytes, X64_RAX, base, (i32)bf.storage_offset);
-}
-
-/* ============================================================
- * Arithmetic */
-
-static void x_binop(CGTarget* t, BinOp op, Operand dst, Operand a_op,
-                    Operand b_op) {
-  MCEmitter* mc = t->mc;
-
-  /* FP binops. */
-  if (op == BO_FADD || op == BO_FSUB || op == BO_FMUL || op == BO_FDIV) {
-    u32 rd = dst.v.reg & 0xFu;
-    u32 ra = a_op.v.reg & 0xFu;
-    u32 rb = b_op.v.reg & 0xFu;
-    u8 prefix2 = type_is_fp_double(dst.type) ? 0xF2 : 0xF3;
-    u8 opcode;
-    switch (op) {
-      case BO_FADD:
-        opcode = 0x58;
-        break;
-      case BO_FSUB:
-        opcode = 0x5C;
-        break;
-      case BO_FMUL:
-        opcode = 0x59;
-        break;
-      case BO_FDIV:
-        opcode = 0x5E;
-        break;
-      default:
-        opcode = 0x58;
-        break;
-    }
-    if (rd == rb && rd != ra) {
-      if (op == BO_FADD || op == BO_FMUL) {
-        emit_sse_rr(mc, prefix2, opcode, rd, ra);
-        return;
-      }
-      emit_sse_rr(mc, prefix2, 0x10, X64_XMM15, rb);
-      emit_sse_rr(mc, prefix2, 0x10, rd, ra);
-      emit_sse_rr(mc, prefix2, opcode, rd, X64_XMM15);
-      return;
-    }
-    if (rd != ra) emit_sse_rr(mc, prefix2, 0x10, rd, ra);
-    emit_sse_rr(mc, prefix2, opcode, rd, rb);
-    return;
-  }
-
-  int w = type_is_64(dst.type) ? 1 : 0;
-  u32 rd = dst.v.reg & 0xFu;
-
-  /* Division: idiv/div uses rax/rdx implicitly. Route divisor through r11
-   * if it would otherwise be rax/rdx. */
-  if (op == BO_SDIV || op == BO_UDIV || op == BO_SREM || op == BO_UREM) {
-    u32 ra = x64_force_reg_int(t, a_op, w, X64_RAX);
-    if (ra != X64_RAX) emit_mov_rr(mc, w, X64_RAX, ra);
-    u32 rb;
-    if (b_op.kind == OPK_REG) {
-      rb = b_op.v.reg & 0xFu;
-      if (rb == X64_RAX || rb == X64_RDX) {
-        emit_mov_rr(mc, w, X64_R11, rb);
-        rb = X64_R11;
-      }
-    } else if (b_op.kind == OPK_IMM) {
-      x64_emit_load_imm(mc, w, X64_R11, b_op.v.imm);
-      rb = X64_R11;
-    } else {
-      compiler_panic(t->c, impl_of(t)->loc,
-                     "x64 div: divisor kind %d unsupported", (int)b_op.kind);
-    }
-    if (op == BO_SDIV || op == BO_SREM) {
-      emit_cqo_or_cdq(mc, w);
-      emit_f7_rm(mc, w, 7u, rb); /* idiv */
-    } else {
-      emit_xor_self(mc, w, X64_RDX);
-      emit_f7_rm(mc, w, 6u, rb); /* div */
-    }
-    u32 result_reg = (op == BO_SREM || op == BO_UREM) ? X64_RDX : X64_RAX;
-    if (rd != result_reg) emit_mov_rr(mc, w, rd, result_reg);
-    return;
-  }
-
-  /* Shifts: shift count must be in cl OR encoded as imm8 directly (C1
-   * /sub ib). Use the imm form when b is OPK_IMM and skip materializing
-   * into cl. */
-  if (op == BO_SHL || op == BO_SHR_U || op == BO_SHR_S) {
-    u32 ra = x64_force_reg_int(t, a_op, w, X64_RAX);
-    u32 sub = (op == BO_SHL) ? 4u : (op == BO_SHR_U ? 5u : 7u);
-    if (b_op.kind == OPK_IMM) {
-      if (rd != ra) emit_mov_rr(mc, w, rd, ra);
-      u32 width = w ? 64u : 32u;
-      emit_shift_imm(mc, w, sub, rd, (u8)((u64)b_op.v.imm & (width - 1u)));
-      return;
-    }
-    if (b_op.kind == OPK_REG) {
-      u32 rb = b_op.v.reg & 0xFu;
-      if (rb != X64_RCX) emit_mov_rr(mc, 0, X64_RCX, rb);
-    } else {
-      compiler_panic(t->c, impl_of(t)->loc,
-                     "x64 shift: count kind %d unsupported", (int)b_op.kind);
-    }
-    if (rd != ra) emit_mov_rr(mc, w, rd, ra);
-    emit_shift_cl(mc, w, sub, rd);
-    return;
-  }
-
-  /* For commutative ops, canonicalize IMM to the RHS so the imm-form
-   * check below fires uniformly. ISUB is non-commutative — IMM-on-LHS
-   * still materializes. */
-  switch (op) {
-    case BO_IADD:
-    case BO_AND:
-    case BO_OR:
-    case BO_XOR:
-    case BO_IMUL: {
-      if (a_op.kind == OPK_IMM && b_op.kind != OPK_IMM) {
-        Operand t_op = a_op;
-        a_op = b_op;
-        b_op = t_op;
-      }
-      break;
-    }
-    default:
-      break;
-  }
-
-  /* IMM-form fast paths. For ADD/SUB/AND/OR/XOR the ALU imm encoding
-   * reads-and-writes a single reg — copy ra → dst first, then `dst OP=
-   * imm`. For IMUL the imm form is three-operand (`dst = src * imm`)
-   * and reads from `ra` directly without the prep copy. */
-  if (b_op.kind == OPK_IMM && a_op.kind == OPK_REG &&
-      (op == BO_IADD || op == BO_ISUB || op == BO_AND || op == BO_OR ||
-       op == BO_XOR || op == BO_IMUL)) {
-    i64 imm = b_op.v.imm;
-    u32 ra = a_op.v.reg & 0xFu;
-    if (op == BO_IMUL) {
-      if (imm_fits_i8(imm)) {
-        emit_imul_imm8(mc, w, rd, ra, (i8)imm);
-        return;
-      }
-      if (imm_fits_i32(imm)) {
-        emit_imul_imm32(mc, w, rd, ra, (i32)imm);
-        return;
-      }
-    } else {
-      u32 sub;
-      switch (op) {
-        case BO_IADD:
-          sub = 0u;
-          break;
-        case BO_OR:
-          sub = 1u;
-          break;
-        case BO_AND:
-          sub = 4u;
-          break;
-        case BO_ISUB:
-          sub = 5u;
-          break;
-        case BO_XOR:
-          sub = 6u;
-          break;
-        default:
-          sub = 0u;
-          break; /* unreachable */
-      }
-      if (imm_fits_i8(imm)) {
-        if (rd != ra) emit_mov_rr(mc, w, rd, ra);
-        emit_alu_imm8(mc, w, sub, rd, (i8)imm);
-        return;
-      }
-      if (imm_fits_i32(imm)) {
-        if (rd != ra) emit_mov_rr(mc, w, rd, ra);
-        emit_alu_imm32(mc, w, sub, rd, (i32)imm);
-        return;
-      }
-    }
-    /* Fall through to materialize for >32-bit literals. */
-  }
-
-  /* Generic 2-operand ALU: copy ra → dst, then dst op= rb.  Preserve rb
-   * first when the allocator chose dst == rb; otherwise the prep copy
-   * would clobber the RHS. */
-  u32 ra = x64_force_reg_int(t, a_op, w, X64_RAX);
-  u32 rb = x64_force_reg_int(t, b_op, w, X64_R11);
-  if (rd == rb && rd != ra) {
-    if (op == BO_IADD || op == BO_AND || op == BO_OR || op == BO_XOR ||
-        op == BO_IMUL) {
-      switch (op) {
-        case BO_IADD:
-          emit_alu_rr(mc, w, 0x01, rd, ra);
-          return;
-        case BO_AND:
-          emit_alu_rr(mc, w, 0x21, rd, ra);
-          return;
-        case BO_OR:
-          emit_alu_rr(mc, w, 0x09, rd, ra);
-          return;
-        case BO_XOR:
-          emit_alu_rr(mc, w, 0x31, rd, ra);
-          return;
-        case BO_IMUL:
-          emit_imul_rr(mc, w, rd, ra);
-          return;
-        default:
-          break;
-      }
-    }
-    emit_mov_rr(mc, w, X64_R11, rb);
-    rb = X64_R11;
-  }
-  if (rd != ra) emit_mov_rr(mc, w, rd, ra);
-  switch (op) {
-    case BO_IADD:
-      emit_alu_rr(mc, w, 0x01, rd, rb);
-      break;
-    case BO_ISUB:
-      emit_alu_rr(mc, w, 0x29, rd, rb);
-      break;
-    case BO_AND:
-      emit_alu_rr(mc, w, 0x21, rd, rb);
-      break;
-    case BO_OR:
-      emit_alu_rr(mc, w, 0x09, rd, rb);
-      break;
-    case BO_XOR:
-      emit_alu_rr(mc, w, 0x31, rd, rb);
-      break;
-    case BO_IMUL:
-      emit_imul_rr(mc, w, rd, rb);
-      break;
-    default:
-      compiler_panic(t->c, impl_of(t)->loc, "x64 binop: op %d unimpl", (int)op);
-  }
-}
-
-static void x_unop(CGTarget* t, UnOp op, Operand dst, Operand a_op) {
-  MCEmitter* mc = t->mc;
-  u32 rd = dst.v.reg & 0xFu;
-  if (op == UO_FNEG) {
-    u8 mask_bytes[8];
-    ConstBytes cb;
-    Operand mask;
-    u32 ra;
-    if (dst.cls != RC_FP || a_op.kind != OPK_REG || a_op.cls != RC_FP) {
-      compiler_panic(t->c, impl_of(t)->loc,
-                     "x64 unop: FP neg requires FP REG operand");
-    }
-    ra = a_op.v.reg & 0xFu;
-    if (rd != ra)
-      emit_sse_rr(mc, type_is_fp_double(dst.type) ? 0xF2 : 0xF3, 0x10, rd, ra);
-    memset(mask_bytes, 0, sizeof mask_bytes);
-    if (type_is_fp_double(dst.type)) {
-      mask_bytes[7] = 0x80u;
-      cb.size = 8;
-      cb.align = 8;
-    } else {
-      mask_bytes[3] = 0x80u;
-      cb.size = 4;
-      cb.align = 4;
-    }
-    cb.type = dst.type;
-    cb.bytes = mask_bytes;
-    memset(&mask, 0, sizeof mask);
-    mask.kind = OPK_REG;
-    mask.cls = RC_FP;
-    mask.type = dst.type;
-    mask.v.reg = X64_XMM15;
-    x_load_const(t, mask, cb);
-    emit_sse_rr(mc, type_is_fp_double(dst.type) ? 0x66 : 0, 0x57, rd,
-                X64_XMM15);
-    return;
-  }
-
-  int w = type_is_64(dst.type) ? 1 : 0;
-  /* IMM operand is legal per the CGTarget contract (arch.h); materialize
-   * into a scratch register when not already a register. cg folds
-   * literal unops upstream (cg_fold_unop), so this path is reached only
-   * when opt's emit hands us an unfolded literal. */
-  u32 ra = x64_force_reg_int(t, a_op, w, X64_R11);
-  switch (op) {
-    case UO_NEG:
-      if (rd != ra) emit_mov_rr(mc, w, rd, ra);
-      emit_f7_rm(mc, w, 3u, rd);
-      return;
-    case UO_BNOT:
-      if (rd != ra) emit_mov_rr(mc, w, rd, ra);
-      emit_f7_rm(mc, w, 2u, rd);
-      return;
-    case UO_NOT:
-      /* !x → (x == 0) materialized as 0/1 in dst. */
-      emit_test_self(mc, w, ra);
-      emit_setcc(mc, X64_CC_E, rd);
-      emit_movzx_r32_r8(mc, rd, rd);
-      return;
-    default:
-      compiler_panic(t->c, impl_of(t)->loc, "x64 unop: op %d unimpl", (int)op);
-  }
-}
-
-static void x_convert(CGTarget* t, ConvKind k, Operand dst, Operand src) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  u32 rd = dst.v.reg & 0xFu;
-  u32 rs = src.v.reg & 0xFu;
-  switch (k) {
-    case CV_SEXT: {
-      u32 src_bytes = type_byte_size(src.type);
-      int w = type_is_64(dst.type) ? 1 : 0;
-      emit_extend_rr(mc, w, /*signed=*/1, src_bytes, rd, rs);
-      return;
-    }
-    case CV_ZEXT: {
-      u32 src_bytes = type_byte_size(src.type);
-      int w = type_is_64(dst.type) ? 1 : 0;
-      emit_extend_rr(mc, w, /*signed=*/0, src_bytes, rd, rs);
-      return;
-    }
-    case CV_TRUNC: {
-      /* In-reg truncation: `mov r32, r32` clears high 32. Narrower stores
-       * select width themselves. */
-      emit_mov_rr(mc, 0, rd, rs);
-      return;
-    }
-    case CV_ITOF_S:
-    case CV_ITOF_U: {
-      int w_src = type_is_64(src.type) ? 1 : 0;
-      u8 prefix2 = type_is_fp_double(dst.type) ? 0xF2 : 0xF3;
-      if (k == CV_ITOF_U && w_src == 1) {
-        MCLabel L_high = mc->label_new(mc);
-        MCLabel L_done = mc->label_new(mc);
-        u32 rr = rs;
-        emit_test_self(mc, 1, rr);
-        emit_jcc_label(mc, X64_CC_S, L_high);
-        emit_sse_rr_w(mc, prefix2, 0x2A, 1, rd, rr);
-        emit_jmp_label(mc, L_done);
-
-        mc->label_place(mc, L_high);
-        emit_mov_rr(mc, 1, X64_R11, rr);
-        emit_mov_rr(mc, 1, X64_RAX, rr);
-        emit_alu_imm8(mc, 1, 4u, X64_RAX, 1);       /* and rax, 1 */
-        emit_shift_imm(mc, 1, 5u, X64_R11, 1);      /* shr r11, 1 */
-        emit_alu_rr(mc, 1, 0x09, X64_R11, X64_RAX); /* or r11, rax */
-        emit_sse_rr_w(mc, prefix2, 0x2A, 1, rd, X64_R11);
-        emit_sse_rr(mc, prefix2, 0x58, rd, rd); /* addss/addsd dst, dst */
-
-        mc->label_place(mc, L_done);
-        return;
-      }
-      if (k == CV_ITOF_U) {
-        /* u32→fp: zero-extend to 64-bit, then signed cvtsi2sd works. */
-        emit_extend_rr(mc, 0, 0, 4, X64_R11, rs);
-        rs = X64_R11;
-        w_src = 1;
-      }
-      emit_sse_rr_w(mc, prefix2, 0x2A, w_src, rd, rs);
-      return;
-    }
-    case CV_FTOI_S:
-    case CV_FTOI_U: {
-      int w_dst = type_is_64(dst.type) ? 1 : 0;
-      u8 prefix2 = type_is_fp_double(src.type) ? 0xF2 : 0xF3;
-      if (k == CV_FTOI_U && w_dst == 1) {
-        static const u8 two63_f64[8] = {0, 0, 0, 0, 0, 0, 0xE0, 0x43};
-        static const u8 two63_f32[4] = {0, 0, 0, 0x5F};
-        MCLabel L_small = mc->label_new(mc);
-        MCLabel L_done = mc->label_new(mc);
-        ConstBytes cb;
-        Operand limit;
-        memset(&cb, 0, sizeof cb);
-        if (type_is_fp_double(src.type)) {
-          cb.bytes = two63_f64;
-          cb.size = 8;
-          cb.align = 8;
-        } else {
-          cb.bytes = two63_f32;
-          cb.size = 4;
-          cb.align = 4;
-        }
-        cb.type = src.type;
-        memset(&limit, 0, sizeof limit);
-        limit.kind = OPK_REG;
-        limit.cls = RC_FP;
-        limit.type = src.type;
-        limit.v.reg = X64_XMM15;
-        x_load_const(t, limit, cb);
-
-        emit_sse_rr(mc, type_is_fp_double(src.type) ? 0x66 : 0, 0x2E, rs,
-                    X64_XMM15);
-        emit_jcc_label(mc, X64_CC_B, L_small);
-
-        emit_sse_rr(mc, prefix2, 0x10, X64_XMM0 + 14, rs);
-        emit_sse_rr(mc, prefix2, 0x5C, X64_XMM0 + 14, X64_XMM15);
-        emit_sse_rr_w(mc, prefix2, 0x2C, 1, rd, X64_XMM0 + 14);
-        x64_emit_load_imm(mc, 1, X64_R11, -9223372036854775807LL - 1LL);
-        emit_alu_rr(mc, 1, 0x31, rd, X64_R11); /* xor sign bit */
-        emit_jmp_label(mc, L_done);
-
-        mc->label_place(mc, L_small);
-        emit_sse_rr_w(mc, prefix2, 0x2C, 1, rd, rs);
-        mc->label_place(mc, L_done);
-        return;
-      }
-      emit_sse_rr_w(mc, prefix2, 0x2C, w_dst, rd, rs);
-      return;
-    }
-    case CV_FEXT:
-      emit_sse_rr(mc, 0xF3, 0x5A, rd, rs);
-      return;
-    case CV_FTRUNC:
-      emit_sse_rr(mc, 0xF2, 0x5A, rd, rs);
-      return;
-    case CV_BITCAST: {
-      /* movd/movq between xmm and GPR. */
-      if (src.cls == RC_INT && dst.cls == RC_FP) {
-        int w = type_is_64(dst.type) ? 1 : 0;
-        emit_sse_rr_w(mc, 0x66, 0x6E, w, rd, rs);
-      } else if (src.cls == RC_FP && dst.cls == RC_INT) {
-        int w = type_is_64(src.type) ? 1 : 0;
-        emit_sse_rr_w(mc, 0x66, 0x7E, w, rs, rd);
-      } else {
-        compiler_panic(t->c, a->loc,
-                       "x64 convert BITCAST: same-class not supported");
-      }
-      return;
-    }
-    default:
-      compiler_panic(t->c, a->loc, "x64 convert kind %d unimpl", (int)k);
-  }
-}
-
-/* ============================================================
- * Calls / return */
-
-static Operand x_call_stack_arg_addr(CGTarget* t, u32 stack_offset, int tail) {
-  XImpl* a = impl_of(t);
-  Operand addr;
-  memset(&addr, 0, sizeof addr);
-  addr.kind = OPK_INDIRECT;
-  addr.cls = RC_INT;
-  addr.v.ind.base = tail && !a->omit_frame ? X64_RBP : X64_RSP;
-  addr.v.ind.index = REG_NONE;
-  addr.v.ind.log2_scale = 0;
-  addr.v.ind.ofs = (i32)stack_offset + (tail ? 8 : 0);
-  if (tail && !a->omit_frame) addr.v.ind.ofs = 16 + (i32)stack_offset;
-  return addr;
-}
-
-static void x_check_tail_stack_args(CGTarget* t, u32 stack_size) {
-  XImpl* a = impl_of(t);
-  if (stack_size > a->next_param_stack) {
-    compiler_panic(t->c, a->loc,
-                   "x64 tail call: stack argument area too small");
-  }
-}
-
-static u32 x_call_plan_stack_raw_size(const CGCallPlan* p) {
-  u32 size = 0;
-  for (u32 i = 0; i < p->nargs; ++i) {
-    const CGCallPlanMove* m = &p->args[i];
-    if (m->dst_kind == CG_CALL_PLAN_STACK ||
-        m->dst_kind == CG_CALL_PLAN_TAIL_STACK) {
-      u32 end = m->stack_offset + (m->mem.size > 8u ? m->mem.size : 8u);
-      if (end > size) size = end;
-    }
-  }
-  return size;
-}
-
-static inline void x_call_sync_slot(const X64ABIRegs* abi, u32* next_int,
-                                    u32* next_fp);
-
-static void emit_arg_value(CGTarget* t, const CGABIValue* av, u32* next_int,
-                           u32* next_fp, u32* stack_off, int tail) {
-  XImpl* a = impl_of(t);
-  /* Synthesize one-part DIRECT for variadic args (av->abi NULL). */
-  ABIArgInfo va_ai;
-  ABIArgPart va_pt;
-  const ABIArgInfo* ai = av->abi;
-  if (!ai) {
-    u32 sz = type_byte_size(av->type);
-    memset(&va_ai, 0, sizeof va_ai);
-    memset(&va_pt, 0, sizeof va_pt);
-    va_ai.kind = ABI_ARG_DIRECT;
-    va_ai.parts = &va_pt;
-    va_ai.nparts = 1;
-    va_pt.cls = (av->storage.cls == RC_FP) ? ABI_CLASS_FP : ABI_CLASS_INT;
-    va_pt.size = sz;
-    va_pt.align = sz;
-    va_pt.src_offset = 0;
-    ai = &va_ai;
-  }
-  if (ai->kind == ABI_ARG_IGNORE) return;
-  if (ai->kind == ABI_ARG_INDIRECT) {
-    /* Pass &av->storage_local in the next int arg reg. */
-    u32 nargs_reg = a->abi->n_int_args;
-    u32 dst_reg =
-        (*next_int < nargs_reg) ? a->abi->int_args[(*next_int)++] : X64_RAX;
-    int to_stack = (dst_reg == X64_RAX);
-    x_call_sync_slot(a->abi, next_int, next_fp);
-    if (av->storage.kind == OPK_LOCAL) {
-      XSlot* s = x64_slot_get(a, av->storage.v.frame_slot);
-      if (!s) compiler_panic(t->c, a->loc, "x64 call: bad byval slot");
-      emit_lea(t->mc, dst_reg, X64_RBP, -(i32)s->off);
-    } else if (av->storage.kind == OPK_INDIRECT) {
-      emit_lea(t->mc, dst_reg, av->storage.v.ind.base & 0xFu,
-               av->storage.v.ind.ofs);
-    } else if (av->storage.kind == OPK_GLOBAL) {
-      emit_global_lea(t, dst_reg, av->storage.v.global.sym,
-                      av->storage.v.global.addend);
-    } else {
-      compiler_panic(t->c, a->loc,
-                     "x64 call: INDIRECT arg storage kind %d unsupported",
-                     (int)av->storage.kind);
-    }
-    if (to_stack) {
-      Operand addr = x_call_stack_arg_addr(t, *stack_off, tail);
-      emit_mov_store(t->mc, 8, dst_reg, addr.v.ind.base & 0xFu, addr.v.ind.ofs);
-      *stack_off += 8;
-    }
-    return;
-  }
-
-  if (ai->kind == ABI_ARG_DIRECT &&
-      x64_abi_direct_to_stack(ai, *next_int, *next_fp)) {
-    for (u16 i = 0; i < ai->nparts; ++i) {
-      const ABIArgPart* pt = &ai->parts[i];
-      u32 sz = pt->size;
-      Operand addr = x_call_stack_arg_addr(t, *stack_off, tail);
-      if (pt->cls == ABI_CLASS_FP) {
-        u8 prefix2 = (sz == 8) ? 0xF2 : 0xF3;
-        if (av->storage.kind == OPK_REG) {
-          emit_sse_store(t->mc, prefix2, 0x11, av->storage.v.reg & 0xFu,
-                         addr.v.ind.base & 0xFu, addr.v.ind.ofs);
-        } else if (av->storage.kind == OPK_LOCAL) {
-          XSlot* s = x64_slot_get(a, av->storage.v.frame_slot);
-          if (!s) compiler_panic(t->c, a->loc, "x64 call: bad FP arg slot");
-          emit_sse_load(t->mc, prefix2, 0x10, X64_XMM15, X64_RBP,
-                        -(i32)s->off + (i32)pt->src_offset);
-          emit_sse_store(t->mc, prefix2, 0x11, X64_XMM15,
-                         addr.v.ind.base & 0xFu, addr.v.ind.ofs);
-        } else if (av->storage.kind == OPK_INDIRECT) {
-          emit_sse_load(t->mc, prefix2, 0x10, X64_XMM15,
-                        av->storage.v.ind.base & 0xFu,
-                        av->storage.v.ind.ofs + (i32)pt->src_offset);
-          emit_sse_store(t->mc, prefix2, 0x11, X64_XMM15,
-                         addr.v.ind.base & 0xFu, addr.v.ind.ofs);
-        } else {
-          compiler_panic(t->c, a->loc,
-                         "x64 call: FP stack-arg storage kind %d unsupported",
-                         (int)av->storage.kind);
-        }
-      } else if (pt->cls == ABI_CLASS_INT) {
-        switch (av->storage.kind) {
-          case OPK_IMM: {
-            int w = (sz == 8) ? 1 : 0;
-            x64_emit_load_imm(t->mc, w, X64_RAX, av->storage.v.imm);
-            break;
-          }
-          case OPK_REG: {
-            int w = (sz == 8) ? 1 : 0;
-            u32 sr = av->storage.v.reg & 0xFu;
-            if (sr != X64_RAX) emit_mov_rr(t->mc, w, X64_RAX, sr);
-            break;
-          }
-          case OPK_LOCAL: {
-            XSlot* s = x64_slot_get(a, av->storage.v.frame_slot);
-            if (!s) compiler_panic(t->c, a->loc, "x64 call: bad arg slot");
-            emit_mov_load(t->mc, sz, 0, X64_RAX, X64_RBP,
-                          -(i32)s->off + (i32)pt->src_offset);
-            break;
-          }
-          case OPK_INDIRECT:
-            emit_mov_load(t->mc, sz, 0, X64_RAX, av->storage.v.ind.base & 0xFu,
-                          av->storage.v.ind.ofs + (i32)pt->src_offset);
-            break;
-          default:
-            compiler_panic(t->c, a->loc,
-                           "x64 call: arg storage kind %d unsupported",
-                           (int)av->storage.kind);
-        }
-        emit_mov_store(t->mc, sz, X64_RAX, addr.v.ind.base & 0xFu,
-                       addr.v.ind.ofs);
-      } else {
-        compiler_panic(t->c, a->loc, "x64 call: ABI class %d unimpl",
-                       (int)pt->cls);
-      }
-      *stack_off += 8;
-    }
-    return;
-  }
-
-  for (u16 i = 0; i < ai->nparts; ++i) {
-    const ABIArgPart* pt = &ai->parts[i];
-    u32 sz = pt->size;
-    if (pt->cls == ABI_CLASS_INT) {
-      int to_stack = (*next_int >= a->abi->n_int_args);
-      u32 dst_reg = to_stack ? X64_RAX : a->abi->int_args[(*next_int)++];
-      if (!to_stack) x_call_sync_slot(a->abi, next_int, next_fp);
-      switch (av->storage.kind) {
-        case OPK_IMM: {
-          int w = (sz == 8) ? 1 : 0;
-          x64_emit_load_imm(t->mc, w, dst_reg, av->storage.v.imm);
-          break;
-        }
-        case OPK_REG: {
-          int w = (sz == 8) ? 1 : 0;
-          u32 sr = av->storage.v.reg & 0xFu;
-          if (sr != dst_reg) emit_mov_rr(t->mc, w, dst_reg, sr);
-          break;
-        }
-        case OPK_LOCAL: {
-          XSlot* s = x64_slot_get(a, av->storage.v.frame_slot);
-          if (!s) compiler_panic(t->c, a->loc, "x64 call: bad arg slot");
-          emit_mov_load(t->mc, sz, 0, dst_reg, X64_RBP,
-                        -(i32)s->off + (i32)pt->src_offset);
-          break;
-        }
-        case OPK_INDIRECT: {
-          /* cg holds INDIRECT base regs in {RBX, R10, R12..R15}, disjoint
-           * from arg regs (RDI/RSI/RDX/RCX/R8/R9) and the RAX scratch, so
-           * the base survives across the part loop. */
-          emit_mov_load(t->mc, sz, 0, dst_reg, av->storage.v.ind.base & 0xFu,
-                        av->storage.v.ind.ofs + (i32)pt->src_offset);
-          break;
-        }
-        default:
-          compiler_panic(t->c, a->loc,
-                         "x64 call: arg storage kind %d unsupported",
-                         (int)av->storage.kind);
-      }
-      if (to_stack) {
-        Operand addr = x_call_stack_arg_addr(t, *stack_off, tail);
-        emit_mov_store(t->mc, 8, dst_reg, addr.v.ind.base & 0xFu,
-                       addr.v.ind.ofs);
-        *stack_off += 8;
-      }
-    } else if (pt->cls == ABI_CLASS_FP) {
-      int to_stack = (*next_fp >= a->abi->n_fp_args);
-      u8 prefix2 = (sz == 8) ? 0xF2 : 0xF3;
-      if (!to_stack) {
-        u32 dst_x = (*next_fp)++;
-        /* Win64: variadic FP args must be duplicated into the matching
-         * GPR so a callee that doesn't know the argument type finds the
-         * bits in either register. `av->abi == NULL` is cfree's marker
-         * that this is a variadic (un-prototyped) arg. */
-        int dup_to_gpr = a->abi->vararg_fp_dup_to_gpr && (av->abi == NULL) &&
-                         (dst_x < a->abi->n_int_args);
-        if (av->storage.kind == OPK_REG) {
-          u32 sx = av->storage.v.reg & 0xFu;
-          if (sx != dst_x) emit_sse_rr(t->mc, prefix2, 0x10, dst_x, sx);
-        } else if (av->storage.kind == OPK_LOCAL) {
-          XSlot* s = x64_slot_get(a, av->storage.v.frame_slot);
-          if (!s) compiler_panic(t->c, a->loc, "x64 call: bad FP arg slot");
-          emit_sse_load(t->mc, prefix2, 0x10, dst_x, X64_RBP,
-                        -(i32)s->off + (i32)pt->src_offset);
-        } else if (av->storage.kind == OPK_INDIRECT) {
-          emit_sse_load(t->mc, prefix2, 0x10, dst_x,
-                        av->storage.v.ind.base & 0xFu,
-                        av->storage.v.ind.ofs + (i32)pt->src_offset);
-        } else {
-          compiler_panic(t->c, a->loc,
-                         "x64 call: FP arg storage kind %d unsupported",
-                         (int)av->storage.kind);
-        }
-        if (dup_to_gpr) {
-          /* movq r64, xmm: 66 REX.W 0F 7E /r (xmm as ModRM:reg,
-           * r64 as ModRM:r/m). emit_sse_rr_w(prefix=0x66, opcode=0x7E,
-           * w=1, dst=xmm, src=gpr) emits that encoding. */
-          u32 gpr = a->abi->int_args[dst_x];
-          emit_sse_rr_w(t->mc, 0x66, 0x7E, /*w=*/1, dst_x, gpr);
-        }
-        /* Keep int/fp slot indices in lockstep on Win64. */
-        x_call_sync_slot(a->abi, next_int, next_fp);
-      } else {
-        if (av->storage.kind == OPK_REG) {
-          Operand addr = x_call_stack_arg_addr(t, *stack_off, tail);
-          emit_sse_store(t->mc, prefix2, 0x11, av->storage.v.reg & 0xFu,
-                         addr.v.ind.base & 0xFu, addr.v.ind.ofs);
-        } else if (av->storage.kind == OPK_LOCAL) {
-          Operand addr = x_call_stack_arg_addr(t, *stack_off, tail);
-          XSlot* s = x64_slot_get(a, av->storage.v.frame_slot);
-          if (!s) compiler_panic(t->c, a->loc, "x64 call: bad FP arg slot");
-          emit_sse_load(t->mc, prefix2, 0x10, X64_XMM15, X64_RBP,
-                        -(i32)s->off + (i32)pt->src_offset);
-          emit_sse_store(t->mc, prefix2, 0x11, X64_XMM15,
-                         addr.v.ind.base & 0xFu, addr.v.ind.ofs);
-        } else if (av->storage.kind == OPK_INDIRECT) {
-          Operand addr = x_call_stack_arg_addr(t, *stack_off, tail);
-          /* Load through xmm15 (scratch — last in g_fp_order so cg won't
-           * have it live mid-call) then store. */
-          emit_sse_load(t->mc, prefix2, 0x10, X64_XMM15,
-                        av->storage.v.ind.base & 0xFu,
-                        av->storage.v.ind.ofs + (i32)pt->src_offset);
-          emit_sse_store(t->mc, prefix2, 0x11, X64_XMM15,
-                         addr.v.ind.base & 0xFu, addr.v.ind.ofs);
-        } else {
-          compiler_panic(t->c, a->loc,
-                         "x64 call: FP stack-arg storage kind %d unsupported",
-                         (int)av->storage.kind);
-        }
-        *stack_off += 8;
-      }
-    } else {
-      compiler_panic(t->c, a->loc, "x64 call: ABI class %d unimpl",
-                     (int)pt->cls);
-    }
-  }
-}
-
-static inline void x_call_sync_slot(const X64ABIRegs* abi, u32* next_int,
-                                    u32* next_fp) {
-  if (!abi->slot_shared_int_fp) return;
-  u32 m = *next_int > *next_fp ? *next_int : *next_fp;
-  *next_int = m;
-  *next_fp = m;
-}
-
-static void count_arg_stack(const X64ABIRegs* abi, const CGABIValue* av,
-                            u32* next_int, u32* next_fp, u32* stack_off) {
-  ABIArgInfo va_ai;
-  ABIArgPart va_pt;
-  const ABIArgInfo* ai = av->abi;
-  if (!ai) {
-    u32 sz = type_byte_size(av->type);
-    memset(&va_ai, 0, sizeof va_ai);
-    memset(&va_pt, 0, sizeof va_pt);
-    va_ai.kind = ABI_ARG_DIRECT;
-    va_ai.parts = &va_pt;
-    va_ai.nparts = 1;
-    va_pt.cls = (av->storage.cls == RC_FP) ? ABI_CLASS_FP : ABI_CLASS_INT;
-    va_pt.size = sz;
-    va_pt.align = sz;
-    va_pt.src_offset = 0;
-    ai = &va_ai;
-  }
-  if (ai->kind == ABI_ARG_IGNORE) return;
-  if (ai->kind == ABI_ARG_INDIRECT) {
-    if (*next_int < abi->n_int_args)
-      ++*next_int;
-    else
-      *stack_off += 8;
-    x_call_sync_slot(abi, next_int, next_fp);
-    return;
-  }
-  if (ai->kind == ABI_ARG_DIRECT &&
-      x64_abi_direct_to_stack(ai, *next_int, *next_fp)) {
-    *stack_off += (u32)ai->nparts * 8u;
-    return;
-  }
-  for (u16 i = 0; i < ai->nparts; ++i) {
-    const ABIArgPart* pt = &ai->parts[i];
-    if (pt->cls == ABI_CLASS_INT) {
-      if (*next_int < abi->n_int_args)
-        ++*next_int;
-      else
-        *stack_off += 8;
-    } else if (pt->cls == ABI_CLASS_FP) {
-      if (*next_fp < abi->n_fp_args)
-        ++*next_fp;
-      else
-        *stack_off += 8;
-    }
-    x_call_sync_slot(abi, next_int, next_fp);
-  }
-}
-
-static u32 x_call_stack_size(CGTarget* t, const CGCallDesc* d) {
-  const X64ABIRegs* abi = x64_abi_for_os(t->c->target.os);
-  u32 next_int = (d->abi && d->abi->has_sret) ? 1u : 0u;
-  u32 next_fp = 0;
-  /* Win64 reserves a 32 B shadow space at [rsp+0..31] which is part of
-   * the caller's outgoing area; stack args land above it. SysV has no
-   * shadow space. */
-  u32 stack_off = abi->shadow_space;
-  x_call_sync_slot(abi, &next_int, &next_fp);
-  for (u32 i = 0; i < d->nargs; ++i)
-    count_arg_stack(abi, &d->args[i], &next_int, &next_fp, &stack_off);
-  return (stack_off + 15u) & ~15u;
-}
-
-static const Reg g_tail_cs_int_order_all[X64_MAX_CS_INT_REGS] = {
-    X64_RBX, X64_R12, X64_R13, X64_R14, X64_R15, X64_RDI, X64_RSI,
-};
-
-#define X64_TAIL_MAX_CS_FP_REGS 10u
-static const Reg g_tail_cs_fp_order_all[X64_TAIL_MAX_CS_FP_REGS] = {
-    X64_XMM6,      X64_XMM7,      X64_XMM8,      X64_XMM0 + 9,  X64_XMM0 + 10,
-    X64_XMM0 + 11, X64_XMM0 + 12, X64_XMM0 + 13, X64_XMM0 + 14, X64_XMM15,
-};
-
-/* Realizability of a sibling call (see CGTarget.tail_call_unrealizable_reason).
- * The callee's outgoing stack arguments must fit the area this function itself
- * received (next_param_stack); the tail prologue restore reuses those slots.
- * Variadic callees need no special handling (AL = #XMM regs is set as usual)
- * and sret callees are realizable by forwarding this function's own incoming
- * sret pointer (the return-shape precondition guarantees it matches). */
-static const char* x_tail_call_unrealizable_reason(CGTarget* t,
-                                                   const CGCallDesc* d) {
-  XImpl* a = impl_of(t);
-  if (x_call_stack_size(t, d) > a->next_param_stack)
-    return "tail call stack arguments exceed the caller's parameter area";
-  return NULL;
-}
-
-static u32 x_tail_collect_cs_regs(const XImpl* a, Reg* cs_regs) {
-  u32 cs_used = 0;
-  u64 mask = (u64)a->used_cs_int_mask & a->abi->cs_int_mask;
-  mask &= ~(1ull << X64_RBP);
-  for (u32 i = 0; i < X64_MAX_CS_INT_REGS; ++i) {
-    Reg r = g_tail_cs_int_order_all[i];
-    if (mask & (1ull << r)) cs_regs[cs_used++] = r;
-  }
-  return cs_used;
-}
-
-static u32 x_tail_collect_cs_fp_regs(const XImpl* a, Reg* cs_fp_regs) {
-  u32 n = 0;
-  u64 mask = (u64)a->used_cs_fp_mask & a->abi->cs_fp_mask;
-  for (u32 i = 0; i < X64_TAIL_MAX_CS_FP_REGS; ++i) {
-    Reg r = g_tail_cs_fp_order_all[i];
-    if (mask & (1ull << r)) cs_fp_regs[n++] = r;
-  }
-  return n;
-}
-
-static void x_tail_restore_frame(CGTarget* t) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  Reg cs_regs[X64_MAX_CS_INT_REGS];
-  Reg cs_fp_regs[X64_TAIL_MAX_CS_FP_REGS];
-  u32 cs_used = x_tail_collect_cs_regs(a, cs_regs);
-  u32 cs_fp_used = x_tail_collect_cs_fp_regs(a, cs_fp_regs);
-
-  if (a->omit_frame) return;
-  /* Mirror the func_end frame layout: xmm_base is cum_off rounded up to
-   * 16 when any XMM is saved, else == cum_off. */
-  u32 xmm_base = a->cum_off;
-  if (cs_fp_used) xmm_base = (xmm_base + 15u) & ~15u;
-  for (i32 i = (i32)cs_fp_used - 1; i >= 0; --i) {
-    u32 xmm = cs_fp_regs[i];
-    i32 off = -(i32)xmm_base - (i32)(i + 1) * 16;
-    emit_sse_load(mc, /*prefix=*/0, /*opcode=*/0x28, xmm, X64_RBP, off);
-  }
-  for (i32 i = (i32)cs_used - 1; i >= 0; --i) {
-    u32 reg = cs_regs[i];
-    i32 off = -(i32)xmm_base - (i32)(cs_fp_used) * 16 - (i32)(i + 1) * 8;
-    emit_mov_load(mc, 8, 0, reg, X64_RBP, off);
-  }
-  {
-    u8 op = 0xC9;
-    mc->emit_bytes(mc, &op, 1);
-  }
-}
-
-static void x_tail_branch(CGTarget* t, Operand callee) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  if (callee.kind == OPK_REG) {
-    u32 r = callee.v.reg & 0xFu;
-    if (r != X64_R11) emit_mov_rr(mc, 1, X64_R11, r);
-    x_tail_restore_frame(t);
-    emit_rex(mc, 0, 0, 0, X64_R11);
-    u8 buf[2] = {0xFF, modrm(3u, 4u, X64_R11)};
-    mc->emit_bytes(mc, buf, 2);
-  } else if (callee.kind == OPK_GLOBAL) {
-    x_tail_restore_frame(t);
-    u8 op = 0xE9;
-    mc->emit_bytes(mc, &op, 1);
-    u32 disp_pos = mc->pos(mc);
-    emit_u32le(mc, 0);
-    mc->emit_reloc_at(mc, mc->section_id, disp_pos, R_X64_PLT32,
-                      callee.v.global.sym, callee.v.global.addend - 4, 1, 0);
-  } else {
-    compiler_panic(t->c, a->loc, "x64 tail call: callee kind %d unsupported",
-                   (int)callee.kind);
-  }
-}
-
-static void x_call(CGTarget* t, const CGCallDesc* d) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  u32 next_int = 0, next_fp = 0, stack_off = a->abi->shadow_space;
-  int is_tail = (d->flags & CG_CALL_TAIL) != 0;
-
-  /* sret: the first integer argument register holds the result pointer.
-   * An ordinary call points it at the
-   * destination local; a tail call forwards this function's own incoming sret
-   * pointer (loaded just before the branch below), and ret.storage is the void
-   * sentinel, so only reserve the register here. */
-  if (d->abi && d->abi->has_sret) {
-    next_int = 1;
-    x_call_sync_slot(a->abi, &next_int, &next_fp);
-    if (!is_tail) {
-      if (d->ret.storage.kind != OPK_LOCAL) {
-        compiler_panic(t->c, a->loc,
-                       "x64 call: sret destination must be LOCAL");
-      }
-      XSlot* s = x64_slot_get(a, d->ret.storage.v.frame_slot);
-      if (!s) compiler_panic(t->c, a->loc, "x64 call: bad sret slot");
-      emit_lea(mc, a->abi->int_args[0], X64_RBP, -(i32)s->off);
-    }
-  }
-  for (u32 i = 0; i < d->nargs; ++i) {
-    emit_arg_value(t, &d->args[i], &next_int, &next_fp, &stack_off, is_tail);
-  }
-  u32 needed = (stack_off + 15u) & ~15u;
-  if (!is_tail && needed > a->max_outgoing) {
-    if (a->known_frame)
-      compiler_panic(t->c, a->loc,
-                     "x64 call: known frame outgoing area too small");
-    a->max_outgoing = needed;
-  }
-
-  /* Variadic calls: AL = number of XMM regs used. */
-  if (d->abi && d->abi->variadic) {
-    x64_emit_load_imm(mc, 0, X64_RAX, (i64)next_fp);
-  }
-
-  if (is_tail) {
-    if (d->abi && d->abi->has_sret) {
-      /* Forward the incoming sret pointer into the ABI sret register (spilled
-       * to sret_ptr_slot at entry). Load while rbp is valid, before
-       * x_tail_branch restores the frame; the register survives the restore
-       * and is not used by the args above. */
-      if (a->sret_ptr_slot == FRAME_SLOT_NONE)
-        compiler_panic(t->c, a->loc,
-                       "x64 tail call: missing incoming sret slot");
-      XSlot* s = x64_slot_get(a, a->sret_ptr_slot);
-      if (!s) compiler_panic(t->c, a->loc, "x64 tail call: bad sret slot");
-      emit_mov_load(mc, 8, 0, a->abi->int_args[0], X64_RBP, -(i32)s->off);
-    }
-    x_check_tail_stack_args(t, stack_off);
-    x_tail_branch(t, d->callee);
-    return;
-  }
-
-  if (d->callee.kind == OPK_GLOBAL) {
-    /* call rel32: E8 + disp32 + R_X64_PLT32. */
-    u8 op = 0xE8;
-    mc->emit_bytes(mc, &op, 1);
-    u32 disp_pos = mc->pos(mc);
-    emit_u32le(mc, 0);
-    mc->emit_reloc_at(mc, mc->section_id, disp_pos, R_X64_PLT32,
-                      d->callee.v.global.sym, d->callee.v.global.addend - 4, 1,
-                      0);
-  } else if (d->callee.kind == OPK_REG) {
-    u32 r = d->callee.v.reg & 0xFu;
-    emit_rex(mc, 0, 0, 0, r);
-    u8 buf[2] = {0xFF, modrm(3u, 2u, r)};
-    mc->emit_bytes(mc, buf, 2);
-  } else {
-    compiler_panic(t->c, a->loc, "x64 call: callee kind %d unsupported",
-                   (int)d->callee.kind);
-  }
-
-  /* Receive return value. */
-  const ABIArgInfo* ri = &d->abi->ret;
-  if (ri->kind == ABI_ARG_IGNORE || ri->kind == ABI_ARG_INDIRECT) return;
-  if (ri->nparts == 0) return;
-
-  Operand rs = d->ret.storage;
-  u32 next_int_ret = 0, next_fp_ret = 0;
-  static const u32 ret_int_regs[2] = {X64_RAX, X64_RDX};
-  for (u16 i = 0; i < ri->nparts; ++i) {
-    const ABIArgPart* p = &ri->parts[i];
-    u32 src_reg;
-    if (p->cls == ABI_CLASS_INT)
-      src_reg = ret_int_regs[next_int_ret++];
-    else if (p->cls == ABI_CLASS_FP)
-      src_reg = (u32)(X64_XMM0 + next_fp_ret++);
-    else
-      compiler_panic(t->c, a->loc, "x64 call: ret cls %d unimpl", (int)p->cls);
-
-    if (rs.kind == OPK_REG) {
-      if (ri->nparts != 1) {
-        compiler_panic(t->c, a->loc, "x64 call: REG ret_storage with %u parts",
-                       (unsigned)ri->nparts);
-      }
-      if (p->cls == ABI_CLASS_INT) {
-        int w = (p->size == 8) ? 1 : 0;
-        u32 dr = rs.v.reg & 0xFu;
-        if (dr != src_reg) emit_mov_rr(mc, w, dr, src_reg);
-      } else {
-        u8 prefix2 = (p->size == 8) ? 0xF2 : 0xF3;
-        u32 dr = rs.v.reg & 0xFu;
-        if (dr != src_reg) emit_sse_rr(mc, prefix2, 0x10, dr, src_reg);
-      }
-    } else if (rs.kind == OPK_LOCAL || rs.kind == OPK_INDIRECT) {
-      u32 base_reg;
-      i32 base_off;
-      if (rs.kind == OPK_LOCAL) {
-        XSlot* s = x64_slot_get(a, rs.v.frame_slot);
-        if (!s) compiler_panic(t->c, a->loc, "x64 call: bad ret slot");
-        base_reg = X64_RBP;
-        base_off = -(i32)s->off;
-      } else {
-        base_reg = rs.v.ind.base & 0xFu;
-        base_off = rs.v.ind.ofs;
-      }
-      i32 off = base_off + (i32)p->src_offset;
-      if (p->cls == ABI_CLASS_INT) {
-        emit_mov_store(mc, p->size, src_reg, base_reg, off);
-      } else {
-        u8 prefix2 = (p->size == 8) ? 0xF2 : 0xF3;
-        emit_sse_store(mc, prefix2, 0x11, src_reg, base_reg, off);
-      }
-    } else if (rs.kind == OPK_IMM &&
-               rs.type == CG_BUILTIN_ID(CFREE_CG_BUILTIN_VOID)) {
-      /* void ret placeholder — nothing to do. */
-    } else {
-      compiler_panic(t->c, a->loc, "x64 call: ret_storage kind %d unsupported",
-                     (int)rs.kind);
-    }
-  }
-}
-
-static void x_emit_call_plan(CGTarget* t, const CGCallPlan* p) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  if (p->is_variadic)
-    x64_emit_load_imm(mc, 0, X64_RAX, (i64)p->variadic_fp_count);
-
-  if (p->flags & CG_CALL_TAIL) {
-    if (p->has_sret) {
-      /* Forward the incoming sret pointer into the ABI sret register (see
-       * x_call). Load before x_tail_branch restores the frame. */
-      if (a->sret_ptr_slot == FRAME_SLOT_NONE)
-        compiler_panic(t->c, a->loc,
-                       "x64 tail call: missing incoming sret slot");
-      XSlot* s = x64_slot_get(a, a->sret_ptr_slot);
-      if (!s) compiler_panic(t->c, a->loc, "x64 tail call: bad sret slot");
-      emit_mov_load(mc, 8, 0, a->abi->int_args[0], X64_RBP, -(i32)s->off);
-    }
-    x_check_tail_stack_args(t, x_call_plan_stack_raw_size(p));
-    x_tail_branch(t, p->callee);
-    return;
-  }
-
-  {
-    u32 needed = (x_call_plan_stack_raw_size(p) + 15u) & ~15u;
-    if (needed > a->max_outgoing) {
-      if (a->known_frame)
-        compiler_panic(t->c, a->loc,
-                       "x64 call plan: known frame outgoing area too small");
-      a->max_outgoing = needed;
-    }
-  }
-
-  if (p->callee.kind == OPK_GLOBAL) {
-    u8 op = 0xE8;
-    mc->emit_bytes(mc, &op, 1);
-    u32 disp_pos = mc->pos(mc);
-    emit_u32le(mc, 0);
-    mc->emit_reloc_at(mc, mc->section_id, disp_pos, R_X64_PLT32,
-                      p->callee.v.global.sym, p->callee.v.global.addend - 4, 1,
-                      0);
-  } else if (p->callee.kind == OPK_REG) {
-    u32 r = p->callee.v.reg & 0xFu;
-    emit_rex(mc, 0, 0, 0, r);
-    u8 buf[2] = {0xFF, modrm(3u, 2u, r)};
-    mc->emit_bytes(mc, buf, 2);
-  } else {
-    compiler_panic(t->c, impl_of(t)->loc,
-                   "x64 emit_call_plan: callee kind %d unsupported",
-                   (int)p->callee.kind);
-  }
-}
-
-static Operand x_call_plan_offset_operand(Operand op, u32 offset) {
-  if (!offset) return op;
-  if (op.kind == OPK_INDIRECT) op.v.ind.ofs += (i32)offset;
-  return op;
-}
-
-static void x_load_call_arg(CGTarget* t, Operand dst, const CGCallPlanMove* m) {
-  Operand src = x_call_plan_offset_operand(m->src, m->src_offset);
-  if (m->src_kind == CG_CALL_PLAN_SRC_ADDR) {
-    x_addr_of(t, dst, src);
-    return;
-  }
-  if (src.kind == OPK_LOCAL) {
-    XImpl* a = impl_of(t);
-    XSlot* s = x64_slot_get(a, src.v.frame_slot);
-    if (!s) compiler_panic(t->c, a->loc, "x64 load_call_arg: bad slot");
-    i32 off = -(i32)s->off + (i32)m->src_offset;
-    if (dst.cls == RC_FP) {
-      u8 prefix2 = (m->mem.size == 8) ? 0xF2 : 0xF3;
-      emit_sse_load(t->mc, prefix2, 0x10, dst.v.reg & 0xFu, X64_RBP, off);
-    } else {
-      emit_mov_load(t->mc, m->mem.size, 0, dst.v.reg & 0xFu, X64_RBP, off);
-    }
-    return;
-  }
-  if (src.kind == OPK_INDIRECT && m->src_offset) {
-    x_load(t, dst, src, m->mem);
-    return;
-  }
-  if (src.kind == OPK_GLOBAL) {
-    x_addr_of(t, dst, src);
-    return;
-  }
-  x_load(t, dst, src, m->mem);
-}
-
-static void x_store_call_ret(CGTarget* t, const CGCallPlanRet* r, Operand src) {
-  Operand dst = r->dst;
-  if (dst.kind == OPK_INDIRECT) dst.v.ind.ofs += (i32)r->dst_offset;
-  if (dst.kind == OPK_LOCAL) {
-    XImpl* a = impl_of(t);
-    XSlot* s = x64_slot_get(a, dst.v.frame_slot);
-    if (!s) compiler_panic(t->c, a->loc, "x64 store_call_ret: bad slot");
-    i32 off = -(i32)s->off + (i32)r->dst_offset;
-    if (src.cls == RC_FP) {
-      u8 prefix2 = (r->mem.size == 8) ? 0xF2 : 0xF3;
-      emit_sse_store(t->mc, prefix2, 0x11, src.v.reg & 0xFu, X64_RBP, off);
-    } else {
-      emit_mov_store(t->mc, r->mem.size, src.v.reg & 0xFu, X64_RBP, off);
-    }
-    return;
-  }
-  x_store(t, dst, src, r->mem);
-}
-
-static void x_store_call_arg(CGTarget* t, const CGCallPlanMove* m) {
-  Operand addr;
-  addr = x_call_stack_arg_addr(t, m->stack_offset,
-                               m->dst_kind == CG_CALL_PLAN_TAIL_STACK);
-  addr.type = m->mem.type;
-
-  if (m->src_kind == CG_CALL_PLAN_SRC_ADDR) {
-    Operand tmp = {.kind = OPK_REG, .cls = RC_INT, .type = m->mem.type};
-    tmp.v.reg = X64_RAX;
-    x_load_call_arg(t, tmp, m);
-    x_store(t, addr, tmp, m->mem);
-    return;
-  }
-
-  if (m->src.kind == OPK_REG || m->src.kind == OPK_IMM) {
-    x_store(t, addr, m->src, m->mem);
-    return;
-  }
-  if (m->src.kind == OPK_GLOBAL) {
-    Operand tmp = {.kind = OPK_REG, .cls = RC_INT, .type = m->mem.type};
-    tmp.v.reg = X64_RAX;
-    x_load_call_arg(t, tmp, m);
-    x_store(t, addr, tmp, m->mem);
-    return;
-  }
-  if (m->src.kind == OPK_LOCAL || m->src.kind == OPK_INDIRECT) {
-    Operand tmp = {.kind = OPK_REG, .cls = m->cls, .type = m->mem.type};
-    tmp.v.reg = m->cls == RC_FP ? X64_XMM15 : X64_RAX;
-    x_load_call_arg(t, tmp, m);
-    x_store(t, addr, tmp, m->mem);
-    return;
-  }
-  compiler_panic(t->c, impl_of(t)->loc,
-                 "x64 store_call_arg: source kind %d unsupported",
-                 (int)m->src.kind);
-}
-
-static void x_ret(CGTarget* t, const CGABIValue* val) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-
-  if (val) {
-    const ABIArgInfo* ri = val->abi;
-    if (ri && ri->kind == ABI_ARG_INDIRECT) {
-      /* sret: reload destination pointer into rdi, memcpy source into [rdi]. */
-      u32 src_base;
-      i32 src_base_off;
-      u32 nbytes;
-      if (val->storage.kind == OPK_LOCAL) {
-        XSlot* s = x64_slot_get(a, val->storage.v.frame_slot);
-        if (!s) compiler_panic(t->c, a->loc, "x64 ret: bad sret slot");
-        src_base = X64_RBP;
-        src_base_off = -(i32)s->off;
-        nbytes = s->size;
-      } else if (val->storage.kind == OPK_INDIRECT) {
-        src_base = val->storage.v.ind.base & 0xFu;
-        src_base_off = val->storage.v.ind.ofs;
-        nbytes = val->size;
-        if (!nbytes) {
-          compiler_panic(t->c, a->loc,
-                         "x64 ret indirect: missing aggregate size");
-        }
-      } else {
-        compiler_panic(t->c, a->loc,
-                       "x64 ret indirect: storage kind %d unsupported",
-                       (int)val->storage.kind);
-      }
-      if (val->storage.kind == OPK_INDIRECT &&
-          (src_base == X64_RAX ||
-           (src_base == X64_RDI && a->sret_ptr_slot != FRAME_SLOT_NONE))) {
-        emit_mov_rr(mc, 1, X64_R11, src_base);
-        src_base = X64_R11;
-      }
-      if (a->sret_ptr_slot != FRAME_SLOT_NONE) {
-        XSlot* sp = x64_slot_get(a, a->sret_ptr_slot);
-        if (sp) emit_mov_load(mc, 8, 0, X64_RDI, X64_RBP, -(i32)sp->off);
-      }
-      u32 i = 0;
-      while (i + 8 <= nbytes) {
-        emit_mov_load(mc, 8, 0, X64_RAX, src_base, src_base_off + (i32)i);
-        emit_mov_store(mc, 8, X64_RAX, X64_RDI, (i32)i);
-        i += 8;
-      }
-      while (i + 4 <= nbytes) {
-        emit_mov_load(mc, 4, 0, X64_RAX, src_base, src_base_off + (i32)i);
-        emit_mov_store(mc, 4, X64_RAX, X64_RDI, (i32)i);
-        i += 4;
-      }
-      while (i + 2 <= nbytes) {
-        emit_mov_load(mc, 2, 0, X64_RAX, src_base, src_base_off + (i32)i);
-        emit_mov_store(mc, 2, X64_RAX, X64_RDI, (i32)i);
-        i += 2;
-      }
-      while (i < nbytes) {
-        emit_mov_load(mc, 1, 0, X64_RAX, src_base, src_base_off + (i32)i);
-        emit_mov_store(mc, 1, X64_RAX, X64_RDI, (i32)i);
-        i += 1;
-      }
-      /* Convention: return sret pointer in rax. */
-      emit_mov_rr(mc, 1, X64_RAX, X64_RDI);
-    } else if (val->storage.kind == OPK_REG) {
-      if (val->storage.cls == RC_FP) {
-        u8 prefix2 = type_is_fp_double(val->storage.type) ? 0xF2 : 0xF3;
-        u32 sr = val->storage.v.reg & 0xFu;
-        if (sr != X64_XMM0) emit_sse_rr(mc, prefix2, 0x10, X64_XMM0, sr);
-      } else {
-        int w = type_is_64(val->storage.type) ? 1 : 0;
-        u32 sr = val->storage.v.reg & 0xFu;
-        if (sr != X64_RAX) emit_mov_rr(mc, w, X64_RAX, sr);
-      }
-    } else if (val->storage.kind == OPK_IMM) {
-      int w = type_is_64(val->storage.type) ? 1 : 0;
-      x64_emit_load_imm(mc, w, X64_RAX, val->storage.v.imm);
-    } else if (val->storage.kind == OPK_LOCAL ||
-               val->storage.kind == OPK_INDIRECT) {
-      /* DIRECT struct return: load each part into rax/rdx or xmm0/xmm1. */
-      u32 base_reg;
-      i32 base_off;
-      if (val->storage.kind == OPK_LOCAL) {
-        XSlot* s = x64_slot_get(a, val->storage.v.frame_slot);
-        if (!s) compiler_panic(t->c, a->loc, "x64 ret: bad local slot");
-        base_reg = X64_RBP;
-        base_off = -(i32)s->off;
-      } else {
-        base_reg = val->storage.v.ind.base & 0xFu;
-        base_off = val->storage.v.ind.ofs;
-      }
-      const ABIArgInfo* ri2 = val->abi;
-      u32 next_int_ret = 0, next_fp_ret = 0;
-      static const u32 ret_int_regs[2] = {X64_RAX, X64_RDX};
-      for (u16 i = 0; i < (ri2 ? ri2->nparts : 0); ++i) {
-        const ABIArgPart* pt = &ri2->parts[i];
-        i32 off = base_off + (i32)pt->src_offset;
-        if (pt->cls == ABI_CLASS_INT) {
-          emit_mov_load(mc, pt->size, 0, ret_int_regs[next_int_ret++], base_reg,
-                        off);
-        } else if (pt->cls == ABI_CLASS_FP) {
-          u8 prefix2 = (pt->size == 8) ? 0xF2 : 0xF3;
-          emit_sse_load(mc, prefix2, 0x10, (u32)(X64_XMM0 + next_fp_ret++),
-                        base_reg, off);
-        } else {
-          compiler_panic(t->c, a->loc, "x64 ret: ret part cls %d unimpl",
-                         (int)pt->cls);
-        }
-      }
-    }
-  }
-  if (a->omit_frame) {
-    emit_ret(mc);
-    return;
-  }
-  emit_jmp_label(mc, a->epilogue_label);
-}
-
-/* ============================================================
- * Alloca / VLA.
- *
- * Layout (low → high addresses, after a `sub rsp, aligned_size`):
- *   [rsp + 0, +max_outgoing):           outgoing-arg area
- *   [rsp + max_outgoing, +max_outgoing  +aligned_size): newly allocated block
- *
- * max_outgoing is only known at func_end (it is the max across all
- * x_call sites in the function), so each alloca emits a placeholder
- * `lea dst, [rsp + 0]` whose 4-byte disp is patched at func_end. The
- * epilogue restores rsp via `leave` (mov rsp, rbp; pop rbp), so no
- * extra dance is needed when alloca is present. */
-
-static void emit_lea_rsp_disp32(MCEmitter* mc, u32 dst, u32* out_disp_pos) {
-  /* Force the disp32 form (mod=10, rm=SIB, base=rsp, no index, scale=0)
-   * regardless of the displacement value so func_end has a fixed-width
-   * field to patch. 8 bytes: REX.W [+R] | 0x8D | ModRM | SIB | disp32. */
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  emit_rex(mc, 1, dst, 0, X64_RSP);
-  u8 op = 0x8D;
-  mc->emit_bytes(mc, &op, 1);
-  u8 mr = modrm(2u, dst & 7u, 4u);
-  mc->emit_bytes(mc, &mr, 1);
-  u8 s = sib(0, 4u, X64_RSP);
-  mc->emit_bytes(mc, &s, 1);
-  *out_disp_pos = mc->pos(mc);
-  emit_u32le(mc, 0);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-static void x_alloca_(CGTarget* t, Operand d, Operand sz, u32 align) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  if (d.kind != OPK_REG)
-    compiler_panic(t->c, a->loc, "x64 alloca: dst must be REG");
-  if (align > 16) {
-    compiler_panic(t->c, a->loc, "x64 alloca: align %u > 16 not yet supported",
-                   align);
-  }
-
-  if (sz.kind == OPK_IMM) {
-    i64 v = sz.v.imm;
-    if (v < 0) compiler_panic(t->c, a->loc, "x64 alloca: negative size");
-    u64 aligned = ((u64)v + 15u) & ~(u64)15u;
-    if (aligned == 0) aligned = 16;
-    /* sub rsp, imm32 : REX.W 0x81 /5 imm32 (7 bytes). */
-    emit_rex(mc, 1, 0, 0, X64_RSP);
-    u8 buf[2] = {0x81, modrm(3u, 5u, X64_RSP)};
-    mc->emit_bytes(mc, buf, 2);
-    emit_u32le(mc, (u32)aligned);
-  } else if (sz.kind == OPK_REG) {
-    u32 sz_reg = sz.v.reg & 0xFu;
-    /* rax = (sz_reg + 15) & ~15 */
-    emit_lea(mc, X64_RAX, sz_reg, 15);
-    /* and rax, -16 : REX.W 0x83 /4 imm8(0xF0). */
-    emit_rex(mc, 1, 0, 0, X64_RAX);
-    u8 abuf[3] = {0x83, modrm(3u, 4u, X64_RAX), 0xF0};
-    mc->emit_bytes(mc, abuf, 3);
-    /* sub rsp, rax */
-    emit_alu_rr(mc, 1, 0x29, X64_RSP, X64_RAX);
-  } else {
-    compiler_panic(t->c, a->loc, "x64 alloca: size kind %d unsupported",
-                   (int)sz.kind);
-  }
-
-  /* lea dst, [rsp + max_outgoing] — placeholder, patched at func_end. */
-  if (a->nalloca_patches == a->alloca_patches_cap) {
-    u32 ncap = a->alloca_patches_cap ? a->alloca_patches_cap * 2u : 4u;
-    XAllocaPatch* nb = arena_array(t->c->tu, XAllocaPatch, ncap);
-    if (a->alloca_patches)
-      memcpy(nb, a->alloca_patches, sizeof(XAllocaPatch) * a->nalloca_patches);
-    a->alloca_patches = nb;
-    a->alloca_patches_cap = ncap;
-  }
-  u32 disp_pos;
-  emit_lea_rsp_disp32(mc, d.v.reg & 0xFu, &disp_pos);
-  a->alloca_patches[a->nalloca_patches].disp_pos = disp_pos;
-  a->nalloca_patches++;
-  a->has_alloca = 1;
-}
-
-/* SysV AMD64 __va_list_tag (24 bytes, 8-aligned):
- *   off  0  u32  gp_offset    next free GP slot in reg_save_area (0..48)
- *   off  4  u32  fp_offset    next free FP slot                  (48..176)
- *   off  8  ptr  overflow_arg_area  pointer to next stack-passed arg
- *   off 16  ptr  reg_save_area      pointer to the 176-byte save area
- *
- * The reg_save_area layout (filled in func_begin):
- *   +0..+40   : rdi, rsi, rdx, rcx, r8, r9 (8B each)
- *   +48..+168 : xmm0..xmm7 at 16B stride (low 8B written via movsd)
- *
- * va_arg dispatches on dst class. When the relevant offset reaches its
- * max (48 for GP, 176 for FP), fall through to overflow_arg_area at
- * 8-byte stride. */
-
-static void x_va_start_(CGTarget* t, Operand ap_op) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  if (!a->is_variadic)
-    compiler_panic(t->c, a->loc, "x64 va_start: function not variadic");
-  u32 ap = ap_op.v.reg & 0xFu;
-  if (a->abi->shadow_space) {
-    /* Win64 va_list is a single pointer to the next variadic stack
-     * slot. The 32 B caller-allocated home space at [rbp + 16] holds
-     * the first four named integer args (RCX/RDX/R8/R9, spilled by
-     * the prologue's variadic save). Variadic args start immediately
-     * after the named args at:
-     *     [rbp + 16 + named_int_count * 8 + named_stack_bytes]
-     * x_emit_variadic_reg_saves already spilled the four arg regs to
-     * the home space; va_arg consumes from there onward at 8-byte
-     * stride (the call-site duplicates FP varargs into the matching
-     * GPR, so all FP varargs are reachable through the integer arm). */
-    u32 first_var_off = 16u + a->next_param_int * 8u + a->next_param_stack;
-    emit_lea(mc, X64_RAX, X64_RBP, (i32)first_var_off);
-    emit_mov_store(mc, 8, X64_RAX, ap, 0);
-    return;
-  }
-  XSlot* rs = x64_slot_get(a, a->reg_save_slot);
-  if (!rs) compiler_panic(t->c, a->loc, "x64 va_start: no reg_save_slot");
-
-  /* gp_offset = next_param_int * 8 */
-  x64_emit_load_imm(mc, 0, X64_RAX, (i64)(a->next_param_int * 8u));
-  emit_mov_store(mc, 4, X64_RAX, ap, 0);
-  /* fp_offset = 48 + next_param_fp * 16 */
-  x64_emit_load_imm(mc, 0, X64_RAX, (i64)(48u + a->next_param_fp * 16u));
-  emit_mov_store(mc, 4, X64_RAX, ap, 4);
-  /* overflow_arg_area = rbp + 16 + next_param_stack */
-  emit_lea(mc, X64_RAX, X64_RBP, (i32)(16u + a->next_param_stack));
-  emit_mov_store(mc, 8, X64_RAX, ap, 8);
-  /* reg_save_area = rbp - reg_save_slot.off */
-  emit_lea(mc, X64_RAX, X64_RBP, -(i32)rs->off);
-  emit_mov_store(mc, 8, X64_RAX, ap, 16);
-}
-
-static void x_va_arg_(CGTarget* t, Operand dst, Operand ap_op,
-                      CfreeCgTypeId ty) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  u32 ap = ap_op.v.reg & 0xFu;
-  u32 sz = type_byte_size(ty);
-  int is_fp = (dst.cls == RC_FP);
-  u32 dr = dst.v.reg & 0xFu;
-  if (a->abi->shadow_space) {
-    /* Win64: va_list is a plain pointer to the next slot. Every
-     * variadic arg occupies exactly 8 bytes (or 16-byte aggregates
-     * passed by hidden ptr — cfree's caller side already handles
-     * that). FP varargs are duplicated into the matching GPR slot
-     * at the call site (vararg_fp_dup_to_gpr), so we always load
-     * from the integer slot at *ap.
-     *   r11 = *ap         ; current slot address
-     *   dst = [r11]       ; load
-     *   r11 += 8          ; advance
-     *   *ap = r11         ; write back */
-    emit_mov_load(mc, 8, 0, X64_R11, ap, 0);
-    if (is_fp) {
-      u8 prefix = (sz == 8) ? 0xF2 : 0xF3;
-      emit_sse_load(mc, prefix, 0x10, dr, X64_R11, 0);
-    } else {
-      int sx = type_is_signed(ty);
-      emit_mov_load(mc, sz, sx, dr, X64_R11, 0);
-    }
-    /* add r11, 8 : REX.WB 0x83 /0 imm8. */
-    {
-      u32 ofs = obj_pos(mc->obj, mc->section_id);
-      u8 rex = (u8)(X64_REX_BASE | X64_REX_W | X64_REX_B);
-      mc->emit_bytes(mc, &rex, 1);
-      u8 buf[3] = {0x83, modrm(3u, 0u, X64_R11 & 7u), 8};
-      mc->emit_bytes(mc, buf, 3);
-      if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-    }
-    emit_mov_store(mc, 8, X64_R11, ap, 0);
-    return;
-  }
-  u32 offs_field = is_fp ? 4u : 0u;
-  u32 max_offs = is_fp ? 176u : 48u;
-  u32 stride = is_fp ? 16u : 8u;
-
-  MCLabel L_stack = mc->label_new(mc);
-  MCLabel L_done = mc->label_new(mc);
-
-  /* eax = ap[offs_field]; cmp eax, max_offs; jae L_stack. */
-  emit_mov_load(mc, 4, 0, X64_RAX, ap, (i32)offs_field);
-  if (max_offs <= 127u) {
-    emit_cmp_imm8(mc, 0, X64_RAX, (i8)max_offs);
-  } else {
-    /* cmp eax, imm32 : 0x3D imm32 (5 bytes, EAX-specific form). */
-    u32 ofs = obj_pos(mc->obj, mc->section_id);
-    u8 op = 0x3D;
-    mc->emit_bytes(mc, &op, 1);
-    emit_u32le(mc, max_offs);
-    if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-  }
-  emit_jcc_label(mc, X64_CC_AE, L_stack);
-
-  /* Reg path:
-   *   r11 = ap[16] (reg_save_area)
-   *   r11 = r11 + rax
-   *   load dst from [r11 + 0]
-   *   eax += stride; ap[offs_field] = eax
-   *   jmp L_done */
-  emit_mov_load(mc, 8, 0, X64_R11, ap, 16);
-  emit_alu_rr(mc, 1, 0x01, X64_R11, X64_RAX);
-  if (is_fp) {
-    u8 prefix = (sz == 8) ? 0xF2 : 0xF3;
-    emit_sse_load(mc, prefix, 0x10, dr, X64_R11, 0);
-  } else {
-    int sx = type_is_signed(ty);
-    emit_mov_load(mc, sz, sx, dr, X64_R11, 0);
-  }
-  /* add eax, imm8 : 0x83 /0 imm8 (no REX needed for eax). */
-  {
-    u32 ofs = obj_pos(mc->obj, mc->section_id);
-    u8 buf[3] = {0x83, modrm(3u, 0u, X64_RAX), (u8)stride};
-    mc->emit_bytes(mc, buf, 3);
-    if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-  }
-  emit_mov_store(mc, 4, X64_RAX, ap, (i32)offs_field);
-  emit_jmp_label(mc, L_done);
-
-  /* L_stack:
-   *   r11 = ap[8] (overflow_arg_area)
-   *   load dst from [r11 + 0]
-   *   r11 += 8; ap[8] = r11 */
-  mc->label_place(mc, L_stack);
-  emit_mov_load(mc, 8, 0, X64_R11, ap, 8);
-  if (is_fp) {
-    u8 prefix = (sz == 8) ? 0xF2 : 0xF3;
-    emit_sse_load(mc, prefix, 0x10, dr, X64_R11, 0);
-  } else {
-    int sx = type_is_signed(ty);
-    emit_mov_load(mc, sz, sx, dr, X64_R11, 0);
-  }
-  /* add r11, 8 : REX.WB 0x83 /0 imm8. */
-  {
-    u32 ofs = obj_pos(mc->obj, mc->section_id);
-    u8 rex = (u8)(X64_REX_BASE | X64_REX_W | X64_REX_B);
-    mc->emit_bytes(mc, &rex, 1);
-    u8 buf[3] = {0x83, modrm(3u, 0u, X64_R11 & 7u), 8};
-    mc->emit_bytes(mc, buf, 3);
-    if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-  }
-  emit_mov_store(mc, 8, X64_R11, ap, 8);
-
-  mc->label_place(mc, L_done);
-}
-
-static void x_va_end_(CGTarget* t, Operand a) {
-  (void)t;
-  (void)a;
-}
-
-static void x_va_copy_(CGTarget* t, Operand d, Operand s) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  u32 dr = d.v.reg & 0xFu;
-  u32 sr = s.v.reg & 0xFu;
-  if (a->abi->shadow_space) {
-    /* Win64 va_list is a single 8-byte pointer. */
-    emit_mov_load(mc, 8, 0, X64_RAX, sr, 0);
-    emit_mov_store(mc, 8, X64_RAX, dr, 0);
-    return;
-  }
-  /* SysV va_list is 24 bytes; three 8B loads + stores via rax. */
-  for (u32 i = 0; i < 24u; i += 8u) {
-    emit_mov_load(mc, 8, 0, X64_RAX, sr, (i32)i);
-    emit_mov_store(mc, 8, X64_RAX, dr, (i32)i);
-  }
-}
-
-/* ============================================================
- * Atomics (Group K).
- *
- * x86 has a strong memory model: plain MOV is acquire on loads and
- * release on stores, so most MemOrders need no extra fence. The
- * exception is SEQ_CST stores, which need a full StoreLoad barrier —
- * realized either via XCHG (which has implicit LOCK) or MOV+MFENCE.
- * All LOCK-prefixed RMWs (XADD/XCHG/CMPXCHG) act as full barriers,
- * subsuming any MemOrder the front end requests. */
-
-static void emit_lock_prefix(MCEmitter* mc) {
-  u8 b = 0xF0;
-  mc->emit_bytes(mc, &b, 1);
-}
-
-static void emit_mfence(MCEmitter* mc) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  u8 b[3] = {0x0F, 0xAE, 0xF0};
-  mc->emit_bytes(mc, b, 3);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-static void emit_ud2(MCEmitter* mc) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  u8 b[2] = {0x0F, 0x0B};
-  mc->emit_bytes(mc, b, 2);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-/* LOCK XADD [base+disp], src. Opcode 0F C1 /r (32/64-bit; sets src=prior,
- * mem=mem+src). */
-static void emit_lock_xadd(MCEmitter* mc, int w, u32 src, u32 base, i32 disp) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  emit_lock_prefix(mc);
-  emit_rex(mc, w, src, 0, base);
-  u8 op[2] = {0x0F, 0xC1};
-  mc->emit_bytes(mc, op, 2);
-  emit_mem_operand(mc, src, base, disp);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-/* XCHG [base+disp], src. Opcode 87 /r. LOCK is implicit when the
- * destination is memory, but we emit it explicitly for clarity. */
-static void emit_lock_xchg_mem(MCEmitter* mc, int w, u32 src, u32 base,
-                               i32 disp) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  emit_lock_prefix(mc);
-  emit_rex(mc, w, src, 0, base);
-  u8 op = 0x87;
-  mc->emit_bytes(mc, &op, 1);
-  emit_mem_operand(mc, src, base, disp);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-/* LOCK CMPXCHG [base+disp], src. Opcode 0F B1 /r. Compares RAX with [mem];
- * if equal, [mem]=src and ZF=1; else RAX=[mem] and ZF=0. */
-static void emit_lock_cmpxchg(MCEmitter* mc, int w, u32 src, u32 base,
-                              i32 disp) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  emit_lock_prefix(mc);
-  emit_rex(mc, w, src, 0, base);
-  u8 op[2] = {0x0F, 0xB1};
-  mc->emit_bytes(mc, op, 2);
-  emit_mem_operand(mc, src, base, disp);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-/* POPCNT rd, rs. Encoding: F3 0F B8 /r. */
-static void emit_popcnt(MCEmitter* mc, int w, u32 dst, u32 src) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  u8 p = 0xF3;
-  mc->emit_bytes(mc, &p, 1);
-  emit_rex(mc, w, dst, 0, src);
-  u8 op[2] = {0x0F, 0xB8};
-  mc->emit_bytes(mc, op, 2);
-  emit_rm_reg(mc, dst, src);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-/* BSF/BSR rd, rs. opcode2 = 0xBC (BSF) or 0xBD (BSR). */
-static void emit_bs(MCEmitter* mc, int w, u8 opcode2, u32 dst, u32 src) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  emit_rex(mc, w, dst, 0, src);
-  u8 op[2] = {0x0F, opcode2};
-  mc->emit_bytes(mc, op, 2);
-  emit_rm_reg(mc, dst, src);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-/* BSWAP r32/r64. Opcode 0F C8+r; REX.W for r64; REX.B if reg>=8. */
-static void emit_bswap(MCEmitter* mc, int w, u32 reg) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  emit_rex(mc, w, 0, 0, reg);
-  u8 op[2] = {0x0F, (u8)(0xC8 + (reg & 7))};
-  mc->emit_bytes(mc, op, 2);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-/* ROL r/m16, imm8. Used to swap bytes in a 16-bit value (ROL by 8). */
-static void emit_rol16_imm8(MCEmitter* mc, u32 reg, u8 imm) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  u8 p = 0x66;
-  mc->emit_bytes(mc, &p, 1);
-  emit_rex(mc, 0, 0, 0, reg);
-  u8 buf[3];
-  buf[0] = 0xC1;
-  buf[1] = modrm(3u, 0u, reg & 7u);
-  buf[2] = imm;
-  mc->emit_bytes(mc, buf, 3);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-/* XOR r/m, imm32 — opcode 81 /6. Used to compute (bits-1) - x via XOR. */
-static void emit_xor_imm32(MCEmitter* mc, int w, u32 reg, i32 imm) {
-  u32 ofs = obj_pos(mc->obj, mc->section_id);
-  emit_rex(mc, w, 0, 0, reg);
-  u8 op = 0x81;
-  mc->emit_bytes(mc, &op, 1);
-  emit_rm_reg(mc, 6u, reg);
-  emit_u32le(mc, (u32)imm);
-  if (mc->debug) debug_emit_row(mc->debug, mc->section_id, ofs, mc->loc);
-}
-
-/* Resolve an atomic addr operand to (base, disp) for a memory operand.
- * Accepts OPK_REG (pointer in reg, disp=0), OPK_LOCAL, or OPK_INDIRECT. */
-static u32 atomic_addr_base(CGTarget* t, Operand addr, i32* out_disp) {
-  if (addr.kind == OPK_REG) {
-    *out_disp = 0;
-    return addr.v.reg & 0xFu;
-  }
-  return addr_base(t, addr, out_disp);
-}
-
-static void x_atomic_load(CGTarget* t, Operand dst, Operand addr, MemAccess ma,
-                          MemOrder ord) {
-  MCEmitter* mc = t->mc;
-  (void)ord; /* x86: plain MOV satisfies all orders for loads. */
-  u32 sz = ma.size ? ma.size : type_byte_size(dst.type);
-  i32 disp;
-  u32 base = atomic_addr_base(t, addr, &disp);
-  int signed_ = type_is_signed(ma.type ? ma.type : dst.type);
-  emit_mov_load(mc, sz, signed_, dst.v.reg & 0xFu, base, disp);
-}
-
-static void x_atomic_store(CGTarget* t, Operand addr, Operand src, MemAccess ma,
-                           MemOrder ord) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  u32 sz = ma.size ? ma.size : type_byte_size(src.type);
-  int w = (sz == 8) ? 1 : 0;
-  i32 disp;
-  u32 base = atomic_addr_base(t, addr, &disp);
-
-  /* Materialize src into a register. */
-  u32 sr;
-  if (src.kind == OPK_IMM) {
-    x64_emit_load_imm(mc, w, X64_R11, src.v.imm);
-    sr = X64_R11;
-  } else if (src.kind == OPK_REG) {
-    sr = src.v.reg & 0xFu;
-  } else {
-    compiler_panic(t->c, a->loc, "x64 atomic_store: src kind %d unsupported",
-                   (int)src.kind);
-  }
-
-  if (ord == MO_SEQ_CST) {
-    /* SEQ_CST store: XCHG implicitly fences. Move src into r11 so the
-     * caller's reg is unmodified, then xchg [mem], r11. */
-    if (sr != X64_R11) emit_mov_rr(mc, w, X64_R11, sr);
-    emit_lock_xchg_mem(mc, w, X64_R11, base, disp);
-    return;
-  }
-  /* Plain store covers RELAXED / RELEASE. */
-  emit_mov_store(mc, sz, sr, base, disp);
-}
-
-static void x_atomic_rmw(CGTarget* t, AtomicOp op, Operand dst, Operand addr,
-                         Operand val, MemAccess ma, MemOrder ord) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  (void)ord; /* LOCK-prefixed ops are unconditionally full barriers. */
-  u32 sz = ma.size ? ma.size : type_byte_size(dst.type);
-  int w = (sz == 8) ? 1 : 0;
-  i32 disp;
-  u32 base = atomic_addr_base(t, addr, &disp);
-  u32 dr = dst.v.reg & 0xFu;
-
-  /* Materialize val into r11 (it's our working temp). For SUB we negate
-   * it so the XADD does the subtraction. */
-  if (val.kind == OPK_IMM) {
-    i64 v = val.v.imm;
-    if (op == AO_SUB) v = -v;
-    x64_emit_load_imm(mc, w, X64_R11, v);
-  } else if (val.kind == OPK_REG) {
-    u32 vr = val.v.reg & 0xFu;
-    if (vr != X64_R11) emit_mov_rr(mc, w, X64_R11, vr);
-    if (op == AO_SUB) emit_f7_rm(mc, w, 3u, X64_R11); /* NEG */
-  } else {
-    compiler_panic(t->c, a->loc, "x64 atomic_rmw: val kind %d unsupported",
-                   (int)val.kind);
-  }
-
-  if (op == AO_ADD || op == AO_SUB) {
-    /* LOCK XADD [base], r11 — afterwards r11 holds prior. */
-    emit_lock_xadd(mc, w, X64_R11, base, disp);
-    if (dr != X64_R11) emit_mov_rr(mc, w, dr, X64_R11);
-    return;
-  }
-  if (op == AO_XCHG) {
-    emit_lock_xchg_mem(mc, w, X64_R11, base, disp);
-    if (dr != X64_R11) emit_mov_rr(mc, w, dr, X64_R11);
-    return;
-  }
-
-  /* AND/OR/XOR/NAND: CMPXCHG retry loop.
-   *
-   *     mov rax, [mem]
-   *   .retry:
-   *     mov rcx, rax           ; new = prior
-   *     <op> rcx, r11          ; combine with val
-   *     [NAND: not rcx]
-   *     lock cmpxchg [mem], rcx
-   *     jne .retry
-   *     mov dr, rax
-   *
-   * rax = prior (cmpxchg implicit), rcx = new (scratch), r11 = val. */
-  emit_mov_load(mc, sz, 0, X64_RAX, base, disp);
-  MCLabel L_retry = mc->label_new(mc);
-  mc->label_place(mc, L_retry);
-  emit_mov_rr(mc, w, X64_RCX, X64_RAX);
-  switch (op) {
-    case AO_AND:
-      emit_alu_rr(mc, w, 0x21, X64_RCX, X64_R11);
-      break;
-    case AO_OR:
-      emit_alu_rr(mc, w, 0x09, X64_RCX, X64_R11);
-      break;
-    case AO_XOR:
-      emit_alu_rr(mc, w, 0x31, X64_RCX, X64_R11);
-      break;
-    case AO_NAND:
-      emit_alu_rr(mc, w, 0x21, X64_RCX, X64_R11);
-      emit_f7_rm(mc, w, 2u, X64_RCX); /* NOT */
-      break;
-    default:
-      compiler_panic(t->c, a->loc, "x64 atomic_rmw: op %d unimpl", (int)op);
-  }
-  emit_lock_cmpxchg(mc, w, X64_RCX, base, disp);
-  emit_jcc_label(mc, X64_CC_NE, L_retry);
-  if (dr != X64_RAX) emit_mov_rr(mc, w, dr, X64_RAX);
-}
-
-static void x_atomic_cas(CGTarget* t, Operand prior, Operand ok, Operand addr,
-                         Operand expected, Operand desired, MemAccess ma,
-                         MemOrder succ, MemOrder fail) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  (void)succ;
-  (void)fail;
-  u32 sz = ma.size ? ma.size : type_byte_size(prior.type);
-  int w = (sz == 8) ? 1 : 0;
-  i32 disp;
-  u32 base = atomic_addr_base(t, addr, &disp);
-
-  /* RAX = expected. */
-  if (expected.kind == OPK_IMM) {
-    x64_emit_load_imm(mc, w, X64_RAX, expected.v.imm);
-  } else if (expected.kind == OPK_REG) {
-    u32 er = expected.v.reg & 0xFu;
-    if (er != X64_RAX) emit_mov_rr(mc, w, X64_RAX, er);
-  } else {
-    compiler_panic(t->c, a->loc, "x64 atomic_cas: exp kind %d unsupported",
-                   (int)expected.kind);
-  }
-  /* R11 = desired. */
-  if (desired.kind == OPK_IMM) {
-    x64_emit_load_imm(mc, w, X64_R11, desired.v.imm);
-  } else if (desired.kind == OPK_REG) {
-    u32 dr2 = desired.v.reg & 0xFu;
-    if (dr2 != X64_R11) emit_mov_rr(mc, w, X64_R11, dr2);
-  } else {
-    compiler_panic(t->c, a->loc, "x64 atomic_cas: des kind %d unsupported",
-                   (int)desired.kind);
-  }
-
-  emit_lock_cmpxchg(mc, w, X64_R11, base, disp);
-
-  /* ok = ZF (success). */
-  u32 ok_r = ok.v.reg & 0xFu;
-  emit_setcc(mc, X64_CC_E, ok_r);
-  emit_movzx_r32_r8(mc, ok_r, ok_r);
-
-  /* prior = rax. */
-  u32 pr = prior.v.reg & 0xFu;
-  if (pr != X64_RAX) emit_mov_rr(mc, w, pr, X64_RAX);
-}
-
-static void x_fence(CGTarget* t, MemOrder o) {
-  /* x86: only SEQ_CST needs an explicit StoreLoad barrier. RELAXED is
-   * a no-op; ACQUIRE/RELEASE/ACQ_REL are satisfied by plain MOV. */
-  if (o == MO_SEQ_CST) emit_mfence(t->mc);
-}
-
-/* ============================================================
- * Intrinsics (Group L). */
-
-static void x_intrinsic(CGTarget* t, IntrinKind kind, Operand* dsts, u32 nd,
-                        const Operand* args, u32 na) {
-  XImpl* a = impl_of(t);
-  MCEmitter* mc = t->mc;
-  (void)nd;
-  (void)na;
-
-  switch (kind) {
-    case INTRIN_POPCOUNT: {
-      Operand src = args[0];
-      Operand dst = dsts[0];
-      int w = type_is_64(src.type) ? 1 : 0;
-      emit_popcnt(mc, w, dst.v.reg & 0xFu, src.v.reg & 0xFu);
-      return;
-    }
-    case INTRIN_CTZ: {
-      /* BSF gives the index of the lowest set bit (undefined for 0). */
-      Operand src = args[0];
-      Operand dst = dsts[0];
-      int w = type_is_64(src.type) ? 1 : 0;
-      emit_bs(mc, w, 0xBC, dst.v.reg & 0xFu, src.v.reg & 0xFu);
-      return;
-    }
-    case INTRIN_CLZ: {
-      /* BSR gives the index of the highest set bit; clz = (bits-1) - bsr.
-       * XOR with (bits-1) computes the subtraction for in-range values. */
-      Operand src = args[0];
-      Operand dst = dsts[0];
-      int w = type_is_64(src.type) ? 1 : 0;
-      u32 dr = dst.v.reg & 0xFu;
-      emit_bs(mc, w, 0xBD, dr, src.v.reg & 0xFu);
-      emit_xor_imm32(mc, w, dr, w ? 63 : 31);
-      return;
-    }
-    case INTRIN_BSWAP16: {
-      Operand src = args[0];
-      Operand dst = dsts[0];
-      u32 dr = dst.v.reg & 0xFu;
-      u32 sr = src.v.reg & 0xFu;
-      if (dr != sr) emit_mov_rr(mc, 0, dr, sr);
-      emit_rol16_imm8(mc, dr, 8);
-      return;
-    }
-    case INTRIN_BSWAP32: {
-      Operand src = args[0];
-      Operand dst = dsts[0];
-      u32 dr = dst.v.reg & 0xFu;
-      u32 sr = src.v.reg & 0xFu;
-      if (dr != sr) emit_mov_rr(mc, 0, dr, sr);
-      emit_bswap(mc, 0, dr);
-      return;
-    }
-    case INTRIN_BSWAP64: {
-      Operand src = args[0];
-      Operand dst = dsts[0];
-      u32 dr = dst.v.reg & 0xFu;
-      u32 sr = src.v.reg & 0xFu;
-      if (dr != sr) emit_mov_rr(mc, 1, dr, sr);
-      emit_bswap(mc, 1, dr);
-      return;
-    }
-    case INTRIN_MEMCPY:
-    case INTRIN_MEMMOVE: {
-      /* args = (dst_addr, src_addr, n_bytes). v1: const n, REG ptrs. */
-      Operand da = args[0], sa = args[1], nb = args[2];
-      if (da.kind != OPK_REG || sa.kind != OPK_REG || nb.kind != OPK_IMM) {
-        compiler_panic(
-            t->c, a->loc, "x64 intrinsic: %.*s with non-const n or non-REG ptr",
-            SLICE_ARG(
-                slice_from_cstr(kind == INTRIN_MEMCPY ? "memcpy" : "memmove")));
-      }
-      u32 dr = da.v.reg & 0xFu;
-      u32 sr = sa.v.reg & 0xFu;
-      u32 n = (u32)nb.v.imm;
-      if (kind == INTRIN_MEMCPY) {
-        u32 i = 0;
-        while (i + 8 <= n) {
-          emit_mov_load(mc, 8, 0, X64_RAX, sr, (i32)i);
-          emit_mov_store(mc, 8, X64_RAX, dr, (i32)i);
-          i += 8;
-        }
-        while (i + 4 <= n) {
-          emit_mov_load(mc, 4, 0, X64_RAX, sr, (i32)i);
-          emit_mov_store(mc, 4, X64_RAX, dr, (i32)i);
-          i += 4;
-        }
-        while (i + 2 <= n) {
-          emit_mov_load(mc, 2, 0, X64_RAX, sr, (i32)i);
-          emit_mov_store(mc, 2, X64_RAX, dr, (i32)i);
-          i += 2;
-        }
-        while (i < n) {
-          emit_mov_load(mc, 1, 0, X64_RAX, sr, (i32)i);
-          emit_mov_store(mc, 1, X64_RAX, dr, (i32)i);
-          i += 1;
-        }
-      } else {
-        /* memmove: copy backward so dst>src overlap is safe. */
-        u32 i = n;
-        while (i >= 8) {
-          i -= 8;
-          emit_mov_load(mc, 8, 0, X64_RAX, sr, (i32)i);
-          emit_mov_store(mc, 8, X64_RAX, dr, (i32)i);
-        }
-        while (i >= 4) {
-          i -= 4;
-          emit_mov_load(mc, 4, 0, X64_RAX, sr, (i32)i);
-          emit_mov_store(mc, 4, X64_RAX, dr, (i32)i);
-        }
-        while (i >= 2) {
-          i -= 2;
-          emit_mov_load(mc, 2, 0, X64_RAX, sr, (i32)i);
-          emit_mov_store(mc, 2, X64_RAX, dr, (i32)i);
-        }
-        while (i >= 1) {
-          i -= 1;
-          emit_mov_load(mc, 1, 0, X64_RAX, sr, (i32)i);
-          emit_mov_store(mc, 1, X64_RAX, dr, (i32)i);
-        }
-      }
-      return;
-    }
-    case INTRIN_MEMSET: {
-      /* args = (dst_addr, byte, n). */
-      Operand da = args[0], bv = args[1], nb = args[2];
-      if (da.kind != OPK_REG || nb.kind != OPK_IMM) {
-        compiler_panic(t->c, a->loc,
-                       "x64 intrinsic: memset with non-const n / non-REG ptr");
-      }
-      u32 dr = da.v.reg & 0xFu;
-      u32 n = (u32)nb.v.imm;
-      /* Build a 64-bit value with the byte broadcast across all 8 bytes. */
-      if (bv.kind == OPK_IMM) {
-        u8 byte = (u8)(bv.v.imm & 0xffu);
-        u64 b64 = byte;
-        b64 |= b64 << 8;
-        b64 |= b64 << 16;
-        b64 |= b64 << 32;
-        x64_emit_load_imm(mc, 1, X64_RAX, (i64)b64);
-      } else if (bv.kind == OPK_REG) {
-        /* Broadcast low byte of bv across 8 bytes: rax = bv *
-         * 0x0101010101010101. */
-        x64_emit_load_imm(mc, 1, X64_R11, (i64)0x0101010101010101ll);
-        emit_mov_rr(mc, 1, X64_RAX, bv.v.reg & 0xFu);
-        emit_imul_rr(mc, 1, X64_RAX, X64_R11);
-      } else {
-        compiler_panic(t->c, a->loc,
-                       "x64 intrinsic: memset byte kind %d unsupported",
-                       (int)bv.kind);
-      }
-      u32 i = 0;
-      while (i + 8 <= n) {
-        emit_mov_store(mc, 8, X64_RAX, dr, (i32)i);
-        i += 8;
-      }
-      while (i + 4 <= n) {
-        emit_mov_store(mc, 4, X64_RAX, dr, (i32)i);
-        i += 4;
-      }
-      while (i + 2 <= n) {
-        emit_mov_store(mc, 2, X64_RAX, dr, (i32)i);
-        i += 2;
-      }
-      while (i < n) {
-        emit_mov_store(mc, 1, X64_RAX, dr, (i32)i);
-        i += 1;
-      }
-      return;
-    }
-    case INTRIN_PREFETCH:
-      /* Drop the hint. */
-      return;
-    case INTRIN_ASSUME_ALIGNED: {
-      /* dst = src (alignment is a hint only). */
-      Operand src = args[0];
-      Operand dst = dsts[0];
-      u32 dr = dst.v.reg & 0xFu;
-      u32 sr = src.v.reg & 0xFu;
-      if (dr != sr) emit_mov_rr(mc, 1, dr, sr);
-      return;
-    }
-    case INTRIN_EXPECT: {
-      /* dst = val; expected hint dropped. */
-      Operand val = args[0];
-      Operand dst = dsts[0];
-      int w = type_is_64(dst.type) ? 1 : 0;
-      u32 dr = dst.v.reg & 0xFu;
-      if (val.kind == OPK_REG) {
-        u32 sr = val.v.reg & 0xFu;
-        if (sr != dr) emit_mov_rr(mc, w, dr, sr);
-      } else if (val.kind == OPK_IMM) {
-        x64_emit_load_imm(mc, w, dr, val.v.imm);
-      } else {
-        compiler_panic(t->c, a->loc,
-                       "x64 intrinsic: expect val kind %d unsupported",
-                       (int)val.kind);
-      }
-      return;
-    }
-    case INTRIN_UNREACHABLE:
-    case INTRIN_TRAP:
-      emit_ud2(mc);
-      return;
-    case INTRIN_SADD_OVERFLOW:
-    case INTRIN_UADD_OVERFLOW:
-    case INTRIN_SSUB_OVERFLOW:
-    case INTRIN_USUB_OVERFLOW: {
-      /* dsts: [val, ovf]. Signed uses OF; unsigned uses CF. */
-      Operand a_op = args[0], b_op = args[1];
-      Operand dval = dsts[0], dovf = dsts[1];
-      int w = type_is_64(dval.type) ? 1 : 0;
-      u32 rd = dval.v.reg & 0xFu;
-      u32 ra = x64_force_reg_int(t, a_op, w, X64_RAX);
-      if (rd != ra) emit_mov_rr(mc, w, rd, ra);
-      u32 rb = x64_force_reg_int(t, b_op, w, X64_R11);
-      u8 op = (kind == INTRIN_SADD_OVERFLOW || kind == INTRIN_UADD_OVERFLOW)
-                  ? 0x01
-                  : 0x29;
-      u32 cc = (kind == INTRIN_UADD_OVERFLOW || kind == INTRIN_USUB_OVERFLOW)
-                   ? X64_CC_B
-                   : X64_CC_O;
-      emit_alu_rr(mc, w, op, rd, rb);
-      u32 dovf_r = dovf.v.reg & 0xFu;
-      emit_setcc(mc, cc, dovf_r);
-      emit_movzx_r32_r8(mc, dovf_r, dovf_r);
-      return;
-    }
-    case INTRIN_SMUL_OVERFLOW: {
-      /* dsts: [val, ovf]. IMUL r32, r/m32 (0F AF /r) is the signed
-       * two-operand form: low bits of product go to dst, OF set if
-       * the result didn't fit. */
-      Operand a_op = args[0], b_op = args[1];
-      Operand dval = dsts[0], dovf = dsts[1];
-      int w = type_is_64(dval.type) ? 1 : 0;
-      u32 rd = dval.v.reg & 0xFu;
-      u32 ra = x64_force_reg_int(t, a_op, w, X64_RAX);
-      if (rd != ra) emit_mov_rr(mc, w, rd, ra);
-      u32 rb = x64_force_reg_int(t, b_op, w, X64_R11);
-      emit_imul_rr(mc, w, rd, rb);
-      u32 dovf_r = dovf.v.reg & 0xFu;
-      emit_setcc(mc, X64_CC_O, dovf_r);
-      emit_movzx_r32_r8(mc, dovf_r, dovf_r);
-      return;
-    }
-    case INTRIN_UMUL_OVERFLOW: {
-      /* MUL writes the double-width product to RDX:RAX and sets CF/OF
-       * when the high half is non-zero. */
-      Operand a_op = args[0], b_op = args[1];
-      Operand dval = dsts[0], dovf = dsts[1];
-      int w = type_is_64(dval.type) ? 1 : 0;
-      u32 rd = dval.v.reg & 0xFu;
-      u32 rb = x64_force_reg_int(t, b_op, w, X64_R11);
-      if (rb == X64_RAX || rb == X64_RDX) {
-        emit_mov_rr(mc, w, X64_R11, rb);
-        rb = X64_R11;
-      }
-      u32 ra = x64_force_reg_int(t, a_op, w, X64_RAX);
-      if (ra != X64_RAX) emit_mov_rr(mc, w, X64_RAX, ra);
-      emit_f7_rm(mc, w, 4u, rb);
-      if (rd != X64_RAX) emit_mov_rr(mc, w, rd, X64_RAX);
-      u32 dovf_r = dovf.v.reg & 0xFu;
-      emit_setcc(mc, X64_CC_O, dovf_r);
-      emit_movzx_r32_r8(mc, dovf_r, dovf_r);
-      return;
-    }
-    default:
-      compiler_panic(t->c, a->loc, "x64 intrinsic: kind %d unsupported",
-                     (int)kind);
-  }
-}
-static void x_asm_block(CGTarget* t, const char* tmpl,
-                        const AsmConstraint* outs, u32 no, Operand* oo,
-                        const AsmConstraint* ins, u32 ni, const Operand* io,
-                        const Sym* clobs, u32 nc) {
-  XImpl* a_impl = impl_of(t);
-  u32 i;
-  X64Asm* a;
-  for (i = 0; i < nc; ++i) {
-    Reg phys;
-    RegClass cls;
-    if (!t->resolve_reg_name ||
-        t->resolve_reg_name(t, clobs[i], &phys, &cls) != 0)
-      continue;
-    if (cls == RC_INT) {
-      if (phys == X64_RBX || phys == X64_RBP || phys == X64_R12 ||
-          phys == X64_R13 || phys == X64_R14 || phys == X64_R15)
-        a_impl->used_cs_int_mask |= 1u << phys;
-    }
-  }
-  a = x64_asm_open(t->c);
-  x64_inline_bind(a, outs, no, oo, ins, ni, io, clobs, nc);
-  x64_asm_run_template(a, t->mc, tmpl);
-  x64_asm_close(a);
-}
-
-static void x_set_loc(CGTarget* t, SrcLoc l) {
-  ((XImpl*)t)->loc = l;
-  if (t->mc) t->mc->set_loc(t->mc, l);
-}
-
-static void x_finalize(CGTarget* t) { (void)t; }
-static void x_destroy(CGTarget* t) { (void)t; }
-
-static void cgt_cleanup(void* arg) { cgtarget_free((CGTarget*)arg); }
-
-CGTarget* x64_cgtarget_new(Compiler* c, ObjBuilder* o, MCEmitter* m) {
-  XImpl* x = arena_new(c->tu, XImpl);
-  memset(x, 0, sizeof *x);
-
-  CGTarget* t = &x->base;
-  t->c = c;
-  t->obj = o;
-  t->mc = m;
-
-  t->func_begin = x_func_begin;
-  t->func_begin_known_frame = x_func_begin_known_frame;
-  t->func_end = x_func_end;
-
-  t->frame_slot = x_frame_slot;
-  t->param = x_param;
-  t->resolve_reg_name = x_resolve_reg_name;
-  t->spill_reg = x_spill_reg;
-  t->reload_reg = x_reload_reg;
-
-  t->label_new = x_label_new;
-  t->label_place = x_label_place;
-  t->jump = x_jump;
-  t->cmp_branch = x_cmp_branch;
-  t->load_label_addr = x_load_label_addr;
-  t->indirect_branch = x_indirect_branch;
-
-  t->scope_begin = x_scope_begin;
-  t->scope_else = x_scope_else;
-  t->scope_end = x_scope_end;
-  t->break_to = x_break_to;
-  t->continue_to = x_continue_to;
-
-  t->load_imm = x_load_imm;
-  t->load_const = x_load_const;
-  t->copy = x_copy;
-  t->load = x_load;
-  t->store = x_store;
-  t->addr_of = x_addr_of;
-  t->tls_addr_of = x_tls_addr_of;
-  t->copy_bytes = x_copy_bytes;
-  t->set_bytes = x_set_bytes;
-  t->bitfield_load = x_bitfield_load;
-  t->bitfield_store = x_bitfield_store;
-
-  t->binop = x_binop;
-  t->unop = x_unop;
-  t->cmp = x_cmp;
-  t->convert = x_convert;
-
-  t->call = x_call;
-  t->load_call_arg = x_load_call_arg;
-  t->emit_call_plan = x_emit_call_plan;
-  t->store_call_arg = x_store_call_arg;
-  t->store_call_ret = x_store_call_ret;
-  t->call_stack_size = x_call_stack_size;
-  t->tail_call_unrealizable_reason = x_tail_call_unrealizable_reason;
-  t->ret = x_ret;
-
-  t->alloca_ = x_alloca_;
-  t->va_start_ = x_va_start_;
-  t->va_arg_ = x_va_arg_;
-  t->va_end_ = x_va_end_;
-  t->va_copy_ = x_va_copy_;
-
-  t->atomic_load = x_atomic_load;
-  t->atomic_store = x_atomic_store;
-  t->atomic_rmw = x_atomic_rmw;
-  t->atomic_cas = x_atomic_cas;
-  t->fence = x_fence;
-
-  t->intrinsic = x_intrinsic;
-  t->asm_block = x_asm_block;
-
-  t->set_loc = x_set_loc;
-  t->finalize = x_finalize;
-  t->destroy = x_destroy;
-
-#if CFREE_OPT_ENABLED
-  x_coord_vtable_init(t);
-#endif
-
-  compiler_defer(c, cgt_cleanup, t);
-  return t;
-}
diff --git a/src/arch/x64/opt_coord.c b/src/arch/x64/opt_coord.c
@@ -1,410 +0,0 @@
-/* x64/opt_coord.c — opt/backend register coordination hooks. */
-
-#include "arch/x64/internal.h"
-
-/* ============================================================
- * Register tables used by opt.
- *
- * Spill scratch registers are passed back to normal backend emit calls as
- * operands, so keep them out of the backend's internal scratch set (RAX/R11,
- * XMM15 in a few lowering paths) and out of opt's allocable pool. */
-
-static const Reg x_int_allocable[] = {X64_R13, X64_R14, X64_R15, X64_R10};
-static const Reg x_fp_allocable[] = {
-    X64_XMM6,      X64_XMM7,      X64_XMM8,      X64_XMM0 + 9,
-    X64_XMM0 + 10, X64_XMM0 + 11, X64_XMM0 + 12, X64_XMM0 + 13};
-
-static const Reg x_int_scratch[] = {X64_RBX, X64_R12};
-static const Reg x_fp_scratch[] = {X64_XMM0 + 14, X64_XMM15};
-
-static const CGPhysRegInfo x_int_phys[] = {
-    {X64_RDI, RC_INT, 0, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0,
-     0},
-    {X64_RSI, RC_INT, 1, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0,
-     0},
-    {X64_R8, RC_INT, 4, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0,
-     0},
-    {X64_R9, RC_INT, 5, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0,
-     0},
-    {X64_R10, RC_INT, 0xff,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_TEMP_PREFERRED, 0, 0},
-    {X64_R13, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {X64_R14, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-    {X64_R15, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
-};
-static const CGPhysRegInfo x_fp_phys[] = {
-    {X64_XMM0, RC_FP, 0,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG | CG_REG_RET, 0, 0},
-    {X64_XMM0 + 1, RC_FP, 1,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG | CG_REG_RET, 0, 0},
-    {X64_XMM0 + 2, RC_FP, 2,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {X64_XMM0 + 3, RC_FP, 3,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {X64_XMM0 + 4, RC_FP, 4,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {X64_XMM0 + 5, RC_FP, 5,
-     CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0, 0},
-    {X64_XMM6, RC_FP, 6, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0,
-     0},
-    {X64_XMM7, RC_FP, 7, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED | CG_REG_ARG, 0,
-     0},
-    {X64_XMM8, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
-    {X64_XMM0 + 9, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
-    {X64_XMM0 + 10, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
-    {X64_XMM0 + 11, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
-    {X64_XMM0 + 12, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
-    {X64_XMM0 + 13, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
-};
-
-/* ============================================================
- * Vtable methods */
-
-static void x_get_allocable_regs(CGTarget* t, RegClass cls, const Reg** out,
-                                 u32* nregs) {
-  (void)t;
-  switch (cls) {
-    case RC_INT:
-      *out = x_int_allocable;
-      *nregs = sizeof x_int_allocable / sizeof x_int_allocable[0];
-      break;
-    case RC_FP:
-      *out = x_fp_allocable;
-      *nregs = sizeof x_fp_allocable / sizeof x_fp_allocable[0];
-      break;
-    default:
-      *out = NULL;
-      *nregs = 0;
-      break;
-  }
-}
-
-static void x_get_scratch_regs(CGTarget* t, RegClass cls, const Reg** out,
-                               u32* nregs) {
-  (void)t;
-  switch (cls) {
-    case RC_INT:
-      *out = x_int_scratch;
-      *nregs = sizeof x_int_scratch / sizeof x_int_scratch[0];
-      break;
-    case RC_FP:
-      *out = x_fp_scratch;
-      *nregs = sizeof x_fp_scratch / sizeof x_fp_scratch[0];
-      break;
-    default:
-      *out = NULL;
-      *nregs = 0;
-      break;
-  }
-}
-
-static void x_get_phys_regs(CGTarget* t, RegClass cls,
-                            const CGPhysRegInfo** out, u32* nregs) {
-  (void)t;
-  switch (cls) {
-    case RC_INT:
-      *out = x_int_phys;
-      *nregs = sizeof x_int_phys / sizeof x_int_phys[0];
-      break;
-    case RC_FP:
-      *out = x_fp_phys;
-      *nregs = sizeof x_fp_phys / sizeof x_fp_phys[0];
-      break;
-    default:
-      *out = NULL;
-      *nregs = 0;
-      break;
-  }
-}
-
-static int x_is_caller_saved(CGTarget* t, RegClass cls, Reg reg) {
-  const X64ABIRegs* abi = x64_abi_for_os(t->c->target.os);
-  switch (cls) {
-    case RC_INT:
-      /* Everything that isn't callee-saved (and isn't RSP/RBP) is
-       * caller-saved. Inverting the ABI's cs_int_mask handles both
-       * SysV and Win64 in one line. */
-      if (reg == X64_RSP || reg == X64_RBP) return 0;
-      return (abi->cs_int_mask & (1ull << reg)) == 0;
-    case RC_FP:
-      /* SysV: all XMMs caller-saved. Win64: XMM0..XMM5 caller-saved,
-       * XMM6..XMM15 callee-saved. */
-      if (reg < X64_XMM0 || reg > X64_XMM0 + 15) return 0;
-      return (abi->cs_fp_mask & (1ull << reg)) == 0;
-    default:
-      return 0;
-  }
-}
-
-static u32 x_call_clobber_mask(CGTarget* t, const CGCallDesc* d, RegClass cls) {
-  (void)d;
-  const X64ABIRegs* abi = x64_abi_for_os(t->c->target.os);
-  switch (cls) {
-    case RC_INT: {
-      /* All GPRs except callee-saved (and RSP/RBP) are clobbered by a
-       * call. */
-      u32 mask = 0;
-      for (u32 r = 0; r < 16u; ++r) {
-        if (r == X64_RSP || r == X64_RBP) continue;
-        if ((abi->cs_int_mask & (1ull << r)) == 0) mask |= (1u << r);
-      }
-      return mask;
-    }
-    case RC_FP: {
-      /* All XMMs except callee-saved are clobbered by a call. */
-      u32 mask = 0;
-      for (u32 r = 0; r < 16u; ++r) {
-        if ((abi->cs_fp_mask & (1ull << r)) == 0) mask |= (1u << r);
-      }
-      return mask;
-    }
-    default:
-      return 0;
-  }
-}
-
-static u32 x_callee_save_mask(CGTarget* t, RegClass cls) {
-  const X64ABIRegs* abi = x64_abi_for_os(t->c->target.os);
-  if (cls == RC_INT) {
-    /* RBP is saved by the prologue head, not exposed for general
-     * callee-save spill bookkeeping. */
-    return (u32)(abi->cs_int_mask & ~(1ull << X64_RBP));
-  }
-  if (cls == RC_FP) return (u32)abi->cs_fp_mask;
-  return 0;
-}
-
-static u32 x_return_reg_mask(CGTarget* t, const ABIFuncInfo* abi,
-                             RegClass cls) {
-  (void)t;
-  if (!abi || abi->ret.kind == ABI_ARG_IGNORE ||
-      abi->ret.kind == ABI_ARG_INDIRECT)
-    return 0;
-  u32 mask = 0, ni = 0, nf = 0;
-  static const u32 iregs[2] = {X64_RAX, X64_RDX};
-  for (u16 i = 0; i < abi->ret.nparts; ++i) {
-    const ABIArgPart* p = &abi->ret.parts[i];
-    if (cls == RC_INT && p->cls == ABI_CLASS_INT && ni < 2)
-      mask |= 1u << iregs[ni++];
-    else if (cls == RC_FP && p->cls == ABI_CLASS_FP && nf < 2)
-      mask |= 1u << (X64_XMM0 + nf++);
-  }
-  return mask;
-}
-
-static int x_is_void_ret_storage(Operand op) {
-  return op.kind == OPK_IMM && op.type == CG_BUILTIN_ID(CFREE_CG_BUILTIN_VOID);
-}
-
-static void x_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
-  memset(out, 0, sizeof *out);
-  out->callee = d->callee;
-  out->flags = d->flags;
-  out->stack_arg_size = t->call_stack_size ? t->call_stack_size(t, d) : 0;
-  out->has_sret = d->abi && d->abi->has_sret;
-  out->is_variadic = d->abi && d->abi->variadic;
-  for (u32 c = 0; c < CG_CALL_PLAN_REG_CLASSES; ++c) {
-    out->clobber_mask[c] = x_call_clobber_mask(t, d, (RegClass)c);
-    out->return_mask[c] = x_return_reg_mask(t, d->abi, (RegClass)c);
-  }
-  u32 cap = d->nargs * 2u + 2u;
-  out->args = arena_zarray(t->c->tu, CGCallPlanMove, cap ? cap : 1u);
-  out->rets = arena_zarray(t->c->tu, CGCallPlanRet, 4);
-  const X64ABIRegs* abi = x64_abi_for_os(t->c->target.os);
-  u32 next_int = d->abi && d->abi->has_sret ? 1u : 0u, next_fp = 0;
-  /* Win64 reserves a 32 B shadow space above the return address that
-   * the caller owns; the first stack-passed arg lands above it. SysV
-   * starts at offset 0. */
-  u32 stack = abi->shadow_space;
-  /* Ordinary sret call: pass the destination address. A tail call
-   * instead forwards the function's own incoming sret pointer (handled in
-   * x_emit_call_plan), and ret.storage is the void sentinel, so skip it. */
-  if (d->abi && d->abi->has_sret && (d->flags & CG_CALL_TAIL) == 0) {
-    CGCallPlanMove* m = &out->args[out->nargs++];
-    m->src = d->ret.storage;
-    m->src_kind = CG_CALL_PLAN_SRC_ADDR;
-    m->dst_kind = CG_CALL_PLAN_REG;
-    m->cls = RC_INT;
-    m->dst_reg = abi->int_args[0];
-    m->mem.type = d->ret.type;
-    m->mem.size = 8;
-    m->mem.align = 8;
-  }
-  /* On Win64, advance the FP slot counter in lockstep with the int
-   * slot counter (shared slot). */
-  if (abi->slot_shared_int_fp) next_fp = next_int;
-  for (u32 a = 0; a < d->nargs; ++a) {
-    const CGABIValue* av = &d->args[a];
-    const ABIArgInfo* ai = av->abi;
-    ABIArgInfo vai;
-    ABIArgPart vap;
-    if (!ai) {
-      memset(&vai, 0, sizeof vai);
-      memset(&vap, 0, sizeof vap);
-      vap.cls = av->storage.cls == RC_FP ? ABI_CLASS_FP : ABI_CLASS_INT;
-      vap.size = type_byte_size(av->type);
-      vai.kind = ABI_ARG_DIRECT;
-      vai.nparts = 1;
-      vai.parts = &vap;
-      ai = &vai;
-    }
-    if (ai->kind == ABI_ARG_IGNORE) continue;
-    if (ai->kind == ABI_ARG_INDIRECT) {
-      CGCallPlanMove* m = &out->args[out->nargs++];
-      m->src = av->storage;
-      m->src_kind = CG_CALL_PLAN_SRC_ADDR;
-      m->cls = RC_INT;
-      if (next_int < abi->n_int_args) {
-        m->dst_kind = CG_CALL_PLAN_REG;
-        m->dst_reg = abi->int_args[next_int++];
-      } else {
-        m->dst_kind = CG_CALL_PLAN_STACK;
-        m->stack_offset = stack;
-        stack += 8;
-      }
-      if (abi->slot_shared_int_fp) next_fp = next_int;
-      m->mem.type = av->type;
-      m->mem.size = 8;
-      m->mem.align = 8;
-      continue;
-    }
-    if (ai->kind == ABI_ARG_DIRECT &&
-        x64_abi_direct_to_stack(ai, next_int, next_fp)) {
-      for (u16 i = 0; i < ai->nparts; ++i) {
-        const ABIArgPart* p = &ai->parts[i];
-        CGCallPlanMove* m = &out->args[out->nargs++];
-        m->src = av->nparts ? av->parts[i].op : av->storage;
-        m->src_offset = av->nparts ? av->parts[i].src_offset : p->src_offset;
-        m->dst_kind = CG_CALL_PLAN_STACK;
-        m->stack_offset = stack;
-        m->mem.type = av->type;
-        m->mem.size = p->size;
-        m->mem.align = p->align ? p->align : p->size;
-        if (p->cls == ABI_CLASS_FP)
-          m->cls = RC_FP;
-        else
-          m->cls = RC_INT;
-        stack += 8;
-      }
-      continue;
-    }
-    for (u16 i = 0; i < ai->nparts; ++i) {
-      const ABIArgPart* p = &ai->parts[i];
-      CGCallPlanMove* m = &out->args[out->nargs++];
-      m->src = av->nparts ? av->parts[i].op : av->storage;
-      m->src_offset = av->nparts ? av->parts[i].src_offset : p->src_offset;
-      m->mem.type = av->type;
-      m->mem.size = p->size;
-      m->mem.align = p->align ? p->align : p->size;
-      if (p->cls == ABI_CLASS_FP) {
-        m->cls = RC_FP;
-        if (next_fp < abi->n_fp_args) {
-          u32 dst_x = next_fp;
-          m->dst_kind = CG_CALL_PLAN_REG;
-          m->dst_reg = X64_XMM0 + next_fp++;
-          if (abi->vararg_fp_dup_to_gpr && av->abi == NULL &&
-              dst_x < abi->n_int_args) {
-            CGCallPlanMove* dup = &out->args[out->nargs++];
-            *dup = *m;
-            dup->cls = RC_INT;
-            dup->dst_reg = abi->int_args[dst_x];
-          }
-        } else {
-          m->dst_kind = CG_CALL_PLAN_STACK;
-          m->stack_offset = stack;
-          stack += 8;
-        }
-        if (abi->slot_shared_int_fp) next_int = next_fp;
-      } else {
-        m->cls = RC_INT;
-        if (next_int < abi->n_int_args) {
-          m->dst_kind = CG_CALL_PLAN_REG;
-          m->dst_reg = abi->int_args[next_int++];
-        } else {
-          m->dst_kind = CG_CALL_PLAN_STACK;
-          m->stack_offset = stack;
-          stack += 8;
-        }
-        if (abi->slot_shared_int_fp) next_fp = next_int;
-      }
-    }
-  }
-  out->variadic_fp_count = (u8)next_fp;
-  if ((out->flags & CG_CALL_TAIL) == 0 && d->abi &&
-      d->abi->ret.kind != ABI_ARG_IGNORE &&
-      d->abi->ret.kind != ABI_ARG_INDIRECT &&
-      !x_is_void_ret_storage(d->ret.storage)) {
-    u32 ni = 0, nf = 0;
-    static const u32 rregs[2] = {X64_RAX, X64_RDX};
-    for (u16 i = 0; i < d->abi->ret.nparts; ++i) {
-      const ABIArgPart* p = &d->abi->ret.parts[i];
-      CGCallPlanRet* r = &out->rets[out->nrets++];
-      r->dst = d->ret.storage;
-      r->dst_offset = p->src_offset;
-      r->mem.type = d->ret.type;
-      r->mem.size = p->size;
-      r->mem.align = p->align ? p->align : p->size;
-      if (p->cls == ABI_CLASS_FP) {
-        r->cls = RC_FP;
-        r->src_reg = X64_XMM0 + nf++;
-      } else {
-        r->cls = RC_INT;
-        r->src_reg = rregs[ni++];
-      }
-    }
-  }
-}
-
-static void x_reserve_hard_regs(CGTarget* t, RegClass cls, const Reg* regs,
-                                u32 n) {
-  XImpl* a = impl_of(t);
-  for (u32 i = 0; i < n; ++i) {
-    Reg r = regs[i];
-    switch (cls) {
-      case RC_INT:
-        if (!x_is_caller_saved(t, cls, r) && r < 32u)
-          a->used_cs_int_mask |= 1u << r;
-        break;
-      case RC_FP:
-        if (!x_is_caller_saved(t, cls, r) && r < 32u)
-          a->used_cs_fp_mask |= 1u << r;
-        break;
-      default:
-        break;
-    }
-  }
-}
-
-static void x_plan_hard_regs(CGTarget* t, RegClass cls, const Reg* regs,
-                             u32 n) {
-  XImpl* a = impl_of(t);
-  a->has_planned_regs = 1;
-  for (u32 i = 0; i < n; ++i) {
-    Reg r = regs[i];
-    switch (cls) {
-      case RC_INT:
-        if (!x_is_caller_saved(t, cls, r) && r < 32u)
-          a->planned_cs_int_mask |= 1u << r;
-        break;
-      case RC_FP:
-        if (!x_is_caller_saved(t, cls, r) && r < 32u)
-          a->planned_cs_fp_mask |= 1u << r;
-        break;
-      default:
-        break;
-    }
-  }
-}
-
-void x_coord_vtable_init(CGTarget* t) {
-  t->get_allocable_regs = x_get_allocable_regs;
-  t->get_phys_regs = x_get_phys_regs;
-  t->get_scratch_regs = x_get_scratch_regs;
-  t->is_caller_saved = x_is_caller_saved;
-  t->call_clobber_mask = x_call_clobber_mask;
-  t->return_reg_mask = x_return_reg_mask;
-  t->callee_save_mask = x_callee_save_mask;
-  t->plan_call = x_plan_call;
-  t->plan_hard_regs = x_plan_hard_regs;
-  t->reserve_hard_regs = x_reserve_hard_regs;
-}
diff --git a/src/arch/x64/x64.h b/src/arch/x64/x64.h
@@ -2,7 +2,11 @@
 #define CFREE_ARCH_X64_H
 
 #include "arch/mc.h"
+#include "arch/native_target.h"
 
-CGTarget* x64_cgtarget_new(Compiler*, ObjBuilder*, MCEmitter*);
+typedef struct NativeOps NativeOps;
+
+NativeTarget* x64_native_target_new(Compiler*, ObjBuilder*, MCEmitter*);
+const NativeOps* x64_native_direct_ops(void);
 
 #endif
diff --git a/test/parse/cases/asm_01_grammar.x64.skip b/test/parse/cases/asm_01_grammar.x64.skip
@@ -0,0 +1 @@
+asm_01_grammar template uses aa64-specific mnemonics; x64 inline-asm coverage lives in test/arch/x64_inline_test.c

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

A	doc/INTERFACES.md	\|	309	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	doc/NATIVE_PORT_X64.md	\|	4342	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	include/cfree/config.h	\|	2	+-
D	src/arch/x64/alloc.c	\|	598	-------------------------------------------------------------------------------
M	src/arch/x64/arch.c	\|	34	++++++++++++++++++++++++++++------
M	src/arch/x64/asm.c	\|	23	++++++++++++++++++-----
M	src/arch/x64/asm.h	\|	13	+++++++++++++
M	src/arch/x64/emit.c	\|	506	+------------------------------------------------------------------------------
A	src/arch/x64/emit.h	\|	147	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D	src/arch/x64/internal.h	\|	314	-------------------------------------------------------------------------------
A	src/arch/x64/native.c	\|	3751	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D	src/arch/x64/ops.c	\|	2939	-------------------------------------------------------------------------------
D	src/arch/x64/opt_coord.c	\|	410	-------------------------------------------------------------------------------
M	src/arch/x64/x64.h	\|	6	+++++-
A	test/parse/cases/asm_01_grammar.x64.skip	\|	1	+