Implement O1 call clobber register metadata - kit

commit 98d5b64ce59ab7cddb5ffc28aeece8ca689f4ca1
parent d69b41a89baaf03c6cb9a781705b401cfcb41c8e
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 15 May 2026 17:35:37 -0700

Implement O1 call clobber register metadata

Diffstat:
M doc/OPT1.md  | 65 +++++++++++++++++++++++++++++++++++++++++++++--------------------
M doc/OPT_REGS_CALL_PLAN.md  | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------
M src/arch/aa64/opt_coord.c  | 188 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M src/arch/arch.h  | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M src/arch/rv64/opt_coord.c  | 178 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M src/arch/x64/opt_coord.c  | 182 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M src/opt/ir.h  | 9 +++++++++
M src/opt/opt.c  | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
M src/opt/pass_lower.c  | 122 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
M test/opt/opt_test.c  | 137 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

10 files changed, 1055 insertions(+), 53 deletions(-)
diff --git a/doc/OPT1.md b/doc/OPT1.md
@@ -104,9 +104,13 @@ Each value is assigned either:
 - a target-provided hard register in the value's register class; or
 - a compatible `FS_SPILL` frame slot.
 
-Non-tied values known to be live across calls avoid caller-saved registers.
-Tied hard-register values are checked for availability, clobber constraints,
-and live-range conflicts.
+Non-tied values known to be live across calls strongly prefer callee-saved
+registers, but may use caller-saved registers when no better non-conflicting
+location is available. Correctness does not depend on avoiding caller-saved
+registers globally: rewrite preserves hard-assigned values only at calls whose
+target-provided clobber mask actually destroys that register. Tied
+hard-register values are checked for availability, clobber constraints, and
+live-range conflicts.
 
 Occupancy is stored as sorted per-location interval sets:
 
@@ -129,8 +133,9 @@ liveness:
 - hard-assigned pseudos become physical registers;
 - spilled uses receive reloads through target scratch registers;
 - spilled defs receive stores;
-- call save/restore insertion checks only hard-assigned caller-saved values
-  known to be live across some call;
+- call save/restore insertion checks hard-assigned values known to be live
+  across some call, then intersects each value's assigned register with that
+  call's target-provided clobber mask;
 - inline-asm register constraints are applied only for functions containing
   `IR_ASM_BLOCK`.
 
@@ -140,9 +145,17 @@ Rewrite does not materialize per-instruction full live-after sets.
 
 O1 relies on each target backend to provide:
 
-- allocable hard-register pools per register class;
+- physical-register metadata per register class, including allocable,
+  caller-saved, callee-saved, argument, return, reserved, and preference/cost
+  flags;
+- legacy allocable hard-register pools per register class for direct-CG
+  compatibility and as a fallback while older backends migrate;
 - scratch-register pools disjoint from allocable pools;
-- caller-saved register classification;
+- caller-saved register classification and per-call clobber masks;
+- return-register masks and callee-save masks;
+- call-plan construction that describes ABI register destinations, outgoing
+  stack arguments, return registers, call clobbers, return masks, outgoing
+  stack size, and variadic metadata before liveness and allocation;
 - parameter storage binding through `CGTarget.param`, which may return either a
   frame slot or a register-backed local storage for simple direct ABI params;
 - optional hard-register reservation before backend `func_end`;
@@ -157,6 +170,9 @@ The current target pools are:
 Backends still own final prologue/epilogue emission and callee-saved register
 preservation. O1 calls `reserve_hard_regs` with the hard registers still visible
 in replay after cleanup so backend save/restore decisions match the emitted IR.
+Direct CG still uses the legacy `call` hook. O1 records call plans for liveness,
+allocation, and preservation, but backend call emission still performs the final
+argument and return moves through the existing sequential call emitters.
 
 Targets may also provide a known-frame entry path for O1. When
 `func_begin_known_frame` and `call_stack_size` are both available, O1 computes
@@ -264,6 +280,13 @@ b   ...
   local combine pass also retargets safe single-use arithmetic producers to a
   following physical-copy destination, including commutative operand swaps for
   x64-style two-operand overlap cases.
+- Targets now expose descriptive physical-register metadata and per-call
+  clobber/return/callee-save masks to O1. `machinize` attaches a target call
+  plan to each `IR_CALL` before liveness and allocation, and rewrite plus
+  hard-register liveness use that call-specific clobber mask for preservation.
+  This closes the register-preservation correctness issue: hard-assigned values
+  live across calls are saved/restored only when the planned call actually
+  clobbers their assigned physical register.
 - Simple scalar parameters no longer need late mem2reg promotion. `CGTarget.param`
   owns ABI entry binding and O1 can keep non-memory-required direct params as
   virtual-register-backed locals, replaying the final hard register or spill slot
@@ -273,10 +296,13 @@ b   ...
 
 Remaining O1 shape issues visible in the current dumps:
 
-- O1 still saves/restores more callee-saved registers than ideal in some small
-  functions under register pressure or values live across calls. The old
-  unconditional scratch-register saves have been removed, but wider
-  caller-saved allocation needs separate call-argument safety work.
+- The register-preservation correctness issue is closed, but O1 still
+  saves/restores more callee-saved registers than ideal in some small functions
+  under register pressure or values live across calls. The old unconditional
+  scratch-register saves have been removed, and preservation is now
+  call-specific. Remaining excess traffic is code-shape work: conservative
+  register exposure, missing call-argument parallel copies in emitted code, and
+  limited coalescing/argument-copy cleanup.
 - Direct-call tiny functions are still heavy at O1. The x64 `callee(x) + 2`
   probe emitted 167 bytes and 47 instructions across two small functions,
   mostly frame setup, callee-save traffic, copies, and branch-to-epilogue
@@ -288,12 +314,11 @@ Remaining O1 shape issues visible in the current dumps:
 MIR's O1 path suggests these high-value local cleanups that still fit cfree's
 fast tier:
 
-1. Continue reducing callee-save traffic.
-   O1 now reserves/preserves only replay-visible hard registers after final
-   cleanup. Remaining work is mostly coalescing/argument-copy quality,
-   pressure-sensitive choices, and safely broadening caller-saved allocation for
-   values that are not live across calls. The larger structural plan is in
-   `doc/OPT_REGS_CALL_PLAN.md`: targets expose richer physical-register
-   metadata, calls become opt-visible fixed-register/clobber constraints, and
-   call argument setup uses parallel copies so all three backends can safely
-   expose nearly all allocatable registers.
+1. Continue reducing callee-save traffic as a code-shape issue.
+   The preservation correctness piece is done: O1 reserves/preserves only
+   replay-visible hard registers after final cleanup and uses per-call clobber
+   masks for live-across-call saves/restores. Remaining work is mostly
+   coalescing/argument-copy quality, pressure-sensitive choices, switching O1
+   call emission to planned/parallel argument setup, and then safely broadening
+   ABI argument/return registers in the allocable pools. The remaining
+   structural work is tracked in `doc/OPT_REGS_CALL_PLAN.md`.
diff --git a/doc/OPT_REGS_CALL_PLAN.md b/doc/OPT_REGS_CALL_PLAN.md
@@ -12,6 +12,44 @@ The goal is not to move to a full target machine IR immediately. The goal is to
 make the current O1 path honest about target constraints while keeping replay and
 backend emission intact enough to migrate one architecture at a time.
 
+## Current Status
+
+The correctness foundation for register preservation is implemented. Targets now
+expose descriptive physical-register metadata, per-call clobber masks,
+return-register masks, callee-save masks, and call plans. O1 records each call
+plan during `machinize`, builds its current hard-register tables from
+`CGPhysRegInfo`, uses target save/use costs in allocation scoring, and preserves
+hard-assigned live-across-call values by intersecting the assigned register with
+the planned call's clobber mask. Post-RA hard-register liveness uses the same
+call-specific clobber mask.
+
+What this closes:
+
+- the register-preservation correctness issue for values live across calls;
+- target-provided physical-register metadata as the source for O1 register
+  tables;
+- call-plan construction for scalar integer/FP/direct/indirect/byval/sret-shaped
+  calls in the current descriptor model;
+- conservative allocation scoring that can choose caller-saved registers when
+  rewrite can preserve them, while still preferring callee-saved registers for
+  call-crossing values.
+
+What remains open:
+
+- calls are not yet lowered into explicit opt-visible setup/call/return-copy IR;
+- call argument and return moves are not yet resolved by an opt parallel-copy
+  resolver;
+- backend emission still uses the legacy sequential `call` hook rather than an
+  `emit_call_plan` path with pre-materialized arguments and returns;
+- target `get_phys_regs` tables still expose mostly the old conservative pools,
+  so ABI argument/return registers are not generally allocable yet;
+- direct CG still uses legacy allocation/call hooks;
+- broader call-plan layout tests, parallel-copy hazard tests, and code-shape
+  probes remain to be added.
+
+In phase terms: Phase 1 is done, Phase 2 is mostly done, Phase 3 is partially
+done for clobber/preservation visibility, and Phases 4-6 remain open.
+
 ## Current Problem
 
 The current `CGTarget` contract exposes:
@@ -286,6 +324,10 @@ thin wrapper builds and emits a call plan internally.
 
 ### Phase 1 - Register Description Without Behavior Change
 
+Status: done. `CGPhysRegInfo` and `get_phys_regs` exist, x64/AArch64/RV64
+provide current-pool metadata, and `opt_machinize` consumes it with legacy
+fallbacks. Focused opt tests cover metadata consumption.
+
 - Add `CGPhysRegInfo` and `get_phys_regs`.
 - Implement it for x64, AArch64, and RV64 using the current exposed pools first.
 - Build opt's current hard-reg tables from the richer description.
@@ -297,6 +339,11 @@ Expected result: no codegen behavior change.
 
 ### Phase 2 - Call Plan Construction
 
+Status: mostly done. `CGCallPlan`, `plan_call`, call clobber masks, return
+masks, and callee-save masks exist for the three native backends. O1 attaches
+plans during `machinize`. Remaining work is fuller layout/dump coverage for
+mixed, variadic, stack-arg, and aggregate cases.
+
 - Add `CGCallPlan` and `plan_call`.
 - Implement call planning for simple direct scalar integer and FP args/returns on
   all three architectures.
@@ -309,6 +356,11 @@ yet.
 
 ### Phase 3 - Opt IR Call Constraints
 
+Status: partial. Calls carry plan aux data before liveness/allocation, and
+rewrite plus hard-register liveness use the planned clobber masks. Calls are not
+yet lowered into explicit setup/call/return-copy IR, and implicit arg/return
+register uses/defs are not yet modeled as first-class constrained operations.
+
 - Lower `IR_CALL` into opt-visible call setup, constrained call, and return-copy
   representation during `machinize`.
 - Teach liveness/range building that call ops have implicit register uses,
@@ -321,6 +373,9 @@ allocator starts using those registers widely.
 
 ### Phase 4 - Parallel Copy Resolver
 
+Status: open. Argument/return setup still goes through backend sequential call
+emitters, so ABI argument and return registers cannot be broadly exposed yet.
+
 - Implement local parallel move resolution for call setup and return extraction.
 - Support register-register cycles, register-stack moves, local/indirect loads,
   immediates, and stack arguments.
@@ -336,6 +391,11 @@ Expected result: ABI arg and return registers can be made allocable safely.
 
 ### Phase 5 - Broaden Register Exposure
 
+Status: open except for allocator scoring. O1 now has target-informed scoring
+and per-call preservation, but target phys-reg tables still mostly expose the old
+conservative pools. Broadening ABI arg/return and additional caller-saved regs
+depends on planned/parallel call emission.
+
 - Expand target `get_phys_regs` tables to include nearly all allocable physical
   registers.
 - Update opt scoring to prefer caller-saved regs for non-call-crossing values and
@@ -350,6 +410,9 @@ call correctness.
 
 ### Phase 6 - Remove Legacy Pool Semantics
 
+Status: open. Legacy `get_allocable_regs`, `get_scratch_regs`,
+`is_caller_saved`, and `call` remain active for direct CG and fallback replay.
+
 - Convert direct CG to either use `CGPhysRegInfo` or build call plans internally.
 - Remove allocation-policy dependence on `get_allocable_regs`.
 - Restrict `get_scratch_regs` to legacy direct-CG fallback, then remove it once
@@ -364,12 +427,15 @@ O2 allocation.
 
 Focused unit tests:
 
-- target register metadata per architecture;
-- call-plan layout for scalar, FP, mixed, sret, variadic, and stack-arg calls;
-- call clobber masks;
-- parallel-copy cycles and memory routing;
-- caller-saved live-across-call preservation using per-call masks;
-- callee-save reservation after broadened allocation.
+- done: opt-side target register metadata consumption;
+- done: caller-saved live-across-call preservation using per-call masks;
+- still needed: target register metadata tests per real architecture;
+- still needed: call-plan layout for scalar, FP, mixed, sret, variadic, and
+  stack-arg calls;
+- still needed: direct call-clobber mask tests per real architecture;
+- still needed: parallel-copy cycles and memory routing;
+- still needed: callee-save reservation/code-shape tests after broadened
+  allocation.
 
 Code-shape probes:
 
@@ -410,15 +476,27 @@ make test-smoke-rv64
 
 ## Recommended First Patch Stack
 
+Completed:
+
 1. Add `CGPhysRegInfo` plus current-pool metadata for all three targets.
-2. Teach `opt_machinize` to consume the new metadata while preserving identical
-   allocation order.
-3. Add `CGCallPlan` and plan simple scalar calls without using it for emission.
-4. Add call-plan dump/unit tests.
-5. Lower one simple call shape through opt-visible setup/call/return constraints
-   behind a feature flag or narrow capability check.
-6. Implement parallel copies for that shape.
-7. Broaden AArch64 and RV64 caller-saved temp exposure first, then x64.
+2. Teach `opt_machinize` to consume the new metadata.
+3. Add `CGCallPlan` and plan calls without using it for emission.
+4. Use call-plan clobber masks for rewrite and post-RA hard-register liveness.
+
+Next patch stack:
+
+1. Add call-plan layout/dump tests for real x64/AArch64/RV64 scalar, FP, mixed,
+   sret, variadic, and stack-arg cases.
+2. Lower one simple call shape through explicit opt-visible setup/call/return
+   constraints behind a narrow capability check.
+3. Implement local parallel copies for that shape.
+4. Add red-green hazard tests for argument permutations, indirect callees in
+   argument registers, return-register collisions, and stack-argument sources.
+5. Add `emit_call_plan` for one backend and switch O1 replay for supported plans.
+6. Broaden register exposure incrementally, keeping helper scratch registers
+   reserved until their clobbers are explicit.
+7. Migrate direct CG or wrap it with internal call planning, then remove legacy
+   pool semantics.
 
 This order keeps each step testable and avoids mixing API migration, allocation
 policy, and call move correctness in one change.
diff --git a/src/arch/aa64/opt_coord.c b/src/arch/aa64/opt_coord.c
@@ -15,6 +15,37 @@ static const Reg aa_fp_allocable[]  = {8,  9,  10, 11, 12, 13, 14, 15,
 static const Reg aa_int_scratch[] = {16, 17};
 static const Reg aa_fp_scratch[]  = {24, 25};
 
+static const CGPhysRegInfo aa_int_phys[] = {
+    {19, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {20, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {21, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {22, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {23, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {24, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {25, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {26, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {27, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {28, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+};
+static const CGPhysRegInfo aa_fp_phys[] = {
+    {8, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {9, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {10, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {11, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {12, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {13, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {14, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {15, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {16, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {17, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {18, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {19, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {20, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {21, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {22, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {23, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+};
+
 /* ============================================================
  * Vtable methods */
 
@@ -56,6 +87,25 @@ static void aa_get_scratch_regs(CGTarget* t, RegClass cls,
   }
 }
 
+static void aa_get_phys_regs(CGTarget* t, RegClass cls,
+                             const CGPhysRegInfo** out, u32* nregs) {
+  (void)t;
+  switch (cls) {
+    case RC_INT:
+      *out = aa_int_phys;
+      *nregs = sizeof aa_int_phys / sizeof aa_int_phys[0];
+      break;
+    case RC_FP:
+      *out = aa_fp_phys;
+      *nregs = sizeof aa_fp_phys / sizeof aa_fp_phys[0];
+      break;
+    default:
+      *out = NULL;
+      *nregs = 0;
+      break;
+  }
+}
+
 static int aa_is_caller_saved(CGTarget* t, RegClass cls, Reg reg) {
   (void)t;
   switch (cls) {
@@ -70,6 +120,139 @@ static int aa_is_caller_saved(CGTarget* t, RegClass cls, Reg reg) {
   }
 }
 
+static u32 aa_call_clobber_mask(CGTarget* t, const CGCallDesc* d,
+                                RegClass cls) {
+  (void)t;
+  (void)d;
+  if (cls == RC_INT) return ((1u << 19) - 1u) | (1u << 30);
+  if (cls == RC_FP) return 0xFFFF00FFu;
+  return 0;
+}
+
+static u32 aa_callee_save_mask(CGTarget* t, RegClass cls) {
+  (void)t;
+  if (cls == RC_INT) {
+    u32 mask = 0;
+    for (u32 r = 19; r <= 28; ++r) mask |= 1u << r;
+    return mask;
+  }
+  if (cls == RC_FP) return 0x0000FF00u;
+  return 0;
+}
+
+static u32 aa_return_reg_mask(CGTarget* t, const ABIFuncInfo* abi,
+                              RegClass cls) {
+  (void)t;
+  if (!abi || abi->ret.kind == ABI_ARG_IGNORE ||
+      abi->ret.kind == ABI_ARG_INDIRECT)
+    return 0;
+  u32 mask = 0, ni = 0, nf = 0;
+  for (u16 i = 0; i < abi->ret.nparts; ++i) {
+    const ABIArgPart* p = &abi->ret.parts[i];
+    if (cls == RC_INT && p->cls == ABI_CLASS_INT) mask |= 1u << ni++;
+    else if (cls == RC_FP && p->cls == ABI_CLASS_FP) mask |= 1u << nf++;
+  }
+  return mask;
+}
+
+static void aa_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
+  memset(out, 0, sizeof *out);
+  out->callee = d->callee;
+  out->stack_arg_size = t->call_stack_size ? t->call_stack_size(t, d) : 0;
+  out->has_sret = d->abi && d->abi->has_sret;
+  for (u32 c = 0; c < CG_CALL_PLAN_REG_CLASSES; ++c) {
+    out->clobber_mask[c] = aa_call_clobber_mask(t, d, (RegClass)c);
+    out->return_mask[c] = aa_return_reg_mask(t, d->abi, (RegClass)c);
+  }
+  u32 cap = d->nargs * 2u + 2u;
+  out->args = arena_zarray(t->c->tu, CGCallPlanMove, cap ? cap : 1u);
+  out->rets = arena_zarray(t->c->tu, CGCallPlanRet, 4);
+  u32 next_int = 0, next_fp = 0, stack = 0;
+  for (u32 a = 0; a < d->nargs; ++a) {
+    const CGABIValue* av = &d->args[a];
+    const ABIArgInfo* ai = av->abi;
+    ABIArgInfo vai;
+    ABIArgPart vap;
+    if (!ai) {
+      memset(&vai, 0, sizeof vai);
+      memset(&vap, 0, sizeof vap);
+      vap.cls = av->storage.cls == RC_FP ? ABI_CLASS_FP : ABI_CLASS_INT;
+      vap.size = type_byte_size(av->type);
+      vai.kind = ABI_ARG_DIRECT;
+      vai.nparts = 1;
+      vai.parts = &vap;
+      ai = &vai;
+      if (d->abi && d->abi->vararg_on_stack) next_int = next_fp = 8;
+    }
+    if (ai->kind == ABI_ARG_IGNORE) continue;
+    if (ai->kind == ABI_ARG_INDIRECT) {
+      CGCallPlanMove* m = &out->args[out->nargs++];
+      m->src = av->storage;
+      m->cls = RC_INT;
+      if (next_int < 8) {
+        m->dst_kind = CG_CALL_PLAN_REG;
+        m->dst_reg = next_int++;
+      } else {
+        m->dst_kind = CG_CALL_PLAN_STACK;
+        m->stack_offset = stack;
+        stack += 8;
+      }
+      m->mem.type = av->type;
+      m->mem.size = 8;
+      m->mem.align = 8;
+      continue;
+    }
+    for (u16 i = 0; i < ai->nparts; ++i) {
+      const ABIArgPart* p = &ai->parts[i];
+      CGCallPlanMove* m = &out->args[out->nargs++];
+      m->src = av->nparts ? av->parts[i].op : av->storage;
+      m->mem.type = av->type;
+      m->mem.size = p->size;
+      m->mem.align = p->align ? p->align : p->size;
+      if (p->cls == ABI_CLASS_FP) {
+        m->cls = RC_FP;
+        if (next_fp < 8) {
+          m->dst_kind = CG_CALL_PLAN_REG;
+          m->dst_reg = next_fp++;
+        } else {
+          m->dst_kind = CG_CALL_PLAN_STACK;
+          m->stack_offset = stack;
+          stack += 8;
+        }
+      } else {
+        m->cls = RC_INT;
+        if (next_int < 8) {
+          m->dst_kind = CG_CALL_PLAN_REG;
+          m->dst_reg = next_int++;
+        } else {
+          m->dst_kind = CG_CALL_PLAN_STACK;
+          m->stack_offset = stack;
+          stack += 8;
+        }
+      }
+    }
+  }
+  if (d->abi && d->abi->ret.kind != ABI_ARG_IGNORE &&
+      d->abi->ret.kind != ABI_ARG_INDIRECT) {
+    u32 ni = 0, nf = 0;
+    for (u16 i = 0; i < d->abi->ret.nparts; ++i) {
+      const ABIArgPart* p = &d->abi->ret.parts[i];
+      CGCallPlanRet* r = &out->rets[out->nrets++];
+      r->dst = d->ret.storage;
+      r->mem.type = d->ret.type;
+      r->mem.size = p->size;
+      r->mem.align = p->align ? p->align : p->size;
+      if (p->cls == ABI_CLASS_FP) {
+        r->cls = RC_FP;
+        r->src_reg = nf++;
+      } else {
+        r->cls = RC_INT;
+        r->src_reg = ni++;
+      }
+    }
+  }
+}
+
 static void aa_reserve_hard_regs(CGTarget* t, RegClass cls,
                                  const Reg* regs, u32 n) {
   AAImpl* a = impl_of(t);
@@ -109,8 +292,13 @@ static void aa_plan_hard_regs(CGTarget* t, RegClass cls, const Reg* regs,
 
 void aa_coord_vtable_init(CGTarget* t) {
   t->get_allocable_regs = aa_get_allocable_regs;
+  t->get_phys_regs      = aa_get_phys_regs;
   t->get_scratch_regs   = aa_get_scratch_regs;
   t->is_caller_saved    = aa_is_caller_saved;
+  t->call_clobber_mask  = aa_call_clobber_mask;
+  t->return_reg_mask    = aa_return_reg_mask;
+  t->callee_save_mask   = aa_callee_save_mask;
+  t->plan_call          = aa_plan_call;
   t->plan_hard_regs     = aa_plan_hard_regs;
   t->reserve_hard_regs  = aa_reserve_hard_regs;
 }
diff --git a/src/arch/arch.h b/src/arch/arch.h
@@ -397,6 +397,64 @@ typedef struct CGCallDesc {
   CGABIValue ret;
 } CGCallDesc;
 
+typedef enum CGPhysRegFlag {
+  CG_REG_ALLOCABLE = 1u << 0,
+  CG_REG_CALLER_SAVED = 1u << 1,
+  CG_REG_CALLEE_SAVED = 1u << 2,
+  CG_REG_ARG = 1u << 3,
+  CG_REG_RET = 1u << 4,
+  CG_REG_TEMP_PREFERRED = 1u << 5,
+  CG_REG_PLATFORM = 1u << 6,
+  CG_REG_RESERVED = 1u << 7,
+} CGPhysRegFlag;
+
+typedef struct CGPhysRegInfo {
+  Reg reg;
+  u8 cls;       /* RegClass */
+  u8 abi_index; /* arg/ret order when applicable, otherwise 0xff */
+  u16 flags;    /* CGPhysRegFlag */
+  u16 save_cost;
+  u16 use_cost;
+} CGPhysRegInfo;
+
+typedef enum CGCallPlanLocKind {
+  CG_CALL_PLAN_REG,
+  CG_CALL_PLAN_STACK,
+  CG_CALL_PLAN_IGNORE,
+} CGCallPlanLocKind;
+
+typedef struct CGCallPlanMove {
+  Operand src;
+  u8 dst_kind; /* CGCallPlanLocKind */
+  u8 cls;      /* RegClass for register destinations */
+  Reg dst_reg;
+  u32 stack_offset;
+  MemAccess mem;
+} CGCallPlanMove;
+
+typedef struct CGCallPlanRet {
+  Operand dst;
+  u8 cls;
+  Reg src_reg;
+  MemAccess mem;
+} CGCallPlanRet;
+
+#define CG_CALL_PLAN_REG_CLASSES 3u
+
+typedef struct CGCallPlan {
+  CGCallPlanMove* args;
+  u32 nargs;
+  CGCallPlanRet* rets;
+  u32 nrets;
+  Operand callee;
+  u32 clobber_mask[CG_CALL_PLAN_REG_CLASSES];
+  u32 return_mask[CG_CALL_PLAN_REG_CLASSES];
+  u32 stack_arg_size;
+  u8 variadic_fp_count;
+  u8 has_sret;
+  u8 pad[2];
+} CGCallPlan;
+
 typedef u32 Label;
 #define LABEL_NONE 0
 
@@ -556,6 +614,13 @@ struct CGTarget {
    * is backend-internal storage that outlives the current function. */
   void (*get_allocable_regs)(CGTarget*, RegClass, const Reg** out, u32* nregs);
 
+  /* Return the target's physical register file for `cls`. Unlike
+   * get_allocable_regs, this is descriptive metadata: callers filter by
+   * CGPhysRegFlag instead of assuming the list is already an allocation
+   * policy. */
+  void (*get_phys_regs)(CGTarget*, RegClass, const CGPhysRegInfo** out,
+                        u32* nregs);
+
   /* Return the target's scratch registers for `cls`.
    * Scratch registers are used internally by the backend (e.g. large
    * immediate materialization) and must not appear in the allocable pool.
@@ -565,6 +630,10 @@ struct CGTarget {
   /* Return non-zero if `reg` in `cls` is caller-saved on this target. */
   int (*is_caller_saved)(CGTarget*, RegClass, Reg);
 
+  u32 (*call_clobber_mask)(CGTarget*, const CGCallDesc*, RegClass);
+  u32 (*return_reg_mask)(CGTarget*, const ABIFuncInfo*, RegClass);
+  u32 (*callee_save_mask)(CGTarget*, RegClass);
+
   /* Tell the backend which hard registers opt is going to assign in the next
    * function before func_begin reserves its prologue placeholder. Backends use
    * this only as a sizing hint; reserve_hard_regs remains the authoritative
@@ -656,6 +725,8 @@ struct CGTarget {
    * for direct, indirect/byval, sret, split, and multi-register values.
    * `callee.kind == OPK_GLOBAL` is direct; any other kind is indirect. */
   void (*call)(CGTarget*, const CGCallDesc*);
+  void (*plan_call)(CGTarget*, const CGCallDesc*, CGCallPlan* out);
+  void (*emit_call_plan)(CGTarget*, const CGCallPlan*);
   void (*ret)(CGTarget*, const CGABIValue* val_or_null);
 
   /* ---- alloca ----
diff --git a/src/arch/rv64/opt_coord.c b/src/arch/rv64/opt_coord.c
@@ -11,6 +11,27 @@ static const Reg rv_fp_allocable[]  = {20, 21, 22, 23, 24, 25, 26, 27};
 static const Reg rv_int_scratch[] = {18, 19}; /* s2, s3; reserved by opt_emit */
 static const Reg rv_fp_scratch[]  = {18, 19}; /* fs2, fs3; reserved by opt_emit */
 
+static const CGPhysRegInfo rv_int_phys[] = {
+    {20, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {21, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {22, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {23, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {24, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {25, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {26, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {27, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+};
+static const CGPhysRegInfo rv_fp_phys[] = {
+    {20, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {21, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {22, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {23, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {24, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {25, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {26, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {27, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+};
+
 /* ============================================================
  * Vtable methods */
 
@@ -52,6 +73,25 @@ static void rv_get_scratch_regs(CGTarget* t, RegClass cls,
   }
 }
 
+static void rv_get_phys_regs(CGTarget* t, RegClass cls,
+                             const CGPhysRegInfo** out, u32* nregs) {
+  (void)t;
+  switch (cls) {
+    case RC_INT:
+      *out = rv_int_phys;
+      *nregs = sizeof rv_int_phys / sizeof rv_int_phys[0];
+      break;
+    case RC_FP:
+      *out = rv_fp_phys;
+      *nregs = sizeof rv_fp_phys / sizeof rv_fp_phys[0];
+      break;
+    default:
+      *out = NULL;
+      *nregs = 0;
+      break;
+  }
+}
+
 static int rv_is_caller_saved(CGTarget* t, RegClass cls, Reg reg) {
   (void)t;
   switch (cls) {
@@ -68,6 +108,139 @@ static int rv_is_caller_saved(CGTarget* t, RegClass cls, Reg reg) {
   }
 }
 
+static u32 rv_call_clobber_mask(CGTarget* t, const CGCallDesc* d,
+                                RegClass cls) {
+  (void)t;
+  (void)d;
+  u32 mask = 0;
+  if (cls == RC_INT || cls == RC_FP) {
+    for (u32 r = 5; r <= 7; ++r) mask |= 1u << r;
+    for (u32 r = 10; r <= 17; ++r) mask |= 1u << r;
+    for (u32 r = 28; r <= 31; ++r) mask |= 1u << r;
+  }
+  return mask;
+}
+
+static u32 rv_callee_save_mask(CGTarget* t, RegClass cls) {
+  (void)t;
+  u32 mask = 0;
+  if (cls == RC_INT || cls == RC_FP)
+    for (u32 r = 18; r <= 27; ++r) mask |= 1u << r;
+  return mask;
+}
+
+static u32 rv_return_reg_mask(CGTarget* t, const ABIFuncInfo* abi,
+                              RegClass cls) {
+  (void)t;
+  if (!abi || abi->ret.kind == ABI_ARG_IGNORE ||
+      abi->ret.kind == ABI_ARG_INDIRECT)
+    return 0;
+  u32 mask = 0, ni = 0, nf = 0;
+  for (u16 i = 0; i < abi->ret.nparts; ++i) {
+    const ABIArgPart* p = &abi->ret.parts[i];
+    if (cls == RC_INT && p->cls == ABI_CLASS_INT) mask |= 1u << (RV_A0 + ni++);
+    else if (cls == RC_FP && p->cls == ABI_CLASS_FP) mask |= 1u << (10u + nf++);
+  }
+  return mask;
+}
+
+static void rv_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
+  memset(out, 0, sizeof *out);
+  out->callee = d->callee;
+  out->stack_arg_size = t->call_stack_size ? t->call_stack_size(t, d) : 0;
+  out->has_sret = d->abi && d->abi->has_sret;
+  for (u32 c = 0; c < CG_CALL_PLAN_REG_CLASSES; ++c) {
+    out->clobber_mask[c] = rv_call_clobber_mask(t, d, (RegClass)c);
+    out->return_mask[c] = rv_return_reg_mask(t, d->abi, (RegClass)c);
+  }
+  u32 cap = d->nargs * 2u + 2u;
+  out->args = arena_zarray(t->c->tu, CGCallPlanMove, cap ? cap : 1u);
+  out->rets = arena_zarray(t->c->tu, CGCallPlanRet, 4);
+  u32 next_int = d->abi && d->abi->has_sret ? 1u : 0u, next_fp = 0, stack = 0;
+  for (u32 a = 0; a < d->nargs; ++a) {
+    const CGABIValue* av = &d->args[a];
+    const ABIArgInfo* ai = av->abi;
+    ABIArgInfo vai;
+    ABIArgPart vap;
+    if (!ai) {
+      memset(&vai, 0, sizeof vai);
+      memset(&vap, 0, sizeof vap);
+      vap.cls = ABI_CLASS_INT;
+      vap.size = type_byte_size(av->type);
+      vai.kind = ABI_ARG_DIRECT;
+      vai.nparts = 1;
+      vai.parts = &vap;
+      ai = &vai;
+    }
+    if (ai->kind == ABI_ARG_IGNORE) continue;
+    if (ai->kind == ABI_ARG_INDIRECT) {
+      CGCallPlanMove* m = &out->args[out->nargs++];
+      m->src = av->storage;
+      m->cls = RC_INT;
+      if (next_int < 8) {
+        m->dst_kind = CG_CALL_PLAN_REG;
+        m->dst_reg = RV_A0 + next_int++;
+      } else {
+        m->dst_kind = CG_CALL_PLAN_STACK;
+        m->stack_offset = stack;
+        stack += 8;
+      }
+      m->mem.type = av->type;
+      m->mem.size = 8;
+      m->mem.align = 8;
+      continue;
+    }
+    for (u16 i = 0; i < ai->nparts; ++i) {
+      const ABIArgPart* p = &ai->parts[i];
+      CGCallPlanMove* m = &out->args[out->nargs++];
+      m->src = av->nparts ? av->parts[i].op : av->storage;
+      m->mem.type = av->type;
+      m->mem.size = p->size;
+      m->mem.align = p->align ? p->align : p->size;
+      if (p->cls == ABI_CLASS_FP) {
+        m->cls = RC_FP;
+        if (next_fp < 8) {
+          m->dst_kind = CG_CALL_PLAN_REG;
+          m->dst_reg = 10u + next_fp++;
+        } else {
+          m->dst_kind = CG_CALL_PLAN_STACK;
+          m->stack_offset = stack;
+          stack += 8;
+        }
+      } else {
+        m->cls = RC_INT;
+        if (next_int < 8) {
+          m->dst_kind = CG_CALL_PLAN_REG;
+          m->dst_reg = RV_A0 + next_int++;
+        } else {
+          m->dst_kind = CG_CALL_PLAN_STACK;
+          m->stack_offset = stack;
+          stack += 8;
+        }
+      }
+    }
+  }
+  if (d->abi && d->abi->ret.kind != ABI_ARG_IGNORE &&
+      d->abi->ret.kind != ABI_ARG_INDIRECT) {
+    u32 ni = 0, nf = 0;
+    for (u16 i = 0; i < d->abi->ret.nparts; ++i) {
+      const ABIArgPart* p = &d->abi->ret.parts[i];
+      CGCallPlanRet* r = &out->rets[out->nrets++];
+      r->dst = d->ret.storage;
+      r->mem.type = d->ret.type;
+      r->mem.size = p->size;
+      r->mem.align = p->align ? p->align : p->size;
+      if (p->cls == ABI_CLASS_FP) {
+        r->cls = RC_FP;
+        r->src_reg = 10u + nf++;
+      } else {
+        r->cls = RC_INT;
+        r->src_reg = RV_A0 + ni++;
+      }
+    }
+  }
+}
+
 static void rv_reserve_hard_regs(CGTarget* t, RegClass cls,
                                  const Reg* regs, u32 n) {
   RImpl* a = impl_of(t);
@@ -107,8 +280,13 @@ static void rv_plan_hard_regs(CGTarget* t, RegClass cls, const Reg* regs,
 
 void rv_coord_vtable_init(CGTarget* t) {
   t->get_allocable_regs = rv_get_allocable_regs;
+  t->get_phys_regs      = rv_get_phys_regs;
   t->get_scratch_regs   = rv_get_scratch_regs;
   t->is_caller_saved    = rv_is_caller_saved;
+  t->call_clobber_mask  = rv_call_clobber_mask;
+  t->return_reg_mask    = rv_return_reg_mask;
+  t->callee_save_mask   = rv_callee_save_mask;
+  t->plan_call          = rv_plan_call;
   t->plan_hard_regs     = rv_plan_hard_regs;
   t->reserve_hard_regs  = rv_reserve_hard_regs;
 }
diff --git a/src/arch/x64/opt_coord.c b/src/arch/x64/opt_coord.c
@@ -18,6 +18,24 @@ static const Reg x_fp_allocable[]  = {X64_XMM6, X64_XMM7, X64_XMM8,
 static const Reg x_int_scratch[] = {X64_RBX, X64_R12};
 static const Reg x_fp_scratch[]  = {X64_XMM0 + 14, X64_XMM15};
 
+static const CGPhysRegInfo x_int_phys[] = {
+    {X64_R13, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {X64_R14, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {X64_R15, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED, 50, 4},
+    {X64_R10, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED |
+                              CG_REG_TEMP_PREFERRED, 0, 0},
+};
+static const CGPhysRegInfo x_fp_phys[] = {
+    {X64_XMM6, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {X64_XMM7, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {X64_XMM8, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {X64_XMM0 + 9, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {X64_XMM0 + 10, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {X64_XMM0 + 11, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {X64_XMM0 + 12, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+    {X64_XMM0 + 13, RC_FP, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED, 0, 0},
+};
+
 /* ============================================================
  * Vtable methods */
 
@@ -59,6 +77,25 @@ static void x_get_scratch_regs(CGTarget* t, RegClass cls,
   }
 }
 
+static void x_get_phys_regs(CGTarget* t, RegClass cls,
+                            const CGPhysRegInfo** out, u32* nregs) {
+  (void)t;
+  switch (cls) {
+    case RC_INT:
+      *out = x_int_phys;
+      *nregs = sizeof x_int_phys / sizeof x_int_phys[0];
+      break;
+    case RC_FP:
+      *out = x_fp_phys;
+      *nregs = sizeof x_fp_phys / sizeof x_fp_phys[0];
+      break;
+    default:
+      *out = NULL;
+      *nregs = 0;
+      break;
+  }
+}
+
 static int x_is_caller_saved(CGTarget* t, RegClass cls, Reg reg) {
   (void)t;
   switch (cls) {
@@ -75,6 +112,146 @@ static int x_is_caller_saved(CGTarget* t, RegClass cls, Reg reg) {
   }
 }
 
+static u32 x_call_clobber_mask(CGTarget* t, const CGCallDesc* d,
+                               RegClass cls) {
+  (void)t;
+  (void)d;
+  switch (cls) {
+    case RC_INT:
+      return (1u << X64_RAX) | (1u << X64_RCX) | (1u << X64_RDX) |
+             (1u << X64_RSI) | (1u << X64_RDI) | (1u << X64_R8) |
+             (1u << X64_R9) | (1u << X64_R10) | (1u << X64_R11);
+    case RC_FP:
+      return 0xFFFFu;
+    default:
+      return 0;
+  }
+}
+
+static u32 x_callee_save_mask(CGTarget* t, RegClass cls) {
+  (void)t;
+  return cls == RC_INT ? ((1u << X64_RBX) | (1u << X64_R12) |
+                          (1u << X64_R13) | (1u << X64_R14) |
+                          (1u << X64_R15))
+                       : 0;
+}
+
+static u32 x_return_reg_mask(CGTarget* t, const ABIFuncInfo* abi,
+                             RegClass cls) {
+  (void)t;
+  if (!abi || abi->ret.kind == ABI_ARG_IGNORE ||
+      abi->ret.kind == ABI_ARG_INDIRECT)
+    return 0;
+  u32 mask = 0, ni = 0, nf = 0;
+  static const u32 iregs[2] = {X64_RAX, X64_RDX};
+  for (u16 i = 0; i < abi->ret.nparts; ++i) {
+    const ABIArgPart* p = &abi->ret.parts[i];
+    if (cls == RC_INT && p->cls == ABI_CLASS_INT && ni < 2) mask |= 1u << iregs[ni++];
+    else if (cls == RC_FP && p->cls == ABI_CLASS_FP && nf < 2) mask |= 1u << (X64_XMM0 + nf++);
+  }
+  return mask;
+}
+
+static void x_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
+  memset(out, 0, sizeof *out);
+  out->callee = d->callee;
+  out->stack_arg_size = t->call_stack_size ? t->call_stack_size(t, d) : 0;
+  out->has_sret = d->abi && d->abi->has_sret;
+  for (u32 c = 0; c < CG_CALL_PLAN_REG_CLASSES; ++c) {
+    out->clobber_mask[c] = x_call_clobber_mask(t, d, (RegClass)c);
+    out->return_mask[c] = x_return_reg_mask(t, d->abi, (RegClass)c);
+  }
+  u32 cap = d->nargs * 2u + 2u;
+  out->args = arena_zarray(t->c->tu, CGCallPlanMove, cap ? cap : 1u);
+  out->rets = arena_zarray(t->c->tu, CGCallPlanRet, 4);
+  u32 next_int = d->abi && d->abi->has_sret ? 1u : 0u, next_fp = 0, stack = 0;
+  static const u32 iregs[6] = {X64_RDI, X64_RSI, X64_RDX, X64_RCX, X64_R8, X64_R9};
+  for (u32 a = 0; a < d->nargs; ++a) {
+    const CGABIValue* av = &d->args[a];
+    const ABIArgInfo* ai = av->abi;
+    ABIArgInfo vai;
+    ABIArgPart vap;
+    if (!ai) {
+      memset(&vai, 0, sizeof vai);
+      memset(&vap, 0, sizeof vap);
+      vap.cls = av->storage.cls == RC_FP ? ABI_CLASS_FP : ABI_CLASS_INT;
+      vap.size = type_byte_size(av->type);
+      vai.kind = ABI_ARG_DIRECT;
+      vai.nparts = 1;
+      vai.parts = &vap;
+      ai = &vai;
+    }
+    if (ai->kind == ABI_ARG_IGNORE) continue;
+    if (ai->kind == ABI_ARG_INDIRECT) {
+      CGCallPlanMove* m = &out->args[out->nargs++];
+      m->src = av->storage;
+      m->cls = RC_INT;
+      if (next_int < 6) {
+        m->dst_kind = CG_CALL_PLAN_REG;
+        m->dst_reg = iregs[next_int++];
+      } else {
+        m->dst_kind = CG_CALL_PLAN_STACK;
+        m->stack_offset = stack;
+        stack += 8;
+      }
+      m->mem.type = av->type;
+      m->mem.size = 8;
+      m->mem.align = 8;
+      continue;
+    }
+    for (u16 i = 0; i < ai->nparts; ++i) {
+      const ABIArgPart* p = &ai->parts[i];
+      CGCallPlanMove* m = &out->args[out->nargs++];
+      m->src = av->nparts ? av->parts[i].op : av->storage;
+      m->mem.type = av->type;
+      m->mem.size = p->size;
+      m->mem.align = p->align ? p->align : p->size;
+      if (p->cls == ABI_CLASS_FP) {
+        m->cls = RC_FP;
+        if (next_fp < 8) {
+          m->dst_kind = CG_CALL_PLAN_REG;
+          m->dst_reg = X64_XMM0 + next_fp++;
+        } else {
+          m->dst_kind = CG_CALL_PLAN_STACK;
+          m->stack_offset = stack;
+          stack += 8;
+        }
+      } else {
+        m->cls = RC_INT;
+        if (next_int < 6) {
+          m->dst_kind = CG_CALL_PLAN_REG;
+          m->dst_reg = iregs[next_int++];
+        } else {
+          m->dst_kind = CG_CALL_PLAN_STACK;
+          m->stack_offset = stack;
+          stack += 8;
+        }
+      }
+    }
+  }
+  out->variadic_fp_count = (u8)next_fp;
+  if (d->abi && d->abi->ret.kind != ABI_ARG_IGNORE &&
+      d->abi->ret.kind != ABI_ARG_INDIRECT) {
+    u32 ni = 0, nf = 0;
+    static const u32 rregs[2] = {X64_RAX, X64_RDX};
+    for (u16 i = 0; i < d->abi->ret.nparts; ++i) {
+      const ABIArgPart* p = &d->abi->ret.parts[i];
+      CGCallPlanRet* r = &out->rets[out->nrets++];
+      r->dst = d->ret.storage;
+      r->mem.type = d->ret.type;
+      r->mem.size = p->size;
+      r->mem.align = p->align ? p->align : p->size;
+      if (p->cls == ABI_CLASS_FP) {
+        r->cls = RC_FP;
+        r->src_reg = X64_XMM0 + nf++;
+      } else {
+        r->cls = RC_INT;
+        r->src_reg = rregs[ni++];
+      }
+    }
+  }
+}
+
 static void x_reserve_hard_regs(CGTarget* t, RegClass cls,
                                 const Reg* regs, u32 n) {
   XImpl* a = impl_of(t);
@@ -118,8 +295,13 @@ static void x_plan_hard_regs(CGTarget* t, RegClass cls, const Reg* regs,
 
 void x_coord_vtable_init(CGTarget* t) {
   t->get_allocable_regs = x_get_allocable_regs;
+  t->get_phys_regs      = x_get_phys_regs;
   t->get_scratch_regs   = x_get_scratch_regs;
   t->is_caller_saved    = x_is_caller_saved;
+  t->call_clobber_mask  = x_call_clobber_mask;
+  t->return_reg_mask    = x_return_reg_mask;
+  t->callee_save_mask   = x_callee_save_mask;
+  t->plan_call          = x_plan_call;
   t->plan_hard_regs     = x_plan_hard_regs;
   t->reserve_hard_regs  = x_reserve_hard_regs;
 }
diff --git a/src/opt/ir.h b/src/opt/ir.h
@@ -126,6 +126,9 @@ typedef struct IRPhiAux {
  * Val form via the CGABIValue.storage.v.reg fields where applicable. */
 typedef struct IRCallAux {
   CGCallDesc desc;
+  CGCallPlan plan;
+  u8 plan_valid;
+  u8 pad[3];
   /* Result Vals (one per ABI-decomposed return part). 0 for void. */
   u32 nresults;
   Val* results;
@@ -354,9 +357,15 @@ typedef struct Func {
 
   Reg opt_hard_regs[OPT_REG_CLASSES][OPT_MAX_HARD_REGS];
   u32 opt_hard_reg_count[OPT_REG_CLASSES];
+  CGPhysRegInfo opt_phys_regs[OPT_REG_CLASSES][OPT_MAX_HARD_REGS];
+  u32 opt_phys_reg_count[OPT_REG_CLASSES];
   Reg opt_scratch_regs[OPT_REG_CLASSES][OPT_MAX_SCRATCH_REGS];
   u32 opt_scratch_reg_count[OPT_REG_CLASSES];
   u32 opt_caller_saved[OPT_REG_CLASSES]; /* bit r set if hard reg r is caller-saved */
+  u32 opt_callee_saved[OPT_REG_CLASSES];
+  u32 opt_reserved_regs[OPT_REG_CLASSES];
+  u32 opt_arg_regs[OPT_REG_CLASSES];
+  u32 opt_ret_regs[OPT_REG_CLASSES];
 } Func;
 
 /* ---- API ---- */
diff --git a/src/opt/opt.c b/src/opt/opt.c
@@ -468,6 +468,17 @@ static void w_get_allocable_regs(CGTarget* t, RegClass cls, const Reg** out,
   }
 }
 
+static void w_get_phys_regs(CGTarget* t, RegClass cls,
+                            const CGPhysRegInfo** out, u32* nregs) {
+  CGTarget* wr = impl_of(t)->target;
+  if (wr->get_phys_regs)
+    wr->get_phys_regs(wr, cls, out, nregs);
+  else {
+    *out = NULL;
+    *nregs = 0;
+  }
+}
+
 static void w_get_scratch_regs(CGTarget* t, RegClass cls, const Reg** out,
                                u32* nregs) {
   CGTarget* wr = impl_of(t)->target;
@@ -485,6 +496,26 @@ static int w_is_caller_saved(CGTarget* t, RegClass cls, Reg r) {
   return 0;
 }
 
+static u32 w_call_clobber_mask(CGTarget* t, const CGCallDesc* d,
+                               RegClass cls) {
+  CGTarget* wr = impl_of(t)->target;
+  if (wr->call_clobber_mask) return wr->call_clobber_mask(wr, d, cls);
+  return 0;
+}
+
+static u32 w_return_reg_mask(CGTarget* t, const ABIFuncInfo* abi,
+                             RegClass cls) {
+  CGTarget* wr = impl_of(t)->target;
+  if (wr->return_reg_mask) return wr->return_reg_mask(wr, abi, cls);
+  return 0;
+}
+
+static u32 w_callee_save_mask(CGTarget* t, RegClass cls) {
+  CGTarget* wr = impl_of(t)->target;
+  if (wr->callee_save_mask) return wr->callee_save_mask(wr, cls);
+  return 0;
+}
+
 static void w_plan_hard_regs(CGTarget* t, RegClass cls, const Reg* regs,
                              u32 n) {
   CGTarget* wr = impl_of(t)->target;
@@ -874,6 +905,19 @@ static void w_call(CGTarget* t, const CGCallDesc* d) {
   }
 }
 
+static void w_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
+  CGTarget* wr = impl_of(t)->target;
+  if (wr->plan_call)
+    wr->plan_call(wr, d, out);
+  else
+    memset(out, 0, sizeof *out);
+}
+
+static void w_emit_call_plan(CGTarget* t, const CGCallPlan* p) {
+  CGTarget* wr = impl_of(t)->target;
+  if (wr->emit_call_plan) wr->emit_call_plan(wr, p);
+}
+
 static void w_ret(CGTarget* t, const CGABIValue* v) {
   OptImpl* o = impl_of(t);
   Inst* in = rec(o, IR_RET);
@@ -2027,8 +2071,12 @@ CGTarget* opt_cgtarget_new(Compiler* c, CGTarget* target, int level) {
   t->reload_reg = w_reload_reg;
 
   t->get_allocable_regs = w_get_allocable_regs;
+  t->get_phys_regs = w_get_phys_regs;
   t->get_scratch_regs = w_get_scratch_regs;
   t->is_caller_saved = w_is_caller_saved;
+  t->call_clobber_mask = w_call_clobber_mask;
+  t->return_reg_mask = w_return_reg_mask;
+  t->callee_save_mask = w_callee_save_mask;
   t->plan_hard_regs = w_plan_hard_regs;
   t->reserve_hard_regs = w_reserve_hard_regs;
 
@@ -2061,6 +2109,8 @@ CGTarget* opt_cgtarget_new(Compiler* c, CGTarget* target, int level) {
   t->convert = w_convert;
 
   t->call = w_call;
+  t->plan_call = w_plan_call;
+  t->emit_call_plan = w_emit_call_plan;
   t->ret = w_ret;
 
   t->alloca_ = w_alloca_;
diff --git a/src/opt/pass_lower.c b/src/opt/pass_lower.c
@@ -358,17 +358,45 @@ void opt_machinize(Func* f, CGTarget* target) {
   f->opt_has_target = 1;
   for (u32 c = 0; c < OPT_REG_CLASSES; ++c) {
     f->opt_hard_reg_count[c] = 0;
+    f->opt_phys_reg_count[c] = 0;
     f->opt_scratch_reg_count[c] = 0;
     f->opt_caller_saved[c] = 0;
+    f->opt_callee_saved[c] = 0;
+    f->opt_reserved_regs[c] = 0;
+    f->opt_arg_regs[c] = 0;
+    f->opt_ret_regs[c] = 0;
   }
 
   for (u32 c = 0; c < OPT_REG_CLASSES; ++c) {
-    const Reg* hard = NULL;
-    u32 nhard = 0;
-    if (target->get_allocable_regs)
-      target->get_allocable_regs(target, (RegClass)c, &hard, &nhard);
-    for (u32 i = 0; i < nhard && i < OPT_MAX_HARD_REGS; ++i)
-      f->opt_hard_regs[c][f->opt_hard_reg_count[c]++] = hard[i];
+    const CGPhysRegInfo* phys = NULL;
+    u32 nphys = 0;
+    if (target->get_phys_regs)
+      target->get_phys_regs(target, (RegClass)c, &phys, &nphys);
+    if (phys) {
+      for (u32 i = 0; i < nphys && i < OPT_MAX_HARD_REGS; ++i) {
+        CGPhysRegInfo pi = phys[i];
+        Reg hr = pi.reg;
+        if (hr < 32u) {
+          if (pi.flags & CG_REG_CALLER_SAVED) f->opt_caller_saved[c] |= 1u << hr;
+          if (pi.flags & CG_REG_CALLEE_SAVED) f->opt_callee_saved[c] |= 1u << hr;
+          if (pi.flags & CG_REG_RESERVED) f->opt_reserved_regs[c] |= 1u << hr;
+          if (pi.flags & CG_REG_ARG) f->opt_arg_regs[c] |= 1u << hr;
+          if (pi.flags & CG_REG_RET) f->opt_ret_regs[c] |= 1u << hr;
+        }
+        f->opt_phys_regs[c][f->opt_phys_reg_count[c]++] = pi;
+        if ((pi.flags & CG_REG_ALLOCABLE) &&
+            !(pi.flags & CG_REG_RESERVED)) {
+          f->opt_hard_regs[c][f->opt_hard_reg_count[c]++] = hr;
+        }
+      }
+    } else {
+      const Reg* hard = NULL;
+      u32 nhard = 0;
+      if (target->get_allocable_regs)
+        target->get_allocable_regs(target, (RegClass)c, &hard, &nhard);
+      for (u32 i = 0; i < nhard && i < OPT_MAX_HARD_REGS; ++i)
+        f->opt_hard_regs[c][f->opt_hard_reg_count[c]++] = hard[i];
+    }
 
     const Reg* scratch = NULL;
     u32 nscratch = 0;
@@ -377,13 +405,15 @@ void opt_machinize(Func* f, CGTarget* target) {
     for (u32 i = 0; i < nscratch && i < OPT_MAX_SCRATCH_REGS; ++i)
       f->opt_scratch_regs[c][f->opt_scratch_reg_count[c]++] = scratch[i];
 
-    if (target->is_caller_saved) {
+    if (!phys && target->is_caller_saved) {
       for (u32 i = 0; i < f->opt_hard_reg_count[c]; ++i) {
         Reg hr = f->opt_hard_regs[c][i];
         if (target->is_caller_saved(target, (RegClass)c, hr))
           f->opt_caller_saved[c] |= (1u << hr);
       }
     }
+    if (target->callee_save_mask)
+      f->opt_callee_saved[c] |= target->callee_save_mask(target, (RegClass)c);
   }
 
   for (u32 c = 0; c < OPT_REG_CLASSES; ++c) {
@@ -405,8 +435,15 @@ void opt_machinize(Func* f, CGTarget* target) {
     Block* bl = &f->blocks[b];
     for (u32 i = 0; i < bl->ninsts; ++i) {
       Inst* in = &bl->insts[i];
-      if ((IROp)in->op == IR_ASM_BLOCK)
+      if ((IROp)in->op == IR_ASM_BLOCK) {
         asm_prepare_constraints(f, target, (IRAsmAux*)in->extra.aux);
+      } else if ((IROp)in->op == IR_CALL && target->plan_call) {
+        IRCallAux* aux = (IRCallAux*)in->extra.aux;
+        if (aux) {
+          target->plan_call(target, &aux->desc, &aux->plan);
+          aux->plan_valid = 1;
+        }
+      }
     }
   }
 }
@@ -416,6 +453,18 @@ static int is_caller_saved(Func* f, u8 cls, Reg r) {
   return (f->opt_caller_saved[cls] & (1u << r)) != 0;
 }
 
+typedef struct OptAllocator OptAllocator;
+
+static u32 call_clobber_mask_for(Func* f, const Inst* in, u8 cls) {
+  if (cls >= OPT_REG_CLASSES) return 0;
+  if (in && (IROp)in->op == IR_CALL) {
+    IRCallAux* aux = (IRCallAux*)in->extra.aux;
+    if (aux && aux->plan_valid)
+      return aux->plan.clobber_mask[cls];
+  }
+  return f->opt_caller_saved[cls];
+}
+
 #define OPT_BLK_NONE 0xffffffffu
 
 typedef struct LoopPostorderCtx {
@@ -569,6 +618,13 @@ static int hard_available(Func* f, u8 cls, Reg r) {
   return 0;
 }
 
+static const CGPhysRegInfo* phys_info_for(Func* f, u8 cls, Reg r) {
+  if (cls >= OPT_REG_CLASSES) return NULL;
+  for (u32 i = 0; i < f->opt_phys_reg_count[cls]; ++i)
+    if (f->opt_phys_regs[cls][i].reg == r) return &f->opt_phys_regs[cls][i];
+  return NULL;
+}
+
 static FrameSlot spill_slot_for(Func* f, Val v) {
   if (f->val_info[v].spill_slot != FRAME_SLOT_NONE)
     return f->val_info[v].spill_slot;
@@ -624,6 +680,24 @@ typedef struct OptAllocator {
   u64 stack_mark_points;
 } OptAllocator;
 
+static u32 hard_reg_alloc_score(Func* f, const OptAllocator* a,
+                                const OptValInfo* vi, Reg hr) {
+  const CGPhysRegInfo* pi = phys_info_for(f, vi->cls, hr);
+  u32 score = pi ? pi->use_cost : 0;
+  if (vi->live_across_call_freq) {
+    if (is_caller_saved(f, vi->cls, hr))
+      score += 1000u + vi->live_across_call_freq;
+    else
+      score += 20u;
+  } else if (!is_caller_saved(f, vi->cls, hr)) {
+    u32 bit = hard_loc_bit(vi->cls, hr);
+    int already_open = bit < a->hard_loc_bits &&
+                       a->hard_used_locs[bit].n != 0;
+    if (!already_open) score += pi ? pi->save_cost : 50u;
+  }
+  return score;
+}
+
 static int alloc_candidate_higher(const OptAllocCandidate* a,
                                   const OptAllocCandidate* b) {
   if (a->tied != b->tied) return a->tied > b->tied;
@@ -1007,15 +1081,22 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
     }
 
     int found = 0;
+    Reg best = REG_NONE;
+    u32 best_score = 0xffffffffu;
     for (u32 r = 0; r < f->opt_hard_reg_count[cls]; ++r) {
       Reg hr = f->opt_hard_regs[cls][r];
       if (hr >= 32) continue;
       if (vi->forbidden_hard_regs & (1u << hr)) continue;
-      if (vi->live_across_call_freq && is_caller_saved(f, cls, hr)) continue;
       if (alloc_hard_conflicts(a, ranges, v, hard_loc_bit(cls, hr))) continue;
-      alloc_assign_hard(f, a, ranges, v, hr);
-      found = 1;
-      break;
+      u32 score = hard_reg_alloc_score(f, a, vi, hr);
+      if (!found || score < best_score) {
+        found = 1;
+        best = hr;
+        best_score = score;
+      }
+    }
+    if (found) {
+      alloc_assign_hard(f, a, ranges, v, best);
     }
     if (!found) {
       alloc_assign_stack(f, a, ranges, v);
@@ -1241,6 +1322,7 @@ typedef struct RewriteCallSaveCtx {
   Func* f;
   RewriteList* out;
   const InstRefs* refs;
+  const Inst* call;
   int emit_restore;
 } RewriteCallSaveCtx;
 
@@ -1250,7 +1332,10 @@ static void rewrite_call_save_one(Val v, void* arg) {
   if (v == VAL_NONE || v >= f->nvals) return;
   if (c->refs && refs_has_def(c->refs, v)) return;
   if (f->val_info[v].alloc_kind != OPT_ALLOC_HARD) return;
-  if (!is_caller_saved(f, f->val_info[v].cls, f->val_info[v].hard_reg)) return;
+  u8 cls = f->val_info[v].cls;
+  Reg hr = f->val_info[v].hard_reg;
+  if (cls >= OPT_REG_CLASSES || hr >= 32u) return;
+  if ((call_clobber_mask_for(f, c->call, cls) & (1u << hr)) == 0) return;
   if (c->emit_restore)
     append_load_val(f, c->out, v);
   else
@@ -1258,11 +1343,12 @@ static void rewrite_call_save_one(Val v, void* arg) {
 }
 
 static void append_live_call_saves(Func* f, RewriteList* out,
+                                   const Inst* call,
                                    const u64* live_after, u32 live_active_words,
                                    const InstRefs* refs,
                                    const Val* call_save_vals,
                                    u32 ncall_save_vals, int emit_restore) {
-  RewriteCallSaveCtx ctx = {f, out, refs, emit_restore};
+  RewriteCallSaveCtx ctx = {f, out, refs, call, emit_restore};
   f->opt_rewrite_live_words_touched += ncall_save_vals;
   for (u32 i = 0; i < ncall_save_vals; ++i) {
     Val v = call_save_vals[i];
@@ -1279,7 +1365,6 @@ static Val* rewrite_collect_call_save_vals(Func* f, u32* count_out) {
     OptValInfo* vi = &f->val_info[v];
     if (vi->alloc_kind != OPT_ALLOC_HARD) continue;
     if (!vi->live_across_call_freq) continue;
-    if (!is_caller_saved(f, vi->cls, vi->hard_reg)) continue;
     ++n;
   }
   Val* vals = arena_array(f->arena, Val, n ? n : 1u);
@@ -1288,7 +1373,6 @@ static Val* rewrite_collect_call_save_vals(Func* f, u32* count_out) {
     OptValInfo* vi = &f->val_info[v];
     if (vi->alloc_kind != OPT_ALLOC_HARD) continue;
     if (!vi->live_across_call_freq) continue;
-    if (!is_caller_saved(f, vi->cls, vi->hard_reg)) continue;
     vals[w++] = v;
   }
   *count_out = n;
@@ -1352,9 +1436,9 @@ static void rewrite_func(Func* f, const OptLiveInfo* live_info) {
         walk_inst_operands(f, &in, rewrite_one_operand, &ctx);
       }
       if ((IROp)in.op == IR_CALL) {
-        append_live_call_saves(f, &call_saves, live, live_active_words, &refs,
+        append_live_call_saves(f, &call_saves, &in, live, live_active_words, &refs,
                                call_save_vals, ncall_save_vals, 0);
-        append_live_call_saves(f, &call_restores, live, live_active_words,
+        append_live_call_saves(f, &call_restores, &in, live, live_active_words,
                                &refs, call_save_vals, ncall_save_vals, 1);
       }
 
@@ -2146,7 +2230,7 @@ static void hard_inst_use_def(Func* f, const Inst* in, HardRegSet* use,
         hard_use_abivalue(use, &aux->desc.args[i]);
       hard_def_abivalue(def, &aux->desc.ret);
       for (u32 c = 0; c < OPT_REG_CLASSES; ++c)
-        def->cls[c] |= f->opt_caller_saved[c];
+        def->cls[c] |= call_clobber_mask_for(f, in, (u8)c);
       break;
     }
     case IR_CMP_BRANCH:
diff --git a/test/opt/opt_test.c b/test/opt/opt_test.c
@@ -406,9 +406,14 @@ typedef struct MockCGTarget {
   CGTarget base;
   const Reg* pool[OPT_REG_CLASSES];
   u32 pool_n[OPT_REG_CLASSES];
+  const CGPhysRegInfo* phys[OPT_REG_CLASSES];
+  u32 phys_n[OPT_REG_CLASSES];
   const Reg* scratch[OPT_REG_CLASSES];
   u32 scratch_n[OPT_REG_CLASSES];
   u32 caller_saved_mask[OPT_REG_CLASSES];
+  u32 callee_saved_mask[OPT_REG_CLASSES];
+  u32 call_clobber_mask[OPT_REG_CLASSES];
+  int plan_call_count;
   int plan_calls[OPT_REG_CLASSES];
   int plan_regs[OPT_REG_CLASSES];
   int func_begin_plan_calls;
@@ -443,6 +448,13 @@ static void mock_get_allocable_regs(CGTarget* t, RegClass cls, const Reg** out,
   *nregs = m->pool_n[cls];
 }
 
+static void mock_get_phys_regs(CGTarget* t, RegClass cls,
+                               const CGPhysRegInfo** out, u32* nregs) {
+  MockCGTarget* m = (MockCGTarget*)t;
+  *out = m->phys[cls];
+  *nregs = m->phys_n[cls];
+}
+
 static void mock_get_scratch_regs(CGTarget* t, RegClass cls, const Reg** out,
                                   u32* nregs) {
   MockCGTarget* m = (MockCGTarget*)t;
@@ -456,6 +468,36 @@ static int mock_is_caller_saved(CGTarget* t, RegClass cls, Reg reg) {
   return (m->caller_saved_mask[cls] & (1u << reg)) != 0;
 }
 
+static u32 mock_call_clobber_mask(CGTarget* t, const CGCallDesc* d,
+                                  RegClass cls) {
+  MockCGTarget* m = (MockCGTarget*)t;
+  (void)d;
+  return cls < OPT_REG_CLASSES ? m->call_clobber_mask[cls] : 0;
+}
+
+static u32 mock_callee_save_mask(CGTarget* t, RegClass cls) {
+  MockCGTarget* m = (MockCGTarget*)t;
+  return cls < OPT_REG_CLASSES ? m->callee_saved_mask[cls] : 0;
+}
+
+static u32 mock_return_reg_mask(CGTarget* t, const ABIFuncInfo* abi,
+                                RegClass cls) {
+  (void)t;
+  (void)abi;
+  (void)cls;
+  return 0;
+}
+
+static void mock_plan_call(CGTarget* t, const CGCallDesc* d,
+                           CGCallPlan* out) {
+  MockCGTarget* m = (MockCGTarget*)t;
+  memset(out, 0, sizeof *out);
+  out->callee = d->callee;
+  for (u32 c = 0; c < OPT_REG_CLASSES; ++c)
+    out->clobber_mask[c] = m->call_clobber_mask[c];
+  ++m->plan_call_count;
+}
+
 static void mock_reserve_hard_regs(CGTarget* t, RegClass cls, const Reg* regs,
                                    u32 n) {
   MockCGTarget* m = (MockCGTarget*)t;
@@ -619,8 +661,13 @@ static void mock_init(MockCGTarget* m, Compiler* c) {
   m->base.ret = mock_ret;
   m->base.set_loc = mock_set_loc;
   m->base.get_allocable_regs = mock_get_allocable_regs;
+  m->base.get_phys_regs = mock_get_phys_regs;
   m->base.get_scratch_regs = mock_get_scratch_regs;
   m->base.is_caller_saved = mock_is_caller_saved;
+  m->base.call_clobber_mask = mock_call_clobber_mask;
+  m->base.return_reg_mask = mock_return_reg_mask;
+  m->base.callee_save_mask = mock_callee_save_mask;
+  m->base.plan_call = mock_plan_call;
   m->base.plan_hard_regs = mock_plan_hard_regs;
   m->base.reserve_hard_regs = mock_reserve_hard_regs;
   m->base.resolve_reg_name = mock_resolve_reg_name;
@@ -634,6 +681,22 @@ static void mock_set_pool(MockCGTarget* m, RegClass cls, const Reg* pool,
   m->scratch[cls] = scratch_;
   m->scratch_n[cls] = nscratch;
   m->caller_saved_mask[cls] = caller_mask;
+  m->call_clobber_mask[cls] = caller_mask;
+}
+
+static void mock_set_phys(MockCGTarget* m, RegClass cls,
+                          const CGPhysRegInfo* phys, u32 nphys) {
+  m->phys[cls] = phys;
+  m->phys_n[cls] = nphys;
+  m->caller_saved_mask[cls] = 0;
+  m->callee_saved_mask[cls] = 0;
+  for (u32 i = 0; i < nphys; ++i) {
+    if (phys[i].reg >= 32u) continue;
+    if (phys[i].flags & CG_REG_CALLER_SAVED)
+      m->caller_saved_mask[cls] |= 1u << phys[i].reg;
+    if (phys[i].flags & CG_REG_CALLEE_SAVED)
+      m->callee_saved_mask[cls] |= 1u << phys[i].reg;
+  }
 }
 
 /* ============================================================
@@ -642,6 +705,78 @@ static void mock_set_pool(MockCGTarget* m, RegClass cls, const Reg* pool,
  * injected through MockCGTarget + opt_machinize.
  * ============================================================ */
 
+static void opt_machinize_uses_phys_reg_metadata(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  MockCGTarget mock;
+  mock_init(&mock, tc.c);
+  static const Reg legacy_pool[] = {7};
+  static const Reg scratch[] = {9, 10};
+  static const CGPhysRegInfo phys[] = {
+      {13, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLER_SAVED |
+                         CG_REG_ARG, 0, 1},
+      {19, RC_INT, 0xff, CG_REG_ALLOCABLE | CG_REG_CALLEE_SAVED |
+                         CG_REG_RET, 50, 4},
+      {29, RC_INT, 0xff, CG_REG_RESERVED, 0, 0},
+  };
+  mock_set_pool(&mock, RC_INT, legacy_pool, 1, scratch, 2, 0);
+  mock_set_phys(&mock, RC_INT, phys, sizeof phys / sizeof phys[0]);
+
+  Func* f = new_func(&tc);
+  opt_machinize(f, &mock.base);
+  EXPECT(f->opt_hard_reg_count[RC_INT] == 2,
+         "phys metadata should replace legacy allocable pool");
+  EXPECT(f->opt_hard_regs[RC_INT][0] == 13 &&
+             f->opt_hard_regs[RC_INT][1] == 19,
+         "phys allocable order should be preserved");
+  EXPECT((f->opt_caller_saved[RC_INT] & (1u << 13)) != 0,
+         "caller-saved phys flag should be recorded");
+  EXPECT((f->opt_callee_saved[RC_INT] & (1u << 19)) != 0,
+         "callee-saved phys flag should be recorded");
+  EXPECT((f->opt_reserved_regs[RC_INT] & (1u << 29)) != 0,
+         "reserved phys flag should be recorded");
+  EXPECT((f->opt_arg_regs[RC_INT] & (1u << 13)) != 0,
+         "arg phys flag should be recorded");
+  EXPECT((f->opt_ret_regs[RC_INT] & (1u << 19)) != 0,
+         "ret phys flag should be recorded");
+  tc_fini(&tc);
+}
+
+static void opt_call_plan_drives_call_specific_preservation(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  MockCGTarget mock;
+  mock_init(&mock, tc.c);
+  static const Reg pool[] = {13};
+  static const Reg scratch[] = {9, 10};
+  mock_set_pool(&mock, RC_INT, pool, 1, scratch, 2, 1u << 13);
+  mock.call_clobber_mask[RC_INT] = 0;
+
+  Func* f = new_func(&tc);
+  Val live = add_val(f, tc.i32);
+  emit_load_imm(f, f->entry, live, tc.i32, 11);
+  emit_call_void(f, f->entry);
+  emit_ret_val(f, f->entry, live, tc.i32);
+  opt_machinize(f, &mock.base);
+  opt_build_cfg(f);
+  opt_build_loop_tree(f);
+  opt_regalloc(f, 0);
+
+  EXPECT(mock.plan_call_count == 1, "opt_machinize should request call plan");
+  Block* b = &f->blocks[f->entry];
+  int saw_call_save_restore = 0;
+  for (u32 i = 1; i + 1 < b->ninsts; ++i) {
+    if ((IROp)b->insts[i].op == IR_CALL &&
+        (IROp)b->insts[i - 1u].op == IR_STORE &&
+        (IROp)b->insts[i + 1u].op == IR_LOAD) {
+      saw_call_save_restore = 1;
+    }
+  }
+  EXPECT(!saw_call_save_restore,
+         "call-specific non-clobbering plan should suppress save/restore");
+  tc_fini(&tc);
+}
+
 static void opt_liveness_branch(void) {
   TestCtx tc;
   tc_init(&tc);
@@ -2734,6 +2869,8 @@ static void simple_regalloc_reports_exact_used_regs(void) {
 }
 
 int main(void) {
+  opt_machinize_uses_phys_reg_metadata();
+  opt_call_plan_drives_call_specific_preservation();
   opt_cfg_prunes_unreachable();
   opt_cfg_preserves_scope_edges();
   opt_jump_cleanup_forwards_branch_targets();

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/OPT1.md	\|	65	+++++++++++++++++++++++++++++++++++++++++++++--------------------
M	doc/OPT_REGS_CALL_PLAN.md	\|	106	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------
M	src/arch/aa64/opt_coord.c	\|	188	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	src/arch/arch.h	\|	71	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	src/arch/rv64/opt_coord.c	\|	178	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	src/arch/x64/opt_coord.c	\|	182	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	src/opt/ir.h	\|	9	+++++++++
M	src/opt/opt.c	\|	50	++++++++++++++++++++++++++++++++++++++++++++++++++
M	src/opt/pass_lower.c	\|	122	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
M	test/opt/opt_test.c	\|	137	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++