Move O1 tail calls onto planned replay - kit

commit f753db1d7995abbd515ba6c79f7a22ea8d97de1f
parent af14511ce25c7bb424bcf5465180c0716ebecb73
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sat, 16 May 2026 09:33:45 -0700

Move O1 tail calls onto planned replay

Diffstat:
M doc/OPT_REGS_CALL_PLAN.md  | 62 +++++++++++++++++++++++++++++---------------------------------
M src/arch/aa64/opt_coord.c  | 4 +++-
M src/arch/arch.h  | 2 ++
M src/arch/rv64/opt_coord.c  | 4 +++-
M src/arch/x64/opt_coord.c  | 4 +++-
M src/opt/opt.c  | 28 +++++-----------------------
M src/opt/pass_lower.c  | 18 ------------------
M test/opt/opt_test.c  | 45 +++++++++++++++++++++++++++++++++++++++++++--
A test/toy/cases/24_tail_arg_permute.expected  | 1 +
A test/toy/cases/24_tail_arg_permute.toy  | 11 +++++++++++
A test/toy/cases/25_tail_many_stack_args.expected  | 1 +
A test/toy/cases/25_tail_many_stack_args.toy  | 14 ++++++++++++++
A test/toy/cases/26_tail_live_pressure.expected  | 1 +
A test/toy/cases/26_tail_live_pressure.toy  | 21 +++++++++++++++++++++
A test/toy/cases/27_tail_mixed_int_fp.expected  | 1 +
A test/toy/cases/27_tail_mixed_int_fp.toy  | 11 +++++++++++
A test/toy/cases/28_tail_chain.expected  | 1 +
A test/toy/cases/28_tail_chain.toy  | 15 +++++++++++++++
A test/toy/cases/29_tail_cross_arch_stack.expected  | 1 +
A test/toy/cases/29_tail_cross_arch_stack.toy  | 14 ++++++++++++++
A test/toy/cases/30_tail_indirect_wanted.expected  | 1 +
A test/toy/cases/30_tail_indirect_wanted.toy  | 11 +++++++++++

22 files changed, 192 insertions(+), 79 deletions(-)
diff --git a/doc/OPT_REGS_CALL_PLAN.md b/doc/OPT_REGS_CALL_PLAN.md
@@ -23,11 +23,12 @@ allocation scoring, and preserves hard-assigned live-across-call values by
 intersecting the assigned register with the planned call's clobber mask.
 Post-RA hard-register liveness uses the same call-specific clobber mask.
 
-For supported non-tail call plans, O1 now replays calls by materializing
+For supported call plans, O1 now replays calls by materializing
 arguments with a local parallel-copy resolver, invoking backend stack-argument
-and branch-only call-plan hooks, and extracting returns from fixed return
-registers. Address-valued call moves cover byval/indirect arguments and hidden
-sret destination pointers. The x64, AArch64, and RV64 backends implement
+and branch-only call-plan hooks, and extracting non-tail returns from fixed
+return registers. Address-valued call moves cover byval/indirect arguments and
+hidden sret destination pointers. Tail calls use the same setup and planned
+branch path, with no return extraction. The x64, AArch64, and RV64 backends implement
 `store_call_arg` for outgoing stack slots and `emit_call_plan` for the call
 branch.
 
@@ -46,31 +47,28 @@ What remains open:
 
 - call setup/return extraction are represented by call-plan aux data rather
   than separate first-class IR ops;
-- tail calls still fall back to the legacy backend `call` hook;
 - target `get_phys_regs` tables expose broader O1 pools, and incoming
   parameter functions can now allocate ABI argument/return registers with
-  opt-side constraints for sequential parameter-copy hazards; legacy call
-  fallback, including tail calls, still suppresses ABI argument/return registers
-  until tail-call setup is opt-visible;
+  opt-side constraints for sequential parameter-copy hazards;
 - direct CG still uses legacy allocation/call hooks;
 - code-shape probes remain to be added.
 
 In phase terms: Phase 1 and Phase 2 are done, Phase 3 is implemented through
 call-plan aux visibility plus planned replay for supported call shapes, Phase 4
-is implemented for register, stack, sret, and return moves with tail fallback,
+is implemented for register, stack, sret, tail-call, and return moves,
 Phase 5 is implemented for call setup/replay, and Phase 6 remains open.
 
 ## Planned Call Replay Boundary
 
-The legacy backend `call` hook remains intentionally active as a correctness
-fallback. The planned replay path currently covers the call shapes needed to
-prove ABI register hazards without moving every ABI corner case at once.
+The legacy backend `call` hook is no longer used by O1 replay. Calls that reach
+optimized replay must have a supported plan; unsupported planned shapes fail
+diagnostically instead of falling back to sequential backend lowering. Direct CG
+continues to use the legacy `call` hook while it is migrated separately.
 
 Planned replay is used only when all of the following are true:
 
 - the call has a valid `CGCallPlan`;
 - the backend provides `emit_call_plan`;
-- the call is not a tail call;
 - every stack argument destination has backend `store_call_arg` support;
 - every offset/address-valued argument source has backend `load_call_arg`
   support;
@@ -87,19 +85,18 @@ For those calls, O1 owns the setup and extraction sequence:
   target-provided scratch register first;
 - the backend emits only required call metadata and the branch through
   `emit_call_plan`;
-- return registers are copied or stored into their planned destinations.
+- non-tail return registers are copied or stored into their planned destinations;
+- tail calls stop after the planned branch and have no return extraction.
 
-The fallback path is still required for:
+The legacy `call` path is still required for:
 
-- **tail calls**: the legacy hook owns epilogue emission, legality checks, and
-  branch-without-continuation behavior;
 - **direct CG**: direct codegen still uses the old backend allocation and call
   hooks while O1 migrates first.
 
 This boundary lets Phase 3/4 tests exercise register argument permutation,
 outgoing stack arguments, sret hidden pointers, indirect-callee clobber hazards,
 call-specific clobber preservation, and return extraction without broadening
-the register file across still-legacy tail-call lowering.
+the register file across legacy call lowering.
 
 ## Current Problem
 
@@ -443,13 +440,13 @@ allocator starts using those registers widely.
 
 ### Phase 4 - Parallel Copy Resolver
 
-Status: implemented for non-tail call plans. O1 replay uses a local
+Status: implemented. O1 replay uses a local
 parallel-copy resolver for planned call setup and return extraction, including
 register-register cycles, local/indirect loads, address-valued moves,
 immediates, globals, register and outgoing stack destinations, local/indirect
 return destinations, and indirect callees that occupy a destination argument
-register. Tail-call plans continue to use the legacy backend `call` fallback
-until epilogue transfer is represented in the target contract.
+register. Tail-call plans use the same setup and planned branch path, then skip
+return extraction.
 
 - done: implement local parallel move resolution for register call setup and
   return extraction;
@@ -461,8 +458,8 @@ until epilogue transfer is represented in the target contract.
 - done: add red-green tests for argument permutation cycles, indirect callees in
   argument registers, stack-argument replay, and address-valued args;
 - done: support `CG_CALL_PLAN_STACK` materialization directly in opt;
-- still open: add return-register collision and stack-source hazard tests once
-  stack materialization is explicit.
+- done: add return-register collision, stack-source hazard, and tail-call replay
+  tests.
 
 Expected result: ABI arg and return registers can be made allocable safely.
 
@@ -471,12 +468,11 @@ Expected result: ABI arg and return registers can be made allocable safely.
 Status: implemented for call setup and incoming scalar parameter setup. O1 has
 target-informed scoring and per-call preservation, and the native target
 phys-reg tables now expose broader O1 pools. Known backend helper scratch
-registers remain hidden. ABI arg/return registers are available when all calls
-in a function use planned replay. Incoming parameter functions keep those
+registers remain hidden. ABI arg/return registers are available to O1. Incoming
+parameter functions keep those
 registers allocable, with opt forbidding earlier parameter values from being
 assigned to later incoming ABI registers that the backend still copies
-sequentially. Legacy tail-call fallback still suppresses ABI registers until
-tail-call setup is replay-visible.
+sequentially.
 
 - done: expand target `get_phys_regs` tables with guarded caller-saved and ABI
   registers for x64, AArch64, and RV64;
@@ -487,8 +483,8 @@ tail-call setup is replay-visible.
 - done: remove call-driven ABI-reg suppression for stack and sret call plans;
 - done: remove incoming-parameter ABI-reg suppression by modeling parameter
   incoming-register clobber hazards in opt allocation constraints;
-- still open: remove the legacy tail-call fallback ABI-reg suppression after
-  tail-call setup is opt-visible;
+- done: remove the legacy tail-call fallback ABI-reg suppression by replaying
+  tail-call setup through call plans;
 - Add code-shape tests for direct-call tiny functions and unused-param functions
   across x64, AArch64, and RV64.
 
@@ -573,7 +569,7 @@ Completed:
 2. Teach `opt_machinize` to consume the new metadata.
 3. Add `CGCallPlan` and plan calls without using it for emission.
 4. Use call-plan clobber masks for rewrite and post-RA hard-register liveness.
-5. Replay non-tail call plans in opt, including ABI register setup, outgoing
+5. Replay call plans in opt, including ABI register setup, outgoing
    stack arguments, address-valued byval/indirect/sret moves, and return
    extraction.
 6. Remove call-driven ABI-reg suppression for stack-argument and sret-shaped
@@ -584,12 +580,12 @@ Completed:
    sources.
 9. Remove incoming-parameter ABI-reg suppression with opt-side constraints for
    incoming parameter copy hazards.
+10. Replay tail-call plans in opt and remove O1's legacy backend `call`
+    fallback.
 
 Next patch stack:
 
-1. Continue broadening register exposure by removing the remaining tail-call
-   ABI-reg guard when tail-call setup becomes opt-visible.
-2. Migrate direct CG or wrap it with internal call planning, then remove legacy
+1. Migrate direct CG or wrap it with internal call planning, then remove legacy
    pool semantics.
 
 This order keeps each step testable and avoids mixing API migration, allocation
diff --git a/src/arch/aa64/opt_coord.c b/src/arch/aa64/opt_coord.c
@@ -205,6 +205,7 @@ static u32 aa_return_reg_mask(CGTarget* t, const ABIFuncInfo* abi,
 static void aa_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
   memset(out, 0, sizeof *out);
   out->callee = d->callee;
+  out->flags = d->flags;
   out->stack_arg_size = t->call_stack_size ? t->call_stack_size(t, d) : 0;
   out->has_sret = d->abi && d->abi->has_sret;
   out->is_variadic = d->abi && d->abi->variadic;
@@ -293,7 +294,8 @@ static void aa_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
       }
     }
   }
-  if (d->abi && d->abi->ret.kind != ABI_ARG_IGNORE &&
+  if ((d->flags & CG_CALL_TAIL) == 0 &&
+      d->abi && d->abi->ret.kind != ABI_ARG_IGNORE &&
       d->abi->ret.kind != ABI_ARG_INDIRECT) {
     u32 ni = 0, nf = 0;
     for (u16 i = 0; i < d->abi->ret.nparts; ++i) {
diff --git a/src/arch/arch.h b/src/arch/arch.h
@@ -463,6 +463,8 @@ typedef struct CGCallPlan {
   u8 is_variadic;
   u8 has_sret;
   u8 pad;
+  u16 flags; /* CGCallFlag */
+  u16 pad2;
 } CGCallPlan;
 
 typedef u32 Label;
diff --git a/src/arch/rv64/opt_coord.c b/src/arch/rv64/opt_coord.c
@@ -189,6 +189,7 @@ static u32 rv_return_reg_mask(CGTarget* t, const ABIFuncInfo* abi,
 static void rv_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
   memset(out, 0, sizeof *out);
   out->callee = d->callee;
+  out->flags = d->flags;
   out->stack_arg_size = t->call_stack_size ? t->call_stack_size(t, d) : 0;
   out->has_sret = d->abi && d->abi->has_sret;
   out->is_variadic = d->abi && d->abi->variadic;
@@ -276,7 +277,8 @@ static void rv_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
       }
     }
   }
-  if (d->abi && d->abi->ret.kind != ABI_ARG_IGNORE &&
+  if ((d->flags & CG_CALL_TAIL) == 0 &&
+      d->abi && d->abi->ret.kind != ABI_ARG_IGNORE &&
       d->abi->ret.kind != ABI_ARG_INDIRECT) {
     u32 ni = 0, nf = 0;
     for (u16 i = 0; i < d->abi->ret.nparts; ++i) {
diff --git a/src/arch/x64/opt_coord.c b/src/arch/x64/opt_coord.c
@@ -177,6 +177,7 @@ static u32 x_return_reg_mask(CGTarget* t, const ABIFuncInfo* abi,
 static void x_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
   memset(out, 0, sizeof *out);
   out->callee = d->callee;
+  out->flags = d->flags;
   out->stack_arg_size = t->call_stack_size ? t->call_stack_size(t, d) : 0;
   out->has_sret = d->abi && d->abi->has_sret;
   out->is_variadic = d->abi && d->abi->variadic;
@@ -266,7 +267,8 @@ static void x_plan_call(CGTarget* t, const CGCallDesc* d, CGCallPlan* out) {
     }
   }
   out->variadic_fp_count = (u8)next_fp;
-  if (d->abi && d->abi->ret.kind != ABI_ARG_IGNORE &&
+  if ((d->flags & CG_CALL_TAIL) == 0 &&
+      d->abi && d->abi->ret.kind != ABI_ARG_IGNORE &&
       d->abi->ret.kind != ABI_ARG_INDIRECT) {
     u32 ni = 0, nf = 0;
     static const u32 rregs[2] = {X64_RAX, X64_RDX};
diff --git a/src/opt/opt.c b/src/opt/opt.c
@@ -1555,6 +1555,8 @@ static void replay_planned_call(ReplayCtx* r, const IRCallAux* aux) {
   replay_parallel_moves(r, arg_moves, nargs);
   r->tgt->emit_call_plan(r->tgt, &plan);
 
+  if (plan.flags & CG_CALL_TAIL) return;
+
   ReplayParallelMove* ret_moves =
       src_plan->nrets ? arena_zarray(r->f->arena, ReplayParallelMove,
                                      src_plan->nrets)
@@ -1723,33 +1725,13 @@ static void replay_inst(ReplayCtx* r, u32 b, Inst* in) {
     }
     case IR_CALL: {
       IRCallAux* aux = (IRCallAux*)in->extra.aux;
-      if (aux && aux->use_plan_replay && (aux->desc.flags & CG_CALL_TAIL) == 0 &&
-          w->emit_call_plan &&
+      if (aux && aux->use_plan_replay && w->emit_call_plan &&
           replay_plan_supported(w, &aux->plan)) {
         replay_planned_call(r, aux);
         break;
       }
-      CGCallDesc cd = aux->desc;
-      cd.callee = xlat_op(r, cd.callee);
-      CGABIValue* args = NULL;
-      if (cd.nargs) {
-        args = arena_array(r->f->arena, CGABIValue, cd.nargs);
-        for (u32 k = 0; k < cd.nargs; ++k) {
-          CGABIPart* parts = aux->desc.args[k].nparts
-                                 ? arena_array(r->f->arena, CGABIPart,
-                                               aux->desc.args[k].nparts)
-                                 : NULL;
-          args[k] = xlat_abivalue(r, &aux->desc.args[k], parts);
-        }
-        cd.args = args;
-      } else {
-        cd.args = NULL;
-      }
-      CGABIPart* ret_parts =
-          cd.ret.nparts ? arena_array(r->f->arena, CGABIPart, cd.ret.nparts)
-                        : NULL;
-      cd.ret = xlat_abivalue(r, &aux->desc.ret, ret_parts);
-      w->call(w, &cd);
+      compiler_panic(r->c, in->loc,
+                     "opt replay: call has no supported call plan");
       break;
     }
     case IR_BR: {
diff --git a/src/opt/pass_lower.c b/src/opt/pass_lower.c
@@ -439,19 +439,6 @@ static void asm_prepare_constraints(Func* f, CGTarget* target, IRAsmAux* aux) {
 static int call_plan_replay_supported(const IRCallAux* aux,
                                       const CGTarget* target);
 
-static int func_has_legacy_call_fallback(Func* f) {
-  for (u32 b = 0; b < f->nblocks; ++b) {
-    Block* bl = &f->blocks[b];
-    for (u32 i = 0; i < bl->ninsts; ++i) {
-      Inst* in = &bl->insts[i];
-      if ((IROp)in->op != IR_CALL) continue;
-      IRCallAux* aux = (IRCallAux*)in->extra.aux;
-      if (!aux || !aux->use_plan_replay) return 1;
-    }
-  }
-  return 0;
-}
-
 void opt_machinize(Func* f, CGTarget* target) {
   f->opt_target = target->c->target;
   f->opt_has_target = 1;
@@ -483,8 +470,6 @@ void opt_machinize(Func* f, CGTarget* target) {
     }
   }
 
-  int suppress_abi_regs = func_has_legacy_call_fallback(f);
-
   for (u32 c = 0; c < OPT_REG_CLASSES; ++c) {
     const CGPhysRegInfo* phys = NULL;
     u32 nphys = 0;
@@ -503,8 +488,6 @@ void opt_machinize(Func* f, CGTarget* target) {
         }
         f->opt_phys_regs[c][f->opt_phys_reg_count[c]++] = pi;
         if ((pi.flags & CG_REG_ALLOCABLE) &&
-            (!suppress_abi_regs ||
-             (pi.flags & (CG_REG_ARG | CG_REG_RET)) == 0) &&
             !(pi.flags & CG_REG_RESERVED)) {
           f->opt_hard_regs[c][f->opt_hard_reg_count[c]++] = hr;
         }
@@ -572,7 +555,6 @@ static u32 call_clobber_mask_for(Func* f, const Inst* in, u8 cls) {
 static int call_plan_replay_supported(const IRCallAux* aux,
                                       const CGTarget* target) {
   if (!aux || !aux->plan_valid || !target || !target->emit_call_plan) return 0;
-  if (aux->desc.flags & CG_CALL_TAIL) return 0;
   for (u32 i = 0; i < aux->plan.nargs; ++i) {
     if (aux->plan.args[i].dst_kind == CG_CALL_PLAN_STACK &&
         !target->store_call_arg)
diff --git a/test/opt/opt_test.c b/test/opt/opt_test.c
@@ -517,6 +517,7 @@ static void mock_plan_call(CGTarget* t, const CGCallDesc* d,
   MockCGTarget* m = (MockCGTarget*)t;
   memset(out, 0, sizeof *out);
   out->callee = d->callee;
+  out->flags = d->flags;
   for (u32 c = 0; c < OPT_REG_CLASSES; ++c)
     out->clobber_mask[c] = m->call_clobber_mask[c];
   u32 nargs = m->planned_nargs ? m->planned_nargs : d->nargs;
@@ -948,7 +949,7 @@ static void opt_machinize_uses_phys_reg_metadata(void) {
   tc_fini(&tc);
 }
 
-static void opt_machinize_filters_abi_regs_for_legacy_call_fallback(void) {
+static void opt_machinize_keeps_abi_regs_without_legacy_call_fallback(void) {
   TestCtx tc;
   tc_init(&tc);
   MockCGTarget mock;
@@ -3249,6 +3250,45 @@ static void opt_planned_call_replay_stores_stack_sources_before_clobber(void) {
   tc_fini(&tc);
 }
 
+static void opt_planned_tail_call_uses_replay_without_return_moves(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  MockCGTarget mock;
+  mock_init(&mock, tc.c);
+
+  Func* f = new_func(&tc);
+  Inst* in = ir_emit(f, f->entry, IR_CALL);
+  IRCallAux* aux = arena_znew(f->arena, IRCallAux);
+  in->extra.aux = aux;
+  aux->desc.flags = CG_CALL_TAIL;
+  aux->plan_valid = 1;
+  aux->use_plan_replay = 1;
+  aux->plan.flags = CG_CALL_TAIL;
+  aux->plan.callee = op_reg_(8, tc.i64);
+  aux->plan.args = arena_zarray(f->arena, CGCallPlanMove, 1);
+  aux->plan.nargs = 1;
+  aux->plan.args[0].src = op_reg_(2, tc.i64);
+  aux->plan.args[0].dst_kind = CG_CALL_PLAN_REG;
+  aux->plan.args[0].cls = RC_INT;
+  aux->plan.args[0].dst_reg = 1;
+  aux->plan.args[0].mem = mem_unknown_(tc.i64, 8);
+  aux->plan.rets = arena_zarray(f->arena, CGCallPlanRet, 1);
+  aux->plan.nrets = 1;
+  aux->plan.rets[0].dst = op_reg_(3, tc.i64);
+  aux->plan.rets[0].cls = RC_INT;
+  aux->plan.rets[0].src_reg = 1;
+  aux->plan.rets[0].mem = mem_unknown_(tc.i64, 8);
+
+  opt_emit(tc.c, f, &mock.base);
+
+  EXPECT(mock.emit_call_plan_calls == 1,
+         "tail call should use planned emit path");
+  EXPECT(mock.call_calls == 0, "tail call should not use legacy call fallback");
+  EXPECT(mock.copy_calls == 1,
+         "tail call should materialize args but skip return extraction");
+  tc_fini(&tc);
+}
+
 static void opt_emit_preserves_physical_reg_zero(void) {
   TestCtx tc;
   tc_init(&tc);
@@ -3618,7 +3658,7 @@ static void simple_regalloc_reports_exact_used_regs(void) {
 
 int main(void) {
   opt_machinize_uses_phys_reg_metadata();
-  opt_machinize_filters_abi_regs_for_legacy_call_fallback();
+  opt_machinize_keeps_abi_regs_without_legacy_call_fallback();
   opt_machinize_keeps_abi_regs_for_incoming_params();
   real_arch_call_plan_layouts();
   opt_regalloc_prefers_caller_saved_for_non_call_value();
@@ -3666,6 +3706,7 @@ int main(void) {
   opt_planned_call_replay_materializes_address_args();
   opt_planned_call_replay_resolves_return_reg_collision();
   opt_planned_call_replay_stores_stack_sources_before_clobber();
+  opt_planned_tail_call_uses_replay_without_return_moves();
   opt_emit_preserves_physical_reg_zero();
   opt_emit_no_virtual_alloc();
   opt_records_const_bytes_by_value();
diff --git a/test/toy/cases/24_tail_arg_permute.expected b/test/toy/cases/24_tail_arg_permute.expected
@@ -0,0 +1 @@
+33
diff --git a/test/toy/cases/24_tail_arg_permute.toy b/test/toy/cases/24_tail_arg_permute.toy
@@ -0,0 +1,11 @@
+fn target(a: int, b: int, c: int): int {
+  return a + b * 2 + c * 4;
+}
+
+fn caller(x: int, y: int, z: int): int {
+  return tail target(z, x, y);
+}
+
+fn main(): int {
+  return caller(3, 5, 7);
+}
diff --git a/test/toy/cases/25_tail_many_stack_args.expected b/test/toy/cases/25_tail_many_stack_args.expected
@@ -0,0 +1 @@
+220
diff --git a/test/toy/cases/25_tail_many_stack_args.toy b/test/toy/cases/25_tail_many_stack_args.toy
@@ -0,0 +1,14 @@
+fn target(a: int, b: int, c: int, d: int, e: int,
+          f: int, g: int, h: int, i: int, j: int): int {
+  return a + b * 2 + c * 3 + d * 4 + e * 5 +
+         f * 6 + g * 7 + h * 8 + i * 9 + j * 10;
+}
+
+fn caller(a: int, b: int, c: int, d: int, e: int,
+          f: int, g: int, h: int, i: int, j: int): int {
+  return tail target(j, i, h, g, f, e, d, c, b, a);
+}
+
+fn main(): int {
+  return caller(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
+}
diff --git a/test/toy/cases/26_tail_live_pressure.expected b/test/toy/cases/26_tail_live_pressure.expected
@@ -0,0 +1 @@
+170
diff --git a/test/toy/cases/26_tail_live_pressure.toy b/test/toy/cases/26_tail_live_pressure.toy
@@ -0,0 +1,21 @@
+fn sink(a: int, b: int, c: int, d: int,
+        e: int, f: int, g: int, h: int): int {
+  return a + b * 2 + c * 3 + d * 4 +
+         e * 5 + f * 6 + g * 7 + h * 8;
+}
+
+fn pressure(x: int): int {
+  let a: int = x + 1;
+  let b: int = x + 2;
+  let c: int = x + 3;
+  let d: int = x + 4;
+  let e: int = x + 5;
+  let f: int = x + 6;
+  let g: int = x + 7;
+  let h: int = x + 8;
+  return tail sink(h, f, d, b, g, e, c, a);
+}
+
+fn main(): int {
+  return pressure(1);
+}
diff --git a/test/toy/cases/27_tail_mixed_int_fp.expected b/test/toy/cases/27_tail_mixed_int_fp.expected
@@ -0,0 +1 @@
+47
diff --git a/test/toy/cases/27_tail_mixed_int_fp.toy b/test/toy/cases/27_tail_mixed_int_fp.toy
@@ -0,0 +1,11 @@
+fn target(a: int, b: f64, c: int, d: f64): int {
+  return a + c + (b as int) * 2 + (d as int) * 3;
+}
+
+fn caller(x: int, y: f64, z: int, w: f64): int {
+  return tail target(z, w, x, y);
+}
+
+fn main(): int {
+  return caller(3, 5.0, 7, 11.0);
+}
diff --git a/test/toy/cases/28_tail_chain.expected b/test/toy/cases/28_tail_chain.expected
@@ -0,0 +1 @@
+27
diff --git a/test/toy/cases/28_tail_chain.toy b/test/toy/cases/28_tail_chain.toy
@@ -0,0 +1,15 @@
+fn h(a: int, b: int, c: int): int {
+  return a * 4 + b * 2 + c;
+}
+
+fn g(a: int, b: int, c: int): int {
+  return tail h(c, a, b);
+}
+
+fn f(a: int, b: int, c: int): int {
+  return tail g(b, c, a);
+}
+
+fn main(): int {
+  return f(2, 5, 9);
+}
diff --git a/test/toy/cases/29_tail_cross_arch_stack.expected b/test/toy/cases/29_tail_cross_arch_stack.expected
@@ -0,0 +1 @@
+108
diff --git a/test/toy/cases/29_tail_cross_arch_stack.toy b/test/toy/cases/29_tail_cross_arch_stack.toy
@@ -0,0 +1,14 @@
+fn target(a: int, b: int, c: int, d: int, e: int, f: int,
+          g: int, h: int, i: int, j: int, k: int, l: int): int {
+  return a + b * 2 + c * 3 + d * 4 + e * 5 + f * 6 +
+         g * 7 + h * 8 + i * 9 + j * 10 + k * 11 + l * 12 - 256;
+}
+
+fn caller(a: int, b: int, c: int, d: int, e: int, f: int,
+          g: int, h: int, i: int, j: int, k: int, l: int): int {
+  return tail target(l, k, j, i, h, g, f, e, d, c, b, a);
+}
+
+fn main(): int {
+  return caller(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12);
+}
diff --git a/test/toy/cases/30_tail_indirect_wanted.expected b/test/toy/cases/30_tail_indirect_wanted.expected
@@ -0,0 +1 @@
+42
diff --git a/test/toy/cases/30_tail_indirect_wanted.toy b/test/toy/cases/30_tail_indirect_wanted.toy
@@ -0,0 +1,11 @@
+fn add1(x: int): int {
+  return x + 1;
+}
+
+fn apply(fp: *fn(int): int, x: int): int {
+  return tail fp(x);
+}
+
+fn main(): int {
+  return apply(add1, 41);
+}

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/OPT_REGS_CALL_PLAN.md	\|	62	+++++++++++++++++++++++++++++---------------------------------
M	src/arch/aa64/opt_coord.c	\|	4	+++-
M	src/arch/arch.h	\|	2	++
M	src/arch/rv64/opt_coord.c	\|	4	+++-
M	src/arch/x64/opt_coord.c	\|	4	+++-
M	src/opt/opt.c	\|	28	+++++-----------------------
M	src/opt/pass_lower.c	\|	18	------------------
M	test/opt/opt_test.c	\|	45	+++++++++++++++++++++++++++++++++++++++++++--
A	test/toy/cases/24_tail_arg_permute.expected	\|	1	+
A	test/toy/cases/24_tail_arg_permute.toy	\|	11	+++++++++++
A	test/toy/cases/25_tail_many_stack_args.expected	\|	1	+
A	test/toy/cases/25_tail_many_stack_args.toy	\|	14	++++++++++++++
A	test/toy/cases/26_tail_live_pressure.expected	\|	1	+
A	test/toy/cases/26_tail_live_pressure.toy	\|	21	+++++++++++++++++++++
A	test/toy/cases/27_tail_mixed_int_fp.expected	\|	1	+
A	test/toy/cases/27_tail_mixed_int_fp.toy	\|	11	+++++++++++
A	test/toy/cases/28_tail_chain.expected	\|	1	+
A	test/toy/cases/28_tail_chain.toy	\|	15	+++++++++++++++
A	test/toy/cases/29_tail_cross_arch_stack.expected	\|	1	+
A	test/toy/cases/29_tail_cross_arch_stack.toy	\|	14	++++++++++++++
A	test/toy/cases/30_tail_indirect_wanted.expected	\|	1	+
A	test/toy/cases/30_tail_indirect_wanted.toy	\|	11	+++++++++++