Implement O2 allocation coalescing and splitting - kit

commit 3a249786f7441806f9ba1eb92ce1dd40ba07da04
parent b998131d3551a570801f5028d6f55683a4cdac69
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 21 May 2026 14:12:41 -0700

Implement O2 allocation coalescing and splitting

Diffstat:
M doc/OPT.md  | 63 +++++++++++++++++++++++++++++++++++++++++++++------------------
M src/opt/ir.h  | 27 +++++++++++++++++++++++++++
M src/opt/opt.h  | 2 ++
M src/opt/opt_internal.h  | 1 +
A src/opt/pass_coalesce.c  | 292 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M src/opt/pass_live.c  | 2 ++
M src/opt/pass_lower.c  | 426 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
M test/opt/opt_test.c  | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A test/toy/cases/129_o2_pressure_coalesce.expected  | 1 +
A test/toy/cases/129_o2_pressure_coalesce.toy  | 32 ++++++++++++++++++++++++++++++++

10 files changed, 941 insertions(+), 85 deletions(-)
diff --git a/doc/OPT.md b/doc/OPT.md
@@ -471,6 +471,8 @@ The optimizer is no longer just a stub:
 - `src/opt/pass_ssa.c` implements pseudo-register SSA construction, mem2reg
   SSA construction, conventional SSA edge-copy lowering, SSA undo, and stable
   SSA dumps for pass tests.
+- `src/opt/pass_coalesce.c` implements O2 move collection, move-related
+  conflict analysis, and conservative coalescing.
 - `src/opt/pass_analysis.c` owns shared CFG order, dominators, dominance
   frontiers, dominator children, def-use rebuilding, and verifier guardrails.
 - `cfree cc` forwards `-O0`/`-O1`/`-O2` into `CfreeCodeOptions`, so native
@@ -483,14 +485,26 @@ Current behavior:
 - Level `1` records IR, builds CFG, machinizes, builds loop frequencies,
   computes PReg liveness and live ranges, runs the current priority
   allocator/rewrite path over mutable PRegs, combines, runs post-RA DCE,
-  cleans jumps/layout, and emits through the wrapped target.
+  cleans jumps/layout, and emits through the wrapped target. This path keeps
+  whole-PReg allocation and does not run coalescing or splitting.
 - Level `2` has its own scheduler entry point. It now converts mutable
   pseudo-registers to SSA values, builds real mem2reg SSA, verifies phi/use-def
   invariants, runs the first small O2 transforms, lowers phis through
-  conventional SSA, undoes SSA, and then routes through the same proven
-  lowering path as `-O1`.
-- `opt_regalloc(..., allow_live_range_split)` accepts the O2 policy bit, but
-  live-range splitting is not implemented yet.
+  conventional SSA, undoes SSA, and then routes through the shared lowering
+  path with O2 allocation policy enabled.
+- `opt_regalloc(..., allow_live_range_split)` now uses the O2 policy bit to
+  enable move coalescing and singleton live-range splitting. Coalescing only
+  considers valid same-class, same-type `IR_COPY` pairs, weights moves by block
+  frequency, builds a bounded conflict matrix for move-related PRegs, and
+  rejects incompatible tied-register or forbidden-register constraints.
+- O2 allocation assigns coalesced groups as a unit. Every member receives the
+  same hard register or spill slot, so coalesced copies rewrite to hard-reg
+  self copies and are removed by post-RA combine/DCE.
+- O2 live-range splitting is intentionally conservative in this slice: only
+  singleton allocation groups with multiple live ranges can split. Split PRegs
+  get a canonical spill home plus per-range segment records. Hard segments
+  insert boundary reload/store traffic through the existing rewrite buffers;
+  coalesced groups, call-crossing values, and O1 allocations remain whole.
 - `opt_build_ssa` is no longer just a shape check: promoted loads/stores are
   rewritten to SSA values, phis have real `Val` defs and predecessor inputs,
   and def-use chains are rebuilt for downstream SSA passes.
@@ -627,28 +641,41 @@ Exit criteria:
 
 ### Phase C - Full Allocation Infrastructure
 
-Status: pending quality work. The current priority allocator/rewrite path is
-shared by O1 and O2 lowering and accepts the O2 live-range-splitting policy
-bit, but coalescing and splitting are not implemented yet. Several small O2
-SSA cleanup passes landed first; this phase remains the allocation-quality
-milestone before heavier value/memory optimization.
+Status: implemented as the first O2 allocation-quality slice. O1 keeps the
+existing whole-PReg priority allocator. O2 enables conservative move
+coalescing, coalesced-group allocation, and singleton live-range splitting
+without changing public `libcfree` APIs or adding CFG edge splitting.
 
 Goal: bring `-O2` allocation quality up to MIR's model before value passes
 start changing code aggressively.
 
 Implement:
 
-- Move collection.
-- Move-only liveness.
-- Conflict matrix for move-related regs.
-- Aggressive coalescing.
-- Live-range splitting and edge/block spill placement.
+- [x] Move collection.
+- [x] Bounded conflict matrix for move-related regs using existing live-range
+  overlap data.
+- [x] Conservative coalescing for same-class, same-type copies with compatible
+  fixed-register and forbidden-register constraints.
+- [x] Coalesced-group allocation before singleton allocation.
+- [x] O2-only split allocation representation with per-PReg segment lists and
+  canonical spill homes.
+- [x] Conservative singleton live-range splitting when whole hard allocation
+  fails.
+- [x] Boundary reload/store insertion for hard split segments through rewrite.
+- [ ] Critical-edge CFG mutation for spill placement. Current placement stays
+  at unambiguous block/range boundaries.
 
 Exit criteria:
 
-- `-O1` remains green.
-- `-O2` can use the same lowering path with coalescing/splitting enabled, even
+- [x] `-O1` remains green.
+- [x] `-O2` can use the same lowering path with coalescing/splitting enabled,
   while preserving the current small SSA cleanup passes.
+- [x] Focused `make test-opt` coverage for coalescing, conflict refusal,
+  incompatible constraints, O1 preservation, and singleton splitting.
+- [x] Toy pressure fixture:
+  `CFREE_TEST_FILTER=129_o2_pressure_coalesce CFREE_TOY_OPT_LEVELS="0 1 2"
+  make test-toy`
+- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
 
 ### Phase D - SSA-Backed Value and Memory Passes
 
@@ -786,7 +813,7 @@ src/opt/
   pass_machinize.c  target ABI and machine-form lowering
   pass_live.c       liveness, pressure, live ranges
   pass_lower.c      current assignment/rewrite implementation
-  pass_coalesce.c   move collection and coalescing (split out when implemented)
+  pass_coalesce.c   move collection and coalescing
   pass_regalloc.c   assignment, rewrite, splitting (split out when warranted)
   pass_combine.c    target-aware code selection
   pass_emit.c       drive wrapped CGTarget
diff --git a/src/opt/ir.h b/src/opt/ir.h
@@ -325,8 +325,24 @@ typedef enum OptAllocKind {
   OPT_ALLOC_NONE,
   OPT_ALLOC_HARD,
   OPT_ALLOC_SPILL,
+  OPT_ALLOC_SPLIT,
 } OptAllocKind;
 
+typedef struct OptAllocSegment {
+  u32 start;
+  u32 end;
+  u32 block;
+  u8 loc_kind;
+  u8 cls;
+  Reg hard_reg;
+  FrameSlot spill_slot;
+  FrameSlot spill_home;
+  u8 reload_at_start;
+  u8 store_at_end;
+  u8 pad[2];
+  u32 next;
+} OptAllocSegment;
+
 typedef struct OptPRegInfo {
   u32 first_pos;
   u32 last_pos;
@@ -435,8 +451,19 @@ typedef struct Func {
   u64 opt_rewrite_inserted_insts;
   u64 opt_rewrite_live_words_touched;
   u64 opt_dde_live_words_touched;
+  u64 opt_coalesce_moves_seen;
+  u64 opt_coalesce_candidates;
+  u64 opt_coalesce_conflicts;
+  u64 opt_coalesce_merge_attempts;
+  u64 opt_coalesce_merges;
   InstId next_inst_id;
   OptPRegInfo* preg_info; /* indexed by the current allocation reg namespace */
+  u32* opt_coalesce_parent;
+  u32* opt_coalesce_size;
+  OptAllocSegment* opt_alloc_segments;
+  u32 opt_nalloc_segments;
+  u32 opt_alloc_segments_cap;
+  u32* opt_first_segment_by_preg;
 
   OptUse* opt_uses;
   u32 opt_nuses, opt_uses_cap;
diff --git a/src/opt/opt.h b/src/opt/opt.h
@@ -107,6 +107,8 @@ typedef struct OptLiveRange {
   PReg preg;
   u32 start;
   u32 end;
+  u32 raw_start;
+  u32 raw_end;
   u32 next;
   u32 block;
   u8 whole_block;
diff --git a/src/opt/opt_internal.h b/src/opt/opt_internal.h
@@ -111,6 +111,7 @@ OptHardBlockLive* opt_maybe_build_hard_live(Func*);
 OptHardRegSet opt_hard_live_out_for_block(const OptHardBlockLive*);
 int opt_block_live_out_has_phys_reg(Func*, const OptHardBlockLive*, u32 block,
                                     const Operand*);
+void opt_coalesce_ranges(Func*, const OptLiveRangeSet*);
 
 void opt_replay(Compiler*, Func*, CGTarget* target);
 
diff --git a/src/opt/pass_coalesce.c b/src/opt/pass_coalesce.c
@@ -0,0 +1,292 @@
+#include <stdlib.h>
+#include <string.h>
+
+#include "core/arena.h"
+#include "core/metrics.h"
+#include "opt/opt_internal.h"
+
+typedef struct CoalesceMove {
+  PReg dst;
+  PReg src;
+  u32 weight;
+} CoalesceMove;
+
+typedef struct CoalesceCtx {
+  Func* f;
+  const OptLiveRangeSet* ranges;
+  PReg* related;
+  u32 nrelated;
+  u32 related_cap;
+  u32* related_index;
+  u64* conflicts;
+  u64* unit_conflicts;
+  u32 conflict_words;
+} CoalesceCtx;
+
+static PReg coalesce_find(Func* f, PReg v) {
+  if (!f->opt_coalesce_parent || v == PREG_NONE || v >= opt_reg_count(f))
+    return v;
+  PReg p = (PReg)f->opt_coalesce_parent[v];
+  while (p != f->opt_coalesce_parent[p]) p = (PReg)f->opt_coalesce_parent[p];
+  while (v != p) {
+    PReg n = (PReg)f->opt_coalesce_parent[v];
+    f->opt_coalesce_parent[v] = p;
+    v = n;
+  }
+  return p;
+}
+
+static void coalesce_add_related(CoalesceCtx* c, PReg v) {
+  Func* f = c->f;
+  if (!opt_reg_valid(f, v)) return;
+  if (c->related_index[v] != OPT_RANGE_NONE) return;
+  if (c->nrelated == c->related_cap) {
+    u32 ncap = c->related_cap ? c->related_cap * 2u : 32u;
+    PReg* nr = arena_array(f->arena, PReg, ncap);
+    if (c->related) memcpy(nr, c->related, sizeof(c->related[0]) * c->nrelated);
+    c->related = nr;
+    c->related_cap = ncap;
+  }
+  c->related_index[v] = c->nrelated;
+  c->related[c->nrelated++] = v;
+}
+
+static int ranges_overlap_kind(const OptLiveRangeSet* ranges, PReg a, PReg b) {
+  int unit_only = 0;
+  for (u32 ar = ranges->first_range_by_preg[a]; ar != OPT_RANGE_NONE;
+       ar = ranges->ranges[ar].next) {
+    const OptLiveRange* ra = &ranges->ranges[ar];
+    for (u32 br = ranges->first_range_by_preg[b]; br != OPT_RANGE_NONE;
+         br = ranges->ranges[br].next) {
+      const OptLiveRange* rb = &ranges->ranges[br];
+      if (ra->start < rb->end && rb->start < ra->end) {
+        u32 start = ra->start > rb->start ? ra->start : rb->start;
+        u32 end = ra->end < rb->end ? ra->end : rb->end;
+        if (end > start + 1u) return 2;
+        unit_only = 1;
+      }
+    }
+  }
+  return unit_only ? 1 : 0;
+}
+
+static void coalesce_set_conflict_bit(u64* bits, u32 nrelated, u32 a, u32 b) {
+  if (!bits) return;
+  if (a == b) return;
+  u32 x = a < b ? a : b;
+  u32 y = a < b ? b : a;
+  u32 bit = x * nrelated + y;
+  bits[bit / 64u] |= 1ull << (bit % 64u);
+}
+
+static void coalesce_set_conflict(CoalesceCtx* c, u32 a, u32 b, int unit) {
+  coalesce_set_conflict_bit(c->conflicts, c->nrelated, a, b);
+  if (unit) coalesce_set_conflict_bit(c->unit_conflicts, c->nrelated, a, b);
+}
+
+static int coalesce_has_bit(const CoalesceCtx* c, const u64* bits, PReg a,
+                            PReg b) {
+  if (!bits || a >= opt_reg_count(c->f) || b >= opt_reg_count(c->f)) return 1;
+  u32 ai = c->related_index[a];
+  u32 bi = c->related_index[b];
+  if (ai == OPT_RANGE_NONE || bi == OPT_RANGE_NONE) return 0;
+  if (ai == bi) return 0;
+  u32 x = ai < bi ? ai : bi;
+  u32 y = ai < bi ? bi : ai;
+  u32 bit = x * c->nrelated + y;
+  return (bits[bit / 64u] & (1ull << (bit % 64u))) != 0;
+}
+
+static int coalesce_has_conflict(const CoalesceCtx* c, PReg a, PReg b) {
+  return coalesce_has_bit(c, c->conflicts, a, b);
+}
+
+static int coalesce_has_unit_conflict(const CoalesceCtx* c, PReg a, PReg b) {
+  return coalesce_has_bit(c, c->unit_conflicts, a, b);
+}
+
+static int group_conflicts(const CoalesceCtx* c, PReg ra, PReg rb, PReg allow_a,
+                           PReg allow_b) {
+  for (u32 i = 0; i < c->nrelated; ++i) {
+    PReg a = c->related[i];
+    if (coalesce_find(c->f, a) != ra) continue;
+    for (u32 j = 0; j < c->nrelated; ++j) {
+      PReg b = c->related[j];
+      if (coalesce_find(c->f, b) != rb) continue;
+      if (((a == allow_a && b == allow_b) || (a == allow_b && b == allow_a)) &&
+          coalesce_has_unit_conflict(c, a, b))
+        continue;
+      if (coalesce_has_conflict(c, a, b)) return 1;
+    }
+  }
+  return 0;
+}
+
+static int hard_reg_possible(Func* f, u8 cls, u32 forbidden) {
+  for (u32 i = 0; i < f->opt_hard_reg_count[cls]; ++i) {
+    Reg r = f->opt_hard_regs[cls][i];
+    if (r < 32 && (forbidden & (1u << r)) == 0) return 1;
+  }
+  return f->opt_hard_reg_count[cls] == 0;
+}
+
+static int group_constraints_compatible(const CoalesceCtx* c, PReg ra,
+                                        PReg rb) {
+  Func* f = c->f;
+  u8 cls = opt_reg_cls(f, ra);
+  CfreeCgTypeId type = opt_reg_type(f, ra);
+  i32 tied = -1;
+  u32 forbidden = 0;
+  for (PReg v = 1; v < opt_reg_count(f); ++v) {
+    PReg r = coalesce_find(f, v);
+    if (r != ra && r != rb) continue;
+    if (opt_reg_cls(f, v) != cls || opt_reg_type(f, v) != type) return 0;
+    const OptPRegInfo* vi = &f->preg_info[v];
+    forbidden |= vi->forbidden_hard_regs;
+    if (vi->tied_hard_reg >= 0) {
+      if (tied >= 0 && tied != vi->tied_hard_reg) return 0;
+      tied = vi->tied_hard_reg;
+    }
+  }
+  if (tied >= 0 && tied < 32 && (forbidden & (1u << (Reg)tied))) return 0;
+  return hard_reg_possible(f, cls, forbidden);
+}
+
+static void coalesce_union(Func* f, PReg a, PReg b) {
+  PReg ra = coalesce_find(f, a);
+  PReg rb = coalesce_find(f, b);
+  if (ra == rb) return;
+  u32 as = f->opt_coalesce_size[ra];
+  u32 bs = f->opt_coalesce_size[rb];
+  if (as < bs || (as == bs && rb < ra)) {
+    PReg t = ra;
+    ra = rb;
+    rb = t;
+  }
+  f->opt_coalesce_parent[rb] = ra;
+  f->opt_coalesce_size[ra] += f->opt_coalesce_size[rb];
+}
+
+static int move_higher(const CoalesceMove* a, const CoalesceMove* b) {
+  if (a->weight != b->weight) return a->weight > b->weight;
+  if (a->dst != b->dst) return a->dst < b->dst;
+  return a->src < b->src;
+}
+
+static int move_cmp(const void* va, const void* vb) {
+  const CoalesceMove* a = (const CoalesceMove*)va;
+  const CoalesceMove* b = (const CoalesceMove*)vb;
+  if (move_higher(a, b)) return -1;
+  if (move_higher(b, a)) return 1;
+  return 0;
+}
+
+static int collect_move(Func* f, const OptLiveRangeSet* ranges, Inst* in,
+                        u32 block, CoalesceMove* out) {
+  if ((IROp)in->op != IR_COPY || in->nopnds < 2) return 0;
+  if (in->opnds[0].kind != OPK_REG || in->opnds[1].kind != OPK_REG) return 0;
+  PReg dst = (PReg)in->opnds[0].v.reg;
+  PReg src = (PReg)in->opnds[1].v.reg;
+  if (!opt_reg_valid(f, dst) || !opt_reg_valid(f, src) || dst == src) return 0;
+  if (ranges->first_range_by_preg[dst] == OPT_RANGE_NONE ||
+      ranges->first_range_by_preg[src] == OPT_RANGE_NONE)
+    return 0;
+  if (opt_reg_cls(f, dst) != opt_reg_cls(f, src)) return 0;
+  if (opt_reg_type(f, dst) != opt_reg_type(f, src)) return 0;
+  out->dst = dst;
+  out->src = src;
+  out->weight = f->blocks[block].frequency ? f->blocks[block].frequency : 1u;
+  return 1;
+}
+
+void opt_coalesce_ranges(Func* f, const OptLiveRangeSet* ranges) {
+  if (!f || !ranges || !f->preg_info) return;
+  u32 nregs = opt_reg_count(f);
+  f->opt_coalesce_moves_seen = 0;
+  f->opt_coalesce_candidates = 0;
+  f->opt_coalesce_conflicts = 0;
+  f->opt_coalesce_merge_attempts = 0;
+  f->opt_coalesce_merges = 0;
+  f->opt_coalesce_parent = arena_array(f->arena, u32, nregs ? nregs : 1u);
+  f->opt_coalesce_size = arena_array(f->arena, u32, nregs ? nregs : 1u);
+  for (PReg v = 0; v < nregs; ++v) {
+    f->opt_coalesce_parent[v] = v;
+    f->opt_coalesce_size[v] = 1;
+  }
+
+  CoalesceMove* moves = NULL;
+  u32 nmoves = 0;
+  u32 move_cap = 0;
+  CoalesceCtx ctx;
+  memset(&ctx, 0, sizeof ctx);
+  ctx.f = f;
+  ctx.ranges = ranges;
+  ctx.related_index = arena_array(f->arena, u32, nregs ? nregs : 1u);
+  memset(ctx.related_index, 0xff,
+         sizeof(ctx.related_index[0]) * (nregs ? nregs : 1u));
+
+  for (u32 b = 0; b < f->nblocks; ++b) {
+    Block* bl = &f->blocks[b];
+    for (u32 i = 0; i < bl->ninsts; ++i) {
+      if ((IROp)bl->insts[i].op == IR_COPY) ++f->opt_coalesce_moves_seen;
+      CoalesceMove m;
+      if (!collect_move(f, ranges, &bl->insts[i], b, &m)) continue;
+      if (nmoves == move_cap) {
+        u32 ncap = move_cap ? move_cap * 2u : 32u;
+        CoalesceMove* nv = arena_array(f->arena, CoalesceMove, ncap);
+        if (moves) memcpy(nv, moves, sizeof(moves[0]) * nmoves);
+        moves = nv;
+        move_cap = ncap;
+      }
+      moves[nmoves++] = m;
+      coalesce_add_related(&ctx, m.dst);
+      coalesce_add_related(&ctx, m.src);
+    }
+  }
+  f->opt_coalesce_candidates = nmoves;
+  if (!nmoves || ctx.nrelated < 2) goto metrics;
+
+  ctx.conflict_words = (ctx.nrelated * ctx.nrelated + 63u) / 64u;
+  ctx.conflicts =
+      arena_zarray(f->arena, u64, ctx.conflict_words ? ctx.conflict_words : 1u);
+  ctx.unit_conflicts =
+      arena_zarray(f->arena, u64, ctx.conflict_words ? ctx.conflict_words : 1u);
+  for (u32 i = 0; i < ctx.nrelated; ++i) {
+    for (u32 j = i + 1u; j < ctx.nrelated; ++j) {
+      int kind = ranges_overlap_kind(ranges, ctx.related[i], ctx.related[j]);
+      if (kind) {
+        coalesce_set_conflict(&ctx, i, j, kind == 1);
+        ++f->opt_coalesce_conflicts;
+      }
+    }
+  }
+
+  qsort(moves, nmoves, sizeof(moves[0]), move_cmp);
+  for (u32 i = 0; i < nmoves; ++i) {
+    PReg ra = coalesce_find(f, moves[i].dst);
+    PReg rb = coalesce_find(f, moves[i].src);
+    if (ra == rb) continue;
+    ++f->opt_coalesce_merge_attempts;
+    if (group_conflicts(&ctx, ra, rb, moves[i].dst, moves[i].src)) continue;
+    if (!group_constraints_compatible(&ctx, ra, rb)) continue;
+    coalesce_union(f, ra, rb);
+    ++f->opt_coalesce_merges;
+  }
+
+metrics:
+  metrics_count(f->c, "opt.coalesce.moves_seen", f->opt_coalesce_moves_seen);
+  metrics_count(f->c, "opt.coalesce.candidates", f->opt_coalesce_candidates);
+  metrics_count(f->c, "opt.coalesce.conflicts", f->opt_coalesce_conflicts);
+  metrics_count(f->c, "opt.coalesce.merge_attempts",
+                f->opt_coalesce_merge_attempts);
+  metrics_count(f->c, "opt.coalesce.merges", f->opt_coalesce_merges);
+}
+
+void opt_coalesce(Func* f) {
+  if (!f) return;
+  OptLiveInfo live;
+  opt_live_blocks(f, &live);
+  OptLiveRangeSet ranges;
+  opt_live_ranges_build(f, &live, &ranges);
+  opt_coalesce_ranges(f, &ranges);
+}
diff --git a/src/opt/pass_live.c b/src/opt/pass_live.c
@@ -520,6 +520,8 @@ static void range_append(RangeBuildCtx* c, PReg r, u32 start, u32 end,
   lr->preg = r;
   lr->start = start;
   lr->end = end;
+  lr->raw_start = start;
+  lr->raw_end = end;
   lr->next = OPT_RANGE_NONE;
   lr->block = block;
   lr->whole_block = whole_block ? 1u : 0u;
diff --git a/src/opt/pass_lower.c b/src/opt/pass_lower.c
@@ -328,6 +328,18 @@ typedef struct OptSpillCandidate {
   u32 last;
 } OptSpillCandidate;
 
+typedef struct OptAllocGroupInfo {
+  PReg root;
+  u32 spill_cost;
+  u32 live_length;
+  u32 first;
+  u32 last;
+  i32 tied_hard_reg;
+  u32 forbidden_hard_regs;
+  u8 cls;
+  u8 pad[3];
+} OptAllocGroupInfo;
+
 typedef struct AllocInterval {
   u32 start;
   u32 end;
@@ -415,6 +427,45 @@ static void alloc_sort_spills(OptSpillCandidate* spills, u32 n) {
   if (n > 1) qsort(spills, n, sizeof(spills[0]), spill_candidate_cmp);
 }
 
+static PReg alloc_coalesce_root(Func* f, PReg v) {
+  if (!f->opt_coalesce_parent || v == PREG_NONE || v >= opt_reg_count(f))
+    return v;
+  PReg p = (PReg)f->opt_coalesce_parent[v];
+  while (p != f->opt_coalesce_parent[p]) p = (PReg)f->opt_coalesce_parent[p];
+  while (v != p) {
+    PReg n = (PReg)f->opt_coalesce_parent[v];
+    f->opt_coalesce_parent[v] = p;
+    v = n;
+  }
+  return p;
+}
+
+static int alloc_group_member(Func* f, PReg root, PReg v) {
+  return alloc_coalesce_root(f, v) == root;
+}
+
+static void alloc_group_info(Func* f, const OptLiveRangeSet* ranges, PReg root,
+                             OptAllocGroupInfo* out) {
+  memset(out, 0, sizeof *out);
+  out->root = root;
+  out->first = (u32)~0u;
+  out->tied_hard_reg = -1;
+  out->cls = f->preg_info[root].cls;
+  for (PReg v = 1; v < opt_reg_count(f); ++v) {
+    if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
+    if (!alloc_group_member(f, root, v)) continue;
+    OptPRegInfo* vi = &f->preg_info[v];
+    out->spill_cost += vi->frequency ? vi->frequency : vi->spill_cost;
+    out->live_length += vi->live_length;
+    u32 first = vi->first_pos ? vi->first_pos - 1u : 0;
+    if (first < out->first) out->first = first;
+    if (vi->last_pos > out->last) out->last = vi->last_pos;
+    out->forbidden_hard_regs |= vi->forbidden_hard_regs;
+    if (vi->tied_hard_reg >= 0) out->tied_hard_reg = vi->tied_hard_reg;
+  }
+  if (out->first == (u32)~0u) out->first = 0;
+}
+
 static u32 alloc_interval_lower_bound(const AllocIntervalVec* v, u32 start) {
   u32 lo = 0;
   u32 hi = v ? v->n : 0;
@@ -636,18 +687,30 @@ static int spill_slot_compatible(Func* f, FrameSlot fs, PReg v) {
   return 1;
 }
 
-static int alloc_hard_conflicts(OptAllocator* a, const OptLiveRangeSet* ranges,
-                                PReg v, u32 bit) {
+static int alloc_group_hard_conflicts(Func* f, OptAllocator* a,
+                                      const OptLiveRangeSet* ranges, PReg root,
+                                      u32 bit) {
   if (bit >= a->hard_loc_bits) return 1;
-  return alloc_ranges_overlap_vec(a, ranges, v, &a->hard_used_locs[bit],
-                                  &a->hard_point_visits);
+  for (PReg v = 1; v < opt_reg_count(f); ++v) {
+    if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
+    if (!alloc_group_member(f, root, v)) continue;
+    if (alloc_ranges_overlap_vec(a, ranges, v, &a->hard_used_locs[bit],
+                                 &a->hard_point_visits))
+      return 1;
+  }
+  return 0;
 }
 
-static void alloc_mark_hard_loc(OptAllocator* a, const OptLiveRangeSet* ranges,
-                                PReg v, u32 bit, Func* f) {
+static void alloc_mark_group_hard_loc(Func* f, OptAllocator* a,
+                                      const OptLiveRangeSet* ranges, PReg root,
+                                      u32 bit) {
   if (bit >= a->hard_loc_bits) return;
-  alloc_mark_vec(f, a, ranges, v, &a->hard_used_locs[bit],
-                 &a->hard_mark_points);
+  for (PReg v = 1; v < opt_reg_count(f); ++v) {
+    if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
+    if (!alloc_group_member(f, root, v)) continue;
+    alloc_mark_vec(f, a, ranges, v, &a->hard_used_locs[bit],
+                   &a->hard_mark_points);
+  }
 }
 
 static void alloc_grow_stack_locs(Func* f, OptAllocator* a, u32 need_slots) {
@@ -670,11 +733,16 @@ static void alloc_grow_stack_locs(Func* f, OptAllocator* a, u32 need_slots) {
   a->stack_slot_cap = ncap;
 }
 
-static void alloc_mark_stack_loc(OptAllocator* a, const OptLiveRangeSet* ranges,
-                                 PReg v, u32 stack_idx, Func* f) {
+static void alloc_mark_group_stack_loc(Func* f, OptAllocator* a,
+                                       const OptLiveRangeSet* ranges, PReg root,
+                                       u32 stack_idx) {
   if (stack_idx >= a->stack_slot_count) return;
-  alloc_mark_vec(f, a, ranges, v, &a->stack_used_locs[stack_idx],
-                 &a->stack_mark_points);
+  for (PReg v = 1; v < opt_reg_count(f); ++v) {
+    if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
+    if (!alloc_group_member(f, root, v)) continue;
+    alloc_mark_vec(f, a, ranges, v, &a->stack_used_locs[stack_idx],
+                   &a->stack_mark_points);
+  }
 }
 
 static int stack_slot_heap_less(const OptAllocator* a, u32 lhs, u32 rhs) {
@@ -776,57 +844,179 @@ static int stack_take_free_compatible(Func* f, OptAllocator* a, PReg v,
   return 0;
 }
 
-static void alloc_assign_hard(Func* f, OptAllocator* a,
-                              const OptLiveRangeSet* ranges, PReg v, Reg r) {
-  OptPRegInfo* vi = &f->preg_info[v];
-  vi->alloc_kind = OPT_ALLOC_HARD;
-  vi->hard_reg = r;
-  a->locs[v].kind = OPT_LOC_HARD;
-  a->locs[v].cls = vi->cls;
-  a->locs[v].hard_reg = r;
-  a->locs[v].spill_slot = FRAME_SLOT_NONE;
-  alloc_mark_hard_loc(a, ranges, v, hard_loc_bit(vi->cls, r), f);
+static void alloc_assign_group_hard(Func* f, OptAllocator* a,
+                                    const OptLiveRangeSet* ranges, PReg root,
+                                    Reg r) {
+  u32 bit = hard_loc_bit(f->preg_info[root].cls, r);
+  for (PReg v = 1; v < opt_reg_count(f); ++v) {
+    if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
+    if (!alloc_group_member(f, root, v)) continue;
+    OptPRegInfo* vi = &f->preg_info[v];
+    vi->alloc_kind = OPT_ALLOC_HARD;
+    vi->hard_reg = r;
+    a->locs[v].kind = OPT_LOC_HARD;
+    a->locs[v].cls = vi->cls;
+    a->locs[v].hard_reg = r;
+    a->locs[v].spill_slot = FRAME_SLOT_NONE;
+  }
+  alloc_mark_group_hard_loc(f, a, ranges, root, bit);
 }
 
-static void alloc_assign_stack_slot(Func* f, OptAllocator* a,
-                                    const OptLiveRangeSet* ranges, PReg v,
-                                    u32 stack_idx) {
-  OptPRegInfo* vi = &f->preg_info[v];
+static void alloc_assign_group_stack_slot(Func* f, OptAllocator* a,
+                                          const OptLiveRangeSet* ranges,
+                                          PReg root, u32 stack_idx,
+                                          u32 group_last) {
   FrameSlot slot = a->stack_slots[stack_idx];
-  vi->spill_slot = slot;
-  vi->alloc_kind = OPT_ALLOC_SPILL;
-  a->locs[v].kind = OPT_LOC_STACK;
-  a->locs[v].cls = vi->cls;
-  a->locs[v].hard_reg = REG_NONE;
-  a->locs[v].spill_slot = slot;
-  alloc_mark_stack_loc(a, ranges, v, stack_idx, f);
-  a->stack_slot_end[stack_idx] = vi->last_pos;
+  for (PReg v = 1; v < opt_reg_count(f); ++v) {
+    if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
+    if (!alloc_group_member(f, root, v)) continue;
+    OptPRegInfo* vi = &f->preg_info[v];
+    vi->spill_slot = slot;
+    vi->alloc_kind = OPT_ALLOC_SPILL;
+    a->locs[v].kind = OPT_LOC_STACK;
+    a->locs[v].cls = vi->cls;
+    a->locs[v].hard_reg = REG_NONE;
+    a->locs[v].spill_slot = slot;
+  }
+  alloc_mark_group_stack_loc(f, a, ranges, root, stack_idx);
+  a->stack_slot_end[stack_idx] = group_last;
   stack_active_push(f, a, stack_idx);
 }
 
-static void alloc_assign_stack(Func* f, OptAllocator* a,
-                               const OptLiveRangeSet* ranges, PReg v) {
-  OptPRegInfo* vi = &f->preg_info[v];
-  u32 first = vi->first_pos ? vi->first_pos - 1u : 0;
+static void alloc_assign_group_stack(Func* f, OptAllocator* a,
+                                     const OptLiveRangeSet* ranges, PReg root,
+                                     const OptAllocGroupInfo* gi) {
   u32 stack_idx = 0;
-  stack_expire_slots(f, a, first);
-  if (!stack_take_free_compatible(f, a, v, &stack_idx)) {
-    FrameSlot slot = spill_slot_for(f, v);
+  stack_expire_slots(f, a, gi->first);
+  if (!stack_take_free_compatible(f, a, root, &stack_idx)) {
+    FrameSlot slot = spill_slot_for(f, root);
     alloc_grow_stack_locs(f, a, a->stack_slot_count + 1u);
     stack_idx = a->stack_slot_count++;
     a->stack_slots[stack_idx] = slot;
   }
-  alloc_assign_stack_slot(f, a, ranges, v, stack_idx);
+  alloc_assign_group_stack_slot(f, a, ranges, root, stack_idx, gi->last);
+}
+
+static void split_segment_grow(Func* f) {
+  if (f->opt_nalloc_segments < f->opt_alloc_segments_cap) return;
+  u32 ncap = f->opt_alloc_segments_cap ? f->opt_alloc_segments_cap * 2u : 32u;
+  OptAllocSegment* ns = arena_array(f->arena, OptAllocSegment, ncap);
+  if (f->opt_alloc_segments)
+    memcpy(ns, f->opt_alloc_segments,
+           sizeof(f->opt_alloc_segments[0]) * f->opt_nalloc_segments);
+  f->opt_alloc_segments = ns;
+  f->opt_alloc_segments_cap = ncap;
+}
+
+static void split_segment_push(Func* f, PReg v, const OptLiveRange* lr,
+                               u8 loc_kind, Reg hard_reg,
+                               FrameSlot spill_home) {
+  split_segment_grow(f);
+  u32 idx = f->opt_nalloc_segments++;
+  OptAllocSegment* s = &f->opt_alloc_segments[idx];
+  memset(s, 0, sizeof *s);
+  s->start = lr->raw_start;
+  s->end = lr->raw_end;
+  s->block = lr->block;
+  s->loc_kind = loc_kind;
+  s->cls = f->preg_info[v].cls;
+  s->hard_reg = hard_reg;
+  s->spill_slot = loc_kind == OPT_LOC_STACK ? spill_home : FRAME_SLOT_NONE;
+  s->spill_home = spill_home;
+  s->reload_at_start = loc_kind == OPT_LOC_HARD ? 1u : 0u;
+  s->store_at_end = loc_kind == OPT_LOC_HARD ? 1u : 0u;
+  s->next = f->opt_first_segment_by_preg[v];
+  f->opt_first_segment_by_preg[v] = idx;
+}
+
+static int alloc_range_hard_conflicts(OptAllocator* a, const OptLiveRange* lr,
+                                      u32 bit) {
+  if (bit >= a->hard_loc_bits) return 1;
+  u32 end = lr->end < a->point_count ? lr->end : a->point_count;
+  ++a->hard_point_visits;
+  if (lr->start >= end) return 0;
+  return alloc_interval_overlaps(&a->hard_used_locs[bit], lr->start, end);
+}
+
+static void alloc_mark_range_hard(Func* f, OptAllocator* a,
+                                  const OptLiveRange* lr, u32 bit) {
+  if (bit >= a->hard_loc_bits) return;
+  u32 end = lr->end < a->point_count ? lr->end : a->point_count;
+  if (lr->start >= end) return;
+  ++a->hard_mark_points;
+  alloc_interval_insert(f, &a->hard_used_locs[bit], lr->start, end);
+}
+
+static int alloc_try_split_singleton(Func* f, OptAllocator* a,
+                                     const OptLiveRangeSet* ranges, PReg v) {
+  if (f->opt_coalesce_parent &&
+      f->opt_coalesce_size[alloc_coalesce_root(f, v)] > 1)
+    return 0;
+  if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) return 0;
+  if (f->preg_info[v].live_across_call_freq) return 0;
+  u32 nranges = 0;
+  for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE;
+       ri = ranges->ranges[ri].next)
+    ++nranges;
+  if (nranges < 2) return 0;
+  if (!f->opt_first_segment_by_preg) {
+    f->opt_first_segment_by_preg =
+        arena_array(f->arena, u32, opt_reg_count(f) ? opt_reg_count(f) : 1u);
+    memset(f->opt_first_segment_by_preg, 0xff,
+           sizeof(f->opt_first_segment_by_preg[0]) *
+               (opt_reg_count(f) ? opt_reg_count(f) : 1u));
+  }
+
+  OptPRegInfo* vi = &f->preg_info[v];
+  FrameSlot home = spill_slot_for(f, v);
+  vi->alloc_kind = OPT_ALLOC_SPLIT;
+  vi->hard_reg = REG_NONE;
+  vi->spill_slot = home;
+  a->locs[v].kind = OPT_LOC_STACK;
+  a->locs[v].cls = vi->cls;
+  a->locs[v].hard_reg = REG_NONE;
+  a->locs[v].spill_slot = home;
+
+  for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE;
+       ri = ranges->ranges[ri].next) {
+    const OptLiveRange* lr = &ranges->ranges[ri];
+    int found = 0;
+    Reg best = REG_NONE;
+    u32 best_score = 0xffffffffu;
+    for (u32 r = 0; r < f->opt_hard_reg_count[vi->cls]; ++r) {
+      Reg hr = f->opt_hard_regs[vi->cls][r];
+      if (hr >= 32) continue;
+      if (vi->forbidden_hard_regs & (1u << hr)) continue;
+      u32 bit = hard_loc_bit(vi->cls, hr);
+      if (alloc_range_hard_conflicts(a, lr, bit)) continue;
+      u32 score = hard_reg_alloc_score(f, a, vi, hr);
+      if (!found || score < best_score) {
+        found = 1;
+        best = hr;
+        best_score = score;
+      }
+    }
+    if (found) {
+      alloc_mark_range_hard(f, a, lr, hard_loc_bit(vi->cls, best));
+      split_segment_push(f, v, lr, OPT_LOC_HARD, best, home);
+    } else {
+      split_segment_push(f, v, lr, OPT_LOC_STACK, REG_NONE, home);
+    }
+  }
+  return 1;
 }
 
 static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
-                              OptAllocator* a) {
+                              OptAllocator* a, int allow_live_range_split) {
   memset(a, 0, sizeof *a);
   a->point_count = ranges->point_count ? ranges->point_count : 1u;
   a->hard_loc_bits = OPT_REG_CLASSES * 32u;
   u32 ncands = 0;
-  for (PReg v = 1; v < opt_reg_count(f); ++v)
-    if (ranges->first_range_by_preg[v] != OPT_RANGE_NONE) ++ncands;
+  for (PReg v = 1; v < opt_reg_count(f); ++v) {
+    if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
+    if (alloc_coalesce_root(f, v) != v) continue;
+    ++ncands;
+  }
   a->locs =
       arena_zarray(f->arena, OptLoc, opt_reg_count(f) ? opt_reg_count(f) : 1u);
   a->hard_used_locs = arena_zarray(f->arena, AllocIntervalVec,
@@ -842,11 +1032,14 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
   u32 n = 0;
   for (PReg v = 1; v < opt_reg_count(f); ++v) {
     if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
-    OptPRegInfo* vi = &f->preg_info[v];
+    PReg root = alloc_coalesce_root(f, v);
+    if (root != v) continue;
+    OptAllocGroupInfo gi;
+    alloc_group_info(f, ranges, root, &gi);
     cands[n].v = v;
-    cands[n].spill_cost = vi->frequency ? vi->frequency : vi->spill_cost;
-    cands[n].live_length = vi->live_length;
-    cands[n].tied = vi->tied_hard_reg >= 0;
+    cands[n].spill_cost = gi.spill_cost;
+    cands[n].live_length = gi.live_length;
+    cands[n].tied = gi.tied_hard_reg >= 0;
     memset(cands[n].pad, 0, sizeof cands[n].pad);
     ++n;
   }
@@ -855,10 +1048,12 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
   u32 nspills = 0;
   for (u32 i = 0; i < n; ++i) {
     PReg v = cands[i].v;
+    OptAllocGroupInfo gi;
+    alloc_group_info(f, ranges, v, &gi);
     OptPRegInfo* vi = &f->preg_info[v];
-    u8 cls = vi->cls;
-    if (vi->tied_hard_reg >= 0) {
-      Reg fixed = (Reg)vi->tied_hard_reg;
+    u8 cls = gi.cls;
+    if (gi.tied_hard_reg >= 0) {
+      Reg fixed = (Reg)gi.tied_hard_reg;
       if (!hard_available(f, cls, fixed)) {
         SrcLoc loc = {0, 0, 0};
         compiler_panic(
@@ -866,19 +1061,19 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
             "opt regalloc: fixed hard reg %u unavailable in class %u",
             (unsigned)fixed, (unsigned)cls);
       }
-      if (fixed < 32 && (vi->forbidden_hard_regs & (1u << fixed))) {
+      if (fixed < 32 && (gi.forbidden_hard_regs & (1u << fixed))) {
         SrcLoc loc = {0, 0, 0};
         compiler_panic(f->c, loc,
                        "opt regalloc: fixed hard reg %u is clobbered",
                        (unsigned)fixed);
       }
-      if (fixed >= 32 ||
-          alloc_hard_conflicts(a, ranges, v, hard_loc_bit(cls, fixed))) {
+      if (fixed >= 32 || alloc_group_hard_conflicts(f, a, ranges, v,
+                                                    hard_loc_bit(cls, fixed))) {
         SrcLoc loc = {0, 0, 0};
         compiler_panic(f->c, loc, "opt regalloc: conflicting fixed hard reg %u",
                        (unsigned)fixed);
       }
-      alloc_assign_hard(f, a, ranges, v, fixed);
+      alloc_assign_group_hard(f, a, ranges, v, fixed);
       continue;
     }
 
@@ -888,8 +1083,9 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
     for (u32 r = 0; r < f->opt_hard_reg_count[cls]; ++r) {
       Reg hr = f->opt_hard_regs[cls][r];
       if (hr >= 32) continue;
-      if (vi->forbidden_hard_regs & (1u << hr)) continue;
-      if (alloc_hard_conflicts(a, ranges, v, hard_loc_bit(cls, hr))) continue;
+      if (gi.forbidden_hard_regs & (1u << hr)) continue;
+      if (alloc_group_hard_conflicts(f, a, ranges, v, hard_loc_bit(cls, hr)))
+        continue;
       u32 score = hard_reg_alloc_score(f, a, vi, hr);
       if (!found || score < best_score) {
         found = 1;
@@ -898,18 +1094,23 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
       }
     }
     if (found) {
-      alloc_assign_hard(f, a, ranges, v, best);
+      alloc_assign_group_hard(f, a, ranges, v, best);
     }
     if (!found) {
+      if (allow_live_range_split && alloc_try_split_singleton(f, a, ranges, v))
+        continue;
       spills[nspills].v = v;
-      spills[nspills].first = vi->first_pos ? vi->first_pos - 1u : 0;
-      spills[nspills].last = vi->last_pos;
+      spills[nspills].first = gi.first;
+      spills[nspills].last = gi.last;
       ++nspills;
     }
   }
   alloc_sort_spills(spills, nspills);
-  for (u32 i = 0; i < nspills; ++i)
-    alloc_assign_stack(f, a, ranges, spills[i].v);
+  for (u32 i = 0; i < nspills; ++i) {
+    OptAllocGroupInfo gi;
+    alloc_group_info(f, ranges, spills[i].v, &gi);
+    alloc_assign_group_stack(f, a, ranges, spills[i].v, &gi);
+  }
   a->hard_loc_words =
       alloc_interval_storage_words(a->hard_used_locs, a->hard_loc_bits);
   a->stack_loc_words =
@@ -1027,6 +1228,16 @@ static Operand hard_operand(Func* f, PReg v) {
   return o;
 }
 
+static Operand hard_operand_reg(Func* f, PReg v, Reg hard_reg) {
+  Operand o;
+  memset(&o, 0, sizeof o);
+  o.kind = OPK_REG;
+  o.cls = f->preg_info[v].cls;
+  o.type = opt_reg_type(f, v);
+  o.v.reg = hard_reg;
+  return o;
+}
+
 static void append_store_preg(Func* f, RewriteList* out, PReg v) {
   Inst* st = list_push(f, out, IR_STORE);
   st->opnds = arena_array(f->arena, Operand, 2);
@@ -1045,6 +1256,26 @@ static void append_load_preg(Func* f, RewriteList* out, PReg v) {
   ld->extra.mem = spill_mem(f, v);
 }
 
+static void append_store_preg_hard(Func* f, RewriteList* out, PReg v,
+                                   Reg hard_reg) {
+  Inst* st = list_push(f, out, IR_STORE);
+  st->opnds = arena_array(f->arena, Operand, 2);
+  st->opnds[0] = spill_addr(f, v);
+  st->opnds[1] = hard_operand_reg(f, v, hard_reg);
+  st->nopnds = 2;
+  st->extra.mem = spill_mem(f, v);
+}
+
+static void append_load_preg_hard(Func* f, RewriteList* out, PReg v,
+                                  Reg hard_reg) {
+  Inst* ld = list_push(f, out, IR_LOAD);
+  ld->opnds = arena_array(f->arena, Operand, 2);
+  ld->opnds[0] = hard_operand_reg(f, v, hard_reg);
+  ld->opnds[1] = spill_addr(f, v);
+  ld->nopnds = 2;
+  ld->extra.mem = spill_mem(f, v);
+}
+
 static Reg scratch_for(Func* f, u8 cls, u32* next) {
   u32 n = f->opt_scratch_reg_count[cls];
   if (!n) return REG_NONE;
@@ -1057,8 +1288,19 @@ typedef struct RewriteCtx {
   RewriteList* before;
   RewriteList* after;
   u32 next_scratch[OPT_REG_CLASSES];
+  u32 raw_point;
 } RewriteCtx;
 
+static const OptAllocSegment* split_segment_at(Func* f, PReg v, u32 raw_point) {
+  if (!f->opt_first_segment_by_preg || v >= opt_reg_count(f)) return NULL;
+  for (u32 si = f->opt_first_segment_by_preg[v]; si != OPT_RANGE_NONE;
+       si = f->opt_alloc_segments[si].next) {
+    const OptAllocSegment* s = &f->opt_alloc_segments[si];
+    if (s->start <= raw_point && raw_point < s->end) return s;
+  }
+  return NULL;
+}
+
 static void rewrite_one_operand(Func* f, Inst* owner, Operand* op, int is_def,
                                 void* arg) {
   RewriteCtx* c = (RewriteCtx*)arg;
@@ -1070,7 +1312,15 @@ static void rewrite_one_operand(Func* f, Inst* owner, Operand* op, int is_def,
     op->v.reg = vi->hard_reg;
     return;
   }
-  if (vi->alloc_kind != OPT_ALLOC_SPILL) return;
+  if (vi->alloc_kind == OPT_ALLOC_SPLIT) {
+    const OptAllocSegment* seg = split_segment_at(f, v, c->raw_point);
+    if (seg && seg->loc_kind == OPT_LOC_HARD) {
+      op->v.reg = seg->hard_reg;
+      return;
+    }
+  } else if (vi->alloc_kind != OPT_ALLOC_SPILL) {
+    return;
+  }
   u8 cls = vi->cls;
   Reg scratch = scratch_for(f, cls, &c->next_scratch[cls]);
   if (scratch == (Reg)REG_NONE) {
@@ -1104,7 +1354,8 @@ static void rewrite_call_arg_operand(Func* f, Operand* op) {
   OptPRegInfo* vi = &f->preg_info[v];
   if (vi->alloc_kind == OPT_ALLOC_HARD) {
     op->v.reg = vi->hard_reg;
-  } else if (vi->alloc_kind == OPT_ALLOC_SPILL) {
+  } else if (vi->alloc_kind == OPT_ALLOC_SPILL ||
+             vi->alloc_kind == OPT_ALLOC_SPLIT) {
     *op = spill_addr(f, v);
   }
 }
@@ -1192,6 +1443,38 @@ static PReg* rewrite_collect_call_save_pregs(Func* f, u32* count_out) {
   return pregs;
 }
 
+static void append_split_reloads_at(Func* f, RewriteList* out, u32 block,
+                                    u32 raw_point) {
+  if (!f->opt_first_segment_by_preg) return;
+  for (PReg v = 1; v < opt_reg_count(f); ++v) {
+    if (f->preg_info[v].alloc_kind != OPT_ALLOC_SPLIT) continue;
+    for (u32 si = f->opt_first_segment_by_preg[v]; si != OPT_RANGE_NONE;
+         si = f->opt_alloc_segments[si].next) {
+      const OptAllocSegment* s = &f->opt_alloc_segments[si];
+      if (s->block == block && s->start == raw_point && s->reload_at_start &&
+          s->loc_kind == OPT_LOC_HARD) {
+        append_load_preg_hard(f, out, v, s->hard_reg);
+      }
+    }
+  }
+}
+
+static void append_split_stores_at(Func* f, RewriteList* out, u32 block,
+                                   u32 raw_point) {
+  if (!f->opt_first_segment_by_preg) return;
+  for (PReg v = 1; v < opt_reg_count(f); ++v) {
+    if (f->preg_info[v].alloc_kind != OPT_ALLOC_SPLIT) continue;
+    for (u32 si = f->opt_first_segment_by_preg[v]; si != OPT_RANGE_NONE;
+         si = f->opt_alloc_segments[si].next) {
+      const OptAllocSegment* s = &f->opt_alloc_segments[si];
+      if (s->block == block && s->end == raw_point && s->store_at_end &&
+          s->loc_kind == OPT_LOC_HARD) {
+        append_store_preg_hard(f, out, v, s->hard_reg);
+      }
+    }
+  }
+}
+
 static void rewrite_func(Func* f, const OptLiveInfo* live_info) {
   u32 words = live_info ? live_info->words : f->opt_live_words;
   if (!words) words = bit_words(opt_reg_count(f));
@@ -1205,6 +1488,12 @@ static void rewrite_func(Func* f, const OptLiveInfo* live_info) {
   InstRefs refs;
   memset(&refs, 0, sizeof refs);
   u32 live_active_words = 0;
+  u32* block_base = arena_array(f->arena, u32, f->nblocks ? f->nblocks : 1u);
+  u32 raw = 0;
+  for (u32 b = 0; b < f->nblocks; ++b) {
+    block_base[b] = raw;
+    raw += f->blocks[b].ninsts ? f->blocks[b].ninsts : 1u;
+  }
   for (u32 b = 0; b < f->nblocks; ++b) {
     Block* bl = &f->blocks[b];
     RewriteOut out;
@@ -1235,6 +1524,7 @@ static void rewrite_func(Func* f, const OptLiveInfo* live_info) {
       memset(&ctx, 0, sizeof ctx);
       ctx.before = &before;
       ctx.after = &after;
+      ctx.raw_point = block_base[b] + i;
       if ((IROp)in.op == IR_CALL) {
         IRCallAux* aux = (IRCallAux*)in.extra.aux;
         if (aux) {
@@ -1265,6 +1555,8 @@ static void rewrite_func(Func* f, const OptLiveInfo* live_info) {
         append_live_call_saves(f, &call_restores, &in, live, live_active_words,
                                &refs, call_save_pregs, ncall_save_pregs, 1);
       }
+      append_split_reloads_at(f, &before, b, block_base[b] + i);
+      append_split_stores_at(f, &after, b, block_base[b] + i + 1u);
 
       out_prepend_list_reverse(f, &out, &after);
       out_prepend_list_reverse(f, &out, &call_restores);
@@ -1367,7 +1659,6 @@ void opt_dead_def_elim(Func* f) {
 }
 
 void opt_regalloc(Func* f, int allow_live_range_split) {
-  (void)allow_live_range_split;
   metrics_scope_begin(f->c, "opt.live_ranges.regalloc");
   OptLiveInfo live;
   opt_live_blocks(f, &live);
@@ -1376,6 +1667,7 @@ void opt_regalloc(Func* f, int allow_live_range_split) {
   opt_init_preg_info_from_ranges(f, &ranges);
   opt_apply_asm_constraints_from_live(f, &live);
   apply_param_incoming_register_hazards(f);
+  if (allow_live_range_split) opt_coalesce_ranges(f, &ranges);
   metrics_count(f->c, "opt.live_words", f->opt_live_words);
   metrics_count(f->c, "opt.ranges", ranges.nranges);
   metrics_count(f->c, "opt.range_points", ranges.point_count);
@@ -1396,6 +1688,6 @@ void opt_regalloc(Func* f, int allow_live_range_split) {
   metrics_scope_end(f->c, "opt.live_ranges.regalloc");
 
   OptAllocator alloc;
-  opt_assign_ranges(f, &ranges, &alloc);
+  opt_assign_ranges(f, &ranges, &alloc, allow_live_range_split);
   rewrite_func(f, &live);
 }
diff --git a/test/opt/opt_test.c b/test/opt/opt_test.c
@@ -2911,6 +2911,181 @@ static void opt_regalloc_priority(void) {
   tc_fini(&tc);
 }
 
+static void opt_o2_coalesces_nonconflicting_copy(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  MockCGTarget mock;
+  mock_init(&mock, tc.c);
+  static const Reg pool[] = {19};
+  static const Reg scratch[] = {9, 10};
+  mock_set_pool(&mock, RC_INT, pool, 1, scratch, 2, 0x4007FFFFu);
+
+  Func* f = new_func(&tc);
+  opt_machinize(f, &mock.base);
+  Val a = add_val(f, tc.i32);
+  Val b = add_val(f, tc.i32);
+  emit_load_imm(f, f->entry, a, tc.i32, 1);
+  emit_copy(f, f->entry, b, a, tc.i32);
+  emit_ret_val(f, f->entry, b, tc.i32);
+  opt_build_cfg(f);
+  opt_build_loop_tree(f);
+  opt_regalloc(f, 1);
+
+  EXPECT(f->opt_coalesce_candidates == 1,
+         "copy should be a coalesce candidate");
+  EXPECT(f->opt_coalesce_merges == 1, "non-overlapping copy should coalesce");
+  EXPECT(f->preg_info[a].alloc_kind == OPT_ALLOC_HARD &&
+             f->preg_info[b].alloc_kind == OPT_ALLOC_HARD,
+         "coalesced values should allocate hard under one-reg pressure");
+  EXPECT(f->preg_info[a].hard_reg == f->preg_info[b].hard_reg,
+         "coalesced values should share a hard register");
+  opt_combine(f);
+  opt_dce(f);
+  EXPECT(count_op(f, IR_COPY) == 0,
+         "coalesced hard-reg self copy should be removed");
+  tc_fini(&tc);
+}
+
+static void opt_o1_ignores_coalescing(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  MockCGTarget mock;
+  mock_init(&mock, tc.c);
+  static const Reg pool[] = {19};
+  static const Reg scratch[] = {9, 10};
+  mock_set_pool(&mock, RC_INT, pool, 1, scratch, 2, 0x4007FFFFu);
+
+  Func* f = new_func(&tc);
+  opt_machinize(f, &mock.base);
+  Val a = add_val(f, tc.i32);
+  Val b = add_val(f, tc.i32);
+  emit_load_imm(f, f->entry, a, tc.i32, 1);
+  emit_copy(f, f->entry, b, a, tc.i32);
+  emit_ret_val(f, f->entry, b, tc.i32);
+  opt_build_cfg(f);
+  opt_build_loop_tree(f);
+  opt_regalloc(f, 0);
+
+  EXPECT(f->opt_coalesce_candidates == 0 && f->opt_coalesce_merges == 0,
+         "O1 regalloc should not run coalescing");
+  tc_fini(&tc);
+}
+
+static void opt_o2_refuses_overlapping_copy_coalesce(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  MockCGTarget mock;
+  mock_init(&mock, tc.c);
+  static const Reg pool[] = {19, 20};
+  static const Reg scratch[] = {9, 10};
+  mock_set_pool(&mock, RC_INT, pool, 2, scratch, 2, 0x4007FFFFu);
+
+  Func* f = new_func(&tc);
+  opt_machinize(f, &mock.base);
+  Val a = add_val(f, tc.i32);
+  Val b = add_val(f, tc.i32);
+  Val out = add_val(f, tc.i32);
+  emit_load_imm(f, f->entry, a, tc.i32, 1);
+  emit_copy(f, f->entry, b, a, tc.i32);
+  emit_binop(f, f->entry, out, a, b, tc.i32);
+  emit_ret_val(f, f->entry, out, tc.i32);
+  opt_build_cfg(f);
+  opt_build_loop_tree(f);
+  opt_regalloc(f, 1);
+
+  EXPECT(f->opt_coalesce_candidates == 1,
+         "overlapping copy should still be collected as a candidate");
+  EXPECT(f->opt_coalesce_conflicts >= 1,
+         "overlapping copy values should be recorded as conflicting");
+  EXPECT(f->opt_coalesce_merges == 0, "overlapping copy should not coalesce");
+  tc_fini(&tc);
+}
+
+static void opt_o2_refuses_incompatible_copy_coalesce(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  MockCGTarget mock;
+  mock_init(&mock, tc.c);
+  static const Reg pool[] = {19, 20};
+  static const Reg scratch[] = {9, 10};
+  mock_set_pool(&mock, RC_INT, pool, 2, scratch, 2, 0x4007FFFFu);
+
+  Func* diff_type = new_func(&tc);
+  opt_machinize(diff_type, &mock.base);
+  Val a = add_val(diff_type, tc.i32);
+  Val b = add_val(diff_type, tc.i64);
+  emit_load_imm(diff_type, diff_type->entry, a, tc.i32, 1);
+  emit_copy(diff_type, diff_type->entry, b, a, tc.i64);
+  emit_ret_val(diff_type, diff_type->entry, b, tc.i64);
+  opt_build_cfg(diff_type);
+  opt_build_loop_tree(diff_type);
+  opt_regalloc(diff_type, 1);
+  EXPECT(diff_type->opt_coalesce_candidates == 0,
+         "different value types should not become coalesce candidates");
+
+  Func* tied = new_func(&tc);
+  opt_machinize(tied, &mock.base);
+  a = add_val(tied, tc.i32);
+  b = add_val(tied, tc.i32);
+  emit_load_imm(tied, tied->entry, a, tc.i32, 1);
+  emit_copy(tied, tied->entry, b, a, tc.i32);
+  emit_ret_val(tied, tied->entry, b, tc.i32);
+  opt_build_cfg(tied);
+  opt_build_loop_tree(tied);
+  ensure_test_preg_info(tied);
+  tied->preg_info[a].tied_hard_reg = 19;
+  tied->preg_info[b].tied_hard_reg = 20;
+  opt_regalloc(tied, 1);
+  EXPECT(tied->opt_coalesce_candidates == 1,
+         "same-shape tied copy should be collected");
+  EXPECT(tied->opt_coalesce_merges == 0,
+         "different fixed hard regs should block coalescing");
+  tc_fini(&tc);
+}
+
+static void opt_o2_splits_singleton_when_whole_alloc_fails(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  MockCGTarget mock;
+  mock_init(&mock, tc.c);
+  static const Reg pool[] = {19};
+  static const Reg scratch[] = {9, 10};
+  mock_set_pool(&mock, RC_INT, pool, 1, scratch, 2, 0x4007FFFFu);
+
+  Func* f = new_func(&tc);
+  opt_machinize(f, &mock.base);
+  Val pinned = add_val(f, tc.i32);
+  Val v = add_val(f, tc.i32);
+  Val tmp = add_val(f, tc.i32);
+  emit_load_imm(f, f->entry, pinned, tc.i32, 1);
+  emit_load_imm(f, f->entry, v, tc.i32, 2);
+  emit_binop(f, f->entry, tmp, pinned, v, tc.i32);
+  emit_load_imm(f, f->entry, v, tc.i32, 3);
+  emit_ret_val(f, f->entry, v, tc.i32);
+  opt_build_cfg(f);
+  opt_build_loop_tree(f);
+  ensure_test_preg_info(f);
+  f->preg_info[pinned].tied_hard_reg = 19;
+  opt_regalloc(f, 1);
+
+  EXPECT(f->preg_info[pinned].alloc_kind == OPT_ALLOC_HARD,
+         "pinned value should keep the only hard register");
+  EXPECT(f->preg_info[v].alloc_kind == OPT_ALLOC_SPLIT,
+         "singleton value should split instead of whole spilling");
+  int saw_hard = 0;
+  int saw_stack = 0;
+  for (u32 si = f->opt_first_segment_by_preg[v]; si != OPT_RANGE_NONE;
+       si = f->opt_alloc_segments[si].next) {
+    if (f->opt_alloc_segments[si].loc_kind == OPT_LOC_HARD) saw_hard = 1;
+    if (f->opt_alloc_segments[si].loc_kind == OPT_LOC_STACK) saw_stack = 1;
+  }
+  EXPECT(saw_hard && saw_stack,
+         "split value should have both hard and spill-home segments");
+  EXPECT(f->preg_info[v].spill_slot != FRAME_SLOT_NONE,
+         "split value should have a canonical spill home");
+  tc_fini(&tc);
+}
+
 static void opt_range_regalloc_no_conflicts_and_stack_reuse(void) {
   TestCtx tc;
   tc_init(&tc);
@@ -4711,6 +4886,11 @@ int main(void) {
   opt_live_across_call_frequency();
   opt_range_overlap_class();
   opt_regalloc_priority();
+  opt_o2_coalesces_nonconflicting_copy();
+  opt_o1_ignores_coalescing();
+  opt_o2_refuses_overlapping_copy_coalesce();
+  opt_o2_refuses_incompatible_copy_coalesce();
+  opt_o2_splits_singleton_when_whole_alloc_fails();
   opt_range_regalloc_no_conflicts_and_stack_reuse();
   opt_stack_spill_assignment_avoids_quadratic_probe();
   opt_rewrite_spill_use_def();
diff --git a/test/toy/cases/129_o2_pressure_coalesce.expected b/test/toy/cases/129_o2_pressure_coalesce.expected
@@ -0,0 +1 @@
+40
diff --git a/test/toy/cases/129_o2_pressure_coalesce.toy b/test/toy/cases/129_o2_pressure_coalesce.toy
@@ -0,0 +1,32 @@
+fn pressure(n0: i64): i64 {
+  var n: i64 = n0;
+  let a0: i64 = n0 + 1;
+  let a1: i64 = n0 + 2;
+  let a2: i64 = n0 + 3;
+  let a3: i64 = n0 + 4;
+  let a4: i64 = n0 + 5;
+  let a5: i64 = n0 + 6;
+  let a6: i64 = n0 + 7;
+  let a7: i64 = n0 + 8;
+  let a8: i64 = n0 + 9;
+  let a9: i64 = n0 + 10;
+  let a10: i64 = n0 + 11;
+  let a11: i64 = n0 + 12;
+  let a12: i64 = n0 + 13;
+  let a13: i64 = n0 + 14;
+  let a14: i64 = n0 + 15;
+  let a15: i64 = n0 + 16;
+  var s: i64 = 0;
+  while n > 0 {
+    s = s + a0 + a1 + a2 + a3 + a4 + a5 + a6 + a7 +
+        a8 + a9 + a10 + a11 + a12 + a13 + a14 + a15;
+    n = n - 1;
+  }
+  return s;
+}
+
+fn __user_main(): i64 {
+  return pressure(3);
+}
+
+fn main(): i32 { return __user_main() as i32; }

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/OPT.md	\|	63	+++++++++++++++++++++++++++++++++++++++++++++------------------
M	src/opt/ir.h	\|	27	+++++++++++++++++++++++++++
M	src/opt/opt.h	\|	2	++
M	src/opt/opt_internal.h	\|	1	+
A	src/opt/pass_coalesce.c	\|	292	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	src/opt/pass_live.c	\|	2	++
M	src/opt/pass_lower.c	\|	426	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
M	test/opt/opt_test.c	\|	180	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	test/toy/cases/129_o2_pressure_coalesce.expected	\|	1	+
A	test/toy/cases/129_o2_pressure_coalesce.toy	\|	32	++++++++++++++++++++++++++++++++