Optimize O1 allocator occupancy - kit

commit dbf13315d00e0f39eb2202d426bff83f68f8a0b3
parent 644fccea6714b06256d79a0a0e19f7de24044ae5
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 15 May 2026 00:41:55 -0700

Optimize O1 allocator occupancy

Diffstat:
M doc/MIR_RA_REPORT.md  | 34 ++++++++++++++++++++--------------
M doc/PERF.md  | 36 ++++++++++++++++++++++++++++++------
M src/opt/pass_lower.c  | 260 ++++++++++++++++++++++++++++++++++++++++++++++---------------------------------

3 files changed, 203 insertions(+), 127 deletions(-)
diff --git a/doc/MIR_RA_REPORT.md b/doc/MIR_RA_REPORT.md
@@ -45,11 +45,11 @@ matrix.
 As of the current O1 implementation, the dense pseudo-conflict allocator is no
 longer the normal path. O1 now has pass-local block liveness, compressed live
 ranges, split hard-register and stack-slot occupancy, assignment-map rewrite,
-and metrics for live/range/allocation/rewrite event traffic. The remaining work
-is not "delete the old matrix"; it is to make the new path more MIR-shaped by
-removing the remaining high-watermark bitmap traffic, reducing live-range build
-cost on spill-heavy wide functions, and replacing repeated range-point
-occupancy walks when they become the dominant bucket.
+per-location interval occupancy, and metrics for live/range/allocation/rewrite
+event traffic. The remaining work is no longer "delete the old matrix" or
+"replace O1 point-index occupancy"; those cleanup steps are complete. The next
+allocator work should move to O2 coalescing/splitting and generated-code
+quality, while keeping O1's fast MIR-shaped path as the baseline.
 
 ## MIR Container Layer
 
@@ -384,14 +384,20 @@ The initial O1 implementation had five major structural mismatches:
 
 4. Hot loops scanned all vals.
    Many of these loops now use operand-reference events, set-bit iteration, or
-   touched-value/generation lists. Remaining work is to remove high-watermark
-   bitmap copies and any finalization paths that still scale with the highest
-   value id instead of live or touched values.
+   touched-value/generation lists. The final O1 cleanup removed the remaining
+   high-watermark live-across-call/rewrite scans that showed up in spill-heavy
+   wide functions.
 
 5. `Block.live_after` stored a full live bitmap per instruction.
    Normal O1 no longer persists per-instruction full live-after storage.
-   Rewrite and DDE walk backward with rolling live state, but rewrite still
-   shows superlinear live-word traffic on spill-heavy wide functions.
+   Rewrite and DDE walk backward with rolling live state, and rewrite's
+   live-word traffic is now linear in the focused spill-pressure ladder.
+
+6. Point-index occupancy loops remained after the matrix was removed.
+   Normal O1 allocation now stores sorted live intervals per physical
+   hard-register and per stack spill slot. Assignment checks interval overlap
+   for each candidate range instead of OR-ing occupied-location bitmap rows for
+   every compressed point.
 
 ## Proposed cfree Structural Target
 
@@ -508,11 +514,11 @@ Deliverables:
 After O1 has the right shape, remove transition scaffolding and fix the
 remaining O1 scaling buckets before adding O2 allocator features:
 
-- rewrite without extra per-instruction temporary list churn
-- delete the old dense liveness/conflict path instead of preserving
-  compatibility for old optimizer-internal tests
+- rewrite without extra per-instruction temporary list churn: done
+- delete the old dense liveness/conflict path from the normal O1 route: done
 - narrow or compress `used_locs` so stack occupancy does not scale as
-  `points * candidates`
+  `points * candidates`: done via split hard/stack occupancy and then
+  per-location interval sets
 
 Only after that should O2 add move collection, move-only liveness, capped
 coalescing matrices, live-range splitting, and edge/block spill placement. This
diff --git a/doc/PERF.md b/doc/PERF.md
@@ -778,14 +778,38 @@ all double as the input doubles. The wall-time buckets are close enough to the
 linear target for O1 that remaining occupancy-loop work is no longer blocking
 the O2 coalescing/splitting work.
 
+### O1 Occupancy-Interval Cleanup
+
+The final O1 cleanup replaces the allocator's point-index occupied-location
+bitmaps with sorted per-location interval sets. Assignment still consumes the
+same compressed live ranges and produces the same `OptLoc` map for rewrite, but
+hard-register and stack-slot conflict checks no longer OR per-point bitmap
+rows. The old `hard_word_ors` and `stack_word_ors` counters therefore stay at
+zero on this path.
+
+Focused single-sample spill-pressure sanity run on a fresh generated input:
+
+```text
+groups opt.o1.total live_ranges_reg regalloc hard_point_visits stack_point_visits hard_mark_points stack_mark_points hard_word_ors stack_word_ors spills rewrite.live_words
+128          19.782           4.971   13.355              40195              17392            10754              2048             0              0   2048              25604
+256          43.253          10.285   29.960              80387              34800            21506              4096             0              0   4096              51204
+512          98.830          21.880   72.118             160771              69616            43010              8192             0              0   8192             102404
+```
+
+The important shape is still linear at the 128 -> 256 -> 512 steps: spills,
+rewrite live traffic, stack interval probes, hard interval probes, and interval
+marks all roughly double as the generated input doubles. This removes the last
+O1 allocator cleanup item identified by the MIR-shape report; further allocator
+work should now be justified by generated-code quality, O2 coalescing, or live
+range splitting rather than by O1 compile-time scaling.
+
 ## Performance Priorities
 
-1. Replace point-index occupancy loops when they become the next bottleneck.
-   `opt.alloc.hard_point_visits` and `opt.alloc.stack_point_visits` are now
-   linear, but their absolute counts are high. MIR's shape suggests
-   per-location interval/event occupancy as the next replacement for repeated
-   point-by-point OR and mark loops once live-range and rewrite traffic are
-   under control.
+1. Keep O1 on interval occupancy.
+   The normal O1 allocator now uses per-location interval sets instead of
+   point-index bitmap rows. Do not reintroduce point-by-point OR/mark loops for
+   O1; if O2 needs richer placement, build it as interval/event state with
+   explicit metrics.
 
 2. Keep the split `used_locs` design.
    `hard_loc_words` and `stack_loc_words` now scale linearly in both normal and
diff --git a/src/opt/pass_lower.c b/src/opt/pass_lower.c
@@ -586,22 +586,6 @@ static u32 loc_bit_words(u32 nbits) { return (nbits + 63u) / 64u; }
 
 static u32 hard_loc_bit(u8 cls, Reg r) { return ((u32)cls * 32u) + (u32)r; }
 
-static void loc_bit_set(u64* bits, u32 bit) {
-  bits[bit / 64u] |= 1ull << (bit % 64u);
-}
-
-static int loc_bit_has(const u64* bits, u32 bit) {
-  return (bits[bit / 64u] & (1ull << (bit % 64u))) != 0;
-}
-
-static void loc_bits_clear(u64* bits, u32 words) {
-  for (u32 i = 0; i < words; ++i) bits[i] = 0;
-}
-
-static void loc_bits_or(u64* dst, const u64* src, u32 words) {
-  for (u32 i = 0; i < words; ++i) dst[i] |= src[i];
-}
-
 typedef struct OptAllocCandidate {
   Val v;
   u32 spill_cost;
@@ -610,10 +594,21 @@ typedef struct OptAllocCandidate {
   u8 pad[3];
 } OptAllocCandidate;
 
+typedef struct AllocInterval {
+  u32 start;
+  u32 end;
+} AllocInterval;
+
+typedef struct AllocIntervalVec {
+  AllocInterval* data;
+  u32 n;
+  u32 cap;
+} AllocIntervalVec;
+
 typedef struct OptAllocator {
   OptLoc* locs;
-  u64* hard_used_locs;
-  u64* stack_used_locs;
+  AllocIntervalVec* hard_used_locs;
+  AllocIntervalVec* stack_used_locs;
   u32 point_count;
   u32 hard_loc_words;
   u32 stack_loc_words;
@@ -649,6 +644,98 @@ static void alloc_sort_candidates(OptAllocCandidate* cands, u32 n) {
   if (n > 1) qsort(cands, n, sizeof(cands[0]), alloc_candidate_cmp);
 }
 
+static u32 alloc_interval_lower_bound(const AllocIntervalVec* v, u32 start) {
+  u32 lo = 0;
+  u32 hi = v ? v->n : 0;
+  while (lo < hi) {
+    u32 mid = lo + (hi - lo) / 2u;
+    if (v->data[mid].start < start)
+      lo = mid + 1u;
+    else
+      hi = mid;
+  }
+  return lo;
+}
+
+static int alloc_interval_overlaps(const AllocIntervalVec* v, u32 start,
+                                   u32 end) {
+  if (!v || !v->n) return 0;
+  u32 pos = alloc_interval_lower_bound(v, start);
+  if (pos > 0) {
+    const AllocInterval* prev = &v->data[pos - 1u];
+    if (prev->end > start && prev->start < end) return 1;
+  }
+  if (pos < v->n) {
+    const AllocInterval* cur = &v->data[pos];
+    if (cur->start < end && cur->end > start) return 1;
+  }
+  return 0;
+}
+
+static void alloc_interval_grow(Func* f, AllocIntervalVec* v) {
+  if (v->n < v->cap) return;
+  u32 ncap = v->cap ? v->cap * 2u : 8u;
+  AllocInterval* nd = arena_array(f->arena, AllocInterval, ncap);
+  if (v->data) memcpy(nd, v->data, sizeof(v->data[0]) * v->n);
+  v->data = nd;
+  v->cap = ncap;
+}
+
+static void alloc_interval_insert(Func* f, AllocIntervalVec* v, u32 start,
+                                  u32 end) {
+  if (end <= start) end = start + 1u;
+  u32 pos = alloc_interval_lower_bound(v, start);
+  if (pos > 0 && v->data[pos - 1u].end >= start) {
+    --pos;
+    if (v->data[pos].end < end) v->data[pos].end = end;
+  } else {
+    alloc_interval_grow(f, v);
+    if (pos < v->n)
+      memmove(v->data + pos + 1u, v->data + pos,
+              sizeof(v->data[0]) * (v->n - pos));
+    v->data[pos].start = start;
+    v->data[pos].end = end;
+    ++v->n;
+  }
+
+  while (pos + 1u < v->n && v->data[pos + 1u].start <= v->data[pos].end) {
+    if (v->data[pos].end < v->data[pos + 1u].end)
+      v->data[pos].end = v->data[pos + 1u].end;
+    if (pos + 2u < v->n)
+      memmove(v->data + pos + 1u, v->data + pos + 2u,
+              sizeof(v->data[0]) * (v->n - pos - 2u));
+    --v->n;
+  }
+}
+
+static int alloc_ranges_overlap_vec(OptAllocator* a,
+                                    const OptLiveRangeSet* ranges, Val v,
+                                    const AllocIntervalVec* vec,
+                                    u64* visits) {
+  for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
+       r = ranges->ranges[r].next) {
+    const OptLiveRange* lr = &ranges->ranges[r];
+    u32 end = lr->end < a->point_count ? lr->end : a->point_count;
+    if (visits) ++*visits;
+    if (lr->start >= end) continue;
+    if (alloc_interval_overlaps(vec, lr->start, end)) return 1;
+  }
+  return 0;
+}
+
+static void alloc_mark_vec(Func* f, OptAllocator* a,
+                           const OptLiveRangeSet* ranges, Val v,
+                           AllocIntervalVec* vec, u64* marks) {
+  for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
+       r = ranges->ranges[r].next) {
+    const OptLiveRange* lr = &ranges->ranges[r];
+    u32 end = lr->end < a->point_count ? lr->end : a->point_count;
+    if (lr->start >= end) continue;
+    if (marks) ++*marks;
+    alloc_interval_insert(f, vec, lr->start, end);
+  }
+}
+
 static void opt_init_val_info_from_ranges(Func* f,
                                           const OptLiveRangeSet* ranges) {
   OptValInfo* old = f->val_info;
@@ -770,79 +857,51 @@ static int spill_slot_compatible(Func* f, FrameSlot fs, Val v) {
   return 1;
 }
 
-static void alloc_collect_occupied_hard(OptAllocator* a,
-                                        const OptLiveRangeSet* ranges, Val v,
-                                        u64* occupied) {
-  loc_bits_clear(occupied, a->hard_loc_words);
-  for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
-       r = ranges->ranges[r].next) {
-    const OptLiveRange* lr = &ranges->ranges[r];
-    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
-      ++a->hard_point_visits;
-      a->hard_word_ors += a->hard_loc_words;
-      loc_bits_or(occupied,
-                  a->hard_used_locs + ((size_t)p * a->hard_loc_words),
-                  a->hard_loc_words);
-    }
-  }
+static int alloc_hard_conflicts(OptAllocator* a,
+                                const OptLiveRangeSet* ranges, Val v,
+                                u32 bit) {
+  if (bit >= a->hard_loc_bits) return 1;
+  return alloc_ranges_overlap_vec(a, ranges, v, &a->hard_used_locs[bit],
+                                  &a->hard_point_visits);
 }
 
-static void alloc_collect_occupied_stack(OptAllocator* a,
-                                         const OptLiveRangeSet* ranges, Val v,
-                                         u64* occupied) {
-  loc_bits_clear(occupied, a->stack_loc_words);
-  if (!a->stack_loc_words) return;
-  for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
-       r = ranges->ranges[r].next) {
-    const OptLiveRange* lr = &ranges->ranges[r];
-    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
-      ++a->stack_point_visits;
-      a->stack_word_ors += a->stack_loc_words;
-      loc_bits_or(occupied,
-                  a->stack_used_locs + ((size_t)p * a->stack_loc_words),
-                  a->stack_loc_words);
-    }
-  }
+static int alloc_stack_conflicts(OptAllocator* a,
+                                 const OptLiveRangeSet* ranges, Val v,
+                                 u32 stack_idx) {
+  if (stack_idx >= a->stack_slot_count) return 1;
+  return alloc_ranges_overlap_vec(a, ranges, v, &a->stack_used_locs[stack_idx],
+                                  &a->stack_point_visits);
 }
 
 static void alloc_mark_hard_loc(OptAllocator* a, const OptLiveRangeSet* ranges,
-                                Val v, u32 bit) {
-  for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
-       r = ranges->ranges[r].next) {
-    const OptLiveRange* lr = &ranges->ranges[r];
-    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
-      ++a->hard_mark_points;
-      loc_bit_set(a->hard_used_locs + ((size_t)p * a->hard_loc_words), bit);
-    }
-  }
-}
-
-static void alloc_grow_stack_words(Func* f, OptAllocator* a, u32 need_words) {
-  if (need_words <= a->stack_loc_words) return;
-  u64* next =
-      arena_zarray(f->arena, u64, (size_t)a->point_count * need_words);
-  for (u32 p = 0; p < a->point_count; ++p) {
-    u64* dst = next + ((size_t)p * need_words);
-    if (a->stack_loc_words) {
-      const u64* src = a->stack_used_locs + ((size_t)p * a->stack_loc_words);
-      for (u32 w = 0; w < a->stack_loc_words; ++w) dst[w] = src[w];
-    }
+                                Val v, u32 bit, Func* f) {
+  if (bit >= a->hard_loc_bits) return;
+  alloc_mark_vec(f, a, ranges, v, &a->hard_used_locs[bit],
+                 &a->hard_mark_points);
+}
+
+static void alloc_grow_stack_locs(Func* f, OptAllocator* a, u32 need_slots) {
+  if (need_slots <= a->stack_slot_cap) return;
+  u32 ncap = a->stack_slot_cap ? a->stack_slot_cap * 2u : 16u;
+  while (ncap < need_slots) ncap *= 2u;
+  FrameSlot* ns = arena_array(f->arena, FrameSlot, ncap);
+  AllocIntervalVec* ni = arena_zarray(f->arena, AllocIntervalVec, ncap);
+  if (a->stack_slots) {
+    memcpy(ns, a->stack_slots, sizeof(a->stack_slots[0]) *
+                                a->stack_slot_count);
+    memcpy(ni, a->stack_used_locs, sizeof(a->stack_used_locs[0]) *
+                                    a->stack_slot_count);
   }
-  a->stack_used_locs = next;
-  a->stack_loc_words = need_words;
+  a->stack_slots = ns;
+  a->stack_used_locs = ni;
+  a->stack_slot_cap = ncap;
 }
 
 static void alloc_mark_stack_loc(OptAllocator* a, const OptLiveRangeSet* ranges,
-                                 Val v, u32 stack_idx) {
-  for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
-       r = ranges->ranges[r].next) {
-    const OptLiveRange* lr = &ranges->ranges[r];
-    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
-      ++a->stack_mark_points;
-      loc_bit_set(a->stack_used_locs + ((size_t)p * a->stack_loc_words),
-                  stack_idx);
-    }
-  }
+                                 Val v, u32 stack_idx, Func* f) {
+  if (stack_idx >= a->stack_slot_count) return;
+  alloc_mark_vec(f, a, ranges, v, &a->stack_used_locs[stack_idx],
+                 &a->stack_mark_points);
 }
 
 static void alloc_assign_hard(Func* f, OptAllocator* a,
@@ -854,17 +913,16 @@ static void alloc_assign_hard(Func* f, OptAllocator* a,
   a->locs[v].cls = vi->cls;
   a->locs[v].hard_reg = r;
   a->locs[v].spill_slot = FRAME_SLOT_NONE;
-  alloc_mark_hard_loc(a, ranges, v, hard_loc_bit(vi->cls, r));
+  alloc_mark_hard_loc(a, ranges, v, hard_loc_bit(vi->cls, r), f);
 }
 
 static void alloc_assign_stack(Func* f, OptAllocator* a,
-                               const OptLiveRangeSet* ranges, Val v,
-                               const u64* occupied) {
+                               const OptLiveRangeSet* ranges, Val v) {
   OptValInfo* vi = &f->val_info[v];
   u32 stack_idx = a->stack_slot_count;
   FrameSlot slot = FRAME_SLOT_NONE;
   for (u32 i = 0; i < a->stack_slot_count; ++i) {
-    if (loc_bit_has(occupied, i)) continue;
+    if (alloc_stack_conflicts(a, ranges, v, i)) continue;
     if (!spill_slot_compatible(f, a->stack_slots[i], v)) continue;
     stack_idx = i;
     slot = a->stack_slots[i];
@@ -872,18 +930,10 @@ static void alloc_assign_stack(Func* f, OptAllocator* a,
   }
   if (slot == FRAME_SLOT_NONE) {
     slot = spill_slot_for(f, v);
-    if (a->stack_slot_count == a->stack_slot_cap) {
-      u32 ncap = a->stack_slot_cap ? a->stack_slot_cap * 2u : 16u;
-      FrameSlot* ns = arena_array(f->arena, FrameSlot, ncap);
-      if (a->stack_slots)
-        memcpy(ns, a->stack_slots, sizeof(a->stack_slots[0]) *
-                                    a->stack_slot_count);
-      a->stack_slots = ns;
-      a->stack_slot_cap = ncap;
-    }
+    alloc_grow_stack_locs(f, a, a->stack_slot_count + 1u);
     a->stack_slots[a->stack_slot_count++] = slot;
     stack_idx = a->stack_slot_count - 1u;
-    alloc_grow_stack_words(f, a, loc_bit_words(a->stack_slot_count));
+    a->stack_loc_words = loc_bit_words(a->stack_slot_count);
   } else {
     vi->spill_slot = slot;
   }
@@ -892,7 +942,7 @@ static void alloc_assign_stack(Func* f, OptAllocator* a,
   a->locs[v].cls = vi->cls;
   a->locs[v].hard_reg = REG_NONE;
   a->locs[v].spill_slot = slot;
-  alloc_mark_stack_loc(a, ranges, v, stack_idx);
+  alloc_mark_stack_loc(a, ranges, v, stack_idx, f);
 }
 
 static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
@@ -906,9 +956,11 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
   a->hard_loc_words = loc_bit_words(a->hard_loc_bits);
   a->locs = arena_zarray(f->arena, OptLoc, f->nvals ? f->nvals : 1u);
   a->hard_used_locs =
-      arena_zarray(f->arena, u64, (size_t)a->point_count * a->hard_loc_words);
+      arena_zarray(f->arena, AllocIntervalVec,
+                   a->hard_loc_bits ? a->hard_loc_bits : 1u);
   a->stack_slots = NULL;
   a->stack_slot_cap = 0;
+  a->stack_used_locs = NULL;
 
   OptAllocCandidate* cands =
       arena_array(f->arena, OptAllocCandidate, ncands ? ncands : 1u);
@@ -925,15 +977,10 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
   }
   alloc_sort_candidates(cands, n);
 
-  u64* hard_occupied =
-      arena_zarray(f->arena, u64, a->hard_loc_words ? a->hard_loc_words : 1u);
-  u64* stack_occupied =
-      arena_zarray(f->arena, u64, loc_bit_words(ncands ? ncands : 1u));
   for (u32 i = 0; i < n; ++i) {
     Val v = cands[i].v;
     OptValInfo* vi = &f->val_info[v];
     u8 cls = vi->cls;
-    alloc_collect_occupied_hard(a, ranges, v, hard_occupied);
     if (vi->tied_hard_reg >= 0) {
       Reg fixed = (Reg)vi->tied_hard_reg;
       if (!hard_available(f, cls, fixed)) {
@@ -950,7 +997,7 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
                        (unsigned)fixed);
       }
       if (fixed >= 32 ||
-          loc_bit_has(hard_occupied, hard_loc_bit(cls, fixed))) {
+          alloc_hard_conflicts(a, ranges, v, hard_loc_bit(cls, fixed))) {
         SrcLoc loc = {0, 0, 0};
         compiler_panic(f->c, loc, "opt regalloc: conflicting fixed hard reg %u",
                        (unsigned)fixed);
@@ -965,14 +1012,13 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
       if (hr >= 32) continue;
       if (vi->forbidden_hard_regs & (1u << hr)) continue;
       if (vi->live_across_call_freq && is_caller_saved(f, cls, hr)) continue;
-      if (loc_bit_has(hard_occupied, hard_loc_bit(cls, hr))) continue;
+      if (alloc_hard_conflicts(a, ranges, v, hard_loc_bit(cls, hr))) continue;
       alloc_assign_hard(f, a, ranges, v, hr);
       found = 1;
       break;
     }
     if (!found) {
-      alloc_collect_occupied_stack(a, ranges, v, stack_occupied);
-      alloc_assign_stack(f, a, ranges, v, stack_occupied);
+      alloc_assign_stack(f, a, ranges, v);
     }
   }
   f->opt_alloc_hard_loc_words = a->point_count * a->hard_loc_words;

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/MIR_RA_REPORT.md	\|	34	++++++++++++++++++++--------------
M	doc/PERF.md	\|	36	++++++++++++++++++++++++++++++------
M	src/opt/pass_lower.c	\|	260	++++++++++++++++++++++++++++++++++++++++++++++---------------------------------