commit dbf13315d00e0f39eb2202d426bff83f68f8a0b3
parent 644fccea6714b06256d79a0a0e19f7de24044ae5
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Fri, 15 May 2026 00:41:55 -0700
Optimize O1 allocator occupancy
Diffstat:
3 files changed, 203 insertions(+), 127 deletions(-)
diff --git a/doc/MIR_RA_REPORT.md b/doc/MIR_RA_REPORT.md
@@ -45,11 +45,11 @@ matrix.
As of the current O1 implementation, the dense pseudo-conflict allocator is no
longer the normal path. O1 now has pass-local block liveness, compressed live
ranges, split hard-register and stack-slot occupancy, assignment-map rewrite,
-and metrics for live/range/allocation/rewrite event traffic. The remaining work
-is not "delete the old matrix"; it is to make the new path more MIR-shaped by
-removing the remaining high-watermark bitmap traffic, reducing live-range build
-cost on spill-heavy wide functions, and replacing repeated range-point
-occupancy walks when they become the dominant bucket.
+per-location interval occupancy, and metrics for live/range/allocation/rewrite
+event traffic. The remaining work is no longer "delete the old matrix" or
+"replace O1 point-index occupancy"; those cleanup steps are complete. The next
+allocator work should move to O2 coalescing/splitting and generated-code
+quality, while keeping O1's fast MIR-shaped path as the baseline.
## MIR Container Layer
@@ -384,14 +384,20 @@ The initial O1 implementation had five major structural mismatches:
4. Hot loops scanned all vals.
Many of these loops now use operand-reference events, set-bit iteration, or
- touched-value/generation lists. Remaining work is to remove high-watermark
- bitmap copies and any finalization paths that still scale with the highest
- value id instead of live or touched values.
+ touched-value/generation lists. The final O1 cleanup removed the remaining
+ high-watermark live-across-call/rewrite scans that showed up in spill-heavy
+ wide functions.
5. `Block.live_after` stored a full live bitmap per instruction.
Normal O1 no longer persists per-instruction full live-after storage.
- Rewrite and DDE walk backward with rolling live state, but rewrite still
- shows superlinear live-word traffic on spill-heavy wide functions.
+ Rewrite and DDE walk backward with rolling live state, and rewrite's
+ live-word traffic is now linear in the focused spill-pressure ladder.
+
+6. Point-index occupancy loops remained after the matrix was removed.
+ Normal O1 allocation now stores sorted live intervals per physical
+ hard-register and per stack spill slot. Assignment checks interval overlap
+ for each candidate range instead of OR-ing occupied-location bitmap rows for
+ every compressed point.
## Proposed cfree Structural Target
@@ -508,11 +514,11 @@ Deliverables:
After O1 has the right shape, remove transition scaffolding and fix the
remaining O1 scaling buckets before adding O2 allocator features:
-- rewrite without extra per-instruction temporary list churn
-- delete the old dense liveness/conflict path instead of preserving
- compatibility for old optimizer-internal tests
+- rewrite without extra per-instruction temporary list churn: done
+- delete the old dense liveness/conflict path from the normal O1 route: done
- narrow or compress `used_locs` so stack occupancy does not scale as
- `points * candidates`
+ `points * candidates`: done via split hard/stack occupancy and then
+ per-location interval sets
Only after that should O2 add move collection, move-only liveness, capped
coalescing matrices, live-range splitting, and edge/block spill placement. This
diff --git a/doc/PERF.md b/doc/PERF.md
@@ -778,14 +778,38 @@ all double as the input doubles. The wall-time buckets are close enough to the
linear target for O1 that remaining occupancy-loop work is no longer blocking
the O2 coalescing/splitting work.
+### O1 Occupancy-Interval Cleanup
+
+The final O1 cleanup replaces the allocator's point-index occupied-location
+bitmaps with sorted per-location interval sets. Assignment still consumes the
+same compressed live ranges and produces the same `OptLoc` map for rewrite, but
+hard-register and stack-slot conflict checks no longer OR per-point bitmap
+rows. The old `hard_word_ors` and `stack_word_ors` counters therefore stay at
+zero on this path.
+
+Focused single-sample spill-pressure sanity run on a fresh generated input:
+
+```text
+groups opt.o1.total live_ranges_reg regalloc hard_point_visits stack_point_visits hard_mark_points stack_mark_points hard_word_ors stack_word_ors spills rewrite.live_words
+128 19.782 4.971 13.355 40195 17392 10754 2048 0 0 2048 25604
+256 43.253 10.285 29.960 80387 34800 21506 4096 0 0 4096 51204
+512 98.830 21.880 72.118 160771 69616 43010 8192 0 0 8192 102404
+```
+
+The important shape is still linear at the 128 -> 256 -> 512 steps: spills,
+rewrite live traffic, stack interval probes, hard interval probes, and interval
+marks all roughly double as the generated input doubles. This removes the last
+O1 allocator cleanup item identified by the MIR-shape report; further allocator
+work should now be justified by generated-code quality, O2 coalescing, or live
+range splitting rather than by O1 compile-time scaling.
+
## Performance Priorities
-1. Replace point-index occupancy loops when they become the next bottleneck.
- `opt.alloc.hard_point_visits` and `opt.alloc.stack_point_visits` are now
- linear, but their absolute counts are high. MIR's shape suggests
- per-location interval/event occupancy as the next replacement for repeated
- point-by-point OR and mark loops once live-range and rewrite traffic are
- under control.
+1. Keep O1 on interval occupancy.
+ The normal O1 allocator now uses per-location interval sets instead of
+ point-index bitmap rows. Do not reintroduce point-by-point OR/mark loops for
+ O1; if O2 needs richer placement, build it as interval/event state with
+ explicit metrics.
2. Keep the split `used_locs` design.
`hard_loc_words` and `stack_loc_words` now scale linearly in both normal and
diff --git a/src/opt/pass_lower.c b/src/opt/pass_lower.c
@@ -586,22 +586,6 @@ static u32 loc_bit_words(u32 nbits) { return (nbits + 63u) / 64u; }
static u32 hard_loc_bit(u8 cls, Reg r) { return ((u32)cls * 32u) + (u32)r; }
-static void loc_bit_set(u64* bits, u32 bit) {
- bits[bit / 64u] |= 1ull << (bit % 64u);
-}
-
-static int loc_bit_has(const u64* bits, u32 bit) {
- return (bits[bit / 64u] & (1ull << (bit % 64u))) != 0;
-}
-
-static void loc_bits_clear(u64* bits, u32 words) {
- for (u32 i = 0; i < words; ++i) bits[i] = 0;
-}
-
-static void loc_bits_or(u64* dst, const u64* src, u32 words) {
- for (u32 i = 0; i < words; ++i) dst[i] |= src[i];
-}
-
typedef struct OptAllocCandidate {
Val v;
u32 spill_cost;
@@ -610,10 +594,21 @@ typedef struct OptAllocCandidate {
u8 pad[3];
} OptAllocCandidate;
+typedef struct AllocInterval {
+ u32 start;
+ u32 end;
+} AllocInterval;
+
+typedef struct AllocIntervalVec {
+ AllocInterval* data;
+ u32 n;
+ u32 cap;
+} AllocIntervalVec;
+
typedef struct OptAllocator {
OptLoc* locs;
- u64* hard_used_locs;
- u64* stack_used_locs;
+ AllocIntervalVec* hard_used_locs;
+ AllocIntervalVec* stack_used_locs;
u32 point_count;
u32 hard_loc_words;
u32 stack_loc_words;
@@ -649,6 +644,98 @@ static void alloc_sort_candidates(OptAllocCandidate* cands, u32 n) {
if (n > 1) qsort(cands, n, sizeof(cands[0]), alloc_candidate_cmp);
}
+static u32 alloc_interval_lower_bound(const AllocIntervalVec* v, u32 start) {
+ u32 lo = 0;
+ u32 hi = v ? v->n : 0;
+ while (lo < hi) {
+ u32 mid = lo + (hi - lo) / 2u;
+ if (v->data[mid].start < start)
+ lo = mid + 1u;
+ else
+ hi = mid;
+ }
+ return lo;
+}
+
+static int alloc_interval_overlaps(const AllocIntervalVec* v, u32 start,
+ u32 end) {
+ if (!v || !v->n) return 0;
+ u32 pos = alloc_interval_lower_bound(v, start);
+ if (pos > 0) {
+ const AllocInterval* prev = &v->data[pos - 1u];
+ if (prev->end > start && prev->start < end) return 1;
+ }
+ if (pos < v->n) {
+ const AllocInterval* cur = &v->data[pos];
+ if (cur->start < end && cur->end > start) return 1;
+ }
+ return 0;
+}
+
+static void alloc_interval_grow(Func* f, AllocIntervalVec* v) {
+ if (v->n < v->cap) return;
+ u32 ncap = v->cap ? v->cap * 2u : 8u;
+ AllocInterval* nd = arena_array(f->arena, AllocInterval, ncap);
+ if (v->data) memcpy(nd, v->data, sizeof(v->data[0]) * v->n);
+ v->data = nd;
+ v->cap = ncap;
+}
+
+static void alloc_interval_insert(Func* f, AllocIntervalVec* v, u32 start,
+ u32 end) {
+ if (end <= start) end = start + 1u;
+ u32 pos = alloc_interval_lower_bound(v, start);
+ if (pos > 0 && v->data[pos - 1u].end >= start) {
+ --pos;
+ if (v->data[pos].end < end) v->data[pos].end = end;
+ } else {
+ alloc_interval_grow(f, v);
+ if (pos < v->n)
+ memmove(v->data + pos + 1u, v->data + pos,
+ sizeof(v->data[0]) * (v->n - pos));
+ v->data[pos].start = start;
+ v->data[pos].end = end;
+ ++v->n;
+ }
+
+ while (pos + 1u < v->n && v->data[pos + 1u].start <= v->data[pos].end) {
+ if (v->data[pos].end < v->data[pos + 1u].end)
+ v->data[pos].end = v->data[pos + 1u].end;
+ if (pos + 2u < v->n)
+ memmove(v->data + pos + 1u, v->data + pos + 2u,
+ sizeof(v->data[0]) * (v->n - pos - 2u));
+ --v->n;
+ }
+}
+
+static int alloc_ranges_overlap_vec(OptAllocator* a,
+ const OptLiveRangeSet* ranges, Val v,
+ const AllocIntervalVec* vec,
+ u64* visits) {
+ for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
+ r = ranges->ranges[r].next) {
+ const OptLiveRange* lr = &ranges->ranges[r];
+ u32 end = lr->end < a->point_count ? lr->end : a->point_count;
+ if (visits) ++*visits;
+ if (lr->start >= end) continue;
+ if (alloc_interval_overlaps(vec, lr->start, end)) return 1;
+ }
+ return 0;
+}
+
+static void alloc_mark_vec(Func* f, OptAllocator* a,
+ const OptLiveRangeSet* ranges, Val v,
+ AllocIntervalVec* vec, u64* marks) {
+ for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
+ r = ranges->ranges[r].next) {
+ const OptLiveRange* lr = &ranges->ranges[r];
+ u32 end = lr->end < a->point_count ? lr->end : a->point_count;
+ if (lr->start >= end) continue;
+ if (marks) ++*marks;
+ alloc_interval_insert(f, vec, lr->start, end);
+ }
+}
+
static void opt_init_val_info_from_ranges(Func* f,
const OptLiveRangeSet* ranges) {
OptValInfo* old = f->val_info;
@@ -770,79 +857,51 @@ static int spill_slot_compatible(Func* f, FrameSlot fs, Val v) {
return 1;
}
-static void alloc_collect_occupied_hard(OptAllocator* a,
- const OptLiveRangeSet* ranges, Val v,
- u64* occupied) {
- loc_bits_clear(occupied, a->hard_loc_words);
- for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
- r = ranges->ranges[r].next) {
- const OptLiveRange* lr = &ranges->ranges[r];
- for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
- ++a->hard_point_visits;
- a->hard_word_ors += a->hard_loc_words;
- loc_bits_or(occupied,
- a->hard_used_locs + ((size_t)p * a->hard_loc_words),
- a->hard_loc_words);
- }
- }
+static int alloc_hard_conflicts(OptAllocator* a,
+ const OptLiveRangeSet* ranges, Val v,
+ u32 bit) {
+ if (bit >= a->hard_loc_bits) return 1;
+ return alloc_ranges_overlap_vec(a, ranges, v, &a->hard_used_locs[bit],
+ &a->hard_point_visits);
}
-static void alloc_collect_occupied_stack(OptAllocator* a,
- const OptLiveRangeSet* ranges, Val v,
- u64* occupied) {
- loc_bits_clear(occupied, a->stack_loc_words);
- if (!a->stack_loc_words) return;
- for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
- r = ranges->ranges[r].next) {
- const OptLiveRange* lr = &ranges->ranges[r];
- for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
- ++a->stack_point_visits;
- a->stack_word_ors += a->stack_loc_words;
- loc_bits_or(occupied,
- a->stack_used_locs + ((size_t)p * a->stack_loc_words),
- a->stack_loc_words);
- }
- }
+static int alloc_stack_conflicts(OptAllocator* a,
+ const OptLiveRangeSet* ranges, Val v,
+ u32 stack_idx) {
+ if (stack_idx >= a->stack_slot_count) return 1;
+ return alloc_ranges_overlap_vec(a, ranges, v, &a->stack_used_locs[stack_idx],
+ &a->stack_point_visits);
}
static void alloc_mark_hard_loc(OptAllocator* a, const OptLiveRangeSet* ranges,
- Val v, u32 bit) {
- for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
- r = ranges->ranges[r].next) {
- const OptLiveRange* lr = &ranges->ranges[r];
- for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
- ++a->hard_mark_points;
- loc_bit_set(a->hard_used_locs + ((size_t)p * a->hard_loc_words), bit);
- }
- }
-}
-
-static void alloc_grow_stack_words(Func* f, OptAllocator* a, u32 need_words) {
- if (need_words <= a->stack_loc_words) return;
- u64* next =
- arena_zarray(f->arena, u64, (size_t)a->point_count * need_words);
- for (u32 p = 0; p < a->point_count; ++p) {
- u64* dst = next + ((size_t)p * need_words);
- if (a->stack_loc_words) {
- const u64* src = a->stack_used_locs + ((size_t)p * a->stack_loc_words);
- for (u32 w = 0; w < a->stack_loc_words; ++w) dst[w] = src[w];
- }
+ Val v, u32 bit, Func* f) {
+ if (bit >= a->hard_loc_bits) return;
+ alloc_mark_vec(f, a, ranges, v, &a->hard_used_locs[bit],
+ &a->hard_mark_points);
+}
+
+static void alloc_grow_stack_locs(Func* f, OptAllocator* a, u32 need_slots) {
+ if (need_slots <= a->stack_slot_cap) return;
+ u32 ncap = a->stack_slot_cap ? a->stack_slot_cap * 2u : 16u;
+ while (ncap < need_slots) ncap *= 2u;
+ FrameSlot* ns = arena_array(f->arena, FrameSlot, ncap);
+ AllocIntervalVec* ni = arena_zarray(f->arena, AllocIntervalVec, ncap);
+ if (a->stack_slots) {
+ memcpy(ns, a->stack_slots, sizeof(a->stack_slots[0]) *
+ a->stack_slot_count);
+ memcpy(ni, a->stack_used_locs, sizeof(a->stack_used_locs[0]) *
+ a->stack_slot_count);
}
- a->stack_used_locs = next;
- a->stack_loc_words = need_words;
+ a->stack_slots = ns;
+ a->stack_used_locs = ni;
+ a->stack_slot_cap = ncap;
}
static void alloc_mark_stack_loc(OptAllocator* a, const OptLiveRangeSet* ranges,
- Val v, u32 stack_idx) {
- for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
- r = ranges->ranges[r].next) {
- const OptLiveRange* lr = &ranges->ranges[r];
- for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
- ++a->stack_mark_points;
- loc_bit_set(a->stack_used_locs + ((size_t)p * a->stack_loc_words),
- stack_idx);
- }
- }
+ Val v, u32 stack_idx, Func* f) {
+ if (stack_idx >= a->stack_slot_count) return;
+ alloc_mark_vec(f, a, ranges, v, &a->stack_used_locs[stack_idx],
+ &a->stack_mark_points);
}
static void alloc_assign_hard(Func* f, OptAllocator* a,
@@ -854,17 +913,16 @@ static void alloc_assign_hard(Func* f, OptAllocator* a,
a->locs[v].cls = vi->cls;
a->locs[v].hard_reg = r;
a->locs[v].spill_slot = FRAME_SLOT_NONE;
- alloc_mark_hard_loc(a, ranges, v, hard_loc_bit(vi->cls, r));
+ alloc_mark_hard_loc(a, ranges, v, hard_loc_bit(vi->cls, r), f);
}
static void alloc_assign_stack(Func* f, OptAllocator* a,
- const OptLiveRangeSet* ranges, Val v,
- const u64* occupied) {
+ const OptLiveRangeSet* ranges, Val v) {
OptValInfo* vi = &f->val_info[v];
u32 stack_idx = a->stack_slot_count;
FrameSlot slot = FRAME_SLOT_NONE;
for (u32 i = 0; i < a->stack_slot_count; ++i) {
- if (loc_bit_has(occupied, i)) continue;
+ if (alloc_stack_conflicts(a, ranges, v, i)) continue;
if (!spill_slot_compatible(f, a->stack_slots[i], v)) continue;
stack_idx = i;
slot = a->stack_slots[i];
@@ -872,18 +930,10 @@ static void alloc_assign_stack(Func* f, OptAllocator* a,
}
if (slot == FRAME_SLOT_NONE) {
slot = spill_slot_for(f, v);
- if (a->stack_slot_count == a->stack_slot_cap) {
- u32 ncap = a->stack_slot_cap ? a->stack_slot_cap * 2u : 16u;
- FrameSlot* ns = arena_array(f->arena, FrameSlot, ncap);
- if (a->stack_slots)
- memcpy(ns, a->stack_slots, sizeof(a->stack_slots[0]) *
- a->stack_slot_count);
- a->stack_slots = ns;
- a->stack_slot_cap = ncap;
- }
+ alloc_grow_stack_locs(f, a, a->stack_slot_count + 1u);
a->stack_slots[a->stack_slot_count++] = slot;
stack_idx = a->stack_slot_count - 1u;
- alloc_grow_stack_words(f, a, loc_bit_words(a->stack_slot_count));
+ a->stack_loc_words = loc_bit_words(a->stack_slot_count);
} else {
vi->spill_slot = slot;
}
@@ -892,7 +942,7 @@ static void alloc_assign_stack(Func* f, OptAllocator* a,
a->locs[v].cls = vi->cls;
a->locs[v].hard_reg = REG_NONE;
a->locs[v].spill_slot = slot;
- alloc_mark_stack_loc(a, ranges, v, stack_idx);
+ alloc_mark_stack_loc(a, ranges, v, stack_idx, f);
}
static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
@@ -906,9 +956,11 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
a->hard_loc_words = loc_bit_words(a->hard_loc_bits);
a->locs = arena_zarray(f->arena, OptLoc, f->nvals ? f->nvals : 1u);
a->hard_used_locs =
- arena_zarray(f->arena, u64, (size_t)a->point_count * a->hard_loc_words);
+ arena_zarray(f->arena, AllocIntervalVec,
+ a->hard_loc_bits ? a->hard_loc_bits : 1u);
a->stack_slots = NULL;
a->stack_slot_cap = 0;
+ a->stack_used_locs = NULL;
OptAllocCandidate* cands =
arena_array(f->arena, OptAllocCandidate, ncands ? ncands : 1u);
@@ -925,15 +977,10 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
}
alloc_sort_candidates(cands, n);
- u64* hard_occupied =
- arena_zarray(f->arena, u64, a->hard_loc_words ? a->hard_loc_words : 1u);
- u64* stack_occupied =
- arena_zarray(f->arena, u64, loc_bit_words(ncands ? ncands : 1u));
for (u32 i = 0; i < n; ++i) {
Val v = cands[i].v;
OptValInfo* vi = &f->val_info[v];
u8 cls = vi->cls;
- alloc_collect_occupied_hard(a, ranges, v, hard_occupied);
if (vi->tied_hard_reg >= 0) {
Reg fixed = (Reg)vi->tied_hard_reg;
if (!hard_available(f, cls, fixed)) {
@@ -950,7 +997,7 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
(unsigned)fixed);
}
if (fixed >= 32 ||
- loc_bit_has(hard_occupied, hard_loc_bit(cls, fixed))) {
+ alloc_hard_conflicts(a, ranges, v, hard_loc_bit(cls, fixed))) {
SrcLoc loc = {0, 0, 0};
compiler_panic(f->c, loc, "opt regalloc: conflicting fixed hard reg %u",
(unsigned)fixed);
@@ -965,14 +1012,13 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
if (hr >= 32) continue;
if (vi->forbidden_hard_regs & (1u << hr)) continue;
if (vi->live_across_call_freq && is_caller_saved(f, cls, hr)) continue;
- if (loc_bit_has(hard_occupied, hard_loc_bit(cls, hr))) continue;
+ if (alloc_hard_conflicts(a, ranges, v, hard_loc_bit(cls, hr))) continue;
alloc_assign_hard(f, a, ranges, v, hr);
found = 1;
break;
}
if (!found) {
- alloc_collect_occupied_stack(a, ranges, v, stack_occupied);
- alloc_assign_stack(f, a, ranges, v, stack_occupied);
+ alloc_assign_stack(f, a, ranges, v);
}
}
f->opt_alloc_hard_loc_words = a->point_count * a->hard_loc_words;