kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit dbf13315d00e0f39eb2202d426bff83f68f8a0b3
parent 644fccea6714b06256d79a0a0e19f7de24044ae5
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 15 May 2026 00:41:55 -0700

Optimize O1 allocator occupancy

Diffstat:
Mdoc/MIR_RA_REPORT.md | 34++++++++++++++++++++--------------
Mdoc/PERF.md | 36++++++++++++++++++++++++++++++------
Msrc/opt/pass_lower.c | 260++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
3 files changed, 203 insertions(+), 127 deletions(-)

diff --git a/doc/MIR_RA_REPORT.md b/doc/MIR_RA_REPORT.md @@ -45,11 +45,11 @@ matrix. As of the current O1 implementation, the dense pseudo-conflict allocator is no longer the normal path. O1 now has pass-local block liveness, compressed live ranges, split hard-register and stack-slot occupancy, assignment-map rewrite, -and metrics for live/range/allocation/rewrite event traffic. The remaining work -is not "delete the old matrix"; it is to make the new path more MIR-shaped by -removing the remaining high-watermark bitmap traffic, reducing live-range build -cost on spill-heavy wide functions, and replacing repeated range-point -occupancy walks when they become the dominant bucket. +per-location interval occupancy, and metrics for live/range/allocation/rewrite +event traffic. The remaining work is no longer "delete the old matrix" or +"replace O1 point-index occupancy"; those cleanup steps are complete. The next +allocator work should move to O2 coalescing/splitting and generated-code +quality, while keeping O1's fast MIR-shaped path as the baseline. ## MIR Container Layer @@ -384,14 +384,20 @@ The initial O1 implementation had five major structural mismatches: 4. Hot loops scanned all vals. Many of these loops now use operand-reference events, set-bit iteration, or - touched-value/generation lists. Remaining work is to remove high-watermark - bitmap copies and any finalization paths that still scale with the highest - value id instead of live or touched values. + touched-value/generation lists. The final O1 cleanup removed the remaining + high-watermark live-across-call/rewrite scans that showed up in spill-heavy + wide functions. 5. `Block.live_after` stored a full live bitmap per instruction. Normal O1 no longer persists per-instruction full live-after storage. - Rewrite and DDE walk backward with rolling live state, but rewrite still - shows superlinear live-word traffic on spill-heavy wide functions. + Rewrite and DDE walk backward with rolling live state, and rewrite's + live-word traffic is now linear in the focused spill-pressure ladder. + +6. Point-index occupancy loops remained after the matrix was removed. + Normal O1 allocation now stores sorted live intervals per physical + hard-register and per stack spill slot. Assignment checks interval overlap + for each candidate range instead of OR-ing occupied-location bitmap rows for + every compressed point. ## Proposed cfree Structural Target @@ -508,11 +514,11 @@ Deliverables: After O1 has the right shape, remove transition scaffolding and fix the remaining O1 scaling buckets before adding O2 allocator features: -- rewrite without extra per-instruction temporary list churn -- delete the old dense liveness/conflict path instead of preserving - compatibility for old optimizer-internal tests +- rewrite without extra per-instruction temporary list churn: done +- delete the old dense liveness/conflict path from the normal O1 route: done - narrow or compress `used_locs` so stack occupancy does not scale as - `points * candidates` + `points * candidates`: done via split hard/stack occupancy and then + per-location interval sets Only after that should O2 add move collection, move-only liveness, capped coalescing matrices, live-range splitting, and edge/block spill placement. This diff --git a/doc/PERF.md b/doc/PERF.md @@ -778,14 +778,38 @@ all double as the input doubles. The wall-time buckets are close enough to the linear target for O1 that remaining occupancy-loop work is no longer blocking the O2 coalescing/splitting work. +### O1 Occupancy-Interval Cleanup + +The final O1 cleanup replaces the allocator's point-index occupied-location +bitmaps with sorted per-location interval sets. Assignment still consumes the +same compressed live ranges and produces the same `OptLoc` map for rewrite, but +hard-register and stack-slot conflict checks no longer OR per-point bitmap +rows. The old `hard_word_ors` and `stack_word_ors` counters therefore stay at +zero on this path. + +Focused single-sample spill-pressure sanity run on a fresh generated input: + +```text +groups opt.o1.total live_ranges_reg regalloc hard_point_visits stack_point_visits hard_mark_points stack_mark_points hard_word_ors stack_word_ors spills rewrite.live_words +128 19.782 4.971 13.355 40195 17392 10754 2048 0 0 2048 25604 +256 43.253 10.285 29.960 80387 34800 21506 4096 0 0 4096 51204 +512 98.830 21.880 72.118 160771 69616 43010 8192 0 0 8192 102404 +``` + +The important shape is still linear at the 128 -> 256 -> 512 steps: spills, +rewrite live traffic, stack interval probes, hard interval probes, and interval +marks all roughly double as the generated input doubles. This removes the last +O1 allocator cleanup item identified by the MIR-shape report; further allocator +work should now be justified by generated-code quality, O2 coalescing, or live +range splitting rather than by O1 compile-time scaling. + ## Performance Priorities -1. Replace point-index occupancy loops when they become the next bottleneck. - `opt.alloc.hard_point_visits` and `opt.alloc.stack_point_visits` are now - linear, but their absolute counts are high. MIR's shape suggests - per-location interval/event occupancy as the next replacement for repeated - point-by-point OR and mark loops once live-range and rewrite traffic are - under control. +1. Keep O1 on interval occupancy. + The normal O1 allocator now uses per-location interval sets instead of + point-index bitmap rows. Do not reintroduce point-by-point OR/mark loops for + O1; if O2 needs richer placement, build it as interval/event state with + explicit metrics. 2. Keep the split `used_locs` design. `hard_loc_words` and `stack_loc_words` now scale linearly in both normal and diff --git a/src/opt/pass_lower.c b/src/opt/pass_lower.c @@ -586,22 +586,6 @@ static u32 loc_bit_words(u32 nbits) { return (nbits + 63u) / 64u; } static u32 hard_loc_bit(u8 cls, Reg r) { return ((u32)cls * 32u) + (u32)r; } -static void loc_bit_set(u64* bits, u32 bit) { - bits[bit / 64u] |= 1ull << (bit % 64u); -} - -static int loc_bit_has(const u64* bits, u32 bit) { - return (bits[bit / 64u] & (1ull << (bit % 64u))) != 0; -} - -static void loc_bits_clear(u64* bits, u32 words) { - for (u32 i = 0; i < words; ++i) bits[i] = 0; -} - -static void loc_bits_or(u64* dst, const u64* src, u32 words) { - for (u32 i = 0; i < words; ++i) dst[i] |= src[i]; -} - typedef struct OptAllocCandidate { Val v; u32 spill_cost; @@ -610,10 +594,21 @@ typedef struct OptAllocCandidate { u8 pad[3]; } OptAllocCandidate; +typedef struct AllocInterval { + u32 start; + u32 end; +} AllocInterval; + +typedef struct AllocIntervalVec { + AllocInterval* data; + u32 n; + u32 cap; +} AllocIntervalVec; + typedef struct OptAllocator { OptLoc* locs; - u64* hard_used_locs; - u64* stack_used_locs; + AllocIntervalVec* hard_used_locs; + AllocIntervalVec* stack_used_locs; u32 point_count; u32 hard_loc_words; u32 stack_loc_words; @@ -649,6 +644,98 @@ static void alloc_sort_candidates(OptAllocCandidate* cands, u32 n) { if (n > 1) qsort(cands, n, sizeof(cands[0]), alloc_candidate_cmp); } +static u32 alloc_interval_lower_bound(const AllocIntervalVec* v, u32 start) { + u32 lo = 0; + u32 hi = v ? v->n : 0; + while (lo < hi) { + u32 mid = lo + (hi - lo) / 2u; + if (v->data[mid].start < start) + lo = mid + 1u; + else + hi = mid; + } + return lo; +} + +static int alloc_interval_overlaps(const AllocIntervalVec* v, u32 start, + u32 end) { + if (!v || !v->n) return 0; + u32 pos = alloc_interval_lower_bound(v, start); + if (pos > 0) { + const AllocInterval* prev = &v->data[pos - 1u]; + if (prev->end > start && prev->start < end) return 1; + } + if (pos < v->n) { + const AllocInterval* cur = &v->data[pos]; + if (cur->start < end && cur->end > start) return 1; + } + return 0; +} + +static void alloc_interval_grow(Func* f, AllocIntervalVec* v) { + if (v->n < v->cap) return; + u32 ncap = v->cap ? v->cap * 2u : 8u; + AllocInterval* nd = arena_array(f->arena, AllocInterval, ncap); + if (v->data) memcpy(nd, v->data, sizeof(v->data[0]) * v->n); + v->data = nd; + v->cap = ncap; +} + +static void alloc_interval_insert(Func* f, AllocIntervalVec* v, u32 start, + u32 end) { + if (end <= start) end = start + 1u; + u32 pos = alloc_interval_lower_bound(v, start); + if (pos > 0 && v->data[pos - 1u].end >= start) { + --pos; + if (v->data[pos].end < end) v->data[pos].end = end; + } else { + alloc_interval_grow(f, v); + if (pos < v->n) + memmove(v->data + pos + 1u, v->data + pos, + sizeof(v->data[0]) * (v->n - pos)); + v->data[pos].start = start; + v->data[pos].end = end; + ++v->n; + } + + while (pos + 1u < v->n && v->data[pos + 1u].start <= v->data[pos].end) { + if (v->data[pos].end < v->data[pos + 1u].end) + v->data[pos].end = v->data[pos + 1u].end; + if (pos + 2u < v->n) + memmove(v->data + pos + 1u, v->data + pos + 2u, + sizeof(v->data[0]) * (v->n - pos - 2u)); + --v->n; + } +} + +static int alloc_ranges_overlap_vec(OptAllocator* a, + const OptLiveRangeSet* ranges, Val v, + const AllocIntervalVec* vec, + u64* visits) { + for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE; + r = ranges->ranges[r].next) { + const OptLiveRange* lr = &ranges->ranges[r]; + u32 end = lr->end < a->point_count ? lr->end : a->point_count; + if (visits) ++*visits; + if (lr->start >= end) continue; + if (alloc_interval_overlaps(vec, lr->start, end)) return 1; + } + return 0; +} + +static void alloc_mark_vec(Func* f, OptAllocator* a, + const OptLiveRangeSet* ranges, Val v, + AllocIntervalVec* vec, u64* marks) { + for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE; + r = ranges->ranges[r].next) { + const OptLiveRange* lr = &ranges->ranges[r]; + u32 end = lr->end < a->point_count ? lr->end : a->point_count; + if (lr->start >= end) continue; + if (marks) ++*marks; + alloc_interval_insert(f, vec, lr->start, end); + } +} + static void opt_init_val_info_from_ranges(Func* f, const OptLiveRangeSet* ranges) { OptValInfo* old = f->val_info; @@ -770,79 +857,51 @@ static int spill_slot_compatible(Func* f, FrameSlot fs, Val v) { return 1; } -static void alloc_collect_occupied_hard(OptAllocator* a, - const OptLiveRangeSet* ranges, Val v, - u64* occupied) { - loc_bits_clear(occupied, a->hard_loc_words); - for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE; - r = ranges->ranges[r].next) { - const OptLiveRange* lr = &ranges->ranges[r]; - for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) { - ++a->hard_point_visits; - a->hard_word_ors += a->hard_loc_words; - loc_bits_or(occupied, - a->hard_used_locs + ((size_t)p * a->hard_loc_words), - a->hard_loc_words); - } - } +static int alloc_hard_conflicts(OptAllocator* a, + const OptLiveRangeSet* ranges, Val v, + u32 bit) { + if (bit >= a->hard_loc_bits) return 1; + return alloc_ranges_overlap_vec(a, ranges, v, &a->hard_used_locs[bit], + &a->hard_point_visits); } -static void alloc_collect_occupied_stack(OptAllocator* a, - const OptLiveRangeSet* ranges, Val v, - u64* occupied) { - loc_bits_clear(occupied, a->stack_loc_words); - if (!a->stack_loc_words) return; - for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE; - r = ranges->ranges[r].next) { - const OptLiveRange* lr = &ranges->ranges[r]; - for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) { - ++a->stack_point_visits; - a->stack_word_ors += a->stack_loc_words; - loc_bits_or(occupied, - a->stack_used_locs + ((size_t)p * a->stack_loc_words), - a->stack_loc_words); - } - } +static int alloc_stack_conflicts(OptAllocator* a, + const OptLiveRangeSet* ranges, Val v, + u32 stack_idx) { + if (stack_idx >= a->stack_slot_count) return 1; + return alloc_ranges_overlap_vec(a, ranges, v, &a->stack_used_locs[stack_idx], + &a->stack_point_visits); } static void alloc_mark_hard_loc(OptAllocator* a, const OptLiveRangeSet* ranges, - Val v, u32 bit) { - for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE; - r = ranges->ranges[r].next) { - const OptLiveRange* lr = &ranges->ranges[r]; - for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) { - ++a->hard_mark_points; - loc_bit_set(a->hard_used_locs + ((size_t)p * a->hard_loc_words), bit); - } - } -} - -static void alloc_grow_stack_words(Func* f, OptAllocator* a, u32 need_words) { - if (need_words <= a->stack_loc_words) return; - u64* next = - arena_zarray(f->arena, u64, (size_t)a->point_count * need_words); - for (u32 p = 0; p < a->point_count; ++p) { - u64* dst = next + ((size_t)p * need_words); - if (a->stack_loc_words) { - const u64* src = a->stack_used_locs + ((size_t)p * a->stack_loc_words); - for (u32 w = 0; w < a->stack_loc_words; ++w) dst[w] = src[w]; - } + Val v, u32 bit, Func* f) { + if (bit >= a->hard_loc_bits) return; + alloc_mark_vec(f, a, ranges, v, &a->hard_used_locs[bit], + &a->hard_mark_points); +} + +static void alloc_grow_stack_locs(Func* f, OptAllocator* a, u32 need_slots) { + if (need_slots <= a->stack_slot_cap) return; + u32 ncap = a->stack_slot_cap ? a->stack_slot_cap * 2u : 16u; + while (ncap < need_slots) ncap *= 2u; + FrameSlot* ns = arena_array(f->arena, FrameSlot, ncap); + AllocIntervalVec* ni = arena_zarray(f->arena, AllocIntervalVec, ncap); + if (a->stack_slots) { + memcpy(ns, a->stack_slots, sizeof(a->stack_slots[0]) * + a->stack_slot_count); + memcpy(ni, a->stack_used_locs, sizeof(a->stack_used_locs[0]) * + a->stack_slot_count); } - a->stack_used_locs = next; - a->stack_loc_words = need_words; + a->stack_slots = ns; + a->stack_used_locs = ni; + a->stack_slot_cap = ncap; } static void alloc_mark_stack_loc(OptAllocator* a, const OptLiveRangeSet* ranges, - Val v, u32 stack_idx) { - for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE; - r = ranges->ranges[r].next) { - const OptLiveRange* lr = &ranges->ranges[r]; - for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) { - ++a->stack_mark_points; - loc_bit_set(a->stack_used_locs + ((size_t)p * a->stack_loc_words), - stack_idx); - } - } + Val v, u32 stack_idx, Func* f) { + if (stack_idx >= a->stack_slot_count) return; + alloc_mark_vec(f, a, ranges, v, &a->stack_used_locs[stack_idx], + &a->stack_mark_points); } static void alloc_assign_hard(Func* f, OptAllocator* a, @@ -854,17 +913,16 @@ static void alloc_assign_hard(Func* f, OptAllocator* a, a->locs[v].cls = vi->cls; a->locs[v].hard_reg = r; a->locs[v].spill_slot = FRAME_SLOT_NONE; - alloc_mark_hard_loc(a, ranges, v, hard_loc_bit(vi->cls, r)); + alloc_mark_hard_loc(a, ranges, v, hard_loc_bit(vi->cls, r), f); } static void alloc_assign_stack(Func* f, OptAllocator* a, - const OptLiveRangeSet* ranges, Val v, - const u64* occupied) { + const OptLiveRangeSet* ranges, Val v) { OptValInfo* vi = &f->val_info[v]; u32 stack_idx = a->stack_slot_count; FrameSlot slot = FRAME_SLOT_NONE; for (u32 i = 0; i < a->stack_slot_count; ++i) { - if (loc_bit_has(occupied, i)) continue; + if (alloc_stack_conflicts(a, ranges, v, i)) continue; if (!spill_slot_compatible(f, a->stack_slots[i], v)) continue; stack_idx = i; slot = a->stack_slots[i]; @@ -872,18 +930,10 @@ static void alloc_assign_stack(Func* f, OptAllocator* a, } if (slot == FRAME_SLOT_NONE) { slot = spill_slot_for(f, v); - if (a->stack_slot_count == a->stack_slot_cap) { - u32 ncap = a->stack_slot_cap ? a->stack_slot_cap * 2u : 16u; - FrameSlot* ns = arena_array(f->arena, FrameSlot, ncap); - if (a->stack_slots) - memcpy(ns, a->stack_slots, sizeof(a->stack_slots[0]) * - a->stack_slot_count); - a->stack_slots = ns; - a->stack_slot_cap = ncap; - } + alloc_grow_stack_locs(f, a, a->stack_slot_count + 1u); a->stack_slots[a->stack_slot_count++] = slot; stack_idx = a->stack_slot_count - 1u; - alloc_grow_stack_words(f, a, loc_bit_words(a->stack_slot_count)); + a->stack_loc_words = loc_bit_words(a->stack_slot_count); } else { vi->spill_slot = slot; } @@ -892,7 +942,7 @@ static void alloc_assign_stack(Func* f, OptAllocator* a, a->locs[v].cls = vi->cls; a->locs[v].hard_reg = REG_NONE; a->locs[v].spill_slot = slot; - alloc_mark_stack_loc(a, ranges, v, stack_idx); + alloc_mark_stack_loc(a, ranges, v, stack_idx, f); } static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges, @@ -906,9 +956,11 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges, a->hard_loc_words = loc_bit_words(a->hard_loc_bits); a->locs = arena_zarray(f->arena, OptLoc, f->nvals ? f->nvals : 1u); a->hard_used_locs = - arena_zarray(f->arena, u64, (size_t)a->point_count * a->hard_loc_words); + arena_zarray(f->arena, AllocIntervalVec, + a->hard_loc_bits ? a->hard_loc_bits : 1u); a->stack_slots = NULL; a->stack_slot_cap = 0; + a->stack_used_locs = NULL; OptAllocCandidate* cands = arena_array(f->arena, OptAllocCandidate, ncands ? ncands : 1u); @@ -925,15 +977,10 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges, } alloc_sort_candidates(cands, n); - u64* hard_occupied = - arena_zarray(f->arena, u64, a->hard_loc_words ? a->hard_loc_words : 1u); - u64* stack_occupied = - arena_zarray(f->arena, u64, loc_bit_words(ncands ? ncands : 1u)); for (u32 i = 0; i < n; ++i) { Val v = cands[i].v; OptValInfo* vi = &f->val_info[v]; u8 cls = vi->cls; - alloc_collect_occupied_hard(a, ranges, v, hard_occupied); if (vi->tied_hard_reg >= 0) { Reg fixed = (Reg)vi->tied_hard_reg; if (!hard_available(f, cls, fixed)) { @@ -950,7 +997,7 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges, (unsigned)fixed); } if (fixed >= 32 || - loc_bit_has(hard_occupied, hard_loc_bit(cls, fixed))) { + alloc_hard_conflicts(a, ranges, v, hard_loc_bit(cls, fixed))) { SrcLoc loc = {0, 0, 0}; compiler_panic(f->c, loc, "opt regalloc: conflicting fixed hard reg %u", (unsigned)fixed); @@ -965,14 +1012,13 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges, if (hr >= 32) continue; if (vi->forbidden_hard_regs & (1u << hr)) continue; if (vi->live_across_call_freq && is_caller_saved(f, cls, hr)) continue; - if (loc_bit_has(hard_occupied, hard_loc_bit(cls, hr))) continue; + if (alloc_hard_conflicts(a, ranges, v, hard_loc_bit(cls, hr))) continue; alloc_assign_hard(f, a, ranges, v, hr); found = 1; break; } if (!found) { - alloc_collect_occupied_stack(a, ranges, v, stack_occupied); - alloc_assign_stack(f, a, ranges, v, stack_occupied); + alloc_assign_stack(f, a, ranges, v); } } f->opt_alloc_hard_loc_words = a->point_count * a->hard_loc_words;