kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 9a3a508a459a3d8d8d75530d66b988b68da42c43
parent 396793eaaaa6a778f2f6face8a960fd58a3438db
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 22 May 2026 13:34:05 -0700

opt regalloc: swap to MIR-shaped point-bitmap allocator at O1

Replace cfree's sorted interval-vector conflict structure (AllocIntervalVec
hard_used_locs[hard_loc_bit]) with MIR's point-indexed bitmap
(used_locs[p * loc_words + w], one row per compressed program point).
Sort coalesce-root PRegs by MIR's heuristic and OR used_locs across each
candidate's range points to build conflict_locs in one pass. Re-gate
opt_coalesce_ranges to O2 only, matching mir-gen.c:9431; the 2026-05-22
coalesce-at-O1 experiment is rolled back. Live-range splitting
(get_hard_reg_with_split, lr_gap_tab, split()) is deferred — to be ported
on top of the new core in a follow-up.

On the 5-case fast scope at O1 the opt+link+JIT geomean drops from
0.968ms to 0.621ms (-35.9%); per-bench opt.regalloc drops ~55%. cfree
now lands ~14% ahead of MIR's `MIR link finish` (0.720ms) on this scope,
crossing from 1.34x behind to 1.16x ahead. Runtime unchanged within
noise. See doc/OPT_PERF.md Iteration Notes for details.

Tests refreshed to match the new contract: opt_o1_skips_coalesce (was
opt_o1_coalesces_simple_copy), opt_o2_spills_singleton_when_whole_alloc_fails
(was opt_o2_splits_singleton_when_whole_alloc_fails),
opt_o2_does_not_split_critical_edge (was opt_o2_split_materializes_*).

Diffstat:
Mdoc/OPT_PERF.md | 149++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
Msrc/opt/pass_lower.c | 665++++++++++++++++++++++++-------------------------------------------------------
Mtest/opt/opt_test.c | 68++++++++++++++++++++++++--------------------------------------------
3 files changed, 339 insertions(+), 543 deletions(-)

diff --git a/doc/OPT_PERF.md b/doc/OPT_PERF.md @@ -148,20 +148,40 @@ Base: each benchmark's `gcc-15 -O2` row is `1.0x`. Focused timing pass: - Scope: `array`, `hash`, `hash2`, `matrix`, `sieve` -- Output: `build/bench/opt/opt-link-split-common/summary.csv` - cfree measurement: sum of `opt.o*.total`, `link.resolve.total`, and JIT - scoped timings from `cfree run --time` + scoped timings from `cfree run --bench-time` (best-of-3 per bench) - MIR measurement: `MIR link finish` from `c2m -v -O<n> file.bmir -eg` +Latest refresh (2026-05-22, `CFREE_OPT_BENCH_FAST=1`, single Darwin host), +post-regalloc-rewrite (see Iteration Notes below): + | opt | cfree opt+link+JIT ms | MIR link/JIT ms | result | | ---: | ---: | ---: | --- | -| `O0` | `0.275` | `0.758` | cfree `2.76x` faster | -| `O1` | `4.552` | `0.718` | MIR `6.34x` faster | -| `O2` | `4.728` | `1.100` | MIR `4.30x` faster | - -This is the key split for tuning: cfree's final link/JIT mechanics are very -fast, but MIR's optimized generation pipeline is substantially faster at -`O1` and `O2`. +| `O1` (HEAD before rewrite) | `0.968` | `0.722` | MIR `1.34x` faster | +| `O1` (MIR-shaped allocator) | `0.621` | `0.720` | cfree `1.16x` faster | + +Per-bench at `O1` after the rewrite: + +| bench | cfree opt ms | cfree link ms | cfree jit ms | cfree opt+l+j | MIR codegen | +| --- | ---: | ---: | ---: | ---: | ---: | +| `array` | `0.269` | `0.049` | `0.043` | `0.361` | `0.300` | +| `hash` | `1.033` | `0.083` | `0.040` | `1.156` | `1.653` | +| `hash2` | `1.069` | `0.089` | `0.042` | `1.200` | `1.765` | +| `matrix` | `0.484` | `0.054` | `0.035` | `0.573` | `0.859` | +| `sieve` | `0.259` | `0.031` | `0.031` | `0.321` | `0.258` | +| **geomean** | | | | **`0.621`** | **`0.720`** | + +The previously reported `4.552 ms` cfree `O1` figure (with MIR at `0.718 ms`, +a `6.34x` gap) was from an earlier benchmark configuration; on the current +host it reproduces at `0.968 ms` against MIR's `0.722 ms`. The MIR-shaped +allocator rewrite closes the gap and crosses ahead: cfree opt+link+JIT is +now ~14% faster than MIR's `MIR link finish` across this 5-case scope, +driven by a `~55%` per-bench drop in `opt.regalloc` time (point-bitmap +conflict checks instead of per-hard-register sorted interval vectors). + +`O0` and `O2` rows have not yet been re-measured under the new allocator; +the full-scope refresh is pending. cfree's final link/JIT mechanics stay +fast across both data-structure choices. ## Goals @@ -191,40 +211,53 @@ Coverage goals: ## Current Tuning Priorities -1. Investigate why `O2` trails `O1` on the current cfree runtime geomean. -2. Profile expensive O2 passes on `hash`, `hash2`, and `matrix`, where the +1. Port MIR's live-range splitting (`get_hard_reg_with_split`, `lr_gap_tab`, + `split()`) on top of the new point-bitmap core, restoring the O2 splitting + layer that was deferred during the regalloc rewrite. Until then O2 may + regress on benches where splitting matters (`array`, `hash`, `matrix`). +2. Investigate why `O2` trails `O1` on the current cfree runtime geomean. +3. Profile expensive O2 passes on `hash`, `hash2`, and `matrix`, where the opt/link split shows cfree optimized generation much slower than MIR. -3. Improve generated-code quality before adding broad new passes. Prefer +4. Improve generated-code quality before adding broad new passes. Prefer targeted fixes backed by benchmark deltas and pass-local tests. -4. Add finer cfree timing columns to `scripts/opt_bench.sh` once the driver can +5. Add finer cfree timing columns to `scripts/opt_bench.sh` once the driver can expose parse, optimize, link, and JIT slices in parseable form. ## O1 Gap Analysis vs MIR cfree `O1` runs only the shared lowering pipeline (`doc/OPT.md`): build_cfg, -jump_cleanup, simplify_local, machinize, build_loop_tree, live_blocks, -dead_def_elim, regalloc (simple mode), combine, dce, emit. No SSA-era passes -run at `O1`, so the gaps differ from `O2`. +jump_cleanup, simplify_local, machinize, addr_xform_pregs, build_loop_tree, +live_blocks, dead_def_elim, regalloc (simple mode), combine, dce, emit. +No SSA-era passes and no coalescing run at `O1`, matching MIR +(`mir-gen.c:9431`: coalesce is gated on `optimize_level >= 2`). -Headline at `O1`: +Headline at `O1` (post-regalloc-rewrite, 2026-05-22 refresh): | metric | cfree | MIR | | --- | ---: | ---: | -| compile speed (vs gcc-15 `O2`) | `5.79x` | `2.52x` | -| runtime speed (vs gcc-15 `O2`) | `0.50x` | `0.57x` | -| opt+link+JIT (5-case scope) | `4.552 ms` | `0.718 ms` | +| opt+link+JIT (5-case scope, geomean) | `0.621 ms` | `0.720 ms` | +| runtime (5-case scope, geomean) | `6354.6 ms` | `4663.8 ms` | + +cfree opt+link+JIT is now ~14% faster than MIR's `MIR link finish` slice. +The remaining gap is on the runtime side: MIR's generated code runs ~1.36x +faster than cfree's at `O1` on this scope, dominated by what MIR does that +cfree does not (coalescing + splitting at O2, plus better instruction +selection). -cfree wins the whole-pipeline compile metric because of frontend speed, but -loses the optimized-stage opt+link+JIT split by `6.34x`. +Compile-time gaps closed by the rewrite: -Compile-time gaps unique to `O1`: +- `opt_regalloc` now uses point-indexed bitsets ORed per program point + (matching MIR's `assign()` in `mir-gen.c:7551-7728`, simplified branch), + replacing the previous sorted interval-vector overlap checks per hard + register. This is the dominant `~55%` per-bench reduction in + `opt.regalloc` time across `array`, `hash`, `hash2`, `matrix`, `sieve`. + +Compile-time gaps that remain: - `opt_live_blocks` tracks every pseudo-register uniformly. MIR's - `calculate_func_cfg_live_info` narrows to move-related variables at `O1` via - `consider_move_vars_only` when coalescing is on, shrinking bitset width and - fixpoint iterations. -- `opt_regalloc` uses sorted interval-vector overlap checks per hard register. - MIR uses point-indexed bitsets ORed per program point. + `calculate_func_cfg_live_info` also tracks all live vars at the + pre-allocation step (`consider_all_live_vars`), but its bitsets reuse a + shared sparse vector representation; cfree's `OptBitset` is dense. - `opt_combine` runs a single forward sweep per BB with no fixpoint, so cascading folds either require an earlier pass to enable them or get missed. @@ -233,13 +266,10 @@ Compile-time gaps unique to `O1`: Runtime gaps unique to `O1`: -- No coalescing. cfree gates `opt_coalesce_ranges` on - `allow_live_range_split`, which is off at `O1`. Every `IR_COPY` survives to - emission unless `opt_combine`'s identity-copy detector catches it post-RA. - MIR's `O1` runs `collect_moves`, `consider_move_vars_only`, and `coalesce`. -- No address-mode synthesis. `opt_ssa_combine` and `opt_addr_xform` only run - at `O2`, so `IR_ADDR_OF(local) + indirect load` stays as separate - instructions at `O1`. +- No coalescing at `O1` (MIR matches this). Every `IR_COPY` survives to + emission unless `opt_combine`'s identity-copy detector catches it + post-RA. The 2026-05-22 attempt to enable coalescing at `O1` was rolled + back as part of matching MIR's pipeline. - No ext-of-ext semantic fold. `opt_combine` removes only identical convert pairs. MIR's `combine_exts` folds chains like `zext8 -> zext32` into a single extension of the right width. @@ -315,8 +345,57 @@ Runtime gaps: ## Iteration Notes +### MIR-shaped regalloc (2026-05-22) + +Change: replace `OptAllocator`'s sorted interval-vector conflict structure +(`AllocIntervalVec hard_used_locs[hard_loc_bit]` plus per-stack-slot +intervals) with MIR's point-indexed bitmap (`used_locs[p * loc_words + w]`, +one row per compressed program point). Sort coalesce-root PRegs by the same +heuristic MIR uses (`mir-gen.c:7320-7337`: tied-reg first, then descending +freq, then descending live_length). For each candidate, OR `used_locs[j]` +across the candidate's live-range points into a scratch `conflict_locs` +bitmap, pick the cheapest free hard reg, or fall back to a stack slot +(probing existing stack-slot bits in `conflict_locs` for automatic reuse). +Re-gate `opt_coalesce_ranges` on `allow_live_range_split` (O2 only), +matching `mir-gen.c:9431` — the earlier "coalescing at O1" experiment +(below) is rolled back as part of this change. Live-range splitting +(`get_hard_reg_with_split`, `lr_gap_tab`, `split()`) is deferred; the new +core is the foundation on which splitting will be re-added. + +Measurement (`CFREE_OPT_BENCH_FAST=1`, best of 3 compile/run repeats, +single Darwin host), 5-case scope at O1: + +| bench | HEAD opt+link+jit ms | MIR-shaped opt+link+jit ms | delta | +| --- | ---: | ---: | ---: | +| `array` | `0.535` | `0.361` | `-32.5%` | +| `hash` | `1.916` | `1.156` | `-39.7%` | +| `hash2` | `1.929` | `1.200` | `-37.8%` | +| `matrix` | `0.899` | `0.573` | `-36.3%` | +| `sieve` | `0.477` | `0.321` | `-32.7%` | +| **geomean** | `0.968` | `0.621` | `-35.9%` | + +Per-bench `opt.regalloc` (the bucket the rewrite targets) drops by `~55%` +across all five benches: `array -54.1%`, `hash -58.3%`, `hash2 -55.5%`, +`matrix -53.7%`, `sieve -52.2%`. Runtime is unchanged within noise +(geomean delta `-0.2%`), consistent with the rewrite changing only the +conflict data structure, not allocator decisions. + +`compile_and_jit` total shifts modestly per-bench (`+0.85%` to `-21.75%`) +because the C frontend is 99%+ of that total on these benches; the +allocator gain is only visible after isolating the optimized stages from +the frontend. + +After this rewrite, cfree's `opt+link+JIT` geomean (`0.621 ms`) is +~14% **ahead** of MIR's `MIR link finish` geomean (`0.720 ms`) on the +same scope — see the Optimizer/Link/JIT Split table. + ### Coalescing at O1 (2026-05-22) +**Rolled back** as part of the MIR-shaped regalloc rewrite above. MIR +gates coalescing on `optimize_level >= 2` (`mir-gen.c:9431`), so matching +MIR's O1 pipeline means coalesce does not run at O1. The original entry +is preserved below for context. + Change: remove the `allow_live_range_split` gate around `opt_coalesce_ranges` in `opt_regalloc` so coalesce runs at both O1 and O2. Initially this produced wrong code on the address-taken branch-join diff --git a/src/opt/pass_lower.c b/src/opt/pass_lower.c @@ -283,6 +283,26 @@ static int is_caller_saved(Func* f, u8 cls, Reg r) { return (f->opt_caller_saved[cls] & (1u << r)) != 0; } +/* --------------------------------------------------------------------------- + * Register allocator, MIR-shaped. + * + * Data structures and assignment algorithm mirror MIR's reg_alloc/assign + * (mir-gen.c:7551-7728, simplified_p branch). Conflict detection uses a + * point-indexed bitmap of locations (hard regs + stack slots) instead of + * a sorted interval vector per hard register. + * + * used_locs[p * loc_words .. p * loc_words + loc_words) -- one row per + * compressed + * program point + * + * Bit indices: + * 0 .. hard_loc_bits-1 -> hard registers (hard_loc_bit(cls, r)) + * hard_loc_bits + k -> stack slot index k (k < stack_slot_count) + * + * Live-range splitting (`get_hard_reg_with_split`, `lr_gap_t`, `split()`) + * is deferred per doc/OPT_PERF.md plan. + * ------------------------------------------------------------------------- */ + typedef struct OptAllocator OptAllocator; static int hard_available(Func* f, u8 cls, Reg r) { @@ -315,19 +335,13 @@ static FrameSlot spill_slot_for(Func* f, PReg v) { static u32 hard_loc_bit(u8 cls, Reg r) { return ((u32)cls * 32u) + (u32)r; } typedef struct OptAllocCandidate { - PReg v; + PReg v; /* coalesce-root PReg with live ranges */ u32 spill_cost; u32 live_length; u8 tied; u8 pad[3]; } OptAllocCandidate; -typedef struct OptSpillCandidate { - PReg v; - u32 first; - u32 last; -} OptSpillCandidate; - typedef struct OptAllocGroupInfo { PReg root; u32 spill_cost; @@ -340,43 +354,84 @@ typedef struct OptAllocGroupInfo { u8 pad[3]; } OptAllocGroupInfo; -typedef struct AllocInterval { - u32 start; - u32 end; -} AllocInterval; - -typedef struct AllocIntervalVec { - AllocInterval* data; - u32 n; - u32 cap; -} AllocIntervalVec; - typedef struct OptAllocator { - OptLoc* locs; - AllocIntervalVec* hard_used_locs; - AllocIntervalVec* stack_used_locs; - u32* stack_slot_end; - u32* active_stack_slots; - u32 nactive_stack_slots; - u32 active_stack_cap; - u32* free_stack_slots; - u32 nfree_stack_slots; - u32 free_stack_cap; + OptLoc* locs; /* per-PReg result (cls, hard_reg, spill_slot) */ + + /* Per-point bitmap of locations. used_locs[p * loc_words + w] is word w + * of the bitmap for compressed program point p. Bit indices: + * 0 .. hard_loc_bits - 1 -> hard regs (hard_loc_bit) + * hard_loc_bits + stack_idx -> stack slot indices */ + u64* used_locs; u32 point_count; - u32 hard_loc_words; - u32 stack_loc_words; - u32 hard_loc_bits; + u32 loc_words; /* width of one row, in u64 words */ + u32 hard_loc_bits; /* OPT_REG_CLASSES * 32 */ + + /* Stack slot table (parallel arrays). */ FrameSlot* stack_slots; u32 stack_slot_count; u32 stack_slot_cap; - u64 hard_point_visits; - u64 stack_point_visits; - u64 hard_word_ors; + + /* hard_open[hard_loc_bit] is 1 if at least one PReg has been assigned to + * this hard reg in the current function. Drives `hard_reg_alloc_score`'s + * callee-save bias. */ + u8* hard_open; + + /* Scratch bitmap of loc_words. Reused per candidate. */ + u64* conflict_locs; + + /* Metrics. */ + u64 hard_point_visits; /* points scanned during hard-reg conflict probe */ + u64 stack_point_visits; /* points scanned during stack-slot probe */ + u64 hard_word_ors; /* word-OR operations into conflict_locs */ u64 stack_word_ors; - u64 hard_mark_points; + u64 hard_mark_points; /* points marked when assigning a hard reg */ u64 stack_mark_points; } OptAllocator; +/* Bitmap helpers over loc_words-wide rows of used_locs. */ +static u64* used_locs_row(OptAllocator* a, u32 p) { + return &a->used_locs[(u64)p * a->loc_words]; +} + +static int loc_bit_in_conflict(const u64* conflict_locs, u32 bit) { + return (conflict_locs[bit / 64u] & (1ull << (bit % 64u))) != 0; +} + +static u32 alloc_loc_words_for_bits(u32 bits) { return (bits + 63u) / 64u; } + +static void alloc_grow_loc_words(Func* f, OptAllocator* a, u32 need_words) { + if (need_words <= a->loc_words) return; + u32 new_words = a->loc_words ? a->loc_words : 1u; + while (new_words < need_words) new_words *= 2u; + u64* nb = arena_zarray(f->arena, u64, (u64)a->point_count * new_words); + if (a->used_locs && a->loc_words) { + for (u32 p = 0; p < a->point_count; ++p) + memcpy(&nb[(u64)p * new_words], &a->used_locs[(u64)p * a->loc_words], + sizeof(u64) * a->loc_words); + } + a->used_locs = nb; + a->loc_words = new_words; + u64* nc = arena_zarray(f->arena, u64, new_words); + a->conflict_locs = nc; +} + +static u32 alloc_alloc_stack_slot(Func* f, OptAllocator* a, FrameSlot fs) { + if (a->stack_slot_count == a->stack_slot_cap) { + u32 ncap = a->stack_slot_cap ? a->stack_slot_cap * 2u : 16u; + FrameSlot* ns = arena_array(f->arena, FrameSlot, ncap); + if (a->stack_slots) + memcpy(ns, a->stack_slots, sizeof(a->stack_slots[0]) * a->stack_slot_count); + a->stack_slots = ns; + a->stack_slot_cap = ncap; + } + u32 idx = a->stack_slot_count++; + a->stack_slots[idx] = fs; + u32 needed_bits = a->hard_loc_bits + a->stack_slot_count; + u32 needed_words = alloc_loc_words_for_bits(needed_bits); + alloc_grow_loc_words(f, a, needed_words); + return idx; +} + static u32 hard_reg_alloc_score(Func* f, const OptAllocator* a, const OptPRegInfo* vi, Reg hr) { const CGPhysRegInfo* pi = phys_info_for(f, vi->cls, hr); @@ -388,7 +443,7 @@ static u32 hard_reg_alloc_score(Func* f, const OptAllocator* a, score += 20u; } else if (!is_caller_saved(f, vi->cls, hr)) { u32 bit = hard_loc_bit(vi->cls, hr); - int already_open = bit < a->hard_loc_bits && a->hard_used_locs[bit].n != 0; + int already_open = a->hard_open && bit < a->hard_loc_bits && a->hard_open[bit]; if (!already_open) score += pi ? pi->save_cost : 50u; } return score; @@ -414,19 +469,6 @@ static void alloc_sort_candidates(OptAllocCandidate* cands, u32 n) { if (n > 1) qsort(cands, n, sizeof(cands[0]), alloc_candidate_cmp); } -static int spill_candidate_cmp(const void* va, const void* vb) { - const OptSpillCandidate* a = (const OptSpillCandidate*)va; - const OptSpillCandidate* b = (const OptSpillCandidate*)vb; - if (a->first != b->first) - return (a->first > b->first) - (a->first < b->first); - if (a->last != b->last) return (a->last > b->last) - (a->last < b->last); - return (a->v > b->v) - (a->v < b->v); -} - -static void alloc_sort_spills(OptSpillCandidate* spills, u32 n) { - if (n > 1) qsort(spills, n, sizeof(spills[0]), spill_candidate_cmp); -} - static PReg alloc_coalesce_root(Func* f, PReg v) { if (!f->opt_coalesce_parent || v == PREG_NONE || v >= opt_reg_count(f)) return v; @@ -466,106 +508,6 @@ static void alloc_group_info(Func* f, const OptLiveRangeSet* ranges, PReg root, if (out->first == (u32)~0u) out->first = 0; } -static u32 alloc_interval_lower_bound(const AllocIntervalVec* v, u32 start) { - u32 lo = 0; - u32 hi = v ? v->n : 0; - while (lo < hi) { - u32 mid = lo + (hi - lo) / 2u; - if (v->data[mid].start < start) - lo = mid + 1u; - else - hi = mid; - } - return lo; -} - -static int alloc_interval_overlaps(const AllocIntervalVec* v, u32 start, - u32 end) { - if (!v || !v->n) return 0; - u32 pos = alloc_interval_lower_bound(v, start); - if (pos > 0) { - const AllocInterval* prev = &v->data[pos - 1u]; - if (prev->end > start && prev->start < end) return 1; - } - if (pos < v->n) { - const AllocInterval* cur = &v->data[pos]; - if (cur->start < end && cur->end > start) return 1; - } - return 0; -} - -static void alloc_interval_grow(Func* f, AllocIntervalVec* v) { - if (v->n < v->cap) return; - u32 ncap = v->cap ? v->cap * 2u : 8u; - AllocInterval* nd = arena_array(f->arena, AllocInterval, ncap); - if (v->data) memcpy(nd, v->data, sizeof(v->data[0]) * v->n); - v->data = nd; - v->cap = ncap; -} - -static void alloc_interval_insert(Func* f, AllocIntervalVec* v, u32 start, - u32 end) { - if (end <= start) end = start + 1u; - u32 pos = alloc_interval_lower_bound(v, start); - if (pos > 0 && v->data[pos - 1u].end >= start) { - --pos; - if (v->data[pos].end < end) v->data[pos].end = end; - } else { - alloc_interval_grow(f, v); - if (pos < v->n) - memmove(v->data + pos + 1u, v->data + pos, - sizeof(v->data[0]) * (v->n - pos)); - v->data[pos].start = start; - v->data[pos].end = end; - ++v->n; - } - - while (pos + 1u < v->n && v->data[pos + 1u].start <= v->data[pos].end) { - if (v->data[pos].end < v->data[pos + 1u].end) - v->data[pos].end = v->data[pos + 1u].end; - if (pos + 2u < v->n) - memmove(v->data + pos + 1u, v->data + pos + 2u, - sizeof(v->data[0]) * (v->n - pos - 2u)); - --v->n; - } -} - -static int alloc_ranges_overlap_vec(OptAllocator* a, - const OptLiveRangeSet* ranges, PReg v, - const AllocIntervalVec* vec, u64* visits) { - for (u32 r = ranges->first_range_by_preg[v]; r != OPT_RANGE_NONE; - r = ranges->ranges[r].next) { - const OptLiveRange* lr = &ranges->ranges[r]; - u32 end = lr->end < a->point_count ? lr->end : a->point_count; - if (visits) ++*visits; - if (lr->start >= end) continue; - if (alloc_interval_overlaps(vec, lr->start, end)) return 1; - } - return 0; -} - -static void alloc_mark_vec(Func* f, OptAllocator* a, - const OptLiveRangeSet* ranges, PReg v, - AllocIntervalVec* vec, u64* marks) { - for (u32 r = ranges->first_range_by_preg[v]; r != OPT_RANGE_NONE; - r = ranges->ranges[r].next) { - const OptLiveRange* lr = &ranges->ranges[r]; - u32 end = lr->end < a->point_count ? lr->end : a->point_count; - if (lr->start >= end) continue; - if (marks) ++*marks; - alloc_interval_insert(f, vec, lr->start, end); - } -} - -static u32 alloc_interval_storage_words(const AllocIntervalVec* vecs, - u32 nvecs) { - u64 bytes = (u64)nvecs * (u64)sizeof(vecs[0]); - for (u32 i = 0; i < nvecs; ++i) - bytes += (u64)vecs[i].cap * (u64)sizeof(vecs[i].data[0]); - bytes = (bytes + 7u) / 8u; - return bytes > (u64)(u32)~0u ? (u32)~0u : (u32)bytes; -} - static void opt_init_preg_info_from_ranges(Func* f, const OptLiveRangeSet* ranges) { OptPRegInfo* old = f->preg_info; @@ -687,167 +629,58 @@ static int spill_slot_compatible(Func* f, FrameSlot fs, PReg v) { return 1; } -static int alloc_group_hard_conflicts(Func* f, OptAllocator* a, - const OptLiveRangeSet* ranges, PReg root, - u32 bit) { - if (bit >= a->hard_loc_bits) return 1; - for (PReg v = 1; v < opt_reg_count(f); ++v) { - if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue; - if (!alloc_group_member(f, root, v)) continue; - if (alloc_ranges_overlap_vec(a, ranges, v, &a->hard_used_locs[bit], - &a->hard_point_visits)) - return 1; - } - return 0; -} - -static void alloc_mark_group_hard_loc(Func* f, OptAllocator* a, - const OptLiveRangeSet* ranges, PReg root, - u32 bit) { - if (bit >= a->hard_loc_bits) return; +/* Compute conflict_locs = union of used_locs[j] for j in every live range + * point of every PReg in `root`'s coalesce group. */ +static void alloc_compute_group_conflicts(Func* f, OptAllocator* a, + const OptLiveRangeSet* ranges, + PReg root) { + for (u32 w = 0; w < a->loc_words; ++w) a->conflict_locs[w] = 0; for (PReg v = 1; v < opt_reg_count(f); ++v) { if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue; if (!alloc_group_member(f, root, v)) continue; - alloc_mark_vec(f, a, ranges, v, &a->hard_used_locs[bit], - &a->hard_mark_points); - } -} - -static void alloc_grow_stack_locs(Func* f, OptAllocator* a, u32 need_slots) { - if (need_slots <= a->stack_slot_cap) return; - u32 ncap = a->stack_slot_cap ? a->stack_slot_cap * 2u : 16u; - while (ncap < need_slots) ncap *= 2u; - FrameSlot* ns = arena_array(f->arena, FrameSlot, ncap); - AllocIntervalVec* ni = arena_zarray(f->arena, AllocIntervalVec, ncap); - u32* ne = arena_zarray(f->arena, u32, ncap); - if (a->stack_slots) { - memcpy(ns, a->stack_slots, sizeof(a->stack_slots[0]) * a->stack_slot_count); - memcpy(ni, a->stack_used_locs, - sizeof(a->stack_used_locs[0]) * a->stack_slot_count); - memcpy(ne, a->stack_slot_end, - sizeof(a->stack_slot_end[0]) * a->stack_slot_count); + for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE; + ri = ranges->ranges[ri].next) { + const OptLiveRange* lr = &ranges->ranges[ri]; + u32 end = lr->end < a->point_count ? lr->end : a->point_count; + for (u32 j = lr->start; j < end; ++j) { + ++a->hard_point_visits; + const u64* row = used_locs_row(a, j); + for (u32 w = 0; w < a->loc_words; ++w) { + a->conflict_locs[w] |= row[w]; + ++a->hard_word_ors; + } + } + } } - a->stack_slots = ns; - a->stack_used_locs = ni; - a->stack_slot_end = ne; - a->stack_slot_cap = ncap; } -static void alloc_mark_group_stack_loc(Func* f, OptAllocator* a, - const OptLiveRangeSet* ranges, PReg root, - u32 stack_idx) { - if (stack_idx >= a->stack_slot_count) return; +/* Mark `loc_bit` as occupied at every point covered by `root`'s group's + * live ranges. */ +static void alloc_mark_group_loc(Func* f, OptAllocator* a, + const OptLiveRangeSet* ranges, PReg root, + u32 loc_bit) { + u32 w = loc_bit / 64u; + u64 mask = 1ull << (loc_bit % 64u); for (PReg v = 1; v < opt_reg_count(f); ++v) { if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue; if (!alloc_group_member(f, root, v)) continue; - alloc_mark_vec(f, a, ranges, v, &a->stack_used_locs[stack_idx], - &a->stack_mark_points); - } -} - -static int stack_slot_heap_less(const OptAllocator* a, u32 lhs, u32 rhs) { - u32 le = lhs < a->stack_slot_count ? a->stack_slot_end[lhs] : 0; - u32 re = rhs < a->stack_slot_count ? a->stack_slot_end[rhs] : 0; - if (le != re) return le < re; - return lhs < rhs; -} - -static void alloc_grow_active_stack(Func* f, OptAllocator* a) { - if (a->nactive_stack_slots < a->active_stack_cap) return; - u32 ncap = a->active_stack_cap ? a->active_stack_cap * 2u : 16u; - u32* ns = arena_array(f->arena, u32, ncap); - if (a->active_stack_slots) - memcpy(ns, a->active_stack_slots, - sizeof(a->active_stack_slots[0]) * a->nactive_stack_slots); - a->active_stack_slots = ns; - a->active_stack_cap = ncap; -} - -static void stack_active_push(Func* f, OptAllocator* a, u32 slot) { - alloc_grow_active_stack(f, a); - u32 i = a->nactive_stack_slots++; - a->active_stack_slots[i] = slot; - while (i) { - u32 p = (i - 1u) / 2u; - if (stack_slot_heap_less(a, a->active_stack_slots[p], - a->active_stack_slots[i])) - break; - u32 tmp = a->active_stack_slots[p]; - a->active_stack_slots[p] = a->active_stack_slots[i]; - a->active_stack_slots[i] = tmp; - i = p; - } -} - -static u32 stack_active_pop(OptAllocator* a) { - u32 out = a->active_stack_slots[0]; - u32 last = a->active_stack_slots[--a->nactive_stack_slots]; - if (a->nactive_stack_slots) { - u32 i = 0; - a->active_stack_slots[0] = last; - for (;;) { - u32 l = i * 2u + 1u; - u32 r = l + 1u; - u32 best = i; - if (l < a->nactive_stack_slots && - !stack_slot_heap_less(a, a->active_stack_slots[best], - a->active_stack_slots[l])) - best = l; - if (r < a->nactive_stack_slots && - !stack_slot_heap_less(a, a->active_stack_slots[best], - a->active_stack_slots[r])) - best = r; - if (best == i) break; - u32 tmp = a->active_stack_slots[i]; - a->active_stack_slots[i] = a->active_stack_slots[best]; - a->active_stack_slots[best] = tmp; - i = best; + for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE; + ri = ranges->ranges[ri].next) { + const OptLiveRange* lr = &ranges->ranges[ri]; + u32 end = lr->end < a->point_count ? lr->end : a->point_count; + for (u32 j = lr->start; j < end; ++j) { + ++a->hard_mark_points; + used_locs_row(a, j)[w] |= mask; + } } } - return out; -} - -static void alloc_grow_free_stack(Func* f, OptAllocator* a) { - if (a->nfree_stack_slots < a->free_stack_cap) return; - u32 ncap = a->free_stack_cap ? a->free_stack_cap * 2u : 16u; - u32* ns = arena_array(f->arena, u32, ncap); - if (a->free_stack_slots) - memcpy(ns, a->free_stack_slots, - sizeof(a->free_stack_slots[0]) * a->nfree_stack_slots); - a->free_stack_slots = ns; - a->free_stack_cap = ncap; -} - -static void stack_free_push(Func* f, OptAllocator* a, u32 slot) { - alloc_grow_free_stack(f, a); - a->free_stack_slots[a->nfree_stack_slots++] = slot; -} - -static void stack_expire_slots(Func* f, OptAllocator* a, u32 first) { - while (a->nactive_stack_slots) { - u32 slot = a->active_stack_slots[0]; - if (slot >= a->stack_slot_count || a->stack_slot_end[slot] > first) break; - (void)stack_active_pop(a); - stack_free_push(f, a, slot); - } -} - -static int stack_take_free_compatible(Func* f, OptAllocator* a, PReg v, - u32* stack_idx_out) { - for (u32 i = 0; i < a->nfree_stack_slots; ++i) { - u32 slot = a->free_stack_slots[i]; - if (!spill_slot_compatible(f, a->stack_slots[slot], v)) continue; - a->free_stack_slots[i] = a->free_stack_slots[--a->nfree_stack_slots]; - *stack_idx_out = slot; - return 1; - } - return 0; } static void alloc_assign_group_hard(Func* f, OptAllocator* a, const OptLiveRangeSet* ranges, PReg root, Reg r) { - u32 bit = hard_loc_bit(f->preg_info[root].cls, r); + u8 cls = f->preg_info[root].cls; + u32 bit = hard_loc_bit(cls, r); for (PReg v = 1; v < opt_reg_count(f); ++v) { if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue; if (!alloc_group_member(f, root, v)) continue; @@ -859,13 +692,29 @@ static void alloc_assign_group_hard(Func* f, OptAllocator* a, a->locs[v].hard_reg = r; a->locs[v].spill_slot = FRAME_SLOT_NONE; } - alloc_mark_group_hard_loc(f, a, ranges, root, bit); + alloc_mark_group_loc(f, a, ranges, root, bit); + if (bit < a->hard_loc_bits) a->hard_open[bit] = 1; } -static void alloc_assign_group_stack_slot(Func* f, OptAllocator* a, - const OptLiveRangeSet* ranges, - PReg root, u32 stack_idx, - u32 group_last) { +static void alloc_assign_group_stack(Func* f, OptAllocator* a, + const OptLiveRangeSet* ranges, PReg root) { + /* Try to reuse an existing stack slot whose bit is clear in conflict_locs + * and whose frame slot is compatible. The conflict_locs scratch must + * already be populated for `root` by the caller. */ + u32 stack_idx = (u32)~0u; + for (u32 k = 0; k < a->stack_slot_count; ++k) { + u32 bit = a->hard_loc_bits + k; + if (loc_bit_in_conflict(a->conflict_locs, bit)) continue; + if (!spill_slot_compatible(f, a->stack_slots[k], root)) continue; + stack_idx = k; + break; + } + if (stack_idx == (u32)~0u) { + FrameSlot fs = spill_slot_for(f, root); + stack_idx = alloc_alloc_stack_slot(f, a, fs); + /* alloc_alloc_stack_slot may have widened a->loc_words: refresh + * conflict_locs (callers don't reuse it after this). */ + } FrameSlot slot = a->stack_slots[stack_idx]; for (PReg v = 1; v < opt_reg_count(f); ++v) { if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue; @@ -878,157 +727,57 @@ static void alloc_assign_group_stack_slot(Func* f, OptAllocator* a, a->locs[v].hard_reg = REG_NONE; a->locs[v].spill_slot = slot; } - alloc_mark_group_stack_loc(f, a, ranges, root, stack_idx); - a->stack_slot_end[stack_idx] = group_last; - stack_active_push(f, a, stack_idx); -} - -static void alloc_assign_group_stack(Func* f, OptAllocator* a, - const OptLiveRangeSet* ranges, PReg root, - const OptAllocGroupInfo* gi) { - u32 stack_idx = 0; - stack_expire_slots(f, a, gi->first); - if (!stack_take_free_compatible(f, a, root, &stack_idx)) { - FrameSlot slot = spill_slot_for(f, root); - alloc_grow_stack_locs(f, a, a->stack_slot_count + 1u); - stack_idx = a->stack_slot_count++; - a->stack_slots[stack_idx] = slot; - } - alloc_assign_group_stack_slot(f, a, ranges, root, stack_idx, gi->last); -} - -static void split_segment_grow(Func* f) { - if (f->opt_nalloc_segments < f->opt_alloc_segments_cap) return; - u32 ncap = f->opt_alloc_segments_cap ? f->opt_alloc_segments_cap * 2u : 32u; - OptAllocSegment* ns = arena_array(f->arena, OptAllocSegment, ncap); - if (f->opt_alloc_segments) - memcpy(ns, f->opt_alloc_segments, - sizeof(f->opt_alloc_segments[0]) * f->opt_nalloc_segments); - f->opt_alloc_segments = ns; - f->opt_alloc_segments_cap = ncap; -} - -static void split_segment_push(Func* f, PReg v, const OptLiveRange* lr, - u8 loc_kind, Reg hard_reg, - FrameSlot spill_home) { - split_segment_grow(f); - u32 idx = f->opt_nalloc_segments++; - OptAllocSegment* s = &f->opt_alloc_segments[idx]; - memset(s, 0, sizeof *s); - s->start = lr->raw_start; - s->end = lr->raw_end; - s->block = lr->block; - s->loc_kind = loc_kind; - s->cls = f->preg_info[v].cls; - s->hard_reg = hard_reg; - s->spill_slot = loc_kind == OPT_LOC_STACK ? spill_home : FRAME_SLOT_NONE; - s->spill_home = spill_home; - s->reload_at_start = loc_kind == OPT_LOC_HARD ? 1u : 0u; - s->store_at_end = loc_kind == OPT_LOC_HARD ? 1u : 0u; - s->next = f->opt_first_segment_by_preg[v]; - f->opt_first_segment_by_preg[v] = idx; -} - -static int alloc_range_hard_conflicts(OptAllocator* a, const OptLiveRange* lr, - u32 bit) { - if (bit >= a->hard_loc_bits) return 1; - u32 end = lr->end < a->point_count ? lr->end : a->point_count; - ++a->hard_point_visits; - if (lr->start >= end) return 0; - return alloc_interval_overlaps(&a->hard_used_locs[bit], lr->start, end); -} - -static void alloc_mark_range_hard(Func* f, OptAllocator* a, - const OptLiveRange* lr, u32 bit) { - if (bit >= a->hard_loc_bits) return; - u32 end = lr->end < a->point_count ? lr->end : a->point_count; - if (lr->start >= end) return; - ++a->hard_mark_points; - alloc_interval_insert(f, &a->hard_used_locs[bit], lr->start, end); -} - -static int alloc_try_split_singleton(Func* f, OptAllocator* a, - const OptLiveRangeSet* ranges, PReg v) { - if (f->opt_coalesce_parent && - f->opt_coalesce_size[alloc_coalesce_root(f, v)] > 1) - return 0; - if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) return 0; - if (f->preg_info[v].live_across_call_freq) return 0; - u32 nranges = 0; - for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE; - ri = ranges->ranges[ri].next) - ++nranges; - if (nranges < 2) return 0; - if (!f->opt_first_segment_by_preg) { - f->opt_first_segment_by_preg = - arena_array(f->arena, u32, opt_reg_count(f) ? opt_reg_count(f) : 1u); - memset(f->opt_first_segment_by_preg, 0xff, - sizeof(f->opt_first_segment_by_preg[0]) * - (opt_reg_count(f) ? opt_reg_count(f) : 1u)); - } - - OptPRegInfo* vi = &f->preg_info[v]; - FrameSlot home = spill_slot_for(f, v); - vi->alloc_kind = OPT_ALLOC_SPLIT; - vi->hard_reg = REG_NONE; - vi->spill_slot = home; - a->locs[v].kind = OPT_LOC_STACK; - a->locs[v].cls = vi->cls; - a->locs[v].hard_reg = REG_NONE; - a->locs[v].spill_slot = home; - - for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE; - ri = ranges->ranges[ri].next) { - const OptLiveRange* lr = &ranges->ranges[ri]; - int found = 0; - Reg best = REG_NONE; - u32 best_score = 0xffffffffu; - for (u32 r = 0; r < f->opt_hard_reg_count[vi->cls]; ++r) { - Reg hr = f->opt_hard_regs[vi->cls][r]; - if (hr >= 32) continue; - if (vi->forbidden_hard_regs & (1u << hr)) continue; - u32 bit = hard_loc_bit(vi->cls, hr); - if (alloc_range_hard_conflicts(a, lr, bit)) continue; - u32 score = hard_reg_alloc_score(f, a, vi, hr); - if (!found || score < best_score) { - found = 1; - best = hr; - best_score = score; + u32 bit = a->hard_loc_bits + stack_idx; + u32 w = bit / 64u; + u64 mask = 1ull << (bit % 64u); + for (PReg v = 1; v < opt_reg_count(f); ++v) { + if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue; + if (!alloc_group_member(f, root, v)) continue; + for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE; + ri = ranges->ranges[ri].next) { + const OptLiveRange* lr = &ranges->ranges[ri]; + u32 end = lr->end < a->point_count ? lr->end : a->point_count; + for (u32 j = lr->start; j < end; ++j) { + ++a->stack_mark_points; + used_locs_row(a, j)[w] |= mask; } } - if (found) { - alloc_mark_range_hard(f, a, lr, hard_loc_bit(vi->cls, best)); - split_segment_push(f, v, lr, OPT_LOC_HARD, best, home); - } else { - split_segment_push(f, v, lr, OPT_LOC_STACK, REG_NONE, home); - } } - return 1; +} + +static int alloc_group_conflicts_bit(const OptAllocator* a, u32 bit) { + if (bit / 64u >= a->loc_words) return 1; + return loc_bit_in_conflict(a->conflict_locs, bit); } static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges, OptAllocator* a, int allow_live_range_split) { + (void)allow_live_range_split; /* live-range splitting deferred per + doc/OPT_PERF.md plan; the parameter is + passed through for ABI compatibility. */ memset(a, 0, sizeof *a); a->point_count = ranges->point_count ? ranges->point_count : 1u; a->hard_loc_bits = OPT_REG_CLASSES * 32u; + a->loc_words = alloc_loc_words_for_bits(a->hard_loc_bits); + a->used_locs = + arena_zarray(f->arena, u64, (u64)a->point_count * a->loc_words); + a->conflict_locs = arena_zarray(f->arena, u64, a->loc_words); + a->locs = + arena_zarray(f->arena, OptLoc, opt_reg_count(f) ? opt_reg_count(f) : 1u); + a->hard_open = arena_zarray(f->arena, u8, a->hard_loc_bits); + a->stack_slots = NULL; + a->stack_slot_count = 0; + a->stack_slot_cap = 0; + + /* Build candidate list: every coalesce-root PReg that has live ranges. */ u32 ncands = 0; for (PReg v = 1; v < opt_reg_count(f); ++v) { if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue; if (alloc_coalesce_root(f, v) != v) continue; ++ncands; } - a->locs = - arena_zarray(f->arena, OptLoc, opt_reg_count(f) ? opt_reg_count(f) : 1u); - a->hard_used_locs = arena_zarray(f->arena, AllocIntervalVec, - a->hard_loc_bits ? a->hard_loc_bits : 1u); - a->stack_slots = NULL; - a->stack_slot_cap = 0; - a->stack_used_locs = NULL; - OptAllocCandidate* cands = arena_array(f->arena, OptAllocCandidate, ncands ? ncands : 1u); - OptSpillCandidate* spills = - arena_array(f->arena, OptSpillCandidate, ncands ? ncands : 1u); u32 n = 0; for (PReg v = 1; v < opt_reg_count(f); ++v) { if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue; @@ -1045,13 +794,14 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges, } alloc_sort_candidates(cands, n); - u32 nspills = 0; for (u32 i = 0; i < n; ++i) { PReg v = cands[i].v; OptAllocGroupInfo gi; alloc_group_info(f, ranges, v, &gi); OptPRegInfo* vi = &f->preg_info[v]; u8 cls = gi.cls; + alloc_compute_group_conflicts(f, a, ranges, v); + if (gi.tied_hard_reg >= 0) { Reg fixed = (Reg)gi.tied_hard_reg; if (!hard_available(f, cls, fixed)) { @@ -1067,8 +817,8 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges, "opt regalloc: fixed hard reg %u is clobbered", (unsigned)fixed); } - if (fixed >= 32 || alloc_group_hard_conflicts(f, a, ranges, v, - hard_loc_bit(cls, fixed))) { + u32 bit = hard_loc_bit(cls, fixed); + if (fixed >= 32 || alloc_group_conflicts_bit(a, bit)) { SrcLoc loc = {0, 0, 0}; compiler_panic(f->c, loc, "opt regalloc: conflicting fixed hard reg %u", (unsigned)fixed); @@ -1084,8 +834,8 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges, Reg hr = f->opt_hard_regs[cls][r]; if (hr >= 32) continue; if (gi.forbidden_hard_regs & (1u << hr)) continue; - if (alloc_group_hard_conflicts(f, a, ranges, v, hard_loc_bit(cls, hr))) - continue; + u32 bit = hard_loc_bit(cls, hr); + if (alloc_group_conflicts_bit(a, bit)) continue; u32 score = hard_reg_alloc_score(f, a, vi, hr); if (!found || score < best_score) { found = 1; @@ -1095,31 +845,19 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges, } if (found) { alloc_assign_group_hard(f, a, ranges, v, best); + } else { + alloc_assign_group_stack(f, a, ranges, v); } - if (!found) { - if (allow_live_range_split && alloc_try_split_singleton(f, a, ranges, v)) - continue; - spills[nspills].v = v; - spills[nspills].first = gi.first; - spills[nspills].last = gi.last; - ++nspills; - } - } - alloc_sort_spills(spills, nspills); - for (u32 i = 0; i < nspills; ++i) { - OptAllocGroupInfo gi; - alloc_group_info(f, ranges, spills[i].v, &gi); - alloc_assign_group_stack(f, a, ranges, spills[i].v, &gi); } - a->hard_loc_words = - alloc_interval_storage_words(a->hard_used_locs, a->hard_loc_bits); - a->stack_loc_words = - alloc_interval_storage_words(a->stack_used_locs, a->stack_slot_count); - f->opt_alloc_hard_loc_words = a->hard_loc_words; - f->opt_alloc_stack_loc_words = a->stack_loc_words; + + /* Report storage metrics in u64 words (used_locs is the only bitmap). */ + u32 total_words = a->point_count * a->loc_words; + u32 hard_words = alloc_loc_words_for_bits(a->hard_loc_bits) * a->point_count; + if (hard_words > total_words) hard_words = total_words; + f->opt_alloc_hard_loc_words = hard_words; + f->opt_alloc_stack_loc_words = total_words - hard_words; f->opt_alloc_stack_slots = a->stack_slot_count; - f->opt_used_loc_words = - f->opt_alloc_hard_loc_words + f->opt_alloc_stack_loc_words; + f->opt_used_loc_words = total_words; f->opt_alloc_hard_point_visits = a->hard_point_visits; f->opt_alloc_stack_point_visits = a->stack_point_visits; f->opt_alloc_hard_word_ors = a->hard_word_ors; @@ -1835,11 +1573,10 @@ void opt_regalloc(Func* f, int allow_live_range_split) { opt_init_preg_info_from_ranges(f, &ranges); opt_apply_asm_constraints_from_live(f, &live); apply_param_incoming_register_hazards(f); - /* Coalesce runs at both O1 and O2. IRF_NO_COALESCE protects SSA edge copies - * inserted by opt_make_conventional_ssa at O2; at O1 no such copies exist. - * ranges_overlap_kind treats two or more unit overlaps between dst and src - * as a real conflict, so multi-def pseudos at O1 are not unsafely merged. */ - opt_coalesce_ranges(f, &ranges); + /* MIR coalesces only at -O2 (mir-gen.c:9431); match that here. At O1 the + * point-bitmap allocator emits copies through the natural conflict-free + * path. IRF_NO_COALESCE protects SSA edge copies inserted at O2. */ + if (allow_live_range_split) opt_coalesce_ranges(f, &ranges); metrics_count(f->c, "opt.live_words", f->opt_live_words); metrics_count(f->c, "opt.ranges", ranges.nranges); metrics_count(f->c, "opt.range_points", ranges.point_count); diff --git a/test/opt/opt_test.c b/test/opt/opt_test.c @@ -599,13 +599,6 @@ static int count_op(Func* f, IROp op) { return n; } -static int block_contains_op(Func* f, u32 b, IROp op) { - if (!f || b >= f->nblocks) return 0; - for (u32 i = 0; i < f->blocks[b].ninsts; ++i) - if ((IROp)f->blocks[b].insts[i].op == op) return 1; - return 0; -} - static Inst* def_inst(Func* f, Val v) { if (!f || v == VAL_NONE || v >= f->nvals) return NULL; u32 b = f->val_def_block[v]; @@ -4277,7 +4270,10 @@ static void opt_o2_coalesces_nonconflicting_copy(void) { tc_fini(&tc); } -static void opt_o1_coalesces_simple_copy(void) { +static void opt_o1_skips_coalesce(void) { + /* O1 matches MIR's pipeline (mir-gen.c:9431): coalescing runs only at + * optimize_level >= 2. At O1 the point-bitmap allocator emits the copy + * through the normal path without merging operands. */ TestCtx tc; tc_init(&tc); MockCGTarget mock; @@ -4297,8 +4293,8 @@ static void opt_o1_coalesces_simple_copy(void) { opt_build_loop_tree(f); opt_regalloc(f, 0); - EXPECT(f->opt_coalesce_candidates == 1 && f->opt_coalesce_merges == 1, - "O1 regalloc should coalesce a simple copy"); + EXPECT(f->opt_coalesce_candidates == 0 && f->opt_coalesce_merges == 0, + "O1 regalloc should not run coalesce"); tc_fini(&tc); } @@ -4374,7 +4370,10 @@ static void opt_o2_refuses_incompatible_copy_coalesce(void) { tc_fini(&tc); } -static void opt_o2_splits_singleton_when_whole_alloc_fails(void) { +static void opt_o2_spills_singleton_when_whole_alloc_fails(void) { + /* Live-range splitting is deferred per doc/OPT_PERF.md plan. With one hard + * reg pinned and another value live across the pinned use, the allocator + * spills the unpinned value whole instead of producing OPT_ALLOC_SPLIT. */ TestCtx tc; tc_init(&tc); MockCGTarget mock; @@ -4401,23 +4400,16 @@ static void opt_o2_splits_singleton_when_whole_alloc_fails(void) { EXPECT(f->preg_info[pinned].alloc_kind == OPT_ALLOC_HARD, "pinned value should keep the only hard register"); - EXPECT(f->preg_info[v].alloc_kind == OPT_ALLOC_SPLIT, - "singleton value should split instead of whole spilling"); - int saw_hard = 0; - int saw_stack = 0; - for (u32 si = f->opt_first_segment_by_preg[v]; si != OPT_RANGE_NONE; - si = f->opt_alloc_segments[si].next) { - if (f->opt_alloc_segments[si].loc_kind == OPT_LOC_HARD) saw_hard = 1; - if (f->opt_alloc_segments[si].loc_kind == OPT_LOC_STACK) saw_stack = 1; - } - EXPECT(saw_hard && saw_stack, - "split value should have both hard and spill-home segments"); + EXPECT(f->preg_info[v].alloc_kind == OPT_ALLOC_SPILL, + "without splitting, the conflicting value should whole-spill"); EXPECT(f->preg_info[v].spill_slot != FRAME_SLOT_NONE, - "split value should have a canonical spill home"); + "spilled value should have a stack slot"); tc_fini(&tc); } -static void opt_o2_split_materializes_on_critical_edge(void) { +static void opt_o2_does_not_split_critical_edge(void) { + /* Live-range splitting (and the associated critical-edge materialization) + * is deferred. The unpinned value whole-spills; no edge blocks are added. */ TestCtx tc; tc_init(&tc); MockCGTarget mock; @@ -4450,23 +4442,11 @@ static void opt_o2_split_materializes_on_critical_edge(void) { u32 original_blocks = f->nblocks; opt_regalloc(f, 1); - EXPECT(f->preg_info[v].alloc_kind == OPT_ALLOC_SPLIT, - "edge fixture should split v%u", (unsigned)v); - EXPECT(f->nblocks > original_blocks, - "critical edge materialization should add edge blocks"); - u32 edge = f->blocks[entry].succ[0]; - EXPECT(edge != join && edge < f->nblocks && f->blocks[edge].nsucc == 1 && - f->blocks[edge].succ[0] == join, - "entry->join critical edge should be split"); - if (edge != join && edge < f->nblocks) { - EXPECT(block_contains_op(f, edge, IR_LOAD), - "split edge block should hold the reload"); - EXPECT(block_contains_op(f, edge, IR_BR), - "split edge block should still branch to join"); - } - EXPECT(!block_contains_op(f, join, IR_LOAD), - "join block should not receive an ambiguous reload"); - opt_verify(f, "test-split-critical-edge-materialization"); + EXPECT(f->preg_info[v].alloc_kind != OPT_ALLOC_SPLIT, + "splitting is deferred; v%u should not be OPT_ALLOC_SPLIT", + (unsigned)v); + EXPECT(f->nblocks == original_blocks, + "no edge materialization expected without splitting"); tc_fini(&tc); } @@ -6647,11 +6627,11 @@ int main(void) { opt_range_overlap_class(); opt_regalloc_priority(); opt_o2_coalesces_nonconflicting_copy(); - opt_o1_coalesces_simple_copy(); + opt_o1_skips_coalesce(); opt_o2_refuses_overlapping_copy_coalesce(); opt_o2_refuses_incompatible_copy_coalesce(); - opt_o2_splits_singleton_when_whole_alloc_fails(); - opt_o2_split_materializes_on_critical_edge(); + opt_o2_spills_singleton_when_whole_alloc_fails(); + opt_o2_does_not_split_critical_edge(); opt_o1_does_not_split_spill_edges(); opt_range_regalloc_no_conflicts_and_stack_reuse(); opt_stack_spill_assignment_avoids_quadratic_probe();