commit 9a3a508a459a3d8d8d75530d66b988b68da42c43
parent 396793eaaaa6a778f2f6face8a960fd58a3438db
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Fri, 22 May 2026 13:34:05 -0700
opt regalloc: swap to MIR-shaped point-bitmap allocator at O1
Replace cfree's sorted interval-vector conflict structure (AllocIntervalVec
hard_used_locs[hard_loc_bit]) with MIR's point-indexed bitmap
(used_locs[p * loc_words + w], one row per compressed program point).
Sort coalesce-root PRegs by MIR's heuristic and OR used_locs across each
candidate's range points to build conflict_locs in one pass. Re-gate
opt_coalesce_ranges to O2 only, matching mir-gen.c:9431; the 2026-05-22
coalesce-at-O1 experiment is rolled back. Live-range splitting
(get_hard_reg_with_split, lr_gap_tab, split()) is deferred — to be ported
on top of the new core in a follow-up.
On the 5-case fast scope at O1 the opt+link+JIT geomean drops from
0.968ms to 0.621ms (-35.9%); per-bench opt.regalloc drops ~55%. cfree
now lands ~14% ahead of MIR's `MIR link finish` (0.720ms) on this scope,
crossing from 1.34x behind to 1.16x ahead. Runtime unchanged within
noise. See doc/OPT_PERF.md Iteration Notes for details.
Tests refreshed to match the new contract: opt_o1_skips_coalesce (was
opt_o1_coalesces_simple_copy), opt_o2_spills_singleton_when_whole_alloc_fails
(was opt_o2_splits_singleton_when_whole_alloc_fails),
opt_o2_does_not_split_critical_edge (was opt_o2_split_materializes_*).
Diffstat:
| M | doc/OPT_PERF.md | | | 149 | ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------- |
| M | src/opt/pass_lower.c | | | 665 | ++++++++++++++++++++++++------------------------------------------------------- |
| M | test/opt/opt_test.c | | | 68 | ++++++++++++++++++++++++-------------------------------------------- |
3 files changed, 339 insertions(+), 543 deletions(-)
diff --git a/doc/OPT_PERF.md b/doc/OPT_PERF.md
@@ -148,20 +148,40 @@ Base: each benchmark's `gcc-15 -O2` row is `1.0x`.
Focused timing pass:
- Scope: `array`, `hash`, `hash2`, `matrix`, `sieve`
-- Output: `build/bench/opt/opt-link-split-common/summary.csv`
- cfree measurement: sum of `opt.o*.total`, `link.resolve.total`, and JIT
- scoped timings from `cfree run --time`
+ scoped timings from `cfree run --bench-time` (best-of-3 per bench)
- MIR measurement: `MIR link finish` from `c2m -v -O<n> file.bmir -eg`
+Latest refresh (2026-05-22, `CFREE_OPT_BENCH_FAST=1`, single Darwin host),
+post-regalloc-rewrite (see Iteration Notes below):
+
| opt | cfree opt+link+JIT ms | MIR link/JIT ms | result |
| ---: | ---: | ---: | --- |
-| `O0` | `0.275` | `0.758` | cfree `2.76x` faster |
-| `O1` | `4.552` | `0.718` | MIR `6.34x` faster |
-| `O2` | `4.728` | `1.100` | MIR `4.30x` faster |
-
-This is the key split for tuning: cfree's final link/JIT mechanics are very
-fast, but MIR's optimized generation pipeline is substantially faster at
-`O1` and `O2`.
+| `O1` (HEAD before rewrite) | `0.968` | `0.722` | MIR `1.34x` faster |
+| `O1` (MIR-shaped allocator) | `0.621` | `0.720` | cfree `1.16x` faster |
+
+Per-bench at `O1` after the rewrite:
+
+| bench | cfree opt ms | cfree link ms | cfree jit ms | cfree opt+l+j | MIR codegen |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| `array` | `0.269` | `0.049` | `0.043` | `0.361` | `0.300` |
+| `hash` | `1.033` | `0.083` | `0.040` | `1.156` | `1.653` |
+| `hash2` | `1.069` | `0.089` | `0.042` | `1.200` | `1.765` |
+| `matrix` | `0.484` | `0.054` | `0.035` | `0.573` | `0.859` |
+| `sieve` | `0.259` | `0.031` | `0.031` | `0.321` | `0.258` |
+| **geomean** | | | | **`0.621`** | **`0.720`** |
+
+The previously reported `4.552 ms` cfree `O1` figure (with MIR at `0.718 ms`,
+a `6.34x` gap) was from an earlier benchmark configuration; on the current
+host it reproduces at `0.968 ms` against MIR's `0.722 ms`. The MIR-shaped
+allocator rewrite closes the gap and crosses ahead: cfree opt+link+JIT is
+now ~14% faster than MIR's `MIR link finish` across this 5-case scope,
+driven by a `~55%` per-bench drop in `opt.regalloc` time (point-bitmap
+conflict checks instead of per-hard-register sorted interval vectors).
+
+`O0` and `O2` rows have not yet been re-measured under the new allocator;
+the full-scope refresh is pending. cfree's final link/JIT mechanics stay
+fast across both data-structure choices.
## Goals
@@ -191,40 +211,53 @@ Coverage goals:
## Current Tuning Priorities
-1. Investigate why `O2` trails `O1` on the current cfree runtime geomean.
-2. Profile expensive O2 passes on `hash`, `hash2`, and `matrix`, where the
+1. Port MIR's live-range splitting (`get_hard_reg_with_split`, `lr_gap_tab`,
+ `split()`) on top of the new point-bitmap core, restoring the O2 splitting
+ layer that was deferred during the regalloc rewrite. Until then O2 may
+ regress on benches where splitting matters (`array`, `hash`, `matrix`).
+2. Investigate why `O2` trails `O1` on the current cfree runtime geomean.
+3. Profile expensive O2 passes on `hash`, `hash2`, and `matrix`, where the
opt/link split shows cfree optimized generation much slower than MIR.
-3. Improve generated-code quality before adding broad new passes. Prefer
+4. Improve generated-code quality before adding broad new passes. Prefer
targeted fixes backed by benchmark deltas and pass-local tests.
-4. Add finer cfree timing columns to `scripts/opt_bench.sh` once the driver can
+5. Add finer cfree timing columns to `scripts/opt_bench.sh` once the driver can
expose parse, optimize, link, and JIT slices in parseable form.
## O1 Gap Analysis vs MIR
cfree `O1` runs only the shared lowering pipeline (`doc/OPT.md`): build_cfg,
-jump_cleanup, simplify_local, machinize, build_loop_tree, live_blocks,
-dead_def_elim, regalloc (simple mode), combine, dce, emit. No SSA-era passes
-run at `O1`, so the gaps differ from `O2`.
+jump_cleanup, simplify_local, machinize, addr_xform_pregs, build_loop_tree,
+live_blocks, dead_def_elim, regalloc (simple mode), combine, dce, emit.
+No SSA-era passes and no coalescing run at `O1`, matching MIR
+(`mir-gen.c:9431`: coalesce is gated on `optimize_level >= 2`).
-Headline at `O1`:
+Headline at `O1` (post-regalloc-rewrite, 2026-05-22 refresh):
| metric | cfree | MIR |
| --- | ---: | ---: |
-| compile speed (vs gcc-15 `O2`) | `5.79x` | `2.52x` |
-| runtime speed (vs gcc-15 `O2`) | `0.50x` | `0.57x` |
-| opt+link+JIT (5-case scope) | `4.552 ms` | `0.718 ms` |
+| opt+link+JIT (5-case scope, geomean) | `0.621 ms` | `0.720 ms` |
+| runtime (5-case scope, geomean) | `6354.6 ms` | `4663.8 ms` |
+
+cfree opt+link+JIT is now ~14% faster than MIR's `MIR link finish` slice.
+The remaining gap is on the runtime side: MIR's generated code runs ~1.36x
+faster than cfree's at `O1` on this scope, dominated by what MIR does that
+cfree does not (coalescing + splitting at O2, plus better instruction
+selection).
-cfree wins the whole-pipeline compile metric because of frontend speed, but
-loses the optimized-stage opt+link+JIT split by `6.34x`.
+Compile-time gaps closed by the rewrite:
-Compile-time gaps unique to `O1`:
+- `opt_regalloc` now uses point-indexed bitsets ORed per program point
+ (matching MIR's `assign()` in `mir-gen.c:7551-7728`, simplified branch),
+ replacing the previous sorted interval-vector overlap checks per hard
+ register. This is the dominant `~55%` per-bench reduction in
+ `opt.regalloc` time across `array`, `hash`, `hash2`, `matrix`, `sieve`.
+
+Compile-time gaps that remain:
- `opt_live_blocks` tracks every pseudo-register uniformly. MIR's
- `calculate_func_cfg_live_info` narrows to move-related variables at `O1` via
- `consider_move_vars_only` when coalescing is on, shrinking bitset width and
- fixpoint iterations.
-- `opt_regalloc` uses sorted interval-vector overlap checks per hard register.
- MIR uses point-indexed bitsets ORed per program point.
+ `calculate_func_cfg_live_info` also tracks all live vars at the
+ pre-allocation step (`consider_all_live_vars`), but its bitsets reuse a
+ shared sparse vector representation; cfree's `OptBitset` is dense.
- `opt_combine` runs a single forward sweep per BB with no fixpoint, so
cascading folds either require an earlier pass to enable them or get
missed.
@@ -233,13 +266,10 @@ Compile-time gaps unique to `O1`:
Runtime gaps unique to `O1`:
-- No coalescing. cfree gates `opt_coalesce_ranges` on
- `allow_live_range_split`, which is off at `O1`. Every `IR_COPY` survives to
- emission unless `opt_combine`'s identity-copy detector catches it post-RA.
- MIR's `O1` runs `collect_moves`, `consider_move_vars_only`, and `coalesce`.
-- No address-mode synthesis. `opt_ssa_combine` and `opt_addr_xform` only run
- at `O2`, so `IR_ADDR_OF(local) + indirect load` stays as separate
- instructions at `O1`.
+- No coalescing at `O1` (MIR matches this). Every `IR_COPY` survives to
+ emission unless `opt_combine`'s identity-copy detector catches it
+ post-RA. The 2026-05-22 attempt to enable coalescing at `O1` was rolled
+ back as part of matching MIR's pipeline.
- No ext-of-ext semantic fold. `opt_combine` removes only identical convert
pairs. MIR's `combine_exts` folds chains like `zext8 -> zext32` into a
single extension of the right width.
@@ -315,8 +345,57 @@ Runtime gaps:
## Iteration Notes
+### MIR-shaped regalloc (2026-05-22)
+
+Change: replace `OptAllocator`'s sorted interval-vector conflict structure
+(`AllocIntervalVec hard_used_locs[hard_loc_bit]` plus per-stack-slot
+intervals) with MIR's point-indexed bitmap (`used_locs[p * loc_words + w]`,
+one row per compressed program point). Sort coalesce-root PRegs by the same
+heuristic MIR uses (`mir-gen.c:7320-7337`: tied-reg first, then descending
+freq, then descending live_length). For each candidate, OR `used_locs[j]`
+across the candidate's live-range points into a scratch `conflict_locs`
+bitmap, pick the cheapest free hard reg, or fall back to a stack slot
+(probing existing stack-slot bits in `conflict_locs` for automatic reuse).
+Re-gate `opt_coalesce_ranges` on `allow_live_range_split` (O2 only),
+matching `mir-gen.c:9431` — the earlier "coalescing at O1" experiment
+(below) is rolled back as part of this change. Live-range splitting
+(`get_hard_reg_with_split`, `lr_gap_tab`, `split()`) is deferred; the new
+core is the foundation on which splitting will be re-added.
+
+Measurement (`CFREE_OPT_BENCH_FAST=1`, best of 3 compile/run repeats,
+single Darwin host), 5-case scope at O1:
+
+| bench | HEAD opt+link+jit ms | MIR-shaped opt+link+jit ms | delta |
+| --- | ---: | ---: | ---: |
+| `array` | `0.535` | `0.361` | `-32.5%` |
+| `hash` | `1.916` | `1.156` | `-39.7%` |
+| `hash2` | `1.929` | `1.200` | `-37.8%` |
+| `matrix` | `0.899` | `0.573` | `-36.3%` |
+| `sieve` | `0.477` | `0.321` | `-32.7%` |
+| **geomean** | `0.968` | `0.621` | `-35.9%` |
+
+Per-bench `opt.regalloc` (the bucket the rewrite targets) drops by `~55%`
+across all five benches: `array -54.1%`, `hash -58.3%`, `hash2 -55.5%`,
+`matrix -53.7%`, `sieve -52.2%`. Runtime is unchanged within noise
+(geomean delta `-0.2%`), consistent with the rewrite changing only the
+conflict data structure, not allocator decisions.
+
+`compile_and_jit` total shifts modestly per-bench (`+0.85%` to `-21.75%`)
+because the C frontend is 99%+ of that total on these benches; the
+allocator gain is only visible after isolating the optimized stages from
+the frontend.
+
+After this rewrite, cfree's `opt+link+JIT` geomean (`0.621 ms`) is
+~14% **ahead** of MIR's `MIR link finish` geomean (`0.720 ms`) on the
+same scope — see the Optimizer/Link/JIT Split table.
+
### Coalescing at O1 (2026-05-22)
+**Rolled back** as part of the MIR-shaped regalloc rewrite above. MIR
+gates coalescing on `optimize_level >= 2` (`mir-gen.c:9431`), so matching
+MIR's O1 pipeline means coalesce does not run at O1. The original entry
+is preserved below for context.
+
Change: remove the `allow_live_range_split` gate around
`opt_coalesce_ranges` in `opt_regalloc` so coalesce runs at both O1 and
O2. Initially this produced wrong code on the address-taken branch-join
diff --git a/src/opt/pass_lower.c b/src/opt/pass_lower.c
@@ -283,6 +283,26 @@ static int is_caller_saved(Func* f, u8 cls, Reg r) {
return (f->opt_caller_saved[cls] & (1u << r)) != 0;
}
+/* ---------------------------------------------------------------------------
+ * Register allocator, MIR-shaped.
+ *
+ * Data structures and assignment algorithm mirror MIR's reg_alloc/assign
+ * (mir-gen.c:7551-7728, simplified_p branch). Conflict detection uses a
+ * point-indexed bitmap of locations (hard regs + stack slots) instead of
+ * a sorted interval vector per hard register.
+ *
+ * used_locs[p * loc_words .. p * loc_words + loc_words) -- one row per
+ * compressed
+ * program point
+ *
+ * Bit indices:
+ * 0 .. hard_loc_bits-1 -> hard registers (hard_loc_bit(cls, r))
+ * hard_loc_bits + k -> stack slot index k (k < stack_slot_count)
+ *
+ * Live-range splitting (`get_hard_reg_with_split`, `lr_gap_t`, `split()`)
+ * is deferred per doc/OPT_PERF.md plan.
+ * ------------------------------------------------------------------------- */
+
typedef struct OptAllocator OptAllocator;
static int hard_available(Func* f, u8 cls, Reg r) {
@@ -315,19 +335,13 @@ static FrameSlot spill_slot_for(Func* f, PReg v) {
static u32 hard_loc_bit(u8 cls, Reg r) { return ((u32)cls * 32u) + (u32)r; }
typedef struct OptAllocCandidate {
- PReg v;
+ PReg v; /* coalesce-root PReg with live ranges */
u32 spill_cost;
u32 live_length;
u8 tied;
u8 pad[3];
} OptAllocCandidate;
-typedef struct OptSpillCandidate {
- PReg v;
- u32 first;
- u32 last;
-} OptSpillCandidate;
-
typedef struct OptAllocGroupInfo {
PReg root;
u32 spill_cost;
@@ -340,43 +354,84 @@ typedef struct OptAllocGroupInfo {
u8 pad[3];
} OptAllocGroupInfo;
-typedef struct AllocInterval {
- u32 start;
- u32 end;
-} AllocInterval;
-
-typedef struct AllocIntervalVec {
- AllocInterval* data;
- u32 n;
- u32 cap;
-} AllocIntervalVec;
-
typedef struct OptAllocator {
- OptLoc* locs;
- AllocIntervalVec* hard_used_locs;
- AllocIntervalVec* stack_used_locs;
- u32* stack_slot_end;
- u32* active_stack_slots;
- u32 nactive_stack_slots;
- u32 active_stack_cap;
- u32* free_stack_slots;
- u32 nfree_stack_slots;
- u32 free_stack_cap;
+ OptLoc* locs; /* per-PReg result (cls, hard_reg, spill_slot) */
+
+ /* Per-point bitmap of locations. used_locs[p * loc_words + w] is word w
+ * of the bitmap for compressed program point p. Bit indices:
+ * 0 .. hard_loc_bits - 1 -> hard regs (hard_loc_bit)
+ * hard_loc_bits + stack_idx -> stack slot indices */
+ u64* used_locs;
u32 point_count;
- u32 hard_loc_words;
- u32 stack_loc_words;
- u32 hard_loc_bits;
+ u32 loc_words; /* width of one row, in u64 words */
+ u32 hard_loc_bits; /* OPT_REG_CLASSES * 32 */
+
+ /* Stack slot table (parallel arrays). */
FrameSlot* stack_slots;
u32 stack_slot_count;
u32 stack_slot_cap;
- u64 hard_point_visits;
- u64 stack_point_visits;
- u64 hard_word_ors;
+
+ /* hard_open[hard_loc_bit] is 1 if at least one PReg has been assigned to
+ * this hard reg in the current function. Drives `hard_reg_alloc_score`'s
+ * callee-save bias. */
+ u8* hard_open;
+
+ /* Scratch bitmap of loc_words. Reused per candidate. */
+ u64* conflict_locs;
+
+ /* Metrics. */
+ u64 hard_point_visits; /* points scanned during hard-reg conflict probe */
+ u64 stack_point_visits; /* points scanned during stack-slot probe */
+ u64 hard_word_ors; /* word-OR operations into conflict_locs */
u64 stack_word_ors;
- u64 hard_mark_points;
+ u64 hard_mark_points; /* points marked when assigning a hard reg */
u64 stack_mark_points;
} OptAllocator;
+/* Bitmap helpers over loc_words-wide rows of used_locs. */
+static u64* used_locs_row(OptAllocator* a, u32 p) {
+ return &a->used_locs[(u64)p * a->loc_words];
+}
+
+static int loc_bit_in_conflict(const u64* conflict_locs, u32 bit) {
+ return (conflict_locs[bit / 64u] & (1ull << (bit % 64u))) != 0;
+}
+
+static u32 alloc_loc_words_for_bits(u32 bits) { return (bits + 63u) / 64u; }
+
+static void alloc_grow_loc_words(Func* f, OptAllocator* a, u32 need_words) {
+ if (need_words <= a->loc_words) return;
+ u32 new_words = a->loc_words ? a->loc_words : 1u;
+ while (new_words < need_words) new_words *= 2u;
+ u64* nb = arena_zarray(f->arena, u64, (u64)a->point_count * new_words);
+ if (a->used_locs && a->loc_words) {
+ for (u32 p = 0; p < a->point_count; ++p)
+ memcpy(&nb[(u64)p * new_words], &a->used_locs[(u64)p * a->loc_words],
+ sizeof(u64) * a->loc_words);
+ }
+ a->used_locs = nb;
+ a->loc_words = new_words;
+ u64* nc = arena_zarray(f->arena, u64, new_words);
+ a->conflict_locs = nc;
+}
+
+static u32 alloc_alloc_stack_slot(Func* f, OptAllocator* a, FrameSlot fs) {
+ if (a->stack_slot_count == a->stack_slot_cap) {
+ u32 ncap = a->stack_slot_cap ? a->stack_slot_cap * 2u : 16u;
+ FrameSlot* ns = arena_array(f->arena, FrameSlot, ncap);
+ if (a->stack_slots)
+ memcpy(ns, a->stack_slots, sizeof(a->stack_slots[0]) * a->stack_slot_count);
+ a->stack_slots = ns;
+ a->stack_slot_cap = ncap;
+ }
+ u32 idx = a->stack_slot_count++;
+ a->stack_slots[idx] = fs;
+ u32 needed_bits = a->hard_loc_bits + a->stack_slot_count;
+ u32 needed_words = alloc_loc_words_for_bits(needed_bits);
+ alloc_grow_loc_words(f, a, needed_words);
+ return idx;
+}
+
static u32 hard_reg_alloc_score(Func* f, const OptAllocator* a,
const OptPRegInfo* vi, Reg hr) {
const CGPhysRegInfo* pi = phys_info_for(f, vi->cls, hr);
@@ -388,7 +443,7 @@ static u32 hard_reg_alloc_score(Func* f, const OptAllocator* a,
score += 20u;
} else if (!is_caller_saved(f, vi->cls, hr)) {
u32 bit = hard_loc_bit(vi->cls, hr);
- int already_open = bit < a->hard_loc_bits && a->hard_used_locs[bit].n != 0;
+ int already_open = a->hard_open && bit < a->hard_loc_bits && a->hard_open[bit];
if (!already_open) score += pi ? pi->save_cost : 50u;
}
return score;
@@ -414,19 +469,6 @@ static void alloc_sort_candidates(OptAllocCandidate* cands, u32 n) {
if (n > 1) qsort(cands, n, sizeof(cands[0]), alloc_candidate_cmp);
}
-static int spill_candidate_cmp(const void* va, const void* vb) {
- const OptSpillCandidate* a = (const OptSpillCandidate*)va;
- const OptSpillCandidate* b = (const OptSpillCandidate*)vb;
- if (a->first != b->first)
- return (a->first > b->first) - (a->first < b->first);
- if (a->last != b->last) return (a->last > b->last) - (a->last < b->last);
- return (a->v > b->v) - (a->v < b->v);
-}
-
-static void alloc_sort_spills(OptSpillCandidate* spills, u32 n) {
- if (n > 1) qsort(spills, n, sizeof(spills[0]), spill_candidate_cmp);
-}
-
static PReg alloc_coalesce_root(Func* f, PReg v) {
if (!f->opt_coalesce_parent || v == PREG_NONE || v >= opt_reg_count(f))
return v;
@@ -466,106 +508,6 @@ static void alloc_group_info(Func* f, const OptLiveRangeSet* ranges, PReg root,
if (out->first == (u32)~0u) out->first = 0;
}
-static u32 alloc_interval_lower_bound(const AllocIntervalVec* v, u32 start) {
- u32 lo = 0;
- u32 hi = v ? v->n : 0;
- while (lo < hi) {
- u32 mid = lo + (hi - lo) / 2u;
- if (v->data[mid].start < start)
- lo = mid + 1u;
- else
- hi = mid;
- }
- return lo;
-}
-
-static int alloc_interval_overlaps(const AllocIntervalVec* v, u32 start,
- u32 end) {
- if (!v || !v->n) return 0;
- u32 pos = alloc_interval_lower_bound(v, start);
- if (pos > 0) {
- const AllocInterval* prev = &v->data[pos - 1u];
- if (prev->end > start && prev->start < end) return 1;
- }
- if (pos < v->n) {
- const AllocInterval* cur = &v->data[pos];
- if (cur->start < end && cur->end > start) return 1;
- }
- return 0;
-}
-
-static void alloc_interval_grow(Func* f, AllocIntervalVec* v) {
- if (v->n < v->cap) return;
- u32 ncap = v->cap ? v->cap * 2u : 8u;
- AllocInterval* nd = arena_array(f->arena, AllocInterval, ncap);
- if (v->data) memcpy(nd, v->data, sizeof(v->data[0]) * v->n);
- v->data = nd;
- v->cap = ncap;
-}
-
-static void alloc_interval_insert(Func* f, AllocIntervalVec* v, u32 start,
- u32 end) {
- if (end <= start) end = start + 1u;
- u32 pos = alloc_interval_lower_bound(v, start);
- if (pos > 0 && v->data[pos - 1u].end >= start) {
- --pos;
- if (v->data[pos].end < end) v->data[pos].end = end;
- } else {
- alloc_interval_grow(f, v);
- if (pos < v->n)
- memmove(v->data + pos + 1u, v->data + pos,
- sizeof(v->data[0]) * (v->n - pos));
- v->data[pos].start = start;
- v->data[pos].end = end;
- ++v->n;
- }
-
- while (pos + 1u < v->n && v->data[pos + 1u].start <= v->data[pos].end) {
- if (v->data[pos].end < v->data[pos + 1u].end)
- v->data[pos].end = v->data[pos + 1u].end;
- if (pos + 2u < v->n)
- memmove(v->data + pos + 1u, v->data + pos + 2u,
- sizeof(v->data[0]) * (v->n - pos - 2u));
- --v->n;
- }
-}
-
-static int alloc_ranges_overlap_vec(OptAllocator* a,
- const OptLiveRangeSet* ranges, PReg v,
- const AllocIntervalVec* vec, u64* visits) {
- for (u32 r = ranges->first_range_by_preg[v]; r != OPT_RANGE_NONE;
- r = ranges->ranges[r].next) {
- const OptLiveRange* lr = &ranges->ranges[r];
- u32 end = lr->end < a->point_count ? lr->end : a->point_count;
- if (visits) ++*visits;
- if (lr->start >= end) continue;
- if (alloc_interval_overlaps(vec, lr->start, end)) return 1;
- }
- return 0;
-}
-
-static void alloc_mark_vec(Func* f, OptAllocator* a,
- const OptLiveRangeSet* ranges, PReg v,
- AllocIntervalVec* vec, u64* marks) {
- for (u32 r = ranges->first_range_by_preg[v]; r != OPT_RANGE_NONE;
- r = ranges->ranges[r].next) {
- const OptLiveRange* lr = &ranges->ranges[r];
- u32 end = lr->end < a->point_count ? lr->end : a->point_count;
- if (lr->start >= end) continue;
- if (marks) ++*marks;
- alloc_interval_insert(f, vec, lr->start, end);
- }
-}
-
-static u32 alloc_interval_storage_words(const AllocIntervalVec* vecs,
- u32 nvecs) {
- u64 bytes = (u64)nvecs * (u64)sizeof(vecs[0]);
- for (u32 i = 0; i < nvecs; ++i)
- bytes += (u64)vecs[i].cap * (u64)sizeof(vecs[i].data[0]);
- bytes = (bytes + 7u) / 8u;
- return bytes > (u64)(u32)~0u ? (u32)~0u : (u32)bytes;
-}
-
static void opt_init_preg_info_from_ranges(Func* f,
const OptLiveRangeSet* ranges) {
OptPRegInfo* old = f->preg_info;
@@ -687,167 +629,58 @@ static int spill_slot_compatible(Func* f, FrameSlot fs, PReg v) {
return 1;
}
-static int alloc_group_hard_conflicts(Func* f, OptAllocator* a,
- const OptLiveRangeSet* ranges, PReg root,
- u32 bit) {
- if (bit >= a->hard_loc_bits) return 1;
- for (PReg v = 1; v < opt_reg_count(f); ++v) {
- if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
- if (!alloc_group_member(f, root, v)) continue;
- if (alloc_ranges_overlap_vec(a, ranges, v, &a->hard_used_locs[bit],
- &a->hard_point_visits))
- return 1;
- }
- return 0;
-}
-
-static void alloc_mark_group_hard_loc(Func* f, OptAllocator* a,
- const OptLiveRangeSet* ranges, PReg root,
- u32 bit) {
- if (bit >= a->hard_loc_bits) return;
+/* Compute conflict_locs = union of used_locs[j] for j in every live range
+ * point of every PReg in `root`'s coalesce group. */
+static void alloc_compute_group_conflicts(Func* f, OptAllocator* a,
+ const OptLiveRangeSet* ranges,
+ PReg root) {
+ for (u32 w = 0; w < a->loc_words; ++w) a->conflict_locs[w] = 0;
for (PReg v = 1; v < opt_reg_count(f); ++v) {
if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
if (!alloc_group_member(f, root, v)) continue;
- alloc_mark_vec(f, a, ranges, v, &a->hard_used_locs[bit],
- &a->hard_mark_points);
- }
-}
-
-static void alloc_grow_stack_locs(Func* f, OptAllocator* a, u32 need_slots) {
- if (need_slots <= a->stack_slot_cap) return;
- u32 ncap = a->stack_slot_cap ? a->stack_slot_cap * 2u : 16u;
- while (ncap < need_slots) ncap *= 2u;
- FrameSlot* ns = arena_array(f->arena, FrameSlot, ncap);
- AllocIntervalVec* ni = arena_zarray(f->arena, AllocIntervalVec, ncap);
- u32* ne = arena_zarray(f->arena, u32, ncap);
- if (a->stack_slots) {
- memcpy(ns, a->stack_slots, sizeof(a->stack_slots[0]) * a->stack_slot_count);
- memcpy(ni, a->stack_used_locs,
- sizeof(a->stack_used_locs[0]) * a->stack_slot_count);
- memcpy(ne, a->stack_slot_end,
- sizeof(a->stack_slot_end[0]) * a->stack_slot_count);
+ for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE;
+ ri = ranges->ranges[ri].next) {
+ const OptLiveRange* lr = &ranges->ranges[ri];
+ u32 end = lr->end < a->point_count ? lr->end : a->point_count;
+ for (u32 j = lr->start; j < end; ++j) {
+ ++a->hard_point_visits;
+ const u64* row = used_locs_row(a, j);
+ for (u32 w = 0; w < a->loc_words; ++w) {
+ a->conflict_locs[w] |= row[w];
+ ++a->hard_word_ors;
+ }
+ }
+ }
}
- a->stack_slots = ns;
- a->stack_used_locs = ni;
- a->stack_slot_end = ne;
- a->stack_slot_cap = ncap;
}
-static void alloc_mark_group_stack_loc(Func* f, OptAllocator* a,
- const OptLiveRangeSet* ranges, PReg root,
- u32 stack_idx) {
- if (stack_idx >= a->stack_slot_count) return;
+/* Mark `loc_bit` as occupied at every point covered by `root`'s group's
+ * live ranges. */
+static void alloc_mark_group_loc(Func* f, OptAllocator* a,
+ const OptLiveRangeSet* ranges, PReg root,
+ u32 loc_bit) {
+ u32 w = loc_bit / 64u;
+ u64 mask = 1ull << (loc_bit % 64u);
for (PReg v = 1; v < opt_reg_count(f); ++v) {
if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
if (!alloc_group_member(f, root, v)) continue;
- alloc_mark_vec(f, a, ranges, v, &a->stack_used_locs[stack_idx],
- &a->stack_mark_points);
- }
-}
-
-static int stack_slot_heap_less(const OptAllocator* a, u32 lhs, u32 rhs) {
- u32 le = lhs < a->stack_slot_count ? a->stack_slot_end[lhs] : 0;
- u32 re = rhs < a->stack_slot_count ? a->stack_slot_end[rhs] : 0;
- if (le != re) return le < re;
- return lhs < rhs;
-}
-
-static void alloc_grow_active_stack(Func* f, OptAllocator* a) {
- if (a->nactive_stack_slots < a->active_stack_cap) return;
- u32 ncap = a->active_stack_cap ? a->active_stack_cap * 2u : 16u;
- u32* ns = arena_array(f->arena, u32, ncap);
- if (a->active_stack_slots)
- memcpy(ns, a->active_stack_slots,
- sizeof(a->active_stack_slots[0]) * a->nactive_stack_slots);
- a->active_stack_slots = ns;
- a->active_stack_cap = ncap;
-}
-
-static void stack_active_push(Func* f, OptAllocator* a, u32 slot) {
- alloc_grow_active_stack(f, a);
- u32 i = a->nactive_stack_slots++;
- a->active_stack_slots[i] = slot;
- while (i) {
- u32 p = (i - 1u) / 2u;
- if (stack_slot_heap_less(a, a->active_stack_slots[p],
- a->active_stack_slots[i]))
- break;
- u32 tmp = a->active_stack_slots[p];
- a->active_stack_slots[p] = a->active_stack_slots[i];
- a->active_stack_slots[i] = tmp;
- i = p;
- }
-}
-
-static u32 stack_active_pop(OptAllocator* a) {
- u32 out = a->active_stack_slots[0];
- u32 last = a->active_stack_slots[--a->nactive_stack_slots];
- if (a->nactive_stack_slots) {
- u32 i = 0;
- a->active_stack_slots[0] = last;
- for (;;) {
- u32 l = i * 2u + 1u;
- u32 r = l + 1u;
- u32 best = i;
- if (l < a->nactive_stack_slots &&
- !stack_slot_heap_less(a, a->active_stack_slots[best],
- a->active_stack_slots[l]))
- best = l;
- if (r < a->nactive_stack_slots &&
- !stack_slot_heap_less(a, a->active_stack_slots[best],
- a->active_stack_slots[r]))
- best = r;
- if (best == i) break;
- u32 tmp = a->active_stack_slots[i];
- a->active_stack_slots[i] = a->active_stack_slots[best];
- a->active_stack_slots[best] = tmp;
- i = best;
+ for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE;
+ ri = ranges->ranges[ri].next) {
+ const OptLiveRange* lr = &ranges->ranges[ri];
+ u32 end = lr->end < a->point_count ? lr->end : a->point_count;
+ for (u32 j = lr->start; j < end; ++j) {
+ ++a->hard_mark_points;
+ used_locs_row(a, j)[w] |= mask;
+ }
}
}
- return out;
-}
-
-static void alloc_grow_free_stack(Func* f, OptAllocator* a) {
- if (a->nfree_stack_slots < a->free_stack_cap) return;
- u32 ncap = a->free_stack_cap ? a->free_stack_cap * 2u : 16u;
- u32* ns = arena_array(f->arena, u32, ncap);
- if (a->free_stack_slots)
- memcpy(ns, a->free_stack_slots,
- sizeof(a->free_stack_slots[0]) * a->nfree_stack_slots);
- a->free_stack_slots = ns;
- a->free_stack_cap = ncap;
-}
-
-static void stack_free_push(Func* f, OptAllocator* a, u32 slot) {
- alloc_grow_free_stack(f, a);
- a->free_stack_slots[a->nfree_stack_slots++] = slot;
-}
-
-static void stack_expire_slots(Func* f, OptAllocator* a, u32 first) {
- while (a->nactive_stack_slots) {
- u32 slot = a->active_stack_slots[0];
- if (slot >= a->stack_slot_count || a->stack_slot_end[slot] > first) break;
- (void)stack_active_pop(a);
- stack_free_push(f, a, slot);
- }
-}
-
-static int stack_take_free_compatible(Func* f, OptAllocator* a, PReg v,
- u32* stack_idx_out) {
- for (u32 i = 0; i < a->nfree_stack_slots; ++i) {
- u32 slot = a->free_stack_slots[i];
- if (!spill_slot_compatible(f, a->stack_slots[slot], v)) continue;
- a->free_stack_slots[i] = a->free_stack_slots[--a->nfree_stack_slots];
- *stack_idx_out = slot;
- return 1;
- }
- return 0;
}
static void alloc_assign_group_hard(Func* f, OptAllocator* a,
const OptLiveRangeSet* ranges, PReg root,
Reg r) {
- u32 bit = hard_loc_bit(f->preg_info[root].cls, r);
+ u8 cls = f->preg_info[root].cls;
+ u32 bit = hard_loc_bit(cls, r);
for (PReg v = 1; v < opt_reg_count(f); ++v) {
if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
if (!alloc_group_member(f, root, v)) continue;
@@ -859,13 +692,29 @@ static void alloc_assign_group_hard(Func* f, OptAllocator* a,
a->locs[v].hard_reg = r;
a->locs[v].spill_slot = FRAME_SLOT_NONE;
}
- alloc_mark_group_hard_loc(f, a, ranges, root, bit);
+ alloc_mark_group_loc(f, a, ranges, root, bit);
+ if (bit < a->hard_loc_bits) a->hard_open[bit] = 1;
}
-static void alloc_assign_group_stack_slot(Func* f, OptAllocator* a,
- const OptLiveRangeSet* ranges,
- PReg root, u32 stack_idx,
- u32 group_last) {
+static void alloc_assign_group_stack(Func* f, OptAllocator* a,
+ const OptLiveRangeSet* ranges, PReg root) {
+ /* Try to reuse an existing stack slot whose bit is clear in conflict_locs
+ * and whose frame slot is compatible. The conflict_locs scratch must
+ * already be populated for `root` by the caller. */
+ u32 stack_idx = (u32)~0u;
+ for (u32 k = 0; k < a->stack_slot_count; ++k) {
+ u32 bit = a->hard_loc_bits + k;
+ if (loc_bit_in_conflict(a->conflict_locs, bit)) continue;
+ if (!spill_slot_compatible(f, a->stack_slots[k], root)) continue;
+ stack_idx = k;
+ break;
+ }
+ if (stack_idx == (u32)~0u) {
+ FrameSlot fs = spill_slot_for(f, root);
+ stack_idx = alloc_alloc_stack_slot(f, a, fs);
+ /* alloc_alloc_stack_slot may have widened a->loc_words: refresh
+ * conflict_locs (callers don't reuse it after this). */
+ }
FrameSlot slot = a->stack_slots[stack_idx];
for (PReg v = 1; v < opt_reg_count(f); ++v) {
if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
@@ -878,157 +727,57 @@ static void alloc_assign_group_stack_slot(Func* f, OptAllocator* a,
a->locs[v].hard_reg = REG_NONE;
a->locs[v].spill_slot = slot;
}
- alloc_mark_group_stack_loc(f, a, ranges, root, stack_idx);
- a->stack_slot_end[stack_idx] = group_last;
- stack_active_push(f, a, stack_idx);
-}
-
-static void alloc_assign_group_stack(Func* f, OptAllocator* a,
- const OptLiveRangeSet* ranges, PReg root,
- const OptAllocGroupInfo* gi) {
- u32 stack_idx = 0;
- stack_expire_slots(f, a, gi->first);
- if (!stack_take_free_compatible(f, a, root, &stack_idx)) {
- FrameSlot slot = spill_slot_for(f, root);
- alloc_grow_stack_locs(f, a, a->stack_slot_count + 1u);
- stack_idx = a->stack_slot_count++;
- a->stack_slots[stack_idx] = slot;
- }
- alloc_assign_group_stack_slot(f, a, ranges, root, stack_idx, gi->last);
-}
-
-static void split_segment_grow(Func* f) {
- if (f->opt_nalloc_segments < f->opt_alloc_segments_cap) return;
- u32 ncap = f->opt_alloc_segments_cap ? f->opt_alloc_segments_cap * 2u : 32u;
- OptAllocSegment* ns = arena_array(f->arena, OptAllocSegment, ncap);
- if (f->opt_alloc_segments)
- memcpy(ns, f->opt_alloc_segments,
- sizeof(f->opt_alloc_segments[0]) * f->opt_nalloc_segments);
- f->opt_alloc_segments = ns;
- f->opt_alloc_segments_cap = ncap;
-}
-
-static void split_segment_push(Func* f, PReg v, const OptLiveRange* lr,
- u8 loc_kind, Reg hard_reg,
- FrameSlot spill_home) {
- split_segment_grow(f);
- u32 idx = f->opt_nalloc_segments++;
- OptAllocSegment* s = &f->opt_alloc_segments[idx];
- memset(s, 0, sizeof *s);
- s->start = lr->raw_start;
- s->end = lr->raw_end;
- s->block = lr->block;
- s->loc_kind = loc_kind;
- s->cls = f->preg_info[v].cls;
- s->hard_reg = hard_reg;
- s->spill_slot = loc_kind == OPT_LOC_STACK ? spill_home : FRAME_SLOT_NONE;
- s->spill_home = spill_home;
- s->reload_at_start = loc_kind == OPT_LOC_HARD ? 1u : 0u;
- s->store_at_end = loc_kind == OPT_LOC_HARD ? 1u : 0u;
- s->next = f->opt_first_segment_by_preg[v];
- f->opt_first_segment_by_preg[v] = idx;
-}
-
-static int alloc_range_hard_conflicts(OptAllocator* a, const OptLiveRange* lr,
- u32 bit) {
- if (bit >= a->hard_loc_bits) return 1;
- u32 end = lr->end < a->point_count ? lr->end : a->point_count;
- ++a->hard_point_visits;
- if (lr->start >= end) return 0;
- return alloc_interval_overlaps(&a->hard_used_locs[bit], lr->start, end);
-}
-
-static void alloc_mark_range_hard(Func* f, OptAllocator* a,
- const OptLiveRange* lr, u32 bit) {
- if (bit >= a->hard_loc_bits) return;
- u32 end = lr->end < a->point_count ? lr->end : a->point_count;
- if (lr->start >= end) return;
- ++a->hard_mark_points;
- alloc_interval_insert(f, &a->hard_used_locs[bit], lr->start, end);
-}
-
-static int alloc_try_split_singleton(Func* f, OptAllocator* a,
- const OptLiveRangeSet* ranges, PReg v) {
- if (f->opt_coalesce_parent &&
- f->opt_coalesce_size[alloc_coalesce_root(f, v)] > 1)
- return 0;
- if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) return 0;
- if (f->preg_info[v].live_across_call_freq) return 0;
- u32 nranges = 0;
- for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE;
- ri = ranges->ranges[ri].next)
- ++nranges;
- if (nranges < 2) return 0;
- if (!f->opt_first_segment_by_preg) {
- f->opt_first_segment_by_preg =
- arena_array(f->arena, u32, opt_reg_count(f) ? opt_reg_count(f) : 1u);
- memset(f->opt_first_segment_by_preg, 0xff,
- sizeof(f->opt_first_segment_by_preg[0]) *
- (opt_reg_count(f) ? opt_reg_count(f) : 1u));
- }
-
- OptPRegInfo* vi = &f->preg_info[v];
- FrameSlot home = spill_slot_for(f, v);
- vi->alloc_kind = OPT_ALLOC_SPLIT;
- vi->hard_reg = REG_NONE;
- vi->spill_slot = home;
- a->locs[v].kind = OPT_LOC_STACK;
- a->locs[v].cls = vi->cls;
- a->locs[v].hard_reg = REG_NONE;
- a->locs[v].spill_slot = home;
-
- for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE;
- ri = ranges->ranges[ri].next) {
- const OptLiveRange* lr = &ranges->ranges[ri];
- int found = 0;
- Reg best = REG_NONE;
- u32 best_score = 0xffffffffu;
- for (u32 r = 0; r < f->opt_hard_reg_count[vi->cls]; ++r) {
- Reg hr = f->opt_hard_regs[vi->cls][r];
- if (hr >= 32) continue;
- if (vi->forbidden_hard_regs & (1u << hr)) continue;
- u32 bit = hard_loc_bit(vi->cls, hr);
- if (alloc_range_hard_conflicts(a, lr, bit)) continue;
- u32 score = hard_reg_alloc_score(f, a, vi, hr);
- if (!found || score < best_score) {
- found = 1;
- best = hr;
- best_score = score;
+ u32 bit = a->hard_loc_bits + stack_idx;
+ u32 w = bit / 64u;
+ u64 mask = 1ull << (bit % 64u);
+ for (PReg v = 1; v < opt_reg_count(f); ++v) {
+ if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
+ if (!alloc_group_member(f, root, v)) continue;
+ for (u32 ri = ranges->first_range_by_preg[v]; ri != OPT_RANGE_NONE;
+ ri = ranges->ranges[ri].next) {
+ const OptLiveRange* lr = &ranges->ranges[ri];
+ u32 end = lr->end < a->point_count ? lr->end : a->point_count;
+ for (u32 j = lr->start; j < end; ++j) {
+ ++a->stack_mark_points;
+ used_locs_row(a, j)[w] |= mask;
}
}
- if (found) {
- alloc_mark_range_hard(f, a, lr, hard_loc_bit(vi->cls, best));
- split_segment_push(f, v, lr, OPT_LOC_HARD, best, home);
- } else {
- split_segment_push(f, v, lr, OPT_LOC_STACK, REG_NONE, home);
- }
}
- return 1;
+}
+
+static int alloc_group_conflicts_bit(const OptAllocator* a, u32 bit) {
+ if (bit / 64u >= a->loc_words) return 1;
+ return loc_bit_in_conflict(a->conflict_locs, bit);
}
static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
OptAllocator* a, int allow_live_range_split) {
+ (void)allow_live_range_split; /* live-range splitting deferred per
+ doc/OPT_PERF.md plan; the parameter is
+ passed through for ABI compatibility. */
memset(a, 0, sizeof *a);
a->point_count = ranges->point_count ? ranges->point_count : 1u;
a->hard_loc_bits = OPT_REG_CLASSES * 32u;
+ a->loc_words = alloc_loc_words_for_bits(a->hard_loc_bits);
+ a->used_locs =
+ arena_zarray(f->arena, u64, (u64)a->point_count * a->loc_words);
+ a->conflict_locs = arena_zarray(f->arena, u64, a->loc_words);
+ a->locs =
+ arena_zarray(f->arena, OptLoc, opt_reg_count(f) ? opt_reg_count(f) : 1u);
+ a->hard_open = arena_zarray(f->arena, u8, a->hard_loc_bits);
+ a->stack_slots = NULL;
+ a->stack_slot_count = 0;
+ a->stack_slot_cap = 0;
+
+ /* Build candidate list: every coalesce-root PReg that has live ranges. */
u32 ncands = 0;
for (PReg v = 1; v < opt_reg_count(f); ++v) {
if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
if (alloc_coalesce_root(f, v) != v) continue;
++ncands;
}
- a->locs =
- arena_zarray(f->arena, OptLoc, opt_reg_count(f) ? opt_reg_count(f) : 1u);
- a->hard_used_locs = arena_zarray(f->arena, AllocIntervalVec,
- a->hard_loc_bits ? a->hard_loc_bits : 1u);
- a->stack_slots = NULL;
- a->stack_slot_cap = 0;
- a->stack_used_locs = NULL;
-
OptAllocCandidate* cands =
arena_array(f->arena, OptAllocCandidate, ncands ? ncands : 1u);
- OptSpillCandidate* spills =
- arena_array(f->arena, OptSpillCandidate, ncands ? ncands : 1u);
u32 n = 0;
for (PReg v = 1; v < opt_reg_count(f); ++v) {
if (ranges->first_range_by_preg[v] == OPT_RANGE_NONE) continue;
@@ -1045,13 +794,14 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
}
alloc_sort_candidates(cands, n);
- u32 nspills = 0;
for (u32 i = 0; i < n; ++i) {
PReg v = cands[i].v;
OptAllocGroupInfo gi;
alloc_group_info(f, ranges, v, &gi);
OptPRegInfo* vi = &f->preg_info[v];
u8 cls = gi.cls;
+ alloc_compute_group_conflicts(f, a, ranges, v);
+
if (gi.tied_hard_reg >= 0) {
Reg fixed = (Reg)gi.tied_hard_reg;
if (!hard_available(f, cls, fixed)) {
@@ -1067,8 +817,8 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
"opt regalloc: fixed hard reg %u is clobbered",
(unsigned)fixed);
}
- if (fixed >= 32 || alloc_group_hard_conflicts(f, a, ranges, v,
- hard_loc_bit(cls, fixed))) {
+ u32 bit = hard_loc_bit(cls, fixed);
+ if (fixed >= 32 || alloc_group_conflicts_bit(a, bit)) {
SrcLoc loc = {0, 0, 0};
compiler_panic(f->c, loc, "opt regalloc: conflicting fixed hard reg %u",
(unsigned)fixed);
@@ -1084,8 +834,8 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
Reg hr = f->opt_hard_regs[cls][r];
if (hr >= 32) continue;
if (gi.forbidden_hard_regs & (1u << hr)) continue;
- if (alloc_group_hard_conflicts(f, a, ranges, v, hard_loc_bit(cls, hr)))
- continue;
+ u32 bit = hard_loc_bit(cls, hr);
+ if (alloc_group_conflicts_bit(a, bit)) continue;
u32 score = hard_reg_alloc_score(f, a, vi, hr);
if (!found || score < best_score) {
found = 1;
@@ -1095,31 +845,19 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
}
if (found) {
alloc_assign_group_hard(f, a, ranges, v, best);
+ } else {
+ alloc_assign_group_stack(f, a, ranges, v);
}
- if (!found) {
- if (allow_live_range_split && alloc_try_split_singleton(f, a, ranges, v))
- continue;
- spills[nspills].v = v;
- spills[nspills].first = gi.first;
- spills[nspills].last = gi.last;
- ++nspills;
- }
- }
- alloc_sort_spills(spills, nspills);
- for (u32 i = 0; i < nspills; ++i) {
- OptAllocGroupInfo gi;
- alloc_group_info(f, ranges, spills[i].v, &gi);
- alloc_assign_group_stack(f, a, ranges, spills[i].v, &gi);
}
- a->hard_loc_words =
- alloc_interval_storage_words(a->hard_used_locs, a->hard_loc_bits);
- a->stack_loc_words =
- alloc_interval_storage_words(a->stack_used_locs, a->stack_slot_count);
- f->opt_alloc_hard_loc_words = a->hard_loc_words;
- f->opt_alloc_stack_loc_words = a->stack_loc_words;
+
+ /* Report storage metrics in u64 words (used_locs is the only bitmap). */
+ u32 total_words = a->point_count * a->loc_words;
+ u32 hard_words = alloc_loc_words_for_bits(a->hard_loc_bits) * a->point_count;
+ if (hard_words > total_words) hard_words = total_words;
+ f->opt_alloc_hard_loc_words = hard_words;
+ f->opt_alloc_stack_loc_words = total_words - hard_words;
f->opt_alloc_stack_slots = a->stack_slot_count;
- f->opt_used_loc_words =
- f->opt_alloc_hard_loc_words + f->opt_alloc_stack_loc_words;
+ f->opt_used_loc_words = total_words;
f->opt_alloc_hard_point_visits = a->hard_point_visits;
f->opt_alloc_stack_point_visits = a->stack_point_visits;
f->opt_alloc_hard_word_ors = a->hard_word_ors;
@@ -1835,11 +1573,10 @@ void opt_regalloc(Func* f, int allow_live_range_split) {
opt_init_preg_info_from_ranges(f, &ranges);
opt_apply_asm_constraints_from_live(f, &live);
apply_param_incoming_register_hazards(f);
- /* Coalesce runs at both O1 and O2. IRF_NO_COALESCE protects SSA edge copies
- * inserted by opt_make_conventional_ssa at O2; at O1 no such copies exist.
- * ranges_overlap_kind treats two or more unit overlaps between dst and src
- * as a real conflict, so multi-def pseudos at O1 are not unsafely merged. */
- opt_coalesce_ranges(f, &ranges);
+ /* MIR coalesces only at -O2 (mir-gen.c:9431); match that here. At O1 the
+ * point-bitmap allocator emits copies through the natural conflict-free
+ * path. IRF_NO_COALESCE protects SSA edge copies inserted at O2. */
+ if (allow_live_range_split) opt_coalesce_ranges(f, &ranges);
metrics_count(f->c, "opt.live_words", f->opt_live_words);
metrics_count(f->c, "opt.ranges", ranges.nranges);
metrics_count(f->c, "opt.range_points", ranges.point_count);
diff --git a/test/opt/opt_test.c b/test/opt/opt_test.c
@@ -599,13 +599,6 @@ static int count_op(Func* f, IROp op) {
return n;
}
-static int block_contains_op(Func* f, u32 b, IROp op) {
- if (!f || b >= f->nblocks) return 0;
- for (u32 i = 0; i < f->blocks[b].ninsts; ++i)
- if ((IROp)f->blocks[b].insts[i].op == op) return 1;
- return 0;
-}
-
static Inst* def_inst(Func* f, Val v) {
if (!f || v == VAL_NONE || v >= f->nvals) return NULL;
u32 b = f->val_def_block[v];
@@ -4277,7 +4270,10 @@ static void opt_o2_coalesces_nonconflicting_copy(void) {
tc_fini(&tc);
}
-static void opt_o1_coalesces_simple_copy(void) {
+static void opt_o1_skips_coalesce(void) {
+ /* O1 matches MIR's pipeline (mir-gen.c:9431): coalescing runs only at
+ * optimize_level >= 2. At O1 the point-bitmap allocator emits the copy
+ * through the normal path without merging operands. */
TestCtx tc;
tc_init(&tc);
MockCGTarget mock;
@@ -4297,8 +4293,8 @@ static void opt_o1_coalesces_simple_copy(void) {
opt_build_loop_tree(f);
opt_regalloc(f, 0);
- EXPECT(f->opt_coalesce_candidates == 1 && f->opt_coalesce_merges == 1,
- "O1 regalloc should coalesce a simple copy");
+ EXPECT(f->opt_coalesce_candidates == 0 && f->opt_coalesce_merges == 0,
+ "O1 regalloc should not run coalesce");
tc_fini(&tc);
}
@@ -4374,7 +4370,10 @@ static void opt_o2_refuses_incompatible_copy_coalesce(void) {
tc_fini(&tc);
}
-static void opt_o2_splits_singleton_when_whole_alloc_fails(void) {
+static void opt_o2_spills_singleton_when_whole_alloc_fails(void) {
+ /* Live-range splitting is deferred per doc/OPT_PERF.md plan. With one hard
+ * reg pinned and another value live across the pinned use, the allocator
+ * spills the unpinned value whole instead of producing OPT_ALLOC_SPLIT. */
TestCtx tc;
tc_init(&tc);
MockCGTarget mock;
@@ -4401,23 +4400,16 @@ static void opt_o2_splits_singleton_when_whole_alloc_fails(void) {
EXPECT(f->preg_info[pinned].alloc_kind == OPT_ALLOC_HARD,
"pinned value should keep the only hard register");
- EXPECT(f->preg_info[v].alloc_kind == OPT_ALLOC_SPLIT,
- "singleton value should split instead of whole spilling");
- int saw_hard = 0;
- int saw_stack = 0;
- for (u32 si = f->opt_first_segment_by_preg[v]; si != OPT_RANGE_NONE;
- si = f->opt_alloc_segments[si].next) {
- if (f->opt_alloc_segments[si].loc_kind == OPT_LOC_HARD) saw_hard = 1;
- if (f->opt_alloc_segments[si].loc_kind == OPT_LOC_STACK) saw_stack = 1;
- }
- EXPECT(saw_hard && saw_stack,
- "split value should have both hard and spill-home segments");
+ EXPECT(f->preg_info[v].alloc_kind == OPT_ALLOC_SPILL,
+ "without splitting, the conflicting value should whole-spill");
EXPECT(f->preg_info[v].spill_slot != FRAME_SLOT_NONE,
- "split value should have a canonical spill home");
+ "spilled value should have a stack slot");
tc_fini(&tc);
}
-static void opt_o2_split_materializes_on_critical_edge(void) {
+static void opt_o2_does_not_split_critical_edge(void) {
+ /* Live-range splitting (and the associated critical-edge materialization)
+ * is deferred. The unpinned value whole-spills; no edge blocks are added. */
TestCtx tc;
tc_init(&tc);
MockCGTarget mock;
@@ -4450,23 +4442,11 @@ static void opt_o2_split_materializes_on_critical_edge(void) {
u32 original_blocks = f->nblocks;
opt_regalloc(f, 1);
- EXPECT(f->preg_info[v].alloc_kind == OPT_ALLOC_SPLIT,
- "edge fixture should split v%u", (unsigned)v);
- EXPECT(f->nblocks > original_blocks,
- "critical edge materialization should add edge blocks");
- u32 edge = f->blocks[entry].succ[0];
- EXPECT(edge != join && edge < f->nblocks && f->blocks[edge].nsucc == 1 &&
- f->blocks[edge].succ[0] == join,
- "entry->join critical edge should be split");
- if (edge != join && edge < f->nblocks) {
- EXPECT(block_contains_op(f, edge, IR_LOAD),
- "split edge block should hold the reload");
- EXPECT(block_contains_op(f, edge, IR_BR),
- "split edge block should still branch to join");
- }
- EXPECT(!block_contains_op(f, join, IR_LOAD),
- "join block should not receive an ambiguous reload");
- opt_verify(f, "test-split-critical-edge-materialization");
+ EXPECT(f->preg_info[v].alloc_kind != OPT_ALLOC_SPLIT,
+ "splitting is deferred; v%u should not be OPT_ALLOC_SPLIT",
+ (unsigned)v);
+ EXPECT(f->nblocks == original_blocks,
+ "no edge materialization expected without splitting");
tc_fini(&tc);
}
@@ -6647,11 +6627,11 @@ int main(void) {
opt_range_overlap_class();
opt_regalloc_priority();
opt_o2_coalesces_nonconflicting_copy();
- opt_o1_coalesces_simple_copy();
+ opt_o1_skips_coalesce();
opt_o2_refuses_overlapping_copy_coalesce();
opt_o2_refuses_incompatible_copy_coalesce();
- opt_o2_splits_singleton_when_whole_alloc_fails();
- opt_o2_split_materializes_on_critical_edge();
+ opt_o2_spills_singleton_when_whole_alloc_fails();
+ opt_o2_does_not_split_critical_edge();
opt_o1_does_not_split_spill_edges();
opt_range_regalloc_no_conflicts_and_stack_reuse();
opt_stack_spill_assignment_avoids_quadratic_probe();