Track O1 MIR-shape metrics - kit

commit 37df47e0fdbfb2e47fd439d37c2c297cc9b29395
parent 13d1a28d0b81cc43eb901962ef049aa68b8b2569
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 14 May 2026 21:32:12 -0700

Track O1 MIR-shape metrics

Diffstat:
M doc/MIR_RA_REPORT.md  | 141 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
M doc/PERF.md  | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
M src/opt/ir.h  | 9 +++++++++
M src/opt/opt.c  | 33 ++++++++++++++++++++++++---------
M src/opt/opt.h  | 6 ++++++
M src/opt/pass_live.c  | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------
M src/opt/pass_lower.c  | 155 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
M test/opt/phase0_guardrails.sh  | 17 ++++++++++++++++-

8 files changed, 547 insertions(+), 111 deletions(-)
diff --git a/doc/MIR_RA_REPORT.md b/doc/MIR_RA_REPORT.md
@@ -26,10 +26,10 @@ sets, converts them into compressed live ranges, assigns physical locations by
 checking occupied hard-register and stack locations at live-range program
 points, then rewrites instructions.
 
-cfree's current O1 path is conflict-matrix based. `opt_live_info` builds a
-dense `nvals x words` pseudo interference matrix, stores per-instruction full
-`live_after` bitsets, and scans `1..nvals` in hot loops. `doc/PERF.md` shows
-this structure is the scaling problem for one large function.
+cfree's original O1 path was conflict-matrix based. `opt_live_info` built a
+dense `nvals x words` pseudo interference matrix, stored per-instruction full
+`live_after` bitsets, and scanned `1..nvals` in hot loops. `doc/PERF.md` shows
+why that structure was the scaling problem for one large function.
 
 The main recommendation is to reshape cfree around MIR's allocation model:
 
@@ -42,6 +42,15 @@ The main recommendation is to reshape cfree around MIR's allocation model:
 This is a structural change, not an incremental optimization of the current
 matrix.
 
+As of the current O1 implementation, the dense pseudo-conflict allocator is no
+longer the normal path. O1 now has pass-local block liveness, compressed live
+ranges, split hard-register and stack-slot occupancy, assignment-map rewrite,
+and metrics for live/range/allocation/rewrite event traffic. The remaining work
+is not "delete the old matrix"; it is to make the new path more MIR-shaped by
+removing the remaining high-watermark bitmap traffic, reducing live-range build
+cost on spill-heavy wide functions, and replacing repeated range-point
+occupancy walks when they become the dominant bucket.
+
 ## MIR Container Layer
 
 MIR uses small generic data-structure building blocks throughout `mir-gen.c`.
@@ -352,29 +361,37 @@ cfree should follow this model:
 - optional move-only matrix in O2 coalescing
 - hard cap and metric counters for matrix size
 
-## Current cfree Mismatch
+## cfree Mismatch Status
 
-The current O1 implementation has these structural mismatches:
+The initial O1 implementation had five major structural mismatches:
 
-1. `opt_live_info` is overloaded.
-   It computes block liveness, per-value summaries, per-instruction live-after,
+1. `opt_live_info` was overloaded.
+   It computed block liveness, per-value summaries, per-instruction live-after,
    conflict matrix, call live-across summaries, and asm constraints in one pass.
-
-2. DDE pays for full allocation analysis.
-   The pre-DDE liveness call builds data needed only by regalloc. This is the
-   first obvious waste, but fixing only that is still incremental.
-
-3. Dense conflicts are the allocator's core dependency.
-   `Func.val_conflicts` is `nvals x opt_conflict_words`. This should not be a
-   fundamental `Func` field in the MIR-like design.
-
-4. Hot loops scan all vals.
-   Several loops walk `1..nvals` inside instruction processing. MIR mostly
-   iterates instruction operands or set bits.
-
-5. `Block.live_after` stores a full live bitmap per instruction.
-   This multiplies memory by instruction count and value count. MIR derives the
-   information needed for rewrite with backward walks.
+   This has been split into pass-local liveness, range construction,
+   allocation, and rewrite state.
+
+2. DDE paid for full allocation analysis.
+   Pre-DDE no longer builds allocation ranges or conflicts. It still runs block
+   liveness before DDE, and O1 runs block liveness again for regalloc after DDE
+   changes the instruction stream. Avoiding or reusing that second pass is a
+   remaining O1 cleanup item, not the main current bottleneck.
+
+3. Dense conflicts were the allocator's core dependency.
+   The normal O1 allocator no longer allocates or consumes `Func.val_conflicts`.
+   Dense matrices should stay out of O1 and return only as bounded, optional
+   move-coalescing support in O2.
+
+4. Hot loops scanned all vals.
+   Many of these loops now use operand-reference events, set-bit iteration, or
+   touched-value/generation lists. Remaining work is to remove high-watermark
+   bitmap copies and any finalization paths that still scale with the highest
+   value id instead of live or touched values.
+
+5. `Block.live_after` stored a full live bitmap per instruction.
+   Normal O1 no longer persists per-instruction full live-after storage.
+   Rewrite and DDE walk backward with rolling live state, but rewrite still
+   shows superlinear live-word traffic on spill-heavy wide functions.
 
 ## Proposed cfree Structural Target
 
@@ -528,15 +545,30 @@ a clean fast implementation over shims for old pass boundaries.
 - [x] Add metrics names for the new shape before deleting old counters:
   - [x] `opt.live.blocks`
   - [x] `opt.live.active_words`
+  - [x] `opt.live.bitset_words_touched`
+  - [x] `opt.live.dataflow_iterations`
+  - [x] `opt.live.dataflow_block_visits`
   - [x] `opt.ranges`
   - [x] `opt.range_points`
+  - [x] `opt.range.point_visits`
+  - [x] `opt.range.value_scans`
+  - [x] `opt.range.live_words_touched`
   - [x] `opt.alloc.used_loc_words`
   - [x] `opt.alloc.hard_loc_words`
   - [x] `opt.alloc.stack_loc_words`
   - [x] `opt.alloc.stack_slots`
+  - [x] `opt.alloc.hard_point_visits`
+  - [x] `opt.alloc.stack_point_visits`
+  - [x] `opt.alloc.hard_word_ors`
+  - [x] `opt.alloc.stack_word_ors`
+  - [x] `opt.alloc.hard_mark_points`
+  - [x] `opt.alloc.stack_mark_points`
   - [x] `opt.alloc.spills`
+  - [x] `opt.dde.live_words_touched`
   - [x] `opt.rewrite.reloads`
   - [x] `opt.rewrite.stores`
+  - [x] `opt.rewrite.inserted_insts`
+  - [x] `opt.rewrite.live_words_touched`
 
 Implemented by `test/opt/phase0_guardrails.sh`, wired into `make test-opt`.
 The script runs generated O0/O1 guardrail programs for the stress shapes above
@@ -730,7 +762,60 @@ tables. Hard occupancy remains `OPT_REG_CLASSES * 32` bits per point; stack
 occupancy grows only when a new spill slot crosses a word boundary. The sparse
 interval form is deferred until timing data shows it is needed.
 
-### Phase 8: O2 Coalescing
+### Phase 8: O1 MIR-Shape Cleanup
+
+The range allocator has the right high-level shape, but current timings show
+there is still O1 work to do before adding O2 coalescing or splitting. The next
+areas of focus are the remaining places where cfree is less MIR-shaped than it
+looks at the pass boundary: fixed-width live traffic, point-by-point occupancy
+loops, and rewrite liveness work.
+
+Implemented in the current cleanup pass:
+
+- [x] Emit event counters for live, range, allocation, DDE, and rewrite work.
+- [x] Remove the duplicate pre-DDE live-range build from the normal O1 path.
+- [x] Replace DDE per-instruction full-width use/def bitsets with
+      operand-reference events.
+- [x] Replace rewrite per-instruction full-width use/def bitsets with
+      operand-reference events.
+- [x] Track open/touched range values with generation marks during live-range
+      construction.
+- [x] Avoid assigning caller-saved registers to non-tied values that are known
+      to be live across calls.
+- [x] Update `doc/PERF.md` with current O1 scaling data and counter ratios.
+
+Remaining O1 focus:
+
+- [ ] Make `OptBitset` operations truly MIR-like: track active length, trim
+      trailing zero words, and bound copy/ior/and-not work by active words
+      rather than the function's maximum value id.
+- [ ] Reduce `opt.live_ranges.regalloc` on spill-heavy wide functions.
+      Current `opt.range.value_scans` is linear, so investigate live-word
+      touches, range insertion/compression, final per-value summaries, and
+      live-across-call bookkeeping in `src/opt/pass_live.c`.
+- [ ] Reduce `opt.rewrite.live_words_touched`. Rewrite no longer stores
+      `Block.live_after`, but its rolling live set still grows faster than
+      input on spill-heavy wide functions.
+- [ ] Replace repeated point-index occupancy OR/mark loops with per-location
+      interval or event state once live-range/rewrite traffic is under control.
+      `opt.alloc.*point_visits` is linear now, but the absolute count is high.
+- [ ] Decide whether the second O1 block-liveness pass can be avoided after
+      DDE, either by reusing/revalidating liveness or by having DDE produce
+      enough local update information.
+- [ ] Keep split hard-register and stack-slot occupancy. It fixed the old
+      `points * candidates` stack growth and should remain the O1 baseline.
+
+Exit criteria:
+
+- [ ] Straight-line and table-shaped stress inputs remain near-linear at the
+      512-to-1024 step.
+- [ ] Spill-pressure `opt.live_ranges.regalloc` and `opt.regalloc` stop
+      growing near `3x` on each input doubling.
+- [ ] `opt.rewrite.live_words_touched` no longer grows superlinearly on the
+      spill ladder.
+- [ ] O1 remains free of dense pseudo-conflict construction on the normal path.
+
+### Phase 9: O2 Coalescing
 
 - [ ] Collect move candidates after machinize.
 - [ ] Add move-only liveness using `scan_vars`-style restricted variables.
@@ -748,7 +833,7 @@ Exit criteria:
 - [ ] Coalescing is optional and does not affect O1 compile-time shape.
 - [ ] Matrix memory is bounded and reported.
 
-### Phase 9: O2 Splitting and Spill Placement
+### Phase 10: O2 Splitting and Spill Placement
 
 - [ ] Add split-capable occupancy state similar to MIR's `busy_used_locs`.
 - [ ] Identify profitable live-range gaps.
@@ -762,11 +847,11 @@ Exit criteria:
 - [ ] O2 can reduce spills without changing O1 assignment behavior.
 - [ ] Edge/block placement tests cover diamonds, loops, and critical edges.
 
-### Phase 10: Cleanup and Documentation
+### Phase 11: Cleanup and Documentation
 
 - [ ] Update `doc/OPT.md` with the final implemented module names and pass
       order.
-- [ ] Update `doc/PERF.md` with new O1 scaling data.
+- [x] Update `doc/PERF.md` with new O1 scaling data.
 - [ ] Add a short allocator invariant section near the implementation.
 - [ ] Ensure pass dumps are stable enough for targeted regression tests.
 - [ ] Remove obsolete comments that describe the dense conflict allocator as
diff --git a/doc/PERF.md b/doc/PERF.md
@@ -30,11 +30,20 @@ Important counters:
 - `opt.funcs`, `opt.blocks`, `opt.insts`, `opt.vals`
 - `opt.live_words`, `opt.live.blocks`, `opt.live.active_words`
 - `opt.live.block_bytes`, `opt.live.set_bit_scans`
+- `opt.live.bitset_words_touched`, `opt.live.dataflow_iterations`,
+  `opt.live.dataflow_block_visits`
 - `opt.ranges`, `opt.range_points`, `opt.range_raw_points`
+- `opt.range.point_visits`, `opt.range.value_scans`,
+  `opt.range.live_words_touched`
 - `opt.alloc.used_loc_words`, `opt.alloc.spills`
 - `opt.alloc.hard_loc_words`, `opt.alloc.stack_loc_words`,
   `opt.alloc.stack_slots`
-- `opt.rewrite.reloads`, `opt.rewrite.stores`
+- `opt.alloc.hard_point_visits`, `opt.alloc.stack_point_visits`,
+  `opt.alloc.hard_word_ors`, `opt.alloc.stack_word_ors`,
+  `opt.alloc.hard_mark_points`, `opt.alloc.stack_mark_points`
+- `opt.dde.live_words_touched`
+- `opt.rewrite.reloads`, `opt.rewrite.stores`,
+  `opt.rewrite.inserted_insts`, `opt.rewrite.live_words_touched`
 - `opt.conflict_bytes`
 - `link.inputs`, `link.sections`, `link.segments`, `link.syms`, `link.relocs`
 - `jit.master_size`, `jit.nsegments`, `jit.input_section_bytes`,
@@ -681,44 +690,120 @@ Next investigation should start in:
   Both still maintain rolling live bitsets with function-width scratch storage;
   after range/occupancy work, convert these to set-bit or operand-event updates.
 
-## Performance Priorities
+### Current O1 MIR-Shape Rerun
+
+After the follow-up O1 cleanup, the normal path is closer to MIR in four
+important ways:
+
+- pre-DDE no longer builds a duplicate live-range set;
+- DDE and rewrite use operand-reference events instead of materializing
+  per-instruction full-width use/def bitsets;
+- live-range construction tracks open and touched values with generation marks
+  instead of rescanning every value at block boundaries;
+- O1 assignment avoids caller-saved registers for values known to be live
+  across calls when the value is not tied to a hard register.
+
+The table below is from a focused 3-sample p50 rerun over freshly generated
+stress inputs. It is not byte-for-byte comparable to the earlier phase tables,
+but it shows the current bucket shape and the counters that should drive the
+next work.
+
+```text
+case          compile.tu opt.o1.total live_pre dde   live_ranges_reg regalloc range.value_scans alloc.hard_point_visits rewrite.live_words dde.live_words spills
+straight 512     140.387      116.194   17.700 3.159           30.384   63.670             25616                   28689              81672          31398      0
+straight 1024    286.085      234.626   35.394 6.410           62.155  129.088             51216                   57361             261624          62774      0
+table 512        135.180      113.238   17.650 2.934           29.298   61.829             19506                   21058              25156          25152      0
+table 1024       273.757      226.367   35.209 6.007           58.357  123.448             38962                   42050              50244          50240      0
+spill 128         44.702       19.570    0.712 1.018            7.844   12.942             25700                   64252              77762          26306   2048
+spill 256        109.002       43.998    1.327 2.021           20.569   31.053             51300                  128380             257818          52506   4096
+spill 512        300.108      109.833    2.538 3.964           61.995   83.985            102500                  256636             925130         104906   8192
+```
 
-1. Continue reducing wide-function high-watermark bitset work.
-   Phase 8 removed several avoidable dense scans and the candidate-sort
-   quadratic, but the spill-pressure ladder is still superlinear. The next
-   target is persistent rolling-live updates and live/block metrics that still
-   scale with the highest live value id rather than the number of live values.
+The normal straight-line and table-shaped ladders are now essentially linear.
+For 512 to 1024 functions, `opt.o1.total`, `opt.live_ranges.regalloc`,
+`opt.regalloc`, `opt.range.value_scans`, `opt.alloc.hard_point_visits`, and
+`opt.dde.live_words_touched` are all about `2.0x`. The one outlier is
+`opt.rewrite.live_words_touched` on straight-line input, which is `3.20x`,
+although its absolute time is no longer the dominant bucket.
 
-2. Reduce range-point occupancy walks.
-   `hard_loc_words` and `stack_loc_words` scale linearly, but assignment still
-   revisits range points for every candidate. Per-location interval lists or
-   event sweeps should replace repeated point-by-point OR/mark loops on wide
-   functions.
+The spill-pressure ladder is much better in absolute time, but still bends:
+
+```text
+spill 128 -> 256:
+opt.o1.total             19.570 ->  43.998 ms  2.25x
+opt.live_ranges.regalloc  7.844 ->  20.569 ms  2.62x
+opt.regalloc             12.942 ->  31.053 ms  2.40x
+range.value_scans         25700 ->   51300     2.00x
+alloc.hard_point_visits   64252 ->  128380     2.00x
+rewrite.live_words        77762 ->  257818     3.32x
+
+spill 256 -> 512:
+opt.o1.total             43.998 -> 109.833 ms  2.50x
+opt.live_ranges.regalloc 20.569 ->  61.995 ms  3.01x
+opt.regalloc             31.053 ->  83.985 ms  2.70x
+range.value_scans         51300 ->  102500     2.00x
+alloc.hard_point_visits  128380 ->  256636     2.00x
+rewrite.live_words       257818 ->  925130     3.59x
+```
+
+This narrows the next O1 work. DDE is now roughly linear, range value scans are
+linear, and allocation point visits are linear but high. The remaining
+MIR-shape gap is concentrated in `opt.live_ranges.regalloc`, `opt.regalloc`,
+and rewrite live-word traffic on spill-heavy wide functions.
+
+## Performance Priorities
 
-3. Keep the split `used_locs` design.
+1. Finish MIR-like bitset behavior.
+   `OptBitset` hides most fixed-width details, but hot passes still copy or
+   touch `nwords` in several places. The next target is active-length
+   copy/ior/and-not behavior that trims trailing zero words and makes live
+   traffic proportional to the active set instead of the function high-water
+   value id.
+
+2. Reduce `opt.live_ranges.regalloc` on spill-heavy wide functions.
+   The new `opt.range.value_scans` counter is linear, so the remaining
+   `live_ranges_reg` bend is likely in live-word copying, range insertion and
+   sorting/compression, or live-across-call bookkeeping. Start with
+   `opt.range.live_words_touched` and the per-block finalization path in
+   `src/opt/pass_live.c`.
+
+3. Reduce rewrite rolling-live traffic.
+   DDE is now linear, but `opt.rewrite.live_words_touched` still grows faster
+   than input in wide spill tests. Either make the rolling live set operate on
+   active words/set bits, or continue moving call-clobber preservation into
+   allocation constraints so rewrite does less save/restore liveness work.
+
+4. Replace point-index occupancy loops when they become the next bottleneck.
+   `opt.alloc.hard_point_visits` and `opt.alloc.stack_point_visits` are now
+   linear, but their absolute counts are high. MIR's shape suggests
+   per-location interval/event occupancy as the next replacement for repeated
+   point-by-point OR and mark loops once live-range and rewrite traffic are
+   under control.
+
+5. Keep the split `used_locs` design.
    `hard_loc_words` and `stack_loc_words` now scale linearly in both normal and
    spill-pressure ladders. Stack occupancy is demand-sized by allocated stack
    slots, so it is no longer the explanation for the wide-spill superlinear
    timings.
 
-4. Add live-range event metrics.
-   The next timing question is how much work is coming from `instructions *
-   live_words` versus range-point scans and rewrite insertions. Add counters for
-   bitset words touched during live updates, range point visits, assignment
-   point visits, and rewrite inserted instructions.
+6. Decide whether O1 can avoid the second block-liveness pass.
+   O1 still runs block liveness before DDE and again for regalloc because DDE
+   changes the instruction stream. Keep this until the other buckets are
+   smaller, then either reuse/revalidate liveness after DDE or make DDE produce
+   enough local update information to avoid a full rebuild.
 
-5. Avoid whole-function contiguous growth where hot passes scan repeatedly.
+7. Avoid whole-function contiguous growth where hot passes scan repeatedly.
    Large `Val`/block/instruction arrays and dense bit matrices are likely to
    hurt cache locality. Segmented arrays may help by reducing large copies and
    stabilizing addresses, but pass algorithms need to avoid turning segment
    traversal into extra indirection in inner loops.
 
-6. Keep link/JIT lower priority for now.
+8. Keep link/JIT lower priority for now.
    Even at 1024 functions, `link_jit.all` is around `0.5 ms`; copy, reloc, and
    icache flush are small and roughly linear. They are not the bottleneck for
    submit-to-execution latency in these tests.
 
-7. Split `compile.c.parse_codegen` further before changing parser/CG data
+9. Split `compile.c.parse_codegen` further before changing parser/CG data
    structures.
    The non-O1 compile residual is meaningful at large source sizes, but it is
    currently broad. More counters are needed before choosing parser, PP, type,
diff --git a/src/opt/ir.h b/src/opt/ir.h
@@ -327,6 +327,15 @@ typedef struct Func {
   u32 opt_alloc_stack_loc_words;
   u32 opt_alloc_stack_slots;
   u32 opt_position_count;
+  u64 opt_alloc_hard_point_visits;
+  u64 opt_alloc_stack_point_visits;
+  u64 opt_alloc_hard_word_ors;
+  u64 opt_alloc_stack_word_ors;
+  u64 opt_alloc_hard_mark_points;
+  u64 opt_alloc_stack_mark_points;
+  u64 opt_rewrite_inserted_insts;
+  u64 opt_rewrite_live_words_touched;
+  u64 opt_dde_live_words_touched;
   OptValInfo* val_info; /* indexed by Val, length nvals */
 
   Reg opt_hard_regs[OPT_REG_CLASSES][OPT_MAX_HARD_REGS];
diff --git a/src/opt/opt.c b/src/opt/opt.c
@@ -1408,19 +1408,18 @@ static void w_func_end(CGTarget* t) {
     metrics_count(o->c, "opt.live.active_words", live.active_words);
     metrics_count(o->c, "opt.live.block_bytes", live.block_bytes);
     metrics_count(o->c, "opt.live.set_bit_scans", live.set_bit_scans);
-    OptLiveRangeSet ranges;
-    opt_live_ranges_build(o->f, &live, &ranges);
-    metrics_count(o->c, "opt.ranges", ranges.nranges);
-    metrics_count(o->c, "opt.range_points", ranges.point_count);
-    metrics_count(o->c, "opt.range_raw_points", ranges.raw_point_count);
-    metrics_count(o->c, "opt.range_max_per_val", ranges.max_ranges_per_val);
-    metrics_count(o->c, "opt.range_max_length", ranges.max_live_length);
-    metrics_count(o->c, "opt.range_whole_block_spans",
-                  ranges.whole_block_spans);
+    metrics_count(o->c, "opt.live.bitset_words_touched",
+                  live.bitset_words_touched);
+    metrics_count(o->c, "opt.live.dataflow_iterations",
+                  live.dataflow_iterations);
+    metrics_count(o->c, "opt.live.dataflow_block_visits",
+                  live.dataflow_block_visits);
     metrics_count(o->c, "opt.conflict_bytes", 0);
     metrics_scope_end(o->c, "opt.live_blocks.pre_dde");
     metrics_scope_begin(o->c, "opt.dead_def_elim");
     opt_dead_def_elim_with_live(o->f, &live);
+    metrics_count(o->c, "opt.dde.live_words_touched",
+                  o->f->opt_dde_live_words_touched);
     metrics_scope_end(o->c, "opt.dead_def_elim");
     metrics_scope_begin(o->c, "opt.regalloc");
     opt_regalloc(o->f, 0);
@@ -1432,9 +1431,25 @@ static void w_func_end(CGTarget* t) {
                   o->f->opt_alloc_stack_loc_words);
     metrics_count(o->c, "opt.alloc.stack_slots",
                   o->f->opt_alloc_stack_slots);
+    metrics_count(o->c, "opt.alloc.hard_point_visits",
+                  o->f->opt_alloc_hard_point_visits);
+    metrics_count(o->c, "opt.alloc.stack_point_visits",
+                  o->f->opt_alloc_stack_point_visits);
+    metrics_count(o->c, "opt.alloc.hard_word_ors",
+                  o->f->opt_alloc_hard_word_ors);
+    metrics_count(o->c, "opt.alloc.stack_word_ors",
+                  o->f->opt_alloc_stack_word_ors);
+    metrics_count(o->c, "opt.alloc.hard_mark_points",
+                  o->f->opt_alloc_hard_mark_points);
+    metrics_count(o->c, "opt.alloc.stack_mark_points",
+                  o->f->opt_alloc_stack_mark_points);
     metrics_count(o->c, "opt.alloc.spills", func_spill_alloc_count(o->f));
     metrics_count(o->c, "opt.rewrite.reloads", func_spill_load_count(o->f));
     metrics_count(o->c, "opt.rewrite.stores", func_spill_store_count(o->f));
+    metrics_count(o->c, "opt.rewrite.inserted_insts",
+                  o->f->opt_rewrite_inserted_insts);
+    metrics_count(o->c, "opt.rewrite.live_words_touched",
+                  o->f->opt_rewrite_live_words_touched);
     metrics_scope_end(o->c, "opt.regalloc");
     metrics_scope_begin(o->c, "opt.combine");
     opt_combine(o->f);
diff --git a/src/opt/opt.h b/src/opt/opt.h
@@ -88,6 +88,9 @@ typedef struct OptLiveInfo {
   u64 active_words;
   u64 block_bytes;
   u64 set_bit_scans;
+  u64 bitset_words_touched;
+  u64 dataflow_iterations;
+  u64 dataflow_block_visits;
   OptBlockLive* blocks;
 } OptLiveInfo;
 
@@ -121,6 +124,9 @@ typedef struct OptLiveRangeSet {
   u32 max_ranges_per_val;
   u32 max_live_length;
   u32 whole_block_spans;
+  u64 range_point_visits;
+  u64 value_scans;
+  u64 live_words_touched;
 } OptLiveRangeSet;
 
 typedef enum OptLocKind {
diff --git a/src/opt/pass_live.c b/src/opt/pass_live.c
@@ -228,6 +228,39 @@ static u64 live_count_set_bits(const OptBitset* bs) {
   return n;
 }
 
+static void live_metric_clear(OptLiveInfo* live, OptBitset* bs) {
+  if (live && bs) live->bitset_words_touched += bs->active_words;
+  opt_bitset_clear(bs);
+}
+
+static int live_metric_copy(OptLiveInfo* live, OptBitset* dst,
+                            const OptBitset* src) {
+  u32 n = 0;
+  if (dst && src) n = dst->nwords < src->nwords ? dst->nwords : src->nwords;
+  if (live) live->bitset_words_touched += n;
+  return opt_bitset_copy(dst, src);
+}
+
+static int live_metric_union(OptLiveInfo* live, OptBitset* dst,
+                             const OptBitset* src) {
+  u32 n = 0;
+  if (dst && src) n = dst->nwords < src->active_words ? dst->nwords
+                                                      : src->active_words;
+  if (live) live->bitset_words_touched += n;
+  return opt_bitset_union(dst, src);
+}
+
+static int live_metric_union_and_not(OptLiveInfo* live, OptBitset* dst,
+                                     const OptBitset* src,
+                                     const OptBitset* not_bits) {
+  u32 n = 0;
+  (void)not_bits;
+  if (dst && src) n = dst->nwords < src->active_words ? dst->nwords
+                                                      : src->active_words;
+  if (live) live->bitset_words_touched += n;
+  return opt_bitset_union_and_not(dst, src, not_bits);
+}
+
 static void live_recompute_metrics(OptLiveInfo* live) {
   live->active_words = 0;
   live->block_bytes = 0;
@@ -279,21 +312,23 @@ void opt_live_blocks(Func* f, OptLiveInfo* live) {
   int changed;
   do {
     changed = 0;
+    ++live->dataflow_iterations;
     for (u32 bi = f->nblocks; bi > 0; --bi) {
       u32 b = bi - 1u;
       Block* bl = &f->blocks[b];
       OptBlockLive* lb = &live->blocks[b];
-      opt_bitset_clear(&new_out);
-      opt_bitset_clear(&tmp);
+      ++live->dataflow_block_visits;
+      live_metric_clear(live, &new_out);
+      live_metric_clear(live, &tmp);
       for (u32 s = 0; s < bl->nsucc; ++s) {
         u32 t = bl->succ[s];
         if (t < f->nblocks)
-          opt_bitset_union(&new_out, &live->blocks[t].live_in);
+          live_metric_union(live, &new_out, &live->blocks[t].live_in);
       }
-      opt_bitset_copy(&tmp, &lb->live_use);
-      opt_bitset_union_and_not(&tmp, &new_out, &lb->live_def);
-      changed |= opt_bitset_copy(&lb->live_out, &new_out);
-      changed |= opt_bitset_copy(&lb->live_in, &tmp);
+      live_metric_copy(live, &tmp, &lb->live_use);
+      live_metric_union_and_not(live, &tmp, &new_out, &lb->live_def);
+      changed |= live_metric_copy(live, &lb->live_out, &new_out);
+      changed |= live_metric_copy(live, &lb->live_in, &tmp);
     }
   } while (changed);
 
@@ -338,6 +373,14 @@ typedef struct RangeBuildCtx {
   OptLiveRangeSet* ranges;
   u32* last_range_by_val;
   u32* open_end_by_val;
+  u32* open_gen_by_val;
+  u32 open_gen;
+  Val* open_vals;
+  u32 nopen_vals;
+  u32 open_vals_cap;
+  Val* range_vals;
+  u32 nrange_vals;
+  u32 range_vals_cap;
   u32* raw_points;
   u32 nraw_points;
   u32 raw_points_cap;
@@ -429,6 +472,61 @@ static void range_push_raw_point(RangeBuildCtx* c, u32 p) {
   c->raw_points[c->nraw_points++] = p;
 }
 
+static void range_push_open_val(RangeBuildCtx* c, Val v) {
+  if (c->nopen_vals == c->open_vals_cap) {
+    u32 ncap = c->open_vals_cap ? c->open_vals_cap * 2u : 64u;
+    Val* nv = arena_array(c->f->arena, Val, ncap);
+    if (c->open_vals)
+      memcpy(nv, c->open_vals, sizeof(c->open_vals[0]) * c->nopen_vals);
+    c->open_vals = nv;
+    c->open_vals_cap = ncap;
+  }
+  c->open_vals[c->nopen_vals++] = v;
+}
+
+static void range_push_range_val(RangeBuildCtx* c, Val v) {
+  if (c->nrange_vals == c->range_vals_cap) {
+    u32 ncap = c->range_vals_cap ? c->range_vals_cap * 2u : 64u;
+    Val* nv = arena_array(c->f->arena, Val, ncap);
+    if (c->range_vals)
+      memcpy(nv, c->range_vals, sizeof(c->range_vals[0]) * c->nrange_vals);
+    c->range_vals = nv;
+    c->range_vals_cap = ncap;
+  }
+  c->range_vals[c->nrange_vals++] = v;
+}
+
+static void range_block_reset(RangeBuildCtx* c) {
+  c->nopen_vals = 0;
+  ++c->open_gen;
+  if (c->open_gen) return;
+  memset(c->open_gen_by_val, 0,
+         sizeof(c->open_gen_by_val[0]) * (c->f->nvals ? c->f->nvals : 1u));
+  c->open_gen = 1;
+}
+
+static int range_is_open(const RangeBuildCtx* c, Val v) {
+  return v < c->f->nvals && c->open_gen_by_val[v] == c->open_gen;
+}
+
+static u32 range_open_end(const RangeBuildCtx* c, Val v) {
+  return range_is_open(c, v) ? c->open_end_by_val[v] : OPT_RANGE_NONE;
+}
+
+static void range_set_open_end(RangeBuildCtx* c, Val v, u32 end) {
+  if (v == VAL_NONE || v >= c->f->nvals) return;
+  if (!range_is_open(c, v)) {
+    c->open_gen_by_val[v] = c->open_gen;
+    range_push_open_val(c, v);
+  }
+  c->open_end_by_val[v] = end;
+}
+
+static void range_close_open_end(RangeBuildCtx* c, Val v) {
+  if (v == VAL_NONE || v >= c->f->nvals) return;
+  c->open_gen_by_val[v] = 0;
+}
+
 static void range_append(RangeBuildCtx* c, Val v, u32 start, u32 end, u32 block,
                          int whole_block) {
   if (v == VAL_NONE || v >= c->f->nvals) return;
@@ -453,6 +551,7 @@ static void range_append(RangeBuildCtx* c, Val v, u32 start, u32 end, u32 block,
   r->whole_block = whole_block ? 1u : 0u;
   if (ranges->first_range_by_val[v] == OPT_RANGE_NONE) {
     ranges->first_range_by_val[v] = idx;
+    range_push_range_val(c, v);
   } else {
     ranges->ranges[c->last_range_by_val[v]].next = idx;
   }
@@ -469,7 +568,7 @@ typedef struct RangeOpenCtx {
 
 static void range_open_at_end(Val v, void* arg) {
   RangeOpenCtx* c = (RangeOpenCtx*)arg;
-  if (v < c->build->f->nvals) c->build->open_end_by_val[v] = c->raw_end;
+  range_set_open_end(c->build, v, c->raw_end);
 }
 
 typedef struct RangeCloseCtx {
@@ -482,9 +581,10 @@ static void range_close_def(Val v, void* arg) {
   RangeCloseCtx* c = (RangeCloseCtx*)arg;
   RangeBuildCtx* b = c->build;
   if (v >= b->f->nvals) return;
-  if (b->open_end_by_val[v] != OPT_RANGE_NONE) {
-    range_append(b, v, c->start, b->open_end_by_val[v], c->block, 0);
-    b->open_end_by_val[v] = OPT_RANGE_NONE;
+  u32 open_end = range_open_end(b, v);
+  if (open_end != OPT_RANGE_NONE) {
+    range_append(b, v, c->start, open_end, c->block, 0);
+    range_close_open_end(b, v);
   } else {
     range_append(b, v, c->start, c->start + 1u, c->block, 0);
   }
@@ -501,7 +601,7 @@ static void range_open_use(Val v, void* arg) {
   RangeUseCtx* c = (RangeUseCtx*)arg;
   RangeBuildCtx* b = c->build;
   if (v >= b->f->nvals) return;
-  if (b->open_end_by_val[v] == OPT_RANGE_NONE) b->open_end_by_val[v] = c->end;
+  if (!range_is_open(b, v)) range_set_open_end(b, v, c->end);
   b->ranges->use_freq_by_val[v] += b->f->blocks[c->block].frequency;
 }
 
@@ -573,6 +673,7 @@ static void range_compress_points(OptLiveRangeSet* ranges,
     r->end = range_point_index(build->raw_points, unique, r->end);
     if (r->end <= r->start) r->end = r->start + 1u;
     u32 len = r->end - r->start;
+    ranges->range_point_visits += len;
     ranges->live_length_by_val[r->val] += len;
     if (ranges->max_live_length < len) ranges->max_live_length = len;
   }
@@ -598,8 +699,8 @@ void opt_live_ranges_build(Func* f, const OptLiveInfo* live,
       arena_zarray(f->arena, u32, f->nvals ? f->nvals : 1u);
   ranges->spill_cost_by_val =
       arena_zarray(f->arena, u32, f->nvals ? f->nvals : 1u);
-  for (Val v = 0; v < f->nvals; ++v)
-    ranges->first_range_by_val[v] = OPT_RANGE_NONE;
+  memset(ranges->first_range_by_val, 0xff,
+         sizeof(ranges->first_range_by_val[0]) * (f->nvals ? f->nvals : 1u));
 
   RangeBuildCtx build;
   memset(&build, 0, sizeof build);
@@ -608,10 +709,10 @@ void opt_live_ranges_build(Func* f, const OptLiveInfo* live,
   build.last_range_by_val =
       arena_array(f->arena, u32, f->nvals ? f->nvals : 1u);
   build.open_end_by_val = arena_array(f->arena, u32, f->nvals ? f->nvals : 1u);
-  for (Val v = 0; v < f->nvals; ++v) {
-    build.last_range_by_val[v] = OPT_RANGE_NONE;
-    build.open_end_by_val[v] = OPT_RANGE_NONE;
-  }
+  build.open_gen_by_val = arena_zarray(f->arena, u32, f->nvals ? f->nvals : 1u);
+  build.open_gen = 1;
+  memset(build.last_range_by_val, 0xff,
+         sizeof(build.last_range_by_val[0]) * (f->nvals ? f->nvals : 1u));
 
   u32* block_base = arena_array(f->arena, u32, f->nblocks ? f->nblocks : 1u);
   u32 raw = 0;
@@ -631,17 +732,27 @@ void opt_live_ranges_build(Func* f, const OptLiveInfo* live,
   for (u32 b = 0; b < f->nblocks; ++b) {
     Block* bl = &f->blocks[b];
     const OptBlockLive* lb = &live->blocks[b];
-    for (Val v = 0; v < f->nvals; ++v)
-      build.open_end_by_val[v] = OPT_RANGE_NONE;
+    range_block_reset(&build);
     u32 raw_start = block_base[b];
     u32 raw_end = raw_start + (bl->ninsts ? bl->ninsts : 1u);
     RangeOpenCtx open_ctx = {&build, raw_end};
+    ranges->live_words_touched += lb->live_out.active_words;
     opt_bitset_iter_set(&lb->live_out, range_open_at_end, &open_ctx);
+    ranges->live_words_touched += live_after.nwords < lb->live_out.nwords
+                                      ? live_after.nwords
+                                      : lb->live_out.nwords;
     opt_bitset_copy(&live_after, &lb->live_out);
 
     RangeBlockFreqCtx freq_ctx = {ranges, bl->frequency};
+    ranges->live_words_touched += block_live.nwords < lb->live_in.nwords
+                                      ? block_live.nwords
+                                      : lb->live_in.nwords;
     opt_bitset_copy(&block_live, &lb->live_in);
+    ranges->live_words_touched += block_live.nwords < lb->live_out.active_words
+                                      ? block_live.nwords
+                                      : lb->live_out.active_words;
     opt_bitset_union(&block_live, &lb->live_out);
+    ranges->live_words_touched += block_live.active_words;
     opt_bitset_iter_set(&block_live, range_add_block_freq, &freq_ctx);
 
     for (u32 ri = bl->ninsts; ri > 0; --ri) {
@@ -652,6 +763,7 @@ void opt_live_ranges_build(Func* f, const OptLiveInfo* live,
 
       if ((IROp)in->op == IR_CALL) {
         RangeCallCtx call_ctx = {f, ranges, &refs, bl->frequency};
+        ranges->live_words_touched += live_after.active_words;
         opt_bitset_iter_set(&live_after, range_add_live_across_call, &call_ctx);
       }
 
@@ -664,19 +776,24 @@ void opt_live_ranges_build(Func* f, const OptLiveInfo* live,
       range_update_live_before(&live_after, &refs);
     }
 
-    for (Val v = 1; v < f->nvals; ++v) {
-      if (build.open_end_by_val[v] == OPT_RANGE_NONE) continue;
-      int whole_block = build.open_end_by_val[v] == raw_end &&
+    for (u32 oi = 0; oi < build.nopen_vals; ++oi) {
+      Val v = build.open_vals[oi];
+      if (!range_is_open(&build, v)) continue;
+      u32 open_end = range_open_end(&build, v);
+      if (open_end == OPT_RANGE_NONE) continue;
+      int whole_block = open_end == raw_end &&
                         !opt_bitset_has(&lb->live_use, v) &&
                         !opt_bitset_has(&lb->live_def, v);
-      range_append(&build, v, raw_start, build.open_end_by_val[v], b,
-                   whole_block);
+      range_append(&build, v, raw_start, open_end, b, whole_block);
+      range_close_open_end(&build, v);
     }
+    ranges->value_scans += build.nopen_vals;
   }
 
   range_compress_points(ranges, &build);
 
-  for (Val v = 1; v < f->nvals; ++v) {
+  for (u32 vi = 0; vi < build.nrange_vals; ++vi) {
+    Val v = build.range_vals[vi];
     u32 nranges = 0;
     for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
          r = ranges->ranges[r].next)
@@ -688,6 +805,7 @@ void opt_live_ranges_build(Func* f, const OptLiveInfo* live,
                                    ranges->live_across_call_freq_by_val[v] +
                                    ranges->live_block_freq_by_val[v];
   }
+  ranges->value_scans += build.nrange_vals;
 }
 
 void opt_live_dump_ranges(Func* f, const OptLiveRangeSet* ranges, Writer* w) {
diff --git a/src/opt/pass_lower.c b/src/opt/pass_lower.c
@@ -47,6 +47,9 @@ static u32 type_size_fallback(const Func* f, CfreeCgTypeId t) {
 static u32 bit_words(u32 nvals) { return (nvals + 63u) / 64u; }
 
 static void bit_set(u64* bits, Val v) { bits[v / 64u] |= 1ull << (v % 64u); }
+static void bit_clear(u64* bits, Val v) {
+  bits[v / 64u] &= ~(1ull << (v % 64u));
+}
 static int bit_has(const u64* bits, Val v) {
   return (bits[v / 64u] & (1ull << (v % 64u))) != 0;
 }
@@ -145,6 +148,15 @@ typedef struct BitsCtx {
   u64* def;
 } BitsCtx;
 
+typedef struct InstRefs {
+  Val* uses;
+  Val* defs;
+  u32 nuses;
+  u32 ndefs;
+  u32 use_cap;
+  u32 def_cap;
+} InstRefs;
+
 static void collect_bits(Func* f, Inst* in, Operand* op, int is_def,
                          void* arg) {
   (void)in;
@@ -158,6 +170,50 @@ static void collect_bits(Func* f, Inst* in, Operand* op, int is_def,
     bit_set(c->use, v);
 }
 
+static void refs_reset(InstRefs* refs) {
+  refs->nuses = 0;
+  refs->ndefs = 0;
+}
+
+static void refs_push(Func* f, Val** vals, u32* nvals, u32* cap, Val v) {
+  for (u32 i = 0; i < *nvals; ++i)
+    if ((*vals)[i] == v) return;
+  if (*nvals == *cap) {
+    u32 ncap = *cap ? *cap * 2u : 8u;
+    Val* nv = arena_array(f->arena, Val, ncap);
+    if (*vals) memcpy(nv, *vals, sizeof((*vals)[0]) * *nvals);
+    *vals = nv;
+    *cap = ncap;
+  }
+  (*vals)[(*nvals)++] = v;
+}
+
+static void refs_collect(Func* f, Inst* in, Operand* op, int is_def,
+                         void* arg) {
+  (void)in;
+  InstRefs* refs = (InstRefs*)arg;
+  if (op->kind != OPK_REG) return;
+  Val v = (Val)op->v.reg;
+  if (v == VAL_NONE || v >= f->nvals) return;
+  if (is_def)
+    refs_push(f, &refs->defs, &refs->ndefs, &refs->def_cap, v);
+  else
+    refs_push(f, &refs->uses, &refs->nuses, &refs->use_cap, v);
+}
+
+static int refs_has_def(const InstRefs* refs, Val v) {
+  for (u32 i = 0; i < refs->ndefs; ++i)
+    if (refs->defs[i] == v) return 1;
+  return 0;
+}
+
+static void live_update_refs_before(u64* live, const InstRefs* refs) {
+  for (u32 i = 0; i < refs->ndefs; ++i)
+    bit_clear(live, refs->defs[i]);
+  for (u32 i = 0; i < refs->nuses; ++i)
+    bit_set(live, refs->uses[i]);
+}
+
 static void forbid_val_reg(Func* f, Val v, u8 cls, Reg r) {
   if (v == VAL_NONE || v >= f->nvals || cls >= OPT_REG_CLASSES || r >= 32)
     return;
@@ -544,6 +600,12 @@ typedef struct OptAllocator {
   FrameSlot* stack_slots;
   u32 stack_slot_count;
   u32 stack_slot_cap;
+  u64 hard_point_visits;
+  u64 stack_point_visits;
+  u64 hard_word_ors;
+  u64 stack_word_ors;
+  u64 hard_mark_points;
+  u64 stack_mark_points;
 } OptAllocator;
 
 static int alloc_candidate_higher(const OptAllocCandidate* a,
@@ -675,21 +737,24 @@ static int spill_slot_compatible(Func* f, FrameSlot fs, Val v) {
   return 1;
 }
 
-static void alloc_collect_occupied_hard(const OptAllocator* a,
+static void alloc_collect_occupied_hard(OptAllocator* a,
                                         const OptLiveRangeSet* ranges, Val v,
                                         u64* occupied) {
   loc_bits_clear(occupied, a->hard_loc_words);
   for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
        r = ranges->ranges[r].next) {
     const OptLiveRange* lr = &ranges->ranges[r];
-    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p)
+    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
+      ++a->hard_point_visits;
+      a->hard_word_ors += a->hard_loc_words;
       loc_bits_or(occupied,
                   a->hard_used_locs + ((size_t)p * a->hard_loc_words),
                   a->hard_loc_words);
+    }
   }
 }
 
-static void alloc_collect_occupied_stack(const OptAllocator* a,
+static void alloc_collect_occupied_stack(OptAllocator* a,
                                          const OptLiveRangeSet* ranges, Val v,
                                          u64* occupied) {
   loc_bits_clear(occupied, a->stack_loc_words);
@@ -697,10 +762,13 @@ static void alloc_collect_occupied_stack(const OptAllocator* a,
   for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
        r = ranges->ranges[r].next) {
     const OptLiveRange* lr = &ranges->ranges[r];
-    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p)
+    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
+      ++a->stack_point_visits;
+      a->stack_word_ors += a->stack_loc_words;
       loc_bits_or(occupied,
                   a->stack_used_locs + ((size_t)p * a->stack_loc_words),
                   a->stack_loc_words);
+    }
   }
 }
 
@@ -709,8 +777,10 @@ static void alloc_mark_hard_loc(OptAllocator* a, const OptLiveRangeSet* ranges,
   for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
        r = ranges->ranges[r].next) {
     const OptLiveRange* lr = &ranges->ranges[r];
-    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p)
+    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
+      ++a->hard_mark_points;
       loc_bit_set(a->hard_used_locs + ((size_t)p * a->hard_loc_words), bit);
+    }
   }
 }
 
@@ -734,9 +804,11 @@ static void alloc_mark_stack_loc(OptAllocator* a, const OptLiveRangeSet* ranges,
   for (u32 r = ranges->first_range_by_val[v]; r != OPT_RANGE_NONE;
        r = ranges->ranges[r].next) {
     const OptLiveRange* lr = &ranges->ranges[r];
-    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p)
+    for (u32 p = lr->start; p < lr->end && p < a->point_count; ++p) {
+      ++a->stack_mark_points;
       loc_bit_set(a->stack_used_locs + ((size_t)p * a->stack_loc_words),
                   stack_idx);
+    }
   }
 }
 
@@ -859,6 +931,7 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
       Reg hr = f->opt_hard_regs[cls][r];
       if (hr >= 32) continue;
       if (vi->forbidden_hard_regs & (1u << hr)) continue;
+      if (vi->live_across_call_freq && is_caller_saved(f, cls, hr)) continue;
       if (loc_bit_has(hard_occupied, hard_loc_bit(cls, hr))) continue;
       alloc_assign_hard(f, a, ranges, v, hr);
       found = 1;
@@ -874,6 +947,12 @@ static void opt_assign_ranges(Func* f, const OptLiveRangeSet* ranges,
   f->opt_alloc_stack_slots = a->stack_slot_count;
   f->opt_used_loc_words =
       f->opt_alloc_hard_loc_words + f->opt_alloc_stack_loc_words;
+  f->opt_alloc_hard_point_visits = a->hard_point_visits;
+  f->opt_alloc_stack_point_visits = a->stack_point_visits;
+  f->opt_alloc_hard_word_ors = a->hard_word_ors;
+  f->opt_alloc_stack_word_ors = a->stack_word_ors;
+  f->opt_alloc_hard_mark_points = a->hard_mark_points;
+  f->opt_alloc_stack_mark_points = a->stack_mark_points;
 }
 
 typedef struct RewriteList {
@@ -899,6 +978,7 @@ static Inst* list_push(Func* f, RewriteList* l, IROp op) {
   Inst* in = &l->data[l->n++];
   memset(in, 0, sizeof *in);
   in->op = (u16)op;
+  ++f->opt_rewrite_inserted_insts;
   return in;
 }
 
@@ -1081,7 +1161,7 @@ static void rewrite_call_arg_value(Func* f, Inst* owner, CGABIValue* v,
 typedef struct RewriteCallSaveCtx {
   Func* f;
   RewriteList* out;
-  const u64* def;
+  const InstRefs* refs;
   int emit_restore;
 } RewriteCallSaveCtx;
 
@@ -1089,7 +1169,7 @@ static void rewrite_call_save_one(Val v, void* arg) {
   RewriteCallSaveCtx* c = (RewriteCallSaveCtx*)arg;
   Func* f = c->f;
   if (v == VAL_NONE || v >= f->nvals) return;
-  if (bit_has(c->def, v)) return;
+  if (c->refs && refs_has_def(c->refs, v)) return;
   if (f->val_info[v].alloc_kind != OPT_ALLOC_HARD) return;
   if (!is_caller_saved(f, f->val_info[v].cls, f->val_info[v].hard_reg)) return;
   if (c->emit_restore)
@@ -1099,9 +1179,11 @@ static void rewrite_call_save_one(Val v, void* arg) {
 }
 
 static void append_live_call_saves(Func* f, RewriteList* out,
-                                   const u64* live_after, const u64* def,
-                                   u32 words, int emit_restore) {
-  RewriteCallSaveCtx ctx = {f, out, def, emit_restore};
+                                   const u64* live_after,
+                                   const InstRefs* refs, u32 words,
+                                   int emit_restore) {
+  RewriteCallSaveCtx ctx = {f, out, refs, emit_restore};
+  f->opt_rewrite_live_words_touched += words;
   for (u32 w = 0; w < words; ++w) {
     u64 bits = live_after[w];
     while (bits) {
@@ -1116,10 +1198,12 @@ static void rewrite_func(Func* f, const OptLiveInfo* live_info) {
   u32 words = live_info ? live_info->words : f->opt_live_words;
   if (!words) words = bit_words(f->nvals);
   f->opt_live_words = (u16)words;
+  f->opt_rewrite_inserted_insts = 0;
+  f->opt_rewrite_live_words_touched = 0;
 
   u64* live = arena_zarray(f->arena, u64, words ? words : 1u);
-  u64* use = arena_zarray(f->arena, u64, words ? words : 1u);
-  u64* def = arena_zarray(f->arena, u64, words ? words : 1u);
+  InstRefs refs;
+  memset(&refs, 0, sizeof refs);
   for (u32 b = 0; b < f->nblocks; ++b) {
     Block* bl = &f->blocks[b];
     RewriteOut out;
@@ -1130,14 +1214,13 @@ static void rewrite_func(Func* f, const OptLiveInfo* live_info) {
     memset(&call_saves, 0, sizeof call_saves);
     memset(&call_restores, 0, sizeof call_restores);
     live_copy_block_out(f, live_info, b, live, words);
+    f->opt_rewrite_live_words_touched += words;
 
     for (u32 ri = bl->ninsts; ri > 0; --ri) {
       u32 i = ri - 1u;
       Inst in = bl->insts[i];
-      bits_clear(use, words);
-      bits_clear(def, words);
-      BitsCtx db = {use, def};
-      walk_inst_operands(f, &in, collect_bits, &db);
+      refs_reset(&refs);
+      walk_inst_operands(f, &in, refs_collect, &refs);
       list_reset(&before);
       list_reset(&after);
       list_reset(&call_saves);
@@ -1159,8 +1242,8 @@ static void rewrite_func(Func* f, const OptLiveInfo* live_info) {
         walk_inst_operands(f, &in, rewrite_one_operand, &ctx);
       }
       if ((IROp)in.op == IR_CALL) {
-        append_live_call_saves(f, &call_saves, live, def, words, 0);
-        append_live_call_saves(f, &call_restores, live, def, words, 1);
+        append_live_call_saves(f, &call_saves, live, &refs, words, 0);
+        append_live_call_saves(f, &call_restores, live, &refs, words, 1);
       }
 
       out_prepend_list_reverse(f, &out, &after);
@@ -1168,7 +1251,8 @@ static void rewrite_func(Func* f, const OptLiveInfo* live_info) {
       out_prepend_inst(f, &out, &in);
       out_prepend_list_reverse(f, &out, &call_saves);
       out_prepend_list_reverse(f, &out, &before);
-      live_update_before(live, use, def, words);
+      live_update_refs_before(live, &refs);
+      f->opt_rewrite_live_words_touched += refs.nuses + refs.ndefs;
     }
     bl->insts = out.data + out.start;
     bl->ninsts = out.cap - out.start;
@@ -1214,10 +1298,14 @@ static int all_defs_dead(Func* f, Inst* in, u64* live) {
 
 void opt_dead_def_elim_with_live(Func* f, const OptLiveInfo* live_info) {
   u32 words = live_info ? live_info->words : f->opt_live_words;
+  f->opt_dde_live_words_touched = 0;
+  InstRefs refs;
+  memset(&refs, 0, sizeof refs);
   for (u32 b = 0; b < f->nblocks; ++b) {
     Block* bl = &f->blocks[b];
     u64* live = arena_zarray(f->arena, u64, words);
     const u64* live_out = live_info ? live_info->blocks[b].live_out.words : NULL;
+    f->opt_dde_live_words_touched += words;
     if (live_out) {
       for (u32 w = 0; w < words; ++w) live[w] = live_out[w];
     }
@@ -1232,12 +1320,10 @@ void opt_dead_def_elim_with_live(Func* f, const OptLiveInfo* live_info) {
       }
       new_insts[w++] = *in;
 
-      u64* use = arena_zarray(f->arena, u64, words);
-      u64* def = arena_zarray(f->arena, u64, words);
-      BitsCtx bc = {use, def};
-      walk_inst_operands(f, in, collect_bits, &bc);
-      for (u32 wi = 0; wi < words; ++wi)
-        live[wi] = (live[wi] & ~def[wi]) | use[wi];
+      refs_reset(&refs);
+      walk_inst_operands(f, in, refs_collect, &refs);
+      live_update_refs_before(live, &refs);
+      f->opt_dde_live_words_touched += refs.nuses + refs.ndefs;
     }
 
     for (u32 i = 0; i < w / 2; ++i) {
@@ -1268,6 +1354,23 @@ void opt_regalloc(Func* f, int allow_live_range_split) {
   opt_init_val_info_from_ranges(f, &ranges);
   opt_apply_asm_constraints_from_live(f, &live);
   metrics_count(f->c, "opt.live_words", f->opt_live_words);
+  metrics_count(f->c, "opt.ranges", ranges.nranges);
+  metrics_count(f->c, "opt.range_points", ranges.point_count);
+  metrics_count(f->c, "opt.range_raw_points", ranges.raw_point_count);
+  metrics_count(f->c, "opt.range_max_per_val", ranges.max_ranges_per_val);
+  metrics_count(f->c, "opt.range_max_length", ranges.max_live_length);
+  metrics_count(f->c, "opt.range_whole_block_spans",
+                ranges.whole_block_spans);
+  metrics_count(f->c, "opt.live.bitset_words_touched",
+                live.bitset_words_touched);
+  metrics_count(f->c, "opt.live.dataflow_iterations",
+                live.dataflow_iterations);
+  metrics_count(f->c, "opt.live.dataflow_block_visits",
+                live.dataflow_block_visits);
+  metrics_count(f->c, "opt.range.point_visits", ranges.range_point_visits);
+  metrics_count(f->c, "opt.range.value_scans", ranges.value_scans);
+  metrics_count(f->c, "opt.range.live_words_touched",
+                ranges.live_words_touched);
   metrics_count(f->c, "opt.conflict_bytes", 0);
   metrics_scope_end(f->c, "opt.live_ranges.regalloc");
 
diff --git a/test/opt/phase0_guardrails.sh b/test/opt/phase0_guardrails.sh
@@ -91,19 +91,34 @@ check_metrics() {
   for metric in \
     opt.live.blocks \
     opt.live.active_words \
+    opt.live.bitset_words_touched \
+    opt.live.dataflow_iterations \
+    opt.live.dataflow_block_visits \
     opt.ranges \
     opt.range_points \
     opt.range_raw_points \
     opt.range_max_per_val \
     opt.range_max_length \
     opt.range_whole_block_spans \
+    opt.range.point_visits \
+    opt.range.value_scans \
+    opt.range.live_words_touched \
+    opt.dde.live_words_touched \
     opt.alloc.used_loc_words \
     opt.alloc.hard_loc_words \
     opt.alloc.stack_loc_words \
     opt.alloc.stack_slots \
+    opt.alloc.hard_point_visits \
+    opt.alloc.stack_point_visits \
+    opt.alloc.hard_word_ors \
+    opt.alloc.stack_word_ors \
+    opt.alloc.hard_mark_points \
+    opt.alloc.stack_mark_points \
     opt.alloc.spills \
     opt.rewrite.reloads \
-    opt.rewrite.stores; do
+    opt.rewrite.stores \
+    opt.rewrite.inserted_insts \
+    opt.rewrite.live_words_touched; do
     if ! grep -q "$metric" "$err"; then
       echo "phase0 metrics: missing $metric" >&2
       sed -n '1,160p' "$err" >&2

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/MIR_RA_REPORT.md	\|	141	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
M	doc/PERF.md	\|	127	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
M	src/opt/ir.h	\|	9	+++++++++
M	src/opt/opt.c	\|	33	++++++++++++++++++++++++---------
M	src/opt/opt.h	\|	6	++++++
M	src/opt/pass_live.c	\|	170	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------
M	src/opt/pass_lower.c	\|	155	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
M	test/opt/phase0_guardrails.sh	\|	17	++++++++++++++++-