commit 1fc79efe18e7f925fad06ad0323ee4ea25edcc5f
parent 68c9946c896959ba7af3e72b6cefdedef6bf6caf
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Thu, 14 May 2026 18:34:49 -0700
RA plan
Diffstat:
| A | doc/MIR_RA_REPORT.md | | | 740 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
1 file changed, 740 insertions(+), 0 deletions(-)
diff --git a/doc/MIR_RA_REPORT.md b/doc/MIR_RA_REPORT.md
@@ -0,0 +1,740 @@
+# MIR Data Structures and Register Allocation Report
+
+This report summarizes the data structures and algorithms in MIR that matter
+most for cfree's optimizer direction. The emphasis is structural: cfree should
+move toward MIR's allocation shape before spending more effort on small
+localized speedups in the current dense-conflict implementation.
+
+Primary MIR source references:
+
+- `~/tmp/mir/mir-gen.c`
+- `~/tmp/mir/mir.h`
+- `~/tmp/mir/mir-bitmap.h`
+- `~/tmp/mir/mir-varr.h`
+- `~/tmp/mir/mir-dlist.h`
+
+Related cfree notes:
+
+- `doc/OPT.md`: broad MIR-informed optimizer plan.
+- `doc/PERF.md`: current timing and scaling data.
+
+## Executive Summary
+
+MIR's register allocation pipeline is live-range based. It does not allocate by
+building a full pseudo-register interference matrix. MIR computes block live
+sets, converts them into compressed live ranges, assigns physical locations by
+checking occupied hard-register and stack locations at live-range program
+points, then rewrites instructions.
+
+cfree's current O1 path is conflict-matrix based. `opt_live_info` builds a
+dense `nvals x words` pseudo interference matrix, stores per-instruction full
+`live_after` bitsets, and scans `1..nvals` in hot loops. `doc/PERF.md` shows
+this structure is the scaling problem for one large function.
+
+The main recommendation is to reshape cfree around MIR's allocation model:
+
+1. Separate block liveness from register-allocation analysis.
+2. Build live ranges from block liveness.
+3. Allocate against per-program-point occupied physical locations.
+4. Keep conflict matrices only for bounded, optional coalescing.
+5. Rewrite pseudos after assignment, with later live-range splitting for O2.
+
+This is a structural change, not an incremental optimization of the current
+matrix.
+
+## MIR Container Layer
+
+MIR uses small generic data-structure building blocks throughout `mir-gen.c`.
+They are simple but important because they determine how passes are organized.
+
+### VARR
+
+`mir-varr.h` defines typed growable arrays. A `VARR(T)` stores:
+
+- element count
+- capacity
+- element pointer
+- allocator
+
+Growth uses `realloc` and increases capacity by about 1.5x. MIR uses `VARR`
+for pass-owned vectors such as temporary instruction lists, live-range tables,
+sorted allocation candidates, and per-point used-location bitmaps.
+
+cfree already has arena-backed arrays and typed vector macros in several
+subsystems. We should not copy `VARR` mechanically, but we should mirror the
+ownership pattern: pass state owns resizable vectors explicitly instead of
+embedding all allocation artifacts permanently in `Func`.
+
+### DLIST
+
+`mir-dlist.h` defines intrusive typed doubly linked lists. MIR functions and
+basic blocks use linked instruction lists because passes insert, delete, and
+move instructions frequently:
+
+- function instruction list: `MIR_func.insns`
+- CFG block list: `func_cfg.bbs`
+- block instruction list: `bb.bb_insns`
+- edge lists: `bb.in_edges`, `bb.out_edges`
+
+cfree currently stores each `Block` as a contiguous `Inst*` array. That is good
+for replay and scanning, but less natural for MIR-style lowering, rewrite,
+spill insertion, edge splitting, and post-RA combine. For the first structural
+rewrite we can keep arrays if we introduce a pass-local edit buffer, but the
+final shape should support cheap insertion around instructions and on block
+boundaries.
+
+### Bitmap
+
+`mir-bitmap.h` is a growable sparse-ish bitmap built on `VARR<uint64_t>`.
+Important details:
+
+- It expands only as far as the highest needed bit.
+- `bitmap_clear` truncates length to zero.
+- binary/ternary operations trim trailing zero words.
+- iteration uses `FOREACH_BITMAP_BIT` to visit set bits.
+
+This is a key contrast with cfree's current fixed-width bitsets. In cfree,
+`opt_live_words = ceil(nvals / 64)`, and every live set for the function has
+that width. MIR's bitmaps still use dense words internally, but their active
+length tracks actual high bits, and many loops iterate set bits.
+
+cfree should add an optimizer bitmap abstraction before rewriting RA. It should
+support:
+
+- clear without touching max capacity
+- copy/ior/and-not with change reporting
+- set-bit iteration
+- optional trimming of trailing zero words
+
+The exact storage can remain arena-backed for pass lifetime, but the API should
+hide fixed full-function width from users.
+
+## MIR Function and CFG Shape
+
+MIR's generator owns a `gen_ctx` with pass subcontexts:
+
+- target context
+- data-flow context
+- SSA context
+- GVN context
+- live-range context
+- coalesce context
+- RA context
+- combine context
+
+This keeps pass state out of the public MIR function structure. The current
+function being generated stores a `func_cfg` in `func_item->data` during
+generation.
+
+### Instructions
+
+`MIR_insn` is a variable-sized object:
+
+```c
+struct MIR_insn {
+ void *data;
+ DLIST_LINK(MIR_insn_t) insn_link;
+ MIR_insn_code_t code : 32;
+ unsigned int nops : 32;
+ MIR_op_t ops[1];
+};
+```
+
+`data` points to pass-local metadata, usually `bb_insn` after CFG construction.
+This lets MIR keep the IR instruction object stable while attaching and
+replacing analysis state.
+
+cfree's `Inst` is a fixed struct with side arrays for operands and aux data.
+That is fine for recording, but pass state should move out of `Inst`, `Block`,
+and `Func` where possible. In particular, liveness, ranges, allocation
+assignment, and rewrite scratch state should be pass contexts, not persistent
+fields scattered through `Func`.
+
+### Basic Blocks
+
+MIR `bb` contains:
+
+- CFG links: in/out edge lists
+- instruction list
+- flags and reachability
+- four data-flow bitmaps: `in`, `out`, `gen`, `kill`
+- dominator bitmaps reused by several passes
+- loop-node pointer
+- max integer and FP pressure
+
+MIR reuses the bitmap fields for different data-flow problems through local
+macros such as `live_in`, `live_out`, `live_gen`, and `live_kill`.
+
+cfree currently stores `live_in`, `live_out`, `live_use`, `live_def`, and
+`live_after` directly on `Block`. That ties block storage to one analysis. The
+MIR-like direction is to keep persistent block structure focused on CFG and
+instruction ownership, and keep data-flow bitmaps in pass-local arrays indexed
+by block id.
+
+## MIR Pipeline Shape
+
+For the allocation-related portion, MIR runs:
+
+1. `target_machinize`
+2. build loop tree for O1+
+3. optional move collection and coalescing for O2
+4. full live info
+5. `reg_alloc`
+6. rewrite
+7. optional splitting for O2
+8. post-RA combine
+9. post-combine DCE
+
+This split is more important than individual implementation details. In MIR,
+liveness is a reusable input to multiple later phases, but register allocation
+does not depend on a full pseudo interference graph.
+
+For cfree, O1 should become:
+
+```text
+build_cfg
+machinize
+build_loop_tree
+live_blocks
+build_live_ranges
+assign_locations(no splitting)
+rewrite_locations
+combine
+dce
+emit
+```
+
+O2 can add coalescing, splitting, SSA value passes, and richer cleanup after
+this shape is stable.
+
+## MIR Liveness
+
+MIR's liveness is block-first:
+
+1. Walk each block backward to compute `live_gen` and `live_kill`.
+2. Solve backward data-flow:
+ - `live_out = union(successor.live_in)`
+ - `live_in = live_gen | (live_out & ~live_kill)`
+3. Track register frequency and block pressure while scanning.
+4. Iterate set bits for live-out pressure accounting.
+
+MIR also has a `scan_vars` mode. Normally `scan_vars_num == 0`, meaning all
+variables participate. For move coalescing, MIR can restrict liveness to only
+move-related variables.
+
+cfree should copy this separation. The first replacement for `opt_live_info`
+should not build:
+
+- `val_conflicts`
+- per-instruction `live_after`
+- spill/rewrite data
+
+It should only produce block live-in/live-out, use/def, frequency, and pressure
+inputs. Later passes can build the extra views they need.
+
+## MIR Live Ranges
+
+MIR represents each variable as a linked list of live ranges:
+
+```c
+struct live_range {
+ lr_bb_t lr_bb;
+ int start, finish;
+ int ref_cost;
+ live_range_t next;
+};
+```
+
+The range builder walks blocks backward from `live_out`, just like liveness,
+but emits range start/end points. It also records whole-block ranges using
+`lr_bb` for values that are live through a block without being referenced
+inside it.
+
+MIR then compresses live-range points:
+
+- gather all born/dead points
+- map old instruction points to a smaller point space
+- merge adjacent ranges where legal
+
+This compression matters. It reduces the size of the later `used_locs`
+structure and avoids making the allocator pay for every instruction position
+when only lifetime boundaries matter.
+
+cfree should introduce:
+
+```c
+typedef struct OptLiveRange {
+ Val val;
+ u32 start;
+ u32 end;
+ u32 next;
+ u32 first_block_span; /* optional: whole-block span list */
+} OptLiveRange;
+
+typedef struct OptLiveRangeSet {
+ OptLiveRange* ranges;
+ u32 nranges;
+ u32* first_range_by_val;
+ u32 point_count;
+} OptLiveRangeSet;
+```
+
+The exact fields can change, but the structural point should not: allocation
+should reason from ranges, not from a dense value-value conflict matrix.
+
+## MIR Register Allocation
+
+MIR allocation has three core pieces:
+
+### Candidate Ordering
+
+MIR builds `sorted_regs` from pseudo regs, computes live length from ranges,
+and sorts allocation candidates. The priority includes tied hard-register
+requirements, frequency, and live length.
+
+cfree already has a similar priority idea in `val_higher_priority`. The
+problem is not ordering. The problem is what each candidate checks against.
+
+### Occupied Locations
+
+MIR maintains:
+
+- `used_locs`: vector of bitmaps indexed by compressed program point
+- `busy_used_locs`: extra occupancy for split-capable allocation
+- `conflict_locs1`: temporary bitmap of occupied physical locations
+
+For each pseudo range, MIR unions the occupied locations for all points in the
+range into `conflict_locs`, then asks for a hard register or stack slot not in
+that set.
+
+This is the central data-structure difference:
+
+- MIR: pseudo -> live ranges -> occupied hard locations by point
+- cfree today: pseudo -> dense conflicts with other pseudos -> scan assigned
+ pseudos to test hard-register conflicts
+
+The MIR shape bounds allocation checks by range length and physical-location
+bitmap width, not by all pseudo pairs. It is a better fit for large generated
+functions where many values never overlap.
+
+### Rewrite
+
+After assignment, MIR rewrites instruction operands:
+
+- hard-register assigned pseudos become hard regs
+- stack-assigned pseudos become loads/stores around uses/defs
+- some spilled uses can become memory operands directly if target-legal
+- call-clobbered live regs are saved/restored
+- noop moves are deleted
+
+cfree's current rewrite is simpler and tightly coupled to `live_after`. A
+MIR-like cfree rewrite should walk each block backward with a rolling live set,
+using allocation assignment and target scratch registers. It should not require
+persisting a full live-after bitmap for every instruction.
+
+## MIR Conflict Matrix Use
+
+MIR still has a conflict matrix, but only for coalescing. It is intentionally
+narrow:
+
+- collect moves
+- collect only source/destination regs from those moves
+- cap the number at `MIR_MAX_COALESCE_VARS`
+- build a matrix over this reduced scan-var space
+- coalesce non-conflicting move-related regs
+
+The comment in MIR says 10K scan vars is about 8 MB for the coalescing matrix.
+That cap is a design signal: even MIR avoids an unbounded full pseudo matrix.
+
+cfree should follow this model:
+
+- no full matrix in normal O1 allocation
+- optional move-only matrix in O2 coalescing
+- hard cap and metric counters for matrix size
+
+## Current cfree Mismatch
+
+The current O1 implementation has these structural mismatches:
+
+1. `opt_live_info` is overloaded.
+ It computes block liveness, per-value summaries, per-instruction live-after,
+ conflict matrix, call live-across summaries, and asm constraints in one pass.
+
+2. DDE pays for full allocation analysis.
+ The pre-DDE liveness call builds data needed only by regalloc. This is the
+ first obvious waste, but fixing only that is still incremental.
+
+3. Dense conflicts are the allocator's core dependency.
+ `Func.val_conflicts` is `nvals x opt_conflict_words`. This should not be a
+ fundamental `Func` field in the MIR-like design.
+
+4. Hot loops scan all vals.
+ Several loops walk `1..nvals` inside instruction processing. MIR mostly
+ iterates instruction operands or set bits.
+
+5. `Block.live_after` stores a full live bitmap per instruction.
+ This multiplies memory by instruction count and value count. MIR derives the
+ information needed for rewrite with backward walks.
+
+## Proposed cfree Structural Target
+
+Introduce pass-local optimizer state:
+
+```c
+typedef struct OptBitset OptBitset;
+typedef struct OptDataflow OptDataflow;
+typedef struct OptLiveRanges OptLiveRanges;
+typedef struct OptAllocator OptAllocator;
+typedef struct OptRewrite OptRewrite;
+
+typedef struct OptPassCtx {
+ Compiler* c;
+ Func* f;
+ Arena* scratch;
+ OptDataflow live;
+ OptLiveRanges ranges;
+ OptAllocator alloc;
+} OptPassCtx;
+```
+
+Move these out of persistent `Func` where possible:
+
+- `live_in`, `live_out`, `live_use`, `live_def`
+- `live_after`
+- `val_conflicts`
+- allocation scratch arrays
+
+Keep these or equivalents as persistent-enough function metadata:
+
+- value type/class tables
+- block/instruction ownership
+- CFG edges
+- frame slots
+- target hard-register availability
+
+Allocation result should be a separate mapping:
+
+```c
+typedef enum OptLocKind {
+ OPT_LOC_NONE,
+ OPT_LOC_HARD,
+ OPT_LOC_STACK,
+} OptLocKind;
+
+typedef struct OptLoc {
+ u8 kind;
+ u8 cls;
+ Reg hard_reg;
+ FrameSlot spill_slot;
+} OptLoc;
+```
+
+Rewrite consumes `OptLoc[val]` and emits final target operands.
+
+## Implementation Direction
+
+This is the recommended structural sequence.
+
+### Phase 1: Data Structure Extraction
+
+Add pass-local bitset and data-flow structures. Keep current behavior running,
+but make new liveness code independent of `Block.live_*` fields.
+
+Deliverables:
+
+- `pass_live.c` owns block liveness storage.
+- bitset supports set-bit iteration and trimmed operations.
+- DDE uses block liveness only.
+- metrics count live words, active words, block live bytes, and set-bit scans.
+
+### Phase 2: Live Range Builder
+
+Build live ranges from block live sets. Do not allocate from them yet.
+
+Deliverables:
+
+- range list per value
+- compressed point numbering
+- metrics for point count, range count, average ranges per value, max range
+ length, and whole-block spans
+- dump output for range tests
+
+### Phase 3: Range-Based O1 Allocator
+
+Replace dense-conflict allocation with MIR-style occupied-location assignment.
+No live-range splitting yet.
+
+Deliverables:
+
+- `used_locs[point]` bitmaps for hard regs and stack slots
+- candidate ordering by tied hard reg, frequency, live length, stable id
+- hard-register assignment when possible
+- stack assignment otherwise
+- no `val_conflicts` allocation in normal O1
+
+### Phase 4: Rewrite Without `live_after`
+
+Rewrite by walking blocks and instructions with rolling live state plus
+assignment maps.
+
+Deliverables:
+
+- hard-reg operand rewrite
+- stack reload/store insertion
+- scratch-reg handling by target class
+- call-clobber preservation
+- noop move deletion
+- no required per-instruction full live-after storage
+
+### Phase 5: Coalescing and Splitting
+
+After O1 has the right shape, add O2 allocator features:
+
+- move collection
+- move-only liveness
+- capped conflict matrix for coalescing
+- live-range splitting
+- edge/block spill placement
+
+This mirrors MIR's division: the matrix is an optional coalescing tool, not the
+allocator's main model.
+
+## Implementation Checklist
+
+Use this as the working tracker for the structural rewrite. Each phase should
+land with focused tests and metrics before moving to the next phase.
+
+### Phase 0: Baseline and Guardrails
+
+- [ ] Preserve current O0 behavior as the reference path.
+- [ ] Keep existing O1 tests green before structural changes begin.
+- [ ] Add or identify stress inputs for:
+ - [ ] one large straight-line function
+ - [ ] many small functions
+ - [ ] branch liveness
+ - [ ] call-clobber preservation
+ - [ ] spills
+ - [ ] tied hard registers from inline asm
+- [ ] Keep `cfree run --time -O1` metrics available throughout the rewrite.
+- [ ] Add metrics names for the new shape before deleting old counters:
+ - [ ] `opt.live.blocks`
+ - [ ] `opt.live.active_words`
+ - [ ] `opt.ranges`
+ - [ ] `opt.range_points`
+ - [ ] `opt.alloc.used_loc_words`
+ - [ ] `opt.alloc.spills`
+ - [ ] `opt.rewrite.reloads`
+ - [ ] `opt.rewrite.stores`
+
+### Phase 1: Pass-Local Liveness
+
+- [ ] Add an optimizer bitset abstraction with:
+ - [ ] set/clear/test
+ - [ ] copy
+ - [ ] union
+ - [ ] union-and-not
+ - [ ] change reporting
+ - [ ] set-bit iteration
+ - [ ] active-word tracking or trailing-zero trimming
+- [ ] Add pass-local block liveness storage.
+- [ ] Move block `use`/`def` calculation into the new liveness module.
+- [ ] Compute `live_in`/`live_out` without allocating conflicts.
+- [ ] Make pre-DDE use only block liveness.
+- [ ] Keep old full `opt_live_info` available for regalloc during transition.
+- [ ] Add dump output for block liveness.
+- [ ] Add tests for branch, loop, and call liveness.
+
+Exit criteria:
+
+- [ ] Existing O1 behavior is unchanged.
+- [ ] Pre-DDE no longer builds `val_conflicts`.
+- [ ] Metrics show separate block-liveness timing and no pre-DDE conflict bytes.
+
+### Phase 2: Live Range Model
+
+- [ ] Define `OptLiveRange` and `OptLiveRangeSet`.
+- [ ] Build ranges from block `live_out` with a backward block walk.
+- [ ] Track per-value:
+ - [ ] first range
+ - [ ] live length
+ - [ ] frequency/spill cost
+ - [ ] live-across-call information
+- [ ] Add compressed point numbering.
+- [ ] Add whole-block span representation or an equivalent compact model.
+- [ ] Add range dump output.
+- [ ] Add range metrics:
+ - [ ] total ranges
+ - [ ] total points before compression
+ - [ ] total points after compression
+ - [ ] max ranges per value
+ - [ ] max live length
+
+Exit criteria:
+
+- [ ] Range dumps match expected lifetimes on small hand-written cases.
+- [ ] Range construction works without using `Block.live_after`.
+- [ ] Current allocator still works while range data is built side by side.
+
+### Phase 3: Range-Based Assignment
+
+- [ ] Define `OptLoc` assignment mapping for every `Val`.
+- [ ] Build allocation candidate list from values with ranges.
+- [ ] Sort candidates by:
+ - [ ] tied hard-register requirement
+ - [ ] frequency/spill cost
+ - [ ] shorter live length
+ - [ ] stable value id
+- [ ] Add `used_locs[point]` bitmaps.
+- [ ] Mark hard-register live ranges in `used_locs`.
+- [ ] For each candidate, union occupied locations across its ranges.
+- [ ] Assign a legal hard register when available.
+- [ ] Assign/reuse stack slots when no hard register is available.
+- [ ] Preserve tied hard-register validation.
+- [ ] Preserve forbidden hard-register constraints from asm/calls.
+- [ ] Keep live-range splitting disabled for O1.
+
+Exit criteria:
+
+- [ ] O1 regalloc no longer depends on `val_conflicts`.
+- [ ] `opt.conflict_bytes` is zero or absent on the normal O1 allocation path.
+- [ ] Branch, loop, call-clobber, spill, and tied-register tests pass.
+
+### Phase 4: Rewrite From Assignment Map
+
+- [ ] Rewrite hard-assigned values to hard registers.
+- [ ] Insert reloads for stack-assigned uses.
+- [ ] Insert stores for stack-assigned defs.
+- [ ] Reserve target scratch registers per class.
+- [ ] Handle operands nested in:
+ - [ ] normal operands
+ - [ ] indirect operands
+ - [ ] call descriptors
+ - [ ] return descriptors
+ - [ ] intrinsic descriptors
+ - [ ] inline asm descriptors
+- [ ] Preserve call-clobber saves/restores with a rolling live set.
+- [ ] Delete noop moves after rewrite.
+- [ ] Avoid persistent per-instruction full `live_after` storage.
+- [ ] Add rewrite dump output.
+
+Exit criteria:
+
+- [ ] O1 emits correct code without `Block.live_after`.
+- [ ] Reload/store metrics are stable and understandable.
+- [ ] Existing O1 tests and smoke JIT tests pass.
+
+### Phase 5: Remove Old Dense Path
+
+- [ ] Remove normal-path allocation of `Func.val_conflicts`.
+- [ ] Remove or demote `opt_conflict_words` from persistent `Func` state.
+- [ ] Remove old all-values conflict construction from O1.
+- [ ] Keep only compatibility helpers still required by unported passes.
+- [ ] Update docs and metrics names to reflect range-based RA.
+
+Exit criteria:
+
+- [ ] Large straight-line benchmark scales by range/point count, not
+ `nvals * live_words`.
+- [ ] `doc/PERF.md` can be updated with before/after O1 scaling numbers.
+
+### Phase 6: O2 Coalescing
+
+- [ ] Collect move candidates after machinize.
+- [ ] Add move-only liveness using `scan_vars`-style restricted variables.
+- [ ] Build a capped conflict matrix only for move-related values.
+- [ ] Add metrics for:
+ - [ ] move candidates
+ - [ ] coalescing scan vars
+ - [ ] coalescing matrix bytes
+ - [ ] coalesced moves
+- [ ] Coalesce non-conflicting move groups.
+- [ ] Keep the matrix cap explicit and configurable at compile time.
+
+Exit criteria:
+
+- [ ] Coalescing is optional and does not affect O1 compile-time shape.
+- [ ] Matrix memory is bounded and reported.
+
+### Phase 7: O2 Splitting and Spill Placement
+
+- [ ] Add split-capable occupancy state similar to MIR's `busy_used_locs`.
+- [ ] Identify profitable live-range gaps.
+- [ ] Place spills/restores on block boundaries.
+- [ ] Split critical edges when required for edge placement.
+- [ ] Add spill-placement dumps.
+- [ ] Add metrics for split ranges and edge/block spill placements.
+
+Exit criteria:
+
+- [ ] O2 can reduce spills without changing O1 assignment behavior.
+- [ ] Edge/block placement tests cover diamonds, loops, and critical edges.
+
+### Phase 8: Cleanup and Documentation
+
+- [ ] Update `doc/OPT.md` with the final implemented module names and pass
+ order.
+- [ ] Update `doc/PERF.md` with new O1 scaling data.
+- [ ] Add a short allocator invariant section near the implementation.
+- [ ] Ensure pass dumps are stable enough for targeted regression tests.
+- [ ] Remove obsolete comments that describe the dense conflict allocator as
+ the normal O1 path.
+
+## Non-Goals For The First Rewrite
+
+Avoid spending early effort on:
+
+- micro-optimizing the existing full conflict matrix
+- adding sparse rows to `val_conflicts` as a long-term solution
+- tuning candidate priority before range-based allocation exists
+- link/JIT performance
+- parser/CG data-structure changes
+
+Those may still matter later, but they do not fix the current structural
+scaling issue.
+
+## Design Risks
+
+### Instruction Storage
+
+cfree's block instruction arrays make insertion more awkward than MIR's linked
+lists. We can handle this with rewrite output buffers initially. If combine,
+spill insertion, and edge splitting become too cumbersome, revisit instruction
+storage after the range allocator lands.
+
+### Arena Lifetime
+
+MIR explicitly destroys many pass structures. cfree's arenas make allocation
+cheap but can hide large transient memory. The new pass contexts should use a
+scratch arena or explicit freeable heap where transient memory is substantial.
+
+### Debuggability
+
+The rewrite should add stable dumps early:
+
+- block live sets
+- live ranges
+- allocation map
+- inserted reloads/stores
+- final rewritten IR
+
+Without dumps, range allocation bugs will be hard to diagnose.
+
+### Target Coupling
+
+MIR's rewrite relies heavily on target hooks. cfree should keep target
+knowledge behind explicit helpers:
+
+- hard-register class availability
+- caller-saved/callee-saved masks
+- scratch registers
+- memory-operand legality
+- stack-slot addressability
+
+## Bottom Line
+
+MIR's RA performance shape comes from avoiding a full pseudo interference graph
+in the normal allocation path. cfree should mirror that. The next serious O1
+work should be a structural rewrite around pass-local liveness, compressed live
+ranges, occupied-location assignment, and rewrite from an allocation map.
+
+Once that shape exists, incremental tuning will have a much better foundation.