kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 1fc79efe18e7f925fad06ad0323ee4ea25edcc5f
parent 68c9946c896959ba7af3e72b6cefdedef6bf6caf
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Thu, 14 May 2026 18:34:49 -0700

RA plan

Diffstat:
Adoc/MIR_RA_REPORT.md | 740+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 740 insertions(+), 0 deletions(-)

diff --git a/doc/MIR_RA_REPORT.md b/doc/MIR_RA_REPORT.md @@ -0,0 +1,740 @@ +# MIR Data Structures and Register Allocation Report + +This report summarizes the data structures and algorithms in MIR that matter +most for cfree's optimizer direction. The emphasis is structural: cfree should +move toward MIR's allocation shape before spending more effort on small +localized speedups in the current dense-conflict implementation. + +Primary MIR source references: + +- `~/tmp/mir/mir-gen.c` +- `~/tmp/mir/mir.h` +- `~/tmp/mir/mir-bitmap.h` +- `~/tmp/mir/mir-varr.h` +- `~/tmp/mir/mir-dlist.h` + +Related cfree notes: + +- `doc/OPT.md`: broad MIR-informed optimizer plan. +- `doc/PERF.md`: current timing and scaling data. + +## Executive Summary + +MIR's register allocation pipeline is live-range based. It does not allocate by +building a full pseudo-register interference matrix. MIR computes block live +sets, converts them into compressed live ranges, assigns physical locations by +checking occupied hard-register and stack locations at live-range program +points, then rewrites instructions. + +cfree's current O1 path is conflict-matrix based. `opt_live_info` builds a +dense `nvals x words` pseudo interference matrix, stores per-instruction full +`live_after` bitsets, and scans `1..nvals` in hot loops. `doc/PERF.md` shows +this structure is the scaling problem for one large function. + +The main recommendation is to reshape cfree around MIR's allocation model: + +1. Separate block liveness from register-allocation analysis. +2. Build live ranges from block liveness. +3. Allocate against per-program-point occupied physical locations. +4. Keep conflict matrices only for bounded, optional coalescing. +5. Rewrite pseudos after assignment, with later live-range splitting for O2. + +This is a structural change, not an incremental optimization of the current +matrix. + +## MIR Container Layer + +MIR uses small generic data-structure building blocks throughout `mir-gen.c`. +They are simple but important because they determine how passes are organized. + +### VARR + +`mir-varr.h` defines typed growable arrays. A `VARR(T)` stores: + +- element count +- capacity +- element pointer +- allocator + +Growth uses `realloc` and increases capacity by about 1.5x. MIR uses `VARR` +for pass-owned vectors such as temporary instruction lists, live-range tables, +sorted allocation candidates, and per-point used-location bitmaps. + +cfree already has arena-backed arrays and typed vector macros in several +subsystems. We should not copy `VARR` mechanically, but we should mirror the +ownership pattern: pass state owns resizable vectors explicitly instead of +embedding all allocation artifacts permanently in `Func`. + +### DLIST + +`mir-dlist.h` defines intrusive typed doubly linked lists. MIR functions and +basic blocks use linked instruction lists because passes insert, delete, and +move instructions frequently: + +- function instruction list: `MIR_func.insns` +- CFG block list: `func_cfg.bbs` +- block instruction list: `bb.bb_insns` +- edge lists: `bb.in_edges`, `bb.out_edges` + +cfree currently stores each `Block` as a contiguous `Inst*` array. That is good +for replay and scanning, but less natural for MIR-style lowering, rewrite, +spill insertion, edge splitting, and post-RA combine. For the first structural +rewrite we can keep arrays if we introduce a pass-local edit buffer, but the +final shape should support cheap insertion around instructions and on block +boundaries. + +### Bitmap + +`mir-bitmap.h` is a growable sparse-ish bitmap built on `VARR<uint64_t>`. +Important details: + +- It expands only as far as the highest needed bit. +- `bitmap_clear` truncates length to zero. +- binary/ternary operations trim trailing zero words. +- iteration uses `FOREACH_BITMAP_BIT` to visit set bits. + +This is a key contrast with cfree's current fixed-width bitsets. In cfree, +`opt_live_words = ceil(nvals / 64)`, and every live set for the function has +that width. MIR's bitmaps still use dense words internally, but their active +length tracks actual high bits, and many loops iterate set bits. + +cfree should add an optimizer bitmap abstraction before rewriting RA. It should +support: + +- clear without touching max capacity +- copy/ior/and-not with change reporting +- set-bit iteration +- optional trimming of trailing zero words + +The exact storage can remain arena-backed for pass lifetime, but the API should +hide fixed full-function width from users. + +## MIR Function and CFG Shape + +MIR's generator owns a `gen_ctx` with pass subcontexts: + +- target context +- data-flow context +- SSA context +- GVN context +- live-range context +- coalesce context +- RA context +- combine context + +This keeps pass state out of the public MIR function structure. The current +function being generated stores a `func_cfg` in `func_item->data` during +generation. + +### Instructions + +`MIR_insn` is a variable-sized object: + +```c +struct MIR_insn { + void *data; + DLIST_LINK(MIR_insn_t) insn_link; + MIR_insn_code_t code : 32; + unsigned int nops : 32; + MIR_op_t ops[1]; +}; +``` + +`data` points to pass-local metadata, usually `bb_insn` after CFG construction. +This lets MIR keep the IR instruction object stable while attaching and +replacing analysis state. + +cfree's `Inst` is a fixed struct with side arrays for operands and aux data. +That is fine for recording, but pass state should move out of `Inst`, `Block`, +and `Func` where possible. In particular, liveness, ranges, allocation +assignment, and rewrite scratch state should be pass contexts, not persistent +fields scattered through `Func`. + +### Basic Blocks + +MIR `bb` contains: + +- CFG links: in/out edge lists +- instruction list +- flags and reachability +- four data-flow bitmaps: `in`, `out`, `gen`, `kill` +- dominator bitmaps reused by several passes +- loop-node pointer +- max integer and FP pressure + +MIR reuses the bitmap fields for different data-flow problems through local +macros such as `live_in`, `live_out`, `live_gen`, and `live_kill`. + +cfree currently stores `live_in`, `live_out`, `live_use`, `live_def`, and +`live_after` directly on `Block`. That ties block storage to one analysis. The +MIR-like direction is to keep persistent block structure focused on CFG and +instruction ownership, and keep data-flow bitmaps in pass-local arrays indexed +by block id. + +## MIR Pipeline Shape + +For the allocation-related portion, MIR runs: + +1. `target_machinize` +2. build loop tree for O1+ +3. optional move collection and coalescing for O2 +4. full live info +5. `reg_alloc` +6. rewrite +7. optional splitting for O2 +8. post-RA combine +9. post-combine DCE + +This split is more important than individual implementation details. In MIR, +liveness is a reusable input to multiple later phases, but register allocation +does not depend on a full pseudo interference graph. + +For cfree, O1 should become: + +```text +build_cfg +machinize +build_loop_tree +live_blocks +build_live_ranges +assign_locations(no splitting) +rewrite_locations +combine +dce +emit +``` + +O2 can add coalescing, splitting, SSA value passes, and richer cleanup after +this shape is stable. + +## MIR Liveness + +MIR's liveness is block-first: + +1. Walk each block backward to compute `live_gen` and `live_kill`. +2. Solve backward data-flow: + - `live_out = union(successor.live_in)` + - `live_in = live_gen | (live_out & ~live_kill)` +3. Track register frequency and block pressure while scanning. +4. Iterate set bits for live-out pressure accounting. + +MIR also has a `scan_vars` mode. Normally `scan_vars_num == 0`, meaning all +variables participate. For move coalescing, MIR can restrict liveness to only +move-related variables. + +cfree should copy this separation. The first replacement for `opt_live_info` +should not build: + +- `val_conflicts` +- per-instruction `live_after` +- spill/rewrite data + +It should only produce block live-in/live-out, use/def, frequency, and pressure +inputs. Later passes can build the extra views they need. + +## MIR Live Ranges + +MIR represents each variable as a linked list of live ranges: + +```c +struct live_range { + lr_bb_t lr_bb; + int start, finish; + int ref_cost; + live_range_t next; +}; +``` + +The range builder walks blocks backward from `live_out`, just like liveness, +but emits range start/end points. It also records whole-block ranges using +`lr_bb` for values that are live through a block without being referenced +inside it. + +MIR then compresses live-range points: + +- gather all born/dead points +- map old instruction points to a smaller point space +- merge adjacent ranges where legal + +This compression matters. It reduces the size of the later `used_locs` +structure and avoids making the allocator pay for every instruction position +when only lifetime boundaries matter. + +cfree should introduce: + +```c +typedef struct OptLiveRange { + Val val; + u32 start; + u32 end; + u32 next; + u32 first_block_span; /* optional: whole-block span list */ +} OptLiveRange; + +typedef struct OptLiveRangeSet { + OptLiveRange* ranges; + u32 nranges; + u32* first_range_by_val; + u32 point_count; +} OptLiveRangeSet; +``` + +The exact fields can change, but the structural point should not: allocation +should reason from ranges, not from a dense value-value conflict matrix. + +## MIR Register Allocation + +MIR allocation has three core pieces: + +### Candidate Ordering + +MIR builds `sorted_regs` from pseudo regs, computes live length from ranges, +and sorts allocation candidates. The priority includes tied hard-register +requirements, frequency, and live length. + +cfree already has a similar priority idea in `val_higher_priority`. The +problem is not ordering. The problem is what each candidate checks against. + +### Occupied Locations + +MIR maintains: + +- `used_locs`: vector of bitmaps indexed by compressed program point +- `busy_used_locs`: extra occupancy for split-capable allocation +- `conflict_locs1`: temporary bitmap of occupied physical locations + +For each pseudo range, MIR unions the occupied locations for all points in the +range into `conflict_locs`, then asks for a hard register or stack slot not in +that set. + +This is the central data-structure difference: + +- MIR: pseudo -> live ranges -> occupied hard locations by point +- cfree today: pseudo -> dense conflicts with other pseudos -> scan assigned + pseudos to test hard-register conflicts + +The MIR shape bounds allocation checks by range length and physical-location +bitmap width, not by all pseudo pairs. It is a better fit for large generated +functions where many values never overlap. + +### Rewrite + +After assignment, MIR rewrites instruction operands: + +- hard-register assigned pseudos become hard regs +- stack-assigned pseudos become loads/stores around uses/defs +- some spilled uses can become memory operands directly if target-legal +- call-clobbered live regs are saved/restored +- noop moves are deleted + +cfree's current rewrite is simpler and tightly coupled to `live_after`. A +MIR-like cfree rewrite should walk each block backward with a rolling live set, +using allocation assignment and target scratch registers. It should not require +persisting a full live-after bitmap for every instruction. + +## MIR Conflict Matrix Use + +MIR still has a conflict matrix, but only for coalescing. It is intentionally +narrow: + +- collect moves +- collect only source/destination regs from those moves +- cap the number at `MIR_MAX_COALESCE_VARS` +- build a matrix over this reduced scan-var space +- coalesce non-conflicting move-related regs + +The comment in MIR says 10K scan vars is about 8 MB for the coalescing matrix. +That cap is a design signal: even MIR avoids an unbounded full pseudo matrix. + +cfree should follow this model: + +- no full matrix in normal O1 allocation +- optional move-only matrix in O2 coalescing +- hard cap and metric counters for matrix size + +## Current cfree Mismatch + +The current O1 implementation has these structural mismatches: + +1. `opt_live_info` is overloaded. + It computes block liveness, per-value summaries, per-instruction live-after, + conflict matrix, call live-across summaries, and asm constraints in one pass. + +2. DDE pays for full allocation analysis. + The pre-DDE liveness call builds data needed only by regalloc. This is the + first obvious waste, but fixing only that is still incremental. + +3. Dense conflicts are the allocator's core dependency. + `Func.val_conflicts` is `nvals x opt_conflict_words`. This should not be a + fundamental `Func` field in the MIR-like design. + +4. Hot loops scan all vals. + Several loops walk `1..nvals` inside instruction processing. MIR mostly + iterates instruction operands or set bits. + +5. `Block.live_after` stores a full live bitmap per instruction. + This multiplies memory by instruction count and value count. MIR derives the + information needed for rewrite with backward walks. + +## Proposed cfree Structural Target + +Introduce pass-local optimizer state: + +```c +typedef struct OptBitset OptBitset; +typedef struct OptDataflow OptDataflow; +typedef struct OptLiveRanges OptLiveRanges; +typedef struct OptAllocator OptAllocator; +typedef struct OptRewrite OptRewrite; + +typedef struct OptPassCtx { + Compiler* c; + Func* f; + Arena* scratch; + OptDataflow live; + OptLiveRanges ranges; + OptAllocator alloc; +} OptPassCtx; +``` + +Move these out of persistent `Func` where possible: + +- `live_in`, `live_out`, `live_use`, `live_def` +- `live_after` +- `val_conflicts` +- allocation scratch arrays + +Keep these or equivalents as persistent-enough function metadata: + +- value type/class tables +- block/instruction ownership +- CFG edges +- frame slots +- target hard-register availability + +Allocation result should be a separate mapping: + +```c +typedef enum OptLocKind { + OPT_LOC_NONE, + OPT_LOC_HARD, + OPT_LOC_STACK, +} OptLocKind; + +typedef struct OptLoc { + u8 kind; + u8 cls; + Reg hard_reg; + FrameSlot spill_slot; +} OptLoc; +``` + +Rewrite consumes `OptLoc[val]` and emits final target operands. + +## Implementation Direction + +This is the recommended structural sequence. + +### Phase 1: Data Structure Extraction + +Add pass-local bitset and data-flow structures. Keep current behavior running, +but make new liveness code independent of `Block.live_*` fields. + +Deliverables: + +- `pass_live.c` owns block liveness storage. +- bitset supports set-bit iteration and trimmed operations. +- DDE uses block liveness only. +- metrics count live words, active words, block live bytes, and set-bit scans. + +### Phase 2: Live Range Builder + +Build live ranges from block live sets. Do not allocate from them yet. + +Deliverables: + +- range list per value +- compressed point numbering +- metrics for point count, range count, average ranges per value, max range + length, and whole-block spans +- dump output for range tests + +### Phase 3: Range-Based O1 Allocator + +Replace dense-conflict allocation with MIR-style occupied-location assignment. +No live-range splitting yet. + +Deliverables: + +- `used_locs[point]` bitmaps for hard regs and stack slots +- candidate ordering by tied hard reg, frequency, live length, stable id +- hard-register assignment when possible +- stack assignment otherwise +- no `val_conflicts` allocation in normal O1 + +### Phase 4: Rewrite Without `live_after` + +Rewrite by walking blocks and instructions with rolling live state plus +assignment maps. + +Deliverables: + +- hard-reg operand rewrite +- stack reload/store insertion +- scratch-reg handling by target class +- call-clobber preservation +- noop move deletion +- no required per-instruction full live-after storage + +### Phase 5: Coalescing and Splitting + +After O1 has the right shape, add O2 allocator features: + +- move collection +- move-only liveness +- capped conflict matrix for coalescing +- live-range splitting +- edge/block spill placement + +This mirrors MIR's division: the matrix is an optional coalescing tool, not the +allocator's main model. + +## Implementation Checklist + +Use this as the working tracker for the structural rewrite. Each phase should +land with focused tests and metrics before moving to the next phase. + +### Phase 0: Baseline and Guardrails + +- [ ] Preserve current O0 behavior as the reference path. +- [ ] Keep existing O1 tests green before structural changes begin. +- [ ] Add or identify stress inputs for: + - [ ] one large straight-line function + - [ ] many small functions + - [ ] branch liveness + - [ ] call-clobber preservation + - [ ] spills + - [ ] tied hard registers from inline asm +- [ ] Keep `cfree run --time -O1` metrics available throughout the rewrite. +- [ ] Add metrics names for the new shape before deleting old counters: + - [ ] `opt.live.blocks` + - [ ] `opt.live.active_words` + - [ ] `opt.ranges` + - [ ] `opt.range_points` + - [ ] `opt.alloc.used_loc_words` + - [ ] `opt.alloc.spills` + - [ ] `opt.rewrite.reloads` + - [ ] `opt.rewrite.stores` + +### Phase 1: Pass-Local Liveness + +- [ ] Add an optimizer bitset abstraction with: + - [ ] set/clear/test + - [ ] copy + - [ ] union + - [ ] union-and-not + - [ ] change reporting + - [ ] set-bit iteration + - [ ] active-word tracking or trailing-zero trimming +- [ ] Add pass-local block liveness storage. +- [ ] Move block `use`/`def` calculation into the new liveness module. +- [ ] Compute `live_in`/`live_out` without allocating conflicts. +- [ ] Make pre-DDE use only block liveness. +- [ ] Keep old full `opt_live_info` available for regalloc during transition. +- [ ] Add dump output for block liveness. +- [ ] Add tests for branch, loop, and call liveness. + +Exit criteria: + +- [ ] Existing O1 behavior is unchanged. +- [ ] Pre-DDE no longer builds `val_conflicts`. +- [ ] Metrics show separate block-liveness timing and no pre-DDE conflict bytes. + +### Phase 2: Live Range Model + +- [ ] Define `OptLiveRange` and `OptLiveRangeSet`. +- [ ] Build ranges from block `live_out` with a backward block walk. +- [ ] Track per-value: + - [ ] first range + - [ ] live length + - [ ] frequency/spill cost + - [ ] live-across-call information +- [ ] Add compressed point numbering. +- [ ] Add whole-block span representation or an equivalent compact model. +- [ ] Add range dump output. +- [ ] Add range metrics: + - [ ] total ranges + - [ ] total points before compression + - [ ] total points after compression + - [ ] max ranges per value + - [ ] max live length + +Exit criteria: + +- [ ] Range dumps match expected lifetimes on small hand-written cases. +- [ ] Range construction works without using `Block.live_after`. +- [ ] Current allocator still works while range data is built side by side. + +### Phase 3: Range-Based Assignment + +- [ ] Define `OptLoc` assignment mapping for every `Val`. +- [ ] Build allocation candidate list from values with ranges. +- [ ] Sort candidates by: + - [ ] tied hard-register requirement + - [ ] frequency/spill cost + - [ ] shorter live length + - [ ] stable value id +- [ ] Add `used_locs[point]` bitmaps. +- [ ] Mark hard-register live ranges in `used_locs`. +- [ ] For each candidate, union occupied locations across its ranges. +- [ ] Assign a legal hard register when available. +- [ ] Assign/reuse stack slots when no hard register is available. +- [ ] Preserve tied hard-register validation. +- [ ] Preserve forbidden hard-register constraints from asm/calls. +- [ ] Keep live-range splitting disabled for O1. + +Exit criteria: + +- [ ] O1 regalloc no longer depends on `val_conflicts`. +- [ ] `opt.conflict_bytes` is zero or absent on the normal O1 allocation path. +- [ ] Branch, loop, call-clobber, spill, and tied-register tests pass. + +### Phase 4: Rewrite From Assignment Map + +- [ ] Rewrite hard-assigned values to hard registers. +- [ ] Insert reloads for stack-assigned uses. +- [ ] Insert stores for stack-assigned defs. +- [ ] Reserve target scratch registers per class. +- [ ] Handle operands nested in: + - [ ] normal operands + - [ ] indirect operands + - [ ] call descriptors + - [ ] return descriptors + - [ ] intrinsic descriptors + - [ ] inline asm descriptors +- [ ] Preserve call-clobber saves/restores with a rolling live set. +- [ ] Delete noop moves after rewrite. +- [ ] Avoid persistent per-instruction full `live_after` storage. +- [ ] Add rewrite dump output. + +Exit criteria: + +- [ ] O1 emits correct code without `Block.live_after`. +- [ ] Reload/store metrics are stable and understandable. +- [ ] Existing O1 tests and smoke JIT tests pass. + +### Phase 5: Remove Old Dense Path + +- [ ] Remove normal-path allocation of `Func.val_conflicts`. +- [ ] Remove or demote `opt_conflict_words` from persistent `Func` state. +- [ ] Remove old all-values conflict construction from O1. +- [ ] Keep only compatibility helpers still required by unported passes. +- [ ] Update docs and metrics names to reflect range-based RA. + +Exit criteria: + +- [ ] Large straight-line benchmark scales by range/point count, not + `nvals * live_words`. +- [ ] `doc/PERF.md` can be updated with before/after O1 scaling numbers. + +### Phase 6: O2 Coalescing + +- [ ] Collect move candidates after machinize. +- [ ] Add move-only liveness using `scan_vars`-style restricted variables. +- [ ] Build a capped conflict matrix only for move-related values. +- [ ] Add metrics for: + - [ ] move candidates + - [ ] coalescing scan vars + - [ ] coalescing matrix bytes + - [ ] coalesced moves +- [ ] Coalesce non-conflicting move groups. +- [ ] Keep the matrix cap explicit and configurable at compile time. + +Exit criteria: + +- [ ] Coalescing is optional and does not affect O1 compile-time shape. +- [ ] Matrix memory is bounded and reported. + +### Phase 7: O2 Splitting and Spill Placement + +- [ ] Add split-capable occupancy state similar to MIR's `busy_used_locs`. +- [ ] Identify profitable live-range gaps. +- [ ] Place spills/restores on block boundaries. +- [ ] Split critical edges when required for edge placement. +- [ ] Add spill-placement dumps. +- [ ] Add metrics for split ranges and edge/block spill placements. + +Exit criteria: + +- [ ] O2 can reduce spills without changing O1 assignment behavior. +- [ ] Edge/block placement tests cover diamonds, loops, and critical edges. + +### Phase 8: Cleanup and Documentation + +- [ ] Update `doc/OPT.md` with the final implemented module names and pass + order. +- [ ] Update `doc/PERF.md` with new O1 scaling data. +- [ ] Add a short allocator invariant section near the implementation. +- [ ] Ensure pass dumps are stable enough for targeted regression tests. +- [ ] Remove obsolete comments that describe the dense conflict allocator as + the normal O1 path. + +## Non-Goals For The First Rewrite + +Avoid spending early effort on: + +- micro-optimizing the existing full conflict matrix +- adding sparse rows to `val_conflicts` as a long-term solution +- tuning candidate priority before range-based allocation exists +- link/JIT performance +- parser/CG data-structure changes + +Those may still matter later, but they do not fix the current structural +scaling issue. + +## Design Risks + +### Instruction Storage + +cfree's block instruction arrays make insertion more awkward than MIR's linked +lists. We can handle this with rewrite output buffers initially. If combine, +spill insertion, and edge splitting become too cumbersome, revisit instruction +storage after the range allocator lands. + +### Arena Lifetime + +MIR explicitly destroys many pass structures. cfree's arenas make allocation +cheap but can hide large transient memory. The new pass contexts should use a +scratch arena or explicit freeable heap where transient memory is substantial. + +### Debuggability + +The rewrite should add stable dumps early: + +- block live sets +- live ranges +- allocation map +- inserted reloads/stores +- final rewritten IR + +Without dumps, range allocation bugs will be hard to diagnose. + +### Target Coupling + +MIR's rewrite relies heavily on target hooks. cfree should keep target +knowledge behind explicit helpers: + +- hard-register class availability +- caller-saved/callee-saved masks +- scratch registers +- memory-operand legality +- stack-slot addressability + +## Bottom Line + +MIR's RA performance shape comes from avoiding a full pseudo interference graph +in the normal allocation path. cfree should mirror that. The next serious O1 +work should be a structural rewrite around pass-local liveness, compressed live +ranges, occupied-location assignment, and rewrite from an allocation map. + +Once that shape exists, incremental tuning will have a much better foundation.