opt: Phase 3 — IR-native recording, SSA construction, dry-run - kit

commit 32f88989b99113efd573a10ab332c1e7732b54ff
parent bf1dc146e22105508838aeb887215cd591aeb6fb
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sun, 10 May 2026 10:24:27 -0700

opt: Phase 3 — IR-native recording, SSA construction, dry-run

Refactor opt_cgtarget to record CGTarget calls directly into the SSA
IR (Func/Block/Inst) instead of into a shadow tape. Each method lands
as exactly one Inst; replay walks the Func and re-issues calls onto
the wrapped target. Levels 1 and 2 share the recording and replay
path; level 2 additionally runs build_cfg + build_ssa as a dry-run
(output discarded) to shake out IR-shape bugs ahead of Phase 4's
lowering pipeline.

- src/opt/ir.h, src/opt/ir.c: extend IR with CGTarget-shape ops
  (IR_COPY, IR_ADDR_OF, IR_TLS_ADDR_OF, IR_CMP_BRANCH,
  IR_SCOPE_BEGIN/ELSE/END, IR_BREAK_TO/CONTINUE_TO,
  IR_BINOP/UNOP/CMP/CONVERT, IR_LOAD_IMM/CONST, etc.); switch
  Inst.opnds from Val* to Operand* (collapsing Reg ↔ Val per
  doc/OPT.md §5.1); add Func.emit_order so replay walks blocks in
  the order CG visited them (cmp_branch fallthrough blocks are
  created after a label_new'd block but must physically follow it).
- src/opt/opt.c: drop the tape; record straight into Func; replay
  allocates target Regs lazily on first def via target->alloc_reg.
- src/opt/pass_cfg.c: rebuild Block.preds from succ[]/nsucc.
- src/opt/pass_ssa.c: Cooper-Harvey-Kennedy iterative dominators,
  dominance frontiers, mem2reg phi insertion at iterated DF, dom-
  tree rename DFS. Promotable slots identified via OPK_LOCAL on
  IR_LOAD/STORE.
- test/cg/run.sh: default CFREE_OPT_LEVELS to "0 1 2".
- doc/OPT.md: revise the level dial — both levels build CFG + SSA
  and (post-Phase 4) emit through SSA → machinize → regalloc →
  emit; the level selects only the optimization schedule. Level 1
  is the minimal/bisection-floor set; level 2 is the full pipeline
  with IPO.

Corpus stays green at all three levels (1742 pass).

Diffstat:
M doc/OPT.md  | 225 +++++++++++++++++++++++++++++++++++++++++++++----------------------------------
A src/opt/ir.c  | 176 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M src/opt/ir.h  | 353 ++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------
M src/opt/opt.c  | 2325 ++++++++++++++++++++++++++++++-------------------------------------------------
A src/opt/pass_cfg.c  | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A src/opt/pass_ssa.c  | 427 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M test/cg/run.sh  | 60 +++++++++++++++++++++++++++++++++++++-----------------------

7 files changed, 1966 insertions(+), 1716 deletions(-)
diff --git a/doc/OPT.md b/doc/OPT.md
@@ -19,49 +19,63 @@ and builds out to a real intra+IPO pipeline.
 
 ## 1. What working opt must look like
 
-Two distinct replay engines, sharing one recording front-end.
-
-### 1.1 Level-1 (function-at-a-time, no SSA)
+A single shared engine. CG drives `opt_cgtarget`; every CGTarget call
+lands as exactly one `Inst` in a per-function flat-CFG IR (§5.1). On
+`func_end` (intra-procedural) and `finalize` (inter-procedural), the
+wrapper runs an optimization schedule, lowers through machinize →
+regalloc → emit, and drives the wrapped target CGTarget to produce
+machine code.
 
 ```
-parse → cg → opt_cgtarget {record into Func} → on func_end:
-            optionally rewrite the recorded tape (peephole, constfold)
-            replay each Inst back as the matching CGTarget call →
-            wrapped target → MCEmitter → ObjBuilder
+parse → cg → opt_cgtarget {record into Func} →
+        on func_end: build_cfg → build_ssa → <intra schedule> → <lower>
+        on finalize: <inter schedule> → for each dirty Func: <lower>
+                  → wrapped target CGTarget → MCEmitter → ObjBuilder
+
+  <lower> = make_conventional_ssa → ssa_combine → undo_ssa →
+            machinize → live_info → coalesce → regalloc → combine →
+            dce → opt_emit
 ```
 
-Replay is 1:1: each recorded `Inst` corresponds to one `target->method(...)`
-call. No phis, no register allocation, no IR-level CFG transforms that
-would break the correspondence. This preserves `doc/DESIGN.md` §8's
-"function-at-a-time" streaming guarantee at -O1.
+The level dial selects only the optimization schedule; everything
+else is shared.
 
-### 1.2 Level-2 (TU-buffered, SSA, IPO, full lowering)
+### 1.1 Level 1 — minimal
 
-```
-parse → cg → opt_cgtarget {record into Func} → on func_end:
-            store Func; do nothing else.
-        on cgtarget_finalize:
-            for each Func: build_cfg, build_ssa (incl. mem2reg), gvn,
-                copy_prop, dse, ssa_dce, licm, pressure_relief,
-                make_conventional_ssa, ssa_combine, undo_ssa, jump_opt
-            opt_inline + opt_cleanup (bounded iterations, -finline-iters=N)
-            for each Func: machinize, live_info, coalesce, regalloc,
-                combine, dce, opt_emit → wrapped target → MCEmitter
-```
+Intra: `build_cfg`, `build_ssa` (mem2reg), `jump_opt`. No GVN, no
+LICM, no DSE.
+Inter: none (no inlining; no cleanup iteration).
+
+Just enough to land code through the SSA pipeline at quality
+comparable to direct `CGTarget` lowering. The point of level 1 is
+isolation: when a bug reproduces here, it is in IR construction,
+SSA, or lowering — not in any of the value-changing passes. It is
+the working bisection floor for the rest of the pipeline.
 
-The level-2 path *cannot* preserve the 1:1 record→replay correspondence
-once SSA/regalloc has run — the lowering pipeline is the only path
-back to the wrapped CGTarget. This split is the single biggest
-load-bearing decision in the design.
+### 1.2 Level 2 — full
+
+Intra (per `doc/DESIGN.md` §9.2): `build_cfg`, `build_ssa` (mem2reg),
+`gvn` (incl. constprop, redundant-load elim), `copy_prop` (incl.
+redundant-extension elim), `dse`, `ssa_dce`, `jump_opt`,
+`build_loop_tree` + `licm`, `addr_xform`, `block_cloning`,
+`pressure_relief`.
+Inter: `opt_inline` + `opt_cleanup` (bounded by `-finline-iters=N`,
+default 1).
 
 ### 1.3 Equivalence we commit to
 
 For every case in `test/cg/CORPUS.md` Groups A–Q, building with
-`opt_cgtarget_new(c, target, level)` for `level ∈ {1, 2}` must produce
-the same `test_main` exit code as building against the AArch64 target
-directly. DWARF (W path) equivalence is weaker — opt at level 2 may
-collapse line rows or move locations into loclists — but Group P at
-level 0/1 must be byte-identical.
+`opt_cgtarget_new(c, target, level)` for `level ∈ {1, 2}` must
+produce the same `test_main` exit code as building against the
+AArch64 target directly. Both levels emit through the SSA →
+machinize → regalloc → emit path, so neither is byte-equivalent to
+level 0 — exit-code equivalence is the contract.
+
+DWARF (W path) equivalence is weaker. Level 1 aims for row-by-row
+parity with level 0 in Group P; level 2 may collapse line rows or
+move locations into loclists when value-changing passes fire.
+Neither is a hard contract until the DWARF cross-level harness lands
+(§3 Phase 5+).
 
 ---
 
@@ -126,8 +140,10 @@ level 0/1 must be byte-identical.
 
 Each phase ends at a green test surface. Phases 0–3 are reversible —
 strip the wrapper back to identity replay and the system still works.
-Phase 4 is the wedge: once SSA/regalloc is in the pipeline, you need
-the full lowering path to get bytes back.
+Phase 4 is the wedge: once the SSA → lowering pipeline is the only
+path to bytes, both levels go through it. Levels 1 and 2 share the
+pipeline from Phase 4 onward and diverge only in which optimization
+passes the schedule includes (§1.1, §1.2).
 
 ### Phase 0 — wrapper skeleton + equivalence harness
 
@@ -152,10 +168,14 @@ Harness:
 - Add `--opt-level N` flag to `test/cg/harness/cg_runner.c`. When
   `N > 0`, wrap the constructed `CGTarget*` with
   `opt_cgtarget_new(c, target, N)` before running the case.
-- `test/cg/run.sh`: each case in Groups A–Q runs at `--opt-level 0`
-  and `--opt-level 1`; exit codes must match. Add `--opt-level 2`
-  once Phase 4 lands. Group P (DWARF/W path) stays at level 0 only
-  for now — opt-level equivalence on DWARF is a Phase 5+ concern.
+- `test/cg/run.sh`: each case in Groups A–Q runs at `--opt-level 0`,
+  `1`, and `2`; exit codes must match across all three. Through
+  Phase 3, levels 1 and 2 share the recorder's 1:1 replay path with
+  level 2 additionally exercising `build_cfg` + `build_ssa` as a
+  dry-run (output discarded). Phase 4 promotes both levels to the
+  shared SSA → lower path. Group P (DWARF/W path) stays at level 0
+  only for now — opt-level equivalence on DWARF is a Phase 5+
+  concern.
 
 Tests: full Groups A–Q D/R/E/J pass at `--opt-level 0` and
 `--opt-level 1`. Any divergence means the forwarding lost data — fix
@@ -222,30 +242,35 @@ disassembly diff against `--opt-level 0`.
 ### Phase 3 — SSA construction, dry-run
 
 Goal: build SSA without consuming it. Catches IR-shape bugs before
-they matter.
+they matter — Phase 4's lowering pipeline depends on the SSA being
+correct, so we shake out construction bugs separately.
 
 - `src/opt/pass_cfg.c`: `opt_build_cfg` — derives
   `Block.preds`/`succ`/`nsucc` from terminators (`IR_BR`, `IR_CONDBR`,
-  `IR_RET`, `IR_LONGJMP`, `IR_INTRINSIC{TRAP,UNREACHABLE}`).
-  `IR_SETJMP` is a control barrier — splits its block but is not a
-  terminator (control falls through).
+  `IR_CMP_BRANCH`, `IR_RET`, `IR_LONGJMP`, `IR_BREAK_TO`,
+  `IR_CONTINUE_TO`, `IR_INTRINSIC{TRAP,UNREACHABLE}`). `IR_SETJMP` is
+  a control barrier — splits its block but is not a terminator
+  (control falls through).
 - `src/opt/pass_ssa.c`: `opt_build_ssa` — standard dominance-frontier
   algorithm; promotes any `FrameSlot` whose `FSF_ADDR_TAKEN` bit was
   never set (mem2reg folded in per `doc/DESIGN.md` §12). Inserts
   `IR_PHI` instructions with `IRPhiAux` populated.
-- At level 2 these run on `func_end` but their output is *discarded*
-  before replay. Goal is "no panics on the corpus", not "improved
+- These run at both levels on `func_end`, but their output is
+  *discarded* before replay until Phase 4 lands the lowering path.
+  Goal at this phase is "no panics on the corpus", not "improved
   code".
 
-Tests: at level 2, recorder runs `build_cfg` + `build_ssa` then
-forces level-1 replay. Corpus stays green; any panic is a real bug
-(unhandled IROp, malformed CFG from goto/switch lowering, address-
-taken-detection miss).
+Tests: at levels 1 and 2, recorder runs `build_cfg` + `build_ssa`
+then falls back to the recorder's 1:1 replay. Corpus stays green;
+any panic is a real bug (unhandled IROp, malformed CFG from
+goto/switch lowering, address-taken-detection miss).
 
 ### Phase 4 — lowering pipeline (the wedge)
 
-Goal: real level-2 path: SSA → machinize → regalloc → emit. From
-this point, level-2 cannot fall back to level-1 replay.
+Goal: SSA → make_conventional_ssa → undo_ssa → machinize →
+live_info → coalesce → regalloc → combine → dce → opt_emit, replacing
+the recorder's 1:1 replay at *both* levels. From this point, neither
+level falls back to record→replay.
 
 - `src/opt/pass_machinize.c`: `opt_machinize(Func*, Target)` — ABI
   lowering (calls into `TargetABI` for argument/return part
@@ -259,53 +284,59 @@ this point, level-2 cannot fall back to level-1 replay.
   target for clobbers and the param/sret physical mapping).
 - `src/opt/pass_emit.c`: `opt_emit(Compiler*, Func*, CGTarget*)` —
   walks the lowered IR and drives the wrapped target via the
-  emit-side CGTarget surface. This is symmetric with `replay.c` from
-  Phase 1: Inst → target call. Difference is that operands are now
-  physical Operands (post-RA) and prolog/epilog/spill insertion has
-  happened.
-- Wrapper's `finalize` at level 2 runs the per-Func intra pipeline,
-  then for each Func runs the lowering pipeline, then calls
-  `target->finalize(target)`.
-
-Initial pass set inside the intra pipeline is *empty* — `build_ssa`
-followed immediately by `undo_ssa`/`make_conventional_ssa`. The
-lowering pipeline does the real work. This isolates the lowering bugs
-from the optimization bugs.
-
-Tests: `--opt-level 2` corpus is green. Equivalence harness extended
-to cover level 2.
-
-Exit criterion: level-2 corpus green, no regressions in the level-1
-or level-0 suites.
-
-### Phase 5 — intra-procedural passes, one at a time
-
-Goal: enable real optimizations. Each pass is independently
-toggleable for bisection.
-
-Order (tracking `doc/DESIGN.md` §9.2):
-
-1. `opt_gvn` (with constprop, redundant-load elim folded in).
-2. `opt_copy_prop` (with redundant-extension elim).
-3. `opt_dse`.
-4. `opt_ssa_dce`.
-5. `opt_jump_opt`.
-6. `opt_build_loop_tree` + `opt_licm`.
-7. `opt_addr_xform`.
-8. `opt_block_cloning`.
-9. `opt_pressure_relief`.
-10. Lowering-time additions: `opt_coalesce`, `opt_combine`,
-    `opt_dce` (post-RA), live-range splitting in `opt_regalloc`.
-
-Each pass lands behind a flag so the equivalence harness can run with
-just-this-pass to localize bugs. No UB-exploiting transformations
-(`doc/DESIGN.md` §9): no signed-overflow-is-unreachable, no
-shift-by-≥-width-is-unreachable, no division-by-zero-is-unreachable,
-no null-deref-is-unreachable.
-
-### Phase 6 — inter-procedural
-
-Goal: cross-function inlining + cleanup iteration.
+  emit-side CGTarget surface. Inst → target call, but operands are
+  now physical Operands (post-RA) and prolog/epilog/spill insertion
+  has happened.
+- Wrapper's `func_end` runs `build_cfg` + `build_ssa` +
+  `make_conventional_ssa` + `undo_ssa` + the lowering pipeline at
+  both levels. No optimization passes between SSA build and undo at
+  this phase — the lowering pipeline does all the work, so we
+  isolate lowering bugs from optimization-pass bugs. Level 1 and
+  level 2 are functionally identical at the end of Phase 4.
+
+Tests: `--opt-level 1` and `--opt-level 2` corpora green. The
+recorder's old 1:1 replay path is removed; SSA → lower is the only
+path to bytes.
+
+Exit criterion: levels 1 and 2 both green, no regressions in the
+level-0 suite.
+
+### Phase 5 — intra-procedural passes, level-gated
+
+Goal: populate the optimization schedule. All new passes land in the
+*level-2* schedule only; level 1's schedule stays at the Phase 4 set
+(`build_cfg`, `build_ssa`, `jump_opt` once it's wired in) so it
+remains the bisection floor for IR/SSA/lowering bugs.
+
+Order (tracking `doc/DESIGN.md` §9.2; each pass lands behind a flag
+so the equivalence harness can run with just-this-pass to localize
+bugs):
+
+1. `opt_jump_opt` — moves into the level-1 schedule too (cheap,
+   high-value, doesn't change values).
+2. `opt_gvn` (with constprop, redundant-load elim folded in) —
+   level 2 only.
+3. `opt_copy_prop` (with redundant-extension elim) — level 2 only.
+4. `opt_dse` — level 2 only.
+5. `opt_ssa_dce` — level 2 only.
+6. `opt_build_loop_tree` + `opt_licm` — level 2 only.
+7. `opt_addr_xform` — level 2 only.
+8. `opt_block_cloning` — level 2 only.
+9. `opt_pressure_relief` — level 2 only.
+10. Lowering-time additions (run at both levels): `opt_coalesce`,
+    `opt_combine`, `opt_dce` (post-RA), live-range splitting in
+    `opt_regalloc`.
+
+No UB-exploiting transformations (`doc/DESIGN.md` §9): no
+signed-overflow-is-unreachable, no shift-by-≥-width-is-unreachable,
+no division-by-zero-is-unreachable, no null-deref-is-unreachable.
+
+### Phase 6 — inter-procedural (level 2 only)
+
+Goal: cross-function inlining + cleanup iteration. Level 1 stays
+intra-procedural — the inliner is the largest source of correctness
+risk and the largest divergence from level-0 codegen, so it earns
+its own gating.
 
 - `opt_inline`: bottom-up call-graph walk. SCCs (mutual recursion)
   skipped. Heuristic: instruction count + call site count. Inlining
diff --git a/src/opt/ir.c b/src/opt/ir.c
@@ -0,0 +1,176 @@
+/* ir.c — Func/Block/Inst plumbing for the SSA IR (doc/OPT.md §1).
+ *
+ * Each CGTarget call recorded by opt_cgtarget produces exactly one Inst
+ * (or none, for pure bookkeeping calls like alloc_reg). Storage is per-
+ * Func arena, allocated against c->tu so the Func survives until
+ * cgtarget_finalize.
+ *
+ * Invariants:
+ *   - VAL_NONE (= 0) is reserved; first allocated Val is 1.
+ *   - val_def_block / val_def_inst / val_type / val_cls are parallel
+ *     arrays indexed by Val.
+ *   - Inst.opnds is Operand[] (not Val[]): Reg/Val are collapsed
+ *     (doc/OPT.md §5.1) and OPK_REG operands' v.reg field IS the Val
+ *     used at this site. Other OpKinds (IMM/LOCAL/GLOBAL/INDIRECT) are
+ *     not Val uses for SSA dataflow.
+ */
+
+#include "opt/ir.h"
+
+#include <string.h>
+
+#include "core/arena.h"
+#include "core/core.h"
+
+/* ---- val table ---- */
+
+static void val_table_grow(Func* f, u32 needed) {
+  if (needed <= f->vals_cap) return;
+  u32 ncap = f->vals_cap ? f->vals_cap : 16u;
+  while (ncap < needed) ncap *= 2u;
+  u32* nb_blk = arena_zarray(f->arena, u32, ncap);
+  u32* nb_ins = arena_zarray(f->arena, u32, ncap);
+  const Type** nb_ty = arena_zarray(f->arena, const Type*, ncap);
+  u8* nb_cls = arena_zarray(f->arena, u8, ncap);
+  if (f->nvals) {
+    memcpy(nb_blk, f->val_def_block, sizeof(u32) * f->nvals);
+    memcpy(nb_ins, f->val_def_inst, sizeof(u32) * f->nvals);
+    memcpy(nb_ty, f->val_type, sizeof(const Type*) * f->nvals);
+    memcpy(nb_cls, f->val_cls, sizeof(u8) * f->nvals);
+  }
+  f->val_def_block = nb_blk;
+  f->val_def_inst = nb_ins;
+  f->val_type = nb_ty;
+  f->val_cls = nb_cls;
+  f->vals_cap = ncap;
+}
+
+Val ir_alloc_val(Func* f, const Type* t, u8 cls) {
+  Val v;
+  if (f->nvals == 0) {
+    val_table_grow(f, 16);
+    f->nvals = 1; /* reserve slot 0 for VAL_NONE */
+  }
+  if (f->nvals == f->vals_cap) val_table_grow(f, f->nvals + 1);
+  v = f->nvals++;
+  f->val_def_block[v] = 0;
+  f->val_def_inst[v] = 0;
+  f->val_type[v] = t;
+  f->val_cls[v] = cls;
+  return v;
+}
+
+/* ---- blocks ---- */
+
+u32 ir_block_new(Func* f) {
+  Block* b;
+  if (f->nblocks == f->blocks_cap) {
+    u32 ncap = f->blocks_cap ? f->blocks_cap * 2u : 8u;
+    Block* nb = arena_zarray(f->arena, Block, ncap);
+    if (f->blocks) memcpy(nb, f->blocks, sizeof(Block) * f->nblocks);
+    f->blocks = nb;
+    f->blocks_cap = ncap;
+  }
+  b = &f->blocks[f->nblocks];
+  memset(b, 0, sizeof *b);
+  b->id = f->nblocks;
+  return f->nblocks++;
+}
+
+/* ---- emit order ---- */
+
+void ir_note_emit(Func* f, u32 block) {
+  /* Linear scan: emit_order is small in practice (one entry per
+   * placed block, dozens at most for the corpus) and we only ever
+   * append, so a hash table would be overkill. */
+  for (u32 i = 0; i < f->emit_order_n; ++i)
+    if (f->emit_order[i] == block) return;
+  if (f->emit_order_n == f->emit_order_cap) {
+    u32 ncap = f->emit_order_cap ? f->emit_order_cap * 2u : 8u;
+    u32* nb = arena_array(f->arena, u32, ncap);
+    if (f->emit_order)
+      memcpy(nb, f->emit_order, sizeof(u32) * f->emit_order_n);
+    f->emit_order = nb;
+    f->emit_order_cap = ncap;
+  }
+  f->emit_order[f->emit_order_n++] = block;
+}
+
+/* ---- inst append ---- */
+
+Inst* ir_emit(Func* f, u32 block, IROp op) {
+  Block* b = &f->blocks[block];
+  Inst* in;
+  if (b->ninsts == b->cap) {
+    u32 ncap = b->cap ? b->cap * 2u : 8u;
+    Inst* nb = arena_zarray(f->arena, Inst, ncap);
+    if (b->insts) memcpy(nb, b->insts, sizeof(Inst) * b->ninsts);
+    b->insts = nb;
+    b->cap = ncap;
+  }
+  in = &b->insts[b->ninsts++];
+  memset(in, 0, sizeof *in);
+  in->op = (u16)op;
+  return in;
+}
+
+/* ---- frame slots / params ---- */
+
+FrameSlot ir_frame_slot_new(Func* f, const FrameSlotDesc* d) {
+  IRFrameSlot* s;
+  FrameSlot id;
+  if (f->nframe_slots == f->frame_slots_cap) {
+    u32 ncap = f->frame_slots_cap ? f->frame_slots_cap * 2u : 8u;
+    IRFrameSlot* nb = arena_zarray(f->arena, IRFrameSlot, ncap);
+    if (f->frame_slots)
+      memcpy(nb, f->frame_slots, sizeof(IRFrameSlot) * f->nframe_slots);
+    f->frame_slots = nb;
+    f->frame_slots_cap = ncap;
+  }
+  id = (FrameSlot)(f->nframe_slots + 1);
+  s = &f->frame_slots[f->nframe_slots++];
+  s->id = id;
+  s->type = d->type;
+  s->name = d->name;
+  s->loc = d->loc;
+  s->size = d->size;
+  s->align = d->align;
+  s->kind = d->kind;
+  s->flags = d->flags;
+  return id;
+}
+
+void ir_param_add(Func* f, const CGParamDesc* d) {
+  IRParam* p;
+  if (f->nparams == f->params_cap) {
+    u32 ncap = f->params_cap ? f->params_cap * 2u : 4u;
+    IRParam* nb = arena_zarray(f->arena, IRParam, ncap);
+    if (f->params) memcpy(nb, f->params, sizeof(IRParam) * f->nparams);
+    f->params = nb;
+    f->params_cap = ncap;
+  }
+  p = &f->params[f->nparams++];
+  p->index = d->index;
+  p->name = d->name;
+  p->type = d->type;
+  p->slot = d->slot;
+  p->abi = d->abi;
+  p->loc = d->loc;
+}
+
+/* ---- construction ---- */
+
+Func* ir_func_new(Compiler* c, const CGFuncDesc* desc) {
+  Func* f = arena_znew(c->tu, Func);
+  f->arena = c->tu;
+  f->desc = *desc;
+  f->name = desc->sym;
+  f->type = desc->fn_type;
+  /* Reserve slot 0 of the val table eagerly so the very first ir_alloc_val
+   * returns Val=1. */
+  val_table_grow(f, 16);
+  f->nvals = 1;
+  /* Caller is expected to ir_block_new(f) for entry, then assign
+   * f->entry. */
+  return f;
+}
diff --git a/src/opt/ir.h b/src/opt/ir.h
@@ -6,122 +6,104 @@
 #include "core/core.h"
 #include "type/type.h"
 
+/* SSA value id. Identical to the Reg space (doc/OPT.md §5.1): when an
+ * Operand has kind OPK_REG, v.reg names the Val it carries. VAL_NONE=0
+ * is reserved as a sentinel. */
 typedef u32 Val;
 #define VAL_NONE 0u
 
+/* IROps mirror the CGTarget surface 1:1 plus a handful of SSA-only ops
+ * (IR_PHI, IR_CONST_I, IR_CONST_BYTES). Each CGTarget method records as
+ * exactly one Inst, so level-1 replay is a flat walk that re-issues
+ * each Inst as one wrapped target call. doc/OPT.md §1.1. */
 typedef enum IROp {
   IR_NOP,
+  /* SSA-only constants (used by const-propagation; recording uses
+   * load_imm/load_const which become IR_LOAD_IMM/IR_LOAD_CONST). */
   IR_CONST_I,
   IR_CONST_BYTES,
-  IR_PARAM,
-  IR_ALLOCA,
-  IR_LOAD,
-  IR_STORE,
-  IR_AGG_COPY,
-  IR_AGG_SET,
-  IR_BITFIELD_LOAD,
-  IR_BITFIELD_STORE,
-  IR_IADD,
-  IR_ISUB,
-  IR_IMUL,
-  IR_SDIV,
-  IR_UDIV,
-  IR_SREM,
-  IR_UREM,
-  IR_FADD,
-  IR_FSUB,
-  IR_FMUL,
-  IR_FDIV,
-  IR_AND,
-  IR_OR,
-  IR_XOR,
-  IR_SHL,
-  IR_ASHR,
-  IR_LSHR,
-  IR_NEG,
-  IR_BNOT,
-  IR_CMP_EQ,
-  IR_CMP_NE,
-  IR_CMP_SLT,
-  IR_CMP_SLE,
-  IR_CMP_ULT,
-  IR_CMP_ULE,
-  IR_CMP_FLT,
-  IR_CMP_FLE,
-  IR_CMP_FEQ,
-  IR_CMP_FNE,
-  IR_SEXT,
-  IR_ZEXT,
-  IR_TRUNC,
-  IR_BITCAST,
-  IR_SITOFP,
-  IR_UITOFP,
-  IR_FPTOSI,
-  IR_FPTOUI,
-  IR_FPEXT,
-  IR_FPTRUNC,
-  IR_GEP,
+
+  /* Param/frame declarations recorded from CGTarget.param. The frame
+   * slot id table lives separately on Func; this op records the
+   * sequence point so replay can re-issue target->param in order. */
+  IR_PARAM_DECL,
+
+  /* Address-bearing data movement. */
+  IR_LOAD_IMM,    /* opnds[0] dst REG; extra.imm = imm */
+  IR_LOAD_CONST,  /* opnds[0] dst REG; extra.cbytes */
+  IR_COPY,        /* opnds[0] dst REG, opnds[1] src REG */
+  IR_LOAD,        /* opnds[0] dst REG, opnds[1] addr; extra.mem */
+  IR_STORE,       /* opnds[0] addr, opnds[1] src REG|IMM; extra.mem */
+  IR_ADDR_OF,     /* opnds[0] dst REG, opnds[1] lv (LOCAL|GLOBAL|INDIRECT) */
+  IR_TLS_ADDR_OF, /* opnds[0] dst REG; extra.aux = IRTlsAux */
+  IR_AGG_COPY,    /* opnds[0] dst, opnds[1] src; extra.aux = IRAggAux */
+  IR_AGG_SET,     /* opnds[0] dst, opnds[1] byte; extra.aux = IRAggAux */
+  IR_BITFIELD_LOAD,  /* opnds[0] dst REG, opnds[1] record; extra.aux */
+  IR_BITFIELD_STORE, /* opnds[0] record, opnds[1] src; extra.aux */
+
+  /* Arithmetic / cmp / convert. opnds[0] dst REG, [1] a, optionally [2] b.
+   * extra.imm carries the BinOp/UnOp/CmpOp/ConvKind tag. */
+  IR_BINOP,
+  IR_UNOP,
+  IR_CMP,
+  IR_CONVERT,
+
+  /* Calls. extra.aux = IRCallAux (see below). defs = result Vals. */
   IR_CALL,
+
+  /* Phis. extra.aux = IRPhiAux. */
   IR_PHI,
-  IR_BR,
-  IR_CONDBR,
-  IR_RET,
-  IR_ATOMIC_LOAD,
-  IR_ATOMIC_STORE,
-  IR_ATOMIC_RMW, /* extra.imm encodes (AtomicOp << 8) | MemOrder */
-  IR_ATOMIC_CAS, /* extra.imm encodes (success << 8) | failure   */
-  IR_FENCE,      /* extra.imm = MemOrder */
-  IR_VA_START,
-  IR_VA_ARG,
-  IR_VA_END,
-  IR_VA_COPY,
-  IR_SETJMP,    /* returns-twice; opt treats as control barrier */
-  IR_LONGJMP,   /* terminator-like; control does not return */
-  IR_ASM_BLOCK, /* opaque to most passes; preserves order, defines outs,
-                   clobbers */
-  IR_INTRINSIC, /* extra.imm = IntrinKind; multi-result for *_OVERFLOW */
-} IROp;
 
-typedef struct IRCallAux {
-  const Type* fn_type;
-  const ABIFuncInfo* abi;
-  ObjSymId direct_sym; /* OBJ_SYM_NONE for indirect */
-  Val callee;          /* VAL_NONE for direct_sym calls */
-  u32 nargs;
-  Val* args;
-  u32 nresults;
-  Val* results; /* ABI return parts and multi-result builtins */
-  CGABIValue ret_abi;
-} IRCallAux;
+  /* Control flow / scopes. */
+  IR_BR,         /* unconditional. block.succ[0] = target block id. */
+  IR_CONDBR,     /* opnds[0] cond REG; succ[0] = true, succ[1] = false. */
+  IR_CMP_BRANCH, /* fused. opnds = [a, b]; extra.imm = CmpOp;
+                    succ[0] = taken, succ[1] = fallthrough. */
+  IR_RET,        /* extra.aux = IRRetAux* (NULL for void). */
+  IR_SCOPE_BEGIN,   /* extra.aux = IRScopeAux. defs[0] = scope id Val. */
+  IR_SCOPE_ELSE,    /* extra.imm = scope id (Val). */
+  IR_SCOPE_END,     /* extra.imm = scope id (Val). */
+  IR_BREAK_TO,      /* extra.imm = scope id (Val). */
+  IR_CONTINUE_TO,   /* extra.imm = scope id (Val). */
 
-typedef struct IRFrameSlot {
-  FrameSlot id;
-  const Type* type;
-  Sym name;
-  SrcLoc loc;
-  u32 size;
-  u32 align;
-  u8 kind; /* FrameSlotKind */
-  u8 pad;
-  u16 flags; /* FrameSlotFlag */
-} IRFrameSlot;
+  /* alloca / variadics. */
+  IR_ALLOCA,    /* opnds = [dst REG, size]; extra.imm = align */
+  IR_VA_START,  /* opnds = [ap] */
+  IR_VA_ARG,    /* opnds = [dst REG, ap]; extra.aux = const Type* */
+  IR_VA_END,    /* opnds = [ap] */
+  IR_VA_COPY,   /* opnds = [dst, src] */
 
-typedef struct IRParam {
-  u32 index;
-  Sym name;
-  const Type* type;
-  FrameSlot slot;
-  const ABIArgInfo* abi;
-  SrcLoc loc;
-} IRParam;
+  /* setjmp/longjmp. */
+  IR_SETJMP,    /* opnds = [dst REG, buf] */
+  IR_LONGJMP,   /* opnds = [buf, val]; (terminator-like, no fallthrough) */
 
-typedef struct IRMemAux {
-  MemAccess mem;
-} IRMemAux;
+  /* Atomics. */
+  IR_ATOMIC_LOAD,   /* opnds = [dst, addr]; extra.aux = IRAtomicAux */
+  IR_ATOMIC_STORE,  /* opnds = [addr, src]; extra.aux = IRAtomicAux */
+  IR_ATOMIC_RMW,    /* opnds = [dst, addr, val]; extra.aux */
+  IR_ATOMIC_CAS,    /* defs = [prior, ok]; extra.aux = IRCasAux */
+  IR_FENCE,         /* extra.imm = MemOrder */
+
+  /* Inline asm. extra.aux = IRAsmAux. */
+  IR_ASM_BLOCK,
 
-typedef struct IRAggregateAux {
+  /* Compiler intrinsics (see arch.h IntrinKind). extra.aux = IRIntrinAux. */
+  IR_INTRINSIC,
+
+  /* set_loc is *not* an IR op — SrcLoc is sticky on Inst.loc, applied at
+   * recording time from the wrapper's pending_loc. */
+} IROp;
+
+/* ---- per-op aux structs ---- */
+
+typedef struct IRTlsAux {
+  ObjSymId sym;
+  i64 addend;
+} IRTlsAux;
+
+typedef struct IRAggAux {
   AggregateAccess access;
-} IRAggregateAux;
+} IRAggAux;
 
 typedef struct IRBitFieldAux {
   BitFieldAccess access;
@@ -137,42 +119,119 @@ typedef struct IRPhiAux {
   u32 npreds;
   u32* pred_blocks;
   Val* pred_vals;
+  u32 slot_id; /* 0 if not from mem2reg; else 1-based FrameSlot id */
 } IRPhiAux;
 
-typedef struct IRAsmAux {
-  const char* tmpl;
-  const AsmConstraint* outs;
-  const AsmConstraint* ins;
-  const Sym* clobbers;
-  u32 nout, nin, nclob;
-} IRAsmAux;
+/* IR_CALL aux. The CGTarget interface is rich enough that we keep the
+ * full descriptor for replay; SSA passes inspect args/results in their
+ * Val form via the CGABIValue.storage.v.reg fields where applicable. */
+typedef struct IRCallAux {
+  CGCallDesc desc;
+  /* Result Vals (one per ABI-decomposed return part). 0 for void. */
+  u32 nresults;
+  Val* results;
+  /* For SSA-aware passes: the call may have args that are REG operands
+   * (carrying Val uses). They live in desc.args[i].storage.v.reg or
+   * desc.args[i].parts[k].op.v.reg. */
+} IRCallAux;
+
+typedef struct IRRetAux {
+  u8 present; /* 0 → void return; ret(NULL) at replay */
+  CGABIValue val;
+} IRRetAux;
+
+typedef struct IRScopeAux {
+  CGScopeDesc desc;
+  u32 scope_id; /* 1-based; the CGScope handed back to the caller */
+  /* For SCOPE_IF: blocks for then-arm, else-arm, and join after end.
+   * For SCOPE_LOOP/SCOPE_BLOCK: the caller-supplied break/continue
+   * labels translated to block ids; if the desc passed LABEL_NONE we
+   * leave 0 (unused — break_to/continue_to is illegal in that case). */
+  u32 if_then_block;
+  u32 if_else_block;
+  u32 if_end_block;
+  u32 loop_break_block;
+  u32 loop_continue_block;
+  u8 if_has_else;
+} IRScopeAux;
+
+typedef struct IRAtomicAux {
+  MemAccess mem;
+  MemOrder mo;
+  u8 op; /* AtomicOp; valid for IR_ATOMIC_RMW */
+} IRAtomicAux;
 
 typedef struct IRCasAux {
   MemAccess mem;
   MemOrder success;
   MemOrder failure;
-  Val prior;
-  Val ok;
 } IRCasAux;
 
+typedef struct IRAsmAux {
+  const char* tmpl;
+  AsmConstraint* outs;
+  AsmConstraint* ins;
+  Sym* clobbers;
+  Operand* out_ops; /* nout slots; the wrapped target may fill in REG location */
+  u32 nout, nin, nclob;
+} IRAsmAux;
+
+typedef struct IRIntrinAux {
+  IntrinKind kind;
+  Operand* dsts; /* ndst */
+  Operand* args; /* narg */
+  Val* result_vals; /* one per dst that's a REG, parallel to dsts */
+  u32 ndst, narg;
+} IRIntrinAux;
+
+typedef struct IRParamDeclAux {
+  CGParamDesc desc;
+} IRParamDeclAux;
+
+/* ---- frame slots / params ---- */
+
+typedef struct IRFrameSlot {
+  FrameSlot id;
+  const Type* type;
+  Sym name;
+  SrcLoc loc;
+  u32 size;
+  u32 align;
+  u8 kind; /* FrameSlotKind */
+  u8 pad;
+  u16 flags; /* FrameSlotFlag */
+} IRFrameSlot;
+
+typedef struct IRParam {
+  u32 index;
+  Sym name;
+  const Type* type;
+  FrameSlot slot;
+  const ABIArgInfo* abi;
+  SrcLoc loc;
+} IRParam;
+
+/* ---- Inst / Block / Func ---- */
+
 typedef struct Inst {
   u16 op;
-  u16 flags;  /* per-op flags (e.g. nsw/nuw, volatile) */
-  SrcLoc loc; /* set from CGTarget.set_loc when this insn was recorded */
+  u16 flags;
+  SrcLoc loc; /* sticky from CGTarget.set_loc at recording */
   const Type* type;
-  Val def;   /* this instruction's SSA value, or VAL_NONE */
-  u32 ndefs; /* multi-result instructions use defs[0..ndefs) */
-  Val* defs; /* arena-allocated; NULL when ndefs <= 1 */
+  Val def;     /* primary SSA def, or VAL_NONE */
+  u32 ndefs;   /* multi-result */
+  Val* defs;
+  /* Operands. We use Operand instead of Val so that the original
+   * CGTarget call shape (IMM / LOCAL / GLOBAL / INDIRECT in addition to
+   * REG) round-trips at replay. SSA passes treat OPK_REG operands as
+   * Val uses (via .v.reg). */
   u32 nopnds;
-  Val* opnds; /* arena-allocated */
+  Operand* opnds;
   union {
     i64 imm;
     ConstBytes cbytes;
-    struct {
-      ObjSymId sym;
-    } objsym;
     MemAccess mem;
-    void* aux; /* one of IR*Aux, arena-owned and typed by op */
+    void* aux;
   } extra;
 } Inst;
 
@@ -182,44 +241,66 @@ typedef struct Block {
   u32 ninsts, cap;
   u32* preds;
   u32 npreds;
-  u32 succ[2]; /* condbr: 2; br: 1; ret: 0 */
+  u32 succ[2];
   u8 nsucc;
 } Block;
 
 typedef struct Func {
-  /* IR storage. Lives until cgtarget_finalize so inter-procedural passes can
-   * read every Func in the TU. Per-pass scratch goes in Arena scratch, not
-   * here. */
   Arena* arena;
-  ObjSymId name;
+  CGFuncDesc desc; /* preserved for level-1 replay func_begin */
+  ObjSymId name;   /* alias for desc.sym (kept for older callers) */
   const Type* type;
   Block* blocks;
   u32 nblocks, blocks_cap;
-  u32 entry; /* index of entry block */
+  u32 entry;
 
   IRFrameSlot* frame_slots;
   u32 nframe_slots, frame_slots_cap;
   IRParam* params;
   u32 nparams, params_cap;
 
-  /* Value table: for each Val, where it's defined and its type. */
+  /* Value table. Index 0 is VAL_NONE; first allocated Val is 1. */
   u32* val_def_block;
   u32* val_def_inst;
   const Type** val_type;
+  u8* val_cls; /* RegClass per Val, used by replay to reconstruct REG operands */
   u32 nvals, vals_cap;
+
+  /* Scope id table. Indexed by scope_id (1-based). Values map to
+   * IRScopeAux entries (via the IR_SCOPE_BEGIN inst). Stored as a flat
+   * pointer table for O(1) lookup during scope_else/end/break/continue
+   * recording and replay. */
+  Inst** scope_aux_inst;
+  u32 nscopes, scopes_cap;
+
+  /* Emit order: the sequence in which blocks first became `cur` during
+   * recording. This is the natural order CG's emit cursor visited each
+   * block, so replay must follow it (block-creation order can differ —
+   * e.g. label_new(L) precedes cmp_branch but the cmp_branch's
+   * fallthrough block is what physically follows the cmp_branch in
+   * code). Blocks not present here are unreachable / unplaced; replay
+   * skips them. */
+  u32* emit_order;
+  u32 emit_order_n, emit_order_cap;
 } Func;
 
-Func* ir_func_new(Arena*, ObjSymId, const Type* fn_type);
+/* ---- API ---- */
+
+Func* ir_func_new(Compiler*, const CGFuncDesc*);
+
 u32 ir_block_new(Func*);
-Val ir_emit(Func*, u32 block, IROp, const Type* result, const Val* opnds,
-            u32 n);
-void ir_emit_multi(Func*, u32 block, IROp, const Type** results, Val* defs,
-                   u32 ndefs, const Val* opnds, u32 nopnds);
-Val ir_emit_const_i(Func*, u32 block, const Type*, i64);
-Val ir_emit_const_bytes(Func*, u32 block, ConstBytes);
 FrameSlot ir_frame_slot_new(Func*, const FrameSlotDesc*);
 void ir_param_add(Func*, const CGParamDesc*);
-void ir_set_terminator(Func*, u32 block, IROp, u32 succ_a, u32 succ_b,
-                       Val cond);
+
+Val ir_alloc_val(Func*, const Type*, u8 cls);
+
+Inst* ir_emit(Func*, u32 block, IROp);
+
+/* Append `block` to f->emit_order if not already present. Called by
+ * the wrapper whenever cur transitions to a block. */
+void ir_note_emit(Func*, u32 block);
+/* Append Inst to block; caller fills op-specific fields. The Inst is
+ * arena-resident; its address is stable until the Block.insts array is
+ * reallocated by phi insertion (which fixes up references). */
 
 #endif
diff --git a/src/opt/opt.c b/src/opt/opt.c
@@ -1,22 +1,13 @@
-/* opt — CGTarget wrapper that records each function as a tape of
- * CGTarget calls, then replays them onto the wrapped target on
- * func_end. See doc/OPT.md for the phased plan.
+/* opt.c — CGTarget wrapper that records each function as IR (doc/OPT.md
+ * §1). Each CGTarget call lands as exactly one Inst in the current
+ * Func's current block; the wrapper's vreg/vlabel/vslot/vscope ids
+ * coincide with their IR counterparts (Reg ↔ Val, label ↔ block id,
+ * vslot ↔ IR FrameSlot, vscope ↔ scope_aux index — doc/OPT.md §5.1).
  *
- * Phase 1 (current): record every emit-side call into a per-function
- *   tape; alloc_reg / frame_slot / label_new / scope_begin hand out
- *   wrapper-local virtual ids. On func_end the tape is replayed
- *   linearly: each entry produces exactly one wrapped target call,
- *   with virtual ids translated to target-side ids on the fly. This
- *   preserves doc/DESIGN.md §8's "function-at-a-time" streaming
- *   guarantee at -O1.
- *
- * Phase 2 (current): a small, safe peephole pass runs over the tape
- *   between recording and replay. See try_peephole_constfold.
- *
- * Phase 3+ (deferred): build CFG and SSA from the tape, run
- *   intra-procedural passes, lower through machinize → regalloc →
- *   emit. Until that lands, level 2 is functionally identical to
- *   level 1 (per-function record + replay).
+ * Phase 3 (current): on func_end the wrapper optionally runs
+ *   opt_build_cfg + opt_build_ssa (level 2; output discarded), then
+ *   replays each Func into the wrapped target. Level 1 skips the SSA
+ *   passes; in both cases replay is the level-1 1:1 walk.
  *
  * Methods the wrapper rejects under unbounded virtuals:
  *   - clobbers / spill_reg / reload_reg are CG -O0 register-pressure
@@ -31,346 +22,21 @@
 
 #include "core/arena.h"
 #include "core/core.h"
-
-/* ---- tape op tags ---- */
-
-typedef enum {
-  TOP_FUNC_BEGIN,
-  TOP_FUNC_END,
-  TOP_ALLOC_REG,
-  TOP_FRAME_SLOT,
-  TOP_PARAM,
-  TOP_LABEL_NEW,
-  TOP_LABEL_PLACE,
-  TOP_JUMP,
-  TOP_CMP_BRANCH,
-  TOP_SCOPE_BEGIN,
-  TOP_SCOPE_ELSE,
-  TOP_SCOPE_END,
-  TOP_BREAK_TO,
-  TOP_CONTINUE_TO,
-  TOP_LOAD_IMM,
-  TOP_LOAD_CONST,
-  TOP_COPY,
-  TOP_LOAD,
-  TOP_STORE,
-  TOP_ADDR_OF,
-  TOP_TLS_ADDR_OF,
-  TOP_COPY_BYTES,
-  TOP_SET_BYTES,
-  TOP_BITFIELD_LOAD,
-  TOP_BITFIELD_STORE,
-  TOP_BINOP,
-  TOP_UNOP,
-  TOP_CMP,
-  TOP_CONVERT,
-  TOP_CALL,
-  TOP_RET,
-  TOP_ALLOCA,
-  TOP_VA_START,
-  TOP_VA_ARG,
-  TOP_VA_END,
-  TOP_VA_COPY,
-  TOP_SETJMP,
-  TOP_LONGJMP,
-  TOP_ATOMIC_LOAD,
-  TOP_ATOMIC_STORE,
-  TOP_ATOMIC_RMW,
-  TOP_ATOMIC_CAS,
-  TOP_FENCE,
-  TOP_INTRINSIC,
-  TOP_SET_LOC,
-} TapeOpKind;
-
-/* TapeEntry: one recorded CGTarget call. The tagged union is wide; we
- * pay arena bytes for clarity. */
-typedef struct TapeEntry {
-  u8 op;       /* TapeOpKind */
-  u8 dead;     /* set by peepholes; replay skips dead entries */
-  u16 padding;
-  SrcLoc loc;
-  union {
-    /* WOP_FUNC_BEGIN: deep-copied descriptor. The caller's CGFuncDesc
-     * may be stack-allocated, so we copy by value into our arena.
-     * params[] is also copied; field shapes inside (Type*, ABIArgInfo*,
-     * incoming pointer) are TU-lifetime and shared. */
-    struct {
-      CGFuncDesc desc;
-      CGParamDesc* params; /* arena copy of fd.params */
-    } func_begin;
-
-    /* WOP_ALLOC_REG: returns a vreg, indexed into reg_map at replay. */
-    struct {
-      RegClass cls;
-      const Type* ty;
-      Reg vreg;
-    } alloc_reg;
-
-    /* WOP_FRAME_SLOT */
-    struct {
-      FrameSlotDesc desc;
-      FrameSlot vslot;
-    } frame_slot;
-
-    /* WOP_PARAM */
-    struct {
-      CGParamDesc desc;
-    } param;
-
-    /* WOP_LABEL_NEW */
-    struct {
-      Label vlabel;
-    } label_new;
-
-    /* WOP_LABEL_PLACE / WOP_JUMP */
-    struct {
-      Label vlabel;
-    } label_op;
-
-    /* WOP_CMP_BRANCH */
-    struct {
-      CmpOp op;
-      Operand a, b;
-      Label vlabel;
-    } cmp_branch;
-
-    /* WOP_SCOPE_BEGIN */
-    struct {
-      CGScopeDesc desc;
-      CGScope vscope;
-    } scope_begin;
-
-    /* WOP_SCOPE_ELSE / WOP_SCOPE_END / WOP_BREAK_TO / WOP_CONTINUE_TO */
-    struct {
-      CGScope vscope;
-    } scope_op;
-
-    /* WOP_LOAD_IMM */
-    struct {
-      Operand dst;
-      i64 imm;
-    } load_imm;
-
-    /* WOP_LOAD_CONST */
-    struct {
-      Operand dst;
-      ConstBytes cb;
-    } load_const;
-
-    /* WOP_COPY / WOP_ADDR_OF / WOP_VA_COPY */
-    struct {
-      Operand dst;
-      Operand src;
-    } copy;
-
-    /* WOP_LOAD */
-    struct {
-      Operand dst;
-      Operand addr;
-      MemAccess mem;
-    } load;
-
-    /* WOP_STORE */
-    struct {
-      Operand addr;
-      Operand src;
-      MemAccess mem;
-    } store;
-
-    /* WOP_TLS_ADDR_OF */
-    struct {
-      Operand dst;
-      ObjSymId sym;
-      i64 addend;
-    } tls_addr_of;
-
-    /* WOP_COPY_BYTES / WOP_SET_BYTES */
-    struct {
-      Operand a;
-      Operand b;
-      AggregateAccess agg;
-    } agg;
-
-    /* WOP_BITFIELD_LOAD */
-    struct {
-      Operand dst;
-      Operand record;
-      BitFieldAccess bf;
-    } bitfield_load;
-
-    /* WOP_BITFIELD_STORE */
-    struct {
-      Operand record;
-      Operand src;
-      BitFieldAccess bf;
-    } bitfield_store;
-
-    /* WOP_BINOP */
-    struct {
-      BinOp op;
-      Operand dst, a, b;
-    } binop;
-
-    /* WOP_UNOP */
-    struct {
-      UnOp op;
-      Operand dst, a;
-    } unop;
-
-    /* WOP_CMP */
-    struct {
-      CmpOp op;
-      Operand dst, a, b;
-    } cmp;
-
-    /* WOP_CONVERT */
-    struct {
-      ConvKind kind;
-      Operand dst, src;
-    } convert;
-
-    /* WOP_CALL: deep-copied descriptor and inner arrays. */
-    struct {
-      CGCallDesc desc;
-      CGABIValue* args;       /* len = desc.nargs */
-      CGABIPart* ret_parts;   /* len = desc.ret.nparts; NULL if 0 */
-      CGABIPart** arg_parts;  /* per-arg parts arrays; entry is NULL if 0 */
-    } call;
-
-    /* WOP_RET: present == 1 means there is a CGABIValue; otherwise a
-     * void return. parts is deep-copied. */
-    struct {
-      u8 present;
-      CGABIValue val;
-      CGABIPart* parts; /* len = val.nparts */
-    } ret;
-
-    /* WOP_ALLOCA */
-    struct {
-      Operand dst;
-      Operand size;
-      u32 align;
-    } alloca_;
-
-    /* WOP_VA_START / WOP_VA_END */
-    struct {
-      Operand ap;
-    } va_se;
-
-    /* WOP_VA_ARG */
-    struct {
-      Operand dst;
-      Operand ap;
-      const Type* ty;
-    } va_arg_;
-
-    /* WOP_SETJMP */
-    struct {
-      Operand dst;
-      Operand buf;
-    } setjmp_;
-
-    /* WOP_LONGJMP */
-    struct {
-      Operand buf;
-      Operand val;
-    } longjmp_;
-
-    /* WOP_ATOMIC_LOAD */
-    struct {
-      Operand dst;
-      Operand addr;
-      MemAccess mem;
-      MemOrder mo;
-    } atomic_load;
-
-    /* WOP_ATOMIC_STORE */
-    struct {
-      Operand addr;
-      Operand src;
-      MemAccess mem;
-      MemOrder mo;
-    } atomic_store;
-
-    /* WOP_ATOMIC_RMW */
-    struct {
-      AtomicOp op;
-      Operand dst;
-      Operand addr;
-      Operand val;
-      MemAccess mem;
-      MemOrder mo;
-    } atomic_rmw;
-
-    /* WOP_ATOMIC_CAS */
-    struct {
-      Operand prior;
-      Operand ok;
-      Operand addr;
-      Operand expected;
-      Operand desired;
-      MemAccess mem;
-      MemOrder success;
-      MemOrder failure;
-    } atomic_cas;
-
-    /* WOP_FENCE */
-    struct {
-      MemOrder mo;
-    } fence;
-
-    /* WOP_INTRINSIC */
-    struct {
-      IntrinKind kind;
-      Operand* dsts; /* deep-copied */
-      u32 ndst;
-      Operand* args; /* deep-copied */
-      u32 narg;
-    } intrinsic;
-
-    /* WOP_SET_LOC */
-    struct {
-      SrcLoc loc;
-    } set_loc;
-  } u;
-} TapeEntry;
+#include "opt/ir.h"
 
 /* ---- wrapper state ---- */
 
 typedef struct OptImpl {
   CGTarget base;
-  CGTarget* target; /* wrapped */
+  CGTarget* target;
   int level;
   Compiler* c;
 
-  /* Tape: per-function, reset on func_begin. Allocated from c->tu so
-   * the buffer survives panic via compiler_defer cleanups. */
-  TapeEntry* tape;
-  u32 ntape, tape_cap;
-
-  /* Wrapper-local virtual id counters. 1-based; 0 reserved as NONE.
-   * Reset on each func_begin. */
-  Reg next_vreg;
-  FrameSlot next_vslot;
-  Label next_vlabel;
-  CGScope next_vscope;
-
-  /* Replay-time translation tables. Index by virtual id; entry 0 is
-   * the NONE sentinel and never referenced. Allocated lazily on first
-   * replay so peak size matches the largest function. */
-  Reg* reg_map;
-  u32 reg_map_cap;
-  FrameSlot* slot_map;
-  u32 slot_map_cap;
-  Label* label_map;
-  u32 label_map_cap;
-  CGScope* scope_map;
-  u32 scope_map_cap;
-
-  SrcLoc pending_loc; /* most recent set_loc; stamped onto each entry */
+  /* Current function being recorded. NULL between functions. */
+  Func* f;
+  u32 cur;             /* current block id */
+  SrcLoc pending_loc;  /* most recent set_loc; stamped on each Inst */
 
-  /* If non-NULL, dump the tape to this writer on each func_end (before
-   * replay). Used by cg-runner --dump-tape and ad-hoc debugging. */
   Writer* dump_writer;
 } OptImpl;
 
@@ -378,179 +44,94 @@ static OptImpl* impl_of(CGTarget* t) { return (OptImpl*)t; }
 
 static _Noreturn void panic_unsupported(OptImpl* o, const char* what) {
   SrcLoc loc = {0, 0, 0};
-  compiler_panic(o->c, loc, "opt_cgtarget: %s called under unbounded virtuals",
-                 what);
+  compiler_panic(o->c, loc,
+                 "opt_cgtarget: %s called under unbounded virtuals", what);
 }
 
-/* ---- tape append ---- */
+/* ---- recording helpers ---- */
 
-static TapeEntry* tape_append(OptImpl* o, TapeOpKind op) {
-  TapeEntry* e;
-  if (o->ntape == o->tape_cap) {
-    u32 ncap = o->tape_cap ? o->tape_cap * 2u : 64u;
-    TapeEntry* nb = arena_array(o->c->tu, TapeEntry, ncap);
-    if (o->tape) memcpy(nb, o->tape, sizeof(TapeEntry) * o->ntape);
-    o->tape = nb;
-    o->tape_cap = ncap;
-  }
-  e = &o->tape[o->ntape++];
-  memset(e, 0, sizeof *e);
-  e->op = (u8)op;
-  e->loc = o->pending_loc;
-  return e;
+static Inst* rec(OptImpl* o, IROp op) {
+  Inst* in = ir_emit(o->f, o->cur, op);
+  in->loc = o->pending_loc;
+  return in;
 }
 
-/* ---- deep-copy helpers ---- */
+static void set_def(Func* f, Inst* in, u32 block, Val v, const Type* t) {
+  in->def = v;
+  in->type = t;
+  if (v != VAL_NONE && v < f->nvals) {
+    f->val_def_block[v] = block;
+    f->val_def_inst[v] = f->blocks[block].ninsts - 1u;
+  }
+}
 
-static CGParamDesc* copy_params(Compiler* c, const CGParamDesc* src, u32 n) {
-  CGParamDesc* dst;
+static Operand* dup_opnds(Func* f, const Operand* src, u32 n) {
   if (!n) return NULL;
-  dst = arena_array(c->tu, CGParamDesc, n);
-  memcpy(dst, src, sizeof(CGParamDesc) * n);
+  Operand* dst = arena_array(f->arena, Operand, n);
+  memcpy(dst, src, sizeof(Operand) * n);
   return dst;
 }
 
-static CGABIPart* copy_parts(Compiler* c, const CGABIPart* src, u32 n) {
-  CGABIPart* dst;
-  if (!n) return NULL;
-  dst = arena_array(c->tu, CGABIPart, n);
-  memcpy(dst, src, sizeof(CGABIPart) * n);
-  return dst;
+static int cur_terminated(OptImpl* o) {
+  Block* b = &o->f->blocks[o->cur];
+  if (b->nsucc > 0) return 1;
+  if (b->ninsts == 0) return 0;
+  IROp last = (IROp)b->insts[b->ninsts - 1].op;
+  return last == IR_RET || last == IR_LONGJMP;
 }
 
-static Operand* copy_operands(Compiler* c, const Operand* src, u32 n) {
-  Operand* dst;
-  if (!n) return NULL;
-  dst = arena_array(c->tu, Operand, n);
-  memcpy(dst, src, sizeof(Operand) * n);
-  return dst;
+static void set_cur(OptImpl* o, u32 b) {
+  o->cur = b;
+  ir_note_emit(o->f, b);
 }
 
-/* ---- map helpers (replay-time) ----
- * The maps are direct-indexed by the 1-based virtual id; entry 0 is
- * the NONE sentinel. */
-
-static void map_reg_grow(OptImpl* o, u32 needed) {
-  u32 ncap;
-  Reg* nb;
-  if (needed <= o->reg_map_cap) return;
-  ncap = o->reg_map_cap ? o->reg_map_cap : 16u;
-  while (ncap < needed) ncap *= 2u;
-  nb = arena_array(o->c->tu, Reg, ncap);
-  if (o->reg_map) memcpy(nb, o->reg_map, sizeof(Reg) * o->reg_map_cap);
-  /* New slots default to REG_NONE (0xffffffff). */
-  for (u32 i = o->reg_map_cap; i < ncap; ++i) nb[i] = REG_NONE;
-  o->reg_map = nb;
-  o->reg_map_cap = ncap;
-}
-
-static void map_slot_grow(OptImpl* o, u32 needed) {
-  u32 ncap;
-  FrameSlot* nb;
-  if (needed <= o->slot_map_cap) return;
-  ncap = o->slot_map_cap ? o->slot_map_cap : 16u;
-  while (ncap < needed) ncap *= 2u;
-  nb = arena_array(o->c->tu, FrameSlot, ncap);
-  if (o->slot_map) memcpy(nb, o->slot_map, sizeof(FrameSlot) * o->slot_map_cap);
-  for (u32 i = o->slot_map_cap; i < ncap; ++i) nb[i] = FRAME_SLOT_NONE;
-  o->slot_map = nb;
-  o->slot_map_cap = ncap;
-}
-
-static void map_label_grow(OptImpl* o, u32 needed) {
-  u32 ncap;
-  Label* nb;
-  if (needed <= o->label_map_cap) return;
-  ncap = o->label_map_cap ? o->label_map_cap : 16u;
-  while (ncap < needed) ncap *= 2u;
-  nb = arena_array(o->c->tu, Label, ncap);
-  if (o->label_map) memcpy(nb, o->label_map, sizeof(Label) * o->label_map_cap);
-  for (u32 i = o->label_map_cap; i < ncap; ++i) nb[i] = LABEL_NONE;
-  o->label_map = nb;
-  o->label_map_cap = ncap;
-}
-
-static void map_scope_grow(OptImpl* o, u32 needed) {
-  u32 ncap;
-  CGScope* nb;
-  if (needed <= o->scope_map_cap) return;
-  ncap = o->scope_map_cap ? o->scope_map_cap : 8u;
-  while (ncap < needed) ncap *= 2u;
-  nb = arena_array(o->c->tu, CGScope, ncap);
-  if (o->scope_map) memcpy(nb, o->scope_map, sizeof(CGScope) * o->scope_map_cap);
-  for (u32 i = o->scope_map_cap; i < ncap; ++i) nb[i] = CG_SCOPE_NONE;
-  o->scope_map = nb;
-  o->scope_map_cap = ncap;
-}
-
-/* ---- recording: every emit-side method records a tape entry.
- *
- * Allocator methods (alloc_reg, frame_slot, label_new, scope_begin)
- * additionally hand back a wrapper-local virtual id; the underlying
- * target is not consulted until replay. */
+/* After emitting a terminator, allocate a fresh block for any
+ * subsequent (likely unreachable) recording. */
+static void after_terminator(OptImpl* o) {
+  set_cur(o, ir_block_new(o->f));
+}
+
+/* ---- function lifecycle ---- */
 
 static void w_func_begin(CGTarget* t, const CGFuncDesc* fd) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e;
-
-  /* Reset per-function state. */
-  o->tape = NULL;
-  o->ntape = 0;
-  o->tape_cap = 0;
-  o->next_vreg = 1;
-  o->next_vslot = 1;
-  o->next_vlabel = 1;
-  o->next_vscope = 1;
+  o->f = ir_func_new(o->c, fd);
+  u32 entry = ir_block_new(o->f);
+  o->f->entry = entry;
+  set_cur(o, entry);
   o->pending_loc = (SrcLoc){0, 0, 0};
-  /* Reset translation maps; capacities are kept for amortization. */
-  for (u32 i = 0; i < o->reg_map_cap; ++i) o->reg_map[i] = REG_NONE;
-  for (u32 i = 0; i < o->slot_map_cap; ++i) o->slot_map[i] = FRAME_SLOT_NONE;
-  for (u32 i = 0; i < o->label_map_cap; ++i) o->label_map[i] = LABEL_NONE;
-  for (u32 i = 0; i < o->scope_map_cap; ++i) o->scope_map[i] = CG_SCOPE_NONE;
-
-  e = tape_append(o, TOP_FUNC_BEGIN);
-  /* Shallow-copy the descriptor by value, then deep-copy the params
-   * array — the harness mutates pds[i].slot AFTER func_begin returns,
-   * so we can't rely on pointer-shallow-copy for that field. The slots
-   * we record here are wrapper vslots (allocated by w_frame_slot in the
-   * subsequent param-setup loop); replay translates them. */
-  e->u.func_begin.desc = *fd;
-  e->u.func_begin.params = copy_params(o->c, fd->params, fd->nparams);
-  e->u.func_begin.desc.params = e->u.func_begin.params;
 }
 
 static void w_func_end(CGTarget* t);
 
+/* ---- registers and frame slots ---- */
+
 static Reg w_alloc_reg(CGTarget* t, RegClass cls, const Type* ty) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_ALLOC_REG);
-  Reg vreg = o->next_vreg++;
-  e->u.alloc_reg.cls = cls;
-  e->u.alloc_reg.ty = ty;
-  e->u.alloc_reg.vreg = vreg;
-  return vreg;
+  Val v = ir_alloc_val(o->f, ty, (u8)cls);
+  return (Reg)v;
 }
 
 static void w_free_reg(CGTarget* t, Reg r) {
-  /* Hint; opt_cgtarget ignores. The wrapper's vregs are unbounded —
-   * there is no pool to return to. */
   (void)t;
   (void)r;
 }
 
 static FrameSlot w_frame_slot(CGTarget* t, const FrameSlotDesc* d) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_FRAME_SLOT);
-  FrameSlot vslot = o->next_vslot++;
-  e->u.frame_slot.desc = *d;
-  e->u.frame_slot.vslot = vslot;
-  return vslot;
+  return ir_frame_slot_new(o->f, d);
 }
 
 static void w_param(CGTarget* t, const CGParamDesc* d) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_PARAM);
-  e->u.param.desc = *d;
+  /* Deep-copy parts so caller-stack memory isn't relied on. */
+  CGParamDesc copy = *d;
+  if (d->nincoming) {
+    CGABIPart* parts = arena_array(o->f->arena, CGABIPart, d->nincoming);
+    memcpy(parts, d->incoming, sizeof(CGABIPart) * d->nincoming);
+    copy.incoming = parts;
+  }
+  ir_param_add(o->f, &copy);
 }
 
 static const Reg* w_clobbers(CGTarget* t, RegClass cls, u32* nregs) {
@@ -571,317 +152,541 @@ static void w_reload_reg(CGTarget* t, Operand dst, FrameSlot s, MemAccess m) {
   panic_unsupported(impl_of(t), "reload_reg");
 }
 
+/* ---- labels and control flow ---- */
+
 static Label w_label_new(CGTarget* t) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_LABEL_NEW);
-  Label v = o->next_vlabel++;
-  e->u.label_new.vlabel = v;
-  return v;
+  return (Label)ir_block_new(o->f);
 }
 
 static void w_label_place(CGTarget* t, Label l) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_LABEL_PLACE);
-  e->u.label_op.vlabel = l;
+  u32 target_blk = (u32)l;
+  if (target_blk >= o->f->nblocks) {
+    SrcLoc loc = {0, 0, 0};
+    compiler_panic(o->c, loc, "opt: label_place(%u) out of range",
+                   (unsigned)l);
+  }
+  if (!cur_terminated(o)) {
+    Block* cb = &o->f->blocks[o->cur];
+    rec(o, IR_BR);
+    cb->succ[0] = target_blk;
+    cb->nsucc = 1;
+  }
+  set_cur(o, target_blk);
 }
+
 static void w_jump(CGTarget* t, Label l) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_JUMP);
-  e->u.label_op.vlabel = l;
+  u32 target_blk = (u32)l;
+  Block* cb = &o->f->blocks[o->cur];
+  rec(o, IR_BR);
+  cb->succ[0] = target_blk;
+  cb->nsucc = 1;
+  after_terminator(o);
 }
+
 static void w_cmp_branch(CGTarget* t, CmpOp op, Operand a, Operand b, Label l) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_CMP_BRANCH);
-  e->u.cmp_branch.op = op;
-  e->u.cmp_branch.a = a;
-  e->u.cmp_branch.b = b;
-  e->u.cmp_branch.vlabel = l;
+  u32 taken = (u32)l;
+  Inst* in = rec(o, IR_CMP_BRANCH);
+  Operand ops[2] = {a, b};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  in->extra.imm = (i64)op;
+  Block* cb = &o->f->blocks[o->cur];
+  cb->succ[0] = taken;
+  u32 ft = ir_block_new(o->f);
+  cb->succ[1] = ft;
+  cb->nsucc = 2;
+  set_cur(o, ft);
+}
+
+/* ---- structured scopes ---- */
+
+static u32 scope_register(Func* f, Inst* in) {
+  if (f->nscopes == f->scopes_cap) {
+    u32 ncap = f->scopes_cap ? f->scopes_cap * 2u : 4u;
+    Inst** nb = arena_zarray(f->arena, Inst*, ncap);
+    if (f->scope_aux_inst)
+      memcpy(nb, f->scope_aux_inst, sizeof(Inst*) * f->nscopes);
+    f->scope_aux_inst = nb;
+    f->scopes_cap = ncap;
+  }
+  f->scope_aux_inst[f->nscopes++] = in;
+  return f->nscopes;
+}
+
+static IRScopeAux* scope_lookup(OptImpl* o, CGScope s) {
+  if (s == CG_SCOPE_NONE || s > o->f->nscopes) {
+    SrcLoc loc = {0, 0, 0};
+    compiler_panic(o->c, loc, "opt: bad scope id %u", (unsigned)s);
+  }
+  return (IRScopeAux*)o->f->scope_aux_inst[s - 1]->extra.aux;
 }
 
 static CGScope w_scope_begin(CGTarget* t, const CGScopeDesc* d) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_SCOPE_BEGIN);
-  CGScope v = o->next_vscope++;
-  e->u.scope_begin.desc = *d;
-  e->u.scope_begin.vscope = v;
-  return v;
+  Inst* in = rec(o, IR_SCOPE_BEGIN);
+  IRScopeAux* aux = arena_znew(o->f->arena, IRScopeAux);
+  aux->desc = *d;
+  in->extra.aux = aux;
+  u32 sid = scope_register(o->f, in);
+  aux->scope_id = sid;
+
+  if (d->kind == SCOPE_IF) {
+    aux->if_then_block = ir_block_new(o->f);
+    aux->if_else_block = ir_block_new(o->f);
+    aux->if_end_block = ir_block_new(o->f);
+    Block* cb = &o->f->blocks[o->cur];
+    cb->succ[0] = aux->if_then_block;
+    cb->succ[1] = aux->if_else_block;
+    cb->nsucc = 2;
+    set_cur(o, aux->if_then_block);
+  } else if (d->kind == SCOPE_LOOP || d->kind == SCOPE_BLOCK) {
+    aux->loop_break_block =
+        d->break_label != LABEL_NONE ? (u32)d->break_label : 0;
+    aux->loop_continue_block =
+        d->continue_label != LABEL_NONE ? (u32)d->continue_label : 0;
+  }
+  return (CGScope)sid;
 }
+
 static void w_scope_else(CGTarget* t, CGScope s) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_SCOPE_ELSE);
-  e->u.scope_op.vscope = s;
+  IRScopeAux* aux = scope_lookup(o, s);
+  if (aux->desc.kind != SCOPE_IF) {
+    SrcLoc loc = {0, 0, 0};
+    compiler_panic(o->c, loc, "opt: scope_else on non-IF scope %u",
+                   (unsigned)s);
+  }
+  Inst* in = rec(o, IR_SCOPE_ELSE);
+  in->extra.imm = (i64)s;
+  if (!cur_terminated(o)) {
+    Block* cb = &o->f->blocks[o->cur];
+    cb->succ[0] = aux->if_end_block;
+    cb->nsucc = 1;
+  }
+  aux->if_has_else = 1;
+  set_cur(o, aux->if_else_block);
 }
+
 static void w_scope_end(CGTarget* t, CGScope s) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_SCOPE_END);
-  e->u.scope_op.vscope = s;
+  IRScopeAux* aux = scope_lookup(o, s);
+  Inst* in = rec(o, IR_SCOPE_END);
+  in->extra.imm = (i64)s;
+  if (aux->desc.kind == SCOPE_IF) {
+    if (!cur_terminated(o)) {
+      Block* cb = &o->f->blocks[o->cur];
+      cb->succ[0] = aux->if_end_block;
+      cb->nsucc = 1;
+    }
+    if (!aux->if_has_else) {
+      Block* eb = &o->f->blocks[aux->if_else_block];
+      if (eb->nsucc == 0) {
+        eb->succ[0] = aux->if_end_block;
+        eb->nsucc = 1;
+      }
+      /* Else block was never visited as cur, but it has code (the
+       * fall-through from scope_begin) — record it before end so emit
+       * order has it. */
+      ir_note_emit(o->f, aux->if_else_block);
+    }
+    set_cur(o, aux->if_end_block);
+  }
 }
+
 static void w_break_to(CGTarget* t, CGScope s) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_BREAK_TO);
-  e->u.scope_op.vscope = s;
+  IRScopeAux* aux = scope_lookup(o, s);
+  Inst* in = rec(o, IR_BREAK_TO);
+  in->extra.imm = (i64)s;
+  Block* cb = &o->f->blocks[o->cur];
+  cb->succ[0] = aux->loop_break_block;
+  cb->nsucc = 1;
+  after_terminator(o);
 }
+
 static void w_continue_to(CGTarget* t, CGScope s) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_CONTINUE_TO);
-  e->u.scope_op.vscope = s;
+  IRScopeAux* aux = scope_lookup(o, s);
+  Inst* in = rec(o, IR_CONTINUE_TO);
+  in->extra.imm = (i64)s;
+  Block* cb = &o->f->blocks[o->cur];
+  cb->succ[0] = aux->loop_continue_block;
+  cb->nsucc = 1;
+  after_terminator(o);
 }
 
+/* ---- data movement ---- */
+
 static void w_load_imm(CGTarget* t, Operand dst, i64 imm) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_LOAD_IMM);
-  e->u.load_imm.dst = dst;
-  e->u.load_imm.imm = imm;
+  Inst* in = rec(o, IR_LOAD_IMM);
+  Operand ops[1] = {dst};
+  in->opnds = dup_opnds(o->f, ops, 1);
+  in->nopnds = 1;
+  in->extra.imm = imm;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_load_const(CGTarget* t, Operand dst, ConstBytes cb) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_LOAD_CONST);
-  e->u.load_const.dst = dst;
-  e->u.load_const.cb = cb;
+  Inst* in = rec(o, IR_LOAD_CONST);
+  Operand ops[1] = {dst};
+  in->opnds = dup_opnds(o->f, ops, 1);
+  in->nopnds = 1;
+  in->extra.cbytes = cb;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_copy(CGTarget* t, Operand dst, Operand src) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_COPY);
-  e->u.copy.dst = dst;
-  e->u.copy.src = src;
+  Inst* in = rec(o, IR_COPY);
+  Operand ops[2] = {dst, src};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_load(CGTarget* t, Operand dst, Operand addr, MemAccess m) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_LOAD);
-  e->u.load.dst = dst;
-  e->u.load.addr = addr;
-  e->u.load.mem = m;
+  Inst* in = rec(o, IR_LOAD);
+  Operand ops[2] = {dst, addr};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  in->extra.mem = m;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_store(CGTarget* t, Operand addr, Operand src, MemAccess m) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_STORE);
-  e->u.store.addr = addr;
-  e->u.store.src = src;
-  e->u.store.mem = m;
+  Inst* in = rec(o, IR_STORE);
+  Operand ops[2] = {addr, src};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  in->extra.mem = m;
 }
+
 static void w_addr_of(CGTarget* t, Operand dst, Operand lv) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_ADDR_OF);
-  e->u.copy.dst = dst;
-  e->u.copy.src = lv;
+  Inst* in = rec(o, IR_ADDR_OF);
+  Operand ops[2] = {dst, lv};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_tls_addr_of(CGTarget* t, Operand dst, ObjSymId sym, i64 addend) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_TLS_ADDR_OF);
-  e->u.tls_addr_of.dst = dst;
-  e->u.tls_addr_of.sym = sym;
-  e->u.tls_addr_of.addend = addend;
+  Inst* in = rec(o, IR_TLS_ADDR_OF);
+  Operand ops[1] = {dst};
+  in->opnds = dup_opnds(o->f, ops, 1);
+  in->nopnds = 1;
+  IRTlsAux* aux = arena_znew(o->f->arena, IRTlsAux);
+  aux->sym = sym;
+  aux->addend = addend;
+  in->extra.aux = aux;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_copy_bytes(CGTarget* t, Operand dst, Operand src,
                          AggregateAccess agg) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_COPY_BYTES);
-  e->u.agg.a = dst;
-  e->u.agg.b = src;
-  e->u.agg.agg = agg;
+  Inst* in = rec(o, IR_AGG_COPY);
+  Operand ops[2] = {dst, src};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  IRAggAux* aux = arena_znew(o->f->arena, IRAggAux);
+  aux->access = agg;
+  in->extra.aux = aux;
 }
+
 static void w_set_bytes(CGTarget* t, Operand dst, Operand byte,
                         AggregateAccess agg) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_SET_BYTES);
-  e->u.agg.a = dst;
-  e->u.agg.b = byte;
-  e->u.agg.agg = agg;
+  Inst* in = rec(o, IR_AGG_SET);
+  Operand ops[2] = {dst, byte};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  IRAggAux* aux = arena_znew(o->f->arena, IRAggAux);
+  aux->access = agg;
+  in->extra.aux = aux;
 }
+
 static void w_bitfield_load(CGTarget* t, Operand dst, Operand record,
                             BitFieldAccess bf) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_BITFIELD_LOAD);
-  e->u.bitfield_load.dst = dst;
-  e->u.bitfield_load.record = record;
-  e->u.bitfield_load.bf = bf;
+  Inst* in = rec(o, IR_BITFIELD_LOAD);
+  Operand ops[2] = {dst, record};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  IRBitFieldAux* aux = arena_znew(o->f->arena, IRBitFieldAux);
+  aux->access = bf;
+  in->extra.aux = aux;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_bitfield_store(CGTarget* t, Operand record, Operand src,
                              BitFieldAccess bf) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_BITFIELD_STORE);
-  e->u.bitfield_store.record = record;
-  e->u.bitfield_store.src = src;
-  e->u.bitfield_store.bf = bf;
+  Inst* in = rec(o, IR_BITFIELD_STORE);
+  Operand ops[2] = {record, src};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  IRBitFieldAux* aux = arena_znew(o->f->arena, IRBitFieldAux);
+  aux->access = bf;
+  in->extra.aux = aux;
 }
 
+/* ---- arithmetic / cmp / convert ---- */
+
 static void w_binop(CGTarget* t, BinOp op, Operand dst, Operand a, Operand b) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_BINOP);
-  e->u.binop.op = op;
-  e->u.binop.dst = dst;
-  e->u.binop.a = a;
-  e->u.binop.b = b;
+  Inst* in = rec(o, IR_BINOP);
+  Operand ops[3] = {dst, a, b};
+  in->opnds = dup_opnds(o->f, ops, 3);
+  in->nopnds = 3;
+  in->extra.imm = (i64)op;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_unop(CGTarget* t, UnOp op, Operand dst, Operand a) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_UNOP);
-  e->u.unop.op = op;
-  e->u.unop.dst = dst;
-  e->u.unop.a = a;
+  Inst* in = rec(o, IR_UNOP);
+  Operand ops[2] = {dst, a};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  in->extra.imm = (i64)op;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_cmp(CGTarget* t, CmpOp op, Operand dst, Operand a, Operand b) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_CMP);
-  e->u.cmp.op = op;
-  e->u.cmp.dst = dst;
-  e->u.cmp.a = a;
-  e->u.cmp.b = b;
+  Inst* in = rec(o, IR_CMP);
+  Operand ops[3] = {dst, a, b};
+  in->opnds = dup_opnds(o->f, ops, 3);
+  in->nopnds = 3;
+  in->extra.imm = (i64)op;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_convert(CGTarget* t, ConvKind k, Operand dst, Operand src) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_CONVERT);
-  e->u.convert.kind = k;
-  e->u.convert.dst = dst;
-  e->u.convert.src = src;
+  Inst* in = rec(o, IR_CONVERT);
+  Operand ops[2] = {dst, src};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  in->extra.imm = (i64)k;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
+}
+
+/* ---- calls / return ---- */
+
+static CGABIPart* dup_parts(Arena* a, const CGABIPart* src, u32 n) {
+  if (!n) return NULL;
+  CGABIPart* dst = arena_array(a, CGABIPart, n);
+  memcpy(dst, src, sizeof(CGABIPart) * n);
+  return dst;
 }
 
 static void w_call(CGTarget* t, const CGCallDesc* d) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_CALL);
-  CGABIValue* args_copy = NULL;
-  CGABIPart** arg_parts_copy = NULL;
-  CGABIPart* ret_parts_copy = NULL;
-  u32 i;
-
-  /* Deep-copy the argv. Caller-owned d may be on the stack, and
-   * args[i].parts may be too. */
+  Inst* in = rec(o, IR_CALL);
+  IRCallAux* aux = arena_znew(o->f->arena, IRCallAux);
+  aux->desc = *d;
   if (d->nargs) {
-    args_copy = arena_array(o->c->tu, CGABIValue, d->nargs);
-    arg_parts_copy = arena_array(o->c->tu, CGABIPart*, d->nargs);
-    for (i = 0; i < d->nargs; ++i) {
-      args_copy[i] = d->args[i];
-      arg_parts_copy[i] =
-          copy_parts(o->c, d->args[i].parts, d->args[i].nparts);
-      args_copy[i].parts = arg_parts_copy[i];
+    CGABIValue* args = arena_array(o->f->arena, CGABIValue, d->nargs);
+    for (u32 i = 0; i < d->nargs; ++i) {
+      args[i] = d->args[i];
+      args[i].parts =
+          dup_parts(o->f->arena, d->args[i].parts, d->args[i].nparts);
     }
+    aux->desc.args = args;
   }
-  ret_parts_copy = copy_parts(o->c, d->ret.parts, d->ret.nparts);
-
-  e->u.call.desc = *d;
-  e->u.call.desc.args = args_copy;
-  e->u.call.desc.ret.parts = ret_parts_copy;
-  e->u.call.args = args_copy;
-  e->u.call.arg_parts = arg_parts_copy;
-  e->u.call.ret_parts = ret_parts_copy;
+  aux->desc.ret = d->ret;
+  aux->desc.ret.parts = dup_parts(o->f->arena, d->ret.parts, d->ret.nparts);
+  in->extra.aux = aux;
+  in->type = d->fn_type;
 }
 
 static void w_ret(CGTarget* t, const CGABIValue* v) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_RET);
-  if (!v) {
-    e->u.ret.present = 0;
-    return;
+  Inst* in = rec(o, IR_RET);
+  IRRetAux* aux = arena_znew(o->f->arena, IRRetAux);
+  if (v) {
+    aux->present = 1;
+    aux->val = *v;
+    aux->val.parts = dup_parts(o->f->arena, v->parts, v->nparts);
   }
-  e->u.ret.present = 1;
-  e->u.ret.val = *v;
-  e->u.ret.parts = copy_parts(o->c, v->parts, v->nparts);
-  e->u.ret.val.parts = e->u.ret.parts;
+  in->extra.aux = aux;
+  Block* cb = &o->f->blocks[o->cur];
+  cb->nsucc = 0;
+  after_terminator(o);
 }
 
+/* ---- alloca / variadics / setjmp / atomics / fence / intrinsic ---- */
+
 static void w_alloca_(CGTarget* t, Operand dst, Operand size, u32 align) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_ALLOCA);
-  e->u.alloca_.dst = dst;
-  e->u.alloca_.size = size;
-  e->u.alloca_.align = align;
+  Inst* in = rec(o, IR_ALLOCA);
+  Operand ops[2] = {dst, size};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  in->extra.imm = (i64)align;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
 
 static void w_va_start_(CGTarget* t, Operand ap) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_VA_START);
-  e->u.va_se.ap = ap;
+  Inst* in = rec(o, IR_VA_START);
+  Operand ops[1] = {ap};
+  in->opnds = dup_opnds(o->f, ops, 1);
+  in->nopnds = 1;
 }
+
 static void w_va_arg_(CGTarget* t, Operand dst, Operand ap, const Type* ty) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_VA_ARG);
-  e->u.va_arg_.dst = dst;
-  e->u.va_arg_.ap = ap;
-  e->u.va_arg_.ty = ty;
+  Inst* in = rec(o, IR_VA_ARG);
+  Operand ops[2] = {dst, ap};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  in->extra.aux = (void*)ty;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_va_end_(CGTarget* t, Operand ap) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_VA_END);
-  e->u.va_se.ap = ap;
+  Inst* in = rec(o, IR_VA_END);
+  Operand ops[1] = {ap};
+  in->opnds = dup_opnds(o->f, ops, 1);
+  in->nopnds = 1;
 }
+
 static void w_va_copy_(CGTarget* t, Operand dst, Operand src) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_VA_COPY);
-  e->u.copy.dst = dst;
-  e->u.copy.src = src;
+  Inst* in = rec(o, IR_VA_COPY);
+  Operand ops[2] = {dst, src};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
 }
 
 static void w_setjmp_(CGTarget* t, Operand dst, Operand buf) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_SETJMP);
-  e->u.setjmp_.dst = dst;
-  e->u.setjmp_.buf = buf;
+  Inst* in = rec(o, IR_SETJMP);
+  Operand ops[2] = {dst, buf};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_longjmp_(CGTarget* t, Operand buf, Operand val) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_LONGJMP);
-  e->u.longjmp_.buf = buf;
-  e->u.longjmp_.val = val;
+  Inst* in = rec(o, IR_LONGJMP);
+  Operand ops[2] = {buf, val};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  Block* cb = &o->f->blocks[o->cur];
+  cb->nsucc = 0;
+  after_terminator(o);
 }
 
 static void w_atomic_load(CGTarget* t, Operand dst, Operand addr, MemAccess m,
                           MemOrder mo) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_ATOMIC_LOAD);
-  e->u.atomic_load.dst = dst;
-  e->u.atomic_load.addr = addr;
-  e->u.atomic_load.mem = m;
-  e->u.atomic_load.mo = mo;
+  Inst* in = rec(o, IR_ATOMIC_LOAD);
+  Operand ops[2] = {dst, addr};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  IRAtomicAux* aux = arena_znew(o->f->arena, IRAtomicAux);
+  aux->mem = m;
+  aux->mo = mo;
+  in->extra.aux = aux;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_atomic_store(CGTarget* t, Operand addr, Operand src, MemAccess m,
                            MemOrder mo) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_ATOMIC_STORE);
-  e->u.atomic_store.addr = addr;
-  e->u.atomic_store.src = src;
-  e->u.atomic_store.mem = m;
-  e->u.atomic_store.mo = mo;
+  Inst* in = rec(o, IR_ATOMIC_STORE);
+  Operand ops[2] = {addr, src};
+  in->opnds = dup_opnds(o->f, ops, 2);
+  in->nopnds = 2;
+  IRAtomicAux* aux = arena_znew(o->f->arena, IRAtomicAux);
+  aux->mem = m;
+  aux->mo = mo;
+  in->extra.aux = aux;
 }
+
 static void w_atomic_rmw(CGTarget* t, AtomicOp op, Operand dst, Operand addr,
                          Operand val, MemAccess m, MemOrder mo) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_ATOMIC_RMW);
-  e->u.atomic_rmw.op = op;
-  e->u.atomic_rmw.dst = dst;
-  e->u.atomic_rmw.addr = addr;
-  e->u.atomic_rmw.val = val;
-  e->u.atomic_rmw.mem = m;
-  e->u.atomic_rmw.mo = mo;
+  Inst* in = rec(o, IR_ATOMIC_RMW);
+  Operand ops[3] = {dst, addr, val};
+  in->opnds = dup_opnds(o->f, ops, 3);
+  in->nopnds = 3;
+  IRAtomicAux* aux = arena_znew(o->f->arena, IRAtomicAux);
+  aux->mem = m;
+  aux->mo = mo;
+  aux->op = (u8)op;
+  in->extra.aux = aux;
+  if (dst.kind == OPK_REG) set_def(o->f, in, o->cur, (Val)dst.v.reg, dst.type);
 }
+
 static void w_atomic_cas(CGTarget* t, Operand prior, Operand ok, Operand addr,
                          Operand expected, Operand desired, MemAccess m,
                          MemOrder s, MemOrder f) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_ATOMIC_CAS);
-  e->u.atomic_cas.prior = prior;
-  e->u.atomic_cas.ok = ok;
-  e->u.atomic_cas.addr = addr;
-  e->u.atomic_cas.expected = expected;
-  e->u.atomic_cas.desired = desired;
-  e->u.atomic_cas.mem = m;
-  e->u.atomic_cas.success = s;
-  e->u.atomic_cas.failure = f;
+  Inst* in = rec(o, IR_ATOMIC_CAS);
+  Operand ops[5] = {prior, ok, addr, expected, desired};
+  in->opnds = dup_opnds(o->f, ops, 5);
+  in->nopnds = 5;
+  IRCasAux* aux = arena_znew(o->f->arena, IRCasAux);
+  aux->mem = m;
+  aux->success = s;
+  aux->failure = f;
+  in->extra.aux = aux;
+  if (prior.kind == OPK_REG)
+    set_def(o->f, in, o->cur, (Val)prior.v.reg, prior.type);
 }
+
 static void w_fence(CGTarget* t, MemOrder mo) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_FENCE);
-  e->u.fence.mo = mo;
+  Inst* in = rec(o, IR_FENCE);
+  in->extra.imm = (i64)mo;
 }
 
-static void w_intrinsic(CGTarget* t, IntrinKind k, Operand* dsts, u32 nd,
+static void w_intrinsic(CGTarget* t, IntrinKind kind, Operand* dsts, u32 nd,
                         const Operand* args, u32 na) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e = tape_append(o, TOP_INTRINSIC);
-  e->u.intrinsic.kind = k;
-  e->u.intrinsic.ndst = nd;
-  e->u.intrinsic.narg = na;
-  e->u.intrinsic.dsts = copy_operands(o->c, dsts, nd);
-  e->u.intrinsic.args = copy_operands(o->c, args, na);
+  Inst* in = rec(o, IR_INTRINSIC);
+  IRIntrinAux* aux = arena_znew(o->f->arena, IRIntrinAux);
+  aux->kind = kind;
+  aux->ndst = nd;
+  aux->narg = na;
+  aux->dsts = nd ? arena_array(o->f->arena, Operand, nd) : NULL;
+  aux->args = na ? arena_array(o->f->arena, Operand, na) : NULL;
+  if (nd) memcpy(aux->dsts, dsts, sizeof(Operand) * nd);
+  if (na) memcpy(aux->args, args, sizeof(Operand) * na);
+  in->extra.aux = aux;
+  if (nd == 1 && dsts[0].kind == OPK_REG) {
+    set_def(o->f, in, o->cur, (Val)dsts[0].v.reg, dsts[0].type);
+  } else if (nd > 1) {
+    in->ndefs = nd;
+    in->defs = arena_array(o->f->arena, Val, nd);
+    for (u32 i = 0; i < nd; ++i) {
+      in->defs[i] =
+          (dsts[i].kind == OPK_REG) ? (Val)dsts[i].v.reg : VAL_NONE;
+      if (in->defs[i] != VAL_NONE && in->defs[i] < o->f->nvals) {
+        o->f->val_def_block[in->defs[i]] = o->cur;
+        o->f->val_def_inst[in->defs[i]] =
+            o->f->blocks[o->cur].ninsts - 1u;
+      }
+    }
+    in->def = in->defs[0];
+    in->type = dsts[0].type;
+  }
 }
 
 static void w_asm_block(CGTarget* t, const char* tmpl,
@@ -897,877 +702,473 @@ static void w_asm_block(CGTarget* t, const char* tmpl,
   (void)in_ops;
   (void)clobbers;
   (void)nclob;
-  /* Group M (inline asm) is deferred in the corpus; the wrapper does
-   * not yet support it. */
   panic_unsupported(impl_of(t), "asm_block");
 }
 
 static void w_set_loc(CGTarget* t, SrcLoc loc) {
   OptImpl* o = impl_of(t);
-  TapeEntry* e;
   o->pending_loc = loc;
-  e = tape_append(o, TOP_SET_LOC);
-  e->u.set_loc.loc = loc;
 }
 
-/* ---- replay-time translation ---- */
+/* ============================================================
+ * Replay: walk the recorded Func and emit to the wrapped target.
+ * ============================================================ */
 
-static Reg xlat_reg(OptImpl* o, Reg vreg) {
-  if (vreg == REG_NONE || vreg == 0u) return vreg;
-  if (vreg >= o->reg_map_cap || o->reg_map[vreg] == REG_NONE) {
-    SrcLoc loc = {0, 0, 0};
-    compiler_panic(o->c, loc, "opt replay: unmapped vreg %u", (unsigned)vreg);
-  }
-  return o->reg_map[vreg];
-}
-
-static FrameSlot xlat_slot(OptImpl* o, FrameSlot vs) {
-  if (vs == FRAME_SLOT_NONE) return FRAME_SLOT_NONE;
-  if (vs >= o->slot_map_cap || o->slot_map[vs] == FRAME_SLOT_NONE) {
+typedef struct ReplayCtx {
+  OptImpl* o;
+  CGTarget* tgt;
+  Reg* val_to_reg;
+  FrameSlot* slot_map;
+  Label* label_map;
+  CGScope* scope_map;
+  u8* val_alloced;
+  u8* block_label_placed;
+} ReplayCtx;
+
+static Reg val_to_target_reg(ReplayCtx* r, Val v) {
+  Func* f = r->o->f;
+  if (v == VAL_NONE) return REG_NONE;
+  if (v >= f->nvals) {
     SrcLoc loc = {0, 0, 0};
-    compiler_panic(o->c, loc, "opt replay: unmapped vslot %u", (unsigned)vs);
+    compiler_panic(r->o->c, loc, "opt replay: Val %u out of range", v);
   }
-  return o->slot_map[vs];
-}
-
-static Label xlat_label(OptImpl* o, Label vl) {
-  if (vl == LABEL_NONE) return LABEL_NONE;
-  if (vl >= o->label_map_cap || o->label_map[vl] == LABEL_NONE) {
-    SrcLoc loc = {0, 0, 0};
-    compiler_panic(o->c, loc, "opt replay: unmapped vlabel %u", (unsigned)vl);
+  if (!r->val_alloced[v]) {
+    r->val_to_reg[v] =
+        r->tgt->alloc_reg(r->tgt, (RegClass)f->val_cls[v], f->val_type[v]);
+    r->val_alloced[v] = 1;
   }
-  return o->label_map[vl];
+  return r->val_to_reg[v];
 }
 
-static CGScope xlat_scope(OptImpl* o, CGScope vs) {
-  if (vs == CG_SCOPE_NONE) return CG_SCOPE_NONE;
-  if (vs >= o->scope_map_cap || o->scope_map[vs] == CG_SCOPE_NONE) {
+static FrameSlot slot_to_target(ReplayCtx* r, FrameSlot vs) {
+  if (vs == FRAME_SLOT_NONE) return FRAME_SLOT_NONE;
+  if (vs >= r->o->f->nframe_slots + 1u) {
     SrcLoc loc = {0, 0, 0};
-    compiler_panic(o->c, loc, "opt replay: unmapped vscope %u", (unsigned)vs);
+    compiler_panic(r->o->c, loc, "opt replay: vslot %u out of range",
+                   (unsigned)vs);
   }
-  return o->scope_map[vs];
+  return r->slot_map[vs];
 }
 
-static Operand xlat_op(OptImpl* o, Operand op) {
+static Operand xlat_op(ReplayCtx* r, Operand op) {
   switch ((OpKind)op.kind) {
     case OPK_IMM:
     case OPK_GLOBAL:
       return op;
     case OPK_REG:
-      op.v.reg = xlat_reg(o, op.v.reg);
+      op.v.reg = val_to_target_reg(r, (Val)op.v.reg);
       return op;
     case OPK_LOCAL:
-      op.v.frame_slot = xlat_slot(o, op.v.frame_slot);
+      op.v.frame_slot = slot_to_target(r, op.v.frame_slot);
       return op;
     case OPK_INDIRECT:
-      op.v.ind.base = xlat_reg(o, op.v.ind.base);
+      op.v.ind.base = val_to_target_reg(r, (Val)op.v.ind.base);
       return op;
   }
-  /* unreachable */
   return op;
 }
 
-static CGABIValue xlat_abivalue(OptImpl* o, const CGABIValue* in,
+static CGABIValue xlat_abivalue(ReplayCtx* r, const CGABIValue* in,
                                 CGABIPart* parts_out) {
   CGABIValue out = *in;
-  out.storage = xlat_op(o, in->storage);
+  out.storage = xlat_op(r, in->storage);
   if (in->nparts && parts_out) {
     for (u32 i = 0; i < in->nparts; ++i) {
       parts_out[i] = in->parts[i];
-      parts_out[i].op = xlat_op(o, in->parts[i].op);
+      parts_out[i].op = xlat_op(r, in->parts[i].op);
     }
     out.parts = parts_out;
+  } else {
+    out.parts = NULL;
   }
   return out;
 }
 
-/* ---- replay ---- */
-
-static void replay(OptImpl* o) {
-  CGTarget* w = o->target;
-
-  /* Pre-size the maps to the high-water mark for this function. */
-  if (o->next_vreg > 1) map_reg_grow(o, o->next_vreg);
-  if (o->next_vslot > 1) map_slot_grow(o, o->next_vslot);
-  if (o->next_vlabel > 1) map_label_grow(o, o->next_vlabel);
-  if (o->next_vscope > 1) map_scope_grow(o, o->next_vscope);
-
-  for (u32 i = 0; i < o->ntape; ++i) {
-    TapeEntry* e = &o->tape[i];
-    if (e->dead) continue;
-    switch ((TapeOpKind)e->op) {
-      case TOP_FUNC_BEGIN: {
-        /* Build a fresh CGFuncDesc with translated param slots. */
-        CGFuncDesc fd = e->u.func_begin.desc;
-        if (fd.nparams) {
-          CGParamDesc* params = arena_array(o->c->tu, CGParamDesc, fd.nparams);
-          for (u32 k = 0; k < fd.nparams; ++k) {
-            params[k] = e->u.func_begin.params[k];
-            params[k].slot = xlat_slot(o, e->u.func_begin.params[k].slot);
-          }
-          fd.params = params;
-        }
-        w->func_begin(w, &fd);
-        break;
-      }
-      case TOP_FUNC_END:
-        w->func_end(w);
-        break;
-      case TOP_ALLOC_REG: {
-        Reg r =
-            w->alloc_reg(w, e->u.alloc_reg.cls, e->u.alloc_reg.ty);
-        Reg v = e->u.alloc_reg.vreg;
-        if (v >= o->reg_map_cap) map_reg_grow(o, v + 1);
-        o->reg_map[v] = r;
-        break;
-      }
-      case TOP_FRAME_SLOT: {
-        FrameSlot s = w->frame_slot(w, &e->u.frame_slot.desc);
-        FrameSlot v = e->u.frame_slot.vslot;
-        if (v >= o->slot_map_cap) map_slot_grow(o, v + 1);
-        o->slot_map[v] = s;
-        break;
-      }
-      case TOP_PARAM: {
-        CGParamDesc d = e->u.param.desc;
-        d.slot = xlat_slot(o, d.slot);
-        w->param(w, &d);
-        break;
-      }
-      case TOP_LABEL_NEW: {
-        Label l = w->label_new(w);
-        Label v = e->u.label_new.vlabel;
-        if (v >= o->label_map_cap) map_label_grow(o, v + 1);
-        o->label_map[v] = l;
-        break;
-      }
-      case TOP_LABEL_PLACE:
-        w->label_place(w, xlat_label(o, e->u.label_op.vlabel));
-        break;
-      case TOP_JUMP:
-        w->jump(w, xlat_label(o, e->u.label_op.vlabel));
-        break;
-      case TOP_CMP_BRANCH:
-        w->cmp_branch(w, e->u.cmp_branch.op, xlat_op(o, e->u.cmp_branch.a),
-                      xlat_op(o, e->u.cmp_branch.b),
-                      xlat_label(o, e->u.cmp_branch.vlabel));
-        break;
-      case TOP_SCOPE_BEGIN: {
-        CGScopeDesc d = e->u.scope_begin.desc;
-        d.cond = xlat_op(o, d.cond);
-        d.break_label = xlat_label(o, d.break_label);
-        d.continue_label = xlat_label(o, d.continue_label);
-        CGScope s = w->scope_begin(w, &d);
-        CGScope v = e->u.scope_begin.vscope;
-        if (v >= o->scope_map_cap) map_scope_grow(o, v + 1);
-        o->scope_map[v] = s;
-        break;
-      }
-      case TOP_SCOPE_ELSE:
-        w->scope_else(w, xlat_scope(o, e->u.scope_op.vscope));
-        break;
-      case TOP_SCOPE_END:
-        w->scope_end(w, xlat_scope(o, e->u.scope_op.vscope));
-        break;
-      case TOP_BREAK_TO:
-        w->break_to(w, xlat_scope(o, e->u.scope_op.vscope));
-        break;
-      case TOP_CONTINUE_TO:
-        w->continue_to(w, xlat_scope(o, e->u.scope_op.vscope));
-        break;
-      case TOP_LOAD_IMM:
-        w->load_imm(w, xlat_op(o, e->u.load_imm.dst), e->u.load_imm.imm);
-        break;
-      case TOP_LOAD_CONST:
-        w->load_const(w, xlat_op(o, e->u.load_const.dst), e->u.load_const.cb);
-        break;
-      case TOP_COPY:
-        w->copy(w, xlat_op(o, e->u.copy.dst), xlat_op(o, e->u.copy.src));
-        break;
-      case TOP_LOAD:
-        w->load(w, xlat_op(o, e->u.load.dst), xlat_op(o, e->u.load.addr),
-                e->u.load.mem);
-        break;
-      case TOP_STORE:
-        w->store(w, xlat_op(o, e->u.store.addr), xlat_op(o, e->u.store.src),
-                 e->u.store.mem);
-        break;
-      case TOP_ADDR_OF:
-        w->addr_of(w, xlat_op(o, e->u.copy.dst), xlat_op(o, e->u.copy.src));
-        break;
-      case TOP_TLS_ADDR_OF:
-        w->tls_addr_of(w, xlat_op(o, e->u.tls_addr_of.dst),
-                       e->u.tls_addr_of.sym, e->u.tls_addr_of.addend);
-        break;
-      case TOP_COPY_BYTES:
-        w->copy_bytes(w, xlat_op(o, e->u.agg.a), xlat_op(o, e->u.agg.b),
-                      e->u.agg.agg);
-        break;
-      case TOP_SET_BYTES:
-        w->set_bytes(w, xlat_op(o, e->u.agg.a), xlat_op(o, e->u.agg.b),
-                     e->u.agg.agg);
-        break;
-      case TOP_BITFIELD_LOAD:
-        w->bitfield_load(w, xlat_op(o, e->u.bitfield_load.dst),
-                         xlat_op(o, e->u.bitfield_load.record),
-                         e->u.bitfield_load.bf);
-        break;
-      case TOP_BITFIELD_STORE:
-        w->bitfield_store(w, xlat_op(o, e->u.bitfield_store.record),
-                          xlat_op(o, e->u.bitfield_store.src),
-                          e->u.bitfield_store.bf);
-        break;
-      case TOP_BINOP:
-        w->binop(w, e->u.binop.op, xlat_op(o, e->u.binop.dst),
-                 xlat_op(o, e->u.binop.a), xlat_op(o, e->u.binop.b));
-        break;
-      case TOP_UNOP:
-        w->unop(w, e->u.unop.op, xlat_op(o, e->u.unop.dst),
-                xlat_op(o, e->u.unop.a));
-        break;
-      case TOP_CMP:
-        w->cmp(w, e->u.cmp.op, xlat_op(o, e->u.cmp.dst),
-               xlat_op(o, e->u.cmp.a), xlat_op(o, e->u.cmp.b));
-        break;
-      case TOP_CONVERT:
-        w->convert(w, e->u.convert.kind, xlat_op(o, e->u.convert.dst),
-                   xlat_op(o, e->u.convert.src));
-        break;
-      case TOP_CALL: {
-        CGCallDesc cd = e->u.call.desc;
-        cd.callee = xlat_op(o, cd.callee);
-        CGABIValue* args = NULL;
-        if (cd.nargs) {
-          args = arena_array(o->c->tu, CGABIValue, cd.nargs);
-          for (u32 k = 0; k < cd.nargs; ++k) {
-            CGABIPart* parts =
-                e->u.call.args[k].nparts
-                    ? arena_array(o->c->tu, CGABIPart,
-                                  e->u.call.args[k].nparts)
-                    : NULL;
-            args[k] = xlat_abivalue(o, &e->u.call.args[k], parts);
-          }
-          cd.args = args;
-        } else {
-          cd.args = NULL;
+static Label ensure_label(ReplayCtx* r, u32 b) {
+  if (b >= r->o->f->nblocks) return LABEL_NONE;
+  if (r->label_map[b] == LABEL_NONE) {
+    r->label_map[b] = r->tgt->label_new(r->tgt);
+  }
+  return r->label_map[b];
+}
+
+static void ensure_label_placed(ReplayCtx* r, u32 b) {
+  if (r->block_label_placed[b]) return;
+  r->block_label_placed[b] = 1;
+  if (b == r->o->f->entry) return;
+  Label l = ensure_label(r, b);
+  r->tgt->label_place(r->tgt, l);
+}
+
+static void replay_inst(ReplayCtx* r, u32 b, Inst* in) {
+  CGTarget* w = r->tgt;
+  w->set_loc(w, in->loc);
+
+  switch ((IROp)in->op) {
+    case IR_NOP:
+    case IR_CONST_I:
+    case IR_CONST_BYTES:
+    case IR_PARAM_DECL:
+    case IR_PHI:
+    case IR_CONDBR:
+    case IR_ASM_BLOCK:
+      break;
+    case IR_LOAD_IMM: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      w->load_imm(w, dst, in->extra.imm);
+      break;
+    }
+    case IR_LOAD_CONST: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      w->load_const(w, dst, in->extra.cbytes);
+      break;
+    }
+    case IR_COPY: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand src = xlat_op(r, in->opnds[1]);
+      w->copy(w, dst, src);
+      break;
+    }
+    case IR_LOAD: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand addr = xlat_op(r, in->opnds[1]);
+      w->load(w, dst, addr, in->extra.mem);
+      break;
+    }
+    case IR_STORE: {
+      Operand addr = xlat_op(r, in->opnds[0]);
+      Operand src = xlat_op(r, in->opnds[1]);
+      w->store(w, addr, src, in->extra.mem);
+      break;
+    }
+    case IR_ADDR_OF: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand lv = xlat_op(r, in->opnds[1]);
+      w->addr_of(w, dst, lv);
+      break;
+    }
+    case IR_TLS_ADDR_OF: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      IRTlsAux* aux = (IRTlsAux*)in->extra.aux;
+      w->tls_addr_of(w, dst, aux->sym, aux->addend);
+      break;
+    }
+    case IR_AGG_COPY: {
+      Operand a = xlat_op(r, in->opnds[0]);
+      Operand bo = xlat_op(r, in->opnds[1]);
+      IRAggAux* aux = (IRAggAux*)in->extra.aux;
+      w->copy_bytes(w, a, bo, aux->access);
+      break;
+    }
+    case IR_AGG_SET: {
+      Operand a = xlat_op(r, in->opnds[0]);
+      Operand bo = xlat_op(r, in->opnds[1]);
+      IRAggAux* aux = (IRAggAux*)in->extra.aux;
+      w->set_bytes(w, a, bo, aux->access);
+      break;
+    }
+    case IR_BITFIELD_LOAD: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand rec_ = xlat_op(r, in->opnds[1]);
+      IRBitFieldAux* aux = (IRBitFieldAux*)in->extra.aux;
+      w->bitfield_load(w, dst, rec_, aux->access);
+      break;
+    }
+    case IR_BITFIELD_STORE: {
+      Operand rec_ = xlat_op(r, in->opnds[0]);
+      Operand src = xlat_op(r, in->opnds[1]);
+      IRBitFieldAux* aux = (IRBitFieldAux*)in->extra.aux;
+      w->bitfield_store(w, rec_, src, aux->access);
+      break;
+    }
+    case IR_BINOP: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand a = xlat_op(r, in->opnds[1]);
+      Operand bo = xlat_op(r, in->opnds[2]);
+      w->binop(w, (BinOp)in->extra.imm, dst, a, bo);
+      break;
+    }
+    case IR_UNOP: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand a = xlat_op(r, in->opnds[1]);
+      w->unop(w, (UnOp)in->extra.imm, dst, a);
+      break;
+    }
+    case IR_CMP: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand a = xlat_op(r, in->opnds[1]);
+      Operand bo = xlat_op(r, in->opnds[2]);
+      w->cmp(w, (CmpOp)in->extra.imm, dst, a, bo);
+      break;
+    }
+    case IR_CONVERT: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand src = xlat_op(r, in->opnds[1]);
+      w->convert(w, (ConvKind)in->extra.imm, dst, src);
+      break;
+    }
+    case IR_CALL: {
+      IRCallAux* aux = (IRCallAux*)in->extra.aux;
+      CGCallDesc cd = aux->desc;
+      cd.callee = xlat_op(r, cd.callee);
+      CGABIValue* args = NULL;
+      if (cd.nargs) {
+        args = arena_array(r->o->f->arena, CGABIValue, cd.nargs);
+        for (u32 k = 0; k < cd.nargs; ++k) {
+          CGABIPart* parts =
+              aux->desc.args[k].nparts
+                  ? arena_array(r->o->f->arena, CGABIPart,
+                                aux->desc.args[k].nparts)
+                  : NULL;
+          args[k] = xlat_abivalue(r, &aux->desc.args[k], parts);
         }
-        CGABIPart* ret_parts =
-            cd.ret.nparts
-                ? arena_array(o->c->tu, CGABIPart, cd.ret.nparts)
-                : NULL;
-        cd.ret = xlat_abivalue(o, &e->u.call.desc.ret, ret_parts);
-        w->call(w, &cd);
-        break;
+        cd.args = args;
+      } else {
+        cd.args = NULL;
       }
-      case TOP_RET: {
-        if (!e->u.ret.present) {
-          w->ret(w, NULL);
-          break;
-        }
+      CGABIPart* ret_parts =
+          cd.ret.nparts
+              ? arena_array(r->o->f->arena, CGABIPart, cd.ret.nparts)
+              : NULL;
+      cd.ret = xlat_abivalue(r, &aux->desc.ret, ret_parts);
+      w->call(w, &cd);
+      break;
+    }
+    case IR_BR: {
+      Block* bl = &r->o->f->blocks[b];
+      if (bl->nsucc < 1) break;
+      Label l = ensure_label(r, bl->succ[0]);
+      w->jump(w, l);
+      break;
+    }
+    case IR_CMP_BRANCH: {
+      Operand a = xlat_op(r, in->opnds[0]);
+      Operand bo = xlat_op(r, in->opnds[1]);
+      Block* bl = &r->o->f->blocks[b];
+      Label taken = ensure_label(r, bl->succ[0]);
+      w->cmp_branch(w, (CmpOp)in->extra.imm, a, bo, taken);
+      break;
+    }
+    case IR_RET: {
+      IRRetAux* aux = (IRRetAux*)in->extra.aux;
+      if (!aux || !aux->present) {
+        w->ret(w, NULL);
+      } else {
         CGABIPart* parts =
-            e->u.ret.val.nparts
-                ? arena_array(o->c->tu, CGABIPart, e->u.ret.val.nparts)
+            aux->val.nparts
+                ? arena_array(r->o->f->arena, CGABIPart, aux->val.nparts)
                 : NULL;
-        CGABIValue v = xlat_abivalue(o, &e->u.ret.val, parts);
+        CGABIValue v = xlat_abivalue(r, &aux->val, parts);
         w->ret(w, &v);
-        break;
       }
-      case TOP_ALLOCA:
-        w->alloca_(w, xlat_op(o, e->u.alloca_.dst),
-                   xlat_op(o, e->u.alloca_.size), e->u.alloca_.align);
-        break;
-      case TOP_VA_START:
-        w->va_start_(w, xlat_op(o, e->u.va_se.ap));
-        break;
-      case TOP_VA_ARG:
-        w->va_arg_(w, xlat_op(o, e->u.va_arg_.dst),
-                   xlat_op(o, e->u.va_arg_.ap), e->u.va_arg_.ty);
-        break;
-      case TOP_VA_END:
-        w->va_end_(w, xlat_op(o, e->u.va_se.ap));
-        break;
-      case TOP_VA_COPY:
-        w->va_copy_(w, xlat_op(o, e->u.copy.dst), xlat_op(o, e->u.copy.src));
-        break;
-      case TOP_SETJMP:
-        w->setjmp_(w, xlat_op(o, e->u.setjmp_.dst),
-                   xlat_op(o, e->u.setjmp_.buf));
-        break;
-      case TOP_LONGJMP:
-        w->longjmp_(w, xlat_op(o, e->u.longjmp_.buf),
-                    xlat_op(o, e->u.longjmp_.val));
-        break;
-      case TOP_ATOMIC_LOAD:
-        w->atomic_load(w, xlat_op(o, e->u.atomic_load.dst),
-                       xlat_op(o, e->u.atomic_load.addr),
-                       e->u.atomic_load.mem, e->u.atomic_load.mo);
-        break;
-      case TOP_ATOMIC_STORE:
-        w->atomic_store(w, xlat_op(o, e->u.atomic_store.addr),
-                        xlat_op(o, e->u.atomic_store.src),
-                        e->u.atomic_store.mem, e->u.atomic_store.mo);
-        break;
-      case TOP_ATOMIC_RMW:
-        w->atomic_rmw(w, e->u.atomic_rmw.op, xlat_op(o, e->u.atomic_rmw.dst),
-                      xlat_op(o, e->u.atomic_rmw.addr),
-                      xlat_op(o, e->u.atomic_rmw.val), e->u.atomic_rmw.mem,
-                      e->u.atomic_rmw.mo);
-        break;
-      case TOP_ATOMIC_CAS:
-        w->atomic_cas(w, xlat_op(o, e->u.atomic_cas.prior),
-                      xlat_op(o, e->u.atomic_cas.ok),
-                      xlat_op(o, e->u.atomic_cas.addr),
-                      xlat_op(o, e->u.atomic_cas.expected),
-                      xlat_op(o, e->u.atomic_cas.desired),
-                      e->u.atomic_cas.mem, e->u.atomic_cas.success,
-                      e->u.atomic_cas.failure);
-        break;
-      case TOP_FENCE:
-        w->fence(w, e->u.fence.mo);
-        break;
-      case TOP_INTRINSIC: {
-        Operand* dsts = NULL;
-        Operand* args = NULL;
-        if (e->u.intrinsic.ndst) {
-          dsts = arena_array(o->c->tu, Operand, e->u.intrinsic.ndst);
-          for (u32 k = 0; k < e->u.intrinsic.ndst; ++k) {
-            dsts[k] = xlat_op(o, e->u.intrinsic.dsts[k]);
-          }
-        }
-        if (e->u.intrinsic.narg) {
-          args = arena_array(o->c->tu, Operand, e->u.intrinsic.narg);
-          for (u32 k = 0; k < e->u.intrinsic.narg; ++k) {
-            args[k] = xlat_op(o, e->u.intrinsic.args[k]);
-          }
-        }
-        w->intrinsic(w, e->u.intrinsic.kind, dsts, e->u.intrinsic.ndst, args,
-                     e->u.intrinsic.narg);
-        break;
+      break;
+    }
+    case IR_SCOPE_BEGIN: {
+      IRScopeAux* aux = (IRScopeAux*)in->extra.aux;
+      CGScopeDesc d = aux->desc;
+      d.cond = xlat_op(r, d.cond);
+      if (aux->desc.kind == SCOPE_LOOP || aux->desc.kind == SCOPE_BLOCK) {
+        d.break_label = aux->loop_break_block
+                            ? ensure_label(r, aux->loop_break_block)
+                            : LABEL_NONE;
+        d.continue_label = aux->loop_continue_block
+                               ? ensure_label(r, aux->loop_continue_block)
+                               : LABEL_NONE;
       }
-      case TOP_SET_LOC:
-        w->set_loc(w, e->u.set_loc.loc);
-        break;
+      CGScope cs = w->scope_begin(w, &d);
+      r->scope_map[aux->scope_id] = cs;
+      break;
+    }
+    case IR_SCOPE_ELSE:
+      w->scope_else(w, r->scope_map[(u32)in->extra.imm]);
+      break;
+    case IR_SCOPE_END:
+      w->scope_end(w, r->scope_map[(u32)in->extra.imm]);
+      break;
+    case IR_BREAK_TO:
+      w->break_to(w, r->scope_map[(u32)in->extra.imm]);
+      break;
+    case IR_CONTINUE_TO:
+      w->continue_to(w, r->scope_map[(u32)in->extra.imm]);
+      break;
+    case IR_ALLOCA: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand size = xlat_op(r, in->opnds[1]);
+      w->alloca_(w, dst, size, (u32)in->extra.imm);
+      break;
+    }
+    case IR_VA_START: {
+      Operand ap = xlat_op(r, in->opnds[0]);
+      w->va_start_(w, ap);
+      break;
+    }
+    case IR_VA_ARG: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand ap = xlat_op(r, in->opnds[1]);
+      const Type* ty = (const Type*)in->extra.aux;
+      w->va_arg_(w, dst, ap, ty);
+      break;
+    }
+    case IR_VA_END: {
+      Operand ap = xlat_op(r, in->opnds[0]);
+      w->va_end_(w, ap);
+      break;
+    }
+    case IR_VA_COPY: {
+      Operand a = xlat_op(r, in->opnds[0]);
+      Operand src = xlat_op(r, in->opnds[1]);
+      w->va_copy_(w, a, src);
+      break;
+    }
+    case IR_SETJMP: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand buf = xlat_op(r, in->opnds[1]);
+      w->setjmp_(w, dst, buf);
+      break;
+    }
+    case IR_LONGJMP: {
+      Operand buf = xlat_op(r, in->opnds[0]);
+      Operand val = xlat_op(r, in->opnds[1]);
+      w->longjmp_(w, buf, val);
+      break;
+    }
+    case IR_ATOMIC_LOAD: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand addr = xlat_op(r, in->opnds[1]);
+      IRAtomicAux* aux = (IRAtomicAux*)in->extra.aux;
+      w->atomic_load(w, dst, addr, aux->mem, aux->mo);
+      break;
+    }
+    case IR_ATOMIC_STORE: {
+      Operand addr = xlat_op(r, in->opnds[0]);
+      Operand src = xlat_op(r, in->opnds[1]);
+      IRAtomicAux* aux = (IRAtomicAux*)in->extra.aux;
+      w->atomic_store(w, addr, src, aux->mem, aux->mo);
+      break;
+    }
+    case IR_ATOMIC_RMW: {
+      Operand dst = xlat_op(r, in->opnds[0]);
+      Operand addr = xlat_op(r, in->opnds[1]);
+      Operand val = xlat_op(r, in->opnds[2]);
+      IRAtomicAux* aux = (IRAtomicAux*)in->extra.aux;
+      w->atomic_rmw(w, (AtomicOp)aux->op, dst, addr, val, aux->mem, aux->mo);
+      break;
+    }
+    case IR_ATOMIC_CAS: {
+      Operand prior = xlat_op(r, in->opnds[0]);
+      Operand ok = xlat_op(r, in->opnds[1]);
+      Operand addr = xlat_op(r, in->opnds[2]);
+      Operand expected = xlat_op(r, in->opnds[3]);
+      Operand desired = xlat_op(r, in->opnds[4]);
+      IRCasAux* aux = (IRCasAux*)in->extra.aux;
+      w->atomic_cas(w, prior, ok, addr, expected, desired, aux->mem,
+                    aux->success, aux->failure);
+      break;
+    }
+    case IR_FENCE:
+      w->fence(w, (MemOrder)in->extra.imm);
+      break;
+    case IR_INTRINSIC: {
+      IRIntrinAux* aux = (IRIntrinAux*)in->extra.aux;
+      Operand* dsts =
+          aux->ndst ? arena_array(r->o->f->arena, Operand, aux->ndst) : NULL;
+      Operand* args =
+          aux->narg ? arena_array(r->o->f->arena, Operand, aux->narg) : NULL;
+      for (u32 k = 0; k < aux->ndst; ++k) dsts[k] = xlat_op(r, aux->dsts[k]);
+      for (u32 k = 0; k < aux->narg; ++k) args[k] = xlat_op(r, aux->args[k]);
+      w->intrinsic(w, aux->kind, dsts, aux->ndst, args, aux->narg);
+      break;
     }
   }
 }
 
-/* ---- printer ---- */
-
-static void wstr(Writer* w, const char* s) {
-  size_t n = 0;
-  while (s[n]) ++n;
-  if (n) w->write(w, s, n);
-}
-
-/* Minimal i64 → decimal formatter. Writes into a 32-byte buffer (enough
- * for INT64_MIN). Returns nothing; the caller hands the buffer to wstr. */
-static void fmt_i64(i64 v, char* out) {
-  char tmp[32];
-  u32 n = 0;
-  u64 u;
-  int neg = 0;
-  if (v < 0) {
-    neg = 1;
-    u = (u64)(-(v + 1)) + 1u; /* avoid UB for INT64_MIN */
-  } else {
-    u = (u64)v;
-  }
-  do {
-    tmp[n++] = (char)('0' + (u % 10u));
-    u /= 10u;
-  } while (u);
-  if (neg) tmp[n++] = '-';
-  /* reverse */
-  for (u32 i = 0; i < n; ++i) out[i] = tmp[n - 1 - i];
-  out[n] = 0;
-}
-
-static void wint(Writer* w, i64 v) {
-  char buf[32];
-  fmt_i64(v, buf);
-  wstr(w, buf);
-}
-
-static const char* binop_name(BinOp op) {
-  switch (op) {
-    case BO_IADD: return "iadd";
-    case BO_ISUB: return "isub";
-    case BO_IMUL: return "imul";
-    case BO_SDIV: return "sdiv";
-    case BO_UDIV: return "udiv";
-    case BO_SREM: return "srem";
-    case BO_UREM: return "urem";
-    case BO_FADD: return "fadd";
-    case BO_FSUB: return "fsub";
-    case BO_FMUL: return "fmul";
-    case BO_FDIV: return "fdiv";
-    case BO_AND:  return "and";
-    case BO_OR:   return "or";
-    case BO_XOR:  return "xor";
-    case BO_SHL:  return "shl";
-    case BO_SHR_S: return "shr_s";
-    case BO_SHR_U: return "shr_u";
+static void replay_block(ReplayCtx* r, u32 b) {
+  Func* f = r->o->f;
+  if (b >= f->nblocks) return;
+  ensure_label_placed(r, b);
+  Block* bl = &f->blocks[b];
+  for (u32 i = 0; i < bl->ninsts; ++i) {
+    replay_inst(r, b, &bl->insts[i]);
   }
-  return "?binop";
 }
 
-static const char* unop_name(UnOp op) {
-  switch (op) {
-    case UO_NEG: return "neg";
-    case UO_NOT: return "not";
-    case UO_BNOT: return "bnot";
-  }
-  return "?unop";
-}
-
-static const char* cmp_name(CmpOp op) {
-  switch (op) {
-    case CMP_EQ: return "eq";
-    case CMP_NE: return "ne";
-    case CMP_LT_S: return "lt_s";
-    case CMP_LE_S: return "le_s";
-    case CMP_GT_S: return "gt_s";
-    case CMP_GE_S: return "ge_s";
-    case CMP_LT_U: return "lt_u";
-    case CMP_LE_U: return "le_u";
-    case CMP_GT_U: return "gt_u";
-    case CMP_GE_U: return "ge_u";
-    case CMP_LT_F: return "lt_f";
-    case CMP_LE_F: return "le_f";
-    case CMP_GT_F: return "gt_f";
-    case CMP_GE_F: return "ge_f";
-  }
-  return "?cmp";
-}
+static void replay_func(OptImpl* o) {
+  Func* f = o->f;
+  CGTarget* w = o->target;
 
-static void print_operand(Writer* w, const Operand* op) {
-  switch ((OpKind)op->kind) {
-    case OPK_IMM:
-      wstr(w, "imm:");
-      wint(w, op->v.imm);
-      return;
-    case OPK_REG:
-      wstr(w, "v");
-      wint(w, (i64)op->v.reg);
-      return;
-    case OPK_LOCAL:
-      wstr(w, "fs");
-      wint(w, (i64)op->v.frame_slot);
-      return;
-    case OPK_GLOBAL:
-      wstr(w, "sym");
-      wint(w, (i64)op->v.global.sym);
-      if (op->v.global.addend) {
-        wstr(w, "+");
-        wint(w, op->v.global.addend);
-      }
-      return;
-    case OPK_INDIRECT:
-      wstr(w, "[v");
-      wint(w, (i64)op->v.ind.base);
-      if (op->v.ind.ofs) {
-        wstr(w, "+");
-        wint(w, op->v.ind.ofs);
-      }
-      wstr(w, "]");
-      return;
+  ReplayCtx r;
+  r.o = o;
+  r.tgt = w;
+  u32 nv = f->nvals ? f->nvals : 1u;
+  r.val_to_reg = arena_zarray(f->arena, Reg, nv);
+  for (u32 i = 0; i < nv; ++i) r.val_to_reg[i] = REG_NONE;
+  r.val_alloced = arena_zarray(f->arena, u8, nv);
+  r.slot_map = arena_zarray(f->arena, FrameSlot, f->nframe_slots + 1u);
+  for (u32 i = 0; i <= f->nframe_slots; ++i) r.slot_map[i] = FRAME_SLOT_NONE;
+  u32 nb = f->nblocks ? f->nblocks : 1u;
+  r.label_map = arena_zarray(f->arena, Label, nb);
+  for (u32 i = 0; i < f->nblocks; ++i) r.label_map[i] = LABEL_NONE;
+  r.scope_map = arena_zarray(f->arena, CGScope, f->nscopes + 1u);
+  for (u32 i = 0; i <= f->nscopes; ++i) r.scope_map[i] = CG_SCOPE_NONE;
+  r.block_label_placed = arena_zarray(f->arena, u8, nb);
+
+  /* func_begin with the recorded descriptor. The desc.params[].slot
+   * fields are wrapper IR slot ids; aarch64's func_begin doesn't
+   * dereference them so we don't translate. */
+  w->func_begin(w, &f->desc);
+
+  for (u32 i = 0; i < f->nframe_slots; ++i) {
+    IRFrameSlot* s = &f->frame_slots[i];
+    FrameSlotDesc d = {0};
+    d.type = s->type;
+    d.name = s->name;
+    d.loc = s->loc;
+    d.size = s->size;
+    d.align = s->align;
+    d.kind = s->kind;
+    d.flags = s->flags;
+    r.slot_map[s->id] = w->frame_slot(w, &d);
   }
-  wstr(w, "?op");
-}
 
-static void print_tape(OptImpl* o, Writer* w) {
-  for (u32 i = 0; i < o->ntape; ++i) {
-    TapeEntry* e = &o->tape[i];
-    if (e->dead) {
-      wstr(w, "  ; dead\n");
-      continue;
-    }
-    wstr(w, "  ");
-    switch ((TapeOpKind)e->op) {
-      case TOP_FUNC_BEGIN:
-        wstr(w, "func_begin sym=");
-        wint(w, (i64)e->u.func_begin.desc.sym);
-        wstr(w, " nparams=");
-        wint(w, (i64)e->u.func_begin.desc.nparams);
-        break;
-      case TOP_FUNC_END:
-        wstr(w, "func_end");
-        break;
-      case TOP_ALLOC_REG:
-        wstr(w, "alloc_reg v");
-        wint(w, (i64)e->u.alloc_reg.vreg);
-        wstr(w, " cls=");
-        wint(w, (i64)e->u.alloc_reg.cls);
-        break;
-      case TOP_FRAME_SLOT:
-        wstr(w, "frame_slot fs");
-        wint(w, (i64)e->u.frame_slot.vslot);
-        wstr(w, " size=");
-        wint(w, (i64)e->u.frame_slot.desc.size);
-        wstr(w, " kind=");
-        wint(w, (i64)e->u.frame_slot.desc.kind);
-        break;
-      case TOP_PARAM:
-        wstr(w, "param idx=");
-        wint(w, (i64)e->u.param.desc.index);
-        wstr(w, " fs=");
-        wint(w, (i64)e->u.param.desc.slot);
-        break;
-      case TOP_LABEL_NEW:
-        wstr(w, "label_new L");
-        wint(w, (i64)e->u.label_new.vlabel);
-        break;
-      case TOP_LABEL_PLACE:
-        wstr(w, "label_place L");
-        wint(w, (i64)e->u.label_op.vlabel);
-        break;
-      case TOP_JUMP:
-        wstr(w, "jump L");
-        wint(w, (i64)e->u.label_op.vlabel);
-        break;
-      case TOP_CMP_BRANCH:
-        wstr(w, "cmp_branch ");
-        wstr(w, cmp_name(e->u.cmp_branch.op));
-        wstr(w, " ");
-        print_operand(w, &e->u.cmp_branch.a);
-        wstr(w, ", ");
-        print_operand(w, &e->u.cmp_branch.b);
-        wstr(w, " -> L");
-        wint(w, (i64)e->u.cmp_branch.vlabel);
-        break;
-      case TOP_SCOPE_BEGIN:
-        wstr(w, "scope_begin S");
-        wint(w, (i64)e->u.scope_begin.vscope);
-        wstr(w, " kind=");
-        wint(w, (i64)e->u.scope_begin.desc.kind);
-        break;
-      case TOP_SCOPE_ELSE:
-        wstr(w, "scope_else S");
-        wint(w, (i64)e->u.scope_op.vscope);
-        break;
-      case TOP_SCOPE_END:
-        wstr(w, "scope_end S");
-        wint(w, (i64)e->u.scope_op.vscope);
-        break;
-      case TOP_BREAK_TO:
-        wstr(w, "break_to S");
-        wint(w, (i64)e->u.scope_op.vscope);
-        break;
-      case TOP_CONTINUE_TO:
-        wstr(w, "continue_to S");
-        wint(w, (i64)e->u.scope_op.vscope);
-        break;
-      case TOP_LOAD_IMM:
-        wstr(w, "load_imm ");
-        print_operand(w, &e->u.load_imm.dst);
-        wstr(w, ", ");
-        wint(w, e->u.load_imm.imm);
-        break;
-      case TOP_LOAD_CONST:
-        wstr(w, "load_const ");
-        print_operand(w, &e->u.load_const.dst);
-        wstr(w, ", <bytes:");
-        wint(w, (i64)e->u.load_const.cb.size);
-        wstr(w, ">");
-        break;
-      case TOP_COPY:
-        wstr(w, "copy ");
-        print_operand(w, &e->u.copy.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.copy.src);
-        break;
-      case TOP_LOAD:
-        wstr(w, "load ");
-        print_operand(w, &e->u.load.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.load.addr);
-        break;
-      case TOP_STORE:
-        wstr(w, "store ");
-        print_operand(w, &e->u.store.addr);
-        wstr(w, ", ");
-        print_operand(w, &e->u.store.src);
-        break;
-      case TOP_ADDR_OF:
-        wstr(w, "addr_of ");
-        print_operand(w, &e->u.copy.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.copy.src);
-        break;
-      case TOP_TLS_ADDR_OF:
-        wstr(w, "tls_addr_of ");
-        print_operand(w, &e->u.tls_addr_of.dst);
-        wstr(w, ", sym");
-        wint(w, (i64)e->u.tls_addr_of.sym);
-        break;
-      case TOP_COPY_BYTES:
-        wstr(w, "copy_bytes ");
-        print_operand(w, &e->u.agg.a);
-        wstr(w, ", ");
-        print_operand(w, &e->u.agg.b);
-        wstr(w, " size=");
-        wint(w, (i64)e->u.agg.agg.size);
-        break;
-      case TOP_SET_BYTES:
-        wstr(w, "set_bytes ");
-        print_operand(w, &e->u.agg.a);
-        wstr(w, ", ");
-        print_operand(w, &e->u.agg.b);
-        wstr(w, " size=");
-        wint(w, (i64)e->u.agg.agg.size);
-        break;
-      case TOP_BITFIELD_LOAD:
-        wstr(w, "bitfield_load ");
-        print_operand(w, &e->u.bitfield_load.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.bitfield_load.record);
-        break;
-      case TOP_BITFIELD_STORE:
-        wstr(w, "bitfield_store ");
-        print_operand(w, &e->u.bitfield_store.record);
-        wstr(w, ", ");
-        print_operand(w, &e->u.bitfield_store.src);
-        break;
-      case TOP_BINOP:
-        wstr(w, binop_name(e->u.binop.op));
-        wstr(w, " ");
-        print_operand(w, &e->u.binop.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.binop.a);
-        wstr(w, ", ");
-        print_operand(w, &e->u.binop.b);
-        break;
-      case TOP_UNOP:
-        wstr(w, unop_name(e->u.unop.op));
-        wstr(w, " ");
-        print_operand(w, &e->u.unop.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.unop.a);
-        break;
-      case TOP_CMP:
-        wstr(w, "cmp.");
-        wstr(w, cmp_name(e->u.cmp.op));
-        wstr(w, " ");
-        print_operand(w, &e->u.cmp.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.cmp.a);
-        wstr(w, ", ");
-        print_operand(w, &e->u.cmp.b);
-        break;
-      case TOP_CONVERT:
-        wstr(w, "convert ");
-        print_operand(w, &e->u.convert.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.convert.src);
-        wstr(w, " kind=");
-        wint(w, (i64)e->u.convert.kind);
-        break;
-      case TOP_CALL:
-        wstr(w, "call ");
-        print_operand(w, &e->u.call.desc.callee);
-        wstr(w, " nargs=");
-        wint(w, (i64)e->u.call.desc.nargs);
-        break;
-      case TOP_RET:
-        wstr(w, "ret");
-        if (e->u.ret.present) {
-          wstr(w, " ");
-          print_operand(w, &e->u.ret.val.storage);
-        }
-        break;
-      case TOP_ALLOCA:
-        wstr(w, "alloca ");
-        print_operand(w, &e->u.alloca_.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.alloca_.size);
-        break;
-      case TOP_VA_START:
-        wstr(w, "va_start ");
-        print_operand(w, &e->u.va_se.ap);
-        break;
-      case TOP_VA_ARG:
-        wstr(w, "va_arg ");
-        print_operand(w, &e->u.va_arg_.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.va_arg_.ap);
-        break;
-      case TOP_VA_END:
-        wstr(w, "va_end ");
-        print_operand(w, &e->u.va_se.ap);
-        break;
-      case TOP_VA_COPY:
-        wstr(w, "va_copy ");
-        print_operand(w, &e->u.copy.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.copy.src);
-        break;
-      case TOP_SETJMP:
-        wstr(w, "setjmp ");
-        print_operand(w, &e->u.setjmp_.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.setjmp_.buf);
-        break;
-      case TOP_LONGJMP:
-        wstr(w, "longjmp ");
-        print_operand(w, &e->u.longjmp_.buf);
-        wstr(w, ", ");
-        print_operand(w, &e->u.longjmp_.val);
-        break;
-      case TOP_ATOMIC_LOAD:
-        wstr(w, "atomic_load ");
-        print_operand(w, &e->u.atomic_load.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.atomic_load.addr);
-        break;
-      case TOP_ATOMIC_STORE:
-        wstr(w, "atomic_store ");
-        print_operand(w, &e->u.atomic_store.addr);
-        wstr(w, ", ");
-        print_operand(w, &e->u.atomic_store.src);
-        break;
-      case TOP_ATOMIC_RMW:
-        wstr(w, "atomic_rmw op=");
-        wint(w, (i64)e->u.atomic_rmw.op);
-        wstr(w, " ");
-        print_operand(w, &e->u.atomic_rmw.dst);
-        wstr(w, ", ");
-        print_operand(w, &e->u.atomic_rmw.addr);
-        wstr(w, ", ");
-        print_operand(w, &e->u.atomic_rmw.val);
-        break;
-      case TOP_ATOMIC_CAS:
-        wstr(w, "atomic_cas prior=");
-        print_operand(w, &e->u.atomic_cas.prior);
-        wstr(w, " ok=");
-        print_operand(w, &e->u.atomic_cas.ok);
-        wstr(w, " addr=");
-        print_operand(w, &e->u.atomic_cas.addr);
-        break;
-      case TOP_FENCE:
-        wstr(w, "fence mo=");
-        wint(w, (i64)e->u.fence.mo);
-        break;
-      case TOP_INTRINSIC:
-        wstr(w, "intrinsic kind=");
-        wint(w, (i64)e->u.intrinsic.kind);
-        wstr(w, " ndst=");
-        wint(w, (i64)e->u.intrinsic.ndst);
-        wstr(w, " narg=");
-        wint(w, (i64)e->u.intrinsic.narg);
-        break;
-      case TOP_SET_LOC:
-        wstr(w, "set_loc ");
-        wint(w, (i64)e->u.set_loc.loc.line);
-        wstr(w, ":");
-        wint(w, (i64)e->u.set_loc.loc.col);
-        break;
-    }
-    wstr(w, "\n");
+  for (u32 i = 0; i < f->nparams; ++i) {
+    IRParam* p = &f->params[i];
+    CGParamDesc d = {0};
+    d.index = p->index;
+    d.name = p->name;
+    d.type = p->type;
+    d.slot = slot_to_target(&r, p->slot);
+    d.abi = p->abi;
+    d.loc = p->loc;
+    w->param(w, &d);
   }
-}
 
-/* ---- Phase 2 peephole: integer constant folding ----
- *
- * Pattern: LOAD_IMM(V_a, k_a); LOAD_IMM(V_b, k_b); BINOP(op, V_d, V_a, V_b)
- *           with op ∈ {IADD, ISUB, IMUL}.
- * After:   the BINOP is rewritten to LOAD_IMM(V_d, k_a OP k_b).
- *
- * Both operands must be OPK_REG referencing wrapper vregs whose only
- * recorded definition was a LOAD_IMM. The intermediate LOAD_IMMs are
- * left in place — they may have other uses, and DCE is a Phase 3
- * concern.
- *
- * Folding is done in 64-bit signed arithmetic and truncated by the
- * target's load_imm based on the destination type. This matches C11
- * §6.5/3 ("two's-complement wraparound at the abstract machine level
- * for signed and unsigned integer types alike" per cfree's no-UB
- * stance — see doc/DESIGN.md §9). */
-
-typedef struct ImmInfo {
-  i64 val;
-  u8 known;
-} ImmInfo;
-
-static void peephole_constfold(OptImpl* o) {
-  ImmInfo* imm;
-  u32 cap;
-
-  if (o->next_vreg <= 1) return;
-  cap = o->next_vreg;
-  imm = arena_zarray(o->c->tu, ImmInfo, cap);
-
-  for (u32 i = 0; i < o->ntape; ++i) {
-    TapeEntry* e = &o->tape[i];
-    if (e->dead) continue;
-    switch ((TapeOpKind)e->op) {
-      case TOP_LOAD_IMM:
-        if (e->u.load_imm.dst.kind == OPK_REG) {
-          Reg r = e->u.load_imm.dst.v.reg;
-          if (r < cap) {
-            imm[r].val = e->u.load_imm.imm;
-            imm[r].known = 1;
-          }
-        }
-        break;
-      case TOP_BINOP: {
-        Operand a = e->u.binop.a;
-        Operand b = e->u.binop.b;
-        BinOp op = e->u.binop.op;
-        if (a.kind != OPK_REG || b.kind != OPK_REG) break;
-        if (a.v.reg >= cap || b.v.reg >= cap) break;
-        if (!imm[a.v.reg].known || !imm[b.v.reg].known) break;
-        if (op != BO_IADD && op != BO_ISUB && op != BO_IMUL) break;
-
-        i64 av = imm[a.v.reg].val;
-        i64 bv = imm[b.v.reg].val;
-        u64 folded;
-        /* Compute in u64 to make wraparound deterministic, then cast
-         * back. cfree's no-UB stance forbids signed-overflow-is-UB
-         * exploitation (doc/DESIGN.md §9), so this is the right shape. */
-        switch (op) {
-          case BO_IADD: folded = (u64)av + (u64)bv; break;
-          case BO_ISUB: folded = (u64)av - (u64)bv; break;
-          case BO_IMUL: folded = (u64)av * (u64)bv; break;
-          default: continue;
-        }
-
-        Operand dst = e->u.binop.dst;
-        memset(&e->u, 0, sizeof e->u);
-        e->op = (u8)TOP_LOAD_IMM;
-        e->u.load_imm.dst = dst;
-        e->u.load_imm.imm = (i64)folded;
-        if (dst.kind == OPK_REG && dst.v.reg < cap) {
-          imm[dst.v.reg].val = (i64)folded;
-          imm[dst.v.reg].known = 1;
-        }
-        break;
-      }
-      default:
-        break;
-    }
+  /* Body in emit order — the order CG's emit cursor visited each
+   * block. Block-creation order can differ when label_new precedes a
+   * cmp_branch whose fallthrough block must physically follow. */
+  for (u32 i = 0; i < f->emit_order_n; ++i) {
+    replay_block(&r, f->emit_order[i]);
   }
+
+  w->func_end(w);
 }
 
-/* ---- func_end: append TOP_FUNC_END, run peepholes, replay ---- */
+/* ---- func_end: optionally run dry-run passes; replay; reset ---- */
 
 static void w_func_end(CGTarget* t) {
   OptImpl* o = impl_of(t);
-  tape_append(o, TOP_FUNC_END);
-  peephole_constfold(o);
-  if (o->dump_writer) print_tape(o, o->dump_writer);
-  replay(o);
-}
+  if (!o->f) return;
 
-/* ---- public API: dump writer ---- */
+  if (o->level >= 2) {
+    opt_build_cfg(o->f);
+    opt_build_ssa(o->f);
+  }
 
-void opt_set_dump_writer(CGTarget* t, Writer* w) {
-  /* Identify our own targets by the func_begin slot. Anything else is
-   * a non-opt CGTarget and the call is a silent no-op. */
-  if (!t || t->func_begin != w_func_begin) return;
-  impl_of(t)->dump_writer = w;
+  replay_func(o);
+  o->f = NULL;
+  o->cur = 0;
 }
 
-/* ---- end-of-TU and destruction ---- */
+/* ---- finalize / destroy ---- */
 
 static void w_finalize(CGTarget* t) {
   CGTarget* wr = impl_of(t)->target;
@@ -1779,12 +1180,16 @@ static void w_destroy(CGTarget* t) {
   if (wr->destroy) wr->destroy(wr);
 }
 
+/* ---- public dump-writer API ---- */
+
+void opt_set_dump_writer(CGTarget* t, Writer* w) {
+  if (!t || t->func_begin != w_func_begin) return;
+  impl_of(t)->dump_writer = w;
+}
+
 /* ---- construction ---- */
 
 CGTarget* opt_cgtarget_new(Compiler* c, CGTarget* target, int level) {
-  OptImpl* o;
-  CGTarget* t;
-
   if (!target) {
     SrcLoc loc = {0, 0, 0};
     compiler_panic(c, loc, "opt_cgtarget_new: target is NULL");
@@ -1795,13 +1200,13 @@ CGTarget* opt_cgtarget_new(Compiler* c, CGTarget* target, int level) {
                    level);
   }
 
-  o = arena_new(c->tu, OptImpl);
+  OptImpl* o = arena_new(c->tu, OptImpl);
   memset(o, 0, sizeof *o);
   o->c = c;
   o->target = target;
   o->level = level;
 
-  t = &o->base;
+  CGTarget* t = &o->base;
   t->c = c;
   t->obj = target->obj;
   t->mc = target->mc;
diff --git a/src/opt/pass_cfg.c b/src/opt/pass_cfg.c
@@ -0,0 +1,116 @@
+/* pass_cfg.c — derive Block.preds and Block.succ/nsucc from each
+ * block's terminator. doc/OPT.md §3 Phase 3.
+ *
+ * Terminator inventory:
+ *   IR_BR                          — 1 succ (succ[0])
+ *   IR_CONDBR                      — 2 succs ([true, false])
+ *   IR_CMP_BRANCH                  — 2 succs ([taken, fallthrough])
+ *   IR_RET                         — 0 succs
+ *   IR_LONGJMP                     — 0 succs
+ *   IR_INTRINSIC TRAP/UNREACHABLE  — 0 succs
+ *   IR_BREAK_TO / IR_CONTINUE_TO   — 0 succs (control transferred to
+ *                                    the scope's break/continue label,
+ *                                    which is a successor encoded on
+ *                                    the IRScopeAux; pass populates
+ *                                    succ from there)
+ *
+ * IR_SETJMP is a control barrier: the recorder splits its block but
+ * IR_SETJMP itself falls through. pass_cfg sees it as a normal inst.
+ *
+ * For scope ops the wrapper's recording assigns succ[] at emit time
+ * (since it owns the vlabel→block_id mapping). pass_cfg trusts that
+ * and only repopulates from the trailing terminator inst when one is
+ * present. */
+
+#include "opt/ir.h"
+#include "opt/opt.h"
+
+#include <string.h>
+
+#include "core/arena.h"
+#include "core/core.h"
+
+static int is_terminator(const Inst* in) {
+  switch ((IROp)in->op) {
+    case IR_BR:
+    case IR_CONDBR:
+    case IR_CMP_BRANCH:
+    case IR_RET:
+    case IR_LONGJMP:
+    case IR_BREAK_TO:
+    case IR_CONTINUE_TO:
+      return 1;
+    case IR_INTRINSIC:
+      return in->extra.imm == INTRIN_TRAP ||
+             in->extra.imm == INTRIN_UNREACHABLE;
+    default:
+      return 0;
+  }
+}
+
+void opt_build_cfg(Func* f) {
+  for (u32 b = 0; b < f->nblocks; ++b) {
+    f->blocks[b].preds = NULL;
+    f->blocks[b].npreds = 0;
+  }
+
+  /* Trust the recorder's succ[] for terminators that don't have a fixed
+   * succ count from the inst alone (IR_BR, IR_CONDBR, IR_CMP_BRANCH,
+   * IR_BREAK_TO, IR_CONTINUE_TO). Only fix nsucc for ops where we can
+   * read it directly from the op tag. */
+  for (u32 b = 0; b < f->nblocks; ++b) {
+    Block* bl = &f->blocks[b];
+    if (bl->ninsts == 0) {
+      bl->nsucc = 0;
+      continue;
+    }
+    const Inst* last = &bl->insts[bl->ninsts - 1];
+    if (!is_terminator(last)) {
+      bl->nsucc = 0;
+      continue;
+    }
+    switch ((IROp)last->op) {
+      case IR_RET:
+      case IR_LONGJMP:
+        bl->nsucc = 0;
+        break;
+      case IR_INTRINSIC:
+        bl->nsucc = 0;
+        break;
+      case IR_BR:
+      case IR_BREAK_TO:
+      case IR_CONTINUE_TO:
+        bl->nsucc = 1;
+        break;
+      case IR_CONDBR:
+      case IR_CMP_BRANCH:
+        bl->nsucc = 2;
+        break;
+      default:
+        break;
+    }
+  }
+
+  /* Count predecessors. */
+  u32* counts = arena_zarray(f->arena, u32, f->nblocks);
+  for (u32 b = 0; b < f->nblocks; ++b) {
+    Block* bl = &f->blocks[b];
+    for (u32 s = 0; s < bl->nsucc; ++s) {
+      u32 t = bl->succ[s];
+      if (t < f->nblocks) counts[t]++;
+    }
+  }
+  for (u32 b = 0; b < f->nblocks; ++b) {
+    if (counts[b]) {
+      f->blocks[b].preds = arena_array(f->arena, u32, counts[b]);
+    }
+  }
+  for (u32 b = 0; b < f->nblocks; ++b) {
+    Block* bl = &f->blocks[b];
+    for (u32 s = 0; s < bl->nsucc; ++s) {
+      u32 t = bl->succ[s];
+      if (t >= f->nblocks) continue;
+      f->blocks[t].preds[f->blocks[t].npreds++] = b;
+    }
+  }
+}
diff --git a/src/opt/pass_ssa.c b/src/opt/pass_ssa.c
@@ -0,0 +1,427 @@
+/* pass_ssa.c — mem2reg + dominance-frontier SSA construction.
+ *
+ * Goal for Phase 3 (doc/OPT.md): build SSA without consuming it. The
+ * output is discarded before replay, so this pass's job is shape
+ * checking — no panics on the corpus.
+ *
+ * Algorithm (Cooper-Harvey-Kennedy iterative dominators + Cytron et al.
+ * dominance-frontier phi insertion):
+ *
+ *   1. Postorder + reverse-postorder traversal of the CFG.
+ *   2. Compute idom[] iteratively via the two-finger intersect.
+ *   3. Compute DF[] from idom[] in one pass.
+ *   4. For each promotable FrameSlot (no FSF_ADDR_TAKEN), compute the
+ *      iterated dominance frontier of its defining blocks; insert
+ *      IR_PHI at the start of each block in the IDF.
+ *   5. Rename: DFS the dominator tree, maintain a stack per slot; on
+ *      store push the stored Val, on load record (load_def → top) into
+ *      a rename map, on each successor fill in this block's slot in
+ *      that successor's phis. After processing children, pop.
+ *
+ * The rename map is built but its uses are intentionally NOT walked
+ * across other instructions in the dry-run — that is the part that
+ * mutates the IR for downstream passes, and Phase 3 discards the IR.
+ * The phi-insertion + slot-stack walk is the part that exercises the
+ * IR shape, which is what the dry-run is checking. */
+
+#include "opt/ir.h"
+#include "opt/opt.h"
+
+#include <string.h>
+
+#include "core/arena.h"
+#include "core/core.h"
+
+#define BLK_NONE 0xffffffffu
+
+/* ---- postorder ---- */
+
+typedef struct PostorderCtx {
+  Func* f;
+  u32* po;     /* po[i] = block id at postorder position i */
+  u32* po_idx; /* po_idx[block] = postorder position of block */
+  u8* visited;
+  u32 count;
+} PostorderCtx;
+
+static void postorder_dfs(PostorderCtx* ctx, u32 b) {
+  if (ctx->visited[b]) return;
+  ctx->visited[b] = 1;
+  Block* bl = &ctx->f->blocks[b];
+  for (u32 s = 0; s < bl->nsucc; ++s) {
+    u32 t = bl->succ[s];
+    if (t < ctx->f->nblocks) postorder_dfs(ctx, t);
+  }
+  ctx->po[ctx->count] = b;
+  ctx->po_idx[b] = ctx->count;
+  ctx->count++;
+}
+
+/* ---- dominators (Cooper-Harvey-Kennedy) ---- */
+
+static u32 dom_intersect(u32 b1, u32 b2, const u32* idom, const u32* po_idx) {
+  while (b1 != b2) {
+    while (po_idx[b1] < po_idx[b2]) b1 = idom[b1];
+    while (po_idx[b2] < po_idx[b1]) b2 = idom[b2];
+  }
+  return b1;
+}
+
+static u32* compute_idom(Func* f, const u32* po, const u32* po_idx, u32 ncount,
+                        u32 entry) {
+  u32* idom = arena_array(f->arena, u32, f->nblocks);
+  for (u32 b = 0; b < f->nblocks; ++b) idom[b] = BLK_NONE;
+  idom[entry] = entry;
+
+  int changed = 1;
+  while (changed) {
+    changed = 0;
+    /* Reverse postorder, skip the entry block. */
+    for (i32 i = (i32)ncount - 1; i >= 0; --i) {
+      u32 b = po[i];
+      if (b == entry) continue;
+      Block* bl = &f->blocks[b];
+      u32 new_idom = BLK_NONE;
+      for (u32 p = 0; p < bl->npreds; ++p) {
+        u32 pp = bl->preds[p];
+        if (idom[pp] != BLK_NONE) {
+          new_idom = (new_idom == BLK_NONE)
+                         ? pp
+                         : dom_intersect(pp, new_idom, idom, po_idx);
+        }
+      }
+      if (new_idom != BLK_NONE && idom[b] != new_idom) {
+        idom[b] = new_idom;
+        changed = 1;
+      }
+    }
+  }
+  return idom;
+}
+
+/* ---- dominance frontier ---- */
+
+typedef struct DfSet {
+  u32* members;
+  u32 n, cap;
+} DfSet;
+
+static void df_add(Arena* a, DfSet* s, u32 b) {
+  for (u32 i = 0; i < s->n; ++i)
+    if (s->members[i] == b) return;
+  if (s->n == s->cap) {
+    u32 ncap = s->cap ? s->cap * 2u : 4u;
+    u32* nb = arena_array(a, u32, ncap);
+    if (s->members) memcpy(nb, s->members, sizeof(u32) * s->n);
+    s->members = nb;
+    s->cap = ncap;
+  }
+  s->members[s->n++] = b;
+}
+
+static DfSet* compute_df(Func* f, const u32* idom) {
+  DfSet* df = arena_zarray(f->arena, DfSet, f->nblocks);
+  for (u32 b = 0; b < f->nblocks; ++b) {
+    Block* bl = &f->blocks[b];
+    if (bl->npreds < 2) continue;
+    if (idom[b] == BLK_NONE) continue;
+    for (u32 p = 0; p < bl->npreds; ++p) {
+      u32 runner = bl->preds[p];
+      while (runner != idom[b] && runner != BLK_NONE) {
+        df_add(f->arena, &df[runner], b);
+        runner = idom[runner];
+      }
+    }
+  }
+  return df;
+}
+
+/* ---- promotable slots ---- */
+
+/* Identify the FrameSlot a load/store address operand refers to.
+ * Returns 0 if the operand is not OPK_LOCAL (i.e., not a direct
+ * frame-slot reference) — those addresses route through computed Val
+ * pointers and cannot be promoted by a slot-keyed pass. */
+static u32 opnd_slot_id(const Operand* op) {
+  if (op->kind != OPK_LOCAL) return 0;
+  return (u32)op->v.frame_slot;
+}
+
+static int slot_promotable(const Func* f, u32 slot_id) {
+  if (slot_id == 0 || slot_id > f->nframe_slots) return 0;
+  const IRFrameSlot* s = &f->frame_slots[slot_id - 1];
+  if (s->flags & FSF_ADDR_TAKEN) return 0;
+  if (s->flags & FSF_VOLATILE) return 0;
+  /* Only locals are promotable for v1; params live in slots too but
+   * promoting them is a separate transform. */
+  if (s->kind != FS_LOCAL) return 0;
+  return 1;
+}
+
+/* ---- phi insertion ---- */
+
+/* Insert n_new phi instructions at the start of block b, each tagged
+ * with its slot via opnds[0] = synthetic "slot tag" Val (we reuse
+ * IRPhiAux to carry pred info). The stored slot id lives in
+ * extra.imm. */
+static void insert_phis(Func* f, u32 b, u32 n_new, const u32* phi_slots,
+                        const u32* phi_blocks_for_slot) {
+  if (!n_new) return;
+  Block* bl = &f->blocks[b];
+  u32 old = bl->ninsts;
+  u32 nnew = old + n_new;
+  Inst* nb = arena_zarray(f->arena, Inst, nnew);
+  /* Phis go first. */
+  for (u32 i = 0; i < n_new; ++i) {
+    Inst* in = &nb[i];
+    u32 slot_id = phi_slots[i];
+    const IRFrameSlot* s = &f->frame_slots[slot_id - 1];
+    in->op = IR_PHI;
+    in->type = s->type;
+    in->extra.imm = (i64)slot_id;
+    /* Allocate IRPhiAux with one slot per pred, initialized to
+     * VAL_NONE. The rename pass fills pred_vals later. */
+    IRPhiAux* aux = arena_znew(f->arena, IRPhiAux);
+    aux->npreds = bl->npreds;
+    if (bl->npreds) {
+      aux->pred_blocks = arena_array(f->arena, u32, bl->npreds);
+      aux->pred_vals = arena_zarray(f->arena, Val, bl->npreds);
+      memcpy(aux->pred_blocks, bl->preds, sizeof(u32) * bl->npreds);
+    }
+    /* Reuse extra union: imm carries slot, but we also need aux. We
+     * stash the IRPhiAux in opnds: opnds[0] is a sentinel pointer cast
+     * — that breaks the Val* type. Instead, we use a parallel side
+     * table rooted on the inst. To keep this self-contained without
+     * altering the Inst layout, we point in->opnds at a Val array of
+     * length 0 and rely on the aux pointer through the extra union.
+     *
+     * Layout choice: extra.aux = aux (carry the IRPhiAux pointer);
+     * imm-as-slot lives in a side table. */
+    in->extra.aux = aux;
+    in->nopnds = 0;
+    in->opnds = NULL;
+    /* def Val: allocate a fresh value typed as the slot's type. */
+    /* We can't use val_alloc directly (it's static); set up the val
+     * table manually via an emit-equivalent. Simpler: piggy-back on
+     * ir_emit_const_i style by appending after current state. But we're
+     * mid-rebuild of the block. Defer: encode def slot via re-reading
+     * the val table. */
+    /* For Phase 3 dry-run, leave def = VAL_NONE on phis. The rename
+     * pass uses extra.aux->pred_vals to inspect phi shape; the
+     * dry-run discards before downstream passes need def. */
+    in->def = VAL_NONE;
+  }
+  /* Existing instructions shifted right. */
+  if (old) memcpy(nb + n_new, bl->insts, sizeof(Inst) * old);
+  bl->insts = nb;
+  bl->ninsts = nnew;
+  bl->cap = nnew;
+  /* val_def_inst for any val defined in this block has shifted by
+   * n_new. Walk the val table and update. */
+  for (u32 v = 1; v < f->nvals; ++v) {
+    if (f->val_def_block[v] == b) f->val_def_inst[v] += n_new;
+  }
+  (void)phi_blocks_for_slot;
+}
+
+/* ---- rename ---- */
+
+typedef struct SlotStack {
+  Val* stack;
+  u32 n, cap;
+} SlotStack;
+
+static void slot_push(Arena* a, SlotStack* s, Val v) {
+  if (s->n == s->cap) {
+    u32 ncap = s->cap ? s->cap * 2u : 4u;
+    Val* nb = arena_array(a, Val, ncap);
+    if (s->stack) memcpy(nb, s->stack, sizeof(Val) * s->n);
+    s->stack = nb;
+    s->cap = ncap;
+  }
+  s->stack[s->n++] = v;
+}
+
+static Val slot_top(const SlotStack* s) {
+  return s->n ? s->stack[s->n - 1] : VAL_NONE;
+}
+
+static void rename_dfs(Func* f, u32 b, const u32* idom, SlotStack* slots) {
+  Block* bl = &f->blocks[b];
+  /* Track per-slot push count so we can pop on exit. */
+  u32* pushed = arena_zarray(f->arena, u32, f->nframe_slots + 1);
+
+  /* 1. Process phis in this block: each phi has extra.imm = slot id.
+   * Push a synthetic phi value (VAL_NONE in dry-run; a "fresh Val"
+   * once we wire renaming). */
+  for (u32 i = 0; i < bl->ninsts; ++i) {
+    Inst* in = &bl->insts[i];
+    if (in->op != IR_PHI) break;
+    u32 slot_id = (u32)in->extra.imm;
+    if (slot_id == 0 || slot_id > f->nframe_slots) continue;
+    /* For dry-run, the phi's def is VAL_NONE; we still push so the
+     * stack has the right depth. Downstream passes (Phase 4+) will
+     * allocate a real Val here. */
+    slot_push(f->arena, &slots[slot_id], VAL_NONE);
+    pushed[slot_id]++;
+  }
+
+  /* 2. Process the rest of the block. */
+  for (u32 i = 0; i < bl->ninsts; ++i) {
+    Inst* in = &bl->insts[i];
+    if (in->op == IR_PHI) continue;
+    if (in->op == IR_STORE) {
+      /* IR_STORE opnds: [0] = addr, [1] = src. */
+      if (in->nopnds < 2) continue;
+      u32 sid = opnd_slot_id(&in->opnds[0]);
+      if (sid && slot_promotable(f, sid)) {
+        const Operand* src = &in->opnds[1];
+        Val v = (src->kind == OPK_REG) ? (Val)src->v.reg : VAL_NONE;
+        slot_push(f->arena, &slots[sid], v);
+        pushed[sid]++;
+      }
+      continue;
+    }
+    if (in->op == IR_LOAD) {
+      /* IR_LOAD opnds: [0] = dst REG, [1] = addr. */
+      if (in->nopnds < 2) continue;
+      u32 sid = opnd_slot_id(&in->opnds[1]);
+      if (sid && slot_promotable(f, sid)) {
+        /* Touching slot_top exercises the stack invariant — that's
+         * the shape check we want. The rewrite map proper would walk
+         * uses; we skip that in the dry-run (output discarded). */
+        (void)slot_top(&slots[sid]);
+      }
+      continue;
+    }
+  }
+
+  /* 3. Fill our slot in each successor's phis. */
+  for (u32 s = 0; s < bl->nsucc; ++s) {
+    u32 succ = bl->succ[s];
+    if (succ >= f->nblocks) continue;
+    Block* sb = &f->blocks[succ];
+    /* Find which pred index `b` is in succ. */
+    u32 pred_idx = 0;
+    int found = 0;
+    for (u32 p = 0; p < sb->npreds; ++p) {
+      if (sb->preds[p] == b) {
+        pred_idx = p;
+        found = 1;
+        break;
+      }
+    }
+    if (!found) continue;
+    for (u32 i = 0; i < sb->ninsts; ++i) {
+      Inst* in = &sb->insts[i];
+      if (in->op != IR_PHI) break;
+      u32 slot_id = (u32)in->extra.imm;
+      if (slot_id == 0 || slot_id > f->nframe_slots) continue;
+      IRPhiAux* aux = (IRPhiAux*)in->extra.aux;
+      if (!aux || pred_idx >= aux->npreds) continue;
+      aux->pred_vals[pred_idx] = slot_top(&slots[slot_id]);
+    }
+  }
+
+  /* 4. Recurse into immediate dom children. */
+  for (u32 c = 0; c < f->nblocks; ++c) {
+    if (c == b) continue;
+    if (idom[c] == b) rename_dfs(f, c, idom, slots);
+  }
+
+  /* 5. Pop. */
+  for (u32 sid = 0; sid <= f->nframe_slots; ++sid) {
+    while (pushed[sid]--) {
+      if (slots[sid].n) slots[sid].n--;
+    }
+  }
+}
+
+/* ---- main entry ---- */
+
+void opt_build_ssa(Func* f) {
+  if (f->nblocks == 0) return;
+
+  /* Postorder traversal from entry. */
+  PostorderCtx pctx;
+  pctx.f = f;
+  pctx.po = arena_array(f->arena, u32, f->nblocks);
+  pctx.po_idx = arena_array(f->arena, u32, f->nblocks);
+  pctx.visited = arena_zarray(f->arena, u8, f->nblocks);
+  pctx.count = 0;
+  for (u32 b = 0; b < f->nblocks; ++b) pctx.po_idx[b] = 0;
+  postorder_dfs(&pctx, f->entry);
+  /* Unreachable blocks: never visited. Skip them in the dom analysis;
+   * the dry-run shouldn't crash on them. */
+
+  u32* idom = compute_idom(f, pctx.po, pctx.po_idx, pctx.count, f->entry);
+  DfSet* df = compute_df(f, idom);
+
+  /* For each promotable slot, find defining blocks → iterated DF. */
+  if (f->nframe_slots == 0) return;
+  u8* needs_phi_storage =
+      arena_zarray(f->arena, u8, (f->nframe_slots + 1) * f->nblocks);
+#define NEEDS_PHI(slot, blk) \
+  needs_phi_storage[((slot) * f->nblocks) + (blk)]
+
+  /* Worklist algo per slot. */
+  u32* worklist = arena_array(f->arena, u32, f->nblocks);
+  u8* on_worklist = arena_zarray(f->arena, u8, f->nblocks);
+
+  for (u32 sid = 1; sid <= f->nframe_slots; ++sid) {
+    if (!slot_promotable(f, sid)) continue;
+    /* Reset per-slot worklist state. */
+    for (u32 i = 0; i < f->nblocks; ++i) on_worklist[i] = 0;
+    u32 wn = 0;
+    /* Seed with blocks containing a store to this slot. */
+    for (u32 b = 0; b < f->nblocks; ++b) {
+      Block* bl = &f->blocks[b];
+      for (u32 i = 0; i < bl->ninsts; ++i) {
+        Inst* in = &bl->insts[i];
+        if (in->op == IR_STORE && in->nopnds >= 1 &&
+            opnd_slot_id(&in->opnds[0]) == sid) {
+          if (!on_worklist[b]) {
+            on_worklist[b] = 1;
+            worklist[wn++] = b;
+          }
+          break;
+        }
+      }
+    }
+    /* Iterated DF. */
+    while (wn) {
+      u32 x = worklist[--wn];
+      DfSet* d = &df[x];
+      for (u32 i = 0; i < d->n; ++i) {
+        u32 y = d->members[i];
+        if (NEEDS_PHI(sid, y)) continue;
+        NEEDS_PHI(sid, y) = 1;
+        if (!on_worklist[y]) {
+          on_worklist[y] = 1;
+          worklist[wn++] = y;
+        }
+      }
+    }
+  }
+
+  /* Insert phis per block. */
+  for (u32 b = 0; b < f->nblocks; ++b) {
+    u32 nphi = 0;
+    for (u32 sid = 1; sid <= f->nframe_slots; ++sid) {
+      if (NEEDS_PHI(sid, b)) nphi++;
+    }
+    if (!nphi) continue;
+    u32* slots_arr = arena_array(f->arena, u32, nphi);
+    u32 k = 0;
+    for (u32 sid = 1; sid <= f->nframe_slots; ++sid) {
+      if (NEEDS_PHI(sid, b)) slots_arr[k++] = sid;
+    }
+    insert_phis(f, b, nphi, slots_arr, NULL);
+  }
+
+  /* Rename phase: DFS from entry over the dominator tree. */
+  SlotStack* slots = arena_zarray(f->arena, SlotStack, f->nframe_slots + 1);
+  rename_dfs(f, f->entry, idom, slots);
+
+#undef NEEDS_PHI
+}
diff --git a/test/cg/run.sh b/test/cg/run.sh
@@ -51,15 +51,18 @@ ALLOW_SKIP="${CFREE_TEST_ALLOW_SKIP:-0}"
 # Filters (env vars or positional args; args win):
 #   $1 / CFREE_TEST_FILTER — substring match against case name
 #   $2 / CFREE_TEST_PATHS  — subset of "DREJ" (default "DREJ")
-#   CFREE_OPT_LEVELS — space-separated opt levels to exercise. Default "0 1"
-#                      so every case is built twice: directly against the
-#                      backend (level 0) and through the opt_cgtarget
-#                      wrapper (level 1). Path W (DWARF) only runs at
-#                      level 0 — opt-level DWARF equivalence is a later
-#                      phase concern.
+#   CFREE_OPT_LEVELS — space-separated opt levels to exercise. Default
+#                      "0 1 2" so every case is built three ways:
+#                      directly against the backend (level 0), and
+#                      through the opt_cgtarget wrapper at levels 1 and
+#                      2. Level 2 currently runs the Phase 3 dry-run
+#                      passes (build_cfg + build_ssa) and discards
+#                      before replay, so behavior must match level 1.
+#                      Path W (DWARF) only runs at level 0 — opt-level
+#                      DWARF equivalence is a later phase concern.
 FILTER="${1:-${CFREE_TEST_FILTER:-}}"
 PATHS="${2:-${CFREE_TEST_PATHS:-DREJW}}"
-OPT_LEVELS="${CFREE_OPT_LEVELS:-0 1}"
+OPT_LEVELS="${CFREE_OPT_LEVELS:-0 1 2}"
 case "$PATHS" in *D*) RUN_D=1;; *) RUN_D=0;; esac
 case "$PATHS" in *R*) RUN_R=1;; *) RUN_R=0;; esac
 case "$PATHS" in *E*) RUN_E=1;; *) RUN_E=0;; esac
@@ -111,13 +114,13 @@ arch_raw="$(uname -m 2>/dev/null || true)"
 
 READELF_BIN="$(command -v llvm-readelf 2>/dev/null || command -v readelf 2>/dev/null || true)"
 
-# Shared aarch64 exec helper — see test/lib/exec_aarch64.sh. Path E queues
-# each linked.exe and we drain the queue in a single batched podman run
-# after the case loop, amortizing the per-launch podman overhead across
-# all ~200 cg cases.
-EXEC_AARCH64_MOUNT_ROOT="$BUILD_DIR"
-# shellcheck source=../lib/exec_aarch64.sh
-source "$ROOT/test/lib/exec_aarch64.sh"
+# Shared per-arch exec helper — see test/lib/exec_target.sh. Path E
+# queues each linked.exe and we drain the queue in a single batched
+# podman run per arch after the case loop, amortizing the per-launch
+# podman overhead across all ~200 cg cases.
+EXEC_TARGET_MOUNT_ROOT="$BUILD_DIR"
+# shellcheck source=../lib/exec_target.sh
+source "$ROOT/test/lib/exec_target.sh"
 
 # ---- build harness binaries ------------------------------------------------
 
@@ -263,6 +266,16 @@ for OPT_LEVEL in $OPT_LEVELS; do
         # negative-return cases compare correctly.
         expected_byte=$(( expected & 0xff ))
 
+        # Path E target arch. cg-runner --arches NAME prints the arches a
+        # case can run on (one per line). Today every case is aarch64-only;
+        # multi-arch cases will land alongside x64 codegen in MULTIARCH
+        # phase 3 and broaden this list.
+        case_arches="$("${CG_RUN[@]}" --arches "$name" 2>/dev/null)"
+        case_arches="${case_arches:-aarch64}"
+        # First arch is the canonical one path E targets. (Cases are still
+        # single-arch through phase 2; the loop is a placeholder seam.)
+        case_arch="$(printf '%s\n' "$case_arches" | head -n1)"
+
         # ---- Path D: in-process JIT (only on aarch64) ------------------------
         if [ $RUN_D -eq 1 ]; then
             if [ $is_aarch64 -eq 1 ]; then
@@ -322,7 +335,7 @@ for OPT_LEVEL in $OPT_LEVELS; do
                         >"$work/exec_link.out" 2>"$work/exec_link.err"; then
                     dt=$(( $(now_ms) - t0 )); T_E=$(( T_E + dt ))
                     note_fail "$name/E${TAG} (link failed, ${dt}ms)"
-                elif [ $have_runner -eq 1 ]; then
+                elif exec_target_supported "$case_arch"; then
                     link_dt=$(( $(now_ms) - t0 )); T_E=$(( T_E + link_dt ))
                     E_NAMES+=("$name")
                     E_WORK+=("$work")
@@ -330,10 +343,11 @@ for OPT_LEVEL in $OPT_LEVELS; do
                     E_EXPECTED+=("$expected_byte")
                     # Queue with a level-tagged key so cases at different
                     # opt levels don't collide in the batched runner.
-                    exec_aarch64_queue "L${OPT_LEVEL}_${name}" "$exe" \
-                        "$work/exec.out" "$work/exec.err" "$work/exec.rc"
+                    exec_target_queue "$case_arch" "L${OPT_LEVEL}_${name}" \
+                        "$exe" "$work/exec.out" "$work/exec.err" \
+                        "$work/exec.rc"
                 else
-                    note_skip "$name/E${TAG}" "no qemu/podman"
+                    note_skip "$name/E${TAG}" "no runner for $case_arch"
                 fi
             else
                 note_skip "$name/E${TAG}" "no link-exe-runner, aarch64 clang, or start.o"
@@ -385,13 +399,13 @@ for OPT_LEVEL in $OPT_LEVELS; do
     done
 
     # ---- batched path-E flush + verification (per level) -------------------
-    # Run every queued case in a single podman invocation, then iterate the
-    # queue to read each exit code and emit PASS/FAIL.
-    if [ "$(exec_aarch64_queue_size)" -gt 0 ]; then
+    # Run every queued case in a single podman invocation per arch, then
+    # iterate the queue to read each exit code and emit PASS/FAIL.
+    if [ "$(exec_target_queue_size)" -gt 0 ]; then
         printf 'Running path E%s (%d cases batched)...\n' \
-            "$TAG" "$(exec_aarch64_queue_size)"
+            "$TAG" "$(exec_target_queue_size)"
         t0=$(now_ms)
-        exec_aarch64_flush
+        exec_target_flush
         DELTA=$(( $(now_ms) - t0 ))
         T_E_BATCH=$(( ${T_E_BATCH:-0} + DELTA )); T_E=$(( T_E + DELTA ))

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/OPT.md	\|	225	+++++++++++++++++++++++++++++++++++++++++++++----------------------------------
A	src/opt/ir.c	\|	176	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	src/opt/ir.h	\|	353	++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------
M	src/opt/opt.c	\|	2325	++++++++++++++++++++++++++++++-------------------------------------------------
A	src/opt/pass_cfg.c	\|	116	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	src/opt/pass_ssa.c	\|	427	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	test/cg/run.sh	\|	60	+++++++++++++++++++++++++++++++++++++-----------------------