Complete optimizer Phase E inlining - kit

commit 7d47b9fc98c761d38925c34d547866bcf426d908
parent ceffaef359a99d6d2695adc06e78f6acde56921e
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 22 May 2026 02:41:44 -0700

Complete optimizer Phase E inlining

Diffstat:
M doc/OPT.md  | 88 ++++++++++++++++++++++++++++++++++++++++---------------------------------------
M src/opt/opt.c  | 49 +------------------------------------------------
M src/opt/opt.h  | 20 ++++++++++----------
M src/opt/pass_inline.c  | 311 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
M src/opt/pass_o2.c  | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
M test/opt/opt_test.c  | 294 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A test/toy/cases/133_inline_wrapper_o2.toy  | 12 ++++++++++++
A test/toy/cases/134_inline_chain_o2.toy  | 16 ++++++++++++++++

8 files changed, 680 insertions(+), 168 deletions(-)
diff --git a/doc/OPT.md b/doc/OPT.md
@@ -1006,12 +1006,14 @@ make the `opt.h` contract real for `-O2`: `func_end` records and stores the
 function, while `finalize` performs interprocedural work, then runs the normal
 per-function O2 pipeline and emission in original function order.
 
-Current implementation slice: retained `-O2` finalize scheduling is in place,
-and `pass_inline.c` inlines small direct calls to retained, acyclic,
-straight-line callees with void/scalar register returns. It intentionally
-refuses multi-block CFG cloning, aggregate/sret/byval returns, tail calls,
-varargs, alloca, inline asm, computed goto, and other hard cases until the
-broader rewrite rules below are implemented with focused tests.
+Status: implemented for the Phase E v1 shape. Retained `-O2` finalize
+scheduling is in place, `opt_cleanup` is the reusable O2 pre-lowering cleanup
+schedule, and `pass_inline.c` inlines small direct calls to retained, acyclic
+callees with void/scalar register returns. It supports straight-line wrappers
+and small branchy callees with multiple scalar returns. It intentionally refuses
+aggregate/sret/byval returns, tail calls, varargs, alloca, inline asm,
+computed goto, and other hard cases until those rewrite rules have focused
+tests.
 
 This matches MIR's useful property: inline while calls, parameters, frame
 slots, returns, labels, and source locations are still target-independent.
@@ -1023,10 +1025,10 @@ constants, copies, dead stores, and dead blocks.
 
 Implementation shape:
 
-- Add real `FuncSet` ownership to `OptImpl`: an arena-backed ordered `Func**`
-  plus a symbol map from `ObjSymId` to the retained definition. No global
-  state. Record order is emission order at finalize.
-- Split scheduling in `opt.c`:
+- [x] Add real `FuncSet` ownership to `OptImpl`: an arena-backed ordered
+  `Func**` plus retained-definition lookup by `ObjSymId`. No global state.
+  Record order is emission order at finalize.
+- [x] Split scheduling in `opt.c`:
   - `func_end`: `opt_frame_home_addr_taken_locals`, classify eligibility, append
     to `FuncSet`, and clear the current recorder state. For `-O1`, keeping the
     current immediate emit path is acceptable; `-O2` must retain functions.
@@ -1036,105 +1038,105 @@ Implementation shape:
   - Refactor the current O2 scheduler into "O2 cleanup/pre-lowering" and
     "lowering/emission" pieces so an inlined dirty function can reuse the
     existing verifier-bracketed pass order.
-- Build the call graph from direct `IR_CALL` sites whose callee is
+- [x] Build the call graph from direct `IR_CALL` sites whose callee is
   `OPK_GLOBAL` with addend zero and whose symbol has a retained `Func`.
   Indirect calls and unresolved/external symbols are ordinary non-inlineable
   calls.
-- Compute SCCs on the retained graph. Inline only calls to callees in already
+- [x] Compute SCCs on the retained graph. Inline only calls to callees in already
   processed acyclic components. Skip self recursion and mutually recursive SCCs
   for v1 instead of trying partial recursive budgets.
-- Rebuild the call graph at each inline iteration. Default `max_iters = 1`;
+- [x] Rebuild the call graph at each inline iteration. Default `max_iters = 1`;
   keep a small internal cap even if a later driver knob requests more.
 
 Eligibility and budgets:
 
-- Candidate callee must be retained, same target/context, non-variadic, not
+- [x] Candidate callee must be retained, same target/context, non-variadic, not
   naked/interrupt/ifunc once those attributes reach `CGFuncDesc`, and not marked
   `noinline` once C declaration flags are plumbed through CG.
-- Candidate call site must be a normal direct call. Skip `CG_CALL_TAIL` and any
+- [x] Candidate call site must be a normal direct call. Skip `CG_CALL_TAIL` and any
   future musttail representation.
-- V1 skips callees containing `IR_ALLOCA`, `IR_VA_*`, `IR_LOAD_LABEL_ADDR`,
+- [x] V1 skips callees containing `IR_ALLOCA`, `IR_VA_*`, `IR_LOAD_LABEL_ADDR`,
   `IR_INDIRECT_BRANCH`, `IR_ASM_BLOCK`, `INTRIN_SETJMP`, `INTRIN_LONGJMP`,
   coroutine switch, or any op whose cloning semantics are not explicit.
-- V1 supports void and scalar register returns first. Aggregate/sret/byval
+- [x] V1 supports void and scalar register returns first. Aggregate/sret/byval
   returns can be enabled later after return materialization has focused tests.
-- Normal-call budget should be deliberately small: start around 12 to 20
+- [x] Normal-call budget should be deliberately small: start around 12 to 20
   non-debug, non-label instructions, with a caller growth cap around 25 percent
   and an absolute per-function growth cap. Once `inline` / `always_inline` /
   `noinline` are carried into CG, use a larger explicit-inline budget and let
   `always_inline` exceed the normal budget but still obey semantic refusal
   rules.
-- Penalize callees with calls, memory barriers, atomics, fences, and multiple
-  returns; discount simple `IR_LOAD_IMM`, `IR_COPY`, and single-use arithmetic.
-  Do not use target-specific instruction costs in Phase E.
+- [x] Keep the heuristic target-independent. Calls, memory barriers, atomics,
+  and fences remain semantic refusals in Phase E v1; multiple returns add cost,
+  and simple scalar instructions use the normal small-instruction budget.
 
 Inline rewrite:
 
-- Split the caller block at the call site into `pre`, cloned callee body, and
+- [x] Split the caller block at the call site into `pre`, cloned callee body, and
   `cont`. The original call instruction is removed.
-- Allocate fresh caller-local ids for every cloned block, frame slot, scope,
+- [x] Allocate fresh caller-local ids for every cloned block, frame slot, scope,
   pseudo-register, and SSA value id present in the callee. Preserve callee
   instruction ids only as source material; cloned instructions get new stable
   ids from the caller.
-- Clone `emit_order` for cloned blocks in callee layout order, inserted between
+- [x] Clone `emit_order` for cloned blocks in callee layout order, inserted between
   `pre` and `cont`. Update CFG successors/predecessors through `opt_build_cfg`
   after the rewrite rather than hand-maintaining all derived metadata.
-- Materialize arguments at the clone entry. For register-like callee parameter
+- [x] Materialize arguments at the clone entry. For register-like callee parameter
   storage, emit `IR_COPY` from the call argument value to the remapped callee
   parameter pseudo. For frame-backed parameters, emit stores to the remapped
   parameter frame slots using the recorded `IRParam` storage and `MemAccess`
   shape. Constants become `IR_LOAD_IMM` or direct stores as appropriate.
-- Rewrite each cloned `IR_RET`:
+- [x] Rewrite each cloned `IR_RET`:
   - void return: replace with `IR_BR cont`.
   - scalar register return: copy the returned value to the call site's result
     operand, then branch to `cont`.
   - multiple returns: all returns branch to the same continuation; if later SSA
     needs a phi, the normal O2 mem2reg/reg-SSA construction will create it from
     the inserted copies/stores.
-- For noreturn callees, omit the continuation edge when every cloned exit is
-  terminating. Keep ordinary call semantics until this is tested; do not infer
-  noreturn only from body shape in v1.
-- Preserve `Inst.loc` from cloned callee instructions. Argument materialization
+- [x] Keep ordinary continuation semantics for noreturn callees in Phase E v1.
+  Do not infer noreturn only from body shape until continuation omission has
+  focused tests.
+- [x] Preserve `Inst.loc` from cloned callee instructions. Argument materialization
   uses the call-site location. This is enough for line-accurate emission now;
   `DW_TAG_inlined_subroutine` can remain a later DWARF phase.
 
 Cleanup after inlining:
 
-- Mark modified callers dirty and invalidate CFG, def-use, dominator, and loop
+- [x] Mark modified callers dirty and invalidate CFG, def-use, dominator, and loop
   analyses.
-- For each retained `-O2` function, run the existing O2 pre-lowering schedule
+- [x] For each retained `-O2` function, run the existing O2 pre-lowering schedule
   after all inline iterations. This is `opt_cleanup` in Phase E terms:
   `build_cfg`, jump cleanup, reg-SSA, mem2reg SSA, simplify, GVN, copy
   propagation, DSE, SSA DCE, LICM, pressure relief, conventional SSA,
   SSA combine, undo SSA, and jump optimization, with the same verifier
   checkpoints as the current O2 path.
-- Then run the shared lowering pipeline once and emit. No function should be
+- [x] Then run the shared lowering pipeline once and emit. No function should be
   emitted before all possible callers have had a chance to inline it.
 
 Testing plan:
 
-- Unit/pass tests in `test/opt` for direct wrapper inline, two-return scalar
+- [x] Unit/pass tests in `test/opt` for direct wrapper inline, two-return scalar
   inline, refusal for recursion/SCCs, refusal for varargs/alloca/setjmp/asm,
   and caller growth cap.
-- End-to-end toy/C cases where `static int add1(int x) { return x + 1; }`
+- [x] End-to-end toy/C cases where `static int add1(int x) { return x + 1; }`
   disappears at `-O2` but remains a call at `-O1`.
-- Keep the Toy `add1(41) == 42` example from the Phase D checklist as a
+- [x] Keep the Toy `add1(41) == 42` example from the Phase D checklist as a
   disassembly sanity case: Phase E is responsible for removing the call; Phase
   D cleanup is responsible for removing the exposed constant/copy/branch
   leftovers.
-- A chained-call case proves bottom-up order: `main -> f -> g` inlines `g`
+- [x] A chained-call case proves bottom-up order: `main -> f -> g` inlines `g`
   into `f` before deciding whether `f` fits into `main`.
-- A negative recursive case proves codegen is unchanged and correct.
-- A focused disassembly sanity check on AArch64 or x64 verifies wrapper-call
+- [x] A negative recursive case proves codegen is unchanged and correct.
+- [x] A focused disassembly sanity check on AArch64 or x64 verifies wrapper-call
   removal after inline+cleanup without broad corpus churn.
 
 Exit criteria:
 
-- `make test-opt`
-- Focused wrapper/chained-call toy fixtures:
+- [x] `make test-opt`
+- [x] Focused wrapper/chained-call toy fixtures:
   `CFREE_TEST_FILTER=<inline-case> CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-- Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
-- Recursive and mutually recursive cases are unchanged and correct.
+- [x] Full `CFREE_TOY_OPT_LEVELS="0 1 2" make test-toy`
+- [x] Recursive and mutually recursive cases are unchanged and correct.
 
 ### Phase F - Benchmark Closure
 
diff --git a/src/opt/opt.c b/src/opt/opt.c
@@ -1410,54 +1410,7 @@ static void opt_run_o1_pipeline(OptImpl* o) {
 
 static void opt_run_o2_pipeline(OptImpl* o) {
   metrics_scope_begin(o->c, "opt.o2.ssa");
-  opt_build_cfg(o->f);
-  opt_jump_cleanup(o->f, OPT_JUMP_CLEANUP_CFG);
-  opt_build_cfg(o->f);
-  opt_verify(o->f, "o2-pre-ssa-cfg");
-  opt_build_reg_ssa(o->f);
-  opt_verify(o->f, "o2-reg-ssa");
-  opt_block_cloning(o->f);
-  opt_verify(o->f, "o2-block-clone-cfg");
-  opt_build_ssa(o->f);
-  opt_verify(o->f, "o2-ssa");
-  opt_ssa_dce(o->f);
-  opt_copy_cleanup(o->f);
-  opt_verify(o->f, "o2-pre-addr-cleanup");
-  opt_addr_xform(o->f);
-  opt_verify(o->f, "o2-addr-xform-ssa");
-  opt_ssa_dce(o->f);
-  opt_verify(o->f, "o2-addr-xform-dce");
-  opt_copy_cleanup(o->f);
-  opt_verify(o->f, "o2-addr-xform-copy-cleanup");
-  opt_simplify(o->f);
-  opt_verify(o->f, "o2-simplify-ssa");
-  opt_ssa_dce(o->f);
-  opt_copy_cleanup(o->f);
-  opt_verify(o->f, "o2-simplify-cleanup");
-  opt_gvn(o->f);
-  opt_verify(o->f, "o2-gvn-ssa");
-  opt_copy_prop(o->f);
-  opt_verify(o->f, "o2-copy-prop-ssa");
-  opt_ssa_dce(o->f);
-  opt_verify(o->f, "o2-copy-dce");
-  opt_dse(o->f);
-  opt_verify(o->f, "o2-dse-ssa");
-  opt_ssa_dce(o->f);
-  opt_verify(o->f, "o2-dse-dce");
-  opt_build_loop_tree(o->f);
-  opt_licm(o->f);
-  opt_verify(o->f, "o2-licm-ssa");
-  opt_pressure_relief(o->f);
-  opt_verify(o->f, "o2-pressure-relief-ssa");
-  opt_make_conventional_ssa(o->f);
-  opt_verify(o->f, "o2-conventional-ssa");
-  opt_ssa_combine(o->f);
-  opt_verify(o->f, "o2-ssa-combine");
-  opt_undo_ssa(o->f);
-  opt_copy_cleanup(o->f);
-  opt_verify(o->f, "o2-undo-copy-cleanup");
-  opt_jump_opt(o->f);
-  opt_verify(o->f, "o2-jump-opt");
+  opt_cleanup(o->f);
   metrics_scope_end(o->c, "opt.o2.ssa");
   opt_run_lowering_pipeline(o, "opt.o2.total", 1);
 }
diff --git a/src/opt/opt.h b/src/opt/opt.h
@@ -10,11 +10,12 @@
  *     through the shared simple allocator and passes them to normal emit calls.
  *   - Every other emit-side call is recorded into the current block as one
  *     SSA Inst (with the current SrcLoc from set_loc).
- *   - On CGTarget.func_end it runs the intra-procedural pipeline (down through
- *     jump_opt) and stores the optimized Func in a per-TU set.
- *   - On CGTarget.finalize it runs inter-procedural passes (inlining +
- * cleanup), then for each Func runs machinize → live → coalesce → RA → combine
- * → DCE → prolog/epilog → translate, driving the wrapped target CGTarget.
+ *   - On CGTarget.func_end, level 1 immediately runs the lowering pipeline and
+ *     emits; level 2 retains the raw Func in a per-TU set.
+ *   - On CGTarget.finalize, level 2 runs inter-procedural passes (inlining),
+ *     then for each Func runs O2 cleanup/pre-lowering and machinize → live →
+ *     coalesce → RA → combine → DCE → prolog/epilog → translate, driving the
+ *     wrapped target CGTarget.
  *
  * No machine code is in `obj` until the driver calls cgtarget_finalize.
  * Drivers must call it before reading `obj` or invoking debug_emit.
@@ -27,7 +28,7 @@
  *   2 — full pipeline below. Inlining enabled. */
 CGTarget* opt_cgtarget_new(Compiler*, CGTarget* target, int level);
 
-/* ----- intra-procedural passes (run per Func at func_end on -O2) ----- */
+/* ----- intra-procedural passes (run per retained Func at finalize on -O2) ----- */
 void opt_build_cfg(Func*);
 typedef enum OptJumpCleanupStage {
   OPT_JUMP_CLEANUP_CFG,
@@ -64,10 +65,9 @@ typedef struct FuncSet FuncSet;
  * default 1; cap is enforced by opt_cgtarget). */
 void opt_inline(FuncSet*, int max_iters);
 
-/* Cheap re-run of the intra-procedural pipeline tailored to "what inlining
- * exposes": constfold, copy_prop, gvn, ssa_dce, jump_opt, licm if the
- * function has loops, addr_xform if any GEP-equivalents reach uses. Run on
- * each Func that opt_inline marked dirty. */
+/* Full O2 pre-lowering cleanup: CFG cleanup, pseudo-reg SSA, mem2reg SSA,
+ * value/memory/loop passes, conventional SSA lowering, SSA destruction, and
+ * jump optimization. */
 void opt_cleanup(Func*);
 
 /* ----- lowering / backend prep (per Func, run before driving target CGTarget)
diff --git a/src/opt/pass_inline.c b/src/opt/pass_inline.c
@@ -12,6 +12,8 @@ typedef struct InlineMap {
   u32 npreg;
   FrameSlot* slot;
   u32 nslot;
+  u32* block;
+  u32 nblock;
 } InlineMap;
 
 typedef struct InstVec {
@@ -21,6 +23,38 @@ typedef struct InstVec {
   u32 cap;
 } InstVec;
 
+typedef struct InlineOrderCtx {
+  FuncSet* fs;
+  u8* temp;
+  u8* done;
+  Func** order;
+  u32 norder;
+} InlineOrderCtx;
+
+#define INLINE_NORMAL_COST_LIMIT 20u
+#define INLINE_ABS_GROWTH_LIMIT 64u
+
+static u32 func_inline_cost(Func* f) {
+  u32 cost = 0;
+  if (!f) return 0;
+  for (u32 b = 0; b < f->nblocks; ++b) {
+    Block* bl = &f->blocks[b];
+    for (u32 i = 0; i < bl->ninsts; ++i) {
+      switch ((IROp)bl->insts[i].op) {
+        case IR_NOP:
+        case IR_PARAM_DECL:
+        case IR_BR:
+        case IR_RET:
+          break;
+        default:
+          ++cost;
+          break;
+      }
+    }
+  }
+  return cost;
+}
+
 static void instvec_push(InstVec* iv, const Inst* in) {
   if (iv->n == iv->cap) {
     u32 ncap = iv->cap ? iv->cap * 2u : 16u;
@@ -92,6 +126,9 @@ static int op_supported_in_straightline_inline(IROp op) {
     case IR_UNOP:
     case IR_CMP:
     case IR_CONVERT:
+    case IR_BR:
+    case IR_CONDBR:
+    case IR_CMP_BRANCH:
     case IR_RET:
       return 1;
     default:
@@ -110,31 +147,34 @@ static int callee_inline_shape(Func* callee, u32* cost_out) {
     return 0;
   }
 
-  Block* entry = &callee->blocks[callee->entry];
   u32 nret = 0;
   u32 cost = 0;
-  for (u32 i = 0; i < entry->ninsts; ++i) {
-    Inst* in = &entry->insts[i];
-    IROp op = (IROp)in->op;
-    if (!op_supported_in_straightline_inline(op)) {
-      metrics_count(callee->c, "opt.inline.refuse_shape_op", 1);
-      return 0;
-    }
-    if (op == IR_RET) {
-      ++nret;
-      if (i + 1u != entry->ninsts) {
-        metrics_count(callee->c, "opt.inline.refuse_shape_ret_pos", 1);
+  for (u32 b = 0; b < callee->nblocks; ++b) {
+    Block* bl = &callee->blocks[b];
+    for (u32 i = 0; i < bl->ninsts; ++i) {
+      Inst* in = &bl->insts[i];
+      IROp op = (IROp)in->op;
+      if (!op_supported_in_straightline_inline(op)) {
+        metrics_count(callee->c, "opt.inline.refuse_shape_op", 1);
         return 0;
       }
-      continue;
+      if (op == IR_RET) {
+        ++nret;
+        if (i + 1u != bl->ninsts) {
+          metrics_count(callee->c, "opt.inline.refuse_shape_ret_pos", 1);
+          return 0;
+        }
+        continue;
+      }
+      if (op != IR_NOP && op != IR_PARAM_DECL && op != IR_BR) ++cost;
     }
-    if (op != IR_NOP && op != IR_PARAM_DECL) ++cost;
   }
-  if (nret != 1) {
+  if (nret == 0) {
     metrics_count(callee->c, "opt.inline.refuse_shape_ret_count", 1);
     return 0;
   }
-  if (cost > 20u) {
+  if (nret > 1) cost += (nret - 1u) * 4u;
+  if (cost > INLINE_NORMAL_COST_LIMIT) {
     metrics_count(callee->c, "opt.inline.refuse_shape_budget", 1);
     return 0;
   }
@@ -152,6 +192,11 @@ static FrameSlot map_slot(InlineMap* m, FrameSlot s) {
   return m->slot[s] ? m->slot[s] : s;
 }
 
+static u32 map_block(InlineMap* m, u32 b) {
+  if (b >= m->nblock) return b;
+  return m->block[b] != 0xffffffffu ? m->block[b] : b;
+}
+
 static void map_mem(InlineMap* m, MemAccess* mem) {
   if (!mem) return;
   if (mem->alias.kind == ALIAS_LOCAL && mem->alias.v.local_id > 0) {
@@ -183,8 +228,11 @@ static int build_inline_map(InlineMap* m, Func* caller, Func* callee) {
   m->callee = callee;
   m->npreg = callee->npregs;
   m->nslot = callee->nframe_slots + 1u;
+  m->nblock = callee->nblocks;
   m->preg = arena_zarray(caller->arena, PReg, m->npreg ? m->npreg : 1u);
   m->slot = arena_zarray(caller->arena, FrameSlot, m->nslot ? m->nslot : 1u);
+  m->block = arena_array(caller->arena, u32, m->nblock ? m->nblock : 1u);
+  for (u32 b = 0; b < m->nblock; ++b) m->block[b] = 0xffffffffu;
 
   for (FrameSlot s = 1; s <= callee->nframe_slots; ++s) {
     IRFrameSlot* old = &callee->frame_slots[s - 1u];
@@ -204,6 +252,7 @@ static int build_inline_map(InlineMap* m, Func* caller, Func* callee) {
     m->preg[r] =
         ir_alloc_preg(caller, callee->preg_type[r], callee->preg_cls[r]);
   }
+  for (u32 b = 0; b < callee->nblocks; ++b) m->block[b] = ir_block_new(caller);
   return 1;
 }
 
@@ -310,6 +359,11 @@ static Inst clone_inst(InlineMap* m, const Inst* src) {
   return dst;
 }
 
+static void clone_block_succs(InlineMap* m, Block* dst, const Block* src) {
+  ir_block_set_nsucc(m->caller, dst->id, src->nsucc);
+  for (u32 s = 0; s < src->nsucc; ++s) dst->succ[s] = map_block(m, src->succ[s]);
+}
+
 static int append_return_materialization(InstVec* out, InlineMap* m,
                                          const Inst* call, const Inst* ret) {
   IRCallAux* call_aux = (IRCallAux*)call->extra.aux;
@@ -341,6 +395,47 @@ static int append_return_materialization(InstVec* out, InlineMap* m,
   return 1;
 }
 
+static Inst make_branch(Func* f, SrcLoc loc) { return make_inst(f, IR_BR, loc); }
+
+static void emit_order_push_unique(u32* out, u32* nout, u32 b) {
+  for (u32 i = 0; i < *nout; ++i)
+    if (out[i] == b) return;
+  out[(*nout)++] = b;
+}
+
+static void inline_rebuild_emit_order(Func* caller, u32 anchor, InlineMap* m,
+                                      u32 cont) {
+  u32 cap = caller->emit_order_n + m->callee->emit_order_n + m->callee->nblocks +
+            2u;
+  u32* order = arena_array(caller->arena, u32, cap ? cap : 1u);
+  u32 norder = 0;
+  int inserted = 0;
+  for (u32 i = 0; i < caller->emit_order_n; ++i) {
+    u32 b = caller->emit_order[i];
+    emit_order_push_unique(order, &norder, b);
+    if (b != anchor) continue;
+    for (u32 j = 0; j < m->callee->emit_order_n; ++j)
+      emit_order_push_unique(order, &norder,
+                             map_block(m, m->callee->emit_order[j]));
+    for (u32 cb = 0; cb < m->callee->nblocks; ++cb)
+      emit_order_push_unique(order, &norder, map_block(m, cb));
+    emit_order_push_unique(order, &norder, cont);
+    inserted = 1;
+  }
+  if (!inserted) {
+    emit_order_push_unique(order, &norder, anchor);
+    for (u32 j = 0; j < m->callee->emit_order_n; ++j)
+      emit_order_push_unique(order, &norder,
+                             map_block(m, m->callee->emit_order[j]));
+    for (u32 cb = 0; cb < m->callee->nblocks; ++cb)
+      emit_order_push_unique(order, &norder, map_block(m, cb));
+    emit_order_push_unique(order, &norder, cont);
+  }
+  caller->emit_order = order;
+  caller->emit_order_n = norder;
+  caller->emit_order_cap = cap;
+}
+
 static int inline_rewrite_supported(Func* callee, const Inst* call) {
   IRCallAux* call_aux = (IRCallAux*)call->extra.aux;
   if (!call_aux || call_aux->desc.nargs != callee->nparams) return 0;
@@ -350,62 +445,125 @@ static int inline_rewrite_supported(Func* callee, const Inst* call) {
     if (av->storage.kind != OPK_REG && av->storage.kind != OPK_IMM) return 0;
   }
 
-  Block* cbl = &callee->blocks[callee->entry];
-  const Inst* ret = cbl->ninsts ? &cbl->insts[cbl->ninsts - 1u] : NULL;
-  if (!ret || (IROp)ret->op != IR_RET) return 0;
-  IRRetAux* ret_aux = (IRRetAux*)ret->extra.aux;
-  if (!ret_aux || !ret_aux->present) return 1;
-  if (ret_aux->val.nparts != 0 || call_aux->desc.ret.nparts != 0) return 0;
-  if (call_aux->desc.ret.storage.kind == OPK_IMM) return 1;
-  if (call_aux->desc.ret.storage.kind != OPK_REG) return 0;
-  return ret_aux->val.storage.kind == OPK_REG ||
-         ret_aux->val.storage.kind == OPK_IMM;
+  for (u32 b = 0; b < callee->nblocks; ++b) {
+    Block* bl = &callee->blocks[b];
+    for (u32 i = 0; i < bl->ninsts; ++i) {
+      Inst* in = &bl->insts[i];
+      if ((IROp)in->op != IR_RET) continue;
+      IRRetAux* ret_aux = (IRRetAux*)in->extra.aux;
+      if (!ret_aux || !ret_aux->present) continue;
+      if (ret_aux->val.nparts != 0 || call_aux->desc.ret.nparts != 0)
+        return 0;
+      if (call_aux->desc.ret.storage.kind == OPK_IMM) continue;
+      if (call_aux->desc.ret.storage.kind != OPK_REG) return 0;
+      if (ret_aux->val.storage.kind != OPK_REG &&
+          ret_aux->val.storage.kind != OPK_IMM)
+        return 0;
+    }
+  }
+  return 1;
 }
 
 static int inline_call_site(Func* caller, u32 block_idx, u32 inst_idx,
                             Func* callee) {
-  Block* cbl = &callee->blocks[callee->entry];
-  Block* bl = &caller->blocks[block_idx];
-  Inst call = bl->insts[inst_idx];
+  Block* old_bl = &caller->blocks[block_idx];
+  Inst* old_insts = old_bl->insts;
+  u32 old_ninsts = old_bl->ninsts;
+  u32 old_nsucc = old_bl->nsucc;
+  u32* old_succ = arena_array(caller->arena, u32, old_nsucc ? old_nsucc : 1u);
+  for (u32 s = 0; s < old_nsucc; ++s) old_succ[s] = old_bl->succ[s];
+  Inst call = old_insts[inst_idx];
   InlineMap map;
-  InstVec out;
-  memset(&out, 0, sizeof out);
-  out.f = caller;
   if (!build_inline_map(&map, caller, callee)) return 0;
-
-  for (u32 i = 0; i < inst_idx; ++i) instvec_push(&out, &bl->insts[i]);
-  if (!append_param_materialization(&out, &map, &call)) return 0;
-
-  const Inst* ret = NULL;
-  for (u32 i = 0; i < cbl->ninsts; ++i) {
-    Inst* src = &cbl->insts[i];
-    switch ((IROp)src->op) {
-      case IR_NOP:
-      case IR_PARAM_DECL:
-        break;
-      case IR_RET:
-        ret = src;
-        break;
-      default: {
-        Inst cloned = clone_inst(&map, src);
-        instvec_push(&out, &cloned);
-        break;
+  u32 cont = ir_block_new(caller);
+
+  InstVec pre;
+  memset(&pre, 0, sizeof pre);
+  pre.f = caller;
+  for (u32 i = 0; i < inst_idx; ++i) instvec_push(&pre, &old_insts[i]);
+  if (!append_param_materialization(&pre, &map, &call)) return 0;
+  Inst br = make_branch(caller, call.loc);
+  instvec_push(&pre, &br);
+
+  Block* pre_bl = &caller->blocks[block_idx];
+  pre_bl->insts = pre.v;
+  pre_bl->ninsts = pre.n;
+  pre_bl->cap = pre.cap;
+  ir_block_set_nsucc(caller, block_idx, 1);
+  pre_bl = &caller->blocks[block_idx];
+  pre_bl->succ[0] = map_block(&map, callee->entry);
+
+  Block* cont_bl = &caller->blocks[cont];
+  InstVec cont_out;
+  memset(&cont_out, 0, sizeof cont_out);
+  cont_out.f = caller;
+  for (u32 i = inst_idx + 1u; i < old_ninsts; ++i)
+    instvec_push(&cont_out, &old_insts[i]);
+  cont_bl->insts = cont_out.v;
+  cont_bl->ninsts = cont_out.n;
+  cont_bl->cap = cont_out.cap;
+  ir_block_set_nsucc(caller, cont, old_nsucc);
+  cont_bl = &caller->blocks[cont];
+  for (u32 s = 0; s < old_nsucc; ++s) cont_bl->succ[s] = old_succ[s];
+
+  for (u32 b = 0; b < callee->nblocks; ++b) {
+    Block* src_bl = &callee->blocks[b];
+    u32 dst_b = map_block(&map, b);
+    Block* dst_bl = &caller->blocks[dst_b];
+    InstVec body;
+    memset(&body, 0, sizeof body);
+    body.f = caller;
+    for (u32 i = 0; i < src_bl->ninsts; ++i) {
+      Inst* src = &src_bl->insts[i];
+      switch ((IROp)src->op) {
+        case IR_NOP:
+        case IR_PARAM_DECL:
+          break;
+        case IR_RET: {
+          if (!append_return_materialization(&body, &map, &call, src))
+            return 0;
+          Inst ret_br = make_branch(caller, src->loc);
+          instvec_push(&body, &ret_br);
+          break;
+        }
+        default: {
+          Inst cloned = clone_inst(&map, src);
+          instvec_push(&body, &cloned);
+          break;
+        }
       }
     }
+    dst_bl = &caller->blocks[dst_b];
+    dst_bl->insts = body.v;
+    dst_bl->ninsts = body.n;
+    dst_bl->cap = body.cap;
+    if (body.n && (IROp)body.v[body.n - 1u].op == IR_BR &&
+        src_bl->ninsts && (IROp)src_bl->insts[src_bl->ninsts - 1u].op == IR_RET) {
+      ir_block_set_nsucc(caller, dst_b, 1);
+      caller->blocks[dst_b].succ[0] = cont;
+    } else {
+      clone_block_succs(&map, &caller->blocks[dst_b], src_bl);
+    }
   }
-  if (!ret || !append_return_materialization(&out, &map, &call, ret)) return 0;
-  for (u32 i = inst_idx + 1u; i < bl->ninsts; ++i)
-    instvec_push(&out, &bl->insts[i]);
 
-  bl->insts = out.v;
-  bl->ninsts = out.n;
-  bl->cap = out.cap;
+  inline_rebuild_emit_order(caller, block_idx, &map, cont);
   opt_analysis_invalidate(caller, OPT_ANALYSIS_CFG | OPT_ANALYSIS_DEF_USE |
                                       OPT_ANALYSIS_DOM | OPT_ANALYSIS_LOOP);
   return 1;
 }
 
-static int try_inline_call(FuncSet* fs, Func* caller, u32 b, u32 i) {
+static int caller_growth_ok(FuncSet* fs, Func* caller, const u32* base_cost,
+                            u32 callee_cost) {
+  int idx = funcset_index(fs, caller);
+  u32 base = idx >= 0 ? base_cost[idx] : func_inline_cost(caller);
+  u32 growth_limit = base / 4u;
+  if (growth_limit < INLINE_NORMAL_COST_LIMIT) growth_limit = INLINE_NORMAL_COST_LIMIT;
+  if (growth_limit > INLINE_ABS_GROWTH_LIMIT) growth_limit = INLINE_ABS_GROWTH_LIMIT;
+  return func_inline_cost(caller) + callee_cost <= base + growth_limit;
+}
+
+static int try_inline_call(FuncSet* fs, Func* caller, u32 b, u32 i,
+                           const u32* base_cost) {
   Inst* in = &caller->blocks[b].insts[i];
   Func* callee = direct_callee(fs, in);
   u32 cost = 0;
@@ -419,11 +577,14 @@ static int try_inline_call(FuncSet* fs, Func* caller, u32 b, u32 i) {
     metrics_count(caller->c, "opt.inline.refuse_shape", 1);
     return 0;
   }
+  if (!caller_growth_ok(fs, caller, base_cost, cost)) {
+    metrics_count(caller->c, "opt.inline.refuse_growth", 1);
+    return 0;
+  }
   if (!inline_rewrite_supported(callee, in)) {
     metrics_count(caller->c, "opt.inline.refuse_rewrite_shape", 1);
     return 0;
   }
-  (void)cost;
   if (!inline_call_site(caller, b, i, callee)) {
     metrics_count(caller->c, "opt.inline.refuse_rewrite", 1);
     return 0;
@@ -432,19 +593,47 @@ static int try_inline_call(FuncSet* fs, Func* caller, u32 b, u32 i) {
   return 1;
 }
 
+static void inline_order_visit(InlineOrderCtx* ctx, Func* f) {
+  int idx = funcset_index(ctx->fs, f);
+  if (idx < 0 || ctx->done[idx]) return;
+  if (ctx->temp[idx]) return;
+  ctx->temp[idx] = 1;
+  for (u32 b = 0; b < f->nblocks; ++b) {
+    Block* bl = &f->blocks[b];
+    for (u32 i = 0; i < bl->ninsts; ++i) {
+      Func* callee = direct_callee(ctx->fs, &bl->insts[i]);
+      if (callee) inline_order_visit(ctx, callee);
+    }
+  }
+  ctx->temp[idx] = 0;
+  ctx->done[idx] = 1;
+  ctx->order[ctx->norder++] = f;
+}
+
 void opt_inline(FuncSet* fs, int max_iters) {
   if (!fs || fs->nfuncs == 0 || max_iters <= 0) return;
   if (max_iters > 4) max_iters = 4;
+  u32* base_cost = arena_array(fs->arena, u32, fs->nfuncs);
+  for (u32 i = 0; i < fs->nfuncs; ++i) base_cost[i] = func_inline_cost(fs->funcs[i]);
   for (int iter = 0; iter < max_iters; ++iter) {
     int changed = 0;
-    for (u32 fidx = 0; fidx < fs->nfuncs; ++fidx) {
-      Func* caller = fs->funcs[fidx];
+    InlineOrderCtx ctx;
+    memset(&ctx, 0, sizeof ctx);
+    ctx.fs = fs;
+    ctx.temp = arena_zarray(fs->arena, u8, fs->nfuncs);
+    ctx.done = arena_zarray(fs->arena, u8, fs->nfuncs);
+    ctx.order = arena_array(fs->arena, Func*, fs->nfuncs);
+    for (u32 fidx = 0; fidx < fs->nfuncs; ++fidx)
+      inline_order_visit(&ctx, fs->funcs[fidx]);
+
+    for (u32 fidx = 0; fidx < ctx.norder; ++fidx) {
+      Func* caller = ctx.order[fidx];
       if (!caller || caller->opt_reg_ssa || caller->opt_rewritten) continue;
       for (u32 b = 0; b < caller->nblocks; ++b) {
         Block* bl = &caller->blocks[b];
         for (u32 i = 0; i < bl->ninsts; ++i) {
           if ((IROp)bl->insts[i].op != IR_CALL) continue;
-          if (try_inline_call(fs, caller, b, i)) {
+          if (try_inline_call(fs, caller, b, i, base_cost)) {
             changed = 1;
             bl = &caller->blocks[b];
           }
diff --git a/src/opt/pass_o2.c b/src/opt/pass_o2.c
@@ -11,6 +11,58 @@
 #define GVN_ENTRY_NONE 0xffffffffu
 #define DSE_KEY_NONE 0xffffffffu
 
+void opt_cleanup(Func* f) {
+  if (!f) return;
+  opt_build_cfg(f);
+  opt_jump_cleanup(f, OPT_JUMP_CLEANUP_CFG);
+  opt_build_cfg(f);
+  opt_verify(f, "o2-pre-ssa-cfg");
+  opt_build_reg_ssa(f);
+  opt_verify(f, "o2-reg-ssa");
+  opt_block_cloning(f);
+  opt_verify(f, "o2-block-clone-cfg");
+  opt_build_ssa(f);
+  opt_verify(f, "o2-ssa");
+  opt_ssa_dce(f);
+  opt_copy_cleanup(f);
+  opt_verify(f, "o2-pre-addr-cleanup");
+  opt_addr_xform(f);
+  opt_verify(f, "o2-addr-xform-ssa");
+  opt_ssa_dce(f);
+  opt_verify(f, "o2-addr-xform-dce");
+  opt_copy_cleanup(f);
+  opt_verify(f, "o2-addr-xform-copy-cleanup");
+  opt_simplify(f);
+  opt_verify(f, "o2-simplify-ssa");
+  opt_ssa_dce(f);
+  opt_copy_cleanup(f);
+  opt_verify(f, "o2-simplify-cleanup");
+  opt_gvn(f);
+  opt_verify(f, "o2-gvn-ssa");
+  opt_copy_prop(f);
+  opt_verify(f, "o2-copy-prop-ssa");
+  opt_ssa_dce(f);
+  opt_verify(f, "o2-copy-dce");
+  opt_dse(f);
+  opt_verify(f, "o2-dse-ssa");
+  opt_ssa_dce(f);
+  opt_verify(f, "o2-dse-dce");
+  opt_build_loop_tree(f);
+  opt_licm(f);
+  opt_verify(f, "o2-licm-ssa");
+  opt_pressure_relief(f);
+  opt_verify(f, "o2-pressure-relief-ssa");
+  opt_make_conventional_ssa(f);
+  opt_verify(f, "o2-conventional-ssa");
+  opt_ssa_combine(f);
+  opt_verify(f, "o2-ssa-combine");
+  opt_undo_ssa(f);
+  opt_copy_cleanup(f);
+  opt_verify(f, "o2-undo-copy-cleanup");
+  opt_jump_opt(f);
+  opt_verify(f, "o2-jump-opt");
+}
+
 typedef struct CloneValMap {
   Val old_val;
   Val new_val;
@@ -2664,9 +2716,3 @@ void opt_ssa_combine(Func* f) {
   if (changed) opt_analysis_invalidate(f, OPT_ANALYSIS_DEF_USE);
   opt_rebuild_def_use(f);
 }
-
-void opt_cleanup(Func* f) {
-  if (!f) return;
-  opt_ssa_dce(f);
-  opt_copy_prop(f);
-}
diff --git a/test/opt/opt_test.c b/test/opt/opt_test.c
@@ -1185,6 +1185,95 @@ static Inst* emit_call_one_arg(TestCtx* tc, Func* f, u32 b, Operand arg,
   return in;
 }
 
+static Func* new_named_func(TestCtx* tc, ObjSymId sym, CfreeCgTypeId ret,
+                            const CfreeCgTypeId* params, u32 nparams,
+                            int variadic) {
+  CGFuncDesc fd;
+  memset(&fd, 0, sizeof fd);
+  fd.sym = sym;
+  fd.fn_type = make_func_type(tc, ret, params, nparams, variadic);
+  fd.abi = abi_cg_func_info(tc->c->abi, fd.fn_type);
+  Func* f = ir_func_new(tc->c, &fd);
+  f->entry = ir_block_new(f);
+  ir_note_emit(f, f->entry);
+  return f;
+}
+
+static PReg add_preg(Func* f, CfreeCgTypeId ty) {
+  return ir_alloc_preg(f, ty, RC_INT);
+}
+
+static void add_reg_param(Func* f, PReg r, CfreeCgTypeId ty) {
+  CGParamDesc d;
+  memset(&d, 0, sizeof d);
+  d.index = f->nparams;
+  d.type = ty;
+  d.size = 4;
+  d.align = 4;
+  d.storage.kind = CG_LOCAL_STORAGE_REG;
+  d.storage.v.reg = r;
+  ir_param_add(f, &d);
+
+  Inst* in = ir_emit(f, f->entry, IR_PARAM_DECL);
+  in->opnds = arena_array(f->arena, Operand, 1);
+  in->opnds[0] = op_reg_(r, ty);
+  in->nopnds = 1;
+  in->def = r;
+  in->type = ty;
+}
+
+static Inst* emit_preg_load_imm(Func* f, u32 b, PReg dst, CfreeCgTypeId ty,
+                                i64 imm) {
+  Inst* in = ir_emit(f, b, IR_LOAD_IMM);
+  in->opnds = arena_array(f->arena, Operand, 1);
+  in->opnds[0] = op_reg_(dst, ty);
+  in->nopnds = 1;
+  in->def = dst;
+  in->type = ty;
+  in->extra.imm = imm;
+  return in;
+}
+
+static Inst* emit_preg_binop(Func* f, u32 b, PReg dst, PReg a, PReg c,
+                             CfreeCgTypeId ty) {
+  Inst* in = ir_emit(f, b, IR_BINOP);
+  in->opnds = arena_array(f->arena, Operand, 3);
+  in->opnds[0] = op_reg_(dst, ty);
+  in->opnds[1] = op_reg_(a, ty);
+  in->opnds[2] = op_reg_(c, ty);
+  in->nopnds = 3;
+  in->def = dst;
+  in->type = ty;
+  in->extra.imm = BO_IADD;
+  return in;
+}
+
+static void emit_preg_ret(Func* f, u32 b, PReg v, CfreeCgTypeId ty) {
+  emit_ret_val(f, b, v, ty);
+}
+
+static Inst* emit_direct_call(TestCtx* tc, Func* f, u32 b, ObjSymId callee_sym,
+                              CfreeCgTypeId fn_ty, const Operand* args,
+                              u32 nargs, Operand ret) {
+  Inst* in = ir_emit(f, b, IR_CALL);
+  IRCallAux* aux = arena_znew(f->arena, IRCallAux);
+  CGABIValue* av = NULL;
+  if (nargs) {
+    av = arena_zarray(f->arena, CGABIValue, nargs);
+    const ABIFuncInfo* abi = abi_cg_func_info(tc->c->abi, fn_ty);
+    for (u32 i = 0; i < nargs; ++i)
+      av[i] = call_arg_value(args[i].type, abi ? &abi->params[i] : NULL,
+                             args[i]);
+  }
+  CGABIValue rv;
+  memset(&rv, 0, sizeof rv);
+  if (ret.kind) rv = call_arg_value(ret.type, NULL, ret);
+  aux->desc = make_call_desc(tc, fn_ty, av, nargs, rv);
+  aux->desc.callee = op_global_(callee_sym, 0, fn_ty);
+  in->extra.aux = aux;
+  return in;
+}
+
 static void expect_plan_arg(const CGCallPlan* p, u32 i, u8 dst_kind, u8 cls,
                             Reg reg, u32 stack_offset, const char* ctx) {
   EXPECT(i < p->nargs, "%s missing arg %u", ctx, (unsigned)i);
@@ -6100,6 +6189,206 @@ static void simple_regalloc_reports_exact_used_regs(void) {
          "reserved reg should not be allocated");
 }
 
+static void opt_inline_direct_wrapper(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  CfreeCgTypeId ps[1] = {tc.i32};
+  Func* callee = new_named_func(&tc, (ObjSymId)2, tc.i32, ps, 1, 0);
+  PReg x = add_preg(callee, tc.i32);
+  add_reg_param(callee, x, tc.i32);
+  PReg one = add_preg(callee, tc.i32);
+  PReg sum = add_preg(callee, tc.i32);
+  emit_preg_load_imm(callee, callee->entry, one, tc.i32, 1);
+  emit_preg_binop(callee, callee->entry, sum, x, one, tc.i32);
+  emit_preg_ret(callee, callee->entry, sum, tc.i32);
+
+  Func* caller = new_named_func(&tc, (ObjSymId)1, tc.i32, NULL, 0, 0);
+  PReg arg = add_preg(caller, tc.i32);
+  PReg ret = add_preg(caller, tc.i32);
+  emit_preg_load_imm(caller, caller->entry, arg, tc.i32, 41);
+  Operand arg_op = op_reg_(arg, tc.i32);
+  emit_direct_call(&tc, caller, caller->entry, (ObjSymId)2, callee->type,
+                   &arg_op, 1, op_reg_(ret, tc.i32));
+  emit_preg_ret(caller, caller->entry, ret, tc.i32);
+
+  Func* funcs[2] = {caller, callee};
+  FuncSet fs = {tc.c, tc.c->tu, funcs, 2, 2};
+  opt_inline(&fs, 1);
+  EXPECT(count_op(caller, IR_CALL) == 0, "direct wrapper call should inline");
+  EXPECT(count_op(caller, IR_BINOP) == 1,
+         "inlined wrapper should clone callee arithmetic");
+  tc_fini(&tc);
+}
+
+static void opt_inline_two_return_scalar(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  CfreeCgTypeId ps[1] = {tc.i32};
+  Func* callee = new_named_func(&tc, (ObjSymId)3, tc.i32, ps, 1, 0);
+  PReg x = add_preg(callee, tc.i32);
+  add_reg_param(callee, x, tc.i32);
+  u32 then_b = ir_block_new(callee);
+  u32 else_b = ir_block_new(callee);
+  ir_note_emit(callee, then_b);
+  ir_note_emit(callee, else_b);
+  emit_cond_branch(callee, callee->entry, x, then_b, else_b, tc.i32);
+  emit_preg_ret(callee, then_b, x, tc.i32);
+  PReg zero = add_preg(callee, tc.i32);
+  emit_preg_load_imm(callee, else_b, zero, tc.i32, 0);
+  emit_preg_ret(callee, else_b, zero, tc.i32);
+
+  Func* caller = new_named_func(&tc, (ObjSymId)4, tc.i32, NULL, 0, 0);
+  PReg arg = add_preg(caller, tc.i32);
+  PReg ret = add_preg(caller, tc.i32);
+  emit_preg_load_imm(caller, caller->entry, arg, tc.i32, 1);
+  Operand a = op_reg_(arg, tc.i32);
+  emit_direct_call(&tc, caller, caller->entry, (ObjSymId)3, callee->type, &a, 1,
+                   op_reg_(ret, tc.i32));
+  emit_preg_ret(caller, caller->entry, ret, tc.i32);
+
+  Func* funcs[2] = {caller, callee};
+  FuncSet fs = {tc.c, tc.c->tu, funcs, 2, 2};
+  opt_inline(&fs, 1);
+  EXPECT(count_op(caller, IR_CALL) == 0,
+         "two-return scalar callee should inline");
+  EXPECT(count_op(caller, IR_COPY) >= 2,
+         "each cloned return should materialize the scalar result");
+  tc_fini(&tc);
+}
+
+static void opt_inline_bottom_up_chain_single_iter(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  CfreeCgTypeId ps[1] = {tc.i32};
+  Func* g = new_named_func(&tc, (ObjSymId)5, tc.i32, ps, 1, 0);
+  PReg gx = add_preg(g, tc.i32);
+  add_reg_param(g, gx, tc.i32);
+  PReg one = add_preg(g, tc.i32);
+  PReg gout = add_preg(g, tc.i32);
+  emit_preg_load_imm(g, g->entry, one, tc.i32, 1);
+  emit_preg_binop(g, g->entry, gout, gx, one, tc.i32);
+  emit_preg_ret(g, g->entry, gout, tc.i32);
+
+  Func* f = new_named_func(&tc, (ObjSymId)6, tc.i32, ps, 1, 0);
+  PReg fx = add_preg(f, tc.i32);
+  PReg fout = add_preg(f, tc.i32);
+  add_reg_param(f, fx, tc.i32);
+  Operand fa = op_reg_(fx, tc.i32);
+  emit_direct_call(&tc, f, f->entry, (ObjSymId)5, g->type, &fa, 1,
+                   op_reg_(fout, tc.i32));
+  emit_preg_ret(f, f->entry, fout, tc.i32);
+
+  Func* main_f = new_named_func(&tc, (ObjSymId)7, tc.i32, NULL, 0, 0);
+  PReg arg = add_preg(main_f, tc.i32);
+  PReg ret = add_preg(main_f, tc.i32);
+  emit_preg_load_imm(main_f, main_f->entry, arg, tc.i32, 41);
+  Operand ma = op_reg_(arg, tc.i32);
+  emit_direct_call(&tc, main_f, main_f->entry, (ObjSymId)6, f->type, &ma, 1,
+                   op_reg_(ret, tc.i32));
+  emit_preg_ret(main_f, main_f->entry, ret, tc.i32);
+
+  Func* funcs[3] = {main_f, f, g};
+  FuncSet fs = {tc.c, tc.c->tu, funcs, 3, 3};
+  opt_inline(&fs, 1);
+  EXPECT(count_op(f, IR_CALL) == 0,
+         "bottom-up order should inline leaf into wrapper first");
+  EXPECT(count_op(main_f, IR_CALL) == 0,
+         "bottom-up order should inline cleaned wrapper into caller");
+  tc_fini(&tc);
+}
+
+static void opt_inline_refuses_recursive_and_unsupported(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  Func* rec = new_named_func(&tc, (ObjSymId)8, tc.i32, NULL, 0, 0);
+  PReg rr = add_preg(rec, tc.i32);
+  emit_direct_call(&tc, rec, rec->entry, (ObjSymId)8, rec->type, NULL, 0,
+                   op_reg_(rr, tc.i32));
+  emit_preg_ret(rec, rec->entry, rr, tc.i32);
+
+  Func* varg = new_named_func(&tc, (ObjSymId)9, tc.i32, NULL, 0, 1);
+  PReg vr = add_preg(varg, tc.i32);
+  emit_preg_load_imm(varg, varg->entry, vr, tc.i32, 3);
+  emit_preg_ret(varg, varg->entry, vr, tc.i32);
+
+  Func* alloca_f = new_named_func(&tc, (ObjSymId)13, tc.i32, NULL, 0, 0);
+  ir_emit(alloca_f, alloca_f->entry, IR_ALLOCA);
+  PReg ar = add_preg(alloca_f, tc.i32);
+  emit_preg_load_imm(alloca_f, alloca_f->entry, ar, tc.i32, 4);
+  emit_preg_ret(alloca_f, alloca_f->entry, ar, tc.i32);
+
+  Func* asm_f = new_named_func(&tc, (ObjSymId)14, tc.i32, NULL, 0, 0);
+  ir_emit(asm_f, asm_f->entry, IR_ASM_BLOCK);
+  PReg sr = add_preg(asm_f, tc.i32);
+  emit_preg_load_imm(asm_f, asm_f->entry, sr, tc.i32, 5);
+  emit_preg_ret(asm_f, asm_f->entry, sr, tc.i32);
+
+  Func* setjmp_f = new_named_func(&tc, (ObjSymId)15, tc.i32, NULL, 0, 0);
+  Inst* sj = ir_emit(setjmp_f, setjmp_f->entry, IR_INTRINSIC);
+  IRIntrinAux* sj_aux = arena_znew(setjmp_f->arena, IRIntrinAux);
+  sj_aux->kind = INTRIN_SETJMP;
+  sj->extra.aux = sj_aux;
+  PReg jr = add_preg(setjmp_f, tc.i32);
+  emit_preg_load_imm(setjmp_f, setjmp_f->entry, jr, tc.i32, 6);
+  emit_preg_ret(setjmp_f, setjmp_f->entry, jr, tc.i32);
+
+  Func* caller = new_named_func(&tc, (ObjSymId)10, tc.i32, NULL, 0, 0);
+  PReg a = add_preg(caller, tc.i32);
+  PReg b = add_preg(caller, tc.i32);
+  PReg c = add_preg(caller, tc.i32);
+  PReg d = add_preg(caller, tc.i32);
+  PReg e = add_preg(caller, tc.i32);
+  emit_direct_call(&tc, caller, caller->entry, (ObjSymId)8, rec->type, NULL, 0,
+                   op_reg_(a, tc.i32));
+  emit_direct_call(&tc, caller, caller->entry, (ObjSymId)9, varg->type, NULL, 0,
+                   op_reg_(b, tc.i32));
+  emit_direct_call(&tc, caller, caller->entry, (ObjSymId)13, alloca_f->type,
+                   NULL, 0, op_reg_(c, tc.i32));
+  emit_direct_call(&tc, caller, caller->entry, (ObjSymId)14, asm_f->type, NULL,
+                   0, op_reg_(d, tc.i32));
+  emit_direct_call(&tc, caller, caller->entry, (ObjSymId)15, setjmp_f->type,
+                   NULL, 0, op_reg_(e, tc.i32));
+  emit_preg_ret(caller, caller->entry, e, tc.i32);
+
+  Func* funcs[6] = {caller, rec, varg, alloca_f, asm_f, setjmp_f};
+  FuncSet fs = {tc.c, tc.c->tu, funcs, 6, 6};
+  opt_inline(&fs, 1);
+  EXPECT(count_op(rec, IR_CALL) == 1, "self recursion should not inline");
+  EXPECT(count_op(caller, IR_CALL) == 5,
+         "recursive, variadic, alloca, asm, and setjmp callees remain calls");
+  tc_fini(&tc);
+}
+
+static void opt_inline_caller_growth_cap(void) {
+  TestCtx tc;
+  tc_init(&tc);
+  Func* callee = new_named_func(&tc, (ObjSymId)11, tc.i32, NULL, 0, 0);
+  PReg last = PREG_NONE;
+  for (u32 i = 0; i < 20; ++i) {
+    last = add_preg(callee, tc.i32);
+    emit_preg_load_imm(callee, callee->entry, last, tc.i32, (i64)i);
+  }
+  emit_preg_ret(callee, callee->entry, last, tc.i32);
+
+  Func* caller = new_named_func(&tc, (ObjSymId)12, tc.i32, NULL, 0, 0);
+  PReg ret = PREG_NONE;
+  for (u32 i = 0; i < 8; ++i) {
+    ret = add_preg(caller, tc.i32);
+    emit_direct_call(&tc, caller, caller->entry, (ObjSymId)11, callee->type,
+                     NULL, 0, op_reg_(ret, tc.i32));
+  }
+  emit_preg_ret(caller, caller->entry, ret, tc.i32);
+
+  Func* funcs[2] = {caller, callee};
+  FuncSet fs = {tc.c, tc.c->tu, funcs, 2, 2};
+  opt_inline(&fs, 1);
+  EXPECT(count_op(caller, IR_CALL) > 0,
+         "caller growth cap should leave some large calls uninlined");
+  EXPECT(count_op(caller, IR_CALL) < 8,
+         "growth cap should still permit the first affordable inline");
+  tc_fini(&tc);
+}
+
 int main(void) {
   opt_machinize_uses_phys_reg_metadata();
   opt_machinize_keeps_abi_regs_without_legacy_call_fallback();
@@ -6221,6 +6510,11 @@ int main(void) {
   opt_param_memory_required_uses_frame();
   opt_local_addr_taken_uses_frame_and_replays_addr_of();
   opt_register_local_addr_frame_homes();
+  opt_inline_direct_wrapper();
+  opt_inline_two_return_scalar();
+  opt_inline_bottom_up_chain_single_iter();
+  opt_inline_refuses_recursive_and_unsupported();
+  opt_inline_caller_growth_cap();
   simple_regalloc_reports_exact_used_regs();
   if (g_fails) {
     fprintf(stderr, "opt tests: %d failed (%d checks)\n", g_fails, g_checks);
diff --git a/test/toy/cases/133_inline_wrapper_o2.toy b/test/toy/cases/133_inline_wrapper_o2.toy
@@ -0,0 +1,12 @@
+fn add1(x: i64): i64 {
+  return x + 1;
+}
+
+fn __user_main(): i64 {
+  if add1(41) != 42 {
+    return 1;
+  }
+  return 0;
+}
+
+fn main(): i32 { return __user_main() as i32; }
diff --git a/test/toy/cases/134_inline_chain_o2.toy b/test/toy/cases/134_inline_chain_o2.toy
@@ -0,0 +1,16 @@
+fn g(x: i64): i64 {
+  return x + 1;
+}
+
+fn f(x: i64): i64 {
+  return g(x) + 1;
+}
+
+fn __user_main(): i64 {
+  if f(40) != 42 {
+    return 1;
+  }
+  return 0;
+}
+
+fn main(): i32 { return __user_main() as i32; }

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/OPT.md	\|	88	++++++++++++++++++++++++++++++++++++++++---------------------------------------
M	src/opt/opt.c	\|	49	+------------------------------------------------
M	src/opt/opt.h	\|	20	++++++++++----------
M	src/opt/pass_inline.c	\|	311	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
M	src/opt/pass_o2.c	\|	58	++++++++++++++++++++++++++++++++++++++++++++++++++++------
M	test/opt/opt_test.c	\|	294	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	test/toy/cases/133_inline_wrapper_o2.toy	\|	12	++++++++++++
A	test/toy/cases/134_inline_chain_o2.toy	\|	16	++++++++++++++++