kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 7617fbc06d101d6345d167594ce106aca90ac979
parent a7442212be4d2e3a72cad198038e1f75103ac0bb
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 15 May 2026 15:12:22 -0700

Add O1 local jump cleanup

Diffstat:
Mdoc/OPT1.md | 49+++++++++++++++++--------------------------------
Mdoc/PERF.md | 34+++++++++++++++++++++++++++++++---
Msrc/opt/opt.c | 7+++++++
Msrc/opt/opt.h | 5+++++
Asrc/opt/pass_jump.c | 209+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mtest/opt/opt_test.c | 130+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 399 insertions(+), 35 deletions(-)

diff --git a/doc/OPT1.md b/doc/OPT1.md @@ -31,6 +31,8 @@ The normal O1 function-end path is: ```text build_cfg +jump_cleanup.cfg +build_cfg machinize build_loop_tree live_blocks.pre_dde @@ -42,6 +44,9 @@ regalloc: rewrite combine dce +jump_cleanup.cfg +build_cfg +jump_cleanup.layout emit ``` @@ -57,6 +62,11 @@ Important pass boundaries: deletion, safe single-use folds, redundant spill load/store cleanup, and similar local rewrites. - `dce` is conservative after rewrite and removes only trivial dead IR shapes. +- `jump_cleanup` is the cheap O1 branch-shape pass. The CFG stage forwards + branches to branch-only blocks and inverts compare-branches when that removes + an unconditional jump block without doing full block layout. The layout stage + runs immediately before emit and deletes unconditional branches to the + physical fallthrough block. ## Liveness And Ranges @@ -238,6 +248,10 @@ Observed progress: - Branch-consuming compares lower to direct compare branches. For example, AArch64 while-loop conditions use `cmp w19, #0xa` plus `b.ge ...` rather than a `cmp; cset; cmp #0; b.cond` bridge. +- Local branch cleanup now removes O1-only branch artifacts: branch-to-branch + targets are forwarded, safe compare-branches are inverted to remove + fallthrough jump blocks, and final unconditional branches to the physical + fallthrough block are deleted before emit. - Very local constant simplification handles the direct O1 probes covered by `doc/CONSTFOLD.md`: immediate arithmetic, compare literals, delayed arithmetic chains, and straight-line scalar-local shadows. The @@ -251,9 +265,6 @@ b ... Remaining O1 shape issues visible in the current dumps: -- Cheap branch layout cleanup is still missing. Every quality probe had at - least one unconditional branch to the immediately following block, including - trivial returns and shared return epilogues. - Parameter and entry-slot promotion is incomplete. The trivial AArch64 identity function still stores the incoming argument to a frame slot, reloads it into `w19`, then copies it back to `w0`: @@ -287,13 +298,7 @@ mov w19, w21 MIR's O1 path suggests these high-value local cleanups that still fit cfree's fast tier: -1. Clean up local branch layout artifacts. - MIR's full jump optimizer is O2-only, but its cheap pieces are appropriate - for O1: delete branches to immediate fallthrough blocks, forward - branch-to-branch targets, and invert a branch when it removes an - unconditional jump. Avoid full CFG layout work. - -2. Promote remaining scalar entry slots before backend allocation. +1. Promote remaining scalar entry slots before backend allocation. MIR's C frontend represents normal scalar block locals as MIR registers and leaves stack slots for aggregates, forced-stack cases, and address-taken values. O1 now keeps simple loop locals in registers in the probe, but still @@ -301,34 +306,14 @@ fast tier: pass should promote remaining integer/pointer scalars whose address does not escape, starting with parameters and single-entry structured control flow. -3. Avoid unnecessary callee-save traffic. +2. Avoid unnecessary callee-save traffic. Reserve and preserve only hard registers that survive final post-rewrite cleanup, and consider caller-saved registers for values that are not live across calls. This would make small leaf functions much closer to expected O1 output without requiring global optimization. -4. Continue tightening post-rewrite DCE and copy cleanup. +3. Continue tightening post-rewrite DCE and copy cleanup. Model hard-register call arguments and call clobbers precisely enough to delete dead caller-saved defs before calls without removing required ABI traffic. Also fold single-use arithmetic temporaries into their destination when target constraints allow it. - -5. Keep compare-branch fusion covered by tests. - The current probes show direct `cmp` plus branch shapes for branch-consuming - compares on AArch64. Add focused regression coverage so the old - `cmp; cset; cmp #0; b.cond` bridge does not return. - -6. Keep tiny local constant simplification bounded. - The vstack constfold path now removes obvious immediate and straight-line - scalar-local code before allocation. Keep it basic-block-local; broader - propagation belongs in the O2 SSA/value optimizer. - -7. Watch spill-pressure regalloc slope. - Normal-path scaling is linear, but heavy spill pressure still bends slightly - in regalloc time. Rerun the nonconstant wide-local ladder and decide whether - interval probing or stack slot assignment needs another narrow cleanup. - -- Keep `opt_combine` legality target-aware. - Existing one-use copy/immediate/convert folds should stay conservative. New - folds must be gated by target operand legality rather than assuming every - backend accepts the same IR operand forms. diff --git a/doc/PERF.md b/doc/PERF.md @@ -17,6 +17,8 @@ Top-level `cfree run --time` scopes: - `compile.frontend`: registered frontend body. - `compile.c.*`: C frontend setup, PP option application, parse/codegen, cleanup. - `opt.o1.total`: per-function O1 pipeline, nested under C parse/codegen. +- `opt.jump_cleanup`: cheap O1 CFG/layout branch cleanup after post-rewrite + cleanup, nested under `opt.o1.total`. - `link_jit.all`: linker resolve plus JIT image materialization. - `link.resolve.*`: archive ingestion, symbol resolution, GC, layout, reloc emit. - `jit.*`: reservation, segment copy, relocation patching, protection, icache flush, @@ -840,6 +842,26 @@ compare_branch_literal 0.351 0.088 0.188 0 0 local_addr_taken 0.351 0.088 0.187 0 0 0 ``` +### O1 Local Jump Cleanup + +The O1 branch-shape cleanup is intentionally local and bounded. It has two +stages: + +- `jump_cleanup.cfg`: forwards branch-to-branch targets and inverts a + compare-branch only when that removes the immediately following + unconditional-jump block. +- `jump_cleanup.layout`: runs immediately before emit and deletes an + unconditional branch whose target is the next physical block in `emit_order`. + +The scaling property is block/edge-linear for the O1 shapes this pass owns. +The pass builds a per-block emit-order index once per run, memoizes resolved +branch-only chains, and never walks values, liveness bitsets, ranges, or +allocator occupancy. Its scratch storage is proportional to block count, and +its work is proportional to blocks plus terminator edges plus branch-only chain +members. It should therefore remain invisible next to `live_blocks`, +`live_ranges_reg`, and `regalloc` in large O1 scaling probes. If future branch +layout work needs global block ordering, keep that out of O1 and put it in O2. + ## Performance Priorities 1. Keep O1 on interval occupancy. @@ -860,18 +882,24 @@ local_addr_taken 0.351 0.088 0.187 0 0 smaller, then either reuse/revalidate liveness after DDE or make DDE produce enough local update information to avoid a full rebuild. -4. Avoid whole-function contiguous growth where hot passes scan repeatedly. +4. Keep jump cleanup local and block-linear. + The O1 branch cleanup exists to remove obvious code-shape artifacts, not to + perform global layout. Keep it on cached emit-order indexes, memoized + branch-only chains, and block/terminator scans. Do not add value, liveness, + or range traffic to this pass. + +5. Avoid whole-function contiguous growth where hot passes scan repeatedly. Large `Val`/block/instruction arrays and dense bit matrices are likely to hurt cache locality. Segmented arrays may help by reducing large copies and stabilizing addresses, but pass algorithms need to avoid turning segment traversal into extra indirection in inner loops. -5. Keep link/JIT lower priority for now. +6. Keep link/JIT lower priority for now. Even at 1024 functions, `link_jit.all` is around `0.5 ms`; copy, reloc, and icache flush are small and roughly linear. They are not the bottleneck for submit-to-execution latency in these tests. -6. Split `compile.c.parse_codegen` further before changing parser/CG data +7. Split `compile.c.parse_codegen` further before changing parser/CG data structures. The non-O1 compile residual is meaningful at large source sizes, but it is currently broad. More counters are needed before choosing parser, PP, type, diff --git a/src/opt/opt.c b/src/opt/opt.c @@ -1725,6 +1725,8 @@ static void w_func_end(CGTarget* t) { metrics_count(o->c, "opt.vals", o->f->nvals); metrics_scope_begin(o->c, "opt.build_cfg"); opt_build_cfg(o->f); + opt_jump_cleanup(o->f, OPT_JUMP_CLEANUP_CFG); + opt_build_cfg(o->f); metrics_scope_end(o->c, "opt.build_cfg"); metrics_scope_begin(o->c, "opt.machinize"); opt_machinize(o->f, o->target); @@ -1789,6 +1791,11 @@ static void w_func_end(CGTarget* t) { metrics_scope_begin(o->c, "opt.dce"); opt_dce(o->f); metrics_scope_end(o->c, "opt.dce"); + metrics_scope_begin(o->c, "opt.jump_cleanup"); + opt_jump_cleanup(o->f, OPT_JUMP_CLEANUP_CFG); + opt_build_cfg(o->f); + opt_jump_cleanup(o->f, OPT_JUMP_CLEANUP_LAYOUT); + metrics_scope_end(o->c, "opt.jump_cleanup"); metrics_scope_begin(o->c, "opt.emit"); opt_emit(o->c, o->f, o->target); metrics_scope_end(o->c, "opt.emit"); diff --git a/src/opt/opt.h b/src/opt/opt.h @@ -29,6 +29,11 @@ CGTarget* opt_cgtarget_new(Compiler*, CGTarget* target, int level); /* ----- intra-procedural passes (run per Func at func_end on -O2) ----- */ void opt_build_cfg(Func*); +typedef enum OptJumpCleanupStage { + OPT_JUMP_CLEANUP_CFG, + OPT_JUMP_CLEANUP_LAYOUT, +} OptJumpCleanupStage; +void opt_jump_cleanup(Func*, OptJumpCleanupStage); void opt_block_cloning(Func*); void opt_build_ssa(Func*); void opt_addr_xform(Func*); diff --git a/src/opt/pass_jump.c b/src/opt/pass_jump.c @@ -0,0 +1,209 @@ +#include <string.h> + +#include "opt/opt.h" + +#define BLOCK_NONE ((u32)~0u) + +typedef struct JumpCleanupCtx { + Func* f; + u32* emit_index_by_block; + u32* forwarded_target_by_block; + u32* forward_path; +} JumpCleanupCtx; + +static u32 emit_order_index(const JumpCleanupCtx* c, u32 block) { + if (block >= c->f->nblocks) return BLOCK_NONE; + return c->emit_index_by_block[block]; +} + +static u32 next_emit_block(const JumpCleanupCtx* c, u32 block) { + u32 idx = emit_order_index(c, block); + if (idx == BLOCK_NONE || idx + 1u >= c->f->emit_order_n) return BLOCK_NONE; + return c->f->emit_order[idx + 1u]; +} + +static JumpCleanupCtx jump_cleanup_ctx(Func* f) { + JumpCleanupCtx c; + c.f = f; + c.emit_index_by_block = + arena_array(f->arena, u32, f->nblocks ? f->nblocks : 1u); + c.forwarded_target_by_block = + arena_array(f->arena, u32, f->nblocks ? f->nblocks : 1u); + c.forward_path = arena_array(f->arena, u32, f->nblocks ? f->nblocks : 1u); + for (u32 b = 0; b < f->nblocks; ++b) c.emit_index_by_block[b] = BLOCK_NONE; + for (u32 b = 0; b < f->nblocks; ++b) + c.forwarded_target_by_block[b] = BLOCK_NONE; + for (u32 i = 0; i < f->emit_order_n; ++i) { + u32 b = f->emit_order[i]; + if (b < f->nblocks) c.emit_index_by_block[b] = i; + } + return c; +} + +static int single_jump_block(const JumpCleanupCtx* c, u32 block, + u32* target_out) { + Func* f = c->f; + if (block >= f->nblocks) return 0; + Block* bl = &f->blocks[block]; + if (bl->ninsts != 1 || bl->nsucc != 1) return 0; + if ((IROp)bl->insts[0].op != IR_BR) return 0; + if (target_out) *target_out = bl->succ[0]; + return 1; +} + +static u32 forward_jump_target(JumpCleanupCtx* c, u32 target) { + u32 cur = target; + Func* f = c->f; + u32 npath = 0; + u32 result = target; + for (u32 step = 0; step < f->nblocks; ++step) { + u32 next = BLOCK_NONE; + if (cur >= f->nblocks) { + result = cur; + break; + } + if (c->forwarded_target_by_block[cur] != BLOCK_NONE) { + result = c->forwarded_target_by_block[cur]; + break; + } + if (!single_jump_block(c, cur, &next)) { + result = cur; + c->forwarded_target_by_block[cur] = result; + break; + } + c->forward_path[npath++] = cur; + if (next == cur) { + result = target; + break; + } + cur = next; + if (step + 1u == f->nblocks) result = target; + } + for (u32 i = 0; i < npath; ++i) + c->forwarded_target_by_block[c->forward_path[i]] = result; + return result; +} + +static int invert_cmp(CmpOp op, CmpOp* out) { + switch (op) { + case CMP_EQ: + *out = CMP_NE; + return 1; + case CMP_NE: + *out = CMP_EQ; + return 1; + case CMP_LT_S: + *out = CMP_GE_S; + return 1; + case CMP_LE_S: + *out = CMP_GT_S; + return 1; + case CMP_GT_S: + *out = CMP_LE_S; + return 1; + case CMP_GE_S: + *out = CMP_LT_S; + return 1; + case CMP_LT_U: + *out = CMP_GE_U; + return 1; + case CMP_LE_U: + *out = CMP_GT_U; + return 1; + case CMP_GT_U: + *out = CMP_LE_U; + return 1; + case CMP_GE_U: + *out = CMP_LT_U; + return 1; + default: + return 0; + } +} + +static int block_has_only_pred(const Func* f, u32 block, u32 pred) { + if (block >= f->nblocks) return 0; + const Block* bl = &f->blocks[block]; + return bl->npreds == 1 && bl->preds && bl->preds[0] == pred; +} + +static void cleanup_branch_targets(JumpCleanupCtx* c) { + Func* f = c->f; + for (u32 b = 0; b < f->nblocks; ++b) { + Block* bl = &f->blocks[b]; + u32 nsucc = 0; + if (!bl->ninsts) continue; + IROp op = (IROp)bl->insts[bl->ninsts - 1u].op; + switch (op) { + case IR_BR: + nsucc = bl->nsucc; + break; + case IR_CMP_BRANCH: + case IR_CONDBR: + nsucc = bl->nsucc ? 1u : 0u; + break; + default: + continue; + } + for (u32 s = 0; s < nsucc; ++s) { + u32 target = bl->succ[s]; + u32 forwarded = forward_jump_target(c, target); + if (forwarded < f->nblocks) bl->succ[s] = forwarded; + } + } +} + +static void cleanup_invert_jump_fallthrough(JumpCleanupCtx* c) { + Func* f = c->f; + for (u32 b = 0; b < f->nblocks; ++b) { + Block* bl = &f->blocks[b]; + if (!bl->ninsts || bl->nsucc != 2) continue; + Inst* last = &bl->insts[bl->ninsts - 1u]; + if ((IROp)last->op != IR_CMP_BRANCH) continue; + + u32 taken = bl->succ[0]; + u32 fallthrough = bl->succ[1]; + u32 next = next_emit_block(c, b); + u32 fallthrough_idx = emit_order_index(c, fallthrough); + u32 jump_target = BLOCK_NONE; + CmpOp inverted; + if (next != fallthrough) continue; + if (fallthrough_idx == BLOCK_NONE || + fallthrough_idx + 1u >= f->emit_order_n) + continue; + if (f->emit_order[fallthrough_idx + 1u] != taken) continue; + if (!block_has_only_pred(f, fallthrough, b)) continue; + if (!single_jump_block(c, fallthrough, &jump_target)) continue; + jump_target = forward_jump_target(c, jump_target); + if (jump_target >= f->nblocks) continue; + if (!invert_cmp((CmpOp)last->extra.imm, &inverted)) continue; + + last->extra.imm = inverted; + bl->succ[0] = jump_target; + bl->succ[1] = taken; + } +} + +static void cleanup_layout_fallthrough_branches(const JumpCleanupCtx* c) { + Func* f = c->f; + for (u32 b = 0; b < f->nblocks; ++b) { + Block* bl = &f->blocks[b]; + if (!bl->ninsts || bl->nsucc != 1) continue; + Inst* last = &bl->insts[bl->ninsts - 1u]; + if ((IROp)last->op != IR_BR) continue; + if (next_emit_block(c, b) != bl->succ[0]) continue; + memset(last, 0, sizeof *last); + --bl->ninsts; + } +} + +void opt_jump_cleanup(Func* f, OptJumpCleanupStage stage) { + if (!f) return; + JumpCleanupCtx c = jump_cleanup_ctx(f); + if (stage == OPT_JUMP_CLEANUP_CFG) { + cleanup_invert_jump_fallthrough(&c); + cleanup_branch_targets(&c); + } else if (stage == OPT_JUMP_CLEANUP_LAYOUT) { + cleanup_layout_fallthrough_branches(&c); + } +} diff --git a/test/opt/opt_test.c b/test/opt/opt_test.c @@ -1090,6 +1090,132 @@ static void opt_cfg_preserves_scope_edges(void) { tc_fini(&tc); } +static void opt_jump_cleanup_forwards_branch_targets(void) { + TestCtx tc; + tc_init(&tc); + Func* f = new_func(&tc); + u32 b0 = f->entry; + u32 hop = ir_block_new(f); + u32 ret = ir_block_new(f); + ir_note_emit(f, hop); + ir_note_emit(f, ret); + + Val v = add_val(f, tc.i32); + emit_br_to(f, b0, hop); + emit_br_to(f, hop, ret); + emit_load_imm(f, ret, v, tc.i32, 7); + emit_ret_val(f, ret, v, tc.i32); + + opt_build_cfg(f); + opt_jump_cleanup(f, OPT_JUMP_CLEANUP_CFG); + opt_build_cfg(f); + + EXPECT(f->blocks[b0].nsucc == 1 && f->blocks[b0].succ[0] == ret, + "jump cleanup should forward branch target to final block"); + EXPECT(f->blocks[hop].ninsts == 0, + "forwarded branch-only block should be pruned as unreachable"); + EXPECT(f->emit_order_n == 2, + "emit_order should drop pruned branch-only block, got %u", + (unsigned)f->emit_order_n); + tc_fini(&tc); +} + +static void opt_jump_cleanup_inverts_to_remove_jump_block(void) { + TestCtx tc; + tc_init(&tc); + Func* f = new_func(&tc); + u32 b0 = f->entry; + u32 jump_ft = ir_block_new(f); + u32 taken = ir_block_new(f); + u32 join = ir_block_new(f); + ir_note_emit(f, jump_ft); + ir_note_emit(f, taken); + ir_note_emit(f, join); + + Val v = add_val(f, tc.i32); + emit_test_branch(f, b0, taken, jump_ft, tc.i32); + emit_br_to(f, jump_ft, join); + emit_load_imm(f, taken, v, tc.i32, 7); + emit_br_to(f, taken, join); + emit_ret_val(f, join, v, tc.i32); + + opt_build_cfg(f); + opt_jump_cleanup(f, OPT_JUMP_CLEANUP_CFG); + opt_build_cfg(f); + + Inst* br = &f->blocks[b0].insts[f->blocks[b0].ninsts - 1u]; + EXPECT((IROp)br->op == IR_CMP_BRANCH && br->extra.imm == CMP_EQ, + "jump cleanup should invert CMP_NE to CMP_EQ"); + EXPECT(f->blocks[b0].succ[0] == join && f->blocks[b0].succ[1] == taken, + "inverted branch should target join and fall through to taken"); + EXPECT(f->blocks[jump_ft].ninsts == 0, + "removed fallthrough jump block should be pruned"); + EXPECT(f->emit_order_n >= 2 && f->emit_order[1] == taken, + "taken block should become physical fallthrough after pruning"); + tc_fini(&tc); +} + +static void opt_jump_cleanup_keeps_conditional_fallthrough_block(void) { + TestCtx tc; + tc_init(&tc); + Func* f = new_func(&tc); + u32 b0 = f->entry; + u32 jump_ft = ir_block_new(f); + u32 other = ir_block_new(f); + u32 taken = ir_block_new(f); + u32 join = ir_block_new(f); + ir_note_emit(f, jump_ft); + ir_note_emit(f, other); + ir_note_emit(f, taken); + ir_note_emit(f, join); + + Val v = add_val(f, tc.i32); + Val w = add_val(f, tc.i32); + emit_test_branch(f, b0, taken, jump_ft, tc.i32); + emit_br_to(f, jump_ft, join); + emit_load_imm(f, taken, v, tc.i32, 7); + emit_br_to(f, taken, other); + emit_load_imm(f, other, w, tc.i32, 1); + emit_br_to(f, other, join); + emit_ret_val(f, join, w, tc.i32); + + opt_build_cfg(f); + opt_jump_cleanup(f, OPT_JUMP_CLEANUP_CFG); + opt_build_cfg(f); + + EXPECT(f->blocks[b0].succ[1] == jump_ft, + "conditional fallthrough edge should not be forwarded without " + "layout-safe inversion"); + EXPECT(f->blocks[jump_ft].ninsts == 1, + "fallthrough jump block should remain reachable"); + EXPECT(f->emit_order_n >= 2 && f->emit_order[1] == jump_ft, + "fallthrough jump block should remain physical next block"); + tc_fini(&tc); +} + +static void opt_jump_cleanup_layout_deletes_fallthrough_branch(void) { + TestCtx tc; + tc_init(&tc); + Func* f = new_func(&tc); + u32 b0 = f->entry; + u32 b1 = ir_block_new(f); + ir_note_emit(f, b1); + + Val v = add_val(f, tc.i32); + emit_br_to(f, b0, b1); + emit_load_imm(f, b1, v, tc.i32, 7); + emit_ret_val(f, b1, v, tc.i32); + + opt_build_cfg(f); + opt_jump_cleanup(f, OPT_JUMP_CLEANUP_LAYOUT); + + EXPECT(count_op(f, IR_BR) == 0, + "layout cleanup should delete branch to physical fallthrough"); + EXPECT(f->blocks[b0].nsucc == 1 && f->blocks[b0].succ[0] == b1, + "layout cleanup should leave successor metadata for final emit"); + tc_fini(&tc); +} + static void opt_loop_tree_excludes_side_exit(void) { TestCtx tc; tc_init(&tc); @@ -2336,6 +2462,10 @@ static void simple_regalloc_reports_exact_used_regs(void) { int main(void) { opt_cfg_prunes_unreachable(); opt_cfg_preserves_scope_edges(); + opt_jump_cleanup_forwards_branch_targets(); + opt_jump_cleanup_inverts_to_remove_jump_block(); + opt_jump_cleanup_keeps_conditional_fallthrough_block(); + opt_jump_cleanup_layout_deletes_fallthrough_branch(); opt_loop_tree_excludes_side_exit(); opt_loop_tree_nested_depths(); opt_loop_tree_does_not_mutate_cfg();