kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit d69b41a89baaf03c6cb9a781705b401cfcb41c8e
parent 98c440ab42655f8883d95a4d127341a27c4b75da
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 15 May 2026 17:15:56 -0700

docs: plan richer opt register constraints

Diffstat:
Mdoc/OPT1.md | 6+++++-
Adoc/OPT_REGS_CALL_PLAN.md | 424+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 429 insertions(+), 1 deletion(-)

diff --git a/doc/OPT1.md b/doc/OPT1.md @@ -292,4 +292,8 @@ fast tier: O1 now reserves/preserves only replay-visible hard registers after final cleanup. Remaining work is mostly coalescing/argument-copy quality, pressure-sensitive choices, and safely broadening caller-saved allocation for - values that are not live across calls. + values that are not live across calls. The larger structural plan is in + `doc/OPT_REGS_CALL_PLAN.md`: targets expose richer physical-register + metadata, calls become opt-visible fixed-register/clobber constraints, and + call argument setup uses parallel copies so all three backends can safely + expose nearly all allocatable registers. diff --git a/doc/OPT_REGS_CALL_PLAN.md b/doc/OPT_REGS_CALL_PLAN.md @@ -0,0 +1,424 @@ +# OPT Register And Call Constraint Plan + +This plan expands the O1 register-allocation contract so the optimizer can use +nearly all target registers safely. It combines two structural changes: + +1. targets expose a richer physical register file instead of a small pre-filtered + allocable pool; and +2. calls are lowered into opt-visible fixed-register, stack-argument, and clobber + constraints before register allocation. + +The goal is not to move to a full target machine IR immediately. The goal is to +make the current O1 path honest about target constraints while keeping replay and +backend emission intact enough to migrate one architecture at a time. + +## Current Problem + +The current `CGTarget` contract exposes: + +- `get_allocable_regs`; +- `get_scratch_regs`; +- `is_caller_saved`; +- `plan_hard_regs` / `reserve_hard_regs`; +- `call_stack_size`. + +That contract is too coarse for optimizer-driven allocation. A target has to hide +registers that are perfectly usable in most instructions because they are unsafe +for some call-lowering or helper-lowering cases. + +Examples: + +- ABI argument and return registers are useful for short-lived values, but current + call emitters copy arguments sequentially into those same registers. +- scratch registers are hidden globally even when only a small subset of target + operations need them. +- callee-saved registers are cheap for values live across calls but expensive for + one-use temporaries in leaf or tiny functions. +- the allocator can avoid caller-saved registers for call-crossing values, but it + has no target-provided save/restore cost or call-specific clobber masks. + +The result is conservative and correct, but it forces unnecessary prologue and +epilogue traffic in small O1 functions. + +## Design Goals + +- Keep O1 fast and range-based. +- Let each target expose all general allocatable physical registers, excluding + only permanently reserved registers such as stack pointer, frame pointer when + fixed, zero registers, platform registers, and architectural non-registers. +- Make ABI argument, return, and call-clobber effects explicit before liveness and + allocation. +- Make call argument moves parallel rather than sequential. +- Preserve the existing backend ownership of final prologue, epilogue, frame + layout, and machine-code emission during this migration. +- Avoid target-specific register knowledge in opt beyond data supplied by the + target. +- Keep direct CG usable while opt grows the richer contract. + +Non-goals for this plan: + +- full machine IR; +- global coalescing; +- live-range splitting; +- instruction scheduling; +- target-specific peephole rewrites beyond the call boundary. + +## New Target Register Contract + +Add a register-file description that replaces the allocation-policy meaning of +`get_allocable_regs`. The old hook can remain as a compatibility wrapper during +migration. + +```c +typedef enum CGPhysRegFlag { + CG_REG_ALLOCABLE = 1u << 0, + CG_REG_CALLER_SAVED = 1u << 1, + CG_REG_CALLEE_SAVED = 1u << 2, + CG_REG_ARG = 1u << 3, + CG_REG_RET = 1u << 4, + CG_REG_TEMP_PREFERRED = 1u << 5, + CG_REG_PLATFORM = 1u << 6, + CG_REG_RESERVED = 1u << 7, +} CGPhysRegFlag; + +typedef struct CGPhysRegInfo { + Reg reg; + u8 cls; /* RegClass */ + u8 abi_index; /* arg/ret order when applicable, otherwise 0xff */ + u16 flags; /* CGPhysRegFlag */ + u16 save_cost; /* relative prologue/epilogue cost if callee-saved */ + u16 use_cost; /* relative preference cost for ordinary allocation */ +} CGPhysRegInfo; +``` + +New target hooks: + +```c +void (*get_phys_regs)(CGTarget*, RegClass, const CGPhysRegInfo** out, + u32* nregs); +u32 (*call_clobber_mask)(CGTarget*, const CGCallDesc*, RegClass); +u32 (*return_reg_mask)(CGTarget*, const ABIFuncInfo*, RegClass); +u32 (*callee_save_mask)(CGTarget*, RegClass); +``` + +The exact masks may need to grow beyond `u32` if future architectures expose +larger register files, but `u32` matches the current register numbering model and +keeps this step consistent with existing code. + +Target policy: + +- AArch64 should expose normal integer allocation candidates from `x0-x28`, + excluding `sp`, `x29`, `x30`, and platform-reserved registers as needed. `x16` + and `x17` can be marked temp-preferred or reserved until helper scratch + clobbers are modeled. +- AArch64 FP should expose `v0-v31`, reserving only registers that target helper + expansion still requires globally. +- x64 should expose caller-saved and callee-saved GPRs except fixed `rsp/rbp` and + any helper-reserved registers still hidden during migration. It should expose + XMM registers with SysV all-caller-saved metadata. +- RV64 should expose `a*`, `t*`, `s*`, and `f*` equivalents, excluding `sp`, + fixed `s0` when used as frame pointer, `ra` unless explicitly modeled, `gp`, + `tp`, and zero. + +## Opt Register Policy + +`opt_machinize` should build per-class register tables from `CGPhysRegInfo`: + +- physical register list; +- caller-saved mask; +- callee-saved mask; +- reserved mask; +- argument mask; +- return mask; +- save/use costs. + +The O1 allocator should keep its interval assignment model, but candidate +register scoring should change from pure target order to a target-informed cost: + +```text +base use cost ++ callee-save open cost if this function has not already used that reg ++ caller-save crossing cost if value is live across calls ++ fixed/tied penalty rules ++ spill/reload alternative cost +``` + +Hard requirements: + +- values live across a call may use caller-saved registers only if rewrite can + preserve them at that call; +- non-call-crossing values should generally prefer caller-saved registers to + avoid function-wide callee-save traffic; +- once a callee-saved register is already used in the function, later allocations + may treat its save cost as already paid; +- tied/fixed registers from ABI lowering and inline asm remain mandatory. + +This can land without a global coalescer. It gives the current allocator enough +information to make better choices while preserving its O1 compile-time shape. + +## Opt-Visible Call Plan + +Add a target hook that converts a `CGCallDesc` into a call plan before liveness +and allocation: + +```c +typedef enum CGCallPlanLocKind { + CG_CALL_PLAN_REG, + CG_CALL_PLAN_STACK, + CG_CALL_PLAN_IGNORE, +} CGCallPlanLocKind; + +typedef struct CGCallPlanMove { + Operand src; /* virtual value, local, indirect, imm, or global */ + u8 dst_kind; /* CGCallPlanLocKind */ + u8 cls; /* RegClass for register destinations */ + Reg dst_reg; /* valid for CG_CALL_PLAN_REG */ + u32 stack_offset; /* valid for CG_CALL_PLAN_STACK */ + MemAccess mem; /* width/sign for loads/stores */ +} CGCallPlanMove; + +typedef struct CGCallPlanRet { + Operand dst; /* virtual destination in current IR */ + u8 cls; + Reg src_reg; + MemAccess mem; +} CGCallPlanRet; + +typedef struct CGCallPlan { + CGCallPlanMove* args; + u32 nargs; + CGCallPlanRet* rets; + u32 nrets; + Operand callee; + u32 clobber_mask[OPT_REG_CLASSES]; + u32 return_mask[OPT_REG_CLASSES]; + u32 stack_arg_size; + u8 variadic_fp_count; + u8 has_sret; +} CGCallPlan; +``` + +Target hook: + +```c +void (*plan_call)(CGTarget*, const CGCallDesc*, CGCallPlan* out); +``` + +The target remains the authority for ABI classification and stack layout. Opt +becomes the authority for scheduling the moves and preserving live values around +the call. + +Lowering shape: + +```text +CALL_SETUP_BEGIN +parallel copies: virtual/local/imm -> ABI arg regs or outgoing stack slots +CALL target, implicit uses arg regs, implicit defs return regs, + implicit clobbers call clobber mask +parallel copies: ABI return regs -> virtual/local destinations +CALL_SETUP_END +``` + +This can be represented either as new IR ops or as expanded existing `IR_COPY`, +`IR_STORE`, and `IR_CALL` ops with call aux data carrying implicit masks. The new +IR-op route is clearer and easier to test. + +## Parallel Move Resolver + +Call argument setup must not use the current sequential backend copy model once +ABI registers become allocable. + +Add a generic opt pass for parallel copies: + +- inputs are `(src operand, dst operand)` pairs; +- destinations may be physical registers or stack argument slots; +- sources may be virtual/hard registers, locals, indirect operands, immediates, + or globals; +- cycles are broken with a target-provided temporary register or spill slot; +- memory-to-memory copies route through a temporary; +- stack stores are ordered after any register loads that depend on stack source + addresses they could overwrite. + +For O1, this resolver can be local to call setup and return extraction. It does +not need to become a general coalescing pass in the first implementation. + +## Rewrite And Preservation + +The current rewrite inserts stores/loads for hard-assigned caller-saved values +known to be live across calls. With call plans, this should become: + +- for each call, compute values live across the call; +- intersect their assigned hard registers with the call plan's clobber mask; +- exclude values defined by the call return; +- emit save/restore only for that call-specific intersection. + +This keeps preservation precise for: + +- direct calls; +- indirect calls; +- varargs calls; +- target-specific helper calls if they use a different clobber mask later. + +The allocator should still use live-across-call frequency, but correctness should +come from per-call clobber masks in rewrite. + +## Backend Emission Changes + +Backends should gain emission hooks for an already-planned call: + +```c +void (*emit_call_plan)(CGTarget*, const CGCallPlan*); +``` + +For the transition, this hook should assume arg registers and outgoing stack +slots have already been materialized by opt. It only emits: + +- required varargs metadata such as x64 `AL`; +- direct or indirect call branch; +- target-specific call relocation; +- no sequential argument copies; +- no return copies. + +Direct CG can keep using the existing `call` hook until it is migrated or until a +thin wrapper builds and emits a call plan internally. + +## Migration Phases + +### Phase 1 - Register Description Without Behavior Change + +- Add `CGPhysRegInfo` and `get_phys_regs`. +- Implement it for x64, AArch64, and RV64 using the current exposed pools first. +- Build opt's current hard-reg tables from the richer description. +- Keep `get_allocable_regs`, `get_scratch_regs`, and `is_caller_saved` as + wrappers. +- Add tests that inspect target register metadata for each architecture. + +Expected result: no codegen behavior change. + +### Phase 2 - Call Plan Construction + +- Add `CGCallPlan` and `plan_call`. +- Implement call planning for simple direct scalar integer and FP args/returns on + all three architectures. +- Keep backend `call` emission unchanged. +- Add dump tests that verify planned arg regs, return regs, clobber masks, and + outgoing stack size. + +Expected result: opt can see call constraints, but does not allocate differently +yet. + +### Phase 3 - Opt IR Call Constraints + +- Lower `IR_CALL` into opt-visible call setup, constrained call, and return-copy + representation during `machinize`. +- Teach liveness/range building that call ops have implicit register uses, + implicit return defs, and clobber masks. +- Keep the old path behind a fallback for unsupported call-plan shapes. +- Add tests for values occupying ABI arg registers before call setup. + +Expected result: correctness coverage for arg-register hazards before the +allocator starts using those registers widely. + +### Phase 4 - Parallel Copy Resolver + +- Implement local parallel move resolution for call setup and return extraction. +- Support register-register cycles, register-stack moves, local/indirect loads, + immediates, and stack arguments. +- Use target-provided temporary policy first; later this can use per-instruction + temp allocation. +- Add red-green tests for argument permutation hazards: + - `f(b, a)` where `a` and `b` are already in the opposite ABI registers; + - indirect callee held in an argument register; + - return register also used by a live pre-call value; + - stack arguments sourced from registers that are also call destinations. + +Expected result: ABI arg and return registers can be made allocable safely. + +### Phase 5 - Broaden Register Exposure + +- Expand target `get_phys_regs` tables to include nearly all allocable physical + registers. +- Update opt scoring to prefer caller-saved regs for non-call-crossing values and + callee-saved regs for call-crossing values. +- Keep known backend helper scratch registers reserved until their clobbers are + expressed. +- Add code-shape tests for direct-call tiny functions and unused-param functions + across x64, AArch64, and RV64. + +Expected result: fewer callee-save prologue/epilogue pairs without sacrificing +call correctness. + +### Phase 6 - Remove Legacy Pool Semantics + +- Convert direct CG to either use `CGPhysRegInfo` or build call plans internally. +- Remove allocation-policy dependence on `get_allocable_regs`. +- Restrict `get_scratch_regs` to legacy direct-CG fallback, then remove it once + backend helper clobbers are modeled. +- Make `reserve_hard_regs` consume actual replay-visible hard registers as it + does today, but derive preservation decisions from the richer register metadata. + +Expected result: one target register contract serves direct CG, opt, and future +O2 allocation. + +## Test Plan + +Focused unit tests: + +- target register metadata per architecture; +- call-plan layout for scalar, FP, mixed, sret, variadic, and stack-arg calls; +- call clobber masks; +- parallel-copy cycles and memory routing; +- caller-saved live-across-call preservation using per-call masks; +- callee-save reservation after broadened allocation. + +Code-shape probes: + +- `int f(int x) { return 42; }`; +- `static int callee(int x) { return x + 1; }` + plus `int caller(int x) { return callee(x) + 2; }`; +- multiple non-call-crossing locals under pressure; +- one value live across a call plus several short-lived call-local values; +- FP argument and return variants. + +Targeted runs: + +```sh +make test-opt +make test-cg-api +make test-toy +make test-aa64-inline +make test-smoke-x64 +make test-smoke-rv64 +``` + +## Risks And Open Questions + +- The current call emitters still contain target-specific scratch assumptions. + Those assumptions must either become call-plan constraints or stay reserved + until later. +- x64 has implicit call metadata for variadic calls (`AL`) and helper scratch use + around memory copies; both need explicit representation. +- AArch64 `x16/x17` and platform register policy differs by OS and relocation + model. The register metadata must be target-OS aware. +- RV64 `ra`, `gp`, `tp`, `s0`, and zero should remain reserved unless the backend + grows explicit support for them. +- Stack argument stores can alias frame or outgoing areas in awkward cases. The + call-plan stack area should remain target-owned, with opt only scheduling the + materialization. +- Debug info and unwind data should continue to be backend-owned. Opt only tells + the backend which hard registers are actually live in emitted code. + +## Recommended First Patch Stack + +1. Add `CGPhysRegInfo` plus current-pool metadata for all three targets. +2. Teach `opt_machinize` to consume the new metadata while preserving identical + allocation order. +3. Add `CGCallPlan` and plan simple scalar calls without using it for emission. +4. Add call-plan dump/unit tests. +5. Lower one simple call shape through opt-visible setup/call/return constraints + behind a feature flag or narrow capability check. +6. Implement parallel copies for that shape. +7. Broaden AArch64 and RV64 caller-saved temp exposure first, then x64. + +This order keeps each step testable and avoids mixing API migration, allocation +policy, and call move correctness in one change.