docs: plan richer opt register constraints - kit

commit d69b41a89baaf03c6cb9a781705b401cfcb41c8e
parent 98c440ab42655f8883d95a4d127341a27c4b75da
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 15 May 2026 17:15:56 -0700

docs: plan richer opt register constraints

Diffstat:
M doc/OPT1.md  | 6 +++++-
A doc/OPT_REGS_CALL_PLAN.md  | 424 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

2 files changed, 429 insertions(+), 1 deletion(-)
diff --git a/doc/OPT1.md b/doc/OPT1.md
@@ -292,4 +292,8 @@ fast tier:
    O1 now reserves/preserves only replay-visible hard registers after final
    cleanup. Remaining work is mostly coalescing/argument-copy quality,
    pressure-sensitive choices, and safely broadening caller-saved allocation for
-   values that are not live across calls.
+   values that are not live across calls. The larger structural plan is in
+   `doc/OPT_REGS_CALL_PLAN.md`: targets expose richer physical-register
+   metadata, calls become opt-visible fixed-register/clobber constraints, and
+   call argument setup uses parallel copies so all three backends can safely
+   expose nearly all allocatable registers.
diff --git a/doc/OPT_REGS_CALL_PLAN.md b/doc/OPT_REGS_CALL_PLAN.md
@@ -0,0 +1,424 @@
+# OPT Register And Call Constraint Plan
+
+This plan expands the O1 register-allocation contract so the optimizer can use
+nearly all target registers safely. It combines two structural changes:
+
+1. targets expose a richer physical register file instead of a small pre-filtered
+   allocable pool; and
+2. calls are lowered into opt-visible fixed-register, stack-argument, and clobber
+   constraints before register allocation.
+
+The goal is not to move to a full target machine IR immediately. The goal is to
+make the current O1 path honest about target constraints while keeping replay and
+backend emission intact enough to migrate one architecture at a time.
+
+## Current Problem
+
+The current `CGTarget` contract exposes:
+
+- `get_allocable_regs`;
+- `get_scratch_regs`;
+- `is_caller_saved`;
+- `plan_hard_regs` / `reserve_hard_regs`;
+- `call_stack_size`.
+
+That contract is too coarse for optimizer-driven allocation. A target has to hide
+registers that are perfectly usable in most instructions because they are unsafe
+for some call-lowering or helper-lowering cases.
+
+Examples:
+
+- ABI argument and return registers are useful for short-lived values, but current
+  call emitters copy arguments sequentially into those same registers.
+- scratch registers are hidden globally even when only a small subset of target
+  operations need them.
+- callee-saved registers are cheap for values live across calls but expensive for
+  one-use temporaries in leaf or tiny functions.
+- the allocator can avoid caller-saved registers for call-crossing values, but it
+  has no target-provided save/restore cost or call-specific clobber masks.
+
+The result is conservative and correct, but it forces unnecessary prologue and
+epilogue traffic in small O1 functions.
+
+## Design Goals
+
+- Keep O1 fast and range-based.
+- Let each target expose all general allocatable physical registers, excluding
+  only permanently reserved registers such as stack pointer, frame pointer when
+  fixed, zero registers, platform registers, and architectural non-registers.
+- Make ABI argument, return, and call-clobber effects explicit before liveness and
+  allocation.
+- Make call argument moves parallel rather than sequential.
+- Preserve the existing backend ownership of final prologue, epilogue, frame
+  layout, and machine-code emission during this migration.
+- Avoid target-specific register knowledge in opt beyond data supplied by the
+  target.
+- Keep direct CG usable while opt grows the richer contract.
+
+Non-goals for this plan:
+
+- full machine IR;
+- global coalescing;
+- live-range splitting;
+- instruction scheduling;
+- target-specific peephole rewrites beyond the call boundary.
+
+## New Target Register Contract
+
+Add a register-file description that replaces the allocation-policy meaning of
+`get_allocable_regs`. The old hook can remain as a compatibility wrapper during
+migration.
+
+```c
+typedef enum CGPhysRegFlag {
+  CG_REG_ALLOCABLE       = 1u << 0,
+  CG_REG_CALLER_SAVED    = 1u << 1,
+  CG_REG_CALLEE_SAVED    = 1u << 2,
+  CG_REG_ARG             = 1u << 3,
+  CG_REG_RET             = 1u << 4,
+  CG_REG_TEMP_PREFERRED  = 1u << 5,
+  CG_REG_PLATFORM        = 1u << 6,
+  CG_REG_RESERVED        = 1u << 7,
+} CGPhysRegFlag;
+
+typedef struct CGPhysRegInfo {
+  Reg reg;
+  u8 cls;        /* RegClass */
+  u8 abi_index;  /* arg/ret order when applicable, otherwise 0xff */
+  u16 flags;     /* CGPhysRegFlag */
+  u16 save_cost; /* relative prologue/epilogue cost if callee-saved */
+  u16 use_cost;  /* relative preference cost for ordinary allocation */
+} CGPhysRegInfo;
+```
+
+New target hooks:
+
+```c
+void (*get_phys_regs)(CGTarget*, RegClass, const CGPhysRegInfo** out,
+                      u32* nregs);
+u32 (*call_clobber_mask)(CGTarget*, const CGCallDesc*, RegClass);
+u32 (*return_reg_mask)(CGTarget*, const ABIFuncInfo*, RegClass);
+u32 (*callee_save_mask)(CGTarget*, RegClass);
+```
+
+The exact masks may need to grow beyond `u32` if future architectures expose
+larger register files, but `u32` matches the current register numbering model and
+keeps this step consistent with existing code.
+
+Target policy:
+
+- AArch64 should expose normal integer allocation candidates from `x0-x28`,
+  excluding `sp`, `x29`, `x30`, and platform-reserved registers as needed. `x16`
+  and `x17` can be marked temp-preferred or reserved until helper scratch
+  clobbers are modeled.
+- AArch64 FP should expose `v0-v31`, reserving only registers that target helper
+  expansion still requires globally.
+- x64 should expose caller-saved and callee-saved GPRs except fixed `rsp/rbp` and
+  any helper-reserved registers still hidden during migration. It should expose
+  XMM registers with SysV all-caller-saved metadata.
+- RV64 should expose `a*`, `t*`, `s*`, and `f*` equivalents, excluding `sp`,
+  fixed `s0` when used as frame pointer, `ra` unless explicitly modeled, `gp`,
+  `tp`, and zero.
+
+## Opt Register Policy
+
+`opt_machinize` should build per-class register tables from `CGPhysRegInfo`:
+
+- physical register list;
+- caller-saved mask;
+- callee-saved mask;
+- reserved mask;
+- argument mask;
+- return mask;
+- save/use costs.
+
+The O1 allocator should keep its interval assignment model, but candidate
+register scoring should change from pure target order to a target-informed cost:
+
+```text
+base use cost
++ callee-save open cost if this function has not already used that reg
++ caller-save crossing cost if value is live across calls
++ fixed/tied penalty rules
++ spill/reload alternative cost
+```
+
+Hard requirements:
+
+- values live across a call may use caller-saved registers only if rewrite can
+  preserve them at that call;
+- non-call-crossing values should generally prefer caller-saved registers to
+  avoid function-wide callee-save traffic;
+- once a callee-saved register is already used in the function, later allocations
+  may treat its save cost as already paid;
+- tied/fixed registers from ABI lowering and inline asm remain mandatory.
+
+This can land without a global coalescer. It gives the current allocator enough
+information to make better choices while preserving its O1 compile-time shape.
+
+## Opt-Visible Call Plan
+
+Add a target hook that converts a `CGCallDesc` into a call plan before liveness
+and allocation:
+
+```c
+typedef enum CGCallPlanLocKind {
+  CG_CALL_PLAN_REG,
+  CG_CALL_PLAN_STACK,
+  CG_CALL_PLAN_IGNORE,
+} CGCallPlanLocKind;
+
+typedef struct CGCallPlanMove {
+  Operand src;       /* virtual value, local, indirect, imm, or global */
+  u8 dst_kind;       /* CGCallPlanLocKind */
+  u8 cls;            /* RegClass for register destinations */
+  Reg dst_reg;       /* valid for CG_CALL_PLAN_REG */
+  u32 stack_offset;  /* valid for CG_CALL_PLAN_STACK */
+  MemAccess mem;     /* width/sign for loads/stores */
+} CGCallPlanMove;
+
+typedef struct CGCallPlanRet {
+  Operand dst;       /* virtual destination in current IR */
+  u8 cls;
+  Reg src_reg;
+  MemAccess mem;
+} CGCallPlanRet;
+
+typedef struct CGCallPlan {
+  CGCallPlanMove* args;
+  u32 nargs;
+  CGCallPlanRet* rets;
+  u32 nrets;
+  Operand callee;
+  u32 clobber_mask[OPT_REG_CLASSES];
+  u32 return_mask[OPT_REG_CLASSES];
+  u32 stack_arg_size;
+  u8 variadic_fp_count;
+  u8 has_sret;
+} CGCallPlan;
+```
+
+Target hook:
+
+```c
+void (*plan_call)(CGTarget*, const CGCallDesc*, CGCallPlan* out);
+```
+
+The target remains the authority for ABI classification and stack layout. Opt
+becomes the authority for scheduling the moves and preserving live values around
+the call.
+
+Lowering shape:
+
+```text
+CALL_SETUP_BEGIN
+parallel copies: virtual/local/imm -> ABI arg regs or outgoing stack slots
+CALL target, implicit uses arg regs, implicit defs return regs,
+     implicit clobbers call clobber mask
+parallel copies: ABI return regs -> virtual/local destinations
+CALL_SETUP_END
+```
+
+This can be represented either as new IR ops or as expanded existing `IR_COPY`,
+`IR_STORE`, and `IR_CALL` ops with call aux data carrying implicit masks. The new
+IR-op route is clearer and easier to test.
+
+## Parallel Move Resolver
+
+Call argument setup must not use the current sequential backend copy model once
+ABI registers become allocable.
+
+Add a generic opt pass for parallel copies:
+
+- inputs are `(src operand, dst operand)` pairs;
+- destinations may be physical registers or stack argument slots;
+- sources may be virtual/hard registers, locals, indirect operands, immediates,
+  or globals;
+- cycles are broken with a target-provided temporary register or spill slot;
+- memory-to-memory copies route through a temporary;
+- stack stores are ordered after any register loads that depend on stack source
+  addresses they could overwrite.
+
+For O1, this resolver can be local to call setup and return extraction. It does
+not need to become a general coalescing pass in the first implementation.
+
+## Rewrite And Preservation
+
+The current rewrite inserts stores/loads for hard-assigned caller-saved values
+known to be live across calls. With call plans, this should become:
+
+- for each call, compute values live across the call;
+- intersect their assigned hard registers with the call plan's clobber mask;
+- exclude values defined by the call return;
+- emit save/restore only for that call-specific intersection.
+
+This keeps preservation precise for:
+
+- direct calls;
+- indirect calls;
+- varargs calls;
+- target-specific helper calls if they use a different clobber mask later.
+
+The allocator should still use live-across-call frequency, but correctness should
+come from per-call clobber masks in rewrite.
+
+## Backend Emission Changes
+
+Backends should gain emission hooks for an already-planned call:
+
+```c
+void (*emit_call_plan)(CGTarget*, const CGCallPlan*);
+```
+
+For the transition, this hook should assume arg registers and outgoing stack
+slots have already been materialized by opt. It only emits:
+
+- required varargs metadata such as x64 `AL`;
+- direct or indirect call branch;
+- target-specific call relocation;
+- no sequential argument copies;
+- no return copies.
+
+Direct CG can keep using the existing `call` hook until it is migrated or until a
+thin wrapper builds and emits a call plan internally.
+
+## Migration Phases
+
+### Phase 1 - Register Description Without Behavior Change
+
+- Add `CGPhysRegInfo` and `get_phys_regs`.
+- Implement it for x64, AArch64, and RV64 using the current exposed pools first.
+- Build opt's current hard-reg tables from the richer description.
+- Keep `get_allocable_regs`, `get_scratch_regs`, and `is_caller_saved` as
+  wrappers.
+- Add tests that inspect target register metadata for each architecture.
+
+Expected result: no codegen behavior change.
+
+### Phase 2 - Call Plan Construction
+
+- Add `CGCallPlan` and `plan_call`.
+- Implement call planning for simple direct scalar integer and FP args/returns on
+  all three architectures.
+- Keep backend `call` emission unchanged.
+- Add dump tests that verify planned arg regs, return regs, clobber masks, and
+  outgoing stack size.
+
+Expected result: opt can see call constraints, but does not allocate differently
+yet.
+
+### Phase 3 - Opt IR Call Constraints
+
+- Lower `IR_CALL` into opt-visible call setup, constrained call, and return-copy
+  representation during `machinize`.
+- Teach liveness/range building that call ops have implicit register uses,
+  implicit return defs, and clobber masks.
+- Keep the old path behind a fallback for unsupported call-plan shapes.
+- Add tests for values occupying ABI arg registers before call setup.
+
+Expected result: correctness coverage for arg-register hazards before the
+allocator starts using those registers widely.
+
+### Phase 4 - Parallel Copy Resolver
+
+- Implement local parallel move resolution for call setup and return extraction.
+- Support register-register cycles, register-stack moves, local/indirect loads,
+  immediates, and stack arguments.
+- Use target-provided temporary policy first; later this can use per-instruction
+  temp allocation.
+- Add red-green tests for argument permutation hazards:
+  - `f(b, a)` where `a` and `b` are already in the opposite ABI registers;
+  - indirect callee held in an argument register;
+  - return register also used by a live pre-call value;
+  - stack arguments sourced from registers that are also call destinations.
+
+Expected result: ABI arg and return registers can be made allocable safely.
+
+### Phase 5 - Broaden Register Exposure
+
+- Expand target `get_phys_regs` tables to include nearly all allocable physical
+  registers.
+- Update opt scoring to prefer caller-saved regs for non-call-crossing values and
+  callee-saved regs for call-crossing values.
+- Keep known backend helper scratch registers reserved until their clobbers are
+  expressed.
+- Add code-shape tests for direct-call tiny functions and unused-param functions
+  across x64, AArch64, and RV64.
+
+Expected result: fewer callee-save prologue/epilogue pairs without sacrificing
+call correctness.
+
+### Phase 6 - Remove Legacy Pool Semantics
+
+- Convert direct CG to either use `CGPhysRegInfo` or build call plans internally.
+- Remove allocation-policy dependence on `get_allocable_regs`.
+- Restrict `get_scratch_regs` to legacy direct-CG fallback, then remove it once
+  backend helper clobbers are modeled.
+- Make `reserve_hard_regs` consume actual replay-visible hard registers as it
+  does today, but derive preservation decisions from the richer register metadata.
+
+Expected result: one target register contract serves direct CG, opt, and future
+O2 allocation.
+
+## Test Plan
+
+Focused unit tests:
+
+- target register metadata per architecture;
+- call-plan layout for scalar, FP, mixed, sret, variadic, and stack-arg calls;
+- call clobber masks;
+- parallel-copy cycles and memory routing;
+- caller-saved live-across-call preservation using per-call masks;
+- callee-save reservation after broadened allocation.
+
+Code-shape probes:
+
+- `int f(int x) { return 42; }`;
+- `static int callee(int x) { return x + 1; }`
+  plus `int caller(int x) { return callee(x) + 2; }`;
+- multiple non-call-crossing locals under pressure;
+- one value live across a call plus several short-lived call-local values;
+- FP argument and return variants.
+
+Targeted runs:
+
+```sh
+make test-opt
+make test-cg-api
+make test-toy
+make test-aa64-inline
+make test-smoke-x64
+make test-smoke-rv64
+```
+
+## Risks And Open Questions
+
+- The current call emitters still contain target-specific scratch assumptions.
+  Those assumptions must either become call-plan constraints or stay reserved
+  until later.
+- x64 has implicit call metadata for variadic calls (`AL`) and helper scratch use
+  around memory copies; both need explicit representation.
+- AArch64 `x16/x17` and platform register policy differs by OS and relocation
+  model. The register metadata must be target-OS aware.
+- RV64 `ra`, `gp`, `tp`, `s0`, and zero should remain reserved unless the backend
+  grows explicit support for them.
+- Stack argument stores can alias frame or outgoing areas in awkward cases. The
+  call-plan stack area should remain target-owned, with opt only scheduling the
+  materialization.
+- Debug info and unwind data should continue to be backend-owned. Opt only tells
+  the backend which hard registers are actually live in emitted code.
+
+## Recommended First Patch Stack
+
+1. Add `CGPhysRegInfo` plus current-pool metadata for all three targets.
+2. Teach `opt_machinize` to consume the new metadata while preserving identical
+   allocation order.
+3. Add `CGCallPlan` and plan simple scalar calls without using it for emission.
+4. Add call-plan dump/unit tests.
+5. Lower one simple call shape through opt-visible setup/call/return constraints
+   behind a feature flag or narrow capability check.
+6. Implement parallel copies for that shape.
+7. Broaden AArch64 and RV64 caller-saved temp exposure first, then x64.
+
+This order keeps each step testable and avoids mixing API migration, allocation
+policy, and call move correctness in one change.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/OPT1.md	\|	6	+++++-
A	doc/OPT_REGS_CALL_PLAN.md	\|	424	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++