kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 8add2caae50139cc197d41b88b205987dc711bfb
parent 6d9c6b8fea4710d515208ecd332168a7327d04e5
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 29 May 2026 11:00:16 -0700

doc: opt reg constraints plan

Diffstat:
Adoc/OPT_MACHINE_REG_CONSTRAINTS.md | 180+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 180 insertions(+), 0 deletions(-)

diff --git a/doc/OPT_MACHINE_REG_CONSTRAINTS.md b/doc/OPT_MACHINE_REG_CONSTRAINTS.md @@ -0,0 +1,180 @@ +# Teaching the optimizer about fixed-register machine instructions + +## Problem + +Some target instructions pin operands or results to specific physical registers +and clobber others as a side effect of the encoding — *hardware* constraints, +not allocator choices. On x86-64: + +| operation | constraint | +|---|---| +| `idiv` / `div` (DIV) | dividend in `rax`; quotient → `rax`; `rdx` clobbered (`cqo`/`xor`) | +| `idiv` / `div` (MOD) | dividend in `rax`; remainder → `rdx`; `rax` clobbered | +| variable shift `shl/shr/sar` | shift count in `cl` (`rcx`) | +| `mul` / `imul` one-operand (unsigned-mul-high, `*_overflow`) | `rax`*`r/m` → `rdx:rax`; `rdx` clobbered | +| `cmpxchg` (atomic CAS) | expected/prior in `rax`; `rax` clobbered | +| `xadd`/`cmpxchg` loop (atomic RMW) | `rax`/`rcx`/`rdx` used | +| `va_arg` (SysV) | `rax` used internally for the gp/fp offset | + +aarch64 and riscv64 have **no** such instructions — their div, shift, and mul +are ordinary three-operand forms with no fixed registers — which is why this +never came up porting them. + +Today the optimizer (`-O1`) models none of this. It treats `idiv` as a generic +binop `def = a / b`, so the register allocator is free to keep an unrelated live +value in `rdx` across the divide. The x64 backend then emits `xor rdx,rdx; idiv` +and silently destroys that value. Concretely (`6_5_04_div_mod`, the VLA cases, +`continue` cases): a live pointer/sum allocated to `rdx` is gone after the first +`idiv`, producing wrong results or a SIGSEGV. This is the single root cause +behind the x64 parse `-O1` tail (div/mod, VLA size math, modulo in loops, the +atomic and overflow intrinsics, and the variadic accumulators). + +The `-O0` path is immune: NativeDirectTarget keeps values in memory / flushes its +small register cache around such ops, so nothing is live in a clobbered register. + +## The machinery that already exists + +The allocator (`src/opt/pass_lower.c`) already supports exactly the two +primitives we need; today only inline-asm and calls feed them: + +- **`preg_info[v].tied_hard_reg`** — pin value `v` to a physical register. + Set by `apply_fixed_asm_operand` from `aux->in_fixed_regs` / `out_fixed_regs`. +- **`clobber_mask[cls]` → `forbidden_hard_regs`** — for each register an + instruction clobbers, every value live *across* the instruction (live-after, + not a use/def of it) is forbidden from that register. Applied in + `apply_asm_register_constraints` (asm) and `opt_call_clobber_mask_for` (calls). + +`pass_machinize.c` populates the asm constraints in `machinize_prepare_insts` +(`asm_prepare_constraints`), resolving named-register constraint strings into +masks. Calls carry their clobber/return masks in the call plan. + +So the gap is narrow: **a generic machine instruction (div, shift, …) has no way +to declare fixed registers and clobbers.** It has no `aux` to hang them on — a +binop's `extra` union already holds the `BinOp` tag (`extra.imm`). + +## Design + +### 1. A target hook describing an instruction's register constraints + +```c +/* arch/native_target.h */ +#define NATIVE_MAX_FIXED_OPNDS 4u + +typedef struct NativeInstRegConstraints { + /* Use-operand index -> physical register it must occupy, or -1. Indices match + * the instruction's OPK_REG use operands in order. */ + i8 in_fixed[NATIVE_MAX_FIXED_OPNDS]; + /* Result (def) index -> physical register, or -1. */ + i8 out_fixed[NATIVE_MAX_FIXED_OPNDS]; + /* Extra physical registers the instruction clobbers (defines), per class. + * The fixed in/out registers need not be repeated here. */ + u32 clobber_mask[NATIVE_CALL_PLAN_CLASSES]; +} NativeInstRegConstraints; + +/* Fill *out with the register constraints the target's encoding of `op` imposes, + * given the instruction (for the BinOp/Conv sub-tag in extra.imm and operand + * kinds — e.g. a variable vs constant shift count). Return 1 if constrained, 0 + * if the instruction is unconstrained (the common case). May be NULL. + * + * The optimizer speaks only in physical register numbers here; the hook is the + * one place target ISA register rules enter the allocator. */ +int (*inst_reg_constraints)(NativeTarget*, const struct Inst* in, + NativeInstRegConstraints* out); +``` + +aarch64/riscv64 leave the hook NULL (or `return 0`) — zero behaviour change. + +### 2. Where it is computed and stored + +The allocator (`pass_lower`) does not hold a `NativeTarget*`, and a binop has no +free `aux` slot, so constraints are **computed in machinization and stored in a +per-function side table keyed by `Inst.id`**: + +```c +/* Func, built in opt_machinize_native */ +typedef struct OptInstConstraints { + InstId inst; + NativeInstRegConstraints c; +} OptInstConstraints; +/* f->inst_constraints : sorted/dense by InstId; lookup is O(1) by id or a + * small bsearch. Only constrained instructions are stored. */ +``` + +`machinize_prepare_insts` already walks every instruction; extend it: for each +non-asm, non-call instruction, call `inst_reg_constraints`; on a `1` return, +append an entry. (Asm and calls keep their existing paths.) + +### 3. Where it is applied + +In `pass_lower`'s per-instruction allocation walk, beside the existing +`apply_asm_register_constraints(f, in, use, def, live)` call, add +`apply_machine_reg_constraints(f, in, use, def, live)`: + +- Look up `in->id` in `f->inst_constraints`; if none, return. +- For each `in_fixed[i] >= 0`: tie the i-th OPK_REG **use** operand's value via + `preg_info[v].tied_hard_reg` (reuse `apply_fixed_asm_operand`). +- For each `out_fixed[i] >= 0`: tie the i-th **def** value. +- For `clobber_mask`: run the same live-across-forbid loop the asm path uses. + +This reuses 100% of the existing allocator mechanism; the only new code is the +lookup + dispatch. No change to the allocation algorithm itself. + +### 4. The x64 hook + +```c +int x64_inst_reg_constraints(NativeTarget*, const Inst* in, + NativeInstRegConstraints* out) { + zero(out); set all in_fixed/out_fixed = -1; + switch ((IROp)in->op) { + case IR_BINOP: + switch ((BinOp)in->extra.imm) { + SDIV/UDIV: in_fixed[dividend]=RAX; out_fixed[0]=RAX; clobber{RDX}; return 1; + SMOD/UMOD: in_fixed[dividend]=RAX; out_fixed[0]=RDX; clobber{RAX}; return 1; + SHL/SHR/SAR with non-const count: in_fixed[count]=RCX; return 1; + default: return 0; + } + case IR_VA_ARG: clobber{RAX}; return 1; /* internal offset scratch */ + case IR_ATOMIC_CAS: in_fixed[expected]=RAX; out_fixed[prior]=RAX; + clobber{RAX,RCX}; return 1; + case IR_ATOMIC_RMW: clobber{RAX,RCX,RDX}; return 1; + case IR_INTRINSIC (umul_overflow): clobber{RDX}; return 1; + default: return 0; + } +} +``` + +(Exact operand indices follow the IR's use/def layout; `mul`/`imul` two-operand +forms and signed-mul-overflow via `imul r,r` need no constraint.) + +### 5. Backend simplification (follow-on, optional) + +Once the allocator guarantees the dividend is already in `rax` and nothing live +sits in `rdx`/`rcx`, several defensive moves in the x64 backend become +no-ops/removable: the `mov rax, dividend` in `x64_binop`, the address-staging in +`x64_atomic_cas`/`_rmw` (the optimizer ties expected→rax etc.), and the shift +`mov rcx, count`. These can be simplified after the constraints land, but are +left in initially (harmless — they become register-to-itself moves). + +## Validation + +- x64 parse `-O1` div/mod, VLA, continue, atomic, overflow, varargs cases go + green; toy `-O1` stays 156/0; `-O0` unchanged (hook only consulted on the + optimizer path). +- aa64/rv64 unaffected (hook returns 0 / NULL): re-run toy + parse to confirm + byte-identical output. +- Add a focused regression: a function that divides and keeps an unrelated value + live across the divide, compiled at `-O1`. + +## Risks / corner cases + +- **Tied input that is also the clobber/def register** (div: dividend tied to + `rax`, `rax` redefined as quotient). The allocator must accept a use whose + tied register is also written — same shape as a two-address op. Verify + `tied_hard_reg` + the def coexist; if the live-across-forbid loop also forbids + the *result* value from `rax`, exclude defs of this instruction (the asm loop + already skips use/def values — reuse that exactly). +- **Two operands tied to the same register** must be rejected at the hook level + (never emitted for the x64 set above). +- **`InstId` density** for the side table — confirm ids are assigned per + function and stable through `pass_lower`; otherwise key by instruction pointer + in a small open-addressed map.