doc: opt reg constraints plan - kit

commit 8add2caae50139cc197d41b88b205987dc711bfb
parent 6d9c6b8fea4710d515208ecd332168a7327d04e5
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 29 May 2026 11:00:16 -0700

doc: opt reg constraints plan

Diffstat:
A doc/OPT_MACHINE_REG_CONSTRAINTS.md  | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 file changed, 180 insertions(+), 0 deletions(-)
diff --git a/doc/OPT_MACHINE_REG_CONSTRAINTS.md b/doc/OPT_MACHINE_REG_CONSTRAINTS.md
@@ -0,0 +1,180 @@
+# Teaching the optimizer about fixed-register machine instructions
+
+## Problem
+
+Some target instructions pin operands or results to specific physical registers
+and clobber others as a side effect of the encoding — *hardware* constraints,
+not allocator choices. On x86-64:
+
+| operation | constraint |
+|---|---|
+| `idiv` / `div` (DIV) | dividend in `rax`; quotient → `rax`; `rdx` clobbered (`cqo`/`xor`) |
+| `idiv` / `div` (MOD) | dividend in `rax`; remainder → `rdx`; `rax` clobbered |
+| variable shift `shl/shr/sar` | shift count in `cl` (`rcx`) |
+| `mul` / `imul` one-operand (unsigned-mul-high, `*_overflow`) | `rax`*`r/m` → `rdx:rax`; `rdx` clobbered |
+| `cmpxchg` (atomic CAS) | expected/prior in `rax`; `rax` clobbered |
+| `xadd`/`cmpxchg` loop (atomic RMW) | `rax`/`rcx`/`rdx` used |
+| `va_arg` (SysV) | `rax` used internally for the gp/fp offset |
+
+aarch64 and riscv64 have **no** such instructions — their div, shift, and mul
+are ordinary three-operand forms with no fixed registers — which is why this
+never came up porting them.
+
+Today the optimizer (`-O1`) models none of this. It treats `idiv` as a generic
+binop `def = a / b`, so the register allocator is free to keep an unrelated live
+value in `rdx` across the divide. The x64 backend then emits `xor rdx,rdx; idiv`
+and silently destroys that value. Concretely (`6_5_04_div_mod`, the VLA cases,
+`continue` cases): a live pointer/sum allocated to `rdx` is gone after the first
+`idiv`, producing wrong results or a SIGSEGV. This is the single root cause
+behind the x64 parse `-O1` tail (div/mod, VLA size math, modulo in loops, the
+atomic and overflow intrinsics, and the variadic accumulators).
+
+The `-O0` path is immune: NativeDirectTarget keeps values in memory / flushes its
+small register cache around such ops, so nothing is live in a clobbered register.
+
+## The machinery that already exists
+
+The allocator (`src/opt/pass_lower.c`) already supports exactly the two
+primitives we need; today only inline-asm and calls feed them:
+
+- **`preg_info[v].tied_hard_reg`** — pin value `v` to a physical register.
+  Set by `apply_fixed_asm_operand` from `aux->in_fixed_regs` / `out_fixed_regs`.
+- **`clobber_mask[cls]` → `forbidden_hard_regs`** — for each register an
+  instruction clobbers, every value live *across* the instruction (live-after,
+  not a use/def of it) is forbidden from that register. Applied in
+  `apply_asm_register_constraints` (asm) and `opt_call_clobber_mask_for` (calls).
+
+`pass_machinize.c` populates the asm constraints in `machinize_prepare_insts`
+(`asm_prepare_constraints`), resolving named-register constraint strings into
+masks. Calls carry their clobber/return masks in the call plan.
+
+So the gap is narrow: **a generic machine instruction (div, shift, …) has no way
+to declare fixed registers and clobbers.** It has no `aux` to hang them on — a
+binop's `extra` union already holds the `BinOp` tag (`extra.imm`).
+
+## Design
+
+### 1. A target hook describing an instruction's register constraints
+
+```c
+/* arch/native_target.h */
+#define NATIVE_MAX_FIXED_OPNDS 4u
+
+typedef struct NativeInstRegConstraints {
+  /* Use-operand index -> physical register it must occupy, or -1. Indices match
+   * the instruction's OPK_REG use operands in order. */
+  i8 in_fixed[NATIVE_MAX_FIXED_OPNDS];
+  /* Result (def) index -> physical register, or -1. */
+  i8 out_fixed[NATIVE_MAX_FIXED_OPNDS];
+  /* Extra physical registers the instruction clobbers (defines), per class.
+   * The fixed in/out registers need not be repeated here. */
+  u32 clobber_mask[NATIVE_CALL_PLAN_CLASSES];
+} NativeInstRegConstraints;
+
+/* Fill *out with the register constraints the target's encoding of `op` imposes,
+ * given the instruction (for the BinOp/Conv sub-tag in extra.imm and operand
+ * kinds — e.g. a variable vs constant shift count). Return 1 if constrained, 0
+ * if the instruction is unconstrained (the common case). May be NULL.
+ *
+ * The optimizer speaks only in physical register numbers here; the hook is the
+ * one place target ISA register rules enter the allocator. */
+int (*inst_reg_constraints)(NativeTarget*, const struct Inst* in,
+                            NativeInstRegConstraints* out);
+```
+
+aarch64/riscv64 leave the hook NULL (or `return 0`) — zero behaviour change.
+
+### 2. Where it is computed and stored
+
+The allocator (`pass_lower`) does not hold a `NativeTarget*`, and a binop has no
+free `aux` slot, so constraints are **computed in machinization and stored in a
+per-function side table keyed by `Inst.id`**:
+
+```c
+/* Func, built in opt_machinize_native */
+typedef struct OptInstConstraints {
+  InstId inst;
+  NativeInstRegConstraints c;
+} OptInstConstraints;
+/* f->inst_constraints : sorted/dense by InstId; lookup is O(1) by id or a
+ * small bsearch. Only constrained instructions are stored. */
+```
+
+`machinize_prepare_insts` already walks every instruction; extend it: for each
+non-asm, non-call instruction, call `inst_reg_constraints`; on a `1` return,
+append an entry. (Asm and calls keep their existing paths.)
+
+### 3. Where it is applied
+
+In `pass_lower`'s per-instruction allocation walk, beside the existing
+`apply_asm_register_constraints(f, in, use, def, live)` call, add
+`apply_machine_reg_constraints(f, in, use, def, live)`:
+
+- Look up `in->id` in `f->inst_constraints`; if none, return.
+- For each `in_fixed[i] >= 0`: tie the i-th OPK_REG **use** operand's value via
+  `preg_info[v].tied_hard_reg` (reuse `apply_fixed_asm_operand`).
+- For each `out_fixed[i] >= 0`: tie the i-th **def** value.
+- For `clobber_mask`: run the same live-across-forbid loop the asm path uses.
+
+This reuses 100% of the existing allocator mechanism; the only new code is the
+lookup + dispatch. No change to the allocation algorithm itself.
+
+### 4. The x64 hook
+
+```c
+int x64_inst_reg_constraints(NativeTarget*, const Inst* in,
+                             NativeInstRegConstraints* out) {
+  zero(out); set all in_fixed/out_fixed = -1;
+  switch ((IROp)in->op) {
+    case IR_BINOP:
+      switch ((BinOp)in->extra.imm) {
+        SDIV/UDIV:  in_fixed[dividend]=RAX; out_fixed[0]=RAX; clobber{RDX}; return 1;
+        SMOD/UMOD:  in_fixed[dividend]=RAX; out_fixed[0]=RDX; clobber{RAX}; return 1;
+        SHL/SHR/SAR with non-const count: in_fixed[count]=RCX; return 1;
+        default: return 0;
+      }
+    case IR_VA_ARG:        clobber{RAX}; return 1;   /* internal offset scratch */
+    case IR_ATOMIC_CAS:    in_fixed[expected]=RAX; out_fixed[prior]=RAX;
+                           clobber{RAX,RCX}; return 1;
+    case IR_ATOMIC_RMW:    clobber{RAX,RCX,RDX}; return 1;
+    case IR_INTRINSIC (umul_overflow): clobber{RDX}; return 1;
+    default: return 0;
+  }
+}
+```
+
+(Exact operand indices follow the IR's use/def layout; `mul`/`imul` two-operand
+forms and signed-mul-overflow via `imul r,r` need no constraint.)
+
+### 5. Backend simplification (follow-on, optional)
+
+Once the allocator guarantees the dividend is already in `rax` and nothing live
+sits in `rdx`/`rcx`, several defensive moves in the x64 backend become
+no-ops/removable: the `mov rax, dividend` in `x64_binop`, the address-staging in
+`x64_atomic_cas`/`_rmw` (the optimizer ties expected→rax etc.), and the shift
+`mov rcx, count`. These can be simplified after the constraints land, but are
+left in initially (harmless — they become register-to-itself moves).
+
+## Validation
+
+- x64 parse `-O1` div/mod, VLA, continue, atomic, overflow, varargs cases go
+  green; toy `-O1` stays 156/0; `-O0` unchanged (hook only consulted on the
+  optimizer path).
+- aa64/rv64 unaffected (hook returns 0 / NULL): re-run toy + parse to confirm
+  byte-identical output.
+- Add a focused regression: a function that divides and keeps an unrelated value
+  live across the divide, compiled at `-O1`.
+
+## Risks / corner cases
+
+- **Tied input that is also the clobber/def register** (div: dividend tied to
+  `rax`, `rax` redefined as quotient). The allocator must accept a use whose
+  tied register is also written — same shape as a two-address op. Verify
+  `tied_hard_reg` + the def coexist; if the live-across-forbid loop also forbids
+  the *result* value from `rax`, exclude defs of this instruction (the asm loop
+  already skips use/def values — reuse that exactly).
+- **Two operands tied to the same register** must be rejected at the hook level
+  (never emitted for the x64 set above).
+- **`InstId` density** for the side table — confirm ids are assigned per
+  function and stable through `pass_lower`; otherwise key by instruction pointer
+  in a small open-addressed map.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README