kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 20c4b046d8cbce3704e010a55002c6bc7f37f613
parent aa77e40659f4596f238bec93a379edf8c3f2476c
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon,  1 Jun 2026 19:44:42 -0700

plan: rework CODEGEN.md to record landed tracks + focus on what remains

Tracks 1a/1d, 5, 3a, FP_REM, and the AsmDir + Atomic/MemOrder enum slices of
Track 2 are committed and green; condense them into a Done table and replace the
per-track write-ups with the remaining work, sharpened with what executing the
landed slices surfaced: the binop/cmp split design (separate hooks + doubled IR
opcodes + fold-layer restructure, the int/fp decision already made by type), the
multi-result -O1/wasm follow-up, the cg_adapter.h enum-duplication hazard, and
the bswap width-from-type feasibility. Reorders the remaining sequence.

Diffstat:
Mdoc/plan/CODEGEN.md | 658++++++++++++++++++++++++++++++++-----------------------------------------------
1 file changed, 268 insertions(+), 390 deletions(-)

diff --git a/doc/plan/CODEGEN.md b/doc/plan/CODEGEN.md @@ -1,359 +1,268 @@ -# Codegen Interface Cleanup — Roadmap +# Codegen Interface Cleanup — Roadmap (remaining work) -Status: **decided — ready to execute** (all eight decisions resolved in §Decisions). Forward-looking -companion to the canonical design in [doc/CODEGEN.md](../CODEGEN.md). Goal: make the -`CfreeCg` public API and the internal `CgTarget` contract carry **one clear -representation per concept, with no advertise-but-ignore surface and no façade**. -Breaking and sweeping changes are in scope; reducing churn is *not* a priority. +Status: **partially executed.** The independent, lower-risk tracks are landed and committed +(see §Done). What remains is the high-blast-radius work: the **binop/cmp op-split** (the +rest of Track 2), the **op/intrinsic taxonomy** (Track 4), the **fold-layer isolation** +(Track 6), and the **PLACE/VALUE centerpiece** (Track 7, with Track 3b folded in). + +Forward-looking companion to the canonical design in [doc/CODEGEN.md](../CODEGEN.md). Goal: +make the `CfreeCg` public API and the internal `CgTarget` contract carry **one clear +representation per concept, with no advertise-but-ignore surface and no façade**. Breaking +and sweeping changes are in scope; reducing churn is *not* a priority. The centerpiece is **Track 7** — a strict PLACE/VALUE stack discipline that ends CG's -inference of what a stack slot *means* (lvalue vs rvalue, by-value vs by-reference). Its -core is decided (Model B; see §Track 7); the other tracks orbit it. +inference of what a stack slot *means*. Its core is decided (Model B; see §Track 7). ## Scope Two stacked interfaces (see [doc/CODEGEN.md §The two boundaries](../CODEGEN.md)): - **Public** `cfree_cg_*` / `CfreeCg` (`include/cfree/cg.h`) — a value-stack machine. -- **Internal** `CgTarget` (`src/cg/cgtarget.h`) — a three-address operand vtable the - 5 realizations implement (native `-O0`, IR recorder, C-source, wasm). +- **Internal** `CgTarget` (`src/cg/cgtarget.h`) — a three-address operand vtable. NOTE the + op enums also flow into the **physical** `NativeTarget` (`src/arch/native_target.h`, + which `#include`s `cgtarget.h`), the recorder IR (`src/cg/ir.h`) and opt IR + (`src/opt/ir.h`), and the interpreter (`src/interp/`). Any enum change touches all of + these layers, not just the semantic vtable. Between them sits the translation layer (`src/cg/value.c`, `arith.c`, `memory.c`, -`control.c`, `call.c`), which also performs `-O0` constant folding and compare -fusion. Almost every defect below lives at the **seam** between the two models — -duplicated, mapped lossily, or advertised on one side and dropped on the other. - -## Principles we are enforcing - -1. **One representation per concept.** No concept should exist in two structs or two - enums that must be hand-kept in sync. -2. **No advertise-but-ignore.** If a field/flag is in a public struct, it is either - honored or it does not exist. -3. **No façade.** A public enumerator that always panics is a bug. Either implement - it, or remove it, or gate it behind a capability query and a clean diagnostic — - consistent with how call-convs and symbol features already work. +`control.c`, `call.c`), which also performs `-O0` constant folding and compare fusion. + +## Principles we are enforcing (unchanged) + +1. **One representation per concept.** No concept in two structs/enums hand-kept in sync. +2. **No advertise-but-ignore.** A public field/flag is honored or it does not exist. +3. **No façade.** A public enumerator that always panics is a bug — implement it, remove + it, or gate it behind a capability query + clean diagnostic. 4. **Width belongs to the type, not the opcode.** `bswap` is one operation, not three. 5. **Ops vs intrinsics has a stated rule** (§Track 4) and both layers obey it. -6. **The semantic layer may peephole, but that responsibility is named and isolated**, - not smeared across the op families. The vstack peephole is a *feature* (free `-O0` - perf), kept and maintained — not removed. -7. **Completeness over minimalism.** Keep an op/enumerator that has a distinct, sensible - meaning and completes an orthogonal set — *even with no current caller*. Judge a - surface by whether it is consistent and complete on its own terms, not by present - usage. Remove only the *redundant*: two spellings of one behavior. +6. **The semantic layer may peephole, but that responsibility is named and isolated.** The + vstack peephole is a kept feature (free `-O0` perf), not removed. +7. **Completeness over minimalism.** Keep an op/enumerator with a distinct, sensible + meaning that completes an orthogonal set — even with no caller. Remove only the + *redundant*: two spellings of one behavior. --- -## Track 1 — Remove dead/redundant surface - -Genuine deletions are 1a (unreachable) and 1d (redundant). Applying Principle 7, **1b -and 1c are NOT deletions**: 1b (the vstack peephole) is a kept feature to *re-enable* -(now under Track 6); 1c (conditional control ops) is a complete set we keep and finish. -The net deletions here are pure subtraction with no behavior change. - -### 1a. `SCOPE_IF` / `CGScopeDesc.cond` / `scope_else` are unreachable -`cfree_cg_scope_begin` → `SCOPE_LOOP`, `cfree_cg_block_begin` → `SCOPE_BLOCK` -(`control.c:525,529`). **Nothing ever produces `SCOPE_IF`.** Therefore: -- `CGScopeDesc.cond` is never read for a real value. -- `scope_else` (implemented in `ir_recorder.c:625`, `native_direct_target.c:1792`, - recorded in IR `ir_dump.c:52`/`ir_print.c:94`) is never invoked. `nd_scope_else` - guards `if (s->kind != SCOPE_IF) panic` — which would *always* fire. -- wasm's native `SCOPE_IF` handling (`arch/wasm/emit.c`) is dead; `c_emit.c:1570` - already documents "Public API doesn't emit SCOPE_IF." - -**Action:** remove `SCOPE_IF`, `CGScopeDesc.cond`, the `scope_else` vtable member and -all 3+ implementations, and the dead wasm/native branches. `if` keeps lowering as two -nested `SCOPE_BLOCK` + `break_false` (the `cfree_cg_if_*` helpers). - -### 1b. Vstack peephole (`SV_ARITH`) — **keep & re-enable, moved to Track 6** -`api_can_delay_int_arith()` returns `0` (`value.c:1038`), gating off the delayed-arith -peephole (`api_make_arith_binop`/`_unop`, `api_materialize_arith_to`, `api_release_arith`, -`api_try_fold_arith_chain`, `api_try_collapse_binop_identity`, `api_try_fold_unary_chain`, -the `a_owned`/`b_owned` bookkeeping). It is **disabled, not dead-by-design**: git shows it -was live (`g && !flags && api_foldable_int_type(...)`) until commit `a126bec` ("extend -memory ops with effective-address rider") flipped it to `return 0` — the EA rider and the -delayed forms fought (see the "re-fetch in case alloc materialized a delayed expression" -workarounds at `memory.c:339`, `control.c:886`). **Decision (owner): keep the vstack -peephole — it is free `-O0` perf.** Since Track 7 *removes* the EA rider, re-enabling is -clean. Restoring the original gate + isolating the whole peephole is now **Track 6.3**; -[doc/CODEGEN.md:76-79](../CODEGEN.md) (which documents delayed arith as live) becomes -correct again. - -### 1c. Conditional control ops — **keep (complete set), per Principle 7** -`break_true`/`continue_true`/`continue_false` have 0 callers and `break_false` has 1, but -usage is not the test. The set is the orthogonal cross-product -`{break, continue} × {unconditional, _true, _false}` — the structured-scope analog of the -unstructured `branch_true`/`branch_false`, letting a frontend say "exit/continue this -scope if cond" without materializing a separate branch+label. Its result-carrying -semantics are well-defined (`cg.h:474-482`: `break_true` on an expression scope is -`[result, bool] → pop bool; if true pop result and exit`). Deleting only the unused -arms would make the API *incomplete and asymmetric* — exactly what we are fixing. -**Action:** keep the full set; **audit it for completeness** (confirm continue is -rejected on non-loop scopes, that block vs loop scope rules are uniform, and that every -arm has a test). `cfree_cg_block_begin` (0 direct callers, used via `cfree_cg_if_*`) is a -distinct, sensible primitive — keep. - -### 1d. `CFREE_CG_TAIL_NEVER` — remove (redundant, not incomplete) -Documented as "Treated as DEFAULT" (`cg.h:813`): a second spelling of `DEFAULT` with -identical semantics. Unlike 1c, removing it *increases* consistency (no two enumerators -mean the same thing). **Action:** remove; "no tail" is `CFREE_CG_TAIL_DEFAULT`. - -**Affected (1a/1d):** `cg.h`, `cgtarget.h`, `control.c`, `ir_recorder.c`, -`native_direct_target.c`, `arch/wasm/emit.c`, `arch/c_target/c_emit.c`, `ir_dump.c`, -`opt/ir_print.c`. -**Tests:** existing control-flow + toy + wasm suites stay green (1a/1d are no-behavior -deletions); **add** the missing-arm coverage for 1c (each `break_*`/`continue_*` variant -exercised end-to-end on a backend). +## Done (committed, all green: lib · toy 1344/0 · cg-api · smoke x64/rv64 · opt · isa · libc; `make bootstrap` reproduces at -O0 AND -O1) + +| Commit | Track | Summary | +|---|---|---| +| `e27a288` | **1a / 1d** | Removed `SCOPE_IF` / `CGScopeDesc.cond` / the `scope_else` hook (both IR opcodes `CG_IR_SCOPE_ELSE` + `IR_SCOPE_ELSE`, all 5 realizations, the `desc.cond` opt walkers; ~22 files). Removed `CFREE_CG_TAIL_NEVER` (redundant with `DEFAULT`). | +| `ae8d0f6` | **5** | Multi-result public API: `CfreeCgFuncSig.results[]`/`nresults` (+ `CfreeCgFuncResult`), `cfree_cg_type_func_nresults`/`_result`, `cfree_cg_ret_void` removed (void = 0-result `cfree_cg_ret`). Type system stores `results[]`; `cfree_cg_call` pushes/`cfree_cg_ret` pops in declaration order (last result on TOS). **Includes a self-host regression fix:** a no-value return on a *non-void* function (UB fall-off) now emits `cfree_cg_unreachable` instead of underflowing the value stack (`pcg_ret` in `lang/c/parse/cg_adapter.c`). | +| `fabf255` | **3a** | Dropped `CFREE_CG_MEM_NONTEMPORAL`/`_INVARIANT` + `CfreeCgMemAccess.alias_scope`/`noalias_scope` (decision #5) and the matching toy attributes. | +| `5e1335d` | **4 (FP_REM)** | Removed the `CFREE_CG_FP_REM` façade (always-panic; only dead callers). FP remainder is a libcall the frontend emits. | +| `917ffe9` | **2 (AsmDir)** | Deleted internal `AsmDir` + `api_map_asm_dir`; `AsmConstraint.dir` and backends use public `CfreeCgAsmDir`. | +| `a2f6367` | **2 (Atomic/Order)** | Deleted internal `AtomicOp`/`MemOrder` + `api_map_atomic_op`/`api_map_mem_order`; **both** the semantic `CgTarget` and physical `NativeTarget` atomic hooks, the recorder+opt IR aux, and the interpreter now carry public `CfreeCgAtomicOp`/`CfreeCgMemOrder`. | + +So **Tracks 1a/1d, 5, 3a are done; Track 2 is 2/3 done** (the 3 identical enums); **Track 4** +has FP_REM removed. + +### Caveats / follow-ups discovered while doing the above +- **Track 5 multi-result is single-result-complete only.** The `-O0` native path handles + `nresults > 1`, but the **opt path** (`src/opt/cg_ir_lower.c`, the `CG_IR_CALL`/`CG_IR_RET` + lowering) still only threads `results[0]` — a true 2+-result function is lossy at `-O1`. + The **wasm frontend** (`lang/wasm/cg.c`) was also migrated as single-result (takes + `f->results[0]`). True multi-value end-to-end (wasm + `-O1`) is unfinished follow-up. +- **The C frontend keeps its own private copies** of `BinOp`/`AtomicOp`/`MemOrder`/ + `IntrinKind` in `lang/c/parse/cg_adapter.h`. These are a **separate Principle-1 issue**, + deliberately left alone by Track 2 (they're a different namespace; do not blind-rename + `AO_*`/`MO_*`/`BO_*` across `lang/`). Worth a follow-up to dedupe against the public enums. +- **Regression lesson** (in [[doc/plan/BOOTSTRAP.md]] / the self-build): removing a "bare + return that ignores result count" primitive means every frontend's *fall-off / default* + return must push the right number of values or terminate with `unreachable`. Audit other + frontends if you remove more return primitives. --- -## Track 2 — Unify the op-enum vocabulary +## Track 1c — Conditional control ops: completeness audit + tests (REMAINING, small) -Every operation enum exists twice and is hand-mapped 1:1 in `value.c`: +KEEP the full `{break, continue} × {unconditional, _true, _false}` set (Principle 7; +`break_true`/`continue_true`/`continue_false` have 0 callers, `break_false` 1, but the set +is the structured analog of `branch_true`/`branch_false`). Remaining work is **test +coverage + an audit**, not code change: +- Confirm `continue*` is rejected on non-loop scopes and block-vs-loop rules are uniform. +- Add an end-to-end test for each `break_*`/`continue_*` variant on a backend (the + result-carrying semantics are spec'd at `cg.h` `cfree_cg_break_true` &c). -| Public (`cg.h`) | Internal (`cgtarget.h`) | Relationship | Mapper | -|---|---|---|---| -| `CfreeCgAtomicOp` (7) | `AtomicOp` (7) | **identical** | `api_map_atomic_op` | -| `CfreeCgMemOrder` (6) | `MemOrder` (6) | **identical** | `api_map_mem_order` | -| `CfreeCgAsmDir` (3) | `AsmDir` (3) | **identical** | `api_map_asm_dir` | -| `CfreeCgIntBinOp`+`CfreeCgFpBinOp` | `BinOp` | split→merged | `api_map_int_binop`/`api_map_fp_binop` | -| `CfreeCgIntCmpOp`(10)+`CfreeCgFpCmpOp`(12) | `CmpOp` (14) | split→merged, **lossy** | `api_map_int_cmp`/`api_map_fp_cmp` | -| `CfreeCgIntUnOp`+`CfreeCgFpUnOp` | `UnOp` | split→merged | `api_map_int_unop` | - -Two concrete defects: -- The split→merge→split round-trip earns nothing: every native backend re-splits - int/fp immediately (`aa64/native.c:2070`, `x64/native.c:906`). -- The merge is **lossy**: `api_map_fp_cmp` collapses `OEQ`/`UEQ`→`CMP_EQ` (`value.c:648-668`) - so the public ordered/unordered distinction cannot survive to a backend; and - `api_map_fp_binop` maps `CFREE_CG_FP_REM`→`BO_FDIV` (`value.c:605`), which is dead - *and* wrong-looking. - -**Decision (recommended):** `CgTarget` consumes the public `CfreeCg*` op enums directly. -Delete the parallel internal `BinOp`/`UnOp`/`CmpOp`/`AtomicOp`/`MemOrder`/`AsmDir` and -every `api_map_*` (~200 lines of `value.c`). `cgtarget.h` already `#include`s `cfree/cg.h`, -so this is mechanical. Keep the public **int/fp split** (it is the clearer API and -matches what backends do anyway); backends switch on `CfreeCgIntBinOp` and -`CfreeCgFpBinOp` separately. This is a single-repo internal contract, not a published -backend ABI, so coupling it to the public enum values is acceptable. See §Open -decisions #2 for the split-vs-merged confirmation. - -**Affected:** `cgtarget.h` (enum deletions + signature changes on `binop`/`unop`/`cmp`/ -`atomic_*`/`fence`/`asm_block`), all 5 backends' switch sites, `ir_recorder.c` + -`opt/` IR (the recorded op field changes type), `value.c`/`arith.c`/`atomic.c`/`asm.c`. -**Tests:** ISA encode/decode (`test-isa`, `test-arch`), opt, smoke; add a case that -exercises an unordered FP compare end-to-end (currently lossy). +--- + +## Track 2 (remaining) — Split the merged `BinOp`/`UnOp`/`CmpOp` + +The 3 *identical* enums (Atomic/Order/AsmDir) are done. What remains is the **split→merged** +trio, which is the structural core of Track 2 (the largest remaining mechanical change): + +| Public (`cg.h`) | Internal (`cgtarget.h`) | Relationship | +|---|---|---| +| `CfreeCgIntBinOp`(13) + `CfreeCgFpBinOp`(4) | `BinOp` | split→merged | +| `CfreeCgIntCmpOp`(10) + `CfreeCgFpCmpOp`(12) | `CmpOp`(14) | split→merged, **lossy** | +| `CfreeCgIntUnOp`(3) + `CfreeCgFpUnOp`(1) | `UnOp` | split→merged | + +**Why it matters:** the merge is **lossy** — `api_map_fp_cmp` collapses `OEQ`/`UEQ` → one +`CMP_EQ` (`value.c`), so the public ordered/unordered FP-compare distinction cannot reach a +backend. Fixing that is the real correctness win; the binop/unop dedup is consistency. + +**Decision (#2): `CgTarget` consumes the public split enums directly; backends switch on +`CfreeCgIntBinOp` and `CfreeCgFpBinOp` separately.** Delete `BinOp`/`UnOp`/`CmpOp` and +`api_map_int_binop`/`api_map_fp_binop`/`api_map_int_unop`/`api_map_int_cmp`/`api_map_fp_cmp`. + +### Why this is bigger than the atomic slice — the design to implement +Unlike Atomic/Order (a 1:1 value-preserving *rename*), this is a genuine **split**: + +1. **Hooks split** (`cgtarget.h`, mirrored in `native_target.h` if any binop/cmp is physical + — check; binop/cmp are semantic `CgTarget` hooks): `binop`→`int_binop`/`fp_binop`, + `unop`→`int_unop`/`fp_unop`, `cmp`→`int_cmp`/`fp_cmp`, `cmp_branch`→ + `int_cmp_branch`/`fp_cmp_branch`. +2. **IR opcodes double** — the recorder (`src/cg/ir.h`) and opt IR (`src/opt/ir.h`) store the + op in `extra.imm`/aux; a single `CG_IR_BINOP`/`IR_BINOP` can't hold an ambiguous value + (`CFREE_CG_INT_ADD == CFREE_CG_FP_ADD == 0`). Either split the opcodes + (`CG_IR_INT_BINOP`/`CG_IR_FP_BINOP`, …) **or** add an `is_fp` discriminator bit. Splitting + the opcodes is cleaner; both touch `ir_recorder.c`, `cg_ir_lower.c`, `pass_native_emit.c`, + `ir_dump.c`/`ir_print.c`, and every opt pass that switches on `IR_BINOP`/`IR_CMP`/ + `IR_CMP_BRANCH` (`pass_combine`, `pass_simplify`, `pass_o2`, `pass_jump`, …). +3. **Fold layer restructures** (`arith.c` + `value.c`). `api_cg_binop(BinOp)` / + `api_cg_unop(UnOp)` / `api_cg_cmp(CmpOp)` are the shared dispatch. Note the int/fp split + is **already made by TYPE** (`api_type_is_float`), not by the enum — so splitting the + dispatch is natural: the int path keeps the fold (`api_try_fold_int_binop`/`_unop`/`_cmp`, + int-only) and the delayed forms (`SV_ARITH` arith, `SV_CMP` compare; `ApiDelayedArith.bin_op`/ + `un_op`, `ApiDelayedCmp.op`, `api_make_cmp`, `api_materialize_cmp_to`, `api_invert_cmp`, + `api_branch_if`); the fp path is simpler (the f128 helper path in `cfree_cg_fp_binop` + already exists, plus the fp hook). **This is the subtle, high-risk part** — get the + delayed-compare fusion + constant-fold right per int/fp. Coordinate with Track 6.2 (which + moves these into `fold.c`); doing 6.2 first may make this cleaner. +4. **Backends split their switches:** the 3 native arches (`aa64`/`x64`/`rv64` `native.c` — + they already re-split int/fp internally), `c_target/c_emit.c`, `wasm/emit.c`, and the + interpreter (`interp/engine.c`). + +**Method that worked for the atomic slice:** delete the internal enum + change the hook +signatures, then let `-Werror` enumerate every cg-side site (the C frontend's `cg_adapter.h` +copy won't be flagged — it's a different type). Then fix per file. For the value-label +renames, sed **only** within `src/cg|arch|opt|interp` (never `lang/`, never `src/wasm/`). + +**Tests:** `test-isa`/`test-arch` (encode/decode), `test-opt`, smoke; **add an unordered FP +compare exercised end-to-end** (the currently-lossy case) — that's the regression guard for +the real fix. --- -## Track 3 — Unify duplicated representations - -### 3a. Two `MemAccess` structs + advertise-but-ignore flags -Public `CfreeCgMemAccess` vs internal `MemAccess`, with non-overlapping flag enums -(`CfreeCgMemAccessFlag` vs `MemFlag`). `api_mem_from_access` (`value.c:284-295`) -translates only `VOLATILE`; **`NONTEMPORAL` and `INVARIANT` are silently dropped**, and -`alias_scope`/`noalias_scope` are **never read** by anything. - -**Action:** either (a) carry `NONTEMPORAL`/`INVARIANT` through to an internal carrier -and into at least one backend, or (b) remove them from `CfreeCgMemAccess`. Remove -`alias_scope`/`noalias_scope` until there is a consumer. Keep one access struct as the -source of truth; derive the internal one by a single documented projection (not a -parallel hand-maintained type). Recommendation: (b) remove now — no frontend sets them -except toy, and there is no internal model for them. - -### 3b. Bitfields exist in three representations -- Public rider on `CfreeCgMemAccess` (`bit_offset`/`bit_width`/`storage_size`/`bit_signed`). -- Public rider on `CfreeCgField` (`bit_width`/`bit_offset`/`bit_storage_size`/`bit_signed` - — note `storage_size` vs `bit_storage_size` naming drift). -- Internal dedicated `BitFieldAccess` + `bitfield_load`/`bitfield_store`. - -The public load/store carry 4 bitfield fields that most callers zero, bridged by -`bf_from_access` (`memory.c:364`) into the dedicated internal path. - -**Action:** expose **dedicated public bitfield ops** (`cfree_cg_bitfield_load`/`_store` -taking an explicit `CfreeCgBitField` struct), mirroring the internal shape. Drop the -bitfield fields from `CfreeCgMemAccess` entirely. Keep `CfreeCgField`'s layout-query -fields (they answer record-layout queries) but rename for consistency. This removes the -"every memop is secretly maybe-a-bitfield" branch from `cfree_cg_load`/`_store` -(`memory.c:420,577,646`). - -### 3c. `scale` vs `log2_scale` — **superseded by Track 7** -The public `CfreeCgEffAddr` rider is removed entirely in Track 7 (its base+index*scale+ -offset job moves into the place representation built by `field`/`elem`). The scale-form -mismatch disappears with it. No separate action here. +## Track 3b — Bitfields as a PLACE subkind (REMAINING — do with/after Track 7) + +Three representations today: a rider on `CfreeCgMemAccess` +(`bit_offset`/`bit_width`/`storage_size`/`bit_signed`), a rider on `CfreeCgField` +(`bit_width`/`bit_offset`/`bit_storage_size`/`bit_signed` — note `storage_size` vs +`bit_storage_size` drift), and internal `BitFieldAccess` + `bitfield_load`/`_store`. + +**Decision (#7): a bitfield is a PLACE subkind** carrying the descriptor; the normal +`load`/`store` perform the extract/insert. This merges into Track 7 (it depends on the +place model). Drop the bitfield fields from `CfreeCgMemAccess`; keep `CfreeCgField`'s +layout-query fields but fix the naming drift. Removes the "every memop is secretly +maybe-a-bitfield" branch in `cfree_cg_load`/`_store`. **Affected:** `cg.h`, `cgtarget.h`, `memory.c`, all backends' `load`/`store`/`bitfield_*`, -the C frontend's bitfield path (`lang/c/parse/cg_adapter.c`). -**Tests:** bitfield corpus in toy + C; `test-cg-api`. +`lang/c/parse/cg_adapter.c`. **Tests:** bitfield corpus in toy + C; `test-cg-api`. --- -## Track 4 — Fix the op/intrinsic taxonomy - -Today "op vs intrinsic" is drawn inconsistently across and within layers: -- `memcpy`/`memset`: dedicated **public ops**, internal **intrinsics** (`INTRIN_MEMCPY`…). -- `unreachable`: public **op** documented as "a real terminator, not a side-effect - intrinsic" (`cg.h:560`) — yet lowered through the **intrinsic** hook (`control.c:401`, - `INTRIN_UNREACHABLE`). Direct doc/impl contradiction. -- `trap`: public **intrinsic**. -- `bswap`: **1** public intrinsic but **3** internal (`BSWAP16/32/64`), split by a - size test in `api_map_intrinsic` (`arith.c:803-806`). - -**The rule (proposed):** -- **Terminators are first-class `CgTarget` ops** (ret, unreachable, jump, branch, - computed_goto, tail-call). Give `unreachable` its own hook and honor its documented - terminator status; stop routing it through `intrinsic`. -- **Primitives that may lower to either an inline sequence or a libcall are intrinsics** - (clz/ctz/popcount/bswap/overflow/fma/memcpy/memset). Decide each concept's home once - and make public+internal agree. Recommendation: keep `memcpy`/`memset` as dedicated - *public* ops (they carry rich `MemAccess`) but stop double-modeling them as a separate - public *intrinsic* surface. -- **Width comes from the operand type, not the opcode.** Collapse `BSWAP16/32/64` → one - `BSWAP`; backends read width from the operand. Deletes the size-branch in - `api_map_intrinsic`. - -### 4b. Façade intrinsics (ties into Track 1) +## Track 4 (remaining) — op/intrinsic taxonomy + +FP_REM removal is done. Remaining: + +### 4a. Width-by-type: collapse `BSWAP16/32/64` → one `BSWAP` +Internal `IntrinKind` has 3 bswaps; public has 1 (`CFREE_CG_INTRIN_BSWAP`). `api_map_intrinsic` +(`arith.c`) picks the internal one by `abi_cg_sizeof(result_type)`. **Feasible to collapse:** +`NativeLoc` carries `.type` and `NativeTarget` has `t->c->abi`, so backends derive width from +`dsts[0].type` (the result type — same source the size-branch uses). Collapse = wrap each +backend's three existing sequences under a `switch(width)`; preserve the sequences verbatim. +Touches `cgtarget.h` (enum), `arith.c` (drop the size-branch), and the bswap cases in +`aa64`/`x64`/`rv64` `native.c`, `interp/engine.c`, `c_target/c_emit.c`, and **wasm +(`arch/wasm/emit.c`, multi-site ~1577/1708/2894/3113 + capability path)**. NOTE the C +frontend's `cg_adapter.h` has its own `INTRIN_BSWAP16/32/64`; leave it (it maps to the public +single `BSWAP` at the call site). Pure internal dedup — public API unchanged. + +### 4b. `unreachable` as a first-class terminator hook +`cfree_cg_unreachable` is documented "a real terminator, not a side-effect intrinsic" +(`cg.h`) but is routed through the **intrinsic** hook (`control.c`, `INTRIN_UNREACHABLE`). +Give it its own `CgTarget` hook + its own IR op (recorder + opt), and move the 5 backends' +`INTRIN_UNREACHABLE` handling onto it. (Terminators are first-class: ret, unreachable, jump, +branch, computed_goto, tail-call.) + +### 4c. Façade intrinsics: query + implement the trivial ones `api_map_intrinsic` maps ~16 enumerators (`FMA`, `SYSCALL`, all `IRQ_*`, `DMB`/`DSB`/`ISB`, `DCACHE_*`/`ICACHE_*`, `CPU_NOP`/`CPU_YIELD`/`WFI`/`WFE`/`SEV`, `CORO_SWITCH`) → `INTRIN_NONE`, -and `cfree_cg_intrinsic` turns `INTRIN_NONE` into `compiler_panic("unsupported intrinsic")` -(`arith.c:884`). The toy frontend calls them in good faith (`builtins.c:507`); the -expected-error test `test/toy/err/unsupported_cpu_nop.toy` confirms the panic is the -*current intended behavior*. `CFREE_CG_FP_REM` is the same (`arith.c:573`). And unlike -call-convs/symbol-features, there is **no `supports_` query for intrinsics**, so a -frontend cannot check before it panics. - -**Action:** -1. Add `cfree_cg_target_supports_intrinsic(CfreeCompiler*, CfreeCgIntrinsic)`, consistent - with `cfree_cg_target_supports_call_conv`/`_symbol_feature`. -2. Convert the bare `compiler_panic` into a proper unsupported-feature diagnostic. -3. Implement the trivial single-instruction baremetal/CPU intrinsics on native arches - (`cpu_nop`/`cpu_yield`/`wfi`/`wfe`/`sev`/`isb`/`dmb`/`dsb`/`irq_*`) — these are one - instruction each and the toy frontend already wants them. -4. Leave `FMA`/`SYSCALL`/`CORO_SWITCH` reported `false` by the query until implemented; - remove `CFREE_CG_FP_REM` (no path, and fp rem is a libcall the frontend can emit). - -See §Open decisions #3 (implement-vs-formally-unsupported per intrinsic). +and `cfree_cg_intrinsic` turns `INTRIN_NONE` into a bare `compiler_panic` (`arith.c`). The toy +frontend calls them in good faith; `test/toy/err/unsupported_*` encode the panic as current +behavior. There is **no `supports_` query for intrinsics**. -**Affected:** `cg.h`, `cgtarget.h`, `arith.c`, `control.c`, native backends' -`intrinsic`, `lang/toy/builtins.c`, `test/toy/err/`. -**Tests:** add `supports_intrinsic` coverage; convert the toy err-cases that become -supported into positive smoke cases. +1. Add `cfree_cg_target_supports_intrinsic(CfreeCompiler*, CfreeCgIntrinsic)` (mirror + `cfree_cg_target_supports_call_conv`/`_symbol_feature`). Needs a per-arch capability source. +2. Convert the bare `compiler_panic` into a proper unsupported-feature diagnostic. +3. Implement the trivial single-instruction baremetal/CPU intrinsics on the native arches + (`cpu_nop`/`cpu_yield`/`wfi`/`wfe`/`sev`/`isb`/`dmb`/`dsb`/`irq_*`) — one instruction each; + convert the corresponding `test/toy/err/` cases to positive smoke cases. +4. Leave `FMA`/`SYSCALL`/`CORO_SWITCH` reported `false` until implemented. ---- +Also settle: keep `memcpy`/`memset` as dedicated *public* ops (they carry rich `MemAccess`) +but stop double-modeling them as a separate public *intrinsic* surface. -## Track 5 — Expose multi-result publicly - -The internal stack is already multi-result: `CGCallDesc`/`CGFuncDesc`/`ret` carry -`nresults`/`nvalues`, and backends realize `>1` via `plan_call`/`plan_ret` (no backend -asserts ≤1). But the public API tops out at one: `CfreeCgFuncSig` has a single `ret` -(`cg.h:102`), `session.c:318-324` fills `fn_result_types[1]` with 0 or 1, and -`cfree_cg_call`/`call_symbol` push exactly one result (`call.c:228,287,161`), `cfree_cg_ret` -pops one (`call.c:316`). **Decision: expose it.** Because backends already handle it, -this is a public-API + type-system + `value.c` change with **no backend work**. - -### API shape -```c -/* Symmetric with CfreeCgFuncParam. */ -typedef struct CfreeCgFuncResult { CfreeCgTypeId type; CfreeCgAbiAttrs attrs; } CfreeCgFuncResult; - -typedef struct CfreeCgFuncSig { - const CfreeCgFuncResult* results; /* was: CfreeCgTypeId ret; CfreeCgAbiAttrs ret_attrs; */ - uint32_t nresults; /* 0 = void */ - const CfreeCgFuncParam* params; - uint32_t nparams; - CfreeCgCallConv call_conv; - bool abi_variadic; -} CfreeCgFuncSig; -``` -- Type queries: replace `cfree_cg_type_func_ret`/`_ret_attrs` with - `cfree_cg_type_func_nresults` + `cfree_cg_type_func_result(idx)`. -- Type system: `CgType.func` stores `results[]`+`nresults`; interning (`type.c:344`) and - `cg_type_func_ret_id` (`type.c:268,827`) updated. -- `CfreeCg`: `fn_ret_type`/`fn_result_types[1]` → a small results array. -- **Stack-order convention (must be specified):** results are pushed by `cfree_cg_call` - in declaration order, so TOS is the last result; `cfree_cg_ret` pops `nresults` values - expecting the same order (last result on top). Document this on both calls. -- `void` is `nresults==0`; **`cfree_cg_ret_void` is removed** (decision #4): a void - function returns via `cfree_cg_ret` with 0 results — one return entry point. - -**Affected:** `cg.h`, `type.c`/`type.h`, `session.c`, `call.c`, every frontend's -func-type construction and `cfree_cg_type_func_ret` caller (C/toy/wasm adapters), wasm -backend can now surface true multi-value returns; every `cfree_cg_ret_void` caller -migrates to a 0-result `cfree_cg_ret`. -**Tests:** new `test-cg-api` + toy cases returning 2 values; wasm multi-value smoke. +**Affected:** `cg.h`, `cgtarget.h`, `arith.c`, `control.c`, native backends' `intrinsic`, +`lang/toy/builtins.c`, `test/toy/err/`. --- ## Track 6 — Isolate and complete the semantic peephole -The semantic layer is also a `-O0` peephole optimizer, and that is **a feature we keep** -(free `-O0` perf, Principle 6). This track gives it a named home and restores the half -that was switched off. +The semantic layer is also a `-O0` peephole optimizer — a **kept feature** (Principle 6). ### Current state -- **Live:** constant folding (`api_try_fold_int_binop`/`_unop`/`_cmp`, driven from - `arith.c:44,126,171`) and the `SV_CMP` fused-compare-into-branch path - (`api_make_cmp`/`api_materialize_cmp_to`/`api_branch_if`). -- **Disabled (not dead-by-design):** the `SV_ARITH` delayed-arith subsystem, gated by - `api_can_delay_int_arith()==0`. It was live until `a126bec` flipped it off to ship the - EA rider (Track 1b). Track 7 removes that rider. -- **Live:** scalar store-to-load forwarding (`api_local_const_*`, `value.c:939-1036`). +- **Live:** constant folding (`api_try_fold_int_binop`/`_unop`/`_cmp`, from `arith.c`) and + the `SV_CMP` fused-compare-into-branch path (`api_make_cmp`/`api_materialize_cmp_to`/ + `api_branch_if`). +- **Disabled (not dead):** the `SV_ARITH` delayed-arith subsystem, gated by + `api_can_delay_int_arith()==0`. It was live until commit `a126bec` flipped it off to ship + the EA rider; **Track 7 removes that rider**, so re-enabling is clean. +- **Live:** scalar store-to-load forwarding (`api_local_const_*`). ### Action 1. **6.2 — Extract the live peephole into `src/cg/fold.c` + `fold.h`** with a documented - contract: integer fold helpers, the `SV_CMP` lifecycle (make/release/materialize/ - branch-fuse), and const-local forwarding with its invalidation boundaries - (`api_local_const_memory_boundary`/`_control_boundary`/`_address_taken`). The op - families (`arith.c`/`memory.c`/`control.c`/`call.c`) call into `fold.h` instead of - reaching into `value.c` internals. This also settles `ApiSValue`'s shape before Track 7. -2. **6.3 — Re-enable delayed arith *after* Track 7** (once the EA rider is gone). Restore - the original gate (`g && !flags && api_foldable_int_type(...)`), bring - `api_make_arith_*`/`api_materialize_arith_to`/`api_release_arith`/the fold-chain + - identity-collapse helpers under `fold.c`, and verify the delayed forms now compose with - the place/value model (the old conflict was specifically the EA rider). Net `-O0` win: - small immediates flow into `binop`, arith chains and identities fold. -3. **Fix [doc/CODEGEN.md](../CODEGEN.md)** to match the restored, isolated peephole. - -**Affected:** `value.c`, `arith.c`, `internal.h`, new `fold.c`/`fold.h`, `doc/CODEGEN.md`. -**Tests:** `-O0` smoke + opt suites; snapshot-diff to confirm the peephole *improves* -`-O0` codegen (const-fold, fused compare, delayed arith) with no `-O1+` regression. + contract: integer fold helpers, the `SV_CMP` lifecycle, and const-local forwarding with + its invalidation boundaries (`api_local_const_memory_boundary`/`_control_boundary`/ + `_address_taken`). Op families call into `fold.h` instead of reaching into `value.c` + internals. **This settles `ApiSValue`'s shape — do it before Track 7, and it eases the + Track 2 binop/cmp split (the fold layer is the entangled part there).** +2. **6.3 — Re-enable delayed arith *after* Track 7.** Restore the gate + (`g && !flags && api_foldable_int_type(...)`); bring `api_make_arith_*`/ + `api_materialize_arith_to`/`api_release_arith`/the fold-chain + identity-collapse helpers + under `fold.c`; verify they compose with the place/value model. +3. **Fix [doc/CODEGEN.md](../CODEGEN.md)** to match the restored, isolated peephole (it + currently documents delayed arith as live). --- -## Track 7 — Strict place/value discipline (the centerpiece) +## Track 7 — Strict place/value discipline (the centerpiece, UNTOUCHED) -**Decided:** Model B (explicit place/value kinds); wide-16 scalars are *values*. +**Decided:** Model B (explicit place/value kinds); wide-16 scalars are *values*. (Track 3c — +the `scale` vs `log2_scale` rider mismatch — is subsumed here: the `CfreeCgEffAddr` rider is +removed entirely.) Today the value stack carries an **inferred** lvalue/rvalue distinction and several ops -accept multiple operand shapes and dispatch on type + shape. A stack slot's meaning is -*computed*, not declared. The inference points: - -- **`api_is_lvalue_sv` is a heuristic** (`value.c:176-180`): ORs the `lvalue` flag, - `bitfield_lvalue`, `api_operand_can_address`, and `source_local!=NONE && OPK_LOCAL`. -- **`cfree_cg_load` has ~7 behaviors, several of which don't load** (`memory.c:436-568`): - aggregate-lvalue@0 re-pushed as-is; ptr-rvalue-to-aggregate re-pushed; `OPK_GLOBAL` - aggregate/wide16 flips `lvalue=1`; scalar-local returns the local value directly; - wide16 keeps storage; then two general lvalue/ptr-rvalue paths. +dispatch on type + shape. Inference points to remove: +- **`api_is_lvalue_sv` is a heuristic** (`value.c`): ORs `lvalue`, `bitfield_lvalue`, + `api_operand_can_address`, `source_local!=NONE && OPK_LOCAL`. +- **`cfree_cg_load` has ~7 behaviors, several of which don't load** (`memory.c`). - **`load`/`store` `base` accepts 4 shapes** ({lvalue, ptr-rvalue} × {no-index, indexed}); - there is **no explicit deref** — a pointer base is silently dereferenceable. -- **`cfree_cg_index` infers pointer-vs-array-lvalue** (`control.c:849-860`); - **`cfree_cg_field` infers record-lvalue-vs-pointer** (`control.c:941-952`). -- **Aggregates are implicitly by-reference and CG decides it** (`call.c:18-42,101-106, - 310-315`): the frontend never says "pass by reference"; CG infers it from - `cg_type_is_aggregate`. -- **wide16 (i128/f128) is special-cased** as aggregate-like throughout (`memory.c:504-533`, - `call.c:53-66`). + there is no explicit deref. +- **`cfree_cg_index` / `cfree_cg_field` infer** pointer-vs-array / record-vs-pointer. +- **Aggregates are implicitly by-reference, CG decides it** (`call.c`). +- **wide16 (i128/f128) is special-cased** as aggregate-like (`memory.c`/`call.c`/`wide.c`). ### The discipline -Every stack entry is exactly one explicit, type-checked kind — no heuristic: +Every stack entry is exactly one explicit, type-checked kind: +- **PLACE** — addressable location of a typed object (`OPK_LOCAL`/`OPK_GLOBAL`/ + `OPK_INDIRECT(base+index*scale+off)`). +- **VALUE** — a scalar rvalue: integers, floats, **pointers, and i128/f128**. -- **PLACE** — an addressable location of a typed object. Representation = the existing - addressable operands (`OPK_LOCAL` / `OPK_GLOBAL` / `OPK_INDIRECT(base+index*scale+off)`). -- **VALUE** — a scalar rvalue: integers, floats, **pointers, and now i128/f128**. - -CG keeps owning **layout** (field offsets, element sizes, types — deterministic -computation from the record/array type). What it stops doing is **guessing the kind or -passing-mode of a stack value**. Every op declares the kinds it consumes/produces and -panics on mismatch. +CG keeps owning **layout** (field offsets, element sizes, types). It stops guessing the kind +or passing-mode of a stack value. Every op declares the kinds it consumes/produces and panics +on mismatch. ### Op signatures (strict, single-shape) | Op | Consumes | Produces | Notes | @@ -364,96 +273,65 @@ panics on mismatch. | `push_local_addr l` | — | VALUE (ptr) | sugar for `push_local; addr` | | `addr` | PLACE | VALUE (ptr) | address of the place | | **`deref`** (NEW) | VALUE (ptr) | PLACE | the explicit ptr→place transition | -| `field i` | PLACE(record) | PLACE(field) | offset/type from layout; for `->` do `deref; field` | -| `elem` (was `index`) | VALUE(ptr to T) + index VALUE | PLACE(T) | `*(p+i)`; scale = `sizeof(T)`. Array lvalues decay to ptr first | +| `field i` | PLACE(record) | PLACE(field) | offset/type from layout; `->` is `deref; field` | +| `elem` (was `index`) | VALUE(ptr to T) + index VALUE | PLACE(T) | `*(p+i)`; scale=`sizeof(T)`; arrays decay to ptr first | | `load access` | PLACE | VALUE | always dereferences; **no EA rider** | | `store access` | PLACE, VALUE | — | always dereferences | -The **`CfreeCgEffAddr` rider is removed** from `load`/`store`: addressing is built -explicitly by `field`/`elem`/`deref` and absorbed into the `OPK_INDIRECT` place, so the -backend still receives a single `[base+index*scale+off]` memop. The kept fold layer -(Track 6) recovers `-O0` quality: `load` of `PLACE(local)` folds to the local (no memory -round-trip), and a `deref` of a pointer-arith chain folds back into the place's indirect -form. Per decision #8 this recovery is **desirable but not a gate** — Track 7 may land -ahead of the peephole work; `-O1+` carries quality. +The **`CfreeCgEffAddr` rider is removed** from `load`/`store`: addressing is built explicitly +by `field`/`elem`/`deref` and absorbed into the `OPK_INDIRECT` place, so the backend still +gets a single `[base+index*scale+off]` memop. The kept fold layer (Track 6) recovers `-O0` +quality (`load` of `PLACE(local)` → the local; `deref` of a ptr-arith chain → the indirect +place). Per decision #8 this recovery is **desirable but not a gate**. ### Aggregates (values forbidden) -An aggregate is **always a PLACE**; a VALUE of aggregate type is illegal (panic). Reading -an aggregate = keeping its place. Copies are explicit (`memcpy` between two places, or -field-by-field). Call args/returns of aggregate type pass an explicit place, with the -mode named via the ABI attrs that already exist (`SRET`/`BYVAL`/`BYREF`). This removes -the aggregate branches from `api_materialize_call_local`, `api_push_call_result`, and the -aggregate `ret` path — the frontend states the passing mode instead of CG inferring it. - -### wide16 (decided: scalar values) -`i128`/`f128` are VALUES like any scalar; the backend lowers 16-byte storage/moves. The -wide16 special paths in `memory.c`/`call.c`/`wide.c` collapse into the normal value path -(plus backend support for 16-byte value moves where not already present). - -### Inference points removed -`api_is_lvalue_sv` (→ a kind tag check); the 7-way `load` cascade (→ one deref+load); -`load`/`store` 4-shape base (→ one PLACE); `index`/`field` dual-mode (→ `elem` on ptr, -`field` on place); aggregate auto-by-ref (→ explicit place + ABI attr); wide16 special -path (→ value path). - -**Affected:** `cg.h` (new `deref`, `elem` rename, EA rider removed from `load`/`store`, -`ApiSValue` kind tag), `value.c` (kind discipline replaces `lvalue`/`api_is_lvalue_sv`), -`memory.c` (load/store rewritten; `fold_ea_into_operand`/`pop_and_normalize_index` -folded into place-building), `control.c` (`index`→`elem`, `field`), `call.c` (aggregate -branches removed), `wide.c` (wide16 path removed), **every frontend** (insert explicit -`deref`/array-decay where they relied on pointer-base load/store; mark aggregate passing -modes). Backends mostly unaffected (they already consume `OPK_INDIRECT`). -**Tests:** this is the highest-blast-radius track — red-green per op, lean on the toy -corpus and C frontend for *correctness*; snapshot `-O0` codegen to *track* addressing-mode -recovery (decision #8: `-O0` quality is not a gate, so a temporary regression does not -block landing). - -## Recommended sequencing - -Each track is independently shippable and testable. Suggested order by risk/leverage and -dependency: - -1. **Track 1 (remove dead/redundant surface: 1a, 1d) + 1c completeness audit.** Pure - subtraction plus filling test gaps; no behavior change. Shrinks the surface. -2. **Track 6.2 (isolate the live fold layer into `fold.c`).** Settles `ApiSValue`'s shape - and makes the fold layer a clean dependency for Track 7. -3. **Track 7 (place/value discipline).** The centerpiece; removes the EA rider; depends on - a solid fold layer. Highest blast radius — do it deliberately, red-green. -4. **Track 6.3 (re-enable delayed arith).** Now that Track 7 removed the EA rider that - killed it; free `-O0` perf, under the isolated `fold.c`. -5. **Track 3a/3b (MemAccess unify + bitfield-as-PLACE-subkind).** On the strict - place-based `load`/`store`. (3c was folded into 7; 3b merges via decision #7.) -6. **Track 2 + Track 4 (op/intrinsic vocabulary).** Independent of the above; reshape the - op/intrinsic vocabulary once. -7. **Track 5 (multi-result).** Independent; public + type-system + `value.c` only. - -## Decisions (all resolved — ready to execute) - -1. **`SV_ARITH`: delete or re-enable?** **DECIDED (owner): re-enable** — the vstack - peephole is a kept feature for free `-O0` perf. It was disabled by `a126bec` to ship - the EA rider, which Track 7 removes; restore + isolate under Track 6.2/6.3. -2. **Op enums: one public vocabulary, int/fp split.** `CgTarget` consumes the public - `CfreeCg*` op enums directly; delete internal `BinOp`/`UnOp`/`CmpOp`/`AtomicOp`/ - `MemOrder`/`AsmDir` + all `api_map_*`. Couples the internal contract to public enum - values (accepted for an in-repo contract). -3. **Façade intrinsics: query + implement the trivial ones.** Add - `cfree_cg_target_supports_intrinsic` + a clean diagnostic; implement the - single-instruction baremetal/CPU intrinsics on native arches; report - `FMA`/`SYSCALL`/`CORO_SWITCH` false until built; remove `FP_REM`. -4. **`cfree_cg_ret_void`: fold into `ret`.** Remove `ret_void`; a void function returns - via `cfree_cg_ret` with 0 results — a single return entry point. -5. **`NONTEMPORAL`/`INVARIANT`/alias scopes: remove now.** Drop them from - `CfreeCgMemAccess`; re-add with a real internal carrier + backend consumer when needed. - -**Track 7 model (decided earlier):** Model B (explicit PLACE/VALUE kinds + `deref`; -aggregate values forbidden); wide-16 scalars are values; the `CfreeCgEffAddr` rider is -removed. - -6. **`elem` operand shape: pointer VALUE + explicit array-decay.** `elem` consumes a - pointer VALUE (`*(p+i)`); array lvalues decay via an explicit PLACE(array)→VALUE(ptr) - op. One shape, no dual-mode. -7. **Bitfields: PLACE subkind.** A bitfield is a PLACE subkind carrying the descriptor; - the normal `load`/`store` perform the extract/insert. Merges Track 3b into Track 7. -8. **`-O0` quality: not a gate.** Track 7 may land the cleaner semantics even with `-O0` - codegen regressions; `-O1+` carries quality. The vstack peephole (Track 6.3) is still - restored for the free `-O0` win, but it does **not** block Track 7. +An aggregate is **always a PLACE**; a VALUE of aggregate type is illegal (panic). Copies are +explicit (`memcpy` between places, or field-by-field). Call args/returns of aggregate type +pass an explicit place, mode named via existing ABI attrs (`SRET`/`BYVAL`/`BYREF`). Removes +the aggregate branches in `api_materialize_call_local`, `api_push_call_result`, and the +aggregate `ret` path. + +### wide16 (scalar values) +`i128`/`f128` are VALUES; the backend lowers 16-byte storage/moves. The wide16 special paths +in `memory.c`/`call.c`/`wide.c` collapse into the value path (plus backend 16-byte value-move +support where missing). + +**Affected:** `cg.h` (new `deref`, `elem` rename, EA rider removed, `ApiSValue` kind tag), +`value.c`, `memory.c`, `control.c` (`index`→`elem`, `field`), `call.c`, `wide.c`, **every +frontend** (insert explicit `deref`/array-decay; mark aggregate passing modes). Backends +mostly unaffected (they already consume `OPK_INDIRECT`). **Tests:** highest blast radius — +red-green per op on the toy corpus + C frontend; `-O0` quality is not a gate (decision #8). + +--- + +## Recommended sequencing (remaining) + +1. **Track 1c** completeness audit + tests (small, no behavior change). +2. **Track 6.2** — isolate the live fold layer into `fold.c`. Settles `ApiSValue` and is a + clean dependency for both the Track 2 binop/cmp split and Track 7. +3. **Track 2 binop/cmp split** — independent of 6.2 but cleaner after it (shares the fold + layer). Also fixes the lossy FP compare. +4. **Track 7** (place/value) — the centerpiece; removes the EA rider; do it red-green. +5. **Track 6.3** — re-enable delayed arith once Track 7 removed the EA rider. +6. **Track 3b** — bitfield-as-PLACE-subkind, on the strict place-based `load`/`store`. +7. **Track 4** (bswap collapse, `unreachable` hook, `supports_intrinsic`, CPU intrinsics) — + independent; can be done any time. +8. **Track 5 follow-up** — true multi-value at `-O1` (opt `cg_ir_lower`) + wasm, if wanted. + +2, 4, 7 are independent of each other; 6.2 helps 2 and 7; 6.3 and 3b depend on 7. + +## Decisions still governing remaining work + +2. **Op enums: one public vocabulary, int/fp split.** `CgTarget` consumes the public split + enums; delete internal `BinOp`/`UnOp`/`CmpOp` + their `api_map_*`. (Atomic/Order/AsmDir + already done.) +3. **Façade intrinsics: query + implement the trivial ones.** Add `supports_intrinsic` + a + clean diagnostic; implement single-instruction baremetal/CPU intrinsics; report + `FMA`/`SYSCALL`/`CORO_SWITCH` false until built. (`FP_REM` already removed.) +6. **`elem` operand shape: pointer VALUE + explicit array-decay.** +7. **Bitfields: PLACE subkind** (merges Track 3b into Track 7). +8. **`-O0` quality: not a gate.** Track 7 may land with `-O0` regressions; `-O1+` carries + quality. Track 6.3 still restores the peephole for the free `-O0` win but does not block 7. + +(Decisions 1, 4, 5 are realized: 1 = peephole kept/re-enable under Track 6; 4 = `ret_void` +removed; 5 = NONTEMPORAL/INVARIANT/alias scopes removed.)