doc: rewrite CBACKEND.md plan around CGTarget seam - kit

commit d364b563010b67b4db8b36087a75ab7a850db57a
parent 7935e136ee6401e739de61fd9dbb8a1209ef9ab3
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Wed, 20 May 2026 08:01:25 -0700

doc: rewrite CBACKEND.md plan around CGTarget seam

The prior plan referenced src/api/cg.c, which no longer exists (CG was
split into src/cg/*.c), and proposed an ABI storage-shape refactor as
the prerequisite for a C backend. That framing was wrong-level: a C
backend is a new CGTarget implementation selected by output format, not
a new arch and not blocked on changing how aggregate operands are
shaped. The new plan uses the existing virtual_regs=1 substrate, keeps
aggregates address-shaped, and is strictly additive in
src/arch/c_target/. Also documents the target-locked nature of the
emitted C.

Diffstat:
M doc/CBACKEND.md  | 753 +++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------

1 file changed, 487 insertions(+), 266 deletions(-)
diff --git a/doc/CBACKEND.md b/doc/CBACKEND.md
@@ -1,298 +1,519 @@
-# C Source Backend and ABI Storage-Shape Refactor
+# C Source Backend
 
 ## Motivation
 
-cfree's no-deps posture rules out linking against LLVM or GCC's optimizer
-directly. The practical path to "industrial-strength" optimization for cfree
-users is to emit C from the CG layer and hand the result to gcc/clang. A C
-backend lives at the same layer as `arch_impl_aa64`, `arch_impl_x64`, etc.: a
-new `arch_impl_c` with its own `CGTarget` and ABI vtable. Frontends do not need
-to know it exists.
+cfree's no-deps posture rules out linking against LLVM or GCC's optimizer.
+The practical path to "industrial-strength" optimization for cfree users is
+to emit C from CG and hand the result to gcc/clang, which exist on every
+build host we care about. The output is `.c` source, not `.o` bytes; the
+host C compiler does ABI lowering, instruction selection, and register
+allocation. cfree's job is to produce *legal* and *complete* C, not fast C.
 
-GCC/clang-extension C covers what looked like blockers on first read:
+GCC/clang-extension C covers everything cfree CG can express. Concretely:
 
-- inline asm — `IRAsmAux` is already GCC's `asm(tmpl : outs : ins : clobbers)` shape.
+- inline asm — `CfreeCgInlineAsm` is already GCC's
+  `asm(tmpl : outs : ins : clobbers)` shape; emit verbatim.
 - overflow/trap — `__builtin_{add,sub,mul}_overflow`, `__builtin_trap`,
   `__builtin_unreachable`.
 - atomics — `_Atomic` + `<stdatomic.h>` with explicit `memory_order_*`.
-- TLS — `__thread` or `_Thread_local`.
+- TLS — `_Thread_local`.
+- `setjmp`/`longjmp` — `<setjmp.h>`.
+- computed goto / label-as-value — GCC `&&label` extension.
+- `__int128`, `long double` — host C compiler types.
+- bitfields — emit as bit-extract/insert on the storage unit (cfree CG
+  already carries `BitFieldAccess.storage_offset + bit_offset + bit_width`,
+  not the original C field declaration).
+
+## Scope: target-locked, not portable
+
+The emitted C is **target-locked**: it must be compiled for the same triple
+that `cfree --target=` selected. Compile it for a different triple and it
+may silently misbehave.
+
+Cause: CG flattens semantic lvalue chains to `(base_reg, byte_offset)`
+before any backend sees them. `cfree_cg_field(g, field_index)` becomes
+`OPK_INDIRECT(reg, ofs=12)` at the vtable; the field identity is gone. The
+offset `12` was computed using the cfree-selected target's
+`abi_cg_record_layout`. If the downstream C compiler assumes a different
+layout, the access is wrong. Same story for array indexing, struct sizes,
+and pointer arithmetic.
+
+This is the same trade LLVM IR makes (datalayout-locked). It does *not*
+limit usefulness for the stated goal — "industrial-strength optimization
+via the host toolchain" — because the user already controls the triple at
+cfree invocation. Concretely supported:
+
+- `cfree --target=x86_64-linux --emit=c foo.c | {gcc,clang,tcc} -O3 -c`  ✓
+- moving that `.c` to a different-arch host and recompiling           ✗
+
+Producing genuinely portable C source would require a separate emission
+path in the C frontend (`lang/c/`), above CG, where field/element identity
+is still alive. That is a different project from this one. If "portable C
+as a deliverable" ever becomes a goal, expect a new doc, not an extension
+of this plan.
+
+## Where the C backend plugs in
+
+A C backend is *not* a new arch in the sense `arch_impl_x64` is. The eventual
+machine code still runs on the host triple — x86_64, aarch64, rv64. What
+changes is the *form of CG output*: text instead of object bytes. So the
+seam is not `ArchImpl`; it is `CGTarget`.
+
+The two relevant abstractions in `src/arch/arch.h`:
+
+- `MCEmitter` writes bytes to an `ObjBuilder` section. Per-arch concrete
+  backends call into it from each `binop`, `load`, `store`, etc.
+- `CGTarget` is the vtable CG calls — ~50 methods covering function
+  lifecycle, frame slots, data movement, arithmetic, calls/returns,
+  intrinsics, atomics, inline asm, varargs, scopes, source locations.
+
+A C backend is a new `CGTarget` implementation that:
+
+1. Ignores `MCEmitter` and writes C source to a `CfreeWriter` instead.
+2. Inherits the host's `ABIVtable` only for `sizeof`/`alignof`/`record_layout`.
+   It does **not** consult ABI classification for arg routing — gcc will
+   re-do that on the emitted C.
+3. Sets `virtual_regs = 1` so CG hands out fresh, unbounded `Reg` ids; each
+   id becomes a unique C local variable.
+
+The arch identity (`CFREE_ARCH_X86_64` etc.) is preserved end-to-end so type
+sizes and struct layouts match the downstream gcc invocation.
+
+### Selection
+
+Add a new `CfreeObjFmt` variant or a `CodeOptions` flag — `emit_c_source`.
+When set, `cfree_cg_new` constructs the C `CGTarget` instead of dispatching
+through `arch_impl_*.cgtarget_new`. Concretely: branch in
+`src/arch/cgtarget.c:cgtarget_new` (currently the only call site) on the
+new flag and return a C-source `CGTarget`. The `MCEmitter` is still
+constructed (the CG holds a pointer) but receives no calls from the C target.
+
+The downstream driver workflow: `cfree --emit=c foo.c -o foo.c.cfree.c`, then
+the user runs `cc -O2 foo.c.cfree.c`. No object format coupling at the
+cfree boundary.
+
+## Why the prior plan's framing was wrong
+
+The previous version of this doc proposed an "ABI Storage-Shape Refactor":
+add a `ABIStorageShape` enum, make `api_arg_storage_must_be_addr` consult an
+ABI helper, then write a "trivial C ABI" vtable that classifies everything
+as `DIRECT/1-full-part`. Reasons that's the wrong tool:
+
+1. **The CG vtable surface is the real work, not the predicate.** Even if
+   `api_pack_call_arg` produced a value-shaped storage for an aggregate, the
+   C target still needs implementations for ~50 `CGTarget` methods. The
+   predicate refactor would save *one* address-shaped path at the call site
+   and gain nothing for the rest.
+
+2. **`Operand` cannot hold an aggregate by value.** `OpKind` is
+   `IMM/REG/LOCAL/GLOBAL/INDIRECT`. None of those carry a struct value. So
+   the proposed `ABI_STORAGE_VALUE` for a single-part DIRECT aggregate would
+   yield a malformed `Operand` (e.g. `OPK_LOCAL` with frame-slot id but no
+   actual register class for the struct). Native backends — which the prior
+   plan promised would see byte-identical output — would actually break,
+   because their `T->call` path reads `desc.args[i].storage.kind` and is not
+   prepared for "REG that's actually 256 bytes wide".
+
+3. **For C output, the aggregate-via-address shape is fine.** Given an
+   aggregate arg as `OPK_LOCAL slot_3` (address of a frame slot), the C
+   target emits `f(*(struct T*)slot_3)` or, better, `f(slot_3)` where
+   `slot_3` is already typed as `struct T` in the emitted code. No new
+   Operand kind, no ABI invariant change, no native-backend regression risk.
+
+4. **The wide16 / SysV-x64 i128 discussion is unrelated.** GCC accepts
+   `__int128` and `long double` natively. The C backend emits the source
+   type, gcc does the rest. Fixing native ABI classifiers for i128 is real
+   work but is not on the C-backend critical path.
+
+5. **The line numbers in the prior plan are dead.** The CG layer was split
+   from `src/api/cg.c` (gone) into `src/cg/{call,value,memory,...}.c`. Every
+   `src/api/cg.c:NNNN` reference in the old doc points to nothing. The "Prep
+   A" / "Prep B" landings did happen but the helpers now live in
+   `src/cg/call.c` and `src/cg/value.c`.
+
+So we discard the storage-shape framing and plan the actual work directly.
+
+## Architecture sketch
 
-The real blocker is one layer up: the CG layer makes aggregate-passing
-decisions that bypass the ABI vtable. A trivial "C ABI" — every arg
-`ABI_ARG_DIRECT` with one full-coverage part — would still see the CG layer
-materialize aggregates as addresses and allocate sret slots.
-
-This document plans the refactor that makes those decisions ABI-driven, so a
-trivial C ABI vtable produces value-shaped storage suitable for emitting
-`ret = f(a, b, c)` C source.
-
-## Current State
-
-### ABI Vtable Selection
-
-Native ABIs already classify small aggregates as `ABI_ARG_DIRECT` with multiple
-parts (e.g. SysV-x64 splits a 16B struct into two `ABI_CLASS_INT` parts in
-`src/abi/abi_sysv_x64.c:53-71`). Large aggregates classify as
-`ABI_ARG_INDIRECT`. The ABI vtable is selected per-target via
-`ArchImpl.abi_vtable` (`src/arch/arch.h:899`) and dispatched through
-`abi_init` → `select_vtable` (`src/abi/abi.c:176`).
-
-### Preparatory Refactors (landed)
-
-Two preparatory passes shaped `src/api/cg.c` so the functional change can be
-small and confined to a couple of helper bodies:
-
-**Prep A — central predicate.** Added a single helper that today encodes the
-type-shape decision; future change rewrites only its body.
-
-```c
-/* src/api/cg.c:1323 */
-static int api_arg_storage_must_be_addr(Compiler *c, CfreeCgTypeId ty) {
-  return cg_type_is_aggregate(c, ty) || api_is_wide16_scalar_type(c, ty);
-}
+```
++---------------------+
+| frontend (lang/c/)  |  source AST → CG calls
++----------+----------+
+           |
+           v
++---------------------+
+| CfreeCg (src/cg/*)  |  value stack, lvalues, virtual Regs
++----------+----------+
+           | CGTarget vtable
+           v
++---------------------+        +---------------------+
+| arch CGTarget       |   OR   | C-source CGTarget   |
+| (aa64/x64/rv64)     |        | (new, this plan)    |
++----------+----------+        +----------+----------+
+           |                              |
+           v                              v
+     MCEmitter→bytes              CfreeWriter→text
+       ↓ObjBuilder                  ↓.c file
 ```
 
-Used by `api_release_arg_storage` (`src/api/cg.c:2129`) and
-`api_alloc_call_ret_storage` (`src/api/cg.c:6408`).
-
-**Prep B — call-shape helpers.** `cfree_cg_call` and `api_call_symbol_common`
-shared a ~80-line duplicated body. Extracted five helpers and reduced both
-public entry points to thin orchestration:
-
-| Helper                          | Location              | Role                                       |
-| ------------------------------- | --------------------- | ------------------------------------------ |
-| `api_alloc_call_args`           | `src/api/cg.c:6363`   | `avs` array + `avs_in_flight` setup        |
-| `api_pack_call_arg`             | `src/api/cg.c:6374`   | per-arg type resolution + 3-way packaging  |
-| `api_alloc_call_ret_storage`    | `src/api/cg.c:6406`   | return slot vs return register             |
-| `api_release_call_args`         | `src/api/cg.c:6424`   | post-call release loop                     |
-| `api_push_call_result`          | `src/api/cg.c:6432`   | lv/sv push based on storage kind           |
-
-After Prep A+B, the CG-side surface area that needs to change is reduced to
-two helper bodies and one as-yet-unextracted ret packaging function.
+The C target shares the entire frontend pipeline and the CG layer. Only the
+backend differs.
 
-### Remaining Predicate Sites
+### Substrate: virtual_regs
 
-After the prep refactors, the type-shape decisions that still need to become
-ABI-driven live in just three places:
+`CGTarget.virtual_regs = 1` already exists and is used by `opt_cgtarget`.
+Effect in CG (`src/cg/value.c:313`, `:342`):
 
-1. **`api_arg_storage_must_be_addr`** (`src/api/cg.c:1323`) — the central
-   predicate consulted by `api_release_arg_storage` and
-   `api_alloc_call_ret_storage`. Today: `is_aggregate || wide16`.
-2. **`api_pack_call_arg`** (`src/api/cg.c:6374`) — per-arg packaging still
-   has a three-way switch (`api_is_wide16_scalar_type` at line 6387,
-   `cg_type_is_aggregate` at 6392, scalar fall-through). The three branches
-   collapse to "address-shaped" vs "value-shaped" under ABI control.
-3. **`cfree_cg_ret`** (`src/api/cg.c:6636`) — ret packaging still has the
-   same three-way switch inline (`is_aggregate` at 6654, `wide16` at 6662).
-   Not extracted yet because Prep B's scope was call/call_symbol dedupe.
+- `api_regalloc_begin` initializes the regalloc in virtual mode.
+- `api_alloc_reg` mints fresh ids 1, 2, 3, … and never panics.
+- `api_free_reg` is a no-op; spill paths are unreachable.
 
-Together these three are the entire functional surface for Phase 1.
+The C target sets this and is otherwise free of register pool concerns.
+Each minted `Reg` id maps to one declared C local: `uintN_t v17;` or
+`double v23;` keyed on the Operand's `cls` and the source `CfreeCgTypeId`
+carried alongside.
 
-## Refactor Plan
+The C target still must implement `get_allocable_regs`,
+`get_phys_regs`, etc. as empty stubs (the CG checks `virtual_regs` and skips
+them); same for `spill_reg`/`reload_reg` (unreachable in virtual mode but
+required by the vtable's non-null-callable contract).
 
-### Invariant to Introduce
+### What about the aggregate-by-address issue?
 
-`CGABIValue.storage` shape is determined by an ABI helper, not by
-`cg_type_is_aggregate`:
+When CG packs a call arg whose type is an aggregate
+(`src/cg/call.c:30`), it materializes an address operand
+(`OPK_LOCAL`/`OPK_GLOBAL`/`OPK_INDIRECT`) referring to a memory image of the
+struct. The C target's `call` method sees `desc.args[i].storage.kind ==
+OPK_LOCAL` (or similar) and the source type `desc.args[i].type` is the
+aggregate.
 
-```c
-/* In src/abi/abi.h */
-typedef enum ABIStorageShape {
-  ABI_STORAGE_VALUE,  /* storage is the value itself (REG/IMM/GLOBAL/LOCAL) */
-  ABI_STORAGE_ADDR,   /* storage is the address of a memory image */
-} ABIStorageShape;
+The emission rule for an aggregate operand:
 
-ABIStorageShape abi_arg_storage_shape(const ABIArgInfo*, u32 type_size);
 ```
+desc.args[i].type = struct S
+desc.args[i].storage.kind = OPK_LOCAL, frame_slot = 7
+  → emit `slot_7` (where slot_7 was declared `struct S slot_7;`)
 
-Rule:
-
-- `ABI_STORAGE_ADDR` iff `kind == ABI_ARG_INDIRECT`, **or**
-  `kind == ABI_ARG_DIRECT && (nparts > 1 || parts[0].src_offset != 0 ||
-  parts[0].size != type_size)`.
-- Otherwise `ABI_STORAGE_VALUE`.
-
-This makes today's native behavior fall out unchanged: small structs
-(multi-part DIRECT) → ADDR; large structs (INDIRECT) → ADDR. Only a
-trivial DIRECT — one part, full coverage, zero offset — produces VALUE,
-which is exactly what the C target will register.
+desc.args[i].storage.kind = OPK_INDIRECT, base = vN, ofs = K
+  → emit `(*(struct S*)((char*)vN + K))`
+```
 
-### Touch List
+Both are valid C. The first is what the common case looks like (caller
+spilled the struct to a named frame slot). No CG change needed.
+
+For *returns* of aggregates, `api_alloc_call_ret_storage` (`src/cg/call.c:44`)
+allocates a fresh frame slot via `api_arg_storage_must_be_addr`. The C target
+sees a frame-slot Operand for `desc.ret.storage`; after the call it can either:
+
+- emit `slot_R = f(args);` directly (preferred — gcc handles the aggregate
+  return on its end), or
+- emit `f_into(&slot_R, args);` if the backend chooses to lift aggregate
+  returns out (not required).
+
+Either way no `sret` shim is in the emitted C — gcc figures that out.
+
+## CGTarget surface, mapped
+
+Methods in `arch.h:592–850`, grouped by emission strategy:
+
+| Method                | C-source emission                                         |
+| --------------------- | --------------------------------------------------------- |
+| `func_begin`/`_end`   | `static? T name(P0 p0, P1 p1, …) {` … `}`                 |
+| `frame_slot`          | declare `T slot_N;` at function entry                     |
+| `local`               | declare `T loc_N;` at function entry                      |
+| `local_addr`          | `vDST = &loc_N;`                                          |
+| `param`               | already a function parameter — track name mapping         |
+| `spill_reg`/`reload_` | unreachable in virtual_regs mode (no-op stub)             |
+| `label_new`/`_place`  | `Lk:` placement; minted ids → unique label names          |
+| `jump`                | `goto Lk;`                                                |
+| `cmp_branch`          | `if ((vA OP vB)) goto Lk;`                                |
+| `scope_begin/end/...` | C `if`/`{ }` block or `for(;;){ … L_break: ;}`           |
+| `load_imm`            | `vDST = K;`                                               |
+| `load_const`          | static const decl at top; `vDST = rodata_N;`              |
+| `copy`                | `vDST = vSRC;`                                            |
+| `load`                | `vDST = *(T*)addr;` (or `__atomic_load` for atomic)       |
+| `store`               | `*(T*)addr = src;`                                        |
+| `addr_of`             | `vDST = &lv;`                                             |
+| `tls_addr_of`         | `vDST = &tls_sym;` — gcc handles model selection          |
+| `copy_bytes`          | `__builtin_memcpy(dst, src, N)`                           |
+| `set_bytes`           | `__builtin_memset(dst, byte, N)`                          |
+| `bitfield_load`       | `vDST = (T)((load(storage) >> bit_offset) & mask);`       |
+| `bitfield_store`      | `store(storage, (load(storage) & ~mask) │ ((src&mask)<<lo))` |
+| `binop`/`unop`/`cmp`  | direct C operator with appropriate cast for signedness    |
+| `convert`             | C cast, except CV_BITCAST → `__builtin_bit_cast` or memcpy |
+| `call`                | `[vRET = ] fname(args…);` (see aggregate rule above)      |
+| `ret`                 | `return vR;` or `return;`                                 |
+| `alloca_`             | `vDST = __builtin_alloca(size);`                          |
+| `va_start_`/`arg`/etc | `__builtin_va_start(ap, last)` and friends                |
+| `atomic_load`/etc     | `<stdatomic.h>` primitives with explicit memory order     |
+| `intrinsic`           | per-kind: `__builtin_{popcount,ctz,clz,bswap,trap,…}`    |
+| `asm_block`           | re-serialize as GCC `__asm__(tmpl : outs : ins : clob);`  |
+| `set_loc`             | `# line N "file"` directive                               |
+| `finalize`            | flush data definitions, close writer                      |
+| `destroy`             | arena-backed, no-op                                       |
+
+CGCallPlan methods (`plan_call`/`load_call_arg`/etc.) are used by
+`opt_cgtarget`'s lowering pass. The C target advertises `virtual_regs = 1`
+which causes opt to be enabled — but the C target should refuse opt levels
+above 0 (or be wrapped so opt is bypassed). See "Sequencing with opt" below.
+
+## Sequencing with opt
+
+`src/cg/session.c:36-39`:
 
-**`src/abi/abi.h` / `src/abi/abi.c`** — add `ABIStorageShape` enum and
-`abi_arg_storage_shape()`.
+```c
+if (opt_level > 0) {
+  target = opt_cgtarget_new((Compiler*)c, target, opt_level);
+}
+```
 
-**`src/api/cg.c`** — rewrite three function bodies:
+`opt_cgtarget` records IR, runs SSA/DCE/combine/loop passes, then lowers
+through the wrapped target's CGCallPlan + value-emission methods. For the
+C backend this is undesirable: the *whole point* is to defer optimization
+to gcc. opt would just churn.
+
+Decision: when `emit_c_source` is set, force `opt_level = 0` regardless of
+what the caller asked for, with a diagnostic note. The C target then sits
+directly under CG with `virtual_regs = 1`, and CG mints virtual reg ids
+without any IR layer between them.
+
+## Type emission
+
+C source needs each composite type declared (typedef'd) before first use.
+The C target maintains a type-emission worklist:
+
+1. As CG calls into `func_begin`, `frame_slot`, `local`, `param`, `load`,
+   `store`, etc., it carries `CfreeCgTypeId`. The target records every
+   `CfreeCgTypeId` it sees.
+2. At `finalize`, walk the recorded types, topologically order by
+   dependency (pointee, element, field, return, param), and emit:
+   - scalars → use `<stdint.h>` / `_Bool` / `__int128` / `float` / `double` /
+     `long double`.
+   - pointers → `T*`.
+   - arrays → `T (*) [N]` for parameters or `T name[N]` for declarations.
+   - records → `struct ___s_N { … };` with explicit `_Alignas(K)` only if
+     `align_override` requires it.
+   - enums → emit underlying integer type; do not emit C `enum` (cfree CG
+     does not preserve the C-level enum identity at this layer).
+   - aliases → `typedef base_T name;`.
+3. Functions: emit signature with proper calling convention attribute
+   (`__attribute__((sysv_abi))` etc.) only when CG requested non-default;
+   the common case is the host's default ABI, which is exactly what we want.
+
+Bitfield-in-record handling: `cfree_cg_field` already returns byte offsets
+and bit-offset/width separately. We do **not** emit the field as a C
+bitfield. Instead, the record's emitted layout uses opaque storage bytes
+(e.g. one `unsigned char raw[N]` member or just a `uint64_t storage`), and
+the frontend's `bitfield_load`/`_store` calls into CG produce explicit
+mask/shift expressions referencing the storage member. This sidesteps the
+ABI ambiguity of C bitfields entirely.
+
+## Symbol and data emission
+
+`cfree_cg_decl`, `cfree_cg_data_*`, `cfree_cg_const_data` define symbols
+and data. The C target maps these to:
+
+- function decl → `T name(args);` forward declaration at TU top.
+- function defn → emitted by `func_begin/end`.
+- object decl (no body) → `extern T name;`.
+- object defn (data_begin/data_end) → `T name = { … };` constructed from
+  the `data_int`/`data_float`/`data_bytes`/`data_zero`/`data_addr` stream
+  buffered during the data definition. Easiest representation: write the
+  bytes verbatim as `static const unsigned char __sym[N] = { 0xAA, … };`
+  and cast to the typed pointer at uses. Loses readability, gains
+  correctness with arbitrary aggregate initializers and inter-object
+  relocations.
+- `data_addr(target, addend, …)` inside an initializer → cannot be expressed
+  as a static const initializer if the target is in another TU; in that
+  case lift to a runtime initializer (`__attribute__((constructor))`) or
+  fail with a diagnostic. v0 may diagnose; v1 lifts.
+- `data_pcrel`, `data_symdiff` → diagnose; these are link-time concepts
+  with no C-source equivalent. Frontends that need them are not viable C
+  targets.
+- TLS objects → `_Thread_local T name;`.
+- COMDAT, weak, visibility — emit `__attribute__((weak))`,
+  `__attribute__((visibility("hidden")))`, `__attribute__((selectany))`.
+- Symbol bind/visibility/flags from `CfreeCgSymbolAttrs` map to gcc
+  attributes; no behavioral change.
+
+## Source locations and debug
+
+`cfree_cg_set_loc` triggers `T->set_loc`. The C target emits
+`#line N "path"` immediately before the next significant emission. With
+`-g` set on the downstream gcc invocation, the resulting `.o` carries
+source-mapped debug info back to the original cfree input. The cfree
+`Debug` producer is not used in this mode (no DWARF emission).
+
+## Touchpoints in the existing tree
+
+New files (additions only):
+
+- `src/arch/c_target/` — directory for the C `CGTarget` implementation.
+  - `target.c` — vtable construction; `c_cgtarget_new()` entry point.
+  - `emit.c` — per-method emission bodies.
+  - `types.c` — type worklist and typedef ordering.
+  - `data.c` — data-definition buffering and emission.
+  - `names.c` — Reg/FrameSlot/Local/Label/Sym → C identifier mapping.
+  - `internal.h` — local types and helpers.
+
+Existing files touched (minimal, additive):
+
+- `src/arch/cgtarget.c` — branch on the C-output mode and call
+  `c_cgtarget_new()` instead of dispatching through `ArchImpl.cgtarget_new`.
+- `include/cfree/compile.h` / `include/cfree/core.h` — add `emit_c_source`
+  to `CodeOptions` (or add a `CFREE_OBJ_C_SOURCE` value to `CfreeObjFmt` —
+  decision deferred to v0 scoping, but `CodeOptions` flag is the lighter
+  change since the file is conceptually still a translation unit, just a
+  different surface format).
+- `src/cg/session.c` — when the flag is set, force `opt_level = 0`.
+- `driver/cc/` — accept `--emit=c` and wire it through to `CodeOptions`.
+
+Nothing in `src/cg/*.c`, `src/abi/*`, `src/arch/{aa64,x64,rv64}/`,
+`src/arch/regalloc.c`, or `src/arch/mc.c` needs to change. The C backend is
+strictly additive.
+
+## Things to **not** do
+
+- Do not invent a new `ABIStorageShape` enum or new `OpKind` for value-shaped
+  aggregates. Aggregates stay address-shaped at the CG-Operand boundary; the
+  C target emits `*(T*)addr` and gcc rebuilds value semantics.
+- Do not write a "trivial C ABI vtable". The C target inherits the host
+  arch's ABI vtable for sizeof/alignof and ignores classification at the
+  call site.
+- Do not register `arch_impl_c` in `src/arch/registry.c`. There is no "C
+  arch"; the host arch is unchanged.
+- Do not try to make `opt_cgtarget` lower into the C target. Bypass opt.
+
+## Test surface
+
+Add a new test directory `test/cbackend/`. Tests compile a cfree CG fixture
+(or run the C frontend on a `.c` corpus from `test/parse/` and similar),
+emit C source via the new target, then compile that C source with the host
+`cc` and run the resulting binary, asserting the same exit code or stdout
+as a reference run.
+
+Test tiers, in priority order:
+
+1. **v0 sanity** — one fixture per CG primitive family (int arith, fp arith,
+   load/store, branches, switch, calls, returns, const data, scalar params).
+   Pass criterion: emitted C compiles with `cc -Werror -std=c11` and
+   produces the expected exit code.
+2. **Coverage** — aggregates by value, sret, varargs, bitfields,
+   computed-goto, inline asm, atomics, TLS, weak/visibility, alloca,
+   intrinsics (overflow, trap, popcount, etc.), setjmp/longjmp,
+   wide16 (i128, long double, f128). Each its own fixture file.
+3. **Frontend integration** — run the existing `test/toy/` and `test/parse/`
+   corpora through the C backend and require the resulting binary to match
+   the native-backend binary's behavior.
+4. **Self-hosting smoke** — eventually compile libcfree through libcfree-via-C
+   and check that the bootstrapped artifact still passes its test suite. This
+   is a separate effort; flagged here only to note the long-term shape.
 
-| Site                              | Location              | Change                                                                 |
-| --------------------------------- | --------------------- | ---------------------------------------------------------------------- |
-| `api_arg_storage_must_be_addr`    | `src/api/cg.c:1323`   | body becomes `abi_arg_storage_shape(abi, size) == ABI_STORAGE_ADDR`    |
-| `api_pack_call_arg`               | `src/api/cg.c:6374`   | collapse 3-way switch to `must_be_addr`-driven materialization         |
-| `cfree_cg_ret`                    | `src/api/cg.c:6636`   | same collapse, OR pre-extract a `api_pack_ret_value` helper first      |
+## Phasing
 
-Optionally extract `api_pack_ret_value` from `cfree_cg_ret` as Prep C before
-the functional change, so the three-way collapse lives in helper bodies
-rather than mid-public-function. Small, mechanical, ~20 LOC.
+### Phase 0 — scaffolding
 
-**`src/abi/abi_sysv_x64.c`, `abi_aapcs64.c`, `abi_apple_arm64.c`,
-`abi_rv64.c`** — extend `classify_one`/`classify_scalar` to classify wide16
-scalars (i128, long double) as `ABI_ARG_DIRECT` with multi-part shape. See
-Phase 2 below.
+- Add `CodeOptions.emit_c_source`.
+- Branch in `cgtarget_new` and `session.c`.
+- Stub `c_cgtarget_new` returning a vtable where every method is
+  `compiler_panic("C target: <method> not implemented")`.
+- Wire `--emit=c` in `driver/cc/`.
+- Acceptance: `cfree --emit=c empty.c -o /tmp/x.c` panics with a *specific*
+  unimplemented-method message (not a crash, not silent success).
 
-**Untouched** — other `cg_type_is_aggregate` sites in `cg.c` (lines 1754,
-1782, 3823ff, 3945ff, 5094, 5103). Those handle assignments, lvalue
-conversion, and address-of. They are correctly aggregate-policy, not
-ABI-policy.
+### Phase 1 — minimal viable: scalar arithmetic and calls
 
-**Native backends** — no expected change. They already consult
-`desc.args[i].abi` and the invariant preserves what they see today.
+Implement, in roughly this order, only the methods needed for:
 
-## Phasing
+```c
+int add(int a, int b) { return a + b; }
+int main(void) { return add(2, 3); }
+```
 
-### Phase 0 (done) — preparatory refactors
-
-- Prep A: central `api_arg_storage_must_be_addr` predicate.
-- Prep B: extract `api_alloc_call_args`, `api_pack_call_arg`,
-  `api_alloc_call_ret_storage`, `api_release_call_args`,
-  `api_push_call_result`.
-- Verified by `test-cg-api` (610 pass), `test-opt`, `test-toy`,
-  `test-smoke-x64`.
-
-### Phase 1 — Helper bodies become ABI-driven
-
-- Add `abi_arg_storage_shape()` in `abi.h`/`abi.c`.
-- Rewrite `api_arg_storage_must_be_addr` body to delegate to the new helper
-  (needs `const ABIArgInfo*` and `type_size` — adjust the helper signature
-  accordingly, and pass them through from `api_pack_call_arg` /
-  `api_alloc_call_ret_storage` / `api_release_arg_storage`).
-- Collapse the three-way switches in `api_pack_call_arg` and `cfree_cg_ret`
-  (or extracted `api_pack_ret_value`) into a single `must_be_addr` branch.
-
-Acceptance: `make test-cg-api test-opt test-link test-elf test-toy
-test-smoke-x64 test-smoke-rv64 test-aa64-inline` pass; spot-check `.o`
-outputs on a representative corpus against the current state to confirm
-byte-identical codegen for native ABIs.
-
-### Phase 2 — Migrate wide16 to ABI classification
-
-Today `api_is_wide16_scalar_type` papers over incomplete ABI classifiers in
-some native targets (see Risks below). Phase 2 fixes the classifiers, then
-removes the wide16-specific code path from the predicate.
-
-- Fix SysV-x64 `classify_scalar` to emit DIRECT/2-INT-parts for
-  `ti.size == 16 && ti.scalar_kind == ABI_SC_INT` (the i128 case),
-  matching what RV64 and AAPCS64 already do.
-- Defer long-double-as-FP correctness — long double passes through
-  memory in current cfree even on native targets, and the existing
-  wide16 shortcut effectively forces that. Either retain the
-  `is_wide16` check just for long-double cases (narrow the branch),
-  or introduce a dedicated x87 / 16B-FP ABI class as a separate piece
-  of work. The C-backend refactor does not require this fix.
-- After classifiers are correct, drop the `api_is_wide16_scalar_type`
-  clause from `api_arg_storage_must_be_addr`.
-
-Acceptance: same as Phase 1, plus `test-libc` (long double through
-musl/glibc paths) and any i128 coverage.
-
-### Phase 3 — Negative-control fixture
-
-Add a unit test in `test/api/` that constructs a `Compiler` with a
-synthetic ABI vtable returning trivial DIRECT/one-full-part for everything,
-drives `cfree_cg_call` with an aggregate arg and aggregate return, and
-asserts `desc.args[0].storage.kind != OPK_INDIRECT` and that no sret frame
-slot was allocated.
-
-This fixture locks in the new invariant so future changes cannot
-accidentally regress to always-address-for-aggregate.
-
-### Phase 4 (out of scope here, but enabled by this work)
-
-- Add `arch_impl_c` with a `c_abi_vtable` whose `compute_func_info` returns
-  trivial DIRECT/one-full-part for every arg and return.
-- Stub `cgtarget_new` to a placeholder that records call/ret shapes for
-  inspection.
-- The actual C-source emitter is a separate piece of work, driven by the
-  recorded `CGCallDesc` shape that this refactor now makes value-typed for
-  aggregates.
-
-## Risks and Open Items
-
-Investigated post-plan and after the prep refactors:
-
-- **`api_release_arg_storage` (resolved by Prep A).**
-  Originally a fifth open site; now uses `api_arg_storage_must_be_addr`
-  directly (`src/api/cg.c:2129`). Resolution: same helper drives the
-  decision.
-
-- **`call_symbol` duplication (resolved by Prep B).**
-  Both `cfree_cg_call` and `api_call_symbol_common` now share the five
-  extracted helpers and contain only the call-shape orchestration. Drift
-  is no longer a maintenance concern.
-
-- **`fn_abi` is reliably non-null inside a function body.**
-  Set at `cfree_cg_func_begin` (`src/api/cg.c:3125`) and cleared at
-  `func_end` (line 3149). `cfree_cg_ret` only runs within that window.
-  No null-safe fallback needed.
-
-- **CGCallPlan / backends are already fully ABI-driven.**
-  Grep across `src/arch/` finds zero `cg_type_is_aggregate` references.
-  Every site branches on `ai->kind == ABI_ARG_INDIRECT` or iterates
-  `ai->parts`. Examples: `arch/aa64/ops.c:904`, `arch/x64/alloc.c:54`,
-  `arch/rv64/opt_coord.c:178`. The new invariant preserves what native
-  backends observe (multi-part DIRECT aggregates still produce ADDR
-  storage), so backends do not change.
-
-- **Wide16 classification is incomplete in some native ABIs — this is
-  the biggest finding and Phase 2's largest hidden cost.**
-  Today the wide16 check in `api_arg_storage_must_be_addr` papers over
-  bugs in the underlying ABI classifiers. Per-target status:
-
-  - **RV64** (`src/abi/abi_rv64.c:23-43`): correctly classifies 16B
-    INT or FLOAT scalars as DIRECT with two 8B INT parts. ✓
-  - **AAPCS64** (`src/abi/abi_aapcs64.c:23-39`): correctly classifies
-    16B INT scalars (i128) as DIRECT/2-parts. **Missing**: 16B FP
-    (long double on ARM64) should be DIRECT with one or two FP parts
-    in Q-registers, not fall through to single 16B INT part.
-  - **Apple ARM64** (`src/abi/abi_apple_arm64.c`): delegates to AAPCS64;
-    inherits the same long-double gap.
-  - **SysV-x64** (`src/abi/abi_sysv_x64.c:28-44`): **no 16B branch
-    at all**. i128 currently falls through to a single 16B INT part —
-    malformed because no GPR can hold it. Long double is 80-bit x87
-    with 16B alignment and needs a target-specific class entirely.
-    The wide16 clause in `api_arg_storage_must_be_addr` hides both bugs
-    by always routing wide16 through a memory image.
-
-  **Consequence**: if Phase 2 drops the wide16 clause before fixing the
-  SysV-x64 and AAPCS64 classifiers, native codegen breaks. The new
-  `abi_arg_storage_shape` would compute VALUE for a malformed single-part
-  DIRECT (one part, `src_offset==0, size==type_size==16`), but no Operand
-  kind can hold a 16B value.
-
-- **HFA / HVA in AAPCS64**: the existing classifier explicitly defers
-  HFA refinement (see comment at `src/abi/abi_aapcs64.c:9` and `:68-69`).
-  Small aggregates today classify uniformly as DIRECT/INT-parts. Wide16
-  classification (i128) does not collide with HFA logic because the two
-  enter `classify_one` through disjoint type kinds (RECORD vs scalar).
-  Confirmed safe.
-
-- **`tail` interaction**: the tail-call path
-  (`src/api/cg.c:6497-6498`) calls `api_regalloc_finish` before
-  `T->call`, which can mutate live storage state. The storage-shape
-  helper is queried per-arg during pre-call packaging, before this
-  finish call, so the decision sequencing is unchanged. No additional
-  risk.
-
-## Estimated Size
-
-- Phase 0 (done): Prep A (~25 LOC) + Prep B (~95 LOC of helpers, ~120 LOC
-  of duplicate body deleted from `cfree_cg_call` / `api_call_symbol_common`).
-- Phase 1: ~20 LOC for `abi_arg_storage_shape` + rewriting three function
-  bodies in `cg.c` (signature changes to thread `ABIArgInfo*` + size into
-  the helpers).
-- Optional Prep C (extract `api_pack_ret_value` from `cfree_cg_ret`): ~20 LOC.
-- Phase 2a (i128 classification fix): ~30 LOC in `abi_sysv_x64.c` +
-  removing the i128 path from the wide16 clause. ~50 LOC total.
-- Phase 2b (long-double, optional / deferable): not required for the C
-  backend. Treat as separate work.
-- Phase 3 (negative-control fixture): one ~150 LOC test file.
-- Total remaining for C-backend prerequisite: ~250 LOC, no public API change.
+- `func_begin`/`func_end`, `param`, `ret`, `binop`, `load_imm`, `copy`,
+  `call`, plus minimal type emission for `int`/`void`.
+- Identifier-mapping helper (Reg, FrameSlot, Label, Sym → C name).
+- Writer plumbing (the C target owns a `CfreeWriter` set at construction).
+
+Acceptance: the example above round-trips through the C target, `cc` it,
+run it, exit code 5.
+
+### Phase 2 — control flow and memory
+
+- `cmp`, `cmp_branch`, `label_*`, `jump`, `scope_*`.
+- `load`/`store`/`addr_of`/`indirect`, `local`, `frame_slot`, `local_addr`.
+- `convert` for the integer/float conversions; `bitcast` via memcpy.
+
+Acceptance: corpora that exercise loops, conditionals, locals, pointer
+chasing.
+
+### Phase 3 — aggregates, varargs, intrinsics
+
+- Type emission for records, arrays, function types.
+- Call/return for aggregate types (address-operand emission rule).
+- `va_start_`/`va_arg_`/`va_end_`/`va_copy_`.
+- `intrinsic` for the full IntrinKind set.
+- `copy_bytes`/`set_bytes`.
+
+### Phase 4 — atomics, asm, TLS, exotic features
+
+- Atomic load/store/rmw/cas/fence.
+- `asm_block` re-serialization.
+- TLS, weak, visibility, `_Alignas`.
+- `bitfield_load`/`_store`.
+- `data_*` definition emission including const_data.
+- `tls_addr_of`, `alloca_`.
+
+### Phase 5 — quality
+
+- Better identifier names (use `Sym name` where available instead of `vN`).
+- Optional: emit C struct field accesses instead of `*(T*)((char*)p + ofs)`
+  when the offset corresponds to a known record field.
+- `#line` directives.
+- C-target-specific diagnostics for unsupported combinations
+  (`data_pcrel`, `data_symdiff`, computed goto across functions, etc.).
+
+## Estimated size
+
+Order-of-magnitude:
+
+- Phase 0: ~150 LOC.
+- Phase 1: ~400 LOC (target.c skeleton + emit.c primitives + names.c +
+  minimal type emission).
+- Phase 2: ~600 LOC.
+- Phase 3: ~800 LOC (type worklist is most of it).
+- Phase 4: ~600 LOC.
+- Phase 5: ongoing.
+
+Plus test fixtures and a `test-cbackend` make target — roughly 1k LOC of
+test infrastructure to start.
+
+Total to a credibly useful C backend: ~3k LOC, isolated to
+`src/arch/c_target/` plus the four small touchpoints listed earlier. No
+modifications to CG, ABI, regalloc, or existing arch backends.
+
+## Open questions
+
+- **Aggregate returns in C source**: should the C target emit
+  `slot_R = f(args)` for aggregate returns (relying on gcc to handle the
+  ABI) or pre-lower to `f_into_buf(&slot_R, args)`? The first is simpler
+  and is what gcc would do anyway. Default to the simple form; revisit if
+  it produces bad codegen.
+- **Output format flag location**: `CodeOptions.emit_c_source` (boolean)
+  vs `CFREE_OBJ_C_SOURCE` (enum extension). The latter forces every code
+  path that switches on `obj_fmt` to know about C-source; the former is a
+  narrower addition. Lean toward the boolean.
+- **Multi-TU emission**: cfree compiles one TU at a time today. The C
+  backend follows the same model — one `.c` source out per `.c` source in.
+  Cross-TU LTO is gcc's job downstream.
+- **Floating-point reproducibility**: cfree's FP-flag enum
+  (`CfreeCgFpFlag.REASSOC`/`APPROX`/…) maps to gcc's
+  `-ffast-math`-style behavior, but per-operation. C doesn't have a per-op
+  syntax for these. Options: ignore the flags (correct but pessimistic),
+  wrap in `#pragma STDC FP_CONTRACT off` blocks, or emit
+  `__attribute__((optimize(...)))` on the enclosing function when any
+  flag fires. Probably ignore in v0/v1, document the gap.
+- **i128 division/modulo and f128 ops** are emitted by cfree CG today via
+  calls to `__divti3` / `__multf3` / `__addtf3` etc. The C target can
+  prefer native `__int128` operators and `__float128`/`long double` so
+  gcc inlines them. Detail to validate in Phase 1.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README