kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit d364b563010b67b4db8b36087a75ab7a850db57a
parent 7935e136ee6401e739de61fd9dbb8a1209ef9ab3
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Wed, 20 May 2026 08:01:25 -0700

doc: rewrite CBACKEND.md plan around CGTarget seam

The prior plan referenced src/api/cg.c, which no longer exists (CG was
split into src/cg/*.c), and proposed an ABI storage-shape refactor as
the prerequisite for a C backend. That framing was wrong-level: a C
backend is a new CGTarget implementation selected by output format, not
a new arch and not blocked on changing how aggregate operands are
shaped. The new plan uses the existing virtual_regs=1 substrate, keeps
aggregates address-shaped, and is strictly additive in
src/arch/c_target/. Also documents the target-locked nature of the
emitted C.

Diffstat:
Mdoc/CBACKEND.md | 753+++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------
1 file changed, 487 insertions(+), 266 deletions(-)

diff --git a/doc/CBACKEND.md b/doc/CBACKEND.md @@ -1,298 +1,519 @@ -# C Source Backend and ABI Storage-Shape Refactor +# C Source Backend ## Motivation -cfree's no-deps posture rules out linking against LLVM or GCC's optimizer -directly. The practical path to "industrial-strength" optimization for cfree -users is to emit C from the CG layer and hand the result to gcc/clang. A C -backend lives at the same layer as `arch_impl_aa64`, `arch_impl_x64`, etc.: a -new `arch_impl_c` with its own `CGTarget` and ABI vtable. Frontends do not need -to know it exists. +cfree's no-deps posture rules out linking against LLVM or GCC's optimizer. +The practical path to "industrial-strength" optimization for cfree users is +to emit C from CG and hand the result to gcc/clang, which exist on every +build host we care about. The output is `.c` source, not `.o` bytes; the +host C compiler does ABI lowering, instruction selection, and register +allocation. cfree's job is to produce *legal* and *complete* C, not fast C. -GCC/clang-extension C covers what looked like blockers on first read: +GCC/clang-extension C covers everything cfree CG can express. Concretely: -- inline asm — `IRAsmAux` is already GCC's `asm(tmpl : outs : ins : clobbers)` shape. +- inline asm — `CfreeCgInlineAsm` is already GCC's + `asm(tmpl : outs : ins : clobbers)` shape; emit verbatim. - overflow/trap — `__builtin_{add,sub,mul}_overflow`, `__builtin_trap`, `__builtin_unreachable`. - atomics — `_Atomic` + `<stdatomic.h>` with explicit `memory_order_*`. -- TLS — `__thread` or `_Thread_local`. +- TLS — `_Thread_local`. +- `setjmp`/`longjmp` — `<setjmp.h>`. +- computed goto / label-as-value — GCC `&&label` extension. +- `__int128`, `long double` — host C compiler types. +- bitfields — emit as bit-extract/insert on the storage unit (cfree CG + already carries `BitFieldAccess.storage_offset + bit_offset + bit_width`, + not the original C field declaration). + +## Scope: target-locked, not portable + +The emitted C is **target-locked**: it must be compiled for the same triple +that `cfree --target=` selected. Compile it for a different triple and it +may silently misbehave. + +Cause: CG flattens semantic lvalue chains to `(base_reg, byte_offset)` +before any backend sees them. `cfree_cg_field(g, field_index)` becomes +`OPK_INDIRECT(reg, ofs=12)` at the vtable; the field identity is gone. The +offset `12` was computed using the cfree-selected target's +`abi_cg_record_layout`. If the downstream C compiler assumes a different +layout, the access is wrong. Same story for array indexing, struct sizes, +and pointer arithmetic. + +This is the same trade LLVM IR makes (datalayout-locked). It does *not* +limit usefulness for the stated goal — "industrial-strength optimization +via the host toolchain" — because the user already controls the triple at +cfree invocation. Concretely supported: + +- `cfree --target=x86_64-linux --emit=c foo.c | {gcc,clang,tcc} -O3 -c` ✓ +- moving that `.c` to a different-arch host and recompiling ✗ + +Producing genuinely portable C source would require a separate emission +path in the C frontend (`lang/c/`), above CG, where field/element identity +is still alive. That is a different project from this one. If "portable C +as a deliverable" ever becomes a goal, expect a new doc, not an extension +of this plan. + +## Where the C backend plugs in + +A C backend is *not* a new arch in the sense `arch_impl_x64` is. The eventual +machine code still runs on the host triple — x86_64, aarch64, rv64. What +changes is the *form of CG output*: text instead of object bytes. So the +seam is not `ArchImpl`; it is `CGTarget`. + +The two relevant abstractions in `src/arch/arch.h`: + +- `MCEmitter` writes bytes to an `ObjBuilder` section. Per-arch concrete + backends call into it from each `binop`, `load`, `store`, etc. +- `CGTarget` is the vtable CG calls — ~50 methods covering function + lifecycle, frame slots, data movement, arithmetic, calls/returns, + intrinsics, atomics, inline asm, varargs, scopes, source locations. + +A C backend is a new `CGTarget` implementation that: + +1. Ignores `MCEmitter` and writes C source to a `CfreeWriter` instead. +2. Inherits the host's `ABIVtable` only for `sizeof`/`alignof`/`record_layout`. + It does **not** consult ABI classification for arg routing — gcc will + re-do that on the emitted C. +3. Sets `virtual_regs = 1` so CG hands out fresh, unbounded `Reg` ids; each + id becomes a unique C local variable. + +The arch identity (`CFREE_ARCH_X86_64` etc.) is preserved end-to-end so type +sizes and struct layouts match the downstream gcc invocation. + +### Selection + +Add a new `CfreeObjFmt` variant or a `CodeOptions` flag — `emit_c_source`. +When set, `cfree_cg_new` constructs the C `CGTarget` instead of dispatching +through `arch_impl_*.cgtarget_new`. Concretely: branch in +`src/arch/cgtarget.c:cgtarget_new` (currently the only call site) on the +new flag and return a C-source `CGTarget`. The `MCEmitter` is still +constructed (the CG holds a pointer) but receives no calls from the C target. + +The downstream driver workflow: `cfree --emit=c foo.c -o foo.c.cfree.c`, then +the user runs `cc -O2 foo.c.cfree.c`. No object format coupling at the +cfree boundary. + +## Why the prior plan's framing was wrong + +The previous version of this doc proposed an "ABI Storage-Shape Refactor": +add a `ABIStorageShape` enum, make `api_arg_storage_must_be_addr` consult an +ABI helper, then write a "trivial C ABI" vtable that classifies everything +as `DIRECT/1-full-part`. Reasons that's the wrong tool: + +1. **The CG vtable surface is the real work, not the predicate.** Even if + `api_pack_call_arg` produced a value-shaped storage for an aggregate, the + C target still needs implementations for ~50 `CGTarget` methods. The + predicate refactor would save *one* address-shaped path at the call site + and gain nothing for the rest. + +2. **`Operand` cannot hold an aggregate by value.** `OpKind` is + `IMM/REG/LOCAL/GLOBAL/INDIRECT`. None of those carry a struct value. So + the proposed `ABI_STORAGE_VALUE` for a single-part DIRECT aggregate would + yield a malformed `Operand` (e.g. `OPK_LOCAL` with frame-slot id but no + actual register class for the struct). Native backends — which the prior + plan promised would see byte-identical output — would actually break, + because their `T->call` path reads `desc.args[i].storage.kind` and is not + prepared for "REG that's actually 256 bytes wide". + +3. **For C output, the aggregate-via-address shape is fine.** Given an + aggregate arg as `OPK_LOCAL slot_3` (address of a frame slot), the C + target emits `f(*(struct T*)slot_3)` or, better, `f(slot_3)` where + `slot_3` is already typed as `struct T` in the emitted code. No new + Operand kind, no ABI invariant change, no native-backend regression risk. + +4. **The wide16 / SysV-x64 i128 discussion is unrelated.** GCC accepts + `__int128` and `long double` natively. The C backend emits the source + type, gcc does the rest. Fixing native ABI classifiers for i128 is real + work but is not on the C-backend critical path. + +5. **The line numbers in the prior plan are dead.** The CG layer was split + from `src/api/cg.c` (gone) into `src/cg/{call,value,memory,...}.c`. Every + `src/api/cg.c:NNNN` reference in the old doc points to nothing. The "Prep + A" / "Prep B" landings did happen but the helpers now live in + `src/cg/call.c` and `src/cg/value.c`. + +So we discard the storage-shape framing and plan the actual work directly. + +## Architecture sketch -The real blocker is one layer up: the CG layer makes aggregate-passing -decisions that bypass the ABI vtable. A trivial "C ABI" — every arg -`ABI_ARG_DIRECT` with one full-coverage part — would still see the CG layer -materialize aggregates as addresses and allocate sret slots. - -This document plans the refactor that makes those decisions ABI-driven, so a -trivial C ABI vtable produces value-shaped storage suitable for emitting -`ret = f(a, b, c)` C source. - -## Current State - -### ABI Vtable Selection - -Native ABIs already classify small aggregates as `ABI_ARG_DIRECT` with multiple -parts (e.g. SysV-x64 splits a 16B struct into two `ABI_CLASS_INT` parts in -`src/abi/abi_sysv_x64.c:53-71`). Large aggregates classify as -`ABI_ARG_INDIRECT`. The ABI vtable is selected per-target via -`ArchImpl.abi_vtable` (`src/arch/arch.h:899`) and dispatched through -`abi_init` → `select_vtable` (`src/abi/abi.c:176`). - -### Preparatory Refactors (landed) - -Two preparatory passes shaped `src/api/cg.c` so the functional change can be -small and confined to a couple of helper bodies: - -**Prep A — central predicate.** Added a single helper that today encodes the -type-shape decision; future change rewrites only its body. - -```c -/* src/api/cg.c:1323 */ -static int api_arg_storage_must_be_addr(Compiler *c, CfreeCgTypeId ty) { - return cg_type_is_aggregate(c, ty) || api_is_wide16_scalar_type(c, ty); -} +``` ++---------------------+ +| frontend (lang/c/) | source AST → CG calls ++----------+----------+ + | + v ++---------------------+ +| CfreeCg (src/cg/*) | value stack, lvalues, virtual Regs ++----------+----------+ + | CGTarget vtable + v ++---------------------+ +---------------------+ +| arch CGTarget | OR | C-source CGTarget | +| (aa64/x64/rv64) | | (new, this plan) | ++----------+----------+ +----------+----------+ + | | + v v + MCEmitter→bytes CfreeWriter→text + ↓ObjBuilder ↓.c file ``` -Used by `api_release_arg_storage` (`src/api/cg.c:2129`) and -`api_alloc_call_ret_storage` (`src/api/cg.c:6408`). - -**Prep B — call-shape helpers.** `cfree_cg_call` and `api_call_symbol_common` -shared a ~80-line duplicated body. Extracted five helpers and reduced both -public entry points to thin orchestration: - -| Helper | Location | Role | -| ------------------------------- | --------------------- | ------------------------------------------ | -| `api_alloc_call_args` | `src/api/cg.c:6363` | `avs` array + `avs_in_flight` setup | -| `api_pack_call_arg` | `src/api/cg.c:6374` | per-arg type resolution + 3-way packaging | -| `api_alloc_call_ret_storage` | `src/api/cg.c:6406` | return slot vs return register | -| `api_release_call_args` | `src/api/cg.c:6424` | post-call release loop | -| `api_push_call_result` | `src/api/cg.c:6432` | lv/sv push based on storage kind | - -After Prep A+B, the CG-side surface area that needs to change is reduced to -two helper bodies and one as-yet-unextracted ret packaging function. +The C target shares the entire frontend pipeline and the CG layer. Only the +backend differs. -### Remaining Predicate Sites +### Substrate: virtual_regs -After the prep refactors, the type-shape decisions that still need to become -ABI-driven live in just three places: +`CGTarget.virtual_regs = 1` already exists and is used by `opt_cgtarget`. +Effect in CG (`src/cg/value.c:313`, `:342`): -1. **`api_arg_storage_must_be_addr`** (`src/api/cg.c:1323`) — the central - predicate consulted by `api_release_arg_storage` and - `api_alloc_call_ret_storage`. Today: `is_aggregate || wide16`. -2. **`api_pack_call_arg`** (`src/api/cg.c:6374`) — per-arg packaging still - has a three-way switch (`api_is_wide16_scalar_type` at line 6387, - `cg_type_is_aggregate` at 6392, scalar fall-through). The three branches - collapse to "address-shaped" vs "value-shaped" under ABI control. -3. **`cfree_cg_ret`** (`src/api/cg.c:6636`) — ret packaging still has the - same three-way switch inline (`is_aggregate` at 6654, `wide16` at 6662). - Not extracted yet because Prep B's scope was call/call_symbol dedupe. +- `api_regalloc_begin` initializes the regalloc in virtual mode. +- `api_alloc_reg` mints fresh ids 1, 2, 3, … and never panics. +- `api_free_reg` is a no-op; spill paths are unreachable. -Together these three are the entire functional surface for Phase 1. +The C target sets this and is otherwise free of register pool concerns. +Each minted `Reg` id maps to one declared C local: `uintN_t v17;` or +`double v23;` keyed on the Operand's `cls` and the source `CfreeCgTypeId` +carried alongside. -## Refactor Plan +The C target still must implement `get_allocable_regs`, +`get_phys_regs`, etc. as empty stubs (the CG checks `virtual_regs` and skips +them); same for `spill_reg`/`reload_reg` (unreachable in virtual mode but +required by the vtable's non-null-callable contract). -### Invariant to Introduce +### What about the aggregate-by-address issue? -`CGABIValue.storage` shape is determined by an ABI helper, not by -`cg_type_is_aggregate`: +When CG packs a call arg whose type is an aggregate +(`src/cg/call.c:30`), it materializes an address operand +(`OPK_LOCAL`/`OPK_GLOBAL`/`OPK_INDIRECT`) referring to a memory image of the +struct. The C target's `call` method sees `desc.args[i].storage.kind == +OPK_LOCAL` (or similar) and the source type `desc.args[i].type` is the +aggregate. -```c -/* In src/abi/abi.h */ -typedef enum ABIStorageShape { - ABI_STORAGE_VALUE, /* storage is the value itself (REG/IMM/GLOBAL/LOCAL) */ - ABI_STORAGE_ADDR, /* storage is the address of a memory image */ -} ABIStorageShape; +The emission rule for an aggregate operand: -ABIStorageShape abi_arg_storage_shape(const ABIArgInfo*, u32 type_size); ``` +desc.args[i].type = struct S +desc.args[i].storage.kind = OPK_LOCAL, frame_slot = 7 + → emit `slot_7` (where slot_7 was declared `struct S slot_7;`) -Rule: - -- `ABI_STORAGE_ADDR` iff `kind == ABI_ARG_INDIRECT`, **or** - `kind == ABI_ARG_DIRECT && (nparts > 1 || parts[0].src_offset != 0 || - parts[0].size != type_size)`. -- Otherwise `ABI_STORAGE_VALUE`. - -This makes today's native behavior fall out unchanged: small structs -(multi-part DIRECT) → ADDR; large structs (INDIRECT) → ADDR. Only a -trivial DIRECT — one part, full coverage, zero offset — produces VALUE, -which is exactly what the C target will register. +desc.args[i].storage.kind = OPK_INDIRECT, base = vN, ofs = K + → emit `(*(struct S*)((char*)vN + K))` +``` -### Touch List +Both are valid C. The first is what the common case looks like (caller +spilled the struct to a named frame slot). No CG change needed. + +For *returns* of aggregates, `api_alloc_call_ret_storage` (`src/cg/call.c:44`) +allocates a fresh frame slot via `api_arg_storage_must_be_addr`. The C target +sees a frame-slot Operand for `desc.ret.storage`; after the call it can either: + +- emit `slot_R = f(args);` directly (preferred — gcc handles the aggregate + return on its end), or +- emit `f_into(&slot_R, args);` if the backend chooses to lift aggregate + returns out (not required). + +Either way no `sret` shim is in the emitted C — gcc figures that out. + +## CGTarget surface, mapped + +Methods in `arch.h:592–850`, grouped by emission strategy: + +| Method | C-source emission | +| --------------------- | --------------------------------------------------------- | +| `func_begin`/`_end` | `static? T name(P0 p0, P1 p1, …) {` … `}` | +| `frame_slot` | declare `T slot_N;` at function entry | +| `local` | declare `T loc_N;` at function entry | +| `local_addr` | `vDST = &loc_N;` | +| `param` | already a function parameter — track name mapping | +| `spill_reg`/`reload_` | unreachable in virtual_regs mode (no-op stub) | +| `label_new`/`_place` | `Lk:` placement; minted ids → unique label names | +| `jump` | `goto Lk;` | +| `cmp_branch` | `if ((vA OP vB)) goto Lk;` | +| `scope_begin/end/...` | C `if`/`{ }` block or `for(;;){ … L_break: ;}` | +| `load_imm` | `vDST = K;` | +| `load_const` | static const decl at top; `vDST = rodata_N;` | +| `copy` | `vDST = vSRC;` | +| `load` | `vDST = *(T*)addr;` (or `__atomic_load` for atomic) | +| `store` | `*(T*)addr = src;` | +| `addr_of` | `vDST = &lv;` | +| `tls_addr_of` | `vDST = &tls_sym;` — gcc handles model selection | +| `copy_bytes` | `__builtin_memcpy(dst, src, N)` | +| `set_bytes` | `__builtin_memset(dst, byte, N)` | +| `bitfield_load` | `vDST = (T)((load(storage) >> bit_offset) & mask);` | +| `bitfield_store` | `store(storage, (load(storage) & ~mask) │ ((src&mask)<<lo))` | +| `binop`/`unop`/`cmp` | direct C operator with appropriate cast for signedness | +| `convert` | C cast, except CV_BITCAST → `__builtin_bit_cast` or memcpy | +| `call` | `[vRET = ] fname(args…);` (see aggregate rule above) | +| `ret` | `return vR;` or `return;` | +| `alloca_` | `vDST = __builtin_alloca(size);` | +| `va_start_`/`arg`/etc | `__builtin_va_start(ap, last)` and friends | +| `atomic_load`/etc | `<stdatomic.h>` primitives with explicit memory order | +| `intrinsic` | per-kind: `__builtin_{popcount,ctz,clz,bswap,trap,…}` | +| `asm_block` | re-serialize as GCC `__asm__(tmpl : outs : ins : clob);` | +| `set_loc` | `# line N "file"` directive | +| `finalize` | flush data definitions, close writer | +| `destroy` | arena-backed, no-op | + +CGCallPlan methods (`plan_call`/`load_call_arg`/etc.) are used by +`opt_cgtarget`'s lowering pass. The C target advertises `virtual_regs = 1` +which causes opt to be enabled — but the C target should refuse opt levels +above 0 (or be wrapped so opt is bypassed). See "Sequencing with opt" below. + +## Sequencing with opt + +`src/cg/session.c:36-39`: -**`src/abi/abi.h` / `src/abi/abi.c`** — add `ABIStorageShape` enum and -`abi_arg_storage_shape()`. +```c +if (opt_level > 0) { + target = opt_cgtarget_new((Compiler*)c, target, opt_level); +} +``` -**`src/api/cg.c`** — rewrite three function bodies: +`opt_cgtarget` records IR, runs SSA/DCE/combine/loop passes, then lowers +through the wrapped target's CGCallPlan + value-emission methods. For the +C backend this is undesirable: the *whole point* is to defer optimization +to gcc. opt would just churn. + +Decision: when `emit_c_source` is set, force `opt_level = 0` regardless of +what the caller asked for, with a diagnostic note. The C target then sits +directly under CG with `virtual_regs = 1`, and CG mints virtual reg ids +without any IR layer between them. + +## Type emission + +C source needs each composite type declared (typedef'd) before first use. +The C target maintains a type-emission worklist: + +1. As CG calls into `func_begin`, `frame_slot`, `local`, `param`, `load`, + `store`, etc., it carries `CfreeCgTypeId`. The target records every + `CfreeCgTypeId` it sees. +2. At `finalize`, walk the recorded types, topologically order by + dependency (pointee, element, field, return, param), and emit: + - scalars → use `<stdint.h>` / `_Bool` / `__int128` / `float` / `double` / + `long double`. + - pointers → `T*`. + - arrays → `T (*) [N]` for parameters or `T name[N]` for declarations. + - records → `struct ___s_N { … };` with explicit `_Alignas(K)` only if + `align_override` requires it. + - enums → emit underlying integer type; do not emit C `enum` (cfree CG + does not preserve the C-level enum identity at this layer). + - aliases → `typedef base_T name;`. +3. Functions: emit signature with proper calling convention attribute + (`__attribute__((sysv_abi))` etc.) only when CG requested non-default; + the common case is the host's default ABI, which is exactly what we want. + +Bitfield-in-record handling: `cfree_cg_field` already returns byte offsets +and bit-offset/width separately. We do **not** emit the field as a C +bitfield. Instead, the record's emitted layout uses opaque storage bytes +(e.g. one `unsigned char raw[N]` member or just a `uint64_t storage`), and +the frontend's `bitfield_load`/`_store` calls into CG produce explicit +mask/shift expressions referencing the storage member. This sidesteps the +ABI ambiguity of C bitfields entirely. + +## Symbol and data emission + +`cfree_cg_decl`, `cfree_cg_data_*`, `cfree_cg_const_data` define symbols +and data. The C target maps these to: + +- function decl → `T name(args);` forward declaration at TU top. +- function defn → emitted by `func_begin/end`. +- object decl (no body) → `extern T name;`. +- object defn (data_begin/data_end) → `T name = { … };` constructed from + the `data_int`/`data_float`/`data_bytes`/`data_zero`/`data_addr` stream + buffered during the data definition. Easiest representation: write the + bytes verbatim as `static const unsigned char __sym[N] = { 0xAA, … };` + and cast to the typed pointer at uses. Loses readability, gains + correctness with arbitrary aggregate initializers and inter-object + relocations. +- `data_addr(target, addend, …)` inside an initializer → cannot be expressed + as a static const initializer if the target is in another TU; in that + case lift to a runtime initializer (`__attribute__((constructor))`) or + fail with a diagnostic. v0 may diagnose; v1 lifts. +- `data_pcrel`, `data_symdiff` → diagnose; these are link-time concepts + with no C-source equivalent. Frontends that need them are not viable C + targets. +- TLS objects → `_Thread_local T name;`. +- COMDAT, weak, visibility — emit `__attribute__((weak))`, + `__attribute__((visibility("hidden")))`, `__attribute__((selectany))`. +- Symbol bind/visibility/flags from `CfreeCgSymbolAttrs` map to gcc + attributes; no behavioral change. + +## Source locations and debug + +`cfree_cg_set_loc` triggers `T->set_loc`. The C target emits +`#line N "path"` immediately before the next significant emission. With +`-g` set on the downstream gcc invocation, the resulting `.o` carries +source-mapped debug info back to the original cfree input. The cfree +`Debug` producer is not used in this mode (no DWARF emission). + +## Touchpoints in the existing tree + +New files (additions only): + +- `src/arch/c_target/` — directory for the C `CGTarget` implementation. + - `target.c` — vtable construction; `c_cgtarget_new()` entry point. + - `emit.c` — per-method emission bodies. + - `types.c` — type worklist and typedef ordering. + - `data.c` — data-definition buffering and emission. + - `names.c` — Reg/FrameSlot/Local/Label/Sym → C identifier mapping. + - `internal.h` — local types and helpers. + +Existing files touched (minimal, additive): + +- `src/arch/cgtarget.c` — branch on the C-output mode and call + `c_cgtarget_new()` instead of dispatching through `ArchImpl.cgtarget_new`. +- `include/cfree/compile.h` / `include/cfree/core.h` — add `emit_c_source` + to `CodeOptions` (or add a `CFREE_OBJ_C_SOURCE` value to `CfreeObjFmt` — + decision deferred to v0 scoping, but `CodeOptions` flag is the lighter + change since the file is conceptually still a translation unit, just a + different surface format). +- `src/cg/session.c` — when the flag is set, force `opt_level = 0`. +- `driver/cc/` — accept `--emit=c` and wire it through to `CodeOptions`. + +Nothing in `src/cg/*.c`, `src/abi/*`, `src/arch/{aa64,x64,rv64}/`, +`src/arch/regalloc.c`, or `src/arch/mc.c` needs to change. The C backend is +strictly additive. + +## Things to **not** do + +- Do not invent a new `ABIStorageShape` enum or new `OpKind` for value-shaped + aggregates. Aggregates stay address-shaped at the CG-Operand boundary; the + C target emits `*(T*)addr` and gcc rebuilds value semantics. +- Do not write a "trivial C ABI vtable". The C target inherits the host + arch's ABI vtable for sizeof/alignof and ignores classification at the + call site. +- Do not register `arch_impl_c` in `src/arch/registry.c`. There is no "C + arch"; the host arch is unchanged. +- Do not try to make `opt_cgtarget` lower into the C target. Bypass opt. + +## Test surface + +Add a new test directory `test/cbackend/`. Tests compile a cfree CG fixture +(or run the C frontend on a `.c` corpus from `test/parse/` and similar), +emit C source via the new target, then compile that C source with the host +`cc` and run the resulting binary, asserting the same exit code or stdout +as a reference run. + +Test tiers, in priority order: + +1. **v0 sanity** — one fixture per CG primitive family (int arith, fp arith, + load/store, branches, switch, calls, returns, const data, scalar params). + Pass criterion: emitted C compiles with `cc -Werror -std=c11` and + produces the expected exit code. +2. **Coverage** — aggregates by value, sret, varargs, bitfields, + computed-goto, inline asm, atomics, TLS, weak/visibility, alloca, + intrinsics (overflow, trap, popcount, etc.), setjmp/longjmp, + wide16 (i128, long double, f128). Each its own fixture file. +3. **Frontend integration** — run the existing `test/toy/` and `test/parse/` + corpora through the C backend and require the resulting binary to match + the native-backend binary's behavior. +4. **Self-hosting smoke** — eventually compile libcfree through libcfree-via-C + and check that the bootstrapped artifact still passes its test suite. This + is a separate effort; flagged here only to note the long-term shape. -| Site | Location | Change | -| --------------------------------- | --------------------- | ---------------------------------------------------------------------- | -| `api_arg_storage_must_be_addr` | `src/api/cg.c:1323` | body becomes `abi_arg_storage_shape(abi, size) == ABI_STORAGE_ADDR` | -| `api_pack_call_arg` | `src/api/cg.c:6374` | collapse 3-way switch to `must_be_addr`-driven materialization | -| `cfree_cg_ret` | `src/api/cg.c:6636` | same collapse, OR pre-extract a `api_pack_ret_value` helper first | +## Phasing -Optionally extract `api_pack_ret_value` from `cfree_cg_ret` as Prep C before -the functional change, so the three-way collapse lives in helper bodies -rather than mid-public-function. Small, mechanical, ~20 LOC. +### Phase 0 — scaffolding -**`src/abi/abi_sysv_x64.c`, `abi_aapcs64.c`, `abi_apple_arm64.c`, -`abi_rv64.c`** — extend `classify_one`/`classify_scalar` to classify wide16 -scalars (i128, long double) as `ABI_ARG_DIRECT` with multi-part shape. See -Phase 2 below. +- Add `CodeOptions.emit_c_source`. +- Branch in `cgtarget_new` and `session.c`. +- Stub `c_cgtarget_new` returning a vtable where every method is + `compiler_panic("C target: <method> not implemented")`. +- Wire `--emit=c` in `driver/cc/`. +- Acceptance: `cfree --emit=c empty.c -o /tmp/x.c` panics with a *specific* + unimplemented-method message (not a crash, not silent success). -**Untouched** — other `cg_type_is_aggregate` sites in `cg.c` (lines 1754, -1782, 3823ff, 3945ff, 5094, 5103). Those handle assignments, lvalue -conversion, and address-of. They are correctly aggregate-policy, not -ABI-policy. +### Phase 1 — minimal viable: scalar arithmetic and calls -**Native backends** — no expected change. They already consult -`desc.args[i].abi` and the invariant preserves what they see today. +Implement, in roughly this order, only the methods needed for: -## Phasing +```c +int add(int a, int b) { return a + b; } +int main(void) { return add(2, 3); } +``` -### Phase 0 (done) — preparatory refactors - -- Prep A: central `api_arg_storage_must_be_addr` predicate. -- Prep B: extract `api_alloc_call_args`, `api_pack_call_arg`, - `api_alloc_call_ret_storage`, `api_release_call_args`, - `api_push_call_result`. -- Verified by `test-cg-api` (610 pass), `test-opt`, `test-toy`, - `test-smoke-x64`. - -### Phase 1 — Helper bodies become ABI-driven - -- Add `abi_arg_storage_shape()` in `abi.h`/`abi.c`. -- Rewrite `api_arg_storage_must_be_addr` body to delegate to the new helper - (needs `const ABIArgInfo*` and `type_size` — adjust the helper signature - accordingly, and pass them through from `api_pack_call_arg` / - `api_alloc_call_ret_storage` / `api_release_arg_storage`). -- Collapse the three-way switches in `api_pack_call_arg` and `cfree_cg_ret` - (or extracted `api_pack_ret_value`) into a single `must_be_addr` branch. - -Acceptance: `make test-cg-api test-opt test-link test-elf test-toy -test-smoke-x64 test-smoke-rv64 test-aa64-inline` pass; spot-check `.o` -outputs on a representative corpus against the current state to confirm -byte-identical codegen for native ABIs. - -### Phase 2 — Migrate wide16 to ABI classification - -Today `api_is_wide16_scalar_type` papers over incomplete ABI classifiers in -some native targets (see Risks below). Phase 2 fixes the classifiers, then -removes the wide16-specific code path from the predicate. - -- Fix SysV-x64 `classify_scalar` to emit DIRECT/2-INT-parts for - `ti.size == 16 && ti.scalar_kind == ABI_SC_INT` (the i128 case), - matching what RV64 and AAPCS64 already do. -- Defer long-double-as-FP correctness — long double passes through - memory in current cfree even on native targets, and the existing - wide16 shortcut effectively forces that. Either retain the - `is_wide16` check just for long-double cases (narrow the branch), - or introduce a dedicated x87 / 16B-FP ABI class as a separate piece - of work. The C-backend refactor does not require this fix. -- After classifiers are correct, drop the `api_is_wide16_scalar_type` - clause from `api_arg_storage_must_be_addr`. - -Acceptance: same as Phase 1, plus `test-libc` (long double through -musl/glibc paths) and any i128 coverage. - -### Phase 3 — Negative-control fixture - -Add a unit test in `test/api/` that constructs a `Compiler` with a -synthetic ABI vtable returning trivial DIRECT/one-full-part for everything, -drives `cfree_cg_call` with an aggregate arg and aggregate return, and -asserts `desc.args[0].storage.kind != OPK_INDIRECT` and that no sret frame -slot was allocated. - -This fixture locks in the new invariant so future changes cannot -accidentally regress to always-address-for-aggregate. - -### Phase 4 (out of scope here, but enabled by this work) - -- Add `arch_impl_c` with a `c_abi_vtable` whose `compute_func_info` returns - trivial DIRECT/one-full-part for every arg and return. -- Stub `cgtarget_new` to a placeholder that records call/ret shapes for - inspection. -- The actual C-source emitter is a separate piece of work, driven by the - recorded `CGCallDesc` shape that this refactor now makes value-typed for - aggregates. - -## Risks and Open Items - -Investigated post-plan and after the prep refactors: - -- **`api_release_arg_storage` (resolved by Prep A).** - Originally a fifth open site; now uses `api_arg_storage_must_be_addr` - directly (`src/api/cg.c:2129`). Resolution: same helper drives the - decision. - -- **`call_symbol` duplication (resolved by Prep B).** - Both `cfree_cg_call` and `api_call_symbol_common` now share the five - extracted helpers and contain only the call-shape orchestration. Drift - is no longer a maintenance concern. - -- **`fn_abi` is reliably non-null inside a function body.** - Set at `cfree_cg_func_begin` (`src/api/cg.c:3125`) and cleared at - `func_end` (line 3149). `cfree_cg_ret` only runs within that window. - No null-safe fallback needed. - -- **CGCallPlan / backends are already fully ABI-driven.** - Grep across `src/arch/` finds zero `cg_type_is_aggregate` references. - Every site branches on `ai->kind == ABI_ARG_INDIRECT` or iterates - `ai->parts`. Examples: `arch/aa64/ops.c:904`, `arch/x64/alloc.c:54`, - `arch/rv64/opt_coord.c:178`. The new invariant preserves what native - backends observe (multi-part DIRECT aggregates still produce ADDR - storage), so backends do not change. - -- **Wide16 classification is incomplete in some native ABIs — this is - the biggest finding and Phase 2's largest hidden cost.** - Today the wide16 check in `api_arg_storage_must_be_addr` papers over - bugs in the underlying ABI classifiers. Per-target status: - - - **RV64** (`src/abi/abi_rv64.c:23-43`): correctly classifies 16B - INT or FLOAT scalars as DIRECT with two 8B INT parts. ✓ - - **AAPCS64** (`src/abi/abi_aapcs64.c:23-39`): correctly classifies - 16B INT scalars (i128) as DIRECT/2-parts. **Missing**: 16B FP - (long double on ARM64) should be DIRECT with one or two FP parts - in Q-registers, not fall through to single 16B INT part. - - **Apple ARM64** (`src/abi/abi_apple_arm64.c`): delegates to AAPCS64; - inherits the same long-double gap. - - **SysV-x64** (`src/abi/abi_sysv_x64.c:28-44`): **no 16B branch - at all**. i128 currently falls through to a single 16B INT part — - malformed because no GPR can hold it. Long double is 80-bit x87 - with 16B alignment and needs a target-specific class entirely. - The wide16 clause in `api_arg_storage_must_be_addr` hides both bugs - by always routing wide16 through a memory image. - - **Consequence**: if Phase 2 drops the wide16 clause before fixing the - SysV-x64 and AAPCS64 classifiers, native codegen breaks. The new - `abi_arg_storage_shape` would compute VALUE for a malformed single-part - DIRECT (one part, `src_offset==0, size==type_size==16`), but no Operand - kind can hold a 16B value. - -- **HFA / HVA in AAPCS64**: the existing classifier explicitly defers - HFA refinement (see comment at `src/abi/abi_aapcs64.c:9` and `:68-69`). - Small aggregates today classify uniformly as DIRECT/INT-parts. Wide16 - classification (i128) does not collide with HFA logic because the two - enter `classify_one` through disjoint type kinds (RECORD vs scalar). - Confirmed safe. - -- **`tail` interaction**: the tail-call path - (`src/api/cg.c:6497-6498`) calls `api_regalloc_finish` before - `T->call`, which can mutate live storage state. The storage-shape - helper is queried per-arg during pre-call packaging, before this - finish call, so the decision sequencing is unchanged. No additional - risk. - -## Estimated Size - -- Phase 0 (done): Prep A (~25 LOC) + Prep B (~95 LOC of helpers, ~120 LOC - of duplicate body deleted from `cfree_cg_call` / `api_call_symbol_common`). -- Phase 1: ~20 LOC for `abi_arg_storage_shape` + rewriting three function - bodies in `cg.c` (signature changes to thread `ABIArgInfo*` + size into - the helpers). -- Optional Prep C (extract `api_pack_ret_value` from `cfree_cg_ret`): ~20 LOC. -- Phase 2a (i128 classification fix): ~30 LOC in `abi_sysv_x64.c` + - removing the i128 path from the wide16 clause. ~50 LOC total. -- Phase 2b (long-double, optional / deferable): not required for the C - backend. Treat as separate work. -- Phase 3 (negative-control fixture): one ~150 LOC test file. -- Total remaining for C-backend prerequisite: ~250 LOC, no public API change. +- `func_begin`/`func_end`, `param`, `ret`, `binop`, `load_imm`, `copy`, + `call`, plus minimal type emission for `int`/`void`. +- Identifier-mapping helper (Reg, FrameSlot, Label, Sym → C name). +- Writer plumbing (the C target owns a `CfreeWriter` set at construction). + +Acceptance: the example above round-trips through the C target, `cc` it, +run it, exit code 5. + +### Phase 2 — control flow and memory + +- `cmp`, `cmp_branch`, `label_*`, `jump`, `scope_*`. +- `load`/`store`/`addr_of`/`indirect`, `local`, `frame_slot`, `local_addr`. +- `convert` for the integer/float conversions; `bitcast` via memcpy. + +Acceptance: corpora that exercise loops, conditionals, locals, pointer +chasing. + +### Phase 3 — aggregates, varargs, intrinsics + +- Type emission for records, arrays, function types. +- Call/return for aggregate types (address-operand emission rule). +- `va_start_`/`va_arg_`/`va_end_`/`va_copy_`. +- `intrinsic` for the full IntrinKind set. +- `copy_bytes`/`set_bytes`. + +### Phase 4 — atomics, asm, TLS, exotic features + +- Atomic load/store/rmw/cas/fence. +- `asm_block` re-serialization. +- TLS, weak, visibility, `_Alignas`. +- `bitfield_load`/`_store`. +- `data_*` definition emission including const_data. +- `tls_addr_of`, `alloca_`. + +### Phase 5 — quality + +- Better identifier names (use `Sym name` where available instead of `vN`). +- Optional: emit C struct field accesses instead of `*(T*)((char*)p + ofs)` + when the offset corresponds to a known record field. +- `#line` directives. +- C-target-specific diagnostics for unsupported combinations + (`data_pcrel`, `data_symdiff`, computed goto across functions, etc.). + +## Estimated size + +Order-of-magnitude: + +- Phase 0: ~150 LOC. +- Phase 1: ~400 LOC (target.c skeleton + emit.c primitives + names.c + + minimal type emission). +- Phase 2: ~600 LOC. +- Phase 3: ~800 LOC (type worklist is most of it). +- Phase 4: ~600 LOC. +- Phase 5: ongoing. + +Plus test fixtures and a `test-cbackend` make target — roughly 1k LOC of +test infrastructure to start. + +Total to a credibly useful C backend: ~3k LOC, isolated to +`src/arch/c_target/` plus the four small touchpoints listed earlier. No +modifications to CG, ABI, regalloc, or existing arch backends. + +## Open questions + +- **Aggregate returns in C source**: should the C target emit + `slot_R = f(args)` for aggregate returns (relying on gcc to handle the + ABI) or pre-lower to `f_into_buf(&slot_R, args)`? The first is simpler + and is what gcc would do anyway. Default to the simple form; revisit if + it produces bad codegen. +- **Output format flag location**: `CodeOptions.emit_c_source` (boolean) + vs `CFREE_OBJ_C_SOURCE` (enum extension). The latter forces every code + path that switches on `obj_fmt` to know about C-source; the former is a + narrower addition. Lean toward the boolean. +- **Multi-TU emission**: cfree compiles one TU at a time today. The C + backend follows the same model — one `.c` source out per `.c` source in. + Cross-TU LTO is gcc's job downstream. +- **Floating-point reproducibility**: cfree's FP-flag enum + (`CfreeCgFpFlag.REASSOC`/`APPROX`/…) maps to gcc's + `-ffast-math`-style behavior, but per-operation. C doesn't have a per-op + syntax for these. Options: ignore the flags (correct but pessimistic), + wrap in `#pragma STDC FP_CONTRACT off` blocks, or emit + `__attribute__((optimize(...)))` on the enclosing function when any + flag fires. Probably ignore in v0/v1, document the gap. +- **i128 division/modulo and f128 ops** are emitted by cfree CG today via + calls to `__divti3` / `__multf3` / `__addtf3` etc. The C target can + prefer native `__int128` operators and `__float128`/`long double` so + gcc inlines them. Detail to validate in Phase 1.