commit 1806e4076ddd06706caa95e512c54844869694b9
parent 9e7fedaaf62629d397c972e56bcd90b25da3f732
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Sat, 30 May 2026 17:46:56 -0700
test+doc: correct rv64 hostas-cross root cause (cc -S infidelity, not emulation)
Debugging the rv64 cross-exec hang showed it is NOT an emulation problem: a
minimal clang rv64 static exe and the DIRECT cfree `cc -c` object both run
correctly under the same podman qemu-riscv64; only the `cc -S | as` round-trip
hangs. Root cause: rv64 has no symbolizer (no ArchAsmOps), so `cc -S` emits the
call as `auipc ra,0x0; jalr ra,0(ra)` with R_RISCV_CALL unsymbolized — it calls
itself — and branches keep numeric targets (`j 0x90`). Correct the harness
header and doc/ASM_ROUNDTRIP_TESTING.md accordingly, and record that x64 now
assembles 312/312 (272/312 exec; data round-trip backlog). Both are the
"x64/rv64 cc -S round-trip" backlog; no code change here.
Diffstat:
3 files changed, 632 insertions(+), 16 deletions(-)
diff --git a/doc/ASM_ROUNDTRIP_TESTING.md b/doc/ASM_ROUNDTRIP_TESTING.md
@@ -250,18 +250,24 @@ downgrades to SKIP instead of hanging). Status:
- **aarch64-linux**: green end-to-end (cfree-as 312/0, clang-as 312/0) — podman
runs arm64 natively in its VM, so it's fast and the primary verified target.
-- **x86_64-linux**: SKIPS on the x64 `cc -S` symbolizer gap — x64 emits numeric
- branch targets (`jmp 0x77`) the x64 `as` can't reassemble. The aarch64
- symbolizer (intra-section branch-target label synthesis in
- `src/api/asm_emit.c`, and the relocation-operand syntax via
- `ArchAsmOps.reloc_operand`) needs x64 implementations — `is_local_branch`-style
- recognition for `jmp`/`jcc`, plus an x64 `reloc_operand` table
- (`sym(%rip)`/`@PLT`/`@GOTPCREL`). Tracked.
-- **riscv64-linux**: `cc -S | cfree as | cfree ld -static` works; SKIPS where
- riscv64 user-mode emulation is unavailable or too slow/wedged to pass the exec
- smoke (e.g. the macOS/arm64 dev host's podman riscv64 path hangs on the
- cfree-built static ELF even though it runs a clang-built one — likely a
- cfree-rv64-ELF-under-qemu-user issue to chase separately).
+- **x86_64-linux**: the x64 `cc -S` symbolizer landed (the aarch64 symbolizer is
+ now arch-generalized — `ArchAsmOps.is_local_branch` for `jmp`/`jcc`, a x64
+ `reloc_operand` table for `sym(%rip)`/bare-`@PLT`/`@GOTPCREL` with a +4 rel32
+ addend bias, and operand-driven RIP surgery), so the whole corpus
+ **re-assembles 312/312** via both cfree-as and clang. Cross-EXEC is **272/312**:
+ ~23 cases (switch/jump tables, global/array/fp data, varargs) lose fidelity in
+ the x64 cc -S **data** round-trip — confirmed cc -S infidelity, since the
+ DIRECT `cc -c` object executes correctly. That data backlog is the remaining
+ x64 work. Opt-in until it closes.
+- **riscv64-linux**: assembles, but cross-EXEC **hangs** — NOT emulation (a
+ minimal clang rv64 static exe AND the DIRECT cfree `cc -c` object both run
+ correctly under the same qemu-riscv64; only the `cc -S | as` round-trip hangs).
+ Root cause: rv64 has **no symbolizer** (no `ArchAsmOps`), so `cc -S` emits the
+ call as `auipc ra,0x0; jalr ra,0(ra)` with the `R_RISCV_CALL` reloc
+ unsymbolized — it calls itself — and branches keep numeric targets (`j 0x90`).
+ Needs an rv64 `ArchAsmOps`: `is_local_branch` (j/beq/bne/...) and a
+ `reloc_operand` for the RISC-V `%pcrel_hi`/`%pcrel_lo`/`%hi`/`%lo`/`call`
+ syntax (the `%pcrel_lo` label-pairing makes this the hardest of the three).
Override the matrix with `CFREE_HOSTAS_CROSS_TARGETS="tag:triple ..."`, the
exec-smoke cap with `CFREE_HOSTAS_EXEC_TIMEOUT=<secs>`, and per-arch images with
diff --git a/doc/INCREMENTAL_OBJLINK.md b/doc/INCREMENTAL_OBJLINK.md
@@ -0,0 +1,604 @@
+# File-based incremental linking — obj + link internals
+
+Status: design draft. Scope: the `src/obj` and `src/link` substrate that lets a
+file-based rebuild patch a prior image instead of relinking from scratch, plus
+the public interface a build system consumes. This is **distinct from** the two
+existing incremental docs and is sequenced under neither:
+
+- `doc/INCREMENTAL_LINK.md` — append-only growth of a *live JIT image* for
+ `cfree dbg`. In-process, never patches existing code. We reuse its machinery
+ (append cursors, durable reloc records) but target on-disk rebuilds.
+- `doc/HOT_RELOAD.md` — replacing a *running* function body in a live process.
+ Shares the "indirection cell" idea (see §8) but is a different consumer.
+
+The build graph, compile cache, dependency scanning, and daemon/watch modes are
+**out of scope** here — they live in the separate build-system plan and consume
+the interface in §16. This document is deliberately about the linker and object
+internals those layers stand on.
+
+---
+
+## 1. Scope & non-goals
+
+**In scope.** Make the obj/link layers able to:
+- give every object and every function/data *atom* a stable **content identity**;
+- persist a prior link's placement + relocation state as side-band data;
+- on a changed input, **re-resolve and patch only the changed atoms**, keeping
+ every unchanged address byte-stable;
+- detect when a change is *not* provably local and **fall back to a full link**;
+- expose all of the above through a small public API.
+
+**Non-goals (here).**
+- Build graph / DAG, compile cache, header-dependency scanning, `cfree serve`
+ daemon, watch mode — separate build-system plan (consumes §16).
+- Cross-TU / whole-program optimization (ThinLTO-style). Incremental link is a
+ `-O0`/`-O1` *dev* feature; release builds always full-link, clean.
+- Reclaiming dead patch space within a session (we recycle via a free-list but
+ do not compact; a clean link reclaims).
+- Mach-O / COFF in-place patching in v1 (ELF first — see §14).
+
+---
+
+## 2. Target & cost model
+
+**"Instant" defined.** After editing one translation unit in a project of *N*
+TUs, the *link* cost should be `O(changed atoms + their relocations)`, not
+`O(whole program)`. Compile cost is the build system's problem (cache); this doc
+makes the *link* incremental.
+
+**Where link time goes today.** `link_resolve` (`src/link/link_layout.c:1212`)
+is six whole-program phases:
+
+| Phase | Function | Cost |
+|---|---|---|
+| 1 Symbol resolve | `link_resolve_symbols` (`link_resolve.c:228`) | `O(Σ symbols)` — one global `SymHash` |
+| 1b GC | `link_gc_compute` | `O(sections + relocs)` BFS, no delta-marking |
+| 2 Layout | `link_layout_sections` (`link_layout.c:209`) | `O(total_kept)`; **any size change shifts all downstream vaddrs** (`link_layout.c:340-348`, no slack) |
+| 2c Bytes | `link_emit_segment_bytes` (`link_layout.c:1050`) | `O(Σ bytes)` into one monolithic per-segment buffer |
+| 3 Vaddr | `link_assign_symbol_vaddrs` (`link_reloc_layout.c:40`) | `O(Σ symbols)` |
+| 4 Relocs | `link_emit_relocations` (`link_reloc_layout.c:1227`) | `O(Σ relocs)` |
+| 6 Emit + id | format emitter + `link_image_id_compute` (`link_image_id.c:31`) | `O(output)`; FNV-1a over **all** segment bytes *and* vaddrs |
+
+Plus: `link_resolve_at`/`link_resolve_extend` are panic stubs
+(`link.c:629,638`); the GOT is one exactly-sized segment placed after everything
+(`link_reloc_layout.c:710-748`); relocations are applied **destructively** into
+`segment_bytes` at emit (`src/obj/elf/link.c:318-470`) — but the
+`LinkRelocApply` records that *produce* those writes are preserved as data first
+(invariant, internal `src/link/link.h:234-246`).
+
+**Benchmark (true shape).** `tmp/projects/lua`: 35 `.c` files; the Makefile
+compiles **32 objects** (CORE_O=20 + LIB_O=12) into `liblua.a`, then links **two**
+executables — `lua` and `luac` — that share the archive. So the substrate must
+model (a) archive members as link inputs and (b) one edited TU fanning out to
+multiple final images. `sqlite-amalg` (1 huge TU) and `yyjson` (1 TU) exercise
+the single-TU degenerate case.
+
+---
+
+## 3. Design principles
+
+1. **Provable locality, else fall back.** Reuse is correct only when the change
+ cannot alter symbol resolution (mold's cascading-effects argument). The full
+ link is always available and always correct; the incremental path is an
+ accelerator gated on a soundness check (§7.3). A correct-but-slow result
+ always beats a fast-but-wrong one.
+2. **Address stability is the bedrock.** Once a vaddr is published it never
+ moves. Unchanged atoms keep their bytes *and* addresses, so their relocations
+ never reapply — this is what makes a patch `O(change)`. Enforced by
+ overwrite-in-slack / append-to-free-slot, **never compact**.
+3. **Content-hash keying, not transient IDs.** `LinkInputId`/`LinkSymId` are
+ stable only *in-process* (`link.h:240-241`); a file rebuild allocates a fresh
+ `Linker`. So persisted state is keyed by **content hashes** and **symbol
+ names**, never by re-derived IDs (§10). This makes determinism a dedup
+ *nicety*, not a correctness *requirement*.
+4. **Relocation location is relative, target is symbolic.** Persist a reloc as
+ `(atom, offset-within-atom, kind, target-name, addend)`. Derive the absolute
+ write address and target address from *current* placements at apply time.
+ Moving an atom then needs **zero reloc rewriting** — placements change, the
+ reloc re-derives. (Closes the "rewrite `write_vaddr` on move" hazard.)
+5. **The move-on-grow primitive is swappable.** Everything else (atoms, slack,
+ free-list, persisted session, soundness gate) is independent of *how a caller
+ reaches a moved callee*. Ship **thunk-on-grow** first (no codegen change),
+ converge on **GOT-cell** later to share one mechanism with hot reload (§8).
+6. **Frontend-agnostic.** All work attaches at the shared `ObjBuilder` boundary
+ (`obj_finalize`, `src/api/compile.c:356`). C, Toy, asm, and WASM all benefit
+ with no per-frontend code beyond a tiny capability (§15).
+7. **Project rules.** No VLAs; no global state (everything hangs off
+ `Compiler`/`LinkSession`); multi-arch/multi-format behind the existing
+ `ArchImpl`/`ObjFormatImpl` vtables; determinism preserved on the full-link
+ path.
+
+---
+
+## 4. What exists vs what is new
+
+| Need | Status | Where |
+|---|---|---|
+| Durable, non-destructive reloc records | **exists** | `LinkRelocApply`, internal `link.h:234-246`, `link_internal.h:129` |
+| Stable IDs *within a link* | **exists** (in-process only) | `link.h:240-241` |
+| Per-input id translation | **exists** | `InputMap`, `link_internal.h:21` |
+| Atom-level GC granularity | **exists** | `InputMap.section_atom_*`; atoms placed individually `link_layout.c:282-284` |
+| Append cursors + reserved slack | **exists (JIT only)** | `link_jit.c:111-114`; **panics** on exhaustion `link_jit.c:1080` |
+| Apply one reloc to a live mapping | **exists (JIT)** | `cfree_jit_append_obj` path, `reloc_apply.c` |
+| AArch64 call-stub template | **exists (JIT only, off for static exe)** | `link_layout_jit_stubs` `link_reloc_layout.c:429` |
+| GOT slot machinery | **exists, but only for GOT-reloc kinds** | `link_layout_got` `link_reloc_layout.c:654`; `reloc_uses_got` `:376` |
+| BLAKE2b CAS blob/tree store | **exists** | `driver/dist`: `cas.c`/`blob.c`/`tree.c`, `DIST_BLAKE2B_LEN=32` |
+| Dependency iteration (C includes) | **exists** | `cfree_dep_iter_new/next` `src/api/compile.c:417-462`; `cc_dep_finish` `cc.c:2121` |
+| `LinkSession` type | **new** (only sketched in docs; today bare fields on `CfreeJit`) | — |
+| `link_resolve_extend` | **new** (panic stub) | `link.c:638` |
+| Per-atom slack / free-list / overwrite / grow-relocate | **new** | — |
+| Fall-back-instead-of-panic discipline | **new** (JIT preflight panics) | — |
+| Object/atom content identity | **new** | — |
+| Per-atom reloc & symbol indices | **new** (flat scans today: `obj_reloc_count` `obj.c:831`, `obj_symbol_find` `obj.c:528`) | — |
+| Incremental (per-segment) build-id | **new** (FNV-1a is whole-image) | `link_image_id.c:31` |
+| Move-on-grow primitive (thunk / GOT) | **new** (direct `R_AARCH64_CALL26` today) | — |
+
+The honest summary: durable relocs, stable in-process IDs, atom GC, and the JIT
+append *placement* are reusable; **everything that makes a relink incremental —
+content identity, slack/free-list, overwrite/grow, persistence, the soundness
+gate, graceful fallback, and the move primitive — is net-new code.**
+
+---
+
+## 5. The atom model (obj side)
+
+The minimal relocatable unit is an **atom**: one function or one data object.
+cfree already carries atoms for GC; incremental link promotes them to the
+patch unit.
+
+- Under `--incremental` (dev mode), frontends emit one section per
+ function/global (a `-ffunction-sections`/`-fdata-sections` equivalent) so each
+ atom is independently placeable. cfree already lays out kept atoms as
+ individual `LinkSection`s (`link_layout.c:282-284`).
+- Each atom gets a **content id**: BLAKE2b over its canonical form —
+ `bytes || align || flags || canonical(relocs)`, where `canonical(relocs)`
+ encodes each reloc as `(offset-within-atom, kind, target-name, addend)`.
+ Target is the *name*, never a transient id (principle 3).
+- The **object content id** is BLAKE2b over the atom-id list plus object-level
+ metadata (format, arch, ext flags). Two byte-identical compiles → identical
+ object id (modulo the determinism audit; see §12).
+
+### 5.1 obj internals additions
+
+```c
+/* src/obj/obj.h — new */
+typedef struct ObjAtomId { u32 v; } ObjAtomId; /* 0 = none */
+
+/* Deterministic content identity over the canonical form above. */
+void obj_atom_content_id(ObjBuilder*, ObjAtomId, u8 out[DIST_BLAKE2B_LEN]);
+void obj_content_id(ObjBuilder*, u8 out[DIST_BLAKE2B_LEN]);
+
+/* O(1) per-atom lookups (today both are linear scans). */
+const Reloc* obj_atom_reloc_first(ObjBuilder*, ObjAtomId, ObjRelocCursor*);
+ObjSymId obj_symbol_by_name(ObjBuilder*, Sym name); /* hash, not O(nsyms) */
+```
+
+Required obj changes:
+1. **Per-atom reloc index.** Today `obj_reloc_count` scans the flat reloc table
+ (`obj.c:831`). Add a per-atom reloc list so "relocs touching atom A" is `O(1)`.
+2. **Symbol-by-name hash.** `obj_symbol_find` is `O(nsyms)` (`obj.c:528`). Add a
+ name→`ObjSymId` hash on the builder.
+3. **Deterministic, lossless serialize/deserialize.** A cached/persisted object
+ must rehydrate identically: atoms, relocs, COMDAT groups, debug sections, and
+ format `ext_type/ext_flags` (round-trip-safe today per survey) all preserved.
+ This is the cache-value contract the build system relies on (§16).
+4. `obj_finalize` (`obj.c`, currently reserved/empty) is the natural place to
+ compute and memoize the content ids once a TU is built.
+
+---
+
+## 6. The `LinkSession` (link side)
+
+A new type that owns the state that must outlive one `link_resolve` and can be
+persisted. It generalizes the per-segment cursor/slack fields currently inlined
+in `struct CfreeJit` (`link_jit.c:92-114`) and adds overwrite, free-list, grow,
+and graceful fallback.
+
+```c
+/* src/link/link_session.h — new. Hangs off Compiler; no global state. */
+typedef struct LinkFreeList LinkFreeList; /* gold-style two-level free list */
+
+typedef struct LinkAtomPlace { /* one per placed atom */
+ u8 content_id[DIST_BLAKE2B_LEN]; /* §5 atom content id (the key) */
+ u64 vaddr; /* published address — STABLE */
+ u64 file_offset;
+ u32 size;
+ u32 capacity; /* size + reserved slack */
+ u32 seg_bucket; /* SEG_RX/R/RW/TLS */
+ /* relocs stored relative: (offset-within-atom, kind, target_name, addend) */
+} LinkAtomPlace;
+
+typedef struct LinkSession {
+ Compiler* c;
+ Linker* l; /* stable LinkInputId -> ObjBuilder* */
+ LinkImage* img; /* now MUTABLE-by-patch */
+ u64 cursor[SEG_NBUCKETS]; /* append cursor per class (from JIT) */
+ u64 limit[SEG_NBUCKETS]; /* reserved ceiling per class */
+ LinkFreeList free[SEG_NBUCKETS]; /* vacated slots, first-fit reuse */
+ u32 slack_pct; /* per-atom reserve, default 10% */
+ /* atom placement table, keyed by content_id; the persisted core (§10) */
+ LinkAtomPlace* atoms; u32 natoms;
+} LinkSession;
+
+/* Fixed-size transaction watermark (no VLA). */
+typedef struct LinkPatchTxn {
+ u32 old_natoms, old_nsyms, old_nsections, old_nrelocs;
+ u64 old_cursor[SEG_NBUCKETS];
+ /* free-list undo log handle */
+} LinkPatchTxn;
+```
+
+`LinkImage` stays the read-side view for inspection/DWARF/emit, but its symbol,
+section, and reloc vectors become append/overwrite-capable (they already grow on
+the JIT path).
+
+---
+
+## 7. Incremental resolve & the soundness gate
+
+Implement `link_resolve_extend` (`link.c:638`, panic stub today) in two stages.
+
+### 7.1 Stage A — append-only (sound subset, first milestone)
+New inputs that only *add* definitions, resolving against the existing image +
+external resolver. This is exactly the JIT append model and reuses its cursor +
+slack *placement* (`link_jit.c` append path) — but writing to a file image and,
+critically, **falling back instead of panicking** on exhaustion.
+
+### 7.2 Stage B — patch changed atoms
+For a changed input, diff its atoms (by content id) against the persisted
+placement table:
+- **Unchanged atom** (id matches): keep placement, keep bytes, **do nothing** —
+ its relocations are never revisited.
+- **Changed atom, fits capacity**: overwrite bytes in place; reapply only *its*
+ relocations (re-derived from current placements, §9).
+- **Changed atom, grows past capacity**: allocate a new slot (free-list, else
+ bump `cursor[seg]`), return the old slot to the free-list, write bytes, and
+ make callers reach it via the move primitive (§8). Reapply its own relocations.
+- **New atom**: place at cursor/free-list; resolve & apply its relocations.
+- **Removed atom**: return its slot to the free-list; drop its symbols.
+
+### 7.3 The soundness gate — fall back to full link when
+Compute the changed object's **interface** = { defined global names + bindings,
+COMMON sizes/aligns, set of undefs }. The edit is *local* only if the interface
+is identical to the persisted one **and** no archive pull-in changes. Otherwise
+fall back. Triggers (grounded in `link_resolve.c`):
+1. **Symbol-set / binding change** — added/removed global, weak↔strong flip:
+ changes global resolution (`bind_strength`) and which archive members pull.
+2. **Archive pull-in change** — a new undef now selects a `.a` member that was
+ not in the prior link (`link_ingest_archives` is greedy single-pass).
+3. **COMDAT ownership** — COFF SELECTANY keeps the *earlier* winner
+ (`link_resolve.c:308-323`). If the edited TU is the winner and its group body
+ changed, patch the shared body; if it is a loser, no-op; if ownership would
+ flip, fall back. COMMON size/align merge (`:288-303`) changing → fall back.
+4. **TLS size change** — boundary syms `__tdata_start/end`, `__tbss_size`
+ (`link_layout.c`) shift if any TLS section resizes → fall back.
+5. **Import-set change** (PLT/`.got.plt`/dynamic) → re-synthesize via
+ `fmt->layout_dyn` → fall back.
+6. **Slack/free-list exhaustion** in any segment → fall back.
+7. **Layout-affecting flags / linker script / `--gc-sections` / `-r` / LTO** →
+ full link (GC liveness is whole-graph; incompatible, as in gold).
+
+On fallback, **discard the half-mutated session** (the `LinkPatchTxn` watermark
+rolls back `cursor[]`/free-list/append counts) and run a normal full link, which
+because objects are resident is already far cheaper than a cold `cfree ld`.
+
+The JIT's duplicate-global *preflight* is the precedent for the gate — but it
+**panics**; converting "detect non-local" into "roll back + full link" is new
+control flow.
+
+---
+
+## 8. Placement, slack, and the move-on-grow primitive
+
+**Slack.** Today sections are contiguous with only alignment padding
+(`link_layout.c:340-348`). Under `--incremental`, reserve per-atom slack
+(`slack_pct`, gold's `--incremental-patch=n` analog) so overwrite-in-place is the
+common case. A two-level free-list (one of free file blocks, one per segment
+bucket) recycles vacated slots, first-fit.
+
+**The move primitive — swappable.** When an atom moves, callers must still reach
+it without their bytes changing. Abstract this as one hook with two
+implementations; the rest of the design is identical either way.
+
+```c
+/* src/link/link_move.h — the only A/B difference */
+typedef struct LinkMoveOps {
+ /* make all references to `atom` reach its NEW vaddr, without touching callers */
+ void (*atom_moved)(LinkSession*, LinkAtomPlace* atom, u64 new_vaddr);
+} LinkMoveOps;
+```
+
+### 8.1 Thunk-on-grow (ship first) — `LinkMoveOps` = thunk
+Calls stay **direct** (`R_AARCH64_CALL26`, x64 `PLT32`, RV `CALL` — what codegen
+emits today; cross-TU is direct per `src/obj/macho/link.c:537`,
+`src/obj/elf/link.c:251`). On a move, leave a **jump island** at the atom's *old*
+slot pointing to the new location. Callers branch to the old address as before →
+hit the island → jump on. Properties:
+- **No codegen change.** Pure linker. Reuses the `link_layout_jit_stubs`
+ (`link_reloc_layout.c:429`) island shape as a template (per arch: aa64 jit/iplt
+ stub, x64 iplt stub `src/obj/x64/link.c:40`, rv64 trampoline).
+- **Reachability is free**: callers already branched directly to the old slot,
+ so the island there is in range by construction.
+- **Tax**: an extra jump *only for functions that moved* (one island per
+ function that ever grew, re-pointed on subsequent grows). Unmoved functions
+ pay nothing.
+- **Data caveat**: a thunk redirects code only. A grown *global* that must move
+ cannot be thunked. v1 rule: give data atoms generous slack and **fall back to
+ full link if a data atom outgrows its capacity** (never move data). This keeps
+ the thunk path entirely codegen-free.
+
+### 8.2 GOT-cell (convergence target) — `LinkMoveOps` = got
+Under `--incremental`, codegen emits cross-unit calls (and movable-data loads)
+through a GOT cell (aa64 `ADRP+LDR+BLR`, x64 `call *cell(%rip)`, rv64
+`auipc+ld+jalr`). A move updates **one** cell. Properties:
+- **Per-arch codegen change** (instruction selection + reloc kinds) for calls
+ *and* data — `reloc_uses_got` (`link_reloc_layout.c:376`) currently lists only
+ GOT-relative kinds, and `link_layout_got` only allocates slots for those.
+- **Tax**: one extra indirect load on *every* cross-unit reference, uniformly.
+- **GOT growth**: a new cross-unit target adds a slot, but the GOT is a single
+ exactly-sized segment at the image end (`link_reloc_layout.c:710-748`). Needs
+ **reserved GOT slack + a GOT free-list**, with fall-back on exhaustion —
+ otherwise adding a slot moves the GOT and breaks stability for existing slots.
+- **Strategic upside**: it is the *same* primitive `doc/HOT_RELOAD.md §7` assumes
+ ("one slot per function changes; call sites not patched"). One GOT-cell-update
+ mechanism would then serve JIT hot reload *and* file incremental link.
+
+**Why thunk-first.** Thunk taxes only what moves and needs zero codegen, so it
+proves the slack/free-list/persistence/soundness machinery end-to-end fastest.
+The free-list, slack, session, and gate are reused verbatim when we later swap in
+GOT cells; only `LinkMoveOps` changes. Converge on GOT when hot reload needs it.
+
+---
+
+## 9. Relocation reuse & application
+
+`LinkRelocApply` records are durable data, never burned into bytes before emit
+(invariant `link.h:234-246`). Incremental link leans on this hard.
+
+- **Relative + symbolic form.** Persist each reloc as `(atom_content_id,
+ offset_within_atom, kind, target_name, addend)`. At apply time the absolute
+ write address is `atom.vaddr + offset_within_atom` and the target address is
+ the *current* placement of `target_name`. **An atom that moves needs no reloc
+ rewriting** — both addresses re-derive from current placements (principle 4).
+- **Reapply only the changed atom's relocs**, found via the new per-atom index
+ (§5.1). Unchanged atoms' relocs are never touched.
+- **Apply path.** File emit currently writes relocations destructively into
+ `segment_bytes` at emit (`src/obj/elf/link.c:318-470`). For patching we apply a
+ single atom's relocs into its (possibly newly placed) bytes using the same
+ `reloc_apply.c` kind dispatch the JIT uses, then re-emit only the changed
+ segment ranges. Per-arch reloc kinds already flow through `reloc_apply.c`.
+
+---
+
+## 10. Persisted incremental state
+
+Side-band, content-addressed — **not** gold's ELF-embedded `.gnu_incremental_*`
+sections (those are ELF-only; we are multi-format). Store one blob in the
+existing `driver/dist` CAS (`dist_cas_put_blob`, BLAKE2b) keyed by the link
+action id (a build-system concern; §16). The blob records, per input and per
+atom:
+- object content id + atom content ids (the diff keys);
+- `LinkAtomPlace` table: vaddr / file_offset / size / capacity / bucket;
+- symbol → vaddr bindings, keyed by **name**;
+- relocations in the relative+symbolic form of §9;
+- free-list state and per-segment cursors/limits.
+
+Because everything is content/name-keyed, reloading does **not** depend on a
+fresh `Linker` re-deriving identical `LinkSymId`s. The determinism audit (§12)
+becomes a dedup optimization, not a correctness gate. We still add a cheap guard:
+on reload, verify each referenced object blob's BLAKE2b matches its recorded id
+before trusting it (defends against a torn/garbage cache entry).
+
+---
+
+## 11. Image identity / build-id
+
+`link_image_id_compute` (`link_image_id.c:31`) is FNV-1a streamed over **every**
+segment's vaddr + file_size + bytes — `O(image)` and not incrementally
+updatable. For patching, compute a **per-segment subhash** and combine them
+(Merkle-style) into the image id, so a patch re-hashes only changed segments.
+Note this hash is FNV-1a and independent of the BLAKE2b used for content/CAS;
+keep them distinct.
+
+Consequence (acceptable, document loudly): an incremental output is **not
+byte-reproducible against a from-scratch full link** of the same sources — slack
+padding and (under GOT mode) indirection differ. Release builds (`--incremental`
+off) are canonical and reproducible.
+
+---
+
+## 12. Address-stability & determinism invariants
+
+- **Stability (falsifiable):** after a patch, `nm`/`addr2line` on an *unchanged*
+ symbol must return the identical vaddr as before. Enforced by
+ overwrite-in-slack / append-to-free-slot, never compact.
+- **Determinism audit (prerequisite for dedup, not correctness):** confirm that
+ identical `(source, flags, target)` yields byte-identical objects — audit
+ symbol ordering and `pool_intern` first-access order in obj emit. With
+ content/name keying (§10) a nondeterministic order only costs cache dedup, not
+ a wrong patch; but byte-stability is still wanted so two machines agree.
+- **Reloc re-derivation:** never store an absolute `write_vaddr`; always
+ `atom.vaddr + offset_within_atom` (principle 4).
+
+---
+
+## 13. Debug info (DWARF) consistency
+
+- A moved atom's `.debug_info`/`.debug_line`/`.debug_aranges` address ranges
+ change → reapply that atom's debug relocs (re-derived like §9). Unchanged
+ atoms' debug stays byte-stable because their addresses do.
+- v1 stance: rebuild only the *changed TU's* debug sections, `O(change)`. An
+ in-slack overwrite that does not move the atom leaves addresses (and therefore
+ `.debug_line` byte content) unchanged — free, but see the open question on
+ line-number-only shifts (§20).
+- `addr2line` and `cfree dbg` re-read debug from the patched image. The JIT path
+ invalidates a cached view by generation counter; a file consumer re-reads the
+ file, so the build-id change (§11) is the staleness signal.
+
+---
+
+## 14. Multi-format & multi-arch
+
+- **ELF first.** The atom + slack + move-primitive model is format-agnostic, but
+ Mach-O carries whole-image structures (chained fixups, `LC_DYLD_INFO`,
+ code-signature, `LC_UUID`) that resist in-place patching; enumerate which load
+ commands must be regenerated before attempting Mach-O incremental. COFF later.
+ Persisted state is side-band CAS for all three (§10), so no per-format on-disk
+ incremental metadata.
+- **Per-arch surface is small:** only (a) the move primitive's island/cell shape
+ and (b) the branch-into-island/cell reloc kind. aa64 has the jit-stub shape to
+ reuse; x64 (`src/obj/x64/link.c:40`) and rv64 each have a trampoline shape to
+ adapt. All reloc kinds already dispatch through `reloc_apply.c`.
+- CI exercises the patch path on **ELF/aa64 + ELF/x64** first (per project
+ "narrow test runs"); rv64 and Mach-O/COFF follow.
+
+---
+
+## 15. The frontend contract (shared across all frontends)
+
+All frontends converge to `ObjBuilder` and join the shared path at
+`obj_finalize` (`src/api/compile.c:356`), so the incremental machinery attaches
+once, frontend-agnostically. **Toy, asm, and WASM get incremental link with no
+frontend-specific code.** The "clear expectations" are a small optional
+capability plus four guarantees.
+
+```c
+/* include/cfree/compile.h — optional addition to CfreeFrontendVTable */
+typedef struct CfreeFrontendCaps {
+ const char* frontend_id; /* "c" / "toy" / "asm" / "wasm" */
+ u32 schema_version; /* bump on any codegen/output change */
+ /* report external inputs read this compile (for the build system's key). */
+ CfreeStatus (*report_deps)(CfreeFrontendState*, const CfreeFrontendCompileOptions*,
+ const CfreeSourceInput*, CfreeDepSink*);
+} CfreeFrontendCaps;
+```
+
+The contract each frontend must honor to be incrementally safe:
+1. **Deterministic output** — identical `(source, flags, target, deps)` ⇒
+ byte-identical `ObjBuilder` (§12).
+2. **Declared dependency set** — report every external input read. C reuses the
+ existing `CfreeDepIter` (`src/api/compile.c:417-462`); asm/Toy/WASM report
+ "none" (single-source TUs).
+3. **Stable, source-derived symbol naming** — no run-varying temp names; atom
+ content ids depend on it (§5).
+4. **Identity + version** — `frontend_id` + `schema_version` salt the
+ build-system key so any frontend change invalidates correctly.
+
+**Per-frontend cost:** C — low (wire `CfreeDepIter` + a version constant). asm,
+Toy, WASM — trivial (no deps; version constant).
+
+**Toy's REPL wrinkle.** Toy's durable module (the existing `commit`/`abort`
+hooks, `lang/toy/compile.c:215-223`) means the REPL path is *not* a pure function
+of source. That path either folds the module snapshot into the input key or opts
+out of caching; Toy's **batch/file** compile conforms like any other frontend.
+
+---
+
+## 16. The interface boundary the build system consumes
+
+The separate build-system plan (build graph, cache, watch) calls only this
+public surface; it never touches `src/link` internals.
+
+```c
+/* include/cfree/object.h */
+CfreeStatus cfree_obj_content_id(CfreeObjBuilder*, uint8_t out[CFREE_BLAKE2B_LEN]);
+
+/* include/cfree/link.h — new incremental session surface */
+typedef enum { CFREE_LINK_PATCHED, CFREE_LINK_FELL_BACK_FULL } CfreeLinkOutcome;
+
+CfreeStatus cfree_link_session_open_incremental(CfreeLinkSession*,
+ const void* persisted, size_t persisted_len); /* NULL = cold */
+CfreeStatus cfree_link_session_replace_input(CfreeLinkSession*, CfreeLinkInputId,
+ CfreeObjBuilder* changed); /* by content */
+CfreeStatus cfree_link_session_patch_emit(CfreeLinkSession*, CfreeWriter* image,
+ CfreeWriter* persisted_out, CfreeLinkOutcome* outcome);
+```
+
+The build system supplies changed objects (it decides *which* via its cache),
+gets back the patched image, the new persisted blob, and — crucially — the
+**outcome** so it knows whether the fast path applied or the link fell back. The
+object content id lets it detect "this TU's object is byte-identical, skip it."
+
+---
+
+## 17. Failure behavior (transactional)
+
+A patch is all-or-nothing from the consumer's view:
+- compile/resolve failure, gate fallback, slack exhaustion, or reloc-apply
+ failure ⇒ the image is unchanged (or a clean full link is produced).
+- Pages/bytes may have been written before a late failure; the `LinkPatchTxn`
+ watermark rolls back `cursor[]`, the free-list undo log, and the
+ atom/symbol/section/reloc counts so no partial result is published.
+
+---
+
+## 18. Implementation sequence
+
+**M0 — atom identity & obj indices (no behavior change).**
+`obj_content_id` / `obj_atom_content_id`, per-atom reloc index, symbol-by-name
+hash, deterministic round-trip + a determinism test. Wire `CfreeFrontendCaps`
+and the contract (C deps via `CfreeDepIter`; trivial for others).
+
+**M1 — `LinkSession` + append-only extend (Stage A).**
+Introduce `LinkSession`, implement `link_resolve_extend` for append-only against
+a file image, reusing JIT cursor/slack *placement* but **falling back, not
+panicking**. Persisted blob round-trips (§10).
+
+**M2 — patch changed atoms in slack (Stage B, no move yet).**
+Per-atom diff, overwrite-in-slack, reapply that atom's relocs (§9), per-segment
+build-id (§11), the soundness gate + transactional rollback (§7.3, §17). Atoms
+that would grow past capacity ⇒ fall back (no move primitive yet).
+
+**M3 — move-on-grow via thunk (`LinkMoveOps` = thunk).**
+Free-list, grow-relocate code atoms, jump islands (reuse jit-stub shape), data
+slack + fall-back-on-data-grow. ELF/aa64 then ELF/x64.
+
+**M4 — converge on GOT-cell (`LinkMoveOps` = got), if/when hot reload needs it.**
+`--incremental` codegen mode for cross-unit calls + movable data, reserved GOT
+slack + free-list. Shares the primitive with `doc/HOT_RELOAD.md`.
+
+Mach-O/COFF and rv64 patching follow M3/M4 per §14.
+
+---
+
+## 19. Test plan (narrow, per-arch, red-green)
+
+Prefer targeted runs; redirect output to a file and read it (project rules).
+
+- **M0:** compile `tmp/projects/lua/src/ltable.c` twice ⇒ identical
+ `obj_content_id` (determinism). Edit one function body ⇒ exactly that atom's
+ content id changes, others stable. aa64 + x64 only.
+- **M1:** initial object + appended object where appended code calls an initial
+ function; appended duplicate-strong-def ⇒ fall back (not panic); unresolved
+ ⇒ transactional, image unchanged.
+- **M2:** build `liblua.a` + `lua`; patch one in-slack function body ⇒ unchanged
+ symbols keep vaddrs (`nm` diff), binary runs (`test/lib` `exec_target`), and
+ `link_resolve` whole-program path was *not* taken (instrument a counter, dump
+ to file). Negative: add a new global ⇒ fall back; weak↔strong flip ⇒ fall back;
+ new archive pull-in ⇒ fall back.
+- **M3:** grow a function past its slack ⇒ it relocates, an island appears at the
+ old slot, callers' bytes are byte-identical, result runs. Grow a *global* past
+ data slack ⇒ fall back. `addr2line` an unchanged function after a patch ⇒
+ correct file:line.
+- **Multi-output:** edit a core TU shared by `lua` and `luac` ⇒ both images
+ patch (or both fall back) consistently.
+
+---
+
+## 20. Open questions / decisions
+
+1. **DWARF on in-slack overwrite:** accept that an overwrite that does not move
+ the atom leaves `.debug_line` byte-identical (free) even if source *line
+ numbers* shifted within the body — or always re-emit the atom's `.debug_line`
+ on any body change (correct, slightly slower)? (§13)
+2. **Data movement under thunk mode:** v1 forbids moving data (slack + fall
+ back). Is the slack budget for data atoms tunable per project, or fixed?
+3. **GOT convergence trigger:** build M4 only when hot reload needs the shared
+ cell, or proactively to unify the two paths sooner? (§8.2)
+4. **Determinism guarantee strength:** require byte-stable objects (enables
+ cross-machine dedup) or only content-keyed correctness (§12)?
+5. **Persisted-blob lifetime/keying:** the link action id is a build-system
+ concern (§16) — confirm the boundary: does the build system own the CAS key,
+ or does the link session?
+6. **Mach-O/COFF scope:** confirm ELF-only for v1 (§14); enumerate Mach-O
+ whole-image structures before committing to patch them.
diff --git a/test/asm/hostas_cross.sh b/test/asm/hostas_cross.sh
@@ -36,10 +36,16 @@
# fidelity in the x64 cc -S data round-trip — confirmed cc -S
# infidelity (direct `cc -c` executes correctly). Opt-in until
# that backlog closes. See doc/ASM_ROUNDTRIP_TESTING.md.
-# - riscv64-linux: `cc -S | cfree as | cfree ld` works; cross-EXEC pends a
-# healthy riscv64 user-mode emulator — the cfree-built static
-# rv64 ELF hangs under this host's podman qemu-riscv64 (a
-# clang-built one runs), so the bounded exec smoke SKIPS it.
+# - riscv64-linux: assembles, but cross-EXEC hangs — the rv64 cc -S round-trip
+# is unfaithful because rv64 has no symbolizer (ArchAsmOps):
+# the call emits `auipc ra,0x0; jalr ra,0(ra)` with the
+# R_RISCV_CALL reloc unsymbolized, so it calls itself (and
+# branches like `j 0x90` keep numeric targets). NOT an
+# emulation issue — a minimal clang rv64 static exe and the
+# DIRECT cfree `cc -c` object both run correctly under the
+# same qemu-riscv64. Needs an rv64 ArchAsmOps (is_local_branch
+# for j/beq/...; reloc_operand for %pcrel_hi/%pcrel_lo/call).
+# The bounded exec smoke SKIPS it until then.
#
# Override the matrix with CFREE_HOSTAS_CROSS_TARGETS="tag:triple ..." and the
# clang-as gate with CFREE_HOSTAS_ENFORCE_CLANG=0 (demote lane B to XFAIL).