asm+test: full Toy cc -S → cfree/clang as exec parity across aa64/x64/rv64 - kit

commit 3d661011371f95c34738c1b22bda05e0309a96f5
parent f26028bd7df956414c3f1fc90638cd69ec15e7c8
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sat, 30 May 2026 23:00:23 -0700

asm+test: full Toy cc -S → cfree/clang as exec parity across aa64/x64/rv64

The cross-compile + cross-exec lane (test-hostas-cross) now passes BOTH
assemblers by EXECUTION (matching exit codes), not bytes, for all three ELF
targets under podman/qemu: 936/936 each (312 cases × {O0,O1} × 3 arches),
ENFORCE_CLANG on. All three arches are now the gating default.

Three independent problems were in the way:

1. Harness wedge (the "rv64 doesn't work under podman" report). The batched
   single-container runner had no per-case timeout, so ONE hanging binary (a
   clang-assembled jump-table case under qemu-user) blocked all 312 cases,
   leaving the rest unscored (read back as rc 127 → 227 false fails). Add a
   per-case `timeout -s KILL` ($EXEC_CASE_TIMEOUT, default 20s) inside the
   in-container loop so a hang fails exactly one case and the loop continues.
   This made podman exec reliable for all three arches.

2. rv64 clang lane. cfree computes some `&&label` and jump-table targets as
   fixed byte offsets that assume its own uncompressed, un-relaxed layout;
   clang's C extension compresses instructions and shifts them. Emit
   `.option norvc`/`.option norelax` (new ArchAsmOps.file_prologue) to pin the
   layout through any assembler; cfree-as accepts `.option` (it never
   compresses/relaxes anyway).

3. x64 clang lane. x86 has no layout-pinning directive (clang picks movabs vs
   mov-imm32, jmp rel32 vs rel8), so reference code locations SYMBOLICALLY
   instead: `&&label` address-takes become `leaq Lcf_*(%rip)` (un-relocated
   PC-relative computes detected via new ArchAsmOps.pcrel_code_target; the
   target gets a synthesized label), and switch jump-table entries become
   `.quad Lcf_*` rather than `.quad fn+off` (absolute data pointers into an
   executable section are re-pointed at a synthesized code label by the
   extended collect_code_anchors + code_target_label). Gated to arches that
   need it (x86_64); aarch64 is fixed-width and rv64 uses .option norvc, both
   unchanged.

Verified green with no regressions: test-hostas-toy (312/0 both lanes),
test-toy (1338/0), test-asm (27/0), test-asm-x64 (13/0), test-asm-roundtrip
(572/0), test-asm-roundtrip-toy (624/0), test-asm-symmetry (no new asymmetry),
test-diff-llvm (agrees), test-link (122/0), test-elf (40/0), test-driver-ar.

Diffstat:
M doc/ASM_ROUNDTRIP_TESTING.md  | 57 +++++++++++++++++++++++++++------------------------------
M src/api/asm_emit.c  | 183 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
M src/arch/arch.h  | 39 +++++++++++++++++++++++++++++++++++++++
M src/arch/registry.c  | 18 ++++++++++++++++++
M src/arch/rv64/asm.c  | 16 ++++++++++++++++
M src/arch/x64/asm.c  | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M src/asm/asm.c  | 8 +++++++-
M test/asm/hostas_cross.sh  | 46 +++++++++++++++++++++++-----------------------
M test/lib/exec_target.sh  | 15 +++++++++++++--

9 files changed, 363 insertions(+), 85 deletions(-)
diff --git a/doc/ASM_ROUNDTRIP_TESTING.md b/doc/ASM_ROUNDTRIP_TESTING.md
@@ -246,36 +246,33 @@ raw `exit_group` syscall.
 Each target **self-skips** (never fails) unless the host has (1) a clang cross
 target, (2) a runner (podman/qemu), (3) a working `cc -S | cfree as` round-trip
 for that arch, and (4) a passing **bounded** exec smoke (so a wedged emulator
-downgrades to SKIP instead of hanging). Status:
-
-- **aarch64-linux**: green end-to-end (cfree-as 312/0, clang-as 312/0) — podman
-  runs arm64 natively in its VM, so it's fast and the primary verified target.
-- **x86_64-linux**: the x64 `cc -S` symbolizer is complete — the aarch64
-  symbolizer was arch-generalized (`ArchAsmOps.is_local_branch` for `jmp`/`jcc`,
-  an x64 `reloc_operand` table for `sym(%rip)`/bare-`@PLT`/`@GOTPCREL` with a +4
-  rel32 addend bias, operand-driven RIP surgery) and the `emit_data_range` data
-  path now handles `R_PC32`/`R_PC64` (jump tables, global/array/fp/static-string
-  data). `cc -S | cfree as` re-assembles AND **cross-EXECS the whole corpus
-  correctly: cfree-as 312/312.** Byte-faithful 300/312 — the 12 are alloca/abi
-  cases where the re-assembled encoding is execution-equivalent (e.g.
-  `leaq (%rsp)` vs `leaq 0(%rsp)`). The clang lane is 301/11 (cfree emits AT&T
-  text clang rejects). Opt-in (the global clang gate would fail on that residue).
-- **riscv64-linux**: the rv64 `cc -S` symbolizer landed — a new `ArchAsmOps` with
-  `is_local_branch` (j/beq/...), a `reloc_operand` covering `%pcrel_hi`/
-  `%pcrel_lo`/`%hi`/`%lo`, the `%pcrel_lo` AUIPC-anchor pairing (synthesized
-  `.Lpcrel` labels via a new `ARCH_RELOC_SURG_RV_LO12` + `emit_anchor`/
-  `ref_anchor`), and an `R_RISCV_CALL` AUIPC+JALR call-pair fusion to `call`/
-  `tail`. `cc -S | cfree as` round-trips AND **cross-EXECS correctly: cfree-as
-  312/312** — the earlier self-call hang is gone. Byte-faithful 282/312 — the 30
-  are tail-call cases where cfree codegen uses `t0` but the standard (and
-  RAS-friendly) `tail` pseudo the assembler emits uses `t1`; execution-identical.
-  The clang lane is 254/58 (rv64 data-symbolization syntax + bare-`fcvt`
-  rounding-mode that clang encodes differently). Opt-in.
-- **Remaining (both arches): the third-party `clang` lane.** cfree's `cc -S` is
-  faithfully re-assemblable and executable by cfree's own `as`, but not yet
-  fully clang-standard for x64 (a few AT&T spellings) or rv64 (data
-  `%`-operator syntax; bare-`fcvt` needs an explicit rounding-mode suffix). A
-  standard-conformance follow-up; does not block the cfree `-S` path.
+downgrades to SKIP instead of hanging). All three ELF targets are in the gating
+default and pass **both** lanes — **936/936** = 312 cases × {O0,O1} × 3 arches,
+cfree-as **and** clang-as, judged purely by execution (matching exit code):
+
+- **aarch64-linux**: green end-to-end — podman runs arm64 natively in its VM, so
+  it's fast. Fixed-width encodings preserve cfree's instruction layout, so code
+  references need no special spelling.
+- **x86_64-linux**: cc -S references code locations **symbolically** so clang's
+  encoding choices (movabs vs mov-imm32, `jmp` rel32 vs rel8) can't shift a fixed
+  byte offset onto the wrong instruction: a `&&label` address-take is
+  `leaq Lcf_*(%rip)` (un-relocated PC-relative computes are detected via
+  `ArchAsmOps.pcrel_code_target`, the target gets a synthesized label), and
+  switch jump-table entries are `.quad Lcf_*` rather than `.quad fn+off`
+  (absolute data pointers into an executable section are re-pointed at a
+  synthesized code label by `collect_code_anchors` + `code_target_label`).
+- **riscv64-linux**: cc -S emits `.option norvc`/`.option norelax`
+  (`ArchAsmOps.file_prologue`) to pin cfree's fixed layout against clang's
+  C-extension compression — cfree computes some `&&label`/jump-table targets as
+  fixed offsets, which compression would otherwise shift — plus the
+  `%pcrel_hi`/`%pcrel_lo` AUIPC-anchor pairing and `R_RISCV_CALL` AUIPC+JALR
+  fusion to `call`/`tail`.
+
+Both lanes are judged by **execution**, never by bytes: cfree and clang emit
+different (execution-equivalent) code, so a byte/text match would be meaningless.
+The batched container runner caps each case at `EXEC_CASE_TIMEOUT` seconds
+(default 20) so a single hanging binary can't wedge the whole single-container
+run, leaving every later case unscored.
 
 Override the matrix with `CFREE_HOSTAS_CROSS_TARGETS="tag:triple ..."`, the
 exec-smoke cap with `CFREE_HOSTAS_EXEC_TIMEOUT=<secs>`, and per-arch images with
diff --git a/src/api/asm_emit.c b/src/api/asm_emit.c
@@ -529,6 +529,8 @@ typedef struct {
   u16 kind;
   Sym sym;
   i64 addend;
+  ObjSecId target_sec; /* section the reloc's symbol is defined in (or NONE) */
+  u64 target_val;      /* the symbol's value (offset within target_sec) */
 } SecReloc;
 
 static int cmp_secreloc(const void* va, const void* vb) {
@@ -564,6 +566,8 @@ static SecReloc* collect_relocs(Compiler* c, ObjBuilder* ob, ObjSecId sec_id,
     arr[n].kind = r->kind;
     arr[n].sym = s ? s->name : (Sym)0;
     arr[n].addend = r->addend;
+    arr[n].target_sec = s ? s->section_id : OBJ_SEC_NONE;
+    arr[n].target_val = s ? s->value : 0;
     ++n;
   }
   if (n > 1) qsort(arr, n, sizeof(SecReloc), cmp_secreloc);
@@ -909,12 +913,40 @@ static int is_btarget(const EmitCtx* x, u32 off) {
   return 0;
 }
 
-/* Pre-scan: collect in-section branch targets of un-relocated local branches. */
-static u32* collect_branch_targets(Compiler* c, ArchDisasm* dasm,
-                                   const SecReloc* relocs, u32 nrelocs,
-                                   const u8* data, u32 total, u32* n_out) {
+/* Append `off` to a dynamic, deduplicated anchor array (arena-grown). */
+static void anchor_add(Compiler* c, u32** arr, u32* n, u32* cap, u32 off) {
+  u32 j;
+  for (j = 0; j < *n; ++j)
+    if ((*arr)[j] == off) return;
+  if (*n == *cap) {
+    u32 nc = *cap ? *cap * 2 : 8;
+    u32* na = arena_array(c->tu, u32, nc);
+    if (!na) return;
+    if (*arr) memcpy(na, *arr, *cap * sizeof(u32));
+    *arr = na;
+    *cap = nc;
+  }
+  (*arr)[(*n)++] = off;
+}
+
+/* Pre-scan: offsets in section `sec_id` that need a synthesized Lcf_ label so a
+ * layout-dependent reference resolves symbolically through any assembler:
+ *   1. targets of un-relocated intra-section local branches (b/jmp/jcc);
+ *   2. targets of un-relocated PC-relative code-address-takes (x86-64 `leaq
+ *      disp(%rip)` for `&&label`);
+ *   3. offsets targeted by an absolute data-pointer relocation living in a
+ *      NON-executable section (switch jump-table `.quad fn+off` entries).
+ * (2)/(3) are exactly the references that break when the assembler picks
+ * different instruction lengths than cfree did, so they are collected only for
+ * arches that need symbolic code refs (x86-64); fixed-width (aarch64) or
+ * layout-pinned (RISC-V .option norvc) arches keep the compact offset forms. */
+static u32* collect_code_anchors(Compiler* c, ObjBuilder* ob, ObjSecId sec_id,
+                                 ArchDisasm* dasm, const SecReloc* relocs,
+                                 u32 nrelocs, const u8* data, u32 total,
+                                 u32* n_out) {
   u32* arr = NULL;
   u32 n = 0, cap = 0, off = 0;
+  int want_sym = arch_needs_symbolic_code_refs(c);
 
   *n_out = 0;
   while (off < total) {
@@ -925,30 +957,44 @@ static u32* collect_branch_targets(Compiler* c, ArchDisasm* dasm,
       off += 1;
       continue;
     }
-    if (!reloc_in_range(relocs, nrelocs, off, nb) &&
-        arch_is_local_branch(c, insn.mnemonic) &&
-        parse_hex_tail(insn.operands, &tgt) && tgt < total) {
-      u32 j;
-      int found = 0;
-      for (j = 0; j < n; ++j)
-        if (arr[j] == (u32)tgt) {
-          found = 1;
-          break;
-        }
-      if (!found) {
-        if (n == cap) {
-          u32 nc = cap ? cap * 2 : 8;
-          u32* na = arena_array(c->tu, u32, nc);
-          if (!na) break;
-          if (arr) memcpy(na, arr, cap * sizeof(u32));
-          arr = na;
-          cap = nc;
+    if (!reloc_in_range(relocs, nrelocs, off, nb)) {
+      if (arch_is_local_branch(c, insn.mnemonic) &&
+          parse_hex_tail(insn.operands, &tgt) && tgt < total) {
+        anchor_add(c, &arr, &n, &cap, (u32)tgt);
+      } else if (want_sym) {
+        i64 disp;
+        if (arch_pcrel_code_target(c, insn.mnemonic, insn.operands, &disp)) {
+          i64 t = (i64)off + (i64)nb + disp;
+          if (t >= 0 && (u64)t < total)
+            anchor_add(c, &arr, &n, &cap, (u32)t);
         }
-        arr[n++] = (u32)tgt;
       }
     }
     off += nb;
   }
+
+  if (want_sym) {
+    u32 nr = obj_reloc_total(ob), i;
+    for (i = 0; i < nr; ++i) {
+      const Reloc* r = obj_reloc_at(ob, i);
+      const Section* host;
+      const ObjSym* s;
+      const char* dir;
+      u32 width;
+      int pcrel;
+      i64 t;
+      if (!r || r->removed) continue;
+      host = obj_section_get(ob, r->section_id);
+      if (!host || (host->flags & SF_EXEC)) continue; /* code reloc: skip */
+      if (!data_reloc_directive(r->kind, &dir, &width, &pcrel) || pcrel)
+        continue; /* only absolute data pointers (jump-table entries) */
+      s = obj_symbol_get(ob, r->sym);
+      if (!s || s->section_id != sec_id) continue;
+      t = (i64)s->value + r->addend;
+      if (t >= 0 && (u64)t < total) anchor_add(c, &arr, &n, &cap, (u32)t);
+    }
+  }
+
   if (n > 1) qsort(arr, n, sizeof(u32), cmp_u32);
   *n_out = n;
   return arr;
@@ -998,10 +1044,55 @@ static CfreeStatus emit_operands(Writer* w, const EmitCtx* x,
       return w_symbolized(w, insn->operands.s, insn->operands.len, name,
                           ARCH_RELOC_SURG_TAIL);
     }
+  } else {
+    /* Un-relocated PC-relative code-address-take (x86-64 `leaq disp(%rip)` for
+     * `&&label`): rewrite the fixed displacement to the synthesized target
+     * label so an encoding-divergent assembler recomputes it. */
+    i64 disp;
+    if (arch_pcrel_code_target(x->c, insn->mnemonic, insn->operands, &disp)) {
+      i64 t = (i64)off + (i64)insn->nbytes + disp;
+      if (t >= 0 && is_btarget(x, (u32)t)) {
+        char name[256];
+        build_label_name(name, sizeof name, x, (u32)t);
+        return w_symbolized(w, insn->operands.s, insn->operands.len, name,
+                            ARCH_RELOC_SURG_RIP);
+      }
+    }
   }
   return cfree_writer_write(w, insn->operands.s, insn->operands.len);
 }
 
+/* Symbolic name for a code location (target_sec:target_off) referenced from a
+ * data directive: an assemblable label defined exactly there if one exists,
+ * else the synthesized `Lcf_<sec>_<off>` that collect_code_anchors guarantees
+ * is emitted in the target section. Mirrors the synth-vs-real choice the label
+ * emitter makes (symbol_at / build_label_name), so both ends agree. */
+static u32 code_target_label(char* buf, u32 cap, Compiler* c, ObjBuilder* ob,
+                             ObjSecId target_sec, u32 target_off) {
+  ObjSymIter* it = obj_symiter_new(ob);
+  if (it) {
+    ObjSymEntry e;
+    while (obj_symiter_next(it, &e)) {
+      const ObjSym* s = e.sym;
+      Slice nm;
+      if (!s || s->removed || !s->name) continue;
+      if (s->section_id != target_sec || (u32)s->value != target_off) continue;
+      if (s->kind == SK_SECTION || s->kind == SK_FILE) continue;
+      nm = pool_slice(c->global, s->name);
+      if (slice_eq_cstr(nm, ".LpcrelHi")) continue;
+      if (sym_is_assemblable(nm)) {
+        u32 p = 0, j;
+        for (j = 0; j < nm.len && p + 1 < cap; ++j) buf[p++] = nm.s[j];
+        buf[p] = '\0';
+        obj_symiter_free(it);
+        return p;
+      }
+    }
+    obj_symiter_free(it);
+  }
+  return fmt_synth_label(buf, cap, (u32)target_sec, target_off);
+}
+
 /* Emit a data range, rendering any covered relocation as a symbolic integer
  * directive (`.quad sym+addend`) so cc -S | as reproduces the data relocation
  * table — switch jump tables (R_ABS64 against the function) and any other
@@ -1009,9 +1100,9 @@ static CfreeStatus emit_operands(Writer* w, const EmitCtx* x,
  * target the assembler can't spell, falls back to raw `.byte`; the dropped
  * reloc then surfaces in the round-trip's reloc comparison. `relocs` is the
  * section's relocation list, sorted by offset. */
-static CfreeStatus emit_data_range(Writer* w, Compiler* c, const u8* data,
-                                   u32 start, u32 end, const SecReloc* relocs,
-                                   u32 nrelocs) {
+static CfreeStatus emit_data_range(Writer* w, Compiler* c, ObjBuilder* ob,
+                                   const u8* data, u32 start, u32 end,
+                                   const SecReloc* relocs, u32 nrelocs) {
   u32 off = start;
   while (off < end) {
     const SecReloc* r = NULL;
@@ -1037,6 +1128,33 @@ static CfreeStatus emit_data_range(Writer* w, Compiler* c, const u8* data,
        * re-derives R_PC{32,64} instead of an absolute reloc. */
       ArchRelocOperand bare = {ARCH_RELOC_SURG_NONE, "", "", 0, 0, 0};
       if (data_reloc_directive(r->kind, &dir, &width, &pcrel) &&
+          off + width <= end) {
+        const Section* tsec = (r->target_sec != OBJ_SEC_NONE)
+                                  ? obj_section_get(ob, r->target_sec)
+                                  : NULL;
+        /* An absolute pointer into executable code (switch jump-table entry):
+         * spell it as a label that moves with the code rather than `fn+off`.
+         * After an encoding-divergent assembler re-lays-out the function, a
+         * fixed offset would point into the wrong instruction; a label is
+         * recomputed to the correct address. Only for arches that need it. */
+        if (!pcrel && tsec && (tsec->flags & SF_EXEC) &&
+            arch_needs_symbolic_code_refs(c)) {
+          char label[256];
+          u64 toff = r->target_val + (u64)r->addend;
+          CfreeStatus st;
+          code_target_label(label, sizeof label, c, ob, r->target_sec,
+                             (u32)toff);
+          st = w_str(w, dir);
+          if (st != CFREE_OK) return st;
+          st = w_str(w, label);
+          if (st != CFREE_OK) return st;
+          st = w_newline(w);
+          if (st != CFREE_OK) return st;
+          off += width;
+          continue;
+        }
+      }
+      if (data_reloc_directive(r->kind, &dir, &width, &pcrel) &&
           off + width <= end &&
           build_symref(symref, sizeof symref, c, &bare, r->sym, r->addend) >=
               0) {
@@ -1198,6 +1316,13 @@ CfreeStatus cfree_obj_builder_emit_asm(CfreeObjBuilder* builder,
   sx.c = c;
   nsec = obj_section_count(ob);
 
+  /* Arch-specific leading directives (e.g. RISC-V `.option norvc` to pin
+   * cfree's fixed instruction layout against a compressing assembler). */
+  {
+    const char* prologue = arch_asm_file_prologue(c);
+    if (prologue) w_str(w, prologue);
+  }
+
   for (i = 1; i < nsec; ++i) {
     const Section* sec = obj_section_get(ob, (ObjSecId)i);
     SymLabel* labels;
@@ -1243,8 +1368,8 @@ CfreeStatus cfree_obj_builder_emit_asm(CfreeObjBuilder* builder,
         buf_flatten(&sec->bytes, heap_data);
         flat_data = heap_data;
         if (dasm)
-          btargets = collect_branch_targets(c, dasm, relocs, nrelocs, flat_data,
-                                            total, &nbt);
+          btargets = collect_code_anchors(c, ob, (ObjSecId)i, dasm, relocs,
+                                          nrelocs, flat_data, total, &nbt);
       }
     } else if (total > 0 && sec->kind != SEC_BSS) {
       Heap* heap = c->ctx->heap;
@@ -1297,7 +1422,7 @@ CfreeStatus cfree_obj_builder_emit_asm(CfreeObjBuilder* builder,
         } else if ((sec->flags & SF_EXEC) && dasm && flat_data) {
           emit_disasm_range(w, &ctx, dasm, flat_data, off, next);
         } else if (flat_data) {
-          emit_data_range(w, c, flat_data, off, next, relocs, nrelocs);
+          emit_data_range(w, c, ob, flat_data, off, next, relocs, nrelocs);
         }
         off = next;
       }
diff --git a/src/arch/arch.h b/src/arch/arch.h
@@ -244,6 +244,28 @@ typedef struct ArchAsmOps {
    * pair fusion for the arch. */
   int (*reloc_call_pair)(u16 reloc_kind, CfreeSlice pair_mnemonic,
                          CfreeSlice pair_ops, const char** mnemonic_out);
+  /* Arch-specific leading directives emitted at the very top of a cc -S file,
+   * before any section, returned as a NUL-terminated string the printer writes
+   * verbatim (NULL = none). RISC-V returns "\t.option norvc\n.option norelax\n":
+   * cfree's codegen computes some PC-relative label / jump-table targets as
+   * fixed byte offsets that assume its own uncompressed, un-relaxed instruction
+   * stream, so a third-party assembler (clang) must be told not to compress or
+   * relax, or those offsets shift and the targets break. aarch64/x86-64 have
+   * fixed-width encodings and no such layout dependence -> NULL. */
+  const char* (*file_prologue)(void);
+  /* 1 if (mnemonic, operands) is an un-relocated PC-relative reference to a
+   * code address computed as a fixed displacement — x86-64 `leaq disp(%rip),
+   * reg` emitted for a `&&label` address-take. Sets *disp_out to the signed
+   * byte displacement from the END of the instruction to the target. The
+   * symbolizer then synthesizes a label at (insn_end + disp) and rewrites the
+   * displacement to that label so a re-encoding assembler recomputes it.
+   * Providing this hook ALSO opts the arch into symbolic switch jump-table
+   * entries (.quad fn+off -> .quad <label>): both are needed precisely when the
+   * arch's assembler may pick different instruction lengths than cfree did
+   * (x86-64 movabs/mov-imm32, jmp rel32/rel8). Fixed-width arches (aarch64) and
+   * arches that pin layout another way (RISC-V .option norvc) leave it NULL. */
+  int (*pcrel_code_target)(CfreeSlice mnemonic, CfreeSlice operands,
+                           i64* disp_out);
 } ArchAsmOps;
 
 typedef struct ArchImpl {
@@ -312,6 +334,23 @@ int arch_reloc_call_pair(const Compiler* c, u16 reloc_kind,
                          CfreeSlice pair_mnemonic, CfreeSlice pair_ops,
                          const char** mnemonic_out);
 
+/* Leading directive string for the top of a cc -S file for the compiler's
+ * target arch (e.g. RISC-V `.option norvc`), or NULL when the arch needs none.
+ * Thin dispatch over ArchAsmOps.file_prologue. */
+const char* arch_asm_file_prologue(const Compiler* c);
+
+/* 1 if `insn` is an un-relocated PC-relative code-address-take for the target
+ * arch, with *disp_out set to the signed displacement from the instruction end
+ * to the target. Thin dispatch over ArchAsmOps.pcrel_code_target. */
+int arch_pcrel_code_target(const Compiler* c, CfreeSlice mnemonic,
+                           CfreeSlice operands, i64* disp_out);
+
+/* 1 if the target arch needs code locations referenced symbolically (by label)
+ * rather than as fixed byte offsets in cc -S — true exactly for arches that
+ * provide pcrel_code_target (x86-64). Drives both `&&label` address-take and
+ * switch jump-table symbolization. */
+int arch_needs_symbolic_code_refs(const Compiler* c);
+
 ArchDisasm* arch_disasm_new(Compiler*);
 u32 arch_disasm_decode(ArchDisasm*, const u8* bytes, size_t len, u64 vaddr,
                        CfreeInsn* out);
diff --git a/src/arch/registry.c b/src/arch/registry.c
@@ -110,6 +110,24 @@ int arch_reloc_call_pair(const Compiler* c, u16 reloc_kind,
                                      mnemonic_out);
 }
 
+const char* arch_asm_file_prologue(const Compiler* c) {
+  const ArchImpl* a = arch_for_compiler(c);
+  if (!a || !a->asm_ops || !a->asm_ops->file_prologue) return NULL;
+  return a->asm_ops->file_prologue();
+}
+
+int arch_pcrel_code_target(const Compiler* c, CfreeSlice mnemonic,
+                           CfreeSlice operands, i64* disp_out) {
+  const ArchImpl* a = arch_for_compiler(c);
+  if (!a || !a->asm_ops || !a->asm_ops->pcrel_code_target) return 0;
+  return a->asm_ops->pcrel_code_target(mnemonic, operands, disp_out);
+}
+
+int arch_needs_symbolic_code_refs(const Compiler* c) {
+  const ArchImpl* a = arch_for_compiler(c);
+  return a && a->asm_ops && a->asm_ops->pcrel_code_target != NULL;
+}
+
 const CGBackend* cg_backend_for_session(const Compiler* c,
                                         const CfreeCodeOptions* opts) {
   if (opts && opts->check_only) {
diff --git a/src/arch/rv64/asm.c b/src/arch/rv64/asm.c
@@ -1110,10 +1110,26 @@ static int rv64_reloc_call_pair(u16 kind, CfreeSlice pair_mnemonic,
   return 0;
 }
 
+/* RISC-V cc -S file prologue. cfree computes a few PC-relative targets as
+ * fixed byte offsets baked into the instruction stream rather than as symbolic
+ * relocations: a `&&label` address-of (auipc+addi with a hardcoded immediate,
+ * no reloc) and switch jump-table entries (`.quad fn+offset`). Both assume
+ * cfree's own 4-byte-per-instruction, un-relaxed layout. A standards-conformant
+ * assembler such as clang defaults to the C extension and would compress
+ * instructions (e.g. `mv`->`c.mv`), shifting every later offset and sending
+ * those targets to the wrong place. `.option norvc`/`.option norelax` pin the
+ * layout so cfree's offsets stay valid through any assembler — cfree's own
+ * codegen never emits compressed/relaxed forms, so this only constrains a
+ * third party to match what cfree already does. */
+static const char* rv64_file_prologue(void) {
+  return "\t.option norvc\n\t.option norelax\n";
+}
+
 const ArchAsmOps rv64_asm_ops = {
     .reloc_operand = rv64_reloc_operand,
     .is_local_branch = rv64_is_local_branch,
     .reloc_call_pair = rv64_reloc_call_pair,
+    .file_prologue = rv64_file_prologue,
 };
 
 ArchAsm* rv64_arch_asm_new(Compiler* c) {
diff --git a/src/arch/x64/asm.c b/src/arch/x64/asm.c
@@ -1637,9 +1637,75 @@ static int x64_is_local_branch(CfreeSlice m) {
   return 0;
 }
 
+/* Parse a leading signed integer (decimal or 0x-hex) from [s, s+len). Returns
+ * chars consumed and sets *out, or 0 if no integer starts here. */
+static u32 x64_parse_leading_int(const char* s, u32 len, i64* out) {
+  u32 i = 0, start;
+  int neg = 0;
+  i64 v = 0;
+  if (i < len && (s[i] == '+' || s[i] == '-')) {
+    neg = (s[i] == '-');
+    ++i;
+  }
+  if (i + 1 < len && s[i] == '0' && (s[i + 1] == 'x' || s[i + 1] == 'X')) {
+    i += 2;
+    start = i;
+    for (; i < len; ++i) {
+      char c = s[i];
+      if (c >= '0' && c <= '9')
+        v = v * 16 + (c - '0');
+      else if (c >= 'a' && c <= 'f')
+        v = v * 16 + (c - 'a' + 10);
+      else if (c >= 'A' && c <= 'F')
+        v = v * 16 + (c - 'A' + 10);
+      else
+        break;
+    }
+  } else {
+    start = i;
+    for (; i < len; ++i) {
+      char c = s[i];
+      if (c >= '0' && c <= '9')
+        v = v * 10 + (c - '0');
+      else
+        break;
+    }
+  }
+  if (i == start) return 0;
+  *out = neg ? -v : v;
+  return i;
+}
+
+/* x86-64 `&&label` address-take: an un-relocated `leaq <disp>(%rip), %reg`. The
+ * disassembler renders the resolved target as a fixed displacement from the
+ * next instruction (the %rip base); report it so the symbolizer can swap in a
+ * label that an encoding-divergent assembler will recompute correctly. */
+static int x64_pcrel_code_target(CfreeSlice mnemonic, CfreeSlice operands,
+                                 i64* disp_out) {
+  const char* o = operands.s;
+  u32 ol = operands.len, i, n;
+  i64 disp = 0;
+  int has_rip = 0;
+  if (!(mnemonic.len == 4 && memcmp(mnemonic.s, "leaq", 4) == 0) &&
+      !(mnemonic.len == 3 && memcmp(mnemonic.s, "lea", 3) == 0))
+    return 0;
+  for (i = 0; i + 6 <= ol; ++i)
+    if (memcmp(o + i, "(%rip)", 6) == 0) {
+      has_rip = 1;
+      break;
+    }
+  if (!has_rip) return 0;
+  n = x64_parse_leading_int(o, ol, &disp);
+  /* The displacement must sit immediately before `(%rip)`. */
+  if (n == 0 || !(n + 6 <= ol && memcmp(o + n, "(%rip)", 6) == 0)) return 0;
+  *disp_out = disp;
+  return 1;
+}
+
 const ArchAsmOps x64_asm_ops = {
     .reloc_operand = x64_reloc_operand,
     .is_local_branch = x64_is_local_branch,
+    .pcrel_code_target = x64_pcrel_code_target,
 };
 
 ArchAsm* x64_arch_asm_new(Compiler* c) { return &x64_asm_open(c)->base; }
diff --git a/src/asm/asm.c b/src/asm/asm.c
@@ -1180,7 +1180,13 @@ static void do_directive(AsmDriver* d, Sym name) {
       sym_eq(d, name, "subsections_via_symbols") || sym_eq(d, name, "macro") ||
       sym_eq(d, name, "endm") || sym_eq(d, name, "if") ||
       sym_eq(d, name, "endif") || sym_eq(d, name, "else") ||
-      sym_eq(d, name, "include")) {
+      sym_eq(d, name, "include") ||
+      /* RISC-V `.option rvc/norvc/relax/norelax/push/pop/...`: cfree's own
+       * cc -S emits `.option norvc`/`.option norelax` to pin its fixed
+       * instruction layout (see rv64_file_prologue). cfree-as never compresses
+       * or relaxes, so it already honors these implicitly — accept and ignore
+       * rather than treat as an unknown directive. */
+      sym_eq(d, name, "option")) {
     d_skip_to_eol(d);
     return;
   }
diff --git a/test/asm/hostas_cross.sh b/test/asm/hostas_cross.sh
@@ -27,23 +27,24 @@
 # clang cross-compiler for it, (2) a runner (podman/qemu) per exec_target, (3) a
 # working `cfree cc -S | cfree as` round-trip for that arch, and (4) a bounded
 # exec smoke that returns the oracle. So the harness runs green on whatever the
-# host supports and self-extends as gaps close. Status at time of writing:
-#   - aarch64-linux: works end-to-end (podman runs arm64 natively in its VM).
-#                    This is the gating default (312/312 both lanes).
-#   - x86_64-linux:  `cc -S | cfree as` round-trips and CROSS-EXECS the whole
-#                    corpus correctly (cfree-as 312/312). The clang lane has a
-#                    small residue (~11 efail: cfree emits AT&T text clang
-#                    rejects). Opt-in: the global clang gate (ENFORCE_CLANG=1)
-#                    would fail on that residue, so x64 isn't in the gating
-#                    default yet — run with CFREE_HOSTAS_CROSS_TARGETS, optionally
-#                    CFREE_HOSTAS_ENFORCE_CLANG=0.
-#   - riscv64-linux: `cc -S | cfree as` round-trips and CROSS-EXECS correctly
-#                    (cfree-as 312/312) — the rv64 symbolizer (ArchAsmOps with
-#                    %pcrel_hi/%pcrel_lo anchor pairing + AUIPC/JALR call fusion)
-#                    landed; the earlier self-call hang is gone. The clang lane
-#                    has a larger residue (~58 efail: rv64 data-symbolization
-#                    syntax + bare-fcvt rounding-mode that clang encodes
-#                    differently). Opt-in, same as x64.
+# host supports and self-extends as gaps close. All three ELF targets now pass
+# BOTH lanes end-to-end (936/936 = 312 cases x {O0,O1} x 3 arches, ENFORCE_CLANG):
+#   - aarch64-linux: podman runs arm64 natively in its VM; fixed-width encodings
+#                    keep cfree's layout, so code references need no special form.
+#   - x86_64-linux:  cc -S references code locations symbolically — `&&label`
+#                    address-takes (`leaq Lcf_*(%rip)`) and switch jump-table
+#                    entries (`.quad Lcf_*`) — so clang's encoding choices
+#                    (movabs vs mov-imm32, jmp rel32 vs rel8) can't shift a fixed
+#                    offset onto the wrong instruction. (ArchAsmOps.pcrel_code_target
+#                    + collect_code_anchors; see src/api/asm_emit.c.)
+#   - riscv64-linux: cc -S emits `.option norvc`/`.option norelax` to pin cfree's
+#                    fixed instruction layout against clang's C-extension
+#                    compression, plus the %pcrel_hi/%pcrel_lo + AUIPC/JALR call
+#                    symbolizer.
+# Execution under qemu-user (x86_64/riscv64 in their podman containers) is the
+# sole judge — cfree and clang emit different code, so a byte/text match would be
+# meaningless. The batched runner caps each case (EXEC_CASE_TIMEOUT) so one
+# hanging binary can't wedge the whole container.
 #
 # Override the matrix with CFREE_HOSTAS_CROSS_TARGETS="tag:triple ..." and the
 # clang-as gate with CFREE_HOSTAS_ENFORCE_CLANG=0 (demote lane B to XFAIL).
@@ -60,12 +61,11 @@ FILTER="${1:-}"
 ENFORCE_CLANG="${CFREE_HOSTAS_ENFORCE_CLANG:-1}"
 EXEC_SMOKE_TIMEOUT="${CFREE_HOSTAS_EXEC_TIMEOUT:-45}"
 
-# "tag:triple" — tag is exec_target.sh's <arch>-<os> spelling. The gating
-# default is the fully-verified target (aarch64-linux). x86_64 and riscv64 are
-# wired and opt-in (see the status notes above) — add them with
-# CFREE_HOSTAS_CROSS_TARGETS once you want to exercise their in-progress lanes:
-#   CFREE_HOSTAS_CROSS_TARGETS="x64-linux:x86_64-linux-gnu rv64-linux:riscv64-linux-gnu"
-TARGETS="${CFREE_HOSTAS_CROSS_TARGETS:-aarch64-linux:aarch64-linux-gnu}"
+# "tag:triple" — tag is exec_target.sh's <arch>-<os> spelling. All three ELF
+# targets are in the gating default (each SKIPs cleanly if its clang cross
+# target or container runner is unavailable). Narrow the matrix with
+# CFREE_HOSTAS_CROSS_TARGETS, e.g. CFREE_HOSTAS_CROSS_TARGETS="x64-linux:x86_64-linux-gnu".
+TARGETS="${CFREE_HOSTAS_CROSS_TARGETS:-aarch64-linux:aarch64-linux-gnu x64-linux:x86_64-linux-gnu rv64-linux:riscv64-linux-gnu}"
 
 # Same TLS-symbolization skip as the sibling lanes.
 SKIP="141_threadlocal_mutate"
diff --git a/test/lib/exec_target.sh b/test/lib/exec_target.sh
@@ -287,7 +287,15 @@ _exec_target_flush_tag() {
             echo "exec_target_flush: EXEC_TARGET_MOUNT_ROOT must be set" >&2
             return 2
         fi
-        local platform image platform_flag=()
+        local platform image platform_flag=() case_to
+        # Per-case wall-clock cap inside the batched container. Without it a
+        # single hanging exe (e.g. a miscompiled loop, or qemu-user wedging on
+        # one binary) blocks the whole single-container run, leaving every
+        # later case with no .rc — which the caller reads back as 127 and
+        # reports as a mass failure. With it, a hang is killed (rc 137) and the
+        # loop moves on, so a real hang fails exactly one case. Override with
+        # EXEC_CASE_TIMEOUT (seconds); generous by default for slow TCG.
+        case_to="${EXEC_CASE_TIMEOUT:-20}"
         platform="$(_exec_target_platform "$tag")"
         image="$(_exec_target_image "$tag")"
         if ! _exec_target_podman_native "$tag"; then
@@ -307,12 +315,15 @@ _exec_target_flush_tag() {
                     "${EXEC_TARGET_RCS[$k]}"
             done
         } | podman run -i --rm --pull=never "${platform_flag[@]}" --net=none \
+                -e EXEC_CASE_TIMEOUT="$case_to" \
                 -v "$EXEC_TARGET_MOUNT_ROOT":"$EXEC_TARGET_MOUNT_ROOT":Z \
                 "$image" \
                 /bin/sh -c '
 set -u
+_to="${EXEC_CASE_TIMEOUT:-20}"
+if command -v timeout >/dev/null 2>&1; then _t="timeout -s KILL $_to"; else _t=""; fi
 while IFS="	" read -r exe out err rc; do
-    "$exe" >"$out" 2>"$err"
+    $_t "$exe" >"$out" 2>"$err"
     echo $? >"$rc"
 done
 '

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

M	doc/ASM_ROUNDTRIP_TESTING.md	\|	57	+++++++++++++++++++++++++++------------------------------
M	src/api/asm_emit.c	\|	183	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
M	src/arch/arch.h	\|	39	+++++++++++++++++++++++++++++++++++++++
M	src/arch/registry.c	\|	18	++++++++++++++++++
M	src/arch/rv64/asm.c	\|	16	++++++++++++++++
M	src/arch/x64/asm.c	\|	66	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	src/asm/asm.c	\|	8	+++++++-
M	test/asm/hostas_cross.sh	\|	46	+++++++++++++++++++++++-----------------------
M	test/lib/exec_target.sh	\|	15	+++++++++++++--