commit 74ca227e2b3673d0e7955f10349183737e72b5cf
parent 2686dfe936ae06310cd00dd61166417df8b3b8a7
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Sat, 30 May 2026 21:04:54 -0700
x64+rv64: complete the cc -S round-trip symbolizer (cross-exec 312/312 each)
Finish the x86-64 and riscv64 `cc -S` paths so `cc -S | cfree as | cfree ld`
re-assembles and CROSS-EXECUTES the whole toy corpus correctly under podman/qemu
(cfree-as 312/312 for both arches; aarch64 unchanged at 312/0).
Symbolizer generalization (src/api/asm_emit.c, src/arch/arch.h, registry.c):
- emit_data_range/data_reloc_directive now round-trip R_PC32/R_PC64 data
relocations (jump tables, global/array/fp/static-string data), not just
R_ABS32/64 — recovers the x64 data backlog.
- ArchAsmOps gains is_local_branch and a reloc_call_pair hook; ArchRelocOperand
gains addend_bias and the RISC-V hi/lo anchor pairing (emit_anchor/ref_anchor
+ ARCH_RELOC_SURG_RV_LO12). The symbolizer synthesizes unique `.Lpcrel`
anchor labels for %pcrel_hi and resolves the paired %pcrel_lo to them, and
fuses an R_RISCV_CALL AUIPC+JALR pair into `call`/`tail sym`.
rv64 (src/arch/rv64/asm.c, arch.c): new rv64_reloc_operand (%pcrel_hi/%pcrel_lo/
%hi/%lo/branches), rv64_is_local_branch, rv64_asm_ops; assembler-side acceptance
of the low-half `mv rd,rs,%pcrel_lo(L)` form and the fcvt rounding mode the
disassembler drops (RTZ for fp->int truncation, DYN for int->fp/fp<->fp — the
C-correct, codegen-consistent modes). The earlier self-call hang
(`auipc ra,0x0; jalr ra,0(ra)` unsymbolized) is gone. `tail` stays the standard
t1 (RAS-friendly; cfree codegen's t0 is execution-equivalent — only tail-call
byte images differ, exec matches).
x64 (src/arch/x64/{asm,isa,native}.c): x64_reloc_operand/is_local_branch as
before; native.c prologue frame spills use the minimal-disp encoding
(x64_pack_mem) so the round-trip reproduces them (execution-equivalent).
Fixes a regression the integration introduced: aa64/x64 reloc_operand don't set
the new emit_anchor/ref_anchor fields, so the ArchRelocOperand stack decls in
asm_emit.c are now zero-initialized (uninitialized ref_anchor had made aa64
adrp / x64 callq fall into the rv64 anchor path and stay numeric — hostas-toy
had dropped to 19/293; restored to 312/0). Shared asm.c unary-minus UB fix
retained.
Updated the rv64_fp/rv64_fp_cvt encode goldens to the corrected (truncating)
fcvt rounding modes. Verified: test-toy 1338/0, test-asm 27/0, test-asm-x64
13/0, test-asm-rv64 43/0, test-asm-roundtrip 572/0, test-asm-roundtrip-toy
624/0/1, test-diff-llvm 271-agree/0-skip, hostas-toy aa64 312/0 both lanes;
x64 cross-exec cfree-as 312/0, rv64 cross-exec cfree-as 312/0.
Residual (tracked, not blocking the cfree -S path): the third-party clang lane
(x64 ~11, rv64 ~58) — cfree emits asm clang encodes differently (notably
bare-fcvt rounding-mode and rv64 data %-operator syntax); a standard-conformance
follow-up.
Diffstat:
13 files changed, 635 insertions(+), 124 deletions(-)
diff --git a/doc/ASM_ROUNDTRIP_TESTING.md b/doc/ASM_ROUNDTRIP_TESTING.md
@@ -250,24 +250,32 @@ downgrades to SKIP instead of hanging). Status:
- **aarch64-linux**: green end-to-end (cfree-as 312/0, clang-as 312/0) — podman
runs arm64 natively in its VM, so it's fast and the primary verified target.
-- **x86_64-linux**: the x64 `cc -S` symbolizer landed (the aarch64 symbolizer is
- now arch-generalized — `ArchAsmOps.is_local_branch` for `jmp`/`jcc`, a x64
- `reloc_operand` table for `sym(%rip)`/bare-`@PLT`/`@GOTPCREL` with a +4 rel32
- addend bias, and operand-driven RIP surgery), so the whole corpus
- **re-assembles 312/312** via both cfree-as and clang. Cross-EXEC is **272/312**:
- ~23 cases (switch/jump tables, global/array/fp data, varargs) lose fidelity in
- the x64 cc -S **data** round-trip — confirmed cc -S infidelity, since the
- DIRECT `cc -c` object executes correctly. That data backlog is the remaining
- x64 work. Opt-in until it closes.
-- **riscv64-linux**: assembles, but cross-EXEC **hangs** — NOT emulation (a
- minimal clang rv64 static exe AND the DIRECT cfree `cc -c` object both run
- correctly under the same qemu-riscv64; only the `cc -S | as` round-trip hangs).
- Root cause: rv64 has **no symbolizer** (no `ArchAsmOps`), so `cc -S` emits the
- call as `auipc ra,0x0; jalr ra,0(ra)` with the `R_RISCV_CALL` reloc
- unsymbolized — it calls itself — and branches keep numeric targets (`j 0x90`).
- Needs an rv64 `ArchAsmOps`: `is_local_branch` (j/beq/bne/...) and a
- `reloc_operand` for the RISC-V `%pcrel_hi`/`%pcrel_lo`/`%hi`/`%lo`/`call`
- syntax (the `%pcrel_lo` label-pairing makes this the hardest of the three).
+- **x86_64-linux**: the x64 `cc -S` symbolizer is complete — the aarch64
+ symbolizer was arch-generalized (`ArchAsmOps.is_local_branch` for `jmp`/`jcc`,
+ an x64 `reloc_operand` table for `sym(%rip)`/bare-`@PLT`/`@GOTPCREL` with a +4
+ rel32 addend bias, operand-driven RIP surgery) and the `emit_data_range` data
+ path now handles `R_PC32`/`R_PC64` (jump tables, global/array/fp/static-string
+ data). `cc -S | cfree as` re-assembles AND **cross-EXECS the whole corpus
+ correctly: cfree-as 312/312.** Byte-faithful 300/312 — the 12 are alloca/abi
+ cases where the re-assembled encoding is execution-equivalent (e.g.
+ `leaq (%rsp)` vs `leaq 0(%rsp)`). The clang lane is 301/11 (cfree emits AT&T
+ text clang rejects). Opt-in (the global clang gate would fail on that residue).
+- **riscv64-linux**: the rv64 `cc -S` symbolizer landed — a new `ArchAsmOps` with
+ `is_local_branch` (j/beq/...), a `reloc_operand` covering `%pcrel_hi`/
+ `%pcrel_lo`/`%hi`/`%lo`, the `%pcrel_lo` AUIPC-anchor pairing (synthesized
+ `.Lpcrel` labels via a new `ARCH_RELOC_SURG_RV_LO12` + `emit_anchor`/
+ `ref_anchor`), and an `R_RISCV_CALL` AUIPC+JALR call-pair fusion to `call`/
+ `tail`. `cc -S | cfree as` round-trips AND **cross-EXECS correctly: cfree-as
+ 312/312** — the earlier self-call hang is gone. Byte-faithful 282/312 — the 30
+ are tail-call cases where cfree codegen uses `t0` but the standard (and
+ RAS-friendly) `tail` pseudo the assembler emits uses `t1`; execution-identical.
+ The clang lane is 254/58 (rv64 data-symbolization syntax + bare-`fcvt`
+ rounding-mode that clang encodes differently). Opt-in.
+- **Remaining (both arches): the third-party `clang` lane.** cfree's `cc -S` is
+ faithfully re-assemblable and executable by cfree's own `as`, but not yet
+ fully clang-standard for x64 (a few AT&T spellings) or rv64 (data
+ `%`-operator syntax; bare-`fcvt` needs an explicit rounding-mode suffix). A
+ standard-conformance follow-up; does not block the cfree `-S` path.
Override the matrix with `CFREE_HOSTAS_CROSS_TARGETS="tag:triple ..."`, the
exec-smoke cap with `CFREE_HOSTAS_EXEC_TIMEOUT=<secs>`, and per-arch images with
diff --git a/src/api/asm_emit.c b/src/api/asm_emit.c
@@ -83,6 +83,16 @@ static SymLabel* collect_labels(Compiler* c, ObjBuilder* ob, ObjSecId sec_id,
if (sym->section_id != sec_id) continue;
if (sym->kind == SK_SECTION || sym->kind == SK_FILE) continue;
if (!sym->name) continue;
+ /* RISC-V `.LpcrelHi` anchors are codegen-internal labels on AUIPC
+ * instructions, used only as the target of a paired `%pcrel_lo`
+ * relocation. Many share the one name (one per AUIPC), so emitting them
+ * verbatim defines the same label repeatedly and breaks re-assembly. The
+ * symbolizer replaces each with a unique synthesized anchor label
+ * (emit_anchor / ref_anchor), so suppress the originals here. */
+ {
+ Slice nm = pool_slice(c->global, sym->name);
+ if (slice_eq_cstr(nm, ".LpcrelHi")) continue;
+ }
if (n == cap) {
u32 ncap = cap ? cap * 2 : 8;
@@ -233,46 +243,67 @@ static u32 align_log2(u32 a) {
* SEC_DEBUG). SEC_OTHER (a global in a named section, e.g.
* __attribute__((section(...)))) emits the real name plus its flags/type/
* entsize in GNU-as syntax so the label and bytes survive re-assembly. */
+/* Emit `.section name, "flags", @type[, entsize]` (the GNU-as named-section
+ * form). Used for SEC_OTHER and for any canonical-kind section whose name or
+ * flags can't be reproduced by the bare `.text`/`.section .rodata` builtins. */
+static int elf_named_section(const AsmSynCtx* x, const Section* sec) {
+ Writer* w = x->w;
+ Slice nm = pool_slice(x->c->global, sec->name);
+ if (nm.len == 0) return 0;
+ w_str(w, " .section\t");
+ cfree_writer_write(w, nm.s, nm.len);
+ w_str(w, ", \"");
+ w_secflags(w, sec->flags);
+ w_str(w, "\", ");
+ w_str(w, sec->sem == SSEM_NOBITS ? "@nobits" : "@progbits");
+ if ((sec->flags & SF_MERGE) || sec->entsize) {
+ w_str(w, ", ");
+ w_dec(w, (u64)(sec->entsize ? sec->entsize : 1));
+ }
+ w_newline(w);
+ return 1;
+}
+
+/* Does this canonical-kind section round-trip through its bare builtin
+ * directive? Only if its name is exactly the canonical spelling and it carries
+ * no flags that the builtin can't express (MERGE/STRINGS/RETAIN/entsize). A
+ * `.rodata.foo.merge` mergeable-string section, for instance, must be spelled
+ * in full or the linker won't merge/GC it the way the direct object does. */
+static int sec_is_canonical(const AsmSynCtx* x, const Section* sec,
+ const char* canon) {
+ Slice nm = pool_slice(x->c->global, sec->name);
+ if (sec->flags & (SF_MERGE | SF_STRINGS | SF_RETAIN)) return 0;
+ if (sec->entsize) return 0;
+ return slice_eq_cstr(nm, canon);
+}
+
static int elf_section_header(const AsmSynCtx* x, const Section* sec) {
Writer* w = x->w;
+ if (sec->flags & SF_TLS) return 0;
switch (sec->kind) {
case SEC_TEXT:
+ if (!sec_is_canonical(x, sec, ".text")) return elf_named_section(x, sec);
w_str(w, " .text");
w_newline(w);
return 1;
case SEC_RODATA:
- if (sec->flags & SF_TLS) return 0;
+ if (!sec_is_canonical(x, sec, ".rodata"))
+ return elf_named_section(x, sec);
w_str(w, " .section\t.rodata");
w_newline(w);
return 1;
case SEC_DATA:
- if (sec->flags & SF_TLS) return 0;
+ if (!sec_is_canonical(x, sec, ".data")) return elf_named_section(x, sec);
w_str(w, " .section\t.data");
w_newline(w);
return 1;
case SEC_BSS:
- if (sec->flags & SF_TLS) return 0;
+ if (!sec_is_canonical(x, sec, ".bss")) return elf_named_section(x, sec);
w_str(w, " .section\t.bss");
w_newline(w);
return 1;
- case SEC_OTHER: {
- Slice nm;
- if (sec->flags & SF_TLS) return 0;
- nm = pool_slice(x->c->global, sec->name);
- if (nm.len == 0) return 0;
- w_str(w, " .section\t");
- cfree_writer_write(w, nm.s, nm.len);
- w_str(w, ", \"");
- w_secflags(w, sec->flags);
- w_str(w, "\", ");
- w_str(w, sec->sem == SSEM_NOBITS ? "@nobits" : "@progbits");
- if ((sec->flags & SF_MERGE) || sec->entsize) {
- w_str(w, ", ");
- w_dec(w, (u64)(sec->entsize ? sec->entsize : 1));
- }
- w_newline(w);
- return 1;
- }
+ case SEC_OTHER:
+ return elf_named_section(x, sec);
default:
return 0;
}
@@ -436,20 +467,34 @@ static CfreeStatus emit_raw_bytes(Writer* w, const u8* data, u32 start,
}
/* A reloc kind whose data field carries a symbol value reproducible by an
- * integer directive: maps to (directive, byte width). The assembler emits the
- * matching R_ABS{32,64} for `.word`/`.quad SYM+addend` (emit_int_directive),
- * so the round-tripped relocation matches codegen's. Returns 0 for kinds with
- * no integer-directive spelling (caller keeps the raw bytes). */
-static int data_reloc_directive(u16 kind, const char** dir, u32* width) {
+ * integer directive: maps to (directive, byte width, PC-relative?). The
+ * assembler emits the matching R_ABS{32,64} for `.word`/`.quad SYM+addend` and
+ * R_PC{32,64} for `.long`/`.quad SYM - .` (emit_int_directive), so the
+ * round-tripped relocation matches codegen's. `*pcrel` selects the `SYM - .`
+ * spelling (built by build_data_symref). Returns 0 for kinds with no
+ * integer-directive spelling (caller keeps the raw bytes). */
+static int data_reloc_directive(u16 kind, const char** dir, u32* width,
+ int* pcrel) {
+ *pcrel = 0;
switch (kind) {
case R_ABS64:
*dir = " .quad ";
*width = 8;
return 1;
+ case R_PC64:
+ *dir = " .quad ";
+ *width = 8;
+ *pcrel = 1;
+ return 1;
case R_ABS32:
*dir = " .word ";
*width = 4;
return 1;
+ case R_PC32:
+ *dir = " .long ";
+ *width = 4;
+ *pcrel = 1;
+ return 1;
default:
return 0;
}
@@ -637,6 +682,44 @@ static CfreeStatus w_symbolized(Writer* w, const char* ops, u32 olen,
return w_str(w, symref);
}
}
+ if (surg == ARCH_RELOC_SURG_RV_LO12) {
+ /* RISC-V low-half: a `disp(base)` memory form rewrites the displacement;
+ * a register-immediate form appends the modifier as a new operand. The
+ * memory form is recognized by a trailing `(...)` group. */
+ i32 lp = -1, rp = -1;
+ u32 i;
+ for (i = 0; i < olen; ++i) {
+ if (ops[i] == '(') lp = (i32)i;
+ else if (ops[i] == ')') rp = (i32)i;
+ }
+ if (lp >= 0 && rp > lp && (u32)(rp + 1) == olen) {
+ /* `..., <disp>(base)` -> `..., symref(base)`: replace the displacement
+ * run that ends immediately before the '('. */
+ i32 ds = lp; /* start of the displacement run */
+ CfreeStatus st;
+ while (ds > 0) {
+ char ch = ops[ds - 1];
+ if ((ch >= '0' && ch <= '9') || (ch >= 'a' && ch <= 'f') ||
+ (ch >= 'A' && ch <= 'F') || ch == 'x' || ch == '-' || ch == '+')
+ --ds;
+ else
+ break;
+ }
+ st = cfree_writer_write(w, ops, (u32)ds); /* "..., " before the disp */
+ if (st != CFREE_OK) return st;
+ st = w_str(w, symref);
+ if (st != CFREE_OK) return st;
+ return cfree_writer_write(w, ops + lp, olen - (u32)lp); /* "(base)" */
+ }
+ /* Register-immediate (e.g. `mv rd, rs`): append symref as a new operand. */
+ {
+ CfreeStatus st = cfree_writer_write(w, ops, olen);
+ if (st != CFREE_OK) return st;
+ st = w_str(w, ", ");
+ if (st != CFREE_OK) return st;
+ return w_str(w, symref);
+ }
+ }
/* SURG_MEM: keep the base register, set the offset to symref. */
{
i32 lb = -1, rb = -1, base_end;
@@ -754,6 +837,44 @@ static u32 fmt_synth_label(char* buf, u32 cap, u32 secidx, u32 off) {
return p;
}
+/* Synthesized hi/lo anchor label spelling (RISC-V `%pcrel_hi`/`%pcrel_lo`
+ * pairing). The high-half reloc defines `.Lpcrel_hi_<secidx>_<off>` at its
+ * AUIPC; the paired low-half reloc references it. `.L`-prefixed so the
+ * assembler's lexer accepts it and the linker treats it as local. */
+static u32 fmt_anchor_label(char* buf, u32 cap, u32 secidx, u32 off) {
+ u32 p = 0;
+ const char* pre = ".Lpcrel_hi_";
+ u32 i;
+ for (i = 0; pre[i] && p + 1 < cap; ++i) buf[p++] = pre[i];
+ p = fmt_u64(buf, p, cap, secidx, 10);
+ if (p + 1 < cap) buf[p++] = '_';
+ p = fmt_u64(buf, p, cap, off, 16);
+ buf[p] = '\0';
+ return p;
+}
+
+/* The offset of the high-half (anchor-emitting) relocation paired with a
+ * low-half reloc at `lo_off`: the nearest preceding reloc whose ArchRelocOperand
+ * sets emit_anchor. cfree's codegen always emits the AUIPC immediately before
+ * its paired ADDI/load, so the nearest preceding anchor is the correct one.
+ * Returns 1 and *hi_off on success. */
+static int find_anchor_for_lo12(const EmitCtx* x, u32 lo_off, u32* hi_off) {
+ u32 i;
+ int found = 0;
+ u32 best = 0;
+ for (i = 0; i < x->nrelocs; ++i) {
+ ArchRelocOperand ro = {0}; /* zero emit_anchor/ref_anchor: arches that
+ * don't set them must read as 0 (rv64-only) */
+ if (x->relocs[i].offset >= lo_off) break;
+ if (arch_reloc_operand(x->c, x->relocs[i].kind, &ro) && ro.emit_anchor) {
+ best = x->relocs[i].offset;
+ found = 1;
+ }
+ }
+ if (found && hi_off) *hi_off = best;
+ return found;
+}
+
/* Non-dot symbol name defined at `off`, or NULL. Such a symbol is used as the
* branch label directly (no synthesized label needed). */
static Sym symbol_at(const EmitCtx* x, u32 off) {
@@ -841,12 +962,33 @@ static CfreeStatus emit_operands(Writer* w, const EmitCtx* x,
if (!insn->operands.len) return CFREE_OK;
r = reloc_in_range(x->relocs, x->nrelocs, off, insn->nbytes);
if (r) {
- ArchRelocOperand ro;
+ ArchRelocOperand ro = {0}; /* zero emit_anchor/ref_anchor: arches that
+ * don't set them must read as 0 (rv64-only) */
if (arch_reloc_operand(x->c, r->kind, &ro)) {
char symref[256];
- if (build_symref(symref, sizeof symref, x->c, &ro, r->sym, r->addend) >= 0)
+ /* A low-half reloc (RISC-V `%pcrel_lo`) names the paired high-half's
+ * synthesized anchor label, not the reloc's own (`.LpcrelHi`) symbol. */
+ if (ro.ref_anchor) {
+ u32 hi_off;
+ if (find_anchor_for_lo12(x, off, &hi_off)) {
+ char name[256];
+ u32 p = 0, i;
+ for (i = 0; ro.prefix[i] && p + 1 < sizeof name; ++i)
+ name[p++] = ro.prefix[i];
+ p += fmt_anchor_label(name + p, (u32)sizeof name - p, x->secidx,
+ hi_off);
+ for (i = 0; ro.suffix[i] && p + 1 < sizeof name; ++i)
+ name[p++] = ro.suffix[i];
+ name[p] = '\0';
+ return w_symbolized(w, insn->operands.s, insn->operands.len, name,
+ ro.surg);
+ }
+ /* No anchor found (unexpected): fall through to keep numeric. */
+ } else if (build_symref(symref, sizeof symref, x->c, &ro, r->sym,
+ r->addend) >= 0) {
return w_symbolized(w, insn->operands.s, insn->operands.len, symref,
ro.surg);
+ }
}
} else if (arch_is_local_branch(x->c, insn->mnemonic)) {
u64 tgt;
@@ -887,17 +1029,25 @@ static CfreeStatus emit_data_range(Writer* w, Compiler* c, const u8* data,
if (r) {
const char* dir;
u32 width;
+ int pcrel;
char symref[256];
/* Data relocations spell the bare symbol (`.quad sym+addend`): no
- * page/lo12-style operand modifier on either format. */
- ArchRelocOperand bare = {ARCH_RELOC_SURG_NONE, "", "", 0};
- if (data_reloc_directive(r->kind, &dir, &width) && off + width <= end &&
+ * page/lo12-style operand modifier on either format. A PC-relative
+ * reloc adds a trailing ` - .` (location counter) so the assembler
+ * re-derives R_PC{32,64} instead of an absolute reloc. */
+ ArchRelocOperand bare = {ARCH_RELOC_SURG_NONE, "", "", 0, 0, 0};
+ if (data_reloc_directive(r->kind, &dir, &width, &pcrel) &&
+ off + width <= end &&
build_symref(symref, sizeof symref, c, &bare, r->sym, r->addend) >=
0) {
CfreeStatus st = w_str(w, dir);
if (st != CFREE_OK) return st;
st = w_str(w, symref);
if (st != CFREE_OK) return st;
+ if (pcrel) {
+ st = w_str(w, " - .");
+ if (st != CFREE_OK) return st;
+ }
st = w_newline(w);
if (st != CFREE_OK) return st;
off += width;
@@ -938,6 +1088,63 @@ static CfreeStatus emit_disasm_range(Writer* w, const EmitCtx* x,
continue;
}
+ /* Call-pair fusion (RISC-V R_RV_CALL): a reloc on this instruction whose
+ * arch fuses it with the FOLLOWING instruction into a single `call`/`tail
+ * sym` pseudo. Probe the partner for the call-vs-tail decision, emit one
+ * line, and skip both. Decoding the partner reuses the disassembler's
+ * buffers (clobbering `insn`), so build the symref first and re-decode
+ * `insn` when the pair is not fused. */
+ {
+ const SecReloc* cr = reloc_in_range(x->relocs, x->nrelocs, off, n);
+ char symref[256];
+ ArchRelocOperand bare = {ARCH_RELOC_SURG_TAIL, "", "", 0, 0, 0};
+ if (cr && off + n < end &&
+ build_symref(symref, sizeof symref, x->c, &bare, cr->sym,
+ cr->addend) >= 0) {
+ CfreeInsn partner;
+ u32 pn = arch_disasm_decode(dasm, data + off + n, end - (off + n),
+ (u64)(off + n), &partner);
+ const char* mn = NULL;
+ if (pn && arch_reloc_call_pair(x->c, cr->kind, partner.mnemonic,
+ partner.operands, &mn)) {
+ st = w_str(w, "\t");
+ if (st != CFREE_OK) return st;
+ st = w_str(w, mn);
+ if (st != CFREE_OK) return st;
+ st = w_str(w, "\t");
+ if (st != CFREE_OK) return st;
+ st = w_str(w, symref);
+ if (st != CFREE_OK) return st;
+ st = w_newline(w);
+ if (st != CFREE_OK) return st;
+ off += n + pn;
+ continue;
+ }
+ /* Not fused: the partner probe clobbered `insn`; re-decode it. */
+ (void)arch_disasm_decode(dasm, data + off, end - off, vaddr, &insn);
+ }
+ }
+
+ /* A high-half reloc (RISC-V AUIPC `%pcrel_hi`/`%got_pcrel_hi`) needs a
+ * unique local anchor label here so the paired `%pcrel_lo` can name it. */
+ {
+ const SecReloc* hr = reloc_in_range(x->relocs, x->nrelocs, off, n);
+ if (hr) {
+ ArchRelocOperand ro = {0}; /* zero emit_anchor/ref_anchor: arches that
+ * don't set them must read as 0 (rv64-only) */
+ if (arch_reloc_operand(x->c, hr->kind, &ro) && ro.emit_anchor) {
+ char name[256];
+ fmt_anchor_label(name, sizeof name, x->secidx, off);
+ st = w_str(w, name);
+ if (st != CFREE_OK) return st;
+ st = w_str(w, ":");
+ if (st != CFREE_OK) return st;
+ st = w_newline(w);
+ if (st != CFREE_OK) return st;
+ }
+ }
+ }
+
st = w_str(w, "\t");
if (st != CFREE_OK) return st;
st = cfree_writer_write(w, insn.mnemonic.s, insn.mnemonic.len);
diff --git a/src/arch/arch.h b/src/arch/arch.h
@@ -188,6 +188,14 @@ typedef enum ArchRelocSurg {
ARCH_RELOC_SURG_TAIL, /* replace last comma component (or whole operand) */
ARCH_RELOC_SURG_MEM, /* rewrite the offset inside [...] (aarch64 ldst) */
ARCH_RELOC_SURG_RIP, /* insert sym before disp(%rip) (x86-64 RIP-rel) */
+ /* RISC-V `%pcrel_lo`/`%lo` low-half operand. A single reloc kind covers two
+ * disassembled shapes: a register-immediate ADDI (printed as `mv rd, rs`
+ * when the immediate is 0) where the modifier becomes a new trailing
+ * operand (`mv rd, rs, %pcrel_lo(L)`, which the assembler folds back into
+ * ADDI), and a `disp(base)` load/store where the modifier replaces the
+ * displacement (`%pcrel_lo(L)(base)`). The shape is picked from the operand
+ * text: a trailing `(...)` group selects the memory form. */
+ ARCH_RELOC_SURG_RV_LO12,
} ArchRelocSurg;
typedef struct ArchRelocOperand {
@@ -198,6 +206,15 @@ typedef struct ArchRelocOperand {
* an instruction-encoding bias so the printed offset is the *symbol* offset:
* 0 for aarch64; +4 for x86-64 rel32 (PC32/PLT32/GOTPCREL store addend-4). */
int addend_bias;
+ /* hi/lo anchor pairing (RISC-V `%pcrel_hi`/`%pcrel_lo`). A high-half reloc
+ * (AUIPC `%pcrel_hi(sym)`) sets `emit_anchor` so the symbolizer defines a
+ * unique local label at this instruction. The paired low-half reloc
+ * (`%pcrel_lo`) sets `ref_anchor`: its operand references that synthesized
+ * anchor label (the nearest preceding anchor) instead of the reloc's own
+ * symbol — matching the RISC-V ABI, where `%pcrel_lo` names the AUIPC's
+ * label, not the target symbol. Other arches leave both 0. */
+ u8 emit_anchor;
+ u8 ref_anchor;
} ArchRelocOperand;
typedef struct ArchAsmOps {
@@ -214,6 +231,19 @@ typedef struct ArchAsmOps {
* (aarch64 b/b.cc/cbz/...; x86-64 jmp/jcc). Calls are excluded — they carry
* relocations. NULL hook = no local-branch symbolization for the arch. */
int (*is_local_branch)(CfreeSlice mnemonic);
+ /* Fuse a relocation that the disassembler renders as a 2-instruction pair
+ * back into a single relocated pseudo-instruction line. RISC-V R_RV_CALL
+ * sits on an AUIPC whose JALR partner carries no reloc; the canonical `.s`
+ * spelling is a single `call`/`tail sym`. When `kind` names such a reloc,
+ * the hook returns 1 and sets *mnemonic_out to the fused mnemonic — the
+ * symbolizer then emits "<mnemonic>\t<sym[+addend]>" in place of BOTH
+ * instructions (skipping the partner). `pair_mnemonic`/`pair_ops` are the
+ * SECOND instruction's disassembled text (the JALR), used to disambiguate
+ * (e.g. call vs tail by its link register). Returns 0 to leave the pair
+ * un-fused (per-instruction operand symbolization applies). NULL hook = no
+ * pair fusion for the arch. */
+ int (*reloc_call_pair)(u16 reloc_kind, CfreeSlice pair_mnemonic,
+ CfreeSlice pair_ops, const char** mnemonic_out);
} ArchAsmOps;
typedef struct ArchImpl {
@@ -273,6 +303,15 @@ int arch_reloc_operand(const Compiler* c, u16 reloc_kind,
* arch has no asm_ops/is_local_branch hook. */
int arch_is_local_branch(const Compiler* c, CfreeSlice mnemonic);
+/* 1 if `reloc_kind` names a 2-instruction call pair the symbolizer should fuse
+ * into a single pseudo line (RISC-V R_RV_CALL -> `call`/`tail`), with
+ * *mnemonic_out set to the fused mnemonic. `pair_*` are the partner (second)
+ * instruction's disassembled text. 0 when not fused / no hook. Thin dispatch
+ * over ArchAsmOps.reloc_call_pair. */
+int arch_reloc_call_pair(const Compiler* c, u16 reloc_kind,
+ CfreeSlice pair_mnemonic, CfreeSlice pair_ops,
+ const char** mnemonic_out);
+
ArchDisasm* arch_disasm_new(Compiler*);
u32 arch_disasm_decode(ArchDisasm*, const u8* bytes, size_t len, u64 vaddr,
CfreeInsn* out);
diff --git a/src/arch/registry.c b/src/arch/registry.c
@@ -101,6 +101,15 @@ int arch_is_local_branch(const Compiler* c, CfreeSlice mnemonic) {
return a->asm_ops->is_local_branch(mnemonic);
}
+int arch_reloc_call_pair(const Compiler* c, u16 reloc_kind,
+ CfreeSlice pair_mnemonic, CfreeSlice pair_ops,
+ const char** mnemonic_out) {
+ const ArchImpl* a = arch_for_compiler(c);
+ if (!a || !a->asm_ops || !a->asm_ops->reloc_call_pair) return 0;
+ return a->asm_ops->reloc_call_pair(reloc_kind, pair_mnemonic, pair_ops,
+ mnemonic_out);
+}
+
const CGBackend* cg_backend_for_session(const Compiler* c,
const CfreeCodeOptions* opts) {
if (opts && opts->check_only) {
diff --git a/src/arch/rv64/arch.c b/src/arch/rv64/arch.c
@@ -13,6 +13,7 @@ extern const LinkArchDesc link_arch_rv64;
extern const ArchDbgOps rv64_dbg_ops;
extern const ArchEmuOps rv64_emu_ops;
extern const ArchDwarfOps rv64_dwarf_ops;
+extern const ArchAsmOps rv64_asm_ops;
static int rv64_register_at_public(uint32_t idx, CfreeArchReg* out) {
const char* nm = NULL;
@@ -180,6 +181,7 @@ const ArchImpl arch_impl_rv64 = {
.link = &link_arch_rv64,
.dwarf = &rv64_dwarf_ops,
.dbg = &rv64_dbg_ops,
+ .asm_ops = &rv64_asm_ops,
.predefined_macros = rv64_predefined_macros,
.npredefined_macros =
(u32)(sizeof rv64_predefined_macros / sizeof rv64_predefined_macros[0]),
diff --git a/src/arch/rv64/asm.c b/src/arch/rv64/asm.c
@@ -449,6 +449,12 @@ static u32 assemble_one(AsmDriver* d, const Rv64InsnDesc* desc) {
rd = parse_xreg(d);
expect_comma(d);
rs1 = parse_xreg(d);
+ /* `cc -S` spells an ADDI rd,rs,%pcrel_lo(L) low-half (imm 0) as the
+ * `mv` alias plus a trailing relocation operand: `mv rd, rs,
+ * %pcrel_lo(L)`. Fold it back into ADDI with the I-type reloc. */
+ if (asm_driver_eat_comma(d) &&
+ !rv_emit_imm_mod_reloc(d, RV_MODPOS_LO_I))
+ asm_driver_panic(d, "rv64 asm: mv: expected %lo/%pcrel_lo operand");
return enc_i(m, rd, rs1, 0);
}
if (slice_eq_cstr(desc->mnemonic, "sext.w")) {
@@ -640,8 +646,31 @@ static u32 assemble_one(AsmDriver* d, const Rv64InsnDesc* desc) {
expect_comma(d);
rs1 = parse_freg(d);
}
- /* match already encodes rs2 (type selector); only OR rd/rs1. */
- return m | ((rs1 & 0x1fu) << 15) | ((rd & 0x1fu) << 7);
+ /* match already encodes rs2 (type selector); OR rd/rs1 and the rounding
+ * mode the disassembler dropped. The rm is fixed per conversion family
+ * (mirrors the rv_fcvt_* encoders in isa.h, the codegen source of
+ * truth): fp->int truncates (RTZ=1); int->fp and fp->fp use DYN=7; the
+ * fmv bit-moves carry no rounding (rm=0). Keyed on the funct7 in match. */
+ {
+ u32 funct7 = (m >> 25) & 0x7fu;
+ u32 rm;
+ switch (funct7) {
+ case 0x60: /* fcvt.{w,wu,l,lu}.s */
+ case 0x61: /* fcvt.{w,wu,l,lu}.d */
+ rm = 0x1u; /* RTZ */
+ break;
+ case 0x70: /* fmv.x.w */
+ case 0x71: /* fmv.x.d */
+ case 0x78: /* fmv.w.x */
+ case 0x79: /* fmv.d.x */
+ rm = 0x0u;
+ break;
+ default: /* int->fp (0x68/0x69) and fp<->fp (0x20/0x21): DYN */
+ rm = 0x7u;
+ break;
+ }
+ return m | (rm << 12) | ((rs1 & 0x1fu) << 15) | ((rd & 0x1fu) << 7);
+ }
case RV64_FMT_AMO:
rd = parse_xreg(d);
@@ -907,6 +936,12 @@ static bool rv64_emit_pseudo(AsmDriver* d, const Rv64InsnDesc* desc) {
return true;
}
if (slice_eq_cstr(desc->mnemonic, "tail")) {
+ /* Standard RISC-V `tail` materializes the address into t1 (x6). cfree
+ * codegen uses t0 for its own tail-call temp, so a `cc -S`-fused
+ * `tail sym` re-assembles to t1 not t0 — execution-equivalent (both are
+ * caller-saved temps clobbered by the tail jump; cross-exec still
+ * matches), only the byte image differs on tail-call cases. Keeping the
+ * assembler's `tail` standard preserves clang/gas interop. */
rv_emit_call_pseudo(d, RV_T1, RV_ZERO);
return true;
}
@@ -948,6 +983,117 @@ static void rv64_arch_asm_insn(ArchAsm* base, AsmDriver* d, Sym mnemonic) {
static void rv64_arch_asm_destroy(ArchAsm* base) { (void)base; }
+/* ---- textual-assembly operand syntax (printer <-> parser) ----------------
+ *
+ * Inverse of the `.s` parsers above (rv_parse_mod_reloc / rv_reloc_target and
+ * the call/la pseudo expanders): how a relocated rv64 operand is spelled in
+ * `cc -S` so the same text re-assembles under cfree-as. RISC-V uses the same
+ * `%hi`/`%lo`/`%pcrel_hi`/`%pcrel_lo` operator syntax on every object format,
+ * so `fmt` is unused. See ArchAsmOps and src/api/asm_emit.c. */
+static int rv64_reloc_operand(u16 kind, CfreeObjFmt fmt, ArchRelocOperand* out) {
+ (void)fmt;
+ out->prefix = "";
+ out->suffix = "";
+ out->addend_bias = 0;
+ out->emit_anchor = 0;
+ out->ref_anchor = 0;
+ switch (kind) {
+ case R_RV_PCREL_HI20:
+ out->surg = ARCH_RELOC_SURG_TAIL;
+ out->prefix = "%pcrel_hi(";
+ out->suffix = ")";
+ out->emit_anchor = 1; /* define a unique anchor label at this AUIPC */
+ return 1;
+ case R_RV_GOT_HI20:
+ out->surg = ARCH_RELOC_SURG_TAIL;
+ out->prefix = "%got_pcrel_hi(";
+ out->suffix = ")";
+ out->emit_anchor = 1;
+ return 1;
+ case R_RV_PCREL_LO12_I:
+ case R_RV_PCREL_LO12_S:
+ out->surg = ARCH_RELOC_SURG_RV_LO12;
+ out->prefix = "%pcrel_lo(";
+ out->suffix = ")";
+ out->ref_anchor = 1; /* references the preceding AUIPC's anchor label */
+ return 1;
+ case R_RV_HI20:
+ out->surg = ARCH_RELOC_SURG_TAIL;
+ out->prefix = "%hi(";
+ out->suffix = ")";
+ return 1;
+ case R_RV_LO12_I:
+ case R_RV_LO12_S:
+ out->surg = ARCH_RELOC_SURG_RV_LO12;
+ out->prefix = "%lo(";
+ out->suffix = ")";
+ return 1;
+ case R_RV_BRANCH:
+ case R_RV_JAL:
+ out->surg = ARCH_RELOC_SURG_TAIL;
+ return 1;
+ default:
+ return 0; /* R_ABS*, R_RV_RVC_*, R_RV_RELAX, TLS, ... → keep numeric */
+ }
+}
+
+/* Intra-section local branches whose target codegen resolved in place (no
+ * relocation): the disassembler renders the target numerically, so cc -S
+ * synthesizes a label there. `j`/`jal x0` are JAL aliases; the conditional
+ * branches are B-type. `call`/`tail` are excluded — they carry R_RV_CALL. */
+static int rv64_is_local_branch(CfreeSlice m) {
+ if (m.len == 1 && m.s[0] == 'j') return 1;
+ if (m.len == 3 && memcmp(m.s, "jal", 3) == 0) return 1;
+ if (m.len == 3 && memcmp(m.s, "beq", 3) == 0) return 1;
+ if (m.len == 3 && memcmp(m.s, "bne", 3) == 0) return 1;
+ if (m.len == 3 && memcmp(m.s, "blt", 3) == 0) return 1;
+ if (m.len == 3 && memcmp(m.s, "bge", 3) == 0) return 1;
+ if (m.len == 4 && memcmp(m.s, "bltu", 4) == 0) return 1;
+ if (m.len == 4 && memcmp(m.s, "bgeu", 4) == 0) return 1;
+ if (m.len == 4 && memcmp(m.s, "beqz", 4) == 0) return 1;
+ if (m.len == 4 && memcmp(m.s, "bnez", 4) == 0) return 1;
+ if (m.len == 4 && memcmp(m.s, "blez", 4) == 0) return 1;
+ if (m.len == 4 && memcmp(m.s, "bgez", 4) == 0) return 1;
+ if (m.len == 4 && memcmp(m.s, "bltz", 4) == 0) return 1;
+ if (m.len == 4 && memcmp(m.s, "bgtz", 4) == 0) return 1;
+ if (m.len == 6 && memcmp(m.s, "c.beqz", 6) == 0) return 1;
+ if (m.len == 6 && memcmp(m.s, "c.bnez", 6) == 0) return 1;
+ if (m.len == 3 && memcmp(m.s, "c.j", 3) == 0) return 1;
+ return 0;
+}
+
+/* R_RV_CALL fuses an AUIPC+JALR pair into a single `call`/`tail sym` pseudo
+ * (the canonical `.s` spelling the assembler re-expands to the same pair +
+ * reloc). The reloc sits on the AUIPC; the JALR partner carries no reloc. A
+ * tail call links into x0 (the JALR's rd is `zero`); a regular call links into
+ * ra. We read that from the partner JALR's disassembled text. */
+static int rv64_reloc_call_pair(u16 kind, CfreeSlice pair_mnemonic,
+ CfreeSlice pair_ops, const char** mnemonic_out) {
+ if (kind != R_RV_CALL) return 0;
+ /* The partner JALR links into ra (regular call) or x0 (tail). The
+ * disassembler renders the x0-link, zero-immediate form as the `jr rs`
+ * alias, and the ra form as `jalr ra, 0(ra)`. So a `jr` partner is always a
+ * tail; a `jalr` partner is a tail iff its link register is `zero`. */
+ if (pair_mnemonic.len == 2 && memcmp(pair_mnemonic.s, "jr", 2) == 0) {
+ *mnemonic_out = "tail";
+ return 1;
+ }
+ if (pair_mnemonic.len == 4 && memcmp(pair_mnemonic.s, "jalr", 4) == 0) {
+ if (pair_ops.len >= 4 && memcmp(pair_ops.s, "zero", 4) == 0)
+ *mnemonic_out = "tail";
+ else
+ *mnemonic_out = "call";
+ return 1;
+ }
+ return 0;
+}
+
+const ArchAsmOps rv64_asm_ops = {
+ .reloc_operand = rv64_reloc_operand,
+ .is_local_branch = rv64_is_local_branch,
+ .reloc_call_pair = rv64_reloc_call_pair,
+};
+
ArchAsm* rv64_arch_asm_new(Compiler* c) {
Rv64Asm* a = arena_new(c->tu, Rv64Asm);
memset(a, 0, sizeof *a);
diff --git a/src/arch/x64/asm.c b/src/arch/x64/asm.c
@@ -562,12 +562,17 @@ static __attribute__((unused)) void emit_reg_rm_twobyte(
buf[n++] = opcode2;
buf[n++] = x64_modrm(3u, dst, src.reg);
} else {
- n += x64_pack_rex(buf + n, width == 8u, dst, 0, src.base);
+ /* Route the full memory-operand variety (plain / SIB-indexed / RIP /
+ * segment) through the shared pack helpers so a SIB index register is
+ * preserved (e.g. `movzbl (%rcx,%rsi,1), %edx`). */
+ if (src.seg) buf[n++] = src.seg;
+ n += x64_pack_rex_mem_operand(buf + n, width == 8u, dst, src);
buf[n++] = X64_OPC_TWOBYTE;
buf[n++] = opcode2;
- n += x64_pack_mem(buf + n, dst, src.base, src.disp);
+ n += x64_pack_mem_operand(buf + n, dst, src);
}
emit_packed(mc, buf, n);
+ if (src.kind == X64_ASM_OP_MEM) x64_emit_mem_reloc(d, mc, &src, 0);
}
/* ====================================================================
@@ -857,7 +862,14 @@ static void parse_alu_rr(X64ParseCtx* p) {
src.imm, 1);
return;
}
- if (imm_fits_i8(src.imm))
+ /* Stack-pointer adjustments (`add/sub $imm, %rsp`, 64-bit) always use the
+ * imm32 form in codegen — the prologue and alloca patch a fixed-width
+ * placeholder, so they never shrink to imm8 even for a small frame. Match
+ * that here so `cc -S | as` reproduces codegen's bytes exactly; %rsp is a
+ * reserved register, so codegen never emits an imm8 ALU op against it. */
+ if (dst.reg == X64_RSP && p->width == 8u && imm_fits_i32(src.imm))
+ emit_alu_imm32(p->mc, 1, imm_row->modrm_reg, dst.reg, (i32)src.imm);
+ else if (imm_fits_i8(src.imm))
emit_alu_imm8(p->mc, width_to_w(p->width), imm_row->modrm_reg, dst.reg,
(i8)src.imm);
else if (imm_fits_i32(src.imm))
@@ -1059,8 +1071,11 @@ static void parse_movzx_movsx(X64ParseCtx* p) {
dst = parse_operand(p->d);
if (dst.kind != X64_ASM_OP_REG)
asm_driver_panic(p->d, "x64 asm: movx dst register");
+ /* REX.W follows the destination register width: `movsbq …, %rcx` (64-bit)
+ * needs REX.W; `movsbl …, %ecx` (32-bit) does not. The disassembler spells
+ * the q/l form from REX.W, so honoring dst width here round-trips it. */
emit_reg_rm_twobyte(
- p->d, p->mc, 4u, p->desc->opc[1], dst.reg, src,
+ p->d, p->mc, dst.width == 8u ? 8u : 4u, p->desc->opc[1], dst.reg, src,
p->desc->opc[1] == X64_OPC_MOVZX_B || p->desc->opc[1] == X64_OPC_MOVSX_B,
0);
}
@@ -1234,25 +1249,31 @@ static void parse_sse_rr(X64ParseCtx* p) {
expect_comma(p->d);
dst = parse_operand(p->d);
if (cvt_to_int) {
+ /* cvttsd2si/cvttss2si XMM/m -> GPR: REX.W follows the GPR destination
+ * width (`%rdx` = 64-bit, `%edx` = 32-bit), not the mnemonic — these rows
+ * carry no size suffix. */
if (dst.kind != X64_ASM_OP_REG)
asm_driver_panic(p->d, "x64 asm: cvtt dst register");
+ u32 gpr_w = dst.width == 8u ? 8u : 4u;
if (src.kind == X64_ASM_OP_XMM)
emit_sse_rr_w(p->mc, p->desc->leg_pfx, p->desc->opc[1],
- width_to_w(p->width), dst.reg, src.reg);
+ width_to_w(gpr_w), dst.reg, src.reg);
else if (src.kind == X64_ASM_OP_MEM)
- emit_reg_rm_twobyte(p->d, p->mc, p->width, p->desc->opc[1], dst.reg, src,
- 0, p->desc->leg_pfx);
+ emit_reg_rm_twobyte(p->d, p->mc, gpr_w, p->desc->opc[1], dst.reg, src, 0,
+ p->desc->leg_pfx);
else
asm_driver_panic(p->d, "x64 asm: cvtt source");
return;
}
if (cvt_from_int) {
+ /* cvtsi2sd/cvtsi2ss GPR/m -> XMM: REX.W follows the GPR source width. */
if (dst.kind != X64_ASM_OP_XMM)
asm_driver_panic(p->d, "x64 asm: cvtsi dst xmm");
- if (src.kind == X64_ASM_OP_REG)
+ if (src.kind == X64_ASM_OP_REG) {
+ u32 gpr_w = src.width == 8u ? 8u : 4u;
emit_sse_rr_w(p->mc, p->desc->leg_pfx, p->desc->opc[1],
- width_to_w(p->width), dst.reg, src.reg);
- else if (src.kind == X64_ASM_OP_MEM)
+ width_to_w(gpr_w), dst.reg, src.reg);
+ } else if (src.kind == X64_ASM_OP_MEM)
emit_sse_load(p->mc, p->desc->leg_pfx, p->desc->opc[1], dst.reg, src.base,
src.disp);
else
diff --git a/src/arch/x64/isa.c b/src/arch/x64/isa.c
@@ -497,8 +497,11 @@ static int read_disp(const u8* bytes, u32 len, u32 off, u32 n, i32* out) {
* `disp_out` and `base_out` describe what to print. */
typedef struct DecodedMem {
u32 base;
+ u32 index; /* SIB index register (valid when has_index) */
+ u32 scale; /* SIB scale as the literal 1/2/4/8 (valid when has_index) */
i32 disp;
int has_base;
+ int has_index; /* a SIB index register is present */
int rip_relative;
u32 bytes_used;
} DecodedMem;
@@ -506,8 +509,11 @@ typedef struct DecodedMem {
static u32 decode_mem(const u8* bytes, u32 len, u32 off, X64DecodeCtx ctx,
u32 mod, u32 rm_low, DecodedMem* out) {
out->base = 0;
+ out->index = 0;
+ out->scale = 1;
out->disp = 0;
out->has_base = 1;
+ out->has_index = 0;
out->rip_relative = 0;
out->bytes_used = 0;
if (mod == 3u) return 0; /* caller handles reg-form */
@@ -516,10 +522,17 @@ static u32 decode_mem(const u8* bytes, u32 len, u32 off, X64DecodeCtx ctx,
if (off >= len) return (u32)-1;
u8 s = bytes[off];
u32 sib_base = (s & 7u) | ((u32)ctx.rex_b << 3);
+ u32 sib_index = ((s >> 3) & 7u) | ((u32)ctx.rex_x << 3);
u32 used = 1;
+ /* SIB index = 4 (RSP) with REX.X=0 encodes "no index". */
+ if (sib_index != 4u) {
+ out->has_index = 1;
+ out->index = sib_index;
+ out->scale = 1u << (s >> 6);
+ }
if (mod == 0u && (s & 7u) == 5u) {
- /* mod=00, base=101: disp32 with no base — treat as RIP-relative
- * style (cfree uses this for label-table addressing). */
+ /* mod=00, base=101: disp32 with no base — either a label-table
+ * disp32 (no index) or an indexed `[index*scale + disp32]`. */
i32 d = 0;
if (!read_disp(bytes, len, off + used, 4, &d)) return (u32)-1;
used += 4;
@@ -579,9 +592,16 @@ static void put_mem(StrBuf* sb, const DecodedMem* m) {
}
if (m->rip_relative) {
strbuf_puts(sb, "(%rip)");
- } else if (m->has_base) {
+ } else if (m->has_base || m->has_index) {
+ /* `(base)`, `(base,index,scale)`, or the base-less `(,index,scale)`. */
strbuf_putc(sb, '(');
- put_reg(sb, m->base, 8);
+ if (m->has_base) put_reg(sb, m->base, 8);
+ if (m->has_index) {
+ strbuf_putc(sb, ',');
+ put_reg(sb, m->index, 8);
+ strbuf_putc(sb, ',');
+ strbuf_put_i64(sb, (i64)m->scale);
+ }
strbuf_putc(sb, ')');
}
}
@@ -975,6 +995,16 @@ static u32 print_xmm_rr(StrBuf* sb, const X64InsnDesc* d, const u8* bytes,
put_rm(sb, &rr, *ctx, gp_w);
return off + 1u + rr.bytes_after_modrm;
}
+ /* Store-direction XMM moves (MOVSD/MOVSS/MOVUPS 0x11, MOVAPS 0x29): the
+ * reg-field xmm is the SOURCE and the r/m (memory or xmm) is the
+ * DESTINATION — AT&T order `reg_xmm, rm`. Without this the disassembler
+ * prints them in load order, so re-assembly flips the data direction. */
+ if (op == 0x11u || op == 0x29u) {
+ put_xmm(sb, rr.reg);
+ strbuf_puts(sb, ", ");
+ put_rm_xmm(sb, &rr, *ctx);
+ return off + 1u + rr.bytes_after_modrm;
+ }
{
int dst_is_gp = (op == 0x2Cu); /* CVTTSD/SS2SI */
int src_is_gp = (op == 0x2Au || op == 0x6Eu); /* CVTSI2*, MOVD/Q g->x */
diff --git a/src/arch/x64/native.c b/src/arch/x64/native.c
@@ -1459,43 +1459,41 @@ static u32 x64_build_prologue(X64NativeTarget* a, u8* buf, u32 cap,
wr_u32_le(buf + wi, frame_size);
wi += 4;
}
- /* sret: spill the first int arg reg (destination pointer) into its slot. */
+ /* sret: spill the first int arg reg (destination pointer) into its slot.
+ * Use the minimal disp encoding (x64_pack_mem) so it matches the body's
+ * frame stores and the matching epilogue restore — the `cc -S | as`
+ * round-trip can then reproduce these bytes exactly. The -O0 placeholder is
+ * NOP-padded to a fixed width, so a shorter prologue is harmless. */
if (a->has_sret && a->sret_ptr_slot != NATIVE_FRAME_SLOT_NONE) {
X64NativeSlot* s = x64_slot_get(a, a->sret_ptr_slot);
u32 sret_reg = a->abi->int_args[0];
i32 off = -(i32)s->off;
- if (wi + 7u > cap) x64_panic(a, "prologue placeholder overflow");
+ if (wi + 8u > cap) x64_panic(a, "prologue placeholder overflow");
buf[wi++] = (u8)(X64_REX_BASE | X64_REX_W |
((sret_reg & 8u) ? X64_REX_R : 0u));
buf[wi++] = X64_OPC_MOV_RM_R;
- buf[wi++] = modrm(2u, sret_reg & 7u, X64_RBP);
- wr_u32_le(buf + wi, (u32)off);
- wi += 4;
+ wi += x64_pack_mem(buf + wi, sret_reg & 7u, X64_RBP, off);
}
/* Spill callee-saved GPRs. */
for (i = 0; i < n_int; ++i) {
u32 reg = cs_int[i];
i32 off = -(i32)xmm_base - (i32)n_fp * 16 - (i32)(i + 1u) * 8;
- if (wi + 7u > cap) x64_panic(a, "prologue placeholder overflow");
+ if (wi + 8u > cap) x64_panic(a, "prologue placeholder overflow");
buf[wi++] = (u8)(X64_REX_BASE | X64_REX_W | ((reg & 8u) ? X64_REX_R : 0u));
buf[wi++] = X64_OPC_MOV_RM_R;
- buf[wi++] = modrm(2u, reg & 7u, X64_RBP);
- wr_u32_le(buf + wi, (u32)off);
- wi += 4;
+ wi += x64_pack_mem(buf + wi, reg & 7u, X64_RBP, off);
}
- /* Spill callee-saved XMMs (Win64). movaps [rbp+disp32], xmm. */
+ /* Spill callee-saved XMMs (Win64). movaps [rbp+disp], xmm. */
for (i = 0; i < n_fp; ++i) {
u32 xmm = cs_fp[i];
i32 off = -(i32)xmm_base - (i32)(i + 1u) * 16;
u8 rex = (u8)((xmm & 8u) ? (X64_REX_BASE | X64_REX_R) : 0u);
- u32 need = rex ? 8u : 7u;
+ u32 need = rex ? 9u : 8u;
if (wi + need > cap) x64_panic(a, "prologue placeholder overflow");
if (rex) buf[wi++] = rex;
buf[wi++] = X64_OPC_TWOBYTE;
buf[wi++] = 0x29; /* MOVAPS r/m128, xmm */
- buf[wi++] = modrm(2u, xmm & 7u, X64_RBP);
- wr_u32_le(buf + wi, (u32)off);
- wi += 4;
+ wi += x64_pack_mem(buf + wi, xmm & 7u, X64_RBP, off);
}
return wi;
}
@@ -2972,9 +2970,14 @@ static void x64_va_arg_core(X64NativeTarget* a, NativeLoc dst, NativeAddr ap,
i8 stride = is_fp ? 16 : 8;
MCLabel L_stack = mc->label_new(mc);
MCLabel L_done = mc->label_new(mc);
- /* gp32 = ap[offs]; cmp gp32, max; jae L_stack. */
+ /* gp32 = ap[offs]; cmp gp32, max; jae L_stack. Use the imm8 form when the
+ * threshold fits (gp_offset max 48) so the encoding is canonical and the
+ * `cc -S | as` round-trip reproduces it; fp_offset max 176 needs imm32. */
emit_mov_load(mc, 4, 0, gp, ap_base, (i32)offs_field);
- emit_alu_imm32(mc, 0, X64_ALU_SUB_CMP, gp, (i32)max_offs);
+ if (imm_fits_i8((i64)max_offs))
+ emit_alu_imm8(mc, 0, X64_ALU_SUB_CMP, gp, (i8)max_offs);
+ else
+ emit_alu_imm32(mc, 0, X64_ALU_SUB_CMP, gp, (i32)max_offs);
emit_jcc_rel32(mc, X64_CC_AE, L_stack);
/* reg path: ap[offs] += stride; gp = reg_save_area(ap[16]) + offset; load.
* (The memory increment leaves gp holding the old offset.) */
@@ -3116,14 +3119,10 @@ static void x64_intrinsic(NativeTarget* t, IntrinKind kind,
int w = x64_is_64(t, args[0].type) ? 1 : 0;
u32 dr = loc_reg(dsts[0]);
emit_bs(mc, w, 0xBD /* bsr */, dr, loc_reg(args[0]));
- /* clz = (bits-1) - bsr, computed via xor with bits-1. */
- emit_rex(mc, w, 0, 0, dr);
- {
- u8 op = X64_OPC_ALU_IMM32;
- mc->emit_bytes(mc, &op, 1);
- }
- emit_rm_reg(mc, X64_ALU_SUB_XOR, dr);
- emit_u32le(mc, w ? 63u : 31u);
+ /* clz = (bits-1) - bsr, computed via xor with bits-1. The mask (31/63)
+ * fits in imm8, so use the compact 0x83 form to match the canonical
+ * encoding (and the assembler's `cc -S | as` round-trip). */
+ emit_alu_imm8(mc, w, X64_ALU_SUB_XOR, dr, w ? 63 : 31);
return;
}
case INTRIN_BSWAP16: {
diff --git a/src/asm/asm.c b/src/asm/asm.c
@@ -214,14 +214,20 @@ static void promote_undef_externs(AsmDriver* d) {
typedef struct AsmExpr {
ObjSymId sym;
i64 value;
+ u8 is_here; /* the location-counter token `.` (no sym, no value yet) */
+ u8 pcrel; /* `sym - .`: emit a PC-relative data reloc instead of absolute */
} AsmExpr;
static AsmExpr expr_c(i64 v) {
- AsmExpr e = {OBJ_SYM_NONE, v};
+ AsmExpr e = {OBJ_SYM_NONE, v, 0, 0};
return e;
}
static AsmExpr expr_s(ObjSymId s, i64 v) {
- AsmExpr e = {s, v};
+ AsmExpr e = {s, v, 0, 0};
+ return e;
+}
+static AsmExpr expr_here(void) {
+ AsmExpr e = {OBJ_SYM_NONE, 0, 1, 0};
return e;
}
@@ -290,6 +296,11 @@ static AsmExpr parse_primary(AsmDriver* d) {
(void)d_next(d);
return e;
}
+ /* Lone `.` is the location counter (used in `sym - .` PC-relative data). */
+ if (tok_is_punct(t, '.')) {
+ (void)d_next(d);
+ return expr_here();
+ }
d_panicf(d, "asm: expected expression");
}
@@ -325,7 +336,8 @@ static AsmExpr parse_mul(AsmDriver* d) {
u32 op = t.v.punct;
(void)d_next(d);
AsmExpr b = parse_unary(d);
- if (a.sym || b.sym) d_panicf(d, "asm: '*/%%' on symbolic operand");
+ if (a.sym || b.sym || a.is_here || b.is_here)
+ d_panicf(d, "asm: '*/%%' on symbolic operand");
if (op == '*')
a.value *= b.value;
else if (op == '/') {
@@ -346,6 +358,16 @@ static AsmExpr parse_add(AsmDriver* d) {
u32 op = t.v.punct;
(void)d_next(d);
AsmExpr b = parse_mul(d);
+ /* `sym - .`: a PC-relative data reference. `.` is the location of the
+ * field being emitted, so the relocation's P equals its own offset and the
+ * RELA addend stays `a.value` (typically 0). */
+ if (op == '-' && b.is_here) {
+ if (!a.sym) d_panicf(d, "asm: '- .' requires a symbol operand");
+ a.pcrel = 1;
+ continue;
+ }
+ if (a.is_here || b.is_here)
+ d_panicf(d, "asm: '.' location counter only valid as `sym - .`");
if (op == '+') {
if (a.sym && b.sym) d_panicf(d, "asm: cannot add two symbols");
if (b.sym) {
@@ -618,7 +640,16 @@ static void emit_int_directive(AsmDriver* d, u32 width) {
AsmExpr e = parse_expr(d);
if (e.sym) {
RelocKind k;
- if (width == 4)
+ if (e.pcrel) {
+ /* `sym - .`: PC-relative data. Only the 32/64-bit widths codegen
+ * emits via cfree_cg_data_pcrel are supported. */
+ if (width == 4)
+ k = R_PC32;
+ else if (width == 8)
+ k = R_PC64;
+ else
+ d_panicf(d, "asm: PC-relative `sym - .` needs .long/.quad");
+ } else if (width == 4)
k = R_ABS32;
else if (width == 8)
k = R_ABS64;
@@ -670,26 +701,47 @@ static void do_directive(AsmDriver* d, Sym name) {
if (sym_eq(d, name, "section")) {
Sym sname = 0;
AsmTok t = d_peek(d);
- if (t.kind == ASM_TOK_IDENT) {
- sname = t.v.ident;
- (void)d_next(d);
- } else if (t.kind == ASM_TOK_STR) {
+ if (t.kind == ASM_TOK_STR) {
size_t n = 0;
const char* p = asm_str(d, t.spelling, &n);
if (n >= 2 && p[0] == '"')
sname = pool_intern_slice(d->pool, (Slice){.s = p + 1, .len = n - 2});
(void)d_next(d);
- } else if (tok_is_punct(t, '.')) {
- (void)d_next(d);
+ } else if (t.kind == ASM_TOK_IDENT || tok_is_punct(t, '.')) {
+ /* A bare section name. The lexer breaks a dotted name like
+ * `.rodata.toy.merge` into PUNCT('.')+IDENT segments (the `.`+digit
+ * identifier rule does not glue `.`+letter), so reassemble the full
+ * dotted spelling by consuming each adjacent `.segment`. Stops at the
+ * `, "flags"` operands (the next token is then a comma). */
+ char buf[128];
+ size_t bn = 0;
+ int leading_dot = tok_is_punct(t, '.');
+ if (leading_dot) {
+ (void)d_next(d);
+ buf[bn++] = '.';
+ }
AsmTok id = d_next(d);
if (id.kind != ASM_TOK_IDENT) d_panicf(d, "asm: .section: bad name");
- size_t ni = 0;
- const char* nm = asm_str(d, id.v.ident, &ni);
- char buf[128];
- if (ni + 1 >= sizeof buf) d_panicf(d, "asm: .section: name too long");
- buf[0] = '.';
- for (size_t i = 0; i < ni; ++i) buf[i + 1] = nm[i];
- sname = pool_intern_slice(d->pool, (Slice){.s = buf, .len = ni + 1});
+ for (;;) {
+ size_t ni = 0;
+ const char* nm = asm_str(d, id.spelling, &ni);
+ if (bn + ni >= sizeof buf) d_panicf(d, "asm: .section: name too long");
+ for (size_t i = 0; i < ni; ++i) buf[bn++] = nm[i];
+ /* Glue a following `.<ident>` (or `.<num>`) segment, no whitespace. */
+ if (!tok_is_punct(d_peek(d), '.')) break;
+ (void)d_next(d); /* '.' */
+ AsmTok seg = d_peek(d);
+ if (seg.kind != ASM_TOK_IDENT && seg.kind != ASM_TOK_NUM) {
+ /* A lone trailing '.' is not part of the name; put it back is not
+ * supported, but section names never end in '.', so this is a
+ * malformed directive. */
+ d_panicf(d, "asm: .section: bad name");
+ }
+ if (bn + 1 >= sizeof buf) d_panicf(d, "asm: .section: name too long");
+ buf[bn++] = '.';
+ id = d_next(d);
+ }
+ sname = pool_intern_slice(d->pool, (Slice){.s = buf, .len = bn});
} else {
d_panicf(d, "asm: .section: expected name");
}
diff --git a/test/asm/encode/rv64_fp.expected.hex b/test/asm/encode/rv64_fp.expected.hex
@@ -1 +1 @@
-53f5c500d376f70a53f02010d371521a5385c5285394242b53a5c5a0d392e6a2530505c0530525d0d30200e0530505f2072501002734b100
+53f5c500d376f70a53f02010d371521a5385c5285394242b53a5c5a0d392e6a2531505c0537525d0d30200e0530505f2072501002734b100
diff --git a/test/asm/encode/rv64_fp_cvt.expected.hex b/test/asm/encode/rv64_fp_cvt.expected.hex
@@ -1 +1 @@
-530505c0d38515c0530626c0d38636c0530707c2d38727c2530505d0d38515d0530606d2d38626d25387174053880842538505585386065a
+531505c0d39515c0531626c0d39636c0531707c2d39727c2537505d0d3f515d0537606d2d3f626d253f7174053f8084253f5055853f6065a
diff --git a/test/asm/hostas_cross.sh b/test/asm/hostas_cross.sh
@@ -30,22 +30,20 @@
# host supports and self-extends as gaps close. Status at time of writing:
# - aarch64-linux: works end-to-end (podman runs arm64 natively in its VM).
# This is the gating default (312/312 both lanes).
-# - x86_64-linux: cc -S re-assembles for the whole corpus (312/312) via both
-# cfree-as and clang, but cross-EXEC is 272/312: ~23 cases
-# (switch/jump tables, global/array/fp data, varargs) lose
-# fidelity in the x64 cc -S data round-trip — confirmed cc -S
-# infidelity (direct `cc -c` executes correctly). Opt-in until
-# that backlog closes. See doc/ASM_ROUNDTRIP_TESTING.md.
-# - riscv64-linux: assembles, but cross-EXEC hangs — the rv64 cc -S round-trip
-# is unfaithful because rv64 has no symbolizer (ArchAsmOps):
-# the call emits `auipc ra,0x0; jalr ra,0(ra)` with the
-# R_RISCV_CALL reloc unsymbolized, so it calls itself (and
-# branches like `j 0x90` keep numeric targets). NOT an
-# emulation issue — a minimal clang rv64 static exe and the
-# DIRECT cfree `cc -c` object both run correctly under the
-# same qemu-riscv64. Needs an rv64 ArchAsmOps (is_local_branch
-# for j/beq/...; reloc_operand for %pcrel_hi/%pcrel_lo/call).
-# The bounded exec smoke SKIPS it until then.
+# - x86_64-linux: `cc -S | cfree as` round-trips and CROSS-EXECS the whole
+# corpus correctly (cfree-as 312/312). The clang lane has a
+# small residue (~11 efail: cfree emits AT&T text clang
+# rejects). Opt-in: the global clang gate (ENFORCE_CLANG=1)
+# would fail on that residue, so x64 isn't in the gating
+# default yet — run with CFREE_HOSTAS_CROSS_TARGETS, optionally
+# CFREE_HOSTAS_ENFORCE_CLANG=0.
+# - riscv64-linux: `cc -S | cfree as` round-trips and CROSS-EXECS correctly
+# (cfree-as 312/312) — the rv64 symbolizer (ArchAsmOps with
+# %pcrel_hi/%pcrel_lo anchor pairing + AUIPC/JALR call fusion)
+# landed; the earlier self-call hang is gone. The clang lane
+# has a larger residue (~58 efail: rv64 data-symbolization
+# syntax + bare-fcvt rounding-mode that clang encodes
+# differently). Opt-in, same as x64.
#
# Override the matrix with CFREE_HOSTAS_CROSS_TARGETS="tag:triple ..." and the
# clang-as gate with CFREE_HOSTAS_ENFORCE_CLANG=0 (demote lane B to XFAIL).