kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 74ca227e2b3673d0e7955f10349183737e72b5cf
parent 2686dfe936ae06310cd00dd61166417df8b3b8a7
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sat, 30 May 2026 21:04:54 -0700

x64+rv64: complete the cc -S round-trip symbolizer (cross-exec 312/312 each)

Finish the x86-64 and riscv64 `cc -S` paths so `cc -S | cfree as | cfree ld`
re-assembles and CROSS-EXECUTES the whole toy corpus correctly under podman/qemu
(cfree-as 312/312 for both arches; aarch64 unchanged at 312/0).

Symbolizer generalization (src/api/asm_emit.c, src/arch/arch.h, registry.c):
- emit_data_range/data_reloc_directive now round-trip R_PC32/R_PC64 data
  relocations (jump tables, global/array/fp/static-string data), not just
  R_ABS32/64 — recovers the x64 data backlog.
- ArchAsmOps gains is_local_branch and a reloc_call_pair hook; ArchRelocOperand
  gains addend_bias and the RISC-V hi/lo anchor pairing (emit_anchor/ref_anchor
  + ARCH_RELOC_SURG_RV_LO12). The symbolizer synthesizes unique `.Lpcrel`
  anchor labels for %pcrel_hi and resolves the paired %pcrel_lo to them, and
  fuses an R_RISCV_CALL AUIPC+JALR pair into `call`/`tail sym`.

rv64 (src/arch/rv64/asm.c, arch.c): new rv64_reloc_operand (%pcrel_hi/%pcrel_lo/
%hi/%lo/branches), rv64_is_local_branch, rv64_asm_ops; assembler-side acceptance
of the low-half `mv rd,rs,%pcrel_lo(L)` form and the fcvt rounding mode the
disassembler drops (RTZ for fp->int truncation, DYN for int->fp/fp<->fp — the
C-correct, codegen-consistent modes). The earlier self-call hang
(`auipc ra,0x0; jalr ra,0(ra)` unsymbolized) is gone. `tail` stays the standard
t1 (RAS-friendly; cfree codegen's t0 is execution-equivalent — only tail-call
byte images differ, exec matches).

x64 (src/arch/x64/{asm,isa,native}.c): x64_reloc_operand/is_local_branch as
before; native.c prologue frame spills use the minimal-disp encoding
(x64_pack_mem) so the round-trip reproduces them (execution-equivalent).

Fixes a regression the integration introduced: aa64/x64 reloc_operand don't set
the new emit_anchor/ref_anchor fields, so the ArchRelocOperand stack decls in
asm_emit.c are now zero-initialized (uninitialized ref_anchor had made aa64
adrp / x64 callq fall into the rv64 anchor path and stay numeric — hostas-toy
had dropped to 19/293; restored to 312/0). Shared asm.c unary-minus UB fix
retained.

Updated the rv64_fp/rv64_fp_cvt encode goldens to the corrected (truncating)
fcvt rounding modes. Verified: test-toy 1338/0, test-asm 27/0, test-asm-x64
13/0, test-asm-rv64 43/0, test-asm-roundtrip 572/0, test-asm-roundtrip-toy
624/0/1, test-diff-llvm 271-agree/0-skip, hostas-toy aa64 312/0 both lanes;
x64 cross-exec cfree-as 312/0, rv64 cross-exec cfree-as 312/0.

Residual (tracked, not blocking the cfree -S path): the third-party clang lane
(x64 ~11, rv64 ~58) — cfree emits asm clang encodes differently (notably
bare-fcvt rounding-mode and rv64 data %-operator syntax); a standard-conformance
follow-up.

Diffstat:
Mdoc/ASM_ROUNDTRIP_TESTING.md | 44++++++++++++++++++++++++++------------------
Msrc/api/asm_emit.c | 269++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------
Msrc/arch/arch.h | 39+++++++++++++++++++++++++++++++++++++++
Msrc/arch/registry.c | 9+++++++++
Msrc/arch/rv64/arch.c | 2++
Msrc/arch/rv64/asm.c | 150+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
Msrc/arch/x64/asm.c | 41+++++++++++++++++++++++++++++++----------
Msrc/arch/x64/isa.c | 38++++++++++++++++++++++++++++++++++----
Msrc/arch/x64/native.c | 47+++++++++++++++++++++++------------------------
Msrc/asm/asm.c | 86+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
Mtest/asm/encode/rv64_fp.expected.hex | 2+-
Mtest/asm/encode/rv64_fp_cvt.expected.hex | 2+-
Mtest/asm/hostas_cross.sh | 30++++++++++++++----------------
13 files changed, 635 insertions(+), 124 deletions(-)

diff --git a/doc/ASM_ROUNDTRIP_TESTING.md b/doc/ASM_ROUNDTRIP_TESTING.md @@ -250,24 +250,32 @@ downgrades to SKIP instead of hanging). Status: - **aarch64-linux**: green end-to-end (cfree-as 312/0, clang-as 312/0) — podman runs arm64 natively in its VM, so it's fast and the primary verified target. -- **x86_64-linux**: the x64 `cc -S` symbolizer landed (the aarch64 symbolizer is - now arch-generalized — `ArchAsmOps.is_local_branch` for `jmp`/`jcc`, a x64 - `reloc_operand` table for `sym(%rip)`/bare-`@PLT`/`@GOTPCREL` with a +4 rel32 - addend bias, and operand-driven RIP surgery), so the whole corpus - **re-assembles 312/312** via both cfree-as and clang. Cross-EXEC is **272/312**: - ~23 cases (switch/jump tables, global/array/fp data, varargs) lose fidelity in - the x64 cc -S **data** round-trip — confirmed cc -S infidelity, since the - DIRECT `cc -c` object executes correctly. That data backlog is the remaining - x64 work. Opt-in until it closes. -- **riscv64-linux**: assembles, but cross-EXEC **hangs** — NOT emulation (a - minimal clang rv64 static exe AND the DIRECT cfree `cc -c` object both run - correctly under the same qemu-riscv64; only the `cc -S | as` round-trip hangs). - Root cause: rv64 has **no symbolizer** (no `ArchAsmOps`), so `cc -S` emits the - call as `auipc ra,0x0; jalr ra,0(ra)` with the `R_RISCV_CALL` reloc - unsymbolized — it calls itself — and branches keep numeric targets (`j 0x90`). - Needs an rv64 `ArchAsmOps`: `is_local_branch` (j/beq/bne/...) and a - `reloc_operand` for the RISC-V `%pcrel_hi`/`%pcrel_lo`/`%hi`/`%lo`/`call` - syntax (the `%pcrel_lo` label-pairing makes this the hardest of the three). +- **x86_64-linux**: the x64 `cc -S` symbolizer is complete — the aarch64 + symbolizer was arch-generalized (`ArchAsmOps.is_local_branch` for `jmp`/`jcc`, + an x64 `reloc_operand` table for `sym(%rip)`/bare-`@PLT`/`@GOTPCREL` with a +4 + rel32 addend bias, operand-driven RIP surgery) and the `emit_data_range` data + path now handles `R_PC32`/`R_PC64` (jump tables, global/array/fp/static-string + data). `cc -S | cfree as` re-assembles AND **cross-EXECS the whole corpus + correctly: cfree-as 312/312.** Byte-faithful 300/312 — the 12 are alloca/abi + cases where the re-assembled encoding is execution-equivalent (e.g. + `leaq (%rsp)` vs `leaq 0(%rsp)`). The clang lane is 301/11 (cfree emits AT&T + text clang rejects). Opt-in (the global clang gate would fail on that residue). +- **riscv64-linux**: the rv64 `cc -S` symbolizer landed — a new `ArchAsmOps` with + `is_local_branch` (j/beq/...), a `reloc_operand` covering `%pcrel_hi`/ + `%pcrel_lo`/`%hi`/`%lo`, the `%pcrel_lo` AUIPC-anchor pairing (synthesized + `.Lpcrel` labels via a new `ARCH_RELOC_SURG_RV_LO12` + `emit_anchor`/ + `ref_anchor`), and an `R_RISCV_CALL` AUIPC+JALR call-pair fusion to `call`/ + `tail`. `cc -S | cfree as` round-trips AND **cross-EXECS correctly: cfree-as + 312/312** — the earlier self-call hang is gone. Byte-faithful 282/312 — the 30 + are tail-call cases where cfree codegen uses `t0` but the standard (and + RAS-friendly) `tail` pseudo the assembler emits uses `t1`; execution-identical. + The clang lane is 254/58 (rv64 data-symbolization syntax + bare-`fcvt` + rounding-mode that clang encodes differently). Opt-in. +- **Remaining (both arches): the third-party `clang` lane.** cfree's `cc -S` is + faithfully re-assemblable and executable by cfree's own `as`, but not yet + fully clang-standard for x64 (a few AT&T spellings) or rv64 (data + `%`-operator syntax; bare-`fcvt` needs an explicit rounding-mode suffix). A + standard-conformance follow-up; does not block the cfree `-S` path. Override the matrix with `CFREE_HOSTAS_CROSS_TARGETS="tag:triple ..."`, the exec-smoke cap with `CFREE_HOSTAS_EXEC_TIMEOUT=<secs>`, and per-arch images with diff --git a/src/api/asm_emit.c b/src/api/asm_emit.c @@ -83,6 +83,16 @@ static SymLabel* collect_labels(Compiler* c, ObjBuilder* ob, ObjSecId sec_id, if (sym->section_id != sec_id) continue; if (sym->kind == SK_SECTION || sym->kind == SK_FILE) continue; if (!sym->name) continue; + /* RISC-V `.LpcrelHi` anchors are codegen-internal labels on AUIPC + * instructions, used only as the target of a paired `%pcrel_lo` + * relocation. Many share the one name (one per AUIPC), so emitting them + * verbatim defines the same label repeatedly and breaks re-assembly. The + * symbolizer replaces each with a unique synthesized anchor label + * (emit_anchor / ref_anchor), so suppress the originals here. */ + { + Slice nm = pool_slice(c->global, sym->name); + if (slice_eq_cstr(nm, ".LpcrelHi")) continue; + } if (n == cap) { u32 ncap = cap ? cap * 2 : 8; @@ -233,46 +243,67 @@ static u32 align_log2(u32 a) { * SEC_DEBUG). SEC_OTHER (a global in a named section, e.g. * __attribute__((section(...)))) emits the real name plus its flags/type/ * entsize in GNU-as syntax so the label and bytes survive re-assembly. */ +/* Emit `.section name, "flags", @type[, entsize]` (the GNU-as named-section + * form). Used for SEC_OTHER and for any canonical-kind section whose name or + * flags can't be reproduced by the bare `.text`/`.section .rodata` builtins. */ +static int elf_named_section(const AsmSynCtx* x, const Section* sec) { + Writer* w = x->w; + Slice nm = pool_slice(x->c->global, sec->name); + if (nm.len == 0) return 0; + w_str(w, " .section\t"); + cfree_writer_write(w, nm.s, nm.len); + w_str(w, ", \""); + w_secflags(w, sec->flags); + w_str(w, "\", "); + w_str(w, sec->sem == SSEM_NOBITS ? "@nobits" : "@progbits"); + if ((sec->flags & SF_MERGE) || sec->entsize) { + w_str(w, ", "); + w_dec(w, (u64)(sec->entsize ? sec->entsize : 1)); + } + w_newline(w); + return 1; +} + +/* Does this canonical-kind section round-trip through its bare builtin + * directive? Only if its name is exactly the canonical spelling and it carries + * no flags that the builtin can't express (MERGE/STRINGS/RETAIN/entsize). A + * `.rodata.foo.merge` mergeable-string section, for instance, must be spelled + * in full or the linker won't merge/GC it the way the direct object does. */ +static int sec_is_canonical(const AsmSynCtx* x, const Section* sec, + const char* canon) { + Slice nm = pool_slice(x->c->global, sec->name); + if (sec->flags & (SF_MERGE | SF_STRINGS | SF_RETAIN)) return 0; + if (sec->entsize) return 0; + return slice_eq_cstr(nm, canon); +} + static int elf_section_header(const AsmSynCtx* x, const Section* sec) { Writer* w = x->w; + if (sec->flags & SF_TLS) return 0; switch (sec->kind) { case SEC_TEXT: + if (!sec_is_canonical(x, sec, ".text")) return elf_named_section(x, sec); w_str(w, " .text"); w_newline(w); return 1; case SEC_RODATA: - if (sec->flags & SF_TLS) return 0; + if (!sec_is_canonical(x, sec, ".rodata")) + return elf_named_section(x, sec); w_str(w, " .section\t.rodata"); w_newline(w); return 1; case SEC_DATA: - if (sec->flags & SF_TLS) return 0; + if (!sec_is_canonical(x, sec, ".data")) return elf_named_section(x, sec); w_str(w, " .section\t.data"); w_newline(w); return 1; case SEC_BSS: - if (sec->flags & SF_TLS) return 0; + if (!sec_is_canonical(x, sec, ".bss")) return elf_named_section(x, sec); w_str(w, " .section\t.bss"); w_newline(w); return 1; - case SEC_OTHER: { - Slice nm; - if (sec->flags & SF_TLS) return 0; - nm = pool_slice(x->c->global, sec->name); - if (nm.len == 0) return 0; - w_str(w, " .section\t"); - cfree_writer_write(w, nm.s, nm.len); - w_str(w, ", \""); - w_secflags(w, sec->flags); - w_str(w, "\", "); - w_str(w, sec->sem == SSEM_NOBITS ? "@nobits" : "@progbits"); - if ((sec->flags & SF_MERGE) || sec->entsize) { - w_str(w, ", "); - w_dec(w, (u64)(sec->entsize ? sec->entsize : 1)); - } - w_newline(w); - return 1; - } + case SEC_OTHER: + return elf_named_section(x, sec); default: return 0; } @@ -436,20 +467,34 @@ static CfreeStatus emit_raw_bytes(Writer* w, const u8* data, u32 start, } /* A reloc kind whose data field carries a symbol value reproducible by an - * integer directive: maps to (directive, byte width). The assembler emits the - * matching R_ABS{32,64} for `.word`/`.quad SYM+addend` (emit_int_directive), - * so the round-tripped relocation matches codegen's. Returns 0 for kinds with - * no integer-directive spelling (caller keeps the raw bytes). */ -static int data_reloc_directive(u16 kind, const char** dir, u32* width) { + * integer directive: maps to (directive, byte width, PC-relative?). The + * assembler emits the matching R_ABS{32,64} for `.word`/`.quad SYM+addend` and + * R_PC{32,64} for `.long`/`.quad SYM - .` (emit_int_directive), so the + * round-tripped relocation matches codegen's. `*pcrel` selects the `SYM - .` + * spelling (built by build_data_symref). Returns 0 for kinds with no + * integer-directive spelling (caller keeps the raw bytes). */ +static int data_reloc_directive(u16 kind, const char** dir, u32* width, + int* pcrel) { + *pcrel = 0; switch (kind) { case R_ABS64: *dir = " .quad "; *width = 8; return 1; + case R_PC64: + *dir = " .quad "; + *width = 8; + *pcrel = 1; + return 1; case R_ABS32: *dir = " .word "; *width = 4; return 1; + case R_PC32: + *dir = " .long "; + *width = 4; + *pcrel = 1; + return 1; default: return 0; } @@ -637,6 +682,44 @@ static CfreeStatus w_symbolized(Writer* w, const char* ops, u32 olen, return w_str(w, symref); } } + if (surg == ARCH_RELOC_SURG_RV_LO12) { + /* RISC-V low-half: a `disp(base)` memory form rewrites the displacement; + * a register-immediate form appends the modifier as a new operand. The + * memory form is recognized by a trailing `(...)` group. */ + i32 lp = -1, rp = -1; + u32 i; + for (i = 0; i < olen; ++i) { + if (ops[i] == '(') lp = (i32)i; + else if (ops[i] == ')') rp = (i32)i; + } + if (lp >= 0 && rp > lp && (u32)(rp + 1) == olen) { + /* `..., <disp>(base)` -> `..., symref(base)`: replace the displacement + * run that ends immediately before the '('. */ + i32 ds = lp; /* start of the displacement run */ + CfreeStatus st; + while (ds > 0) { + char ch = ops[ds - 1]; + if ((ch >= '0' && ch <= '9') || (ch >= 'a' && ch <= 'f') || + (ch >= 'A' && ch <= 'F') || ch == 'x' || ch == '-' || ch == '+') + --ds; + else + break; + } + st = cfree_writer_write(w, ops, (u32)ds); /* "..., " before the disp */ + if (st != CFREE_OK) return st; + st = w_str(w, symref); + if (st != CFREE_OK) return st; + return cfree_writer_write(w, ops + lp, olen - (u32)lp); /* "(base)" */ + } + /* Register-immediate (e.g. `mv rd, rs`): append symref as a new operand. */ + { + CfreeStatus st = cfree_writer_write(w, ops, olen); + if (st != CFREE_OK) return st; + st = w_str(w, ", "); + if (st != CFREE_OK) return st; + return w_str(w, symref); + } + } /* SURG_MEM: keep the base register, set the offset to symref. */ { i32 lb = -1, rb = -1, base_end; @@ -754,6 +837,44 @@ static u32 fmt_synth_label(char* buf, u32 cap, u32 secidx, u32 off) { return p; } +/* Synthesized hi/lo anchor label spelling (RISC-V `%pcrel_hi`/`%pcrel_lo` + * pairing). The high-half reloc defines `.Lpcrel_hi_<secidx>_<off>` at its + * AUIPC; the paired low-half reloc references it. `.L`-prefixed so the + * assembler's lexer accepts it and the linker treats it as local. */ +static u32 fmt_anchor_label(char* buf, u32 cap, u32 secidx, u32 off) { + u32 p = 0; + const char* pre = ".Lpcrel_hi_"; + u32 i; + for (i = 0; pre[i] && p + 1 < cap; ++i) buf[p++] = pre[i]; + p = fmt_u64(buf, p, cap, secidx, 10); + if (p + 1 < cap) buf[p++] = '_'; + p = fmt_u64(buf, p, cap, off, 16); + buf[p] = '\0'; + return p; +} + +/* The offset of the high-half (anchor-emitting) relocation paired with a + * low-half reloc at `lo_off`: the nearest preceding reloc whose ArchRelocOperand + * sets emit_anchor. cfree's codegen always emits the AUIPC immediately before + * its paired ADDI/load, so the nearest preceding anchor is the correct one. + * Returns 1 and *hi_off on success. */ +static int find_anchor_for_lo12(const EmitCtx* x, u32 lo_off, u32* hi_off) { + u32 i; + int found = 0; + u32 best = 0; + for (i = 0; i < x->nrelocs; ++i) { + ArchRelocOperand ro = {0}; /* zero emit_anchor/ref_anchor: arches that + * don't set them must read as 0 (rv64-only) */ + if (x->relocs[i].offset >= lo_off) break; + if (arch_reloc_operand(x->c, x->relocs[i].kind, &ro) && ro.emit_anchor) { + best = x->relocs[i].offset; + found = 1; + } + } + if (found && hi_off) *hi_off = best; + return found; +} + /* Non-dot symbol name defined at `off`, or NULL. Such a symbol is used as the * branch label directly (no synthesized label needed). */ static Sym symbol_at(const EmitCtx* x, u32 off) { @@ -841,12 +962,33 @@ static CfreeStatus emit_operands(Writer* w, const EmitCtx* x, if (!insn->operands.len) return CFREE_OK; r = reloc_in_range(x->relocs, x->nrelocs, off, insn->nbytes); if (r) { - ArchRelocOperand ro; + ArchRelocOperand ro = {0}; /* zero emit_anchor/ref_anchor: arches that + * don't set them must read as 0 (rv64-only) */ if (arch_reloc_operand(x->c, r->kind, &ro)) { char symref[256]; - if (build_symref(symref, sizeof symref, x->c, &ro, r->sym, r->addend) >= 0) + /* A low-half reloc (RISC-V `%pcrel_lo`) names the paired high-half's + * synthesized anchor label, not the reloc's own (`.LpcrelHi`) symbol. */ + if (ro.ref_anchor) { + u32 hi_off; + if (find_anchor_for_lo12(x, off, &hi_off)) { + char name[256]; + u32 p = 0, i; + for (i = 0; ro.prefix[i] && p + 1 < sizeof name; ++i) + name[p++] = ro.prefix[i]; + p += fmt_anchor_label(name + p, (u32)sizeof name - p, x->secidx, + hi_off); + for (i = 0; ro.suffix[i] && p + 1 < sizeof name; ++i) + name[p++] = ro.suffix[i]; + name[p] = '\0'; + return w_symbolized(w, insn->operands.s, insn->operands.len, name, + ro.surg); + } + /* No anchor found (unexpected): fall through to keep numeric. */ + } else if (build_symref(symref, sizeof symref, x->c, &ro, r->sym, + r->addend) >= 0) { return w_symbolized(w, insn->operands.s, insn->operands.len, symref, ro.surg); + } } } else if (arch_is_local_branch(x->c, insn->mnemonic)) { u64 tgt; @@ -887,17 +1029,25 @@ static CfreeStatus emit_data_range(Writer* w, Compiler* c, const u8* data, if (r) { const char* dir; u32 width; + int pcrel; char symref[256]; /* Data relocations spell the bare symbol (`.quad sym+addend`): no - * page/lo12-style operand modifier on either format. */ - ArchRelocOperand bare = {ARCH_RELOC_SURG_NONE, "", "", 0}; - if (data_reloc_directive(r->kind, &dir, &width) && off + width <= end && + * page/lo12-style operand modifier on either format. A PC-relative + * reloc adds a trailing ` - .` (location counter) so the assembler + * re-derives R_PC{32,64} instead of an absolute reloc. */ + ArchRelocOperand bare = {ARCH_RELOC_SURG_NONE, "", "", 0, 0, 0}; + if (data_reloc_directive(r->kind, &dir, &width, &pcrel) && + off + width <= end && build_symref(symref, sizeof symref, c, &bare, r->sym, r->addend) >= 0) { CfreeStatus st = w_str(w, dir); if (st != CFREE_OK) return st; st = w_str(w, symref); if (st != CFREE_OK) return st; + if (pcrel) { + st = w_str(w, " - ."); + if (st != CFREE_OK) return st; + } st = w_newline(w); if (st != CFREE_OK) return st; off += width; @@ -938,6 +1088,63 @@ static CfreeStatus emit_disasm_range(Writer* w, const EmitCtx* x, continue; } + /* Call-pair fusion (RISC-V R_RV_CALL): a reloc on this instruction whose + * arch fuses it with the FOLLOWING instruction into a single `call`/`tail + * sym` pseudo. Probe the partner for the call-vs-tail decision, emit one + * line, and skip both. Decoding the partner reuses the disassembler's + * buffers (clobbering `insn`), so build the symref first and re-decode + * `insn` when the pair is not fused. */ + { + const SecReloc* cr = reloc_in_range(x->relocs, x->nrelocs, off, n); + char symref[256]; + ArchRelocOperand bare = {ARCH_RELOC_SURG_TAIL, "", "", 0, 0, 0}; + if (cr && off + n < end && + build_symref(symref, sizeof symref, x->c, &bare, cr->sym, + cr->addend) >= 0) { + CfreeInsn partner; + u32 pn = arch_disasm_decode(dasm, data + off + n, end - (off + n), + (u64)(off + n), &partner); + const char* mn = NULL; + if (pn && arch_reloc_call_pair(x->c, cr->kind, partner.mnemonic, + partner.operands, &mn)) { + st = w_str(w, "\t"); + if (st != CFREE_OK) return st; + st = w_str(w, mn); + if (st != CFREE_OK) return st; + st = w_str(w, "\t"); + if (st != CFREE_OK) return st; + st = w_str(w, symref); + if (st != CFREE_OK) return st; + st = w_newline(w); + if (st != CFREE_OK) return st; + off += n + pn; + continue; + } + /* Not fused: the partner probe clobbered `insn`; re-decode it. */ + (void)arch_disasm_decode(dasm, data + off, end - off, vaddr, &insn); + } + } + + /* A high-half reloc (RISC-V AUIPC `%pcrel_hi`/`%got_pcrel_hi`) needs a + * unique local anchor label here so the paired `%pcrel_lo` can name it. */ + { + const SecReloc* hr = reloc_in_range(x->relocs, x->nrelocs, off, n); + if (hr) { + ArchRelocOperand ro = {0}; /* zero emit_anchor/ref_anchor: arches that + * don't set them must read as 0 (rv64-only) */ + if (arch_reloc_operand(x->c, hr->kind, &ro) && ro.emit_anchor) { + char name[256]; + fmt_anchor_label(name, sizeof name, x->secidx, off); + st = w_str(w, name); + if (st != CFREE_OK) return st; + st = w_str(w, ":"); + if (st != CFREE_OK) return st; + st = w_newline(w); + if (st != CFREE_OK) return st; + } + } + } + st = w_str(w, "\t"); if (st != CFREE_OK) return st; st = cfree_writer_write(w, insn.mnemonic.s, insn.mnemonic.len); diff --git a/src/arch/arch.h b/src/arch/arch.h @@ -188,6 +188,14 @@ typedef enum ArchRelocSurg { ARCH_RELOC_SURG_TAIL, /* replace last comma component (or whole operand) */ ARCH_RELOC_SURG_MEM, /* rewrite the offset inside [...] (aarch64 ldst) */ ARCH_RELOC_SURG_RIP, /* insert sym before disp(%rip) (x86-64 RIP-rel) */ + /* RISC-V `%pcrel_lo`/`%lo` low-half operand. A single reloc kind covers two + * disassembled shapes: a register-immediate ADDI (printed as `mv rd, rs` + * when the immediate is 0) where the modifier becomes a new trailing + * operand (`mv rd, rs, %pcrel_lo(L)`, which the assembler folds back into + * ADDI), and a `disp(base)` load/store where the modifier replaces the + * displacement (`%pcrel_lo(L)(base)`). The shape is picked from the operand + * text: a trailing `(...)` group selects the memory form. */ + ARCH_RELOC_SURG_RV_LO12, } ArchRelocSurg; typedef struct ArchRelocOperand { @@ -198,6 +206,15 @@ typedef struct ArchRelocOperand { * an instruction-encoding bias so the printed offset is the *symbol* offset: * 0 for aarch64; +4 for x86-64 rel32 (PC32/PLT32/GOTPCREL store addend-4). */ int addend_bias; + /* hi/lo anchor pairing (RISC-V `%pcrel_hi`/`%pcrel_lo`). A high-half reloc + * (AUIPC `%pcrel_hi(sym)`) sets `emit_anchor` so the symbolizer defines a + * unique local label at this instruction. The paired low-half reloc + * (`%pcrel_lo`) sets `ref_anchor`: its operand references that synthesized + * anchor label (the nearest preceding anchor) instead of the reloc's own + * symbol — matching the RISC-V ABI, where `%pcrel_lo` names the AUIPC's + * label, not the target symbol. Other arches leave both 0. */ + u8 emit_anchor; + u8 ref_anchor; } ArchRelocOperand; typedef struct ArchAsmOps { @@ -214,6 +231,19 @@ typedef struct ArchAsmOps { * (aarch64 b/b.cc/cbz/...; x86-64 jmp/jcc). Calls are excluded — they carry * relocations. NULL hook = no local-branch symbolization for the arch. */ int (*is_local_branch)(CfreeSlice mnemonic); + /* Fuse a relocation that the disassembler renders as a 2-instruction pair + * back into a single relocated pseudo-instruction line. RISC-V R_RV_CALL + * sits on an AUIPC whose JALR partner carries no reloc; the canonical `.s` + * spelling is a single `call`/`tail sym`. When `kind` names such a reloc, + * the hook returns 1 and sets *mnemonic_out to the fused mnemonic — the + * symbolizer then emits "<mnemonic>\t<sym[+addend]>" in place of BOTH + * instructions (skipping the partner). `pair_mnemonic`/`pair_ops` are the + * SECOND instruction's disassembled text (the JALR), used to disambiguate + * (e.g. call vs tail by its link register). Returns 0 to leave the pair + * un-fused (per-instruction operand symbolization applies). NULL hook = no + * pair fusion for the arch. */ + int (*reloc_call_pair)(u16 reloc_kind, CfreeSlice pair_mnemonic, + CfreeSlice pair_ops, const char** mnemonic_out); } ArchAsmOps; typedef struct ArchImpl { @@ -273,6 +303,15 @@ int arch_reloc_operand(const Compiler* c, u16 reloc_kind, * arch has no asm_ops/is_local_branch hook. */ int arch_is_local_branch(const Compiler* c, CfreeSlice mnemonic); +/* 1 if `reloc_kind` names a 2-instruction call pair the symbolizer should fuse + * into a single pseudo line (RISC-V R_RV_CALL -> `call`/`tail`), with + * *mnemonic_out set to the fused mnemonic. `pair_*` are the partner (second) + * instruction's disassembled text. 0 when not fused / no hook. Thin dispatch + * over ArchAsmOps.reloc_call_pair. */ +int arch_reloc_call_pair(const Compiler* c, u16 reloc_kind, + CfreeSlice pair_mnemonic, CfreeSlice pair_ops, + const char** mnemonic_out); + ArchDisasm* arch_disasm_new(Compiler*); u32 arch_disasm_decode(ArchDisasm*, const u8* bytes, size_t len, u64 vaddr, CfreeInsn* out); diff --git a/src/arch/registry.c b/src/arch/registry.c @@ -101,6 +101,15 @@ int arch_is_local_branch(const Compiler* c, CfreeSlice mnemonic) { return a->asm_ops->is_local_branch(mnemonic); } +int arch_reloc_call_pair(const Compiler* c, u16 reloc_kind, + CfreeSlice pair_mnemonic, CfreeSlice pair_ops, + const char** mnemonic_out) { + const ArchImpl* a = arch_for_compiler(c); + if (!a || !a->asm_ops || !a->asm_ops->reloc_call_pair) return 0; + return a->asm_ops->reloc_call_pair(reloc_kind, pair_mnemonic, pair_ops, + mnemonic_out); +} + const CGBackend* cg_backend_for_session(const Compiler* c, const CfreeCodeOptions* opts) { if (opts && opts->check_only) { diff --git a/src/arch/rv64/arch.c b/src/arch/rv64/arch.c @@ -13,6 +13,7 @@ extern const LinkArchDesc link_arch_rv64; extern const ArchDbgOps rv64_dbg_ops; extern const ArchEmuOps rv64_emu_ops; extern const ArchDwarfOps rv64_dwarf_ops; +extern const ArchAsmOps rv64_asm_ops; static int rv64_register_at_public(uint32_t idx, CfreeArchReg* out) { const char* nm = NULL; @@ -180,6 +181,7 @@ const ArchImpl arch_impl_rv64 = { .link = &link_arch_rv64, .dwarf = &rv64_dwarf_ops, .dbg = &rv64_dbg_ops, + .asm_ops = &rv64_asm_ops, .predefined_macros = rv64_predefined_macros, .npredefined_macros = (u32)(sizeof rv64_predefined_macros / sizeof rv64_predefined_macros[0]), diff --git a/src/arch/rv64/asm.c b/src/arch/rv64/asm.c @@ -449,6 +449,12 @@ static u32 assemble_one(AsmDriver* d, const Rv64InsnDesc* desc) { rd = parse_xreg(d); expect_comma(d); rs1 = parse_xreg(d); + /* `cc -S` spells an ADDI rd,rs,%pcrel_lo(L) low-half (imm 0) as the + * `mv` alias plus a trailing relocation operand: `mv rd, rs, + * %pcrel_lo(L)`. Fold it back into ADDI with the I-type reloc. */ + if (asm_driver_eat_comma(d) && + !rv_emit_imm_mod_reloc(d, RV_MODPOS_LO_I)) + asm_driver_panic(d, "rv64 asm: mv: expected %lo/%pcrel_lo operand"); return enc_i(m, rd, rs1, 0); } if (slice_eq_cstr(desc->mnemonic, "sext.w")) { @@ -640,8 +646,31 @@ static u32 assemble_one(AsmDriver* d, const Rv64InsnDesc* desc) { expect_comma(d); rs1 = parse_freg(d); } - /* match already encodes rs2 (type selector); only OR rd/rs1. */ - return m | ((rs1 & 0x1fu) << 15) | ((rd & 0x1fu) << 7); + /* match already encodes rs2 (type selector); OR rd/rs1 and the rounding + * mode the disassembler dropped. The rm is fixed per conversion family + * (mirrors the rv_fcvt_* encoders in isa.h, the codegen source of + * truth): fp->int truncates (RTZ=1); int->fp and fp->fp use DYN=7; the + * fmv bit-moves carry no rounding (rm=0). Keyed on the funct7 in match. */ + { + u32 funct7 = (m >> 25) & 0x7fu; + u32 rm; + switch (funct7) { + case 0x60: /* fcvt.{w,wu,l,lu}.s */ + case 0x61: /* fcvt.{w,wu,l,lu}.d */ + rm = 0x1u; /* RTZ */ + break; + case 0x70: /* fmv.x.w */ + case 0x71: /* fmv.x.d */ + case 0x78: /* fmv.w.x */ + case 0x79: /* fmv.d.x */ + rm = 0x0u; + break; + default: /* int->fp (0x68/0x69) and fp<->fp (0x20/0x21): DYN */ + rm = 0x7u; + break; + } + return m | (rm << 12) | ((rs1 & 0x1fu) << 15) | ((rd & 0x1fu) << 7); + } case RV64_FMT_AMO: rd = parse_xreg(d); @@ -907,6 +936,12 @@ static bool rv64_emit_pseudo(AsmDriver* d, const Rv64InsnDesc* desc) { return true; } if (slice_eq_cstr(desc->mnemonic, "tail")) { + /* Standard RISC-V `tail` materializes the address into t1 (x6). cfree + * codegen uses t0 for its own tail-call temp, so a `cc -S`-fused + * `tail sym` re-assembles to t1 not t0 — execution-equivalent (both are + * caller-saved temps clobbered by the tail jump; cross-exec still + * matches), only the byte image differs on tail-call cases. Keeping the + * assembler's `tail` standard preserves clang/gas interop. */ rv_emit_call_pseudo(d, RV_T1, RV_ZERO); return true; } @@ -948,6 +983,117 @@ static void rv64_arch_asm_insn(ArchAsm* base, AsmDriver* d, Sym mnemonic) { static void rv64_arch_asm_destroy(ArchAsm* base) { (void)base; } +/* ---- textual-assembly operand syntax (printer <-> parser) ---------------- + * + * Inverse of the `.s` parsers above (rv_parse_mod_reloc / rv_reloc_target and + * the call/la pseudo expanders): how a relocated rv64 operand is spelled in + * `cc -S` so the same text re-assembles under cfree-as. RISC-V uses the same + * `%hi`/`%lo`/`%pcrel_hi`/`%pcrel_lo` operator syntax on every object format, + * so `fmt` is unused. See ArchAsmOps and src/api/asm_emit.c. */ +static int rv64_reloc_operand(u16 kind, CfreeObjFmt fmt, ArchRelocOperand* out) { + (void)fmt; + out->prefix = ""; + out->suffix = ""; + out->addend_bias = 0; + out->emit_anchor = 0; + out->ref_anchor = 0; + switch (kind) { + case R_RV_PCREL_HI20: + out->surg = ARCH_RELOC_SURG_TAIL; + out->prefix = "%pcrel_hi("; + out->suffix = ")"; + out->emit_anchor = 1; /* define a unique anchor label at this AUIPC */ + return 1; + case R_RV_GOT_HI20: + out->surg = ARCH_RELOC_SURG_TAIL; + out->prefix = "%got_pcrel_hi("; + out->suffix = ")"; + out->emit_anchor = 1; + return 1; + case R_RV_PCREL_LO12_I: + case R_RV_PCREL_LO12_S: + out->surg = ARCH_RELOC_SURG_RV_LO12; + out->prefix = "%pcrel_lo("; + out->suffix = ")"; + out->ref_anchor = 1; /* references the preceding AUIPC's anchor label */ + return 1; + case R_RV_HI20: + out->surg = ARCH_RELOC_SURG_TAIL; + out->prefix = "%hi("; + out->suffix = ")"; + return 1; + case R_RV_LO12_I: + case R_RV_LO12_S: + out->surg = ARCH_RELOC_SURG_RV_LO12; + out->prefix = "%lo("; + out->suffix = ")"; + return 1; + case R_RV_BRANCH: + case R_RV_JAL: + out->surg = ARCH_RELOC_SURG_TAIL; + return 1; + default: + return 0; /* R_ABS*, R_RV_RVC_*, R_RV_RELAX, TLS, ... → keep numeric */ + } +} + +/* Intra-section local branches whose target codegen resolved in place (no + * relocation): the disassembler renders the target numerically, so cc -S + * synthesizes a label there. `j`/`jal x0` are JAL aliases; the conditional + * branches are B-type. `call`/`tail` are excluded — they carry R_RV_CALL. */ +static int rv64_is_local_branch(CfreeSlice m) { + if (m.len == 1 && m.s[0] == 'j') return 1; + if (m.len == 3 && memcmp(m.s, "jal", 3) == 0) return 1; + if (m.len == 3 && memcmp(m.s, "beq", 3) == 0) return 1; + if (m.len == 3 && memcmp(m.s, "bne", 3) == 0) return 1; + if (m.len == 3 && memcmp(m.s, "blt", 3) == 0) return 1; + if (m.len == 3 && memcmp(m.s, "bge", 3) == 0) return 1; + if (m.len == 4 && memcmp(m.s, "bltu", 4) == 0) return 1; + if (m.len == 4 && memcmp(m.s, "bgeu", 4) == 0) return 1; + if (m.len == 4 && memcmp(m.s, "beqz", 4) == 0) return 1; + if (m.len == 4 && memcmp(m.s, "bnez", 4) == 0) return 1; + if (m.len == 4 && memcmp(m.s, "blez", 4) == 0) return 1; + if (m.len == 4 && memcmp(m.s, "bgez", 4) == 0) return 1; + if (m.len == 4 && memcmp(m.s, "bltz", 4) == 0) return 1; + if (m.len == 4 && memcmp(m.s, "bgtz", 4) == 0) return 1; + if (m.len == 6 && memcmp(m.s, "c.beqz", 6) == 0) return 1; + if (m.len == 6 && memcmp(m.s, "c.bnez", 6) == 0) return 1; + if (m.len == 3 && memcmp(m.s, "c.j", 3) == 0) return 1; + return 0; +} + +/* R_RV_CALL fuses an AUIPC+JALR pair into a single `call`/`tail sym` pseudo + * (the canonical `.s` spelling the assembler re-expands to the same pair + + * reloc). The reloc sits on the AUIPC; the JALR partner carries no reloc. A + * tail call links into x0 (the JALR's rd is `zero`); a regular call links into + * ra. We read that from the partner JALR's disassembled text. */ +static int rv64_reloc_call_pair(u16 kind, CfreeSlice pair_mnemonic, + CfreeSlice pair_ops, const char** mnemonic_out) { + if (kind != R_RV_CALL) return 0; + /* The partner JALR links into ra (regular call) or x0 (tail). The + * disassembler renders the x0-link, zero-immediate form as the `jr rs` + * alias, and the ra form as `jalr ra, 0(ra)`. So a `jr` partner is always a + * tail; a `jalr` partner is a tail iff its link register is `zero`. */ + if (pair_mnemonic.len == 2 && memcmp(pair_mnemonic.s, "jr", 2) == 0) { + *mnemonic_out = "tail"; + return 1; + } + if (pair_mnemonic.len == 4 && memcmp(pair_mnemonic.s, "jalr", 4) == 0) { + if (pair_ops.len >= 4 && memcmp(pair_ops.s, "zero", 4) == 0) + *mnemonic_out = "tail"; + else + *mnemonic_out = "call"; + return 1; + } + return 0; +} + +const ArchAsmOps rv64_asm_ops = { + .reloc_operand = rv64_reloc_operand, + .is_local_branch = rv64_is_local_branch, + .reloc_call_pair = rv64_reloc_call_pair, +}; + ArchAsm* rv64_arch_asm_new(Compiler* c) { Rv64Asm* a = arena_new(c->tu, Rv64Asm); memset(a, 0, sizeof *a); diff --git a/src/arch/x64/asm.c b/src/arch/x64/asm.c @@ -562,12 +562,17 @@ static __attribute__((unused)) void emit_reg_rm_twobyte( buf[n++] = opcode2; buf[n++] = x64_modrm(3u, dst, src.reg); } else { - n += x64_pack_rex(buf + n, width == 8u, dst, 0, src.base); + /* Route the full memory-operand variety (plain / SIB-indexed / RIP / + * segment) through the shared pack helpers so a SIB index register is + * preserved (e.g. `movzbl (%rcx,%rsi,1), %edx`). */ + if (src.seg) buf[n++] = src.seg; + n += x64_pack_rex_mem_operand(buf + n, width == 8u, dst, src); buf[n++] = X64_OPC_TWOBYTE; buf[n++] = opcode2; - n += x64_pack_mem(buf + n, dst, src.base, src.disp); + n += x64_pack_mem_operand(buf + n, dst, src); } emit_packed(mc, buf, n); + if (src.kind == X64_ASM_OP_MEM) x64_emit_mem_reloc(d, mc, &src, 0); } /* ==================================================================== @@ -857,7 +862,14 @@ static void parse_alu_rr(X64ParseCtx* p) { src.imm, 1); return; } - if (imm_fits_i8(src.imm)) + /* Stack-pointer adjustments (`add/sub $imm, %rsp`, 64-bit) always use the + * imm32 form in codegen — the prologue and alloca patch a fixed-width + * placeholder, so they never shrink to imm8 even for a small frame. Match + * that here so `cc -S | as` reproduces codegen's bytes exactly; %rsp is a + * reserved register, so codegen never emits an imm8 ALU op against it. */ + if (dst.reg == X64_RSP && p->width == 8u && imm_fits_i32(src.imm)) + emit_alu_imm32(p->mc, 1, imm_row->modrm_reg, dst.reg, (i32)src.imm); + else if (imm_fits_i8(src.imm)) emit_alu_imm8(p->mc, width_to_w(p->width), imm_row->modrm_reg, dst.reg, (i8)src.imm); else if (imm_fits_i32(src.imm)) @@ -1059,8 +1071,11 @@ static void parse_movzx_movsx(X64ParseCtx* p) { dst = parse_operand(p->d); if (dst.kind != X64_ASM_OP_REG) asm_driver_panic(p->d, "x64 asm: movx dst register"); + /* REX.W follows the destination register width: `movsbq …, %rcx` (64-bit) + * needs REX.W; `movsbl …, %ecx` (32-bit) does not. The disassembler spells + * the q/l form from REX.W, so honoring dst width here round-trips it. */ emit_reg_rm_twobyte( - p->d, p->mc, 4u, p->desc->opc[1], dst.reg, src, + p->d, p->mc, dst.width == 8u ? 8u : 4u, p->desc->opc[1], dst.reg, src, p->desc->opc[1] == X64_OPC_MOVZX_B || p->desc->opc[1] == X64_OPC_MOVSX_B, 0); } @@ -1234,25 +1249,31 @@ static void parse_sse_rr(X64ParseCtx* p) { expect_comma(p->d); dst = parse_operand(p->d); if (cvt_to_int) { + /* cvttsd2si/cvttss2si XMM/m -> GPR: REX.W follows the GPR destination + * width (`%rdx` = 64-bit, `%edx` = 32-bit), not the mnemonic — these rows + * carry no size suffix. */ if (dst.kind != X64_ASM_OP_REG) asm_driver_panic(p->d, "x64 asm: cvtt dst register"); + u32 gpr_w = dst.width == 8u ? 8u : 4u; if (src.kind == X64_ASM_OP_XMM) emit_sse_rr_w(p->mc, p->desc->leg_pfx, p->desc->opc[1], - width_to_w(p->width), dst.reg, src.reg); + width_to_w(gpr_w), dst.reg, src.reg); else if (src.kind == X64_ASM_OP_MEM) - emit_reg_rm_twobyte(p->d, p->mc, p->width, p->desc->opc[1], dst.reg, src, - 0, p->desc->leg_pfx); + emit_reg_rm_twobyte(p->d, p->mc, gpr_w, p->desc->opc[1], dst.reg, src, 0, + p->desc->leg_pfx); else asm_driver_panic(p->d, "x64 asm: cvtt source"); return; } if (cvt_from_int) { + /* cvtsi2sd/cvtsi2ss GPR/m -> XMM: REX.W follows the GPR source width. */ if (dst.kind != X64_ASM_OP_XMM) asm_driver_panic(p->d, "x64 asm: cvtsi dst xmm"); - if (src.kind == X64_ASM_OP_REG) + if (src.kind == X64_ASM_OP_REG) { + u32 gpr_w = src.width == 8u ? 8u : 4u; emit_sse_rr_w(p->mc, p->desc->leg_pfx, p->desc->opc[1], - width_to_w(p->width), dst.reg, src.reg); - else if (src.kind == X64_ASM_OP_MEM) + width_to_w(gpr_w), dst.reg, src.reg); + } else if (src.kind == X64_ASM_OP_MEM) emit_sse_load(p->mc, p->desc->leg_pfx, p->desc->opc[1], dst.reg, src.base, src.disp); else diff --git a/src/arch/x64/isa.c b/src/arch/x64/isa.c @@ -497,8 +497,11 @@ static int read_disp(const u8* bytes, u32 len, u32 off, u32 n, i32* out) { * `disp_out` and `base_out` describe what to print. */ typedef struct DecodedMem { u32 base; + u32 index; /* SIB index register (valid when has_index) */ + u32 scale; /* SIB scale as the literal 1/2/4/8 (valid when has_index) */ i32 disp; int has_base; + int has_index; /* a SIB index register is present */ int rip_relative; u32 bytes_used; } DecodedMem; @@ -506,8 +509,11 @@ typedef struct DecodedMem { static u32 decode_mem(const u8* bytes, u32 len, u32 off, X64DecodeCtx ctx, u32 mod, u32 rm_low, DecodedMem* out) { out->base = 0; + out->index = 0; + out->scale = 1; out->disp = 0; out->has_base = 1; + out->has_index = 0; out->rip_relative = 0; out->bytes_used = 0; if (mod == 3u) return 0; /* caller handles reg-form */ @@ -516,10 +522,17 @@ static u32 decode_mem(const u8* bytes, u32 len, u32 off, X64DecodeCtx ctx, if (off >= len) return (u32)-1; u8 s = bytes[off]; u32 sib_base = (s & 7u) | ((u32)ctx.rex_b << 3); + u32 sib_index = ((s >> 3) & 7u) | ((u32)ctx.rex_x << 3); u32 used = 1; + /* SIB index = 4 (RSP) with REX.X=0 encodes "no index". */ + if (sib_index != 4u) { + out->has_index = 1; + out->index = sib_index; + out->scale = 1u << (s >> 6); + } if (mod == 0u && (s & 7u) == 5u) { - /* mod=00, base=101: disp32 with no base — treat as RIP-relative - * style (cfree uses this for label-table addressing). */ + /* mod=00, base=101: disp32 with no base — either a label-table + * disp32 (no index) or an indexed `[index*scale + disp32]`. */ i32 d = 0; if (!read_disp(bytes, len, off + used, 4, &d)) return (u32)-1; used += 4; @@ -579,9 +592,16 @@ static void put_mem(StrBuf* sb, const DecodedMem* m) { } if (m->rip_relative) { strbuf_puts(sb, "(%rip)"); - } else if (m->has_base) { + } else if (m->has_base || m->has_index) { + /* `(base)`, `(base,index,scale)`, or the base-less `(,index,scale)`. */ strbuf_putc(sb, '('); - put_reg(sb, m->base, 8); + if (m->has_base) put_reg(sb, m->base, 8); + if (m->has_index) { + strbuf_putc(sb, ','); + put_reg(sb, m->index, 8); + strbuf_putc(sb, ','); + strbuf_put_i64(sb, (i64)m->scale); + } strbuf_putc(sb, ')'); } } @@ -975,6 +995,16 @@ static u32 print_xmm_rr(StrBuf* sb, const X64InsnDesc* d, const u8* bytes, put_rm(sb, &rr, *ctx, gp_w); return off + 1u + rr.bytes_after_modrm; } + /* Store-direction XMM moves (MOVSD/MOVSS/MOVUPS 0x11, MOVAPS 0x29): the + * reg-field xmm is the SOURCE and the r/m (memory or xmm) is the + * DESTINATION — AT&T order `reg_xmm, rm`. Without this the disassembler + * prints them in load order, so re-assembly flips the data direction. */ + if (op == 0x11u || op == 0x29u) { + put_xmm(sb, rr.reg); + strbuf_puts(sb, ", "); + put_rm_xmm(sb, &rr, *ctx); + return off + 1u + rr.bytes_after_modrm; + } { int dst_is_gp = (op == 0x2Cu); /* CVTTSD/SS2SI */ int src_is_gp = (op == 0x2Au || op == 0x6Eu); /* CVTSI2*, MOVD/Q g->x */ diff --git a/src/arch/x64/native.c b/src/arch/x64/native.c @@ -1459,43 +1459,41 @@ static u32 x64_build_prologue(X64NativeTarget* a, u8* buf, u32 cap, wr_u32_le(buf + wi, frame_size); wi += 4; } - /* sret: spill the first int arg reg (destination pointer) into its slot. */ + /* sret: spill the first int arg reg (destination pointer) into its slot. + * Use the minimal disp encoding (x64_pack_mem) so it matches the body's + * frame stores and the matching epilogue restore — the `cc -S | as` + * round-trip can then reproduce these bytes exactly. The -O0 placeholder is + * NOP-padded to a fixed width, so a shorter prologue is harmless. */ if (a->has_sret && a->sret_ptr_slot != NATIVE_FRAME_SLOT_NONE) { X64NativeSlot* s = x64_slot_get(a, a->sret_ptr_slot); u32 sret_reg = a->abi->int_args[0]; i32 off = -(i32)s->off; - if (wi + 7u > cap) x64_panic(a, "prologue placeholder overflow"); + if (wi + 8u > cap) x64_panic(a, "prologue placeholder overflow"); buf[wi++] = (u8)(X64_REX_BASE | X64_REX_W | ((sret_reg & 8u) ? X64_REX_R : 0u)); buf[wi++] = X64_OPC_MOV_RM_R; - buf[wi++] = modrm(2u, sret_reg & 7u, X64_RBP); - wr_u32_le(buf + wi, (u32)off); - wi += 4; + wi += x64_pack_mem(buf + wi, sret_reg & 7u, X64_RBP, off); } /* Spill callee-saved GPRs. */ for (i = 0; i < n_int; ++i) { u32 reg = cs_int[i]; i32 off = -(i32)xmm_base - (i32)n_fp * 16 - (i32)(i + 1u) * 8; - if (wi + 7u > cap) x64_panic(a, "prologue placeholder overflow"); + if (wi + 8u > cap) x64_panic(a, "prologue placeholder overflow"); buf[wi++] = (u8)(X64_REX_BASE | X64_REX_W | ((reg & 8u) ? X64_REX_R : 0u)); buf[wi++] = X64_OPC_MOV_RM_R; - buf[wi++] = modrm(2u, reg & 7u, X64_RBP); - wr_u32_le(buf + wi, (u32)off); - wi += 4; + wi += x64_pack_mem(buf + wi, reg & 7u, X64_RBP, off); } - /* Spill callee-saved XMMs (Win64). movaps [rbp+disp32], xmm. */ + /* Spill callee-saved XMMs (Win64). movaps [rbp+disp], xmm. */ for (i = 0; i < n_fp; ++i) { u32 xmm = cs_fp[i]; i32 off = -(i32)xmm_base - (i32)(i + 1u) * 16; u8 rex = (u8)((xmm & 8u) ? (X64_REX_BASE | X64_REX_R) : 0u); - u32 need = rex ? 8u : 7u; + u32 need = rex ? 9u : 8u; if (wi + need > cap) x64_panic(a, "prologue placeholder overflow"); if (rex) buf[wi++] = rex; buf[wi++] = X64_OPC_TWOBYTE; buf[wi++] = 0x29; /* MOVAPS r/m128, xmm */ - buf[wi++] = modrm(2u, xmm & 7u, X64_RBP); - wr_u32_le(buf + wi, (u32)off); - wi += 4; + wi += x64_pack_mem(buf + wi, xmm & 7u, X64_RBP, off); } return wi; } @@ -2972,9 +2970,14 @@ static void x64_va_arg_core(X64NativeTarget* a, NativeLoc dst, NativeAddr ap, i8 stride = is_fp ? 16 : 8; MCLabel L_stack = mc->label_new(mc); MCLabel L_done = mc->label_new(mc); - /* gp32 = ap[offs]; cmp gp32, max; jae L_stack. */ + /* gp32 = ap[offs]; cmp gp32, max; jae L_stack. Use the imm8 form when the + * threshold fits (gp_offset max 48) so the encoding is canonical and the + * `cc -S | as` round-trip reproduces it; fp_offset max 176 needs imm32. */ emit_mov_load(mc, 4, 0, gp, ap_base, (i32)offs_field); - emit_alu_imm32(mc, 0, X64_ALU_SUB_CMP, gp, (i32)max_offs); + if (imm_fits_i8((i64)max_offs)) + emit_alu_imm8(mc, 0, X64_ALU_SUB_CMP, gp, (i8)max_offs); + else + emit_alu_imm32(mc, 0, X64_ALU_SUB_CMP, gp, (i32)max_offs); emit_jcc_rel32(mc, X64_CC_AE, L_stack); /* reg path: ap[offs] += stride; gp = reg_save_area(ap[16]) + offset; load. * (The memory increment leaves gp holding the old offset.) */ @@ -3116,14 +3119,10 @@ static void x64_intrinsic(NativeTarget* t, IntrinKind kind, int w = x64_is_64(t, args[0].type) ? 1 : 0; u32 dr = loc_reg(dsts[0]); emit_bs(mc, w, 0xBD /* bsr */, dr, loc_reg(args[0])); - /* clz = (bits-1) - bsr, computed via xor with bits-1. */ - emit_rex(mc, w, 0, 0, dr); - { - u8 op = X64_OPC_ALU_IMM32; - mc->emit_bytes(mc, &op, 1); - } - emit_rm_reg(mc, X64_ALU_SUB_XOR, dr); - emit_u32le(mc, w ? 63u : 31u); + /* clz = (bits-1) - bsr, computed via xor with bits-1. The mask (31/63) + * fits in imm8, so use the compact 0x83 form to match the canonical + * encoding (and the assembler's `cc -S | as` round-trip). */ + emit_alu_imm8(mc, w, X64_ALU_SUB_XOR, dr, w ? 63 : 31); return; } case INTRIN_BSWAP16: { diff --git a/src/asm/asm.c b/src/asm/asm.c @@ -214,14 +214,20 @@ static void promote_undef_externs(AsmDriver* d) { typedef struct AsmExpr { ObjSymId sym; i64 value; + u8 is_here; /* the location-counter token `.` (no sym, no value yet) */ + u8 pcrel; /* `sym - .`: emit a PC-relative data reloc instead of absolute */ } AsmExpr; static AsmExpr expr_c(i64 v) { - AsmExpr e = {OBJ_SYM_NONE, v}; + AsmExpr e = {OBJ_SYM_NONE, v, 0, 0}; return e; } static AsmExpr expr_s(ObjSymId s, i64 v) { - AsmExpr e = {s, v}; + AsmExpr e = {s, v, 0, 0}; + return e; +} +static AsmExpr expr_here(void) { + AsmExpr e = {OBJ_SYM_NONE, 0, 1, 0}; return e; } @@ -290,6 +296,11 @@ static AsmExpr parse_primary(AsmDriver* d) { (void)d_next(d); return e; } + /* Lone `.` is the location counter (used in `sym - .` PC-relative data). */ + if (tok_is_punct(t, '.')) { + (void)d_next(d); + return expr_here(); + } d_panicf(d, "asm: expected expression"); } @@ -325,7 +336,8 @@ static AsmExpr parse_mul(AsmDriver* d) { u32 op = t.v.punct; (void)d_next(d); AsmExpr b = parse_unary(d); - if (a.sym || b.sym) d_panicf(d, "asm: '*/%%' on symbolic operand"); + if (a.sym || b.sym || a.is_here || b.is_here) + d_panicf(d, "asm: '*/%%' on symbolic operand"); if (op == '*') a.value *= b.value; else if (op == '/') { @@ -346,6 +358,16 @@ static AsmExpr parse_add(AsmDriver* d) { u32 op = t.v.punct; (void)d_next(d); AsmExpr b = parse_mul(d); + /* `sym - .`: a PC-relative data reference. `.` is the location of the + * field being emitted, so the relocation's P equals its own offset and the + * RELA addend stays `a.value` (typically 0). */ + if (op == '-' && b.is_here) { + if (!a.sym) d_panicf(d, "asm: '- .' requires a symbol operand"); + a.pcrel = 1; + continue; + } + if (a.is_here || b.is_here) + d_panicf(d, "asm: '.' location counter only valid as `sym - .`"); if (op == '+') { if (a.sym && b.sym) d_panicf(d, "asm: cannot add two symbols"); if (b.sym) { @@ -618,7 +640,16 @@ static void emit_int_directive(AsmDriver* d, u32 width) { AsmExpr e = parse_expr(d); if (e.sym) { RelocKind k; - if (width == 4) + if (e.pcrel) { + /* `sym - .`: PC-relative data. Only the 32/64-bit widths codegen + * emits via cfree_cg_data_pcrel are supported. */ + if (width == 4) + k = R_PC32; + else if (width == 8) + k = R_PC64; + else + d_panicf(d, "asm: PC-relative `sym - .` needs .long/.quad"); + } else if (width == 4) k = R_ABS32; else if (width == 8) k = R_ABS64; @@ -670,26 +701,47 @@ static void do_directive(AsmDriver* d, Sym name) { if (sym_eq(d, name, "section")) { Sym sname = 0; AsmTok t = d_peek(d); - if (t.kind == ASM_TOK_IDENT) { - sname = t.v.ident; - (void)d_next(d); - } else if (t.kind == ASM_TOK_STR) { + if (t.kind == ASM_TOK_STR) { size_t n = 0; const char* p = asm_str(d, t.spelling, &n); if (n >= 2 && p[0] == '"') sname = pool_intern_slice(d->pool, (Slice){.s = p + 1, .len = n - 2}); (void)d_next(d); - } else if (tok_is_punct(t, '.')) { - (void)d_next(d); + } else if (t.kind == ASM_TOK_IDENT || tok_is_punct(t, '.')) { + /* A bare section name. The lexer breaks a dotted name like + * `.rodata.toy.merge` into PUNCT('.')+IDENT segments (the `.`+digit + * identifier rule does not glue `.`+letter), so reassemble the full + * dotted spelling by consuming each adjacent `.segment`. Stops at the + * `, "flags"` operands (the next token is then a comma). */ + char buf[128]; + size_t bn = 0; + int leading_dot = tok_is_punct(t, '.'); + if (leading_dot) { + (void)d_next(d); + buf[bn++] = '.'; + } AsmTok id = d_next(d); if (id.kind != ASM_TOK_IDENT) d_panicf(d, "asm: .section: bad name"); - size_t ni = 0; - const char* nm = asm_str(d, id.v.ident, &ni); - char buf[128]; - if (ni + 1 >= sizeof buf) d_panicf(d, "asm: .section: name too long"); - buf[0] = '.'; - for (size_t i = 0; i < ni; ++i) buf[i + 1] = nm[i]; - sname = pool_intern_slice(d->pool, (Slice){.s = buf, .len = ni + 1}); + for (;;) { + size_t ni = 0; + const char* nm = asm_str(d, id.spelling, &ni); + if (bn + ni >= sizeof buf) d_panicf(d, "asm: .section: name too long"); + for (size_t i = 0; i < ni; ++i) buf[bn++] = nm[i]; + /* Glue a following `.<ident>` (or `.<num>`) segment, no whitespace. */ + if (!tok_is_punct(d_peek(d), '.')) break; + (void)d_next(d); /* '.' */ + AsmTok seg = d_peek(d); + if (seg.kind != ASM_TOK_IDENT && seg.kind != ASM_TOK_NUM) { + /* A lone trailing '.' is not part of the name; put it back is not + * supported, but section names never end in '.', so this is a + * malformed directive. */ + d_panicf(d, "asm: .section: bad name"); + } + if (bn + 1 >= sizeof buf) d_panicf(d, "asm: .section: name too long"); + buf[bn++] = '.'; + id = d_next(d); + } + sname = pool_intern_slice(d->pool, (Slice){.s = buf, .len = bn}); } else { d_panicf(d, "asm: .section: expected name"); } diff --git a/test/asm/encode/rv64_fp.expected.hex b/test/asm/encode/rv64_fp.expected.hex @@ -1 +1 @@ -53f5c500d376f70a53f02010d371521a5385c5285394242b53a5c5a0d392e6a2530505c0530525d0d30200e0530505f2072501002734b100 +53f5c500d376f70a53f02010d371521a5385c5285394242b53a5c5a0d392e6a2531505c0537525d0d30200e0530505f2072501002734b100 diff --git a/test/asm/encode/rv64_fp_cvt.expected.hex b/test/asm/encode/rv64_fp_cvt.expected.hex @@ -1 +1 @@ -530505c0d38515c0530626c0d38636c0530707c2d38727c2530505d0d38515d0530606d2d38626d25387174053880842538505585386065a +531505c0d39515c0531626c0d39636c0531707c2d39727c2537505d0d3f515d0537606d2d3f626d253f7174053f8084253f5055853f6065a diff --git a/test/asm/hostas_cross.sh b/test/asm/hostas_cross.sh @@ -30,22 +30,20 @@ # host supports and self-extends as gaps close. Status at time of writing: # - aarch64-linux: works end-to-end (podman runs arm64 natively in its VM). # This is the gating default (312/312 both lanes). -# - x86_64-linux: cc -S re-assembles for the whole corpus (312/312) via both -# cfree-as and clang, but cross-EXEC is 272/312: ~23 cases -# (switch/jump tables, global/array/fp data, varargs) lose -# fidelity in the x64 cc -S data round-trip — confirmed cc -S -# infidelity (direct `cc -c` executes correctly). Opt-in until -# that backlog closes. See doc/ASM_ROUNDTRIP_TESTING.md. -# - riscv64-linux: assembles, but cross-EXEC hangs — the rv64 cc -S round-trip -# is unfaithful because rv64 has no symbolizer (ArchAsmOps): -# the call emits `auipc ra,0x0; jalr ra,0(ra)` with the -# R_RISCV_CALL reloc unsymbolized, so it calls itself (and -# branches like `j 0x90` keep numeric targets). NOT an -# emulation issue — a minimal clang rv64 static exe and the -# DIRECT cfree `cc -c` object both run correctly under the -# same qemu-riscv64. Needs an rv64 ArchAsmOps (is_local_branch -# for j/beq/...; reloc_operand for %pcrel_hi/%pcrel_lo/call). -# The bounded exec smoke SKIPS it until then. +# - x86_64-linux: `cc -S | cfree as` round-trips and CROSS-EXECS the whole +# corpus correctly (cfree-as 312/312). The clang lane has a +# small residue (~11 efail: cfree emits AT&T text clang +# rejects). Opt-in: the global clang gate (ENFORCE_CLANG=1) +# would fail on that residue, so x64 isn't in the gating +# default yet — run with CFREE_HOSTAS_CROSS_TARGETS, optionally +# CFREE_HOSTAS_ENFORCE_CLANG=0. +# - riscv64-linux: `cc -S | cfree as` round-trips and CROSS-EXECS correctly +# (cfree-as 312/312) — the rv64 symbolizer (ArchAsmOps with +# %pcrel_hi/%pcrel_lo anchor pairing + AUIPC/JALR call fusion) +# landed; the earlier self-call hang is gone. The clang lane +# has a larger residue (~58 efail: rv64 data-symbolization +# syntax + bare-fcvt rounding-mode that clang encodes +# differently). Opt-in, same as x64. # # Override the matrix with CFREE_HOSTAS_CROSS_TARGETS="tag:triple ..." and the # clang-as gate with CFREE_HOSTAS_ENFORCE_CLANG=0 (demote lane B to XFAIL).