commit 22b8a80d7271c883acfddcfc5948b058c2c8e716
parent 89ec3480b48b5f7c95293b8df207700f11b3e7f0
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Fri, 29 May 2026 20:23:50 -0700
doc+build: round-trip covers the full core op set; wire L0+L1 into default test
Mark P2 branch relaxation, .L locals, data-section symbolization, FP/SIMD
load/store, and exclusive-atomic decode as done; record the full-core-op-set
coverage (852 lane-checks, 1 skip) and re-scope the remaining follow-ups
(assembler .bss NOBITS, FP reg-offset/q decode, section-symbol/TLS
symbolization, other arches, llvm differential).
Add test-asm-roundtrip (L0+L1, host-independent) to DEFAULT_TEST_TARGETS so
the round-trip runs in the default suite; test-asm-roundtrip-exec (L2) stays
opt-in (native arch).
Diffstat:
2 files changed, 64 insertions(+), 20 deletions(-)
diff --git a/doc/ASM_ROUNDTRIP_TESTING.md b/doc/ASM_ROUNDTRIP_TESTING.md
@@ -6,13 +6,47 @@ round-tripping the **compiler's own output** rather than only a hand-written
corpus. The corpus (`test/asm/`) only tests instructions we thought to write
down; codegen output tests every instruction codegen actually emits.
-Status: plan + aa64 vertical slice landed (2026-05-29). Prereqs from the
-native-arch asm work are in (see `doc/NATIVE_ARCH_COMPLETENESS.md`): the
-assembler now parses the full relocation-operator syntax on all three arches,
-which is exactly what the `-S` symbolizer (Phase 2) must *emit*.
+Status: aa64 slice now covers the **full core op set** (2026-05-29). The corpus
+exercises every CG operation family — int/fp arith, bitwise, shifts, compares,
+unary, conversions (incl. bitcast), loads/stores of every width, control flow,
+switch (compare-chain + jump-table), indirect/recursive/stack-arg calls,
+aggregates (struct by-val/by-ref, bitfields, unions), globals/static-locals, and
+atomics (RMW / compare-exchange via the exclusive-monitor sequence) — and the
+round-trip passes all three lanes at `-O0` and `-O1`: **852 lane-checks pass,
+1 skip** (the lone skip, `glob_bss_write`, is the assembler `.bss`-NOBITS
+follow-up below). L0+L1 are wired into the default `make test` via
+`test-asm-roundtrip`; L2 stays opt-in (`test-asm-roundtrip-exec`, native arch).
### Implemented so far (aa64)
+- **P2 — same-section branch relaxation (DONE).** At `asm_parse` finalize the
+ assembler resolves branch relocations (JUMP26/CONDBR19/TSTBR14, never CALL26)
+ whose target is a defined local non-function symbol in the same section —
+ patching the displacement via `link_reloc_apply` with section-relative S/P and
+ dropping the reloc, matching codegen/GNU-as. L1 now covers control-flow-bearing
+ code (was auto-skipped). `src/asm/asm.c:relax_local_branches`.
+- **`.L` local symbols + data-section symbolization (DONE).** The assembler lexer
+ accepts `.L`-prefixed locals (incl. embedded dots, `.Lcfree_ro.0`) and the
+ `name.N` discriminator mangling (`acc.1`) as identifiers; the `-S` symbolizer
+ emits `.L` operands instead of numeric fallback. `emit_data_range` renders
+ relocated data as `.quad/.word sym+addend` (the inverse of the assembler's
+ `.quad`), so switch jump tables (R_ABS64 against the function) and global
+ pointer tables round-trip. L1 compares relocs across `.text/.rodata/.data`.
+- **FP/SIMD scalar load/store + unscaled ld/st family (DONE).** `p_ldst_core` /
+ `p_ldur_stur` now encode FP transfer registers (Bt..Qt, V=1) and the full
+ unscaled family (`sturb`/`ldurb`/`sturh`/`ldurh`/`ldursb`/`ldursh`/`ldursw`);
+ the disassembler decodes the signed unscaled loads (keying Wt/Xt on opc). This
+ unblocked every FP spill and conversion case.
+- **Exclusive / acquire-release atomic decode (DONE).** The assembler already
+ encoded `ldxr`/`ldaxr`/`stxr`/`stlxr`/`ldar`/`stlr` (+ b/h), but the
+ disassembler rendered them `.inst`, so the atomic RMW sequence codegen emits
+ for `_Atomic` was dropped by `cc -S`. Added `AA64_FMT_LDST_EXCL` +
+ `print_ldst_excl` and the matching decode rows. Found by an adversarial sweep
+ (atomics were the one core-op family the corpus fan-out missed); now
+ `roundtrip/atomic_{rmw,cas,ops}`.
+
+### Earlier vertical-slice notes (aa64)
+
- **L0 decode-completeness** — `cc -S` already emits the distinct, re-assemblable
marker `.inst 0x<word>` for an undecodable word (only `aa64_write_unknown`
produces it), so the gate is "no `.inst` inside .text". No emitter change was
@@ -38,25 +72,34 @@ which is exactly what the `-S` symbolizer (Phase 2) must *emit*.
### Remaining (tracked here)
-- **P2 — assembler same-section branch relaxation (gates L1 for branchy code).**
- Codegen resolves intra-function branches locally (no reloc); the assembler
- emits a JUMP26/CONDBR19 reloc against the (local) label instead. So L1's
- reloc-table comparison diverges for any function with control flow, and the
- L1 lane auto-skips cases whose `-S` contains an `Lcf_` label. Fix: at
- assembler finalize, for a branch reloc whose target symbol is defined in the
- same section, compute the displacement, patch the instruction field (reuse
- `link_reloc_apply`), and drop the reloc — matching GNU as / llvm-mc. Then L1
- covers control flow too. (L0 and L2 already do.)
+- **Assembler `.bss` is PROGBITS, not NOBITS (the one corpus skip).** `cc -S`
+ renders a zero-init global as `.section .bss` + `.zero N`; `as` writes real
+ zero bytes and tracks position by byte count, so the symbol lands at offset 0
+ and the section emits `SHT_PROGBITS`. The round-tripped `.bss` then loads
+ read-only in the JIT image and a store faults (L0/L1 pass, L2 aborts —
+ `roundtrip/glob_bss_write`). Fix: NOBITS position-tracking in the assembler —
+ a `SEC_BSS`/`SSEM_NOBITS` section's symbol offsets and `.zero`/`.skip`/`.align`
+ must advance `bss_size` instead of `obj_write`ing bytes (the obj layer already
+ treats `SEC_BSS` specially in `obj_align_to`; `obj_pos`/`m_emit_fill`/
+ `process_label` need the matching NOBITS path). `glob_rw` covers the
+ global-write path via a `.data` global meanwhile.
+- **FP register-offset + 128-bit `q` decode.** The assembler now *encodes* FP
+ register-offset (`str d0,[x,x,lsl#3]`) and `q` ldr/str, but the disassembler
+ decodes neither (renders `.inst`). Codegen emits neither for scalar C (FP
+ array indexing computes the address in a GPR first), so the round-trip never
+ hits them; add the decode rows if a NEON/vector path later emits them.
- **`.inst` is dropped by `as`** — `cfree as` accepts the `.inst` directive but
emits no bytes for it, so an undecoded word would not round-trip at L1 (L0
still flags it). `as` should emit the word (or error).
-- **Section-relative + TLS reloc symbolization** — `build_symref` skips
- `.`-prefixed (section/local) symbol names; string-literal/static-local data
- refs and TLS kinds fall back to numeric. Extend once `as` accepts those.
-- **Other arches** — the symbolizer switches on aa64 reloc kinds; x64/rv64 keep
- the numeric `-S` output. Broaden per the RelocKind→syntax tables below.
-- **Default suite + differential** — wire L0/L1 into the default `make test`
- once the corpus is broad; add the llvm-mc / llvm-objdump differential lanes.
+- **Section-relative + TLS reloc symbolization** — `build_symref` accepts `.L`
+ locals but still skips bare section symbols (`.text`) and TLS kinds, which
+ fall back to numeric. Extend once `as` accepts those operands.
+- **Other arches** — the symbolizer switches on aa64 reloc kinds, and the
+ branch-relaxation predicate lists only the aa64 branch kinds; x64/rv64 keep
+ the numeric `-S` output and current `as` behavior. Broaden per the
+ RelocKind→syntax tables below.
+- **Differential** — add the llvm-mc / llvm-objdump differential lanes over the
+ same `-S` output as a second-oracle cross-check.
## Background — what cfree can do today (verified)
diff --git a/test/test.mk b/test/test.mk
@@ -110,6 +110,7 @@ DEFAULT_TEST_TARGETS = \
test-asm \
test-asm-x64 \
test-asm-rv64 \
+ test-asm-roundtrip \
test-isa \
test-aa64-inline \
test-rv64-inline \