doc: plan for asm/disasm completeness via codegen round-trip testing - kit

commit d12baa0e9445f2d0c31f0bdc7a422319296e037f
parent ba3ebe2a0a399273258207688f862f1ea92a5e0f
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 29 May 2026 17:03:10 -0700

doc: plan for asm/disasm completeness via codegen round-trip testing

Three-layer strategy (L0 decode-completeness, L1 byte round-trip, L2 exec
equivalence) that round-trips the compiler's own -S output instead of only the
hand-written corpus. Records the verified current state (cc -S is disasm-to-text
sharing the disassembler; run/emu give host-independent cross-arch exec), the
blocker (-S emits a non-re-assemblable listing — numeric branches, de-symbolized
relocs), the trap (.byte fallback masks decode gaps in a run-only round-trip),
and the keystone Phase 2: symbolize -S using the reloc-operator syntax the
assembler now parses. Includes the RelocKind->operand-syntax mapping per arch
and the harness/file map.

Diffstat:
A doc/ASM_ROUNDTRIP_TESTING.md  | 205 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 file changed, 205 insertions(+), 0 deletions(-)
diff --git a/doc/ASM_ROUNDTRIP_TESTING.md b/doc/ASM_ROUNDTRIP_TESTING.md
@@ -0,0 +1,205 @@
+# Asm/disasm completeness via codegen round-trip testing
+
+Goal: measure and lock in **completeness** of the per-arch assembler (`as`),
+disassembler (`objdump -d` / `cc -S`), and the link relocation path, by
+round-tripping the **compiler's own output** rather than only a hand-written
+corpus. The corpus (`test/asm/`) only tests instructions we thought to write
+down; codegen output tests every instruction codegen actually emits.
+
+Status: plan only (2026-05-29). Prereqs from the native-arch asm work are in
+(see `doc/NATIVE_ARCH_COMPLETENESS.md`): the assembler now parses the full
+relocation-operator syntax on all three arches, which is exactly what the
+`-S` symbolizer (Phase 2) must *emit*.
+
+## Background — what cfree can do today (verified)
+
+- **`cc -S` exists** and is *disassembly-to-text plus module scaffolding*:
+  `driver/cc.c` (`emit_asm_source`, ~line 1116) → `cfree_obj_builder_emit_asm`
+  (`src/api/asm_emit.c:256`). It walks each section, emits labels for **symbols**
+  (`collect_labels`, `asm_emit.c:68`), and disassembles `.text` via
+  `emit_disasm_range` (`asm_emit.c:215`) using `arch_disasm_decode`. So `-S`
+  and `objdump -d` share the **same disassembler** — they are one decode
+  surface.
+- **Cross-arch execution needs no qemu/podman**: `cfree run` (in-process JIT,
+  `driver/run.c`) and `cfree emu` (user-mode ELF emulator, `driver/emu.c`) both
+  run guest code on the host. The asm harness already has JIT/exec paths
+  (`test/asm/run.sh`, the `D`/`J`/`E` lanes; `link-exe-runner`, `jit-runner`).
+- **The assembler accepts the reloc-operator syntax** (this is the key enabler):
+  aa64 `:lo12:`/`:got:`/`:got_lo12:`, rv64 `%hi`/`%lo`/`%pcrel_hi`/`%pcrel_lo`,
+  x64 `sym(%rip)`/`@PLT`/`@GOTPCREL` — see `src/arch/{aa64,x64,rv64}/asm.c`.
+
+### Two gotchas the design must handle
+
+1. **`-S` is a listing, not re-assemblable assembly (the blocker).** Today it
+   emits **numeric** branch targets (`b 0x100`) and **de-symbolized**
+   relocated operands (`bl 0x11c` instead of `bl add`; `adrp x16, 0x0` + `ldr
+   [x16]` instead of `adrp x16, g` + `ldr [x16, :lo12:g]`). Re-assembling that
+   branches to the wrong place and loads from address 0. **L1/L2 below are
+   blocked until `-S` symbolizes (Phase 2).**
+
+2. **The `.byte` fallback masks decode gaps (the trap).** When the disassembler
+   can't decode a word, `emit_disasm_range` emits `.byte 0x..` (`asm_emit.c`
+   ~227). Re-assembling a `.byte` reproduces the exact original bytes — so a
+   *run-only* round-trip **passes even when disasm is incomplete**. A decode or
+   byte/reloc check must gate before trusting an exec round-trip.
+
+3. **Padding/data is indistinguishable from a decode failure today.** Inter-
+   function alignment fill is emitted as the *same* `.byte 0x0` token as a real
+   decode failure (observed: x64 zero-pads between functions; aa64/rv64 nop-pad
+   so show none). The completeness metric must separate "byte the disassembler
+   failed on, inside a function" from "padding/data outside any function".
+
+## The three layers (build cheapest/sharpest first)
+
+| Layer | Check | Catches | Cost | Needs Phase 2 |
+|-------|-------|---------|------|---------------|
+| **L0 decode completeness** | over a program corpus, assert **no in-function decode failure** | disasm can't decode an insn codegen emits | cheap, no exec, pinpoints the word | no |
+| **L1 byte round-trip** | `cc -c` vs `cc -S \| as`; diff `.text` bytes **and** reloc tables | asm⊗disasm disagreements (round-trip violations) | cheap, exact | yes |
+| **L2 exec equivalence** | `cc` direct vs `cc -S \| as \| ld`, run + compare output/exit | semantic bugs; tolerant of benign encoding diffs | exec via `run`/`emu` | yes |
+
+L0 measures the disassembler against codegen. L1 measures asm⊗disasm
+agreement (the exact "round-trip violation" framing of the completeness doc).
+L2 is the end-to-end "it actually runs the same" signal.
+
+## Phase 1 — L0 decode-completeness gate (unblocked, do first)
+
+The single cheapest high-value win; implementable with no new features.
+
+- [ ] Make the decode-failure signal **unambiguous**. Pick one:
+  - (preferred) In `emit_disasm_range`, emit padding/data outside function
+    symbol ranges as `.zero N` / `.p2align`, and emit a *genuine in-stream
+    decode failure* as a distinct token — `.inst 0x<word>` (a real aarch64
+    directive) or `.byte … # UNDECODED`. Then "grep for the marker" is exact.
+  - (alt) Bound the L0 scan to `[sym.value, sym.value+sym.size)` function
+    ranges using the emitted `.size` directives, counting only in-range
+    `.byte`/`.inst`.
+- [ ] Curate an L0 corpus that stresses instruction families codegen emits:
+  int/long arith + shifts/bitops, `float`/`double` (FP/SIMD — most likely gap),
+  `switch` (jump tables), structs/by-ref, loops/memory, `cmp`/`cset`/`cmov`,
+  calls + global/TLS access, varargs, `asm()` inline. Reuse `test/toy` and a
+  small C set. Run at `-O0` and `-O1` (different encodings).
+- [ ] For massive free coverage, also run L0 over the **bootstrap objects**
+  (cfree compiling cfree — see `doc/`/`bootstrap-*` targets); it exercises a
+  huge slice of the ISA.
+- [ ] New target (e.g. `test-disasm-complete`): for each arch in {aa64,x64,rv64},
+  `cfree cc -S -target <triple> <src>` and assert zero in-function decode
+  markers. Wire into the default suite (no exec needed — host-independent).
+
+Prototype already run (rich.c with `double`/`float`/`switch`/bitops): aa64 and
+rv64 disassembled clean; x64 only showed inter-function zero padding (a false
+positive that motivates the "unambiguous signal" item above). A genuine gap —
+e.g. the signed sub-word load decode fixed this session — would have surfaced
+here.
+
+### Force multiplier: differential decode vs llvm
+"No decode failure" does not catch a *wrong* decode (decodes to the wrong
+mnemonic). Add an opt-in lane that diffs `cfree objdump -d` against
+`llvm-objdump -d` over the same objects, normalized (whitespace, hex case,
+`0x` prefixes, address columns). llvm is the oracle for decode *text*.
+
+## Phase 2 — Symbolize `-S` (the keystone; unblocks L1/L2)
+
+Make `cc -S` emit **re-assemblable** assembly. Lives in `src/api/asm_emit.c`
+(and possibly a shared symbolizing-disasm layer also used by `objdump -d`).
+Note `objdump`'s symbol annotations are *comments* (`bl 0x11c <add>`), which
+are not re-assemblable — `-S` needs the symbol to *be* the operand (`bl add`).
+
+Two new inputs into the emit loop, then a two-pass emit:
+
+1. **Section relocations** from the `ObjBuilder`: a map `offset → (RelocKind,
+   target sym, addend)`. When an instruction covers a reloc offset, render its
+   operand symbolically via the reloc-kind → modifier mapping (the inverse of
+   what the assembler parses).
+2. **Branch-target labels**: collect intra-section branch / PC-rel targets,
+   synthesize `.L<sec>_<off>` labels at those offsets, render branches as
+   `b .L...`.
+
+Emit as two passes: pass 1 decodes everything and builds the label set =
+{symbols} ∪ {synthesized branch-target labels} ∪ {rv64 `%pcrel_hi` anchors};
+pass 2 emits, inserting labels at offsets and symbolic operands per the table.
+
+### RelocKind → operand-syntax mapping (inverse of the assembler)
+
+aarch64:
+- `R_AARCH64_CALL26` → `bl sym`; `R_AARCH64_JUMP26` → `b sym`;
+  `R_AARCH64_CONDBR19`/`TSTBR14` → `b.cc sym`/`tbz sym`
+- `R_AARCH64_ADR_PREL_PG_HI21` → `adrp Rd, sym`;
+  `R_AARCH64_ADR_GOT_PAGE` → `adrp Rd, :got:sym`
+- `R_AARCH64_ADD_ABS_LO12_NC` → `add Rd, Rn, :lo12:sym`
+- `R_AARCH64_LDST{8,16,32,64}_ABS_LO12_NC` → `ldr/str …, [Rn, :lo12:sym]`;
+  `R_AARCH64_LD64_GOT_LO12_NC` → `:got_lo12:sym`
+- TLS LE (`TLSLE_*`) → `:tprel_hi12:` / `:tprel_lo12_nc:` (later)
+
+riscv64:
+- `R_RV_CALL` → `call sym` (collapses the auipc+jalr pair)
+- `R_RV_PCREL_HI20` → `auipc Rd, %pcrel_hi(sym)` **and emit a local anchor
+  label** at this offset; `R_RV_PCREL_LO12_I/S` → `… %pcrel_lo(.Lanchor)`
+  referencing that anchor (mirrors codegen's `.LpcrelHi`, `native.c`)
+- `R_RV_HI20`/`R_RV_LO12_I/S` → `%hi(sym)`/`%lo(sym)`;
+  `R_RV_GOT_HI20` → `%got_pcrel_hi(sym)`
+- `R_RV_BRANCH`/`R_RV_JAL` → `beq …, sym` / `j sym`
+
+x86-64:
+- `R_X64_PLT32` on a `call`/`jmp` → `call sym@PLT` / `jmp sym@PLT`
+- `R_PC32` on a rip-relative mem operand → `sym(%rip)`
+- `R_X64_REX_GOTPCRELX`/`GOTPCREL` → `sym@GOTPCREL(%rip)`
+- absolute data refs → `sym` / `sym+addend`
+
+Decisions:
+- [ ] Where does symbolization live — `asm_emit.c` only, or a shared
+  symbolizing-disasm layer reused by `objdump -d`? (Recommend: a small shared
+  "resolve operand at offset → symbolic string" helper fed by a reloc map +
+  label map, with `-S` and `objdump` choosing operand-substitution vs comment.)
+- [ ] How to recover the *instruction operand position* a reloc patches (the
+  disassembler's `CfreeInsn` may not expose which operand the reloc field maps
+  to). May need the decoder to report the immediate/branch field offset, or the
+  symbolizer to re-derive it from the reloc offset within the instruction.
+
+## Phase 3 — L1 + L2 round-trip lanes
+
+Once `-S` is re-assemblable:
+
+- [ ] **L1 (byte round-trip)**: per arch, `cfree cc -c <src> → a.o`;
+  `cfree cc -S <src> | cfree as → b.o`; assert `.text` bytes **and** the
+  relocation table (kind, offset, target, addend) of `a.o` == `b.o`. Exact,
+  host-independent, pinpoints the divergent instruction. Gate on Phase 1
+  (no decode failures) first.
+- [ ] **L2 (exec equivalence)**: `cfree cc <src> → run` (direct) vs
+  `cfree cc -S <src> | cfree as | cfree ld → run` (round-trip); compare
+  stdout + exit. Execute via `cfree run` (host arch) or `cfree emu` (cross
+  arch), reusing the `test/asm/run.sh` J/E plumbing. Robust to benign encoding
+  differences; the end-to-end "it works" signal.
+- [ ] Run L1/L2 across {aa64, x64, rv64} × {-O0, -O1} over the corpus. Make L1
+  default-suite (cheap); L2 opt-in (exec).
+
+### Force multiplier: llvm-mc as a second assembler
+On `-S` output, assemble with **both** `cfree as` and `llvm-mc` and compare
+bytes. This cross-checks cfree's assembler against an oracle on real
+codegen-shaped input — coverage the hand-written corpus can't reach. (Caveat:
+llvm-mc may pick different-but-equivalent encodings; normalize or scope to
+forms where they agree.)
+
+## Harness / file map
+
+- Emitter to extend: `src/api/asm_emit.c` (`emit_disasm_range`,
+  `collect_labels`, `cfree_obj_builder_emit_asm`).
+- Disassembler: `src/arch/disasm.c` + per-arch `src/arch/<arch>/isa.c`
+  (`arch_disasm_decode`, `CfreeInsn`).
+- Reloc kinds: `src/obj/obj.h` (`RelocKind`); per-arch ELF maps under
+  `src/obj/elf/reloc_*.c`.
+- Tools: `cfree as` (`driver/as.c`), `objdump` (`driver/objdump.c`),
+  `run` (`driver/run.c`), `emu` (`driver/emu.c`), `ld` (`driver/ld.c`).
+- Test harness: `test/asm/run.sh` (H/T/L/D/J/E), `test/toy/run.sh`,
+  `test/lib/exec_target.sh`; golden regen `test/asm/regen.sh` /
+  `test/asm/regen-rv64.sh`. Cross toolchain present on dev host: `clang`,
+  `llvm-mc`, `llvm-objdump` (triples aarch64/x86_64/riscv64-linux-gnu).
+- New targets to add: `test-disasm-complete` (L0), `test-asm-roundtrip` (L1),
+  `test-asm-roundtrip-exec` (L2, opt-in).
+
+## Why this is worth it
+
+It converts "we wrote corpus cases for the instructions we remembered" into
+"every instruction the compiler emits is provably decodable (L0), re-encodes
+identically (L1), and runs identically (L2)" — a *coverage-driven* guarantee
+that tracks codegen automatically as new instructions are added, and that
+closes the loop with the relocation-operator syntax already in the assembler.

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README