kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit d12baa0e9445f2d0c31f0bdc7a422319296e037f
parent ba3ebe2a0a399273258207688f862f1ea92a5e0f
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Fri, 29 May 2026 17:03:10 -0700

doc: plan for asm/disasm completeness via codegen round-trip testing

Three-layer strategy (L0 decode-completeness, L1 byte round-trip, L2 exec
equivalence) that round-trips the compiler's own -S output instead of only the
hand-written corpus. Records the verified current state (cc -S is disasm-to-text
sharing the disassembler; run/emu give host-independent cross-arch exec), the
blocker (-S emits a non-re-assemblable listing — numeric branches, de-symbolized
relocs), the trap (.byte fallback masks decode gaps in a run-only round-trip),
and the keystone Phase 2: symbolize -S using the reloc-operator syntax the
assembler now parses. Includes the RelocKind->operand-syntax mapping per arch
and the harness/file map.

Diffstat:
Adoc/ASM_ROUNDTRIP_TESTING.md | 205+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 205 insertions(+), 0 deletions(-)

diff --git a/doc/ASM_ROUNDTRIP_TESTING.md b/doc/ASM_ROUNDTRIP_TESTING.md @@ -0,0 +1,205 @@ +# Asm/disasm completeness via codegen round-trip testing + +Goal: measure and lock in **completeness** of the per-arch assembler (`as`), +disassembler (`objdump -d` / `cc -S`), and the link relocation path, by +round-tripping the **compiler's own output** rather than only a hand-written +corpus. The corpus (`test/asm/`) only tests instructions we thought to write +down; codegen output tests every instruction codegen actually emits. + +Status: plan only (2026-05-29). Prereqs from the native-arch asm work are in +(see `doc/NATIVE_ARCH_COMPLETENESS.md`): the assembler now parses the full +relocation-operator syntax on all three arches, which is exactly what the +`-S` symbolizer (Phase 2) must *emit*. + +## Background — what cfree can do today (verified) + +- **`cc -S` exists** and is *disassembly-to-text plus module scaffolding*: + `driver/cc.c` (`emit_asm_source`, ~line 1116) → `cfree_obj_builder_emit_asm` + (`src/api/asm_emit.c:256`). It walks each section, emits labels for **symbols** + (`collect_labels`, `asm_emit.c:68`), and disassembles `.text` via + `emit_disasm_range` (`asm_emit.c:215`) using `arch_disasm_decode`. So `-S` + and `objdump -d` share the **same disassembler** — they are one decode + surface. +- **Cross-arch execution needs no qemu/podman**: `cfree run` (in-process JIT, + `driver/run.c`) and `cfree emu` (user-mode ELF emulator, `driver/emu.c`) both + run guest code on the host. The asm harness already has JIT/exec paths + (`test/asm/run.sh`, the `D`/`J`/`E` lanes; `link-exe-runner`, `jit-runner`). +- **The assembler accepts the reloc-operator syntax** (this is the key enabler): + aa64 `:lo12:`/`:got:`/`:got_lo12:`, rv64 `%hi`/`%lo`/`%pcrel_hi`/`%pcrel_lo`, + x64 `sym(%rip)`/`@PLT`/`@GOTPCREL` — see `src/arch/{aa64,x64,rv64}/asm.c`. + +### Two gotchas the design must handle + +1. **`-S` is a listing, not re-assemblable assembly (the blocker).** Today it + emits **numeric** branch targets (`b 0x100`) and **de-symbolized** + relocated operands (`bl 0x11c` instead of `bl add`; `adrp x16, 0x0` + `ldr + [x16]` instead of `adrp x16, g` + `ldr [x16, :lo12:g]`). Re-assembling that + branches to the wrong place and loads from address 0. **L1/L2 below are + blocked until `-S` symbolizes (Phase 2).** + +2. **The `.byte` fallback masks decode gaps (the trap).** When the disassembler + can't decode a word, `emit_disasm_range` emits `.byte 0x..` (`asm_emit.c` + ~227). Re-assembling a `.byte` reproduces the exact original bytes — so a + *run-only* round-trip **passes even when disasm is incomplete**. A decode or + byte/reloc check must gate before trusting an exec round-trip. + +3. **Padding/data is indistinguishable from a decode failure today.** Inter- + function alignment fill is emitted as the *same* `.byte 0x0` token as a real + decode failure (observed: x64 zero-pads between functions; aa64/rv64 nop-pad + so show none). The completeness metric must separate "byte the disassembler + failed on, inside a function" from "padding/data outside any function". + +## The three layers (build cheapest/sharpest first) + +| Layer | Check | Catches | Cost | Needs Phase 2 | +|-------|-------|---------|------|---------------| +| **L0 decode completeness** | over a program corpus, assert **no in-function decode failure** | disasm can't decode an insn codegen emits | cheap, no exec, pinpoints the word | no | +| **L1 byte round-trip** | `cc -c` vs `cc -S \| as`; diff `.text` bytes **and** reloc tables | asm⊗disasm disagreements (round-trip violations) | cheap, exact | yes | +| **L2 exec equivalence** | `cc` direct vs `cc -S \| as \| ld`, run + compare output/exit | semantic bugs; tolerant of benign encoding diffs | exec via `run`/`emu` | yes | + +L0 measures the disassembler against codegen. L1 measures asm⊗disasm +agreement (the exact "round-trip violation" framing of the completeness doc). +L2 is the end-to-end "it actually runs the same" signal. + +## Phase 1 — L0 decode-completeness gate (unblocked, do first) + +The single cheapest high-value win; implementable with no new features. + +- [ ] Make the decode-failure signal **unambiguous**. Pick one: + - (preferred) In `emit_disasm_range`, emit padding/data outside function + symbol ranges as `.zero N` / `.p2align`, and emit a *genuine in-stream + decode failure* as a distinct token — `.inst 0x<word>` (a real aarch64 + directive) or `.byte … # UNDECODED`. Then "grep for the marker" is exact. + - (alt) Bound the L0 scan to `[sym.value, sym.value+sym.size)` function + ranges using the emitted `.size` directives, counting only in-range + `.byte`/`.inst`. +- [ ] Curate an L0 corpus that stresses instruction families codegen emits: + int/long arith + shifts/bitops, `float`/`double` (FP/SIMD — most likely gap), + `switch` (jump tables), structs/by-ref, loops/memory, `cmp`/`cset`/`cmov`, + calls + global/TLS access, varargs, `asm()` inline. Reuse `test/toy` and a + small C set. Run at `-O0` and `-O1` (different encodings). +- [ ] For massive free coverage, also run L0 over the **bootstrap objects** + (cfree compiling cfree — see `doc/`/`bootstrap-*` targets); it exercises a + huge slice of the ISA. +- [ ] New target (e.g. `test-disasm-complete`): for each arch in {aa64,x64,rv64}, + `cfree cc -S -target <triple> <src>` and assert zero in-function decode + markers. Wire into the default suite (no exec needed — host-independent). + +Prototype already run (rich.c with `double`/`float`/`switch`/bitops): aa64 and +rv64 disassembled clean; x64 only showed inter-function zero padding (a false +positive that motivates the "unambiguous signal" item above). A genuine gap — +e.g. the signed sub-word load decode fixed this session — would have surfaced +here. + +### Force multiplier: differential decode vs llvm +"No decode failure" does not catch a *wrong* decode (decodes to the wrong +mnemonic). Add an opt-in lane that diffs `cfree objdump -d` against +`llvm-objdump -d` over the same objects, normalized (whitespace, hex case, +`0x` prefixes, address columns). llvm is the oracle for decode *text*. + +## Phase 2 — Symbolize `-S` (the keystone; unblocks L1/L2) + +Make `cc -S` emit **re-assemblable** assembly. Lives in `src/api/asm_emit.c` +(and possibly a shared symbolizing-disasm layer also used by `objdump -d`). +Note `objdump`'s symbol annotations are *comments* (`bl 0x11c <add>`), which +are not re-assemblable — `-S` needs the symbol to *be* the operand (`bl add`). + +Two new inputs into the emit loop, then a two-pass emit: + +1. **Section relocations** from the `ObjBuilder`: a map `offset → (RelocKind, + target sym, addend)`. When an instruction covers a reloc offset, render its + operand symbolically via the reloc-kind → modifier mapping (the inverse of + what the assembler parses). +2. **Branch-target labels**: collect intra-section branch / PC-rel targets, + synthesize `.L<sec>_<off>` labels at those offsets, render branches as + `b .L...`. + +Emit as two passes: pass 1 decodes everything and builds the label set = +{symbols} ∪ {synthesized branch-target labels} ∪ {rv64 `%pcrel_hi` anchors}; +pass 2 emits, inserting labels at offsets and symbolic operands per the table. + +### RelocKind → operand-syntax mapping (inverse of the assembler) + +aarch64: +- `R_AARCH64_CALL26` → `bl sym`; `R_AARCH64_JUMP26` → `b sym`; + `R_AARCH64_CONDBR19`/`TSTBR14` → `b.cc sym`/`tbz sym` +- `R_AARCH64_ADR_PREL_PG_HI21` → `adrp Rd, sym`; + `R_AARCH64_ADR_GOT_PAGE` → `adrp Rd, :got:sym` +- `R_AARCH64_ADD_ABS_LO12_NC` → `add Rd, Rn, :lo12:sym` +- `R_AARCH64_LDST{8,16,32,64}_ABS_LO12_NC` → `ldr/str …, [Rn, :lo12:sym]`; + `R_AARCH64_LD64_GOT_LO12_NC` → `:got_lo12:sym` +- TLS LE (`TLSLE_*`) → `:tprel_hi12:` / `:tprel_lo12_nc:` (later) + +riscv64: +- `R_RV_CALL` → `call sym` (collapses the auipc+jalr pair) +- `R_RV_PCREL_HI20` → `auipc Rd, %pcrel_hi(sym)` **and emit a local anchor + label** at this offset; `R_RV_PCREL_LO12_I/S` → `… %pcrel_lo(.Lanchor)` + referencing that anchor (mirrors codegen's `.LpcrelHi`, `native.c`) +- `R_RV_HI20`/`R_RV_LO12_I/S` → `%hi(sym)`/`%lo(sym)`; + `R_RV_GOT_HI20` → `%got_pcrel_hi(sym)` +- `R_RV_BRANCH`/`R_RV_JAL` → `beq …, sym` / `j sym` + +x86-64: +- `R_X64_PLT32` on a `call`/`jmp` → `call sym@PLT` / `jmp sym@PLT` +- `R_PC32` on a rip-relative mem operand → `sym(%rip)` +- `R_X64_REX_GOTPCRELX`/`GOTPCREL` → `sym@GOTPCREL(%rip)` +- absolute data refs → `sym` / `sym+addend` + +Decisions: +- [ ] Where does symbolization live — `asm_emit.c` only, or a shared + symbolizing-disasm layer reused by `objdump -d`? (Recommend: a small shared + "resolve operand at offset → symbolic string" helper fed by a reloc map + + label map, with `-S` and `objdump` choosing operand-substitution vs comment.) +- [ ] How to recover the *instruction operand position* a reloc patches (the + disassembler's `CfreeInsn` may not expose which operand the reloc field maps + to). May need the decoder to report the immediate/branch field offset, or the + symbolizer to re-derive it from the reloc offset within the instruction. + +## Phase 3 — L1 + L2 round-trip lanes + +Once `-S` is re-assemblable: + +- [ ] **L1 (byte round-trip)**: per arch, `cfree cc -c <src> → a.o`; + `cfree cc -S <src> | cfree as → b.o`; assert `.text` bytes **and** the + relocation table (kind, offset, target, addend) of `a.o` == `b.o`. Exact, + host-independent, pinpoints the divergent instruction. Gate on Phase 1 + (no decode failures) first. +- [ ] **L2 (exec equivalence)**: `cfree cc <src> → run` (direct) vs + `cfree cc -S <src> | cfree as | cfree ld → run` (round-trip); compare + stdout + exit. Execute via `cfree run` (host arch) or `cfree emu` (cross + arch), reusing the `test/asm/run.sh` J/E plumbing. Robust to benign encoding + differences; the end-to-end "it works" signal. +- [ ] Run L1/L2 across {aa64, x64, rv64} × {-O0, -O1} over the corpus. Make L1 + default-suite (cheap); L2 opt-in (exec). + +### Force multiplier: llvm-mc as a second assembler +On `-S` output, assemble with **both** `cfree as` and `llvm-mc` and compare +bytes. This cross-checks cfree's assembler against an oracle on real +codegen-shaped input — coverage the hand-written corpus can't reach. (Caveat: +llvm-mc may pick different-but-equivalent encodings; normalize or scope to +forms where they agree.) + +## Harness / file map + +- Emitter to extend: `src/api/asm_emit.c` (`emit_disasm_range`, + `collect_labels`, `cfree_obj_builder_emit_asm`). +- Disassembler: `src/arch/disasm.c` + per-arch `src/arch/<arch>/isa.c` + (`arch_disasm_decode`, `CfreeInsn`). +- Reloc kinds: `src/obj/obj.h` (`RelocKind`); per-arch ELF maps under + `src/obj/elf/reloc_*.c`. +- Tools: `cfree as` (`driver/as.c`), `objdump` (`driver/objdump.c`), + `run` (`driver/run.c`), `emu` (`driver/emu.c`), `ld` (`driver/ld.c`). +- Test harness: `test/asm/run.sh` (H/T/L/D/J/E), `test/toy/run.sh`, + `test/lib/exec_target.sh`; golden regen `test/asm/regen.sh` / + `test/asm/regen-rv64.sh`. Cross toolchain present on dev host: `clang`, + `llvm-mc`, `llvm-objdump` (triples aarch64/x86_64/riscv64-linux-gnu). +- New targets to add: `test-disasm-complete` (L0), `test-asm-roundtrip` (L1), + `test-asm-roundtrip-exec` (L2, opt-in). + +## Why this is worth it + +It converts "we wrote corpus cases for the instructions we remembered" into +"every instruction the compiler emits is provably decodable (L0), re-encodes +identically (L1), and runs identically (L2)" — a *coverage-driven* guarantee +that tracks codegen automatically as new instructions are added, and that +closes the loop with the relocation-operator syntax already in the assembler.