kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

Testing

kit's test architecture is built around one core idea: drive testing from what codegen actually emits, not from a hand-written corpus of cases someone remembered to write. A hand corpus only ever tests the instructions we thought to write down; the compiler's own output exercises every instruction codegen emits, and tracks codegen automatically as new instructions appear. This doc describes the harness design — the layering, the invariants each lane locks in, and why the lanes are shaped the way they are. For the surfaces these tests exercise see ASM.md (assembler/disassembler/cc -S), ARCH.md (per-arch ISA tables), and FRONTENDS.md (the Toy and C frontends).

The test tree lives under test/, with per-area subdirectories (test/asm/, test/toy/, test/smoke/, test/libc/, plus unit-test areas like test/arch/, test/elf/, test/opt/). Build/run wiring lives in mk/test.mk. Every harness conforms to one of four canonical test types, each backed by a shared library under test/lib/, so test infrastructure is written once and reused rather than re-invented per area.

Canonical test types

Type Library What it is
U unit kit_unit.h (+ test_unit.mk build manifest) a C translation unit linked against libkit, self-checking in-process
C corpus cf_corpus.sh a directory of case files run through one or more lanes, each with its own oracle
K scripted kit_sh_kit.sh a hand-written shell test driving the kit binary, judged by golden transcript (mode G) or procedural asserts (mode P)
D differential cf_differential.sh correctness defined as agreement — vs a checked-in baseline, or vs a reference tool

All four record results through one shared report layer test/lib/kit_sh_report.sh (cf_pass/cf_fail/cf_skip/cf_skip_na/ cf_xfail/cf_xpass/cf_time + a unified cf_summary/cf_exit); cf_skip.sh unifies the skip sources (sidecar files, phased-rollout diagnostic regexes, target-tuple applicability); exec_target.sh and exec_kernel.sh run guest binaries for cross-arch lanes. The sections below describe what each harness tests; this is how they are structured.

Type C — the corpus engine. cf_corpus.sh owns the whole pipeline: discover a glob of cases, expand the {case × opt-level × target-tuple} matrix, run each enabled lane's recipe, apply that lane's oracle, report, and flush any deferred cross-arch exec. A runner supplies only its corpus glob, its lane set, and a cf_lane_<ID>() hook per lane — which writes only under a per-case $CF_WORK dir and records via exactly one cf_* verb. Because the oracle is lane-local, exit-code, golden-diff, byte-compare, structural-grep, negative (expect-failure) and cross-arch-exec are all just lane bodies, not separate harness types; a suite with disjoint sub-corpora (asm's encode/decode/listing, elf's layers) makes one cf_corpus_run call per sub-corpus. Execution is serial or parallel through the same code path: in parallel mode each case runs as a background worker that appends its results to an event file, and the parent replays those events in index order — so counts, failing-name order, and the cross-arch-exec queue are deterministic regardless of worker completion order, and a runner goes parallel by flipping CF_PARALLELIZABLE with no other change. The engine is proven by test/lib/cf_corpus_selftest.sh (make test-cf-corpus-selftest): it asserts serial≡parallel determinism and the load-bearing invariant that cross-arch exec is queued on the parent, never inside a worker.

Type K — scripted shell. kit_sh_kit.sh is the single source point for a hand-written tool/driver test: it pulls in the report layer plus the procedural assert verbs (ok/run_ok/run_fail/contains/same_file/is_executable/ check_mode) and the golden-transcript runner cf_scenario_case. A suite picks a mode — G diffs a cases/<name>.sh transcript against <name>.expected (ar/strip/objcopy/strings/objdump, dbg); P runs procedural asserts over a $work sandbox (cas/pkg, the driver CLI suite, the COFF Windows smokes). Mode-P suites stay serial (they share one $work and mutate fixtures).

Type D — differential. cf_differential.sh is for tests whose oracle is agreement rather than a fixed per-case answer: cf_diff_baseline regenerates a normalized report and gates on a zero delta vs a checked-in snapshot (CF_DIFF_UPDATE=1 refreshes it), and cf_diff_agree requires two independent producers to match byte-for-byte (with explicit equivalence-skips). Used by the asm self-symmetry sweep and the kit-vs-llvm-mc cross-check.

Why codegen-driven round-trip testing

cc -S is not a separate pretty-printer: it is the disassembler plus module scaffolding (src/api/asm_emit.c). The same arch_disasm_decode surface backs both cc -S and objdump -d, so -S text is kit's disassembly rendered as re-assemblable assembly. That makes a single feedback loop possible:

  source ──cc -c──▶ object (codegen's bytes + relocs)
     │
     └────cc -S──▶ assembly ──as──▶ object (re-encoded bytes + relocs)

If the disassembler and assembler are mutually faithful, the two objects are byte-and-reloc identical. Any instruction codegen emits is, by construction, in the loop — so the coverage tracks the compiler, not a static list.

The three completeness layers (L0/L1/L2)

test/asm/roundtrip.sh runs a C corpus (test/asm/roundtrip/, each case a test_main) through three lanes, cheapest and sharpest first. They build on each other: a higher lane is only trustworthy once the lower one passes.

  L0  decode-completeness   cc -S, assert no in-function decode failure
  L1  byte + reloc round    cc -c  vs  cc -S | as   (.text bytes AND reloc table)
  L2  exec equivalence      run direct.o  vs  run rt.o   (exit code / stdout)

L0 — decode completeness. Disassemble the program and assert no instruction codegen emitted failed to decode. This is host-independent (no execution), cheap, and pinpoints the exact undecodable word. The signal must be unambiguous: padding and data are not decode failures. The harness keys on the .inst 0x<word> marker — a real assembler directive emitted only by the disassembler's unknown-word path (aa64_write_unknown) — and counts it only inside .text. Inter-function padding (.byte 0x0 fill) and data-section .byte are explicitly distinct from the in-stream .inst marker, so "grep for the marker inside .text" is an exact completeness test.

L1 — byte + reloc round-trip. Compile the source two ways: cc -c directly, and cc -S | as. Diff both the .text bytes and the relocation table. The reloc-table comparison is essential — a same-section branch resolved in place versus kept as a relocation produces identical bytes but a different relocatable object, and only the reloc diff catches it. L1 covers the sections cc -S reproduces (.text, .rodata, .data, and .bss; switch jump tables live in .rodata). .bss is NOBITS — it carries a size but no bytes, so byte round-tripping does not apply to it; only its presence and symbols are checked. L1 excludes sections -S does not emit (e.g. .eh_frame) so their absence in the round-tripped object is not misread as a divergence. L1 is gated on L0: a byte match is meaningless if the disassembler punted to a .byte/.inst fallback (see the gotcha below).

L2 — exec equivalence. Run direct.o and rt.o and compare exit codes (and stdout, and an optional <name>.expected oracle). This is the end-to-end "it runs the same" signal, tolerant of benign encoding differences L1 would flag. Crucially, no qemu is needed for the host arch: execution goes through the in-process JIT (kit run / the jit-runner), and cross-arch execution is available via the emulator (kit emu) — see JIT.md and EMU.md. L2 runs only when the target arch matches the host (native JIT); otherwise it self-skips.

Opt levels matter: -O0 and -O1 emit different encodings, so each lane runs at every level in KIT_TEST_OPTS.

The symbolization invariant

L1/L2 only work because cc -S is re-assemblable, not a listing. A listing emits numeric branch targets (b 0x100) and de-symbolized relocated operands (bl 0x11c instead of bl add); re-assembling that branches to the wrong place and loads from address 0. The invariant the symbolizer maintains is: every operand a relocation patches is rendered in the assembler's own reloc-operator syntax (aa64 :lo12:/:got:, rv64 %pcrel_hi/%pcrel_lo, x64 sym(%rip)/ @PLT), and every intra-section branch/PC-relative target gets a synthesized local label. The symbolizer is the inverse of what the assembler parses; see ASM.md for the RelocKind→syntax mapping.

A second structural decision makes this robust across different assemblers: code locations an encoding-divergent assembler must be able to recompute — switch jump-table entries and &&label address-takes — are referenced through a per-basic-block local symbol the emitter mints (mc_label_symbol, src/arch/mc.c), uniformly on all three arches. The jump table emits .quad .Lcfblk.* and the address-take a standard PC-relative relocation against that block symbol, rather than a baked numeric offset. Because the reference is genuinely relocatable, a third-party assembler's encoding choices (movabs vs mov-imm32, jmp rel32 vs rel8, RVC compression) cannot shift a baked offset onto the wrong instruction. cc -c and cc -S emit the same relocations, so the L1 byte/reloc lanes stay faithful too.

The .byte/.inst fallback gotcha

The disassembler's fallback for an undecodable word reproduces the exact original bytes (.byte 0x.., or .inst 0x<word>). Re-assembling that fallback yields the same bytes — so a run-only round-trip passes even when the disassembler is incomplete. The decode/byte check (L0, and the reloc half of L1) must gate before an exec round-trip is trusted, otherwise L2 green hides a real disassembler gap. This is why the lanes are ordered, and why L0 keys on the in-stream .inst marker specifically — it is the unambiguous "the disassembler could not decode this word" signal.

Self-symmetry sweep + checked-in baseline

The codegen round-trip exercises the disassembler only on what the compiler emits. test/asm/symmetry.sh complements it by sweeping the tools' own instruction set for asm⊗disasm symmetry, independent of codegen:

The two tools cover slightly different ISA subsets on purpose (forms the assembler accepts for completeness that codegen never emits, so the disassembler never had to decode them). Known asymmetries live in a checked-in snapshot, test/asm/symmetry.baseline; the sweep passes iff the current set equals the baseline. So it gates against new asymmetry (a regression) while the baseline documents the disasm-completeness backlog. Closing a gap shrinks the baseline (symmetry.sh --update). This is the standard kit pattern: an honest, checked-in "what is currently known-incomplete" snapshot that turns a regression into a diff.

The diff-vs-llvm second oracle

"No decode failure" does not catch a wrong decode, and a self-round-trip can't either — kit's own re-encode would repeat the mistake. test/asm/ diff_llvm.sh adds an independent oracle (llvm-mc), byte-level so it sidesteps disassembly-text normalization (movz-vs-mov, #16-vs-#0x10), which would founder on alias/format differences:

The one benign disagreement is recognized structurally: kit codegen keeps a same-section CALL26/JUMP26/CONDBR relocation that llvm-mc (like GNU as) resolves in place and drops. The bytes are link-equivalent, only the relocatable form differs, so the reloc-table diff distinguishes it and it is not flagged. Opt-in; skips cleanly when llvm-mc is absent.

Host-assembler execution lanes (cc -S is standard assembly)

The round-trip and diff-llvm lanes use either kit's own assembler or compare bytes. A separate question is whether cc -S is standard assembly a third-party assembler accepts and that means the same thing — judged by execution, not bytes (kit and clang emit different but execution-equivalent code, so a byte/text match would be meaningless).

test/asm/hostas_toy.sh answers it on the native target. Per Toy case (both -O0/-O1) it emits one cc -S and feeds it to two assemblers, then links each with kit ld and runs it, asserting the exit-code oracle:

  cc -S ──┬── kit as ──kit ld──▶ ./a.out   exit == oracle   (lane A, baseline)
          └── clang -c  ──kit ld──▶ ./b.out   exit == oracle   (lane B, the test)

The assembler is the only variable. Lane A is a baseline (kit both writes and reads its own dialect, so a private-dialect quirk could hide); lane B is the real test — a standard assembler can't paper over such a quirk. This is what proves cc -S emits the clean dialect of the target object format: the format-divergent directive spelling (.type/.size/.section/.p2align) lives behind an AsmSyntax vtable selected by object format in src/api/asm_emit.c, and the relocation operand syntax (ELF :lo12: vs Mach-O @PAGEOFF) behind a per-arch ArchAsmOps.reloc_operand hook — keeping the printer free of arch-specific reloc knowledge while staying format-correct. kit's own as parses the dialect of its target too, so the single cc -S output assembles identically under both. The clang-as lane gates by default; KIT_HOSTAS_ENFORCE_CLANG=0 demotes it to XFAIL (useful while bringing up a new arch/format whose printer side isn't done).

test/asm/hostas_cross.sh is the cross extension: the same two-assembler-by- execution test, but for ELF Linux targets (aarch64/x86_64/riscv64) emitted with cc -S -target <triple>, assembled by both kit-as and clang, linked into a static non-PIE ELF with kit ld -static, and run under podman/qemu via test/lib/exec_target.sh. The executable is made runnable with no libc/loader by linking the freestanding crt test/link/harness/start.c (-Dtest_main=main): _start runs ctors, calls main, and exits with its return (the oracle) via a raw syscall. Each target self-skips (never fails) unless the host has (1) a clang cross target, (2) a runner (podman/qemu), (3) a working cc -S | kit as round-trip for that arch, and (4) a passing bounded exec smoke — so a wedged emulator downgrades to SKIP instead of hanging. Both lanes are judged purely by matching exit codes. Opt-in; both lanes skip cleanly without clang/podman.

Shared cross-exec helper

test/lib/exec_target.sh is the one place that knows how to run a guest ELF for a <arch>-<os> target tag. It offers synchronous (exec_target_run) and batched-queue (exec_target_queue + exec_target_flush) modes. The batched mode groups queued cases by target and runs one podman run per group, amortizing podman's per-launch client round-trip across the whole suite; the in-container loop caps each case (EXEC_CASE_TIMEOUT) so one hanging binary can't wedge the batch and silently fail every later case. The helper picks native exec, then qemu-user, then a batched podman container, and reports "no runner" so callers SKIP cleanly. It is shared by the Toy cross lane, the hostas-cross lane, and the link/smoke/libc harnesses, so cross-exec policy lives in exactly one file.

The same tag interface also covers <arch>-freebsd and <arch>-windows, whose runner is a VM (test/lib/exec_vm.sh). A VM is expensive to boot and stateful, so unlike the stateless podman/qemu runners it has a lifecycle: exec_target_setup/the flush boot the VM lazily and once (one Windows VM serves both arches), keep it warm across flushes, and exec_target_teardown_all (installed via trap … EXIT) shuts down only VMs we booted — never a VM the user already had running. The transport is scripts/{freebsd,windows}_vm.sh run-batch, which ships a staging dir in, runs a generated entry script, and brings each binary's rc/out/err back; Windows exit codes are masked to 8 bits so .rc is a uniform POSIX-style status across every tag. So a harness gains real FreeBSD/Windows execution just by selecting the tag — test/toy/vm.sh and the hosted suite both do.

The Toy corpus as CG-API coverage

The Toy frontend (lang/toy/, see FRONTENDS.md) is a small language that exists to exercise the full CG API op set, and every case carries an exit-code oracle (test/toy/cases/<name>.expected). test/toy/run.sh is a Type C corpus harness that runs each case through several paths — its cf_corpus.sh lanes, each a distinct backend/seam — all judged against the same oracle:

  R   kit run                 in-process JIT, native
  I   kit run --no-jit        the IR interpreter (see INTERPRETER.md)
  L   kit cc -c | kit ld    native object → linked executable
  X   kit cc -target | ld     cross ELF (aa64/x64/rv64) run via exec_target
  C   kit cc --emit=c | host cc   the C-source backend (see CBACKEND.md)
  W   kit cc -target wasm32   the Wasm backend → re-lower → run (see WASM.md)

One corpus thus validates the JIT, the interpreter, the linker, cross targets, and the C and Wasm CGTargets — proving the CGTarget seam is frontend-agnostic. Paths the interpreter/C/Wasm targets don't yet implement emit a greppable "not supported"/"not yet implemented" diagnostic and report SKIP, so the suite signal stays "real regressions". test/toy/err/ holds compile-failure cases checked against an expected diagnostic substring.

test/toy/vm.sh <freebsd|windows> adds two hosted-VM lanes on the same corpus + oracle, the hosted counterparts to the freestanding-Linux X lane: it links each case against a real OS sysroot (FreeBSD base.txz extract via scripts/freebsd_sysroot.sh, or the llvm-mingw UCRT sysroot) and runs the binary on the genuine OS in a VM, so the full hosted path — ABI, CRT startup, the platform loader, syscalls/Win32 — is exercised. It compiles every applicable case × opt × link-mode (so a codegen/link bug is caught even with no VM) and executes them through the shared seam (the <arch>-freebsd/<arch>-windows tags + exec_vm.sh lifecycle above), joining each exit code back to the oracle. FreeBSD covers amd64/aarch64/riscv64 (static + dynamic); Windows covers x64 (Prism emulation) + aarch64 on one ARM64 VM. These are opt-in (make test-toy-freebsd-vm / test-toy-windows-vm / test-toy-vm), not in the default set, since they need provisioned VMs + cross sysroots and amd64/riscv64 FreeBSD run under slow TCG. Inapplicable cases SKIP (a committed <name>.{freebsd,windows}.skip sidecar, the shared .link.skip, or arch-only *_aa64/*_x64/*_rv64 suffixes); genuine codegen gaps are left RED.

Because the Toy corpus is broad and oracle-carrying, it is reused for free coverage elsewhere: test/asm/roundtrip_toy.sh runs it through the L2 exec round-trip (cc -S | as | run and | ld | exec, exit must equal the oracle), and both host-assembler lanes use it. This reuse found a real miscompile (a multiply-high the disassembler couldn't decode, silently dropped by as until the .inst-emits-the-word fix) that the hand corpus never reached.

Driver and tool-level tests (Type K)

The kit multitool's command-line behavior is covered by Type K scripted harnesses on kit_sh_kit.sh, in two oracle modes:

Unit tests (Type U)

Lower-level invariants are covered by C unit-test binaries built from mk/test_unit.mk and linking test/lib/kit_unit.h. There are two link flavors that differ only in what they can reach: UNIT_TESTS_AR link the public archive (internal symbols hidden — exercises the public surface), UNIT_TESTS_OBJS link the raw objects (internal hidden symbols reachable). These cover ISA encode/decode (test/arch/), the CG API and ABI classification (test/api/), optimizer passes (test/opt/), DWARF roundtrip (test/dwarf/, test/debug/), object/link plumbing, the emulator, and the interpreter. Registering one is two lines (a stem + its _SRC), and both shared test headers are dependencies of every binary so editing them rebuilds dependents.

Smoke tests per arch

test/smoke/x64.sh and test/smoke/rv64.sh are end-to-end sanity checks for the multi-arch exec pipeline itself. They build a tiny freestanding static ELF (a direct exit syscall in _start, no libc/relocations/PIE) with a cross clang, push it through exec_target_run and exec_target_queue+flush, and assert the expected exit code on both paths. The point is to validate the harness plumbing (cross-compile → podman/qemu → recorded rc) before relying on it for the heavier lanes, and to give a clear per-tool ok/MISSING diagnosis when a host lacks a runner. There is also a header smoke test (test/rt/smoke.c): every freestanding header must parse and expose its required macros/typedefs under a strict freestanding compile.

libc conformance vs glibc/musl

test/libc/ proves kit ld and the kit runtime interoperate with real, unmodified system libcs. Each case in test/libc/cases/*.c is compiled against an extracted sysroot and linked by kit ld against the real libc, then run and checked against an expected exit code and optional stdout substring:

This exercises the dynamic-link path of kit ld (PT_INTERP, PT_DYNAMIC / DT_NEEDED, .dynsym/.gnu.hash, .rela.plt/.got.plt) against a real loader — see LINK.md. Arch selection is KIT_LIBC_ARCHES (aa64/x64/rv64); a missing sysroot or runtime for an enabled arch is SKIP, not failure.

A related guard, test-lib-deps, asserts libkit.a's set of external (undefined) symbols matches a checked-in allowlist (scripts/lib_deps.allowlist) and that a relocatable link of the library exposes no non-public symbols. This keeps the freestanding library's dependency surface from drifting silently.

Hosted suite (cross-OS build + run)

test/hosted/ is a unified hosted-execution suite: each C case in test/hosted/cases/*.c is built for every (target, link-mode) config in the support set and run on it, checked against an exit-code + stdout oracle. It's the principled counterpart to the per-libc lanes above — one corpus, one runner seam, many OSes. The seed case is hello.c. Two verdicts per (case, config): :build (compile + link succeeded) and :run (correct exit code + stdout). The full matrix is 15 configs: Linux {aa64,x64,rv64} × {musl-static, musl-dynamic, glibc} + macOS-arm64 + Windows {x64,aarch64} + FreeBSD {amd64,aarch64,riscv64}. The libc rides in the exec tag (<arch>-linux = musl/alpine, <arch>-linux-glibc = glibc/debian), so a single flush routes every config to its runner (alpine/debian container, native, or VM).

The two pieces it composes are reusable on their own:

Default config set is Linux (all three libc/link shapes) + macOS; KIT_HOSTED_VM=1 (or make test-hosted-vm) adds the FreeBSD + Windows VM configs. On macOS, Linux binaries run under podman (no qemu-user on a Darwin host): musl in the pinned alpine images, glibc in per-arch Debian images (make test-hosted pulls both). Compiling against glibc headers needs kit's __restrict/__restrict__ keyword aliases (glibc uses them as bare GCC keywords and #undefs the fallback macro). Opt-in (make test-hosted), not in the default set; a missing sysroot/runner SKIPs, never fails.

Bootstrap reproducibility (stage2 == stage3)

The strongest end-to-end correctness signal is that kit can compile itself to a fixed point. The bootstrap (driven from the top-level Makefile, in both debug/-O0 and release/-O1 modes) builds:

  stage1  = host-built kit, copied into the bootstrap tree
  stage2  = stage1 compiling the full source tree
  stage3  = stage2 compiling the full source tree

The check is cmp stage2/kit stage3/kit — byte-identical binaries (and the recipe also compares every .o). If stage2 mis-compiles any part of kit, the two stages diverge, so this exercises a very large slice of the language, the ISA, the optimizer, the assembler, the linker, and the object writers at once, on real code rather than test fixtures. test-bootstrap-toy additionally runs the full Toy corpus through a bootstrapped compiler to confirm it not only reproduces but works. The diagnostic discipline that makes a bootstrap divergence tractable is to compare a single stage2-compiled object against the host compiler's output for the same TU — separating a malformed-object bug from a link-driver symptom — and to triage O1 codegen without -g, which perturbs object layout.

Aggregation and conventions

mk/test.mk defines the targets. A default test aggregate runs the host-independent lanes (frontend corpora, unit tests, L0/L1 round-trip, the cf_corpus engine selftest, the libc-dep guard); the exec-dependent and second-oracle lanes (L2 exec, symmetry, diff-llvm, hostas-toy/cross, smoke, libc conformance) are opt-in so the default run stays host-independent and fast. Bootstrap is not part of this test system: it is a separate top-level target (make bootstrap, make bootstrap-debug/release) driven from the top-level Makefile, not from mk/test.mk — see the bootstrap section above. Conventions shared across all four test types:

Planned work: see doc/plan/.