Testing
kit's test architecture is built around one core idea: drive testing from
what codegen actually emits, not from a hand-written corpus of cases someone
remembered to write. A hand corpus only ever tests the instructions we thought
to write down; the compiler's own output exercises every instruction codegen
emits, and tracks codegen automatically as new instructions appear. This doc
describes the harness design — the layering, the invariants each lane locks in,
and why the lanes are shaped the way they are. For the surfaces these tests
exercise see ASM.md (assembler/disassembler/cc -S),
ARCH.md (per-arch ISA tables), and FRONTENDS.md
(the Toy and C frontends).
The test tree lives under test/, with per-area subdirectories (test/asm/,
test/toy/, test/smoke/, test/libc/, plus unit-test areas like
test/arch/, test/elf/, test/opt/). Build/run wiring lives in
mk/test.mk. Every harness conforms to one of four canonical test types,
each backed by a shared library under test/lib/, so test infrastructure is
written once and reused rather than re-invented per area.
Canonical test types
| Type | Library | What it is |
|---|---|---|
| U unit | kit_unit.h (+ test_unit.mk build manifest) |
a C translation unit linked against libkit, self-checking in-process |
| C corpus | cf_corpus.sh |
a directory of case files run through one or more lanes, each with its own oracle |
| K scripted | kit_sh_kit.sh |
a hand-written shell test driving the kit binary, judged by golden transcript (mode G) or procedural asserts (mode P) |
| D differential | cf_differential.sh |
correctness defined as agreement — vs a checked-in baseline, or vs a reference tool |
All four record results through one shared report layer
test/lib/kit_sh_report.sh (cf_pass/cf_fail/cf_skip/cf_skip_na/
cf_xfail/cf_xpass/cf_time + a unified cf_summary/cf_exit);
cf_skip.sh unifies the skip sources (sidecar files, phased-rollout diagnostic
regexes, target-tuple applicability); exec_target.sh and exec_kernel.sh run
guest binaries for cross-arch lanes. The sections below describe what each
harness tests; this is how they are structured.
Type C — the corpus engine. cf_corpus.sh owns the whole pipeline:
discover a glob of cases, expand the {case × opt-level × target-tuple} matrix,
run each enabled lane's recipe, apply that lane's oracle, report, and flush any
deferred cross-arch exec. A runner supplies only its corpus glob, its lane set,
and a cf_lane_<ID>() hook per lane — which writes only under a per-case
$CF_WORK dir and records via exactly one cf_* verb. Because the oracle is
lane-local, exit-code, golden-diff, byte-compare, structural-grep, negative
(expect-failure) and cross-arch-exec are all just lane bodies, not separate
harness types; a suite with disjoint sub-corpora (asm's encode/decode/listing,
elf's layers) makes one cf_corpus_run call per sub-corpus. Execution is serial
or parallel through the same code path: in parallel mode each case runs as a
background worker that appends its results to an event file, and the parent
replays those events in index order — so counts, failing-name order, and the
cross-arch-exec queue are deterministic regardless of worker completion order,
and a runner goes parallel by flipping CF_PARALLELIZABLE with no other change.
The engine is proven by test/lib/cf_corpus_selftest.sh
(make test-cf-corpus-selftest): it asserts serial≡parallel determinism and the
load-bearing invariant that cross-arch exec is queued on the parent, never
inside a worker.
Type K — scripted shell. kit_sh_kit.sh is the single source point for a
hand-written tool/driver test: it pulls in the report layer plus the procedural
assert verbs (ok/run_ok/run_fail/contains/same_file/is_executable/
check_mode) and the golden-transcript runner cf_scenario_case. A suite picks
a mode — G diffs a cases/<name>.sh transcript against <name>.expected
(ar/strip/objcopy/strings/objdump, dbg); P runs procedural asserts over a
$work sandbox (cas/pkg, the driver CLI suite, the COFF Windows smokes). Mode-P
suites stay serial (they share one $work and mutate fixtures).
Type D — differential. cf_differential.sh is for tests whose oracle is
agreement rather than a fixed per-case answer: cf_diff_baseline regenerates a
normalized report and gates on a zero delta vs a checked-in snapshot
(CF_DIFF_UPDATE=1 refreshes it), and cf_diff_agree requires two independent
producers to match byte-for-byte (with explicit equivalence-skips). Used by the
asm self-symmetry sweep and the kit-vs-llvm-mc cross-check.
Why codegen-driven round-trip testing
cc -S is not a separate pretty-printer: it is the disassembler plus module
scaffolding (src/api/asm_emit.c). The same arch_disasm_decode surface backs
both cc -S and objdump -d, so -S text is kit's disassembly rendered as
re-assemblable assembly. That makes a single feedback loop possible:
source ──cc -c──▶ object (codegen's bytes + relocs)
│
└────cc -S──▶ assembly ──as──▶ object (re-encoded bytes + relocs)
If the disassembler and assembler are mutually faithful, the two objects are byte-and-reloc identical. Any instruction codegen emits is, by construction, in the loop — so the coverage tracks the compiler, not a static list.
The three completeness layers (L0/L1/L2)
test/asm/roundtrip.sh runs a C corpus (test/asm/roundtrip/, each case a
test_main) through three lanes, cheapest and sharpest first. They build on
each other: a higher lane is only trustworthy once the lower one passes.
L0 decode-completeness cc -S, assert no in-function decode failure
L1 byte + reloc round cc -c vs cc -S | as (.text bytes AND reloc table)
L2 exec equivalence run direct.o vs run rt.o (exit code / stdout)
L0 — decode completeness. Disassemble the program and assert no instruction
codegen emitted failed to decode. This is host-independent (no execution),
cheap, and pinpoints the exact undecodable word. The signal must be
unambiguous: padding and data are not decode failures. The harness keys on the
.inst 0x<word> marker — a real assembler directive emitted only by the
disassembler's unknown-word path (aa64_write_unknown) — and counts it only
inside .text. Inter-function padding (.byte 0x0 fill) and data-section
.byte are explicitly distinct from the in-stream .inst marker, so "grep for
the marker inside .text" is an exact completeness test.
L1 — byte + reloc round-trip. Compile the source two ways: cc -c directly,
and cc -S | as. Diff both the .text bytes and the relocation table. The
reloc-table comparison is essential — a same-section branch resolved in place
versus kept as a relocation produces identical bytes but a different relocatable
object, and only the reloc diff catches it. L1 covers the sections cc -S
reproduces (.text, .rodata, .data, and .bss; switch jump tables live in
.rodata). .bss is NOBITS — it carries a size but no bytes, so byte
round-tripping does not apply to it; only its presence and symbols are checked.
L1 excludes sections -S does not emit (e.g. .eh_frame) so their absence in
the round-tripped object is not misread as a divergence. L1 is gated on L0: a
byte match is meaningless if the disassembler punted to a .byte/.inst
fallback (see the gotcha below).
L2 — exec equivalence. Run direct.o and rt.o and compare exit codes (and
stdout, and an optional <name>.expected oracle). This is the end-to-end "it
runs the same" signal, tolerant of benign encoding differences L1 would flag.
Crucially, no qemu is needed for the host arch: execution goes through the
in-process JIT (kit run / the jit-runner), and cross-arch execution is
available via the emulator (kit emu) — see JIT.md and
EMU.md. L2 runs only when the target arch matches the host
(native JIT); otherwise it self-skips.
Opt levels matter: -O0 and -O1 emit different encodings, so each lane runs
at every level in KIT_TEST_OPTS.
The symbolization invariant
L1/L2 only work because cc -S is re-assemblable, not a listing. A listing
emits numeric branch targets (b 0x100) and de-symbolized relocated operands
(bl 0x11c instead of bl add); re-assembling that branches to the wrong place
and loads from address 0. The invariant the symbolizer maintains is: every
operand a relocation patches is rendered in the assembler's own reloc-operator
syntax (aa64 :lo12:/:got:, rv64 %pcrel_hi/%pcrel_lo, x64 sym(%rip)/
@PLT), and every intra-section branch/PC-relative target gets a synthesized
local label. The symbolizer is the inverse of what the assembler parses; see
ASM.md for the RelocKind→syntax mapping.
A second structural decision makes this robust across different assemblers:
code locations an encoding-divergent assembler must be able to recompute —
switch jump-table entries and &&label address-takes — are referenced through a
per-basic-block local symbol the emitter mints (mc_label_symbol,
src/arch/mc.c), uniformly on all three arches. The jump table emits
.quad .Lcfblk.* and the address-take a standard PC-relative relocation against
that block symbol, rather than a baked numeric offset. Because the reference is
genuinely relocatable, a third-party assembler's encoding choices (movabs vs
mov-imm32, jmp rel32 vs rel8, RVC compression) cannot shift a baked offset onto
the wrong instruction. cc -c and cc -S emit the same relocations, so the L1
byte/reloc lanes stay faithful too.
The .byte/.inst fallback gotcha
The disassembler's fallback for an undecodable word reproduces the exact original
bytes (.byte 0x.., or .inst 0x<word>). Re-assembling that fallback yields the
same bytes — so a run-only round-trip passes even when the disassembler is
incomplete. The decode/byte check (L0, and the reloc half of L1) must gate
before an exec round-trip is trusted, otherwise L2 green hides a real
disassembler gap. This is why the lanes are ordered, and why L0 keys on the
in-stream .inst marker specifically — it is the unambiguous "the disassembler
could not decode this word" signal.
Self-symmetry sweep + checked-in baseline
The codegen round-trip exercises the disassembler only on what the compiler
emits. test/asm/symmetry.sh complements it by sweeping the tools' own
instruction set for asm⊗disasm symmetry, independent of codegen:
- decode-side:
test/arch/aa64_sweep_gen.csynthesizes one representative encoding per row of the disassembler's instruction table; each is decoded, the disassembly re-assembled, and decoded again — the text must be a fixed point. Catches a form the disassembler decodes but the assembler can't re-encode (decode-only), or where they disagree. - encode-side: assemble every aa64
test/asm/encode/*.sand disassemble; any.instmeans the assembler encodes a form the disassembler can't decode (encode-only).
The two tools cover slightly different ISA subsets on purpose (forms the
assembler accepts for completeness that codegen never emits, so the
disassembler never had to decode them). Known asymmetries live in a checked-in
snapshot, test/asm/symmetry.baseline; the sweep passes iff the current set
equals the baseline. So it gates against new asymmetry (a regression) while
the baseline documents the disasm-completeness backlog. Closing a gap shrinks
the baseline (symmetry.sh --update). This is the standard kit pattern: an
honest, checked-in "what is currently known-incomplete" snapshot that turns a
regression into a diff.
The diff-vs-llvm second oracle
"No decode failure" does not catch a wrong decode, and a self-round-trip
can't either — kit's own re-encode would repeat the mistake. test/asm/ diff_llvm.sh adds an independent oracle (llvm-mc), byte-level so it
sidesteps disassembly-text normalization (movz-vs-mov, #16-vs-#0x10), which
would founder on alias/format differences:
- encode lane: assemble every aa64
encode/*.swith bothkit asandllvm-mc; the.textbytes must match. Validates kit's assembler. - disasm lane:
cc -cgives codegen's bytes;cc -Sgives kit's disassembly as re-assemblable text; assemble that with llvm-mc and require the bytes to match codegen's. If llvm agrees the-Stext means the original bytes, the decode is correct.
The one benign disagreement is recognized structurally: kit codegen keeps a
same-section CALL26/JUMP26/CONDBR relocation that llvm-mc (like GNU as) resolves
in place and drops. The bytes are link-equivalent, only the relocatable form
differs, so the reloc-table diff distinguishes it and it is not flagged.
Opt-in; skips cleanly when llvm-mc is absent.
Host-assembler execution lanes (cc -S is standard assembly)
The round-trip and diff-llvm lanes use either kit's own assembler or compare
bytes. A separate question is whether cc -S is standard assembly a
third-party assembler accepts and that means the same thing — judged by
execution, not bytes (kit and clang emit different but execution-equivalent
code, so a byte/text match would be meaningless).
test/asm/hostas_toy.sh answers it on the native target. Per Toy case (both
-O0/-O1) it emits one cc -S and feeds it to two assemblers, then links
each with kit ld and runs it, asserting the exit-code oracle:
cc -S ──┬── kit as ──kit ld──▶ ./a.out exit == oracle (lane A, baseline)
└── clang -c ──kit ld──▶ ./b.out exit == oracle (lane B, the test)
The assembler is the only variable. Lane A is a baseline (kit both writes and
reads its own dialect, so a private-dialect quirk could hide); lane B is the real
test — a standard assembler can't paper over such a quirk. This is what proves
cc -S emits the clean dialect of the target object format: the format-divergent
directive spelling (.type/.size/.section/.p2align) lives behind an
AsmSyntax vtable selected by object format in src/api/asm_emit.c, and the
relocation operand syntax (ELF :lo12: vs Mach-O @PAGEOFF) behind a per-arch
ArchAsmOps.reloc_operand hook — keeping the printer free of arch-specific reloc
knowledge while staying format-correct. kit's own as parses the dialect of
its target too, so the single cc -S output assembles identically under both.
The clang-as lane gates by default; KIT_HOSTAS_ENFORCE_CLANG=0 demotes it to
XFAIL (useful while bringing up a new arch/format whose printer side isn't done).
test/asm/hostas_cross.sh is the cross extension: the same two-assembler-by-
execution test, but for ELF Linux targets (aarch64/x86_64/riscv64)
emitted with cc -S -target <triple>, assembled by both kit-as and clang,
linked into a static non-PIE ELF with kit ld -static, and run under
podman/qemu via test/lib/exec_target.sh. The executable is made runnable
with no libc/loader by linking the freestanding crt test/link/harness/start.c
(-Dtest_main=main): _start runs ctors, calls main, and exits with its
return (the oracle) via a raw syscall. Each target self-skips (never fails)
unless the host has (1) a clang cross target, (2) a runner (podman/qemu), (3) a
working cc -S | kit as round-trip for that arch, and (4) a passing bounded
exec smoke — so a wedged emulator downgrades to SKIP instead of hanging. Both
lanes are judged purely by matching exit codes. Opt-in; both lanes skip cleanly
without clang/podman.
Shared cross-exec helper
test/lib/exec_target.sh is the one place that knows how to run a guest ELF
for a <arch>-<os> target tag. It offers synchronous (exec_target_run) and
batched-queue (exec_target_queue + exec_target_flush) modes. The batched mode
groups queued cases by target and runs one podman run per group, amortizing
podman's per-launch client round-trip across the whole suite; the in-container
loop caps each case (EXEC_CASE_TIMEOUT) so one hanging binary can't wedge the
batch and silently fail every later case. The helper picks native exec, then
qemu-user, then a batched podman container, and reports "no runner" so callers
SKIP cleanly. It is shared by the Toy cross lane, the hostas-cross lane, and the
link/smoke/libc harnesses, so cross-exec policy lives in exactly one file.
The same tag interface also covers <arch>-freebsd and <arch>-windows, whose
runner is a VM (test/lib/exec_vm.sh). A VM is expensive to boot and
stateful, so unlike the stateless podman/qemu runners it has a lifecycle:
exec_target_setup/the flush boot the VM lazily and once (one Windows VM
serves both arches), keep it warm across flushes, and exec_target_teardown_all
(installed via trap … EXIT) shuts down only VMs we booted — never a VM the user
already had running. The transport is scripts/{freebsd,windows}_vm.sh run-batch,
which ships a staging dir in, runs a generated entry script, and brings each
binary's rc/out/err back; Windows exit codes are masked to 8 bits so .rc
is a uniform POSIX-style status across every tag. So a harness gains real
FreeBSD/Windows execution just by selecting the tag — test/toy/vm.sh and the
hosted suite both do.
The Toy corpus as CG-API coverage
The Toy frontend (lang/toy/, see FRONTENDS.md) is a small
language that exists to exercise the full CG API op set, and every case carries
an exit-code oracle (test/toy/cases/<name>.expected). test/toy/run.sh is a
Type C corpus harness that runs each case through several paths — its
cf_corpus.sh lanes, each a distinct backend/seam — all judged against the same
oracle:
R kit run in-process JIT, native
I kit run --no-jit the IR interpreter (see INTERPRETER.md)
L kit cc -c | kit ld native object → linked executable
X kit cc -target | ld cross ELF (aa64/x64/rv64) run via exec_target
C kit cc --emit=c | host cc the C-source backend (see CBACKEND.md)
W kit cc -target wasm32 the Wasm backend → re-lower → run (see WASM.md)
One corpus thus validates the JIT, the interpreter, the linker, cross targets,
and the C and Wasm CGTargets — proving the CGTarget seam is frontend-agnostic.
Paths the interpreter/C/Wasm targets don't yet implement emit a greppable
"not supported"/"not yet implemented" diagnostic and report SKIP, so the suite
signal stays "real regressions". test/toy/err/ holds compile-failure cases
checked against an expected diagnostic substring.
test/toy/vm.sh <freebsd|windows> adds two hosted-VM lanes on the same
corpus + oracle, the hosted counterparts to the freestanding-Linux X lane: it
links each case against a real OS sysroot (FreeBSD base.txz extract via
scripts/freebsd_sysroot.sh, or the llvm-mingw UCRT sysroot) and runs the
binary on the genuine OS in a VM, so the full hosted path — ABI, CRT startup,
the platform loader, syscalls/Win32 — is exercised. It compiles every applicable
case × opt × link-mode (so a codegen/link bug is caught even with no VM) and
executes them through the shared seam (the <arch>-freebsd/<arch>-windows
tags + exec_vm.sh lifecycle above), joining each exit code back to the oracle.
FreeBSD covers amd64/aarch64/riscv64 (static + dynamic); Windows covers x64
(Prism emulation) + aarch64 on one ARM64 VM. These
are opt-in (make test-toy-freebsd-vm / test-toy-windows-vm / test-toy-vm),
not in the default set, since they need provisioned VMs + cross sysroots and
amd64/riscv64 FreeBSD run under slow TCG. Inapplicable cases SKIP (a committed
<name>.{freebsd,windows}.skip sidecar, the shared .link.skip, or arch-only
*_aa64/*_x64/*_rv64 suffixes); genuine codegen gaps are left RED.
Because the Toy corpus is broad and oracle-carrying, it is reused for free
coverage elsewhere: test/asm/roundtrip_toy.sh runs it through the L2 exec
round-trip (cc -S | as | run and | ld | exec, exit must equal the oracle),
and both host-assembler lanes use it. This reuse found a real miscompile (a
multiply-high the disassembler couldn't decode, silently dropped by as until
the .inst-emits-the-word fix) that the hand corpus never reached.
Driver and tool-level tests (Type K)
The kit multitool's command-line behavior is covered by Type K scripted
harnesses on kit_sh_kit.sh, in two oracle modes:
- mode G (golden transcript):
test/{ar,strip,objcopy,strings,objdump}/run.shandtest/dbg/run.sheach run a directory ofcases/<name>.shscripts in a sandbox and diff combined stdout+stderr against a checked-in<name>.expected. dbg additionally uses xfail/xpass: an xfail case that fails is expected (cf_xfail); one that unexpectedly passes (cf_xpass) is always a failure — the stale marker should be removed — andDBG_STRICT_XFAILpromotes even an expected failure to a hard error. - mode P (procedural asserts):
test/driver/run.sh(thecc/ld/ar/nm/size/addr2line/… behavior suite),test/cas/run.sh,test/pkg/run.sh, and the COFF Windows smokes (test/coff/windows-*-smoke.sh) drive a sequence ofkitinvocations over a$worksandbox and assert outcomes withrun_ok/run_fail/contains/same_file/check_mode. These stay serial (they share$workand mutate fixtures).
Unit tests (Type U)
Lower-level invariants are covered by C unit-test binaries built from
mk/test_unit.mk and linking test/lib/kit_unit.h. There are two link
flavors that differ only in what they can reach: UNIT_TESTS_AR link the public
archive (internal symbols hidden — exercises the public surface), UNIT_TESTS_OBJS
link the raw objects (internal hidden symbols reachable). These cover ISA
encode/decode (test/arch/), the CG API and ABI classification (test/api/),
optimizer passes (test/opt/), DWARF roundtrip (test/dwarf/, test/debug/),
object/link plumbing, the emulator, and the interpreter. Registering one is two
lines (a stem + its _SRC), and both shared test headers are dependencies of
every binary so editing them rebuilds dependents.
Smoke tests per arch
test/smoke/x64.sh and test/smoke/rv64.sh are end-to-end sanity checks for the
multi-arch exec pipeline itself. They build a tiny freestanding static ELF (a
direct exit syscall in _start, no libc/relocations/PIE) with a cross clang,
push it through exec_target_run and exec_target_queue+flush, and assert the
expected exit code on both paths. The point is to validate the harness plumbing
(cross-compile → podman/qemu → recorded rc) before relying on it for the heavier
lanes, and to give a clear per-tool ok/MISSING diagnosis when a host lacks a
runner. There is also a header smoke test (test/rt/smoke.c): every freestanding
header must parse and expose its required macros/typedefs under a strict
freestanding compile.
libc conformance vs glibc/musl
test/libc/ proves kit ld and the kit runtime interoperate with real,
unmodified system libcs. Each case in test/libc/cases/*.c is compiled against an
extracted sysroot and linked by kit ld against the real libc, then run and
checked against an expected exit code and optional stdout substring:
test/libc/musl/— Alpine + musl, where the loader is the libc itself.test/libc/glibc/— Debian + glibc, dynamic-link only (static glibc relies on dlopen-loaded NSS and isn't a real deployment shape). glibc's loader is a separate ELF, so the run passes-dynamic-linkerand handslibc.so.6pluslibc_nonshared.adirectly (kit ld doesn't parse the GROUP linker script).
This exercises the dynamic-link path of kit ld (PT_INTERP, PT_DYNAMIC /
DT_NEEDED, .dynsym/.gnu.hash, .rela.plt/.got.plt) against a real loader —
see LINK.md. Arch selection is KIT_LIBC_ARCHES (aa64/x64/rv64);
a missing sysroot or runtime for an enabled arch is SKIP, not failure.
A related guard, test-lib-deps, asserts libkit.a's set of external
(undefined) symbols matches a checked-in allowlist (scripts/lib_deps.allowlist)
and that a relocatable link of the library exposes no non-public symbols. This
keeps the freestanding library's dependency surface from drifting silently.
Hosted suite (cross-OS build + run)
test/hosted/ is a unified hosted-execution suite: each C case in
test/hosted/cases/*.c is built for every (target, link-mode) config in the
support set and run on it, checked against an exit-code + stdout oracle. It's
the principled counterpart to the per-libc lanes above — one corpus, one runner
seam, many OSes. The seed case is hello.c. Two verdicts per (case, config):
:build (compile + link succeeded) and :run (correct exit code + stdout). The
full matrix is 15 configs: Linux {aa64,x64,rv64} × {musl-static, musl-dynamic,
glibc} + macOS-arm64 + Windows {x64,aarch64} + FreeBSD {amd64,aarch64,riscv64}.
The libc rides in the exec tag (<arch>-linux = musl/alpine,
<arch>-linux-glibc = glibc/debian), so a single flush routes every config to
its runner (alpine/debian container, native, or VM).
The two pieces it composes are reusable on their own:
scripts/hosted.sh— one front-end over every sysroot provisioner. Targets are<os>[-<libc>]-<arch>(linux-{glibc,musl}-{aa64,x64,rv64},freebsd-{amd64,aarch64,riscv64},windows-{x64,aarch64},macos-aarch64).prepareprovisions the sysroot (wrappingfreebsd_sysroot.sh/llvm_mingw_sysroot.sh/ the libc containerextract.sh);cccompiles+links for a target (adding the right-target/--sysroot/-mconsole/-isysroot);path/triple/tag/doctorround it out.test/lib/exec_target.shruns the result (native / qemu-user / podman / VM), picking the runner from the tag.
Default config set is Linux (all three libc/link shapes) + macOS;
KIT_HOSTED_VM=1 (or make test-hosted-vm) adds the FreeBSD + Windows VM
configs. On macOS, Linux binaries run under podman (no qemu-user on a Darwin
host): musl in the pinned alpine images, glibc in per-arch Debian images
(make test-hosted pulls both). Compiling against glibc headers needs kit's
__restrict/__restrict__ keyword aliases (glibc uses them as bare GCC
keywords and #undefs the fallback macro). Opt-in (make test-hosted), not in
the default set; a missing sysroot/runner SKIPs, never fails.
Bootstrap reproducibility (stage2 == stage3)
The strongest end-to-end correctness signal is that kit can compile itself to a
fixed point. The bootstrap (driven from the top-level Makefile, in both
debug/-O0 and release/-O1 modes) builds:
stage1 = host-built kit, copied into the bootstrap tree
stage2 = stage1 compiling the full source tree
stage3 = stage2 compiling the full source tree
The check is cmp stage2/kit stage3/kit — byte-identical binaries (and the
recipe also compares every .o). If stage2 mis-compiles any part of kit, the
two stages diverge, so this exercises a very large slice of the language, the
ISA, the optimizer, the assembler, the linker, and the object writers at once, on
real code rather than test fixtures. test-bootstrap-toy additionally runs the
full Toy corpus through a bootstrapped compiler to confirm it not only reproduces
but works. The diagnostic discipline that makes a bootstrap divergence tractable
is to compare a single stage2-compiled object against the host compiler's output
for the same TU — separating a malformed-object bug from a link-driver symptom —
and to triage O1 codegen without -g, which perturbs object layout.
Aggregation and conventions
mk/test.mk defines the targets. A default test aggregate runs the
host-independent lanes (frontend corpora, unit tests, L0/L1 round-trip, the
cf_corpus engine selftest, the libc-dep guard); the exec-dependent and
second-oracle lanes (L2 exec, symmetry, diff-llvm, hostas-toy/cross, smoke, libc
conformance) are opt-in so the default run stays host-independent and fast.
Bootstrap is not part of this test system: it is a separate top-level target
(make bootstrap, make bootstrap-debug/release) driven from the top-level
Makefile, not from mk/test.mk — see the bootstrap section above.
Conventions shared across all four test types:
- One report layer. Every shell harness records through
cf_pass/cf_fail/cf_skip/cf_skip_na/cf_xfail/cf_xpassand ends withcf_summary+cf_exit— no harness rolls its own counters or summary. XPASS always gates the exit; SKIP gates only when the suite opts in (CF_SKIP_IS_FAILURE=1, the corpus default), overridable per run withKIT_TEST_ALLOW_SKIP=1. - SKIP, never silently pass. A lane that can't run (no runner, no cross
toolchain, an unimplemented backend path) reports SKIP with a reason via the
shared
cf_skip.sh(sidecar / phased-rollout diagnostic regex / target-tuple applicability) rather than vanishing; a structurally-inapplicable case is an uncounted SKIP-NA. - Parallel by a flag flip. A Type C corpus runs serial or parallel through
the same
cf_corpus.shevent-replay path; hooks that write only under$CF_WORKand record only viacf_*are parallel-safe by construction, soCF_PARALLELIZABLEtoggles dispatch with no other change. - Checked-in baselines turn incompleteness into a regression diff. The
symmetry baseline and the per-case
.skipsidecars document what is known- incomplete; the gate fires only on a change to that set. - Per-case sidecars over conditionals in the harness. Applicability and
expectations ride alongside the case (
<name>.expected,<name>.targets,<name>.skip,<name>.objdump, theerr/cases), so adding or quarantining a case is a data change, not a script edit.
Planned work: see doc/plan/.