ELF Test Corpus — Target Coverage
What the test/elf/ corpus should cover for full ELF object-file support,
independent of kit's current implementation state. Each row is a
distinct case worth a discrete test (unit/, cases/, exec/, or bad/);
groups starred (★) are highest-leverage and should land first.
Conventions:
- U =
unit/(hand-built ObjBuilder roundtrip) - C =
cases/(clang.o→ kit-roundtrip → structural+behavioral diff) - E =
exec/(kit-link → run, behavioral oracle) - B =
bad/(negative read_elf input)
1. ELF header / target identification ★
| Case | Layer | Notes |
|---|---|---|
e_machine per supported arch |
C, U | aarch64, x86_64, riscv64, riscv32, arm32 (when supported) |
ELFCLASS32 32-bit ELF |
C, U | independent matrix from class |
ELFDATA2MSB big-endian |
C, U | aarch64-be, mips, etc. |
ELFOSABI_* variations |
U | NONE / LINUX / FREEBSD — emitter must round-trip whatever it reads |
e_flags per-arch |
C | RISC-V RVE/RVC, ARM EABI version |
2. Section types ★
sh_type |
Layer | Specific case |
|---|---|---|
PROGBITS (text/data/rodata) |
C, U | covered by 01_return42 etc. |
NOBITS |
C, U | .bss of various sizes; sh_addralign 1/8/64/4096 |
SYMTAB / STRTAB |
U | round-trip preserves logical content (not byte layout) |
RELA / REL |
C | both encodings; sh_flags & SHF_INFO_LINK |
NOTE |
C | .note.gnu.build-id, .note.ABI-tag, .note.gnu.property |
INIT_ARRAY/FINI_ARRAY/PREINIT_ARRAY |
C, E | constructor ordering across TUs |
GROUP (COMDAT) |
C | C++-style inline funcs across two TUs (09_comdat_inline.c) |
LLVM_ADDRSIG and other custom |
C | unknown sh_type must round-trip via raw-type preservation |
GNU_HASH / HASH |
C | dynamic objects (post-shared-lib support) |
DYNSYM / DYNAMIC |
C | shared objects |
3. Section flags
| Flag | Coverage |
|---|---|
SHF_ALLOC / WRITE / EXECINSTR |
implicit in every case |
SHF_TLS |
.tdata / .tbss |
SHF_MERGE + SHF_STRINGS |
.rodata.str1.1 / .debug_str |
SHF_MERGE + fixed sh_entsize |
.rodata.cst{4,8,16} constant pools |
SHF_GROUP |
every section inside a COMDAT |
SHF_LINK_ORDER |
-flto outputs, .gcc_except_table |
SHF_INFO_LINK |
every .rela.* |
SHF_EXCLUDE |
.llvm_addrsig; linker drop hint |
SHF_COMPRESSED |
zlib/zstd-compressed .debug_* |
4. Symbol coverage ★
Bindings: STB_LOCAL, STB_GLOBAL, STB_WEAK. (STB_GNU_UNIQUE if kit
ever needs it.)
Types: STT_NOTYPE, STT_FUNC, STT_OBJECT, STT_SECTION, STT_FILE,
STT_COMMON, STT_TLS, STT_GNU_IFUNC.
Visibility: STV_DEFAULT, STV_HIDDEN, STV_PROTECTED, STV_INTERNAL.
shndx values: ordinary index, SHN_UNDEF, SHN_ABS, SHN_COMMON,
SHN_XINDEX (extended for >65279 sections).
Cases:
| Case | Layer |
|---|---|
| Plain global function definition | C |
| Static (file-local) function | C |
| Tentative definition (common) | C |
__attribute__((weak)) defined and undefined |
C |
__attribute__((visibility("hidden"))) |
C |
TLS variable (__thread) |
C, E |
IFUNC (__attribute__((ifunc("resolver")))) |
C, E |
| Aliased symbols (multiple names, same address) | C |
| Section symbols as relocation targets | C |
File symbol (STT_FILE) round-trip |
C |
AArch64 mapping symbols $x / $d (STT_NOTYPE on defined sym) |
C |
5. Relocation coverage ★
For each supported arch, every reloc kind kit's RelocKind enum maps must
have a unit test (round-trip) AND a behavioral test (linked + run gives
the right value).
AArch64
| Reloc | Test | Notes |
|---|---|---|
R_AARCH64_NONE |
U | sentinel |
R_AARCH64_ABS64 / ABS32 |
C, E | data pointers, absolute jump tables |
R_AARCH64_PREL64 / PREL32 |
C, E | .eh_frame FDE pointers |
R_AARCH64_CALL26 / JUMP26 |
E | direct calls, tail calls |
R_AARCH64_ADR_PREL_PG_HI21 + ADD_ABS_LO12_NC |
E | small-model PIC addressing |
R_AARCH64_LDST{8,16,32,64,128}_ABS_LO12_NC |
E | LDR/STR offset materialization |
R_AARCH64_GOT_* family |
C, E | shared-lib path |
R_AARCH64_TLSGD_* / TLSIE_* / TLSLE_* / TLSDESC_* |
C, E | TLS access models |
R_AARCH64_PLT32 |
C, E | PIE/shared call through PLT |
x86_64 (when added)
R_X86_64_64, _32, _PC32, _PC64, _PLT32, _GOTPCREL, _GOTPCRELX,
_REX_GOTPCRELX, _TLSGD, _GOTTPOFF, _TPOFF32/64, _DTPOFF32/64.
RISC-V (when added)
R_RISCV_HI20/_LO12_I/_LO12_S, _BRANCH, _JAL, _CALL_PLT,
_PCREL_HI20/_PCREL_LO12_*, _RELAX, _TLS_GD_HI20, etc.
Reloc edge cases (any arch)
- Zero addend, positive, negative, near-overflow
r->sym == OBJ_SYM_NONE(rare but legal — section-relative)RELAvsRELencodings on archs that distinguish (x86)- Pair relocations (
R_AARCH64_LD_PREL_LO19paired with prefetch hint) - Relocations targeting weak undef (resolves to 0)
- Relocations targeting common symbols
- Relocations across COMDAT-merged content
6. Special sections
| Section | Coverage |
|---|---|
.text.<fnname> (function sections) |
C — -ffunction-sections |
.data.<varname> |
C — -fdata-sections |
.data.rel.ro |
C — relocatable read-only data |
.init_array.NNN / .fini_array.NNN |
E — priority ctors/dtors |
.tdata / .tbss |
C, E — TLS |
.gcc_except_table + .eh_frame |
C — exception tables |
.note.gnu.build-id |
C — reproducible-build identity |
.note.gnu.property |
C — CET/BTI/PAC markers (AArch64-BTI) |
.ARM.attributes / .riscv.attributes / .note.ABI-tag |
C |
.gnu.linkonce.t.<sym> (legacy COMDAT) |
C |
.debug_* (DWARF) |
C — opaque preservation; semantic equivalence later |
.eh_frame_hdr |
C — when shared/exe path emits it |
.got / .got.plt / .plt |
E — shared-lib link path |
.dynamic / .dynstr / .dynsym |
E — shared-object output |
7. Layout / structure edge cases
- Empty
.o(NULL section only) - Section count = 1, 65279, > 65279 (extended indexing via SHN_XINDEX in section 0)
- Symbol count > 65535 (extended via Elf64_Sym overflow path)
- Very large
.strtab(> 1 MB) - Sections with
sh_addralignof 1, 4, 8, 16, 64, 4096 - Per-section
sh_entsize(mergeable, symtab, rela) - Self-referential relocations (
X = &X + 8) - Multiple sections with the same name (legal pre-merge, common with
-ffunction-sections)
8. Archive (.a) ★
| Case | Layer |
|---|---|
| Empty archive | B |
Single .o member |
C-like (separate ar harness) |
| Multiple members, dependency on later member | E |
| BSD vs SysV format | C |
Symbol index (//__.SYMDEF) present and absent |
C, E |
Long filenames (// extended name table) |
C |
9. Negative inputs (bad/)
Each blob has a .expect substring; harness asserts compiler_panic
exits cleanly (no segfault).
| Blob | Trigger |
|---|---|
truncated_ehdr.elf |
< 64 bytes |
bad_magic.elf |
first 4 bytes wrong |
e_machine_x86.elf |
machine mismatch (when arch-validated) |
wrong_class.elf |
64-bit machine tagged ELFCLASS32 (class/arch mismatch) |
wrong_endian.elf |
ELFDATA2MSB in an LSB pipeline |
sh_offset_oob.elf |
sh_offset + sh_size > file_size |
sh_link_oob.elf |
sh_link >= e_shnum |
e_shstrndx_oob.elf |
bogus shstrndx |
symtab_entsize_bad.elf |
sh_entsize != sizeof(Elf64_Sym) |
rela_entsize_bad.elf |
sh_entsize != 24 |
r_info_sym_oob.elf |
reloc sym index past symtab |
group_cycle.elf |
SHT_GROUP referencing itself |
nobits_with_data.elf |
SHT_NOBITS with non-zero sh_offset body |
huge_size.elf |
sh_size = u64::max |
string_no_nul.elf |
strtab without trailing \0 |
unknown_machine.elf |
accepted as opaque or rejected by policy |
10. Cross-tool agreement
For every cases/*.c, the structural diff oracle should pass against:
clang -O0andclang -O2gcc -O0andgcc -O2(when available)binutils-asoutput (hand-written.s)lldandld.bfdlinker outputs (for shared/exe variants)
Any case that diverges across these is a bug in either kit or the
normalizer — not allowed to silently .xfail.
11. Behavioral / runtime
exec/ already covers: exit code, in-section call, ADRP+ADD load,
.rodata load, .data load, BSS, two-TU link. Extend with:
| Case | Exercises |
|---|---|
| Static initializer order across TUs | INIT_ARRAY priority |
| Weak symbol replaced by strong | resolution rule |
| Common symbol coalescing | tentative-def merging |
| Inline function shared via COMDAT | group dedup |
| TLS variable read/written from two TUs | .tdata + TLS relocs end-to-end |
dlopen-style runtime relocation (when shared lands) |
dynamic relocs |
setjmp/longjmp across compilation unit |
unwind interaction |
Stratification
When picking what to land next, the prioritization is:
- ★ Reloc-kind matrix per arch — every kind kit claims to support needs unit + behavioral coverage. This is the single highest-leverage gap.
- ★ Symbol kind/visibility matrix — every
STT_*×STB_*×STV_*combo we emit must round-trip. - ★ Section type matrix — every
sh_typewe admit, especiallyNOBITS,GROUP,INIT_ARRAY. - Special sections with semantic flags (
SHF_TLS,SHF_MERGE, etc.). - Negative inputs (
bad/). - Layout edge cases (large/extended).
- Cross-tool agreement (clang vs gcc, lld vs ld.bfd).
- Archive support.
- DWARF semantic equivalence (deferred until the consumer side cares).
A "complete" corpus has a row for every cell in groups 1–4 and at least one representative for every cell in 5–7.