Object Model
kit's src/obj/ is the format-neutral object layer: one in-memory
representation of "a relocatable object or linked image" that every other
subsystem reads and writes, plus a registry that hides ELF/Mach-O/COFF/Wasm
specifics behind a single dispatch seam. Codegen (cg), the static linker, the
JIT linker, the disassembler, the DWARF producer, the emulator loader, and the
inspection tools (objdump/nm/size/strip/objcopy/addr2line) all meet
here. The design goal is that the rest of the compiler reasons about sections,
symbols, relocations, groups, and atoms — never about ELF section header tables
or Mach-O load commands — and that adding a format is a matter of filling one
table, not threading new branches through every caller.
See LINK.md for how the linker consumes this model, ARCH.md for the backends that produce it, and DWARF.md for debug sections.
The ObjBuilder
The single concrete in-memory object is the ObjBuilder (src/obj/obj.c,
declared in src/obj/obj.h; the public handle is KitObjBuilder). It is the
only object representation in the system — there is no separate "parsed
object" type. A builder is produced by exactly two kinds of writer:
- a backend during compilation (
cgvia the MCEmitter / CGTarget path), or - an
.oreader during linking/inspection (read_elf,read_macho,read_coff,read_wasm).
The central invariant: post-finalize, a backend-produced builder is identical
in shape to what a reader would produce from the same object written to disk.
Consumers therefore never care which path created the builder — the linker reads
a freshly compiled TU and a .o off disk through one API.
Storage: segmented arrays + chunked byte buffers
Sections, symbols, relocations, groups, and atoms each live in their own
segmented array (core/segvec.h). Segmentation is load-bearing: callers hold
const Section* / const ObjSym* pointers returned by obj_*_get across
further appends, so storage must never relocate existing elements the way a
flat realloc would. Section payloads use the chunked Buf type so large
.text/.data bodies grow without copying.
Handles are small integer ids (ObjSecId, ObjSymId, ObjGroupId,
ObjAtomId), each scoped to one builder, with index 0 reserved as the "none"
sentinel in every id space. Ids are stable for the builder's lifetime;
relocations carry a section + symbol id, supporting forward references (mint an
undefined ObjSymId for a reloc, define it later with obj_symbol_define).
The five tables and what they model:
- Sections — name +
SecKind(TEXT/RODATA/DATA/BSS/DEBUG/OTHER) +SecSem(PROGBITS/NOBITS/SYMTAB/RELA/GROUP/...) + neutralSecFlagbits (EXEC/WRITE/ALLOC/TLS/MERGE/GROUP/RETAIN/...).obj_sectionfind-or-creates by(name, kind, PROGBITS)so repeated literal/initializer emissions coalesce into one section with merged align/flags rather than fanning out. - Symbols —
SymBindxSymVisxSymKind, a defining section id (or NONE for undefined externs/commons), value, size. The object owns its whole symbol namespace: locals, section symbols, file symbols, commons, and external references are allObjSyms. - Relocs — flat across all sections (filtered by
section_idon read), each a(section, offset, RelocKind, sym, addend)tuple.RelocKindis a canonical enum spanning every arch (see "Relocation model" below). - Groups — COMDAT / section groups: a signature symbol plus a member section list (dedup keyed on the signature at link time).
- Atoms — sub-section ranges (
section,offset,size, signature) that let a format split one section into independently-linkable pieces. Mach-O setssplit_sections_as_atoms; the linker uses atoms for dead-strip and-rgranularity where the format has no section-per-symbol convention.
Format pass-through without leaking format knowledge
Generic tables stay neutral, but .o round-tripping needs to preserve bits the
canonical model doesn't name. A few targeted escape hatches handle this without
polluting the core:
- Per-section
ext_type/ext_flags(rawsh_type/sh_flags) re-emit format-specific section types that collapse to a genericSecSem(SHT_LLVM_ADDRSIG,SHT_ARM_ATTRIBUTES,SHF_EXCLUDE, ...). - Per-symbol
flagscarry format attribute bits (today Mach-On_desc). - Builder-level fields: ELF
e_flags, the COFF short-import DLL annotation. - Per-image raw fields (
ObjImageRaw,kit_obj_image_rawiter_*): a flat(tag, value, extra)list for linked-image values outside the neutral model — PE data directories / subsystem / dllchars, ELF rawDT_*, Mach-O load commands (see "The linked-image dimension"). obj_ext_set/obj_ext_getattach one opaque payload perObjExtKind(today the Wasm module model and Wasm import descriptors); the builder owns the payload's lifetime via a registered free function.
Lifecycle and the finalize discipline
obj_new
│
├─ write side: obj_section / obj_symbol / obj_reloc / obj_atom / obj_group
│ (MCEmitter / CGTarget, or an .o reader)
│
├─ cgtarget_finalize (flush lowered code into sections; -O2 path)
├─ debug_emit (if -g: writes .debug_* sections)
│
├─ obj_finalize ─────── freezes the read-side view
│
└─ read side: obj_section_get / obj_symiter / obj_reloc_at ...
(file emitters, linker, objdump)
obj_finalize is the read-side gate. The contract is "build mutably, then
finalize before any read-side query." Today it is a deliberate near-no-op — the
build path already keeps the index spaces consistent and section bytes are
flattened on demand by emitters — but it is the designated home for any future
intra-section fixup pass (label-to-offset resolution after a full section is
written), and keeping every consumer routed through it preserves that option.
Mutators and the tombstone sweep
strip/objcopy mutate a finalized builder. Rather than compact storage
(which would invalidate the stable-id contract), mutators flip per-entry
removed tombstones and individual fields. obj_sweep_dead then runs the
cascading cleanup — drop symbols defined in removed sections, prune
non-referenced undefined externs (the historical "spurious extern from a
header" filter, now folded in), kill relocs that became dangling, compact group
member lists, clear stale Section.link. Every file emitter calls
obj_sweep_dead at the top of emit, and raw id-based iteration must consult
removed itself — tombstones are a per-entry field, not hidden behind the
iterators, so the model stays cheap and idempotent.
Relocation model and the shared byte-patcher
RelocKind (src/obj/obj.h) is a single canonical enum covering every target:
arch-neutral forms (R_ABS32/64, R_REL32/64, R_PC32/64), then per-arch
families (AArch64 ADRP/ADD/LDST/branch/TLS, x86-64 GOT/PLT/TLS, RISC-V
HI20/LO12/branch/ADD/SUB/SET/ULEB128, COFF SECREL/SECTION, Wasm idx
relocs). Backends emit canonical kinds; the per-format reloc translators
(reloc_* in each format dir) map between canonical kinds and on-disk wire
types in both directions.
reloc_apply.c — one byte-patcher, three loaders
src/obj/reloc_apply.c exposes link_reloc_apply(c, kind, P_bytes, S, A, P): a
pure S/P/A byte patcher. It computes nothing about loader or linker policy —
it receives the already-resolved symbol address S, the in-memory patch site
P_bytes, the addend A, and the site's runtime/virtual address P, then
encodes the bits for that RelocKind (with range checks). It owns the fiddly
encoding details: AArch64 imm19/imm26/imm12 field placement and ADRP page math,
RISC-V U/I/S/B/J immediate scatter, the 0x800 HI20 bias, and the
fixed-width-ULEB128 re-encode that lets SET/SUB_ULEB128 relocs rewrite a
DWARF symbol-difference field without shifting section layout.
This routine is a key shared boundary — it is reused verbatim by every consumer that has to put resolved bytes down:
link_reloc_apply(c, kind, P_bytes, S, A, P)
▲ ▲ ▲
static linker ─────┘ │ └───── emu guest loader
(src/link, src/obj/*/link.c) JIT linker (src/emu/dl.c — dynamic
assembler (src/asm) (src/link/link_jit.c) reloc at guest load)
Each caller computes the policy (where S lives — link-time vaddr,
JIT-mapped runtime address, or guest virtual address; whether a reference is
redirected through a GOT/PLT/IAT slot) and then defers the encoding to this
one function. That separation is why the JIT, the static linker, and the
emulator can never disagree on how an R_AARCH64_CALL26 is encoded: there is
exactly one encoder. The few relocs that are intrinsically loader-only
(R_X64_COPY) panic here, since they have no static-byte meaning.
The format registry
src/obj/registry.c is the dispatch seam. Each format is one ObjFormatImpl
(src/obj/format.h) — a vtable of function pointers and small per-format
constants:
read/read_dso— parse bytes into anObjBuilder(relocatable view, plus DSO export-only view for the linker's-linputs);emit— write a relocatable.ofrom a builder;link_emit/layout_dyn/free_dyn— final-image writer + dynamic-table synthesis hooks, owned bysrc/linkbut registered here;emu—ObjFormatEmuOpsfor the user-mode emulator's guest loader (detect/load executable, map object, dynamic-needed/symbol/reloc iterators);- per-arch sub-tables (
elf_arch/elf_machine,macho_arch/macho_cputype,coff_arch/coff_machine) that pair aKitArchKindwith its machine code, dynamic reloc type numbers, stub emitters, and reloc translators; - optional archive-ingestion policy (
classify_obj_input,archive_hint,archive_member) — only COFF needs it, to reclassify short-import members as DSOs.
obj_format_lookup resolves by ObjFmt; obj_format_lookup_bin by detected
KitBinFmt. Each format is independently gated by a KIT_OBJ_*_ENABLED
build flag, and the whole link/archive/emu machinery is gated too: when those
subsystems are compiled out, the registry binds the hooks to disabled stubs
rather than carrying #ifdefs at the call sites. A backend or tool that wants
format behavior calls through this table; it never names a format directly.
Detection
src/api/object_detect.c sniffs the leading bytes: ar magic, \x7fELF,
\0asm, the five Mach-O magics, MZ/COFF machine words, and the Microsoft
short-import 00 00 FF FF prefix. kit_detect_fmt returns the binary family;
kit_detect_target decodes the arch/OS/pointer-size into a KitTarget.
kit_obj_open (src/api/object_file.c) chains detect → registry lookup →
impl->read, so every inspection tool opens any supported format through one
call.
Format-aware policy helpers
The OS/format knowledge that backends would otherwise hardcode is concentrated in two policy TUs so it lands as one case when a format is added, not as fan-out across every CGTarget:
src/obj/obj_secnames.c— canonical synthetic section names. Most sections keep ELF-style dotted names end-to-end (the writer translates at emit), but linker-synthesized sections diverge:obj_secname_init_arrayreturns.init_array(ELF) /__DATA,__mod_init_func(Mach-O) /.CRT$XCU(COFF); likewise fini/preinit arrays and TLS template sections. This TU also owns the format codegen predicates:obj_format_extern_via_got(Mach-O always, ELF only under PIC/PIE),obj_format_c_mangle(Mach-O leading_),obj_format_default_entry_name(_mainvs_startvsmainCRTStartup), and the Mach-O DWARF/section-name spellings shared by the writer and the DWARF reader.src/obj/obj_tls.c— format-aware TLS emission (below).
Format-aware TLS emission
_Thread_local storage has one source-level shape but two radically different
on-disk forms, and src/obj/obj_tls.c owns the split so the frontend and
backends stay format-agnostic. The frontend collects a TLS definition's bytes
(or BSS marker), alignment, and any pointer-init relocs, then calls
obj_define_tls; backends consult obj_format_tls_via_descriptor when choosing
an access sequence.
- ELF / COFF: the symbol is defined directly in
.tdata/.tbss(ELF) or.tls$(COFF); access is a direct TP-relative offset (TLSLE relocs on ELF; the COFF SECREL relocs on Windows). - Mach-O: storage and access split. Bytes live under a private
<name>$tlv$initsymbol in__DATA,__thread_data/__thread_bss; the user-visible symbol is defined onto a 24-byte TLV descriptor in__DATA,__thread_vars—[_tlv_bootstrap, 0, &init]— that dyld rewrites at load time. Compiled code reaches the variable through an indirect call via the descriptor's slot 0 (TLVP_LOAD_PAGE21/PAGEOFF12reloc pair). The_tlv_bootstrapundef extern is cached on the builder so multiple TLV vars in one TU share one symbol entry.
The read-only inspection surface
include/kit/object.h + src/api/object_file.c expose the read side as
KitObjFile: open a blob, then iterate sections, symbols, relocations,
groups, and section data — the format-neutral view the objdump/nm/size/
strip/objcopy/addr2line/strings tools share. src/api/object_builder.c
is the peer write-side adapter (KitObjBuilder), with static asserts pinning
the public KIT_RELOC_* enum to the internal R_* values so the two never
drift.
The linked-image dimension
Relocatable objects (ET_REL/MH_OBJECT/COFF .obj) carry no image:
obj_image(ob) is NULL. Executables and shared objects carry an extra
dimension the section/symbol tables can't model — load segments, an entry
point, image base, interpreter, soname, dependencies, rpaths, dynamic symbols,
and dynamic relocations. The ObjImage (defined in obj.c, hung off the
builder, released by obj_free) holds this common denominator across formats.
All three native formats fill it: ELF (ET_EXEC/ET_DYN), Mach-O
(MH_EXECUTE/MH_DYLIB), and COFF/PE (executables / DLLs). Readers call
obj_image_ensure(ob, OBJ_KIND_EXEC|DYN) and the appenders; the section/symbol
view stays populated where the format still carries it, so a non-stripped ELF
exec presents both views and a table-stripped image presents only segments. The
public API mirrors the relocatable iterators: kit_obj_kind, image-info
scalars, and segment/dep/rpath/dynsym/dynreloc iterators, plus a raw-fields
iterator (kit_obj_image_rawiter_*) — the image-level escape hatch for
format-specific values the neutral model doesn't name (PE data directories /
subsystem / dllchars, ELF raw DT_*, Mach-O load commands). OBJ_KIND_CORE is
reserved — detected and rejected cleanly, not parsed.
Per-format notes
All four formats implement the same read/emit contract over the neutral
model; the differences are in what the wire format carries.
ELF64 (src/obj/elf/)
The most complete path. read.c parses ET_REL into the section/symbol/reloc
view and ET_EXEC/ET_DYN additionally into the ObjImage (program headers →
segments + PT_INTERP; .dynamic → needed/soname/rpath; .dynsym/.rela.* →
dynamic symbols and relocs). emit.c writes relocatable objects;
reloc_aarch64.c/reloc_riscv64.c/reloc_x86_64.c translate canonical kinds
to/from per-arch wire types, paired in the registry with each arch's dynamic
reloc type numbers (RELATIVE/GLOB_DAT/JUMP_SLOT) and default musl interp
string. read_elf_dso produces an export-only builder for -l inputs.
link.c and link_dyn.c (registered as ELF's link_emit/layout_dyn) write
the final image and synthesize the dynamic-link tables — enumerate imports and
DT_NEEDED, reserve and fill .rela.dyn/.rela.plt, lay out .dynamic,
.got/.got.plt, and the dynamic symbol/string tables for PIE/shared output.
emu_load.c provides the guest-ELF loader for the emulator (elf_emu_ops):
detect/load an executable, map dependent objects, and walk dynamic relocs,
applying them through the shared link_reloc_apply.
Mach-O (src/obj/macho/)
read.c/emit.c handle MH_OBJECT plus the MH_EXECUTE/MH_DYLIB image view
(re-walking load commands for segments, dylinker, install-name, dylib deps,
rpaths, entry, LC_SYMTAB dynamic symbols, and LC_DYLD_CHAINED_FIXUPS
binds/rebases). link.c is the final-image writer and link glue;
reloc_aarch64.c/reloc_x86_64.c translate relocs and supply pcrel/length
metadata. Mach-O sets split_sections_as_atoms. Two extra readers feed the
linker's -l path: read_macho_dso (MH_DYLIB exports) and tbd_read.c (Apple
.tbd text stubs from the SDK). Format quirks — the leading-_ C mangle, the
__DATA,__got non-lazy-pointer indirection for externs, the TLV descriptor
model, the __DWARF segment section-name spellings — are concentrated in
obj_secnames.c/obj_tls.c and the writer, not the backends.
COFF / PE (Windows, src/obj/coff/)
64-bit only (x86_64-windows, aarch64-windows); the hosted profile is
mingw/llvm-mingw UCRT, not MSVC. read.c/emit.c round-trip relocatable
PE/COFF: sections with Characteristics, symbols with auxiliary records,
COMDAT groups and SELECTANY dedup, weak externals + mingw alias fallback,
commons, long section names via the string table, and per-arch relocations.
read_image.c (read_coff_image, dispatched from read.c on the DOS MZ
magic) is the peer of read_elf_image/read_macho_image: it parses a linked
.exe/.dll into the ObjImage — one segment per PE section, exports →
dynamic symbols + soname, imports → deps (with imported-name lists) + undefined
dynamic symbols, base relocations → RELATIVE dynamic relocs — plus a full
section/symbol view, and the raw data-directory / subsystem / dllchars fields
through the image escape hatch. read_util.c holds the RVA→offset, bounded
string, and Characteristics→SecKind helpers shared by the .obj, DSO, and
image readers. read_dso.c walks raw PE DLL export directories (and forwarder
ENT entries, surfaced as defined symbols so the OS loader chases the chain at
runtime).
archive.c implements the registry's archive-ingestion hooks: it classifies
import-library members, routing Microsoft short-import records
(Sig1=0, Sig2=0xFFFF) through read_coff to synthesize the imported symbols
and tagging the builder with the providing DLL name (so the link layer
reclassifies the input as a DSO), while long-form members fall through as
regular objects — handling mixed-member archives in one pass.
link.c is the PE32+ writer (registered as COFF's link_emit): DOS stub + PE
headers, PE32+ optional header, Windows-aligned sections, .idata import
descriptors with per-DLL ILT/IAT/hint-name tables, per-arch IAT call stubs (the
registry's emit_iat_stub), .reloc base-relocation blocks, the TLS directory
_tls_used, and subsystem/entry-point selection (mainCRTStartupconsole /WinMainCRTStartupGUI). COFF has no ELF GOT/PLT model: the object emits direct references and the linker binds imports through IAT slots and stubs. Windows TLS access is materialized into.tls$sections with the platform's TEB-relative sequence (x64gs:[0x58]; aarch64x18TEB slot), using theR_COFF_*SECREL*relocs. ABI selection (Win64 calling convention,__chkstkprobes,long double == double, the mingw__int128split) is keyed on(arch, os)in the ABI dispatch — see ARCH.md.
Wasm object (src/obj/wasm/)
Minimal and inspection-oriented. read.c parses a core module's container
sections into neutral ObjBuilder sections carrying their raw payload (so
objdump -h/-s show the real container), marks the code section SF_EXEC for
-d, and adds one function symbol per defined function. It is not a
linkable-object reader: the WebAssembly tool-conventions linking/reloc.*
sections aren't recovered, so relocations don't round-trip. emit.c is the
peer writer and the module model hangs off the builder via OBJ_EXT_WASM /
OBJ_EXT_WASM_IMPORTS. See WASM.md.
Why this shape
- One object type, two writers, many readers. Backends and
.oreaders converge on the same post-finalize shape, so the linker, emitters, and objdump have a single contract regardless of provenance. - One byte-patcher.
link_reloc_applyis the only place relocation encoding lives, so the static linker, JIT, and emulator loader cannot disagree on a fixup — they differ only in how they computeS/P. - One registry seam. Format knowledge sits behind
ObjFormatImpland theobj_secnames/obj_tlspolicy helpers, so a new format is a table entry, not a sweep through callers. - Stable ids + tombstones. Segmented storage and
removedflags letstrip/objcopymutate freely without invalidating outstanding handles.
Planned work (image-inspection extensions, fuller Wasm object support): see doc/plan/.