Seed kernel: virtio-blk for I/O
Plan to replace the cpio-initrd-in / UART-hex-out I/O shape of
seed-kernel/ with a pair of virtio-blk-MMIO devices: one read-only
disk carrying the boot inputs (cpio newc, byte-identical to today's
-initrd) and one read-write disk for outputs. UART stays the console.
Motivation
Current shape (seed-kernel/run.sh, lib-seed-runscm.sh:98-103):
- In: QEMU
-initrdloads cpio newc; kernel finds it via DTB/chosen/linux,initrd-{start,end}(kernel.c:286-296), unpacks viaparse_cpio(kernel.c:429-465) into the in-memory tmpfs. - Out: PL011 UART is the only egress. On exit (
dumpfsbootarg),dump_tmpfs(kernel.c:926-939) hex-encodes every tmpfs file framed by=== DUMP-BEGIN ===/=== FILE … ===/=== DUMP-END ===sentinels.seed-kernel/scripts/extract-dump.shreassembles host-side.
This works but has real costs:
- Output throughput. Hex doubles size; PL011 is byte-at-a-time MMIO. For boot5 (full musl + libc.a + crt*.o, tens of MB), UART dump is the dominant wall-time cost on TCG and a non-trivial slice under HVF.
- No mid-run egress. Output exists only at exit. A crash mid-build loses everything in tmpfs.
- Symmetry. Inputs ride a structured device (initrd memory region); outputs ride a debug device (UART). Two separate framings.
- Optionality. Once the kernel can talk to virtio-blk, mounting a real on-disk artifact cache (across multiple runs) becomes trivial.
Non-goals
- Generalised block layer, partitions, ext2/FAT, write-back cache, buffer cache, multi-queue, MSI-X, IRQ delivery.
- virtio-net, virtio-9p, virtio-console, PCI transport.
- Replacing the in-memory tmpfs semantics user code sees (
sys_openat,sys_read,sys_writeagainstfiles[]stays exactly as-is).
The seed kernel keeps its single-process, polling-only, no-interrupts shape. virtio-blk just swaps the boot-time transport and the exit-time dump format.
Design
Two virtio-blk-MMIO devices on QEMU virt:
-drive file=in.img,if=none,format=raw,id=hd0,readonly=on \
-device virtio-blk-device,drive=hd0 \
-drive file=out.img,if=none,format=raw,id=hd1 \
-device virtio-blk-device,drive=hd1
in.img is the cpio newc archive (today's initramfs.cpio), padded to
a 512-byte multiple. out.img is a pre-allocated zero file sized to an
upper bound (≈256 MB covers boot5 worst case).
Boot flow:
kmainbrings up MMU as today.- New
virtio_blk_init()walks DTB nodes namedvirtio_mmio@*(compatible"virtio,mmio"); for each, probe MagicValue / Version / DeviceID. ID 2 = block; finish device init for each, then read sector 0 to classify (cpio magic → blk0, otherwise → blk1). Panic if the count of either is not exactly 1. parse_cpiois fed byblk_read_all(blk0)instead of the initrd memory region. cpio bytes land in the existing kheap-backed buffer;parse_cpiois unchanged.- On exit, kernel writes a serialised tmpfs to blk1 in the flat
format described below, then PSCI off. No bootarg gating — the
write is unconditional (a no-op if
files[]is empty). - Host reads blk1 with a small extractor (
extract-blk.sh).
UART stays the console; uart_puts everywhere is untouched. The
dumpfs bootarg, dump_tmpfs, the === DUMP-{BEGIN,END} === /
=== FILE … === sentinels, and scripts/extract-dump.sh are all
deleted in the same change.
On-disk layout for outputs
Tiny custom format — no FS. Sector-aligned (512 B), little-endian, all offsets in sectors:
sector 0 magic "SEEDFS\0\0" (8B) | nfiles u32 | reserved u32
followed by nfiles directory entries:
path[96] | data_offset_sectors u32 | size_bytes u64 | _pad
(entry size 112 B → 4 entries/sector → sector 1.. for table)
sector N.. file data, each file padded up to 512-byte boundary
Reusing the existing path[96] and MAX_FILES=4096 from struct file
keeps the table at ≤900 KB (under 2 sectors of header + ~896 KB table).
The host extractor walks the table and writes each file out.
This is roughly "cpio-without-the-headers-per-file." Could equally write cpio newc back; the flat table is just smaller code in the kernel (no hex name length, no per-entry headers, no parse loop).
Memory / DMA
virtio-mmio descriptors carry physical addresses. Kernel-side buffers must therefore have known PAs.
The current MMU (kernel.c:144-213) gives us this for free in two regions:
L1[1..3]identity-maps VA 1..4 GB to PA 1..4 GB as Normal memory. Kernel image (0x40080000) and kheap (0x40xxxxxx..0x4b000000) live here, so anykalloc()'d buffer has VA == PA.mem_cpyetc. work directly.L1[4]is a 1 GB Device block aliasing PA 0..1 GB at VA 4..5 GB, which is how we already reach UART; we'll reach the virtio-mmio control regs at0x0a000000..0x0a004000through the same alias (DEVICE_ALIAS_BASE + 0x0a000000).
So DMA buffers come from kalloc() (Normal, identity-mapped, VA==PA)
and the device regs from the existing high alias. No MMU changes.
Cache coherency. virtio-mmio in QEMU is dma-coherent per the DTB
(/virtio_mmio@…/dma-coherent); virtio-mmio v2 + the modern feature
bits assume coherent DMA. Inner-shareable WBWA (already programmed in
TCR/MAIR) plus DMB before NotifyQueue and DMB after reading the used
ring is sufficient. No explicit cache maintenance ops.
Reservation. virtio queue memory must be 4 KB-aligned. The cpio read buffer must be sized to the cpio length (fetched from blk0 capacity and trimmed at parse). Sizes of interest today:
- boot5 cpio ≈ 30-80 MB. Already fits in current kheap (192 MB).
- Output blob: bound by tmpfs total bytes. Current
kheap_end = 0x4b000000allows ~176 MB heap; sufficient for boot5's output (≈10s of MB). Sizing is unchanged from today's cpio-in-RAM design.
Conclusion: no memory layout changes required. Only one new fixed allocation: a small (single 4 KB page) virtqueue area per device.
virtio-blk-MMIO driver shape
A polling, single-virtqueue, one-request-at-a-time driver. Spec ref: virtio 1.2 §4.2 (MMIO transport) and §5.2 (block device).
Layout (one struct virtio_mmio per device):
volatile u32 *regs; // VA in DEVICE_ALIAS_BASE+phy
struct vring_desc desc[8]; // 8 descriptors plenty (we issue 1 at a time, 3 chained)
struct vring_avail avail;
struct vring_used used;
u16 next_desc;
u16 last_used;
u64 capacity_sectors;
Registers used (offsets from §4.2):
MagicValue (0x000), Version (0x004), DeviceID (0x008),
DeviceFeatures (0x010) / DeviceFeaturesSel (0x014),
DriverFeatures (0x020) / DriverFeaturesSel (0x024),
QueueSel (0x030), QueueNumMax (0x034), QueueNum (0x038),
QueueReady (0x044), QueueNotify (0x050),
InterruptStatus (0x060), InterruptACK (0x064),
Status (0x070), QueueDescLow/High (0x080/084),
QueueDriverLow/High (0x090/094), QueueDeviceLow/High (0x0a0/0a4),
Config (0x100) (block: 8-byte capacity at +0).
Init sequence (§3.1.1):
MagicValue == 0x74726976("virt"),Version == 2,DeviceID == 2.Status = 0(reset), then|= ACKNOWLEDGE, then|= DRIVER.- Read
DeviceFeatures(sel 0 and 1); negotiateVIRTIO_F_VERSION_1only (bit 32). RefuseVIRTIO_BLK_F_ROif mismatched with intent (we setreadonly=onon blk0 so the device offers RO; the driver doesn't need to negotiate RO since we just won't issue writes). Status |= FEATURES_OK; reread Status to confirm.QueueSel = 0;QueueNumMax(≥8 always on QEMU);QueueNum = 8.- Allocate 4 KB-aligned 4 KB page; lay out desc[8] / avail / used per
§2.7, write
QueueDesc{Low,High},QueueDriver{Low,High},QueueDevice{Low,High}to PAs. QueueReady = 1;Status |= DRIVER_OK.- Read
Config + 0for capacity (sectors).
Request shape (§5.2.6): chain of three descriptors:
desc[0]: read-only, points to struct virtio_blk_req_hdr {u32 type, u32 reserved, u64 sector}
desc[1]: write-only (for read req) / read-only (for write req), points to data buffer (multi-sector OK with one descriptor per spec; in practice we use 1 desc per ≤4 MB chunk and loop)
desc[2]: write-only, 1 byte status
Add head index to avail.ring[avail.idx % qsz], dmb ishst,
avail.idx++, dmb ishst, regs[QueueNotify] = 0, then poll
used.idx until it advances past last_used. Read status byte;
0 = OK, else fail.
Chunk size: pick 1 MB per request (2048 sectors). Cpio fetch loops
until capacity_sectors sectors are read or the cpio TRAILER is seen
(we can also just read all of capacity_sectors since in.img is
sized to the cpio).
Public API
int blk_init(void); // probes DTB, finds blk0/blk1
u64 blk_capacity(int dev); // sectors
int blk_read (int dev, u64 sector, void *buf, u64 nsectors);
int blk_write(int dev, u64 sector, const void *buf, u64 nsectors);
Used by:
kmain:blk_init();blk_read(0, 0, cpio_buf, blk_capacity(0)), thenparse_cpio(cpio_buf, capacity*512).dump_tmpfs_blk(): serialisefiles[]into the SEEDFS layout described above andblk_write(1, …).
DTB walking
parse_dtb (kernel.c:254-317) currently records only chosen.initrd*
and the memory@… reg. Extend with a callback that, when entering a
node whose name starts with "virtio_mmio@", captures up to N reg
tuples into a dtb_info::virtio_mmio[] array (PA + size). The MMU
device alias already covers all of these.
Subtlety: per QEMU virt, only some of the 32 virtio-mmio slots are
populated — unpopulated slots return MagicValue==0 / DeviceID==0.
The driver init must skip those.
Build-system changes
seed-kernel/Makefile:
- Add
kernel.cdep on a newvirtio_blk.c(or inline inkernel.cto keep the single-TU shape — leaning toward inline; the driver is ≤300 lines). - Add
$(OUT)/in.imgrule: copyinitramfs.cpio, pad to a 512-byte multiple withtruncate -s %512. - Add
$(OUT)/out.imgrule:truncate -s 256M. - Update
run.sh: drop-initrd "$INITRD"; add the two-drive/-devicepairs above.INITRDbecomesIN_IMG,OUT_IMGis created fresh per run.
scripts/lib-seed-runscm.sh:
- Replace
-initrd "$INITRAMFS"with the two-disk variant. - Drop
dumpfsfrom the-appendline (no longer recognised). - Replace
"$EXTRACT" … "$TRANSCRIPT"withextract-blk.sh "$S_OUT_DIR" "$OUT_IMG". The DUMP-END grep guard is replaced by checking thatextract-blk.shfinds the SEEDFS magic at sector 0; absence means the kernel didn't reach exit.
seed-kernel/scripts/extract-blk.sh: reads sector 0 magic, walks
the table, writes files. Output contract matches what
extract-dump.sh produced (same filenames in the same dump dir),
so seed_runscm_export and downstream acceptance scripts don't
need to change.
seed-kernel/scripts/extract-dump.sh is deleted; tier1-gate.sh
and tier2-gate.sh switch their EXTRACT envvar / direct calls
to extract-blk.sh.
Implementation order
Single branch, single landing. Internal checkpoints in a sensible order; no dual paths in the tree at any commit boundary.
- Add
virtio_blkdriver and DTB enumeration. Extendparse_dtbto recordvirtio_mmio@…reg tuples. Add the driver (init,blk_read,blk_write). Not yet wired intokmain. Sanity check: a unit-testkmainthat probes and prints capacity for both disks boots cleanly under a hand-builtrun.shwith two-drive/-devicepairs. - Cut over input path. Replace the initrd-region read in
kmainwithblk_read(0, 0, cpio_buf, blk_capacity(0)). Delete thechosen.initrd-{start,end}handling fromparse_dtband the "no initrd" panic. UpdateMakefileto producein.imgfrominitramfs.cpio. - Cut over output path. Add
dump_tmpfs_blkandextract-blk.sh. Deletedump_tmpfs,dump_tmpfs's sentinels, thedumpfsbootarg parser,g_dumpfs, andscripts/extract-dump.sh.dump_tmpfs_blkruns unconditionally fromsys_exit_finalbefore PSCI off. - Acceptance. Run
tier1-gate.sh,tier2-gate.sh,seed-accept.sh,seed-accept-boot34.sh,seed-accept-boot5.sh. All must produce byte-identical artifacts to the prior (cpio+dumpfs) tree atHEAD~1. Expect boot5 to surface any off-by-one in the directory table fastest (≈3900 tmpfs entries).
Decisions (resolved)
- Console. PL011 stays.
uart_putc/_puts/_putd/_putxand userwrite(1, …)are unchanged. Only the file dump moves to virtio. - virtio version. Pin MMIO Version == 2 (MagicValue == 0x74726976,
Version regs read in init). Anything else:
uart_putsa panic line andwfe. QEMU 10 (current host) and any QEMU ≥4.0 ship v2; the build harness has a single QEMU floor and we don't support pre-v2. - Identifying blk0 vs blk1. Slot order in the DTB does not depend
on
-driveattachment (verified withdumpdtb: all 32virtio_mmio@…nodes are present unconditionally), and QEMU's command-line-to-slot mapping is not contractual across versions. Use content-based identification: after enumerating all populated DeviceID==2 devices, read sector 0 of each and call the one whose first 6 bytes are"070701"(cpio newc magic)blk0; the other isblk1. If neither matches or both match, panic. This removes the dependency on-driveordering on the qemu command line entirely. - Output image size. Host pre-allocates
out.imgas a 256 MB sparse file (truncate -s 256M out.img). Header at sector 0 records total used bytes;extract-blk.shreads only that many. No truncation ofout.imgis needed — sparse + bounded read is free on APFS / ext4. - Initial kheap sizing. Today
kheap_endstarts at0x44000000(64 MB) and bumps to0x4b000000afterparse_cpiofinishes, because the initrd region was reserved up to0x4b000000. Without-initrd, that region is free from boot, so set the initialkheap_end = 0x4b000000(176 MB). The cpio read buffer (kalloc(blk_capacity(0) * 512)) lands in this range. Boot5 cpio ≈ 80 MB; comfortably fits. - Persistence across runs. Out of scope.
out.imgis re-created (truncated to 256 MB of zeros) before each run by the harness; the kernel always writes a fresh header at sector 0. - Per-request chunking. 1 MB chunks (2048 sectors) per virtio
request, single data descriptor per chunk (3-descriptor chain:
hdr / data / status). 8-entry virtqueue, one in-flight request at
a time, polling
used.idx. No interrupts (InterruptACKwritten once per used entry to clear the device-side bit, but no IRQ handler — DAIF stays masked as today). - Coherency. Inner-shareable WBWA (already programmed);
dmb ishstbeforeQueueNotify,dmb ishafter observingused.idxadvance. Nodc civac/icops — virtio-mmio isdma-coherentper DTB and the device DMAs into the same inner-shareable domain the kernel reads.
Risks (residual)
- Empty / mis-sized
in.img. If the harness fails to stage the cpio onto blk0,parse_cpiowalks zero bytes andfind_file("init")fails — exactly the same failure mode as a missing-initrdtoday (kernel.c:1136-1139). No new risk. - Boot5 file count growth.
MAX_FILES = 4096andpath[96]remain the binding limits, unchanged from today. The on-disk directory table is sized off these constants; bumping either requires a same-commit bump toextract-blk.sh's parser.
Estimated effort
- DTB walk extension + virtio_blk driver + integration in
kmain: ~300 lines C, one work session. - Output serialiser + extractor: ~80 lines C + ~40 lines shell.
- Build/run wiring + acceptance plumbing: ~50 lines shell across Makefile, run.sh, lib-seed-runscm.sh.
- Stabilising acceptance against existing fixtures: a couple sessions to chase any byte-divergence (most likely culprit is dump ordering or padding, both fixable in extractor).
Total: ~1-2 days of focused work, gated by byte-identical acceptance vs the cpio+dumpfs tree at the pre-cutover commit.