boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

Seed kernel: virtio-blk for I/O

Plan to replace the cpio-initrd-in / UART-hex-out I/O shape of seed-kernel/ with a pair of virtio-blk-MMIO devices: one read-only disk carrying the boot inputs (cpio newc, byte-identical to today's -initrd) and one read-write disk for outputs. UART stays the console.

Motivation

Current shape (seed-kernel/run.sh, lib-seed-runscm.sh:98-103):

This works but has real costs:

  1. Output throughput. Hex doubles size; PL011 is byte-at-a-time MMIO. For boot5 (full musl + libc.a + crt*.o, tens of MB), UART dump is the dominant wall-time cost on TCG and a non-trivial slice under HVF.
  2. No mid-run egress. Output exists only at exit. A crash mid-build loses everything in tmpfs.
  3. Symmetry. Inputs ride a structured device (initrd memory region); outputs ride a debug device (UART). Two separate framings.
  4. Optionality. Once the kernel can talk to virtio-blk, mounting a real on-disk artifact cache (across multiple runs) becomes trivial.

Non-goals

The seed kernel keeps its single-process, polling-only, no-interrupts shape. virtio-blk just swaps the boot-time transport and the exit-time dump format.

Design

Two virtio-blk-MMIO devices on QEMU virt:

-drive file=in.img,if=none,format=raw,id=hd0,readonly=on \
-device virtio-blk-device,drive=hd0 \
-drive file=out.img,if=none,format=raw,id=hd1 \
-device virtio-blk-device,drive=hd1

in.img is the cpio newc archive (today's initramfs.cpio), padded to a 512-byte multiple. out.img is a pre-allocated zero file sized to an upper bound (≈256 MB covers boot5 worst case).

Boot flow:

  1. kmain brings up MMU as today.
  2. New virtio_blk_init() walks DTB nodes named virtio_mmio@* (compatible "virtio,mmio"); for each, probe MagicValue / Version / DeviceID. ID 2 = block; finish device init for each, then read sector 0 to classify (cpio magic → blk0, otherwise → blk1). Panic if the count of either is not exactly 1.
  3. parse_cpio is fed by blk_read_all(blk0) instead of the initrd memory region. cpio bytes land in the existing kheap-backed buffer; parse_cpio is unchanged.
  4. On exit, kernel writes a serialised tmpfs to blk1 in the flat format described below, then PSCI off. No bootarg gating — the write is unconditional (a no-op if files[] is empty).
  5. Host reads blk1 with a small extractor (extract-blk.sh).

UART stays the console; uart_puts everywhere is untouched. The dumpfs bootarg, dump_tmpfs, the === DUMP-{BEGIN,END} === / === FILE … === sentinels, and scripts/extract-dump.sh are all deleted in the same change.

On-disk layout for outputs

Tiny custom format — no FS. Sector-aligned (512 B), little-endian, all offsets in sectors:

sector 0           magic "SEEDFS\0\0" (8B) | nfiles u32 | reserved u32
                   followed by nfiles directory entries:
                     path[96] | data_offset_sectors u32 | size_bytes u64 | _pad
                   (entry size 112 B → 4 entries/sector → sector 1.. for table)
sector N..         file data, each file padded up to 512-byte boundary

Reusing the existing path[96] and MAX_FILES=4096 from struct file keeps the table at ≤900 KB (under 2 sectors of header + ~896 KB table). The host extractor walks the table and writes each file out.

This is roughly "cpio-without-the-headers-per-file." Could equally write cpio newc back; the flat table is just smaller code in the kernel (no hex name length, no per-entry headers, no parse loop).

Memory / DMA

virtio-mmio descriptors carry physical addresses. Kernel-side buffers must therefore have known PAs.

The current MMU (kernel.c:144-213) gives us this for free in two regions:

So DMA buffers come from kalloc() (Normal, identity-mapped, VA==PA) and the device regs from the existing high alias. No MMU changes.

Cache coherency. virtio-mmio in QEMU is dma-coherent per the DTB (/virtio_mmio@…/dma-coherent); virtio-mmio v2 + the modern feature bits assume coherent DMA. Inner-shareable WBWA (already programmed in TCR/MAIR) plus DMB before NotifyQueue and DMB after reading the used ring is sufficient. No explicit cache maintenance ops.

Reservation. virtio queue memory must be 4 KB-aligned. The cpio read buffer must be sized to the cpio length (fetched from blk0 capacity and trimmed at parse). Sizes of interest today:

Conclusion: no memory layout changes required. Only one new fixed allocation: a small (single 4 KB page) virtqueue area per device.

virtio-blk-MMIO driver shape

A polling, single-virtqueue, one-request-at-a-time driver. Spec ref: virtio 1.2 §4.2 (MMIO transport) and §5.2 (block device).

Layout (one struct virtio_mmio per device):

volatile u32 *regs;          // VA in DEVICE_ALIAS_BASE+phy
struct vring_desc  desc[8];  // 8 descriptors plenty (we issue 1 at a time, 3 chained)
struct vring_avail avail;
struct vring_used  used;
u16 next_desc;
u16 last_used;
u64 capacity_sectors;

Registers used (offsets from §4.2):

MagicValue (0x000), Version (0x004), DeviceID (0x008), DeviceFeatures (0x010) / DeviceFeaturesSel (0x014), DriverFeatures (0x020) / DriverFeaturesSel (0x024), QueueSel (0x030), QueueNumMax (0x034), QueueNum (0x038), QueueReady (0x044), QueueNotify (0x050), InterruptStatus (0x060), InterruptACK (0x064), Status (0x070), QueueDescLow/High (0x080/084), QueueDriverLow/High (0x090/094), QueueDeviceLow/High (0x0a0/0a4), Config (0x100) (block: 8-byte capacity at +0).

Init sequence (§3.1.1):

  1. MagicValue == 0x74726976 ("virt"), Version == 2, DeviceID == 2.
  2. Status = 0 (reset), then |= ACKNOWLEDGE, then |= DRIVER.
  3. Read DeviceFeatures (sel 0 and 1); negotiate VIRTIO_F_VERSION_1 only (bit 32). Refuse VIRTIO_BLK_F_RO if mismatched with intent (we set readonly=on on blk0 so the device offers RO; the driver doesn't need to negotiate RO since we just won't issue writes).
  4. Status |= FEATURES_OK; reread Status to confirm.
  5. QueueSel = 0; QueueNumMax (≥8 always on QEMU); QueueNum = 8.
  6. Allocate 4 KB-aligned 4 KB page; lay out desc[8] / avail / used per §2.7, write QueueDesc{Low,High}, QueueDriver{Low,High}, QueueDevice{Low,High} to PAs.
  7. QueueReady = 1; Status |= DRIVER_OK.
  8. Read Config + 0 for capacity (sectors).

Request shape (§5.2.6): chain of three descriptors:

desc[0]: read-only, points to struct virtio_blk_req_hdr {u32 type, u32 reserved, u64 sector}
desc[1]: write-only (for read req) / read-only (for write req), points to data buffer (multi-sector OK with one descriptor per spec; in practice we use 1 desc per ≤4 MB chunk and loop)
desc[2]: write-only, 1 byte status

Add head index to avail.ring[avail.idx % qsz], dmb ishst, avail.idx++, dmb ishst, regs[QueueNotify] = 0, then poll used.idx until it advances past last_used. Read status byte; 0 = OK, else fail.

Chunk size: pick 1 MB per request (2048 sectors). Cpio fetch loops until capacity_sectors sectors are read or the cpio TRAILER is seen (we can also just read all of capacity_sectors since in.img is sized to the cpio).

Public API

int  blk_init(void);                         // probes DTB, finds blk0/blk1
u64  blk_capacity(int dev);                  // sectors
int  blk_read (int dev, u64 sector, void *buf, u64 nsectors);
int  blk_write(int dev, u64 sector, const void *buf, u64 nsectors);

Used by:

DTB walking

parse_dtb (kernel.c:254-317) currently records only chosen.initrd* and the memory@… reg. Extend with a callback that, when entering a node whose name starts with "virtio_mmio@", captures up to N reg tuples into a dtb_info::virtio_mmio[] array (PA + size). The MMU device alias already covers all of these.

Subtlety: per QEMU virt, only some of the 32 virtio-mmio slots are populated — unpopulated slots return MagicValue==0 / DeviceID==0. The driver init must skip those.

Build-system changes

seed-kernel/Makefile:

scripts/lib-seed-runscm.sh:

seed-kernel/scripts/extract-blk.sh: reads sector 0 magic, walks the table, writes files. Output contract matches what extract-dump.sh produced (same filenames in the same dump dir), so seed_runscm_export and downstream acceptance scripts don't need to change.

seed-kernel/scripts/extract-dump.sh is deleted; tier1-gate.sh and tier2-gate.sh switch their EXTRACT envvar / direct calls to extract-blk.sh.

Implementation order

Single branch, single landing. Internal checkpoints in a sensible order; no dual paths in the tree at any commit boundary.

  1. Add virtio_blk driver and DTB enumeration. Extend parse_dtb to record virtio_mmio@… reg tuples. Add the driver (init, blk_read, blk_write). Not yet wired into kmain. Sanity check: a unit-test kmain that probes and prints capacity for both disks boots cleanly under a hand-built run.sh with two -drive/-device pairs.
  2. Cut over input path. Replace the initrd-region read in kmain with blk_read(0, 0, cpio_buf, blk_capacity(0)). Delete the chosen.initrd-{start,end} handling from parse_dtb and the "no initrd" panic. Update Makefile to produce in.img from initramfs.cpio.
  3. Cut over output path. Add dump_tmpfs_blk and extract-blk.sh. Delete dump_tmpfs, dump_tmpfs's sentinels, the dumpfs bootarg parser, g_dumpfs, and scripts/extract-dump.sh. dump_tmpfs_blk runs unconditionally from sys_exit_final before PSCI off.
  4. Acceptance. Run tier1-gate.sh, tier2-gate.sh, seed-accept.sh, seed-accept-boot34.sh, seed-accept-boot5.sh. All must produce byte-identical artifacts to the prior (cpio+dumpfs) tree at HEAD~1. Expect boot5 to surface any off-by-one in the directory table fastest (≈3900 tmpfs entries).

Decisions (resolved)

Risks (residual)

Estimated effort

Total: ~1-2 days of focused work, gated by byte-identical acceptance vs the cpio+dumpfs tree at the pre-cutover commit.