seed-kernel: close all 11 OS-TODO items + verification gates - boot2

commit 5073ee046922edc1734cd7c6abda83c25d743fef
parent d437cb28c8b53b20923897d9aaa74fd7b0244928
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon,  4 May 2026 20:58:34 -0700

seed-kernel: close all 11 OS-TODO items + verification gates

Tier 1 fixes:
- argv from /chosen/bootargs (whitespace-tokenised) with /init.argv
  fallback; build_user_stack now takes (argc, argv[])
- exit_group masks status to low byte
- Per-process L2 page table maps user low VA to a 768 MB physical
  pool. PL011/GIC/virtio reached from kernel via a high alias
  (L1[4] = Device 1 GB block at PA 0, so VA 0x109000000 -> PA
  0x09000000). The boot2 chain runs at its native -B 0x600000.
- load_elf clips PT_LOAD memsz at USER_VA_HI and reports end-of-image
  in g_user_image_end so brk_base sits above the binary's BSS
- Documented RWX-at-EL1 in load_elf as deliberate per OS.md

Tier 2 syscalls (clone/execve/waitid):
- Pseudo-fork via proc_save[] holding a memory snapshot, regs, brk
  state, and fd table. clone returns 0 to the current "child";
  execve replaces the user image; exit_group pops the snapshot,
  restores parent state, and resumes parent's clone() with x0=pid.
  waitid populates info[8]/info[24] per the prelude. envp is
  ignored — accepted NULL/empty as the spec requires.

Caches enabled at MMU bring-up (SCTLR.C|I) so the 768 MB snapshot
copy isn't unbearably slow under TCG.

Verification harness:
- dumpfs bootargs token triggers a sentinel-framed hex tmpfs dump
  on exit; scripts/extract-dump.sh decodes it back to files.
- scripts/tier1-gate.sh runs an arbitrary chain stage as /init,
  extracts the post-run tmpfs. Verified with boot0/catm and
  boot3/tcc0 (compiles a .c into a valid aarch64 ELF object).
- scripts/tier2-gate.sh + tier2-tcc-driver.scm fixture run
  scheme1 → (run "tcc0" -nostdlib -c -o out.o input.c) → wait,
  end-to-end. Output is a real ELF relocatable with the expected
  symbols. This is the canonical OS.md §Verification Tier-2 case.

User demos: forktest.c + child.c exercise clone/execve/waitid
(parent observes si_code=CLD_EXITED, si_status=42 from child).
hello.c grew an asm _start shim to read argc/argv off sp.

run.sh bumped to -m 2048M; kernel layout: 192 MB image+kheap,
768 MB user pool at 0x4c000000, 768 MB pseudo-fork snapshot at
0x7c000000, 320 MB spare.

Diffstat:
A docs/OS-TODO.md  | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M seed-kernel/Makefile  | 18 +++++++++++++++++-
M seed-kernel/kernel.c  | 528 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
M seed-kernel/run.sh  | 2 +-
A seed-kernel/scripts/extract-dump.sh  | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A seed-kernel/scripts/fixtures/tier2-driver.scm  | 5 +++++
A seed-kernel/scripts/fixtures/tier2-tcc-driver.scm  | 8 ++++++++
A seed-kernel/scripts/tier1-gate.sh  | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A seed-kernel/scripts/tier2-gate.sh  | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A seed-kernel/user/child.c  | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
A seed-kernel/user/forktest.c  | 88 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M seed-kernel/user/hello.c  | 23 ++++++++++++++++++++++-
M seed-kernel/user/user.lds  | 8 +++++---

13 files changed, 1030 insertions(+), 52 deletions(-)
diff --git a/docs/OS-TODO.md b/docs/OS-TODO.md
@@ -0,0 +1,117 @@
+# Seed kernel — gaps against docs/OS.md
+
+Audit of [`seed-kernel/`](../seed-kernel/) against the contract in
+[`OS.md`](OS.md). All eleven items are now resolved — the seed kernel
+boots, parses the DTB, unpacks an initramfs into an in-memory tmpfs,
+loads `/init` as a static aarch64 ELF, dispatches the eight Tier-1 +
+three Tier-2 syscalls, and supports both the host-side verification
+gates `scripts/tier1-gate.sh` and `scripts/tier2-gate.sh`. Verified
+against `boot0/catm`, `boot1/M1pp`, and `boot3/tcc0`; the canonical
+Tier-2 case (scheme1 driver spawns tcc0 to compile a `.c` into a
+relocatable ELF object) round-trips end-to-end.
+
+## Tier 1
+
+1. **Real argv.** ✅ `build_user_stack` takes `(argc, argv[])`. argv is
+   sourced from `/chosen/bootargs` (whitespace-tokenised), then from a
+   `/init.argv` file in the initramfs, with `argc=1, argv[0]="init"`
+   as the final fallback. The kernel reserves a `dumpfs` token in
+   bootargs (stripped from user argv) that triggers the UART tmpfs
+   dump on exit (item 9).
+
+2. **User load address.** ✅ Per-process L2 page table installs an
+   `l2_user[]` covering the low 1 GB of VA in 2 MB blocks. Slot 0 is
+   invalid (NULL traps); slots 1…384 are Normal user RAM backed by a
+   768 MB physical pool (`USER_POOL_PA`); slots 385…511 stay Device-
+   identity for safety. The PL011 / GIC / virtio that used to live in
+   the low 1 GB are now reached from kernel code via a high alias —
+   `L1[4]` is a 1 GB Device block at PA 0, so VA `0x109000000` ↔ PA
+   `0x09000000`. This lets the boot2 chain link at its native
+   `-B 0x600000` and run unmodified on the seed kernel.
+
+3. **Bigger heap.** ✅ User pool is 768 MB (slots 1…384 × 2 MB),
+   sized so tcc0/tcc-boot2 (which declare a 512 MB BSS at link base
+   `0x600000` ⇒ end VA `0x20600000`) fit with a healthy brk window
+   above end-of-bss. `load_elf` walks PT_LOAD segments and records
+   the post-clip end-of-image in `g_user_image_end`; kmain and
+   `do_execve` use it to seed `brk_base`. `brk_max` is
+   `USER_VA_HI - 16 MB` (16 MB stack reserve at the top).
+
+4. **Per-segment ELF permissions.** ✅ Documented as a deliberate
+   spec-permissible choice in `load_elf` — segments are RWX at EL1.
+   OS.md §"Memory model" allows this; tcc-boot2 doesn't JIT.
+
+5. **`exit_group` exit-code masking.** ✅ `code &= 0xff` in
+   `sys_exit_final` / `sys_exit_or_resume_parent`.
+
+## Tier 2
+
+6. **`clone` / `execve` / `waitid`.** ✅ Pseudo-fork via a
+   `proc_stack[]` of saved frames. `sys_clone` snapshots the trap
+   frame + sp_el0 + brk + fd table + the entire 768 MB user image
+   (one snapshot at PA `0x7c000000`), returns 0 to the current
+   context (the "child"). `do_execve` captures path/argv into a
+   kernel pool before clobbering user memory, loads the new ELF,
+   resets brk above its end-of-bss, and rewrites the trap frame so
+   `eret` lands at the new entry point with a fresh user stack.
+   `sys_waitid` populates the siginfo at offsets 8 (CLD_EXITED) and
+   24 (status) per `scheme1/prelude.scm:497-506`. On
+   `sys_exit_or_resume_parent`, if `proc_depth > 0`, the kernel
+   restores the parent's image / regs / brk / fd table, syncs I-cache
+   over the freshly-overwritten user pages, and returns to the
+   parent's `clone()` site with `x0 = child_pid`.
+
+7. **Per-process state on a stack.** ✅ `proc_save` records regs +
+   ELR + SPSR + sp_el0 + brk_base + brk_cur + fd table + a 768 MB
+   memory snapshot. `MAX_PROC_DEPTH = 1` — the scheme1 prelude only
+   forks one level deep before waiting; one snapshot frame is all
+   that's needed and keeps total RAM at 2 GB.
+
+8. **`execve` accepts NULL/empty envp.** ✅ `do_execve` ignores its
+   `envp` argument; the prelude wrapper passes no envp at all and
+   the value in `x2` at the SVC site is whatever happens to be
+   there.
+
+## Verification harness
+
+9. **Output extraction.** ✅ The kernel emits a sentinel-framed
+   hex dump of every tmpfs file on exit when bootargs contain the
+   `dumpfs` token. Scripts:
+   - [`scripts/extract-dump.sh`](../seed-kernel/scripts/extract-dump.sh) —
+     scans a UART transcript for `=== DUMP-BEGIN ===` … `=== DUMP-END ===`,
+     decodes each `=== FILE path=… size=… ===` payload, writes files.
+
+10. **Tier 1 gate.** ✅
+    [`scripts/tier1-gate.sh`](../seed-kernel/scripts/tier1-gate.sh) —
+    builds an initramfs containing a stage binary as `/init` plus
+    arbitrary input files, runs the seed kernel under qemu with the
+    stage's argv as bootargs, and extracts the post-run tmpfs.
+    Verified against `boot0/catm` (multi-input concatenation, output
+    matches host `cat`) and `boot3/tcc0` (compiles `int main(void)
+    {return 42;}` into a valid aarch64 relocatable object).
+
+11. **Tier 2 gate.** ✅
+    [`scripts/tier2-gate.sh`](../seed-kernel/scripts/tier2-gate.sh) —
+    cats `prelude.scm` + a driver fixture into `combined.scm`, packs
+    initramfs `/init=scheme1, /child-prog=<chain stage>, /combined.scm,
+    <inputs>`, runs the seed kernel, asserts the driver exited 0, and
+    extracts every output file. Verified end-to-end with the canonical
+    fixture
+    [`scripts/fixtures/tier2-tcc-driver.scm`](../seed-kernel/scripts/fixtures/tier2-tcc-driver.scm) —
+    scheme1 evaluates `(run "child-prog" "-nostdlib" "-c" "-o" "out.o"
+    "input.c")`, where `child-prog` is `boot3/tcc0`. Output `out.o` is a
+    valid aarch64 ELF relocatable with the expected `add` and `main`
+    symbols.
+
+## Things still worth doing (out of scope of the original list)
+
+- **Multi-stage Tier 1 driving**: `make tcc-boot2 ARCH=aarch64` could be
+  taught to swap each podman invocation for `tier1-gate.sh`. The hooks
+  exist; it would just be a `seed-kernel/Makefile.gate` overlay.
+- **Snapshot speed**: `mem_cpy(USER_POOL_SIZE = 768 MB)` is the dominant
+  cost of every clone (~30 s under TCG). A copy-on-write or only-
+  touched-pages strategy would help, but isn't needed for compliance.
+- **NULL-page hardening**: slot 0 is unmapped so a NULL deref faults to
+  the kernel as a user sync; the kernel currently panics rather than
+  delivering a SIGSEGV-equivalent. Acceptable per OS.md (default-action
+  termination is sufficient) but a minor polish opportunity.
diff --git a/seed-kernel/Makefile b/seed-kernel/Makefile
@@ -8,7 +8,10 @@ KOBJS    := $(OUT)/start.o $(OUT)/kernel.o
 KIMAGE   := $(OUT)/kernel.elf
 KBIN     := $(OUT)/Image
 USER     := $(OUT)/init
+USER_FORK := $(OUT)/forktest
+USER_CHILD := $(OUT)/child
 INITRAMFS := $(OUT)/initramfs.cpio
+INITRAMFS_FORK := $(OUT)/initramfs-fork.cpio
 
 CFLAGS_COMMON := -nostdlib -nostartfiles -ffreestanding -fno-stack-protector \
                  -fno-pic -static -Wall -Wextra -O2 -mcmodel=large \
@@ -16,7 +19,7 @@ CFLAGS_COMMON := -nostdlib -nostartfiles -ffreestanding -fno-stack-protector \
 KCFLAGS  := $(CFLAGS_COMMON) -mgeneral-regs-only
 
 .PHONY: all clean kernel user initramfs
-all: $(KBIN) $(INITRAMFS)
+all: $(KBIN) $(INITRAMFS) $(INITRAMFS_FORK)
 
 $(OUT):
 	mkdir -p $(OUT)
@@ -37,9 +40,22 @@ $(KBIN): $(KIMAGE)
 $(USER): user/hello.c user/user.lds | $(OUT)
 	gcc $(CFLAGS_COMMON) -mgeneral-regs-only -T user/user.lds -o $@ $<
 
+$(USER_FORK): user/forktest.c user/user.lds | $(OUT)
+	gcc $(CFLAGS_COMMON) -mgeneral-regs-only -T user/user.lds -o $@ $<
+
+$(USER_CHILD): user/child.c user/user.lds | $(OUT)
+	gcc $(CFLAGS_COMMON) -mgeneral-regs-only -T user/user.lds -o $@ $<
+
 $(INITRAMFS): $(USER)
 	cd $(OUT) && printf 'init\n' | cpio -o -H newc > initramfs.cpio
 
+# Tier 2 demo cpio: /init is the fork driver, /child the program it execs.
+$(INITRAMFS_FORK): $(USER_FORK) $(USER_CHILD)
+	rm -rf $(OUT)/fork-stage && mkdir -p $(OUT)/fork-stage
+	cp $(USER_FORK) $(OUT)/fork-stage/init
+	cp $(USER_CHILD) $(OUT)/fork-stage/child
+	cd $(OUT)/fork-stage && printf 'init\nchild\n' | cpio -o -H newc > ../initramfs-fork.cpio
+
 kernel: $(KBIN)
 user: $(USER)
 initramfs: $(INITRAMFS)
diff --git a/seed-kernel/kernel.c b/seed-kernel/kernel.c
@@ -16,7 +16,13 @@ typedef int            i32;
 
 /* ─── PL011 console ─────────────────────────────────────────────────────── */
 
-#define UART0 0x09000000UL
+/* The PL011 lives at PA 0x09000000 on QEMU virt. Once the MMU comes up the
+ * kernel reaches it through the device alias mapped into VA 4 GB..5 GB
+ * (L1[4]). That keeps the entire low 1 GB of VA available as user RAM —
+ * device MMIO at user-space VAs would otherwise collide with the boot2
+ * chain's BSS, which can run past 256 MB. */
+#define DEVICE_ALIAS_BASE 0x100000000UL
+#define UART0 (DEVICE_ALIAS_BASE + 0x09000000UL)
 #define UART_DR ((volatile u32 *)(UART0 + 0x00))
 #define UART_FR ((volatile u32 *)(UART0 + 0x18))
 #define UART_FR_TXFF (1u << 5)
@@ -75,28 +81,83 @@ static void mem_set(void *d, int c, u64 n) {
 }
 
 /* ─── MMU bring-up ──────────────────────────────────────────────────────── */
-/* Identity-map the first 4 GB at L1 (1 GB blocks). One page table — 4 KB.
- * Entry 0 (0..1G):  Device-nGnRnE   (UART/GIC/virtio/flash live here)
- * Entry 1 (1..2G):  Normal WB-WA    (RAM 0x40000000-)
- * Entry 2 (2..3G):  Normal WB-WA    (extra RAM if -m > 1G)
- * Entry 3 (3..4G):  Normal WB-WA    (above-RAM PCI on virt; rarely touched)
+/* Two-level page table:
+ *   L1[0]   → l2_user table descriptor (VA 0..1 GB, 2 MB blocks)
+ *   L1[1..3] = Normal 1 GB blocks identity-mapping VA 1..4 GB (RAM + high MMIO)
+ *   L1[4]   = Device 1 GB block at PA 0 (VA 4..5 GB mirrors PA 0..1 GB as
+ *             Device-nGnRnE — the kernel's only path to UART/GIC/virtio/PCI
+ *             once we hand the low 1 GB over to user code).
+ *
+ * The l2_user table carves the low 1 GB into:
+ *   slot 0           (VA 0..2 MB)        invalid — NULL pointer traps
+ *   slots 1..N       (VA 2 MB..USER_VA_HI)  Normal user RAM, backed by the
+ *                                        physical pool USER_POOL_PA. The
+ *                                        boot2 chain links at 0x600000 and
+ *                                        scheme1 reserves ~256 MB of BSS;
+ *                                        sizing N at 256 (slots 1..256, 512 MB)
+ *                                        gives both code+BSS and the brk
+ *                                        window plenty of room.
+ *   slots N+1..511   (VA USER_VA_HI..1G) Device-identity, kept for safety —
+ *                                        nothing user-side touches them, and
+ *                                        the kernel uses the high alias.
+ *
  * With MMU on + Normal memory, unaligned loads/stores work — gcc's auto-
- * vectorised 64-bit load in be64() stops trapping.
- */
+ * vectorised 64-bit load in be64() stops trapping. */
 __attribute__((aligned(4096))) static u64 l1_pt[512];
+__attribute__((aligned(4096))) static u64 l2_user[512];
+
+/* Physical RAM region reserved as the backing store for user low VAs.
+ * 768 MB (slots 1..384 × 2 MB), placed above the kernel heap end. Sized
+ * to fit tcc0 / tcc-boot2 — they declare a 512 MB BSS and link at
+ * 0x600000, so the binary's VA reach is 0x600000 + 512 MB = 0x20600000.
+ * 768 MB gives that plus a healthy brk window above end-of-bss.
+ *
+ * With QEMU -m 2048M (RAM 0x40000000–0xc0000000) and MAX_PROC_DEPTH=1
+ * (one 768 MB pseudo-fork snapshot above the user pool), the layout is:
+ *   0x40000000–0x4c000000  kernel image + kheap (192 MB)
+ *   0x4c000000–0x7c000000  user RAM pool        (768 MB)
+ *   0x7c000000–0xac000000  pseudo-fork snapshot (768 MB)
+ *   0xac000000–0xc0000000  spare                (320 MB)
+ */
+#define USER_POOL_PA   0x4c000000UL
+#define USER_POOL_SIZE 0x30000000UL    /* 768 MB */
+#define USER_VA_LO     0x00200000UL    /* slot 1 — first mapped 2 MB block */
+#define USER_VA_HI     0x30200000UL    /* slot 385 — first device-only block */
 
 static void setup_mmu(void) {
-    /* AP=00 (RW EL1 only — keep EL0 out for now), SH=ISH, AF=1, AttrIdx=0/1.
-     * Bits: V(0)=1, block(1)=0, AttrIdx[4:2], NS(5)=0, AP[7:6]=00, SH[9:8]=11,
-     *       AF(10)=1, nG(11)=0  →  0x701 (Normal) / 0x705 (Device) */
+    /* Block-descriptor attribute bits (block at L1 = bit[1]=0).
+     *   V(0)=1, block(1)=0, AttrIdx[4:2]=Attr0(Normal)/Attr1(Device),
+     *   NS(5)=0, AP[7:6]=00 (RW EL1 only), SH[9:8]=11 (ISH), AF(10)=1,
+     *   nG(11)=0 → 0x701 (Normal) / 0x705 (Device-nGnRnE).
+     * Block descriptors at L2 use the same bit layout. */
     u64 normal = 0x701;
     u64 device = 0x705;
 
     for (int i = 0; i < 512; i++) l1_pt[i] = 0;
-    l1_pt[0] = 0x00000000UL | device;
+
+    /* L2 user table: slot 0 invalid; slots 1..(USER_POOL_SIZE/2 MB) Normal
+     * RAM backed by the user pool; slots above that Device-identity. */
+    int user_slots = (int)(USER_POOL_SIZE / 0x200000UL);
+    l2_user[0] = 0;
+    for (int i = 1; i <= user_slots; i++) {
+        u64 pa = USER_POOL_PA + (u64)(i - 1) * 0x200000UL;
+        l2_user[i] = pa | normal;
+    }
+    for (int i = user_slots + 1; i < 512; i++) {
+        u64 pa = (u64)i * 0x200000UL;
+        l2_user[i] = pa | device;
+    }
+
+    /* L1[0] table descriptor → l2_user. Table-desc encoding at L1 is
+     *   bits [1:0] = 0b11, bits [47:12] = next-level table PA. */
+    l1_pt[0] = (u64)l2_user | 0x3UL;
     l1_pt[1] = 0x40000000UL | normal;
     l1_pt[2] = 0x80000000UL | normal;
     l1_pt[3] = 0xc0000000UL | normal;
+    /* L1[4]: Device 1 GB block aliasing PA 0..1 GB into VA 4 GB..5 GB so
+     * the kernel can still reach UART/GIC/virtio after we hand the low 1
+     * GB over to user mappings. */
+    l1_pt[4] = 0x00000000UL | device;
 
     /* MAIR: Attr0 = 0xff (Normal WB-WA), Attr1 = 0x00 (Device-nGnRnE) */
     u64 mair = 0x00000000000000ffUL;
@@ -120,8 +181,10 @@ static void setup_mmu(void) {
 
     u64 sctlr;
     asm volatile("mrs %0, sctlr_el1" : "=r"(sctlr));
-    sctlr &= ~(u64)((1 << 1) | (1 << 19)); /* clear A, WXN */
-    sctlr |=  (u64)(1 << 0);                /* M (MMU) only — caches stay off */
+    sctlr &= ~(u64)((1 << 1) | (1 << 19));   /* clear A (alignment), WXN */
+    sctlr |=  (u64)((1 << 0)                  /* M  — MMU on              */
+                  | (1 << 2)                  /* C  — D-cache on          */
+                  | (1 << 12));               /* I  — I-cache on          */
     asm volatile("msr sctlr_el1, %0" :: "r"(sctlr));
     asm volatile("isb");
 }
@@ -318,6 +381,11 @@ struct phdr { u32 p_type, p_flags; u64 p_offset, p_vaddr, p_paddr, p_filesz, p_m
 
 #define PT_LOAD 1
 
+/* Highest VA touched by the most recently loaded image's PT_LOAD segments
+ * (after USER_VA_HI clipping). load_elf updates this; kmain / sys_execve
+ * use it to seed brk_base above the user image's BSS. */
+static u64 g_user_image_end;
+
 static u64 load_elf(const u8 *elf) {
     const struct ehdr *eh = (const struct ehdr *)elf;
     if (!(eh->e_ident[0] == 0x7f && eh->e_ident[1] == 'E' &&
@@ -327,15 +395,38 @@ static u64 load_elf(const u8 *elf) {
     if (eh->e_machine != 0xb7) { /* EM_AARCH64 */
         uart_puts("ELF: not aarch64\n"); return 0;
     }
+    /* p_flags (R/W/X) are deliberately ignored: the L2 user mapping is one
+     * giant Normal-memory RWX-at-EL1 region (see setup_mmu). OS.md
+     * §"Memory model" permits this — there's no W^X enforcement in the
+     * contract, and tcc-boot2 never JITs.
+     *
+     * Segments are clipped at USER_VA_HI: a binary may declare a BSS that
+     * extends past the mapped user window (scheme1 reserves ~256 MB), and
+     * a naive mem_set would walk into the device-block region above and
+     * trigger an external abort. The user image gets only the portion of
+     * its memsz that fits in the user pool; if user code later touches
+     * the unmapped tail, that's a user-space fault, not a kernel panic. */
+    u64 hi = 0;
     for (int i = 0; i < eh->e_phnum; i++) {
         const struct phdr *ph = (const struct phdr *)(elf + eh->e_phoff + (u64)i * eh->e_phentsize);
         if (ph->p_type != PT_LOAD) continue;
-        u8 *dst = (u8 *)ph->p_vaddr;
+        u64 vaddr = ph->p_vaddr;
+        u64 filesz = ph->p_filesz;
+        u64 memsz  = ph->p_memsz;
+        if (vaddr >= USER_VA_HI) continue;          /* segment fully out of window */
+        u64 reach = USER_VA_HI - vaddr;
+        if (filesz > reach) filesz = reach;
+        if (memsz > reach)  memsz = reach;
+        u8 *dst = (u8 *)vaddr;
         const u8 *src = elf + ph->p_offset;
-        mem_cpy(dst, src, ph->p_filesz);
-        if (ph->p_memsz > ph->p_filesz)
-            mem_set(dst + ph->p_filesz, 0, ph->p_memsz - ph->p_filesz);
+        mem_cpy(dst, src, filesz);
+        if (memsz > filesz)
+            mem_set(dst + filesz, 0, memsz - filesz);
+        u64 end = vaddr + memsz;
+        if (end > hi) hi = end;
     }
+    /* Round up to 16 bytes so callers can use it directly as brk_base. */
+    g_user_image_end = (hi + 15) & ~15UL;
     /* I-cache sync (cheap insurance even with caches off). */
     asm volatile("dsb sy" ::: "memory");
     asm volatile("ic iallu" ::: "memory");
@@ -378,7 +469,14 @@ static u64 brk_max;
 #define SYS_read       63
 #define SYS_write      64
 #define SYS_exit_group 93
+#define SYS_waitid     95
 #define SYS_brk        214
+#define SYS_clone      220
+#define SYS_execve     221
+
+#define ECHILD     10
+#define EAGAIN     11
+#define ENOEXEC     8
 
 static i64 sys_write(int fd, const void *buf, u64 len) {
     if (fd == 1 || fd == 2) {
@@ -476,12 +574,226 @@ static i64 sys_unlinkat(int dirfd, const char *path, int flags) {
     return 0;
 }
 
+/* ─── Tier 2: pseudo-fork (clone / execve / waitid / exit_group) ────────── */
+/*
+ * The boot2 chain's clone/execve/waitid pattern (scheme1/prelude.scm:520-537)
+ * is rigidly synchronous: the parent calls clone, the "child" immediately
+ * calls execve and runs to exit_group, then the parent calls waitid. Nothing
+ * else runs between clone and execve in the child, or between clone and
+ * waitid in the parent.
+ *
+ * We implement that as pseudo-fork on a single-threaded kernel:
+ *
+ *   sys_clone   → push parent state (regs, brk, fd table, full user image)
+ *                 onto proc_stack; return 0 to current context (the "child").
+ *   sys_execve  → reset brk, load new ELF over user RAM, build user stack,
+ *                 set tf so eret resumes at the new entry point.
+ *   sys_exit    → if proc_stack non-empty: stash exit code in last_child,
+ *                 restore parent state (regs / brk / fds / memory), set tf
+ *                 so eret resumes the parent's clone() call with x0 = pid.
+ *                 If proc_stack empty: real exit (dump tmpfs, PSCI off).
+ *   sys_waitid  → return last_child's exit code via the siginfo struct.
+ *
+ * No actual concurrency. The "parent" is suspended at the moment of clone
+ * and resumed only when the "child" calls exit_group. This works because
+ * the prelude never schedules other work between fork and wait.
+ */
+
+struct trapframe {
+    u64 x[31];
+    u64 elr;
+    u64 spsr;
+};
+
+/* Forward decls for state defined further down. */
+#define MAX_ARGV 32
+static u64 build_user_stack(u64 stack_top, int argc, char **argv);
+static int tokenise(char *src, char **argv, int cap);
+
+#define MAX_PROC_DEPTH 1
+/* Memory snapshot pool — placed above the user RAM pool. The scheme1
+ * prelude only ever forks one level deep before waiting (clone → execve
+ * in child → exit_group → waitid in parent), so a single 768 MB frame
+ * suffices. Snapshot N lives at SNAP_BASE + N*USER_POOL_SIZE. */
+#define SNAP_BASE_PA 0x7c000000UL
+
+struct proc_save {
+    int active;
+    u64 child_pid;
+    /* Saved trap-frame state — enough to resume the parent at the SVC
+     * instruction following its clone(). x[0] is overwritten with child_pid
+     * at restore time so the parent sees a non-zero return. */
+    u64 regs[31];
+    u64 elr;
+    u64 spsr;
+    u64 sp_el0;
+    /* User image + per-process state at the moment of clone. brk_base
+     * is saved alongside brk_cur because do_execve resets it above the
+     * new image's end-of-bss — the parent's value needs to come back
+     * with the parent's memory image. */
+    u64 brk_base_save;
+    u64 brk_cur_save;
+    struct fdent fdtab_save[MAX_FD];
+    u8 *mem_snapshot;
+};
+
+static struct proc_save proc_stack[MAX_PROC_DEPTH];
+static int proc_depth = 0;
+static u64 g_next_pid = 2;
+
+/* The most recently exited child, for sys_waitid to consume. */
+static int last_child_valid = 0;
+static u64 last_child_pid = 0;
+static int last_child_code = 0;
+
+/* USER_POOL_PA / USER_POOL_SIZE (defined above) describe the user RAM pool. */
+
+static i64 sys_clone(struct trapframe *tf, u64 flags, u64 stack, u64 ptid,
+                     u64 ctid, u64 tls) {
+    (void)flags; (void)stack; (void)ptid; (void)ctid; (void)tls;
+    if (proc_depth >= MAX_PROC_DEPTH) return -EAGAIN;
+    struct proc_save *p = &proc_stack[proc_depth];
+    p->active = 1;
+    p->child_pid = g_next_pid++;
+    for (int i = 0; i < 31; i++) p->regs[i] = tf->x[i];
+    p->elr = tf->elr;
+    p->spsr = tf->spsr;
+    asm volatile("mrs %0, sp_el0" : "=r"(p->sp_el0));
+    p->brk_base_save = brk_base;
+    p->brk_cur_save  = brk_cur;
+    for (int i = 0; i < MAX_FD; i++) p->fdtab_save[i] = fdtab[i];
+    p->mem_snapshot = (u8 *)(SNAP_BASE_PA + (u64)proc_depth * USER_POOL_SIZE);
+    mem_cpy(p->mem_snapshot, (void *)USER_POOL_PA, USER_POOL_SIZE);
+    proc_depth++;
+    /* Current context becomes the "child"; clone returns 0 here. */
+    return 0;
+}
+
+/* execve must capture path+argv into kernel-side buffers BEFORE load_elf
+ * runs — load_elf clobbers user memory, and the path/argv strings live in
+ * that memory. */
+static char execve_argv_pool[2048];
+static i64 sys_execve(struct trapframe *tf, const char *path,
+                      char **argv, char **envp) {
+    /* envp may be NULL — the prelude wrapper passes no envp arg, so x2 is
+     * whatever happened to be there. We ignore envp regardless. */
+    (void)envp;
+    if (!path) return -EFAULT;
+    /* Copy path before find_file does anything else (path lives in user
+     * memory which load_elf will clobber). */
+    char path_buf[128];
+    int pn = 0;
+    while (path[pn] && pn < 127) { path_buf[pn] = path[pn]; pn++; }
+    path_buf[pn] = 0;
+    int fidx = find_file(path_buf);
+    if (fidx < 0) return -ENOENT;
+
+    /* Capture argv into a kernel-side pool. */
+    int argc = 0;
+    char *new_argv[MAX_ARGV];
+    int pool_off = 0;
+    if (argv) {
+        while (argc < MAX_ARGV - 1 && argv[argc]) {
+            const char *s = argv[argc];
+            int n = 0;
+            while (s[n] && pool_off + n < (int)sizeof(execve_argv_pool) - 1) n++;
+            for (int j = 0; j < n; j++) execve_argv_pool[pool_off + j] = s[j];
+            execve_argv_pool[pool_off + n] = 0;
+            new_argv[argc] = &execve_argv_pool[pool_off];
+            pool_off += n + 1;
+            argc++;
+        }
+    }
+    if (argc == 0) {
+        /* Synthesise argv[0] from the path so user code that reads argv[0]
+         * doesn't crash. */
+        int n = 0;
+        while (path_buf[n] && pool_off + n < (int)sizeof(execve_argv_pool) - 1) n++;
+        for (int j = 0; j < n; j++) execve_argv_pool[pool_off + j] = path_buf[j];
+        execve_argv_pool[pool_off + n] = 0;
+        new_argv[0] = &execve_argv_pool[pool_off];
+        pool_off += n + 1;
+        argc = 1;
+    }
+
+    /* Load new ELF over user RAM. */
+    u64 entry = load_elf(files[fidx].data);
+    if (!entry) return -ENOEXEC;
+    /* Reset brk above the new image's end-of-bss. */
+    brk_base = g_user_image_end ? g_user_image_end : USER_VA_LO;
+    brk_cur  = brk_base;
+    /* Build new user stack at top of user VA window. */
+    u64 new_sp = build_user_stack(USER_VA_HI, argc, new_argv);
+
+    /* Rewrite trap frame so eret jumps to the new image's entry, with a
+     * clean register state and the new stack. */
+    for (int i = 0; i < 31; i++) tf->x[i] = 0;
+    tf->elr = entry;
+    /* sp_el0 isn't in the trap frame — set it directly; it survives until
+     * the eret since the kernel uses SP_ELx while in trap_sync. */
+    asm volatile("msr sp_el0, %0" :: "r"(new_sp));
+    /* x[0] = 0 will be overwritten by the dispatcher's tf->x[0] = (u64)r
+     * assignment. To preserve "argc/argv on the stack only", return 0 and
+     * let the dispatcher write it; user code never sees the return value
+     * because elr now points at _start. */
+    return 0;
+}
+
+static i64 sys_waitid(struct trapframe *tf, int idtype, u64 id,
+                      void *info, int options) {
+    (void)tf; (void)idtype; (void)id; (void)options;
+    if (!last_child_valid) return -ECHILD;
+    /* scheme1/prelude.scm:497-506 reads info[8]=si_code (CLD_EXITED=1) and
+     * info[24]=si_status. siginfo_t is sparsely written — zero the rest so
+     * the prelude's view is deterministic. */
+    if (info) {
+        u8 *p = info;
+        for (int i = 0; i < 128; i++) p[i] = 0;
+        u32 *si_code   = (u32 *)(p + 8);
+        u32 *si_status = (u32 *)(p + 24);
+        *si_code   = 1;                       /* CLD_EXITED */
+        *si_status = (u32)last_child_code;
+    }
+    last_child_valid = 0;
+    return 0;
+}
+
 static int g_exit_code = 0;
 static int g_exited = 0;
 
-static void sys_exit(int code) {
+/* Dump every file in the tmpfs to UART, hex-encoded, framed by sentinels
+ * a host-side extractor can scan for. The chain's verification harness
+ * (qemu-host wrapper) parses this to recover output ELFs etc. without
+ * needing virtio-9p — flat tmpfs over UART is enough for boot2's
+ * file-only IPC. Dump only happens when a "dumpfs" token is present in
+ * /chosen/bootargs; the hello.c demo runs without it and stays quiet. */
+static int g_dumpfs = 0;
+
+static void uart_putc_hex(u8 b) {
+    static const char hex[] = "0123456789abcdef";
+    uart_putc(hex[b >> 4]);
+    uart_putc(hex[b & 0xf]);
+}
+
+static void dump_tmpfs(void) {
+    uart_puts("\n=== DUMP-BEGIN ===\n");
+    for (int i = 0; i < MAX_FILES; i++) {
+        if (!files[i].used) continue;
+        uart_puts("=== FILE path=");
+        uart_puts(files[i].path);
+        uart_puts(" size=");
+        uart_putd((i64)files[i].len);
+        uart_puts(" ===\n");
+        for (u64 j = 0; j < files[i].len; j++) uart_putc_hex(files[i].data[j]);
+        uart_puts("\n");
+    }
+    uart_puts("=== DUMP-END ===\n");
+}
+
+static void sys_exit_final(int code) {
     g_exit_code = code;
     g_exited = 1;
+    if (g_dumpfs) dump_tmpfs();
     uart_puts("\n[seed] user exit_group("); uart_putd(code); uart_puts(")\n");
     /* Try PSCI SYSTEM_OFF so QEMU exits cleanly; fall back to spin. */
     register u64 x0 asm("x0") = 0x84000008;
@@ -491,13 +803,43 @@ static void sys_exit(int code) {
     for (;;) asm volatile("wfi");
 }
 
-/* ─── Trap dispatch (called from start.S vector handlers) ───────────────── */
+/* Dispatcher-side exit_group: pops proc_stack and resumes the parent's
+ * clone() if there's a saved frame, otherwise falls through to the real
+ * shutdown path. Returns 1 if the trap frame was rewritten (resume parent),
+ * 0 if the caller should treat it as a normal trap-return path (which
+ * will never happen, since sys_exit_final does not return). */
+static int sys_exit_or_resume_parent(struct trapframe *tf, int code) {
+    code &= 0xff;
+    if (proc_depth > 0) {
+        struct proc_save *p = &proc_stack[--proc_depth];
+        last_child_pid   = p->child_pid;
+        last_child_code  = code;
+        last_child_valid = 1;
+        /* Restore memory, brk, fd table. */
+        mem_cpy((void *)USER_POOL_PA, p->mem_snapshot, USER_POOL_SIZE);
+        brk_base = p->brk_base_save;
+        brk_cur  = p->brk_cur_save;
+        for (int i = 0; i < MAX_FD; i++) fdtab[i] = p->fdtab_save[i];
+        /* Restore registers (overwriting x[0] with child_pid, since the
+         * dispatcher will write tf->x[0] = (u64)r before eret — we want
+         * the parent's clone() to see child_pid as the syscall return). */
+        for (int i = 0; i < 31; i++) tf->x[i] = p->regs[i];
+        tf->elr  = p->elr;
+        tf->spsr = p->spsr;
+        asm volatile("msr sp_el0, %0" :: "r"(p->sp_el0));
+        /* Instruction cache may hold stale lines from the child's image
+         * that we just overwrote with the parent's. Invalidate. */
+        asm volatile("dsb sy" ::: "memory");
+        asm volatile("ic iallu" ::: "memory");
+        asm volatile("dsb sy" ::: "memory");
+        asm volatile("isb");
+        return (int)p->child_pid;     /* >0: tells dispatcher to write this as r */
+    }
+    sys_exit_final(code);
+    return 0;                        /* unreachable */
+}
 
-struct trapframe {
-    u64 x[31];
-    u64 elr;
-    u64 spsr;
-};
+/* ─── Trap dispatch (called from start.S vector handlers) ───────────────── */
 
 i64 trap_sync(u64 esr, struct trapframe *tf);
 void trap_kernel(u64 esr, struct trapframe *tf);
@@ -518,7 +860,19 @@ i64 trap_sync(u64 esr, struct trapframe *tf) {
         case SYS_lseek:      r = sys_lseek((int)a0, (i64)a1, (int)a2); break;
         case SYS_brk:        r = sys_brk(a0); break;
         case SYS_unlinkat:   r = sys_unlinkat((int)a0, (const char *)a1, (int)a2); break;
-        case SYS_exit_group: sys_exit((int)a0); r = 0; break;
+        case SYS_clone:      r = sys_clone(tf, a0, a1, a2, a3, a4); break;
+        case SYS_execve:     r = sys_execve(tf, (const char *)a0, (char **)a1, (char **)a2); break;
+        case SYS_waitid:     r = sys_waitid(tf, (int)a0, a1, (void *)a2, (int)a3); break;
+        case SYS_exit_group:
+            r = sys_exit_or_resume_parent(tf, (int)a0);
+            /* If we resumed the parent, sys_exit_or_resume_parent has
+             * rewritten tf->x[0..30] and tf->elr — overriding tf->x[0]
+             * below would corrupt the parent's register state. */
+            if (proc_depth >= 0 && r != 0) {
+                tf->x[0] = (u64)r;
+                return 0;
+            }
+            break;
         default:
             uart_puts("[seed] ENOSYS "); uart_putd((i64)nr); uart_puts("\n");
             r = -38; /* ENOSYS */
@@ -536,8 +890,10 @@ i64 trap_sync(u64 esr, struct trapframe *tf) {
 }
 
 void trap_kernel(u64 esr, struct trapframe *tf) {
+    u64 far; asm volatile("mrs %0, far_el1" : "=r"(far));
     uart_puts("[seed] PANIC: kernel sync, ESR="); uart_putx(esr);
     uart_puts(" ELR=");                          uart_putx(tf->elr);
+    uart_puts(" FAR=");                          uart_putx(far);
     uart_puts("\n");
     for (;;) asm volatile("wfe");
 }
@@ -553,23 +909,48 @@ void trap_unhandled(u64 esr, struct trapframe *tf) {
 
 extern void eret_to_user(u64 entry, u64 sp);
 
-static u64 build_user_stack(u64 stack_top, const char *argv0) {
-    /* Place argv0 string at top, then argc/argv/envp below it.
-     *
-     * SysV layout from low to high at sp:
-     *   argc, argv[0], NULL, NULL (envp term)
-     */
-    int n = str_n(argv0) + 1;
-    char *str = (char *)(stack_top - 32);
-    for (int i = 0; i < n; i++) str[i] = argv0[i];
-
-    u64 sp = (u64)str - 64;
+/* Tokenise `src` in place (whitespace separators) into argv slots.
+ * Writes pointers into argv[0..argc-1] and returns argc. Stops at cap. */
+static int tokenise(char *src, char **argv, int cap) {
+    int argc = 0;
+    char *p = src;
+    while (*p && argc < cap) {
+        while (*p == ' ' || *p == '\t' || *p == '\n' || *p == '\r') p++;
+        if (!*p) break;
+        argv[argc++] = p;
+        while (*p && *p != ' ' && *p != '\t' && *p != '\n' && *p != '\r') p++;
+        if (*p) *p++ = 0;
+    }
+    return argc;
+}
+
+static u64 build_user_stack(u64 stack_top, int argc, char **argv) {
+    /* SysV layout, low to high at the returned sp:
+     *   argc, argv[0..argc-1], NULL (argv term), NULL (envp term).
+     * Strings live above the vectors, in a string pool placed just below
+     * stack_top so the user image's high-water mark is stable. */
+    if (argc < 1) argc = 1;
+    if (argc > MAX_ARGV) argc = MAX_ARGV;
+
+    /* Lay strings down from stack_top - 16 (16-byte alignment slack). */
+    u64 strs_top = stack_top - 16;
+    u64 strs[MAX_ARGV];
+    char *cursor = (char *)strs_top;
+    for (int i = argc - 1; i >= 0; i--) {
+        int n = str_n(argv[i]) + 1;
+        cursor -= n;
+        for (int j = 0; j < n; j++) cursor[j] = argv[i][j];
+        strs[i] = (u64)cursor;
+    }
+
+    /* sp must hold: argc + (argc+1)*8 (argv + NULL) + 8 (envp NULL) */
+    u64 sp = (u64)cursor - (u64)((argc + 3) * 8);
     sp &= ~15UL;
     u64 *p = (u64 *)sp;
-    p[0] = 1;                       /* argc */
-    p[1] = (u64)str;                /* argv[0] */
-    p[2] = 0;                       /* argv terminator */
-    p[3] = 0;                       /* envp terminator */
+    p[0] = (u64)argc;
+    for (int i = 0; i < argc; i++) p[1 + i] = strs[i];
+    p[1 + argc] = 0;                /* argv terminator */
+    p[2 + argc] = 0;                /* envp terminator */
     return sp;
 }
 
@@ -614,13 +995,68 @@ void kmain(u64 dtb_phys) {
     if (!entry) { uart_puts("[seed] load_elf failed\n"); for(;;) asm volatile("wfe"); }
     uart_puts("[seed] /init e_entry="); uart_putx(entry); uart_puts("\n");
 
-    /* User stack at top of a reserved high region. brk above that. */
-    u64 ustack_top = 0x46000000UL;
-    brk_base = 0x46000000UL;
+    /* parse_cpio + load_elf are done — original initrd memory is dead.
+     * Bump kheap_end to reclaim it for tmpfs file growth via sys_write. */
+    kheap_end = (u8 *)0x4b000000UL;
+
+    /* User runs in the L2-mapped low-VA window (USER_VA_LO..USER_VA_HI,
+     * physically backed by USER_POOL_PA). Stack grows down from the top
+     * of the window; brk grows up from above the loaded image's
+     * end-of-bss (g_user_image_end, set by load_elf). 16 MB reserved at
+     * the top for the user stack. */
+    u64 ustack_top = USER_VA_HI;
+    brk_base = g_user_image_end ? g_user_image_end : USER_VA_LO;
     brk_cur  = brk_base;
-    brk_max  = 0x4a000000UL;
+    brk_max  = USER_VA_HI - 0x01000000UL;
+
+    /* Build argv. Priority:
+     *   1. DTB /chosen/bootargs (whitespace-tokenised — qemu -append "...").
+     *   2. /init.argv from the initramfs (one arg per line).
+     *   3. Fallback: argc=1, argv[0]="init".
+     * In all three cases, argv passed to user is exactly what the source
+     * provided — no implicit argv[0]="init" prefix.
+     *
+     * The seed kernel reserves one bootargs token: "dumpfs". When present,
+     * it is stripped from argv and triggers a hex-encoded dump of the
+     * full tmpfs over UART on exit (sentinel-framed for host extraction). */
+    static char argv_pool[512];
+    char *uargv[MAX_ARGV];
+    int uargc = 0;
+
+    if (dt.bootargs[0]) {
+        int n = 0;
+        while (dt.bootargs[n] && n < (int)sizeof(argv_pool) - 1) {
+            argv_pool[n] = dt.bootargs[n]; n++;
+        }
+        argv_pool[n] = 0;
+        char *raw[MAX_ARGV];
+        int rawc = tokenise(argv_pool, raw, MAX_ARGV);
+        for (int i = 0; i < rawc; i++) {
+            if (str_eq(raw[i], "dumpfs")) { g_dumpfs = 1; continue; }
+            uargv[uargc++] = raw[i];
+        }
+    }
+    if (uargc == 0) {
+        int aidx = find_file("init.argv");
+        if (aidx >= 0) {
+            u64 n = files[aidx].len;
+            if (n >= sizeof(argv_pool)) n = sizeof(argv_pool) - 1;
+            for (u64 i = 0; i < n; i++) argv_pool[i] = (char)files[aidx].data[i];
+            argv_pool[n] = 0;
+            uargc = tokenise(argv_pool, uargv, MAX_ARGV);
+        }
+    }
+    if (uargc == 0) {
+        argv_pool[0] = 'i'; argv_pool[1] = 'n'; argv_pool[2] = 'i';
+        argv_pool[3] = 't'; argv_pool[4] = 0;
+        uargv[0] = argv_pool;
+        uargc = 1;
+    }
+    uart_puts("[seed] argv:");
+    for (int i = 0; i < uargc; i++) { uart_puts(" "); uart_puts(uargv[i]); }
+    uart_puts("\n");
 
-    u64 user_sp = build_user_stack(ustack_top, "init");
+    u64 user_sp = build_user_stack(ustack_top, uargc, uargv);
 
     uart_puts("[seed] eret to user, sp="); uart_putx(user_sp); uart_puts("\n");
     eret_to_user(entry, user_sp);
diff --git a/seed-kernel/run.sh b/seed-kernel/run.sh
@@ -15,7 +15,7 @@ INITRD=build/initramfs.cpio
 exec qemu-system-aarch64 \
     -machine virt \
     -cpu cortex-a72 \
-    -m 512M \
+    -m 2048M \
     -nographic \
     -no-reboot \
     -kernel "$KERNEL" \
diff --git a/seed-kernel/scripts/extract-dump.sh b/seed-kernel/scripts/extract-dump.sh
@@ -0,0 +1,56 @@
+#!/bin/sh
+# Extract files from a seed-kernel UART transcript that was produced with
+# the "dumpfs" bootargs token. Reads transcript from stdin (or $1), writes
+# each dumped file to <outdir>/<path>. Header format emitted by kernel.c
+# dump_tmpfs():
+#
+#     === DUMP-BEGIN ===
+#     === FILE path=<name> size=<N> ===
+#     <2*N hex chars><LF>
+#     ... repeat ...
+#     === DUMP-END ===
+#
+# Anything before DUMP-BEGIN or after DUMP-END is ignored.
+#
+# Usage: extract-dump.sh <outdir> [transcript]
+
+set -eu
+
+[ $# -ge 1 ] || { echo "usage: $0 <outdir> [transcript]"; exit 2; }
+
+outdir=$1
+shift
+mkdir -p "$outdir"
+
+if [ $# -ge 1 ]; then
+    src=$(cat "$1")
+else
+    src=$(cat)
+fi
+
+# Strip CRs that QEMU's nographic UART likes to emit.
+src=$(printf '%s' "$src" | tr -d '\r')
+
+awk -v outdir="$outdir" '
+/^=== DUMP-BEGIN ===$/         { in_dump = 1;  next }
+/^=== DUMP-END ===$/            { in_dump = 0;  next }
+in_dump && /^=== FILE path=/    {
+    sub(/^=== FILE path=/, "")
+    sub(/ ===$/, "")
+    n = split($0, kv, " size=")
+    path = kv[1]
+    size = kv[2]+0
+    out = outdir "/" path
+    # Make any parent dirs (tmpfs is flat but be safe).
+    cmd = "mkdir -p \"$(dirname \"" out "\")\""; system(cmd); close(cmd)
+    next_is_hex = 1
+    print "extract: " path " (" size " bytes) -> " out > "/dev/stderr"
+    # Capture next non-empty line as hex payload, decode via xxd.
+    getline hex
+    decode_cmd = "printf %s \"" hex "\" | xxd -r -p > \"" out "\""
+    system(decode_cmd); close(decode_cmd)
+    next
+}
+' <<EOF
+$src
+EOF
diff --git a/seed-kernel/scripts/fixtures/tier2-driver.scm b/seed-kernel/scripts/fixtures/tier2-driver.scm
@@ -0,0 +1,5 @@
+;; Tier 2 acceptance driver: spawn the chain stage, wait, return status.
+(let ((r (run "child-prog" "out.txt" "in1" "in2")))
+  (if (and (car r) (= 0 (cdr r)))
+      (sys-exit 0)
+      (sys-exit 1)))
diff --git a/seed-kernel/scripts/fixtures/tier2-tcc-driver.scm b/seed-kernel/scripts/fixtures/tier2-tcc-driver.scm
@@ -0,0 +1,8 @@
+;; Tier 2 acceptance driver (canonical form): scheme1 spawns tcc0 (the
+;; cc.scm-built bootstrap C compiler) to compile a .c source into a
+;; relocatable object, waits for the child, and exits with the child's
+;; status. The OS-TODO item 11 fixture.
+(let ((r (run "child-prog" "-nostdlib" "-c" "-o" "out.o" "input.c")))
+  (if (and (car r) (= 0 (cdr r)))
+      (sys-exit 0)
+      (sys-exit 1)))
diff --git a/seed-kernel/scripts/tier1-gate.sh b/seed-kernel/scripts/tier1-gate.sh
@@ -0,0 +1,91 @@
+#!/bin/sh
+# tier1-gate.sh — run a single boot2-chain stage binary on the seed
+# kernel and extract its output files. The chain treats stages as
+# pure file→file transformations (catm-style), so a Tier 1 acceptance
+# run is one qemu boot per stage.
+#
+# Usage:
+#   tier1-gate.sh <stage-binary> <output-dir> -- <argv...> -- <input-files...>
+#
+# Builds an initramfs containing the stage binary as /init plus every
+# file from <input-files...> at its basename, runs qemu, and extracts
+# every file in the post-run tmpfs into <output-dir>/. The driver
+# passes <argv...> verbatim through /chosen/bootargs (with a "dumpfs"
+# token appended to trigger the UART dump on exit).
+#
+# Example: run boot0 catm to concatenate a + b into out.
+#   tier1-gate.sh build/aarch64/boot0/catm /tmp/out \
+#                 -- init out a b -- /tmp/a /tmp/b
+
+set -eu
+
+if [ $# -lt 3 ]; then
+    echo "usage: $0 <stage-binary> <output-dir> -- <argv...> -- <input-files...>" >&2
+    exit 2
+fi
+
+STAGE_BIN=$1; shift
+OUTDIR=$1; shift
+[ "$1" = "--" ] || { echo "expected -- before argv" >&2; exit 2; }
+shift
+
+# Collect argv until next --
+ARGV=""
+while [ $# -gt 0 ] && [ "$1" != "--" ]; do
+    if [ -z "$ARGV" ]; then ARGV=$1; else ARGV="$ARGV $1"; fi
+    shift
+done
+[ "${1:-}" = "--" ] || { echo "expected -- before input-files" >&2; exit 2; }
+shift
+
+HERE=$(cd "$(dirname "$0")" && pwd)
+SEED_DIR=$(cd "$HERE/.." && pwd)
+KERNEL=$SEED_DIR/build/Image
+EXTRACT=$HERE/extract-dump.sh
+
+[ -f "$KERNEL" ] || { echo "missing $KERNEL — run 'make' in $SEED_DIR first" >&2; exit 1; }
+[ -x "$EXTRACT" ] || { echo "missing $EXTRACT" >&2; exit 1; }
+
+mkdir -p "$OUTDIR"
+
+# Stage initramfs.
+STAGE=$(mktemp -d -t tier1-stage.XXXXXX)
+trap 'rm -rf "$STAGE"' EXIT
+
+cp "$STAGE_BIN" "$STAGE/init"
+chmod +x "$STAGE/init"
+NAMES="init"
+for inp in "$@"; do
+    base=$(basename "$inp")
+    cp "$inp" "$STAGE/$base"
+    NAMES="$NAMES
+$base"
+done
+INITRAMFS=$STAGE/initramfs.cpio
+( cd "$STAGE" && printf '%s\n' "$NAMES" | cpio -o -H newc 2>/dev/null > initramfs.cpio )
+
+# Run qemu, capture transcript, extract.
+TRANSCRIPT=$STAGE/transcript.txt
+echo "[gate] running stage with argv: $ARGV dumpfs" >&2
+qemu-system-aarch64 \
+    -machine virt -cpu cortex-a72 -m 2048M \
+    -nographic -no-reboot \
+    -kernel "$KERNEL" -initrd "$INITRAMFS" \
+    -append "$ARGV dumpfs" \
+    > "$TRANSCRIPT" 2>&1 &
+QPID=$!
+# Bound the run; the seed kernel ends with PSCI SYSTEM_OFF on exit,
+# but on a hang we still need to come back.
+( sleep 120; kill -9 $QPID 2>/dev/null ) &
+WATCHER=$!
+wait $QPID 2>/dev/null || true
+kill $WATCHER 2>/dev/null || true
+
+if ! grep -q '=== DUMP-END ===' "$TRANSCRIPT"; then
+    echo "[gate] FAIL: no DUMP-END in transcript" >&2
+    tail -40 "$TRANSCRIPT" >&2
+    exit 3
+fi
+
+"$EXTRACT" "$OUTDIR" "$TRANSCRIPT"
+echo "[gate] extracted to $OUTDIR" >&2
diff --git a/seed-kernel/scripts/tier2-gate.sh b/seed-kernel/scripts/tier2-gate.sh
@@ -0,0 +1,87 @@
+#!/bin/sh
+# tier2-gate.sh — end-to-end Tier 2 acceptance: scheme1 driver spawns a
+# chain stage as a subprocess (clone + execve + waitid), waits for it,
+# returns its result file. One qemu boot, end-to-end.
+#
+# Usage:
+#   tier2-gate.sh <scheme1> <prelude.scm> <driver.scm> \
+#                 <child-bin> <output-dir> -- <input-files...>
+#
+# Stages: combines prelude.scm + driver.scm into combined.scm via host
+# `cat`, packs an initramfs containing /init=scheme1, /combined.scm,
+# /child-prog=<child-bin>, plus every input file at its basename, then
+# boots qemu with bootargs "init combined.scm dumpfs". driver.scm is
+# expected to use prelude's (run "child-prog" ...) wrapper.
+#
+# After qemu exits, every file in the post-run tmpfs is extracted into
+# <output-dir>/. The driver's exit status is reflected in this script's
+# exit status (0 = scheme1 driver said success).
+
+set -eu
+
+if [ $# -lt 6 ]; then
+    echo "usage: $0 <scheme1> <prelude.scm> <driver.scm> <child-bin> <outdir> -- <inputs...>" >&2
+    exit 2
+fi
+
+SCHEME1=$1; PRELUDE=$2; DRIVER=$3; CHILD=$4; OUTDIR=$5
+shift 5
+[ "$1" = "--" ] || { echo "expected -- before input files" >&2; exit 2; }
+shift
+
+HERE=$(cd "$(dirname "$0")" && pwd)
+SEED_DIR=$(cd "$HERE/.." && pwd)
+KERNEL=$SEED_DIR/build/Image
+EXTRACT=$HERE/extract-dump.sh
+[ -f "$KERNEL" ] || { echo "missing $KERNEL — run 'make' in $SEED_DIR first" >&2; exit 1; }
+
+mkdir -p "$OUTDIR"
+STAGE=$(mktemp -d -t tier2-stage.XXXXXX)
+trap 'rm -rf "$STAGE"' EXIT
+
+cp "$SCHEME1" "$STAGE/init"; chmod +x "$STAGE/init"
+cp "$CHILD"   "$STAGE/child-prog"; chmod +x "$STAGE/child-prog"
+cat "$PRELUDE" "$DRIVER" > "$STAGE/combined.scm"
+NAMES="init
+child-prog
+combined.scm"
+for inp in "$@"; do
+    base=$(basename "$inp")
+    cp "$inp" "$STAGE/$base"
+    NAMES="$NAMES
+$base"
+done
+( cd "$STAGE" && printf '%s\n' "$NAMES" | cpio -o -H newc 2>/dev/null > initramfs.cpio )
+
+TRANSCRIPT=$STAGE/transcript.txt
+echo "[gate] running scheme1 driver" >&2
+qemu-system-aarch64 \
+    -machine virt -cpu cortex-a72 -m 2048M \
+    -nographic -no-reboot \
+    -kernel "$KERNEL" -initrd "$STAGE/initramfs.cpio" \
+    -append "init combined.scm dumpfs" \
+    > "$TRANSCRIPT" 2>&1 &
+QPID=$!
+( sleep 240; kill -9 $QPID 2>/dev/null ) &
+WATCHER=$!
+wait $QPID 2>/dev/null || true
+kill $WATCHER 2>/dev/null || true
+
+if ! grep -q '=== DUMP-END ===' "$TRANSCRIPT"; then
+    echo "[gate] FAIL: no DUMP-END in transcript" >&2
+    tail -40 "$TRANSCRIPT" >&2
+    exit 3
+fi
+
+# Capture the driver's exit code from the kernel's parting message.
+EXIT_LINE=$(grep -E "user exit_group" "$TRANSCRIPT" | tail -1 || true)
+"$EXTRACT" "$OUTDIR" "$TRANSCRIPT"
+
+case "$EXIT_LINE" in
+    *"exit_group(0)"*)
+        echo "[gate] PASS — driver exit 0; outputs in $OUTDIR" >&2
+        exit 0 ;;
+    *)
+        echo "[gate] FAIL — driver did not exit 0: $EXIT_LINE" >&2
+        exit 4 ;;
+esac
diff --git a/seed-kernel/user/child.c b/seed-kernel/user/child.c
@@ -0,0 +1,51 @@
+/* Tier 2 demo: child program execve'd by forktest. Prints argv and
+ * exits 42 so the parent's waitid can verify si_status round-trips. */
+
+typedef long          i64;
+typedef unsigned long u64;
+
+#define SYS_write       64
+#define SYS_exit_group  93
+
+static i64 sysc(u64 nr, u64 a, u64 b, u64 c) {
+    register u64 x8 asm("x8") = nr;
+    register u64 x0 asm("x0") = a;
+    register u64 x1 asm("x1") = b;
+    register u64 x2 asm("x2") = c;
+    asm volatile("svc #0" : "+r"(x0) : "r"(x8), "r"(x1), "r"(x2) : "memory", "cc");
+    return (i64)x0;
+}
+
+static i64 sys_write(int fd, const void *buf, u64 n) { return sysc(SYS_write, (u64)fd, (u64)buf, n); }
+static void sys_exit(int c) { sysc(SYS_exit_group, (u64)c, 0, 0); for(;;); }
+
+void *memset(void *d, int c, u64 n) {
+    unsigned char *dd = d; for (u64 i = 0; i < n; i++) dd[i] = (unsigned char)c; return d;
+}
+
+static u64 strlen_(const char *s) { u64 n = 0; while (s[n]) n++; return n; }
+static void puts_(const char *s) { sys_write(1, s, strlen_(s)); }
+static void put_d(i64 v) {
+    char buf[24]; int i = 0;
+    if (v == 0) { sys_write(1, "0", 1); return; }
+    if (v < 0) { sys_write(1, "-", 1); v = -v; }
+    while (v) { buf[i++] = '0' + (char)(v % 10); v /= 10; }
+    while (i--) sys_write(1, &buf[i], 1);
+}
+
+void _start_c(long argc, char **argv) {
+    puts_("[child] argc="); put_d(argc); puts_("\n");
+    for (long i = 0; i < argc; i++) {
+        puts_("[child] argv["); put_d(i); puts_("] = "); puts_(argv[i]); puts_("\n");
+    }
+    puts_("[child] exiting 42\n");
+    sys_exit(42);
+}
+
+asm(
+    ".globl _start\n"
+    ".type  _start, %function\n"
+    "_start:\n"
+    "    ldr x0, [sp]\n"
+    "    add x1, sp, #8\n"
+    "    b   _start_c\n");
diff --git a/seed-kernel/user/forktest.c b/seed-kernel/user/forktest.c
@@ -0,0 +1,88 @@
+/* Tier 2 demo: parent does clone() → execve("child") in child →
+ * waitid in parent → reports result. Mirrors the scheme1 prelude's
+ * spawn/run/wait pattern in C. */
+
+typedef long          i64;
+typedef unsigned long u64;
+typedef int           i32;
+
+#define SYS_write       64
+#define SYS_openat      56
+#define SYS_close       57
+#define SYS_read        63
+#define SYS_lseek       62
+#define SYS_brk         214
+#define SYS_exit_group  93
+#define SYS_clone       220
+#define SYS_execve      221
+#define SYS_waitid       95
+
+static i64 sysc(u64 nr, u64 a, u64 b, u64 c, u64 d, u64 e, u64 f) {
+    register u64 x8 asm("x8") = nr;
+    register u64 x0 asm("x0") = a;
+    register u64 x1 asm("x1") = b;
+    register u64 x2 asm("x2") = c;
+    register u64 x3 asm("x3") = d;
+    register u64 x4 asm("x4") = e;
+    register u64 x5 asm("x5") = f;
+    asm volatile("svc #0"
+                 : "+r"(x0)
+                 : "r"(x8), "r"(x1), "r"(x2), "r"(x3), "r"(x4), "r"(x5)
+                 : "memory", "cc");
+    return (i64)x0;
+}
+
+static i64 sys_write(int fd, const void *buf, u64 n) { return sysc(SYS_write, (u64)fd, (u64)buf, n, 0,0,0); }
+static void sys_exit(int c) { sysc(SYS_exit_group, (u64)c, 0,0,0,0,0); for(;;); }
+static i64 sys_clone(void) { return sysc(SYS_clone, 17/*SIGCHLD*/, 0,0,0,0,0); }
+static i64 sys_execve(const char *p, char **argv) { return sysc(SYS_execve, (u64)p, (u64)argv, 0, 0, 0, 0); }
+static i64 sys_waitid(int id, int pid, void *info, int opts) { return sysc(SYS_waitid, (u64)id, (u64)pid, (u64)info, (u64)opts, 0, 0); }
+
+void *memset(void *d, int c, u64 n) {
+    unsigned char *dd = d; for (u64 i = 0; i < n; i++) dd[i] = (unsigned char)c; return d;
+}
+
+static u64 strlen_(const char *s) { u64 n = 0; while (s[n]) n++; return n; }
+static void puts_(const char *s) { sys_write(1, s, strlen_(s)); }
+static void put_d(i64 v) {
+    char buf[24]; int i = 0;
+    if (v < 0) { sys_write(1, "-", 1); v = -v; }
+    if (v == 0) buf[i++] = '0';
+    while (v) { buf[i++] = '0' + (char)(v % 10); v /= 10; }
+    while (i--) sys_write(1, &buf[i], 1);
+}
+
+void _start_c(long argc, char **argv) {
+    puts_("[forktest] argc="); put_d(argc); puts_(" argv[0]="); puts_(argv[0]); puts_("\n");
+
+    long pid = sys_clone();
+    if (pid == 0) {
+        /* child */
+        puts_("[forktest:child] pre-exec\n");
+        char *cargv[3];
+        cargv[0] = "child";
+        cargv[1] = "from-parent";
+        cargv[2] = 0;
+        sys_execve("child", cargv);
+        puts_("[forktest:child] execve failed\n");
+        sys_exit(127);
+    }
+    /* parent */
+    puts_("[forktest:parent] clone returned pid="); put_d(pid); puts_("\n");
+    unsigned char info[128];
+    memset(info, 0, sizeof info);
+    long w = sys_waitid(/*P_PID*/1, (int)pid, info, /*WEXITED*/4);
+    puts_("[forktest:parent] waitid="); put_d(w);
+    puts_(" si_code="); put_d(*(i32 *)(info + 8));
+    puts_(" si_status="); put_d(*(i32 *)(info + 24));
+    puts_("\n");
+    sys_exit(0);
+}
+
+asm(
+    ".globl _start\n"
+    ".type  _start, %function\n"
+    "_start:\n"
+    "    ldr x0, [sp]\n"
+    "    add x1, sp, #8\n"
+    "    b   _start_c\n");
diff --git a/seed-kernel/user/hello.c b/seed-kernel/user/hello.c
@@ -61,8 +61,17 @@ static void put_x(u64 v) {
     for (int i = 60; i >= 0; i -= 4) { char c = hex[(v >> i) & 0xf]; sys_write(1, &c, 1); }
 }
 
-void _start(void) {
+/* aarch64 entry: x0 holds nothing — the SysV stack layout is at sp:
+ *   [argc][argv[0]]...[argv[argc-1]][NULL][envp...][NULL]
+ * We read argc/argv off the initial stack pointer in the asm shim
+ * below, then tail-call into _start_c. */
+void _start_c(long argc, char **argv) {
     puts_("hello from user space (EL1t, identity-map MMU)\n");
+    puts_("argc = "); put_d(argc); puts_("\n");
+    for (long i = 0; i < argc; i++) {
+        puts_("  argv["); put_d(i); puts_("] = ");
+        puts_(argv[i]); puts_("\n");
+    }
 
     /* Exercise brk: ask current break, push it up by 1 MiB, write+read. */
     u64 b0 = (u64)sys_brk(0);
@@ -107,3 +116,15 @@ void _start(void) {
     puts_("[user] all checks passed, exiting 0\n");
     sys_exit(0);
 }
+
+/* Read argc/argv off the initial stack and tail-call into _start_c. The
+ * kernel sets sp_el0 to point at [argc][argv[0]]... before ERETing.
+ * Emitted as a plain global symbol with raw asm — no C-compiler-generated
+ * prologue, since gcc would clobber sp before we read argc. */
+asm(
+    ".globl _start\n"
+    ".type  _start, %function\n"
+    "_start:\n"
+    "    ldr x0, [sp]\n"
+    "    add x1, sp, #8\n"
+    "    b   _start_c\n");
diff --git a/seed-kernel/user/user.lds b/seed-kernel/user/user.lds
@@ -1,10 +1,12 @@
-/* Link the user binary high enough to be clear of the kernel image
- * (which sits at 0x40080000) and the initrd (placed by QEMU). */
+/* Link at the boot2 chain's default base (hex2pp -B 0x600000). This is
+ * below QEMU virt's RAM (which starts at 0x40000000) — the seed kernel
+ * provides a per-process L2 page table that maps user low VAs to a
+ * reserved physical RAM pool, so VA 0x600000 has real backing. */
 
 ENTRY(_start)
 
 SECTIONS {
-    . = 0x42000000;
+    . = 0x00600000;
 
     .text   : { *(.text .text.*) }
     .rodata : ALIGN(8) { *(.rodata .rodata.*) }

	boot2 Playing with the boostrap
	git clone https://git.ryansepassi.com/git/boot2.git
	Log \| Files \| Refs \| README

A	docs/OS-TODO.md	\|	117	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	seed-kernel/Makefile	\|	18	+++++++++++++++++-
M	seed-kernel/kernel.c	\|	528	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
M	seed-kernel/run.sh	\|	2	+-
A	seed-kernel/scripts/extract-dump.sh	\|	56	++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	seed-kernel/scripts/fixtures/tier2-driver.scm	\|	5	+++++
A	seed-kernel/scripts/fixtures/tier2-tcc-driver.scm	\|	8	++++++++
A	seed-kernel/scripts/tier1-gate.sh	\|	91	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	seed-kernel/scripts/tier2-gate.sh	\|	87	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	seed-kernel/user/child.c	\|	51	+++++++++++++++++++++++++++++++++++++++++++++++++++
A	seed-kernel/user/forktest.c	\|	88	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	seed-kernel/user/hello.c	\|	23	++++++++++++++++++++++-
M	seed-kernel/user/user.lds	\|	8	+++++---