commit 5073ee046922edc1734cd7c6abda83c25d743fef
parent d437cb28c8b53b20923897d9aaa74fd7b0244928
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Mon, 4 May 2026 20:58:34 -0700
seed-kernel: close all 11 OS-TODO items + verification gates
Tier 1 fixes:
- argv from /chosen/bootargs (whitespace-tokenised) with /init.argv
fallback; build_user_stack now takes (argc, argv[])
- exit_group masks status to low byte
- Per-process L2 page table maps user low VA to a 768 MB physical
pool. PL011/GIC/virtio reached from kernel via a high alias
(L1[4] = Device 1 GB block at PA 0, so VA 0x109000000 -> PA
0x09000000). The boot2 chain runs at its native -B 0x600000.
- load_elf clips PT_LOAD memsz at USER_VA_HI and reports end-of-image
in g_user_image_end so brk_base sits above the binary's BSS
- Documented RWX-at-EL1 in load_elf as deliberate per OS.md
Tier 2 syscalls (clone/execve/waitid):
- Pseudo-fork via proc_save[] holding a memory snapshot, regs, brk
state, and fd table. clone returns 0 to the current "child";
execve replaces the user image; exit_group pops the snapshot,
restores parent state, and resumes parent's clone() with x0=pid.
waitid populates info[8]/info[24] per the prelude. envp is
ignored — accepted NULL/empty as the spec requires.
Caches enabled at MMU bring-up (SCTLR.C|I) so the 768 MB snapshot
copy isn't unbearably slow under TCG.
Verification harness:
- dumpfs bootargs token triggers a sentinel-framed hex tmpfs dump
on exit; scripts/extract-dump.sh decodes it back to files.
- scripts/tier1-gate.sh runs an arbitrary chain stage as /init,
extracts the post-run tmpfs. Verified with boot0/catm and
boot3/tcc0 (compiles a .c into a valid aarch64 ELF object).
- scripts/tier2-gate.sh + tier2-tcc-driver.scm fixture run
scheme1 → (run "tcc0" -nostdlib -c -o out.o input.c) → wait,
end-to-end. Output is a real ELF relocatable with the expected
symbols. This is the canonical OS.md §Verification Tier-2 case.
User demos: forktest.c + child.c exercise clone/execve/waitid
(parent observes si_code=CLD_EXITED, si_status=42 from child).
hello.c grew an asm _start shim to read argc/argv off sp.
run.sh bumped to -m 2048M; kernel layout: 192 MB image+kheap,
768 MB user pool at 0x4c000000, 768 MB pseudo-fork snapshot at
0x7c000000, 320 MB spare.
Diffstat:
13 files changed, 1030 insertions(+), 52 deletions(-)
diff --git a/docs/OS-TODO.md b/docs/OS-TODO.md
@@ -0,0 +1,117 @@
+# Seed kernel — gaps against docs/OS.md
+
+Audit of [`seed-kernel/`](../seed-kernel/) against the contract in
+[`OS.md`](OS.md). All eleven items are now resolved — the seed kernel
+boots, parses the DTB, unpacks an initramfs into an in-memory tmpfs,
+loads `/init` as a static aarch64 ELF, dispatches the eight Tier-1 +
+three Tier-2 syscalls, and supports both the host-side verification
+gates `scripts/tier1-gate.sh` and `scripts/tier2-gate.sh`. Verified
+against `boot0/catm`, `boot1/M1pp`, and `boot3/tcc0`; the canonical
+Tier-2 case (scheme1 driver spawns tcc0 to compile a `.c` into a
+relocatable ELF object) round-trips end-to-end.
+
+## Tier 1
+
+1. **Real argv.** ✅ `build_user_stack` takes `(argc, argv[])`. argv is
+ sourced from `/chosen/bootargs` (whitespace-tokenised), then from a
+ `/init.argv` file in the initramfs, with `argc=1, argv[0]="init"`
+ as the final fallback. The kernel reserves a `dumpfs` token in
+ bootargs (stripped from user argv) that triggers the UART tmpfs
+ dump on exit (item 9).
+
+2. **User load address.** ✅ Per-process L2 page table installs an
+ `l2_user[]` covering the low 1 GB of VA in 2 MB blocks. Slot 0 is
+ invalid (NULL traps); slots 1…384 are Normal user RAM backed by a
+ 768 MB physical pool (`USER_POOL_PA`); slots 385…511 stay Device-
+ identity for safety. The PL011 / GIC / virtio that used to live in
+ the low 1 GB are now reached from kernel code via a high alias —
+ `L1[4]` is a 1 GB Device block at PA 0, so VA `0x109000000` ↔ PA
+ `0x09000000`. This lets the boot2 chain link at its native
+ `-B 0x600000` and run unmodified on the seed kernel.
+
+3. **Bigger heap.** ✅ User pool is 768 MB (slots 1…384 × 2 MB),
+ sized so tcc0/tcc-boot2 (which declare a 512 MB BSS at link base
+ `0x600000` ⇒ end VA `0x20600000`) fit with a healthy brk window
+ above end-of-bss. `load_elf` walks PT_LOAD segments and records
+ the post-clip end-of-image in `g_user_image_end`; kmain and
+ `do_execve` use it to seed `brk_base`. `brk_max` is
+ `USER_VA_HI - 16 MB` (16 MB stack reserve at the top).
+
+4. **Per-segment ELF permissions.** ✅ Documented as a deliberate
+ spec-permissible choice in `load_elf` — segments are RWX at EL1.
+ OS.md §"Memory model" allows this; tcc-boot2 doesn't JIT.
+
+5. **`exit_group` exit-code masking.** ✅ `code &= 0xff` in
+ `sys_exit_final` / `sys_exit_or_resume_parent`.
+
+## Tier 2
+
+6. **`clone` / `execve` / `waitid`.** ✅ Pseudo-fork via a
+ `proc_stack[]` of saved frames. `sys_clone` snapshots the trap
+ frame + sp_el0 + brk + fd table + the entire 768 MB user image
+ (one snapshot at PA `0x7c000000`), returns 0 to the current
+ context (the "child"). `do_execve` captures path/argv into a
+ kernel pool before clobbering user memory, loads the new ELF,
+ resets brk above its end-of-bss, and rewrites the trap frame so
+ `eret` lands at the new entry point with a fresh user stack.
+ `sys_waitid` populates the siginfo at offsets 8 (CLD_EXITED) and
+ 24 (status) per `scheme1/prelude.scm:497-506`. On
+ `sys_exit_or_resume_parent`, if `proc_depth > 0`, the kernel
+ restores the parent's image / regs / brk / fd table, syncs I-cache
+ over the freshly-overwritten user pages, and returns to the
+ parent's `clone()` site with `x0 = child_pid`.
+
+7. **Per-process state on a stack.** ✅ `proc_save` records regs +
+ ELR + SPSR + sp_el0 + brk_base + brk_cur + fd table + a 768 MB
+ memory snapshot. `MAX_PROC_DEPTH = 1` — the scheme1 prelude only
+ forks one level deep before waiting; one snapshot frame is all
+ that's needed and keeps total RAM at 2 GB.
+
+8. **`execve` accepts NULL/empty envp.** ✅ `do_execve` ignores its
+ `envp` argument; the prelude wrapper passes no envp at all and
+ the value in `x2` at the SVC site is whatever happens to be
+ there.
+
+## Verification harness
+
+9. **Output extraction.** ✅ The kernel emits a sentinel-framed
+ hex dump of every tmpfs file on exit when bootargs contain the
+ `dumpfs` token. Scripts:
+ - [`scripts/extract-dump.sh`](../seed-kernel/scripts/extract-dump.sh) —
+ scans a UART transcript for `=== DUMP-BEGIN ===` … `=== DUMP-END ===`,
+ decodes each `=== FILE path=… size=… ===` payload, writes files.
+
+10. **Tier 1 gate.** ✅
+ [`scripts/tier1-gate.sh`](../seed-kernel/scripts/tier1-gate.sh) —
+ builds an initramfs containing a stage binary as `/init` plus
+ arbitrary input files, runs the seed kernel under qemu with the
+ stage's argv as bootargs, and extracts the post-run tmpfs.
+ Verified against `boot0/catm` (multi-input concatenation, output
+ matches host `cat`) and `boot3/tcc0` (compiles `int main(void)
+ {return 42;}` into a valid aarch64 relocatable object).
+
+11. **Tier 2 gate.** ✅
+ [`scripts/tier2-gate.sh`](../seed-kernel/scripts/tier2-gate.sh) —
+ cats `prelude.scm` + a driver fixture into `combined.scm`, packs
+ initramfs `/init=scheme1, /child-prog=<chain stage>, /combined.scm,
+ <inputs>`, runs the seed kernel, asserts the driver exited 0, and
+ extracts every output file. Verified end-to-end with the canonical
+ fixture
+ [`scripts/fixtures/tier2-tcc-driver.scm`](../seed-kernel/scripts/fixtures/tier2-tcc-driver.scm) —
+ scheme1 evaluates `(run "child-prog" "-nostdlib" "-c" "-o" "out.o"
+ "input.c")`, where `child-prog` is `boot3/tcc0`. Output `out.o` is a
+ valid aarch64 ELF relocatable with the expected `add` and `main`
+ symbols.
+
+## Things still worth doing (out of scope of the original list)
+
+- **Multi-stage Tier 1 driving**: `make tcc-boot2 ARCH=aarch64` could be
+ taught to swap each podman invocation for `tier1-gate.sh`. The hooks
+ exist; it would just be a `seed-kernel/Makefile.gate` overlay.
+- **Snapshot speed**: `mem_cpy(USER_POOL_SIZE = 768 MB)` is the dominant
+ cost of every clone (~30 s under TCG). A copy-on-write or only-
+ touched-pages strategy would help, but isn't needed for compliance.
+- **NULL-page hardening**: slot 0 is unmapped so a NULL deref faults to
+ the kernel as a user sync; the kernel currently panics rather than
+ delivering a SIGSEGV-equivalent. Acceptable per OS.md (default-action
+ termination is sufficient) but a minor polish opportunity.
diff --git a/seed-kernel/Makefile b/seed-kernel/Makefile
@@ -8,7 +8,10 @@ KOBJS := $(OUT)/start.o $(OUT)/kernel.o
KIMAGE := $(OUT)/kernel.elf
KBIN := $(OUT)/Image
USER := $(OUT)/init
+USER_FORK := $(OUT)/forktest
+USER_CHILD := $(OUT)/child
INITRAMFS := $(OUT)/initramfs.cpio
+INITRAMFS_FORK := $(OUT)/initramfs-fork.cpio
CFLAGS_COMMON := -nostdlib -nostartfiles -ffreestanding -fno-stack-protector \
-fno-pic -static -Wall -Wextra -O2 -mcmodel=large \
@@ -16,7 +19,7 @@ CFLAGS_COMMON := -nostdlib -nostartfiles -ffreestanding -fno-stack-protector \
KCFLAGS := $(CFLAGS_COMMON) -mgeneral-regs-only
.PHONY: all clean kernel user initramfs
-all: $(KBIN) $(INITRAMFS)
+all: $(KBIN) $(INITRAMFS) $(INITRAMFS_FORK)
$(OUT):
mkdir -p $(OUT)
@@ -37,9 +40,22 @@ $(KBIN): $(KIMAGE)
$(USER): user/hello.c user/user.lds | $(OUT)
gcc $(CFLAGS_COMMON) -mgeneral-regs-only -T user/user.lds -o $@ $<
+$(USER_FORK): user/forktest.c user/user.lds | $(OUT)
+ gcc $(CFLAGS_COMMON) -mgeneral-regs-only -T user/user.lds -o $@ $<
+
+$(USER_CHILD): user/child.c user/user.lds | $(OUT)
+ gcc $(CFLAGS_COMMON) -mgeneral-regs-only -T user/user.lds -o $@ $<
+
$(INITRAMFS): $(USER)
cd $(OUT) && printf 'init\n' | cpio -o -H newc > initramfs.cpio
+# Tier 2 demo cpio: /init is the fork driver, /child the program it execs.
+$(INITRAMFS_FORK): $(USER_FORK) $(USER_CHILD)
+ rm -rf $(OUT)/fork-stage && mkdir -p $(OUT)/fork-stage
+ cp $(USER_FORK) $(OUT)/fork-stage/init
+ cp $(USER_CHILD) $(OUT)/fork-stage/child
+ cd $(OUT)/fork-stage && printf 'init\nchild\n' | cpio -o -H newc > ../initramfs-fork.cpio
+
kernel: $(KBIN)
user: $(USER)
initramfs: $(INITRAMFS)
diff --git a/seed-kernel/kernel.c b/seed-kernel/kernel.c
@@ -16,7 +16,13 @@ typedef int i32;
/* ─── PL011 console ─────────────────────────────────────────────────────── */
-#define UART0 0x09000000UL
+/* The PL011 lives at PA 0x09000000 on QEMU virt. Once the MMU comes up the
+ * kernel reaches it through the device alias mapped into VA 4 GB..5 GB
+ * (L1[4]). That keeps the entire low 1 GB of VA available as user RAM —
+ * device MMIO at user-space VAs would otherwise collide with the boot2
+ * chain's BSS, which can run past 256 MB. */
+#define DEVICE_ALIAS_BASE 0x100000000UL
+#define UART0 (DEVICE_ALIAS_BASE + 0x09000000UL)
#define UART_DR ((volatile u32 *)(UART0 + 0x00))
#define UART_FR ((volatile u32 *)(UART0 + 0x18))
#define UART_FR_TXFF (1u << 5)
@@ -75,28 +81,83 @@ static void mem_set(void *d, int c, u64 n) {
}
/* ─── MMU bring-up ──────────────────────────────────────────────────────── */
-/* Identity-map the first 4 GB at L1 (1 GB blocks). One page table — 4 KB.
- * Entry 0 (0..1G): Device-nGnRnE (UART/GIC/virtio/flash live here)
- * Entry 1 (1..2G): Normal WB-WA (RAM 0x40000000-)
- * Entry 2 (2..3G): Normal WB-WA (extra RAM if -m > 1G)
- * Entry 3 (3..4G): Normal WB-WA (above-RAM PCI on virt; rarely touched)
+/* Two-level page table:
+ * L1[0] → l2_user table descriptor (VA 0..1 GB, 2 MB blocks)
+ * L1[1..3] = Normal 1 GB blocks identity-mapping VA 1..4 GB (RAM + high MMIO)
+ * L1[4] = Device 1 GB block at PA 0 (VA 4..5 GB mirrors PA 0..1 GB as
+ * Device-nGnRnE — the kernel's only path to UART/GIC/virtio/PCI
+ * once we hand the low 1 GB over to user code).
+ *
+ * The l2_user table carves the low 1 GB into:
+ * slot 0 (VA 0..2 MB) invalid — NULL pointer traps
+ * slots 1..N (VA 2 MB..USER_VA_HI) Normal user RAM, backed by the
+ * physical pool USER_POOL_PA. The
+ * boot2 chain links at 0x600000 and
+ * scheme1 reserves ~256 MB of BSS;
+ * sizing N at 256 (slots 1..256, 512 MB)
+ * gives both code+BSS and the brk
+ * window plenty of room.
+ * slots N+1..511 (VA USER_VA_HI..1G) Device-identity, kept for safety —
+ * nothing user-side touches them, and
+ * the kernel uses the high alias.
+ *
* With MMU on + Normal memory, unaligned loads/stores work — gcc's auto-
- * vectorised 64-bit load in be64() stops trapping.
- */
+ * vectorised 64-bit load in be64() stops trapping. */
__attribute__((aligned(4096))) static u64 l1_pt[512];
+__attribute__((aligned(4096))) static u64 l2_user[512];
+
+/* Physical RAM region reserved as the backing store for user low VAs.
+ * 768 MB (slots 1..384 × 2 MB), placed above the kernel heap end. Sized
+ * to fit tcc0 / tcc-boot2 — they declare a 512 MB BSS and link at
+ * 0x600000, so the binary's VA reach is 0x600000 + 512 MB = 0x20600000.
+ * 768 MB gives that plus a healthy brk window above end-of-bss.
+ *
+ * With QEMU -m 2048M (RAM 0x40000000–0xc0000000) and MAX_PROC_DEPTH=1
+ * (one 768 MB pseudo-fork snapshot above the user pool), the layout is:
+ * 0x40000000–0x4c000000 kernel image + kheap (192 MB)
+ * 0x4c000000–0x7c000000 user RAM pool (768 MB)
+ * 0x7c000000–0xac000000 pseudo-fork snapshot (768 MB)
+ * 0xac000000–0xc0000000 spare (320 MB)
+ */
+#define USER_POOL_PA 0x4c000000UL
+#define USER_POOL_SIZE 0x30000000UL /* 768 MB */
+#define USER_VA_LO 0x00200000UL /* slot 1 — first mapped 2 MB block */
+#define USER_VA_HI 0x30200000UL /* slot 385 — first device-only block */
static void setup_mmu(void) {
- /* AP=00 (RW EL1 only — keep EL0 out for now), SH=ISH, AF=1, AttrIdx=0/1.
- * Bits: V(0)=1, block(1)=0, AttrIdx[4:2], NS(5)=0, AP[7:6]=00, SH[9:8]=11,
- * AF(10)=1, nG(11)=0 → 0x701 (Normal) / 0x705 (Device) */
+ /* Block-descriptor attribute bits (block at L1 = bit[1]=0).
+ * V(0)=1, block(1)=0, AttrIdx[4:2]=Attr0(Normal)/Attr1(Device),
+ * NS(5)=0, AP[7:6]=00 (RW EL1 only), SH[9:8]=11 (ISH), AF(10)=1,
+ * nG(11)=0 → 0x701 (Normal) / 0x705 (Device-nGnRnE).
+ * Block descriptors at L2 use the same bit layout. */
u64 normal = 0x701;
u64 device = 0x705;
for (int i = 0; i < 512; i++) l1_pt[i] = 0;
- l1_pt[0] = 0x00000000UL | device;
+
+ /* L2 user table: slot 0 invalid; slots 1..(USER_POOL_SIZE/2 MB) Normal
+ * RAM backed by the user pool; slots above that Device-identity. */
+ int user_slots = (int)(USER_POOL_SIZE / 0x200000UL);
+ l2_user[0] = 0;
+ for (int i = 1; i <= user_slots; i++) {
+ u64 pa = USER_POOL_PA + (u64)(i - 1) * 0x200000UL;
+ l2_user[i] = pa | normal;
+ }
+ for (int i = user_slots + 1; i < 512; i++) {
+ u64 pa = (u64)i * 0x200000UL;
+ l2_user[i] = pa | device;
+ }
+
+ /* L1[0] table descriptor → l2_user. Table-desc encoding at L1 is
+ * bits [1:0] = 0b11, bits [47:12] = next-level table PA. */
+ l1_pt[0] = (u64)l2_user | 0x3UL;
l1_pt[1] = 0x40000000UL | normal;
l1_pt[2] = 0x80000000UL | normal;
l1_pt[3] = 0xc0000000UL | normal;
+ /* L1[4]: Device 1 GB block aliasing PA 0..1 GB into VA 4 GB..5 GB so
+ * the kernel can still reach UART/GIC/virtio after we hand the low 1
+ * GB over to user mappings. */
+ l1_pt[4] = 0x00000000UL | device;
/* MAIR: Attr0 = 0xff (Normal WB-WA), Attr1 = 0x00 (Device-nGnRnE) */
u64 mair = 0x00000000000000ffUL;
@@ -120,8 +181,10 @@ static void setup_mmu(void) {
u64 sctlr;
asm volatile("mrs %0, sctlr_el1" : "=r"(sctlr));
- sctlr &= ~(u64)((1 << 1) | (1 << 19)); /* clear A, WXN */
- sctlr |= (u64)(1 << 0); /* M (MMU) only — caches stay off */
+ sctlr &= ~(u64)((1 << 1) | (1 << 19)); /* clear A (alignment), WXN */
+ sctlr |= (u64)((1 << 0) /* M — MMU on */
+ | (1 << 2) /* C — D-cache on */
+ | (1 << 12)); /* I — I-cache on */
asm volatile("msr sctlr_el1, %0" :: "r"(sctlr));
asm volatile("isb");
}
@@ -318,6 +381,11 @@ struct phdr { u32 p_type, p_flags; u64 p_offset, p_vaddr, p_paddr, p_filesz, p_m
#define PT_LOAD 1
+/* Highest VA touched by the most recently loaded image's PT_LOAD segments
+ * (after USER_VA_HI clipping). load_elf updates this; kmain / sys_execve
+ * use it to seed brk_base above the user image's BSS. */
+static u64 g_user_image_end;
+
static u64 load_elf(const u8 *elf) {
const struct ehdr *eh = (const struct ehdr *)elf;
if (!(eh->e_ident[0] == 0x7f && eh->e_ident[1] == 'E' &&
@@ -327,15 +395,38 @@ static u64 load_elf(const u8 *elf) {
if (eh->e_machine != 0xb7) { /* EM_AARCH64 */
uart_puts("ELF: not aarch64\n"); return 0;
}
+ /* p_flags (R/W/X) are deliberately ignored: the L2 user mapping is one
+ * giant Normal-memory RWX-at-EL1 region (see setup_mmu). OS.md
+ * §"Memory model" permits this — there's no W^X enforcement in the
+ * contract, and tcc-boot2 never JITs.
+ *
+ * Segments are clipped at USER_VA_HI: a binary may declare a BSS that
+ * extends past the mapped user window (scheme1 reserves ~256 MB), and
+ * a naive mem_set would walk into the device-block region above and
+ * trigger an external abort. The user image gets only the portion of
+ * its memsz that fits in the user pool; if user code later touches
+ * the unmapped tail, that's a user-space fault, not a kernel panic. */
+ u64 hi = 0;
for (int i = 0; i < eh->e_phnum; i++) {
const struct phdr *ph = (const struct phdr *)(elf + eh->e_phoff + (u64)i * eh->e_phentsize);
if (ph->p_type != PT_LOAD) continue;
- u8 *dst = (u8 *)ph->p_vaddr;
+ u64 vaddr = ph->p_vaddr;
+ u64 filesz = ph->p_filesz;
+ u64 memsz = ph->p_memsz;
+ if (vaddr >= USER_VA_HI) continue; /* segment fully out of window */
+ u64 reach = USER_VA_HI - vaddr;
+ if (filesz > reach) filesz = reach;
+ if (memsz > reach) memsz = reach;
+ u8 *dst = (u8 *)vaddr;
const u8 *src = elf + ph->p_offset;
- mem_cpy(dst, src, ph->p_filesz);
- if (ph->p_memsz > ph->p_filesz)
- mem_set(dst + ph->p_filesz, 0, ph->p_memsz - ph->p_filesz);
+ mem_cpy(dst, src, filesz);
+ if (memsz > filesz)
+ mem_set(dst + filesz, 0, memsz - filesz);
+ u64 end = vaddr + memsz;
+ if (end > hi) hi = end;
}
+ /* Round up to 16 bytes so callers can use it directly as brk_base. */
+ g_user_image_end = (hi + 15) & ~15UL;
/* I-cache sync (cheap insurance even with caches off). */
asm volatile("dsb sy" ::: "memory");
asm volatile("ic iallu" ::: "memory");
@@ -378,7 +469,14 @@ static u64 brk_max;
#define SYS_read 63
#define SYS_write 64
#define SYS_exit_group 93
+#define SYS_waitid 95
#define SYS_brk 214
+#define SYS_clone 220
+#define SYS_execve 221
+
+#define ECHILD 10
+#define EAGAIN 11
+#define ENOEXEC 8
static i64 sys_write(int fd, const void *buf, u64 len) {
if (fd == 1 || fd == 2) {
@@ -476,12 +574,226 @@ static i64 sys_unlinkat(int dirfd, const char *path, int flags) {
return 0;
}
+/* ─── Tier 2: pseudo-fork (clone / execve / waitid / exit_group) ────────── */
+/*
+ * The boot2 chain's clone/execve/waitid pattern (scheme1/prelude.scm:520-537)
+ * is rigidly synchronous: the parent calls clone, the "child" immediately
+ * calls execve and runs to exit_group, then the parent calls waitid. Nothing
+ * else runs between clone and execve in the child, or between clone and
+ * waitid in the parent.
+ *
+ * We implement that as pseudo-fork on a single-threaded kernel:
+ *
+ * sys_clone → push parent state (regs, brk, fd table, full user image)
+ * onto proc_stack; return 0 to current context (the "child").
+ * sys_execve → reset brk, load new ELF over user RAM, build user stack,
+ * set tf so eret resumes at the new entry point.
+ * sys_exit → if proc_stack non-empty: stash exit code in last_child,
+ * restore parent state (regs / brk / fds / memory), set tf
+ * so eret resumes the parent's clone() call with x0 = pid.
+ * If proc_stack empty: real exit (dump tmpfs, PSCI off).
+ * sys_waitid → return last_child's exit code via the siginfo struct.
+ *
+ * No actual concurrency. The "parent" is suspended at the moment of clone
+ * and resumed only when the "child" calls exit_group. This works because
+ * the prelude never schedules other work between fork and wait.
+ */
+
+struct trapframe {
+ u64 x[31];
+ u64 elr;
+ u64 spsr;
+};
+
+/* Forward decls for state defined further down. */
+#define MAX_ARGV 32
+static u64 build_user_stack(u64 stack_top, int argc, char **argv);
+static int tokenise(char *src, char **argv, int cap);
+
+#define MAX_PROC_DEPTH 1
+/* Memory snapshot pool — placed above the user RAM pool. The scheme1
+ * prelude only ever forks one level deep before waiting (clone → execve
+ * in child → exit_group → waitid in parent), so a single 768 MB frame
+ * suffices. Snapshot N lives at SNAP_BASE + N*USER_POOL_SIZE. */
+#define SNAP_BASE_PA 0x7c000000UL
+
+struct proc_save {
+ int active;
+ u64 child_pid;
+ /* Saved trap-frame state — enough to resume the parent at the SVC
+ * instruction following its clone(). x[0] is overwritten with child_pid
+ * at restore time so the parent sees a non-zero return. */
+ u64 regs[31];
+ u64 elr;
+ u64 spsr;
+ u64 sp_el0;
+ /* User image + per-process state at the moment of clone. brk_base
+ * is saved alongside brk_cur because do_execve resets it above the
+ * new image's end-of-bss — the parent's value needs to come back
+ * with the parent's memory image. */
+ u64 brk_base_save;
+ u64 brk_cur_save;
+ struct fdent fdtab_save[MAX_FD];
+ u8 *mem_snapshot;
+};
+
+static struct proc_save proc_stack[MAX_PROC_DEPTH];
+static int proc_depth = 0;
+static u64 g_next_pid = 2;
+
+/* The most recently exited child, for sys_waitid to consume. */
+static int last_child_valid = 0;
+static u64 last_child_pid = 0;
+static int last_child_code = 0;
+
+/* USER_POOL_PA / USER_POOL_SIZE (defined above) describe the user RAM pool. */
+
+static i64 sys_clone(struct trapframe *tf, u64 flags, u64 stack, u64 ptid,
+ u64 ctid, u64 tls) {
+ (void)flags; (void)stack; (void)ptid; (void)ctid; (void)tls;
+ if (proc_depth >= MAX_PROC_DEPTH) return -EAGAIN;
+ struct proc_save *p = &proc_stack[proc_depth];
+ p->active = 1;
+ p->child_pid = g_next_pid++;
+ for (int i = 0; i < 31; i++) p->regs[i] = tf->x[i];
+ p->elr = tf->elr;
+ p->spsr = tf->spsr;
+ asm volatile("mrs %0, sp_el0" : "=r"(p->sp_el0));
+ p->brk_base_save = brk_base;
+ p->brk_cur_save = brk_cur;
+ for (int i = 0; i < MAX_FD; i++) p->fdtab_save[i] = fdtab[i];
+ p->mem_snapshot = (u8 *)(SNAP_BASE_PA + (u64)proc_depth * USER_POOL_SIZE);
+ mem_cpy(p->mem_snapshot, (void *)USER_POOL_PA, USER_POOL_SIZE);
+ proc_depth++;
+ /* Current context becomes the "child"; clone returns 0 here. */
+ return 0;
+}
+
+/* execve must capture path+argv into kernel-side buffers BEFORE load_elf
+ * runs — load_elf clobbers user memory, and the path/argv strings live in
+ * that memory. */
+static char execve_argv_pool[2048];
+static i64 sys_execve(struct trapframe *tf, const char *path,
+ char **argv, char **envp) {
+ /* envp may be NULL — the prelude wrapper passes no envp arg, so x2 is
+ * whatever happened to be there. We ignore envp regardless. */
+ (void)envp;
+ if (!path) return -EFAULT;
+ /* Copy path before find_file does anything else (path lives in user
+ * memory which load_elf will clobber). */
+ char path_buf[128];
+ int pn = 0;
+ while (path[pn] && pn < 127) { path_buf[pn] = path[pn]; pn++; }
+ path_buf[pn] = 0;
+ int fidx = find_file(path_buf);
+ if (fidx < 0) return -ENOENT;
+
+ /* Capture argv into a kernel-side pool. */
+ int argc = 0;
+ char *new_argv[MAX_ARGV];
+ int pool_off = 0;
+ if (argv) {
+ while (argc < MAX_ARGV - 1 && argv[argc]) {
+ const char *s = argv[argc];
+ int n = 0;
+ while (s[n] && pool_off + n < (int)sizeof(execve_argv_pool) - 1) n++;
+ for (int j = 0; j < n; j++) execve_argv_pool[pool_off + j] = s[j];
+ execve_argv_pool[pool_off + n] = 0;
+ new_argv[argc] = &execve_argv_pool[pool_off];
+ pool_off += n + 1;
+ argc++;
+ }
+ }
+ if (argc == 0) {
+ /* Synthesise argv[0] from the path so user code that reads argv[0]
+ * doesn't crash. */
+ int n = 0;
+ while (path_buf[n] && pool_off + n < (int)sizeof(execve_argv_pool) - 1) n++;
+ for (int j = 0; j < n; j++) execve_argv_pool[pool_off + j] = path_buf[j];
+ execve_argv_pool[pool_off + n] = 0;
+ new_argv[0] = &execve_argv_pool[pool_off];
+ pool_off += n + 1;
+ argc = 1;
+ }
+
+ /* Load new ELF over user RAM. */
+ u64 entry = load_elf(files[fidx].data);
+ if (!entry) return -ENOEXEC;
+ /* Reset brk above the new image's end-of-bss. */
+ brk_base = g_user_image_end ? g_user_image_end : USER_VA_LO;
+ brk_cur = brk_base;
+ /* Build new user stack at top of user VA window. */
+ u64 new_sp = build_user_stack(USER_VA_HI, argc, new_argv);
+
+ /* Rewrite trap frame so eret jumps to the new image's entry, with a
+ * clean register state and the new stack. */
+ for (int i = 0; i < 31; i++) tf->x[i] = 0;
+ tf->elr = entry;
+ /* sp_el0 isn't in the trap frame — set it directly; it survives until
+ * the eret since the kernel uses SP_ELx while in trap_sync. */
+ asm volatile("msr sp_el0, %0" :: "r"(new_sp));
+ /* x[0] = 0 will be overwritten by the dispatcher's tf->x[0] = (u64)r
+ * assignment. To preserve "argc/argv on the stack only", return 0 and
+ * let the dispatcher write it; user code never sees the return value
+ * because elr now points at _start. */
+ return 0;
+}
+
+static i64 sys_waitid(struct trapframe *tf, int idtype, u64 id,
+ void *info, int options) {
+ (void)tf; (void)idtype; (void)id; (void)options;
+ if (!last_child_valid) return -ECHILD;
+ /* scheme1/prelude.scm:497-506 reads info[8]=si_code (CLD_EXITED=1) and
+ * info[24]=si_status. siginfo_t is sparsely written — zero the rest so
+ * the prelude's view is deterministic. */
+ if (info) {
+ u8 *p = info;
+ for (int i = 0; i < 128; i++) p[i] = 0;
+ u32 *si_code = (u32 *)(p + 8);
+ u32 *si_status = (u32 *)(p + 24);
+ *si_code = 1; /* CLD_EXITED */
+ *si_status = (u32)last_child_code;
+ }
+ last_child_valid = 0;
+ return 0;
+}
+
static int g_exit_code = 0;
static int g_exited = 0;
-static void sys_exit(int code) {
+/* Dump every file in the tmpfs to UART, hex-encoded, framed by sentinels
+ * a host-side extractor can scan for. The chain's verification harness
+ * (qemu-host wrapper) parses this to recover output ELFs etc. without
+ * needing virtio-9p — flat tmpfs over UART is enough for boot2's
+ * file-only IPC. Dump only happens when a "dumpfs" token is present in
+ * /chosen/bootargs; the hello.c demo runs without it and stays quiet. */
+static int g_dumpfs = 0;
+
+static void uart_putc_hex(u8 b) {
+ static const char hex[] = "0123456789abcdef";
+ uart_putc(hex[b >> 4]);
+ uart_putc(hex[b & 0xf]);
+}
+
+static void dump_tmpfs(void) {
+ uart_puts("\n=== DUMP-BEGIN ===\n");
+ for (int i = 0; i < MAX_FILES; i++) {
+ if (!files[i].used) continue;
+ uart_puts("=== FILE path=");
+ uart_puts(files[i].path);
+ uart_puts(" size=");
+ uart_putd((i64)files[i].len);
+ uart_puts(" ===\n");
+ for (u64 j = 0; j < files[i].len; j++) uart_putc_hex(files[i].data[j]);
+ uart_puts("\n");
+ }
+ uart_puts("=== DUMP-END ===\n");
+}
+
+static void sys_exit_final(int code) {
g_exit_code = code;
g_exited = 1;
+ if (g_dumpfs) dump_tmpfs();
uart_puts("\n[seed] user exit_group("); uart_putd(code); uart_puts(")\n");
/* Try PSCI SYSTEM_OFF so QEMU exits cleanly; fall back to spin. */
register u64 x0 asm("x0") = 0x84000008;
@@ -491,13 +803,43 @@ static void sys_exit(int code) {
for (;;) asm volatile("wfi");
}
-/* ─── Trap dispatch (called from start.S vector handlers) ───────────────── */
+/* Dispatcher-side exit_group: pops proc_stack and resumes the parent's
+ * clone() if there's a saved frame, otherwise falls through to the real
+ * shutdown path. Returns 1 if the trap frame was rewritten (resume parent),
+ * 0 if the caller should treat it as a normal trap-return path (which
+ * will never happen, since sys_exit_final does not return). */
+static int sys_exit_or_resume_parent(struct trapframe *tf, int code) {
+ code &= 0xff;
+ if (proc_depth > 0) {
+ struct proc_save *p = &proc_stack[--proc_depth];
+ last_child_pid = p->child_pid;
+ last_child_code = code;
+ last_child_valid = 1;
+ /* Restore memory, brk, fd table. */
+ mem_cpy((void *)USER_POOL_PA, p->mem_snapshot, USER_POOL_SIZE);
+ brk_base = p->brk_base_save;
+ brk_cur = p->brk_cur_save;
+ for (int i = 0; i < MAX_FD; i++) fdtab[i] = p->fdtab_save[i];
+ /* Restore registers (overwriting x[0] with child_pid, since the
+ * dispatcher will write tf->x[0] = (u64)r before eret — we want
+ * the parent's clone() to see child_pid as the syscall return). */
+ for (int i = 0; i < 31; i++) tf->x[i] = p->regs[i];
+ tf->elr = p->elr;
+ tf->spsr = p->spsr;
+ asm volatile("msr sp_el0, %0" :: "r"(p->sp_el0));
+ /* Instruction cache may hold stale lines from the child's image
+ * that we just overwrote with the parent's. Invalidate. */
+ asm volatile("dsb sy" ::: "memory");
+ asm volatile("ic iallu" ::: "memory");
+ asm volatile("dsb sy" ::: "memory");
+ asm volatile("isb");
+ return (int)p->child_pid; /* >0: tells dispatcher to write this as r */
+ }
+ sys_exit_final(code);
+ return 0; /* unreachable */
+}
-struct trapframe {
- u64 x[31];
- u64 elr;
- u64 spsr;
-};
+/* ─── Trap dispatch (called from start.S vector handlers) ───────────────── */
i64 trap_sync(u64 esr, struct trapframe *tf);
void trap_kernel(u64 esr, struct trapframe *tf);
@@ -518,7 +860,19 @@ i64 trap_sync(u64 esr, struct trapframe *tf) {
case SYS_lseek: r = sys_lseek((int)a0, (i64)a1, (int)a2); break;
case SYS_brk: r = sys_brk(a0); break;
case SYS_unlinkat: r = sys_unlinkat((int)a0, (const char *)a1, (int)a2); break;
- case SYS_exit_group: sys_exit((int)a0); r = 0; break;
+ case SYS_clone: r = sys_clone(tf, a0, a1, a2, a3, a4); break;
+ case SYS_execve: r = sys_execve(tf, (const char *)a0, (char **)a1, (char **)a2); break;
+ case SYS_waitid: r = sys_waitid(tf, (int)a0, a1, (void *)a2, (int)a3); break;
+ case SYS_exit_group:
+ r = sys_exit_or_resume_parent(tf, (int)a0);
+ /* If we resumed the parent, sys_exit_or_resume_parent has
+ * rewritten tf->x[0..30] and tf->elr — overriding tf->x[0]
+ * below would corrupt the parent's register state. */
+ if (proc_depth >= 0 && r != 0) {
+ tf->x[0] = (u64)r;
+ return 0;
+ }
+ break;
default:
uart_puts("[seed] ENOSYS "); uart_putd((i64)nr); uart_puts("\n");
r = -38; /* ENOSYS */
@@ -536,8 +890,10 @@ i64 trap_sync(u64 esr, struct trapframe *tf) {
}
void trap_kernel(u64 esr, struct trapframe *tf) {
+ u64 far; asm volatile("mrs %0, far_el1" : "=r"(far));
uart_puts("[seed] PANIC: kernel sync, ESR="); uart_putx(esr);
uart_puts(" ELR="); uart_putx(tf->elr);
+ uart_puts(" FAR="); uart_putx(far);
uart_puts("\n");
for (;;) asm volatile("wfe");
}
@@ -553,23 +909,48 @@ void trap_unhandled(u64 esr, struct trapframe *tf) {
extern void eret_to_user(u64 entry, u64 sp);
-static u64 build_user_stack(u64 stack_top, const char *argv0) {
- /* Place argv0 string at top, then argc/argv/envp below it.
- *
- * SysV layout from low to high at sp:
- * argc, argv[0], NULL, NULL (envp term)
- */
- int n = str_n(argv0) + 1;
- char *str = (char *)(stack_top - 32);
- for (int i = 0; i < n; i++) str[i] = argv0[i];
-
- u64 sp = (u64)str - 64;
+/* Tokenise `src` in place (whitespace separators) into argv slots.
+ * Writes pointers into argv[0..argc-1] and returns argc. Stops at cap. */
+static int tokenise(char *src, char **argv, int cap) {
+ int argc = 0;
+ char *p = src;
+ while (*p && argc < cap) {
+ while (*p == ' ' || *p == '\t' || *p == '\n' || *p == '\r') p++;
+ if (!*p) break;
+ argv[argc++] = p;
+ while (*p && *p != ' ' && *p != '\t' && *p != '\n' && *p != '\r') p++;
+ if (*p) *p++ = 0;
+ }
+ return argc;
+}
+
+static u64 build_user_stack(u64 stack_top, int argc, char **argv) {
+ /* SysV layout, low to high at the returned sp:
+ * argc, argv[0..argc-1], NULL (argv term), NULL (envp term).
+ * Strings live above the vectors, in a string pool placed just below
+ * stack_top so the user image's high-water mark is stable. */
+ if (argc < 1) argc = 1;
+ if (argc > MAX_ARGV) argc = MAX_ARGV;
+
+ /* Lay strings down from stack_top - 16 (16-byte alignment slack). */
+ u64 strs_top = stack_top - 16;
+ u64 strs[MAX_ARGV];
+ char *cursor = (char *)strs_top;
+ for (int i = argc - 1; i >= 0; i--) {
+ int n = str_n(argv[i]) + 1;
+ cursor -= n;
+ for (int j = 0; j < n; j++) cursor[j] = argv[i][j];
+ strs[i] = (u64)cursor;
+ }
+
+ /* sp must hold: argc + (argc+1)*8 (argv + NULL) + 8 (envp NULL) */
+ u64 sp = (u64)cursor - (u64)((argc + 3) * 8);
sp &= ~15UL;
u64 *p = (u64 *)sp;
- p[0] = 1; /* argc */
- p[1] = (u64)str; /* argv[0] */
- p[2] = 0; /* argv terminator */
- p[3] = 0; /* envp terminator */
+ p[0] = (u64)argc;
+ for (int i = 0; i < argc; i++) p[1 + i] = strs[i];
+ p[1 + argc] = 0; /* argv terminator */
+ p[2 + argc] = 0; /* envp terminator */
return sp;
}
@@ -614,13 +995,68 @@ void kmain(u64 dtb_phys) {
if (!entry) { uart_puts("[seed] load_elf failed\n"); for(;;) asm volatile("wfe"); }
uart_puts("[seed] /init e_entry="); uart_putx(entry); uart_puts("\n");
- /* User stack at top of a reserved high region. brk above that. */
- u64 ustack_top = 0x46000000UL;
- brk_base = 0x46000000UL;
+ /* parse_cpio + load_elf are done — original initrd memory is dead.
+ * Bump kheap_end to reclaim it for tmpfs file growth via sys_write. */
+ kheap_end = (u8 *)0x4b000000UL;
+
+ /* User runs in the L2-mapped low-VA window (USER_VA_LO..USER_VA_HI,
+ * physically backed by USER_POOL_PA). Stack grows down from the top
+ * of the window; brk grows up from above the loaded image's
+ * end-of-bss (g_user_image_end, set by load_elf). 16 MB reserved at
+ * the top for the user stack. */
+ u64 ustack_top = USER_VA_HI;
+ brk_base = g_user_image_end ? g_user_image_end : USER_VA_LO;
brk_cur = brk_base;
- brk_max = 0x4a000000UL;
+ brk_max = USER_VA_HI - 0x01000000UL;
+
+ /* Build argv. Priority:
+ * 1. DTB /chosen/bootargs (whitespace-tokenised — qemu -append "...").
+ * 2. /init.argv from the initramfs (one arg per line).
+ * 3. Fallback: argc=1, argv[0]="init".
+ * In all three cases, argv passed to user is exactly what the source
+ * provided — no implicit argv[0]="init" prefix.
+ *
+ * The seed kernel reserves one bootargs token: "dumpfs". When present,
+ * it is stripped from argv and triggers a hex-encoded dump of the
+ * full tmpfs over UART on exit (sentinel-framed for host extraction). */
+ static char argv_pool[512];
+ char *uargv[MAX_ARGV];
+ int uargc = 0;
+
+ if (dt.bootargs[0]) {
+ int n = 0;
+ while (dt.bootargs[n] && n < (int)sizeof(argv_pool) - 1) {
+ argv_pool[n] = dt.bootargs[n]; n++;
+ }
+ argv_pool[n] = 0;
+ char *raw[MAX_ARGV];
+ int rawc = tokenise(argv_pool, raw, MAX_ARGV);
+ for (int i = 0; i < rawc; i++) {
+ if (str_eq(raw[i], "dumpfs")) { g_dumpfs = 1; continue; }
+ uargv[uargc++] = raw[i];
+ }
+ }
+ if (uargc == 0) {
+ int aidx = find_file("init.argv");
+ if (aidx >= 0) {
+ u64 n = files[aidx].len;
+ if (n >= sizeof(argv_pool)) n = sizeof(argv_pool) - 1;
+ for (u64 i = 0; i < n; i++) argv_pool[i] = (char)files[aidx].data[i];
+ argv_pool[n] = 0;
+ uargc = tokenise(argv_pool, uargv, MAX_ARGV);
+ }
+ }
+ if (uargc == 0) {
+ argv_pool[0] = 'i'; argv_pool[1] = 'n'; argv_pool[2] = 'i';
+ argv_pool[3] = 't'; argv_pool[4] = 0;
+ uargv[0] = argv_pool;
+ uargc = 1;
+ }
+ uart_puts("[seed] argv:");
+ for (int i = 0; i < uargc; i++) { uart_puts(" "); uart_puts(uargv[i]); }
+ uart_puts("\n");
- u64 user_sp = build_user_stack(ustack_top, "init");
+ u64 user_sp = build_user_stack(ustack_top, uargc, uargv);
uart_puts("[seed] eret to user, sp="); uart_putx(user_sp); uart_puts("\n");
eret_to_user(entry, user_sp);
diff --git a/seed-kernel/run.sh b/seed-kernel/run.sh
@@ -15,7 +15,7 @@ INITRD=build/initramfs.cpio
exec qemu-system-aarch64 \
-machine virt \
-cpu cortex-a72 \
- -m 512M \
+ -m 2048M \
-nographic \
-no-reboot \
-kernel "$KERNEL" \
diff --git a/seed-kernel/scripts/extract-dump.sh b/seed-kernel/scripts/extract-dump.sh
@@ -0,0 +1,56 @@
+#!/bin/sh
+# Extract files from a seed-kernel UART transcript that was produced with
+# the "dumpfs" bootargs token. Reads transcript from stdin (or $1), writes
+# each dumped file to <outdir>/<path>. Header format emitted by kernel.c
+# dump_tmpfs():
+#
+# === DUMP-BEGIN ===
+# === FILE path=<name> size=<N> ===
+# <2*N hex chars><LF>
+# ... repeat ...
+# === DUMP-END ===
+#
+# Anything before DUMP-BEGIN or after DUMP-END is ignored.
+#
+# Usage: extract-dump.sh <outdir> [transcript]
+
+set -eu
+
+[ $# -ge 1 ] || { echo "usage: $0 <outdir> [transcript]"; exit 2; }
+
+outdir=$1
+shift
+mkdir -p "$outdir"
+
+if [ $# -ge 1 ]; then
+ src=$(cat "$1")
+else
+ src=$(cat)
+fi
+
+# Strip CRs that QEMU's nographic UART likes to emit.
+src=$(printf '%s' "$src" | tr -d '\r')
+
+awk -v outdir="$outdir" '
+/^=== DUMP-BEGIN ===$/ { in_dump = 1; next }
+/^=== DUMP-END ===$/ { in_dump = 0; next }
+in_dump && /^=== FILE path=/ {
+ sub(/^=== FILE path=/, "")
+ sub(/ ===$/, "")
+ n = split($0, kv, " size=")
+ path = kv[1]
+ size = kv[2]+0
+ out = outdir "/" path
+ # Make any parent dirs (tmpfs is flat but be safe).
+ cmd = "mkdir -p \"$(dirname \"" out "\")\""; system(cmd); close(cmd)
+ next_is_hex = 1
+ print "extract: " path " (" size " bytes) -> " out > "/dev/stderr"
+ # Capture next non-empty line as hex payload, decode via xxd.
+ getline hex
+ decode_cmd = "printf %s \"" hex "\" | xxd -r -p > \"" out "\""
+ system(decode_cmd); close(decode_cmd)
+ next
+}
+' <<EOF
+$src
+EOF
diff --git a/seed-kernel/scripts/fixtures/tier2-driver.scm b/seed-kernel/scripts/fixtures/tier2-driver.scm
@@ -0,0 +1,5 @@
+;; Tier 2 acceptance driver: spawn the chain stage, wait, return status.
+(let ((r (run "child-prog" "out.txt" "in1" "in2")))
+ (if (and (car r) (= 0 (cdr r)))
+ (sys-exit 0)
+ (sys-exit 1)))
diff --git a/seed-kernel/scripts/fixtures/tier2-tcc-driver.scm b/seed-kernel/scripts/fixtures/tier2-tcc-driver.scm
@@ -0,0 +1,8 @@
+;; Tier 2 acceptance driver (canonical form): scheme1 spawns tcc0 (the
+;; cc.scm-built bootstrap C compiler) to compile a .c source into a
+;; relocatable object, waits for the child, and exits with the child's
+;; status. The OS-TODO item 11 fixture.
+(let ((r (run "child-prog" "-nostdlib" "-c" "-o" "out.o" "input.c")))
+ (if (and (car r) (= 0 (cdr r)))
+ (sys-exit 0)
+ (sys-exit 1)))
diff --git a/seed-kernel/scripts/tier1-gate.sh b/seed-kernel/scripts/tier1-gate.sh
@@ -0,0 +1,91 @@
+#!/bin/sh
+# tier1-gate.sh — run a single boot2-chain stage binary on the seed
+# kernel and extract its output files. The chain treats stages as
+# pure file→file transformations (catm-style), so a Tier 1 acceptance
+# run is one qemu boot per stage.
+#
+# Usage:
+# tier1-gate.sh <stage-binary> <output-dir> -- <argv...> -- <input-files...>
+#
+# Builds an initramfs containing the stage binary as /init plus every
+# file from <input-files...> at its basename, runs qemu, and extracts
+# every file in the post-run tmpfs into <output-dir>/. The driver
+# passes <argv...> verbatim through /chosen/bootargs (with a "dumpfs"
+# token appended to trigger the UART dump on exit).
+#
+# Example: run boot0 catm to concatenate a + b into out.
+# tier1-gate.sh build/aarch64/boot0/catm /tmp/out \
+# -- init out a b -- /tmp/a /tmp/b
+
+set -eu
+
+if [ $# -lt 3 ]; then
+ echo "usage: $0 <stage-binary> <output-dir> -- <argv...> -- <input-files...>" >&2
+ exit 2
+fi
+
+STAGE_BIN=$1; shift
+OUTDIR=$1; shift
+[ "$1" = "--" ] || { echo "expected -- before argv" >&2; exit 2; }
+shift
+
+# Collect argv until next --
+ARGV=""
+while [ $# -gt 0 ] && [ "$1" != "--" ]; do
+ if [ -z "$ARGV" ]; then ARGV=$1; else ARGV="$ARGV $1"; fi
+ shift
+done
+[ "${1:-}" = "--" ] || { echo "expected -- before input-files" >&2; exit 2; }
+shift
+
+HERE=$(cd "$(dirname "$0")" && pwd)
+SEED_DIR=$(cd "$HERE/.." && pwd)
+KERNEL=$SEED_DIR/build/Image
+EXTRACT=$HERE/extract-dump.sh
+
+[ -f "$KERNEL" ] || { echo "missing $KERNEL — run 'make' in $SEED_DIR first" >&2; exit 1; }
+[ -x "$EXTRACT" ] || { echo "missing $EXTRACT" >&2; exit 1; }
+
+mkdir -p "$OUTDIR"
+
+# Stage initramfs.
+STAGE=$(mktemp -d -t tier1-stage.XXXXXX)
+trap 'rm -rf "$STAGE"' EXIT
+
+cp "$STAGE_BIN" "$STAGE/init"
+chmod +x "$STAGE/init"
+NAMES="init"
+for inp in "$@"; do
+ base=$(basename "$inp")
+ cp "$inp" "$STAGE/$base"
+ NAMES="$NAMES
+$base"
+done
+INITRAMFS=$STAGE/initramfs.cpio
+( cd "$STAGE" && printf '%s\n' "$NAMES" | cpio -o -H newc 2>/dev/null > initramfs.cpio )
+
+# Run qemu, capture transcript, extract.
+TRANSCRIPT=$STAGE/transcript.txt
+echo "[gate] running stage with argv: $ARGV dumpfs" >&2
+qemu-system-aarch64 \
+ -machine virt -cpu cortex-a72 -m 2048M \
+ -nographic -no-reboot \
+ -kernel "$KERNEL" -initrd "$INITRAMFS" \
+ -append "$ARGV dumpfs" \
+ > "$TRANSCRIPT" 2>&1 &
+QPID=$!
+# Bound the run; the seed kernel ends with PSCI SYSTEM_OFF on exit,
+# but on a hang we still need to come back.
+( sleep 120; kill -9 $QPID 2>/dev/null ) &
+WATCHER=$!
+wait $QPID 2>/dev/null || true
+kill $WATCHER 2>/dev/null || true
+
+if ! grep -q '=== DUMP-END ===' "$TRANSCRIPT"; then
+ echo "[gate] FAIL: no DUMP-END in transcript" >&2
+ tail -40 "$TRANSCRIPT" >&2
+ exit 3
+fi
+
+"$EXTRACT" "$OUTDIR" "$TRANSCRIPT"
+echo "[gate] extracted to $OUTDIR" >&2
diff --git a/seed-kernel/scripts/tier2-gate.sh b/seed-kernel/scripts/tier2-gate.sh
@@ -0,0 +1,87 @@
+#!/bin/sh
+# tier2-gate.sh — end-to-end Tier 2 acceptance: scheme1 driver spawns a
+# chain stage as a subprocess (clone + execve + waitid), waits for it,
+# returns its result file. One qemu boot, end-to-end.
+#
+# Usage:
+# tier2-gate.sh <scheme1> <prelude.scm> <driver.scm> \
+# <child-bin> <output-dir> -- <input-files...>
+#
+# Stages: combines prelude.scm + driver.scm into combined.scm via host
+# `cat`, packs an initramfs containing /init=scheme1, /combined.scm,
+# /child-prog=<child-bin>, plus every input file at its basename, then
+# boots qemu with bootargs "init combined.scm dumpfs". driver.scm is
+# expected to use prelude's (run "child-prog" ...) wrapper.
+#
+# After qemu exits, every file in the post-run tmpfs is extracted into
+# <output-dir>/. The driver's exit status is reflected in this script's
+# exit status (0 = scheme1 driver said success).
+
+set -eu
+
+if [ $# -lt 6 ]; then
+ echo "usage: $0 <scheme1> <prelude.scm> <driver.scm> <child-bin> <outdir> -- <inputs...>" >&2
+ exit 2
+fi
+
+SCHEME1=$1; PRELUDE=$2; DRIVER=$3; CHILD=$4; OUTDIR=$5
+shift 5
+[ "$1" = "--" ] || { echo "expected -- before input files" >&2; exit 2; }
+shift
+
+HERE=$(cd "$(dirname "$0")" && pwd)
+SEED_DIR=$(cd "$HERE/.." && pwd)
+KERNEL=$SEED_DIR/build/Image
+EXTRACT=$HERE/extract-dump.sh
+[ -f "$KERNEL" ] || { echo "missing $KERNEL — run 'make' in $SEED_DIR first" >&2; exit 1; }
+
+mkdir -p "$OUTDIR"
+STAGE=$(mktemp -d -t tier2-stage.XXXXXX)
+trap 'rm -rf "$STAGE"' EXIT
+
+cp "$SCHEME1" "$STAGE/init"; chmod +x "$STAGE/init"
+cp "$CHILD" "$STAGE/child-prog"; chmod +x "$STAGE/child-prog"
+cat "$PRELUDE" "$DRIVER" > "$STAGE/combined.scm"
+NAMES="init
+child-prog
+combined.scm"
+for inp in "$@"; do
+ base=$(basename "$inp")
+ cp "$inp" "$STAGE/$base"
+ NAMES="$NAMES
+$base"
+done
+( cd "$STAGE" && printf '%s\n' "$NAMES" | cpio -o -H newc 2>/dev/null > initramfs.cpio )
+
+TRANSCRIPT=$STAGE/transcript.txt
+echo "[gate] running scheme1 driver" >&2
+qemu-system-aarch64 \
+ -machine virt -cpu cortex-a72 -m 2048M \
+ -nographic -no-reboot \
+ -kernel "$KERNEL" -initrd "$STAGE/initramfs.cpio" \
+ -append "init combined.scm dumpfs" \
+ > "$TRANSCRIPT" 2>&1 &
+QPID=$!
+( sleep 240; kill -9 $QPID 2>/dev/null ) &
+WATCHER=$!
+wait $QPID 2>/dev/null || true
+kill $WATCHER 2>/dev/null || true
+
+if ! grep -q '=== DUMP-END ===' "$TRANSCRIPT"; then
+ echo "[gate] FAIL: no DUMP-END in transcript" >&2
+ tail -40 "$TRANSCRIPT" >&2
+ exit 3
+fi
+
+# Capture the driver's exit code from the kernel's parting message.
+EXIT_LINE=$(grep -E "user exit_group" "$TRANSCRIPT" | tail -1 || true)
+"$EXTRACT" "$OUTDIR" "$TRANSCRIPT"
+
+case "$EXIT_LINE" in
+ *"exit_group(0)"*)
+ echo "[gate] PASS — driver exit 0; outputs in $OUTDIR" >&2
+ exit 0 ;;
+ *)
+ echo "[gate] FAIL — driver did not exit 0: $EXIT_LINE" >&2
+ exit 4 ;;
+esac
diff --git a/seed-kernel/user/child.c b/seed-kernel/user/child.c
@@ -0,0 +1,51 @@
+/* Tier 2 demo: child program execve'd by forktest. Prints argv and
+ * exits 42 so the parent's waitid can verify si_status round-trips. */
+
+typedef long i64;
+typedef unsigned long u64;
+
+#define SYS_write 64
+#define SYS_exit_group 93
+
+static i64 sysc(u64 nr, u64 a, u64 b, u64 c) {
+ register u64 x8 asm("x8") = nr;
+ register u64 x0 asm("x0") = a;
+ register u64 x1 asm("x1") = b;
+ register u64 x2 asm("x2") = c;
+ asm volatile("svc #0" : "+r"(x0) : "r"(x8), "r"(x1), "r"(x2) : "memory", "cc");
+ return (i64)x0;
+}
+
+static i64 sys_write(int fd, const void *buf, u64 n) { return sysc(SYS_write, (u64)fd, (u64)buf, n); }
+static void sys_exit(int c) { sysc(SYS_exit_group, (u64)c, 0, 0); for(;;); }
+
+void *memset(void *d, int c, u64 n) {
+ unsigned char *dd = d; for (u64 i = 0; i < n; i++) dd[i] = (unsigned char)c; return d;
+}
+
+static u64 strlen_(const char *s) { u64 n = 0; while (s[n]) n++; return n; }
+static void puts_(const char *s) { sys_write(1, s, strlen_(s)); }
+static void put_d(i64 v) {
+ char buf[24]; int i = 0;
+ if (v == 0) { sys_write(1, "0", 1); return; }
+ if (v < 0) { sys_write(1, "-", 1); v = -v; }
+ while (v) { buf[i++] = '0' + (char)(v % 10); v /= 10; }
+ while (i--) sys_write(1, &buf[i], 1);
+}
+
+void _start_c(long argc, char **argv) {
+ puts_("[child] argc="); put_d(argc); puts_("\n");
+ for (long i = 0; i < argc; i++) {
+ puts_("[child] argv["); put_d(i); puts_("] = "); puts_(argv[i]); puts_("\n");
+ }
+ puts_("[child] exiting 42\n");
+ sys_exit(42);
+}
+
+asm(
+ ".globl _start\n"
+ ".type _start, %function\n"
+ "_start:\n"
+ " ldr x0, [sp]\n"
+ " add x1, sp, #8\n"
+ " b _start_c\n");
diff --git a/seed-kernel/user/forktest.c b/seed-kernel/user/forktest.c
@@ -0,0 +1,88 @@
+/* Tier 2 demo: parent does clone() → execve("child") in child →
+ * waitid in parent → reports result. Mirrors the scheme1 prelude's
+ * spawn/run/wait pattern in C. */
+
+typedef long i64;
+typedef unsigned long u64;
+typedef int i32;
+
+#define SYS_write 64
+#define SYS_openat 56
+#define SYS_close 57
+#define SYS_read 63
+#define SYS_lseek 62
+#define SYS_brk 214
+#define SYS_exit_group 93
+#define SYS_clone 220
+#define SYS_execve 221
+#define SYS_waitid 95
+
+static i64 sysc(u64 nr, u64 a, u64 b, u64 c, u64 d, u64 e, u64 f) {
+ register u64 x8 asm("x8") = nr;
+ register u64 x0 asm("x0") = a;
+ register u64 x1 asm("x1") = b;
+ register u64 x2 asm("x2") = c;
+ register u64 x3 asm("x3") = d;
+ register u64 x4 asm("x4") = e;
+ register u64 x5 asm("x5") = f;
+ asm volatile("svc #0"
+ : "+r"(x0)
+ : "r"(x8), "r"(x1), "r"(x2), "r"(x3), "r"(x4), "r"(x5)
+ : "memory", "cc");
+ return (i64)x0;
+}
+
+static i64 sys_write(int fd, const void *buf, u64 n) { return sysc(SYS_write, (u64)fd, (u64)buf, n, 0,0,0); }
+static void sys_exit(int c) { sysc(SYS_exit_group, (u64)c, 0,0,0,0,0); for(;;); }
+static i64 sys_clone(void) { return sysc(SYS_clone, 17/*SIGCHLD*/, 0,0,0,0,0); }
+static i64 sys_execve(const char *p, char **argv) { return sysc(SYS_execve, (u64)p, (u64)argv, 0, 0, 0, 0); }
+static i64 sys_waitid(int id, int pid, void *info, int opts) { return sysc(SYS_waitid, (u64)id, (u64)pid, (u64)info, (u64)opts, 0, 0); }
+
+void *memset(void *d, int c, u64 n) {
+ unsigned char *dd = d; for (u64 i = 0; i < n; i++) dd[i] = (unsigned char)c; return d;
+}
+
+static u64 strlen_(const char *s) { u64 n = 0; while (s[n]) n++; return n; }
+static void puts_(const char *s) { sys_write(1, s, strlen_(s)); }
+static void put_d(i64 v) {
+ char buf[24]; int i = 0;
+ if (v < 0) { sys_write(1, "-", 1); v = -v; }
+ if (v == 0) buf[i++] = '0';
+ while (v) { buf[i++] = '0' + (char)(v % 10); v /= 10; }
+ while (i--) sys_write(1, &buf[i], 1);
+}
+
+void _start_c(long argc, char **argv) {
+ puts_("[forktest] argc="); put_d(argc); puts_(" argv[0]="); puts_(argv[0]); puts_("\n");
+
+ long pid = sys_clone();
+ if (pid == 0) {
+ /* child */
+ puts_("[forktest:child] pre-exec\n");
+ char *cargv[3];
+ cargv[0] = "child";
+ cargv[1] = "from-parent";
+ cargv[2] = 0;
+ sys_execve("child", cargv);
+ puts_("[forktest:child] execve failed\n");
+ sys_exit(127);
+ }
+ /* parent */
+ puts_("[forktest:parent] clone returned pid="); put_d(pid); puts_("\n");
+ unsigned char info[128];
+ memset(info, 0, sizeof info);
+ long w = sys_waitid(/*P_PID*/1, (int)pid, info, /*WEXITED*/4);
+ puts_("[forktest:parent] waitid="); put_d(w);
+ puts_(" si_code="); put_d(*(i32 *)(info + 8));
+ puts_(" si_status="); put_d(*(i32 *)(info + 24));
+ puts_("\n");
+ sys_exit(0);
+}
+
+asm(
+ ".globl _start\n"
+ ".type _start, %function\n"
+ "_start:\n"
+ " ldr x0, [sp]\n"
+ " add x1, sp, #8\n"
+ " b _start_c\n");
diff --git a/seed-kernel/user/hello.c b/seed-kernel/user/hello.c
@@ -61,8 +61,17 @@ static void put_x(u64 v) {
for (int i = 60; i >= 0; i -= 4) { char c = hex[(v >> i) & 0xf]; sys_write(1, &c, 1); }
}
-void _start(void) {
+/* aarch64 entry: x0 holds nothing — the SysV stack layout is at sp:
+ * [argc][argv[0]]...[argv[argc-1]][NULL][envp...][NULL]
+ * We read argc/argv off the initial stack pointer in the asm shim
+ * below, then tail-call into _start_c. */
+void _start_c(long argc, char **argv) {
puts_("hello from user space (EL1t, identity-map MMU)\n");
+ puts_("argc = "); put_d(argc); puts_("\n");
+ for (long i = 0; i < argc; i++) {
+ puts_(" argv["); put_d(i); puts_("] = ");
+ puts_(argv[i]); puts_("\n");
+ }
/* Exercise brk: ask current break, push it up by 1 MiB, write+read. */
u64 b0 = (u64)sys_brk(0);
@@ -107,3 +116,15 @@ void _start(void) {
puts_("[user] all checks passed, exiting 0\n");
sys_exit(0);
}
+
+/* Read argc/argv off the initial stack and tail-call into _start_c. The
+ * kernel sets sp_el0 to point at [argc][argv[0]]... before ERETing.
+ * Emitted as a plain global symbol with raw asm — no C-compiler-generated
+ * prologue, since gcc would clobber sp before we read argc. */
+asm(
+ ".globl _start\n"
+ ".type _start, %function\n"
+ "_start:\n"
+ " ldr x0, [sp]\n"
+ " add x1, sp, #8\n"
+ " b _start_c\n");
diff --git a/seed-kernel/user/user.lds b/seed-kernel/user/user.lds
@@ -1,10 +1,12 @@
-/* Link the user binary high enough to be clear of the kernel image
- * (which sits at 0x40080000) and the initrd (placed by QEMU). */
+/* Link at the boot2 chain's default base (hex2pp -B 0x600000). This is
+ * below QEMU virt's RAM (which starts at 0x40000000) — the seed kernel
+ * provides a per-process L2 page table that maps user low VAs to a
+ * reserved physical RAM pool, so VA 0x600000 has real backing. */
ENTRY(_start)
SECTIONS {
- . = 0x42000000;
+ . = 0x00600000;
.text : { *(.text .text.*) }
.rodata : ALIGN(8) { *(.rodata .rodata.*) }