boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

commit 5073ee046922edc1734cd7c6abda83c25d743fef
parent d437cb28c8b53b20923897d9aaa74fd7b0244928
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon,  4 May 2026 20:58:34 -0700

seed-kernel: close all 11 OS-TODO items + verification gates

Tier 1 fixes:
- argv from /chosen/bootargs (whitespace-tokenised) with /init.argv
  fallback; build_user_stack now takes (argc, argv[])
- exit_group masks status to low byte
- Per-process L2 page table maps user low VA to a 768 MB physical
  pool. PL011/GIC/virtio reached from kernel via a high alias
  (L1[4] = Device 1 GB block at PA 0, so VA 0x109000000 -> PA
  0x09000000). The boot2 chain runs at its native -B 0x600000.
- load_elf clips PT_LOAD memsz at USER_VA_HI and reports end-of-image
  in g_user_image_end so brk_base sits above the binary's BSS
- Documented RWX-at-EL1 in load_elf as deliberate per OS.md

Tier 2 syscalls (clone/execve/waitid):
- Pseudo-fork via proc_save[] holding a memory snapshot, regs, brk
  state, and fd table. clone returns 0 to the current "child";
  execve replaces the user image; exit_group pops the snapshot,
  restores parent state, and resumes parent's clone() with x0=pid.
  waitid populates info[8]/info[24] per the prelude. envp is
  ignored — accepted NULL/empty as the spec requires.

Caches enabled at MMU bring-up (SCTLR.C|I) so the 768 MB snapshot
copy isn't unbearably slow under TCG.

Verification harness:
- dumpfs bootargs token triggers a sentinel-framed hex tmpfs dump
  on exit; scripts/extract-dump.sh decodes it back to files.
- scripts/tier1-gate.sh runs an arbitrary chain stage as /init,
  extracts the post-run tmpfs. Verified with boot0/catm and
  boot3/tcc0 (compiles a .c into a valid aarch64 ELF object).
- scripts/tier2-gate.sh + tier2-tcc-driver.scm fixture run
  scheme1 → (run "tcc0" -nostdlib -c -o out.o input.c) → wait,
  end-to-end. Output is a real ELF relocatable with the expected
  symbols. This is the canonical OS.md §Verification Tier-2 case.

User demos: forktest.c + child.c exercise clone/execve/waitid
(parent observes si_code=CLD_EXITED, si_status=42 from child).
hello.c grew an asm _start shim to read argc/argv off sp.

run.sh bumped to -m 2048M; kernel layout: 192 MB image+kheap,
768 MB user pool at 0x4c000000, 768 MB pseudo-fork snapshot at
0x7c000000, 320 MB spare.

Diffstat:
Adocs/OS-TODO.md | 117+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mseed-kernel/Makefile | 18+++++++++++++++++-
Mseed-kernel/kernel.c | 528++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
Mseed-kernel/run.sh | 2+-
Aseed-kernel/scripts/extract-dump.sh | 56++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Aseed-kernel/scripts/fixtures/tier2-driver.scm | 5+++++
Aseed-kernel/scripts/fixtures/tier2-tcc-driver.scm | 8++++++++
Aseed-kernel/scripts/tier1-gate.sh | 91+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Aseed-kernel/scripts/tier2-gate.sh | 87+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Aseed-kernel/user/child.c | 51+++++++++++++++++++++++++++++++++++++++++++++++++++
Aseed-kernel/user/forktest.c | 88+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mseed-kernel/user/hello.c | 23++++++++++++++++++++++-
Mseed-kernel/user/user.lds | 8+++++---
13 files changed, 1030 insertions(+), 52 deletions(-)

diff --git a/docs/OS-TODO.md b/docs/OS-TODO.md @@ -0,0 +1,117 @@ +# Seed kernel — gaps against docs/OS.md + +Audit of [`seed-kernel/`](../seed-kernel/) against the contract in +[`OS.md`](OS.md). All eleven items are now resolved — the seed kernel +boots, parses the DTB, unpacks an initramfs into an in-memory tmpfs, +loads `/init` as a static aarch64 ELF, dispatches the eight Tier-1 + +three Tier-2 syscalls, and supports both the host-side verification +gates `scripts/tier1-gate.sh` and `scripts/tier2-gate.sh`. Verified +against `boot0/catm`, `boot1/M1pp`, and `boot3/tcc0`; the canonical +Tier-2 case (scheme1 driver spawns tcc0 to compile a `.c` into a +relocatable ELF object) round-trips end-to-end. + +## Tier 1 + +1. **Real argv.** ✅ `build_user_stack` takes `(argc, argv[])`. argv is + sourced from `/chosen/bootargs` (whitespace-tokenised), then from a + `/init.argv` file in the initramfs, with `argc=1, argv[0]="init"` + as the final fallback. The kernel reserves a `dumpfs` token in + bootargs (stripped from user argv) that triggers the UART tmpfs + dump on exit (item 9). + +2. **User load address.** ✅ Per-process L2 page table installs an + `l2_user[]` covering the low 1 GB of VA in 2 MB blocks. Slot 0 is + invalid (NULL traps); slots 1…384 are Normal user RAM backed by a + 768 MB physical pool (`USER_POOL_PA`); slots 385…511 stay Device- + identity for safety. The PL011 / GIC / virtio that used to live in + the low 1 GB are now reached from kernel code via a high alias — + `L1[4]` is a 1 GB Device block at PA 0, so VA `0x109000000` ↔ PA + `0x09000000`. This lets the boot2 chain link at its native + `-B 0x600000` and run unmodified on the seed kernel. + +3. **Bigger heap.** ✅ User pool is 768 MB (slots 1…384 × 2 MB), + sized so tcc0/tcc-boot2 (which declare a 512 MB BSS at link base + `0x600000` ⇒ end VA `0x20600000`) fit with a healthy brk window + above end-of-bss. `load_elf` walks PT_LOAD segments and records + the post-clip end-of-image in `g_user_image_end`; kmain and + `do_execve` use it to seed `brk_base`. `brk_max` is + `USER_VA_HI - 16 MB` (16 MB stack reserve at the top). + +4. **Per-segment ELF permissions.** ✅ Documented as a deliberate + spec-permissible choice in `load_elf` — segments are RWX at EL1. + OS.md §"Memory model" allows this; tcc-boot2 doesn't JIT. + +5. **`exit_group` exit-code masking.** ✅ `code &= 0xff` in + `sys_exit_final` / `sys_exit_or_resume_parent`. + +## Tier 2 + +6. **`clone` / `execve` / `waitid`.** ✅ Pseudo-fork via a + `proc_stack[]` of saved frames. `sys_clone` snapshots the trap + frame + sp_el0 + brk + fd table + the entire 768 MB user image + (one snapshot at PA `0x7c000000`), returns 0 to the current + context (the "child"). `do_execve` captures path/argv into a + kernel pool before clobbering user memory, loads the new ELF, + resets brk above its end-of-bss, and rewrites the trap frame so + `eret` lands at the new entry point with a fresh user stack. + `sys_waitid` populates the siginfo at offsets 8 (CLD_EXITED) and + 24 (status) per `scheme1/prelude.scm:497-506`. On + `sys_exit_or_resume_parent`, if `proc_depth > 0`, the kernel + restores the parent's image / regs / brk / fd table, syncs I-cache + over the freshly-overwritten user pages, and returns to the + parent's `clone()` site with `x0 = child_pid`. + +7. **Per-process state on a stack.** ✅ `proc_save` records regs + + ELR + SPSR + sp_el0 + brk_base + brk_cur + fd table + a 768 MB + memory snapshot. `MAX_PROC_DEPTH = 1` — the scheme1 prelude only + forks one level deep before waiting; one snapshot frame is all + that's needed and keeps total RAM at 2 GB. + +8. **`execve` accepts NULL/empty envp.** ✅ `do_execve` ignores its + `envp` argument; the prelude wrapper passes no envp at all and + the value in `x2` at the SVC site is whatever happens to be + there. + +## Verification harness + +9. **Output extraction.** ✅ The kernel emits a sentinel-framed + hex dump of every tmpfs file on exit when bootargs contain the + `dumpfs` token. Scripts: + - [`scripts/extract-dump.sh`](../seed-kernel/scripts/extract-dump.sh) — + scans a UART transcript for `=== DUMP-BEGIN ===` … `=== DUMP-END ===`, + decodes each `=== FILE path=… size=… ===` payload, writes files. + +10. **Tier 1 gate.** ✅ + [`scripts/tier1-gate.sh`](../seed-kernel/scripts/tier1-gate.sh) — + builds an initramfs containing a stage binary as `/init` plus + arbitrary input files, runs the seed kernel under qemu with the + stage's argv as bootargs, and extracts the post-run tmpfs. + Verified against `boot0/catm` (multi-input concatenation, output + matches host `cat`) and `boot3/tcc0` (compiles `int main(void) + {return 42;}` into a valid aarch64 relocatable object). + +11. **Tier 2 gate.** ✅ + [`scripts/tier2-gate.sh`](../seed-kernel/scripts/tier2-gate.sh) — + cats `prelude.scm` + a driver fixture into `combined.scm`, packs + initramfs `/init=scheme1, /child-prog=<chain stage>, /combined.scm, + <inputs>`, runs the seed kernel, asserts the driver exited 0, and + extracts every output file. Verified end-to-end with the canonical + fixture + [`scripts/fixtures/tier2-tcc-driver.scm`](../seed-kernel/scripts/fixtures/tier2-tcc-driver.scm) — + scheme1 evaluates `(run "child-prog" "-nostdlib" "-c" "-o" "out.o" + "input.c")`, where `child-prog` is `boot3/tcc0`. Output `out.o` is a + valid aarch64 ELF relocatable with the expected `add` and `main` + symbols. + +## Things still worth doing (out of scope of the original list) + +- **Multi-stage Tier 1 driving**: `make tcc-boot2 ARCH=aarch64` could be + taught to swap each podman invocation for `tier1-gate.sh`. The hooks + exist; it would just be a `seed-kernel/Makefile.gate` overlay. +- **Snapshot speed**: `mem_cpy(USER_POOL_SIZE = 768 MB)` is the dominant + cost of every clone (~30 s under TCG). A copy-on-write or only- + touched-pages strategy would help, but isn't needed for compliance. +- **NULL-page hardening**: slot 0 is unmapped so a NULL deref faults to + the kernel as a user sync; the kernel currently panics rather than + delivering a SIGSEGV-equivalent. Acceptable per OS.md (default-action + termination is sufficient) but a minor polish opportunity. diff --git a/seed-kernel/Makefile b/seed-kernel/Makefile @@ -8,7 +8,10 @@ KOBJS := $(OUT)/start.o $(OUT)/kernel.o KIMAGE := $(OUT)/kernel.elf KBIN := $(OUT)/Image USER := $(OUT)/init +USER_FORK := $(OUT)/forktest +USER_CHILD := $(OUT)/child INITRAMFS := $(OUT)/initramfs.cpio +INITRAMFS_FORK := $(OUT)/initramfs-fork.cpio CFLAGS_COMMON := -nostdlib -nostartfiles -ffreestanding -fno-stack-protector \ -fno-pic -static -Wall -Wextra -O2 -mcmodel=large \ @@ -16,7 +19,7 @@ CFLAGS_COMMON := -nostdlib -nostartfiles -ffreestanding -fno-stack-protector \ KCFLAGS := $(CFLAGS_COMMON) -mgeneral-regs-only .PHONY: all clean kernel user initramfs -all: $(KBIN) $(INITRAMFS) +all: $(KBIN) $(INITRAMFS) $(INITRAMFS_FORK) $(OUT): mkdir -p $(OUT) @@ -37,9 +40,22 @@ $(KBIN): $(KIMAGE) $(USER): user/hello.c user/user.lds | $(OUT) gcc $(CFLAGS_COMMON) -mgeneral-regs-only -T user/user.lds -o $@ $< +$(USER_FORK): user/forktest.c user/user.lds | $(OUT) + gcc $(CFLAGS_COMMON) -mgeneral-regs-only -T user/user.lds -o $@ $< + +$(USER_CHILD): user/child.c user/user.lds | $(OUT) + gcc $(CFLAGS_COMMON) -mgeneral-regs-only -T user/user.lds -o $@ $< + $(INITRAMFS): $(USER) cd $(OUT) && printf 'init\n' | cpio -o -H newc > initramfs.cpio +# Tier 2 demo cpio: /init is the fork driver, /child the program it execs. +$(INITRAMFS_FORK): $(USER_FORK) $(USER_CHILD) + rm -rf $(OUT)/fork-stage && mkdir -p $(OUT)/fork-stage + cp $(USER_FORK) $(OUT)/fork-stage/init + cp $(USER_CHILD) $(OUT)/fork-stage/child + cd $(OUT)/fork-stage && printf 'init\nchild\n' | cpio -o -H newc > ../initramfs-fork.cpio + kernel: $(KBIN) user: $(USER) initramfs: $(INITRAMFS) diff --git a/seed-kernel/kernel.c b/seed-kernel/kernel.c @@ -16,7 +16,13 @@ typedef int i32; /* ─── PL011 console ─────────────────────────────────────────────────────── */ -#define UART0 0x09000000UL +/* The PL011 lives at PA 0x09000000 on QEMU virt. Once the MMU comes up the + * kernel reaches it through the device alias mapped into VA 4 GB..5 GB + * (L1[4]). That keeps the entire low 1 GB of VA available as user RAM — + * device MMIO at user-space VAs would otherwise collide with the boot2 + * chain's BSS, which can run past 256 MB. */ +#define DEVICE_ALIAS_BASE 0x100000000UL +#define UART0 (DEVICE_ALIAS_BASE + 0x09000000UL) #define UART_DR ((volatile u32 *)(UART0 + 0x00)) #define UART_FR ((volatile u32 *)(UART0 + 0x18)) #define UART_FR_TXFF (1u << 5) @@ -75,28 +81,83 @@ static void mem_set(void *d, int c, u64 n) { } /* ─── MMU bring-up ──────────────────────────────────────────────────────── */ -/* Identity-map the first 4 GB at L1 (1 GB blocks). One page table — 4 KB. - * Entry 0 (0..1G): Device-nGnRnE (UART/GIC/virtio/flash live here) - * Entry 1 (1..2G): Normal WB-WA (RAM 0x40000000-) - * Entry 2 (2..3G): Normal WB-WA (extra RAM if -m > 1G) - * Entry 3 (3..4G): Normal WB-WA (above-RAM PCI on virt; rarely touched) +/* Two-level page table: + * L1[0] → l2_user table descriptor (VA 0..1 GB, 2 MB blocks) + * L1[1..3] = Normal 1 GB blocks identity-mapping VA 1..4 GB (RAM + high MMIO) + * L1[4] = Device 1 GB block at PA 0 (VA 4..5 GB mirrors PA 0..1 GB as + * Device-nGnRnE — the kernel's only path to UART/GIC/virtio/PCI + * once we hand the low 1 GB over to user code). + * + * The l2_user table carves the low 1 GB into: + * slot 0 (VA 0..2 MB) invalid — NULL pointer traps + * slots 1..N (VA 2 MB..USER_VA_HI) Normal user RAM, backed by the + * physical pool USER_POOL_PA. The + * boot2 chain links at 0x600000 and + * scheme1 reserves ~256 MB of BSS; + * sizing N at 256 (slots 1..256, 512 MB) + * gives both code+BSS and the brk + * window plenty of room. + * slots N+1..511 (VA USER_VA_HI..1G) Device-identity, kept for safety — + * nothing user-side touches them, and + * the kernel uses the high alias. + * * With MMU on + Normal memory, unaligned loads/stores work — gcc's auto- - * vectorised 64-bit load in be64() stops trapping. - */ + * vectorised 64-bit load in be64() stops trapping. */ __attribute__((aligned(4096))) static u64 l1_pt[512]; +__attribute__((aligned(4096))) static u64 l2_user[512]; + +/* Physical RAM region reserved as the backing store for user low VAs. + * 768 MB (slots 1..384 × 2 MB), placed above the kernel heap end. Sized + * to fit tcc0 / tcc-boot2 — they declare a 512 MB BSS and link at + * 0x600000, so the binary's VA reach is 0x600000 + 512 MB = 0x20600000. + * 768 MB gives that plus a healthy brk window above end-of-bss. + * + * With QEMU -m 2048M (RAM 0x40000000–0xc0000000) and MAX_PROC_DEPTH=1 + * (one 768 MB pseudo-fork snapshot above the user pool), the layout is: + * 0x40000000–0x4c000000 kernel image + kheap (192 MB) + * 0x4c000000–0x7c000000 user RAM pool (768 MB) + * 0x7c000000–0xac000000 pseudo-fork snapshot (768 MB) + * 0xac000000–0xc0000000 spare (320 MB) + */ +#define USER_POOL_PA 0x4c000000UL +#define USER_POOL_SIZE 0x30000000UL /* 768 MB */ +#define USER_VA_LO 0x00200000UL /* slot 1 — first mapped 2 MB block */ +#define USER_VA_HI 0x30200000UL /* slot 385 — first device-only block */ static void setup_mmu(void) { - /* AP=00 (RW EL1 only — keep EL0 out for now), SH=ISH, AF=1, AttrIdx=0/1. - * Bits: V(0)=1, block(1)=0, AttrIdx[4:2], NS(5)=0, AP[7:6]=00, SH[9:8]=11, - * AF(10)=1, nG(11)=0 → 0x701 (Normal) / 0x705 (Device) */ + /* Block-descriptor attribute bits (block at L1 = bit[1]=0). + * V(0)=1, block(1)=0, AttrIdx[4:2]=Attr0(Normal)/Attr1(Device), + * NS(5)=0, AP[7:6]=00 (RW EL1 only), SH[9:8]=11 (ISH), AF(10)=1, + * nG(11)=0 → 0x701 (Normal) / 0x705 (Device-nGnRnE). + * Block descriptors at L2 use the same bit layout. */ u64 normal = 0x701; u64 device = 0x705; for (int i = 0; i < 512; i++) l1_pt[i] = 0; - l1_pt[0] = 0x00000000UL | device; + + /* L2 user table: slot 0 invalid; slots 1..(USER_POOL_SIZE/2 MB) Normal + * RAM backed by the user pool; slots above that Device-identity. */ + int user_slots = (int)(USER_POOL_SIZE / 0x200000UL); + l2_user[0] = 0; + for (int i = 1; i <= user_slots; i++) { + u64 pa = USER_POOL_PA + (u64)(i - 1) * 0x200000UL; + l2_user[i] = pa | normal; + } + for (int i = user_slots + 1; i < 512; i++) { + u64 pa = (u64)i * 0x200000UL; + l2_user[i] = pa | device; + } + + /* L1[0] table descriptor → l2_user. Table-desc encoding at L1 is + * bits [1:0] = 0b11, bits [47:12] = next-level table PA. */ + l1_pt[0] = (u64)l2_user | 0x3UL; l1_pt[1] = 0x40000000UL | normal; l1_pt[2] = 0x80000000UL | normal; l1_pt[3] = 0xc0000000UL | normal; + /* L1[4]: Device 1 GB block aliasing PA 0..1 GB into VA 4 GB..5 GB so + * the kernel can still reach UART/GIC/virtio after we hand the low 1 + * GB over to user mappings. */ + l1_pt[4] = 0x00000000UL | device; /* MAIR: Attr0 = 0xff (Normal WB-WA), Attr1 = 0x00 (Device-nGnRnE) */ u64 mair = 0x00000000000000ffUL; @@ -120,8 +181,10 @@ static void setup_mmu(void) { u64 sctlr; asm volatile("mrs %0, sctlr_el1" : "=r"(sctlr)); - sctlr &= ~(u64)((1 << 1) | (1 << 19)); /* clear A, WXN */ - sctlr |= (u64)(1 << 0); /* M (MMU) only — caches stay off */ + sctlr &= ~(u64)((1 << 1) | (1 << 19)); /* clear A (alignment), WXN */ + sctlr |= (u64)((1 << 0) /* M — MMU on */ + | (1 << 2) /* C — D-cache on */ + | (1 << 12)); /* I — I-cache on */ asm volatile("msr sctlr_el1, %0" :: "r"(sctlr)); asm volatile("isb"); } @@ -318,6 +381,11 @@ struct phdr { u32 p_type, p_flags; u64 p_offset, p_vaddr, p_paddr, p_filesz, p_m #define PT_LOAD 1 +/* Highest VA touched by the most recently loaded image's PT_LOAD segments + * (after USER_VA_HI clipping). load_elf updates this; kmain / sys_execve + * use it to seed brk_base above the user image's BSS. */ +static u64 g_user_image_end; + static u64 load_elf(const u8 *elf) { const struct ehdr *eh = (const struct ehdr *)elf; if (!(eh->e_ident[0] == 0x7f && eh->e_ident[1] == 'E' && @@ -327,15 +395,38 @@ static u64 load_elf(const u8 *elf) { if (eh->e_machine != 0xb7) { /* EM_AARCH64 */ uart_puts("ELF: not aarch64\n"); return 0; } + /* p_flags (R/W/X) are deliberately ignored: the L2 user mapping is one + * giant Normal-memory RWX-at-EL1 region (see setup_mmu). OS.md + * §"Memory model" permits this — there's no W^X enforcement in the + * contract, and tcc-boot2 never JITs. + * + * Segments are clipped at USER_VA_HI: a binary may declare a BSS that + * extends past the mapped user window (scheme1 reserves ~256 MB), and + * a naive mem_set would walk into the device-block region above and + * trigger an external abort. The user image gets only the portion of + * its memsz that fits in the user pool; if user code later touches + * the unmapped tail, that's a user-space fault, not a kernel panic. */ + u64 hi = 0; for (int i = 0; i < eh->e_phnum; i++) { const struct phdr *ph = (const struct phdr *)(elf + eh->e_phoff + (u64)i * eh->e_phentsize); if (ph->p_type != PT_LOAD) continue; - u8 *dst = (u8 *)ph->p_vaddr; + u64 vaddr = ph->p_vaddr; + u64 filesz = ph->p_filesz; + u64 memsz = ph->p_memsz; + if (vaddr >= USER_VA_HI) continue; /* segment fully out of window */ + u64 reach = USER_VA_HI - vaddr; + if (filesz > reach) filesz = reach; + if (memsz > reach) memsz = reach; + u8 *dst = (u8 *)vaddr; const u8 *src = elf + ph->p_offset; - mem_cpy(dst, src, ph->p_filesz); - if (ph->p_memsz > ph->p_filesz) - mem_set(dst + ph->p_filesz, 0, ph->p_memsz - ph->p_filesz); + mem_cpy(dst, src, filesz); + if (memsz > filesz) + mem_set(dst + filesz, 0, memsz - filesz); + u64 end = vaddr + memsz; + if (end > hi) hi = end; } + /* Round up to 16 bytes so callers can use it directly as brk_base. */ + g_user_image_end = (hi + 15) & ~15UL; /* I-cache sync (cheap insurance even with caches off). */ asm volatile("dsb sy" ::: "memory"); asm volatile("ic iallu" ::: "memory"); @@ -378,7 +469,14 @@ static u64 brk_max; #define SYS_read 63 #define SYS_write 64 #define SYS_exit_group 93 +#define SYS_waitid 95 #define SYS_brk 214 +#define SYS_clone 220 +#define SYS_execve 221 + +#define ECHILD 10 +#define EAGAIN 11 +#define ENOEXEC 8 static i64 sys_write(int fd, const void *buf, u64 len) { if (fd == 1 || fd == 2) { @@ -476,12 +574,226 @@ static i64 sys_unlinkat(int dirfd, const char *path, int flags) { return 0; } +/* ─── Tier 2: pseudo-fork (clone / execve / waitid / exit_group) ────────── */ +/* + * The boot2 chain's clone/execve/waitid pattern (scheme1/prelude.scm:520-537) + * is rigidly synchronous: the parent calls clone, the "child" immediately + * calls execve and runs to exit_group, then the parent calls waitid. Nothing + * else runs between clone and execve in the child, or between clone and + * waitid in the parent. + * + * We implement that as pseudo-fork on a single-threaded kernel: + * + * sys_clone → push parent state (regs, brk, fd table, full user image) + * onto proc_stack; return 0 to current context (the "child"). + * sys_execve → reset brk, load new ELF over user RAM, build user stack, + * set tf so eret resumes at the new entry point. + * sys_exit → if proc_stack non-empty: stash exit code in last_child, + * restore parent state (regs / brk / fds / memory), set tf + * so eret resumes the parent's clone() call with x0 = pid. + * If proc_stack empty: real exit (dump tmpfs, PSCI off). + * sys_waitid → return last_child's exit code via the siginfo struct. + * + * No actual concurrency. The "parent" is suspended at the moment of clone + * and resumed only when the "child" calls exit_group. This works because + * the prelude never schedules other work between fork and wait. + */ + +struct trapframe { + u64 x[31]; + u64 elr; + u64 spsr; +}; + +/* Forward decls for state defined further down. */ +#define MAX_ARGV 32 +static u64 build_user_stack(u64 stack_top, int argc, char **argv); +static int tokenise(char *src, char **argv, int cap); + +#define MAX_PROC_DEPTH 1 +/* Memory snapshot pool — placed above the user RAM pool. The scheme1 + * prelude only ever forks one level deep before waiting (clone → execve + * in child → exit_group → waitid in parent), so a single 768 MB frame + * suffices. Snapshot N lives at SNAP_BASE + N*USER_POOL_SIZE. */ +#define SNAP_BASE_PA 0x7c000000UL + +struct proc_save { + int active; + u64 child_pid; + /* Saved trap-frame state — enough to resume the parent at the SVC + * instruction following its clone(). x[0] is overwritten with child_pid + * at restore time so the parent sees a non-zero return. */ + u64 regs[31]; + u64 elr; + u64 spsr; + u64 sp_el0; + /* User image + per-process state at the moment of clone. brk_base + * is saved alongside brk_cur because do_execve resets it above the + * new image's end-of-bss — the parent's value needs to come back + * with the parent's memory image. */ + u64 brk_base_save; + u64 brk_cur_save; + struct fdent fdtab_save[MAX_FD]; + u8 *mem_snapshot; +}; + +static struct proc_save proc_stack[MAX_PROC_DEPTH]; +static int proc_depth = 0; +static u64 g_next_pid = 2; + +/* The most recently exited child, for sys_waitid to consume. */ +static int last_child_valid = 0; +static u64 last_child_pid = 0; +static int last_child_code = 0; + +/* USER_POOL_PA / USER_POOL_SIZE (defined above) describe the user RAM pool. */ + +static i64 sys_clone(struct trapframe *tf, u64 flags, u64 stack, u64 ptid, + u64 ctid, u64 tls) { + (void)flags; (void)stack; (void)ptid; (void)ctid; (void)tls; + if (proc_depth >= MAX_PROC_DEPTH) return -EAGAIN; + struct proc_save *p = &proc_stack[proc_depth]; + p->active = 1; + p->child_pid = g_next_pid++; + for (int i = 0; i < 31; i++) p->regs[i] = tf->x[i]; + p->elr = tf->elr; + p->spsr = tf->spsr; + asm volatile("mrs %0, sp_el0" : "=r"(p->sp_el0)); + p->brk_base_save = brk_base; + p->brk_cur_save = brk_cur; + for (int i = 0; i < MAX_FD; i++) p->fdtab_save[i] = fdtab[i]; + p->mem_snapshot = (u8 *)(SNAP_BASE_PA + (u64)proc_depth * USER_POOL_SIZE); + mem_cpy(p->mem_snapshot, (void *)USER_POOL_PA, USER_POOL_SIZE); + proc_depth++; + /* Current context becomes the "child"; clone returns 0 here. */ + return 0; +} + +/* execve must capture path+argv into kernel-side buffers BEFORE load_elf + * runs — load_elf clobbers user memory, and the path/argv strings live in + * that memory. */ +static char execve_argv_pool[2048]; +static i64 sys_execve(struct trapframe *tf, const char *path, + char **argv, char **envp) { + /* envp may be NULL — the prelude wrapper passes no envp arg, so x2 is + * whatever happened to be there. We ignore envp regardless. */ + (void)envp; + if (!path) return -EFAULT; + /* Copy path before find_file does anything else (path lives in user + * memory which load_elf will clobber). */ + char path_buf[128]; + int pn = 0; + while (path[pn] && pn < 127) { path_buf[pn] = path[pn]; pn++; } + path_buf[pn] = 0; + int fidx = find_file(path_buf); + if (fidx < 0) return -ENOENT; + + /* Capture argv into a kernel-side pool. */ + int argc = 0; + char *new_argv[MAX_ARGV]; + int pool_off = 0; + if (argv) { + while (argc < MAX_ARGV - 1 && argv[argc]) { + const char *s = argv[argc]; + int n = 0; + while (s[n] && pool_off + n < (int)sizeof(execve_argv_pool) - 1) n++; + for (int j = 0; j < n; j++) execve_argv_pool[pool_off + j] = s[j]; + execve_argv_pool[pool_off + n] = 0; + new_argv[argc] = &execve_argv_pool[pool_off]; + pool_off += n + 1; + argc++; + } + } + if (argc == 0) { + /* Synthesise argv[0] from the path so user code that reads argv[0] + * doesn't crash. */ + int n = 0; + while (path_buf[n] && pool_off + n < (int)sizeof(execve_argv_pool) - 1) n++; + for (int j = 0; j < n; j++) execve_argv_pool[pool_off + j] = path_buf[j]; + execve_argv_pool[pool_off + n] = 0; + new_argv[0] = &execve_argv_pool[pool_off]; + pool_off += n + 1; + argc = 1; + } + + /* Load new ELF over user RAM. */ + u64 entry = load_elf(files[fidx].data); + if (!entry) return -ENOEXEC; + /* Reset brk above the new image's end-of-bss. */ + brk_base = g_user_image_end ? g_user_image_end : USER_VA_LO; + brk_cur = brk_base; + /* Build new user stack at top of user VA window. */ + u64 new_sp = build_user_stack(USER_VA_HI, argc, new_argv); + + /* Rewrite trap frame so eret jumps to the new image's entry, with a + * clean register state and the new stack. */ + for (int i = 0; i < 31; i++) tf->x[i] = 0; + tf->elr = entry; + /* sp_el0 isn't in the trap frame — set it directly; it survives until + * the eret since the kernel uses SP_ELx while in trap_sync. */ + asm volatile("msr sp_el0, %0" :: "r"(new_sp)); + /* x[0] = 0 will be overwritten by the dispatcher's tf->x[0] = (u64)r + * assignment. To preserve "argc/argv on the stack only", return 0 and + * let the dispatcher write it; user code never sees the return value + * because elr now points at _start. */ + return 0; +} + +static i64 sys_waitid(struct trapframe *tf, int idtype, u64 id, + void *info, int options) { + (void)tf; (void)idtype; (void)id; (void)options; + if (!last_child_valid) return -ECHILD; + /* scheme1/prelude.scm:497-506 reads info[8]=si_code (CLD_EXITED=1) and + * info[24]=si_status. siginfo_t is sparsely written — zero the rest so + * the prelude's view is deterministic. */ + if (info) { + u8 *p = info; + for (int i = 0; i < 128; i++) p[i] = 0; + u32 *si_code = (u32 *)(p + 8); + u32 *si_status = (u32 *)(p + 24); + *si_code = 1; /* CLD_EXITED */ + *si_status = (u32)last_child_code; + } + last_child_valid = 0; + return 0; +} + static int g_exit_code = 0; static int g_exited = 0; -static void sys_exit(int code) { +/* Dump every file in the tmpfs to UART, hex-encoded, framed by sentinels + * a host-side extractor can scan for. The chain's verification harness + * (qemu-host wrapper) parses this to recover output ELFs etc. without + * needing virtio-9p — flat tmpfs over UART is enough for boot2's + * file-only IPC. Dump only happens when a "dumpfs" token is present in + * /chosen/bootargs; the hello.c demo runs without it and stays quiet. */ +static int g_dumpfs = 0; + +static void uart_putc_hex(u8 b) { + static const char hex[] = "0123456789abcdef"; + uart_putc(hex[b >> 4]); + uart_putc(hex[b & 0xf]); +} + +static void dump_tmpfs(void) { + uart_puts("\n=== DUMP-BEGIN ===\n"); + for (int i = 0; i < MAX_FILES; i++) { + if (!files[i].used) continue; + uart_puts("=== FILE path="); + uart_puts(files[i].path); + uart_puts(" size="); + uart_putd((i64)files[i].len); + uart_puts(" ===\n"); + for (u64 j = 0; j < files[i].len; j++) uart_putc_hex(files[i].data[j]); + uart_puts("\n"); + } + uart_puts("=== DUMP-END ===\n"); +} + +static void sys_exit_final(int code) { g_exit_code = code; g_exited = 1; + if (g_dumpfs) dump_tmpfs(); uart_puts("\n[seed] user exit_group("); uart_putd(code); uart_puts(")\n"); /* Try PSCI SYSTEM_OFF so QEMU exits cleanly; fall back to spin. */ register u64 x0 asm("x0") = 0x84000008; @@ -491,13 +803,43 @@ static void sys_exit(int code) { for (;;) asm volatile("wfi"); } -/* ─── Trap dispatch (called from start.S vector handlers) ───────────────── */ +/* Dispatcher-side exit_group: pops proc_stack and resumes the parent's + * clone() if there's a saved frame, otherwise falls through to the real + * shutdown path. Returns 1 if the trap frame was rewritten (resume parent), + * 0 if the caller should treat it as a normal trap-return path (which + * will never happen, since sys_exit_final does not return). */ +static int sys_exit_or_resume_parent(struct trapframe *tf, int code) { + code &= 0xff; + if (proc_depth > 0) { + struct proc_save *p = &proc_stack[--proc_depth]; + last_child_pid = p->child_pid; + last_child_code = code; + last_child_valid = 1; + /* Restore memory, brk, fd table. */ + mem_cpy((void *)USER_POOL_PA, p->mem_snapshot, USER_POOL_SIZE); + brk_base = p->brk_base_save; + brk_cur = p->brk_cur_save; + for (int i = 0; i < MAX_FD; i++) fdtab[i] = p->fdtab_save[i]; + /* Restore registers (overwriting x[0] with child_pid, since the + * dispatcher will write tf->x[0] = (u64)r before eret — we want + * the parent's clone() to see child_pid as the syscall return). */ + for (int i = 0; i < 31; i++) tf->x[i] = p->regs[i]; + tf->elr = p->elr; + tf->spsr = p->spsr; + asm volatile("msr sp_el0, %0" :: "r"(p->sp_el0)); + /* Instruction cache may hold stale lines from the child's image + * that we just overwrote with the parent's. Invalidate. */ + asm volatile("dsb sy" ::: "memory"); + asm volatile("ic iallu" ::: "memory"); + asm volatile("dsb sy" ::: "memory"); + asm volatile("isb"); + return (int)p->child_pid; /* >0: tells dispatcher to write this as r */ + } + sys_exit_final(code); + return 0; /* unreachable */ +} -struct trapframe { - u64 x[31]; - u64 elr; - u64 spsr; -}; +/* ─── Trap dispatch (called from start.S vector handlers) ───────────────── */ i64 trap_sync(u64 esr, struct trapframe *tf); void trap_kernel(u64 esr, struct trapframe *tf); @@ -518,7 +860,19 @@ i64 trap_sync(u64 esr, struct trapframe *tf) { case SYS_lseek: r = sys_lseek((int)a0, (i64)a1, (int)a2); break; case SYS_brk: r = sys_brk(a0); break; case SYS_unlinkat: r = sys_unlinkat((int)a0, (const char *)a1, (int)a2); break; - case SYS_exit_group: sys_exit((int)a0); r = 0; break; + case SYS_clone: r = sys_clone(tf, a0, a1, a2, a3, a4); break; + case SYS_execve: r = sys_execve(tf, (const char *)a0, (char **)a1, (char **)a2); break; + case SYS_waitid: r = sys_waitid(tf, (int)a0, a1, (void *)a2, (int)a3); break; + case SYS_exit_group: + r = sys_exit_or_resume_parent(tf, (int)a0); + /* If we resumed the parent, sys_exit_or_resume_parent has + * rewritten tf->x[0..30] and tf->elr — overriding tf->x[0] + * below would corrupt the parent's register state. */ + if (proc_depth >= 0 && r != 0) { + tf->x[0] = (u64)r; + return 0; + } + break; default: uart_puts("[seed] ENOSYS "); uart_putd((i64)nr); uart_puts("\n"); r = -38; /* ENOSYS */ @@ -536,8 +890,10 @@ i64 trap_sync(u64 esr, struct trapframe *tf) { } void trap_kernel(u64 esr, struct trapframe *tf) { + u64 far; asm volatile("mrs %0, far_el1" : "=r"(far)); uart_puts("[seed] PANIC: kernel sync, ESR="); uart_putx(esr); uart_puts(" ELR="); uart_putx(tf->elr); + uart_puts(" FAR="); uart_putx(far); uart_puts("\n"); for (;;) asm volatile("wfe"); } @@ -553,23 +909,48 @@ void trap_unhandled(u64 esr, struct trapframe *tf) { extern void eret_to_user(u64 entry, u64 sp); -static u64 build_user_stack(u64 stack_top, const char *argv0) { - /* Place argv0 string at top, then argc/argv/envp below it. - * - * SysV layout from low to high at sp: - * argc, argv[0], NULL, NULL (envp term) - */ - int n = str_n(argv0) + 1; - char *str = (char *)(stack_top - 32); - for (int i = 0; i < n; i++) str[i] = argv0[i]; - - u64 sp = (u64)str - 64; +/* Tokenise `src` in place (whitespace separators) into argv slots. + * Writes pointers into argv[0..argc-1] and returns argc. Stops at cap. */ +static int tokenise(char *src, char **argv, int cap) { + int argc = 0; + char *p = src; + while (*p && argc < cap) { + while (*p == ' ' || *p == '\t' || *p == '\n' || *p == '\r') p++; + if (!*p) break; + argv[argc++] = p; + while (*p && *p != ' ' && *p != '\t' && *p != '\n' && *p != '\r') p++; + if (*p) *p++ = 0; + } + return argc; +} + +static u64 build_user_stack(u64 stack_top, int argc, char **argv) { + /* SysV layout, low to high at the returned sp: + * argc, argv[0..argc-1], NULL (argv term), NULL (envp term). + * Strings live above the vectors, in a string pool placed just below + * stack_top so the user image's high-water mark is stable. */ + if (argc < 1) argc = 1; + if (argc > MAX_ARGV) argc = MAX_ARGV; + + /* Lay strings down from stack_top - 16 (16-byte alignment slack). */ + u64 strs_top = stack_top - 16; + u64 strs[MAX_ARGV]; + char *cursor = (char *)strs_top; + for (int i = argc - 1; i >= 0; i--) { + int n = str_n(argv[i]) + 1; + cursor -= n; + for (int j = 0; j < n; j++) cursor[j] = argv[i][j]; + strs[i] = (u64)cursor; + } + + /* sp must hold: argc + (argc+1)*8 (argv + NULL) + 8 (envp NULL) */ + u64 sp = (u64)cursor - (u64)((argc + 3) * 8); sp &= ~15UL; u64 *p = (u64 *)sp; - p[0] = 1; /* argc */ - p[1] = (u64)str; /* argv[0] */ - p[2] = 0; /* argv terminator */ - p[3] = 0; /* envp terminator */ + p[0] = (u64)argc; + for (int i = 0; i < argc; i++) p[1 + i] = strs[i]; + p[1 + argc] = 0; /* argv terminator */ + p[2 + argc] = 0; /* envp terminator */ return sp; } @@ -614,13 +995,68 @@ void kmain(u64 dtb_phys) { if (!entry) { uart_puts("[seed] load_elf failed\n"); for(;;) asm volatile("wfe"); } uart_puts("[seed] /init e_entry="); uart_putx(entry); uart_puts("\n"); - /* User stack at top of a reserved high region. brk above that. */ - u64 ustack_top = 0x46000000UL; - brk_base = 0x46000000UL; + /* parse_cpio + load_elf are done — original initrd memory is dead. + * Bump kheap_end to reclaim it for tmpfs file growth via sys_write. */ + kheap_end = (u8 *)0x4b000000UL; + + /* User runs in the L2-mapped low-VA window (USER_VA_LO..USER_VA_HI, + * physically backed by USER_POOL_PA). Stack grows down from the top + * of the window; brk grows up from above the loaded image's + * end-of-bss (g_user_image_end, set by load_elf). 16 MB reserved at + * the top for the user stack. */ + u64 ustack_top = USER_VA_HI; + brk_base = g_user_image_end ? g_user_image_end : USER_VA_LO; brk_cur = brk_base; - brk_max = 0x4a000000UL; + brk_max = USER_VA_HI - 0x01000000UL; + + /* Build argv. Priority: + * 1. DTB /chosen/bootargs (whitespace-tokenised — qemu -append "..."). + * 2. /init.argv from the initramfs (one arg per line). + * 3. Fallback: argc=1, argv[0]="init". + * In all three cases, argv passed to user is exactly what the source + * provided — no implicit argv[0]="init" prefix. + * + * The seed kernel reserves one bootargs token: "dumpfs". When present, + * it is stripped from argv and triggers a hex-encoded dump of the + * full tmpfs over UART on exit (sentinel-framed for host extraction). */ + static char argv_pool[512]; + char *uargv[MAX_ARGV]; + int uargc = 0; + + if (dt.bootargs[0]) { + int n = 0; + while (dt.bootargs[n] && n < (int)sizeof(argv_pool) - 1) { + argv_pool[n] = dt.bootargs[n]; n++; + } + argv_pool[n] = 0; + char *raw[MAX_ARGV]; + int rawc = tokenise(argv_pool, raw, MAX_ARGV); + for (int i = 0; i < rawc; i++) { + if (str_eq(raw[i], "dumpfs")) { g_dumpfs = 1; continue; } + uargv[uargc++] = raw[i]; + } + } + if (uargc == 0) { + int aidx = find_file("init.argv"); + if (aidx >= 0) { + u64 n = files[aidx].len; + if (n >= sizeof(argv_pool)) n = sizeof(argv_pool) - 1; + for (u64 i = 0; i < n; i++) argv_pool[i] = (char)files[aidx].data[i]; + argv_pool[n] = 0; + uargc = tokenise(argv_pool, uargv, MAX_ARGV); + } + } + if (uargc == 0) { + argv_pool[0] = 'i'; argv_pool[1] = 'n'; argv_pool[2] = 'i'; + argv_pool[3] = 't'; argv_pool[4] = 0; + uargv[0] = argv_pool; + uargc = 1; + } + uart_puts("[seed] argv:"); + for (int i = 0; i < uargc; i++) { uart_puts(" "); uart_puts(uargv[i]); } + uart_puts("\n"); - u64 user_sp = build_user_stack(ustack_top, "init"); + u64 user_sp = build_user_stack(ustack_top, uargc, uargv); uart_puts("[seed] eret to user, sp="); uart_putx(user_sp); uart_puts("\n"); eret_to_user(entry, user_sp); diff --git a/seed-kernel/run.sh b/seed-kernel/run.sh @@ -15,7 +15,7 @@ INITRD=build/initramfs.cpio exec qemu-system-aarch64 \ -machine virt \ -cpu cortex-a72 \ - -m 512M \ + -m 2048M \ -nographic \ -no-reboot \ -kernel "$KERNEL" \ diff --git a/seed-kernel/scripts/extract-dump.sh b/seed-kernel/scripts/extract-dump.sh @@ -0,0 +1,56 @@ +#!/bin/sh +# Extract files from a seed-kernel UART transcript that was produced with +# the "dumpfs" bootargs token. Reads transcript from stdin (or $1), writes +# each dumped file to <outdir>/<path>. Header format emitted by kernel.c +# dump_tmpfs(): +# +# === DUMP-BEGIN === +# === FILE path=<name> size=<N> === +# <2*N hex chars><LF> +# ... repeat ... +# === DUMP-END === +# +# Anything before DUMP-BEGIN or after DUMP-END is ignored. +# +# Usage: extract-dump.sh <outdir> [transcript] + +set -eu + +[ $# -ge 1 ] || { echo "usage: $0 <outdir> [transcript]"; exit 2; } + +outdir=$1 +shift +mkdir -p "$outdir" + +if [ $# -ge 1 ]; then + src=$(cat "$1") +else + src=$(cat) +fi + +# Strip CRs that QEMU's nographic UART likes to emit. +src=$(printf '%s' "$src" | tr -d '\r') + +awk -v outdir="$outdir" ' +/^=== DUMP-BEGIN ===$/ { in_dump = 1; next } +/^=== DUMP-END ===$/ { in_dump = 0; next } +in_dump && /^=== FILE path=/ { + sub(/^=== FILE path=/, "") + sub(/ ===$/, "") + n = split($0, kv, " size=") + path = kv[1] + size = kv[2]+0 + out = outdir "/" path + # Make any parent dirs (tmpfs is flat but be safe). + cmd = "mkdir -p \"$(dirname \"" out "\")\""; system(cmd); close(cmd) + next_is_hex = 1 + print "extract: " path " (" size " bytes) -> " out > "/dev/stderr" + # Capture next non-empty line as hex payload, decode via xxd. + getline hex + decode_cmd = "printf %s \"" hex "\" | xxd -r -p > \"" out "\"" + system(decode_cmd); close(decode_cmd) + next +} +' <<EOF +$src +EOF diff --git a/seed-kernel/scripts/fixtures/tier2-driver.scm b/seed-kernel/scripts/fixtures/tier2-driver.scm @@ -0,0 +1,5 @@ +;; Tier 2 acceptance driver: spawn the chain stage, wait, return status. +(let ((r (run "child-prog" "out.txt" "in1" "in2"))) + (if (and (car r) (= 0 (cdr r))) + (sys-exit 0) + (sys-exit 1))) diff --git a/seed-kernel/scripts/fixtures/tier2-tcc-driver.scm b/seed-kernel/scripts/fixtures/tier2-tcc-driver.scm @@ -0,0 +1,8 @@ +;; Tier 2 acceptance driver (canonical form): scheme1 spawns tcc0 (the +;; cc.scm-built bootstrap C compiler) to compile a .c source into a +;; relocatable object, waits for the child, and exits with the child's +;; status. The OS-TODO item 11 fixture. +(let ((r (run "child-prog" "-nostdlib" "-c" "-o" "out.o" "input.c"))) + (if (and (car r) (= 0 (cdr r))) + (sys-exit 0) + (sys-exit 1))) diff --git a/seed-kernel/scripts/tier1-gate.sh b/seed-kernel/scripts/tier1-gate.sh @@ -0,0 +1,91 @@ +#!/bin/sh +# tier1-gate.sh — run a single boot2-chain stage binary on the seed +# kernel and extract its output files. The chain treats stages as +# pure file→file transformations (catm-style), so a Tier 1 acceptance +# run is one qemu boot per stage. +# +# Usage: +# tier1-gate.sh <stage-binary> <output-dir> -- <argv...> -- <input-files...> +# +# Builds an initramfs containing the stage binary as /init plus every +# file from <input-files...> at its basename, runs qemu, and extracts +# every file in the post-run tmpfs into <output-dir>/. The driver +# passes <argv...> verbatim through /chosen/bootargs (with a "dumpfs" +# token appended to trigger the UART dump on exit). +# +# Example: run boot0 catm to concatenate a + b into out. +# tier1-gate.sh build/aarch64/boot0/catm /tmp/out \ +# -- init out a b -- /tmp/a /tmp/b + +set -eu + +if [ $# -lt 3 ]; then + echo "usage: $0 <stage-binary> <output-dir> -- <argv...> -- <input-files...>" >&2 + exit 2 +fi + +STAGE_BIN=$1; shift +OUTDIR=$1; shift +[ "$1" = "--" ] || { echo "expected -- before argv" >&2; exit 2; } +shift + +# Collect argv until next -- +ARGV="" +while [ $# -gt 0 ] && [ "$1" != "--" ]; do + if [ -z "$ARGV" ]; then ARGV=$1; else ARGV="$ARGV $1"; fi + shift +done +[ "${1:-}" = "--" ] || { echo "expected -- before input-files" >&2; exit 2; } +shift + +HERE=$(cd "$(dirname "$0")" && pwd) +SEED_DIR=$(cd "$HERE/.." && pwd) +KERNEL=$SEED_DIR/build/Image +EXTRACT=$HERE/extract-dump.sh + +[ -f "$KERNEL" ] || { echo "missing $KERNEL — run 'make' in $SEED_DIR first" >&2; exit 1; } +[ -x "$EXTRACT" ] || { echo "missing $EXTRACT" >&2; exit 1; } + +mkdir -p "$OUTDIR" + +# Stage initramfs. +STAGE=$(mktemp -d -t tier1-stage.XXXXXX) +trap 'rm -rf "$STAGE"' EXIT + +cp "$STAGE_BIN" "$STAGE/init" +chmod +x "$STAGE/init" +NAMES="init" +for inp in "$@"; do + base=$(basename "$inp") + cp "$inp" "$STAGE/$base" + NAMES="$NAMES +$base" +done +INITRAMFS=$STAGE/initramfs.cpio +( cd "$STAGE" && printf '%s\n' "$NAMES" | cpio -o -H newc 2>/dev/null > initramfs.cpio ) + +# Run qemu, capture transcript, extract. +TRANSCRIPT=$STAGE/transcript.txt +echo "[gate] running stage with argv: $ARGV dumpfs" >&2 +qemu-system-aarch64 \ + -machine virt -cpu cortex-a72 -m 2048M \ + -nographic -no-reboot \ + -kernel "$KERNEL" -initrd "$INITRAMFS" \ + -append "$ARGV dumpfs" \ + > "$TRANSCRIPT" 2>&1 & +QPID=$! +# Bound the run; the seed kernel ends with PSCI SYSTEM_OFF on exit, +# but on a hang we still need to come back. +( sleep 120; kill -9 $QPID 2>/dev/null ) & +WATCHER=$! +wait $QPID 2>/dev/null || true +kill $WATCHER 2>/dev/null || true + +if ! grep -q '=== DUMP-END ===' "$TRANSCRIPT"; then + echo "[gate] FAIL: no DUMP-END in transcript" >&2 + tail -40 "$TRANSCRIPT" >&2 + exit 3 +fi + +"$EXTRACT" "$OUTDIR" "$TRANSCRIPT" +echo "[gate] extracted to $OUTDIR" >&2 diff --git a/seed-kernel/scripts/tier2-gate.sh b/seed-kernel/scripts/tier2-gate.sh @@ -0,0 +1,87 @@ +#!/bin/sh +# tier2-gate.sh — end-to-end Tier 2 acceptance: scheme1 driver spawns a +# chain stage as a subprocess (clone + execve + waitid), waits for it, +# returns its result file. One qemu boot, end-to-end. +# +# Usage: +# tier2-gate.sh <scheme1> <prelude.scm> <driver.scm> \ +# <child-bin> <output-dir> -- <input-files...> +# +# Stages: combines prelude.scm + driver.scm into combined.scm via host +# `cat`, packs an initramfs containing /init=scheme1, /combined.scm, +# /child-prog=<child-bin>, plus every input file at its basename, then +# boots qemu with bootargs "init combined.scm dumpfs". driver.scm is +# expected to use prelude's (run "child-prog" ...) wrapper. +# +# After qemu exits, every file in the post-run tmpfs is extracted into +# <output-dir>/. The driver's exit status is reflected in this script's +# exit status (0 = scheme1 driver said success). + +set -eu + +if [ $# -lt 6 ]; then + echo "usage: $0 <scheme1> <prelude.scm> <driver.scm> <child-bin> <outdir> -- <inputs...>" >&2 + exit 2 +fi + +SCHEME1=$1; PRELUDE=$2; DRIVER=$3; CHILD=$4; OUTDIR=$5 +shift 5 +[ "$1" = "--" ] || { echo "expected -- before input files" >&2; exit 2; } +shift + +HERE=$(cd "$(dirname "$0")" && pwd) +SEED_DIR=$(cd "$HERE/.." && pwd) +KERNEL=$SEED_DIR/build/Image +EXTRACT=$HERE/extract-dump.sh +[ -f "$KERNEL" ] || { echo "missing $KERNEL — run 'make' in $SEED_DIR first" >&2; exit 1; } + +mkdir -p "$OUTDIR" +STAGE=$(mktemp -d -t tier2-stage.XXXXXX) +trap 'rm -rf "$STAGE"' EXIT + +cp "$SCHEME1" "$STAGE/init"; chmod +x "$STAGE/init" +cp "$CHILD" "$STAGE/child-prog"; chmod +x "$STAGE/child-prog" +cat "$PRELUDE" "$DRIVER" > "$STAGE/combined.scm" +NAMES="init +child-prog +combined.scm" +for inp in "$@"; do + base=$(basename "$inp") + cp "$inp" "$STAGE/$base" + NAMES="$NAMES +$base" +done +( cd "$STAGE" && printf '%s\n' "$NAMES" | cpio -o -H newc 2>/dev/null > initramfs.cpio ) + +TRANSCRIPT=$STAGE/transcript.txt +echo "[gate] running scheme1 driver" >&2 +qemu-system-aarch64 \ + -machine virt -cpu cortex-a72 -m 2048M \ + -nographic -no-reboot \ + -kernel "$KERNEL" -initrd "$STAGE/initramfs.cpio" \ + -append "init combined.scm dumpfs" \ + > "$TRANSCRIPT" 2>&1 & +QPID=$! +( sleep 240; kill -9 $QPID 2>/dev/null ) & +WATCHER=$! +wait $QPID 2>/dev/null || true +kill $WATCHER 2>/dev/null || true + +if ! grep -q '=== DUMP-END ===' "$TRANSCRIPT"; then + echo "[gate] FAIL: no DUMP-END in transcript" >&2 + tail -40 "$TRANSCRIPT" >&2 + exit 3 +fi + +# Capture the driver's exit code from the kernel's parting message. +EXIT_LINE=$(grep -E "user exit_group" "$TRANSCRIPT" | tail -1 || true) +"$EXTRACT" "$OUTDIR" "$TRANSCRIPT" + +case "$EXIT_LINE" in + *"exit_group(0)"*) + echo "[gate] PASS — driver exit 0; outputs in $OUTDIR" >&2 + exit 0 ;; + *) + echo "[gate] FAIL — driver did not exit 0: $EXIT_LINE" >&2 + exit 4 ;; +esac diff --git a/seed-kernel/user/child.c b/seed-kernel/user/child.c @@ -0,0 +1,51 @@ +/* Tier 2 demo: child program execve'd by forktest. Prints argv and + * exits 42 so the parent's waitid can verify si_status round-trips. */ + +typedef long i64; +typedef unsigned long u64; + +#define SYS_write 64 +#define SYS_exit_group 93 + +static i64 sysc(u64 nr, u64 a, u64 b, u64 c) { + register u64 x8 asm("x8") = nr; + register u64 x0 asm("x0") = a; + register u64 x1 asm("x1") = b; + register u64 x2 asm("x2") = c; + asm volatile("svc #0" : "+r"(x0) : "r"(x8), "r"(x1), "r"(x2) : "memory", "cc"); + return (i64)x0; +} + +static i64 sys_write(int fd, const void *buf, u64 n) { return sysc(SYS_write, (u64)fd, (u64)buf, n); } +static void sys_exit(int c) { sysc(SYS_exit_group, (u64)c, 0, 0); for(;;); } + +void *memset(void *d, int c, u64 n) { + unsigned char *dd = d; for (u64 i = 0; i < n; i++) dd[i] = (unsigned char)c; return d; +} + +static u64 strlen_(const char *s) { u64 n = 0; while (s[n]) n++; return n; } +static void puts_(const char *s) { sys_write(1, s, strlen_(s)); } +static void put_d(i64 v) { + char buf[24]; int i = 0; + if (v == 0) { sys_write(1, "0", 1); return; } + if (v < 0) { sys_write(1, "-", 1); v = -v; } + while (v) { buf[i++] = '0' + (char)(v % 10); v /= 10; } + while (i--) sys_write(1, &buf[i], 1); +} + +void _start_c(long argc, char **argv) { + puts_("[child] argc="); put_d(argc); puts_("\n"); + for (long i = 0; i < argc; i++) { + puts_("[child] argv["); put_d(i); puts_("] = "); puts_(argv[i]); puts_("\n"); + } + puts_("[child] exiting 42\n"); + sys_exit(42); +} + +asm( + ".globl _start\n" + ".type _start, %function\n" + "_start:\n" + " ldr x0, [sp]\n" + " add x1, sp, #8\n" + " b _start_c\n"); diff --git a/seed-kernel/user/forktest.c b/seed-kernel/user/forktest.c @@ -0,0 +1,88 @@ +/* Tier 2 demo: parent does clone() → execve("child") in child → + * waitid in parent → reports result. Mirrors the scheme1 prelude's + * spawn/run/wait pattern in C. */ + +typedef long i64; +typedef unsigned long u64; +typedef int i32; + +#define SYS_write 64 +#define SYS_openat 56 +#define SYS_close 57 +#define SYS_read 63 +#define SYS_lseek 62 +#define SYS_brk 214 +#define SYS_exit_group 93 +#define SYS_clone 220 +#define SYS_execve 221 +#define SYS_waitid 95 + +static i64 sysc(u64 nr, u64 a, u64 b, u64 c, u64 d, u64 e, u64 f) { + register u64 x8 asm("x8") = nr; + register u64 x0 asm("x0") = a; + register u64 x1 asm("x1") = b; + register u64 x2 asm("x2") = c; + register u64 x3 asm("x3") = d; + register u64 x4 asm("x4") = e; + register u64 x5 asm("x5") = f; + asm volatile("svc #0" + : "+r"(x0) + : "r"(x8), "r"(x1), "r"(x2), "r"(x3), "r"(x4), "r"(x5) + : "memory", "cc"); + return (i64)x0; +} + +static i64 sys_write(int fd, const void *buf, u64 n) { return sysc(SYS_write, (u64)fd, (u64)buf, n, 0,0,0); } +static void sys_exit(int c) { sysc(SYS_exit_group, (u64)c, 0,0,0,0,0); for(;;); } +static i64 sys_clone(void) { return sysc(SYS_clone, 17/*SIGCHLD*/, 0,0,0,0,0); } +static i64 sys_execve(const char *p, char **argv) { return sysc(SYS_execve, (u64)p, (u64)argv, 0, 0, 0, 0); } +static i64 sys_waitid(int id, int pid, void *info, int opts) { return sysc(SYS_waitid, (u64)id, (u64)pid, (u64)info, (u64)opts, 0, 0); } + +void *memset(void *d, int c, u64 n) { + unsigned char *dd = d; for (u64 i = 0; i < n; i++) dd[i] = (unsigned char)c; return d; +} + +static u64 strlen_(const char *s) { u64 n = 0; while (s[n]) n++; return n; } +static void puts_(const char *s) { sys_write(1, s, strlen_(s)); } +static void put_d(i64 v) { + char buf[24]; int i = 0; + if (v < 0) { sys_write(1, "-", 1); v = -v; } + if (v == 0) buf[i++] = '0'; + while (v) { buf[i++] = '0' + (char)(v % 10); v /= 10; } + while (i--) sys_write(1, &buf[i], 1); +} + +void _start_c(long argc, char **argv) { + puts_("[forktest] argc="); put_d(argc); puts_(" argv[0]="); puts_(argv[0]); puts_("\n"); + + long pid = sys_clone(); + if (pid == 0) { + /* child */ + puts_("[forktest:child] pre-exec\n"); + char *cargv[3]; + cargv[0] = "child"; + cargv[1] = "from-parent"; + cargv[2] = 0; + sys_execve("child", cargv); + puts_("[forktest:child] execve failed\n"); + sys_exit(127); + } + /* parent */ + puts_("[forktest:parent] clone returned pid="); put_d(pid); puts_("\n"); + unsigned char info[128]; + memset(info, 0, sizeof info); + long w = sys_waitid(/*P_PID*/1, (int)pid, info, /*WEXITED*/4); + puts_("[forktest:parent] waitid="); put_d(w); + puts_(" si_code="); put_d(*(i32 *)(info + 8)); + puts_(" si_status="); put_d(*(i32 *)(info + 24)); + puts_("\n"); + sys_exit(0); +} + +asm( + ".globl _start\n" + ".type _start, %function\n" + "_start:\n" + " ldr x0, [sp]\n" + " add x1, sp, #8\n" + " b _start_c\n"); diff --git a/seed-kernel/user/hello.c b/seed-kernel/user/hello.c @@ -61,8 +61,17 @@ static void put_x(u64 v) { for (int i = 60; i >= 0; i -= 4) { char c = hex[(v >> i) & 0xf]; sys_write(1, &c, 1); } } -void _start(void) { +/* aarch64 entry: x0 holds nothing — the SysV stack layout is at sp: + * [argc][argv[0]]...[argv[argc-1]][NULL][envp...][NULL] + * We read argc/argv off the initial stack pointer in the asm shim + * below, then tail-call into _start_c. */ +void _start_c(long argc, char **argv) { puts_("hello from user space (EL1t, identity-map MMU)\n"); + puts_("argc = "); put_d(argc); puts_("\n"); + for (long i = 0; i < argc; i++) { + puts_(" argv["); put_d(i); puts_("] = "); + puts_(argv[i]); puts_("\n"); + } /* Exercise brk: ask current break, push it up by 1 MiB, write+read. */ u64 b0 = (u64)sys_brk(0); @@ -107,3 +116,15 @@ void _start(void) { puts_("[user] all checks passed, exiting 0\n"); sys_exit(0); } + +/* Read argc/argv off the initial stack and tail-call into _start_c. The + * kernel sets sp_el0 to point at [argc][argv[0]]... before ERETing. + * Emitted as a plain global symbol with raw asm — no C-compiler-generated + * prologue, since gcc would clobber sp before we read argc. */ +asm( + ".globl _start\n" + ".type _start, %function\n" + "_start:\n" + " ldr x0, [sp]\n" + " add x1, sp, #8\n" + " b _start_c\n"); diff --git a/seed-kernel/user/user.lds b/seed-kernel/user/user.lds @@ -1,10 +1,12 @@ -/* Link the user binary high enough to be clear of the kernel image - * (which sits at 0x40080000) and the initrd (placed by QEMU). */ +/* Link at the boot2 chain's default base (hex2pp -B 0x600000). This is + * below QEMU virt's RAM (which starts at 0x40000000) — the seed kernel + * provides a per-process L2 page table that maps user low VAs to a + * reserved physical RAM pool, so VA 0x600000 has real backing. */ ENTRY(_start) SECTIONS { - . = 0x42000000; + . = 0x00600000; .text : { *(.text .text.*) } .rodata : ALIGN(8) { *(.rodata .rodata.*) }