boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

commit a661617c6f3c8f260539fe2c8c85a94b5b95e7cf
parent c24ff801990d19566f05e7a00f02a53822889a0d
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Wed, 29 Apr 2026 09:03:40 -0700

docs: libc plan — port mes libc, build libc.a + libtcc1.a with tcc-boot2

LIBC.md is the engineer-facing handoff: Phase A links tcc-boot2 itself
(vendor mes subset + lispcc-syscall.c → libc.P1pp, catm with tcc.P1pp),
Phase B produces the on-disk libc.a and libtcc1.a archives tcc-boot2
needs to link the code it compiles. TCC-TODO.md gets a pointer paragraph
recording why mes libc beats musl at this layer and why both phases are
required for a useful tcc-boot2.

Diffstat:
Adocs/LIBC.md | 372+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mdocs/TCC-TODO.md | 32++++++++++++++++++++++++++++++++
2 files changed, 404 insertions(+), 0 deletions(-)

diff --git a/docs/LIBC.md b/docs/LIBC.md @@ -0,0 +1,372 @@ +# lispcc libc — implementation plan + +Engineer-facing handoff. Goal: a `tcc-boot2` that **runs and produces +working binaries.** That requires three things, in order: + +1. **Phase A** — define every symbol in [LIBC.txt](LIBC.txt) so + tcc-boot2 itself links. The output is `libc.P1pp`, catm'd with + `tcc.P1pp` to produce the tcc-boot2 ELF. +2. **Phase B1** — produce a `libc.a` archive on disk at the path + tcc-boot2 expects (`$LIBDIR/libc.a`). tcc-boot2 auto-appends `-lc` + when linking user code; without this archive, even `hello.c` fails + to link. +3. **Phase B2** — produce a `libtcc1.a` archive on disk at + `$LIBDIR/tcc/libtcc1.a`. tcc emits calls to runtime helpers + (`__divdi3`, `__floatundidf`, …) in the code it compiles; without + this archive, anything using long-long divmod or FP fails to link. + +Phases B1/B2 use tcc-boot2 itself as the compiler, the same way +live-bootstrap uses tcc-mes. They are bootstrap steps, not separate +projects: until they're done, "tcc-boot2 works" is only true for +`-version` and similar trivial paths. + +Strategy in one sentence: **vendor a curated subset of mes libc as +source, patch four small things, replace mes's inline-asm syscall +wrappers with one hand-written file that calls P1pp's labelled +`sys_*` entry points, then build it three different ways: as P1pp +(Phase A) and as ELF object files via tcc-boot2 (Phase B1). Phase B2 +compiles upstream tcc's `lib/libtcc1.c` with tcc-boot2.** Rationale +lives in [TCC-TODO.md §libc strategy](TCC-TODO.md#libc--see-libcmd); +read it once, then operate from this file. + +Anchors: mes source lives at `../mes/lib/`. P1pp syscall block is at +[P1/P1pp.P1pp:986-1029](../P1/P1pp.P1pp). cc.scm's C linkage is the +recent commit `6488cca`. Live-bootstrap's reference catm command is +the long line in +`../live-bootstrap/steps/tcc-0.9.26/pass1.kaem` (search for +`unified-libc.c`). + +## Prerequisites + +- `make scheme1 cc ARCH=aarch64` succeeds (i.e. `build/aarch64/scheme1` + and `build/aarch64/cc/cc.scm` exist). +- `make tcc-boot2 ARCH=aarch64` runs to the linker stage; the unresolved + symbols match LIBC.txt. Refresh the list with `scripts/boot-undef.sh`. + +## Phase A — link tcc-boot2 + +### 1. Add `sys_lseek`, `sys_brk`, `sys_unlink` to P1pp + +Edit [P1/P1pp.P1pp](../P1/P1pp.P1pp) — append three labelled entries +next to `:sys_close` (lines 1015-1019), shape mirrors the existing +`:sys_open` (route `unlink` through `unlinkat(AT_FDCWD, path, 0)`): + +``` +:sys_lseek ; (fd, off, whence) -> off +:sys_brk ; (addr) -> new_brk ; addr=0 returns current break +:sys_unlink ; (path) -> 0 / -errno ; via unlinkat on aarch64/riscv64 +``` + +Then add the syscall numbers to +`P1/P1-{aarch64,amd64,riscv64}.M1pp`: + +| arch | lseek | brk | unlink | +|---------|------:|----:|----------------| +| amd64 | 8 | 12 | 87 | +| aarch64 | 62 | 214 | 35 (unlinkat) | +| riscv64 | 62 | 214 | 35 (unlinkat) | + +Acceptance for this step: a hand-written P1pp test that calls each +of the three (e.g. `tests/p1pp/sys_brk.P1pp`) prints expected values +under `make test ARCH=aarch64`. + +### 2. Vendor mes libc subset to `vendor/mes-libc/` + +Mirror mes's directory structure under `vendor/mes-libc/` and copy +the files listed in the manifest below verbatim. Keep mes's +copyright headers; add a top-level `LICENSE` (mes is GPLv3+). + +**Manifest** (paths are relative to `../mes/lib/`): + +``` +ctype/ isalnum.c isalpha.c isascii.c iscntrl.c isdigit.c + isgraph.c islower.c isnumber.c isprint.c ispunct.c + isspace.c isupper.c isxdigit.c tolower.c toupper.c + +string/ memchr.c memcmp.c memcpy.c memmem.c memmove.c memset.c + strcat.c strchr.c strcmp.c strcpy.c strcspn.c strdup.c + strerror.c strlen.c strncat.c strncmp.c strncpy.c + strpbrk.c strrchr.c strspn.c strstr.c strupr.c + +stdlib/ abort.c atoi.c atol.c calloc.c exit.c __exit.c free.c + qsort.c realloc.c strtof.c strtol.c strtoll.c + strtoul.c strtoull.c + +stdio/ clearerr.c fclose.c fdopen.c feof.c ferror.c fflush.c + fgetc.c fgets.c fileno.c fopen.c fprintf.c fputc.c + fputs.c fread.c fseek.c ftell.c fwrite.c getc.c + perror.c printf.c putc.c remove.c snprintf.c sprintf.c + ungetc.c vfprintf.c vprintf.c vsnprintf.c vsprintf.c + +linux/ brk.c close.c lseek.c malloc.c _open3.c _read.c unlink.c + +posix/ buffered-read.c execvp.c getcwd.c getenv.c open.c + sbrk.c write.c + +mes/ abtol.c __assert_fail.c __buffered_read.c cast.c dtoab.c + eputc.c eputs.c fdgetc.c fdgets.c fdputc.c fdputs.c + fdungetc.c globals.c __init_io.c itoa.c ltoa.c ltoab.c + __mes_debug.c mes_open.c ntoab.c oputc.c oputs.c + search-path.c ultoa.c utoa.c +``` + +Also vendor the headers cc.scm needs to flatten the file list. Copy +`../mes/include/` → `vendor/mes-libc/include/` (it's already used by +`stage1-flatten.sh` via `MES_INCLUDE`; reuse the same tree). + +### 3. Apply the four surgical patches + +Place these as `vendor/mes-libc/patches/*.patch` and apply in +`scripts/boot-build-libc.sh` the same way `stage1-flatten.sh` applies +its simple-patches. + +1. **`mes/globals.c`** — leave as-is. Sanity-check that it declares + `int errno;`, `char **environ;`, and `int __stdin/out/err;` as + plain globals. (mes already does this; the patch is empty, listed + here so the engineer doesn't accidentally "fix" it to TLS.) +2. **`linux/malloc.c`** — replace `sizeof (max_align_t)` with the + integer literal `16`. cc.scm has no `max_align_t`. The arithmetic + is unchanged. +3. **`string/strstr.c`** — drop `#include <sys/mman.h>`. The function + doesn't use mmap; the include is a stray. +4. **printf-family `ap` shift** — no patch required. The blocks + guarded by `#if __GNUC__ && __x86_64__ && !SYSTEM_LIBC` in + `stdio/{snprintf,sprintf,vsprintf,fprintf,printf}.c` evaluate to + zero under cc.scm (no `__GNUC__`), so they compile out cleanly. + Confirm by grep after preprocessing. + +### 4. Write `vendor/mes-libc/lispcc-syscall.c` + +This is the only file we author. ~80 lines. It replaces every +`linux/<arch>-mes-mescc/syscall.c` from mes (those rely on inline +asm). One C wrapper per syscall; each calls a P1pp label by name, +relying on cc.scm's external-linkage rule (commit `6488cca`). + +Sketch: + +```c +extern long sys_read (long fd, long buf, long n); +extern long sys_write (long fd, long buf, long n); +extern long sys_open (long path, long flags, long mode); +extern long sys_close (long fd); +extern long sys_lseek (long fd, long off, long whence); +extern long sys_brk (long addr); +extern long sys_unlink (long path); +extern long sys_exit (long code); + +extern int errno; + +static long set_errno (long r) { + if (r < 0) { errno = -r; return -1; } + errno = 0; return r; +} + +ssize_t read (int fd, void *buf, size_t n) { + return set_errno (sys_read (fd, (long) buf, n)); +} +ssize_t write (int fd, void const *buf, size_t n) { + return set_errno (sys_write (fd, (long) buf, n)); +} +int close (int fd) { return (int) set_errno (sys_close (fd)); } +off_t lseek (int fd, off_t off, int w) { + return set_errno (sys_lseek (fd, off, w)); +} +long brk (void *p) { return set_errno (sys_brk ((long) p)); } +int unlink(char const *p) { + return (int) set_errno (sys_unlink ((long) p)); +} +void _exit (int c) { sys_exit (c); } + +/* execve gets a similar wrapper; see mes/lib/linux/execve.c for the + * argv/envp marshalling. */ +``` + +This file replaces these mes files (do not vendor them): +`linux/<arch>-mes-mescc/syscall.c`, `linux/<arch>-mes-mescc/_exit.c`, +`linux/<arch>-mes-mescc/_write.c`, `linux/<arch>-mes-gcc/*.c`, +`linux/<arch>-mes-mescc/syscall-internal.c`. + +Also: drop mes's `linux/_read.c` (it dispatches to the inline-asm +wrapper); our `read` above replaces it. Keep +`posix/buffered-read.c` — it consumes our `read` via the +`__buffered_read` indirection. + +### 5. Write `scripts/boot-build-libc.sh` + +Mirror `boot-build-cc.sh`'s shape. Pseudocode: + +```sh +ROOT=...; ARCH=... +LIBC_FLAT=build/cc-bootstrap/$ARCH/libc.flat.c +LIBC_P1PP=build/$ARCH/libc.P1pp + +# (a) preprocess + concat (host cc -E -nostdinc, like stage1-flatten) +host_cc -E -nostdinc \ + -I vendor/mes-libc/include \ + -I vendor/mes-libc/include/linux/$MES_ARCH \ + -D HAVE_CONFIG_H=1 \ + vendor/mes-libc/unified-libc.c \ + > "$LIBC_FLAT" + +# (b) compile with cc.scm in container +podman run ... build/$ARCH/scheme1 build/$ARCH/cc/cc.scm \ + "$LIBC_FLAT" "$LIBC_P1PP" +``` + +Where `vendor/mes-libc/unified-libc.c` is a hand-written file that +just `#include`s every .c in the manifest order (live-bootstrap +catms; we use `#include` so the host preprocessor handles dedup of +mes's per-file `#include <mes/lib.h>` etc.). The `#include "*.c"` +pattern is the same one used by `tcc.flat.c`'s `#include "libtcc.c"` +upstream. + +Wire into the Makefile so `make tcc-boot2 ARCH=$A` runs +`boot-build-libc.sh` before the link step. The link step gains one +line: + +``` +cat tcc.P1pp libc.P1pp > tcc-boot2.P1pp +``` + +(libc *after* tcc — its .bss must follow tcc's data without crossing +it). + +### 6. Phase A smoke tests + +- `tests/cc/200-libc-hello.c` — a hand-written `main()` that calls + `printf("hi\n")` then `exit(0)`. Compile with cc.scm, link against + libc.P1pp, run in the container, check stdout = `hi\n` and exit + status 0. +- `tests/cc/201-libc-malloc.c` — round-trips malloc/free/realloc. +- `tests/cc/202-libc-stdio.c` — fopen/fwrite/fclose, then re-read + and compare bytes. + +Phase A acceptance: `make tcc-boot2 ARCH=aarch64` links to a runnable +ELF, and `tcc-boot2 -version` prints the version string under the +per-arch container. + +## Phase B — build the on-disk archives tcc-boot2 needs + +tcc-boot2 produces ELF binaries via its own codegen (X86_64, +aarch64, riscv64). When it links a user program it auto-appends +`-lc` and resolves `__divdi3` / `__floatundidf` / etc. against +`$LIBDIR/tcc/libtcc1.a`. Both archives have to exist on disk before +tcc-boot2 is useful as a compiler. Phase A's `libc.P1pp` doesn't +help here — that one is linked into tcc-boot2 itself in P1pp form. +The archives are tcc-boot2's *output* world. + +Build them with tcc-boot2 itself, mirroring live-bootstrap's +`pass1.kaem` (search for `unified-libc.o` and `libtcc1.o`). + +### B1. libc.a from the same vendored sources + +Reuse `vendor/mes-libc/unified-libc.c` from Phase A. Compile with +tcc-boot2 (per arch). Add `scripts/boot-build-libc-archive.sh`: + +```sh +TCC_BOOT2=build/$ARCH/tcc-boot2 +$TCC_BOOT2 -c -D HAVE_CONFIG_H=1 \ + -I vendor/mes-libc/include \ + -I vendor/mes-libc/include/linux/$MES_ARCH \ + -o build/$ARCH/libc.o \ + vendor/mes-libc/unified-libc.c +$TCC_BOOT2 -ar cr build/$ARCH/libc.a build/$ARCH/libc.o +``` + +Install at `$LIBDIR/libc.a` where `$LIBDIR` is whatever +`CONFIG_TCC_CRTPREFIX` was baked into tcc-boot2 (default +`build/$ARCH/sysroot/lib`; align with the `-D CONFIG_TCC_CRTPREFIX` +in the Makefile). Also produce `crt1.o` from +`vendor/mes-libc/linux/$MES_ARCH-mes-gcc/crt1.c` if the link needs +one — for static binaries with our hand-written `_start` it can be +skipped; check by linking the smoke test below. + +The chicken-and-egg concern is moot: tcc-boot2's codegen for +P1-64 targets does not emit `__divdi3`-class calls when compiling +mes libc (long-long is native register width on X86_64 / aarch64 / +riscv64, and `HAVE_FLOAT` paths in mes libc are dead under +`HAVE_CONFIG_H=1` with our config). So building libc.a needs no +prior libtcc1.a. + +### B2. libtcc1.a from upstream tcc + +The file is already vendored implicitly via `stage1-flatten.sh` +(it's `tcc-0.9.26-1147-gee75a10c/lib/libtcc1.c` inside the tarball). +Add `scripts/boot-build-libtcc1.sh`: + +```sh +TCC_BOOT2=build/$ARCH/tcc-boot2 +TCC_SRC=build/cc-bootstrap/$ARCH/tcc-0.9.26-1147-gee75a10c +$TCC_BOOT2 -c -D HAVE_CONFIG_H=1 -D HAVE_LONG_LONG=1 -D HAVE_FLOAT=1 \ + -I vendor/mes-libc/include \ + -I vendor/mes-libc/include/linux/$MES_ARCH \ + -o build/$ARCH/libtcc1.o \ + $TCC_SRC/lib/libtcc1.c +# riscv64 also pulls in lib-arm64.c per upstream: +if [ "$ARCH" = riscv64 ]; then + $TCC_BOOT2 -c ... -o build/$ARCH/lib-arm64.o $TCC_SRC/lib/lib-arm64.c + EXTRA=build/$ARCH/lib-arm64.o +fi +$TCC_BOOT2 -ar cr build/$ARCH/libtcc1.a build/$ARCH/libtcc1.o $EXTRA +``` + +Install at `$LIBDIR/tcc/libtcc1.a` (matches tcc.flat.c line 11234 +`tcc_add_support(s1, "libtcc1.a")` against `tcc_lib_path`). + +### B3. Wire into the Makefile + +``` +make tcc-boot2 ARCH=aarch64 # phase A: links tcc-boot2 itself +make tcc-archives ARCH=aarch64 # phase B: produces libc.a + libtcc1.a +make tcc-smoke ARCH=aarch64 # phase B acceptance, see below +``` + +`tcc-archives` depends on `tcc-boot2`. `tcc-smoke` depends on +`tcc-archives`. + +### B4. Phase B smoke tests + +- `tests/tcc/300-hello.c` — `printf("hi\n")`. Compile *with + tcc-boot2*, run, check output. Exercises libc.a auto-link. +- `tests/tcc/301-longlong.c` — does `long long a = ...; a / b;` + with values that force a real divmod on 32-bit hosts; on our + P1-64 targets this should still link (idiv is native), but the + test confirms libtcc1.a is on the path and gets searched. +- `tests/tcc/302-self-host-fragment.c` — pull a small TU out of + tcc.c (e.g. one of the smaller pass1 files) and compile it with + tcc-boot2 to confirm tcc-on-tcc works end-to-end. + +Phase B acceptance: `make tcc-smoke ARCH=aarch64` passes all three. + +### Looking ahead — milestone 5 + +With Phase B done, tcc-boot2 can compile arbitrary C through its +own codegen. Milestone 5 (use tcc-boot2 to rebuild tcc.c → tcc-boot0, +checksum-match against the live-bootstrap reference) becomes a +matter of feeding tcc.c through tcc-boot2 with the boot0 defines. +That's tracked in [TCC.md](TCC.md), not here. + +## Out of scope + +- **Threading, locale, dynamic linker, IEEE-754 math.** tcc-mes + defines have `HAVE_FLOAT` / `HAVE_SETJMP` off; the fp paths in mes + libc compile but are dead code under those defines. +- **errno from threads.** `errno` is a single global int. cc.scm has + no TLS; tcc-boot2 is single-threaded. + +## Notes for the engineer + +- Refresh LIBC.txt with `scripts/boot-undef.sh > docs/LIBC.txt` after + the link starts working — new externals may surface that the + current static analysis missed. +- If a mes file pulls in a header path we don't have, the right move + is almost always to copy the matching `mes/include/` header + verbatim — don't write a substitute. +- cc.scm's debug flag (`--cc-debug`, see TCC-TODO.md "Repro") prints + per-phase heap usage. libc.flat.c is small (~30 KB after flatten) + so heap should be flat; if it isn't, that's a cc.scm bug, not a + libc bug. +- The existing `vendor/seed/` layout is `<tool>/<arch>/...`. mes-libc + is per-arch only via headers; the .c manifest is arch-agnostic. + Layout `vendor/mes-libc/{ctype,string,...}/` flat, with + `vendor/mes-libc/include/linux/<arch>/` per-arch. diff --git a/docs/TCC-TODO.md b/docs/TCC-TODO.md @@ -299,7 +299,39 @@ likely walls live in the assembly side and at runtime: (string-table init, command-line parsing, output for `-version`) without tripping on cg semantics that pass the small tests but diverge from C in subtle ways. +- **libc**. The 39 unresolved externals in [LIBC.txt](LIBC.txt) are + unmet — there is no libc in the link today. See §libc strategy below. The end goal is milestone 4 in [CC.md §Validation milestones](CC.md) — "Compile tcc.c (under the tcc-mes defines) → tcc-lispcc; verify `tcc-lispcc -version` runs." + +## libc — see [LIBC.md](LIBC.md) + +The unresolved externals in [LIBC.txt](LIBC.txt) are met by porting a +curated subset of mes libc and adding a thin syscall-wrapper file that +calls P1pp's `sys_*` labels directly. **musl is rejected** for this +layer (built around gcc-only idioms — inline asm everywhere, TLS +`errno`, weak/visibility attrs, `_Atomic`, IEEE math, dynamic linker — +none of which survive cc.scm's subset). + +The work splits cleanly into two phases. **Phase A** is what makes +tcc-boot2 itself link: cc.scm compiles the vendored mes libc subset +into `libc.P1pp`, catm'd with `tcc.P1pp` to produce the tcc-boot2 +ELF. **Phase B** is what makes tcc-boot2 *useful* — tcc-boot2 +auto-appends `-lc` and resolves runtime helpers +(`__divdi3`, `__floatundidf`, …) against `$LIBDIR/tcc/libtcc1.a` when +linking the code it compiles. Both archives have to exist on disk or +even hello-world won't link. Phase B uses tcc-boot2 itself as the +compiler (mirrors live-bootstrap's `pass1.kaem` substituting tcc-boot2 +for tcc-mes); upstream tcc's `lib/libtcc1.c` is the source for the +libtcc1.a side and is already pulled in by `stage1-flatten.sh`'s +tarball unpack. + +The full implementation handoff (manifest of files to vendor, the four +surgical patches, the new P1pp entry points, the Phase A / B build +scripts, smoke tests, acceptance criteria) lives in +[LIBC.md](LIBC.md). + +CC.md needs a follow-up edit: its "we link against the same `libc+tcc` +archive MesCC uses" line is now stale.