commit a661617c6f3c8f260539fe2c8c85a94b5b95e7cf
parent c24ff801990d19566f05e7a00f02a53822889a0d
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Wed, 29 Apr 2026 09:03:40 -0700
docs: libc plan — port mes libc, build libc.a + libtcc1.a with tcc-boot2
LIBC.md is the engineer-facing handoff: Phase A links tcc-boot2 itself
(vendor mes subset + lispcc-syscall.c → libc.P1pp, catm with tcc.P1pp),
Phase B produces the on-disk libc.a and libtcc1.a archives tcc-boot2
needs to link the code it compiles. TCC-TODO.md gets a pointer paragraph
recording why mes libc beats musl at this layer and why both phases are
required for a useful tcc-boot2.
Diffstat:
| A | docs/LIBC.md | | | 372 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
| M | docs/TCC-TODO.md | | | 32 | ++++++++++++++++++++++++++++++++ |
2 files changed, 404 insertions(+), 0 deletions(-)
diff --git a/docs/LIBC.md b/docs/LIBC.md
@@ -0,0 +1,372 @@
+# lispcc libc — implementation plan
+
+Engineer-facing handoff. Goal: a `tcc-boot2` that **runs and produces
+working binaries.** That requires three things, in order:
+
+1. **Phase A** — define every symbol in [LIBC.txt](LIBC.txt) so
+ tcc-boot2 itself links. The output is `libc.P1pp`, catm'd with
+ `tcc.P1pp` to produce the tcc-boot2 ELF.
+2. **Phase B1** — produce a `libc.a` archive on disk at the path
+ tcc-boot2 expects (`$LIBDIR/libc.a`). tcc-boot2 auto-appends `-lc`
+ when linking user code; without this archive, even `hello.c` fails
+ to link.
+3. **Phase B2** — produce a `libtcc1.a` archive on disk at
+ `$LIBDIR/tcc/libtcc1.a`. tcc emits calls to runtime helpers
+ (`__divdi3`, `__floatundidf`, …) in the code it compiles; without
+ this archive, anything using long-long divmod or FP fails to link.
+
+Phases B1/B2 use tcc-boot2 itself as the compiler, the same way
+live-bootstrap uses tcc-mes. They are bootstrap steps, not separate
+projects: until they're done, "tcc-boot2 works" is only true for
+`-version` and similar trivial paths.
+
+Strategy in one sentence: **vendor a curated subset of mes libc as
+source, patch four small things, replace mes's inline-asm syscall
+wrappers with one hand-written file that calls P1pp's labelled
+`sys_*` entry points, then build it three different ways: as P1pp
+(Phase A) and as ELF object files via tcc-boot2 (Phase B1). Phase B2
+compiles upstream tcc's `lib/libtcc1.c` with tcc-boot2.** Rationale
+lives in [TCC-TODO.md §libc strategy](TCC-TODO.md#libc--see-libcmd);
+read it once, then operate from this file.
+
+Anchors: mes source lives at `../mes/lib/`. P1pp syscall block is at
+[P1/P1pp.P1pp:986-1029](../P1/P1pp.P1pp). cc.scm's C linkage is the
+recent commit `6488cca`. Live-bootstrap's reference catm command is
+the long line in
+`../live-bootstrap/steps/tcc-0.9.26/pass1.kaem` (search for
+`unified-libc.c`).
+
+## Prerequisites
+
+- `make scheme1 cc ARCH=aarch64` succeeds (i.e. `build/aarch64/scheme1`
+ and `build/aarch64/cc/cc.scm` exist).
+- `make tcc-boot2 ARCH=aarch64` runs to the linker stage; the unresolved
+ symbols match LIBC.txt. Refresh the list with `scripts/boot-undef.sh`.
+
+## Phase A — link tcc-boot2
+
+### 1. Add `sys_lseek`, `sys_brk`, `sys_unlink` to P1pp
+
+Edit [P1/P1pp.P1pp](../P1/P1pp.P1pp) — append three labelled entries
+next to `:sys_close` (lines 1015-1019), shape mirrors the existing
+`:sys_open` (route `unlink` through `unlinkat(AT_FDCWD, path, 0)`):
+
+```
+:sys_lseek ; (fd, off, whence) -> off
+:sys_brk ; (addr) -> new_brk ; addr=0 returns current break
+:sys_unlink ; (path) -> 0 / -errno ; via unlinkat on aarch64/riscv64
+```
+
+Then add the syscall numbers to
+`P1/P1-{aarch64,amd64,riscv64}.M1pp`:
+
+| arch | lseek | brk | unlink |
+|---------|------:|----:|----------------|
+| amd64 | 8 | 12 | 87 |
+| aarch64 | 62 | 214 | 35 (unlinkat) |
+| riscv64 | 62 | 214 | 35 (unlinkat) |
+
+Acceptance for this step: a hand-written P1pp test that calls each
+of the three (e.g. `tests/p1pp/sys_brk.P1pp`) prints expected values
+under `make test ARCH=aarch64`.
+
+### 2. Vendor mes libc subset to `vendor/mes-libc/`
+
+Mirror mes's directory structure under `vendor/mes-libc/` and copy
+the files listed in the manifest below verbatim. Keep mes's
+copyright headers; add a top-level `LICENSE` (mes is GPLv3+).
+
+**Manifest** (paths are relative to `../mes/lib/`):
+
+```
+ctype/ isalnum.c isalpha.c isascii.c iscntrl.c isdigit.c
+ isgraph.c islower.c isnumber.c isprint.c ispunct.c
+ isspace.c isupper.c isxdigit.c tolower.c toupper.c
+
+string/ memchr.c memcmp.c memcpy.c memmem.c memmove.c memset.c
+ strcat.c strchr.c strcmp.c strcpy.c strcspn.c strdup.c
+ strerror.c strlen.c strncat.c strncmp.c strncpy.c
+ strpbrk.c strrchr.c strspn.c strstr.c strupr.c
+
+stdlib/ abort.c atoi.c atol.c calloc.c exit.c __exit.c free.c
+ qsort.c realloc.c strtof.c strtol.c strtoll.c
+ strtoul.c strtoull.c
+
+stdio/ clearerr.c fclose.c fdopen.c feof.c ferror.c fflush.c
+ fgetc.c fgets.c fileno.c fopen.c fprintf.c fputc.c
+ fputs.c fread.c fseek.c ftell.c fwrite.c getc.c
+ perror.c printf.c putc.c remove.c snprintf.c sprintf.c
+ ungetc.c vfprintf.c vprintf.c vsnprintf.c vsprintf.c
+
+linux/ brk.c close.c lseek.c malloc.c _open3.c _read.c unlink.c
+
+posix/ buffered-read.c execvp.c getcwd.c getenv.c open.c
+ sbrk.c write.c
+
+mes/ abtol.c __assert_fail.c __buffered_read.c cast.c dtoab.c
+ eputc.c eputs.c fdgetc.c fdgets.c fdputc.c fdputs.c
+ fdungetc.c globals.c __init_io.c itoa.c ltoa.c ltoab.c
+ __mes_debug.c mes_open.c ntoab.c oputc.c oputs.c
+ search-path.c ultoa.c utoa.c
+```
+
+Also vendor the headers cc.scm needs to flatten the file list. Copy
+`../mes/include/` → `vendor/mes-libc/include/` (it's already used by
+`stage1-flatten.sh` via `MES_INCLUDE`; reuse the same tree).
+
+### 3. Apply the four surgical patches
+
+Place these as `vendor/mes-libc/patches/*.patch` and apply in
+`scripts/boot-build-libc.sh` the same way `stage1-flatten.sh` applies
+its simple-patches.
+
+1. **`mes/globals.c`** — leave as-is. Sanity-check that it declares
+ `int errno;`, `char **environ;`, and `int __stdin/out/err;` as
+ plain globals. (mes already does this; the patch is empty, listed
+ here so the engineer doesn't accidentally "fix" it to TLS.)
+2. **`linux/malloc.c`** — replace `sizeof (max_align_t)` with the
+ integer literal `16`. cc.scm has no `max_align_t`. The arithmetic
+ is unchanged.
+3. **`string/strstr.c`** — drop `#include <sys/mman.h>`. The function
+ doesn't use mmap; the include is a stray.
+4. **printf-family `ap` shift** — no patch required. The blocks
+ guarded by `#if __GNUC__ && __x86_64__ && !SYSTEM_LIBC` in
+ `stdio/{snprintf,sprintf,vsprintf,fprintf,printf}.c` evaluate to
+ zero under cc.scm (no `__GNUC__`), so they compile out cleanly.
+ Confirm by grep after preprocessing.
+
+### 4. Write `vendor/mes-libc/lispcc-syscall.c`
+
+This is the only file we author. ~80 lines. It replaces every
+`linux/<arch>-mes-mescc/syscall.c` from mes (those rely on inline
+asm). One C wrapper per syscall; each calls a P1pp label by name,
+relying on cc.scm's external-linkage rule (commit `6488cca`).
+
+Sketch:
+
+```c
+extern long sys_read (long fd, long buf, long n);
+extern long sys_write (long fd, long buf, long n);
+extern long sys_open (long path, long flags, long mode);
+extern long sys_close (long fd);
+extern long sys_lseek (long fd, long off, long whence);
+extern long sys_brk (long addr);
+extern long sys_unlink (long path);
+extern long sys_exit (long code);
+
+extern int errno;
+
+static long set_errno (long r) {
+ if (r < 0) { errno = -r; return -1; }
+ errno = 0; return r;
+}
+
+ssize_t read (int fd, void *buf, size_t n) {
+ return set_errno (sys_read (fd, (long) buf, n));
+}
+ssize_t write (int fd, void const *buf, size_t n) {
+ return set_errno (sys_write (fd, (long) buf, n));
+}
+int close (int fd) { return (int) set_errno (sys_close (fd)); }
+off_t lseek (int fd, off_t off, int w) {
+ return set_errno (sys_lseek (fd, off, w));
+}
+long brk (void *p) { return set_errno (sys_brk ((long) p)); }
+int unlink(char const *p) {
+ return (int) set_errno (sys_unlink ((long) p));
+}
+void _exit (int c) { sys_exit (c); }
+
+/* execve gets a similar wrapper; see mes/lib/linux/execve.c for the
+ * argv/envp marshalling. */
+```
+
+This file replaces these mes files (do not vendor them):
+`linux/<arch>-mes-mescc/syscall.c`, `linux/<arch>-mes-mescc/_exit.c`,
+`linux/<arch>-mes-mescc/_write.c`, `linux/<arch>-mes-gcc/*.c`,
+`linux/<arch>-mes-mescc/syscall-internal.c`.
+
+Also: drop mes's `linux/_read.c` (it dispatches to the inline-asm
+wrapper); our `read` above replaces it. Keep
+`posix/buffered-read.c` — it consumes our `read` via the
+`__buffered_read` indirection.
+
+### 5. Write `scripts/boot-build-libc.sh`
+
+Mirror `boot-build-cc.sh`'s shape. Pseudocode:
+
+```sh
+ROOT=...; ARCH=...
+LIBC_FLAT=build/cc-bootstrap/$ARCH/libc.flat.c
+LIBC_P1PP=build/$ARCH/libc.P1pp
+
+# (a) preprocess + concat (host cc -E -nostdinc, like stage1-flatten)
+host_cc -E -nostdinc \
+ -I vendor/mes-libc/include \
+ -I vendor/mes-libc/include/linux/$MES_ARCH \
+ -D HAVE_CONFIG_H=1 \
+ vendor/mes-libc/unified-libc.c \
+ > "$LIBC_FLAT"
+
+# (b) compile with cc.scm in container
+podman run ... build/$ARCH/scheme1 build/$ARCH/cc/cc.scm \
+ "$LIBC_FLAT" "$LIBC_P1PP"
+```
+
+Where `vendor/mes-libc/unified-libc.c` is a hand-written file that
+just `#include`s every .c in the manifest order (live-bootstrap
+catms; we use `#include` so the host preprocessor handles dedup of
+mes's per-file `#include <mes/lib.h>` etc.). The `#include "*.c"`
+pattern is the same one used by `tcc.flat.c`'s `#include "libtcc.c"`
+upstream.
+
+Wire into the Makefile so `make tcc-boot2 ARCH=$A` runs
+`boot-build-libc.sh` before the link step. The link step gains one
+line:
+
+```
+cat tcc.P1pp libc.P1pp > tcc-boot2.P1pp
+```
+
+(libc *after* tcc — its .bss must follow tcc's data without crossing
+it).
+
+### 6. Phase A smoke tests
+
+- `tests/cc/200-libc-hello.c` — a hand-written `main()` that calls
+ `printf("hi\n")` then `exit(0)`. Compile with cc.scm, link against
+ libc.P1pp, run in the container, check stdout = `hi\n` and exit
+ status 0.
+- `tests/cc/201-libc-malloc.c` — round-trips malloc/free/realloc.
+- `tests/cc/202-libc-stdio.c` — fopen/fwrite/fclose, then re-read
+ and compare bytes.
+
+Phase A acceptance: `make tcc-boot2 ARCH=aarch64` links to a runnable
+ELF, and `tcc-boot2 -version` prints the version string under the
+per-arch container.
+
+## Phase B — build the on-disk archives tcc-boot2 needs
+
+tcc-boot2 produces ELF binaries via its own codegen (X86_64,
+aarch64, riscv64). When it links a user program it auto-appends
+`-lc` and resolves `__divdi3` / `__floatundidf` / etc. against
+`$LIBDIR/tcc/libtcc1.a`. Both archives have to exist on disk before
+tcc-boot2 is useful as a compiler. Phase A's `libc.P1pp` doesn't
+help here — that one is linked into tcc-boot2 itself in P1pp form.
+The archives are tcc-boot2's *output* world.
+
+Build them with tcc-boot2 itself, mirroring live-bootstrap's
+`pass1.kaem` (search for `unified-libc.o` and `libtcc1.o`).
+
+### B1. libc.a from the same vendored sources
+
+Reuse `vendor/mes-libc/unified-libc.c` from Phase A. Compile with
+tcc-boot2 (per arch). Add `scripts/boot-build-libc-archive.sh`:
+
+```sh
+TCC_BOOT2=build/$ARCH/tcc-boot2
+$TCC_BOOT2 -c -D HAVE_CONFIG_H=1 \
+ -I vendor/mes-libc/include \
+ -I vendor/mes-libc/include/linux/$MES_ARCH \
+ -o build/$ARCH/libc.o \
+ vendor/mes-libc/unified-libc.c
+$TCC_BOOT2 -ar cr build/$ARCH/libc.a build/$ARCH/libc.o
+```
+
+Install at `$LIBDIR/libc.a` where `$LIBDIR` is whatever
+`CONFIG_TCC_CRTPREFIX` was baked into tcc-boot2 (default
+`build/$ARCH/sysroot/lib`; align with the `-D CONFIG_TCC_CRTPREFIX`
+in the Makefile). Also produce `crt1.o` from
+`vendor/mes-libc/linux/$MES_ARCH-mes-gcc/crt1.c` if the link needs
+one — for static binaries with our hand-written `_start` it can be
+skipped; check by linking the smoke test below.
+
+The chicken-and-egg concern is moot: tcc-boot2's codegen for
+P1-64 targets does not emit `__divdi3`-class calls when compiling
+mes libc (long-long is native register width on X86_64 / aarch64 /
+riscv64, and `HAVE_FLOAT` paths in mes libc are dead under
+`HAVE_CONFIG_H=1` with our config). So building libc.a needs no
+prior libtcc1.a.
+
+### B2. libtcc1.a from upstream tcc
+
+The file is already vendored implicitly via `stage1-flatten.sh`
+(it's `tcc-0.9.26-1147-gee75a10c/lib/libtcc1.c` inside the tarball).
+Add `scripts/boot-build-libtcc1.sh`:
+
+```sh
+TCC_BOOT2=build/$ARCH/tcc-boot2
+TCC_SRC=build/cc-bootstrap/$ARCH/tcc-0.9.26-1147-gee75a10c
+$TCC_BOOT2 -c -D HAVE_CONFIG_H=1 -D HAVE_LONG_LONG=1 -D HAVE_FLOAT=1 \
+ -I vendor/mes-libc/include \
+ -I vendor/mes-libc/include/linux/$MES_ARCH \
+ -o build/$ARCH/libtcc1.o \
+ $TCC_SRC/lib/libtcc1.c
+# riscv64 also pulls in lib-arm64.c per upstream:
+if [ "$ARCH" = riscv64 ]; then
+ $TCC_BOOT2 -c ... -o build/$ARCH/lib-arm64.o $TCC_SRC/lib/lib-arm64.c
+ EXTRA=build/$ARCH/lib-arm64.o
+fi
+$TCC_BOOT2 -ar cr build/$ARCH/libtcc1.a build/$ARCH/libtcc1.o $EXTRA
+```
+
+Install at `$LIBDIR/tcc/libtcc1.a` (matches tcc.flat.c line 11234
+`tcc_add_support(s1, "libtcc1.a")` against `tcc_lib_path`).
+
+### B3. Wire into the Makefile
+
+```
+make tcc-boot2 ARCH=aarch64 # phase A: links tcc-boot2 itself
+make tcc-archives ARCH=aarch64 # phase B: produces libc.a + libtcc1.a
+make tcc-smoke ARCH=aarch64 # phase B acceptance, see below
+```
+
+`tcc-archives` depends on `tcc-boot2`. `tcc-smoke` depends on
+`tcc-archives`.
+
+### B4. Phase B smoke tests
+
+- `tests/tcc/300-hello.c` — `printf("hi\n")`. Compile *with
+ tcc-boot2*, run, check output. Exercises libc.a auto-link.
+- `tests/tcc/301-longlong.c` — does `long long a = ...; a / b;`
+ with values that force a real divmod on 32-bit hosts; on our
+ P1-64 targets this should still link (idiv is native), but the
+ test confirms libtcc1.a is on the path and gets searched.
+- `tests/tcc/302-self-host-fragment.c` — pull a small TU out of
+ tcc.c (e.g. one of the smaller pass1 files) and compile it with
+ tcc-boot2 to confirm tcc-on-tcc works end-to-end.
+
+Phase B acceptance: `make tcc-smoke ARCH=aarch64` passes all three.
+
+### Looking ahead — milestone 5
+
+With Phase B done, tcc-boot2 can compile arbitrary C through its
+own codegen. Milestone 5 (use tcc-boot2 to rebuild tcc.c → tcc-boot0,
+checksum-match against the live-bootstrap reference) becomes a
+matter of feeding tcc.c through tcc-boot2 with the boot0 defines.
+That's tracked in [TCC.md](TCC.md), not here.
+
+## Out of scope
+
+- **Threading, locale, dynamic linker, IEEE-754 math.** tcc-mes
+ defines have `HAVE_FLOAT` / `HAVE_SETJMP` off; the fp paths in mes
+ libc compile but are dead code under those defines.
+- **errno from threads.** `errno` is a single global int. cc.scm has
+ no TLS; tcc-boot2 is single-threaded.
+
+## Notes for the engineer
+
+- Refresh LIBC.txt with `scripts/boot-undef.sh > docs/LIBC.txt` after
+ the link starts working — new externals may surface that the
+ current static analysis missed.
+- If a mes file pulls in a header path we don't have, the right move
+ is almost always to copy the matching `mes/include/` header
+ verbatim — don't write a substitute.
+- cc.scm's debug flag (`--cc-debug`, see TCC-TODO.md "Repro") prints
+ per-phase heap usage. libc.flat.c is small (~30 KB after flatten)
+ so heap should be flat; if it isn't, that's a cc.scm bug, not a
+ libc bug.
+- The existing `vendor/seed/` layout is `<tool>/<arch>/...`. mes-libc
+ is per-arch only via headers; the .c manifest is arch-agnostic.
+ Layout `vendor/mes-libc/{ctype,string,...}/` flat, with
+ `vendor/mes-libc/include/linux/<arch>/` per-arch.
diff --git a/docs/TCC-TODO.md b/docs/TCC-TODO.md
@@ -299,7 +299,39 @@ likely walls live in the assembly side and at runtime:
(string-table init, command-line parsing, output for `-version`)
without tripping on cg semantics that pass the small tests but
diverge from C in subtle ways.
+- **libc**. The 39 unresolved externals in [LIBC.txt](LIBC.txt) are
+ unmet — there is no libc in the link today. See §libc strategy below.
The end goal is milestone 4 in [CC.md §Validation milestones](CC.md)
— "Compile tcc.c (under the tcc-mes defines) → tcc-lispcc; verify
`tcc-lispcc -version` runs."
+
+## libc — see [LIBC.md](LIBC.md)
+
+The unresolved externals in [LIBC.txt](LIBC.txt) are met by porting a
+curated subset of mes libc and adding a thin syscall-wrapper file that
+calls P1pp's `sys_*` labels directly. **musl is rejected** for this
+layer (built around gcc-only idioms — inline asm everywhere, TLS
+`errno`, weak/visibility attrs, `_Atomic`, IEEE math, dynamic linker —
+none of which survive cc.scm's subset).
+
+The work splits cleanly into two phases. **Phase A** is what makes
+tcc-boot2 itself link: cc.scm compiles the vendored mes libc subset
+into `libc.P1pp`, catm'd with `tcc.P1pp` to produce the tcc-boot2
+ELF. **Phase B** is what makes tcc-boot2 *useful* — tcc-boot2
+auto-appends `-lc` and resolves runtime helpers
+(`__divdi3`, `__floatundidf`, …) against `$LIBDIR/tcc/libtcc1.a` when
+linking the code it compiles. Both archives have to exist on disk or
+even hello-world won't link. Phase B uses tcc-boot2 itself as the
+compiler (mirrors live-bootstrap's `pass1.kaem` substituting tcc-boot2
+for tcc-mes); upstream tcc's `lib/libtcc1.c` is the source for the
+libtcc1.a side and is already pulled in by `stage1-flatten.sh`'s
+tarball unpack.
+
+The full implementation handoff (manifest of files to vendor, the four
+surgical patches, the new P1pp entry points, the Phase A / B build
+scripts, smoke tests, acceptance criteria) lives in
+[LIBC.md](LIBC.md).
+
+CC.md needs a follow-up edit: its "we link against the same `libc+tcc`
+archive MesCC uses" line is now stale.