boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

Minimal C subset (boot2)

Working doc. Baseline is C99; everything here is a delta against it. The target is just enough C to compile

tcc-0.9.26-1147-gee75a10c/tcc.c

with the same defines used at MesCC's tcc-mes stage in live-bootstrap:

-D BOOTSTRAP=1
-D HAVE_LONG_LONG=1
-D ONE_SOURCE=1
-D TCC_TARGET_X86_64=1
-D inline=
-D CONFIG_TCCDIR="..."  ...etc

Notably not defined: HAVE_FLOAT, HAVE_BITFIELD, HAVE_SETJMP. Those gate off entire code paths in tcc.c (floats, bitfield struct support, setjmp-based error recovery), and we don't have to compile any of it.

The accepted surface is shaped by two intersecting constraints:

  1. Lower bound — what tcc.c (under those defines) actually uses.
  2. Upper bound — what MesCC accepts, since MesCC already builds tcc-mes and we're its replacement. Anything MesCC strips silently (const, inline, __attribute__) we also strip silently.

Things outside both bounds are cut. Things admitted are load-bearing.

Scope

Toolchain envelope

tcc.c + system headers
       │
       │ pre-flatten: resolve #include recursively, splice into one file
       │ (separate tool: scheme1 or shell; not part of cc.scm)
       ▼
tcc.flat.c                          single bytestream, no #include
       │
       │ scheme1 cc.scm
       ▼
tcc.P1pp                            our compiler's output
       │
       │ catm with arch backend + libp1pp.P1pp
       │ m1pp
       ▼
tcc.M1
       │ M0
       ▼
tcc.hex2
       │ hex2
       ▼
tcc-mes                             native ELF, replaces MesCC's tcc-mes

The pre-flatten pass is not a C preprocessor — it only resolves #include. All other directives (#define, #if, …) are handled by the in-Scheme preprocessor in pass 2.

Translation phases

The C standard names eight phases. We collapse them to three:

  1. Lex — bytestream → token list. Trigraphs and line-splicing (backslash-newline) are handled here, alongside numbers / strings / identifiers / punctuators. Comments removed. Newlines preserved as NL tokens (the preprocessor needs them to delimit directives).
  2. Preprocess — token list → expanded token list. Directives consumed, macros expanded, NL tokens stripped on exit.
  3. Parse + emit — token list → P1pp text. xcc-style direct emit; no AST.

Lexical syntax

Subset of C99 lexical grammar.

Preprocessor

Directive set:

Operators inside the body of a function-like macro:

Built-in macros:

Adjacent string-literal tokens in the post-expansion stream are concatenated (translation phase 6).

Expression evaluator (used by #if/#elif):

Macro expansion uses C11 6.10.3.4 hide-set discipline. Each token carries the set of macro names already expanded into it; an identifier inside its own hide-set is not re-expanded. This is the standard defense against #define A B\n#define B A.

Types

Primitives (P1-64)

Type Size (bytes) Align Notes
void only as ptr-target / fn-ret
char 1 1 signed by default
signed char 1 1
unsigned char 1 1
short 2 2
unsigned short 2 2
int 4 4
unsigned int 4 4
long 8 8 LP64
unsigned long 8 8
long long 8 8 same as long in LP64
unsigned long long 8 8
pointer 8 8 tag-free; raw native address
_Bool 1 1 values: 0, 1

size_t is unsigned long; ptrdiff_t is long; intptr_t / uintptr_t are long / unsigned long. These typedefs come from the flattened headers; the language doesn't bake them in.

Floating-point types (float, double, long double, _Complex, _Imaginary) are parsed but never codegen'd: prototypes and struct fields involving them are accepted (so the flattened tcc.c TU can be ingested), and sizeof reports the standard SysV widths (4/8/8). Any attempt to materialize an fp value — load, store, cast, arithmetic, call/return — dies with fp not codegen'd. tcc.c only uses fp under HAVE_FLOAT, which is off, so live code never trips the cg guard. Not present: __int128. float.h macros and <math.h> are unavailable to the input.

Derived types

Qualifiers

Declarations and storage

Declarators

Full C99 spiral-declarator grammar:

int  *p             // pointer to int
int  *p[10]         // array of 10 pointers to int
int (*p)[10]        // pointer to array of 10 ints
int (*f)(int, int)  // pointer to function (int,int) returning int
int  *f(int)        // function (int) returning pointer to int
char *(*tab[5])(int) // array of 5 pointers to function (int) returning char*

Storage classes

Function definitions

[storage] [type-quals] return-type name(params) { body }

Parameter list forms:

K&R-style (int f(a, b) int a, b; { … }) is not supported.

Variable initializers

Inline / attributes

Statements

All standard C statements:

Cut:

Expressions

All standard C operators with standard precedence and associativity:

Tier (high → low) Operators
postfix a[i], f(a,...), s.m, p->m, e++, e--
unary ++e, --e, &e, *e, +e, -e, ~e, !e, sizeof, (T)e
multiplicative *, /, %
additive +, -
shift <<, >>
relational <, <=, >, >=
equality ==, !=
bitwise &, ^, | (in that order)
logical &&, ||
conditional ?:
assignment =, +=, -=, *=, /=, %=, <<=, >>=, &=, ^=, |=
comma ,

Notes:

Variadic argument access

#include <stdarg.h>     // pre-flattened in
void f(int n, ...) {
    va_list ap; va_start(ap, n);
    int x = va_arg(ap, int);
    va_end(ap);
}

va_list, va_start, va_arg, va_end are macros from the flattened header. They expand to direct frame-slot reads keyed off the ... slot offset our codegen exposes. Implementation detail: our stdarg.h substitute is one of the headers shipped with the compiler.

Standard library expectations

Our compiler doesn't bundle libc. The bootstrap script links the output against the same libc+tcc archive MesCC uses, which provides:

Anything <setjmp.h> is not required at the tcc-mes stage (HAVE_SETJMP off). <math.h> is not required (HAVE_FLOAT off).

Built-in functions our compiler recognizes (vs. linking against):

Cut from C99 / C11

Kept explicit so additions are deliberate.

Feature Status Rationale
Floats / doubles / _Complex parse-only parsed as types; cg rejects fp ops (HAVE_FLOAT off)
long double parse-only same softening; sized as 8 bytes
Bitfields rejected HAVE_BITFIELD off
setjmp / longjmp not lib HAVE_SETJMP off
VLAs rejected tcc.c doesn't use; complicates frame layout
Compound literals (T){...} rejected tcc.c doesn't use
Statement expressions ({...}) (GCC) rejected tcc.c doesn't use
_Generic rejected not used
_Atomic, _Thread_local rejected not used
_Alignof, _Alignas rejected not used
_Static_assert rejected not used
Wide / UTF strings (L"…", u8"…") rejected not used
Anonymous struct/union members rejected not used
Multi-character constants ('AB') rejected not used
Universal character names (\uXXXX) rejected identifier set is ASCII only
K&R-style function definitions rejected tcc.c uses ANSI
Nested function definitions (GCC) rejected not used
Inline assembly (__asm__) rejected not used at this stage
__label__ (GCC) rejected not used
#include rejected external pre-flatten step
const, volatile, restrict parsed, discarded match MesCC
inline parsed, discarded -D inline= in bootstrap
__attribute__((...)) parsed, discarded match MesCC
register, auto storage classes parsed, no effect

Undefined behavior policy

Following LISP.md's "Primitive failure" stance: out-of-bounds array access, signed integer overflow, dereferencing a null or uninitialized pointer, integer division by zero, and modifying a string literal are undefined. The compiler emits no runtime checks; the generated P1pp will crash, loop, or produce nonsense, and that's acceptable.

The compiler itself aims to be deterministic: the same input bytes produce identical output bytes. Errors detected at compile time (syntax errors, type errors, unresolved identifiers) abort with a diagnostic on stderr and a nonzero exit code. No partial output is written.

Validation milestones

Status legend: [x] done · [~] in progress · [ ] not started.

  1. Self-tests: a tests/cc/ tree mirroring tests/scheme1/ — one tiny .c file per language feature, exit-status-driven.
  2. Compile a hand-written single-file C "hello world" through to ELF.
  3. Compile the mes libc unified-libc.c (the same file MesCC builds into libc.a).
  4. Compile tcc.c (under the tcc-mes defines) → tcc-boot2; verify tcc-boot2 -version runs.
  5. Use tcc-boot2 to build tcc-boot0; verify checksum matches the live-bootstrap reference.

Hitting (5) is the bootstrap milestone — at that point boot2 has fully replaced MesCC in the chain.