commit 22aa8e4e433beeeada6d84d710d873edca1d2f25
parent 9fd3adda7106605e23b78d7c85f1237f1ea67801
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Sat, 6 Jun 2026 05:17:20 -0700
docs: plan llgen import
Diffstat:
2 files changed, 390 insertions(+), 0 deletions(-)
diff --git a/doc/plan/LLGEN_IMPORT.md b/doc/plan/LLGEN_IMPORT.md
@@ -0,0 +1,389 @@
+# LLGEN Import Plan
+
+This is the import plan for folding `/Users/ryan/code/ll1` into libkit and
+exposing it through the kit driver. The goal is not to make `ll1` a language
+frontend. It should land as a reusable parser/lexer-generator subsystem: EBNF in,
+parse and lexer tables out, with allocation-free push runtimes for generated
+parsers and lexers.
+
+The standalone project already has the right broad shape for libkit: explicit
+allocation, no hidden CLI dependency in the generator API, immutable generated
+tables, and caller-owned runtime state. The import work is mostly namespace,
+host-boundary, build-gating, and file-layout discipline.
+
+## Decision Summary
+
+- Driver command: `llgen`.
+- Public generator API header: `<kit/llgen.h>`.
+- Public parser runtime header: `<kit/llparse.h>`.
+- Public lexer runtime header: `<kit/lllex.h>`.
+- Generated-code support headers: `<kit/support/llparse_tables.h>` and
+ `<kit/support/lllex_tables.h>`.
+- Library subsystem gate: `KIT_LLGEN_ENABLED`.
+- Driver tool gate: `KIT_TOOL_LLGEN_ENABLED`.
+- Implementation directory: `src/llgen/`.
+- Tests: `test/llgen/` plus a driver smoke lane.
+
+`ll1` remains a source repository/project name only. It should not appear in the
+libkit API, installed command name, or user-facing help.
+
+## Boundaries
+
+`llgen` is a library subsystem, not a `lang/` frontend. It does not register a
+`KitFrontendVTable`, does not emit `KitCg`, and does not participate in
+source-to-object compilation unless a future frontend chooses to use it
+internally. Its public surface is consumed by embedders, the driver command, and
+generated C files.
+
+The driver remains the only hosted layer. File reads, directory creation, stdout,
+stderr, and CLI allocation policy live in `driver/cmd/llgen.c` and the hosted
+driver environment. The library accepts input as `KitSlice`, writes output to
+`KitWriter`, allocates through `KitContext.heap`, and reports diagnostics through
+`KitContext.diag`.
+
+Generated grammar tables are immutable static data and are fine as globals.
+Mutable parser, lexer, and generator state must hang off caller-owned runtime
+objects or explicit libkit handles.
+
+## File Moves
+
+Import the C implementation, generated meta tables, runtime, and tests. Do not
+make the Python generator part of the normal libkit build.
+
+| Current file | Kit destination | Notes |
+|--------------|-----------------|-------|
+| `include/llgen.h` | `include/kit/llgen.h` | Public generator API, renamed to Kit types and functions. |
+| `include/llparse.h` | `include/kit/llparse.h` | Public parser runtime API. |
+| `include/lllex.h` | `include/kit/lllex.h` | Public lexer runtime API. |
+| `include/llparse_tables.h` | `include/kit/support/llparse_tables.h` | Public generated-code support, not ordinary embedder API. |
+| `include/lllex_tables.h` | `include/kit/support/lllex_tables.h` | Public generated-code support, not ordinary embedder API. |
+| `include/llunicode.h` | `src/llgen/unicode.h` | Private helper; expose later as `<kit/unicode.h>` only if there is a broader API need. |
+| `include/llunicode_props.h` | `src/llgen/unicode_props.h` | Private generated Unicode property resolver. |
+| `runtime/llparse.c` | `src/llgen/parse_runtime.c` | Implements `<kit/llparse.h>`. |
+| `runtime/lllex.c` | `src/llgen/lex_runtime.c` | Implements `<kit/lllex.h>`. |
+| `runtime/llunicode.c` | `src/llgen/unicode.c` | Private Unicode helpers for generator and lexer runtime. |
+| `runtime/llunicode_props.c` | `src/llgen/unicode_props.c` | Checked-in generated property tables. |
+| `gen/llgen.c` | `src/llgen/generator.c` | Public API implementation and C emitter after Kit context rewrite. |
+| `gen/llgen_ll1.c` | `src/llgen/ll1.c` | LL(1), Pratt, FIRST/FOLLOW, and validation. |
+| `gen/llgen_lex_byte.c` | `src/llgen/lex_byte.c` | Byte-mode lexer compiler. |
+| `gen/llgen_lex_unicode.c` | `src/llgen/lex_unicode.c` | Unicode-mode lexer compiler. |
+| `gen/llgen_internal.h` | `src/llgen/internal.h` | Private to `src/llgen/*.c`. |
+| `gen/meta.ebnf` | `src/llgen/meta.ebnf` | Source grammar for the generator's own parser; not consumed by normal builds. |
+| generated `meta` `.c/.h` | `src/llgen/meta_tables.c`, `src/llgen/meta_tables.h` | Check in generated tables so libkit does not need Python or a previous `llgen` to build. |
+| `gen/llgen_cli.c` | `driver/cmd/llgen.c` | Rewrite as a Kit driver command using public `<kit/llgen.h>`. |
+| `tools/gen_unicode_props.py` | `scripts/gen_llgen_unicode_props.py` | Regeneration helper only; not in the build. |
+| `data/ucd/17.0.0/*` | `data/ucd/17.0.0/*` or `test/llgen/ucd/17.0.0/*` | Keep if we want reproducible Unicode-table regeneration in-tree. |
+| `test/*.c`, `test/*.ebnf` | `test/llgen/*` | Library tests and fixtures. |
+| `test/errors/*` | `test/llgen/errors/*` | Error fixtures. |
+
+`gen/llgen.py` should not be imported into libkit. If parity against the Python
+reference is still useful during the transition, keep it temporarily as
+`scripts/llgen_ref.py` and exclude it from release/build dependencies. Delete it
+once the imported C generator is trusted.
+
+## Public Renames
+
+The standalone `ll_*` and `llgen_*` names are too short for libkit and would
+violate the public symbol discipline. Public definitions must use `Kit`,
+`kit_`, or `KIT_`.
+
+### Generator API
+
+| Standalone name | Kit name |
+|-----------------|----------|
+| `llgen_options` | `KitLlgenOptions` |
+| `llgen_compiled` | `KitLlgenCompiled` |
+| `llgen_codegen` | Remove or narrow; prefer `KitWriter` outputs. |
+| `llgen_compile_text` | `kit_llgen_compile_text` |
+| `llgen_compiled_free` | `kit_llgen_free` |
+| `llgen_dump_sexpr_text` | `kit_llgen_dump_sexpr` |
+| `llgen_generate_c` | `kit_llgen_emit_c` |
+| `llgen_parser_grammar` | `kit_llgen_parser_grammar` |
+| `llgen_lexer_grammar` | `kit_llgen_lexer_grammar` |
+| `llgen_token_count` | `kit_llgen_token_count` |
+| `llgen_token_name` | `kit_llgen_token_name` |
+| `llgen_token_display` | `kit_llgen_token_display` |
+| `llgen_find_token` | `kit_llgen_find_token` |
+| `llgen_rule_count` | `kit_llgen_rule_count` |
+| `llgen_rule_name` | `kit_llgen_rule_name` |
+| `llgen_find_rule` | `kit_llgen_find_rule` |
+
+The generator API should take a `const KitContext*` and `KitSlice` inputs. It
+should not expose `llgen_allocator`; the allocator becomes `ctx->heap`. C output
+should stream to caller-provided `KitWriter`s. Callers that want owned in-memory
+text can use `kit_writer_mem`.
+
+Proposed core shape:
+
+```c
+typedef struct KitLlgenCompiled KitLlgenCompiled;
+
+typedef struct KitLlgenOptions {
+ KitSlice name; /* optional grammar name override */
+} KitLlgenOptions;
+
+typedef struct KitLlgenEmitOptions {
+ KitSlice header_path;
+ KitSlice source_path;
+ KitSlice prefix;
+} KitLlgenEmitOptions;
+
+KIT_API KitStatus kit_llgen_compile_text(const KitContext* ctx,
+ KitSlice text, KitSlice path,
+ const KitLlgenOptions* opts,
+ KitLlgenCompiled** out);
+KIT_API void kit_llgen_free(KitLlgenCompiled*);
+
+KIT_API KitStatus kit_llgen_emit_c(const KitLlgenCompiled*,
+ const KitLlgenEmitOptions* opts,
+ KitWriter* header, KitWriter* source);
+KIT_API KitStatus kit_llgen_dump_sexpr(const KitContext* ctx, KitSlice text,
+ KitSlice path,
+ const KitLlgenOptions* opts,
+ KitWriter* out);
+```
+
+### Parser Runtime API
+
+| Standalone name | Kit name |
+|-----------------|----------|
+| `ll_tok_kind` | `KitLlTokenKind` |
+| `ll_rule_id` | `KitLlRuleId` |
+| `ll_sem` | `KitLlSem` |
+| `ll_token` | `KitLlToken` |
+| `ll_error` | `KitLlParseError` |
+| `ll_err_action` | `KitLlErrorAction` |
+| `LL_ABORT` | `KIT_LL_ERROR_ABORT` |
+| `LL_SKIP` | `KIT_LL_ERROR_SKIP` |
+| `LL_RESYNC` | `KIT_LL_ERROR_RESYNC` |
+| `ll_actions` | `KitLlActions` |
+| `ll_slot` | `KitLlSlot` |
+| `LL_SLOT_SIZE` | `KIT_LL_SLOT_SIZE` |
+| `ll_config` | `KitLlParserConfig` |
+| `ll_grammar` | `KitLlGrammar` |
+| `ll_parser` | `KitLlParser` |
+| `LL_PARSER_SIZE` | `KIT_LL_PARSER_SIZE` |
+| `ll_status` | `KitLlParseStatus` |
+| `LL_NEED_MORE` | `KIT_LL_PARSE_NEED_MORE` |
+| `LL_PARSE_ACCEPT` | `KIT_LL_PARSE_ACCEPT` |
+| `LL_PARSE_ERROR` | `KIT_LL_PARSE_ERROR` |
+| `ll_parser_init` | `kit_ll_parser_init` |
+| `ll_parser_push` | `kit_ll_parser_push` |
+| `ll_parser_finish` | `kit_ll_parser_finish` |
+| `ll_parser_result` | `kit_ll_parser_result` |
+| `ll_stack_bounds` | `kit_ll_stack_bounds` |
+
+### Lexer Runtime API
+
+| Standalone name | Kit name |
+|-----------------|----------|
+| `ll_lex_grammar` | `KitLlLexGrammar` |
+| `ll_lexer` | `KitLlLexer` |
+| `LL_LEXER_SIZE` | `KIT_LL_LEXER_SIZE` |
+| `ll_lex_config` | `KitLlLexConfig` |
+| `ll_lex_status` | `KitLlLexStatus` |
+| `LL_LEX_TOKEN` | `KIT_LL_LEX_TOKEN` |
+| `LL_LEX_NEED_MORE` | `KIT_LL_LEX_NEED_MORE` |
+| `LL_LEX_EOF` | `KIT_LL_LEX_EOF` |
+| `LL_LEX_ERROR` | `KIT_LL_LEX_ERROR` |
+| `ll_lex_error` | `KitLlLexError` |
+| `ll_lexer_init` | `kit_ll_lexer_init` |
+| `ll_lexer_push` | `kit_ll_lexer_push` |
+| `ll_lexer_finish` | `kit_ll_lexer_finish` |
+| `ll_lexer_next` | `kit_ll_lexer_next` |
+| `ll_lexer_error` | `kit_ll_lexer_error` |
+
+### Generated-Code Support API
+
+The generated-code support headers should use Kit names too:
+
+| Standalone name | Kit name |
+|-----------------|----------|
+| `ll_sym_kind` | `KitLlSymKind` |
+| `LL_S_TERM` | `KIT_LL_SYM_TERM` |
+| `LL_S_RULE` | `KIT_LL_SYM_RULE` |
+| `LL_S_REP` | `KIT_LL_SYM_REP` |
+| `LL_S_OPT` | `KIT_LL_SYM_OPT` |
+| `ll_sym` | `KitLlSym` |
+| `ll_prod` | `KitLlProd` |
+| `ll_pratt_op` | `KitLlPrattOp` |
+| `ll_pratt` | `KitLlPratt` |
+| `ll_rule` | `KitLlRule` |
+| `LL_TERM` | `KIT_LL_TERM` |
+| `LL_RULE` | `KIT_LL_RULE` |
+| `ll_lex_accept` | `KitLlLexAccept` |
+| `LL_LEX_DEAD` | `KIT_LL_LEX_DEAD` |
+| `LL_LEX_ACCEPT_NONE` | `KIT_LL_LEX_ACCEPT_NONE` |
+
+Generated grammar-specific token/rule enums (`TOK_*`, `R_*`, and
+`<prefix>_*`) are user artifacts. They may keep their current shape because they
+are controlled by the generated prefix and are not libkit exports.
+
+## Include Rewrites
+
+The generator should emit installed-style includes:
+
+```c
+/* generated header */
+#include <kit/llparse.h>
+#include <kit/lllex.h> /* only when a generated lexer exists */
+
+/* generated source */
+#include "generated_name.h"
+#include <kit/support/llparse_tables.h>
+#include <kit/support/lllex_tables.h> /* only when needed */
+```
+
+Private `src/llgen/*.c` files include `internal.h` and the public headers they
+implement. They must not include driver headers.
+
+## Driver Command
+
+The user-facing command is:
+
+```text
+kit llgen [--dump-sexpr] [--prefix PREFIX] [-o OUT.c] [--header OUT.h] grammar.ebnf
+```
+
+The first import should preserve existing behavior:
+
+- Default `.c` output path: replace the input suffix with `.c`.
+- Default `.h` output path: replace the input suffix with `.h`.
+- Default prefix: derive from the input basename and append `_`.
+- `--dump-sexpr`: write the meta-grammar dump to stdout.
+- Exit code `0`: success.
+- Exit code `1`: compile, diagnostic, or I/O failure.
+- Exit code `2`: bad command-line usage.
+
+The driver implementation should use `DriverEnv` for `KitContext`, file reads,
+writer opening, diagnostics, and memory. The command should not call `malloc`,
+`free`, `fopen`, `fprintf`, or `exit` directly.
+
+Driver integration points:
+
+- Add `KIT_TOOL_LLGEN_ENABLED` to `include/kit/config.h`.
+- Add `driver/cmd/llgen.c`.
+- Add `driver_llgen` and `driver_help_llgen` to `driver/driver.h`.
+- Add the `llgen` row to `driver/main.c`, gated by `KIT_TOOL_LLGEN_ENABLED`.
+- Add `$(call tool-cmd,LLGEN,llgen)` to `mk/driver_srcs.mk`.
+- Keep it in `DRIVER_GROUP_OTHER` for now. It is a developer tool, not a
+ default drop-in binutils/toolchain symlink.
+
+## Build Integration
+
+Library integration points:
+
+- Add `KIT_LLGEN_ENABLED` to `include/kit/config.h` as an optional library
+ subsystem.
+- Add `LIB_SRCS_LLGEN := $(shell find src/llgen -name '*.c' ...)` to
+ `mk/lib_srcs.mk`, and include it only when `KIT_LLGEN_ENABLED` is `1`.
+- Add weak public stubs in `src/api/config_stubs.c` for gated-out generator API
+ entry points. Runtime stubs may be omitted if the runtime is considered part
+ of the generated-code ABI and the whole subsystem is always enabled in the
+ default build; if it is gated, public runtime symbols need stubs too.
+- Keep `src/llgen/meta_tables.c` checked in. Normal `make lib` must not require
+ Python, network access, UCD regeneration, or a bootstrap `llgen` binary.
+- Add a regeneration-only maintenance target later, for example
+ `make regen-llgen-meta` and `make regen-llgen-unicode-props`.
+
+The first import can gate runtime and generator together under
+`KIT_LLGEN_ENABLED`. If size-sensitive embeddings need generated parser runtime
+without the generator compiler, split later into:
+
+- `KIT_LLGEN_RUNTIME_ENABLED`: parser/lexer runtime plus support headers.
+- `KIT_LLGEN_ENABLED`: generator compiler, depending on runtime.
+
+Do not introduce that split until there is a real embedding that benefits from
+it; it adds config and stub surface.
+
+## Implementation Phases
+
+### Phase 0: Freeze Standalone Behavior
+
+- Run the current `ll1` test suite and save the passing command set in the
+ import notes.
+- Generate and check in `meta_tables.c` / `meta_tables.h` from `meta.ebnf`.
+- Confirm generated output for representative grammars is stable.
+
+### Phase 1: Mechanical Import, Private Names Still Allowed Internally
+
+- Move files into the destinations above.
+- Keep behavior unchanged while fixing include paths.
+- Add `test/llgen` fixtures and a `make test-llgen` target, initially allowed to
+ fail until the namespace rewrite lands.
+
+### Phase 2: Public Namespace Rewrite
+
+- Rename every public `ll_*`, `llgen_*`, and `LL_*` symbol to Kit spelling.
+- Update emitted C and generated meta tables to use the new names.
+- Run `make test-lib-deps` to catch leaked public symbols outside `Kit`,
+ `kit_`, or `KIT`.
+
+### Phase 3: Kit Context Rewrite
+
+- Replace `llgen_allocator` with `KitContext.heap`.
+- Replace generated text return structs with `KitWriter` outputs.
+- Replace direct diagnostics with `KitContext.diag`.
+- Remove hosted libc calls from imported library code.
+- Keep `setjmp`/`longjmp` only if the existing frontend panic pattern accepts
+ it for this subsystem; otherwise convert OOM and validation aborts to explicit
+ `KitStatus` unwinding.
+
+### Phase 4: Driver Command
+
+- Port `gen/llgen_cli.c` to `driver/cmd/llgen.c`.
+- Use the public generator API only.
+- Add help text consistent with other driver commands.
+- Add a focused driver test: invoke `kit llgen` on a small grammar, compile the
+ generated C against libkit, and run the parser.
+
+### Phase 5: Cleanup and Documentation
+
+- Document the stable runtime API in the public headers.
+- Add a durable design doc under `doc/` only after the subsystem ships.
+- Add `llgen` to `README.md` and `doc/DESIGN.md` capability lists after the
+ command works.
+- Remove temporary compatibility shims and any retained Python parity path.
+
+## Tests
+
+Targeted tests should land with the import:
+
+- `make test-llgen`: direct API compile, table introspection, generated parser,
+ generated lexer, Pratt grammar, UTF-8 lexer, and error fixtures.
+- `make test-driver-llgen`: CLI generation and generated-code compile/run smoke.
+- `make test-lib-deps`: symbol discipline and no accidental hosted dependencies.
+
+Map standalone tests as follows:
+
+| Standalone test | Kit test |
+|-----------------|----------|
+| `test/test_llgen_api.c` | `test/llgen/api_test.c` |
+| `test/test_calc.c` | `test/llgen/calc_test.c` |
+| `test/test_features.c` | `test/llgen/features_test.c` |
+| `test/test_lexer.c` | `test/llgen/lexer_test.c` |
+| `test/test_pratt_calc.c` | `test/llgen/pratt_test.c` |
+| `test/test_unicode_support.c` | `test/llgen/unicode_test.c` |
+| `test/test_unicode_lexer.c` | `test/llgen/unicode_lexer_test.c` |
+| `test/test_utf8_runtime.c` | `test/llgen/utf8_runtime_test.c` |
+| `test/errors/*.ebnf` | `test/llgen/errors/*.ebnf` |
+
+Prefer red-green import steps:
+
+1. Add the test target and fixtures first.
+2. Import the runtime until hand-written/generated tables parse again.
+3. Import the generator until direct API tests pass.
+4. Add the driver command and smoke test last.
+
+## Open Questions
+
+- Should generated table-layout headers be documented as stable ABI, or merely
+ stable enough for C emitted by the same libkit version? The first import should
+ promise only same-version compatibility.
+- Should Unicode UCD source data live in-tree permanently, or should only the
+ generated property tables be checked in? Keeping the data improves
+ reproducible regeneration but increases repository size.
+- Should the runtime/generator gate be split immediately? The plan says no until
+ an embedding needs runtime-only size savings.
+- Should `llgen` eventually support non-C output modes? The API should not bake
+ in more than `emit_c` today, but the command can grow `--emit=` later.
diff --git a/doc/plan/README.md b/doc/plan/README.md
@@ -20,5 +20,6 @@ shrinks to whatever remains open.
| [IMAGE_INSPECT.md](IMAGE_INSPECT.md) | Extending object inspection to executables and shared libraries. | [../OBJ.md](../OBJ.md) |
| [BUILD.md](BUILD.md) | A new content-addressed build coordinator (Bazel/Nix-style incremental builds layered on the CAS) — storage state machine, caching algorithm, recipe protocol. Distinct from `../BUILD.md` (kit's own Makefile build). | — (new subsystem) |
| [BUILD_COMMANDS.md](BUILD_COMMANDS.md) | The kit-native `build-exe`/`build-lib`/`build-obj` verbs that replace `compile`: polyglot, in-memory compile+link with `--group` flag scoping and full link-flag control. Distinct from `BUILD.md` (the CAS coordinator). | [../DRIVER.md](../DRIVER.md) |
+| [LLGEN_IMPORT.md](LLGEN_IMPORT.md) | Importing the standalone LL(1)/Pratt parser and lexer generator into libkit, including public API renames, file moves, build gates, and a `kit llgen` command. | — |
| [BACKTRACE.md](BACKTRACE.md) | Stack-trace support: GCC-compatible `__builtin_return_address`/`__builtin_frame_address` primitives, a freestanding `__kit_backtrace` capture helper, and symbolized backtrace printing. | [../FRONTENDS.md](../FRONTENDS.md), [../RUNTIME.md](../RUNTIME.md), [../DWARF.md](../DWARF.md) |
| [TODO.md](TODO.md) | Open deferred fixes and code smells only. Completed items are removed instead of checked off. Not a roadmap; a current backlog. | — |