docs: plan llgen import - kit

commit 86d3164a20fead7c9b3e92932a861a271d915e42
parent 9fd3adda7106605e23b78d7c85f1237f1ea67801
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sat,  6 Jun 2026 05:17:20 -0700

docs: plan llgen import

Diffstat:
A doc/plan/LLGEN_IMPORT.md  | 390 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M doc/plan/README.md  | 1 +

2 files changed, 391 insertions(+), 0 deletions(-)
diff --git a/doc/plan/LLGEN_IMPORT.md b/doc/plan/LLGEN_IMPORT.md
@@ -0,0 +1,390 @@
+# Gram Import Plan
+
+This is the import plan for folding `/Users/ryan/code/ll1` into libkit and
+exposing it through the kit driver. The goal is not to make `ll1` a language
+frontend. It should land as a reusable parser/lexer-generator subsystem: EBNF in,
+parse and lexer tables out, with allocation-free push runtimes for generated
+parsers and lexers.
+
+The standalone project already has the right broad shape for libkit: explicit
+allocation, no hidden CLI dependency in the generator API, immutable generated
+tables, and caller-owned runtime state. The import work is mostly namespace,
+host-boundary, build-gating, and file-layout discipline.
+
+## Decision Summary
+
+- Driver command: `gram`.
+- Public generator API header: `<kit/gram.h>`.
+- Public parser runtime header: `<kit/gram_parse.h>`.
+- Public lexer runtime header: `<kit/gram_lex.h>`.
+- Generated-code support headers: `<kit/support/gram_parse_tables.h>` and
+  `<kit/support/gram_lex_tables.h>`.
+- Library subsystem gate: `KIT_GRAM_ENABLED`.
+- Driver tool gate: `KIT_TOOL_GRAM_ENABLED`.
+- Implementation directory: `src/gram/`.
+- Tests: `test/gram/` plus a driver smoke lane.
+
+`ll1` remains a source repository/project name only. It should not appear in the
+libkit API, installed command name, or user-facing help.
+
+## Boundaries
+
+`gram` is a library subsystem, not a `lang/` frontend. It does not register a
+`KitFrontendVTable`, does not emit `KitCg`, and does not participate in
+source-to-object compilation unless a future frontend chooses to use it
+internally. Its public surface is consumed by embedders, the driver command, and
+generated C files.
+
+The driver remains the only hosted layer. File reads, directory creation, stdout,
+stderr, and CLI allocation policy live in `driver/cmd/gram.c` and the hosted
+driver environment. The library accepts input as `KitSlice`, writes output to
+`KitWriter`, allocates through `KitContext.heap`, and reports diagnostics through
+`KitContext.diag`.
+
+Generated grammar tables are immutable static data and are fine as globals.
+Mutable parser, lexer, and generator state must hang off caller-owned runtime
+objects or explicit libkit handles.
+
+## File Moves
+
+Import the C implementation, generated meta tables, runtime, and tests. Do not
+make the Python generator part of the normal libkit build.
+
+| Current file | Kit destination | Notes |
+|--------------|-----------------|-------|
+| `include/llgen.h` | `include/kit/gram.h` | Public generator API, renamed to Kit types and functions. |
+| `include/llparse.h` | `include/kit/gram_parse.h` | Public parser runtime API. |
+| `include/lllex.h` | `include/kit/gram_lex.h` | Public lexer runtime API. |
+| `include/llparse_tables.h` | `include/kit/support/gram_parse_tables.h` | Public generated-code support, not ordinary embedder API. |
+| `include/lllex_tables.h` | `include/kit/support/gram_lex_tables.h` | Public generated-code support, not ordinary embedder API. |
+| `include/llunicode.h` | `src/gram/unicode.h` | Private helper; expose later as `<kit/unicode.h>` only if there is a broader API need. |
+| `include/llunicode_props.h` | `src/gram/unicode_props.h` | Private generated Unicode property resolver. |
+| `runtime/llparse.c` | `src/gram/parse_runtime.c` | Implements `<kit/gram_parse.h>`. |
+| `runtime/lllex.c` | `src/gram/lex_runtime.c` | Implements `<kit/gram_lex.h>`. |
+| `runtime/llunicode.c` | `src/gram/unicode.c` | Private Unicode helpers for generator and lexer runtime. |
+| `runtime/llunicode_props.c` | `src/gram/unicode_props.c` | Checked-in generated property tables. |
+| `gen/llgen.c` | `src/gram/generator.c` | Public API implementation and C emitter after Kit context rewrite. |
+| `gen/llgen_ll1.c` | `src/gram/ll1.c` | LL(1), Pratt, FIRST/FOLLOW, and validation. |
+| `gen/llgen_lex_byte.c` | `src/gram/lex_byte.c` | Byte-mode lexer compiler. |
+| `gen/llgen_lex_unicode.c` | `src/gram/lex_unicode.c` | Unicode-mode lexer compiler. |
+| `gen/llgen_internal.h` | `src/gram/internal.h` | Private to `src/gram/*.c`. |
+| `gen/meta.ebnf` | `src/gram/meta.ebnf` | Source grammar for the generator's own parser; not consumed by normal builds. |
+| generated `meta` `.c/.h` | `src/gram/meta_tables.c`, `src/gram/meta_tables.h` | Check in generated tables so libkit does not need Python or a previous `gram` to build. |
+| `gen/llgen_cli.c` | `driver/cmd/gram.c` | Rewrite as a Kit driver command using public `<kit/gram.h>`. |
+| `tools/gen_unicode_props.py` | `scripts/gen_gram_unicode_props.py` | Regeneration helper only; not in the build. |
+| `data/ucd/17.0.0/*` | `data/ucd/17.0.0/*` or `test/gram/ucd/17.0.0/*` | Keep if we want reproducible Unicode-table regeneration in-tree. |
+| `test/*.c`, `test/*.ebnf` | `test/gram/*` | Library tests and fixtures. |
+| `test/errors/*` | `test/gram/errors/*` | Error fixtures. |
+
+`gen/llgen.py` should not be imported into libkit. If parity against the Python
+reference is still useful during the transition, keep it temporarily as
+`scripts/gram_ref.py` and exclude it from release/build dependencies. Delete it
+once the imported C generator is trusted.
+
+## Public Renames
+
+The standalone `ll_*` and `llgen_*` names are too short for libkit and would
+violate the public symbol discipline. Public definitions for the imported
+subsystem must use `KitGram`, `kit_gram_`, or `KIT_GRAM_`.
+
+### Generator API
+
+| Standalone name | Kit name |
+|-----------------|----------|
+| `llgen_options` | `KitGramOptions` |
+| `llgen_compiled` | `KitGramCompiled` |
+| `llgen_codegen` | Remove or narrow; prefer `KitWriter` outputs. |
+| `llgen_compile_text` | `kit_gram_compile_text` |
+| `llgen_compiled_free` | `kit_gram_free` |
+| `llgen_dump_sexpr_text` | `kit_gram_dump_sexpr` |
+| `llgen_generate_c` | `kit_gram_emit_c` |
+| `llgen_parser_grammar` | `kit_gram_parser_grammar` |
+| `llgen_lexer_grammar` | `kit_gram_lexer_grammar` |
+| `llgen_token_count` | `kit_gram_token_count` |
+| `llgen_token_name` | `kit_gram_token_name` |
+| `llgen_token_display` | `kit_gram_token_display` |
+| `llgen_find_token` | `kit_gram_find_token` |
+| `llgen_rule_count` | `kit_gram_rule_count` |
+| `llgen_rule_name` | `kit_gram_rule_name` |
+| `llgen_find_rule` | `kit_gram_find_rule` |
+
+The generator API should take a `const KitContext*` and `KitSlice` inputs. It
+should not expose `llgen_allocator`; the allocator becomes `ctx->heap`. C output
+should stream to caller-provided `KitWriter`s. Callers that want owned in-memory
+text can use `kit_writer_mem`.
+
+Proposed core shape:
+
+```c
+typedef struct KitGramCompiled KitGramCompiled;
+
+typedef struct KitGramOptions {
+  KitSlice name; /* optional grammar name override */
+} KitGramOptions;
+
+typedef struct KitGramEmitOptions {
+  KitSlice header_path;
+  KitSlice source_path;
+  KitSlice prefix;
+} KitGramEmitOptions;
+
+KIT_API KitStatus kit_gram_compile_text(const KitContext* ctx,
+                                        KitSlice text, KitSlice path,
+                                        const KitGramOptions* opts,
+                                        KitGramCompiled** out);
+KIT_API void kit_gram_free(KitGramCompiled*);
+
+KIT_API KitStatus kit_gram_emit_c(const KitGramCompiled*,
+                                  const KitGramEmitOptions* opts,
+                                  KitWriter* header, KitWriter* source);
+KIT_API KitStatus kit_gram_dump_sexpr(const KitContext* ctx, KitSlice text,
+                                      KitSlice path,
+                                      const KitGramOptions* opts,
+                                      KitWriter* out);
+```
+
+### Parser Runtime API
+
+| Standalone name | Kit name |
+|-----------------|----------|
+| `ll_tok_kind` | `KitGramTokenKind` |
+| `ll_rule_id` | `KitGramRuleId` |
+| `ll_sem` | `KitGramSem` |
+| `ll_token` | `KitGramToken` |
+| `ll_error` | `KitGramParseError` |
+| `ll_err_action` | `KitGramErrorAction` |
+| `LL_ABORT` | `KIT_GRAM_ERROR_ABORT` |
+| `LL_SKIP` | `KIT_GRAM_ERROR_SKIP` |
+| `LL_RESYNC` | `KIT_GRAM_ERROR_RESYNC` |
+| `ll_actions` | `KitGramActions` |
+| `ll_slot` | `KitGramSlot` |
+| `LL_SLOT_SIZE` | `KIT_GRAM_SLOT_SIZE` |
+| `ll_config` | `KitGramParserConfig` |
+| `ll_grammar` | `KitGramGrammar` |
+| `ll_parser` | `KitGramParser` |
+| `LL_PARSER_SIZE` | `KIT_GRAM_PARSER_SIZE` |
+| `ll_status` | `KitGramParseStatus` |
+| `LL_NEED_MORE` | `KIT_GRAM_PARSE_NEED_MORE` |
+| `LL_PARSE_ACCEPT` | `KIT_GRAM_PARSE_ACCEPT` |
+| `LL_PARSE_ERROR` | `KIT_GRAM_PARSE_ERROR` |
+| `ll_parser_init` | `kit_gram_parser_init` |
+| `ll_parser_push` | `kit_gram_parser_push` |
+| `ll_parser_finish` | `kit_gram_parser_finish` |
+| `ll_parser_result` | `kit_gram_parser_result` |
+| `ll_stack_bounds` | `kit_gram_stack_bounds` |
+
+### Lexer Runtime API
+
+| Standalone name | Kit name |
+|-----------------|----------|
+| `ll_lex_grammar` | `KitGramLexGrammar` |
+| `ll_lexer` | `KitGramLexer` |
+| `LL_LEXER_SIZE` | `KIT_GRAM_LEXER_SIZE` |
+| `ll_lex_config` | `KitGramLexConfig` |
+| `ll_lex_status` | `KitGramLexStatus` |
+| `LL_LEX_TOKEN` | `KIT_GRAM_LEX_TOKEN` |
+| `LL_LEX_NEED_MORE` | `KIT_GRAM_LEX_NEED_MORE` |
+| `LL_LEX_EOF` | `KIT_GRAM_LEX_EOF` |
+| `LL_LEX_ERROR` | `KIT_GRAM_LEX_ERROR` |
+| `ll_lex_error` | `KitGramLexError` |
+| `ll_lexer_init` | `kit_gram_lexer_init` |
+| `ll_lexer_push` | `kit_gram_lexer_push` |
+| `ll_lexer_finish` | `kit_gram_lexer_finish` |
+| `ll_lexer_next` | `kit_gram_lexer_next` |
+| `ll_lexer_error` | `kit_gram_lexer_error` |
+
+### Generated-Code Support API
+
+The generated-code support headers should use Kit names too:
+
+| Standalone name | Kit name |
+|-----------------|----------|
+| `ll_sym_kind` | `KitGramSymKind` |
+| `LL_S_TERM` | `KIT_GRAM_SYM_TERM` |
+| `LL_S_RULE` | `KIT_GRAM_SYM_RULE` |
+| `LL_S_REP` | `KIT_GRAM_SYM_REP` |
+| `LL_S_OPT` | `KIT_GRAM_SYM_OPT` |
+| `ll_sym` | `KitGramSym` |
+| `ll_prod` | `KitGramProd` |
+| `ll_pratt_op` | `KitGramPrattOp` |
+| `ll_pratt` | `KitGramPratt` |
+| `ll_rule` | `KitGramRule` |
+| `LL_TERM` | `KIT_GRAM_TERM` |
+| `LL_RULE` | `KIT_GRAM_RULE` |
+| `ll_lex_accept` | `KitGramLexAccept` |
+| `LL_LEX_DEAD` | `KIT_GRAM_LEX_DEAD` |
+| `LL_LEX_ACCEPT_NONE` | `KIT_GRAM_LEX_ACCEPT_NONE` |
+
+Generated grammar-specific token/rule enums (`TOK_*`, `R_*`, and
+`<prefix>_*`) are user artifacts. They may keep their current shape because they
+are controlled by the generated prefix and are not libkit exports.
+
+## Include Rewrites
+
+The generator should emit installed-style includes:
+
+```c
+/* generated header */
+#include <kit/gram_parse.h>
+#include <kit/gram_lex.h>       /* only when a generated lexer exists */
+
+/* generated source */
+#include "generated_name.h"
+#include <kit/support/gram_parse_tables.h>
+#include <kit/support/gram_lex_tables.h> /* only when needed */
+```
+
+Private `src/gram/*.c` files include `internal.h` and the public headers they
+implement. They must not include driver headers.
+
+## Driver Command
+
+The user-facing command is:
+
+```text
+kit gram [--dump-sexpr] [--prefix PREFIX] [-o OUT.c] [--header OUT.h] grammar.ebnf
+```
+
+The first import should preserve existing behavior:
+
+- Default `.c` output path: replace the input suffix with `.c`.
+- Default `.h` output path: replace the input suffix with `.h`.
+- Default prefix: derive from the input basename and append `_`.
+- `--dump-sexpr`: write the meta-grammar dump to stdout.
+- Exit code `0`: success.
+- Exit code `1`: compile, diagnostic, or I/O failure.
+- Exit code `2`: bad command-line usage.
+
+The driver implementation should use `DriverEnv` for `KitContext`, file reads,
+writer opening, diagnostics, and memory. The command should not call `malloc`,
+`free`, `fopen`, `fprintf`, or `exit` directly.
+
+Driver integration points:
+
+- Add `KIT_TOOL_GRAM_ENABLED` to `include/kit/config.h`.
+- Add `driver/cmd/gram.c`.
+- Add `driver_gram` and `driver_help_gram` to `driver/driver.h`.
+- Add the `gram` row to `driver/main.c`, gated by `KIT_TOOL_GRAM_ENABLED`.
+- Add `$(call tool-cmd,GRAM,gram)` to `mk/driver_srcs.mk`.
+- Keep it in `DRIVER_GROUP_OTHER` for now. It is a developer tool, not a
+  default drop-in binutils/toolchain symlink.
+
+## Build Integration
+
+Library integration points:
+
+- Add `KIT_GRAM_ENABLED` to `include/kit/config.h` as an optional library
+  subsystem.
+- Add `LIB_SRCS_GRAM := $(shell find src/gram -name '*.c' ...)` to
+  `mk/lib_srcs.mk`, and include it only when `KIT_GRAM_ENABLED` is `1`.
+- Add weak public stubs in `src/api/config_stubs.c` for gated-out generator API
+  entry points. Runtime stubs may be omitted if the runtime is considered part
+  of the generated-code ABI and the whole subsystem is always enabled in the
+  default build; if it is gated, public runtime symbols need stubs too.
+- Keep `src/gram/meta_tables.c` checked in. Normal `make lib` must not require
+  Python, network access, UCD regeneration, or a bootstrap `gram` binary.
+- Add a regeneration-only maintenance target later, for example
+  `make regen-gram-meta` and `make regen-gram-unicode-props`.
+
+The first import can gate runtime and generator together under
+`KIT_GRAM_ENABLED`. If size-sensitive embeddings need generated parser runtime
+without the generator compiler, split later into:
+
+- `KIT_GRAM_RUNTIME_ENABLED`: parser/lexer runtime plus support headers.
+- `KIT_GRAM_ENABLED`: generator compiler, depending on runtime.
+
+Do not introduce that split until there is a real embedding that benefits from
+it; it adds config and stub surface.
+
+## Implementation Phases
+
+### Phase 0: Freeze Standalone Behavior
+
+- Run the current `ll1` test suite and save the passing command set in the
+  import notes.
+- Generate and check in `meta_tables.c` / `meta_tables.h` from `meta.ebnf`.
+- Confirm generated output for representative grammars is stable.
+
+### Phase 1: Mechanical Import, Private Names Still Allowed Internally
+
+- Move files into the destinations above.
+- Keep behavior unchanged while fixing include paths.
+- Add `test/gram` fixtures and a `make test-gram` target, initially allowed to
+  fail until the namespace rewrite lands.
+
+### Phase 2: Public Namespace Rewrite
+
+- Rename every public `ll_*`, `llgen_*`, and `LL_*` symbol to `KitGram`,
+  `kit_gram_`, or `KIT_GRAM_` spelling.
+- Update emitted C and generated meta tables to use the new names.
+- Run `make test-lib-deps` to catch leaked public symbols outside `Kit`,
+  `kit_`, or `KIT`.
+
+### Phase 3: Kit Context Rewrite
+
+- Replace `llgen_allocator` with `KitContext.heap`.
+- Replace generated text return structs with `KitWriter` outputs.
+- Replace direct diagnostics with `KitContext.diag`.
+- Remove hosted libc calls from imported library code.
+- Keep `setjmp`/`longjmp` only if the existing frontend panic pattern accepts
+  it for this subsystem; otherwise convert OOM and validation aborts to explicit
+  `KitStatus` unwinding.
+
+### Phase 4: Driver Command
+
+- Port `gen/llgen_cli.c` to `driver/cmd/gram.c`.
+- Use the public generator API only.
+- Add help text consistent with other driver commands.
+- Add a focused driver test: invoke `kit gram` on a small grammar, compile the
+  generated C against libkit, and run the parser.
+
+### Phase 5: Cleanup and Documentation
+
+- Document the stable runtime API in the public headers.
+- Add a durable design doc under `doc/` only after the subsystem ships.
+- Add `gram` to `README.md` and `doc/DESIGN.md` capability lists after the
+  command works.
+- Remove temporary compatibility shims and any retained Python parity path.
+
+## Tests
+
+Targeted tests should land with the import:
+
+- `make test-gram`: direct API compile, table introspection, generated parser,
+  generated lexer, Pratt grammar, UTF-8 lexer, and error fixtures.
+- `make test-driver-gram`: CLI generation and generated-code compile/run smoke.
+- `make test-lib-deps`: symbol discipline and no accidental hosted dependencies.
+
+Map standalone tests as follows:
+
+| Standalone test | Kit test |
+|-----------------|----------|
+| `test/test_llgen_api.c` | `test/gram/api_test.c` |
+| `test/test_calc.c` | `test/gram/calc_test.c` |
+| `test/test_features.c` | `test/gram/features_test.c` |
+| `test/test_lexer.c` | `test/gram/lexer_test.c` |
+| `test/test_pratt_calc.c` | `test/gram/pratt_test.c` |
+| `test/test_unicode_support.c` | `test/gram/unicode_test.c` |
+| `test/test_unicode_lexer.c` | `test/gram/unicode_lexer_test.c` |
+| `test/test_utf8_runtime.c` | `test/gram/utf8_runtime_test.c` |
+| `test/errors/*.ebnf` | `test/gram/errors/*.ebnf` |
+
+Prefer red-green import steps:
+
+1. Add the test target and fixtures first.
+2. Import the runtime until hand-written/generated tables parse again.
+3. Import the generator until direct API tests pass.
+4. Add the driver command and smoke test last.
+
+## Open Questions
+
+- Should generated table-layout headers be documented as stable ABI, or merely
+  stable enough for C emitted by the same libkit version? The first import should
+  promise only same-version compatibility.
+- Should Unicode UCD source data live in-tree permanently, or should only the
+  generated property tables be checked in? Keeping the data improves
+  reproducible regeneration but increases repository size.
+- Should the runtime/generator gate be split immediately? The plan says no until
+  an embedding needs runtime-only size savings.
+- Should `gram` eventually support non-C output modes? The API should not bake
+  in more than `emit_c` today, but the command can grow `--emit=` later.
diff --git a/doc/plan/README.md b/doc/plan/README.md
@@ -20,5 +20,6 @@ shrinks to whatever remains open.
 | [IMAGE_INSPECT.md](IMAGE_INSPECT.md) | Extending object inspection to executables and shared libraries. | [../OBJ.md](../OBJ.md) |
 | [BUILD.md](BUILD.md) | A new content-addressed build coordinator (Bazel/Nix-style incremental builds layered on the CAS) — storage state machine, caching algorithm, recipe protocol. Distinct from `../BUILD.md` (kit's own Makefile build). | — (new subsystem) |
 | [BUILD_COMMANDS.md](BUILD_COMMANDS.md) | The kit-native `build-exe`/`build-lib`/`build-obj` verbs that replace `compile`: polyglot, in-memory compile+link with `--group` flag scoping and full link-flag control. Distinct from `BUILD.md` (the CAS coordinator). | [../DRIVER.md](../DRIVER.md) |
+| [LLGEN_IMPORT.md](LLGEN_IMPORT.md) | Importing the standalone LL(1)/Pratt parser and lexer generator into libkit, including public API renames, file moves, build gates, and a `kit llgen` command. | — |
 | [BACKTRACE.md](BACKTRACE.md) | Stack-trace support: GCC-compatible `__builtin_return_address`/`__builtin_frame_address` primitives, a freestanding `__kit_backtrace` capture helper, and symbolized backtrace printing. | [../FRONTENDS.md](../FRONTENDS.md), [../RUNTIME.md](../RUNTIME.md), [../DWARF.md](../DWARF.md) |
 | [TODO.md](TODO.md) | Open deferred fixes and code smells only. Completed items are removed instead of checked off. Not a roadmap; a current backlog. | — |

	kit kit
	git clone https://git.ryansepassi.com/git/kit.git
	Log \| Files \| Refs \| README

A	doc/plan/LLGEN_IMPORT.md	\|	390	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	doc/plan/README.md	\|	1	+