kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 22aa8e4e433beeeada6d84d710d873edca1d2f25
parent 9fd3adda7106605e23b78d7c85f1237f1ea67801
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Sat,  6 Jun 2026 05:17:20 -0700

docs: plan llgen import

Diffstat:
Adoc/plan/LLGEN_IMPORT.md | 389+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mdoc/plan/README.md | 1+
2 files changed, 390 insertions(+), 0 deletions(-)

diff --git a/doc/plan/LLGEN_IMPORT.md b/doc/plan/LLGEN_IMPORT.md @@ -0,0 +1,389 @@ +# LLGEN Import Plan + +This is the import plan for folding `/Users/ryan/code/ll1` into libkit and +exposing it through the kit driver. The goal is not to make `ll1` a language +frontend. It should land as a reusable parser/lexer-generator subsystem: EBNF in, +parse and lexer tables out, with allocation-free push runtimes for generated +parsers and lexers. + +The standalone project already has the right broad shape for libkit: explicit +allocation, no hidden CLI dependency in the generator API, immutable generated +tables, and caller-owned runtime state. The import work is mostly namespace, +host-boundary, build-gating, and file-layout discipline. + +## Decision Summary + +- Driver command: `llgen`. +- Public generator API header: `<kit/llgen.h>`. +- Public parser runtime header: `<kit/llparse.h>`. +- Public lexer runtime header: `<kit/lllex.h>`. +- Generated-code support headers: `<kit/support/llparse_tables.h>` and + `<kit/support/lllex_tables.h>`. +- Library subsystem gate: `KIT_LLGEN_ENABLED`. +- Driver tool gate: `KIT_TOOL_LLGEN_ENABLED`. +- Implementation directory: `src/llgen/`. +- Tests: `test/llgen/` plus a driver smoke lane. + +`ll1` remains a source repository/project name only. It should not appear in the +libkit API, installed command name, or user-facing help. + +## Boundaries + +`llgen` is a library subsystem, not a `lang/` frontend. It does not register a +`KitFrontendVTable`, does not emit `KitCg`, and does not participate in +source-to-object compilation unless a future frontend chooses to use it +internally. Its public surface is consumed by embedders, the driver command, and +generated C files. + +The driver remains the only hosted layer. File reads, directory creation, stdout, +stderr, and CLI allocation policy live in `driver/cmd/llgen.c` and the hosted +driver environment. The library accepts input as `KitSlice`, writes output to +`KitWriter`, allocates through `KitContext.heap`, and reports diagnostics through +`KitContext.diag`. + +Generated grammar tables are immutable static data and are fine as globals. +Mutable parser, lexer, and generator state must hang off caller-owned runtime +objects or explicit libkit handles. + +## File Moves + +Import the C implementation, generated meta tables, runtime, and tests. Do not +make the Python generator part of the normal libkit build. + +| Current file | Kit destination | Notes | +|--------------|-----------------|-------| +| `include/llgen.h` | `include/kit/llgen.h` | Public generator API, renamed to Kit types and functions. | +| `include/llparse.h` | `include/kit/llparse.h` | Public parser runtime API. | +| `include/lllex.h` | `include/kit/lllex.h` | Public lexer runtime API. | +| `include/llparse_tables.h` | `include/kit/support/llparse_tables.h` | Public generated-code support, not ordinary embedder API. | +| `include/lllex_tables.h` | `include/kit/support/lllex_tables.h` | Public generated-code support, not ordinary embedder API. | +| `include/llunicode.h` | `src/llgen/unicode.h` | Private helper; expose later as `<kit/unicode.h>` only if there is a broader API need. | +| `include/llunicode_props.h` | `src/llgen/unicode_props.h` | Private generated Unicode property resolver. | +| `runtime/llparse.c` | `src/llgen/parse_runtime.c` | Implements `<kit/llparse.h>`. | +| `runtime/lllex.c` | `src/llgen/lex_runtime.c` | Implements `<kit/lllex.h>`. | +| `runtime/llunicode.c` | `src/llgen/unicode.c` | Private Unicode helpers for generator and lexer runtime. | +| `runtime/llunicode_props.c` | `src/llgen/unicode_props.c` | Checked-in generated property tables. | +| `gen/llgen.c` | `src/llgen/generator.c` | Public API implementation and C emitter after Kit context rewrite. | +| `gen/llgen_ll1.c` | `src/llgen/ll1.c` | LL(1), Pratt, FIRST/FOLLOW, and validation. | +| `gen/llgen_lex_byte.c` | `src/llgen/lex_byte.c` | Byte-mode lexer compiler. | +| `gen/llgen_lex_unicode.c` | `src/llgen/lex_unicode.c` | Unicode-mode lexer compiler. | +| `gen/llgen_internal.h` | `src/llgen/internal.h` | Private to `src/llgen/*.c`. | +| `gen/meta.ebnf` | `src/llgen/meta.ebnf` | Source grammar for the generator's own parser; not consumed by normal builds. | +| generated `meta` `.c/.h` | `src/llgen/meta_tables.c`, `src/llgen/meta_tables.h` | Check in generated tables so libkit does not need Python or a previous `llgen` to build. | +| `gen/llgen_cli.c` | `driver/cmd/llgen.c` | Rewrite as a Kit driver command using public `<kit/llgen.h>`. | +| `tools/gen_unicode_props.py` | `scripts/gen_llgen_unicode_props.py` | Regeneration helper only; not in the build. | +| `data/ucd/17.0.0/*` | `data/ucd/17.0.0/*` or `test/llgen/ucd/17.0.0/*` | Keep if we want reproducible Unicode-table regeneration in-tree. | +| `test/*.c`, `test/*.ebnf` | `test/llgen/*` | Library tests and fixtures. | +| `test/errors/*` | `test/llgen/errors/*` | Error fixtures. | + +`gen/llgen.py` should not be imported into libkit. If parity against the Python +reference is still useful during the transition, keep it temporarily as +`scripts/llgen_ref.py` and exclude it from release/build dependencies. Delete it +once the imported C generator is trusted. + +## Public Renames + +The standalone `ll_*` and `llgen_*` names are too short for libkit and would +violate the public symbol discipline. Public definitions must use `Kit`, +`kit_`, or `KIT_`. + +### Generator API + +| Standalone name | Kit name | +|-----------------|----------| +| `llgen_options` | `KitLlgenOptions` | +| `llgen_compiled` | `KitLlgenCompiled` | +| `llgen_codegen` | Remove or narrow; prefer `KitWriter` outputs. | +| `llgen_compile_text` | `kit_llgen_compile_text` | +| `llgen_compiled_free` | `kit_llgen_free` | +| `llgen_dump_sexpr_text` | `kit_llgen_dump_sexpr` | +| `llgen_generate_c` | `kit_llgen_emit_c` | +| `llgen_parser_grammar` | `kit_llgen_parser_grammar` | +| `llgen_lexer_grammar` | `kit_llgen_lexer_grammar` | +| `llgen_token_count` | `kit_llgen_token_count` | +| `llgen_token_name` | `kit_llgen_token_name` | +| `llgen_token_display` | `kit_llgen_token_display` | +| `llgen_find_token` | `kit_llgen_find_token` | +| `llgen_rule_count` | `kit_llgen_rule_count` | +| `llgen_rule_name` | `kit_llgen_rule_name` | +| `llgen_find_rule` | `kit_llgen_find_rule` | + +The generator API should take a `const KitContext*` and `KitSlice` inputs. It +should not expose `llgen_allocator`; the allocator becomes `ctx->heap`. C output +should stream to caller-provided `KitWriter`s. Callers that want owned in-memory +text can use `kit_writer_mem`. + +Proposed core shape: + +```c +typedef struct KitLlgenCompiled KitLlgenCompiled; + +typedef struct KitLlgenOptions { + KitSlice name; /* optional grammar name override */ +} KitLlgenOptions; + +typedef struct KitLlgenEmitOptions { + KitSlice header_path; + KitSlice source_path; + KitSlice prefix; +} KitLlgenEmitOptions; + +KIT_API KitStatus kit_llgen_compile_text(const KitContext* ctx, + KitSlice text, KitSlice path, + const KitLlgenOptions* opts, + KitLlgenCompiled** out); +KIT_API void kit_llgen_free(KitLlgenCompiled*); + +KIT_API KitStatus kit_llgen_emit_c(const KitLlgenCompiled*, + const KitLlgenEmitOptions* opts, + KitWriter* header, KitWriter* source); +KIT_API KitStatus kit_llgen_dump_sexpr(const KitContext* ctx, KitSlice text, + KitSlice path, + const KitLlgenOptions* opts, + KitWriter* out); +``` + +### Parser Runtime API + +| Standalone name | Kit name | +|-----------------|----------| +| `ll_tok_kind` | `KitLlTokenKind` | +| `ll_rule_id` | `KitLlRuleId` | +| `ll_sem` | `KitLlSem` | +| `ll_token` | `KitLlToken` | +| `ll_error` | `KitLlParseError` | +| `ll_err_action` | `KitLlErrorAction` | +| `LL_ABORT` | `KIT_LL_ERROR_ABORT` | +| `LL_SKIP` | `KIT_LL_ERROR_SKIP` | +| `LL_RESYNC` | `KIT_LL_ERROR_RESYNC` | +| `ll_actions` | `KitLlActions` | +| `ll_slot` | `KitLlSlot` | +| `LL_SLOT_SIZE` | `KIT_LL_SLOT_SIZE` | +| `ll_config` | `KitLlParserConfig` | +| `ll_grammar` | `KitLlGrammar` | +| `ll_parser` | `KitLlParser` | +| `LL_PARSER_SIZE` | `KIT_LL_PARSER_SIZE` | +| `ll_status` | `KitLlParseStatus` | +| `LL_NEED_MORE` | `KIT_LL_PARSE_NEED_MORE` | +| `LL_PARSE_ACCEPT` | `KIT_LL_PARSE_ACCEPT` | +| `LL_PARSE_ERROR` | `KIT_LL_PARSE_ERROR` | +| `ll_parser_init` | `kit_ll_parser_init` | +| `ll_parser_push` | `kit_ll_parser_push` | +| `ll_parser_finish` | `kit_ll_parser_finish` | +| `ll_parser_result` | `kit_ll_parser_result` | +| `ll_stack_bounds` | `kit_ll_stack_bounds` | + +### Lexer Runtime API + +| Standalone name | Kit name | +|-----------------|----------| +| `ll_lex_grammar` | `KitLlLexGrammar` | +| `ll_lexer` | `KitLlLexer` | +| `LL_LEXER_SIZE` | `KIT_LL_LEXER_SIZE` | +| `ll_lex_config` | `KitLlLexConfig` | +| `ll_lex_status` | `KitLlLexStatus` | +| `LL_LEX_TOKEN` | `KIT_LL_LEX_TOKEN` | +| `LL_LEX_NEED_MORE` | `KIT_LL_LEX_NEED_MORE` | +| `LL_LEX_EOF` | `KIT_LL_LEX_EOF` | +| `LL_LEX_ERROR` | `KIT_LL_LEX_ERROR` | +| `ll_lex_error` | `KitLlLexError` | +| `ll_lexer_init` | `kit_ll_lexer_init` | +| `ll_lexer_push` | `kit_ll_lexer_push` | +| `ll_lexer_finish` | `kit_ll_lexer_finish` | +| `ll_lexer_next` | `kit_ll_lexer_next` | +| `ll_lexer_error` | `kit_ll_lexer_error` | + +### Generated-Code Support API + +The generated-code support headers should use Kit names too: + +| Standalone name | Kit name | +|-----------------|----------| +| `ll_sym_kind` | `KitLlSymKind` | +| `LL_S_TERM` | `KIT_LL_SYM_TERM` | +| `LL_S_RULE` | `KIT_LL_SYM_RULE` | +| `LL_S_REP` | `KIT_LL_SYM_REP` | +| `LL_S_OPT` | `KIT_LL_SYM_OPT` | +| `ll_sym` | `KitLlSym` | +| `ll_prod` | `KitLlProd` | +| `ll_pratt_op` | `KitLlPrattOp` | +| `ll_pratt` | `KitLlPratt` | +| `ll_rule` | `KitLlRule` | +| `LL_TERM` | `KIT_LL_TERM` | +| `LL_RULE` | `KIT_LL_RULE` | +| `ll_lex_accept` | `KitLlLexAccept` | +| `LL_LEX_DEAD` | `KIT_LL_LEX_DEAD` | +| `LL_LEX_ACCEPT_NONE` | `KIT_LL_LEX_ACCEPT_NONE` | + +Generated grammar-specific token/rule enums (`TOK_*`, `R_*`, and +`<prefix>_*`) are user artifacts. They may keep their current shape because they +are controlled by the generated prefix and are not libkit exports. + +## Include Rewrites + +The generator should emit installed-style includes: + +```c +/* generated header */ +#include <kit/llparse.h> +#include <kit/lllex.h> /* only when a generated lexer exists */ + +/* generated source */ +#include "generated_name.h" +#include <kit/support/llparse_tables.h> +#include <kit/support/lllex_tables.h> /* only when needed */ +``` + +Private `src/llgen/*.c` files include `internal.h` and the public headers they +implement. They must not include driver headers. + +## Driver Command + +The user-facing command is: + +```text +kit llgen [--dump-sexpr] [--prefix PREFIX] [-o OUT.c] [--header OUT.h] grammar.ebnf +``` + +The first import should preserve existing behavior: + +- Default `.c` output path: replace the input suffix with `.c`. +- Default `.h` output path: replace the input suffix with `.h`. +- Default prefix: derive from the input basename and append `_`. +- `--dump-sexpr`: write the meta-grammar dump to stdout. +- Exit code `0`: success. +- Exit code `1`: compile, diagnostic, or I/O failure. +- Exit code `2`: bad command-line usage. + +The driver implementation should use `DriverEnv` for `KitContext`, file reads, +writer opening, diagnostics, and memory. The command should not call `malloc`, +`free`, `fopen`, `fprintf`, or `exit` directly. + +Driver integration points: + +- Add `KIT_TOOL_LLGEN_ENABLED` to `include/kit/config.h`. +- Add `driver/cmd/llgen.c`. +- Add `driver_llgen` and `driver_help_llgen` to `driver/driver.h`. +- Add the `llgen` row to `driver/main.c`, gated by `KIT_TOOL_LLGEN_ENABLED`. +- Add `$(call tool-cmd,LLGEN,llgen)` to `mk/driver_srcs.mk`. +- Keep it in `DRIVER_GROUP_OTHER` for now. It is a developer tool, not a + default drop-in binutils/toolchain symlink. + +## Build Integration + +Library integration points: + +- Add `KIT_LLGEN_ENABLED` to `include/kit/config.h` as an optional library + subsystem. +- Add `LIB_SRCS_LLGEN := $(shell find src/llgen -name '*.c' ...)` to + `mk/lib_srcs.mk`, and include it only when `KIT_LLGEN_ENABLED` is `1`. +- Add weak public stubs in `src/api/config_stubs.c` for gated-out generator API + entry points. Runtime stubs may be omitted if the runtime is considered part + of the generated-code ABI and the whole subsystem is always enabled in the + default build; if it is gated, public runtime symbols need stubs too. +- Keep `src/llgen/meta_tables.c` checked in. Normal `make lib` must not require + Python, network access, UCD regeneration, or a bootstrap `llgen` binary. +- Add a regeneration-only maintenance target later, for example + `make regen-llgen-meta` and `make regen-llgen-unicode-props`. + +The first import can gate runtime and generator together under +`KIT_LLGEN_ENABLED`. If size-sensitive embeddings need generated parser runtime +without the generator compiler, split later into: + +- `KIT_LLGEN_RUNTIME_ENABLED`: parser/lexer runtime plus support headers. +- `KIT_LLGEN_ENABLED`: generator compiler, depending on runtime. + +Do not introduce that split until there is a real embedding that benefits from +it; it adds config and stub surface. + +## Implementation Phases + +### Phase 0: Freeze Standalone Behavior + +- Run the current `ll1` test suite and save the passing command set in the + import notes. +- Generate and check in `meta_tables.c` / `meta_tables.h` from `meta.ebnf`. +- Confirm generated output for representative grammars is stable. + +### Phase 1: Mechanical Import, Private Names Still Allowed Internally + +- Move files into the destinations above. +- Keep behavior unchanged while fixing include paths. +- Add `test/llgen` fixtures and a `make test-llgen` target, initially allowed to + fail until the namespace rewrite lands. + +### Phase 2: Public Namespace Rewrite + +- Rename every public `ll_*`, `llgen_*`, and `LL_*` symbol to Kit spelling. +- Update emitted C and generated meta tables to use the new names. +- Run `make test-lib-deps` to catch leaked public symbols outside `Kit`, + `kit_`, or `KIT`. + +### Phase 3: Kit Context Rewrite + +- Replace `llgen_allocator` with `KitContext.heap`. +- Replace generated text return structs with `KitWriter` outputs. +- Replace direct diagnostics with `KitContext.diag`. +- Remove hosted libc calls from imported library code. +- Keep `setjmp`/`longjmp` only if the existing frontend panic pattern accepts + it for this subsystem; otherwise convert OOM and validation aborts to explicit + `KitStatus` unwinding. + +### Phase 4: Driver Command + +- Port `gen/llgen_cli.c` to `driver/cmd/llgen.c`. +- Use the public generator API only. +- Add help text consistent with other driver commands. +- Add a focused driver test: invoke `kit llgen` on a small grammar, compile the + generated C against libkit, and run the parser. + +### Phase 5: Cleanup and Documentation + +- Document the stable runtime API in the public headers. +- Add a durable design doc under `doc/` only after the subsystem ships. +- Add `llgen` to `README.md` and `doc/DESIGN.md` capability lists after the + command works. +- Remove temporary compatibility shims and any retained Python parity path. + +## Tests + +Targeted tests should land with the import: + +- `make test-llgen`: direct API compile, table introspection, generated parser, + generated lexer, Pratt grammar, UTF-8 lexer, and error fixtures. +- `make test-driver-llgen`: CLI generation and generated-code compile/run smoke. +- `make test-lib-deps`: symbol discipline and no accidental hosted dependencies. + +Map standalone tests as follows: + +| Standalone test | Kit test | +|-----------------|----------| +| `test/test_llgen_api.c` | `test/llgen/api_test.c` | +| `test/test_calc.c` | `test/llgen/calc_test.c` | +| `test/test_features.c` | `test/llgen/features_test.c` | +| `test/test_lexer.c` | `test/llgen/lexer_test.c` | +| `test/test_pratt_calc.c` | `test/llgen/pratt_test.c` | +| `test/test_unicode_support.c` | `test/llgen/unicode_test.c` | +| `test/test_unicode_lexer.c` | `test/llgen/unicode_lexer_test.c` | +| `test/test_utf8_runtime.c` | `test/llgen/utf8_runtime_test.c` | +| `test/errors/*.ebnf` | `test/llgen/errors/*.ebnf` | + +Prefer red-green import steps: + +1. Add the test target and fixtures first. +2. Import the runtime until hand-written/generated tables parse again. +3. Import the generator until direct API tests pass. +4. Add the driver command and smoke test last. + +## Open Questions + +- Should generated table-layout headers be documented as stable ABI, or merely + stable enough for C emitted by the same libkit version? The first import should + promise only same-version compatibility. +- Should Unicode UCD source data live in-tree permanently, or should only the + generated property tables be checked in? Keeping the data improves + reproducible regeneration but increases repository size. +- Should the runtime/generator gate be split immediately? The plan says no until + an embedding needs runtime-only size savings. +- Should `llgen` eventually support non-C output modes? The API should not bake + in more than `emit_c` today, but the command can grow `--emit=` later. diff --git a/doc/plan/README.md b/doc/plan/README.md @@ -20,5 +20,6 @@ shrinks to whatever remains open. | [IMAGE_INSPECT.md](IMAGE_INSPECT.md) | Extending object inspection to executables and shared libraries. | [../OBJ.md](../OBJ.md) | | [BUILD.md](BUILD.md) | A new content-addressed build coordinator (Bazel/Nix-style incremental builds layered on the CAS) — storage state machine, caching algorithm, recipe protocol. Distinct from `../BUILD.md` (kit's own Makefile build). | — (new subsystem) | | [BUILD_COMMANDS.md](BUILD_COMMANDS.md) | The kit-native `build-exe`/`build-lib`/`build-obj` verbs that replace `compile`: polyglot, in-memory compile+link with `--group` flag scoping and full link-flag control. Distinct from `BUILD.md` (the CAS coordinator). | [../DRIVER.md](../DRIVER.md) | +| [LLGEN_IMPORT.md](LLGEN_IMPORT.md) | Importing the standalone LL(1)/Pratt parser and lexer generator into libkit, including public API renames, file moves, build gates, and a `kit llgen` command. | — | | [BACKTRACE.md](BACKTRACE.md) | Stack-trace support: GCC-compatible `__builtin_return_address`/`__builtin_frame_address` primitives, a freestanding `__kit_backtrace` capture helper, and symbolized backtrace printing. | [../FRONTENDS.md](../FRONTENDS.md), [../RUNTIME.md](../RUNTIME.md), [../DWARF.md](../DWARF.md) | | [TODO.md](TODO.md) | Open deferred fixes and code smells only. Completed items are removed instead of checked off. Not a roadmap; a current backlog. | — |