Debugger, Debug Info, and Profiling (planned work)
This roadmap consolidates the remaining work across the interactive JIT
debugger (kit dbg), the DWARF producer/consumer, and the not-yet-built
sampling profiler (kit prof). Designs live one level up:
../DBG.md covers the KitJitSession architecture, the
KitDbgOs host vtable, software breakpoints, and displaced single-step;
../DWARF.md covers the producer pipeline and the
kit_dwarf_* consumer surface. This document is forward-looking: it states
the baseline only as a starting point, then enumerates the open gaps, their
rationale, and the next steps. Shipped items are noted as "done (baseline)".
Baseline
What already works and is not re-planned here:
- The JIT debugger session (
src/dbg/session.c,bp.c,mem.c,step.c,displaced.c) is real: worker thread, park/unpark, fault classification, refcounted software breakpoints with a read overlay, guarded memory access, displaced single-step, and theSTEP_LINE/NEXT_LINE/STEP_OUTstate machines. Done (baseline). - Displaced-step lifter implementations exist for all three backends. The
lifter is arch-neutral (
dbg_displaced_preparedrives anArchDbgOpsvtable); each backend shipsbuild_displaced_shim+decode_insn:src/arch/aa64/dbg.c,src/arch/x64/dbg.c(INT3 + RIP-relative/rel8/rel32 fixups),src/arch/rv64/dbg.c(EBREAK + AUIPC/JAL/branch fixups). Done (baseline). - Session-level integration of displaced-step is complete only on aarch64 hosts. The x64 and rv64 backends decode and fix up instructions correctly in isolation, but their end-to-end REPL session loop (fault classification, trap PC normalization, register marshalling) has not been validated; closing that gap is §1 below. Done (baseline) for aarch64 only.
- The POSIX host adapter (
driver/env/posix_dbg.c, macOS/Linux/FreeBSD ucontext marshalling) and the Windows host adapter (driver/env/windows.c:g_dbg_os_winwithAddVectoredExceptionHandler,Set/GetThreadContextinterrupt path,__try/__exceptguarded copy) are both wired. Done (baseline). - The DWARF producer (
src/debug/debug*.c: abbrev/form/emit, line program, type DIEs,.eh_frameCFI) and consumer (src/debug/dwarf_*.c: open/line/die/type/loc/query/cfi) are implemented and tested viatest-dwarf/test-debug, including multi-inputkit_jit_view,kit_dwarf_line_to_addrsuffix matching, and graceful "no debug info for this frame" degradation. Done (baseline).
The sections below are the work that remains.
1. Bring x64 / rv64 debug sessions to full parity
The non-aarch64 lifters exist (per baseline) but test/dbg/run.sh self-skips
on any host that is not aarch64. The remaining work is proving the non-aa64
sessions are real, not writing new lifters.
- Run the full
test/dbgtranscript suite against x64 and rv64 hosts and fix whatever the session-level integration surfaces (fault classification, trap PC normalization, register marshalling). The backends decode and fix up instructions correctly in isolation (test-x64-dbg); the gap is the session/host loop around them. - Decide and implement the
test-dbghost policy: extend theDarwin/Linux + arm64/aarch64allow-list intest/dbg/run.shto admit x64 and rv64 once green, and decide whethertest-dbgself-skips or hard-fails when the compiled backend/host cannot support a session at all. - Remove the non-aarch64 degraded-mode warning in
driver/cmd/dbg.conce x64 and rv64 sessions are real, not just present.
2. Displaced single-step: remaining instruction coverage
The lifters cover the common PC-relative families per arch. One known decline remains, plus general hardening:
- aarch64
LDR (literal)vector forms (S/D/Q destinations) are declined today (src/arch/aa64/dbg.creturns unsupported whenV==1). These are common in optimized and FP-heavy builds; synthesize the same indirect-load shim used for the integer/LDRSWforms but targeting a vector register. WHY: stepping otherwise fails at any FP literal load. - Audit the x64 and rv64 lifters for analogous declined forms surfaced by §1's
parity testing, and either lift them or document the decline with a clear
KIT_UNSUPPORTEDpath so the session degrades to "cannot step here" rather than misbehaving.
3. Direct dbg unit + smoke tests
Much verification has gone through transcript tests; the low-level primitives still lack focused unit coverage. Following red-green TDD (see ../TESTING.md):
test/dbg/bp_patch_roundtrip: install/clear at one address; assert byte restore, refcount behavior, and thedbg_bp_unpatch_readoverlay.test/dbg/displaced_*: one canned encoding from every PC-relative family, per arch; assert shim bytes + literal-pool layout. (x64 hastest-x64-dbg; give aa64 and rv64 the same.)test/dbg/guarded_copy_segv:read_memfrom NULL returns nonzero and the worker survives the next resume.test/smoke/dbg_hello: scripted REPL against a JIT'd C source with a golden-transcript diff (b sym,r,c,s,x ADDR,p NAME,q).test/dbg/source_step: scriptedn/step/finish, asserting the reported source line at each stop.- Make session teardown explicit enough to test stopping while the worker is
parked. Note:
kit_jit_session_freedeliberately leaks a worker parked inside the signal handler (no async-safe unwind), so tests must account for this rather than expect a clean join.
4. REPL polish and machine-readable mode
Shared REPL work that improves usability and unblocks tooling/IDE frontends:
- Repeat the last stepping command on a blank line.
- Add memory-format variants for
x(bytes, words, strings, pointers). - Add a stable machine-readable transcript mode for tests and external tools, keeping command parsing factored so an editor/IDE frontend can reuse the command engine without scraping human output. WHY: the REPL is the only programmatic entry to the session; scraping is brittle.
- Add Ctrl-C / interrupt transcript coverage where the host can test it reliably (the interrupt path itself is wired on both POSIX and Windows).
5. Toy and C REPL frontends
The Toy frontend drives the debugger as the first REPL language; C support is the larger follow-on. Design detail lives in ../FRONTENDS.md.
Toy result formatting and structured values:
- Typed result formatting (and pretty-printing via type info) instead of always
rendering scalar
u64/i64hex. - Richer expression-thunk signatures so expressions can accept and return
non-integer values — pointers, floats, records, arrays/slices (the
toy-structured-exprred transcript is the spec). - More readable diagnostics that keep the REPL usable after bad input, plus
better multi-line / unmatched-brace handling, and stable synthetic file names
so
listandfile:linebreakpoints are predictable across runs.
C as a REPL language (after the Toy experience is solid):
- Teach the C frontend
KIT_FRONTEND_INPUT_REPL_EXPR/KIT_FRONTEND_INPUT_REPL_BLOCK, and preserve C declarations across snippets without leaking frontend internals into the driver. - Support C function calls through normal REPL expressions (no separate
callcommand), infer result types and print typed values, allow thunks to refer to stopped-frame locals where feasible, and add transcript tests once Toy is stable.
6. DWARF producer/consumer gaps
Producer and consumer are colocated under src/debug/ but share only the wire
format (dwarf_defs.h); that boundary must hold for any new work. The
remaining gaps:
- Loclists for optimized code (producer-only). The consumer already
resolves
DW_FORM_loclistxagainst.debug_loclists(dw_loclist_resolveinsrc/debug/dwarf_loc.c), and the producer models time-varying locations asDVL_LOCLISTviadebug_loclist_new/_add(src/debug/debug.c) — but those producer entry points are placeholders that do not serialize a.debug_loclistssection. Realize the serialization so a variable's location can vary by PC range. WHY: without it,-O1/-O2builds lose variable locations the moment a value moves between register and frame slot — exactly where a debugger is most needed. - Richer CFI register recovery. The unwinder (
src/debug/dwarf_cfi.c) computes the caller CFA/PC and the return address but does not do CFA-relative loads to recover arbitrary callee-saved registers — needed to show correct register values in outer frames. - Composite locations. Once opt generates split values, the loc-expr
evaluator needs
DW_OP_piece; defer until opt synthesizes them. list file:linefor prebuilt inputs. When the JIT image includes.o/.adebug sections whose source file is not on disk,driver/cmd/dbg.cshould show the DWARF line number alone and omit the source snippet rather than failing the listing. The line/symbol lookups already degrade; the on-disk source read (viaenv.file_io) is missing.
Explicitly deferred until a client needs them (carried forward, not planned):
.debug_macro— emit macro definition/expansion records so a debugger can report#defined values. The SourceManager already tracks the macro-expansion pseudo-files that can seed the records; cheap once wired.- Inlined-subroutine DIEs — emit
DW_TAG_inlined_subroutinewith abstract-origin links once the optimizer reports inline decisions to the CG session (consumer-sidedw_build_subsalready indexes the tag). - Split DWARF /
.dwo,.debug_pubnames, and anyLSDA/ exception tables (C has none).
7. Sampling profiler — kit prof (not yet built)
A statistical CPU profiler that reuses the debugger's host signal
infrastructure. Nothing exists yet: no prof subcommand in driver/main.c,
no src/dbg/prof.c, no on_sample field on KitDbgSignalOps. Design
intent: SIGPROF fires on the worker, the handler walks the frame-pointer chain
into a pre-allocated ring buffer and returns without parking — the one
property that keeps sampling cheap and guest timing undisturbed — and PCs are
symbolicated after the guest exits.
Public API (include/kit.h):
- Add
on_sample(void* session, void* ucontext)toKitDbgSignalOps(NULL = ignore SIGPROF); it receives the rawucontext_t*, not a marshalled frame, because it extracts only PC and FP on the hot path. - Declare
KitProfBuf(fixed-capacity sample ring:pcs[PROF_MAX_DEPTH]per sample,count/cap/dropped) andKitProfWriter(post-run symbolication callback vtable), pluskit_jit_session_prof_attach(session, buf)(beforesession_call) andkit_jit_session_prof_collect(session, buf, writer).
Library (src/dbg/prof.c, freestanding C11):
dbg_fp_walk(ucontext, sample): frame-pointer walk viadbg_os->guarded_copyfor every dereference; terminate on NULL / misaligned / non-advancing FP orPROF_MAX_DEPTH. The three frame layouts are identical ([FP]= saved FP,[FP+8]= saved LR/return addr); FP is x29 / rbp / s0(x8). WHY: no DWARF or symbol lookup on the signal path — raw PCs only.on_samplebody: capacity check, walk, append or bumpdropped(non-atomic; the single worker makes that safe).prof_attach/prof_collectbodies;prof_collectsymbolicates each PC viakit_jit_addr_to_sym+kit_dwarf_addr_to_line, dispatching to the writer (may allocate freely).
Host adapter (driver/env/posix_dbg.c and windows.c):
- Add SIGPROF to the POSIX handler's signal set with an early-return path:
if (signo == SIGPROF && on_sample) { on_sample(...); return; }— no park/unpark. SIGPROF joins the blocked cohort so it does not recurse. Timer arming (setitimer(ITIMER_PROF)) and thread targeting stay driver-side and do not belong behindKitDbgOs. - Decide the Windows sampling mechanism (no SIGPROF): a periodic
SuspendThread+GetThreadContextsampler thread is the natural analog of the VEH interrupt path. WHY: the SIGPROF design has no direct Windows equivalent, so this is a genuine open design question, not a port.
Driver (driver/cmd/prof.c, wired into the multi-call dispatch in
driver/main.c):
- Flags
--rate(default 1ms),--depth(64),--cap(1M),--output(prof.folded),--no-folded,--no-flat. Input handling mirrorskit run, with-gforced on so symbolication always has DWARF. - Arm the timer before
session_call, disarm after, thenprof_collect. Emit folded stacks (sorted + RLE forflamegraph.pl) plus a flat self%/cumul% report to stdout and a dropped-sample warning.
Tests: test/smoke/prof_hello (assert prof.folded is non-empty and main
appears), test/dbg/fp_walk_* (canned frame chains per arch, assert the PC
sequence and termination), and test/dbg/prof_buf_overflow (fill to capacity,
assert dropped increments and count caps).
Profiler follow-ons (deferred): per-thread timers via
timer_create(CLOCK_THREAD_CPUTIME_ID) + SIGEV_THREAD_ID for multi-thread
guests; an ITIMER_REAL wall-clock mode for I/O-bound programs; allocation
profiling via a conditional breakpoint on the allocator; SpeedScope / pprof
output.
8. Bigger follow-ons (cross-cutting)
- Watchpoints, once
CGTargetcan express them without an ISA-specific debug-register API. All breakpoints are software today; watchpoints need hardware debug registers, hence the abstraction requirement. - Multi-threaded guests. The session assumes one worker. Concurrent guest
threads require widening
KitDbgOswith thread enumeration and per-tid stop/event slots; this is also the prerequisite for reliable per-thread profiler timer delivery in §7. Out of scope until a concrete need lands.