kit

kit
git clone https://git.ryansepassi.com/git/kit.git
Log | Files | Refs | README

commit 7aecaa468f135bcada791ccc5b7f76aa8a0a3a2e
parent 61ac2c5548bcea594508b8e586154d02ddea4f9e
Author: Ryan Sepassi <rsepassi@gmail.com>
Date:   Mon, 11 May 2026 18:36:37 -0700

PROF.md plan

Diffstat:
Adoc/PROF.md | 392+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 392 insertions(+), 0 deletions(-)

diff --git a/doc/PROF.md b/doc/PROF.md @@ -0,0 +1,392 @@ +# cfree prof design + +Architecture of `cfree prof`, a sampling CPU profiler for JIT'd code. +Companion to `DBG.md` and `DESIGN.md`. Scope: how the profiler intercepts +the worker thread without disturbing it, how samples are recorded cheaply, +and how PCs are turned into human-readable output after the run. Not a +tutorial; not implementation notes. + +## 1. Goals + +- `prof` multi-call subcommand: compile C sources, JIT-link, run under a + sampling profiler, and emit a profile report. +- Statistical CPU profiling: deliver SIGPROF to the worker at a + configurable rate, capture the current PC and a frame-pointer-based call + stack in the signal handler, and resume the worker without a park/unpark + cycle. +- Post-run symbolication: translate raw PCs to `symbol+offset` and + `file:line` using the JIT image and DWARF, after the guest exits. +- Output in folded-stacks format (Brendan Gregg's `flamegraph.pl` input), + and a flat function-by-self-time text report. +- Reuse `CfreeDbgOs` signal infrastructure from `dbg`; `src/dbg/prof.c` + stays freestanding C11. +- v1 target: aarch64 on macOS and Linux. + +## 2. Non-goals (v1) + +- Heap / allocation profiling. Separate concern; different instrumentation. +- Kernel-time profiling (`ITIMER_REAL` vs `ITIMER_PROF`). Wall-clock + sampling can be added later; v1 profiles user CPU time only. +- Multi-threaded guests. One worker thread per session; per-thread timer + targeting is future work. +- Hardware performance counters (PMU). Pure-software SIGPROF for now. +- Continuous / streaming profile output. One profile file per run. +- Instrumented (non-sampling) profiling. Breakpoint-based entry/exit + counting is higher overhead and a different tool. + +## 3. Layout + +``` +include/ + cfree.h on_sample added to CfreeDbgSignalOps; CfreeProfSpec; + cfree_jit_session_prof_attach / prof_collect + +src/ + dbg/ + prof.c sample ring buffer; frame-pointer walk; symbolication + dbg.h CfreeProfSample, CfreeProfBuf (internal) + +driver/ + prof.c new driver entry: parse flags, arm timer, run session, + emit folded-stacks and flat report + env.c SIGPROF added to g_dbg_signos[]; on_sample dispatch + path that returns without parking +``` + +## 4. Dataflow + +``` +setitimer(ITIMER_PROF, rate) + │ + │ SIGPROF → worker thread + ▼ +dbg_signal_handler + ├── is worker thread? no → re-raise (default) + ├── on_sample set? no → park (normal dbg path) + └── yes → dbg_fp_walk(ucontext, buf) → append to ring buffer → return + (worker continues) + │ + │ after session exits + ▼ +cfree_jit_session_prof_collect(session, buf, nsamples) + │ + ▼ +symbolicate: cfree_jit_addr_to_sym + cfree_dwarf_addr_to_line + │ + ├── emit folded stacks → output.folded (flamegraph.pl input) + └── emit flat report → stdout +``` + +The critical property: **the worker is never parked**. SIGPROF fires, +the handler reads the frame pointer chain and appends to a pre-allocated +buffer, and returns. No event signals, no mutex acquires, no park/unpark +handshake. The guest's elapsed time and memory state are unaffected except +for the signal-delivery overhead. + +## 5. Signal handler path + +`dbg_signal_handler` (in `driver/env.c`) currently dispatches every +delivered signal through the park/stop handshake. For profiling a second +path is needed that returns without parking. + +The dispatch condition is: `signo == SIGPROF && on_sample != NULL`. + +```c +if (signo == SIGPROF && g_dbg_ops.on_sample) { + g_dbg_ops.on_sample(g_dbg_session, uc); + return; /* worker resumes immediately */ +} +``` + +`on_sample` is a new field in `CfreeDbgSignalOps` (alongside the existing +`on_fault`): + +```c +typedef struct CfreeDbgSignalOps { + /* existing */ + int (*on_fault)(void* session, int signo, CfreeUnwindFrame* frame); + /* new */ + void (*on_sample)(void* session, void* ucontext); +} CfreeDbgSignalOps; +``` + +`on_sample` receives the raw `ucontext_t*` rather than a marshalled +`CfreeUnwindFrame` because it will extract only FP and PC directly — +marshalling all 32 registers would be unnecessary work in the hot path. + +SIGPROF is added to `g_dbg_signos[]` in `driver/env.c` alongside the +existing fault signals. The `sa_mask` for the handler already blocks the +cohort of debugger signals during delivery; SIGPROF joins that mask so it +doesn't recurse. + +## 6. Frame-pointer walk + +The signal handler calls `dbg_fp_walk(ucontext, sample)` in +`src/dbg/prof.c`. This is the only code that runs on the signal-delivery +path outside of the existing dispatcher boilerplate. + +Contract: +- No heap allocation. Writes into a caller-supplied `CfreeProfSample`. +- No DWARF, no symbol lookup. Raw PCs only. +- Max depth is a compile-time constant `PROF_MAX_DEPTH` (default 64). +- Terminates when: FP is NULL, FP is misaligned, FP is outside the JIT + image address range, or depth limit is reached. +- Uses `dbg_os->guarded_copy` for each dereference to tolerate a corrupt + or partially-initialised frame chain. + +AArch64 frame layout (standard AAPCS64 with frame pointer enabled): + +``` +[FP + 0] → saved FP of caller +[FP + 8] → saved LR (return address = PC of call site in caller) +``` + +PC at the point of the sample is taken directly from `ucontext.__ss.__pc` +(macOS) or `mcontext.pc` (Linux). FP comes from register x29. + +Walk pseudocode: + +```c +void dbg_fp_walk(void* uc, CfreeProfSample* s) { + uint64_t pc = uc_get_pc(uc); + uint64_t fp = uc_get_fp(uc); + s->pcs[s->nframes++] = pc; + while (fp && s->nframes < PROF_MAX_DEPTH) { + uint64_t saved_fp, saved_lr; + if (dbg_os->guarded_copy(&saved_fp, (void*)fp, 8)) break; + if (dbg_os->guarded_copy(&saved_lr, (void*)(fp+8), 8)) break; + s->pcs[s->nframes++] = saved_lr; + if (!saved_fp || saved_fp <= fp) break; /* no backward progress */ + fp = saved_fp; + } +} +``` + +cfree controls codegen and ensures `-fno-omit-frame-pointer` is the +default (or forced for prof mode). Frames from the host's C runtime above +the JIT entry are recorded as-is; symbolication will simply leave them as +hex if they fall outside the JIT image. + +## 7. Sample buffer + +```c +/* CfreeProfSample — one entry per timer tick */ +typedef struct CfreeProfSample { + uint64_t pcs[PROF_MAX_DEPTH]; + uint32_t nframes; +} CfreeProfSample; + +/* CfreeProfBuf — pre-allocated, fixed-size ring */ +typedef struct CfreeProfBuf { + CfreeProfSample* samples; + uint32_t cap; /* allocated slots */ + uint32_t count; /* samples recorded */ + uint32_t dropped; /* samples dropped when full */ +} CfreeProfBuf; +``` + +`CfreeProfBuf` is allocated by the driver before calling +`cfree_jit_session_prof_attach(session, buf)`. The session stores a +pointer; `on_sample` appends to it. When `count == cap`, the sample is +discarded and `dropped` is incremented (both writes are non-atomic; single +worker thread makes that safe). + +Default capacity: 1 million samples. At 64 PCs × 8 bytes each that is +~512 MB — reduce `PROF_MAX_DEPTH` or the cap for constrained environments. +The driver flags `--depth` and `--cap` control both. + +## 8. Post-run symbolication + +After `cfree_jit_session_call()` returns, the buffer holds raw PCs. +Symbolication is a post-processing pass in `src/dbg/prof.c`: + +```c +int cfree_jit_session_prof_collect( + CfreeJitSession* session, + CfreeProfBuf* buf, + CfreeProfWriter* writer); +``` + +`CfreeProfWriter` is a vtable the driver supplies: + +```c +typedef struct CfreeProfWriter { + void (*write_sample)(void* user, const char** syms, uint32_t nframes); + void* user; +} CfreeProfWriter; +``` + +For each sample, `prof_collect`: +1. For each PC in `sample.pcs[]`: + a. `cfree_jit_addr_to_sym(jit, pc, &sym, &off)` — symbol name or + NULL if outside the JIT image. + b. `cfree_dwarf_addr_to_line(dwarf, img_pc, &file, &line)` if DWARF + is attached and the symbol was found. + c. Format: `"sym+0xOFF"` when no line info; `"sym (file:line)"` when + available; `"0xADDR"` when outside the image entirely. +2. Calls `writer->write_sample(user, syms, nframes)` with the + null-terminated string array. + +Symbolication touches no async-signal machinery and may allocate freely. + +## 9. Output format + +### Folded stacks (primary) + +One line per sample. Frames innermost-first, separated by `;`, followed by +a space and the count weight (`1` per sample). This is the canonical input +for `flamegraph.pl`: + +``` +main;compute;inner_loop 1 +main;compute;inner_loop 1 +main;compute 1 +main 1 +``` + +Written to `--output FILE` (default: `prof.folded`). The driver emits +this via a `CfreeProfWriter` that builds strings into a `CfrBuf` and +flushes lines. + +Aggregate identical stacks before writing: sort samples lexicographically +by their frame sequence, then run-length-encode. This reduces output size +significantly for programs with tight hot loops. + +### Flat report (secondary) + +Printed to stdout after the run: + +``` +Samples: 10243 (0 dropped) rate: 1ms + + SELF% CUMUL% FUNCTION + 42.1% 42.1% inner_loop (compute.c:17) + 31.0% 73.1% compute (compute.c:45) + ... +``` + +Self% counts samples where the function is the top (leaf) frame. +Cumul% counts samples where it appears anywhere in the stack. + +## 10. Driver interface + +``` +cfree prof [options] [sources/objects] [-- args...] +``` + +Options: + +``` +--rate=MICROSECONDS SIGPROF interval, default 1000 (1 ms) +--depth=N max frames per sample, default 64 +--cap=N sample buffer capacity, default 1000000 +--output=FILE folded-stacks output path, default prof.folded +--no-folded suppress folded-stacks file +--no-flat suppress flat report to stdout +``` + +Input handling mirrors `cfree run`: `.c` / `.o` / `.a` / stdin, compiled +with `-g` forced on so DWARF is always present for symbolication. + +The driver: +1. Allocates `CfreeProfBuf`. +2. Arms `setitimer(ITIMER_PROF, rate)`. +3. Creates session with `on_sample` wired. +4. Calls `cfree_jit_session_call()`. +5. Disarms timer after return. +6. Calls `cfree_jit_session_prof_collect()` to symbolicate and write output. +7. Prints flat report and dropped-sample count. + +## 11. CfreeDbgOs and public API changes + +### `include/cfree.h` + +`CfreeDbgSignalOps` gains `on_sample`: + +```c +typedef struct CfreeDbgSignalOps { + int (*on_fault)(void* session, int signo, CfreeUnwindFrame* frame); + void (*on_sample)(void* session, void* ucontext); /* NEW; NULL = ignore SIGPROF */ +} CfreeDbgSignalOps; +``` + +New public entry points: + +```c +/* Attach a pre-allocated sample buffer; must be called before session_call. */ +void cfree_jit_session_prof_attach(CfreeJitSession*, CfreeProfBuf*); + +/* Symbolicate buffer and call writer once per sample. Safe after session_call. */ +int cfree_jit_session_prof_collect(CfreeJitSession*, CfreeProfBuf*, + CfreeProfWriter*); +``` + +`CfreeProfBuf` and `CfreeProfWriter` are declared in `cfree.h`; their +internals (frame storage, PROF_MAX_DEPTH) live in `src/dbg/prof.c`. + +### `driver/env.c` + +- Add `SIGPROF` to `g_dbg_signos[]`. +- In `dbg_signal_handler`: early-out path for `SIGPROF` + non-null + `on_sample` that calls the callback and returns without touching the + park/unpark events. + +No other OS layer changes. `CfreeDbgOs` itself does not need new fields: +timer arming (`setitimer`) and thread targeting (`pthread_kill`) are both +driver-side and do not belong behind the vtable. + +## 12. Checklist + +### Public API — `include/cfree.h` + +- [ ] `on_sample` field added to `CfreeDbgSignalOps` +- [ ] `CfreeProfBuf` and `CfreeProfWriter` structs declared +- [ ] `cfree_jit_session_prof_attach` +- [ ] `cfree_jit_session_prof_collect` + +### Library — `src/dbg/prof.c` + +- [ ] `CfreeProfBuf` alloc/free helpers +- [ ] `dbg_fp_walk(ucontext, sample)` — aarch64 frame-pointer walk using + `guarded_copy`; terminates on NULL/misaligned/non-advancing FP +- [ ] `on_sample` implementation: check capacity, call `dbg_fp_walk`, + append or increment `dropped` +- [ ] `cfree_jit_session_prof_attach` body +- [ ] `cfree_jit_session_prof_collect` body: symbolication loop + + `CfreeProfWriter` dispatch +- [ ] `dbg_fp_walk` x64 variant (frame layout identical; FP = rbp) +- [ ] `dbg_fp_walk` rv64 variant (frame layout identical; FP = s0/x8) + +### Host adapter — `driver/env.c` + +- [ ] `SIGPROF` added to `g_dbg_signos[]` +- [ ] `on_sample` early-return path in `dbg_signal_handler` + +### Driver — `driver/prof.c` + +- [ ] Flag parsing (`--rate`, `--depth`, `--cap`, `--output`, + `--no-folded`, `--no-flat`) +- [ ] `setitimer(ITIMER_PROF, ...)` arm before `session_call`, disarm after +- [ ] `CfreeProfWriter` for folded-stacks output (sort + run-length encode) +- [ ] Flat report: self% / cumul% table printed to stdout +- [ ] Dropped-sample warning when `buf.dropped > 0` +- [ ] Wire into multi-call dispatch in `driver/main.c` + +### Tests + +- [ ] `test/smoke/prof_hello`: run a simple C program under `cfree prof`, + assert `prof.folded` is non-empty, `main` appears in output +- [ ] `test/dbg/fp_walk_aa64`: canned aarch64 frame chain (stack buffer + with crafted FP links); assert `dbg_fp_walk` produces expected PC + sequence and terminates correctly on a NULL sentinel +- [ ] `test/dbg/prof_buf_overflow`: fill buffer to capacity, assert + `dropped` increments and count stays at cap + +### Bigger follow-ons + +- [ ] Per-thread timer via `timer_create(CLOCK_THREAD_CPUTIME_ID)` + + `SIGEV_THREAD_ID` (Linux) for reliable delivery in multi-thread guests +- [ ] `ITIMER_REAL` mode (`--wall`) for profiling I/O-bound programs +- [ ] Allocation profiling via breakpoint on the allocator entry with + `CfreeBreakpointSpec.condition` recording stack traces +- [ ] SpeedScope / pprof output formats