commit 7aecaa468f135bcada791ccc5b7f76aa8a0a3a2e
parent 61ac2c5548bcea594508b8e586154d02ddea4f9e
Author: Ryan Sepassi <rsepassi@gmail.com>
Date: Mon, 11 May 2026 18:36:37 -0700
PROF.md plan
Diffstat:
| A | doc/PROF.md | | | 392 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
1 file changed, 392 insertions(+), 0 deletions(-)
diff --git a/doc/PROF.md b/doc/PROF.md
@@ -0,0 +1,392 @@
+# cfree prof design
+
+Architecture of `cfree prof`, a sampling CPU profiler for JIT'd code.
+Companion to `DBG.md` and `DESIGN.md`. Scope: how the profiler intercepts
+the worker thread without disturbing it, how samples are recorded cheaply,
+and how PCs are turned into human-readable output after the run. Not a
+tutorial; not implementation notes.
+
+## 1. Goals
+
+- `prof` multi-call subcommand: compile C sources, JIT-link, run under a
+ sampling profiler, and emit a profile report.
+- Statistical CPU profiling: deliver SIGPROF to the worker at a
+ configurable rate, capture the current PC and a frame-pointer-based call
+ stack in the signal handler, and resume the worker without a park/unpark
+ cycle.
+- Post-run symbolication: translate raw PCs to `symbol+offset` and
+ `file:line` using the JIT image and DWARF, after the guest exits.
+- Output in folded-stacks format (Brendan Gregg's `flamegraph.pl` input),
+ and a flat function-by-self-time text report.
+- Reuse `CfreeDbgOs` signal infrastructure from `dbg`; `src/dbg/prof.c`
+ stays freestanding C11.
+- v1 target: aarch64 on macOS and Linux.
+
+## 2. Non-goals (v1)
+
+- Heap / allocation profiling. Separate concern; different instrumentation.
+- Kernel-time profiling (`ITIMER_REAL` vs `ITIMER_PROF`). Wall-clock
+ sampling can be added later; v1 profiles user CPU time only.
+- Multi-threaded guests. One worker thread per session; per-thread timer
+ targeting is future work.
+- Hardware performance counters (PMU). Pure-software SIGPROF for now.
+- Continuous / streaming profile output. One profile file per run.
+- Instrumented (non-sampling) profiling. Breakpoint-based entry/exit
+ counting is higher overhead and a different tool.
+
+## 3. Layout
+
+```
+include/
+ cfree.h on_sample added to CfreeDbgSignalOps; CfreeProfSpec;
+ cfree_jit_session_prof_attach / prof_collect
+
+src/
+ dbg/
+ prof.c sample ring buffer; frame-pointer walk; symbolication
+ dbg.h CfreeProfSample, CfreeProfBuf (internal)
+
+driver/
+ prof.c new driver entry: parse flags, arm timer, run session,
+ emit folded-stacks and flat report
+ env.c SIGPROF added to g_dbg_signos[]; on_sample dispatch
+ path that returns without parking
+```
+
+## 4. Dataflow
+
+```
+setitimer(ITIMER_PROF, rate)
+ │
+ │ SIGPROF → worker thread
+ ▼
+dbg_signal_handler
+ ├── is worker thread? no → re-raise (default)
+ ├── on_sample set? no → park (normal dbg path)
+ └── yes → dbg_fp_walk(ucontext, buf) → append to ring buffer → return
+ (worker continues)
+ │
+ │ after session exits
+ ▼
+cfree_jit_session_prof_collect(session, buf, nsamples)
+ │
+ ▼
+symbolicate: cfree_jit_addr_to_sym + cfree_dwarf_addr_to_line
+ │
+ ├── emit folded stacks → output.folded (flamegraph.pl input)
+ └── emit flat report → stdout
+```
+
+The critical property: **the worker is never parked**. SIGPROF fires,
+the handler reads the frame pointer chain and appends to a pre-allocated
+buffer, and returns. No event signals, no mutex acquires, no park/unpark
+handshake. The guest's elapsed time and memory state are unaffected except
+for the signal-delivery overhead.
+
+## 5. Signal handler path
+
+`dbg_signal_handler` (in `driver/env.c`) currently dispatches every
+delivered signal through the park/stop handshake. For profiling a second
+path is needed that returns without parking.
+
+The dispatch condition is: `signo == SIGPROF && on_sample != NULL`.
+
+```c
+if (signo == SIGPROF && g_dbg_ops.on_sample) {
+ g_dbg_ops.on_sample(g_dbg_session, uc);
+ return; /* worker resumes immediately */
+}
+```
+
+`on_sample` is a new field in `CfreeDbgSignalOps` (alongside the existing
+`on_fault`):
+
+```c
+typedef struct CfreeDbgSignalOps {
+ /* existing */
+ int (*on_fault)(void* session, int signo, CfreeUnwindFrame* frame);
+ /* new */
+ void (*on_sample)(void* session, void* ucontext);
+} CfreeDbgSignalOps;
+```
+
+`on_sample` receives the raw `ucontext_t*` rather than a marshalled
+`CfreeUnwindFrame` because it will extract only FP and PC directly —
+marshalling all 32 registers would be unnecessary work in the hot path.
+
+SIGPROF is added to `g_dbg_signos[]` in `driver/env.c` alongside the
+existing fault signals. The `sa_mask` for the handler already blocks the
+cohort of debugger signals during delivery; SIGPROF joins that mask so it
+doesn't recurse.
+
+## 6. Frame-pointer walk
+
+The signal handler calls `dbg_fp_walk(ucontext, sample)` in
+`src/dbg/prof.c`. This is the only code that runs on the signal-delivery
+path outside of the existing dispatcher boilerplate.
+
+Contract:
+- No heap allocation. Writes into a caller-supplied `CfreeProfSample`.
+- No DWARF, no symbol lookup. Raw PCs only.
+- Max depth is a compile-time constant `PROF_MAX_DEPTH` (default 64).
+- Terminates when: FP is NULL, FP is misaligned, FP is outside the JIT
+ image address range, or depth limit is reached.
+- Uses `dbg_os->guarded_copy` for each dereference to tolerate a corrupt
+ or partially-initialised frame chain.
+
+AArch64 frame layout (standard AAPCS64 with frame pointer enabled):
+
+```
+[FP + 0] → saved FP of caller
+[FP + 8] → saved LR (return address = PC of call site in caller)
+```
+
+PC at the point of the sample is taken directly from `ucontext.__ss.__pc`
+(macOS) or `mcontext.pc` (Linux). FP comes from register x29.
+
+Walk pseudocode:
+
+```c
+void dbg_fp_walk(void* uc, CfreeProfSample* s) {
+ uint64_t pc = uc_get_pc(uc);
+ uint64_t fp = uc_get_fp(uc);
+ s->pcs[s->nframes++] = pc;
+ while (fp && s->nframes < PROF_MAX_DEPTH) {
+ uint64_t saved_fp, saved_lr;
+ if (dbg_os->guarded_copy(&saved_fp, (void*)fp, 8)) break;
+ if (dbg_os->guarded_copy(&saved_lr, (void*)(fp+8), 8)) break;
+ s->pcs[s->nframes++] = saved_lr;
+ if (!saved_fp || saved_fp <= fp) break; /* no backward progress */
+ fp = saved_fp;
+ }
+}
+```
+
+cfree controls codegen and ensures `-fno-omit-frame-pointer` is the
+default (or forced for prof mode). Frames from the host's C runtime above
+the JIT entry are recorded as-is; symbolication will simply leave them as
+hex if they fall outside the JIT image.
+
+## 7. Sample buffer
+
+```c
+/* CfreeProfSample — one entry per timer tick */
+typedef struct CfreeProfSample {
+ uint64_t pcs[PROF_MAX_DEPTH];
+ uint32_t nframes;
+} CfreeProfSample;
+
+/* CfreeProfBuf — pre-allocated, fixed-size ring */
+typedef struct CfreeProfBuf {
+ CfreeProfSample* samples;
+ uint32_t cap; /* allocated slots */
+ uint32_t count; /* samples recorded */
+ uint32_t dropped; /* samples dropped when full */
+} CfreeProfBuf;
+```
+
+`CfreeProfBuf` is allocated by the driver before calling
+`cfree_jit_session_prof_attach(session, buf)`. The session stores a
+pointer; `on_sample` appends to it. When `count == cap`, the sample is
+discarded and `dropped` is incremented (both writes are non-atomic; single
+worker thread makes that safe).
+
+Default capacity: 1 million samples. At 64 PCs × 8 bytes each that is
+~512 MB — reduce `PROF_MAX_DEPTH` or the cap for constrained environments.
+The driver flags `--depth` and `--cap` control both.
+
+## 8. Post-run symbolication
+
+After `cfree_jit_session_call()` returns, the buffer holds raw PCs.
+Symbolication is a post-processing pass in `src/dbg/prof.c`:
+
+```c
+int cfree_jit_session_prof_collect(
+ CfreeJitSession* session,
+ CfreeProfBuf* buf,
+ CfreeProfWriter* writer);
+```
+
+`CfreeProfWriter` is a vtable the driver supplies:
+
+```c
+typedef struct CfreeProfWriter {
+ void (*write_sample)(void* user, const char** syms, uint32_t nframes);
+ void* user;
+} CfreeProfWriter;
+```
+
+For each sample, `prof_collect`:
+1. For each PC in `sample.pcs[]`:
+ a. `cfree_jit_addr_to_sym(jit, pc, &sym, &off)` — symbol name or
+ NULL if outside the JIT image.
+ b. `cfree_dwarf_addr_to_line(dwarf, img_pc, &file, &line)` if DWARF
+ is attached and the symbol was found.
+ c. Format: `"sym+0xOFF"` when no line info; `"sym (file:line)"` when
+ available; `"0xADDR"` when outside the image entirely.
+2. Calls `writer->write_sample(user, syms, nframes)` with the
+ null-terminated string array.
+
+Symbolication touches no async-signal machinery and may allocate freely.
+
+## 9. Output format
+
+### Folded stacks (primary)
+
+One line per sample. Frames innermost-first, separated by `;`, followed by
+a space and the count weight (`1` per sample). This is the canonical input
+for `flamegraph.pl`:
+
+```
+main;compute;inner_loop 1
+main;compute;inner_loop 1
+main;compute 1
+main 1
+```
+
+Written to `--output FILE` (default: `prof.folded`). The driver emits
+this via a `CfreeProfWriter` that builds strings into a `CfrBuf` and
+flushes lines.
+
+Aggregate identical stacks before writing: sort samples lexicographically
+by their frame sequence, then run-length-encode. This reduces output size
+significantly for programs with tight hot loops.
+
+### Flat report (secondary)
+
+Printed to stdout after the run:
+
+```
+Samples: 10243 (0 dropped) rate: 1ms
+
+ SELF% CUMUL% FUNCTION
+ 42.1% 42.1% inner_loop (compute.c:17)
+ 31.0% 73.1% compute (compute.c:45)
+ ...
+```
+
+Self% counts samples where the function is the top (leaf) frame.
+Cumul% counts samples where it appears anywhere in the stack.
+
+## 10. Driver interface
+
+```
+cfree prof [options] [sources/objects] [-- args...]
+```
+
+Options:
+
+```
+--rate=MICROSECONDS SIGPROF interval, default 1000 (1 ms)
+--depth=N max frames per sample, default 64
+--cap=N sample buffer capacity, default 1000000
+--output=FILE folded-stacks output path, default prof.folded
+--no-folded suppress folded-stacks file
+--no-flat suppress flat report to stdout
+```
+
+Input handling mirrors `cfree run`: `.c` / `.o` / `.a` / stdin, compiled
+with `-g` forced on so DWARF is always present for symbolication.
+
+The driver:
+1. Allocates `CfreeProfBuf`.
+2. Arms `setitimer(ITIMER_PROF, rate)`.
+3. Creates session with `on_sample` wired.
+4. Calls `cfree_jit_session_call()`.
+5. Disarms timer after return.
+6. Calls `cfree_jit_session_prof_collect()` to symbolicate and write output.
+7. Prints flat report and dropped-sample count.
+
+## 11. CfreeDbgOs and public API changes
+
+### `include/cfree.h`
+
+`CfreeDbgSignalOps` gains `on_sample`:
+
+```c
+typedef struct CfreeDbgSignalOps {
+ int (*on_fault)(void* session, int signo, CfreeUnwindFrame* frame);
+ void (*on_sample)(void* session, void* ucontext); /* NEW; NULL = ignore SIGPROF */
+} CfreeDbgSignalOps;
+```
+
+New public entry points:
+
+```c
+/* Attach a pre-allocated sample buffer; must be called before session_call. */
+void cfree_jit_session_prof_attach(CfreeJitSession*, CfreeProfBuf*);
+
+/* Symbolicate buffer and call writer once per sample. Safe after session_call. */
+int cfree_jit_session_prof_collect(CfreeJitSession*, CfreeProfBuf*,
+ CfreeProfWriter*);
+```
+
+`CfreeProfBuf` and `CfreeProfWriter` are declared in `cfree.h`; their
+internals (frame storage, PROF_MAX_DEPTH) live in `src/dbg/prof.c`.
+
+### `driver/env.c`
+
+- Add `SIGPROF` to `g_dbg_signos[]`.
+- In `dbg_signal_handler`: early-out path for `SIGPROF` + non-null
+ `on_sample` that calls the callback and returns without touching the
+ park/unpark events.
+
+No other OS layer changes. `CfreeDbgOs` itself does not need new fields:
+timer arming (`setitimer`) and thread targeting (`pthread_kill`) are both
+driver-side and do not belong behind the vtable.
+
+## 12. Checklist
+
+### Public API — `include/cfree.h`
+
+- [ ] `on_sample` field added to `CfreeDbgSignalOps`
+- [ ] `CfreeProfBuf` and `CfreeProfWriter` structs declared
+- [ ] `cfree_jit_session_prof_attach`
+- [ ] `cfree_jit_session_prof_collect`
+
+### Library — `src/dbg/prof.c`
+
+- [ ] `CfreeProfBuf` alloc/free helpers
+- [ ] `dbg_fp_walk(ucontext, sample)` — aarch64 frame-pointer walk using
+ `guarded_copy`; terminates on NULL/misaligned/non-advancing FP
+- [ ] `on_sample` implementation: check capacity, call `dbg_fp_walk`,
+ append or increment `dropped`
+- [ ] `cfree_jit_session_prof_attach` body
+- [ ] `cfree_jit_session_prof_collect` body: symbolication loop +
+ `CfreeProfWriter` dispatch
+- [ ] `dbg_fp_walk` x64 variant (frame layout identical; FP = rbp)
+- [ ] `dbg_fp_walk` rv64 variant (frame layout identical; FP = s0/x8)
+
+### Host adapter — `driver/env.c`
+
+- [ ] `SIGPROF` added to `g_dbg_signos[]`
+- [ ] `on_sample` early-return path in `dbg_signal_handler`
+
+### Driver — `driver/prof.c`
+
+- [ ] Flag parsing (`--rate`, `--depth`, `--cap`, `--output`,
+ `--no-folded`, `--no-flat`)
+- [ ] `setitimer(ITIMER_PROF, ...)` arm before `session_call`, disarm after
+- [ ] `CfreeProfWriter` for folded-stacks output (sort + run-length encode)
+- [ ] Flat report: self% / cumul% table printed to stdout
+- [ ] Dropped-sample warning when `buf.dropped > 0`
+- [ ] Wire into multi-call dispatch in `driver/main.c`
+
+### Tests
+
+- [ ] `test/smoke/prof_hello`: run a simple C program under `cfree prof`,
+ assert `prof.folded` is non-empty, `main` appears in output
+- [ ] `test/dbg/fp_walk_aa64`: canned aarch64 frame chain (stack buffer
+ with crafted FP links); assert `dbg_fp_walk` produces expected PC
+ sequence and terminates correctly on a NULL sentinel
+- [ ] `test/dbg/prof_buf_overflow`: fill buffer to capacity, assert
+ `dropped` increments and count stays at cap
+
+### Bigger follow-ons
+
+- [ ] Per-thread timer via `timer_create(CLOCK_THREAD_CPUTIME_ID)` +
+ `SIGEV_THREAD_ID` (Linux) for reliable delivery in multi-thread guests
+- [ ] `ITIMER_REAL` mode (`--wall`) for profiling I/O-bound programs
+- [ ] Allocation profiling via breakpoint on the allocator entry with
+ `CfreeBreakpointSpec.condition` recording stack traces
+- [ ] SpeedScope / pprof output formats