pprof works internally, and show you how to read flame graphs to locate bottlenecks with confidence.
1. The Performance Optimization Workflow
Before touching any code, follow this four-step process:
1. Understand the code → Clarify the common logic and usage scenarios
↓
2. Write benchmarks → Simulate realistic traffic and workloads
↓
3. Collect data → Use pprof or flame graphs to capture runtime behavior
↓
4. Optimize hotspots → Focus on the functions with the highest relative cost
Rule of thumb: Optimization closer to the application layer (e.g. caching, async logic) typically yields ms-level improvements. Code-level micro-optimizations yield μs-level gains. Always start with the bigger wins.
2. What Is pprof?
pprof is Go's built-in tool for visualizing and analyzing performance profiling data. It collects runtime data — goroutine stacks, memory allocations, CPU usage — and lets you identify exactly where your program spends its time and memory.
Two Types of Profilers
| Type |
How It Works |
Examples |
| Sampling profiler |
Measures at regular time intervals |
Go CPU profiler |
| Tracing profiler |
Fires on specific events (function call, lock, GC) |
Go execution tracer |
A sampling profiler has two core components:
-
Sampler — a callback triggered at fixed intervals that captures the current stack trace
-
Data collector — aggregates all captured stack traces into a statistical summary (call counts, memory sizes, etc.)
3. How CPU Profiling Works
Go's CPU profiler uses a stack trace + statistics model.
┌──────────────────────────────────────────────────────┐
│ CPU Profiling Pipeline │
│ │
│ pprof.StartCPUProfile() │
│ ↓ │
│ Go runtime sets SIGPROF signal handler │
│ (via setitimer / timer_create / timer_settime) │
│ ↓ │
│ SIGPROF fires every 10ms (100Hz, fixed rate) │
│ ↓ │
│ Kernel delivers signal to a running goroutine │
│ ↓ │
│ sigProfHandler captures goroutine stack trace │
│ ↓ │
│ Stack written to profBuf │
│ (lock-free single-writer / single-reader ring buf) │
│ ↓ │
│ profileWriter goroutine reads profBuf │
│ ↓ │
│ Results aggregated into profMap (hashmap) │
│ ↓ │
│ pprof.StopCPUProfile() → output .prof file │
└──────────────────────────────────────────────────────┘
Key details:
- Sampling rate: 100Hz (every 10ms) — fixed, not configurable at runtime
- Only running goroutines are captured. Goroutines blocked on I/O are not counted (Go uses non-blocking I/O)
- Each captured stack can be tagged with a custom label for later filtering
- The lock-free
profBuf structure (runtime/profbuf.go) ensures minimal overhead during signal handling
Note: Because I/O-waiting goroutines are excluded, CPU profiling alone won't reveal I/O bottlenecks. Use fgprof (which calls runtime.GoroutineProfile) to capture both running and waiting goroutines.
4. How Heap Profiling Works
Heap profiling also uses a stack trace + statistics model, but instead of a timer, it hooks directly into the memory allocator.
Memory allocation path
↓
Heap profiler intercepts allocation
(samples every 512KB allocated by default)
↓
Captures current stack trace
↓
Aggregates samples → per-function allocation counts
Key metrics:
| Metric |
Meaning |
alloc_space |
Total bytes allocated (cumulative) |
alloc_objects |
Total objects allocated (cumulative) |
inuse_space |
Bytes currently in use |
inuse_objects |
Objects currently in use |
Formula: inuse = alloc - free
Because heap profiling is also sampled (default: every 512KB), the displayed sizes will be smaller than actual allocations — but the relative proportions are accurate enough to locate hotspots.
5. Other Profiling Types
Goroutine Profiling
Captures the call stack of all user-initiated, currently running goroutines (excludes runtime.* entry points).
stop the world
→ iterate allg slice
→ output stack trace for each goroutine
start the world
Block Profiling
Samples blocking operations (channel waits, mutex waits) by duration.
- Only records blocks that exceed a configurable threshold
- Rate:
1 = record every block
Lock Contention Profiling
Samples mutex contention — how often locks are contested and for how long.
- Rate:
1 = record every lock operation
- Uses the same report/aggregate pattern as block profiling
6. How to Read a Flame Graph
A flame graph is the most intuitive way to visualize profiling data. Here's how to interpret it:
┌─────────────────────────────────────────────────────┐
│ Flame Graph │
│ │
│ [narrow] encodeJSON [narrow] │ ← top: currently on CPU
│ [ processRequest ] │
│ [ handleHTTP ] │
│ [ ServeHTTP ] │
│ [ main ]│ ← bottom: entry point
│ │
│ ← call order: bottom to top │
│ ← width = time proportion (wider = more CPU) │
│ ← color has no special meaning │
└─────────────────────────────────────────────────────┘
Reading rules:
| Axis |
Meaning |
| Vertical (Y) |
Call stack depth — bottom is the entry point, top is what's running on CPU |
| Horizontal (X) |
Alphabetically sorted, merged call stacks — not time order
|
| Width of a block |
Proportion of samples — wider = more CPU time = likely bottleneck
|
| Color |
No special meaning — just for visual contrast |
Focus on wide blocks near the top — these are the functions consuming the most CPU and are your primary optimization targets.
7. go tool trace — When pprof Isn't Enough
pprof tells you what is using CPU. But it can't tell you why a goroutine isn't running. For that, use go tool trace.
Possible reasons a goroutine isn't running:
- Blocked on a syscall
- Blocked on a channel or mutex
- Blocked by the GC (STW)
- Not scheduled by the runtime
go tool trace captures these events with nanosecond-level precision:
| Event Category |
Examples |
| Goroutine lifecycle |
create / block / unblock |
| Syscall |
enter / exit / block |
| GC events |
mark start, STW, sweep |
| Heap |
allocation / free size changes |
| Processor |
start / stop |
Trace UI panels:
Timeline → execution time axis (zoomable)
─────────────────────────────────────────────
Heap → memory alloc/free over time (line chart)
Goroutines → GCWaiting | Runnable | Running counts
Threads → InSyscall | Running counts
─────────────────────────────────────────────
P0 ~ Pn → one row per virtual processor (GOMAXPROCS)
shows which goroutine ran on each P
click a goroutine → stack trace + related events
8. Quick Reference: Profiling Types Summary
| Type |
What It Captures |
Sampling Rate |
Trigger |
| CPU |
Function call time |
100Hz (10ms) |
SIGPROF signal |
| Heap |
Alloc/inuse memory |
Every 512KB |
Memory allocator hook |
| Goroutine |
All running goroutine stacks |
On demand |
STW snapshot |
| ThreadCreate |
OS thread creation stacks |
On demand |
STW snapshot |
| Block |
Blocking op duration |
Threshold-based |
Block event hook |
| Mutex |
Lock contention duration |
Ratio-based |
Lock event hook |
| Trace |
All runtime events |
Continuous |
Event instrumentation |
Summary
- Use pprof CPU profiling to find functions consuming the most CPU time
- Use heap profiling to locate memory allocation hotspots
- Use flame graphs to visually identify wide, top-level blocks as bottlenecks
- Use block/mutex profiling to diagnose concurrency contention
- Use go tool trace when you need to understand why goroutines aren't running
Next in this series: Go Heap Memory Allocation: tcmalloc, Mutator/Allocator & Multi-Level Cache (Part 2)
If this breakdown of Go's profiling internals was useful, follow the series for deeper dives into the Go runtime — scheduler, GC, memory allocator, and more.