Bayesian Online Change Point Detection — AVX2 Optimized
A production-grade C implementation of Adams & MacKay (2007) BOCPD, engineered for sub-microsecond latency in high-frequency trading systems.
BOCPD maintains a probability distribution over "run lengths" — time since the last changepoint — updated online with each observation. Unlike batch methods, it processes data sequentially with O(active_len) complexity per step.
Use cases: Volatility regime detection, trend break identification, adaptive filtering, real-time anomaly detection.
| Method | Latency | Detects | Online |
|---|---|---|---|
| CUSUM | ~100ns | Mean shift | ✓ |
| BOCPD | ~0.19μs | Mean + Variance | ✓ |
| HMM | ~50μs | Regimes | ✓ |
| Offline CPD | ~10ms | All | ✗ |
Benchmark Results (Intel Core i9):
| Configuration | Throughput | Latency |
|---|---|---|
| Single detector | 2.43M obs/sec | 0.41 μs |
| Stationary data (100K) | 1.87M obs/sec | 0.53 μs |
| With changepoints (100K) | 5.14M obs/sec | 0.19 μs |
| Pool (100 instruments) | 3.03M obs/sec | 0.33 μs |
| Large scale (380 instruments) | 2.12M obs/sec | 0.47 μs |
| Peak throughput | 5.14M obs/sec | 0.19 μs |
Measured on Intel Core i9, GCC/MSVC, -O3 -mavx2 -mfma:
| Test Scenario | Throughput | Latency | Speedup vs Naive |
|---|---|---|---|
| Single detector | 2.43M obs/sec | 0.41 μs | ×ばつ |
| Stationary data (100K obs) | 1.87M obs/sec | 0.53 μs | ×ばつ |
| With changepoints (100K obs) | 5.14M obs/sec | 0.19 μs | ×ばつ |
| Multi-detector pool (100) | 3.03M obs/sec | 0.33 μs | ×ばつ |
| Large scale (380 ×ばつ 500) | 2.12M obs/sec | 0.47 μs | — |
| max_run | Throughput | avg_active |
|---|---|---|
| 64 | 2.26M obs/sec | 41.3 |
| 128 | 1.90M obs/sec | 53.1 |
| 256 | 1.84M obs/sec | 56.5 |
| 512 | 1.69M obs/sec | 61.4 |
| 1024 | 1.69M obs/sec | 62.8 |
| 2048 | 1.87M obs/sec | 56.0 |
| 4096 | 1.72M obs/sec | 61.2 |
| Implementation | Throughput | Relative |
|---|---|---|
| Python (numpy) | ~1.2K obs/sec | ×ばつ |
| Rust (changepoint) | ~22K obs/sec | ×ばつ |
| Naive C | 52K obs/sec | ×ばつ |
| BOCPD Ultra V3.2 | 2.43M obs/sec | ×ばつ |
| Peak (with changepoints) | 5.14M obs/sec | ×ばつ |
┌─────────────────────────────────────────────────────────────────┐
│ BOCPD Ultra V3.2 │
├─────────────────────────────────────────────────────────────────┤
│ Custom lgamma (Lanczos + Stirling) │
│ ├─ Replaces slow libm lgamma() (~100 cycles → ~15 cycles) │
│ ├─ Lanczos g=4.7421875 for small args │
│ └─ Stirling with Bernoulli coefficients for large args │
├─────────────────────────────────────────────────────────────────┤
│ Native Interleaved Layout │
│ ├─ 256-byte superblocks (4 run lengths ×ばつ 8 parameters) │
│ ├─ Prediction params first (cache-friendly) │
│ └─ Zero-copy: no per-step data transformation │
├─────────────────────────────────────────────────────────────────┤
│ Ping-Pong Double Buffering │
│ ├─ Two interleaved buffers, alternating each step │
│ ├─ Zero memmove operations │
│ └─ Implicit +1 shift via buffer swap │
├─────────────────────────────────────────────────────────────────┤
│ AVX2 ASM Kernel │
│ ├─ 8 elements/iteration (2 blocks ×ばつ 4 lanes) │
│ ├─ Interleaved Block A/B for ILP │
│ ├─ Estrin polynomial evaluation (reduced latency) │
│ ├─ IEEE-754 exponent bit manipulation │
│ └─ bsr-based truncation (no branch mispredicts) │
├─────────────────────────────────────────────────────────────────┤
│ Pool Allocator │
│ ├─ Single malloc for N detectors │
│ ├─ Cache-friendly sequential layout │
│ └─ Reduced TLB pressure │
└─────────────────────────────────────────────────────────────────┘
At each timestep t with observation x_t, BOCPD updates the run-length distribution:
- Predict: Compute P(x_t | run_length = r) via Student-t posterior predictive
- Growth: P(r_t = r) ∝ P(r_{t-1} = r-1) · P(x_t | run) · (1 - H)
- Changepoint: P(r_t = 0) ∝ Σr P(r_{t-1} = r) · P(x_t | run) · H
- Normalize: Ensure Σr P(r_t = r) = 1
Where H = 1/λ is the hazard rate.
For Gaussian data with unknown mean μ and variance σ2:
Prior: p(μ, τ) = Normal(μ | μ0, (κ0τ)−1) · Gamma(τ | α0, β0)
Posterior Predictive (Student-t):
p(x | data) = Student-t(x | μn, σ2n, νn)
where:
νn = 2αn (degrees of freedom)
σ2n = βn(κn+1)/(αnκn) (scale)
The key signal is not r[0] directly, but:
- MAP drop: argmax(r) suddenly falls from high to low
- Short-run probability: P(r < k) spikes at changepoints
Time: 1 2 3 4 5 6 7 8 9 10 11 12
Data: ~~~~~~~~~~~~ normal ~~~~~~~~~~~~ | !! SHIFT !!
MAP rl: 1 2 3 4 5 6 7 8 9 2 3 4
P(r<5): .9 .5 .3 .2 .1 .1 .1 .1 .1 .8 .6 .4
↑
DETECTED
This section details the engineering that achieves 5.14M obs/sec peak throughput.
Custom lgamma Implementation (×ばつ Speedup)
×ばつ Speedup)" href="#custom-lgamma-implementation-2-speedup">Problem: The Student-t log-pdf requires lgamma((ν+1)/2) - lgamma(ν/2) for every run length. The standard library lgamma() is extremely slow (~100+ cycles) and dominates runtime.
Solution: Implement custom lgamma using Lanczos approximation for small arguments and Stirling series for large arguments:
// Lanczos approximation (ν < 40): ~15 cycles // g = 4.7421875 (exactly representable in binary) double lgamma_lanczos(double x) { static const double c[7] = { 0.99999999999980993, 676.5203681218851, -1259.1392167224028, 771.32342877765313, -176.61502916214059, 12.507343278686905, -0.13857109526572012 }; // Horner's method evaluation... } // Stirling series (ν ≥ 40): ~12 cycles // ln Γ(x) ≈ (x-1⁄2)ln(x) - x + 1⁄2ln(2π) + Σ B_{2k}/(2k(2k-1)x^{2k-1}) double lgamma_stirling(double x) { // Bernoulli number coefficients B2, B4, B6, B8, B10, B12 // Achieves ~15 digits precision for x ≥ 40 }
Impact: Replacing libm lgamma() with custom implementation gave ~100% speedup (×ばつ faster). This was the single largest optimization.
Full Student-t log-pdf:
ln p(x) = lgamma((ν+1)/2) - lgamma(ν/2) - 1⁄2ln(νπσ2) - ((ν+1)/2)·ln(1 + z2/(νσ2))
Optimized (constants precomputed when posterior updates):
ln p(x) = C1 - C2 · log1p(z2 · inv_σ2ν)
C1 = lgamma(α+1⁄2) - lgamma(α) - 1⁄2ln(2απσ2) ← precomputed at update time
C2 = α + 1⁄2 ← precomputed
inv_σ2ν = 1/(σ2·2α) ← precomputed
Hot path: 1 multiply, 1 FMA, 1 log1p. No lgamma, no division.
The lgamma calls happen only during posterior updates (once per run length that survives truncation), not during prediction (which runs for all active run lengths).
Problem: V2 required O(n) build_interleaved() transformation every observation to convert separate arrays into SIMD-friendly format.
Solution: Store parameters directly in 256-byte superblocks:
V2 Layout (separate arrays + staging buffer):
mu[]: [μ0 μ1 μ2 μ3 ...] ← separate allocation
C1[]: [c0 c1 c2 c3 ...] ← separate allocation
C2[]: [d0 d1 d2 d3 ...] ← separate allocation
inv_ssn[]: [e0 e1 e2 e3 ...] ← separate allocation
build_interleaved(): O(n) copy every step!
V3 Layout (native 256-byte superblocks):
Block 0 (bytes 0-255):
[μ0 μ1 μ2 μ3] offset 0 (prediction)
[c0 c1 c2 c3] offset 32 (prediction)
[d0 d1 d2 d3] offset 64 (prediction)
[e0 e1 e2 e3] offset 96 (prediction)
[κ0 κ1 κ2 κ3] offset 128 (update)
[α0 α1 α2 α3] offset 160 (update)
[β0 β1 β2 β3] offset 192 (update)
[n0 n1 n2 n3] offset 224 (update)
Block 1 (bytes 256-511):
... next 4 elements ...
Impact: Eliminated O(n) copy, direct SIMD access. ~20% overall speedup.
Problem: BOCPD reads from run length r but writes to r+1. Naive approach requires memmove() after every step to shift the distribution.
Solution: Maintain two buffers, alternate between them:
Step t (cur_buf = 0):
Read: interleaved[0][r] → posterior for run length r
Write: interleaved[1][r+1] → updated posterior at r+1
Flip: cur_buf = 1
Step t+1 (cur_buf = 1):
Read: interleaved[1][r] → now contains what was r+1
Write: interleaved[0][r+1] → write to other buffer
Flip: cur_buf = 0
Benefits:
- Zero
memmove()operations (was O(n) per step) - Implicit +1 index shift via buffer swap
- No data hazards between read and write
Impact: Eliminates ~15-20% overhead from memory copies.
Problem: Ping-pong update reads from index i, writes to index i+1. With interleaved blocks, this crosses block boundaries.
Solution: Use vpermpd to rotate vector, vblendpd to merge with existing blocks:
; Input: [v0, v1, v2, v3] for indices [i, i+1, i+2, i+3] ; Output: write to indices [i+1, i+2, i+3, i+4] spanning two blocks vpermpd ymm_rot, ymm_vals,0x93 ; [v3, v0, v1, v2] vblendpd ymm_k, ymm_exist_k, ymm_rot, 0b1110 ; block k: lanes 1,2,3 vblendpd ymm_k1, ymm_exist_k1, ymm_rot, 0b0001 ; block k+1: lane 0
Performance: 6 cycles vs ~16 cycles for scalar stores.
The Student-t log-pdf requires log(1 + t) where t = z2/σ2ν.
log(1+t) ≈ t · (c1 + t·(c2 + t·(c3 + t·(c4 + t·(c5 + t·c6)))))
Coefficients (Taylor series):
c1 = 1, c2 = -1/2, c3 = 1/3, c4 = -1/4, c5 = 1/5, c6 = -1/6
Horner's method: 6 FMAs, fully pipelined, ~15 cycles vs ~100 for libm log().
Problem: Horner's method for exp() has 6-deep dependency chain.
Solution: Estrin's scheme groups terms to reduce depth:
exp(x) = 2^k · 2^f where k = round(x/ln2), f = frac(x/ln2), |f| ≤ 0.5
2^f ≈ 1 + f·c1 + f2·c2 + f3·c3 + f4·c4 + f5·c5 + f6·c6
Estrin (depth 4):
p01 = 1 + f·c1 }
p23 = c2 + f·c3 } parallel (level 1)
p45 = c4 + f·c5 }
f2 = f · f
q0123 = p01 + f2·p23 } parallel (level 2)
q456 = p45 + f2·c6 }
f4 = f2 · f2
result = q0123 + f4·q456 (level 3)
Impact: ~5% improvement on polynomial evaluation.
Computing 2^k without libm:
vcvtpd2dq xmm0, ymm_k ; k → int32 vpmovsxdqymm0,xmm0 ; sign-extend to int64 vpaddqymm0,ymm0,[bias] ; add exponent bias (1023) vpsllqymm0,ymm0,52 ; shift to exponent field ; ymm0 now contains bit pattern for 2^k as double
Direct bit manipulation: 4 integer ops vs expensive floating-point pow().
Problem: Single 4-wide AVX2 block leaves FMA units underutilized during memory latency.
Solution: Process 8 elements per iteration as two 4-wide blocks:
Block addressing (V3 layout):
Block A: byte_offset = (i / 4) * 256
Block B: byte_offset = (i / 4) * 256 + 256
Iteration structure:
Block A: load → z2 → t → log1p → exp → update → store
Block B: load → z2 → t → log1p → exp → update → store
Advance: i += 8, idx_vec += 8
Problem: vhaddpd is slow (high latency, limited ports).
Solution: Use vunpckhpd + vaddsd:
; Reduce [a, b, c, d] → a+b+c+d vextractf128xmm0,ymm12,1 ; [c, d] vaddpd xmm0,xmm0,xmm12 ; [a+c, b+d] vunpckhpd xmm1,xmm0,xmm0 ; [b+d, b+d] vaddsd xmm0,xmm0,xmm1 ; a+b+c+d
Problem: Chain of bt (bit test) instructions is slow.
Solution: Single bsr (bit scan reverse) finds highest set bit:
vcmppd ymm0, ymm_growth, ymm_thresh,14 vmovmskpd eax,ymm0 testeax,eax jz .skip bsrecx,eax ; Find highest set bit learbx,[rsi+rcx+1] ; last_valid = i + bit + 1
Problem: Broadcasting loop counter is expensive:
// Old: 2 μops, 3-cycle latency per iteration __m256i idx = _mm256_set1_epi64x(i); idx = _mm256_add_epi64(idx, offset_vec);
Solution: Maintain running index vectors:
; Setup (once): vmovapd ymm15,[1.0,2.0,3.0,4.0] ; idx_vec_A vmovapd ymm_b,[5.0,6.0,7.0,8.0] ; idx_vec_B (on stack) ; Per iteration (1 μop each): vaddpd ymm15,ymm15,ymm7 ; ymm7 = [8,8,8,8] vaddpd ymm3,[stack],ymm7 vmovapd [stack],ymm3
Hand-written AVX2 kernel optimized for modern x86-64:
Dedicated constants:
ymm12 = threshold
ymm13 = 1-h
ymm14 = h (hazard rate)
ymm15 = x (observation, broadcast)
Accumulators:
ymm9 = max_growth_B
ymm10 = max_growth_A
ymm11 = r0 accumulator
Scratch (reused per block):
ymm0-8 = computation temporaries
Stack storage:
idx_vec_A, idx_vec_B, max_idx_A, max_idx_B, r_old_B
; Block A: elements [i, i+1, i+2, i+3] movrax,r14 shrrax,2 ; block_index = i / 4 shlrax,8 ; byte_offset = block_index * 256 vmovapd ymm0,[r13+rax] ; mu at offset 0 vmovapd ymm1,[r13+rax+32] ; C1 at offset 32 vmovapd ymm2,[r13+rax+64] ; C2 at offset 64 vmovapd ymm3,[r13+rax+96] ; inv_ssn at offset 96 ; Block B: elements [i+4, i+5, i+6, i+7] movrax,r14 addrax,4 shrrax,2 shlrax,8 ; Next block at +256 bytes
| Parameter | Typical Range | Effect |
|---|---|---|
mu0 |
0 | Prior mean center |
kappa0 |
0.001 - 0.1 | Lower = faster adaptation |
alpha0 |
0.5 - 5 | Variance prior shape |
beta0 |
Scale to data | E[σ2] ≈ β0/(α0-1) |
Daily returns (σ ≈ 1-2%):
bocpd_prior_t prior = { .mu0 = 0.0, .kappa0 = 0.01, .alpha0 = 2.0, .beta0 = 0.0002 // E[σ2] ≈ 0.0001 };
| Data Frequency | λ | Expected regime duration |
|---|---|---|
| Tick data | 1000-10000 | Many ticks per regime |
| 1-minute bars | 200-500 | ~1-2 changes per day |
| Daily bars | 50-200 | ~1 change per month |
#include "bocpd_asm.h" bocpd_asm_t detector; bocpd_prior_t prior = {0.0, 1.0, 1.0, 1.0}; bocpd_ultra_init(&detector, 200.0, prior, 1024); for (int i = 0; i < n_observations; i++) { bocpd_ultra_step(&detector, data[i]); if (detector.p_changepoint > 0.5) { printf("Changepoint detected at t=%zu\n", detector.t); } } bocpd_ultra_free(&detector);
bocpd_pool_t pool; bocpd_prior_t prior = {0.0, 1.0, 1.0, 1.0}; bocpd_pool_init(&pool, 100, 200.0, prior, 1024); for (int t = 0; t < n_steps; t++) { for (int i = 0; i < 100; i++) { bocpd_asm_t *det = bocpd_pool_get(&pool, i); bocpd_ultra_step(det, data[i][t]); } } bocpd_pool_free(&pool);
| Capacity | Memory | Notes |
|---|---|---|
| 64 | ~18 KB | Minimum practical |
| 256 | ~38 KB | Low-latency trading |
| 512 | ~74 KB | General use |
| 1024 | ~148 KB | Long regimes |
| 2048 | ~296 KB | Extended analysis |
This implementation achieves 5.14M obs/sec peak. Faster is possible — but not without sacrificing correctness.
| Rejected Optimization | Potential Gain | Why We Refused |
|---|---|---|
| Low-order polynomials | +7-15% | Tail probabilities span 300+ orders of magnitude |
| Approximate pruning (top-K) | +30-70% | Makes posterior non-Bayesian |
| float16/float32 | +10-20% | Underflows at ~1e-38; run-lengths hit 1e-200 |
| Top-1 path tracking | +3000% | Not Bayesian, can't quantify uncertainty |
| Bit-hack exponentiation | +15-25% | Numerically unsafe for BOCPD ranges |
Anyone claiming 10M+ obs/sec for "BOCPD" is likely doing top-1 path tracking or using garbage polynomials. That's not Bayesian inference.
5.14M obs/sec is the performance ceiling given numerical constraints. For trading systems where wrong signals cost real money, that ceiling is the right place to be.
-
Adams, R. P., & MacKay, D. J. C. (2007). Bayesian Online Changepoint Detection. arXiv:0710.3742. [PDF]
-
Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. [PDF]
-
Intel Intrinsics Guide. [Link]
-
Agner Fog's Optimization Manuals. [Link]
| Version | Throughput | Key Changes |
|---|---|---|
| V1 (Naive) | 52K obs/sec | Reference implementation |
| V2 | 525K obs/sec | Ping-pong buffers, AVX2 kernel |
| V3 | 1.5M obs/sec | Native interleaved layout |
| V3.2 | 5.14M obs/sec | Optimized ASM kernel, pool allocator |
MIT License. See LICENSE.
Built for speed. Designed for trading. Open for all.