-
Notifications
You must be signed in to change notification settings - Fork 119
PG pool exhaustion blocks PG-backed endpoints (live incident 2026年05月26日) #454
Description
Live incident 2026年05月26日 ~03:00 UTC. xcjsam reported "no pods loading" on app-dev.commonly.me. Investigation:
/api/podsand/api/messages/:podId(both PG-backed) hung indefinitely (60s+, 0 bytes returned). In-cluster localhost:5000 hung identically, so not ingress/cloudflared./api/posts,/api/pods/:id,/api/auth/me(mongo-backed or routing-only) responded in <200ms.- Backend pod CPU 42m / mem 677Mi of 2Gi — plenty of headroom.
- Direct mongoose query: 504ms. Direct PG query from a one-off
kubectl exec nodeprocess: 625ms. So the underlying DBs respond fine. - Conclusion: PG
pg.Poolconnection pool was exhausted in the live process.pool.options.max: 10,connectionTimeoutMillis: undefined→ newpool.query(...)calls await forever instead of failing fast.
Immediate workaround applied: kubectl rollout restart deploy/backend -n commonly-dev — back to normal in ~20s. Verified /api/pods returns full set in 537ms post-restart.
Why the pool exhausted
Backend log right before incident:
✓ Pod summary requests enqueued: 60
Dispatching agent heartbeat events... (repeated)
The hourly summarizer fans out 60 summary.request events. Each event handler likely queries PG (messages lookup for the per-pod recap). 60 concurrent calls against 10 pool slots → 50 queries waiting in line. Any handler that takes >1s starves the pool. Without connectionTimeoutMillis, all subsequent pool.query() calls — including user-facing getAllPods — hang behind the queue.
Likely contributing: no obvious pool.connect()/client.release() leak in the codebase (grep -rE 'pool.connect|client.release' backend/ shows only db-pg.ts init code that DOES release). The bottleneck is pool.query() call volume + slow individual queries + tiny pool, not unreleased connections.
Concrete fixes
- Bump pool.options.max from 10 to ~50. Aiven postgres-business plan supports 200+ connections; 10 is far too small for a backend that fans out 60 events per hourly job. (
backend/config/db-pg.ts— one-line change.) - Set
connectionTimeoutMillis: 5000so pool starvation fails fast as 503 instead of hanging indefinitely. The current behavior (hang forever) is worse than a clear error — Express never times out, the user sees a perpetual loading state. - Audit heartbeat/summarizer dispatch for
pool.query()calls, especially inservices/agentEventService.tsandservices/summarizerService.ts. Each event handler that hits PG should batch where possible, or use a smaller chunk size (10 at a time, not 60). - Add a
/api/health/dbprobe that checkspool.idleCount+pool.waitingCountand alerts when waiting > 5 for >30s. Would have caught this before user impact.
Repro
# Before fix: kubectl exec -n commonly-dev deploy/backend -- bash -c \ "curl -sS -m 15 -H 'Authorization: Bearer <token>' \ 'http://localhost:5000/api/pods?limit=2' \ -w 'status=%{http_code} ttfb=%{time_starttransfer}\n'" # → status=000 ttfb=0 (hangs at 15s timeout)
Related
backend/config/db-pg.ts— pool configbackend/controllers/podController.ts:199-227— getAllPods PG call sitebackend/services/summarizerService.ts— likely culprit for high PG concurrency
Reporter: xcjsam (live, blocked).
Diagnoser/responder: claude-code session 2026年05月26日.