PG pool exhaustion blocks PG-backed endpoints (live incident 2026年05月26日) #454

Open

@samxu01

Description

@samxu01

samxu01

opened

on May 26, 2026

Live incident 2026年05月26日 ~03:00 UTC. xcjsam reported "no pods loading" on app-dev.commonly.me. Investigation:

/api/pods and /api/messages/:podId (both PG-backed) hung indefinitely (60s+, 0 bytes returned). In-cluster localhost:5000 hung identically, so not ingress/cloudflared.
/api/posts, /api/pods/:id, /api/auth/me (mongo-backed or routing-only) responded in <200ms.
Backend pod CPU 42m / mem 677Mi of 2Gi — plenty of headroom.
Direct mongoose query: 504ms. Direct PG query from a one-off kubectl exec node process: 625ms. So the underlying DBs respond fine.
Conclusion: PG pg.Pool connection pool was exhausted in the live process. pool.options.max: 10, connectionTimeoutMillis: undefined → new pool.query(...) calls await forever instead of failing fast.

Immediate workaround applied: kubectl rollout restart deploy/backend -n commonly-dev — back to normal in ~20s. Verified /api/pods returns full set in 537ms post-restart.

Why the pool exhausted

Backend log right before incident:

✓ Pod summary requests enqueued: 60
Dispatching agent heartbeat events... (repeated)

The hourly summarizer fans out 60 summary.request events. Each event handler likely queries PG (messages lookup for the per-pod recap). 60 concurrent calls against 10 pool slots → 50 queries waiting in line. Any handler that takes >1s starves the pool. Without connectionTimeoutMillis, all subsequent pool.query() calls — including user-facing getAllPods — hang behind the queue.

Likely contributing: no obvious pool.connect()/client.release() leak in the codebase (grep -rE 'pool.connect|client.release' backend/ shows only db-pg.ts init code that DOES release). The bottleneck is pool.query() call volume + slow individual queries + tiny pool, not unreleased connections.

Concrete fixes

Bump pool.options.max from 10 to ~50. Aiven postgres-business plan supports 200+ connections; 10 is far too small for a backend that fans out 60 events per hourly job. (backend/config/db-pg.ts — one-line change.)
Set connectionTimeoutMillis: 5000 so pool starvation fails fast as 503 instead of hanging indefinitely. The current behavior (hang forever) is worse than a clear error — Express never times out, the user sees a perpetual loading state.
Audit heartbeat/summarizer dispatch for pool.query() calls, especially in services/agentEventService.ts and services/summarizerService.ts. Each event handler that hits PG should batch where possible, or use a smaller chunk size (10 at a time, not 60).
Add a /api/health/db probe that checks pool.idleCount + pool.waitingCount and alerts when waiting > 5 for >30s. Would have caught this before user impact.

Repro

# Before fix:
kubectl exec -n commonly-dev deploy/backend -- bash -c \
 "curl -sS -m 15 -H 'Authorization: Bearer <token>' \
 'http://localhost:5000/api/pods?limit=2' \
 -w 'status=%{http_code} ttfb=%{time_starttransfer}\n'"
# → status=000 ttfb=0 (hangs at 15s timeout)

backend/config/db-pg.ts — pool config
backend/controllers/podController.ts:199-227 — getAllPods PG call site
backend/services/summarizerService.ts — likely culprit for high PG concurrency

Reporter: xcjsam (live, blocked).
Diagnoser/responder: claude-code session 2026年05月26日.

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Fields

Give feedback

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PG pool exhaustion blocks PG-backed endpoints (live incident 2026年05月26日) #454

Description

Why the pool exhausted

Concrete fixes

Repro

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions