Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

PG pool exhaustion blocks PG-backed endpoints (live incident 2026年05月26日) #454

Open

Description

Live incident 2026年05月26日 ~03:00 UTC. xcjsam reported "no pods loading" on app-dev.commonly.me. Investigation:

  • /api/pods and /api/messages/:podId (both PG-backed) hung indefinitely (60s+, 0 bytes returned). In-cluster localhost:5000 hung identically, so not ingress/cloudflared.
  • /api/posts, /api/pods/:id, /api/auth/me (mongo-backed or routing-only) responded in <200ms.
  • Backend pod CPU 42m / mem 677Mi of 2Gi — plenty of headroom.
  • Direct mongoose query: 504ms. Direct PG query from a one-off kubectl exec node process: 625ms. So the underlying DBs respond fine.
  • Conclusion: PG pg.Pool connection pool was exhausted in the live process. pool.options.max: 10, connectionTimeoutMillis: undefined → new pool.query(...) calls await forever instead of failing fast.

Immediate workaround applied: kubectl rollout restart deploy/backend -n commonly-dev — back to normal in ~20s. Verified /api/pods returns full set in 537ms post-restart.

Why the pool exhausted

Backend log right before incident:

✓ Pod summary requests enqueued: 60
Dispatching agent heartbeat events... (repeated)

The hourly summarizer fans out 60 summary.request events. Each event handler likely queries PG (messages lookup for the per-pod recap). 60 concurrent calls against 10 pool slots → 50 queries waiting in line. Any handler that takes >1s starves the pool. Without connectionTimeoutMillis, all subsequent pool.query() calls — including user-facing getAllPods — hang behind the queue.

Likely contributing: no obvious pool.connect()/client.release() leak in the codebase (grep -rE 'pool.connect|client.release' backend/ shows only db-pg.ts init code that DOES release). The bottleneck is pool.query() call volume + slow individual queries + tiny pool, not unreleased connections.

Concrete fixes

  1. Bump pool.options.max from 10 to ~50. Aiven postgres-business plan supports 200+ connections; 10 is far too small for a backend that fans out 60 events per hourly job. (backend/config/db-pg.ts — one-line change.)
  2. Set connectionTimeoutMillis: 5000 so pool starvation fails fast as 503 instead of hanging indefinitely. The current behavior (hang forever) is worse than a clear error — Express never times out, the user sees a perpetual loading state.
  3. Audit heartbeat/summarizer dispatch for pool.query() calls, especially in services/agentEventService.ts and services/summarizerService.ts. Each event handler that hits PG should batch where possible, or use a smaller chunk size (10 at a time, not 60).
  4. Add a /api/health/db probe that checks pool.idleCount + pool.waitingCount and alerts when waiting > 5 for >30s. Would have caught this before user impact.

Repro

# Before fix:
kubectl exec -n commonly-dev deploy/backend -- bash -c \
 "curl -sS -m 15 -H 'Authorization: Bearer <token>' \
 'http://localhost:5000/api/pods?limit=2' \
 -w 'status=%{http_code} ttfb=%{time_starttransfer}\n'"
# → status=000 ttfb=0 (hangs at 15s timeout)

Related

  • backend/config/db-pg.ts — pool config
  • backend/controllers/podController.ts:199-227 — getAllPods PG call site
  • backend/services/summarizerService.ts — likely culprit for high PG concurrency

Reporter: xcjsam (live, blocked).
Diagnoser/responder: claude-code session 2026年05月26日.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /