Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add socket-level DB connection resilience#4855

Open
stuartc wants to merge 2 commits into
main from
db-socket-resilience
Open

Add socket-level DB connection resilience #4855
stuartc wants to merge 2 commits into
main from
db-socket-resilience

Conversation

@stuartc

@stuartc stuartc commented Jun 15, 2026
edited
Loading

Copy link
Copy Markdown
Member

Description

This PR changes the Lightning.Repo database connection so that, on Linux, a
connection whose network path has died fails fast instead of blocking on a dead
socket.

The (default) 15s DBConnection timeout guards a slow-but-live query on a healthy
connection, but it does not cover a dead network path: if the route to the
database is silently severed (e.g. a TCP reset during a Kubernetes node-pool
upgrade), the BEAM can block on the dead socket until the kernel's
retransmission budget expires (~15 min). We saw this take down an Oban
Scheduler job that was stuck in executing after a node in k8s was drained; taking the network down underneath a running instance.

On Linux, this sets the TCP_USER_TIMEOUT socket option, which caps how long
transmitted data may go unacknowledged before the socket is dropped:

  • DATABASE_TCP_USER_TIMEOUT (ms) defaults to DATABASE_TIMEOUT + 5s (20s),
    so the application query timeout fires first on a connection that is merely
    slow. Set it to override, or 0 to restore the previous behaviour. A value
    below DATABASE_TIMEOUT logs a warning at boot. Ignored on non-Linux
    platforms.

It does not abort a live query: tcp_user_timeout measures unacknowledged data,
not query duration, so a long read whose bytes were acked at send time never
accumulates against it. A dead idle pooled connection is caught lazily on its
next use within the same budget.

This is on by default because the behavior it replaces — blocking ~15 min on a
dead socket — is a latent failure mode every Linux deployment inherits, and the
deployments most exposed (rolling node upgrades, cloud NAT/LB resets) are the
common shape. The 0 escape hatch and the Linux gate keep it safe.

A server-side statement_timeout was considered and dropped: it shares the
single Repo pool with migrations and the retention purge (which already requests
up to 150s), so a global ceiling is unsafe without per-transaction overrides we
don't yet have.

Validation steps

  1. mix test test/lightning/config/bootstrap_test.exs — covers the
    tcp_user_timeout socket option, the ECTO_IPV6 coexistence case (the prod
    block appends rather than clobbers socket_options), the
    below-DATABASE_TIMEOUT boot warning, and the SSL toggle.
  2. Boot with DATABASE_TCP_USER_TIMEOUT=20000 on Linux and confirm the Repo
    socket_options carry the raw {:raw, 6, 18, ...} entry; on macOS confirm it
    is omitted.

Additional notes for the reviewer

  1. The prod block in configure/0 re-sets socket_options for the ECTO_IPV6
    toggle; it now appends the resilience option (maybe_ipv6 ++ db_socket_options) rather than clobbering it. A unit test guards this.
  2. Default-on is a deliberate behavior change for Linux deployments, hence the
    CHANGELOG entry sits under Changed, not Added. Deploy-side tuning is a
    separate decision and lives in the deployment config, not this repo.

AI Usage

  • I have used Claude Code
  • I have used another model
  • I have not used AI

Pre-submission checklist

  • I have performed an AI review of my code (we recommend using /review
    with Claude Code)
  • I have implemented and tested all related authorization policies
    (n/a — config-only change, no new data access)
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR

Copy link
Copy Markdown

Security Review ✅

  • S0 (project scoping): N/A — the diff only adds OS-level TCP socket options and a startup warning in lib/lightning/config/bootstrap.ex; no queries, controllers, LiveViews, or channels touch project-scoped data.
  • S1 (authorization): N/A — no new web-layer entrypoints, handle_events, or policy clauses are introduced.
  • S2 (audit trail): N/A — no Repo.insert/update/delete on configuration resources; the change configures a runtime socket timeout and adds tests, with no domain writes to audit.

@stuartc stuartc self-assigned this Jun 15, 2026

codecov Bot commented Jun 15, 2026
edited
Loading

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.5%. Comparing base (817e445) to head (a254b07).

Additional details and impacted files
@@ Coverage Diff @@
## main #4855 +/- ##
=======================================
- Coverage 90.5% 90.5% -0.0% 
=======================================
 Files 445 445 
 Lines 22721 22733 +12 
=======================================
+ Hits 20561 20569 +8 
- Misses 2160 2164 +4 

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

stuartc added 2 commits June 15, 2026 13:37
The 15s DBConnection `timeout` guards a slow-but-live query on a healthy
connection, but not a dead network path: if the route to the database is
silently severed (e.g. a TCP reset during a node-pool upgrade), the BEAM
can block on the dead socket until the kernel's retransmission budget
expires (~15 min).
On Linux, set the TCP_USER_TIMEOUT socket option so such a connection
fails fast instead. It caps how long transmitted data may go
unacknowledged before the socket is dropped, and defaults to
DATABASE_TIMEOUT + 5s (set DATABASE_TCP_USER_TIMEOUT to override, or 0 to
disable); it is ignored on other platforms. A value below DATABASE_TIMEOUT
logs a warning, since on a large write it could drop a slow-but-live
connection before the query timeout fires.
This does not abort a live query: tcp_user_timeout measures unacknowledged
data, not query duration, so a long read whose bytes were acked at send
time never accumulates against it. A dead idle pooled connection is caught
lazily on its next use within the same budget.
A server-side statement_timeout was considered but dropped: it shares the
single Repo pool with migrations and the retention purge (which already
requests up to 150s), so a global ceiling is unsafe without per-transaction
overrides we don't yet have.
The prod block also sets `socket_options` for the IPv6 toggle, so it
appends this option rather than clobbering it.
Add DATABASE_KEEPALIVE and DATABASE_TCP_USER_TIMEOUT to the deployment
env var reference, including the Linux-only default and the disable/
override semantics.

Copy link
Copy Markdown
Collaborator

@stuartc I blew my timebox trying to find non-clanker sources on setting raw socket values. TCP socket programming is a deep rabbit hole and I have no prior knowledge to lean on - so chasing it further is probably not cost-effective. The rest of the PR looks fine - I made a comment relating to a cosmetic documentation change.

Comment thread CHANGELOG.md
- On Linux, database connections now fail fast when their network path dies
(e.g. a TCP reset during a node-pool upgrade) instead of blocking on a dead
socket for up to ~15 minutes. `DATABASE_TCP_USER_TIMEOUT` (ms) defaults to
`DATABASE_TIMEOUT + 5s`; set it to `0` to restore the previous behaviour. No

@rorymckinley rorymckinley Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuartc The wording here and in DEPLOYMENT.md can be interpreted as "it is ok for this to be set on non-Linux systems as it is a no-op". I have not found any documentation to confirm this, so perhaps it may be best to reword it to indicate that Lightning will not apply this setting on non-Linux hosts, to better align with the code?

stuartc commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

@stuartc I blew my timebox trying to find non-clanker sources on setting raw socket values. TCP socket programming is a deep rabbit hole and I have no prior knowledge to lean on - so chasing it further is probably not cost-effective. The rest of the PR looks fine - I made a comment relating to a cosmetic documentation change.

Yeah I'm going to make a test harness to put this through it's paces on a linux vm before merging.

So let not merge until I've done that. And yeah that language needs another pass. It shouldn't read "previous" either - that ages fast.

midigofrank reacted with thumbs up emoji

@midigofrank midigofrank left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Great learnings here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@rorymckinley rorymckinley rorymckinley approved these changes

@midigofrank midigofrank midigofrank approved these changes

Labels

None yet

Projects

Status: New Issues

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /