Add socket-level DB connection resilience#4855

Open

stuartc wants to merge 2 commits into

main from

db-socket-resilience

Open

Add socket-level DB connection resilience #4855
stuartc wants to merge 2 commits into
main from
db-socket-resilience

Conversation

@stuartc

@stuartc stuartc commented Jun 15, 2026 •

edited

Loading

Copy link

Copy Markdown

Member

Description

This PR changes the Lightning.Repo database connection so that, on Linux, a
connection whose network path has died fails fast instead of blocking on a dead
socket.

The (default) 15s DBConnection timeout guards a slow-but-live query on a healthy
connection, but it does not cover a dead network path: if the route to the
database is silently severed (e.g. a TCP reset during a Kubernetes node-pool
upgrade), the BEAM can block on the dead socket until the kernel's
retransmission budget expires (~15 min). We saw this take down an Oban
Scheduler job that was stuck in executing after a node in k8s was drained; taking the network down underneath a running instance.

On Linux, this sets the TCP_USER_TIMEOUT socket option, which caps how long
transmitted data may go unacknowledged before the socket is dropped:

DATABASE_TCP_USER_TIMEOUT (ms) defaults to DATABASE_TIMEOUT + 5s (20s),
so the application query timeout fires first on a connection that is merely
slow. Set it to override, or 0 to restore the previous behaviour. A value
below DATABASE_TIMEOUT logs a warning at boot. Ignored on non-Linux
platforms.

It does not abort a live query: tcp_user_timeout measures unacknowledged data,
not query duration, so a long read whose bytes were acked at send time never
accumulates against it. A dead idle pooled connection is caught lazily on its
next use within the same budget.

This is on by default because the behavior it replaces — blocking ~15 min on a
dead socket — is a latent failure mode every Linux deployment inherits, and the
deployments most exposed (rolling node upgrades, cloud NAT/LB resets) are the
common shape. The 0 escape hatch and the Linux gate keep it safe.

A server-side statement_timeout was considered and dropped: it shares the
single Repo pool with migrations and the retention purge (which already requests
up to 150s), so a global ceiling is unsafe without per-transaction overrides we
don't yet have.

Validation steps

mix test test/lightning/config/bootstrap_test.exs — covers the
tcp_user_timeout socket option, the ECTO_IPV6 coexistence case (the prod
block appends rather than clobbers socket_options), the
below-DATABASE_TIMEOUT boot warning, and the SSL toggle.
Boot with DATABASE_TCP_USER_TIMEOUT=20000 on Linux and confirm the Repo
socket_options carry the raw {:raw, 6, 18, ...} entry; on macOS confirm it
is omitted.

Additional notes for the reviewer

The prod block in configure/0 re-sets socket_options for the ECTO_IPV6
toggle; it now appends the resilience option (maybe_ipv6 ++ db_socket_options) rather than clobbering it. A unit test guards this.
Default-on is a deliberate behavior change for Linux deployments, hence the
CHANGELOG entry sits under Changed, not Added. Deploy-side tuning is a
separate decision and lives in the deployment config, not this repo.

AI Usage

I have used Claude Code
I have used another model
I have not used AI

Pre-submission checklist

I have performed an AI review of my code (we recommend using /review
with Claude Code)
I have implemented and tested all related authorization policies
(n/a — config-only change, no new data access)
I have updated the changelog.
I have ticked a box in "AI usage" in this PR

@github-project-automation github-project-automation Bot added this to Core

Jun 15, 2026

@github-project-automation github-project-automation Bot moved this to New Issues in Core

Jun 15, 2026

@stuartc stuartc force-pushed the db-socket-resilience branch from c25fd45 to 0915352 Compare

June 15, 2026 11:20

@github-actions

github-actions Bot commented Jun 15, 2026

Copy link

Copy Markdown

Security Review ✅

S0 (project scoping): N/A — the diff only adds OS-level TCP socket options and a startup warning in lib/lightning/config/bootstrap.ex; no queries, controllers, LiveViews, or channels touch project-scoped data.
S1 (authorization): N/A — no new web-layer entrypoints, handle_events, or policy clauses are introduced.
S2 (audit trail): N/A — no Repo.insert/update/delete on configuration resources; the change configures a runtime socket timeout and adds tests, with no domain writes to audit.

@stuartc stuartc self-assigned this

Jun 15, 2026

@codecov

codecov Bot commented Jun 15, 2026 •

edited

Loading

Copy link

Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.5%. Comparing base (817e445) to head (a254b07).

Additional details and impacted files

@@ Coverage Diff @@
## main #4855 +/- ##
=======================================
- Coverage 90.5% 90.5% -0.0% 
=======================================
 Files 445 445 
 Lines 22721 22733 +12 
=======================================
+ Hits 20561 20569 +8 
- Misses 2160 2164 +4

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

stuartc added 2 commits

June 15, 2026 13:37

@stuartc


 Add socket-level DB connection resilience

0fb3b2b

The 15s DBConnection `timeout` guards a slow-but-live query on a healthy
connection, but not a dead network path: if the route to the database is
silently severed (e.g. a TCP reset during a node-pool upgrade), the BEAM
can block on the dead socket until the kernel's retransmission budget
expires (~15 min).
On Linux, set the TCP_USER_TIMEOUT socket option so such a connection
fails fast instead. It caps how long transmitted data may go
unacknowledged before the socket is dropped, and defaults to
DATABASE_TIMEOUT + 5s (set DATABASE_TCP_USER_TIMEOUT to override, or 0 to
disable); it is ignored on other platforms. A value below DATABASE_TIMEOUT
logs a warning, since on a large write it could drop a slow-but-live
connection before the query timeout fires.
This does not abort a live query: tcp_user_timeout measures unacknowledged
data, not query duration, so a long read whose bytes were acked at send
time never accumulates against it. A dead idle pooled connection is caught
lazily on its next use within the same budget.
A server-side statement_timeout was considered but dropped: it shares the
single Repo pool with migrations and the retention purge (which already
requests up to 150s), so a global ceiling is unsafe without per-transaction
overrides we don't yet have.
The prod block also sets `socket_options` for the IPv6 toggle, so it
appends this option rather than clobbering it.

@stuartc


 Document DB connection resilience env vars

a254b07

Add DATABASE_KEEPALIVE and DATABASE_TCP_USER_TIMEOUT to the deployment
env var reference, including the Linux-only default and the disable/
override semantics.

@stuartc stuartc force-pushed the db-socket-resilience branch from 0915352 to a254b07 Compare

June 15, 2026 11:37

@stuartc stuartc requested review from midigofrank and rorymckinley

June 15, 2026 12:06

@rorymckinley

rorymckinley commented Jun 15, 2026

Copy link

Copy Markdown

Collaborator

@stuartc I blew my timebox trying to find non-clanker sources on setting raw socket values. TCP socket programming is a deep rabbit hole and I have no prior knowledge to lean on - so chasing it further is probably not cost-effective. The rest of the PR looks fine - I made a comment relating to a cosmetic documentation change.

rorymckinley

rorymckinley approved these changes

Jun 15, 2026

View reviewed changes

CHANGELOG.md

- On Linux, database connections now fail fast when their network path dies

(e.g. a TCP reset during a node-pool upgrade) instead of blocking on a dead

socket for up to ~15 minutes. `DATABASE_TCP_USER_TIMEOUT` (ms) defaults to

`DATABASE_TIMEOUT + 5s`; set it to `0` to restore the previous behaviour. No

@rorymckinley rorymckinley Jun 15, 2026

Copy link

Copy Markdown

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuartc The wording here and in DEPLOYMENT.md can be interpreted as "it is ok for this to be set on non-Linux systems as it is a no-op". I have not found any documentation to confirm this, so perhaps it may be best to reword it to indicate that Lightning will not apply this setting on non-Linux hosts, to better align with the code?

@stuartc

stuartc commented Jun 16, 2026

Copy link

Copy Markdown

Member Author

@stuartc I blew my timebox trying to find non-clanker sources on setting raw socket values. TCP socket programming is a deep rabbit hole and I have no prior knowledge to lean on - so chasing it further is probably not cost-effective. The rest of the PR looks fine - I made a comment relating to a cosmetic documentation change.

Yeah I'm going to make a test harness to put this through it's paces on a linux vm before merging.

So let not merge until I've done that. And yeah that language needs another pass. It shouldn't read "previous" either - that ages fast.

midigofrank

midigofrank approved these changes

Jun 16, 2026

View reviewed changes

@midigofrank midigofrank left a comment

Copy link

Copy Markdown

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Great learnings here

Labels

None yet

3 participants

@stuartc @rorymckinley @midigofrank

Conversation

@stuartc stuartc commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Validation steps

Additional notes for the reviewer

AI Usage

Pre-submission checklist

Uh oh!

github-actions Bot commented Jun 15, 2026

Security Review ✅

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rorymckinley commented Jun 15, 2026

Uh oh!

@rorymckinley rorymckinley Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

stuartc commented Jun 16, 2026

Uh oh!

@midigofrank midigofrank left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@stuartc stuartc commented Jun 15, 2026 •

edited

Loading

codecov Bot commented Jun 15, 2026 •

edited

Loading