Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Cluster does not recover from temporary network partition #2140

Unanswered
rpfeifer-swi asked this question in Q&A
Discussion options

Discovered that if a network connectivity issue makes a node in a couchDB cluster unreachable (routing issue, someone trips over cable, etc), after about a minute or so the affected node will disconnect and never attempt to reconnect. This leaves the cluster broken, and the only apparent way to recover is to manually restart couchDB, which re-establishes connections.

To duplicate:
I set up a small cluster (3 nodes, couchDB 2.3.1 on Debian 9) and verifed a database replicates across them. Noted that there was an open TCP socket to port 9100 from each peer.

Disconnected network (virtual, on VirtualBox VM) to one of them. After about a minute the sockets involving the affected node closed. Also noticed that an attempt to update a database hung unti the socket closed (then completed with success).

Upon re-connecting the affected node, noted that that node in no longer synced to te rest of the cluster, and never recovers. There is apparently no mechanism to re-establish the broken connections. Stopping and re-starting any node's couchDB will re-establish normal operation. This does not appear to be related to link state or other conditions; simple loss of routing is confirmed to cause.

This would seem to be a fairly glaring reliability issue. If there is some mechanism handle this, it does not appear in the documentation.

You must be logged in to vote

Replies: 10 comments 5 replies

Comment options

This is definitely not the case in production systems across hundreds of installs I've seen personally.

Can you describe your cluster setup more completely? Are you using Docker, or cloud instances with private networking, or bare metal installs (e.g. Raspberry Pis)?

You must be logged in to vote
0 replies
Comment options

This is running on a VPN networking appliance based on Debian 9, run in a (amd64) VM (typically VMWare) or on real iron. We are using the provided pre-build .deb (stretch) packages.

Our cluster configuration is a bit unusual, in that we adjust sharding (and n) to keep a shard on every node, so that every node always has a full copy of data. We build the cluster one node at a time; each new node joins the cluster and then replicates a shard metadata to the new node. We are maintaining one fairly small database (at most a few thousand small documents), with a cluster size of 1 - 4 nodes (each of which must always have data available, hence the resharding). Otherwise, it's pretty standard stuff, I think.

You must be logged in to vote
1 reply
Comment options

rnewson Jul 1, 2020
Collaborator

if you create a database on a 3 node cluster, then every node has a full copy of the data anyway, that's the default behaviour and requires no adjustment from you.

Comment options

/cc @nickva ever seen this? this might be an actual bug but I don't know if it's one we care to fix with 4.0 plans.

You must be logged in to vote
0 replies
Comment options

@rwpfeifer I'm sorry that we don't have any more information to provide here. The only thing I can think of is that if you are actually tearing down the network interface itself, that epmd - the Erlang Port Mapper Daemon - may be losing the interface it's bound to. This, in turn, would prevent other nodes from reaching CouchDB on that node, since they can't talk to epmd, which is how they find out how to talk to CouchDB on e.g. port 9100/tcp.

When you kill CouchDB, epmd will also terminate. Restarting CouchDB will then automatically restart epmd.

No Erlang distributed process can survive epmd being restarted from underneath it, to my knowledge, so the only workaround for you would be to ensure that when you interrupt networking, you do not also tear down the virtual interface in the VM guest at the same time.

You must be logged in to vote
1 reply
Comment options

Joan I didn't catch your reply previously, but in my scenario epmd is only opening port 4396, not 9100 -- and there's nothing flowing on that port...

root@io-01:/var/log/couchdb2 # sockstat -4 | grep epm
couchdb epmd 1078 3 tcp4 *:4369 *:*
couchdb epmd 1078 5 tcp4 127.0.0.1:4369 127.0.0.1:14743
root@io-01:/var/log/couchdb2 # netstat -an | grep 4369
tcp4 0 0 127.0.0.1.4369 127.0.0.1.14743 ESTABLISHED
tcp4 0 0 127.0.0.1.14743 127.0.0.1.4369 ESTABLISHED
tcp6 0 0 *.4369 *.* LISTEN
tcp4 0 0 *.4369 *.* LISTEN
root@io-02:/var/log/couchdb2 # sockstat -4 | grep epm
couchdb epmd 1013 3 tcp4 *:4369 *:*
couchdb epmd 1013 5 tcp4 127.0.0.1:4369 127.0.0.1:48258
root@io-02:/var/log/couchdb2 # netstat -an | grep 4369
tcp4 0 0 127.0.0.1.4369 127.0.0.1.48258 ESTABLISHED
tcp4 0 0 127.0.0.1.48258 127.0.0.1.4369 ESTABLISHED
tcp6 0 0 *.4369 *.* LISTEN
tcp4 0 0 *.4369 *.* LISTEN
Comment options

Unfortunately I'm facing the same issue.

I have two VMs on Digital Ocean that are connected trough the internal lans.

After some days or hours of activity the machines just split brain. I guess "behind the curtain" the VM may be moving from one physical host to another one.

The logs get filled with things like:

[error] 2020年06月30日T14:34:38.754383Z couchdb@10.133.109.142 <0.3087.142> b8e83a37cf fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/c0000000-dfffffff/queue.1592984636">>
[error] 2020年06月30日T14:34:38.915822Z couchdb@10.133.109.142 <0.1364.142> c0efb5762f fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/60000000-7fffffff/queue.1592984636">>
[error] 2020年06月30日T14:34:45.850769Z couchdb@10.133.109.142 <0.27551.141> 01ce47afb2 fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/20000000-3fffffff/queue.1592984636">>
[error] 2020年06月30日T14:34:51.734713Z couchdb@10.133.109.142 <0.30745.141> 852d8f98f9 fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/20000000-3fffffff/queue.1592984636">>
[error] 2020年06月30日T14:34:56.528303Z couchdb@10.133.109.142 <0.2807.142> 8655186dc3 fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/a0000000-bfffffff/queue.1592984636">>

(That is the "remote" machine IP).

Restarting CouchDB fixes the problem.

CouchDB 2.3.1 on FreeBSD 12.1

You must be logged in to vote
0 replies
Comment options

This is the "beginning" of the problem, that happened at 14.26 today:

Machine A

[error] 2020年06月30日T14:26:42.218066Z couchdb@10.133.109.142 <0.22770.141> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/20000000-3fffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:42.218209Z couchdb@10.133.109.142 <0.22770.141> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020年06月30日T14:26:42.218284Z couchdb@10.133.109.142 <0.22775.141> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/60000000-7fffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:42.218350Z couchdb@10.133.109.142 <0.22775.141> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020年06月30日T14:26:43.239807Z couchdb@10.133.109.142 <0.19639.141> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/40000000-5fffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:43.239891Z couchdb@10.133.109.142 <0.19639.141> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020年06月30日T14:26:43.927694Z couchdb@10.133.109.142 <0.19725.104> fd66a2549d fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/a0000000-bfffffff/queue.1592984636">>

Machine B

[error] 2020年06月30日T14:26:19.770152Z couchdb@10.133.98.18 <0.15949.176> -------- rexi_server: from: couchdb@10.133.98.18(<0.27190.175>) mfa: fabric_rpc:all_docs/3 error:function_clause [{couch_db,incref,[undefined],[{file,"src/couch_db.erl"},{line,185}]},{couch_server,open,2,[{file,"src/couch_server.erl"},{line,85}]},{fabric_rpc,all_docs,3,[{file,"src/fabric_rpc.erl"},{line,124}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2020年06月30日T14:26:39.895071Z couchdb@10.133.98.18 <0.30132.0> -------- Replicator, request PUT to "http://127.0.0.1:5984/_users/_local/153910aca337d66bb0901018a8f58206" failed due to error {error,req_timedout}
[error] 2020年06月30日T14:26:39.895406Z couchdb@10.133.98.18 <0.30132.0> -------- Replication `153910aca337d66bb0901018a8f58206+continuous` (`http://home-replication:*****@10.133.136.126:5984/_users/` -> `http://127.0.0.1:5984/_users/`) failed: {http_request_failed,"PUT",
[error] 2020年06月30日T14:26:39.922133Z couchdb@10.133.98.18 <0.5378.1> -------- Replicator, request GET to "http://home-replication:*****@10.133.136.126:5984/_users/_changes?feed=continuous&style=all_docs&since=%22118783-g1AAAALLeJyl0M0KwjAMAODiBMW7Ht18gbE2rm4n9yban8mQqSdvgr6Jvonii6jv4H127ZynIayHJJCQj5AcIdTPHIlGYrcXmeQJDnwMoIL6mNBczTsMcbcoinXmcITc6Ub1eiuxkiGBxsV_JvdU5vOafQ81iyWRfDprzyYlu6hZb6BZilOIwYJdluzx94S7ZhkGQShuzW67KqOTKko-G3r80nQoCRNpaElfDH2trs41TSDiwYxZ0jdDPyr6oGmIghRkbEk_Df399cRczWIaR6Rxf_0B5GCzpw%22&timeout=10000" failed due to error closing_on_request
[error] 2020年06月30日T14:26:43.649857Z couchdb@10.133.98.18 <0.9173.176> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/40000000-5fffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:43.649951Z couchdb@10.133.98.18 <0.9173.176> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020年06月30日T14:26:44.159851Z couchdb@10.133.98.18 <0.18146.175> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/e0000000-ffffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:44.159850Z couchdb@10.133.98.18 <0.22160.175> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/00000000-1fffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:44.159910Z couchdb@10.133.98.18 <0.18146.175> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020年06月30日T14:26:44.159930Z couchdb@10.133.98.18 <0.22160.175> -------- Error checking security objects for _global_changes :: {error,timeout}
You must be logged in to vote
0 replies
Comment options

Are the IP's changing? What does /_membership show on the various nodes? Are you still reading/writing when this partition is not resolving itself?

You must be logged in to vote
1 reply
Comment options

The IP are absolutely static and not changing.

When the cluster splits, CouchDB starts behaving strange. Not all reads works and writing gets funky as well:

[error] 2020年07月01日T14:12:27.228410Z couchdb@10.133.98.18 <0.1645.0> -------- Error getting security objects for <<"userdb-4242544d484c3737453331463833394a">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.228883Z couchdb@10.133.98.18 <0.1661.0> -------- Error getting security objects for <<"userdb-4242544d484c3737453331463833394a">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.229347Z couchdb@10.133.98.18 <0.1633.0> -------- Error getting security objects for <<"userdb-4242544c4c523636423635433538385a">>: {error,no_majority}
[error] 2020年07月01日T14:11:21.739767Z couchdb@10.133.109.142 <0.20260.77> -------- rexi_server: from: couchdb@10.133.109.142(<0.19280.77>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2020年07月01日T14:11:21.739850Z couchdb@10.133.109.142 <0.18854.77> -------- rexi_server: from: couchdb@10.133.109.142(<0.19280.77>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2020年07月01日T14:12:27.062349Z couchdb@10.133.98.18 <0.350.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.062389Z couchdb@10.133.98.18 <0.353.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.062394Z couchdb@10.133.98.18 <0.352.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.062419Z couchdb@10.133.98.18 <0.419.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.062478Z couchdb@10.133.98.18 <0.348.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}

Basic Couch requests works, like / or whatever, so it's also very complicated to have an automated monitor for the problem.

As you can see from the log it's happening almost daily.

Just as an hypothesis: I see the cluster has a TCP connection between nodes. May be it never times out or it gets stuck when the backend lan flap?

(Today CouchDB 3.1 should become available in the FreeBSD Ports tree, so I plan to upgrade soon. Let's hope it fixes the problem.)

Comment options

I had the same problem happen again half an hour ago on a different couple of machines.

Same kind of symptoms, same kind of logs.

Netstat during the broken condition showed an active TCP connection between the machine, and with tcpdump I could see traffic flowing.

Yet at the same time in the log I had dozens of

[error] 2020年07月02日T12:29:48.722656Z couchdb@10.133.138.24 <0.31373.307> -------- fabric_worker_timeout open_doc,'couchdb@10.133.136.126',<<"shards/80000000-ffffffff/userdb-42535444474936395032354c37333651.1593462601">>
[error] 2020年07月02日T12:29:48.722840Z couchdb@10.133.138.24 <0.31373.307> -------- _all_docs open error: userdb-42535444474936395032354c37333651 05a93f2beee15e@io01.rcovid19.it :: {error,{case_clause,{error,timeout}}} [{fabric_view_all_docs,open_doc_int,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,269}]},{fabric_view_all_docs,open_doc,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,258}]}]

and

[error] 2020年07月02日T12:29:48.722656Z couchdb@10.133.138.24 <0.31373.307> -------- fabric_worker_timeout open_doc,'couchdb@10.133.136.126',<<"shards/80000000-ffffffff/userdb-42535444474936395032354c37333651.1593462601">>
[error] 2020年07月02日T12:29:48.722840Z couchdb@10.133.138.24 <0.31373.307> -------- _all_docs open error: userdb-42535444474936395032354c37333651 05a93f2beee15e@io01.rcovid19.it :: {error,{case_clause,{error,timeout}}} [{fabric_view_all_docs,open_doc_int,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,269}]},{fabric_view_all_docs,open_doc,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,258}]}]

I'll investigate further, if you have any diagnosing suggestion feel free to tell me....

You must be logged in to vote
1 reply
Comment options

Just my 2 cents worth - our solution was to periodically check to detect disconnected nodes (in _membership / cluster_nodes) and re-start couchdb if any detected. Ugly, but it works.

Comment options

Hey guys this could be whats happening to us, was there any resolution?

You must be logged in to vote
0 replies
Comment options

this problem occured to us on two couchdb setup in azure kubernetes cluster.

one setup has run for over a year or two since before i came to the team, but in Feburary the nodes suddenly partitioned. the cluster recovered after a problematic pod is restarted.

another setup is a cluster we used to validate the backup of couchdb

  • we start a 3 node couchdb cluster with seed list and statefulset and headless service
    • we verified it's working.
  • we changed the command to sleep 1d and removed the /_up health checks, and the pods are recreated.
  • we rsync/rcloned the databases into volume mount path
  • we remove the command to let couchdb instances start
    • after all three nodes started, we found that the nodes are not syncing
  • we restarted each node, one by one
    • after that, the nodes started to sync. partition situation is gone.
  • we add the health check back to statefulset
    • the nodes restarted one by one, and works after the statefulset is all up-to-date and ready

maybe we should write a script as the health checker to detect the partition situation and let kubernetes kill the pod.

You must be logged in to vote
1 reply
Comment options

nickva Mar 21, 2024
Collaborator

You can try monitoring the _membership on each node like @rpfeifer-swi suggested.

In the latest 3.3.3 there is a mem3_distribution module which will try to periodically reconnect any disconnected nodes. There is a period of how often it check and tries to do that configured as [cluster] reconnect_interval_sec = $seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
Converted from issue

This discussion was converted from issue #2140 on June 25, 2020 18:13.

AltStyle によって変換されたページ (->オリジナル) /