Cluster does not recover from temporary network partition · apache/couchdb · Discussion #2140

rpfeifer-swi
Aug 23, 2019

Discovered that if a network connectivity issue makes a node in a couchDB cluster unreachable (routing issue, someone trips over cable, etc), after about a minute or so the affected node will disconnect and never attempt to reconnect. This leaves the cluster broken, and the only apparent way to recover is to manually restart couchDB, which re-establishes connections.

To duplicate:
I set up a small cluster (3 nodes, couchDB 2.3.1 on Debian 9) and verifed a database replicates across them. Noted that there was an open TCP socket to port 9100 from each peer.

Disconnected network (virtual, on VirtualBox VM) to one of them. After about a minute the sockets involving the affected node closed. Also noticed that an attempt to update a database hung unti the socket closed (then completed with success).

Upon re-connecting the affected node, noted that that node in no longer synced to te rest of the cluster, and never recovers. There is apparently no mechanism to re-establish the broken connections. Stopping and re-starting any node's couchDB will re-establish normal operation. This does not appear to be related to link state or other conditions; simple loss of routing is confirmed to cause.

This would seem to be a fairly glaring reliability issue. If there is some mechanism handle this, it does not appear in the documentation.

Replies: 10 comments 5 replies

wohali
Aug 23, 2019
Collaborator

This is definitely not the case in production systems across hundreds of installs I've seen personally.

Can you describe your cluster setup more completely? Are you using Docker, or cloud instances with private networking, or bare metal installs (e.g. Raspberry Pis)?

0 replies

rpfeifer-swi
Aug 23, 2019
Author

This is running on a VPN networking appliance based on Debian 9, run in a (amd64) VM (typically VMWare) or on real iron. We are using the provided pre-build .deb (stretch) packages.

Our cluster configuration is a bit unusual, in that we adjust sharding (and n) to keep a shard on every node, so that every node always has a full copy of data. We build the cluster one node at a time; each new node joins the cluster and then replicates a shard metadata to the new node. We are maintaining one fairly small database (at most a few thousand small documents), with a cluster size of 1 - 4 nodes (each of which must always have data available, hence the resharding). Otherwise, it's pretty standard stuff, I think.

1 reply

@rnewson

rnewson Jul 1, 2020
Collaborator

if you create a database on a 3 node cluster, then every node has a full copy of the data anyway, that's the default behaviour and requires no adjustment from you.

wohali
Aug 23, 2019
Collaborator

/cc @nickva ever seen this? this might be an actual bug but I don't know if it's one we care to fix with 4.0 plans.

0 replies

wohali
Jun 25, 2020
Collaborator

@rwpfeifer I'm sorry that we don't have any more information to provide here. The only thing I can think of is that if you are actually tearing down the network interface itself, that epmd - the Erlang Port Mapper Daemon - may be losing the interface it's bound to. This, in turn, would prevent other nodes from reaching CouchDB on that node, since they can't talk to epmd, which is how they find out how to talk to CouchDB on e.g. port 9100/tcp.

When you kill CouchDB, epmd will also terminate. Restarting CouchDB will then automatically restart epmd.

No Erlang distributed process can survive epmd being restarted from underneath it, to my knowledge, so the only workaround for you would be to ensure that when you interrupt networking, you do not also tear down the virtual interface in the VM guest at the same time.

1 reply

@skeyby

skeyby Jul 2, 2020

Joan I didn't catch your reply previously, but in my scenario epmd is only opening port 4396, not 9100 -- and there's nothing flowing on that port...

root@io-01:/var/log/couchdb2 # sockstat -4 | grep epm
couchdb epmd 1078 3 tcp4 *:4369 *:*
couchdb epmd 1078 5 tcp4 127.0.0.1:4369 127.0.0.1:14743
root@io-01:/var/log/couchdb2 # netstat -an | grep 4369
tcp4 0 0 127.0.0.1.4369 127.0.0.1.14743 ESTABLISHED
tcp4 0 0 127.0.0.1.14743 127.0.0.1.4369 ESTABLISHED
tcp6 0 0 *.4369 *.* LISTEN
tcp4 0 0 *.4369 *.* LISTEN

root@io-02:/var/log/couchdb2 # sockstat -4 | grep epm
couchdb epmd 1013 3 tcp4 *:4369 *:*
couchdb epmd 1013 5 tcp4 127.0.0.1:4369 127.0.0.1:48258
root@io-02:/var/log/couchdb2 # netstat -an | grep 4369
tcp4 0 0 127.0.0.1.4369 127.0.0.1.48258 ESTABLISHED
tcp4 0 0 127.0.0.1.48258 127.0.0.1.4369 ESTABLISHED
tcp6 0 0 *.4369 *.* LISTEN
tcp4 0 0 *.4369 *.* LISTEN

skeyby
Jun 30, 2020

Unfortunately I'm facing the same issue.

I have two VMs on Digital Ocean that are connected trough the internal lans.

After some days or hours of activity the machines just split brain. I guess "behind the curtain" the VM may be moving from one physical host to another one.

The logs get filled with things like:

[error] 2020年06月30日T14:34:38.754383Z couchdb@10.133.109.142 <0.3087.142> b8e83a37cf fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/c0000000-dfffffff/queue.1592984636">>
[error] 2020年06月30日T14:34:38.915822Z couchdb@10.133.109.142 <0.1364.142> c0efb5762f fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/60000000-7fffffff/queue.1592984636">>
[error] 2020年06月30日T14:34:45.850769Z couchdb@10.133.109.142 <0.27551.141> 01ce47afb2 fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/20000000-3fffffff/queue.1592984636">>
[error] 2020年06月30日T14:34:51.734713Z couchdb@10.133.109.142 <0.30745.141> 852d8f98f9 fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/20000000-3fffffff/queue.1592984636">>
[error] 2020年06月30日T14:34:56.528303Z couchdb@10.133.109.142 <0.2807.142> 8655186dc3 fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/a0000000-bfffffff/queue.1592984636">>

(That is the "remote" machine IP).

Restarting CouchDB fixes the problem.

CouchDB 2.3.1 on FreeBSD 12.1

0 replies

skeyby
Jun 30, 2020

This is the "beginning" of the problem, that happened at 14.26 today:

Machine A

[error] 2020年06月30日T14:26:42.218066Z couchdb@10.133.109.142 <0.22770.141> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/20000000-3fffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:42.218209Z couchdb@10.133.109.142 <0.22770.141> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020年06月30日T14:26:42.218284Z couchdb@10.133.109.142 <0.22775.141> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/60000000-7fffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:42.218350Z couchdb@10.133.109.142 <0.22775.141> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020年06月30日T14:26:43.239807Z couchdb@10.133.109.142 <0.19639.141> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/40000000-5fffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:43.239891Z couchdb@10.133.109.142 <0.19639.141> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020年06月30日T14:26:43.927694Z couchdb@10.133.109.142 <0.19725.104> fd66a2549d fabric_worker_timeout update_docs,'couchdb@10.133.98.18',<<"shards/a0000000-bfffffff/queue.1592984636">>

Machine B

[error] 2020年06月30日T14:26:19.770152Z couchdb@10.133.98.18 <0.15949.176> -------- rexi_server: from: couchdb@10.133.98.18(<0.27190.175>) mfa: fabric_rpc:all_docs/3 error:function_clause [{couch_db,incref,[undefined],[{file,"src/couch_db.erl"},{line,185}]},{couch_server,open,2,[{file,"src/couch_server.erl"},{line,85}]},{fabric_rpc,all_docs,3,[{file,"src/fabric_rpc.erl"},{line,124}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2020年06月30日T14:26:39.895071Z couchdb@10.133.98.18 <0.30132.0> -------- Replicator, request PUT to "http://127.0.0.1:5984/_users/_local/153910aca337d66bb0901018a8f58206" failed due to error {error,req_timedout}
[error] 2020年06月30日T14:26:39.895406Z couchdb@10.133.98.18 <0.30132.0> -------- Replication `153910aca337d66bb0901018a8f58206+continuous` (`http://home-replication:*****@10.133.136.126:5984/_users/` -> `http://127.0.0.1:5984/_users/`) failed: {http_request_failed,"PUT",
[error] 2020年06月30日T14:26:39.922133Z couchdb@10.133.98.18 <0.5378.1> -------- Replicator, request GET to "http://home-replication:*****@10.133.136.126:5984/_users/_changes?feed=continuous&style=all_docs&since=%22118783-g1AAAALLeJyl0M0KwjAMAODiBMW7Ht18gbE2rm4n9yban8mQqSdvgr6Jvonii6jv4H127ZynIayHJJCQj5AcIdTPHIlGYrcXmeQJDnwMoIL6mNBczTsMcbcoinXmcITc6Ub1eiuxkiGBxsV_JvdU5vOafQ81iyWRfDprzyYlu6hZb6BZilOIwYJdluzx94S7ZhkGQShuzW67KqOTKko-G3r80nQoCRNpaElfDH2trs41TSDiwYxZ0jdDPyr6oGmIghRkbEk_Df399cRczWIaR6Rxf_0B5GCzpw%22&timeout=10000" failed due to error closing_on_request
[error] 2020年06月30日T14:26:43.649857Z couchdb@10.133.98.18 <0.9173.176> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/40000000-5fffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:43.649951Z couchdb@10.133.98.18 <0.9173.176> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020年06月30日T14:26:44.159851Z couchdb@10.133.98.18 <0.18146.175> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/e0000000-ffffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:44.159850Z couchdb@10.133.98.18 <0.22160.175> -------- fabric_worker_timeout get_all_security,'couchdb@10.133.98.18',<<"shards/00000000-1fffffff/_global_changes.1592994148">>
[error] 2020年06月30日T14:26:44.159910Z couchdb@10.133.98.18 <0.18146.175> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020年06月30日T14:26:44.159930Z couchdb@10.133.98.18 <0.22160.175> -------- Error checking security objects for _global_changes :: {error,timeout}

0 replies

rnewson
Jul 1, 2020
Collaborator

Are the IP's changing? What does /_membership show on the various nodes? Are you still reading/writing when this partition is not resolving itself?

1 reply

@skeyby

skeyby Jul 2, 2020

The IP are absolutely static and not changing.

When the cluster splits, CouchDB starts behaving strange. Not all reads works and writing gets funky as well:

[error] 2020年07月01日T14:12:27.228410Z couchdb@10.133.98.18 <0.1645.0> -------- Error getting security objects for <<"userdb-4242544d484c3737453331463833394a">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.228883Z couchdb@10.133.98.18 <0.1661.0> -------- Error getting security objects for <<"userdb-4242544d484c3737453331463833394a">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.229347Z couchdb@10.133.98.18 <0.1633.0> -------- Error getting security objects for <<"userdb-4242544c4c523636423635433538385a">>: {error,no_majority}

[error] 2020年07月01日T14:11:21.739767Z couchdb@10.133.109.142 <0.20260.77> -------- rexi_server: from: couchdb@10.133.109.142(<0.19280.77>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2020年07月01日T14:11:21.739850Z couchdb@10.133.109.142 <0.18854.77> -------- rexi_server: from: couchdb@10.133.109.142(<0.19280.77>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]

[error] 2020年07月01日T14:12:27.062349Z couchdb@10.133.98.18 <0.350.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.062389Z couchdb@10.133.98.18 <0.353.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.062394Z couchdb@10.133.98.18 <0.352.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.062419Z couchdb@10.133.98.18 <0.419.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020年07月01日T14:12:27.062478Z couchdb@10.133.98.18 <0.348.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}

Basic Couch requests works, like / or whatever, so it's also very complicated to have an automated monitor for the problem.

As you can see from the log it's happening almost daily.

Just as an hypothesis: I see the cluster has a TCP connection between nodes. May be it never times out or it gets stuck when the backend lan flap?

(Today CouchDB 3.1 should become available in the FreeBSD Ports tree, so I plan to upgrade soon. Let's hope it fixes the problem.)

skeyby
Jul 2, 2020

I had the same problem happen again half an hour ago on a different couple of machines.

Same kind of symptoms, same kind of logs.

Netstat during the broken condition showed an active TCP connection between the machine, and with tcpdump I could see traffic flowing.

Yet at the same time in the log I had dozens of

[error] 2020年07月02日T12:29:48.722656Z couchdb@10.133.138.24 <0.31373.307> -------- fabric_worker_timeout open_doc,'couchdb@10.133.136.126',<<"shards/80000000-ffffffff/userdb-42535444474936395032354c37333651.1593462601">>
[error] 2020年07月02日T12:29:48.722840Z couchdb@10.133.138.24 <0.31373.307> -------- _all_docs open error: userdb-42535444474936395032354c37333651 05a93f2beee15e@io01.rcovid19.it :: {error,{case_clause,{error,timeout}}} [{fabric_view_all_docs,open_doc_int,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,269}]},{fabric_view_all_docs,open_doc,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,258}]}]

and

[error] 2020年07月02日T12:29:48.722656Z couchdb@10.133.138.24 <0.31373.307> -------- fabric_worker_timeout open_doc,'couchdb@10.133.136.126',<<"shards/80000000-ffffffff/userdb-42535444474936395032354c37333651.1593462601">>
[error] 2020年07月02日T12:29:48.722840Z couchdb@10.133.138.24 <0.31373.307> -------- _all_docs open error: userdb-42535444474936395032354c37333651 05a93f2beee15e@io01.rcovid19.it :: {error,{case_clause,{error,timeout}}} [{fabric_view_all_docs,open_doc_int,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,269}]},{fabric_view_all_docs,open_doc,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,258}]}]

I'll investigate further, if you have any diagnosing suggestion feel free to tell me....

1 reply

@rpfeifer-swi

rpfeifer-swi Jul 2, 2020
Author

Just my 2 cents worth - our solution was to periodically check to detect disconnected nodes (in _membership / cluster_nodes) and re-start couchdb if any detected. Ugly, but it works.

zdravko123
Dec 10, 2022

Hey guys this could be whats happening to us, was there any resolution?

0 replies

jeffguorg
Mar 21, 2024

this problem occured to us on two couchdb setup in azure kubernetes cluster.

one setup has run for over a year or two since before i came to the team, but in Feburary the nodes suddenly partitioned. the cluster recovered after a problematic pod is restarted.

another setup is a cluster we used to validate the backup of couchdb

we start a 3 node couchdb cluster with seed list and statefulset and headless service
- we verified it's working.
we changed the command to sleep 1d and removed the /_up health checks, and the pods are recreated.
we rsync/rcloned the databases into volume mount path
we remove the command to let couchdb instances start
- after all three nodes started, we found that the nodes are not syncing
we restarted each node, one by one
- after that, the nodes started to sync. partition situation is gone.
we add the health check back to statefulset
- the nodes restarted one by one, and works after the statefulset is all up-to-date and ready

maybe we should write a script as the health checker to detect the partition situation and let kubernetes kill the pod.

1 reply

@nickva

nickva Mar 21, 2024
Collaborator

You can try monitoring the _membership on each node like @rpfeifer-swi suggested.

In the latest 3.3.3 there is a mem3_distribution module which will try to periodically reconnect any disconnected nodes. There is a period of how often it check and tries to do that configured as [cluster] reconnect_interval_sec = $seconds

Cluster does not recover from temporary network partition #2140

Uh oh!

Replies: 10 comments · 5 replies

Uh oh!

wohali Aug 23, 2019 Collaborator

Uh oh!

rpfeifer-swi Aug 23, 2019 Author

Uh oh!

rnewson Jul 1, 2020 Collaborator

Uh oh!

wohali Aug 23, 2019 Collaborator

Uh oh!

wohali Jun 25, 2020 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rnewson Jul 1, 2020 Collaborator

Uh oh!

Uh oh!

Uh oh!

rpfeifer-swi Jul 2, 2020 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nickva Mar 21, 2024 Collaborator

Replies: 10 comments 5 replies

wohali
Aug 23, 2019
Collaborator

rpfeifer-swi
Aug 23, 2019
Author

rnewson Jul 1, 2020
Collaborator

wohali
Aug 23, 2019
Collaborator

wohali
Jun 25, 2020
Collaborator

rnewson
Jul 1, 2020
Collaborator

rpfeifer-swi Jul 2, 2020
Author

nickva Mar 21, 2024
Collaborator