-
Notifications
You must be signed in to change notification settings - Fork 1.1k
-
I am trying to understand the causes behind a recent outage. Basically Couch was swamped with traffic and most calls were returning timeout errors (even though the host server had plenty of free resources). This went on for days and the timeouts continued even when almost all traffic stopped besides a nightly purge script. Then, one night, things just magically started working again and the timeouts were gone. Now, I am trying to piece together what happened...
Specifically, though, I am trying to understand if/how the calls to _purge might have effected the overall performance of the instance. Here is an example from the Couch logs of the indexing for a specific shard over a 24hr period. These are all of the log entries for that specific shard for the complete 24hr period.
[info] 2022年08月01日T21:53:45.260544Z couchdb@127.0.0.1 <0.12840.4> -------- Starting index update for db: shards/20000000-3fffffff/medic.1630490256 idx: _design/medic
[error] 2022年08月02日T02:02:03.733219Z couchdb@127.0.0.1 <0.12372.4503> -------- Invalid purge doc '_local/purge-mrview-2859a456772728483f6ab654a5d72d25' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0'
[error] 2022年08月02日T02:02:03.734967Z couchdb@127.0.0.1 <0.12372.4503> -------- Invalid purge doc '_local/purge-mrview-2fccf76609d4d962a544d274741fe286' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0'
[error] 2022年08月02日T02:02:03.736033Z couchdb@127.0.0.1 <0.12372.4503> -------- Invalid purge doc '_local/purge-mrview-4a20d1a4845f3466905fa2b1e39250f5' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0'
[error] 2022年08月02日T02:02:03.737022Z couchdb@127.0.0.1 <0.12372.4503> -------- Invalid purge doc '_local/purge-mrview-4fe776f15f5867d5a32b0dbdc20852c9' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0'
in database <<"shards/20000000-3fffffff/medic.1630490256">>
in database <<"shards/20000000-3fffffff/medic.1630490256">>
[warning] 2022年08月02日T02:02:03.745046Z couchdb@127.0.0.1 <0.12372.4503> -------- The purge sequence for 'shards/20000000-3fffffff/medic.1630490256' exceeds configured threshold
[info] 2022年08月02日T10:28:58.120033Z couchdb@127.0.0.1 <0.12840.4> -------- Index update finished for db: shards/20000000-3fffffff/medic.1630490256 idx: _design/medic
[info] 2022年08月02日T10:28:58.128140Z couchdb@127.0.0.1 <0.12840.4> -------- Starting index update for db: shards/20000000-3fffffff/medic.1630490256 idx: _design/medic
state: {state,<0.31130.2486>,[<<"shards/20000000-3fffffff/medic.1630490256">>]}
[info] 2022年08月02日T21:34:22.419520Z couchdb@127.0.0.1 <0.12840.4> -------- Index update finished for db: shards/20000000-3fffffff/medic.1630490256 idx: _design/medic
The original message at 2022年08月01日T21:53 is right in the middle of the massive batch of _purge calls that were run that night (almost all of which just ended up timing out...). You can see from the log dates that the indexing for that shard takes ~12hrs to complete. Then the indexing immediately restarts again and runs for another 11hrs. Is this excessive time spent building the index for the shard caused by the Invalid purge doc errors? What does the The purge sequence for ... exceeds configured threshold mean? Is there some way we can increase that configuration to avoid this issue in the future?
I found some information in the docs about the purge sequence, but did not find any way to configure the limit...
Details:
- Couch
2.3.1 - Size on disk:
222G 2141databases5,000,000+docs in the largest db
Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 5 comments 6 replies
-
Is it possible that somehow the purging is causing the index to have to be completely rebuilt instead of just updated incrementally?
Beta Was this translation helpful? Give feedback.
All reactions
-
Purge info limit can updated with https://docs.couchdb.org/en/3.2.0/api/database/misc.html#put--db-_purged_infos_limit. It's per database.
I think the Invalid purge doc '_local/purge-mrview-4fe776f15f5867d5a32b0dbdc20852c9' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0' indicates the view might have been updated and due to a race condition the purge doc with the old view signature was not cleaned up. If you think that the 4fe776f15f5867d5a32b0dbdc20852c9 is a stale view signature, try removing the _local/purge-mrview-4fe776f15f5867d5a32b0dbdc20852c9 by hand if possible.
As for the views being rebuilt I could not see any place where the view could be reset due to a stale purge sequence. See if there are any crashes or stack traces in the logs. Otherwise make sure the design document is not updated, otherwise every update will generate a new view hash and trigger a full view rebuild process.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for the response @nickva! For better or worse, we cannot really replicate the errors above (so I cannot be sure what effect removing _local/purge-mrview-4fe776f15f5867d5a32b0dbdc20852c9 would have). Once the timeouts magically stopped, both the purging and the indexing continued normally (and without errors).
There is a lack of any interesting errors in the logs. I have spent days sifting through them and almost all the errors are just timeout errors like this:
[error] 2022年08月01日T16:36:25.158532Z couchdb@127.0.0.1 <0.7884.3264> fe8372ae91 rexi_server: from: couchdb@127.0.0.1(<0.6708.3264>) mfa: fabric_rpc:purge_docs/3 exit:{timeout,{gen_server,call,[<0.471.0>,{purge_docs,[{<<"37f60125fa241fd853c0457bd54160d2">>,<<"78108249-21ce-4981-b94b-ae07e33e9f9d">>,[{1,<<227,169,6,230,50,105,22,71,12,15,103,190,33,175,120,216>>}]}],[{user_ctx,{user_ctx,<<"*******">>,[<<"*******">>,<<"********">>,<<"******">>],<<"********">>}},{w,"1"}]}]}} [{gen_server,call,2,[{file,"gen_server.erl"},{line,204}]},{fabric_rpc,with_db,3,[{file,"src/fabric_rpc.erl"},{line,334}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2022年08月01日T16:36:25.158755Z couchdb@127.0.0.1 <0.6708.3264> fe8372ae91 req_err(1108132862) case_clause : {error,[{error,internal_server_error}]}
[<<"chttpd_db:db_req/2 L567">>,<<"chttpd:handle_req_after_auth/2 L320">>,<<"chttpd:process_request/1 L303">>,<<"chttpd:handle_request_int/1 L243">>,<<"mochiweb_http:headers/6 L128">>,<<"proc_lib:init_p_do_apply/3 L247">>]
I do not see any updates to the design docs (no PUTs to _design). Is there any other reason you know of, besides a complete rebuild, that an index might take 12+hrs to update?
Beta Was this translation helpful? Give feedback.
All reactions
-
Oh [error] 2022年08月01日T16:36:25.158532Z couchdb@127.0.0.1 <0.7884.3264> fe8372ae91 rexi_server: from: couchdb@127.0.0.1(<0.6708.3264>) mfa: fabric_rpc:purge_docs/3 exit:{timeout,{gen_server,call,[<0.471.0>,{purge_docs, looks like a genuine error. Though I don't know if that would cause the view to reset
Beta Was this translation helpful? Give feedback.
All reactions
-
😕 1
-
This should fix that particular timeout #4143
And to be clear, it just seems like a separate error that we're lucky to find out about from your log report :-)
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
@jkuester would there be a way to reproduce the issue like a script or series of step like: create db, populate with data, build view, purge, etc?
Beta Was this translation helpful? Give feedback.
All reactions
-
I am working to try to recreate the issue, but it might be a challenge since the server was experiencing some pretty extreme conditions when we saw this. I will reply out here later with my results.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
When unfortunately I was unable to recreate this locally. I started with a db with 400000+ docs and then tried to simultaneously:
- Purge all the docs as fast as possible
- Spam out 10s of new views involving all the documents
- Retrieve the new views
I managed to hang my computer for hours until it crashed 😆, but did not see any errors like we have above. The best I could get were some of these:
[error] 2022年08月10日T21:59:27.093163Z nonode@nohost <0.11471.34> 138292ce1d fabric_worker_timeout map_view,nonode@nohost,<<"shards/e0000000-ffffffff/medic.1654115212">>
[error] 2022年08月10日T22:04:31.554783Z nonode@nohost <0.11718.34> 138292ce1d rexi_server: from: nonode@nohost(<0.11471.34>) mfa: fabric_rpc:map_view/5 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{couch_mrview,query_view,5,[{file,"src/couch_mrview.erl"},{line,266}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
Beta Was this translation helpful? Give feedback.
All reactions
-
Thank you for trying @jkuester. That error looks like a general timeout when building a view.
The error in the log we saw originally[error] 2022年08月02日T02:02:03.736033Z couchdb@127.0.0.1 <0.12372.4503> -------- Invalid purge doc '_local/purge-mrview-4a20d1a4845f3466905fa2b1e39250f5' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0' indicates that perhaps one of these operations happened:
- The view was updates
- The view was deleted / recreated
Potentially those operations would have happened during/around the time purge requests are processed.
This log line
The purge sequence for 'shards/20000000-3fffffff/medic.1630490256' exceeds configured threshold
also points to the same thing: if the system encounters a stale purge checkpoint document (cannot find an existing view which it is associated with) it cannot advance the oldest purge sequence beyond that, so it starts to lag. Then the large lag was reported in the log as "exceeds configured threshold". We had noticed this in another user's case but couldn't track down how it happens exactly. But it just confirm that somehow a modified or re-created view may be involved.
Beta Was this translation helpful? Give feedback.
All reactions
-
As an additional note, I did find this discussion with the same fabric timeout in purge_docs. TLDR is:
The performance improved significantly upon recreating the db and having a much smaller view tied to the db that was being purged. So it seems like the size of the view was causing problems
This seems interesting and perhaps related since I guess we are seeing some funky interaction between these timeouts and indexing...
Beta Was this translation helpful? Give feedback.
All reactions
-
Good find @jkuester. Yeah, it looks like in that discussion the user had encountered the same timeout as we saw above. The fix for that was already merged and will be in the next release (likely 3.3) #4143
Beta Was this translation helpful? Give feedback.