The purge sequence ... exceeds configured threshold · apache/couchdb · Discussion #4141

jkuester
Aug 9, 2022

I am trying to understand the causes behind a recent outage. Basically Couch was swamped with traffic and most calls were returning timeout errors (even though the host server had plenty of free resources). This went on for days and the timeouts continued even when almost all traffic stopped besides a nightly purge script. Then, one night, things just magically started working again and the timeouts were gone. Now, I am trying to piece together what happened...

Specifically, though, I am trying to understand if/how the calls to _purge might have effected the overall performance of the instance. Here is an example from the Couch logs of the indexing for a specific shard over a 24hr period. These are all of the log entries for that specific shard for the complete 24hr period.

[info] 2022年08月01日T21:53:45.260544Z couchdb@127.0.0.1 <0.12840.4> -------- Starting index update for db: shards/20000000-3fffffff/medic.1630490256 idx: _design/medic
[error] 2022年08月02日T02:02:03.733219Z couchdb@127.0.0.1 <0.12372.4503> -------- Invalid purge doc '_local/purge-mrview-2859a456772728483f6ab654a5d72d25' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0'
[error] 2022年08月02日T02:02:03.734967Z couchdb@127.0.0.1 <0.12372.4503> -------- Invalid purge doc '_local/purge-mrview-2fccf76609d4d962a544d274741fe286' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0'
[error] 2022年08月02日T02:02:03.736033Z couchdb@127.0.0.1 <0.12372.4503> -------- Invalid purge doc '_local/purge-mrview-4a20d1a4845f3466905fa2b1e39250f5' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0'
[error] 2022年08月02日T02:02:03.737022Z couchdb@127.0.0.1 <0.12372.4503> -------- Invalid purge doc '_local/purge-mrview-4fe776f15f5867d5a32b0dbdc20852c9' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0'
 in database <<"shards/20000000-3fffffff/medic.1630490256">>
 in database <<"shards/20000000-3fffffff/medic.1630490256">>
[warning] 2022年08月02日T02:02:03.745046Z couchdb@127.0.0.1 <0.12372.4503> -------- The purge sequence for 'shards/20000000-3fffffff/medic.1630490256' exceeds configured threshold
[info] 2022年08月02日T10:28:58.120033Z couchdb@127.0.0.1 <0.12840.4> -------- Index update finished for db: shards/20000000-3fffffff/medic.1630490256 idx: _design/medic
[info] 2022年08月02日T10:28:58.128140Z couchdb@127.0.0.1 <0.12840.4> -------- Starting index update for db: shards/20000000-3fffffff/medic.1630490256 idx: _design/medic
 state: {state,<0.31130.2486>,[<<"shards/20000000-3fffffff/medic.1630490256">>]}
[info] 2022年08月02日T21:34:22.419520Z couchdb@127.0.0.1 <0.12840.4> -------- Index update finished for db: shards/20000000-3fffffff/medic.1630490256 idx: _design/medic

The original message at 2022年08月01日T21:53 is right in the middle of the massive batch of _purge calls that were run that night (almost all of which just ended up timing out...). You can see from the log dates that the indexing for that shard takes ~12hrs to complete. Then the indexing immediately restarts again and runs for another 11hrs. Is this excessive time spent building the index for the shard caused by the Invalid purge doc errors? What does the The purge sequence for ... exceeds configured threshold mean? Is there some way we can increase that configuration to avoid this issue in the future?

I found some information in the docs about the purge sequence, but did not find any way to configure the limit...

Details:

Couch 2.3.1
Size on disk: 222G
2141 databases
5,000,000+ docs in the largest db

Thank you in advance!

Replies: 5 comments 6 replies

jkuester
Aug 9, 2022
Author

Is it possible that somehow the purging is causing the index to have to be completely rebuilt instead of just updated incrementally?

0 replies

nickva
Aug 9, 2022
Collaborator

Purge info limit can updated with https://docs.couchdb.org/en/3.2.0/api/database/misc.html#put--db-_purged_infos_limit. It's per database.

I think the Invalid purge doc '_local/purge-mrview-4fe776f15f5867d5a32b0dbdc20852c9' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0' indicates the view might have been updated and due to a race condition the purge doc with the old view signature was not cleaned up. If you think that the 4fe776f15f5867d5a32b0dbdc20852c9 is a stale view signature, try removing the _local/purge-mrview-4fe776f15f5867d5a32b0dbdc20852c9 by hand if possible.

As for the views being rebuilt I could not see any place where the view could be reset due to a stale purge sequence. See if there are any crashes or stack traces in the logs. Otherwise make sure the design document is not updated, otherwise every update will generate a new view hash and trigger a full view rebuild process.

0 replies

jkuester
Aug 9, 2022
Author

Thanks for the response @nickva! For better or worse, we cannot really replicate the errors above (so I cannot be sure what effect removing _local/purge-mrview-4fe776f15f5867d5a32b0dbdc20852c9 would have). Once the timeouts magically stopped, both the purging and the indexing continued normally (and without errors).

There is a lack of any interesting errors in the logs. I have spent days sifting through them and almost all the errors are just timeout errors like this:

[error] 2022年08月01日T16:36:25.158532Z couchdb@127.0.0.1 <0.7884.3264> fe8372ae91 rexi_server: from: couchdb@127.0.0.1(<0.6708.3264>) mfa: fabric_rpc:purge_docs/3 exit:{timeout,{gen_server,call,[<0.471.0>,{purge_docs,[{<<"37f60125fa241fd853c0457bd54160d2">>,<<"78108249-21ce-4981-b94b-ae07e33e9f9d">>,[{1,<<227,169,6,230,50,105,22,71,12,15,103,190,33,175,120,216>>}]}],[{user_ctx,{user_ctx,<<"*******">>,[<<"*******">>,<<"********">>,<<"******">>],<<"********">>}},{w,"1"}]}]}} [{gen_server,call,2,[{file,"gen_server.erl"},{line,204}]},{fabric_rpc,with_db,3,[{file,"src/fabric_rpc.erl"},{line,334}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2022年08月01日T16:36:25.158755Z couchdb@127.0.0.1 <0.6708.3264> fe8372ae91 req_err(1108132862) case_clause : {error,[{error,internal_server_error}]}
 [<<"chttpd_db:db_req/2 L567">>,<<"chttpd:handle_req_after_auth/2 L320">>,<<"chttpd:process_request/1 L303">>,<<"chttpd:handle_request_int/1 L243">>,<<"mochiweb_http:headers/6 L128">>,<<"proc_lib:init_p_do_apply/3 L247">>]

I do not see any updates to the design docs (no PUTs to _design). Is there any other reason you know of, besides a complete rebuild, that an index might take 12+hrs to update?

2 replies

@nickva

nickva Aug 9, 2022
Collaborator

Oh [error] 2022年08月01日T16:36:25.158532Z couchdb@127.0.0.1 <0.7884.3264> fe8372ae91 rexi_server: from: couchdb@127.0.0.1(<0.6708.3264>) mfa: fabric_rpc:purge_docs/3 exit:{timeout,{gen_server,call,[<0.471.0>,{purge_docs, looks like a genuine error. Though I don't know if that would cause the view to reset

@nickva

nickva Aug 9, 2022
Collaborator

This should fix that particular timeout #4143

And to be clear, it just seems like a separate error that we're lucky to find out about from your log report :-)

nickva
Aug 10, 2022
Collaborator

@jkuester would there be a way to reproduce the issue like a script or series of step like: create db, populate with data, build view, purge, etc?

3 replies

@jkuester

jkuester Aug 10, 2022
Author

I am working to try to recreate the issue, but it might be a challenge since the server was experiencing some pretty extreme conditions when we saw this. I will reply out here later with my results.

@jkuester

jkuester Aug 11, 2022
Author

When unfortunately I was unable to recreate this locally. I started with a db with 400000+ docs and then tried to simultaneously:

Purge all the docs as fast as possible
Spam out 10s of new views involving all the documents
Retrieve the new views

I managed to hang my computer for hours until it crashed 😆, but did not see any errors like we have above. The best I could get were some of these:

[error] 2022年08月10日T21:59:27.093163Z nonode@nohost <0.11471.34> 138292ce1d fabric_worker_timeout map_view,nonode@nohost,<<"shards/e0000000-ffffffff/medic.1654115212">>
[error] 2022年08月10日T22:04:31.554783Z nonode@nohost <0.11718.34> 138292ce1d rexi_server: from: nonode@nohost(<0.11471.34>) mfa: fabric_rpc:map_view/5 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{couch_mrview,query_view,5,[{file,"src/couch_mrview.erl"},{line,266}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]

@nickva

nickva Aug 13, 2022
Collaborator

Thank you for trying @jkuester. That error looks like a general timeout when building a view.

The error in the log we saw originally[error] 2022年08月02日T02:02:03.736033Z couchdb@127.0.0.1 <0.12372.4503> -------- Invalid purge doc '_local/purge-mrview-4a20d1a4845f3466905fa2b1e39250f5' on <<"shards/20000000-3fffffff/medic.1630490256">> with purge_seq '0' indicates that perhaps one of these operations happened:

The view was updates
The view was deleted / recreated

Potentially those operations would have happened during/around the time purge requests are processed.

This log line

The purge sequence for 'shards/20000000-3fffffff/medic.1630490256' exceeds configured threshold

also points to the same thing: if the system encounters a stale purge checkpoint document (cannot find an existing view which it is associated with) it cannot advance the oldest purge sequence beyond that, so it starts to lag. Then the large lag was reported in the log as "exceeds configured threshold". We had noticed this in another user's case but couldn't track down how it happens exactly. But it just confirm that somehow a modified or re-created view may be involved.

jkuester
Aug 10, 2022
Author

As an additional note, I did find this discussion with the same fabric timeout in purge_docs. TLDR is:

The performance improved significantly upon recreating the db and having a much smaller view tied to the db that was being purged. So it seems like the size of the view was causing problems

This seems interesting and perhaps related since I guess we are seeing some funky interaction between these timeouts and indexing...

1 reply

@nickva

nickva Aug 10, 2022
Collaborator

Good find @jkuester. Yeah, it looks like in that discussion the user had encountered the same timeout as we saw above. The fix for that was already merged and will be in the next release (likely 3.3) #4143

The purge sequence ... exceeds configured threshold #4141

Uh oh!

jkuester Aug 9, 2022

Replies: 5 comments · 6 replies

Uh oh!

jkuester Aug 9, 2022 Author

Uh oh!

nickva Aug 9, 2022 Collaborator

Uh oh!

jkuester Aug 9, 2022 Author

Uh oh!

nickva Aug 9, 2022 Collaborator

Uh oh!

nickva Aug 9, 2022 Collaborator

Uh oh!

nickva Aug 10, 2022 Collaborator

Uh oh!

jkuester Aug 10, 2022 Author

Uh oh!

jkuester Aug 11, 2022 Author

Uh oh!

Uh oh!

nickva Aug 13, 2022 Collaborator

Uh oh!

jkuester Aug 10, 2022 Author

Uh oh!

nickva Aug 10, 2022 Collaborator

jkuester
Aug 9, 2022

Replies: 5 comments 6 replies

jkuester
Aug 9, 2022
Author

nickva
Aug 9, 2022
Collaborator

jkuester
Aug 9, 2022
Author

nickva Aug 9, 2022
Collaborator

nickva Aug 9, 2022
Collaborator

nickva
Aug 10, 2022
Collaborator

jkuester Aug 10, 2022
Author

jkuester Aug 11, 2022
Author

nickva Aug 13, 2022
Collaborator

jkuester
Aug 10, 2022
Author

nickva Aug 10, 2022
Collaborator