Incremental upper bound in sorted range

Question 1

Since initial question Increasing binary search for mismatch in sorted range contained defects in code and can confuse the readers, here is the updated version of the question.

Is the code below correct and efficiently implemented to solve the following task? Can this code be improved with reuse of other C++ Standard Library algorithms without additional libraries?

Functional specification

Having a very large vector (gigabytes) with sorted repeated values with exponential distribution (like 50% of values occur only once, 25% only two times in a row, 12.5% about 3 times in a row, etc.) provide a function to get iterator to the next value (different than current) in std::vector.

The algorithm is going to be used with many predicates for the same vector of data.

Performance is key both from perspective CPU loading and memory bandwidth. Actual values in the vector are not integers and comparison is time-consuming operation, so using std::upper_bound is not effective here, since it involves about log2(n) comparisons while the needed element often is very close.

Design

Having requirements on limited data bandwidth and having many predicates for the same data prevents us from pre-calculating offsets to needed positions.

Based on values frequency distribution, it is more effective to start with range 1 and multiply the range with every step, this gives about 30% of speed up against pure std::upper_bound usage.

The code

The fully functional demo:

#include <algorithm>
#include <iostream>
#include <vector>
// Algorithm works similarly to std::upper_bound, but uses element specified by iterator *first* to find first element in range which is not equal to it.
template<class RandomIt, class Pred>
constexpr RandomIt incremental_upper_bound(RandomIt first, RandomIt last, Pred comp, size_t starting_search_range = 1)
{
 if (first == last)
 return last;
 const auto current = *first++;
 size_t short_search_range = starting_search_range;
 while (first != last) {
 const auto short_search_end = (size_t)std::distance(first, last) < short_search_range ? last : std::next(first, short_search_range);
 first = std::upper_bound(first, short_search_end, current, comp);
 if (first == last || comp(current, *first))
 return first;
 short_search_range *= 2;
 }
 return last;
}
int main() {
 std::vector<int> v = { 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 7 };
 for (auto it = v.begin(); it != v.end(); ) {
 auto it_next = incremental_upper_bound(it, v.end());
 std::cout << "Number of elements with value " << *it << " is " << it_next - it << "\n";
 // Some work on the range
 it = it_next;
 }
 return 0;
}

Measurements

Per advice from G. Sliepen I’ve done some measurements in terms of number of comparisons and performance. To show the trends, I tested on tiny, medium and large text bases.

The suffix arrays for real texts have some specifics, the next unique element comes very fast (example for 9 lexems long strings limit for comparison), so real offsets statistics are like this:

Offset	Amount in data	%
1	463682831	98.49%
2	6519364	1.38%
3	400000	0.08%
4	124922	0.03%
5	18276	0.00%
6	8421	0.00%
7	5086	0.00%
8	2802	0.00%
9	1941	0.00%
10	1727	0.00%

Here are the results:

std::upper_bound	incremental_upper_bound
Tiny text (text size: 363 lexems; dictionary size: 153 lexems
Amount of comparisons (predicate calls)	1943	551
Medium size text (text size: 69 910 995 lexems; dictionary size: 878 993 lexems)
Amount of comparisons (predicate calls)	1 428 018 106	115 013 024
Running time, seconds	33	7
Large text (text size: 583 654 363 lexems; dictionary size: 2 791 370 lexems)
Amount of comparisons (predicate calls)	13 122 916 104	943 337 060
Running time, seconds	436	79

Conclusion

The results are as expected. Let’s see the last case. For the index array std:upper_bound needs about 28 iterations on average to find the answer and incremental_upper_bound takes two on average, so the comparisons rate in about 14 times.

Of course, this is the data specifics which makes the improvement so huge. When we mostly need to find the next unique elements in one or two positions from current, it is different from other datasets. But, anyway, on large data std:upper_bound is memory-bounded and not cache-friendly, making huge steps could be quite expensive for such task as "searching for the next element, not equal to current". The incremental_upper_bound helps to localize work and increase caching benefits.

Question 2

Your main example suggests an ordered RLE datastructure, of either (val, len) or (val, index) tuples. If your (unseen) values are K-tuples, with K predicates defined on them, then just produce K datastructures to assist with navigating a DB cursor over them. This is what happens when you CREATE TABLE with K columns, and then K times issue a CREATE INDEX command, in order to support rapid GROUP BY queries.

Question 3

@J_H, as it is stated in the functional specification, the algorithm is memory-bound. Having many predicates, I will have to create many pairs (val,len) for every predicate. Taking into account that the vector takes gigabytes, these indexes will take even more gigabytes and performance will decrease dramatically. I tried this and the result is quite memory-consuming and slow. Let's assume here that we don't have a luxury of preprocessing.

Question 4

By hypothesis the first half of your N values consuming "gigabytes" will fit in a single RLE entry, then 0.25 × N for the second entry, then 0.125 × N for the third entry. At some point we encounter diminishing returns, and resort to using the underlying values rather than an ancillary compressed representation of them. So the RLE entries consume small amount of RAM, even after expanding that by your undisclosed K multiplier. // You haven't given any throughput or latency performance targets. When sqlite stores indexed values on disk, most remain there. It's unclear why RAM residency is needed

Question 5

Please do not edit the question, especially the code, after an answer has been posted. Changing the question may cause answer invalidation. Everyone needs to be able to see what the reviewer was referring to. What to do after the question has been answered.

Question 6

@G.Sliepen, I had the same thoughts when got this data, but the largest offset I see in real data is 276880 and the key point is that this could vary from text to text, so it is still possible to have dozens of large offsets for huge text files and this will lead linear search to disaster. My original code had explicit check of offset==1, but I decided to simplify the code relying to inlining both incremental_upper_bound and called from in std::upper_bound so case of offset==1 should be processed fast enough. I still can check explicitly in incremental_upper_bound if I see speed up.

Question 7

`current` should be a reference

If comparisons are time-consuming, then the values you are comparing are probably non-trivial. Your declaration of current makes a copy of a value, which can be expensive, or perhaps even impossible if the value type of the container is non-copyable. Making current a reference solves this.

Measure how many comparisons it actually takes

According to cppreference.com, std::upper_bound() uses at most \$\log_2(last - first) + O(1)\$ comparisons. However, the actual number of comparisons depends on the exact implementation of the standard library, and where in the range you pass it the upper bound really is. For example, there is no guarantee that it does its first comparison with the last element in the range.

So I would create a class with a comparison operator that counts how often it is called. Then you can call your incremental_upper_bound() on a vector with the desired distribution, and check how many comparisons it takes. Then you can also check if different ways to initialize and update short_search_range are better or worse.

It might also make sense to avoid using std::upper_bound(), and do the search yourself. That way you have full control of when comparisons are made.

Question 8

On the making current a reference. In my case the vector stores int values which represent offsets in suffix array which is used to compare the pointed substrings; that's why comparison is time-consuming; at the same time the value stored in the vector is a pure unsigned int (not size_t to save memory so far tests sets don't exceed 4 Gb). So, at the moment current is copyable and copying is not time consuming, but planning to make algorithm generic it is better to reduce requirements to data; so reference is better and optimizer will do its work on caching; is this understanding correct?

Question 9

On the amount of comparisons, I measured on 3Gb text and with my algorithm the function that uses it to find equal strings works 30% faster. Not sure home much faster the algorithm itself, maybe even more if other processing in function takes time. Won't it be "too much" or "overfocusing" to make extra measurements? Is it worth to avoid reuse of std::upper_bound when it works fast enough with my addition (on my data)? I remember how many years it took for programmers (back in 1960-70th) to get first correct version of binary search function...

Question 10

A reference is for sure better for large types. And since the call to std::upper_bound() will very likely be inlined, the compiler can optimize things and effectively pass current to it by value. The issue with std::upper_bound() is that there is no guarantee what value it will pick for its first comparison. The first element in the range? Something in the middle? Maybe the last element? What would be better for your situation"? You like the 30% increase in efficiency, but what if it could be 50%?

Question 11

I can't think of implementation which could work differently than pure binary search for random access iterators; I just checked the VS 2024 implementation, it runs classic binary search. And in the vast majority of cases the range for std::upper_bound will be way too small for any data specific-optimizations. Anyway, I will do research here as homework for practice. (And,of course, it doesn't start with comparison with "the last element in the range", it starts comparison with the element in the middle to decide where to continue the search, in the first or the second part of the range. )

Question 12

The Meausrements and Conclusions sections have been added, Please see the results. In my case data specifics shows directly that I should start with starting_search_range=1. The multiplier under the question still, it could be improved based on data specifics, but this will take significant work and give less benefits. So, since this won't be an easy taken fruit, let's stay with 2.

G. Sliepen G. Sliepen 68.9k3 gold badges74 silver badges180 bronze badges · Accepted Answer · 2024-01-27 19:17:03Z

3

\$\begingroup\$

`current` should be a reference

If comparisons are time-consuming, then the values you are comparing are probably non-trivial. Your declaration of current makes a copy of a value, which can be expensive, or perhaps even impossible if the value type of the container is non-copyable. Making current a reference solves this.

Measure how many comparisons it actually takes

According to cppreference.com, std::upper_bound() uses at most \$\log_2(last - first) + O(1)\$ comparisons. However, the actual number of comparisons depends on the exact implementation of the standard library, and where in the range you pass it the upper bound really is. For example, there is no guarantee that it does its first comparison with the last element in the range.

So I would create a class with a comparison operator that counts how often it is called. Then you can call your incremental_upper_bound() on a vector with the desired distribution, and check how many comparisons it takes. Then you can also check if different ways to initialize and update short_search_range are better or worse.

It might also make sense to avoid using std::upper_bound(), and do the search yourself. That way you have full control of when comparisons are made.

Share

edited Jan 28, 2024 at 17:35

answered Jan 27, 2024 at 19:17

G. Sliepen's user avatar

G. Sliepen G. Sliepen

68.9k3 gold badges74 silver badges180 bronze badges

\$\endgroup\$

6

\$\begingroup\$ On the making current a reference. In my case the vector stores int values which represent offsets in suffix array which is used to compare the pointed substrings; that's why comparison is time-consuming; at the same time the value stored in the vector is a pure unsigned int (not size_t to save memory so far tests sets don't exceed 4 Gb). So, at the moment current is copyable and copying is not time consuming, but planning to make algorithm generic it is better to reduce requirements to data; so reference is better and optimizer will do its work on caching; is this understanding correct? \$\endgroup\$

Damir Tenishev
– Damir Tenishev

2024年01月28日 18:32:11 +00:00
Commented Jan 28, 2024 at 18:32
\$\begingroup\$ On the amount of comparisons, I measured on 3Gb text and with my algorithm the function that uses it to find equal strings works 30% faster. Not sure home much faster the algorithm itself, maybe even more if other processing in function takes time. Won't it be "too much" or "overfocusing" to make extra measurements? Is it worth to avoid reuse of std::upper_bound when it works fast enough with my addition (on my data)? I remember how many years it took for programmers (back in 1960-70th) to get first correct version of binary search function... \$\endgroup\$

Damir Tenishev
– Damir Tenishev

2024年01月28日 19:19:19 +00:00
Commented Jan 28, 2024 at 19:19
\$\begingroup\$ A reference is for sure better for large types. And since the call to std::upper_bound() will very likely be inlined, the compiler can optimize things and effectively pass current to it by value. The issue with std::upper_bound() is that there is no guarantee what value it will pick for its first comparison. The first element in the range? Something in the middle? Maybe the last element? What would be better for your situation"? You like the 30% increase in efficiency, but what if it could be 50%? \$\endgroup\$

G. Sliepen
– G. Sliepen

2024年01月28日 20:40:40 +00:00
Commented Jan 28, 2024 at 20:40
\$\begingroup\$ I can't think of implementation which could work differently than pure binary search for random access iterators; I just checked the VS 2024 implementation, it runs classic binary search. And in the vast majority of cases the range for std::upper_bound will be way too small for any data specific-optimizations. Anyway, I will do research here as homework for practice. (And,of course, it doesn't start with comparison with "the last element in the range", it starts comparison with the element in the middle to decide where to continue the search, in the first or the second part of the range. ) \$\endgroup\$

Damir Tenishev
– Damir Tenishev

2024年01月28日 21:16:33 +00:00
Commented Jan 28, 2024 at 21:16
\$\begingroup\$ The Meausrements and Conclusions sections have been added, Please see the results. In my case data specifics shows directly that I should start with starting_search_range=1. The multiplier under the question still, it could be improved based on data specifics, but this will take significant work and give less benefits. So, since this won't be an easy taken fruit, let's stay with 2. \$\endgroup\$

Damir Tenishev
– Damir Tenishev

2024年02月02日 13:58:03 +00:00
Commented Feb 2, 2024 at 13:58

| Show 1 more comment

Stack Exchange Network

Incremental upper bound in sorted range

1 Answer 1

`current` should be a reference

Measure how many comparisons it actually takes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Incremental upper bound in sorted range

1 Answer 1

current should be a reference

Measure how many comparisons it actually takes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`current` should be a reference