Since initial question Increasing binary search for mismatch in sorted range contained defects in code and can confuse the readers, here is the updated version of the question.
Is the code below correct and efficiently implemented to solve the following task? Can this code be improved with reuse of other C++ Standard Library algorithms without additional libraries?
Functional specification
Having a very large vector (gigabytes) with sorted repeated values with exponential distribution (like 50% of values occur only once, 25% only two times in a row, 12.5% about 3 times in a row, etc.) provide a function to get iterator to the next value (different than current) in std::vector
.
The algorithm is going to be used with many predicates for the same vector of data.
Performance is key both from perspective CPU loading and memory bandwidth. Actual values in the vector are not integers and comparison is time-consuming operation, so using std::upper_bound is not effective here, since it involves about log2(n) comparisons while the needed element often is very close.
Design
Having requirements on limited data bandwidth and having many predicates for the same data prevents us from pre-calculating offsets to needed positions.
Based on values frequency distribution, it is more effective to start with range 1 and multiply the range with every step, this gives about 30% of speed up against pure std::upper_bound usage.
The code
The fully functional demo:
#include <algorithm>
#include <iostream>
#include <vector>
// Algorithm works similarly to std::upper_bound, but uses element specified by iterator *first* to find first element in range which is not equal to it.
template<class RandomIt, class Pred>
constexpr RandomIt incremental_upper_bound(RandomIt first, RandomIt last, Pred comp, size_t starting_search_range = 1)
{
if (first == last)
return last;
const auto current = *first++;
size_t short_search_range = starting_search_range;
while (first != last) {
const auto short_search_end = (size_t)std::distance(first, last) < short_search_range ? last : std::next(first, short_search_range);
first = std::upper_bound(first, short_search_end, current, comp);
if (first == last || comp(current, *first))
return first;
short_search_range *= 2;
}
return last;
}
int main() {
std::vector<int> v = { 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 7 };
for (auto it = v.begin(); it != v.end(); ) {
auto it_next = incremental_upper_bound(it, v.end());
std::cout << "Number of elements with value " << *it << " is " << it_next - it << "\n";
// Some work on the range
it = it_next;
}
return 0;
}
Measurements
Per advice from G. Sliepen I’ve done some measurements in terms of number of comparisons and performance. To show the trends, I tested on tiny, medium and large text bases.
The suffix arrays for real texts have some specifics, the next unique element comes very fast (example for 9 lexems long strings limit for comparison), so real offsets statistics are like this:
Offset | Amount in data | % |
---|---|---|
1 | 463682831 | 98.49% |
2 | 6519364 | 1.38% |
3 | 400000 | 0.08% |
4 | 124922 | 0.03% |
5 | 18276 | 0.00% |
6 | 8421 | 0.00% |
7 | 5086 | 0.00% |
8 | 2802 | 0.00% |
9 | 1941 | 0.00% |
10 | 1727 | 0.00% |
Here are the results:
std::upper_bound | incremental_upper_bound | |
---|---|---|
Tiny text (text size: 363 lexems; dictionary size: 153 lexems |
||
Amount of comparisons (predicate calls) | 1943 | 551 |
Medium size text (text size: 69 910 995 lexems; dictionary size: 878 993 lexems) |
||
Amount of comparisons (predicate calls) | 1 428 018 106 | 115 013 024 |
Running time, seconds | 33 | 7 |
Large text (text size: 583 654 363 lexems; dictionary size: 2 791 370 lexems) |
||
Amount of comparisons (predicate calls) | 13 122 916 104 | 943 337 060 |
Running time, seconds | 436 | 79 |
Conclusion
The results are as expected. Let’s see the last case. For the index array std:upper_bound needs about 28 iterations on average to find the answer and incremental_upper_bound takes two on average, so the comparisons rate in about 14 times.
Of course, this is the data specifics which makes the improvement so huge. When we mostly need to find the next unique elements in one or two positions from current, it is different from other datasets. But, anyway, on large data std:upper_bound is memory-bounded and not cache-friendly, making huge steps could be quite expensive for such task as "searching for the next element, not equal to current". The incremental_upper_bound helps to localize work and increase caching benefits.
1 Answer 1
current
should be a reference
If comparisons are time-consuming, then the values you are comparing are probably non-trivial. Your declaration of current
makes a copy of a value, which can be expensive, or perhaps even impossible if the value type of the container is non-copyable. Making current
a reference solves this.
Measure how many comparisons it actually takes
According to cppreference.com, std::upper_bound()
uses at most \$\log_2(last - first) + O(1)\$ comparisons. However, the actual number of comparisons depends on the exact implementation of the standard library, and where in the range you pass it the upper bound really is. For example, there is no guarantee that it does its first comparison with the last element in the range.
So I would create a class
with a comparison operator that counts how often it is called. Then you can call your incremental_upper_bound()
on a vector with the desired distribution, and check how many comparisons it takes. Then you can also check if different ways to initialize and update short_search_range
are better or worse.
It might also make sense to avoid using std::upper_bound()
, and do the search yourself. That way you have full control of when comparisons are made.
-
\$\begingroup\$ On the making
current
a reference. In my case the vector stores int values which represent offsets in suffix array which is used to compare the pointed substrings; that's why comparison is time-consuming; at the same time the value stored in the vector is a pureunsigned int
(not size_t to save memory so far tests sets don't exceed 4 Gb). So, at the momentcurrent
is copyable and copying is not time consuming, but planning to make algorithm generic it is better to reduce requirements to data; so reference is better and optimizer will do its work on caching; is this understanding correct? \$\endgroup\$Damir Tenishev– Damir Tenishev2024年01月28日 18:32:11 +00:00Commented Jan 28, 2024 at 18:32 -
\$\begingroup\$ On the amount of comparisons, I measured on 3Gb text and with my algorithm the function that uses it to find equal strings works 30% faster. Not sure home much faster the algorithm itself, maybe even more if other processing in function takes time. Won't it be "too much" or "overfocusing" to make extra measurements? Is it worth to avoid reuse of
std::upper_bound
when it works fast enough with my addition (on my data)? I remember how many years it took for programmers (back in 1960-70th) to get first correct version of binary search function... \$\endgroup\$Damir Tenishev– Damir Tenishev2024年01月28日 19:19:19 +00:00Commented Jan 28, 2024 at 19:19 -
\$\begingroup\$ A reference is for sure better for large types. And since the call to
std::upper_bound()
will very likely be inlined, the compiler can optimize things and effectively passcurrent
to it by value. The issue withstd::upper_bound()
is that there is no guarantee what value it will pick for its first comparison. The first element in the range? Something in the middle? Maybe the last element? What would be better for your situation"? You like the 30% increase in efficiency, but what if it could be 50%? \$\endgroup\$G. Sliepen– G. Sliepen2024年01月28日 20:40:40 +00:00Commented Jan 28, 2024 at 20:40 -
\$\begingroup\$ I can't think of implementation which could work differently than pure binary search for random access iterators; I just checked the VS 2024 implementation, it runs classic binary search. And in the vast majority of cases the range for
std::upper_bound
will be way too small for any data specific-optimizations. Anyway, I will do research here as homework for practice. (And,of course, it doesn't start with comparison with "the last element in the range", it starts comparison with the element in the middle to decide where to continue the search, in the first or the second part of the range. ) \$\endgroup\$Damir Tenishev– Damir Tenishev2024年01月28日 21:16:33 +00:00Commented Jan 28, 2024 at 21:16 -
\$\begingroup\$ The Meausrements and Conclusions sections have been added, Please see the results. In my case data specifics shows directly that I should start with
starting_search_range=1
. The multiplier under the question still, it could be improved based on data specifics, but this will take significant work and give less benefits. So, since this won't be an easy taken fruit, let's stay with 2. \$\endgroup\$Damir Tenishev– Damir Tenishev2024年02月02日 13:58:03 +00:00Commented Feb 2, 2024 at 13:58
main
example suggests an ordered RLE datastructure, of either(val, len)
or(val, index)
tuples. If your (unseen) values are K-tuples, with K predicates defined on them, then just produce K datastructures to assist with navigating a DB cursor over them. This is what happens when you CREATE TABLE with K columns, and then K times issue a CREATE INDEX command, in order to support rapid GROUP BY queries. \$\endgroup\$