More than once I claimed that using binary search doesn't improve performance of the insertion sort. For example, see answer here and comments here). Now I have time to substantiate my claim.
The only practical application of the insertion sort, where we actually do care about performance, is sorting almost sorted data; that is such data where each element is within a fixed distance from its final destination. Only this scenario is benchmarked.
First, the implementations of insertion sorts (insertion_sort.h
)
#include <algorithm>
template<typename It>
void straight_insertion_sort(It first, It last) {
for (auto cur = first + 1; cur < last; ++cur) {
auto val = *cur;
auto it = cur;
if (val < *first) {
for (it = cur; it > first; --it) {
*it = *(it - 1);
}
} else {
for (it = cur; val < *(it - 1); --it) {
*it = *(it - 1);
}
}
*it = val;
}
}
template<typename It>
void binary_insertion_sort(It first, It last) {
for (auto cur = first + 1; cur < last; ++cur) {
auto val = *cur;
auto insertion_point = std::lower_bound(first, cur - 1, *cur);
std:: copy_backward(insertion_point, cur - 1, cur);
*insertion_point = val;
}
}
The benchmarks shall run against an almost sorted collection. This is how the testcases are prepared. (incomplete_qsort.h
, the code is adapted from std::partition) example; cutoff is added to make the array almost sorted. After a call to incomplete_qsort
every element is at most cutoff
away from where it is supposed to be. NB: this is not really for a review, but only for completeness.
Note: I need c++14 here. c++11 does not allow auto
as an argument to lambda
.
#include <algorithm>
template<typename It>
void incomplete_qsort(It first, It last, size_t cutoff) {
if (std::distance(first, last) < cutoff) {
return;
}
auto pivot = *first;
auto mid1 = std::partition(first, last,
[pivot](const auto& em) {return em < pivot; });
auto mid2 = std::partition(mid1, last,
[pivot](const auto& em) {return !(pivot < em); });
incomplete_qsort(first, mid1, cutoff);
incomplete_qsort(mid2, last, cutoff);
}
This is the driver (benchmark.cpp
):
#include "incomplete_qsort.h"
#include "insertion_sort.h"
#include <chrono>
#include <iostream>
#include <iomanip>
#include <iostream>
#include <numeric>
#include <random>
#include <vector>
using iter = std::vector<int>::iterator;
using sorter = void (*)(iter, iter);
double run_benchmark(std::vector<int>& data, sorter s) {
auto start = std::chrono::system_clock::now();
s(data.begin(), data.end());
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> diff = end - start;
return diff.count();
}
int main(int argc, char ** argv)
{
std::random_device rd;
std::mt19937 g(rd());
for (int i = 12; i < 25; i++) {
auto size = 1 << i;
std::vector<int> data1(size);
std::vector<int> data2(size);
std::iota(data1.begin(), data1.end(), 0);
std::shuffle(data1.begin(), data1.end(), g);
incomplete_qsort(data1.begin(), data1.end(), 16);
std::copy(data1.begin(), data1.end(), data2.begin());
double duration1 = run_benchmark(data1, straight_insertion_sort);
double duration2 = run_benchmark(data2, binary_insertion_sort);
std::cout << std::setw(8) << size << ": "
<< std::setw(8) << duration1
<< std::setw(8) << duration2
<< " (" << duration2 / duration1 << ")"
<< '\n';
}
}
And finally the results, compiled with -O3
:
4096: 5.2e-05 0.000158 (3.03846)
8192: 9.1e-05 0.000269 (2.95604)
16384: 0.000161 0.000494 (3.06832)
32768: 0.000275 0.000968 (3.52)
65536: 0.000555 0.001823 (3.28468)
131072: 0.001171 0.003686 (3.14774)
262144: 0.002084 0.007765 (3.72601)
524288: 0.004457 0.015087 (3.38501)
1048576: 0.008304 0.030951 (3.72724)
2097152: 0.017204 0.063931 (3.71605)
4194304: 0.033697 0.132659 (3.93682)
8388608: 0.06833 0.277166 (4.05629)
16777216: 0.136164 0.569059 (4.17922)
2 Answers 2
Your initial claim sounds about right to me, since for each iteration, checking at most cutoff
elements for the insertion_point
in the straight version (due to the restriction on the input) should become increasingly faster than checking logarithmic many in the binary version. Of course there is a lot more to consider like cache locality, but computational complexity should be the dominating factor in this case. That being said, I see some potential to improve your benchmark.
Benchmarking
Verify that your implementations are correct
A testsuite would of course be best practice, but the absolute minimum is to make sure that your algorithms return the same result as std::sort
. The binary insertion sort you provided has an off-by-one-error, thus rendering your results useless. For the following two lines, the shown fix was to increase all end-iterators by one:
auto insertion_point = std::lower_bound(first, cur, *cur);
std::copy_backward(insertion_point, cur, cur + 1);
Choose a suitable baseline
Without any generally accepted baseline for the runtime of the algorithms, it is hard to argue whether the results are significant in any way. Again, std::sort
does the job.
Test against (somewhat) equally optimized implementations
I am not an expert in optimization, but managed to shave off about 30% of the runtime of the binary version by adding an early return and using std::upper_bound
instead of std::lower_bound
, both of which indirectly happen in your straight version:
for (auto cur = first + 1; cur < last; ++cur) {
if (*(cur - 1) < *cur) { continue; }
auto val = *cur;
auto insertion_point = std::upper_bound(first, cur, *cur);
std::copy_backward(insertion_point, cur, cur + 1);
*insertion_point = val;
}
The change from std::lower_bound
to std::upper_bound
does not change much due to the input format, which leads us to the next chapter.
Use realistic data
In your benchmark, you simply shuffle the numbers from 0 to n and partially sort them again, which means that there are no duplicates in the input. This is a rather strict constraint and probably allows for even more optimized algorithms (e.g. bucket sort). An input vector where each element is drawn from a chosen probability distribution (and then again partially sorted) should yield more representative results.
Additionally, you should always put some thought into the type of elements you are sorting, e.g. for int
copying is fine, but for larger classes the benchmark needs to be adapted towards utilizing std::move
.
Run tests multiple times
This is especially important for micro optimizations, so small size
in our case, and the reason why microbenchmark support libraries like google/benchmark exist. If you’re not willing to put up with the hazzle of integrating it in your project, quick-bench.com allows for easy online benchmarking.
I quickly threw together an example using your code and the fixed algorithm, you can find it here.
Specify your compiler version and hardware
This is not as relevant for proving a general point, but of course results will differ when using compilers of different development levels (or even using your own homecrafted one). Here, websites like quick-bench come in handy again.
Code quality
Naming
As mentioned by others, duration1
and duration2
as well as data1
and data2
are quite unhelpful. Also, iterators are usually named begin
and end
instead of first
and last
. Other than that, your naming is expressive enough.
Creating the input vector
You initialize two vectors of the needed size, thus default initializing all the elements. Then you fill the first one and copy the partially sorted result back to the other. Preferably, one would reserve an empty vector and then use a custom function like iota_n
(example) to back-insert all the elements. Once they are shuffled and partially sorted, simply use
auto data_copy = initial_data;
instead of calling std::copy
.
Also, you included <iostream>
twice.
Insertion sort
Whereas binary_insertion_sort is readable and reasonably easy to grasp, it took me a while longer for straight_insertion_sort. The if-case can only occur in the beginning of the range to sort and does nothing but catch an edge-case. It can be simplified to
for (auto cur = first + 1; cur < last; ++cur) {
if (*cur < *first) {
std::rotate(first, cur, cur + 1);
}
else {
auto val = *cur;
auto it = cur;
for (it = cur; val < *(it - 1); --it) {
*it = *(it - 1);
}
*it = val;
}
}
, which actually seems to be a bit faster. I tried making the else-case more readable whilst preserving speed by using std::rotate
once more, but failed to do so.
For both algorithms, you use <
to compare iterators, where usually !=
is used, see this SO thread. It doesn’t make any difference speedwise.
-
2\$\begingroup\$ Welcome to Code Review. This is a very good first contribution! \$\endgroup\$G. Sliepen– G. Sliepen2020年05月17日 18:55:56 +00:00Commented May 17, 2020 at 18:55
-
\$\begingroup\$
std::is_sorted()
would be ideal for the verify improvement. \$\endgroup\$Toby Speight– Toby Speight2024年02月03日 17:22:17 +00:00Commented Feb 3, 2024 at 17:22
Naming
As pointed in comments,
duration1
andduration2
are bad names since they lead to a confusion.duration_straight
andduration_binary
seem to be better choice.
Explore related questions
See similar questions with these tags.
binary / straight
. \$\endgroup\$