Benchmarking insertion sort

Question 1

More than once I claimed that using binary search doesn't improve performance of the insertion sort. For example, see answer here and comments here). Now I have time to substantiate my claim.

The only practical application of the insertion sort, where we actually do care about performance, is sorting almost sorted data; that is such data where each element is within a fixed distance from its final destination. Only this scenario is benchmarked.

First, the implementations of insertion sorts (insertion_sort.h)

#include <algorithm>
template<typename It>
void straight_insertion_sort(It first, It last) {
 for (auto cur = first + 1; cur < last; ++cur) {
 auto val = *cur;
 auto it = cur;
 if (val < *first) {
 for (it = cur; it > first; --it) {
 *it = *(it - 1);
 }
 } else {
 for (it = cur; val < *(it - 1); --it) {
 *it = *(it - 1);
 }
 }
 *it = val;
 }
} 
template<typename It>
void binary_insertion_sort(It first, It last) {
 for (auto cur = first + 1; cur < last; ++cur) {
 auto val = *cur;
 auto insertion_point = std::lower_bound(first, cur - 1, *cur);
 std:: copy_backward(insertion_point, cur - 1, cur);
 *insertion_point = val;
 }
}

The benchmarks shall run against an almost sorted collection. This is how the testcases are prepared. (incomplete_qsort.h, the code is adapted from std::partition) example; cutoff is added to make the array almost sorted. After a call to incomplete_qsort every element is at most cutoff away from where it is supposed to be. NB: this is not really for a review, but only for completeness.

Note: I need c++14 here. c++11 does not allow auto as an argument to lambda.

#include <algorithm>
template<typename It>
void incomplete_qsort(It first, It last, size_t cutoff) {
 if (std::distance(first, last) < cutoff) {
 return;
 } 
 auto pivot = *first;
 auto mid1 = std::partition(first, last,
 [pivot](const auto& em) {return em < pivot; });
 auto mid2 = std::partition(mid1, last,
 [pivot](const auto& em) {return !(pivot < em); });
 incomplete_qsort(first, mid1, cutoff);
 incomplete_qsort(mid2, last, cutoff);
}

This is the driver (benchmark.cpp):

#include "incomplete_qsort.h"
#include "insertion_sort.h"
#include <chrono> 
#include <iostream>
#include <iomanip>
#include <iostream>
#include <numeric>
#include <random>
#include <vector>
 
using iter = std::vector<int>::iterator;
 
using sorter = void (*)(iter, iter);
double run_benchmark(std::vector<int>& data, sorter s) {
 auto start = std::chrono::system_clock::now();
 s(data.begin(), data.end());
 auto end = std::chrono::system_clock::now();
 std::chrono::duration<double> diff = end - start;
 return diff.count();
}
int main(int argc, char ** argv)
{
 std::random_device rd;
 std::mt19937 g(rd());
 for (int i = 12; i < 25; i++) {
 auto size = 1 << i;
 std::vector<int> data1(size);
 std::vector<int> data2(size);
 std::iota(data1.begin(), data1.end(), 0);
 std::shuffle(data1.begin(), data1.end(), g);
 incomplete_qsort(data1.begin(), data1.end(), 16);
 std::copy(data1.begin(), data1.end(), data2.begin());
 double duration1 = run_benchmark(data1, straight_insertion_sort);
 double duration2 = run_benchmark(data2, binary_insertion_sort);
 std::cout << std::setw(8) << size << ": "
 << std::setw(8) << duration1
 << std::setw(8) << duration2
 << " (" << duration2 / duration1 << ")"
 << '\n';
 }
}

And finally the results, compiled with -O3:

 4096: 5.2e-05 0.000158 (3.03846)
 8192: 9.1e-05 0.000269 (2.95604)
 16384: 0.000161 0.000494 (3.06832)
 32768: 0.000275 0.000968 (3.52)
 65536: 0.000555 0.001823 (3.28468)
 131072: 0.001171 0.003686 (3.14774)
 262144: 0.002084 0.007765 (3.72601)
 524288: 0.004457 0.015087 (3.38501)
 1048576: 0.008304 0.030951 (3.72724)
 2097152: 0.017204 0.063931 (3.71605)
 4194304: 0.033697 0.132659 (3.93682)
 8388608: 0.06833 0.277166 (4.05629)
16777216: 0.136164 0.569059 (4.17922)

Question 2

Sorry, but is there a specific question here, i.e., what in particular are you looking for in an answer?

Question 3

@Juho According to the chapter of this exchange, any facet of the code presented is a fair game for a reviewer. I am interested in everything people may come up with. Style, methodology, omissions, outright errors, etc.

Question 4

Correct me if I'm wrong, but the results seem to indicate that binary search is still an aver of 3 times faster than insertion sort.

Question 5

@pacmaninbw Umm... I don't see it. The last column is binary / straight.

Question 6

Your initial claim sounds about right to me, since for each iteration, checking at most cutoff elements for the insertion_point in the straight version (due to the restriction on the input) should become increasingly faster than checking logarithmic many in the binary version. Of course there is a lot more to consider like cache locality, but computational complexity should be the dominating factor in this case. That being said, I see some potential to improve your benchmark.

Benchmarking

Verify that your implementations are correct

A testsuite would of course be best practice, but the absolute minimum is to make sure that your algorithms return the same result as std::sort. The binary insertion sort you provided has an off-by-one-error, thus rendering your results useless. For the following two lines, the shown fix was to increase all end-iterators by one:

auto insertion_point = std::lower_bound(first, cur, *cur);
std::copy_backward(insertion_point, cur, cur + 1);

Choose a suitable baseline

Without any generally accepted baseline for the runtime of the algorithms, it is hard to argue whether the results are significant in any way. Again, std::sort does the job.

Test against (somewhat) equally optimized implementations

I am not an expert in optimization, but managed to shave off about 30% of the runtime of the binary version by adding an early return and using std::upper_bound instead of std::lower_bound, both of which indirectly happen in your straight version:

for (auto cur = first + 1; cur < last; ++cur) {
 if (*(cur - 1) < *cur) { continue; }
 auto val = *cur;
 auto insertion_point = std::upper_bound(first, cur, *cur);
 std::copy_backward(insertion_point, cur, cur + 1);
 *insertion_point = val;
}

The change from std::lower_bound to std::upper_bound does not change much due to the input format, which leads us to the next chapter.

Use realistic data

In your benchmark, you simply shuffle the numbers from 0 to n and partially sort them again, which means that there are no duplicates in the input. This is a rather strict constraint and probably allows for even more optimized algorithms (e.g. bucket sort). An input vector where each element is drawn from a chosen probability distribution (and then again partially sorted) should yield more representative results.

Additionally, you should always put some thought into the type of elements you are sorting, e.g. for int copying is fine, but for larger classes the benchmark needs to be adapted towards utilizing std::move.

Run tests multiple times

This is especially important for micro optimizations, so small size in our case, and the reason why microbenchmark support libraries like google/benchmark exist. If you’re not willing to put up with the hazzle of integrating it in your project, quick-bench.com allows for easy online benchmarking.

I quickly threw together an example using your code and the fixed algorithm, you can find it here.

Specify your compiler version and hardware

This is not as relevant for proving a general point, but of course results will differ when using compilers of different development levels (or even using your own homecrafted one). Here, websites like quick-bench come in handy again.

Code quality

Naming

As mentioned by others, duration1 and duration2 as well as data1 and data2 are quite unhelpful. Also, iterators are usually named begin and end instead of first and last. Other than that, your naming is expressive enough.

Creating the input vector

You initialize two vectors of the needed size, thus default initializing all the elements. Then you fill the first one and copy the partially sorted result back to the other. Preferably, one would reserve an empty vector and then use a custom function like iota_n (example) to back-insert all the elements. Once they are shuffled and partially sorted, simply use

auto data_copy = initial_data;

instead of calling std::copy.

Also, you included <iostream> twice.

Insertion sort

Whereas binary_insertion_sort is readable and reasonably easy to grasp, it took me a while longer for straight_insertion_sort. The if-case can only occur in the beginning of the range to sort and does nothing but catch an edge-case. It can be simplified to

for (auto cur = first + 1; cur < last; ++cur) {
 if (*cur < *first) {
 std::rotate(first, cur, cur + 1);
 }
 else {
 auto val = *cur;
 auto it = cur;
 for (it = cur; val < *(it - 1); --it) {
 *it = *(it - 1);
 }
 *it = val;
 }
}

, which actually seems to be a bit faster. I tried making the else-case more readable whilst preserving speed by using std::rotate once more, but failed to do so.

For both algorithms, you use < to compare iterators, where usually != is used, see this SO thread. It doesn’t make any difference speedwise.

Question 7

Welcome to Code Review. This is a very good first contribution!

Question 8

std::is_sorted() would be ideal for the verify improvement.

Question 9

Naming

As pointed in comments, duration1 and duration2 are bad names since they lead to a confusion. duration_straight and duration_binary seem to be better choice.

Emma X Emma XEmma X 4334 silver badges9 bronze badges · Accepted Answer · 2020-05-17 17:35:56Z

Your initial claim sounds about right to me, since for each iteration, checking at most cutoff elements for the insertion_point in the straight version (due to the restriction on the input) should become increasingly faster than checking logarithmic many in the binary version. Of course there is a lot more to consider like cache locality, but computational complexity should be the dominating factor in this case. That being said, I see some potential to improve your benchmark.

Benchmarking

Verify that your implementations are correct

A testsuite would of course be best practice, but the absolute minimum is to make sure that your algorithms return the same result as std::sort. The binary insertion sort you provided has an off-by-one-error, thus rendering your results useless. For the following two lines, the shown fix was to increase all end-iterators by one:

auto insertion_point = std::lower_bound(first, cur, *cur);
std::copy_backward(insertion_point, cur, cur + 1);

Choose a suitable baseline

Without any generally accepted baseline for the runtime of the algorithms, it is hard to argue whether the results are significant in any way. Again, std::sort does the job.

Test against (somewhat) equally optimized implementations

I am not an expert in optimization, but managed to shave off about 30% of the runtime of the binary version by adding an early return and using std::upper_bound instead of std::lower_bound, both of which indirectly happen in your straight version:

for (auto cur = first + 1; cur < last; ++cur) {
 if (*(cur - 1) < *cur) { continue; }
 auto val = *cur;
 auto insertion_point = std::upper_bound(first, cur, *cur);
 std::copy_backward(insertion_point, cur, cur + 1);
 *insertion_point = val;
}

The change from std::lower_bound to std::upper_bound does not change much due to the input format, which leads us to the next chapter.

Use realistic data

In your benchmark, you simply shuffle the numbers from 0 to n and partially sort them again, which means that there are no duplicates in the input. This is a rather strict constraint and probably allows for even more optimized algorithms (e.g. bucket sort). An input vector where each element is drawn from a chosen probability distribution (and then again partially sorted) should yield more representative results.

Additionally, you should always put some thought into the type of elements you are sorting, e.g. for int copying is fine, but for larger classes the benchmark needs to be adapted towards utilizing std::move.

Run tests multiple times

This is especially important for micro optimizations, so small size in our case, and the reason why microbenchmark support libraries like google/benchmark exist. If you’re not willing to put up with the hazzle of integrating it in your project, quick-bench.com allows for easy online benchmarking.

I quickly threw together an example using your code and the fixed algorithm, you can find it here.

Specify your compiler version and hardware

This is not as relevant for proving a general point, but of course results will differ when using compilers of different development levels (or even using your own homecrafted one). Here, websites like quick-bench come in handy again.

Code quality

Naming

As mentioned by others, duration1 and duration2 as well as data1 and data2 are quite unhelpful. Also, iterators are usually named begin and end instead of first and last. Other than that, your naming is expressive enough.

Creating the input vector

You initialize two vectors of the needed size, thus default initializing all the elements. Then you fill the first one and copy the partially sorted result back to the other. Preferably, one would reserve an empty vector and then use a custom function like iota_n (example) to back-insert all the elements. Once they are shuffled and partially sorted, simply use

auto data_copy = initial_data;

instead of calling std::copy.

Also, you included <iostream> twice.

Insertion sort

Whereas binary_insertion_sort is readable and reasonably easy to grasp, it took me a while longer for straight_insertion_sort. The if-case can only occur in the beginning of the range to sort and does nothing but catch an edge-case. It can be simplified to

for (auto cur = first + 1; cur < last; ++cur) {
 if (*cur < *first) {
 std::rotate(first, cur, cur + 1);
 }
 else {
 auto val = *cur;
 auto it = cur;
 for (it = cur; val < *(it - 1); --it) {
 *it = *(it - 1);
 }
 *it = val;
 }
}

, which actually seems to be a bit faster. I tried making the else-case more readable whilst preserving speed by using std::rotate once more, but failed to do so.

For both algorithms, you use < to compare iterators, where usually != is used, see this SO thread. It doesn’t make any difference speedwise.

Welcome to Code Review. This is a very good first contribution!
std::is_sorted() would be ideal for the verify improvement.

Stack Exchange Network

Benchmarking insertion sort

2 Answers 2

Benchmarking

Verify that your implementations are correct

Choose a suitable baseline

Test against (somewhat) equally optimized implementations

Use realistic data

Run tests multiple times

Specify your compiler version and hardware

Code quality

Naming

Creating the input vector

Insertion sort

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Benchmarking insertion sort

2 Answers 2

Benchmarking

Verify that your implementations are correct

Choose a suitable baseline

Test against (somewhat) equally optimized implementations

Use realistic data

Run tests multiple times

Specify your compiler version and hardware

Code quality

Naming

Creating the input vector

Insertion sort

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions