Do you see any inefficient parts with the following code? On internet I see 30 ms for 100.000 integers but I can sort 100.000 integers in 300ms. The data is random. I am using late 2013 macbook pro so I don't expect a 10x slowdown from CPU, but who knows.
void mergesort2(int* list, int begin, int end,int *tmplist=0){
bool allocated = false;
if (end-begin < 1) {
return;
}
if (tmplist == 0) {
allocated = true;
tmplist = new int[end-begin+1];
}
int middle = (begin+end)/2;
mergesort2(list, begin, middle, tmplist);
mergesort2(list, middle+1, end, tmplist);
for (int first=begin, second=middle+1, tmp=0; tmp<end-begin+1; tmp++) {
if (first > middle) {
memcpy(tmplist+tmp, list+second, sizeof(int)*(end-begin-tmp+1));
}
else if (second > end) {
memcpy(tmplist+tmp, list+first, sizeof(int)*(end-begin-tmp+1));
}
else if (list[first] <= list[second]) {
tmplist[tmp] = list[first++];
}else{
tmplist[tmp] = list[second++];
}
}
memcpy(list+begin, tmplist, (end-begin+1)*sizeof(int));
if (allocated) {
delete tmplist;
}
}
4 Answers 4
Few things here...
the format of your input parameters is inconvenient. You should probably just replace the
begin
andend
parameters with a singlesize
. The method can then be called as:mergesort2(list, middle, tmplist); mergesort2(list + middle, end - middle, tmplist);
this also fixes a lot of
+1
offsets you have where theend
index is not convenient. It will also remove a lot of math where you are doingend-begin+1
or equivalent.consider changing your mid-point calculation to be:
int middle = begin + (end - begin) / 2;
This avoids a bug when
begin
andend
are large values, and the sum of them overflows theint
.Actually, if it were me, I would be using the size method above, and I would declare
middle
to be 1 different than you:int middle = (size + 1) / 2;
this makes the second 'half' the 'smaller' half, and it removes all the '+1' offsets you have to apply later in the code (it also changes some other range checks/limits), but, overall it will simplify the code.
Small nit-pick - I dislike magic numbers, especially 'vagrant'
+ 1
values... consider this line here:for (int first=begin, second=middle+1, tmp=0; tmp<end-begin+1; tmp++) {
This
+1
could be avoided with (if you have a good size, and middle):for (int first=begin, second=middle, tmp=0; tmp<size; tmp++) {
performance/correctness: declare
begin
andend
to be constants in the method signature.performance/correctness: inside your loop you have two conditions:
if (first > middle) { .... } else if (second > end) { .... }
inside each of those blocks you should
break;
so that you do not need to repeatedly copy the memory which does not need merging.
-
\$\begingroup\$ I have always struggled with +1 -1s, I will try to follow these guidelines and I hope they will help me in the long run! \$\endgroup\$EralpB– EralpB2014年03月01日 15:20:40 +00:00Commented Mar 1, 2014 at 15:20
-
2\$\begingroup\$ Don't like the idea of using size rather than begin and end. It goes against the normal way C++ code is written. If the code had been tagged C then I would have agreed. \$\endgroup\$Loki Astari– Loki Astari2014年03月04日 01:34:31 +00:00Commented Mar 4, 2014 at 1:34
-
1\$\begingroup\$ @LokiAstari - yeah, I was trapped by the original question not having a language tag at all, and the OP never really paid much attention at the time \$\endgroup\$rolfl– rolfl2014年03月04日 01:59:41 +00:00Commented Mar 4, 2014 at 1:59
A C++ merge sort in the style of a Standard Library algorithm could be written as
template<class BiDirIt, class Compare = std::less<typename std::iterator_traits<BiDirIt>::value_type>>
void merge_sort(BiDirIt first, BiDirIt last, Compare cmp = Compare())
{
auto const N = std::distance(first, last);
if (N < 2) return;
auto middle = std::next(first, N / 2);
merge_sort(first, middle, cmp);
merge_sort(middle, last, cmp);
std::inplace_merge(first, middle, last, cmp);
}
Let's look at how this improves upon your design:
- generic interface: it takes two iterators and a comparison function of arbitrary types, instead of two indices into a pointer to an array of
int
and the implicit but fixed<
comparison. - in-place: behind the covers it calls
std::inplace_merge
that can allocate extra memory but it does it so that users don't have to supply this, instead of yourtmplist
parameter. - half-open intervals: it will split the input range
[first, last)
into[first, middle)
and[middle, last)
which automically takes care of easy to make "off-by-one" errors in code likemiddle + 1
(in your code it is correct, but you still have to think about it, half-open intervals are easier to get right). - named algorithms: it uses
std::inplace_merge
as a ready-to-use component instead of a somewhat long and certainly tricky final step in your algorithm. Not only makes this yourmerge_sort
more compact, theinplace_merge
itself is a resuable component in its own right! - heavily optimized: Standard Library experts are very likely to know about the state of the art in writing performance-optimized algorithms. A factor of 10 of performance compared to your code seems a lot, but if you forget some hidden copies, use a hand-written loop that has
O(N^2)
complexity instead ofO(N log N)
you can quickly get there.
Back to the original question, I created a small test fixture for experimenting with your code (after all, had to confirm it actually worked first!)
#define NUM_OF_INTS 100000
#define DEBUG 0
int main()
{
int numbers[NUM_OF_INTS];
int i;
int *tmplist;
tmplist = new int[NUM_OF_INTS];
srand(0);
for( i = 0; i < NUM_OF_INTS; i++ )
numbers[i] = rand()%10000;
if (DEBUG == 1)
for( i = 0; i < NUM_OF_INTS; i++ )
printf( "%03i:%04i\n", i, numbers[i] );
mergesort2( &numbers[0], 0, NUM_OF_INTS, tmplist );
if (DEBUG == 1 ) printf( "\n");
if (DEBUG == 1)
for( i = 0; i < NUM_OF_INTS; i++ )
printf( "%03i:%04i\n", i, numbers[i] );
return 0;
}
Running this test fixture in my enviroment, I see about .7s for 100k ints. I initially thought checking for tmplist in every recursive iteration might be causing a performance degradation, but it made a negligible change in execution time. I then started down the path of reimplementing mergesort without a temporary array, thinking that the array memcopies might be the root cause of the performance issue, when I stumbled upon what is wrong with your code.
Hint
- Think about why the memcpy's are there inside the for loop. What is their purpose? Why are they necessary? What does it mean algorithmically when those memcpy's are done executing? What should happen next?
Spoiler (really think & test your code before looking at this)
You need to do a
break;
to quit the for loop after each memcopy for the two degenerate cases (first advances past midpoint, middle advances past end). Otherwise you are doing repeated memcpys (which lucky or unlucky, have no effect on the output), until tmp finally reaches the end.
With the fix, the code can now do 100k ints in .08s in my environment, hence the 10x improvement you were looking for.
After fixing the problem @RichHendricks pointed out this still has...not a problem exactly, but still a design that's less than optimal (at least in my opinion).
In the initial phase, you basically do:
calculate middle
recurse on left side
recurse on right side
Each of those recursive calls does the same thing again. Modern processors have done quite a bit to make calls and returns more efficient, but that still adds quite a bit of overhead when the fundamental operation involved is basically just computing the arithmetic mean of two numbers.
You can avoid that by using a bottom-up merge-sort. Instead of recursively splitting the array in half until it gets down to some manageable size, you can simply take the first N elements and sort them with some low-overhead sort (typically an insertion sort). Repeat N elements at a time until you reach the end of the collection (though, of course, the last group might not be the same same size as the others).
Then take the first two sorted groups of N elements and merge them. Repeat for the next pair of groups until you reach the end. Repeat that process until all the groups have been merged into one.
As to how to do the merging: a couple of papers1 have been written about in-place merging. I should add that quite a few algorithms to do this have been devised, but many (most?) are primarily theoretical--in theory, they have low enough computational complexity to maintain the O(N log n) complexity of the merge sort as a whole, but in fact they impose so much overhead that other methods are faster for almost any practical amount of data. I've tried to restrict the list of citations below to ones that seem to have at least some chance of being practical, not just theoretical.
One last point: contrary to one widespread belief, this is an area where you actually stand a pretty decent chance of writing code significantly better than the standard library provides. The C++ standard library does include an inplace_merge algorithm, but it's allowed to be significantly slower than the best algorithms now known. At least if I recall correctly these requirements haven't been updated since the 1998 standard, which was before many of the best in-place merging algorithms were invented. I don't claim to have looked at every implementation of in-place merging in every standard library, but I have seen at least a few relatively recent ones that used older, substantially less efficient algorithms.
- For a few examples:
http://akira.ruc.dk/~keld/teaching/algoritmedesign_f04/Artikler/04/Huang88.pdf http://www.sofsem.cz/sofsem06/data/prezentace/23/A/Kim-OnOptimalandEfficientInPlaceMerging.ppt http://www.dcs.kcl.ac.uk/technical-reports/papers/TR-04-05.pdf
operator new[]
but delete withoperator delete
. You should clean uptmplist
withdelete [] tmplist;
\$\endgroup\$