This is my implementation of "merge sort" in C++. I'm still a newbie at C++ so:
- Have I understood how merge sort works and implemented it in the right way?
- Have I used any bad practices?
- How could the code be improved upon in terms of efficiency?
Criticism is welcome and appreciated!
#include <iostream>
#include <vector>
#include <string>
#include <sstream>
using namespace std;
vector<float> merge(vector<float> firstHalf, vector<float> secondHalf){
vector<float> combined;
for(int i = firstHalf.size() + secondHalf.size(); i > 0; i--){//merge two vectors
if(firstHalf.back() > secondHalf.back() && !firstHalf.empty()){
combined.push_back(firstHalf.back());
firstHalf.pop_back();
}else if(!secondHalf.empty()){
combined.push_back(secondHalf.back());
secondHalf.pop_back();
}
}
vector<float> revCombined;//reverse merged vectors. Vectors don't have pop_front and I didn't want to use lists.
for(int i = 0; i < combined.size(); i++){
revCombined.push_back(combined[combined.size()-i-1]);
}
return revCombined;
}
vector<float> mergeSort(vector<float> &inputArray){//for example [9, 8, 1] as input
if(inputArray.size() > 1){
vector<float> firstHalf;
vector<float> secondHalf;
for(int i = 0; i < inputArray.size()/2; i++){//auto round the input array because size() returns int
firstHalf.push_back(inputArray[i]);
}//first half = [9, 8]
for(int i = inputArray.size()/2; i < inputArray.size(); i++){
secondHalf.push_back(inputArray[i]);
}//second half = [1]
return merge(mergeSort(firstHalf), mergeSort(secondHalf));
}
else{
return inputArray;
}
}
vector<string> split(string str, char delimiter) {
vector<string> internal;
stringstream ss(str); // Turn the string into a stream.
string tok;
while(getline(ss, tok, delimiter)) {
internal.push_back(tok);
}
return internal;
}
vector<float> floatVectorInput(){
string inputString;
getline(cin, inputString);
vector<string> stringArray = split(inputString, ' ');
vector<float> array;
for(int i = 0; i < stringArray.size(); i++){
array.push_back(stof(stringArray[i]));
}
return array;
}
int main(){
cout << "Array to sort (separate by spaces): " << endl;
vector<float> inputArray = floatVectorInput();
vector<float> sorted = mergeSort(inputArray);
cout << endl << "Sorted Array:" << endl;
for(int i = 0; i < sorted.size(); i++){
cout << sorted[i];
if(i == sorted.size()-1){
cout << endl << endl;
}else{
cout << ", ";
}
}
return 0;
}
2 Answers 2
merge
calls back()
on empty vector
The logic in your merge
function has a bug.
if (firstHalf.back() > secondHalf.back() && !firstHalf.empty()) { ... }
The first problem is that you access firstHalf.back()
before the check that !firstHalf.empty()
. The checks should be in reversed order.
The second problem is that you're also accessing secondHalf.back()
without checking whether !secondHalf.empty()
.
While the code could be fixed with less changes, I suggest that you look at the problem again: Merging only really makes sense if neither of the halves is empty. So let's break the problem into two parts:
- Merge two non-empty containers.
- Append the remaining elements to the already merged container.
std::vector<float> combined{};
// Merge two non-empty containers.
while (!firstHalf.empty() && !secondHalf.empty()) {
if (firstHalf.back() > secondHalf.back()) {
combined.push_back(firstHalf.back());
firstHalf.pop_back();
} else {
combined.push_back(secondHalf.back());
secondHalf.pop_back();
}
}
// Append the remaining elements to the already merged container.
while (!firstHalf.empty()) {
combined.push_back(firstHalf.back());
firstHalf.pop_back();
}
while (!secondHalf.empty()) {
combined.push_back(secondHalf.back());
secondHalf.pop_back();
}
This version is not only correct but also simpler to understand.
Input does not handle excess spaces correctly
If you feed your program with input that puts more than one space between two numbers, your program will crash.
While it is a legitimate choice to require exactly one space between subsequent numbers, it turns out that you can make the behavior more user-friendly and simplify the code at the same time.
std::vector<float> floatVectorInput()
{
std::vector<float> numbers{};
std::string line{};
std::getline(std::cin, line);
std::istringstream iss{line};
float x;
while (iss >> x) {
numbers.push_back(x);
}
if (!iss.eof()) {
throw std::runtime_error{"Garbage input"};
}
return numbers;
}
Instead of first splitting into a vector of strings, I'm reading the float
s from the std::istringstream
directly. This will skip over any amount of white-space as desired.
The !iss.eof()
check is there because I want to make sure that we stop because the input is exhausted, not because there was something that is not a floating-point number.
Avoid using namespace std
I know that many C++ tutorials want you to put
using namespace std;
into every file. I don't think that this is good advice and recommend against doing it. Importing namespace
s is a complex operation and you probably don't understand all its implications. For example, what will the following code print with and without the using namespace std;
?
#include <iostream>
#include <utility>
//using namespace std;
void swap(double x, double y)
{
std::clog << "swap(" << x << ", " << y << ")\n";
}
int main()
{
int a = 1;
int b = 2;
swap(a, b);
swap(3, 4);
}
With the offending line commented out, the code will print
swap(1, 2)
andswap(3, 4)
as you'd probably expect. However, withusing namespace std
, it will only print the second line.
What happened?
By
using namespace std
, we've madestd::swap
(defined in<utility>
) visible. Since our ownswap
takesdouble
s as parameters, calling it with an argument of typeint
is not a perfect match. What the C++ compiler does is adding an implicit conversion fromint
todouble
. However, if there is also a function that doesn't require this conversion to happen, the compiler will prefer it. It just so happens thatstd::swap
is atemplate
that will be a better match in this case. So why is only the first call resolved tostd::swap
then? This is becausestd::swap
takes its arguments as mutable references. A temporary object (like an integer literal in this case) doesn't bind to a mutable reference.
I understand that this is complicated stuff for a beginner and you probably shouldn't have to worry about it at this point. But if you're using namespace std
, you'll have to know it or you won't understand your code.
That said, using namespace std
is also frowned upon in production-quality code (written by people who ought to understand the aforementioned language rules) so you're teaching yourself a bad habit by using it. Just be explicit and prefix symbols from the standard library with std::
. It will tell the reader at a glance where the symbol comes from which makes understanding the code easier for readers of any level of experience.
Be const
correct
Your mergeSort
function doesn't modify its argument (sorts the vector in-place) but actually returns a new, sorted vector. Therefore, make its argument const
.
std::vector<float> mergeSort(const std::vector<float>& inputArray);
// ^^^^^
You should do this with any variable that you don't intend to modify.
Use the correct integer types
This comment is actually wrong.
auto round the input array because size() returns int
std::vector::size
is of type std::vector::size_type
which is probably an alias for std::size_t
. In any case, it is an unsigned
integer type. Compiling your code with warnings enabled will warn you about this mixing of signed
and unsigned
types.
Unfortunately, std::size_t
is an unsigned
integer for historic reasons. This meas that you cannot let inidces go below zero which makes the loop conditions a bit more complicated in some places.
Make double
your go-to floating-point type
Unless you have a good reason to use float
, prefer double
for its greater range and precision. On modern hardware, you'll hardly notice a performance difference.
That's not to say there are no valid use-cases for float
. But unless you have such a case, your default-choice should be for double
.
Prefer '\n'
over std::endl
by default
std::endl
does more than outputting a new line. It also flushes the output stream. Which is an expensive operation.
Sometimes, this is just what you want. For example, in this line,
cout << "Array to sort (separate by spaces): " << endl;
you've used it correctly. The text must be visible before the instruction on the subsequent line gets executed or the program might start blocking for user input before the unlucky user was asked to provide some.
However, in all other places in your program, there is no need to flush. And output flushing is slow. It probably doesn't make any measurable difference for your small program but I've seen code where replacing std::endl
with '\n'
literally made the code ten times faster! And as a bonus: it's also less typing.
Consider using auto
type deduction (C++11)
C++11 introduced a great feature: auto
type deduction. This means that if you're declaring a variable and immediately initialize it with some value, you don't have to spell out its type. The compiler will deduce it for you.
So, for example, you could replace
vector<float> sorted = mergeSort(inputArray);
by
auto sorted = mergeSort(inputArray);
and the compiler will figure out for you that sorted
is of type vector<float>
.
Using auto
systematically can help you avoid repeating the same information over and over again and also avoid doing some silly mistakes that happen when you think you know the type.
(Applying the earlier advice, we should use const auto
here, of course.)
Avoid index-based for
loops (C++11)
You're using C-style for
loops throughout your code. While they are not wrong, they're less readable than ideal and allow you to make some avoidable bugs with the index bookkeeping. In some cases, they might also perform worse than ideal.
Since C++11, you can iterate over any standard library container using the following idiom, known as range-based for
loop.
Before:
for (int i = 0; i < stringArray.size(); i++) {
array.push_back(stof(stringArray[i]));
}
After:
for (auto&& s : stringArray) {
array.push_back(stof(s));
}
Notice how I am using auto
type deduction here. The type of s
will be deduced to std::string&
here but you don't necessarily have to worry about it. You can always use auto&&
if you don't care about the actual type. Prefer const auto&
if you don't intend to modify the object, though. Don't use plain auto
as it will make an unnecessary copy.
Prefer standard library algorithms over raw loops (intermediate level)
The previous item notwithstanding, you should try to avoid raw loops in your code at all, if possible. (Search for Sean Parent's great "Better Code" talks and in particular for his "no raw loops" advice.)
For example, the loop from above could become this.
std::transform(
std::begin(stringArray),
std::end(stringArray),
std::back_inserter(array),
[](const std::string& s){ return std::stof(s); }
);
I understand that this might not be very easy to understand for a beginner and it is perfectly acceptable to write your own loops at this point. But once you become more familiar with C++, standard algorithms are definitely something to look into.
(Taking this further, production code would of course use the standard library's std::stable_sort
algorithm instead of rolling its own merge sort implementation. But implementing algorithms yourself for learning is a good exercise.)
Make the code generic (intermediate level)
Your sorting logic can currently only be used for sorting std::vector<float>
. You could generalize it to work with any container of any type that defines a comparison function.
This is not a defect in your code. If you only have to sort std::vector<float>
, that's fine. In fact, it is better to write specific code that is correct than to write generic code that is flawed. Once you've learned about template
s and move semantics, you can attempt to make your code generic.
Avoid unnecessary copies
I liked this comment in your code.
Reverse merged vectors. Vectors don't have pop_front and I didn't want to use lists.
Indeed, using lists would be a bad idea here (and almost everywhere else, too). However, there is no need to create another vector. For one, you can reverse the elements of a vector in-place. In this case, however, an even better solution would be to use a std::deque
(which has pop_front
) instead of a std:vector
.
Of course, you could also re-work your code such that you don't actually pop the items off the front of your vector but merely iterate over it.
There are some other places where your code makes copies that could be avoided. For example, the splitting into the firstHalf
and secondHalf
vectors isn't necessary. You could instead always pass a reference to the original vector and a pair of indices. Alternatively (and preferably) use iterators.
An expert would probably also try to avoid the allocations for the merged intermediary results and instead use a single working array. But that is probably too complicated for now.
The naming could be improved slightly
Your code is generally very readable but at some places, I'd suggest looking carefully at the chosen names. For example, internal
doesn't really tell me anything useful. More as a matter of taste, I'd avoid names like stringArray
in favor of strings
. Especially since it is a std::vector
and not a std::array
.
Use tools to detect easy bugs
Some of the issues pointed out in this review could easily have been found by tools.
The most important tool is your compiler. Always compile with a high level of warnings enabled and treat them as errors. For GCC and Clang, I recommend you use the -Wall
, -Wextra
, -Werror
and-pedantic
options. There are even more warnings available but these should cover a good part.
If your C++ implementation supports it, build debug builds with a "checked STL". That is, a modified standard library that performs more checking than required (or even allowed) by the standard. When using GCC, this is really simple. All you have to do is compile with the -D_GLIBCXX_DEBUG
option.
This cannot find all issues, however. So in addition, I recommend that you compile your code with a sanitizer. GCC and Clang both support the -fsanitize=TOOL
flag where TOOL
is the name of the sanitizer to use. I recommend that you use address
and undefined
routinely. Using these switches will instrument your binaries with some run-time checks.
Alternatively – or, preferably, in addition – you can use Valgrind which is a tool to instrument your binaries on the fly.
Instrumentation of your code is only useful, however, if you have appropriate tests that execute the instrumented code. In your case, testing would be easy. Just write a function that calls mergeSort
with a number of interesting inputs. "Interesting" in this case might be: an empty vector, a vector of a single element, a sorted vector, a vector sorted in reverse, a vector with duplicate elements, a random vector, etc. Of course, the test should also verify that the output is sorted correctly. Re-run these tests whenever you make modifications to your code to be alerted of newly introduced bugs immediately.
-
\$\begingroup\$ Wow!! Fantastic! This is the most insightful thing I've read in a while! What a brilliant answer! Thank You! \$\endgroup\$user2635139– user26351392017年01月21日 22:38:48 +00:00Commented Jan 21, 2017 at 22:38
-
\$\begingroup\$ Here are my improvements so far: github.com/BenjaminThomas1999/Merge-Sort/blob/master/main.cpp \$\endgroup\$user2635139– user26351392017年01月21日 22:40:24 +00:00Commented Jan 21, 2017 at 22:40
- Have I understood how merge sort works and implemented it in the right way?
You understood the algorithm correctly, and your code yields the desired results.
- Have I used any bad practices?
- How could the code be improved upon in terms of efficiency?
See my further points below please:
1. Don't use using namespace std;
While that would work in your particular case, it's considered bad practice. Especially when you move out your code to separate header files.
See more details here please:
Why is "using namespace std;" considered bad practice?
2. Check constraints in order
This code may call undefined behavior, if the constraints aren't checked first (logical boolean arithmetic operations are executed in order):
if(firstHalf.back() > secondHalf.back() && !firstHalf.empty()){
Since there's the possibility calling std::vector::back()
with an empty vector, that code should be
if(!firstHalf.empty() && !secondHalf.empty() && firstHalf.back() > secondHalf.back()){
to avoid calling the offensive statements
3. Simplify your data input processing
Your data input function can be simplified a lot. You don't need another split()
function to do this. Simply use a std::istringstream
to do this:
vector<float> floatVectorInput(){
string inputString;
getline(cin, inputString);
vector<float> array;
std::istringstream iss(inputString);
float val;
while(iss >> val){
array.push_back(val);
}
return array;
}
4. Prefer loops and dynamically allocated stacks over recursion
Recursive function calls like
return merge(mergeSort(firstHalf), mergeSort(secondHalf));
always are limited to the (OS defined) stack size regarding the call stack depth.
On large input lists this may bail out with a Stack Overflow error.
You can simply replace that with a std::stack<std::pair<std::vector<float>,std::vector<float>>>
structure that is handled within a loop.
-
\$\begingroup\$ Vectors are being modified inside, so const won't work. \$\endgroup\$Raziman T V– Raziman T V2017年01月21日 20:07:08 +00:00Commented Jan 21, 2017 at 20:07
-
\$\begingroup\$ @RazimanT.V. Ah, the
pop_back()
stuff, yes. THX, I've been overlooking that. \$\endgroup\$πάντα ῥεῖ– πάντα ῥεῖ2017年01月21日 20:09:06 +00:00Commented Jan 21, 2017 at 20:09
if(firstHalf.back() > secondHalf.back() && !firstHalf.empty())
will cause a segfault. Also, you keep creating and moving a lot of vectors, perhaps it is better to just keep one scratchpad vector and do the merging on it. Finally, you usually sort a vector by modifying it rather than creating another vector with elements sorted. \$\endgroup\$back()
on provably empty vectors. That is undefined behaviour and just waiting for the right input to crash. \$\endgroup\$