Efficiently splitting a string in C++

Question 1

I've made a splitString() function which receives a character and a string and returns a vector<string> containing all the string split by a character in the input string:

vector<string> splitString(string input, char separator) {
 size_t pos;
 vector<string> text;
 while (!input.empty()) {
 pos = input.find(separator); //find separator character position
 if (pos == string::npos) {
 text.push_back(input);
 break;
 }
 text.push_back(input.substr(0, pos));
 input = input.substr(pos + 1);
 }
 return text; //returns a vector with all the strings
}

Example Input(.csv):

name;surname;age

Output would be:

vector<string> contains {name, surname, age}

Is there any efficient way of doing this?

As you can see, the ; was deleted separating the strings.
I was wondering if there is any way of adding an option to NOT delete the separator character, including it in the split strings.

Question 2

Welcome to the Code Review Community, we can provide a better code review if you provide a unit test program that you have written the tests the function.

Question 3

What are the expensive operations you do?

Allocating memory for the copy of a std::string passed into splitString().

This can be fixed by accepting a std::string_view instead, or at least a constant reference instead of a copy.

See How exactly is std::string_view faster than const std::string&? and What is string_view? for details.
Allocating memory for the std::strings stored in the result-vector.

Those might be replacable with std::string_view if the argument is changed, but no guarantees. This would be a significant usage change.
Repeatedly shuffling all the rest of the string to the front when removing a match.

Just find the target and copy / reference from there.
Constructing the result-elements in-place using .emplace_back() can also save some effort.

Adding the option to not remove the separator would only really make sense if you allow multiple separators.

Regarding your code, I'm not sure whether you use the dreaded using namespace std;, or the far more benign using std::string; using std::vector;.
If the former, see Why is "using namespace std;" considered bad practice?.

One behavioral point I changed: You ignore the last entry if empty. I decided to preserve it always, thus even an empty string gets at least one result.

auto splitString(std::string_view in, char sep) {
 std::vector<std::string_view> r;
 r.reserve(std::count(in.begin(), in.end(), sep) + 1); // optional
 for (auto p = in.begin();; ++p) {
 auto q = p;
 p = std::find(p, in.end(), sep);
 r.emplace_back(q, p);
 if (p == in.end())
 return r;
 }
}

Question 4

Good code :) Minor issue: The algorithm in the question returns an empty vector if the input string is empty. In this answer, a vector with a single empty string is returned instead. However, this can be fixed easily.

Question 5

@pschill Good for seeing that. I took the naming more seriously... Imho, treating the last entry as non-existent if empty is a defect. But I should have written that down.

Question 6

I agree. This version is more consistent.

Question 7

There's a danger to using this function. It can't be used with a temporary string. Consider this call: auto vec = splitString(function_returning_string(), ';');. The returned vector will be full of string_streams that will be pointing to destroyed data.

Question 8

Yes, but ensuring no owning temporaries are used to initialize the argument is beyond the scope, as in not impossible but cumbersome and handicapping use.

Question 9

If you don't want to modify the original string, you can use a stringstream:

#include <vector>
#include <sstream>
std::vector<std::string> splitString(const std::string input, const char& delimiter) {
 std::vector<std::string> elements;
 std::stringstream stream(input);
 std::string element;
 while (getline(stream, element, delimiter)) {
 elements.push_back(element);
 }
 return elements;
}

This has the added benefit of not modifying the input string. Because std::stringstream is an input stream, we can use getline to append characters to the element string until the delimiter is reached. This is repeated until the end of the input string.

I also noticed that you're using using namespace std. For smaller projects like this, that's fine. But as you begin to develop bigger applications and programs, you should stick to writing std::, as namespace collisions can happen if you're using the same function names.

Question 10

Your answer is a more inefficient solution when more efficiency is asked for.

Question 11

I mean no offense but this is one of the worst implementations and least efficient implementations of split.

Think what this operation does input = input.substr(pos + 1);. It copies all data past pos and store it in another instance of a string. Basically, if input had N strings each of size K then this algo would take O(N^2 K) operations which is too much. It shouldn't be over O(NK) the total number of elements.

Have you heard of GTA V bug that caused loading times to be several minutes for a dumb reasons? See this link GTA V fix. It turns out that it spent the whole time loading a 12mb json file - for a dumb reason. Something similar to what you have only more C-like.

@Deduplicator wrote a decent fix to the implementation. While @Linny used a stringstream for this purpose... it is more semantic but such text streams are generally considered outdated and better proposals are coming in. Also Deduplicator's implementation is more efficient.

Although, in general, split is fundamentally wrong function to use. It's not an efficient parsing tool. Normally, for efficient parsing one iterates parsing algo on each element instead of running split.

Say, you have a simple algorithm - you convert each string element into a double and expect a vector<double> as output. Is it efficient to first create vector<string> and then convert it to vector<double> or it's preferable to run conversion in-place and return the end result?

Question 12

I have an efficient function that splits a string based on a single character delimiter, and it does not need to allocate memory and copy the substrings, at all!

First, it returns a collection of string_view so it just points to the locations within the original string.

Second, it returns them as a std::array rather than a vector so it doesn't need to allocate memory. Instead, the array is allocated on the stack by the caller, and NVRO prevents the whole thing from being copied when it is returned.

The downside is that you need to give the number of elements expected as a template argument. For my use case (mainly CSV files and multi-part fields such as dates) I know the number of fields expected to split into. Note that in Perl you can give split an optional argument that specifies the number of fields and this acts as a maximum; it stops looking for delimiters after that. Since the array is a fixed size, if it's actually a maximum then you'd have empty string_view items padding the end.

If you can't handle fixed-size at all, then you can split the difference and return a dynamic collection, but still use string_view in it. You can also use a non-standard collection, or vector with an Allocator, so it allocates a certain size on the stack and only has to dynamically allocate on the heap if it exceeds that.

template <size_t N>
auto split (char separator, std::string_view input)
{
 std::array<std::string_view, N> results; 
 auto current = input.begin();
 const auto End = input.end();
 for (auto& part : results)
 {
 if (current == End) {
 const bool is_last_part = &part == &(results.back());
 if (!is_last_part) throw std::invalid_argument ("not enough parts to split");
 }
 auto delim = std::find (current, End, separator);
 part = { &*current, size_t(delim-current) };
 current = delim;
 if (delim != End) ++current;
 }
 if (current != End) throw std::invalid_argument ("too many parts to split");
 return results;
}

Deduplicator Deduplicator 19.6k1 gold badge32 silver badges65 bronze badges · Accepted Answer · 2021-05-08 16:44:51Z

What are the expensive operations you do?

Allocating memory for the copy of a std::string passed into splitString().

This can be fixed by accepting a std::string_view instead, or at least a constant reference instead of a copy.

See How exactly is std::string_view faster than const std::string&? and What is string_view? for details.
Allocating memory for the std::strings stored in the result-vector.

Those might be replacable with std::string_view if the argument is changed, but no guarantees. This would be a significant usage change.
Repeatedly shuffling all the rest of the string to the front when removing a match.

Just find the target and copy / reference from there.
Constructing the result-elements in-place using .emplace_back() can also save some effort.

Adding the option to not remove the separator would only really make sense if you allow multiple separators.

Regarding your code, I'm not sure whether you use the dreaded using namespace std;, or the far more benign using std::string; using std::vector;.
If the former, see Why is "using namespace std;" considered bad practice?.

One behavioral point I changed: You ignore the last entry if empty. I decided to preserve it always, thus even an empty string gets at least one result.

auto splitString(std::string_view in, char sep) {
 std::vector<std::string_view> r;
 r.reserve(std::count(in.begin(), in.end(), sep) + 1); // optional
 for (auto p = in.begin();; ++p) {
 auto q = p;
 p = std::find(p, in.end(), sep);
 r.emplace_back(q, p);
 if (p == in.end())
 return r;
 }
}

Good code :) Minor issue: The algorithm in the question returns an empty vector if the input string is empty. In this answer, a vector with a single empty string is returned instead. However, this can be fixed easily.
@pschill Good for seeing that. I took the naming more seriously... Imho, treating the last entry as non-existent if empty is a defect. But I should have written that down.
There's a danger to using this function. It can't be used with a temporary string. Consider this call: auto vec = splitString(function_returning_string(), ';');. The returned vector will be full of string_streams that will be pointing to destroyed data.
Yes, but ensuring no owning temporaries are used to initialize the argument is beyond the scope, as in not impossible but cumbersome and handicapping use.

Stack Exchange Network

Efficiently splitting a string in C++

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Efficiently splitting a string in C++

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions