Parsing positive integers from CSV file

Question 1

I Needed to parse some positive integers with the following constraints:

If possible use C++ only (no C API)
Needs to be able to process parsed data after each newline
Should be reasonably fast
Memory usage should be kept low (i.e. don't read the entire file in one go)
CSV file is well formed so error checking can be mostly omitted
Unknown amount of lines in file
Should support both unix (\n) and windows (\r\n) line endings

This code runs fine and is faster than the standard approach of streaming via >>. On average it takes 290ms to parse 10,000,000 lines whereas the version with >> takes 1.1s which I think is not a bad improvement.

However I have some concerns about this:

Are there any obvious mistakes that make this slower than it should be?
Are there any side effects or undefined behavior?
Is the template part applied correctly or is there a better way to do it?
Does the naming make sense?

Please also point out anything else you notice.

Sample input:

1,22,333
4444,55555,666666

Code:

#include <algorithm>
#include <fstream>
#include <string>
#include <vector>
inline void parse_uints(
 const char* buffer_ptr,
 std::vector<uint_fast16_t>& numbers) {
 bool is_eol = false;
 while (!is_eol) {
 uint_fast16_t parsed_number = 0;
 for (;;) {
 if (*buffer_ptr == ',') {
 break;
 }
 if (*buffer_ptr == '\r' || *buffer_ptr == '0円') {
 is_eol = true;
 break;
 }
 parsed_number = (parsed_number * 10) + (*buffer_ptr++ - '0');
 }
 // skip delimiter
 ++buffer_ptr;
 numbers.emplace_back(parsed_number);
 }
}
template<typename T>
void read_line(
 const std::string& filename,
 const uint_fast8_t& line_length,
 const uint_fast8_t& values_per_line,
 T callback) {
 std::ifstream infile{filename};
 if (!infile.good()) {
 return;
 }
 std::vector<uint_fast16_t> numbers;
 numbers.reserve(values_per_line);
 std::string buffer;
 buffer.reserve(line_length);
 while (infile.good() && std::getline(infile, buffer)) {
 parse_uints(buffer.data(), numbers);
 callback(numbers);
 numbers.clear();
 }
}
int main(int argc, char** argv) {
 constexpr uint_fast8_t line_length = 25;
 constexpr uint_fast8_t values_per_line = 3;
 read_line(argv[1], line_length, values_per_line, [](auto& values) {
 // do something with the values here, for example get the max
 auto max = std::max_element(std::begin(values), std::end(values));
 });
}

Question 2

You can't rely on the '\r' character portably. You might need to expect a single '\n' as well.

Question 3

@πάνταῥεῖ Are there issues with this when running on Windows? I only intend to run this on Linux (haven't tested it on Windows) and it worked fine when I tried it.

Question 4

I'd expect issues the other way round. The sequence \r\nline endings are typical for windows: cs.toronto.edu/~krueger/csc209h/tut/line-endings.html

Question 5

@πάνταῥεῖ Yeah, my plan was to run this on Linux but also be able to parse CSV files coming from windows with their exotic line ending.

Question 6

Sorry, I didn't spot that you're using std::getline() to read the lines delimited with '\n' in 1st place. The way you have it your code is portable.

Question 7

Some couple nits.

You function read_line does not what it suggests. It reads a file line by line. So you should find an appropriate name.
Your choice of uint_fast8_t/uint_fast16_t is interesting given that your example csv table hold some larger values. Have you measured the impact vs unsigned int or std::size_t. Be aware that modern cpu can read multiple elements at once so you choice might actually have negative impact.
Every line you clear numbers and build it up again. It will be beneficial to just initialize it once via resize and then just overwrite the old value. That way you can omit the unnecessary clear
Getline takes a third argument which is the delemiter. Also getline stops at newline and end of file so you dont need all that error checking Given that you now the number of entries per line you can simply loop and accumulate
```
while (infile.good()) {
 for (auto&& elem : numbers) {
 std::getline(infile, buffer, ",");
 elem = 0;
 for (auto&& digit : buffer) {
 elem *= 10;
 elem += (digit - '0');
 }
 }
 callback(numbers);
}
```

If you want to be sure you can check getline and return if its bad.

 while (infile.good()) {
 for (auto&& elem : numbers) {
 if (!std::getline(infile, buffer, ",")) {
 return;
 }
 elem = 0;
 for (auto&& digit : buffer) {
 elem *= 10;
 elem += (digit - '0');
 }
 }
 callback(numbers);
 }

Question 8

The values I pass are not actually the exact sizes I expect in the file but rather an estimate to avoid resizing of the buffers while it parses the file. If numbers is not cleared it could possibly contain old values that won't be overwritten if the next line is shorter than the previous one. At least that was my reasoning.

Question 9

Upon further testing it seems supplying getline with a delimiter makes it read until that delimiter even across line endings. So this code will always read the last value of a given line and the first value of the next line because only then it finds the , delimiter again.

Question 10

That is incorrect according to cplusplus.com/reference/string/string/getline So it seems there is something wrong with your line endings

Question 11

Have you tested it? cplusplus.com has been wrong in the past before. Also cppreference does not corroborate the behavior you describe.

miscco miscco 4,36112 silver badges17 bronze badges · Answer 1 · 2018-03-09 19:26:43Z

Some couple nits.

You function read_line does not what it suggests. It reads a file line by line. So you should find an appropriate name.
Your choice of uint_fast8_t/uint_fast16_t is interesting given that your example csv table hold some larger values. Have you measured the impact vs unsigned int or std::size_t. Be aware that modern cpu can read multiple elements at once so you choice might actually have negative impact.
Every line you clear numbers and build it up again. It will be beneficial to just initialize it once via resize and then just overwrite the old value. That way you can omit the unnecessary clear
Getline takes a third argument which is the delemiter. Also getline stops at newline and end of file so you dont need all that error checking Given that you now the number of entries per line you can simply loop and accumulate
```
while (infile.good()) {
 for (auto&& elem : numbers) {
 std::getline(infile, buffer, ",");
 elem = 0;
 for (auto&& digit : buffer) {
 elem *= 10;
 elem += (digit - '0');
 }
 }
 callback(numbers);
}
```

If you want to be sure you can check getline and return if its bad.

 while (infile.good()) {
 for (auto&& elem : numbers) {
 if (!std::getline(infile, buffer, ",")) {
 return;
 }
 elem = 0;
 for (auto&& digit : buffer) {
 elem *= 10;
 elem += (digit - '0');
 }
 }
 callback(numbers);
 }

The values I pass are not actually the exact sizes I expect in the file but rather an estimate to avoid resizing of the buffers while it parses the file. If numbers is not cleared it could possibly contain old values that won't be overwritten if the next line is shorter than the previous one. At least that was my reasoning.
Upon further testing it seems supplying getline with a delimiter makes it read until that delimiter even across line endings. So this code will always read the last value of a given line and the first value of the next line because only then it finds the , delimiter again.
That is incorrect according to cplusplus.com/reference/string/string/getline So it seems there is something wrong with your line endings
Have you tested it? cplusplus.com has been wrong in the past before. Also cppreference does not corroborate the behavior you describe.

Stack Exchange Network

Parsing positive integers from CSV file

Sample input:

Code:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Parsing positive integers from CSV file

Sample input:

Code:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions