Parsing positive integers from CSV file Part-2

Question 1

This is a small CSV parser for positive integers mostly intended for personal projects.
I (hopefully) improved the code from Part-1 by condensing some parts and improving others while also keeping performance the same.
My main worries are any obvious pitfalls I might have missed as well as concerns about the general approach. Of course anything else you notice is also welcome.
Please see part-1 for a list of requirements/restrictions I had when writing this.

For anyone who wants to test this against actual input I provide a small Perl script to generate CSV files.

Call it with: perl [scriptname] [how many lines you want] >outputfile

use strict;
use warnings "all";
die "usage: perl 0ドル <amount_of_lines>\n" unless (int(@ARGV) > 0);
for (1 .. $ARGV[0]) {
 my $vals_per_line = int(rand(10));
 my @values;
 for (0 .. $vals_per_line) {
 push(@values, int(rand(1000)));
 }
 print join(",", @values) . "\n";
}

Code:

#include <fstream>
#include <string>
#include <vector>
template<typename T>
void parse_lines(
 const std::string& filename,
 const uint_fast8_t& line_length_estimate,
 const uint_fast8_t& values_per_line_estimate,
 T callback) {
 std::ifstream infile{filename};
 if (!infile.good()) {
 return;
 }
 std::vector<uint_fast16_t> numbers;
 numbers.reserve(values_per_line_estimate);
 std::string buffer;
 buffer.reserve(line_length_estimate);
 while (infile.good() && std::getline(infile, buffer)) {
 if (buffer[buffer.size() - 1] == '\r') {
 buffer[buffer.size() - 1] = ',';
 }
 else {
 buffer.push_back(',');
 }
 uint_fast16_t parsed_number = 0;
 for (const auto& digit : buffer) {
 if (digit == ',') {
 numbers.emplace_back(parsed_number);
 parsed_number = 0;
 }
 else {
 parsed_number = (parsed_number * 10) + (digit - '0');
 }
 }
 callback(numbers);
 numbers.clear();
 }
}
int main(int argc, char** argv) {
 constexpr uint_fast8_t line_length_estimate = 50;
 constexpr uint_fast8_t values_per_line_estimate = 15;
 parse_lines(argv[1], line_length_estimate, values_per_line_estimate, [](auto& values) {
 // ...
 });
}

Question 2

I tried to break the dependency between different states of the parsed number, e.g. doing parsed_number += (digit - '0') * powers[current_power], but it kept ending up slower. I guess improving performance further will require some more intricate techniques.

Question 3

Great that you renamed you function according to what it does. I would suggest, that you add a helper function that reads a single line.

Creating small well encapsulated functions is key in writing maintainable code.

Regarding you code there are some small nits:

The code pattern

container[container.size() - 1]

can alway be replaced by

container.back()

Regarding the parsing you could use std::string::find_first_of and search for the next occurence of it.
```
std::size_t first = 0;
std::size_t second;
while (first != buffer.size()) {
 second = buffer.find_first_of(",\r", first);
 // parse [first, second)
 first = second + 1; 
}
```
That way in the interval [first, second) there would always be the numbers and you would not have to replace the final character or other stuff

Again encapsulation. There is no reason why the this block of code is not inside its own function:

uint_fast16_t parsed_number = 0;
for (const auto& digit : buffer) {
 if (digit == ',') {
 numbers.emplace_back(parsed_number);
 parsed_number = 0;
 }
 else {
 parsed_number = (parsed_number * 10) + (digit - '0');
 }
}

Even better if you encapsulate it you can easily adopt for iterators/positions

uint_fast16_t parse_number(cont std::string& buffer, std::size_t first, std::size_t second) {
 uint_fast16_t parsed_number = 0;
 for (std::size_t i = first; i < second; ++i) {
 parsed_number = (parsed_number * 10) + (buffer[i] - '0');
 }
 return parsed_number;
}

Now you can write the loop as

while (infile.good() && std::getline(infile, buffer)) {
 std::size_t first = 0;
 std::size_t second;
 while (first != buffer.size()) {
 second = buffer.find_first_of(",\r", first);
 numbers.push_back(parse_number(buffer, first,second));
 first = second + 1; 
 }
 callback(numbers);
 numbers.clear();
}

I would even suggest to create another function for parsing the single line:

void parse_single_line(std::string& buffer, std::vector<uint_fast16_t>& numbers) {
 std::size_t first = 0;
 std::size_t second;
 while (first != buffer.size()) {
 second = buffer.find_first_of(",\r", first);
 numbers.push_back(parse_number(buffer, first,second));
 first = second + 1; 
 }

Together this leads to

while (infile.good() && std::getline(infile, buffer)) {
 parse_single_line(buffer, numbers);
 callback(numbers);
 numbers.clear();
}

Question 4

This is an interesting approach. I think you forgot to pass buffer into parse_number? Also if I'm not mistaken this won't work with \n line endings because getline will remove the line ending. This can be fixed by checking second against numeric_limits but performance suffers quite a bit.

Question 5

@yuri buffer is obviously missing, Sorry for that. Regarding line ending find will give you std::string::npos if nothing is found o that should also be fine, as we parese [first,second)

Question 6

Adding if (second == std::string::npos) second = buffer.size() - 1; makes this work for both line ending types. However performance is now nearly twice as bad as before. Any suggestions on how to speed this up?

miscco miscco 4,36112 silver badges17 bronze badges · Answer 1 · 2018-03-11 13:10:29Z

Great that you renamed you function according to what it does. I would suggest, that you add a helper function that reads a single line.

Creating small well encapsulated functions is key in writing maintainable code.

Regarding you code there are some small nits:

The code pattern

container[container.size() - 1]

can alway be replaced by

container.back()

Regarding the parsing you could use std::string::find_first_of and search for the next occurence of it.
```
std::size_t first = 0;
std::size_t second;
while (first != buffer.size()) {
 second = buffer.find_first_of(",\r", first);
 // parse [first, second)
 first = second + 1; 
}
```
That way in the interval [first, second) there would always be the numbers and you would not have to replace the final character or other stuff

Again encapsulation. There is no reason why the this block of code is not inside its own function:

uint_fast16_t parsed_number = 0;
for (const auto& digit : buffer) {
 if (digit == ',') {
 numbers.emplace_back(parsed_number);
 parsed_number = 0;
 }
 else {
 parsed_number = (parsed_number * 10) + (digit - '0');
 }
}

Even better if you encapsulate it you can easily adopt for iterators/positions

uint_fast16_t parse_number(cont std::string& buffer, std::size_t first, std::size_t second) {
 uint_fast16_t parsed_number = 0;
 for (std::size_t i = first; i < second; ++i) {
 parsed_number = (parsed_number * 10) + (buffer[i] - '0');
 }
 return parsed_number;
}

Now you can write the loop as

while (infile.good() && std::getline(infile, buffer)) {
 std::size_t first = 0;
 std::size_t second;
 while (first != buffer.size()) {
 second = buffer.find_first_of(",\r", first);
 numbers.push_back(parse_number(buffer, first,second));
 first = second + 1; 
 }
 callback(numbers);
 numbers.clear();
}

I would even suggest to create another function for parsing the single line:

void parse_single_line(std::string& buffer, std::vector<uint_fast16_t>& numbers) {
 std::size_t first = 0;
 std::size_t second;
 while (first != buffer.size()) {
 second = buffer.find_first_of(",\r", first);
 numbers.push_back(parse_number(buffer, first,second));
 first = second + 1; 
 }

Together this leads to

while (infile.good() && std::getline(infile, buffer)) {
 parse_single_line(buffer, numbers);
 callback(numbers);
 numbers.clear();
}

This is an interesting approach. I think you forgot to pass buffer into parse_number? Also if I'm not mistaken this won't work with \n line endings because getline will remove the line ending. This can be fixed by checking second against numeric_limits but performance suffers quite a bit.
@yuri buffer is obviously missing, Sorry for that. Regarding line ending find will give you std::string::npos if nothing is found o that should also be fine, as we parese [first,second)
Adding if (second == std::string::npos) second = buffer.size() - 1; makes this work for both line ending types. However performance is now nearly twice as bad as before. Any suggestions on how to speed this up?

Stack Exchange Network

Parsing positive integers from CSV file Part-2

Code:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Parsing positive integers from CSV file Part-2

Code:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions