This is a small CSV parser for positive integers mostly intended for personal projects.
I (hopefully) improved the code from Part-1 by condensing some parts and improving others while also keeping performance the same.
My main worries are any obvious pitfalls I might have missed as well as concerns about the general approach. Of course anything else you notice is also welcome.
Please see part-1 for a list of requirements/restrictions I had when writing this.
For anyone who wants to test this against actual input I provide a small Perl script to generate CSV files.
Call it with: perl [scriptname] [how many lines you want] >outputfile
use strict;
use warnings "all";
die "usage: perl 0ドル <amount_of_lines>\n" unless (int(@ARGV) > 0);
for (1 .. $ARGV[0]) {
my $vals_per_line = int(rand(10));
my @values;
for (0 .. $vals_per_line) {
push(@values, int(rand(1000)));
}
print join(",", @values) . "\n";
}
Code:
#include <fstream>
#include <string>
#include <vector>
template<typename T>
void parse_lines(
const std::string& filename,
const uint_fast8_t& line_length_estimate,
const uint_fast8_t& values_per_line_estimate,
T callback) {
std::ifstream infile{filename};
if (!infile.good()) {
return;
}
std::vector<uint_fast16_t> numbers;
numbers.reserve(values_per_line_estimate);
std::string buffer;
buffer.reserve(line_length_estimate);
while (infile.good() && std::getline(infile, buffer)) {
if (buffer[buffer.size() - 1] == '\r') {
buffer[buffer.size() - 1] = ',';
}
else {
buffer.push_back(',');
}
uint_fast16_t parsed_number = 0;
for (const auto& digit : buffer) {
if (digit == ',') {
numbers.emplace_back(parsed_number);
parsed_number = 0;
}
else {
parsed_number = (parsed_number * 10) + (digit - '0');
}
}
callback(numbers);
numbers.clear();
}
}
int main(int argc, char** argv) {
constexpr uint_fast8_t line_length_estimate = 50;
constexpr uint_fast8_t values_per_line_estimate = 15;
parse_lines(argv[1], line_length_estimate, values_per_line_estimate, [](auto& values) {
// ...
});
}
1 Answer 1
Great that you renamed you function according to what it does. I would suggest, that you add a helper function that reads a single line.
Creating small well encapsulated functions is key in writing maintainable code.
Regarding you code there are some small nits:
The code pattern
container[container.size() - 1]
can alway be replaced by
container.back()
Regarding the parsing you could use
std::string::find_first_of
and search for the next occurence of it.std::size_t first = 0; std::size_t second; while (first != buffer.size()) { second = buffer.find_first_of(",\r", first); // parse [first, second) first = second + 1; }
That way in the interval [first, second) there would always be the numbers and you would not have to replace the final character or other stuff
Again encapsulation. There is no reason why the this block of code is not inside its own function:
uint_fast16_t parsed_number = 0; for (const auto& digit : buffer) { if (digit == ',') { numbers.emplace_back(parsed_number); parsed_number = 0; } else { parsed_number = (parsed_number * 10) + (digit - '0'); } }
Even better if you encapsulate it you can easily adopt for iterators/positions
uint_fast16_t parse_number(cont std::string& buffer, std::size_t first, std::size_t second) { uint_fast16_t parsed_number = 0; for (std::size_t i = first; i < second; ++i) { parsed_number = (parsed_number * 10) + (buffer[i] - '0'); } return parsed_number; }
Now you can write the loop as
while (infile.good() && std::getline(infile, buffer)) { std::size_t first = 0; std::size_t second; while (first != buffer.size()) { second = buffer.find_first_of(",\r", first); numbers.push_back(parse_number(buffer, first,second)); first = second + 1; } callback(numbers); numbers.clear(); }
I would even suggest to create another function for parsing the single line:
void parse_single_line(std::string& buffer, std::vector<uint_fast16_t>& numbers) { std::size_t first = 0; std::size_t second; while (first != buffer.size()) { second = buffer.find_first_of(",\r", first); numbers.push_back(parse_number(buffer, first,second)); first = second + 1; }
Together this leads to
while (infile.good() && std::getline(infile, buffer)) { parse_single_line(buffer, numbers); callback(numbers); numbers.clear(); }
-
\$\begingroup\$ This is an interesting approach. I think you forgot to pass
buffer
intoparse_number
? Also if I'm not mistaken this won't work with\n
line endings becausegetline
will remove the line ending. This can be fixed by checkingsecond
againstnumeric_limits
but performance suffers quite a bit. \$\endgroup\$yuri– yuri2018年03月12日 09:31:46 +00:00Commented Mar 12, 2018 at 9:31 -
\$\begingroup\$ @yuri buffer is obviously missing, Sorry for that. Regarding line ending find will give you std::string::npos if nothing is found o that should also be fine, as we parese [first,second) \$\endgroup\$miscco– miscco2018年03月12日 13:19:50 +00:00Commented Mar 12, 2018 at 13:19
-
\$\begingroup\$ Adding
if (second == std::string::npos) second = buffer.size() - 1;
makes this work for both line ending types. However performance is now nearly twice as bad as before. Any suggestions on how to speed this up? \$\endgroup\$yuri– yuri2018年03月12日 16:07:37 +00:00Commented Mar 12, 2018 at 16:07
Explore related questions
See similar questions with these tags.
parsed_number += (digit - '0') * powers[current_power]
, but it kept ending up slower. I guess improving performance further will require some more intricate techniques. \$\endgroup\$