I Needed to parse some positive integers with the following constraints:
If possible use C++ only (no C API)
Needs to be able to process parsed data after each newline
Should be reasonably fast
Memory usage should be kept low (i.e. don't read the entire file in one go)
CSV file is well formed so error checking can be mostly omitted
Unknown amount of lines in file
Should support both unix (
\n
) and windows (\r\n
) line endings
This code runs fine and is faster than the standard approach of streaming via >>
.
On average it takes 290ms to parse 10,000,000 lines whereas the version with >>
takes 1.1s which I think is not a bad improvement.
However I have some concerns about this:
Are there any obvious mistakes that make this slower than it should be?
Are there any side effects or undefined behavior?
Is the template part applied correctly or is there a better way to do it?
Does the naming make sense?
Please also point out anything else you notice.
Sample input:
1,22,333
4444,55555,666666
Code:
#include <algorithm>
#include <fstream>
#include <string>
#include <vector>
inline void parse_uints(
const char* buffer_ptr,
std::vector<uint_fast16_t>& numbers) {
bool is_eol = false;
while (!is_eol) {
uint_fast16_t parsed_number = 0;
for (;;) {
if (*buffer_ptr == ',') {
break;
}
if (*buffer_ptr == '\r' || *buffer_ptr == '0円') {
is_eol = true;
break;
}
parsed_number = (parsed_number * 10) + (*buffer_ptr++ - '0');
}
// skip delimiter
++buffer_ptr;
numbers.emplace_back(parsed_number);
}
}
template<typename T>
void read_line(
const std::string& filename,
const uint_fast8_t& line_length,
const uint_fast8_t& values_per_line,
T callback) {
std::ifstream infile{filename};
if (!infile.good()) {
return;
}
std::vector<uint_fast16_t> numbers;
numbers.reserve(values_per_line);
std::string buffer;
buffer.reserve(line_length);
while (infile.good() && std::getline(infile, buffer)) {
parse_uints(buffer.data(), numbers);
callback(numbers);
numbers.clear();
}
}
int main(int argc, char** argv) {
constexpr uint_fast8_t line_length = 25;
constexpr uint_fast8_t values_per_line = 3;
read_line(argv[1], line_length, values_per_line, [](auto& values) {
// do something with the values here, for example get the max
auto max = std::max_element(std::begin(values), std::end(values));
});
}
1 Answer 1
Some couple nits.
You function read_line does not what it suggests. It reads a file line by line. So you should find an appropriate name.
Your choice of uint_fast8_t/uint_fast16_t is interesting given that your example csv table hold some larger values. Have you measured the impact vs unsigned int or std::size_t. Be aware that modern cpu can read multiple elements at once so you choice might actually have negative impact.
Every line you clear numbers and build it up again. It will be beneficial to just initialize it once via
resize
and then just overwrite the old value. That way you can omit the unnecessaryclear
Getline takes a third argument which is the delemiter. Also getline stops at newline and end of file so you dont need all that error checking Given that you now the number of entries per line you can simply loop and accumulate
while (infile.good()) { for (auto&& elem : numbers) { std::getline(infile, buffer, ","); elem = 0; for (auto&& digit : buffer) { elem *= 10; elem += (digit - '0'); } } callback(numbers); }
If you want to be sure you can check getline and return if its bad.
while (infile.good()) {
for (auto&& elem : numbers) {
if (!std::getline(infile, buffer, ",")) {
return;
}
elem = 0;
for (auto&& digit : buffer) {
elem *= 10;
elem += (digit - '0');
}
}
callback(numbers);
}
-
\$\begingroup\$ The values I pass are not actually the exact sizes I expect in the file but rather an estimate to avoid resizing of the buffers while it parses the file. If
numbers
is not cleared it could possibly contain old values that won't be overwritten if the next line is shorter than the previous one. At least that was my reasoning. \$\endgroup\$yuri– yuri2018年03月09日 19:37:46 +00:00Commented Mar 9, 2018 at 19:37 -
\$\begingroup\$ Upon further testing it seems supplying
getline
with a delimiter makes it read until that delimiter even across line endings. So this code will always read the last value of a given line and the first value of the next line because only then it finds the,
delimiter again. \$\endgroup\$yuri– yuri2018年03月10日 10:10:49 +00:00Commented Mar 10, 2018 at 10:10 -
\$\begingroup\$ That is incorrect according to cplusplus.com/reference/string/string/getline So it seems there is something wrong with your line endings \$\endgroup\$miscco– miscco2018年03月10日 12:27:14 +00:00Commented Mar 10, 2018 at 12:27
-
\$\begingroup\$ Have you tested it?
cplusplus.com
has been wrong in the past before. Also cppreference does not corroborate the behavior you describe. \$\endgroup\$yuri– yuri2018年03月10日 12:58:09 +00:00Commented Mar 10, 2018 at 12:58
Explore related questions
See similar questions with these tags.
'\r'
character portably. You might need to expect a single'\n'
as well. \$\endgroup\$\r\n
line endings are typical for windows: cs.toronto.edu/~krueger/csc209h/tut/line-endings.html \$\endgroup\$std::getline()
to read the lines delimited with'\n'
in 1st place. The way you have it your code is portable. \$\endgroup\$