I started to learn C++ after programming for some years in Java, and this is my first (header-only) library. When switching from another language, it is always difficult not to project the old habits into the new medium, and thus I wonder how idiomatic my C++ code is, so it would be awesome if someone reviewed this piece.
#ifndef __FILE_READER_H__
#define __FILE_READER_H__
#include <filesystem>
#include <fstream>
#include <functional>
#include <stdexcept>
#include <string>
#include <vector>
// API
namespace octarine {
// Read a CSV file into a vector of vectors of strings.
// Each string is trimmed of whitespace.
inline std::vector<std::vector<std::string>> read_csv(
const std::filesystem::path &filename,
bool has_header = true,
char separator = ',');
// Read a CSV file into a vector of objects of a predefined type.
template <typename T>
inline std::vector<T> read_csv(
const std::filesystem::path &filename,
const std::function<T(const std::vector<std::string>&)>& mapper,
bool has_header = true,
char separator = ','
);
}
// Implementation
namespace octarine {
namespace {
size_t count_items(const std::string& line, char separator);
std::vector<std::string> parse_line(const std::string &line, char separator, size_t num_items, size_t line_number);
const char* k_whitespace = " \t";
}
template <typename T>
std::vector<T> read_csv(
const std::filesystem::path &filename,
const std::function<T(const std::vector<std::string>&)>& mapper,
bool has_header,
char separator) {
const auto& lines = read_csv(filename, has_header, separator);
std::vector<T> result;
result.reserve(lines.size());
for (const auto& line : lines) {
result.emplace_back(mapper(line));
}
return result;
}
std::vector<std::vector<std::string>> read_csv(
const std::filesystem::path &filename,
bool has_header,
char separator) {
std::ifstream f(filename);
if (!f.good()) {
throw std::invalid_argument("unable to read file '" + filename.string() + "'");
}
// read header line
std::string header;
std::getline(f, header);
if (!f.good()) {
throw std::invalid_argument("error reading header line from '" + filename.string() + "'");
}
// count number of items per line
size_t num_items = count_items(header, separator);
// if we don't have the header, add the first line to the results
std::vector<std::vector<std::string>> lines;
size_t line_number = 1;
if (!has_header) {
lines.push_back(parse_line(header, separator, num_items, line_number));
line_number += 1;
}
std::string line;
while (!f.bad()) {
std::getline(f, line);
if (f.eof()) {
break;
}
if (f.fail()) {
throw std::invalid_argument("error reading line from '" + filename.string() + "'");
}
lines.push_back(parse_line(line, separator, num_items, line_number));
line_number += 1;
}
return lines;
}
namespace {
// counts number of comma-separated items in a line from a CSV file
size_t count_items(const std::string &line, char separator) {
size_t count = 1;
for (char c : line) {
if (c == separator) {
++count;
}
}
return count;
}
// splits a line from a CSV file when the number of items per line is known in advance
std::vector<std::string> parse_line(
const std::string &line,
char separator,
size_t num_items,
size_t line_number) {
if (num_items == 0) {
throw std::invalid_argument("number of items must be positive");
}
std::vector<std::string> result(num_items);
size_t item = 0;
size_t offset = 0, end_offset = 0;
size_t max_offset = line.size();
size_t index;
while (end_offset != max_offset) {
if (item >= num_items) {
throw std::length_error(
"line " + std::to_string(line_number) + ": found more items in a line than expected");
}
index = line.find(separator, offset);
end_offset = (index != std::string::npos) ? index : max_offset;
size_t non_space_start = line.find_first_not_of(k_whitespace, offset);
size_t non_space_end = line.find_last_not_of(k_whitespace, end_offset - 1);
if (non_space_start == std::string::npos || non_space_end == std::string::npos ||
non_space_start == index) {
result[item] = "";
} else {
result[item] = line.substr(non_space_start, non_space_end - non_space_start + 1);
}
offset = end_offset + 1;
item += 1;
}
return result;
}
}
}
#endif
2 Answers 2
A few observations:
I would never write
x += 1;
if I wanted to incrementx
. I think this is confusing and more prone to error than writing++x;
. Of course, there should be no difference for the compiler.In
count_items
, do you really want to return 1 even if there are no matches? Be as it may, you don't have to write an explicit loop as this is equivalent to usingstd::count
, i.e., you could just writestd::size_t count_items(const std::string &line, char separator) { return std::count(line.cbegin(), line.cend(), separator) + 1; }
I find your split function
parse_line
quite difficult to read. There are several alternatives to this including many in Boost (Tokenizer, boost::algorithm::split, ...).
-
\$\begingroup\$ Thanks for the review! Regarding count_items, "1" definitely should be returned if there are no commas found (e.g. "item"), and "2" if there is one comma ("item1,item2"), so this behaviour is correct, however using std::count is much better! As for using boost, this library is intended to be easy to include as a header and thus no dependencies is a must. \$\endgroup\$matsurago– matsurago2020年05月17日 03:56:33 +00:00Commented May 17, 2020 at 3:56
Basic review
A couple of simple mistakes that are easily addressed:
- We're using a reserved identifier in the include guard - anything beginning with
_
followed by a letter, or containing__
anywhere, is reserved for the implementation. size_t
is defined in thestd
namespace, so should be writtenstd::size_t
. Implementations are allowed to also define it in the global namespace, but as that's not required, it's a portability error to rely on that.std::invalid_argument
is a misleading exception to throw for I/O failure - usestd::ios_base::failure
for this (and consider using the stream'sexceptions()
method to cause the stream to throw this for itself).
Slightly more involved:
std::vector<std::string> result(num_items)
will constructnum_items
string objects, but we're going to overwrite them all. It's more efficient to create an empty result, thenreserve()
enough space for us topush_back()
(oremplace_back()
) each string.
Testing
This is the kind of code that would benefit from some unit tests to give confidence in its correctness.
It's hard to test as it stands, as the interface only admits named files, but we can improve that by splitting the functions' responsibilities so that we have one function that deals with file access and another that accepts a std::istream&
and does the actual parsing.
We would really like to test parse_line()
in isolation, so I'd be inclined to make that part of the public interface, rather than being hidden in the anonymous namespace.
Interface/usability
The second version would be better if it created T
objects on the fly, rather than constructing an entire std::vector<std::vector<std::string>>
before starting to construct the first T
object. I think we could combine the two functions into one, if we provide a default mapper that just returns the std::vector<std::string>&
that it was given (take care to ensure we have no unnecessary copying, though - Valgrind can help with that).
Consider providing an interface that accepts an output iterator to which it can write the results, rather than storing everything into a std::vector
. That will be more useful to programs that are able to operate on one line at a time (e.g. a cut
-like utility to select particular columns). It's easy to adapt that to give the current behaviour (by providing a back-inserter to create a vector), but impossible to adapt in the opposite direction.
Maintainability
count_items()
duplicates the parsing logic, and must be kept in sync with parse_line()
. If we improve the parsing (to deal with quoted columns, for instance), then we have two places that need to change. We can make this easier for ourselves if we create a utility function that can identify the end of a value and the start of the next value, if there is one.