A header-only library to read CSV

Question 1

I started to learn C++ after programming for some years in Java, and this is my first (header-only) library. When switching from another language, it is always difficult not to project the old habits into the new medium, and thus I wonder how idiomatic my C++ code is, so it would be awesome if someone reviewed this piece.

#ifndef __FILE_READER_H__
#define __FILE_READER_H__
#include <filesystem>
#include <fstream>
#include <functional>
#include <stdexcept>
#include <string>
#include <vector>
// API
namespace octarine {
 // Read a CSV file into a vector of vectors of strings.
 // Each string is trimmed of whitespace.
 inline std::vector<std::vector<std::string>> read_csv(
 const std::filesystem::path &filename,
 bool has_header = true,
 char separator = ',');
 // Read a CSV file into a vector of objects of a predefined type.
 template <typename T>
 inline std::vector<T> read_csv(
 const std::filesystem::path &filename,
 const std::function<T(const std::vector<std::string>&)>& mapper,
 bool has_header = true,
 char separator = ','
 );
}
// Implementation
namespace octarine {
 namespace {
 size_t count_items(const std::string& line, char separator);
 std::vector<std::string> parse_line(const std::string &line, char separator, size_t num_items, size_t line_number);
 const char* k_whitespace = " \t";
 }
 template <typename T>
 std::vector<T> read_csv(
 const std::filesystem::path &filename,
 const std::function<T(const std::vector<std::string>&)>& mapper,
 bool has_header,
 char separator) {
 const auto& lines = read_csv(filename, has_header, separator);
 std::vector<T> result;
 result.reserve(lines.size());
 for (const auto& line : lines) {
 result.emplace_back(mapper(line));
 }
 return result;
 }
 std::vector<std::vector<std::string>> read_csv(
 const std::filesystem::path &filename,
 bool has_header,
 char separator) {
 std::ifstream f(filename);
 if (!f.good()) {
 throw std::invalid_argument("unable to read file '" + filename.string() + "'");
 }
 // read header line
 std::string header;
 std::getline(f, header);
 if (!f.good()) {
 throw std::invalid_argument("error reading header line from '" + filename.string() + "'");
 }
 // count number of items per line
 size_t num_items = count_items(header, separator);
 // if we don't have the header, add the first line to the results
 std::vector<std::vector<std::string>> lines;
 size_t line_number = 1;
 if (!has_header) {
 lines.push_back(parse_line(header, separator, num_items, line_number));
 line_number += 1;
 }
 std::string line;
 while (!f.bad()) {
 std::getline(f, line);
 if (f.eof()) {
 break;
 }
 if (f.fail()) {
 throw std::invalid_argument("error reading line from '" + filename.string() + "'");
 }
 lines.push_back(parse_line(line, separator, num_items, line_number));
 line_number += 1;
 }
 return lines;
 }
 namespace {
 // counts number of comma-separated items in a line from a CSV file
 size_t count_items(const std::string &line, char separator) {
 size_t count = 1;
 for (char c : line) {
 if (c == separator) {
 ++count;
 }
 }
 return count;
 }
 // splits a line from a CSV file when the number of items per line is known in advance
 std::vector<std::string> parse_line(
 const std::string &line,
 char separator,
 size_t num_items,
 size_t line_number) {
 if (num_items == 0) {
 throw std::invalid_argument("number of items must be positive");
 }
 std::vector<std::string> result(num_items);
 size_t item = 0;
 size_t offset = 0, end_offset = 0;
 size_t max_offset = line.size();
 size_t index;
 while (end_offset != max_offset) {
 if (item >= num_items) {
 throw std::length_error(
 "line " + std::to_string(line_number) + ": found more items in a line than expected");
 }
 index = line.find(separator, offset);
 end_offset = (index != std::string::npos) ? index : max_offset;
 size_t non_space_start = line.find_first_not_of(k_whitespace, offset);
 size_t non_space_end = line.find_last_not_of(k_whitespace, end_offset - 1);
 if (non_space_start == std::string::npos || non_space_end == std::string::npos ||
 non_space_start == index) {
 result[item] = "";
 } else {
 result[item] = line.substr(non_space_start, non_space_end - non_space_start + 1);
 }
 offset = end_offset + 1;
 item += 1;
 }
 return result;
 }
 }
}
#endif

Question 2

A few observations:

I would never write x += 1; if I wanted to increment x. I think this is confusing and more prone to error than writing ++x;. Of course, there should be no difference for the compiler.
In count_items, do you really want to return 1 even if there are no matches? Be as it may, you don't have to write an explicit loop as this is equivalent to using std::count, i.e., you could just write
```
std::size_t count_items(const std::string &line, char separator) {
 return std::count(line.cbegin(), line.cend(), separator) + 1;
}
```
I find your split function parse_line quite difficult to read. There are several alternatives to this including many in Boost (Tokenizer, boost::algorithm::split, ...).

Question 3

Thanks for the review! Regarding count_items, "1" definitely should be returned if there are no commas found (e.g. "item"), and "2" if there is one comma ("item1,item2"), so this behaviour is correct, however using std::count is much better! As for using boost, this library is intended to be easy to include as a header and thus no dependencies is a must.

Question 4

Basic review

A couple of simple mistakes that are easily addressed:

We're using a reserved identifier in the include guard - anything beginning with _ followed by a letter, or containing __ anywhere, is reserved for the implementation.
size_t is defined in the std namespace, so should be written std::size_t. Implementations are allowed to also define it in the global namespace, but as that's not required, it's a portability error to rely on that.
std::invalid_argument is a misleading exception to throw for I/O failure - use std::ios_base::failure for this (and consider using the stream's exceptions() method to cause the stream to throw this for itself).

Slightly more involved:

std::vector<std::string> result(num_items) will construct num_items string objects, but we're going to overwrite them all. It's more efficient to create an empty result, then reserve() enough space for us to push_back() (or emplace_back()) each string.

Testing

This is the kind of code that would benefit from some unit tests to give confidence in its correctness.

It's hard to test as it stands, as the interface only admits named files, but we can improve that by splitting the functions' responsibilities so that we have one function that deals with file access and another that accepts a std::istream& and does the actual parsing.

We would really like to test parse_line() in isolation, so I'd be inclined to make that part of the public interface, rather than being hidden in the anonymous namespace.

Interface/usability

The second version would be better if it created T objects on the fly, rather than constructing an entire std::vector<std::vector<std::string>> before starting to construct the first T object. I think we could combine the two functions into one, if we provide a default mapper that just returns the std::vector<std::string>& that it was given (take care to ensure we have no unnecessary copying, though - Valgrind can help with that).

Consider providing an interface that accepts an output iterator to which it can write the results, rather than storing everything into a std::vector. That will be more useful to programs that are able to operate on one line at a time (e.g. a cut-like utility to select particular columns). It's easy to adapt that to give the current behaviour (by providing a back-inserter to create a vector), but impossible to adapt in the opposite direction.

Maintainability

count_items() duplicates the parsing logic, and must be kept in sync with parse_line(). If we improve the parsing (to deal with quoted columns, for instance), then we have two places that need to change. We can make this easier for ourselves if we create a utility function that can identify the end of a value and the start of the next value, if there is one.

Juho JuhoJuho 3,63921 silver badges18 bronze badges · Accepted Answer · 2020-05-16 10:56:28Z

A few observations:

I would never write x += 1; if I wanted to increment x. I think this is confusing and more prone to error than writing ++x;. Of course, there should be no difference for the compiler.
In count_items, do you really want to return 1 even if there are no matches? Be as it may, you don't have to write an explicit loop as this is equivalent to using std::count, i.e., you could just write
```
std::size_t count_items(const std::string &line, char separator) {
 return std::count(line.cbegin(), line.cend(), separator) + 1;
}
```
I find your split function parse_line quite difficult to read. There are several alternatives to this including many in Boost (Tokenizer, boost::algorithm::split, ...).

Thanks for the review! Regarding count_items, "1" definitely should be returned if there are no commas found (e.g. "item"), and "2" if there is one comma ("item1,item2"), so this behaviour is correct, however using std::count is much better! As for using boost, this library is intended to be easy to include as a header and thus no dependencies is a must.

Stack Exchange Network

A header-only library to read CSV

2 Answers 2

Basic review

Testing

Interface/usability

Maintainability

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A header-only library to read CSV

2 Answers 2

Basic review

Testing

Interface/usability

Maintainability

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions