Class to read comma separated data from disk

Question 1

After programming a lot in high level languages such as Python and R, I started working with C++ recently. To get my C++ skills up, I of course read a lot of books. In addition, I try and replicate some functionality from high level languages in C++. My first attempt at this (after Hello World :)) was to create a C++ class which could read comma separated files. In essence the class parses the file and reads it into a boost::MultiArray. I would appreciate any feedback on the code.

The links to the code and example data are listed below, they link to my bitbucket account.

The actual code comprises of a cpp file (csv-reader.cpp):

#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <boost/algorithm/string.hpp>
#include <boost/multi_array.hpp>
#include "csv-test.h"
#include <cassert>
template <class T> class csv_reader {
 boost::multi_array<T, 2> content2d ;
 std::vector<std::string> split_line ; 
 std::string line;
 std::string sep ;
 int ncol ; 
 int nrow ; 
public :
 csv_reader(std::string, std::string) ; // constructor
 ~csv_reader(); // desctructor
 void cout_content() ; // print the contents
 T operator() (unsigned row, unsigned column) ;
} ;
// Constructor
template <class T> csv_reader<T>::csv_reader(std::string path, std::string sep = ",")
{
 // Initializing variables
 ncol = 0 ; // Initialize the number of colums to 0
 nrow = 1 ; // Initialize the number of rows to 1
 content2d = boost::multi_array<T, 2> (boost::extents[0][0]) ;
 std::ifstream data(path.c_str()) ;
 // read the csv data
 while(getline(data, line))
 { 
 boost::split(split_line, line, boost::is_any_of(sep) ) ;
 if(ncol == 0) 
 {
 ncol = split_line.size() ;
 }
 else assert(ncol == split_line.size()) ;
 content2d.resize(boost::extents[nrow][ncol]) ;
 for(int i = 0; i < split_line.size(); i++) 
 {
 content2d[nrow - 1][i] = convertToDouble(split_line[i]) ;
 }
 nrow++ ;
 }
}
// Destructor
template <class T> csv_reader<T>::~csv_reader() { }
template <class T> void csv_reader<T>::cout_content()
{
 for(int row = 0; row < (nrow - 1); row++) 
 {
 for(int col = 0; col < ncol ; col++) 
 {
 std::cout << content2d[row][col] << " ";
 }
 std::cout << "\n" ;
 }
}
// Allow access to the contents
template <class T> T csv_reader<T>::operator() (unsigned row, unsigned column)
{ 
 if (row >= nrow || column >= ncol)
 throw BadIndex("boost::MultiArray subscript out of bounds");
 return(content2d[row][column]) ;
}
int main()
{
 // An integer csv reader
 csv_reader<int> csv_obj_int("test.csv") ;
 csv_obj_int.cout_content() ;
 // A double csv reader
 csv_reader<double> csv_obj_double("test.csv") ;
 csv_obj_double.cout_content() ;
 // It also supports direct access to the content using operator()
 std::cout << csv_obj_double(1,1) << "\n" ;
 std::cout << csv_obj_double(1,1) * 5 << "\n" ;
 // This statement fails with a subscript out of bounds error
 // std::cout << csv_obj_double(10,10) * 5 << "\n" ;
 // Testing a different seperator
 csv_reader<double> csv_obj_double_sep2("test_semicol.csv", ";") ;
 csv_obj_double_sep2.cout_content() ;
}

and a header file (csv-test.h):

 // File: convert.h
 #include <iostream>
 #include <sstream>
 #include <string>
 #include <stdexcept>
 class BadConversion : public std::runtime_error {
 public:
 BadConversion(std::string const& s)
 : std::runtime_error(s)
 { }
 };
 class BadIndex : public std::runtime_error {
 public:
 BadIndex(std::string const& s)
 : std::runtime_error(s)
 { }
 };
 inline double convertToDouble(std::string const& s)
 {
 std::istringstream i(s);
 double x;
 if (!(i >> x))
 throw BadConversion("convertToDouble(\"" + s + "\")");
 return x;
 }

The main() test the class on a number of different csv file examples: test.csv and test_semicol.csv.

Question 2

First of all, boost::split() is just a strtok() so it won't work with REAL csv files. CSV parsing is huge topic and it is better avoid this discussion here.

Second, convertToDouble() is not from C++, seems to be a visitor from C#. Even if I am wrong, it is design problem: template is defined for arbitrary class, but "silently" uses double for operation. It is necessary to create proper type casting, or use template specialization.

Strange usage of assert(). It does something in DEBUG builds only, in release build it is empty. If you need debug aids, just put assert() before content2d.resize(boost::extents[nrow][ncol]) ; and add any if that you need to handle the size problem for release builds. Consider throw an exception.

Destructor. You may drop both declaration and implementation of dtor and let the compiler create one for you, but I have a concern.

I do not know, where boost::multi_array allocates its memory, but if it uses stack, you may have troubles with big arrays. Because of this, it may be better to allocate content2d in heap (content2d = new boost::multi_array<T, 2>; and put delete content2d into destructor.

Constructor. It is better to use initializators list, so it may looks like (I agree with Loki about streams):

csv_reader(std::istream& str, std::string const& sep) : content2d( new boost::multi_array<T, 2> (boost::extents[0][0]))

I assume boost::multi_array<T, 2>* content2d declaration.

You do not need ncol and nrow variables, besides local variables in ctor: array contains information about its dimensions, and if you add separate variables for that, you just add another point of failure in your code.

Question 3

Header Files

Should always be protected by include guards.
Otherwise you are sustainable to infinite recursion and multiple definitions.

// Header.h
#ifndef XXX
#define XXX
/// The header file goes here
#endif

XXX Should be a unique identifier. A lot of build systems will generate a GUID for you. If you are doing your own then make it unique by including the namespace and file name.

While we are talking about it it is worth putting all your stuff into a namespace to avoid clashes with other people code.

Source File

There is some stuff here the should be moved to the header file.

template <class T> class csv_reader {

If you want to re-use this should also be in a header file (note because it is a template you also need to put the definitions of the methods in the header file).

Looking at the class you seem to be keeping too man member variables:

template <class T> class csv_reader {
 boost::multi_array<T, 2> content2d ;
 std::vector<std::string> split_line ; 
 std::string line;
 std::string sep ;
 int ncol ; 
 int nrow ; 
public :
 csv_reader(std::string, std::string) ; // constructor
 ~csv_reader(); // desctructor
 void cout_content() ; // print the contents
 T operator() (unsigned row, unsigned column) ;
} ;

The only member variable you need is content2d. The other variables are really function local variables you do not need to store them as part of the class.

The constructor:

csv_reader(std::string, std::string) ; // constructor

You are passing the name of a file.
This is very limiting. It would be more traditional to pass the file as a stream. Then you can read from a file/string/std::cin etc.

You are passing things by value. This can introduce unneeded copying of the values so pass by const reference.

csv_reader(std::istream& str, std::string const& sep) ; // constructor

If you're destructor does not do anything then do not define one.

// Destructor
template <class T> csv_reader<T>::~csv_reader() { }

Best to just use the compiler generated version (which does the above).

Printing the content: You are using the method void cout_content().

Explicitly encoding the stream into the name is a bad idea it makes it less flexable. Pass the stream as a parameter you can default the stream to std::cout if you want.

void printContent(std::ostream& stream = std::cout); // print the contents

I would then also add an output operator operator<< that calls this method.

template <class T>
std::ostream& operator<<(std::ostream& stream, csv_reader<T> const& data)
{
 data.printContent(stream);
 return stream;
}

Indexing into data: you use operator() which is fine. But with a tiny amount of work you can use [] see https://stackoverflow.com/a/1971207/14065

Question 4

I would use boost::lexical_cast for type conversions. Also I consider that you shouldn't limit class to parse csv values of the single type, so I guess template parameters should specify column types as well, so maybe your multi array should keep something like boost::variant

Michael Michael 5892 silver badges7 bronze badges · Answer 1 · 2012-03-19 13:58:01Z

First of all, boost::split() is just a strtok() so it won't work with REAL csv files. CSV parsing is huge topic and it is better avoid this discussion here.

Second, convertToDouble() is not from C++, seems to be a visitor from C#. Even if I am wrong, it is design problem: template is defined for arbitrary class, but "silently" uses double for operation. It is necessary to create proper type casting, or use template specialization.

Strange usage of assert(). It does something in DEBUG builds only, in release build it is empty. If you need debug aids, just put assert() before content2d.resize(boost::extents[nrow][ncol]) ; and add any if that you need to handle the size problem for release builds. Consider throw an exception.

Destructor. You may drop both declaration and implementation of dtor and let the compiler create one for you, but I have a concern.

I do not know, where boost::multi_array allocates its memory, but if it uses stack, you may have troubles with big arrays. Because of this, it may be better to allocate content2d in heap (content2d = new boost::multi_array<T, 2>; and put delete content2d into destructor.

Constructor. It is better to use initializators list, so it may looks like (I agree with Loki about streams):

csv_reader(std::istream& str, std::string const& sep) : content2d( new boost::multi_array<T, 2> (boost::extents[0][0]))

I assume boost::multi_array<T, 2>* content2d declaration.

You do not need ncol and nrow variables, besides local variables in ctor: array contains information about its dimensions, and if you add separate variables for that, you just add another point of failure in your code.

Loki Astari Loki Astari 97.7k5 gold badges126 silver badges341 bronze badges · Answer 2 · 2012-03-19 03:09:28Z

Header Files

Should always be protected by include guards.
Otherwise you are sustainable to infinite recursion and multiple definitions.

// Header.h
#ifndef XXX
#define XXX
/// The header file goes here
#endif

XXX Should be a unique identifier. A lot of build systems will generate a GUID for you. If you are doing your own then make it unique by including the namespace and file name.

While we are talking about it it is worth putting all your stuff into a namespace to avoid clashes with other people code.

Source File

There is some stuff here the should be moved to the header file.

template <class T> class csv_reader {

If you want to re-use this should also be in a header file (note because it is a template you also need to put the definitions of the methods in the header file).

Looking at the class you seem to be keeping too man member variables:

template <class T> class csv_reader {
 boost::multi_array<T, 2> content2d ;
 std::vector<std::string> split_line ; 
 std::string line;
 std::string sep ;
 int ncol ; 
 int nrow ; 
public :
 csv_reader(std::string, std::string) ; // constructor
 ~csv_reader(); // desctructor
 void cout_content() ; // print the contents
 T operator() (unsigned row, unsigned column) ;
} ;

The only member variable you need is content2d. The other variables are really function local variables you do not need to store them as part of the class.

The constructor:

csv_reader(std::string, std::string) ; // constructor

You are passing the name of a file.
This is very limiting. It would be more traditional to pass the file as a stream. Then you can read from a file/string/std::cin etc.

You are passing things by value. This can introduce unneeded copying of the values so pass by const reference.

csv_reader(std::istream& str, std::string const& sep) ; // constructor

If you're destructor does not do anything then do not define one.

// Destructor
template <class T> csv_reader<T>::~csv_reader() { }

Best to just use the compiler generated version (which does the above).

Printing the content: You are using the method void cout_content().

Explicitly encoding the stream into the name is a bad idea it makes it less flexable. Pass the stream as a parameter you can default the stream to std::cout if you want.

void printContent(std::ostream& stream = std::cout); // print the contents

I would then also add an output operator operator<< that calls this method.

template <class T>
std::ostream& operator<<(std::ostream& stream, csv_reader<T> const& data)
{
 data.printContent(stream);
 return stream;
}

Indexing into data: you use operator() which is fine. But with a tiny amount of work you can use [] see https://stackoverflow.com/a/1971207/14065

Ivan Fateev Ivan Fateev 2011 silver badge4 bronze badges · Answer 3 · 2014-06-15 16:18:47Z

I would use boost::lexical_cast for type conversions. Also I consider that you shouldn't limit class to parse csv values of the single type, so I guess template parameters should specify column types as well, so maybe your multi array should keep something like boost::variant

Stack Exchange Network

Class to read comma separated data from disk

3 Answers 3

Header Files

Source File

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Class to read comma separated data from disk

3 Answers 3

Header Files

Source File

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions