Read numeric data from a text file using C++

Question 1

I need to read Numeric Data from a text file that looks like this:

2
cell X-cord Y-cord 
1 1.1 2.2
2 2.1 2.0

The first line indicates the dimension (2 or 3), the second line is just trash. The file can be comma separated, so it may look like this:

3
cell X-cord Y-Cord Z-Cord
1, 1.1, 2.2, 3.2

Of course this is just a short example, the real file has hundreds of thousands of cells, so I want to improve the performance.

When I have a comma separated file I use this code:

std::istringstream iss; //used iterate over the elements in line
while (std::getline(myfile,line)) 
{
 replace(line.begin(),line.end(),',',' '); //this replaces commas for spaces
 iss.str(line); //this will let us iterate over the elements of line
 if (dim==2) // if it is a 2d problem
 {
 iss>>cell>>x>>y;
 cell_vector.push_back(cell); 
 x_vector.push_back(x);
 y_vector.push_back(y);
 }
 else
 {
 iss>>cell>>x>>y>>z;
 cell_vector.push_back(cell); 
 x_vector.push_back(x);
 y_vector.push_back(y);
 z_vector.push_back(z);
 }
iss.clear(); //so that we can assing a new line to it 
}

When the file is not comma separated I use this code:

int i=0; //used to know where to save a
while (my_file>>a)
{
 if (dim==2)
 {
 switch(i%3)
 {
 case 0:
 cell_vector.push_back(a);
 break;
 case 1:
 x_vector.push_back(a);
 break;
 case 2:
 y_vector.push_back(a);
 break;
 }
 i++l;
 }
 else
 {
 //similar but for 3 dimensions
 }
}

I have tested both codes independently and they work fine. However, I don't know beforehand if the file is comma separated or not, so I thought of doing something like this:

int dim;
input_v>>dim; //to get the information regarding dimension
std::getline(input_v,line);
std::getline(input_v,line); 
std::getline(input_v,line); //3 times to skip the "cell X-cord...." part
bool flag_comma=false; //used to later determine if file is comma separated 
for (int i=0;i<=line.length();i++) 
{
 if (line[i]==',') 
 {
 flag_comma=true;
 break;
 }
}
if (flag_comma)
{
 //code for comma separated file
}
else 
{
 //code for space separated file
}

Everything works fine except that the program only starts to save everything starting from cell 2 onwards... I can't save the data from cell 1.

The only way I found around it is to add this:

 replace(line.begin(),line.end(),',',' '); //this replaces commas for spaces
 iss.str(line); //this will let us iterate over the elements of line
 if (dim==2) // if it is a 2d problem
 {
 iss>>cell>>x>>y;
 }
 else
 {
 iss>>cell>>x>>y>>z;
 }
iss.clear(); //so that we can assing a new line to it

Before the if (flag_comma) part, regardless of the case (comma or space). It is the same piece of code used inside the while loop for the comma separated part.

It is also worth noting that the "comma separated code" works just as well when the file has no commas (i.e. it is space separated). The reason I made the space separated part is because I want to improve performance. I don't really know how complex is the replace function.

So my questions are:

A) Do I really improve performance this way? Instead of just keeping the comma separated version of the code?

B) How can I improve my code (in general)?

C) Is there a better way to get the data from cell 1?

D) Can I improve something memory-wise?

Question 2

This is the best fit problem for std::ctype<char> (note that it is very different from its template version). Just set the comma to be space.

Question 3

Just to be sure, you do reserve a sensible amount of space for your vectors? Otherwise memory reallocations are going to be one of the main slowdowns of your program. If you do not know the number of lines beforehand you should be able to estimate them from the size of the file.

Question 4

Check this out ignore punctuation using manipulator. Basically you can get the stream to treat the comma like it was a space by imbuing it with a custom std::codecvt. That way you don't need to run over the line and remove commas from potentially arbitrary long lines.

Question 5

@miscco Yes, I reserve some amount of space for my vectors, I have another file (related to this one) that tells me exactly how many points are there. Though the file I am talking about is an upper bound, usually this file has fewer cells.

Question 6

A) I don't think that you improved performance of your code in this way. In both methods you get the single numbers with ">>", which is IMHO the best way to get the numbers. So I think it does not matter much.

B) First you can improve your code by regarding the "push_back" commands. This command regularly needs to reallocate your data to fit in a specific part of the main memory. If you know your total number of lines beforehand, you can call "reserve" before the first push_back with the number of lines. Depending on the number of lines you have in total this approach might save you a lot of time. There are more opportunities, just look it up here. Second you can use binary data instead of a textual representation. In your example the code has to read multiple chars for every single number. With binary data every data has a fixed size, independent of the decimal length. Moreover the numbers then are already in the correct representation memory wise.

C) IMHO your approach is ok.

D) That, I cannot answer. I think there is a way to use streams, if you don't already do this.

Question 7

push_back() has an amortized complexity of O(1) (i.e. constant). This is explicitly specified by the standard. So saying it regularly needs to re-allocate is not true. The number of times it reallocates is defined to minimal for a large number of pushes. Though saying it can be expensive and using reserve() is a good idea is totally true and good advice.

Question 8

So here are my 2 cents.

You should reserve the memory of your vectors beforehand. If you do not know the actual number of lines you should still be able to guess it more or less correctly from the size of the file. What you really want is to avoid constant reallocations, which are guaranteed when pushing hundreds of thouthands of elements.
Have a look here: https://stackoverflow.com/questions/11719538/how-to-use-stringstream-to-separate-comma-separated-strings

In short you can pass an additional separating token to the stringstream, which eliminates the need for modifying the string of the line

Helion Helion 1111 bronze badge · Answer 1 · 2017-04-12 16:25:35Z

A) I don't think that you improved performance of your code in this way. In both methods you get the single numbers with ">>", which is IMHO the best way to get the numbers. So I think it does not matter much.

B) First you can improve your code by regarding the "push_back" commands. This command regularly needs to reallocate your data to fit in a specific part of the main memory. If you know your total number of lines beforehand, you can call "reserve" before the first push_back with the number of lines. Depending on the number of lines you have in total this approach might save you a lot of time. There are more opportunities, just look it up here. Second you can use binary data instead of a textual representation. In your example the code has to read multiple chars for every single number. With binary data every data has a fixed size, independent of the decimal length. Moreover the numbers then are already in the correct representation memory wise.

C) IMHO your approach is ok.

D) That, I cannot answer. I think there is a way to use streams, if you don't already do this.

push_back() has an amortized complexity of O(1) (i.e. constant). This is explicitly specified by the standard. So saying it regularly needs to re-allocate is not true. The number of times it reallocates is defined to minimal for a large number of pushes. Though saying it can be expensive and using reserve() is a good idea is totally true and good advice.

miscco miscco 4,35112 silver badges17 bronze badges · Answer 2 · 2017-04-12 18:05:53Z

So here are my 2 cents.

You should reserve the memory of your vectors beforehand. If you do not know the actual number of lines you should still be able to guess it more or less correctly from the size of the file. What you really want is to avoid constant reallocations, which are guaranteed when pushing hundreds of thouthands of elements.
Have a look here: https://stackoverflow.com/questions/11719538/how-to-use-stringstream-to-separate-comma-separated-strings

In short you can pass an additional separating token to the stringstream, which eliminates the need for modifying the string of the line

Stack Exchange Network

Read numeric data from a text file using C++

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Read numeric data from a text file using C++

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions