I need to read Numeric Data from a text file that looks like this:
2
cell X-cord Y-cord
1 1.1 2.2
2 2.1 2.0
The first line indicates the dimension (2 or 3), the second line is just trash. The file can be comma separated, so it may look like this:
3
cell X-cord Y-Cord Z-Cord
1, 1.1, 2.2, 3.2
Of course this is just a short example, the real file has hundreds of thousands of cells, so I want to improve the performance.
When I have a comma separated file I use this code:
std::istringstream iss; //used iterate over the elements in line
while (std::getline(myfile,line))
{
replace(line.begin(),line.end(),',',' '); //this replaces commas for spaces
iss.str(line); //this will let us iterate over the elements of line
if (dim==2) // if it is a 2d problem
{
iss>>cell>>x>>y;
cell_vector.push_back(cell);
x_vector.push_back(x);
y_vector.push_back(y);
}
else
{
iss>>cell>>x>>y>>z;
cell_vector.push_back(cell);
x_vector.push_back(x);
y_vector.push_back(y);
z_vector.push_back(z);
}
iss.clear(); //so that we can assing a new line to it
}
When the file is not comma separated I use this code:
int i=0; //used to know where to save a
while (my_file>>a)
{
if (dim==2)
{
switch(i%3)
{
case 0:
cell_vector.push_back(a);
break;
case 1:
x_vector.push_back(a);
break;
case 2:
y_vector.push_back(a);
break;
}
i++l;
}
else
{
//similar but for 3 dimensions
}
}
I have tested both codes independently and they work fine. However, I don't know beforehand if the file is comma separated or not, so I thought of doing something like this:
int dim;
input_v>>dim; //to get the information regarding dimension
std::getline(input_v,line);
std::getline(input_v,line);
std::getline(input_v,line); //3 times to skip the "cell X-cord...." part
bool flag_comma=false; //used to later determine if file is comma separated
for (int i=0;i<=line.length();i++)
{
if (line[i]==',')
{
flag_comma=true;
break;
}
}
if (flag_comma)
{
//code for comma separated file
}
else
{
//code for space separated file
}
Everything works fine except that the program only starts to save everything starting from cell 2 onwards... I can't save the data from cell 1.
The only way I found around it is to add this:
replace(line.begin(),line.end(),',',' '); //this replaces commas for spaces
iss.str(line); //this will let us iterate over the elements of line
if (dim==2) // if it is a 2d problem
{
iss>>cell>>x>>y;
}
else
{
iss>>cell>>x>>y>>z;
}
iss.clear(); //so that we can assing a new line to it
Before the if (flag_comma)
part, regardless of the case (comma or space). It is the same piece of code used inside the while loop for the comma separated part.
It is also worth noting that the "comma separated code" works just as well when the file has no commas (i.e. it is space separated). The reason I made the space separated part is because I want to improve performance. I don't really know how complex is the replace
function.
So my questions are:
A) Do I really improve performance this way? Instead of just keeping the comma separated version of the code?
B) How can I improve my code (in general)?
C) Is there a better way to get the data from cell 1?
D) Can I improve something memory-wise?
2 Answers 2
A) I don't think that you improved performance of your code in this way. In both methods you get the single numbers with ">>", which is IMHO the best way to get the numbers. So I think it does not matter much.
B) First you can improve your code by regarding the "push_back" commands. This command regularly needs to reallocate your data to fit in a specific part of the main memory. If you know your total number of lines beforehand, you can call "reserve" before the first push_back with the number of lines. Depending on the number of lines you have in total this approach might save you a lot of time. There are more opportunities, just look it up here. Second you can use binary data instead of a textual representation. In your example the code has to read multiple chars for every single number. With binary data every data has a fixed size, independent of the decimal length. Moreover the numbers then are already in the correct representation memory wise.
C) IMHO your approach is ok.
D) That, I cannot answer. I think there is a way to use streams, if you don't already do this.
-
\$\begingroup\$
push_back()
has an amortized complexity ofO(1)
(i.e. constant). This is explicitly specified by the standard. So saying it regularly needs to re-allocate is not true. The number of times it reallocates is defined to minimal for a large number of pushes. Though saying it can be expensive and usingreserve()
is a good idea is totally true and good advice. \$\endgroup\$Loki Astari– Loki Astari2017年04月12日 20:06:58 +00:00Commented Apr 12, 2017 at 20:06
So here are my 2 cents.
You should reserve the memory of your vectors beforehand. If you do not know the actual number of lines you should still be able to guess it more or less correctly from the size of the file. What you really want is to avoid constant reallocations, which are guaranteed when pushing hundreds of thouthands of elements.
Have a look here: https://stackoverflow.com/questions/11719538/how-to-use-stringstream-to-separate-comma-separated-strings
In short you can pass an additional separating token to the stringstream, which eliminates the need for modifying the string of the line
Explore related questions
See similar questions with these tags.
std::ctype<char>
(note that it is very different from its template version). Just set the comma to be space. \$\endgroup\$std::codecvt
. That way you don't need to run over the line and remove commas from potentially arbitrary long lines. \$\endgroup\$