I have written a function in c++, which accepts an "ifstream" object (file) to read from and a string to match a header token in the file. This way I am counting how many times a token appears in the file and that makes up my frame count. The data file is a repeating unit of the one shown below
ITEM: TIME
0
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
292
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 2.5600000000000000e+02
0.0000000000000000e+00 2.5600000000000000e+02
-1.0000000000000001e-01 1.0000000000000001e-01
ITEM: ATOMS id type x y
1 1 44.6432 78.3496
2 1 229.256 99.4456
3 1 43.533 41.526
4 1 4.6025 134.388
...
...
So, in an attempt to count the repeating unit, I simply count the occurrence of the string "TIME", through "matchstr" in the function given below:
template<typename P>
int frames(const std::string& matchstr, P& file)
{
std::string str;
int iframes=0;
if(!file.seekg(0,std::ios_base::beg))return 0;
while(true)
{
file>>str;
if(file.fail())break;
if(str==matchstr)iframes++;
}
return iframes;
}
I want to know the pitfalls attached to this function and also how else I can efficiently count the frames. The second question will be how to make it run faster using OpenMP or MPI?
-
\$\begingroup\$ How big is the file? You may benefit from reading the entire thing into memory first. \$\endgroup\$Reinderien– Reinderien2023年08月24日 00:49:50 +00:00Commented Aug 24, 2023 at 0:49
-
\$\begingroup\$ @Reinderien Sometimes the file could be of 5gb. And that is why I need to do all my calculations on the fly, instead of loading it completely into the memory. \$\endgroup\$Syed Shuja Zaidi P19PH007– Syed Shuja Zaidi P19PH0072023年08月24日 13:05:17 +00:00Commented Aug 24, 2023 at 13:05
1 Answer 1
Don't use a template parameter for std::istream
s
All the stream objects in the standard library are part of a class hierarchy, and all the functions you need are part of the base class std::istream
, from which classes like std::ifstream
inherit. So you don't need to make this function a template, just take a reference to this base class:
int frames(const std::string& matchstr, std::istream& file)
{
...
}
Be more strict in how you parse lines
You are only matching individual words in the file. Words can appear anywhere on the line. So if I call frames("ATOMS", ...)
, it will match both these lines:
ITEM: NUMBER OF ATOMS
ITEM: ATOMS id type x y
Maybe you only care about TIME
(assuming there is only one timestamp associated with each frame), but what if TIME
could appear on lines you didn't think about? Maybe there are some lines where free text could appear. Maybe there is a way to add comments to this file. It probably would be better to match complete lines instead:
while (std::getline(file, str))
{
if (str == matchstr)
iframes++;
}
And call it like frames("ITEM: TIME", ...)
.
Error handling
It's good that you check for errors while reading from the stream, but if you do encounter an error, you just return the number of frames you've seen so far. The caller can then only assume everything went fine, unless it checked the stream itself again, which is easily forgotten. It would be much better to propagate the error to the caller. Consider returning a std::optional
, std::expected
(since C++23), or just throw
on of the standard exceptions.
Performance
It's not trivial to parallelize generic text processing. The best thing to do would be to split your file into separate regions, making sure each region begins and ends on a line boundary, and then have separate threads scan through those regions. You could do that with OpenMP and MPI, or use std::thread
or std::async()
if you want to stay within the standard C++ library.
Even better would be if you didn't need to scan the number of frames. If possible, make the process that generates the data add the number of frames to the file (either at the start somewhere or at the end). Or it could include an ITEM: FRAME
header at the same time it adds an ITEM: TIME
header, and then you only need to scan backward from the end of the file to find the highest frame number.
Why do you need to know the frame number? Usually, that information is not interesting by itself. If you need to reserve memory before you parse the data in the file, I recommend you use an appropriate data structure instead that can resize itself, like std::vector
, or even better std::deque
; the latter doesn't have to move data around every time it resizes itself.
-
\$\begingroup\$ I try to write the code without any values hard coded, so that I do not need to adjust the values of N (number of atoms), number of frames, etc., for different data files. And then I construct averages over frames and/or number of particles on the fly. Using push_back for vectors was okay for all the frames in a single data file, but to do averaging over many such data files does make the push_back facility ineffective. So, if I know the total frames in a file, then I can fix the size of the vector in one dimension, which will be the same for all such data files. \$\endgroup\$Syed Shuja Zaidi P19PH007– Syed Shuja Zaidi P19PH0072023年08月24日 13:16:07 +00:00Commented Aug 24, 2023 at 13:16
-
\$\begingroup\$ Do you need those vectors to begin with? If you are only interested in averages, you just maintain a sum of the values seen so far, and divide by the number of values you saw at the end. \$\endgroup\$G. Sliepen– G. Sliepen2023年08月24日 13:25:03 +00:00Commented Aug 24, 2023 at 13:25