Looping over files in C++ and changing their names to store data into a single file

Question 1

I have a lot of data in separate files generated from a simulation program. All data is clear and formatted via a single space to read in 4 columns. I can open and store data from each file one at a time, but that will get so impossible once I hit more than, say, 10 files of data with the same simulation data columns.

I am just trying to find a better way to open files and store the data using an iterative method instead of typing, for example, "open file, read file data, append to file" over 1000 times.

#include <iostream>
#include <sstream>
#include <fstream>
#include <vector>
using namespace std;
int main ()
{
// INITIALIZE
 int tmax = 80000;
 float time_step[tmax];
 float LEx[tmax];
 float LEy[tmax];
 float LEz[tmax];
 float LE[tmax];
 float AvgLEx;
 float AvgLEy;
 float AvgLEz;
 float AvgLE;
 float Function[tmax];
// FILES
 ifstream infile;
 infile.open("File_1.prn");
// THIS INFILE is where I want to put a loop to change File_1 to File_2 to File_3 and so on and so on to append data from all those files into ONE single file
 if(infile.fail())
 {
 cout << "error reading file" << endl;
 return 1;
 }
 for (int i = 0; i < tmax+1; ++i)
 {
 infile >> time_step[i] >> LEx[i] >> LEy[i] >> LEz[i] >> LE[i];
 AvgLEx = AvgLEx + LEx[i];
 AvgLEy = AvgLEy + LEy[i];
 AvgLEz = AvgLEz + LEz[i];
 AvgLE = AvgLE + LE[i];
 }
 AvgLEx = AvgLEx/tmax;
 AvgLEy = AvgLEy/tmax;
 AvgLEz = AvgLEz/tmax;
 AvgLE = AvgLE/tmax;
 cout << "AvgLE: " << AvgLE << "\n";
 cout << time_step[tmax] << " LEx " << LEx[tmax] << " LEy " << LEy[tmax] << " LEz: " << LEz[tmax] << " LE: " << LE[tmax] << endl;
 cout << "\n";
 infile.close();
 cout << "\nFile read! \n";
// FUNCTIONS
 for (int i = 0; i < tmax+1; ++i)
 {
 Function[i] = LE[i] - AvgLE;
 cout << LE[i] << " " << AvgLE << " " << Function[i] << "\n";
 }
 cout << "\n Function set up \n";
// // CONVOLUTION
 float A_t;
 float A_dt;
 int dt, t;
 float max=0;
 for (int dt = 0; dt < 1 + tmax/10; ++dt)
 {
 //dt = 1;
 cout << "dt: " << dt << "\n";
 //A_dt = 0;
 int window = tmax-dt;
 for (int t = 0; t < window; ++t)
 {
 if (t==0)
 {
 A_t = 0;
 cout << "A_t was set to zero \n";
 }
 A_t+= Function[t] * Function[t + dt];
 cout << "A_t is accumulation ...: " << A_t << "\n" ;
 }
 A_dt = A_t / window;
 if (A_dt > max)
 {
 max = A_dt;
 }
// OUTPUT
 ofstream AutoCor;
 AutoCor.open ("AutoCor_25.prn", ios::app);
 AutoCor << dt << " " << A_dt << "\n";
 AutoCor.close();
 cout << "A_dt: " << A_dt << " " << "window used: " << window << "\n";
 }
 cout << "max used: " << max << "\n";
// RETURN 
 return A_dt;
 return 0;
}

Question 2

I'm sorry, but request regarding adding new implementation is off-topic.

Question 3

I added all my working code. I just am trying to find a better way to open files and store the data using an iterative method instead of typing ex: open file, read file data, append to file. over 1000 times

Question 4

Okay. I'll replace the off-topic explanation with that, then reopen.

Question 5

You can just use the already existing command cat to do this.

Question 6

@Sarah You have to type the file names in somehwere! Unless they have a common prefix then you can use shell expansion cat file* > all.file or cat file{1..3} > file4

Question 7

Creating Filenames

If you know the naming format, you can use std::ostringstream or snprintf to create a filename:

char filename[40];
for (unsigned int i = 0; i < 100; i++)
{
 snprintf(filename, sizeof(filename),
 "file_%04d.prn");
 std::ifstream infile(filename);
 //...
}

Another, platform specific, method is to iterate over each file. One algorithm is:

filename = Get_First_Filename();
process file
filename = Get_Next_Filename();
if (filename != NULL) goto to step 2

An array of structures vs. many arrays of variables

Instead of declaring many arrays like this:

float LEx[tmax];
float LEy[tmax];
float LEz[tmax];
float LE[tmax];

You may want to put them into a structure (record) and use an array of the structures:

struct Record
{
 float LEx;
 float LEy;
 float LEz;
 float LE;
};
std::vector< Record > data(tmax);

You may find that this change will speed up your program because all the relevant variables are next to each other instead of being separated by tmax elements.

Use double not float

In your programs prefer the double type over float. The double is more accurate and most floating point libraries are designed for double. The float should be used in constrained environments where memory space is restricted.

Read all data, then process

In your program, you read a record of data then process it:

for (int i = 0; i < tmax+1; ++i)
{
 infile >> time_step[i] >> LEx[i] >> LEy[i] >> LEz[i] >> LE[i];
 AvgLEx = AvgLEx + LEx[i];
 AvgLEy = AvgLEy + LEy[i];
 AvgLEz = AvgLEz + LEz[i];
 AvgLE = AvgLE + LE[i];
}

Most platforms are most efficient when the reading of data is continuous. This allows the hard drive to keep spinning rather than spinning down and stopping after a read, then spinning up again for the next read (yes, modern drives use caches to smooth this).

A more efficient method is to read all or most of your data, then process it:

struct Input_Record
{
 double time_step;
 double LEx;
 double LEy;
 double LEz;
};
std::vector< Input_Record > input_data(tmax);
for (unsigned int i = 0; i < tmax; ++i)
{
 infile >> input_data[i].time_step
 >> input_data[i].LEx
 >> input_data[i].LEy
 >> input_data[i].LEz;
}
for (unsigned int i = 0; i < tmax; i++)
{
 AvgLEx += input_data[i].LEx;
 AvgLEy += input_data[i].LEy;
 AvgLEz += input_data[i].LEz;
}

Question 8

wow. So cool. This is such a clearer way to understand what is going on 'behind the scenes'. The use of structures is MOST definitely going to be helpful for me.

Question 9

I think that for the file part - I do know the file format because I create it myself in TCL to store the data when I run/call my simulation program. So if I understand what you wrote above correctly in the following -- char filename[40]; for (unsigned int i = 0; i < 100; i++) { snprintf(filename, sizeof(filename), "file_%04d.prn"); std::ifstream infile(filename); //... } .. you are using the "%04d" part to add the '1', '2', '3' ..etc changes to the file names so that it is then possible to loop over them and collect the data from each file?

Question 10

Yes, the snprint function is used to append numbers to a file name. So the first filename is "file_0000.prn", the second is "file_0001.prn", and so on.

Question 11

As you're new to C++, you should be aware that implementing an entire program in main() can make things difficult, especially when the code is already complex.

As it appears, you have declare and initialize variables at the top, open the file, perform the operations, then output things. Even with these comments in place, you should still split this up into functions.

Do not maintain a list of variables as it could make maintenance more difficult, such as if variables are no longer being used. Instead. you should only declare or initialize variables as close to their intended use as possible. That way, if you no longer use that variable, you won't have to search for it at the top. This will also make your code easier to follow.
Consider just having main() do two things: attempt to open the file and call all the necessary functions afterwards. You can also open the file in an other function, but if you do it in main(), then you can easily return from there if the file cannot be opened.
Everything else, especially the file operations, should be in separate functions. Also keep in mind that a function should just have one primary purpose, in order to maintain the Single Responsible Principle (SRP). If a function needs to do something else that another function already does, then it should call that function. You should also name functions, in the verb form, based on what they're supposed to do. This will make the function's intent very clear to others.

Example:
```
void openFile() {}
```

Miscellaneous:

Try not to use using namespace std in global scope as it could cause name-clashing bugs, especially in larger programs. See this post for more info on this.
This variable appears to be a constant:
```
int tmax = 80000;
```
If so, initialize it with const so that it cannot be modified:
```
const int tmax = 80000;
```

For this approach at checking the file:

infile.open("File_1.prn");
if(infile.fail())
{
 cout << "error reading file" << endl;
 return 1;
}

you should use std::ifstream, check for !infile(), print the error message to std::cerr, and then return EXIT_FAILURE:

std::ifstream infile("File_1.prn");
if (!infile)
{
 std::cerr << "error reading file";
 return EXIT_FAILURE;
}

Question 12

ok I will make those changes now and then repost a clearer code

Question 13

@Sarah: You may still wait for more answers. This is mostly a general review, and there may be more issues here. I haven't looked at every bit yet.

Question 14

oh ok! I will check back then in a bit

Question 15

Your current code looks buggy to me. The first (and most obvious) one that I saw was:

for (int i = 0; i < tmax+1; ++i)

...then using i as an index into arrays defined as xxx[tmax]. With tmax as the size of an array, the allowable indices are 0 through tmax-1. Trying to read/write xxx[tmax] and xxx[tmax+1] leads to undefined behavior.

Right now, the code also blindly assumes that each input file contains a pre-set number of rows of data (but the intent seems to be that it will then produce a file for which that assumption is false).

Instead of defining a (fixed-size) array and basing your assumptions on each file containing a fixed number of items, I'd prefer to use std::vector, and read exactly as many items as the file contains.

I agree with @Thomas Matthews' advice to create a struct to hold a single record, and create a vector (he said array, but see above) of those records. I'd go one more step though: I'd overload operator>> (and possibly also operator<<) for that record type to make reading the records convenient and easy as well.

class record {
 float time_step;
 float LEx;
 float LEy;
 float LEz;
 float LE;
 friend std::istream &operator>>(std::istream &is, record &r) { 
 return is >> time_step >> r.LEx >> r.LEy >> r.LEz >> r.LE;
 }
};

Using this, reading one file of data into a vector of records can look like this:

std::vector<record> records{std::istream_iterator<record>(infile),
 std::istream_iterator<record>()};

In your case (reading multiple files into a single vector) it's marginally less convenient, but only marginally so:

std::vector<record> records;
for (std::string const &filename : filenames) {
 std::istream infile(filename);
 std::copy(std::istream_iterator<record>(infile),
 std::istream_iterator<record>(),
 std::back_inserter(records));
}

While it's possible to have the program synthesize the file names, it's generally easier to supply them on the command line, possibly including a wildcard, something like:

process File_*.prn

In this case, your program can simply use argc/argv that are passed to main to process the file names:

int main(int argc, char **argv) { 
 std::vector<std::string> filenames(argv+1, argv+argc-1);

On Unix, passing a wildcard on the command line will normally be handled automatically by the shell. On Windows (at least with MSVC) you can link your file with setargv.obj to get the same effect: cl myfile.cpp setargv.obj.

Alternatively, you could have your code enumerate the files fitting a specified pattern. If you're going to do this, I'd recommend using the Boost FileSystem library, so you can do the job portably across most major OSes--and better still, it's being used as the basis for work on an addition to the standard library, so the code you write now will probably work with the future standard library with little or no extra work.

Although you didn't ask about them, I'd note that your FUNCTIONS and CONVOLUTION sections could be cleaned up quite a bit as well. It looks like std::accumulate could make computing averages quite a bit cleaner, and your:

A_t+= Function[t] * Function[t + dt];

...looks like it could be implemented rather more cleanly with std::inner_product:

A_t = std::inner_product(Function.begin(), Function.begin()+window, Function.begin()+dt, 0.0);

(Note that this replaces the entire loop to calculate A_t, not just the one line quoted above).

I guess I'll stop there (at least for now) since it's not really clear how much you care about cleaning up this part of the code.

Question 16

No that is great ^^^ !!. Although I was mostly interested (desperate?) to find a solution to the file manipulation .. all these extra comments are so very much appreciated AND useful. I figured there was a lot about C++ syntax that I am clearly not exploiting/using/aware of and that was frustrating to me while I was writing the code initially. So thank you @JerryCoffin

Thomas Matthews Thomas Matthews 4312 silver badges4 bronze badges · Accepted Answer · 2014-07-14 20:08:58Z

Creating Filenames

If you know the naming format, you can use std::ostringstream or snprintf to create a filename:

char filename[40];
for (unsigned int i = 0; i < 100; i++)
{
 snprintf(filename, sizeof(filename),
 "file_%04d.prn");
 std::ifstream infile(filename);
 //...
}

Another, platform specific, method is to iterate over each file. One algorithm is:

filename = Get_First_Filename();
process file
filename = Get_Next_Filename();
if (filename != NULL) goto to step 2

An array of structures vs. many arrays of variables

Instead of declaring many arrays like this:

float LEx[tmax];
float LEy[tmax];
float LEz[tmax];
float LE[tmax];

You may want to put them into a structure (record) and use an array of the structures:

struct Record
{
 float LEx;
 float LEy;
 float LEz;
 float LE;
};
std::vector< Record > data(tmax);

You may find that this change will speed up your program because all the relevant variables are next to each other instead of being separated by tmax elements.

Use double not float

In your programs prefer the double type over float. The double is more accurate and most floating point libraries are designed for double. The float should be used in constrained environments where memory space is restricted.

Read all data, then process

In your program, you read a record of data then process it:

for (int i = 0; i < tmax+1; ++i)
{
 infile >> time_step[i] >> LEx[i] >> LEy[i] >> LEz[i] >> LE[i];
 AvgLEx = AvgLEx + LEx[i];
 AvgLEy = AvgLEy + LEy[i];
 AvgLEz = AvgLEz + LEz[i];
 AvgLE = AvgLE + LE[i];
}

Most platforms are most efficient when the reading of data is continuous. This allows the hard drive to keep spinning rather than spinning down and stopping after a read, then spinning up again for the next read (yes, modern drives use caches to smooth this).

A more efficient method is to read all or most of your data, then process it:

struct Input_Record
{
 double time_step;
 double LEx;
 double LEy;
 double LEz;
};
std::vector< Input_Record > input_data(tmax);
for (unsigned int i = 0; i < tmax; ++i)
{
 infile >> input_data[i].time_step
 >> input_data[i].LEx
 >> input_data[i].LEy
 >> input_data[i].LEz;
}
for (unsigned int i = 0; i < tmax; i++)
{
 AvgLEx += input_data[i].LEx;
 AvgLEy += input_data[i].LEy;
 AvgLEz += input_data[i].LEz;
}

wow. So cool. This is such a clearer way to understand what is going on 'behind the scenes'. The use of structures is MOST definitely going to be helpful for me.
I think that for the file part - I do know the file format because I create it myself in TCL to store the data when I run/call my simulation program. So if I understand what you wrote above correctly in the following -- char filename[40]; for (unsigned int i = 0; i < 100; i++) { snprintf(filename, sizeof(filename), "file_%04d.prn"); std::ifstream infile(filename); //... } .. you are using the "%04d" part to add the '1', '2', '3' ..etc changes to the file names so that it is then possible to loop over them and collect the data from each file?
Yes, the snprint function is used to append numbers to a file name. So the first filename is "file_0000.prn", the second is "file_0001.prn", and so on.

Stack Exchange Network

Looping over files in C++ and changing their names to store data into a single file

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Looping over files in C++ and changing their names to store data into a single file

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions