Creating multiple files from a list

Question 1

This piece of code is to get random samples from two String lists, pos and neg. Based on a specified sampleSize, it writes an equal number of strings into 4 files. This codes seems slow, perhaps due to the file creating operation, since each file is created for a single string. Is there a better way to do this?

// pos and neg are ArrayList<String>
 Collections.shuffle(pos);
 Collections.shuffle(neg);
 int i = 0;
 int j =0;
 while(i < sampleSize ){
 if(i<sampleSize/2) {
 BufferedWriter bw1 = new BufferedWriter(new FileWriter(("test/neg/" + i + ".txt")));
 bw1.write(neg.get(i));
 bw1.close();
 BufferedWriter bw2 = new BufferedWriter(new FileWriter(("test/pos/" + i + ".txt")));
 bw2.write(pos.get(i));
 bw2.close();
 }else{
 BufferedWriter bw1 = new BufferedWriter(new FileWriter(("train/neg/" + j + ".txt")));
 bw1.write(neg.get(j));
 bw1.close();
 BufferedWriter bw2 = new BufferedWriter(new FileWriter(("train/pos/" + j + ".txt")));
 bw2.write(pos.get(j++));
 bw2.close();
 }
 i++;
 System.out.println(i);
 }

Question 2

I'd try hard to change the requirements. Using many files is hardly a good idea. Even multi-GB databases working with hundreds of tables stick with a couple of files.

Question 3

I don't know what your general application, but if those files are also read by a Java program in the end, it would probably be best to not write so many small files. For example do the shuffling once and then save the shuffled pos and neg in only two files and recreate your pos/neg test/train data sets in the part of the application that needs them by reading the two files.

Your code is clearly copy/paste, which is never good. That means you should use a function for the block of code you copied.

I also don't like your while-loop with with an if in the middle. I think it would be cleaner to have two for-loops (one for each half). And your filenames for the train set won't start at 0.txt, but at the middle value, which might be a bug.

You can speed up the write speed by using multi-threading. A general example:

 IntStream.range(0, 1000).parallel().forEach((i) -> {
 try {
 BufferedWriter writer = new BufferedWriter(new FileWriter((i + ".txt")));
 writer.write(Integer.toString(i));
 writer.close();
 } catch (IOException e) {
 // ...
 }
 });

On my machine I get a 2 or 3 factor speed up. Using very many threads won't help since all threads access the same disk.

Question 4

The execution time will be dominated by I/O, since you are writing so many files. The only suggestion I have that might change the I/O performance is to try the NIO API, namely Files.write().

There is another possible inefficiency if sampleSize is much smaller than the size of the collections. In that case, shuffling the entire list would be overkill, and you might be better off writing your own function to perform sampling without replacement. (That would basically be a Fisher-Yates Shuffle that terminates early.)

There are four similar copies of the code, which you might want to consolidate, especially if it is not important to preserve the behaviour of your System.out.println(i) progress indicator.

My first observation is that the pos and neg lists can be treated independently. My second observation is that you could simplify the loop if you interleave the output of the test and train cases.

I think that this code would make it more obvious that the number of files created is sampleSize, for each of the lists neg and pos.

private static void writeSampleFiles(List<String> list, String name, int sampleSize) throws IOException {
 Collections.shuffle(list);
 for (int i = 0; i < sampleSize; i++) {
 Path path = Paths.get(i & 1 == 0 ? "train" : "test",
 name,
 (i >> 1) + ".txt");
 Files.write(path, list.get(i).getBytes());
 }
}
// elsewhere...
writeSampleFiles(neg, "neg", sampleSize);
writeSampleFiles(pos, "pos", sampleSize);

Note that this isn't exactly identical to your original code. The shuffled entries are extracted in a different order, which could be an issue if you are concerned about "dealing from the top of the deck". If the execution aborts prematurely due to an IOException, then the difference in the processing order would be visible to the user.

toto2 toto2toto2 5,45017 silver badges21 bronze badges · Accepted Answer · 2017-08-12 14:10:59Z

I don't know what your general application, but if those files are also read by a Java program in the end, it would probably be best to not write so many small files. For example do the shuffling once and then save the shuffled pos and neg in only two files and recreate your pos/neg test/train data sets in the part of the application that needs them by reading the two files.

Your code is clearly copy/paste, which is never good. That means you should use a function for the block of code you copied.

I also don't like your while-loop with with an if in the middle. I think it would be cleaner to have two for-loops (one for each half). And your filenames for the train set won't start at 0.txt, but at the middle value, which might be a bug.

You can speed up the write speed by using multi-threading. A general example:

 IntStream.range(0, 1000).parallel().forEach((i) -> {
 try {
 BufferedWriter writer = new BufferedWriter(new FileWriter((i + ".txt")));
 writer.write(Integer.toString(i));
 writer.close();
 } catch (IOException e) {
 // ...
 }
 });

On my machine I get a 2 or 3 factor speed up. Using very many threads won't help since all threads access the same disk.

Stack Exchange Network

Creating multiple files from a list

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Creating multiple files from a list

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions