This piece of code is to get random samples from two String
lists, pos
and neg
. Based on a specified sampleSize
, it writes an equal number of strings into 4 files. This codes seems slow, perhaps due to the file creating operation, since each file is created for a single string. Is there a better way to do this?
// pos and neg are ArrayList<String>
Collections.shuffle(pos);
Collections.shuffle(neg);
int i = 0;
int j =0;
while(i < sampleSize ){
if(i<sampleSize/2) {
BufferedWriter bw1 = new BufferedWriter(new FileWriter(("test/neg/" + i + ".txt")));
bw1.write(neg.get(i));
bw1.close();
BufferedWriter bw2 = new BufferedWriter(new FileWriter(("test/pos/" + i + ".txt")));
bw2.write(pos.get(i));
bw2.close();
}else{
BufferedWriter bw1 = new BufferedWriter(new FileWriter(("train/neg/" + j + ".txt")));
bw1.write(neg.get(j));
bw1.close();
BufferedWriter bw2 = new BufferedWriter(new FileWriter(("train/pos/" + j + ".txt")));
bw2.write(pos.get(j++));
bw2.close();
}
i++;
System.out.println(i);
}
-
\$\begingroup\$ I'd try hard to change the requirements. Using many files is hardly a good idea. Even multi-GB databases working with hundreds of tables stick with a couple of files. \$\endgroup\$maaartinus– maaartinus2017年08月12日 18:06:45 +00:00Commented Aug 12, 2017 at 18:06
2 Answers 2
I don't know what your general application, but if those files are also read by a Java program in the end, it would probably be best to not write so many small files. For example do the shuffling once and then save the shuffled pos
and neg
in only two files and recreate your pos/neg test/train data sets in the part of the application that needs them by reading the two files.
Your code is clearly copy/paste, which is never good. That means you should use a function for the block of code you copied.
I also don't like your while-loop with with an if
in the middle. I think it would be cleaner to have two for-loops (one for each half). And your filenames for the train set won't start at 0.txt
, but at the middle value, which might be a bug.
You can speed up the write speed by using multi-threading. A general example:
IntStream.range(0, 1000).parallel().forEach((i) -> {
try {
BufferedWriter writer = new BufferedWriter(new FileWriter((i + ".txt")));
writer.write(Integer.toString(i));
writer.close();
} catch (IOException e) {
// ...
}
});
On my machine I get a 2 or 3 factor speed up. Using very many threads won't help since all threads access the same disk.
The execution time will be dominated by I/O, since you are writing so many files. The only suggestion I have that might change the I/O performance is to try the NIO API, namely Files.write()
.
There is another possible inefficiency if sampleSize
is much smaller than the size of the collections. In that case, shuffling the entire list would be overkill, and you might be better off writing your own function to perform sampling without replacement. (That would basically be a Fisher-Yates Shuffle that terminates early.)
There are four similar copies of the code, which you might want to consolidate, especially if it is not important to preserve the behaviour of your System.out.println(i)
progress indicator.
My first observation is that the pos
and neg
lists can be treated independently. My second observation is that you could simplify the loop if you interleave the output of the test
and train
cases.
I think that this code would make it more obvious that the number of files created is sampleSize
, for each of the lists neg
and pos
.
private static void writeSampleFiles(List<String> list, String name, int sampleSize) throws IOException {
Collections.shuffle(list);
for (int i = 0; i < sampleSize; i++) {
Path path = Paths.get(i & 1 == 0 ? "train" : "test",
name,
(i >> 1) + ".txt");
Files.write(path, list.get(i).getBytes());
}
}
// elsewhere...
writeSampleFiles(neg, "neg", sampleSize);
writeSampleFiles(pos, "pos", sampleSize);
Note that this isn't exactly identical to your original code. The shuffled entries are extracted in a different order, which could be an issue if you are concerned about "dealing from the top of the deck". If the execution aborts prematurely due to an IOException
, then the difference in the processing order would be visible to the user.