External / File-based mergesort

Question 1

I've implemented an external mergesort to sort a file consisting of Java int primitives, however it is horribly slow (fortunately it does at least work).

Very little happens in the sort method: it just recursively calls merge with blockSize doubling each call and swapping input and output files each time.

How could I be losing so much time here?

//Merge stage of external mergesort
//Read from input file, already sorted into blocks of size blockSize
//Write to output file, sorted into blocks of 2*blockSize
public static void merge(String inputFile, String outputFile, long blockSize)
 throws IOException
{
 //readers for block1/2
 FileInputStream fis1 = new FileInputStream(inputFile);
 DataInputStream dis1 = new DataInputStream(fis1);
 FileInputStream fis2 = new FileInputStream(inputFile);
 DataInputStream dis2 = new DataInputStream(fis2);
 //writer to output file
 FileOutputStream fos = new FileOutputStream(outputFile);
 DataOutputStream dos = new DataOutputStream(fos);
 // merging 2 sub lists
 // go along pairs of blocks in inputFile
 // continue until end of input
 //initialise block2 at right position
 dis2.skipBytes((int) blockSize);
 //while we haven't reached the end of the file
 while (dis1.available() > 0)
 {
 // if block1 is last block, copy block1 to output
 if (dis2.available() <= 0)
 {
 while (dis1.available() > 0) 
 dos.writeInt(dis1.readInt());
 break;
 }
 // if block1 not last block, merge block1 and block2
 else
 {
 long block1Pos = 0;
 long block2Pos = 0;
 boolean block1Over = false;
 boolean block2Over = false;
 //data read from each block
 int e1 = dis1.readInt();
 int e2 = dis2.readInt();
 //keep going until fully examined both blocks
 while (!block1Over | !block2Over)
 {
 //copy from block 1 if:
 // block1 hasnt been fully examined AND
 // block1 element less than block2s OR block2 has been fully examined
 while ( !block1Over & ((e1 <= e2) | block2Over) )
 {
 dos.writeInt(e1); block1Pos += 4;
 if (block1Pos < blockSize & dis1.available() > 0) 
 e1 = dis1.readInt();
 else 
 block1Over = true;
 }
 //same for block2
 while ( !block2Over & ((e2 < e1) | block1Over) )
 {
 dos.writeInt(e2); block2Pos += 4;
 if (block2Pos < blockSize & dis2.available() > 0) 
 e2 = dis2.readInt();
 else 
 block2Over = true;
 }
 }
 }
 // skip to next blocks
 dis1.skipBytes((int) blockSize);
 dis2.skipBytes((int) blockSize);
 }
 dis1.close();
 dis2.close();
 dos.close();
 fos.close();
}

Question 2

This code is.... odd. Explain to me why you open two DataInputStreams on a single inputfile? You are not merging anything in the merge method, right?

Question 3

Never mind, I missed the blocksize skip.

Question 4

I was intrigued by this problem, and took it to 'the next level'

Question 5

Steve is absolutely right that adding a buffered layer between the Data input/output streams and the File input/output streams, will make things work a whole lot better. I would also suggest changing to use a try-with-resources system, which will also close, and otherwise manage the files in a better way:

try (DataInputStream dis1 = new DataInputStream(new BufferedInputStream(new FileInputStream(inputFile)));
 DataInputStream dis2 = new DataInputStream(new BufferedInputStream(new FileInputStream(inputFile)));
 DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(outputFile))); ) {
 //initialise block2 at right position
 dis2.skipBytes((int) blockSize);
 ... do other work ...
}

Now, your input/output is buffered, and it's closed cleanly, and there are no leaks. The input/output is also buffered, leading to fewer IO's, and more efficient processing.

This will likely make a huge difference in performance, but, I suspect that using NIO (ByteBuffer) operations (especially with memory-mapped IO) will be faster again. Consider using FileChannel operations that reduce the amount of memory copies that are made of the data in the file.

UPDATE:

I have taken some time to run some tests and use some strategies that I am familiar with from high-performance systems. As I suspected, a FilChannel with Memory-mapped IO is far faster.

On my computer, a file with 400,000 int values takes about 15 seconds to sort using your system. When I used Buffered IO, it took 1.5 seconds (10 times faster).

I then rewrote the system using a couple of tricks:

use small sorts for blocks of 32 integers.
then use merging for larger, and larger blocks.
use memory mapped IO to do the file accesses

The result was a sort in 0.096 seconds, or 150 times faster than your code.

Now, this code is not exactly simple, so, be warned that it is a little obscure.

The first thing I did though, was create a class to abstract away the low-level IO:

private static final class FastFile implements AutoCloseable {
 private final Path path;
 private final FileChannel channel;
 private final long size;
 private MappedByteBuffer buffer = null;
 private long mapPosition = -1;
 public FastFile(final Path folder, final long size) throws IOException {
 this.size = size;
 this.path = Files.createTempFile(folder, "tmpdata", ".dat");
 this.channel = FileChannel.open(path, StandardOpenOption.TRUNCATE_EXISTING, StandardOpenOption.WRITE, StandardOpenOption.READ);
 // Create the file with the right size.
 channel.write(ByteBuffer.allocate(1), size - 1);
 resetFile();
 }
 @Override
 public void close() throws Exception {
 channel.force(true);
 channel.close();
 buffer = null;
 }
 private final void relocate(final long filepos) throws IOException {
 if (filepos >= size) {
 throw new IllegalArgumentException("Illegal file position " + filepos + " in file of size " + size); 
 }
 final long mappos = filepos >>> MAPPEDSHIFT;
 if (mappos != mapPosition) {
 final long pos = mappos << MAPPEDSHIFT;
 final long len = Math.min(size - pos, MAPPEDSIZE);
 buffer = channel.map(MapMode.READ_WRITE, pos, len);
 buffer.load();
 System.out.println("Move to position " + pos + " and length " + len);
 mapPosition = mappos;
 }
 }
 public void resetFile() throws IOException {
 relocate(0);
 }
 public int getInt(final long intPosition) throws IOException {
 final long filepos = intPosition << 2;
 final int offset = (int)(filepos & MAPPEDMASK);
 relocate(filepos);
 return buffer.getInt(offset);
 }
 public void putInt(final long intPosition, final int intValue) throws IOException {
 final long filepos = intPosition << 2;
 final int offset = (int)(filepos & MAPPEDMASK);
 relocate(filepos);
 buffer.putInt(offset, intValue);
 }
 public void rename(String targetName) throws IOException {
 Files.move(path, Paths.get(targetName));
 }
 public void delete() throws IOException {
 Files.delete(path);
 }
}

The above class can take a file, and read/write ints at any particular place. It first creates the file, and sets it to be the right size.

It is read/write and random-access. It can write an int at any position. The expensive part of the operation is relocating the buffer, but, that will happen seldom.

Using that file class, I have the following sort code:

private static void mergeSortSmart(String sourceName, String targetName) throws IOException {
 long start = System.currentTimeMillis();
 Path source = Paths.get(sourceName);
 final long size = Files.size(source);
 final long intvals = size >>> 2; // number of actual integer values (4 bytes per int).
 Path target = Paths.get(targetName).toAbsolutePath();
 if (Files.exists(target)) {
 Files.delete(target);
 }
 Path tdir = target.getParent();
 FastFile filea = new FastFile(tdir, size);
 FastFile fileb = new FastFile(tdir, size);
 int blockSize = 32;
 // copy the source data to a fast file, but do 32-size block int sorts
 // before merge-sorting.
 copyAndMicroSort(source, size, filea, blockSize);
 // then do iterative merge sorts.
 for (long bs = blockSize; bs < size; bs *= 2) {
 mergeFast(filea, fileb, intvals, bs);
 FastFile tmp = filea;
 filea = fileb;
 fileb = tmp;
 }
 // rename the sorted file.
 filea.rename(targetName);
 // delete the temp file.
 fileb.delete();
 System.out.printf("Sorted in %.3fs%n", (System.currentTimeMillis() - start)/ 1000.0);
}

The copyAndMicroSort is simple:

private static void copyAndMicroSort(Path source, long size, FastFile filea, final int batchSize) throws IOException {
 try (DataInputStream di = new DataInputStream(new BufferedInputStream(new FileInputStream(source.toFile())))) {
 long pos = 0;
 int cnt = 0;
 long vcount = 0;
 int[] data = new int[batchSize];
 while (pos < size) {
 if (cnt == data.length) {
 appendSortedInts(data, cnt, filea, vcount);
 vcount += cnt;
 cnt = 0;
 }
 data[cnt++] = di.readInt();
 pos += 4; // size of int;
 }
 appendSortedInts(data, cnt, filea, vcount);
 }
}
private static void appendSortedInts(final int[] data, final int cnt, final FastFile filea,
 final long vcount) throws IOException {
 Arrays.sort(data, 0, cnt);
 for (int i = 0; i < cnt; i++) {
 filea.putInt(vcount + i, data[i]);
 }
}

And the individual merge sorts are:

private static void mergeFast(final FastFile infile, final FastFile outfile, final long intCount, final long bs) throws IOException {
 long apos = 0;
 long bpos = bs;
 long outpos = 0;
 while (apos < intCount) {
 long alimit = Math.min(bpos, intCount);
 long blimit = Math.min(alimit + bs, intCount);
 while (apos < alimit && bpos < blimit) {
 int aval = infile.getInt(apos);
 int bval = infile.getInt(bpos);
 if (aval <= bval) {
 outfile.putInt(outpos++, aval);
 apos++;
 } else {
 outfile.putInt(outpos++, bval);
 bpos++;
 }
 }
 while (apos < alimit) {
 outfile.putInt(outpos++, infile.getInt(apos++));
 }
 while (bpos < blimit) {
 outfile.putInt(outpos++, infile.getInt(bpos++));
 }
 apos += bs;
 bpos += bs;
 }
}

Question 6

By making the above change, I have changed a 15second sort on 400KB of data, to instead take 2.2 seconds, I expect I can bring it back to less than 0.5 seconds in a bit.

Question 7

Thanks for this, nice to see it's such an easy change to make. One problem I've noticed is if I restrict the buffer size to a small amount, when blockSize > 2*bufferSize, dis.available() returns 7FFFFFFF after the stream runs out, as opposed to 0 or a negative number. What's the reasoning for this?

Question 8

@Alex - good question, and I am not sure.

Question 9

@Alex - I posted a revised version of my results, sorting very fast indeed now.

Question 10

Thanks for this code, lots to learn from it. One problem I'm having is MAPPEDMASK, MAPPEDSIZE etc give compile errors; what should I import to have access to them (or what are their values)?

Question 11

You'll be able to make a lot of headway on performance simply by adding BufferedInputStream and BufferedOutputStream to your stream chains.

You say that you recursively call merge - but I don't see the recursion. Are you just referring to the looping?

Question 12

Sorry, misuse of recursively; I meant repeatedly.

rolfl rolfl 98.1k17 gold badges219 silver badges419 bronze badges · Accepted Answer · 2014-08-15 19:05:19Z

Steve is absolutely right that adding a buffered layer between the Data input/output streams and the File input/output streams, will make things work a whole lot better. I would also suggest changing to use a try-with-resources system, which will also close, and otherwise manage the files in a better way:

try (DataInputStream dis1 = new DataInputStream(new BufferedInputStream(new FileInputStream(inputFile)));
 DataInputStream dis2 = new DataInputStream(new BufferedInputStream(new FileInputStream(inputFile)));
 DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(outputFile))); ) {
 //initialise block2 at right position
 dis2.skipBytes((int) blockSize);
 ... do other work ...
}

Now, your input/output is buffered, and it's closed cleanly, and there are no leaks. The input/output is also buffered, leading to fewer IO's, and more efficient processing.

This will likely make a huge difference in performance, but, I suspect that using NIO (ByteBuffer) operations (especially with memory-mapped IO) will be faster again. Consider using FileChannel operations that reduce the amount of memory copies that are made of the data in the file.

UPDATE:

I have taken some time to run some tests and use some strategies that I am familiar with from high-performance systems. As I suspected, a FilChannel with Memory-mapped IO is far faster.

On my computer, a file with 400,000 int values takes about 15 seconds to sort using your system. When I used Buffered IO, it took 1.5 seconds (10 times faster).

I then rewrote the system using a couple of tricks:

use small sorts for blocks of 32 integers.
then use merging for larger, and larger blocks.
use memory mapped IO to do the file accesses

The result was a sort in 0.096 seconds, or 150 times faster than your code.

Now, this code is not exactly simple, so, be warned that it is a little obscure.

The first thing I did though, was create a class to abstract away the low-level IO:

private static final class FastFile implements AutoCloseable {
 private final Path path;
 private final FileChannel channel;
 private final long size;
 private MappedByteBuffer buffer = null;
 private long mapPosition = -1;
 public FastFile(final Path folder, final long size) throws IOException {
 this.size = size;
 this.path = Files.createTempFile(folder, "tmpdata", ".dat");
 this.channel = FileChannel.open(path, StandardOpenOption.TRUNCATE_EXISTING, StandardOpenOption.WRITE, StandardOpenOption.READ);
 // Create the file with the right size.
 channel.write(ByteBuffer.allocate(1), size - 1);
 resetFile();
 }
 @Override
 public void close() throws Exception {
 channel.force(true);
 channel.close();
 buffer = null;
 }
 private final void relocate(final long filepos) throws IOException {
 if (filepos >= size) {
 throw new IllegalArgumentException("Illegal file position " + filepos + " in file of size " + size); 
 }
 final long mappos = filepos >>> MAPPEDSHIFT;
 if (mappos != mapPosition) {
 final long pos = mappos << MAPPEDSHIFT;
 final long len = Math.min(size - pos, MAPPEDSIZE);
 buffer = channel.map(MapMode.READ_WRITE, pos, len);
 buffer.load();
 System.out.println("Move to position " + pos + " and length " + len);
 mapPosition = mappos;
 }
 }
 public void resetFile() throws IOException {
 relocate(0);
 }
 public int getInt(final long intPosition) throws IOException {
 final long filepos = intPosition << 2;
 final int offset = (int)(filepos & MAPPEDMASK);
 relocate(filepos);
 return buffer.getInt(offset);
 }
 public void putInt(final long intPosition, final int intValue) throws IOException {
 final long filepos = intPosition << 2;
 final int offset = (int)(filepos & MAPPEDMASK);
 relocate(filepos);
 buffer.putInt(offset, intValue);
 }
 public void rename(String targetName) throws IOException {
 Files.move(path, Paths.get(targetName));
 }
 public void delete() throws IOException {
 Files.delete(path);
 }
}

The above class can take a file, and read/write ints at any particular place. It first creates the file, and sets it to be the right size.

It is read/write and random-access. It can write an int at any position. The expensive part of the operation is relocating the buffer, but, that will happen seldom.

Using that file class, I have the following sort code:

private static void mergeSortSmart(String sourceName, String targetName) throws IOException {
 long start = System.currentTimeMillis();
 Path source = Paths.get(sourceName);
 final long size = Files.size(source);
 final long intvals = size >>> 2; // number of actual integer values (4 bytes per int).
 Path target = Paths.get(targetName).toAbsolutePath();
 if (Files.exists(target)) {
 Files.delete(target);
 }
 Path tdir = target.getParent();
 FastFile filea = new FastFile(tdir, size);
 FastFile fileb = new FastFile(tdir, size);
 int blockSize = 32;
 // copy the source data to a fast file, but do 32-size block int sorts
 // before merge-sorting.
 copyAndMicroSort(source, size, filea, blockSize);
 // then do iterative merge sorts.
 for (long bs = blockSize; bs < size; bs *= 2) {
 mergeFast(filea, fileb, intvals, bs);
 FastFile tmp = filea;
 filea = fileb;
 fileb = tmp;
 }
 // rename the sorted file.
 filea.rename(targetName);
 // delete the temp file.
 fileb.delete();
 System.out.printf("Sorted in %.3fs%n", (System.currentTimeMillis() - start)/ 1000.0);
}

The copyAndMicroSort is simple:

private static void copyAndMicroSort(Path source, long size, FastFile filea, final int batchSize) throws IOException {
 try (DataInputStream di = new DataInputStream(new BufferedInputStream(new FileInputStream(source.toFile())))) {
 long pos = 0;
 int cnt = 0;
 long vcount = 0;
 int[] data = new int[batchSize];
 while (pos < size) {
 if (cnt == data.length) {
 appendSortedInts(data, cnt, filea, vcount);
 vcount += cnt;
 cnt = 0;
 }
 data[cnt++] = di.readInt();
 pos += 4; // size of int;
 }
 appendSortedInts(data, cnt, filea, vcount);
 }
}
private static void appendSortedInts(final int[] data, final int cnt, final FastFile filea,
 final long vcount) throws IOException {
 Arrays.sort(data, 0, cnt);
 for (int i = 0; i < cnt; i++) {
 filea.putInt(vcount + i, data[i]);
 }
}

And the individual merge sorts are:

private static void mergeFast(final FastFile infile, final FastFile outfile, final long intCount, final long bs) throws IOException {
 long apos = 0;
 long bpos = bs;
 long outpos = 0;
 while (apos < intCount) {
 long alimit = Math.min(bpos, intCount);
 long blimit = Math.min(alimit + bs, intCount);
 while (apos < alimit && bpos < blimit) {
 int aval = infile.getInt(apos);
 int bval = infile.getInt(bpos);
 if (aval <= bval) {
 outfile.putInt(outpos++, aval);
 apos++;
 } else {
 outfile.putInt(outpos++, bval);
 bpos++;
 }
 }
 while (apos < alimit) {
 outfile.putInt(outpos++, infile.getInt(apos++));
 }
 while (bpos < blimit) {
 outfile.putInt(outpos++, infile.getInt(bpos++));
 }
 apos += bs;
 bpos += bs;
 }
}

By making the above change, I have changed a 15second sort on 400KB of data, to instead take 2.2 seconds, I expect I can bring it back to less than 0.5 seconds in a bit.
Thanks for this, nice to see it's such an easy change to make. One problem I've noticed is if I restrict the buffer size to a small amount, when blockSize > 2*bufferSize, dis.available() returns 7FFFFFFF after the stream runs out, as opposed to 0 or a negative number. What's the reasoning for this?
@Alex - I posted a revised version of my results, sorting very fast indeed now.
Thanks for this code, lots to learn from it. One problem I'm having is MAPPEDMASK, MAPPEDSIZE etc give compile errors; what should I import to have access to them (or what are their values)?

Stack Exchange Network

External / File-based mergesort

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

External / File-based mergesort

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions