I posted this an answer this to a question about drawing a random line from a file too large to put into memory. I hacked the below code together. In essence, this is what Reservoir Sampling does, in pseudo-code:
Scan over the 'tape'
Put the first 'n' samples in a reservoir(of size n)
// samples are lines of a document, numbers, whatever
After the first 'n' :
Pick a random number between 1 and NumberOfLinesCounted
if the number is between 1 and n
replace an existing line with a 1/n chance
Here is the code designed to scan over a document, with said sampling method:
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;
public class reservoirSampling {
public static void main(String[] args) throws FileNotFoundException, IOException{
Sampler mySampler = new Sampler();
List<String> myList = mySampler.sampler(10);
for(int index = 0;index<myList.size();index++){
System.out.println(myList.get(index));
}
}
}
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Scanner;
public class Sampler {
public Sampler(){}
public List<String> sampler (int reservoirSize) throws FileNotFoundException, IOException
{
String currentLine=null;
//reservoirList is where our selected lines stored
List <String> reservoirList= new ArrayList<String>(reservoirSize);
// we will use this counter to count the current line number while iterating
int count=0;
Random ra = new Random();
int randomNumber = 0;
Scanner sc = new Scanner(new File("Open_source.html")).useDelimiter("\n");
while (sc.hasNext())
{
currentLine = sc.next();
count ++;
if (count<=reservoirSize)
{
reservoirList.add(currentLine);
}
else if ((randomNumber = (int) ra.nextInt(count))<reservoirSize)
{
reservoirList.set(randomNumber, currentLine);
}
}
return reservoirList;
}
}
I'm using something very similar on a project I am on at the moment (i.e. the code is close enough I can transfer any changes easily by hand).
Is this a) efficient, and b) actually reservoir sampling (with equal odds of any line being drawn)?
3 Answers 3
Reviving an old, but really interesting post, that did not get a great opportunity for review the first time around....
What you've done well:
This code, for the most part, is excellent. As for your specific questions:
Is this:
efficient
Mostly, yes. The inefficiencies in here are all related to the IO, and the Scanner, not your code. A more low-level API may net you some performance benefits, but not likely. You are also (re)using the Random class well, and there should not be a problem there.
actually reservoir sampling (with equal odds of any line being drawn)?
Yes. This is a good implementation of the reservoir algorithm. The random arguments are good, and the substitutions should be fine.
What's concerning
The most significant beef is that you do not close the
Scanner
. Scanners can hold on to open file-handles, and, if the file really is big, this may have system-wide consequences. A try-catch-finally block would be appropriate (or, a try-with-resources if you're on Java7 now).A second (smaller) issue is that reservoir sampling is designed to handle huge data in a streaming way. I know it is perhaps unrealistic to consider in your use-case, but multi-gigabyte files are not unexpected, and the likelihood of overflowing the
count
variable is not unrealistic. That would cause problems....You should consider converting
count
andrandomNumber
to belong
values, and only convert it back to anint
if it is less-thanreservoirSize
Finally....
Nice code to review. Thanks. Obviously, the new language features in more recent Java versions may be interesting to consider....
A few random notes:
FileNotFoundException
is a subclass ofIOException
. Declaring bothFileNotFoundException
andIOException
is redundant.The
Sampler()
constructor can be omitted, since it is just the default constructor that would have been implicitly defined anyway.Your
sampler()
method is just a pure function, and could be madestatic
.The
reservoirSampling
class should be namedReservoirSampling
:Class names should be nouns, in mixed case with the first letter of each internal word capitalized.
Reference: Code Conventions for the Java Programming Language, 9 - Naming Conventions
According to the previous Code Conventions,
sampler()
should be namedsample()
:Methods should be verbs, in mixed case with the first letter lowercase, with the first letter of each internal word capitalized.
It is a bad idea to hard-code the filename from which you will be sampling. The file should be passed in as a parameter. I think it's odd to take samples from an HTML file, which usually doesn't contain line-oriented data.
Variables
currentLine
andrandomNumber
are only used inside the while loop, and should be declared inside it. (See Effective Java, 2nd edition, Item 45: Minimize the scope of local variables)It is not too difficult to change your code produce samples with any type, not just strings:
public static <T> List<T> sample(Iterator<T> pool, int reservoirSize) { //reservoirList is where our selected lines stored List<T> reservoirList = new ArrayList<T>(reservoirSize); // we will use this counter to count the current item number while iterating int count = 0; Random ra = new Random(); while (pool.hasNext()) { T current = pool.next(); count++; // ... } }
(See Effective Java, 2nd edition, Item 27: Favor generic methods)
For an
Iterator<String>
over the lines of a file, you could useFileUtils.lineIterator(new File(...))
from Apache Commons IO.
At the first glance
You can use Scanner.netxLine()
available on all OS.
(randomNumber = (int) ra.nextInt(count))<reservoirSize
EDIT suppressed unnecessary lines after comments
-
\$\begingroup\$ So use sc.nextLine() not Scanner(...).useDelimeter("\n")? Got it. Also, I don't follow how an
else if
would miss some lines? Reservoir sampling takes the firstn
lines then randomly swaps them out. Sometimes it swaps, sometimes it doesn't I thnk I see what you're doing, but that's not quite the point of reservoir sampling I don't think. \$\endgroup\$AncientSwordRage– AncientSwordRage2012年10月04日 07:46:14 +00:00Commented Oct 4, 2012 at 7:46 -
\$\begingroup\$ @Pureferret
else if
whithout a finalelse
block do not catch all conditions. But it is that you want. So removewhile(1) ..
line andbreak
line with a}
so you remove the negative int randomised and avoid a Exception (no negative integer in an array position) \$\endgroup\$cl-r– cl-r2012年10月04日 07:54:38 +00:00Commented Oct 4, 2012 at 7:54 -
\$\begingroup\$ Well the only thing not inside both the
else
and theif
statement is the while and the random number, which the iff relies on - why not make them the same line? Also, I should never get a random number less than 0.... \$\endgroup\$AncientSwordRage– AncientSwordRage2012年10月04日 08:02:53 +00:00Commented Oct 4, 2012 at 8:02 -
\$\begingroup\$ @Pureferret
java.util.Random
return also negative number so you cannot test the two limits in theif( ..)
statement . Maodify with EDIT \$\endgroup\$cl-r– cl-r2012年10月04日 08:30:44 +00:00Commented Oct 4, 2012 at 8:30 -
\$\begingroup\$ From the docs
nextInt(int n): Returns a pseudorandom, uniformly distributed int value between 0 (inclusive) and the specified value (exclusive), drawn from this random number generator's sequence.
So no. It can't return a negative. \$\endgroup\$AncientSwordRage– AncientSwordRage2012年10月04日 08:48:08 +00:00Commented Oct 4, 2012 at 8:48