Implementing feature hashing using guava

Question 1

I am trying to implement feature hashing in Java. For this I am trying to use Hashing functions from guava.

This is how I am doing it now:

//seed is any integer.
HashFunction hf = Hashing.murmur3_32(seed); 
//key is the input int and hd is the hashed dimension.
int hash = Hashing.consistentHash(hf.hashInt(key), hd);

I have another method for sign, not shown in this question.

To measure the performance, I am doing something like below:

int min = 0, max = 2000000, sampleSize = 3000000;
int[] hds = { 2000, 5000, 6000, 10000, 20000, 500000, 1000000, 2000000 };
List<Double> guavaCollisons = new ArrayList<>();
UniformIntegerDistribution runif = new UniformIntegerDistribution(min, max);
Arrays.stream(hds)
 .forEach(hd -> {
 int seed = runif.sample();
 HashFunction hf = Hashing.murmur3_32(seed);
 Map<Integer, Set<Integer>> guava = new HashMap<>();
 Arrays.stream(runif.sample(sampleSize))
 .forEach(key -> {
 int guavaHash = Hashing.consistentHash(hf.hashInt(key), hd);
 if (!guava.containsKey(guavaHash))
 guava.put(guavaHash, new HashSet<Integer>());
 guava.get(guavaHash).add(key);
 });
 double guavaCollisionHd = guava.entrySet()
 .stream()
 .mapToInt(entry -> entry.getValue()
 .size())
 .average()
 .getAsDouble();
 guavaCollisons.add(guavaCollisionHd);
 System.out.println("Available buckets = " + hd 
 + ", unutilized = " + (hd - guava.keySet().size())
 + ", guava collision = " + guavaCollisionHd);
 });
double guavaCollision = Stats.meanOf(guavaCollisons);
System.out.println("Average collision = " + guavaCollision);

My questions are:

Is this a correct way to implement the 'hash' part of feature hashing?
Is this a correct way to measure the collision rate of the hashing technique?

Question 2

If you are downvoting, please provide reason; otherwise I can't improve the quality of this question.

Question 3

Simplified code:

List<Double> guavaCollisions = Arrays.stream(hashDimensions)
 .map(dimension -> {
 HashFunction hashFunction = Hashing.murmur3_32(runif.sample());
 Map<Integer, Long> guavaCounts = Arrays.stream(runif.sample(sampleSize))
 .collect(Collectors.groupingBy(key -> Hashing.consistentHash(hashFunction.hashInt(key), dimension),
 Collectors.counting());
 return guavaCounts.values().stream().mapToLong(id -> id).average().getAsDouble();
 }).collect(Collectors.toList());

Reasoning:

Inlined the seed. Keeping the seed variable there isn't necessary since we don't use it anymore, not even in the logging-statement.
Used collect instead of forEach(el -> collection.add(el)) to create Collections which is more idiomatic.
Dropped the Set<Integer> from the Map, since we don't actually ever use the Integers we keep in memory here, but instead only want the number of unique integers. (Note that currently I don't check for uniqueness, which means the semantics changed)
Simplified the calculation of average from guavaCounts by not streaming the entry-set and then mapping to value, which could be simplified by using Entry::getValue, but directly streaming the values().
Dropped the System.out because it's incredibly slow in comparison, a side-effect that shouldn't be happening in a Stream and doesn't add viable information to the "final" result of the code.
Renamed hds, hd and hf because shortening names is pointless. I'd have preferred to also rename runif, but didn't find a better name, which would remain succinct.

Vogel612 Vogel612 25.5k7 gold badges59 silver badges141 bronze badges · Answer 1 · 2017-03-13 10:12:28Z

Simplified code:

List<Double> guavaCollisions = Arrays.stream(hashDimensions)
 .map(dimension -> {
 HashFunction hashFunction = Hashing.murmur3_32(runif.sample());
 Map<Integer, Long> guavaCounts = Arrays.stream(runif.sample(sampleSize))
 .collect(Collectors.groupingBy(key -> Hashing.consistentHash(hashFunction.hashInt(key), dimension),
 Collectors.counting());
 return guavaCounts.values().stream().mapToLong(id -> id).average().getAsDouble();
 }).collect(Collectors.toList());

Reasoning:

Inlined the seed. Keeping the seed variable there isn't necessary since we don't use it anymore, not even in the logging-statement.
Used collect instead of forEach(el -> collection.add(el)) to create Collections which is more idiomatic.
Dropped the Set<Integer> from the Map, since we don't actually ever use the Integers we keep in memory here, but instead only want the number of unique integers. (Note that currently I don't check for uniqueness, which means the semantics changed)
Simplified the calculation of average from guavaCounts by not streaming the entry-set and then mapping to value, which could be simplified by using Entry::getValue, but directly streaming the values().
Dropped the System.out because it's incredibly slow in comparison, a side-effect that shouldn't be happening in a Stream and doesn't add viable information to the "final" result of the code.
Renamed hds, hd and hf because shortening names is pointless. I'd have preferred to also rename runif, but didn't find a better name, which would remain succinct.

Stack Exchange Network

Implementing feature hashing using guava

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Implementing feature hashing using guava

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions