I am trying to implement feature hashing in Java. For this I am trying to use Hashing functions from guava.
This is how I am doing it now:
//seed is any integer.
HashFunction hf = Hashing.murmur3_32(seed);
//key is the input int and hd is the hashed dimension.
int hash = Hashing.consistentHash(hf.hashInt(key), hd);
I have another method for sign, not shown in this question.
To measure the performance, I am doing something like below:
int min = 0, max = 2000000, sampleSize = 3000000;
int[] hds = { 2000, 5000, 6000, 10000, 20000, 500000, 1000000, 2000000 };
List<Double> guavaCollisons = new ArrayList<>();
UniformIntegerDistribution runif = new UniformIntegerDistribution(min, max);
Arrays.stream(hds)
.forEach(hd -> {
int seed = runif.sample();
HashFunction hf = Hashing.murmur3_32(seed);
Map<Integer, Set<Integer>> guava = new HashMap<>();
Arrays.stream(runif.sample(sampleSize))
.forEach(key -> {
int guavaHash = Hashing.consistentHash(hf.hashInt(key), hd);
if (!guava.containsKey(guavaHash))
guava.put(guavaHash, new HashSet<Integer>());
guava.get(guavaHash).add(key);
});
double guavaCollisionHd = guava.entrySet()
.stream()
.mapToInt(entry -> entry.getValue()
.size())
.average()
.getAsDouble();
guavaCollisons.add(guavaCollisionHd);
System.out.println("Available buckets = " + hd
+ ", unutilized = " + (hd - guava.keySet().size())
+ ", guava collision = " + guavaCollisionHd);
});
double guavaCollision = Stats.meanOf(guavaCollisons);
System.out.println("Average collision = " + guavaCollision);
My questions are:
- Is this a correct way to implement the 'hash' part of feature hashing?
- Is this a correct way to measure the collision rate of the hashing technique?
-
\$\begingroup\$ If you are downvoting, please provide reason; otherwise I can't improve the quality of this question. \$\endgroup\$Sayan Pal– Sayan Pal2017年03月11日 07:56:51 +00:00Commented Mar 11, 2017 at 7:56
1 Answer 1
Simplified code:
List<Double> guavaCollisions = Arrays.stream(hashDimensions)
.map(dimension -> {
HashFunction hashFunction = Hashing.murmur3_32(runif.sample());
Map<Integer, Long> guavaCounts = Arrays.stream(runif.sample(sampleSize))
.collect(Collectors.groupingBy(key -> Hashing.consistentHash(hashFunction.hashInt(key), dimension),
Collectors.counting());
return guavaCounts.values().stream().mapToLong(id -> id).average().getAsDouble();
}).collect(Collectors.toList());
Reasoning:
Inlined the
seed
. Keeping theseed
variable there isn't necessary since we don't use it anymore, not even in the logging-statement.Used
collect
instead offorEach(el -> collection.add(el))
to create Collections which is more idiomatic.Dropped the
Set<Integer>
from theMap
, since we don't actually ever use the Integers we keep in memory here, but instead only want the number of unique integers. (Note that currently I don't check for uniqueness, which means the semantics changed)Simplified the calculation of
average
fromguavaCounts
by not streaming the entry-set and then mapping to value, which could be simplified by usingEntry::getValue
, but directly streaming thevalues()
.Dropped the
System.out
because it's incredibly slow in comparison, a side-effect that shouldn't be happening in aStream
and doesn't add viable information to the "final" result of the code.Renamed
hds
,hd
andhf
because shortening names is pointless. I'd have preferred to also renamerunif
, but didn't find a better name, which would remain succinct.