I am following the Wikipedia definition of the Bloom filter. My implementation uses the BitArray
I created here. Here is an alternative implementation by Ilya Grigorik.
# n = Number of items in the filter
# p = Probability of false positives, float between 0 and 1 or a number indicating 1-in-p
# m = Number of bits in the filter
# k = Number of hash functions
require_relative '../bitarray/bit_array'
require 'digest'
class BloomFilter
def initialize k: 1, m: 10
@k = k
@m = m
@n = 0
@b = BitArray.new
end
# Probability of a false positive based on formula in wikipedia
def false_positive_rate
fp_rate = (1-(1-1/@m.to_f)**(@k.to_f*@n.to_f))**@k.to_f
p fp_rate
fp_rate.round(2)
end
def insert item
@n += 1
@k.times do |n|
hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
position = hash_val % @m
@b.set position
end
self
end
def include? item
@k.times do |n|
hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
position = hash_val % @m
return @b.get(position) == 1
end
return false
end
end
The test:
require 'minitest/autorun'
require_relative 'bloom_filter'
class BloomFilterTest < MiniTest::Unit::TestCase
# Verified probability formula and tests from http://hur.st/bloomfilter?n=1&p=0.01
def test_stats
assert_equal 0.01, BloomFilter.new(k: 7, m: 10).insert("a").false_positive_rate
b = BloomFilter.new(k: 2, m: 29)
10.times { |n| b.insert(n) }
assert_equal 0.25, b.false_positive_rate
end
def test_include
b = BloomFilter.new(k: 2, m: 100)
b.insert "The Trumpet of the Swan"
assert_equal true, b.include?("The Trumpet of the Swan")
refute b.include?("E. B. White")
end
end
1 Answer 1
You're using
assert_equal
to compare something withtrue
- which is exactly what the basicassert
method already does. Just like you userefute
to check that something's false,assert
checks that something's true.You might want to check that \$k ≤ m\$. Doesn't really make sense otherwise. (Of course it doesn't really make sense for \$k\$ to equal \$m\$ either, as adding just 1 item would set every bit right away, but at least that's technically valid.)
In this case, I'm OK with using single-letter variables like
k
andm
, since that appears to be conventional enough, and they're hard to give simple, yet descriptive names.(削除) However, I'd also add copious comments to explain exactly what the heck they mean. I had to go to Wikipedia to find out. (削除ここまで)(Edit: Sorry, comments already there, just not by the#initialize
method, where I expected them to be.)p fp_rate
- don't leave print calls in final code (unless that code's purpose is to print, of course).Why does
#false_positive_rate
round its output? Rounding is a presentation concern; the rate just is what it is, decimals and all. I'm thinking you're maybe rounding so it passes your tests, but in that case you're writing your code backwards.Your
#include?
method is broken. It'll return (from the method; not just the block) at the first bit it checks - it won't check all of them. So if the first bit it checks is zero, you getfalse
- which is correct! But if the first bit is 1, you immediate gettrue
even though none of the other k-1 bits have been checked.
This also means that your false positive rate is at best$1ドル - (1 - \frac{1}{m})^n$$
because \$k\$ is effectively always 1 when checking. But if \$k\$ is higher, then we'd be setting more bits than we're checking, so the bit array fills up faster.
For instance, if \$k = 2\$ and \$m = 4\$ (extreme values; just for illustration), then the first item (call it \$A\$) we insert should set half of the bits in the bit array. That should (if my math is correct) give a false positive rate of ~19% after adding that one item: There's a 19% chance that another item, \$B\,ドル will match the same 2 bits. But if we only check 1 bit, then the false positive rate is essentially 50%, since half the bits in the array are set.Making only only minimal changes, you could do this:
@k.times do |n| hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16) position = hash_val % @m return false if @b.get(position).zero? end true
Now, it'll only return if a bit is zero, otherwise it'll keep going.
But your hashing logic is flawed too, I'm afraid. There's no guarantee that it'll return \$k\$ unique bit positions (for \$k > 1\$). It's possible (though unlikely) that you'll calculate \$k\$ MD5 hashes that are all cleanly divisible by (i.e. are multiples of) \$m\$. In that case, you'd just get the bit position zero \$k\$ number of times and end up setting a single bit. It's a similar situation to the one above, since again, the false positive rate rises.
If, for example, \$k = 2\$ and \$m = 4\$ again, and we've added item \$A\$ which (correctly) set 2 bits, we should have that 19% false positive rate when checking \$B\$. But if the hash for \$B\$ works out so that it's just 1 bit instead of 2, then we're back to a 50% false positive rate since it's only that 1 bit that would have to match, rather than 2 separate bits.Frankly, I don't know enough about creating robust hashing functions to offer a good solution to this. I'll continue and I'll provide alternative code for different things, but it all includes this basic flaw (except in the addendum at the end). The code I provide here won't fix that; it's there to illustrate other aspects of the code.
You've duplicated the logic for hashing items. Duplication is always bad, but this is a prime example of why: If the two pieces of code aren't exactly identical, the whole thing stops functioning. I'd extract the hashing logic into a separate, private method that returns an array of bit positions
def item_bit_positions item @k.times.map do |n| hash = Digest::MD5.hexdigest("#{item}#{n}").to_i(16) hash % @m end end
You'll note that I'm using string interpolation; IMO it's clearer than manually calling
to_s
and concatenating. The results the same, though, but that result may not be what you want (see below)Using MD5 and then discarding a bunch of stuff really feels like overkill. Ruby already maintains a hash for anything that inherits from
Object
(i.e. everything). It's what's used to check for object equality, and it's hexadecimal number you see printed if you print an object that doesn't have a#to_s
method.
#hash
is just a (signed) integer. You could do something like:def item_bit_positions item @k.times.map { |n| item.hash % (@m-n) } end
(Again, this is just as flawed [edit: Actually, it's more flawed, see below] as your current implementation (it can return fewer than \$k\$ positions); it's just here to illustrate how
#hash
can be used.)The above does not require
digest
. It does however require that \$k ≤ m - 1\,ドル or you'll get a divide-by-zero error.Note, though, that unlike MD5 which will always produce the same output for a given input,
#hash
is simply a number generated at runtime, so it'll be different each time you run the code. This could make testing tricky, since you're also dealing with something that can produce false positives. (One solution to this might be to stub out#hash
on the items you're inserting.)Using
#hash
also changes the behavior somewhat since it's now about object-wise equality - not string-wise equality. In your implementation, the string"foo"
and the symbol:foo
would be treated as one and the same, since both have the same string representation, and you're converting everything to a string to calculate the MD5 hash. But using#hash
there's no such collision.You're using your
BitArray
class, but I'd be tempted to just re-roll some of the same logic here, since it doesn't take much, and the implementation can be more specific. You can do everything with integers and bitwise manipulation.
class BloomFilter
# This reader is mostly here for test purposes,
# which isn't too pretty, I'm afraid. At least
# it's a pretty harmless addition
attr_reader :bit_array
# TODO: Explain what k and m mean!
def initialize(k = 1, m = 10)
k, m = k.to_i, m.to_i
raise RangeError, "k must be smaller than m" if k >= m
raise ArgumentError, "k and m must be positive integers" if k < 1
@k = k
@m = m
@bit_array = 0
end
def insert(item)
@bit_array |= digest(item)
self
end
def include?(item)
hash = digest(item)
@bit_array & hash == hash
end
alias_method :member?, :include?
private
# WARNING: This method is flawed! See answer
def digest(item)
hash = item.hash
@k.times.map do |n|
bit = (hash % (@m-n))
2**bit
end.inject(&:|)
end
end
gem 'minitest'
require 'minitest/autorun'
class BloomFilterTest < Minitest::Test
def test_initialize_with_valid_args
assert BloomFilter.new(1, 2)
end
def test_initialize_with_invalid_args
assert_raises RangeError do
BloomFilter.new(2, 2)
end
end
def test_initialize_with_negative_args
assert_raises ArgumentError do
BloomFilter.new(-2, 2)
end
end
def test_insert_uses_the_items_hash
item = mock_item(123)
instance = BloomFilter.new
instance.insert(item)
item.verify
end
def test_insert_changes_the_bit_array
instance = BloomFilter.new
previous = instance.bit_array
instance.insert(Object.new)
refute_equal previous, instance.bit_array
end
def test_include_uses_the_items_hash
item = mock_item(123)
instance = BloomFilter.new
instance.include?(item)
item.verify
end
def test_include_returns_true_for_member
item = mock_item(123, 2)
instance = BloomFilter.new
instance.insert(item)
assert instance.include?(item)
end
def test_include_returns_false_for_nonmember
member = mock_item(123)
non_member = mock_item(122)
instance = BloomFilter.new(3, 10)
instance.insert(member)
refute instance.include?(non_member)
end
# An intentionally failing test
def test_digest_always_sets_k_bits
k = 3
item = mock_item(720) # multiple of 10, 8 and 9
instance = BloomFilter.new(k, 10)
hash = instance.send(:digest, item) # testing private method
assert_equal k, count_bits(hash), "Note: This test intentionally fails!"
end
# A helper method (mock object factory)
def mock_item(hash, calls = 1)
mock = Minitest::Mock.new
calls.times { mock.expect(:hash, hash) }
mock
end
def count_bits(number)
count = 0
until number == 0
count += (number & 1)
number = number >> 1
end
count
end
end
A thing I've noticed in you past few questions: You're writing tests which is great, but you're writing the barest minimum of tests. For instance, you're not testing #insert
at all, and your test for #include?
doesn't include an edge case as the one mentioned above where 2 items shared the same first bit - but not all of them.
Basically, you should have a higher test-to-code ratio. It's not uncommon to have way more test code than implementation code; in fact that's something to aim for.
Addendum: This is an interesting topic, actually. Not usually my cup of tea (I suck at theoretical stuff like this), but I got nerd sniped.
I've also realized that my hash implementation is even more flawed than I thought: Even if it manages to set \$k\$ bits (which isn't guaranteed), their distribution isn't random. Since I'm using modulo @m - n
, only the first iteration has a chance of setting the highest bit. Boo, me, boo.
Anyway, as I said, I don't know of a way to construct a hash function that will always set exactly \$k\$ bits with random distribution, and I'm not smart enough to work one out - but I can think of a hack.
If \$m\$ is simply defined to be some multiple of \$k\,ドル then we can do something like this:
If we say that \$k = 4\$ and our multiplier \$f = 8\$ (just as an example), then \$m = k \cdot f = 32\$. So we have an array that's 32 bits long, or - to put it another way - we have 4 arrays that are each 8 bits long.
We can then do something like this (using #hash
for simplicity):
mask = 0
k.times do |n|
hash = calc_hash(n, item) # calculate a big 'ol hash (assume this method exists)
mask = mask << f # shift our current mask f places
mask |= 2**(hash % f) # set a bit somewhere in the lower f places
end
mask
will now contain exactly 4 "set" bits - one in each of its 4 "sections". Even if 2**(hash % f)
is the same for each iteration, it'll still end up in different places, since we shift the mask.
This will not have the same false positive rate as a proper hashing function, but at least it'll be predictable. But it will be slightly worse overall.
In essence, what we have are four Bloom filters, each set to \$k = 1\$ and \$m = f\$. So (if my math is correct, which is not something to rely on) our false positive rate should be the regular rate for one such filter, but to the power of 4. Or, in mathematical terms:
$$(1 - (1 - \frac{1}{f})^{n})^{k}$$
(I think that's correct but I may be completely wrong)
-
\$\begingroup\$ I DID consider using ruby's
Hash
but since it seeds a new value for each process I decided that would not be good. Other implementations I have seen use CRC32. Hashes use thedivision method
to fit into an array so that is what I used. I have the documentation for those variables at the top. I added link to another implementation that I was using to understand the functionality. \$\endgroup\$aarti– aarti2014年10月21日 21:19:07 +00:00Commented Oct 21, 2014 at 21:19 -
\$\begingroup\$ @gitarchie Whoops - sorry about missing the comments at the top. Must've scrolled around, and not scrolled all the way back. As for using
#hash
, yours is a fair point. At least with a single hashing method, it's easy to swap out the logic there (especially since it's flawed) \$\endgroup\$Flambino– Flambino2014年10月21日 21:23:43 +00:00Commented Oct 21, 2014 at 21:23 -
\$\begingroup\$ Why do you think digest is flawed, I asked a question here and implemented based on answer and reference implementation I found. programmers.stackexchange.com/questions/260459/… \$\endgroup\$aarti– aarti2014年10月21日 21:44:00 +00:00Commented Oct 21, 2014 at 21:44
-
1\$\begingroup\$ @gitarchie Perhaps my understanding is flawed, but that question pertains reusing hashing functions (w/ seeding); not the output of said hashing function. My point is that if the hash of an item sets fewer than k bits, then you increase the risk of false positives. If e.g. k is 5, but the item hash you're using only sets a single bit, then the question of whether the filter includes the item hinges one just 1 bit - not five. Thus, the false positive rate will increase. The rate will be somewhere between the rate for k = 1 and k = 5 \$\endgroup\$Flambino– Flambino2014年10月21日 21:55:01 +00:00Commented Oct 21, 2014 at 21:55
-
1\$\begingroup\$ @gitarchie I've added a big chunk at the end of my answer, which you might find interesting. It was just something I did for fun, but this is absolutely not my area of expertise, so take it with a grain of salt :) \$\endgroup\$Flambino– Flambino2014年10月22日 14:26:15 +00:00Commented Oct 22, 2014 at 14:26