Bloom filter implementation using a BitArray

Question 1

I am following the Wikipedia definition of the Bloom filter. My implementation uses the BitArray I created here. Here is an alternative implementation by Ilya Grigorik.

# n = Number of items in the filter
# p = Probability of false positives, float between 0 and 1 or a number indicating 1-in-p
# m = Number of bits in the filter
# k = Number of hash functions
require_relative '../bitarray/bit_array'
require 'digest'
class BloomFilter
 def initialize k: 1, m: 10
 @k = k
 @m = m
 @n = 0
 @b = BitArray.new
 end
 # Probability of a false positive based on formula in wikipedia
 def false_positive_rate
 fp_rate = (1-(1-1/@m.to_f)**(@k.to_f*@n.to_f))**@k.to_f
 p fp_rate
 fp_rate.round(2)
 end 
 def insert item
 @n += 1
 @k.times do |n|
 hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
 position = hash_val % @m
 @b.set position
 end 
 self
 end 
 def include? item
 @k.times do |n|
 hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
 position = hash_val % @m
 return @b.get(position) == 1
 end 
 return false
 end 
end

The test:

require 'minitest/autorun'
require_relative 'bloom_filter'
class BloomFilterTest < MiniTest::Unit::TestCase
 # Verified probability formula and tests from http://hur.st/bloomfilter?n=1&p=0.01
 def test_stats
 assert_equal 0.01, BloomFilter.new(k: 7, m: 10).insert("a").false_positive_rate
 b = BloomFilter.new(k: 2, m: 29)
 10.times { |n| b.insert(n) }
 assert_equal 0.25, b.false_positive_rate
 end 
 def test_include
 b = BloomFilter.new(k: 2, m: 100)
 b.insert "The Trumpet of the Swan"
 assert_equal true, b.include?("The Trumpet of the Swan")
 refute b.include?("E. B. White") 
 end 
end

Question 2

You're using assert_equal to compare something with true - which is exactly what the basic assert method already does. Just like you use refute to check that something's false, assert checks that something's true.
You might want to check that \$k ≤ m\$. Doesn't really make sense otherwise. (Of course it doesn't really make sense for \$k\$ to equal \$m\$ either, as adding just 1 item would set every bit right away, but at least that's technically valid.)
In this case, I'm OK with using single-letter variables like k and m, since that appears to be conventional enough, and they're hard to give simple, yet descriptive names. ~~(削除) However, I'd also add copious comments to explain exactly what the heck they mean. I had to go to Wikipedia to find out. (削除ここまで)~~ (Edit: Sorry, comments already there, just not by the #initialize method, where I expected them to be.)
p fp_rate - don't leave print calls in final code (unless that code's purpose is to print, of course).
Why does #false_positive_rate round its output? Rounding is a presentation concern; the rate just is what it is, decimals and all. I'm thinking you're maybe rounding so it passes your tests, but in that case you're writing your code backwards.
Your #include? method is broken. It'll return (from the method; not just the block) at the first bit it checks - it won't check all of them. So if the first bit it checks is zero, you get false - which is correct! But if the first bit is 1, you immediate get true even though none of the other k-1 bits have been checked.
This also means that your false positive rate is at best

$1ドル - (1 - \frac{1}{m})^n$$

because \$k\$ is effectively always 1 when checking. But if \$k\$ is higher, then we'd be setting more bits than we're checking, so the bit array fills up faster.
For instance, if \$k = 2\$ and \$m = 4\$ (extreme values; just for illustration), then the first item (call it \$A\$) we insert should set half of the bits in the bit array. That should (if my math is correct) give a false positive rate of ~19% after adding that one item: There's a 19% chance that another item, \$B\,ドル will match the same 2 bits. But if we only check 1 bit, then the false positive rate is essentially 50%, since half the bits in the array are set.

Making only only minimal changes, you could do this:
```
@k.times do |n|
 hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
 position = hash_val % @m
 return false if @b.get(position).zero?
end 
true
```
Now, it'll only return if a bit is zero, otherwise it'll keep going.
But your hashing logic is flawed too, I'm afraid. There's no guarantee that it'll return \$k\$ unique bit positions (for \$k > 1\$). It's possible (though unlikely) that you'll calculate \$k\$ MD5 hashes that are all cleanly divisible by (i.e. are multiples of) \$m\$. In that case, you'd just get the bit position zero \$k\$ number of times and end up setting a single bit. It's a similar situation to the one above, since again, the false positive rate rises.
If, for example, \$k = 2\$ and \$m = 4\$ again, and we've added item \$A\$ which (correctly) set 2 bits, we should have that 19% false positive rate when checking \$B\$. But if the hash for \$B\$ works out so that it's just 1 bit instead of 2, then we're back to a 50% false positive rate since it's only that 1 bit that would have to match, rather than 2 separate bits.

Frankly, I don't know enough about creating robust hashing functions to offer a good solution to this. I'll continue and I'll provide alternative code for different things, but it all includes this basic flaw (except in the addendum at the end). The code I provide here won't fix that; it's there to illustrate other aspects of the code.
You've duplicated the logic for hashing items. Duplication is always bad, but this is a prime example of why: If the two pieces of code aren't exactly identical, the whole thing stops functioning. I'd extract the hashing logic into a separate, private method that returns an array of bit positions
```
 def item_bit_positions item
 @k.times.map do |n|
 hash = Digest::MD5.hexdigest("#{item}#{n}").to_i(16)
 hash % @m
 end
 end
```
You'll note that I'm using string interpolation; IMO it's clearer than manually calling to_s and concatenating. The results the same, though, but that result may not be what you want (see below)
Using MD5 and then discarding a bunch of stuff really feels like overkill. Ruby already maintains a hash for anything that inherits from Object (i.e. everything). It's what's used to check for object equality, and it's hexadecimal number you see printed if you print an object that doesn't have a #to_s method.
#hash is just a (signed) integer. You could do something like:
```
 def item_bit_positions item
 @k.times.map { |n| item.hash % (@m-n) }
 end
```
(Again, this is just as flawed [edit: Actually, it's more flawed, see below] as your current implementation (it can return fewer than \$k\$ positions); it's just here to illustrate how #hash can be used.)

The above does not require digest. It does however require that \$k ≤ m - 1\,ドル or you'll get a divide-by-zero error.

Note, though, that unlike MD5 which will always produce the same output for a given input, #hash is simply a number generated at runtime, so it'll be different each time you run the code. This could make testing tricky, since you're also dealing with something that can produce false positives. (One solution to this might be to stub out #hash on the items you're inserting.)

Using #hash also changes the behavior somewhat since it's now about object-wise equality - not string-wise equality. In your implementation, the string "foo" and the symbol :foo would be treated as one and the same, since both have the same string representation, and you're converting everything to a string to calculate the MD5 hash. But using #hash there's no such collision.
You're using your BitArray class, but I'd be tempted to just re-roll some of the same logic here, since it doesn't take much, and the implementation can be more specific. You can do everything with integers and bitwise manipulation.

class BloomFilter
 # This reader is mostly here for test purposes,
 # which isn't too pretty, I'm afraid. At least
 # it's a pretty harmless addition
 attr_reader :bit_array
 # TODO: Explain what k and m mean!
 def initialize(k = 1, m = 10)
 k, m = k.to_i, m.to_i
 raise RangeError, "k must be smaller than m" if k >= m
 raise ArgumentError, "k and m must be positive integers" if k < 1
 @k = k
 @m = m
 @bit_array = 0
 end
 def insert(item)
 @bit_array |= digest(item)
 self
 end
 def include?(item)
 hash = digest(item)
 @bit_array & hash == hash
 end
 alias_method :member?, :include?
 private
 # WARNING: This method is flawed! See answer
 def digest(item)
 hash = item.hash
 @k.times.map do |n|
 bit = (hash % (@m-n))
 2**bit
 end.inject(&:|)
 end
end
gem 'minitest'
require 'minitest/autorun'
class BloomFilterTest < Minitest::Test
 def test_initialize_with_valid_args
 assert BloomFilter.new(1, 2)
 end
 def test_initialize_with_invalid_args
 assert_raises RangeError do
 BloomFilter.new(2, 2)
 end
 end
 def test_initialize_with_negative_args
 assert_raises ArgumentError do
 BloomFilter.new(-2, 2)
 end
 end
 def test_insert_uses_the_items_hash
 item = mock_item(123)
 instance = BloomFilter.new
 instance.insert(item)
 item.verify
 end
 def test_insert_changes_the_bit_array
 instance = BloomFilter.new
 previous = instance.bit_array
 instance.insert(Object.new)
 refute_equal previous, instance.bit_array
 end
 def test_include_uses_the_items_hash
 item = mock_item(123)
 instance = BloomFilter.new
 instance.include?(item)
 item.verify
 end
 def test_include_returns_true_for_member
 item = mock_item(123, 2)
 instance = BloomFilter.new
 instance.insert(item)
 assert instance.include?(item)
 end
 def test_include_returns_false_for_nonmember
 member = mock_item(123)
 non_member = mock_item(122)
 instance = BloomFilter.new(3, 10)
 instance.insert(member)
 refute instance.include?(non_member)
 end
 # An intentionally failing test
 def test_digest_always_sets_k_bits
 k = 3
 item = mock_item(720) # multiple of 10, 8 and 9
 instance = BloomFilter.new(k, 10)
 hash = instance.send(:digest, item) # testing private method
 assert_equal k, count_bits(hash), "Note: This test intentionally fails!"
 end
 # A helper method (mock object factory)
 def mock_item(hash, calls = 1)
 mock = Minitest::Mock.new
 calls.times { mock.expect(:hash, hash) }
 mock
 end
 def count_bits(number)
 count = 0
 until number == 0
 count += (number & 1)
 number = number >> 1
 end
 count
 end
end

A thing I've noticed in you past few questions: You're writing tests which is great, but you're writing the barest minimum of tests. For instance, you're not testing #insert at all, and your test for #include? doesn't include an edge case as the one mentioned above where 2 items shared the same first bit - but not all of them.

Basically, you should have a higher test-to-code ratio. It's not uncommon to have way more test code than implementation code; in fact that's something to aim for.

Addendum: This is an interesting topic, actually. Not usually my cup of tea (I suck at theoretical stuff like this), but I got nerd sniped.

I've also realized that my hash implementation is even more flawed than I thought: Even if it manages to set \$k\$ bits (which isn't guaranteed), their distribution isn't random. Since I'm using modulo @m - n, only the first iteration has a chance of setting the highest bit. Boo, me, boo.

Anyway, as I said, I don't know of a way to construct a hash function that will always set exactly \$k\$ bits with random distribution, and I'm not smart enough to work one out - but I can think of a hack.

If \$m\$ is simply defined to be some multiple of \$k\,ドル then we can do something like this:

If we say that \$k = 4\$ and our multiplier \$f = 8\$ (just as an example), then \$m = k \cdot f = 32\$. So we have an array that's 32 bits long, or - to put it another way - we have 4 arrays that are each 8 bits long.

We can then do something like this (using #hash for simplicity):

mask = 0
k.times do |n|
 hash = calc_hash(n, item) # calculate a big 'ol hash (assume this method exists)
 mask = mask << f # shift our current mask f places
 mask |= 2**(hash % f) # set a bit somewhere in the lower f places
end

mask will now contain exactly 4 "set" bits - one in each of its 4 "sections". Even if 2**(hash % f) is the same for each iteration, it'll still end up in different places, since we shift the mask.

This will not have the same false positive rate as a proper hashing function, but at least it'll be predictable. But it will be slightly worse overall.

In essence, what we have are four Bloom filters, each set to \$k = 1\$ and \$m = f\$. So (if my math is correct, which is not something to rely on) our false positive rate should be the regular rate for one such filter, but to the power of 4. Or, in mathematical terms:

$$(1 - (1 - \frac{1}{f})^{n})^{k}$$

(I think that's correct but I may be completely wrong)

Question 3

I DID consider using ruby's Hash but since it seeds a new value for each process I decided that would not be good. Other implementations I have seen use CRC32. Hashes use the division method to fit into an array so that is what I used. I have the documentation for those variables at the top. I added link to another implementation that I was using to understand the functionality.

Question 4

@gitarchie Whoops - sorry about missing the comments at the top. Must've scrolled around, and not scrolled all the way back. As for using #hash, yours is a fair point. At least with a single hashing method, it's easy to swap out the logic there (especially since it's flawed)

Question 5

Why do you think digest is flawed, I asked a question here and implemented based on answer and reference implementation I found. programmers.stackexchange.com/questions/260459/…

Question 6

@gitarchie Perhaps my understanding is flawed, but that question pertains reusing hashing functions (w/ seeding); not the output of said hashing function. My point is that if the hash of an item sets fewer than k bits, then you increase the risk of false positives. If e.g. k is 5, but the item hash you're using only sets a single bit, then the question of whether the filter includes the item hinges one just 1 bit - not five. Thus, the false positive rate will increase. The rate will be somewhere between the rate for k = 1 and k = 5

Question 7

@gitarchie I've added a big chunk at the end of my answer, which you might find interesting. It was just something I did for fun, but this is absolutely not my area of expertise, so take it with a grain of salt :)

Flambino Flambino 33.3k2 gold badges46 silver badges90 bronze badges · Accepted Answer · 2014-10-21 21:00:09Z

You're using assert_equal to compare something with true - which is exactly what the basic assert method already does. Just like you use refute to check that something's false, assert checks that something's true.
You might want to check that \$k ≤ m\$. Doesn't really make sense otherwise. (Of course it doesn't really make sense for \$k\$ to equal \$m\$ either, as adding just 1 item would set every bit right away, but at least that's technically valid.)
In this case, I'm OK with using single-letter variables like k and m, since that appears to be conventional enough, and they're hard to give simple, yet descriptive names. ~~(削除) However, I'd also add copious comments to explain exactly what the heck they mean. I had to go to Wikipedia to find out. (削除ここまで)~~ (Edit: Sorry, comments already there, just not by the #initialize method, where I expected them to be.)
p fp_rate - don't leave print calls in final code (unless that code's purpose is to print, of course).
Why does #false_positive_rate round its output? Rounding is a presentation concern; the rate just is what it is, decimals and all. I'm thinking you're maybe rounding so it passes your tests, but in that case you're writing your code backwards.
Your #include? method is broken. It'll return (from the method; not just the block) at the first bit it checks - it won't check all of them. So if the first bit it checks is zero, you get false - which is correct! But if the first bit is 1, you immediate get true even though none of the other k-1 bits have been checked.
This also means that your false positive rate is at best

$1ドル - (1 - \frac{1}{m})^n$$

because \$k\$ is effectively always 1 when checking. But if \$k\$ is higher, then we'd be setting more bits than we're checking, so the bit array fills up faster.
For instance, if \$k = 2\$ and \$m = 4\$ (extreme values; just for illustration), then the first item (call it \$A\$) we insert should set half of the bits in the bit array. That should (if my math is correct) give a false positive rate of ~19% after adding that one item: There's a 19% chance that another item, \$B\,ドル will match the same 2 bits. But if we only check 1 bit, then the false positive rate is essentially 50%, since half the bits in the array are set.

Making only only minimal changes, you could do this:
```
@k.times do |n|
 hash_val = Digest::MD5.hexdigest( item.to_s + n.to_s ).to_i(16)
 position = hash_val % @m
 return false if @b.get(position).zero?
end 
true
```
Now, it'll only return if a bit is zero, otherwise it'll keep going.
But your hashing logic is flawed too, I'm afraid. There's no guarantee that it'll return \$k\$ unique bit positions (for \$k > 1\$). It's possible (though unlikely) that you'll calculate \$k\$ MD5 hashes that are all cleanly divisible by (i.e. are multiples of) \$m\$. In that case, you'd just get the bit position zero \$k\$ number of times and end up setting a single bit. It's a similar situation to the one above, since again, the false positive rate rises.
If, for example, \$k = 2\$ and \$m = 4\$ again, and we've added item \$A\$ which (correctly) set 2 bits, we should have that 19% false positive rate when checking \$B\$. But if the hash for \$B\$ works out so that it's just 1 bit instead of 2, then we're back to a 50% false positive rate since it's only that 1 bit that would have to match, rather than 2 separate bits.

Frankly, I don't know enough about creating robust hashing functions to offer a good solution to this. I'll continue and I'll provide alternative code for different things, but it all includes this basic flaw (except in the addendum at the end). The code I provide here won't fix that; it's there to illustrate other aspects of the code.
You've duplicated the logic for hashing items. Duplication is always bad, but this is a prime example of why: If the two pieces of code aren't exactly identical, the whole thing stops functioning. I'd extract the hashing logic into a separate, private method that returns an array of bit positions
```
 def item_bit_positions item
 @k.times.map do |n|
 hash = Digest::MD5.hexdigest("#{item}#{n}").to_i(16)
 hash % @m
 end
 end
```
You'll note that I'm using string interpolation; IMO it's clearer than manually calling to_s and concatenating. The results the same, though, but that result may not be what you want (see below)
Using MD5 and then discarding a bunch of stuff really feels like overkill. Ruby already maintains a hash for anything that inherits from Object (i.e. everything). It's what's used to check for object equality, and it's hexadecimal number you see printed if you print an object that doesn't have a #to_s method.
#hash is just a (signed) integer. You could do something like:
```
 def item_bit_positions item
 @k.times.map { |n| item.hash % (@m-n) }
 end
```
(Again, this is just as flawed [edit: Actually, it's more flawed, see below] as your current implementation (it can return fewer than \$k\$ positions); it's just here to illustrate how #hash can be used.)

The above does not require digest. It does however require that \$k ≤ m - 1\,ドル or you'll get a divide-by-zero error.

Note, though, that unlike MD5 which will always produce the same output for a given input, #hash is simply a number generated at runtime, so it'll be different each time you run the code. This could make testing tricky, since you're also dealing with something that can produce false positives. (One solution to this might be to stub out #hash on the items you're inserting.)

Using #hash also changes the behavior somewhat since it's now about object-wise equality - not string-wise equality. In your implementation, the string "foo" and the symbol :foo would be treated as one and the same, since both have the same string representation, and you're converting everything to a string to calculate the MD5 hash. But using #hash there's no such collision.
You're using your BitArray class, but I'd be tempted to just re-roll some of the same logic here, since it doesn't take much, and the implementation can be more specific. You can do everything with integers and bitwise manipulation.

class BloomFilter
 # This reader is mostly here for test purposes,
 # which isn't too pretty, I'm afraid. At least
 # it's a pretty harmless addition
 attr_reader :bit_array
 # TODO: Explain what k and m mean!
 def initialize(k = 1, m = 10)
 k, m = k.to_i, m.to_i
 raise RangeError, "k must be smaller than m" if k >= m
 raise ArgumentError, "k and m must be positive integers" if k < 1
 @k = k
 @m = m
 @bit_array = 0
 end
 def insert(item)
 @bit_array |= digest(item)
 self
 end
 def include?(item)
 hash = digest(item)
 @bit_array & hash == hash
 end
 alias_method :member?, :include?
 private
 # WARNING: This method is flawed! See answer
 def digest(item)
 hash = item.hash
 @k.times.map do |n|
 bit = (hash % (@m-n))
 2**bit
 end.inject(&:|)
 end
end
gem 'minitest'
require 'minitest/autorun'
class BloomFilterTest < Minitest::Test
 def test_initialize_with_valid_args
 assert BloomFilter.new(1, 2)
 end
 def test_initialize_with_invalid_args
 assert_raises RangeError do
 BloomFilter.new(2, 2)
 end
 end
 def test_initialize_with_negative_args
 assert_raises ArgumentError do
 BloomFilter.new(-2, 2)
 end
 end
 def test_insert_uses_the_items_hash
 item = mock_item(123)
 instance = BloomFilter.new
 instance.insert(item)
 item.verify
 end
 def test_insert_changes_the_bit_array
 instance = BloomFilter.new
 previous = instance.bit_array
 instance.insert(Object.new)
 refute_equal previous, instance.bit_array
 end
 def test_include_uses_the_items_hash
 item = mock_item(123)
 instance = BloomFilter.new
 instance.include?(item)
 item.verify
 end
 def test_include_returns_true_for_member
 item = mock_item(123, 2)
 instance = BloomFilter.new
 instance.insert(item)
 assert instance.include?(item)
 end
 def test_include_returns_false_for_nonmember
 member = mock_item(123)
 non_member = mock_item(122)
 instance = BloomFilter.new(3, 10)
 instance.insert(member)
 refute instance.include?(non_member)
 end
 # An intentionally failing test
 def test_digest_always_sets_k_bits
 k = 3
 item = mock_item(720) # multiple of 10, 8 and 9
 instance = BloomFilter.new(k, 10)
 hash = instance.send(:digest, item) # testing private method
 assert_equal k, count_bits(hash), "Note: This test intentionally fails!"
 end
 # A helper method (mock object factory)
 def mock_item(hash, calls = 1)
 mock = Minitest::Mock.new
 calls.times { mock.expect(:hash, hash) }
 mock
 end
 def count_bits(number)
 count = 0
 until number == 0
 count += (number & 1)
 number = number >> 1
 end
 count
 end
end