Background
Bloom-filter is a data structure to which we can insert elements, and check if it already contains a given element. The peculiarity is, that if a contains
query returns true
, then it might still be possible, that in fact, this element was not inserted to the filter. (If, on the other hand, it returns false
, then the element was definitely not inserted previously.)
The implementation consists of a bit-vector of length n
(originally all bits are 0
), and of k
hash functions, which map any input value into the range of ([0...n)
, i.e. 0
inclusive, n
exclusive). When adding an element, we compute its mapped value for all of the n
hash functions, and set the corresponding bits to one, in the bit vector. Similarly, when querying if an element was added, we compute the value for all the hash functions and return true
, if all the corresponding bits are true
, and false
otherwise (i.e., if the corresponding bit for at least one function is zero).
Objectives of the review
While any remark/suggestion is always welcome, I'm mostly interested in the following aspects:
Is this a correct implementation of the data-structure, or do you see any flaws?
Is there a way to make the implementation more optimal? (E.g. is there an elegant way in
bloom-contains
to jump out of thereduce
if we encounter a bit not equal to1
?)Related to the above: is there a way to make this code more idiomatic? (I.e., conforming to Clojure best practices.)
Can you think of any missing tests? Or some other edge-cases which are not covered?
Out of scope
The quality of the hash-functions used for testing is out of scope of this review. (I know there are much better ones, but for now I focused on the data-structure itself.) However, if you know a way to e.g. better organize them, and avoid repetition (but without making that global!), that would be very appreciated.
The code
core.clj
(ns bloom-filter.core)
(defn bloom-create [numbits hash-functions]
(if (or (nil? numbits) (nil? hash-functions)) nil
{:bits (vec (repeat numbits 0)) :hash-functions hash-functions}))
(defn bloom-add [bloom-filter value]
(when-not (nil? bloom-filter)
(let [hash-functions (:hash-functions bloom-filter)
bits (:bits bloom-filter)
new-bits (reduce (fn [actual-bits hash-function] (assoc actual-bits (hash-function value) 1)) bits hash-functions)]
(assoc-in bloom-filter [:bits] new-bits))))
(defn bloom-contains [bloom-filter value]
(let [hash-functions (:hash-functions bloom-filter)
bits (:bits bloom-filter)]
(reduce (fn [actual-value hash-function] (and actual-value (= 1 (bits (hash-function value))))) true hash-functions)))
core-test.clj
(ns bloom-filter.core-test
(:require [clojure.test :refer :all]
[bloom-filter.core :refer :all]))
(defn mod7-fun [num] (mod num 7))
(defn always-zero-fun [dontcare] 0)
(deftest create-test
(testing "create bloom filter"
(is (= nil (bloom-create nil nil)))
(is (= nil (bloom-create 0 nil)))
(is (= nil (bloom-create nil [])))
(is (= {:bits [] :hash-functions []} (bloom-create 0 [])))
(is (= {:bits [0] :hash-functions []} (bloom-create 1 [])))
(is (= {:bits [] :hash-functions [always-zero-fun]} (bloom-create 0 [always-zero-fun])))
))
(deftest add-test
(let [empty-filter (bloom-create 7 [])
single-fun-filter (bloom-create 7 [mod7-fun])
two-fun-filter (bloom-create 7 [mod7-fun always-zero-fun])]
(testing "add to bloom filter"
(is (= nil (bloom-add nil 3)))
(is (= empty-filter (bloom-add empty-filter nil)))
(is (= empty-filter (bloom-add empty-filter 10)))
(is (= {:bits [0 0 0 1 0 0 0] :hash-functions [mod7-fun]}
(bloom-add single-fun-filter 3)))
(is (= {:bits [1 0 0 1 0 0 0] :hash-functions [mod7-fun always-zero-fun]}
(bloom-add two-fun-filter 3)))
)))
(deftest contains-test
(let [empty-filter (bloom-create 7 [])
simple-filter (bloom-create 7 [mod7-fun])
filter-with-element (bloom-add simple-filter 3)]
(testing "bloom filter contains"
(is (true? (bloom-contains empty-filter 0)))
(is (false? (bloom-contains simple-filter 0)))
(is (true? (bloom-contains filter-with-element 3)))
(is (true? (bloom-contains filter-with-element 10)))
)))
GitHub repo
-
\$\begingroup\$ very nice problem statement and question. \$\endgroup\$erdos– erdos2018年02月08日 07:18:09 +00:00Commented Feb 8, 2018 at 7:18
1 Answer 1
First of all, this is a very nice question and problem statement. My review consists mainly of code style improvements buy maybe you will still find it helpful.
bloom-create
function
- Instead of
(if ... nil ...)
you can use(when-not ... ...)
. - Argument
hash-functions
is a collection of functions so you should write(empty? hash-functions)
instead of(nil? hash-functions)
. Be aware that this changes the semantics of your program. - I prefer writing simple argument validation as
assert
expressions. Like:(assert (every? ifn? hash-functions))
and(assert (number? numbits))
. - Instead of a vector of zeros you can use
(vector-of :boolean ...)
for better performance. - I think you should make sure that the hash functions no not return a value out of range. You can do this by composing them with
#(mod % numbits)
.
bloom-add
function
- Instead of
(when-not (nil? bloom-filter) ...) you can write (when bloom-filter ...)
- You can use destructuring to get
the contents of the
bloom-filter
parameter. - You should use
(assoc ... :bits ...)
instead of(assoc-in ... [:bits] ...)
. - You can use the thread last macro for organizing the code.
- You can use transient data structures in the
reduce
to make thinks faster. Unfortunately they do not work withvector-of
at the moment :(
bloom-contains
function
- It is a nice practice to have your predicate method names end with a question mark
- You can use reduced to jump
out of the
reduce
call. - However you might want to consider using every? or some predicate instead of a
reduce
.
Test cases
- Your data structure is immutable, therefore you can reuse
empty-filter
,single-fun-filter
andtwo-fun-filter
betweeb your tests. - In
add-test
you may only check the:bits
part of the result thus making the test cases shorter.
Code
(defn bloom-create [numbits hash-functions]
(when-not (or (nil? numbits) (empty? hash-functions))
(assert (number? numbits))
(assert (every? ifn? hash-functions))
{:bits (apply vector-of :boolean (repeat numbits false))
:hash-functions (mapv (partial comp #(mod % numbits)) hash-functions)}))
(defn bloom-add [{:keys [hash-functions bits] :as bloom-filter} value]
(when-not (nil? bloom-filter)
(->> hash-functions
(reduce (fn [actual-bits hash-function]
(assoc actual-bits (hash-function value) true)) bits)
(assoc bloom-filter :bits))))
(defn bloom-contains? [{:keys [hash-functions bits]} value]
(some (map #(false? (bits (% value))))))
(require '[clojure.test :refer :all])
(defn mod7-fun [num] (mod num 7))
(defn always-zero-fun [dontcare] 0)
(def single-fun-filter (bloom-create 7 [mod7-fun]))
(def two-fun-filter (bloom-create 7 [mod7-fun always-zero-fun]))
(deftest add-test
(testing "add to bloom filter"
(is (= nil (bloom-add nil 3)))
(is (= [false false false true false false false]
(:bits (bloom-add single-fun-filter 3))))
(is (= [true false false true false false false]
(:bits (bloom-add two-fun-filter 3))))))
-
\$\begingroup\$ Thanks for the very detailed review, @erdos. I improved the code based on your suggestions. I have some follow-up questions: What is your opinion on using
assert
s vs:pre
-conditions? What is the difference betweenapply vector-of :boolean
and simply usingvec
? (Btw, with the latter, alsotransient
works :) ) Something is missing from the thebloom-contains
implementation withsome
. (By addinghash-functions
as the second parameter it still does not work for me...) \$\endgroup\$Attilio– Attilio2018年02月11日 21:25:10 +00:00Commented Feb 11, 2018 at 21:25 -
\$\begingroup\$ Another interesting question would be if I can validate that the hash-functions really return numbers between
0
andN
, whereN
is the number of bits. I think this question is so interesting that I asked it on SO. \$\endgroup\$Attilio– Attilio2018年02月11日 21:52:57 +00:00Commented Feb 11, 2018 at 21:52