I recently implemented a Bloom Filter in Haskell and, although I am not new to functional programming, I am a beginner when it comes to Haskell itself.
I'll gladly take any feedback regarding the implementation or the code style. I tried to stick to the Haskell conventions regarding function and variable naming, but I may have gotten some stuff wrong.
You will notice that I'm using my own implementation of a Bitset, you can assume it behaves like a normal one (I sure hope so).
module DataStructures.BloomFilter (empty, insert, member, DataStructures.BloomFilter.null) where
import System.Random (random, StdGen)
import qualified DataStructures.Bitset as Bitset
import Data.Hashable (hashWithSalt)
-- n: expected number of items in the Bloom Filter
-- p: acceptable probability of a false positive
-- m: max number of bits the Bloom Filter will use
-- k: number of hashing functions
data BloomFilter = BloomFilter {
n :: Int, p :: Float, bitset :: Bitset.Bitset, m :: Int, k :: Int, hashSeed :: Int
} deriving (Eq, Show)
getMaxSize :: Int -> Float -> Int
getMaxSize n p = abs $ ceiling $ fromIntegral n * (log p) / (log (1 / (log 2 ^ 2)))
getNumberOfHashFunctions :: Int -> Int -> Int
getNumberOfHashFunctions n m = round $ fromIntegral (m `div` n) * log 2
empty :: Int -> Float -> StdGen -> BloomFilter
empty n p randGen =
let m = getMaxSize n p
k = getNumberOfHashFunctions n m
hashSeed = fst $ random randGen
in BloomFilter n p Bitset.empty m k hashSeed
null :: BloomFilter -> Bool
null = Bitset.null . bitset
getHashes :: Show a => BloomFilter -> a -> [Int]
getHashes bloomFilter elem =
let str = show elem
seed = hashSeed bloomFilter
maxSize = m bloomFilter
in (`mod` maxSize) . abs . (`hashWithSalt` str) . (seed +) <$> [1..(k bloomFilter)]
insert :: Show a => BloomFilter -> a -> BloomFilter
insert bloomFilter elem =
let hashes = getHashes bloomFilter elem
newBitset = Bitset.insertMany (bitset bloomFilter) hashes
in bloomFilter { bitset = newBitset }
-- Returns whether an element *may be* present in the bloom filter.
-- This function can yield false positives, but not false negatives.
member :: Show a => BloomFilter -> a -> Bool
member bloomFilter elem =
let hashes = getHashes bloomFilter elem
bs = bitset bloomFilter
in all (Bitset.member bs) hashes
I would have liked to use a murmur3 hashing function but all Haskell implementations use types that I'm not yet familiar with (Word32, ByteString).
2 Answers 2
Instead of transforming everything to String
using Show
just for the purpose of hashing it, you should constraint element types to be Hashable
instead:
getHashes :: Hashable a => BloomFilter -> a -> [Int]
getHashes bloomFilter elem =
let seed = hashSeed bloomFilter
maxSize = m bloomFilter
in (`mod` maxSize) . abs . (`hashWithSalt` elem) . (seed +) <$> [1..(k bloomFilter)]
I'd also use maxSize
and numFuns
as the field names in BloomFilter
and then use RecordWildCards
:
getHashes :: Hashable a => BloomFilter -> a -> [Int]
getHashes BloomFilter{..} elem = map nthHash [1..numFuns]
where
nthHash n = abs (hashWithSalt (n + hashSeed) elem) `mod` maxSize
Or maybe even nicer:
getHashes :: Hashable a => BloomFilter -> a -> [Int]
getHashes BloomFilter{..} elem =
[ abs (hashWithSalt (i + hashSeed) elem) `mod` maxSize | i <- [1..numFuns] ]
-
\$\begingroup\$ Oh I did not know about the RecordWildCards thing. Very cool, thanks! \$\endgroup\$Bertrand– Bertrand2020年05月26日 14:53:43 +00:00Commented May 26, 2020 at 14:53
I only have one minor nitpick: the definition of getMaxSize
becomes clearer if we use logBase
:
abs $ ceiling $ fromIntegral n * (log p) / (log (1 / (log 2 ^ 2)))
becomes
abs $ ceiling $ fromIntegral n * (- 0.5) * logBase (log 2) p
We can use the identity ceiling (-x) == - floor(x)
to get
abs . floor $ fromIntegral n * 0.5 * logBase (log 2) p
Explore related questions
See similar questions with these tags.