Bloom Filter in Haskell

Question 1

I recently implemented a Bloom Filter in Haskell and, although I am not new to functional programming, I am a beginner when it comes to Haskell itself.

I'll gladly take any feedback regarding the implementation or the code style. I tried to stick to the Haskell conventions regarding function and variable naming, but I may have gotten some stuff wrong.

You will notice that I'm using my own implementation of a Bitset, you can assume it behaves like a normal one (I sure hope so).

module DataStructures.BloomFilter (empty, insert, member, DataStructures.BloomFilter.null) where
import System.Random (random, StdGen)
import qualified DataStructures.Bitset as Bitset
import Data.Hashable (hashWithSalt)
-- n: expected number of items in the Bloom Filter
-- p: acceptable probability of a false positive
-- m: max number of bits the Bloom Filter will use
-- k: number of hashing functions
data BloomFilter = BloomFilter {
 n :: Int, p :: Float, bitset :: Bitset.Bitset, m :: Int, k :: Int, hashSeed :: Int
} deriving (Eq, Show)
getMaxSize :: Int -> Float -> Int
getMaxSize n p = abs $ ceiling $ fromIntegral n * (log p) / (log (1 / (log 2 ^ 2)))
getNumberOfHashFunctions :: Int -> Int -> Int
getNumberOfHashFunctions n m = round $ fromIntegral (m `div` n) * log 2
empty :: Int -> Float -> StdGen -> BloomFilter
empty n p randGen =
 let m = getMaxSize n p
 k = getNumberOfHashFunctions n m
 hashSeed = fst $ random randGen
 in BloomFilter n p Bitset.empty m k hashSeed
null :: BloomFilter -> Bool
null = Bitset.null . bitset
getHashes :: Show a => BloomFilter -> a -> [Int]
getHashes bloomFilter elem =
 let str = show elem
 seed = hashSeed bloomFilter
 maxSize = m bloomFilter
 in (`mod` maxSize) . abs . (`hashWithSalt` str) . (seed +) <$> [1..(k bloomFilter)]
insert :: Show a => BloomFilter -> a -> BloomFilter
insert bloomFilter elem =
 let hashes = getHashes bloomFilter elem
 newBitset = Bitset.insertMany (bitset bloomFilter) hashes
 in bloomFilter { bitset = newBitset }
-- Returns whether an element *may be* present in the bloom filter.
-- This function can yield false positives, but not false negatives.
member :: Show a => BloomFilter -> a -> Bool
member bloomFilter elem =
 let hashes = getHashes bloomFilter elem
 bs = bitset bloomFilter
 in all (Bitset.member bs) hashes

I would have liked to use a murmur3 hashing function but all Haskell implementations use types that I'm not yet familiar with (Word32, ByteString).

Question 2

Instead of transforming everything to String using Show just for the purpose of hashing it, you should constraint element types to be Hashable instead:

getHashes :: Hashable a => BloomFilter -> a -> [Int]
getHashes bloomFilter elem =
 let seed = hashSeed bloomFilter
 maxSize = m bloomFilter
 in (`mod` maxSize) . abs . (`hashWithSalt` elem) . (seed +) <$> [1..(k bloomFilter)]

I'd also use maxSize and numFuns as the field names in BloomFilter and then use RecordWildCards:

getHashes :: Hashable a => BloomFilter -> a -> [Int]
getHashes BloomFilter{..} elem = map nthHash [1..numFuns]
 where
 nthHash n = abs (hashWithSalt (n + hashSeed) elem) `mod` maxSize

Or maybe even nicer:

getHashes :: Hashable a => BloomFilter -> a -> [Int]
getHashes BloomFilter{..} elem = 
 [ abs (hashWithSalt (i + hashSeed) elem) `mod` maxSize | i <- [1..numFuns] ]

Question 3

Oh I did not know about the RecordWildCards thing. Very cool, thanks!

Question 4

I only have one minor nitpick: the definition of getMaxSize becomes clearer if we use logBase:

abs $ ceiling $ fromIntegral n * (log p) / (log (1 / (log 2 ^ 2)))

becomes

abs $ ceiling $ fromIntegral n * (- 0.5) * logBase (log 2) p

We can use the identity ceiling (-x) == - floor(x) to get

abs . floor $ fromIntegral n * 0.5 * logBase (log 2) p

Cactus Cactus 2121 silver badge5 bronze badges · Answer 1 · 2020-05-25 11:40:00Z

Instead of transforming everything to String using Show just for the purpose of hashing it, you should constraint element types to be Hashable instead:

getHashes :: Hashable a => BloomFilter -> a -> [Int]
getHashes bloomFilter elem =
 let seed = hashSeed bloomFilter
 maxSize = m bloomFilter
 in (`mod` maxSize) . abs . (`hashWithSalt` elem) . (seed +) <$> [1..(k bloomFilter)]

I'd also use maxSize and numFuns as the field names in BloomFilter and then use RecordWildCards:

getHashes :: Hashable a => BloomFilter -> a -> [Int]
getHashes BloomFilter{..} elem = map nthHash [1..numFuns]
 where
 nthHash n = abs (hashWithSalt (n + hashSeed) elem) `mod` maxSize

Or maybe even nicer:

getHashes :: Hashable a => BloomFilter -> a -> [Int]
getHashes BloomFilter{..} elem = 
 [ abs (hashWithSalt (i + hashSeed) elem) `mod` maxSize | i <- [1..numFuns] ]

Oh I did not know about the RecordWildCards thing. Very cool, thanks!

Eman Yalpsid Eman Yalpsid 1,56911 silver badges16 bronze badges · Answer 2 · 2020-05-01 12:08:46Z

I only have one minor nitpick: the definition of getMaxSize becomes clearer if we use logBase:

abs $ ceiling $ fromIntegral n * (log p) / (log (1 / (log 2 ^ 2)))

becomes

abs $ ceiling $ fromIntegral n * (- 0.5) * logBase (log 2) p

We can use the identity ceiling (-x) == - floor(x) to get

abs . floor $ fromIntegral n * 0.5 * logBase (log 2) p

Stack Exchange Network

Bloom Filter in Haskell

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Bloom Filter in Haskell

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions