Bloomfilter in Kotlin

Question 1

I would like a code-review. Not so much on if the implementation is good or effecient, it's probably not, more on code style and readability.

import java.lang.Exception
import java.nio.ByteBuffer
import java.security.MessageDigest
import java.util.*
import kotlin.math.abs
fun main() {
 val filterSize = 1_000_000
 val numberOfEntries = 100_000
 val filter = BloomFilter(filterSize, numberOfHashes = 4)
 val entriesInFilter = Array(numberOfEntries) { randomString() }
 val entriesNotInFilter = Array(numberOfEntries) { randomString() }
 for (entry in entriesInFilter)
 filter.add(entry)
 val confusionMatrix = ConfusionMatrix(filter, entriesInFilter, entriesNotInFilter)
 confusionMatrix.printReport()
 if (confusionMatrix.falseNegativeRate > 0.0) {
 throw Exception("This should not happen, if it does the implementation of the bloom filter is wrong.")
 }
}
class BloomFilter(private val size: Int, numberOfHashes: Int) {
 private val flags = BitSet(size)
 private val salts = IntArray(numberOfHashes) { it }.map { it.toString() }
 private val sha = MessageDigest.getInstance("SHA-1")
 fun add(entry: String) {
 for (salt in salts) {
 val index = hashedIndex(entry, salt)
 flags.set(index)
 }
 }
 fun maybeExists(entry: String): Boolean {
 for (salt in salts) {
 val index = hashedIndex(entry, salt)
 if (!flags[index]) {
 return false
 }
 }
 return true
 }
 private fun hashedIndex(entry: String, salt: String): Int {
 val salted = entry + salt
 val hash = sha.digest(salted.toByteArray())
 val wrapped = ByteBuffer.wrap(hash)
 return abs(wrapped.int) % size
 }
}
class ConfusionMatrix(filter: BloomFilter, entriesInFilter: Array<String>, entriesNotInFilter: Array<String>) {
 private val inFilterCount = entriesInFilter.size
 private val notInFilterCount = entriesNotInFilter.size
 private var truePositiveCount = 0
 private var trueNegativeCount = 0
 private var falsePositiveCount = 0
 private var falseNegativeCount = 0
 val accuracyRate by lazy { (truePositiveCount + trueNegativeCount).toDouble() / (notInFilterCount + inFilterCount) }
 val misclassificationRate by lazy { 1.0 - accuracyRate }
 val truePositiveRate by lazy { truePositiveCount.toDouble() / inFilterCount }
 val trueNegativeRate by lazy { trueNegativeCount.toDouble() / notInFilterCount }
 val falsePositiveRate by lazy { falsePositiveCount.toDouble() / notInFilterCount }
 val falseNegativeRate by lazy { falseNegativeCount.toDouble() / inFilterCount }
 init {
 countTruePositiveAndFalseNegative(entriesInFilter, filter)
 countFalsePositiveAndTrueNegative(entriesNotInFilter, filter)
 }
 private fun countTruePositiveAndFalseNegative(entriesInFilter: Array<String>, filter: BloomFilter) {
 for (entryInFilter in entriesInFilter) {
 if (filter.maybeExists(entryInFilter)) {
 truePositiveCount++
 } else {
 falseNegativeCount++
 }
 }
 }
 private fun countFalsePositiveAndTrueNegative(entriesNotInFilter: Array<String>, filter: BloomFilter) {
 for (entryNotInFilter in entriesNotInFilter) {
 if (filter.maybeExists(entryNotInFilter)) {
 falsePositiveCount++
 } else {
 trueNegativeCount++
 }
 }
 }
 fun printReport() {
 val dataRows = mapOf(
 "Accuracy" to accuracyRate,
 "Misclassification rate" to misclassificationRate,
 "True positive rate" to truePositiveRate,
 "True negative rate" to trueNegativeRate,
 "False positive rate" to falsePositiveRate,
 "False negative rate" to falseNegativeRate
 )
 val printer = Printer(dataRows)
 printer.print()
 }
}
class Printer(private val dataRows: Map<String, Double>) {
 private val spacing = 2
 private val longestLabelLength = getLongestString(dataRows.keys, default=50) + spacing
 private val stringBuilder = StringBuilder()
 private fun getLongestString(labels: Set<String>, default: Int): Int {
 return labels.map { it.length }.max() ?: default
 }
 fun print() {
 for ((label, value) in dataRows) {
 printLabel(label)
 printPadding(label)
 printFormattedValue(value)
 println()
 }
 }
 private fun printLabel(label: String) {
 print("$label:")
 }
 private fun printPadding(label: String) {
 val paddingNeeded = longestLabelLength - label.length
 stringBuilder.clear()
 for (x in 0 until paddingNeeded) stringBuilder.append(" ")
 print(stringBuilder.toString())
 }
 private fun printFormattedValue(value: Double) {
 val width6digits2 = "%6.2f"
 val percentage = String.format(width6digits2, value * 100) + "%"
 print(percentage)
 }
}
private fun randomString(): String {
 return UUID.randomUUID().toString()
}

Question 2

Why are you using lazy for trivial division operations?

Question 3

Yes it's probably overkill. I don't like them to be var or even private set as var, as they are only set once which is exactly was lazy does when called and then saves for later. In C# we have the readonly modifier where fields can be set only in object initialization, but then still remain public and communicate clearly that this is constant after initialization. Does Kotlin have similar, what would you suggest?

Question 4

= is uglier than by lazy { }? Or if you just define them as Doubles, you can set them one time in the init block, like you described doing in C#.

Question 5

But then I have to make them var with private set. Which 1) gives two lines instead of one for each property and 2) gives the impression that accuracyRate can change after initialization. var accuracyRate = 0.0 private set

Question 6

You don't have to make it a var. pl.kotl.in/1yNSwYOwh

Question 7

Here's how I'd clean up the ConfusionMatrix class. I don't know anything about this algorithm, but this should be equivalent code. You can calculate and set these read-only values at their declaration sites if you do them in order. So all parameters can be val and you don't need lazy, which wraps your property in a Lazy class. There are no custom getters and there are no setters, so the whole class is immutable and compact with no references to anything else once it's instantiated.

class ConfusionMatrix(filter: BloomFilter, entriesInFilter: Array<String>, entriesNotInFilter: Array<String>) {
 private val inFilterCount = entriesInFilter.size
 private val notInFilterCount = entriesNotInFilter.size
 private val truePositiveCount = entriesInFilter.count { filter.maybeExists(it) }
 private val falseNegativeCount = entriesInFilter.size - truePositiveCount
 private val falsePositiveCount = entriesNotInFilter.count { filter.maybeExists(it) }
 private val trueNegativeCount = entriesNotInFilter.size - truePositiveCount
 val accuracyRate = (truePositiveCount + trueNegativeCount).toDouble() / (notInFilterCount + inFilterCount)
 val misclassificationRate = 1.0 - accuracyRate
 val truePositiveRate = truePositiveCount.toDouble() / inFilterCount 
 val trueNegativeRate = trueNegativeCount.toDouble() / notInFilterCount
 val falsePositiveRate = falsePositiveCount.toDouble() / notInFilterCount
 val falseNegativeRate = falseNegativeCount.toDouble() / inFilterCount
 fun printReport() {
 val dataRows = mapOf(
 "Accuracy" to accuracyRate,
 "Misclassification rate" to misclassificationRate,
 "True positive rate" to truePositiveRate,
 "True negative rate" to trueNegativeRate,
 "False positive rate" to falsePositiveRate,
 "False negative rate" to falseNegativeRate
 )
 val printer = Printer(dataRows)
 printer.print()
 }
}

Knowing nothing of the algorithm, I'd say BloomFilter is pretty clean, but you could more naturally write the declaration of salts like this:

private val salts = (0..numberOfHashes).map { it.toString() }

or

private val salts = (0..numberOfHashes).map(Int::toString)

The second form is usually preferred to lambdas when there's a function that exactly matches the required signature because it shows the type. Not really helpful here, but helpful in a chain of functional calls to make it more readable later.

In your main method, a couple of little tips...

When you want to do some sort of logging type of action without side effects as you are assigning something to a variable, you can use also. It kind of de-emphasizes it for someone reading your code, especially if it's some action that takes a few lines of code. It's not all that useful here since this is so simple, but might be handy for you in other situations.

val confusionMatrix = ConfusionMatrix(filter, entriesInFilter, entriesNotInFilter)
 also { it.printReport() }

And there's a function for asserting something and throwing a runtime exception if it fails, so your last bit can be cleaned up:

require(confusionMatrix.falseNegativeRate > 0.0) {
 "This should not happen, if it does the implementation of the bloom filter is wrong."
}

Question 8

After looking at it a bit

hashedIndex does many things. It salts the input, hashes it, wraps it and makes sure it fits into the size. Could it be split up and more clear what is happening?

The confusion matrix seems like a general mathy thing, why does it have a direct dependency on a BloomFilter and its data? Try to come up with some way of decoupling these so the confusion matrix can be reused for other statistical purposes.

countTruePositiveAndFalseNegative and countFalsePositiveAndTrueNegative looks a lot like repetition, can the logic be moved to a single implementation?

None of the classes implement interfaces or abstract methods, so using them would require a dependency to the concrete implementation, making the depende unnecessarily difficult to test and change.

There is a possible divide by zero issue if inFilterCount or notInFilterCount is zero.

Question 9

I got around the dependency from ConfusionMatrix to Bloomfilter by using a lambda that "predicts", which I think is a good use of lambdas compared to a solution using e.g. adapter pattern, where a lot of boilerplate would be created.

Tenfour04 Tenfour04 8234 silver badges9 bronze badges · Answer 1 · 2020-08-27 20:04:25Z

Here's how I'd clean up the ConfusionMatrix class. I don't know anything about this algorithm, but this should be equivalent code. You can calculate and set these read-only values at their declaration sites if you do them in order. So all parameters can be val and you don't need lazy, which wraps your property in a Lazy class. There are no custom getters and there are no setters, so the whole class is immutable and compact with no references to anything else once it's instantiated.

class ConfusionMatrix(filter: BloomFilter, entriesInFilter: Array<String>, entriesNotInFilter: Array<String>) {
 private val inFilterCount = entriesInFilter.size
 private val notInFilterCount = entriesNotInFilter.size
 private val truePositiveCount = entriesInFilter.count { filter.maybeExists(it) }
 private val falseNegativeCount = entriesInFilter.size - truePositiveCount
 private val falsePositiveCount = entriesNotInFilter.count { filter.maybeExists(it) }
 private val trueNegativeCount = entriesNotInFilter.size - truePositiveCount
 val accuracyRate = (truePositiveCount + trueNegativeCount).toDouble() / (notInFilterCount + inFilterCount)
 val misclassificationRate = 1.0 - accuracyRate
 val truePositiveRate = truePositiveCount.toDouble() / inFilterCount 
 val trueNegativeRate = trueNegativeCount.toDouble() / notInFilterCount
 val falsePositiveRate = falsePositiveCount.toDouble() / notInFilterCount
 val falseNegativeRate = falseNegativeCount.toDouble() / inFilterCount
 fun printReport() {
 val dataRows = mapOf(
 "Accuracy" to accuracyRate,
 "Misclassification rate" to misclassificationRate,
 "True positive rate" to truePositiveRate,
 "True negative rate" to trueNegativeRate,
 "False positive rate" to falsePositiveRate,
 "False negative rate" to falseNegativeRate
 )
 val printer = Printer(dataRows)
 printer.print()
 }
}

Knowing nothing of the algorithm, I'd say BloomFilter is pretty clean, but you could more naturally write the declaration of salts like this:

private val salts = (0..numberOfHashes).map { it.toString() }

or

private val salts = (0..numberOfHashes).map(Int::toString)

The second form is usually preferred to lambdas when there's a function that exactly matches the required signature because it shows the type. Not really helpful here, but helpful in a chain of functional calls to make it more readable later.

In your main method, a couple of little tips...

When you want to do some sort of logging type of action without side effects as you are assigning something to a variable, you can use also. It kind of de-emphasizes it for someone reading your code, especially if it's some action that takes a few lines of code. It's not all that useful here since this is so simple, but might be handy for you in other situations.

val confusionMatrix = ConfusionMatrix(filter, entriesInFilter, entriesNotInFilter)
 also { it.printReport() }

And there's a function for asserting something and throwing a runtime exception if it fails, so your last bit can be cleaned up:

require(confusionMatrix.falseNegativeRate > 0.0) {
 "This should not happen, if it does the implementation of the bloom filter is wrong."
}

Peheje Peheje 1193 bronze badges · Answer 2 · 2020-08-23 16:32:24Z

After looking at it a bit

hashedIndex does many things. It salts the input, hashes it, wraps it and makes sure it fits into the size. Could it be split up and more clear what is happening?

The confusion matrix seems like a general mathy thing, why does it have a direct dependency on a BloomFilter and its data? Try to come up with some way of decoupling these so the confusion matrix can be reused for other statistical purposes.

countTruePositiveAndFalseNegative and countFalsePositiveAndTrueNegative looks a lot like repetition, can the logic be moved to a single implementation?

None of the classes implement interfaces or abstract methods, so using them would require a dependency to the concrete implementation, making the depende unnecessarily difficult to test and change.

There is a possible divide by zero issue if inFilterCount or notInFilterCount is zero.

I got around the dependency from ConfusionMatrix to Bloomfilter by using a lambda that "predicts", which I think is a good use of lambdas compared to a solution using e.g. adapter pattern, where a lot of boilerplate would be created.

Stack Exchange Network

Bloomfilter in Kotlin

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Bloomfilter in Kotlin

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions