I would like a code-review. Not so much on if the implementation is good or effecient, it's probably not, more on code style and readability.
import java.lang.Exception
import java.nio.ByteBuffer
import java.security.MessageDigest
import java.util.*
import kotlin.math.abs
fun main() {
val filterSize = 1_000_000
val numberOfEntries = 100_000
val filter = BloomFilter(filterSize, numberOfHashes = 4)
val entriesInFilter = Array(numberOfEntries) { randomString() }
val entriesNotInFilter = Array(numberOfEntries) { randomString() }
for (entry in entriesInFilter)
filter.add(entry)
val confusionMatrix = ConfusionMatrix(filter, entriesInFilter, entriesNotInFilter)
confusionMatrix.printReport()
if (confusionMatrix.falseNegativeRate > 0.0) {
throw Exception("This should not happen, if it does the implementation of the bloom filter is wrong.")
}
}
class BloomFilter(private val size: Int, numberOfHashes: Int) {
private val flags = BitSet(size)
private val salts = IntArray(numberOfHashes) { it }.map { it.toString() }
private val sha = MessageDigest.getInstance("SHA-1")
fun add(entry: String) {
for (salt in salts) {
val index = hashedIndex(entry, salt)
flags.set(index)
}
}
fun maybeExists(entry: String): Boolean {
for (salt in salts) {
val index = hashedIndex(entry, salt)
if (!flags[index]) {
return false
}
}
return true
}
private fun hashedIndex(entry: String, salt: String): Int {
val salted = entry + salt
val hash = sha.digest(salted.toByteArray())
val wrapped = ByteBuffer.wrap(hash)
return abs(wrapped.int) % size
}
}
class ConfusionMatrix(filter: BloomFilter, entriesInFilter: Array<String>, entriesNotInFilter: Array<String>) {
private val inFilterCount = entriesInFilter.size
private val notInFilterCount = entriesNotInFilter.size
private var truePositiveCount = 0
private var trueNegativeCount = 0
private var falsePositiveCount = 0
private var falseNegativeCount = 0
val accuracyRate by lazy { (truePositiveCount + trueNegativeCount).toDouble() / (notInFilterCount + inFilterCount) }
val misclassificationRate by lazy { 1.0 - accuracyRate }
val truePositiveRate by lazy { truePositiveCount.toDouble() / inFilterCount }
val trueNegativeRate by lazy { trueNegativeCount.toDouble() / notInFilterCount }
val falsePositiveRate by lazy { falsePositiveCount.toDouble() / notInFilterCount }
val falseNegativeRate by lazy { falseNegativeCount.toDouble() / inFilterCount }
init {
countTruePositiveAndFalseNegative(entriesInFilter, filter)
countFalsePositiveAndTrueNegative(entriesNotInFilter, filter)
}
private fun countTruePositiveAndFalseNegative(entriesInFilter: Array<String>, filter: BloomFilter) {
for (entryInFilter in entriesInFilter) {
if (filter.maybeExists(entryInFilter)) {
truePositiveCount++
} else {
falseNegativeCount++
}
}
}
private fun countFalsePositiveAndTrueNegative(entriesNotInFilter: Array<String>, filter: BloomFilter) {
for (entryNotInFilter in entriesNotInFilter) {
if (filter.maybeExists(entryNotInFilter)) {
falsePositiveCount++
} else {
trueNegativeCount++
}
}
}
fun printReport() {
val dataRows = mapOf(
"Accuracy" to accuracyRate,
"Misclassification rate" to misclassificationRate,
"True positive rate" to truePositiveRate,
"True negative rate" to trueNegativeRate,
"False positive rate" to falsePositiveRate,
"False negative rate" to falseNegativeRate
)
val printer = Printer(dataRows)
printer.print()
}
}
class Printer(private val dataRows: Map<String, Double>) {
private val spacing = 2
private val longestLabelLength = getLongestString(dataRows.keys, default=50) + spacing
private val stringBuilder = StringBuilder()
private fun getLongestString(labels: Set<String>, default: Int): Int {
return labels.map { it.length }.max() ?: default
}
fun print() {
for ((label, value) in dataRows) {
printLabel(label)
printPadding(label)
printFormattedValue(value)
println()
}
}
private fun printLabel(label: String) {
print("$label:")
}
private fun printPadding(label: String) {
val paddingNeeded = longestLabelLength - label.length
stringBuilder.clear()
for (x in 0 until paddingNeeded) stringBuilder.append(" ")
print(stringBuilder.toString())
}
private fun printFormattedValue(value: Double) {
val width6digits2 = "%6.2f"
val percentage = String.format(width6digits2, value * 100) + "%"
print(percentage)
}
}
private fun randomString(): String {
return UUID.randomUUID().toString()
}
2 Answers 2
Here's how I'd clean up the ConfusionMatrix class. I don't know anything about this algorithm, but this should be equivalent code. You can calculate and set these read-only values at their declaration sites if you do them in order. So all parameters can be val
and you don't need lazy
, which wraps your property in a Lazy
class. There are no custom getters and there are no setters, so the whole class is immutable and compact with no references to anything else once it's instantiated.
class ConfusionMatrix(filter: BloomFilter, entriesInFilter: Array<String>, entriesNotInFilter: Array<String>) {
private val inFilterCount = entriesInFilter.size
private val notInFilterCount = entriesNotInFilter.size
private val truePositiveCount = entriesInFilter.count { filter.maybeExists(it) }
private val falseNegativeCount = entriesInFilter.size - truePositiveCount
private val falsePositiveCount = entriesNotInFilter.count { filter.maybeExists(it) }
private val trueNegativeCount = entriesNotInFilter.size - truePositiveCount
val accuracyRate = (truePositiveCount + trueNegativeCount).toDouble() / (notInFilterCount + inFilterCount)
val misclassificationRate = 1.0 - accuracyRate
val truePositiveRate = truePositiveCount.toDouble() / inFilterCount
val trueNegativeRate = trueNegativeCount.toDouble() / notInFilterCount
val falsePositiveRate = falsePositiveCount.toDouble() / notInFilterCount
val falseNegativeRate = falseNegativeCount.toDouble() / inFilterCount
fun printReport() {
val dataRows = mapOf(
"Accuracy" to accuracyRate,
"Misclassification rate" to misclassificationRate,
"True positive rate" to truePositiveRate,
"True negative rate" to trueNegativeRate,
"False positive rate" to falsePositiveRate,
"False negative rate" to falseNegativeRate
)
val printer = Printer(dataRows)
printer.print()
}
}
Knowing nothing of the algorithm, I'd say BloomFilter is pretty clean, but you could more naturally write the declaration of salts
like this:
private val salts = (0..numberOfHashes).map { it.toString() }
or
private val salts = (0..numberOfHashes).map(Int::toString)
The second form is usually preferred to lambdas when there's a function that exactly matches the required signature because it shows the type. Not really helpful here, but helpful in a chain of functional calls to make it more readable later.
In your main method, a couple of little tips...
When you want to do some sort of logging type of action without side effects as you are assigning something to a variable, you can use also
. It kind of de-emphasizes it for someone reading your code, especially if it's some action that takes a few lines of code. It's not all that useful here since this is so simple, but might be handy for you in other situations.
val confusionMatrix = ConfusionMatrix(filter, entriesInFilter, entriesNotInFilter)
also { it.printReport() }
And there's a function for asserting something and throwing a runtime exception if it fails, so your last bit can be cleaned up:
require(confusionMatrix.falseNegativeRate > 0.0) {
"This should not happen, if it does the implementation of the bloom filter is wrong."
}
After looking at it a bit
hashedIndex does many things. It salts the input, hashes it, wraps it and makes sure it fits into the size. Could it be split up and more clear what is happening?
The confusion matrix seems like a general mathy thing, why does it have a direct dependency on a BloomFilter and its data? Try to come up with some way of decoupling these so the confusion matrix can be reused for other statistical purposes.
countTruePositiveAndFalseNegative and countFalsePositiveAndTrueNegative looks a lot like repetition, can the logic be moved to a single implementation?
None of the classes implement interfaces or abstract methods, so using them would require a dependency to the concrete implementation, making the depende unnecessarily difficult to test and change.
There is a possible divide by zero issue if inFilterCount or notInFilterCount is zero.
-
\$\begingroup\$ I got around the dependency from ConfusionMatrix to Bloomfilter by using a lambda that "predicts", which I think is a good use of lambdas compared to a solution using e.g. adapter pattern, where a lot of boilerplate would be created. \$\endgroup\$Peheje– Peheje2020年08月25日 19:02:36 +00:00Commented Aug 25, 2020 at 19:02
lazy
for trivial division operations? \$\endgroup\$=
is uglier thanby lazy { }
? Or if you just define them as Doubles, you can set them one time in theinit
block, like you described doing in C#. \$\endgroup\$var accuracyRate = 0.0 private set
\$\endgroup\$