I am trying to implement the code described in this post about how Reddit uses binary search to identify duplicate user names, in Scala. Just to clarify, this is mainly for me figuring out how these things work, i.e. reinventing the wheel. The way I understand it, the steps for binary search are:
You have a new user name you are comparing to an alphabetically sorted list of existing names
You compare that new name to the middle item in the list.
If there is no match, you check if the new name is higher or lower in the alphabet than the middle item, and focus on this side of the list, again checking the middle item.
You repeat until there is a match ("User name taken") or none ("User name available").
Below is what I have. It seems to work when I test it on the names in the list and random other names. It'd cool if someone had some feedback re:
Will this work correctly or is there some kind of bug?
Is there anything really un-Scalaesque in here? I'm just starting out...
Is the
while
loop the right way to go? Are there better options?I think the comparer pattern matching function should be expanded to do more, i.e. cover different cases of non-matching, but I couldn't quite get there..
Any other comments are much appreciated!
-
//this is my test data
val lines = List("ally", "ben", "albert", "zong", "hermit", "??ennis", "123pat", "alfred", "zil", "zol", "zal")
def comparer(name_one:String, name_two:String): Boolean = name_one match
//comparer takes two strings, returns true if identical
{
case `name_two` => true
case _ => false
}
def checker(new_name: String, old_names: List[String]) : Boolean =
//checker takes a new user name, compares to List of names already in use
//returns true if already present or false if not
{
val sortednames:List[String] = old_names.sorted
println(new_name + "input list is length " + sortednames.length)
//set vars, initialize
var found: Boolean = false
var startpoint: Int = 0
var endpoint: Int = sortednames.length-1
//while loop, breaks when something is found
while (found == false && startpoint < endpoint)
{
println("start: "+startpoint+" end: "+endpoint)
//set the midpoint were checking against
var mid: Int = (startpoint + endpoint) / 2
println("mid is " + mid)
//run comparer to set found var initialized abpve
found = comparer (new_name, sortednames(mid))
println("new name " + new_name)
println("ist window " + sortednames(mid))
//check if new_name is higher or lower in alphabet than mid
if (new_name > sortednames(mid)) {
startpoint = mid +1
println("set start to " + startpoint)
}
if (new_name < sortednames(mid)) {
endpoint = mid
println("set end to " + endpoint)
}
}
found
}
val x:Boolean =checker("ally", lines)
println("checker returns ")
println(x)
1 Answer 1
You're doing a number of things that are very un-Scalaesque, chief among them:
- mutable variables - using
var
instead ofval
while
loop - recursion is usually preferred- basic comparison operators can be used on strings -
>
,<
,==
, etc.
There are also efficiency issues to be considered:
- indexing into a
List
is slow -Vector
andArray
collections are better - mixed binary search - many implementations abandon the binary search and go into a straight linear search when the field of candidates becomes small enough.
With a few of these things in mind, here's an alternative approach. (A full binary search, just for simplicity.)
def checker(name: String, names: Seq[String]) : Boolean = {
// sort the input once, use Array for fast indexing
val arr: Array[String] = names.sorted.toArray
// tail-recursive binary search for name in names
def search(start: Int = 0, end: Int = arr.length -1): Boolean = {
val mid = start + (end-start)/2
if (start > end) false // can't be found
else if (arr(mid) == name) true // found
else if (arr(mid) > name) search(start, mid-1) // narrow the field
else search(mid+1, end)
}
search() // do it
}
Result:
val lines = List("ally", "ben", "albert", "zong", "hermit", "??ennis", "123pat", "alfred", "zil", "zol", "zal")
checker("zong", lines) // res0: Boolean = true
checker("zing", lines) // res1: Boolean = false
-
\$\begingroup\$ I think
val mid = (end+start)/2
is better \$\endgroup\$tom10271– tom102712019年04月01日 07:09:57 +00:00Commented Apr 1, 2019 at 7:09 -
\$\begingroup\$ @tom10271; That is more direct and to-the-point, but for sufficiently large values of
start
andend
, the shorter version will encounterInt
overflow and thus incorrect (negative) results. That won't happen with the longer version. Not likely to be an issue in this case but still worth considering. \$\endgroup\$jwvh– jwvh2019年04月01日 18:40:50 +00:00Commented Apr 1, 2019 at 18:40 -
\$\begingroup\$ Good point to consider \$\endgroup\$tom10271– tom102712019年04月02日 00:26:04 +00:00Commented Apr 2, 2019 at 0:26
Explore related questions
See similar questions with these tags.
while
and the twoif
s? And do you have any hints of how to do that better? Thanks! \$\endgroup\$checker()
returnstrue
whennew_name
matches the last value in the sorted list? ("zong" in your example) \$\endgroup\$