0
\$\begingroup\$

I am trying to implement the code described in this post about how Reddit uses binary search to identify duplicate user names, in Scala. Just to clarify, this is mainly for me figuring out how these things work, i.e. reinventing the wheel. The way I understand it, the steps for binary search are:

  1. You have a new user name you are comparing to an alphabetically sorted list of existing names

  2. You compare that new name to the middle item in the list.

  3. If there is no match, you check if the new name is higher or lower in the alphabet than the middle item, and focus on this side of the list, again checking the middle item.

  4. You repeat until there is a match ("User name taken") or none ("User name available").

Below is what I have. It seems to work when I test it on the names in the list and random other names. It'd cool if someone had some feedback re:

  • Will this work correctly or is there some kind of bug?

  • Is there anything really un-Scalaesque in here? I'm just starting out...

  • Is the while loop the right way to go? Are there better options?

  • I think the comparer pattern matching function should be expanded to do more, i.e. cover different cases of non-matching, but I couldn't quite get there..

  • Any other comments are much appreciated!

-

//this is my test data
val lines = List("ally", "ben", "albert", "zong", "hermit", "??ennis", "123pat", "alfred", "zil", "zol", "zal")
def comparer(name_one:String, name_two:String): Boolean = name_one match 
 //comparer takes two strings, returns true if identical
 {
 case `name_two` => true
 case _ => false
 }
def checker(new_name: String, old_names: List[String]) : Boolean =
 //checker takes a new user name, compares to List of names already in use
 //returns true if already present or false if not
 {
 val sortednames:List[String] = old_names.sorted
 println(new_name + "input list is length " + sortednames.length)
 //set vars, initialize
 var found: Boolean = false
 var startpoint: Int = 0
 var endpoint: Int = sortednames.length-1
 //while loop, breaks when something is found
 while (found == false && startpoint < endpoint)
 {
 println("start: "+startpoint+" end: "+endpoint)
 //set the midpoint were checking against
 var mid: Int = (startpoint + endpoint) / 2
 println("mid is " + mid)
 //run comparer to set found var initialized abpve
 found = comparer (new_name, sortednames(mid)) 
 println("new name " + new_name)
 println("ist window " + sortednames(mid))
 //check if new_name is higher or lower in alphabet than mid
 if (new_name > sortednames(mid)) {
 startpoint = mid +1
 println("set start to " + startpoint)
 }
 if (new_name < sortednames(mid)) {
 endpoint = mid
 println("set end to " + endpoint)
 }
 }
 found
 }
val x:Boolean =checker("ally", lines)
println("checker returns ")
println(x)
asked Apr 26, 2017 at 21:40
\$\endgroup\$
4
  • 1
    \$\begingroup\$ For user base of 1 billion users, your function would use approximately 30 billion comparisons, not 30 \$\endgroup\$ Commented Apr 26, 2017 at 22:10
  • \$\begingroup\$ @enedil eh that is unfortunate / terrible! Is it because I do the while and the two ifs? And do you have any hints of how to do that better? Thanks! \$\endgroup\$ Commented Apr 26, 2017 at 22:19
  • 1
    \$\begingroup\$ it's because you sort the list at the beginning of the function. You have to assume it's already sorted - just keep it sorted \$\endgroup\$ Commented Apr 26, 2017 at 22:23
  • \$\begingroup\$ @patrick, it looks like there is a bug in this code. Have you checked if checker() returns true when new_name matches the last value in the sorted list? ("zong" in your example) \$\endgroup\$ Commented Apr 28, 2017 at 9:32

1 Answer 1

1
\$\begingroup\$

You're doing a number of things that are very un-Scalaesque, chief among them:

  • mutable variables - using var instead of val
  • while loop - recursion is usually preferred
  • basic comparison operators can be used on strings - >, <, ==, etc.

There are also efficiency issues to be considered:

  • indexing into a List is slow - Vector and Array collections are better
  • mixed binary search - many implementations abandon the binary search and go into a straight linear search when the field of candidates becomes small enough.

With a few of these things in mind, here's an alternative approach. (A full binary search, just for simplicity.)

def checker(name: String, names: Seq[String]) : Boolean = {
 // sort the input once, use Array for fast indexing
 val arr: Array[String] = names.sorted.toArray
 // tail-recursive binary search for name in names
 def search(start: Int = 0, end: Int = arr.length -1): Boolean = {
 val mid = start + (end-start)/2
 if (start > end) false // can't be found
 else if (arr(mid) == name) true // found
 else if (arr(mid) > name) search(start, mid-1) // narrow the field
 else search(mid+1, end)
 }
 search() // do it
}

Result:

val lines = List("ally", "ben", "albert", "zong", "hermit", "??ennis", "123pat", "alfred", "zil", "zol", "zal")
checker("zong", lines) // res0: Boolean = true
checker("zing", lines) // res1: Boolean = false
answered May 4, 2017 at 6:33
\$\endgroup\$
3
  • \$\begingroup\$ I think val mid = (end+start)/2 is better \$\endgroup\$ Commented Apr 1, 2019 at 7:09
  • \$\begingroup\$ @tom10271; That is more direct and to-the-point, but for sufficiently large values of start and end, the shorter version will encounter Int overflow and thus incorrect (negative) results. That won't happen with the longer version. Not likely to be an issue in this case but still worth considering. \$\endgroup\$ Commented Apr 1, 2019 at 18:40
  • \$\begingroup\$ Good point to consider \$\endgroup\$ Commented Apr 2, 2019 at 0:26

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.