Binary search in R

Question 1

Again playing around in R with a Edx CS50 homework.

Performance relative to %in% is above my expectation. But I still wonder whether I am breaking any good practice/style, norms, or common sense overall or in the tiniest detail.

BiSearch <- function(table, key) {
 # Takes sorted (in ascending order) vectors
 stopifnot(is.vector(table), is.numeric(table))
 r <- length(table)
 m <- ceiling(r / 2L) # Midpoint
 if (table[m] > key) {
 if (r == 1L) {
 return(FALSE)
 }
 BiSearch(table[1L:(m - 1L)], key)
 }
 else if (table[m] < key) {
 if (r == 1L) {
 return(FALSE)
 }
 BiSearch(table[(m + 1L):r], key)
 }
 else {
 return(TRUE) 
 }
}

Question 2

Would be interesting to see the (presumably C) code behind the match() function.

Question 3

First, let's point out that %in% and match do not require that the input be sorted so they should asymptotically (i.e as the input becomes large) perform way worse than your code if implemented properly.

My main concern with your implementation would be that you are disregarding potential floating point errors. See for example that

BiSearch(seq(from = 0, to = 1, by = 0.1), 0.3)
# [1] FALSE

To fix that, you need to allow for some very small tolerance. You could mimic all.equal by setting that tolerance to .Machine$double.eps ^ 0.5:

BiSearch <- function(table, key, tol = .Machine$double.eps ^ 0.5) {
 ...
 if (table[m] > key + tol) { ... }
 else if (table[m] < key - tol) { ... }
 ...
}

Next, from a performance point of view, you are wasting a good amount of time and memory by creating and storing a new vector at each iteration. Instead, you could keep the same initial vector and only pass around start and end indices:

BiSearch2 <- function(table, key, start.idx = 1, end.idx = length(table),
 tol = .Machine$double.eps ^ 0.5) {
 # Takes sorted (in ascending order) vectors
 stopifnot(is.vector(table), is.numeric(table))
 r <- length(table)
 m <- as.integer(ceiling((end.idx + start.idx) / 2)) # Midpoint
 if (table[m] > key + tol) {
 if (start.idx == end.idx) return(FALSE)
 Recall(table, key, start.idx = start.idx, end.idx = m - 1L, tol = tol)
 } else if (table[m] < key - tol) {
 if (start.idx == end.idx) return(FALSE)
 Recall(table, key, start.idx = m + 1L, end.idx = end.idx, tol = tol)
 } else return(TRUE)
}

Notice how I also made a few other changes:

ceiling returns a numeric so it needs to be passed through as.integer if you want to preserve an integer class for your index
I used Recall rather than the name of the function itself. This is preferred as it makes it easier to later rename your function: you will only have to change the function name in one place instead of three.
Though this has no negative effect within the body of a function, I used the preferred bracing syntax where else is put on the same line as the previous }. Outside functions, especially when writing code at the terminal, this would otherwise throw an unexpected else error. This is well explained in section 8.1.43 of the R inferno: http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

At this point, the code should be a lot faster. So fast that the repetitive checking via stopifnot becomes relatively expensive so it makes sense to turn it off within the nested calls:

BiSearch3 <- function(table, key, start.idx = 1, end.idx = length(table),
 tol = .Machine$double.eps ^ 0.5,
 check = TRUE) {
 # Takes sorted (in ascending order) vectors
 if (check) stopifnot(is.vector(table), is.numeric(table))
 r <- length(table)
 m <- as.integer(ceiling((end.idx + start.idx) / 2)) # Midpoint
 if (table[m] > key + tol) {
 if (start.idx == end.idx) return(FALSE)
 Recall(table, key, start.idx = start.idx, end.idx = m - 1L, tol = tol, check = FALSE)
 } else if (table[m] < key - tol) {
 if (start.idx == end.idx) return(FALSE)
 Recall(table, key, start.idx = m + 1L, end.idx = end.idx, tol = tol, check = FALSE)
 } else return(TRUE)
}

Here are some benchmark comparisons using a large input vector:

library(microbenchmark)
table <- 1:1e7
microbenchmark(BiSearch(table, 1L), BiSearch2(table, 1L), BiSearch3(table, 1L))
# Unit: microseconds
# expr min lq mean median uq max neval
# BiSearch(table, 1L) 146830.845 166891.1690 221590.9946 253923.2490 276508.785 290987.343 100
# BiSearch2(table, 1L) 342.497 352.9960 376.3591 370.3830 377.602 561.851 100
# BiSearch3(table, 1L) 119.976 124.0535 136.2610 130.1285 143.567 298.806 100

I leave that up to you but at this point, I think you would probably get even faster computation times if you replaced the recursion by a simple while loop (since time is lost calling the function many times and maintaining a stack of function calls).

flodel flodel 3,5551 gold badge16 silver badges15 bronze badges · Accepted Answer · 2017-01-05 03:33:41Z

First, let's point out that %in% and match do not require that the input be sorted so they should asymptotically (i.e as the input becomes large) perform way worse than your code if implemented properly.

My main concern with your implementation would be that you are disregarding potential floating point errors. See for example that

BiSearch(seq(from = 0, to = 1, by = 0.1), 0.3)
# [1] FALSE

To fix that, you need to allow for some very small tolerance. You could mimic all.equal by setting that tolerance to .Machine$double.eps ^ 0.5:

BiSearch <- function(table, key, tol = .Machine$double.eps ^ 0.5) {
 ...
 if (table[m] > key + tol) { ... }
 else if (table[m] < key - tol) { ... }
 ...
}

Next, from a performance point of view, you are wasting a good amount of time and memory by creating and storing a new vector at each iteration. Instead, you could keep the same initial vector and only pass around start and end indices:

BiSearch2 <- function(table, key, start.idx = 1, end.idx = length(table),
 tol = .Machine$double.eps ^ 0.5) {
 # Takes sorted (in ascending order) vectors
 stopifnot(is.vector(table), is.numeric(table))
 r <- length(table)
 m <- as.integer(ceiling((end.idx + start.idx) / 2)) # Midpoint
 if (table[m] > key + tol) {
 if (start.idx == end.idx) return(FALSE)
 Recall(table, key, start.idx = start.idx, end.idx = m - 1L, tol = tol)
 } else if (table[m] < key - tol) {
 if (start.idx == end.idx) return(FALSE)
 Recall(table, key, start.idx = m + 1L, end.idx = end.idx, tol = tol)
 } else return(TRUE)
}

Notice how I also made a few other changes:

ceiling returns a numeric so it needs to be passed through as.integer if you want to preserve an integer class for your index
I used Recall rather than the name of the function itself. This is preferred as it makes it easier to later rename your function: you will only have to change the function name in one place instead of three.
Though this has no negative effect within the body of a function, I used the preferred bracing syntax where else is put on the same line as the previous }. Outside functions, especially when writing code at the terminal, this would otherwise throw an unexpected else error. This is well explained in section 8.1.43 of the R inferno: http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

At this point, the code should be a lot faster. So fast that the repetitive checking via stopifnot becomes relatively expensive so it makes sense to turn it off within the nested calls:

BiSearch3 <- function(table, key, start.idx = 1, end.idx = length(table),
 tol = .Machine$double.eps ^ 0.5,
 check = TRUE) {
 # Takes sorted (in ascending order) vectors
 if (check) stopifnot(is.vector(table), is.numeric(table))
 r <- length(table)
 m <- as.integer(ceiling((end.idx + start.idx) / 2)) # Midpoint
 if (table[m] > key + tol) {
 if (start.idx == end.idx) return(FALSE)
 Recall(table, key, start.idx = start.idx, end.idx = m - 1L, tol = tol, check = FALSE)
 } else if (table[m] < key - tol) {
 if (start.idx == end.idx) return(FALSE)
 Recall(table, key, start.idx = m + 1L, end.idx = end.idx, tol = tol, check = FALSE)
 } else return(TRUE)
}

Here are some benchmark comparisons using a large input vector:

library(microbenchmark)
table <- 1:1e7
microbenchmark(BiSearch(table, 1L), BiSearch2(table, 1L), BiSearch3(table, 1L))
# Unit: microseconds
# expr min lq mean median uq max neval
# BiSearch(table, 1L) 146830.845 166891.1690 221590.9946 253923.2490 276508.785 290987.343 100
# BiSearch2(table, 1L) 342.497 352.9960 376.3591 370.3830 377.602 561.851 100
# BiSearch3(table, 1L) 119.976 124.0535 136.2610 130.1285 143.567 298.806 100

I leave that up to you but at this point, I think you would probably get even faster computation times if you replaced the recursion by a simple while loop (since time is lost calling the function many times and maintaining a stack of function calls).

Stack Exchange Network

Binary search in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Binary search in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions