I am trying to write a binary search using recursion in C on a sorted vector that we can assume comes from a uniform distribution between the start and end values. What I am trying to write is an algorithm that performs best in expected time. I am having some issue with picking the correct index to 'split' at and while it passes my tests, it feels a bit hacky.
int weighted_binary_split(int *x, int start_limit, int start_idx, int end_idx,
int val, int *depth) {
// For a vector of length `n` with known starting and ending values,
// x[start_idx:end_idx+1] = [start, ..., end]
// searches for `val` in `x`, assuming `x` is uniformly distributed and
// ordered. Note that we always get the first occurence of `val` in `x`.
// I say 'best' because it is simply intuition.
// Note that when initially calling this, we will have
// start_limit = start_idx
// however when we are within the recursion this is no longer the case.
int n = end_idx - start_idx + 1;
int start = x[start_idx];
int end = x[end_idx];
int out;
int i;
// Keep track of how many iterations/depth we have gone in binary search
// tree.
*depth += 1;
// Catch the case where end == start (causing divide by 0 below).
if (end - start == 0) {
return start_idx;
}
// Catch the case where val == start (can just return start_idx).
if (val == start) {
return start_idx;
}
// We want to find
// (val - start)/((end - start)/n) = n*val/(end - start),
// rounded to the nearest integer.
// Note there is no point using
// 1) idx = start_idx
// as if val == x[start_idx] ==> val == start, caught above.
// 2) idx = start_idx + n = end_idx + 1, too big for vector.
// Hence we use (n - 1). THIS PART FEELS HACKY.
int idx = start_idx + ((n-1)*(val - start))/(end - start);
if (idx == start_idx) {
idx += 1;
}
if (x[idx] > val) {
// Search again with idx as the new `end_idx`.
out = weighted_binary_split(x, start_limit, start_idx, idx, val, depth);
} else if (x[idx] < val) {
// Search again with idx as the new `start_idx`.
out = weighted_binary_split(x, start_limit, idx, end_idx, val, depth);
} else {
// In this case we have found val in x at x[idx].
// We finally get the first occurence of `val` in x[start_limit:idx+1].
for (i=1; i<idx-start_limit+1; i++) {
if (x[idx - i] != x[idx]) {
// We have found our first value!
return idx - i + 1;
}
}
// If we reach this point, then x[start_limit] was the first value.
return start_limit;
}
return out;
}
Below is a test template you can use.
int example_split() {
// Test that our weighted binary split works correctly.
int x[10] = {0, 1, 2, 2, 4, 5, 6, 6, 7, 9};
int start_idx = 0;
int end_idx = 9;
int val = 2;
int idx;
int i;
int depth = 0;
idx = weighted_binary_split(x, start_idx, start_idx, end_idx, val, &depth);
printf("\nFirst index into:\n\tx = (");
for (i=0; i<10; i++) {
printf("%d,", x[i]);
}
printf(")");
printf("\nfor:\n\tval = %d", val);
printf("\nis\n\tidx = %d", idx);
printf("\nwith\n\tx[idx]=%d", x[idx]);
printf("\nat\n\tdepth=%d\n", depth);
return 0;
}
-
\$\begingroup\$ Is the code working as expected? We can only review working code on code review. If there is a bug, then it might be better to ask this question on our sister website stackoverflow.com. \$\endgroup\$pacmaninbw– pacmaninbw ♦2018年05月31日 14:56:58 +00:00Commented May 31, 2018 at 14:56
-
1\$\begingroup\$ Yep this is working as expected for all tests I give it. Maybe you are asking for it in a version that can be compiled i.e. with includes and a main? \$\endgroup\$rwolst– rwolst2018年05月31日 14:58:45 +00:00Commented May 31, 2018 at 14:58
-
1\$\begingroup\$ It is nice to get all of the program, but I was just checking to see if it worked due to the rules of this website. \$\endgroup\$pacmaninbw– pacmaninbw ♦2018年05月31日 15:02:03 +00:00Commented May 31, 2018 at 15:02
-
1\$\begingroup\$ This algorithm is known as interpolation search. It was first described by W. W. Peterson in 1957, but is often rediscovered. \$\endgroup\$Gareth Rees– Gareth Rees2018年05月31日 21:12:07 +00:00Commented May 31, 2018 at 21:12
2 Answers 2
Bug
Generally code should prevent stack overflows from occurring. The search implemented here causes a stack overflow when the item searched for is not in the array. I tested the code using 8 as the value to search for which caused a recursive stack overflow.
I searched the internet for Weighted Binary Search
and couldn't find it, instead I found Weighted Binary Tree.
Wikipedia defines a binary search as "In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. ..."
The search implemented is not a binary search. A binary search always divides the search area by two.
I'm not really sure why there is a for loop in the function weighted_binary_split()
, it doesn't comply with a binary search.
Handling Errors
Since the code is keeping track of the depth of the search a possible solution to a stack overflow problem would be to check the value of depth against Log(N) where N is the number of items in the array. A secondary check might be to check the value of depth against N itself because N would be the limit in a linear search.
The function weighted_binary_split()
could return -1 in the case of an error or use setjmp(jmp_buf env) and longjmp(jmp_buf env, int value) if negative values are valid in the search.
Always check for possible errors when developing code.
Array Initializations
In the function example_split()
the following line:
int x[10] = { 0, 1, 2, 2, 4, 5, 6, 6, 7, 9 };
would be easier to modify if it was written as:
int x[] = { 0, 1, 2, 2, 4, 5, 6, 6, 7, 9 };
There is no need to include the size of the array for x because the compiler does it for you.
The size of the array can be calculated:
int arr_size = (sizeof(x) / sizeof(x[0]));
The variable end_idx can then be calculated as well.
int end_idx = (sizeof(x) / sizeof(x[0])) - 1;
For Loop Control Variables
In the function example_split()
there is no need to create the variable i at the top, it can be created inside the for loop itself:
for (int i = 0; i < arr_size; i++) {
printf("%d,", x[i]);
}
It's generally better to create the variables as they are needed. This is a change from the original version of the C Programming Language that was quite helpful.
-
\$\begingroup\$ Thanks for suggestions. The code isn't based on an algorithm that I found but a modification of a binary search where there is an underlying assumption that the vector contains realisations from a uniform distribution. Maybe I could come up with a better name. In particular if
x
is an arithmetic sequence this always finds the correct index after depth 1. \$\endgroup\$rwolst– rwolst2018年05月31日 16:55:08 +00:00Commented May 31, 2018 at 16:55
What I am trying to write is an algorithm that performs best in expected time.
Improvement: Do not search over a known non-matching value.
if (x[idx] > val) {
// out = weighted_binary_split(x, start_limit, start_idx, idx, val, depth);
out = weighted_binary_split(x, start_limit, start_idx, idx - 1, val, depth);
Similar change for } else if (x[idx] < val) {
Avoid 2 checks per loop
Rather than check i
range over and over, check the end once
// for (i=1; i<idx-start_limit+1; i++) {
// if (x[idx - i] != x[idx]) {
// return idx - i + 1;
// }
// }
if (x[start_idx] == val) return start_idx;
while (x[idx - 1] == val) idx--;
return idx;
This part of code I would consider a change, such as to a binary search to avoid O(N)
possibilities.
having some issue with picking the correct index to 'split'
int idx = start_idx + ((n-1)*(val - start))/(end - start);
attempts to do a linear interpolation. This is reasonable if the data "comes from a uniform distribution" yet can progress horribly in sub-ranges that are not so uniformly distributed.
Instead, suggest: alternate between linear interpolation and bisecting the range. This retains much of the benefit of linear interpolation yet not its worst case of O(n)
time.
Another concern about linear interpolation: (n-1)*(val - start)
may overflow. Consider using wider math to form the product.
-
\$\begingroup\$ Very interesting. Is alternating between linear interpolation and bisecting the range a common technique or did you just think of it? Seems like a good idea. \$\endgroup\$rwolst– rwolst2018年05月31日 17:14:26 +00:00Commented May 31, 2018 at 17:14
-
\$\begingroup\$ @rwolst It is an idea i thought of years ago. A further refinement would use LI as long as each range reduction is less than half and else bisect \$\endgroup\$chux– chux2018年05月31日 17:23:53 +00:00Commented May 31, 2018 at 17:23
-
2\$\begingroup\$ A note about this always finds the correct index after depth 1: the test case is too trivial. Recall that unless the distribution is perfectly uniformly distributed, a LI recursive approach will eventually encounter sub-regions that are not uniformly distributed. \$\endgroup\$chux– chux2018年05月31日 17:26:23 +00:00Commented May 31, 2018 at 17:26
-
\$\begingroup\$ I agree, that's what I meant by
x
being an arithmetic sequence. \$\endgroup\$rwolst– rwolst2018年05月31日 21:54:53 +00:00Commented May 31, 2018 at 21:54