Binary search on a uniformly distributed vector

Question 1

I am trying to write a binary search using recursion in C on a sorted vector that we can assume comes from a uniform distribution between the start and end values. What I am trying to write is an algorithm that performs best in expected time. I am having some issue with picking the correct index to 'split' at and while it passes my tests, it feels a bit hacky.

int weighted_binary_split(int *x, int start_limit, int start_idx, int end_idx,
 int val, int *depth) {
 // For a vector of length `n` with known starting and ending values,
 // x[start_idx:end_idx+1] = [start, ..., end]
 // searches for `val` in `x`, assuming `x` is uniformly distributed and
 // ordered. Note that we always get the first occurence of `val` in `x`.
 // I say 'best' because it is simply intuition.
 // Note that when initially calling this, we will have
 // start_limit = start_idx
 // however when we are within the recursion this is no longer the case.
 int n = end_idx - start_idx + 1;
 int start = x[start_idx];
 int end = x[end_idx];
 int out;
 int i;
 // Keep track of how many iterations/depth we have gone in binary search
 // tree.
 *depth += 1;
 // Catch the case where end == start (causing divide by 0 below).
 if (end - start == 0) {
 return start_idx;
 }
 // Catch the case where val == start (can just return start_idx).
 if (val == start) {
 return start_idx;
 }
 // We want to find
 // (val - start)/((end - start)/n) = n*val/(end - start),
 // rounded to the nearest integer.
 // Note there is no point using
 // 1) idx = start_idx 
 // as if val == x[start_idx] ==> val == start, caught above.
 // 2) idx = start_idx + n = end_idx + 1, too big for vector.
 // Hence we use (n - 1). THIS PART FEELS HACKY.
 int idx = start_idx + ((n-1)*(val - start))/(end - start);
 if (idx == start_idx) {
 idx += 1;
 }
 if (x[idx] > val) {
 // Search again with idx as the new `end_idx`.
 out = weighted_binary_split(x, start_limit, start_idx, idx, val, depth);
 } else if (x[idx] < val) {
 // Search again with idx as the new `start_idx`.
 out = weighted_binary_split(x, start_limit, idx, end_idx, val, depth);
 } else {
 // In this case we have found val in x at x[idx].
 // We finally get the first occurence of `val` in x[start_limit:idx+1].
 for (i=1; i<idx-start_limit+1; i++) {
 if (x[idx - i] != x[idx]) {
 // We have found our first value!
 return idx - i + 1;
 }
 }
 // If we reach this point, then x[start_limit] was the first value.
 return start_limit;
 }
 return out;
}

Below is a test template you can use.

int example_split() {
 // Test that our weighted binary split works correctly.
 int x[10] = {0, 1, 2, 2, 4, 5, 6, 6, 7, 9};
 int start_idx = 0;
 int end_idx = 9;
 int val = 2;
 int idx;
 int i;
 int depth = 0;
 idx = weighted_binary_split(x, start_idx, start_idx, end_idx, val, &depth);
 printf("\nFirst index into:\n\tx = (");
 for (i=0; i<10; i++) {
 printf("%d,", x[i]);
 }
 printf(")");
 printf("\nfor:\n\tval = %d", val);
 printf("\nis\n\tidx = %d", idx);
 printf("\nwith\n\tx[idx]=%d", x[idx]);
 printf("\nat\n\tdepth=%d\n", depth);
 return 0;
}

Question 2

Is the code working as expected? We can only review working code on code review. If there is a bug, then it might be better to ask this question on our sister website stackoverflow.com.

Question 3

Yep this is working as expected for all tests I give it. Maybe you are asking for it in a version that can be compiled i.e. with includes and a main?

Question 4

It is nice to get all of the program, but I was just checking to see if it worked due to the rules of this website.

Question 5

This algorithm is known as interpolation search. It was first described by W. W. Peterson in 1957, but is often rediscovered.

Question 6

Bug
Generally code should prevent stack overflows from occurring. The search implemented here causes a stack overflow when the item searched for is not in the array. I tested the code using 8 as the value to search for which caused a recursive stack overflow.

I searched the internet for Weighted Binary Search and couldn't find it, instead I found Weighted Binary Tree.

Wikipedia defines a binary search as "In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. ..."

The search implemented is not a binary search. A binary search always divides the search area by two.

I'm not really sure why there is a for loop in the function weighted_binary_split(), it doesn't comply with a binary search.

Handling Errors
Since the code is keeping track of the depth of the search a possible solution to a stack overflow problem would be to check the value of depth against Log(N) where N is the number of items in the array. A secondary check might be to check the value of depth against N itself because N would be the limit in a linear search.

The function weighted_binary_split() could return -1 in the case of an error or use setjmp(jmp_buf env) and longjmp(jmp_buf env, int value) if negative values are valid in the search.

Always check for possible errors when developing code.

Array Initializations In the function example_split() the following line:

 int x[10] = { 0, 1, 2, 2, 4, 5, 6, 6, 7, 9 };

would be easier to modify if it was written as:

 int x[] = { 0, 1, 2, 2, 4, 5, 6, 6, 7, 9 };

There is no need to include the size of the array for x because the compiler does it for you.

The size of the array can be calculated:

 int arr_size = (sizeof(x) / sizeof(x[0]));

The variable end_idx can then be calculated as well.

 int end_idx = (sizeof(x) / sizeof(x[0])) - 1;

For Loop Control Variables
In the function example_split() there is no need to create the variable i at the top, it can be created inside the for loop itself:

 for (int i = 0; i < arr_size; i++) {
 printf("%d,", x[i]);
 }

It's generally better to create the variables as they are needed. This is a change from the original version of the C Programming Language that was quite helpful.

Question 7

Thanks for suggestions. The code isn't based on an algorithm that I found but a modification of a binary search where there is an underlying assumption that the vector contains realisations from a uniform distribution. Maybe I could come up with a better name. In particular if x is an arithmetic sequence this always finds the correct index after depth 1.

Question 8

What I am trying to write is an algorithm that performs best in expected time.

Improvement: Do not search over a known non-matching value.

if (x[idx] > val) {
 // out = weighted_binary_split(x, start_limit, start_idx, idx, val, depth);
 out = weighted_binary_split(x, start_limit, start_idx, idx - 1, val, depth);

Similar change for } else if (x[idx] < val) {

Avoid 2 checks per loop

Rather than check i range over and over, check the end once

 // for (i=1; i<idx-start_limit+1; i++) {
 // if (x[idx - i] != x[idx]) {
 // return idx - i + 1;
 // }
 // }
 if (x[start_idx] == val) return start_idx;
 while (x[idx - 1] == val) idx--;
 return idx;

This part of code I would consider a change, such as to a binary search to avoid O(N) possibilities.

having some issue with picking the correct index to 'split'

int idx = start_idx + ((n-1)*(val - start))/(end - start); attempts to do a linear interpolation. This is reasonable if the data "comes from a uniform distribution" yet can progress horribly in sub-ranges that are not so uniformly distributed.

Instead, suggest: alternate between linear interpolation and bisecting the range. This retains much of the benefit of linear interpolation yet not its worst case of O(n) time.

Another concern about linear interpolation: (n-1)*(val - start) may overflow. Consider using wider math to form the product.

Question 9

Very interesting. Is alternating between linear interpolation and bisecting the range a common technique or did you just think of it? Seems like a good idea.

Question 10

@rwolst It is an idea i thought of years ago. A further refinement would use LI as long as each range reduction is less than half and else bisect

Question 11

A note about this always finds the correct index after depth 1: the test case is too trivial. Recall that unless the distribution is perfectly uniformly distributed, a LI recursive approach will eventually encounter sub-regions that are not uniformly distributed.

Question 12

I agree, that's what I meant by x being an arithmetic sequence.

pacmaninbw ♦pacmaninbw 26.2k13 gold badges47 silver badges113 bronze badges · Accepted Answer · 2018-05-31 16:43:22Z

Bug
Generally code should prevent stack overflows from occurring. The search implemented here causes a stack overflow when the item searched for is not in the array. I tested the code using 8 as the value to search for which caused a recursive stack overflow.

I searched the internet for Weighted Binary Search and couldn't find it, instead I found Weighted Binary Tree.

Wikipedia defines a binary search as "In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. ..."

The search implemented is not a binary search. A binary search always divides the search area by two.

I'm not really sure why there is a for loop in the function weighted_binary_split(), it doesn't comply with a binary search.

Handling Errors
Since the code is keeping track of the depth of the search a possible solution to a stack overflow problem would be to check the value of depth against Log(N) where N is the number of items in the array. A secondary check might be to check the value of depth against N itself because N would be the limit in a linear search.

The function weighted_binary_split() could return -1 in the case of an error or use setjmp(jmp_buf env) and longjmp(jmp_buf env, int value) if negative values are valid in the search.

Always check for possible errors when developing code.

Array Initializations In the function example_split() the following line:

 int x[10] = { 0, 1, 2, 2, 4, 5, 6, 6, 7, 9 };

would be easier to modify if it was written as:

 int x[] = { 0, 1, 2, 2, 4, 5, 6, 6, 7, 9 };

There is no need to include the size of the array for x because the compiler does it for you.

The size of the array can be calculated:

 int arr_size = (sizeof(x) / sizeof(x[0]));

The variable end_idx can then be calculated as well.

 int end_idx = (sizeof(x) / sizeof(x[0])) - 1;

For Loop Control Variables
In the function example_split() there is no need to create the variable i at the top, it can be created inside the for loop itself:

 for (int i = 0; i < arr_size; i++) {
 printf("%d,", x[i]);
 }

It's generally better to create the variables as they are needed. This is a change from the original version of the C Programming Language that was quite helpful.

Thanks for suggestions. The code isn't based on an algorithm that I found but a modification of a binary search where there is an underlying assumption that the vector contains realisations from a uniform distribution. Maybe I could come up with a better name. In particular if x is an arithmetic sequence this always finds the correct index after depth 1.

Stack Exchange Network

Binary search on a uniformly distributed vector

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Binary search on a uniformly distributed vector

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions