Creating a search function

Question 1

Write an efficient function in the Python language called SmartSearch. This function will receive a sorted arr array of integers of size \$n\$ containing at the beginning \$m\$ "significant numbers" and the rest of the numbers are "fictitious numbers" whose value is 9999. The function also receives a number \$x\$ to be searched for in the array. If the number is in the array, the function must return its index (position). If the number \$x\$ does not exist, it must return -1. Important note: the size \$m\$ is unknown. Required efficiency: \$O(\log m)\$

I tried that, but I don't think it quite answers the question. I would appreciate corrections please. I realized that I should first find in which index the fictitious numbers start, and then refer only to the list without the fictitious numbers. How do you do that?

def smartSearch(arr, x):
 if x == 9999:
 return -1
 low = 0
 high = len(arr) - 1
 while low <= high:
 mid = (high + low) // 2
 if arr[mid] == x:
 return mid
 elif arr[mid] < x:
 low = mid + 1
 else:
 high = mid - 1
 return -1

Question 2

Yes, a programming challenge my lecturer gave me

Question 3

The program works, just not exactly according to the requirements of the challenge

Question 4

You could have an initial loop performed after you have initialized high as follows: while high >= 0 and arr[high] == 999: high -= 1. But what if you had an array of 200 elements of which the last 100 have the value 9999. You would be looping 100 times to set high from its initial value of 199 to 99. If you did nothing you would compute mid initially to be 99. If arr[99] is less than the x value you seek you will discover on your next iteration that the value does not exist. Otherwise you have eliminated the 100 fictitious numbers with one search. So do nothing.

Question 5

You are not sorting anything so why is this title "Creating a sort function" instead of "Creating a search function".

Question 6

There is an ambiguity here. There is no condition that directly states that "significant numbers" have to be lower than 9999. What if: smartSearch([1,88888,88889,9999,9999], 88888):

Question 7

Let's start with the issues in the problem statement itself:

Don't name a function SmartSearch; use smart_search due to PEP8.
It's not helpful to define \$n\$ but then never meaningfully use it. It's implied that either \$m < n\$ or \$m \le n\$, but which is it?
The word array is almost certainly mis-used. There are arrays in Python but this is almost certainly not what the problem statement actually meant; instead it meant a sequence.
It's a bad idea to return -1 on failure for a function that also returns integers on success. This is a C-ism and not how Python is supposed to work. A well-designed Python API would instead throw an exception, possibly an IndexError or ValueError. Otherwise, a naive caller may try to use -1 as an index: which would index to the last element of the sequence and silently produce nonsensical behaviour!

For your purposes let's change the definition to accommodate all but the last point.

As to your implementation,

You violated the spec by writing smartSearch, which (by the original spec) would be called SmartSearch, but shouldn't be either; it should be smart_search.
Don't re-implement a binary search when bisect is part of the standard library.
Write unit tests.

An implementation can look like

from bisect import bisect_left
from typing import Sequence
FICTIONAL = 9999
NOT_FOUND = -1
def smart_search(arr: Sequence[int], x: int) -> int:
 if x == FICTIONAL:
 return NOT_FOUND
 idx = bisect_left(a=arr, x=x)
 if idx >= len(arr) or arr[idx] != x:
 return NOT_FOUND
 return idx
def test() -> None:
 assert smart_search((1, 2, 3), 0) == NOT_FOUND
 assert smart_search((1, 2, 3), 1) == 0
 assert smart_search((1, 2, 3), 2) == 1
 assert smart_search((1, 2, 3), 3) == 2
 assert smart_search((1, 2, 3), 4) == NOT_FOUND
 assert smart_search((1, 2, 3, FICTIONAL), 0) == NOT_FOUND
 assert smart_search((1, 2, 3, FICTIONAL), 1) == 0
 assert smart_search((1, 2, 3, FICTIONAL), 2) == 1
 assert smart_search((1, 2, 3, FICTIONAL), 3) == 2
 assert smart_search((1, 2, 3, FICTIONAL), 4) == NOT_FOUND
 assert smart_search((1, 2, 4), 3) == -1
if __name__ == '__main__':
 test()

Question 8

lint

def smartSearch(arr, x):

Pylint reports:

C0103: Function name "smartSearch" doesn't conform to snake_case naming style (invalid-name)

Please have your lecturer recite these words out loud before the class:

Pep-8 asks that you give functions snake_case names.

magic number

Let us define

SENTINEL = 9999

algorithm

 high = len(arr) - 1

This is just Wrong. It clearly leads to \$O(\log n)\$ cost, given that you assigned \$n\$ to it.

determine m

Write a find_m(arr) helper. It must complete within \$O(\log m)\$ time, but that's easily accomplished.

To return a useful estimate of \$m\$, start with estimate = 1. Keep doubling the estimate until you see arr[estimate] == SENTINEL or it runs off the end of the array, whichever happens first. Return the estimate; it may be an over-estimate, which is fine. It will be within an octave of the true value.

If for some reason you really want to be picky and find the exact value of \$m\$, that is still feasible within the time limit, given integer inputs. Just do binary search, between 0 and estimate, for the value SENTINEL - 1. Do not try searching for that value between estimate and len(arr) - 1, as that would be prohibitively expensive, with \$O(\log n)\$ cost.

determine the index

 high = find_m(arr)

Now you're ready to search in the usual way.

large target value

It appears that x > SENTINEL corresponds to invalid input, for which we could immediately return -1.

That is, the problem speaks of

a sorted array of integers of size n containing at the beginning m "significant numbers"

The combination of "sorted" and "beginning" implies that significant numbers shall always be less than the sentinel value.

test suite

When you write unit tests, be sure to include cases corresponding to

m == 0
m == n

Question 9

Since \$m\$ is unknown, but \$m \le \$n, then determining (or even estimating) \$m\$ is an \$O(\log n)\$ operation, so \$O(\log n + \log m) = O(2 * \log n) = O(\log n)\$. Your "prohibitively expensive, with \$O(\log n)\$ cost" is wrong and misleading; perhaps you meant "do not try a linear search ...".

Question 10

Another variation: while doubling the estimation of \$m\,ドル test for the overrun and for the sentinel value, and also for a 'significant value' greater than the sought one. In the third case a binary search may use a comparator without a check for sentinel values.

Question 11

@AJNeufeld, no, that does not follow. There are two cases, the first being \$m \ll n\,ドル in which the proposed algorithm definitely finds an estimate with \$O(\log m)\$ cost, independent of \$n\,ドル thanks to the great many sentinels. In the other case we have a much bigger budget, so the fact that we examined much of arr is irrelevant, given that the large \$m\$ value grants us permission to do that. // No, I didn't mean a linear search. I meant that we should not try binary search on the wrong half, on the high half, since that is bounded by \$n\$ rather than by \$m\$.

Question 12

@J_H Yes, you are correct. I failed to notice your telescoping estimate is actually \$O(\log m)\,ドル and thus much more efficient when \$m \ll n\$.

Question 13

@CiaPan, when teaching a student, I prefer to keep things as simple as feasible. Often that corresponds to relying on composition of functions, so each one can be tested and reasoned about in isolation. There are certainly ways to implement a function that has several responsibilities and have the Right Thing happen in the end, but sticking to Single Responsibility is easier for fallible humans to get right.

Question 14

0-based

Python is a 0-based language. A sequence of length n has elements running from index 0 to n-1. There is no need to write high = len(arr) - 1 and then test using <= to include the end of the range. Simply drop the - 1 (and the cost of the unnecessary subtraction calculation) and test with <.

 low = 0
 high = len(arr) # no "- 1" here
 while low < high: # use "<" here
 mid = (high + low) // 2
 if arr[mid] == x:
 return mid
 elif arr[mid] < x:
 low = mid + 1
 else:
 high = mid # no "- 1" here

Repeated indexing

You fetch arr[mid] twice in your search loop. Indexing into a random-access array shouldn't be expensive, but why pay any extra cost? Cache the looked up value.

 val = arr[mid]
 if val == x:
 return mid
 elif val < x:
 ...

Reworked code

Using some of the suggestions by J_H and Reinderien, (although we still reinvent-the-wheel, since the requirement is to write the search function) we can rework the code as follows:

FICTIONAL = 9_999
NOT_FOUND = -1
def smart_search(arr: list[int], x: int) -> int:
 """
 Find 'x' in sorted list of integers, in O(log m) time.
 'arr' contains a 'm' actual values, and 'n-m' padding values.
 Returns:
 if present: index of 'x'
 otherwise: NOT_FOUND (-1)
 """
 # Fast fail
 if x >= FICTIONAL:
 return NOT_FOUND
 n = len(arr)
 if n == 0:
 return NOT_FOUND
 # Telescoping estimate of 'm'
 m_est = 1
 while m_est < n and arr[m_est] != FICTIONAL:
 m_est *= 2
 m_est = min(m_est, n)
 # Standard binary search
 low = 0
 high = m_est
 while low < high:
 mid = (low + high) // 2
 val = arr[mid] 
 if val == x:
 return mid
 elif val < x:
 low = mid + 1
 else:
 high = mid
 
 return NOT_FOUND

Testing

Let's mock an array with the sequence 0, 10, 20, ... and count the number of times we index into the array.

class MockArray:
 """
 A mock array of length 'n'.
 The 'array' contains 'm' ascending values, followed by 'n-m'
 fictitious values representing unused entries.
 Accesses to the array are counted, for instrumentation.
 'm' is not returned by any public method.
 """
 def __init__(self, n: int, m: int):
 self._n = n
 self._m = m
 self._gets = 0
 if not (0 <= m <= n):
 raise ValueError("Invalid n and/or m")
 if m >= 1000:
 raise ValueError("m is too big")
 def __len__(self) -> int:
 return self._n
 def __getitem__(self, index: int) -> int:
 if index < 0:
 index += self._n
 if index < 0 or index >= self._n:
 raise IndexError()
 self._gets += 1
 
 if index < self._m:
 return index * 10 # Fake storage of ascending data
 else:
 return FICTIONAL # Fake storage of fictional data
 def num_gets(self):
 return self._gets

Then we can add a small test scaffold ...

def _test_smart_search(n: int, m: int, x: int, expected: int, limit: int):
 
 arr = MockArray(n, m)
 index = smart_search(arr, x)
 assert index == expected, f"{m=}: expected {expected}, got {index}"
 assert arr.num_gets() <= limit, f"{arr.num_gets()} > {limit}"

... which creates the mock array, searches for the given element, and validates if the correct result was returned and no more than the given number of array lookups were made, to validate the time O-requirement.

Finally, we can use this to run a number of tests on a billion element array.

def test_smart_search():
 
 for m in range(1000):
 limit = m.bit_length() * 2 + 2
 target = 50
 expected = target // 10 if m*10 > target else NOT_FOUND
 
 _test_smart_search(1_000_000_000, m, target, expected, limit)
if __name__ == '__main__':
 test_smart_search()

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2024-03-28 12:56:55Z

Let's start with the issues in the problem statement itself:

Don't name a function SmartSearch; use smart_search due to PEP8.
It's not helpful to define \$n\$ but then never meaningfully use it. It's implied that either \$m < n\$ or \$m \le n\$, but which is it?
The word array is almost certainly mis-used. There are arrays in Python but this is almost certainly not what the problem statement actually meant; instead it meant a sequence.
It's a bad idea to return -1 on failure for a function that also returns integers on success. This is a C-ism and not how Python is supposed to work. A well-designed Python API would instead throw an exception, possibly an IndexError or ValueError. Otherwise, a naive caller may try to use -1 as an index: which would index to the last element of the sequence and silently produce nonsensical behaviour!

For your purposes let's change the definition to accommodate all but the last point.

As to your implementation,

You violated the spec by writing smartSearch, which (by the original spec) would be called SmartSearch, but shouldn't be either; it should be smart_search.
Don't re-implement a binary search when bisect is part of the standard library.
Write unit tests.

An implementation can look like

from bisect import bisect_left
from typing import Sequence
FICTIONAL = 9999
NOT_FOUND = -1
def smart_search(arr: Sequence[int], x: int) -> int:
 if x == FICTIONAL:
 return NOT_FOUND
 idx = bisect_left(a=arr, x=x)
 if idx >= len(arr) or arr[idx] != x:
 return NOT_FOUND
 return idx
def test() -> None:
 assert smart_search((1, 2, 3), 0) == NOT_FOUND
 assert smart_search((1, 2, 3), 1) == 0
 assert smart_search((1, 2, 3), 2) == 1
 assert smart_search((1, 2, 3), 3) == 2
 assert smart_search((1, 2, 3), 4) == NOT_FOUND
 assert smart_search((1, 2, 3, FICTIONAL), 0) == NOT_FOUND
 assert smart_search((1, 2, 3, FICTIONAL), 1) == 0
 assert smart_search((1, 2, 3, FICTIONAL), 2) == 1
 assert smart_search((1, 2, 3, FICTIONAL), 3) == 2
 assert smart_search((1, 2, 3, FICTIONAL), 4) == NOT_FOUND
 assert smart_search((1, 2, 4), 3) == -1
if __name__ == '__main__':
 test()

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 2 · 2024-03-27 15:23:45Z

lint

def smartSearch(arr, x):

Pylint reports:

C0103: Function name "smartSearch" doesn't conform to snake_case naming style (invalid-name)

Please have your lecturer recite these words out loud before the class:

Pep-8 asks that you give functions snake_case names.

magic number

Let us define

SENTINEL = 9999

algorithm

 high = len(arr) - 1

This is just Wrong. It clearly leads to \$O(\log n)\$ cost, given that you assigned \$n\$ to it.

determine m

Write a find_m(arr) helper. It must complete within \$O(\log m)\$ time, but that's easily accomplished.

To return a useful estimate of \$m\$, start with estimate = 1. Keep doubling the estimate until you see arr[estimate] == SENTINEL or it runs off the end of the array, whichever happens first. Return the estimate; it may be an over-estimate, which is fine. It will be within an octave of the true value.

If for some reason you really want to be picky and find the exact value of \$m\$, that is still feasible within the time limit, given integer inputs. Just do binary search, between 0 and estimate, for the value SENTINEL - 1. Do not try searching for that value between estimate and len(arr) - 1, as that would be prohibitively expensive, with \$O(\log n)\$ cost.

determine the index

 high = find_m(arr)

Now you're ready to search in the usual way.

large target value

It appears that x > SENTINEL corresponds to invalid input, for which we could immediately return -1.

That is, the problem speaks of

a sorted array of integers of size n containing at the beginning m "significant numbers"

The combination of "sorted" and "beginning" implies that significant numbers shall always be less than the sentinel value.

test suite

When you write unit tests, be sure to include cases corresponding to

m == 0
m == n

Since \$m\$ is unknown, but \$m \le \$n, then determining (or even estimating) \$m\$ is an \$O(\log n)\$ operation, so \$O(\log n + \log m) = O(2 * \log n) = O(\log n)\$. Your "prohibitively expensive, with \$O(\log n)\$ cost" is wrong and misleading; perhaps you meant "do not try a linear search ...".
Another variation: while doubling the estimation of \$m\,ドル test for the overrun and for the sentinel value, and also for a 'significant value' greater than the sought one. In the third case a binary search may use a comparator without a check for sentinel values.
@AJNeufeld, no, that does not follow. There are two cases, the first being \$m \ll n\,ドル in which the proposed algorithm definitely finds an estimate with \$O(\log m)\$ cost, independent of \$n\,ドル thanks to the great many sentinels. In the other case we have a much bigger budget, so the fact that we examined much of arr is irrelevant, given that the large \$m\$ value grants us permission to do that. // No, I didn't mean a linear search. I meant that we should not try binary search on the wrong half, on the high half, since that is bounded by \$n\$ rather than by \$m\$.
@J_H Yes, you are correct. I failed to notice your telescoping estimate is actually \$O(\log m)\,ドル and thus much more efficient when \$m \ll n\$.
@CiaPan, when teaching a student, I prefer to keep things as simple as feasible. Often that corresponds to relying on composition of functions, so each one can be tested and reasoned about in isolation. There are certainly ways to implement a function that has several responsibilities and have the Right Thing happen in the end, but sticking to Single Responsibility is easier for fallible humans to get right.

AJNeufeld AJNeufeld 35.2k5 gold badges41 silver badges103 bronze badges · Answer 3 · 2024-03-29 03:13:50Z

0-based

Python is a 0-based language. A sequence of length n has elements running from index 0 to n-1. There is no need to write high = len(arr) - 1 and then test using <= to include the end of the range. Simply drop the - 1 (and the cost of the unnecessary subtraction calculation) and test with <.

 low = 0
 high = len(arr) # no "- 1" here
 while low < high: # use "<" here
 mid = (high + low) // 2
 if arr[mid] == x:
 return mid
 elif arr[mid] < x:
 low = mid + 1
 else:
 high = mid # no "- 1" here

Repeated indexing

You fetch arr[mid] twice in your search loop. Indexing into a random-access array shouldn't be expensive, but why pay any extra cost? Cache the looked up value.

 val = arr[mid]
 if val == x:
 return mid
 elif val < x:
 ...

Reworked code

Using some of the suggestions by J_H and Reinderien, (although we still reinvent-the-wheel, since the requirement is to write the search function) we can rework the code as follows:

FICTIONAL = 9_999
NOT_FOUND = -1
def smart_search(arr: list[int], x: int) -> int:
 """
 Find 'x' in sorted list of integers, in O(log m) time.
 'arr' contains a 'm' actual values, and 'n-m' padding values.
 Returns:
 if present: index of 'x'
 otherwise: NOT_FOUND (-1)
 """
 # Fast fail
 if x >= FICTIONAL:
 return NOT_FOUND
 n = len(arr)
 if n == 0:
 return NOT_FOUND
 # Telescoping estimate of 'm'
 m_est = 1
 while m_est < n and arr[m_est] != FICTIONAL:
 m_est *= 2
 m_est = min(m_est, n)
 # Standard binary search
 low = 0
 high = m_est
 while low < high:
 mid = (low + high) // 2
 val = arr[mid] 
 if val == x:
 return mid
 elif val < x:
 low = mid + 1
 else:
 high = mid
 
 return NOT_FOUND

Testing

Let's mock an array with the sequence 0, 10, 20, ... and count the number of times we index into the array.

class MockArray:
 """
 A mock array of length 'n'.
 The 'array' contains 'm' ascending values, followed by 'n-m'
 fictitious values representing unused entries.
 Accesses to the array are counted, for instrumentation.
 'm' is not returned by any public method.
 """
 def __init__(self, n: int, m: int):
 self._n = n
 self._m = m
 self._gets = 0
 if not (0 <= m <= n):
 raise ValueError("Invalid n and/or m")
 if m >= 1000:
 raise ValueError("m is too big")
 def __len__(self) -> int:
 return self._n
 def __getitem__(self, index: int) -> int:
 if index < 0:
 index += self._n
 if index < 0 or index >= self._n:
 raise IndexError()
 self._gets += 1
 
 if index < self._m:
 return index * 10 # Fake storage of ascending data
 else:
 return FICTIONAL # Fake storage of fictional data
 def num_gets(self):
 return self._gets

Then we can add a small test scaffold ...

def _test_smart_search(n: int, m: int, x: int, expected: int, limit: int):
 
 arr = MockArray(n, m)
 index = smart_search(arr, x)
 assert index == expected, f"{m=}: expected {expected}, got {index}"
 assert arr.num_gets() <= limit, f"{arr.num_gets()} > {limit}"

... which creates the mock array, searches for the given element, and validates if the correct result was returned and no more than the given number of array lookups were made, to validate the time O-requirement.

Finally, we can use this to run a number of tests on a billion element array.

def test_smart_search():
 
 for m in range(1000):
 limit = m.bit_length() * 2 + 2
 target = 50
 expected = target // 10 if m*10 > target else NOT_FOUND
 
 _test_smart_search(1_000_000_000, m, target, expected, limit)
if __name__ == '__main__':
 test_smart_search()

Stack Exchange Network

Creating a search function

3 Answers 3

lint

magic number

algorithm

determine m

determine the index

large target value

test suite

0-based

Repeated indexing

Reworked code

Testing

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Creating a search function

3 Answers 3

lint

magic number

algorithm

determine m

determine the index

large target value

test suite

0-based

Repeated indexing

Reworked code

Testing

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions