Write an efficient function in the Python language called SmartSearch. This function will receive a sorted
arr
array of integers of size \$n\$ containing at the beginning \$m\$ "significant numbers" and the rest of the numbers are "fictitious numbers" whose value is 9999. The function also receives a number \$x\$ to be searched for in the array. If the number is in the array, the function must return its index (position). If the number \$x\$ does not exist, it must return -1. Important note: the size \$m\$ is unknown. Required efficiency: \$O(\log m)\$
I tried that, but I don't think it quite answers the question. I would appreciate corrections please. I realized that I should first find in which index the fictitious numbers start, and then refer only to the list without the fictitious numbers. How do you do that?
def smartSearch(arr, x):
if x == 9999:
return -1
low = 0
high = len(arr) - 1
while low <= high:
mid = (high + low) // 2
if arr[mid] == x:
return mid
elif arr[mid] < x:
low = mid + 1
else:
high = mid - 1
return -1
3 Answers 3
Let's start with the issues in the problem statement itself:
- Don't name a function
SmartSearch
; usesmart_search
due to PEP8. - It's not helpful to define \$n\$ but then never meaningfully use it. It's implied that either \$m < n\$ or \$m \le n\$, but which is it?
- The word array is almost certainly mis-used. There are arrays in Python but this is almost certainly not what the problem statement actually meant; instead it meant a sequence.
- It's a bad idea to return -1 on failure for a function that also returns integers on success. This is a C-ism and not how Python is supposed to work. A well-designed Python API would instead throw an exception, possibly an
IndexError
orValueError
. Otherwise, a naive caller may try to use -1 as an index: which would index to the last element of the sequence and silently produce nonsensical behaviour!
For your purposes let's change the definition to accommodate all but the last point.
As to your implementation,
- You violated the spec by writing
smartSearch
, which (by the original spec) would be calledSmartSearch
, but shouldn't be either; it should besmart_search
. - Don't re-implement a binary search when bisect is part of the standard library.
- Write unit tests.
An implementation can look like
from bisect import bisect_left
from typing import Sequence
FICTIONAL = 9999
NOT_FOUND = -1
def smart_search(arr: Sequence[int], x: int) -> int:
if x == FICTIONAL:
return NOT_FOUND
idx = bisect_left(a=arr, x=x)
if idx >= len(arr) or arr[idx] != x:
return NOT_FOUND
return idx
def test() -> None:
assert smart_search((1, 2, 3), 0) == NOT_FOUND
assert smart_search((1, 2, 3), 1) == 0
assert smart_search((1, 2, 3), 2) == 1
assert smart_search((1, 2, 3), 3) == 2
assert smart_search((1, 2, 3), 4) == NOT_FOUND
assert smart_search((1, 2, 3, FICTIONAL), 0) == NOT_FOUND
assert smart_search((1, 2, 3, FICTIONAL), 1) == 0
assert smart_search((1, 2, 3, FICTIONAL), 2) == 1
assert smart_search((1, 2, 3, FICTIONAL), 3) == 2
assert smart_search((1, 2, 3, FICTIONAL), 4) == NOT_FOUND
assert smart_search((1, 2, 4), 3) == -1
if __name__ == '__main__':
test()
lint
def smartSearch(arr, x):
Pylint reports:
C0103: Function name "smartSearch" doesn't conform to snake_case naming style (invalid-name)
Please have your lecturer recite these words out loud before the class:
Pep-8 asks that you give functions
snake_case
names.
magic number
Let us define
SENTINEL = 9999
algorithm
high = len(arr) - 1
This is just Wrong. It clearly leads to \$O(\log n)\$ cost, given that you assigned \$n\$ to it.
determine m
Write a find_m(arr)
helper.
It must complete within \$O(\log m)\$ time,
but that's easily accomplished.
To return a useful estimate of \$m\$,
start with estimate = 1
.
Keep doubling the estimate until
you see arr[estimate] == SENTINEL
or it runs off the end of the array,
whichever happens first.
Return
the estimate; it may be an over-estimate, which is fine.
It will be within an octave of the true value.
If for some reason you really want to be picky and
find the exact value of \$m\$,
that is still feasible within the time limit,
given integer inputs.
Just do binary search, between 0
and estimate
,
for the value SENTINEL - 1
.
Do not try searching for that value
between estimate
and len(arr) - 1
,
as that would be prohibitively expensive,
with \$O(\log n)\$ cost.
determine the index
high = find_m(arr)
Now you're ready to search in the usual way.
large target value
It appears that x > SENTINEL
corresponds to invalid input,
for which we could immediately return -1
.
That is, the problem speaks of
a sorted array of integers of size n containing at the beginning m "significant numbers"
The combination of "sorted" and "beginning" implies that significant numbers shall always be less than the sentinel value.
test suite
When you write unit tests, be sure to include cases corresponding to
m == 0
m == n
-
\$\begingroup\$ Since \$m\$ is unknown, but \$m \le \$n, then determining (or even estimating) \$m\$ is an \$O(\log n)\$ operation, so \$O(\log n + \log m) = O(2 * \log n) = O(\log n)\$. Your "prohibitively expensive, with \$O(\log n)\$ cost" is wrong and misleading; perhaps you meant "do not try a linear search ...". \$\endgroup\$AJNeufeld– AJNeufeld2024年03月27日 15:35:20 +00:00Commented Mar 27, 2024 at 15:35
-
\$\begingroup\$ Another variation: while doubling the estimation of \$m\,ドル test for the overrun and for the sentinel value, and also for a 'significant value' greater than the sought one. In the third case a binary search may use a comparator without a check for sentinel values. \$\endgroup\$CiaPan– CiaPan2024年03月27日 15:35:25 +00:00Commented Mar 27, 2024 at 15:35
-
1\$\begingroup\$ @AJNeufeld, no, that does not follow. There are two cases, the first being \$m \ll n\,ドル in which the proposed algorithm definitely finds an estimate with \$O(\log m)\$ cost, independent of \$n\,ドル thanks to the great many sentinels. In the other case we have a much bigger budget, so the fact that we examined much of
arr
is irrelevant, given that the large \$m\$ value grants us permission to do that. // No, I didn't mean a linear search. I meant that we should not try binary search on the wrong half, on the high half, since that is bounded by \$n\$ rather than by \$m\$. \$\endgroup\$J_H– J_H2024年03月27日 15:40:31 +00:00Commented Mar 27, 2024 at 15:40 -
1\$\begingroup\$ @J_H Yes, you are correct. I failed to notice your telescoping
estimate
is actually \$O(\log m)\,ドル and thus much more efficient when \$m \ll n\$. \$\endgroup\$AJNeufeld– AJNeufeld2024年03月27日 15:45:30 +00:00Commented Mar 27, 2024 at 15:45 -
\$\begingroup\$ @CiaPan, when teaching a student, I prefer to keep things as simple as feasible. Often that corresponds to relying on composition of functions, so each one can be tested and reasoned about in isolation. There are certainly ways to implement a function that has several responsibilities and have the Right Thing happen in the end, but sticking to Single Responsibility is easier for fallible humans to get right. \$\endgroup\$J_H– J_H2024年03月27日 15:47:10 +00:00Commented Mar 27, 2024 at 15:47
0-based
Python is a 0-based language. A sequence of length n has elements running from index 0 to n-1. There is no need to write high = len(arr) - 1
and then test using <=
to include the end of the range. Simply drop the - 1
(and the cost of the unnecessary subtraction calculation) and test with <
.
low = 0
high = len(arr) # no "- 1" here
while low < high: # use "<" here
mid = (high + low) // 2
if arr[mid] == x:
return mid
elif arr[mid] < x:
low = mid + 1
else:
high = mid # no "- 1" here
Repeated indexing
You fetch arr[mid]
twice in your search loop. Indexing into a random-access array shouldn't be expensive, but why pay any extra cost? Cache the looked up value.
val = arr[mid]
if val == x:
return mid
elif val < x:
...
Reworked code
Using some of the suggestions by J_H and Reinderien, (although we still reinvent-the-wheel, since the requirement is to write the search function) we can rework the code as follows:
FICTIONAL = 9_999
NOT_FOUND = -1
def smart_search(arr: list[int], x: int) -> int:
"""
Find 'x' in sorted list of integers, in O(log m) time.
'arr' contains a 'm' actual values, and 'n-m' padding values.
Returns:
if present: index of 'x'
otherwise: NOT_FOUND (-1)
"""
# Fast fail
if x >= FICTIONAL:
return NOT_FOUND
n = len(arr)
if n == 0:
return NOT_FOUND
# Telescoping estimate of 'm'
m_est = 1
while m_est < n and arr[m_est] != FICTIONAL:
m_est *= 2
m_est = min(m_est, n)
# Standard binary search
low = 0
high = m_est
while low < high:
mid = (low + high) // 2
val = arr[mid]
if val == x:
return mid
elif val < x:
low = mid + 1
else:
high = mid
return NOT_FOUND
Testing
Let's mock an array with the sequence 0, 10, 20, ... and count the number of times we index into the array.
class MockArray:
"""
A mock array of length 'n'.
The 'array' contains 'm' ascending values, followed by 'n-m'
fictitious values representing unused entries.
Accesses to the array are counted, for instrumentation.
'm' is not returned by any public method.
"""
def __init__(self, n: int, m: int):
self._n = n
self._m = m
self._gets = 0
if not (0 <= m <= n):
raise ValueError("Invalid n and/or m")
if m >= 1000:
raise ValueError("m is too big")
def __len__(self) -> int:
return self._n
def __getitem__(self, index: int) -> int:
if index < 0:
index += self._n
if index < 0 or index >= self._n:
raise IndexError()
self._gets += 1
if index < self._m:
return index * 10 # Fake storage of ascending data
else:
return FICTIONAL # Fake storage of fictional data
def num_gets(self):
return self._gets
Then we can add a small test scaffold ...
def _test_smart_search(n: int, m: int, x: int, expected: int, limit: int):
arr = MockArray(n, m)
index = smart_search(arr, x)
assert index == expected, f"{m=}: expected {expected}, got {index}"
assert arr.num_gets() <= limit, f"{arr.num_gets()} > {limit}"
... which creates the mock array, searches for the given element, and validates if the correct result was returned and no more than the given number of array lookups were made, to validate the time O-requirement.
Finally, we can use this to run a number of tests on a billion element array.
def test_smart_search():
for m in range(1000):
limit = m.bit_length() * 2 + 2
target = 50
expected = target // 10 if m*10 > target else NOT_FOUND
_test_smart_search(1_000_000_000, m, target, expected, limit)
if __name__ == '__main__':
test_smart_search()
high
as follows:while high >= 0 and arr[high] == 999: high -= 1
. But what if you had an array of 200 elements of which the last 100 have the value9999
. You would be looping 100 times to sethigh
from its initial value of 199 to 99. If you did nothing you would computemid
initially to be 99. Ifarr[99]
is less than thex
value you seek you will discover on your next iteration that the value does not exist. Otherwise you have eliminated the 100 fictitious numbers with one search. So do nothing. \$\endgroup\$smartSearch([1,88888,88889,9999,9999], 88888):
\$\endgroup\$