I am doing a Coursera assignment on Hashtable and my solution is taking too long to run. Here is the problem as described in the assignment:
The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.
The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the ith row of the file specifying the ith entry of the array.
Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers x,y in the input file that satisfy x+y=t
Here is my entire Python code:
from collections import defaultdict
import threading
def hash_fn(n):
M=1999993
return n%M
def two_sum_present(target):
global ans
bool=True
for x in nums:
y = target-x
if y in my_hash_table[hash_fn(y)] and y!=x:
bool=False
ans+=1
return 1
if bool == True:
return 0
f = open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt')
nums=[]
my_hash_table = defaultdict(list)
for line in f.readlines():
num = int(line.split('\n')[0])
nums.append(num)
my_hash_table[hash_fn(num)].append(num)
ans=0
for nr in range(-10000, 10001):
thr = threading.Thread(target = two_sum_present, args=(nr,))
thr.start()
print(ans)
Notes:
I found a page for the same problem here, but this solution uses neither hashing nor binary search, so it should take longer than my solution. This solution is essentially this:
ans = sum(any(n-x in nums and 2*x != n for x in nums) for n in range(-10000, 10001))
The text file can be found here. if someone is interested. But think of it as a big text file with 1M lines having 1 integer(+ve or -ve) on each line.
'distinct' as mentioned here only means that x+x=t is not a valid 2-sum pair; but two (x, y) pairs where x and y are not same constitute a valid pair
2 Answers 2
functions
split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.
hashing
Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict
you use
global variable ans
Instead of passing the global variable ans
, the easiest way would be to use the fact that in a numeric context (like sum
), True
equals 1 and False
equals 0, so you can just return True or False in two_sum_present
my take on this:
read the file
Using the fact that a file is an iterator, I would do something like this
def read_file(filename):
with open(filename, 'r') as file:
return set(map(int, file))
find the target
About the same as your two_sum_present
with my remarks incorporated
def find_target(numbers, target):
for i in numbers:
if target - i in numbers and 2 * i != target:
return True
return False
iterate over the interval
def find_sums(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target(numbers, target)
putting it together
def main(filename, start=-10000, end=10000):
numbers = read_file(filename)
# print(f'numbers found: {len(numbers)}')
return sum(find_sums(numbers, start, end))
if __name__ == '__main__':
filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
print(main(filename, -100, 100))
All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time
This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101))
, but that one took 59s
Multithreading
If you want to, you can divide out all the call to find_target
to different workers/cores/... In this case, a frozenset
instead of a set
might be more appropriate since it's immutable, and will give less problems with concurrency
distinct
As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y
. To cover the second interpretation, You can change the set
to a collections.Counter
, and change the test criterium slightly
from collections import Counter
def read_file_distinct(filename):
with open(filename, 'r') as file:
return Counter(map(int, file))
def find_sums_distinct(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct(numbers, target)
def find_target_distinct(numbers, target):
for i in numbers:
if target - i in numbers and (2 * i != target or numbers[i] > 1):
return True
return False
def main_distinct(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
# print(f'numbers found: {len(numbers)}')
return sum(find_sums_distinct(numbers, start, end))
This has an effect on speed, though
%timeit main(filename, -10, 10)
4.51 s ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit main_distinct(filename, -10, 10)
6.42 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
distinct 2
A slightly different, faster approach can be used to tackle the 2nd interpretation
def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct2(numbers, repetitions, target)
def find_target_distinct2(numbers, repetitions, target):
for i in numbers:
if target - i in numbers and (2 * i != target or i in repetitions):
return True
return False
def main_distinct2(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
repetitions = {k for k, v in numbers.items() if ((v > 1) and (start < k < end))}
# print(repetitions)
numbers = set(numbers)
# print(f'{len(numbers)} numbers found, {len(repetitions)} repetitions')
return sum(find_sums_distinct2(numbers, repetitions, start, end))
%timeit main_distinct2(filename, -10, 10)
4.92 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
-
\$\begingroup\$ This doesn't seem to handle the possibility of repetitions in the input. \$\endgroup\$Gareth Rees– Gareth Rees2018年04月09日 14:08:22 +00:00Commented Apr 9, 2018 at 14:08
-
\$\begingroup\$ does it need to? The question mentions
such that there are distinct numbers x,y in the input file that satisfy x+y=t
. If instead by distinct they also mean 2 repetitions of the same number, then aCounter
might be more appropriate, with a slightly revised test likeif target - i in numbers and (2 * i != target or numbers[i] > 1)
\$\endgroup\$Maarten Fabré– Maarten Fabré2018年04月09日 14:39:32 +00:00Commented Apr 9, 2018 at 14:39 -
\$\begingroup\$ The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests \$x \ne y\$. \$\endgroup\$Gareth Rees– Gareth Rees2018年04月09日 14:44:04 +00:00Commented Apr 9, 2018 at 14:44
-
\$\begingroup\$ the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations \$\endgroup\$Maarten Fabré– Maarten Fabré2018年04月09日 14:45:10 +00:00Commented Apr 9, 2018 at 14:45
-
\$\begingroup\$ @MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC. \$\endgroup\$Piyush Singh– Piyush Singh2018年04月09日 17:19:33 +00:00Commented Apr 9, 2018 at 17:19
import bisect
num_set = set()
with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
for num in file.read().splitlines():
num = int(num)
num_set.add(num)
num_list = sorted(num_set)
targets = set()
for num in num_list:
low = bisect.bisect_left(num_list, -10000-num)
high = bisect.bisect_right(num_list, 10000-num)
for num_2 in num_list[low:high]:
if num_2 != num:
targets.add(num+num_2)
print(len(targets))
Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.
Explore related questions
See similar questions with these tags.
hash_fn
is misaligned for eg. \$\endgroup\$