Variant of 2-SUM Problem using Hashtable with multi-threading

Question 1

I am doing a Coursera assignment on Hashtable and my solution is taking too long to run. Here is the problem as described in the assignment:

The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.

The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the ith row of the file specifying the ith entry of the array.

Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers x,y in the input file that satisfy x+y=t

Here is my entire Python code:

from collections import defaultdict
import threading
def hash_fn(n):
 M=1999993
 return n%M
def two_sum_present(target):
 global ans
 bool=True
 for x in nums:
 y = target-x
 if y in my_hash_table[hash_fn(y)] and y!=x:
 bool=False
 ans+=1
 return 1
 if bool == True:
 return 0
f = open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt')
nums=[]
my_hash_table = defaultdict(list) 
for line in f.readlines():
 num = int(line.split('\n')[0])
 nums.append(num)
 my_hash_table[hash_fn(num)].append(num)
ans=0
for nr in range(-10000, 10001):
 thr = threading.Thread(target = two_sum_present, args=(nr,))
 thr.start()
print(ans)

Notes:

I found a page for the same problem here, but this solution uses neither hashing nor binary search, so it should take longer than my solution. This solution is essentially this:
```
 ans = sum(any(n-x in nums and 2*x != n for x in nums) for n in range(-10000, 10001))
```
The text file can be found here. if someone is interested. But think of it as a big text file with 1M lines having 1 integer(+ve or -ve) on each line.
'distinct' as mentioned here only means that x+x=t is not a valid 2-sum pair; but two (x, y) pairs where x and y are not same constitute a valid pair

Question 2

Please fix the indentation in your code. The hash_fn is misaligned for eg.

Question 3

Sorry for bad indentation there. Fixed it.

Question 4

Welcome to Code Review! I hope you get some great answers.

Question 5

Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.

Question 6

@PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.

Question 7

functions

split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.

hashing

Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict you use

global variable ans

Instead of passing the global variable ans, the easiest way would be to use the fact that in a numeric context (like sum), True equals 1 and False equals 0, so you can just return True or False in two_sum_present

my take on this:

read the file

Using the fact that a file is an iterator, I would do something like this

def read_file(filename):
 with open(filename, 'r') as file:
 return set(map(int, file))

find the target

About the same as your two_sum_present with my remarks incorporated

def find_target(numbers, target):
 for i in numbers:
 if target - i in numbers and 2 * i != target:
 return True
 return False

iterate over the interval

def find_sums(numbers, start=-10000, end=10000):
 for target in range(start, end + 1):
# print(target)
 yield find_target(numbers, target)

putting it together

def main(filename, start=-10000, end=10000):
 numbers = read_file(filename)
# print(f'numbers found: {len(numbers)}')
 return sum(find_sums(numbers, start, end))
if __name__ == '__main__':
 filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
 print(main(filename, -100, 100))

All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time

This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101)), but that one took 59s

Multithreading

If you want to, you can divide out all the call to find_target to different workers/cores/... In this case, a frozenset instead of a set might be more appropriate since it's immutable, and will give less problems with concurrency

distinct

As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y. To cover the second interpretation, You can change the set to a collections.Counter, and change the test criterium slightly

from collections import Counter
def read_file_distinct(filename):
 with open(filename, 'r') as file:
 return Counter(map(int, file))
def find_sums_distinct(numbers, start=-10000, end=10000):
 for target in range(start, end + 1):
# print(target)
 yield find_target_distinct(numbers, target)
def find_target_distinct(numbers, target):
 for i in numbers:
 if target - i in numbers and (2 * i != target or numbers[i] > 1):
 return True
 return False
def main_distinct(filename, start=-10000, end=10000):
 numbers = read_file_distinct(filename)
# print(f'numbers found: {len(numbers)}')
 return sum(find_sums_distinct(numbers, start, end))

This has an effect on speed, though

%timeit main(filename, -10, 10)

4.51 s ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit main_distinct(filename, -10, 10)

6.42 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

distinct 2

A slightly different, faster approach can be used to tackle the 2nd interpretation

def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
 for target in range(start, end + 1):
# print(target)
 yield find_target_distinct2(numbers, repetitions, target)
def find_target_distinct2(numbers, repetitions, target):
 for i in numbers:
 if target - i in numbers and (2 * i != target or i in repetitions):
 return True
 return False
def main_distinct2(filename, start=-10000, end=10000):
 numbers = read_file_distinct(filename)
 repetitions = {k for k, v in numbers.items() if ((v > 1) and (start < k < end))}
# print(repetitions)
 numbers = set(numbers)
# print(f'{len(numbers)} numbers found, {len(repetitions)} repetitions')
 return sum(find_sums_distinct2(numbers, repetitions, start, end))

%timeit main_distinct2(filename, -10, 10)

4.92 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Question 8

This doesn't seem to handle the possibility of repetitions in the input.

Question 9

does it need to? The question mentions such that there are distinct numbers x,y in the input file that satisfy x+y=t. If instead by distinct they also mean 2 repetitions of the same number, then a Counter might be more appropriate, with a slightly revised test like if target - i in numbers and (2 * i != target or numbers[i] > 1)

Question 10

The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests \$x \ne y\$.

Question 11

the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations

Question 12

@MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.

Question 13

import bisect
num_set = set()
with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
 for num in file.read().splitlines():
 num = int(num)
 num_set.add(num)
num_list = sorted(num_set)
targets = set()
for num in num_list:
 low = bisect.bisect_left(num_list, -10000-num)
 high = bisect.bisect_right(num_list, 10000-num)
 for num_2 in num_list[low:high]:
 if num_2 != num:
 targets.add(num+num_2)
print(len(targets))

Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Answer 1 · 2018-04-09 12:25:54Z

functions

split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.

hashing

Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict you use

global variable ans

Instead of passing the global variable ans, the easiest way would be to use the fact that in a numeric context (like sum), True equals 1 and False equals 0, so you can just return True or False in two_sum_present

my take on this:

read the file

Using the fact that a file is an iterator, I would do something like this

def read_file(filename):
 with open(filename, 'r') as file:
 return set(map(int, file))

find the target

About the same as your two_sum_present with my remarks incorporated

def find_target(numbers, target):
 for i in numbers:
 if target - i in numbers and 2 * i != target:
 return True
 return False

iterate over the interval

def find_sums(numbers, start=-10000, end=10000):
 for target in range(start, end + 1):
# print(target)
 yield find_target(numbers, target)

putting it together

def main(filename, start=-10000, end=10000):
 numbers = read_file(filename)
# print(f'numbers found: {len(numbers)}')
 return sum(find_sums(numbers, start, end))
if __name__ == '__main__':
 filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
 print(main(filename, -100, 100))

All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time

This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101)), but that one took 59s

Multithreading

If you want to, you can divide out all the call to find_target to different workers/cores/... In this case, a frozenset instead of a set might be more appropriate since it's immutable, and will give less problems with concurrency

distinct

As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y. To cover the second interpretation, You can change the set to a collections.Counter, and change the test criterium slightly

from collections import Counter
def read_file_distinct(filename):
 with open(filename, 'r') as file:
 return Counter(map(int, file))
def find_sums_distinct(numbers, start=-10000, end=10000):
 for target in range(start, end + 1):
# print(target)
 yield find_target_distinct(numbers, target)
def find_target_distinct(numbers, target):
 for i in numbers:
 if target - i in numbers and (2 * i != target or numbers[i] > 1):
 return True
 return False
def main_distinct(filename, start=-10000, end=10000):
 numbers = read_file_distinct(filename)
# print(f'numbers found: {len(numbers)}')
 return sum(find_sums_distinct(numbers, start, end))

This has an effect on speed, though

%timeit main(filename, -10, 10)

4.51 s ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit main_distinct(filename, -10, 10)

6.42 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

distinct 2

A slightly different, faster approach can be used to tackle the 2nd interpretation

def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
 for target in range(start, end + 1):
# print(target)
 yield find_target_distinct2(numbers, repetitions, target)
def find_target_distinct2(numbers, repetitions, target):
 for i in numbers:
 if target - i in numbers and (2 * i != target or i in repetitions):
 return True
 return False
def main_distinct2(filename, start=-10000, end=10000):
 numbers = read_file_distinct(filename)
 repetitions = {k for k, v in numbers.items() if ((v > 1) and (start < k < end))}
# print(repetitions)
 numbers = set(numbers)
# print(f'{len(numbers)} numbers found, {len(repetitions)} repetitions')
 return sum(find_sums_distinct2(numbers, repetitions, start, end))

%timeit main_distinct2(filename, -10, 10)

4.92 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This doesn't seem to handle the possibility of repetitions in the input.
does it need to? The question mentions such that there are distinct numbers x,y in the input file that satisfy x+y=t. If instead by distinct they also mean 2 repetitions of the same number, then a Counter might be more appropriate, with a slightly revised test like if target - i in numbers and (2 * i != target or numbers[i] > 1)
The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests \$x \ne y\$.
the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
@MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.

Piyush Singh Piyush Singh 1518 bronze badges · Answer 2 · 2018-04-09 17:21:40Z

import bisect
num_set = set()
with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
 for num in file.read().splitlines():
 num = int(num)
 num_set.add(num)
num_list = sorted(num_set)
targets = set()
for num in num_list:
 low = bisect.bisect_left(num_list, -10000-num)
 high = bisect.bisect_right(num_list, 10000-num)
 for num_2 in num_list[low:high]:
 if num_2 != num:
 targets.add(num+num_2)
print(len(targets))

Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.

Stack Exchange Network

Variant of 2-SUM Problem using Hashtable with multi-threading

2 Answers 2

functions

hashing

global variable ans

my take on this:

read the file

find the target

iterate over the interval

putting it together

Multithreading

distinct

distinct 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Variant of 2-SUM Problem using Hashtable with multi-threading

2 Answers 2

functions

hashing

global variable ans

my take on this:

read the file

find the target

iterate over the interval

putting it together

Multithreading

distinct

distinct 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions