List comparison algorithm

Question 1

I am attempting to create an efficient algorithm to pull all of the similar elements between two lists. The problem is two fold. First, I can not seem to find any similar algorithms online. Second, there should be a more efficient way.

By 'similar', I mean a predesigned way to find all similar elements between two lists in a timely fashion.

Currently, I am taking a greedy approach by:

Sorting the lists that are being compared,
Comparing each element in the shorter list to each element in the larger list,
Since the largeList and smallList are sorted we can save the last index that was visited,
Continue from the previous index (largeIndex).

Currently, the run-time seems to be average of O(nlog(n)). This can be seen by running the test cases listed after this block of code.

This is being run on Python 3.3.

Right now, my code looks as such:

 def compare(small,large,largeStart,largeEnd):
 for i in range(largeStart, largeEnd):
 if small==large[i]:
 return [1,i]
 if small<large[i]:
 if i!=0:
 return [0,i-1]
 else:
 return [0, i]
 return [0,largeStart]
 def determineLongerList(aList, bList):
 if len(aList)>len(bList):
 return (aList, bList)
 elif len(aList)<len(bList):
 return (bList, aList)
 else:
 return (aList, bList)
 def compareElementsInLists(aList, bList):
 import time
 startTime = time.time()
 holder = determineLongerList(aList, bList)
 sameItems = []
 iterations = 0
 ##########################################
 smallList = sorted(holder[1])
 smallLength = len(smallList)
 smallIndex = 0
 largeList = sorted(holder[0])
 largeLength = len(largeList)
 largeIndex = 0
 while (smallIndex<smallLength):
 boolean = compare(smallList[smallIndex],largeList,largeIndex,largeLength)
 if boolean[0]==1:
 #`compare` returns 1 as True
 sameItems.append(smallList[smallIndex])
 oldIndex = largeIndex
 largeIndex = boolean[1]
 else:
 #else no match and possible new index
 oldIndex = largeIndex
 largeIndex = boolean[1]
 smallIndex+=1
 iterations =largeIndex-oldIndex+iterations+1
 print('RAN {it} OUT OF {mathz} POSSIBLE'.format(it=iterations, mathz=smallLength*largeLength))
 print('RATIO:\t\t'+str(iterations/(smallLength*largeLength))+'\n')
 return sameItems

, and here are some test cases:

 def testLargest():
 import time
 from random import randint
 print('\n\n******************************************\n')
 start_time = time.time()
 lis = []
 for i in range(0,1000000):
 ran = randint(0,1000000)
 lis.append(ran)
 lis2 = []
 for i in range(0,1000000):
 ran = randint(0,1000000)
 lis2.append(ran)
 timeTaken = time.time()-start_time 
 print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
 print('\n******************************************')
 start_time = time.time()
 c = compareElementsInLists(lis, lis2)
 timeTaken = time.time()-start_time 
 print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
 print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
 print('\n******************************************')
 #testLargest()
 '''
 One rendition of testLargest:
 ******************************************
 CREATING LISTS TOOK: 21.009342908859253
 ******************************************
 RAN 999998 OUT OF 1000000000000 POSSIBLE
 RATIO: 9.99998e-07
 COMPARING LISTS TOOK: 13.99990701675415
 NUMBER OF SAME ITEMS: 632328
 ******************************************
 '''
 def testLarge():
 import time
 from random import randint
 print('\n\n******************************************\n')
 start_time = time.time()
 lis = []
 for i in range(0,1000000):
 ran = randint(0,100)
 lis.append(ran)
 lis2 = []
 for i in range(0,1000000):
 ran = randint(0,100)
 lis2.append(ran)
 timeTaken = time.time()-start_time 
 print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
 print('\n******************************************')
 start_time = time.time()
 c = compareElementsInLists(lis, lis2)
 timeTaken = time.time()-start_time 
 print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
 print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
 print('\n******************************************')
 testLarge()

Question 2

Why don't you convert both lists to sets and then form the intersection?

Question 3

@spiderpig I was looking into this, and with the randomly generated lists in my test cases, it takes a significant amount of more time for set once the data becomes large enough. Also, set does not return he entire collection, but just similar elements. I am looking into collections.Counter now...

Question 4

It's not 100% clear what data you want as output here. Do you simply want to return all the elements that are common to both lists?

Question 5

I also don't understand what you are trying to do here. Some examples might help.

Question 6

After running extensive tests with varying list lengths I've found the set.intersection() method to be the fastest by far. I coded it like this:

 start_time = time.time()
 if len(lis) > len(lis2):
 lis, lis2 = lis2, lis
 c = list(set(lis).intersection(lis2))
 c.sort()
 timeTaken = time.time() - start_time
 print('COMPARING LISTS TOOK:\t\t' + str(timeTaken))
 print('NUMBER OF SAME ITEMS:\t\t' + str(len(c)))
 print c[:20]

First, the smaller of both lists will be converted to a set. The conversion does take some time, as noted before, and this will minimize the effort. Second, only one list needs to be cast into a set. The intersection will be done with the second, unaltered list. It would take longer to convert the second list as well which is unnecessary. The final sorting can or cannot be omitted depending on your needs.

With random lists of lengths 60.000 and 40.000 resp. OP's algorithm took 20.6 s and set.intersection() took 0.031 s on my (old) notebook (664:1). Both lists of equal length, 60.000, 35.4 s vs. 0.031 s = 1141:1. For identical lists of length 60.000, 23.8 to 0.016 s = 1487:1. With these ratios I don't see any way to optimize the original algorithms to gain 3 orders of magnitude in speed.

user1016274 user1016274 6445 silver badges9 bronze badges · Accepted Answer · 2015-04-06 21:10:38Z

After running extensive tests with varying list lengths I've found the set.intersection() method to be the fastest by far. I coded it like this:

 start_time = time.time()
 if len(lis) > len(lis2):
 lis, lis2 = lis2, lis
 c = list(set(lis).intersection(lis2))
 c.sort()
 timeTaken = time.time() - start_time
 print('COMPARING LISTS TOOK:\t\t' + str(timeTaken))
 print('NUMBER OF SAME ITEMS:\t\t' + str(len(c)))
 print c[:20]

First, the smaller of both lists will be converted to a set. The conversion does take some time, as noted before, and this will minimize the effort. Second, only one list needs to be cast into a set. The intersection will be done with the second, unaltered list. It would take longer to convert the second list as well which is unnecessary. The final sorting can or cannot be omitted depending on your needs.

With random lists of lengths 60.000 and 40.000 resp. OP's algorithm took 20.6 s and set.intersection() took 0.031 s on my (old) notebook (664:1). Both lists of equal length, 60.000, 35.4 s vs. 0.031 s = 1141:1. For identical lists of length 60.000, 23.8 to 0.016 s = 1487:1. With these ratios I don't see any way to optimize the original algorithms to gain 3 orders of magnitude in speed.

Stack Exchange Network

List comparison algorithm

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

List comparison algorithm

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions