Finding similar values within a list of lists in Python

Question 1

I am working on a machine learning problem with object detection. For now, I am trying to find GPS coordinates that are close to other GPS coordinates. If they are close, I want to make note of it by index. So in my example below, with test data, these two areas are not actually close to one another, so their 'close_points_index' should be just their index. But my actual data set has ~100k observations.

This code is slow with 100k observations. I am looking for some help optimizing this code as I can get correct output but would like it if someone could point out any inefficiencies.

My data looks like:

[{'area_name': 'ElephantRock', 'us_state': 'Colorado', 'url': 'https://www.mountainproject.com/area/105746486/elephant-rock', 'lnglat': [38.88463, -106.15182], 'metadata': {'lnglat_from_parent': False}}, {'area_name': 'RaspberryBoulders', 'us_state': 'Colorado', 'url': 'https://www.mountainproject.com/area/108289128/raspberry-boulders', 'lnglat': [39.491, -106.0501], 'metadata': {'lnglat_from_parent': False}}]

My code solution is below. I avoided using two for loops but realize that I am sure a map() is just syntatical sugar for a for loop. Note that latLongDistance I assume is fairly optimized but if not I don't mind. My focus is on my findClusters() function.

from math import cos, asin, sqrt, pi
from functools import partial
def latLongDistance(coord1, coord2):
 lat2 = coord2[0]
 lat1 = coord1[0]
 lon1 = coord1[1]
 lon2 = coord2[1]
 p = pi/180
 a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p) * cos(lat2*p) * (1-cos((lon2-lon1)*p))/2
 kmDistance = 12742 * asin(sqrt(a)) 
 return kmDistance
def findClusters(listOfPoints, thresholdValueM = 800):
 coords = [x['lnglat'] for x in listOfPoints]
 for index, data in enumerate(listOfPoints):
 
 lngLat = data['lnglat']
 modifiedLLDistance = partial(latLongDistance,coord2 = lngLat)
 
 listOfDistances = list(map(modifiedLLDistance,coords))
 meterDistance = [x*1000 for x in listOfDistances]
 closePoints = [i for i in range(len(meterDistance)) if meterDistance[i] < thresholdValueM]
 listOfPoints[index]['close_points_index'] = closePoints
 return listOfPoints

After the function is ran, see below. Note that these have multiple indices as I ran this output on the actual data set. If I were to run just these two points their indices should be: [0] and [1] respectively.


[{'area_name': 'ElephantRock', 'us_state': 'Colorado', 'url': 'https://www.mountainproject.com/area/105746486/elephant-rock', 'lnglat': [38.88463, -106.15182], 'metadata': {'lnglat_from_parent': False}, 'close_points_index': [0]}, {'area_name': 'RaspberryBoulders', 'us_state': 'Colorado', 'url': 'https://www.mountainproject.com/area/108289128/raspberry-boulders', 'lnglat': [39.491, -106.0501], 'metadata': {'lnglat_from_parent': False}, 'close_points_index': [1]}]

I've experimented with a few things, but am coming up short. Primarily, I am a bit inexperienced with finding speed increases as I am relatively new to Python. Any critical input would be helpful. I have not posted here so let me know if I need some more information for it to be reproducible.

Question 2

It sounds like this code does not work as expected. Is it a working solution? If not, this might be a bit better for Stack Overflow rather than Code Review

Question 3

This code does work as expected. Apologies if that is not clear. I will edit my example output and my comment.

Question 4

Have you looked at scipy.spatial.KDTree? Convert lat/long to x/y/z and build the KDTree. Use the query_pairs(d) method to find points that are < d distance apart.

Question 5

@RootTwo That is a solution that is faster. I implemented it although I had to write some code to grab all of the clusters after producing the pairs. Ex: (10, 20), (20, 30), (10, 30). For 10 I need to know that it is close to 20 and 30. But that is pretty trivial. For my education I am still hoping there is some efficiency in my code to be gained outside of using that library/method. Primarily because I am aiming to get over my 'beginner hump' of knowledge with Python. Long comment, but thank you as your solution is great! I will leave open for a day or two then close this question.

Question 6

First some general comments

In the dataset, the key lnglat is confusing, because the data is clearly latitude and then longitude. That is just asking for a bug.

lng, lon, and long are all used as abbreviations for longitude in the code, pick one.

Take a look at PEP8 to see what people expect to see when looking at Python code, e.g., latlongdistance or lat_long_distance

Use sequence unpacking:

lat1, lon1 = coord1

list(map(....)) is a bit of an anti-pattern. The whole point of using map is the generate values as needed rather than create a list of all the values. If you want a list, many people find a list comprehension clearer.

enumerate() works in comprehensions too:

closePoints = [i for i, distance in enumerate(meterDistance)
 if distance < thresholdValueM]

Each of listOfDistances and meterDistance create a long list of distances (100k of them), only to discard most of the distances when creating closePoints. Use a generator expression to avoid creating the lists.

Instead of multiplying each distance by 1000, divide thresholdvalue by 1000 just once outside of the for-loop. That's 1 division instead of 10G multiplications (100k loops and 100k multiplies in the list comprehension).

The code calculates each distance twice. For example, in the first loop iteration it calculates the distance from the first coord to the second coord, then on the second loop iteration it calculates the distance from the second to the first.

So something like this would be somewhat more efficient (untested code).

def findclusters(points, threshold=800):
 coords = [x['lnglat'] for x in points]
 # convert to meters
 threshold /= 1000
 for index, data in enumerate(points):
 
 lngLat = data['lnglat']
 # this is a generator expression
 distances = (latlongdistance(lnglat, coord) for coord in coords[index:])
 for i, d in enumerate(distances, index)):
 if d < threshold:
 points[index].setdefault('close_points_index', []).append(i)
 points[i].setdefault('close_points_index', []).append(index)
 return points

But the biggest efficiency issue is that findClusters() has O(n^2) complexity. The for index, data loop runs for each point in listOfPoints. Inside the loop, each of these lines also loops over the entire list.

listOfDistances = list(map(modifiedLLDistance,coords))
meterDistance = [x*1000 for x in listOfDistances]

That's n * n. To get significant speedups, a different approach is needed. There are various data structures that can be built in O(n * log n) time and then queried to find nearby points. I mentioned KDTrees in a comment to the question, but there are others.

Question 7

Thank you. I am coding Python code primarily by myself so I don't have much of a critical eye on my code. I really appreciate it.

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2021-09-09 23:32:45Z

First some general comments

In the dataset, the key lnglat is confusing, because the data is clearly latitude and then longitude. That is just asking for a bug.

lng, lon, and long are all used as abbreviations for longitude in the code, pick one.

Take a look at PEP8 to see what people expect to see when looking at Python code, e.g., latlongdistance or lat_long_distance

Use sequence unpacking:

lat1, lon1 = coord1

list(map(....)) is a bit of an anti-pattern. The whole point of using map is the generate values as needed rather than create a list of all the values. If you want a list, many people find a list comprehension clearer.

enumerate() works in comprehensions too:

closePoints = [i for i, distance in enumerate(meterDistance)
 if distance < thresholdValueM]

Each of listOfDistances and meterDistance create a long list of distances (100k of them), only to discard most of the distances when creating closePoints. Use a generator expression to avoid creating the lists.

Instead of multiplying each distance by 1000, divide thresholdvalue by 1000 just once outside of the for-loop. That's 1 division instead of 10G multiplications (100k loops and 100k multiplies in the list comprehension).

The code calculates each distance twice. For example, in the first loop iteration it calculates the distance from the first coord to the second coord, then on the second loop iteration it calculates the distance from the second to the first.

So something like this would be somewhat more efficient (untested code).

def findclusters(points, threshold=800):
 coords = [x['lnglat'] for x in points]
 # convert to meters
 threshold /= 1000
 for index, data in enumerate(points):
 
 lngLat = data['lnglat']
 # this is a generator expression
 distances = (latlongdistance(lnglat, coord) for coord in coords[index:])
 for i, d in enumerate(distances, index)):
 if d < threshold:
 points[index].setdefault('close_points_index', []).append(i)
 points[i].setdefault('close_points_index', []).append(index)
 return points

But the biggest efficiency issue is that findClusters() has O(n^2) complexity. The for index, data loop runs for each point in listOfPoints. Inside the loop, each of these lines also loops over the entire list.

listOfDistances = list(map(modifiedLLDistance,coords))
meterDistance = [x*1000 for x in listOfDistances]

That's n * n. To get significant speedups, a different approach is needed. There are various data structures that can be built in O(n * log n) time and then queried to find nearby points. I mentioned KDTrees in a comment to the question, but there are others.

Thank you. I am coding Python code primarily by myself so I don't have much of a critical eye on my code. I really appreciate it.

Stack Exchange Network

Finding similar values within a list of lists in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding similar values within a list of lists in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions