I am working on a machine learning problem with object detection. For now, I am trying to find GPS coordinates that are close to other GPS coordinates. If they are close, I want to make note of it by index. So in my example below, with test data, these two areas are not actually close to one another, so their 'close_points_index' should be just their index. But my actual data set has ~100k observations.
This code is slow with 100k observations. I am looking for some help optimizing this code as I can get correct output but would like it if someone could point out any inefficiencies.
My data looks like:
[{'area_name': 'ElephantRock', 'us_state': 'Colorado', 'url': 'https://www.mountainproject.com/area/105746486/elephant-rock', 'lnglat': [38.88463, -106.15182], 'metadata': {'lnglat_from_parent': False}}, {'area_name': 'RaspberryBoulders', 'us_state': 'Colorado', 'url': 'https://www.mountainproject.com/area/108289128/raspberry-boulders', 'lnglat': [39.491, -106.0501], 'metadata': {'lnglat_from_parent': False}}]
My code solution is below. I avoided using two for loops but realize that I am sure a map() is just syntatical sugar for a for loop. Note that latLongDistance I assume is fairly optimized but if not I don't mind. My focus is on my findClusters() function.
from math import cos, asin, sqrt, pi
from functools import partial
def latLongDistance(coord1, coord2):
lat2 = coord2[0]
lat1 = coord1[0]
lon1 = coord1[1]
lon2 = coord2[1]
p = pi/180
a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p) * cos(lat2*p) * (1-cos((lon2-lon1)*p))/2
kmDistance = 12742 * asin(sqrt(a))
return kmDistance
def findClusters(listOfPoints, thresholdValueM = 800):
coords = [x['lnglat'] for x in listOfPoints]
for index, data in enumerate(listOfPoints):
lngLat = data['lnglat']
modifiedLLDistance = partial(latLongDistance,coord2 = lngLat)
listOfDistances = list(map(modifiedLLDistance,coords))
meterDistance = [x*1000 for x in listOfDistances]
closePoints = [i for i in range(len(meterDistance)) if meterDistance[i] < thresholdValueM]
listOfPoints[index]['close_points_index'] = closePoints
return listOfPoints
After the function is ran, see below. Note that these have multiple indices as I ran this output on the actual data set. If I were to run just these two points their indices should be: [0] and [1] respectively.
[{'area_name': 'ElephantRock', 'us_state': 'Colorado', 'url': 'https://www.mountainproject.com/area/105746486/elephant-rock', 'lnglat': [38.88463, -106.15182], 'metadata': {'lnglat_from_parent': False}, 'close_points_index': [0]}, {'area_name': 'RaspberryBoulders', 'us_state': 'Colorado', 'url': 'https://www.mountainproject.com/area/108289128/raspberry-boulders', 'lnglat': [39.491, -106.0501], 'metadata': {'lnglat_from_parent': False}, 'close_points_index': [1]}]
I've experimented with a few things, but am coming up short. Primarily, I am a bit inexperienced with finding speed increases as I am relatively new to Python. Any critical input would be helpful. I have not posted here so let me know if I need some more information for it to be reproducible.
1 Answer 1
First some general comments
In the dataset, the key lnglat
is confusing, because the data is clearly latitude and then longitude. That is just asking for a bug.
lng
, lon
, and long
are all used as abbreviations for longitude in the code, pick one.
Take a look at PEP8 to see what people expect to see when looking at Python code, e.g., latlongdistance
or lat_long_distance
Use sequence unpacking:
lat1, lon1 = coord1
list(map(....))
is a bit of an anti-pattern. The whole point of using map
is the generate values as needed rather than create a list of all the values. If you want a list, many people find a list comprehension clearer.
enumerate()
works in comprehensions too:
closePoints = [i for i, distance in enumerate(meterDistance)
if distance < thresholdValueM]
Each of listOfDistances
and meterDistance
create a long list of distances (100k of them), only to discard most of the distances when creating closePoints
. Use a generator expression to avoid creating the lists.
Instead of multiplying each distance by 1000, divide thresholdvalue
by 1000 just once outside of the for
-loop. That's 1 division instead of 10G multiplications (100k loops and 100k multiplies in the list comprehension).
The code calculates each distance twice. For example, in the first loop iteration it calculates the distance from the first coord to the second coord, then on the second loop iteration it calculates the distance from the second to the first.
So something like this would be somewhat more efficient (untested code).
def findclusters(points, threshold=800):
coords = [x['lnglat'] for x in points]
# convert to meters
threshold /= 1000
for index, data in enumerate(points):
lngLat = data['lnglat']
# this is a generator expression
distances = (latlongdistance(lnglat, coord) for coord in coords[index:])
for i, d in enumerate(distances, index)):
if d < threshold:
points[index].setdefault('close_points_index', []).append(i)
points[i].setdefault('close_points_index', []).append(index)
return points
But the biggest efficiency issue is that findClusters()
has O(n^2) complexity. The for index, data
loop runs for each point in listOfPoints
. Inside the loop, each of these lines also loops over the entire list.
listOfDistances = list(map(modifiedLLDistance,coords))
meterDistance = [x*1000 for x in listOfDistances]
That's n * n. To get significant speedups, a different approach is needed. There are various data structures that can be built in O(n * log n) time and then queried to find nearby points. I mentioned KDTrees in a comment to the question, but there are others.
-
\$\begingroup\$ Thank you. I am coding Python code primarily by myself so I don't have much of a critical eye on my code. I really appreciate it. \$\endgroup\$jlombard314– jlombard3142021年09月10日 03:05:07 +00:00Commented Sep 10, 2021 at 3:05
scipy.spatial.KDTree
? Convert lat/long to x/y/z and build the KDTree. Use thequery_pairs(d)
method to find points that are <d
distance apart. \$\endgroup\$