Grouping sorted coordinates based on proximity to each other

Question 1

I created an algotrithm that groups a sorted list of coordinates into buckets based on their proximity (30) to one another.

Steps:

Create a new key with a list value and pop the first point in the list into it
Scan the list of points for points that are close to it. Push matches to the new list and replace values with None
After scan, filter the list, removing None values
Back to one, until points is empty.

I can't modify the list by deleting elements because I use indexes.

After I'm done grouping, I take the average of each group of points.

def group_points(points):
 groups = {}
 groupnum = 0
 while len(points) > 1:
 groupnum += 1
 key = str(groupnum)
 groups[key] = []
 ref = points.pop(0)
 for i, point in enumerate(points):
 d = get_distance(ref, point)
 if d < 30:
 groups[key].append(points[i])
 points[i] = None
 points = list(filter(lambda x: x is not None, points))
 # perform average operation on each group
 return list([[int(np.mean(list([x[0] for x in groups[arr]]))), int(np.mean(list([x[1] for x in groups[arr]])))] for arr in groups])
def get_distance(ref, point):
 # print('ref: {} , point: {}'.format(ref, point))
 x1, y1 = ref[0], ref[1]
 x2, y2 = point[0], point[1]
 return math.hypot(x2 - x1, y2 - y1)

If possible, I would like to reduce the amount of variables and total loops over the points array. Do I need to use indexes? Is it possible to achieve this in one pass over points?

Question 2

How many points do you expect to process? Do the points have any special properties or distribution?

Question 3

"a sorted list" - sorted according to what criterium?

Question 4

They are sorted by the x value then the y value. The number of points varies, but its almost guaranteed they will all fit in some sort of group with at least one other point.

Question 5

General comments:

You are basically using the groups dict like a list. Might as well just use a list.

An empty data structure (list, dict, set, tuple) is False in a boolean context, so while len(points) > 1: can be simplified to while points:

It is generally slower to pop from the front of a list than the back of a list, because after removing the first item all the rest of the items get moved up one spot.

points.pop() actually changes the list passed in. Make sure that's what you want.

filter(None, points) filters out all "False" items.

[ ... ] creates a list. So, list( [ ... ] ) is redundant.

You can just use x1, y1 = ref.

Put that all together and you get something like:

def group_points(points):
 groups = []
 while points:
 far_points = []
 ref = points.pop()
 groups.append([ref])
 for point in points:
 d = get_distance(ref, point)
 if d < 30:
 groups[-1].append(point)
 else:
 far_points.append(point)
 points = far_points
 # perform average operation on each group
 return [list(np.mean(x, axis=1).astype(int)) for x in groups]
def get_distance(ref, point):
 # print('ref: {} , point: {}'.format(ref, point))
 x1, y1 = ref
 x2, y2 = point
 return math.hypot(x2 - x1, y2 - y1)

You also might want to look at functions in scipy.cluster.

Question 6

return [list(np.mean(x, axis=1).astype(int)) for x in groups] Will this work if each element is an array? [x,y]

Question 7

Also, are lists faster than dicts?

Question 8

@Josh Sharkey, yes np.mean() works on arrays. The axis parameter tells it how to do the calculation. Without axis is takes the mean of the whole array. And yes, lists can be faster than dicts and likely use less memory. Like many things in programming there is a tradeoff. In most cases code that is easier to understand is worth the slight performance tradeoff.

Question 9

so np.mean(axis=1) will average the x values and the y values separately and return a single list [x_avg, y_avg]?

Question 10

@Josh yes. np.mean(array) returns the mean of the whole array. with axis=0 it returns the mean for each row. with axis=1 it returns the mean of each column.

Question 11

while len(points) > 1:

Shouldn't it be:

while len(points) > 0:

or else the last point "hangs" unhandled in the points list when finished.

 ...
 groups[key] = []
 ref = points.pop(0)
 ...

Don't you forget to insert the ref point itself into the new list?:

 ...
 ref = points.pop(0)
 groups[key] = [ ref ] 
 ...

if d < 30:

I would have this distance (30) as a parameter of the function:

def group_points(points, distance):

in order to make it more useful.

for i, point in enumerate(points):
 d = get_distance(ref, point)
 if d < distance:
 groups[key].append(points[i])
 points[i] = None
points = list(filter(lambda x: x is not None, points))

can be simplified to:

for point in points:
 if get_distance(ref, point) < distance:
 groups[key].append(point)
points = list(filter(lambda x: x not in groups[key], points))

But as eric.m notices in his comment, the original may be more efficient than my suggestion.

return list([[int(np.mean(list([x[0] for x in groups[arr]]))), int(np.mean(list([x[1] for x in groups[arr]])))] for arr in groups])

A rather scary statement. Split it up in meaningful parts:

def points_mean(points):
 return list(np.mean(points, axis = 0).astype(int))

and then

return map(points_mean, groups)

BtW: Why are you operating in integers and not in floating points?

Your method changes the input data set (points.pop()), which you as a client normally don't expect. To avoid that, you can do something like:

def group_points(points, distance):
 if len(points) == 0 or distance < 0: return []
 groups = [[points[0]]]
 for point in points[1:]:
 handled = False
 for group in groups:
 if get_distance(group[0], point) < distance:
 group.append(point)
 handled = True
 break
 if not handled:
 groups.append([point])
# perform average operation on each group
return map(points_mean, groups)
def points_mean(points):
 return list(np.mean(points, axis = 0).astype(int))
def get_distance(ref, point):
 x1, y1 = ref
 x2, y2 = point
 return math.hypot(x2 - x1, y2 - y1)

Disclaimer: I'm not that familiar with Python, so something in the above can maybe be done simpler and more succinct, so regard it as an attempt to think along your lines of thoughts and not as a state of the art.

Question 12

Wasn't the points[i] = None more efficient? It's faster checking wheter x is not None than checking if x is inside a list

Question 13

@eric.m: you can have a point there. You could maybe use points.remove(point) instead?

Question 14

points.remove modifies the list in-place, making it shorter. That screws up the i indexing.

Question 15

@dfhwze: needed a little change, and it seems that all C#'ers are on vacation :-)

Question 16

I get that, I recently started answering Ruby questions myself :s

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2019-07-23 02:49:32Z

General comments:

You are basically using the groups dict like a list. Might as well just use a list.

An empty data structure (list, dict, set, tuple) is False in a boolean context, so while len(points) > 1: can be simplified to while points:

It is generally slower to pop from the front of a list than the back of a list, because after removing the first item all the rest of the items get moved up one spot.

points.pop() actually changes the list passed in. Make sure that's what you want.

filter(None, points) filters out all "False" items.

[ ... ] creates a list. So, list( [ ... ] ) is redundant.

You can just use x1, y1 = ref.

Put that all together and you get something like:

def group_points(points):
 groups = []
 while points:
 far_points = []
 ref = points.pop()
 groups.append([ref])
 for point in points:
 d = get_distance(ref, point)
 if d < 30:
 groups[-1].append(point)
 else:
 far_points.append(point)
 points = far_points
 # perform average operation on each group
 return [list(np.mean(x, axis=1).astype(int)) for x in groups]
def get_distance(ref, point):
 # print('ref: {} , point: {}'.format(ref, point))
 x1, y1 = ref
 x2, y2 = point
 return math.hypot(x2 - x1, y2 - y1)

You also might want to look at functions in scipy.cluster.

return [list(np.mean(x, axis=1).astype(int)) for x in groups] Will this work if each element is an array? [x,y]
@Josh Sharkey, yes np.mean() works on arrays. The axis parameter tells it how to do the calculation. Without axis is takes the mean of the whole array. And yes, lists can be faster than dicts and likely use less memory. Like many things in programming there is a tradeoff. In most cases code that is easier to understand is worth the slight performance tradeoff.
so np.mean(axis=1) will average the x values and the y values separately and return a single list [x_avg, y_avg]?
@Josh yes. np.mean(array) returns the mean of the whole array. with axis=0 it returns the mean for each row. with axis=1 it returns the mean of each column.

Stack Exchange Network

Grouping sorted coordinates based on proximity to each other

2 Answers 2

General comments:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Grouping sorted coordinates based on proximity to each other

2 Answers 2

General comments:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions