I created an algotrithm that groups a sorted list
of coordinates into buckets based on their proximity (30) to one another.
Steps:
- Create a new key with a list value and pop the first point in the list into it
- Scan the list of points for points that are close to it. Push matches to the new list and replace values with
None
- After scan, filter the list, removing
None
values - Back to one, until
points
is empty.
I can't modify the list by deleting elements because I use indexes.
After I'm done grouping, I take the average of each group of points.
def group_points(points):
groups = {}
groupnum = 0
while len(points) > 1:
groupnum += 1
key = str(groupnum)
groups[key] = []
ref = points.pop(0)
for i, point in enumerate(points):
d = get_distance(ref, point)
if d < 30:
groups[key].append(points[i])
points[i] = None
points = list(filter(lambda x: x is not None, points))
# perform average operation on each group
return list([[int(np.mean(list([x[0] for x in groups[arr]]))), int(np.mean(list([x[1] for x in groups[arr]])))] for arr in groups])
def get_distance(ref, point):
# print('ref: {} , point: {}'.format(ref, point))
x1, y1 = ref[0], ref[1]
x2, y2 = point[0], point[1]
return math.hypot(x2 - x1, y2 - y1)
If possible, I would like to reduce the amount of variables and total loops over the points array. Do I need to use indexes? Is it possible to achieve this in one pass over points
?
2 Answers 2
General comments:
You are basically using the groups
dict like a list. Might as well just use a list.
An empty data structure (list, dict, set, tuple) is False in a boolean context, so while len(points) > 1:
can be simplified to while points:
It is generally slower to pop from the front of a list than the back of a list, because after removing the first item all the rest of the items get moved up one spot.
points.pop()
actually changes the list passed in. Make sure that's what you want.
filter(None, points)
filters out all "False" items.
[ ... ]
creates a list. So, list( [ ... ] )
is redundant.
You can just use x1, y1 = ref
.
Put that all together and you get something like:
def group_points(points):
groups = []
while points:
far_points = []
ref = points.pop()
groups.append([ref])
for point in points:
d = get_distance(ref, point)
if d < 30:
groups[-1].append(point)
else:
far_points.append(point)
points = far_points
# perform average operation on each group
return [list(np.mean(x, axis=1).astype(int)) for x in groups]
def get_distance(ref, point):
# print('ref: {} , point: {}'.format(ref, point))
x1, y1 = ref
x2, y2 = point
return math.hypot(x2 - x1, y2 - y1)
You also might want to look at functions in scipy.cluster
.
-
\$\begingroup\$
return [list(np.mean(x, axis=1).astype(int)) for x in groups]
Will this work if each element is an array?[x,y]
\$\endgroup\$Josh Sharkey– Josh Sharkey2019年07月24日 13:47:38 +00:00Commented Jul 24, 2019 at 13:47 -
\$\begingroup\$ Also, are lists faster than dicts? \$\endgroup\$Josh Sharkey– Josh Sharkey2019年07月24日 13:55:55 +00:00Commented Jul 24, 2019 at 13:55
-
1\$\begingroup\$ @Josh Sharkey, yes
np.mean()
works on arrays. The axis parameter tells it how to do the calculation. Withoutaxis
is takes the mean of the whole array. And yes, lists can be faster than dicts and likely use less memory. Like many things in programming there is a tradeoff. In most cases code that is easier to understand is worth the slight performance tradeoff. \$\endgroup\$RootTwo– RootTwo2019年07月24日 17:40:13 +00:00Commented Jul 24, 2019 at 17:40 -
\$\begingroup\$ so
np.mean(axis=1)
will average the x values and the y values separately and return a single list[x_avg, y_avg]
? \$\endgroup\$Josh Sharkey– Josh Sharkey2019年07月24日 19:45:54 +00:00Commented Jul 24, 2019 at 19:45 -
1\$\begingroup\$ @Josh yes. np.mean(array) returns the mean of the whole array. with axis=0 it returns the mean for each row. with axis=1 it returns the mean of each column. \$\endgroup\$RootTwo– RootTwo2019年07月24日 22:02:22 +00:00Commented Jul 24, 2019 at 22:02
while len(points) > 1:
Shouldn't it be:
while len(points) > 0:
or else the last point "hangs" unhandled in the points list when finished.
... groups[key] = [] ref = points.pop(0) ...
Don't you forget to insert the ref
point itself into the new list?:
...
ref = points.pop(0)
groups[key] = [ ref ]
...
if d < 30:
I would have this distance (30
) as a parameter of the function:
def group_points(points, distance):
in order to make it more useful.
for i, point in enumerate(points): d = get_distance(ref, point) if d < distance: groups[key].append(points[i]) points[i] = None points = list(filter(lambda x: x is not None, points))
can be simplified to:
for point in points:
if get_distance(ref, point) < distance:
groups[key].append(point)
points = list(filter(lambda x: x not in groups[key], points))
But as eric.m notices in his comment, the original may be more efficient than my suggestion.
return list([[int(np.mean(list([x[0] for x in groups[arr]]))), int(np.mean(list([x[1] for x in groups[arr]])))] for arr in groups])
A rather scary statement. Split it up in meaningful parts:
def points_mean(points):
return list(np.mean(points, axis = 0).astype(int))
and then
return map(points_mean, groups)
BtW: Why are you operating in integers and not in floating points?
Your method changes the input data set (points.pop()
), which you as a client normally don't expect. To avoid that, you can do something like:
def group_points(points, distance):
if len(points) == 0 or distance < 0: return []
groups = [[points[0]]]
for point in points[1:]:
handled = False
for group in groups:
if get_distance(group[0], point) < distance:
group.append(point)
handled = True
break
if not handled:
groups.append([point])
# perform average operation on each group
return map(points_mean, groups)
def points_mean(points):
return list(np.mean(points, axis = 0).astype(int))
def get_distance(ref, point):
x1, y1 = ref
x2, y2 = point
return math.hypot(x2 - x1, y2 - y1)
Disclaimer: I'm not that familiar with Python, so something in the above can maybe be done simpler and more succinct, so regard it as an attempt to think along your lines of thoughts and not as a state of the art.
-
1\$\begingroup\$ Wasn't the
points[i] = None
more efficient? It's faster checking wheterx is not None
than checking if x is inside a list \$\endgroup\$eric.m– eric.m2019年07月23日 10:01:32 +00:00Commented Jul 23, 2019 at 10:01 -
\$\begingroup\$ @eric.m: you can have a point there. You could maybe use
points.remove(point)
instead? \$\endgroup\$user73941– user739412019年07月23日 10:21:38 +00:00Commented Jul 23, 2019 at 10:21 -
\$\begingroup\$ points.remove modifies the list in-place, making it shorter. That screws up the
i
indexing. \$\endgroup\$Josh Sharkey– Josh Sharkey2019年07月23日 13:13:51 +00:00Commented Jul 23, 2019 at 13:13 -
1\$\begingroup\$ @dfhwze: needed a little change, and it seems that all C#'ers are on vacation :-) \$\endgroup\$user73941– user739412019年07月23日 14:51:17 +00:00Commented Jul 23, 2019 at 14:51
-
1\$\begingroup\$ I get that, I recently started answering Ruby questions myself :s \$\endgroup\$dfhwze– dfhwze2019年07月23日 14:52:26 +00:00Commented Jul 23, 2019 at 14:52
list
" - sorted according to what criterium? \$\endgroup\$