Python implementation of the Ramer-Douglas-Peucker Algorithm

Question 1

I recently implemented the RDP polygon approximation algorithm in Python and I'm skeptical of whether or not I implemented it correctly of with the greatest efficiency. The algorithm runs in around 0.0003 seconds for a polygon with 30 points on my computer (an Intel Core i5 with 3.8 GHz of RAM), so I'm worried about how it may run on a slower computer. Also, there seems to be a cap as to the number of points that can be removed by the algorithm, or at least there's a cap in my implementation. No matter how high I set the tolerance, the approximation always caps at about \$\frac{2N}{3}\$ points where \$N\$ is the number of points in the input polygon. Could I be doing something wrong?

NegInf = float('-inf')
def distance(v1, v2):
 """
 Calculate the distance between two points.
 PARAMETERS
 ==========
 v1, v2 >> The first and second vertices respectively.
 """
 dx = v2[0] - v1[0]
 dy = v2[1] - v1[1]
 return math.sqrt(dx*dx + dy*dy)
def perpendicular_distance(point, line_start, line_end):
 """
 Calculate the perpendicular distance from a point to a line.
 PARAMETERS
 ==========
 point >> The point of which to calculate the distance from the line
 (must be an (x, y) tuple).
 line_start, line_end >> Start and end points defining the line respectively
 (must each be an (x, y) tuple).
 """
 x1, y1 = line_start
 x2, y2 = line_end
 vx, vy = point
 if x1 == x2:
 return abs(x1 - vx)
 m = (y2 - y1)/(x2 - x1)
 b = y1 - m*x1
 return abs(m * vx - vy + b)/math.sqrt(m*m + 1)
def _rdp_approx(points, tolerance, depth):
 """
 Internal Function: Recursively perform the RDP algorithm.
 """
 if not points:
 # In case the furthest point index discovered is equal to the length of the
 # list of points, leading to points[furthest:] sending in an empty list.
 return []
 elif len(points) <= 2:
 # BASE CASE:: No points to remove, only the start and the end points of the line.
 # Return it.
 return points
 elif len(points) == 3:
 # BASE CASE:: Our decomposition of the polygon has reached a minimum of 3 points.
 # Now all that is left is to remove the point in the middle (assuming it's distance
 # from the line is greater than the set tolerance).
 dist = perpendicular_distance(points[1],
 points[0],
 points[2]
 )
 if dist < tolerance:
 return [points[0], points[-1]]
 return points
 max_dist = NegInf
 furthest = None
 start = 0
 start_point = points[start]
 if depth == 1:
 # In the initial approximation, we are given an entire polygon to approximate. This
 # means that the start and end points are the same, thus we cannot use the perpendicular
 # distance equation to calculate the distance a point is from the start since the start is
 # not a line. We have to use ordinary distance formula instead.
 get_distance = lambda point: distance(point, start_point)
 else:
 end_point = points[-1]
 get_distance = lambda point: perpendicular_distance(point, start_point, end_point)
 # Find the farthest point from the norm.
 for i, point in enumerate(points[1:], 1):
 dist = get_distance(point)
 if dist > max_dist:
 max_dist = dist
 furthest = i
 # Recursively calculate the RDP approximation of the two polygonal chains formed by
 # slicing at the index of the furthest discovered point.
 prev_points = _rdp_approx(points[:furthest+1], tolerance, depth+1)
 next_points = _rdp_approx(points[furthest:], tolerance, depth+1)
 new_points = []
 for point in prev_points + next_points:
 # Filter out the duplicate points whilst maintaining order.
 # TODO:: There's probably some fancy slicing trick I just haven't figured out
 # that can be applied to prev_points and next_points so that we don't have to
 # do this, but it's not a huge bottleneck so we needn't worry about it now.
 if point not in new_points:
 new_points.append(point)
 return new_points 
def rdp_polygon_approximate(coordinates, tolerance):
 """
 Use the Ramer-Douglas-Peucker algorithm to approximate the points on a polygon.
 The RDP algorithm recursively cuts away parts of a polygon that stray from the
 average of the edges. It is a great algorithm for maintaining the overall form
 of the input polygon, however one should be careful when using this for larger
 polygons as the algorithm has an average complexity of T(n) = 2T(n/2) + O(n) and
 a worst case complexity of O(n^2).
 PARAMETERS
 ==========
 coordinates >> The coordinates of the polygon to approximate.
 tolerance >> The amount of tolerance the algorithm will use. The tolerance
 determines the minimum distance a point has to sway from the average
 before it gets deleted from the polygon. Thus, setting the tolerance to
 be higher should delete more points on the final polygon.
 That said, due to how the algorithm works there is a limit to the number
 of vertices that can be removed on a polygon. Setting the tolerance to
 float('inf') or sys.maxsize will not approximate the polygon to a single
 point. Usually the minimum points an approximated polygon can have if the
 original polygon had N points is between 2N/3 and N/3.
 FURTHER READING
 ===============
 For further reading on the Ramer-Douglas-Peucker algorithm, see
 http://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm
 """
 return _rdp_approx(coordinates, tolerance, 1)
if __name__ == '__main__':
 poly = [(3, 0), (4, 2), (5, 2), (5.5, 3), (5, 4), (4, 5), (5, 6),
 (7, 5), (7, 3), (8, 2.5), (8, 4), (9, 5), (8, 7), (7, 8), (6, 7),
 (4, 7.75), (3.5, 7.5), (3, 8), (3, 8.5), (2.5, 9), (1, 9), (0, 8),
 (2, 7), (1, 7), (0, 6), (1, 4), (2, 5), (2, 2), (3, 3), (2, 1)]
 print(rdp_polygon_approximate(poly, 3))
 print(rdp_polygon_approximate(poly, float('inf')))

Question 2

You probably forgot to close your parenthesis at the end of your prints at the end of your program. Which version of Python are you using?

Question 3

My bad! Thanks for catching that, I'm using python 3 so that indeed would have failed.

Question 4

Before looking at optimisation, maybe look at this test case ` poly = [(0, 0), (0, 1), (0, 2), (0, 3), (0, 4)]` which does not return the correct [(0, 0), (0, 4)].

Question 5

If you want to calculate the farthest point you don't need to use the square root for 'real' distance calculation. Since you only need the distance for comparison and not the actual distance it is totally sufficient to compare by x*x + y*y instead of sqrt(x*x + y*y)

Question 6

In this case, the perpendicular_distance also probably needs a simpler way to produce a squared distance.

Question 7

That's a good suggestion, I'll start looking into a way to implement a squared perpendicular distance function. Thanks!

Question 8

This is quite an old question, yet I'd like to answer because it comes up if one searches for "Python Ramer-Douglas-Peucker" on the internet.

First, the implementation is incorrect!

The OP suspected the same when they said:

No matter how high I set the tolerance, the approximation always caps at about (2 N / 3) points where N is the number of points in the input polygon. Could I be doing something wrong?

Knowing how the algorithm works, it should be clear that the number of points in the input should not be the driving variable determining the number of output points. Instead, the tolerance should matter.

What is an easy test case? A straight line!

Let's use Hypothesis to generate some test cases. Our properties are easy: The resulting poly-line should have length 2, and the first and last point are the same as the input poly-line.

 import unittest
 from hypothesis import given, example, assume
 from hypothesis.strategies import lists, integers, tuples, floats
 from ramer_douglas_peucker.original import rdp_polygon_approximate, perpendicular_distance
 import numpy as np
 class CompareTestCase(unittest.TestCase):
 @given(min_x=floats(min_value=-1e5, max_value=1e5, allow_nan=False,
 allow_infinity=False),
 max_x = floats(min_value=-1e5, max_value=1e5, allow_nan=False,
 allow_infinity=False))
 def test_straight_line(self, min_x, max_x):
 assume(min_x < max_x)
 m = 2.0
 b = -3.0
 poly = [(x, m*x+b) for x in np.linspace(min_x, max_x, 100)]
 poly_out = rdp_polygon_approximate(poly, tolerance=0.01)
 self.assertEqual(len(poly_out), 2)
 self.assertEqual(poly_out, [poly[0], poly[-1]])

You should see that this fails: AssertionError: 77 != 2 You might get a slightly different length. Thus the length of the returned poly_out is not equal to two.

Next, let's have a look at your implementation.

# Find the farthest point from the norm.
for i, point in enumerate(points[1:], 1):
 dist = get_distance(point)
 if dist > max_dist:
 max_dist = dist
 furthest = i
# Main bug here:
# You were missing this common case: All points are already within
# 'tolerance'. Just return start and end.
if max_dist < tolerance:
 return [points[0], points[-1]]

This bug explains the observation that you can only simplify by (2 N / 3) points. The code only drops the middle point if the (recursive) input has length 3 (special case at the beginning) and the max distance is larger than tolerance.

Next, let's have a look at your 'TODO' comment, where you put the two recursion results together:

new_points = []
for point in prev_points + next_points:
 # Filter out the duplicate points whilst maintaining order.
 # TODO:: There's probably some fancy slicing trick I just haven't figured out
 # that can be applied to prev_points and next_points so that we don't have to
 # do this, but it's not a huge bottleneck so we needn't worry about it now.
 if point not in new_points:
 new_points.append(point)
return new_points

The fancy slicing trick you mentioned in your TODO is: You want to concatenate prev_points and next_points but the point at the pivot (furthest) will be present twice. So just return the first list except for the last point.

return prev_points[:-1] + next_points

This also avoids a bug, where you have the same point multiple times. Like [A, b, c, A, d, e, f]. The second A would've been removed.

More could be said about the fact that you are using recursion (if you care about performance), the special case handling at the beginning of the implementation (not needed) -- however, this question is too old for anyone to care.

Question 9

Though the question is old, I'm not yet! Thanks for doing this, I always appreciate it when old, unfulfilled questions get properly sorted out. It'll serve the next generation of python RDP implementers!

Question 10

You can simplify your distance function with the function math.hypot(x, y):

dx = v2[0] - v1[0]
dy = v2[1] - v1[1]
return math.hypot(dx, dy)

math.hypot(x, y) computes directly math.sqrt(x*x + y*y). Morevoer, the CPython implementation should be based on the underlying C function hypot; therefore, this function is safer than the naive implementation since it does its best to avoid overflows and underflows at an intermediate stage of the computation.

From a design point of view, one thing I would probably do to avoid passing coordinates everywhere is to create at least two classes, Point and Line:

class Point:

def __init__(self, x, y):
 self.x = x
 self.y = y

class Line:

def __init__(self, start, end):
 self.start = start
 self.end = end

Then, I would have done a "generic" distance function that can compute the distance between a point and virtually anything. With Python 3.4 singledispatch decorator, it would be something like this (I did not test the code, it is merely here to give you an idea):

from singledispatch import singledispatch
@singledispatch
def distance(shape, point):
 pass
@distance.register(Point)
def _(arg, point):
 """distance from point ot point"""
 # implementation
@distance.register(Line)
def _(arg, point):
 """distance from point to line)"""
 # implementation

Unfortunately, this is only a design idea, not an optimization one, but it would at least help to write code easier to write, with an API based on actual types, and not on mere coordinates.

Question 11

I did consider using math.hypot, however running some tests with timeit gave some interesting results: distance with math.sqrt: 7.09003095935727e-07s, distance with math.hypot: 8.529042939559872e-07s. I'm pretty sure it has to do something with the fact that math.hypot is not just a square root function in C, it uses a different implementation (as seen at en.wikipedia.org/wiki/Hypot).

Question 12

@user3002473 That's actually rather interesting. It shows again that there may be an actual trade-off between safe vs fast :)

anhoppe anhoppe 2062 silver badges4 bronze badges · Accepted Answer · 2014-05-15 10:59:39Z

9

\$\begingroup\$

If you want to calculate the farthest point you don't need to use the square root for 'real' distance calculation. Since you only need the distance for comparison and not the actual distance it is totally sufficient to compare by x*x + y*y instead of sqrt(x*x + y*y)

Share

edited May 15, 2014 at 11:18

Morwenn's user avatar

Morwenn

20.2k3 gold badges69 silver badges132 bronze badges

answered May 15, 2014 at 10:59

anhoppe's user avatar

anhoppe anhoppe

2062 silver badges4 bronze badges

\$\endgroup\$

2

\$\begingroup\$ In this case, the perpendicular_distance also probably needs a simpler way to produce a squared distance. \$\endgroup\$

Morwenn
– Morwenn

2014年05月15日 11:20:10 +00:00
Commented May 15, 2014 at 11:20
\$\begingroup\$ That's a good suggestion, I'll start looking into a way to implement a squared perpendicular distance function. Thanks! \$\endgroup\$

user3002473
– user3002473

2014年05月15日 22:03:25 +00:00
Commented May 15, 2014 at 22:03

Add a comment |

Stack Exchange Network

Python implementation of the Ramer-Douglas-Peucker Algorithm

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python implementation of the Ramer-Douglas-Peucker Algorithm

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions