Advice on taming a slow loop required for visualization of large GIS dataset

Question 1

I'm working towards plotting a large GIS dataset of which I've shown a sample above of about 1/6 of the data. I'm happy with how quickly the data loads in, and bokeh renders the html nearly instantaneously. However, I've encountered a pretty hot loop in my code that is not scaling well as I increase the 1) number of rows and 2) resolution of the polygons. I'm just getting killed in the #count points loop and am wondering if there isn't a better way of doing this?

I found the suggestion for a loop off a GIS readthedoc.io and was happy with its performance for a few thousand points a couple months ago. But now the project needs to process a GeoDataFrame with>730000 rows. Is there a better method I'm suppose to be using to count the number of points in each polygon? I'm on a modern desktop to do the computation but the project has access to Azure resources so maybe that's most people professionally do this sort of computation? I'd prefer to do the computation locally but it means my desktop might have to sit at max cpu cycles overnight or longer which isn't a thrilling prospect. I'm using Python 3.8.2 & Conda 4.3.2.

from shapely.geometry import Polygon
import pysal.viz.mapclassify as mc
import geopandas as gpd
def count_points(main_df, geo_grid, levels=5):
 """
 outputs a gdf of polygons with a columns of classifiers to be used for color mapping
 """
 pts = gpd.GeoDataFrame(main_df["geometry"]).copy()
 #counts points
 pts_in_polys = []
 for i, poly in geo_grid.iterrows():
 pts_in_this_poly = []
 for j, pt in pts.iterrows():
 if poly.geometry.contains(pt.geometry):
 pts_in_this_poly.append(pt.geometry)
 pts = pts.drop([j])
 nums = len(pts_in_this_poly)
 pts_in_polys.append(nums)
 geo_grid['number of points'] = gpd.GeoSeries(pts_in_polys) #Adds number of points in each polygon
 # Adds Quantiles column
 classifier = mc.Quantiles.make(k=levels)
 geo_grid["class"] = geo_grid[["number of points"]].apply(classifier)
 # Adds Polygon grid points to new geodataframe
 geo_grid["x"] = geo_grid.apply(getPolyCoords, geom="geometry", coord_type="x", axis=1)
 geo_grid["y"] = geo_grid.apply(getPolyCoords, geom="geometry", coord_type="y", axis=1)
 polygons = geo_grid.drop("geometry", axis=1).copy()
 return polygons

Question 2

can you share a small sample of (dummy) data so we can test what happens and see whether alternative approaches result in the same outcome?

Question 3

The nested loop compares each poly to each point, so the complexity is (Npoly * Npts).

If the grid is regular, that is each square in a row has the same top and bottom coordinate, and each square in a column has the same left and right coordinate, then binary searches (bisect in std library) could be used to determine which grid row and column a point falls in. This would be Npts * log Npoly. I think this could be combined with .groupby() to eliminate any explicit loops. I don't have time to post code now, maybe tomorrow.

(It's tomorrow)

Better yet, use numpy.histogram2d. On my laptop, it takes about 220ms to bin 2M points:

import numpy as np
min_lat, max_lat = 49.22, 49.30
num_lat_bins = 10
min_long, max_long = -123.23, -122.92
num_long_bins = 20
num_points = 2_000_000
# random coordinates for testing
lats = np.random.uniform(min_lat, max_lat, num_points)
longs = np.random.uniform(min_long, max_long, num_points)
h, lat_edges, long_edges = np.histogram2d(lats, longs,
 bins=(10,20),
 range=[(min_lat,max_lat), (min_long, max_long)])

Where, h is the histogram, or count in each grid square and lat_edges and 'long_edges are the grid lines. They look like:

# h (10 rows by 20 columns)
[[ 9982. 10126. 10124. ... 10218. 10140. 9844.]
 [ 9971. 10096. 10035. ... 10057. 9844. 9923.]
 [ 9940. 10036. 9982. ... 9966. 10034. 9872.]
 ... 
 [10025. 10011. 10027. ... 9870. 9942. 9985.]]
# lat_edges
[49.22 , 49.228, 49.236, 49.244, 49.252, 49.26 , 49.268, 49.276,
 49.284, 49.292, 49.3 ]
# long_edges
[-123.23 , -123.2145, -123.199 , -123.1835, -123.168 , -123.1525,
 -123.137 , -123.1215, -123.106 , -123.0905, -123.075 , -123.0595,
 -123.044 , -123.0285, -123.013 , -122.9975, -122.982 , -122.9665,
 -122.951 , -122.9355, -122.92 ]

Question 4

A well deserved checkmark for you. It brought the algorithm down from 1200 min for 200,000 rows to ~15 secs for the full set of 750,000 rows. Nicely done.

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2020-04-06 07:03:58Z

The nested loop compares each poly to each point, so the complexity is (Npoly * Npts).

If the grid is regular, that is each square in a row has the same top and bottom coordinate, and each square in a column has the same left and right coordinate, then binary searches (bisect in std library) could be used to determine which grid row and column a point falls in. This would be Npts * log Npoly. I think this could be combined with .groupby() to eliminate any explicit loops. I don't have time to post code now, maybe tomorrow.

(It's tomorrow)

Better yet, use numpy.histogram2d. On my laptop, it takes about 220ms to bin 2M points:

import numpy as np
min_lat, max_lat = 49.22, 49.30
num_lat_bins = 10
min_long, max_long = -123.23, -122.92
num_long_bins = 20
num_points = 2_000_000
# random coordinates for testing
lats = np.random.uniform(min_lat, max_lat, num_points)
longs = np.random.uniform(min_long, max_long, num_points)
h, lat_edges, long_edges = np.histogram2d(lats, longs,
 bins=(10,20),
 range=[(min_lat,max_lat), (min_long, max_long)])

Where, h is the histogram, or count in each grid square and lat_edges and 'long_edges are the grid lines. They look like:

# h (10 rows by 20 columns)
[[ 9982. 10126. 10124. ... 10218. 10140. 9844.]
 [ 9971. 10096. 10035. ... 10057. 9844. 9923.]
 [ 9940. 10036. 9982. ... 9966. 10034. 9872.]
 ... 
 [10025. 10011. 10027. ... 9870. 9942. 9985.]]
# lat_edges
[49.22 , 49.228, 49.236, 49.244, 49.252, 49.26 , 49.268, 49.276,
 49.284, 49.292, 49.3 ]
# long_edges
[-123.23 , -123.2145, -123.199 , -123.1835, -123.168 , -123.1525,
 -123.137 , -123.1215, -123.106 , -123.0905, -123.075 , -123.0595,
 -123.044 , -123.0285, -123.013 , -122.9975, -122.982 , -122.9665,
 -122.951 , -122.9355, -122.92 ]

A well deserved checkmark for you. It brought the algorithm down from 1200 min for 200,000 rows to ~15 secs for the full set of 750,000 rows. Nicely done.

Stack Exchange Network

Advice on taming a slow loop required for visualization of large GIS dataset

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Advice on taming a slow loop required for visualization of large GIS dataset

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions