I'm working towards plotting a large GIS dataset of which I've shown a sample above of about 1/6 of the data. I'm happy with how quickly the data loads in, and bokeh
renders the html nearly instantaneously. However, I've encountered a pretty hot loop in my code that is not scaling well as I increase the 1) number of rows and 2) resolution of the polygons. I'm just getting killed in the #count points
loop and am wondering if there isn't a better way of doing this?
I found the suggestion for a loop off a GIS readthedoc.io and was happy with its performance for a few thousand points a couple months ago. But now the project needs to process a GeoDataFrame
with>730000 rows. Is there a better method I'm suppose to be using to count the number of points in each polygon? I'm on a modern desktop to do the computation but the project has access to Azure resources so maybe that's most people professionally do this sort of computation? I'd prefer to do the computation locally but it means my desktop might have to sit at max cpu cycles overnight or longer which isn't a thrilling prospect. I'm using Python 3.8.2 & Conda 4.3.2.
from shapely.geometry import Polygon
import pysal.viz.mapclassify as mc
import geopandas as gpd
def count_points(main_df, geo_grid, levels=5):
"""
outputs a gdf of polygons with a columns of classifiers to be used for color mapping
"""
pts = gpd.GeoDataFrame(main_df["geometry"]).copy()
#counts points
pts_in_polys = []
for i, poly in geo_grid.iterrows():
pts_in_this_poly = []
for j, pt in pts.iterrows():
if poly.geometry.contains(pt.geometry):
pts_in_this_poly.append(pt.geometry)
pts = pts.drop([j])
nums = len(pts_in_this_poly)
pts_in_polys.append(nums)
geo_grid['number of points'] = gpd.GeoSeries(pts_in_polys) #Adds number of points in each polygon
# Adds Quantiles column
classifier = mc.Quantiles.make(k=levels)
geo_grid["class"] = geo_grid[["number of points"]].apply(classifier)
# Adds Polygon grid points to new geodataframe
geo_grid["x"] = geo_grid.apply(getPolyCoords, geom="geometry", coord_type="x", axis=1)
geo_grid["y"] = geo_grid.apply(getPolyCoords, geom="geometry", coord_type="y", axis=1)
polygons = geo_grid.drop("geometry", axis=1).copy()
return polygons
-
\$\begingroup\$ can you share a small sample of (dummy) data so we can test what happens and see whether alternative approaches result in the same outcome? \$\endgroup\$Maarten Fabré– Maarten Fabré2020年04月07日 07:02:05 +00:00Commented Apr 7, 2020 at 7:02
1 Answer 1
The nested loop compares each poly to each point, so the complexity is (Npoly * Npts).
If the grid is regular, that is each square in a row has the same top and bottom coordinate, and each square in a column has the same left and right coordinate, then binary searches (bisect in std library) could be used to determine which grid row and column a point falls in. This would be Npts * log Npoly. I think this could be combined with .groupby()
to eliminate any explicit loops. I don't have time to post code now, maybe tomorrow.
(It's tomorrow)
Better yet, use numpy.histogram2d. On my laptop, it takes about 220ms to bin 2M points:
import numpy as np
min_lat, max_lat = 49.22, 49.30
num_lat_bins = 10
min_long, max_long = -123.23, -122.92
num_long_bins = 20
num_points = 2_000_000
# random coordinates for testing
lats = np.random.uniform(min_lat, max_lat, num_points)
longs = np.random.uniform(min_long, max_long, num_points)
h, lat_edges, long_edges = np.histogram2d(lats, longs,
bins=(10,20),
range=[(min_lat,max_lat), (min_long, max_long)])
Where, h
is the histogram, or count in each grid square and lat_edges
and 'long_edges
are the grid lines. They look like:
# h (10 rows by 20 columns)
[[ 9982. 10126. 10124. ... 10218. 10140. 9844.]
[ 9971. 10096. 10035. ... 10057. 9844. 9923.]
[ 9940. 10036. 9982. ... 9966. 10034. 9872.]
...
[10025. 10011. 10027. ... 9870. 9942. 9985.]]
# lat_edges
[49.22 , 49.228, 49.236, 49.244, 49.252, 49.26 , 49.268, 49.276,
49.284, 49.292, 49.3 ]
# long_edges
[-123.23 , -123.2145, -123.199 , -123.1835, -123.168 , -123.1525,
-123.137 , -123.1215, -123.106 , -123.0905, -123.075 , -123.0595,
-123.044 , -123.0285, -123.013 , -122.9975, -122.982 , -122.9665,
-122.951 , -122.9355, -122.92 ]
-
1\$\begingroup\$ A well deserved checkmark for you. It brought the algorithm down from 1200 min for 200,000 rows to ~15 secs for the full set of 750,000 rows. Nicely done. \$\endgroup\$BBirdsell– BBirdsell2020年04月15日 18:46:25 +00:00Commented Apr 15, 2020 at 18:46