I have written a Python snippet that reads lat and long stored in an Excel file. It converts them to a point which is then used to perform multiple geospatial analysis including buffer, intersection, and searching within a shapefile's attribute table (finding count of a specific attribute occurrence and a single row can have multiple attributes in the same column). This code snippet is very slow. I need to traverse around 2000 records and search for around 179 attributes (count) against 10 different radii. It takes around 2.45 minutes to traverse against 1 point for 10 radii and search for the count of only 10 attribute occurrences. Is there any way to speed up this process?
I am attaching the code below.
# importing libraries
from openpyxl import load_workbook
from shapely.geometry import Point
import pandas as pd
import geopandas as gpd
#import matplotlib.pyplot as plt
import csv
#loading input excel
book = load_workbook(r'File path of input file.xlsx')
sheet = book.active
bookOut = r'outputfile.csv'
#outputSheet = bookOut.active
#searching for a specific category and returning its count
def bSearch(lst, cat):
c = 0
for index, row in lst[0:len(lst)].iterrows():
if cat in row['Category_t']:
c = c+1
return (c)
# creates buffer
def createbuf(p,r):
bufp = p['geometry'].buffer(distance = r)
buf = gpd.GeoDataFrame(geometry = bufp, crs = ucs.crs)
return (buf)
# calculates intersection
def calint(b,u):
areaoi = gpd.overlay(u, b, how= "intersection")
return (areaoi)
# shapefiles Input
ucs = gpd.read_file(r'Boundary.shp')
ucs = ucs.to_crs(epsg=32643)
#Business List
business = gpd.read_file(r'List of Categories.shp')
business = business.to_crs(epsg=32643)
businessArray = [
'Cat1','Cat2','Cat3','Cat4','Cat5','Cat6','Cat7','Cat8','Cat9','Cat10'
]
radiusVals=[0.5, 1, 1.5, 2, 2.5, 3.000, 4.000, 5.000, 7.000, 10.000]
# lat long
latlng = []
i = 0
otp = []
for row in range(143, sheet.max_row+1):
detStore = []
ind = 0
for column in "AB":
cell_name = "{}{}".format(column, row)
latlng.append(sheet[cell_name].value)
detStore.append(sheet[cell_name].value)
for column in "CD":
cell_name = "{}{}".format(column, row)
detStore.append(sheet[cell_name].value)
# PCS SRID
p1 = Point((latlng[i],latlng[i+1]))
df = pd.DataFrame({'a':[latlng[i+1],latlng[i]]})
po = gpd.GeoDataFrame(geometry = [p1], crs = ucs.crs)
po['geometry'] = po['geometry'].to_crs(epsg = 32643)
print(i, "--", latlng[i+1], latlng[i])
i = i+2
# calling functions
for r in range (0, len(radiusVals)):
outputStore = []
for d in range (0, len(detStore)):
outputStore.append(detStore[d])
outputStore.append(radiusVals[r])
bufo = createbuf(po,(radiusVals[r]/111))
aoi = calint(bufo ,ucs)
busList = calint(bufo, business)
for b in range (0, len(businessArray)):
buSearch = bSearch (busList,businessArray[b])
outputStore.append(buSearch)
otp.append(outputStore)
print (otp)
fields = ['Location X', 'Location Y', 'Type', 'Name', 'Radius', 'Cat1','Cat2','Cat3','Cat4','Cat5','Cat6','Cat7','Cat8','Cat9','Cat10']
with open(bookOut, 'w', newline='') as f:
write = csv.writer (f)
write.writerow(fields)
write.writerows(otp)
1 Answer 1
DRY
This list is repeated twice in the code:
'Cat1','Cat2','Cat3','Cat4','Cat5','Cat6','Cat7','Cat8','Cat9','Cat10'
You assigned it to the businessArray
variable, which is good, but
you could also use it when assigning to the fields
variable.
Also, there is repetition in the list itself which can be eliminated with a loop, for example.
Comments
You should remove all commented-out code to reduce clutter. For example:
#import matplotlib.pyplot as plt
#outputSheet = bookOut.active
This comment can be removed since it is obvious from the code that you are importing:
# importing libraries
This comment is vague and is not needed if you give the variable a more meaningful name:
# lat long
latlng = []
For example:
coordinates = []
Note that it is common practice to use plural nouns for array variable names.
Documentation
The PEP 8 style guide recommends adding docstrings for functions. For example, you can convert your comments:
#searching for a specific category and returning its count
def bSearch(lst, cat):
into docstrings:
def bSearch(lst, cat):
""" searching for a specific category and returning its count ""
Naming
PEP-8 recommends snake_case for function and variable names.
bSearch
would be b_search
. You should also expand on
what you mean by "b" because it is not obvious from the context.
In the bSearch
function, c
would be better as count
.
calint
might be better as calc_intersect
because "int" is vague.
Layout
Move the functions to the top after the import
lines.
Having them in the middle of the code interrupts the natural flow of the
code (from a human readability standpoint).
It is common to omit parentheses with simple return
statements:
return (c)
Simpler
There is no need for the 0 start with range
:
for r in range (0, len(radiusVals)):
This line can be simplified with an f-string:
cell_name = "{}{}".format(column, row)
The following:
i = i+2
can be simplified using the special assignment operator:
i += 2
The same goes for:
c = c+1
In the calint
function, there is no need for the intermediate areaoi
variable:
areaoi = gpd.overlay(u, b, how= "intersection")
return (areaoi)
this is simpler:
return gpd.overlay(u, b, how= "intersection")
There is no need for range
and len
:
for d in range (0, len(detStore)):
outputStore.append(detStore[d])
This is simpler:
for d_store in detStore:
outputStore.append(d_store)
.shp
file we can use to test your code with? I'd like to see if there's anything I can do for you, even after all this time, but reviewing Python-Excel combos without having a dataset to test with is adding complexity. And possibly the reason your question hasn't been answered yet. \$\endgroup\$