Speeding up Python code to convert multiple csv to shapefile?

Question 1

I am trying to work with some crime data from https://data.police.uk/data/ The data is organised in several .csv files, one for each month and each crime is geocoded with Lat and Long.

As the file structure might differ from month to month I cannot merge the csv together using copy *.csv combined.csv in the command prompt as explained here: https://www.itsupportguides.com/office-2010/how-to-merge-multiple-csv-files-into-one-csv-file-using-cmd/

So I decided to use python to loop through all the csv files in the folder and create a shapefile for each one which I will then merge together at a latter stage.

This is the code I came up with after looking at this post Convertion of multiple csv automatically to shp, it works but it is really slow, in a couple of hours it converted only a handful of tables. Do you have any suggestion to improve my code?

I had to use csvfile.replace('-', '_') as the file names look like this 2012-05-metropolitan-street.csv and I cannot use "-" in the output shapefile name.

import arcpy,os
shpworkspace = r"G:\GIS DATA\Crime Data\CSV"
arcpy.env.workspace = shpworkspace
arcpy.env.overwriteOutput = True
csvlist = arcpy.ListFiles("*.csv")
try:
 for csvfile in csvlist:
 outlayer = "CSVEventLayer"
 spatialreference = "GEOGCS['GCS_WGS_1984',DATUM['D_WGS_1984',SPHEROID['WGS_1984',6378137.0,298.257223563]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]];-400 -400 1000000000;-100000 10000;-100000 10000;8.98315284119522E-09;0.001;0.001;IsHighPrecision"
 arcpy.MakeXYEventLayer_management(csvfile,"Longitude","Latitude",outlayer,spatialreference,"#")
 shpfile = os.path.splitext(csvfile.replace('-', '_'))[0]
 arcpy.CopyFeatures_management(outlayer,shpfile)
 del outlayer
except:
 # If an error occurred print the message to the screen
 print arcpy.GetMessages()

Question 2

Your script doesn't have many steps, so it's hard to see where any time savings can be made, but I would suspect that CopyFeatures is fairly slow. Is G:\ a network drive? Moving the shapefile to your local machine would speed things up a bit.

Question 3

It is a network drive, unfortunately I cannot use my local drive but it is still incredibly slow. In 4 hours it converted 7 tables, each table is about 80,000 rows. At the moment I am running this code on just the London metropolitan area but I hope to run it for the entire country, but this would mean having to loop through thousands of csv...It would take forever at this pace. I guess my approach is entirely wrong.

Question 4

I tend to agree with Jon. Test by moving the files to the local machine, and output to the local drive too. Try adding debug timestamps to each step and see how long each specific process is taking.

Question 5

I too tend to agree, with @jon_two - your code doesn't look inefficient on it's own. I/O is almost always the bottle-neck, and I've seen serious speed-ups when I switched from network to local drives on my local university machines. Another possibility is to use a RAM disk (i.e., a temporary hard-drive in RAM). You can work there, then save your work somewhere persistent (since RAM drives disappear upon shutdown!). This one (softperfect.com/products/ramdisk) has worked well for me. Need admin rights to use it. Also, local USB drives are slow, but maybe faster than your network.

Question 6

You could try writing to a gdb feature class instead of a shapefile.

Question 7

As indicated in comments, and suggested by most commenters, moving the data from a shared drive to local disk appears to have eliminated the performance of concern:

I finally came back to the office today and tried to move the files on the local machine and re-run the script, it worked! What before was taking hours with the data on the network drive now it took only a couple of minutes.

Question 8

Here is a multiprocessing script which speed up the whole process. I hope this will help you.

import arcpy
import os
import multiprocessing
def process_csv_file(param):
 """
 multiprocessing function to process csv files
 :param param: workspace and csv file information
 :return: shp file path
 """
 try:
 arcpy.env.workspace = param.get('workspace')
 csv_file = param.get('file')
 print 'Processing: {}'.format(csv_file)
 arcpy.env.overwriteOutput = True
 shpfile = os.path.splitext(csv_file.replace('-', '_'))[0]
 temp_path = os.path.join('in_memory', ''.join([shpfile, '_EventLayer']))
 spatialreference = "GEOGCS['GCS_WGS_1984',DATUM['D_WGS_1984',SPHEROID['WGS_1984',6378137.0,298.257223563]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]];-400 -400 1000000000;-100000 10000;-100000 10000;8.98315284119522E-09;0.001;0.001;IsHighPrecision"
 evt_lyr = arcpy.MakeXYEventLayer_management(csv_file, "Longitude", "Latitude", temp_path, spatialreference)
 arcpy.CopyFeatures_management(evt_lyr, shpfile)
 del evt_lyr
 return os.path.join(param.get('workspace'), shpfile)
 except Exception as error:
 return error
def main():
 """ main function """
 try:
 # path of csv files and shape files to keep
 workspace = r"C:\Users\surya\Downloads\CrimeData_Aug_Sep152015円-08"
 # Number of files to process at a time
 process = 4
 params = [{'workspace': workspace, 'file': file_name} for file_name in os.listdir(workspace)
 if file_name.endswith('.csv')]
 pool = multiprocessing.Pool(processes=process)
 result = pool.map_async(process_csv_file, params)
 pool.close()
 pool.join()
 print result
 except Exception as error:
 # If an error occurred print the message to the screen
 print error
if __name__ == '__main__':
 main()

Happy to help :)

Question 9

@Suria Thank you for providing this solution. I am not a Python expert and know very little about multiprocessing but I tried to run your script on my locally stored data. The code works but it does not seem to be any faster than my simple version. I noted that CPU usage was quite low, about 3% and only reached 25% (I have 4 cores) when finalizing the creation of the shapefile.

Question 10

Use 64-bit python if you have install ArcGIS Server as well otherwise 32-bit python will also work. How do you executing this script? Just change workspace path with your csv folder path. BTW, it is Surya.

Question 11

I run it from executable. I have Python 3.4.1 64bit

Question 12

Please use ArcGIS python which is normally resides in C:\Python27\ArcGISx6410.3 drive and ArcGIS version may change. As you have mentioned that you are new to python. So, Steps to execute this script: 1. open command prompt. 2. Give the path of the python add one space and path of the script. This will look like C:\Python27\ArcGISx6410.3\python path_of_the_script_file.py

Question 13

Your multiprocessing code did in fact reduce the computing time from 23 mins to 15, which is indeed something. Thank you for your help!

PolyGeo ♦PolyGeo 65.5k29 gold badges115 silver badges350 bronze badges · Answer 1 · 2015-11-26 11:32:13Z

As indicated in comments, and suggested by most commenters, moving the data from a shared drive to local disk appears to have eliminated the performance of concern:

I finally came back to the office today and tried to move the files on the local machine and re-run the script, it worked! What before was taking hours with the data on the network drive now it took only a couple of minutes.

Surya Surya 8248 silver badges20 bronze badges · Answer 2 · 2015-11-27 08:00:45Z

Here is a multiprocessing script which speed up the whole process. I hope this will help you.

import arcpy
import os
import multiprocessing
def process_csv_file(param):
 """
 multiprocessing function to process csv files
 :param param: workspace and csv file information
 :return: shp file path
 """
 try:
 arcpy.env.workspace = param.get('workspace')
 csv_file = param.get('file')
 print 'Processing: {}'.format(csv_file)
 arcpy.env.overwriteOutput = True
 shpfile = os.path.splitext(csv_file.replace('-', '_'))[0]
 temp_path = os.path.join('in_memory', ''.join([shpfile, '_EventLayer']))
 spatialreference = "GEOGCS['GCS_WGS_1984',DATUM['D_WGS_1984',SPHEROID['WGS_1984',6378137.0,298.257223563]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]];-400 -400 1000000000;-100000 10000;-100000 10000;8.98315284119522E-09;0.001;0.001;IsHighPrecision"
 evt_lyr = arcpy.MakeXYEventLayer_management(csv_file, "Longitude", "Latitude", temp_path, spatialreference)
 arcpy.CopyFeatures_management(evt_lyr, shpfile)
 del evt_lyr
 return os.path.join(param.get('workspace'), shpfile)
 except Exception as error:
 return error
def main():
 """ main function """
 try:
 # path of csv files and shape files to keep
 workspace = r"C:\Users\surya\Downloads\CrimeData_Aug_Sep152015円-08"
 # Number of files to process at a time
 process = 4
 params = [{'workspace': workspace, 'file': file_name} for file_name in os.listdir(workspace)
 if file_name.endswith('.csv')]
 pool = multiprocessing.Pool(processes=process)
 result = pool.map_async(process_csv_file, params)
 pool.close()
 pool.join()
 print result
 except Exception as error:
 # If an error occurred print the message to the screen
 print error
if __name__ == '__main__':
 main()

Happy to help :)

@Suria Thank you for providing this solution. I am not a Python expert and know very little about multiprocessing but I tried to run your script on my locally stored data. The code works but it does not seem to be any faster than my simple version. I noted that CPU usage was quite low, about 3% and only reached 25% (I have 4 cores) when finalizing the creation of the shapefile.
Use 64-bit python if you have install ArcGIS Server as well otherwise 32-bit python will also work. How do you executing this script? Just change workspace path with your csv folder path. BTW, it is Surya.
Please use ArcGIS python which is normally resides in C:\Python27\ArcGISx6410.3 drive and ArcGIS version may change. As you have mentioned that you are new to python. So, Steps to execute this script: 1. open command prompt. 2. Give the path of the python add one space and path of the script. This will look like C:\Python27\ArcGISx6410.3\python path_of_the_script_file.py
Your multiprocessing code did in fact reduce the computing time from 23 mins to 15, which is indeed something. Thank you for your help!

Stack Exchange Network

Speeding up Python code to convert multiple csv to shapefile?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Speeding up Python code to convert multiple csv to shapefile?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions