1

I am trying to work with some crime data from https://data.police.uk/data/ The data is organised in several .csv files, one for each month and each crime is geocoded with Lat and Long.

As the file structure might differ from month to month I cannot merge the csv together using copy *.csv combined.csv in the command prompt as explained here: https://www.itsupportguides.com/office-2010/how-to-merge-multiple-csv-files-into-one-csv-file-using-cmd/

So I decided to use python to loop through all the csv files in the folder and create a shapefile for each one which I will then merge together at a latter stage.

This is the code I came up with after looking at this post Convertion of multiple csv automatically to shp, it works but it is really slow, in a couple of hours it converted only a handful of tables. Do you have any suggestion to improve my code?

I had to use csvfile.replace('-', '_') as the file names look like this 2012-05-metropolitan-street.csv and I cannot use "-" in the output shapefile name.

import arcpy,os
shpworkspace = r"G:\GIS DATA\Crime Data\CSV"
arcpy.env.workspace = shpworkspace
arcpy.env.overwriteOutput = True
csvlist = arcpy.ListFiles("*.csv")
try:
 for csvfile in csvlist:
 outlayer = "CSVEventLayer"
 spatialreference = "GEOGCS['GCS_WGS_1984',DATUM['D_WGS_1984',SPHEROID['WGS_1984',6378137.0,298.257223563]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]];-400 -400 1000000000;-100000 10000;-100000 10000;8.98315284119522E-09;0.001;0.001;IsHighPrecision"
 arcpy.MakeXYEventLayer_management(csvfile,"Longitude","Latitude",outlayer,spatialreference,"#")
 shpfile = os.path.splitext(csvfile.replace('-', '_'))[0]
 arcpy.CopyFeatures_management(outlayer,shpfile)
 del outlayer
except:
 # If an error occurred print the message to the screen
 print arcpy.GetMessages()
asked Nov 20, 2015 at 16:08
7
  • 1
    Your script doesn't have many steps, so it's hard to see where any time savings can be made, but I would suspect that CopyFeatures is fairly slow. Is G:\ a network drive? Moving the shapefile to your local machine would speed things up a bit. Commented Nov 20, 2015 at 17:06
  • It is a network drive, unfortunately I cannot use my local drive but it is still incredibly slow. In 4 hours it converted 7 tables, each table is about 80,000 rows. At the moment I am running this code on just the London metropolitan area but I hope to run it for the entire country, but this would mean having to loop through thousands of csv...It would take forever at this pace. I guess my approach is entirely wrong. Commented Nov 20, 2015 at 17:30
  • 1
    I tend to agree with Jon. Test by moving the files to the local machine, and output to the local drive too. Try adding debug timestamps to each step and see how long each specific process is taking. Commented Nov 20, 2015 at 17:45
  • I too tend to agree, with @jon_two - your code doesn't look inefficient on it's own. I/O is almost always the bottle-neck, and I've seen serious speed-ups when I switched from network to local drives on my local university machines. Another possibility is to use a RAM disk (i.e., a temporary hard-drive in RAM). You can work there, then save your work somewhere persistent (since RAM drives disappear upon shutdown!). This one (softperfect.com/products/ramdisk) has worked well for me. Need admin rights to use it. Also, local USB drives are slow, but maybe faster than your network. Commented Nov 21, 2015 at 3:22
  • 1
    You could try writing to a gdb feature class instead of a shapefile. Commented Nov 21, 2015 at 8:18

2 Answers 2

1

As indicated in comments, and suggested by most commenters, moving the data from a shared drive to local disk appears to have eliminated the performance of concern:

I finally came back to the office today and tried to move the files on the local machine and re-run the script, it worked! What before was taking hours with the data on the network drive now it took only a couple of minutes.

answered Nov 26, 2015 at 11:32
1

Here is a multiprocessing script which speed up the whole process. I hope this will help you.

import arcpy
import os
import multiprocessing
def process_csv_file(param):
 """
 multiprocessing function to process csv files
 :param param: workspace and csv file information
 :return: shp file path
 """
 try:
 arcpy.env.workspace = param.get('workspace')
 csv_file = param.get('file')
 print 'Processing: {}'.format(csv_file)
 arcpy.env.overwriteOutput = True
 shpfile = os.path.splitext(csv_file.replace('-', '_'))[0]
 temp_path = os.path.join('in_memory', ''.join([shpfile, '_EventLayer']))
 spatialreference = "GEOGCS['GCS_WGS_1984',DATUM['D_WGS_1984',SPHEROID['WGS_1984',6378137.0,298.257223563]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]];-400 -400 1000000000;-100000 10000;-100000 10000;8.98315284119522E-09;0.001;0.001;IsHighPrecision"
 evt_lyr = arcpy.MakeXYEventLayer_management(csv_file, "Longitude", "Latitude", temp_path, spatialreference)
 arcpy.CopyFeatures_management(evt_lyr, shpfile)
 del evt_lyr
 return os.path.join(param.get('workspace'), shpfile)
 except Exception as error:
 return error
def main():
 """ main function """
 try:
 # path of csv files and shape files to keep
 workspace = r"C:\Users\surya\Downloads\CrimeData_Aug_Sep152015円-08"
 # Number of files to process at a time
 process = 4
 params = [{'workspace': workspace, 'file': file_name} for file_name in os.listdir(workspace)
 if file_name.endswith('.csv')]
 pool = multiprocessing.Pool(processes=process)
 result = pool.map_async(process_csv_file, params)
 pool.close()
 pool.join()
 print result
 except Exception as error:
 # If an error occurred print the message to the screen
 print error
if __name__ == '__main__':
 main()

Happy to help :)

answered Nov 27, 2015 at 8:00
6
  • @Suria Thank you for providing this solution. I am not a Python expert and know very little about multiprocessing but I tried to run your script on my locally stored data. The code works but it does not seem to be any faster than my simple version. I noted that CPU usage was quite low, about 3% and only reached 25% (I have 4 cores) when finalizing the creation of the shapefile. Commented Nov 27, 2015 at 13:12
  • Use 64-bit python if you have install ArcGIS Server as well otherwise 32-bit python will also work. How do you executing this script? Just change workspace path with your csv folder path. BTW, it is Surya. Commented Nov 27, 2015 at 13:30
  • I run it from executable. I have Python 3.4.1 64bit Commented Nov 27, 2015 at 13:39
  • Please use ArcGIS python which is normally resides in C:\Python27\ArcGISx6410.3 drive and ArcGIS version may change. As you have mentioned that you are new to python. So, Steps to execute this script: 1. open command prompt. 2. Give the path of the python add one space and path of the script. This will look like C:\Python27\ArcGISx6410.3\python path_of_the_script_file.py Commented Nov 27, 2015 at 13:49
  • Your multiprocessing code did in fact reduce the computing time from 23 mins to 15, which is indeed something. Thank you for your help! Commented Dec 2, 2015 at 10:20

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.