3

I have a feature layer that contains about 460,000 records and it currently takes about 20 minutes to read that table using the arcpy.da.TableToNumPyArray() tool. Is there a more efficient way to read the rows to then be able to manipulate these data? It seems like there should be a more "C-like" way to access the rows. Here is the function in it's entirety, though I'm focused on the line near the bottom where I'm calling arcpy.da.TableToNumpyArray() to read out the data:

def import_surface(surface, grid, nrow, ncol, row_col_fields, field_to_convert):
 """
 References raster surface to model grid.
 Returns an array of size nrow, ncol.
 """
 out_table = r'in_memory\{}'.format(surface)
 grid_oid_fieldname = arcpy.Describe(grid).OIDFieldName
 # Compute the mean surface value for each model cell and output a table (~20 seconds)
 arcpy.sa.ZonalStatisticsAsTable(grid, grid_oid_fieldname, surface, out_table, 'DATA', 'MEAN')
 # Build some layers and views
 grid_lyr = r'in_memory\grid_lyr'
 table_vwe = r'in_memory\table_vwe'
 arcpy.MakeFeatureLayer_management(grid, grid_lyr)
 arcpy.MakeTableView_management(out_table, table_vwe)
 grid_lyr_oid_fieldname = arcpy.Describe(grid_lyr).OIDFieldName
 # table_vwe_oid_fieldname = arcpy.Describe(table_vwe).OIDFieldName
 # Join the output zonal stats table with the grid to assign row/col to each value.
 arcpy.AddJoin_management(grid_lyr, grid_lyr_oid_fieldname, table_vwe, 'OID_', 'KEEP_ALL')
 # Take the newly joined grid/zonal stats and read out tuples of (row, col, val) (takes ~20 minutes) 
 a = arcpy.da.TableToNumPyArray(grid_lyr, row_col_fields + [field_to_convert], skip_nulls=False)
 # Reshape the 1D array output by TableToNumpy into a 2D structured array, sorting by row/col (~0.1 seconds)
 a = np.rec.fromrecords(a.tolist(), names=['row', 'col', 'val'])
 a.sort(order=['row', 'col'])
 b = np.reshape(a.val, (nrow, ncol))
 return b
asked May 11, 2015 at 18:09
11
  • 2
    Do you have a code sample? are you doing it all in numpy? There are many numpy options for querying which I am sure you are aware of, but the source of the time useage needs to be addressed as to whether it is on the arcpy or the numpy side. Commented May 11, 2015 at 19:00
  • 1
    Why not use the arcpy.da.UpdateCursor? They are pretty fast. Commented May 11, 2015 at 19:41
  • On the 'C' side you would open a cursor, just like in python, and then build a list (just as crmackey said) which is probably what arcpy.da.TableToNumPyArray does. If you want to go faster you're (probably) going to have to go non-Esri to read the table and that depends on how they're stored. Commented May 11, 2015 at 21:07
  • 2
    First thoughts are the use of in_memory and the use of joined data. I would try to make the join permanent saving to a file on local disk and if that speeds things up then doing the same but saving to in_memory. This should be done before converting to an array. It could be the join or working location or a combination of both. Commented May 11, 2015 at 21:32
  • 1
    It's 64bit not background processing that prohibits use of JET databases.. Dan is definitely on to something there, persist the join to disc/in_memory and then load it. I suspect that the problem is that the 'toNumPy' needs to lookup each value in a (possibly non-indexed) table to load it into memory. You could do that faster if you copied your join table to in_memory and indexed it on the join field then persisted it before loading.. how big is your join table? Commented May 11, 2015 at 21:43

2 Answers 2

3

Thought to comment first, but then it got large..

  • Processing that much data in 20 min is reasonable time imo. I've also tried to speed up some of the arcpy operations - look here.
  • If performance is an issue, you could read the file geodatabase with API, but I doubt it would be any faster than reading the data with arcpy into a pure Python data structure such as a dict or a list of tuples.
  • It depends a bit on the data structure you have in your gdb table - looking for keys is faster than looking for values in a Python dict and list comprehensions in some situations can significantly increase the speed of processing. Search for operations are costly in Python and find more optimal approaches.

  • I always run 64bit Python for any large data processing - it gets processed ca 10% on those datasets I've used.

  • You could play with SQLite loading data into a database and running some SQL - it might be faster, but it's hard to verify that without testing without yout data.

Finally, if you don't have it yet - get an SSD hard disk - after I've switched to SSD everything just started flying (now it gets slow again :) - because you get used to it ).

answered May 11, 2015 at 21:44
3
  • Thanks for the suggestions, Alex. The reason I'm surprised at the length of time it takes to read the table (~20 minutes) is because I can run zonal statistics on the dataset in ~20 seconds. I will have to look into getting an SSD for my system! Commented May 12, 2015 at 13:17
  • I understand... Test with SSD, there should be a notable boost. Commented May 12, 2015 at 17:21
  • If you don't want to spend any money and have heaps of RAM installed you could create a RAM drive, it's even faster than a SSD. 70MB is something that I could spare even on WinXP 32bit.. if you're at the limit of what you can achieve with code then your only options are to use a different API (which may or may not help) and/or look at your performance bottlenecks, which is usually hard drive access. Note that a noticeable increase can be gained by having all your data on a different physical disc to your system drive, not needing to share bandwidth with system calls. Commented May 12, 2015 at 21:28
1

The main bottle neck in my previous code was in reading from a GDB featureclass. Once I converted to a shapefile, my read time dropped to about 10% of the original. I should have remembered that I had the same problem just a few months ago #facepalm. Additionally, in a previous iteration of the current problem, I tried to copy my features to an in_memory location which did not yield an improvement in processing time. However, I tried it again, perhaps after resetting the iPython Notebook kernel, and the processing time became much more reasonable.

Here is the new code which runs in less than 2 minutes:

def import_surface(surface, grid, nrow, ncol,
 zstat_field=None,
 join_field=None,
 row_col_fields=('grid_row', 'grid_col'),
 field_aliasname_to_convert='MEAN'):
 """
 References raster surface to model grid and returns an array of size nrow, ncol.
 vars:
 surface: the raster surface to be sampled; raster
 grid: the vector grid upon which the raster will be summarized; feature class
 nrow: number of rows in the grid; int
 ncol: number of columns in the grid; int
 zstat_field: field in the grid that defines the zones; str
 join_field: field in the resulting features to be used for the join; str
 row_col_fields: names of the fields containing row and column numbers; list
 field_aliasname_to_convert: alias of resulting zstat field to use; str
 """
 if join_field is None:
 join_field = arcpy.Describe(grid).OIDFieldName
 if zstat_field is None:
 zstat_field = join_field
 zstat = r'in_memory\{}'.format(surface)
 arcpy.sa.ZonalStatisticsAsTable(grid, zstat_field, surface, zstat, 'NODATA', 'MEAN')
 # Create feature layers and table views for joining
 grid_lyr = r'in_memory\grid_lyr'
 arcpy.MakeFeatureLayer_management(grid, grid_lyr)
 zstat_vwe = r'in_memory\zstat_vwe'
 arcpy.MakeTableView_management(zstat, zstat_vwe)
 # Join tables
 arcpy.AddJoin_management(grid_lyr, join_field, zstat_vwe, join_field, 'KEEP_ALL')
 # Write the grid features to a new featureclass
 zstat_grid = r'in_memory\zstat_grid'
 arcpy.CopyFeatures_management(grid_lyr, zstat_grid)
 # Ensure we point to the correct zstat field name in case of truncation
 for idx, alias in enumerate([f.aliasName for f in arcpy.ListFields(zstat_grid)]):
 if alias == field_aliasname_to_convert:
 name = [f.name for f in arcpy.ListFields(zstat_grid)][idx]
 break
 # Convert regular-grid polygon shapefile to an array.
 a = arcpy.da.TableToNumPyArray(zstat_grid, row_col_fields + (name, ), skip_nulls=False)
 # Convert to recarray
 a_1 = np.rec.fromrecords(a.tolist(), names=['row', 'col', 'val'])
 a_1.sort(order=['row', 'col'])
 b = np.reshape(a_1.val, (nrow, ncol))
 return b
answered May 13, 2015 at 19:21

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.