I'm interested in learning how to utilize NumPy arrays to optimize geoprocessing. Much of my work involves "big data", where geoprocessing often takes days to accomplish certain tasks. Needless to say, I am very interested in optimizing these routines. ArcGIS 10.1 has a number of NumPy functions that can be accessed via arcpy, including:
For example purposes, let's say I want to optimize the following processing intensive workflow utilizing NumPy arrays:
enter image description here
The general idea here is that there are a huge number of vector-based points that move through both vector and raster-based operations resulting in a binary integer raster dataset.
How could I incorporate NumPy arrays to optimize this type of workflow?
-
2FYI, there is also a NumPyArrayToRaster function and a FeatureClassToNumPyArray function.blah238– blah2382013年03月29日 20:39:30 +00:00Commented Mar 29, 2013 at 20:39
-
2The Multiprocessing with ArcGIS blog post has some good information that might apply here. You might also be interested in other multiprocessing questions.blah238– blah2382013年03月29日 20:47:39 +00:00Commented Mar 29, 2013 at 20:47
-
3It seems to me that before thinking about using Numpy in ArcPy, you first need to understand what advantages do NumPy arrays offer over Python lists. The scope of Numpy is much wider than ArcGIS.gene– gene2013年03月29日 21:09:41 +00:00Commented Mar 29, 2013 at 21:09
-
2@gene, this StackOverflow answer seems to sum it up pretty well.blah238– blah2382013年03月29日 21:22:10 +00:00Commented Mar 29, 2013 at 21:22
-
3As an aside, if you are also interested in Hadoop too - there are Big (Spatial) Data developments worth checking out in this video and at GIS Tools for HadoopPolyGeo– PolyGeo ♦2013年03月29日 22:31:30 +00:00Commented Mar 29, 2013 at 22:31
1 Answer 1
I think the crux of the question here is which tasks in your workflow are not really ArcGIS dependent? Obvious candidates include tabular and raster operations. If the data must start and end within a gdb or some other ESRI format, then you need to figure out how to minimize the cost of this reformat (i.e., minimize the number of round trips) or even justify it--simply might be too expensive to rationalize. Another tactic is to modify your workflow to use python-friendly data models earlier (for instance, how soon could you ditch vector polygons?).
To echo @gene, while numpy/scipy are really great, don't assume that these are the only approaches available. You can also use lists, sets, dictionaries as alternative structures (although @blah238's link is pretty clear about efficiency differentials), there are also generators, iterators, and all kinds of other great, fast, efficient tools for working these structures in python. Raymond Hettinger, one of the Python developers, has all kinds of great general Python content out there. This video is a nice example.
Also, to add onto @blah238's idea on multiplexed processing, if you're writing/executing within IPython (not just the "regular" python environment), you can use their "parallel" package for exploiting multiple cores. I'm no whiz with this stuff, but find it a bit higher-level/newbie-friendly than the multiprocessing stuff. Probably really just an issue of personal religion there, so take that with a grain of salt. There's a good overview at about it starting at 2:13:00 in this video. The whole video is great for IPython in general.
Explore related questions
See similar questions with these tags.