How can I utilize NumPy arrays to optimize big data geoprocessing?

Question 1

I'm interested in learning how to utilize NumPy arrays to optimize geoprocessing. Much of my work involves "big data", where geoprocessing often takes days to accomplish certain tasks. Needless to say, I am very interested in optimizing these routines. ArcGIS 10.1 has a number of NumPy functions that can be accessed via arcpy, including:

For example purposes, let's say I want to optimize the following processing intensive workflow utilizing NumPy arrays:

enter image description here

The general idea here is that there are a huge number of vector-based points that move through both vector and raster-based operations resulting in a binary integer raster dataset.

How could I incorporate NumPy arrays to optimize this type of workflow?

Question 2

FYI, there is also a NumPyArrayToRaster function and a FeatureClassToNumPyArray function.

Question 3

The Multiprocessing with ArcGIS blog post has some good information that might apply here. You might also be interested in other multiprocessing questions.

Question 4

It seems to me that before thinking about using Numpy in ArcPy, you first need to understand what advantages do NumPy arrays offer over Python lists. The scope of Numpy is much wider than ArcGIS.

Question 5

@gene, this StackOverflow answer seems to sum it up pretty well.

Question 6

As an aside, if you are also interested in Hadoop too - there are Big (Spatial) Data developments worth checking out in this video and at GIS Tools for Hadoop

Question 7

I think the crux of the question here is which tasks in your workflow are not really ArcGIS dependent? Obvious candidates include tabular and raster operations. If the data must start and end within a gdb or some other ESRI format, then you need to figure out how to minimize the cost of this reformat (i.e., minimize the number of round trips) or even justify it--simply might be too expensive to rationalize. Another tactic is to modify your workflow to use python-friendly data models earlier (for instance, how soon could you ditch vector polygons?).

To echo @gene, while numpy/scipy are really great, don't assume that these are the only approaches available. You can also use lists, sets, dictionaries as alternative structures (although @blah238's link is pretty clear about efficiency differentials), there are also generators, iterators, and all kinds of other great, fast, efficient tools for working these structures in python. Raymond Hettinger, one of the Python developers, has all kinds of great general Python content out there. This video is a nice example.

Also, to add onto @blah238's idea on multiplexed processing, if you're writing/executing within IPython (not just the "regular" python environment), you can use their "parallel" package for exploiting multiple cores. I'm no whiz with this stuff, but find it a bit higher-level/newbie-friendly than the multiprocessing stuff. Probably really just an issue of personal religion there, so take that with a grain of salt. There's a good overview at about it starting at 2:13:00 in this video. The whole video is great for IPython in general.

Roland Roland 1,2909 silver badges21 bronze badges · Accepted Answer · 2013-05-20 15:50:02Z

I think the crux of the question here is which tasks in your workflow are not really ArcGIS dependent? Obvious candidates include tabular and raster operations. If the data must start and end within a gdb or some other ESRI format, then you need to figure out how to minimize the cost of this reformat (i.e., minimize the number of round trips) or even justify it--simply might be too expensive to rationalize. Another tactic is to modify your workflow to use python-friendly data models earlier (for instance, how soon could you ditch vector polygons?).

To echo @gene, while numpy/scipy are really great, don't assume that these are the only approaches available. You can also use lists, sets, dictionaries as alternative structures (although @blah238's link is pretty clear about efficiency differentials), there are also generators, iterators, and all kinds of other great, fast, efficient tools for working these structures in python. Raymond Hettinger, one of the Python developers, has all kinds of great general Python content out there. This video is a nice example.

Also, to add onto @blah238's idea on multiplexed processing, if you're writing/executing within IPython (not just the "regular" python environment), you can use their "parallel" package for exploiting multiple cores. I'm no whiz with this stuff, but find it a bit higher-level/newbie-friendly than the multiprocessing stuff. Probably really just an issue of personal religion there, so take that with a grain of salt. There's a good overview at about it starting at 2:13:00 in this video. The whole video is great for IPython in general.

Stack Exchange Network

How can I utilize NumPy arrays to optimize big data geoprocessing?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How can I utilize NumPy arrays to optimize big data geoprocessing?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions