3

For a rectangular area of about 10 by 10 km, I have to generate "displacement vectors" with a 1 meter resolution (i.e. a delta-x and delta-y value for every cell of 1 by 1 meter). I will end up with 100 million tuples (x, y, delta_x, delta_y). In a plain text file, this would look like:

x y delta_x delta_y
------ ------ ------- -------
228682 184554 0.429 1.221 
228682 184555 0.428 1.222 
228682 184556 0.428 1.223 
228682 184557 0.427 1.224 
228682 184558 0.427 1.225 
...
229526 184822 -0.084 0.593 
229526 184823 0.185 0.002 
229526 184824 0.185 0.001 
229526 184825 0.186 -0.001 
229526 184826 0.186 -0.002 
...

I was wondering if there is an efficient way to store these values as a dataset of type raster/coverage...? The x, y would be the pixel or cell position, and the delta_x and delta_y would be the values. Is HDF5 the way to go? I have no experience using HDF5.

I need to store the values in Python and retrieve them in Python in an efficient way (more efficient then first reading 100 million tuples from a text file, then storing them in a R-Tree index and finally using the index doing intersects).

I am looking for some guidance, tips, best practices, small examples or use cases where this has been done before.

PolyGeo
65.5k29 gold badges115 silver badges350 bronze badges
asked Jan 13, 2016 at 14:02

2 Answers 2

2

Your first chance at reducing the size of the data is to remove the need to explicitly store x and y. Since your data is regularly gridded you can implicitly store these values as the grid indices.

This way you can generate a 3 dimensional numpy array with the shape (2, n_x, n_y).

delta_x and delta_y for the location x = 20 and y = 10 would then be stored at array(:, 10, 20).

There are a number of options to store such an array on disk - numpy.savez and h5py probably being the most popular ones.

import numpy as np
import h5py
test_data = np.arange(200000000).reshape(2, 10000, 10000).astype(np.float32)
np.savez("/data/test_tmp/numpy.npz", test_data)
np.savez_compressed("/data/test_tmp/numpy_compressed.npz", test_data)
with h5py.File("/data/test_tmp/h5py.h5", "w") as hdf5:
 hdf5.create_dataset("innit", data=test_data, compression="gzip")
>ls -al /data/test_tmp/
-rw-rw-r-- 1 dl dl 77886268 Jan 13 17:31 h5py.h5
-rw-rw-r-- 1 dl dl 72289983 Jan 13 17:31 numpy_compressed.npz
-rw-rw-r-- 1 dl dl 800000212 Jan 13 17:30 numpy.npz

On this test dataset you can see the effects of compression. The uncompressed array numpy.npz has a size of ~800 Mb while both compressed arrays are around a 10th of the size at ~75 Mb.

I prefer numpy.savez for its ease of use but h5py definitely gives you a lot more options in regards of compression and multiple datasets in a container.

answered Jan 13, 2016 at 16:47
3
  • your solution seems perfect, and thank you for the elaborate answer. In the mean time, I was working on a similar solution using Numpy and storing the values in a GeoTiff having 2 bands, one for delta_x and one for delta_y. The resulting uncompressed GeoTiff is about 10 % the size of my text version (storing delta's in centimeters, hence integers, not floats). I still have to do the reading part. What's your opinion regarding my solution compared to yours? In particular with regard to reading the values. Commented Jan 13, 2016 at 18:47
  • Your solution basically achieves the same results, although I would say with unnecessary complication (you don't really need GDAL since you don't really need the spatial information for storage - or do you?) Can you provide a subset? Then we would be able to do a performance comparison. Commented Jan 13, 2016 at 20:46
  • Your solution is marked as the answer because it was very to the point and more generic than what I am doing. However, I will go with GeoTiff solution. It's working now and fast. Moreover, this was not mentioned but a colleague of mine needs to use the dataset in Smallworld coding in Magic. I have no idea what his capabilities are, but my guess is that consulting the value of a GeoTiff is a safer option than your output that he might not be able to read. But thanks a bunch for the help! Commented Jan 13, 2016 at 21:27
1

The alternative answer using a GeoTiff for storing the delta_x and delta_y values uses Numpy and GDAL. Here is the outline for the solution.

The implementation for creating the GeoTiff uses the code in this recipe:

https://pcjericks.github.io/py-gdalogr-cookbook/raster_layers.html#create-raster-from-array

The delta_x and delta_y values are stored in separate bands. I chose to store the displacements in centimeters to reduce the size of the GeoTiff (signed 16-bit integers).

I read the values from the GeoTiff in Python using the code in this recipe:

https://gist.github.com/stefanocudini/5201689

The GeoTiff solution is less generic having a dependency on GDAL, but this comes in handy if the output need to be read in other GIS.

answered Jan 13, 2016 at 21:52
1
  • +1, this is indeed a nice solution if someone needs to load the data into a GIS or related geospatial software. Commented Jan 13, 2016 at 22:02

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.