For a rectangular area of about 10 by 10 km, I have to generate "displacement vectors" with a 1 meter resolution (i.e. a delta-x and delta-y value for every cell of 1 by 1 meter). I will end up with 100 million tuples (x, y, delta_x, delta_y). In a plain text file, this would look like:
x y delta_x delta_y
------ ------ ------- -------
228682 184554 0.429 1.221
228682 184555 0.428 1.222
228682 184556 0.428 1.223
228682 184557 0.427 1.224
228682 184558 0.427 1.225
...
229526 184822 -0.084 0.593
229526 184823 0.185 0.002
229526 184824 0.185 0.001
229526 184825 0.186 -0.001
229526 184826 0.186 -0.002
...
I was wondering if there is an efficient way to store these values as a dataset of type raster/coverage...? The x, y would be the pixel or cell position, and the delta_x and delta_y would be the values. Is HDF5 the way to go? I have no experience using HDF5.
I need to store the values in Python and retrieve them in Python in an efficient way (more efficient then first reading 100 million tuples from a text file, then storing them in a R-Tree index and finally using the index doing intersects).
I am looking for some guidance, tips, best practices, small examples or use cases where this has been done before.
2 Answers 2
Your first chance at reducing the size of the data is to remove the need to explicitly store x
and y
. Since your data is regularly gridded you can implicitly store these values as the grid indices.
This way you can generate a 3 dimensional numpy array with the shape (2, n_x, n_y)
.
delta_x
and delta_y
for the location x = 20
and y = 10
would then be stored at array(:, 10, 20)
.
There are a number of options to store such an array on disk - numpy.savez
and h5py
probably being the most popular ones.
import numpy as np
import h5py
test_data = np.arange(200000000).reshape(2, 10000, 10000).astype(np.float32)
np.savez("/data/test_tmp/numpy.npz", test_data)
np.savez_compressed("/data/test_tmp/numpy_compressed.npz", test_data)
with h5py.File("/data/test_tmp/h5py.h5", "w") as hdf5:
hdf5.create_dataset("innit", data=test_data, compression="gzip")
>ls -al /data/test_tmp/
-rw-rw-r-- 1 dl dl 77886268 Jan 13 17:31 h5py.h5
-rw-rw-r-- 1 dl dl 72289983 Jan 13 17:31 numpy_compressed.npz
-rw-rw-r-- 1 dl dl 800000212 Jan 13 17:30 numpy.npz
On this test dataset you can see the effects of compression. The uncompressed array numpy.npz
has a size of ~800 Mb while both compressed arrays are around a 10th of the size at ~75 Mb.
I prefer numpy.savez
for its ease of use but h5py
definitely gives you a lot more options in regards of compression and multiple datasets in a container.
-
your solution seems perfect, and thank you for the elaborate answer. In the mean time, I was working on a similar solution using Numpy and storing the values in a GeoTiff having 2 bands, one for delta_x and one for delta_y. The resulting uncompressed GeoTiff is about 10 % the size of my text version (storing delta's in centimeters, hence integers, not floats). I still have to do the reading part. What's your opinion regarding my solution compared to yours? In particular with regard to reading the values.user52724– user527242016年01月13日 18:47:08 +00:00Commented Jan 13, 2016 at 18:47
-
Your solution basically achieves the same results, although I would say with unnecessary complication (you don't really need GDAL since you don't really need the spatial information for storage - or do you?) Can you provide a subset? Then we would be able to do a performance comparison.Kersten– Kersten2016年01月13日 20:46:36 +00:00Commented Jan 13, 2016 at 20:46
-
Your solution is marked as the answer because it was very to the point and more generic than what I am doing. However, I will go with GeoTiff solution. It's working now and fast. Moreover, this was not mentioned but a colleague of mine needs to use the dataset in Smallworld coding in Magic. I have no idea what his capabilities are, but my guess is that consulting the value of a GeoTiff is a safer option than your output that he might not be able to read. But thanks a bunch for the help!user52724– user527242016年01月13日 21:27:41 +00:00Commented Jan 13, 2016 at 21:27
The alternative answer using a GeoTiff for storing the delta_x and delta_y values uses Numpy and GDAL. Here is the outline for the solution.
The implementation for creating the GeoTiff uses the code in this recipe:
https://pcjericks.github.io/py-gdalogr-cookbook/raster_layers.html#create-raster-from-array
The delta_x and delta_y values are stored in separate bands. I chose to store the displacements in centimeters to reduce the size of the GeoTiff (signed 16-bit integers).
I read the values from the GeoTiff in Python using the code in this recipe:
https://gist.github.com/stefanocudini/5201689
The GeoTiff solution is less generic having a dependency on GDAL, but this comes in handy if the output need to be read in other GIS.
-
+1, this is indeed a nice solution if someone needs to load the data into a GIS or related geospatial software.Kersten– Kersten2016年01月13日 22:02:44 +00:00Commented Jan 13, 2016 at 22:02