Moving from Earth Engine image to array for use in sklearn

Question 1

I've produced a very large multiband image in EE with the goal of classifying it using the classifiers implemented in sklearn (the native ones implemented in EE don't provide enough flexibility for my purposes). sklearn uses 2-D arrays, so minimally I would need to convert each band to a 2D array and feed them in separately as explanatory variables. That's all fine.

Here's my problem: With a raster covering>150k km2, it is beyond tedious and cumbersome to Export.image.toDrive for each band, only to then re-import them to a python environment using rasterio. Ideally there would be some way to convert EE image objects to sklearn-readable NumPy arrays directly using the EE Python API (Google seems to tease as much with their documentation touting the advantages of using EE in Colab: "Seamless integration with Python data science libraries").

Is there a straightforward way to do this that I'm missing?

Question 2

Ideally there would be some way to convert EE image objects to sklearn-readable NumPy arrays directly using the EE Python API.

ee.Image.sampleRectangle() does this.

However, there is a limit of 262144 pixels that can be transferred. The interactive data transfer limit is in place to protect your system from hanging (it is easy to request terabytes of data without realizing it).

So in the case of a large area, your options are to export images to Google Drive or Google Cloud Storage and then import to Earth Engine Python API. Using Google Colab makes this easy - EE is installed by default and there is integration with GDrive and GCS. The Earth Engine batch task export methods are better equipped for dealing with large data (breaks up large exports into manageable sized GeoTIFFs).

Even though ee.Image.sampleRectangle() may not be useful for your application, here is a demo in case it helps others.

The following Python script transfers three Landsat 8 bands for a rectangular region to the Python client and converts the EE arrays to numpy arrays and then stacks the arrays and displays the 3-D array as an RGB image representation of the region.

IPython Notebook

import ee
import numpy as np
import matplotlib.pyplot as plt
ee.Authenticate()
ee.Initialize()
# Define an image.
img = ee.Image('LANDSAT/LC08/C01/T1_SR/LC08_038029_20180810') \
 .select(['B4', 'B5', 'B6'])
# Define an area of interest.
aoi = ee.Geometry.Polygon(
 [[[-110.8, 44.7],
 [-110.8, 44.6],
 [-110.6, 44.6],
 [-110.6, 44.7]]], None, False)
# Get 2-d pixel array for AOI - returns feature with 2-D pixel array as property per band.
band_arrs = img.sampleRectangle(region=aoi)
# Get individual band arrays.
band_arr_b4 = band_arrs.get('B4')
band_arr_b5 = band_arrs.get('B5')
band_arr_b6 = band_arrs.get('B6')
# Transfer the arrays from server to client and cast as np array.
np_arr_b4 = np.array(band_arr_b4.getInfo())
np_arr_b5 = np.array(band_arr_b5.getInfo())
np_arr_b6 = np.array(band_arr_b6.getInfo())
print(np_arr_b4.shape)
print(np_arr_b5.shape)
print(np_arr_b6.shape)
# Expand the dimensions of the images so they can be concatenated into 3-D.
np_arr_b4 = np.expand_dims(np_arr_b4, 2)
np_arr_b5 = np.expand_dims(np_arr_b5, 2)
np_arr_b6 = np.expand_dims(np_arr_b6, 2)
print(np_arr_b4.shape)
print(np_arr_b5.shape)
print(np_arr_b6.shape)
# Stack the individual bands to make a 3-D array.
rgb_img = np.concatenate((np_arr_b6, np_arr_b5, np_arr_b4), 2)
print(rgb_img.shape)
# Scale the data to [0, 255] to show as an RGB image.
rgb_img_test = (255*((rgb_img - 100)/3500)).astype('uint8')
plt.imshow(rgb_img_test)
plt.show()

Question 3

Thank you for this nice example! I found a numpy function np.dstack(), which is essentially the combination of np.expand_dims() and np.concatenate(). You can replace those four lines with one line: rgb_img = np.dstack(np_arr_b6, np_arr_b5, np_arr_b4)

Question 4

Hello Im trying use the code above, but when I change the aoi, this error occurs: EEException: Image.sampleRectangle: Fully masked pixels / pixels outside of the image footprint when sampling band 'B4' with no default value set. Note that calling sampleRectangle() on an image after ee.Image.clip() may result in a sampling bounding box outside the geometry passed to clip(). anyone know how can I solve this? here its the aoi used aoi = ee.Geometry.Polygon( [[[-53.90,-25.0], [-53.90,-25.1], [-53.87,-25.1], [-53.87,-25.0]]], None, False)

Question 5

@Newmar, you might want to try defining your AOI in a projected coordinate system (e.g., UTM), then reprojecting the image to that same projection before calling sampleRectangle. I was running into this same issue trying to extract arrays of Sentinel-2 imagery and this workaround seemed to fix this error for me.

Question 6

Any idea how to extent this example when first taking an imageCollection then reducing it? (See post here: gis.stackexchange.com/questions/401356/…)

Question 7

Why not using sampleRegions ? it is specific for the AOI geometry as oppose to sampleRectangle

Question 8

What I've done is download the images as tifs from GEE (something you might have to do in pieces given the size). I used the getDownloadURL() function because it is faster, For larger images use Export.image.toDrive().

As my bands are in separate tifs, I stack them into one tif using rasterio/GDAL.

I keep them in the output zip file to save on space.

 # Collect path names of the single-band tifs in the folder and
 # convert name into a format readable by rasterio.open()
 import rasterio
 import numpy as np
 from zipfile import Zipfile
 file_list = []
 stack_path = 'C:\Users\stack.tif'
 img_file = 'C:\Users\LC08_023036_20130429'
 with ZipFile(str(img_file.with_suffix('.zip')), 'r') as f:
 names = f.namelist()
 names = [str(img_file.with_suffix('.zip!')) + name for name in names]
 names = ['zip://' + name for name in names]
 for file in names:
 if file.endswith('.tif'):
 file_list.append(file)
 # Read each layer, convert to float and write it to stack
 with rasterio.open(stack_path, 'w', **meta) as dst:
 for id, layer in enumerate(file_list, start=0):
 with rasterio.open(layer) as src1:
 dst.write_band(id + 1, src1.read(1).astype('float32'))

As sklearn requires a 2D matrix, I just reshape it.

The data must be transposed for scikit-image. See rasterio interoperability

 with rasterio.open(str(stack_path), 'r') as ds:
 data = ds.read()
 # rasterio.read output is (Depth, Width, Height). 
 data = data.transpose((1, -1, 0)) 
 # Convert GeoTIFF NoData values in the image to np.nan
 data[data == -999999] = np.nan 
 data[np.isneginf(data)] = np.nan
 # Reshape into a 2D array, where rows = pixels and cols = features/bands
 data_vector = data.reshape([data.shape[0] * data.shape[1], data.shape[2]])
 # Remove NaNs
 data_vector = data_vector[~np.isnan(data_vector).any(axis=1)]

Although downloading the files is cumbersome, if you create a tif stacking and reshaping pipeline for all of your files the process is greatly streamlined.

Justin Braaten Justin Braaten 6,2731 gold badge22 silver badges43 bronze badges · Accepted Answer · 2020-02-18 19:41:02Z

Ideally there would be some way to convert EE image objects to sklearn-readable NumPy arrays directly using the EE Python API.

ee.Image.sampleRectangle() does this.

However, there is a limit of 262144 pixels that can be transferred. The interactive data transfer limit is in place to protect your system from hanging (it is easy to request terabytes of data without realizing it).

So in the case of a large area, your options are to export images to Google Drive or Google Cloud Storage and then import to Earth Engine Python API. Using Google Colab makes this easy - EE is installed by default and there is integration with GDrive and GCS. The Earth Engine batch task export methods are better equipped for dealing with large data (breaks up large exports into manageable sized GeoTIFFs).

Even though ee.Image.sampleRectangle() may not be useful for your application, here is a demo in case it helps others.

The following Python script transfers three Landsat 8 bands for a rectangular region to the Python client and converts the EE arrays to numpy arrays and then stacks the arrays and displays the 3-D array as an RGB image representation of the region.

IPython Notebook

import ee
import numpy as np
import matplotlib.pyplot as plt
ee.Authenticate()
ee.Initialize()
# Define an image.
img = ee.Image('LANDSAT/LC08/C01/T1_SR/LC08_038029_20180810') \
 .select(['B4', 'B5', 'B6'])
# Define an area of interest.
aoi = ee.Geometry.Polygon(
 [[[-110.8, 44.7],
 [-110.8, 44.6],
 [-110.6, 44.6],
 [-110.6, 44.7]]], None, False)
# Get 2-d pixel array for AOI - returns feature with 2-D pixel array as property per band.
band_arrs = img.sampleRectangle(region=aoi)
# Get individual band arrays.
band_arr_b4 = band_arrs.get('B4')
band_arr_b5 = band_arrs.get('B5')
band_arr_b6 = band_arrs.get('B6')
# Transfer the arrays from server to client and cast as np array.
np_arr_b4 = np.array(band_arr_b4.getInfo())
np_arr_b5 = np.array(band_arr_b5.getInfo())
np_arr_b6 = np.array(band_arr_b6.getInfo())
print(np_arr_b4.shape)
print(np_arr_b5.shape)
print(np_arr_b6.shape)
# Expand the dimensions of the images so they can be concatenated into 3-D.
np_arr_b4 = np.expand_dims(np_arr_b4, 2)
np_arr_b5 = np.expand_dims(np_arr_b5, 2)
np_arr_b6 = np.expand_dims(np_arr_b6, 2)
print(np_arr_b4.shape)
print(np_arr_b5.shape)
print(np_arr_b6.shape)
# Stack the individual bands to make a 3-D array.
rgb_img = np.concatenate((np_arr_b6, np_arr_b5, np_arr_b4), 2)
print(rgb_img.shape)
# Scale the data to [0, 255] to show as an RGB image.
rgb_img_test = (255*((rgb_img - 100)/3500)).astype('uint8')
plt.imshow(rgb_img_test)
plt.show()

Thank you for this nice example! I found a numpy function np.dstack(), which is essentially the combination of np.expand_dims() and np.concatenate(). You can replace those four lines with one line: rgb_img = np.dstack(np_arr_b6, np_arr_b5, np_arr_b4)
Hello Im trying use the code above, but when I change the aoi, this error occurs: EEException: Image.sampleRectangle: Fully masked pixels / pixels outside of the image footprint when sampling band 'B4' with no default value set. Note that calling sampleRectangle() on an image after ee.Image.clip() may result in a sampling bounding box outside the geometry passed to clip(). anyone know how can I solve this? here its the aoi used aoi = ee.Geometry.Polygon( [[[-53.90,-25.0], [-53.90,-25.1], [-53.87,-25.1], [-53.87,-25.0]]], None, False)
@Newmar, you might want to try defining your AOI in a projected coordinate system (e.g., UTM), then reprojecting the image to that same projection before calling sampleRectangle. I was running into this same issue trying to extract arrays of Sentinel-2 imagery and this workaround seemed to fix this error for me.
Any idea how to extent this example when first taking an imageCollection then reducing it? (See post here: gis.stackexchange.com/questions/401356/…)
Why not using sampleRegions ? it is specific for the AOI geometry as oppose to sampleRectangle

Stack Exchange Network

Moving from Earth Engine image to array for use in sklearn

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Moving from Earth Engine image to array for use in sklearn

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions