Raster reclassify using python, gdal and numpy

Question 1

I would like to reclassify a raster file from a raster with 10 classes to a raster with 8 classes using Python, GDAL and/or numpy. The classes are represented as integers. I have tried following the steps from this post Reclassifying rasters using GDAL and Python? , the numpy.equal doc and also gdal_calc doc. However, to no avail.

The raster file to be reclassified has integer values ranging from 0 to 11 and also include values 100 and 255. The following show the reclass (from value : to value):

nodata : 4, 0 : 4, 1 : 1, 2 : 2, 3 : 3, 4 : 3, 5 : 4, 6 : 5, 7 : 5, 8 : 6, 9 : 7, 10 : 8, 100 : nodata, 255 : nodata,

What I have been able to do is select the raster file to be reclassified using tkinter.FileDialog and get the raster info such as geotransform, and pixel size with reclass = gdal.Open(raster, GA_ReadOnly).

How do I go about solving the above.

It might be worth mentioning that the rasters to be reclassified can be fairly large in some cases (500mb to 5gb).

Question 2

There is another example on the GeoExamples Blog

Question 3

@bennos, tried the script on the blog but it returns a memory error when unpacking the array.

Question 4

I suggest you discuss this problem with Roger Veciana i Rovira, the author of the post, as he knows his code better than I do and maybe knows how to resolve the issue

Question 5

Changing the input raster from 16Bit unsigned to 8Bit unsigned solved the memory issue. However, it takes about the same amount of time to reclassify as the dmh126 script below.

Question 6

Instead of doing the reclassification as a double for loop described by dmh126, do it using np.where:

# reclassification 
lista[np.where( lista < 200 )] = 1
lista[np.where((200 < lista) & (lista < 400)) ] = 2
lista[np.where((400 < lista) & (lista < 600)) ] = 3
lista[np.where((600 < lista) & (lista < 800)) ] = 4
lista[np.where( lista > 800 )] = 5

On an array of 6163 by 3537 pixels (41.6mb) the classification is done in 1.59 seconds, where it takes 12min 41s using the double for loop. In total just a speedup of 478x.

Bottomline, never use a double for loop using numpy

Question 7

Thanks for the hint, but I think that will lead into a problem if the input classes overlap with the output classes. I don't want my new value to be changed by the next rule.

Question 8

@Gustry - Just ran into that problem here.

Question 9

So check my answer below : gis.stackexchange.com/questions/163007/…

Question 10

Here's a basic example using rasterio and numpy:

import rasterio as rio
import numpy as np
with rio.open('~/rasterio/tests/data/rgb1.tif') as src:
 # Read the raster into a (rows, cols, depth) array,
 # dstack this into a (depth, rows, cols) array,
 # the sum along the last axis (~= grayscale)
 grey = np.mean(np.dstack(src.read()), axis=2)
 # Read the file profile
 srcprof = src.profile.copy()
classes = 5
# Breaks is an array of the class breaks: [ 0. 51. 102. 153. 204.]
breaks = (np.arange(classes) / float(classes)) * grey.max()
# classify the raster
classified = np.sum(np.dstack([(grey < b) for b in breaks]), axis=2).reshape(1, 400, 400).astype(np.int32)
# Update the file opts to one band
srcprof.update(count=1, nodata=None, dtype=classified.dtype)
with rio.open('/tmp/output.tif', 'w', **srcprof) as dst:
 # Write the output
 dst.write(classified)

Question 11

Useful answer that works fine! Any idea how can the classification be reversed? I mean lower values in the bin getting high re-classified values instead, which will be reduced as the bin increases. So instead of having a product going from 1 to 5, we have the reverse from 5 to 1. Thanks!

Question 12

Just to complete the answer from @Mattijn, I think that will lead into a problem if the input classes overlap with the output classes. I don't want my new value to be changed by the next rule.

I don't know if I loose speed but I should do a deep copy :

list_dest = lista.copy()
list_dest[np.where( lista < 0 )] = 0
list_dest[np.where((0 <= lista) & (lista <= 1)) ] = 1
list_dest[np.where((1 < lista) & (lista <= 5)) ] = 2
list_dest[np.where( 5 < lista )] = 3

Question 13

Here you have a simple python script for reclassification, I wrote it and it works for me:

from osgeo import gdal
driver = gdal.GetDriverByName('GTiff')
file = gdal.Open('/home/user/workspace/raster.tif')
band = file.GetRasterBand(1)
lista = band.ReadAsArray()
# reclassification
for j in range(file.RasterXSize):
 for i in range(file.RasterYSize):
 if lista[i,j] < 200:
 lista[i,j] = 1
 elif 200 < lista[i,j] < 400:
 lista[i,j] = 2
 elif 400 < lista[i,j] < 600:
 lista[i,j] = 3
 elif 600 < lista[i,j] < 800:
 lista[i,j] = 4
 else:
 lista[i,j] = 5
# create new file
file2 = driver.Create( 'raster2.tif', file.RasterXSize , file.RasterYSize , 1)
file2.GetRasterBand(1).WriteArray(lista)
# spatial ref system
proj = file.GetProjection()
georef = file.GetGeoTransform()
file2.SetProjection(proj)
file2.SetGeoTransform(georef)
file2.FlushCache()

Just change the ranges.

I hope it will help.

Question 14

You should close file2 either with del file2 or file2 = None to make sure it gets written to disk. .FlushCache() only influences GDALs internal tile cache.

Question 15

@dmh126, I modified the ranges and the script works. However, it is not very quick (quick being debatable). The input raster was about 120mb and took about 15 min to complete.With the aid of a commercial remote sensing package it takes seconds. Any recommendations on decreasing processing time?

Question 16

I think that multithreading can help. You can try to use all of your cores, there is a question gis.stackexchange.com/questions/162978/…

Question 17

Doesn't make sense to use a double for loop, see answer below

Question 18

Right, the double loop and per-element reclassification is the slowest of all possible ways to do this. Use the powerful parts of numpy like ufuncs: docs.scipy.org/doc/numpy-1.10.1/reference/ufuncs.html.

Question 19

In some cases, numpy digitize can be useful to quickly go from ranges to bins.

import rasterio
import numpy as np
with rasterio.open('my_raster.tif') as src: 
 array = src.read()
 profile = src.profile
 bins = np.array([-1.,-0.7,-0.4, 0.2, 1]) 
 inds = np.digitize(array, bins)
with rasterio.open('output_raster.tif', 'w', **profile) as dst:
 dst.write(inds)

Question 20

Here is another Rasterio approach that I hacked together using the Rasterio Cookbook and @Mattijn's answer.

import rasterio
import numpy as np
with rasterio.open('input_raster.tif') as src: 
 # Read as numpy array
 array = src.read()
 profile = src.profile
 # Reclassify
 array[np.where(array == 0)] = 4 
 array[np.where(array == 2)] = 1
 # and so on ... 
with rasterio.open('output_raster.tif', 'w', **profile) as dst:
 # Write to disk
 dst.write(array)

Question 21

Apply a dictionary directly to array: https://stackoverflow.com/questions/16992713/translate-every-element-in-numpy-array-according-to-key

source_raster = rio.open(r'source_raster.tif')
my_dict = {source_raster.nodata: 4, 
0:4, 
1:1,
2:2,
3:3,
4:3, 
5:4,
6:5,
7:5,
8:6,
9:7,
10:8,
100: source_raster.nodata,
255: source_raster.nodata}
source_array = source_raster.read(1)
target_array = np.vectorize(my_dict.__getitem__)(source_array)
profile = source_raster.profile()
with rio.open(r'target_raster.tif', 'w', **profile) as dest:
 dest.write(target_array, 1)

If the source raster is too large for reading into memory, use rasterio windows to serialize the process. Read a windowed portion from the source raster, apply the dict and write the window-content to the target file.

Question 22

I liked the idea of using np.vectorize but it turns out for key, value in reclass_dct.items(): array[array == key] = value is a fair amount faster.

Question 23

With raster RGB color table support :

import numpy as np
from osgeo import gdal
path_inDs = "/data/OCS_2016.extract.tif"
path_outDs = "/data/OCS_2016.postpython.tif"
driver = gdal.GetDriverByName('GTiff')
file = gdal.Open(path_inDs)
if file is None:
 print ('Could not open image file')
 sys.exit(1)
band = file.GetRasterBand(1)
lista = band.ReadAsArray()
# reclassification
lista[np.where(lista == 31)] = 16
# create new file
file2 = driver.Create(path_outDs, file.RasterXSize , file.RasterYSize , 1, gdal.GPI_RGB)
file2.GetRasterBand(1).WriteArray(lista)
# spatial ref system
proj = file.GetProjection()
georef = file.GetGeoTransform()
meta = file.GetMetadata()
colors = file.GetRasterBand(1).GetRasterColorTable()
file2.SetProjection(proj)
file2.SetGeoTransform(georef)
file2.SetMetadata(meta)
file2.GetRasterBand(1).SetRasterColorTable(colors)
file2.FlushCache()
del file2

Question 24

A slightly different alternative could be the following:

import numpy as np
from osgeo import gdal
original = gdal.Open('**PATH**\\origianl_raster.tif')
# read the original file
band = original.GetRasterBand(1) # assuming that the file has only one band
band_array = band.ReadAsArray()
#create a new array with reclassified values
new_array = np.where(band_array == np.nan, 4, 
 np.where(band_array == 0, 4,
 np.where(band_array == 1, 1,
 np.where(band_array == 2, 2,
 np.where(band_array == 3, 3,
 np.where(band_array == 4, 3, 
 np.where(band_array == 5, 4,
 np.where(band_array == 6, 5,
 np.where(band_array == 7, 5,
 np.where(band_array == 8, 6, 
 np.where(band_array == 9, 7,
 np.where(band_array == 10, 8,
 np.where(band_array == 100, np.nan, np.nan))))))))))))) 
 # the last line also includes values equal to 255, as they are the only ones left
# create and save reclassified raster as a new file
outDs = gdal.GetDriverByName('GTiff').Create("**PATH**\\reclassified_raster.tif", original.RasterXSize, original.RasterYSize, 1, gdal.GDT_Float32)
outBand = outDs.GetRasterBand(1)
outBand.WriteArray(new_array)
outDs.SetGeoTransform(original.GetGeoTransform())
outDs.SetProjection(original.GetProjection())
# flush cache
outDs.FlushCache()

This script plays with numpy.where (https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html): in all the steps apart from the last one, instead of returning a value when the condition is not satisfied, it returns another np.where. And it keeps going until all possible values of the raster are considered.

Question 25

Here's an approach for arbitrary reclassification of integer rasters that avoids using a million calls to np.where. Rasterio bits taken from @Aaron's answer:

import rasterio
import numpy as np
# Build a "lookup array" where the index is the original value and the value
# is the reclassified value. Setting all of the reclassified values is cheap 
# because the memory is only allocated once for the lookup array.
lookup = np.arange(255, dtype=np.uint8)
lookup[5] = 10
lookup[6] = 100
lookup[7] = 200
with rasterio.open('input_raster.tif') as src:
 # Read as numpy array
 array = src.read()
 profile = src.profile
 # Reclassify in a single operation using broadcasting
 array = lookup[array]
with rasterio.open('output_raster.tif', 'w', **profile) as dst:
 # Write to disk
 dst.write(array)

Question 26

I combined @dmh126, @mattijn, and @etrimaille way

import numpy as np
from osgeo import gdal
driver = gdal.GetDriverByName('GTiff')
file = gdal.Open('example.tif')
band = file.GetRasterBand(1)
lista = band.ReadAsArray()
# reclassification
list_dest = lista.copy()
list_dest[np.where( lista < 0 )] = 0
list_dest[np.where((0 <= lista) & (lista <= 1)) ] = 1
list_dest[np.where((1 < lista) & (lista <= 5)) ] = 2
list_dest[np.where( 5 < lista )] = 3
# create new file
file2 = driver.Create( 'example_reclassed.tif', file.RasterXSize , file.RasterYSize , 1)
file2.GetRasterBand(1).WriteArray(list_dest)
# spatial ref system
proj = file.GetProjection()
georef = file.GetGeoTransform()
file2.SetProjection(proj)
file2.SetGeoTransform(georef)
file2.FlushCache()

Mattijn Mattijn 1,0369 silver badges15 bronze badges · Accepted Answer · 2016-01-21 08:56:39Z

Instead of doing the reclassification as a double for loop described by dmh126, do it using np.where:

# reclassification 
lista[np.where( lista < 200 )] = 1
lista[np.where((200 < lista) & (lista < 400)) ] = 2
lista[np.where((400 < lista) & (lista < 600)) ] = 3
lista[np.where((600 < lista) & (lista < 800)) ] = 4
lista[np.where( lista > 800 )] = 5

On an array of 6163 by 3537 pixels (41.6mb) the classification is done in 1.59 seconds, where it takes 12min 41s using the double for loop. In total just a speedup of 478x.

Bottomline, never use a double for loop using numpy

Thanks for the hint, but I think that will lead into a problem if the input classes overlap with the output classes. I don't want my new value to be changed by the next rule.
So check my answer below : gis.stackexchange.com/questions/163007/…

Stack Exchange Network

Raster reclassify using python, gdal and numpy

11 Answers 11

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Raster reclassify using python, gdal and numpy

11 Answers 11

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions