2

I am building a simple random forest classification model. This requires samples of data points, in this case, the stacked band values at different coordinate (x, y) points. I am using a GeoPandas GeoDataFrame, which contains the band values for a given coordinate as a sample (row).

The following steps are taken to achieve this:

  1. Load the raster (.tiff) using Rasterio [bands: b02, b03, ... b8a, b11, b12]

  2. Load the vector (.shp) using GeoPandas [columns: "geometry", "field_id", "field_type"]

  3. Extract the coordinates of each point in the vector file to a list

  4. Extract the band value at each coordinate point in the raster file, then add to a GeoDataFrame

  5. Combine the vector and raster data (now a geodataframe) into a single geodataframe.

This is the code I have created:

We first grab the coordinates from our vector file (point geometry type - generated via the centroid generation tool)

def geo_point_extraction(gdf: gpd.GeoDataFrame, xy_coord_column: str):
 # This function will return a list of (x,y) coordinates extracted from the given gdf argument
 if xy_coord_column not in gdf.columns:
 raise ValueError(f"The given xy_coord_column {xy_coord_column} is not present in the given gdf.")
 # Check the type of the first element in the specified column
 
 gp_list = gdf[xy_coord_column].tolist()
 
 return gp_list
test_list = geo_point_extraction(gdf=gdf, xy_coord_column='geometry')

Then we use the list of coordinate points to grab the raster values for each band and save this in a GeoDataFrame.

def raster_data_extraction(raster_file_path: str, gp_list: list):
 """
 Definition:
 This function extracts (band) data from raster objects and returns the data as a geodataframe.
 Parameters (rstr, gp_list):
 raster_file_path:
 A raster file path to the raster that should have data extracted.
 gp_list:
 A list containing the coordinate points from which the data should be extracted.
 Returns:
 combi_raster_vector_gdf:
 A geodataframe containing the data of the desired raster points.
 """
 # Gdf to hold values
 gdf = gpd.GeoDataFrame()
 gdf['geometry'] = gp_list
 with rasterio.open(raster_file_path) as rstr:
 #### Need update_tags function
 # Count # of bands
 band_count = rstr.count
 for band in range(1, band_count+1):
 #for band in range(1, 2):
 print(f"Currently on band: {band}")
 band_values = []
 # Extract coordinates for extracting band values
 coordinates = [(geom.x, geom.y) for geom in gdf['geometry']]
 
 # Extract and append band values
 for crd in coordinates:
 
 x_crd, y_crd = crd
 band_data = rstr.read(band)
 row, col = rasterio.transform.rowcol(rstr.transform, x_crd, y_crd)
 
 band_values.append(band_data[row, col])
 # Update gdf with new band and corresponding values 
 gdf[band_names[band-1]] = band_values
 return gdf
band_names = src.descriptions
combined_raster_vector_data = raster_data_extraction(raster_file_path=raster_path, gp_list=test_list)
combined_raster_vector_data.head()

While this system works, it seems like a rather roundabout way to generate the geodataframe. Is there a package in Python or a tool/plugin in QGIS that will create a GeoDataFrame/vector layer containing rows for a given raster value, where each row represents a pixel in the raster and the columns are the individual band values?

Taras
35.8k5 gold badges77 silver badges151 bronze badges
asked Mar 27 at 14:42
3
  • 1
    Or are you asking how to create a point df where each raster pixel is a point? Commented Mar 31 at 18:07
  • That was my original intent, however I am not sure this is the best approach. As an example: I have a raster file with 10 bands. I also have a random forest classifier model, and I wish to have it predict a class for each pixel coordinate (x,y) using the each of the 10 bands as a feature. To accomplish this, I need to convert each pixel into a row in a dataframe, with each column representing a band. My current effort is now involving rioxarray, as I thought this would be an easy way to obtain the pixel values of each pixel. I will make a new post on this topic. Thanks 4 followin up! Commented Apr 1 at 9:05
  • 1
    Can you keep the data in numpy array format? You can apply the random forest model on the array, and get the results as an array? Then you can save the array to raster. If not this might help: gis.stackexchange.com/questions/388047/… Commented Apr 1 at 9:35

1 Answer 1

3

Try the rasterio.sample.sample_gen function

Sample pixels from a dataset

I have a point dataframe and a raster with four bands.

import rasterio
import geopandas as gpd
raster_file = r"C:\GIS\GIStest\rasters\Sentinel2_B2348.tif"
point_shape = r"C:\GIS\GIStest\rasters1円k_random_points.shp"
point_df = gpd.read_file(point_shape)
point_coords = list(zip(point_df.geometry.x, point_df.geometry.y)) #List of coordinate tuples
#Sample the raster at the coordinate locations
with rasterio.open(raster_file) as src:
 samples = list(rasterio.sample.sample_gen(dataset=src,
 xy=point_coords,
 indexes=[1,2,3,4]))
# samples[:2] #A list of numpy arrays
# [array([1161., 1153., 1161., 1270.], dtype=float32),
# array([1574., 1708., 1797., 3620.], dtype=float32)]
samples_df = gpd.pd.DataFrame(data=samples, columns=["band2", "band3", "band4", "band8"])
# samples_df.head(2)
# band2 band3 band4 band8
# 0 1161.0 1153.0 1161.0 1270.0
# 1 1574.0 1708.0 1797.0 3620.0
point_df = gpd.pd.concat([point_df, samples_df], axis=1)
point_df.head(2)
# id geometry band2 band3 band4 band8
# 0 0.0 POINT (508046.105 6789971.712) 1161.0 1153.0 1161.0 1270.0
# 1 1.0 POINT (521668.296 6784963.290) 1574.0 1708.0 1797.0 3620.0

But you can keep your data in raster format like this. I have a point dataset with the attribute class which is 1 for water, 2 for Sparse vegetation, 3 for Dense vegetation and 4 for clearcut forest which I use to train a model on a Sentinel2 satellite image.

import geopandas as gpd
import rasterio
from rasterio.sample import sample_gen
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
truth_df = gpd.read_file(r"C:\GIS\GIStest\sentinel2\truth_data.gpkg")
dataset = rasterio.open(r"C:\GIS\GIStest\sentinel2\T33WWM_b2348_sample.tif")
samples = list(sample_gen(dataset=dataset, 
 xy=zip(truth_df.geometry.x, truth_df.geometry.y)))
df = gpd.pd.concat([truth_df, gpd.pd.DataFrame(data=samples, columns=["b2","b3","b4","b8"])], axis=1)
# df.head(2)
# class geometry b2 b3 b4 b8
# 0 1 POINT (528531.814 7107983.983) 1096 1123 1087 1084
# 1 1 POINT (528523.835 7107968.024) 1094 1084 1084 1083
# Split the data into features (X) and target (y)
X = df[["b2","b3","b4","b8"]]
y = df["class"]
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fit the model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
#print("Accuracy:", accuracy)
#Accuracy: 0.9516908212560387

Classify the image:

arr = dataset.read()
#Reshape the raster to 2D
#arr.shape
#(4, 435, 678)
arr = arr.transpose(1, 2, 0).reshape(-1, 4)
predicted = rf.predict(arr)
predicted = predicted.reshape((dataset.height, dataset.width))
out_file = r"C:\GIS\GIStest\sentinel2\predicted.tif"
with rasterio.open(out_file, mode="w", driver="Gtiff",
 width=predicted.shape[1], height=predicted.shape[0],
 count=1, crs=dataset.crs, transform=dataset.transform,
 dtype=predicted.dtype) as dst:
 dst.write(predicted, 1)
del(dataset)

enter image description here

answered Mar 28 at 6:16

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.