I am building a simple random forest classification model. This requires samples of data points, in this case, the stacked band values at different coordinate (x, y) points. I am using a GeoPandas GeoDataFrame, which contains the band values for a given coordinate as a sample (row).
The following steps are taken to achieve this:
Load the raster (.tiff) using Rasterio [bands: b02, b03, ... b8a, b11, b12]
Load the vector (.shp) using GeoPandas [columns:
"geometry"
,"field_id"
,"field_type"
]Extract the coordinates of each point in the vector file to a list
Extract the band value at each coordinate point in the raster file, then add to a GeoDataFrame
Combine the vector and raster data (now a geodataframe) into a single geodataframe.
This is the code I have created:
We first grab the coordinates from our vector file (point geometry type - generated via the centroid generation tool)
def geo_point_extraction(gdf: gpd.GeoDataFrame, xy_coord_column: str):
# This function will return a list of (x,y) coordinates extracted from the given gdf argument
if xy_coord_column not in gdf.columns:
raise ValueError(f"The given xy_coord_column {xy_coord_column} is not present in the given gdf.")
# Check the type of the first element in the specified column
gp_list = gdf[xy_coord_column].tolist()
return gp_list
test_list = geo_point_extraction(gdf=gdf, xy_coord_column='geometry')
Then we use the list of coordinate points to grab the raster values for each band and save this in a GeoDataFrame.
def raster_data_extraction(raster_file_path: str, gp_list: list):
"""
Definition:
This function extracts (band) data from raster objects and returns the data as a geodataframe.
Parameters (rstr, gp_list):
raster_file_path:
A raster file path to the raster that should have data extracted.
gp_list:
A list containing the coordinate points from which the data should be extracted.
Returns:
combi_raster_vector_gdf:
A geodataframe containing the data of the desired raster points.
"""
# Gdf to hold values
gdf = gpd.GeoDataFrame()
gdf['geometry'] = gp_list
with rasterio.open(raster_file_path) as rstr:
#### Need update_tags function
# Count # of bands
band_count = rstr.count
for band in range(1, band_count+1):
#for band in range(1, 2):
print(f"Currently on band: {band}")
band_values = []
# Extract coordinates for extracting band values
coordinates = [(geom.x, geom.y) for geom in gdf['geometry']]
# Extract and append band values
for crd in coordinates:
x_crd, y_crd = crd
band_data = rstr.read(band)
row, col = rasterio.transform.rowcol(rstr.transform, x_crd, y_crd)
band_values.append(band_data[row, col])
# Update gdf with new band and corresponding values
gdf[band_names[band-1]] = band_values
return gdf
band_names = src.descriptions
combined_raster_vector_data = raster_data_extraction(raster_file_path=raster_path, gp_list=test_list)
combined_raster_vector_data.head()
While this system works, it seems like a rather roundabout way to generate the geodataframe. Is there a package in Python or a tool/plugin in QGIS that will create a GeoDataFrame/vector layer containing rows for a given raster value, where each row represents a pixel in the raster and the columns are the individual band values?
-
1Or are you asking how to create a point df where each raster pixel is a point?Bera– Bera2025年03月31日 18:07:32 +00:00Commented Mar 31 at 18:07
-
That was my original intent, however I am not sure this is the best approach. As an example: I have a raster file with 10 bands. I also have a random forest classifier model, and I wish to have it predict a class for each pixel coordinate (x,y) using the each of the 10 bands as a feature. To accomplish this, I need to convert each pixel into a row in a dataframe, with each column representing a band. My current effort is now involving rioxarray, as I thought this would be an easy way to obtain the pixel values of each pixel. I will make a new post on this topic. Thanks 4 followin up!tds– tds2025年04月01日 09:05:26 +00:00Commented Apr 1 at 9:05
-
1Can you keep the data in numpy array format? You can apply the random forest model on the array, and get the results as an array? Then you can save the array to raster. If not this might help: gis.stackexchange.com/questions/388047/…Bera– Bera2025年04月01日 09:35:10 +00:00Commented Apr 1 at 9:35
1 Answer 1
Try the rasterio.sample.sample_gen
function
Sample pixels from a dataset
I have a point dataframe and a raster with four bands.
import rasterio
import geopandas as gpd
raster_file = r"C:\GIS\GIStest\rasters\Sentinel2_B2348.tif"
point_shape = r"C:\GIS\GIStest\rasters1円k_random_points.shp"
point_df = gpd.read_file(point_shape)
point_coords = list(zip(point_df.geometry.x, point_df.geometry.y)) #List of coordinate tuples
#Sample the raster at the coordinate locations
with rasterio.open(raster_file) as src:
samples = list(rasterio.sample.sample_gen(dataset=src,
xy=point_coords,
indexes=[1,2,3,4]))
# samples[:2] #A list of numpy arrays
# [array([1161., 1153., 1161., 1270.], dtype=float32),
# array([1574., 1708., 1797., 3620.], dtype=float32)]
samples_df = gpd.pd.DataFrame(data=samples, columns=["band2", "band3", "band4", "band8"])
# samples_df.head(2)
# band2 band3 band4 band8
# 0 1161.0 1153.0 1161.0 1270.0
# 1 1574.0 1708.0 1797.0 3620.0
point_df = gpd.pd.concat([point_df, samples_df], axis=1)
point_df.head(2)
# id geometry band2 band3 band4 band8
# 0 0.0 POINT (508046.105 6789971.712) 1161.0 1153.0 1161.0 1270.0
# 1 1.0 POINT (521668.296 6784963.290) 1574.0 1708.0 1797.0 3620.0
But you can keep your data in raster format like this. I have a point dataset with the attribute class
which is
1 for water, 2 for Sparse vegetation, 3 for Dense vegetation and 4 for clearcut forest which I use to train a model on a Sentinel2 satellite image.
import geopandas as gpd
import rasterio
from rasterio.sample import sample_gen
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
truth_df = gpd.read_file(r"C:\GIS\GIStest\sentinel2\truth_data.gpkg")
dataset = rasterio.open(r"C:\GIS\GIStest\sentinel2\T33WWM_b2348_sample.tif")
samples = list(sample_gen(dataset=dataset,
xy=zip(truth_df.geometry.x, truth_df.geometry.y)))
df = gpd.pd.concat([truth_df, gpd.pd.DataFrame(data=samples, columns=["b2","b3","b4","b8"])], axis=1)
# df.head(2)
# class geometry b2 b3 b4 b8
# 0 1 POINT (528531.814 7107983.983) 1096 1123 1087 1084
# 1 1 POINT (528523.835 7107968.024) 1094 1084 1084 1083
# Split the data into features (X) and target (y)
X = df[["b2","b3","b4","b8"]]
y = df["class"]
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fit the model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
#print("Accuracy:", accuracy)
#Accuracy: 0.9516908212560387
Classify the image:
arr = dataset.read()
#Reshape the raster to 2D
#arr.shape
#(4, 435, 678)
arr = arr.transpose(1, 2, 0).reshape(-1, 4)
predicted = rf.predict(arr)
predicted = predicted.reshape((dataset.height, dataset.width))
out_file = r"C:\GIS\GIStest\sentinel2\predicted.tif"
with rasterio.open(out_file, mode="w", driver="Gtiff",
width=predicted.shape[1], height=predicted.shape[0],
count=1, crs=dataset.crs, transform=dataset.transform,
dtype=predicted.dtype) as dst:
dst.write(predicted, 1)
del(dataset)
Explore related questions
See similar questions with these tags.