Problem statement
I am aware that geopandas (due to the limitations of shapely and pygeos) does not read geometries with M-values - when reading a layer that has them, geopandas now just drops those values.
(For historical reference: Until recently, when reading 2D features with M-values, geopandas would mistake the M-value for Z coordinate values - it was a known shortcoming related to a bug way upstream in the dependency tree: GEOS->pygeos->shapely->geopandas. That bug has been recently fixed in GEOS and the update has already been pushed downstream to pygeos. The folks at shapely are still working on updates to make M-values work with shapely, as seen here and here).
That being the case, how can I quickly extract the geometries from a feature layer in a way that retains the M-values? My goal is to be able to easily access the geometry's coordinates and M-values. I'm working on linear referencing-related stuff, in case that matters.
To put some constraints on the problem: suppose that I'm working exclusively with Feature Layers inside a GeoDataBase (.gdb) and that the layers contain exclusively either LineString
or MultiLineString
geometries.
What I've done so far
I've developed a function/method called get_geoms
(described below) that uses the osgeo/ogr library to read in a layer's rows and store their WKT, WKB and OGR geometries in a pandas DataFrame:
from osgeo import ogr
import pandas as pd
def get_geoms(inGDBPath,
inLayerName):
'''
Parameters
----------
inGDBPath : str or pathlib.Path
Path to the GDB file where the feature layer is stored.
inLayerName : str
name of the layer where the features are stored.
Returns
-------
features_df : pandas.DataFrame
DataFrame that contains four columns:
FID : a "Field ID" column. The name of this column actually depends
on the column name used in the original layer. The column
will typically contain integers from 1 to N.
WKT : a column of STR values containing the well-known-text (WKT)
version of the geometry.
WKB : a column of BYTEARRAY values containing the well-known-binary
(WKB) version of the geometry.
OGR_Geom : a column of osgeo.ogr.Geometry values containing the
OGR geometries of each feature in the layer. This also
includes the M-values in cases where they are present.
'''
# use OGR specific exceptions
ogr.UseExceptions()
# Definitions for input file name and layer name
inDriverName = "OpenFileGDB"
inDriver = ogr.GetDriverByName(inDriverName)
inDataSource = inDriver.Open(inGDBPath, 0)
inLayer = inDataSource.GetLayerByName(inLayerName)
inLayerIDColname = inLayer.GetFIDColumn()
# Making sure the feature reader is reset
inLayer.ResetReading()
# Generating an empty list to hold all the features
features_list = []
# Iterating over every feature in the input layer
for this_inFeature in inLayer:
# Extracting the input feature's ID, geometry, WKT and WKB
this_FID = this_inFeature.GetFID()
this_inGeom = this_inFeature.GetGeometryRef()
this_inWkt = this_inGeom.ExportToIsoWkt()
this_inWkb = this_inGeom.ExportToIsoWkb()
# Copying the geometry
this_inGeom_copy = ogr.CreateGeometryFromWkb(this_inWkb)
# Appending this feature to the list
features_list.append({inLayerIDColname:this_FID,
'WKT':this_inWkt,
'WKB':this_inWkb,
'OGR_Geom':this_inGeom_copy})
# Releasing the input Feature
this_inFeature = None
# Releasing the input file
inDataSource = None
inLayer = None
features_df = pd.DataFrame(features_list).set_index(inLayerIDColname)
return(features_df)
Applying my method to test data
Here is the get_geoms
function being applied to this data (which is a subset of this significantly larger dataset):
import geopandas as gpd
inGDBPath = 'C:/Temp/mytemp.gdb'
inLayerName = 'I20_Temp'
# Loading the GIS layer into a GeoPandas GeoDataFrame
my_gdf = gpd.read_file(filename=inGDBPath,
layer=inLayerName)
# Loading the extra data (WKT, WKB and OGR geometries)
extra_data = get_geoms(inGDBPath=inGDBPath,
inLayerName=inLayerName)
# Merging them into one big GeoDataFrame
full_gdf = my_gdf.merge(extra_data,
how='left',
left_index=True,
right_index=True)
The problems with my approach
First of all, the approach I show above is slow. For this small layer which has 1,354 rows and 202 columns, it takes about 114 ms (timed with the %timeit
magic command). But when applied to the full original dataset which contains 883,837 rows and 202 columns, it takes almost 100 seconds. This is because my implementation relies on slow Python-based loops to read in the data from the GIS layer.
The second problem is that there is no great way to manipulate the data that I've extracted. WKTs and WKBs are only really useful to convert geometries (for example, from OGR geometry to shapely geometry), but they aren't very useful on their own. And it's a pain to use the OGR geometries because they're structured in such an odd way (at least to me). For example, navigating the coordinates of multi-part features is annoying as heck. I even created a function that extracts the (X,Y) coordinates and M-values from the OGR geometries, but my implementation is super clunky.
Back to the main question
So, is there a well-established Python library I could use to quickly extract and easily manipulate geometries from a GIS layer in a way that preserves its M-values?
3 Answers 3
Why not make it simpler (with a list of wkt strings)
my_gdf = gpd.read_file(filename=inGDBPath,layer=inLayerName)
wkt =[]
for this_inFeature in inLayer:
this_inGeom = this_inFeature.GetGeometryRef()
this_inWkt = this_inGeom.ExportToIsoWkt()
wkt.append(this_inWkt)
my_gdf['WKT'] = wkt
my_gdf.WKT.head()
0 MULTILINESTRING M ((-95.999231733 32.62464663 ...
1 MULTILINESTRING M ((-97.934444075 32.710178812...
2 MULTILINESTRING M ((-100.39850042 32.450091716...
3 MULTILINESTRING M ((-96.840064431 32.642864073...
4 MULTILINESTRING M ((-99.677214454 32.460975142...
my_gdf.geometry.head()
0 MULTILINESTRING ((-95.99923 32.62465, -95.9974...
1 MULTILINESTRING ((-97.93444 32.71018, -97.9264...
2 MULTILINESTRING ((-100.39850 32.45009, -100.39...
3 MULTILINESTRING ((-96.84006 32.64286, -96.8400...
4 MULTILINESTRING ((-99.67721 32.46098, -99.6751...
The M values are defined for each point in the geometry (multiple values per line).
inLayer.ResetReading()
# first feature
first = inLayer.GetNextFeature()
mline_geom = first.GetGeometryRef()
print(mline_geom.ExportToIsoWkt())
MULTILINESTRING M ((-95.999231733 32.62464663 516.996,-95.99740274 32.6240395400001 517.111,-95.9889207699999 32.6211937 517.642,-95.988217821 32.620957828 517.686))
for j in range(mline_geom.GetGeometryCount()):
geom = mline_geom.GetGeometryRef(j)
for i, point in enumerate(geom.GetPoints()):
print(geom.GetM(i), point)
516.9959999999992 (-95.99923173299999, 32.62464663000003)
517.1110000000044 (-95.99740273999998, 32.62403954000007)
517.6420000000071 (-95.98892076999994, 32.62119370000005)
517.6860000000015 (-95.98821782099998, 32.62095782800003)
LineString.GetM() will return the value for the first point( point=0), by default.
geomi = mline_geom.GetGeometryRef(0)
geomi.GetM()
516.9959999999992
Building on gene's answer, I've gone ahead and made a couple of more complete functions that do everything that's needed by just using the OGR geometries.
First, I created a function called extract_geometry_info
that takes in one OGR geometry and extracts a bunch of important/valuable information from it. That includes a set of tuples that contain the (X,Y) coordinates and the M-values.
Then, I modified the get_geoms
function so that it applies the get_geoms()
function to every OGR geometry in a given layer.
def extract_geometry_info(ogr_geom):
'''
Extracts all of the coordinates and M-values of a given OGR geometry
Parameters
----------
ogr_geom : osgeo.ogr.geometry
OGR Geometry to be analyzed.
Returns
-------
out_dict : dict
Dictionary containing multiple elements:
geom_type : STR that describes the type of geometry
num_sub_geoms : INT that describes how many sub-geometries are in
the full geometry (i.e., how many parts are in the
multi-part geometry). For LineStrings, this will
default to zero.
num_points : INT or TUPLE that describes the number of points in
the full geometry. For LineStrings, it is just an INT.
For MultiLineStrings, it is a TUPLE of INTs.
geom_coords : TUPLE with the geometry's (x,y) coordinates
geom_mvals : TUPLE with the geomety's M-values
geom_coords_and_mvals : TUPLE with the geometry's (x, y, M-value)
data
upcast_geom_data : TUPLE similar to the one in the
geom_coords_and_mvals variable, but when
LineStrings are found, this geometry is "upcast"
to a MultiLineString containing just one
sub-geometry
'''
geom_type = ogr_geom.GetGeometryName()
geom_wkt = ogr_geom.ExportToIsoWkt()
geom_wkb = ogr_geom.ExportToIsoWkb()
if geom_type.lower()[:len('multi')] == 'multi':
geom_coords = []
geom_mvals = []
geom_coords_and_mvals = []
num_sub_geoms = ogr_geom.GetGeometryCount()
num_points = []
for j in range(ogr_geom.GetGeometryCount()):
sub_geom = ogr_geom.GetGeometryRef(j)
sub_geom_coords = tuple([pts for pts in sub_geom.GetPoints()])
sub_geom_mvals = tuple([sub_geom.GetM(i) for i in
range(len(sub_geom_coords))])
sub_geom_coords_and_mvals = tuple([(*sub_geom_coords[i],
sub_geom_mvals[i]) for i in
range(len(sub_geom_coords))])
geom_coords.append(sub_geom_coords)
geom_mvals.append(sub_geom_mvals)
geom_coords_and_mvals.append(sub_geom_coords_and_mvals)
num_points.append(len(sub_geom_mvals))
geom_coords = tuple(geom_coords)
geom_mvals = tuple(geom_mvals)
geom_coords_and_mvals = tuple(geom_coords_and_mvals)
num_points = tuple(num_points)
upcast_geom_data = geom_coords_and_mvals
else:
geom_coords = tuple([pts for pts in ogr_geom.GetPoints()])
geom_mvals = tuple([ogr_geom.GetM(i) for i in range(len(geom_coords))])
geom_coords_and_mvals = tuple([(*geom_coords[i],
geom_mvals[i]) for i in
range(len(geom_coords))])
num_sub_geoms = 0
num_points = len(geom_mvals)
upcast_geom_data = tuple([geom_coords_and_mvals])
out_dict = {'geom_type':geom_type,
'WKB':geom_wkb,
'WKT':geom_wkt,
'num_sub_geoms':num_sub_geoms,
'num_points':num_points,
'geom_coords':geom_coords,
'geom_mvals':geom_mvals,
'geom_coords_and_mvals':geom_coords_and_mvals,
'upcast_geom_data':upcast_geom_data}
return(out_dict)
def get_geoms(inGDBPath,
inLayerName):
'''
Parameters
----------
inGDBPath : str or pathlib.Path
Path to the GDB file where the feature layer is stored.
inLayerName : str
name of the layer where the features are stored.
Returns
-------
features_df : pandas.DataFrame
DataFrame that contains four columns:
FID : a "Field ID" column. The name of this column actually depends
on the column name used in the original layer. The column
will typically contain integers from 1 to N.
WKT : a column of STR values containing the well-known-text (WKT)
version of the geometry.
WKB : a column of BYTEARRAY values containing the well-known-binary
(WKB) version of the geometry.
OGR_Geom : a column of osgeo.ogr.Geometry values containing the
OGR geometries of each feature in the layer. This also
includes the M-values in cases where they are present.
geom_type : a column of STR values that describes the type of
geometry
num_sub_geoms : a column of INT values that describes how many
sub-geometries are in the full geometry
num_points : a column of INT or TUPLE values that describes the
number of points in the full geometries. For
LineStrings, each row contains INTs. For
MultiLineStrings, each row contains a TUPLE of INTs.
geom_coords : a column of tuples with the geometries' (x,y)
coordinates
geom_mvals : a column of tuples with the geometries' M-values
geom_coords_and_mvals : a column of tuples with the geometries'
(x, y, M-value) data
upcast_geom_data : column of tuples similar to the ones in the
geom_coords_and_mvals variable, but when
LineStrings are found, this geometry is "upcast"
to a MultiLineString containing just one
sub-geometry.
'''
# use OGR specific exceptions
ogr.UseExceptions()
# Definitions for input file name and layer name
inDriverName = "OpenFileGDB"
inDriver = ogr.GetDriverByName(inDriverName)
inDataSource = inDriver.Open(inGDBPath, 0)
inLayer = inDataSource.GetLayerByName(inLayerName)
inLayerIDColname = inLayer.GetFIDColumn()
# Making sure the feature reader is reset
inLayer.ResetReading()
# Generating an empty list to hold all the features
features_list = []
# Iterating over every feature in the input layer
for this_inFeature in inLayer:
# Extracting the input feature's ID and geometry
this_FID = this_inFeature.GetFID()
this_geom = this_inFeature.GetGeometryRef()
# Extracting the rest of the geometry's data
this_geom_info = extract_geometry_info(this_geom)
# Appending this feature to the list
features_list.append({
inLayerIDColname:this_FID,
'OGR_Geom':ogr.CreateGeometryFromWkb(this_geom_info['WKB']),
'WKB':this_geom_info['WKB'],
'WKT':this_geom_info['WKT'],
'geom_type':this_geom_info['geom_type'],
'num_sub_geoms':this_geom_info['num_sub_geoms'],
'num_points':this_geom_info['num_points'],
'geom_coords':this_geom_info['geom_coords'],
'geom_mvals':this_geom_info['geom_mvals'],
'geom_coords_and_mvals':this_geom_info['geom_coords_and_mvals'],
'upcast_geom_data':this_geom_info['upcast_geom_data']})
# Releasing the input Feature
this_inFeature = None
# Releasing the input file
inDataSource = None
inLayer = None
# Generating a Pandas DataFrame and setting the Feature ID as the index
features_df = pd.DataFrame(features_list).set_index(inLayerIDColname)
return(features_df)
Example of the extract_geometry_info
function
Here's an example where I apply the extract_geometry_info
described above to OGR geometries containing a LineString
and a MultiLineString
:
# Creating the OGR Geometry
ogr_geom = ogr.CreateGeometryFromWkt('LINESTRING M (100 100 25, 120 105 50, 140 98 55, 160 103 100)')
# Extracting all of the info from the OGR geometry
extract_geometry_info(ogr_geom)
print(extract_geometry_info)
# {'geom_type': 'LINESTRING',
# 'WKB': bytearray(b'\x01\xd2\x07\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00Y@\x00\x00\x00\x00\x00\x00Y@\x00\x00\x00\x00\x00\x009@\x00\x00\x00\x00\x00\x00^@\x00\x00\x00\x00\x00@Z@\x00\x00\x00\x00\x00\x00I@\x00\x00\x00\x00\x00\x80a@\x00\x00\x00\x00\x00\x80X@\x00\x00\x00\x00\x00\x80K@\x00\x00\x00\x00\x00\x00d@\x00\x00\x00\x00\x00\xc0Y@\x00\x00\x00\x00\x00\x00Y@'),
# 'WKT': 'LINESTRING M (100 100 25,120 105 50,140 98 55,160 103 100)',
# 'num_sub_geoms': 0,
# 'num_points': 4,
# 'geom_coords': ((100.0, 100.0),
# (120.0, 105.0),
# (140.0, 98.0),
# (160.0, 103.0)),
# 'geom_mvals': (25.0, 50.0, 55.0, 100.0),
# 'geom_coords_and_mvals': ((100.0, 100.0, 25.0),
# (120.0, 105.0, 50.0),
# (140.0, 98.0, 55.0),
# (160.0, 103.0, 100.0)),
# 'upcast_geom_data': (((100.0, 100.0, 25.0),
# (120.0, 105.0, 50.0),
# (140.0, 98.0, 55.0),
# (160.0, 103.0, 100.0)),)}
# Creating the OGR Geometry
ogr_geom = ogr.CreateGeometryFromWkt('MULTILINESTRING M ((100 100 25, 120 105 50), (140 98 55, 160 103 100))')
# Extracting all of the info from the OGR geometry
extract_geometry_info(ogr_geom)
# {'geom_type': 'MULTILINESTRING',
# 'WKB': bytearray(b'\x01\xd5\x07\x00\x00\x02\x00\x00\x00\x01\xd2\x07\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00Y@\x00\x00\x00\x00\x00\x00Y@\x00\x00\x00\x00\x00\x009@\x00\x00\x00\x00\x00\x00^@\x00\x00\x00\x00\x00@Z@\x00\x00\x00\x00\x00\x00I@\x01\xd2\x07\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x80a@\x00\x00\x00\x00\x00\x80X@\x00\x00\x00\x00\x00\x80K@\x00\x00\x00\x00\x00\x00d@\x00\x00\x00\x00\x00\xc0Y@\x00\x00\x00\x00\x00\x00Y@'),
# 'WKT': 'MULTILINESTRING M ((100 100 25,120 105 50),(140 98 55,160 103 100))',
# 'num_sub_geoms': 2,
# 'num_points': (2, 2),
# 'geom_coords': (((100.0, 100.0),
# (120.0, 105.0)),
# ((140.0, 98.0),
# (160.0, 103.0))),
# 'geom_mvals': ((25.0, 50.0),
# (55.0, 100.0)),
# 'geom_coords_and_mvals': (((100.0, 100.0, 25.0),
# (120.0, 105.0, 50.0)),
# ((140.0, 98.0, 55.0),
# (160.0, 103.0, 100.0))),
# 'upcast_geom_data': (((100.0, 100.0, 25.0),
# (120.0, 105.0, 50.0)),
# ((140.0, 98.0, 55.0),
# (160.0, 103.0, 100.0)))}
This is still not super fast, but it gets the job done.
Warning: no checks are being carried out!
It's worth pointing out that the functions I provide here are not running ANY kind of checks. They simply ASSUME that the input data will be exactly the way it should.
When using this in real world cases, we might need to consider expanding these a bit to check all the inputs. For example, we need to ensure that the geometries are all LineString
or MultiLineString
.
I was following this thread for a while and modified Felipe's code to fit my use case with GeoPandas
and Shapely
.
WARNING
this code works with my work at the moment, but I have not conducted any significant testing, might be buggy. I did my best to validate the input geometries are only LINESTRING
or MULTILINESTRING
import geopandas as gpd
import numpy as np
import shapely
from osgeo import ogr
def get_linestrings_m(gdb_path: str,
layername: str,
file_driver: str = 'OpenFileGdb' # using the open-source filedriver, can use alternatives
)->gpd.GeoDataFrame:
# error checking
driver = ogr.GetDriverByName(file_driver)
if driver is None:
raise ValueError("Driver not found!")
data_source = driver.Open(gdb_path, 0) # 0 is read-mode, 1 write
feature_layer = data_source.GetLayerByName(layername)
if feature_layer is None:
raise ValueError(f"Feature layer named {layername} not found!")
layer_ID_columnname = feature_layer.GetFIDColumn()
layer_spatial_ref = feature_layer.GetSpatialRef() # fetch the spatial reference object for the file
crs = f"EPSG:{layer_spatial_ref.GetAuthorityCode(None)}" # create the EPSG code
feature_layer.ResetReading()
features_list = []
for this_feature in feature_layer:
this_feature_id = this_feature.GetFID()
this_geometry = this_feature.GetGeometryRef() # should be either LINESTRING or MULTILINESTRING
this_line_id = this_feature.GetField('column_name') # update if this name has changed
this_geom_info = get_geometry_info(ogr_geom=this_geometry)
features_list.append({layer_ID_columnname:this_feature_id,
'LINE_ID': this_line_id,
'geometry': this_geom_info['geom_shapely_repr'],
'geometry M-values': this_geom_info['geom_m_values'],
'num_points': this_geom_info['num_points']
})
this_feature = None # reset variable for next iteration
data_source = None # releasing resources
feature_layer = None # releasing resources
features_df = pd.DataFrame(features_list).explode(['geometry', 'geometry M-values']).reset_index(drop=True) # explode the geometry and geometry M-values, these should be the same shape
features_gdf = gpd.GeoDataFrame(data=features_df, geometry=features_df['geometry'], crs=crs)
return features_gdf
# called for each feature in a layer
def get_geometry_info(ogr_geom: ogr.Geometry)->dict:
geom_type = ogr_geom.GetGeometryName() # Geometry Attributes
# error checking
if 'linestring' not in geom_type.lower():
raise ValueError(f"Error! Found {geom_type}\n Only linestring/ multi-linestring supported at this time!")
if 'multi' in geom_type.lower(): # MULTILINESTRING
geom_m_values = []
num_points = []
geom_shapely_repr = []
num_sub_geometries = ogr_geom.GetGeometryCount()
# extend() rather than append() so we only explode once
for i in range(num_sub_geometries):
linestring_info_i = process_linestring_info(ogr_geom.GetGeometryRef(i))
geom_m_values.extend(linestring_info_i['geom_m_values'])
num_points.extend(linestring_info_i['num_points'])
geom_shapely_repr.extend(linestring_info_i['geom_shapely_repr'])
out_dict = {'num_points':num_points,
'geom_shapely_repr':geom_shapely_repr,
'geom_m_values': geom_m_values
}
else: # LINESTRING
out_dict = process_linestring_info(ogr_geom)
return out_dict
# called for each linestring in a feature (linestring or multilinestring)
def process_linestring_info(linestring_geom: ogr.Geometry)->dict:
geom_n_points = linestring_geom.GetPointCount()
geom_coords = np.array(linestring_geom.GetPoints(nCoordDimension=2)) # selecting only x,y dimensions and casting to numpy array (https://gdal.org/java/org/gdal/ogr/Geometry.html#GetPoints(int))
geom_shapely_points = shapely.points(geom_coords) # list of shapely points (https://shapely.readthedocs.io/en/stable/reference/shapely.points.html#shapely.points)
geom_m_values = [linestring_geom.GetM(idx) for idx in range(geom_n_points)] # list of M-valuess
out_dict = {'num_points':[geom_n_points], # casting to a list for compatability with extend
'geom_shapely_repr':geom_shapely_points,
'geom_m_values': geom_m_values
}
return out_dict