Removing duplicate but not identical data using QGIS

Question 1

I have a dataset of roads that i am trying to clean up and looking for suggestions on workflows that will reduce the manual process of find-and-remove.

The dataset is the legacy result of merging a few datasets, some of which shared the same roads but with slight offsets, my guess is due to coordinate system differences in the original source perhaps, that led to linework being very close, but not identical, see image below for an example.

the same road but with slight offset

Therefore I can not use like selecting identical or remove duplicates, without some accounting for that offset. The attributes are also not a help as they do not contain identical road names or other attributes i could use to identify duplicate records. Their shape_length are usually within a couple meters of each other, so again similar but not identical.

Any suggestions on how to first select records in this dataset that are likely duplicates with an offset, then I can at least flag just those records to weed through, or ideally use another process to remove one and leave the other using QGIS?

I am not very Python savvy, so hoping to find a process more focused on geoprocesses or something like exporting the table to excel and identifying candidates for removal there then joining back to spatial and using that to check candidates, etc.

Question 2

So far my approach is adding a "StartN" and "StartE" field to the data, calculating geometry on them both as Long with UTM and usually the startpoints of the lines are within 1m of each other giving me identical coordinate values to then identify in Excel and add a "flag" to, then rejoin the excel to the spatial table, filter for the flag and then assess the shorter list of roads that may be duplicates. Seems to be working OK so far, but any other suggestions on improving/automating are welcome.

Question 3

I'd do the same thing, but use midpoints and generate near table on itself. It accounts for different direction of lines and no Excel needed

Question 4

If you wish to also ask about ArcGIS Pro then please do that in a separate question.

Question 5

@FelixIP how does this work when the features are all in the same FC? I get a table with a bunch of "0" for the features when comparing against itself as everything overlaps.

Question 6

Have a good look at Generate Near table options. It is esri, I am talking about

Question 7

A bit complicated but you can try to compare the geometries with Hausdorffdistance using SQL with the DB Manager. Similar geometries will get a small value.

Join the layer to itself by features that are within a certain distance of eachother.

You need a unique id field, mine is named id and my layer is named Merged

select 
 row_number() over() as id, 
 a.id as aid, 
 b.id as bid, 
 HausdorffDistance(a.geometry, b.geometry) as hausdiff, 
 a.geometry
from "Merged" a
join
"Merged" as b
on st_distance(a.geometry, b.geometry)<20 and a.id<b.id
order by a.id, b.id, hausdiff

When line 1 and 2 are compared (green arrows) they get a value of 31, from the map I can see they are not duplicates, but 1 and 281 are. Their distance is just 2.8.

By checking a sample of lines I conclude that lines with a value of < 10 are duplicates.

enter image description here

I add a where clause to the query where hausdiff<10 and load the layer into project.

Then select by attributes on your original line layer features that are in the Loaded Query layer:

array_contains(array:=
aggregate( layer:='QueryLayer', aggregate:='array_agg', expression:="aid"), value:="id")

Start editing and delete them.

enter image description here

Bera Bera 81.7k14 gold badges85 silver badges199 bronze badges · Accepted Answer · 2025-03-12 18:34:34Z

A bit complicated but you can try to compare the geometries with Hausdorffdistance using SQL with the DB Manager. Similar geometries will get a small value.

Join the layer to itself by features that are within a certain distance of eachother.

You need a unique id field, mine is named id and my layer is named Merged

select 
 row_number() over() as id, 
 a.id as aid, 
 b.id as bid, 
 HausdorffDistance(a.geometry, b.geometry) as hausdiff, 
 a.geometry
from "Merged" a
join
"Merged" as b
on st_distance(a.geometry, b.geometry)<20 and a.id<b.id
order by a.id, b.id, hausdiff

When line 1 and 2 are compared (green arrows) they get a value of 31, from the map I can see they are not duplicates, but 1 and 281 are. Their distance is just 2.8.

By checking a sample of lines I conclude that lines with a value of < 10 are duplicates.

enter image description here

I add a where clause to the query where hausdiff<10 and load the layer into project.

Then select by attributes on your original line layer features that are in the Loaded Query layer:

array_contains(array:=
aggregate( layer:='QueryLayer', aggregate:='array_agg', expression:="aid"), value:="id")

Start editing and delete them.

enter image description here

Stack Exchange Network

Removing duplicate but not identical data using QGIS

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Removing duplicate but not identical data using QGIS

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions