I have two Pandas DataFrames containing "lat"
and "long"
coordinates. I'd like to do a spatial join and merge columns from one DataFrame
into another.
import pandas as pd
df1 = pd.DataFrame(data={
'name': ['post', 'sutter', 'oak'],
'Lat': [37.788151, 37.789551, 37.815730],
'Long': [-122.407570, -122.408302, -122.288810]
})
df2 = pd.DataFrame(data={
'id': [0, 1, 2],
'col1': ['xx','yy','zz'],
'Lat': [37.787994, 37.789575, 37.813122],
'Long': [-122.407419, -122.408312, -122.288810]
})
When a match is found based on the "lat"
, "long"
coordinates, the join
/ output would look like this:
name Lat Long col1
post 37.788151 -122.407570 xx
sutter 37.789551 -122.408302 NaN
oak 37.815730 -122.288810 NaN
Open to ideas on how to implement this solution? Spatial joins or maybe using reverse geocoding
API to get addresses from "Lat"
"Long"
and then join on them?
2 Answers 2
- Create Pandas DataFrame
- Create GeoPandas DataFrame using #1
- Create Buffer for Points
sjoin
both GeoDataFrame
df1 = pd.DataFrame({
'name': ['post', 'sutter', 'oak'],
'Lat': [37.788151, 37.789551, 37.815730],
'Long': [-122.407570, -122.408302, -122.288810]
})
df2 = pd.DataFrame({
'id': [0, 1, 2],
'col1': ['xx','yy','zz'],
'Lat': [37.787994, 37.789575, 37.813122],
'Long': [-122.407419, -122.408312, -122.288810]
})
gdf1 = gpd.GeoDataFrame(
df1, geometry=gpd.points_from_xy(df1['Long'], df1['Lat']))
gdf2 = gpd.GeoDataFrame(
df2, geometry=gpd.points_from_xy(df2['Long'], df2['Lat']))
gdf2['geometry'] = gdf2.geometry.buffer(0.001)
gdf1.sjoin(gdf2, how="left")
Note that the join is completely dependent on your buffer
size. Make sure to tune according to your needs.
Working copy can be found here
-
1For the working space a person will need the Google account.2022年04月06日 06:53:15 +00:00Commented Apr 6, 2022 at 6:53
-
My
df1
contains 15M entries and has a size of 700MB. Just trying to performgdf1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1['Long'], df1['Lat']))
crashes my kernel. Any suggestions of some more memory efficient way to execute this task? tnxNeStack– NeStack2022年11月15日 19:54:56 +00:00Commented Nov 15, 2022 at 19:54
Another solution that was already mentioned by OP is to use the reverse geocoding. There might be a problem about that, the result quality will be strongly dependable on the decoder.
Here the Nominatim geocoder (free to choose) from the GeoPy geocoding Python library was used, for more details, please check the documentation. Also, coordinates of point features should be transmitted as a pair, otherwise, you may get this error ValueError: Must be a coordinate pair or Point
. Therefore from geopy.point import Point
was additionally imported.
When using this code:
import numpy as np
import pandas as pd
from geopy.geocoders import Nominatim
from geopy.point import Point
geolocator = Nominatim(user_agent="test")
def reverse_geocoding(lat, lon):
try:
location = geolocator.reverse(Point(lat, lon))
return location.raw['place_id']
except:
return None
df1 = pd.DataFrame(data={
'name': ['post', 'sutter', 'oak'],
'Lat': [37.788151, 37.789551, 37.815730],
'Long': [-122.407570, -122.408302, -122.288810]
})
df2 = pd.DataFrame(data={
'id': [0, 1, 2],
'col1': ['xx','yy','zz'],
'Lat': [37.787994, 37.789575, 37.813122],
'Long': [-122.407419, -122.408312, -122.288810]
})
df1['address'] = np.vectorize(reverse_geocoding)(df1['Lat'], df1['Long'])
df2['address'] = np.vectorize(reverse_geocoding)(df2['Lat'], df2['Long'])
result = pd.merge(df1, df2, how='left', left_on='address', right_on='address')
print(result)
it will result in this
name Lat_x Long_x address id col1 Lat_y Long_y
0 post 37.788151 -122.407570 127113751 NaN NaN NaN NaN
1 sutter 37.789551 -122.408302 110481100 1.0 yy 37.789575 -122.408312
2 oak 37.815730 -122.288810 114898877 NaN NaN NaN NaN
Note: the join was done by "place_id"
attribute which is different to "osm_id"
.
-
@Taras I am trying to implement this on my data, however, its taking too long to run. My dataframe is of 2000*80 shape.kms– kms2022年04月07日 05:00:17 +00:00Commented Apr 7, 2022 at 5:00
-
Each of DataFrames is 2000*80 ?2022年04月07日 05:26:06 +00:00Commented Apr 7, 2022 at 5:26
-
What about yout solution with mapbox ? The same efficiency ?2022年04月07日 05:27:04 +00:00Commented Apr 7, 2022 at 5:27
-
For performance efficiency I am suggesting asking a new question2022年04月11日 11:15:34 +00:00Commented Apr 11, 2022 at 11:15
Explore related questions
See similar questions with these tags.
points_from_xy
. create buffer for points (any one geodataframe) and sjoinLat
,Long
and join on address.