6

I have two Pandas DataFrames containing "lat" and "long" coordinates. I'd like to do a spatial join and merge columns from one DataFrame into another.

import pandas as pd
df1 = pd.DataFrame(data={
 'name': ['post', 'sutter', 'oak'],
 'Lat': [37.788151, 37.789551, 37.815730],
 'Long': [-122.407570, -122.408302, -122.288810]
 })
df2 = pd.DataFrame(data={
 'id': [0, 1, 2],
 'col1': ['xx','yy','zz'],
 'Lat': [37.787994, 37.789575, 37.813122],
 'Long': [-122.407419, -122.408312, -122.288810]
 })

When a match is found based on the "lat", "long" coordinates, the join / output would look like this:

name Lat Long col1
post 37.788151 -122.407570 xx
sutter 37.789551 -122.408302 NaN
oak 37.815730 -122.288810 NaN

Open to ideas on how to implement this solution? Spatial joins or maybe using reverse geocoding API to get addresses from "Lat" "Long" and then join on them?

Taras
35.8k5 gold badges77 silver badges152 bronze badges
asked Apr 6, 2022 at 1:45
2
  • 2
    create a geodataframe using points_from_xy. create buffer for points (any one geodataframe) and sjoin Commented Apr 6, 2022 at 4:20
  • I found the best (low error) way to do this was to reverse geocoding Lat, Long and join on address. Commented Apr 6, 2022 at 15:21

2 Answers 2

9
  1. Create Pandas DataFrame
  2. Create GeoPandas DataFrame using #1
  3. Create Buffer for Points
  4. sjoin both GeoDataFrame
df1 = pd.DataFrame({
 'name': ['post', 'sutter', 'oak'],
 'Lat': [37.788151, 37.789551, 37.815730],
 'Long': [-122.407570, -122.408302, -122.288810]
 })
df2 = pd.DataFrame({
 'id': [0, 1, 2],
 'col1': ['xx','yy','zz'],
 'Lat': [37.787994, 37.789575, 37.813122],
 'Long': [-122.407419, -122.408312, -122.288810]
 })
gdf1 = gpd.GeoDataFrame(
 df1, geometry=gpd.points_from_xy(df1['Long'], df1['Lat']))
gdf2 = gpd.GeoDataFrame(
 df2, geometry=gpd.points_from_xy(df2['Long'], df2['Lat']))
gdf2['geometry'] = gdf2.geometry.buffer(0.001)
gdf1.sjoin(gdf2, how="left") 

Note that the join is completely dependent on your buffer size. Make sure to tune according to your needs.

Working copy can be found here

answered Apr 6, 2022 at 4:32
2
  • 1
    For the working space a person will need the Google account. Commented Apr 6, 2022 at 6:53
  • My df1 contains 15M entries and has a size of 700MB. Just trying to perform gdf1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1['Long'], df1['Lat'])) crashes my kernel. Any suggestions of some more memory efficient way to execute this task? tnx Commented Nov 15, 2022 at 19:54
5

Another solution that was already mentioned by OP is to use the reverse geocoding. There might be a problem about that, the result quality will be strongly dependable on the decoder.

Here the Nominatim geocoder (free to choose) from the GeoPy geocoding Python library was used, for more details, please check the documentation. Also, coordinates of point features should be transmitted as a pair, otherwise, you may get this error ValueError: Must be a coordinate pair or Point. Therefore from geopy.point import Point was additionally imported.

When using this code:

import numpy as np
import pandas as pd
from geopy.geocoders import Nominatim
from geopy.point import Point
geolocator = Nominatim(user_agent="test")
def reverse_geocoding(lat, lon):
 try:
 location = geolocator.reverse(Point(lat, lon))
 return location.raw['place_id']
 except:
 return None
df1 = pd.DataFrame(data={
 'name': ['post', 'sutter', 'oak'],
 'Lat': [37.788151, 37.789551, 37.815730],
 'Long': [-122.407570, -122.408302, -122.288810]
 })
df2 = pd.DataFrame(data={
 'id': [0, 1, 2],
 'col1': ['xx','yy','zz'],
 'Lat': [37.787994, 37.789575, 37.813122],
 'Long': [-122.407419, -122.408312, -122.288810]
 })
df1['address'] = np.vectorize(reverse_geocoding)(df1['Lat'], df1['Long'])
df2['address'] = np.vectorize(reverse_geocoding)(df2['Lat'], df2['Long'])
result = pd.merge(df1, df2, how='left', left_on='address', right_on='address')
print(result)

it will result in this

 name Lat_x Long_x address id col1 Lat_y Long_y
0 post 37.788151 -122.407570 127113751 NaN NaN NaN NaN
1 sutter 37.789551 -122.408302 110481100 1.0 yy 37.789575 -122.408312
2 oak 37.815730 -122.288810 114898877 NaN NaN NaN NaN

Note: the join was done by "place_id" attribute which is different to "osm_id".

answered Apr 6, 2022 at 6:50
4
  • @Taras I am trying to implement this on my data, however, its taking too long to run. My dataframe is of 2000*80 shape. Commented Apr 7, 2022 at 5:00
  • Each of DataFrames is 2000*80 ? Commented Apr 7, 2022 at 5:26
  • What about yout solution with mapbox ? The same efficiency ? Commented Apr 7, 2022 at 5:27
  • For performance efficiency I am suggesting asking a new question Commented Apr 11, 2022 at 11:15

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.