Joining two Pandas DataFrames based on lat, long as fields

Question 1

I have two Pandas DataFrames containing "lat" and "long" coordinates. I'd like to do a spatial join and merge columns from one DataFrame into another.

import pandas as pd
df1 = pd.DataFrame(data={
 'name': ['post', 'sutter', 'oak'],
 'Lat': [37.788151, 37.789551, 37.815730],
 'Long': [-122.407570, -122.408302, -122.288810]
 })
df2 = pd.DataFrame(data={
 'id': [0, 1, 2],
 'col1': ['xx','yy','zz'],
 'Lat': [37.787994, 37.789575, 37.813122],
 'Long': [-122.407419, -122.408312, -122.288810]
 })

When a match is found based on the "lat", "long" coordinates, the join / output would look like this:

name Lat Long col1
post 37.788151 -122.407570 xx
sutter 37.789551 -122.408302 NaN
oak 37.815730 -122.288810 NaN

Open to ideas on how to implement this solution? Spatial joins or maybe using reverse geocoding API to get addresses from "Lat" "Long" and then join on them?

Question 2

create a geodataframe using points_from_xy. create buffer for points (any one geodataframe) and sjoin

Question 3

I found the best (low error) way to do this was to reverse geocoding Lat, Long and join on address.

Question 4

Create Pandas DataFrame
Create GeoPandas DataFrame using #1
Create Buffer for Points
sjoin both GeoDataFrame

df1 = pd.DataFrame({
 'name': ['post', 'sutter', 'oak'],
 'Lat': [37.788151, 37.789551, 37.815730],
 'Long': [-122.407570, -122.408302, -122.288810]
 })
df2 = pd.DataFrame({
 'id': [0, 1, 2],
 'col1': ['xx','yy','zz'],
 'Lat': [37.787994, 37.789575, 37.813122],
 'Long': [-122.407419, -122.408312, -122.288810]
 })
gdf1 = gpd.GeoDataFrame(
 df1, geometry=gpd.points_from_xy(df1['Long'], df1['Lat']))
gdf2 = gpd.GeoDataFrame(
 df2, geometry=gpd.points_from_xy(df2['Long'], df2['Lat']))
gdf2['geometry'] = gdf2.geometry.buffer(0.001)
gdf1.sjoin(gdf2, how="left")

Note that the join is completely dependent on your buffer size. Make sure to tune according to your needs.

Working copy can be found here

Question 5

For the working space a person will need the Google account.

Question 6

My df1 contains 15M entries and has a size of 700MB. Just trying to perform gdf1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1['Long'], df1['Lat'])) crashes my kernel. Any suggestions of some more memory efficient way to execute this task? tnx

Question 7

Another solution that was already mentioned by OP is to use the reverse geocoding. There might be a problem about that, the result quality will be strongly dependable on the decoder.

Here the Nominatim geocoder (free to choose) from the GeoPy geocoding Python library was used, for more details, please check the documentation. Also, coordinates of point features should be transmitted as a pair, otherwise, you may get this error ValueError: Must be a coordinate pair or Point. Therefore from geopy.point import Point was additionally imported.

When using this code:

import numpy as np
import pandas as pd
from geopy.geocoders import Nominatim
from geopy.point import Point
geolocator = Nominatim(user_agent="test")
def reverse_geocoding(lat, lon):
 try:
 location = geolocator.reverse(Point(lat, lon))
 return location.raw['place_id']
 except:
 return None
df1 = pd.DataFrame(data={
 'name': ['post', 'sutter', 'oak'],
 'Lat': [37.788151, 37.789551, 37.815730],
 'Long': [-122.407570, -122.408302, -122.288810]
 })
df2 = pd.DataFrame(data={
 'id': [0, 1, 2],
 'col1': ['xx','yy','zz'],
 'Lat': [37.787994, 37.789575, 37.813122],
 'Long': [-122.407419, -122.408312, -122.288810]
 })
df1['address'] = np.vectorize(reverse_geocoding)(df1['Lat'], df1['Long'])
df2['address'] = np.vectorize(reverse_geocoding)(df2['Lat'], df2['Long'])
result = pd.merge(df1, df2, how='left', left_on='address', right_on='address')
print(result)

it will result in this

 name Lat_x Long_x address id col1 Lat_y Long_y
0 post 37.788151 -122.407570 127113751 NaN NaN NaN NaN
1 sutter 37.789551 -122.408302 110481100 1.0 yy 37.789575 -122.408312
2 oak 37.815730 -122.288810 114898877 NaN NaN NaN NaN

Note: the join was done by "place_id" attribute which is different to "osm_id".

Question 8

@Taras I am trying to implement this on my data, however, its taking too long to run. My dataframe is of 2000*80 shape.

Question 9

Each of DataFrames is 2000*80 ?

Question 10

What about yout solution with mapbox ? The same efficiency ?

Question 11

For performance efficiency I am suggesting asking a new question

Aman Bagrecha Aman Bagrecha 1,0857 silver badges16 bronze badges · Accepted Answer · 2022-04-06 04:32:27Z

Create Pandas DataFrame
Create GeoPandas DataFrame using #1
Create Buffer for Points
sjoin both GeoDataFrame

df1 = pd.DataFrame({
 'name': ['post', 'sutter', 'oak'],
 'Lat': [37.788151, 37.789551, 37.815730],
 'Long': [-122.407570, -122.408302, -122.288810]
 })
df2 = pd.DataFrame({
 'id': [0, 1, 2],
 'col1': ['xx','yy','zz'],
 'Lat': [37.787994, 37.789575, 37.813122],
 'Long': [-122.407419, -122.408312, -122.288810]
 })
gdf1 = gpd.GeoDataFrame(
 df1, geometry=gpd.points_from_xy(df1['Long'], df1['Lat']))
gdf2 = gpd.GeoDataFrame(
 df2, geometry=gpd.points_from_xy(df2['Long'], df2['Lat']))
gdf2['geometry'] = gdf2.geometry.buffer(0.001)
gdf1.sjoin(gdf2, how="left")

Note that the join is completely dependent on your buffer size. Make sure to tune according to your needs.

Working copy can be found here

For the working space a person will need the Google account.
My df1 contains 15M entries and has a size of 700MB. Just trying to perform gdf1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1['Long'], df1['Lat'])) crashes my kernel. Any suggestions of some more memory efficient way to execute this task? tnx

Stack Exchange Network

Joining two Pandas DataFrames based on lat, long as fields

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Joining two Pandas DataFrames based on lat, long as fields

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions