I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.
I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven't seen an error yet, so it appears to be working okay.
The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
#Calculate the great circle distance between two points
#on the earth (specified in decimal degrees)
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.
def hav_checker(row, lon, lat):
hav = haversine(row['longitude'], row['latitude'], lon, lat)
return hav
My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).
For reference, I am using the California housing dataset to build this out.
def value_grabber(row, frame, threshold, target_col):
frame = frame.copy()
frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1)
mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean()
return mean_tar
I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.
df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1)
df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1)
df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1)
I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.
-
\$\begingroup\$ Can you include some example data, so we can actually run the code? \$\endgroup\$Graipher– Graipher2019年04月16日 15:29:24 +00:00Commented Apr 16, 2019 at 15:29
1 Answer 1
vectorization
You are doing all the calculations in normal python space. Try to do as much as possible in numpy space
dummy data
np.random.seed(0)
coords = (np.random.random(size=(N, dim)) - 0.5) * 360
median_income = np.random.normal(size=N) * 10000 + 5000
df = pd.DataFrame(
{
"lat": coords[:, 0],
"lon": coords[:, 1],
"median_income": np.random.normal(size=N) * 10000 + 30000,
}
)
instead of using math.radians
, usenp.radians
to calculate this for the whole matrix at once:
coords_rad = np.radians(df[["lat", "lon"]].values)
select only the upper triangle
For this section, I borrowed a bit from this SO post
p1, p2 = np.triu_indices(N,k=1) # k=1 eliminates diagonal indices
havesine distances
lat1, lon1 = coords_rad[p1].T
lat2, lon2 = coords_rad[p2].T
d_lat = lat2 - lat1
d_lon = lon2 - lon1
r = 6371
distances = 2 * r * np.arcsin(
np.sqrt(
np.sin(d_lat / 2) ** 2
+ np.cos(lat1) * np.cos(lat2) * np.sin(d_lon / 2) ** 2
)
)
array([ 6318.56953693, 5685.87555152, 8221.15833653, 6489.20595509, 8755.09024969, 7805.61189508, 6919.53162119, 15295.76892719, 8706.83662262, 8113.95651365, 14532.71048537, 11780.39186778, 7556.99686671, 11832.44825307, 7137.04783302, 9306.23652045, 5446.80037496, 8740.28196777, 10242.77405649, 14237.95015622, 12225.48901658, 2112.82250374, 11830.45390613, 13194.16431067, 3966.47195107, 11375.98162917, 5385.20026834, 10745.8851006 , 15975.57051313, 13621.58550369, 7573.94148257, 2037.20795034, 12284.11555433, 17912.47114836, 9676.18614574, 6000.06279665, 14392.65091451, 11339.26110213, 2465.57715011, 14204.32921867, 15974.00480201, 8347.16187191, 9820.5895048 , 12576.27804606, 9720.35934264])
A way to minimize the memory footprint of this is by choosing the correct dtype
by adding .astype("e")
for example. The correct dtype
for this application is the smallest one that still delivers the necessary resolution, so needs to be chosen with your data taken into consideration.
Distance matrix
You can assemble a distance matrix
distance_matrix = np.zeros((N, N))
distance_matrix [(p1, p2)] = distances
distance_matrix [(p2, p1)] = distances
array([[ 0. , 6318.56953693, 5685.87555152, 8221.15833653, 6489.20595509, 8755.09024969, 7805.61189508, 6919.53162119, 15295.76892719, 8706.83662262], [ 6318.56953693, 0. , 8113.95651365, 14532.71048537, 11780.39186778, 7556.99686671, 11832.44825307, 7137.04783302, 9306.23652045, 5446.80037496], [ 5685.87555152, 8113.95651365, 0. , 8740.28196777, 10242.77405649, 14237.95015622, 12225.48901658, 2112.82250374, 11830.45390613, 13194.16431067], [ 8221.15833653, 14532.71048537, 8740.28196777, 0. , 3966.47195107, 11375.98162917, 5385.20026834, 10745.8851006 , 15975.57051313, 13621.58550369], [ 6489.20595509, 11780.39186778, 10242.77405649, 3966.47195107, 0. , 7573.94148257, 2037.20795034, 12284.11555433, 17912.47114836, 9676.18614574], [ 8755.09024969, 7556.99686671, 14237.95015622, 11375.98162917, 7573.94148257, 0. , 6000.06279665, 14392.65091451, 11339.26110213, 2465.57715011], [ 7805.61189508, 11832.44825307, 12225.48901658, 5385.20026834, 2037.20795034, 6000.06279665, 0. , 14204.32921867, 15974.00480201, 8347.16187191], [ 6919.53162119, 7137.04783302, 2112.82250374, 10745.8851006 , 12284.11555433, 14392.65091451, 14204.32921867, 0. , 9820.5895048 , 12576.27804606], [15295.76892719, 9306.23652045, 11830.45390613, 15975.57051313, 17912.47114836, 11339.26110213, 15974.00480201, 9820.5895048 , 0. , 9720.35934264], [ 8706.83662262, 5446.80037496, 13194.16431067, 13621.58550369, 9676.18614574, 2465.57715011, 8347.16187191, 12576.27804606, 9720.35934264, 0. ]])
Then you can use
close_points = pd.DataFrame(np.where((distance_matrix < d_crit) & (0 < distance_matrix)), index=["p1", "p2"]).T
To get the points which are closer than the critical distance (4km in your case, 10000km for this dummy data).
Another way to get the close points without assembling the distance_matrix
is this:
point_combinations = np.array((p1, p2)).T
close_points = pd.DataFrame(
np.concatenate( # if A is close to B, B is close to A
(
point_combinations[np.ix_(close, [0, 1])],
point_combinations[np.ix_(close, [1, 0])], # if A is close to B, B is close to A
)
),
columns=["p1", "p2"],
)
Then get the mean of the close median incomes, you could use DataFrame.groupby
df["neighbours_mean"] = close_points.groupby("p1").apply(
lambda x: (df.loc[x["p2"], "median_income"]).mean()
)
lat lon median_income neighbours_mean 0 17.57286141383691 77.468171894071 30457.58517301446 30794.78854097742 1 36.994815385791796 16.15794587888287 28128.161499741665 29640.567671359968 2 -27.484272237994304 52.5218807039962 45327.79214358458 28367.842422379927 3 -22.468603945430697 141.0382802815487 44693.58769900285 32114.24852431677 4 166.91859378037054 -41.961053222720025 31549.474256969163 32323.686056555547 5 105.02101370975925 10.402171111045613 33781.625196021734 28564.170628892793 6 24.49604199381563 153.21478978535797 21122.142523698873 34409.152403209606 7 -154.4270190487607 -148.63345210744535 10192.035317760732 32608.604330769795 8 -172.72137692148274 119.74314439725768 26520.878506738474 23294.56216951406 9 100.13643034194618 133.2043733688549 31563.489691039802 28593.31119269739
please test this against a sample set of your data
memory
If you still run into memory problems, you'll need to start calculation the distances in chunks, and then later concatenate them. An alternative is to use dask
instead of pandas
and numpy
-
\$\begingroup\$ I was able to get this solution to work. It's many times faster, however I did run into some memory issues along the way. \$\endgroup\$krewsayder– krewsayder2019年04月16日 20:23:40 +00:00Commented Apr 16, 2019 at 20:23