3
\$\begingroup\$

I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.

I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven't seen an error yet, so it appears to be working okay.

The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.

from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
 #Calculate the great circle distance between two points 
 #on the earth (specified in decimal degrees)
 # convert decimal degrees to radians 
 lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
 # haversine formula 
 dlon = lon2 - lon1 
 dlat = lat2 - lat1 
 a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
 c = 2 * asin(sqrt(a)) 
 r = 6371 # Radius of earth in kilometers. Use 3956 for miles
 return c * r

My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.

def hav_checker(row, lon, lat):
 hav = haversine(row['longitude'], row['latitude'], lon, lat)
 return hav

My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).

For reference, I am using the California housing dataset to build this out.

def value_grabber(row, frame, threshold, target_col):
 frame = frame.copy()
 frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1)
 mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean()
 return mean_tar

I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.

df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1)
df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1)
df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1)

I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.

asked Apr 16, 2019 at 14:04
\$\endgroup\$
1
  • \$\begingroup\$ Can you include some example data, so we can actually run the code? \$\endgroup\$ Commented Apr 16, 2019 at 15:29

1 Answer 1

4
\$\begingroup\$

vectorization

You are doing all the calculations in normal python space. Try to do as much as possible in numpy space

dummy data

np.random.seed(0)
coords = (np.random.random(size=(N, dim)) - 0.5) * 360
median_income = np.random.normal(size=N) * 10000 + 5000
df = pd.DataFrame(
 {
 "lat": coords[:, 0],
 "lon": coords[:, 1],
 "median_income": np.random.normal(size=N) * 10000 + 30000,
 }
)

instead of using math.radians, usenp.radians to calculate this for the whole matrix at once:

coords_rad = np.radians(df[["lat", "lon"]].values)

select only the upper triangle

For this section, I borrowed a bit from this SO post

p1, p2 = np.triu_indices(N,k=1) # k=1 eliminates diagonal indices

havesine distances

lat1, lon1 = coords_rad[p1].T
lat2, lon2 = coords_rad[p2].T
d_lat = lat2 - lat1
d_lon = lon2 - lon1
r = 6371
distances = 2 * r * np.arcsin(
 np.sqrt(
 np.sin(d_lat / 2) ** 2
 + np.cos(lat1) * np.cos(lat2) * np.sin(d_lon / 2) ** 2
 )
)
array([ 6318.56953693, 5685.87555152, 8221.15833653, 6489.20595509,
 8755.09024969, 7805.61189508, 6919.53162119, 15295.76892719,
 8706.83662262, 8113.95651365, 14532.71048537, 11780.39186778,
 7556.99686671, 11832.44825307, 7137.04783302, 9306.23652045,
 5446.80037496, 8740.28196777, 10242.77405649, 14237.95015622,
 12225.48901658, 2112.82250374, 11830.45390613, 13194.16431067,
 3966.47195107, 11375.98162917, 5385.20026834, 10745.8851006 ,
 15975.57051313, 13621.58550369, 7573.94148257, 2037.20795034,
 12284.11555433, 17912.47114836, 9676.18614574, 6000.06279665,
 14392.65091451, 11339.26110213, 2465.57715011, 14204.32921867,
 15974.00480201, 8347.16187191, 9820.5895048 , 12576.27804606,
 9720.35934264])

A way to minimize the memory footprint of this is by choosing the correct dtype by adding .astype("e") for example. The correct dtype for this application is the smallest one that still delivers the necessary resolution, so needs to be chosen with your data taken into consideration.

Distance matrix

You can assemble a distance matrix

distance_matrix = np.zeros((N, N))
distance_matrix [(p1, p2)] = distances 
distance_matrix [(p2, p1)] = distances 
 array([[ 0. , 6318.56953693, 5685.87555152, 8221.15833653, 6489.20595509, 8755.09024969, 7805.61189508, 6919.53162119, 15295.76892719, 8706.83662262],
 [ 6318.56953693, 0. , 8113.95651365, 14532.71048537, 11780.39186778, 7556.99686671, 11832.44825307, 7137.04783302, 9306.23652045, 5446.80037496],
 [ 5685.87555152, 8113.95651365, 0. , 8740.28196777, 10242.77405649, 14237.95015622, 12225.48901658, 2112.82250374, 11830.45390613, 13194.16431067],
 [ 8221.15833653, 14532.71048537, 8740.28196777, 0. , 3966.47195107, 11375.98162917, 5385.20026834, 10745.8851006 , 15975.57051313, 13621.58550369],
 [ 6489.20595509, 11780.39186778, 10242.77405649, 3966.47195107, 0. , 7573.94148257, 2037.20795034, 12284.11555433, 17912.47114836, 9676.18614574],
 [ 8755.09024969, 7556.99686671, 14237.95015622, 11375.98162917, 7573.94148257, 0. , 6000.06279665, 14392.65091451, 11339.26110213, 2465.57715011],
 [ 7805.61189508, 11832.44825307, 12225.48901658, 5385.20026834, 2037.20795034, 6000.06279665, 0. , 14204.32921867, 15974.00480201, 8347.16187191],
 [ 6919.53162119, 7137.04783302, 2112.82250374, 10745.8851006 , 12284.11555433, 14392.65091451, 14204.32921867, 0. , 9820.5895048 , 12576.27804606],
 [15295.76892719, 9306.23652045, 11830.45390613, 15975.57051313, 17912.47114836, 11339.26110213, 15974.00480201, 9820.5895048 , 0. , 9720.35934264],
 [ 8706.83662262, 5446.80037496, 13194.16431067, 13621.58550369, 9676.18614574, 2465.57715011, 8347.16187191, 12576.27804606, 9720.35934264, 0. ]])

Then you can use

close_points = pd.DataFrame(np.where((distance_matrix < d_crit) & (0 < distance_matrix)), index=["p1", "p2"]).T

To get the points which are closer than the critical distance (4km in your case, 10000km for this dummy data).

Another way to get the close points without assembling the distance_matrix is this:

point_combinations = np.array((p1, p2)).T
close_points = pd.DataFrame(
 np.concatenate( # if A is close to B, B is close to A
 (
 point_combinations[np.ix_(close, [0, 1])],
 point_combinations[np.ix_(close, [1, 0])], # if A is close to B, B is close to A
 )
 ),
 columns=["p1", "p2"],
)

Then get the mean of the close median incomes, you could use DataFrame.groupby

df["neighbours_mean"] = close_points.groupby("p1").apply(
 lambda x: (df.loc[x["p2"], "median_income"]).mean()
)
 lat lon median_income neighbours_mean
0 17.57286141383691 77.468171894071 30457.58517301446 30794.78854097742
1 36.994815385791796 16.15794587888287 28128.161499741665 29640.567671359968
2 -27.484272237994304 52.5218807039962 45327.79214358458 28367.842422379927
3 -22.468603945430697 141.0382802815487 44693.58769900285 32114.24852431677
4 166.91859378037054 -41.961053222720025 31549.474256969163 32323.686056555547
5 105.02101370975925 10.402171111045613 33781.625196021734 28564.170628892793
6 24.49604199381563 153.21478978535797 21122.142523698873 34409.152403209606
7 -154.4270190487607 -148.63345210744535 10192.035317760732 32608.604330769795
8 -172.72137692148274 119.74314439725768 26520.878506738474 23294.56216951406
9 100.13643034194618 133.2043733688549 31563.489691039802 28593.31119269739

please test this against a sample set of your data


memory

If you still run into memory problems, you'll need to start calculation the distances in chunks, and then later concatenate them. An alternative is to use dask instead of pandas and numpy

answered Apr 16, 2019 at 15:29
\$\endgroup\$
1
  • \$\begingroup\$ I was able to get this solution to work. It's many times faster, however I did run into some memory issues along the way. \$\endgroup\$ Commented Apr 16, 2019 at 20:23

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.