PANDAS code for calculating distance between waypoints

Question 1

I've written some python code designed to take a csv of waypoints for a series of trips, and calculate the distance of each trip by the sum of the distance between the waypoints.

An example csv might be:

9e77d54918dd25c3f9d2e5354ec86666,0,2015年10月01日T14:14:15.000Z,45.0988,7.5811,,
9e77d54918dd25c3f9d2e5354ec86666,1,2015年10月01日T14:17:15.000Z,45.0967,7.5793,,
9e77d54918dd25c3f9d2e5354ec86666,2,2015年10月01日T14:20:15.000Z,45.1012,7.6144,,
9e77d54918dd25c3f9d2e5354ec86666,3,2015年10月01日T14:23:15.000Z,45.0883,7.6479,,
9e77d54918dd25c3f9d2e5354ec86666,4,2015年10月01日T14:26:15.000Z,45.0774,7.6444,,
ect...

I've got code working, using pandas and numpy, however I'm entirely self-taught and I want to know if there's any serious or obvious mistakes I'm using that might make my code inefficient. It currently takes quite a while to run, I'm guessing because of my for loop. The code I'm using is:

import pandas as pd
import numpy as np
from math import radians, cos, sqrt
def dist(lat1, lon1, lat2, lon2): #short distances using Equirectangular approximation
 lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
 x = (lon2 - lon1) * cos( 0.5*(lat2+lat1) )
 y = lat2 - lat1
 D = 6371 * sqrt(x**2 + y**2)
 return D 
waypoint = pd.read_csv('TripRecordsReportWaypoints.csv',sep=',',header=None, usecols=[0,3,4], names=['TripID','Lat','Lon'])
output = pd.DataFrame(columns = ['TripID','Distance','No. of Waypoints'])
tripList = waypoint['TripID'].tolist() #creates list of tripids
tripList = list(set(tripList)) #makes list unique
for ID in tripList:
 temp = waypoint.loc[waypoint['TripID'] == ID] #creates a temporary dataframe with all waypoint for each trip
 temp['endLat'] = temp['Lat'].shift(periods=-1) #adds two columns with next waypoints lat and lon
 temp['endLon'] = temp['Lon'].shift(periods=-1) 
 temp['Distance']=np.vectorize(dist)(temp['Lat'],temp['Lon'],temp['endLat'],temp['endLon']) #calculates distance, can change function 'dist' for more accuracy
 SumDist = temp['Distance'].sum() #calculates the total distance
 trpId = temp['TripID'].iloc[0] #takes the tripid
 wpcount = temp.shape[0] #length of dataframe
 temp2 = pd.DataFrame([[trpId,SumDist,wpcount]],columns=['TripID','Distance','No. of Waypoints']) #creates a single row dataframe with the total distance
 output = pd.concat([output,temp2]) #adds the row to the output
output.to_csv('TripDistances.csv',sep=',')

Question 2

Your code can be greatly simplified when using pandas.DataFrame.groupby. This function groups a dataframe by some key(s) and then allows performing functions that act on the whole sub-dataframe (henceforth called group) using apply or apply some aggregating function to single columns of that group using aggregate.

For this to work, we need to define a dist function that can take a DataFrame and calculate the total distance of that DataFrame. Note that I will assume that each trip is sorted within the csv, otherwise you will have to add this there as well.

To make the dist function, we need to make sure all functions it uses are vectorized, so I will be using np.cos and np.sqrt instead of the math ones. I also defined a constant, EARTH_RADIUS, because you might want to change the precision on that at some point. In any case, it is currently a magic number and giving it a name helps a lot. Or maybe you move to Mars at some point and need to use a different radius :)

import pandas as pd
import numpy as np
EARTH_RADIUS = 6371 # km
def total_dist(group):
 lat = np.radians(group.Lat)
 lon = np.radians(group.Lon)
 endLon = lon.shift(-1)
 endLat = lat.shift(-1)
 x = (endLon - lon) * np.cos(0.5 * (endLat + lat))
 y = endLat - lat
 D = EARTH_RADIUS * np.sqrt(x**2 + y**2)
 return D.sum()

Note that the conversion to radians is not as nice anymore, but we gain the ability to process a whole trip at a time!

Now we define a helper function to output not only the dist, but also the number of waypoints (by using len):

def trip_statistics(trip):
 return pd.Series({"Distance": total_dist(trip),
 "No. of Waypoints": len(trip)})

Now the only thing left to do is apply this function to all groups and reset the index to get TripID back as a column and not just as an index:

waypoint = pd.read_csv('TripRecordsReportWaypoints.csv', sep=',',
 header=None, usecols=[0, 3, 4],
 names=['TripID', 'Lat', 'Lon'])
output = waypoint.groupby("TripID").apply(trip_statistics)
output.reset_index().to_csv('TripDistances.csv', sep=',', index=False)

Note that I added index=False to avoid writing the row index to the output file.

Question 3

Thank you a lot. I'm not sure what you mean by vectorized functions though. What differences would using math make?

Question 4

@JoshuaKidd math.cos can take only a float (or any other single number) as argument. np.cos takes a vector/numpy.array of floats and acts on all of them at the same time. For the math one you would have to write an explicit loop (e.g. lat = np.array([math.radians(x) for x in group.Lat]) instead of what I wrote in the answer. The numpy implementation is written in C, whereas the explicit loop is (mostly) written in Python. The numpy implementation is a lot faster (especially as the length of the vector grows).

Graipher GraipherGraipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2017-02-07 21:31:54Z

Your code can be greatly simplified when using pandas.DataFrame.groupby. This function groups a dataframe by some key(s) and then allows performing functions that act on the whole sub-dataframe (henceforth called group) using apply or apply some aggregating function to single columns of that group using aggregate.

For this to work, we need to define a dist function that can take a DataFrame and calculate the total distance of that DataFrame. Note that I will assume that each trip is sorted within the csv, otherwise you will have to add this there as well.

To make the dist function, we need to make sure all functions it uses are vectorized, so I will be using np.cos and np.sqrt instead of the math ones. I also defined a constant, EARTH_RADIUS, because you might want to change the precision on that at some point. In any case, it is currently a magic number and giving it a name helps a lot. Or maybe you move to Mars at some point and need to use a different radius :)

import pandas as pd
import numpy as np
EARTH_RADIUS = 6371 # km
def total_dist(group):
 lat = np.radians(group.Lat)
 lon = np.radians(group.Lon)
 endLon = lon.shift(-1)
 endLat = lat.shift(-1)
 x = (endLon - lon) * np.cos(0.5 * (endLat + lat))
 y = endLat - lat
 D = EARTH_RADIUS * np.sqrt(x**2 + y**2)
 return D.sum()

Note that the conversion to radians is not as nice anymore, but we gain the ability to process a whole trip at a time!

Now we define a helper function to output not only the dist, but also the number of waypoints (by using len):

def trip_statistics(trip):
 return pd.Series({"Distance": total_dist(trip),
 "No. of Waypoints": len(trip)})

Now the only thing left to do is apply this function to all groups and reset the index to get TripID back as a column and not just as an index:

waypoint = pd.read_csv('TripRecordsReportWaypoints.csv', sep=',',
 header=None, usecols=[0, 3, 4],
 names=['TripID', 'Lat', 'Lon'])
output = waypoint.groupby("TripID").apply(trip_statistics)
output.reset_index().to_csv('TripDistances.csv', sep=',', index=False)

Note that I added index=False to avoid writing the row index to the output file.

Thank you a lot. I'm not sure what you mean by vectorized functions though. What differences would using math make?
@JoshuaKidd math.cos can take only a float (or any other single number) as argument. np.cos takes a vector/numpy.array of floats and acts on all of them at the same time. For the math one you would have to write an explicit loop (e.g. lat = np.array([math.radians(x) for x in group.Lat]) instead of what I wrote in the answer. The numpy implementation is written in C, whereas the explicit loop is (mostly) written in Python. The numpy implementation is a lot faster (especially as the length of the vector grows).

Stack Exchange Network

PANDAS code for calculating distance between waypoints

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

PANDAS code for calculating distance between waypoints

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions