I've written some python code designed to take a csv of waypoints for a series of trips, and calculate the distance of each trip by the sum of the distance between the waypoints.
An example csv might be:
9e77d54918dd25c3f9d2e5354ec86666,0,2015年10月01日T14:14:15.000Z,45.0988,7.5811,,
9e77d54918dd25c3f9d2e5354ec86666,1,2015年10月01日T14:17:15.000Z,45.0967,7.5793,,
9e77d54918dd25c3f9d2e5354ec86666,2,2015年10月01日T14:20:15.000Z,45.1012,7.6144,,
9e77d54918dd25c3f9d2e5354ec86666,3,2015年10月01日T14:23:15.000Z,45.0883,7.6479,,
9e77d54918dd25c3f9d2e5354ec86666,4,2015年10月01日T14:26:15.000Z,45.0774,7.6444,,
ect...
I've got code working, using pandas and numpy, however I'm entirely self-taught and I want to know if there's any serious or obvious mistakes I'm using that might make my code inefficient. It currently takes quite a while to run, I'm guessing because of my for loop. The code I'm using is:
import pandas as pd
import numpy as np
from math import radians, cos, sqrt
def dist(lat1, lon1, lat2, lon2): #short distances using Equirectangular approximation
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
x = (lon2 - lon1) * cos( 0.5*(lat2+lat1) )
y = lat2 - lat1
D = 6371 * sqrt(x**2 + y**2)
return D
waypoint = pd.read_csv('TripRecordsReportWaypoints.csv',sep=',',header=None, usecols=[0,3,4], names=['TripID','Lat','Lon'])
output = pd.DataFrame(columns = ['TripID','Distance','No. of Waypoints'])
tripList = waypoint['TripID'].tolist() #creates list of tripids
tripList = list(set(tripList)) #makes list unique
for ID in tripList:
temp = waypoint.loc[waypoint['TripID'] == ID] #creates a temporary dataframe with all waypoint for each trip
temp['endLat'] = temp['Lat'].shift(periods=-1) #adds two columns with next waypoints lat and lon
temp['endLon'] = temp['Lon'].shift(periods=-1)
temp['Distance']=np.vectorize(dist)(temp['Lat'],temp['Lon'],temp['endLat'],temp['endLon']) #calculates distance, can change function 'dist' for more accuracy
SumDist = temp['Distance'].sum() #calculates the total distance
trpId = temp['TripID'].iloc[0] #takes the tripid
wpcount = temp.shape[0] #length of dataframe
temp2 = pd.DataFrame([[trpId,SumDist,wpcount]],columns=['TripID','Distance','No. of Waypoints']) #creates a single row dataframe with the total distance
output = pd.concat([output,temp2]) #adds the row to the output
output.to_csv('TripDistances.csv',sep=',')
1 Answer 1
Your code can be greatly simplified when using pandas.DataFrame.groupby
. This function groups a dataframe by some key(s) and then allows performing functions that act on the whole sub-dataframe (henceforth called group) using apply
or apply some aggregating function to single columns of that group using aggregate
.
For this to work, we need to define a dist
function that can take a DataFrame
and calculate the total distance of that DataFrame
. Note that I will assume that each trip is sorted within the csv, otherwise you will have to add this there as well.
To make the dist
function, we need to make sure all functions it uses are vectorized, so I will be using np.cos
and np.sqrt
instead of the math
ones. I also defined a constant, EARTH_RADIUS
, because you might want to change the precision on that at some point. In any case, it is currently a magic number and giving it a name helps a lot. Or maybe you move to Mars at some point and need to use a different radius :)
import pandas as pd
import numpy as np
EARTH_RADIUS = 6371 # km
def total_dist(group):
lat = np.radians(group.Lat)
lon = np.radians(group.Lon)
endLon = lon.shift(-1)
endLat = lat.shift(-1)
x = (endLon - lon) * np.cos(0.5 * (endLat + lat))
y = endLat - lat
D = EARTH_RADIUS * np.sqrt(x**2 + y**2)
return D.sum()
Note that the conversion to radians is not as nice anymore, but we gain the ability to process a whole trip at a time!
Now we define a helper function to output not only the dist, but also the number of waypoints (by using len
):
def trip_statistics(trip):
return pd.Series({"Distance": total_dist(trip),
"No. of Waypoints": len(trip)})
Now the only thing left to do is apply
this function to all groups and reset the index to get TripID
back as a column and not just as an index:
waypoint = pd.read_csv('TripRecordsReportWaypoints.csv', sep=',',
header=None, usecols=[0, 3, 4],
names=['TripID', 'Lat', 'Lon'])
output = waypoint.groupby("TripID").apply(trip_statistics)
output.reset_index().to_csv('TripDistances.csv', sep=',', index=False)
Note that I added index=False
to avoid writing the row index to the output file.
-
\$\begingroup\$ Thank you a lot. I'm not sure what you mean by vectorized functions though. What differences would using math make? \$\endgroup\$Joshua Kidd– Joshua Kidd2017年02月08日 11:06:12 +00:00Commented Feb 8, 2017 at 11:06
-
\$\begingroup\$ @JoshuaKidd
math.cos
can take only afloat
(or any other single number) as argument.np.cos
takes a vector/numpy.array
offloats
and acts on all of them at the same time. For themath
one you would have to write an explicit loop (e.g.lat = np.array([math.radians(x) for x in group.Lat])
instead of what I wrote in the answer. The numpy implementation is written in C, whereas the explicit loop is (mostly) written in Python. The numpy implementation is a lot faster (especially as the length of the vector grows). \$\endgroup\$Graipher– Graipher2017年02月08日 11:09:27 +00:00Commented Feb 8, 2017 at 11:09
Explore related questions
See similar questions with these tags.