I am trying to find a vectorized (or more efficient) solution to an iteration problem, where the only solution I found requires row by row iteration of a DataFrame with multiple loops. The actual data file is huge, so my current solution is practically unfeasible. I included line profiler outputs at the very end, if you'd like to have a look.
The main issue seems to be the increased overhead due to frequent Pandas access for filtering operations, especially when creating the temporary DataFrame dfTemp
. Any ideas on modifying the algorithm and minimizing Pandas usage are welcome. I am also open to Numpy/Cython/Numba implementations.
Problem Statement
An airport has two landing strips side by side. Each plane lands (arrival time), taxis on one of the landing strips for a while, then takes off (departure time). Looking for solutions to two cases:
Build a list of events where there are more than one plane present on a single strip at a time. Do not include subsets of events (e.g. do not show
[3,4]
if there is a valid[3,4,5]
case). The list should store the indices of the actual DataFrame rows. See functionfindSingleEvents()
for a solution for this case (runs around 5 ms).Build a list of events where there is at least one plane on each strip at a time. Do not count subsets of an event, only record the event with maximum number of planes. (e.g. do not show
[3,4]
if there is a[3,4,5]
case). Do not count events that fully occur on a single strip. The list should store the indices of the actual DataFrame rows. See functionfindMultiEvents()
for a solution for this case (runs around 15 ms).
Code with sample input
import numpy as np
import pandas as pd
import itertools
from __future__ import division
data = [{'PLANE':0, 'STRIP':1, 'ARRIVAL':85.00, 'DEPARTURE':86.00},
{'PLANE':1, 'STRIP':1, 'ARRIVAL':87.87, 'DEPARTURE':92.76},
{'PLANE':2, 'STRIP':2, 'ARRIVAL':88.34, 'DEPARTURE':89.72},
{'PLANE':3, 'STRIP':1, 'ARRIVAL':88.92, 'DEPARTURE':90.88},
{'PLANE':4, 'STRIP':2, 'ARRIVAL':90.03, 'DEPARTURE':92.77},
{'PLANE':5, 'STRIP':2, 'ARRIVAL':90.27, 'DEPARTURE':91.95},
{'PLANE':6, 'STRIP':2, 'ARRIVAL':92.42, 'DEPARTURE':93.58},
{'PLANE':7, 'STRIP':2, 'ARRIVAL':94.42, 'DEPARTURE':95.58}]
df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE'])
def findSingleEvents(df):
events = []
for row in df.itertuples():
#Create temporary dataframe for each main iteration
dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
if len(dfTemp)>1:
#convert index values to integers from long
current_event = [int(v) for v in dfTemp.index.tolist()]
#loop backwards to remove elements that do not comply
for i in reversed(current_event):
if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
current_event.remove(i)
events.append(current_event)
#remove duplicate events
events = map(list, set(map(tuple, events)))
return events
def findMultiEvents(df):
events = []
for row in df.itertuples():
#Create temporary dataframe for each main iteration
dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
if len(dfTemp)>1:
#convert index values to integers from long
current_event = [int(v) for v in dfTemp.index.tolist()]
#loop backwards to remove elements that do not comply
for i in reversed(current_event):
if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
current_event.remove(i)
#remove elements only on 1 strip
if len(df.iloc[current_event].STRIP.unique()) > 1:
events.append(current_event)
#remove duplicate events
events = map(list, set(map(tuple, events)))
return events
print findSingleEvents(df[df.STRIP==1])
print findSingleEvents(df[df.STRIP==2])
print findMultiEvents(df)
Output
[[1, 3]]
[[4, 5], [4, 6]]
[[1, 3, 4, 5], [1, 4, 6], [1, 2, 3]]
Line Profiler Logs
%lprun -f findSingleEvents findSingleEvents(df[df.STRIP==1])
Timer unit: 2.85099e-07 s
Total time: 0.0172055 s
File: <ipython-input-33-220dd9d5b99b>
Function: findSingleEvents at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def findSingleEvents(df):
2 1 9.0 9.0 0.0 events = []
3 4 8702.0 2175.5 14.4 for row in df.itertuples():
4 3 31604.0 10534.7 52.4 dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
5 3 65.0 21.7 0.1 if len(dfTemp)>1:
6 6 334.0 55.7 0.6 current_event = [int(v) for v in dfTemp.index.tolist()]
7 6 50.0 8.3 0.1 for i in reversed(current_event):
8 4 19537.0 4884.2 32.4 if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
9 current_event.remove(i)
10 2 12.0 6.0 0.0 events.append(current_event)
11 1 33.0 33.0 0.1 events = map(list, set(map(tuple, events)))
12 1 3.0 3.0 0.0 return events
%lprun -f findMultiEvents findMultiEvents(df)
Timer unit: 2.85099e-07 s
Total time: 0.0532152 s
File: <ipython-input-28-97265d757453>
Function: findMultiEvents at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def findMultiEvents(df):
2 1 18.0 18.0 0.0 events = []
3 9 21661.0 2406.8 11.6 for row in df.itertuples():
4 8 60694.0 7586.8 32.5 dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
5 8 145.0 18.1 0.1 if len(dfTemp)>1:
6 32 1208.0 37.8 0.6 current_event = [int(v) for v in dfTemp.index.tolist()]
7 32 152.0 4.8 0.1 for i in reversed(current_event):
8 26 87007.0 3346.4 46.6 if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
9 6 67.0 11.2 0.0 current_event.remove(i)
10 6 15636.0 2606.0 8.4 if len(df.iloc[current_event].STRIP.unique()) > 1:
11 6 38.0 6.3 0.0 events.append(current_event)
12 1 27.0 27.0 0.0 events = map(list, set(map(tuple, events)))
13 1 2.0 2.0 0.0 return events
1 Answer 1
I suggest using a simpler approach here: just keep track of which planes are where, as you go chronologically through time. This way, you only go through the data once, and you don't have to create dfTemp
.
def findEvents(df):
# An event: a list of indices of the df rows representing planes involved
# in the event.
# Planes present on the strips.
strips = {
1: [],
2: []
}
single_events = {
1: [],
2: []
}
multi_events = []
for row in df.sort_values(by=['ARRIVAL']).itertuples():
# Check for departed planes first
for planes in strips.values():
remove_departed_planes(row.ARRIVAL, planes)
strips[row.STRIP].append(row)
# Single Event:
# Build a list of events where there are more than one plane present on
# a single strip at a time. Do not include subsets of events.
for strip_name in (1, 2):
if len(strips[strip_name]) > 1:
indexes = frozenset(int(row.Index) for row in strips[strip_name])
add_event(single_events[strip_name], indexes)
# Multi Event
# Build a list of events where there is at least one plane on each strip
# at a time. Do not count subsets of an event.
# Do not count events that fully occur on a single strip. (???)
if strips[1] and strips[2]:
indexes = frozenset(int(row.Index) for row in strips[1]) | \
frozenset(int(row.Index) for row in strips[2])
add_event(multi_events, indexes)
# Sort the sets, for consistent output
return (
{
1: [sorted(event) for event in single_events[1] ],
2: [sorted(event) for event in single_events[2] ],
},
[sorted(event) for event in multi_events]
)
def remove_departed_planes(time, planes):
i = 0
while i < len(planes):
if planes[i].DEPARTURE < time:
del planes[i]
else:
i += 1
def add_event(events, event):
""" Adds an event (list of indexes) to the events list.
If the event is a subset of the previous or vice versa,
the biggest set is kept.
"""
if events and events[-1] < event:
# Replace
events[-1] = event
elif events and events[-1] >= event:
# Previous event already covers this event
pass
else:
# Neither is a subset of each other
events.append(event)
print "findSingleEvents and findMultiEvents"
t1 = time.time()
print findSingleEvents(df[df.STRIP==1])
print findSingleEvents(df[df.STRIP==2])
print findMultiEvents(df)
t2 = time.time()
print t2 - t1
print
print "findEvents"
t1 = time.time()
single, multiple = findEvents(df)
print single[1]
print single[2]
print multiple
t2 = time.time()
print t2 - t1
Output:
findSingleEvents and findMultiEvents
[[1, 3]]
[[4, 5], [4, 6]]
[[1, 3, 4, 5], [1, 4, 6], [1, 2, 3]]
0.0490000247955
findEvents
[[1, 3]]
[[4, 5], [4, 6]]
[[1, 2, 3], [1, 3, 4, 5], [1, 4, 6]]
0.00300002098083
In my opinion, this is also easier to read, test and debug than the more pandas-oriented approach.
You can make this even faster by accessing row.Index
and row.Arrival
as row[0]
and row[2]
: tuple-like access is faster than dictionary-like access, but less readable.
If I understand you correctly I have "Do not count events that fully occur on a single strip." covered. If not, you will have to adjust the code for this yourself. Good luck!
for i in reversed...
as once you find an ARRIVAL smaller than all DEPARTURE ofdfTemp
, all the one after in this loop will be smaller, so you can addelse: break
after theif (dfTemp.loc[i].ARRIVAL ...
. Should save some times \$\endgroup\$