Analysis of airport utilization using PANDAS

Question 1

I am trying to find a vectorized (or more efficient) solution to an iteration problem, where the only solution I found requires row by row iteration of a DataFrame with multiple loops. The actual data file is huge, so my current solution is practically unfeasible. I included line profiler outputs at the very end, if you'd like to have a look.

The main issue seems to be the increased overhead due to frequent Pandas access for filtering operations, especially when creating the temporary DataFrame dfTemp. Any ideas on modifying the algorithm and minimizing Pandas usage are welcome. I am also open to Numpy/Cython/Numba implementations.

Problem Statement

An airport has two landing strips side by side. Each plane lands (arrival time), taxis on one of the landing strips for a while, then takes off (departure time). Looking for solutions to two cases:

Build a list of events where there are more than one plane present on a single strip at a time. Do not include subsets of events (e.g. do not show [3,4] if there is a valid [3,4,5] case). The list should store the indices of the actual DataFrame rows. See function findSingleEvents() for a solution for this case (runs around 5 ms).
Build a list of events where there is at least one plane on each strip at a time. Do not count subsets of an event, only record the event with maximum number of planes. (e.g. do not show [3,4] if there is a [3,4,5] case). Do not count events that fully occur on a single strip. The list should store the indices of the actual DataFrame rows. See function findMultiEvents() for a solution for this case (runs around 15 ms).

Code with sample input

import numpy as np
import pandas as pd
import itertools
from __future__ import division
data = [{'PLANE':0, 'STRIP':1, 'ARRIVAL':85.00, 'DEPARTURE':86.00},
 {'PLANE':1, 'STRIP':1, 'ARRIVAL':87.87, 'DEPARTURE':92.76},
 {'PLANE':2, 'STRIP':2, 'ARRIVAL':88.34, 'DEPARTURE':89.72},
 {'PLANE':3, 'STRIP':1, 'ARRIVAL':88.92, 'DEPARTURE':90.88},
 {'PLANE':4, 'STRIP':2, 'ARRIVAL':90.03, 'DEPARTURE':92.77},
 {'PLANE':5, 'STRIP':2, 'ARRIVAL':90.27, 'DEPARTURE':91.95},
 {'PLANE':6, 'STRIP':2, 'ARRIVAL':92.42, 'DEPARTURE':93.58},
 {'PLANE':7, 'STRIP':2, 'ARRIVAL':94.42, 'DEPARTURE':95.58}]
df = pd.DataFrame(data, columns = ['PLANE','STRIP','ARRIVAL','DEPARTURE'])
def findSingleEvents(df):
 events = []
 for row in df.itertuples():
 #Create temporary dataframe for each main iteration
 dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
 if len(dfTemp)>1:
 #convert index values to integers from long
 current_event = [int(v) for v in dfTemp.index.tolist()]
 #loop backwards to remove elements that do not comply
 for i in reversed(current_event):
 if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
 current_event.remove(i)
 events.append(current_event)
 #remove duplicate events
 events = map(list, set(map(tuple, events)))
 return events
def findMultiEvents(df):
 events = []
 for row in df.itertuples():
 #Create temporary dataframe for each main iteration
 dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
 if len(dfTemp)>1:
 #convert index values to integers from long
 current_event = [int(v) for v in dfTemp.index.tolist()]
 #loop backwards to remove elements that do not comply
 for i in reversed(current_event):
 if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
 current_event.remove(i)
 #remove elements only on 1 strip
 if len(df.iloc[current_event].STRIP.unique()) > 1:
 events.append(current_event)
 #remove duplicate events
 events = map(list, set(map(tuple, events)))
 return events
print findSingleEvents(df[df.STRIP==1])
print findSingleEvents(df[df.STRIP==2])
print findMultiEvents(df)

Output

[[1, 3]]
[[4, 5], [4, 6]]
[[1, 3, 4, 5], [1, 4, 6], [1, 2, 3]]

Line Profiler Logs

%lprun -f findSingleEvents findSingleEvents(df[df.STRIP==1])
Timer unit: 2.85099e-07 s
Total time: 0.0172055 s
File: <ipython-input-33-220dd9d5b99b>
Function: findSingleEvents at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 1 def findSingleEvents(df):
 2 1 9.0 9.0 0.0 events = []
 3 4 8702.0 2175.5 14.4 for row in df.itertuples():
 4 3 31604.0 10534.7 52.4 dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
 5 3 65.0 21.7 0.1 if len(dfTemp)>1:
 6 6 334.0 55.7 0.6 current_event = [int(v) for v in dfTemp.index.tolist()]
 7 6 50.0 8.3 0.1 for i in reversed(current_event):
 8 4 19537.0 4884.2 32.4 if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
 9 current_event.remove(i)
 10 2 12.0 6.0 0.0 events.append(current_event)
 11 1 33.0 33.0 0.1 events = map(list, set(map(tuple, events)))
 12 1 3.0 3.0 0.0 return events
%lprun -f findMultiEvents findMultiEvents(df)
Timer unit: 2.85099e-07 s
Total time: 0.0532152 s
File: <ipython-input-28-97265d757453>
Function: findMultiEvents at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 1 def findMultiEvents(df):
 2 1 18.0 18.0 0.0 events = []
 3 9 21661.0 2406.8 11.6 for row in df.itertuples():
 4 8 60694.0 7586.8 32.5 dfTemp = df[(row.DEPARTURE>df.ARRIVAL) & (row.ARRIVAL<df.DEPARTURE)]
 5 8 145.0 18.1 0.1 if len(dfTemp)>1:
 6 32 1208.0 37.8 0.6 current_event = [int(v) for v in dfTemp.index.tolist()]
 7 32 152.0 4.8 0.1 for i in reversed(current_event):
 8 26 87007.0 3346.4 46.6 if (dfTemp.loc[i].ARRIVAL > dfTemp.DEPARTURE).any():
 9 6 67.0 11.2 0.0 current_event.remove(i)
 10 6 15636.0 2606.0 8.4 if len(df.iloc[current_event].STRIP.unique()) > 1:
 11 6 38.0 6.3 0.0 events.append(current_event)
 12 1 27.0 27.0 0.0 events = map(list, set(map(tuple, events)))
 13 1 2.0 2.0 0.0 return events

Question 2

your data seems sort by ARRIVAL, if it's the case, at least you can use a break in thefor i in reversed... as once you find an ARRIVAL smaller than all DEPARTURE of dfTemp, all the one after in this loop will be smaller, so you can add else: break after the if (dfTemp.loc[i].ARRIVAL .... Should save some times

Question 3

what do you mean by your data is huge?

Question 4

I suggest using a simpler approach here: just keep track of which planes are where, as you go chronologically through time. This way, you only go through the data once, and you don't have to create dfTemp.

def findEvents(df):
 # An event: a list of indices of the df rows representing planes involved 
 # in the event.
 # Planes present on the strips. 
 strips = {
 1: [],
 2: []
 }
 single_events = {
 1: [],
 2: []
 }
 multi_events = []
 for row in df.sort_values(by=['ARRIVAL']).itertuples():
 # Check for departed planes first
 for planes in strips.values():
 remove_departed_planes(row.ARRIVAL, planes)
 strips[row.STRIP].append(row)
 # Single Event: 
 # Build a list of events where there are more than one plane present on 
 # a single strip at a time. Do not include subsets of events.
 for strip_name in (1, 2):
 if len(strips[strip_name]) > 1:
 indexes = frozenset(int(row.Index) for row in strips[strip_name])
 add_event(single_events[strip_name], indexes)
 # Multi Event
 # Build a list of events where there is at least one plane on each strip 
 # at a time. Do not count subsets of an event. 
 # Do not count events that fully occur on a single strip. (???) 
 if strips[1] and strips[2]:
 indexes = frozenset(int(row.Index) for row in strips[1]) | \
 frozenset(int(row.Index) for row in strips[2])
 add_event(multi_events, indexes)
 # Sort the sets, for consistent output 
 return (
 {
 1: [sorted(event) for event in single_events[1] ],
 2: [sorted(event) for event in single_events[2] ],
 },
 [sorted(event) for event in multi_events]
 )
def remove_departed_planes(time, planes):
 i = 0
 while i < len(planes):
 if planes[i].DEPARTURE < time:
 del planes[i]
 else:
 i += 1
def add_event(events, event):
 """ Adds an event (list of indexes) to the events list.
 If the event is a subset of the previous or vice versa,
 the biggest set is kept.
 """
 if events and events[-1] < event:
 # Replace
 events[-1] = event
 elif events and events[-1] >= event:
 # Previous event already covers this event
 pass
 else:
 # Neither is a subset of each other
 events.append(event)
print "findSingleEvents and findMultiEvents"
t1 = time.time()
print findSingleEvents(df[df.STRIP==1])
print findSingleEvents(df[df.STRIP==2])
print findMultiEvents(df)
t2 = time.time()
print t2 - t1
print
print "findEvents"
t1 = time.time()
single, multiple = findEvents(df)
print single[1]
print single[2]
print multiple
t2 = time.time()
print t2 - t1

Output:

findSingleEvents and findMultiEvents
[[1, 3]]
[[4, 5], [4, 6]]
[[1, 3, 4, 5], [1, 4, 6], [1, 2, 3]]
0.0490000247955
findEvents
[[1, 3]]
[[4, 5], [4, 6]]
[[1, 2, 3], [1, 3, 4, 5], [1, 4, 6]]
0.00300002098083

In my opinion, this is also easier to read, test and debug than the more pandas-oriented approach.

You can make this even faster by accessing row.Index and row.Arrival as row[0] and row[2]: tuple-like access is faster than dictionary-like access, but less readable.

If I understand you correctly I have "Do not count events that fully occur on a single strip." covered. If not, you will have to adjust the code for this yourself. Good luck!

pj.dewitte pj.dewitte 1764 bronze badges · Answer 1 · 2019-01-20 00:23:40Z

I suggest using a simpler approach here: just keep track of which planes are where, as you go chronologically through time. This way, you only go through the data once, and you don't have to create dfTemp.

def findEvents(df):
 # An event: a list of indices of the df rows representing planes involved 
 # in the event.
 # Planes present on the strips. 
 strips = {
 1: [],
 2: []
 }
 single_events = {
 1: [],
 2: []
 }
 multi_events = []
 for row in df.sort_values(by=['ARRIVAL']).itertuples():
 # Check for departed planes first
 for planes in strips.values():
 remove_departed_planes(row.ARRIVAL, planes)
 strips[row.STRIP].append(row)
 # Single Event: 
 # Build a list of events where there are more than one plane present on 
 # a single strip at a time. Do not include subsets of events.
 for strip_name in (1, 2):
 if len(strips[strip_name]) > 1:
 indexes = frozenset(int(row.Index) for row in strips[strip_name])
 add_event(single_events[strip_name], indexes)
 # Multi Event
 # Build a list of events where there is at least one plane on each strip 
 # at a time. Do not count subsets of an event. 
 # Do not count events that fully occur on a single strip. (???) 
 if strips[1] and strips[2]:
 indexes = frozenset(int(row.Index) for row in strips[1]) | \
 frozenset(int(row.Index) for row in strips[2])
 add_event(multi_events, indexes)
 # Sort the sets, for consistent output 
 return (
 {
 1: [sorted(event) for event in single_events[1] ],
 2: [sorted(event) for event in single_events[2] ],
 },
 [sorted(event) for event in multi_events]
 )
def remove_departed_planes(time, planes):
 i = 0
 while i < len(planes):
 if planes[i].DEPARTURE < time:
 del planes[i]
 else:
 i += 1
def add_event(events, event):
 """ Adds an event (list of indexes) to the events list.
 If the event is a subset of the previous or vice versa,
 the biggest set is kept.
 """
 if events and events[-1] < event:
 # Replace
 events[-1] = event
 elif events and events[-1] >= event:
 # Previous event already covers this event
 pass
 else:
 # Neither is a subset of each other
 events.append(event)
print "findSingleEvents and findMultiEvents"
t1 = time.time()
print findSingleEvents(df[df.STRIP==1])
print findSingleEvents(df[df.STRIP==2])
print findMultiEvents(df)
t2 = time.time()
print t2 - t1
print
print "findEvents"
t1 = time.time()
single, multiple = findEvents(df)
print single[1]
print single[2]
print multiple
t2 = time.time()
print t2 - t1

Output:

findSingleEvents and findMultiEvents
[[1, 3]]
[[4, 5], [4, 6]]
[[1, 3, 4, 5], [1, 4, 6], [1, 2, 3]]
0.0490000247955
findEvents
[[1, 3]]
[[4, 5], [4, 6]]
[[1, 2, 3], [1, 3, 4, 5], [1, 4, 6]]
0.00300002098083

In my opinion, this is also easier to read, test and debug than the more pandas-oriented approach.

You can make this even faster by accessing row.Index and row.Arrival as row[0] and row[2]: tuple-like access is faster than dictionary-like access, but less readable.

If I understand you correctly I have "Do not count events that fully occur on a single strip." covered. If not, you will have to adjust the code for this yourself. Good luck!

Stack Exchange Network

Analysis of airport utilization using PANDAS

Problem Statement

Code with sample input

Output

Line Profiler Logs

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Analysis of airport utilization using PANDAS

Problem Statement

Code with sample input

Output

Line Profiler Logs

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions