2
\$\begingroup\$

I am currently processing some large time-series data with pandas, and I have a function that is intolerably slow, and I'm sure it could be done faster.

The problem is: I'm studying a factory that produces things. It runs continuously throughout the week, but on weekends it shuts down. Before the end of the week, and at the start of a new one, the factory behaves differently, which interferes with the analysis that I'm doing, and so I want to filter out a time window around these weekends.

I have a large dataframe, call it df, whose rows are the articles produced and the columns are their various attributes, one of which is the time at which it was produced, df['timeProduced']. These articles are produced at irregularly-spaced points in time.I want to discard the rows in the table whose timeProduced entry was near one of these shutdown periods. The actual data is confidential, but it looks similar to this:

index partId colour timeProduced \ ...
1 '026531|352' Red 2017年02月01日 00:00:02 
2 '026531|353' Blue 2017年02月01日 00:00:03 
3 '026531|354' Blue 2017年02月01日 00:00:05 
4 '026531|355' Green 2017年02月01日 00:00:09 

This takes tens of minutes to crunch through a million entries. I know it's slow because it isn't at all vectorized, but I'm not sure how to do a pure vectorized numpy/pandas implementation. Any ideas?

def dropIrregularGaps(series, gapLength, runIn):
 '''
 Designed for time-series data where there is points sampled at irregular time intervals.
 Detects adjacent points that are sampled too far apart, and then removes points on either 
 side of the gap which are within a defined runIn period.
 Assumes timeseries data is already sorted. If not, will deliver garbage. 
 series is a pandas series object, with values as pandas DateTime objects.
 gapLength is the amount of time that is considered to be a shutdown
 runIn is the length of time to remove on either side of the gap.
 returns a list of indices that are valid
 '''
 samples = list(series)
 indices = list(series.index)
 prev = samples[0]
 ind = indices[0]
 allGoodIndices = []
 currentGoodIndices = [ind]
 currentGoodTimes = [prev]
 skipPoint = None
 for new, ind in zip(samples[1:], indices[1:]):
 if skipPoint:
 if new - skipPoint >= runIn:
 # if a gap has been detected, skip over all points until the current
 # point is past the run-in period.
 skipPoint = None
 currentGoodIndices = [ind]
 currentGoodTimes = [new]
 elif new - prev > gapLength:
 # if a gap is detected. cut out the cooldown period from the list,
 # and add what remains to the list of goodIndices.
 endPoint = currentGoodTimes[-1]
 while currentGoodTimes and (endPoint - currentGoodTimes[-1] < runIn):
 del (currentGoodTimes[-1])
 del (currentGoodIndices[-1])
 allGoodIndices += currentGoodIndices
 currentGoodIndices = []
 currentGoodTimes = []
 skipPoint = new
 else:
 currentGoodIndices += [ind]
 currentGoodTimes += [new]
 prev = new
 allGoodIndices += currentGoodIndices
 return allGoodIndices

I operate this function by taking my dataframe, and running:

result = dropIrregularGaps(df['timeProduced'],pd.Timedelta('4 hours'), pd.Timedelta('8 hours 0 minutes'))

I then use the result to index into the dataframe, giving me the dataframe without the startup/cooldown periods.

df = df.loc[result]
asked Apr 9, 2018 at 16:24
\$\endgroup\$
3
  • \$\begingroup\$ I suspect you want to look into dataframe.resample(), though I'm not sure because I don't fully understand what you're trying to do. \$\endgroup\$ Commented Apr 10, 2018 at 3:15
  • \$\begingroup\$ can you share some sample data to experiment on? \$\endgroup\$ Commented Apr 10, 2018 at 8:12
  • \$\begingroup\$ Why exactly are you not filtering the data with a specialized timeseries database or similar before crunching it with python? \$\endgroup\$ Commented Apr 11, 2018 at 10:28

1 Answer 1

2
\$\begingroup\$

Some general tips

functions

split the work in functions, so you can verify each part individually, each doing a specific job which can be individually tested

PEP-8

Try to follow the guidelines

My algorithm

If your pandas version is>= 0.20, you can use pandas.merge_asof If you have a series with the end and start of the work week

dummy data

np.random.seed(1)
gap_max, run_in = 3, 2
indices = [0, 1, 2, 3, 7, 8, 9, 13, 15, 16, 17, 18]
values = np.random.random(size = len(indices))
data = pd.DataFrame({'time': time, 'values': values})
 time values
0 0 0.417022004702574
1 1 0.7203244934421581
2 2 0.00011437481734488664
3 3 0.30233257263183977
4 7 0.14675589081711304
5 8 0.0923385947687978
6 9 0.1862602113776709
7 13 0.34556072704304774
8 15 0.39676747423066994
9 16 0.538816734003357
10 17 0.4191945144032948
11 18 0.6852195003967595

so for this data, we expect the values of 1, 2, 3, 7, 8, 9, 13, 15 dropped

Finding the gaps

The gap can be found by using DataFrame.shift.

def find_weekend(times, gap_max):
 gap = times - times.shift(1) > gap_max
 week_start = times[gap]
 weekend_start = times[gap.shift(-1).fillna(False)]
 return weekend_start, week_start
find_weekend(data['time'], gap_max)
 3 3
 6 9
 Name: index, dtype: int64, 
 4 7
 7 13
 Name: index, dtype: int64

Marking the start of the data as the beginning of a week can be done by adding gap.iloc[0] = True as 2nd line. Marking the end of the data as also an end of the week can be done by changing to .fillna(True)

merging with the data

Since merge_asof expects DataFrames, we first need to do some transformation

def drop_irregular_gaps(data, gap_max, run_in):
 weekend_start, week_start = find_weekend(data[time_label], gap_max)
 df_week_end = weekend_start.to_frame(name=time_label).assign(run_out=True)
 df_week_start = week_start.to_frame(name=time_label).assign(run_in=True)
 df_data = data[[time_label]]

Then we can use 2 merges, one forward to mark the end of the week, one backward to mark the beginning of the week

 before_weekend = pd.merge_asof(
 df_data, df_week_end, 
 on=time_label, direction='forward', tolerance=run_in,
 ).set_index(time_label)['run_out'].fillna(False).values
 after_weekend = pd.merge_asof(
 df_data, df_week_start, 
 on=time_label, direction='backward', tolerance=run_in,
 ).set_index(time_label)['run_in'].fillna(False).values

These are 2 array with True as value if they are in a run_in or run_out period

array([False, True, True, True, True, True, True, False, False,
 False, False, False], dtype=bool),
 array([False, False, False, False, True, True, True, True, True,
 False, False, False], dtype=bool)

Next we just us an or and not for the boolean indexing

 to_drop = before_weekend | after_weekend
 return data[~to_drop]
drop_irregular_gaps(data, gap_max, run_in)
 time values
0 0 0.417022004702574
9 16 0.538816734003357
10 17 0.4191945144032948
11 18 0.6852195003967595

This can be easily adapted to 2 separate values for the run_in

Datetime data

The algorithm should be agnostic about whether the time_label data is numeric or datetime. I verified this algorithm also works with this dummy data

data_start = pd.Timestamp('20180101')
time = data_start + pd.to_timedelta([0, 1, 2, 3, 7, 8, 9, 13, 15, 16, 17, 18], unit='day')
gap_max, run_in = pd.to_timedelta(3, unit='day'), pd.to_timedelta(2, unit='day')
values = np.random.random(size = len(indices))
data = pd.DataFrame({'time': time, 'values': values})
drop_irregular_gaps(data, gap_max, run_in)
 time values
0 2018年01月01日 0.417022004702574
9 2018年01月17日 0.538816734003357
10 2018年01月18日 0.4191945144032948
11 2018年01月19日 0.6852195003967595

alternative without merge_asof

Since apparently merge_asof doesn't work as good with duplicate data, here a variant with a loop. If there are a lot of weekends, this might be slower, but I reckon it will still be faster than the original code

def mark_runin(time, week_endpoints, run_in, direction='backward'):
 mask = np.zeros_like(time, dtype=bool)
 for point in week_endpoints:
 interval = (point, point + run_in) if direction == 'forward' else (point - run_in, point)
 mask |= time.between(*interval).values
 return mask
mark_runin(time, weekend_start, run_in)
array([False, True, True, True, True, True, True, False, False, False, False, False], dtype=bool)
def drop_irregular_gaps2(data, gap_max, run_in, time_label = 'time'):
 times = data[time_label]
 weekend_start, week_start = find_weekend(times, gap_max)
 before_weekend = mark_runin(times, weekend_start, run_in, direction = 'backward')
 after_weekend = mark_runin(times, week_start, run_in, direction = 'forward')
 to_drop = before_weekend | after_weekend
 return data[~to_drop]
drop_irregular_gaps2(data, gap_max, run_in)
 time values
0 2018年01月01日 0.417022004702574
9 2018年01月17日 0.538816734003357
10 2018年01月18日 0.4191945144032948
11 2018年01月19日 0.6852195003967595
answered Apr 10, 2018 at 9:43
\$\endgroup\$
2
  • \$\begingroup\$ This solution is really good! It's almost perfect, but when the dataset contains rows that have exactly the same timestamps, it delivers a 'ValueError: left keys must be sorted' . Is there an easy way around this? \$\endgroup\$ Commented Apr 10, 2018 at 14:35
  • \$\begingroup\$ Never mind, the ValueError came from somewhere else. \$\endgroup\$ Commented Apr 11, 2018 at 15:13

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.