I am currently processing some large time-series data with pandas, and I have a function that is intolerably slow, and I'm sure it could be done faster.
The problem is: I'm studying a factory that produces things. It runs continuously throughout the week, but on weekends it shuts down. Before the end of the week, and at the start of a new one, the factory behaves differently, which interferes with the analysis that I'm doing, and so I want to filter out a time window around these weekends.
I have a large dataframe, call it df
, whose rows are the articles produced and the columns are their various attributes, one of which is the time at which it was produced, df['timeProduced'
]. These articles are produced at irregularly-spaced points in time.I want to discard the rows in the table whose timeProduced
entry was near one of these shutdown periods. The actual data is confidential, but it looks similar to this:
index partId colour timeProduced \ ...
1 '026531|352' Red 2017年02月01日 00:00:02
2 '026531|353' Blue 2017年02月01日 00:00:03
3 '026531|354' Blue 2017年02月01日 00:00:05
4 '026531|355' Green 2017年02月01日 00:00:09
This takes tens of minutes to crunch through a million entries. I know it's slow because it isn't at all vectorized, but I'm not sure how to do a pure vectorized numpy/pandas implementation. Any ideas?
def dropIrregularGaps(series, gapLength, runIn):
'''
Designed for time-series data where there is points sampled at irregular time intervals.
Detects adjacent points that are sampled too far apart, and then removes points on either
side of the gap which are within a defined runIn period.
Assumes timeseries data is already sorted. If not, will deliver garbage.
series is a pandas series object, with values as pandas DateTime objects.
gapLength is the amount of time that is considered to be a shutdown
runIn is the length of time to remove on either side of the gap.
returns a list of indices that are valid
'''
samples = list(series)
indices = list(series.index)
prev = samples[0]
ind = indices[0]
allGoodIndices = []
currentGoodIndices = [ind]
currentGoodTimes = [prev]
skipPoint = None
for new, ind in zip(samples[1:], indices[1:]):
if skipPoint:
if new - skipPoint >= runIn:
# if a gap has been detected, skip over all points until the current
# point is past the run-in period.
skipPoint = None
currentGoodIndices = [ind]
currentGoodTimes = [new]
elif new - prev > gapLength:
# if a gap is detected. cut out the cooldown period from the list,
# and add what remains to the list of goodIndices.
endPoint = currentGoodTimes[-1]
while currentGoodTimes and (endPoint - currentGoodTimes[-1] < runIn):
del (currentGoodTimes[-1])
del (currentGoodIndices[-1])
allGoodIndices += currentGoodIndices
currentGoodIndices = []
currentGoodTimes = []
skipPoint = new
else:
currentGoodIndices += [ind]
currentGoodTimes += [new]
prev = new
allGoodIndices += currentGoodIndices
return allGoodIndices
I operate this function by taking my dataframe, and running:
result = dropIrregularGaps(df['timeProduced'],pd.Timedelta('4 hours'), pd.Timedelta('8 hours 0 minutes'))
I then use the result to index into the dataframe, giving me the dataframe without the startup/cooldown periods.
df = df.loc[result]
-
\$\begingroup\$ I suspect you want to look into dataframe.resample(), though I'm not sure because I don't fully understand what you're trying to do. \$\endgroup\$pjz– pjz2018年04月10日 03:15:11 +00:00Commented Apr 10, 2018 at 3:15
-
\$\begingroup\$ can you share some sample data to experiment on? \$\endgroup\$Maarten Fabré– Maarten Fabré2018年04月10日 08:12:20 +00:00Commented Apr 10, 2018 at 8:12
-
\$\begingroup\$ Why exactly are you not filtering the data with a specialized timeseries database or similar before crunching it with python? \$\endgroup\$Vogel612– Vogel6122018年04月11日 10:28:44 +00:00Commented Apr 11, 2018 at 10:28
1 Answer 1
Some general tips
functions
split the work in functions, so you can verify each part individually, each doing a specific job which can be individually tested
PEP-8
Try to follow the guidelines
My algorithm
If your pandas version is>= 0.20, you can use pandas.merge_asof
If you have a series with the end and start of the work week
dummy data
np.random.seed(1)
gap_max, run_in = 3, 2
indices = [0, 1, 2, 3, 7, 8, 9, 13, 15, 16, 17, 18]
values = np.random.random(size = len(indices))
data = pd.DataFrame({'time': time, 'values': values})
time values 0 0 0.417022004702574 1 1 0.7203244934421581 2 2 0.00011437481734488664 3 3 0.30233257263183977 4 7 0.14675589081711304 5 8 0.0923385947687978 6 9 0.1862602113776709 7 13 0.34556072704304774 8 15 0.39676747423066994 9 16 0.538816734003357 10 17 0.4191945144032948 11 18 0.6852195003967595
so for this data, we expect the values of 1, 2, 3, 7, 8, 9, 13, 15 dropped
Finding the gaps
The gap can be found by using DataFrame.shift
.
def find_weekend(times, gap_max):
gap = times - times.shift(1) > gap_max
week_start = times[gap]
weekend_start = times[gap.shift(-1).fillna(False)]
return weekend_start, week_start
find_weekend(data['time'], gap_max)
3 3 6 9 Name: index, dtype: int64, 4 7 7 13 Name: index, dtype: int64
Marking the start of the data as the beginning of a week can be done by adding gap.iloc[0] = True
as 2nd line. Marking the end of the data as also an end of the week can be done by changing to .fillna(True)
merging with the data
Since merge_asof
expects DataFrame
s, we first need to do some transformation
def drop_irregular_gaps(data, gap_max, run_in):
weekend_start, week_start = find_weekend(data[time_label], gap_max)
df_week_end = weekend_start.to_frame(name=time_label).assign(run_out=True)
df_week_start = week_start.to_frame(name=time_label).assign(run_in=True)
df_data = data[[time_label]]
Then we can use 2 merges, one forward to mark the end of the week, one backward to mark the beginning of the week
before_weekend = pd.merge_asof(
df_data, df_week_end,
on=time_label, direction='forward', tolerance=run_in,
).set_index(time_label)['run_out'].fillna(False).values
after_weekend = pd.merge_asof(
df_data, df_week_start,
on=time_label, direction='backward', tolerance=run_in,
).set_index(time_label)['run_in'].fillna(False).values
These are 2 array with True
as value if they are in a run_in or run_out period
array([False, True, True, True, True, True, True, False, False, False, False, False], dtype=bool), array([False, False, False, False, True, True, True, True, True, False, False, False], dtype=bool)
Next we just us an or
and not
for the boolean indexing
to_drop = before_weekend | after_weekend
return data[~to_drop]
drop_irregular_gaps(data, gap_max, run_in)
time values 0 0 0.417022004702574 9 16 0.538816734003357 10 17 0.4191945144032948 11 18 0.6852195003967595
This can be easily adapted to 2 separate values for the run_in
Datetime data
The algorithm should be agnostic about whether the time_label
data is numeric or datetime. I verified this algorithm also works with this dummy data
data_start = pd.Timestamp('20180101')
time = data_start + pd.to_timedelta([0, 1, 2, 3, 7, 8, 9, 13, 15, 16, 17, 18], unit='day')
gap_max, run_in = pd.to_timedelta(3, unit='day'), pd.to_timedelta(2, unit='day')
values = np.random.random(size = len(indices))
data = pd.DataFrame({'time': time, 'values': values})
drop_irregular_gaps(data, gap_max, run_in)
time values 0 2018年01月01日 0.417022004702574 9 2018年01月17日 0.538816734003357 10 2018年01月18日 0.4191945144032948 11 2018年01月19日 0.6852195003967595
alternative without merge_asof
Since apparently merge_asof
doesn't work as good with duplicate data, here a variant with a loop. If there are a lot of weekends, this might be slower, but I reckon it will still be faster than the original code
def mark_runin(time, week_endpoints, run_in, direction='backward'):
mask = np.zeros_like(time, dtype=bool)
for point in week_endpoints:
interval = (point, point + run_in) if direction == 'forward' else (point - run_in, point)
mask |= time.between(*interval).values
return mask
mark_runin(time, weekend_start, run_in)
array([False, True, True, True, True, True, True, False, False, False, False, False], dtype=bool)
def drop_irregular_gaps2(data, gap_max, run_in, time_label = 'time'):
times = data[time_label]
weekend_start, week_start = find_weekend(times, gap_max)
before_weekend = mark_runin(times, weekend_start, run_in, direction = 'backward')
after_weekend = mark_runin(times, week_start, run_in, direction = 'forward')
to_drop = before_weekend | after_weekend
return data[~to_drop]
drop_irregular_gaps2(data, gap_max, run_in)
time values 0 2018年01月01日 0.417022004702574 9 2018年01月17日 0.538816734003357 10 2018年01月18日 0.4191945144032948 11 2018年01月19日 0.6852195003967595
-
\$\begingroup\$ This solution is really good! It's almost perfect, but when the dataset contains rows that have exactly the same timestamps, it delivers a 'ValueError: left keys must be sorted' . Is there an easy way around this? \$\endgroup\$Mdupont– Mdupont2018年04月10日 14:35:25 +00:00Commented Apr 10, 2018 at 14:35
-
\$\begingroup\$ Never mind, the ValueError came from somewhere else. \$\endgroup\$Mdupont– Mdupont2018年04月11日 15:13:31 +00:00Commented Apr 11, 2018 at 15:13