Calculating a unique count within a rolling time window

Question 1

I have a Pandas DataFrame that contains a row per member per day, expressing member interaction with a website. Members interact only on some days, each member is identified with an ID. Here is a simulated example:

import pandas as pd
import numpy as np
# Generate data.
ids = np.repeat(np.arange(100), np.random.randint(100, size = 100))
test = (
 pd.Series(
 ids,
 index = pd.Series(pd.date_range('2020-01-01', '2020-02-01').values).sample(ids.shape[0], replace = True)
 )
 .sort_index()
)
print(test.head())

Gives:

2020年01月01日 4
2020年01月01日 65
2020年01月01日 95
2020年01月01日 40
2020年01月01日 88
dtype: int32

I'd like to calculate a unique count of members within a 7 day rolling window. After some experimentation and research on Stack Overflow (including Pandas rolling / groupby function), I arrived at an explicit loop and slicing.

# Calculate rolling 7 day unique count.
unique_count = {}
for k in test.index.unique():
 unique_count[k] = test.loc[k - pd.Timedelta('7 days'):k].nunique()
# Store as a dataframe and truncate to a minimum period.
unique_count = pd.DataFrame.from_dict(unique_count, orient = 'index', columns = ['7_day_unique_count']).iloc[7:]
print(unique_count.tail())

Gives:

 7_day_unique_count
2020年01月28日 98
2020年01月29日 98
2020年01月30日 98
2020年01月31日 97
2020年02月01日 97

This seems to work correctly and performs OK. But is it possible to do this without the explicit loop, such as with resample / groupby / rolling functions? If so, is that more efficient?

Question 2

Welcome to CodeReview@SE. This seems at the very border of what's on- and off-topic here: looking for information and opinions on one implementation alternative not written (yet) rather than a review of the code presented.

Question 3

@greybeard This is on-topic. However I'm interested, what train of thought is making you think that this is potentially off-topic?

Question 4

@greybeard thanks for taking the time to reply. I hope you have a nice holiday!

Question 5

I don't know if you've seen the discussion and issues on Github, but you might like this one and the discussion around it.

Question 6

Thanks @Juho, that looks very similar to my issue. It sounds like this is a gap in pandas.

Question 7

How about something like this:

test.rolling('7d').apply(lambda s:s.nunique()).groupby(level=0).max()

rolling('7d') is the rolling window. The window is determined for each row. So the first window starts from the row "2020年01月01日 4" and extends 7 days in the past. The second window starts from the row "2020年01月01日 65" and extends 7 days in the past.

.apply(lambda s:s.nunique()) determines the number of unique items in the window. But, because of the way rolling works, we get multiple results for the same day.

.groupby(level=0) groups the results by the date.

.max() takes the maximum nunique value for that date.

The above approach seemed rather slow, so here's a different approach. Basically use a Counter as a multiset and use a deque as a FIFO queue. For each day update the Counter with the id's for that day and subtract the ones for the day at the beginning of the window. The len of the Counter is then the number of unique ids.

from collections import Counter, deque
WINDOW = 7
fifo = deque(maxlen=WINDOW)
uniq = Counter()
def unique_in_window(x):
 if len(fifo) == WINDOW:
 uniq.subtract(fifo.popleft())
 
 uniq.update(x)
 fifo.append(x)
 
 return len(uniq)
test.groupby(level=0).apply(unique_in_window)

Question 8

The Counter approach assumes that there's one day per datum. What if there are days without datum, or two datums in the same day? It is not equivalent to the first approach.

Question 9

The first approach tells me 'no numeric types to aggregate', if I somehow cast it to an integer, it would be converted to float, perhaps with some unintended collisions.

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Answer 1 · 2020-09-02 01:05:08Z

How about something like this:

test.rolling('7d').apply(lambda s:s.nunique()).groupby(level=0).max()

rolling('7d') is the rolling window. The window is determined for each row. So the first window starts from the row "2020年01月01日 4" and extends 7 days in the past. The second window starts from the row "2020年01月01日 65" and extends 7 days in the past.

.apply(lambda s:s.nunique()) determines the number of unique items in the window. But, because of the way rolling works, we get multiple results for the same day.

.groupby(level=0) groups the results by the date.

.max() takes the maximum nunique value for that date.

The above approach seemed rather slow, so here's a different approach. Basically use a Counter as a multiset and use a deque as a FIFO queue. For each day update the Counter with the id's for that day and subtract the ones for the day at the beginning of the window. The len of the Counter is then the number of unique ids.

from collections import Counter, deque
WINDOW = 7
fifo = deque(maxlen=WINDOW)
uniq = Counter()
def unique_in_window(x):
 if len(fifo) == WINDOW:
 uniq.subtract(fifo.popleft())
 
 uniq.update(x)
 fifo.append(x)
 
 return len(uniq)
test.groupby(level=0).apply(unique_in_window)

The Counter approach assumes that there's one day per datum. What if there are days without datum, or two datums in the same day? It is not equivalent to the first approach.
The first approach tells me 'no numeric types to aggregate', if I somehow cast it to an integer, it would be converted to float, perhaps with some unintended collisions.

Stack Exchange Network

Calculating a unique count within a rolling time window

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Calculating a unique count within a rolling time window

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions