8
\$\begingroup\$

I have a Pandas DataFrame that contains a row per member per day, expressing member interaction with a website. Members interact only on some days, each member is identified with an ID. Here is a simulated example:

import pandas as pd
import numpy as np
# Generate data.
ids = np.repeat(np.arange(100), np.random.randint(100, size = 100))
test = (
 pd.Series(
 ids,
 index = pd.Series(pd.date_range('2020-01-01', '2020-02-01').values).sample(ids.shape[0], replace = True)
 )
 .sort_index()
)
print(test.head())

Gives:

2020年01月01日 4
2020年01月01日 65
2020年01月01日 95
2020年01月01日 40
2020年01月01日 88
dtype: int32

I'd like to calculate a unique count of members within a 7 day rolling window. After some experimentation and research on Stack Overflow (including Pandas rolling / groupby function), I arrived at an explicit loop and slicing.

# Calculate rolling 7 day unique count.
unique_count = {}
for k in test.index.unique():
 unique_count[k] = test.loc[k - pd.Timedelta('7 days'):k].nunique()
# Store as a dataframe and truncate to a minimum period.
unique_count = pd.DataFrame.from_dict(unique_count, orient = 'index', columns = ['7_day_unique_count']).iloc[7:]
print(unique_count.tail())

Gives:

 7_day_unique_count
2020年01月28日 98
2020年01月29日 98
2020年01月30日 98
2020年01月31日 97
2020年02月01日 97

This seems to work correctly and performs OK. But is it possible to do this without the explicit loop, such as with resample / groupby / rolling functions? If so, is that more efficient?

Peilonrayz
44.4k7 gold badges80 silver badges157 bronze badges
asked Jul 2, 2020 at 4:27
\$\endgroup\$
5
  • \$\begingroup\$ Welcome to CodeReview@SE. This seems at the very border of what's on- and off-topic here: looking for information and opinions on one implementation alternative not written (yet) rather than a review of the code presented. \$\endgroup\$ Commented Jul 2, 2020 at 4:49
  • 2
    \$\begingroup\$ @greybeard This is on-topic. However I'm interested, what train of thought is making you think that this is potentially off-topic? \$\endgroup\$ Commented Jul 2, 2020 at 7:48
  • \$\begingroup\$ @greybeard thanks for taking the time to reply. I hope you have a nice holiday! \$\endgroup\$ Commented Jul 2, 2020 at 22:08
  • \$\begingroup\$ I don't know if you've seen the discussion and issues on Github, but you might like this one and the discussion around it. \$\endgroup\$ Commented Jul 30, 2020 at 8:15
  • \$\begingroup\$ Thanks @Juho, that looks very similar to my issue. It sounds like this is a gap in pandas. \$\endgroup\$ Commented Aug 2, 2020 at 23:20

1 Answer 1

3
\$\begingroup\$

How about something like this:

test.rolling('7d').apply(lambda s:s.nunique()).groupby(level=0).max()

rolling('7d') is the rolling window. The window is determined for each row. So the first window starts from the row "2020年01月01日 4" and extends 7 days in the past. The second window starts from the row "2020年01月01日 65" and extends 7 days in the past.

.apply(lambda s:s.nunique()) determines the number of unique items in the window. But, because of the way rolling works, we get multiple results for the same day.

.groupby(level=0) groups the results by the date.

.max() takes the maximum nunique value for that date.

The above approach seemed rather slow, so here's a different approach. Basically use a Counter as a multiset and use a deque as a FIFO queue. For each day update the Counter with the id's for that day and subtract the ones for the day at the beginning of the window. The len of the Counter is then the number of unique ids.

from collections import Counter, deque
WINDOW = 7
fifo = deque(maxlen=WINDOW)
uniq = Counter()
def unique_in_window(x):
 if len(fifo) == WINDOW:
 uniq.subtract(fifo.popleft())
 
 uniq.update(x)
 fifo.append(x)
 
 return len(uniq)
test.groupby(level=0).apply(unique_in_window)
answered Sep 2, 2020 at 1:05
\$\endgroup\$
2
  • \$\begingroup\$ The Counter approach assumes that there's one day per datum. What if there are days without datum, or two datums in the same day? It is not equivalent to the first approach. \$\endgroup\$ Commented May 13, 2023 at 18:50
  • \$\begingroup\$ The first approach tells me 'no numeric types to aggregate', if I somehow cast it to an integer, it would be converted to float, perhaps with some unintended collisions. \$\endgroup\$ Commented May 13, 2023 at 18:51

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.