I have a Pandas DataFrame that contains a row per member per day, expressing member interaction with a website. Members interact only on some days, each member is identified with an ID. Here is a simulated example:
import pandas as pd
import numpy as np
# Generate data.
ids = np.repeat(np.arange(100), np.random.randint(100, size = 100))
test = (
pd.Series(
ids,
index = pd.Series(pd.date_range('2020-01-01', '2020-02-01').values).sample(ids.shape[0], replace = True)
)
.sort_index()
)
print(test.head())
Gives:
2020年01月01日 4
2020年01月01日 65
2020年01月01日 95
2020年01月01日 40
2020年01月01日 88
dtype: int32
I'd like to calculate a unique count of members within a 7 day rolling window. After some experimentation and research on Stack Overflow (including Pandas rolling / groupby function), I arrived at an explicit loop and slicing.
# Calculate rolling 7 day unique count.
unique_count = {}
for k in test.index.unique():
unique_count[k] = test.loc[k - pd.Timedelta('7 days'):k].nunique()
# Store as a dataframe and truncate to a minimum period.
unique_count = pd.DataFrame.from_dict(unique_count, orient = 'index', columns = ['7_day_unique_count']).iloc[7:]
print(unique_count.tail())
Gives:
7_day_unique_count
2020年01月28日 98
2020年01月29日 98
2020年01月30日 98
2020年01月31日 97
2020年02月01日 97
This seems to work correctly and performs OK. But is it possible to do this without the explicit loop, such as with resample / groupby / rolling functions? If so, is that more efficient?
-
\$\begingroup\$ Welcome to CodeReview@SE. This seems at the very border of what's on- and off-topic here: looking for information and opinions on one implementation alternative not written (yet) rather than a review of the code presented. \$\endgroup\$greybeard– greybeard2020年07月02日 04:49:15 +00:00Commented Jul 2, 2020 at 4:49
-
2\$\begingroup\$ @greybeard This is on-topic. However I'm interested, what train of thought is making you think that this is potentially off-topic? \$\endgroup\$Peilonrayz– Peilonrayz ♦2020年07月02日 07:48:22 +00:00Commented Jul 2, 2020 at 7:48
-
\$\begingroup\$ @greybeard thanks for taking the time to reply. I hope you have a nice holiday! \$\endgroup\$Peilonrayz– Peilonrayz ♦2020年07月02日 22:08:08 +00:00Commented Jul 2, 2020 at 22:08
-
\$\begingroup\$ I don't know if you've seen the discussion and issues on Github, but you might like this one and the discussion around it. \$\endgroup\$Juho– Juho2020年07月30日 08:15:35 +00:00Commented Jul 30, 2020 at 8:15
-
\$\begingroup\$ Thanks @Juho, that looks very similar to my issue. It sounds like this is a gap in pandas. \$\endgroup\$smj– smj2020年08月02日 23:20:09 +00:00Commented Aug 2, 2020 at 23:20
1 Answer 1
How about something like this:
test.rolling('7d').apply(lambda s:s.nunique()).groupby(level=0).max()
rolling('7d')
is the rolling window. The window is determined for each row. So the first window starts from the row "2020年01月01日 4" and extends 7 days in the past. The second window starts from the row "2020年01月01日 65" and extends 7 days in the past.
.apply(lambda s:s.nunique())
determines the number of unique items in the window. But, because of the way rolling works, we get multiple results for the same day.
.groupby(level=0)
groups the results by the date.
.max()
takes the maximum nunique value for that date.
The above approach seemed rather slow, so here's a different approach. Basically use a Counter as a multiset and use a deque as a FIFO queue. For each day update the Counter with the id's for that day and subtract the ones for the day at the beginning of the window. The len of the Counter is then the number of unique ids.
from collections import Counter, deque
WINDOW = 7
fifo = deque(maxlen=WINDOW)
uniq = Counter()
def unique_in_window(x):
if len(fifo) == WINDOW:
uniq.subtract(fifo.popleft())
uniq.update(x)
fifo.append(x)
return len(uniq)
test.groupby(level=0).apply(unique_in_window)
-
\$\begingroup\$ The Counter approach assumes that there's one day per datum. What if there are days without datum, or two datums in the same day? It is not equivalent to the first approach. \$\endgroup\$David Davó– David Davó2023年05月13日 18:50:00 +00:00Commented May 13, 2023 at 18:50
-
\$\begingroup\$ The first approach tells me 'no numeric types to aggregate', if I somehow cast it to an integer, it would be converted to float, perhaps with some unintended collisions. \$\endgroup\$David Davó– David Davó2023年05月13日 18:51:46 +00:00Commented May 13, 2023 at 18:51