I have a Pandas DataFrame with users subscription dates in the following format:
UserId, StartingDate, EndingDate
And I try to calculate the Churn Rate metric for every day.
What Churn Rate is:
The churn rate, also known as the rate of attrition, is the percentage of subscribers to a service who discontinue their subscriptions to that service within a given time period.
So, for every day, I go back 1 month, I get a list of unique users that had an active subscription and check how many of those don't have it any more.
I wrote the code, but it takes ages to finish, so I am looking for any kind of performance issues that might have
import pandas as pd
from datetime import datetime
from datetime import timedelta
df = pd.read_csv("subscritpions.csv")
#make sure both columns are in datetime type
df['StartingDate'] = pd.to_datetime(df['StartingDate'])
df['EndingDate'] = pd.to_datetime(df['EndingDate'])
#get the first date of the dataframe to start the loop with it and set the stop date as today
start = pd.to_datetime(df.StartingDate.min())
minDate = start
stop = datetime.now()
def getUsersFromADate(df,date):
return df.loc[(df['StartingDate'] <= date) & (df['EndingDate'] >= date)].UserId.unique()
churn = []
while start <= stop:
# first 30 days doesn't have a churn rate. So just append a 0 value
if start < minDate + pd.DateOffset(months=1):
churn.append(0)
else:
usersBefore = getUsersFromADate(df, start - pd.DateOffset(months=1))
usersNow = getUsersFromADate(df, start)
lost = 0
for u in usersBefore:
if u not in usersNow:
lost += 1
churn.append(lost/len(usersBefore))
start = start + timedelta(days=1) # increase day one by one
Example of my data:
UserId StartingDate EndingDate
0 1 2013年05月09日 2015年04月24日
1 1 2015年04月29日 2017年04月02日
2 1 2017年04月05日 2017年12月06日
3 2 2014年02月13日 2018年02月07日
4 3 2013年04月25日 2018年04月19日
-
\$\begingroup\$ Making a query for every day can get very expensive if the range gets large. How large can this range be? And how many records are expected on the csv? \$\endgroup\$juvian– juvian2017年12月04日 20:15:29 +00:00Commented Dec 4, 2017 at 20:15
-
\$\begingroup\$ First date is at mid 2013. The whole file is about 7mb. Not at laptop now to check the exact number of lines. But it should be less than 500k \$\endgroup\$Tasos– Tasos2017年12月04日 20:17:09 +00:00Commented Dec 4, 2017 at 20:17
2 Answers 2
You could do the entire thing in pandas and numpy as their Cython implementation will be much faster than iterating over objects in Python.
First simulate some data,
import pandas as pd, numpy as np
from datetime import datetime
num_samples = 50000
user_ids = np.arange(num_samples)
dates = pd.date_range("2012-01-01","2015-01-01")
start_dates = dates[np.random.randint(0,len(dates),50000)]
data = pd.DataFrame(data={"user_id": user_ids, "start_date":start_dates})
data["end_date"] = data.start_date.apply(lambda x: x + pd.DateOffset(days=np.random.randint(0,300)))
which results in data:
start_date user_id end_date
2013年12月15日 0 2014年09月24日
2013年12月13日 1 2014年01月17日
2014年08月29日 2 2015年03月25日
2014年04月13日 3 2015年01月04日
2014年01月21日 4 2014年06月22日
We need an output dataframe so that we don't have to painfully iterate over each date one by one:
output = pd.DataFrame({"date":pd.date_range(data.start_date.min(), datetime.now())})
and now define your functions to calculate the churn rate and apply it over the entire data frame:
def get_users_from_date(df, date):
return df[(date >= df.start_date) & (date <= df.end_date)].user_id
def calc_churn(df, date):
users_before = get_users_from_date(df,date-pd.DateOffset(months=1))
users_now = get_users_from_date(df,date)
if len(users_before):
return len(np.setdiff1d(users_before,users_now)) / len(users_before)
else:
return 0
output["churn"] = output.date.apply(lambda x: calc_churn(data,x))
This finishes for me in just:
5.42 s ± 65.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Since we are using np.setdiff1d to find the difference, we don't need to unique each subset of user ids beforehand. Finding the unique subsets in get_user_from_date then passing them in, even with assume_unique=True, gets a time of:
5.98 s ± 75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
while using Python set difference instead of numpy arrays gets a time of:
7.5 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and finally, combining Python sets with list comprehensions, similar to your current implementation never finished running.
The main bottleneck is in comparing the sets of user ids, so doing that in either numpy or pandas instead of iterating over Python objects will improve performance.
On a different note, your code looks good and is really easy to read but you should use snake case instead of camel case to fit Python convention.
-
\$\begingroup\$ I have this error when I try to apply your code
TypeError: Cannot compare type 'Timestamp' with type 'str'
\$\endgroup\$Tasos– Tasos2017年12月05日 09:14:58 +00:00Commented Dec 5, 2017 at 9:14 -
\$\begingroup\$ @Tasos I did find a typo but that was unrelated. Copy pasting the code and running it from here also ran without any errors. Do you know what line you're getting that error on? \$\endgroup\$mochi– mochi2017年12月05日 09:25:38 +00:00Commented Dec 5, 2017 at 9:25
-
\$\begingroup\$ I think it is this line:
return df[(date >= df.StartingDate) & (date <= df.EndingDate)].UserId
.. I renamed columns based my data \$\endgroup\$Tasos– Tasos2017年12月05日 09:29:22 +00:00Commented Dec 5, 2017 at 9:29 -
\$\begingroup\$ @Tasos can you check if output.date, data.StartingDate and data.EndingDate are all datetime objects? What parts did you edit exactly? \$\endgroup\$mochi– mochi2017年12月05日 09:33:27 +00:00Commented Dec 5, 2017 at 9:33
-
\$\begingroup\$ You are right! My original dataframe didn't have datetime column types. \$\endgroup\$Tasos– Tasos2017年12月05日 09:41:17 +00:00Commented Dec 5, 2017 at 9:41
Here is an alternative way of calculating the churn rates which would be optimal efficient wise without altering the data you have. Note that maybe pandas queries may outperform python loops used here so your solution could run faster even though its less efficient.
Also, I am assuming no user can have multiple subscriptions on a same day (intervals for user subscription don ́t overlap) and that an user startDate - endDate range is at least 1 month
- For every row in the csv, generate 2 events: subscribed with
date = start date
and unsubscribed withdate = end date
- Sort all these events by date in ascending order
- Set
currentDate = event[0].date
,usersSubscribedByDay = {currentDate: 0}
andusersUnsubscribedByDay = {currentDate: 0}
Preprocess data: keep amount of unsubscriptions total up to each day and amount of actually subscribed users on each day
for event in events: while event.date != currentDate: # we reached a new day (check only date, not datetime) usersSubscribedByDay[currentDate + 1 day] = usersSubscribedByDay[currentDate] usersUnsubscribedByDay[currentDate + 1 day] = usersUnsubscribedByDay[currentDate] currentDate = currentDate + 1 day if event.type == 'subscribed': usersSubscribedByDay[currentDate]+=1 else: usersUnsubscribedByDay[currentDate]+=1 usersSubscribedByDay[currentDate]-=1
Calculate churn rate
for day in sorted(usersSubscribedByDay.keys()): if day is on first month: churn.append(0) else: subscribers = usersSubscribedByDay[day - 1 month] # this gives the amount of users that were subscribed on this day lostSubscribers = usersUnsubscribedByDay[day] - usersUnsubscribedByDay[day - 1 month] # this gives the amount of unsubscriptions in the month. Here we are asumming that the unsubscriptions are not of a subscription that happened in the last month churn.append(lostSubscribers / subscribers)